27
1 MEASURES OF CENTRAL TENDENCY, DISPERSION AND CORRELATION MEASURES OF CENTRAL TENDENCY(AVERAGES) Introduction We saw how data can be summarised and presented in tabular, chart and graphical formats. Sometimes you might need more information than that provided by diagrammatic representations of data. In such circumstances you may need to apply some sort of numerical analysis, for example you might wish to calculate a measure of centrality and a measure of dispersion. An average is a representative figure that is used to give some impression of the size of all the items in the population. There are three main types of average. Arithmetic mean Mode Median We will be looking at each of these averages in turn, their calculation, advantages and disadvantages. The arithmetic mean Arithmetic mean of ungrouped data The arithmetic mean is the best known type of average and is widely understood. It is used for further statistical analysis. Example: The arithmetic mean The demand for a product on each of 20 days was as follows (in units). 3 12 7 17 3 14 9 6 11 10 1 4 19 7 15 6 9 12 12 8 The arithmetic mean of daily demand is

Measures of Central Tendency Dispersion and Correlation

Embed Size (px)

DESCRIPTION

Measures of central tendency notes

Citation preview

Page 1: Measures of Central Tendency Dispersion and Correlation

1

MEASURES OF CENTRAL TENDENCY, DISPERSION AND CORRELATION

MEASURES OF CENTRAL TENDENCY(AVERAGES)

Introduction

We saw how data can be summarised and presented in tabular, chart and

graphical formats. Sometimes you might need more information than that provided by

diagrammatic representations of data. In such circumstances you may need to apply some

sort of numerical analysis, for example you might wish to calculate a measure of

centrality and a measure of dispersion.

An average is a representative figure that is used to give some impression of the size of

all the items in the population. There are three main types of average.

• Arithmetic mean

• Mode

• Median

We will be looking at each of these averages in turn, their calculation, advantages and

disadvantages.

The arithmetic mean

Arithmetic mean of ungrouped data

The arithmetic mean is the best known type of average and is widely understood. It is

used for further statistical analysis.

Example: The arithmetic mean

The demand for a product on each of 20 days was as follows (in units).

3 12 7 17 3 14 9 6 11 10 1 4

19 7 15 6 9 12 12 8

The arithmetic mean of daily demand is

Page 2: Measures of Central Tendency Dispersion and Correlation

2

Arithmetic mean of data in a frequency distribution

It is more likely in an assessment that you will be asked to calculate the arithmetic mean

Example :

Consider the following table and complete the (fx) column and calculate the value of the

arithmetic mean.

ARITHMETIC MEAN OF GROUPED DATA IN CLASS INTERVALS

The arithmetic mean of grouped data is determined as:

where n is the number of values recorded, or the number of items measured.

The mid-point of class intervals

To calculate the arithmetic mean of grouped data we therefore need to decide on a value

which best represents all of the values in a particular class interval. This value is known

as the mid-point.

The mid-point of each class interval is conventionally taken, on the assumption that the

frequencies occur evenly over the class interval range.

Page 3: Measures of Central Tendency Dispersion and Correlation

3

Example: calculate the value of the arithmetic mean from the following frequency

distribution table.

The arithmetic mean of combined data

Suppose that the mean age of a group of five people is 27 and the mean age of another

group of eight people is 32. How would we find the mean age of the whole group of 13

people?

The sum of the ages in the first group is 5 ×27 = 135

The sum of the ages in the second group is 8 ×32 = 256

The sum of all 13 ages is 135 + 256 = 391

The advantages and disadvantages of the arithmetic mean

Advantages of the arithmetic mean

• It is easy to calculate

• It is widely understood

Page 4: Measures of Central Tendency Dispersion and Correlation

4

• It is representative of the whole set of data

• It is supported by mathematical theory and is suited to further statistical analysis

Disadvantages of the arithmetic mean

• Its value may not correspond to any actual value. For example, the 'average' family

might have 2.3 children, but no family has exactly 2.3 children.

• An arithmetic mean might be distorted by extremely high or low values. For example,

the mean of 3, 4, 4 and 6 is 4.25, but the mean of 3, 4, 4, 6 and 15 is 6.4. The high

value, 15, distorts the average and in some circumstances the mean would be a

misleading and inappropriate figure.

Question

Weighted Mean

A firm owns six factories at which the basic weekly wages are given in column 2 of

Table 4.4. Find the mean basic wage earned by employees of the firm.

Page 5: Measures of Central Tendency Dispersion and Correlation

5

This result, which takes account of the number of employees, is a much more realistic

measure of location for the distribution of the basic wage than the straight mean we found

first. The second result is called the weighted mean of the basic wage, where the weights

are the numbers of employees at each factory.

Geometric Mean

The geometric mean is seldom used outside of specialist applications. It is appropriate

when dealing with a set of data such as that which shows exponential growth (that is

where the rate of growth depends on the value of the variable itself), for example

population levels arising from indigenous birth rates, or that which follows a geometric

progression, such as changes in an index number over time, for example the Retail Price

Index.

It is sometimes quite difficult to decide where the use of the geometric mean over the

arithmetic mean is the best choice. We will return to the use of geometric means in the

next chapter. The geometric mean (GM) is evaluated by taking the nth root of the product

of all n observations, that is:

Example:

In the year 2000 the population of a town is 300,000. In 2010 a new census reveals it has

risen to 410,000. Estimate the population in 2015. If we assume that was no net

immigration or migration then the birth rate will depend on the size of the population

(exponential growth) so the geometric mean is appropriate.

(Note that this is appreciably less than the arithmetic mean which is 355,000.)

Page 6: Measures of Central Tendency Dispersion and Correlation

6

Harmonic Mean

Another measure of central tendency which is only occasionally used is the harmonic

mean. It is most frequently employed for averaging speeds where the distances for each

section of the journey are equal.

If the speeds are x then:

Example:

An aeroplane travels a distance of 900 miles. If it covers the first third and the last third

of the trip at a speed of 250 mph and the middle third at a speed of 300 mph, find the

average speed.

THE MODE

The mode or modal value is an average which means 'the most frequently occurring

value'.

Mode of a Simple Frequency Distribution

Consider the following frequency distribution:

Table 8.12 Accident distribution data

In this case the most frequently occurring value is 1 (it occurred 39 times) and so the

mode of this distribution is 1.

Page 7: Measures of Central Tendency Dispersion and Correlation

7

Example 1:

The following is an ordered list of the number of complaints received by a telephone

supervisor per day over a period of a fortnight:

3, 4, 4, 5, 5, 6, 6, 6, 6, 7, 8, 9,

10, 12.

The value which occurs most frequently is 6, therefore:

mode = 6

The mode of a grouped frequency distribution

There are various methods of estimating the modal value (including a graphical one). A

satisfactory result is obtained easily by using the following formula:

where: L = lower boundary of the modal class

i = width of the modal class interval

fm = the frequency of the modal class

fm-1 = the frequency of the pre-modal class

fm+1 = the frequency of the post-modal class

Example:

Find the modal value of the height of employees from the data shown in the Table.

Solution:

The largest frequency is 20, in the fourth class, so this is the modal class, and the

value of the mode lies between 175 and 180 cm. So, using the above formula:

Page 8: Measures of Central Tendency Dispersion and Correlation

8

Advantages and Disadvantages of the Mode

(a) Advantages

(i) It is not distorted by extreme values of the observations.

(ii) It is easy to calculate.

(b) Disadvantages

(i) It cannot be used to calculate any further statistic.

(ii) It may have more than one value (although this feature helps to show the

shape of the distribution).

MEDIAN

The median is the value of the middle member of an array. The middle item of an odd

number of items is calculated as the

If a set of n observations is arranged in order of size then, if n is odd, the median is the

value of the middle observation; if n is even, the median is the value of the arithmetic

mean of the two middle observations.

Note that the same value is obtained whether the set is arranged in ascending or

descending order of size, though the ascending order is most commonly used. This

arrangement in order of size is often called ranking.

The rules for calculating the median are:

(a) If n is odd and M is the value of the median then:

Page 9: Measures of Central Tendency Dispersion and Correlation

9

Example: The median

(a) The median of the following nine values:

8 6 9 12 15 6 3 20 11

is found by taking the middle item (the fifth one) in the array:

3 6 6 8 9 11 12 15 20

The median is 9.

(b) Consider the following array.

1 2 2 2 3 5 6 7 8 11

The median is 4 because, with an even number of items, we have to take the arithmetic

mean of the two middle ones (in this example, (3 + 5)/2 = 4).

Finding the median of an ungrouped frequency distribution

The median of an ungrouped frequency distribution is found in a similar way. Consider

the following distribution.

The median would be the (35 + 1)/2 = 18th item. The 18th item has a value of 16, as we

can see from the cumulative frequencies in the right hand column of the above table.

Finding the median of a grouped frequency distribution

If the sample size is large and the data continuous, it is possible to find the median of

grouped data by estimating the position of the (n/2)th value. In this case the following

formula may be used:

Page 10: Measures of Central Tendency Dispersion and Correlation

10

where: L is the lower class boundary of the median class

F is the cumulative frequency up to but not including the median class

f is the frequency of the median class

i is the width of the class interval.

Example:

From the following frequency distribution establish the median.

Advantages and Disadvantages of the Median

(a) Advantages

(i) Its value is not distorted by extreme values, open-ended classes or classes of

irregular width.

(ii) All the observations are used to order the data even though only the middle one

or two observations are used in the calculation.

(iii) It can be illustrated graphically in a very simple way.

(b) Disadvantages

(i) In a grouped frequency distribution the value of the median within the median

class can only be an estimate, whether it is calculated or read from a graph.

(ii) Although the median is easy to calculate it is difficult to manipulate arithmetically.

It is of little use in calculating other statistical measures.

QUANTILES

Definitions

If a set of data is arranged in ascending order of size, quantiles are the values of the

observations which divide the number of observations into a given number of equal parts.

Page 11: Measures of Central Tendency Dispersion and Correlation

11

They cannot really be called measures of central tendency, but they are measures of

location in that they give the position of specified observations on the x-axis.

The most commonly used quantiles are:

(a) Quartiles

These are denoted by the symbols Q1, Q2 and Q3 and they divide the observations into

four equal parts:

Q1 has 25% below it and 75% above it.

Q2 has 50% below it and 50% above, i.e. it is the median and is more usually

denoted by M.

Q3 has 75% below it and 25% above.

(b) Deciles

These values divide the observations into 10 equal parts and are denoted by D1, D2, ...

D9, e.g. D1 has 10% below it and 90% above, and D2 has 20% below it and 80% above.

(c) Percentiles

These values divide the observations into 100 equal parts and are denoted by P1, P2,

P3, ... P99, e.g. P1 has 1% below it and 99% above.

Note that D5 and P50 are both equal to the median (M).

Calculation of Quantiles

Example:

Table 4.7 shows the grouped distribution of the overdraft sizes of 400 bank customers.

Find

the quartiles, the 4th decile and the 95th percentile of this distribution.

Page 12: Measures of Central Tendency Dispersion and Correlation

12

Size of overdraft of bank customers

Using appropriately amended versions of the formula for the median given previously,

the arithmetic calculations are as follows:

The formula for the first quartile (Q1) may be written as:

where: L is the lower class boundary of the class which contains Q1

F is the cumulative frequency up to but not including the class which contains Q1

f is the frequency of the class which contains Q1 and

i is the width of the class interval.

This gives:

Page 13: Measures of Central Tendency Dispersion and Correlation

13

MEASURES OF DISPERSION

Measures of dispersion give some idea of the spread of a variable about its average. The

main measures are as follows.

The range

The semi-interquartile range

The standard deviation

The variance

The coefficient of variation

1 The range

The range is the difference between the highest and lowest observations.

The range of a distribution is the difference between the largest and the smallest values in

the set of data.

If the data is given in the form of a grouped frequency distribution, the range is the

difference between the highest upper class boundary and the lowest lower class

boundary.

Example:

Calculate the mean and the range of the following set of data.

4 8 7 3 5 16 24 5

Advantages and Disadvantages

(a) Advantages

(i) It is easy to understand.

(ii) It is simple to calculate.

Page 14: Measures of Central Tendency Dispersion and Correlation

14

(iii) It is a good measure for comparison as it spans the whole distribution.

(b) Disadvantages

(i) It uses only two of the observations and so can be distorted by extreme values.

(ii) It does not indicate any concentrations of the observations.

(iii) It cannot be used in calculating other functions of the observations.

2 QUARTILE DEVIATION (THE SEMI-INTERQUARTILE RANGE)

The semi-interquartile range is half the difference between the upper and lower quartiles.

The lower and upper quartiles can be used to calculate a measure of spread called the

semi-interquartile range.

The inter-quartile range

The inter-quartile rangeis the difference between the values of the upper and lower

quartiles (Q3– Q1) and hence shows the range of values of the middle half of the

population.

Example:

Construct an ogive of the following frequency distribution and hence establish the semi-

interquartile range.

Advantages and Disadvantages

(a) Advantages

(i) The calculations are simple and quite quick to do.

(ii) It covers the central 50% of the observations and so is not distorted by extreme

Page 15: Measures of Central Tendency Dispersion and Correlation

15

values.

(iii) It can be illustrated graphically.

(b) Disadvantages

(i) The lower and upper 25% of the observations are not used in the calculation so it

may not be representative of all the data.

(ii) Although it is related to the median, there is no direct arithmetic connection

between the two.

(iii) It cannot be used to calculate any other functions of the data.

3 THE MEAN DEVIATION

The mean deviationis a measure of the average amount by which the values in a

distribution differ from the arithmetic mean.

Explaining the mean deviation formula

Example: The mean deviation

The hours of overtime worked in a particular quarter by the 60 employees of ABC Co are

as follows.

Page 16: Measures of Central Tendency Dispersion and Correlation

16

Required

Calculate the mean deviation of the frequency distribution shown above.

Summary of the mean deviation

(a) It is a measure of dispersion which shows by how much, on average, each item in the

distribution differs in value from the arithmetic mean of the distribution.

(b) Unlike quartiles, it uses all values in the distribution to measure the dispersion, but it

is not greatly affected by a few extreme values because an average is taken.

(c) It is not, however, suitable for further statistical analysis.

4 STANDARD DEVIATION AND VARIANCE

THE STANDARD DEVIATION

The standard deviation measures the spread of data around the mean. In general, the

larger the standard deviation value in relation to the mean, the more dispersed the data.

The standard deviation, which is the square root of the variance, is the most important

measure of spread used in statistics. Make sure you understand how to calculate the

standard deviation of a set of data.

Page 17: Measures of Central Tendency Dispersion and Correlation

17

There are a number of formulae which you may use to calculate the standard deviation;

use whichever one you feel comfortable with.

EXAMPLE 1

Voditel International own a large fleet of company cars. The mileages, in thousands of

miles, of a sample of 17 of their cars over the last financial year were:

11 31 27 26 27 35 23 19 28 25

15 36 29 27 26 22 20

Calculate the mean and standard deviation of these mileage figures.

EXAMPLE 2

The kilocalories per portion in a sample of 32 different breakfast cereals were recorded

and collated into the following grouped frequency distribution:

(a) Obtain an approximate value for the median of the distribution.

(b) Calculate approximate values for the mean and standard deviation of the distribution.

The variance

The variance, σ2 , is the average of the squared mean deviation for each value in a

distribution. σ is the Greek letter sigma (in lower case). The variance is therefore called

'sigma squared'.

Page 18: Measures of Central Tendency Dispersion and Correlation

18

The main properties of the standard deviation

The standard deviation's main properties are as follows.

(a) It is based on all the values in the distribution and so is more comprehensive than

dispersion measures based on quartiles, such as the quartile deviation.

(b) It is suitable for further statistical analysis.

(c) It is more difficult to understand than some other measures of dispersion.

The importance of the standard deviation lies in its suitability for further statistical

Analysis

The coefficient of variation

The spreads of two distributions can be compared using the coefficient of variation.

The bigger the coefficient of variation, the wider the spread. For example, suppose that

two sets of data, A and B, have the following means and standard deviations.

Although B has a higher standard deviation in absolute terms (51 compared to 50) its

relative spread is less than A's since the coefficient of variation is smaller.

Advantages and Disadvantages of the Standard Deviation

(a) Advantages

(i) It uses all the observations.

(ii) It is closely related to the most commonly used measure of location, i.e. the

mean.

(iii) It is easy to manipulate arithmetically.

(b) Disadvantages

(i) It is rather complicated to define and calculate.

(ii) Its value can be distorted by extreme values.

Page 19: Measures of Central Tendency Dispersion and Correlation

19

SKEWNESS

1 Skewed distributions

As well as being able to calculate the average and spread of a frequency distribution, you

should be aware of the skewness of a distribution.

Skewness is the asymmetry of a frequency distribution curve. When the items are not

symmetrically dispersed on each side of the mean, we say that the distribution is skewed

or asymmetric.

Symmetrical frequency distributions

A symmetrical frequency distribution (a normal distribution) can be drawn as follows.

Properties of a symmetrical distribution

Its mean, mode and median all have the same value, M

Its two halves are mirror images of each other

If a distribution is symmetrical, the mean, mode and the median all occur at the same

point, i.e. right in the middle. But in a skew distribution, the mean and the median lie

somewhere along the side with the "tail", although the mode is still at the point where the

curve is highest. The more skew the distribution, the greater the distance from the mode

to the mean and the median.

Positively skewed distributions

A positively skewed distribution's graph will lean towards the left hand side, with a tail

stretching out to the right, and can be drawn as follows.

Page 20: Measures of Central Tendency Dispersion and Correlation

20

Properties of a positively skewed distribution

Its mean, mode and median all have different values

The mode will have a lower value than the median

Its mean will have a higher value than the median (and than most of the

distribution)

It does not have two halves which are mirror images of each other

Negatively skewed distributions

A negatively skewed distribution's graph will lean towards the right hand side, with a tail

stretching out to the left, and can be drawn as follows.

Properties of a negatively skewed distribution

Its mean, median and mode all have different values

The mode will be higher than the median

The mean will have a lower value than the median (and than most of the

distribution)

Since the mean is affected by extreme values, it may not be representative of the items in

a very skewed distribution.

Page 21: Measures of Central Tendency Dispersion and Correlation

21

Measures of Skewness

The more skew the distribution, the more spread out are these three measures of location,

and so we can use the amount of this spread to measure the amount of skewness. The

most usual way of doing this is to calculate:

The value of the coefficient of skewness is between +3 and -3.

Example: Skewness

In a quality control test, the weights of standard packages were measured to give the

following grouped frequency table.

Required

(a) Calculate the mean, standard deviation and median of the weights of the packages.

(b) Calculate pearson’s coefficient of skewness and explain whether or not the

distribution is symmetrical.

Page 22: Measures of Central Tendency Dispersion and Correlation

22

CORRELATION

When the value of one variable is related to the value of another, they are said to be

correlated.

Two variables are said to be correlated if a change in the value of one variable is

accompanied by a change in the value of another variable. This is what is meant by

correlation.

Examples of variables which might be correlated

A person's height and weight

The distance of a journey and the time it takes to make it

Scatter diagrams

One way of showing the correlation between two related variables is on a scatter

diagram, plotting a number of pairs of data on the graph. For example, a scatter diagram

showing monthly selling costs against the volume of sales for a 12-month period might

be as follows.

The independent variable (the cause) is plotted on the horizontal(x) axis and the

dependent variable (the effect) is plotted on the vertical(y) axis.

This scattergraph suggests that there is some correlation between selling costs and sales

volume, so that as sales volume rises, selling costs tend to rise as well.

Degrees of correlation

Two variables might be perfectly correlated, partly correlated or uncorrelated. Correlation

can be positive or negative.

Page 23: Measures of Central Tendency Dispersion and Correlation

23

Positive and negative correlation

Correlation, whether perfect or partial, can be positive or negative.

Positive correlation means that low values of one variable are associated with

low values of the other, and high values of one variable are associated with high

values of the other.

Negative correlation means that low values of one variable are associated with

high values of the other, and high values of one variable with low values of the

other.

The correlation coefficient and the coefficient of determination

The degree of correlation between two variables is measured by Pearson's correlation

coefficient, r. The nearer r is to +1 or –1, the stronger the relationship.

The correlation coefficient

Pearson's correlation coefficient, r(also known as the product moment correlation

coefficient) is used to measure how strong the connection is between two variables,

known as the degree of correlation.

the correlation coefficient range

The correlation coefficient, r must always fall between –1 and +1. If you get a value

outside this range you have made a mistake.

• r = +1 means that the variables are perfectly positively correlated

• r = –1 means that the variables are perfectly negatively correlated

• r = 0 means that the variables are uncorrelated

Page 24: Measures of Central Tendency Dispersion and Correlation

24

Example: The correlation coefficient

The cost of output at a factory is thought to depend on the number of units produced.

Data have been collected for the number of units produced each month in the last six

months, and the associated costs, as follows.

Required

Assess whether there is there any correlation between output and cost.

Solution

Page 25: Measures of Central Tendency Dispersion and Correlation

25

There is perfect positive correlation between the volume of output at the factory and costs

which means that there is a perfect linear relationship between output and costs.

Required

Calculate Pearson's correlation' coefficient for the data and explain the result.

THE COEFFICIENT OF DETERMINATION, R2

The coefficient of determination r2 measures the proportion of the total variation in the

value of one variable that can be explained by variations in the value of the other

variable. Unless the correlation coefficient r is exactly or very nearly +1, –1 or 0, its

meaning or significance is a little unclear. For example, if the correlation coefficient for

two variables is +0.8, this would tell us that the variables are positively correlated, but the

correlation is not perfect. It would not really tell us much else. A more meaningful

Page 26: Measures of Central Tendency Dispersion and Correlation

26

analysis is available from the square of the correlation coefficient, r, which is called the

coefficient of determination, r2

Interpreting r2

In the question above, r = –0.992, therefore r2 = 0.984. This means that over 98% of

variations in sales can be explained by the passage of time, leaving 0.016 (less than2%)

of variations to be explained by other factors.

Similarly, if the correlation coefficient between a company's output volume and

maintenance costs was 0.9, r2 would be 0.81, meaning that 81% of variations in

maintenance costs could be explained by variations in output volume, leaving only 19%

of variations to be explained by other factors (such as the age of the equipment).

SPEARMAN'S RANK CORRELATION COEFFICIENT

Coefficient of rank correlation

In the examples considered above, the data were given in terms of the values of the

relevant variables, such as the number of hours. Sometimes however, they are given in

terms of order or rank rather than actual values.

Spearman's rank correlation coefficient is used when data is given in terms of order or

rank, rather than actual values.

Where n = number of pairs of data

d = the difference between the rankings in each set of data.

The coefficient of rank correlation can be interpreted in exactly the same way as the

ordinary correlation coefficient. Its value can range from –1 to +1.

Page 27: Measures of Central Tendency Dispersion and Correlation

27

Example: The rank correlation coefficient

The examination placings of seven students were as follows.

Required

Judge whether the placings of the students in statistics correlate with their placings in

economics.

Solution

Correlation must be measured by Spearman's coefficient because we are given the

placings of students, and not their actual marks.

where d is the difference between the rank in statistics and the rank in economics for each

student.

The correlation is positive, 0.536, but the correlation is not strong.