Review of Measures of Central Tendency, Dispersion & Association

1

Review of Measures of Central Tendency, Dispersion & Association

• Graphical Excellence• Measures of Central Tendency

– Mean, Median, Mode• Measures of Dispersion

– Variance, Standard Deviation, Range• Measures of Association

– Covariance, Correlation Coefficient• Relationship of basic stats to OLS

2

Graphical Excellence

• Learning from Monkeys

http://www.ifilm.com/video/2691852/collection/18380/minisite/superbowl

3

Why Graphs & Stats?

• Graphs and descriptive statistics when used properly can summarize lines of data effectively for the reader. What’s a good approximation of the age of students in this class?

• We use graphs and basic stats (Mean, Variance, Covariance) etc to highlight trends and to motivate the research question.

• We use other tools for analysis – Regression, Case Study, Content Analysis etc.

4

What story does this graph tell? What questions does the graph raise?

HHI Over Time

0.034

0.035

0.036

0.037

0.038

0.039

0.04

0.041

0.042

0.043

0.044

0.045

0.046

0.047

0.048

1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000

YEAR

HHI I

NDEX

HHI

5


• The graph presents large data sets concisely and coherently – label your axes

• The ideas and concepts to be delivered are clearly understood to the viewer – state the units used (EX: $ or $ in Mil. etc.)

6

What’s the problem here?

30.5026.09

43.51

77.67

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

1950

1991

19501991

7


• The display induces the viewer to address the substance of the data and not the form of the graph. – Select the appropriate type of graph (bar chart for levels, scatter plot for trends etc.)

• There is no distortion of what the data reveal. – Make sure the axes are not stretched or compressed to make a point

8

Do New Stadiums Bring People in?

2.18 3.01 2.04

00.5

11.5

22.5

33.5

Average Annual Attendance

Mill

ions

MLB Average New Stadiums Old Stadiums

9

Do New Stadiums Bring People in?

2.18

3.01

2.040

5

10

15

20

Average Annual Attendance

Mill

ions

MLB Average New Stadiums Old Stadiums

10

Things to be cautious about when observing a graph:

– Is there a missing scale on one axis.

– Do not be influenced by a graph’s caption.– Are changes presented in absolute values only,

or in percent form too.

11

Numerical Descriptive Measures

• Measures of Central Tendency– Mean, Median, Mode

• Measures of Dispersion– Variance, Standard Deviation

• Measures of Association– Covariance, Correlation Coefficient

12

nx

x in

1i

– This is the most popular and useful measure of central location

Sum of the measurementsNumber of measurementsMean =

Sample mean Population mean

Nx i

N1i

Sample size Population size

nx

x in

1i

Arithmetic mean

13

6xxxxxx

6x

x 654321i6

1i

• Example 1The mean of the sample of six measurements 7, 3, 9, -2, 4, 6 is given by

7 3 9 4 64.5

• Example 2Suppose the telephone bills of example 2.1 represent population of measurements. The population mean is

200x...xx

200x 20021i

2001i 42.19 15.30 53.21

43.59

2

14

• Example 3When many of the measurements have the same value, the measurement can be summarized in a frequency table. Suppose the number of children in a sample of 16 employees were recorded as follows:

NUMBER OF CHILDREN 0 1 2 3NUMBER OF EMPLOYEES 3 4 7 2

16 employees

5.116

)3(2)2(7)1(4)0(316

x...xx16

xx 1621i

161i

15

26,26,28,29,30,32,60,31

Odd number of observations

26,26,28,29,30,32,60

Example 4Seven employee salaries were recorded (in 1000s) : 28, 60, 26, 32, 30, 26, 29.Find the median salary.

– The median of a set of measurements is the value that falls in the middle when the measurements are arranged in order of magnitude.

Suppose one employee’s salary of $31,000was added to the group recorded before.Find the median salary.

Even number of observations

26,26,28,29, 30,32,60,3126,26,28,29, 30,32,60,31There are two middle values!

First, sort the salaries.Then, locate the value in the middle

First, sort the salaries.Then, locate the values in the middle26,26,28,29, 30,32,60,3129.5,

The median

16

– The mode of a set of measurements is the value that occurs most frequently.

– Set of data may have one mode (or modal class), or two or more modes.

The modal classFor large data setsthe modal class is much more relevant than the a single-value mode.

The mode

17

– Example 5• The manager of a men’s store observes the waist

size (in inches) of trousers sold yesterday: 31, 34, 36, 33, 28, 34, 30, 34, 32, 40.

• The mode of this data set is 34 in.

This information seems valuable (for example, for the design of a new display in the store), much more than “ the median is 33.2 in.”.

18

Relationship among Mean, Median, and Mode

• If a distribution is symmetrical, the mean, median and mode coincide

• If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ.

A positively skewed distribution(“skewed to the right”)

MeanMedian

Mode

19

`

• If a distribution is symmetrical, the mean, median and mode coincide

• If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ.

A positively skewed distribution(“skewed to the right”)

MeanMedian

Mode MeanMedian

Mode

A negatively skewed distribution(“skewed to the left”)

20

Measures of variability(Looking beyond the average)

• Measures of central location fail to tell the whole story about the distribution.

• A question of interest still remains unanswered:

How typical is the average value of all the measurements in the data set?

How much spread out are the measurements about the average value?

or

21

Observe two hypothetical data sets

The average value provides a good representation of thevalues in the data set.

Low variability data set

High variability data set

The same average value does not provide as good presentation of thevalues in the data set as before.

This is the previous data set. It is now changing to...

22

– The range of a set of measurements is the difference between the largest and smallest measurements.

– Its major advantage is the ease with which it can be computed.

– Its major shortcoming is its failure to provide information on the dispersion of the values between the two end points.

? ? ?

But, how do all the measurements spread out?

Smallestmeasurement

Largestmeasurement

The range cannot assist in answering this questionRange

The range

23

– This measure of dispersion reflects the values of all the measurements.

– The variance of a population of N measurements x1, x2,…,xN having a mean is defined as

– The variance of a sample of n measurementsx1, x2, …,xn having a mean is defined as

N

)x( 2i

N1i2

x

1n

)xx(s

2i

n1i2

The variance

Excel uses Varp formula

Excel uses Var formula

24

Consider two small populations:Population A: 8, 9, 10, 11, 12Population B: 4, 7, 10, 13, 16

1098

74 10

11 12

13 16

8-10= -2

9-10= -111-10= +1

12-10= +2

4-10 = - 6

7-10 = -3

13-10 = +3

16-10 = +6

Sum = 0

Sum = 0

The mean of both populations is 10...

…but measurements in Bare much more dispersedthen those in A.

Thus, a measure of dispersion is needed that agrees with this observation.

Let us start by calculatingthe sum of deviations

A

B

The sum of deviations is zero in both cases,therefore, another measure is needed.

25

1098

74 10

11 12

13 16

8-10= -2

9-10= -111-10= +1

12-10= +2

4-10 = - 6

7-10 = -3

13-10 = +3

16-10 = +6

Sum = 0

Sum = 0

A

B

The sum of deviations is zero in both cases,therefore, another measure is needed.

The sum of squared deviationsis used in calculating the variance.See example next.

26

Let us calculate the variance of the two populations

185

)1016()1013()1010()107()104( 222222B

25

)1012()1011()1010()109()108( 222222A

Why is the variance defined as the average squared deviation?Why not use the sum of squared deviations as a measure of dispersion instead?

After all, the sum of squared deviations increases in magnitude when the dispersionof a data set increases!!

27

Which data set has a larger dispersion?

1 3 1 32 5

A B

Data set Bis more dispersedaround the mean

Let us calculate the sum of squared deviations for both data sets

SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10

SumB = (1-3)2 + (5-3)2 = 8

5 times 5 times

However, when calculated on “per observation” basis (variance), the data set dispersions are properly ranked

A2 = SumA/N = 10/5 = 2

B2 = SumB/N = 8/2 = 4!

28

– Example 6• Find the mean and the variance of the following

sample of measurements (in years).

3.4, 2.5, 4.1, 1.2, 2.8, 3.7– Solution

n

)x(x

1n1

1n

)xx(s

2i

n1i2

i

n

1i

2i

n1i2

95.26

7.176

7.38.22.11.45.24.36

xx i

61i

A shortcut formula

=[3.42+2.52+…+3.72]-[(17.7)2/6] = 1.075 (years)2

29

– The standard deviation of a set of measurements is the square root of the variance of the measurements.

– Example 4.9• Rates of return over the past 10 years for two mutual

funds are shown below. Which one have a higher level of risk?Fund A: 8.3, -6.2, 20.9, -2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05Fund B: 12.1, -2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, -1.3, 11.4

2

2

:deviationandardstPopulation

ss:deviationstandardSample

30

– Solution– Let us use the Excel printout that is run from the

“Descriptive statistics” sub-menu (use file Xm04-10)

Fund A Fund B

Mean 16 Mean 12Standard Error 5.295 Standard Error 3.152Median 14.6 Median 11.75Mode #N/A Mode #N/AStandard Deviation 16.74 Standard Deviation 9.969Sample Variance 280.3 Sample Variance 99.37Kurtosis -1.34 Kurtosis -0.46Skewness 0.217 Skewness 0.107Range 49.1 Range 30.6Minimum -6.2 Minimum -2.8Maximum 42.9 Maximum 27.8Sum 160 Sum 120Count 10 Count 10

Fund A should be consideredriskier because its standard deviation is larger

31

– The coefficient of variation of a set of measurements is the standard deviation divided by the mean value.

– This coefficient provides a proportionate measure of variation.

CV :variation oft coefficien Population

xs

cv :variation oft coefficien Sample

A standard deviation of 10 may be perceivedas large when the mean value is 100, but only moderately large when the mean value is 500

The coefficient of variation

32

Interpreting Standard Deviation• The standard deviation can be used to

– compare the variability of several distributions

– make a statement about the general shape of a distribution.

33

Measures of Association

• Two numerical measures are presented, for the description of linear relationship between two variables depicted in the scatter diagram.– Covariance - is there any pattern to the way two

variables move together? – Correlation coefficient - how strong is the

linear relationship between two variables

34

N

)y)((xY)COV(X,covariance Population yixi

x (y) is the population mean of the variable X (Y)

N is the population size. n is the sample size.

1-n

)y)((xY)cov(X,covariance Sample yixi

The covariance

Excel uses this formula to calculate Cov

NOTE: The formula in Excel does not give you sample covariance

35

• If the two variables move in two opposite directions, (one increases when the other one decreases), the covariance is a large negative number.

• If the two variables are unrelated, the covariance will be close to zero.

• If the two variables move the same direction, (both increase or both decrease), the covariance is a large positive number.

36

– This coefficient answers the question: How strong is the association between X and Y.

yx

)Y,X(COV

ncorrelatio oft coefficien Population

yxss)Y,Xcov(

r

ncorrelatio oft coefficien Sample

The coefficient of correlation

37

COV(X,Y)=0 or r =

+1

0

-1

Strong positive linear relationship

No linear relationship

Strong negative linear relationship

or

COV(X,Y)>0

COV(X,Y)<0

38

• If the two variables are very strongly positively related, the coefficient value is close to +1 (strong positive linear relationship).

• If the two variables are very strongly negatively related, the coefficient value is close to -1 (strong negative linear relationship).

• No straight line relationship is indicated by a coefficient close to zero.

39

– Example 7• Compute the covariance and the coefficient of

correlation to measure how advertising expenditure and sales level are related to one another.

• Base your calculation on the data provided in example 2.3

n

xx)xx(

nyx

yx)yy)(xx(

FurmulasShortcut

2n1i2

in

1i2

in

1i

in

1iin

1iii

n1iii

n1i

Advert Sales1 303 405 404 502 355 503 352 25

40

• Use the procedure below to obtain the required summations

Month1 1 30 30 1 9002 3 40 120 9 16003 5 40 200 25 16004 4 50 200 16 25005 2 35 70 4 12256 5 50 250 25 25007 3 35 105 9 12258 2 25 50 4 625

Sum 25 305 1025 93 12175

x y xy x2 y2

797.839.8458.1

268.10ss

)Y,Xcov(r

yx

268.10830525

102571

nyx

yx1n

1

1n)yy)(xx(

)Y,Xcov(

in

1iin

1iii

n1i

iin

1i

458.1554.1s

554.18

2393

71

nx

x1n

1s

x

22n1i2

i2x

Similarly, sy = 8.839

41

• Excel printout

• Interpretation– The covariance (10.2679) indicates that

advertisement expenditure and sales levelare positively related

– The coefficient of correlation (.797) indicates that there is a strong positive linear relationship between advertisement expenditure and sales level.

Covariance matrix Correlation matrix

Advertsmnt salesAdvertsmnt 2.125Sales 10.2679 78.125

AdvertsmntsalesAdvertsmnt 1Sales 0.7969 1

42

• The Least Squares Method– We are seeking a line that best fit the data– We define “best fit line” as a line for which the

sum of squared differences between it and the data points is minimized.

2ii

n

1i)yy(Minimize

The actual y value of point i The y value of point icalculated from the equation of the line

i10i xbby

43

Errors

Different lines generate different errors, thus different sum of squares of errors.

X

Y

44

The coefficients b0 and b1 of the line that minimizes the sum of squares of errors are calculated from the data.

n

xxand

n

yywhere

xbyb,)xx(

)yy)(xx(b

n

1ii

n

1ii

10n

1i

2i

n

1iii

1

Documents

Review of Measures of Central Tendency, Dispersion & Association