Upload
satya
View
68
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Review of Measures of Central Tendency, Dispersion & Association. Graphical Excellence Measures of Central Tendency Mean, Median, Mode Measures of Dispersion Variance, Standard Deviation, Range Measures of Association Covariance, Correlation Coefficient Relationship of basic stats to OLS. - PowerPoint PPT Presentation
Citation preview
1
Review of Measures of Central Tendency, Dispersion & Association
• Graphical Excellence• Measures of Central Tendency
– Mean, Median, Mode• Measures of Dispersion
– Variance, Standard Deviation, Range• Measures of Association
– Covariance, Correlation Coefficient• Relationship of basic stats to OLS
2
Graphical Excellence
• Learning from Monkeys
3
Why Graphs & Stats?
• Graphs and descriptive statistics when used properly can summarize lines of data effectively for the reader. What’s a good approximation of the age of students in this class?
• We use graphs and basic stats (Mean, Variance, Covariance) etc to highlight trends and to motivate the research question.
• We use other tools for analysis – Regression, Case Study, Content Analysis etc.
4
What story does this graph tell? What questions does the graph raise?
HHI Over Time
0.034
0.035
0.036
0.037
0.038
0.039
0.04
0.041
0.042
0.043
0.044
0.045
0.046
0.047
0.048
1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000
YEAR
HHI I
NDEX
HHI
5
Graphical Excellence
• The graph presents large data sets concisely and coherently – label your axes
• The ideas and concepts to be delivered are clearly understood to the viewer – state the units used (EX: $ or $ in Mil. etc.)
6
What’s the problem here?
30.5026.09
43.51
77.67
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
1950
1991
19501991
7
Graphical Excellence
• The display induces the viewer to address the substance of the data and not the form of the graph. – Select the appropriate type of graph (bar chart for levels, scatter plot for trends etc.)
• There is no distortion of what the data reveal. – Make sure the axes are not stretched or compressed to make a point
8
Do New Stadiums Bring People in?
2.18 3.01 2.04
00.5
11.5
22.5
33.5
Average Annual Attendance
Mill
ions
MLB Average New Stadiums Old Stadiums
9
Do New Stadiums Bring People in?
2.18
3.01
2.040
5
10
15
20
Average Annual Attendance
Mill
ions
MLB Average New Stadiums Old Stadiums
10
Things to be cautious about when observing a graph:
– Is there a missing scale on one axis.
– Do not be influenced by a graph’s caption.– Are changes presented in absolute values only,
or in percent form too.
11
Numerical Descriptive Measures
• Measures of Central Tendency– Mean, Median, Mode
• Measures of Dispersion– Variance, Standard Deviation
• Measures of Association– Covariance, Correlation Coefficient
12
nx
x in
1i
– This is the most popular and useful measure of central location
Sum of the measurementsNumber of measurementsMean =
Sample mean Population mean
Nx i
N1i
Sample size Population size
nx
x in
1i
Arithmetic mean
13
6xxxxxx
6x
x 654321i6
1i
• Example 1The mean of the sample of six measurements 7, 3, 9, -2, 4, 6 is given by
7 3 9 4 64.5
• Example 2Suppose the telephone bills of example 2.1 represent population of measurements. The population mean is
200x...xx
200x 20021i
2001i 42.19 15.30 53.21
43.59
2
14
• Example 3When many of the measurements have the same value, the measurement can be summarized in a frequency table. Suppose the number of children in a sample of 16 employees were recorded as follows:
NUMBER OF CHILDREN 0 1 2 3NUMBER OF EMPLOYEES 3 4 7 2
16 employees
5.116
)3(2)2(7)1(4)0(316
x...xx16
xx 1621i
161i
15
26,26,28,29,30,32,60,31
Odd number of observations
26,26,28,29,30,32,60
Example 4Seven employee salaries were recorded (in 1000s) : 28, 60, 26, 32, 30, 26, 29.Find the median salary.
– The median of a set of measurements is the value that falls in the middle when the measurements are arranged in order of magnitude.
Suppose one employee’s salary of $31,000was added to the group recorded before.Find the median salary.
Even number of observations
26,26,28,29, 30,32,60,3126,26,28,29, 30,32,60,31There are two middle values!
First, sort the salaries.Then, locate the value in the middle
First, sort the salaries.Then, locate the values in the middle26,26,28,29, 30,32,60,3129.5,
The median
16
– The mode of a set of measurements is the value that occurs most frequently.
– Set of data may have one mode (or modal class), or two or more modes.
The modal classFor large data setsthe modal class is much more relevant than the a single-value mode.
The mode
17
– Example 5• The manager of a men’s store observes the waist
size (in inches) of trousers sold yesterday: 31, 34, 36, 33, 28, 34, 30, 34, 32, 40.
• The mode of this data set is 34 in.
This information seems valuable (for example, for the design of a new display in the store), much more than “ the median is 33.2 in.”.
18
Relationship among Mean, Median, and Mode
• If a distribution is symmetrical, the mean, median and mode coincide
• If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ.
A positively skewed distribution(“skewed to the right”)
MeanMedian
Mode
19
`
• If a distribution is symmetrical, the mean, median and mode coincide
• If a distribution is non symmetrical, and skewed to the left or to the right, the three measures differ.
A positively skewed distribution(“skewed to the right”)
MeanMedian
Mode MeanMedian
Mode
A negatively skewed distribution(“skewed to the left”)
20
Measures of variability(Looking beyond the average)
• Measures of central location fail to tell the whole story about the distribution.
• A question of interest still remains unanswered:
How typical is the average value of all the measurements in the data set?
How much spread out are the measurements about the average value?
or
21
Observe two hypothetical data sets
The average value provides a good representation of thevalues in the data set.
Low variability data set
High variability data set
The same average value does not provide as good presentation of thevalues in the data set as before.
This is the previous data set. It is now changing to...
22
– The range of a set of measurements is the difference between the largest and smallest measurements.
– Its major advantage is the ease with which it can be computed.
– Its major shortcoming is its failure to provide information on the dispersion of the values between the two end points.
? ? ?
But, how do all the measurements spread out?
Smallestmeasurement
Largestmeasurement
The range cannot assist in answering this questionRange
The range
23
– This measure of dispersion reflects the values of all the measurements.
– The variance of a population of N measurements x1, x2,…,xN having a mean is defined as
– The variance of a sample of n measurementsx1, x2, …,xn having a mean is defined as
N
)x( 2i
N1i2
x
1n
)xx(s
2i
n1i2
The variance
Excel uses Varp formula
Excel uses Var formula
24
Consider two small populations:Population A: 8, 9, 10, 11, 12Population B: 4, 7, 10, 13, 16
1098
74 10
11 12
13 16
8-10= -2
9-10= -111-10= +1
12-10= +2
4-10 = - 6
7-10 = -3
13-10 = +3
16-10 = +6
Sum = 0
Sum = 0
The mean of both populations is 10...
…but measurements in Bare much more dispersedthen those in A.
Thus, a measure of dispersion is needed that agrees with this observation.
Let us start by calculatingthe sum of deviations
A
B
The sum of deviations is zero in both cases,therefore, another measure is needed.
25
1098
74 10
11 12
13 16
8-10= -2
9-10= -111-10= +1
12-10= +2
4-10 = - 6
7-10 = -3
13-10 = +3
16-10 = +6
Sum = 0
Sum = 0
A
B
The sum of deviations is zero in both cases,therefore, another measure is needed.
The sum of squared deviationsis used in calculating the variance.See example next.
26
Let us calculate the variance of the two populations
185
)1016()1013()1010()107()104( 222222B
25
)1012()1011()1010()109()108( 222222A
Why is the variance defined as the average squared deviation?Why not use the sum of squared deviations as a measure of dispersion instead?
After all, the sum of squared deviations increases in magnitude when the dispersionof a data set increases!!
27
Which data set has a larger dispersion?
1 3 1 32 5
A B
Data set Bis more dispersedaround the mean
Let us calculate the sum of squared deviations for both data sets
SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10
SumB = (1-3)2 + (5-3)2 = 8
5 times 5 times
However, when calculated on “per observation” basis (variance), the data set dispersions are properly ranked
A2 = SumA/N = 10/5 = 2
B2 = SumB/N = 8/2 = 4!
28
– Example 6• Find the mean and the variance of the following
sample of measurements (in years).
3.4, 2.5, 4.1, 1.2, 2.8, 3.7– Solution
n
)x(x
1n1
1n
)xx(s
2i
n1i2
i
n
1i
2i
n1i2
95.26
7.176
7.38.22.11.45.24.36
xx i
61i
A shortcut formula
=[3.42+2.52+…+3.72]-[(17.7)2/6] = 1.075 (years)2
29
– The standard deviation of a set of measurements is the square root of the variance of the measurements.
– Example 4.9• Rates of return over the past 10 years for two mutual
funds are shown below. Which one have a higher level of risk?Fund A: 8.3, -6.2, 20.9, -2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05Fund B: 12.1, -2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, -1.3, 11.4
2
2
:deviationandardstPopulation
ss:deviationstandardSample
30
– Solution– Let us use the Excel printout that is run from the
“Descriptive statistics” sub-menu (use file Xm04-10)
Fund A Fund B
Mean 16 Mean 12Standard Error 5.295 Standard Error 3.152Median 14.6 Median 11.75Mode #N/A Mode #N/AStandard Deviation 16.74 Standard Deviation 9.969Sample Variance 280.3 Sample Variance 99.37Kurtosis -1.34 Kurtosis -0.46Skewness 0.217 Skewness 0.107Range 49.1 Range 30.6Minimum -6.2 Minimum -2.8Maximum 42.9 Maximum 27.8Sum 160 Sum 120Count 10 Count 10
Fund A should be consideredriskier because its standard deviation is larger
31
– The coefficient of variation of a set of measurements is the standard deviation divided by the mean value.
– This coefficient provides a proportionate measure of variation.
CV :variation oft coefficien Population
xs
cv :variation oft coefficien Sample
A standard deviation of 10 may be perceivedas large when the mean value is 100, but only moderately large when the mean value is 500
The coefficient of variation
32
Interpreting Standard Deviation• The standard deviation can be used to
– compare the variability of several distributions
– make a statement about the general shape of a distribution.
33
Measures of Association
• Two numerical measures are presented, for the description of linear relationship between two variables depicted in the scatter diagram.– Covariance - is there any pattern to the way two
variables move together? – Correlation coefficient - how strong is the
linear relationship between two variables
34
N
)y)((xY)COV(X,covariance Population yixi
x (y) is the population mean of the variable X (Y)
N is the population size. n is the sample size.
1-n
)y)((xY)cov(X,covariance Sample yixi
The covariance
Excel uses this formula to calculate Cov
NOTE: The formula in Excel does not give you sample covariance
35
• If the two variables move in two opposite directions, (one increases when the other one decreases), the covariance is a large negative number.
• If the two variables are unrelated, the covariance will be close to zero.
• If the two variables move the same direction, (both increase or both decrease), the covariance is a large positive number.
36
– This coefficient answers the question: How strong is the association between X and Y.
yx
)Y,X(COV
ncorrelatio oft coefficien Population
yxss)Y,Xcov(
r
ncorrelatio oft coefficien Sample
The coefficient of correlation
37
COV(X,Y)=0 or r =
+1
0
-1
Strong positive linear relationship
No linear relationship
Strong negative linear relationship
or
COV(X,Y)>0
COV(X,Y)<0
38
• If the two variables are very strongly positively related, the coefficient value is close to +1 (strong positive linear relationship).
• If the two variables are very strongly negatively related, the coefficient value is close to -1 (strong negative linear relationship).
• No straight line relationship is indicated by a coefficient close to zero.
39
– Example 7• Compute the covariance and the coefficient of
correlation to measure how advertising expenditure and sales level are related to one another.
• Base your calculation on the data provided in example 2.3
n
xx)xx(
nyx
yx)yy)(xx(
FurmulasShortcut
2n1i2
in
1i2
in
1i
in
1iin
1iii
n1iii
n1i
Advert Sales1 303 405 404 502 355 503 352 25
40
• Use the procedure below to obtain the required summations
Month1 1 30 30 1 9002 3 40 120 9 16003 5 40 200 25 16004 4 50 200 16 25005 2 35 70 4 12256 5 50 250 25 25007 3 35 105 9 12258 2 25 50 4 625
Sum 25 305 1025 93 12175
x y xy x2 y2
797.839.8458.1
268.10ss
)Y,Xcov(r
yx
268.10830525
102571
nyx
yx1n
1
1n)yy)(xx(
)Y,Xcov(
in
1iin
1iii
n1i
iin
1i
458.1554.1s
554.18
2393
71
nx
x1n
1s
x
22n1i2
i2x
Similarly, sy = 8.839
41
• Excel printout
• Interpretation– The covariance (10.2679) indicates that
advertisement expenditure and sales levelare positively related
– The coefficient of correlation (.797) indicates that there is a strong positive linear relationship between advertisement expenditure and sales level.
Covariance matrix Correlation matrix
Advertsmnt salesAdvertsmnt 2.125Sales 10.2679 78.125
AdvertsmntsalesAdvertsmnt 1Sales 0.7969 1
42
• The Least Squares Method– We are seeking a line that best fit the data– We define “best fit line” as a line for which the
sum of squared differences between it and the data points is minimized.
2ii
n
1i)yy(Minimize
The actual y value of point i The y value of point icalculated from the equation of the line
i10i xbby
43
Errors
Different lines generate different errors, thus different sum of squares of errors.
X
Y
44
The coefficients b0 and b1 of the line that minimizes the sum of squares of errors are calculated from the data.
n
xxand
n
yywhere
xbyb,)xx(
)yy)(xx(b
n
1ii
n
1ii
10n
1i
2i
n
1iii
1