Upload
walter-page
View
233
Download
0
Embed Size (px)
Citation preview
Descriptive StatisticsPrerequisite Material
MGS 8110Regression & Forecasting
L00D MGS 8110 - Descriptive Statistics
2
Descriptive Analysis of Data
Given a bunch of data (simple data for only one variable), how is the best way to summarize the data?
L00D MGS 8110 - Descriptive Statistics
3
Measures of Central Tendency
• Mean
• Median if n is odd
if n is even •
• Mode the xi value that occurs most frequently
• Mid Range
• Mid InterQuartile
n
xx i
2
1
2
x
nM
2/)xx(1
22
nnM
2/)xx( maxmin RM
2/)xx( 75.25. QM
2
1x nM
L00D MGS 8110 - Descriptive Statistics
4
Measures of Variability• Av Deviate
• Av Absolute Deviate
• Variance
• Standard Deviation
• Range
• Inter Quartile Range (IQR)
n
xxi
)(
n
xxi
1
)(s
2
n
xxi
)xx( minmax R
)xx( 25.75. QR
1
)(s
22
n
xxi
5
Lets Review - Statistical Precepts
Use the Mean for quantitative, symmetric data.
Use the Medium for quantitative non-symmetric data.
Use the mode for categorical data.
Use the Variance when doing calculations.
Use the Standard Deviation when presenting the results of the calculations.
Major Teaching Points are frequently shown in green boxes
L00D MGS 8110 - Descriptive Statistics
L00D MGS 8110 - Descriptive Statistics
6
Appropriate Statisticsdiscussed in
chapter 2
Quantitative - Bell Shaped Mean St Dev HistogramQuantitative - NOT Bell Shaped Median Quartile Range Histogram
Ordinal Median Quartile Range Tables or Bar chartsNominal Mode NA Tables or Bar charts
7
More Review – Need to memorize formules
L00D MGS 8110 - Descriptive Statistics
1
)(s
2
n
xxi
1
)(s
22
n
xxi Used for calculation, but not for presentations. Units are squared (e.g., inches squared).
Used for presentations. Common units (e.g., inches).
Divide by n-1 instead of n in order to get an unbiased estimate.
L00D MGS 8110 - Descriptive Statistics
8
Interpretation of Standard Deviation (1 of 2)
populationtheof73.99contains)3,3(populationtheof%45.95contains)2,2(populationtheof%27.68contains),(
sxsxsxsx
sxsx
If the data is normally distributed
Statistical Precepts
Two-thirds of the data is contained in one sigma.
95% of the data is contained in two sigma.
Almost all of the data is contained in three sigma.
L00D MGS 8110 - Descriptive Statistics
9
Interpretation of Standard Deviation (2 of 2)
If ever asked to explain what the Standard Deviation means, say “two-thirds of the data will be within plus or minus one Standard Deviation from the mean”.
If ever asked for the “worst case” or “best case” outcome calculate “mean – (2)sigma” and/or “mean + (2)sigma”.
Statistical Precepts
Definition of Standard Deviation - two-thirds of the data is contained in a range of values that are two sigma wide.
Worst case outcome is – .
Best case outcome is + 2.
L00D MGS 8110 - Descriptive Statistics
10
Other Measures
• Percentiles xp is the xi such that(measure of tails) p% of the xi < xp
• Quartiles are percentile where p = 25, 50 or 75(measure of tails) the lower, middle or upper quartile.
• 4
1
4
3
4
1
4
1125. xorxorx
nnn
CC
2/)xx(orx1
222
1250.
nnnCC
4
1n3
4
1
4
3
4
3375. xxx
nn
CCmaximum4
minimum0
x
x
C
C
L00D MGS 8110 - Descriptive Statistics
11
Other Measures
• Coefficient of Variation(percentage measure of variability)
• Correlation Coefficient(measure of linear association)
x
sCv
yxii ss
n
yyxx/
1
))((
1,1
L00D MGS 8110 - Descriptive Statistics
12
-4
-3
-2
-1
0
1
2
3
4
-4 -3 -2 -1 0 1 2 3 4
X variable
Y v
aria
ble
-4
-3
-2
-1
0
1
2
3
4
-4 -3 -2 -1 0 1 2 3 4
X variable
Y v
aria
ble
-4
-3
-2
-1
0
1
2
3
4
-4 -3 -2 -1 0 1 2 3 4
X variable
Y v
aria
ble
Interpretation of Correlation Coefficientn=25
=0
1,1
=.8=
1
L00D MGS 8110 - Descriptive Statistics
13
-4
-3
-2
-1
0
1
2
3
4
-4 -3 -2 -1 0 1 2 3 4
X variable
Y v
aria
ble
-4
-3
-2
-1
0
1
2
3
4
-4 -3 -2 -1 0 1 2 3 4
X variable
Y v
aria
ble
-4
-3
-2
-1
0
1
2
3
4
-4 -3 -2 -1 0 1 2 3 4
X variable
Y v
aria
ble
Interpretation of Correlation Coefficientn=25
=0
1,1
=-.8
=-1
L00D MGS 8110 - Descriptive Statistics
14
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Correlation Coefficient
Per
cent
age
of V
aria
bilit
y E
xpla
ined
Interpretation of Correlation Coefficient
yx
ii
ss
n
yyxx
1
))((
L00D MGS 8110 - Descriptive Statistics
15
Interpretation of Correlation Coefficient (1
of 2)
Statistical tests of correlation coefficients are relatively meaningless. These tests are based on the hypothesis that “ = ”. Based on the previous graph, knowing that a correlation coefficient is greater than zero is not necessarily a valuable piece of information.
In terms of “Practical Significance” (compared to “Statistical Significance”), the correlation coefficient has to be at least greater than .5. From the previous graph it can be seen that = only explains 25% of the variability in the data.
Statistical Precept
must be greater than .5 to be of Practical Significance
L00D MGS 8110 - Descriptive Statistics
16
Other Measures(page 3 of 3)
• Skew(measure of symmetry)
• Kurtosis(measure of peakedness)
2
32
3
3
)(
)(
n
xx
n
xx
i
i
3)(
)(
22
4
4
n
xx
n
xx
i
i
3,3y typicallmore ,3
3,3y typicallmore ,3
L00D MGS 8110 - Descriptive Statistics
17
Verifying Bell Shape (Normal Distribution) • Negative Skew if the distribution has a ‘long’ tail to the left,
measured as skewness -1
• Positive Skew if the distribution has a ‘long’ tail to the right (more common situation), measured as skewness +1
• Symmetric if -1 skewness +1
• Peaked Distribution if Kurtosis is a large positive number ( +1).
• Flat Distribution if Kurtosis is a large negative number ( -1).
• Normal shape (proportionally S-shaped sides) if Kurtosis near zero.
L00D MGS 8110 - Descriptive Statistics
18
Verifying Bell Shape (Normal Distribution)
Statistical Precept
Bell shaped (Normally distributed)
if -1 skewness +1 and
if -1 kurtosis +1
L00D MGS 8110 - Descriptive Statistics
19
How is the best way to summarize data?(our original question)
• Central Tendency (Mean, Median & Mode)
• Variability (Variance & Standard Deviation)
• Shape (Percentiles, Skewness & Kurtosis)
• Association (correlation)
L00D MGS 8110 - Descriptive Statistics
20
Notations
pnCorrelatio
sVariance
Deviation
sStandard
average
XMean
ValueTrue
ValueSamplePopulationParameter
22
)(
L00D MGS 8110 - Descriptive Statistics
21
Standard Deviation of Sample Mean
n
ss XX
Called the “Standard Error” of the Mean
L00D MGS 8110 - Descriptive Statistics
22
Insert / Function examples (1 of 3)
• Mean Average(A1:A10)
• Median Median(A1:A10)
• Mode Mode(A1:A10)
• Mid Range ( MAX(A1:A10) + MIN(A1:A10) ) / 2
• InterQuartile ( Quartile(A1:A10,1) + Quartile(A1:A10,3) ) / 2
L00D MGS 8110 - Descriptive Statistics
23
Insert / Function examples (2 of 3)
• Av Deviate NA
• Av Absolute Deviate AveDev(A1:A10)
• Variance Var(A1:A10)
• Standard Deviation StDev(A1:A10)
• Range MAX(A1:A10) - MIN(A1:A10)
• Inter Quartile Range (IQR) Quartile(A1:A10,3) - Quartile(A1:A10,1)
L00D MGS 8110 - Descriptive Statistics
24
Insert / Function examples (3 of 3)
• Percentiles Percentile(A1:A10,.05)
• Quartiles Quartile(A1:A10,q) where q=0,1,2,3,4
• Coef. of Variation StDev(A1:A10)/Average(A1:A10)
• Correlation Correl(A1:A10,B1:B10)
• Skew Skew(A1:A10)
• Kurtosis Kurt(A1:A10)
L00D MGS 8110 - Descriptive Statistics
25
Example Calculations
Q. Should I use the Mean or the Median to state the Central value of this data?
-0.31 =SKEW(Height)-1.27 =KURT(Height)
0.17 =SKEW(Weight)-1.03 =KURT(Weight)
Answer – Both variables have a somewhat peaked distributions (Kurtosis greater than 1), but both variables have very symmetric distributions (non-skewed distribution); hence, use Mean.
L00D MGS 8110 - Descriptive Statistics
26
Example Calculations
Q. The Standard Deviation for Height is almost 2 inches, what is the practical interpretation of this value?
Answer – The height of 2/3 of the population will vary by less than 4 inches (3.87”).
67.1 =B14-B1571.0 =B14+B153.87 =F10-F9
L00D MGS 8110 - Descriptive Statistics
27
Example Calculations
Q. What is the height of the shortest person and the tallest person that I may meet today (worst case and best case)?
Answer – The shortest person will be 5’-5” (65.2”) and the tallest person will be 6’-1” (72.9”).
65.2 =AvHt-2*StDevHt72.9 =AvHt+2*StDevHt
L00D MGS 8110 - Descriptive Statistics
28
Example Calculations
Q. What is the height of the shortest person that I may meet over the next year?
Answer – The shortest person that I am likely to meet in the foreseeable future will be 5’-6” (66.0”).
66.0 =PERCENTILE(Height,0.01)
L00D MGS 8110 - Descriptive Statistics
29
Example Calculations
Q. The answers to the two previous questions are not consistent. The 5% values calculated as Mean – 2(Sigma) was 5’-5” where as the 1% value calculated as a Percentile was 5’-6”.
Answer – These types of inconsistencies (i.e., errors) will occur with small samples. The procedure used by the PERCENTILE function is based on an interpolated calculation with the two smallest values in the sample.
L00D MGS 8110 - Descriptive Statistics
30
Example Calculations
Q. Which variable, Height or Weight, has the greatest relative variability?
Answer – In agreement with our intuition, Weight is 3 to 4 times more variable than height (11/3 = 3.67).
Coef of Variation3% =StDevHt/AvHt
11% =StDevWt/AvWt
L00D MGS 8110 - Descriptive Statistics
31
Example Calculations
Q. Is there a relationship between Height and Weight and if so how large is the relationship?
Answer – The correlation between Height and Weight is .78 which means that about 60% (.61) of the variability in weight is due to differences in Height.
0.783 =CORREL(Height,Weight)0.61 =G23^2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Correlation Coefficient
Pe
rce
nta
ge
of
Va
ria
bili
ty E
xpla
ine
d
L00D MGS 8110 - Descriptive Statistics
32
Example CalculationsQ. Given that there is a
relationship between Height and Weight, is the relationship linear or non-linear?
Answer – Simple statistics cannot be used to determine linear versus non-linear, would need to plot the data. The correlation indicates that there is a relatively strong linear relationship, but a plot of the data (Weight vs. Height) may indicate that there is an even stronger non-linear relationship
0.783 =CORREL(Height,Weight)0.61 =G23^2
L00D MGS 8110 - Descriptive Statistics
33
Example Calculations
Q. Are Height and Weight Normally distributed?
Answer – Based on out Rule-of-Thumb test (-1 < Skew < +1 and -1 < Kurt < +1), neither of these variables are normally distributed.
-0.31 =SKEW(Height)-1.27 =KURT(Height)
0.17 =SKEW(Weight)-1.03 =KURT(Weight)
L00D MGS 8110 - Descriptive Statistics
34
Example Calculations
Q. Given that the variables are NOT Normally distributed, why do I care?
Answer – You previous interpretation of the Standard Deviation maybe somewhat inaccurate (“The height of 2/3 of the population will vary by less than 4 inches “). Also, you previous interpretation of Worst Case and Best Case maybe somewhat inaccurate (“The shortest person will be 5’-5” and the tallest person will be 6’-1”).
L00D MGS 8110 - Descriptive Statistics
35
Example Calculations
Q. The average Height is estimated to be 69.1”, how good is that estimate?
Answer – The true average height could be anywhere between 67.8 inches to 70.3 inches. A better estimate could be obtained if a large sample was available.
67.8 =AvHt-2*StErrorHt70.3 =AvHt+2*StErrorHt
L00D MGS 8110 - Descriptive Statistics
36
More about Variability
1
)(s
2
n
xxi Use StDev (or Var) in Excel
N
xxi
2)(s
Use StDevP (or VarP) in Excel
Alternative formulation
1) if every item in the Universe is included in the Sample or
2) The Mean is know with certainty.
n
xi
2)(s
L00D MGS 8110 - Descriptive Statistics
37
Cumulative Distribution Function for Male Height
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
Height
Cu
mu
lati
ve D
istr
ibu
tio
n
.95
Probability Density Function for Male Height
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
Height
Pro
bab
ility
Den
sity
.95
Normal Calculations
NORMINV(0.95,68.8,2.6)=73.08
L00D MGS 8110 - Descriptive Statistics
38
Probability Density Function for Male Height
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
Height
Pro
bab
ility
Den
sity
.244
Cumulative Distribution Function for Male Height
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
Height
Cu
mu
lati
ve D
istr
ibu
tio
n
.244
Normal Calculations
NORMDIST(67,68.8,2.6,TRUE)=.244
L00D MGS 8110 - Descriptive Statistics
39
Standardized Normal Calculations
NORMSINV(0.95)=1.645
NORMSINV(.05)=-1.645
s
xxz i
X variable has mean and StDev of and which are estimated by x bar and s.
Z variable has mean=0 and StDev=1. Z is a “standardized normal”.
Probability Density Function for Male Height
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
-5 -4 -3 -2 -1 0 1 2 3 4 5
Height
Pro
bab
ility
Den
sity
.95
Cumulative Distribution Function for Male Height
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
-5 -4 -3 -2 -1 0 1 2 3 4 5
Height
Cu
mu
lati
ve D
istr
ibu
tio
n
.95
L00D MGS 8110 - Descriptive Statistics
40
Standardized Normal Calculations
NORMSDIST(-1)=.159
NORMSDIST(+1)=.841
Probability Density Function for Male Height
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
-5 -4 -3 -2 -1 0 1 2 3 4 5
Height
Pro
bab
ility
Den
sity
.159
Cumulative Distribution Function for Male Height
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
-5 -4 -3 -2 -1 0 1 2 3 4 5
Height
Cu
mu
lati
ve D
istr
ibu
tio
n
.159
L00D MGS 8110 - Descriptive Statistics
41
t-distribution Probability Density Function for Male Height
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
-5 -4 -3 -2 -1 0 1 2 3 4 5
Height
Pro
bab
ility
Den
sity
t-Dist df=4 t-Dist df=16 z Normal
.95
t-distribution is needed if is not know and estimated by s and n<30.
L00D MGS 8110 - Descriptive Statistics
42
t-distribution Calculations
one-tail
TDIST(2,4,1)=.058
TDIST(X, d.f., # tails)
Probability Density Function for Male Height
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
-5 -4 -3 -2 -1 0 1 2 3 4 5
Height
Pro
bab
ility
Den
sity
.058
Cumulative Distribution Function for Male Height
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-5 -4 -3 -2 -1 0 1 2 3 4 5
Height
Cu
mu
lati
ve D
istr
ibu
tio
n
.058
“t” with tails=1 sums from + infinity.
“Z” and “Normal” sums from –infinity.
L00D MGS 8110 - Descriptive Statistics
43
t-distribution Calculations
two-tail
TDIST(2,4,2)=.116
TDIST(X, d.f., # tails)
Probability Density Function for Male Height
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
-5 -4 -3 -2 -1 0 1 2 3 4 5
Height
Pro
bab
ility
Den
sity
.058.058
Cumulative Distribution Function for Male Height
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
-5 -4 -3 -2 -1 0 1 2 3 4 5
Height
Cu
mu
lati
ve D
istr
ibu
tio
n
.116
“t (2 tail)” sums simultaneously from both – infinity and + infinity. Undefined for negative values of t.
L00D MGS 8110 - Descriptive Statistics
44
Loading “Data Analysis” in Office 2003
/ Tools / Add-Ins /
Will need to have original MS Office CD.
Both
NO
L00D MGS 8110 - Descriptive Statistics
45
Example ofTools / Data Analysis / Descriptive Statistics
L00D MGS 8110 - Descriptive Statistics
46
Example Output ofTools / Data Analysis / Descriptive Statistics
Mean 47.84Standard Error 2.63Median 48.00Mode 41.00Standard Deviation 11.46Sample Variance 131.25Kurtosis -0.73Skewness 0.29Range 41Minimum 30Maximum 71Sum 909Count 19
L00D MGS 8110 - Descriptive Statistics
47
Loading “Data Analysis” in Office 2007
1) Click the Office button in the upper left hand corner of the Excel.
2) Click the “Excel Options” tab in the bottom right-hand cornet of the drop-down menu gotten from step #1.
Both
NO
NO
1) click
L00D MGS 8110 - Descriptive Statistics
48
Loading “Data Analysis” in Office 2007
3)Click Add-Ins in the left banner of the Excel Options menu.
4)Click “Analysis ToolPak” in the Add-ins menu. Then, select BOTH “Analysis ToolPak” and “Analysisi ToolPak – VBA”
5)Click the Go button at the bottom right hand corner of the Excel Options menu. Don’t click the “OK” button”.
Both
NO
NO
Both
5) click
4) click3) click
6) click
L00D MGS 8110 - Descriptive Statistics
49
Loading “Data Analysis” in Office 2007
1) click
L00D MGS 8110 - Descriptive Statistics
50
Example ofTools / Data Analysis / Descriptive Statistics
L00D MGS 8110 - Descriptive Statistics
51
Precision of numerical results – state “3 Significant Digits”
Wt Wt Wt
Mean 76.506 Mean 76.5 Mean 76.5Standard Error 4.780314 Standard Error 4.8 Standard Error 4.78
Median 73.58 Median 73.6 Median 73.6Mode #N/A Mode #N/A Mode #N/A
Standard Deviation 15.11668 Standard Deviation 15.1 Standard Deviation 15.1Sample Variance 228.514 Sample Variance 228.5 Sample Variance 228.5
Kurtosis -0.67472 Kurtosis -0.7 Kurtosis -0.675Skewness 0.362102 Skewness 0.4 Skewness 0.362
Range 47.98 Range 48.0 Range 48.0Minimum 54.28 Minimum 54.3 Minimum 54.3Maximum 102.26 Maximum 102.3 Maximum 102.3
Sum 765.06 Sum 765.1 Sum 765.1Count 10 Count 10.0 Count 10
Too Many Digits Too Few Digits Correct 3+ significant digits
meaningless
extra digits
too f ew digits
(.651 to .744)
OK
L00D MGS 8110 - Descriptive Statistics
52
Precision of numerical results – state “3 Significant Digits” (continued)
Wt Wt
Mean 76.5 Mean 76.506Standard Error 4.78 Standard Error 4.780
Median 73.6 Median 73.580Mode #N/A Mode #N/A
Standard Deviation 15.1 Standard Deviation 15.117Sample Variance 229 Sample Variance 228.514
Kurtosis -0.675 Kurtosis -0.675Skewness 0.362 Skewness 0.362
Range 48.0 Range 47.980Minimum 54.3 Minimum 54.280Maximum 102 Maximum 102.260
Sum 765 Sum 765.060Count 10 Count 10.000
3 significant digits 3 decimal places
CORRECT I NCORRECT
L00D MGS 8110 - Descriptive Statistics
53
Data is a potential outlier
Symmetric distributionxi < mean – 3s or xi > mean + 3 s
Skewed distribution xi < Q1 – (1.5)RQ or xi > Q3 + (1.5)RQ
Some Great Rules of Thumb
Data is Normally distributed (Bell shaped)
if -1 skewness +1 andif -1 kurtosis +1
Let’s Review
L00D MGS 8110 - Descriptive Statistics
54
Prerequisite Spreadsheet Skills• Cut, Copy, Paste & Paste Special
• Cell corner Copy• Add or delete Rows or Columns• Change width/height of row/column• Font, alignment, boarder & number of cell• Referencing and calculations with cells.• Data / Sort• Naming cell or range of cells• Insert / Function / Average• Sum, Max, Min, Count, Small and Large• (Tools / Add-ins / Data analysis)• Tools / Data Analysis / Descriptive statistics• Single quote for equation statement. • REPLACE command.• DATA / Group & FORMAT: Column, Hide. • Grab an entire column of data (CTRL+SHIFT, down arrow).
See also “L99A MBA7025.ppt” in folder “L00A MGS8110”