Upload
aja-denman
View
216
Download
0
Embed Size (px)
Citation preview
Stor 155, Section 2, Last Time
• Distributions (how are data “spread out”?)
• Visual Display: Histograms– Binwidth is critical
• Time Plots = Time Series
• Course Organization & Websitehttp://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155-07Home.html
Reading In Textbook
Approximate Reading for Today’s Material:
Pages 40-55
Approximate Reading for Next Class:
Pages 64-83
And now for something completely different
Is this class too “monotone”?
• Easier to understand?
• Calm environment enhances learning?
• Or does it induce somnolence?
What is “somnolence”?
Google definition:
Sleepiness, a condition of
semiconsciousness approaching coma.
And now for something completely different
An experiment:
• Pull out any coins you have with you
• How many of you have:
– >= 1 penny?
– >= 1 nickel?
– >= 1 dime?
– >= 1 quarter?
• Choose most frequent denomination
And now for something completely different
Collect data (into Spreadsheet):
• Years stamped on coins
(chosen denomination)
• Many as person has
• Enter into spreadsheet
• Look at “distribution” using histogram
And now for something completely different
• Predicted Answer
– From Text Book, Problem 1.32
• Distribution is Left Skewed
• Works out as predicted?
• Why?
• Note: most skewed dist’ns seem to be:
Right Skewed
Exploratory Data Analysis 4
Numerical Summaries of Quant. Variables:
Idea: Summarize distributional information
(“center”, “spread”, “skewed”)
In Text, Sec. 1.2
for data
(subscripts allow “indexing numbers” in list)nxxx ,...,, 21
Numerical Summaries
A. “Centers” (note there are several)
1. “Mean” = Average =
• Greek letter “Sigma”, for “sum”
In EXCEL, use “AVERAGE” function
nxx n 1
xxn
iin
1
1
Numerical Summaries of Center
2. “Median” = Value in middle (of sorted list)
Unsorted E.g: Sorted E.g:
3 0
1 1
27 “in middle”? (no) 2 better “middle”!
2 3
0 27
EXCEL: use function “MEDIAN”
Difference Betw’n Mean & Median
Symmetric Distribution: Essentially no difference
Right Skewed Distribution:
50% area 50% area
M
bigger since “feels tails more strongly”x
Difference Betw’n Mean & Median
Outliers (unusual values):
Simple Web Example:
http://www.stat.sc.edu/~west/applets/box.html• Mean feels outliers much more strongly
– Leaves “range of most of data”– Good notion of “center”? (perhaps not)
• Median affected very minimally– Robustness Terminology:
Median is “resistant to the effect of outliers”
Difference Betw’n Mean & Median
A richer web example:Publisher’s Web Site: Statistical Applets: Mean & Median
• For Symmetric distributions: – Both are same
• Add an outlier: – Mean feels it much more strongly– Implication for “bad data”: can be very bad
• Two Clusters:– Median jumps more quickly– Mean more stable (better?)
Computation using Excel
Some Toy Examples:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg3Done.xls
• Compute Using Excel Functions
• Mean feels location of data on number line
• Median feels location of data in sorted list
• Median breaks tie by averaging center points
Numerical Centerpoint HW
HW: 1.46 a, 1.47, 1.49
• Use EXCEL
And now for something completely different
Check out this small quick movie clip:
And now for something completely different
Suggestions for other things to show here are very welcome….
• Movie Clips…
• Music…
• Jokes…
• Cartoons…
• …
Numerical Summaries (cont.)
A. “Spreads” (again there are several)
1. Range = biggest - smallest
range
Problems:
• Feels only “outliers”
• Not “bulk of data”
• Very non-resistant to outliers
ix ix
Numerical Summaries of Spread
2. Variance =
= “average squared distance to “
EXCEL: VAR
Drawback: units are wrong
e. g. For in feet is in square feet
111
222
12
n
xx
n
xxxxs
n
ii
n
x
ix2s
Numerical Summaries of Spread
3. Standard Deviation
EXCEL: STDEV
• Scale is right
• But not resistant to outliers
• Will use quite a lot later
(for reasons described later)
2ss
Interactive View of S. D.
Interesting web example (manipulate histogram):http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.html
• Note SD range centered at mean
• Can put SD “right near middle” (densely packed data)
• Can put SD at “edges of data” (U shaped data)
• Can put SD “outside of data” (big spike + outlier)
• But generally “sensible measure of spread”
Variance – S. D. HW
C3: For the data set in 1.46 (i.e. 1.37), find the:
i. Variance (1620)
ii. Standard Deviation (40.2)
• Use EXCEL
Numerical Summaries of Spread
3. Interquartile Range = IQR
Based on “quartiles”, Q1 and Q3
(idea: shows where are 25% & 75% “through the data”)
25% 25% 25% 25%
Q1 Q2 = median Q3
IQR = Q3 – Q1
Quartiles Example
Revisit Hidalgo Stamp Thickness example:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg6Done.xls
Right skewness gives:
– Median < Mean
(mean “feels farther points more strongly”)
– Q1 near median
– Q3 quite far
(makes sense from histogram)
Quartiles Example
A look under the hood:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg6Raw.xls
• Can compute as separate functions for each
• Or use:
Tools Data Analysis Descriptive Stats
• Which gives many other measures as well
• Use “k-th largest & smallest” to get quartiles
5 Number Summary
1. Minimum2. Q1 - 1st Quartile3. Median4. Q3 - 3rd Quartile5. Maximum
Summarize Information About:
a) Center - from 3b) Spread - from 2 & 4 (maybe 1 & 5)c) Skewness - from 2, 3 & 4d) Outliers - from 1 & 5
5 Number Summary
How to Compute?http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg6Done.xls
• EXCEL function QUARTILE
• “One stop shopping”
• IQR seems to need explicit calculation
Rule for Defining “Outliers”
Caution: There are many of these
Textbook version:Above Q3 + 1.5 * IQR
Below Q1 – 1.5 * IQR
For stamps data:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg6Done.xls
– No outliers at “low end”
– Some at “high end”
5 Number Sum. & Outliers HW
1.43
Box Plot
• Additional Visual Display Device
• Again legacy from pencil & paper days
• Not supported in EXCEL
• So we won’t do
• Main use: comparing populations
– Example: Figure from text
Box Plot
Box Plot
• Main use: comparing populations
– Example: Figure from text
• Want to do this?
Find better software package than Excel
And now for something completely different
Recall
Distribution
of majors of
students in
this course:
Stat 155, Section 2, Majors
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Busine
ss /
Man
.
Biolog
y
Public
Poli
cy /
Health
Pharm
/ Nur
sing
Jour
nalis
m /
Comm
.
Env. S
ci.
Other
Undec
ided
Fre
qu
ency
And now for something completely different
How about a business manager joke?
How many managers does it take to replace a light bulb?
And now for something completely different
How about a business manager joke?
How many managers does it take to replace a light bulb?
Two. One to find out if it needs changing, and one to tell an employee to change it.
Source: http://www.joblatino.com/jokes/managers.html
Linear Transformations
Idea: What happens to data & summaries,
when data are:
“shifted and scaled”
i.e. “panned and zoomed”
Math:
Scaled by a
Shifted by b
baxbaxxx nn ,...,,..., 11
Linear Transformations
Effect on linear summaries:
• Centerpoints, and
“follow data”: .
• Spreads, and
“feel scale, not shift”: .
xbaMbxa ,
M
s IQR
aIQRas,
Most Useful Linear Transfo.
“Standardization”
Goal: put data sets on “common scale”
Approach:
1. Subtract Mean ,
to “center at 0”
2. Divide by S.D. ,
to “give common SD = 1”
s
x
StandardizationResult is called “z-score”:
Note that
Thus is interpreted as:
“number of SDs from the mean”iz
sxx
z ii
ii
ii
xszx
xxsz
,
Standardization Example
Next time: work in Excel command:
STANDARDIZE
Standardization Example
Buffalo Snowfall Data:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg7Done.xls
• Standardized data have same (EXCEL default) histogram shape as raw data.
(Since axes and bin edges just
follow the transformation)
• i.e. “shape” doesn’t depend on “scaling”
Standardization Example
A look under the hood:http://stat-or.unc.edu/webspace/postscript/marron/Teaching/stor155-2007/Stor155Eg7Raw.xls
Compute AVERAGE and SD
1. Standardize by:a. Create Formula in cell B2
b. Drag downwards
c. Keep Mean and SD cells fixed using $s
3. Check stand’d data have mean 0 & SD 1note that “8.247E-16 = 0”
Standardization HW
C4: For data in 1.17, use EXCEL to:
a. Give the list of standardized scores
b. Give the Z-score for:
(i) the mean (0)
(ii) the median (-0.223)
(iii) the smallest (-1.21)
(iv) the largest (2.77)
1.59a, 1.73