View
228
Download
3
Tags:
Embed Size (px)
Citation preview
Exploratory Data Analysis
Hal Varian20 March 2006
What is EDA? Goals
Examine and summarize data Look for patterns and suggest hypotheses Provide guidance for more systematic analysis
Methods of analysis Primarily graphics and tables Online reference
http://www.itl.nist.gov/div898/handbook/eda/eda.htm http://www.math.yorku.ca/SCS/Courses/eda/
Tools for EDA We will use R = open source S
Very widely used by statisticians Libraries for all sorts of things are
available Download from
cran.stat.ucla.edu http://www.r-project.org/
Recommend ESS (=Emacs Speaks Statistics) for interactive use
Windows interface is not bad
Interactive R session
> library("foreign")
> dat <- read.spss("GSS93 subset.sav")
> attach(dat)
> summary(AGE)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.0 33.0 43.0 46.4 59.0 99.0 > hist(AGE)
Histogram of ageHistogram of AGE
AGE
Fre
qu
en
cy
20 40 60 80 100
05
01
00
15
02
00
Recode missing data AGE[AGE>90] <- NA plot(density(AGE,na.rm=T))
#plot both together hist(AGE,freq=F) lines(density(AGE,na.rm=T))
Density and density + hist
20 40 60 80 100
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
0.0
25
density(x = AGE, na.rm = T)
N = 1495 Bandwidth = 3.633
De
nsi
ty
Histogram of AGE
AGE
De
nsi
ty
20 40 60 80
0.0
00
0.0
05
0.0
10
0.0
15
0.0
20
0.0
25
Boxplot Boxplot
Outlier 1.5 interquartile range 3rd quartile Median 1st quartile Smallest value 20
4060
8010
0
Boxplot enhancements Notches: confidence interval for
median Varwidth=T: width of box is sqrt(n) Useful for
comparisons2
04
06
08
01
00
Comparing distributions boxplot(AGE~RACE) boxplot(AGE~RACE,notch=T,varwidth=T)
Doesn’t seem to be big diff in age distn
white black other
20
30
40
50
60
70
80
90
EDUC v RACEboxplot(EDUC[EDUC<90]~RACE[EDUC<90],notch=T,varwidth=T)
other black white
05
10
15
20
Violin plot Combines density plot and boxplot Good for weird shaped
distributions…
Back to Back Histogram library("Hmisc") histbackback(EDUC[RACE=="black"],EDUC[RACE=="white"],probability=T)
0.2 0.1 0.0 0.1 0.2
2.0
00
00
00
6.0
00
00
00
10
.00
00
00
01
4.0
00
00
00
18
.00
00
00
0
EDUC[RACE == "black"] EDUC[RACE == "white"]
Two-way table GT12 <- EDUC>12 temp <-table(GT12,RACE)
GT12 white black other FALSE 614 100 37 TRUE 640 67 38
prop.table(temp,2) GT12 white black other FALSE 0.4896332 0.5988024 0.4933333 TRUE 0.5103668 0.4011976 0.5066667
Comparing distributions qqplot = quantile-quantile plot
Fraction of data less than k in x Fraction of data less than k in y
Shapes Straight line: same distribution Vertical intercepts differ: different mean Slopes differ: different variance
Reference distribution can be theoretical distn qnorm – compare to standardized normal Skew to right: both tails below straight line Heavy tails: lower tail above, upper tail below line
qqplot(x,y) examples
-3 -2 -1 0 1 2 3
-2-1
01
2
x
y
-4 -2 0 2 4
-4-2
02
4
x
y
-4 -2 0 2 4
-4-2
02
4
x
y
Mean1=0Mean2=2
1=12=2
identical
-3 -2 -1 0 1 2 3
-2-1
01
2
Normal Q-Q Plot
Theoretical Quantiles
Sa
mp
le Q
ua
ntil
es
Sample vN(0,1),with refline
More qqnorm examples
Skewed to right Heavy tails
www.maths.murdoch.edu.au/units/statsnotes/samplestats/qqplot.html
Pairs of variables Is one variable related to another? Scatterplot
Basic: plot(x,y) Enhanced from library(“car”):
scatterplot(x,y) Scatterplot matrix
Basic: pairs(data.frame(x,y,z)) Enhanced:
scatterplot.matrix(data.frame(x,y,z))
Basic and enhanced scatterplot
Scatterplot matrix
Labeling points in scatterplots identify(x,y,labels=“foo”) Color is also useful
-2 -1 0 1 2
-4-2
02
46
x
y
90
98
110
175
Cigarettes and taxes Discussant on paper by Austan
Goolsbee, “Playing with Fire” Question: did Internet purchases of
cigarettes affect state tobacco tax revenues?
Cigarette Prices in 1990s
1990 1992 1994 1996 1998 2000
15
02
00
25
03
00
35
04
00
Price of cigarettes
Internet usage
1990 1992 1994 1996 1998 2000
0.0
0.1
0.2
0.3
0.4
0.5
Internet usage
Price elasticity of use/sales Across all states and years
Taxable sales elasticity: -0.802 Use elasticity: -0.440
Sales are much more responsive to price than usage suggesting that there is some cross border trade (aka “buttlegging”)
Use vs Sales in 2000
40 60 80 100 120 140 160
34
56
q.p[year == 2000]
cig
use
.p[y
ea
r =
= 2
00
0]
DE
KY
NH
CAUT
Reduced form dp = log(p2001) – log(p1995) dq = log(q2001) – log(q1995) Regress dq/dp on internet
penetration in 2000 See next slide for result
0.25 0.30 0.35 0.40 0.45
-0.8
-0.6
-0.4
-0.2
0.0
0.2
i
dq
/dp
CA
DC
DE
MI
NH
NY
OK
WA
Elasticity v Internet penetration
What is Internet providing? It was always a good deal for some to buy
cigarettes out-of-state (in high tax states) Mail order has been around for a long time
and is certainly cost-effective Internet makes it easier to find merchants
– just type into search engine Internet is great at matching buyers and
sellers
Price of a match Google doesn’t accept cigarette
advertisements, but Overture does Price for top listing: $1.20 per click
Avg price for click on Overture is 40 cents
Conversion rates might be 5%, so advertiser is paying $24 for introduction
But think of lifetime value…
Value of a match Google doesn’t accept cigarette
advertisements, but Overture does Price for top listing: $1.20 per click
Avg price for click on Overture is 40 cents
Conversion rates might be 5%, so advertiser is paying $24 for introduction
But think of lifetime value…
Straightening out and scaling data Find transform so that data looks
linear, or normal, or fits on same scale Log10 (easier to interpret than log) Square root Reciprocal Box-Cox transform (xr – 1)/r which
combines many of above; r=0 is log
City sizes: regular & log10
Histogram of log10(pop1980)
log10(pop1980)
De
nsi
ty
3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
0.0
0.2
0.4
0.6
0.8