Upload
basil-walker
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Fun With NumbersZ-scores
R code (wrt hw. 4)
Statistical Inference The normal distribution is our first choice in most
cases because it has nice properties:Distribution is symmetrical around the meanPercentage of cases associated with standard
deviationsCan identify probability of values under the
curveA linear combination of normally distributed
variables is itself distributed normallyCentral limit theorem
Great flexibility in using the normal distribution
Normal DistributionNormal Distribution and areas under it.
68-95-99.7 Percent Rule
In a normal distribution, about 68 percent of the observations will fall within about +/- 1 standard deviation...
A Picture:
Area (with some added stuff)
http://members.aol.com/svennord/ed/normal.htm
Another Picture
What do we know?Area is useful to determine probabilities.
Fun with Numbers
Gas Prices (Let’s take a sidetrip)
What are some research issues when looking at financial data over time?
Inflation!
2007 dollars vs. 1990 dollars
CPI: 2007 Price=1990 Price*(2007 Price/1990 Price)
Visualizing Data is FUNdamental
11.
52
2.5
3G
as P
rice
(Reg
ular
, una
djus
ted)
1990 1995 2000 2005 2010Year (with monthly measurements)
Non-CPI Adjusted Gas Prices
11.
52
2.5
3G
as P
rice
(Reg
ular
, CP
I-adj
uste
d)
1990 1995 2000 2005 2010Year (with monthly measurements)
CPI Adjusted Gas Prices
11.
52
2.5
Gas
Pric
e (R
egul
ar, C
PI-a
djus
ted)
1990 1995 2000 2005Year (with monthly measures)
CPI Adjusted Gas Prices
Unadjusted CPI-Adjusted CPI-Adjusted w/o 05/06)
Histograms0
12
34
Den
sity
1 1.5 2 2.5 3Price of Gas (unadjusted)
Histogram of Gas Prices
0.5
11.
52
2.5
Den
sity
1 1.5 2 2.5 3Price of Gas (CPI-adjusted)
Histogram of Gas Prices (CPI-Adjusted)
Using z-scoresTaking advantage of the normal distribution
Area under the normal is probability area.
Probabilities must sum to 1.
Full density under normal is 1.
Since it’s symmetric, we know the probability of “being above” the mean is .50 (ditto on below)
Standard Normal Distribution
N~(0,1)
Easy to compute:
When X=mean, z=0.
Metric of z-score: standard deviations from the mean.
Thus, if z=1, X is 1 s.d. above the mean.
NOW since we know the 68-95-99.7 Rule, we can identify probs.
)( XX
z
Getting Gas Let’s look at the adjusted gas prices.
Means: 2006: 2.57 (.30) 1999: 1.37 (.15) 2005: 2.34 (.32) 1998: 1.27 (.04) 2004: 1.98 (.15) 1997: 1.51 (.04) 2003: 1.71 (..09) 1996: 1.54 (.08) 2002: 1.51 (.13) 1995: 1.47 (.06) 2001: 1.62 (.20) 1994: 1.46 (.07) 2000: 1.74 (.11) 1993: 1.49 (.03) 1992: 1.56 (.07) 1991: 1.62 (.05) 1990: 2.00 (.07) [small n]
(Anything interesting here?)
Compute a z-score Mean adjusted price: 1.68
(.37)
To derive z-score for any year, substitute a value X into
Suppose “X”=1.68?
Z=(1.68-1.68)/.37=0
The mean is normalized to 0.
1 s.d. above mean? 1.68+.37=2.05
Z=(2.05-1.68)/.37=1
The metric of z is in standard deviations.
)( XX
z
“Standardizing” X allows us to use “z distribution.” The Most “Average” Price z Week Year |--------------------------------------| | 1.680374 -.009361 Feb 12 2001 | | 1.681257 -.0069663 Nov 03 2003 | | 1.681329 -.0067707 Apr 24 2000 | | 1.682352 -.0039966 Aug 04 2003 | | 1.683292 -.001449 Jun 03 1991 | | | | 1.684771 .0025612 Feb 04 1991 | | 1.68625 .0065716 May 27 1991 | | 1.688924 .0138213 Oct 27 2003 | | 1.689519 .0154355 Apr 17 2000 | | 1.69062 .0184197 Sep 24 2001 | |--------------------------------------|
The 10 Most “Below Average”
Price Z Week Year |--------------------------------------| | 1.096723 -1.59183 Feb 22 1999 | | 1.103978 -1.572159 Mar 01 1999 | | 1.111233 -1.552488 Feb 15 1999 | | 1.113652 -1.545931 Mar 08 1999 | | 1.120907 -1.52626 Feb 08 1999 | |--------------------------------------| | 1.123325 -1.519703 Feb 01 1999 | | 1.13058 -1.500032 Jan 04 1999 | | 1.131789 -1.496754 Jan 25 1999 | | 1.137835 -1.480361 Jan 11 1999 | | 1.141463 -1.470526 Jan 18 1999 | |--------------------------------------|
The 10 Most “Above Average” Price Z Week Year
|-------------------------------------| | 2.947 3.424879 May 15 2006 | | 2.973 3.495373 Jul 10 2006 | | 2.989 3.538755 Jul 17 2006 | | 3 3.56858 Aug 14 2006 | |-------------------------------------| | 3.003 3.576713 Jul 24 2006 | | 3.004 3.579425 Jul 31 2006 | | 3.021628 3.62722 Oct 03 2005 | | 3.038 3.67161 Aug 07 2006 | | 3.049491 3.702766 Sep 12 2005 | | 3.167136 4.021741 Sep 05 2005 |
|-------------------------------------|
01
23
40
12
34
01
23
40
12
34
-2 0 2 4 -2 0 2 4
-2 0 2 4 -2 0 2 4 -2 0 2 4
1990 1991 1992 1993 1994
1995 1996 1997 1998 1999
2000 2001 2002 2003 2004
2005 2006 2007
De
nsity
Z-Score for CPI-Adjusted Gas PriceGraphs by year
Finding ProbabilitiesWhat is the probability of a Z gas price of 2.50
or higher? The z-score is 2.22. In the z-distribution, if gas prices were truly
normally distributed, a score this high or higher has a probability of occurring of .013, or about 1.3%. It’s an unlikely event.
How computed? 1-.9868 gives area above (consult standard normal)
Finding ProbabilitiesWhat is the probability of a z gas price being
between 1.75 and -1.75
P(above)=.04; P(below)=.04
Therefore, P(in between)=1-.08= .92
The upper tail is .04; the lower tail is .04
Any probability calculation is this straightforward.
IssuesThe “gas price” example is pedagogical.
Serious analysis of gas-pricing effects would require much more sophisticated statistical techniques.
z is useful to compare observations from historical eras or across disparate cases.
Hands-on examples in R
Plots and Z-scoresHow to do some of the “stuff” in HW 4
Multiple plots on a single page
Creating z-scores and finding p-values
Visualizing political data
Data: Obama vote share by county
Dot Chart: Obama Vote
dotchart(obamapercent, labels=row.names, cex=.7, xlim=c(0, 100), main="Support for Obama", xlab="Percent Obama")
abline(v=50)
Returns:
ModocLassenShastaTehamaGlennSierraColusaKernYubaSutterTulareCalaverasKingsAmadorMaderaMariposaTuolumnePlumasSiskiyouInyoEl DoradoPlacerDel NorteOrangeButteStanislausFresnoRiversideTrinitySan BernardinoNevadaSan Luis ObispoMercedSan DiegoSan JoaquinVenturaMonoSacramentoLakeSan BenitoSanta BarbaraImperialAlpineHumboldtSolanoNapaYoloMontereyContra CostaLos AngelesSanta ClaraMendocinoSan MateoSonomaSanta CruzMarinAlamedaSan Francisco
0 20 40 60 80 100
Support for Obama
Percent Obama
Interpretation?Geographical Patterns?
Central Valley Coastal SoCal, NorCal?
Why might you observe these patterns?
Z-scores NB: we’re doing this for learning purposes
Z-scoresEasy: create mean, standard deviation
Then derive z-score using formula from last slide set:
R code on next slide
Z-scores and R
#Z scores for Obama meanobama<-mean(obamapercent) sdobama<-sd(obamapercent) zobama<-(obamapercent-meanobama)/sdobama
Interpretation Z-scores in metric of standard deviations
Large z imply the observation is further away from mean than observations with small z.
Z=0 means the observation is exactly at the mean.
Dotchart (code):
par(mfcol=c(1,1))
dotchart(zobama, labels=row.names, cex=.7, xlim=c(-3, 3),
main="p-values for Obama Vote Z-scores", xlab="Probability") abline(v=0)
abline(v=1, col="red")
abline(v=-1, col="red")
abline(v=2, col="dark red")
abline(v=-2, col="dark red")
ModocLassenShastaTehamaGlennSierraColusaKernYubaSutterTulareCalaverasKingsAmadorMaderaMariposaTuolumnePlumasSiskiyouInyoEl DoradoPlacerDel NorteOrangeButteStanislausFresnoRiversideTrinitySan BernardinoNevadaSan Luis ObispoMercedSan DiegoSan JoaquinVenturaMonoSacramentoLakeSan BenitoSanta BarbaraImperialAlpineHumboldtSolanoNapaYoloMontereyContra CostaLos AngelesSanta ClaraMendocinoSan MateoSonomaSanta CruzMarinAlamedaSan Francisco
-3 -2 -1 0 1 2 3
Obama Vote Z-scores
Z-score
Probability ValuesHigh Z-scores are probabilistically less
likely to be observed than smaller scores.
Consult a z-distribution table
Probability area is given
Can think about probabilities in the “tails”
One-tail (upper or lower)
Two-tail (upper + lower)
R
R code
twotailp<- 2*pnorm(-abs(zobama)) #Gives us area in the upper and lower tails of z
onetailp<- pnorm(-abs(zobama)) #Gives us 1-tail probability area; if #subtract this from 1, this give us the area #below this z score (if z is positive) or #area above this z score (if z is negative)
zp<-cbind(county, onetailp, twotailp, zobama ); zp
Plots 4 plots on one page:
par(mfcol=c(2,2))
boxplot(obamapercent, ylab="Vote Percent", main="Obama Vote: Box Plot", col="blue")
hist(zobama, xlab="Obama Vote as Z-Scores", ylab="Frequency",
main="Histogram of Standardized Obama Vote", col="blue")
hist(obamapercent, ylab="Frequency", xlab="Vote Percent", main="Obama Vote: Histogram", col="blue")
plot(zobama, onetailp, ylab="One-Tail p", xlab="Z-score", main="Z-scores and p-values", col="blue")
3040
5060
7080
Obama Vote: Box Plot
Vot
e P
erce
nt
Histogram of Standardized Obama Vote
Obama Vote as Z-Scores
Fre
quen
cy
-2 -1 0 1 2
05
1015
Obama Vote: Histogram
Vote Percent
Fre
quen
cy
30 40 50 60 70 80 90
05
1015
-1 0 1 2
0.0
0.1
0.2
0.3
0.4
0.5
Z-scores and p-values
Z-score
One
-Tai
l p