Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Fun With NumbersZ-scores

R code (wrt hw. 4)

Statistical Inference The normal distribution is our first choice in most

cases because it has nice properties:Distribution is symmetrical around the meanPercentage of cases associated with standard

deviationsCan identify probability of values under the

curveA linear combination of normally distributed

variables is itself distributed normallyCentral limit theorem

Great flexibility in using the normal distribution

Normal DistributionNormal Distribution and areas under it.

68-95-99.7 Percent Rule

In a normal distribution, about 68 percent of the observations will fall within about +/- 1 standard deviation...

A Picture:

Area (with some added stuff)

http://members.aol.com/svennord/ed/normal.htm

Another Picture

What do we know?Area is useful to determine probabilities.

Fun with Numbers

Gas Prices (Let’s take a sidetrip)

What are some research issues when looking at financial data over time?

Inflation!

2007 dollars vs. 1990 dollars

CPI: 2007 Price=1990 Price*(2007 Price/1990 Price)

Visualizing Data is FUNdamental

11.

52

2.5

3G

as P

rice

(Reg

ular

, una

djus

ted)

1990 1995 2000 2005 2010Year (with monthly measurements)

Non-CPI Adjusted Gas Prices

11.

52

2.5

3G

as P

rice

(Reg

ular

, CP

I-adj

uste

d)

1990 1995 2000 2005 2010Year (with monthly measurements)

CPI Adjusted Gas Prices

11.

52

2.5

Gas

Pric

e (R

egul

ar, C

PI-a

djus

ted)

1990 1995 2000 2005Year (with monthly measures)

CPI Adjusted Gas Prices

Unadjusted CPI-Adjusted CPI-Adjusted w/o 05/06)

Histograms0

12

34

Den

sity

1 1.5 2 2.5 3Price of Gas (unadjusted)

Histogram of Gas Prices

0.5

11.

52

2.5

Den

sity

1 1.5 2 2.5 3Price of Gas (CPI-adjusted)

Histogram of Gas Prices (CPI-Adjusted)

Using z-scoresTaking advantage of the normal distribution

Area under the normal is probability area.

Probabilities must sum to 1.

Full density under normal is 1.

Since it’s symmetric, we know the probability of “being above” the mean is .50 (ditto on below)

Standard Normal Distribution

N~(0,1)

Easy to compute:

When X=mean, z=0.

Metric of z-score: standard deviations from the mean.

Thus, if z=1, X is 1 s.d. above the mean.

NOW since we know the 68-95-99.7 Rule, we can identify probs.

)( XX

z

Getting Gas Let’s look at the adjusted gas prices.

Means: 2006: 2.57 (.30) 1999: 1.37 (.15) 2005: 2.34 (.32) 1998: 1.27 (.04) 2004: 1.98 (.15) 1997: 1.51 (.04) 2003: 1.71 (..09) 1996: 1.54 (.08) 2002: 1.51 (.13) 1995: 1.47 (.06) 2001: 1.62 (.20) 1994: 1.46 (.07) 2000: 1.74 (.11) 1993: 1.49 (.03) 1992: 1.56 (.07) 1991: 1.62 (.05) 1990: 2.00 (.07) [small n]

(Anything interesting here?)

Compute a z-score Mean adjusted price: 1.68

(.37)

To derive z-score for any year, substitute a value X into

Suppose “X”=1.68?

Z=(1.68-1.68)/.37=0

The mean is normalized to 0.

1 s.d. above mean? 1.68+.37=2.05

Z=(2.05-1.68)/.37=1

The metric of z is in standard deviations.

)( XX

z

“Standardizing” X allows us to use “z distribution.” The Most “Average” Price z Week Year |--------------------------------------| | 1.680374 -.009361 Feb 12 2001 | | 1.681257 -.0069663 Nov 03 2003 | | 1.681329 -.0067707 Apr 24 2000 | | 1.682352 -.0039966 Aug 04 2003 | | 1.683292 -.001449 Jun 03 1991 | | | | 1.684771 .0025612 Feb 04 1991 | | 1.68625 .0065716 May 27 1991 | | 1.688924 .0138213 Oct 27 2003 | | 1.689519 .0154355 Apr 17 2000 | | 1.69062 .0184197 Sep 24 2001 | |--------------------------------------|

The 10 Most “Below Average”

Price Z Week Year |--------------------------------------| | 1.096723 -1.59183 Feb 22 1999 | | 1.103978 -1.572159 Mar 01 1999 | | 1.111233 -1.552488 Feb 15 1999 | | 1.113652 -1.545931 Mar 08 1999 | | 1.120907 -1.52626 Feb 08 1999 | |--------------------------------------| | 1.123325 -1.519703 Feb 01 1999 | | 1.13058 -1.500032 Jan 04 1999 | | 1.131789 -1.496754 Jan 25 1999 | | 1.137835 -1.480361 Jan 11 1999 | | 1.141463 -1.470526 Jan 18 1999 | |--------------------------------------|

The 10 Most “Above Average” Price Z Week Year

|-------------------------------------| | 2.947 3.424879 May 15 2006 | | 2.973 3.495373 Jul 10 2006 | | 2.989 3.538755 Jul 17 2006 | | 3 3.56858 Aug 14 2006 | |-------------------------------------| | 3.003 3.576713 Jul 24 2006 | | 3.004 3.579425 Jul 31 2006 | | 3.021628 3.62722 Oct 03 2005 | | 3.038 3.67161 Aug 07 2006 | | 3.049491 3.702766 Sep 12 2005 | | 3.167136 4.021741 Sep 05 2005 |

|-------------------------------------|

01

23

40

12

34

01

23

40

12

34

-2 0 2 4 -2 0 2 4

-2 0 2 4 -2 0 2 4 -2 0 2 4

1990 1991 1992 1993 1994

1995 1996 1997 1998 1999

2000 2001 2002 2003 2004

2005 2006 2007

De

nsity

Z-Score for CPI-Adjusted Gas PriceGraphs by year

Finding ProbabilitiesWhat is the probability of a Z gas price of 2.50

or higher? The z-score is 2.22. In the z-distribution, if gas prices were truly

normally distributed, a score this high or higher has a probability of occurring of .013, or about 1.3%. It’s an unlikely event.

How computed? 1-.9868 gives area above (consult standard normal)

Finding ProbabilitiesWhat is the probability of a z gas price being

between 1.75 and -1.75

P(above)=.04; P(below)=.04

Therefore, P(in between)=1-.08= .92

The upper tail is .04; the lower tail is .04

Any probability calculation is this straightforward.

IssuesThe “gas price” example is pedagogical.

Serious analysis of gas-pricing effects would require much more sophisticated statistical techniques.

z is useful to compare observations from historical eras or across disparate cases.

Hands-on examples in R

Plots and Z-scoresHow to do some of the “stuff” in HW 4

Multiple plots on a single page

Creating z-scores and finding p-values

Visualizing political data

Data: Obama vote share by county

Dot Chart: Obama Vote

dotchart(obamapercent, labels=row.names, cex=.7, xlim=c(0, 100), main="Support for Obama", xlab="Percent Obama")

abline(v=50)

Returns:

ModocLassenShastaTehamaGlennSierraColusaKernYubaSutterTulareCalaverasKingsAmadorMaderaMariposaTuolumnePlumasSiskiyouInyoEl DoradoPlacerDel NorteOrangeButteStanislausFresnoRiversideTrinitySan BernardinoNevadaSan Luis ObispoMercedSan DiegoSan JoaquinVenturaMonoSacramentoLakeSan BenitoSanta BarbaraImperialAlpineHumboldtSolanoNapaYoloMontereyContra CostaLos AngelesSanta ClaraMendocinoSan MateoSonomaSanta CruzMarinAlamedaSan Francisco

0 20 40 60 80 100

Support for Obama

Percent Obama

Interpretation?Geographical Patterns?

Central Valley Coastal SoCal, NorCal?

Why might you observe these patterns?

Z-scores NB: we’re doing this for learning purposes

Z-scoresEasy: create mean, standard deviation

Then derive z-score using formula from last slide set:

R code on next slide

Z-scores and R

#Z scores for Obama meanobama<-mean(obamapercent) sdobama<-sd(obamapercent) zobama<-(obamapercent-meanobama)/sdobama

Interpretation Z-scores in metric of standard deviations

Large z imply the observation is further away from mean than observations with small z.

Z=0 means the observation is exactly at the mean.

Dotchart (code):

par(mfcol=c(1,1))

dotchart(zobama, labels=row.names, cex=.7, xlim=c(-3, 3),

main="p-values for Obama Vote Z-scores", xlab="Probability") abline(v=0)

abline(v=1, col="red")

abline(v=-1, col="red")

abline(v=2, col="dark red")

abline(v=-2, col="dark red")

ModocLassenShastaTehamaGlennSierraColusaKernYubaSutterTulareCalaverasKingsAmadorMaderaMariposaTuolumnePlumasSiskiyouInyoEl DoradoPlacerDel NorteOrangeButteStanislausFresnoRiversideTrinitySan BernardinoNevadaSan Luis ObispoMercedSan DiegoSan JoaquinVenturaMonoSacramentoLakeSan BenitoSanta BarbaraImperialAlpineHumboldtSolanoNapaYoloMontereyContra CostaLos AngelesSanta ClaraMendocinoSan MateoSonomaSanta CruzMarinAlamedaSan Francisco

-3 -2 -1 0 1 2 3

Obama Vote Z-scores

Z-score

Probability ValuesHigh Z-scores are probabilistically less

likely to be observed than smaller scores.

Consult a z-distribution table

Probability area is given

Can think about probabilities in the “tails”

One-tail (upper or lower)

Two-tail (upper + lower)

R

R code

twotailp<- 2*pnorm(-abs(zobama)) #Gives us area in the upper and lower tails of z

onetailp<- pnorm(-abs(zobama)) #Gives us 1-tail probability area; if #subtract this from 1, this give us the area #below this z score (if z is positive) or #area above this z score (if z is negative)

zp<-cbind(county, onetailp, twotailp, zobama ); zp

Plots 4 plots on one page:

par(mfcol=c(2,2))

boxplot(obamapercent, ylab="Vote Percent", main="Obama Vote: Box Plot", col="blue")

hist(zobama, xlab="Obama Vote as Z-Scores", ylab="Frequency",

main="Histogram of Standardized Obama Vote", col="blue")

hist(obamapercent, ylab="Frequency", xlab="Vote Percent", main="Obama Vote: Histogram", col="blue")

plot(zobama, onetailp, ylab="One-Tail p", xlab="Z-score", main="Z-scores and p-values", col="blue")

3040

5060

7080

Obama Vote: Box Plot

Vot

e P

erce

nt

Histogram of Standardized Obama Vote

Obama Vote as Z-Scores

Fre

quen

cy

-2 -1 0 1 2

05

1015

Obama Vote: Histogram

Vote Percent

Fre

quen

cy

30 40 50 60 70 80 90

05

1015

-1 0 1 2

0.0

0.1

0.2

0.3

0.4

0.5

Z-scores and p-values

Z-score

One

-Tai

l p

Documents

Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties: