31

Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Embed Size (px)

Citation preview

Page 1: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:
Page 2: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Fun With NumbersZ-scores

R code (wrt hw. 4)

Page 3: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Statistical Inference The normal distribution is our first choice in most

cases because it has nice properties:Distribution is symmetrical around the meanPercentage of cases associated with standard

deviationsCan identify probability of values under the

curveA linear combination of normally distributed

variables is itself distributed normallyCentral limit theorem

Great flexibility in using the normal distribution

Page 4: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Normal DistributionNormal Distribution and areas under it.

68-95-99.7 Percent Rule

In a normal distribution, about 68 percent of the observations will fall within about +/- 1 standard deviation...

A Picture:

Page 5: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Area (with some added stuff)

http://members.aol.com/svennord/ed/normal.htm

Page 6: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Another Picture

Page 7: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

What do we know?Area is useful to determine probabilities.

Fun with Numbers

Gas Prices (Let’s take a sidetrip)

What are some research issues when looking at financial data over time?

Inflation!

2007 dollars vs. 1990 dollars

CPI: 2007 Price=1990 Price*(2007 Price/1990 Price)

Page 8: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Visualizing Data is FUNdamental

11.

52

2.5

3G

as P

rice

(Reg

ular

, una

djus

ted)

1990 1995 2000 2005 2010Year (with monthly measurements)

Non-CPI Adjusted Gas Prices

11.

52

2.5

3G

as P

rice

(Reg

ular

, CP

I-adj

uste

d)

1990 1995 2000 2005 2010Year (with monthly measurements)

CPI Adjusted Gas Prices

11.

52

2.5

Gas

Pric

e (R

egul

ar, C

PI-a

djus

ted)

1990 1995 2000 2005Year (with monthly measures)

CPI Adjusted Gas Prices

Unadjusted CPI-Adjusted CPI-Adjusted w/o 05/06)

Page 9: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Histograms0

12

34

Den

sity

1 1.5 2 2.5 3Price of Gas (unadjusted)

Histogram of Gas Prices

0.5

11.

52

2.5

Den

sity

1 1.5 2 2.5 3Price of Gas (CPI-adjusted)

Histogram of Gas Prices (CPI-Adjusted)

Page 10: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Using z-scoresTaking advantage of the normal distribution

Area under the normal is probability area.

Probabilities must sum to 1.

Full density under normal is 1.

Since it’s symmetric, we know the probability of “being above” the mean is .50 (ditto on below)

Page 11: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Standard Normal Distribution

N~(0,1)

Easy to compute:

When X=mean, z=0.

Metric of z-score: standard deviations from the mean.

Thus, if z=1, X is 1 s.d. above the mean.

NOW since we know the 68-95-99.7 Rule, we can identify probs.

)( XX

z

Page 12: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Getting Gas Let’s look at the adjusted gas prices.

Means: 2006: 2.57 (.30) 1999: 1.37 (.15) 2005: 2.34 (.32) 1998: 1.27 (.04) 2004: 1.98 (.15) 1997: 1.51 (.04) 2003: 1.71 (..09) 1996: 1.54 (.08) 2002: 1.51 (.13) 1995: 1.47 (.06) 2001: 1.62 (.20) 1994: 1.46 (.07) 2000: 1.74 (.11) 1993: 1.49 (.03) 1992: 1.56 (.07) 1991: 1.62 (.05) 1990: 2.00 (.07) [small n]

(Anything interesting here?)

Page 13: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Compute a z-score Mean adjusted price: 1.68

(.37)

To derive z-score for any year, substitute a value X into

Suppose “X”=1.68?

Z=(1.68-1.68)/.37=0

The mean is normalized to 0.

1 s.d. above mean? 1.68+.37=2.05

Z=(2.05-1.68)/.37=1

The metric of z is in standard deviations.

)( XX

z

Page 14: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

“Standardizing” X allows us to use “z distribution.” The Most “Average” Price z Week Year |--------------------------------------| | 1.680374 -.009361 Feb 12 2001 | | 1.681257 -.0069663 Nov 03 2003 | | 1.681329 -.0067707 Apr 24 2000 | | 1.682352 -.0039966 Aug 04 2003 | | 1.683292 -.001449 Jun 03 1991 | | | | 1.684771 .0025612 Feb 04 1991 | | 1.68625 .0065716 May 27 1991 | | 1.688924 .0138213 Oct 27 2003 | | 1.689519 .0154355 Apr 17 2000 | | 1.69062 .0184197 Sep 24 2001 | |--------------------------------------|

Page 15: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

The 10 Most “Below Average”

Price Z Week Year |--------------------------------------| | 1.096723 -1.59183 Feb 22 1999 | | 1.103978 -1.572159 Mar 01 1999 | | 1.111233 -1.552488 Feb 15 1999 | | 1.113652 -1.545931 Mar 08 1999 | | 1.120907 -1.52626 Feb 08 1999 | |--------------------------------------| | 1.123325 -1.519703 Feb 01 1999 | | 1.13058 -1.500032 Jan 04 1999 | | 1.131789 -1.496754 Jan 25 1999 | | 1.137835 -1.480361 Jan 11 1999 | | 1.141463 -1.470526 Jan 18 1999 | |--------------------------------------|

The 10 Most “Above Average” Price Z Week Year

|-------------------------------------| | 2.947 3.424879 May 15 2006 | | 2.973 3.495373 Jul 10 2006 | | 2.989 3.538755 Jul 17 2006 | | 3 3.56858 Aug 14 2006 | |-------------------------------------| | 3.003 3.576713 Jul 24 2006 | | 3.004 3.579425 Jul 31 2006 | | 3.021628 3.62722 Oct 03 2005 | | 3.038 3.67161 Aug 07 2006 | | 3.049491 3.702766 Sep 12 2005 | | 3.167136 4.021741 Sep 05 2005 |

|-------------------------------------|

Page 16: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

01

23

40

12

34

01

23

40

12

34

-2 0 2 4 -2 0 2 4

-2 0 2 4 -2 0 2 4 -2 0 2 4

1990 1991 1992 1993 1994

1995 1996 1997 1998 1999

2000 2001 2002 2003 2004

2005 2006 2007

De

nsity

Z-Score for CPI-Adjusted Gas PriceGraphs by year

Page 17: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Finding ProbabilitiesWhat is the probability of a Z gas price of 2.50

or higher? The z-score is 2.22. In the z-distribution, if gas prices were truly

normally distributed, a score this high or higher has a probability of occurring of .013, or about 1.3%. It’s an unlikely event.

How computed? 1-.9868 gives area above (consult standard normal)

Page 18: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Finding ProbabilitiesWhat is the probability of a z gas price being

between 1.75 and -1.75

P(above)=.04; P(below)=.04

Therefore, P(in between)=1-.08= .92

The upper tail is .04; the lower tail is .04

Any probability calculation is this straightforward.

Page 19: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

IssuesThe “gas price” example is pedagogical.

Serious analysis of gas-pricing effects would require much more sophisticated statistical techniques.

z is useful to compare observations from historical eras or across disparate cases.

Hands-on examples in R

Page 20: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Plots and Z-scoresHow to do some of the “stuff” in HW 4

Multiple plots on a single page

Creating z-scores and finding p-values

Visualizing political data

Data: Obama vote share by county

Page 21: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Dot Chart: Obama Vote

dotchart(obamapercent, labels=row.names, cex=.7, xlim=c(0, 100), main="Support for Obama", xlab="Percent Obama")

abline(v=50)

Returns:

Page 22: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

ModocLassenShastaTehamaGlennSierraColusaKernYubaSutterTulareCalaverasKingsAmadorMaderaMariposaTuolumnePlumasSiskiyouInyoEl DoradoPlacerDel NorteOrangeButteStanislausFresnoRiversideTrinitySan BernardinoNevadaSan Luis ObispoMercedSan DiegoSan JoaquinVenturaMonoSacramentoLakeSan BenitoSanta BarbaraImperialAlpineHumboldtSolanoNapaYoloMontereyContra CostaLos AngelesSanta ClaraMendocinoSan MateoSonomaSanta CruzMarinAlamedaSan Francisco

0 20 40 60 80 100

Support for Obama

Percent Obama

Page 23: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Interpretation?Geographical Patterns?

Central Valley Coastal SoCal, NorCal?

Why might you observe these patterns?

Z-scores NB: we’re doing this for learning purposes

Page 24: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Z-scoresEasy: create mean, standard deviation

Then derive z-score using formula from last slide set:

R code on next slide

Page 25: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Z-scores and R

#Z scores for Obama meanobama<-mean(obamapercent) sdobama<-sd(obamapercent) zobama<-(obamapercent-meanobama)/sdobama

Page 26: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Interpretation Z-scores in metric of standard deviations

Large z imply the observation is further away from mean than observations with small z.

Z=0 means the observation is exactly at the mean.

Dotchart (code):

par(mfcol=c(1,1))

dotchart(zobama, labels=row.names, cex=.7, xlim=c(-3, 3),

main="p-values for Obama Vote Z-scores", xlab="Probability") abline(v=0)

abline(v=1, col="red")

abline(v=-1, col="red")

abline(v=2, col="dark red")

abline(v=-2, col="dark red")

Page 27: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

ModocLassenShastaTehamaGlennSierraColusaKernYubaSutterTulareCalaverasKingsAmadorMaderaMariposaTuolumnePlumasSiskiyouInyoEl DoradoPlacerDel NorteOrangeButteStanislausFresnoRiversideTrinitySan BernardinoNevadaSan Luis ObispoMercedSan DiegoSan JoaquinVenturaMonoSacramentoLakeSan BenitoSanta BarbaraImperialAlpineHumboldtSolanoNapaYoloMontereyContra CostaLos AngelesSanta ClaraMendocinoSan MateoSonomaSanta CruzMarinAlamedaSan Francisco

-3 -2 -1 0 1 2 3

Obama Vote Z-scores

Z-score

Page 28: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Probability ValuesHigh Z-scores are probabilistically less

likely to be observed than smaller scores.

Consult a z-distribution table

Probability area is given

Can think about probabilities in the “tails”

One-tail (upper or lower)

Two-tail (upper + lower)

R

Page 29: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

R code

twotailp<- 2*pnorm(-abs(zobama)) #Gives us area in the upper and lower tails of z

onetailp<- pnorm(-abs(zobama)) #Gives us 1-tail probability area; if #subtract this from 1, this give us the area #below this z score (if z is positive) or #area above this z score (if z is negative)

zp<-cbind(county, onetailp, twotailp, zobama ); zp

Page 30: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

Plots 4 plots on one page:

par(mfcol=c(2,2))

boxplot(obamapercent, ylab="Vote Percent", main="Obama Vote: Box Plot", col="blue")

hist(zobama, xlab="Obama Vote as Z-Scores", ylab="Frequency",

main="Histogram of Standardized Obama Vote", col="blue")

hist(obamapercent, ylab="Frequency", xlab="Vote Percent", main="Obama Vote: Histogram", col="blue")

plot(zobama, onetailp, ylab="One-Tail p", xlab="Z-score", main="Z-scores and p-values", col="blue")

Page 31: Fun With Numbers Z-scores R code (wrt hw. 4) Statistical Inference The normal distribution is our first choice in most cases because it has nice properties:

3040

5060

7080

Obama Vote: Box Plot

Vot

e P

erce

nt

Histogram of Standardized Obama Vote

Obama Vote as Z-Scores

Fre

quen

cy

-2 -1 0 1 2

05

1015

Obama Vote: Histogram

Vote Percent

Fre

quen

cy

30 40 50 60 70 80 90

05

1015

-1 0 1 2

0.0

0.1

0.2

0.3

0.4

0.5

Z-scores and p-values

Z-score

One

-Tai

l p