53
Statistics: Unlocking the Power of Data Normal Distribution STAT 101 Dr. Kari Lock Morgan Chapter 5 Normal distribution Central limit theorem Normal distribution for confidence intervals Normal distribution for p-values Standard normal

Normal Distribution

  • Upload
    scout

  • View
    89

  • Download
    0

Embed Size (px)

DESCRIPTION

STAT 101 Dr. Kari Lock Morgan. Normal Distribution. Chapter 5 Normal distribution Central limit theorem Normal distribution for confidence intervals Normal distribution for p-values Standard normal. Re-grade Requests. 4e potential grading mistake: 0.025 is correct - PowerPoint PPT Presentation

Citation preview

Page 1: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Normal Distribution

STAT 101

Dr. Kari Lock Morgan

Chapter 5• Normal distribution• Central limit theorem• Normal distribution for confidence intervals• Normal distribution for p-values• Standard normal

Page 2: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Re-grade Requests4e potential grading mistake: 0.025 is correct

Requests for a re-grade must be submitted in writing by class on Wednesday, March 5th

Partial credit will NOT be adjusted

Valid re-grade requests: You got points off but believe your answer is correct Points were added incorrectly

Warning: scores may go up or down

Page 3: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

slope (thousandths)-60 -40 -20 0 20 40 60

Measures from Scrambled RestaurantTips Dot Plot

r-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6

Measures from Scrambled Collection 1 Dot Plot

Nullxbar98.2 98.3 98.4 98.5 98.6 98.7 98.8 98.9 99.0

Measures from Sample of BodyTemp50 Dot Plot

Diff-4 -3 -2 -1 0 1 2 3 4

Measures from Scrambled CaffeineTaps Dot Plot

xbar26 27 28 29 30 31 32

Measures from Sample of CommuteAtlanta Dot Plot

Slope :Restaurant tips

Correlation: Malevolent uniforms

Mean :Body Temperatures

Diff means: Finger taps

Mean : Atlanta commutes

phat0.3 0.4 0.5 0.6 0.7 0.8

Measures from Sample of Collection 1 Dot PlotProportion : Owners/dogs

What do you notice?

Bootstrap and Randomization Distributions

Page 4: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

• The symmetric, bell-shaped curve we have seen for almost all of our bootstrap and randomization distributions is called a normal distribution

Normal Distribution

Freq

uenc

y

-3 -2 -1 0 1 2 3

050

010

0015

00

Page 5: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Central Limit Theorem!

For a sufficiently large sample size, the distribution of sample

statistics for a mean or a proportion is normal

www.lock5stat.com/StatKey

Page 6: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Distribution of 100n30n10n1n

0.5p

0.0 0.5 1.0

50n

0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0

0.7p

0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0

0.0 0.5 1.0

0.1p

0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0

Page 7: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

CLT for a MeanPopulation Distribution of

Sample DataDistribution of Sample Means

n = 10

n = 30

n = 50

Freq

uenc

y

0 1 2 3 4 5 6

0.0

1.5

3.0

1.0 2.0 3.0

Freq

uenc

y

0 1 2 3 4 5

04

8

1.5 2.0 2.5 3.0

Freq

uenc

y

0 2 4 6 8 12

010

25

1.4 1.8 2.2 2.6

x

0 2 4 6 8 10

Page 8: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Central Limit Theorem

• The central limit theorem holds for ANY original distribution, although “sufficiently large sample size” varies

• The more skewed the original distribution is (the farther from normal), the larger the sample size has to be for the CLT to work

• For small samples, it is more important that the data itself is approximately normal

Page 9: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Central Limit Theorem

• For distributions of a quantitative variable that are not very skewed and without large outliers, n ≥ 30 is usually sufficient to use the CLT

• For distributions of a categorical variable, counts of at least 10 within each category is usually sufficient to use the CLT

Page 10: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Accuracy• The accuracy of intervals and p-values generated using simulation methods (bootstrapping and randomization) depends on the number of simulations (more simulations = more accurate)

• The accuracy of intervals and p-values generated using formulas and the normal distribution depends on the sample size (larger sample size = more accurate)

• If the distribution of the statistic is truly normal and you have generated many simulated randomizations, the p-values should be very close

Page 11: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

• The normal distribution is fully characterized by it’s mean and standard deviation

Normal Distribution

mean,standard deviationN

Page 12: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Bootstrap DistributionsIf a bootstrap distribution is approximately normally distributed, we can write it as

a) N(parameter, sd)b) N(statistic, sd)c) N(parameter, se)d) N(statistic, se)sd = standard deviation of variablese = standard error = standard deviation of statistic

Page 13: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Hearing Loss• In a random sample of 1771 Americans aged 12 to 19, 19.5% had some hearing loss (this is a dramatic increase from a decade ago!)

• What proportion of Americans aged 12 to 19 have some hearing loss? Give a 95% CI.

Rabin, R. “Childhood: Hearing Loss Grows Among Teenagers,” www.nytimes.com, 8/23/10.

Page 14: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Hearing Loss

(0.177, 0.214)

Page 15: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Hearing Loss

N(0.195, 0.0095)

Page 16: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Confidence Intervals

If the bootstrap distribution is normal:

To find a P% confidence interval , we just need to find the middle P% of the distribution

N(statistic, SE)

Page 17: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Area under a Curve• The area under the curve of a normal distribution is equal to the proportion of the distribution falling within that range

• Knowing just the mean and standard deviation of a normal distribution allows you to calculate areas in the tails and percentiles

www.lock5stat.com/statkey

Page 18: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Hearing Loss

(0.176, 0.214)

www.lock5stat.com/statkey

Page 19: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Standardized DataOften, we standardize the data to have mean 0

and standard deviation 1

This is done with z-scores

From x to z : From z to x:

Places everything on a common scale

x meanzsd

Page 20: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Standard Normal• The standard normal distribution is the normal distribution with mean 0 and standard deviation 1

0,1N

Distribution of Statistic Assuming Null

Statistic

-3 -2 -1 0 1 2 3

Page 21: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Standardized DataConfidence Interval (bootstrap distribution):

mean = sample statistic, sd = SE

From z to x: (CI)

x statist Eic Sz

Page 22: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

z*-z*

P%

P% Confidence Interval2. Return to

original scale with statistic z* SE

1. Find z-scores (–z* and z*) that capture

the middle P% of the standard normal

Page 23: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Confidence Interval using N(0,1)

If a statistic is normally distributed, we find a confidence interval for the parameter using

statistic z* SE

where the area between –z* and +z* in the standard normal distribution is the desired

level of confidence.

Page 24: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Confidence IntervalsFind z* for a 99% confidence interval.

www.lock5stat.com/statkey

z* = 2.575

Page 25: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

z*Why use the standard normal?

Common confidence levels:

95%: z* = 1.96 (but 2 is close enough)

90%: z* = 1.645

99%: z* = 2.576

Page 26: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

In March 2011, a random sample of 1000 US adults were asked

“Do you favor or oppose ‘sin taxes’ on soda and junk food?”

320 adults responded in favor of sin taxes.

Give a 99% CI for the proportion of all US adults that favor these sin taxes.

From a bootstrap distribution, we find SE = 0.015

Sin Taxes

Page 27: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Sin Taxes

Page 28: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Sin Taxes

Page 29: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Randomization DistributionsIf a randomization distribution is approximately normally distributed, we can write it as

a) N(null value, se)b) N(statistic, se)c) N(parameter, se)

Page 30: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

p-valuesIf the randomization distribution is normal:

To calculate a p-value, we just need to find the area in the appropriate tail(s) beyond the observed statistic of the distribution

Page 31: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

First Born Children• Are first born children actually smarter?

• Explanatory variable: first born or not• Response variable: combined SAT score

• Based on a sample of college students, we find

• From a randomization distribution, we find SE = 37

Page 32: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

First Born Children

SE = 37

What normal distribution should we use to find the p-value?

a) N(30.26, 37)b) N(37, 30.26)c) N(0, 37)d) N(0, 30.26)

Page 33: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Hypothesis TestingDistribution of Statistic Assuming Null

Statistic

-3 -2 -1 0 1 2 3

Observed Statistic

Distribution of Statistic Assuming Null

Statistic

-3 -2 -1 0 1 2 3

Distribution of Statistic Assuming Null

Statistic

-3 -2 -1 0 1 2 3

Observed Statistic

p-value

Page 34: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

First Born ChildrenN(0, 37)

www.lock5stat.com/statkey

p-value = 0.207

Page 35: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Standardized DataHypothesis test (randomization distribution):

mean = null value, sd = SE

From x to z (test) :

x meanzsd

Page 36: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

p-value using N(0,1)

If a statistic is normally distributed under H0, the p-value is the probability a standard normal is beyond

Page 37: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

First Born Children

SE = 37

1) Find the standardized test statistic

2) Compute the p-value

Page 38: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

First Born Children

Page 39: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

z-statistic

If z = –3, using = 0.05 we would

(a) Reject the null(b) Not reject the null(c) Impossible to tell(d) I have no idea

Page 40: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

z-statistic

• Calculating the number of standard errors a statistic is from the null value allows us to assess extremity on a common scale

Page 41: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Confidence Interval Formula

*sample statistic z SE

From original data

From bootstrap

distribution

From N(0,1)

IF SAMPLE SIZES ARE LARGE…

Page 42: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Formula for p-values

From randomization

distribution

From H0

sample statistic null valueSE

z

From original data

Compare z to N(0,1) for p-value

IF SAMPLE SIZES ARE LARGE…

Page 43: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Standard Error

• Wouldn’t it be nice if we could compute the standard error without doing thousands of simulations?

• We can!!!

• Or at least we’ll be able to next class…

Page 44: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

• For quantitative data, we use a t-distribution instead of the normal distribution

•The t distribution is very similar to the standard normal, but with slightly fatter tails (to reflect the uncertainty in the sample standard deviations)

t-distribution

Page 45: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

• The t-distribution is characterized by its degrees of freedom (df)

• Degrees of freedom are based on sample size• Single mean: df = n – 1 • Difference in means: df = min(n1, n2) – 1• Correlation: df = n – 2

• The higher the degrees of freedom, the closer the t-distribution is to the standard normal

Degrees of Freedom

Page 46: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

t-distribution

Page 47: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Aside: William Sealy Gosset

Page 48: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

The Pygmalion Effect

Source: Rosenthal, R. and Jacobsen, L. (1968). “Pygmalion in the Classroom: Teacher Expectation and Pupils’ Intellectual Development.” Holt, Rinehart and Winston, Inc.

Teachers were told that certain children (chosen randomly) were expected to be intellectual “growth spurters,” based on the Harvard Test of Inflected Acquisition (a test that didn’t actually exist). These children were selected randomly.

The response variable is change in IQ over the course of one year.

Page 49: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

The Pygmalion Effect

n sControl Students 255 8.42 12.0“Growth Spurters” 65 12.22 13.3

X

Can this provide evidence that merely expecting a child to do well actually causes the child to do better?

If so, how much better?

*s1 and s2 were not given, so I set them to give the correct p-value

SE = 1.8

Page 50: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Pygmalion Effect

Page 51: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Pygmalion EffectFrom the paper: “The difference in gains could be ascribed

to chance about 2 in 100 times”

Page 52: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

Pygmalion Effect

Page 53: Normal Distribution

Statistics: Unlocking the Power of Data Lock5

To DoDo Project 1 (due 3/7)

Read Chapter 5