35
Introduction to Basic Statistical Methods Part 1: “Statistics in a Nutshell” UWHC Scholarly Forum March 19, 2014 Ismor Fischer, Ph.D. UW Dept of Statistics [email protected]

Introduction to Basic Statistical Methods Part 1: “Statistics in a Nutshell” UWHC Scholarly Forum

  • Upload
    leora

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Introduction to Basic Statistical Methods Part 1: “Statistics in a Nutshell” UWHC Scholarly Forum March 19, 2014 Ismor Fischer, Ph.D. UW Dept of Statistics [email protected]. STATISTICS IN A NUTSHELL. UWHC Scholarly Forum March 19, 2014 Ismor Fischer, Ph.D. UW Dept of Statistics - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

Introduction to Basic Statistical Methods

Part 1: “Statistics in a Nutshell”

UWHC Scholarly ForumMarch 19, 2014

Ismor Fischer, Ph.D.UW Dept of [email protected]

Page 2: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

STATISTICS IN A NUTSHELL

UWHC Scholarly ForumMarch 19, 2014

Ismor Fischer, Ph.D.UW Dept of [email protected]

All slides posted at http://www.stat.wisc.edu/~ifischer/Intro_Stat/UWHC

Page 3: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

• Right-cick on image for full .pdf article

• Links in article to access datasets

Page 4: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

POPULATION“Statistical Inference”

Women in the U.S. who have given birth

Page 5: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

Study Question:Has mean (i.e., average) of X = “Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

Present Day: Assume X = “Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

Population Distribution

X

POPULATION“Statistical Inference”

But what does that mean (at least in principle)?

? ?? ?

Page 6: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

Study Question:Has mean (i.e., average) of X = “Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

Present Day: Assume X = “Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

Population Distribution

X

1x2x

3x

4x 5x

POPULATION“Statistical Inference”

Individual ages from the population tend to collect around a single center with a certain amount of spread, but occasional “outliers” are present in left and right symmetric tails.

More precisely…

ad infinitum…

Page 7: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

~ The Normal Distribution ~

symmetric about its mean

unimodal (i.e., one peak), with left and right “tails”

models many (but not all) naturally-occurring systems

useful mathematical properties…

“population mean”

“population standard

deviation”

Example: X = Body Temp (°F)

low variability

98.6

small

( )f x

Page 8: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

Example: X = Body Temp (°F)

low variability

98.6

Example: X = IQ score

high variability

100

~ The Normal Distribution ~

“population mean”

“population standard

deviation”

symmetric about its mean

unimodal (i.e., one peak), with left and right “tails”

models many (but not all) naturally-occurring systems

large

( )f x

small

useful mathematical properties…

Page 9: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

~ The Normal Distribution ~

“population standard

deviation”

symmetric about its mean

unimodal (i.e., one peak), with left and right “tails”

models many (but not all) naturally-occurring systems

Approximately 95% of the population values are contained between

– 2 σ and + 2 σ.

95% is called the confidence level. 5% is called the significance level.

95%2.5% 2.5%≈ 2 σ ≈ 2 σ

“population mean” ( )f x

useful mathematical properties…

Page 10: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

POPULATION

“Null Hypothesis”

via… “Hypothesis Testing”Study Question:Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

H0: pop mean age = 25.4 (i.e., no change since 2010)

“Statistical Inference”

Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

cannot be found with 100% certainty, but can be estimated with high confidence (e.g., 95%).

Population Distribution

X

Page 11: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

Is the difference STATISTICALLY SIGNIFICANT, at the 5% level?

POPULATION

“Null Hypothesis”

via… “Hypothesis Testing”Study Question:Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

FORMULA

x1

x4

x3x2 x5

x400

… etc…

H0: pop mean age = 25.4 (i.e., no change since 2010)

sample mean age1 2 nx x xx

n

Do the data tend to support or refute the null hypothesis?

“Statistical Inference”

?

Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

25.6x

Population Distribution

T-testT-testT-testT-testX

Page 12: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

Actually, this is a special case of…

?Samples,

size n

4x5x

2x

3x1x

~ The Normal Distribution ~

… etc…

n

Population Distribution(of ages)

“Sampling Distribution”

(of mean ages)

via mathematical

proof…

X

X

Page 13: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

Actually, this is a special case of…

(of ages)

Population Distribution

?Samples,

size n

4x5x

2x

3x1x

~ The Normal Distribution ~

… etc…

n

“Sampling Distribution”

(of mean ages)

X

X

CENTRAL LIMIT

THEOREM

… as n gets larger

Population Distribution(of ages)

Page 14: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

?

~ The Normal Distribution ~

n

Population Distribution(of ages)

“Sampling Distribution”

(of mean ages)

X

X

The sample mean values have much less variability about than the population values!

Page 15: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

“Sampling Distribution”

(of mean ages)

Approximately 95% of the sample mean values are contained between

and 2 n 2 n

95%2.5% 2.5%≈ 2 σ ≈ 2 σ

~ The Normal Distribution ~

Approximately 95% of the population values are contained between

– 2 σ and + 2 σ.

Population Distribution(of ages)

X

n

Page 16: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

Approximately 95% of the sample mean values are contained between

and 2 n 2 n

XSample 1

Sample 2 1x

2x

3x

4x

Sample 3

Sample 4

Sample 5

5x

2 n is called the 95% margin of error

etc…

In principle…

Page 17: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

Approximately 95% of the sample mean values are contained between

and 2 n 2 n

XSample 1

Sample 2 1x

2x

3x

4x

Sample 3

Sample 4

Sample 5

5x

2 n is called the 95% margin of error

But from the

samples’ point

of view…

Page 18: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

Approximately 95% of the sample mean values are contained between

and 2 n 2 n

X1x

Sample 1

Sample 2

2x

3x

4x

5x

Sample 3

Sample 4

Sample 5

Approximately 95% of the intervals fromto

contain , and approx 5% do not.2x n 2x n

2 n is called the 95% margin of error

But from the

samples’ point

of view…

Page 19: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

“Sampling Distribution”

(of mean ages)

Approximately 95% of the sample mean values are contained between

and 2 n 2 n

95%2.5% 2.5%≈ 2 σ ≈ 2 σ

~ The Normal Distribution ~

Approximately 95% of the population values are contained between

– 2 σ and + 2 σ.

Population Distribution(of ages)

Approximately 95% of the intervals fromto

contain , and approx 5% do not.2x n 2x n

X

n

Page 20: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

“Null Hypothesis”

via… “Hypothesis Testing”

H0: pop mean age = 25.4 (i.e., no change since 2010)

sample mean1 2 nx x xx

n

= 25.6

“Statistical Inference”POPULATION

Study Question:Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

FORMULA

SAMPLEn = 400 ages

x3x2 x5

x400

… etc…

x1

x4

2n

95% margin of errorApproximately 95% of the intervals from

to contain , and approx 5% do not.

2x n 2x nPROBLEM!

σ is unknown the vast majority of the time!

Page 21: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

POPULATION

“Null Hypothesis”

via… “Hypothesis Testing”Study Question:Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

FORMULA

SAMPLEn = 400 ages

H0: pop mean age = 25.4 (i.e., no change since 2010)

sample mean1 2 nx x xx

n

= 25.6

“Statistical Inference”

x3x2 x5

x400

… etc…

x1

x4

2n

sample standard deviation

sample variance

95% margin of error

= modified average of the squared deviations from the mean

Page 22: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

= 1.6 2 sn

2n

1.6

Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

21( )x x 2 21 2( ) ( )x x x x 2 2 21 2( ) ( ) ( )nx x x x x x 2 2 2

2 1 2( ) ( ) ( )1

nx x x x x xsn

1( )x x

POPULATION

“Null Hypothesis”

via… “Hypothesis Testing”Study Question:Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

FORMULA

SAMPLEn = 400 ages sample mean

1 2 nx x xxn

= 25.6

“Statistical Inference”

x3x2 x5

x400

… etc…

x1

x4

sample variance

2s s

sample standard deviation

= 0.16

95% margin of error

H0: pop mean age = 25.4 (i.e., no change since 2010)

400

Page 23: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

x = 25.6

Approximately 95% of the intervals fromto

contain , and approx 5% do not.2x n 2x n

Page 24: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

25.7625.44

BASED ON OUR SAMPLE DATA, the true value of μ today is between 25.44 and 25.76 years, with 95% “confidence” (…akin to “probability”).

x = 25.6

2 sn

= 0.16

95% margin of error

2 sn

= 0.16

Page 25: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

25.7625.44

BASED ON OUR SAMPLE DATA, the true value of μ today is between 25.44 and 25.76 years, with 95% “confidence” (…akin to “probability”).

x = 25.6

95% CONFIDENCE INTERVAL FOR µ

= 25.4

IF H0 is true, then we would expect a random sample mean that is at least 0.2 years away from = 25.4 (as ours was), to occur with probability 1.28%.

x

“P-VALUE” of our sample

Very informally, the p-value of a sample is the probability (hence a number between 0 and 1) that it “agrees” with the null hypothesis. Hence a very small p-value indicates strong evidence against the null hypothesis. The smaller the p-value, the stronger the evidence, and the more “statistically significant” the finding (e.g., p < .0001).

Two main ways to conduct a formal hypothesis test:

25.4 25.6

Page 26: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

25.4 25.6

Hence a very small p-value indicates strong evidence against the null hypothesis. The smaller the p-value, the stronger the evidence, and the more “statistically significant” the finding (e.g., p < .0001).

Very informally, the p-value of a sample is the probability (hence a number between 0 and 1) that it “agrees” with the null hypothesis.

25.7625.44

BASED ON OUR SAMPLE DATA, the true value of μ today is between 25.44 and 25.76 years, with 95% “confidence” (…akin to “probability”).

x = 25.6

95% CONFIDENCE INTERVAL FOR µ

= 25.4

IF H0 is true, then we would expect a random sample mean that is at least 0.2 years away from = 25.4 (as ours was), to occur with probability 1.28%.

Two main ways to conduct a formal hypothesis test:

x

“P-VALUE” of our sample

However, one problem remains…

FORMAL CONCLUSIONS:

The 95% confidence interval corresponding to our sample mean does not contain the “null value” of the population mean, μ = 25.4 years.

The p-value of our sample, .0128, is less than the predetermined α = .05 significance level.

Based on our sample data, we may (moderately) reject the null hypothesis H0: μ = 25.4 in favor of the two-sided alternative hypothesis HA: μ ≠ 25.4, at the α = .05 significance level.

INTERPRETATION: According to the results of this study, there exists a statistically significant difference between the mean ages at first birth in 2010 (25.4 years old) and today, at the 5% significance level. Moreover, the evidence from the sample data would suggest that the population mean age today is significantly older than in 2010, rather than significantly younger.

Page 27: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

n

(mean ages)

“Sampling Distribution”

X

Approximately 95% of the sample mean values are contained between

and 2 n 2 n

95%2.5% 2.5%≈ 2 σ ≈ 2 σ

Normal Distribution

Approximately 95% of the population values are contained between

– 2 σ and + 2 σ.

Approximately 95% of the intervals fromto

contain , and approx 5% do not.2x n 2x n

Population Distribution(of ages)

Normal Distribution

sn

Page 28: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

(mean ages)

“Sampling Distribution”

Approximately 95% of the sample mean values are contained between

and 2s n 2s n

95%2.5% 2.5%≈ 2 σ ≈ 2 σ

Normal Distribution

Approximately 95% of the population values are contained between

– 2 s and + 2 s.

Approximately 95% of the intervals fromto

contain , and approx 5% do not.2x s n 2x s n

Population Distribution(of ages)

Normal Distribution

T

…IF n is large, e.g., 30

sn

n

X

Alas, this introduces “sampling variability.”

Page 29: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

Edited R code:

y = rnorm(400, 0, 1)z = (y - mean(y)) / sd(y)x = 25.6 + 1.6*z

sort(round(x, 1)) [1] 19.6 20.2 20.4 20.5 21.2 22.3 22.3 22.4 22.4 22.4 22.6 22.7 22.7 22.7 22.8 [16] 23.0 23.0 23.1 23.1 23.2 23.2 23.2 23.2 23.2 23.3 23.4 23.4 23.4 23.5 23.5

etc...[391] 28.7 28.7 28.9 29.2 29.3 29.4 29.6 29.7 29.9 30.2

c(mean(x), sd(x))[1] 25.6 1.6

t.test(x, mu = 25.4)

One Sample t-test

data: x t = 2.5, df = 399, p-value = 0.01282alternative hypothesis: true mean is not equal to 25.4 95 percent confidence interval: 25.44273 25.75727 sample estimates:mean of x 25.6

Generates a normally-distributed random sample of 400 age values.

Calculates sample mean and standard deviation.

Page 30: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

(mean ages)

“Sampling Distribution”

(mean ages)

“Sampling Distribution”

Approximately 95% of the sample mean values are contained between

and 2s n 2s n

95%2.5% 2.5%≈ 2 σ ≈ 2 σ

Normal Distribution

Approximately 95% of the population values are contained between

– 2 s and + 2 s.

Approximately 95% of the intervals fromto

contain , and approx 5% do not.2x s n 2x s n

Population Distribution(of ages)

Normal Distribution

T

n s

n

n

…IF n is large, e.g., 30

But if n is small…

X

Page 31: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

If n is large, T-score ≈ 2.

If n is small, T-score > 2.

… the “T-score" increases (from ≈ 2 to a max of 12.706 for a 95% confidence level) as n decreases larger margin of error less power to reject, even if a genuine statistically significant difference exists!

Page 32: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

POPULATION

“Null Hypothesis”

via… “Hypothesis Testing”Study Question:Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

FORMULA

x1

x4

x3x2 x5

x400

… etc…

H0: pop mean age = 25.4 (i.e., no change since 2010)

sample mean age1 2 nx x xx

n

Do the data tend to support or refute the null hypothesis? Is the difference STATISTICALLY SIGNIFICANT, at the 5% level?

“Statistical Inference”

Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

25.6x

T-testTwo loose ends

Page 33: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

POPULATION

“Null Hypothesis”

via… “Hypothesis Testing”Study Question:Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

H0: pop mean age = 25.4 (i.e., no change since 2010)

“Statistical Inference”

Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

T-test

The reasonableness of the normality assumption is empirically verifiable, and in fact formally testable from the sample data. If violated (e.g., skewed) or inconclusive (e.g., small sample size), then “distribution-free” nonparametric tests can be used instead of the T-test. Examples: Sign Test, Wilcoxon Signed Rank Test (= Mann-Whitney Test)

Two loose ends

Check?

Page 34: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

POPULATION

“Null Hypothesis”

via… “Hypothesis Testing”Study Question:Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

x1

x4

x3x2 x5

x400

… etc…

H0: pop mean age = 25.4 (i.e., no change since 2010)

“Statistical Inference”

Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

T-testTwo loose ends

Sample size n partially depends on the power of the test, i.e., the desired probability of correctly rejecting a false null hypothesis (80% or more).

Page 35: Introduction to Basic Statistical Methods Part 1:  “Statistics in a Nutshell” UWHC Scholarly Forum

Introduction to Basic Statistical Methods

Part 1: Statistics in a Nutshell

UWHC Scholarly ForumMarch 19, 2014

Ismor Fischer, Ph.D.UW Dept of [email protected]

Part 2: Overview of Biostatistics: “Which Test Do I Use??” Sincere thanks to…

• Judith Payne

• Heidi Miller

• Samantha Goodrich

• Troy Lawrence

• YOU!