21
Introduction to Basic Statistical Methods Part 1: Statistics in a Nutshell UWHC Scholarly Forum May 21, 2014 Ismor Fischer, Ph.D. UW Dept of Statistics [email protected] Part 2: Overview of Biostatistics: Which Test Do I Use??” All slides posted at http://www.stat.wisc.edu/~ifischer/Intro _Stat/UWHC

Introduction to Basic Statistical Methods Part 1: Statistics in a Nutshell UWHC Scholarly Forum

  • Upload
    ova

  • View
    24

  • Download
    0

Embed Size (px)

DESCRIPTION

Introduction to Basic Statistical Methods Part 1: Statistics in a Nutshell UWHC Scholarly Forum May 21, 2014 Ismor Fischer, Ph.D. UW Dept of Statistics [email protected]. Part 2: Overview of Biostatistics: “ Which Test Do I Use??”. - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

Introduction to Basic Statistical Methods

Part 1: Statistics in a Nutshell

UWHC Scholarly ForumMay 21, 2014

Ismor Fischer, Ph.D.UW Dept of [email protected]

Part 2: Overview of Biostatistics: “Which Test Do I Use??”

All slides posted at http://www.stat.wisc.edu/~ifischer/Intro_Stat/UWHC

Page 2: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

• Right-cick on image for full .pdf article

• Links in article to access datasets

Page 3: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

Study Question:Has mean (i.e., average) of X = “Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

Present Day: Assume X = “Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

Population Distribution

X

“Statistical Inference”POPULATION

Page 4: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

Study Question:Has mean (i.e., average) of X = “Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

Present Day: Assume X = “Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

Population Distribution

X

POPULATION“Statistical Inference”

Page 5: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

~ The Normal Distribution ~

symmetric about its mean

unimodal (i.e., one peak), with left and right “tails”

models many (but not all) naturally-occurring systems

useful mathematical properties…

“population mean”

“population standard

deviation”

( )f x

Page 6: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

~ The Normal Distribution ~

“population standard

deviation”

symmetric about its mean

unimodal (i.e., one peak), with left and right “tails”

models many (but not all) naturally-occurring systems

Approximately 95% of the population values are contained between

– 2 σ and + 2 σ.

95% is called the confidence level. 5% is called the significance level.

95%2.5% 2.5%≈ 2 σ ≈ 2 σ

“population mean” ( )f x

useful mathematical properties…

= ?

= ?

Page 7: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

POPULATIONStudy Question:Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

x1

x4

x3x2 x5

x400

… etc…

“Statistical Inference”

?

Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

Population Distribution

X

= ?

“Null Hypothesis”

H0: pop mean age = 25.4 (i.e., no change since 2010)

via… “Hypothesis Testing”

cannot be found with 100% certainty, but can be estimated with high confidence (e.g., 95%) from sample data.

Sample size n partially depends on the power of the test, i.e., the desired probability of correctly rejecting a false null hypothesis ( 80%).

Page 8: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

POPULATION

“Null Hypothesis”

via… “Hypothesis Testing”Study Question:Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

x1

x4

x3x2 x5

x400

… etc…

H0: pop mean age = 25.4 (i.e., no change since 2010)

“Statistical Inference”

Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

Population Distribution

X

= ?

sample mean age1 2 nx x x

xn

2 2 22 1 2( ) ( ) ( )

1nx x x x x x

sn

sample variance

25.6x

Page 9: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

sample variancesample standard deviation

POPULATION

“Null Hypothesis”

via… “Hypothesis Testing”Study Question:Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

x1

x4

x3x2 x5

x400

… etc…

H0: pop mean age = 25.4 (i.e., no change since 2010)

“Statistical Inference”

Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

Population Distribution

X

= ?

sample mean age1 2 nx x x

xn

2 2 21 2( ) ( ) ( )

1nx x x x x xs

n

25.6x

s= 1.6

Page 10: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

POPULATION

“Null Hypothesis”

via… “Hypothesis Testing”Study Question:Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

x1

x4

x3x2 x5

x400

… etc…

H0: pop mean age = 25.4 (i.e., no change since 2010)

“Statistical Inference”

Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

Population Distribution

X

sample mean age1 2 nx x x

xn

25.6x

s

Is the difference STATISTICALLY SIGNIFICANT, at the 5% level? Do the data tend to support or refute the null hypothesis?

The population distribution of X follows a bell curve, with

standard deviation .

= 1.6

Page 11: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

POPULATION

“Null Hypothesis”

via… “Hypothesis Testing”Study Question:Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

x1

x4

x3x2 x5

x400

… etc…

H0: pop mean age = 25.4 (i.e., no change since 2010)

“Statistical Inference”

Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

Population Distribution

X

sample mean age1 2 nx x x

xn

25.6x

s

Is the difference STATISTICALLY SIGNIFICANT, at the 5% level? Do the data tend to support or refute the null hypothesis?

The “sampling distribution” of also follows a bell curve,

with standard deviation / n.X

= 1.6

Page 12: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

POPULATION

“Null Hypothesis”

via… “Hypothesis Testing”Study Question:Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

x1

x4

x3x2 x5

x400

… etc…

H0: pop mean age = 25.4 (i.e., no change since 2010)

“Statistical Inference”

Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

Population Distribution

X

sample mean age1 2 nx x x

xn

25.6x

s

Is the difference STATISTICALLY SIGNIFICANT, at the 5% level? Do the data tend to support or refute the null hypothesis?

But estimating by s introduces an additional layer

of “sampling variability.”

= 1.6

Page 13: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

POPULATION

“Null Hypothesis”

via… “Hypothesis Testing”Study Question:Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

x1

x4

x3x2 x5

x400

… etc…

H0: pop mean age = 25.4 (i.e., no change since 2010)

“Statistical Inference”

Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

Population Distribution

X

sample mean age1 2 nx x x

xn

25.6x

s

Is the difference STATISTICALLY SIGNIFICANT, at the 5% level? Do the data tend to support or refute the null hypothesis?

In order to take this into account, a cousin to the

normal distribution called the “T-distribution” is used

instead (Gossett, 1908).

= 1.6

Page 14: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

t1

“standard” bell curve: = 0, = 1

tdf

Student’s T-Distribution

William S. Gossett (1876 - 1937)

… is actually a family of distributions, indexed by the degrees of freedom df = n – 1, labeled tdf.

As n gets large, tdf converges to the standard normal distribution. But the heavier tails mean a wider interval is needed to capture 95%, especially if n is small.

Page 15: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

T-test

POPULATION

“Null Hypothesis”

via… “Hypothesis Testing”Study Question:Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

x1

x4

x3x2 x5

x400

… etc…

H0: pop mean age = 25.4 (i.e., no change since 2010)

“Statistical Inference”

Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

Population Distribution

X

sample mean age1 2 nx x x

xn

25.6x

s

Is the difference STATISTICALLY SIGNIFICANT, at the 5% level? Do the data tend to support or refute the null hypothesis?

In order to take this into account, a cousin to the

normal distribution called the “T-distribution” is used

instead (Gossett, 1908).

= 1.6

Page 16: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

POPULATION

“Null Hypothesis”

via… “Hypothesis Testing”Study Question:Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

x1

x4

x3x2 x5

x400

… etc…

H0: pop mean age = 25.4 (i.e., no change since 2010)

“Statistical Inference”

Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

Population Distribution

X

sample mean age1 2 nx x x

xn

25.6x

s

Is the difference STATISTICALLY SIGNIFICANT, at the 5% level? Do the data tend to support or refute the null hypothesis?

T-test= 1.6

Page 17: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

95% CONFIDENCE INTERVAL FOR µ

= 25.4

IF H0 is true, then we would expect a random sample mean that is at least 0.2 years away from = 25.4 (as ours was), to occur with probability 1.28%.

x

“P-VALUE” of our sample

Very informally, the p-value of a sample is the probability (hence a number between 0 and 1) that it “agrees” with the null hypothesis. Hence a very small p-value indicates strong evidence against the null hypothesis. The smaller the p-value, the stronger the evidence, and the more “statistically significant” the finding (e.g., p < .0001).

Two main ways to conduct a formal hypothesis test:

25.4 25.6

25.7625.44

BASED ON OUR SAMPLE DATA, the true value of μ today is between 25.44 and 25.76 years, with 95% “confidence” (…akin to “probability”).

x = 25.6

Page 18: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

25.4 25.6

Hence a very small p-value indicates strong evidence against the null hypothesis. The smaller the p-value, the stronger the evidence, and the more “statistically significant” the finding (e.g., p < .0001).

Very informally, the p-value of a sample is the probability (hence a number between 0 and 1) that it “agrees” with the null hypothesis.

25.7625.44

BASED ON OUR SAMPLE DATA, the true value of μ today is between 25.44 and 25.76 years, with 95% “confidence” (…akin to “probability”).

x = 25.6

95% CONFIDENCE INTERVAL FOR µ

= 25.4

IF H0 is true, then we would expect a random sample mean that is at least 0.2 years away from = 25.4 (as ours was), to occur with probability 1.28%.

Two main ways to conduct a formal hypothesis test:

x

“P-VALUE” of our sample

FORMAL CONCLUSIONS:

The 95% confidence interval corresponding to our sample mean does not contain the “null value” of the population mean, μ = 25.4 years.

The p-value of our sample, .0128, is less than the predetermined α = .05 significance level.

Based on our sample data, we may (moderately) reject the null hypothesis H0: μ = 25.4 in favor of the two-sided alternative hypothesis HA: μ ≠ 25.4, at the α = .05 significance level.

INTERPRETATION: According to the results of this study, there exists a statistically significant difference between the mean ages at first birth in 2010 (25.4 years old) and today, at the 5% significance level. Moreover, the evidence from the sample data would suggest that the population mean age today is significantly older than in 2010, rather than significantly younger.

Page 19: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum
Page 20: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

Edited R code:

y = rnorm(400, 0, 1)z = (y - mean(y)) / sd(y)x = 25.6 + 1.6*z

sort(round(x, 1)) [1] 19.6 20.2 20.4 20.5 21.2 22.3 22.3 22.4 22.4 22.4 22.6 22.7 22.7 22.7 22.8 [16] 23.0 23.0 23.1 23.1 23.2 23.2 23.2 23.2 23.2 23.3 23.4 23.4 23.4 23.5 23.5

etc...[391] 28.7 28.7 28.9 29.2 29.3 29.4 29.6 29.7 29.9 30.2

c(mean(x), sd(x))[1] 25.6 1.6

t.test(x, mu = 25.4)

One Sample t-test

data: x t = 2.5, df = 399, p-value = 0.01282alternative hypothesis: true mean is not equal to 25.4 95 percent confidence interval: 25.44273 25.75727 sample estimates:mean of x 25.6

Generates a normally-distributed random sample of 400 age values.

Calculates sample mean and standard deviation.

Page 21: Introduction to Basic Statistical Methods Part 1:  Statistics in a Nutshell UWHC Scholarly Forum

POPULATIONStudy Question:Has “Mean (i.e., average) Age at First Birth” of women in the U.S. changed since 2010 (25.4 yrs old)?

x1

x4

x3x2 x5

x400

… etc…

“Statistical Inference”

Present Day: Assume “Mean Age at First Birth” follows a normal distribution (i.e., “bell curve”) in the population.

Population Distribution

X

= ?

H0: pop mean age = 25.4 (i.e., no change since 2010)

via… “Hypothesis Testing”Assume

The reasonableness of the normality assumption is empirically verifiable, and in fact formally testable from the sample data. If violated (e.g., skewed) or inconclusive (e.g., small sample size), then “distribution-free” nonparametric tests should be used instead of the T-test… Examples: Sign Test, Wilcoxon Signed Rank Test (= Mann-Whitney U Test)