38
Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician http://research.LABioMed.org/ Biostat

Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Embed Size (px)

Citation preview

Page 1: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Biostatistics in Practice

Session 3: Testing Hypotheses

Peter D. ChristensonBiostatistician

http://research.LABioMed.org/Biostat

Page 2: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Session 3 Preparation

We have been using a recent study on hyperactivity for the concepts in this course. The questions below based on this paper are intended to prepare you for session 3.

Page 3: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Session 3 Preparation

1. From Figures 1 and 2, we see that 153/209 = 73% of parents of the younger children and 144/160 = 90% of parents of the older children initially were interested but did not participate. Does it seem logical that the rate is lower for the 3-year-olds? Do you have any intuition on whether the magnitude of the 73% vs. 90% difference is enough to support an age difference, regardless of the logical reason?

Page 4: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Session 3 Preparation #1

153/209 144/160

73% ↔ Consented ↔ 90%

Page 5: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Session 3 Preparation #1

Not intuitive whether 73% vs. 90% is a “real” difference, i.e. reproducible or extrapolates to other persons.

153/209 144/160

73% ↔ Consented ↔ 90%

Page 6: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Session 3 Preparation #1

Hypothesis testing compares 73% and 90%. It does not say how precise the %s are.

153/209 144/160

73% ↔ Consented ↔ 90%

Page 7: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Session 3 Preparation

2. Look at the left side of the bottom panel of Figure 3 and recall what we have said about confidence intervals. Would you conclude that there is a change in hyperactivity under Mix A?

3. Repeat question 2 for placebo.

Page 8: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Session 3 Preparation: #2 and #3

Page 9: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Session 3 Preparation: #2 and #3

Possible values for real effect.

Zero is “ruled out”.

Page 10: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Session 3 Preparation

4. Do you think that the positive conclusion for question #3 has been "proven"?

5. Do you think that the negative conclusion for

question #2 has been "proven"?

Page 11: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Session 3 Preparation

4. Do you think that the positive conclusion for question #3 has been "proven"?Yes, with 95% confidence.

5. Do you think that the negative conclusion for question #2 has been "proven"?No, since more subjects would give a narrower confidence interval.

Hypothesis testing make a Yes or No conclusion whether there is an effect and quantifies the chances of a correct conclusion either way.

Confidence intervals give possible magnitudes of effects.

Page 12: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Session 3 Goals

Statistical testing concepts

Three most common tests

Software

Equivalence of testing and confidence intervals

False positive and false negative conclusions

Page 13: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Session 3 Data

For this session, we will focus on another paper for which I have the raw data.

Paper is posted on our class website.

Subjects were hospitalized for many days, blood samples taken every 8 hours and vital signs recorded every hour.

Subject is adrenal insufficient if 2 successive serum cortisols are low.

Page 14: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Goal: Do Groups Differ By More than is Expected By Chance?

Cohan (2005) Crit Care Med;33:2358-66.

Page 15: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Goal: Do Groups Differ By More than is Expected By Chance?

First, need to:

• Specify experimental units (Persons? Blood draws?).

• Specify single outcome for each unit (e.g., Yes/No, mean or min of several measurements?).

• Examine raw data, e.g., histogram, for meeting test requirements.

• Specify group summary measure to be used (e.g., % or mean, median over units).

• Choose particular statistical test for the outcome.

Page 16: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Outcome Type → Statistical Test

Cohan (2005) Crit Care Med;33:2358-66.

. . .

. . .

Medians

%s

Means

WilcoxonTest

ChiSquareTest

t Test

Page 17: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Minimal MAP: Group Distributions of Individual Units

AI Group (N=42) Stem.Leaf # 7 6 1 7 11334 5 6 555 3 6 01112344 8 5 5566778 7 5 01222234 8 4 57788 5 4 23 2 3 6 1 3 13 2 ----+----+----+----+ Multiply Stem.Leaf by 10

Non-AI Group (N=38)Stem.Leaf # 7 79 2 7 00111234 8 6 5556777888 10 6 00112234 8 5 67999 5 5 3 1 4 79 2 4 04 2 ----+----+----+----+ Multiply Stem.Leaf by 10

→ Approximately normally distributed

→ Use means to summarize groups.

→ Use t-test to compare means.

Page 18: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Goal: Do Groups Differ By More than is Expected By Chance?

Next, need to:

1. Calculate a standardized quantity for the particular test, a “test statistic”.

• Often: t=(Diff in Group Means)/SE(Diff)

2. Compare the test statistic to what it is expected to be if (populations represented by) groups do not differ. Often: t is approx’ly normal bell curve.

3. Declare groups to differ if test statistic is too deviant from expectations in (2) above.

• Often: absolute value of t >~2.

Page 19: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

t-Test for Minimal MAP: Step 1

1. Calculate a standardized quantity for the particular test, a “test statistic”.

Diff in Group Means = 63.4 - 56.2 = 7.2 (“Signal”)

SE(Diff) ≈ sqrt[SEM12 + SEM2

2] = sqrt(1.662+1.412) ≈ 2.2 (“Noise”)

AI N 42Mean 56.1666667Std Dev 10.7824634SE(Mean) 1.66=10.78/√42

Non AI N 38Mean 63.4122807Std Dev 8.7141575SE(Mean) 1.41=8.71/√38

→ Test Statistic = t = (7.2 - 0)/2.2 = 3.28

Signal to Noise Ratio

Page 20: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

t-Test for Minimal MAP: Step 2

2. Compare the test statistic to what it is expected to be if (populations represented by) groups do not differ. Often: t is approx’ly normal bell curve.

Expect

0.95 ChanceObserved = 3.28

Expected values for test statistic if groups do not differ.

Area under sections of curve = probability of values in the interval.

(0.5 for 0 to ∞)

Prob (-2 to -1) is Area = 0.14

Page 21: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

t-Test for Minimal MAP: Step 3

Expect

95% ChanceObserved = 3.28

3. Declare groups to differ if test statistic is too deviant. [How much?]

Convention:

“Too deviant” is < 5% chance → |t| >~2.

“Two-tailed” = the 5% is allocated equally for either group to be superior.

2.5%2.5%

Conclude: Groups differ since ≥3.28 has <5% if no difference in the entire populations.

Page 22: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

t-Test for Minimal MAP: p value

Expect

95% ChanceObserved = 3.28

p-value:

Probability of a test statistic at least as deviant as observed, if populations really do not differ.

Smaller values ↔ more evidence of group differences.

Area = 0.0007

Area = 0.0007

p value = 2(0.0007) = 0.0014 <<0.05

3. Declare groups to differ if test statistic is too deviant. [How much?]

Page 23: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

t-Test: Technical Note

There are actually several types of t-tests:

• Equal vs. unequal variance (variance =SD2), depending on whether the SDs are too different between the groups. [Yes, there is another statistical test for comparing the SDs.]

SE(Diff) ≈ sqrt[SEM12 + SEM2

2] = sqrt(1.662+1.412) ≈ 2.2 is approximate. There are more complicated exact formulas that software implements.

AI N 42Mean 56.1666667Std Dev 10.7824634SE(Mean) 1.66=10.78/√42

Non AI N 38Mean 63.4122807Std Dev 8.7141575SE(Mean) 1.41=8.71/√38

Page 24: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

t-Test: Another Note

There are other types of t-tests:

• A two-sided t-test assumes that differences (between groups or pre-to-post) are possible in both directions, e.g., increase or decrease.

• A one-sided t-test assumes that these differences can only be either an increase or decrease, or one group can only have higher or lower responses than the other group. This is very rare, and generally not acceptable.

Page 25: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Back to Paper: Normal Range

What is the “normal” range for lowest MAP in AI patients, i.e., 95% of subjects were in approximately what range?

SD = 8.7 SD = 10.8

N = 38 N = 42

Page 26: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Back to Paper: Normal Range

What is the “normal” range for lowest MAP in AI patients, i.e., 95% of subjects were in approximately what range?

Answer: 56.2 ± 2(10.8) ≈ 35 to 78

SD = 8.7 SD = 10.8

N = 38 N = 42

Page 27: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Back to Paper: Confidence Intervals

Δ= 63.4-56.2= 7.2 is the best guess for the MAP diff between the means of “all” AI and non-AI patients.

We are 95% sure that diff is within ≈ 7.2±2SE(Diff) = 7.2±2(2.2) = 2.8 to 11.6.

SD = 8.7 SD = 10.8

N = 38 N = 42

SE = 1.41 SE = 1.66

SE(Diff of Means) = 2.2

SE(Diff) ≈ sqrt of [SEM1

2 + SEM22]

Page 28: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Back to Paper: t-test

Δ= 7.2 is statistically significant (p=0.0014); i.e., only 14 of 1000 sets of 80 patients would differ so much, if AI and non-AI really don’t differ in MAP.

Is Δ= 7.2 clinically significant?

Page 29: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Confidence Intervals ↔ Tests

p>0.05 p≈0.05 p<0.05Hyperactivity Paper

Page 30: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Confidence Intervals ↔ Tests

The Algebra:

|Δ/SE(Δ)| = |t| < 2

is equivalent to:

|Δ| < 2 SE(Δ)

is equivalent to:

-2 SE(Δ) < Δ < 2 SE(Δ)

is equivalent to:

Δ - 2 SE(Δ) < 0 < Δ + 2 SE(Δ)

Hypothesis Test

Confidence Interval

Page 31: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Confidence Intervals ↔ Tests

95% Confidence Intervals

Non-overlapping 95% confidence intervals, as here, are sufficient for significant (p<0.05) group differences.

However, non-overlapping is not necessary. They can overlap and still groups can differ significantly.

Page 32: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Back to Paper: Experimental Units

Cannot use t-test for comparing lab data for multiple blood draws per subject.

bat least 100 g/kg/min of propofol administered at the time of blood draw, or any pentobarbital in the 48 hrs before the blood draw

Generalization of t-test

Page 33: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Tests on Percentages

Is 26.3% vs. 61.9% statistically significant (p<0.05), i.e., a difference too large to have a <5% of occurring by chance if groups do not really differ?

Solution: Same theme as for means. Find a test statistic and compare to its expected values if groups do not differ.

See next slide.

Page 34: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Tests on Percentages

Cannot use t-test for comparing lab data for multiple blood draws per subject.

Expect

1Observed = 10.2

Area = 0.002

Chi-Square Distribution

95% Chance

5.99

Here, the signal in the test statistic is a squared quantity, expected to be 1.

Test statistic=10.2 >> 5.99, so p<0.05. In fact, p=0.002.

Page 35: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Tests on Percentages: Chi-Square

The chi-square test statistic (10.2 in the example) is found by first calculating what is the expected number of AI patients with MAP <60 and the same for non-AI patients, if AI and non-AI really do not differ for this.

Then, chi-square is found as the sum of standardized (Observed – Expected)2.

This should be close to 1, as in the graph on the previous slide, if groups do not differ. The value 10.2 seems too big to have happened by chance (probability=0.002) if there is no difference among “all” TBI subjects.

Page 36: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Back to t-Test

Expect

95% ChanceObserved = 3.28

Declare groups to differ if test statistic is too deviant.

Convention:

“Too deviant” is < 5% chance → |t| >~2.

Why not choose, say, |t|>3, so that our chances of being wrong are even less, <1%?

2.5%2.5%

How much “deviance” is enough proof?

Page 37: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Back to t-Test

Expect

>99% Chance Observed = 3.28

Convention:

“Too deviant” is < 5% chance → |t| >~2.

Why not choose, say, |t|>3, so that our chances of being wrong are even less, <1%?<0.5%<0.5%

Answer: Then the chances of missing a real difference are increased, the converse wrong conclusion.

This is analogous to setting the threshold for a diagnostic test of disease.

Page 38: Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Power of a Study

Statistical power is the sensitivity of a study to detect real effects, if they exist.

It needs to be balanced with the likelihood of wrongly declaring effects when they are non-existent. Today, we have been keeping that error at <5%.

Power is the topic for the next session #4.