Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician

Biostatistics in Practice

Session 3: Testing Hypotheses

Peter D. ChristensonBiostatistician

http://research.LABioMed.org/Biostat

Session 3 Preparation

We have been using a recent study on hyperactivity for the concepts in this course. The questions below based on this paper are intended to prepare you for session 3.


1. From Figures 1 and 2, we see that 153/209 = 73% of parents of the younger children and 144/160 = 90% of parents of the older children initially were interested but did not participate. Does it seem logical that the rate is lower for the 3-year-olds? Do you have any intuition on whether the magnitude of the 73% vs. 90% difference is enough to support an age difference, regardless of the logical reason?

Session 3 Preparation #1

153/209 144/160

73% ↔ Consented ↔ 90%


Not intuitive whether 73% vs. 90% is a “real” difference, i.e. reproducible or extrapolates to other persons.

153/209 144/160



Hypothesis testing compares 73% and 90%. It does not say how precise the %s are.

153/209 144/160



2. Look at the left side of the bottom panel of Figure 3 and recall what we have said about confidence intervals. Would you conclude that there is a change in hyperactivity under Mix A?

3. Repeat question 2 for placebo.

Session 3 Preparation: #2 and #3

Session 3 Preparation: #2 and #3

Possible values for real effect.

Zero is “ruled out”.


4. Do you think that the positive conclusion for question #3 has been "proven"?

5. Do you think that the negative conclusion for

question #2 has been "proven"?


4. Do you think that the positive conclusion for question #3 has been "proven"?Yes, with 95% confidence.

5. Do you think that the negative conclusion for question #2 has been "proven"?No, since more subjects would give a narrower confidence interval.

Hypothesis testing make a Yes or No conclusion whether there is an effect and quantifies the chances of a correct conclusion either way.

Confidence intervals give possible magnitudes of effects.

Session 3 Goals

Statistical testing concepts

Three most common tests

Software

Equivalence of testing and confidence intervals

False positive and false negative conclusions

Session 3 Data

For this session, we will focus on another paper for which I have the raw data.

Paper is posted on our class website.

Subjects were hospitalized for many days, blood samples taken every 8 hours and vital signs recorded every hour.

Subject is adrenal insufficient if 2 successive serum cortisols are low.

Goal: Do Groups Differ By More than is Expected By Chance?

Cohan (2005) Crit Care Med;33:2358-66.


First, need to:

• Specify experimental units (Persons? Blood draws?).

• Specify single outcome for each unit (e.g., Yes/No, mean or min of several measurements?).

• Examine raw data, e.g., histogram, for meeting test requirements.

• Specify group summary measure to be used (e.g., % or mean, median over units).

• Choose particular statistical test for the outcome.

Outcome Type → Statistical Test

Cohan (2005) Crit Care Med;33:2358-66.

. . .

. . .

Medians

%s

Means

WilcoxonTest

ChiSquareTest

t Test

Minimal MAP: Group Distributions of Individual Units

AI Group (N=42) Stem.Leaf # 7 6 1 7 11334 5 6 555 3 6 01112344 8 5 5566778 7 5 01222234 8 4 57788 5 4 23 2 3 6 1 3 13 2 ----+----+----+----+ Multiply Stem.Leaf by 10

Non-AI Group (N=38)Stem.Leaf # 7 79 2 7 00111234 8 6 5556777888 10 6 00112234 8 5 67999 5 5 3 1 4 79 2 4 04 2 ----+----+----+----+ Multiply Stem.Leaf by 10

→ Approximately normally distributed

→ Use means to summarize groups.

→ Use t-test to compare means.


Next, need to:

1. Calculate a standardized quantity for the particular test, a “test statistic”.

• Often: t=(Diff in Group Means)/SE(Diff)

2. Compare the test statistic to what it is expected to be if (populations represented by) groups do not differ. Often: t is approx’ly normal bell curve.

3. Declare groups to differ if test statistic is too deviant from expectations in (2) above.

• Often: absolute value of t >~2.

t-Test for Minimal MAP: Step 1

1. Calculate a standardized quantity for the particular test, a “test statistic”.

Diff in Group Means = 63.4 - 56.2 = 7.2 (“Signal”)

SE(Diff) ≈ sqrt[SEM12 + SEM2

2] = sqrt(1.662+1.412) ≈ 2.2 (“Noise”)

AI N 42Mean 56.1666667Std Dev 10.7824634SE(Mean) 1.66=10.78/√42

Non AI N 38Mean 63.4122807Std Dev 8.7141575SE(Mean) 1.41=8.71/√38

→ Test Statistic = t = (7.2 - 0)/2.2 = 3.28

Signal to Noise Ratio


2. Compare the test statistic to what it is expected to be if (populations represented by) groups do not differ. Often: t is approx’ly normal bell curve.

Expect

0.95 ChanceObserved = 3.28

Expected values for test statistic if groups do not differ.

Area under sections of curve = probability of values in the interval.

(0.5 for 0 to ∞)

Prob (-2 to -1) is Area = 0.14


Expect

95% ChanceObserved = 3.28

3. Declare groups to differ if test statistic is too deviant. [How much?]

Convention:

“Too deviant” is < 5% chance → |t| >~2.

“Two-tailed” = the 5% is allocated equally for either group to be superior.

2.5%2.5%

Conclude: Groups differ since ≥3.28 has <5% if no difference in the entire populations.

t-Test for Minimal MAP: p value

Expect


p-value:

Probability of a test statistic at least as deviant as observed, if populations really do not differ.

Smaller values ↔ more evidence of group differences.

Area = 0.0007

Area = 0.0007

p value = 2(0.0007) = 0.0014 <<0.05

3. Declare groups to differ if test statistic is too deviant. [How much?]

t-Test: Technical Note

There are actually several types of t-tests:

• Equal vs. unequal variance (variance =SD2), depending on whether the SDs are too different between the groups. [Yes, there is another statistical test for comparing the SDs.]

SE(Diff) ≈ sqrt[SEM12 + SEM2

2] = sqrt(1.662+1.412) ≈ 2.2 is approximate. There are more complicated exact formulas that software implements.

AI N 42Mean 56.1666667Std Dev 10.7824634SE(Mean) 1.66=10.78/√42

Non AI N 38Mean 63.4122807Std Dev 8.7141575SE(Mean) 1.41=8.71/√38

t-Test: Another Note

There are other types of t-tests:

• A two-sided t-test assumes that differences (between groups or pre-to-post) are possible in both directions, e.g., increase or decrease.

• A one-sided t-test assumes that these differences can only be either an increase or decrease, or one group can only have higher or lower responses than the other group. This is very rare, and generally not acceptable.

Back to Paper: Normal Range

What is the “normal” range for lowest MAP in AI patients, i.e., 95% of subjects were in approximately what range?

SD = 8.7 SD = 10.8

N = 38 N = 42

Back to Paper: Normal Range

What is the “normal” range for lowest MAP in AI patients, i.e., 95% of subjects were in approximately what range?

Answer: 56.2 ± 2(10.8) ≈ 35 to 78

SD = 8.7 SD = 10.8

N = 38 N = 42

Back to Paper: Confidence Intervals

Δ= 63.4-56.2= 7.2 is the best guess for the MAP diff between the means of “all” AI and non-AI patients.

We are 95% sure that diff is within ≈ 7.2±2SE(Diff) = 7.2±2(2.2) = 2.8 to 11.6.

SD = 8.7 SD = 10.8

N = 38 N = 42

SE = 1.41 SE = 1.66

SE(Diff of Means) = 2.2

SE(Diff) ≈ sqrt of [SEM1

2 + SEM22]

Back to Paper: t-test

Δ= 7.2 is statistically significant (p=0.0014); i.e., only 14 of 1000 sets of 80 patients would differ so much, if AI and non-AI really don’t differ in MAP.

Is Δ= 7.2 clinically significant?

Confidence Intervals ↔ Tests

p>0.05 p≈0.05 p<0.05Hyperactivity Paper


The Algebra:

|Δ/SE(Δ)| = |t| < 2

is equivalent to:

|Δ| < 2 SE(Δ)

is equivalent to:

-2 SE(Δ) < Δ < 2 SE(Δ)

is equivalent to:

Δ - 2 SE(Δ) < 0 < Δ + 2 SE(Δ)

Hypothesis Test

Confidence Interval


95% Confidence Intervals

Non-overlapping 95% confidence intervals, as here, are sufficient for significant (p<0.05) group differences.

However, non-overlapping is not necessary. They can overlap and still groups can differ significantly.

Back to Paper: Experimental Units

Cannot use t-test for comparing lab data for multiple blood draws per subject.

bat least 100 g/kg/min of propofol administered at the time of blood draw, or any pentobarbital in the 48 hrs before the blood draw

Generalization of t-test

Tests on Percentages

Is 26.3% vs. 61.9% statistically significant (p<0.05), i.e., a difference too large to have a <5% of occurring by chance if groups do not really differ?

Solution: Same theme as for means. Find a test statistic and compare to its expected values if groups do not differ.

See next slide.

Tests on Percentages

Cannot use t-test for comparing lab data for multiple blood draws per subject.

Expect

1Observed = 10.2

Area = 0.002

Chi-Square Distribution

95% Chance

5.99

Here, the signal in the test statistic is a squared quantity, expected to be 1.

Test statistic=10.2 >> 5.99, so p<0.05. In fact, p=0.002.

Tests on Percentages: Chi-Square

The chi-square test statistic (10.2 in the example) is found by first calculating what is the expected number of AI patients with MAP <60 and the same for non-AI patients, if AI and non-AI really do not differ for this.

Then, chi-square is found as the sum of standardized (Observed – Expected)2.

This should be close to 1, as in the graph on the previous slide, if groups do not differ. The value 10.2 seems too big to have happened by chance (probability=0.002) if there is no difference among “all” TBI subjects.

Back to t-Test

Expect


Declare groups to differ if test statistic is too deviant.

Convention:


Why not choose, say, |t|>3, so that our chances of being wrong are even less, <1%?

2.5%2.5%

How much “deviance” is enough proof?

Back to t-Test

Expect

>99% Chance Observed = 3.28

Convention:


Why not choose, say, |t|>3, so that our chances of being wrong are even less, <1%?<0.5%<0.5%

Answer: Then the chances of missing a real difference are increased, the converse wrong conclusion.

This is analogous to setting the threshold for a diagnostic test of disease.

Power of a Study

Statistical power is the sensitivity of a study to detect real effects, if they exist.

It needs to be balanced with the likelihood of wrongly declaring effects when they are non-existent. Today, we have been keeping that error at <5%.

Power is the topic for the next session #4.

Documents

Biostatistics in Practice Session 3: Testing Hypotheses Peter D. Christenson Biostatistician