Sample Size and Power Calculations Andy Avins, MD, MPH Kaiser Permanente Division of Research University of California, San Francisco

Sample Size and Power Calculations

Andy Avins, MD, MPH

Kaiser Permanente Division of Research

University of California, San Francisco

Why do Power Calculations?

Underpowered studies have a high chance of missing a real and important result Risk of misinterpretation is still very high

Overpowered studies waste precious research resources and may delay getting the answer

[BTW: “Sample Size Calculations” ≡ “(Statistical) Power Calculations”]

A Few Little Secrets

Why, yes, power calculations are basically crap

Nothing more than educated guesses Playing games with power calculations has a

long and glorious history So why do them?

You got a better idea? Review committees don’t give you a choice Can be enlightening, even if imperfect

Power Calculations

Purpose Try to rationally guess optimal number of

participants to recruit (Goldilocks principle)

Understand sample size implications of alternative study designs

The Problem of Uncertainty

If everyone responded the same way to treatment, we’d only need to study two people (one intervention, one control)

Uncertainty creeps in when: When we draw a sample from a population When we allocate participants to different

comparison groups (random or not)

Thought Experiment

We have 400 participants Randomly allocate them to 2 groups Test if Group 1 tx does better than Group 2 tx Throw all participants back in the pool, re-allocate

them, and repeat the experiment Truth: Group 1 tx IS better than Group 2 tx

We know this

Thought Experiment

Clinical Trial Run #1: 1>2 [Correct] Clinical Trial Run #2: 1>2 [Correct] Clinical Trial Run #3: 1=2 [Incorrect] Clinical Trial Run #4: 1>2 [Correct] Clinical Trial Run #5: 1=2 [Incorrect] Clinical Trial Run #6: 2>1 [Incorrect] Clinical Trial Run #7: 1>2 [Correct] Clinical Trial Run #8: 1>2 [Correct] ………

Thought Experiment

Repeated the thought experiment an infinite number of times

Result: 70% of runs show 1>2 (correct result) 30% of runs show 1=2 or 1<2 (incorrect result)

Thought Experiment

POWER of doing this clinical trial with 400 participants is 70% Any ONE run of this clinical trial (with 400

participants) has about a 70% chance of producing the correct result

Any ONE run of this clinical trial (with 400 participants) has about a 30% chance of producing the wrong result

Bottom Line

If you only have a 70% chance of showing what you want to show (and you only have $$ for 400 participants):

Should you bother doing the study??

Power Calculations

Sample size calculations are all about making educated guesses to help ensure that our study:

A) Has a sufficiently high chance of finding an effect when one exists

B) Is not “over-powered” and wasteful

Error Terminology

Two types of statistical errors: “Type I Error” ≡ “Alpha Error”

Probability of finding a statistically significant effect when the truth is that there is no effect

“Type II Error” ≡ “Beta Error” Probability of not finding a statistically significant

effect when one really does exist Goal is to minimize both types of errors

Error Terminology

Correct

Type II Error(β)

Type I Error(α)

Truth of Association

Observed Association

Reject Ho when we shouldn't (this is fixed at 5%)

Don't reject Ho when we should (not fixed; this is a function of sample size)

Correct

Hypothesis Testing

Power calculations are based on the principles of hypothesis testing

Research question ≠ hypothesis Device for forcing you to be explicit about

what you want to show

Hypothesis Testing: Mechanics1) Define the “Null Hypothesis” (Ho)

Generally Ho = “no effect” or “no association” Assume it’s true Basically, a straw man (set it up so we can knock

it down) reductio ad absurdum in geometry

Example: There is no difference in the risk of stroke between statin-treated participants and placebo-treated participants.

Hypothesis Testing: Mechanics 2) Define the “Alternative Hypothesis” (HA)

Finding of some effect Can be one-sided or two-sided

One-sided: better/greater/more or worse/less Two-sided: different

Which to choose? One-sided: biologically impossible for other possibility,

don’t care about other possibility (careful!) Easier to get “statistical significance” with one-sided HA

When in doubt, choose a two-sided HA

Example: Statin treatment results in a different risk of stroke compared to placebo treatment

Hypothesis Testing: Mechanics3) Define a decision rule

Virtually always: reject the null hypothesis if p<.05 This cutpoint (.05) = “alpha level”

Hypothesis Testing: Mechanics4) Calculate the “p-value”

Assume that Ho is true Do the study / gather the data Calculate the probability that we’d see (by

chance) something at least as extreme as what we observed IF Ho was true

Hypothesis Testing: Mechanics5) Apply the decision rule:

If p < cutpoint (.05), REJECT Ho i.e., we assert that there is an effect

If p > cutpoint (.05), DO NOT REJECT Ho i.e., we do not assert that there is an effect Note: this is different from asserting that there is no

effect (you can never prove the null)

Terminology Review

Null and Alternative Hypotheses One-sided and Two-Sided HA

Type I Error (α) Type II Error (β)

Power: 1 - β p – value Effect Size

Wald Test

Nd

dSE

d

NdSD

~)( )(

Standard Deviation

The Normal Curve

Probability

Ingredients Needed to Calculate the Sample Size Need to know / decide:

Effect Size: d SD(d)

Standardized Effect Size: d / SD(d) Cutpoint for our decision rule Power we want for the study What statistical test we’re going to use

We can use all this information to calculate our optimal sample size

Where the Pieces Come From: d d is the “Effect Size” d should be set as the “minimum clinically

important difference” (MCID) This is smallest difference between groups

that you would care about Sources

Colleagues (accepted in clinical community) Literature (what others have done)

Smaller differences require larger sample sizes to detect (danger: don’t fool yourself)

Where the Pieces Come From: SD(d) Standard deviation of d Generally, based on prior research

Literature (often not stated); can derive Contact other investigators +/- pilot study

Where the Pieces Come From: Cutpoint Easy: written in stone (Thanks, RA Fisher) Alpha = 0.05 Need to state if one-sided or two-sided

Where the Pieces Come From: Power Higher is better You decide Rules of thumb:

80% is minimum reviewers will accept Rarely need to propose >90%

Greater power requires larger samples

Where the Pieces Come From: Statistical Test A function of data types you will analyze

Chi-Square

t-test

t-test

Outcome Variable

Predictor Variable

Correlation Coefficient

Dichotomous Continuous

Dichotomous

Continuous

Finally, Some Good News

Someone else has done these calculations for us

“Sample Size Tables” DCR, Appendix 6A – 6E (pp. 84 – 91) Entire books

Power Analysis Software PASS, nQuery, Power & Precision, etc Websites (Google search)

Real-Life Example (Steroids for Acute Herniated Lumbar Disc) Ho: There is no difference in the change in the ODI scores

between two treatment groups. Alpha: 0.0471 (two-tailed) Beta: 0.1 (Power=90%) Clinically relevant difference in ODI change scores: 7.0 Standard deviation of change in ODI scores: 15.1 Randomization ratio: 1:1 Statistical test on which calculations are based: Student’s t-test Number of participants required to demonstrate statistical

significance = 101 per group; Total number required (two arms) = 202

Number of participants required after accounting for 20% withdrawals = 244

Based on a projected accrual rate of 8-10 participants per month, we anticipate that we will require approximately 2.25 years to fully recruit to this trial.

Examples from DCR

Sample Size for a Continuous Outcome




How do we REALLY do this?

PASS Output

Talk about playing games…

Talk about playing games...

Talk about playing games…

Power Calculations for a Descriptive Study

Goal: estimate a single quantity Power: determines the precision of the estimate

(i.e., the width of the 95% CI) Greater power = better precision = narrower CI

Sample Size for a Descriptive Study (Proportion)




“Aw, crap!” -- What to do when you need more participants than you can get? Use a more common (eg, composite) outcome Use a more precise outcome Use paired measurements Use a continuous outcome Use unequal group sizes (if cost differential)

Be careful that these changes don’t destroy the clinical relevance of your study

Miscellaneous Points

Don’t do power calculations for pilot-feasibility studies

Don’t do post-hoc power calculations Use confidence intervals

Be sure to account for withdrawals Don’t fool yourself

Documents

Sample Size and Power Calculations Andy Avins, MD, MPH Kaiser Permanente Division of Research University of California, San Francisco