BIOSTAT LECTURE SERIES 2019 SAMPLE SIZE AND POWER

Preview:

Citation preview

BIOSTAT LECTURE SERIES 2019

SAMPLE SIZE AND POWER

Wei Hou, PhD

Email: wei.hou@stonybrookmedicine.edu

Apr 17th, 2019

1

Outline

Post-hoc power

Intuition behind sample size and power calculation

Common sample size formula for different tests

What to bring when meeting with a statistician

2

Question

Have your ever been asked by your reviewer/editor to calculate post-hoc power (observed power) when you are publishing non-significant results?

Bababekov et.al. 2018:

“we argue that the CONSORT and STROBE guidelines should be modified to include the disclosure of power—even if <80%—with the given sample size and effect size observed in that study.”Bababekov, Y. J., Stapleton, S. M., Mueller, J. L., Fong, Z. V., and Chang, D. C. (2018). A proposal to mitigate the consequences of type 2 error in surgical science. Annals of Surgery 267, 621-622

3

Hypothesis

Research question: will vitamin D taking during

antibiotics treatment help patient to recover faster?

Primary outcome: percentage of recovery after the 1st

round of treatment

Hypothesis: H0: P1 P0 vs H1: P1 > P0

What test to use?

4

type I and II errors

TruthResult of statistical test

Fail to reject null

hypothesis

(test shows that Vit D

is NOT superior)

Reject null hypothesis

(test shows that Vit D

is superior)

Null hypothesis is

TRUE

(Vit D is NOT

superior)

Type I error

(false positive)

α

Null hypothesis is

FALSE

(Vit D is superior)

Type II error

(false negative)

β

5

Quick review

Type I error: false positive

Type II error: false negative

α :P(Type I error)=P(reject H0|H0 is true)

β :P(Type II error)= P(fail to reject H0 | H1 is true)

Power=1- β= P(reject H0 | H1 is true)

P-value=P(observing difference is as large as or larger than the observed difference|H0 is right)

Reject H0 when P-value < α

Null hypothesis H0 is assumed to be true until proven otherwise.

6

Why does an editor request Post-hoc power?

When you have a non-significant /negative result,

the editor wants to know whether the result is true

negative or false negative (β, concluding there is no

effect when there actually is an effect).

Unfortunately, reporting observed power does not

answer the question. The reported observed power

does not provide any information about whether the

result is true negative or not.

7

Observed Power is not meaningful

Hoenig and Heisey (2001):

“For any test the observed power

is a 1:1 function of the p value.

When a test is marginally

significant (P = .05), the estimated

power is 50%.”

Reporting observed power is just

another way of reporting the p

value.

Hoenig and Heisey (2001)

8

More about observed Power

Yuan and Maxwell 2005: Observed power “is

almost always a biased estimator of the true

power”

Hoenig and Heisey (2001): “higher observed

power does not imply stronger evidence for a null

hypothesis that is not rejected”.

Say to the editor/reviewer.

9

When do we need power and sample

size calculation?

Fundamentals of Clinical Trials (4th Edition 2010.): Clinical trials should have sufficient statistical power to detect differences between groups considered to be of clinical importance. Therefore, calculation of sample size with provision for adequate levels of significance and power is an essential part of planning.

• Statistical analysis follows study design

• Power and sample size calculation based on the primary

analysis

• A real screenshot from a recent grant review:

10

Hypothesis Testing, Significance Level &

Power

Suppose the primary outcome is binary (e.g. recover

rate), and we want to test whether a true difference

exists in the recover rates of two groups.

H0: P1 P0

Assume the true difference is δ=P1 - P0,

then β or power (1- β) depends on δ, N, and α.

11

Power curve: different N12

Power curve: different α13

Power Formula

Depends on study design

Not hard, but can be VERY algebra intensive

Consult with a statistician

Use software, e.g. G*Power (free), PASS, R (free),

SAS and etc.

14

Analysis Follows Study Design

Randomized controlled trial (RCT)

Stratified randomized trial

Non-inferiority trial

Cross-over study

Non-randomized intervention study

Observational study

Prevalence study

Measuring sensitivity and specificity

15

In a parallel study, we are comparing the HbA1c

levels of two randomized groups.

In a cross-over study of COPD, we are comparing

the exercise duration times of the same person

(treated and on placebo)

16

Types of analysis

Two-sample Independent T-test

Paired T-test

In a cancer study, we want to compare the response rate between a new drug and a placebo

A study examined changes in smoking status after an intervention. The same participants were asked previously and again after the intervention.

A randomized clinical trial to compare the 10-year overall survival between a new drug and a placebo in women with invasive breast cancer.

17

Types of analysis

Chi-square test or Z test for proportions

McNemar’s test for the paired data

Kaplan-Meier Curve and Log rank test

Sample Size Formula Based on Analysis

Variables of interest

type of data e.g. continuous, categorical

Desired power

Desired significance level

Effect/difference of clinical importance

Standard deviations of continuous outcome

variables

One or two-sided tests

18

Phase I: Dose Escalation

Dose limiting toxicity (DLT) must be defined

Decide a few dose levels (e.g. 4)

At least three patients will be treated on each dose

level (cohort)

Not a power or sample size calculation issue

Entry of patients to a new dose level does not occur

until all patients in the previous level are beyond a

certain time frame where you look for DLT

19

Phase II Example:

Two-Stage Optimal Design

Single arm, two stage, using an optimal design & predefined response

Rule out response probability of 20% (H0: p≤0.20)

Level that demonstrates useful activity is 40% (H1:p≥0.20)

Let α = 0.1 (10% probability of accepting a poor agent)

Let β = 0.1 (10% probability of rejecting a good agent)

Charts in Simon (1989) paper with different amounts and varying α and β values

01 pp -

20

Blow up: Simon (1989) Table21

Phase II Example

Initially enroll 17 patients.

0-3 of the 17 have a clinical response then stop accrual

and assume not an active agent

If ≥ 4/17 respond, then accrual will continue to 37

patients.

22

Phase II Example

If 4-10 of the 37 respond this is insufficient

activity to continue

If ≥ 11/37 respond then the agent will be

considered active.

Under this design if the null hypothesis were

true (20% response probability) there is a

55% probability of early termination

23

Sample Size Differences

If the null hypothesis (H0) is true

Using two-stage optimal design

On average 26 subjects enrolled

Using a 1-sample test of proportions

36 patients based on one-sided binomial test

Using a 2-sample randomized test of proportions

77 patients per group based on one-sided Fisher’s exact test

24

Phase III RCT: Continuous Outcomes

0: 010 - H VS 0: 011 - H

Assuming variance is known, sample size needed is2

,/)(4)( 222

2/ ZZNTotal

δ denote the true difference between and

P(x > )=α/2, x is from standard normal distribution

If is unknown, effect size could be expressed as the

standardized difference

If one-sided test is used, substitute with

1 0

2

/

2/Z

2/ZZ

Suppose we want to compare a continuous outcome (e.g. HbA1c)

between intervention and control groups. Hypotheses:

25

Phase III RCT: Continuous Outcomes

The Effect of Non-surgical Periodontal Therapy on Hemoglobin A1c

Levels in Persons with Type 2 Diabetes and Chronic Periodontitis: A

Randomized Clinical Trial, Engebretson et. al. 2013, JAMA

The treatment group received scaling and root planing plus

chlorhexidine oral rinse at baseline, and supportive periodontal

therapy at three and six months. The control group received no

treatment for six months.

We assume a clinically meaningful difference of 0.6% in HbA1c

between the two arms with a standard deviation of 2%.

26

Phase III RCT: Continuous Outcomes

Two independent samples, two-sided test

Set α=.05, β=.10, 90% power

Then

Set

Then

%,6.0 %2

282.1,96.1 1.0025.0 ZZ

4682^6.0/2)^2(2)^282.196.1(42 N

27

Phase III RCT: Continuous Outcomes28

Phase III RCT: Continuous Outcomes G*Power

29

Phase III RCT: Binary Outcomes

0: 010 - ppH VS 0: 011 - ppH

Suppose we want to compare the response rates

between vit D treatment and placebo (20% vs 40%).

Based on a Z-test, the sample size needed is

2

01

2

2/ )/()1()(42 ppppZZN --

• is the pooled proportion

• If one-sided test is used, substitute with 2/ZZ

2/)( 01 ppp

30

Phase III RCT: Binary Outcomes

Two independent samples, two-sided Z test

Set α=.05, β=.10, 90% power,

Then

Set

Then

And,

,40.1 p 20.0 p

30.2/)2.4(. p

282.1,960.1 1.0025.0 ZZ

2202)^2.4/(.)7)(.3(.2)^282.1960.1(42 -N

31

Phase III RCT: Binary Outcomes32

Phase III RCT: Binary Outcomes G*Power

33

Sample Size for Testing Non-inferiority

Suppose we want to test whether a new treatment is equivalent to an established treatment in response rate.

Can we propose the hypotheses as follows:

H0: P1 -P 0 ≠ 0 vs H1: P1 -P 0 0 ???

Use the formula

N=∞, we will never reject H0

2

01

2

2/ )/()1()(42 ppppZZN --

34

Sample Size for Testing Non-inferiority

How about using the original hypotheses:

H0: P1 = P 0 vs H1: P1 ≠ P 0 ???

Calculate the sample size based on “Fail to reject

H0”?

However, failure to reject H0 is not sufficient to claim

two groups to be equal but merely that the evidence is

inadequate to say they are different.

35

Sample Size for Testing Non-inferiority

No statistical method to demonstrate complete equivalence

We can pre-define a margin of difference, δ

H0: the two groups differ by less than δ

H1: the two groups differ by more than δ

Then we can use previous formula: Dichotomous response (p1=p0=p):

Continuous response:

22 /))(1(42 ZZppN -

22 )//()(42 ZZN

36

Multiple response variables

More than one question are equally important

More than one primary variable used to assess a single primary question

Multiple response variables are correlated

Multiple testing issues: when multiple comparisons are made, the chance of finding a significant difference in one of the comparisons (when, in fact, no real differences exist between groups) is greater than the stated significance level.

α need to be adjusted to control familywise error.

37

Interim Analysis

Analysis of the data before the study is ended with

the intention of possibly terminating the study early

If traditional tests are used at both the middle and

the end of the study, Type I error get inflated

To maintain the whole Type I error, α needs to be

adjusted at each interim analysis

# of interim

analysis 0 1 4 9

Type I error 0.05 0.08 0.14 0.2

38

Non-adherence adjustment

Drop-out and drop-in both can happen in a RCT

According to ITT, these participants remain in the analysis

They tend to dilute any difference between the two groups which might be produced by intervention

A simple formula for nonadherence adjustment (Lachin, 1981):

Where R0 is dropout and R1 is dropin rate.

2

10

* )1/( RRNN --

39

Sample size based on confidence

interval

The desired confidence interval width is used for

sample size calculation

For testing the null hypothesis of no treatment

effect, hypothesis testing and confidence intervals

give the same conclusion

The CI method might yield a power of only 50% to

detect a difference of half width of the confidence

interval

40

Estimating Sample Size Parameters

Obtaining reliable estimates (e.g. effect size or

standard deviation) can be challenging

Use pilot studies to refine estimates

Use adaptive design which modify the sample size

based on updated estimates

41

Effect size

Cohen’s D:

Measures the magnitude of a treatment effect

Unlike significance tests, it is independent of sample size

Widely used in the meta analyses

Cohen (1988) hesitantly defined effect sizes as "small" "medium" and "large", stating that "there is a certain risk in inherent in offering conventional operational definitions for those terms for use in power analysis in as diverse a field of inquiry as behavioral science" (p. 25).

Effect size d

Small 0.20

Medium 0.50

Large 0.80

Approximate Nature

Parameters used in the calculation are estimates

Estimate of the relative effectiveness may be based on a population different from that intended to be studied

The effectiveness is often overestimated

Revisions of inclusion and exclusion criteria may influence the type of participants entering the trial

Mathematical models used may only approximate the true, but unknown, distribution of the response variables

So PI should be as conservative as can be justified while still being realistic in estimating the parameters used in the calculation!

43

More Notes

Study’s primary outcome is the variable you do the sample size calculation for

If secondary outcome variables considered important make sure sample size is sufficient

Increase the ‘real’ sample size to reflect loss to follow up, lack of compliance, etc.

44

What does a statistician need?45

Primary hypothesis

Including null and alternative hypotheses

Study design

RCT? Cross-over?

Data types of primary endpoints

Continuous or dichotomous

Significance level

Usually 0.05. Need to be adjusted for interim tests or multiple endpoints

Value of other parameters

Standard deviation – from pilot study or published data

Smallest effect size that is clinically meaningful

e.g. 0.5D in myopia studies

Intended power

Usually 80%. Sometimes 90% for large studies.

But46

https://www.youtube.com/watch?v=PbODigCZqL8

Research Flow Chart

Questions → Hypotheses → Experimental Design → Samples → Data →

Analyses → Conclusions

Take all of your study information to a statistician early and often

47

Quiz time!

True or False?

Sample size↑ N → power ↑

Significance level: α ↑ → power ↓

Effect size: δ ↑ → power ↓

Variation (continuous outcome): σ2 ↑ → power ↑

One-tailed test power < Two-tailed test power

48

Thank you!49

Q. How many statisticians does it take to

change a light bulb?

A. That depends. It is really a matter of

power.From: Stuart Howell

https://jcdverha.home.xs4all.nl/scijokes/1_2.html