View
2
Download
0
Category
Preview:
Citation preview
BIOSTAT LECTURE SERIES 2019
SAMPLE SIZE AND POWER
Wei Hou, PhD
Email: wei.hou@stonybrookmedicine.edu
Apr 17th, 2019
1
Outline
Post-hoc power
Intuition behind sample size and power calculation
Common sample size formula for different tests
What to bring when meeting with a statistician
2
Question
Have your ever been asked by your reviewer/editor to calculate post-hoc power (observed power) when you are publishing non-significant results?
Bababekov et.al. 2018:
“we argue that the CONSORT and STROBE guidelines should be modified to include the disclosure of power—even if <80%—with the given sample size and effect size observed in that study.”Bababekov, Y. J., Stapleton, S. M., Mueller, J. L., Fong, Z. V., and Chang, D. C. (2018). A proposal to mitigate the consequences of type 2 error in surgical science. Annals of Surgery 267, 621-622
3
Hypothesis
Research question: will vitamin D taking during
antibiotics treatment help patient to recover faster?
Primary outcome: percentage of recovery after the 1st
round of treatment
Hypothesis: H0: P1 P0 vs H1: P1 > P0
What test to use?
4
type I and II errors
TruthResult of statistical test
Fail to reject null
hypothesis
(test shows that Vit D
is NOT superior)
Reject null hypothesis
(test shows that Vit D
is superior)
Null hypothesis is
TRUE
(Vit D is NOT
superior)
Type I error
(false positive)
α
Null hypothesis is
FALSE
(Vit D is superior)
Type II error
(false negative)
β
5
Quick review
Type I error: false positive
Type II error: false negative
α :P(Type I error)=P(reject H0|H0 is true)
β :P(Type II error)= P(fail to reject H0 | H1 is true)
Power=1- β= P(reject H0 | H1 is true)
P-value=P(observing difference is as large as or larger than the observed difference|H0 is right)
Reject H0 when P-value < α
Null hypothesis H0 is assumed to be true until proven otherwise.
6
Why does an editor request Post-hoc power?
When you have a non-significant /negative result,
the editor wants to know whether the result is true
negative or false negative (β, concluding there is no
effect when there actually is an effect).
Unfortunately, reporting observed power does not
answer the question. The reported observed power
does not provide any information about whether the
result is true negative or not.
7
Observed Power is not meaningful
Hoenig and Heisey (2001):
“For any test the observed power
is a 1:1 function of the p value.
When a test is marginally
significant (P = .05), the estimated
power is 50%.”
Reporting observed power is just
another way of reporting the p
value.
Hoenig and Heisey (2001)
8
More about observed Power
Yuan and Maxwell 2005: Observed power “is
almost always a biased estimator of the true
power”
Hoenig and Heisey (2001): “higher observed
power does not imply stronger evidence for a null
hypothesis that is not rejected”.
Say to the editor/reviewer.
9
When do we need power and sample
size calculation?
Fundamentals of Clinical Trials (4th Edition 2010.): Clinical trials should have sufficient statistical power to detect differences between groups considered to be of clinical importance. Therefore, calculation of sample size with provision for adequate levels of significance and power is an essential part of planning.
• Statistical analysis follows study design
• Power and sample size calculation based on the primary
analysis
• A real screenshot from a recent grant review:
10
Hypothesis Testing, Significance Level &
Power
Suppose the primary outcome is binary (e.g. recover
rate), and we want to test whether a true difference
exists in the recover rates of two groups.
H0: P1 P0
Assume the true difference is δ=P1 - P0,
then β or power (1- β) depends on δ, N, and α.
11
Power curve: different N12
Power curve: different α13
Power Formula
Depends on study design
Not hard, but can be VERY algebra intensive
Consult with a statistician
Use software, e.g. G*Power (free), PASS, R (free),
SAS and etc.
14
Analysis Follows Study Design
Randomized controlled trial (RCT)
Stratified randomized trial
Non-inferiority trial
Cross-over study
Non-randomized intervention study
Observational study
Prevalence study
Measuring sensitivity and specificity
…
15
In a parallel study, we are comparing the HbA1c
levels of two randomized groups.
In a cross-over study of COPD, we are comparing
the exercise duration times of the same person
(treated and on placebo)
16
Types of analysis
Two-sample Independent T-test
Paired T-test
In a cancer study, we want to compare the response rate between a new drug and a placebo
A study examined changes in smoking status after an intervention. The same participants were asked previously and again after the intervention.
A randomized clinical trial to compare the 10-year overall survival between a new drug and a placebo in women with invasive breast cancer.
17
Types of analysis
Chi-square test or Z test for proportions
McNemar’s test for the paired data
Kaplan-Meier Curve and Log rank test
Sample Size Formula Based on Analysis
Variables of interest
type of data e.g. continuous, categorical
Desired power
Desired significance level
Effect/difference of clinical importance
Standard deviations of continuous outcome
variables
One or two-sided tests
18
Phase I: Dose Escalation
Dose limiting toxicity (DLT) must be defined
Decide a few dose levels (e.g. 4)
At least three patients will be treated on each dose
level (cohort)
Not a power or sample size calculation issue
Entry of patients to a new dose level does not occur
until all patients in the previous level are beyond a
certain time frame where you look for DLT
19
Phase II Example:
Two-Stage Optimal Design
Single arm, two stage, using an optimal design & predefined response
Rule out response probability of 20% (H0: p≤0.20)
Level that demonstrates useful activity is 40% (H1:p≥0.20)
Let α = 0.1 (10% probability of accepting a poor agent)
Let β = 0.1 (10% probability of rejecting a good agent)
Charts in Simon (1989) paper with different amounts and varying α and β values
01 pp -
20
Blow up: Simon (1989) Table21
Phase II Example
Initially enroll 17 patients.
0-3 of the 17 have a clinical response then stop accrual
and assume not an active agent
If ≥ 4/17 respond, then accrual will continue to 37
patients.
22
Phase II Example
If 4-10 of the 37 respond this is insufficient
activity to continue
If ≥ 11/37 respond then the agent will be
considered active.
Under this design if the null hypothesis were
true (20% response probability) there is a
55% probability of early termination
23
Sample Size Differences
If the null hypothesis (H0) is true
Using two-stage optimal design
On average 26 subjects enrolled
Using a 1-sample test of proportions
36 patients based on one-sided binomial test
Using a 2-sample randomized test of proportions
77 patients per group based on one-sided Fisher’s exact test
24
Phase III RCT: Continuous Outcomes
0: 010 - H VS 0: 011 - H
Assuming variance is known, sample size needed is2
,/)(4)( 222
2/ ZZNTotal
δ denote the true difference between and
P(x > )=α/2, x is from standard normal distribution
If is unknown, effect size could be expressed as the
standardized difference
If one-sided test is used, substitute with
1 0
2
/
2/Z
2/ZZ
Suppose we want to compare a continuous outcome (e.g. HbA1c)
between intervention and control groups. Hypotheses:
25
Phase III RCT: Continuous Outcomes
The Effect of Non-surgical Periodontal Therapy on Hemoglobin A1c
Levels in Persons with Type 2 Diabetes and Chronic Periodontitis: A
Randomized Clinical Trial, Engebretson et. al. 2013, JAMA
The treatment group received scaling and root planing plus
chlorhexidine oral rinse at baseline, and supportive periodontal
therapy at three and six months. The control group received no
treatment for six months.
We assume a clinically meaningful difference of 0.6% in HbA1c
between the two arms with a standard deviation of 2%.
26
Phase III RCT: Continuous Outcomes
Two independent samples, two-sided test
Set α=.05, β=.10, 90% power
Then
Set
Then
%,6.0 %2
282.1,96.1 1.0025.0 ZZ
4682^6.0/2)^2(2)^282.196.1(42 N
27
Phase III RCT: Continuous Outcomes28
Phase III RCT: Continuous Outcomes G*Power
29
Phase III RCT: Binary Outcomes
0: 010 - ppH VS 0: 011 - ppH
Suppose we want to compare the response rates
between vit D treatment and placebo (20% vs 40%).
Based on a Z-test, the sample size needed is
2
01
2
2/ )/()1()(42 ppppZZN --
• is the pooled proportion
• If one-sided test is used, substitute with 2/ZZ
2/)( 01 ppp
30
Phase III RCT: Binary Outcomes
Two independent samples, two-sided Z test
Set α=.05, β=.10, 90% power,
Then
Set
Then
And,
,40.1 p 20.0 p
30.2/)2.4(. p
282.1,960.1 1.0025.0 ZZ
2202)^2.4/(.)7)(.3(.2)^282.1960.1(42 -N
31
Phase III RCT: Binary Outcomes32
Phase III RCT: Binary Outcomes G*Power
33
Sample Size for Testing Non-inferiority
Suppose we want to test whether a new treatment is equivalent to an established treatment in response rate.
Can we propose the hypotheses as follows:
H0: P1 -P 0 ≠ 0 vs H1: P1 -P 0 0 ???
Use the formula
N=∞, we will never reject H0
2
01
2
2/ )/()1()(42 ppppZZN --
34
Sample Size for Testing Non-inferiority
How about using the original hypotheses:
H0: P1 = P 0 vs H1: P1 ≠ P 0 ???
Calculate the sample size based on “Fail to reject
H0”?
However, failure to reject H0 is not sufficient to claim
two groups to be equal but merely that the evidence is
inadequate to say they are different.
35
Sample Size for Testing Non-inferiority
No statistical method to demonstrate complete equivalence
We can pre-define a margin of difference, δ
H0: the two groups differ by less than δ
H1: the two groups differ by more than δ
Then we can use previous formula: Dichotomous response (p1=p0=p):
Continuous response:
22 /))(1(42 ZZppN -
22 )//()(42 ZZN
36
Multiple response variables
More than one question are equally important
More than one primary variable used to assess a single primary question
Multiple response variables are correlated
Multiple testing issues: when multiple comparisons are made, the chance of finding a significant difference in one of the comparisons (when, in fact, no real differences exist between groups) is greater than the stated significance level.
α need to be adjusted to control familywise error.
37
Interim Analysis
Analysis of the data before the study is ended with
the intention of possibly terminating the study early
If traditional tests are used at both the middle and
the end of the study, Type I error get inflated
To maintain the whole Type I error, α needs to be
adjusted at each interim analysis
# of interim
analysis 0 1 4 9
Type I error 0.05 0.08 0.14 0.2
38
Non-adherence adjustment
Drop-out and drop-in both can happen in a RCT
According to ITT, these participants remain in the analysis
They tend to dilute any difference between the two groups which might be produced by intervention
A simple formula for nonadherence adjustment (Lachin, 1981):
Where R0 is dropout and R1 is dropin rate.
2
10
* )1/( RRNN --
39
Sample size based on confidence
interval
The desired confidence interval width is used for
sample size calculation
For testing the null hypothesis of no treatment
effect, hypothesis testing and confidence intervals
give the same conclusion
The CI method might yield a power of only 50% to
detect a difference of half width of the confidence
interval
40
Estimating Sample Size Parameters
Obtaining reliable estimates (e.g. effect size or
standard deviation) can be challenging
Use pilot studies to refine estimates
Use adaptive design which modify the sample size
based on updated estimates
41
Effect size
Cohen’s D:
Measures the magnitude of a treatment effect
Unlike significance tests, it is independent of sample size
Widely used in the meta analyses
Cohen (1988) hesitantly defined effect sizes as "small" "medium" and "large", stating that "there is a certain risk in inherent in offering conventional operational definitions for those terms for use in power analysis in as diverse a field of inquiry as behavioral science" (p. 25).
Effect size d
Small 0.20
Medium 0.50
Large 0.80
Approximate Nature
Parameters used in the calculation are estimates
Estimate of the relative effectiveness may be based on a population different from that intended to be studied
The effectiveness is often overestimated
Revisions of inclusion and exclusion criteria may influence the type of participants entering the trial
Mathematical models used may only approximate the true, but unknown, distribution of the response variables
So PI should be as conservative as can be justified while still being realistic in estimating the parameters used in the calculation!
43
More Notes
Study’s primary outcome is the variable you do the sample size calculation for
If secondary outcome variables considered important make sure sample size is sufficient
Increase the ‘real’ sample size to reflect loss to follow up, lack of compliance, etc.
44
What does a statistician need?45
Primary hypothesis
Including null and alternative hypotheses
Study design
RCT? Cross-over?
Data types of primary endpoints
Continuous or dichotomous
Significance level
Usually 0.05. Need to be adjusted for interim tests or multiple endpoints
Value of other parameters
Standard deviation – from pilot study or published data
Smallest effect size that is clinically meaningful
e.g. 0.5D in myopia studies
Intended power
Usually 80%. Sometimes 90% for large studies.
Research Flow Chart
Questions → Hypotheses → Experimental Design → Samples → Data →
Analyses → Conclusions
Take all of your study information to a statistician early and often
47
Quiz time!
True or False?
Sample size↑ N → power ↑
Significance level: α ↑ → power ↓
Effect size: δ ↑ → power ↓
Variation (continuous outcome): σ2 ↑ → power ↑
One-tailed test power < Two-tailed test power
48
Thank you!49
Q. How many statisticians does it take to
change a light bulb?
A. That depends. It is really a matter of
power.From: Stuart Howell
https://jcdverha.home.xs4all.nl/scijokes/1_2.html
Recommended