9
PHARMACEUTICAL STATISTICS Pharmaceut. Statist. 2002; 1: 97–105 (DOI:10.1002/pst.014) Design of clinical trials using an adaptive test statistic John Lawrence* Food and Drug Administration, U.S.A. Since the treatment effect of an experimental drug is generally not known at the onset of a clinical trial, it may be wise to allow for an adjustment in the sample size after an interim analysis of the unblinded data. Using a particular adaptive test statistic, a procedure is demonstrated for finding the optimal design. Both the timing of the interim analysis and the way the sample size is adjusted can influence the power of the resulting procedure. It is possible to have smaller average sample size using the adaptive test statistic, even if the initial estimate of the treatment effect is wrong, compared to the sample size needed using a standard test statistic without an interim look and assuming a correct initial estimate of the effect. Copyright # 2002 John Wiley & Sons Ltd. Keywords: interim analysis; sample size adjustment; sequential monitoring 1. INTRODUCTION The design of a clinical trial requires a reliable estimate of the treatment effect, d. However, this estimate can be difficult to obtain with a new agent. Moreover, even when there is knowledge about the treatment effect, this effect can vary from study to study because of differences in treatment regimen, exclusion criteria, background therapy, etc. For these reasons, it is sensible in many cases to adjust the sample size after part of the data has been observed. However, if the analysis does not incorporate the data-driven sample size adjust- ment, then the type I error can be inflated. Fisher [1], Cui et al. [2], and Lehmacher and Wassmer [3] describe adaptive test procedures that are based on the idea of using a weighted average of the information before the interim look and the information after the interim look. More specifi- cally, in many clinical trials one can use all of the data before the interim analysis to construct a test statistic and a second test statistic using the data after the interim analysis. Under the null hypoth- esis, these two statistics are independent standard normal random variables even if the sample size depends on the first statistic. Therefore, weights can be chosen to combine the two together into a single test statistic that is standard normal under the null hypothesis. The weights are arbitrary, but must be deter- mined before any data are observed in order to ensure the type I error rate is controlled. Using this adjustment to the test statistic, the null distribu- Copyright # 2002 John Wiley & Sons, Ltd. *Correspondence to: John Lawrence, Department of Bio- metrics I, Food and Drug Administration, HFD-710 Room 2030, Woodmont II, 1451 Rockville Pike, Rockville, MD 20852, U.S.A.

Design of clinical trials using an adaptive test statistic

Embed Size (px)

Citation preview

PHARMACEUTICAL STATISTICS

Pharmaceut. Statist. 2002; 1: 97–105 (DOI:10.1002/pst.014)

Design of clinical trials using an

adaptive test statistic

John Lawrence*Food and Drug Administration, U.S.A.

Since the treatment effect of an experimental drug is generally not known at the onset of a clinical

trial, it may be wise to allow for an adjustment in the sample size after an interim analysis of the

unblinded data. Using a particular adaptive test statistic, a procedure is demonstrated for finding the

optimal design. Both the timing of the interim analysis and the way the sample size is adjusted can

influence the power of the resulting procedure. It is possible to have smaller average sample size using

the adaptive test statistic, even if the initial estimate of the treatment effect is wrong, compared to the

sample size needed using a standard test statistic without an interim look and assuming a correct

initial estimate of the effect. Copyright # 2002 John Wiley & Sons Ltd.

Keywords: interim analysis; sample size adjustment; sequential monitoring

1. INTRODUCTION

The design of a clinical trial requires a reliableestimate of the treatment effect, d. However, thisestimate can be difficult to obtain with a newagent. Moreover, even when there is knowledgeabout the treatment effect, this effect can varyfrom study to study because of differences intreatment regimen, exclusion criteria, backgroundtherapy, etc.

For these reasons, it is sensible in many cases toadjust the sample size after part of the data hasbeen observed. However, if the analysis does notincorporate the data-driven sample size adjust-ment, then the type I error can be inflated. Fisher

[1], Cui et al. [2], and Lehmacher and Wassmer [3]describe adaptive test procedures that are based onthe idea of using a weighted average of theinformation before the interim look and theinformation after the interim look. More specifi-cally, in many clinical trials one can use all of thedata before the interim analysis to construct a teststatistic and a second test statistic using the dataafter the interim analysis. Under the null hypoth-esis, these two statistics are independent standardnormal random variables even if the sample sizedepends on the first statistic. Therefore, weightscan be chosen to combine the two together into asingle test statistic that is standard normal underthe null hypothesis.

The weights are arbitrary, but must be deter-mined before any data are observed in order toensure the type I error rate is controlled. Using thisadjustment to the test statistic, the null distribu-

Copyright # 2002 John Wiley & Sons, Ltd.

*Correspondence to: John Lawrence, Department of Bio-metrics I, Food and Drug Administration, HFD-710 Room2030, Woodmont II, 1451 Rockville Pike, Rockville, MD20852, U.S.A.

tion is not affected by the adjustment to the samplesize, so the usual critical values are used. If theestimate of the treatment effect based on theinterim look, D1, agrees with the initial estimate ofthe treatment effect, D0, then no adjustment to thetest is needed, and in fact the adaptive test isidentical to the usual test. On the other hand, if D1

is much less than D0, then the sample size may beincreased and the power of the adaptive test islarger than the power of the original test with noadjustment in sample size. If an adjustment to thesample size is made, the naive test does not havethe correct type I error rate. See [2] for furtherdiscussion on the performance of the adaptive testand the naive test.

In theory, it is possible to postulate anarbitrarily small value of D0 or D1. If there is aconsensus that a treatment effect below, say, e is ofno therapeutic value, then it is possible to set thislower bound on these postulated values. In fact,it is even possible to initially design the studybased on D0=e, the smallest treatment effectof interest. However, if the true treatment effectis much larger than e, then there are obviouseconomic and possible ethical arguments againstdoing this. Nonetheless, if the initial designuses D0=e and the consensus opinion aboute remains unchanged over time, then it wouldmake no sense to adjust the initial planned samplesize upward no matter what is observed at theinterim analysis.

For simplicity, we will first consider the casewhere the observed random variables are indepen-dent and identically normal with known (unit)variance. Then, the impact of estimating thevariance will be examined. We consider the casewhere a point estimate of the treatment effect ispostulated to design the study, and then separatelyconsider the case where the treatment effect isassumed a priori to be likely to lie within somerange. Although this paper focuses on tests fornormal means, all of the ideas generalize to morecomplex clinical trials. For example, the adaptivetest statistic can be used with time-to-event end-points using the logrank or other tests for survivalcurves, tests for differences in proportions, orlongitudinal data analysis.

In this paper, the problem of choosing an initialsample size, a final sample size, and the weightsused in this adaptive test based on one interimlook is investigated. Although the adaptive designframework allows for much greater flexibility thanthe fixed sample size design, there is not muchguidance available in the current literature regard-ing how an investigator should best take advan-tage of this flexibility. The purpose of this paper isto investigate the consequences of using differentchoices and to provide some general guidance inthis area.

2. CHOOSING THE DESIGNPARAMETERS IF d IS POSTULATEDAS A FIXED NUMBER

Suppose Xi�N(mT, s2) and Yi�N(mC, s

2) are tworandom samples observed from the treatment andcontrol groups, respectively. Without loss ofgenerality, we will assume s2=1 unless specifiedotherwise. Initially, we observe n1 random vari-ables from each group. Subsequently, we willobserve n2 additional random variables from bothgroups. The final adaptive test statistic isffiffiffi

lp Pn1

i¼1 Xi �Pn1

i¼1 Yiffiffiffiffiffiffiffi2n1

p

þffiffiffiffiffiffiffiffiffiffiffi1� l

p Pn1þ1þn2i¼n1þ1 Xi �

Pn1þ1þn2i¼n1þ1 Yiffiffiffiffiffiffiffi

2n2p

where 04l51 is a constant that is chosen beforeany data are observed. This statistic will be used totest the hypothesis H0=mC against the alternativeH1:mT>mC at level 0.025. Implicit in this frame-work is the assumption that mT is known to be atleast as large as mC. One may also consider thescenario where it is unknown in which directionthe two parameters may differ and a two-sidedtest is desired. The fixed-weight adaptive testdescribed here is standard normal under the nullhypothesis, so a two-sided test at level 0.05 couldbe constructed by rejecting H0 if the final adaptivetest statistic is sufficiently large in absolute value.This two-sided testing framework is suitable where

Copyright # 2002 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2002; 1: 97–105

98 J. Lawrence

it is just as important to demonstrate that a studydrug is inferior to the control as it is todemonstrate that it is superior. However, in theremainder of this paper, we will focus on the one-sided testing scenario. The interested reader isreferred to Dunnet and Gent [4] for a discussionon the issues of one-sided versus two-sided testingin drug development.

We start with an initial estimate of the treatmenteffect mT�mC. This initial estimate is called D0.After the first n1 observations are obtained, we willform a new estimate of the treatment effect, D1.The goal is to choose a real number l between 0and 1 and strictly positive integers n1 and n2 sothat there is a (1�b)100% chance of rejecting thenull hypothesis. We will assume that the treatmenteffect is D0 when choosing l and n1, and we willassume the treatment effect is D1 when choosingn2.

If we assume that mT�mC=D0 then it is reason-able initially to the total sample size, n1+n2, to beN=2(F�1(b)+F�1(0.025))2/D0

2, where F denotesthe standard normal distribution function, and tochoose l=n1/N. With these choices, the power willbe (1�b)100% if in fact mT�mC=D0, and wechoose n2=N�n1. Note that n1 should be chosennot on the basis of power, but on the basis of otherconsiderations in this case because the power willbe the same for any choice of n1.

After observing n1 variables from each group,Pn1i¼1 Xi �

Pn1i¼1 YiÞ=

ffiffiffiffiffiffiffi2n1

p�is observed to be t1,

say. Using Z to denote a standard normal randomvariable, the power of the adaptive test statisticwhen the treatment effect is assumed to be equal toD1 can be expressed as

Pffiffiffil

pt1 þ

ffiffiffiffiffiffiffiffiffiffiffi1� l

p D1

ffiffiffiffiffin2

pffiffiffi2

p þ Z

( )"

> F�1ð1� 0:025Þ

#

The second-stage sample size n2 for which thecorrect conditional power will be achieved is foundby solving the equation resulting from setting thisexpression equal to 1�b.

The solution is

n2 ¼ 2ðF�1ð1� 0:025Þ �

ffiffiffiffiffiffiffiffiffiffiffi1� l

pF�1ðbÞ �

ffiffiffil

pt1Þ

2

ð1� lÞD21

ð1Þ

if t15fF�1ð1� 0:025Þ �ffiffiffiffiffiffiffiffiffiffiffi1� l

pF�1ðbÞg=

ffiffiffil

p; and

n2=1 otherwise.One choice for D1 is the estimated treatment

effect at the interim look,Pn1

i¼1 Xi �Pn1

i¼1 XiÞ=n1:�

However, one may borrow information from othersources to choose D1. The adaptive test statisticplaces no restrictions on the procedure used tochange the sample size.

As an example, suppose that initially we believethat the treatment effect is 0.2, but in fact thetrue unknown treatment effect is 0.15. The samplesize may be adjusted after half of the data areobserved and it is desired to have 90% power.We plan on observing a total sample size of2(F�1(0.1)+F�1(0.025))2/0.22=525 per group in-itially. Since the sample size adjustment will occurafter half, approximately 262, of the variables areobserved in each group, the weight that we will useis l=1/2.

Table I shows the expected total sample sizeand the power using the adaptive test procedure.In columns (A) and (B), these averages wereobtained using Monte Carlo integration from50 000 simulated trials. In each trial, t1 issimulated, n2 is calculated and the power for thistrial is calculated as

P t1ffiffiffil

0:15ffiffiffiffiffiffiffiffiffiffiffi1� l

p ffiffiffiffiffin2

pffiffiffi2

p þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1� lZ

p"

> F�1ð1� 0:025Þ

#

Table I. Average total sample size (n1+n2) and esti-

mated power (b=0.9, D0=0.2, n1=262, 50 000 simula-

tions). See text for procedure used in each column.

A B C D

Average n1+n2 2791 837 933 525Power 0.903 0.901 0.884 0.681

Design using adaptive test statistic 99

Copyright # 2002 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2002; 1: 97–105

Column A corresponds to using D1 ¼Pn1i¼1 Xi �

Pn1i¼1 Yi=n1; the observed treatment

effect at the interim analysis, to calculate n2. Inorder to avoid unreasonable sample sizes, theestimated treatment effect is truncated to lieabove 0.05. Column B corresponds to usingD1=0.15 when choosing n2. In this case, the poweris always 90%. Column C corresponds to usingn2=2(F�1(0.1)+F�1(0.025))2/0.152�n1. This cor-responds to the optimal total sample size if the truetreatment effect were known at the beginning ofthe study and no adaptive test procedure is used.In column D n2=263, the total sample size thatwould be used if there were no adjustment basedon the interim look.

Column D shows the obvious consequence ofoverestimating the treatment effect and notadjusting the sample size. In other words, thechances of arriving at the correct conclusion willbe lower than expected. Column C shows thatalthough the unadjusted test statistic has 90%power with N=933, the adaptive test statistic doesnot because it weights the information differently.In a column B, we see that the average totalsample size using the optimal choice for n2 whenthe treatment effect is known after the interim lookis 837. This is smaller than the sample size thatwould be needed if the true treatment effect wereused to design a study with no interim look (933).In column A, the average total sample size is 2791,but this is somewhat misleading because thedistribution of the total sample size is skewed –the median is only 749. Since the square of theestimated effect size is in the denominator offormula (1), one would expect to see this type ofskewed distribution.

What is shown in column B bears repeating. Westarted with the wrong estimate of the treatmenteffect, but later learned the true treatment effect.Using the adaptive test procedure, we were able todesign a study that has the same significance andpower with a smaller average total sample sizethan the design with no interim look using thecorrect effect size. If we decide not to change thesample size after the interim look, then the teststatistic is unaffected. Hence, there is no statisticalpenalty for allowing the possibility of changing the

sample size and much to be gained if our initialguess at the treatment effect was too high.

Under some circumstances, it may be advanta-geous to abandon the current study in favour ofstarting a whole new study. For any n2 andtreatment effect d the power of the adaptive teststatistic is

P tffiffiffil

dffiffiffiffiffiffiffiffiffiffiffi1� l

p ffiffiffiffiffin2

pffiffiffi2

p þffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1� lZ

p"

> F�1ð1� 0:025Þ

#

while the power of the test that ignores all the databefore the interim look is

Pd

ffiffiffiffiffin2

pffiffiffi2

p þ Z > F�1ð1� 0:025Þ

" #

The latter expression will be larger than the formerwhen

F�1ð1� 0:025Þ � t1ffiffiffil

p�

dffiffiffiffiffiffiffiffiffiffiffi1� l

p ffiffiffiffiffin2

pffiffiffi2

pffiffiffiffiffiffiffiffiffiffiffi1� l

p :

> F�1ð1� 0:025Þ �d

ffiffiffiffiffin2

pffiffiffi2

p :

Solving for t1, we see that it is beneficial toabandon the study and start from scratch when

t15F�1ð1� 0:025Þ1�

ffiffiffiffiffiffiffiffiffiffiffi1� l

pffiffiffil

pThis does not depend on the postulated treatment effect. For instance, in the example abovewith l=1/2, one should abandon the study ift15F�1ð1� 0:025Þð1�

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1� 1=2

pÞ=

ffiffiffiffiffiffiffiffi1=2

p� 0:812:

When n1=262, this corresponds to an observedtreatment effect of t1

ffiffiffi2

p=

ffiffiffiffiffiffiffiffi262

p� 0:071:

The investigator can change the sample size athis or her discretion. The type I error is guaranteedto be controlled in any case with this procedure.In the simulations in column A, the estimatedtreatment effect used for sample size re-estimationat the interim analysis is truncated below at 0.05.This truncation has the effect of putting anupper bound on the maximum sample size

Copyright # 2002 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2002; 1: 97–105

100 J. Lawrence

(roughly 0.2/0.05)2=16 times the initial plannedsample size). In a sense, one can think of this asrepresenting a smallest clinically meaningfultreatment effect size. The guideline in the previousparagraph suggests that one might want toabandon the study if the observed treatmenteffect is less than 0.07. Following this guidelinewould further limit the maximum sample sizeto roughly nine times the initial planned samplesize. The actual decision will be case-specificand may depend on economic factors, the con-sideration of clinically meaningful effect size, andother factors.

There are many other scenarios of interest. Forexample, in Table I, the interim analysis was doneat expected information time 1/2. One unansweredquestion is the effect of doing the sample size re-estimation at earlier or later information timepoints (various choices of n1). Also, in Table I, theoriginal postulated treatment effect (0.2) washigher than the true treatment effect (0.15). Asecond unanswered question is the effect of havinga postulated treatment effect lower than or exactlyequal to the true treatment effect. Finally, what arethe consequences of placing different upperbounds on the maximum total sample size andwhat is the effect of including a futility boundary?These issues are investigated in a second simula-tion study.

There are two procedures examined under 31different scenarios in this more comprehensivesimulation study (Table II). The two proceduresare identical to those used in columns A and B ofTable I. In other words, procedure A uses D1 ¼max 0:05

Pn1i¼1 Xi �

Pn1i¼1 Yi=n1g

�while procedure

B uses the true effect, D1=0.15, to calculate n2.The first 19 scenarios in Table II are similar to

the scenario used in Table I, but with differentchoices of weights (determined by l), initialpostulated effect size (D0), and maximum allow-able sample size per group (M). For the rowswhere M is a finite number, the sample size istruncated atM if the projected sample size is largerthan M.

The next nine rows in Table II correspond toscenarios where the variance is unknown andinitially assumed to be 1, but estimated from

the data at the interim analysis by the usualpooled estimate of the common variance, #ss2:For these nine scenarios, the second stagesample size is defined as before, with D1= #ss usedin place of D1.

The final three scenarios in Table II all assumean initial postulated treatment effect, D0=0.2,larger than the truth. Also, a futility boundary isincluded and the study is terminated if theestimated treatment effect is less than a prespeci-fied number, Dmin. This futility boundary decreasesthe chances of a false positive conclusion, resultingin deflation of the type I error. Hence, the criticalvalue used at the final analysis is smaller thanwould otherwise be used to maintain the same typeI error rate. This is statistically valid only if thefutility boundary is strictly followed.

First, two general comments about Table II.First, the author verified by simulation that in eachscenario the type I error is controlled at exactly0.025. Although this is proven mathematically, it isreassuring to see that it is true even when there arefutility boundaries, estimated variances, or upperbounds on the maximum sample size. Secondly,the unconditional power of procedure A is not toofar from the target power of 90% although itvaries across scenarios. When the true effect size isknown to re-estimate the sample size, as inprocedure B, the conditional power is always heldto exactly 90% if possible. If it were possible for allfirst-stage samples to make the conditional powerequal to 90% then the unconditional power wouldalso be exactly 90%. The exceptions are whenthere is an upper bound on the maximum samplesize or when the conditional power is higher than90% even if only one more observation is taken. Inparticular, if M is infinite, then the conditionalpower can always be made at least 90% andtherefore the unconditional power will be at least90%. In scenarios 29–31, the conditional power is0% for those first-stage samples where the futilityboundary is crossed. This explains the drop inunconditional power.

There are several observations about scenarios1–19. If the postulated effect size is D0=0.1, thenthe interim sample size is n1=1051. In scenarios 1,8, and 15, the sample size reestimation is done

Design using adaptive test statistic 101

Copyright # 2002 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2002; 1: 97–105

after 1051 observations are made in each group.The mean sample size when using procedure A isbetween 1400 and 1600 depending on the choice ofl, but is nonetheless always smaller than the finalsample size of the fixed design (2052). The mediansample size is between 1100 and 1300. Consideringthe mean or median sample size, in these threescenarios the best choice appears to be l=0.5 witheither procedure. In scenarios 2, 9, and 16, thepostulated effect size is the true effect. Although

all three choices of l result in a smaller mediansample size than the fixed design (933), the bestamong the three appears to be l=0.75. Likewise,in scenarios 5, 12, and 19, all three have smallermedian sample size than 933, but the smallest isachieved when l=0.75. Overall, the smallestmedian sample size appears to be obtained whenthe postulated effect size is equal to the true effectand using l=0.75. Underestimating the effect cansubstantially inflate the median sample size. On the

Table II. Estimated power, mean and median total sample size (n1+n2) under various scenarios. In all scenarios, target

power is b=0.9, d=0.15 and 50 000 simulations are used. See text for further information.

Parameters Procedure A Procedure B

l D0 M Power Mean Med. Power Mean Med.

n1=(F�1(0.1)+F�1(0.025))2/D02

1 0.25 0.1 1 0.915 1596 1268 0.901 1298 12682 0.25 0.15 1 0.896 1842 903 0.900 935 9053 0.25 0.15 2500 0.896 1215 902 0.900 935 9044 0.25 0.15 1250 0.868 924 902 0.894 919 9045 0.25 0.2 1 0.874 2406 843 0.900 872 8426 0.25 0.2 2500 0.875 1225 850 0.900 873 8447 0.25 0.2 1250 0.846 848 839 0.895 856 8418 0.5 0.1 1 0.945 1416 1099 0.921 1172 11089 0.5 0.15 1 0.923 1908 745 0.903 833 74410 0.5 0.15 2500 0.922 1112 742 0.903 832 74211 0.5 0.15 1250 0.888 843 745 0.888 800 74412 0.5 0.2 1 0.903 2791 749 0.901 837 74813 0.5 0.2 2500 0.900 1168 742 0.901 836 74614 0.5 0.2 1250 0.854 800 748 0.878 781 74815 0.75 0.1 1 0.980 1522 1208 0.963 1370 119916 0.75 0.15 1 0.953 2347 617 0.921 867 66217 0.75 0.15 2500 0.945 1066 620 0.917 857 66318 0.75 0.15 1250 0.891 788 616 0.880 767 65819 0.75 0.2 1 0.932 4016 702 0.908 968 712

n1=(F�1(0.1)+F�1(0.025))2/D02, l=0.5, s unknown

20 0.1 0.9 0.953 1236 1082 0.934 1127 108521 0.1 1 0.945 1429 1100 0.921 1174 110922 0.1 1.1 0.939 1719 1146 0.914 1251 115723 0.15 0.9 0.929 1341 625 0.905 700 62724 0.15 1 0.923 1880 741 0.903 831 74325 0.15 1.1 0.917 2585 878 0.902 990 88226 0.2 0.9 0.912 1983 593 0.901 668 59327 0.2 1 0.904 2801 753 0.900 840 75128 0.2 1.1 0.895 3818 932 0.899 1039 930

Futility boundary Dmin, D0=0.2, n1=262, l=0.529 0 0.860 2086 649 0.862 757 70030 0.05 0.778 1139 507 0.786 641 60631 0.07 0.728 793 438 0.741 579 545

Copyright # 2002 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2002; 1: 97–105

102 J. Lawrence

other hand, overestimating the effect can result ina decrease in the median sample size if l=0.25 andan increase in the median sample size if l=0.75.

Moreover, one can see the impact of including amaximum sample size, M, If l and D0 are fixed,then the median sample size appears to beunaffected by the inclusion of M. The decrease inpower is negligible with M=2500, but is slightlygreater with M=1250. The choice of M appears tohave very little effect on the mean sample size inprocedure B, but has a dramatic effect inprocedure A. Also, when M=1250 the meansample size using procedure A is only slightlyhigher than the mean sample size using procedureB. This is interesting because procedure B uses thetrue value of d to re-estimate the sample size.

Assuming the true effect size is known, the fixedsample size design has 933 patients per group. Aswas observed in Table I, in all of these scenariosthe median sample size is close to 933 and issmaller in many cases. The mean sample size canbe very large when there is no maximum samplesize due to a few large outliers. Defining a finite Malso reduces the mean sample size to below 933. Inscenario 14, for example, the initial postulatedeffect size was higher than the true d. Using theinterim observed effect size to re-estimate thesample size and placing an upper bound ofM=1250 on the maximum sample size withprocedure A results in a mean sample size of800. In summary, for practical reasons there willalways be some upper limit on the sample size.Under the scenarios examined here, the inclusionof a modest upper bound has little effect on thepower or median sample size, but can greatlyreduce the mean sample size.

In scenarios 20–28, the sample size re-estimationincludes an estimate of the variance. When theinitial postulated variance is the true variance, theeffect of using the estimated variance is negligible.This is seen by comparing rows 21, 24, and 27 withrows 8, 9, and 12, respectively. If the true varianceis smaller than 1, then the median sample size isslightly larger. In scenarios 20–22, there is littledifference in the median sample size. A possibleexplanation for this is that in all three cases mostof the data needed is already observed at the

interim analysis because the postulated effect sizewas smaller than the true d. In the other scenarios,the median sample size seems to be roughly 20%smaller if s=0.9 and 20% larger if s=1.1 relativeto the sample size when s=1. Also, the mediansample sizes compare favourably to the fixedsample size design (757 when s=0.9 and 1130when s=1.1).

In scenario 29, the futility boundary wascrossed in 4.4% of the runs. When this happened,the total sample size was counted as n1 and thecontribution to the estimated power was 0. Hence,this leads to different interpretations of thepower and the mean and median sample sizes.For example, among those runs where the futilityboundary was not crossed, the median samplesize was 693 and the power was 90% for pro-cedure A. In scenarios 30 and 31, the futilityboundary was crossed in 12.7% and 17.8% ofthe runs and the median sample size in theremaining runs was 601 and 547 for procedureA, respectively. The achieved power using proce-dure B is approximately equal to 90% of theproportion of runs that do not cross the earlyfutility boundary; for example, in scenario 31,74.1%� 90%� (1�0.178).

3. CHOOSING THE DESIGNPARAMETERS IF d HAS AN APRIORI NORMAL DISTRIBUTION

Since there is uncertainty about the treatmenteffect, it may be worthwhile to think of it as arandom variable. The approach that is outlined inthis section is not a Bayesian approach for the testof the treatment effect. The Bayesian paradigm isonly used here as a tool in helping to find thedesign parameters l and n1. In this section, we willview n2 as a function of the observed data at theinterim look. Specifically, n2 will be chosen tomake the adaptive test statistic have the desiredpower assuming the treatment effect is the poster-ior mean.

Modifying the example used in Section 2,suppose that initially it is felt that the treatment

Design using adaptive test statistic 103

Copyright # 2002 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2002; 1: 97–105

effect is N(0.2, t2). More specifically, the actualtrue treatment effect is an unknown number that ischosen from a normal distribution with mean 0.2and variance t2. This can be viewed as a priordistribution for the unknown parameter. Inpractice, one can think of it as specifying a bestguess at the true treatment effect (in this case 0.2)together with some idea of how uncertain theinvestigator is about this guess (measured by t2).Since a normal random variable will differ from itsmean by less than two standard deviations about95% of the time, the prior variance, t2, should bechosen so that the interval (0.2�2t, 0.2+2t)consists of the most likely values of the treatmenteffect. It is important to realize that since we areallowing for some uncertainty in our estimate ofthe treatment effect, this should be considered inchoosing n1.

After the first n1 observations are obtained, theposterior mean of the treatment effect is

1

t20:2þ

n1

2

Pn1

i¼1Xi�

Pn1

i¼1Yi

n1

� �1

t2þ

n1

2

Although l and n1 are free parameters, for thepurpose of illustration, suppose that we definel=n1/525. The expected sample size as a functionof n1 when the treatment effect is N(0.2, 0.022),using the preceding definitions of l and n2, isshown in Figure 1. The two curves in the figurecorrespond to different priors, both are centred at0.2, but one is more diffuse (t=0.02 versust=0.04).

In general, the optimal design can be foundunder different assumptions by finding the averagesample size (or optimizing some other criterion)jointly over l and n1.

4. CONCLUSIONS

By carefully thinking about the prior knowledgeabout the treatment effect, it is possible to designstudies that maintain type I and type II errorprobabilities that allow for modifications after aninterim look. The adaptive test procedure used inthis paper is guaranteed to control the type I errorrate regardless of how the sample size is adjusted.Many different scenarios were examined in a

Figure 1. Average total sample size as a function of n1 with different priors.

Copyright # 2002 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2002; 1: 97–105

104 J. Lawrence

simulation study. The sample size adjustmentadvocated here is based on setting the conditionalpower equal to a specified target (with a possiblemaximum limit). This showed that under thesescenarios the unconditional power varies but isreasonably close to the target, and the averagesample size can be smaller than the fixed samplesize design. Power considerations may limit theapplication of this procedure to these kinds ofscenarios. Since there is uncertainty regarding thetreatment effect, it is possible to view it as a randomvariable. It is possible use this approach to assist inchoosing the initial design parameters and to re-estimate the sample size at the interim analysis.

The scenarios all assume that the initial postu-lated treatment effect is not too far from the trueeffect. Furthermore, in this paper the focus is ontesting for differences in normal means, althoughthe adaptive designs may be used in many othercontexts, including survival analysis (see [2,5]).Finally, this paper uses the adaptive test of [2], butother adaptive test procedures have been proposedin the literature (see [1,3,6–9]). Any of theseprocedures allows flexibility in sample size adjust-ments based on observed data. These proceduresmay be expected to have similar power andaverage sample size characteristics.

REFERENCES

1. Fisher L. Self-designing clinical trials. Statistics inMedicine 1998; 17: 1551–1562.

2. Cui L, Hung HMJ, Wang S. Modifications of samplesize in group sequential trials. Biometrics 1999; 55:853–857.

3. Lehmacher W, Wassmer G. Adaptive sample sizecalculations in group sequential trials. Biometrics1999; 55: 1286–1290.

4. Dunnett C, Gent M. An alternative to the use of two-sided tests in clinical trials. Statistics in Medicine1996; 122: 1729–1738.

5. Lawrence J. Strategies for changing a test statisticduring a clinical trial. Journal of BiopharmaceuticalStatistics 2002; 12: 193–205.

6. Bauer P, K .oohne K. Evaluation of experimentswith adaptive interim analyses. Biometrics 1994; 50:1029–1041.

7. Proschan MA, Hunsberger SA. Designed extensionof studies based on conditional power. Biometrics1995; 51: 1315–1324.

8. Lan KKG, Trost DC. Estimation of parameters andsample size re-estimation. Proceedings of Biopharma-ceutical Section, American Statistical Association1997: 48–51.

9. Liu Q, Chi G. On sample size and inference fortwo-stage adaptive designs. Biometrics 2001: 57:185–190.

Design using adaptive test statistic 105

Copyright # 2002 John Wiley & Sons, Ltd. Pharmaceut. Statist. 2002; 1: 97–105