12
Statistics in Clinical Cancer Research Comparison of Statistical Analysis Plans in Randomize-All Phase III Trials with a Predictive Biomarker Shigeyuki Matsui 1 , Yuki Choai 2 , and Takahiro Nonaka 2,3 Abstract When there are no compelling biologic or early trial data for a candidate predictive biomarker with regard to its ability to predict the effect of an anticancer treatment at the initiation of definitive phase III trials, it is generally reasonable to include all patients as eligible for randomization but to plan for a prospective subgroup analysis based on the biomarker. We assessed such statistical analysis plans, fixed-sequence, fallback, and treatment-by-biomarker interaction approaches, in terms of the probability of asserting treatment efficacy for either the overall patient population or a biomarker-positive subpopulation of patients. If there was some evidence that the treatment would work better in the biomarker-positive subgroup than the biomarker-negative subgroup, then the fixed-sequence approaches would be favored, whereas if evidence was weak that there would be much difference in responsiveness between the two subgroups, then the fallback approach would be favored. If there was substantial uncertainty in the difference in treatment effects between the two subgroups, the treatment-by-biomarker interaction approach could be a reasonable choice as this approach generally provided a high probability of asserting treatment efficacy for the right patient population under homogeneous treatment effects and a qualitative interaction over biomarker-based subgroups. Clin Cancer Res; 20(11); 2820–30. Ó2014 AACR. Introduction Advances in genomics and biotechnology have revealed substantial molecular heterogeneity among human cancers with the same histologic diagnosis. As this heterogeneity has rendered many cancer treatments beneficial only for a subset of patients with cancer, there is a growing need for the development of biomarkers to predict the responsive- ness of new treatments under development (1–3). Eventu- ally, the medical utility of each treatment must be estab- lished with the aid of the developed predictive biomarker in a prospective, phase III randomized clinical trial. When a reliable predictive biomarker is available at the initiation of a phase III trial, an enrichment or targeted design that randomizes only a subset of patients predicted by the biomarker to benefit from the treatment can be an efficient trial design (4). However, it is more common that at the initiation of phase III trials, there are no compelling biologic or early trial data for a candidate predictive bio- marker regarding its capability to predict treatment effects or there is uncertainty about a cutoff point of an analytically validated predictive assay. In such situations, it is generally reasonable to include all patients as eligible for randomi- zation, as done in traditional clinical trials, but to plan for prospective subset analysis based on the predictive biomarker with a control of the study-wise type I error rate at the level a, for example, a ¼ 2.5% at a one-sided level, under the global null hypothesis of no treatment effects for any patients (5– 14). For such statistical analysis plans that allow this, we can identify three approaches: fixed-sequence, fallback, and treatment-by-biomarker interaction approaches (see Fig. 1). The fixed-sequence approaches first test treatment efficacy for a biomarker-based subset of patients using significance level a. If this is significant, treatment efficacy for a subset of the rest patients (fixed-sequence-1; ref. 7) or for the overall population (fixed-sequence-2; ref. 10) is tested using the same significance level a. As an example of the fixed- sequence-1 approach, in a randomized phase III trial of panitumumab with infusional 5-fluorouracil, leucovorin, and oxaliplatin (FOLFOX) versus FOLFOX alone, the treat- ment arms were first compared on the basis of progression- free survival (PFS) for patients with wild-type KRAS tumors (15). Treatment comparisons in patients with mutant KRAS tumors were conditional on a significant difference in the first test for the wild-type KRAS stratum. An example of the fixed-sequence-2 approach is a phase III trial testing cetux- imab in addition to FOLFOX as adjuvant therapy in stage III colon cancer (N0147). In the analysis plan, the efficacy of the regimen on disease-free survival is first tested in the patients with wild-type KRAS using a log-rank test at a Authors' Afliations: 1 Department of Biostatistics, Nagoya University Graduate School of Medicine, Nagoya; 2 Department of Statistical Science, The Graduate University for Advanced Studies; and 3 Pharmaceuticals and Medical Devices Agency, Tokyo, Japan Note: Supplementary data for this article are available at Clinical Cancer Research Online (http://clincancerres.aacrjournals.org/). Corresponding Author: Shigeyuki Matsui, Department of Biostatistics, Nagoya University Graduate School of Medicine, 65 Tsurumai-cho, Showa-ku, Nagoya 466-8550, Japan. Phone 81-52-744-2487; Fax: 81- 52-744-2488; E-mail: [email protected] doi: 10.1158/1078-0432.CCR-13-2698 Ó2014 American Association for Cancer Research. Clinical Cancer Research Clin Cancer Res; 20(11) June 1, 2014 2820 on February 10, 2020. © 2014 American Association for Cancer Research. clincancerres.aacrjournals.org Downloaded from Published OnlineFirst April 1, 2014; DOI: 10.1158/1078-0432.CCR-13-2698

Comparison of Statistical Analysis Plans in Randomize-All Phase …clincancerres.aacrjournals.org/content/clincanres/20/11/... · Comparison of Statistical Analysis Plans in Randomize-All

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Comparison of Statistical Analysis Plans in Randomize-All Phase …clincancerres.aacrjournals.org/content/clincanres/20/11/... · Comparison of Statistical Analysis Plans in Randomize-All

Statistics in Clinical Cancer Research

Comparison of Statistical Analysis Plans in Randomize-AllPhase III Trials with a Predictive Biomarker

Shigeyuki Matsui1, Yuki Choai2, and Takahiro Nonaka2,3

AbstractWhen there are no compelling biologic or early trial data for a candidate predictive biomarker with regard

to its ability to predict the effect of an anticancer treatment at the initiation of definitive phase III trials, it is

generally reasonable to include all patients as eligible for randomization but to plan for a prospective

subgroup analysis based on the biomarker. We assessed such statistical analysis plans, fixed-sequence,

fallback, and treatment-by-biomarker interaction approaches, in terms of the probability of asserting

treatment efficacy for either the overall patient population or a biomarker-positive subpopulation of

patients. If there was some evidence that the treatment would work better in the biomarker-positive

subgroup than the biomarker-negative subgroup, then the fixed-sequence approaches would be favored,

whereas if evidence was weak that there would be much difference in responsiveness between the two

subgroups, then the fallback approach would be favored. If there was substantial uncertainty in the

difference in treatment effects between the two subgroups, the treatment-by-biomarker interaction

approach could be a reasonable choice as this approach generally provided a high probability of asserting

treatment efficacy for the right patient population under homogeneous treatment effects and a qualitative

interaction over biomarker-based subgroups. Clin Cancer Res; 20(11); 2820–30. �2014 AACR.

IntroductionAdvances in genomics and biotechnology have revealed

substantial molecular heterogeneity among human cancerswith the samehistologic diagnosis. As this heterogeneity hasrendered many cancer treatments beneficial only for asubset of patients with cancer, there is a growing need forthe development of biomarkers to predict the responsive-ness of new treatments under development (1–3). Eventu-ally, the medical utility of each treatment must be estab-lishedwith the aid of the developed predictive biomarker ina prospective, phase III randomized clinical trial.

When a reliable predictive biomarker is available at theinitiation of a phase III trial, an enrichment or targeteddesign that randomizes only a subset of patients predictedby the biomarker to benefit from the treatment can be anefficient trial design (4). However, it is more common thatat the initiation of phase III trials, there are no compellingbiologic or early trial data for a candidate predictive bio-

marker regarding its capability to predict treatment effectsor there is uncertainty about a cutoff point of an analyticallyvalidated predictive assay. In such situations, it is generallyreasonable to include all patients as eligible for randomi-zation, as done in traditional clinical trials, but to plan forprospective subset analysis basedon the predictive biomarkerwith a control of the study-wise type I error rate at the levela,for example, a¼ 2.5% at a one-sided level, under the globalnull hypothesis of no treatment effects for any patients (5–14). For such statistical analysis plans that allow this, we canidentify three approaches: fixed-sequence, fallback, andtreatment-by-biomarker interaction approaches (see Fig. 1).

Thefixed-sequence approaches first test treatment efficacyfor a biomarker-based subset of patients using significancelevel a. If this is significant, treatment efficacy for a subset ofthe rest patients (fixed-sequence-1; ref. 7) or for the overallpopulation (fixed-sequence-2; ref. 10) is tested using thesame significance level a. As an example of the fixed-sequence-1 approach, in a randomized phase III trial ofpanitumumab with infusional 5-fluorouracil, leucovorin,and oxaliplatin (FOLFOX) versus FOLFOX alone, the treat-ment arms were first compared on the basis of progression-free survival (PFS) for patients with wild-type KRAS tumors(15). Treatment comparisons in patients withmutantKRAStumors were conditional on a significant difference in thefirst test for the wild-type KRAS stratum. An example of thefixed-sequence-2 approach is a phase III trial testing cetux-imab in addition to FOLFOX as adjuvant therapy in stage IIIcolon cancer (N0147). In the analysis plan, the efficacy ofthe regimen on disease-free survival is first tested in thepatients with wild-type KRAS using a log-rank test at a

Authors' Affiliations: 1Department of Biostatistics, Nagoya UniversityGraduate School of Medicine, Nagoya; 2Department of Statistical Science,The Graduate University for Advanced Studies; and 3Pharmaceuticals andMedical Devices Agency, Tokyo, Japan

Note: Supplementary data for this article are available at Clinical CancerResearch Online (http://clincancerres.aacrjournals.org/).

Corresponding Author: Shigeyuki Matsui, Department of Biostatistics,Nagoya University Graduate School of Medicine, 65 Tsurumai-cho,Showa-ku, Nagoya 466-8550, Japan. Phone 81-52-744-2487; Fax: 81-52-744-2488; E-mail: [email protected]

doi: 10.1158/1078-0432.CCR-13-2698

�2014 American Association for Cancer Research.

ClinicalCancer

Research

Clin Cancer Res; 20(11) June 1, 20142820

on February 10, 2020. © 2014 American Association for Cancer Research. clincancerres.aacrjournals.org Downloaded from

Published OnlineFirst April 1, 2014; DOI: 10.1158/1078-0432.CCR-13-2698

Page 2: Comparison of Statistical Analysis Plans in Randomize-All Phase …clincancerres.aacrjournals.org/content/clincanres/20/11/... · Comparison of Statistical Analysis Plans in Randomize-All

© 2014 American Association for Cancer Research

Fixed-sequence 1

Fixed-sequence 2

MaST

Fallback

Treatment-by-biomarker interaction

Yes Test treatment efficacy in the

B− subset, significant at 2.5%?

No

Yes

No

Yes

No

Yes

No

Yes

No

Yes

No

Yes

No

Yes

No

Yes

No

Yes

No

Yes

No

Yes

No

Assert treatment efficacy

for the B+ subset

Assert treatment efficacy

for the overall population

Test treatment efficacy overall,

significant at 2.5%?

Assert treatment efficacy

for the B+ subset

Assert treatment efficacy

for the overall population

Test treatment efficacy overall,

significant at 0.3%?

Test treatment efficacy in the

B+ subset, significant at a2?

Test treatment efficacy

overall, significant at a1?

Assert treatment efficacy

for the overall population

Test treatment efficacy in the

B− subset, significant at 2.5%?

Test treatment efficacy in the

B+ subset, significant at 2.2%?

Assert treatment efficacy

for the overall population

Assert treatment efficacy

for the B+ subset

No assertion

No assertion

No assertion

No assertion

Test treatment-by-biomarker

interaction, significant at a INT?

Test treatment efficacy

overall, significant at a3?

No assertion

Assert treatment efficacy

for the B+ subset

Assert treatment efficacy

for the overall population

Assert treatment efficacy

for the overall population

Test treatment efficacy in the

B+ subset, significant at a4?

Assert treatment efficacy

for the B+ subset

No assertion

Test treatment efficacy in the

B+ subset, significant at 2.5%?

Test treatment efficacy in the

B+ subset, significant at 2.5%?

Figure 1. Approaches of statistical analysis plans for randomize-all phase III trials with a predictive biomarker. All a levels are one-sided.

Analysis Plans in Phase III Trials with Biomarkers

www.aacrjournals.org Clin Cancer Res; 20(11) June 1, 2014 2821

on February 10, 2020. © 2014 American Association for Cancer Research. clincancerres.aacrjournals.org Downloaded from

Published OnlineFirst April 1, 2014; DOI: 10.1158/1078-0432.CCR-13-2698

Page 3: Comparison of Statistical Analysis Plans in Randomize-All Phase …clincancerres.aacrjournals.org/content/clincanres/20/11/... · Comparison of Statistical Analysis Plans in Randomize-All

significance level of 2.5% (one-sided), followed by anoverall test at a significance level of 2.5% (one-sided) if thefirst subset analysis is statistically significant (10).

The fallback approaches test treatment efficacy in theoverall population, followed by a test of treatment efficacyin a biomarker-based subset if the first test is not significant(8). Reduced significance levels are used for these tests topreserve the study-wise error rate a. Parallel testing for theoverall population and a biomarker-based subset can alsobe considered. For example, in the SATURN trial (16) toassess the use of erlotinib as maintenance therapy inpatients with nonprogressive disease following first-lineplatinum-doublet chemotherapy, PFS after randomizationwas tested in all patients at a significance level of 1.5% (one-sided) and in the patients whose tumors had EGFR proteinoverexpression at a significance level of 1% (one-sided). Theparallel assessment will have the same statistical propertieswith the fallback assessment when the result of the overalltest is prioritized.

The treatment-by-biomarker interaction approachesinvolve deciding whether to compare treatments overall orwithin the biomarker-based subsets based on a preliminarytest of interaction of treatment and biomarker (6, 7). Forexample, in theMARVEL trial (17) to compare erlotinib andpemetrexed as second-line treatment for non–small celllung cancer, the analysis was planned to be conductedseparately in EGFR-positive and -negative patients, with theuse of an interaction test on the difference in treatmenteffects between the two subsets of patients.

The important feature of the aforementioned approachesis that they can demonstrate treatment efficacy for either theoverall patient population or a biomarker-based subset ofpatients adaptively based on the observed clinical trial data.When focusing on this feature, it is critical to evaluate theprobability of asserting treatment efficacy for the rightpatient population (i.e., either the overall population orthe biomarker-based subset). However, to our best knowl-edge, there is no such evaluation in the literature.

In this article, we provide a benchmark when comparingthe three approaches of statistical analysis plans in terms oftheir ability to assert treatment efficacy for the right patientpopulation. With a discussion on the criteria for clinicalvalidation of predictive biomarkers, we aim to providesome general conclusions about which approach to use.

Materials and MethodsWe consider a phase III randomized trial to compare a

new treatment and its control on the basis of survivaloutcomes. We suppose that at the time the trial is initiated,a candidate predictive biomarker is available. Inmany cases,biomarker values are dichotomous or cutoff points areused to classify the biomarker results as either "positive"or "negative," denoted byBþ andB�, respectively. Typically,Bþ represents the subset of patients that is expected to beresponsive to the treatment, whereas B� represents theremainder. Let pþ denote the prevalence of Bþ in the patientpopulation. Randomization can be either stratified orunstratified on the basis of the predictive biomarker. We

suppose a stratified trial because it ensures observation ofthe biomarker status for all randomly assigned patients.

We evaluated the three approaches of statistical analysisplans: fixed-sequence, fallback, and treatment-by-biomarkerinteraction approaches. Each approach can demonstratetreatment efficacy for either the overall patient populationor the Bþ subset of patients with a control of the study-wisetype I error ratea (Fig. 1). In the followingpresentations, allalevels are one-sided.

Fixed-sequence approachesIf evidence from biologic or early trial data suggests the

predictive ability of the biomarker, it is reasonable to consi-der first testing treatment efficacy for the Bþ subset ofpatients. In such a situation, onewould not expect the treat-ment to be effective in the B� patients unless it is effective intheBþpatients. Specifically, as a fixed-sequence-1 approach,we first compare the treatment versus control in the Bþ

patients at a significance level of 0.025. If this test issignificant, we compare the treatment versus control in theB� patients at the significance level of 0.025 (7). In anothervariation of the fixed sequence approaches, fixed-sequence-2, the second stage involves testing treatment efficacy forthe overall population rather than for the subset of B�

patients (10). These sequential approaches control thestudy-wise type I error at 0.025.When both first and secondtests are significant, one may assert treatment efficacy forthe overall patient population. When only the first test forthe Bþ patients is significant, one may assert treatmentefficacy only for future patients who are biomarker positive.

As a more complex variation, Freidlin and colleagues (13,14) recently proposed an analysis plan called the markersequential test (MaST) that performs a fixed-sequence-1approach using a reduced significance level, such as 0.022,followed by a test of treatment efficacy for the overall pop-ulation using a significance level of 0.003 (¼ 0.025� 0.022)if the first test for theBþpatients is not significant. The secondtest is intended to improve the power for detecting homog-enous treatment effects between biomarker-based subsets.They recommend setting the significance level for the first testfor the Bþ patients as 0.022 and 0.04 for the study-wise type Ierror rate of a ¼ 0.025 and 0.05, respectively, to control theprobability that erroneously asserting treatment efficacy forthe B� patients at the level a under the hypothesis that thetreatment is effective for the Bþ patients, but not for the B�

patients, in addition to controlling the study-wise type errorrate at the level a under the global null hypothesis of notreatment effects for both Bþ and B� patients.

Fallback approachWhen there is limited confidence in the predictive bio-

marker, it is generally reasonable to assess treatment efficacyfor the overall patient population and prepare the subsetanalysis as a fallback option. Specifically, we first comparethe treatment and control overall at a reduced significancelevel a1 (<a). If this test is not significant, we test treatmentefficacy for the Bþ patients at a reduced significance level a2(<a; ref. 8). The significance level a2 can be specified by

Matsui et al.

Clin Cancer Res; 20(11) June 1, 2014 Clinical Cancer Research2822

on February 10, 2020. © 2014 American Association for Cancer Research. clincancerres.aacrjournals.org Downloaded from

Published OnlineFirst April 1, 2014; DOI: 10.1158/1078-0432.CCR-13-2698

Page 4: Comparison of Statistical Analysis Plans in Randomize-All Phase …clincancerres.aacrjournals.org/content/clincanres/20/11/... · Comparison of Statistical Analysis Plans in Randomize-All

taking into account the correlation between the first overalltest and the second test in the Bþ patients (9, 18, 19), wherethe correlation depends on pþ (see also Appendix). As notedpreviously, a parallel implementation of the first and sec-ond tests will provide study outcomes that are identical tothe outcomes of the fallback analysis. When the first test issignificant, one may assert treatment efficacy in the overallpopulation. Meanwhile, when only the second test is sig-nificant (following a negative result of the first test), onemay assert treatment efficacy only in future Bþ patients.

Treatment-by-biomarker interaction approachesLike the fallback approach, the treatment-by-biomarker

interaction approaches are used when there is limitedconfidence in the predictive biomarker. This approachinvolves a preliminary test of interaction of treatment andbiomarker to assess whether there is no difference in treat-ment effects (in terms of the HR between treatment arms)between the Bþ and B� patients (6, 7). To control for thestudy-wise type I error rate, we propose the followingapproach: A preliminary one-sided test of interaction isperformed as the first stage using a significance level of aINT

to detect larger treatment effects in the Bþ subset (7). If thistest is not significant, the treatment is compared with thecontrol in all patients using a reduce significance level a3(<a).Otherwise, the treatment is comparedwith the controlin the Bþ patients using a significance level of a4 (<a). Herethe significance levels,aINT,a3, anda4, are chosen to controlthe study-wise type I error rate in testing no treatment effectsfor the overall population and Bþ patients at the level, a,based on the asymptotic distribution of the test statistics(see Appendix). Here the significance levels depend on theratio R¼ E�/Eþ, where Eþ and E� are numbers of events inthe Bþ and B� patients, respectively.When the interaction issignificant and the test for the Bþ patients is significant, onemay assert treatment efficacy only for the Bþ patients.Whenthe interaction is not significant and the overall test issignificant, one may assert treatment efficacy for the overallpopulation.

Criterion in comparing the approaches: probability ofasserting treatment efficacyThe approaches of statistical analysis plans can make

either of two kinds of assertions regarding treatmentefficacy, one for the overall population and the other forthe Bþ subset of patients. Which of the two assertions isconsidered to be valid may depend on the underlyingtreatment effects in the biomarker-based subsets. Specif-ically, let HRþ and HR� denote the HRs of the treatmentrelative to the control in the Bþ and B� subsets of patients,respectively. If the treatment truly has clinically mean-ingful effects in all of the patients, for example, HRþ ¼HR� ¼ 0.7, the assertion of treatment efficacy for theoverall population would be more valid than that forthe Bþ patients because the latter assertion would deprivethe remaining B� patients of the chance of receiving theeffective treatment. On the other hand, if the treatmentcan exert a clinically important effect only in the Bþ

patients, for example, HRþ ¼ 0.5, and no effect in theremaining B� patients, for example, HR� ¼ 1.0, indicat-ing a qualitative interaction between treatment and bio-marker, the assertion of treatment efficacy for the Bþ

patients would be more valid than that for the overallpopulation because the latter assertion would yield over-treatment for the remaining B� patients using the inef-fective, even toxic treatment.

However, there can be other scenarios in which it is notclear which of the two assertions is valid. For example, thetreatment can exert a clinically important effect for the Bþ

patients, for example, HRþ ¼ 0.5, but some moderate orsmall effects for the remaining B� patients, for example,HR� ¼ 0.8, indicating a quantitative interaction betweentreatment and biomarker. Such a treatment effect profilecould be explained by the treatment having multiplemechanisms of action, the misclassification of responsivepatients into the B� subset, and so on. Which of the twoassertions is considered to be valid will be determined ona case-by-case basis incorporating many factors, includingthe prevalence of Bþ, possible adverse effects, treatmentcosts, prognosis of the disease, availability of other treat-ment choices, and so on. In such situations, the proba-bility of asserting treatment efficacy for either the overallpopulation or the subset of Bþ patients could be anothermeaningful criterion. From the point of view of treatmentdevelopers (e.g., pharmaceutical companies), this prob-ability would always be important because it can beinterpreted as the probability of success in treatmentdevelopment.

Let Poverall, Psubset, and Psuccess denote the probability ofasserting treatment efficacy for the overall population andfor the subset of Bþ patients, and that of success, respec-tively. Apparently, Poverall þ Psubset ¼ Psuccess for theapproaches of statistical analysis plans considered here. Assuch, there is a trade-off between the two probabilitiesPoverall and Psubset for a given value of Psuccess.

ResultsWecompared the approaches of statistical analysis plans in

terms of Poverall, Psubset, and Psuccess, under several scenarios.We assessed these probabilities based onasymptotic distribu-tions of the stratified log-rank statistics for the overall pop-ulation, simple log-rank statistics for the biomarker-basedsubsets, and the interaction test statistic (see Appendix) forvarious total numbers of events,E¼ Eþþ E�.We supposeEþ¼ pþE and E� ¼ (1� pþ) E [or R¼ (1� pþ)/pþ] under bothnull and non-null treatment effects. This is reasonable formany cases, for example, when the number of events isslightly less than the number of patients under adequatefollow-up for advanced diseases or when the event rates arecomparable across the biomarker-based subsets. The asymp-totic distributions are adequate approximations for a widerange of the underlying survival time distributions. Adequacyof using the approximations for limited sample sizes waschecked via simulations with exponential survival times.

We considered the prevalence of Bþ in the patient pop-ulation to be pþ ¼ 0.1, 0.3, or 0.5. As to the underlying

Analysis Plans in Phase III Trials with Biomarkers

www.aacrjournals.org Clin Cancer Res; 20(11) June 1, 2014 2823

on February 10, 2020. © 2014 American Association for Cancer Research. clincancerres.aacrjournals.org Downloaded from

Published OnlineFirst April 1, 2014; DOI: 10.1158/1078-0432.CCR-13-2698

Page 5: Comparison of Statistical Analysis Plans in Randomize-All Phase …clincancerres.aacrjournals.org/content/clincanres/20/11/... · Comparison of Statistical Analysis Plans in Randomize-All

treatment effects within biomarker-based subsets, we con-sidered the following scenarios: (HRþ, HR�) ¼ (1.0, 1.0),(0.7, 0.7), (0.5, 1.0), or (0.5, 0.8), that is, null effects,constant effects, qualitative interaction, and quantitativeinteraction as described in the previous section. The study-wise type I error rate was specified as a¼ 0.025. In theMaSTapproach, we used significance levels of 0.022 and 0.003 forthe Bþ subset and the overall population, respectively,according to the recommendation by Freidlin and collea-gues (14). The significance level for the one-sided interac-tion test, aINT, in the treatment-by-biomarker interactionapproach was specified as 0.1, a small level such that theinteraction test could serve as evidence in clinical validationof the predictive biomarker. For the significance levels in thefallback approach, a1 and a2, and the treatment-by-bio-marker interaction approaches, a3, and a4, we specifiedthem so that Poverall and Psubset (¼ a � Poverall) under the

global null hypothesis are identical for these approaches fora fair comparison.We considered setting an intermediate orbalanced level Poverall ¼ 0.015 (Psubset ¼ 0.01) under theglobal null. See Table 1 for resultant significance levels. Wealso considered more unbalanced levels Poverall ¼ 0.005 or0.02 (Psubset ¼ 0.02 or 0.005) under the global null, butsimilar conclusions were obtained (see SupplementaryFigs. S1–S3). We also evaluated the traditional approachwithout use of a biomarker as a reference, for whichPoverall ¼ Psuccess and Psubset ¼ 0, because there is nooption for asserting treatment efficacy for the Bþ subset.Note that Psuccess for the fixed-sequence-1 and -2approaches are always identical to the probability thatthe test for the Bþ subset (the first test in the fixed-sequence approaches) is statistically significant.

We first checked the control of type I error rates. Underthe global null (HRþ, HR�) ¼ (1.0, 1.0), the probabilitiesPoverall, Psubset, and Psuccess calculated on the basis of theasymptotic distributions were constant for any values ofE. Here, Psuccess under the global null corresponds to thestudy-wise type I error rate. Table 2 provides these valuesas well as those obtained by simulations with exponentialsurvival times. Agreement of these two indicates adequacyof using the asymptotic approximations under the globalnull.

With regard to the results under non-null treatmenteffects, Figs. 2–4 show Poverall, Psubset, and Psuccess calculatedon the basis of the asymptotic distributions for variousvalues of E under constant effects, qualitative interaction,and quantitative interaction, respectively. For the scenarioswith constant treatment effects, (HRþ, HR�) ¼ (0.7, 0.7),where Poverall would be a relevant criterion, the traditionalapproach provided the greatest values of Poverall, as wasexpected (Fig. 2). The fallback and treatment-by-biomarkerinteraction approaches provided slightly reduced values ofPoverall than those of the traditional approach. On the otherhand, the fixed-sequence-1 and -2 provided much smaller

Table 2. Poverall, Psubset, Psuccess under the global null (HRþ, HR�) ¼ (1.0, 1.0) for pþ ¼ 0.3

Probability TraditionalFixed-sequence-1

Fixed-sequence-2 MaST Fallback

Treatment-by-biomarkerinteraction

Asymptotic approximationsPoverall 0.025 0.001 0.005 0.003 0.015 0.015Psubset 0.000 0.024 0.020 0.021 0.010 0.010Psuccess 0.025 0.025 0.025 0.024 0.025 0.025

Simulationsa

Poverall 0.026 0.001 0.005 0.003 0.015 0.015Psubset 0.000 0.023 0.018 0.020 0.010 0.010Psuccess 0.026 0.023 0.023 0.023 0.026 0.025

aExponential survival timeswere generated in theBþ andB�patientswith the baseline event rates, lþ¼l�¼ 1.0, and treatment effects(HRþ, HR�) for a total number of patients of 200. Survival times were uncensored, so that Eþ ¼ pþE holds. Use of other values of thebaseline hazards lþ and l� (possibly lþ „ l�) may not change the resultswith the use of the stratified statisticZoverall for the overall testgiven in the Appendix. Ten thousand simulations were conducted for each configuration.

Table 1. Significance levels of the fallback andtreatment-by-biomarker interaction approachesthat satisfy Poverall ¼ 0.015 and Psubset ¼ 0.01under the global null effects

Fallback approach

Treatment-by-bio-marker interaction

approach

pþ a1 a2 a3 a4

0.1 0.0150 0.0110 0.0167 0.01000.3 0.0150 0.0120 0.0167 0.01100.5 0.0150 0.0140 0.0167 0.0130

NOTE: The significance level of the interaction test wasspecified as aINT ¼ 0.1. We supposed Eþ ¼ pþE for calcu-lating significance levels a3 and a4 in the treatment-by-biomarker interaction approach.

Matsui et al.

Clin Cancer Res; 20(11) June 1, 2014 Clinical Cancer Research2824

on February 10, 2020. © 2014 American Association for Cancer Research. clincancerres.aacrjournals.org Downloaded from

Published OnlineFirst April 1, 2014; DOI: 10.1158/1078-0432.CCR-13-2698

Page 6: Comparison of Statistical Analysis Plans in Randomize-All Phase …clincancerres.aacrjournals.org/content/clincanres/20/11/... · Comparison of Statistical Analysis Plans in Randomize-All

values of Poverall. The MaST approach also provided smallervalues of Poverall but showed some improvement over thefixed-sequence-1 approach. Similar trends were observedfor Psuccess.For the scenarios with a qualitative interaction, (HRþ,

HR�) ¼ (0.5, 1.0), where Psubset would be relevant, thefixed-sequence-1 and MaST approaches performed best,followed by the treatment-by-biomarker interactionapproach with some reduction in Psubset (Fig. 3). Thefixed-sequence-2 and fallback approaches provided muchsmaller values of Psubset, especially when pþ � 0.3.

Besides, as E became large, Psubset of these approachescould decrease (with an increment of Poverall). Thefixed-sequence-1, MaST, and treatment-by-biomarkerinteraction approaches suppressed the increment ofPoverall, the probability of over-assertion (overtreatment)in this scenario. With respect to Psuccess, the fallback andthe treatment-by-interaction approach provided slightlyreduced values of Psuccess, compared with the fixed-sequence-1, -2, and MaST approaches. The traditionalapproach provided much smaller Psuccess values becauseof Psubset ¼ 0.

© 2014 American Association for Cancer Research

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

Poverall

E

Pro

b

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

Psubset

E

Pro

b

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

Psuccess

E

Pro

b

p+ = 0.1

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

E

Pro

b

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

E

Pro

b

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

E

Pro

b

p+ = 0.3

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

Traditional Fixed-sequence 1 Fixed-sequence 2

MaST Fallback Treatment-by-biomarker interaction

E

Pro

b

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

E

Pro

b

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

E

Pro

bp+ = 0.5

Figure 2. Poverall, Psubset, and Psuccess for various numbers of events, E, under constant effects (HRþ, HR�) ¼ (0.7, 0.7). The Psuccess curve for thefixed-sequence-1 approach is always coincident with that for the fixed-sequence-2 approach. The Psubset curves for the fixed-sequence-1 and MaSTapproaches were nearly identical.

Analysis Plans in Phase III Trials with Biomarkers

www.aacrjournals.org Clin Cancer Res; 20(11) June 1, 2014 2825

on February 10, 2020. © 2014 American Association for Cancer Research. clincancerres.aacrjournals.org Downloaded from

Published OnlineFirst April 1, 2014; DOI: 10.1158/1078-0432.CCR-13-2698

Page 7: Comparison of Statistical Analysis Plans in Randomize-All Phase …clincancerres.aacrjournals.org/content/clincanres/20/11/... · Comparison of Statistical Analysis Plans in Randomize-All

Finally, for the scenarios with a quantitative interaction,(HRþ, HR�)¼ (0.5, 0.8), the characteristics of the respectiveapproaches became clearer (Fig. 4). The fallback and fixed-sequence-2 approaches tended to provide larger Poverall, likethe traditional approach, whereas the fixed-sequence-1,MaST, and treatment-by-biomarker interaction approachestended to provide larger Psubset values. For the fallback,fixed-sequence-1, -2, and MaST approaches, Psubset candecrease (with an increment in Poverall) as E increases. Onthe contrary, for the treatment-by-biomarker interactionapproach, Poverall can decrease (with an increment in Psubset)

as E increases.With respect to Psuccess, all approaches, exceptthe fixed-sequence-2 for pþ ¼ 0.1, provided comparablePsuccess values and could perform slightly better than thetraditional approach.

DiscussionWe have evaluated the three approaches of statistical

analysis plans in randomize-all phase III trials with a pre-dictive biomarker in terms of their ability to assert treatmentefficacy for the right population. The numerical evaluationsindicated that these approaches have their advantages and

© 2014 American Association for Cancer Research

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

Poverall

E

Pro

b

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

Psubset

E

Pro

b

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

Psuccess

E

Pro

b

p+ = 0.10

.00

.20

.40

.60

.81

.0

100 200 300 400

E

Pro

b

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

E

Pro

b

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

E

Pro

b

p+ = 0.3

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

Traditional Fixed-sequence 1 Fixed-sequence 2

MaST Fallback Treatment-by-biomarker interaction

E

Pro

b

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

E

Pro

b

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

E

Pro

b

p+ = 0.5

Figure 3. Poverall, Psubset, and Psuccess for various numbers of events, E, under a qualitative interaction (HRþ, HR�) ¼ (0.5, 1.0). The Psuccess curve for thefixed-sequence-1 approach is always coincident with that for the fixed-sequence-2 approach.

Matsui et al.

Clin Cancer Res; 20(11) June 1, 2014 Clinical Cancer Research2826

on February 10, 2020. © 2014 American Association for Cancer Research. clincancerres.aacrjournals.org Downloaded from

Published OnlineFirst April 1, 2014; DOI: 10.1158/1078-0432.CCR-13-2698

Page 8: Comparison of Statistical Analysis Plans in Randomize-All Phase …clincancerres.aacrjournals.org/content/clincanres/20/11/... · Comparison of Statistical Analysis Plans in Randomize-All

disadvantages depending on the underlying profile of treat-ment effects across biomarker-based subsets of patients.Generally, the fixed-sequence-1 approach would be suit-

able for cases where there are large treatment effects in theBþ patients (Figs. 3 and 4) but could suffer from a seriouslack of power for nearly constant treatment effects withrelatively moderate effect sizes in the overall population(Fig. 2). Interestingly, the fixed-sequence-2 approach hasquite different properties. This approach had similar char-acteristics with those of the fallback approach under qual-itative and quantitative interactions but suffered from a

serious lack of power under constant treatment effects, likethe fixed-sequence-1 approach (Fig. 2). As is expected, theMaST approach showed some improvement in Poverall overthe fixed-sequence-1 approach under constant treatmenteffects while providing comparable Psubset values under thequalitative interaction. The performance under homoge-neous treatment effects further improved by consideringsmaller significance levels in testing for the Bþ patients than0.022 (with larger significance levels for the overall test; datanot shown), which was specified to control for a type I errorunder the hypothesis that the treatment is effective for the

© 2014 American Association for Cancer Research

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

Poverall

E

Pro

b

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

Psubset

E

Pro

b

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

Psuccess

E

Pro

b

p+ = 0.1

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

E

Pro

b

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

E

Pro

b

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

E

Pro

b

p+ = 0.3

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

Traditional Fixed-sequence 1 Fixed-sequence 2

MaST Fallback Treatment-by-biomarker interaction

E

Pro

b

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

E

Pro

b

0.0

0.2

0.4

0.6

0.8

1.0

100 200 300 400

E

Pro

bp+ = 0.5

Figure 4. Poverall, Psubset, and Psuccess for various numbers of events, E, under a quantitative interaction (HRþ, HR�) ¼ (0.5, 0.8). The Psuccess curve for thefixed-sequence-1 approach is always coincident with that for the fixed-sequence-2 approach.

Analysis Plans in Phase III Trials with Biomarkers

www.aacrjournals.org Clin Cancer Res; 20(11) June 1, 2014 2827

on February 10, 2020. © 2014 American Association for Cancer Research. clincancerres.aacrjournals.org Downloaded from

Published OnlineFirst April 1, 2014; DOI: 10.1158/1078-0432.CCR-13-2698

Page 9: Comparison of Statistical Analysis Plans in Randomize-All Phase …clincancerres.aacrjournals.org/content/clincanres/20/11/... · Comparison of Statistical Analysis Plans in Randomize-All

Bþ patients but not for the B� patients. The need of a strictcontrol of this type of error rate, in addition to the strictcontrol of type I error rate under the global null, could bearguable, especially when there is limited confidence in thepredictive biomarker to assume that the treatment will notbe effective in the B� patients unless it is effective in the Bþ

patients.The fallback approach would be suitable for cases with

homogenous treatment effects in the overall population(Fig. 2) but could suffer from a serious lack of power forqualitative interactions between treatment and biomarker(Fig. 3). In other words, the chance of asserting treatmenteffects for the Bþ patients (or the effect of introducing thefallback test) could be at most moderate. One major con-cern for the fallback approach (and the fixed-sequence-2approach) is that under qualitative interactions, the chanceof asserting treatment effects for the overall population canbe very large when pþ � 0.3 (Fig. 3). This suggests theimportance of a subset analysis based on the biomarker toevaluate the treatment effects in the B� and Bþ subsets evenwhen the primary analysis ended with a significant result ofthe overall test. This is to protect against the over-assertion(overtreatment). When incorporating possible assertionsonly for the Bþ patients based on an additional subsetanalysis for the B� patients, higher Psubset (and smallerPoverall) values are expected for the fallback (and fixed-sequence-2) approaches. As the fallback approach provideda high chance of asserting treatment efficacy for the overallpopulation under quantitative interactions (Fig. 4), it canwork well when the treatment with moderate or even smalleffects is acceptable for the B� patients, as is the case wherethere are no effective treatments for such patients.

The treatment-by-biomarker interaction approach hadan intermediate property between the fallback and fixed-sequence-1 approaches. This approach performed similarlyto the fallback approach under homogenous treatmenteffects (Fig. 2) and also performed well with the fixed-sequence-1 approach under a qualitative interaction (Fig.3). In addition, it generally provided high Psuccess valuesunder all the scenarios. The good performance of thisapproach can be explained by the effectiveness of thepreliminary interaction test in selecting the appropriatepopulation for testing treatment efficacy based on theobserved data. Like the fixed-sequence-1 approach (andMaST approach), the treatment-by-biomarker interactionapproach can be effective to detect large treatment effects inthe Bþ patients as seen in Figs. 3 and 4. This is because alarger interaction can be interpreted as a larger treatmenteffect in the Bþ subset and vice versa. In contrast with thefallback approach, the treatment-by-biomarker interactionapproach can work well when the treatment with moderateeffects is clinically unimportant for the B� patients, forexample, as is the case where established standard treat-ments are already available for such patients. In practicalapplication, it is important to note that the treatment-by-biomarker interaction approach can perform well evenwhen there is limited confidence in the predictive biomark-er, unlike the fixed-sequence-1 approach.

The treatment-by-biomarker interaction approach hasbeen discussed in the literature as a design for clinicalvalidation of the predictive biomarker itself, although itcan suffer from a serious lack of power in detecting aninteraction (6, 7, 10).Whenpositioning this type of analysisas one for assessing the medical utility of a new treatmentwith the aid of a biomarker, as is the case for the proposedtreatment-by-biomarker interaction approach with a strictcontrol of the study-wise type I error rate, it can becomeefficient as indicated by our numerical evaluations (seeResults).One practical issue is that choice of the significancelevels depends on an unknown value of the event ratio Runder the global null hypothesis. Optimality of chosensignificance levels under non-null treatment effects isanother issue. These issues are a subject of future studies.

Another important indication fromour numerical assess-ment is that the traditional approach has two critical limita-tions when there is a moderate to large treatment-by-bio-marker interaction, as is the case for many targeted treat-ments. One is a serious lack of power in terms of Psuccessbecause Psubset ¼ 0 (Fig. 3). The other is its inability indiscerning whether a significance result in the overall test isbrought by large treatment effects only in the Bþ patients(Figs. 3 and 4) because of no incorporation of any bio-marker in this approach. Hence, when a candidate bio-marker is available for targeted treatments, it is generallyadvisable to plan for randomize-all phase III trials usingbiomarker-based analysis plans, such as the treatment-by-biomarker interaction, fixed-sequence-1, MaST, and fall-back approaches, taking account of the aforementionedproperties of the respective approaches.

We should note that the prevalence of Bþ, pþ, whichpertains to the study patients enrolled in the trial, is notnecessarily that of the general population in clinical practice.For example, when sample size calculation is performed fortesting treatment efficacy for the overall patients and the Bþ

subset of patients (or for the Bþ and B� subsets of patients)separately, the expectedprevalence, pþ, in the trialmaynotbeequivalent with the prevalence of the general population. Assuch, our results (Figs. 2–4) can apply to a wide range ofsituations, possibly withmodulated values of the prevalencefor pþ to evaluatewhich analysis plan is efficient for plausiblevalues of the effect sizes (HRþ, HR�) and to calculaterequired sample sizes for a selected analysis plan. An R-codeis available upon request to the authors to depict figuressuch as Figs. 2–4, possibly using different values of pþ, (HRþ,HR�), and R to help designing actual clinical trials.

Clinical validation of predictive biomarkers is impor-tant, particularly when asserting treatment efficacy for theBþ subset of patients. Probably one of the most widelyaccepted methods for clinical validation is to conduct atest of interaction between treatment and biomarker. Thisprocess is built into the treatment-by-biomarker interac-tion approach. Although a one-sided interaction test per seis intended to detect larger treatment effects in the Bþ

patients, not to detect a qualitative interaction of treat-ment and biomarker, a significant interaction may suggesta qualitative interaction (or no meaningful effects in the

Matsui et al.

Clin Cancer Res; 20(11) June 1, 2014 Clinical Cancer Research2828

on February 10, 2020. © 2014 American Association for Cancer Research. clincancerres.aacrjournals.org Downloaded from

Published OnlineFirst April 1, 2014; DOI: 10.1158/1078-0432.CCR-13-2698

Page 10: Comparison of Statistical Analysis Plans in Randomize-All Phase …clincancerres.aacrjournals.org/content/clincanres/20/11/... · Comparison of Statistical Analysis Plans in Randomize-All

B� patients) because the power of the test under quan-titative interactions is generally much lower than thatunder qualitative interactions. Another criterion for clin-ical validation would be to demonstrate that the size oftreatment effects for the Bþ patients is greater than aclinically important effect size, c1, but that for the B�

patients is less than aminimum size of clinical importance,c2, where c is an absolute log-HR between treatment armsand c1 � c2. The fixed-sequence-1 and possibly the MaSTapproach could use this criterion if the Bþ and B� subsetsare sized separately on the basis of these thresholds asreference levels of effect size. In this case, a plan forinterim futility analysis would be warranted for the B�

patients because enrolling a large number of thesepatients who are unlikely to benefit from the treatmentcan yield ethical concerns (7). Another possible criterionfor clinical validation is to demonstrate treatment efficacywith the aid of the biomarker when an overall test oftreatment efficacy without use of the biomarker is notsignificant. The fallback approach seems to use this cri-terion. However, this is a rather indirect or informalcriterion compared with the aforementioned criteria, soone may argue the need for additional clinical validationbased on the other criteria described above, outside theformal fallback analysis plan.In conclusion, this article intends to provide a benchmark

in comparing various statistical analysis plans in random-ize-all phase III trials in the codevelopment of a treatmentanda companionbiomarker to aid in determining adequatephase III trial planning for use toward personalized med-icine. As a general guidance on the choice of statisticalanalysis plan, if there was some evidence that the treatmentwould work better in the biomarker-positive subset thanthe biomarker-negative subset, then the fixed-sequenceapproacheswould be favored,whereas if evidencewasweakthat there would be much difference in responsivenessbetween the two subsets, then the fallback approach wouldbe favored. If there was substantial uncertainty in thedifference in treatment effects between the two subsets, thetreatment-by-biomarker interaction approach could be areasonable choice.

Appendix: Asymptotic Distributions of the TestStatisticsAssuming proportional hazards between treatment arms,

we use the asymptotic distribution of a log-rank test statistic Sunder equal treatment assignment and follow-up, S � N(u,4/E) (20). Here u is the logarithm of the ratio of the hazardfunction under the new treatment relative to that under thecontrol treatmentandE is the totalnumberof eventsobserved.For a clinical trial with a given number of events, we express astandardized test statistic for testing treatment efficacy for the

Bþ patients as Zþ ¼ �̂þ=ffiffiffiffiffiffiffiVþ

p, where �̂þ is an estimate of uþ

and Vþ ¼ 4/Eþ. We consider a similar standardized statistic,Z�, for the B� patients. We also express a standardized teststatistic for testing overall treatment efficacy as Zoverall

¼ �̂overall=ffiffiffiffiffiffiffiffiffiffiffiffiffiffiVoverall

p, where �̂overall is an estimate of uoverall and

Voverall¼ 4/Eoverall¼ 4/(Eþþ E�). By using an approximation,

�̂overall �fð1=VþÞ�̂þ þ ð1=V�Þ�̂�g=ð1=Vþ þ 1=V�Þ¼ðEþ�̂þþ E��̂�Þ=ðEþ þ E�Þ , we have the following stratifiedstatistic for testing overall treatment effects that incorpo-rates possible prognostic effects of the biomarker:

Zoverall ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiVoverall

pVþ

�̂þ þffiffiffiffiffiffiffiffiffiffiffiffiffiffiVoverall

pV�

�̂�

For a standardized test statistic for testing the interactionbetween treatment andbiomarker used in the treatment-by-biomarker interaction approach, we use the followingapproximation:

ZINT ¼ �̂þ � �̂�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiVþ þ V�

p

We assume normality for the aforementioned standard-ized statistics with variance 1. The means of Zþ, Z�, Zoverall,

andZINT are �þ=ffiffiffiffiffiffiffiVþ

p, ��=

ffiffiffiffiffiffiffiV�

p,

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiVoverall

p ð�þ=Vþ þ ��=V�Þ,and ð�þ � ��Þ=

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiVþ þ V�

p, respectively.

With respect to the covariance (or correlation) betweenthe standardized statistics, we first note that independencebetweenZþ andZ�holds in the fixed-sequence-1 approach.The covariance between Zþ and Zoverall in the fallback andfixed-sequence-2 approaches reduces to

ffiffiffiffiffiffipþ

p(9, 18, 19).

Regarding the covariance between the test statistics usedin the treatment-by-biomarker interaction approach, it canbe shown that

covðZINT;ZoverallÞ ¼ 0 and

covðZINT;ZþÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiVþ=ðVþ þ V�Þ

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiE�=ðEþ þ E�Þ

p

¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiR=ð1þ RÞ

p;

where R ¼ E�=Eþ. Under the global null hypothesis of notreatment efficacy for the Bþ and B� patients (and thus noeffects for the overall population), for which we willsearch for the significance levels, a3 (for Zoverall), a4 (forZþ), and aINT (for ZINT) to control the study-wise type Ierror rate. When Eþ ¼ pþE or R ¼ (1 � pþ)/pþ is supposed

as in Results, we have covðZINT;ZþÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffi1� pþ

p. Gener-

ally, we search for the significance levels based on the

covariance covðZINT;ZþÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiR=ð1þ RÞp

based on an

expected event ratio R under the global null effects, whichwill depend on the respective (baseline) event rates (pos-sibly, with some prognostic effects) and the censoringdistributions across biomarker-based subsets, ratherthan using the approximations R ¼ (1 � pþ)/pþ and

covðZINT;ZþÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffi1� pþ

p.

Disclosure of Potential Conflicts of InterestNo potential conflicts of interest were disclosed.

Authors' ContributionsConception and design: S. MatsuiDevelopment of methodology: S. MatsuiAcquisitionofdata (provided animals, acquired andmanagedpatients,provided facilities, etc.): S. Matsui, Y. Choai, T. Nonaka

Analysis Plans in Phase III Trials with Biomarkers

www.aacrjournals.org Clin Cancer Res; 20(11) June 1, 2014 2829

on February 10, 2020. © 2014 American Association for Cancer Research. clincancerres.aacrjournals.org Downloaded from

Published OnlineFirst April 1, 2014; DOI: 10.1158/1078-0432.CCR-13-2698

Page 11: Comparison of Statistical Analysis Plans in Randomize-All Phase …clincancerres.aacrjournals.org/content/clincanres/20/11/... · Comparison of Statistical Analysis Plans in Randomize-All

Analysis and interpretation of data (e.g., statistical analysis, biosta-tistics, computational analysis): S. Matsui, Y. Choai, T. NonakaWriting, review, and/or revision of themanuscript: S. Matsui, T. NonakaAdministrative, technical, or material support (i.e., reporting or orga-nizing data, constructing databases): S. MatsuiStudy supervision: S. Matsui

DisclaimerThe views expressed herein are the result of independent work and do

not necessarily represent the views of the Pharmaceuticals and MedicalDevices Agency.

AcknowledgmentsThe authors thank the anonymous reviewers for valuable comments that

substantially improved this article.

Grant SupportThis research was supported by a Grant-in-Aid for Scientific Research

(24240042; to S. Matsui) from the Ministry of Education, Culture, Sports,Science and Technology of Japan.

Received October 1, 2013; revised February 21, 2014; accepted March 14,2014; published OnlineFirst April 1, 2014.

References1. Rothenberg ML, Carbone DP, Johnson DH. Improving the evaluation

of new cancer treatments: challenges and opportunities. Nat RevCancer 2003;3:303–9.

2. Bradley E. Incorporating biomarkers into clinical trial designs: points toconsider. Nat Biotechnol 2012;30:596–9.

3. Rubin EH, Gilliland DG. Drug development and clinical trials–the pathto an approved cancer drug. Nat Rev Clin Oncol 2012;9:215–22.

4. Simon R, Maitournam A. Evaluating the efficiency of targeted designsfor randomized clinical trials. Clin Cancer Res 2005;10:6759–63.

5. Pusztai L, Hess KR. Clinical trial design for microarray predictivemarker discovery and assessment. Ann Oncol 2004;15:1731–7.

6. Sargent DJ, Conley BA, Allegra C, Collette L. Clinical trial designs forpredictive marker validation in cancer treatment trials. J Clin Oncol2005;23:2020–7.

7. Simon R. The use of genomics in clinical trial design. Clin Cancer Res2008;14:5984–93.

8. Simon R, Wang SJ. Use of genomic signatures in therapeutics devel-opment in oncology and other diseases. Pharmacogenomics J2006;6:166–73.

9. WangSJ, O'Neill RT, HungHM.Approaches to evaluation of treatmenteffect in randomized clinical trials with genomic subset. Pharm Stat2007;6:227–44.

10. Mandrekar SJ, Sargent DJ. Clinical trial designs for predictive bio-marker validation: theoretical considerations and practical challenges.J Clin Oncol 2009;27:4027–34.

11. Freidlin B, McShane LM, Korn EL. Randomized clinical trials withbiomarkers: design issues. J Natl Cancer Inst 2010;102:152–60.

12. BuyseM,Michiels S, Sargent DJ, Grothey A,Matheson A, de GramontA. Integrating biomarkers in clinical trials. Expert Rev Mol Diagn2011;11:171–82.

13. Freidlin B, Sun Z, Gray R, Korn EL. Phase III clinical trials that integratetreatment and biomarker evaluation. J Clin Oncol 2013;31:3158–61.

14. Freidlin B, Korn EL, GrayR.Marker Sequential Test (MaST) design.ClinTrials 2014;11:19–27.

15. Douillard JY, SienaS,Cassidy J, Tabernero J,BurkesR,BarugelM, et al.Randomized, phase III trial of panitumumab with infusional fluorouracil,leucovorin, and oxaliplatin (FOLFOX4) versus FOLFOX4 alone as first-line treatment in patients with previously untreated metastaticcolorectal cancer: the PRIME study. J Clin Oncol 2010;28:4697–705.

16. Cappuzzo F, Ciuleanu T, Stelmakh L, Cicenas S, Szcz�esna A, Juh�aszE, et al. Erlotinib asmaintenance treatment in advanced non-small-celllung cancer: a multicentre, randomised, placebo-controlled phase 3study. Lancet Oncol 2010;11:521–9.

17. Wakelee H, Kernstine K, Vokes E, Schiller J, Baas P, Saijo N, et al.Cooperative group research efforts in lung cancer 2008: focus onadvanced-stage non-small-cell lung cancer. Clin Lung Cancer2008;9:346–51.

18. Song Y, Chi GY. A method for testing a prespecified subgroup inclinical trials. Stat Med 2007;26:3535–49.

19. Spiessens B, Debois M. Adjusted significance levels for subgroupanalyses in clinical trials. Contemp Clin Trials 2010;31:647–56.

20. Tsiatis AA. The asymptotic joint distribution of the efficient score testfor the proportional hazards model calculated over time. Biometrika1981;68:311–5.

Clin Cancer Res; 20(11) June 1, 2014 Clinical Cancer Research2830

Matsui et al.

on February 10, 2020. © 2014 American Association for Cancer Research. clincancerres.aacrjournals.org Downloaded from

Published OnlineFirst April 1, 2014; DOI: 10.1158/1078-0432.CCR-13-2698

Page 12: Comparison of Statistical Analysis Plans in Randomize-All Phase …clincancerres.aacrjournals.org/content/clincanres/20/11/... · Comparison of Statistical Analysis Plans in Randomize-All

2014;20:2820-2830. Published OnlineFirst April 1, 2014.Clin Cancer Res   Shigeyuki Matsui, Yuki Choai and Takahiro Nonaka  Trials with a Predictive BiomarkerComparison of Statistical Analysis Plans in Randomize-All Phase III

  Updated version

  10.1158/1078-0432.CCR-13-2698doi:

Access the most recent version of this article at:

  Material

Supplementary

  http://clincancerres.aacrjournals.org/content/suppl/2014/04/09/1078-0432.CCR-13-2698.DC1

Access the most recent supplemental material at:

   

   

  Cited articles

  http://clincancerres.aacrjournals.org/content/20/11/2820.full#ref-list-1

This article cites 20 articles, 5 of which you can access for free at:

  Citing articles

  http://clincancerres.aacrjournals.org/content/20/11/2820.full#related-urls

This article has been cited by 1 HighWire-hosted articles. Access the articles at:

   

  E-mail alerts related to this article or journal.Sign up to receive free email-alerts

  Subscriptions

Reprints and

  [email protected]

To order reprints of this article or to subscribe to the journal, contact the AACR Publications Department at

  Permissions

  Rightslink site. Click on "Request Permissions" which will take you to the Copyright Clearance Center's (CCC)

.http://clincancerres.aacrjournals.org/content/20/11/2820To request permission to re-use all or part of this article, use this link

on February 10, 2020. © 2014 American Association for Cancer Research. clincancerres.aacrjournals.org Downloaded from

Published OnlineFirst April 1, 2014; DOI: 10.1158/1078-0432.CCR-13-2698