Use of Candidate Predictive Biomarkers in the Design of Phase III Clinical Trials

Use of Candidate Use of Candidate Predictive Biomarkers in Predictive Biomarkers in the Design of Phase III the Design of Phase III

Clinical TrialsClinical TrialsRichard Simon, D.Sc.Richard Simon, D.Sc.

Chief, Biometric Research BranchChief, Biometric Research BranchNational Cancer InstituteNational Cancer Institutehttp://brb.nci.nih.govhttp://brb.nci.nih.gov

Predictive biomarkersPredictive biomarkers Measured before treatment to identify who is Measured before treatment to identify who is

likely or unlikely to benefit from a particular likely or unlikely to benefit from a particular treatmenttreatment

ER, HER2, KRAS, EGFRER, HER2, KRAS, EGFR

Biomarker ValidityBiomarker Validity Analytical validityAnalytical validity

Measures what it’s supposed toMeasures what it’s supposed to Reproducible and robustReproducible and robust

Clinical validity (correlation)Clinical validity (correlation) It correlates with something clinicallyIt correlates with something clinically

Medical utilityMedical utility Actionable resulting in patient benefitActionable resulting in patient benefit

Developing a drug with a companion Developing a drug with a companion test increases complexity and cost of test increases complexity and cost of development but should improve development but should improve chance of success and has substantial chance of success and has substantial benefits for patients and for the benefits for patients and for the economics of health careeconomics of health care

How can we do it in a way that How can we do it in a way that provides the kind of reliable answers provides the kind of reliable answers we expect from phase III trials?we expect from phase III trials?

When the Biology is ClearWhen the Biology is Clear

1.1. Develop a completely specified classifier of the Develop a completely specified classifier of the patients likely (or unlikely) to benefit from a patients likely (or unlikely) to benefit from a new drugnew drug

Classifier is based on either a single Classifier is based on either a single gene/protein or composite scoregene/protein or composite score

2.2. Develop an analytically validated Develop an analytically validated 3.3. Design a focused clinical trial to evaluate Design a focused clinical trial to evaluate

effectiveness of the new treatment and how it effectiveness of the new treatment and how it relates to the testrelates to the test

Using phase II data, develop predictor of response to new drug

Develop Predictor of Response to New Drug

Patient Predicted Responsive

New Drug Control

Patient Predicted Non-Responsive

Off Study

Targeted (Enrichment) Design

Evaluating the Efficiency of Evaluating the Efficiency of Targeted DesignTargeted Design

Simon R and Maitnourim A. Evaluating the efficiency of Simon R and Maitnourim A. Evaluating the efficiency of targeted designs for randomized clinical trials. Clinical targeted designs for randomized clinical trials. Clinical Cancer Research 10:6759-63, 2004; Correction and Cancer Research 10:6759-63, 2004; Correction and supplement 12:3229, 2006supplement 12:3229, 2006

Maitnourim A and Simon R. On the efficiency of Maitnourim A and Simon R. On the efficiency of targeted clinical trials. Statistics in Medicine 24:329-targeted clinical trials. Statistics in Medicine 24:329-339, 2005.339, 2005.

Relative efficiency of targeted design Relative efficiency of targeted design depends on depends on proportion of patients test positiveproportion of patients test positive effectiveness of new drug (compared to control) effectiveness of new drug (compared to control)

for test negative patientsfor test negative patients When less than half of patients are test When less than half of patients are test

positive and the drug has little or no benefit positive and the drug has little or no benefit for test negative patients, the targeted for test negative patients, the targeted design requires dramatically fewer design requires dramatically fewer randomized patients than the standard randomized patients than the standard design in which the marker is not useddesign in which the marker is not used

Comparing T vs C on Survival Comparing T vs C on Survival or DFSor DFS

5% 2-sided Significance and 90% Power 5% 2-sided Significance and 90% Power % Reduction in Hazard Number of Events Required

25% 50930% 33235% 22740% 16245% 11850% 88

Hazard ratio 0.60 for test + patientsHazard ratio 0.60 for test + patients 40% reduction in hazard40% reduction in hazard

Hazard ratio 1.0 for test – patientsHazard ratio 1.0 for test – patients 0% reduction in hazard0% reduction in hazard

33% of patients test positive33% of patients test positive Hazard ratio for unselected population Hazard ratio for unselected population

is is 0.33*0.60 + 0.67*1 = 0.870.33*0.60 + 0.67*1 = 0.87 13% reduction in hazard13% reduction in hazard

To have 90% power for detecting To have 90% power for detecting 40% reduction in hazard within a 40% reduction in hazard within a biomarker positive subsetbiomarker positive subset Number of events within subset = 162Number of events within subset = 162

To have 90% power for detecting To have 90% power for detecting 13% reduction in hazard overall13% reduction in hazard overall Number of events = 2172Number of events = 2172

Stratification DesignStratification Design

Develop Predictor of Response to New Rx

Predicted Non-responsive to New Rx

Predicted ResponsiveTo New Rx

ControlNew RX Control

New RX

Develop prospective analysis plan for evaluation Develop prospective analysis plan for evaluation of treatment effect and how it relates to biomarkerof treatment effect and how it relates to biomarker type I error should be protected for multiple type I error should be protected for multiple

comparisonscomparisons Trial sized for evaluating treatment effect overall and in Trial sized for evaluating treatment effect overall and in

subsets defined by test subsets defined by test Stratifying” (balancing) the randomization is Stratifying” (balancing) the randomization is

useful to ensure that all randomized patients have useful to ensure that all randomized patients have the test performed but is not necessary for the the test performed but is not necessary for the validity of comparing treatments within marker validity of comparing treatments within marker defined subsetsdefined subsets

Post-stratification provides more time for development Post-stratification provides more time for development of analytically validated tests but risks validity of the of analytically validated tests but risks validity of the results if adequate specimens are not collected in -> results if adequate specimens are not collected in -> 100% of cases100% of cases

Fallback Analysis PlanFallback Analysis Plan

Compare the new drug to the control overall Compare the new drug to the control overall for all patients ignoring the classifier.for all patients ignoring the classifier. If pIf poveralloverall ≤ 0.01 claim effectiveness for the ≤ 0.01 claim effectiveness for the

eligible population as a wholeeligible population as a whole Otherwise perform a single subset analysis Otherwise perform a single subset analysis

evaluating the new drug in the classifier + evaluating the new drug in the classifier + patientspatients If pIf psubset subset ≤ 0.04 claim effectiveness for the ≤ 0.04 claim effectiveness for the

classifier + patients.classifier + patients.

Sample size for Analysis Plan Sample size for Analysis Plan

To have 90% power for detecting uniform To have 90% power for detecting uniform 33% reduction in overall hazard at 1% 33% reduction in overall hazard at 1% two-sided level requires 370 events.two-sided level requires 370 events.

If 33% of patients are positive, then when If 33% of patients are positive, then when there are 370 total events there will be there are 370 total events there will be approximately 123 events in positive approximately 123 events in positive patients patients 123 events provides 90% power for detecting 123 events provides 90% power for detecting

a 45% reduction in hazard at a 4% two-sided a 45% reduction in hazard at a 4% two-sided significance level. significance level.

To detect a 40% reduction in hazard in an To detect a 40% reduction in hazard in an a-priori defined subset with 90% power a-priori defined subset with 90% power and a 5% significance level requires 162 and a 5% significance level requires 162 events in the subset.events in the subset.

To detect a 40% reduction in hazard in an To detect a 40% reduction in hazard in an a-priori defined subset with 90% power a-priori defined subset with 90% power and a 4% two-sided significance level and a 4% two-sided significance level requires 171 events in the subset.requires 171 events in the subset.

If the prevalence of the marker is 33%, If the prevalence of the marker is 33%, then the trial might be sized for 3*171= then the trial might be sized for 3*171= total 513 events.total 513 events.

R Simon. Using genomics in clinical trial R Simon. Using genomics in clinical trial design, Clinical Cancer Research 14:5984-design, Clinical Cancer Research 14:5984-93, 200893, 2008

R Simon. Designs and adaptive analysis R Simon. Designs and adaptive analysis plans for pivotal clinical trials of plans for pivotal clinical trials of therapeutics and companion diagnostics, therapeutics and companion diagnostics, Expert Opinion in Medical Diagnostics Expert Opinion in Medical Diagnostics 2:721-29, 20082:721-29, 2008

Web Based Software for Web Based Software for Planning Clinical Trials of Planning Clinical Trials of

Treatments with a Treatments with a Candidate Predictive Candidate Predictive

BiomarkerBiomarker http://brb.nci.nih.gov http://brb.nci.nih.gov

The Biology is Often Not So The Biology is Often Not So ClearClear

Cancer biology is complex and it is not Cancer biology is complex and it is not always possible to have the right single always possible to have the right single completely defined predictive classifier completely defined predictive classifier identified and analytically validated by the identified and analytically validated by the time the pivotal trial of a new drug is time the pivotal trial of a new drug is ready to start accrualready to start accrual

K Candidate Biomarkers K Candidate Biomarkers DesignDesign

Based on Adaptive Threshold Based on Adaptive Threshold DesignDesign

W Jiang, B Freidlin & R SimonW Jiang, B Freidlin & R SimonJNCI 99:1036-43, 2007JNCI 99:1036-43, 2007

K Candidate Biomarkers K Candidate Biomarkers DesignDesign

Have identified K candidate binary Have identified K candidate binary classifiers Bclassifiers B11 , …, B , …, BKK thought to be thought to be predictive of patients likely to predictive of patients likely to benefit from T relative to Cbenefit from T relative to C

Eligibility not restricted by Eligibility not restricted by candidate markerscandidate markers

Compare T vs C for all patientsCompare T vs C for all patients If results are significant at level .01 claim broad effectiveness of If results are significant at level .01 claim broad effectiveness of

TT Otherwise proceed as followsOtherwise proceed as follows

Compare T vs C for the subset of patients positive for Compare T vs C for the subset of patients positive for marker 1; compute pmarker 1; compute p11

Similarly compare T vs C for the subset of patients Similarly compare T vs C for the subset of patients positive for marker 2 (ppositive for marker 2 (p22), positive for marker 3 (p), positive for marker 3 (p33), …), …positive for marker K (ppositive for marker K (pkk))

Compute p* = min{pCompute p* = min{p11 , p , p22 , …, p , …, pKK}} Compute whether a value of p* is statistically Compute whether a value of p* is statistically

significant when adjusted for multiple testingsignificant when adjusted for multiple testing Adjust for multiple testing using permutation of treatment Adjust for multiple testing using permutation of treatment

labels to adjust for correlation among testslabels to adjust for correlation among tests

To detect a 40% reduction in hazard in To detect a 40% reduction in hazard in an a-priori defined subset with 90% an a-priori defined subset with 90% power and a 4% two-sided significance power and a 4% two-sided significance level requires 171 events in the subset.level requires 171 events in the subset.

If the prevalence of the marker is 33%, If the prevalence of the marker is 33%, then the trial might be sized for then the trial might be sized for 3*171= total 513 events.3*171= total 513 events.

To adjust for multiplicity with 4 To adjust for multiplicity with 4 independent tests, 171 -> 224; 513 -> independent tests, 171 -> 224; 513 -> 672 total events.672 total events.

Designs When there are Designs When there are Many Candidate Markers Many Candidate Markers

and too Much Patient and too Much Patient Heterogeneity for any Heterogeneity for any

Single Marker Single Marker

Adaptive Signature Adaptive Signature DesignDesign

Boris Freidlin and Boris Freidlin and Richard SimonRichard Simon

Clinical Cancer Research 11:7872-8, Clinical Cancer Research 11:7872-8, 20052005

Biomarker Adaptive Signature Biomarker Adaptive Signature DesignDesign

Randomized trial of T vs CRandomized trial of T vs C Large number of candidate Large number of candidate

predictive biomarkers availablepredictive biomarkers available Eligibility not restricted by any Eligibility not restricted by any

biomarkerbiomarker This approach can be used with any This approach can be used with any

set of candidate markersset of candidate markers

End of Trial AnalysisEnd of Trial AnalysisFallback AnalysisFallback Analysis

Compare T to C for Compare T to C for all patientsall patients at at significance level αsignificance level α00 (eg 0.01) (eg 0.01) If overall HIf overall H00 is rejected, then claim is rejected, then claim

effectiveness of T for eligible patientseffectiveness of T for eligible patients Otherwise proceed as followsOtherwise proceed as follows

Using only a randomly selected subset of Using only a randomly selected subset of patients of pre-specified size (e.g. patients of pre-specified size (e.g. 1/31/3) to be ) to be used as a training set used as a training set TT, develop a binary , develop a binary classifier M based of whether a patient is classifier M based of whether a patient is likely to benefit from T relative to Clikely to benefit from T relative to C The classifier may use multiple markersThe classifier may use multiple markers The classifier classifies patients into only 2 The classifier classifies patients into only 2

subsets; those predicted to benefit from T and subsets; those predicted to benefit from T and those for whom T is not predicted better than those for whom T is not predicted better than CC

Apply the classifier M to classify Apply the classifier M to classify patients in the validation set patients in the validation set V=D-V=D-TT

Compare T vs C in the subset of Compare T vs C in the subset of V V who are predicted to benefit from T who are predicted to benefit from T using a threshold of significance of using a threshold of significance of 0.040.04

This approach can also be used to This approach can also be used to identify the subset of patients who identify the subset of patients who don’t benefit from T in cases where don’t benefit from T in cases where T is superior to C overall at the 0.01 T is superior to C overall at the 0.01 level. level.

Cross-Validated Cross-Validated Adaptive Signature Adaptive Signature

DesignDesign Freidlin B, Jiang W, Simon RFreidlin B, Jiang W, Simon RClinical Cancer Research 16(2) 2010Clinical Cancer Research 16(2) 2010

At the conclusion of the trial randomly At the conclusion of the trial randomly partition the patients into K approximately partition the patients into K approximately equally sized sets Pequally sized sets P11 , … , P , … , PKK

Let DLet D-i-i denote the full dataset minus data for denote the full dataset minus data for patients in Ppatients in Pii

Omit patients in POmit patients in P11 Apply the defined algorithm to analyze the Apply the defined algorithm to analyze the

data in Ddata in D-1 -1 to obtain a classifier Mto obtain a classifier M-1-1 Classify each patient j in PClassify each patient j in P11 using model M using model M-1-1 Record the treatment recommendation T or CRecord the treatment recommendation T or C

Repeat the above for all K loops of Repeat the above for all K loops of the cross-validationthe cross-validation

All patients have been classified All patients have been classified once as what their optimal treatment once as what their optimal treatment is predicted to be is predicted to be

Let Let SSTT denote the set of patients for whom denote the set of patients for whom treatment T is predicted optimal treatment T is predicted optimal

Compare outcomes for patients in SCompare outcomes for patients in STT who who actually received T to those in Sactually received T to those in STT who who actually received Cactually received C Compute Kaplan Meier curves of those Compute Kaplan Meier curves of those

receiving T and those receiving Creceiving T and those receiving C Let zLet zT T = standardized log-rank statistic = standardized log-rank statistic

Test of Significance for Effectiveness of T vs Test of Significance for Effectiveness of T vs C C

Compute statistical significance of zCompute statistical significance of zTT by randomly permuting treatment by randomly permuting treatment labels and repeating the entire labels and repeating the entire cross-validation procedurecross-validation procedure Do this 1000 or more times to generate Do this 1000 or more times to generate

the permutation null distribution of the permutation null distribution of treatment effect for the patients in each treatment effect for the patients in each subsetsubset

By applying the analysis algorithm to By applying the analysis algorithm to the full RCT dataset the full RCT dataset DD, , recommendations are developed for recommendations are developed for how future patients should be how future patients should be treatedtreated

The size of the T vs C treatment The size of the T vs C treatment effect for the indicated population is effect for the indicated population is (conservatively) estimated by the (conservatively) estimated by the Kaplan Meier survival curves of T Kaplan Meier survival curves of T and of C in Sand of C in STT

70% Response to T in Sensitive Patients70% Response to T in Sensitive Patients25% Response to T Otherwise25% Response to T Otherwise

25% Response to C25% Response to C30% Patients Sensitive30% Patients Sensitive

ASD CV-ASD

Overall 0.05 Test 0.830 0.838

Overall 0.04 Test 0.794 0.808

Sensitive Subset 0.01 Test

0.306 0.723

Overall Power 0.825 0.918

506 prostate cancer patients were randomly allocated to one of four arms: Placebo and 0.2 mg of diethylstilbestrol (DES) were combined as control arm C

1.0 mg DES, or 5.0 mg DES were combined as T.

The end-point was overall survival (death from any cause).Covariates: Age: In years

Performance status (pf): Not bed-ridden at all vs other

Tumor size (sz): Size of the primary tumor (cm2)

Index of a combination of tumor stage and histologic grade (sg)

Serum phosphatic acid phosphatase levels (ap)

Figure 1: Overall analysis. The value of the log-rank statistic Figure 1: Overall analysis. The value of the log-rank statistic is 2.9 and the corresponding p-value is 0.09. The new is 2.9 and the corresponding p-value is 0.09. The new

treatment thus shows no benefit overall at the 0.05 level.treatment thus shows no benefit overall at the 0.05 level.

Figure 2: Cross-validated survival curves for patients Figure 2: Cross-validated survival curves for patients predicted to benefit from the new treatment. log-rank statistic predicted to benefit from the new treatment. log-rank statistic

= 10.0, permutation p-value is .002= 10.0, permutation p-value is .002

Figure 3: Survival curves for cases predicted not to benefit Figure 3: Survival curves for cases predicted not to benefit from the new treatment. The value of the log-rank statistic is from the new treatment. The value of the log-rank statistic is

0.54.0.54.

AcknowledgementsAcknowledgements

Boris FreidlinBoris Freidlin Yingdong ZhaoYingdong Zhao Wenyu JiangWenyu Jiang Aboubakar MaitournamAboubakar Maitournam

Documents

Use of Candidate Predictive Biomarkers in the Design of Phase III Clinical Trials