ORBITA: A case study in the analysis and reporting …gelman/research/unpublished/S...ORBITA: A case study in the analysis and reporting of clinical trials Andrew Gelman, John Carlin

ORBITA:Acasestudyintheanalysisandreportingofclinicaltrials

AndrewGelman,JohnCarlinandBrahmajeeKNallamothu

14Mar2018DepartmentofStatisticsandPoliticalScience,ColumbiaUniversity,NewYorkCity,NY,UnitedStates(AndrewGelman,professor);ClinicalEpidemiology&Biostatistics,MurdochChildren’sResearchInstitute,MelbourneSchoolofPopulationandGlobalHealthandDepartmentofPaediatrics,UniversityofMelbourne,Melbourne,Australia(JohnCarlin,professor);DepartmentofInternalMedicine,UniversityofMichiganMedicalSchool,AnnArbor,MI,UnitedStates(BrahmajeeKNallamothu,professor);Correspondenceto:[email protected]:WethankDougHelmreichforbringingthisexampletoourattention,ShiraMitchellforhelpfulcomments,andtheOfficeofNavalResearch,DefenseAdvancedResearchProjectAgency,andtheNationalInstitutesofHealthforpartialsupportofthiswork.Competinginterests:Dr.GelmanandDr.Carlinreportnocompetinginterests.Dr.NallamothuisaninterventionalcardiologistandEditor-in-ChiefofajournaloftheAmericanHeartAssociationbutotherwisehasnocompetinginterests.WordCount:2085

Introduction

ORBITA(ObjectiveRandomisedBlindedInvestigationWithOptimalMedicalTherapyofAngioplastyinStableAngina)wasarandomizedclinicaltrialofapproximately200patientsinwhichhalfthepatientsreceivedstentsandhalfreceivedaplaceboprocedure.Itssummaryfindingwasthatstentingdidnot“increaseexercisetimebymorethantheeffectofaplaceboprocedure”withthemeandifferenceinthisprimaryoutcomebetweentreatmentandcontrolgroupsreportedas16.6sec(95%confidenceinterval,−8.9to+42.0sec)andap-valueof0.20.

IntheNewYorkTimes,Kolata(2017)reportedthefindingas“unbelievable,”remarkingthatit“stunnedleadingcardiologistsbycounteringdecadesofclinicalexperience.”Indeed,oneofus(BKN)wasquotedasbeinghumbledbythefindingasmanyhadexpectedapositiveresult.Ontheotherhand,Kolatanoted,“therehavelongbeenquestionsabout[stents’]effectiveness.”Attheveryleast,thewillingnessofdoctorsandpatientstoparticipateinacontrolledtrialwithaplaceboproceduresuggestssomedegreeofexistingskepticismandclinicalequipoise.

ORBITAwasalandmarktrialduetoitsinnovativeuseofaplaceboprocedure.However,substantialquestionsremainevenafterORBITAregardingtheroleofstentinginstableangina.Itisawell-knownstatisticalfallacytotakearesultthatisnotstatisticallysignificantandreportitaszero,aswasessentiallydoneherebasedonthep-valueof0.20fortheprimaryoutcome.Hadthiscomparisonhappenedtoproduceap-valueof0.04,wouldtheheadlinehavebeen,“‘Believable’:HeartStentsIndeedEaseChestPain”?

ThepurposeofthispaperistotakeacloserlookatthelackofstatisticalsignificanceinORBITAandthelargerquestionsitraisesaboutstatisticalanalyses,statisticallybasedversusclinicaldecision-making,andthereportingofclinicaltrials.Thisisimportantbecausealotofcertaintyseemstobehangingonasmallbitofdata.

Dichotomizedthresholdsareabigproblem,henceinthispaperwewillavoiddiscussing“statisticalsignificance”exceptwhendiscussingissuesofhowresultsareorcouldbereported.

StatisticalanalysisoftheORBITAtrial

Adjustingforbaselinedifferences.InORBITA,exercisetimeinastandardizedtreadmill

test—theprimaryoutcomeinthepreregistereddesign—increasedonaverageby28.4secinthetreatmentgroupcomparedtoanincreaseofonly11.8secinthecontrolgroup.Asnotedabove,thisdifferencewasassociatedwithap-valuegreaterthan0.05.Hence,followingconventionalrulesofscientificreportingitwastreatedaszero—aninstanceoftheregrettablycommonstatisticalfallacyofpresentingnon-statistically-significantresultsasconfirmationofthenullhypothesisofnodifference.

However,theestimateusinggaininexercisetimedoesnotmakefulluseofthedatathatwereavailableondifferencesbetweenthegroupsatbaseline(VickersandAltman,2001,Harrell,2017a).Thetreatmentandplacebogroupsdifferedintheirpre-treatmentlevelsofexercisetime,withmeanvaluesof528.0and490.0s,respectively(SupplementaryTable).Thissortofdifferenceisfine—randomizationassuresbalanceonlyinexpectation—butitisimportanttoadjustforthisdiscrepancyinestimatingthetreatmenteffect.Inthepublishedpaper,theadjustmentwasperformedbysimplesubtractionofthepre-treatmentvalues:

Gaininexercisetime: (ypost−ypre)T−(ypost−ypre)

C, (1)

Butthisover-correctsfordifferencesinpre-testscores,becauseofthefamiliarphenomenonof“regressiontothemean”—justfromnaturalvariation,wewouldexpectpatientswithlowerscoresatbaselinetoimprove,relativetotheaverage,andpatientswithhigherscorestoregressdownward.

Theoptimallinearestimateofthetreatmenteffectisactually:

Gaininexercisetime: (ypost−βypre)T−(ypost−βypre)

C, (2)

whereβisthecoefficientofypreinaleast-squaresregressionofypostonypre,also

controllingforthetreatmentindicator.

Theestimatein(1)isaspecialcaseoftheregressionestimate(2)correspondingtoβ=1.Giventhatthepre-testandpost-testmeasurementshavenearlyidenticalvariances,wecananticipatethattheoptimalβwillbelessthan1,whichwillreducethecorrectionfordifferenceinpre-testandthusincreasetheestimatedtreatmenteffectwhiledecreasingthestandarderror.

AnadjustedanalysisusingtheinformationavailableisexplainedindetailinBox1.Thep-valuefromthisadjustedanalysisis0.09:asexpected,lowerthanthep=0.20fromtheunadjustedanalysis.Whatisrelevantisnotwhetherornotthisnewp-valuehasbecome

“statisticallysignificant”butratherthereportedp-valueissubjecttochangebasedonalternativeanalyses.

Withindifferentconventionsforscientificreportingandfordifferentfields,ap-valueof0.09isconsideredtobestatisticallysignificant;forexample,inarecentsocialscienceexperimentpublishedintheProceedingsoftheNationalAcademyofSciences,Sands(2017)presentedacausaleffectbasedonap-valueoflessthan0.10,andthiswasenoughforpublicationinatopjournalandinthepopularpress.Voxmentionedthatworkuncriticallywithoutanyconcernregardingsignificancelevels(Resnick,2017).Bycontrast,Voxreportedstentsasaprimeexampleofthe“epidemicofunnecessarymedicaltreatments”afterORBITA(Belluz,2017).

TheseconcernsaredeepenedfurtherwhenoneconsidershowsensitiveresultsfromORBITAwerefromastatisticalstandpoint.Tobetterunderstandthisonecanperformasimplebootstrapanalysis,computingtheresultsthatwouldhavebeenobtainedfromreanalyzingthedata1000times,eachtimeresamplingpatientsfromtheexistingexperimentwithreplacement(Efron,1979).Asrawdatawerenotavailabletous,weapproximatedusingthenormaldistributionbasedontheobservedz-scoreof1.7.Theresultwasthat,in40%ofthesimulations,stentsoutperformedplacebowithp-valueslessthan0.05.Thisisnottosaythatstentsreallyarebetteronaveragethanplaceboinimprovingexercisetime—thedataalsoappearconsistentwithanulleffect.Thetake-homepointofthisexperimentisthattheresultscouldeasilyhavegone“theotherway”,whenreportingisforcedintoabinaryclassificationofstatisticalsignificance.

StatisticallyBasedversusClinicalDecision-Making

Injustifyingtheirstudydesignandsamplesize,Al-Lameeetal.(2017)wrote:“Evidencefromplacebo-controlledrandomisedcontrolledtrialsshowsthatsingleantianginaltherapiesprovideimprovementsinexercisetimeof48–55sec…Giventhepreviousevidence,ORBITAwasconservativelydesignedtobeabletodetectaneffectsizeof30sec.”Theestimatedeffectof21secwithstandarderror12secisconsistentwiththe“conservative”effectsizeestimateof30secgiveninthepublishedarticle.Soalthoughtheexperimentalresultsareconsistentwithanulleffect,theyareevenmoreconsistentwithasmallpositiveeffect.

Onemightask,however,abouttheclinicalsignificanceofsuchatreatmenteffect,whichwecandiscusswithoutrelevancetop-valuesorstatisticalsignificance.Forsimplicity,supposewetakethepointestimatefromthedataatfacevalue.Howshouldwethinkaboutanincreaseinaverageexercisetimeof21sec?Onewaytoconceptualizethisisin

termsofpercentiles.Thedatashowapre-randomizationdistribution(averagingthetreatmentandcontrolgroups)withameanof509secandastandarddeviationof188sec.Assuminganormalapproximation,anincreaseinexercisetimeof21secfrom509to530secwouldtakeapatientfromthe50thpercentiletothe54thpercentileofthedistribution.Lookedatthatway,itwouldbehardtogetexcitedaboutthiseffectsize,evenifitwerearealpopulationshift.Indeed,arecentstudyafterORBITAsuggestedironicallythatsuchgainsarepossibleduringatreadmilltestbysimplyplayingmusic.

Thus,thelargerclinicalquestionishowtobalancethelong-termbenefitsofstentswithrisksoftheprocedure.Itdoesnotseemreasonableforapersontoreceivestentsjustforapotentialbenefitof21secofexercisetimeonastandardizedtreadmilltest—orevenahypothesizedlargerbenefitof50sec,whichwouldstillonlyrepresenta10%improvementforanaveragepatientinthisstudy.Yetmaybea5%to10%increaseisconsequentialinthiscaseasitcouldimprovequalityoflifeforapatient.Perhapsthissmallgaininexercisetimeisassociatedwiththeneedforlessmedications,fewerfunctionallimitationsorgreatermobility.Ifso,however,onemightpostulatethisgainwouldhavebeenapparentinassessmentsofanginaburden,anditwasnot.

Abigconcernhereisthatthesepatientswerealreadydoingprettywellonmedications—thatis,theyalreadyhadalowsymptomfrequencybeforestenting.Forexample,anginafrequencyasmeasuredbytheSeattleAnginaQuestionnairewas63.2afteroptimizingmedicationsandbeforestentinginthetreatmentgroup.Thisroughlytranslatesas“monthly”angina(JohnSpertus,personalcommunication).Howdoesastudywithafollow-upofjust6weeksexpecttoimproveanoutcomethathappensthisinfrequently?Infact,oneofthegreatdebatessurroundingORBITAisthatthosewhodiscountthetrialsuggestitenrolledpatientswhotypicallydonotreceivestentsinroutinepractice.ThosewhobelieveORBITAisagame-changerarguethattheselesssymptomaticpatientsactuallymakeupalargeproportionofthosereceivingstents.

Finally,arestentsreallybeinggiventopatientswithstableanginajusttoimprovefitnessortoreducesymptoms?Oristhereacontinuedexpectationthatstentshavelong-termbenefitsforpatients,despiteearlierdatafromstudiesliketheClinicalOutcomesUtilizingRevascularizationandAggressiveDrugEvaluation(COURAGE)study(Boden,2007)?Thiswouldseemtobethekeyquestion,inwhichcasetheshort-termeffects,orlackthereof,foundintheORBITAstudyarelargelyirrelevant.Otherlargertrials,suchasInternationalStudyofComparativeHealthEffectivenessWithMedicalandInvasiveApproaches(ISCHEMIA,see:https://clinicaltrials.gov/ct2/show/NCT01471522)areconsideringthismorefundamentalquestionbutwillnothaveaplaceboprocedure.

EvidencefromORBITAthatpointedtowardconsistentimprovementsinthephysiologicalparameterofischemiathroughendpointssuchasfractionalflowreserveandstressechosuggeststhereislittlequestionthatsomephysiologicalchangesarebeingmadebystents,withverylargeandhighlystatisticallysignificant.Asisoftenthecase,thenullhypothesisthatthesephysicalchangesshouldmakeabsolutelyzerodifferencetoanydownstreamclinicaloutcomesseemsfarfetched.Thus,thesensiblequestiontoaskis“Howlargearetheclinicaldifferencesobservedandaretheyworthit?”—not“Howsurprisingistheobservedmeandifferenceundera[spurious]nullhypothesis?”

4.Recommendationsforstatisticalreportingoftrials

Thesearchforbettermedicalcareisanincrementalprocess,withincompleteevidenceaccumulatingovertime.Thereisunfortunatelyafundamentalincompatibilitybetweenthatcoreideaandthecommonpractice,bothinmedicaljournalsandthenewsmedia,ofup-or-downreportingofindividualstudiesbasedonstatisticalsignificance.WeoffersomerecommendationstotacklethisissueinBox2.

Inthedesign,evaluation,andreportingofexperimentalstudies,thereisanormoffocusingonthestatisticalsignificanceofaprimaryoutcome—describedattimesas“significantitis”or“dichotomania”(Greenland,2017).Itleadstoanoverrelianceonphraseslike,“Wedeemedapvaluelessthan0.05tobesignificant,”thatarecommonthroughoutthepublishedliterature.Theresultingconclusionsfromsuchaprocessfrequentlywillbefragilebecausep-valuesareextremelynoisyunlesstheunderlyingeffectishuge.Totheircredit,theORBITAauthorsthemselveshaverecognizedthesecriticalissues(seeonline:https://twitter.com/ProfDFrancis/status/952008644018753536).

ORBITAwasnevermeanttobedefinitiveinabroadsense—itwasdesignedtofindaphysiologicaleffectofstentingonmeanexercisetime,withoutclarityontheclinicalrelevanceofthisoutcome.Indeed,alikelyreasonwhythestudywaslimitedtothisendpointwasbecausethisisallthatcouldhavepassedanethicalboardgiventhenoveltyoftheplaceboprocedureinthissetting.FurtherbackgroundonthesetopicsfromDarrelFrancis,theseniorauthoronthestudy,appearsatHarrell(2017b).OnecertainimpactofORBITAisthatbiggertrialsofstentingwithplaceboproceduresarenowmuchmorelikelywithamoremeaningfulsetofoutcomesthatwillbemeasured.

Wedon’tseeanyeasyanswershere—long-termoutcomeswouldrequirealong-termstudy,afterall,andclinicaldecisionsneedtobemaderightaway,everyday.But

perhapswecanuseourexaminationofthisparticularstudyanditsreportingtosuggestpracticaldirectionsforimprovementinhearttreatmentstudiesandinthedesignandreportingofclinicaltrialsmoregenerally.

References

Al-Lamee,R.,Thompson,D.,Dehbi,H.M.,Sen,S.,Tang,K.,Davies,J.,Keeble,T.,Mielewczik,M.,Kaprielian,R.,Malik,I.S.,Nijjer,S.S.,Petraco,R.,Cook,C.,Ahmad,Y.,Howard,J.,Baker,C.,Sharp,A.,Gerber,R.,Talwar,S.,Assomull,R.,Mayet,J.,Wensel,R.,Collier,D.,Shun-Shin,M.,Thom,S.A.,Davies,J.E.,andFrancis,D.P.(2017).Percutaneouscoronaryinterventioninstableangina(ORBITA):adouble-blind,randomisedcontrolledtrial.Lancet.http://dx.doi.org/10.1016/S0140-6736(17)32714-9

Allison,D.B.,Brown,A.W.,George,B.J.,Kaiser,K.A.(2016).Reproducibility:Atragedyoferrors.Nature530,27–29.doi:10.1038/530027a.PubMedPMID:26842041;PubMedCentralPMCID:PMC4831566.

AmericanCollegeofCardiology(2017).ORBITA:Firstplacebo-controlledrandomizedtrialofPCIinCADpatients.ACCNews,2Nov.http://www.acc.org/latest-in-cardiology/articles/2017/10/27/13/34/thurs-1150am-orbita-tct-2017

Belluz,J.(2017).Thousandsofheartpatientsgetstentsthatmaydomoreharmthangood.Vox.com,6Nov.https://www.vox.com/science-and-health/2017/11/3/16599072/stent-chest-pain-treatment-angina-not-effective

Bland,J.M.,andAltman,D.G.(2015).Best(butoftforgotten)practices:Testingfortreatmenteffectsinrandomizedtrialsbyseparateanalysesofchangesfrombaselineineachgroupisamisleadingapproach.AmericanJournalofClinicalNutrition102,991–994.doi:10.3945/ajcn.115.119768.Epub2015Sep9.PubMedPMID:26354536.

Boden,W.E.,O'Rourke,R.A.,Teo,K.K.,Hartigan,P.M.,Maron,D.J.,Kostuk,W.J.,Knudtson,M.,Dada,M.,Casperson,P.,Harris,C.L.,Chaitman,B.R.,Shaw,L.,Gosselin,G.,Nawaz,S.,Title,L.M.,Gau,G.,Blaustein,A.S.,Booth,D.C.,Bates,E.R.,Spertus,J.A.,Berman,D.S.,Mancini,G.B.,andWeintraub,W.S.;COURAGETrialResearchGroup.(2007).OptimalmedicaltherapywithorwithoutPCIforstablecoronarydisease.NewEnglandJournalofMedicine356,1503–16.Epub2007Mar26.

Efron,B.(1979).Bootstrapmethods:Anotherlookatthejackknife.AnnalsofStatistics7,1–26.

Gelman,A.(2004).Treatmenteffectsinbefore-afterdata.InAppliedBayesianModelingandCausalInferencefromIncomplete-dataPerspectives,ed.A.GelmanandX.L.Meng,chapter18.NewYork:Wiley.

Gelman,A.(2018).Thefailureofnullhypothesissignificancetestingwhenstudyingincrementalchanges,andwhattodoaboutit.PersonalityandSocialPsychologyBulletin44,16–23.

Gelman,A.,andCarlin,J.B.(2014).Beyondpowercalculations:AssessingTypeS(sign)andTypeM(magnitude)errors.PerspectivesonPsychologicalScience9,641–651.

Gelman,A.,andStern,H.S.(2006).Thedifferencebetween“significant”and“notsignificant”isnotitselfstatisticallysignificant.AmericanStatistician60,328–331.

Greenland,S.(2017).Theneedforcognitivescienceinmethodology.AmericanJournalofEpidemiology186,639–645.

Harrell,F.(2017a).Statisticalerrorsinthemedicalliterature.StatisticalThinkingblog,8Apr.http://www.fharrell.com/2017/04/statistical-errors-in-medical-literature.html

Harrell,F.(2017b).Statisticalcriticismiseasy;Ineedtorememberthatrealpeopleareinvolved.StatisticalThinkingblog,5Nov.http://www.fharrell.com/2017/11/statistiorbita-tct-2017cal-criticism-is-easy-i-need-to.html

Kolata,G.(2017).’Unbelievable’:Heartstentsfailtoeasechestpain.NewYorkTimes,2Nov.https://www.nytimes.com/2017/11/02/health/heart-disease-stents.html

Resnick,B.(2017).Whitefearofdemographicchangeisapowerfulpsychologicalforce.Vox.com,28Jan.https://www.vox.com/science-and-health/2017/1/26/14340542/white-fear-trump-psychology-minority-majority

Sands,M.L.(2017).Exposuretoinequalityaffectssupportforredistribution.ProceedingsoftheNationalAcademyofSciences114,663–668.

Schulz,K.F.,andGrimes,D.A.(2005).Samplesizecalculationsinrandomisedtrials:Mandatoryandmystical.Lancet365,1348–1353.

Simmons,J.,Nelson,L.,andSimonsohn,U.(2011).False-positivepsychology:Undisclosedflexibilityindatacollectionandanalysisallowpresentinganythingassignificant.PsychologicalScience22,1359-1366.

Vickers,A.J.,andAltman,D.G.(2001).Analysingcontrolledtrialswithbaselineandfollowupmeasurements.BritishMedicalJournal323,1123–1124.

Wasserstein,R.L.,andLazar,N.A.(2016).TheASA'sstatementonp-values:Context,process,andpurpose.AmericanStatistician70,129–133.

SupplementaryTable.Summarydatacomparingstentstoplacebo,fromTable3ofAl-Lameeetal.(2017).

Box1.Usingthereporteddatasummariestoobtaintheanalysiscontrollingforthepre-treatmentmeasureForeachofthetreatmentandcontrolgroups,wearegiventhestandarddeviationofthepre-testmeasurements,thestandarddeviationofthepost-testmeasurements,andthestandarddeviationoftheirdifference,whichcanbeobtainedbytakingthewidthoftheconfidenceintervalforthedifference,dividingby4togetthestandarderrorofthedifference,andthenmultiplyingby 𝑛togetbacktothestandarddeviation.

Thenweusetherule,sd(y! − y!) = sd y! ! + sd y! !

− 2ρ sd(y!)sd(y!)andsolveforρ,thecorrelationbetweenbeforeandaftermeasurementswithineachgroup.Theresultinthiscaseisρ=0.88withineachgroup.Wethenconvertthecorrelationtoaregressioncoefficientofy!ony!usingthewell-knownformula,β = ρ sd(y!)/sd(y!),whichyieldsβ = 0.88forthetreatedandβ = 0.86forthecontrolgroup.Ifthesetwocoefficientsweremuchdifferentfromeachother,wemightwanttoconsideraninteractionmodel(Gelman,2004),butheretheyarecloseenoughthatwesimplytaketheiraverage.

Weusetheaverage,β=0.87,in(2)andgetanestimatefortheadjustedmeandifferenceof21.3(indeed,quiteabithigherthanthereporteddifferenceingainscoresof16.6)withastandarderrorof12.5(veryslightlylowerthan12.7,thestandarderrorofthedifferenceingainscores)and95%CI−3.2to45.8s.Theestimateisnotquitetwostandarderrorsawayfromzero:thez-scoreis1.7,andthep-valueis0.09.

Box2.RecommendationsforAnalysesandReportingAnalyses1.Baselineadjustmentfordifferences:shouldbeprespecifiedfortheprimaryanalysiswherestrongconfounderssuchasabaselinemeasureoftheoutcomeareavailable.2.Beawareoffragilityofinferences.Fragilitycanbedemonstratedusingthesamplingorposteriordistributionasestimatedusingmathematicalmodeling,bootstrapsimulation,orBayesiananalysis.Reporting1.Avoiduseofsharpthresholdsforp-valuesandthuseliminatetheterm“statisticalsignificance”fromthereportingofresults.2.Considerthefullrange(upperandlowerends)ofintervalestimatesforimportantoutcomesandtheirpotentialinclusionofclinicallyimportantdifferences.3.Considerthepotentialforindividualvariabilityinresponses(heterogeneityoftreatmenteffects)andnotjustmeandifferences.

Documents

ORBITA: A case study in the analysis and reporting …gelman/research/unpublished/S...ORBITA: A case study in the analysis and reporting of clinical trials Andrew Gelman, John Carlin