19
1 An Improved Two-Stage Procedure to Compare Hazard Curves Zhongxue Chen 1 *, Hanwen Huang 2 *, and Peihua Qiu 3 1 Department of Epidemiology and Biostatistics, School of Public Health, Indiana University Bloomington. 1025 E. 7th street, Bloomington, IN, 47405, USA. 2 Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia. Athens, GA 30602, USA. 3 Department of Biostatistics, College of Public Health & Health Professions and College of Medicine, University of Florida. Gainesville, FL 32611, USA. *Authors contribute equally Running title: Improved two-stage procedure for survival data

users.phhp.ufl.eduusers.phhp.ufl.edu/pqiu/research/HazardCHQ.pdf · 2017-04-02 · Title: Microsoft Word - paper_jscs v3.docx

  • Upload
    dokhue

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

1

AnImprovedTwo-StageProceduretoCompareHazardCurves

ZhongxueChen1*,HanwenHuang2*,andPeihuaQiu31DepartmentofEpidemiologyandBiostatistics,SchoolofPublicHealth,IndianaUniversity

Bloomington.1025E.7thstreet,Bloomington,IN,47405,USA.

2DepartmentofEpidemiologyandBiostatistics,CollegeofPublicHealth,UniversityofGeorgia.

Athens,GA30602,USA.

3DepartmentofBiostatistics,CollegeofPublicHealth&HealthProfessionsandCollegeof

Medicine,UniversityofFlorida.Gainesville,FL32611,USA.

*Authorscontributeequally

Runningtitle:Improvedtwo-stageprocedureforsurvivaldata

2

Abstract

Qiu and Sheng has proposed a powerful and robust two-stage procedure to compare two

hazard rate functions. In this paper we improve their method by using the Fisher test to

combine the asymptotically independent p-values obtained from the two stages of their

procedure. In addition, we extend the procedure to situations with multiple hazard rate

functions.Our comprehensive simulation study shows that theproposedmethodhasagood

performanceintermsofcontrollingthetypeIerrorrateandofdetectingpower.Tworealdata

applicationsareconsideredforillustratingtheuseofthenewmethod.

Keywords:Bonferronicorrection;crossing;Fishertest;hazardrate;survivaldata

3

1. IntroductionInsurvivaldataanalysis,onecommontask is tocomparetwoormorehazardcurves.Tothis

end, severalnonparametricmethods, including the log-rank (LR),Gehan–WilcoxonandPeto–

Petotests,amongseveralothers,havebeenproposedintheliterature(LeeandWang,2003).

However, theperformanceof thesemethodsdependson the truealternatives. For instance,

theLRtestismorepowerfulthanothermethodsifthehazardfunctionsunderinvestigationare

parallel. But,when such a proportional hazards assumption is violated, the LR testmay lose

power dramatically. To circumvent this limitation, some methods that are robust to the

violationoftheproportionalhazardsassumptionhavebeenproposedintheliterature(Cheng

et al., 2009; Li et al., 2015; Lin andWang, 2004; Liuet al., 2007;Mantel andStablein, 1988;

Moreauetal.,1992;ParkandQiu,2014;QiuandSheng,2008).

The method proposed by Qiu and Sheng (called the QS test hereafter) is a two-stage

proceduredesignedforcomparingtwohazardratefunctions(QiuandSheng,2008).Inthefirst

stage,theLRtestisperformedwiththehopethattwoparallelhazardscanbedistinguishedin

the casewhen they arenot identical. In the second stage, a test statistic (denoted asNP) is

constructed which has the following properties: it is asymptotically independent of the one

usedinthefirststage,anditispowerfulwhenthetwohazardfunctionscrosseachother.Let𝛼

be the overall significance level; 𝛼! and 𝛼! the significance levels for stages one and two,

respectively.Denote𝑝!and𝑝! thep-valuesfromthefirstandthesecondstages,respectively.

TheQS testperformsas follows. In stageone, if𝑝! ≤ 𝛼!, thenwe reject thenullhypothesis

that thetwohazard functionsarethesameandstopthetestingprocedure;otherwise,goto

stage two. In stage two, if 𝑝! ≤ 𝛼!, then we reject the null hypothesis mentioned above;

4

otherwise,theQStestfailstorejectthenullhypothesis.Therefore,asymptoticallywehavethe

followingrelationshipamongthesignificancelevels:

𝛼! + 𝛼! 1− 𝛼! = 𝛼.(1)

Thereareinfinitelymanydifferentchoicesfor𝛼!and𝛼!foragivenoverallsignificancelevel

𝛼in(1).IntheoriginalQStest,forconvenience,𝛼!and𝛼!arechosentobe

𝛼! = 𝛼! = 1− 1− 𝛼.(2)

Theoverallp-valuefortheQStestcanbedefinedby

𝑝 − 𝑣𝑎𝑙𝑢𝑒 =𝑝!, 𝑝! ≤ 𝛼!

𝛼! + 𝑝!(1− 𝛼!), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒.(3)

There are some interesting characteristics associatedwith theQS test. First, theoverall p-

value and power depend on the choices of𝛼! and𝛼!, and they also depend on the preset

significance level𝛼.Second, from(3), it isobviousthattheoverall𝑝 − 𝑣𝑎𝑙𝑢𝑒 ≥ min (𝑝!,𝛼!).

Thismaybeashortcomingifoneviewsthep-valueasthedegreeofevidenceagainstthenull

hypothesis(e.g.,asmallerp-valueimpliesastrongerevidence).Third,thep-valueandpowerof

thistestdependontheorderofthetwostages.Fourth,theQStestwasdesignedforsituations

with two hazard curves only. In practice, onemay havemultiple hazard curves to compare;

therefore,itisdesirabletoextendtheQStesttocaseswithmultiplehazardcurves.

To overcome the aforementioned shortcomings of the QS test, and more importantly, to

increase its detecting power and make it suitable for handling multiple hazard curves, we

proposean improvementof theQS test. Througha comprehensive simulation study,wewill

showthatthenewmethodismorepowerfulthantheQStestundermanydifferentsituations.

Wewillalsousethreerealdataexamplestoillustratetheuseoftheproposedmethod.

5

2. Method

2.1.TheQStest

Suppose we have two treatment groups with nj (j=1,2) subjects and the total number of

subjects is n=n1+n2.We denote the D distinct ordered failure times as t1, t2, …, tD. At each

failuretimeti (i=1,2,…,D),dijandYijarethenumberofeventsandthenumberofsubjectsat

riskforgroupj.Letdi=di1+di2,andYi=Yi1+Yi2.TheLRteststatisticusedinthefirststageoftheQS

testcanbewrittenas

𝑈 = 𝑤!!(𝑑!! − 𝑌!!!!!!)!

!!! / !!!!!

!!!!!

!!!!

!!!!!!!!!

𝑑! ,(4)

where𝑤!! = 1here.

Toconstructtheteststatisticinthesecondstage,weneedthefollowingnotations.Foreachj

(j=1,2), let Tkj (k=1,2,…,nj) denote the event timeof the kth subject in group jwhichhas the

cumulative distribution function Fj, and Ckj be the censoring time which has the cumulative

distributionfunctionGj.Denotethesurvivalfunctionsforthesurvivaltimeandcensoringtime

asSj(s)=1-Fj(s)andLj(s)=1-Gj(s),respectively.LetXkj=min(Tkj,Ckj)betheobservedsurvivalor

censored time, 𝛿!" = 𝐼(𝑇!" < 𝐶!") be the censoring indicator, and 𝜋! 𝑠 = 𝑃 𝑋!" > 𝑠 =

𝑆!(𝑠)𝐿!(𝑠)bethesurvivalfunctionoftheobservedtime.Then,underthenullhypothesisthat

thetwohazardfunctionsarethesame,wehave𝐹! = 𝐹!(orequivalently,𝑆! = 𝑆!).

TheteststatisticusedinthesecondstageoftheQStestisaweightedlog-ranktestdefined

by 𝑉 = 𝑠𝑢𝑝!!!!!!!!!(𝑉!),(5)

where𝑚 = 𝐷𝑟 denotestheintegerpartof𝐷𝑟,forany𝑟 ∈ 𝜀, 1− 𝜀 ,0 < 𝜀 < 0.5isagiven

smallnumber,𝑉! = 𝑤!!(!)(𝑑!! − 𝑌!!

!!!!)!

!!! / 𝑤!!(!) !!!

!!

!!!!!

!!!!

!!!!!!!!!

𝑑! ,

6

𝑤!!(!) = −1, 𝑖𝑓 𝑖 = 1,2,… ,𝑚

𝑐!, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒,

𝑐! = !! !! !! !!!!! !! !! !

!!! !! !!

∆𝑆(𝑡!)!!!! / !! !! !! !!

!!! !! !! !

!!! !! !!

∆𝑆(𝑡!)!!!!!! ,and𝐿!,𝐿!,and𝑆are

theKaplan-Meierestimates(KaplanandMeier,1958)ofL1,L2andS,respectively.Thep-value

forVcanbeapproximatedbyabootstrapprocedure.

ForthetwostatisticsUandVusedinthetwostagesoftheQStest,wehavethefollowing

niceproperty(QiuandSheng,2008).

Theorem1.ThetwostatisticsUandVin(4)and(5)areasymptoticallyindependentunderthe

nullhypothesisthatthetwohazardfunctionsinquestionarethesame.

2.2.Theproposedtest

SincethetwostatisticsUandVareasymptoticallyindependent,thetwop-valuesobtainedby

thetwostatisticsarealsoasymptoticallyindependentandidenticallydistributedfrom0and1

underthenullhypothesisthatthetwohazardcurvesareequal.Basedonthisfact,wepropose

toobtainanoverallp-valueusingtheFisher-testmethodsince it ismorerobust in thesense

that ithasreasonablepowerwheneitheroneofthetwop-values issmall (Chen,2011;Chen

andNadarajah,2014;Fisher,1932;Owen,2009).Bythismethod,theoverallp-valueisdefined

by

𝑝! = 𝐻!!(−2 ln[𝑝!𝑝!]),(6)

where𝐻 is the survival functionofa randomvariable thathasa chi-squaredistributionwith

degreesoffreedom4basedontheFisher’stheorem(Fisher,1932).

Method(6)canbeextendedtocaseswith𝐾 (𝐾 ≥ 2)hazardcurveseasily, inwhichwecan

make pairwise comparisons for a total of 𝐾(𝐾 − 1)/2 pairs of curves. For each of the

7

comparisons,wecancalculateitsoverallp-valueusing(6).Then,theBonferroniprocedurefor

multiplecomparisonscanbeusedtodeterminewhetherthenullhypothesisshouldberejected

ornot.Morespecifically,ifthesmallestp-valuefromthe𝐾(𝐾 − 1)/2comparisonsislessthan

the given significance level divided by 𝐾(𝐾 − 1)/2, then the null hypothesis is rejected;

otherwiseitisnotrejected.Itshouldbepointedoutthatherewedon’tuseaglobaltestforthe

overallcomparisonalthoughtheadoptionofBonferronicorrectionguaranteesthecontrolling

of family-wise error rate. It is possible that some global tests, if exist, may reject the null

hypothesis at a given significance levelbutall of theadjustedp-values fromourmethodare

greaterthanthatnominallevel.

There are some good properties for the proposed test. Figure 1 shows the acceptance

regionsoftheoriginalQStestandtheproposedmethodbasedontheFishertest,bothwitha

significancelevelof0.05incaseswhenK=2.ItclearlyshowsthattheQStestwouldrejectthe

null hypothesis when either p1 or p2 is less than 0.0253. In contrast, the proposedmethod

basedontheFishertestcanrejectthenullhypothesisincertaincaseswhenbothp1andp2are

larger than0.0253 (i.e., (p1,p2) in the shaded region inFigure1). Inotherwords, for certain

alternativehypotheses thatbothUandVhave considerablepowers todetect, theproposed

methodbasedontheFishertestwouldbemorepowerfulthantheoriginalQStest.Inpractice,

suchalternativehypothesescorrespondtocaseswhenthetwohazardcurvesacrossatanearly

or late time, which are common in applications. On the other hand, for the alternative

hypothesesthatoneofUandVhaslittlepowertodetectbuttheotheronehasagoodpower,

theQS testmay perform slightly better. These alternative hypotheses usually correspond to

caseswhenthetwohazardcurvescrossataquitemiddletimepointortheyarequiteparallel

8

to each other, which are less common in practice. In cases when there aremultiple hazard

curves,itwillbeshownthattheproposedmethodismuchmorepowerfulthantheQStestin

mostscenarios.

Results

3.1. Simulationstudy

Toassesstheperformanceoftheproposedtest,weconductacomprehensivesimulationstudy

to compare itwith theQS tests. Thedistributionsof the survival time (T)areassumed tobe

uniform (𝑈:𝑈(𝜃 − 𝑎,𝜃 + 𝑎)), exponential (𝐸: exp 𝑎 + 𝜃),or log-normal (LN) (𝐿𝑁: 𝐿𝑁(𝜃,𝑎));

the corresponding censoring time (C) follows a uniform (𝑈(𝜃 − 𝑎,𝜃 + 𝑎 + 2(1− 2𝑝)/𝑝)),

exponential (𝐸: exp(𝑎 !!!!

)+ 𝜃), and log-uniform (LU) (𝐿𝑈: 𝐿𝑈(𝜃 + 𝑎𝑈(−2,−2+ 2/𝑝))),

respectively. Here p is the expected censoring rate. In the simulation we consider different

valuesforp:0,0.2,0.4,and0.6.WhenestimatingtheempiricaltypeIerrorrate,weassumethe

distributions for the survival times are identical for all groups. When assess the power, in

additiontocaseswhenallgroupshavethesametypesofdistribution,wealsoconsidercases

when treatment groups have different types of distributions. More specifically, we also

consider the following cases: some groups have uniform distributions and the others have

exponential distributions (denoted as U+E), some groups have uniform distributions and the

others have log-normal distributions (denoted as U+LN), and some groups have exponential

distributions and theothershave log-normal (denotedas E+LN). We compare theproposed

method (denoted as New) with the QS two-stage test using default values 𝛼! = 𝛼! = 1−

1− 𝛼(denotedasTS),thelog-ranktest(denotedasLR),andthetestusedinthesecondstage

9

oftheQStest(denotedasNP).Toseehowthesignificancelevels𝛼!,𝛼!inthetwostageaffect

theperformanceof theQSapproach,wealsoconsider twoQStestswithdifferentvalues for

𝛼!and 𝛼!. Specifically,wechose𝛼! = 0.01,𝛼! = 0.0404 (denotedasTS1), and𝛼! = 0.0404,

𝛼! = 0.01(denotedasTS2).

Wechoosesamplesizestobe50forbothgroups.WeusetheRpackageTSHRCwithdefault

setting(i.e,𝜀 = 0.1,andthenumberofbootstrapsamplesis1000)togetthep-valuesforTS,

LR,andNP.Tocompute theempirical type Ierror rateand thepower,weusea significance

levelof0.05andcomputetheproportionofrejectionsoutof1000replicatedsimulations.

First,weconsider thecaseswhenthereareonly twohazardcurves.Table1belowreports

theempirical type Ierrorrates foreachmethod. Itcanbeseenthatallmethodsconsidered

cancontrolthetypeIerrorratewellindifferentcases,althoughthemethodsTS,NP,andNew

tend tohave slightly smallerempirical type I error rates in certain caseswhen the censoring

ratesarelow.

Table2givestheempiricalpowervaluesforeachmethodunderdifferentsettings.Itshows

that(i)incaseswhenbothLRandNPhavesomepowertodetectthedifference,theproposed

test(i.e.,New)isusuallymorepowerfulthantheoriginalQStest(i.e.,TS),and(ii)incaseswhen

onlyoneofLRandNPhaspower,theQStest isslightlybetterthantheproposedtest.These

resultsareconsistentwithourexpectationsdemonstratedinFigure1.

Next,weconsidersituationswhenwehavefourhazardcurves.Similartotheproposedtest,

forcomparisonofmultiplehazardcurves,thep-valuesfromthepairwiseQStestscanbeused

todeterminewhetherweshouldrejecttheoverallnullhypothesisornot.Inotherwords,ifthe

smallest p-value of the QS test obtained from the𝐾(𝐾 − 1)/2 pairwise comparisons is less

10

thanthegivensignificanceleveldividedby𝐾(𝐾 − 1)/2,thentheovernullhypothesisthatall

hazardcurvesarethesameisrejected;otherwisethenullisnotrejected.Itshouldbepointed

thattheLRtestcanbeapplieddirectlytocaseswhenK>2.ThemethodNPisnotincludedhere

sinceitrequiresanenormousnumberofbootstrapsamplestoaccuratelyestimateitsp-value,

whichisnotfeasibleinthisexamplewithrelativelysmallsamplesizes.

Table3reportstheempiricalsizesofthemethodbasedontheQStests(TS,TS1,andTS2),

the LR test and theproposedmethodNew.Again, all of themethods can control the type I

errorratereasonablywell.

Table4reportstheempiricalpowervaluesforeachmethodunderdifferentsettings.Itcan

beseenthatinmanysituations,theproposedtesthaslargerpowervaluesthantheLRtestand

theQStests.Sometimes,thedifferencesofthepowervaluesbetweentheproposedtestand

theQStestsarebig.Forexample,whenthesurvivaltimesareallfromtheuniformdistributions

forthefourhazardcurves,regardlessofthecensoringrates,theproposedmethodcandetect

thedifferencealmostallthetime.Incontrast,theLRandQSmethodshavemuchlower

powers,especiallywhenthecensoringratesarehigh(e.g.,p=0.4and0.6).

AmongthoseTStestswithdifferentvaluesfor 𝛼! and 𝛼!,ingeneral,asexpected,under

situationswhereLRtestismorepowerfulthantheNPtest,TS2ismorepowerfulthanTS1;on

theotherhand,whenLRislesspowerfulthantheNPtest,TS1ismorepowerfulthanTS2.If

therearetwogroupstobecompared(Tables2andS2),weobservedthatundersome

conditions,theTS1orTS2mayhavelargerempiricalpowervaluesthantheproposedtest.

However,ingeneral,thenewtesthasreasonablepowerunderallconditionsconsideredinthe

simulation:ithassimilarorlargerpowervaluescomparedwiththemaximumpowervalues

11

fromTS,TS1andTS2.Whentherearefourhazardratefunctionstobecompared(Tables4and

S4),thenewtestusuallyhaslargerpowervaluesthanthosefromTS,TS1,andTS2.Inaddition,

withdifferentvaluesof 𝛼! and 𝛼!,theempiricalpowervalueoftheTStests(e.g.,TS1vs.TS2)

maychangedramatically.

ToseehowsamplesizesaffecttheperformanceofthenewtestandtheTStests,wealso

simulateddatawithsamplesize100pergroup.Wekeepalloftheotherparametersthesame

butusesignificancelevel𝛼 = 0.01.ThesimulationresultswerereportedintheSupplementary

TablesS1-S4.WehavesimilarobservationsasthosefromTables1-4:ingeneral,theproposed

testispreferredasithasreasonablepowerunderallofthesituationsconsidered.

3.2. Realdataapplications

Toillustratetheuseoftheproposedtest,weappliedittothreedatasets.Thefirstdatasetwas

obtainedfromastudyaboutthetumorigenesisofadrug(Manteletal.,1977)whichhasbeen

analyzed by Qiu and Sheng (Qiu and Sheng, 2008). In this study, rats were taken from 50

distinctlitters.Oneratfromeachlitterwasrandomlyselectedandgiventhedrug,andanother

tworatswereselectedascontrolsandweregivenaplacebo.Therewere29and81censored

observations in the treatmentgroupandthecontrolgroup, respectively.The twop-values in

thefirstandthesecondstagesoftheQStestwere0.0034and0.051,respectively.Ifweusethe

commonlyusedsignificancelevel0.05,theoverallp-valuefromtheQStestwillbe0.0034and

thenullhypothesis isthenrejected.However, ifwechoosethesignificanceleveltobe0.005,

then the overall p-value will be 0.053, which is larger than the preset significance level.

Therefore,wewillnotrejectthenullhypothesisinsuchcases.Asacomparison,theproposed

12

test has the overall p-value 0.0017,which does not depend on the preset significance level.

Therefore,itismorepowerfulinthisexample.

Theseconddatasetistakenfromastudyonthekidneydialysispatientstoassessthetimeto

first exit site infection (in months) in 119 patients with renal impairment. Among those

patients,43utilizedasurgicallyplacedcatheter(group1),76utilizedapercutaneousplacement

of their catheter (group 2). Therewere 27 and 65 censored observations in groups 1 and 2,

respectively.ThisdatasetwasanalyzedinthepapersbyLinandWang(LinandWang,2004),

andQiuandSheng (QiuandSheng,2008).Using thesignificance levelof0.05,wegot thep-

valuesof0.11and0.00078inthefirststageandthesecondstageoftheQStest,andtheoverall

p-valueis0.026.Ifwechangethesignificancelevelto0.001,theoverallp-valueoftheQStest

becomes0.0013,whichislargerthan0.001;therefore,wewillnotrejectthenullhypothesis.In

contrast,thep-valuefromtheproposedtestis0.00090,whichismuchsmallerthantheoverall

p-valuesoftheQStestinbothcases.

Thethirddatasetwasfromtherandomized,double-blindedDigoxinInterventionTrial(The

DigitalisInvestigationGroup,1997).Inthetrial,patientswithleftventricularejectionfractions

of0.45or lesswere randomlyassigned todigoxin (3397patients)orplacebo (3403patients)

groups.Aprimaryoutcomewasthemortalityduetoworseningheartfailure.Thesubjectswere

categorized based on their treatments and gender. It is of interest to compare the survival

distributionsamongthefourgroups.Weappliedvariousmethodstothisdataset.Table5lists

thep-valuesobtainedbytheQStestandtheproposedtest.Herewehavesixcomparisonsin

total, using 0.05 as the significance level, a comparison with an adjusted p-value less than

0.05/6,i.e.,0.0083,wouldbeclaimedsignificant.Noneofthep-valuesfromQStestislessthan

13

0.0083,however,thep-valueforcomparingthefirstandsecondgroupsfromtheproposedtest

is0.0028.Inaddition,thep-valuesfromtheLRtestandtheWilcoxontestare0.11and0.092,

respectively.

4. DiscussionandConclusion

Inthispaper,weproposedanimprovementoftheQStestbyusingtheFishertestindefining

theoverallp-valueofthetest.Thereareseveraladvantagestousetheproposedmethod.First,

itismorepowerfulthantheoriginalQStestinvariouscases.Second,unliketheQStest,thep-

valueofourproposedmethodisindependentofthepresetsignificancelevelandoftheorder

of the two stages as well. Third, for comparing multiple hazard curves, the proposed test

performsmuchbetterthantheQStestinmanycases.

It shouldbepointedout that in theoriginalQS test, the authors set thedefault values as

𝛼! = 𝛼! = 1− 1− 𝛼.We should choose appropriate values for𝛼! and 𝛼! in the TS test if

priorinformationaboutthehazardcurvesisavailable,sothattheTStesthasoptimaldetecting

power.However, if the valuesof𝛼! and 𝛼! are set inappropriately, theTSmethodmayalso

losepowerdramatically.

BesidestheFishertestconsideredhere,severalotherp-valuecombiningmethodshavebeen

proposedintheliterature.Forexample,theweightedZtestsandthegeneralizedFishertests

arecommonlyusedintheliterature(Chen,2011;ChenandNadarajah,2014;Chenetal.,2014).

The method proposed by Chen and Nadarajah (Chen and Nadarajah, 2014) has similar

performancetothatoftheFishertest.However,themethodsbasedontheweightedZ-testare

notrecommendedsincetheyarenotrobusttothep-valuesobtainedintheindividualstepsof

thetwo-stageschemeandcanpotentiallylosepowerdramatically.Inaddition,ifwehavesome

14

prior information about the hazard curves, powerfulmethods of combining p-values can be

designedaccordingly.

Intheliterature,therearemanymultiplecomparisonmethods,suchastheSidakprocedure,

Tukey’s procedure, and Dunnett’s method. However, all of them require some assumptions

thatmaynotbevalidhere.Asacomparison,theBonferroniprocedureadoptedheredoesnot

requireanyassumptions, suchas the independenceamong individualp-values. Furthermore,

our simulation study has shown that the proposed method using the Bonferroni procedure

workswellintermsofcontrollingthetypeIerrorrate.

It shouldbepointedout that, although theoriginalQS test and theproposedmethodare

designed for comparing hazard rate functions, they can also be used to compare survival

curves. In practice, one may want to compare certain quantiles (e.g., the survival median)

(BrookmeyerandCrowley,1982;Chen,2014a,b;ChenandZhang,2016)insteadofthewhole

survival curves. In such cases, both the QS test and the proposed new method cannot be

applieddirectly,andmuchfutureresearchisneeded.

References

Brookmeyer,R.,Crowley,J.,1982.Ak-samplemediantestforcensoreddata.J.Amer.Statist.Assoc.77,433-440.

Chen,Z.,2011.Istheweightedz-testthebestmethodforcombiningprobabilitiesfromindependenttests?J.Evol.Biol.24,926-930.

Chen,Z.,2014a.ExtensionofMood’smediantestforsurvivaldata.Statistics&ProbabilityLetters95,77-84.

Chen,Z.,2014b.ANonparametricApproachtoDetectingtheDifferenceofSurvivalMedians.CommunStatSimulat,(inpress),DOI:10.1080/03610918.03612014.03964804.

Chen,Z.,Nadarajah,S.,2014.Ontheoptimallyweightedz-testforcombiningprobabilitiesfromindependentstudies.Comput.Stat.DataAnal.70,387–394.

15

Chen,Z.,Yang,W.,Liu,Q.,Yang,J.Y.,Li,J.,Yang,M.Q.,2014.Anewstatisticalapproachtocombiningp-valuesusinggammadistributionanditsapplicationtogenome-wideassociationstudy.BMCBioinformatics15(Suppl17),S3.

Chen,Z.,Zhang,G.,2016.Comparingsurvivalcurvesbasedonmedians.BMCMed.Res.Methodol.16,1.

Cheng,M.-Y.,Qiu,P.,Tan,X.,Tu,D.,2009.Confidenceintervalsforthefirstcrossingpointoftwohazardfunctions.LifetimeDataAnal.15,441-454.

Fisher,R.A.,1932.StatisticalMethodsforResearchWorkers.OliverandBoyd,Edinburgh.

Kaplan,E.L.,Meier,P.,1958.Nonparametricestimationfromincompleteobservations.J.Amer.Statist.Assoc.53,457-481.

Lee,E.T.,Wang,J.,2003.Statisticalmethodsforsurvivaldataanalysis.Wiley-Interscience.

Li,H.,Han,D.,Hou,Y.,Chen,H.,Chen,Z.,2015.StatisticalInferenceMethodsforTwoCrossingSurvivalCurves:AComparisonofMethods.PLoSOne10.

Lin,X.,Wang,H.,2004.Anewtestingapproachforcomparingtheoverallhomogeneityofsurvivalcurves.BiometricalJournal46,489-496.

Liu,K.,Qiu,P.,Sheng,J.,2007.ComparingtwocrossinghazardratesbyCoxproportionalhazardsmodelling.Stat.Med.26,375-391.

Mantel,N.,Bohidar,N.R.,Ciminera,J.L.,1977.Mantel-Haenszelanalysesoflitter-matchedtime-to-responsedata,withmodificationsforrecoveryofinterlitterinformation.CancerRes.37,3863-3868.

Mantel,N.,Stablein,D.M.,1988.Thecrossinghazardfunctionproblem.TheStatistician,59-64.

Moreau,T.,Maccario,J.,Lellouch,J.,Huber,C.,1992.Weightedlogrankstatisticsforcomparingtwodistributions.Biometrika79,195-198.

Owen,A.B.,2009.KarlPearson'smeta-analysisrevisited.Ann.Statist37,3867–3892.

Park,K.Y.,Qiu,P.,2014.Modelselectionanddiagnosticsforjointmodelingofsurvivalandlongitudinaldatawithcrossinghazardratefunctions,.Stat.Med.33,4532–4546.

Qiu,P.,Sheng,J.,2008.Atwo-stageprocedureforcomparinghazardratefunctions.J.Roy.Stat.Soc.Ser.B.(Stat.Method.)70,191-208.

TheDigitalisInvestigationGroup,1997.Theeffectofdigoxinonmortalityandmorbidityinpatientswithheartfailure.N.Engl.J.Med.336,525-533.

16

Tables:

Table 1.Empirical type I error rates of eachmethodunder different settingswith sample size 50 foreachgroup(twogroups)andthesignificancelevel0.05.Theresultswereobtainedfrom1000replicates.distribution method censoringrate

0,0 0.2,0.2 0.4,0.4 0.6,0.6 0.4,0.6U:

𝜃 = (6,6)𝑎 = (2,2)

TS 0.046 0.045 0.043 0.049 0.046TS1 0.044 0.042 0.043 0.048 0.045TS2 0.049 0.049 0.050 0.052 0.050LR 0.053 0.055 0.052 0.052 0.049NP 0.039 0.036 0.041 0.046 0.043New 0.042 0.040 0.043 0.048 0.046

E:𝜃 = (12,12)𝑎 = (0.1,0.1)

TS 0.044 0.045 0.047 0.049 0.044TS1 0.043 0.042 0.045 0.048 0.042TS2 0.049 0.049 0.054 0.052 0.049LR 0.051 0.051 0.055 0.050 0.050NP 0.039 0.039 0.039 0.045 0.039New 0.040 0.043 0.044 0.046 0.043

LN:𝜃 = (5,5)𝑎 = (1,1)

TS 0.047 0.046 0.045 0.050 0.049TS1 0.045 0.046 0.040 0.050 0.046TS2 0.051 0.049 0.052 0.052 0.053LR 0.054 0.050 0.053 0.051 0.053NP 0.040 0.042 0.035 0.046 0.043New 0.045 0.040 0.041 0.051 0.046

Table2.Empiricalpowersofeachmethodunderdifferentsettingswithsamplesize50foreachgroup(twogroups)andthesignificancelevel0.05.Theresultswereobtainedfrom1000replicates.distribution method censoringrate

0,0 0.2,0.2 0.4,0.4 0.6,0.6 0.4,0.6U:

𝜃 = (6,6)𝑎 = (8,3)/3

TS 0.616 0.547 0.4750 0.375 0.388TS1 0.606 0.607 0.550 0.445 0.465TS2 0.549 0.444 0.355 0.298 0.287LR 0.243 0.105 0.049 0.067 0.068NP 0.569 0.605 0.570 0.469 0.488New 0.769 0.586 0.463 0.381 0.367

E:𝜃 = (12,10)𝑎 = (4,5)/40

TS 0.730 0.743 0.749 0.774 0.780TS1 0.720 0.734 0.740 0.757 0.783TS2 0.709 0.721 0.728 0.751 0.757LR 0.593 0.595 0.600 0.631 0.639NP 0.632 0.652 0.637 0.614 0.729New 0.810 0.825 0.832 0.846 0.854

LN:𝜃 = (5,5.6)𝑎 = (1,1)

TS 0.723 0.676 0.604 0.543 0.575TS1 0.635 0.585 0.514 0.447 0.468TS2 0.762 0.721 0.648 0.598 0.630

17

LR 0.773 0.733 0.666 0.619 0.642NP 0.265 0.243 0.204 0.145 0.120New 0.748 0.708 0.625 0.560 0.581

U+E:𝜃 = (10, 4)𝑎 = (5,0.1)

TS 0.908 0.607 0.544 0.526 0.659TS1 0.832 0.553 0.577 0.593 0.700TS2 0.927 0.622 0.456 0.409 0.547LR 0.899 0.485 0.161 0.069 0.115NP 0.156 0.342 0.557 0.613 0.715New 0.991 0.789 0.592 0.520 0.658

U+LN:𝜃 = (5, 1)𝑎 = (5,1)

TS 0.733 0.449 0.352 0.404 0.375TS1 0.793 0.462 0.296 0.303 0.295TS2 0.645 0.419 0.383 0.471 0.419LR 0.163 0.287 0.393 0.490 0.440NP 0.800 0.422 0.136 0.049 0.086New 0.752 0.505 0.384 0.388 0.366

E+LN:𝜃 = (0, 1)𝑎 = (0. 2, 2)

TS 0.667 0.525 0.401 0.269 0.301TS1 0.667 0.582 0.467 0.306 0.351TS2 0.602 0.425 0.296 0.213 0.228LR 0.247 0.097 0.053 0.096 0.080NP 0.616 0.588 0.491 0.308 0.369New 0.864 0.538 0.369 0.278 0.297

Table 3. Empirical type I error rates of eachmethodunder different settingswith sample size 50 foreachgroup(fourgroups)andthesignificancelevel0.05.Theresultswereobtainedfrom1000replicates.distribution method censoringrate

0,0,0,0 (2,2,2,2)/10 (4,4,4,4)/10 (6,6,6,6)/10 (4,6,4,6)/10U:

𝜃 = (6,6,6,6)𝑎 = (2,2,2,2)

TS 0.046 0.040 0.040 0.044 0.048TS1 0.046 0.040 0.040 0.044 0.048TS2 0.046 0.040 0.040 0.044 0.048LR 0.059 0.057 0.049 0.055 0.049New 0.030 0.031 0.035 0.044 0.043

E:𝜃 = (12,12,12,12)𝑎 = (0.1,0.1,0.1,0.1)

TS 0.044 0.035 0.048 0.043 0.044TS1 0.044 0.035 0.048 0.043 0.044TS2 0.044 0.035 0.048 0.043 0.044LR 0.053 0.046 0.059 0.063 0.055New 0.029 0.028 0.045 0.050 0.035

LN:𝜃 = (5,5,5,5)𝑎 = (1,1,1,1)

TS 0.060 0.052 0.047 0.058 0.051TS1 0.060 0.052 0.047 0.058 0.051TS2 0.060 0.052 0.047 0.058 0.051LR 0.060 0.051 0.076 0.073 0.054New 0.032 0.037 0.040 0.050 0.052

Table4.Empiricalpowersofeachmethodunderdifferentsettingswithsamplesize50foreachgroup(fourgroups)andthesignificancelevel0.05.Theresultswereobtainedfrom1000replicates.distribution method censoringrate

0,0,0,0 (2,2,2,2)/10 (4,4,4,4)/10 (6,6,6,6)/10

(4,6,4,6)/10

18

U:𝜃 = (6, 6,6,6)𝑎 = (8,8,3,3)/3

TS 0.676 0.380 0.069 0.286 0.217TS1 0.676 0.380 0.069 0.286 0.217TS2 0.676 0.380 0.069 0.286 0.217LR 0.823 0.507 0.092 0.329 0.149New 1.000 1.000 1.000 0.994 0.999

E:𝜃 = (12,10,12,10)𝑎 = (4,4,5,5)/40

TS 0.412 0.418 0.424 0.484 0.502TS1 0.412 0.418 0.424 0.484 0.502TS2 0.412 0.418 0.424 0.484 0.502LR 0.441 0.449 0.476 0.554 0.557New 0.736 0.789 0.802 0.842 0.828

LN:𝜃 = (5,5,5.6,5.6)𝑎 = (1, 1,1,1)

TS 0.850 0.814 0.773 0.687 0.725TS1 0.850 0.814 0.773 0.687 0.725TS2 0.850 0.814 0.773 0.687 0.725LR 0.913 0.876 0.857 0.765 0.803New 0.854 0.814 0.767 0.654 0.716

U+E:𝜃 = (10,10,4,4)𝑎 = (5,5, 0.1,0.1)

TS 0.931 0.483 0.135 0.063 0.108TS1 0.931 0.483 0.135 0.063 0.108TS2 0.931 0.483 0.135 0.063 0.108LR 0.976 0.602 0.192 0.076 0.142New 0.933 0.537 0.509 0.526 0.537

U+LN:𝜃 = (5,5,1, 1)𝑎 = (5, 5,1,1)

TS 0.166 0.293 0.445 0.550 0.511TS1 0.166 0.293 0.445 0.550 0.511TS2 0.166 0.293 0.445 0.550 0.511LR 0.194 0.342 0.497 0.628 0.537New 0.842 0.602 0.437 0.470 0.455

E+LN:𝜃 = (0,0,1,1)

𝑎 = ( 0.2,0.2,2,2)

TS 0.198 0.079 0.039 0.113 0.065TS1 0.198 0.079 0.039 0.113 0.065TS2 0.198 0.079 0.039 0.113 0.065LR 0.310 0.108 0.049 0.126 0.075New 0.610 0.471 0.369 0.291 0.352

Table5.P-valuesobtainedbytheQStestandtheproposedmethodfromeachpairofgroupsoftheDITdata.

Grouppair LR NP TS New1vs.2 0.019 0.026 0.019 0.00281vs.3 0.17 0.32 0.36 0.211vs.4 0.38 0.16 0.18 0.232vs.3 0.86 0.016 0.041 0.0732vs.4 0.47 0.012 0.037 0.0353vs.4 0.66 0.69 0.70 0.81

19

FigureLegend:

Figure1.TheacceptanceregionsfortheoriginalQStwo-stagetestandtheproposedmethodbasedontheFishertest.