Lecture 16: Regression Diagnostics I Proportional Hazards Assumption -graphical methods -regression...

Lecture 16: Regression Diagnostics I

Proportional Hazards Assumption -graphical methods -regression methods

Regression Diagnostics

• Most interested in testing proportional hazards assumption

• Also looking for functional form of covariates• Two types of methods– Graphical approaches– Regression approaches

Graphical Approaches

• Recall our graphical checks– Kernel smoothing– Smoothing splines

• Both will provide information about whether or not the hazards cross

• Both can be implemented in R– “muhaz” package: kernel smoothing for survival– “gss” package: smoothing splines for survival

Graphical Approaches• Consider CPHM with single binary covariate…

• This means we can also can consider the following plot…

• If hazards proportional, this should be ≈ equal to • No package in R

– Calculate NA hazard estimates for each condition at each unique time point

ˆ1 0 lnH t Z

h t Z h t Z

H t Z e H t Z

ln 1 ln 0 vs. H t Z H t Z t

Examples

• Lets explore the graphical and regression checks for proportional hazards– Kidney Infection data• Surgical vs. percutaneous

– BMT data• FAB classification• Methotrexate use

Survival for Kidney

Graphical Checks### KIDNEY INFECTION EXAMPLE####log cum haz plotsLibrary(Kmsurv); library(survival)

coxph(Surv(time, delta)~factor(type), data=kidney)dat1<-kidney[kidney$type==1, ]dat2<-kidney[kidney$type==2, ]fit1<-survfit(coxph(Surv(time, delta)~1, data=dat1), type="aalen")fit2<-survfit(coxph(Surv(time, delta)~1, data=dat2), type="aalen")

times<-sort(unique(kidney$time))ch1<--log(fit1$surv)ch2<--log(fit2$surv)

ch1<-c(0, ch1[1:17],ch1[17],ch1[17],ch1[18:20],ch1[20],ch1[21:23],ch1[23])ch2<-c(ch2[1:13],ch2[13],ch2[14:19],ch2[19],ch2[20],ch2[20],ch2[21:23],ch2[23],ch2[24])plot(times, log(ch2)-log(ch1), type="s", xlab="time",

ylab="log(H[t|Z=perc])-log(H[t|Z=surg])", lwd=2)lines(times, rep(0.613, length(times)), lwd=2, col=2)

Graphical Checks#Smoothing splinescath<-kid$cath[order(kid$Time)]event<-kid$d[order(kid$Time)]times<-sort(kid$Time)

library(gss)hazfit<-sshzd(Surv(times, event)~cath*times)haz<-hzdrate.sshzd(hazfit, data.frame(times=times, cath=cath))h1<-haz[cath==1]; id1<-order(h1); t1<-times[cath==1]h2<-haz[cath==0]; id2<-order(h2); t2<-times[cath==0]

plot(times[cath==0], haz[cath==0], xlim=c(0, max(times)), ylim=range(haz), xlab="Time", type="l",ylab="hazard", lwd=2, col=1)lines(times[cath==1], haz[cath==1], lwd=2, col=2)legend(0, .1, c("percutaneous","surgical"), col=1:3, lwd=2, cex=0.8)

Graphical Checks: Kidney

BMT Data

• Let’s conduct graphical checks for – French/American/British Disease classification• Recall this was significant in our original model

– Methotrexate use• Recall this was not

Survival Curves for BMT: FAB and MTX

BMT Graphical Checks: FAB

BMT Graphical Checks: MTX

Graphical Approaches

• Pretty pictures are nice and can be intuitive but…

• We generally prefer a statistical means of determining if an assumption is true

• This leads us to regression approaches

Regression Approaches

• Impose a time-dependent covariate into the model

• General idea:– If PHM is valid, time-dependent covariate will not

be significant– If time-dependent covariate is significant, then

there is “something” going on in terms of the HRs that varies over time

Introduce Time Dependent Covariate

• Create an new variable Z2(t) = Z1×g(t), where g(t) is a function of time

• We don’t know the functional form of g(t)• Try several possibilities, for example

0 if specific time

at time t if specific time

0 if specific time

ln * at time t if specific time

Binary Case

• Consider a binary covariate Z1

• Generate

• Model is:

• Hazard ratio is:

0 if 0

ln if 1

ZZ Z g t

g t t Z

0 1 1 2 2

0 1 2 1

exp ln

h t h t Z Z

h t t Z

h t Zt

New Time Dependent Covariate

• Fit proportional hazards model with Z1, Z2(t), and estimate b1, b2

• Test local hypothesis:– H0: b2 = 0 vs. HA: b2 ≠ 0

• If you reject H0, can not assume proportional hazards

• Do this for each covariate in question

Examples

• Lets explore the regression check for proportional hazards in our two examples…

– Kidney Infection data• Surgical vs. percutaneous

– BMT data• FAB classification• Methotrexate use

Regression Check: Kidney### Kidney Example (are hazards for percutaneous and surgical proportional?)times<-sort(unique(kidney$time))kidney$id<-1:nrow(kidney)

kid.long<-expand.breakpoints(kidney, index="id", status="delta", tevent="time", breakpoints=times)

kid.long$ttype1<-log(kid.long$Tstop)*(kid.long$type-1)kid.long$ttype2<-kid.long$Tstop*(kid.long$type-1)kid.long$ttype3<-ifelse(kid.long$Tstop>7.5, (kid.long$type-1), 0)kid.long$ttype4<-ifelse(kid.long$Tstop>7.5, kid.long$Tstop*(kid.long$type-1), 0)

m1<-coxph(Surv(Tstart, Tstop, delta)~type+ttype1, data=kid.long)m2<-coxph(Surv(Tstart, Tstop, delta)~type+ttype2, data=kid.long)m3<-coxph(Surv(Tstart, Tstop, delta)~type+ttype3, data=kid.long)m4<-coxph(Surv(Tstart, Tstop, delta)~type+ttype4, data=kid.long)

Results: Kidney> m1

coef exp(coef) se(coef) z ptype 1.44 4.21 1.029 1.40 0.160ttype1 -1.47 0.23 0.587 -2.51 0.012

> m2coef exp(coef) se(coef) z p

type 0.961 2.614 0.751 1.28 0.200ttype2 -0.256 0.774 0.117 -2.18 0.029

type 0.35 1.4193 0.549 0.637 0.520ttype3 -2.89 0.0555 1.184 -2.443 0.015

type 0.241 1.272 0.5317 0.453 0.650ttype4 -0.185 0.831 0.0875 -2.113 0.035

BMT Regression Check: FAB### BMT Example (are hazards FAB classes proportional?)bps<-sort(unique(c(bmt$DFS)))bmt.long<-expand.breakpoints(bmt, index="id", status="Either", tevent="DFS", breakpoints=bps)

#create time-dependent covariatesbmt.long$txfab1<-log(bmt.long$Tstop)*(bmt.long$FAB)bmt.long$txfab2<-bmt.long$Tstop*(bmt.long$FAB)bmt.long$txfab3<-ifelse(bmt.long$Tstop>100, (bmt.long$FAB), 0)bmt.long$txfab4<-ifelse(bmt.long$Tstop>100, bmt.long$Tstop*(bmt.long$FAB), 0)

m1<-coxph(Surv(Tstart, Tstop, Either)~FAB+txfab1, data=bmt.long)m2<-coxph(Surv(Tstart, Tstop, Either)~FAB+txfab2, data=bmt.long)m3<-coxph(Surv(Tstart, Tstop, Either)~FAB+txfab3, data=bmt.long)m4<-coxph(Surv(Tstart, Tstop, Either)~FAB+txfab4, data=bmt.long)

Results: FAB> m1

coef exp(coef) se(coef) z pFAB 0.0253 1.03 0.956 0.0264 0.98txfab1 0.1202 1.13 0.182 0.6605 0.51 > m2

coef exp(coef) se(coef) z pFAB 0.541241 1.72 0.299949 1.804 0.071txfab2 0.000341 1.00 0.000706 0.483 0.630 > m3

coef exp(coef) se(coef) z pFAB 0.782 2.185 0.408 1.914 0.056txfab3 -0.206 0.814 0.488 -0.421 0.670 > m4

coef exp(coef) se(coef) z pFAB 0.564772 1.76 0.287651 1.963 0.05txfab4 0.000274 1.00 0.000681 0.403 0.69

BMT Regression Check: MTX#create time-dependent covariates for MTXbmt.long$txmtx1<-log(bmt.long$Tstop)*(bmt.long$MTX)bmt.long$txmtx2<-bmt.long$Tstop*(bmt.long$MTX)bmt.long$txmtx3<-ifelse(bmt.long$Tstop>400, (bmt.long$MTX), 0)bmt.long$txmtx4<-ifelse(bmt.long$Tstop>400, bmt.long$Tstop*(bmt.long$MTX), 0)

m1<-coxph(Surv(Tstart, Tstop, Either)~MTX+txmtx1, data=bmt.long)m2<-coxph(Surv(Tstart, Tstop, Either)~MTX+txmtx2, data=bmt.long)m3<-coxph(Surv(Tstart, Tstop, Either)~MTX+txmtx3, data=bmt.long)m4<-coxph(Surv(Tstart, Tstop, Either)~MTX+txmtx4, data=bmt.long)

Results: MTX> m1

coef exp(coef) se(coef) z pMTX 2.682 14.614 1.124 2.39 0.017txmtx1 -0.459 0.632 0.222 -2.07 0.038

> m2 coef exp(coef) se(coef) z p

MTX 1.22592 3.407 0.38088 3.22 0.0013txmtx2 -0.00377 0.996 0.00154 -2.45 0.0140

MTX 0.71 2.033 0.263 2.70 0.0069txmtx3 -1.73 0.178 0.789 -2.19 0.0290

MTX 0.70796 2.030 0.26188 2.70 0.0069txmtx4 -0.00303 0.997 0.00143 -2.12 0.0340

Checking Model with >1 Covariate?### Model where MTX is not time-varying> m5a<-coxph(Surv(Tstart, Tstop, Either)~factor(Disease) + FAB + PtAge + DonAge + PtAge*DonAge+MTX, data=bmt.long)> m5aCall:coxph(formula = Surv(Tstart, Tstop, Either) ~ factor(Disease) + FAB + PtAge + DonAge + PtAge * DonAge + MTX, data = bmt.long)

coef exp(coef) se(coef) z pfactor(Disease)2 -1.00606 0.366 0.362370 -2.776 0.00550factor(Disease)3 -0.35406 0.702 0.370966 -0.954 0.34000FAB 0.84303 2.323 0.279461 3.017 0.00260PtAge -0.08522 0.918 0.035708 -2.387 0.01700DonAge -0.08390 0.920 0.030318 -2.767 0.00570MTX 0.30342 1.354 0.252929 1.200 0.23000PtAge:DonAge 0.00315 1.003 0.000943 3.337 0.00085

Likelihood ratio test=34.2 on 7 df, p=1.58e-05 n= 8665, number of events= 83

Checking Model with >1 Covariate?### Model where MTX is not time-varying> m5<-coxph(Surv(Tstart, Tstop, Either)~factor(Disease) + FAB + PtAge + DonAge + PtAge*DonAge+MTX+txmtx1, data=bmt.long)> m5Call: coxph(formula = Surv(Tstart, Tstop, Either) ~ factor(Disease) + FAB + PtAge + DonAge + PtAge * DonAge + MTX + txmtx1, data = bmt.long)

coef exp(coef) se(coef) z pfactor(Disease)2 -1.01485 0.362 0.362198 -2.802 0.0051factor(Disease)3 -0.32998 0.719 0.368393 -0.896 0.3700FAB 0.88170 2.415 0.278466 3.166 0.0015PtAge -0.08201 0.921 0.035909 -2.284 0.0220DonAge -0.08490 0.919 0.030965 -2.742 0.0061MTX 2.69860 14.859 1.152782 2.341 0.0190txmtx1 -0.47772 0.620 0.225152 -2.122 0.0340PtAge:DonAge 0.00305 1.003 0.000955 3.194 0.0014

Alternative Form of Time-Varying Covariate

• So far we’ve guessed at g(t)• Problem is we don’t necessarily know the

correct functional form• Consider a binary covariate, Z1

• Assume for covariate Z1, the relative risk changes over time

• What if we use the data instead– Get “best estimate” from the data

Change Point Model• Let

• This gives us proportional hazards model with a change point at t(Liang et. Al (1990))

• Fit proportional hazards model with Z1 and Z2(t)

• We now have a PH model with HR:

– Model: H(t|Z (t)) = h0(t)exp{b1Z1 + b2Z2 }

– h(t|Z(t)) = h0(t)exp{b1Z1} if t < t

– h(t|Z(t)) = h0(t)exp{(b1 + b2)Z1} if t > t

• So we are fitting a PH model that includes a change point, which allows the HR to change after a specified time

Z if tZ t

How to Determine t

• A change point for the relative risk was introduced. Where is the best change point?

• Recall the partial likelihood only changes at event times

• Calculate log likelihood at each event time where t represents specific event times

• Choose t that yields the largest log-likelihood

Kidney Example#Change point model >cps<-sort(unique(kidney$time[which(kidney$delta==1)]))>LL<-c()>for (i in 1:length(cps))>{> z2<-ifelse(kid.long$Tstop>cps[i], kid.long$type, 0)> mod<-coxph(Surv(Tstart, Tstop, delta)~type+z2, data=kid.long, > method="breslow")> LL<-append(LL, mod$loglik[2])>}

> round(LL, digits=3) [1] -97.878 -100.224 -97.630 -97.501 -99.683 -100.493 -98.856 -100.428 [9] -101.084 -101.668 -102.168 -100.829 -101.477 -102.059 -102.620 -103.229

Change Point ResultsEvent Times Log Partial Likelihood

0.5 -97.8781.5 -100.2242.5 -97.6303.5 -97.5014.5 -99.6835.5 -100.4936.5 -98.8568.5 -100.4289.5 -101.084

10.5 -101.66811.5 -102.16815.5 -100.82916.5 -101.47718.5 -102.05923.5 -102.620

What About Multiple Comparisons?

• Is it “fishing” to try many cutpoints?• No, we are conducting diagnostics so we don’t

worry so much• We aren’t sure of the form of a time-

dependence so we are being flexible to identify if we are missing something

Hazards Not Proportional• Proportional hazards assumption doesn’t

hold… what can we do?• Single binary covariate… consider a piecewise

regression– Change-point identified by data (alternate coding)

• Many covariates, consider stratified model on non-proportional covariate

1 12 3

0 if 0 if

Z t Z tZ t Z t

Kidney: 1-Covariate with Change-Point

> kid.long$z2<-ifelse(kid.long$Tstop>3.5, kid.long$cath, 0)> kid.long$z3<-ifelse(kid.long$Tstop<=3.5, kid.long$cath, 0)> mod<-coxph(Surv(Tstart, Tstop, d)~z2+z3, data=kid.long, method="breslow") > modCall:coxph(formula = Surv(Tstart, Tstop, d) ~ z2 + z3, data = kid.long, method = "breslow")

coef exp(coef) se(coef) z p z2 -2.09 0.124 0.760 -2.75 0.006z3 1.08 2.950 0.783 1.38 0.170

Likelihood ratio test=13.9 on 2 df, p=0.000956 n= 1132, number of events= 26

Interpretation?

• Up to 3.5 months, there is a there is not a significant difference in risk of infection between the two groups.

• However, after 3.5 months the relative risk of infection in patients with percutaneously placed catheters is 0.12 times the risk relative to patients with surgically placed catheters.

• Recall our hazard rate plots– Hazards crossed at about 3.5 months

A Few Points

• A single cutpoint may not be enough– There are “two” models– Within in each piece, we are still assuming

proportional hazards• Check proportional hazards models within

each of the time intervals• Can generate additional time varying

covariates within each interval

Stratified Cox Regression

• Recall stratification• Estimates ‘pooled’ association across strata• Stratification in regression– Estimates pooled regression coefficient– Strong assumption that associations between

covariate and outcome are the same across strata

Estimation: Partial Likelihood Approach

• Partition dataset based on strata• Define log-likelihood per strata• Log-likelihood based on J strata

• Maximize LL(b) w.r.t. b• Notice b is common across all the strata specific

partial log-likelihoods

exp , 1,2,...,

h t t h t t j J

LL LL LL LL

BMT: Stratified Model

• Steps:1) Check proportional hazards assumption2) Fit stratified cox model3) Check model assumptions (i.e. constant b’s… more on this in a moment)

• We’ve already seen that the proportional hazards assumption for Methotrexate use is incorrect.

BMT Data• Associations between covariates and DFS, stratified by diagnosis

> reg2a<-coxph(Surv(Tstart, Tstop, Either)~ factor(Disease)+ FAB+DonAge:PtAge+TRP+strata(MTX), data = bmt.long2)> reg2aCall:coxph(formula=Surv(Tstart, Tstop,Either)~factor(Disease)+FAB+DonAge+

PtAge+DonAge*PtAge+PRt+strata(MTX), data = bmt.long2)

coef exp(coef) se(coef) z pfactor(Disease)2 -1.00911 0.365 0.364333 -2.770 0.0056factor(Disease)3 -0.34552 0.708 0.370254 -0.933 0.3500FAB 0.89008 2.435 0.280684 3.171 0.0015DonAge -0.08415 0.919 0.031123 -2.704 0.0069PtAge -0.08175 0.921 0.036255 -2.255 0.0240PRt 2.08909 8.078 1.095274 1.907 0.0560DonAge:PtAge 0.00301 1.003 0.000957 3.143 0.0017

Is Assumption of Constant b Reasonable?

• Testing assumption• Divide dataset into J strata• Fit model with p covariates in each strata• Define• Define based on stratified model• Test significance via LRT

1 21...

j j j J jjLL LL LL LL

b b b b

LRT j j J pjX LL LL

Checking Constant b Assumption#Testing stratification assumptionreg2a<-coxph(Surv(Tstart, Tstop, Either) ~ factor(Disease) + FAB + DonAge + PtAge + DonAge*PtAge +PRt + strata(MTX), data=bmt.long2)

dat1<-bmt.long2[bmt.long2$MTX==0,]reg2b<-coxph(Surv(Tstart, Tstop, Either) ~ factor(Disease) + FAB + DonAge + PtAge + DonAge*PtAge +PRt + strata(MTX), data=dat1)

dat2<-bmt.long2[bmt.long2$MTX==1,]reg2c<-coxph(Surv(Tstart, Tstop, Either) ~ factor(Disease) + FAB + DonAge + PtAge + DonAge*PtAge +PRt + strata(MTX), data=dat2)

LL2<-reg2$loglik[2]LL3<-reg3a$loglik[2]+reg3b$loglik[2]+reg3c$loglik[2]lrt<-2*(LL3-LL2)p.lrt<-1-pchisq(lrt, 2)

> reg2b coef exp(coef) se(coef) z p

factor(Disease)2 -1.19655 0.302 0.4585 -2.610 0.0091factor(Disease)3 -0.29025 0.748 0.4451 -0.652 0.5100FAB 1.08896 2.971 0.3384 3.218 0.0013DonAge -0.08378 0.920 0.0371 -2.258 0.0240PtAge -0.03630 0.964 0.0536 -0.677 0.5000PRt -0.86747 0.420 0.4793 -1.810 0.0700DonAge:PtAge 0.00227 1.002 0.0014 1.618 0.1100

> reg2c coef exp(coef) se(coef) z p

factor(Disease)2 -0.56372 0.569 0.63847 -0.8829 0.380factor(Disease)3 -0.85828 0.424 0.91761 -0.9353 0.350FAB 0.34408 1.411 0.65122 0.5284 0.600DonAge -0.00452 0.995 0.08152 -0.0555 0.960PtAge -0.02724 0.973 0.08073 -0.3374 0.740PRt -1.00688 0.365 0.55118 -1.8268 0.068DonAge:PtAge 0.00138 1.001 0.00227 0.6098 0.540

Results> LL.piece<-reg2b$loglik[2]+reg2c$loglik[2]> LL.strat<-reg2a$loglik[2]

> lrt<-2*(LL.piece-LL.strat)> lrt[1] 6.118211

> p.lrt<-1-pchisq(lrt, 7)> p.lrt[1] 0.5260164

Next Time

• Regression diagnostics using residuals!

Lecture 16: Regression Diagnostics I Proportional Hazards Assumption -graphical methods -regression...

Documents

Cox Proportional-Hazards Regression for Survival Data in R

Regression Methods

Regression Methods for Survey Data

Robust Regression. Regression Methods We are going to look at three approaches to robust regression: Regression with robust standard errors Regression

CHAPTER 7 Linear Correlation & Regression Methods

Graphical Methods and Regression - sasCommunity

Modern Regression Methods

STRENGTHS AND LIMITATIONS OF ADJUSTMENT … · In medical research logistic regression and Cox proportional hazards regression analysis ... odds ratio of treatment ... AND LIMITATIONS

Diagnostics Functional Form Model Fit and Proportional ...dgillen/STAT255/Handouts/lecture10.pdf · Schoenfeld residuals Summary ... Proportional Hazards Regression Diagnostics Questions

Cox Proportional-Hazards Regression for Survival Data in R · Cox Proportional-Hazards Regression for Survival Data in R An Appendix to An R Companion to Applied Regression, third

Proportional Hazards Regression with Unknown Link Function

Resistant Line & Related Regression Methods

Seminar- Robust Regression Methods

Cox Proportional-Hazards Regression for Survival Data · Cox Proportional-Hazards Regression for Survival Data Appendix to An R and S-PLUS Companion to Applied Regression John Fox

Regression III: Advanced Methods - Department of …polisci.msu.edu/jacoby/icpsr/regress3/lectures/week1/1... · Regression III: Advanced Methods ... – Applied regression, ... Fox,

Comparing logistic regression methods for completely

Distribution regression methodsdse.univr.it/it/documents/it9/VanKerm_slides.pdf · Distribution regression methods Preliminaries Objectives Methods address two related but distinct

Modern Machine Learning Regression Methods

Regularization Methods for Linear Regression

Comparative Analysis of Regression Regularization Methods