Survival Analysis and Unsupervised Learning

Survival Analysis and Unsupervised Learning

Nathan Taback

06/02/2020

Survival Analysis

I Outcome variable: Time until an event occursI Start follow-up Time−→ EventI Event examples: death; disease; recovery; purchase.I Time is often called survival time since it’s the time that an individual has

survived over some follow-up period.

Censored Data

I Censoring occurs when we have some information about individual survivaltime, but it’s not exactly known.

I Consider a person followed in a medical study until they die. If the studyends while the patient is still alive then the patient’s survival time iscensored, since the person will die after the study ends.

I Common reasons why censoring may occur:I study ends - no eventI lost to follow-upI withdraw from study

Survival Data

The data below is from Pancreas cancer patients in TCGA.

days_to_last_follow_up days_to_death vital_status– 12 Dead706 – Alive– 239 Dead1794 – Alive– 153 Dead33 – Alive

How can observed survival time be defined?

Survival Data

days_to_last_follow_up days_to_death vital_status– 12 Dead706 – Alive– 239 Dead1794 – Alive– 153 Dead33 – Alive

I If vital_status = Dead and days_to_death is not missing then T =days_to_death.

I If vital_status = Alive and days_to_last_follow_up is not missingthen T = days_to_last_follow_up.

I If a person is still alive during the study period then they are censored; ifthe person died then the event occured.

Survival Data - Define Survival Time

Define survival time:clinical$os_days <- ifelse(clinical$vital_status == "Alive",

clinical$days_to_last_follow_up,ifelse(clinical$vital_status == "Dead",

clinical$days_to_death, NA))

days_to_last_follow_up days_to_death vital_status os_days– 12 Dead 12706 – Alive 706– 239 Dead 2391794 – Alive 1794– 153 Dead 15333 – Alive 33

Survival Data - Define Censoring

Define event indicator for subject i :

δi ={

1 if death0 if censored

clinical$dead <- ifelse(clinical$vital_status == "Alive", 0,ifelse(clinical$vital_status == "Dead", 1, NA))

days_to_last_follow_up days_to_death vital_status os_days dead– 12 Dead 12 1706 – Alive 706 0– 239 Dead 239 11794 – Alive 1794 0– 153 Dead 153 133 – Alive 33 0

Survivor FunctionI Let T ≥ 0 be a random variable for a person’s survival time.I S(t) = P(T > t) = 1− F (t) is the survivor function.I For example, if T ∼ exp(λ) then S(t) for λ = 1/2 is

0.25

0.50

0.75

1.00

0 1 2 3 4 5t

S(t

)

Exponential Survival Function with rate = 0.5

I Observed survival function and exponential survival with λ = 1/2.

0.2

0.4

0.6

0.8

1.0

0 1 2 3 4t

Obs

erve

d S

(t)

Observed Survival Function

Hazard Function

The hazard function is defined as:

h(t) = lim∆t→0

P(t ≤ T < t + ∆t|T ≥ t)∆t .

I h(t)∆t ≈ P(t ≤ T < t + ∆t|T ≥ t) - the probability of death in(t, t + ∆t) given survival up to time t.

I For example, if time is measured in days then h(t) is the approximateprobability that an individual who is alive on day t, dies in the following day.

Non-parametric Estimation of the Survival Function

S(t) = Number of individuals with survival times ≥ tNumber of individuals in data set .

I The Kaplan-Meier estimate is an example of a procedure for estimatingS(t).

Kaplan-Meier

km <- survfit(Surv(time, status) ~ 1, data = leukemia)summary(km)

## Call: survfit(formula = Surv(time, status) ~ 1, data = leukemia)#### time n.risk n.event survival std.err lower 95% CI upper 95% CI## 5 23 2 0.9130 0.0588 0.8049 1.000## 8 21 2 0.8261 0.0790 0.6848 0.996## 9 19 1 0.7826 0.0860 0.6310 0.971## 12 18 1 0.7391 0.0916 0.5798 0.942## 13 17 1 0.6957 0.0959 0.5309 0.912## 18 14 1 0.6460 0.1011 0.4753 0.878## 23 13 2 0.5466 0.1073 0.3721 0.803## 27 11 1 0.4969 0.1084 0.3240 0.762## 30 9 1 0.4417 0.1095 0.2717 0.718## 31 8 1 0.3865 0.1089 0.2225 0.671## 33 7 1 0.3313 0.1064 0.1765 0.622## 34 6 1 0.2761 0.1020 0.1338 0.569## 43 5 1 0.2208 0.0954 0.0947 0.515## 45 4 1 0.1656 0.0860 0.0598 0.458## 48 2 1 0.0828 0.0727 0.0148 0.462

Kaplan-Meirlibrary(survminer)ggsurvplot(km)

+ +

+

+

+

0.00

0.25

0.50

0.75

1.00

0 40 80 120 160Time

Sur

viva

l pro

babi

lity

Strata + All

Proportional Hazards Model

I Suppose we have two groups of patients: A and B.I Assume the hazard at time t for a patient in B is proportional to the hazard

at time t for a patient in group A. That is,

hB = ψhA(t),

where t ≥ 0 and ψ is a constant.I It’s convenient to set ψ = exp(β), since the ratio is always positive. Let

xi = 1, if subject is in B and xi = 0, if subject is in A then

hi (t) = exp(βxi )h0(t),

This is the proportional hazards model for comparing two groups, withψ = exp(βxi ).

General Proportional Hazards Model

I Suppose that we would like to model the hazard of death at particular timeas a linear function of p explanatory variables: x1, . . . , xp.

I In this case,

ψi = β1x1i + · · ·+ βpxpi .

So,hi (t) = exp(β1x1i + · · ·+ βpxpi )h0(t), or,

log{

hi (t)h0(t)

}= β1x1i + · · ·+ βpxpi .

I h0(t) is called the baseline hazard - the hazard with no x ’s in the model.I A linear model for the logarithm of the hazard ratio.

Fitting the proportional Hazards model

I Cox(1972) derived the appropriate likelihood for this model.I MLE of the β parameters can be found by maximizing the log-likelihood

function using numerical methods such as Newton-Raphson.

Fitting the proportional Hazards modelcox.mod <- coxph(Surv(time, status) ~ x, data = leukemia)summary(cox.mod)

## Call:## coxph(formula = Surv(time, status) ~ x, data = leukemia)#### n= 23, number of events= 18#### coef exp(coef) se(coef) z Pr(>|z|)## xNonmaintained 0.9155 2.4981 0.5119 1.788 0.0737 .## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### exp(coef) exp(-coef) lower .95 upper .95## xNonmaintained 2.498 0.4003 0.9159 6.813#### Concordance= 0.619 (se = 0.063 )## Likelihood ratio test= 3.38 on 1 df, p=0.07## Wald test = 3.2 on 1 df, p=0.07## Score (logrank) test = 3.42 on 1 df, p=0.06

I The estimated hazard ratio of patients not given maintenancechemotherapy vs. patients given maintenance chemotherapy is 2.4981.

hnon-maintenance(t) = exp(0.9155)hmaintenance(t)

Cox Model Diagnostics

I Graphical approaches such as log-log plots.I Goodness of fit tests such as Schoenfield residual plot

Book Reference

UofT lib link to Applied Survival Analysis using R

http://go.utlib.ca/cat/10487946

Unsupervised Learning

I Priciple Component AnalysisI k-meansI Hierarchical Clustering

Principal Components Analysis (PCA)

I Suppose we have n measurements on p features/covariates, X1, . . . ,Xp.I PCA finds a low-dimensional represen- tation of a data set that contains as

much as possible of the variation.I The first principal component is the linear combination of the features

Z1 = φ11X1 + φ21X2 + · · ·+ φp1Xp

that has the largest variance such that∑p

j=1 φ2j1 = 1.

I φ11, φ21, . . . , φp1 are called the loadings of the first principal component.

Computing the first PC

I To compute the first principal component centre each of X1, . . . ,Xp (i.e., sothat they have mean 0).

I Then find the linear combination such that:

zi1 = φ11xi1 + φ21xi2 + · · ·+ φp1xip

that has the largest variance, subject to∑p

j=1 φ2j1 = 1.

Computing the second PC

The second principal component is the linear combination of X1, . . . ,Xp that hasmaximal variance out of all linear combinations that are uncorrelated with Z1.

Computing PCPlot of the score vectors for PC1 and PC2library(ggfortify)# Random sample of patients and genes from pancreaspca.out <- prcomp(pancreas_gene_sample[, 9:38], center = T)as.data.frame(pca.out$rotation[,c(1,2)]) %>% rownames_to_column() %>%

arrange(desc(PC1)) %>% head(n=3)

## rowname PC1 PC2## 1 ENSG00000152969.15 0.15331976 -0.09474669## 2 ENSG00000183580.9 0.08169159 -0.59653953## 3 ENSG00000169696.14 0.04970751 0.35904737autoplot(pca.out, data = pancreas_gene_sample,colour = "dead")

−0.25

0.00

0.25

0.50

−0.2 0.0 0.2 0.4PC1 (72.47%)

PC

2 (1

1.15

%)

0.00

0.25

0.50

0.75

1.00dead

More about principal components

I How many principal components? Scree plot.I Proportion of variance explained.

PCA using sklearnI In Python use sklearn.decomposition module

from sklearn.datasets import load_irisfrom sklearn.preprocessing import StandardScalerimport numpy as npfrom sklearn.decomposition import PCAimport pandas as pd

data = load_iris() # load iris datadf = pd.DataFrame(data = data.data,columns = data.feature_names)x = StandardScaler().fit_transform(df) # standardize data

pca = PCA(n_components=2) # 2 componentspcs = pca.fit_transform(x) # compute pc scores

principalDf = pd.DataFrame(data = pcs,columns = ['principal component 1','principal component 2'])

principalDf.head(n=2)

## principal component 1 principal component 2## 0 -2.264703 0.480027## 1 -2.080961 -0.674134

https://scikit-learn.org/stable/modules/classes.html#module-sklearn.decomposition

k means

I Try to define k disjoint subsets such that each observation belongs suchthat the within cluster variation is as small as possible.

I Typically use Euclidean distance to define within cluster variation.

1. Randomly assign a number, from 1 to k, to each of the observations. Theseserve as initial cluster assignments for the observations.

2. Iterate until the cluster assignments stop changing:

(a) For each of the k clusters, compute the cluster centroid. The kth clustercentroid is the vector of the p feature means for the observations in the kthcluster.

(b) Assign each observation to the cluster whose centroid is closest (whereclosest is defined using Euclidean distance).

k means

set.seed(416)k <- kmeans(x = alldat[,c(9:60491)], centers = 2, nstart = 25)alldat$cluster <- k$clusterfit <- survfit(Surv(os_days/365.25, dead) ~ as.factor(cluster),

data = alldat)ggsurvplot(fit, risk.table = T)

k means

table(alldat$cluster,alldat$dead)

0 11 9 02 77 92

Hierarchical Clusteringset.seed(6)X1 <- rnorm(9, 0, 2)X2 <- rnorm(9,0,1) + runif(9, 0, 10)df <- data.frame(X1,X2)df$rownum <- 1:nrow(df)plot(X1,X2, type ="n");text(X1,X2,labels = rownames(df))

−2 −1 0 1 2 3

26

10

X1

X2

12

3 45 6

7

8

9

hc <- hclust(dist(as.matrix(df)))plot(hc)

8 9

5 6 3 4

7

1 204

8

Cluster Dendrogram

hclust (*, "complete")dist(as.matrix(df))

Hei

ght

Hierarchical Clustering

−2 −1 0 1 2 3

26

10

X1

X2

12

3 45 6

7

8

9

8 9

5 6 3 4

7

1 204

8

Cluster Dendrogram


Hei

ght

hc$merge

## [,1] [,2]## [1,] -5 -6## [2,] -3 -4## [3,] -1 -2## [4,] -8 -9## [5,] 1 2## [6,] 4 5## [7,] -7 3## [8,] 6 7

Hierarchical Clustering

plot(hc)

8 9

5 6

3 4

7

1 202

46

8

Cluster Dendrogram


Hei

ght

cutree(hc, h = 4) # cut tree at height 4

## [1] 1 1 2 2 2 2 3 4 4cutree(hc, h = 7)

## [1] 1 1 2 2 2 2 3 2 2

Documents

Survival Analysis and Unsupervised Learning