Upload
meljun-cortes-mbampa
View
224
Download
0
Embed Size (px)
Citation preview
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
1/60
Survival Analysis
This contains my personal notes only thus, this is not
complete. Most of the contents were taken from the training
manual of IBM SPSS Modeler. Please refer to the training
manual for a complete discussion.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
2/60
Survival Analysis
Survival analysis studies the length of time
of an event of interest.
Originally, the technique was used in
medical research to study the amount of
time patients survive following onset of adisease.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
3/60
The length of t ime a person subs cr ibes to a newspaper ormagazine
The time employees s pend w ith a company
The time to fai lure of an electr ic or mechanical compon ent
The time to prom otion o f employee with in an org anizat ion
The time to complete a complex transact ion suc h as a loanappl icat ion
Time it takes to c omplete a requ iremen t (e.g., a PhD at aunivers i ty)
It has been used to model:
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
4/60
Survival analysis can be done with or
without predictors.
Kaplan Meierfor analysis without
predictors
Cox Regressionfor analysis withcategorical and interval predictors
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
5/60
Censoring vs Missing Data
Missing data observation for which novalid value is recorded
Censored data - contain information but
the final value is unknown. Examples: Medical studies: at the end of the study some
patients are still alive or in remission, somemove or refuse to participate, some die of
other causes
studying time to promotion: some employeeleft the company before being promoted
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
6/60
Censored data are included in the survival
analysis
Censored data cannot be included in regressionanalysis, t-test, anova.
Aside from the censoring issues, in regression
model, the residuals are unlikely normallydistributed because time-to-event distribution is
likely to be non-normally distributed.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
7/60
Concepts
Length of time is the main outcome in survival analysis Mean and median of length of time can be used to summarize
Confidence interval also can be used to summarize.
Means and Medians for Survival Time
groupMeana Median
Estimate Std. Error
95% Confidence Interval
Estimate Std. Error
95% Confidence Interval
Lower Bound Upper Bound Lower Bound Upper Bound
Control 72.545 14.839 43.462 101.629 40.000 12.899 14.719 65.281
Treatment (Prednisolone) 125.264 13.402 98.996 151.532 146.000 28.786 89.580 202.420
Overall 98.925 10.812 77.733 120.117 89.000 21.232 47.385 130.615
a. Estimation is limited to the largest survival time if it is censored.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
8/60
Cumulative survival plot
Graphically, survival data can be summarizedusing survival function over time.
The survival function shows
the probability of surviving
longer than the time
displayed in the chart.
For example, the probability
that a person in the
treatment group survives
beyond t=96 is .63.
Generally, the survival ratedecreases over time.
The treatment group has a
higher survival rate than the
control group. Is this finding
significant?
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
9/60
Cumulative Hazard plot
Survival data can be summarized using hazardfunction over time.
The cumulative hazard plot
shows the risk of an event
occurring at a particular
time.
For example, the probability
that a person in the
treatment group does not
survive at t=100 is .49.
Generally, the hazard rateincreases over time.
The treatment group has a
lower hazard rate than the
control group. Is this finding
significant?
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
10/60
Survival Procedures in SPSS
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
11/60
The Life Tables procedure is appropriate when the time
to critical event measure is recorded in broad ranges (for
example in six-month periods, or whole years) so that
there are many ties among the data values, or if there is
no interest in differentiating between small timedifferences.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
12/60
The Kaplan-Meierprocedure is
appropriate when the time to critical event
measure is precise enough so there are
relatively few ties in the data. Examplesmight be number of months surviving, or
the fractional number of years a retail
space is occupied by a tenant.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
13/60
Cox Regression (also called a proportional
hazard model) posits that the hazard rate can be
a function of both categorical and interval scale
predictor variables. It assumes that the hazard functions for different
groups are proportional to each other over time.
This assumption can be examined and a variant
of Cox regression (Cox Regression with timevarying covariates) can be applied when the
assumption doesnt hold.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
14/60
Kaplan-Meier Example: Survival time for
patients with chronic active hepatitis
The group variabledivides the group intotreatment (prednisolonetherapy) and control
patients
The time variablerecords time to thecritical event (death) ortime when censoringoccurred
The status variableindicates whether thecritical event occurred(1= death) or that thecase was censored (2=censored).
Data set: KM.sav
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
15/60
Null Hypothesis
There is no significant difference in the
survival rate of the treatment and control
groups.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
16/60
ResultsMeans and Medians for Survival Time
group Meana Median
Estimate Std. Error
95% Confidence Interval
Estimate Std. Error
95% Confidence Interval
Lower Bound Upper Bound Lower Bound Upper Bound
Control 72.545 14.839 43.462 101.629 40.000 12.899 14.719 65.281
Treatment (Prednisolone) 125.264 13.402 98.996 151.532 146.000 28.786 89.580 202.420
Overall 98.925 10.812 77.733 120.117 89.000 21.232 47.385 130.615
a. Estimation is limited to the largest survival time if it is censored.
Overall Comparisons
Chi-Square df Sig.
Log Rank (Mantel-Cox) 4.660 1 .031
Breslow (Generalized Wilcoxon) 6.543 1 .011
Tarone-Ware 6.066 1 .014
Test of equality of survival distributions for the different levels of group.
Reject
Ho
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
17/60
Survival table for the Control Group
The first event which occurred at time 2 (2 months) has a cumulative survival value of .955 or (1 (1/22)). The estimatedprobability of surviving beyond two months in this group is 95.5%.
Survival Table
group
Time Status
Cumulative Proportion Surviving at the
TimeN of Cumulative
Events
N of Remaining
CasesEstimate Std. Error
Control 1 2.000 Died .955 .044 1 21
2 3.000 Died .909 .061 2 20
3 4.000 Died .864 .073 3 19
4 7.000 Died .818 .082 4 18
5 10.000 Died .773 .089 5 17
6 22.000 Died .727 .095 6 16
7 28.000 Died .682 .099 7 15
8 29.000 Died .636 .103 8 14
9 32.000 Died .591 .105 9 13
10 37.000 Died .545 .106 10 12
11 40.000 Died .500 .107 11 11
12 41.000 Died .455 .106 12 10
13 54.000 Died .409 .105 13 9
14 61.000 Died .364 .103 14 8
15 63.000 Died .318 .099 15 7
16 71.000 Died .273 .095 16 6
17 127.000 Censored . . 16 5
18 140.000 Censored . . 16 4
19 146.000 Censored . . 16 3
20 158.000 Censored . . 16 2
21 167.000 Censored . . 16 1
22 182.000 Censored . . 16 0
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
18/60
Survival Table for the Treatment Group
Survival Table
group
Time Status
Cumulative Proportion Surviving at the
TimeN of Cumulative
Events
N of Remaining
CasesEstimate Std. Error
Treatment (Prednisolone) 1 2.000 Died .955 .044 1 21
2 6.000 Died .909 .061 2 20
3 12.000 Died .864 .073 3 19
4 54.000 Died .818 .082 4 18
5 56.000 Censored . . 4 17
6 68.000 Died .770 .090 5 16
7 89.000 Died .722 .097 6 15
8 96.000 Died . . 7 14
9 96.000 Died .626 .105 8 13
10 125.000 Censored . . 8 12
11 128.000 Censored . . 8 11
12 131.000 Censored . . 8 10
13 140.000 Censored . . 8 9
14 141.000 Censored . . 8 8
15 143.000 Died .547 .117 9 7
16 145.000 Censored . . 9 6
17 146.000 Died .456 .129 10 5
18 148.000 Censored . . 10 4
19 162.000 Censored . . 10 3
20 168.000 Died .304 .151 11 2
21 173.000 Censored . . 11 1
22 181.000 Censored . . 11 0
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
19/60
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
20/60
Cumulative Hazard plot
Survival data can be summarized using hazardfunction over time.
The cumulative hazard plot
shows the risk of an event
occurring at a particular
time.
For example, the probability
that a person in the
treatment group does not
survive at t=100 is .49.
Generally, the hazard rate
increases over time.
The treatment group has a
lower hazard rate than the
control group. Is this finding
significant?
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
21/60
The Cox Regression Model
The Cox Regression Model posits that the
hazard rate can be a function of both
categorical and interval scale predictorvariables.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
22/60
The hazard function at time t as a function of
predictors X1, X2,Xp:h(t|X1, X2,,Xp) = h0(t)*e
(B1X1+B2X2+BpXp)
h0(t) = the base hazard function that changes over time andis independent of the predictors
e(B1X1+B2X2+BpXp) = the factor and covariate effects which
are independent of time and adjust the base hazard function
The eBXj is the change in the hazard function associated
with a unit change in the predictor (Xj), controlling for the
other effects in the model.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
23/60
Assumption:
The effects of the predictors are constant over time. If this assumptionis not met, then the Cox Regression Model will not provide the best fitto the data.
If such assumption is not met, use the Cox Model with time-dependentcovariates.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
24/60
Cox Regression Example 1: Survival time in heroin
addiction treatment program
Outcome measure(survtime) time (in days)spent in a program forheroin addicts
Terminating event (status) departure from quittingthe program. Data werecensored for participants
still in the program whenthe study was completed
Predictors: clinic(there were two
clinics whose programsdiffered)
prison (whether or not theaddict had a prisonrecord),
methadone dose(methdose, measured inmg/day)
Data set: addicts.sav
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
25/60
Results
Omnibus Tests of Model Coefficientsa
-2 Log Likelihood
Overall (score) Change From Previous Step Change From Previous Block
Chi-square df Sig. Chi-square df Sig. Chi-square df Sig.
1347.345 56.193 3 .000 64.351 3 .000 64.351 3 .000
a. Beginning Block Number 1. Method = Enter
Null Hypotehsis: The effect of one or more of the three predictor
variables are significantly not different from zero in the population.
Reject Ho if chisquare is significant (p
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
26/60
Variables in the Equation
B SE Wald df Sig. Exp(B)
clinic 1.009 .215 22.087 1 .000 2.743
prison -.315 .167 3.543 1 .060 .730
methdose -.035 .006 30.659 1 .000 .965
The B coefficient estimates relate the change in natural log of the hazard per
one unit change in the predictor. (difficult to understand?)
The Exp(B) column presents the estimated change in risk (hazard)
associated with a one-unit change in a predictor, controlling for the otherpredictors. When the predictor is categorical and indicator coding is used,
Exp(B) represents the change in hazard when changing from the reference
category to another category and is referred to as relative risk. Exp(B) is also
called the hazard ratio, since it represents the ratio of the hazards for two
individuals who differ by one unit in the predictor of interest.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
27/60
Variables in the Equation
B SE Wald df Sig. Exp(B)
clinic 1.009 .215 22.087 1 .000 2.743
prison -.315 .167 3.543 1 .060 .730
methdose -.035 .006 30.659 1 .000 .965
For clinic, Exp(B) =2.743: Other things equal, the hazard in clinic 0 is 2.743
times greater than the hazard in clinic 1. Thus patients in clinic 0 exhibit
greater risk and lower survival times. Be careful of the reference category!
For methdose, Exp(B)=.965: A one-unit (one mg/day) increase in dosage isassociated with a decrease (.965) in hazard.
For prison, exp(B)=.730: Non- significant! If significant, it would be
interpreted as: The group with no prison record is at less risk than the group
with a prison record.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
28/60
Cumulative Survival Plot
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
29/60
Cumulative Hazard Plot
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
30/60
Checking the Proportional Hazard Assumptions(The hazard functions of any two individuals or groups remain in constant proportion over time)
1. Examine the survival or hazard plots (inKaplan-Meier) with the categorical predictor asthe factor
2. Examine the survival or log-minus-log plot in
Cox Regression with the categorical predictorspecified as a strata variable
3. Save partial residuals and plot them againsttime (see Cox Regression case study for an
example)4. Fit a Cox Regression model with a time-varyingcovariate; examine its significance andcontribution
Note: we will use #2.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
31/60
The survival plots for the two clinics diverge substantially over time,suggesting that the hazard ratio for the two groups is not constant
over time.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
32/60
Log Minus Log (LML) Plot
The proportional hazard model holds (in our example it means that over time, thehazard functions of the clinics differ by a constant proportion), then the natural log ofthe negative of the natural log of the survival functions for different groups over timewill form parallel lines.
Here the lines are not parallel, indicating that the proportional hazards assumptiondoes not hold forclinic.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
33/60
Cox with Time-Dependent Covariate
Create T_COV_ a time-dependent
variable to be used in the extended cox
model.
T_COV_ = (T_>365)*clinic
Zero(0) when survival time is 365 or less
One (1) when survival time is >365
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
34/60
It can be created by SPSS as follows:
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
35/60
Include T_COV_ as covariate in the Model
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
36/60
Results
The clinic predictor (which now represents the clinic effect during the first 365 days) isno longer significant (p=.06). It should be retained in the model since it was used indefining the time by clinic interaction.
In the first year (time 365) the hazard in clinic 0 is 6.123 times greater than the hazard in clinic 1. The6.123 value is obtained by multiplying the clinic effect (1.616) by the clinic by timeinteraction (3.789); thus after the first year, the clinic effect is estimated to increase bya factor of 3.789.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
37/60
Cox Regression Example 2
Consider the database with 5000 records fromcustomers of a telecommunication firm.
The firm has collected a wide variety of consumerinformation of its customers including gender, age,
education, income, marital status, card tenure, annualfee for primary credit card, # of years held the primarycredit card, then churn (switched providers within lastmonth).
We are interested of studying the length of time
customers retain their primary credit card.In other words, we will model the time for thesecustomers to churn-not renew- their primary credit card.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
38/60
Churn rates are initially around 50% in the early
years.
People who have recently obtained card People who have had their card for a long time
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
39/60
Cox Regression Results
B SE Wald df Sig. Exp(B)
gender .048 .056 .739 1 .390 1.050
age -.080 .003 862.033 1 .000 .923
educati
on.120 .009 169.987 1 .000 1.128
income -.003 .001 18.387 1 .000 .997
marital .385 .057 45.457 1 .000 1.469
cardfee .046 .073 .398 1 .528 1.047
Increasing income and
age lead to reduce
hazard for churn.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
40/60
Cox Regression Results
B SE Wald df Sig. Exp(B)
gender .048 .056 .739 1 .390 1.050
age -.080 .003 862.033 1 .000 .923
educati
on.120 .009 169.987 1 .000 1.128
income -.003 .001 18.387 1 .000 .997
marital .385 .057 45.457 1 .000 1.469
cardfee .046 .073 .398 1 .528 1.047
Increasing education
and being unmarried
are associated with
increasing hazard for
churn.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
41/60
Survival Table (first ten only)
Time
Baseline
Cum Hazard
At mean of covariates
Survival SE Cum Hazard
0 .019 .997 .000 .003
1 .100 .986 .001 .014
2 .170 .976 .002 .024
3 .256 .964 .002 .037
4 .327 .954 .003 .047
5 .409 .943 .003 .058
6 .489 .933 .004 .070
7 .565 .922 .004 .081
8 .650 .911 .005 .093
9 .749 .899 .005 .107
10 .848 .886 .005 .121
Baseline cum hazard is the
hazard rate for the model when
all predictors are zero.
Both survival and cumulative
hazard were computed at the
mean of all predictors
For our model, survival
retaining the primary credit
card-dropped to 93.3% by the
sixth year.
The probability of retainingthe primary credit card beyond
6 years is 93.3%.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
42/60
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
43/60
The differences in
survival gradually
increase over time
between groups
The differences in
hazard gradually
increase over time
between groups
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
44/60
Cox Node Model Options
The following options are available forentering predictors into the model:
enter
stepwise
backward stepwise
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
45/60
Cox Node Model Options
Specifying a group field causes the node to
compute separate models for each category
of the field. It can be any categorical field
(Flag or Set) with string or integer storage.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
46/60
These options allow you to control the parameters for
model convergence.
When you execute the model, the convergence settingscontrol how many times the different parameters are
repeatedly run through to see how well they fit.
The more often the parameters are tried, the closer the
results will be (that is, the results will converge).
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
47/60
Maximum iterations.Allows you to specify the
maximum iterations for the model, which controls
how long the procedure will search for a solution.
Log-likelihood convergence. Iterations stop if therelative change in the log-likelihood is less than
this value. The criterion is not used if the value is 0.
Parameter convergence. Iterations stop if the absolute
change or relative change in the parameter
estimates is less than this value. The criterion is not used
if the value is 0.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
48/60
These options allow you to
request additional statistics and
plots, including the survival
curve, that will appear in theadvanced output of the
generated model built by the
node.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
49/60
You can obtain statistics for your model parameters,
including confidence intervals for exp(B) and correlation
of estimates. You can request these statistics either at
each step or at the last step only.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
50/60
Display baseline function.Allows you to
display the baseline hazard function andcumulative survival at the mean of the
covariates.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
51/60
Plots can help you to evaluate your estimated
model and interpret the results. You can plot
the survival, hazard, log-minus-log, and one-
minus-survival functions.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
52/60
This option is available only for categorical fields.
Value to use for plots Because these functions depend on values of the
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
53/60
Value to use for plots. Because these functions depend on values of the
predictors, you must use constant values for the predictors to plot the
functions versus time.
The default is to use the mean of each predictor as a constant value, but you
can enter your own values for the plot using the grid.
For categorical inputs, indicator coding is used, so there is a regression
coefficient for each category (except the last).
Thus, a categorical input has a mean value for each indicator contrast, equal
to the proportion of cases in the category corresponding to the indicator
contrast.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
54/60
Predict survival at future times. Specify one or more
future times. Survival, that is, whether each case is likely to
survive for at least that length of time (from now) without theterminal event occurring, is predicted for each record at
each time value, one prediction per time value.
Note that survival is the false value of the target field.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
55/60
Regular intervals. Survival time values are generated from the
specified Time interval and Number of time periods to score.
For example, if 3 time periods are requested with an interval
of 2 between each time, survival will be predicted for future times
2, 4, 6. Every record is evaluated at the same time values.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
56/60
Time fields. Survival times are provided for each
record in the time field chosen (one prediction
field is generated), thus each record can beevaluated at different times.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
57/60
Specify the survival time of the record so farfor
example, the tenure of an existing customer as afield. Scoring the likelihood of survival at a future
time will be conditional on past survival time.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
58/60
Append all probabilities. Specifies whether probabilities for each
category of the output field are added to each record processed
by the node. If this option is not selected, the probability of only
the predicted category is added. Probabilities are computed foreach future time.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
59/60
Calculate cumulative hazard function. Specifies whether the value of
the cumulative hazard is added to each record. The cumulative hazard is
computed for each future time.
7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis
60/60
Use customer_dbase.sav for the demo
Inputs: gender, age, ed, income,
marital, cardfee, cardtenure
Output: churn