MELJUN CORTES IBM SPSS Survival Analysis

Embed Size (px)

Citation preview

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    1/60

    Survival Analysis

    This contains my personal notes only thus, this is not

    complete. Most of the contents were taken from the training

    manual of IBM SPSS Modeler. Please refer to the training

    manual for a complete discussion.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    2/60

    Survival Analysis

    Survival analysis studies the length of time

    of an event of interest.

    Originally, the technique was used in

    medical research to study the amount of

    time patients survive following onset of adisease.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    3/60

    The length of t ime a person subs cr ibes to a newspaper ormagazine

    The time employees s pend w ith a company

    The time to fai lure of an electr ic or mechanical compon ent

    The time to prom otion o f employee with in an org anizat ion

    The time to complete a complex transact ion suc h as a loanappl icat ion

    Time it takes to c omplete a requ iremen t (e.g., a PhD at aunivers i ty)

    It has been used to model:

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    4/60

    Survival analysis can be done with or

    without predictors.

    Kaplan Meierfor analysis without

    predictors

    Cox Regressionfor analysis withcategorical and interval predictors

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    5/60

    Censoring vs Missing Data

    Missing data observation for which novalid value is recorded

    Censored data - contain information but

    the final value is unknown. Examples: Medical studies: at the end of the study some

    patients are still alive or in remission, somemove or refuse to participate, some die of

    other causes

    studying time to promotion: some employeeleft the company before being promoted

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    6/60

    Censored data are included in the survival

    analysis

    Censored data cannot be included in regressionanalysis, t-test, anova.

    Aside from the censoring issues, in regression

    model, the residuals are unlikely normallydistributed because time-to-event distribution is

    likely to be non-normally distributed.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    7/60

    Concepts

    Length of time is the main outcome in survival analysis Mean and median of length of time can be used to summarize

    Confidence interval also can be used to summarize.

    Means and Medians for Survival Time

    groupMeana Median

    Estimate Std. Error

    95% Confidence Interval

    Estimate Std. Error

    95% Confidence Interval

    Lower Bound Upper Bound Lower Bound Upper Bound

    Control 72.545 14.839 43.462 101.629 40.000 12.899 14.719 65.281

    Treatment (Prednisolone) 125.264 13.402 98.996 151.532 146.000 28.786 89.580 202.420

    Overall 98.925 10.812 77.733 120.117 89.000 21.232 47.385 130.615

    a. Estimation is limited to the largest survival time if it is censored.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    8/60

    Cumulative survival plot

    Graphically, survival data can be summarizedusing survival function over time.

    The survival function shows

    the probability of surviving

    longer than the time

    displayed in the chart.

    For example, the probability

    that a person in the

    treatment group survives

    beyond t=96 is .63.

    Generally, the survival ratedecreases over time.

    The treatment group has a

    higher survival rate than the

    control group. Is this finding

    significant?

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    9/60

    Cumulative Hazard plot

    Survival data can be summarized using hazardfunction over time.

    The cumulative hazard plot

    shows the risk of an event

    occurring at a particular

    time.

    For example, the probability

    that a person in the

    treatment group does not

    survive at t=100 is .49.

    Generally, the hazard rateincreases over time.

    The treatment group has a

    lower hazard rate than the

    control group. Is this finding

    significant?

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    10/60

    Survival Procedures in SPSS

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    11/60

    The Life Tables procedure is appropriate when the time

    to critical event measure is recorded in broad ranges (for

    example in six-month periods, or whole years) so that

    there are many ties among the data values, or if there is

    no interest in differentiating between small timedifferences.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    12/60

    The Kaplan-Meierprocedure is

    appropriate when the time to critical event

    measure is precise enough so there are

    relatively few ties in the data. Examplesmight be number of months surviving, or

    the fractional number of years a retail

    space is occupied by a tenant.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    13/60

    Cox Regression (also called a proportional

    hazard model) posits that the hazard rate can be

    a function of both categorical and interval scale

    predictor variables. It assumes that the hazard functions for different

    groups are proportional to each other over time.

    This assumption can be examined and a variant

    of Cox regression (Cox Regression with timevarying covariates) can be applied when the

    assumption doesnt hold.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    14/60

    Kaplan-Meier Example: Survival time for

    patients with chronic active hepatitis

    The group variabledivides the group intotreatment (prednisolonetherapy) and control

    patients

    The time variablerecords time to thecritical event (death) ortime when censoringoccurred

    The status variableindicates whether thecritical event occurred(1= death) or that thecase was censored (2=censored).

    Data set: KM.sav

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    15/60

    Null Hypothesis

    There is no significant difference in the

    survival rate of the treatment and control

    groups.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    16/60

    ResultsMeans and Medians for Survival Time

    group Meana Median

    Estimate Std. Error

    95% Confidence Interval

    Estimate Std. Error

    95% Confidence Interval

    Lower Bound Upper Bound Lower Bound Upper Bound

    Control 72.545 14.839 43.462 101.629 40.000 12.899 14.719 65.281

    Treatment (Prednisolone) 125.264 13.402 98.996 151.532 146.000 28.786 89.580 202.420

    Overall 98.925 10.812 77.733 120.117 89.000 21.232 47.385 130.615

    a. Estimation is limited to the largest survival time if it is censored.

    Overall Comparisons

    Chi-Square df Sig.

    Log Rank (Mantel-Cox) 4.660 1 .031

    Breslow (Generalized Wilcoxon) 6.543 1 .011

    Tarone-Ware 6.066 1 .014

    Test of equality of survival distributions for the different levels of group.

    Reject

    Ho

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    17/60

    Survival table for the Control Group

    The first event which occurred at time 2 (2 months) has a cumulative survival value of .955 or (1 (1/22)). The estimatedprobability of surviving beyond two months in this group is 95.5%.

    Survival Table

    group

    Time Status

    Cumulative Proportion Surviving at the

    TimeN of Cumulative

    Events

    N of Remaining

    CasesEstimate Std. Error

    Control 1 2.000 Died .955 .044 1 21

    2 3.000 Died .909 .061 2 20

    3 4.000 Died .864 .073 3 19

    4 7.000 Died .818 .082 4 18

    5 10.000 Died .773 .089 5 17

    6 22.000 Died .727 .095 6 16

    7 28.000 Died .682 .099 7 15

    8 29.000 Died .636 .103 8 14

    9 32.000 Died .591 .105 9 13

    10 37.000 Died .545 .106 10 12

    11 40.000 Died .500 .107 11 11

    12 41.000 Died .455 .106 12 10

    13 54.000 Died .409 .105 13 9

    14 61.000 Died .364 .103 14 8

    15 63.000 Died .318 .099 15 7

    16 71.000 Died .273 .095 16 6

    17 127.000 Censored . . 16 5

    18 140.000 Censored . . 16 4

    19 146.000 Censored . . 16 3

    20 158.000 Censored . . 16 2

    21 167.000 Censored . . 16 1

    22 182.000 Censored . . 16 0

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    18/60

    Survival Table for the Treatment Group

    Survival Table

    group

    Time Status

    Cumulative Proportion Surviving at the

    TimeN of Cumulative

    Events

    N of Remaining

    CasesEstimate Std. Error

    Treatment (Prednisolone) 1 2.000 Died .955 .044 1 21

    2 6.000 Died .909 .061 2 20

    3 12.000 Died .864 .073 3 19

    4 54.000 Died .818 .082 4 18

    5 56.000 Censored . . 4 17

    6 68.000 Died .770 .090 5 16

    7 89.000 Died .722 .097 6 15

    8 96.000 Died . . 7 14

    9 96.000 Died .626 .105 8 13

    10 125.000 Censored . . 8 12

    11 128.000 Censored . . 8 11

    12 131.000 Censored . . 8 10

    13 140.000 Censored . . 8 9

    14 141.000 Censored . . 8 8

    15 143.000 Died .547 .117 9 7

    16 145.000 Censored . . 9 6

    17 146.000 Died .456 .129 10 5

    18 148.000 Censored . . 10 4

    19 162.000 Censored . . 10 3

    20 168.000 Died .304 .151 11 2

    21 173.000 Censored . . 11 1

    22 181.000 Censored . . 11 0

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    19/60

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    20/60

    Cumulative Hazard plot

    Survival data can be summarized using hazardfunction over time.

    The cumulative hazard plot

    shows the risk of an event

    occurring at a particular

    time.

    For example, the probability

    that a person in the

    treatment group does not

    survive at t=100 is .49.

    Generally, the hazard rate

    increases over time.

    The treatment group has a

    lower hazard rate than the

    control group. Is this finding

    significant?

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    21/60

    The Cox Regression Model

    The Cox Regression Model posits that the

    hazard rate can be a function of both

    categorical and interval scale predictorvariables.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    22/60

    The hazard function at time t as a function of

    predictors X1, X2,Xp:h(t|X1, X2,,Xp) = h0(t)*e

    (B1X1+B2X2+BpXp)

    h0(t) = the base hazard function that changes over time andis independent of the predictors

    e(B1X1+B2X2+BpXp) = the factor and covariate effects which

    are independent of time and adjust the base hazard function

    The eBXj is the change in the hazard function associated

    with a unit change in the predictor (Xj), controlling for the

    other effects in the model.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    23/60

    Assumption:

    The effects of the predictors are constant over time. If this assumptionis not met, then the Cox Regression Model will not provide the best fitto the data.

    If such assumption is not met, use the Cox Model with time-dependentcovariates.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    24/60

    Cox Regression Example 1: Survival time in heroin

    addiction treatment program

    Outcome measure(survtime) time (in days)spent in a program forheroin addicts

    Terminating event (status) departure from quittingthe program. Data werecensored for participants

    still in the program whenthe study was completed

    Predictors: clinic(there were two

    clinics whose programsdiffered)

    prison (whether or not theaddict had a prisonrecord),

    methadone dose(methdose, measured inmg/day)

    Data set: addicts.sav

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    25/60

    Results

    Omnibus Tests of Model Coefficientsa

    -2 Log Likelihood

    Overall (score) Change From Previous Step Change From Previous Block

    Chi-square df Sig. Chi-square df Sig. Chi-square df Sig.

    1347.345 56.193 3 .000 64.351 3 .000 64.351 3 .000

    a. Beginning Block Number 1. Method = Enter

    Null Hypotehsis: The effect of one or more of the three predictor

    variables are significantly not different from zero in the population.

    Reject Ho if chisquare is significant (p

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    26/60

    Variables in the Equation

    B SE Wald df Sig. Exp(B)

    clinic 1.009 .215 22.087 1 .000 2.743

    prison -.315 .167 3.543 1 .060 .730

    methdose -.035 .006 30.659 1 .000 .965

    The B coefficient estimates relate the change in natural log of the hazard per

    one unit change in the predictor. (difficult to understand?)

    The Exp(B) column presents the estimated change in risk (hazard)

    associated with a one-unit change in a predictor, controlling for the otherpredictors. When the predictor is categorical and indicator coding is used,

    Exp(B) represents the change in hazard when changing from the reference

    category to another category and is referred to as relative risk. Exp(B) is also

    called the hazard ratio, since it represents the ratio of the hazards for two

    individuals who differ by one unit in the predictor of interest.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    27/60

    Variables in the Equation

    B SE Wald df Sig. Exp(B)

    clinic 1.009 .215 22.087 1 .000 2.743

    prison -.315 .167 3.543 1 .060 .730

    methdose -.035 .006 30.659 1 .000 .965

    For clinic, Exp(B) =2.743: Other things equal, the hazard in clinic 0 is 2.743

    times greater than the hazard in clinic 1. Thus patients in clinic 0 exhibit

    greater risk and lower survival times. Be careful of the reference category!

    For methdose, Exp(B)=.965: A one-unit (one mg/day) increase in dosage isassociated with a decrease (.965) in hazard.

    For prison, exp(B)=.730: Non- significant! If significant, it would be

    interpreted as: The group with no prison record is at less risk than the group

    with a prison record.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    28/60

    Cumulative Survival Plot

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    29/60

    Cumulative Hazard Plot

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    30/60

    Checking the Proportional Hazard Assumptions(The hazard functions of any two individuals or groups remain in constant proportion over time)

    1. Examine the survival or hazard plots (inKaplan-Meier) with the categorical predictor asthe factor

    2. Examine the survival or log-minus-log plot in

    Cox Regression with the categorical predictorspecified as a strata variable

    3. Save partial residuals and plot them againsttime (see Cox Regression case study for an

    example)4. Fit a Cox Regression model with a time-varyingcovariate; examine its significance andcontribution

    Note: we will use #2.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    31/60

    The survival plots for the two clinics diverge substantially over time,suggesting that the hazard ratio for the two groups is not constant

    over time.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    32/60

    Log Minus Log (LML) Plot

    The proportional hazard model holds (in our example it means that over time, thehazard functions of the clinics differ by a constant proportion), then the natural log ofthe negative of the natural log of the survival functions for different groups over timewill form parallel lines.

    Here the lines are not parallel, indicating that the proportional hazards assumptiondoes not hold forclinic.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    33/60

    Cox with Time-Dependent Covariate

    Create T_COV_ a time-dependent

    variable to be used in the extended cox

    model.

    T_COV_ = (T_>365)*clinic

    Zero(0) when survival time is 365 or less

    One (1) when survival time is >365

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    34/60

    It can be created by SPSS as follows:

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    35/60

    Include T_COV_ as covariate in the Model

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    36/60

    Results

    The clinic predictor (which now represents the clinic effect during the first 365 days) isno longer significant (p=.06). It should be retained in the model since it was used indefining the time by clinic interaction.

    In the first year (time 365) the hazard in clinic 0 is 6.123 times greater than the hazard in clinic 1. The6.123 value is obtained by multiplying the clinic effect (1.616) by the clinic by timeinteraction (3.789); thus after the first year, the clinic effect is estimated to increase bya factor of 3.789.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    37/60

    Cox Regression Example 2

    Consider the database with 5000 records fromcustomers of a telecommunication firm.

    The firm has collected a wide variety of consumerinformation of its customers including gender, age,

    education, income, marital status, card tenure, annualfee for primary credit card, # of years held the primarycredit card, then churn (switched providers within lastmonth).

    We are interested of studying the length of time

    customers retain their primary credit card.In other words, we will model the time for thesecustomers to churn-not renew- their primary credit card.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    38/60

    Churn rates are initially around 50% in the early

    years.

    People who have recently obtained card People who have had their card for a long time

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    39/60

    Cox Regression Results

    B SE Wald df Sig. Exp(B)

    gender .048 .056 .739 1 .390 1.050

    age -.080 .003 862.033 1 .000 .923

    educati

    on.120 .009 169.987 1 .000 1.128

    income -.003 .001 18.387 1 .000 .997

    marital .385 .057 45.457 1 .000 1.469

    cardfee .046 .073 .398 1 .528 1.047

    Increasing income and

    age lead to reduce

    hazard for churn.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    40/60

    Cox Regression Results

    B SE Wald df Sig. Exp(B)

    gender .048 .056 .739 1 .390 1.050

    age -.080 .003 862.033 1 .000 .923

    educati

    on.120 .009 169.987 1 .000 1.128

    income -.003 .001 18.387 1 .000 .997

    marital .385 .057 45.457 1 .000 1.469

    cardfee .046 .073 .398 1 .528 1.047

    Increasing education

    and being unmarried

    are associated with

    increasing hazard for

    churn.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    41/60

    Survival Table (first ten only)

    Time

    Baseline

    Cum Hazard

    At mean of covariates

    Survival SE Cum Hazard

    0 .019 .997 .000 .003

    1 .100 .986 .001 .014

    2 .170 .976 .002 .024

    3 .256 .964 .002 .037

    4 .327 .954 .003 .047

    5 .409 .943 .003 .058

    6 .489 .933 .004 .070

    7 .565 .922 .004 .081

    8 .650 .911 .005 .093

    9 .749 .899 .005 .107

    10 .848 .886 .005 .121

    Baseline cum hazard is the

    hazard rate for the model when

    all predictors are zero.

    Both survival and cumulative

    hazard were computed at the

    mean of all predictors

    For our model, survival

    retaining the primary credit

    card-dropped to 93.3% by the

    sixth year.

    The probability of retainingthe primary credit card beyond

    6 years is 93.3%.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    42/60

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    43/60

    The differences in

    survival gradually

    increase over time

    between groups

    The differences in

    hazard gradually

    increase over time

    between groups

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    44/60

    Cox Node Model Options

    The following options are available forentering predictors into the model:

    enter

    stepwise

    backward stepwise

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    45/60

    Cox Node Model Options

    Specifying a group field causes the node to

    compute separate models for each category

    of the field. It can be any categorical field

    (Flag or Set) with string or integer storage.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    46/60

    These options allow you to control the parameters for

    model convergence.

    When you execute the model, the convergence settingscontrol how many times the different parameters are

    repeatedly run through to see how well they fit.

    The more often the parameters are tried, the closer the

    results will be (that is, the results will converge).

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    47/60

    Maximum iterations.Allows you to specify the

    maximum iterations for the model, which controls

    how long the procedure will search for a solution.

    Log-likelihood convergence. Iterations stop if therelative change in the log-likelihood is less than

    this value. The criterion is not used if the value is 0.

    Parameter convergence. Iterations stop if the absolute

    change or relative change in the parameter

    estimates is less than this value. The criterion is not used

    if the value is 0.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    48/60

    These options allow you to

    request additional statistics and

    plots, including the survival

    curve, that will appear in theadvanced output of the

    generated model built by the

    node.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    49/60

    You can obtain statistics for your model parameters,

    including confidence intervals for exp(B) and correlation

    of estimates. You can request these statistics either at

    each step or at the last step only.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    50/60

    Display baseline function.Allows you to

    display the baseline hazard function andcumulative survival at the mean of the

    covariates.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    51/60

    Plots can help you to evaluate your estimated

    model and interpret the results. You can plot

    the survival, hazard, log-minus-log, and one-

    minus-survival functions.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    52/60

    This option is available only for categorical fields.

    Value to use for plots Because these functions depend on values of the

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    53/60

    Value to use for plots. Because these functions depend on values of the

    predictors, you must use constant values for the predictors to plot the

    functions versus time.

    The default is to use the mean of each predictor as a constant value, but you

    can enter your own values for the plot using the grid.

    For categorical inputs, indicator coding is used, so there is a regression

    coefficient for each category (except the last).

    Thus, a categorical input has a mean value for each indicator contrast, equal

    to the proportion of cases in the category corresponding to the indicator

    contrast.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    54/60

    Predict survival at future times. Specify one or more

    future times. Survival, that is, whether each case is likely to

    survive for at least that length of time (from now) without theterminal event occurring, is predicted for each record at

    each time value, one prediction per time value.

    Note that survival is the false value of the target field.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    55/60

    Regular intervals. Survival time values are generated from the

    specified Time interval and Number of time periods to score.

    For example, if 3 time periods are requested with an interval

    of 2 between each time, survival will be predicted for future times

    2, 4, 6. Every record is evaluated at the same time values.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    56/60

    Time fields. Survival times are provided for each

    record in the time field chosen (one prediction

    field is generated), thus each record can beevaluated at different times.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    57/60

    Specify the survival time of the record so farfor

    example, the tenure of an existing customer as afield. Scoring the likelihood of survival at a future

    time will be conditional on past survival time.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    58/60

    Append all probabilities. Specifies whether probabilities for each

    category of the output field are added to each record processed

    by the node. If this option is not selected, the probability of only

    the predicted category is added. Probabilities are computed foreach future time.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    59/60

    Calculate cumulative hazard function. Specifies whether the value of

    the cumulative hazard is added to each record. The cumulative hazard is

    computed for each future time.

  • 7/30/2019 MELJUN CORTES IBM SPSS Survival Analysis

    60/60

    Use customer_dbase.sav for the demo

    Inputs: gender, age, ed, income,

    marital, cardfee, cardtenure

    Output: churn