6953 P8 dadLogistic Regression 2013

Embed Size (px)

DESCRIPTION

adsas

Citation preview

  • Prepared by:

    Assoc. Prof. Dr Bahaman Abu SamahDepartment of Professional Development and Continuing Education

    Faculty of Educational StudiesUniversiti Putra Malaysia

    Serdang

  • Logistic regression is an alternative to multiple linear regression

    Used to predict outcome variable that is a categorical

    dichotomy from a set of categorical or continuous predictor

    variables

    Used because with the categorical dichotomy outcome variable

    violates the assumption of linearity in normal regression

    Logistic regression emphasizes the probability of a particular

    outcome for each case

    Stat TEMPLATE

    5950 Data Analysis TEMPLATE ver 4.xlsx
  • probability of having one

    outcome or another based on the best linear combination of

    predictors using maximum-likelihood estimation

    Probability of Y is calculated based on the following formula:

    u

    u

    e

    eYYP

    1)(

    ppXbXbXbbu

    e

    p

    .......22110

    2.718)( logarithmsnaturalofbasethe

    yProbabilitwhere

    Formula 1

  • 110

    110

    1)(

    Xbb

    Xbb

    ie

    eYYP

    With one predictor variable, the formula will be:

    With multiple predictor variables (p), the formula will be:

    pp

    pp

    XbXbXbb

    XbXbXbb

    ie

    eYYP

    .......

    ......

    22110

    22110

    1)(

    ppXbXbXbbu

    e

    p

    .......22110

    2.718)( logarithmsnaturalofbasethe

    yProbabilitwhere

  • The resulting value from the above computing (probability)

    ranges between 0 and 1

    :: A value close to 0 means Y is very unlikely to occur

    :: A value close to 1 means Y is very likely to occur

    Example 1 Pass

    0 Fail

  • 1. Predict outcome variable based on from a set of

    categorical or continuous predictor variables. Logistic

    regression calculate probability of success over

    probability of failure, the result is presented as an

    odds ratio or likelihood ratio

    2. Determine relationships among constructs

  • DV Dichotomous, assigned as 1 and 0

    IV Continuous/categorical

  • Can outcome be predicted from a set of predictor

    variables?

    Which predictor variables predict the outcome?

    How strong is the relationship between outcome and

    the predictor variables?

  • Assessing Model Fit

    Assessing the Predictor

    Relationship between

    Predictors - Outcome

    Odds Ratio

    Classification of Cases

  • Use the observed and predicted value of the outcome to assess the

    fit of the model.

    The statistic used to measure the fit of the model is called log-

    likelihood:

    N

    i

    iiii YYYY1

    )1(ln)1()(lnlikelihood-Log

    The log-likelihood is the summation of probabilities associated with

    the predicted and actual outcomes

    This log-likelihood statistic is comparable to residual sum of squares

    (SSE) in multiple regression

    Formula 2

  • Log-likelihood will be calculated for two different models (bigger and

    smaller)

    The two models are compared by computing the difference in their

    log-likelihood using Chi-square ( 2)

    LL(B) is log-likelihood for the bigger model which includes all the

    predictors

    LL(0) is log-likelihood for the smaller model which includes only the

    intercept

    degrees of freedom (df) = kB k0 where k is number of parameters

    )0()(22 LLBLL Formula 3

  • Test the null hypothesis that HO: i = 0

    Test the individual contribution of predictor variables using

    Wald statistic

    The Wald statistic is comparable to t-test in multiple regression

    Wald statistic is the squared ratio of the unstandardized logistic

    coefficient to its standard error.

    The Wald statistic and its corresponding p probability level is

    part of SPSS output in the "Variables in the Equation" table.

    2

    )(bSE

    bWald

  • A number of statistics can be used as measures of association

    between predictors and outcome

    The measures include:

    1. R-Statistic

    2. Cox and Snell R2

    3. Nagelkerke R2

    4. Hosmer and R2

  • R-statistic is comparable to multiple correlation coefficient

    Formula:

    )0(2

    )*2(

    LL

    dfWaldR Formula 4

    R-statistic ranges between -1 to +1

    A positive value: as the predictor increases, likelihood of the

    outcome occurring increases, vice versa

  • R2cs is comparable to R2 in multiple linear regression

    The value is displayed in SPSS Logistic Regression

    Formula:

    ))0()((2

    2 1LLBLL

    n

    CS eRFormula 5

    However the value of R2cs never reaches its theoretical

    maximum of 1

  • Nagelkerke suggested for amendment to the earlier R2CS

    The value is displayed in SPSS Logistic Regression

    Formula:

    n

    LL

    CSN

    e

    RR

    ))0((2

    22

    1

    Formula 6

  • Formula to calculate R2L

    )0(2

    )(22

    LL

    BLLRL Formula 7

  • Odds ratio is an indicator of the change in odds (likelihood) resulting

    from a unit change in the predictor

    The odds ratio is the increase (or decrease if the ratio is less than 1) in

    odds of being in one outcome category when the predictor increases

    by one unit.

    It is similar to b-coefficient but is easier to interpret (it does not involve

    logarithmic transformation)

    The odd of an event occurring are defined as the probability of an

    event occurring divided by the probability of the event not occurring

    )(

    )(

    eventnoP

    eventPOdds Formula 8

  • The coefficients (b) are the natural logs of the odds ratio, thus

    odds ratio can be calculated using the following formula:

    Odds ratio indicates the change in odds resulting from a unit

    change in the predictor

    Odds ratio > 1

    Odds ratio < 1

    beratioodds Formula 9

  • X is income (in RM1,000) to predict home ownership (1 = Yes & 0

    =No)

    if b = 1.25

    49.3

    25.1eratioodd

    1 unit increase in income (in RM1,000) will increase the odd

    (likelihood) of home ownership by 3.49 times

  • One method of assessing the success of a model is to evaluate its

    ability to predict correctly the outcome

    The cut-off value for classification is .50

    probability of greater than .5

    SPSS provides:

    1.

    2.

    3. Overall percentage

  • 1. Enter

    All variables entered simultaneously

    2. Sequential/Hierarchical

    Variables entered in blocks

    Blocks should be based on past research or theory being

    tested

    3. Stepwise

    Variables entered on the basis of statistical criteria (relative

    contribution to predict outcome)

    Should be employed only for exploratory analysis

  • (From Tabachnick)

    The following data set

    include three variables:

    1. FALL

    0 - Not falling

    1 - Falling

    2. DIFFICULTY

    Rated on 1 to 3 scale

    3. SEASON

    1 - autumn

    2 - winter

    3 - spring

    Data set:

    Fall Difficulty Season1 3 11 1 10 1 31 2 31 3 20 2 20 1 21 3 11 2 31 2 10 2 20 2 31 3 21 2 20 3 1

    Data: Logistic Regression Tabachnick SKI

  • )2)(418.0()1)(928.0())(010.1(776.1

    )2)(418.0()1)(928.0())(010.1(776.1

    1)(Prob

    SEASSEASDIFF

    SEASSEASDIFF

    ie

    eYFall

    )0()(22 LLBLL Formula 3

    N

    i

    iiii YYYY1

    )1(ln)1()(lnlikelihood-Log Formula 2

    Formula 1

    Excel Computation

    ../Logistic Regression/6953 Logistic Regression Tabachnick.xlsx
  • Excel Computation

    ../Logistic Regression/6953 Logistic Regression Tabachnick.xlsx
  • Table 1: Logistic Regression Analysis of Falling on a Ski Run as a

    Function of Difficulty of Run and Season

    Variables B Wald Test p Odds ratio

    Constant -1.776 0.88 .347 .169

    Difficulty 1.010 1.27 .259 2.747

    Season(1) .927 0.34 .560 2.527

    Season(2) -.418 0.09 .763 .658

    Note: R2 = .165 (Cox & Snell), .227 (Nagelkerke)

    Model 2 (3)= 2.710, p = .439

    May want to also report CI for Odds ratio

  • (Adapted from Andy Field)

    Variable Label/Value

    PERFORM Performance in Subject

    0 No

    1 Yes

    INTEREST Interest in the Subject

    0 No

    1 Yes

    AGE Age in years

    Data: Logistic Regression PERFORM

  • Table 2: Logistic Regression Analysis of Performance

    as a Function of Interest and Age

    Constant

    Interest

    Age

    Variables B Wald Test p Odds ratio

    Note: R2 = ___ (Cox & Snell), ___ (Nagelkerke)

    Model 2 (_)= _____, p = ___

  • (From Tabachnick)

    Variable Label/Value

    WorkStatus Work status1 Working2 Housewives

    Children Presence of children0 No1 Yes

    Control Locus of controlAttMar Attitudes toward current marriageAttHouse Attitudes toward houseworkAttRole Attitudes toward role of womenAge Age groupEduc Years of education

    Data: Logistic Regression TabachnickWORK STATUS

  • Table 3: Logistic Regression Analysis of Work Status as a Function of

    Attitudinal Variables

    Constant

    Locus of control

    Attitude towards marital status

    Attitude towards role of women

    Attitude towards housework

    Variables B Wald Test p Odds ratio

    Note: R2 = ___ (Cox & Snell), ___ (Nagelkerke)

    Model 2 (_)= _____, p = ___

  • Table 3: Logistic Regression Analysis of Work Status as a Function of

    Attitudinal Variables and Children

    Variables B Wald Test p Odds ratio

    Note: R2 = ___ (Cox & Snell), ___ (Nagelkerke)

    Model 2 (_)= _____, p = ___

    Constant

    Presence of children

    Locus of control

    Attitude towards marital status

    Attitude towards role of women

    Attitude towards housework