Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52

1

Logistic regressionLogistic regression


2

Logistic regressionLogistic regression

• Member of the GLM family• Unlike standard linear regression, the

dependent variable is binary (0,1), so that each cases’ value is either 0 or 1.

• Normally, 0 is taken to mean the absence of some attribute, 1 its presence.

• Logistic regression can be extended to the case where there are more than two possible values for the dependent variable (e.g. low, medium, high – multinomial regression)


3

Example: incidence of heart attacks in Example: incidence of heart attacks in relation to agerelation to age

10 30 50 70 90

age

-0.2

0.1

0.4

0.7

1.0

card

iaqu

e

Linear regression inappropriate because:

•Residuals not normal

•Residuals heteroscedastic

•Predicted values nonsense (e.g. what does a predicted value of 0.3 mean?)


4

Logistic regression: dependent variableLogistic regression: dependent variable

• Variable of interest is the probability p of obtaining a a one as a function of predictor variables

• The magnitude of regression coefficients in the model depends on distribution of the predictor variables in the two groups Y= 0 and Y = 1,

X

Y

X

Y

1

0

1

0


5

Dependent variable: logit (p)Dependent variable: logit (p)

logit( )

logit( )

logit( ) ln1

1 1

y p

y p

pp y

p

e ep

e e

-4 -2 0 2 4

logit

0

20

40

60

80

100

p


6

Logistic regression: model coefficientsLogistic regression: model coefficients

• Negative regression coefficient means probability of success decreases with increasing value of predictor.

• Positive regression coefficient means probability of success decreases with increasing value of predictor.

X

Y

X

Y

1

0

1

0

> 0

< 0


7

Logistic regression: model coefficientsLogistic regression: model coefficients

• The magnitude of the regression coefficient depends on how abruptly p changes with X, with large values indicating abrupt change.

X

Y

1

0

> 0, small

X

Y

1

0

> 0, large


8

Least squares Least squares estimation (LSE)estimation (LSE)

• An ordinary least squares (OLS) estimate of a model parameter is that which minimizes the sum of squared differences between observed and predicted values: • Predicted values are

derived from some model whose parameters we wish to estimate

2

1

)ˆ( yySSN

iiR

OLS

SS

R

),(ˆ xfy


9

Maximum likelihood Maximum likelihood estimation (MLE)estimation (MLE)

• A maximum likelihood estimate (MLE) of a model parameter for a given distribution is that which maximizes the probability of generating the observed sample data.

• MLEs are obtained by maximizing the loss function

• …or equivalently, by minimizing the negative log likelihood function

);(1

n

iixL

MLE

L o

r -

log

L

- log LL

));(ln(log1

i

n

i

xL


10

How are the model parameters How are the model parameters estimated?estimated?

• Estimated not by least squares, but rather by Maximum Likelihood– Based on an estimate of the likelihood of obtaining

the observed results based on different values of the model parameters

– In principle, parameter estimates should converge to those maximizing log-likelihood or minimizing - LogL


11

Hypothesis testingHypothesis testing

• Likelihood– Deviance=-2L – Is apprioximately distributed as chi-square– Measures the variation unexplained by the fitted

model, analagous to residual sums of squares.

• Model comparison– Change in deviance when model terms are added

(or deleted) is also approximately distributed as chi-square, so can test hypotheses relating to individual model terms.


12

Model assumptionsModel assumptions

• Observations are independent• Dependent variable has a binomial

distribution• Little error in measurement of dependent

variables.


13

Logistic regression in SPlusLogistic regression in SPlus*** Generalized Linear Model ***

Call: glm(formula = cardiaque ~ age, family = binomial(link = logit), data = SDF12, na.action = na.exclude, control

= list(epsilon = 0.0001, maxit = 50, trace = F))

Deviance Residuals:

Min 1Q Median 3Q Max

-1.545637 -0.5732664 -0.272312 -0.1404323 2.679875

Coefficients:

Value Std. Error t value

(Intercept) -7.76838060 0.376403465 -20.63844

age 0.09557905 0.005097055 18.75182

(Dispersion Parameter for Binomial family taken to be 1 )

Null Deviance: 2050.515 on 1999 degrees of freedom

Residual Deviance: 1490.001 on 1998 degrees of freedom

Number of Fisher Scoring Iterations: 4


14

Incidence of heart attack in relation to ageIncidence of heart attack in relation to age

30 40 50 60 70 80 90

age

-0.1

0.1

0.3

0.5

0.7

0.9

card

iaq

ue

logit( ) 7.77 0.96

logit( ) 7.77 0.96

y=logit(p) 7.77 0.96

1 1 1

y p Age

y p Age

Age

e e ep

e e e


15

Presence of post-operative kyphosis using Presence of post-operative kyphosis using logistic regressionlogistic regression

Kyphosis: a binary variable indicating the presence/absenceof a postoperative spinal deformity called Kyphosis.• Age: the age of the child in months.• Number: the number of vertebrae involved in the spinaloperation.• Start: the beginning of the range of the vertebrae involved in the operation


16

Evidence that the distribution of predictor Evidence that the distribution of predictor variables differs among levels of response variables differs among levels of response variablevariable


17

The modelThe model


18

Testing hypothesesTesting hypotheses

Documents

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression