18
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 22-08-24 04:34 1 Logistic regression Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Embed Size (px)

Citation preview

Page 1: Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52

1

Logistic regressionLogistic regression

Page 2: Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52

2

Logistic regressionLogistic regression

• Member of the GLM family• Unlike standard linear regression, the

dependent variable is binary (0,1), so that each cases’ value is either 0 or 1.

• Normally, 0 is taken to mean the absence  of some attribute, 1 its presence.

• Logistic regression can be extended to the case where there are more than two possible values for the dependent variable (e.g. low, medium, high – multinomial regression)

Page 3: Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52

3

Example: incidence of heart attacks in Example: incidence of heart attacks in relation to agerelation to age

10 30 50 70 90

age

-0.2

0.1

0.4

0.7

1.0

card

iaqu

e

Linear regression inappropriate because:

•Residuals not normal

•Residuals heteroscedastic

•Predicted values nonsense (e.g. what does a predicted value of 0.3 mean?)

Page 4: Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52

4

Logistic regression: dependent variableLogistic regression: dependent variable

• Variable of interest is the probability p of obtaining a a one as a function of predictor variables

• The magnitude of regression coefficients in the model depends on distribution of the predictor variables in the two groups Y= 0 and Y = 1,

X

Y

X

Y

1

0

1

0

Page 5: Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52

5

Dependent variable: logit (p)Dependent variable: logit (p)

logit( )

logit( )

logit( ) ln1

1 1

y p

y p

pp y

p

e ep

e e

-4 -2 0 2 4

logit

0

20

40

60

80

100

p

Page 6: Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52

6

Logistic regression: model coefficientsLogistic regression: model coefficients

• Negative regression coefficient means probability of success decreases with increasing value of predictor.

• Positive regression coefficient means probability of success decreases with increasing value of predictor.

X

Y

X

Y

1

0

1

0

> 0

< 0

Page 7: Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52

7

Logistic regression: model coefficientsLogistic regression: model coefficients

• The magnitude of the regression coefficient depends on how abruptly p changes with X, with large values indicating abrupt change.

X

Y

1

0

> 0, small

X

Y

1

0

> 0, large

Page 8: Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52

8

Least squares Least squares estimation (LSE)estimation (LSE)

• An ordinary least squares (OLS) estimate of a model parameter is that which minimizes the sum of squared differences between observed and predicted values: • Predicted values are

derived from some model whose parameters we wish to estimate

2

1

)ˆ( yySSN

iiR

OLS

SS

R

),(ˆ xfy

Page 9: Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52

9

Maximum likelihood Maximum likelihood estimation (MLE)estimation (MLE)

• A maximum likelihood estimate (MLE) of a model parameter for a given distribution is that which maximizes the probability of generating the observed sample data.

• MLEs are obtained by maximizing the loss function

• …or equivalently, by minimizing the negative log likelihood function

);(1

n

iixL

MLE

L o

r -

log

L

- log LL

));(ln(log1

i

n

i

xL

Page 10: Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52

10

How are the model parameters How are the model parameters estimated?estimated?

• Estimated not by least squares, but rather by Maximum Likelihood– Based on an estimate of the likelihood of obtaining

the observed results based on different values of the model parameters

– In principle, parameter estimates should converge to those maximizing log-likelihood or minimizing - LogL

Page 11: Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52

11

Hypothesis testingHypothesis testing

• Likelihood– Deviance=-2L – Is apprioximately distributed as chi-square– Measures the variation unexplained by the fitted

model, analagous to residual sums of squares.

• Model comparison– Change in deviance when model terms are added

(or deleted) is also approximately distributed as chi-square, so can test hypotheses relating to individual model terms.

Page 12: Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52

12

Model assumptionsModel assumptions

• Observations are independent• Dependent variable has a binomial

distribution• Little error in measurement of dependent

variables.

Page 13: Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52

13

Logistic regression in SPlusLogistic regression in SPlus*** Generalized Linear Model ***

Call: glm(formula = cardiaque ~ age, family = binomial(link = logit), data = SDF12, na.action = na.exclude, control

= list(epsilon = 0.0001, maxit = 50, trace = F))

Deviance Residuals:

Min 1Q Median 3Q Max

-1.545637 -0.5732664 -0.272312 -0.1404323 2.679875

Coefficients:

Value Std. Error t value

(Intercept) -7.76838060 0.376403465 -20.63844

age 0.09557905 0.005097055 18.75182

(Dispersion Parameter for Binomial family taken to be 1 )

Null Deviance: 2050.515 on 1999 degrees of freedom

Residual Deviance: 1490.001 on 1998 degrees of freedom

Number of Fisher Scoring Iterations: 4

Page 14: Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52

14

Incidence of heart attack in relation to ageIncidence of heart attack in relation to age

30 40 50 60 70 80 90

age

-0.1

0.1

0.3

0.5

0.7

0.9

card

iaq

ue

logit( ) 7.77 0.96

logit( ) 7.77 0.96

y=logit(p) 7.77 0.96

1 1 1

y p Age

y p Age

Age

e e ep

e e e

Page 15: Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52

15

Presence of post-operative kyphosis using Presence of post-operative kyphosis using logistic regressionlogistic regression

Kyphosis: a binary variable indicating the presence/absenceof a postoperative spinal deformity called Kyphosis.• Age: the age of the child in months.• Number: the number of vertebrae involved in the spinaloperation.• Start: the beginning of the range of the vertebrae involved in the operation

Page 16: Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52

16

Evidence that the distribution of predictor Evidence that the distribution of predictor variables differs among levels of response variables differs among levels of response variablevariable

Page 17: Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52

17

The modelThe model

Page 18: Université d’Ottawa - Bio 4518 - Biostatistiques appliquées © Antoine Morin et Scott Findlay 2016-01-08 04:32 1 Logistic regression

Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52

18

Testing hypothesesTesting hypotheses