Upload
rudolph-west
View
221
Download
0
Embed Size (px)
Citation preview
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52
1
Logistic regressionLogistic regression
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52
2
Logistic regressionLogistic regression
• Member of the GLM family• Unlike standard linear regression, the
dependent variable is binary (0,1), so that each cases’ value is either 0 or 1.
• Normally, 0 is taken to mean the absence of some attribute, 1 its presence.
• Logistic regression can be extended to the case where there are more than two possible values for the dependent variable (e.g. low, medium, high – multinomial regression)
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52
3
Example: incidence of heart attacks in Example: incidence of heart attacks in relation to agerelation to age
10 30 50 70 90
age
-0.2
0.1
0.4
0.7
1.0
card
iaqu
e
Linear regression inappropriate because:
•Residuals not normal
•Residuals heteroscedastic
•Predicted values nonsense (e.g. what does a predicted value of 0.3 mean?)
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52
4
Logistic regression: dependent variableLogistic regression: dependent variable
• Variable of interest is the probability p of obtaining a a one as a function of predictor variables
• The magnitude of regression coefficients in the model depends on distribution of the predictor variables in the two groups Y= 0 and Y = 1,
X
Y
X
Y
1
0
1
0
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52
5
Dependent variable: logit (p)Dependent variable: logit (p)
logit( )
logit( )
logit( ) ln1
1 1
y p
y p
pp y
p
e ep
e e
-4 -2 0 2 4
logit
0
20
40
60
80
100
p
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52
6
Logistic regression: model coefficientsLogistic regression: model coefficients
• Negative regression coefficient means probability of success decreases with increasing value of predictor.
• Positive regression coefficient means probability of success decreases with increasing value of predictor.
X
Y
X
Y
1
0
1
0
> 0
< 0
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52
7
Logistic regression: model coefficientsLogistic regression: model coefficients
• The magnitude of the regression coefficient depends on how abruptly p changes with X, with large values indicating abrupt change.
X
Y
1
0
> 0, small
X
Y
1
0
> 0, large
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52
8
Least squares Least squares estimation (LSE)estimation (LSE)
• An ordinary least squares (OLS) estimate of a model parameter is that which minimizes the sum of squared differences between observed and predicted values: • Predicted values are
derived from some model whose parameters we wish to estimate
2
1
)ˆ( yySSN
iiR
OLS
SS
R
),(ˆ xfy
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52
9
Maximum likelihood Maximum likelihood estimation (MLE)estimation (MLE)
• A maximum likelihood estimate (MLE) of a model parameter for a given distribution is that which maximizes the probability of generating the observed sample data.
• MLEs are obtained by maximizing the loss function
• …or equivalently, by minimizing the negative log likelihood function
);(1
n
iixL
MLE
L o
r -
log
L
- log LL
));(ln(log1
i
n
i
xL
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52
10
How are the model parameters How are the model parameters estimated?estimated?
• Estimated not by least squares, but rather by Maximum Likelihood– Based on an estimate of the likelihood of obtaining
the observed results based on different values of the model parameters
– In principle, parameter estimates should converge to those maximizing log-likelihood or minimizing - LogL
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52
11
Hypothesis testingHypothesis testing
• Likelihood– Deviance=-2L – Is apprioximately distributed as chi-square– Measures the variation unexplained by the fitted
model, analagous to residual sums of squares.
• Model comparison– Change in deviance when model terms are added
(or deleted) is also approximately distributed as chi-square, so can test hypotheses relating to individual model terms.
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52
12
Model assumptionsModel assumptions
• Observations are independent• Dependent variable has a binomial
distribution• Little error in measurement of dependent
variables.
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52
13
Logistic regression in SPlusLogistic regression in SPlus*** Generalized Linear Model ***
Call: glm(formula = cardiaque ~ age, family = binomial(link = logit), data = SDF12, na.action = na.exclude, control
= list(epsilon = 0.0001, maxit = 50, trace = F))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.545637 -0.5732664 -0.272312 -0.1404323 2.679875
Coefficients:
Value Std. Error t value
(Intercept) -7.76838060 0.376403465 -20.63844
age 0.09557905 0.005097055 18.75182
(Dispersion Parameter for Binomial family taken to be 1 )
Null Deviance: 2050.515 on 1999 degrees of freedom
Residual Deviance: 1490.001 on 1998 degrees of freedom
Number of Fisher Scoring Iterations: 4
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52
14
Incidence of heart attack in relation to ageIncidence of heart attack in relation to age
30 40 50 60 70 80 90
age
-0.1
0.1
0.3
0.5
0.7
0.9
card
iaq
ue
logit( ) 7.77 0.96
logit( ) 7.77 0.96
y=logit(p) 7.77 0.96
1 1 1
y p Age
y p Age
Age
e e ep
e e e
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52
15
Presence of post-operative kyphosis using Presence of post-operative kyphosis using logistic regressionlogistic regression
Kyphosis: a binary variable indicating the presence/absenceof a postoperative spinal deformity called Kyphosis.• Age: the age of the child in months.• Number: the number of vertebrae involved in the spinaloperation.• Start: the beginning of the range of the vertebrae involved in the operation
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52
16
Evidence that the distribution of predictor Evidence that the distribution of predictor variables differs among levels of response variables differs among levels of response variablevariable
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52
17
The modelThe model
Université d’Ottawa - Bio 4518 - Biostatistiques appliquées© Antoine Morin et Scott Findlay23-04-21 17:52
18
Testing hypothesesTesting hypotheses