Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Generalized Linear Models
SJSU
November 17, 2016
GLM Overview
Analysis methods
ANOVA: discrete x, continuous y
Regression: discrete/continuous x, continuous y
GLM: discrete/continuous x, discrete/continuous y
Generalized Linear Models is the broadest category
2 / 48
GLM Overview
GLM types
Logistic regression
Normal regression - needs the 3 assumptions
Poisson regression
Negative binomial regression (maybe)
3 / 48
GLM Overview
GLM types
Response (y) type:
Normally distributed
categorical - disease present/absent
categorical - disease low/medium/high
integer valued - number of chocolate chips
4 / 48
Logistic Regression (GLM)
Logistic Regression
Response (yi ) correspond to multiple covariates (xi )
Response is count data
male-female, healthy-sick, alive-dead, success-failure, win-loss
To simplify notation we denote the response with 0 or 1
We are interested in the probability (p) of y=1
5 / 48
Logistic Regression (GLM)
Logistic Regression
Binary response data (Binomial)F−1(p) = β0 + β1x
Standard regression:y = β0 + β1x + εiF−1(p) = y standard link
6 / 48
Logistic Regression (GLM)
Logistic Regression
7 / 48
Logistic Regression (GLM)
Link Functions
Link function:-link probability (p) and the covariates- F is a monotone cumulative distribution function-common: logistic link, probit link, clog-log link
Logit link: F−1(p) = logit(pi ) = log p1−p
log pi1−pi
= β0 + β1xi
pi = exp(β0+β1xi )1+exp(β0+β1xi )
8 / 48
Logistic Regression (GLM)
Link Functions
Link function:
9 / 48
Logistic Regression (GLM)
Logistic Regression
Multiple measurements correspond to the same covariate (xi )Turn the multiple measurements into binomial counts
xi covariate
ni responses corresponding to xi
yi number of responses equal to 1
log pi1−pi
= β0 + β1xi
10 / 48
Logistic Regression (GLM)
Solving for βs
Newton-Raphson algorithm (eg. the same method as finding theparameters of a Gamma distribution)solve the non-linear equation
log pi1−pi
= β0 + β1xi
Once βs are found then solve for pi
pi = exp(β0+β1xi )1+exp(β0+β1xi )
11 / 48
Logistic Regression (GLM)
Assumptions
Independence still needs to be checked
There is no Normality assumption
There is no constant variance assumption
* the variance is a function of the mean
E (yi ) = pi and V (yi ) = E (yi )/(1− E (yi ))
12 / 48
Logistic Regression (GLM)
Example 1- logistic regression
C-section infection data: C-section is a major surgery to have a baby thatcan cause excessive bleeding, blood clots, infection, pain, longer hospitalstays, and longer recovery. The data is from example 17.1 and concernsinfection from a C-section. The response variable (y) is occurrence ornon-occurrence of infection. Three covariates (x) each at two levels:
x1 nonplan -planned=0 and unplanned=1
x2 riskfac - diabetes, overweight, previous C-section: present=1,not=0
x3 antibio - antibiotics were given =1 or not=0
13 / 48
Logistic Regression (GLM)
Example 1 - Data
Planned No Plan
Infection Infectionyes no total yes no total
AntibioticsRisk(yes) 1 17 18 11 87 98Risk (no) 0 2 2 0 0 0
No AntibioticsRisk(yes) 28 30 58 23 3 26Risk (no) 8 32 40 0 9 9
14 / 48
Logistic Regression (GLM)
Example 1 -Code
log(pinfection/pnoinfection) = β0 + β1 ∗ noplan + β2 ∗ riskfac + β3 ∗ antibio
infection=c(1,11,0,0,28,23,8,0)total=c(18,98,2,0,58,26,40,9)proportion=infection/totalnoplan=c(0,1,0,1,0,1,0,1)riskfac=c(1,1,0,0,1,1,0,0)antibio=c(1,1,1,1,0,0,0,0)
reg1=glm(proportion ∼ noplan+riskfac+antibio, family=“binomial”,weights=total)summary(reg1)
15 / 48
Logistic Regression (GLM)
proportion = yi/niweights = ni
reg1=glm(proportion∼noplan+riskfac+antibio, family=“binomial”,weights=ni )summary(reg1)
16 / 48
Logistic Regression (GLM)
Example 1 -Code
Estimate Std. Error z value Pr(> |z |)(Intercept) -1.8926 0.4124 -4.589 4.45e-06 ***
noplan 1.0720 0.4254 2.520 0.0117 *riskfac 2.0299 0.4553 4.459 8.25e-06 ***antibio -3.2544 0.4813 -6.761 1.37e-11 ***
Null deviance: 83.491 on 6 degrees of freedomResidual deviance: 10.997 on 3 degrees of freedom(1 observation deleted due to missingness)AIC: 36.178
When antibiotics are given the factor exp(-3.25)=0.0388P(infection)/P(no.infection) or 1/0.0388=25.77, the odds decrease 25.77times
17 / 48
Logistic Regression (GLM)
Example 1
Observed(no observed proportion for number 4: 0/0=NaNproportion= 0.0555 0.112 0.000 NaN 0.4827 0.8846 0.200 0.000
Model prediction proportion:log(pinfection/pnoinfection) = β0 + β1 ∗ noplan + β2 ∗ riskfac + β3 ∗ antibio(pinfection/pnoinfection) =exp(β0) + exp(β1 ∗ noplan) + exp(β2 ∗ riskfac) + exp(β3 ∗ antibio)predict(reg1, type=“response”)0.0424 0.1145 0.00578 NA 0.534 0.770 0.1309 0.3056
18 / 48
Logistic Regression (GLM)
Deviance score
The deviance of the model measures goodness-of-fitOutput:Null deviance: 83.491 on 6 degrees of freedomResidual deviance: 10.997 on 3 degrees of freedom(1 observation deleted due to missingness)AIC: 36.178
χ2 with 3 degrees of freedom7 observations- 4 parameters estimated (βs)=3Residual deviance=10.9967pvalue=1-pchisq(10.9967,3)=0.0117Reject H0 and conclude the model does not fit wellInclude some interactions?
19 / 48
Logistic Regression (GLM)
Plot Data and Model
plot(1:8, proportion, pch=15, col=“dark blue”)fit=predict(reg1, type=“response”)points(as.numeric(names(fit)),predict(reg1, type=“response”), pch=19,col=”dark green”)legend(1, 0.9,legend=c(“Obs. Proportions”,“Logistic Fit”), pch=c(15,19),col=c(”dark blue”, ”dark green”))
x: 1=(0,1,1),2=(1,1,1), 3=(0,0,1), 4=(1,0,1), 5=(0,1,0),6=(1,1,0),7=(0,0,0),8=(1,0,0)
20 / 48
Logistic Regression (GLM)
Logistic Regression
Goodness of fit:
no R2, F or MSE (use a psuedo R2)
Deviance: D = −2log likelihood of the fitted modellikelihood of the saturated model
yi is the number of 1s and ni − yi is the number of 0s
D = −2∑k
i=1{yi log( yiyi ) + (ni − yi )logni−yini−yi}
Output:Null deviance: 83.491 on 6 degrees of freedomResidual deviance: 10.997 on 3 degrees of freedom(1 observation deleted due to missingness)AIC: 36.178
21 / 48
Logistic Regression (GLM)
Example 2
Arrhythmia: patients who undergo coronary artery bypass graft surgery(CABG) have an approximately 19 to 40% chance of developing atrialfibrillation (AF). AF is a quivering, chaotic motion in the upper chambersof the heart, known as atria. AF can lead to the formation of blood clots,causing greater in-hospital mortality, strokes, and longer hospitalstays.While this can be prevented with drugs, it is very expensive andsometimes dangerous if not warranted. Ideally, several risk factors thatwould indicate an increased risk of developing AF in this population couldsave lives and money by indicating which patients need pharmacologicalintervention. Researchers began collecting data form CABG patientsduring their hospital stay such as demographics, like age and sex, as wellas heart rate, cholesterol, operations time, etc. Then the researchersrecorded which patients developed AF during their hospital stay. The goalwas to evaluate the probability of AF given the measured demographic andrisk factors.
22 / 48
Logistic Regression (GLM)
Example 2: Data
Y fibrillation
X1 ageX2 aortic cross clamp timeX3 cardiopulmonary bypass timeX4 intensive care unit (ICU) timeX5 average heart rateX6 left ventricle ejection fractionX7 anamnesis of hypertensionX8 gender (1=female, 0=male)X9 anamnesis of diabetes
X10 previous MI
23 / 48
Logistic Regression (GLM)
R Code - Binomial
proportion = yi/niweights = ni
reg1=glm(proportion ∼x1, family=“binomial”, weights=ni )summary(reg1)
OR
Y=0 or 1 arrhythmia, ni = 1 so weights are implied as 1reg1=glm(Y∼x1, family=“binomial”)
OR
ni=die+livereg1=glm(cbind(die, live)∼ x1, family=“binomial” )reg1=glm(die/ni ∼ x1, family=“binomial”, weights=ni )
24 / 48
Logistic Regression (GLM)
Example 2
reg1=glm(Y∼X1+X2+X3+X4+X5+X6+X7+X8+X9+X10,binomial)summary(reg1)
Estimate Std. Error z value Pr(> |z |)Intercept -10.952752 4.539527 -2.413 0.015833 *
X1 0.153628 0.044021 3.490 0.000483 ***X2 0.024800 0.023960 1.035 0.300635X3 -0.016837 0.015594 -1.080 0.280272X4 -0.129457 0.086554 -1.496 0.134737X5 0.007144 0.029105 0.245 0.806109X6 0.020674 0.025727 0.804 0.421647X7 -0.537703 0.613750 -0.876 0.380979X8 -0.263754 0.631467 -0.418 0.676178X9 1.093606 0.633264 1.727 0.084179 .
X10 0.341597 0.641249 0.533 0.594237
(Dispersion parameter for binomial family taken to be 1)Residual deviance: 78.252 on 70 degrees of freedom, AIC: 100.25 25 / 48
Logistic Regression (GLM)
Example 2 plot
Too many X values, so find linear predictor (x-axis): β0 + β1 ∗ x1 + ...y-axis is the observed data Y=1 fibrilation, Y=0 no fiby-axis is also the fitted probabilities (GLM regression)
26 / 48
Logistic Regression (GLM)
Example 2
XX=cbind(rep(1,length(X1)),X1,X2,X3,X4,X5,X6,X7,X8,X9,X10)linpred= apply(XX%*%coef(reg1),1,sum)plot(linpred, Y, col=”dark blue”)fit=predict(reg1, type=”response”)points(linpred,predict(reg1, type=”response”), pch=19, col=”dark green”)legend(-5, 0.9,legend=c(”Obs. Proportions”,”Logistic Fit”), pch=c(1,19),col=c(”dark blue”, ”dark green”))
27 / 48
Logistic Regression (GLM)
Example 2
reg1=glm(Y∼X1+X2+X3+X4+X5+X6+X7+X8+X9+X10,family=”binomial”)summary(reg1)
Estimate Std. Error z value Pr(> |z |)Intercept -10.952752 4.539527 -2.413 0.015833 *
X1 0.153628 0.044021 3.490 0.000483 ***X2 0.024800 0.023960 1.035 0.300635X3 -0.016837 0.015594 -1.080 0.280272X4 -0.129457 0.086554 -1.496 0.134737X5 0.007144 0.029105 0.245 0.806109X6 0.020674 0.025727 0.804 0.421647X7 -0.537703 0.613750 -0.876 0.380979X8 -0.263754 0.631467 -0.418 0.676178X9 1.093606 0.633264 1.727 0.084179 .
X10 0.341597 0.641249 0.533 0.594237
(Dispersion parameter for binomial family taken to be 1)Residual deviance: 78.252 on 70 degrees of freedom, AIC: 100.25 28 / 48
Logistic Regression (GLM)
Example 2
X Correlation Checkround(cor(XX[,2:11]),2)
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10X1 1.00 -0.03 0.01 -0.01 -0.18 0.14 0.15 0.21 -0.11 -0.11X2 -0.03 1.00 0.85 0.34 -0.07 -0.23 0.10 -0.26 0.14 -0.06X3 0.01 0.85 1.00 0.32 0.13 -0.10 0.12 -0.19 0.15 -0.11X4 -0.01 0.34 0.32 1.00 0.11 -0.23 0.14 -0.11 0.00 0.04X5 -0.18 -0.07 0.13 0.11 1.00 -0.13 -0.08 -0.06 -0.02 0.18X6 0.14 -0.23 -0.10 -0.23 -0.13 1.00 0.03 0.25 -0.18 -0.42X7 0.15 0.10 0.12 0.14 -0.08 0.03 1.00 0.08 0.18 -0.12X8 0.21 -0.26 -0.19 -0.11 -0.06 0.25 0.08 1.00 -0.08 -0.09X9 -0.11 0.14 0.15 0.00 -0.02 -0.18 0.18 -0.08 1.00 -0.10
X10 -0.11 -0.06 -0.11 0.04 0.18 -0.42 -0.12 -0.09 -0.10 1.00
highest is 0.85 for X3 (clamp time) and X2 (bypass time), remove one29 / 48
Logistic Regression (GLM)
Example 2
Remove X5 average heart rateRemove X8 genderRemove X10 previous MIRemove X2 clamp timeRemove X3 bypass timeRemove X6 left ejection fractionRemove X7 hypertensionRemove X9 diabetes
Keep X1 age and X4 ICU time
30 / 48
Logistic Regression (GLM)
AIC and Parsimony
AIC= 100.25, X1,.., X10 (full model)AIC=92.46, X1, X4, X6, X7,X9AIC=90.58, X1, X4, X7,X9AIC=89.44, X1, X4, X9AIC= 89.48, X1 and X4 (small model)
Small AIC is *best* modelAIC gives penalty for including too many XsOr you can look for largest “residual deviance”
31 / 48
Logistic Regression (GLM)
Other links
Logit, probit, or clog-log linksreg1=glm(Y∼X1+X2+X3+...+X10,family=binomial(logit))reg1=glm(Y∼X1+X2+X3+...+X10,family=binomial(probit))reg1=glm(Y∼X1+X2+X3+...+X10,family=binomial(cloglog))
Very small difference in resultsComplementary log-log is good when y=1 is rareBayesian algorithms prefer the probit link(see lab 5)
32 / 48
Poisson Regression (GLM)
Poisson Regression
Response (y) is count data (Poisson)y={0,1,2,3,...}
Tend to be rare events in a large number of trials- accidents, incidents of a rare disease, device failure in a time interval
33 / 48
Poisson Regression (GLM)
Poisson Regression
E (y) = λ and V (y) = λ
yi ∼ Pois(λi )
log(λi ) = β0 + β1xior multiple x’slog(λi ) = β0 + β1x1,i + β2x2,i + ...
yi = exp(β0 + β1xi )
34 / 48
Poisson Regression (GLM)
Poisson Regression
Model checking and Goodness-of-fit:
Deviance - D = 2∑n
i=1
(yi log
yiyi− (yi − yi )
)Deviance residuals
Pearson residuals
Freedman-Tukey residuals
Plot the residuals
35 / 48
Poisson Regression (GLM)
Example 3
Cellular differentiation (when a cell becomes a more specialized cell). Thisis a study of TNF (tumor necrosis factor) and IFN (interferon) to inducecell differentiation. The number of cells that exhibited markers ofdifferentiation after exposure to TNF or IFN were recorded. There were 16dose combinations of TFN/IFN and 200 cells were examined.reg1=glm(y ∼tfn*ifn, family=poisson)
36 / 48
Poisson Regression (GLM)
Example 3-Data
y=c(11,18,20,39,22,38,52,69,31,68,69,128,102,171,180,193)tfn=c(0,0,0,0,1,1,1,1,10,10,10,10,100,100,100,100)ifn=c(0,4,20,100,0,4,20,100,0,4,20,100,0,4,20,100)
37 / 48
Poisson Regression (GLM)
Example 3
summary(reg1)
Estimate Std. Error z value Pr(> |z |)(Intercept) 3.436e+00 6.377e-02 53.877 < 2e-16 ***
tfn 1.553e-02 8.308e-04 18.689 < 2e-16 ***ifn 8.946e-03 9.669e-04 9.253 < 2e-16 ***
tfn:ifn -5.670e-05 1.348e-05 -4.205 2.61e-05 ***
(Dispersion parameter for poisson family taken to be 1)Null deviance: 707.03 on 15 degrees of freedomResidual deviance: 142.39 on 12 degrees of freedomAIC: 243.69
38 / 48
Poisson Regression (GLM)
Example 3: More code
confint(reg1)2.5 % 97.5 %
(Intercept) 3.308307e+00 3.558360e+00tfn 1.390603e-02 1.716434e-02ifn 7.043823e-03 1.083599e-02tfn:ifn -8.318686e-05 -3.031362e-05
39 / 48
Poisson Regression (GLM)
Example 3
What we expect to see:
predict(reg1, type=”response”)35.62746 36.47161 40.05305 63.97916 36.09877 36.95410 40.5829164.82554 40.63133 41.59405 45.67849 72.96502 132.60092 135.74276149.07241 238.12240
Data:y=c(11,18,20,39,22,38,52,69,31,68,69,128,102,171,180,193)
40 / 48
Poisson Regression (GLM)
Example 3
residuals(reg1,type=”deviance”)
41 / 48
Poisson Regression (GLM)
Example 4
Overdispersion/Underdispersion and the Quasi-PoissonThe Poisson distribution has one parameter for mean and variance(dispersion parameter)
There is a strict assumption that the mean=variance
What if that is not the case?
42 / 48
Poisson Regression (GLM)
Example 4
Y is the count of faults in the manufacturing of rolls of fabric. X is thelength of the roll.
The Poisson model is: log(yi ) = β0 + β1xiglm(y∼x, family=poisson)
43 / 48
Poisson Regression (GLM)
Example 4
Standard poisson: glm(y∼x, family=poisson)
Estimate Std. Error z value Pr(>—z—)
(Intercept) 0.9717506 0.2124693 4.574 4.79e-06 ***x 0.0019297 0.0003063 6.300 2.97e-10 ***
(Dispersion parameter for poisson family taken to be 1)Null deviance: 103.714 on 31 degrees of freedomResidual deviance: 61.758 on 30 degrees of freedomAIC: 189.06
44 / 48
Poisson Regression (GLM)
Example 4
Standard poisson: glm(y∼x, family=quasipoisson)
Estimate Std. Error z value Pr(>—z—)
(Intercept) 0.9717506 0.3095033 3.140 0.003781 **x 0.0019297 0.0004462 4.325 0.000155 ***
(Dispersion parameter for quasipoisson family taken to be 2.121965)Null deviance: 103.714 on 31 degrees of freedomResidual deviance: 61.758 on 30 degrees of freedomAIC: NA
45 / 48
Poisson Regression (GLM)
Example 4
Poisson versus Quassi-poisson
Same β0 and β1
Same deviance
Dispersion parameter (1) and Dispersion parameter (2.12)
Different p-values
46 / 48
Poisson Regression (GLM)
Example 4
47 / 48
Other GLM
Other GLM
Exponential family response (y)
Normal
Binomial/Bernoulli
Poisson
Gamma
48 / 48