Upload
anis-snow
View
214
Download
0
Embed Size (px)
Citation preview
Lecture 7:Multiple Linear RegressionInterpretation with different types of predictors
BMTRY 701Biostatistical Methods II
Interpreting regression coefficients
So far, we’ve considered continuous covariates Covariates can take other forms:
• binary• nominal categorical• quadratics (or other transforms)• interactions
Interpretations may vary depending on the nature of your covariate
Binary covariates
Considered ‘qualitative’ The ordering of numeric assignments does not
matter Examples: MEDSCHL: 1 = yes; 2 = no More popular examples:
• gender• mutation• pre vs. post-menopausal• Two age categories
How is MEDSCHL related to LOS?
How to interpret β1?
Coding of variables:• 2 vs. 1• i prefer 1 vs. 0• difference? the intercept.
Let’s make a new variable: • MS = 1 if MEDSCHL = 1 (yes)• MS = 0 if MEDSCHL = 2 (no)
iii eMEDSCHLLOS 10
How is MEDSCHOOL related to LOS?
What does β1 mean?
Same, yet different interpretation as continuous What if we had used old coding?
1010
010
10
10
1]1|[
0]0|[
][
ii
ii
ii
eii
MSLOSE
MSLOSE
MSLOSE
eMSLOS
1010
1010
10
22]2|[
1]1|[
ii
ii
eii
MEDSCHLLOSE
MEDSCHLLOSE
eMEDSCHLLOS
R code> data$ms <- ifelse(data$MEDSCHL==2,0,data$MEDSCHL)> table(data$ms, data$MEDSCHL) 1 2 0 0 96 1 17 0> > reg <- lm(LOS ~ ms, data=data)> summary(reg)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.4105 0.1871 50.290 < 2e-16 ***ms 1.5807 0.4824 3.276 0.00140 ** ---
Residual standard error: 1.833 on 111 degrees of freedomMultiple R-squared: 0.08818, Adjusted R-squared: 0.07997 F-statistic: 10.73 on 1 and 111 DF, p-value: 0.001404
Scatterplot? Residual Plot?
0.0 0.2 0.4 0.6 0.8 1.0
-20
24
68
10
data$ms
res
res <- reg$residualsplot(data$ms, res)abline(h=0)
0.0 0.2 0.4 0.6 0.8 1.0
81
01
21
41
61
82
0
data$msd
ata
$L
OS
Fitted Values
Only two fitted values:
Diagnostic plots are not as informative Extrapolation and Interpolation are meaningless! We can estimate LOS for MS=0.5
• LOS = 9.41 + 1.58*0.5 = 10.20• Try to interpret the result…
99.1058.141.9]1|ˆ[
41.9]0|ˆ[
10
0
MSY
MSY
“Linear” regression?
But, what about ‘linear’ assumption?• we still need to adhere to the model assumptions• recall that they relate to the residuals primarily• residuals are independent and identically distributed:
The model is still linear in the parameters!
),0(~ 2 Ni
MLR example: Add infection risk to our model
> reg <- lm(LOS ~ ms + INFRISK, data=data)> summary(reg)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.4547 0.5146 12.542 <2e-16 ***ms 0.9717 0.4316 2.251 0.0263 * INFRISK 0.6998 0.1156 6.054 2e-08 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.595 on 110 degrees of freedomMultiple R-squared: 0.3161, Adjusted R-squared: 0.3036 F-statistic: 25.42 on 2 and 110 DF, p-value: 8.42e-10
How does interpretation change?
What about more than two categories?
We looked briefly at region a few lectures back. How to interpret? You need to define a reference category For med school:
• reference was ms=0• almost ‘subconscious’ with only two categories
With >2 categories, need to be careful of interpretation
LOS ~ REGION
Note how ‘indicator’ or ‘dummy’ variable is defined:• I(condition) = 1 if condition is true
• I(condition) = 0 if condition is false
iiiii eRIRIRILOS )4()3()2( 3210
Interpretation
β0 =
β1 =
β2 =
β3 =
30
20
10
0
]4|[
]3|[
]2|[
]1|[
ii
ii
ii
ii
RLOSE
RLOSE
RLOSE
RLOSE
iiiii eRIRIRILOS )4()3()2( 3210
hypothesis tests?
Ho: β1 = 0 Ha: β1 ≠ 0 What does that test (in words)?
Ho: β2 = 0 Ha: β2 ≠ 0 What does that test (in words)?
What if we want to test region, in general? One of our next topics!
R > reg <- lm(LOS ~ factor(REGION), data=data)> summary(reg)
Call:lm(formula = LOS ~ factor(REGION), data = data)
Residuals: Min 1Q Median 3Q Max -3.05893 -1.03135 -0.02344 0.68107 8.47107
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.0889 0.3165 35.040 < 2e-16 ***factor(REGION)2 -1.4055 0.4333 -3.243 0.00157 ** factor(REGION)3 -1.8976 0.4194 -4.524 1.55e-05 ***factor(REGION)4 -2.9752 0.5248 -5.669 1.19e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.675 on 109 degrees of freedomMultiple R-squared: 0.2531, Adjusted R-squared: 0.2325 F-statistic: 12.31 on 3 and 109 DF, p-value: 5.376e-07
Interpreting
Is mean LOS different in region 2 vs. 1? What about region 3 vs. 1 and 4 vs. 1?
What about region 4 vs. 3?
How to test that? Two options?
• recode the data so that 3 or 4 is the reference• use knowledge about the variance of linear combinations to
estimate the p-value for the difference in the coefficients
For now…we’ll focus on the first.
Make REGION=4 the reference
Our model then changes:
iiiii eRIRIRILOS )1()2()3( 3210
0
10
20
30
]4|[
]3|[
]2|[
]1|[
ii
ii
ii
ii
RLOSE
RLOSE
RLOSE
RLOSE
R code: recoding so last category is reference
> data$rev.region <- factor(data$REGION, levels=rev(sort(unique(data$REGION))) )
> reg <- lm(LOS ~ rev.region, data=data)> summary(reg)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.1137 0.4186 19.381 < 2e-16 ***rev.region3 1.0776 0.5010 2.151 0.03371 * rev.region2 1.5697 0.5127 3.061 0.00277 ** rev.region1 2.9752 0.5248 5.669 1.19e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.675 on 109 degrees of freedomMultiple R-squared: 0.2531, Adjusted R-squared: 0.2325 F-statistic: 12.31 on 3 and 109 DF, p-value: 5.376e-07
Quite a few differences:
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.1137 0.4186 19.381 < 2e-16 ***rev.region3 1.0776 0.5010 2.151 0.03371 * rev.region2 1.5697 0.5127 3.061 0.00277 ** rev.region1 2.9752 0.5248 5.669 1.19e-07 ***
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.0889 0.3165 35.040 < 2e-16 ***factor(REGION)2 -1.4055 0.4333 -3.243 0.00157 ** factor(REGION)3 -1.8976 0.4194 -4.524 1.55e-05 ***factor(REGION)4 -2.9752 0.5248 -5.669 1.19e-07 ***
But the “model” is the same
Model 1: Residual standard error: 1.675 on 109 degrees of freedom
Model 2: Residual standard error: 1.675 on 109 degrees of freedom
The model represent the data equally well. However, the ‘reparameterization’ yields a
difference of interpretation for the model parameters.
Diagnostics
# residual plotreg <- lm(LOS ~ factor(REGION), data=data)res <- reg$residualsfit <- reg$fitted.valuesplot(fit, res)abline(h=0, lwd=2)
8.0 8.5 9.0 9.5 10.0 10.5 11.0
-20
24
68
fit
res
Diagnostics
# residual plotreg <- lm(logLOS ~ factor(REGION), data=data)res <- reg$residualsfit <- reg$fitted.valuesplot(fit, res)abline(h=0, lwd=2)
2.10 2.15 2.20 2.25 2.30 2.35
-0.2
0.0
0.2
0.4
0.6
fit
res
Next type: polynomials
Most common: quadratic What does that mean? Including a linear and ‘squared’ term Why? to adhere to model assumptions! Example:
• last week we saw LOS ~ NURSE• quadratic actually made some sense
eiii eNURSENURSELOS 2210log
Scatterplot
0 100 200 300 400 500 600
2.0
2.2
2.4
2.6
2.8
3.0
data$NURSE
da
ta$
log
LO
S
Fitting the model
> data$nurse2 <- data$NURSE^2> reg <- lm(logLOS ~ NURSE + nurse2, data=data)> summary(reg)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.090e+00 3.645e-02 57.355 < 2e-16 ***NURSE 1.430e-03 3.525e-04 4.058 9.29e-05 ***nurse2 -1.789e-06 6.262e-07 -2.857 0.00511 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.161 on 110 degrees of freedomMultiple R-squared: 0.1965, Adjusted R-squared: 0.1819 F-statistic: 13.45 on 2 and 110 DF, p-value: 5.948e-06
Interpretable?
How does it fit?
0 100 200 300 400 500 600
2.0
2.2
2.4
2.6
2.8
3.0
data$NURSEd
ata
$lo
gL
OS
# make regression lineplot(data$NURSE, data$logLOS, pch=16)coef <- reg$coefficientsnurse.values <- seq(15,650,5)fit.line <- coef[1] + coef[2]*nurse.values + coef[3]*nurse.values^2lines(nurse.values, fit.line, lwd=2)
Note: ‘abline’ only will work forsimple linear regression. whenthere is more than one predictor, you need to make the line another way.
Another approach to the same data
Does it make sense that it increases, and then decreases?
Or, would it make more sense to increase, and then plateau?
Which do you think makes more sense?
How to tell? • use a data driven approach• tells us “what do the data suggest?”
Smoothing
Empirical way to look at the relationship data is ‘binned’ by x for each ‘bin’, the average y is estimated but, it is a little fancier
• it is a ‘moving average’• each x value is in multiple bins
Modern methods use models within bins• Lowess smoothing• Cubic spline smoothing
Specifics are not so important: the “empirical” result is
smoother <- lowess(data$NURSE, data$logLOS)plot(data$NURSE, data$logLOS, pch=16)lines(smoother, lwd=2, col=2)lines(nurse.values, fit.line, lwd=2)legend(450,3,c("Quadratic Model","Lowess Smooth"),
lty=c(1,1),lwd=c(2,2),col=c(1,2))
0 100 200 300 400 500 600
2.0
2.2
2.4
2.6
2.8
3.0
data$NURSE
da
ta$
log
LO
S
Quadratic ModelLowess Smooth
Inference?
What do the data say? Looks like a plateau How can we model that? One option: a spline Zeger: “broken arrow” model Example: looks like a “knot” at NURSE = 250
• there is a linear increase in logLOS until about NURSE=250
• then, the relationship is flat• this implies a slope prior to NURSE=250, and one
after NURSE=250
Implementing a spline
A little tricky We need to define a new variable, NURSE*
And then we write the model as follows:
250NURSE 250
250NURSE 0
i
i*
ifNURSE
ifNURSE
ii
iiii eNURSENURSELOS *210log
How to interpret?
When in doubt, condition on different scenarios What is E(logLOS) when NURSE<250?
What is E(logLOS) when NURSE>250?
i
iii
NURSE
NURSENURSELOSE
10
210 0]250|[log
i
ii
iiii
NURSE
NURSENURSE
NURSENURSENURSELOSE
)()250(
250
)250(]250|[log
2120
2210
210
R
data$nurse.star <- ifelse(data$NURSE<=250,0,data$NURSE-250)data$nurse.starreg.spline <- lm(logLOS ~ NURSE + nurse.star, data=data)# make regression linecoef.spline <- reg.spline$coefficientsnurse.values <- seq(15,650,5)nurse.values.star <- ifelse(nurse.values<=250,0,nurse.values-250)
spline.line <- coef.spline[1] + coef.spline[2]*nurse.values +coef.spline[3]*nurse.values.star
plot(data$NURSE, data$logLOS, pch=16)lines(smoother, lwd=2, col=2)lines(nurse.values, fit.line, lwd=2)lines(nurse.values, spline.line, col=4, lwd=3)legend(450,3,c("Quadratic Model","Lowess Smooth","Spline Model"),
lty=c(1,1,1),lwd=c(2,2,3),col=c(1,2,4))
Interpreting the output
> summary(reg.spline)
Call:lm(formula = logLOS ~ NURSE + nurse.star, data = data)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1073877 0.0332135 63.450 < 2e-16 ***NURSE 0.0010278 0.0002336 4.399 2.52e-05 ***nurse.star -0.0011114 0.0004131 -2.690 0.00825 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1616 on 110 degrees of freedomMultiple R-squared: 0.1902, Adjusted R-squared: 0.1754 F-statistic: 12.91 on 2 and 110 DF, p-value: 9.165e-06
How do we interpret the coefficient on nurse.star?
Why subtract the 250 in defining nurse.star? It ‘calibrates’ where the two pieces of the lines meet if it is not included, then they will not connect
0 100 200 300 400 500 600
2.0
2.2
2.4
2.6
2.8
3.0
data$NURSE
da
ta$
log
LO
S
Quadratic ModelLowess SmoothSpline Model
Why a spline vs. the quadratic?
it fits well! it is more interpretable it makes sense it is less sensitive to the outliers
Can be generalized to have more ‘knots’
Next time
ANOVA F-tests