Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Linear Model
Wim KrijnenLector Analyse Technieken voor Praktijkonderzoek
Lectoraat Healthy Ageing, Allied Health Care and NursingHanze University of Applied Sciences
July 1, 2015
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 1 / 39
Overview
Correlation between two variablesSimple linear models; one predictor variableLeast squares estimation of parametersModel evaluation
ANOVA tableGoodness of fitt-test for model coefficientsconfidence interval for correlation coefficient
prediction of new valueslinear model diagnosticsmultiple regressionmodel building: finding best subset
2
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 2 / 39
Purpose of linear regression analysis
Determining amount of variance predictors can explain in criterionDescribing the relationship between a dependent and a set ofindependent variablesEstimation of parametersPrediction of new valuesControl
3
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 3 / 39
Correlation Coefficient
Expresses degree of linear relation ship between two variables(measurements)
sample variance of = S2X =
∑ni=1(Xi − X )2
n − 1
sample variance of = S2Y =
∑ni=1(Yi − Y )2
n − 1
sample covariance = SXY =
∑ni=1(Xi − X )(Yi − Y )
n − 1
sample correlation coefficient = RXY =SXY
SX · SY
−1 ≤ RXY ≤ 1RXY close to ±1 implies all point close to straight line
4Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 4 / 39
5Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 5 / 39
6Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 6 / 39
linear relationship
linear relation between x and y
y = ax + b
the value of y is a function of x
y = f (x) = ax + b
function is called linear (of first order); deterministicf ′(x) = a; if x increases by one unit y changes by aa slope, b interceptif a = 0, then y = b constantrange is determined by the values of xgiven a, b, and new x , the value of y can be computedchange of name β0 intercept β1 slope
7Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 7 / 39
Example of straight line
8
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 8 / 39
Linear model with one predictor
we generalize to a statistical model
Yi = β0 + β1Xi + εi , i = 1, · · · ,n
unknown parameters of model: β0, β1
β0 intercept, β1 slopeεi random “error” variableεi independently normally distributed with mean E [εi ] = 0 (wlog),Var[εi ] = σ2 for all iindependent (explanatory) variables Xi
dependent variables Yi
σ2 = 0⇒ εi = 0 for all i ; deterministic model (special case)9
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 9 / 39
Least Squares (ML) Estimation of parameters
Yi = β0 + β1Xi + εi ⇒ εi = Yi − (β0 + β1Xi)
minimize the squared difference S between Yi and line
S =n∑
i=1
ε2i =
n∑i=1
(Yi − β0 − β1Xi)2
β0 = Y − β1X , β1 =
∑XiYi − nY X∑
X 2i − (
∑Xi)2/n
X = 1n∑n
i=1 Xi
computational formulas used in softwareyi = β0 + β1xi “regression line”
10
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 10 / 39
Properties of least squares estimator
1 best (minimum variance) linear unbiased (Gauss-Markov)2 minimal Var[β] = σ2(X T X )−1 → O as n increases3 consistent in the sense that estimated coefficients tend to
coefficients in population4 equals maximum likelihood estimator5 makes sense geometrically:
predicted values y are orthogonal projection of y onto spacespanned by predictors X
11
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 11 / 39
Sums of squares and F-test
H0 : β1 = 0, versus HA : β1 6= 0
Regression SS :=n∑
i=1
(Yi − Y )2, Residual SS :=n∑
i=1
ε2i
Total SS :=n∑
i=1
(Yi − Y )2 = Regression SS + Residual SS
Regression MS :=Regression SS
p − 1, Residual MS :=
Residual SSn − p
F =Regression MS
Residual MS∼ F1,n−2; p-value = P(F1,n−2 > F )
If p-value < α, then reject H0If p-value > α, then reject H0
Reasoning: if β1 ≈ 0, then Yi ≈ β0 = Y ⇒ Regression SS small; F notsignificant
12Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 12 / 39
ANOVA table and how it looks in RANOVA table
SS df SS MS F p-valueRegression 1 Regression SS Reg./df Regression MS
Residual MSResiduals n − 2 Residual SS Res./df
SS: Sums of squares; MS mean sum of squares; df degrees offreedom15 blood pressure measurements by expert and machineexpert replaced by machine if measurements are equal (up to error)
> model <- lm(Expert ˜ Machine)> anova(model)Analysis of Variance TableResponse: y
Df Sum Sq Mean Sq F value Pr(>F)x 1 150.485 150.485 1967.4 < 2.2e-16 ***Residuals 320 24.477 0.076
Conclusion: Reject H0 : β1 = 0 13Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 13 / 39
14Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 14 / 39
goodness of fit
R2 :=Regression SS
Total SS=
Regression SSRegression SS + Residual SS
= cor(Y , Y )2
0 ≤ R2 ≤ 1larger values of R2 indicate better goodness of fitR2 · 100% percentage of variance in Y “explained” by linearregression
> cor(blood$Expert,model$fitted.values)ˆ2[1] 0.822395
Conclusion: 82% of variance in blood pressure measurements byExpert is explained Machine 15
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 15 / 39
Relation between correlation and slope
Correlation in population ρ estimated by Pearson’s estimation of thecorrelation coefficient
RXY =SXY
SX SY, β1 =
SXY
S2X⇒ β1 =
SY
SX· RXY
holds also for the population β1 = σYσXρ
RXY = 0 implies β1 = 0 (correlation is necessary for nonzeroslope)RXY and β have the same sign (e.g. both positive)RXY > 0⇒ increase in X associated with increase in YRXY < 0⇒ increase in X associated with decrease in Y
16
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 16 / 39
t-test for model coefficients
H0 : β1 = 0 versus HA : β1 6= 0
SE[β1
]=
√Residual MS
(n − 1)S2X
, t =β1
SE[β1]∼ tn−2 distribution
If |t | > tn−2,1−α/2, then reject H0 in favor of HA : β1 6= 0If p-value P(tn−2 ≥ |t |) < α, then reject H0 for HA : β1 6= 0
> summary(lm(Expert ˜ Machine))Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.9152 8.8236 1.917 0.0775 .Machine 0.7907 0.1019 7.759 3.12e-06 ***
Conclusion: Reject H0 : β1 = 0 17
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 17 / 39
Confidence intervals for model coefficients
express our knowledge wrt parameter estimation in CI
(1− α)100%CI[β1
]= β1 ± tn−2,1−α/2 · SE
[β1
]SE[β1
]=
√Residual MS
(n − 1)S2X
> confint(model)2.5 % 97.5 %
(Intercept) -2.146983 35.977336Machine 0.570539 1.010882
Conclusion: We are 95% certain that β1 is within [0.57,1.01]18
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 18 / 39
Prediction/Confidence interval
New value X ; predicted Y = β0 + β1X average predictedconstruct a 95% prediction confidence interval
(1− α)100%CI[
Y]
= Y ±√
2F2,1−α,n−2 · SE[
Y]
SE[
Y]
=
√√√√S2ε
(1n
+(X − X )2
(n − 1)S2X
), S2
ε =
∑ni=1 ε
2i
n − 2
Example: Machine measures diastolic blood pressure 95
> predict.lm(model,newdata=data.frame(Machine = 95),+ int="prediction")
fit lwr upr1 92.03268 80.41466 103.6507
Conclusion: With 95% certainty predicted expert blood pressure∈ [80.4,103.7] 19
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 19 / 39
plot of prediction confidence intervals
20Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 20 / 39
Model diagnostics
Recall Y = Xβ + ε, ε ∼ N(0, σ2I)Diagnose extent model assumptions hold
same error variance for all values of Xresiduals are nearly normally distributedthere are no extreme outliersonly small proportion of residuals is relatively largethere are no data points highly influential on residuals orregression coefficients
This increases our confidence in making valid statistical inferences 21
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 21 / 39
Standardized residuals
Residual MS :=n∑
i=1
ε2i /(n − p)
hi = element ii of X T (X T X )−1X
standardized ith residual ε′i =εi√
Residual MS√
1− hi
> round(rstandard(model),3)1 2 3 4 5 6 7 8
0.285 0.455 -0.460 -0.163 1.496 1.592 1.441 -1.1299 10 11 12 13 14 15
-0.209 -1.319 0.219 -1.185 -1.337 0.741 -0.632
22
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 22 / 39
Studentized residuals
studentized residual ε∗i = ε′i/SD [ εi ]ε∗(i) studentized residual leaving out i th data point (xi , yi)
ε∗(i) =εi√
Residual MS(i)√
1− hi
, Residual MS(i) =
∑nj=1,j 6=i ε
2i
n − p − 1
ε∗(i) ∼ tn−p, ε∗i ∼ tn−p−1
p number of coefficients, hi hat value iDetect unusual large change > 2 in predicted value
> round(rstudent(model),2)1 2 3 4 5 6 7 8 9 10
0.28 0.44 -0.45 -0.16 1.58 1.70 1.51 -1.14 -0.20 -1.3611 12 13 14 15
0.21 -1.21 -1.38 0.73 -0.62
Conclusion: There are no extreme studentized residuals 23Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 23 / 39
Computation of studentized residuals
Residual MS(i) =
∑nj=1,j 6=i ε
2i
n − p − 1, ε∗(i) =
εi√Residual MS(i)
√1− hi
> round(rstudent(model),3)1 2 3 4 5 6 7 8
0.275 0.441 -0.445 -0.156 1.579 1.705 1.510 -1.1429 10 11 12 13 14 15
-0.201 -1.361 0.211 -1.205 -1.383 0.727 -0.616
24
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 24 / 39
Difference in fitted values (DFFITS)
Difference in fit leaving out i th data point Yi − Y(i)Scaled difference in fitted values leaving out i th data point
∆Y ∗(i) =εi√
hi√Residual MS(i)(1− hi)
∼ Fp,n−p−1
cutoff criterion |∆Y ∗(i)| ≥ 3√
pn − p
to diagnose large influence of xi on overall model fit
> (cutoff <- 3*sqrt(p/(n-p)))[1] 1.176697> round(dffits(model),3)
1 2 3 4 5 6 7 80.133 0.123 -0.144 -0.086 0.475 0.500 0.512 -0.422
9 10 11 12 13 14 15-0.132 -0.424 0.056 -0.667 -0.511 0.196 -0.258
Conclusion: There are no extreme differences in fit 25Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 25 / 39
Influence of ith data point on estimated beta(DFBETAS)
Scaled difference in fitted values leaving the i th data point
∆β∗1(i) =β1 − β1(i)√
Residual MS(i)
√(X T X )−1
ii
,
cutoff criterion |∆β∗1(i)| > 1
> round(dfbetas(model),3)(Intercept) Machine
1 0.118 -0.1072 0.050 -0.0333 0.059 -0.078 # etc
Conclusion: No data points with extreme influence on beta’sRemark: Impossible to base valid inference in case of extremelyinfluential data point 26
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 26 / 39
Cooke’s distance
Cooke’s distance Di is another measure of influence of i th observationon estimated coefficients (Chatterjee & Hadi,1988)
Di =hi
p
εi√Residual MS(i)(1− hi)
2
∼ Fp,n−p
p number of beta coefficientsInterpretation: Scaled distance between β and β(i)
cutoff criterion Di ≥ 0.5
> round(cooks.distance(model),3)1 2 3 4 5 6 7 8
0.010 0.008 0.011 0.004 0.101 0.109 0.119 0.0879 10 11 12 13 14 15
0.009 0.084 0.002 0.215 0.122 0.020 0.035
27Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 27 / 39
Diagnostic plots
A very simple way to produce a two by two panel with 4 diagnosticplots
par(mfrow = c(2,2))plot(model)par(mfrow = c(1,1))
left upper: fitted residuals by fitted plot for detectingheteroscedasticity (increase in sd) or non linearity in dataright upper: normal QQ plotleft lower: Scale location plotleft lower: residuals versus leverage for detecting influential datapoints (Cook’s distance)
influential data points are identified
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 28 / 39
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 29 / 39
Multiple Regression
linear model with more than one predictor variable
Y = β0 + β1X1 + β2X21 + · · ·+ βpXp + ε
unknown parameters of model: β0, β1, β2, · · · , βp
β0 intercept, β1, β2, · · · , βp regression coefficientsεi random “error” variableεi independently normally distributed with mean E [εi ] = 0 (wlog),Var[εi ] = σ2 for all iindependent (explanatory) variables Xi
dependent variables Yi
σ2 = 0⇒ εi = 0 for all i ; deterministic model (special case)30
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 30 / 39
Topics Multiple Regression
linear model with more than one predictor variable
Y = β0 + β1X1 + β2X2 + · · ·+ βpXp + ε
Estimates significantly different from zeroWhich predictor variable to choose?To many predictors: good fit (large R2), but estimates insignificant(over-fitting, not generalizable)To few predictors: bad fit (low R2), estimates significant (underfitting)Desideratum 1: Best estimate of correct modelDesideratum 2: Valid statistical inferences
31
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 31 / 39
Criteria to find best model
R2 := 1− SSE
SST= cor(Y , Y )2 decreases with increasing p
R2adj = 1− SSE/(n − p)
SST/(n − 1); SSE =
n∑i=1
(yi − yi)2,SST =
n∑i=1
(yi − y)2
Mallow’s Cp =SSE
σ2 − n + 2p
maximum likelihood =
(1√
2πσ2
)n
exp(−SSE
2σ2
)AIC = −2 ln(maximum likelihood) +
2pn
BIC = −2 ln(maximum likelihood) + ln(p) · 2p
Cp, AIC, BIC prevent overfitting (invalid inference)choose model with smallest Cp, AIC, BICAIC tends to choose too complex models, BIC not (=consistent)
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 32 / 39
lung function data for cystic fibrosis patients
Data frame 25 rows (patients 7-23 years), 10 columnsVariables (columns):
age in years.sex 0: male, 1:female.height (cm).weight (kg).bmp: body mass (percentage of of normal).fev1: forced expiratory volume.rv: residual volume.rc: functional residual capacity.tlc: total lung capacity.pemax: maximum expiratory pressure.
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 33 / 39
Predicting maximum expiratory pressure for cysticfibrosis patients
> summary(lm(pemax ˜ .,data=cystfibr))Estimate Std. Error t value Pr(>|t|)
(Intercept) 176.0582 225.8912 0.779 0.448age -2.5420 4.8017 -0.529 0.604sex -3.7368 15.4598 -0.242 0.812height -0.4463 0.9034 -0.494 0.628weight 2.9928 2.0080 1.490 0.157bmp -1.7449 1.1552 -1.510 0.152fev1 1.0807 1.0809 1.000 0.333rv 0.1970 0.1962 1.004 0.331frc -0.3084 0.4924 -0.626 0.540tlc 0.1886 0.4997 0.377 0.711Multiple R-squared: 0.64, Adjusted R-squared: 0.42F-statistic: 2.929 on 9 and 15 DF, p-value: 0.03195
Conclusion: No parameters significantly different from zeroWim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 34 / 39
Minimum AIC by backward selection
model <- lm(pemax ˜ .,data=cystfibr)model1 <- step(model, direction = "backward")Step: AIC=160.66pemax ˜ weight + bmp + fev1 + rv> summary(model1)
Estimate Std. Error t value Pr(>|t|)(Intercept) 63.94669 53.27673 1.200 0.244057weight 1.74891 0.38063 4.595 0.000175 ***bmp -1.37724 0.56534 -2.436 0.024322 *fev1 1.54770 0.57761 2.679 0.014410 *rv 0.12572 0.08315 1.512 0.146178
Conclusions:Model with minimum AIC is has more predictorsrv not significant
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 35 / 39
36Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 36 / 39
Estimates of best subset selection
> summary(lm(pemax ˜ weight + bmp + fev1,data=cystfibr))Estimate Std. Error t value Pr(>|t|)
(Intercept) 126.3336 34.7199 3.639 0.001536 **weight 1.5365 0.3644 4.216 0.000387 ***bmp -1.4654 0.5793 -2.530 0.019486 *fev1 1.1086 0.5144 2.155 0.042893 *
Residual standard error: 23.44 on 21 degrees of freedomMultiple R-squared: 0.57, Adjusted R-squared: 0.51F-statistic: 9.279 on 3 and 21 DF, p-value: 0.0004180
Conclusion: Smaller best model is found, seems well interpretable
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 37 / 39
38Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 38 / 39
Confidence interval for parameters
> round(confint(model),2)2.5 % 97.5 %
(Intercept) 54.13 198.54weight 0.78 2.29bmp -2.67 -0.26fev1 0.04 2.18
Conclusion: Large confidence intervals indicate uncertaintyAdvice: Increase number of patientsNote: Pilot type of study
Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 39 / 39