Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Linear Model

Wim KrijnenLector Analyse Technieken voor Praktijkonderzoek

Lectoraat Healthy Ageing, Allied Health Care and NursingHanze University of Applied Sciences

July 1, 2015

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 1 / 39

Overview

Correlation between two variablesSimple linear models; one predictor variableLeast squares estimation of parametersModel evaluation

ANOVA tableGoodness of fitt-test for model coefficientsconfidence interval for correlation coefficient

prediction of new valueslinear model diagnosticsmultiple regressionmodel building: finding best subset

2


Purpose of linear regression analysis

Determining amount of variance predictors can explain in criterionDescribing the relationship between a dependent and a set ofindependent variablesEstimation of parametersPrediction of new valuesControl

3


Correlation Coefficient

Expresses degree of linear relation ship between two variables(measurements)

sample variance of = S2X =

∑ni=1(Xi − X )2

n − 1

sample variance of = S2Y =

∑ni=1(Yi − Y )2

n − 1

sample covariance = SXY =

∑ni=1(Xi − X )(Yi − Y )

n − 1

sample correlation coefficient = RXY =SXY

SX · SY

−1 ≤ RXY ≤ 1RXY close to ±1 implies all point close to straight line

4Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 4 / 39



linear relationship

linear relation between x and y

y = ax + b

the value of y is a function of x

y = f (x) = ax + b

function is called linear (of first order); deterministicf ′(x) = a; if x increases by one unit y changes by aa slope, b interceptif a = 0, then y = b constantrange is determined by the values of xgiven a, b, and new x , the value of y can be computedchange of name β0 intercept β1 slope


Example of straight line

8


Linear model with one predictor

we generalize to a statistical model

Yi = β0 + β1Xi + εi , i = 1, · · · ,n

unknown parameters of model: β0, β1

β0 intercept, β1 slopeεi random “error” variableεi independently normally distributed with mean E [εi ] = 0 (wlog),Var[εi ] = σ2 for all iindependent (explanatory) variables Xi

dependent variables Yi

σ2 = 0⇒ εi = 0 for all i ; deterministic model (special case)9


Least Squares (ML) Estimation of parameters

Yi = β0 + β1Xi + εi ⇒ εi = Yi − (β0 + β1Xi)

minimize the squared difference S between Yi and line

S =n∑

i=1

ε2i =

n∑i=1

(Yi − β0 − β1Xi)2

β0 = Y − β1X , β1 =

∑XiYi − nY X∑

X 2i − (

∑Xi)2/n

X = 1n∑n

i=1 Xi

computational formulas used in softwareyi = β0 + β1xi “regression line”

10


Properties of least squares estimator

1 best (minimum variance) linear unbiased (Gauss-Markov)2 minimal Var[β] = σ2(X T X )−1 → O as n increases3 consistent in the sense that estimated coefficients tend to

coefficients in population4 equals maximum likelihood estimator5 makes sense geometrically:

predicted values y are orthogonal projection of y onto spacespanned by predictors X

11


Sums of squares and F-test

H0 : β1 = 0, versus HA : β1 6= 0

Regression SS :=n∑

i=1

(Yi − Y )2, Residual SS :=n∑

i=1

ε2i

Total SS :=n∑

i=1

(Yi − Y )2 = Regression SS + Residual SS

Regression MS :=Regression SS

p − 1, Residual MS :=

Residual SSn − p

F =Regression MS

Residual MS∼ F1,n−2; p-value = P(F1,n−2 > F )

If p-value < α, then reject H0If p-value > α, then reject H0

Reasoning: if β1 ≈ 0, then Yi ≈ β0 = Y ⇒ Regression SS small; F notsignificant


ANOVA table and how it looks in RANOVA table

SS df SS MS F p-valueRegression 1 Regression SS Reg./df Regression MS

Residual MSResiduals n − 2 Residual SS Res./df

SS: Sums of squares; MS mean sum of squares; df degrees offreedom15 blood pressure measurements by expert and machineexpert replaced by machine if measurements are equal (up to error)

> model <- lm(Expert ˜ Machine)> anova(model)Analysis of Variance TableResponse: y

Df Sum Sq Mean Sq F value Pr(>F)x 1 150.485 150.485 1967.4 < 2.2e-16 ***Residuals 320 24.477 0.076

Conclusion: Reject H0 : β1 = 0 13Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 13 / 39


goodness of fit

R2 :=Regression SS

Total SS=

Regression SSRegression SS + Residual SS

= cor(Y , Y )2

0 ≤ R2 ≤ 1larger values of R2 indicate better goodness of fitR2 · 100% percentage of variance in Y “explained” by linearregression

> cor(blood$Expert,model$fitted.values)ˆ2[1] 0.822395

Conclusion: 82% of variance in blood pressure measurements byExpert is explained Machine 15


Relation between correlation and slope

Correlation in population ρ estimated by Pearson’s estimation of thecorrelation coefficient

RXY =SXY

SX SY, β1 =

SXY

S2X⇒ β1 =

SY

SX· RXY

holds also for the population β1 = σYσXρ

RXY = 0 implies β1 = 0 (correlation is necessary for nonzeroslope)RXY and β have the same sign (e.g. both positive)RXY > 0⇒ increase in X associated with increase in YRXY < 0⇒ increase in X associated with decrease in Y

16


t-test for model coefficients

H0 : β1 = 0 versus HA : β1 6= 0

SE[β1

]=

√Residual MS

(n − 1)S2X

, t =β1

SE[β1]∼ tn−2 distribution

If |t | > tn−2,1−α/2, then reject H0 in favor of HA : β1 6= 0If p-value P(tn−2 ≥ |t |) < α, then reject H0 for HA : β1 6= 0

> summary(lm(Expert ˜ Machine))Estimate Std. Error t value Pr(>|t|)

(Intercept) 16.9152 8.8236 1.917 0.0775 .Machine 0.7907 0.1019 7.759 3.12e-06 ***

Conclusion: Reject H0 : β1 = 0 17


Confidence intervals for model coefficients

express our knowledge wrt parameter estimation in CI

(1− α)100%CI[β1

]= β1 ± tn−2,1−α/2 · SE

[β1

]SE[β1

]=

√Residual MS

(n − 1)S2X

> confint(model)2.5 % 97.5 %

(Intercept) -2.146983 35.977336Machine 0.570539 1.010882

Conclusion: We are 95% certain that β1 is within [0.57,1.01]18


Prediction/Confidence interval

New value X ; predicted Y = β0 + β1X average predictedconstruct a 95% prediction confidence interval

(1− α)100%CI[

Y]

= Y ±√

2F2,1−α,n−2 · SE[

Y]

SE[

Y]

=

√√√√S2ε

(1n

+(X − X )2

(n − 1)S2X

), S2

ε =

∑ni=1 ε

2i

n − 2

Example: Machine measures diastolic blood pressure 95

> predict.lm(model,newdata=data.frame(Machine = 95),+ int="prediction")

fit lwr upr1 92.03268 80.41466 103.6507

Conclusion: With 95% certainty predicted expert blood pressure∈ [80.4,103.7] 19


plot of prediction confidence intervals


Model diagnostics

Recall Y = Xβ + ε, ε ∼ N(0, σ2I)Diagnose extent model assumptions hold

same error variance for all values of Xresiduals are nearly normally distributedthere are no extreme outliersonly small proportion of residuals is relatively largethere are no data points highly influential on residuals orregression coefficients

This increases our confidence in making valid statistical inferences 21


Standardized residuals

Residual MS :=n∑

i=1

ε2i /(n − p)

hi = element ii of X T (X T X )−1X

standardized ith residual ε′i =εi√

Residual MS√

1− hi

> round(rstandard(model),3)1 2 3 4 5 6 7 8

0.285 0.455 -0.460 -0.163 1.496 1.592 1.441 -1.1299 10 11 12 13 14 15

-0.209 -1.319 0.219 -1.185 -1.337 0.741 -0.632

22


Studentized residuals

studentized residual ε∗i = ε′i/SD [ εi ]ε∗(i) studentized residual leaving out i th data point (xi , yi)

ε∗(i) =εi√

Residual MS(i)√

1− hi

, Residual MS(i) =

∑nj=1,j 6=i ε

2i

n − p − 1

ε∗(i) ∼ tn−p, ε∗i ∼ tn−p−1

p number of coefficients, hi hat value iDetect unusual large change > 2 in predicted value

> round(rstudent(model),2)1 2 3 4 5 6 7 8 9 10

0.28 0.44 -0.45 -0.16 1.58 1.70 1.51 -1.14 -0.20 -1.3611 12 13 14 15

0.21 -1.21 -1.38 0.73 -0.62

Conclusion: There are no extreme studentized residuals 23Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 23 / 39

Computation of studentized residuals

Residual MS(i) =

∑nj=1,j 6=i ε

2i

n − p − 1, ε∗(i) =

εi√Residual MS(i)

√1− hi

> round(rstudent(model),3)1 2 3 4 5 6 7 8

0.275 0.441 -0.445 -0.156 1.579 1.705 1.510 -1.1429 10 11 12 13 14 15

-0.201 -1.361 0.211 -1.205 -1.383 0.727 -0.616

24


Difference in fitted values (DFFITS)

Difference in fit leaving out i th data point Yi − Y(i)Scaled difference in fitted values leaving out i th data point

∆Y ∗(i) =εi√

hi√Residual MS(i)(1− hi)

∼ Fp,n−p−1

cutoff criterion |∆Y ∗(i)| ≥ 3√

pn − p

to diagnose large influence of xi on overall model fit

> (cutoff <- 3*sqrt(p/(n-p)))[1] 1.176697> round(dffits(model),3)

1 2 3 4 5 6 7 80.133 0.123 -0.144 -0.086 0.475 0.500 0.512 -0.422

9 10 11 12 13 14 15-0.132 -0.424 0.056 -0.667 -0.511 0.196 -0.258

Conclusion: There are no extreme differences in fit 25Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 25 / 39

Influence of ith data point on estimated beta(DFBETAS)

Scaled difference in fitted values leaving the i th data point

∆β∗1(i) =β1 − β1(i)√

Residual MS(i)

√(X T X )−1

ii

,

cutoff criterion |∆β∗1(i)| > 1

> round(dfbetas(model),3)(Intercept) Machine

1 0.118 -0.1072 0.050 -0.0333 0.059 -0.078 # etc

Conclusion: No data points with extreme influence on beta’sRemark: Impossible to base valid inference in case of extremelyinfluential data point 26


Cooke’s distance

Cooke’s distance Di is another measure of influence of i th observationon estimated coefficients (Chatterjee & Hadi,1988)

Di =hi

p

εi√Residual MS(i)(1− hi)

2

∼ Fp,n−p

p number of beta coefficientsInterpretation: Scaled distance between β and β(i)

cutoff criterion Di ≥ 0.5

> round(cooks.distance(model),3)1 2 3 4 5 6 7 8

0.010 0.008 0.011 0.004 0.101 0.109 0.119 0.0879 10 11 12 13 14 15

0.009 0.084 0.002 0.215 0.122 0.020 0.035


Diagnostic plots

A very simple way to produce a two by two panel with 4 diagnosticplots

par(mfrow = c(2,2))plot(model)par(mfrow = c(1,1))

left upper: fitted residuals by fitted plot for detectingheteroscedasticity (increase in sd) or non linearity in dataright upper: normal QQ plotleft lower: Scale location plotleft lower: residuals versus leverage for detecting influential datapoints (Cook’s distance)

influential data points are identified



Multiple Regression

linear model with more than one predictor variable

Y = β0 + β1X1 + β2X21 + · · ·+ βpXp + ε

unknown parameters of model: β0, β1, β2, · · · , βp

β0 intercept, β1, β2, · · · , βp regression coefficientsεi random “error” variableεi independently normally distributed with mean E [εi ] = 0 (wlog),Var[εi ] = σ2 for all iindependent (explanatory) variables Xi

dependent variables Yi

σ2 = 0⇒ εi = 0 for all i ; deterministic model (special case)30


Topics Multiple Regression

linear model with more than one predictor variable

Y = β0 + β1X1 + β2X2 + · · ·+ βpXp + ε

Estimates significantly different from zeroWhich predictor variable to choose?To many predictors: good fit (large R2), but estimates insignificant(over-fitting, not generalizable)To few predictors: bad fit (low R2), estimates significant (underfitting)Desideratum 1: Best estimate of correct modelDesideratum 2: Valid statistical inferences

31


Criteria to find best model

R2 := 1− SSE

SST= cor(Y , Y )2 decreases with increasing p

R2adj = 1− SSE/(n − p)

SST/(n − 1); SSE =

n∑i=1

(yi − yi)2,SST =

n∑i=1

(yi − y)2

Mallow’s Cp =SSE

σ2 − n + 2p

maximum likelihood =

(1√

2πσ2

)n

exp(−SSE

2σ2

)AIC = −2 ln(maximum likelihood) +

2pn

BIC = −2 ln(maximum likelihood) + ln(p) · 2p

Cp, AIC, BIC prevent overfitting (invalid inference)choose model with smallest Cp, AIC, BICAIC tends to choose too complex models, BIC not (=consistent)


lung function data for cystic fibrosis patients

Data frame 25 rows (patients 7-23 years), 10 columnsVariables (columns):

age in years.sex 0: male, 1:female.height (cm).weight (kg).bmp: body mass (percentage of of normal).fev1: forced expiratory volume.rv: residual volume.rc: functional residual capacity.tlc: total lung capacity.pemax: maximum expiratory pressure.


Predicting maximum expiratory pressure for cysticfibrosis patients

> summary(lm(pemax ˜ .,data=cystfibr))Estimate Std. Error t value Pr(>|t|)

(Intercept) 176.0582 225.8912 0.779 0.448age -2.5420 4.8017 -0.529 0.604sex -3.7368 15.4598 -0.242 0.812height -0.4463 0.9034 -0.494 0.628weight 2.9928 2.0080 1.490 0.157bmp -1.7449 1.1552 -1.510 0.152fev1 1.0807 1.0809 1.000 0.333rv 0.1970 0.1962 1.004 0.331frc -0.3084 0.4924 -0.626 0.540tlc 0.1886 0.4997 0.377 0.711Multiple R-squared: 0.64, Adjusted R-squared: 0.42F-statistic: 2.929 on 9 and 15 DF, p-value: 0.03195

Conclusion: No parameters significantly different from zeroWim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 34 / 39

Minimum AIC by backward selection

model <- lm(pemax ˜ .,data=cystfibr)model1 <- step(model, direction = "backward")Step: AIC=160.66pemax ˜ weight + bmp + fev1 + rv> summary(model1)

Estimate Std. Error t value Pr(>|t|)(Intercept) 63.94669 53.27673 1.200 0.244057weight 1.74891 0.38063 4.595 0.000175 ***bmp -1.37724 0.56534 -2.436 0.024322 *fev1 1.54770 0.57761 2.679 0.014410 *rv 0.12572 0.08315 1.512 0.146178

Conclusions:Model with minimum AIC is has more predictorsrv not significant



Estimates of best subset selection

> summary(lm(pemax ˜ weight + bmp + fev1,data=cystfibr))Estimate Std. Error t value Pr(>|t|)

(Intercept) 126.3336 34.7199 3.639 0.001536 **weight 1.5365 0.3644 4.216 0.000387 ***bmp -1.4654 0.5793 -2.530 0.019486 *fev1 1.1086 0.5144 2.155 0.042893 *

Residual standard error: 23.44 on 21 degrees of freedomMultiple R-squared: 0.57, Adjusted R-squared: 0.51F-statistic: 9.279 on 3 and 21 DF, p-value: 0.0004180

Conclusion: Smaller best model is found, seems well interpretable



Confidence interval for parameters

> round(confint(model),2)2.5 % 97.5 %

(Intercept) 54.13 198.54weight 0.78 2.29bmp -2.67 -0.26fev1 0.04 2.18

Conclusion: Large confidence intervals indicate uncertaintyAdvice: Increase number of patientsNote: Pilot type of study


Documents

Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing