39
Linear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied Sciences July 1, 2015 Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze Univers Linear Model July 1, 2015 1 / 39

Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Linear Model

Wim KrijnenLector Analyse Technieken voor Praktijkonderzoek

Lectoraat Healthy Ageing, Allied Health Care and NursingHanze University of Applied Sciences

July 1, 2015

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 1 / 39

Page 2: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Overview

Correlation between two variablesSimple linear models; one predictor variableLeast squares estimation of parametersModel evaluation

ANOVA tableGoodness of fitt-test for model coefficientsconfidence interval for correlation coefficient

prediction of new valueslinear model diagnosticsmultiple regressionmodel building: finding best subset

2

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 2 / 39

Page 3: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Purpose of linear regression analysis

Determining amount of variance predictors can explain in criterionDescribing the relationship between a dependent and a set ofindependent variablesEstimation of parametersPrediction of new valuesControl

3

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 3 / 39

Page 4: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Correlation Coefficient

Expresses degree of linear relation ship between two variables(measurements)

sample variance of = S2X =

∑ni=1(Xi − X )2

n − 1

sample variance of = S2Y =

∑ni=1(Yi − Y )2

n − 1

sample covariance = SXY =

∑ni=1(Xi − X )(Yi − Y )

n − 1

sample correlation coefficient = RXY =SXY

SX · SY

−1 ≤ RXY ≤ 1RXY close to ±1 implies all point close to straight line

4Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 4 / 39

Page 5: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

5Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 5 / 39

Page 6: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

6Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 6 / 39

Page 7: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

linear relationship

linear relation between x and y

y = ax + b

the value of y is a function of x

y = f (x) = ax + b

function is called linear (of first order); deterministicf ′(x) = a; if x increases by one unit y changes by aa slope, b interceptif a = 0, then y = b constantrange is determined by the values of xgiven a, b, and new x , the value of y can be computedchange of name β0 intercept β1 slope

7Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 7 / 39

Page 8: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Example of straight line

8

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 8 / 39

Page 9: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Linear model with one predictor

we generalize to a statistical model

Yi = β0 + β1Xi + εi , i = 1, · · · ,n

unknown parameters of model: β0, β1

β0 intercept, β1 slopeεi random “error” variableεi independently normally distributed with mean E [εi ] = 0 (wlog),Var[εi ] = σ2 for all iindependent (explanatory) variables Xi

dependent variables Yi

σ2 = 0⇒ εi = 0 for all i ; deterministic model (special case)9

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 9 / 39

Page 10: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Least Squares (ML) Estimation of parameters

Yi = β0 + β1Xi + εi ⇒ εi = Yi − (β0 + β1Xi)

minimize the squared difference S between Yi and line

S =n∑

i=1

ε2i =

n∑i=1

(Yi − β0 − β1Xi)2

β0 = Y − β1X , β1 =

∑XiYi − nY X∑

X 2i − (

∑Xi)2/n

X = 1n∑n

i=1 Xi

computational formulas used in softwareyi = β0 + β1xi “regression line”

10

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 10 / 39

Page 11: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Properties of least squares estimator

1 best (minimum variance) linear unbiased (Gauss-Markov)2 minimal Var[β] = σ2(X T X )−1 → O as n increases3 consistent in the sense that estimated coefficients tend to

coefficients in population4 equals maximum likelihood estimator5 makes sense geometrically:

predicted values y are orthogonal projection of y onto spacespanned by predictors X

11

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 11 / 39

Page 12: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Sums of squares and F-test

H0 : β1 = 0, versus HA : β1 6= 0

Regression SS :=n∑

i=1

(Yi − Y )2, Residual SS :=n∑

i=1

ε2i

Total SS :=n∑

i=1

(Yi − Y )2 = Regression SS + Residual SS

Regression MS :=Regression SS

p − 1, Residual MS :=

Residual SSn − p

F =Regression MS

Residual MS∼ F1,n−2; p-value = P(F1,n−2 > F )

If p-value < α, then reject H0If p-value > α, then reject H0

Reasoning: if β1 ≈ 0, then Yi ≈ β0 = Y ⇒ Regression SS small; F notsignificant

12Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 12 / 39

Page 13: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

ANOVA table and how it looks in RANOVA table

SS df SS MS F p-valueRegression 1 Regression SS Reg./df Regression MS

Residual MSResiduals n − 2 Residual SS Res./df

SS: Sums of squares; MS mean sum of squares; df degrees offreedom15 blood pressure measurements by expert and machineexpert replaced by machine if measurements are equal (up to error)

> model <- lm(Expert ˜ Machine)> anova(model)Analysis of Variance TableResponse: y

Df Sum Sq Mean Sq F value Pr(>F)x 1 150.485 150.485 1967.4 < 2.2e-16 ***Residuals 320 24.477 0.076

Conclusion: Reject H0 : β1 = 0 13Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 13 / 39

Page 14: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

14Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 14 / 39

Page 15: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

goodness of fit

R2 :=Regression SS

Total SS=

Regression SSRegression SS + Residual SS

= cor(Y , Y )2

0 ≤ R2 ≤ 1larger values of R2 indicate better goodness of fitR2 · 100% percentage of variance in Y “explained” by linearregression

> cor(blood$Expert,model$fitted.values)ˆ2[1] 0.822395

Conclusion: 82% of variance in blood pressure measurements byExpert is explained Machine 15

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 15 / 39

Page 16: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Relation between correlation and slope

Correlation in population ρ estimated by Pearson’s estimation of thecorrelation coefficient

RXY =SXY

SX SY, β1 =

SXY

S2X⇒ β1 =

SY

SX· RXY

holds also for the population β1 = σYσXρ

RXY = 0 implies β1 = 0 (correlation is necessary for nonzeroslope)RXY and β have the same sign (e.g. both positive)RXY > 0⇒ increase in X associated with increase in YRXY < 0⇒ increase in X associated with decrease in Y

16

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 16 / 39

Page 17: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

t-test for model coefficients

H0 : β1 = 0 versus HA : β1 6= 0

SE[β1

]=

√Residual MS

(n − 1)S2X

, t =β1

SE[β1]∼ tn−2 distribution

If |t | > tn−2,1−α/2, then reject H0 in favor of HA : β1 6= 0If p-value P(tn−2 ≥ |t |) < α, then reject H0 for HA : β1 6= 0

> summary(lm(Expert ˜ Machine))Estimate Std. Error t value Pr(>|t|)

(Intercept) 16.9152 8.8236 1.917 0.0775 .Machine 0.7907 0.1019 7.759 3.12e-06 ***

Conclusion: Reject H0 : β1 = 0 17

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 17 / 39

Page 18: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Confidence intervals for model coefficients

express our knowledge wrt parameter estimation in CI

(1− α)100%CI[β1

]= β1 ± tn−2,1−α/2 · SE

[β1

]SE[β1

]=

√Residual MS

(n − 1)S2X

> confint(model)2.5 % 97.5 %

(Intercept) -2.146983 35.977336Machine 0.570539 1.010882

Conclusion: We are 95% certain that β1 is within [0.57,1.01]18

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 18 / 39

Page 19: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Prediction/Confidence interval

New value X ; predicted Y = β0 + β1X average predictedconstruct a 95% prediction confidence interval

(1− α)100%CI[

Y]

= Y ±√

2F2,1−α,n−2 · SE[

Y]

SE[

Y]

=

√√√√S2ε

(1n

+(X − X )2

(n − 1)S2X

), S2

ε =

∑ni=1 ε

2i

n − 2

Example: Machine measures diastolic blood pressure 95

> predict.lm(model,newdata=data.frame(Machine = 95),+ int="prediction")

fit lwr upr1 92.03268 80.41466 103.6507

Conclusion: With 95% certainty predicted expert blood pressure∈ [80.4,103.7] 19

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 19 / 39

Page 20: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

plot of prediction confidence intervals

20Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 20 / 39

Page 21: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Model diagnostics

Recall Y = Xβ + ε, ε ∼ N(0, σ2I)Diagnose extent model assumptions hold

same error variance for all values of Xresiduals are nearly normally distributedthere are no extreme outliersonly small proportion of residuals is relatively largethere are no data points highly influential on residuals orregression coefficients

This increases our confidence in making valid statistical inferences 21

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 21 / 39

Page 22: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Standardized residuals

Residual MS :=n∑

i=1

ε2i /(n − p)

hi = element ii of X T (X T X )−1X

standardized ith residual ε′i =εi√

Residual MS√

1− hi

> round(rstandard(model),3)1 2 3 4 5 6 7 8

0.285 0.455 -0.460 -0.163 1.496 1.592 1.441 -1.1299 10 11 12 13 14 15

-0.209 -1.319 0.219 -1.185 -1.337 0.741 -0.632

22

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 22 / 39

Page 23: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Studentized residuals

studentized residual ε∗i = ε′i/SD [ εi ]ε∗(i) studentized residual leaving out i th data point (xi , yi)

ε∗(i) =εi√

Residual MS(i)√

1− hi

, Residual MS(i) =

∑nj=1,j 6=i ε

2i

n − p − 1

ε∗(i) ∼ tn−p, ε∗i ∼ tn−p−1

p number of coefficients, hi hat value iDetect unusual large change > 2 in predicted value

> round(rstudent(model),2)1 2 3 4 5 6 7 8 9 10

0.28 0.44 -0.45 -0.16 1.58 1.70 1.51 -1.14 -0.20 -1.3611 12 13 14 15

0.21 -1.21 -1.38 0.73 -0.62

Conclusion: There are no extreme studentized residuals 23Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 23 / 39

Page 24: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Computation of studentized residuals

Residual MS(i) =

∑nj=1,j 6=i ε

2i

n − p − 1, ε∗(i) =

εi√Residual MS(i)

√1− hi

> round(rstudent(model),3)1 2 3 4 5 6 7 8

0.275 0.441 -0.445 -0.156 1.579 1.705 1.510 -1.1429 10 11 12 13 14 15

-0.201 -1.361 0.211 -1.205 -1.383 0.727 -0.616

24

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 24 / 39

Page 25: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Difference in fitted values (DFFITS)

Difference in fit leaving out i th data point Yi − Y(i)Scaled difference in fitted values leaving out i th data point

∆Y ∗(i) =εi√

hi√Residual MS(i)(1− hi)

∼ Fp,n−p−1

cutoff criterion |∆Y ∗(i)| ≥ 3√

pn − p

to diagnose large influence of xi on overall model fit

> (cutoff <- 3*sqrt(p/(n-p)))[1] 1.176697> round(dffits(model),3)

1 2 3 4 5 6 7 80.133 0.123 -0.144 -0.086 0.475 0.500 0.512 -0.422

9 10 11 12 13 14 15-0.132 -0.424 0.056 -0.667 -0.511 0.196 -0.258

Conclusion: There are no extreme differences in fit 25Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 25 / 39

Page 26: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Influence of ith data point on estimated beta(DFBETAS)

Scaled difference in fitted values leaving the i th data point

∆β∗1(i) =β1 − β1(i)√

Residual MS(i)

√(X T X )−1

ii

,

cutoff criterion |∆β∗1(i)| > 1

> round(dfbetas(model),3)(Intercept) Machine

1 0.118 -0.1072 0.050 -0.0333 0.059 -0.078 # etc

Conclusion: No data points with extreme influence on beta’sRemark: Impossible to base valid inference in case of extremelyinfluential data point 26

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 26 / 39

Page 27: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Cooke’s distance

Cooke’s distance Di is another measure of influence of i th observationon estimated coefficients (Chatterjee & Hadi,1988)

Di =hi

p

εi√Residual MS(i)(1− hi)

2

∼ Fp,n−p

p number of beta coefficientsInterpretation: Scaled distance between β and β(i)

cutoff criterion Di ≥ 0.5

> round(cooks.distance(model),3)1 2 3 4 5 6 7 8

0.010 0.008 0.011 0.004 0.101 0.109 0.119 0.0879 10 11 12 13 14 15

0.009 0.084 0.002 0.215 0.122 0.020 0.035

27Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 27 / 39

Page 28: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Diagnostic plots

A very simple way to produce a two by two panel with 4 diagnosticplots

par(mfrow = c(2,2))plot(model)par(mfrow = c(1,1))

left upper: fitted residuals by fitted plot for detectingheteroscedasticity (increase in sd) or non linearity in dataright upper: normal QQ plotleft lower: Scale location plotleft lower: residuals versus leverage for detecting influential datapoints (Cook’s distance)

influential data points are identified

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 28 / 39

Page 29: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 29 / 39

Page 30: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Multiple Regression

linear model with more than one predictor variable

Y = β0 + β1X1 + β2X21 + · · ·+ βpXp + ε

unknown parameters of model: β0, β1, β2, · · · , βp

β0 intercept, β1, β2, · · · , βp regression coefficientsεi random “error” variableεi independently normally distributed with mean E [εi ] = 0 (wlog),Var[εi ] = σ2 for all iindependent (explanatory) variables Xi

dependent variables Yi

σ2 = 0⇒ εi = 0 for all i ; deterministic model (special case)30

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 30 / 39

Page 31: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Topics Multiple Regression

linear model with more than one predictor variable

Y = β0 + β1X1 + β2X2 + · · ·+ βpXp + ε

Estimates significantly different from zeroWhich predictor variable to choose?To many predictors: good fit (large R2), but estimates insignificant(over-fitting, not generalizable)To few predictors: bad fit (low R2), estimates significant (underfitting)Desideratum 1: Best estimate of correct modelDesideratum 2: Valid statistical inferences

31

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 31 / 39

Page 32: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Criteria to find best model

R2 := 1− SSE

SST= cor(Y , Y )2 decreases with increasing p

R2adj = 1− SSE/(n − p)

SST/(n − 1); SSE =

n∑i=1

(yi − yi)2,SST =

n∑i=1

(yi − y)2

Mallow’s Cp =SSE

σ2 − n + 2p

maximum likelihood =

(1√

2πσ2

)n

exp(−SSE

2σ2

)AIC = −2 ln(maximum likelihood) +

2pn

BIC = −2 ln(maximum likelihood) + ln(p) · 2p

Cp, AIC, BIC prevent overfitting (invalid inference)choose model with smallest Cp, AIC, BICAIC tends to choose too complex models, BIC not (=consistent)

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 32 / 39

Page 33: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

lung function data for cystic fibrosis patients

Data frame 25 rows (patients 7-23 years), 10 columnsVariables (columns):

age in years.sex 0: male, 1:female.height (cm).weight (kg).bmp: body mass (percentage of of normal).fev1: forced expiratory volume.rv: residual volume.rc: functional residual capacity.tlc: total lung capacity.pemax: maximum expiratory pressure.

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 33 / 39

Page 34: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Predicting maximum expiratory pressure for cysticfibrosis patients

> summary(lm(pemax ˜ .,data=cystfibr))Estimate Std. Error t value Pr(>|t|)

(Intercept) 176.0582 225.8912 0.779 0.448age -2.5420 4.8017 -0.529 0.604sex -3.7368 15.4598 -0.242 0.812height -0.4463 0.9034 -0.494 0.628weight 2.9928 2.0080 1.490 0.157bmp -1.7449 1.1552 -1.510 0.152fev1 1.0807 1.0809 1.000 0.333rv 0.1970 0.1962 1.004 0.331frc -0.3084 0.4924 -0.626 0.540tlc 0.1886 0.4997 0.377 0.711Multiple R-squared: 0.64, Adjusted R-squared: 0.42F-statistic: 2.929 on 9 and 15 DF, p-value: 0.03195

Conclusion: No parameters significantly different from zeroWim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 34 / 39

Page 35: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Minimum AIC by backward selection

model <- lm(pemax ˜ .,data=cystfibr)model1 <- step(model, direction = "backward")Step: AIC=160.66pemax ˜ weight + bmp + fev1 + rv> summary(model1)

Estimate Std. Error t value Pr(>|t|)(Intercept) 63.94669 53.27673 1.200 0.244057weight 1.74891 0.38063 4.595 0.000175 ***bmp -1.37724 0.56534 -2.436 0.024322 *fev1 1.54770 0.57761 2.679 0.014410 *rv 0.12572 0.08315 1.512 0.146178

Conclusions:Model with minimum AIC is has more predictorsrv not significant

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 35 / 39

Page 36: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

36Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 36 / 39

Page 37: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Estimates of best subset selection

> summary(lm(pemax ˜ weight + bmp + fev1,data=cystfibr))Estimate Std. Error t value Pr(>|t|)

(Intercept) 126.3336 34.7199 3.639 0.001536 **weight 1.5365 0.3644 4.216 0.000387 ***bmp -1.4654 0.5793 -2.530 0.019486 *fev1 1.1086 0.5144 2.155 0.042893 *

Residual standard error: 23.44 on 21 degrees of freedomMultiple R-squared: 0.57, Adjusted R-squared: 0.51F-statistic: 9.279 on 3 and 21 DF, p-value: 0.0004180

Conclusion: Smaller best model is found, seems well interpretable

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 37 / 39

Page 38: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

38Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 38 / 39

Page 39: Linear Model - Hanze University of Applied SciencesLinear Model Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing

Confidence interval for parameters

> round(confint(model),2)2.5 % 97.5 %

(Intercept) 54.13 198.54weight 0.78 2.29bmp -2.67 -0.26fev1 0.04 2.18

Conclusion: Large confidence intervals indicate uncertaintyAdvice: Increase number of patientsNote: Pilot type of study

Wim Krijnen Lector Analyse Technieken voor Praktijkonderzoek Lectoraat Healthy Ageing, Allied Health Care and Nursing Hanze University of Applied SciencesLinear Model July 1, 2015 39 / 39