19
Multiple Regression One outcome, many explaining variables Example: Ultrasound scanning, shortly before birth (1-3 days before) OBS WEIGHT BPD AD 1 2350 88 92 2 2450 91 98 3 3300 94 110 . . . . . . . . . . . . 105 3550 92 116 106 1173 72 73 107 2900 92 104 (BPD: Head diameter; AD: Stomach circumference) Objectives could be: Prediction, construction of normal regions for diagnostic use (as here) Calculation of causal relationships for intervention use Scientific insight 1 First we look at a single covariate, bpd: The statistical model for a simple linear regression was Y i = α + βX i + ε i i N (02 ) indep. Here there is a marked deviation from linearity! How does that look in model checking? 2 Model checking Statistical model: Y i = α + βX i + ε i i N (02 ) indep. What do we have to check here? linearity variance homogeneity deviations from normality (distance to the line) Note: No assumption of normality for the x i ! Independence between the Y i is checked by inspecting Are there several observations from the same individual? Are there persons from the same family? Twins? 3

Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

Multiple Regression

→ One outcome, many explaining variables

Example:Ultrasound scanning, shortly before birth(1-3 days before)

OBS WEIGHT BPD AD

1 2350 88 92

2 2450 91 98

3 3300 94 110

. . . .

. . . .

. . . .

105 3550 92 116

106 1173 72 73

107 2900 92 104

(BPD: Head diameter; AD: Stomach circumference)

Objectives could be:

• Prediction, construction of normal regionsfor diagnostic use (as here)

• Calculation of causal relationships forintervention use

• Scientific insight

1

First we look at a single covariate, bpd:

The statistical model for a simple linearregression was

Yi = α + βXi + εi, εi ∼ N(0, σ2) indep.

Here there is a marked deviation fromlinearity!

How does that look in model checking?

2

Model checking

Statistical model:

Yi = α + βXi + εi, εi ∼ N(0, σ2) indep.

What do we have to check here?

• linearity

• variance homogeneity

• deviations from normality(distance to the line)

Note:

• No assumption of normality for the xi!

• Independence between the Yi is checked byinspecting

– Are there several observations from thesame individual?

– Are there persons from the samefamily? Twins?

3

Page 2: Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

Model checking consists of

• graphical checks, typically with theresiduals

• perhaps formal tests

Residual:Quantity which expresses the discrepancybetween the observed and the expected(predicted, fitted) value.

There are 4 types of residuals to choose:

1. ordinary: vertical distance of observationto the line, observed - fitted value:

εi = yi − yi

2. standardized (student): ordinary,normalized with the standard deviation

3. press: observed minus predicted, but in amodel, where the current observation hasbeen excluded in the estimation process

4. rstudent (studentized, rstudent):normed press-residuals

4

Problems with the ordinary residuals

We have assumed that

εi ∼ N(0, σ2), independent,

so we would assume that the same holds for theresiduals, εi = yi − yi.

→ This is not so!

• They are not independent (sum up to 0)– not so important, if there are sufficientlymany

• They don’t all have the same variance

Var(εi) = σ2(1− hii)

where

hii =1n

+(xi − x)2

Sxx

denotes the leverage of the ith observation

5

Standardized residuals

ri =εi

s√

1− hii

, Var(ri) ≈ 1

Diagnostic residuals

Here the observations (xi, yi) are excluded oneafter another. For calculating the ith residual,the resulting fitted value (from the modelwithout (xi, yi)) is used - either in raw form(press) or in normalized form (rstudent).

Advantages and disadvantages:

• Nice to have residuals which preserve the

units/scale (type 1 and 3)

• Easiest to find outliers, if observations are

excluded one after another (type 3 and 4)

• Best to normalize, if the observations are

included and one cannot draw...

Thus, in multiple regression type 2 should be

preferred to type 1

6

Page 3: Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

Residual plots

Residuals (suitably chosen type) are plotted vs.

• the explaining variables xi

– to check linearity

• the fitted values yi

– to check variance homogeneity andnormality of the errors

• ’normal scores’, i.e. probability plotor histogram– to check normality

→ The first two plots should look disordered,i.e. unsystematic.

→ The probability plot should lie on a straightline.

7

Residual plots in ANALYST

Many of the plots can be produced viaStatistics/Regression/Linear, by clickingPlots/Residual, where, for example,Ordinary Residual vs. Predicted is chosen.

Predicted value of weight

8

Several graphs for model checking

Predicted value of weight

9

Page 4: Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

Linearity

If linearity does not hold, the model will bemisleading and uninterpretable

Ways out:

• add more covariates, e.g.

– a quadratic term: bpd2

weight=α+β1bpd+β2bpd2

Test of linearity: H0 : β2 = 0

– ad (multiple regression)

• transform variables by

– logarithms

– square root

– inverse

• non-linear regression

10

A clear deviation from linearity can be seenwith the test of the quadratic term:

New variable: cbpd2=(bpd-90)**2

Statistics/Regression/Linear, chooseweight as Dependent, bpd and cbpd2 asExplanatory

(or use Statistics/Regression/Simple

and choose Quadratic)

Dependent Variable: weight

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Prob>F

Model 2 34103611.113 17051805.556 108.081 0.0001

Error 104 16407889.953 157768.17262

C Total 106 50511501.065

Root MSE 397.20042 R-square 0.6752

Dep Mean 2739.09346 Adj R-sq 0.6689

C.V. 14.50116

Parameter Estimates

Parameter Standard T for H0:

Variable DF Estimate Error Parameter=0 Prob > |T|

INTERCEP 1 2720.981236 42.94411387 63.361 0.0001

CBPD 1 117.631510 9.72368306 12.097 0.0001

CBPD2 1 2.232942 0.63718640 3.504 0.0007

11

Quadratic regression

weight = 2720 + 117.63(bpd−90) + 2.23(bpd−90)2

(’-90’: to avoid numerical instability; 90 is ’in the middle’...)

Prediction limits (chosen under Plotsin Statistics/Regression/Simple):

12

Page 5: Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

Variance homogeneity

(constant variance / constant standard deviation)

Var(εi) = σ2, i = 1, . . . , n

If there is no (rough) variance homogeneity, the

estimation will be inefficient

(we obtain an unnecessarily large variance)

Which alternatives do we have?

• constant relative standard deviation= constant coefficient of variation

CV (X) = SD(X)E(X)

– often constant, if small positivequantities, e.g. concentrations, aremeasured

– will lead to a trumpet shapedresidual plot

– way out: transform the outome (Yi) bylogarithm

• Compound experiment

– e.g., several instruments or laboratories

13

Normality assumption

Remember: Only the error terms areassumed to be normally distributed,neither the outcome nor the covariates!

Normality assumption

• is not crucial for the fit itself:the least squares method yields the ’best’estimates at any rate

• is a formal pre-requisite for the t

distribution of the test statistics, but reallyonly a normality assumption for theestimate β is needed, and this is often(approximately) given, if there aresufficiently many observations, due to :

The central limit theorem,

which states that sums or other functionsof many observations get ’more and more’normally distributed.

14

Transformations

• logarithms, square root, inverse

Why take logarithms?

• of the explaing variable

– for obtaining linearity: if there aresuccessive doublings, which have aconstant effect: Use logarithms tothe basis 2!

• of the response / outcome

– for obtaining linearity

– for obtaining variance homogeneity

Var(log(Y )) ≈ Var(Y )Y 2

i.e., a constant coefficient of variation ofY means a constant variance of log(Y ),the natural logarithm

15

Page 6: Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

After log2-transformation of weight:

Predicted value of lweight

16

After log2-transformation ofboth weight and bpd:

Predicted value of lweight

17

Multiple regression

DATA: n persons, p measurements for each:

person x1 . . . xp y

1 x11 . . . x1p y1

2 x21 . . . x2p y2

3 x31 . . . x3p y3

. . . . . . . .

n xn1 . . . xnp yn

The linear regression model with p

explaining variables is given by:

yi︸︷︷︸response

= β0 + β1x1i + · · ·+ βpxpi︸ ︷︷ ︸mean value, regression function

+ εi︸︷︷︸biol. variation

Parameters:

β0 intercept

β1, . . . , βp regression coefficients

18

Page 7: Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

Graphical Illustration

Graphs/Scatter Plot/Three-Dimensional,under Display choose Needles/Pillar

proc g3d;

scatter bpd*ad=weight / shape=’pillar’ size=0.5;

run;

Weight

19

Regression model

yi = β0 +β1xi1 + · · ·+βpxip +εij , i = 1, . . . , n

Usual assumptions:

εi ∼ N(0, σ2), independent

Least squares method:

S(β0, β1, . . . , βp) =∑n

i=1(yi−β0−β1xi1−. . .−βpxip)2

→ minimize with respect to β0, . . . , βp

20

Example: Sechers data with birth weight as alinear function of both bpd and ad

Analyst: Statistics/Regression/Linear, with

weight as Dependent, bpd and ad as Explanatory

The REG Procedure

Dependent Variable: weight

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 2 40736854 20368427 216.72 <.0001

Error 104 9774647 93987

Corrected Total 106 50511501

Root MSE 306.57298 R-Square 0.8065

Dependent Mean 2739.09346 Adj R-Sq 0.8028

Coeff Var 11.19250

Parameter Estimates

Parameter Standard

Variable DF Estimate Error t Value Pr > |t|

Intercept 1 -4628.11813 455.98980 -10.15 <.0001

bpd 1 37.13292 7.61510 4.88 <.0001

ad 1 39.76305 4.16394 9.55 <.0001

→ Strongly significant effect of both covariates.→ But: Are the model assumptions fulfilled?

21

Page 8: Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

Model checks for theuntransformed model

Predicted value of weight

22

Assessment of the model:

• Normality holds roughly, but some singlequite large positive deviations, which couldargue for a logarithmic transformation ofweight.

• Perhaps a light trumpet shape in the plotof residuals vs. predicted values, but notethat the observations are not equallydistributed over the x axis.

• Linearity does not hold well - mainly dueto the earliest born babies.

• Theoretical arguments from clinical expertssuggest a logarithmic transformation ofboth covariates

23

Logarithmic transformation of the data:

lweight=log2(weight)

lbpd=log2(bpd)

lad=log2(ad)

Statistics/Regression/Linear,

choose lweight as Dependent,

lbpd and lad as Explanatory

Dependent Variable: Lweight

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Prob>F

Model 2 14.95054 7.47527 314.925 0.0001

Error 104 2.46861 0.02374

C Total 106 17.41915

Root MSE 0.15407 R-square 0.8583

Dep Mean 11.36775 Adj R-sq 0.8556

C.V. 1.35530

Parameter Estimates

Parameter Standard T for H0:

Variable DF Estimate Error Parameter=0 Prob > |T|

INTERCEP 1 -8.456359 0.95456918 -8.859 0.0001

LBPD 1 1.551943 0.22944935 6.764 0.0001

LAD 1 1.466662 0.14669097 9.998 0.0001

24

Page 9: Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

Test of hypotheses

Is AD of importance, if BPD is already in themodel?

H0 : β2 = 0

Here we have β2=1.467 (SE(β2) = 0.147),and thus the t test yields

t =β2

SE(β2)= 9.998 ∼ t104, p < 0.0001

95% confidence interval for β2:

c.i. = β2 ± t97.5%,n−p−1SE(β2)

= 1.467± 1.984× 0.147 = (1.175 , 1.759)

But:

The βj are correlated – unless the explainingvariables are independent

25

Fitted values

log2(weight) = −8.46 + 1.47 log2(ad)

+1.55 log2(bpd)

i.e.

weight = 0.0028× ad1.47 × bpd1.55

→ If bpd is increased by 10%, this correspondsto multiplying the weight by

1.11.55 = 1.16

i.e. an increase by 16%, if ad is kept fixed

26

Example for calculations

For ad=113 and bpd=88, we would expect

log2(weight)

= −8.46 + 1.47× log2(113) + 1.55× log2(88)

= −8.46 + 1.47× 6.82 + 1.55× 6.46

= 11.58

→ Expected birth weight: 211.58 = 3061g

• Actually observed birth weight: 3400g

• Residual: 3400g - 3061g =339g

27

Page 10: Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

Uncertainty in prediction

Note: The log-scale results in a constantrelative uncertainty

2±1.96×0.154 = (0.81 , 1.23)

This means, that with 95% probability thebirth weight lies somewhere between 19%under and 23% over the predicted value.

(Here we have cheated a bit: We have neglected the

estimation uncertainty in the β’s themselves.)

28

Marginal vs. multiple models

Marginal models:

The response is considered with each singleexplaining variable on its own.

Multiple regression model:

The response is considered with bothexplaining variables together.

Estimates for these models (with correspondingstandard errors in parentheses) :

β0, int. β1, lbpd β2, lad s R2

-10.223 3.332(0.202) - 0.215 0.72

-3.527 - 2.237(0.111) 0.184 0.80

-8.456 1.552(0.229) 1.467(0.147) 0.154 0.86

29

Interpretation of the coefficientβ1 for lbpd

• Marginal model:Change in lweight, if the covariate lbpd ischanged by 1 unit, i.e. if bpd is doubled

• Multiple regression modelChange in lweight, if the covariate lbpd ischanged by 1 unit, but where all othercovariates (here only ad) are kept fixed

We say, that we have corrected (oradjusted) for the effects of the othercovariates in the model.

The difference between the two models canbe quite drastic, since the covariates aretypically related:

– If one of them is changed,the others are also changed

30

Page 11: Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

Goodness-of-fit Measure

R2 =SSModel

SSTotal

“How large is the proportion of variationexplained by the model?”

(here: 0.8583, i.e. 85.83%)

Problem of interpretation if the covariatesare controlled (as for the correlation coefficient)

R2 increases with the number of covariates,even if these are not important!

Adjusted R2:

R2adj = 1− MSResidual

MSTotal

(here: 0.8556)

31

Model checking

• Plots:

– residuals vs. each covariate separately(linearity)

– residuals vs. predicted values(variance homogeneity)

– probability plot(normality)

• Tests:

– generalized vs. simple models

– curvature: square term, cubic term, ...

– interaction: product term ?

• Influencing observations

– modified residuals

– Cook’s distance

32

Model checks for thelog2-transformed model

Predicted value of lweight

33

Page 12: Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

Regression diagnostics

Are the conclusions supported by the wholedata set?

Or are there observations with rather largeinfluence on the results?

Leverage = potential influence(hat-matrix, in SAS called Hat Diag or H)If there is only one covariate, this is simply:

hii =1n

+(xi − x)2

Sxx

Observations with extreme x values can have alarge influence on the results,

34

... but not necessarily!

→ no problem if they lie ’nice’ in relation tothe regression line, i.e. if they have a smallresidual

→ For example:

0 1 2 3 4 5 6

02

46

810

x

y

35

Influencing observations

→ are those, which have a combination of

• high leverage

• large residual

36

Page 13: Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

Regression diagnostics

• Leave out the ith person and find newestimates, β

(i)0 , β

(i)1 and β

(i)2

• Calculate Cook’s distance, a compoundmeasure for the change in the parameterestimates

• Split Cook’s distance into its coordinatesand state:By how many SE’s does β1 (for example)change, if the ith person is left out?

What to do with influencing observations?

• leave them out?

• quote a measure of their influence?

37

Diagnostics: Cook’s distance

38

Outliers

Observations, which don’t fit in the relationship

• not necessarily influencing

• not necessarily with a large residual

What to do with outliers?

• look closer at them,they are often quite interesting

When can we leave them out?

• if they lie far outside, i.e. have a highleverage

– keep in mind to distinguish thecorresponding conclusions!

• if one can find a reason

– and then all these would be left out!

39

Page 14: Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

Model checking and Diagnosticsin Analyst

Many graphics can be produced directly in theregression setting in Analyst, underPlots/Residual or Plots/Diagnostics.

If further plots are wanted (e.g. a plot of Cook’sdistance), one should create a new data set:

In the regression setting, go into

• Save Data

• tick Create and save diagnostics data

• choose (click Add) the quantities to besaved (typically Predicted, Residual,Student, Rstudent, Cookd, Press).

• Double-click at Diagnostics Table in theproject tree

• Save this by clickingFile/Save as By SAS Name

• Open it for further use, byFile/Open By SAS Name

40

Example (DGA p.338)

41

Which explaining variables have a marginaleffect on the response PEmax?

Are these (Age, Height, Weight, FEV1, FRC)the variables which should be included in themultiple regression model?

42

Page 15: Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

Correlations

Correlation Analysis

Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 25

AGE SEX HEIGHT WEIGHT BMP

AGE 1.00000 -0.16712 0.92605 0.90587 0.37776

0.0 0.4246 0.0001 0.0001 0.0626

SEX -0.16712 1.00000 -0.16755 -0.19044 -0.13756

0.4246 0.0 0.4234 0.3619 0.5120

HEIGHT 0.92605 -0.16755 1.00000 0.92070 0.44076

0.0001 0.4234 0.0 0.0001 0.0274

WEIGHT 0.90587 -0.19044 0.92070 1.00000 0.67255

0.0001 0.3619 0.0001 0.0 0.0002

BMP 0.37776 -0.13756 0.44076 0.67255 1.00000

0.0626 0.5120 0.0274 0.0002 0.0

FEV1 0.29449 -0.52826 0.31666 0.44884 0.54552

0.1530 0.0066 0.1230 0.0244 0.0048

RV -0.55194 0.27135 -0.56952 -0.62151 -0.58237

0.0042 0.1895 0.0030 0.0009 0.0023

FRC -0.63936 0.18361 -0.62428 -0.61726 -0.43439

0.0006 0.3797 0.0009 0.0010 0.0300

TLC -0.46937 0.02423 -0.45708 -0.41847 -0.36490

0.0179 0.9085 0.0216 0.0374 0.0729

PEMAX 0.61347 -0.28857 0.59922 0.63522 0.22951

0.0011 0.1618 0.0015 0.0006 0.2698

43

Correlation Analysis

Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 25

FEV1 RV FRC TLC PEMAX

AGE 0.29449 -0.55194 -0.63936 -0.46937 0.61347

0.1530 0.0042 0.0006 0.0179 0.0011

SEX -0.52826 0.27135 0.18361 0.02423 -0.28857

0.0066 0.1895 0.3797 0.9085 0.1618

HEIGHT 0.31666 -0.56952 -0.62428 -0.45708 0.59922

0.1230 0.0030 0.0009 0.0216 0.0015

WEIGHT 0.44884 -0.62151 -0.61726 -0.41847 0.63522

0.0244 0.0009 0.0010 0.0374 0.0006

BMP 0.54552 -0.58237 -0.43439 -0.36490 0.22951

0.0048 0.0023 0.0300 0.0729 0.2698

FEV1 1.00000 -0.66586 -0.66511 -0.44299 0.45338

0.0 0.0003 0.0003 0.0266 0.0228

RV -0.66586 1.00000 0.91060 0.58914 -0.31555

0.0003 0.0 0.0001 0.0019 0.1244

FRC -0.66511 0.91060 1.00000 0.70440 -0.41721

0.0003 0.0001 0.0 0.0001 0.0380

TLC -0.44299 0.58914 0.70440 1.00000 -0.18162

0.0266 0.0019 0.0001 0.0 0.3849

PEMAX 0.45338 -0.31555 -0.41721 -0.18162 1.00000

0.0228 0.1244 0.0380 0.3849 0.0

Note in particular the correlations between age,height and weight!

44

Model selection

(chosen in Model under Regression/Linear):

• Forward selectionInclude each time the most significant→ Final model: WEIGHT BMP FEV1

• Backward eliminationStart with all covariates, then drop eachtime the least significant→ Final model: WEIGHT BMP FEV1

This looks quite stable!?

But:

If WEIGHT would have been logarithmictransformed from the start?

→ Then we would have obtained the finalmodel: AGE FEV1

Rule of thumb:

There should be at least 10 times as manyobservations as parameters in the model.

45

Page 16: Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

If all 9 covariates are included:

Dependent: pemax

Explanatory:

age sex height weight bmp fev1 rv frc tlc

Dependent Variable: PEMAX

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Prob>F

Model 9 17101.39040 1900.15449 2.929 0.0320

Error 15 9731.24960 648.74997

C Total 24 26832.64000

Root MSE 25.47057 R-square 0.6373

Dep Mean 109.12000 Adj R-sq 0.4197

C.V. 23.34180

Parameter Estimates

Parameter Standard T for H0:

Variable DF Estimate Error Parameter=0 Prob > |T|

INTERCEP 1 176.058206 225.89115895 0.779 0.4479

AGE 1 -2.541960 4.80169881 -0.529 0.6043

SEX 1 -3.736781 15.45982182 -0.242 0.8123

HEIGHT 1 -0.446255 0.90335490 -0.494 0.6285

WEIGHT 1 2.992816 2.00795743 1.490 0.1568

BMP 1 -1.744944 1.15523751 -1.510 0.1517

FEV1 1 1.080697 1.08094746 1.000 0.3333

RV 1 0.196972 0.19621362 1.004 0.3314

FRC 1 -0.308431 0.49238994 -0.626 0.5405

TLC 1 0.188602 0.49973514 0.377 0.7112

46

Backward elimination

Table of successive p values:

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]

age 0.604 0.632 0.519 0.616 - - - - -

sex 0.812 - - - - - - - -

height 0.628 0.649 0.550 0.600 0.557 - - - -

weight 0.157 0.143 0.072 0.072 0.040 0.000 0.000 0.000 0.001

bmp 0.152 0.140 0.060 0.056 0.035 0.024 0.019 0.098 -

fev1 0.333 0.108 0.103 0.036 0.024 0.014 0.043 - -

rv 0.331 0.323 0.347 0.326 0.228 0.146 - - -

frc 0.540 0.555 0.638 - - - - - -

tlc 0.711 0.669 - - - - - - -

(Altman stops with step no. 7.)

47

Graph of successive p values

48

Page 17: Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

Remarks on model selection

• Problem of multiple testing!

• Avoid unclear problem formulations (manycovariates, which express more or less thesame)

• Linearly related covariates:

– What can we say about the ’winners’?

– Were they all the time significant, ordid they come in suddenly?

– In the latter case they could also havebeen excluded while they wereinsignificant . . .

– Estimations become very unstable

• Usual recommendation:

– Backward elimination

– Calculation of all models

– Cross-validation:Estimate the model with a part of thedata, then try it out on the rest.

49

What happens if an explaining variable isexcluded?

• The fit gets worse, i.e. the residual sum ofsquares gets larger.

• The number of degrees of freedom (for theresidual SS) increases.

• The estimate s2 for the residual varianceσ2 can both increase or decrease

s2 =∑

()2

n− p− 1

• The proportion of variation which isexplained by the model, R2, decreases.This is compensated in the adjustedcoefficient of determination, R2

adj

→ As criteria for the model fit we can use boths2 or R2

adj.

50

Marginal models:

• Model 1: pemax vs. height

• Model 2: pemax vs. weight

Multiple regression model:

• Model 3: pemax vs. height and weight

β0 β1(height) β2(weight) s R2 R2

-33.276 0.932(0.260) - 27.34 0.3591 0.33

63.546 - 1.187(0.301) 26.38 0.4035 0.38

47.355 0.147(0.655) 1.024(0.787) 26.94 0.4049 0.35

• Each of the two explaining variables hassome importance, as seen from themarginal models.

• In the multiple regression model it looks asif none of them has any importance.

• This means, that at least one of them isimportant, but it is difficult to say which.It looks rather as if it would be weight...

51

Page 18: Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

Options in Statistics/Regression/Linear:

• Model:

– Forward selection

– Backward elimination

• Statistics:

– clb: confidence limits for estimates

– corrb: correlation between estimates

– stb: standardized coefficients:effect of changing the covariate by 1 SD

• Statistics/Tests:

– collin: collinearity diagnostics

– vif: variance inflation factor:variance increase due to collinearity

– tol: tolerance factor:1-R2 for regression of one covariatew.r.t. the others

52

If we add clb, stb, vif and tol, we obtain:

Parameter Estimates

Standardized Variance

Variable DF Estimate Tolerance Inflation

Intercept 1 0 . 0

age 1 -0.38460 0.04581 21.82984

sex 1 -0.05662 0.44064 2.26941

height 1 -0.28694 0.07166 13.95493

weight 1 1.60200 0.02093 47.78130

bmp 1 -0.62651 0.14053 7.11575

fev1 1 0.36190 0.18452 5.41951

rv 1 0.50671 0.09489 10.53805

frc 1 -0.40327 0.05833 17.14307

tlc 1 0.09571 0.37594 2.65999

Parameter Estimates

Variable DF 95% Confidence Limits

Intercept 1 -305.41740 657.53381

age 1 -12.77654 7.69262

sex 1 -36.68861 29.21505

height 1 -2.37171 1.47920

weight 1 -1.28704 7.27268

bmp 1 -4.20727 0.71739

fev1 1 -1.22329 3.38468

rv 1 -0.22125 0.61519

frc 1 -1.35794 0.74107

tlc 1 -0.87656 1.25376

53

The quantities calculated for each observationare best saved in a new data set, and then onecan look at descriptive quantities for these, e.g.:

The MEANS Procedure

Variable Label Mean

--------------------------------------------------------------------

resid Residual 2.50111E-14

stresid Studentized Residual 0.0193870

press Residual without Current Observation 1.2483399

residud Studentized Residual without Current Obs 0.0073219

leverage Leverage 0.4000000

cook Cook’s D Influence Statistic 0.0643761

inflpred Standard Influence on Predicted Value 0.0477590

--------------------------------------------------------------------

Variable Label Minimum

--------------------------------------------------------------------

resid Residual -37.3376860

stresid Studentized Residual -1.7680347

press Residual without Current Observation -60.7098868

residud Studentized Residual without Current Obs -1.9197970

leverage Leverage 0.1925968

cook Cook’s D Influence Statistic 0.000558647

inflpred Standard Influence on Predicted Value -1.7428452

--------------------------------------------------------------------

Variable Label Maximum

--------------------------------------------------------------------

resid Residual 33.4051731

stresid Studentized Residual 1.7053874

press Residual without Current Observation 56.4819549

residud Studentized Residual without Current Obs 1.8350344

leverage Leverage 0.5806599

cook Cook’s D Influence Statistic 0.2582067

inflpred Standard Influence on Predicted Value 1.5251936

--------------------------------------------------------------------

54

Page 19: Multiple Regression Model checkingstaff.pubhealth.ku.dk/~lts/engelsk_basal/overheads... · Statistics/Regression/Linear , by clicking Plots/Residual , where, for example, Ordinary

Selected diagnostic plots

Number of observation

Number of observation

55

Collinearity

→ The covariates are linearly related.There will be always some relationship between them

(hopefully not too strong),

except in designed trials (e.g. in agricultural trials)

Symptoms of collinearity:

• Some of the covariates are stronglycorrelated

• Some parameter estimates have quite largestandard errors

• All covariates in the multiple regressionmodel are insignificant, but R2 isnevertheless large

• There are large changes in the estimates, ifone covariate is excluded from the model

• There are large changes in the estimates, ifan observations is excluded from the model

• The results differ from the expectations.

56