Lecture 9: ANOVA tables F-tests

Lecture 9:ANOVA tablesF-tests

BMTRY 701Biostatistical Methods II

ANOVA

Analysis of Variance Similar in derivation to ANOVA that is

generalization of two-sample t-test Partitioning of variance into several parts

• that due to the ‘model’: SSR• that due to ‘error’: SSE

The sum of the two parts is the total sum of squares: SST

Total Deviations:

0 200 400 600 800

2.0

2.2

2.4

2.6

2.8

3.0

data$BEDS

da

ta$

log

LO

S

YYi

Regression Deviations:

0 200 400 600 800

2.0

2.2

2.4

2.6

2.8

3.0

data$BEDS

da

ta$

log

LO

S

YYi ˆ

Error Deviations:

0 200 400 600 800

2.0

2.2

2.4

2.6

2.8

3.0

data$BEDS

da

ta$

log

LO

S

ii YY ˆ

Definitions

SSESSRSST

YYSSE

YYSSR

YYSST

i

i

i

2

2

2

)ˆ(

)ˆ(

)(

iiii YYYYYY ˆˆ

Example: logLOS ~ BEDS

> ybar <- mean(data$logLOS)> yhati <- reg$fitted.values> sst <- sum((data$logLOS- ybar)^2)> ssr <- sum((yhati - ybar )^2)> sse <- sum((data$logLOS - yhati)^2)> > sst[1] 3.547454> ssr[1] 0.6401715> sse[1] 2.907282> sse+ssr[1] 3.547454>

Degrees of Freedom

Degrees of freedom for SST: n - 1• one df is lost because it is used to estimate mean Y

Degrees of freedom for SSR: 1• only one df because all estimates are based on same

fitted regression line

Degrees of freedom for SSE: n - 2• two lost due to estimating regression line (slope and

intercept)

Mean Squares

“Scaled” version of Sum of Squares Mean Square = SS/df

MSR = SSR/1

MSE = SSE/(n-2) Notes:

• mean squares are not additive! That is, MSR + MSE ≠SST/(n-1)

• MSE is the same as we saw previously

Standard ANOVA Table

SS df MS

Regression SSR 1 MSR

Error SSE n-2 MSE

Total SST n-1

ANOVA for logLOS ~ BEDS

> anova(reg)Analysis of Variance Table

Response: logLOS Df Sum Sq Mean Sq F value Pr(>F) BEDS 1 0.64017 0.64017 24.442 2.737e-06 ***Residuals 111 2.90728 0.02619

Inference?

What is of interest and how do we interpret? We’d like to know if BEDS is related to logLOS. How do we do that using ANOVA table?

We need to know the expected value of the MSR and MSE:

22

12

2

)()(

)(

XXMSRE

MSEE

i

Implications

mean of sampling distribution of MSE is σ2 regardless of whether or not β1= 0

If β1= 0, E(MSE) = E(MSR)

If β1≠ 0, E(MSE) < E(MSR)

To test significance of β1, we can test if MSR and MSE are of the same magnitude.

22

12

2

)()(

)(

XXMSRE

MSEE

i

F-test

Derived naturally from the arguments just made

Hypotheses:• H0: β1= 0

• H1: β1≠ 0

Test statistic: F* = MSR/MSE

Based on earlier argument we expect F* >1 if H1 is true.

Implies one-sided test.

F-test

The distribution of F under the null has two sets of degrees of freedom (df)• numerator degrees of freedom• denominator degrees of freedom

These correspond to the df as shown in the ANOVA table• numerator df = 1• denominator df = n-2

Test is based on)2,1(~* nF

MSE

MSRF

Implementing the F-test

The decision rule

If F* > F(1-α; 1, n-2), then reject Ho If F* ≤ F(1-α; 1, n-2), then fail to reject Ho

F-distributions

0 1 2 3 4 5 6

0.0

0.2

0.4

0.6

0.8

x

f1

F(1,10)F(1,1000)F(5,10)F(5,1000)

ANOVA for logLOS ~ BEDS

> anova(reg)Analysis of Variance Table

Response: logLOS Df Sum Sq Mean Sq F value Pr(>F) BEDS 1 0.64017 0.64017 24.442 2.737e-06 ***Residuals 111 2.90728 0.02619

> qf(0.95, 1, 111)[1] 3.926607

> 1-pf(24.44,1,111)[1] 2.739016e-06

More interesting: MLR

You can test that several coefficients are zero at the same time

Otherwise, F-test gives the same result as a t-test

That is: for testing the significance of ONE covariate in a linear regression model, an F-test and a t-test give the same result:• H0: β1= 0

• H1: β1≠ 0

general F testing approach

Previous seems simple It is in this case, but can be generalized to be

more useful Imagine more general test:

• Ho: small model• Ha: large model

Constraint: the small model must be ‘nested’ in the large model

That is, the small model must be a ‘subset’ of the large model

Example of ‘nested’ models

ii eNURSENURSEMSINFRISKLOS 243210

ii eNURSENURSEINFRISKLOS 24310

ii eMSINFRISKLOS 210

Model 1:

Model 2:

Model 3:

Models 2 and 3 are nested in Model 1Model 2 is not nested in Model 3Model 3 is not nested in Model 2

Testing: Models must be nested!

To test Model 1 vs. Model 2• we are testing that β2 = 0

• Ho: β2 = 0 vs. Ha: β2 ≠ 0

• If β2 = 0 , then we conclude that Model 2 is superior to Model 1

• That is, if we reject the null hypothesis


ii eNURSENURSEINFRISKLOS 24310

Model 2:

Model 1:

R

reg1 <- lm(LOS ~ INFRISK + ms + NURSE + nurse2, data=data)reg2 <- lm(LOS ~ INFRISK + NURSE + nurse2, data=data)reg3 <- lm(LOS ~ INFRISK + ms, data=data)

> anova(reg1)Analysis of Variance Table

Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK 1 116.446 116.446 45.4043 8.115e-10 ***ms 1 12.897 12.897 5.0288 0.02697 * NURSE 1 1.097 1.097 0.4277 0.51449 nurse2 1 1.789 1.789 0.6976 0.40543 Residuals 108 276.981 2.565 ---

R > anova(reg2)Analysis of Variance Table

Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK 1 116.446 116.446 44.8865 9.507e-10 ***NURSE 1 8.212 8.212 3.1653 0.078 . nurse2 1 1.782 1.782 0.6870 0.409 Residuals 109 282.771 2.594 ---

> anova(reg1, reg2)Analysis of Variance Table

Model 1: LOS ~ INFRISK + ms + NURSE + nurse2Model 2: LOS ~ INFRISK + NURSE + nurse2 Res.Df RSS Df Sum of Sq F Pr(>F)1 108 276.981 2 109 282.771 -1 -5.789 2.2574 0.1359

R

> summary(reg1)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.355e+00 5.266e-01 12.068 < 2e-16 ***INFRISK 6.289e-01 1.339e-01 4.696 7.86e-06 ***ms 7.829e-01 5.211e-01 1.502 0.136 NURSE 4.136e-03 4.093e-03 1.010 0.315 nurse2 -5.676e-06 6.796e-06 -0.835 0.405 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.601 on 108 degrees of freedomMultiple R-squared: 0.3231, Adjusted R-squared: 0.2981 F-statistic: 12.89 on 4 and 108 DF, p-value: 1.298e-08

>

Testing more than two covariates

To test Model 1 vs. Model 3• we are testing that β3 = 0 AND β4 = 0

• Ho: β3 = β4 = 0 vs. Ha: β3 ≠ 0 or β4 ≠ 0

• If β3 = β4 = 0, then we conclude that Model 3 is superior to Model 1

• That is, if we reject the null hypothesis


ii eMSINFRISKLOS 210

Model 1:

Model 3:

R> anova(reg3)Analysis of Variance Table

Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK 1 116.446 116.446 45.7683 6.724e-10 ***ms 1 12.897 12.897 5.0691 0.02634 * Residuals 110 279.867 2.544 --- > anova(reg1, reg3)Analysis of Variance Table

Model 1: LOS ~ INFRISK + ms + NURSE + nurse2Model 2: LOS ~ INFRISK + ms Res.Df RSS Df Sum of Sq F Pr(>F)1 108 276.981 2 110 279.867 -2 -2.886 0.5627 0.5713

R > summary(reg3)

Call:lm(formula = LOS ~ INFRISK + ms, data = data)

Residuals: Min 1Q Median 3Q Max -2.9037 -0.8739 -0.1142 0.5965 8.5568

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.4547 0.5146 12.542 <2e-16 ***INFRISK 0.6998 0.1156 6.054 2e-08 ***ms 0.9717 0.4316 2.251 0.0263 * ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.595 on 110 degrees of freedomMultiple R-squared: 0.3161, Adjusted R-squared: 0.3036 F-statistic: 25.42 on 2 and 110 DF, p-value: 8.42e-10

Testing multiple coefficients simultaneously

Region: it is a ‘factor’ variable with 4 categories

iiiii eRIRIRILOS )4()3()2( 3210

Documents

Lecture 9: ANOVA tables F-tests