Upload
kenley
View
52
Download
0
Embed Size (px)
DESCRIPTION
Lecture 9: ANOVA tables F-tests. BMTRY 701 Biostatistical Methods II. ANOVA. Analysis of Variance Similar in derivation to ANOVA that is generalization of two-sample t-test Partitioning of variance into several parts that due to the ‘model’: SSR that due to ‘error’: SSE - PowerPoint PPT Presentation
Citation preview
Lecture 9:ANOVA tablesF-tests
BMTRY 701Biostatistical Methods II
ANOVA
Analysis of Variance Similar in derivation to ANOVA that is
generalization of two-sample t-test Partitioning of variance into several parts
• that due to the ‘model’: SSR• that due to ‘error’: SSE
The sum of the two parts is the total sum of squares: SST
Total Deviations:
0 200 400 600 800
2.0
2.2
2.4
2.6
2.8
3.0
data$BEDS
da
ta$
log
LO
S
YYi
Regression Deviations:
0 200 400 600 800
2.0
2.2
2.4
2.6
2.8
3.0
data$BEDS
da
ta$
log
LO
S
YYi ˆ
Error Deviations:
0 200 400 600 800
2.0
2.2
2.4
2.6
2.8
3.0
data$BEDS
da
ta$
log
LO
S
ii YY ˆ
Definitions
SSESSRSST
YYSSE
YYSSR
YYSST
i
i
i
2
2
2
)ˆ(
)ˆ(
)(
iiii YYYYYY ˆˆ
Example: logLOS ~ BEDS
> ybar <- mean(data$logLOS)> yhati <- reg$fitted.values> sst <- sum((data$logLOS- ybar)^2)> ssr <- sum((yhati - ybar )^2)> sse <- sum((data$logLOS - yhati)^2)> > sst[1] 3.547454> ssr[1] 0.6401715> sse[1] 2.907282> sse+ssr[1] 3.547454>
Degrees of Freedom
Degrees of freedom for SST: n - 1• one df is lost because it is used to estimate mean Y
Degrees of freedom for SSR: 1• only one df because all estimates are based on same
fitted regression line
Degrees of freedom for SSE: n - 2• two lost due to estimating regression line (slope and
intercept)
Mean Squares
“Scaled” version of Sum of Squares Mean Square = SS/df
MSR = SSR/1
MSE = SSE/(n-2) Notes:
• mean squares are not additive! That is, MSR + MSE ≠SST/(n-1)
• MSE is the same as we saw previously
Standard ANOVA Table
SS df MS
Regression SSR 1 MSR
Error SSE n-2 MSE
Total SST n-1
ANOVA for logLOS ~ BEDS
> anova(reg)Analysis of Variance Table
Response: logLOS Df Sum Sq Mean Sq F value Pr(>F) BEDS 1 0.64017 0.64017 24.442 2.737e-06 ***Residuals 111 2.90728 0.02619
Inference?
What is of interest and how do we interpret? We’d like to know if BEDS is related to logLOS. How do we do that using ANOVA table?
We need to know the expected value of the MSR and MSE:
22
12
2
)()(
)(
XXMSRE
MSEE
i
Implications
mean of sampling distribution of MSE is σ2 regardless of whether or not β1= 0
If β1= 0, E(MSE) = E(MSR)
If β1≠ 0, E(MSE) < E(MSR)
To test significance of β1, we can test if MSR and MSE are of the same magnitude.
22
12
2
)()(
)(
XXMSRE
MSEE
i
F-test
Derived naturally from the arguments just made
Hypotheses:• H0: β1= 0
• H1: β1≠ 0
Test statistic: F* = MSR/MSE
Based on earlier argument we expect F* >1 if H1 is true.
Implies one-sided test.
F-test
The distribution of F under the null has two sets of degrees of freedom (df)• numerator degrees of freedom• denominator degrees of freedom
These correspond to the df as shown in the ANOVA table• numerator df = 1• denominator df = n-2
Test is based on)2,1(~* nF
MSE
MSRF
Implementing the F-test
The decision rule
If F* > F(1-α; 1, n-2), then reject Ho If F* ≤ F(1-α; 1, n-2), then fail to reject Ho
F-distributions
0 1 2 3 4 5 6
0.0
0.2
0.4
0.6
0.8
x
f1
F(1,10)F(1,1000)F(5,10)F(5,1000)
ANOVA for logLOS ~ BEDS
> anova(reg)Analysis of Variance Table
Response: logLOS Df Sum Sq Mean Sq F value Pr(>F) BEDS 1 0.64017 0.64017 24.442 2.737e-06 ***Residuals 111 2.90728 0.02619
> qf(0.95, 1, 111)[1] 3.926607
> 1-pf(24.44,1,111)[1] 2.739016e-06
More interesting: MLR
You can test that several coefficients are zero at the same time
Otherwise, F-test gives the same result as a t-test
That is: for testing the significance of ONE covariate in a linear regression model, an F-test and a t-test give the same result:• H0: β1= 0
• H1: β1≠ 0
general F testing approach
Previous seems simple It is in this case, but can be generalized to be
more useful Imagine more general test:
• Ho: small model• Ha: large model
Constraint: the small model must be ‘nested’ in the large model
That is, the small model must be a ‘subset’ of the large model
Example of ‘nested’ models
ii eNURSENURSEMSINFRISKLOS 243210
ii eNURSENURSEINFRISKLOS 24310
ii eMSINFRISKLOS 210
Model 1:
Model 2:
Model 3:
Models 2 and 3 are nested in Model 1Model 2 is not nested in Model 3Model 3 is not nested in Model 2
Testing: Models must be nested!
To test Model 1 vs. Model 2• we are testing that β2 = 0
• Ho: β2 = 0 vs. Ha: β2 ≠ 0
• If β2 = 0 , then we conclude that Model 2 is superior to Model 1
• That is, if we reject the null hypothesis
ii eNURSENURSEMSINFRISKLOS 243210
ii eNURSENURSEINFRISKLOS 24310
Model 2:
Model 1:
R
reg1 <- lm(LOS ~ INFRISK + ms + NURSE + nurse2, data=data)reg2 <- lm(LOS ~ INFRISK + NURSE + nurse2, data=data)reg3 <- lm(LOS ~ INFRISK + ms, data=data)
> anova(reg1)Analysis of Variance Table
Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK 1 116.446 116.446 45.4043 8.115e-10 ***ms 1 12.897 12.897 5.0288 0.02697 * NURSE 1 1.097 1.097 0.4277 0.51449 nurse2 1 1.789 1.789 0.6976 0.40543 Residuals 108 276.981 2.565 ---
R > anova(reg2)Analysis of Variance Table
Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK 1 116.446 116.446 44.8865 9.507e-10 ***NURSE 1 8.212 8.212 3.1653 0.078 . nurse2 1 1.782 1.782 0.6870 0.409 Residuals 109 282.771 2.594 ---
> anova(reg1, reg2)Analysis of Variance Table
Model 1: LOS ~ INFRISK + ms + NURSE + nurse2Model 2: LOS ~ INFRISK + NURSE + nurse2 Res.Df RSS Df Sum of Sq F Pr(>F)1 108 276.981 2 109 282.771 -1 -5.789 2.2574 0.1359
R
> summary(reg1)
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.355e+00 5.266e-01 12.068 < 2e-16 ***INFRISK 6.289e-01 1.339e-01 4.696 7.86e-06 ***ms 7.829e-01 5.211e-01 1.502 0.136 NURSE 4.136e-03 4.093e-03 1.010 0.315 nurse2 -5.676e-06 6.796e-06 -0.835 0.405 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.601 on 108 degrees of freedomMultiple R-squared: 0.3231, Adjusted R-squared: 0.2981 F-statistic: 12.89 on 4 and 108 DF, p-value: 1.298e-08
>
Testing more than two covariates
To test Model 1 vs. Model 3• we are testing that β3 = 0 AND β4 = 0
• Ho: β3 = β4 = 0 vs. Ha: β3 ≠ 0 or β4 ≠ 0
• If β3 = β4 = 0, then we conclude that Model 3 is superior to Model 1
• That is, if we reject the null hypothesis
ii eNURSENURSEMSINFRISKLOS 243210
ii eMSINFRISKLOS 210
Model 1:
Model 3:
R> anova(reg3)Analysis of Variance Table
Response: LOS Df Sum Sq Mean Sq F value Pr(>F) INFRISK 1 116.446 116.446 45.7683 6.724e-10 ***ms 1 12.897 12.897 5.0691 0.02634 * Residuals 110 279.867 2.544 --- > anova(reg1, reg3)Analysis of Variance Table
Model 1: LOS ~ INFRISK + ms + NURSE + nurse2Model 2: LOS ~ INFRISK + ms Res.Df RSS Df Sum of Sq F Pr(>F)1 108 276.981 2 110 279.867 -2 -2.886 0.5627 0.5713
R > summary(reg3)
Call:lm(formula = LOS ~ INFRISK + ms, data = data)
Residuals: Min 1Q Median 3Q Max -2.9037 -0.8739 -0.1142 0.5965 8.5568
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.4547 0.5146 12.542 <2e-16 ***INFRISK 0.6998 0.1156 6.054 2e-08 ***ms 0.9717 0.4316 2.251 0.0263 * ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.595 on 110 degrees of freedomMultiple R-squared: 0.3161, Adjusted R-squared: 0.3036 F-statistic: 25.42 on 2 and 110 DF, p-value: 8.42e-10
Testing multiple coefficients simultaneously
Region: it is a ‘factor’ variable with 4 categories
iiiii eRIRIRILOS )4()3()2( 3210