Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.1
Lecture 4Simple Linear Regression IVReading: Chapter 11
STAT 8020 Statistical Methods IISeptember 1, 2020
Whitney HuangClemson University
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.2
Agenda
1 Analysis of Variance (ANOVA) Approach to Regression
2 Correlation and Coefficient of Determination
3 Residual Analysis: Model Diagnostics and Remedies
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.3
ANOVA Approach toLinear Regression
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.4
Analysis of Variance (ANOVA) Approach to Regression
Partitioning Sums of SquaresTotal sums of squares in response
SST =
n∑i=1
(Yi − Y)2
We can rewrite SST asn∑
i=1
(Yi − Y)2 =
n∑i=1
(Yi − Yi + Yi − Y)2
=
n∑i=1
(Yi − Yi)2
︸ ︷︷ ︸Error
+
n∑i=1
(Yi − Y)2
︸ ︷︷ ︸Model
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.5
Partitioning Total Sums of Squares
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
20 30 40 50 60 70
160
170
180
190
200
Age
Max
Hea
rtR
ate
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.6
Total Sum of Squares: SST
If we ignored the predictor X, the Y would be the best(linear unbiased) predictor
Yi = β0 + εi (1)
SST is the sum of squared deviations for this predictor(i.e., Y)
The total mean square is SST/(n− 1) and represents anunbiased estimate of σ2 under the model (1).
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.7
Regression Sum of Squares: SSR
SSR:∑n
i=1(Yi − Y)2
Degrees of freedom is 1 due to the inclusion of the slope,i.e.,
Yi = β0 + β1Xi + εi (2)
“Large” MSR = SSR/1 suggests a linear trend, because
E[MSE] = σ2 + β21
n∑i=1
(Xi − X)2
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.8
Error Sum of Squares: SSE
SSE is simply the sum of squared residuals
SSE =
n∑i=1
(Yi − Yi)2
Degrees of freedom is n− 2 (Why?)
SSE large when |residuals| are “large"⇒ Yi’s varysubstantially around fitted regression line
MSE = SSE/(n− 2) and represents an unbiased estimateof σ2 when taking X into account
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.9
ANOVA Table and F test
Source df SS MSModel 1 SSR =
∑ni=1(Yi − Y)2 MSR = SSR/1
Error n− 2 SSE =∑n
i=1(Yi − Yi)2 MSE = SSE/(n-2)
Total n− 1 SST =∑n
i=1(Yi − Y)2
Goal: To test H0 : β1 = 0
Test statistics F∗ = MSRMSE
If β1 = 0 then F∗ should be near one⇒ reject H0 when F∗
“large"
We need sampling distribution of F∗ under H0 ⇒ F1,n−2,where F(d1, d2) denotes a F distribution with degrees offreedom d1 and d2
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.10
F Test: H0 : β1 = 0 vs. Ha : β1 6= 0
0 50 100 150
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Null distribution of F test statistic
Test statistic
Den
sity
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.11
SLR: F-Test vs. T-test
ANOVA Table and F-Test
Parameter Estimation and T-Test
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.12
Correlation andCoefficient ofDetermination
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.13
Correlation and Simple Linear Regression
Pearson Correlation: r =∑n
i=1(Xi−X)(Yi−Y)√∑ni=1(Xi−X)2
∑ni=1(Yi−Y)2
−1 ≤ r ≤ 1 measures the strength of the linearrelationship between Y and X
We can show
r = β1
√∑ni=1(Xi − X)2∑ni=1(Yi − Y)2 ,
this impliesβ1 = 0 in SLR ⇔ ρ = 0
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.14
Coefficient of Determination R2
Defined as the proportion of total variation explained bySLR
R2 =
∑ni=1(Yi − Y)2∑ni=1(Yi − Y)2 =
SSRSST
= 1− SSESST
We can show r2 = R2:
r2 =
(β1,LS
√∑ni=1(Xi − X)2∑ni=1(Yi − Y)2
)2
=β2
1,LS∑n
i=1(Xi − X)2∑ni=1(Yi − Y)2
=SSRSST
= R2
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.15
Maximum Heart Rate vs. Age: r and R2
Interpretation:
There is a strong negative linear relationship betweenMaxHeartRate and Age. Furthermore, ∼ 91% of thevariation in MaxHeartRate can be explained by Age.
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.16
Residual Analysis:Model Diagnostics and
Remedies
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.17
Residuals
The residuals are the differences between the observedand fitted values:
ei = Yi − Yi,
where Yi = β0 + β1Xi
ei is NOT the error term εi = Yi − E[Yi]
Residuals are very useful in assessing theappropriateness of the assumptions on εi. Recall
E[εi] = 0
Var[εi] = σ2
Cov[εi, εj] = 0, i 6= j
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.18
Maximum Heart Rate vs. Age Residual Plot: ε vs. X
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
20 30 40 50 60 70
−10
−5
0
5
10
Age
Res
idua
ls
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.19
Interpreting Residual Plots
Figure: Figure courtesy of Faraway’s Linear Models with R (2005, p.59).
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.20
Model Diagnostics and Remedies
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
1 2 3 4
−40
−20
0
20
40
60
x
Res
idua
ls
⇒ Nonlinear relationship
Transform X
Nonlinear regression
●●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
1 2 3 4
−4
−2
0
2
4
6
x
Res
idua
ls
⇒ Non-constant variance
Transform Y
Weighted least squares
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.21
Extrapolation in SLR
●●
●
●●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●●●●
●●●●
●
●
●
●
●
●●
●
●●●●
●
●
●
●
●
●●●
●●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●●
●●●
●
●●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●●
●
●●●
●
●
●
●●●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●●●
●
●
●●
●●
●●
●
●●●●
●●
●●
●
●
●
●
●●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●●
●●
●
●
●●●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●●●
●●●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●●
●
●
●
●
●
●●●
●
●
●
●●●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●●●
●●
●
●●
●
●●
●
●●●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●
●●●●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
−4 −2 0 2 4
−0.5
0.0
0.5
1.0
1.5
x
y
TrueSLR Fit
Extrapolation beyond the range of the given data canlead to seriously biased estimates if the assumed re-lationship does not hold the region of extrapolation
Simple LinearRegression IV
Analysis of Variance(ANOVA) Approach toRegression
Correlation andCoefficient ofDetermination
Residual Analysis:Model Diagnostics andRemedies
4.22
Summary of SLR
Model: Yi = β0 + β1Xi + εi
Estimation: Use the method of least squares to estimatethe parameters
Inference
Hypothesis Testing
Confidence/prediction Intervals
ANOVA
Model Diagnostics and Remedies