22
Simple Linear Regression IV Analysis of Variance (ANOVA) Approach to Regression Correlation and Coefficient of Determination Residual Analysis: Model Diagnostics and Remedies 4.1 Lecture 4 Simple Linear Regression IV Reading: Chapter 11 STAT 8020 Statistical Methods II September 1, 2020 Whitney Huang Clemson University

Simple Linear Regression IV Lecture 4

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.1

Lecture 4Simple Linear Regression IVReading: Chapter 11

STAT 8020 Statistical Methods IISeptember 1, 2020

Whitney HuangClemson University

Page 2: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.2

Agenda

1 Analysis of Variance (ANOVA) Approach to Regression

2 Correlation and Coefficient of Determination

3 Residual Analysis: Model Diagnostics and Remedies

Page 3: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.3

ANOVA Approach toLinear Regression

Page 4: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.4

Analysis of Variance (ANOVA) Approach to Regression

Partitioning Sums of SquaresTotal sums of squares in response

SST =

n∑i=1

(Yi − Y)2

We can rewrite SST asn∑

i=1

(Yi − Y)2 =

n∑i=1

(Yi − Yi + Yi − Y)2

=

n∑i=1

(Yi − Yi)2

︸ ︷︷ ︸Error

+

n∑i=1

(Yi − Y)2

︸ ︷︷ ︸Model

Page 5: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.5

Partitioning Total Sums of Squares

●●

20 30 40 50 60 70

160

170

180

190

200

Age

Max

Hea

rtR

ate

Page 6: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.6

Total Sum of Squares: SST

If we ignored the predictor X, the Y would be the best(linear unbiased) predictor

Yi = β0 + εi (1)

SST is the sum of squared deviations for this predictor(i.e., Y)

The total mean square is SST/(n− 1) and represents anunbiased estimate of σ2 under the model (1).

Page 7: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.7

Regression Sum of Squares: SSR

SSR:∑n

i=1(Yi − Y)2

Degrees of freedom is 1 due to the inclusion of the slope,i.e.,

Yi = β0 + β1Xi + εi (2)

“Large” MSR = SSR/1 suggests a linear trend, because

E[MSE] = σ2 + β21

n∑i=1

(Xi − X)2

Page 8: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.8

Error Sum of Squares: SSE

SSE is simply the sum of squared residuals

SSE =

n∑i=1

(Yi − Yi)2

Degrees of freedom is n− 2 (Why?)

SSE large when |residuals| are “large"⇒ Yi’s varysubstantially around fitted regression line

MSE = SSE/(n− 2) and represents an unbiased estimateof σ2 when taking X into account

Page 9: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.9

ANOVA Table and F test

Source df SS MSModel 1 SSR =

∑ni=1(Yi − Y)2 MSR = SSR/1

Error n− 2 SSE =∑n

i=1(Yi − Yi)2 MSE = SSE/(n-2)

Total n− 1 SST =∑n

i=1(Yi − Y)2

Goal: To test H0 : β1 = 0

Test statistics F∗ = MSRMSE

If β1 = 0 then F∗ should be near one⇒ reject H0 when F∗

“large"

We need sampling distribution of F∗ under H0 ⇒ F1,n−2,where F(d1, d2) denotes a F distribution with degrees offreedom d1 and d2

Page 10: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.10

F Test: H0 : β1 = 0 vs. Ha : β1 6= 0

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Null distribution of F test statistic

Test statistic

Den

sity

Page 11: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.11

SLR: F-Test vs. T-test

ANOVA Table and F-Test

Parameter Estimation and T-Test

Page 12: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.12

Correlation andCoefficient ofDetermination

Page 13: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.13

Correlation and Simple Linear Regression

Pearson Correlation: r =∑n

i=1(Xi−X)(Yi−Y)√∑ni=1(Xi−X)2

∑ni=1(Yi−Y)2

−1 ≤ r ≤ 1 measures the strength of the linearrelationship between Y and X

We can show

r = β1

√∑ni=1(Xi − X)2∑ni=1(Yi − Y)2 ,

this impliesβ1 = 0 in SLR ⇔ ρ = 0

Page 14: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.14

Coefficient of Determination R2

Defined as the proportion of total variation explained bySLR

R2 =

∑ni=1(Yi − Y)2∑ni=1(Yi − Y)2 =

SSRSST

= 1− SSESST

We can show r2 = R2:

r2 =

(β1,LS

√∑ni=1(Xi − X)2∑ni=1(Yi − Y)2

)2

=β2

1,LS∑n

i=1(Xi − X)2∑ni=1(Yi − Y)2

=SSRSST

= R2

Page 15: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.15

Maximum Heart Rate vs. Age: r and R2

Interpretation:

There is a strong negative linear relationship betweenMaxHeartRate and Age. Furthermore, ∼ 91% of thevariation in MaxHeartRate can be explained by Age.

Page 16: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.16

Residual Analysis:Model Diagnostics and

Remedies

Page 17: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.17

Residuals

The residuals are the differences between the observedand fitted values:

ei = Yi − Yi,

where Yi = β0 + β1Xi

ei is NOT the error term εi = Yi − E[Yi]

Residuals are very useful in assessing theappropriateness of the assumptions on εi. Recall

E[εi] = 0

Var[εi] = σ2

Cov[εi, εj] = 0, i 6= j

Page 18: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.18

Maximum Heart Rate vs. Age Residual Plot: ε vs. X

● ●

20 30 40 50 60 70

−10

−5

0

5

10

Age

Res

idua

ls

Page 19: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.19

Interpreting Residual Plots

Figure: Figure courtesy of Faraway’s Linear Models with R (2005, p.59).

Page 20: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.20

Model Diagnostics and Remedies

●●

1 2 3 4

−40

−20

0

20

40

60

x

Res

idua

ls

⇒ Nonlinear relationship

Transform X

Nonlinear regression

●●

● ●

●●

1 2 3 4

−4

−2

0

2

4

6

x

Res

idua

ls

⇒ Non-constant variance

Transform Y

Weighted least squares

Page 21: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.21

Extrapolation in SLR

●●

●●

●●

●●●

●●●

●●●●●

●●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

−4 −2 0 2 4

−0.5

0.0

0.5

1.0

1.5

x

y

TrueSLR Fit

Extrapolation beyond the range of the given data canlead to seriously biased estimates if the assumed re-lationship does not hold the region of extrapolation

Page 22: Simple Linear Regression IV Lecture 4

Simple LinearRegression IV

Analysis of Variance(ANOVA) Approach toRegression

Correlation andCoefficient ofDetermination

Residual Analysis:Model Diagnostics andRemedies

4.22

Summary of SLR

Model: Yi = β0 + β1Xi + εi

Estimation: Use the method of least squares to estimatethe parameters

Inference

Hypothesis Testing

Confidence/prediction Intervals

ANOVA

Model Diagnostics and Remedies