Chapter 11: SIMPLE LINEARREGRESSION (SLR)AND CORRELATION
Part 4: Adequacy of the regression model,CorrelationSections 11-7 and 11-8
• Is a linear model the correct model?(Is simple linear regression complex enoughto capture the relationship betweenX & Y ?)
• Are the assumptions we’re making for ourmodel reasonable, or are they violated?
• To answer these questions, we will use theresiduals of the model.
The residual for observations i:
ei = yi − yi
1
Residuals are informative
• Plot residuals vs. the explanatory variable(or vs. y fitted values in SLR)
• If the plot is a random scatter of points aboveand below the horizontal reference line, thenthe linear model is reasonable, and adequate.
2
• If not (i.e. if there is a pattern in the residualplot), then there may be issues with our lin-earity assumption or perhaps other assump-tions in our model.
• Example showing inadequacy:Kentucky Derby data set
on year of race and speed of horse.
●●●
●●●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●●
●
●●●
●●●
●
●●
●
●
●
●●●
●
●●
●●
●
●
●
●●
●
1880 1900 1920 1940 1960 1980 2000
3233
3435
3637
Kentucky Derby: Horse speed vs. Year
Year
Spe
ed (
MP
H)
3
The form of the scatterplot looks a bit non-linear, but we’ll go ahead and fit a straightline model first to get the following residualplot...
Residual Plot of ‘residuals vs. fitted values’
• Residuals have a bit of a pattern (e.g. be-low the line, above the line, below the line),not randomly scattered around the horizon-tal line
4
• Linear form may not be reasonable oradequate.
⇒ Quadratic may fit better.
5
Beyond Adequacy
• Besides checking that our model fits the gen-eral (linear) relationship between X and Y,we also need to consider the assumptionswe made in our model.
• The basic model
Yi = β0 + β1xi + εi︸ ︷︷ ︸ ↑linear random
relationship error term
with εiiid∼ N(0, σ2)
– Constant variance of errors(only one σ2 for all errors)
– Normality of errors
– Independence of errors
6
Constant Variance Assumption
•We’ll check this assumption by plotting theresiduals vs. the fitted values (or vs. the ex-planatory variable in SLR)
• Look for a constant ‘spread’ above and belowthe horizontal reference line.
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
● ●
●●
0.5 1.0 1.5 2.0
−0.
15−
0.10
−0.
050.
000.
050.
100.
15
Residuals vs. Fitted
fitted values
resi
dual
s
• NOTE: This same residual plot was also used to check
linearity.
7
•Constant Variance and Adequacy areboth checked with the same residualplot in SLR
• Plot residuals against x or y.
8
Normality Assumption
• Use normal probability plot of residualsto check normality of errors (see section 6-6for non-normal patterns like those below).
9
Independence Assumption
• Verify that the observations are independent.
• Check how the data was collected (talk tothe researcher or client).
• If data was collected over time, plot residu-als against time to make sure there isn’t adependence (or trend) across time.
10
• Predictions and Extrapolation
– We can use our fitted model to make pre-dictions.
– e.g.What is the expected longevity of a fruitflywith a thorax of length 0.80 mm?
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
● ●
●● ●
●
●●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
0.65 0.70 0.75 0.80 0.85 0.90 0.95
2040
6080
100
ff.data$Thorax
ff.da
ta$L
onge
vity
Y = −61.05 + 144.33 x
11
Prediction:
Yx=0.80 = −61.05 + 144.33(0.80)= 54.414 days
– If we try to predict Y outside of the rangeof observed x-values, we are using the modelto extrapolate (predict outside the rangeof the observed data).
– You should be very careful when using ex-trapolation. In general it should be avoidedas we don’t have a feel for what is goingon outside the observed range.
– Predicting Y for x = 1.50 mm (which isnot a value near the observed x-values)would be an extrapolation in this fruitflyexample.
12