55
Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Embed Size (px)

Citation preview

Page 1: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Diagnostics – Part I

Using plots to check to see if the assumptions we made about the

model are realistic

Page 2: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Diagnostic methods

• Some simple (but subjective) plots. (Now)

• Formal statistical tests. (Next)

Page 3: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Review of some simple plots …

while checking scope of model

Page 4: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Dot Plot

160150140130120110100908070Speed

Fastest Ever Driving Speed

Women126

Men100

226 Stat 100 Students, Fall '98

Page 5: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Dot Plot

• Summarizes quantitative data.

• Horizontal axis represents measurement scale.

• Plot one dot for each data point.

Page 6: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Stem-and-Leaf PlotStem-and-leaf of Shoes N = 139 Leaf Unit = 1.0

12 0 223334444444 63 0 555555555555566666666677777778888888888888999999999 (33) 1 000000000000011112222233333333444 43 1 555555556667777888 25 2 0000000000023 12 2 5557 8 3 0023 4 3 4 4 00 2 4 2 5 0 1 5 1 6 1 6 1 7 1 7 5

Page 7: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Stem-and-Leaf Plot

• Summarizes quantitative data.

• Each data point is broken down into a “stem” and a “leaf.”

• First, “stems” are aligned in a column.

• Then, “leaves” are attached to the stems.

Page 8: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Box Plot

0

1

2

3

4

5

6

7

8

9

10

Hours

of sl

eep

Amount of sleep in past 24 hours

of Spring 1998 Stat 250 Students

Page 9: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Box Plot

• Summarizes quantitative data.

• Vertical (or horizontal) axis represents measurement scale.

• Lines in box represent the 25th percentile (“first quartile”), the 50th percentile (“median”), and the 75th percentile (“third quartile”), respectively.

Page 10: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Box Plot (cont’d)

• “Whiskers” are drawn to the most extreme data points that are not more than 1.5 times the length of the box beyond either quartile. – Whiskers are useful for identifying outliers.

• “Outliers,” or extreme observations, are denoted by asterisks. – Generally, data points falling beyond the

whiskers are considered outliers.

Page 11: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Okay, now the really new stuff…

Page 12: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Simple linear regression model

• Error terms have mean 0, i.e., E(i) = 0.

i and j are uncorrelated (independent).

• Error terms have same variance, i.e., Var(i) = 2.

• Error terms i are normally distributed.

The response Yi is a function of a systematic linear component and a random error component:

iii XY 10

with assumptions that:

Page 13: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Why should we keep nagging ourselves about the model?

• All of the estimates, confidence intervals, prediction intervals, hypothesis tests, etc. have been developed assuming that the model is correct.

• If the model is incorrect, then the formulas and methods we use are at risk of being incorrect. (Some are more forgiving than others.)

Page 14: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Things that can go wrong with the model

• Regression function is not linear.• Error terms do not have constant variance.• Error terms are not independent.• The model fits all but one or a few outlier

observations.• Error terms are not normally distributed.• Important predictor variable(s) has been left

out of the model.

Page 15: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Residual analysis … the basic idea

We would think the observed residuals:

iii YYe ˆ

would reflect the properties assumed for the unknown true error terms:

iii YEY

So, investigate the observed residuals to see if they behave “properly.”

Page 16: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Some points of clarification about residuals

• The mean of the residuals, e-bar, is 0. So, no need to check that the mean of the residuals is 0 – the LS estimation method has made it so.

• The residuals are not independent, since they are all a function of the same estimated regression function.

Page 17: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

40302010 0

30

20

10

alcohol

stre

ngth

S = 3.87372 R-Sq = 41.2 % R-Sq(adj) = 39.9 %

strength = 26.3695 - 0.295868 alcohol

Regression Plot

Example: Alcohol consumption (X) and Arm muscle strength (Y)

Page 18: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

A well-behaved “residuals vs. fits” plot

252015

5

0

-5

-10

Fitted Value

Re

sid

ual

Residuals Versus the Fitted Values(response is strength)

Page 19: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Characteristics of a well-behaved “residual versus fits” plot

• The residuals “bounce randomly” around the 0 line. (Linear is reasonable).

• No one residual “stands out” from the basic random pattern of residuals. (No outliers).

• The residuals roughly form a “horizontal band” around 0 line. (Constant variance).

Page 20: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

“Residuals versus predictor” plot offers nothing different.

403020100

5

0

-5

-10

alcohol

Re

sid

ual

Residuals Versus alcohol(response is strength)

Page 21: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Example: Is tire tread wear linearly related to mileage?

mileage groove0 394.334 329.508 291.0012 255.1716 229.3320 204.8324 179.0028 163.8332 150.33

X = mileage in 1000 miles

Y = groove depth in mils (0.001 inches)

Page 22: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Example: Is tire tread wear linearly related to mileage?

302010 0

400

300

200

100

mileage

gro

ove

S = 19.0170 R-Sq = 95.3 % R-Sq(adj) = 94.6 %

groove = 360.637 - 7.28062 mileage

Regression Plot

Page 23: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

A “residual versus fits” plot suggesting relationship is not linear

350250150

30

20

10

0

-10

-20

Fitted Value

Re

sid

ual

Residuals Versus the Fitted Values(response is groove)

Page 24: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

How a non-linear function shows up on a “residual versus fits” plot

• The residuals depart from 0 in a systematic fashion, such as being positive for small X values, negative for medium X values, and positive again for large X values.

Page 25: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Example: How is plutonium activity related to alpha particle counts?

2010 0

0.15

0.10

0.05

0.00

plutonium

alph

a

S = 0.0125713 R-Sq = 91.6 % R-Sq(adj) = 91.2 %

alpha = 0.0070331 + 0.0055370 plutonium

Regression Plot

Page 26: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

A residual versus fits plot suggesting non-constant error variance

0.120.100.080.060.040.020.00

0.03

0.02

0.01

0.00

-0.01

-0.02

-0.03

-0.04

Fitted Value

Re

sid

ual

Residuals Versus the Fitted Values(response is alpha)

Page 27: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

How non-constant error variance shows up on a “residual vs. fits” plot• The plot has a “fanning” effect, such as the

residuals being close to 0 for small X values and being much more spread out for large X values.

• The “fanning” effect can also be in the reverse direction.

• Or, the spread of the residuals can vary in some complex fashion.

Page 28: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Example: Relationship between tobacco use and alcohol use?

Region Alcohol TobaccoNorth 6.47 4.03Yorkshire 6.13 3.76Northeast 6.19 3.77EastMidlands 4.89 3.34WestMidlands 5.63 3.47EastAnglia 4.52 2.92 Southeast 5.89 3.20Southwest 4.79 2.71Wales 5.27 3.53Scotland 6.08 4.51Northern Ireland 4.02 4.56

•Family Expenditure Survey of British Dept. of Employment

•X = average weekly expenditure on tobacco

•Y = average weekly expenditure on alcohol

Page 29: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Example: Relationship between tobacco use and alcohol use?

4.54.03.53.0

6.5

6.0

5.5

5.0

4.5

4.0

Tobacco

Alc

oho

l

S = 0.819630 R-Sq = 5.0 % R-Sq(adj) = 0.0 %

Alcohol = 4.35117 + 0.301938 Tobacco

Regression Plot

Page 30: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

A “residual versus fits” plot suggesting an outlier exists.

5.755.655.555.455.355.255.15

1

0

-1

-2

Fitted Value

Re

sid

ual

Residuals Versus the Fitted Values(response is Alcohol)

“outlier”

Page 31: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

How large does a residual need to be before being flagged?

• The magnitude of the residuals depends on the units of the response variable.

• Make the residuals “unitless” by dividing by their standard deviation. That is, use “standardized residuals.”

• Then, an observation with a standardized residual greater than 2 or smaller than -2 should be flagged for further investigation.

Page 32: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Standardized residuals versus fits plot

5.755.655.555.455.355.255.15

1

0

-1

-2

-3

Fitted Value

Sta

ndar

diz

ed

Re

sid

ual

Residuals Versus the Fitted Values(response is Alcohol)

Page 33: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Minitab identifies observations with large standardized residuals …

Unusual ObservationsObs Tobacco Alcohol Fit SE Fit Resid St Resid11 4.56 4.020 5.728 0.482 -1.708 -2.58R

R denotes an observation with a large standardized residual.

Page 34: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Anscombe data set #3

1413121110 9 8 7 6 5 4

13

12

11

10

9

8

7

6

5

x3

y3

S = 1.23631 R-Sq = 66.6 % R-Sq(adj) = 62.9 %

y3 = 3.00245 + 0.499727 x3

Regression Plot

Page 35: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

A “residual versus fits” plot suggesting an outlier exists.

1098765

3

2

1

0

-1

Fitted Value

Re

sid

ual

Residuals Versus the Fitted Values(response is y3)

Page 36: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

How an outlier shows up on a “residuals vs. fits” plot

• The observation’s residual stands apart from the basic random pattern of the rest of the residuals.

• The random pattern of the residual plot can even disappear if one outlier really deviates from the straight line of the rest of the data.

Page 37: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Other simple plots that might help spot an outlier

• Boxplots

• Stem-n-leaf plots

• Dotplots

Page 38: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Boxplot of residuals for Alcohol (Y) and Tobacco (X) example

10-1-2

Residuals

Page 39: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Dotplot of residuals for Alcohol (Y) and Tobacco (X) example

0.50.0-0.5-1.0-1.5

Residuals

Dotplot for RESI1

Page 40: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

“Residuals vs. order” plots to assess non-independence of error terms

• If the data are obtained in a time (or space) sequence, a “residuals vs. order” plot helps to see if there is any correlation between error terms that are near each other in the sequence.

• A horizontal band bouncing randomly around 0 suggests errors are independent, while a systematic pattern suggests not.

Page 41: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

“Residuals vs order” plots suggesting non-independence of error terms

Page 42: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Normal probability plot to assess normality of error terms

• Plot of residuals on horizontal axis against expected values of the residuals under normality (normal scores) on vertical axis.

• Plot that is nearly linear suggests normality of error terms.

Page 43: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Normal probability plot interpretation

skewed right

skewed left

normal

Page 44: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Normal probability plot for Alcohol (X) and Strength (Y) example

50-5-10

2

1

0

-1

-2

No

rmal

Sco

re

Residual

Normal Probability Plot of the Residuals(response is strength)

Page 45: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Normal probability plot for Tree diameter (X) and C-dating Age (Y)

6004002000-200-400

2

1

0

-1

-2

No

rmal

Sco

re

Residual

Normal Probability Plot of the Residuals(response is Age)

Page 46: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

“Residuals vs omitted predictors” plots

• To determine whether there are any other key variables that could provide additional predictive power to the response.

• Look for systematic patterns.• If the plot reveals that the residuals vary

systematically, we don’t say the original model is wrong. It’s just that it can be improved.

Page 47: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

“Residuals vs omitted” plot

Page 48: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

In summary, …

Page 49: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Nonlinearity of regression function

• Scatter plot of response versus predictor

• (Standardized) residuals versus fits plot

• (Standardized) residuals versus predictor plot

Page 50: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Nonconstancy or error variance

• (Standardized) residuals versus fits plot

• (Standardized) residuals versus predictor plot

Page 51: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Presence of outliers

• (Standardized) residuals versus fits plot

• (Standardized) residuals versus predictor plot

• Box plots, stem-n-leaf plots, dot plots of (standardized) residuals

Page 52: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Non-independence of error terms

• (Standardized) residuals versus order plot

Page 53: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Non-normality of error terms

• Normality probability plots

• Box plots, dotplots, stem-n-leaf plots

• Mean far from median?

Page 54: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Residual vs … plots in Minitab

• Stat >> Regression >> Regression.• Specify predictor and response.• Under Graphs…, specify whether regular or

standardized residuals desired. Select which residual plots are desired. If residual versus predictor plot desired, specify predictor in box.

• Select OK. Select OK.

Page 55: Diagnostics – Part I Using plots to check to see if the assumptions we made about the model are realistic

Boxplots, dotplots, etc. of residuals

• Stat >> Regression >> Regression …

• Specify predictor and response.

• Under Storage…, select residuals and/or standardized residuals. They will be stored in worksheet. Then …

• Graph >> Boxplot… or Graph >>Dotplot… or Graph>>Stemleaf…