28
Chapter 8 Linear Regression © 2010 Pearson Education 1

Chapter 8 Linear Regression © 2010 Pearson Education 1

Embed Size (px)

Citation preview

Page 1: Chapter 8 Linear Regression © 2010 Pearson Education 1

Chapter 8Linear Regression

© 2010 Pearson Education

1

Page 2: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

2

8.1 The Linear Model

Example: The scatterplot below shows monthly computer usage for Best Buy verses the number of stores. How can the company predict the computer usage if they decide to expand the number of stores?

Page 3: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

3

8.1 The Linear Model

Example (continued): We see that the points don’t all line up, but that a straight line can summarize the general pattern. We call this line a linear model. This line can be used to predict computer usage for more stores.

Page 4: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

4

8.1 The Linear Model

Residuals

A linear model can be written in the form where b0 and b1 are numbers estimated from the data and is the predicted value.

The difference between the predicted value and the observed value, y, is called the residual and is denoted e.

yye ˆ

yxbby 10ˆ

Page 5: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

5

8.1 The Linear Model

In the computer usage model for 301 stores, the model predicts 262.2 MIPS (Millions of Instructions Per Second) and the actual value is 218.9 MIPS. We may compute the residual for 301 stores.

3.43

2.2629.218ˆ

yy

Page 6: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

6

8.1 The Linear Model

The Line of “Best Fit”

Some residuals will be positive and some negative, so adding up all the residuals is not a good assessment of how well the line fits the data.

If we consider the sum of the squares of the residuals, then the smaller the sum, the better the fit.

The line of best fit is the line for which the sum of the squared residuals is smallest – often called the least squares line.

Page 7: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

7

8.2 Correlation and the Line

Straight lines can be written as

The scatterplot of real data won’t fall exactly on a line so we denote the model of predicted values by the equation

The “hat” on the y will be used to represent an approximate value.

.10 xbby

.ˆ 10 xbby

Page 8: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

8

8.2 Correlation and the Line

For the Best Buy data, the line shown with the scatterplot has the equation that follows.

Storesy 64.34.833ˆ A slope of 3.64 says that each store is associated with an additional 3.64 MIPS, on average.

An intercept of –833.4 is the value of the line when the x-variable (Stores) is zero. This is only interpreted if has a physical meaning.

Page 9: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

9

8.2 Correlation and the Line

We can find the slope of the least squares line using the correlation and the standard deviations.

x

y

s

srb 1

The slope gets its sign from the correlation. If the correlation is positive, the scatterplot runs from lower left to upper right and the slope of the line is positive.

The slope gets its units from the ratio of the two standard deviations, so the units of the slope are a ratio of the units of the variables.

Page 10: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

10

8.2 Correlation and the Line

To find the intercept of our line, we use the means. If our line estimates the data, then it should predict for the x-value Thus we get the following relationship from our line.

xbby 10

We can now solve this equation for the intercept to obtain the formula for the intercept.

y .x

xbyb 10

Page 11: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

11

8.2 Correlation and the Line

1) Quantitative Variables Condition2) Linearity Condition3) Outlier Condition

Least squares lines are commonly called regression lines. We’ll need to check the same condition for regression as we did for correlation.

Page 12: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

12

8.2 Correlation and the Line

Getting from Correlation to the Line

If we consider finding the least squares line for standardized variables zx and zy, the formula for slope can be simplified.

rrs

srb

x

y

z

z 1

11

The intercept formula can be rewritten as well.

00010 rzbzb xy

Page 13: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

13

8.2 Correlation and the Line

Getting from Correlation to the Line

From the values for slope and intercept for the standardized variables, we may rewrite the regression equation.

From this we see that for an observation 1 SD above the mean in x, you’d expect y to have a z-score of r.

xy rzz ˆ

Page 14: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

14

8.2 Correlation and the Line

For the variable Monthly Use, the correlation is 0.979. We can now express the relationship for the standardized variables.

So, for every SD the value of Stores is above (or below) its mean, we predict that the corresponding value for Monthly Use is 0.979 SD above (or below) its mean

StoresUseMonthly zz 979.0ˆ

Page 15: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

15

8.3 Regression to the Mean

The equation below shows that if x is 2 SDs above its mean, we won’t ever move more than 2 SDs away for y, since r can’t be bigger than 1.

So, each predicted y tends to be closer to its mean than its corresponding x was.

This property of the linear model is called regression to the mean.

xy rzz ˆ

Page 16: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

16

8.4 Checking the Model

Models are only useful only when specific assumptions are reasonable. We check conditions that provide information about the assumptions.

1) Quantitative Data Condition – linear models only make sense for quantitative data, so don’t be fooled by categorical data recorded as numbers.

2) Linearity Condition – two variables must have a linear association, or a linear model won’t mean a thing.

3) Outlier Condition – outliers can dramatically change a regression model.

Page 17: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

17

8.5 Learning More from the Residuals

The residuals are the part of the data that hasn’t been modeled.

PredictedDataResidualResidual PredictedData

We have written this in symbols previously.

yye ˆ

Page 18: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

18

8.5 Learning More from the Residuals

Residuals help us see whether the model makes sense.

A scatterplot of residuals against predicted values should show nothing interesting – no patterns, no direction, no shape.

If nonlinearities, outliers, or clusters in the residuals are seen, then we must try to determine what the regression model missed.

Page 19: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

19

8.5 Learning More from the Residuals

The plot of the Best Buy residuals are given below. It does not appear that there is anything interesting occurring.

Page 20: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

20

8.5 Learning More from the Residuals

The standard deviation of the residuals, se, gives us a measure of how much the points spread around the regression line.

We estimate the standard deviation of the residuals as shown below.

The standard deviation around the line should be the same wherever we apply the model – this is called the Equal Spread Condition.

2

2

n

ese

Page 21: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

21

8.5 Learning More from the Residuals

In the Best Buy example, we used a linear model to make a prediction for 301 stores.

The residual for this prediction is –43.3 MIPS while the residual standard deviation is –24.07 MIPS.

This indicates that our prediction is about –43.3/24.07 = –1.8 standard deviations away from the actual value.

This is a typical size since it is within 2 SDs.

Page 22: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

22

8.6 Variation in the Model and R2

The variation in the residuals is the key to assessing how well a model fits. Consider our Best Buy example.

Monthly Use has a standard deviation of 117.0 MIPS. Using the mean to summarize the data, we may expect to be wrong by roughly twice the SD, or plus or minus 234.0 MIPS.

The residuals have a SD of only 24.07 MIPS, so knowing the number of stores allows a much better prediction.

Page 23: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

23

8.6 Variation in the Model and R2

All regression models fall somewhere between the two extremes of zero correlation or perfect correlation of plus or minus 1.

We consider the square of the correlation coefficient r to get r2 which is a value between 0 and 1.

r2 gives the fraction of the data’s variation accounted for by the model and 1 – r2 is the fraction of the original variation left in the residuals.

Page 24: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

24

8.6 Variation in the Model and R2

r2 by tradition is written R2 and called “R squared”.

The Best Buy model had an R2 of (0.979)2 = 0.959. Thus 95.9% of the variation in Monthly Use is accounted for by the number of stores, and 1 – 0.959 = 0.041 or 4.1% of the variability in Monthly Use has been left in the residuals.

Page 25: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

25

8.6 Variation in the Model and R2

How Big Should R2 Be?

There is no value of R2 that automatically determines that a regression is “good”.

Data from scientific experiments often have R2 in the 80% to 90% range.

Data from observational studies may have an acceptable R2 in the 30% to 50% range.

Page 26: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

26

8.7 Reality Check: Is the Regression Reasonable?

• The results of a statistical analysis should reinforce common sense.

• Is the slope reasonable?

• Does the direction of the slope seem right?

• Always be skeptical and ask yourself if the answer is reasonable.

Page 27: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

27

What Can Go Wrong?

• Don’t fit a straight line to a nonlinear relationship.

• Beware of extraordinary points.• Look for y-values that stand off from the linear pattern.• Look for x-values that exert a strong influence.

• Don’t extrapolate far beyond the data.

• Don’t infer that x causes y just because there is a good linear model for their relationship.

• Don’t choose a model based on R2 alone.

Page 28: Chapter 8 Linear Regression © 2010 Pearson Education 1

© 2010 Pearson Education

28

What Have We Learned?

• Linear models can help summarize the relationship between quantitative variables that are linearly related.

• The slope of a regression line is based on the correlation, adjusted for the standard deviations in x and y.

• For each SD a case is away from the mean of x, we expect it to be r SDs in y away from the y mean.

• Since r is between –1 and +1, each predicted y is fewer SDs away from its mean than the corresponding x was.

• R2 gives us the fraction of the variation of the response accounted for by the regression model.