27
Regression line – Fitting a line to data Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall pattern Fitting a line means drawing a line that is as close as possible to the points: the “best” straight line is the regression line.

Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

Regression line – Fitting a line to dataRegression line – Fitting a line to data

If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall pattern

Fitting a line means drawing a line that is as close as possible to the points: the “best” straight line is the regression line.

Page 2: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

Simple Example: Productivity levelSimple Example: Productivity level

 

To see how productivity was related to level of maintenance, a firm randomly selected 5 of its high speed machines for an experiment. Each machine was randomly assigned a different level of maintenance X and then had its average number of stoppages Y recorded.

Hours of maintenance

1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2

0

| | | | | | | | 2 4 6 8 10 12 14 16 X

# in

terr

uptio

ns

Hours X Average int. Y

4 1.6

6 1.2

8 1.1

10 0.5

12 0.6

Ave(x)=8

s(x)=3.16

Ave(y)=1s(y)=0.45

r=–0.94

Page 3: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

Least squares regression lineLeast squares regression line

DefinitionThe regression line of y on x is the line that makes the sum of the squares of the vertical distances (deviations) of the data points from the line as small as possible

It is defined as bxaxyE )|(

Slope b = r*sd(y)/sd(x)Intercept a = ave(y) – b*ave(x)

We use to distinguish between the values predicted from the regression line and the observed values.

)|( xyE

Note: b has the same sign of r

Average value of y at x

Page 4: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

Example: cont.Example: cont.

Slope

Intercept a=ave(y) –b ave(x)=1– (–0.135) 8=2.08 Regression Line: = 2.08 –0.135 x

= 2.08 –0.135 hours

135.016.3

45.094.0

x

y

s

rsb

)|( xyE

The regression line of the number of interruptions and the hours of maintenance per week is calculated as follows.The descriptive statistics for x and y are:Ave(x)=8 s(x)=3.16; Ave(y)=1 s(y)=0.45 and r=–0.94

)|( xyE

Page 5: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

Regression line = 2.08 –0.135 hoursIf the slope is positive, Y increases linearly with X. The slope value is the increase in Y for an increase of one unit in X.

If the slope is negative, Y decreases linearly with X. The slope value is the decrease in Y for an increase of one unit in X

y

Hours of maintenance

1.8 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2

0

| | | | | | | | 2 4 6 8 10 12 14 16 X

# in

terr

uptio

ns

r=–0.94

residual

Point of averages

The slope is b=-0.135. If you increase the maintenance schedule by one hour, the average number of stoppages will decrease by 0.135.

Page 6: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

ResidualsResiduals

Y

x

For a given x, use the regression line to predict the response

The accuracy of the prediction depends on how spread out the observations are around the line.

y

Observed value y

Error

Predicted value y

yy ˆ

Page 7: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

Example: CPU UsageExample: CPU Usage

A study was conducted to examine what factors affect CPU usage.

A set of 38 processes written in a programming language was considered. For each program, data were collected on the CPU usage in seconds, and the number of lines (in thousands) of the program.

CPU usage

Number of lines

The scatter plot shows a clear positive association.

We’ll fit a regression line to model the association!

Page 8: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

Variable N Mean Std Dev Sum Minimum MaximumY time 38 0.15710 0.13129 5.96980 0.01960 0.46780X linet 38 3.16195 3.96094 120.15400 0.10200 14.87200

Correlation Coefficient =0.89802

The regression line is E y x x( | ) . . 0 063 0 0297

Page 9: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

1.1. Coefficient of determinationCoefficient of determination

R2 = (correlation coefficient)2

describes how good the regression line is in explaining the response y. fraction of the variation in the values of y that is explained by the regression line of y on x. Varies between 0 and 1. Values close to 1, then the regression line provides a

good explanation of the data; close to zero, then the regression line is not able to capture the variability in the data

2R

Goodness of fit measuresGoodness of fit measures

Page 10: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

EXAMPLE (cont.): The correlation coefficient is r = –0.94.

The regression line is able to capture 88.3% of the variability in the data. 

It is computed by the Excel function RSQ

883.0)94.0( 22 R

Page 11: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

2. Residuals2. Residuals The vertical distances between the observed points and the regression line can be regarded as the “left-over” variation in the response after fitting the regression line.

A residual is the difference between an observed value of the response variable y and the value predicted by the regression line.

Residual e = observed y – predicted y = = y – (intercept – slope x) =

= y – (a + b x)  A special property: the average of the residuals is always zero.

Page 12: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

  

EXAMPLE: Residuals for the regression line = 2.08 – 0.135 x

for the number of interruptions Y on the hours of maintenance X.)|( xyE

Hours X Average interr. Y

Predicted Interr. Residual y – (a+bx)

4 1.6 2.08 – 0.135*4=1.54 1.6 – 1.54=0.06

6 1.2 2.08 – 0.135*6=1.27 1.2 – 1.27 = –0.07

8 1.1 2.08 – 0.135*8=1 1.1 –1=0.1

10 0.5 2.08 – 0.135*10=0.73 0.5 –0.73= –0.23

12 0.6 2.08 – 0.135*12=0.46 0.6 –0.46=0.14

Average=0

Page 13: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

3. Accuracy of the predictions3. Accuracy of the predictions

If the cloud of points is football-shaped, the prediction errors are similar along the regression line. One possible measure of the accuracy of the regression predictions is given by the root mean square error (r.m.s. error).

The r.m.s. error is defined as the square root of the average squared residual:

2

)#(...)2#()1#(...

222

n

nresidresidresiderrorsmr

This is an estimate of the variation of y about the regression line.

It is computed by the Excel function STEYX

Page 14: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

Roughly 68% of the points

1 r.m.s. error

Roughly 95% of the points

2 r.m.s. errors

Page 15: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

Hours X

Average interr. Y

Predicted Interr.

Residual Squared Residual

4 1.6 1.54 0.06 0.0036

6 1.2 1.27 0.07 0.0049

8 1.1 1 0.1 0.01

10 0.5 0.73 –0.23 0.053

12 0.6 0.46 0.14 0.0196

Total 0.0911

Computing the r.m.s.error:

The r.m.s. error is (0.0911/3) = 0.174

If the company will schedule 7 hours of maintenance per week, the predicted weekly number of interruptions of the machine will be =2.08 – 0.1357=1.135 on average.

Using the r.m.s. error, more likely the number of interruptions will be between 1.135–2*0.174=0.787 and 1.135+2*0.174=1.483.

y

Page 16: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

Detect problems in the regression analysis: the residual plots

The analysis of the residuals is useful to detect possible problems and anomalies in the regression

A residual plot is a scatter plot of the regression residuals against the explanatory variable.

Points should be randomly scattered inside a band centered around the horizontal line at zero (the mean of the residuals).

Page 17: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

““Good case”Good case”

Res

idua

l

X“Bad cases”

Non linear relationship Variation of y changing with x

Page 18: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

Anomalies in the regression analysisAnomalies in the regression analysis

• If the residual plot displays a curve the straight line is not a good description of the association between x and y

• If the residual plot is fan-shaped the variation of y is not constant. In the figure above, predictions on y will be less precise as x increases, since y shows a higher variability for higher values of x.

Page 19: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

Example of CPU usage data Example of CPU usage data Residual plot Residual plot

Do you see any striking pattern?

Page 20: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

Example: 100 meter dashExample: 100 meter dash

At the 1987 World Championship in Rome, Ben Johnson set a new world

record in the 100-meter dash.

Meters JohnsonAverage 55 5.83 St. dev. 30.27 2.52Correlation = 0.999

Scatter plot for Johnson’s times

Ela

psed

tim

e

Meters

The data:Y=the elapsed time from the start of the race in 10-meter increments for Ben Johnson,X= meters

Page 21: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

Regression LineRegression Line

Ela

psed

tim

e

Meters

The fitted regression line is =1.11+0.09 meters.

The value of R2 is 0.999, therefore 99.9% of the variability in the data is explained by the regression line.

y

Page 22: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

Residual PlotResidual Plot

Meters

Res

idua

l

Does the graph show any anomaly?

Page 23: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

Outliers and Influential pointsOutliers and Influential points

An outlier is an observation that lies outside the overall pattern of the other observation

outlier

Large residual

Page 24: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

Influential PointInfluential Point

An observation is influential for the regression line, if removing it would change considerably the fitted line. An influential point pulls the regression line towards itself.

Regression line if is omitted

Influential point

Page 25: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

Example: house prices in Albuquerque.

=365.66+0.5488 price. The coefficient of determination is R2=0.4274.

y

Selling price

Annual tax

What does the value of R2 say?

Are there any influential points?

Page 26: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

New analysis: omitting the influential pointsNew analysis: omitting the influential points

Selling price

Annual tax

 

=-55.364+0.8483 price

The coefficient of determination is R2=0.8273

The regression line is

The new regression line explains 82% of the variation in y .

Previous regression line

Page 27: Regression line – Fitting a line to data If the scatter plot shows a clear linear pattern: a straight line through the points can describe the overall

Summary – WarningsSummary – Warnings

1. Correlation measures linear association, regression line should be used only when the association is linear

2. Extrapolation – do not use the regression line to predict values outside the observed range – predictions are not reliable

3. Correlation and regression line are sensitive to influential / extreme points

4. Check residual plots to detect anomalies and “hidden” patterns which are not captured by the regression line