Anthony Greene1 Regression Using Correlation To Make Predictions

Preview:

Citation preview

Anthony Greene 1

Regression

Using Correlation To Make Predictions

Anthony Greene 2

Making a prediction

xy zrz ˆ

To obtain the predicted value of y based on a known value of x and a known correlation.

Note what happens for positive and negative values of r and for high and low values of r and for near-zero values of r.

Anthony Greene 3

Graph of y = 5 – 3 x

Anthony Greene 4

y-Intercept and Slope

For a linear equation y = a + bx, the constant a is the y-intercept and the constant b is the slope.

x and y are related variables

Anthony Greene 5

Straight-line graphs of three linear equations

Y = a + bXa = y-interceptb = slope (rise/run)

Anthony Greene 6

Graphical Interpretation of Slope

The straight-line graph of the linear equation y = a +bx slopes upward if b > 0, slopes downward if b < 0, and is horizontal if b = 0

Anthony Greene 7

Graphical interpretation of slope

Anthony Greene 8

Four data points

Anthony Greene 9

Scatter plot

Anthony Greene 10

Two possible straight-line fits to the data points

Anthony Greene 11

Determining how well the data points in are fit by Line A Vs.Line B

Anthony Greene 12

Least-Squares Criterion

The straight line that best fits a set of data points is the one having the smallest possible sum of squared errors. Recall that the sum of squared errors is error variance.

Anthony Greene 13

Regression Line and Regression Equation

Regression line: The straight line that best fits a set of data points according to the least-squares criterion.Regression equation: The equation of the regression line.

Anthony Greene 14

The best-fit line minimizes the distance between the actual data and the predicted value

Anthony Greene 15

Residual, e, of a data point

Anthony Greene 16

nyySS

nyxxyS

nxxSS

MySS

MyMxS

MxSS

y

p

x

yy

yxp

xx

22

22

2

2

formula nalcomputatio Or the

We define SSx, SSP and SSy by

Notation Used in Regression and Correlation

Anthony Greene 17

Regression Equation

xbyn

aSS

Sb

xbay

x

P 1 and where

ˆ

The regression equation for a set of n data points is

xy bMMa

Anthony Greene 18

The relationship between b and r

• That is, the regression slope is just the correlation coefficient scaled up to the right size for the variables x and y.

same thehave and becausewhich

x

y

x

y

x

p

yx

p

s

srb

nyx

SS

SSr

SS

Sb

SSSS

Sr

Anthony Greene 19

abxy

bMMaMbMbxy

MxbMy

s

srbMx

s

srMy

s

Mxr

s

My

zrz

xyyx

xy

x

yx

x

yy

x

x

y

y

xy

ˆ

recall ˆ

ˆ

recall ˆ

ˆˆ

Anthony Greene 20

Criterion for Finding a Regression Line

Before finding a regression line for a set of data points, draw a scatter diagram. If the data points do not appear to be scattered about a straight line, do not determine a regression line.

Anthony Greene 21

Linear regression requires linear data:(a) Data points scattered about a curve (b) Inappropriate straight line fit to the dataHigher order regression equations exist but are outside the range of this course

Anthony Greene 22

Uniform Variance

0102030405060708090100

1 2 3 4 5

Math Proficiency By Grade

Anthony Greene 23

Assumptions for Regression Inferences

Anthony Greene 24

Table for obtaining the three sums of squares for the used car data

Anthony Greene 25

Regression line and data points for used car data

What is a fair asking price for a 2.5 year old car?

71.172ˆ

5.226.2047.195ˆ

26.2047.195ˆ

y

y

xy

So since the price unit is $100s, the best prediction is $17,271

Anthony Greene 26

Extrapolation in the used car example

27

Total sum of squares, SST: The variation in the observed values of the response variable:

Regression sum of squares, SSR: The variation in the observed values of the response variable that is explained by the regression:

Error sum of squares, SSE: The variation in the observed values of the response variable that is not explained by the regression:

Sums of Squares in Regression

xxxyy SSMySSR 22ˆ

yyy SMySST 2

xx

xyyy S

SSyySSE

22

ˆ

Anthony Greene 28

Regression Identity

The total sum of squares equals the regression sum of squares plus the error sum of squares. In symbols,

SST = SSR + SSE.

Anthony Greene 29

Graphical portrayal of regression for used cars

y = a + bx

Anthony Greene 30

What sort of things could regression be used for?

Any instance where a known correlation exists, regression can be used to predict a new score. Examples:

1. If you knew that there was a past correlation between the amount of study time and the grade on an exam, you could make a good prediction about the grade before it happened.

2. If you knew that certain features of a stock correlate with its price, you can use regression to predict the price before it happens.

Anthony Greene 31

Regression Example: Low Correlation

0

10

20

30

40

50

60

70

80

90

0 50 100 150 200 250 300 350weight

hei

gh

t

Find the regression equation for predicting height based on knowledge of weight. The existing data is for 10 male stats students?

Anthony Greene 32

287.00 75.00 21,525.00 82,369.00 5,625.00

300.00 71.00 21,300.00 90,000.00 5,041.00

255.00 80.00 20,400.00 65,025.00 6,400.00

180.00 69.00 12,420.00 32,400.00 4,761.00

130.00 70.00 9,100.00 16,900.00 4,900.00

215.00 77.00 16,555.00 46,225.00 5,929.00

165.00 71.00 11,715.00 27,225.00 5,041.00

240.00 71.00 17,040.00 57,600.00 5,041.00

160.00 72.00 11,520.00 25,600.00 5,184.00

150.00 65.00 9,750.00 22,500.00 4,225.002,082.00 721.00 151,325.00 465,844.00 52,147.00

X Y

Anthony Greene 33

287.00 75.00 21,525.00 82,369.00 5,625.00

300.00 71.00 21,300.00 90,000.00 5,041.00

255.00 80.00 20,400.00 65,025.00 6,400.00

180.00 69.00 12,420.00 32,400.00 4,761.00

130.00 70.00 9,100.00 16,900.00 4,900.00

215.00 77.00 16,555.00 46,225.00 5,929.00

165.00 71.00 11,715.00 27,225.00 5,041.00

240.00 71.00 17,040.00 57,600.00 5,041.00

160.00 72.00 11,520.00 25,600.00 5,184.00

150.00 65.00 9,750.00 22,500.00 4,225.002,082.00 721.00 151,325.00 465,844.00 52,147.00

X Y XY X2 Y2

Anthony Greene 34

287.00 75.00 21,525.00 82,369.00 5,625.00

300.00 71.00 21,300.00 90,000.00 5,041.00

255.00 80.00 20,400.00 65,025.00 6,400.00

180.00 69.00 12,420.00 32,400.00 4,761.00

130.00 70.00 9,100.00 16,900.00 4,900.00

215.00 77.00 16,555.00 46,225.00 5,929.00

165.00 71.00 11,715.00 27,225.00 5,041.00

240.00 71.00 17,040.00 57,600.00 5,041.00

160.00 72.00 11,520.00 25,600.00 5,184.00

150.00 65.00 9,750.00 22,500.00 4,225.002,082.00 721.00 151,325.00 465,844.00 52,147.00

X Y XY X2 Y2

Anthony Greene 35

SSx = x2 - (x)2/n = 465,844-433,472.4 = 32,372

SP = xy - x y/n = 151,325-150, 112.2

b=SP/SSx, so b = 1,213/32,372=0.03

a = (1/n)(y-bx), so a = 0.1(721-60.38) = 66

So, Y=0.03x+66

X Y XY X2 Y2

2,082 721 151,325 465,844 52,147

^

Anthony Greene 36

Y=0.03x+66^

0

10

20

30

40

50

60

70

80

90

0 50 100 150 200 250 300 350weight

hei

gh

t

Anthony Greene 37

Regression Example: High Correlation

Find the regression equation for predicting probability of a teenage suicide attempt based on weekly heroine usage.

0102030405060708090100

1 2 3 4 5 6 7

200020012002

38

X Y XY X2 Y2

1 0.2 0.2 1 0.04

1 0.31 0.31 1 0.0961

1 0.18 0.18 1 0.0324

2 0.27 0.54 4 0.0729

2 0.38 0.76 4 0.1444

2 0.46 0.92 4 0.2116

3 0.9 2.7 9 0.81

3 0.58 1.74 9 0.3364

3 0.45 1.35 9 0.2025

4 0.84 3.36 16 0.7056

4 0.74 2.96 16 0.5476

4 0.68 2.72 16 0.4624

5 0.85 4.25 25 0.7225

5 0.78 3.9 25 0.6084

5 0.73 3.65 25 0.5329

6 0.88 5.28 36 0.7744

6 0.82 4.92 36 0.6724

6 0.78 4.68 36 0.6084

7 0.92 6.44 49 0.8464

7 0.85 5.95 49 0.7225

7 0.91 6.37 49 0.8281

84 13.51 63.18 420 9.9779

39

X Y XY X2 Y2

1 0.2 0.2 1 0.04

1 0.31 0.31 1 0.0961

1 0.18 0.18 1 0.0324

2 0.27 0.54 4 0.0729

2 0.38 0.76 4 0.1444

2 0.46 0.92 4 0.2116

3 0.9 2.7 9 0.81

3 0.58 1.74 9 0.3364

3 0.45 1.35 9 0.2025

4 0.84 3.36 16 0.7056

4 0.74 2.96 16 0.5476

4 0.68 2.72 16 0.4624

5 0.85 4.25 25 0.7225

5 0.78 3.9 25 0.6084

5 0.73 3.65 25 0.5329

6 0.88 5.28 36 0.7744

6 0.82 4.92 36 0.6724

6 0.78 4.68 36 0.6084

7 0.92 6.44 49 0.8464

7 0.85 5.95 49 0.7225

7 0.91 6.37 49 0.8281

84 13.51 63.18 420 9.9779

40

X Y XY X2 Y2

1 0.2 0.2 1 0.04

1 0.31 0.31 1 0.0961

1 0.18 0.18 1 0.0324

2 0.27 0.54 4 0.0729

2 0.38 0.76 4 0.1444

2 0.46 0.92 4 0.2116

3 0.9 2.7 9 0.81

3 0.58 1.74 9 0.3364

3 0.45 1.35 9 0.2025

4 0.84 3.36 16 0.7056

4 0.74 2.96 16 0.5476

4 0.68 2.72 16 0.4624

5 0.85 4.25 25 0.7225

5 0.78 3.9 25 0.6084

5 0.73 3.65 25 0.5329

6 0.88 5.28 36 0.7744

6 0.82 4.92 36 0.6724

6 0.78 4.68 36 0.6084

7 0.92 6.44 49 0.8464

7 0.85 5.95 49 0.7225

7 0.91 6.37 49 0.8281

84 13.51 63.18 420 9.9779Σ

41

n = 21

SSx = x2 - (x)2/n = 420 - 336 = 84

SP = xy - x y/n = 63.18 – 54.04 = 9.14

b=SP/SSx, so b = 9.14/84 = 0.109

a=(1/n)(y-bx), so a = (1/21)(13.51-9.156) = 0.207

So, Y= 0.109x + 0.207

X Y XY X2 Y2

84 13.51 63.18 420 9.9779Σ

^

Anthony Greene 42

Why Is It Called Regression?

• For low correlations, the predicted value is close to the mean

• For zero correlations the prediction is the mean• Only for perfect correlations R2= 1.0 do the

predicted scores show as much variation as the actual scores

• Since perfect correlations are rare, we say that the predicted scores show regression towards the mean

Recommended