25
Stat 112 Notes 10 • Today: – Fitting Curvilinear Relationships (Chapter 5) • Homework 3 due Thursday.

Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Embed Size (px)

Citation preview

Page 1: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Stat 112 Notes 10

• Today:– Fitting Curvilinear Relationships (Chapter 5)

• Homework 3 due Thursday.

Page 2: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Curvilinear Relationship• Reconsider the simple regression problem of

estimating the conditional mean of y given x, • For many problems, is not linear. • Linear regression model makes restrictive assumption

that increase in mean of y|x for a one unit increase in x equals

• Curvilinear relationship: is a curve, not a straight line; increase in mean of y|x is not the same for all x.

• When the relationship is curvilinear, the residual plot from a simple linear regression will violate linearity and there will be ranges of X for which the mean of the residuals is not approximately zero.

( | )E y x

( | )E y x

1( | )E y x

Page 3: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Example 1: How does rainfall affect yield?

• Data on average corn yield and rainfall in six U.S. states (1890-1927), cornyield.JMP

20

25

30

35

40

YIE

LD

6 7 8 9 10 11 12 13 14 15 16 17

RAINFALL

Bivariate Fit of YIELD By RAINFALL

Page 4: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

-10

-5

0

5

Res

idua

l

6 7 8 9 10 11 12 13 14 15 16 17

RAINFALL

Residual plot indicates violation of linearity – mean of residuals is above zero for rainfall between about 10-12 and below zero for rainfall from about 13-17.

Page 5: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Example 2: How do people’s incomes change as they age

• Weekly wages and age of 200 randomly chosen males between ages 18 and 70 from the 1998 March Current Population Survey Bivariate Fit of wage By age

0

500

1000

1500

2000

2500

wa

ge

20 30 40 50 60 70

age

Page 6: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Example 3: Display.JMP

• A large chain of liquor stores would like to know how much display space in its stores to devote to a new wine. It collects sales and display space data from 47 of its stores.

Bivariate Fit of Sales By DisplayFeet

0

50

100

150

200

250

300

350

400

450

Sa

les

0 1 2 3 4 5 6 7 8

DisplayFeet

Page 7: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Polynomial Regression

• Add polynomial terms in x as additional explanatory variables in a multiple regression model.

• In JMP is used in the place of x.

This does not affect the that is obtained from the multiple regression model.

• Quadratic model (K=2) is often sufficient.

20 1 2( | ) K

KE Y X x x x

( )x x2

0 1 2( | ) ( ) ( )KKE Y X x x x x x

y

Page 8: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Polynomial Regression in JMP

• Two ways to fit model:– Create variables . Use

fit model with variables

(we will illustrate this method when we apply polynomial regression when there is more than one explanatory variable)

– Use Fit Y by X. Click on red triangle next to Bivariate Analysis … and click Fit Polynomial instead of the usual Fit Line . This method produces nicer plots.

kxxxxxx )(,...,)(,)( 32 kxxxxxxx )(,...,)(,)(, 32

Page 9: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Bivariate Fit of YIELD By RAINFALL

20

25

30

35

40

YIE

LD

6 7 8 9 10 11 12 13 14 15 16 17

RAINFALL

Linear Fit YIELD = 23.552103 + 0.7755493 RAINFALL Summary of Fit RSquare 0.16211 RSquare Adj 0.138835 Root Mean Square Error 4.049471 Mean of Response 31.91579 Observations (or Sum Wgts) 38 Polynomial Fit Degree=2 YIELD = 21.660175 + 1.0572654 RAINFALL - 0.2293639 (RAINFALL-10.7842)^2 Summary of Fit RSquare 0.296674 RSquare Adj 0.256484 Root Mean Square Error 3.762707 Mean of Response 31.91579 Observations (or Sum Wgts) 38 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 21.660175 3.094868 7.00 <.0001 RAINFALL 1.0572654 0.293956 3.60 0.0010 (RAINFALL-10.7842)^2 -0.229364 0.088635 -2.59 0.0140

Page 10: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

B i v a r i a t e F i t o f w a g e B y a g e

0

50 0

10 00

15 00

20 00

25 00

wage

2 0 30 40 50 60 70

ag e

L in ea r Fit

Po ly n omia l Fit De gre e=2

Linear Fit wage = 407.72321 + 6.5370642 age Summary of Fit RSquare 0.049778 RSquare Adj 0.044979 Root Mean Square Error 345.4422 Polynomial Fit Degree=2 wage = 356.39651 + 9.6873755 age - 0.4769883 (age-38.22)^2 Summary of Fit RSquare 0.095328 RSquare Adj 0.086143 Root Mean Square Error 337.9155 Mean of Response 657.5698 Observations (or Sum Wgts) 200 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 356.39651 81.21184 4.39 <.0001 age 9.6873755 2.223264 4.36 <.0001 (age-38.22)^2 -0.476988 0.151453 -3.15 0.0019

Page 11: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Interpretation of coefficients in polynomial regression

• The usual interpretation of multiple regression coefficients doesn’t make sense in polynomial regresssion.

• We can’t hold x fixed and change .

• Effect of increasing x by one unit depends on the starting x=x*

20 1 2( | ) ( )E Y X X X X

2( )X X

* * * * 20 1 2

* * 2 *0 1 2 1 2 2

( | 1) ( | ) [ ( 1) ( 1 ) ]

[ ( ) ] [2 2 ]

E Y X X E Y X X X X X

X X X X X

Page 12: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Interpretation of coefficients in wage data

Polynomial Fit Degree=2 wage = 356.39651 + 9.6873755 age - 0.4769883 (age-38.22)^2 Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept 356.39651 81.21184 4.39 <.0001 age 9.6873755 2.223264 4.36 <.0001 (age-38.22)^2 -0.476988 0.151453 -3.15 0.0019

Change in Mean Wage Associated with One Year Increase in Age

Change in Mean Wage From 29 to 30 18.00 From 39 to 40 8.47 From 49 to 50 -1.07 From 59 to 60 -10.61

Page 13: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Choosing the order in polynomial regression

• Is it necessary to include a kth order term ?

• Test vs.• Choose largest k so that test still rejects (at 0.05

level)• If we use , always keep the lower order

terms in the model. • For corn yield data, use K=2 polynomial regression

model. • For income data, use K=2 polynomial regression

model

( )kX X

20 1 2( | ) ( ) ( )kKE Y X X X X X X

0:0 kH 0: kaH

( )kX X

0H

Page 14: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

B i v a r i a t e F i t o f Y I E L D B y R A I N F A L L

20

25

30

35

40

YIELD

6 7 8 9 10 11 12 13 14 15 16 17

RAINFALL

L inea r Fit

Po ly nomia l Fit Degree=2

Po ly nomia l Fit Degree=3 P a r a m e t e r E s t i m a t e s T e r m E s t i m a t e S t d E r r o r t R a t i o P r o b > | t | In t e r c e p t 2 9 . 2 8 1 2 8 1 5 . 6 2 5 5 3 7 5 . 2 1 < . 0 0 0 1 R A IN F A L L 0 . 3 7 6 7 0 9 0 . 5 1 1 8 1 7 0 . 7 4 0 . 4 6 6 8 ( R A IN F A L L - 1 0 . 7 8 4 2 ) ^ 2 - 0 . 3 4 9 3 3 5 0 . 1 1 4 4 0 1 - 3 . 0 5 0 . 0 0 4 4 ( R A IN F A L L - 1 0 . 7 8 4 2 ) ^ 3 0 . 0 5 1 7 5 6 8 0 . 0 3 2 2 0 2 1 . 6 1 0 . 1 1 7 2

Page 15: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Transformations

• Curvilinear relationship: E(Y|X) is not a straight line.

• Another approach to fitting curvilinear relationships is to transform Y or x.

• Transformations: Perhaps E(f(Y)|g(X)) is a straight line, where f(Y) and g(X) are transformations of Y and X, and a simple linear regression model holds for the response variable f(Y) and explanatory variable g(X).

Page 16: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Curvilinear RelationshipBivariate Fit of Life Expectancy By Per Capita GDP

40

50

60

70

80

Life

Exp

ecta

ncy

0 5000 10000 15000 20000 25000 30000

Per Capita GDP

-25

-15

-5

5

15

Res

idua

l

0 5000 10000 15000 20000 25000 30000

Per Capita GDP

Y=Life Expectancy in 1999X=Per Capita GDP (in US Dollars) in 1999Data in gdplife.JMP

Linearity assumption of simplelinear regression is clearly violated.The increase in mean life expectancy for each additional dollarof GDP is less for large GDPs thanSmall GDPs. Decreasing returns toincreases in GDP.

Page 17: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Bivariate Fit of Life Expectancy By log Per Capita GDP

40

50

60

70

80

Life

Exp

ecta

ncy

6 7 8 9 10

log Per Capita GDP

Linear Fit Life Expectancy = -7.97718 + 8.729051 log Per Capita GDP

-25

-15

-5

5

15

Res

idua

l

6 7 8 9 10

log Per Capita GDP

The mean of Life Expectancy | Log Per Capita appears to be approximatelya straight line.

Page 18: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

How do we use the transformation?

• Testing for association between Y and X: If the simple linear regression model holds for f(Y) and g(X), then Y and X are associated if and only if the slope in the regression of f(Y) and g(X) does not equal zero. P-value for test that slope is zero is <.0001: Strong evidence that per capita GDP and life expectancy are associated.

• Prediction and mean response: What would you predict the life expectancy to be for a country with a per capita GDP of $20,000?

Linear Fit Life Expectancy = -7.97718 + 8.729051 log Per Capita GDP Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -7.97718 3.943378 -2.02 0.0454 log Per Capita GDP

8.729051 0.474257 18.41 <.0001

47.789035.9*7291.89772.7)9035.9log|(ˆ

)000,20loglog|(ˆ)000,20|(ˆ

XYE

XYEXYE

Page 19: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

How do we choose a transformation?

• Tukey’s Bulging Rule.

• See Handout.

• Match curvature in data to the shape of one of the curves drawn in the four quadrants of the figure in the handout. Then use the associated transformations, selecting one for either X, Y or both.

Page 20: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Transformations in JMP1. Use Tukey’s Bulging rule (see handout) to determine

transformations which might help. 2. After Fit Y by X, click red triangle next to Bivariate Fit and click Fit

Special. Experiment with transformations suggested by Tukey’s Bulging rule.

3. Make residual plots of the residuals for transformed model vs. the original X by clicking red triangle next to Transformed Fit to … and clicking plot residuals. Choose transformations which make the residual plot have no pattern in the mean of the residuals vs. X.

4. Compare different transformations by looking for transformation with smallest root mean square error on original y-scale. If using a transformation that involves transforming y, look at root mean square error for fit measured on original scale.

Page 21: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Bivariate Fit of Life Expectancy By Per Capita GDP

40

50

60

70

80Li

fe E

xpec

tanc

y

0 5000 10000 15000 20000 25000 30000

Per Capita GDP

Linear Fit

Transformed Fit to Log

Transformed Fit to Sqrt

Transformed Fit Square

Page 22: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

`

Linear Fit Life Expectancy = 56.176479 + 0.0010699 Per Capita GDP Summary of Fit RSquare 0.515026 RSquare Adj 0.510734 Root Mean Square Error 8.353485 Mean of Response 63.86957 Observations (or Sum Wgts) 115 Transformed Fit to Log Life Expectancy = -7.97718 + 8.729051 Log(Per Capita GDP) Summary of Fit RSquare 0.749874 RSquare Adj 0.74766 Root Mean Square Error 5.999128 Mean of Response 63.86957 Observations (or Sum Wgts) 115

Transformed Fit to Sqrt Life Expectancy = 47.925383 + 0.2187935 Sqrt(Per Capita GDP) Summary of Fit RSquare 0.636551 RSquare Adj 0.633335 Root Mean Square Error 7.231524 Mean of Response 63.86957 Observations (or Sum Wgts) 115 Transformed Fit Square Square(Life Expectancy) = 3232.1292 + 0.1374831 Per Capita GDP Fit Measured on Original Scale Sum of Squared Error 7597.7156 Root Mean Square Error 8.1997818 RSquare 0.5327083 Sum of Residuals -70.29942

By looking at the root mean square error on the original y-scale, we see thatall of the transformations improve upon the untransformed model and that the transformation to log x is by far the best.

Page 23: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Linear Fit

-25

-15

-5

5

15

Res

idua

l

0 5000 10000 15000 20000 25000 30000

Per Capita GDP

Transformation to Log X

-25

-15

-5

5

15

Res

idua

l

0 5000 10000 15000 20000 25000 30000

Per Capita GDP

Transformation to X

-25

-15

-5

5

15

Res

idua

l

0 5000 10000 15000 20000 25000 30000

Per Capita GDP

Transformation to 2Y

-25

-15

-5

5

15

Res

idua

l

0 5000 10000 15000 20000 25000 30000

Per Capita GDP

The transformation to Log X appears to have mostly removed a trend in the meanof the residuals. This means that . There is still a problem of nonconstant variance.

XXYE log)|( 10

Page 24: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Comparing models for curvilinear relationships

• In comparing two transformations, use transformation with lower RMSE, using the fit measured on the original scale if y was transformed on the original y-scale

• In comparing transformations to polynomial regression models, compare RMSE of best transformation to best polynomial regression model.

• If the transfomation’s RMSE is larger than the polynomial regression’s RMSE but is within 1% of the polynomial regression’s RMSE, then it is still a good idea to use the transformation on the grounds of parsimony.

Page 25: Stat 112 Notes 10 Today: –Fitting Curvilinear Relationships (Chapter 5) Homework 3 due Thursday

Transformations and Polynomial Regression for Display.JMP

RMSE

Linear 51.59

log x 41.31

1/x 40.04

46.02

Fourth order poly. 37.79

x

Fourth order polynomial is the best polynomial regression model using the criterion on slide 10

Fourth order polynomial is the best model – it has the smallest RMSE by a considerable amount (more than 1% advantage over best transformation of 1/x.