Fixing problems with the model Transforming the data so that the simple linear regression model is...

Preview:

Citation preview

Fixing problems with the model

Transforming the data so that the simple linear regression model is

okay for the transformed data.

Options for fixing problems with the model

• Abandon simple linear regression model and find a more appropriate – but typically more complex – model.

• Transform the data so that the simple linear regression model works for the transformed data.

Abandoning the model

• If not linear: try a different function, like a quadratic (Ch. 7) or an exponential function (Ch. 13).

• If unequal error variances: use weighted least squares (Ch. 10).

• If error terms are not independent: try fitting a time series model (Ch. 12).

• If important predictor variables omitted: try fitting a multiple regression model (Ch. 6).

• If outlier: use robust estimation procedure (Ch. 10).

Choices for transforming the data

• Transform X values only.

• Transform Y values only.

• Transform both the X and the Y values.

Transforming the X values only

Transforming the X values only

• Appropriate when non-linearity is the only problem – normality and equal variance okay – with the model.

• Transforming the Y values would likely change the well-behaved error terms into badly-behaved error terms.

Memory retention

time prop1 0.845 0.7115 0.6130 0.5660 0.54120 0.47240 0.45480 0.38720 0.361440 0.262880 0.205760 0.1610080 0.08

• Subjects asked to memorize a list of disconnected items. Asked to recall them at various times up to a week later

• Predictor time = time, in minutes, since initially memorized the list.

• Response prop = proportion of items recalled correctly.

Example 1

Fitted line plot

10000 5000 0

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

time

pro

p

S = 0.152284 R-Sq = 57.1 % R-Sq(adj) = 53.2 %

prop = 0.525870 - 0.0000557 time

Regression Plot

Example 1

Residual vs. fits plot

0.50.40.30.20.10.0

0.3

0.2

0.1

0.0

-0.1

-0.2

Fitted Value

Re

sid

ual

Residuals Versus the Fitted Values(response is prop)

Example 1

Normal probability plot

P-Value (approx): > 0.1000R: 0.9751W-test for Normality

N: 13StDev: 0.145801Average: -0.0000000

0.30.20.10.0-0.1-0.2

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

babi

lity

RESI1

Normal Probability Plot

Example 1

Transform the X values

time prop log10_time1 0.84 0.000005 0.71 0.6989715 0.61 1.1760930 0.56 1.4771260 0.54 1.77815120 0.47 2.07918240 0.45 2.38021480 0.38 2.68124720 0.36 2.857331440 0.26 3.158362880 0.20 3.459395760 0.16 3.7604210080 0.08 4.00346

Change (“transform”) the predictor time to log10(time).

Example 1

Fitted line plot using transformed X values

0 1 2 3 4

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

log10time

pro

p

prop = 0.846415 - 0.182427 log10timeS = 0.0233881 R-Sq = 99.0 % R-Sq(adj) = 98.9 %

Regression Plot

Example 1

Residuals vs. fits plot using transformed X values

0.90.80.70.60.50.40.30.20.1

0.04

0.03

0.02

0.01

0.00

-0.01

-0.02

-0.03

-0.04

Fitted Value

Re

sid

ual

Residuals Versus the Fitted Values(response is prop)

Example 1

Normal probability plotusing transformed X values

P-Value (approx): > 0.1000R: 0.9786W-test for Normality

N: 13StDev: 0.0223924Average: -0.0000000

0.030.00-0.03

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

babi

lity

RESI1

Normal Probability Plot

Example 1

Predicting new proportion

Estimated regression function:

timeY 10log182.0846.0ˆ

Therefore, we predict the proportion of words recalled after 1000 minutes is:

30.03182.0846.0ˆ

1000log182.0846.0ˆ10

Y

Y

Example 1

Predicting new proportion

Example 1

Predicted Values for New Observations

New Fit SE Fit 95.0% CI 95.0% PI1 0.299 0.00765 (0.282, 0.316) (0.245, 0.353)

Values of Predictors for New Observations

New Obs log10tim1 3.00

We can be 95% confident that a person will recall between 24.5% and 35.3% of the words after 1000 minutes.

Transforming the Y values only

Transforming the Y values only

• Appropriate when non-normality and/or unequal variances are the problems.

• The transformation on Y may also help to “straighten out” a curved relationship.

Gestation time and birth weight for mammals

Mammal Birthwgt GestationGoat 2.75 155Sheep 4.00 175Deer 0.48 190Porcupine 1.50 210Bear 0.37 213Hippo 50.00 243Horse 30.00 340Camel 40.00 380Zebra 40.00 390Giraffe 98.00 457Elephant 113.00 670

• Predictor Birthwgt = birth weight, in kg, of mammal.

• Response Gestation = number of days until birth

Example 2

Fitted line plot

0 50 100

200

300

400

500

600

700

Birthwgt

Ge

sta

tion

Gestation = 187.084 + 3.59137 BirthwgtS = 66.0943 R-Sq = 83.9 % R-Sq(adj) = 82.1 %

Regression Plot

Example 2

Residual vs. fits plot

600500400300200

100

0

-100

Fitted Value

Re

sid

ual

Residuals Versus the Fitted Values(response is Gestatio)

Example 2

Normal probability plot

P-Value (approx): > 0.1000R: 0.9703W-test for Normality

N: 11StDev: 62.7025Average: -0.0000000

500-50-100

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

babi

lity

RESI1

Normal Probability Plot

Example 2

Transform the Y values

Mammal Birthwgt Gestation log10GestGoat 2.75 155 2.19033Sheep 4.00 175 2.24304Deer 0.48 190 2.27875Porcupine 1.50 210 2.32222Bear 0.37 213 2.32838Hippo 50.00 243 2.38561Horse 30.00 340 2.53148Camel 40.00 380 2.57978Zebra 40.00 390 2.59106Giraffe 98.00 457 2.65992Elephant 113.00 670 2.82607

Change (“transform”) the response Gestation to log10(Gestation).

Example 2

Fitted line plot using transformed Y values

0 50 100

2.2

2.3

2.4

2.5

2.6

2.7

2.8

Birthwgt

log1

0G

est

log10Gest = 2.29256 + 0.0045211 BirthwgtS = 0.0939425 R-Sq = 80.3 % R-Sq(adj) = 78.1 %

Regression Plot

Example 2

Residual vs. fits plotusing transformed Y values

2.3 2.4 2.5 2.6 2.7 2.8

-0.1

0.0

0.1

Fitted Value

Res

idua

l

Residuals Versus the Fitted Values(response is log10Gest)

Example 2

Normal probability plotusing transformed Y values

P-Value (approx): > 0.1000R: 0.9743W-test for Normality

N: 11StDev: 0.0891217Average: -0.0000000

0.10.0-0.1

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

babi

lity

RESI2

Normal Probability Plot

Example 2

Predicting new gestation Estimated regression function:

BirthwgtestG 0045.029.2)ˆ(log10

Therefore, since:

515.2500045.029.2)ˆ(log10 estG

we predict the gestation length of another mammal at 50 kgs to be:

3.3271010ˆ 515.2)ˆ(log10 estGestG

Example 2

Predicting new gestation

Example 2

Predicted Values for New Observations

New Fit SE Fit 95.0% CI 95.0% PI1 2.5186 0.0306 (2.4494, 2.5878) (2.2951, 2.7421)

Values of Predictors for New Observations

New Birthwgt1 50.0

3.19710 2951.2

2.55210 7421.2

We can be 95% confident that the gestation length for a new mammal at 50 kgs will be between 197.3 and 552.2 days.

Transforming both the X and Y values

Transforming both the X and Y values

• Appropriate when the error terms are not normal, have unequal variances, and the function is not linear.

• Transforming the Y values corrects the problems with the error terms (and may help the non-linearity).

• Transforming the X values corrects the non-linearity.

Diameter (inches) and volume (cu. ft.) of 70 shortleaf pines

Example 3

5 15 25

0

50

100

150

Diameter

Vo

lum

e

Volume = -41.5681 + 6.83672 DiameterS = 9.87485 R-Sq = 89.3 % R-Sq(adj) = 89.1 %

Regression Plot

Residuals vs. fits plot

Example 3

100500

5

4

3

2

1

0

-1

-2

Fitted Value

Sta

ndar

diz

ed

Re

sid

ual

Residuals Versus the Fitted Values(response is Volume)

Normal probability plot

Example 3

P-Value (approx): < 0.0100R: 0.9409W-test for Normality

N: 70StDev: 1.02852Average: 0.0085024

543210-1-2

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

babi

lity

SRES1

Normal Probability Plot

Transform the Y values onlyDiameter Volume logVol 4.4 2.0 0.69315 4.6 2.2 0.78846 5.0 3.0 1.09861 5.1 4.3 1.45862 5.1 3.0 1.09861 5.2 2.9 1.06471 5.2 3.5 1.25276 5.5 3.4 1.22378 5.5 5.0 1.60944 5.6 7.2 1.97408 5.9 6.4 1.85630 5.9 5.6 1.72277 7.5 7.7 2.04122 7.6 10.3 2.33214… and so on …

Transform response volume to loge(volume)

Example 3

Fitted line plotusing transformed Y values

5 15 25

0

1

2

3

4

5

6

Diameter

logV

ol

logVol = 0.451703 + 0.239531 DiameterS = 0.322919 R-Sq = 90.5 % R-Sq(adj) = 90.4 %

Regression Plot

Example 3

Residuals vs. fits plotusing transformed Y values

654321

1

0

-1

-2

-3

Fitted Value

Sta

ndar

diz

ed

Re

sid

ual

Residuals Versus the Fitted Values(response is logVol)

Example 3

Normal probability plotusing transformed Y values

P-Value (approx): < 0.0100R: 0.9610W-test for Normality

N: 70StDev: 1.01888Average: -0.0077969

10-1-2-3

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

babi

lity

SRES4

Normal Probability Plot

Example 3

Transform both the X and Y valuesDiameter Volume logDiam logVol 4.4 2.0 1.48160 0.69315 4.6 2.2 1.52606 0.78846 5.0 3.0 1.60944 1.09861 5.1 4.3 1.62924 1.45862 5.1 3.0 1.62924 1.09861 5.2 2.9 1.64866 1.06471 5.2 3.5 1.64866 1.25276 5.5 3.4 1.70475 1.22378 5.5 5.0 1.70475 1.60944 5.6 7.2 1.72277 1.97408 5.9 6.4 1.77495 1.85630 5.9 5.6 1.77495 1.72277 7.5 7.7 2.01490 2.04122 7.6 10.3 2.02815 2.33214… and so on …

Transform predictor diameter to

loge(diameter)

Transform response volume to loge(volume)

Example 3

Fitted line plotusing transformed X and Y values

Example 3

1.5 2.0 2.5 3.0

1

2

3

4

5

logDiam

logV

ol

logVol = -2.87179 + 2.56442 logDiamS = 0.170263 R-Sq = 97.4 % R-Sq(adj) = 97.3 %

Regression Plot

Residual plot using transformed X and Y values

Example 3

54321

3

2

1

0

-1

-2

Fitted Value

Sta

ndar

diz

ed

Re

sid

ual

Residuals Versus the Fitted Values(response is logVol)

Normal probability plot using transformed X and Y values

Example 3

P-Value (approx): > 0.1000R: 0.9896W-test for Normality

N: 70StDev: 1.00930Average: -0.0028401

210-1-2

.999

.99

.95

.80

.50

.20

.05

.01

.001

Pro

babi

lity

SRES5

Normal Probability Plot

Transformation strategies

Effects of transformations

• Transforming the Y values corrects the problems with the error terms – and may simultaneously help non-linearity.

• Transforming the X values can only correct non-linearity.

Transformation strategies

• If form of the relationship between x and y is known, then it may be possible to find a linearizing transformation analytically.

• Fitting a regression model empirically generally requires trial and error – try different transformations to see which does best.

Transformation strategies

Finding a linearizing transformation analytically

Knowing functional relationship is of the power form

If the relationship between x and y is of the power form:

xy

taking log of both sides transforms it into a linear form:

xy eee logloglog

Knowing functional relationship is of the exponential form

If the relationship between x and y is of exponential form:

xey

taking log of both sides transforms it into a linear form:

xy ee loglog

Transformation strategies

Finding a transformation by trial and error

Family of power transformations

The most common transformation involves transforming the response by taking it to some power λ. That is:

yy Most commonly, for interpretation reasons, λ is a number between -1 and 2, such as -1, -0.5, 0, 0.5, (1), 1.5, and 2.

When λ = 0, the transformation is taken to be the log transformation. That is:

yy elog

Effect of loge transformation

10005000

5

0

-5

x

f(x)

Natural log function

Effect of loge transformation

543210

2

1

0

-1

-2

-3

-4

-5

-6

x

f(x)

Natural log function

Some guidelines for specifying λ

• To make smaller values more spread out, use a smaller λ.

• To make larger values more spread out, use a larger λ.

Possible transformations

x

y

2x

x y

y

y

ylog

y1

3x

x

x

Possible transformations

y

x y

y

2y

xlog

x1

3yx

xx

y

Possible transformations

x y

y

y

ylog

y1

x

xx

f(x)

xlog

ylog

xlog

x1

Possible transformations

2x

x y

y

y3x

x

xx

f(x)

2y

3y

Transformation strategies

Variance stabilizing transformations

Common variance stabilizing transformations

If the response is a Poisson count, so that the variance is proportional to the mean, use the square root transformation:

yyy 21

If the response is a binomial proportion, use the arcsine square root transformation:

pp ˆsinˆ 1

Common variance stabilizing transformations

If the variance is proportional to the mean squared, use the natural log transformation:

yy elog

If the variance is proportional to the mean to the fourth power, use the reciprocal transformation:

yy 1

Transforming data in Minitab

• Select Calc >> Calculator …• In box labeled “Store result in variable,”, tell

Minitab in which column (variable) you want the transformed data stored.

• Type (input) the expression for the desired transformation in the box labeled Expression. (Use the available functions.)

• Select OK. The data will appear in the column of the worksheet that you specified.