Residual Plots Extrapolation - Duke University

19.0 Practical Issues in Regression

• Answer Questions

• Nonparametric Regression

• Residual Plots

• Extrapolation

1

19.1 Nonparametric Regression

Recall that the multiple linear regression model is

Y = β0 + β1X1 + . . . + βpXp + ǫ

where IE[ǫ] = 0, Var [ǫ] = σ2, and the ǫ are independent.

The model is useful because:

• it is interpretable—the effect of each explanatory variable is captured

by a single coefficient

• theory supports inference and prediction is easy

• simple interactions and transformations are easy (how?)

• dummy variables allow use of categorical information

• computation is fast.

2

We extended the multiple linear regression model to nonlinear regression,

in which we fit a model of the form:

g0(Y ) = β0 + β1g1(X1) + . . . + βpgp(Xp) + ǫ

where the gi are known transformations of the data, such as the log Y or

1/x1, and, as before, IE[ǫ] = 0, Var [ǫ] = σ2, and the ǫ are independent.

This model can be further extended to nonparametric regression, in

which case one does not know the functions g1, . . . , gp but instead must

estimate these by smoothing the data.

In applications, the linear regression model is usually only a locally

correct approximation. And it is rare that one has a strong theoretical

model that prescribes specific nonlinear transformations. Thus

nonparametric regression is a practical tool in many cases.

3

As a running example for the next several pages, assume we have data

generated from the following function by adding N(0..25) noise.

4

The x values were chosen to be spaced out at the left and right sides of

the domain, and the raw data are shown below.

5

19.1.1 Bin Smoothing

Here one partitions the x-axis into disjoint bins; e.g., take

{[i, i + 1), i ∈ Z}. Within each bin average the Y values to obtain a

smooth that is a step function.

6

19.1.2 Moving Averages

Moving averages use variable bins containing a fixed number of

observations, rather than fixed-width bins with a variable number of

observations. They tend to wiggle near the center of the data, but flatten

out near the boundary of the data.

7

19.1.3 Running Line

This improves on the moving average by fitting a line rather than an

average to the data within a variable-width bin. But it still tends to be

rough.

8

Regression becomes much harder as the number of explanatory variables

increases. This is called the Curse of Dimensionality (COD). The term

was coined by Richard Bellman in the context of approximation theory.

The COD applies to all multivariate regressions that do not to impose

strong modeling assumptions—especially the nonparametric regressions,

but also those in which one tests whether a specific variable or

transformed variable should be included in the model.

In terms of the sample size n and dimension p, the COD has three nearly

equivalent descriptions:

• For fixed n, as p increases, the data become sparse.

• As p increases, the number of possible models explodes.

• For large p, most datasets are multicollinear.

9

For the sparsity description of the COD, let n points be uniformly

distributed in the unit cube in IRp. What is the side-length ℓ of a subcube

that is expected to contain a fraction d of the data? Ans: ℓ = p√

d

This means that for large p, the amount of local information that is

available to fit bumps and wiggles in f is too small.

10

To explain the model explosion aspect, suppose we restrict attention to

just linear models of degree 2 or fewer. For p = 1 these are:

IE[Y ] = β0 IE[Y ] = β1x1 IE[Y ] = β2x21

IE[Y ] = β0 + β1x1 IE[Y ] = β0 + β2x21 IE[Y ] = β1x1 + β2x

21

IE[Y ] = β0 + β1x1 + β2x21

For p = 2 this set is extended to include expressions with the terms α1x2,

α2x22, and γ12x1x2.

For general p, combinatorics shows that the number of possible models is

2

1+2p+

0

B

B

@

p

2

1

C

C

A

− 1.

This increases superexponentially in p, and there is not enough sample

to enable the data to discriminate among these models.

11

For the multicollinearity issue, we note that multicollinearity occurs

when two or more of the explanatory values are highly correlated. This

implies that the predictive value of the fitted model breaks down quickly

as one moves away from the subspace in which the data concentrate.

Insert a physical demonstration.

In this class, we shall agree that multicollinearity occurs whenever

the absolute value of the correlation between two of the explanatory

variables exceeds .9. But this is a judgment call, and one can have

multicollinearity that arises in more complex ways.

For large p with finite n, it is almost certain that two explanatory

variables will have high correlation, just by chance.

12

19.2 Variable Selection

One wants to select a multiple regression model that only includes useful

variables. Some methods are:

• Forward Selection. One starts with no variables in the model, and

sequentially adds the one that best explains the current residuals

(or the raw data, at the initial step). One stops when none of the

remaining variables provide significant explanation.

• Backwards Elimination. Start with all the variables in the model,

and sequentially removes the variable that explains the least, until a

t-test shows that no further variables should be removed.

• Stepwise Regression. Alternate use of forward selection and

backwards elimination.

None of these is bulletproof.

13

19.3 Cross-Validation

To assess model fit in complex, computer-intensive situations, the ideal

strategy is to hold out a random portion of the data, fit a model to the

rest, then use the fitted model to predict the response values from the

values of the explanatory variables in the hold-out sample.

This allows a straightforward estimate of the error in prediction using

regression. But we usually need to compare fits among many models.

If the same hold-out sample is re-used, then the comparisons are not

independent and (worse) the model selection process will tend to choose

a model the overfits the hold-out sample, causing spurious optimism.

14

Cross-validation is a procedure that balances the need to use data to

select a model and the need to use data to assess prediction.

Specifically, v-fold cross-validation is as follows:

• randomly divide the sample into v portions;

• for i = 1, . . . , v, hold out portion i and fit the model from the rest of

the data;

• for i = 1, . . . , v, use the fitted model to predict the hold-out sample;

• average the PMSE over the v different fits.

One repeats these steps (including the random division of the sample!)

each time a new model is assessed.

The choice of v requires judgment. Often v = 10.

15

19.5 Case Study

You should never believe your model. Personally, I’m sometimes willing

to believe the binomial model applies, but for nearly every other

situation, the mechanisms that generate the data just do not quite match

the simple assumptions that underlie the named probability distributions.

George Box said:

All models are wrong, but some are useful.

Economists look at a lot of data and often attempt to fit it by models.

Be wary. Always plot your data. One can do goodness-of-fit tests to see

whether the data conform with a particular model, but this has dangers

too, especially with very large samples.

16

When deciding on which model to use to describe a data set, one should

consider:

• Do you believe that a simple, single probability distribution

generated the data?

• Do the data have some natural support set? (The support set is the

set on which the probability mass function or density function is

positive.)

• Do you believe the data are roughly symmetrically distributed about

the mean? Or is there skewness?

• Will the data have “fat tails”? (That is, are there likely to be some

exceptionally large or small values, compared to what one would see

in a sample from a normal distribution?)

• Do you understand the measurement process that acquired the data?

17

Beware of premature framing of a problem.

In January 1985, a team of engineers at Morton Thiokal was tasked

to study O-ring failures in Challenger launches. There were given

information on all the launches in which O-ring failures occurred, and

related data on temperature, manufacturing history, and so forth.

18

The engineers looked at all the variables. Temperature did not stand out.

On January 28, 1986, when the executives at Morton Thiokol were asked

by NASA whether they objected to greenlighting the launch given the

unusual cold weather at Cape Kennedy, they contacted their engineers

and asked their opinion.

The engineers, led by Roger Boisjoly, were nervous and tried to stop

the flight. The Morton Thiokol management agreed that the issue was

serious enough to recommend delaying the flight, and they arranged a

telephone conference with NASA. However, during the call, the Morton

Thiokol managers asked for a few minutes off the phone to discuss their

final position again.

19

The Morton Thiokol managers decided to advise NASA that their data

was inconclusive. NASA asked if there were objections. Hearing none,

the decision to launch was made.

The engineers should have looked at all the data, not just the data on

failures.

20

Roger Boisjoly was one of the witnesses at the Rogers Commission. After

the Committee gave its findings, Boisjoly found himself shunned by

colleagues and managers and he resigned from Morton Thiokol.

Subsequently, Roger Boisjoly wrote:

... [S]ome may argue that sufficient funds or schedule were

not available and that may be so, but MTI contracted for

that condition. The Shuttle program was declared operational

by NASA after the fourth flight, but the technical problems

in producing and maintaining the reusable boosters were

escalating rapidly as the program matured, instead of

decreasing as one would normally expect. Many opportunities

were available to structure the work force for corrective action,

but the MTI Management style would not let anything compete

or interfere with the production and shipping of boosters.

21

Documents

Residual Plots Extrapolation - Duke University