Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
19.0 Practical Issues in Regression
• Answer Questions
• Nonparametric Regression
• Residual Plots
• Extrapolation
1
19.1 Nonparametric Regression
Recall that the multiple linear regression model is
Y = β0 + β1X1 + . . . + βpXp + ǫ
where IE[ǫ] = 0, Var [ǫ] = σ2, and the ǫ are independent.
The model is useful because:
• it is interpretable—the effect of each explanatory variable is captured
by a single coefficient
• theory supports inference and prediction is easy
• simple interactions and transformations are easy (how?)
• dummy variables allow use of categorical information
• computation is fast.
2
We extended the multiple linear regression model to nonlinear regression,
in which we fit a model of the form:
g0(Y ) = β0 + β1g1(X1) + . . . + βpgp(Xp) + ǫ
where the gi are known transformations of the data, such as the log Y or
1/x1, and, as before, IE[ǫ] = 0, Var [ǫ] = σ2, and the ǫ are independent.
This model can be further extended to nonparametric regression, in
which case one does not know the functions g1, . . . , gp but instead must
estimate these by smoothing the data.
In applications, the linear regression model is usually only a locally
correct approximation. And it is rare that one has a strong theoretical
model that prescribes specific nonlinear transformations. Thus
nonparametric regression is a practical tool in many cases.
3
As a running example for the next several pages, assume we have data
generated from the following function by adding N(0..25) noise.
4
The x values were chosen to be spaced out at the left and right sides of
the domain, and the raw data are shown below.
5
19.1.1 Bin Smoothing
Here one partitions the x-axis into disjoint bins; e.g., take
{[i, i + 1), i ∈ Z}. Within each bin average the Y values to obtain a
smooth that is a step function.
6
19.1.2 Moving Averages
Moving averages use variable bins containing a fixed number of
observations, rather than fixed-width bins with a variable number of
observations. They tend to wiggle near the center of the data, but flatten
out near the boundary of the data.
7
19.1.3 Running Line
This improves on the moving average by fitting a line rather than an
average to the data within a variable-width bin. But it still tends to be
rough.
8
Regression becomes much harder as the number of explanatory variables
increases. This is called the Curse of Dimensionality (COD). The term
was coined by Richard Bellman in the context of approximation theory.
The COD applies to all multivariate regressions that do not to impose
strong modeling assumptions—especially the nonparametric regressions,
but also those in which one tests whether a specific variable or
transformed variable should be included in the model.
In terms of the sample size n and dimension p, the COD has three nearly
equivalent descriptions:
• For fixed n, as p increases, the data become sparse.
• As p increases, the number of possible models explodes.
• For large p, most datasets are multicollinear.
9
For the sparsity description of the COD, let n points be uniformly
distributed in the unit cube in IRp. What is the side-length ℓ of a subcube
that is expected to contain a fraction d of the data? Ans: ℓ = p√
d
This means that for large p, the amount of local information that is
available to fit bumps and wiggles in f is too small.
10
To explain the model explosion aspect, suppose we restrict attention to
just linear models of degree 2 or fewer. For p = 1 these are:
IE[Y ] = β0 IE[Y ] = β1x1 IE[Y ] = β2x21
IE[Y ] = β0 + β1x1 IE[Y ] = β0 + β2x21 IE[Y ] = β1x1 + β2x
21
IE[Y ] = β0 + β1x1 + β2x21
For p = 2 this set is extended to include expressions with the terms α1x2,
α2x22, and γ12x1x2.
For general p, combinatorics shows that the number of possible models is
2
1+2p+
0
B
B
@
p
2
1
C
C
A
− 1.
This increases superexponentially in p, and there is not enough sample
to enable the data to discriminate among these models.
11
For the multicollinearity issue, we note that multicollinearity occurs
when two or more of the explanatory values are highly correlated. This
implies that the predictive value of the fitted model breaks down quickly
as one moves away from the subspace in which the data concentrate.
Insert a physical demonstration.
In this class, we shall agree that multicollinearity occurs whenever
the absolute value of the correlation between two of the explanatory
variables exceeds .9. But this is a judgment call, and one can have
multicollinearity that arises in more complex ways.
For large p with finite n, it is almost certain that two explanatory
variables will have high correlation, just by chance.
12
19.2 Variable Selection
One wants to select a multiple regression model that only includes useful
variables. Some methods are:
• Forward Selection. One starts with no variables in the model, and
sequentially adds the one that best explains the current residuals
(or the raw data, at the initial step). One stops when none of the
remaining variables provide significant explanation.
• Backwards Elimination. Start with all the variables in the model,
and sequentially removes the variable that explains the least, until a
t-test shows that no further variables should be removed.
• Stepwise Regression. Alternate use of forward selection and
backwards elimination.
None of these is bulletproof.
13
19.3 Cross-Validation
To assess model fit in complex, computer-intensive situations, the ideal
strategy is to hold out a random portion of the data, fit a model to the
rest, then use the fitted model to predict the response values from the
values of the explanatory variables in the hold-out sample.
This allows a straightforward estimate of the error in prediction using
regression. But we usually need to compare fits among many models.
If the same hold-out sample is re-used, then the comparisons are not
independent and (worse) the model selection process will tend to choose
a model the overfits the hold-out sample, causing spurious optimism.
14
Cross-validation is a procedure that balances the need to use data to
select a model and the need to use data to assess prediction.
Specifically, v-fold cross-validation is as follows:
• randomly divide the sample into v portions;
• for i = 1, . . . , v, hold out portion i and fit the model from the rest of
the data;
• for i = 1, . . . , v, use the fitted model to predict the hold-out sample;
• average the PMSE over the v different fits.
One repeats these steps (including the random division of the sample!)
each time a new model is assessed.
The choice of v requires judgment. Often v = 10.
15
19.5 Case Study
You should never believe your model. Personally, I’m sometimes willing
to believe the binomial model applies, but for nearly every other
situation, the mechanisms that generate the data just do not quite match
the simple assumptions that underlie the named probability distributions.
George Box said:
All models are wrong, but some are useful.
Economists look at a lot of data and often attempt to fit it by models.
Be wary. Always plot your data. One can do goodness-of-fit tests to see
whether the data conform with a particular model, but this has dangers
too, especially with very large samples.
16
When deciding on which model to use to describe a data set, one should
consider:
• Do you believe that a simple, single probability distribution
generated the data?
• Do the data have some natural support set? (The support set is the
set on which the probability mass function or density function is
positive.)
• Do you believe the data are roughly symmetrically distributed about
the mean? Or is there skewness?
• Will the data have “fat tails”? (That is, are there likely to be some
exceptionally large or small values, compared to what one would see
in a sample from a normal distribution?)
• Do you understand the measurement process that acquired the data?
17
Beware of premature framing of a problem.
In January 1985, a team of engineers at Morton Thiokal was tasked
to study O-ring failures in Challenger launches. There were given
information on all the launches in which O-ring failures occurred, and
related data on temperature, manufacturing history, and so forth.
18
The engineers looked at all the variables. Temperature did not stand out.
On January 28, 1986, when the executives at Morton Thiokol were asked
by NASA whether they objected to greenlighting the launch given the
unusual cold weather at Cape Kennedy, they contacted their engineers
and asked their opinion.
The engineers, led by Roger Boisjoly, were nervous and tried to stop
the flight. The Morton Thiokol management agreed that the issue was
serious enough to recommend delaying the flight, and they arranged a
telephone conference with NASA. However, during the call, the Morton
Thiokol managers asked for a few minutes off the phone to discuss their
final position again.
19
The Morton Thiokol managers decided to advise NASA that their data
was inconclusive. NASA asked if there were objections. Hearing none,
the decision to launch was made.
The engineers should have looked at all the data, not just the data on
failures.
20
Roger Boisjoly was one of the witnesses at the Rogers Commission. After
the Committee gave its findings, Boisjoly found himself shunned by
colleagues and managers and he resigned from Morton Thiokol.
Subsequently, Roger Boisjoly wrote:
... [S]ome may argue that sufficient funds or schedule were
not available and that may be so, but MTI contracted for
that condition. The Shuttle program was declared operational
by NASA after the fourth flight, but the technical problems
in producing and maintaining the reusable boosters were
escalating rapidly as the program matured, instead of
decreasing as one would normally expect. Many opportunities
were available to structure the work force for corrective action,
but the MTI Management style would not let anything compete
or interfere with the production and shipping of boosters.
21