Linear model selection and regularizationanna/stat697F/Chapter6.pdf · Linear model selection and regularization Problems with linear regression with least square 1. Prediction Accuracy:

Linear model selection and regularization

Problems with linear regression with least square

1. Prediction Accuracy: linear regression has low bias but sufferfrom high variance, especially when n ≈ p. It cannot handlen < p.

2. Model Interpretability: It is often the case that some or manyof the variables used in a multiple regression model are in factnot associated with the response. Including such irrelevantvariables leads to unnecessary complexity in the resultingmodel.

Selected alternatives to LS

1. Subset Selection. This approach involves identifying a subset ofthe p predictors that we believe to be related to the response. Wethen fit a model using least squares on the reduced set of variables.

2. Shrinkage. This approach involves fitting a model involving all ppredictors. However, the estimated coefficients are shrunken towardszero relative to the least squares estimates. This shrinkage (alsoknown as regularization) has the effect of reducing variance.Depending on what type of shrinkage is performed, some of thecoefficients may be estimated to be exactly zero. Hence, shrinkagemethods can also perform variable selection.

3. Dimension Reduction. This approach involves projecting the ppredictors into a M-dimensional subspace, where M<p. This isachieved by computing M different linear combinations, orprojections, of the variables. Then these M projections are used aspredictors to fit a linear regression model by least squares

2 4 6 8 10

2e+

07

4e+

07

6e+

07

8e+

07

Number of Predictors

Resid

ual S

um

of S

quare

s

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0


R2

Figure: For each possible model containing a subset of the ten predictorsin the Credit data set, the RSS and R2 are displayed. The red frontiertracks the best model for a given number of predictors, according to RSSand R2. Though the data set contains only ten predictors, the x-axisranges from 1 to 11, since one of the variables is categorical and takes onthree values, leading to the creation of two dummy variables.

I Unlike best subset selection, which involved fitting 2p models,forward stepwise selection involves fitting one null model,along with p − k models in the kth iteration, fork = 0, ..., p − 1. This amounts to a total of1 +

∑p−1k=0(p − k) = 1 + p(p + 1)/2 models. This is a

substantial difference: when p = 20, best subset selectionrequires fitting 1,048,576 models, whereas forward stepwiseselection requires fitting only 211 models.

I Forward stepwise selection can be applied even in thehigh-dimensional setting where n < p; however, in this case, itis possible to construct submodels M0,...,Mn−1 only, sinceeach submodel is fit using least squares, which will not yield aunique solution if p ≥ n.

Backward selection requires that the number of samples n islarger than the number of variables p (so that the full model canbe fit). In contrast, forward stepwise can be used even when n < p,and so is the only viable subset method when p is very large.As another alternative, hybrid versions of forward and backwardstepwise selection are available, in which variables are added tothe model sequentially, in analogy to forward selection. However,after adding each new variable, the method may also remove anyvariables that no longer provide an improvement in the model fit.Such an approach attempts to more closely mimic best subsetselection while retaining the computational advantages of forwardand backward stepwise selection.

Choosing the optimal model

In order to select the best model with respect to test error, we needto estimate this test error. There are two common approaches:

1. We can indirectly estimate test error by making an adjustmentto the training error to account for the bias due to overfitting.

2. We can directly estimate the test error, using either avalidation set approach or a cross-validation approach, asdiscussed in Chapter 5.

2 4 6 8 10

10000

15000

20000

25000

30000


Cp

2 4 6 8 10

10000

15000

20000

25000

30000


BIC

2 4 6 8 10

0.8

60.8

80.9

00.9

20.9

40.9

6


Ad

juste

d R

2

Figure: Cp, BIC , and adjusted R2 are shown for the best models of eachsize for the Credit data set. Cp and BIC are estimates of test MSE . Inthe middle plot we see that the BIC estimate of test error shows anincrease after four variables are selected. The other two plots are ratherflat after four variables are included.

2 4 6 8 10

100

120

140

160

180

200

220


Sq

ua

re R

oo

t o

f B

IC

2 4 6 8 10

100

120

140

160

180

200

220


Va

lida

tio

n S

et

Err

or

2 4 6 8 10

100

120

140

160

180

200

220


Cro

ss−

Va

lida

tio

n E

rro

r

Figure: For the Credit data set, three quantities are displayed for the bestmodel containing d predictors, for d ranging from 1 to 11. The overallbest model, based on each of these quantities, is shown as a blue cross.Left: Square root of BIC. Center: Validation set errors. Right:Cross-validation errors.

one-standard-error rule

We first calculate the standard error of the estimated test MSE foreach model size, and then select the smallest model for which theestimated test error is within one standard error of the lowest pointon the curve. The rationale here is that if a set of models appearto be more or less equally good, then we might as well choose thesimplest model-that is, the model with the smallest number ofpredictors. In this case, applying the one-standard-error rule to thevalidation set or cross-validation approach leads to selection of thethree-variable model.

Shrinkage method I: Ridge regression

Ridge regression is very similar to least squares, except that thecoefficients are estimated by minimizing a slightly differentquantity. In particular, the ridge regression coefficient estimates βR

are the values that minimize

where λ ≥ 0 is a tuning parameter, to be determined separately.The above equation trades off two different criteria. As with leastsquares, ridge regression seeks coefficient estimates that fit thedata well, by making the RSS small. However, the second term,λ∑

j β2j , called a shrinkage penalty, is small when β1, ..., βp are

close to zero, and so it has the effect of shrinking penalty theestimates of βj towards zero.

1e−02 1e+00 1e+02 1e+04

−3

00

−1

00

01

00

20

03

00

40

0

Sta

ndard

ized C

oeffic

ients

IncomeLimitRatingStudent

0.0 0.2 0.4 0.6 0.8 1.0

−3

00

−1

00

01

00

20

03

00

40

0

Sta

ndard

ized C

oeffic

ients

λ ‖βRλ ‖2/‖β‖2

Figure: The standardized ridge regression coefficients are displayed forthe Credit data set, as a function of λ and ‖βR

λ ‖2/‖β‖2.

I Unlike least squares, which generates only one set ofcoefficient estimates, ridge regression will produce a differentset of coefficient estimates, βRλ , for each value of λ. Selectinga good value for λ is critical.

I We want to shrink the estimated association of each variablewith the response; however, we do not want to shrink theintercept, which is simply a measure of the mean value of theresponse when xi1 = xi2 = ... = xip = 0. If we assume thatthe variables-that is, the columns of the data matrix X -havebeen centered to have mean zero before ridge regression isperformed, then the estimated intercept will take the formβ0 = y .

I The shrinkage penalty is not scale invariant. Therefore, it isbest to apply ridge regression after standardizing thepredictors.

Why Does Ridge Regression Improve Over Least Squares?Ridge regression’s advantage over least squares is rooted in thebias-variance trade-off. As λ increases, the flexibility of ridgeregression decreases, leading to decreased variance but increasedbias.

1e−01 1e+01 1e+03

010

20

30

40

50

60

Mean S

quare

d E

rror

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

Mean S

quare

d E

rror

λ ‖βRλ ‖2/‖β‖2

Figure: Squared bias (black), variance (green), and test mean squarederror (purple) for the ridge regression predictions on a simulated data set,as a function of λ and ‖βR

λ ‖2/‖β‖2. The horizontal dashed lines indicatethe minimum possible MSE. The purple crosses indicate the ridgeregression models for which the MSE is smallest. The results are basedon a simulated data set containing p = 45 predictors and n = 50observations

In general, in situations where the relationship between theresponse and the predictors is close to linear, the least squaresestimates will have low bias but may have high variance. Thismeans that a small change in the training data can cause a largechange in the least squares coefficient estimates. In particular,when the number of variables p is almost as large as the number ofobservations n, the least squares estimates will be extremelyvariable. And if p > n, then the least squares estimates do not evenhave a unique solution, whereas ridge regression can still performwell by trading off a small increase in bias for a large decrease invariance. Hence, ridge regression works best in situationswhere the least squares estimates have high variance.

Ridge regression also has substantial computational advantagesover best subset selection, which requires searching through 2p

models. As we discussed previously, even for moderate values of p,such a search can be computationally infeasible. In contrast, forany fixed value of λ, ridge regression only fits a single model, andthe model-fitting procedure can be performed quite quickly. Infact, one can show that the computations required to solve thepenalized least square, simultaneously for all values of λ, arealmost identical to those for fitting a model using least squares.

Documents

Linear model selection and regularizationanna/stat697F/Chapter6.pdf · Linear model selection and regularization Problems with linear regression with least square 1. Prediction Accuracy: