Download pdf - Special Chapters on Arti cial Intelligencecgatu/csia/res/lecture-06.pdf · Special Chapters on Arti cial Intelligence Lecture 6. Regression variable selection Cristian Gatu 1Faculty

Special Chapters on Artificial IntelligenceLecture 6. Regression variable selection

Cristian Gatu

1Faculty of Computer Science“Alexandru Ioan Cuza” University of Iasi, Romania

MCO, 2018–2019

Content

Variable selection

Effects of deleting variables

Variable selection proceduresForward SelectionBackward SelectionStepwise Selection

Variable selection

I In many regression problems there will be:

1. A fixed sample size to work with.

2. A moderate to large number of potential predictorvariables.

I Generally adding more variables to a regression modelthat already contains a small number of variables willimprove predictive accuracy.

I Continuing to add variables (without adding moresample) will often lead to a deterioration in predictiveaccuracy (over-fitting).

I The goal is to find the best subset of variables:Bias/Variance trade-off.

Variable selection

I There are likely many subsets of variables that are likelyto do well.

I Finding the best subset of variables is often referred asvariable selection.

I For experiments variable selection is done before datacollection as important variables are chosen for studybased on theoretical aspects.

I For observational studies there is little control of howdata is collected. In this case variable selection is doneafter date collection and in the data analysis domain.

Content

Variable selection



Some effects of dropping variablesAssume that the correct model is given by

yi = β0 + β1xi1 + · · · + βnxin + εi

and consider the sub-model which includes the first p − 1 < nindependent variables:

yi = β0 + β1xi1 + · · · + β(p−1)xi(p−1) + εi .

I Deleting independent variable usually biases the estimatesof the parameters left in the model;

I Deleting independent variable usually increases the valuesof the expectation of s2 and decreases the covariancematrix of the estimates β0, . . . , β(p−1). Note that we arereferring to the covariance matrix defined in terms of σ2

and not its estimate s2.

Some effects of dropping variablesI A measure of the bias in the predicted values is called

Mallow’s Cp, where

Cp+1 =RSSp+1

s2− (m − 2(p + 1)).

and m is the sample size (number of observations),RSSp+1 are the residual sum of squares of the p-variablemodel and the estimator of σ2 is given by:

s2 =RSSn+1

(m − n − 1).

The key property for applications is that if the submodeldoes not lead to much bias in the predicted, then

Cp ≈ p.

Some effects of dropping variables

I Notice that,

Cn+1 =RSSn+1

s2− (m − 2(n + 1))

=RSSn+1

RSSn+1/(m − n − 1)− (m − 2(n + 1))

= (m − n − 1) − (m − 2(n + 1))

= n + 1.

Thus, Cp assumes that the complete model has beencarefully chosen so as to give reasonable assurance ofnegligible bias.

Effects on estimates of βj

Assume that the correct model is given by

y = Xβ + ε, or y = (X1 X2)(β1β2

)+ ε,

where β = (β1 β2)T , X = (X1 X2), β1 is a p-elementvector and the other dimensions are chosen appropriately.Thus, the correct model can be written as:

y = X1β1 + X2β2 + ε,

where E(ε) = 0, E(y) = X1β1 + X2β2 and Var(ε) = σ2I .

Effects on estimates of βjAssume that we leave out X2β2 and obtain the sub-model:

y = X1β1 + ε.

The estimator of β1 in the sub-model is given by:

β1 = (XT1 X1)−1X1y

and

E(β1) = (XT1 X1)−1XT

1 E(y)

= (XT1 X1)−1XT

1 (X1β1 + X2β2)

= β1 + (XT1 X1)−1XT

1 X2β2.

Thus, the estimate of β1 after deleting X2β2 is biased by

E(β1) − β1 = (XT1 X1)−1XT

1 X2β2.

Effects on estimates of βj

The variance of β1 in the sub-model is given by:

Var(β1) = σ2(XT1 X1)−1.

However, based on the full (correct) model:

Var(β) = σ2(XTX )−1

= σ2

(XT

1 X1 XT1 X2

XT2 X1 XT

2 X2

)−1

≡ σ2

(V11 V12

V21 V22

)If β(1) corresponds to the first (p − 1) independent variables ofthe full model, then Var(β(1)) = V11 ≥ (XT

1 X1)−1 and thus,

Var(β1) ≤ Var(β(1)).

Content

Variable selection



Variable selection procedures

When presented with a data set containing a large number ofpotential regressors, it can be a formidable task to identify auseful, well fitting and parsimonious regression model. In the1950s and 1960s statisticians spent a lot of time inventingingenious ways of automating variable selection in multiplelinear regression. The four most popular methods are forwardselection, backward selection, stepwise regression and bestsubset regression.

Variable selection procedures

The idea of the first three is to reduce the number of possiblemodels that need to be fitted. Often these methods do notprovide the same models which in most cases are not theoptimal ones1. The fourth procedure generates the bestone-regressor model, the best two-regressor model, the bestthree-regressor model etc. The best of these generated modelis selected based on some criteria (R2, adjusted R2, Cp, etc).Often an exhaustive search (considers all possible submodels)is used for generating all subset models.

1Optimal in the sense that there are better models to be selectedusing the same selection criteria

Content

Variable selection



Forward Selection

I Unlike exhaustive search, forward selection is alwayscomputationally tractable. Even in the worst case, itchecks a much smaller number of subsets before finishing.

I This technique adds predictor variables and never deletesthem.

I The starting subset in forward selection is the empty set.

I For a regression model with n possible predictor variables,the first step involves evaluating n predictor variablesubsets, each consisting of a single predictor variable, andselecting the one with the highest evaluation criterion.

I The next step selects from among (n − 1) subsets, thenext step from (n − 2) subsets, and so on.

I Even if all predictor variables are selected, at mostn(n + 1)/2 subsets are evaluated before the search ends.

Forward Selection

1. Start with no variable.

2. For each variable NOT in the model, check the p-value ifthere are added in the model.

3. Choose the one with lowest p-value less than α.

4. Continue until no new variable can be added.

Forward Selection

Start

Fit each xi separately to predict y .

Are any xi significant ? STOP

Add the best single variable

Test each xj NOT in the model given

all variables currently in the model

Are any of them significant ? STOP

Add the most significant predictor NOTin the model (e.g. lowest p-value).

?

?

?

NO-

YES

?

? NO-

YES?

-

Example: Forward selection using the House data

Fit each variable separately P-Value

--------------------------------------------------------

Number of bedrooms 0.039

Floor space in square feet 0.00002

Number of fireplaces 0.545

Number of rooms 0.002

Storm windows (1 if present, 0 if absent) 0.019

Front footage of lot in feet 0.031

Annual Taxes 0.010

Number of bathrooms 0.003

Construction (0 if frame, 1 if brick) 0.504

Garage Size (0 no car, 1 one car) 0.004

Condition (1 need work, 0 otherwise) 0.799

Location 1 (if property is in zone A, 0 otherwise)0.010

Location 2 (if property is in zone B, 0 otherwise)0.994

At 5% significance level the most significant variable is Floorspace in square feet. This variable (FLR) is introduced to themodel (and will never be deleted).


We try to find the second explanatory variable by examiningtheir significance in the model which consists of FLR.1) P-Value R2 Ad. R2

Floor space in square feet 0.000 0.557 0.518

Number of Bedrooms 0.388

2) P-Value R2 Ad. R2








Storm windows 0.005



Front footage of lot in feet 0.194



Annual Taxes 0.726




Example: Forward selection using the House data8) P-Value R2 Ad. R2


Construction 0.059



Garage size 0.049



Condition 0.990



Location (zone A) 0.254



Location (zone B) 0.403

At 5% significance level the second most significantexplanatory variable in the model (conditional to the FLRvariable) is Storm windows (ST). This variable enters themodel which now consists of the constant, FLR and ST. Theremaining variables are fitted to this model one at a time. Themost significant (if any) will enter the model.


p-value p-value p-value

1) FLR 0.000 2) FLR 0.000 3) FLR 0.001

ST 0.002 ST 0.006 ST 0.007

BDR 0.099 FP 0.9 RMS 0.854

Adj R-squ: 0.67 Adj R-squ: 0.63 Adj R-squ: 0.63

4) FLR 0.000 5) FLR 0.001 6) FLR 0.000

ST 0.003 ST 0.005 ST 0.015

LOT 0.069 TAX 0.493 BTH 0.526

Adj R-squ: 0.68 Adj R-squ: 0.64 Adj R-squ: 0.64

7) FLR 0.000 8) FLR 0.000 9) FLR 0.000

ST 0.006 ST 0.007 ST 0.006

CON 0.062 GAR 0.058 GDN 0.671

Adj R-squ: 0.69 Adj R2: 0.64 Adj R2: 0.63

10)FLR 0.000 11) FLR 0.000

ST 0.012 ST 0.006

L1 0.625 L2 0.318

Adj R2: 0.63 Adj R2: 0.65


None of this variables are significant at 5% and thus, theprevious model is kept. That is,

Coefficients:

Estimate SE t value Pr(>|t|)

(Intercept) 32.594 3.907 8.43 2.1e-08

FLR 0.019 0.003 5.74 7.6e-06

ST 10.226 3.337 3.07 0.00549

---

Residual standard error: 7.484 on 23 DF

Multiple R-Squared: 0.67, Adjusted R-squared: 0.65

F-statistic: 23.83 on 2 and 23 DF, p-value: 2.5e-06

Content

Variable selection



Backward Selection

I Backward selection has computational properties that aresimilar to forward selection. The starting subset inbackward selection includes all possible predictor variables.

I Predictor variables are deleted one at a time as long asthis results in a subset with a higher evaluation criterion.

I Again, in the worst case, at most n(n + 1)/2 subsetsmust be evaluated before the search ends. Like forwardselection, backward selection is not guaranteed to find thesubset with the highest evaluation criterion.

Backward Selection

I The disadvantage of backward selection is that one’sconfidence in subset evaluation criterion values tends tobe lower than with forward selection. This is especiallytrue when the number of rows in the predictor matrix isclose to the number of possible predictor variables. Insuch a case, there are very few points that the regressionmodel can use in order to determine its parameter values,and the function evaluation criterion will be sensitive tosmall changes to the predictor matrix data.

I When the ratio of predictor matrix rows to predictorvariables is small, it is usually a better idea to use forwardselection than backward selection.

Backward Selection

1. Start with all variables in the model.

2. Remove the variable with the highest p-value greater thanα.

3. Refit the model and go to step 2.

4. Stop when ALL p-values are less than α.

Backward Selection

Start

Fit the FULL model containing all variables.

Test each xj IN the model given

all variables currently in the model

Are ALL variables significant ? STOP

Drop the most non-significant predictorin the model (e.g. highest p-value)

?

?

? YES-

NO?

-

Example: Backward selection on the House dataFit the model comprising all variables

--------------------------------------

Variable Coefficients SE t Sig.

(Constant) 11.608 6.469 1.794 0.098

num of bedrooms -4.502 2.243 -2.007 0.068

floor space 1.583E-02 0.006 2.641 0.022

num of fireplaces -2.139 2.930 -0.730 0.479

number of rooms 2.380 1.911 1.246 0.237

storm windows 8.931 2.430 3.675 0.003

front footage 0.385 0.130 2.953 0.012

annual taxes 2.344E-04 0.005 0.051 0.960

num of bathrooms 1.187 2.849 0.417 0.684

construction 4.625 2.387 1.938 0.077

garage size 4.985 1.514 3.292 0.006

condition -0.212 2.653 -0.080 0.938

location(zone A) 1.565 3.132 0.500 0.626

location(zone B) 7.383 2.953 2.500 0.028

R2=0.936 Adj R2=0.867

The less significant variable is the Annual taxes. This variableis deleted from the model (and never re-considered) and themodel is fitted again.

Example: Backward selection on the House dataFit the model with all variables excluding Anual Taxes

------------------------------------------------------


(Constant) 11.802 5.009 2.356 0.035

num of bedrooms -4.524 2.115 -2.139 0.052

floor space 1.608E-02 0.003 4.984 0.000

num of fireplaces -2.120 2.793 -0.759 0.461

number of rooms 2.353 1.760 1.337 0.204

storm windows 8.946 2.318 3.860 0.002

front footage 0.386 0.122 3.170 0.007

num of bathrooms 1.132 2.526 0.448 0.662

construction 4.596 2.226 2.065 0.059

garage size 5.000 1.429 3.498 0.004

condition -0.254 2.419 -0.105 0.918

location(zone A) 1.599 2.937 0.545 0.595

location(zone B) 7.419 2.757 2.691 0.019

R2=0.936 Adj R2=0.877

The less significant variable is the Condition. This variable isdeleted from the model (and never re-considered) and themodel is fitted again. This procedure is repeated until all thevariables in the model are significant.

Example: Backward selection on the House data

The model comprising only significant variables

----------------------------------------------------


(Intercept) 14.517 4.212 3.446 0.002706

floor space 0.015 0.002 6.652 2.31e-06

storm windows 8.657 2.065 4.191 0.000496

front footage 0.335 0.113 2.962 0.007999

construction 7.272 1.867 3.895 0.000974

garage size 5.897 1.316 4.480 0.000257

location(zone B) 8.647 2.155 4.013 0.000744

R2=0.901 Adj R2=0.870

In this case the model is given by:

Price = 14.517 + 0.015 FLR + 8.657 ST + 0.335 LOT

+ 7.272 CON + 5.897 GAR + 8.647 L2.

Content

Variable selection



Stepwise Selection

I Stepwise selection has been proposed as a technique thatcombines advantages of forward and backward selection.

I At any point in the search, a single predictor variable maybe added or deleted.

I Commonly, the starting subset is the empty set.

I At most n2 subsets are evaluated before stepwise selectionends.

I There are, however, no guarantees that each predictor willbe chosen at most one time.

Stepwise Selection

I No strong theoretical results exist for comparing theeffectiveness of stepwise selection against forward orbackward selection.

I Stepwise selection evaluates more subsets than the othertwo techniques, so in practice it tends to produce bettersubsets Of course, the price that stepwise selection paysfor finding better subsets is reduced computational speed:usually more subsets must be evaluated before the searchends.

Stepwise Selection

?

?

YES?

-NO

?

?

YES?

-NO

?

YES?

-NO

�-

START

Fit each xi variable separetely to predict y

There are any significant ? STOP

Add the best single variable

Test each variable IN the model givenall other variables currently in the model

There are all significant ? Drop the mostnon-significant

Test each variable NOT in the model givenall variables currently in the model

There are any significant ? STOP

Add the most significan variable

Stepwise Selection

For the House data it has been already observed that the mostsignificant variable is the floor (FLR). This variable enters themodel. The remaining variables are fitted one at time to themodel comprising the constant term and the FLR. Again fromthe earlier results of the forward selection it is know that thestrongest variable is the ST (Storm windows). Thus, thisvariable enters the model:



Storm windows 0.005

Both variables (FLR and ST) are highly significant. Thus,none of them is deleted. The remaining variables are nowfitted one at a time to the model:

Price = 32.594 + 0.0189 FLR + 10.226 ST.

Example: Stepwise selection using the House data1) P-Value

Floor in square feet 0.000

Storm windows 0.002

Number of bedrooms 0.099

2) P-Value


Storm windows 0.007


3) P-Value


Storm windows 0.007


4) P-Value


Storm windows 0.003

Front footage 0.069

5) P-Value


Storm windows 0.005

Annual Taxes 0.493

6) P-Value


Storm windows 0.015


7) P-Value


Storm windows 0.006

Construction 0.062

8) P-Value


Storm windows 0.007

Garage size 0.05

9) P-Value


Storm windows 0.006

Condition 0.671

10) P-Value


Storm windows 0.012

Location 1 0.625

11) P-Value


Storm windows 0.006

Location 2 0.318

Example: Stepwise selection using the House data

At 5% significance none of the remaining variables are foundto be significant. Thus, the procedure terminates and theprevious selected model which comprises the variables FLRand ST is chosen. This is the same model derived using theforward selection method:

Variable Coefficient SE t Sig.

(Constant) 32.594 3.907 8.343 0.000

floor space 1.891E-02 0.003 5.740 0.000

storm windows 10.226 3.337 0.368 0.005

R2=.675 Adj. R2=.646 s=7.48354

Price = 32.594 + 0.0189 FLR + 10.226 ST.