2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1

Model Assessment and Selection

2015 April

Hastie, Tibshirani and Friedman .The Elements of Statistical Learning (2nd edition, 2009).

Presentation by: Ayellet Jehassi

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 22015 April

Outline

1) Introduction

2) Bias, Variance and Model Complexity

3) AIC and BIC

4) Cross-Validation

5) Bootstrap Methods


Introduction

The process of evaluating a model’s performance is known as model assessment, where the process

of selecting the proper level of flexibility for a model is known as assessment model selection.

For example:

- Repeatedly drawing samples from a training set and refitting a model of interest on each sample in

order to obtain additional information about the fitted model.

- Fitting the same statistical method multiple times using different subsets of the training data.


Outline

1) Introduction


3) , AIC and BIC

4) Cross-Validation



Bias, Variance and Model Complexity

We have many different potential predictors. Why not base the model on all of them?

Two sides of one coin: bias and variance

– The model with more predictors can describe the phenomenon better – less bias.

– When we estimate more parameters, the variance of estimators grows – we “fit the noise”, i.e.

we perform an over fitting!

A clever model selection strategy should resolve the bias–variance trade–off.



Typically our model will have a tuning parameter or parameters α and so we can write our predictions as

The tuning parameter varies the complexity of our model, and we wish to find the value of that minimizes the

error.

We have two separate goals that we might have in mind:

Model selection: estimating the performance of different models in order to choose the best one.

Model assessment: having chosen a final model, estimating its prediction error on new data.



The best approach for both problems is to randomly divide the dataset into three parts: a training set, a

validation set, and a test set.

The training set is used to fit the models.

The validation set is used to estimate prediction error for model selection.

The test set is used for assessment of the generalization error of the final chosen model.

The typical proportions are respectively: 50%, 25%, 25%

The methods of this chapter approximate the validation step either analytically or by efficient sample re-use.



Test error, or generalization error, or prediction risk, is the expected error over an independent test sample drawn

from the same distribution as that of the training data:

.

Training error is the average loss over the training data:

.


Our goal is to find the model M which minimizes the test

error.

Too-simple models give too much bias, and too-complex

models give too much variance.

The training error is a downward-biased estimate of the

prediction risk: < .




We assume that where and V, of regression at an input point using squared-error loss:

The more complex we make the model , the lower the bias but higher the variance.



For example, for k-nearest-neighbor regression fit the prediction error has the following form:

.As we increase k, the bias will typically increase while the variance decrease.

Then select a single best model from among M0, . . . ,Mp using cross-validated prediction error, Cp, AIC, BIC.


Outline

1) Introduction


3) , AIC and BIC

4) Cross-Validation



, AIC and BIC

These techniques adjust the training error for the model size, and can be used to select among set of

models with different numbers of variables.

Mallows’ :

AIC (Akaike information criterion): given a set of models

In the case of the linear model with Gaussian errors the and AIC are equivalent.


, AIC and BIC

BIC (Bayesian Information Criterion): under gussian model,

Like Cp, the BIC will tend to take on a small value for a model with a low test error, and so generally we

select the model that has the lowest BIC value.

For model selection purposes, there is no clear choice between AIC and BIC


Outline

1) Introduction


3) , AIC and BIC

4) Cross-Validation



Cross - Validation

We randomly divide the available set of samples into two parts: a training set and a validation or hold-out set.

The model is fit on the training set, and the fitted model is used to predict the responses for the observations in

the validation set.

The resulting validation-set error provides an estimate of the test error.


K-fold Cross - Validation

Widely used approach for estimating test error.

Estimates can be used to select best model, and to give an idea of the test error of the final chosen model.

The idea is to randomly divide the data into K equal-sized parts. We leave out part k, fit the model to the other K − 1

parts (combined), and then obtain predictions for the left-out kth part.

This is done in turn for each part k = 1, 2, . . . K, and then the results are combined.


Cross – Validation: How to use it?

1. For each model complexity, obtain a cross-validated estimate of the error for that complexity: the average

error over all folds.

2. Now you have a number (an average error) for each model complexity. Choose the best (lowest error) model

complexity.

3. Finally, train a model using all of the available data using that choice of model complexity.



Let the K parts be where denotes the indices of the observations in part . There are observations in part : if N

is a multiple of K, then

Then cross-validation estimate of prediction error is

Given a set of models indexed by a tuning parameter

Which provides an estimate of the test error curve, and the tuning parameters that minimizes it.

Our final chosen model is that fit to all the data.



In case is Known as The CV estimator is approximately unbiased for true prediction error, but can have high

variance.

LOOCV sometimes useful, but typically doesn’t shake up the data enough. The estimates from each fold are

highly correlated and hence their average can have high variance.

A better choice is K = 5 or 10.



Consider a simpler classifier for microarrays:

1) Starting with 5,000 genes, find the top 200 genes having the largest correlation with the class label.

2) Carry about the nearest-centroid classification using top 200 genes.

How do we select the tuning parameter in the classifier?

Way 1: apply cross-validation to step 2

Way 2: apply cross-validation to steps 1 and 2

Which is right? – Cross validating the whole procedure.


Outline

1) Introduction


3) , AIC and BIC

4) Cross-Validation



Bootstrap Methods

The bootstrap is a flexible and powerful statistical tool that can be used to quantify the uncertainty

associated with a given estimator or statistical learning method.

For example, it can provide an estimate of the standard error of a coefficient, or a confidence interval

for that coefficient.


Bootstrap Methods

A simple example:

Suppose that we wish to invest a fixed sum of money in two financial assets that yield returns of X and

Y , respectively, where X and Y are random quantities.

We will invest a fraction α of our money in X, and will invest the remaining 1 − α in Y .

We wish to choose α to minimize the total risk, or variance, of our investment. In other words, we

want to minimize


Bootstrap Methods

The bootstrap approach allows us to use a computer to mimic the process of obtaining new data sets,

so that we can estimate the variability of our estimate without generating additional samples.

Rather than repeatedly obtaining independent data sets from the population, we instead obtain

distinct data sets by repeatedly sampling observations from the original data set with replacement.


Bootstrap Methods

Each of these “bootstrap data sets” is created by sampling

with replacement, and is the same size as our original

dataset.

As a result some observations may appear more than once in

a given bootstrap data set and some not at all.


Bootstrap Methods

To estimate prediction error using the bootstrap, we could think about using each bootstrap dataset

as our training sample, and the original sample as our validation sample.

But each bootstrap sample has significant overlap with the original data. About two-thirds of the

original data points appear in each bootstrap sample.


Bootstrap Methods

is the predicted value at , from the model fitted to the th bootstrap dataset, our estimate is .

does not provide a good estimate in general, the bootstrap dataset are acting as the training sample,

while the original training set is acting as the test sample. This is due to the fact that both sets have

common observations.


Summary In order to solve bias-variance trade off one can apply different possible methods like AIC, CV etc.

AIC, BIC are methods of model selection as they allow to choose a subset of variables which minimize the

prediction error (variance).

The CV method folds a subset of given data set and use the rest of the observations to predict the fold data points.

LOOCV gives an unbiased prediction error in a cost of a higher variance.

K=5 to 10 are frequently the optimal values that account for both the bias and the variance.

The bootstrap method use the resampling approach in order to quantify the uncertainty (variance or the CI) of the

estimators.

This method has inherent bias which incorporated in the resampling mechanism.


Thank you!

Documents

2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,