30
Model Assessment and Selection 2015 April UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman .The Elements of Statistical Learning (2nd edition, 2009). Presentation by: Ayellet Jehassi

2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

Embed Size (px)

Citation preview

Page 1: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1

Model Assessment and Selection

2015 April

 Hastie, Tibshirani and Friedman .The Elements of Statistical Learning (2nd edition, 2009).

Presentation by: Ayellet Jehassi

Page 2: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 22015 April

Outline

1) Introduction

2) Bias, Variance and Model Complexity

3) AIC and BIC

4) Cross-Validation

5) Bootstrap Methods

Page 3: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 32015 April

Introduction

The process of evaluating a model’s performance is known as model assessment, where the process 

of selecting the proper level of flexibility for a model is known as assessment model selection.

For example:

- Repeatedly drawing samples from a training set and refitting a model of  interest on each sample  in 

order to obtain additional information about the fitted model.

- Fitting the same statistical method multiple times using different subsets of the training data.

Page 4: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 42015 April

Outline

1) Introduction

2) Bias, Variance and Model Complexity

3)  , AIC and BIC

4) Cross-Validation

5) Bootstrap Methods

Page 5: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 52015 April

Bias, Variance and Model Complexity

We have many different potential predictors. Why not base the model on all of them?

Two sides of one coin: bias and variance

– The model with more predictors can describe the phenomenon better – less bias.

– When we estimate more parameters, the variance of estimators grows – we “fit the noise”, i.e. 

we perform an over fitting!

A clever model selection strategy should resolve the bias–variance trade–off.

Page 6: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 62015 April

Bias, Variance and Model Complexity

Typically our model will have a tuning parameter or parameters α and so we can write our predictions as 

The tuning parameter varies the complexity of our model, and we wish to find the value of that minimizes the 

error.

We have two separate goals that we might have in mind:

Model selection: estimating the performance of different models in order to choose the best one.

Model assessment: having chosen a final model, estimating its prediction error on new data.

Page 7: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 72015 April

Bias, Variance and Model Complexity

The  best  approach  for  both  problems  is  to  randomly  divide  the  dataset  into  three  parts:  a  training  set,  a 

validation set, and a test set.

The training set is used to fit the models.

The validation set is used to estimate prediction error for model selection. 

The test set is used for assessment of the generalization error of the final chosen model.

The typical proportions are respectively: 50%, 25%, 25%

The methods of this chapter approximate the validation step either analytically or by efficient sample re-use.

Page 8: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 82015 April

Bias, Variance and Model Complexity

Test error, or generalization error, or prediction risk,  is  the expected error over an  independent test sample drawn 

from the same distribution as that of the training data:

.

Training error is the average loss over the training data:

.

Page 9: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 92015 April

Our goal is to find the model M which minimizes the test 

error.

Too-simple models give too much bias, and too-complex 

models give too much variance.

The training error is a downward-biased estimate of the 

prediction risk:  < .

Bias, Variance and Model Complexity

Page 10: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 102015 April

Bias, Variance and Model Complexity

We assume that  where  and V, of regression  at an input point  using squared-error loss:

The more complex we make the model , the lower the bias but higher the variance. 

Page 11: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 112015 April

Bias, Variance and Model Complexity

For example,  for k-nearest-neighbor regression fit the prediction error has the following form:

.As we increase k, the bias will typically increase while the variance decrease.

Then select a single best model from among M0, . . . ,Mp using cross-validated prediction error, Cp, AIC, BIC.

Page 12: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 122015 April

Outline

1) Introduction

2) Bias, Variance and Model Complexity

3)  , AIC and BIC

4) Cross-Validation

5) Bootstrap Methods

Page 13: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 132015 April

, AIC and BIC

These  techniques adjust  the  training error  for  the model  size,  and  can be used  to  select  among  set of 

models with different numbers of variables.

Mallows’ :  

AIC (Akaike information criterion): given a set of models  

In the case of the linear model with Gaussian errors the and AIC are equivalent.

Page 14: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 142015 April

, AIC and BIC

BIC (Bayesian Information Criterion):  under gussian model, 

Like Cp, the BIC will tend to take on a small value for a model with a low test error, and so generally we 

select the model that has the lowest BIC value. 

For model selection purposes, there is no clear choice between AIC and BIC

Page 15: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 152015 April

Outline

1) Introduction

2) Bias, Variance and Model Complexity

3)  , AIC and BIC

4) Cross-Validation

5) Bootstrap Methods

Page 16: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 162015 April

Cross - Validation

We randomly divide the available set of samples into two parts: a training set and a validation or hold-out set.

The model is fit on the training set, and the fitted model is used to predict the responses for the observations in 

the validation set. 

The resulting validation-set error provides an estimate of the test error. 

Page 17: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 172015 April

K-fold Cross - Validation

Widely used approach for estimating test error. 

Estimates can be used to select best model, and to give an idea of the test error of the final chosen model. 

The idea is to randomly divide the data into K equal-sized parts. We leave out part k, fit the model to the other K − 1 

parts (combined), and then obtain predictions for the left-out kth part. 

This is done in turn for each part k = 1, 2, . . . K, and then the results are combined. 

Page 18: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 182015 April

Cross – Validation: How to use it?

1. For  each model  complexity,  obtain  a  cross-validated  estimate  of  the  error  for  that  complexity:  the  average 

error over all  folds. 

2. Now you have a number (an average error) for each model complexity. Choose the best (lowest error) model 

complexity.  

3. Finally, train a model using all of the available data using that choice of model complexity.

Page 19: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 192015 April

K-fold Cross - Validation

Let the K parts be where  denotes the indices of the observations in part . There are observations in part :  if N 

is a multiple of K, then 

Then cross-validation estimate of prediction error is  

Given a set of models   indexed by a tuning parameter  

Which provides an estimate of the test error curve, and the tuning parameters  that minimizes it.

Our final chosen model is  that fit to all the data. 

Page 20: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 202015 April

K-fold Cross - Validation

In case  is Known as   The CV estimator  is approximately unbiased for true prediction error, but can have high 

variance.  

LOOCV sometimes useful, but  typically doesn’t  shake up  the data enough. The estimates  from each  fold are 

highly correlated and hence their average can have high variance. 

A better choice is K = 5 or 10.

Page 21: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 212015 April

K-fold Cross - Validation

Consider a simpler classifier for microarrays:

1) Starting with 5,000 genes, find the top 200 genes having the largest correlation with the class label. 

2) Carry about the nearest-centroid classification using top 200 genes. 

How do we select the tuning parameter in the classifier? 

Way 1: apply cross-validation to step 2 

Way 2: apply cross-validation to steps 1 and 2 

Which is right? – Cross validating the whole procedure.

Page 22: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 222015 April

Outline

1) Introduction

2) Bias, Variance and Model Complexity

3)  , AIC and BIC

4) Cross-Validation

5) Bootstrap Methods

Page 23: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 232015 April

Bootstrap Methods

The bootstrap  is a flexible and powerful statistical tool that can be used to quantify the uncertainty 

associated with a given estimator or statistical learning method. 

For example, it can provide an estimate of the standard error of a coefficient, or a confidence interval 

for that coefficient.

Page 24: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 242015 April

Bootstrap Methods

A simple example:

Suppose that we wish to invest a fixed sum of money in two financial assets that yield returns of X and 

Y , respectively, where X and Y are random quantities.

 We will invest a fraction α of our money in X, and will invest the remaining 1 − α in Y . 

We wish  to choose α  to minimize  the  total  risk, or variance, of our  investment.  In other words, we 

want to minimize 

Page 25: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 252015 April

Bootstrap Methods

The bootstrap approach allows us to use a computer to mimic the process of obtaining new data sets, 

so that we can estimate the variability of our estimate without generating additional samples.

  Rather  than  repeatedly  obtaining  independent  data  sets  from  the  population,  we  instead  obtain 

distinct data sets by repeatedly sampling observations from the original data set with replacement. 

Page 26: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 262015 April

Bootstrap Methods

Each  of  these  “bootstrap  data  sets”  is  created  by  sampling 

with  replacement,  and  is  the  same  size  as  our  original 

dataset.

 As a result some observations may appear more than once in 

a given bootstrap data set and some not at all.

Page 27: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 272015 April

Bootstrap Methods

To estimate prediction error using the bootstrap, we could think about using each bootstrap dataset 

as our training sample, and the original sample as our validation sample.

 

But  each  bootstrap  sample  has  significant  overlap  with  the  original  data.  About  two-thirds  of  the 

original data points appear in each bootstrap sample.

Page 28: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 282015 April

Bootstrap Methods

  is the predicted value at , from the model fitted to the th bootstrap dataset, our estimate is    .

  does  not  provide  a  good  estimate  in  general,  the  bootstrap  dataset  are  acting  as  the  training  sample, 

while  the  original  training  set  is  acting  as  the  test  sample.  This  is  due  to  the  fact  that  both  sets  have 

common observations.

Page 29: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 292015 April

Summary In order to solve bias-variance trade off one can apply different possible methods like AIC, CV etc. 

AIC, BIC are methods of model selection as they allow to choose a subset of variables which minimize  the 

prediction error (variance).

The CV method folds a subset of given data set and use the rest of the observations to predict the fold data points. 

LOOCV gives an unbiased prediction error in a cost of a higher variance.

K=5 to 10 are frequently the optimal values that account for both the bias and the variance.

The bootstrap method use the resampling approach in order to quantify the uncertainty (variance or the CI) of the 

estimators.

This method has inherent bias which incorporated in the resampling mechanism.      

Page 30: 2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 302015 April

Thank you!