Upload
amos-murphy
View
212
Download
0
Tags:
Embed Size (px)
Citation preview
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1
Model Assessment and Selection
2015 April
Hastie, Tibshirani and Friedman .The Elements of Statistical Learning (2nd edition, 2009).
Presentation by: Ayellet Jehassi
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 22015 April
Outline
1) Introduction
2) Bias, Variance and Model Complexity
3) AIC and BIC
4) Cross-Validation
5) Bootstrap Methods
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 32015 April
Introduction
The process of evaluating a model’s performance is known as model assessment, where the process
of selecting the proper level of flexibility for a model is known as assessment model selection.
For example:
- Repeatedly drawing samples from a training set and refitting a model of interest on each sample in
order to obtain additional information about the fitted model.
- Fitting the same statistical method multiple times using different subsets of the training data.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 42015 April
Outline
1) Introduction
2) Bias, Variance and Model Complexity
3) , AIC and BIC
4) Cross-Validation
5) Bootstrap Methods
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 52015 April
Bias, Variance and Model Complexity
We have many different potential predictors. Why not base the model on all of them?
Two sides of one coin: bias and variance
– The model with more predictors can describe the phenomenon better – less bias.
– When we estimate more parameters, the variance of estimators grows – we “fit the noise”, i.e.
we perform an over fitting!
A clever model selection strategy should resolve the bias–variance trade–off.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 62015 April
Bias, Variance and Model Complexity
Typically our model will have a tuning parameter or parameters α and so we can write our predictions as
The tuning parameter varies the complexity of our model, and we wish to find the value of that minimizes the
error.
We have two separate goals that we might have in mind:
Model selection: estimating the performance of different models in order to choose the best one.
Model assessment: having chosen a final model, estimating its prediction error on new data.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 72015 April
Bias, Variance and Model Complexity
The best approach for both problems is to randomly divide the dataset into three parts: a training set, a
validation set, and a test set.
The training set is used to fit the models.
The validation set is used to estimate prediction error for model selection.
The test set is used for assessment of the generalization error of the final chosen model.
The typical proportions are respectively: 50%, 25%, 25%
The methods of this chapter approximate the validation step either analytically or by efficient sample re-use.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 82015 April
Bias, Variance and Model Complexity
Test error, or generalization error, or prediction risk, is the expected error over an independent test sample drawn
from the same distribution as that of the training data:
.
Training error is the average loss over the training data:
.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 92015 April
Our goal is to find the model M which minimizes the test
error.
Too-simple models give too much bias, and too-complex
models give too much variance.
The training error is a downward-biased estimate of the
prediction risk: < .
Bias, Variance and Model Complexity
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 102015 April
Bias, Variance and Model Complexity
We assume that where and V, of regression at an input point using squared-error loss:
The more complex we make the model , the lower the bias but higher the variance.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 112015 April
Bias, Variance and Model Complexity
For example, for k-nearest-neighbor regression fit the prediction error has the following form:
.As we increase k, the bias will typically increase while the variance decrease.
Then select a single best model from among M0, . . . ,Mp using cross-validated prediction error, Cp, AIC, BIC.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 122015 April
Outline
1) Introduction
2) Bias, Variance and Model Complexity
3) , AIC and BIC
4) Cross-Validation
5) Bootstrap Methods
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 132015 April
, AIC and BIC
These techniques adjust the training error for the model size, and can be used to select among set of
models with different numbers of variables.
Mallows’ :
AIC (Akaike information criterion): given a set of models
In the case of the linear model with Gaussian errors the and AIC are equivalent.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 142015 April
, AIC and BIC
BIC (Bayesian Information Criterion): under gussian model,
Like Cp, the BIC will tend to take on a small value for a model with a low test error, and so generally we
select the model that has the lowest BIC value.
For model selection purposes, there is no clear choice between AIC and BIC
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 152015 April
Outline
1) Introduction
2) Bias, Variance and Model Complexity
3) , AIC and BIC
4) Cross-Validation
5) Bootstrap Methods
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 162015 April
Cross - Validation
We randomly divide the available set of samples into two parts: a training set and a validation or hold-out set.
The model is fit on the training set, and the fitted model is used to predict the responses for the observations in
the validation set.
The resulting validation-set error provides an estimate of the test error.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 172015 April
K-fold Cross - Validation
Widely used approach for estimating test error.
Estimates can be used to select best model, and to give an idea of the test error of the final chosen model.
The idea is to randomly divide the data into K equal-sized parts. We leave out part k, fit the model to the other K − 1
parts (combined), and then obtain predictions for the left-out kth part.
This is done in turn for each part k = 1, 2, . . . K, and then the results are combined.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 182015 April
Cross – Validation: How to use it?
1. For each model complexity, obtain a cross-validated estimate of the error for that complexity: the average
error over all folds.
2. Now you have a number (an average error) for each model complexity. Choose the best (lowest error) model
complexity.
3. Finally, train a model using all of the available data using that choice of model complexity.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 192015 April
K-fold Cross - Validation
Let the K parts be where denotes the indices of the observations in part . There are observations in part : if N
is a multiple of K, then
Then cross-validation estimate of prediction error is
Given a set of models indexed by a tuning parameter
Which provides an estimate of the test error curve, and the tuning parameters that minimizes it.
Our final chosen model is that fit to all the data.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 202015 April
K-fold Cross - Validation
In case is Known as The CV estimator is approximately unbiased for true prediction error, but can have high
variance.
LOOCV sometimes useful, but typically doesn’t shake up the data enough. The estimates from each fold are
highly correlated and hence their average can have high variance.
A better choice is K = 5 or 10.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 212015 April
K-fold Cross - Validation
Consider a simpler classifier for microarrays:
1) Starting with 5,000 genes, find the top 200 genes having the largest correlation with the class label.
2) Carry about the nearest-centroid classification using top 200 genes.
How do we select the tuning parameter in the classifier?
Way 1: apply cross-validation to step 2
Way 2: apply cross-validation to steps 1 and 2
Which is right? – Cross validating the whole procedure.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 222015 April
Outline
1) Introduction
2) Bias, Variance and Model Complexity
3) , AIC and BIC
4) Cross-Validation
5) Bootstrap Methods
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 232015 April
Bootstrap Methods
The bootstrap is a flexible and powerful statistical tool that can be used to quantify the uncertainty
associated with a given estimator or statistical learning method.
For example, it can provide an estimate of the standard error of a coefficient, or a confidence interval
for that coefficient.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 242015 April
Bootstrap Methods
A simple example:
Suppose that we wish to invest a fixed sum of money in two financial assets that yield returns of X and
Y , respectively, where X and Y are random quantities.
We will invest a fraction α of our money in X, and will invest the remaining 1 − α in Y .
We wish to choose α to minimize the total risk, or variance, of our investment. In other words, we
want to minimize
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 252015 April
Bootstrap Methods
The bootstrap approach allows us to use a computer to mimic the process of obtaining new data sets,
so that we can estimate the variability of our estimate without generating additional samples.
Rather than repeatedly obtaining independent data sets from the population, we instead obtain
distinct data sets by repeatedly sampling observations from the original data set with replacement.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 262015 April
Bootstrap Methods
Each of these “bootstrap data sets” is created by sampling
with replacement, and is the same size as our original
dataset.
As a result some observations may appear more than once in
a given bootstrap data set and some not at all.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 272015 April
Bootstrap Methods
To estimate prediction error using the bootstrap, we could think about using each bootstrap dataset
as our training sample, and the original sample as our validation sample.
But each bootstrap sample has significant overlap with the original data. About two-thirds of the
original data points appear in each bootstrap sample.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 282015 April
Bootstrap Methods
is the predicted value at , from the model fitted to the th bootstrap dataset, our estimate is .
does not provide a good estimate in general, the bootstrap dataset are acting as the training sample,
while the original training set is acting as the test sample. This is due to the fact that both sets have
common observations.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 292015 April
Summary In order to solve bias-variance trade off one can apply different possible methods like AIC, CV etc.
AIC, BIC are methods of model selection as they allow to choose a subset of variables which minimize the
prediction error (variance).
The CV method folds a subset of given data set and use the rest of the observations to predict the fold data points.
LOOCV gives an unbiased prediction error in a cost of a higher variance.
K=5 to 10 are frequently the optimal values that account for both the bias and the variance.
The bootstrap method use the resampling approach in order to quantify the uncertainty (variance or the CI) of the
estimators.
This method has inherent bias which incorporated in the resampling mechanism.
UNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 302015 April
Thank you!