51
Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] http://www.cs.berkeley.edu/~jordan/courses/294-fall09

Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Embed Size (px)

Citation preview

Page 1: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Feature Selection

[slides prises du cours cs294-10 UC Berkeley (2006 / 2009)]http://www.cs.berkeley.edu/~jordan/courses/294-fall09

Page 2: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Outline• Review/introduction

– What is feature selection? Why do it?• Filtering• Model selection

– Model evaluation– Model search

• Regularization• Summary recommendations

Page 3: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Review• Data: pairs • : vector of features• Features can be real ( ), categorical

( ), or more structured• y: response (dependent) variable

– : binary classification– : regression– Typically, this is what we want to be able to

predict, having observed some new .

Page 4: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Featurization• Data is often not originally in vector form

– Have to choose how to featurize– Features often encode expert knowledge of the domain– Can have a huge effect on performance

• Example: documents

– “Bag of words” featurization: throw out order, keep count of how many times each word appears.• Surprisingly effective for many tasks

– Sequence featurization: one feature for first letter in the document, one for second letter, etc.• Poor feature set for most purposes—similar documents are not

close to one another in this representation.

Page 5: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

What is feature selection?

• Reducing the feature space by throwing out some of the features (covariates)– Also called variable selection

• Motivating idea: try to find a simple, “parsimonious” model– Occam’s razor: simplest explanation that

accounts for the data is best

Page 6: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

What is feature selection?

cat 2

and 35

it 20

kitten 8

electric 2

trouble 4

then 5

several 9

feline 2

while 4

lemon 2

cat 2

kitten 8

feline 2

Vegetarian No

Plays video games

Yes

Family history No

Athletic No

Smoker Yes

Sex Male

Lung capacity 5.8L

Hair color Red

Car Audi

Weight 185 lbs

Family history

No

Smoker Yes

Task: classify whether a document is about cats

Data: word counts in the document

Task: predict chances of lung disease

Data: medical history survey

X X

Reduced XReduced X

Page 7: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Why do it?• Case 1: We’re interested in features—we want

to know which are relevant. If we fit a model, it should be interpretable.

• Case 2: We’re interested in prediction; features are not interesting in themselves, we just want to build a good classifier (or other kind of predictor).

Page 8: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Why do it? Case 1.

• What causes lung cancer? – Features are aspects of a patient’s medical history– Binary response variable: did the patient develop lung cancer?– Which features best predict whether lung cancer will develop?

Might want to legislate against these features.• What causes a program to crash? [Alice Zheng ’03, ’04, ‘05]

– Features are aspects of a single program execution• Which branches were taken? • What values did functions return?

– Binary response variable: did the program crash?– Features that predict crashes well are probably bugs.

• What stabilizes protein structure? (my research)– Features are structural aspects of a protein– Real-valued response variable—protein energy– Features that give rise to low energy are stabilizing.

We want to know which features are relevant; we don’t necessarily want to do prediction.

Page 9: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Why do it? Case 2.

• Text classification– Features for all 105 English words, and maybe all word pairs– Common practice: throw in every feature you can think of, let

feature selection get rid of useless ones– Training too expensive with all features– The presence of irrelevant features hurts generalization.

• Classification of leukemia tumors from microarray gene expression data [Xing, Jordan, Karp ’01]– 72 patients (data points)– 7130 features (expression levels of different genes)

• Disease diagnosis– Features are outcomes of expensive medical tests– Which tests should we perform on patient?

• Embedded systems with limited resources– Classifier must be compact– Voice recognition on a cell phone– Branch prediction in a CPU (4K code limit)

We want to build a good predictor.

Page 10: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Get at Case 1 through Case 2

• Even if we just want to identify features, it can be useful to pretend we want to do prediction.

• Relevant features are (typically) exactly those that most aid prediction.

• But not always. Highly correlated features may be redundant but both interesting as “causes”.– e.g. smoking in the morning, smoking at night

Page 11: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Outline• Review/introduction

– What is feature selection? Why do it?• Filtering• Model selection

– Model evaluation– Model search

• Regularization• Kernel methods• Miscellaneous topics• Summary

Page 12: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Filtering• Simple techniques for weeding out irrelevant features without fitting model

Page 13: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Filtering• Basic idea: assign score to each feature

indicating how “related” and are. – Intuition: if for all i, then is good no

matter what our model is—contains all information about .

– Many popular scores [see Yang and Pederson ’97]• Classification with categorical data: Chi-squared, information

gain– Can use binning to make continuous data categorical

• Regression: correlation, mutual information• Markov blanket [Koller and Sahami, ’96]

• Then somehow pick how many of the highest scoring features to keep (nested models)

Page 14: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Comparison of filtering methods for text categorization [Yang and Pederson ’97]

Page 15: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Filtering• Advantages:

– Very fast– Simple to apply

• Disadvantages:– Doesn’t take into account which learning algorithm will

be used.– Doesn’t take into account correlations between features

• This can be an advantage if we’re only interested in ranking the relevance of features, rather than performing prediction.

• Also a significant disadvantage—see homework • Suggestion: use light filtering as an efficient initial

step if there are many obviously irrelevant features– Caveat here too—apparently useless features can be

useful when grouped with others

Page 16: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Outline• Review/introduction

– What is feature selection? Why do it?• Filtering• Model selection

– Model evaluation– Model search

• Regularization• Summary

Page 17: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Model Selection• Choosing between possible models of

varying complexity– In our case, a “model” means a set of features

• Running example: linear regression model

Page 18: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Linear Regression Model

• Recall that we can fit it by minimizing the squared error:

• Can be interpreted as maximum likelihood with–

Data

Response:

Parameters

Assume is augmented to so the constant term is absorbed into as and

ModelPrediction rule:

Page 19: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Least Squares Fitting

0 200

Error or “residual”

Prediction

Observation

Sum squared error:

Page 20: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Model Selection

• Consider a reduced model with only those features for

• Squared error is now

• We want to pick out the “best” . Maybe this means theone with the lowest training error ?

• Note – Just zero out terms in to match .

• Generally speaking, training error will only go up in a simpler model. So why should we use one?

Data

Response:

ParametersModelPrediction rule:

Page 21: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Overfitting example 1

• This model is too rich for the data• Fits training data well, but doesn’t generalize.

0 2 4 6 8 10 12 14 16 18 20-15

-10

-5

0

5

10

15

20

25

30

Degree 15 polynomial

Page 22: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Overfitting example 2• Generate 2000 , i.i.d.• Generate 2000 , i.i.d. completely

independent of the ’s– We shouldn’t be able to predict at all from

• Find • Use this to predict for each by

It really looks like we’ve found a relationship between and ! But no such relationship exists, so will do no better than random on new data.

Page 23: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Model evaluation• Moral 1: In the presence of many irrelevant

features, we might just fit noise. • Moral 2: Training error can lead us astray. • To evaluate a feature set , we need a better

scoring function – We’ve seen that is not appropriate.

• We’re not ultimately interested in training error; we’re interested in test error (error on new data).

• We can estimate test error by pretending we haven’t seen some of our data. – Keep some data aside as a validation set. If we don’t

use it in training, then it’s a fair test of our model.

Page 24: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

K-fold cross validation• A technique for estimating test error• Uses all of the data to validate• Divide data into K groups .• Use each group as a validation set, then average all validation

errors

X1

LearnX2

X3X4

X5

X6

X7

test

Page 25: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

K-fold cross validation• A technique for estimating test error• Uses all of the data to validate• Divide data into K groups .• Use each group as a validation set, then average all validation

errors

X1

LearnX2

X3X4

X5

X6

X7

test

Page 26: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

K-fold cross validation• A technique for estimating test error• Uses all of the data to validate• Divide data into K groups .• Use each group as a validation set, then average all validation

errors

X1

…Learn

X2

X3X4

X5

X6

X7

test

Page 27: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

K-fold cross validation• A technique for estimating test error• Uses all of the data to validate• Divide data into K groups .• Use each group as a validation set, then average all validation

errors

X1

LearnX2

X3X4

X5

X6

X7

Page 28: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Model Search

• We have an objective function – Time to search for a good model.

• This is known as a “wrapper” method– Learning algorithm is a black box– Just use it to compute objective function, then

do search• Exhaustive search expensive

– 2n possible subsets s• Greedy search is common and effective

Page 29: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Model search

• Backward elimination tends to find better models– Better at finding models with interacting features– But it is frequently too expensive to fit the large

models at the beginning of search

• Both can be too greedy.

Backward elimination

Initialize s={1,2,…,n}Do:

remove feature from swhich improves K(s) most

While K(s) can be improved

Forward selection

Initialize s={}Do:

Add feature to swhich improves K(s) most

While K(s) can be improved

Page 30: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Model search• More sophisticated search strategies exist

– Best-first search– Stochastic search– See “Wrappers for Feature Subset Selection”, Kohavi and John

1997• For many models, search moves can be evaluated

quickly without refitting– E.g. linear regression model: add feature that has most

covariance with current residuals• RapidMiner can do feature selection with cross-

validation and either forward selection or backwards elimination.

• Other objective functions exist which add a model-complexity penalty to the training error– AIC: add penalty to log-likelihood. – BIC: add penalty

Page 31: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Outline• Review/introduction

– What is feature selection? Why do it?• Filtering• Model selection

– Model evaluation– Model search

• Regularization• Summary

Page 32: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Regularization• In certain cases, we can move model

selection into the induction algorithm– Only have to fit one model; more efficient.

• This is sometimes called an embedded feature selection algorithm

Page 33: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Regularization• Regularization: add model complexity penalty to

training error.•

for some constant C• Now • Regularization forces weights to be small, but

does it force weights to be exactly zero? – is equivalent to removing feature f from the

model

Page 34: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

L1 vs L2 regularization

Page 35: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

L1 vs L2 regularization• To minimize , we can solve

by (e.g.) gradient descent.

• Minimization is a tug-of-war between the two terms

Page 36: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

L1 vs L2 regularization• To minimize , we can solve

by (e.g.) gradient descent.

• Minimization is a tug-of-war between the two terms

Page 37: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

L1 vs L2 regularization• To minimize , we can solve

by (e.g.) gradient descent.

• Minimization is a tug-of-war between the two terms

Page 38: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

L1 vs L2 regularization• To minimize , we can solve

by (e.g.) gradient descent.

• Minimization is a tug-of-war between the two terms• w is forced into the corners—many components 0

– Solution is sparse

Page 39: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

L1 vs L2 regularization• To minimize , we can solve

by (e.g.) gradient descent.

• Minimization is a tug-of-war between the two terms

Page 40: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

L1 vs L2 regularization• To minimize , we can solve

by (e.g.) gradient descent.

• Minimization is a tug-of-war between the two terms• L2 regularization does not promote sparsity• Even without sparsity, regularization promotes

generalization—limits expressiveness of model

Page 41: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Lasso Regression [Tibshirani ‘94]

• Simply linear regression with an L1 penalty for sparsity.

• Two big questions:– 1. How do we perform this minimization?

• With L2 penalty it’s easy—saw this in a previous lecture

• With L1 it’s not a least-squares problem any more

– 2. How do we choose C?

Page 42: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Least-Angle Regression• Up until a few years ago

this was not trivial– Fitting model: optimization

problem, harder than least-squares

– Cross validation to choose C: must fit model for every candidate C value

• Not with LARS! (Least Angle Regression, Hastie et al, 2004)– Find trajectory of w for all

possible C values simultaneously, as efficiently as least-squares

– Can choose exactly how many features are wanted

Figure taken from Hastie et al (2004) sklearn.linear_model.Lars

Page 43: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Outline• Review/introduction

– What is feature selection? Why do it?• Filtering• Model selection

– Model evaluation– Model search

• Regularization• Summary

Page 44: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Summary• Filtering• L1 regularization

(embedded methods)• Kernel methods

• Wrappers• Forward selection• Backward

selection• Other search• Exhaustive

Com

putational cost

Page 45: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Summary• Good preprocessing step• Information-based scores

seem most effective– Information gain– More expensive: Markov

Blanket [Koller & Sahami, ’97]

• Fail to capture relationship between features

• Filtering• L1 regularization

(embedded methods)• Kernel methods

• Wrappers• Forward selection• Backward

selection• Other search• Exhaustive

Com

putational cost

Page 46: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Summary• Fairly efficient

– LARS-type algorithms now exist for many linear models

– Ideally, use cross-validation to determine regularization coeff.

• Not applicable for all models

• Linear methods can be limited– Common: fit a linear model

initially to select features, then fit a nonlinear model with new feature set

• Filtering• L1 regularization

(embedded methods)• Kernel methods

• Wrappers• Forward selection• Backward

selection• Other search• Exhaustive

Com

putational cost

Page 47: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Summary• Expand the

expressiveness of linear models

• Very effective in practice• Useful when a similarity

kernel is natural to define• Not as interpretable

– They don’t really perform feature selection as such

• Achieve parsimony through a different route– Sparsity in data

• Filtering• L1 regularization

(embedded methods)• Kernel methods

• Wrappers• Forward selection• Backward

selection• Other search• Exhaustive

Com

putational cost

Page 48: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Summary• Most directly optimize

prediction performance• Can be very expensive,

even with greedy search methods

• Cross-validation is a good objective function to start with

• Filtering• L1 regularization

(embedded methods)• Kernel methods

• Wrappers• Forward selection• Backward

selection• Other search• Exhaustive

Com

putational cost

Page 49: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Summary• Too greedy—ignore

relationships between features

• Easy baseline• Can be generalized in

many interesting ways– Stagewise forward

selection– Forward-backward search– Boosting

• Filtering• L1 regularization

(embedded methods)• Kernel methods

• Wrappers• Forward selection• Backward

selection• Other search• Exhaustive

Com

putational cost

Page 50: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Summary• Generally more effective

than greedy• Filtering• L1 regularization

(embedded methods)• Kernel methods

• Wrappers• Forward selection• Backward

selection• Other search• Exhaustive

Com

putational cost

Page 51: Feature Selection [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] jordan/courses/294-fall09

Summary• The “ideal”• Very seldom done in

practice• With cross-validation

objective, there’s a chance of over-fitting– Some subset might

randomly perform quite well in cross-validation

• Filtering• L1 regularization

(embedded methods)• Kernel methods

• Wrappers• Forward selection• Backward

selection• Other search• Exhaustive

Com

putational cost