Lecture Notes on Feature Selection - Politecnico di Milanohome.deib.polimi.it/matteucc/lectures/MIS/handout-e09.pdf · Feature Selection 1. Find a subset ÆFind a search strategyto

Lecture Notes on Feature Selection

Rossella [email protected]

Department of Electronics and InformationPolitecnico di Milano

Methods for Intelligent Systems

Methods for Intelligent Systems, AA 2006-2007 Feature Selection, pp.1

Dimensionality Reduction


DIMENSIONALITY REDUCTIONTo control of the dimensions of our pattern analysis problem and

to improve the classification accuracy

Feature Extraction

Project data in a lower dimensional space obtaining

new features

Feature Selection

Choose the best subset ofthe original features

PCASeeks the axes withmaximum variance

LDASeeks the axes with max

distance between class and minimum distance intreclass

Feature Selection• It concerns with the control of the dimensions of our

pattern analysis problem• The dimensions of the problem is mainly given by the

sample size and the feature set size• It would not have any sense to reduce the number of

examples because:• Usually we never have enough examples• We can assume that examples are correct

• On the contrary, we do not have any guarantee that all our features are needed for the classification


Feature Selection• Refers to algorithms that select (hopefully) the best subset of the initial

feature set• Selected features maintain their original physical interpretation (useful to

understand the physical process that generates patterns)• It leads to saving in measurement cost • In literature Feature Selection is also called “Feature Subset Selection”• Given a feature set X={xi, i=1,..,N}, find a subset YM={x1i, x12, …, xiM}

with M<N, that optimizes an objective function J(Y) (in some wayrelated to the probability of correct classification)


{ } { }[ ]NixJxxx

x

xx

x

xxx

iiM

iMii

iM

i

i

SelectionFeature

N

m

,...,1|maxarg,...,, ..

...,

212

1

3

2

1

==

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

⎯⎯⎯⎯⎯ →⎯

⎥⎥⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢⎢⎢

⎣

⎡

Feature Selection

1. Find a subset Find a search strategy to select candidate subset (the algorithm)

2. Optimizes an objective function Define a measure of the goodness of considered subsets (classification accuracy, interclass distance and so on)


Given a feature set X={xi, i=1,..,N}, find a subset YM={x1i, xi2, …, xiM} with M<N, that optimizes an objective function J(Y) (in some way related to the probability of correct classification)

TR

AIN

ING

D

AT

A Complete Feature Set

SEA

RCH

STRA

TEG

Y

OBJ

ECT

IVE

FU

NCT

ION

FIN

AL

F

EA

TU

RE

SU

BSE

T

FEATURE SUBSET SELECTION

Feature Selection: Example

• Goal: recognize oranges from mandarins…


• We collect 10 measures for 3 features:• Weight• Color Intensity• Diameter

70 80 90 100 110 120 130 140 150 160 170-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1ORANGES VS MANDARINS: Feature Weight

Weight [grams]

OrangesMandarins

Feature Selection: Example• We obtain (first row:oranges, second row=mandarins):


⎥⎦

⎤⎢⎣

⎡=

7879919285937899858012413316412715315916099135120

Weight

6.9 7 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1ORANGES VS MANDARINS: Feature Color Intensity

Color Intensity

OrangesMandarins



⎥⎦

⎤⎢⎣

⎡=

0.780.790.710.720.750.730.780.790.750.780.720.780.770.750.720.780.690.710.750.7

IntensityColor



⎥⎦

⎤⎢⎣

⎡=

8.17.97.16.75.77.36.877.581288.710.79.77.89.711.911.210.5

Diameter

5 6 7 8 9 10 11 12-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1ORANGES VS MANDARINS: Feature Diameter

Diameter [cm]

OrangesMandarins

50

100

150

200

6.5

7

7.5

84

6

8

10

12

Weight [grams]

ORANGES VS MANDARINS: Features Plot

ColorIntensity

Dia

met

er [c

m]

OrangesMandarins

Feature Selection: Example• The feature space that we obtain is a 3 dimensional space


50

100

150

200

6.5

7

7.5

84

6

8

10

12

Weight [grams]

ORANGES VS MANDARINS: Features Plot

ColorIntensity

Dia

met

er [c

m]

OrangesMandarins

Feature Selection: Example• We could perform our classification task in the obtained 3 dimensional feature

space, or……...



space, or in a reduced 2-dimensional space or………………


6.9 7 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.95

6

7

8

9

10

11

12

Color Intensity

Dia

met

er

ORANGES VS MANDARINS: Features Plot - Color Intensity vs Diameter

OrangesMandarins

70 80 90 100 110 120 130 140 150 160 1705

6

7

8

9

10

11

12

Weight [grams]

Dia

met

er

ORANGES VS MANDARINS: Features Plot - Weight vs Diameter

OrangesMandarins

70 80 90 100 110 120 130 140 150 160 1706.9

7

7.1

7.2

7.3

7.4

7.5

7.6

7.7

7.8

7.9

Weight [grams]

Col

or In

tens

ity

ORANGES VS MANDARINS: Features Plot - Weight vs Diameter

OrangesMandarins

6.9 7 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1ORANGES VS MANDARINS: Feature Color Intensity

Color Intensity

OrangesMandarins


space, or in a reduced 2-dimensional space or even in a mono-dimensional space..


70 80 90 100 110 120 130 140 150 160 170-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1ORANGES VS MANDARINS: Feature Weight

Weight [grams]

OrangesMandarins

5 6 7 8 9 10 11 12-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1ORANGES VS MANDARINS: Feature Diameter

Diameter [cm]

OrangesMandarins

• Which one is the best for our classification purpose?!

Best Firsts: a naïve approach• One possible Feature Selection technique is to find the best

features individually and then to keep only the first M features.

• In our previous example the best features are (in decreasing order):

1. Weight (very good)2. Diameter (quite good)3. Color (very bad)

• If we want to keep only 2 features, ‘Weight’ and ‘Diameter’ will be the winner

• But…this approach fails quite always because it does not consider features with complementary information


Best Firsts: example• The figures show a 4 dimensional pattern recognition

problem (features are shown in pairs of 2D scatter plots)• Objective: find the best subset composed of M = 2 features• We rank the goodness of features:

1. x1 is the best feature because it is able to discriminate all clusters (except for ω4 and ω5)

2. x3 is the second best feature (it separates the space in the groups {ω1}, {ω2 - ω3}, {ω4 - ω5})

3. x2 is the third best feature (it is very similar to x3: it separates the space in the groups {ω1 – ω2}, {ω3}, {ω4 - ω5})

4. x4 is the worst feature because there is a lot of overlaps in its space, but it is the only one able to discriminate between ω4 and ω5

• The optimal feature subset (according to the Best FirstsFeature Selection approach) would be x1 and x2, not allowinga discrimination between ω4 and ω5!!

• The real best 2D subset is x1 and x4, as x4 provides the onlyinformation that x1 needs: the discrimination between ω4 and ω5!!


Objective Function

The objective function evaluatesfeature subsets by theirinformation content, typicallyintercalss distance, statisticaldependence or information theoretic measures


Objective Function

Filters Wrappers

The objective function is a pattern classifier, which evaluates feature

subsets by their predicitiveaccuracy (recognition rate on test

data) by statistical resampling or cross-validation

• We need to define:• A rule to analyse each possible subset Search Algorithm• A way to evaluate each subset Objective Function

Filter Objective Function• Distance between classes

• Euclidean• Mahalanobis• Determinant of SW

-1SB• …

• Correlation and information theoretic measures• This methods are based on the rationale that good feature subsets contain

features highly correlated with their class and highly uncorrelated withothers.

• Linear measure Correlation Coefficient

• Non Linear measure Mutual Information, that measures the amount bywhich the uncertainty in the class C is decreased by knowledge of the feature vector (it takes in consideration the entropy function)


∑ ∑

∑

= +=

== M

i

M

ijij

M

iic

MYJ

1 1

1)(ρ

ρ where ρic is the correlation coefficient between feature i and the class label and ρij is the correlation coefficient betweenfeature i and feature j

Filters vs Wrappers Objective Function

Filters☺ Fast Execution

☺ They generally involve a non-iterative computation on the dataset

☺ Generality☺ They evaluates intrinsic properties

of the data (the solution will begood for a large family ofclassifiers)

Tendency to select large subsetSince the filter objective functionsare generally monotonic, thismethods tends to select the full feature set as the oprimal solutionThis forces the user to select a cutoff on the number of features to be selected


Wrappers☺ Accuracy

☺ Wrappers generally achieve betterrecognition rates since they are tuned to the specific interactionsbetween the classifier and the dataset

☺ Generalization☺ Using tecnhiques as cross validation

they can avoid overfittingSlow Execution

Because the wrapper must train a classifier for each feature subset (or several classifiers if cross-validat

Problem dependent

Search Strategies• There is a large number of search algorithms, which can be grouped in 3 categories:

• Exponential Algorithm• These algorithms evaluate a number of subset that grows expoentially with the

dimensionality of the search space• Exhaustive search• Branch & Bound• Approximate Monoticity with Branch & Bound

• Sequential Algorithms• These algorithms add and remove features sequentally

• Sequential Forward Selection• Sequential Backward Selection• Bidirectional Selection• Plus-L Minus-R Selection• Sequential Floating Selection

• Randomized Algorithms• These algorithms incorporate randomness into their search procedure to escape local

minima• Random Generation plus Sequential Selection• Simulated Annealing• Genetic Algorithms


Exponential Search: Exhaustive search• A very naive approach• We consider all possible combinations of features• This number of combination is unfeasible, even for moderate value

of M and N • It is guaranteed to find the optimal subset• In our previous examples (oranges vs mandarins) of 3 features, the

total number of possible subsets was equal to:

⎟⎟⎠

⎞⎜⎜⎝

⎛MN


( ) ( ) ( ) ( ) ( ) ( ) ( ) 713366

26

26

1123123

112123

121123

!33!3!3

!23!2!3

!13!1!3

!!!

=++=++=⋅⋅⋅⋅⋅

+⋅⋅⋅⋅

+⋅⋅⋅⋅

=−

+−

+−

=−

=⎟⎟⎠

⎞⎜⎜⎝

⎛MNM

NMN

• That is, we obtain 3 subsets of 1 feature, 3 subsets of 2 features and 1 subset of 3 features

• With a more realistic number of features (let’s say 10), we wouldobtain 1023 possible subsets!!

Exponential Search: Branch & Bound

• It uses the well known Branch & Bound search method: only a fraction of all possible feature subsets need to be enumerated tofind the optimal subset

• It is based on the monoticity assumption: “the addition of features can only increase the value of the objective function”:

• Branch & Bound starts from the full set and removes featuresusinga depth-first strategy

• Nodes whose objective function is lower than the current best are notexplored, since the monoticity assumption ensures that their childrenwill not contain a better solution


( ) ( ) ( ) ( )Niiiiiiiiii xxxxJxxxJxxJxJ ,...,,,...,,,

321321211<<<<

Exponential Search: Branch & Bound


Exponential Search: Approximate Monoticitywith Branch & Bound (AMB&B)

• AMB&B is a variation of the classic Branch & Boundalgorithm

• It allows non-monotonic functions to be used by relaxingthe cutoff condition that terminates the search on a specificnode

• For example we can replace the limit of number offeatures with a threshold error rate.


Sequential Search: Sequential Forward Selection (SFS)

• It is the simplest greedy search algorithm• Select the best single feature and then add one feature at a time;

the added feature is those one that maximizes the objective function in combination with the previous feature set.

• Starting from the empty set, sequentially add the feature x+ thatresults in the highest objective function J(Yk+x+) when combinedwith the features Yk that have already been selected

• It performs best when the optimal subset has a small number offeatures


• Once a feature is retainedit cannot be discardedanymore

• Suboptimal solution!

Sequential Search: Sequential Backward Selection (SBS)

• Similar to SFS, but it works in the opposite direction• Starting from the full feature set, sequentially delete the

feature x- that results in the smallest decrease of the objective function J(Yk-x-)

• It performs best when the optimal subset has a largenumber of features

• Once a feature is deleted it cannot be brought back anymore



Sequential Search: Bidirectional Search (BDS)

• In order to guarantee that SFS and SBS converge to the samesolution we must ensure that:• Feature already selected by

SFS are not removed bySBS

• Feature already deleted bySBS are not selected by SFS



• BDS is a parallel implementation of SFS and SBS:• SFS is performed from the empty set• SBS is performed from the full set

Sequential Search: Plus-L Minus-R Selection (LRS)

• If L<R, LRS starts from the full set and repeatedly removes ‘R’ features followed by ‘L’ feature additions

• LRS attempts to compensate for the weaknesses of SFS and SBS withsome backtracking capabilities

• Its main limitation is the lack of a theory to help predict the optimalvalues of L and R



• Plus-L Minus-R Selection is a generalization of SFS and SBS• If L>R, LRS starts from the empty set and repeatedly adds ‘L’ features

and removes ‘R’ features n

Sequential Search: Sequential FloatingSelection (SFFS and SFBS)


• Sequential Floating Selection methods are an extension to the LRS algorithms with flexiblevalues for L and R

• These values are determinedautomatically from the data and updated dynamically

• It goes very close to the optimal solution, but at anaffordable computationalcost

• No guarantees to reach the optimal solution!

Randomized Search: Random Generation + Sequential Selection


• RGSS introduce randomness into SFS and SBS• In this way it avoids to fall into local minima• We consider a number of random combination of features and

select the best one

• Obviously there is no guarantee to find the best solution

1. Repeat for a number of iterations1a. Generate a random feature subset1b. Perform SFS on this subset1c. Perform SBS on this subset

2. Chose the best subset

Randomized Search: Genetic Algorithms


• Genetic Algorithms are optimization techniques that mimicthe evolutionary process of survival of the best

• They explore all the solution space and find the solution thatmaximizes the objective function

• The choice of parameters is not so simple...

• We will see more about genetic Algorithms during nextlesson!

Feature Selection Search Strategies: SummaryFeature Selection

Method Method Accuracy Complexity

ExponentialAlgorithms Exponential

Quadratic O(NEX

2)

Generally low

SequentialAlgorithms

RandomizedAlgorithms

Notes

Exhaustive Search

Branch & Bound

ApproximateMonoticity withBranch & Bound

Always find the optimal solution

(B&B under the monoticityassumption)

High accuracy buthigh complexity

Sequential ForwardSelection (SFS)Sequential BackwardSelection (SBS)Plus‐L Minus‐RSelection

No guarantees tofind the optimal

solutionSimple and fast, but not optimal

Genetic Algorithms

BidirectionalSelectionSequential FloatingSelectionRandom Generation plus SequentialSelection Usually it finds the

optimal solution

Escape localminima, difficultto choose goodparameters


Documents

Lecture Notes on Feature Selection - Politecnico di Milanohome.deib.polimi.it/matteucc/lectures/MIS/handout-e09.pdf · Feature Selection 1. Find a subset ÆFind a search strategyto