Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature...

CE-725: Statistical Pattern Recognition Sharif University of TechnologySpring 2013

Soleymani

Feature Selection

Outline Dimensionality reduction Feature selection vs. feature extraction

Filter univariate methods Multi-variate filter & wrapper methods Evaluation criteria Search strategies

Feature Selection Data may contain many irrelevant and redundant variables and

often comparably few training examples

Application for which datasets with tens or hundreds ofthousands of variables are available text or document processing, gene expression array analysis

⋮ ⋮ … ⋮( )( )

1 2 3 4 − 1

Feature Selection

A way to select the most relevant features to find moreaccurate, faster, and easier to understand classifiers. Performance: enhancing generalization ability

alleviating the effect of the curse of dimensionality the higher the ratio of the number of training patterns to the number of free

classifier parameters, the better the generalization of the learned classifier

Efficiency: speeding up the learning process Interpretability: resulting a model that is easier to understand

Peaking Phenomenon

If the class conditional pdfs are unknown: For a finite , by increasing (the number of features), the probability of

error is initially decreased, but later (after a critical point) is increased

Minimum at Usually ∈ [2,10]

[Theodoridis et. Al, 2008]

Noise (or irrelevant) features

Eliminating irrelevant features can decrease theclassification error on test data

SVM Decision Boundary

Noise feature

SVM Decision Boundary

Feature Selection

Supervised feature selection: Given a labeled set of data points,select a subset of features for data representation If class information is not available, unsupervised feature selection

approaches [we will discuss supervised ones in this lecture]

Feature Selection

= ( ) ⋯ ( )⋮ ⋱ ⋮( ) ⋯ ( )= ( )⋮( )

, , … ,The selected

features

Dimensionality Reduction:Feature Selection vs. Feature Extraction

Feature selection Select a subset of a given feature set

Feature extraction (e.g., PCA, LDA) A linear or non-linear transform on the original feature space

⋮ → ⋮Feature

Selection( < )

⋮ → ⋮ = ⋮Feature

Extraction

Gene Expression

Usually examples (patients) But, the number of features ranges from 6000 to 60,000

[Golub et al, Science, 1999]

, = 1, … ,−

Leukemia Diagnosis′ selected features

Face Male/Female Classification

Relief:

Simba:

100 500 1000

[Navot, Bachrach, and Tishby, ICML, 2004]

= 1000 training images = 60 × 85 = 5100 features

Good Features Discriminative: different values for samples of different

classes and almost same values for very similar samples

Interpretable: features that are easier to understand byhuman experts (corresponding to object characteristics )

Representative: provide concise description

Some Definitions

One categorization of feature selection methods: Univariate method: considers one variable (feature) at a time. Multivariate method: considers subsets of features together.

Another categorization: Filter method: ranks features or feature subsets independent of

the classifier as a preprocessing step. Wrapper method: uses a classifier to evaluate the score of

features or feature subsets. Embedded method: Feature selection is done during the training

of a classifier E.g., Adding a regularization term in the cost function of linear

classifiers

Filter – Univariate Method

Filter – Univariate Method Score of each feature is found based on the -th column of

the data matrix and label vector Rank features according to their score values and select the

ones with the highest scores.

Advantage: computational and statistical scalability

Filter – Univariate Method

Filter methods are usually used for univariate case Considering a feature irrespective of others is not such

meaningful to train a classifier to evaluate it.

Criterion of feature selection: Relevance of the feature to predict labels:

Can the feature discriminate the patterns of different classes?

Reducing redundancy: do not select features that areredundant to the previously selected variables

Irrelevance : random variable corresponding to the -th component of

input feature vectors : random variable corresponding to the labels

Irrelevance feature to predict ( = 2): ( | = 1) = ( | = −1)

Using KL divergence to find a distance between ( | = 1) and ( | = −1):= ( ( | = 1)| = −1+ ( ( | = −1)|| ( | = 1))

density

Pearson Correlation Criteria

= ( , )( ) ( ) ≈ ∑ ( ) − ( ) −∑ − ∑ −

(1) ≫ (2)

Univariate Dependence

Independence:

Mutual information as a measure of dependence:

, Score of based on MI with : = ,

ROC Curve

1- Specificity=FPR

18 -+ x1

density

Area Under Curve

Statistical T-test

T statistic: based on the experimental data, we reject or accept H0 at a

certain significance level H0: values of the feature for different classes do not differ

significantly = To select the features for which the means are significantly different

density

Statistical T-test

Assumptions: Normally distributed classes, equal variance,unknown; as the estimated variance from data.

Null hypothesis H0:

T has a Student distribution with degrees of freedom = + − 2 If / or / then

reject the hypothesis /2/2

Statistical T-test

Assumptions: Normally distributed classes, unknownvariances, estimated variances of classes from data as

. The statistic t will have a Student’s t distribution:

= + + − 2

Statistical T-test: Example

feature:

Then, is not a proper feature.

feature:

Then, is a proper feature.

= 1, = 0= = 1⇒ = 1.224, = 4 ⇒ . = 2.78 > 1.224

= 2, = −1= = 1⇒ = 3.674, = 4 ⇒ . = 2.78 < 3.674

= 0.05

Filter – Univariate: Disadvantage

Redundant subset: Same performance could possibly beachieved with a smaller subset of complementaryvariables that does not contain redundant features.

What is the relation between redundancy andcorrelation: Are highly correlated features necessarily redundant? What about completely correlated ones?

Univariate Methods: Failure

Samples where univariate feature analysis and scoringfails:

[Guyon-Elisseeff, JMLR 2004; Springer 2006]

Multivariate Methods: General Procedure

Subset Generation: select a candidate feature subset for evaluation

Subset Evaluation: compute the score (relevancy value) of the subset

Stopping criterion: when stopping the search in the space of feature subsets

Validation: verify that the selected subset is valid

Subset generation

Subset evaluation

Stopping criterion

Validation

Original feature set Subset

No Yes

Search Space for Feature Selection ( )

0,0,0,0

1,0,0,0 0,1,0,0 0,0,1,0 0,0,0,1

1,1,0,0 1,0,1,0 0,1,1,0 1,0,0,1 0,1,0,1 0,0,1,1

1,1,1,0 1,1,0,1 1,0,1,1 0,1,1,1

1,1,1,1

[Kohavi-John,1997]

Stopping Criteria Predefined number of features is selected Predefined number of iterations is reached Addition (or deletion) of any feature does not result in a

better subset An optimal subset (according to the evaluation criterion)

is obtained.

Multi-variate Feature Selection

Search in the space of all possible combinations of features. all feature subsets: For features, 2 possible subsets. high computational and statistical complexity.

Wrappers use the classifier performance to evaluate thefeature subset utilized in the classifier. Training 2 classifiers is infeasible for large . Most wrapper algorithms use a heuristic search.

Filters use an evaluation function that is cheaper to computethan the performance of the classifier e.g. correlation coefficient

Filters vs. Wrappers

rank subsets of useful features

FilterFeature subset

Classifier

Original feature set

Wrapper

Multiple feature subsets

Classifier

take classifier into account to rank feature subsets

Original feature set

Filter Methods: Evaluation criteria Distance (Eulidean distance)

Class separability: Features supporting instances of the same class to becloser in terms of distance than those from different classes

Information (Information Gain) Select S1 if IG(S1,Y)>IG(S2,Y)

Dependency (correlation coefficient) good feature subsets contain features highly correlated with the class,

yet uncorrelated with each other

Consistency (min-features bias) Selects features that guarantee no inconsistency in data

inconsistent instances have the same feature vector but different class labels

Prefers smaller subset with consistency (min-feature)

f1 f2 classinstance 1 a b c1instance 2 a b c2inconsistent

Subset Selection or Generation (Search) Search direction

Forward Backward Random

Search strategies Exhaustive - Complete

Branch & Bound Best first

Heuristic Sequential forward selection Sequential backward elimination Plus-l Minus-r Selection Bidirectional Search Sequential floating Selection

Non-deterministic Simulated annealing Genetic algorithm

Search Strategies Complete: Examine all combinations of feature subset Optimal subset is achievable Too expensive if is large

Heuristic: Selection is directed under certain guidelines Incremental generation of subsets Smaller search space and thus faster search May miss out feature sets of high importance

Non-deterministic or random: No predefined way toselect feature candidate (i.e., probabilistic approach) Optimal subset depends on the number of trials Need more user-defined parameters

Sequential Forward Selection

0,0,0,0

1,0,0,0 0,1,0,0 0,0,1,0 0,0,0,1

1,1,0,0 1,0,1,0 0,1,1,0 1,0,0,1 0,1,0,1 0,0,1,1

1,1,1,0 1,1,0,1 1,0,1,1 0,1,1,1

1,1,1,1

Filters vs. Wrappers Filters Fast execution: evaluation function computation is faster than a classifier training

Generality: Evaluate intrinsic properties of the data, rather than their interactions with aparticular classifier (“good” for a larger family of classifiers)

Tendency to select large subsets: Their objective functions are generally monotonic (sotending to select the full feature set). a cutoff is required on the number of features

Wrappers Slow execution: must train a classifier for each feature subset (or several trainings if cross-

validation is used)

Lack of generality: the solution lacks generality since it is tied to the bias of the classifierused in the evaluation function.

Ability to generalize: Since they typically use cross-validation measures to evaluateclassification accuracy, they have a mechanism to avoid overfitting.

Accuracy: Generally achieve better recognition rates than filters since they find a properfeature set for the intended classifier.

35 [H. Liu and L. Yu, Feature Selection for Data Mining, 2002]

Feature Selection: Summary

Most univariate methods are filters and most wrappersare multivariate.

No feature selection method is universally better thanothers: wide variety of variable types, data distributions, and classifiers.

Match the method complexity to the ratio d/N: univariate feature selection may work better than multivariate. weaker classifiers (e.g. linear) may be better than complex ones.

Feature selection is not always necessary to achieve goodperformance.

References

I. Guyon and A. Elisseeff, An Introduction to Variable andFeature Selection, JMLR, vol. 3, pp. 1157-1182, 2003.

S. Theodoridis and K. Koutroumbas, Pattern Recognition, 4th

edition, 2008. [Chapter 5]

H. Liu and L.Yu, Feature Selection for Data Mining, 2002.

Feature Selection - ce.sharif.educe.sharif.edu/courses/91-92/2/ce725-2/resources... · Feature...

Documents

Features and Feature Selectionce.sharif.edu/courses/92-93/2/ce725-2/resources...4 Sharif University of Technology, Computer Engineering Department, Pattern Recognition Course Features

An introduction to Numpy and Scipy - ce.sharif.educe.sharif.edu/courses/93-94/1/ce153-12/resources/root/numpy.pdf · The SciPy (Scientific Python) package extends the functionality

Lec01, Overview, v1.06 NoPhoneNum.ppt - ce.sharif.educe.sharif.edu/courses/94-95/2/ce342-1/resources/root/Lectures/Lec01... · Encoder Diagram Decoder Diagram ... PAL Analog Color

Semantic Web - ce.sharif.educe.sharif.edu/courses/90-91/2/ce694-1/resources/root/Lecture Notes/… · Terminology (2) Ontology Transformation: a general term for referring to any

ce.sharif.educe.sharif.edu/courses/89-90/2/ce342-1/resources/root... · Web viewاین پروتکل بر اصول پروتکل انتقال خطوط مبتنی است با این

ce.sharif.educe.sharif.edu/courses/81-82/1/ce384/spring-81/resources/psql-programmer-7.2.pdfTable of Contents Preface

Web Programming Session 17: Forms - ce.sharif.educe.sharif.edu/courses/94-95/2/ce419-1/resources/root/Slides/S17-Dja… · Processing Forms • It’s possible to process forms just

Semantic Web - ce.sharif.educe.sharif.edu/courses/94-95/1/ce694-1/resources/root/Lecture Notes... · Yellow pages: What is the service ... Service composition to fulfill the requirement

Intelligent Agents - ce.sharif.educe.sharif.edu/courses/96-97/2/ce417-1/resources/root/slides... · Static (cross-word puzzles), dynamic (taxi driver), semi-dynamic (clock chess)

Statistical Pattern Recognitionce.sharif.edu/courses/91-92/2/ce725-1/resources...4 Sharif University of Technology, Computer Engin eering Department, Pattern Recognition Course Features

ce.sharif.educe.sharif.edu/courses/89-90/1/ce428-1/resources... · NOTICE The Project Management Institute, Inc. (PMI) standards and guideline publications, of which the document

Classification - Statistical Methodsce.sharif.edu/courses/92-93/2/ce725-2/resources/root... · 2015-04-06 · 3 Sharif University of Technology, Computer Engineering Department, Pattern

EITO 2003, 1384-08-25 - ce.sharif.educe.sharif.edu/courses/88-89/1/ce467-1/resources/root/Europ-ICT... · Texas Instruments [peripheral ... IP/TCPEthernet 3. Internet Small Computer

Introduction - ce.sharif.educe.sharif.edu/courses/85-86/2/ce924/resources/root/Presentations/Chap 011.pdf · 2. Detailed design – Product: Decompose system to components • Assuring

Signed Networks - ce.sharif.educe.sharif.edu › ... › root › Slides › 4-SignedNetworks.pdf · Approximately Balanced Networks •The conditions of balance theory (even weak

Data Clustering: 50 Years Beyond K-Means - Sharifce.sharif.edu/courses/90-91/2/ce725-1/resources/root/Readings/Data... · Data Clustering: 50 Years Beyond K-Means1 ... as a method

Virtual Private Network - ce.sharif.educe.sharif.edu/~b_momeni/ce443/resources/13-vpn-mpls.pdfVirtual Private Network VPN, VRF, and MPLS ... Lecture slides are from Computer networks

Semantic Web - ce.sharif.educe.sharif.edu/.../resources/root/LectureNotes/06-MortezaAmini-RDF-R… · RDF stands for Resource Description Framework It is a machine understandable

Soleymani - SHARIF UNIVERSITY OF TECHNOLOGYce.sharif.edu/courses/91-92/2/ce725-2/resources/root/Lectures... · Feature Extraction 4 ... As opposed to PCA,we use also the labels of

Ith fAllhIn the name of Allah - ce.sharif.educe.sharif.edu/courses/85-86/2/ce823/resources/root/Course slides/Lec_9...Outline • Introduction • Theory ofTheory of Hough TransformHough