Some Thoughts at the Interface of Ensemble Methods and...

Some Thoughts at the Interface ofEnsemble Methods and Feature Selection

Gavin Brown, University of Manchester, UK

Cairo Microsoft Innovation Centre...adapted (slightly) from MCS 2010

What type of document is this?

I potentially thousands of “features”

I highly spatially correlated, often noisy

Statistics of High-dimensional data

In data mining, lots of features = big problem.

I overfitting (model only works on in-sample data)

I computationally expensive

I low interpretability of final model

I in certain domains, features cost money!

Handling Hi-D data?

Feature Extraction

S = f(Ω), usually S = wT Ω.

With extraction (e.g. PCA), we lose meaning of original features.

Feature SelectionS ⊆ Ω.

With selection we identify meaningful subsets of originals.Combinatorial optimization over space of possible feature subsets.

Feature Selection - “Filters”

PROCEDURE : FILTERInput: large feature set ΩReturns: useful feature subset S ⊆ Ω10 Identify candidate subset S ⊆ Ω20 While !stop criterion()

Evaluate relevancy index J using S.Adapt subset S.

30 Return S.

Pro: generic feature set, and fast!Con: task-specific design is open problem

How ’relevant’ is any given feature?

Y is a human-labeled document set.X is presence/absence of a particular image feature.

I(X;Y ) - measure of dependence between feature X and target Y .

Zero when X is independent of Y .Increases as X and Y become dependent.

Defined as KL-divergence

I(X;Y ) =∑x∈X

∑y∈Y

p(xy) logp(xy)p(x)p(y)

Filters using Mutual Information

Rank features Xi,∀i by their values of J = I(Xi;Y ).Retain the highest ranked features, discard the lowest ranked.

i J(Xi)35 0.84642 0.81110 0.810

654 0.61122 0.44359 0.388... ...

212 0.0939 0.05

Cut-off point decided byuser, e.g. |S| = 5, soS = 35, 42, 10, 654, 22.

Problem: F42 and F10 are almost identical!

A Problem: Design of Filter Criteria

Q. What is the relevance of feature Xi?

Jmi(Xi) = I(Xi; Y )“its own mutual information with the target”

Jmifs(Xi) = I(Xi; Y )−∑

Xk∈S I(Xi; Xk)“as above, but penalised by correlations with features already chosen”

Jmrmr(Xi) = I(Xi; Y )− 1|S|

∑Xk∈S I(Xi; Xk)

“as above, but averaged, smoothing out noise”

Jjmi(Xi) =∑

Xk∈S I(XiXk; Y )“how well it pairs up with other features chosen”

The Confusing Literature of Feature Selection Land

Criterion Full name AuthorMI Mutual Information Maximisation Various (1970s - )MIFS Mutual Information Feature Selection Battiti (1994)JMI Joint Mutual Information Yang & Moody (1999)MIFS-U MIFS-‘Uniform’ Kwak & Choi (2002)IF Informative Fragments Vidal-Naquet (2003)FCBF Fast Correlation Based Filter Yu et al (2004)CMIM Conditional Mutual Info Maximisation Fleuret (2004)JMI-AVG Averaged Joint Mutual Information Scanlon et al (2004)MRMR Max-Relevance Min-Redundancy Peng et al (2005)ICAP Interaction Capping Jakulin (2005)CIFE Conditional Infomax Feature Extraction Lin & Tang (2006)DISR Double Input Symmetrical Relevance Meyer (2006)MINRED Minimum Redundancy Duch (2006)IGFS Interaction Gain Feature Selection El-Akadi (2008)MIGS Mutual Information Based Gene Selection Cai et al (2009)

Why should we trust any of these? How do they relate?

The Land of Feature Selection: A Summary

Problem: construct a useful set of features

I Need features to be relevant and not redundant.

Accepted research practice: invent heuristic measures

I Encouraging “relevant” features

I Discouraging correlated features

Sound familiar? For ‘feature’ above, read ‘classifier’...

The “Duality” of MCS and FS

classifier

feature values(object description)

classifier classifier

class label

combiner

classifier

class label

combinerclassifier

classifier

class label

combiner

a fancy feature

extractorclassifier

classifier

class label

combinerclassifier?

a fancy feature

extractorclassifier

Regression Ensembles

Loss function : (f(x)− y)2

Combiner function : f(x) = 1M

∑Mi=1 fi(x)

Method:Take objective function, decompose into constituent parts.

(f − y)2 =1M

M∑i=1

(fi − y)2 − 1M

M∑i=1

(fi − f)2

An MCS native visits the Land of Feature Selection

Loss function : I(F ;Y )

‘Combiner’ function : F = X1:M (joint random variable)

Method:Take objective function, decompose into constituent parts.

I(X1:M ; Y ) =∑∀i

I(Xi; Y )

+∑∀i,j

I(Xi, Xj , Y )

+∑∀i,j,k

I(Xi, Xj , Xk, Y )

∀i,j,k,l

I(Xi, Xj , Xk, Xl, Y )

... ... ...

Multiple “levels” of correlation!Each term is a multi-variate mutual information! (McGill, 1954)

Linking theory to heuristics....

Take only terms involving Xi we want to evaluate - exact expression:

I(Xi; Y |S) = I(Xi; Y )−∑k∈S

I(Xi; Xk) +∑k∈S

I(Xi; Xk|Y ) +∑

j,k∈S

I(Xi, Xj , Xk, Y ) + ... +

Jmi = I(Xi; Y )

Jmifs = I(Xi; Y )−∑k∈S

I(Xi; Xk)

Jmrmr = I(Xi; Y )−1

|S|∑k∈S

I(Xi; Xk)

and others can be re-written to this form...

Jjmi =∑k∈S

I(XiXk; Y )

= I(Xi; Y )−1

|S|∑k∈S

I(Xi; Xk) +1

|S|∑k∈S

I(Xi; Xk|Y )

A “Template” Criterion

Jmifs = I(Xi; Y )−∑k∈S

I(Xi; Xk)

Jmrmr = I(Xi; Y )−1

|S|∑k∈S

I(Xi; Xk)

Jjmi = I(Xi; Y )−1

|S|∑k∈S

I(Xi; Xk) +1

|S|∑k∈S

I(Xi; Xk|Y )

Jcife = I(Xi; Y )−∑k∈S

I(Xi; Xk) +∑k∈S

I(Xi; Xk|Y )

Jcmim = I(Xi; Y )− maxk

I(Xi; Xk)− I(Xi; Xk|Y )

J = I(Xn;Y )− β∑∀k∈S

I(Xn;Xk) + γ∑∀k∈S

I(Xn;Xk|Y )

The β/γ Space of Possible Criteria

0 0.2 0.4 0.6 0.8 10

β JMI n=3

JMI n=4

MIFS / MRMR n=2

MRMR n=3

MRMR n=4

FOU / CIFEJMI n=2

I(X1:M ;Y ) ≈ I(Xn;Y )︸︷︷︸relevancy

− β∑

I(Xn;Xk)︸︷︷︸redundancy

+ γ∑

I(Xn;Xk|Y )︸︷︷︸conditional redundancy

The β/γ Space of Possible Criteria

I(X1:M ;Y ) ≈ I(Xn;Y )︸︷︷︸relevancy

− β∑

I(Xn;Xk)︸︷︷︸redundancy

+ γ∑

I(Xn;Xk|Y )︸︷︷︸conditional redundancy

Exploring β/γ spaceBrown

Figure 3: ARCENE data (cancer diagnosis).

Figure 4: GISETTE data (handwritten digit recognition).

Exploring β/γ space

Figure 3: ARCENE data (cancer diagnosis).

Figure 4: GISETTE data (handwritten digit recognition).

Exploring β/γ space

Seems straightforward? Just use the diagonal? Top right corner?But... remember these are only low order components.

Easy to construct problems that have ZERO in low orders, andpositive terms in high orders.

I e.g. data drawn from a Bayesian net with some nodesexhibiting deterministic behavior. (e.g. parity problem).

ARCENE data

3-nn classifier (left), and SVM (right).

Image Segment data

3-nn classifier (left), and SVM (right).Pink line (‘UnitSquare’) is top right corner of β/γ space.

GISETTE data

Low order components insufficient .......heuristics can triumph over theory!

Exports & Imports

Exported a perspective from the Land of MCS...

...solved an open problem in the Land of FS.

But could the MCS natives also learn from this?

Committee members should be ‘diverse’

classifier

class label

combiner

Exports & Imports : Understanding Ensemble Diversity

(Step 1) Take an objective function...- log-likelihood: ensemble combiner g, with M members...

L = Exy

log g(y|φ1:M )

(Step 2) ...decompose into constituent parts.

L = const + I(φ1:M ;Y )︸︷︷︸ensemble members

− KL( p(y|x) || g(y|φ1:M ) )︸︷︷︸combiner

“Information Theoretic Views of Ensemble Learning”.G.Brown, Manchester MLO Tech Report, Feb 2010

∑ ∑∑ ∑∑= +== +==

+−≈M

iiM YXXIXXIYXIYXI

1 11 11:1 )|;();();();(

“diversity”“relevancy”

“An Information Theoretic Perspective on Multiple Classifier Systems”, MCS 2009.

Bagging

Adaboost∑=

ii YXI

“An Information Theoretic Perspective on Multiple ClassifierSystems”, MCS 2009.Zhou & Li, “Multi-Information Ensemble Diversity” (tomorrow9.45am!)

0 10 20 300

0.2Adaboost : breast

Number of Classifiers

0 10 20 300

0.2RandomForests : breast

(ongoing work with Zhi-Hua Zhou)

0 10 20 30−0.2

Adaboost : breast (r = −0.94079)

Individual MIEnsemble Gain

0 10 20 30−0.2

RandomForests : breast (r = −0.98425)

Individual MIEnsemble Gain

(ongoing work with Zhi-Hua Zhou)

Conclusions

It’s getting really hard to contribute meaningful research to MCS.... and to ML/PR in general!

I I’m starting to look at importing ideas from other fields

I Information Theory seems natural

I Unified framework for feature selection

I Deeper understanding of committee learning methods

Some Thoughts at the Interface of Ensemble Methods and...

Documents

19 thoughts Thoughts about Leadership - Update

Ensemble, Ensemble Division

Jazz Ensemble, Guitar Ensemble, and Commercial Ensemble ... · PDF fileJazz Ensemble, Guitar Ensemble, and Commercial Ensemble Auditions Fall 2017 Monday, August 21st 5:00-8:00pm (Horns

Monte Carlo in different ensembles Chapter 5 NVT ensemble NPT ensemble Grand-canonical ensemble Exotic ensembles

Jazz Ensemble, Guitar Ensemble, and Commercial … Ensemble, Guitar Ensemble, and Commercial Ensemble Auditions Fall 2017 Monday, August 21st 5:00-8:00pm (Horns may also …

HOW DO YOU MAKE AN ENSEMBLE EXPERIENCE OF A LIFETIME ... · HOW DO YOU MAKE AN ENSEMBLE EXPERIENCE OF A LIFETIME BETTER? ... Jazz Ensemble, ... Guitar Ensemble

WIND ENSEMBLE AUDITION EXCERPTS · TRUMPET WIND ENSEMBLE EXCERPT SCHUBERT: Standchen Quarter Note = 72 . Pacific Music Institute 2020 WIND ENSEMBLE EXCERPTS HORN WIND ENSEMBLE EXCERPT

ENSEMBLE ENSEMBLE ENSEMBLE

MOGREPS ensemble forecasting...atmospheric ensemble 9 March 2010 - 27 March 2012 • 24-member ensemble designed for short-range forecasting • Regional ensemble over N. Atlantic

Kerzner - Microsoft.pdf

Co-Curricular 2020...• String Ensemble, Concert Band, Jazz Ensemble, Symphonic Band, Lower Strings Ensemble, Chamber Strings, Big Band, Percussion Ensemble, Guitar Ensemble Plus

Découverte Office 365 pour les partenaires Microsoft - …events.orditech.be/wp-content/uploads/2015/10/JPO/Microsoft.pdf · Découverte Office 365 pour les partenaires Microsoft

Student Ensemble: Normal West High School Wind Ensemble

MUSIC ENSEMBLE AND CHOIR REHEARSAL SCHEDULE · · 2016-02-05MUSIC ENSEMBLE AND CHOIR REHEARSAL SCHEDULE Monday ... 7.40-8.25am ChanterElle, Ensemble Room ... Microsoft Word - ENSEMBLE

ENSEMBLE ENSEMBLE ENSEMBLE · ENSEMBLE ENSEMBLE ENSEMBLE Bulbs: 100 watt med. Bulbos: 100 vatios de base mediana Ampoules: 100 watts à culot moyen *** Assembly of this fixture will

Operational Hydrologic Ensemble Forecasting - CNRFC · Operational Hydrologic Ensemble Forecasting ... NWSRFS configuration. probablistic ... Ensemble Streamflow Prediction. Climate

Abstract - arxiv.org · Cross entropy 0.8 hyper-deep ensemble deep ensemble Figure 1: Comparison of our hyper-deep ensemble with deep ensemble for different ensemble sizes using a

Instrumental & Ensemble Music Handbook - Amazon S3€¦ · Guitar Ensemble Junior Rock Band Senior Rock Band Studio Band Vocal Ensemble Woodwind Ensemble MUSIC ENSEMBLE PERFORMANCE

Brass Ensemble Approved List, Solo & Ensemble (2005/06

The Local Ensemble Transform KalmanFiltermmo/SpecialTopics/docs/LETKF.pdf · Ensemble Transform KalmanFilter (ETKF) 5 Ensemble space Recast state space in terms of ensemble Dimension