When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

When Classifier Selection meetsInformation Theory: A Unifying View

Mohamed Abdel Hady, Friedhelm Schwenker,Günther Palm

Institute of Neural Information ProcessingUniversity of Ulm, Germany

December 8, 2010

1 / 20


Outline

1 Ensemble Learning2 Ensemble Pruning3 Information Theory4 Ensemble Pruning meets Information Theory5 Experimental Results6 Conclusion and Future Work

2 / 20


Ensemble Learning

An ensemble is a set of accurate and diverse classifiers. The objective is that theensemble outperforms its member classifiers.

h1

ghi

hN

Classifier Layer Combination Layer

x

h1(x)

hi(x)

hN(x)

g(x)

Ensemble Learning becomes a hot topic during the last years.

Ensemble methods consist of two phase: the construction of multiple individualclassifiers and their combination.

3 / 20


Ensemble Learning

How to construct individual classifiers?

4 / 20


Ensemble Pruning

Recent work has considered an additional intermediate phase that deals with thereduction of the ensemble size before combination.

This phase has several names in the literature such as ensemble pruning,selective ensemble, ensemble thinning and classifier selection.

Classifier selection is important for two reasons: classification accuracy andefficiency.

An ensemble may consist not only of accurate classifiers, but also of classifierswith lower accuracy. The main factor for an effective ensemble is to remove thepoor-performing classifiers while maintaining a good diversity among theensemble members.

The second reason is equally important, efficiency. Having a very large numberof classifiers in an ensemble adds a lot of computational overhead. For instance,decision trees may have large memory requirements and lazy learning methodshave a considerable computational cost during classification phase.

5 / 20


Information Theory

EntropyH(X) = −

∑xj∈X

p(X = xj ) log2 p(X = xj ) (1)

Conditional Entropy

H(X |Y ) = −∑y∈Y

p(Y = y)∑x∈X

p(X = x |Y = y) log p(X = x |Y = y) (2)

6 / 20


Information Theory

Shannon Mutual Information

I(X ; Y ) = H(X)− H(X |Y ) =∑x∈X

∑y∈Y

p(x , y) log2p(x , y)

p(x)p(y)(3)

Shannon Conditional Mutual Information

I(X1; X2|Y ) = H(X1|Y )− H(X1|X2Y )

=∑y∈Y

p(y)∑

x1∈X1

∑x2∈X2

p(x1, x2|y) log2p(x1, x2|y)

p(x1|y)p(x2|y)

(4)

7 / 20


Ensemble Pruning meets Information Theory

Information theory can provide a bound on error probability, p(X1:N 6= Y ), for anycombiner g. The error of predicting target variable Y from input X1:N is boundedby two inequalities as follows,

H(Y )− I(X1:N ; Y )− 1log(|Y |)

≤ p(X1:N 6= Y ) ≤12

H(Y |X1:N ). (5)

I(X1:N ; Y ) involves high dimensional probability distributions p(x1, x2, . . . , xN , y)that are hard to be implemented. However, it can be decomposed into simplerterms.

8 / 20


Interaction Information

Shannon’s Mutual Information I(X1; X2) is a function of two variables. It is notable to measure properties of multiple (N) variables.

McGill presented what is called Interaction Information as a multi-variategeneralization for Shannon’s Mutual Information.

For instance, the Interaction Information between three random variables is

I(X1,X2,X3) = I(X1; X2|X3)− I(X1; X2) (6)

The general form for arbitrary size S is defined recursively.

I(S ∪ X) = I(S|X)− I(S) (7)

W. McGill, Multivariate information transmission, IEEE Trans. on Information Theory,vol. 4, no. 4, pp. 93111, 1954.

9 / 20


Mutual Information Decomposition

Theorem

Given a set of classifiers S = X1, . . . ,XN and a target class label Y , the Shannonmutual information between X1:N and Y can be decomposed into a sum of InteractionInformation terms,

I(X1:N ; Y ) =∑T⊆S

I(T ∪ Y), |T | ≥ 1. (8)

For a set of classifiers S = X1,X2,X3, the mutual information between thejoint variable X1:3 and a target Y can be decomposed as

I(X1:3; Y ) = I(X1; Y ) + I(X2; Y ) + I(X3; Y )

+ I(X1,X2,Y ) + I(X1,X3,Y ) + I(X1,X3,Y )

+ I(X1,X2,X3,Y )

Each term can then be decomposed into class unconditional I(X) andconditional I(X |Y ) according to Eq. (6).

I(X1:3; Y ) =3∑

i=1

I(Xi ; Y )−∑X⊆S|X|=2,3

I(X) +∑X⊆S|X|=2,3

I(X |Y )

10 / 20


Mutual Information Decomposition (cont’d)

For an ensemble S of size N and according to Eq. (7),

I(X1:N ; Y ) =N∑

i=1

I(Xi ; Y )−∑X⊆S|X|=2..N

I(X) +∑X⊆S|X|=2..N

I(X |Y ) (9)

We assume that there exist only pairwise unconditional and conditionalinteractions and omit higher order terms.

I(X1:N ; Y ) 'N∑

i=1

I(Xi ; Y )−N−1∑i=1

N∑j=i+1

I(Xi ; Xj ) +

N−1∑i=1

N∑j=i+1

I(Xi ; Xj |Y ) (10)

G. Brown, A new perspective for information theoretic feature selection, in Proc. of the

12th Int. Conf. on Artificial Intelligence and Statistics (AI-STATS 2009), 2009.

11 / 20


Classifier Selection Criterion

The objective of an information-theoretic classifier selection method, is to selecta subset of K classifiers (S) from a pool of N classifiers (Ω), constructed by anyensemble learning algorithm, that carries as much information as possible aboutthe target class using a predefined selection criterion,

J(Xu(j)) = I(X1:k+1; Y )− I(X1:k ; Y )

= I(Xu(j); Y )−k∑

i=1

I(Xu(j); Xv(i)) +k∑

i=1

I(Xu(j); Xv(i)|Y )(11)

That is the difference in information, after and before the addition of Xu(j) into S.This tells us that the best classifier is a trade-off between these components: therelevance of the classifier, the unconditional correlations, and theclass-conditional correlations. In order to trade-off between these components,Eq. (11) [Brown, AI-STATS 2009] can be parameterized to define the rootcriterion,

J(Xu(j)) = I(Xu(j); Y )− βk∑

i=1

I(Xu(j); Xv(i)) + γk∑

i=1

I(Xu(j); Xv(i)|Y ). (12)

12 / 20


Classifier Selection Algorithm

1: Select the most relevant classifier, v(1) = arg max1≤j≤N I(Xj ; Y )2: S = Xv(1)3: for k = 1 : K − 1 do4: for j = 1 : |Ω \ S| do5: Calculate J(Xu(j)) as defined in Eq. (12)6: end for7: v(k + 1) = arg max1≤j≤|Ω\S| J(Xu(j))

8: S = S ∪ Xv(k+1)9: end for

13 / 20


Classifier Selection Heuristics

Maximal relevance (MR)J(Xu(j)) = I(Xu(j); Y ) (13)

Mutual Information Feature Selection (MIFS) [Battiti, 1994]

J(Xu(j)) = I(Xu(j); Y )−k∑

i=1

I(Xu(j); Xv(i)) (14)

Minimal Redundancy Maximal Relevance (mRMR) [Peng et al., 2005]

J(Xu(j)) = I(Xu(j); Y )−1|S|

k∑i=1

I(Xu(j); Xv(i)) (15)

14 / 20


Classifier Selection Heuristics (cont’d)

Joint Mutual Information (JMI) [Yang and Moody, 1999]

J(Xu(j)) =k∑

i=1

I(Xu(j)Xv(i); Y ) (16)

Conditional Infomax Feature Extraction (CIFE) [Lin and Tang, 2006]

J(Xu(j)) = I(Xu(j); Y )−k∑

i=1

[I(Xu(j); Xv(i))− I(Xu(j); Xv(i)|Y )

](17)

Conditional Mutual Information Maximization (CMIM) [Fleuret, 2004]

J(Xu(j)) = min1≤i≤k

I(Xu(j); Y |Xv(i))

= I(Xu(j); Y )− max1≤i≤k

[I(Xu(j); Xv(i))− I(Xu(j); Xv(i)|Y )](18)

15 / 20


Experimental Results

Bagging and Random Forest to construct pool of 50 decision trees (N =50)

Each selection criterion is evaluated with K =40 (20%), 30 (40%), 20 (60%) and10 (80%).

11 data sets from the UCI machine learning repository

average of performing 5 runs of 10-fold cross-validation

normalized_test_acc = pruned_ens_acc−single_tree_accunpruned_ens_acc−single_tree_acc

id name Classes Examples FeaturesDiscrete Continuous

d1 anneal 6 898 32 6d2 autos 7 205 10 16d3 wisconsin-breast 2 699 0 9d4 bupa liver disorders 2 345 0 6d5 german-credit 2 1000 13 7d6 pima-diabetes 2 768 0 8d7 glass 7 214 0 9d8 cleveland-heart 2 303 7 6d9 hepatitis 2 155 13 6d10 ionosphere 2 351 0 34d11 vehicle 4 846 0 18

16 / 20


Results

Figure: Comparison of the normalized test accuracy of the ensembleof C4.5 decision trees constructed by Bagging

17 / 20


Results (cont’d)

Figure: Comparison of the normalized test accuracy of the ensembleof random trees constructed by Random Forest 18 / 20


Conclusion

This paper examined the issue of classifier selection from an informationtheoretic viewpoint. The main advantage of information theoretic criteria is thatthey capture higher order statistics of the data.

The ensemble mutual information is decomposed into accuracy and diversitycomponents.

Although diversity was represented by low and high order terms, we keep onlythe first-order terms in this paper. In further study, we will study the influence ofincluding the higher-order terms on pruning performance.

We selected in the paper some points within the continuous space of possibleselection criteria, that represent well-known feature selection criteria, such asmRMR, CIFE, JMI and CMIM, and use them for classifier selection. In a futurework, we will explore other points in this space that may lead to more effectivepruning.

We plan to extend the algorithm for pruning ensembles of regression estimator.

19 / 20


Thanks for your attention

Questions ??

20 / 20

Technology

When Classifier Selection meets Information Theory: A Unifying View