20
Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and F When Classifier Selection meets Information Theory: A Unifying View Mohamed Abdel Hady , Friedhelm Schwenker, Günther Palm Institute of Neural Information Processing University of Ulm, Germany December 8, 2010 1 / 20

When Classifier Selection meets Information Theory: A Unifying View

Embed Size (px)

DESCRIPTION

Classifier selection aims to reduce the size of an ensemble of classifiers in order to improve its efficiency and classification accuracy. Recently an information-theoretic view was presented for feature selection. It derives a space of possible selection criteria and show that several feature selection criteria in the literature are points within this continuous space. The contribution of this paper is to export this information-theoretic view to solve an open issue in ensemble learning which is classifier selection. We investigated a couple of informationtheoretic selection criteria that are used to rank classifiers.

Citation preview

Page 1: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

When Classifier Selection meetsInformation Theory: A Unifying View

Mohamed Abdel Hady, Friedhelm Schwenker,Günther Palm

Institute of Neural Information ProcessingUniversity of Ulm, Germany

December 8, 2010

1 / 20

Page 2: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

Outline

1 Ensemble Learning2 Ensemble Pruning3 Information Theory4 Ensemble Pruning meets Information Theory5 Experimental Results6 Conclusion and Future Work

2 / 20

Page 3: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

Ensemble Learning

An ensemble is a set of accurate and diverse classifiers. The objective is that theensemble outperforms its member classifiers.

h1

ghi

hN

Classifier Layer Combination Layer

x

h1(x)

hi(x)

hN(x)

g(x)

Ensemble Learning becomes a hot topic during the last years.

Ensemble methods consist of two phase: the construction of multiple individualclassifiers and their combination.

3 / 20

Page 4: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

Ensemble Learning

How to construct individual classifiers?

4 / 20

Page 5: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

Ensemble Pruning

Recent work has considered an additional intermediate phase that deals with thereduction of the ensemble size before combination.

This phase has several names in the literature such as ensemble pruning,selective ensemble, ensemble thinning and classifier selection.

Classifier selection is important for two reasons: classification accuracy andefficiency.

An ensemble may consist not only of accurate classifiers, but also of classifierswith lower accuracy. The main factor for an effective ensemble is to remove thepoor-performing classifiers while maintaining a good diversity among theensemble members.

The second reason is equally important, efficiency. Having a very large numberof classifiers in an ensemble adds a lot of computational overhead. For instance,decision trees may have large memory requirements and lazy learning methodshave a considerable computational cost during classification phase.

5 / 20

Page 6: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

Information Theory

EntropyH(X) = −

∑xj∈X

p(X = xj ) log2 p(X = xj ) (1)

Conditional Entropy

H(X |Y ) = −∑y∈Y

p(Y = y)∑x∈X

p(X = x |Y = y) log p(X = x |Y = y) (2)

6 / 20

Page 7: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

Information Theory

Shannon Mutual Information

I(X ; Y ) = H(X)− H(X |Y ) =∑x∈X

∑y∈Y

p(x , y) log2p(x , y)

p(x)p(y)(3)

Shannon Conditional Mutual Information

I(X1; X2|Y ) = H(X1|Y )− H(X1|X2Y )

=∑y∈Y

p(y)∑

x1∈X1

∑x2∈X2

p(x1, x2|y) log2p(x1, x2|y)

p(x1|y)p(x2|y)

(4)

7 / 20

Page 8: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

Ensemble Pruning meets Information Theory

Information theory can provide a bound on error probability, p(X1:N 6= Y ), for anycombiner g. The error of predicting target variable Y from input X1:N is boundedby two inequalities as follows,

H(Y )− I(X1:N ; Y )− 1log(|Y |)

≤ p(X1:N 6= Y ) ≤12

H(Y |X1:N ). (5)

I(X1:N ; Y ) involves high dimensional probability distributions p(x1, x2, . . . , xN , y)that are hard to be implemented. However, it can be decomposed into simplerterms.

8 / 20

Page 9: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

Interaction Information

Shannon’s Mutual Information I(X1; X2) is a function of two variables. It is notable to measure properties of multiple (N) variables.

McGill presented what is called Interaction Information as a multi-variategeneralization for Shannon’s Mutual Information.

For instance, the Interaction Information between three random variables is

I(X1,X2,X3) = I(X1; X2|X3)− I(X1; X2) (6)

The general form for arbitrary size S is defined recursively.

I(S ∪ X) = I(S|X)− I(S) (7)

W. McGill, Multivariate information transmission, IEEE Trans. on Information Theory,vol. 4, no. 4, pp. 93111, 1954.

9 / 20

Page 10: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

Mutual Information Decomposition

Theorem

Given a set of classifiers S = X1, . . . ,XN and a target class label Y , the Shannonmutual information between X1:N and Y can be decomposed into a sum of InteractionInformation terms,

I(X1:N ; Y ) =∑T⊆S

I(T ∪ Y), |T | ≥ 1. (8)

For a set of classifiers S = X1,X2,X3, the mutual information between thejoint variable X1:3 and a target Y can be decomposed as

I(X1:3; Y ) = I(X1; Y ) + I(X2; Y ) + I(X3; Y )

+ I(X1,X2,Y ) + I(X1,X3,Y ) + I(X1,X3,Y )

+ I(X1,X2,X3,Y )

Each term can then be decomposed into class unconditional I(X) andconditional I(X |Y ) according to Eq. (6).

I(X1:3; Y ) =3∑

i=1

I(Xi ; Y )−∑X⊆S|X|=2,3

I(X) +∑X⊆S|X|=2,3

I(X |Y )

10 / 20

Page 11: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

Mutual Information Decomposition (cont’d)

For an ensemble S of size N and according to Eq. (7),

I(X1:N ; Y ) =N∑

i=1

I(Xi ; Y )−∑X⊆S|X|=2..N

I(X) +∑X⊆S|X|=2..N

I(X |Y ) (9)

We assume that there exist only pairwise unconditional and conditionalinteractions and omit higher order terms.

I(X1:N ; Y ) 'N∑

i=1

I(Xi ; Y )−N−1∑i=1

N∑j=i+1

I(Xi ; Xj ) +

N−1∑i=1

N∑j=i+1

I(Xi ; Xj |Y ) (10)

G. Brown, A new perspective for information theoretic feature selection, in Proc. of the

12th Int. Conf. on Artificial Intelligence and Statistics (AI-STATS 2009), 2009.

11 / 20

Page 12: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

Classifier Selection Criterion

The objective of an information-theoretic classifier selection method, is to selecta subset of K classifiers (S) from a pool of N classifiers (Ω), constructed by anyensemble learning algorithm, that carries as much information as possible aboutthe target class using a predefined selection criterion,

J(Xu(j)) = I(X1:k+1; Y )− I(X1:k ; Y )

= I(Xu(j); Y )−k∑

i=1

I(Xu(j); Xv(i)) +k∑

i=1

I(Xu(j); Xv(i)|Y )(11)

That is the difference in information, after and before the addition of Xu(j) into S.This tells us that the best classifier is a trade-off between these components: therelevance of the classifier, the unconditional correlations, and theclass-conditional correlations. In order to trade-off between these components,Eq. (11) [Brown, AI-STATS 2009] can be parameterized to define the rootcriterion,

J(Xu(j)) = I(Xu(j); Y )− βk∑

i=1

I(Xu(j); Xv(i)) + γk∑

i=1

I(Xu(j); Xv(i)|Y ). (12)

12 / 20

Page 13: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

Classifier Selection Algorithm

1: Select the most relevant classifier, v(1) = arg max1≤j≤N I(Xj ; Y )2: S = Xv(1)3: for k = 1 : K − 1 do4: for j = 1 : |Ω \ S| do5: Calculate J(Xu(j)) as defined in Eq. (12)6: end for7: v(k + 1) = arg max1≤j≤|Ω\S| J(Xu(j))

8: S = S ∪ Xv(k+1)9: end for

13 / 20

Page 14: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

Classifier Selection Heuristics

Maximal relevance (MR)J(Xu(j)) = I(Xu(j); Y ) (13)

Mutual Information Feature Selection (MIFS) [Battiti, 1994]

J(Xu(j)) = I(Xu(j); Y )−k∑

i=1

I(Xu(j); Xv(i)) (14)

Minimal Redundancy Maximal Relevance (mRMR) [Peng et al., 2005]

J(Xu(j)) = I(Xu(j); Y )−1|S|

k∑i=1

I(Xu(j); Xv(i)) (15)

14 / 20

Page 15: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

Classifier Selection Heuristics (cont’d)

Joint Mutual Information (JMI) [Yang and Moody, 1999]

J(Xu(j)) =k∑

i=1

I(Xu(j)Xv(i); Y ) (16)

Conditional Infomax Feature Extraction (CIFE) [Lin and Tang, 2006]

J(Xu(j)) = I(Xu(j); Y )−k∑

i=1

[I(Xu(j); Xv(i))− I(Xu(j); Xv(i)|Y )

](17)

Conditional Mutual Information Maximization (CMIM) [Fleuret, 2004]

J(Xu(j)) = min1≤i≤k

I(Xu(j); Y |Xv(i))

= I(Xu(j); Y )− max1≤i≤k

[I(Xu(j); Xv(i))− I(Xu(j); Xv(i)|Y )](18)

15 / 20

Page 16: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

Experimental Results

Bagging and Random Forest to construct pool of 50 decision trees (N =50)

Each selection criterion is evaluated with K =40 (20%), 30 (40%), 20 (60%) and10 (80%).

11 data sets from the UCI machine learning repository

average of performing 5 runs of 10-fold cross-validation

normalized_test_acc = pruned_ens_acc−single_tree_accunpruned_ens_acc−single_tree_acc

id name Classes Examples FeaturesDiscrete Continuous

d1 anneal 6 898 32 6d2 autos 7 205 10 16d3 wisconsin-breast 2 699 0 9d4 bupa liver disorders 2 345 0 6d5 german-credit 2 1000 13 7d6 pima-diabetes 2 768 0 8d7 glass 7 214 0 9d8 cleveland-heart 2 303 7 6d9 hepatitis 2 155 13 6d10 ionosphere 2 351 0 34d11 vehicle 4 846 0 18

16 / 20

Page 17: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

Results

Figure: Comparison of the normalized test accuracy of the ensembleof C4.5 decision trees constructed by Bagging

17 / 20

Page 18: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

Results (cont’d)

Figure: Comparison of the normalized test accuracy of the ensembleof random trees constructed by Random Forest 18 / 20

Page 19: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

Conclusion

This paper examined the issue of classifier selection from an informationtheoretic viewpoint. The main advantage of information theoretic criteria is thatthey capture higher order statistics of the data.

The ensemble mutual information is decomposed into accuracy and diversitycomponents.

Although diversity was represented by low and high order terms, we keep onlythe first-order terms in this paper. In further study, we will study the influence ofincluding the higher-order terms on pruning performance.

We selected in the paper some points within the continuous space of possibleselection criteria, that represent well-known feature selection criteria, such asmRMR, CIFE, JMI and CMIM, and use them for classifier selection. In a futurework, we will explore other points in this space that may lead to more effectivepruning.

We plan to extend the algorithm for pruning ensembles of regression estimator.

19 / 20

Page 20: When Classifier Selection meets Information Theory: A Unifying View

Ensemble Learning Information Theory Ensemble Pruning meets Information Theory Experimental Evaluation Conclusion and Future Work

Thanks for your attention

Questions ??

20 / 20