6 Cluster and Select Approach to Classier Fusion

Cluster and Select Approach to ClassifierFusion

Eugeniusz Gatnar

Institute of Statistics, Katowice University of Economics,Bogucicka 14, 40-226 Katowice, Poland; [email protected]

Abstract. The key issue in classifier fusion is diversity of the component models.In order to obtain the most diverse candidate models we generate a large number ofclassifiers and divide the set into K disjoint subsets. Classifiers with similar outputsare in the same cluster and classifiers with different predicted class labels are assignedto different clusters. In the next step one member of each cluster is selected, e.g. theone that exhibits the minimum average distance from the cluster center. Finally theselected classifiers are combined using majority voting.

Results from several experiments have shown that the candidate classifiers arediverse and their fusion improves classification accuracy.

1 Introduction

Tumer and Ghosh (1996) have shown that the key issue in classifier fusion isdiversity of the component models, i.e. error of the fuser decreases with thereduction in correlation between base classifiers.

In order to obtain the most diverse candidate models we have proposedthe “cluster and select” method, being a variant of the so-called “overproduceand choose” general framework.

In this method we have generated a large number of classifiers and clus-tered them into disjoint subsets. Classifiers with similar outputs were in thesame cluster, and classifiers with different predicted class labels were assignedto different clusters. Then, one member of each cluster has been selected,e.g. the one with the highest accuracy. Finally, the selected classifiers werecombined by the majority voting.

In this paper we compare the performance of six measures of dissmilarity ofthe base classifiers and four clustering methods used in the proposed approach.

The paper is organised as follows. In Section 2 we review existing workwithin the “overproduce and choose” framework and propose our “cluster and

60 Eugeniusz Gatnar

select” approach. Section 3 gives a short description of six measures of diver-sity between pairs of base classifiers. In Section 4 we recall four hierarchicalclustering methods used in our experiments and the silhouette index appliedto find the optimal number of clusters. The selection phase is explained in Sec-tion 5. Section 6 gives a brief description of our experiments and the obtainedresults. The last section contains a short summary.

2 Cluster and select approach

Direct creation of accurate and diverse classifiers is a very difficult task. There-fore, Partridge and Yates (1996) proposed a method based on the “overpro-duce and choose” general framework. Also, Sharkey et al. (2000) developed amethod that follows the framework.

According to the “overproduce and choose” framework, an initial large setof candidate classifiers is created on the basis of a training set. Then a subset ofthe most error independent classifiers is selected. Partridge and Yates (1996)also introduced the measure (6) from Section 3 to guide the selection of themost diverse classifiers.

Giacinto et al. (2000) proposed to use the measure (5) from Section 3 asthe distance measure and complete linkage clustering method. They used neu-ral networks and k-nearest neighbors as the component classifiers. Kuncheva(2000) developed a simple clustering and selection algorithm to combine clas-sifier outputs. She divided the training set into clusters using the k-meansmethod and found the most accurate classifier for each cluster. Then the se-lected classifier was nominated to label the observations in the Voronoi cellof the cluster centroid. The proposed method performed slightly better thanthe majority vote and the other combining methods.

Giacinto and Roli (2001) proposed a hierarchical method for ensemblecreation. At the first step, each of M base classifiers is a cluster. Next, the twoleast diverse classifiers (clusters) are joined in a cluster, and the more accurateof them is selected as the cluster representative. Then the representatives ofall clusters form an ensemble. The procedure is repeated until all the classifiersare joined together.

They produced ensembles of 1, 2, . . . , M − 1 classifiers, and the final en-semble size is the size of the one with the lowest classification error estimatedon a test set.

In our method, a set of M candidate classifiers is divided into K disjointsubsets {L1, L2, . . . , LK}, using a clustering method. Then a member of eachcluster is selected, e.g. the one with the highest accuracy or the one thatexhibits the maximum average distance from all other cluster centers. Finally,the selected classifiers are combined using the majority voting.

The algorithm of the proposed method is as follows:

1. Produce M candidate classifiers using the training set.

Cluster and Select Approach to Classifier Fusion 61

2. Build the dissimilarity matrix D = [dij ] that contains the pairwise diver-sities between classifiers.

3. Divide the classifiers into K ∈ {2, . . . , M − 1} clusters using a clusteringmethod and find the optimal value of K.

4. Select one classifier Ck from each cluster.5. Combine the K selected classifiers using the majority voting:

C∗(x) = argmaxy∈Y

{K∑

k=1

I(Ck(x) = y)

}. (1)

Considering the above algorithm, some questions arise: how to build thedissimilarity matrix D, i.e. what measure should be used as the dissimilaritymeasure between classifiers? Which clustering method should be applied tothe candidate classifiers? How to detect the optimal number of clusters, i.e. thenumber of ensemble members? In order to answer the questions, we performedseveral comparisons, described in the Section 6.

3 Pairwise diversity measures

In the “cluster” phase of the algorithm the candidate classifiers are clusteredon the basis of a dissimilarity or distance matrix D = [dij ] between all pairsof them. In order to find the most appropriate dissmilarity measure, we haveexamined six different pairwise diversity measures1, presented in Kunchevaand Whitaker (2003) and Gatnar (2005).

Let [C(x1), C(x2), . . . , C(xN )] be a vector of predictions of the classifier Cfor the set of examples V = {(x1, y1), (x2, y2), . . . , (xN , yN )}. The relationshipbetween a pair of classifiers Ci and Cj can be shown in the form of the 2 × 2contingency table (Table 1). In order to use this table for any number ofclasses, the “oracle” labels are applied. We define the “oracle” output (Ri) ofthe classifier Ci as:

Ri(xn) ={

1 if Ci(xn) = yn

0 if Ci(xn) �= yn. (2)

In other words, the value of Ri = 1 means that the classifier Ci is correct,i.e. it recognizes the true class (yn) of the example xn, and Ri = 0 means thatthe classifier is wrong.

The binary version of the Pearson’s correlation coefficient:

dij =ad − bc√

(a + b)(c + d)(a + c)(b + d). (3)

1 In fact, some of them are similarity measures and have been transformed intodissimilarity ones.

62 Eugeniusz Gatnar

Table 1. A 2 × 2 contingency table for the “oracle” outputs.

Classifiers Rj = 1 Rj = 0Ri = 1 a bRi = 0 c d

can be used, after tranformation, as simple diversity measure. Its valuesrange from −1 to 1, with 0 indicating independence of the two classifers.

Kuncheva et al. (2000) proposed the Yule’s Q statistics to evaluate thediversity of all possible component classifier pairs. The Yule’s Q statistics isthe original measure of dichotomous agreement, designed to be analogous tothe Pearson’s correlation:

dij =ad − bc

ad + bc. (4)

This measure is pairwise and symmetric, and varies between −1 and 1. Avalue of 0 indicates statistical independence of classifiers, positive values meanthat the classifiers have recognized the same examples correctly and negativevalues – that the classifiers commit errors on different examples:

Giacinto and Roli (2000) have introduced a measure based on the com-pound error probability for the two classifiers, and named compound diversity:

dij =d

a + b + c + d. (5)

This measure is also named “double-fault measure” because it is the pro-portion of the examples that have been misclassified by both classifiers. Par-tridge and Yates (1996) and Margineantu and Dietterich (1997) have useda measure named within-set generalization diversity. This measure is simplythe κ statistics, and it measures the level of agreement between two classifierswith the correction for chance. Its pairwise version is calculated as:

dij =2(ac− bd)

(a + b)(c + d) + (a + c)(b + d). (6)

Skalak (1996) reported the use of the disagreement measure to characterizethe diversity between base classifiers:

dij =b + c

a + b + c + d. (7)

This is the ratio between the number of examples on which one classifier iscorrect and the other is wrong to the total number of examples. Gatnar (2005)has proposed the diversity measure based on the Hamann’s coefficient that issimply the difference between the matches and mismatches as a proportion ofthe total number of entries:


dij =(a + d) − (b + c)

a + b + c + d. (8)

It ranges from −1 to 1. A value of 0 indicates an equal number of matchesto mismatches, −1 represents perfect disagreement, and 1 – perfect agreement.

4 Clustering methods

Having a set of M candidate classifiers {C1, C2, . . . , CM}, we divide theminto K clusters: {L1, L2, . . . , LK} on the basis of a dissimilarity matrix D.Classifiers with similar errors are in the same cluster, and classifiers withdifferent errors are assigned to different clusters. In our experiments we haveapplied four hierarchical clustering methods that use the dissimilarity matrixD:

• single linkage (nearest neighbor method), where we treat the smallestdissimilarity between an observation in the first cluster and an observationin the second cluster as the distance between two clusters,

• complete linkage (furthest neighbor method), where we use the largestdissimilarity between a point in the first cluster and a point in the secondcluster as the distance between the two clusters,

• average method, where the distance between two clusters is the average ofthe dissimilarities between the observations in one cluster and the obser-vations in the other cluster.

• Ward’s method, where the dissimilarity between two clusters is the Eu-clidean distance between their centroids.

In order to determine the optimal number of clusters K we have used thesilhouette index developed by Rousseeuw (1987) and implemented in thecluster package described in Kaufman and Rousseeuw (1990).

For each observation xi, the silhouette width s(xi) that ranges from 0 to 1is computed. Observations with a high value of s(xi) are well clustered, whilesmall value of s(xi) means that the observation xi lies between two clusters.Observations with a negative s(xi) belong to the wrong cluster. We find theoptimal clustering as the one that maximizes the sum of silhouette widths forall observations in the learning set:

max

{N∑

i=1

s(xi)

}. (9)

5 Selecting a representative of the cluster

Having an optimal clustering {L1, L2, . . . , LK}, we choose one classifier Ck

from each cluster Lk. Several different selection strategies can be consideredin the “select” phase, but in our experiment we selected the classifier that hasthe lowest classification error estimated on the appropriate test set.

64 Eugeniusz Gatnar

6 Results of experiments

In order to find the optimal diversity measure and the clustering method forthe cluster and select approach, we have applied it to 9 benchmark datasetsfrom the Machine Learning Repository at the UCI (Blake et al. (1998)).

Some of them were already divided into learning and test part, but in somecases we divided them randomly.

For each dataset we have generated 14 sets of candidate classifiers of dif-ferent sizes: M = 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 150, 200, 250, 300, and weused classification trees2 as the base models.

For example, average classification errors for different diversity measuresand different clustering methods for the DNA dataset are shown in Figures 1and 2.

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

DNA dataset

Ensemble size

Cla

ssifi

catio

n er

ror

10 20 30 40 50 60 70 80 90 100 150 200 250 300

PearsHammYuleDFKappaDisag

Fig. 1. Average classification error for different diversity measures for the DNAdataset

We have also computed the number of lowest error combinations of di-versity measure and clustering method for each ensemble size. For example,Table 2 shows the distribution of winning combinations for the DNA data set.

7 Summary

In this paper we have proposed a modification of the cluster-and-select ap-proach to classifier fusion. In order to find the most appropriate dissimilaritymeasure and clustering method, we performed several comparisons.

2 In order to grow trees, we have used the Rpart procedure written by Therneauand Atkinson (1997) for the R environment.


0.00

0.02

0.04

0.06

0.08

0.10

0.12

DNA dataset

Ensemble size

Cla

ssifi

catio

n er

ror

10 20 30 40 50 60 70 80 90 100 150 200 250 300

averagesinglecompleteward

Fig. 2. Average classification error for different clustering methods for the DNAdataset

Table 2. Winning combinations for the DNA dataset

average single complete wardPearson 1 0 0 0Hamann 0 0 1 1Yule 0 0 0 0Double fault 0 0 5 4Kappa 3 1 0 3Disagreement 0 0 1 0

Comparing different diversity measures, we have observed that using dou-ble fault measure or Kappa statistics lead to the most accurate ensembles, butHamman’s coefficient also works quite well. Comparing clustering methods,we conclude that the complete linkage and Wards method outperformed theother clustering methods.

References

BLAKE, C., KEOGH, E. and MERZ, C.J. (1998): UCI Repository of MachineLearning Databases. Department of Information and Computer Science, Uni-versity of California, Irvine.

GATNAR, E. (2005): A Diversity Measure for Tree-Based Classifier Ensembles. In:D. Baier, R. Decker and L. Schmidt-Thieme (Eds.): Data Analysis and DecisionSupport. Springer, Heidelberg.

GIACINTO, G. and ROLI, F. (2001): Design of Effective Neural Network Ensemblesfor Image Classification Processes. Image Vision and Computing Journal, 19,699–707.

66 Eugeniusz Gatnar

GIACINTO, G., ROLI, F. and FUMERA, G. (2000): Design of Effective MultipleClassifier Systems by Clustering of Classifiers. Proc. of the Int. Conference onPattern Recognition, ICPR’00, IEEE.

HANSEN, L.K. and SALAMON, P. (1990): Neural Network Ensembles. IEEETransactions on Pattern Analysis and Machine Intelligence 12, 993–1001.

KAUFMAN, L. and ROUSSEEUW, P.J. (1990): Finding Groups in Data: An In-troduction to Cluster Analysis. Wiley, New York.

KUNCHEVA, L. and WHITAKER, C. (2003): Measures of Diversity in Classifier En-sembles and Their Relationship with the Ensemble Accuracy. Machine Learning51, 181–207.

KUNCHEVA, L. (2000): Clustering-and-Selection Model for Classifier Combina-tion. Proceedings of the 15th International Conference on Pattern Recognition,Barcelona, Spain.

KUNCHEVA, L., WHITAKER, C., SHIPP, D. and DUIN, R. (2000): Is Indepen-dence Good for Combining Classifiers? In: J. Kittler,and F. Roli (Eds.): Pro-ceedings of the First International Workshop on Multiple Classifier Systems.LNCS 1857, Springer, Berlin.

MARGINEANTU, M.M. and DIETTERICH, T.G. (1997): Pruning Adaptive Boost-ing. Proceedings of the 14th International Conference on Machine Learning,Morgan Kaufmann, San Mateo.

PARTRIDGE, D. and YATES, W.B. (1996): Engineering Multiversion Neural-netSystems. Neural Computation 8, 869–893.

ROUSSEEUW P.J. (1987): Silhouettes: A Graphical Aid to the Interpretation andValidation of Cluster Analysis. J. Comput. Appl. Math. 20, 53–65.

SHARKEY, A.J.C., SHARKEY, N.E., GERECKE, U. and CHANDROTH, G.(2000): The Test and Select Approach to Ensemble Combination. In: J. Kittlerand F. Roli (Eds.): Proceedings of the First International Workshop on MultipleClassifier Systems. LNCS 1857, Springer, Berlin.

SKALAK, D.B. (1996): The Sources of Increased Accuracy for two Proposed Boost-ing Algorithms. Proceedeings of the American Association for Artificial Intelli-gence AAAI-96, Morgan Kaufmann, San Mateo.

THERNEAU, T.M. and ATKINSON, E.J. (1997): An Introduction to RecursivePartitioning Using the RPART Routines, Mayo Foundation, Rochester.

TUMER, K. and GHOSH, J. (1996): Analysis of Decision Boundaries in LinearlyCombined Neural Classifiers. Pattern Recognition 29, 341–348.

Documents

6 Cluster and Select Approach to Classier Fusion