Workshop on Supervised and Unsupervised Ensemble Methods ... · on Supervised and Unsupervised Ensemble Methods and ... Supervised and Unsupervised Ensemble Methods and ... Carlotta

Workshopon

Supervised and UnsupervisedEnsemble Methods

andTheir Applications

(SUEMA’2007)

Workshop proceedings edited by

Oleg Okun (University of Oulu, Finland)and

Giorgio Valentini (Universita’ degli Studi di Milano, Italy)

June 4, 2007 Girona, Spain

Workshopon

Supervised and UnsupervisedEnsemble Methods

andTheir Applications

(SUEMA’2007)in conjunction with

the 3rd Iberian Conferenceon

Pattern Recognition and Image AnalysisJune 6-8, 2007 Girona, Spain

http://ibpria2007.udg.cat/workshops.htm

http://suema07.dsi.unimi.it/

Workshop Chairs:

Oleg OkunMachine Vision Group

Department of Electrical and Information Engineering

P.O.Box 4500, 90014 University of Oulu, FINLAND

&

Giorgio ValentiniDipartimento di Scienze del’Informazione

Universita’ degli Studi di Milano

Via Comelico 39, 20135 Milan, ITALY

Workshop Program Committee:

Carlotta Domeniconi (George Mason University, USA)

Robert Duin (Delft University of Technology, the Nether-

lands)

Mark Embrechts (Rensselaer Polytechnic Institute, USA)

Ana Fred (Technical University of Lisbon, Portugal)

Joao Gama (University of Porto, Portugal)

Giorgio Giacinto (University of Cagliari, Italy)

Larry Hall (University of South Florida, USA)

Ludmila Kuncheva (University of Wales, UK)

Francesco Masulli (University of Genova, Italy)

Mykola Pechenizkiy (Eindhoven University of Technology,

the Netherlands)

Petia Radeva (Autonomous University of Barcelona, Spain)

Juan Jose Rodrıguez (University of Burgos, Spain)

Fabio Roli (University of Cagliari, Italy)

Paolo Rosso (University of Valencia, Spain)

Carlo Sansone (University of Naples, Italy)

Jose Salvador Sanchez (University Jaume I, Spain)

Johan Schubert (FOI, Sweden)

Jordi Vitria (Autonomous University of Barcelona, Spain)

Terry Windeatt (University of Surrey, UK)

Clustering Ensembles for Categorical Data

Muna Al-Razgan, Carlotta Domeniconi, and Daniel Barbara

Department of Computer ScienceGeorge Mason University

Fairfax, Virginia 22030, USA

Abstract. Cluster ensembles offer a solution to challenges inherent toclustering arising from its ill-posed nature. In this paper we focus on thedesign of ensembles for categorical data. Our approach leverages diverseinput clusterings discovered in random subspaces. We experimentally de-mostrate the efficacy of our technique in combination with the categoricalclustering algorithm COOLCAT.

1 Introduction

Recently, cluster ensembles have emerged as a technique for overcoming problemswith clustering algorithms. It is well known that clustering methods may discoverdifferent patterns in a given set of data. This is because each clustering algorithmhas its own bias resulting from the optimization of different criteria. Furthermore,there is no ground truth against which the clustering result can be validated.Thus, no cross-validation technique can be carried out to tune input parametersinvolved in the clustering process. As a consequence, the user is equipped withno guidelines for choosing the proper clustering method for a given dataset.

An orthogonal issue related to clustering is high dimensionality. High dimen-sional data pose a difficult challenge to the clustering process. Various clusteringalgorithms can handle data with low dimensionality, but as the dimensionalityof the data increases, these algorithms tend to break down.

A cluster ensemble consists of different partitions. Such partitions can beobtained from multiple applications of any single algorithm with different ini-tializations, or from the application of different algorithms to the same dataset.Cluster ensembles offer a solution to challenges inherent to clustering arisingfrom its ill-posed nature: they can provide more robust and stable solutions byleveraging the consensus across multiple clustering results, while averaging outemergent spurious structures that arise due to the various biases to which eachparticipating algorithm is tuned.

To enhance the quality of clustering results, clustering ensembles have beenexplored. Nevertheless, clustering ensembles for categorical data have not re-ceived much attention in the literature. In this paper we focus on the design ofensembles for categorical data. Our techniques can be used in combination withany categorical clustering approach. Here we apply our method to the partitionsprovided by the COOLCAT algorithm [4].

1

2 Related Work

We briefly describe relevant work in the literature on categorical clustering andcluster ensembles.

Clustering of categorical data has recently attracted the attention of manyresearchers. The k-modes algorithm [14] is an extension of k-means for categor-ical features. To update the modes during the clustering process, the authorsused a new distance measure based on the number of mis-matches between twopoints. Squeezer [23] is a categorical clustering algorithm that processes onepoint at the time. At each step, a point is either placed in an existing cluster orit is rejected by all clusters and it creates a new one. The decision is based ona given similarity function. ROCK (Robust Clustering using links) [9] is a hier-archical clustering algorithm for categorical data. It uses the Jaccard coefficientto compute the distance between points. Two points are considered neighbors iftheir Jaccard similarity exceeds a certain threshold. A link between two pointsis computed by considering the number of common neighbors. An agglomerativeapproach is then used to construct the hierarchy.

To enhance the quality of clustering results, clustering ensemble approacheshave been explored. A cluster ensemble technique is characterized by two compo-nents: the mechanism to generate diverse partitions, and the consensus functionto combine the input partitions into a final clustering. One popular methodol-ogy utilizes a co-association matrix as a consensus function [7, 18]. Graph-basedpartitioning algorithms have been used with success to generate the combinedclustering [22, 13, 3].

Clustering ensembles for categorical data have not received much attentionin the literature. The approach in [10] generates a partition for each categoricalattribute, so that points in each cluster share the same value for that attribute.The resulting clusterings are combined using the consensus functions presented in[22]. The work in [11] constructs cluster ensembles for data with mixed numericaland categorical features.

3 The COOLCAT algorithm

In this Section, we briefly present the clustering algorithm COOLCAT [4]. Thisalgorithm has been proven effective for clustering categorical data. Thus, weuse the clusterings provided by COOLCAT as components of our ensembles tofurther improve the grouping of data.

The COOLCAT algorithm [4] is a scalable clustering algorithm that discoversclusters with minimal entropy in categorical data. COOLCAT uses categorical,rather than numerical attributes, enabling the mining of real-world datasetsoffered by fields such as psychology and statistics. The algorithm is based onthe idea that a cluster containing similar points has an entropy smaller than acluster of dissimilar points. Thus, COOLCAT uses entropy to define the criterionfor grouping similar objects.

2

Formally, the entropy measures the uncertainty associated to a random vari-able. Let X be a random variable with values in S(X), and let p(x) be thecorresponding probability function of X . The entropy of X is defined as follows:

H(X) = −∑

x∈S(X)

p(x) log(p(x))

The entropy of a multivariate vector X = (X1, X2, . . . , Xn) is defined as:

H(X) = −∑

x1∈S(X1)

. . .∑

xn∈S(Xn)

p(x1, ..., xn) log p(x1, ..., xn)

.To minimize the entropy associated to clusters, COOLCAT proceeds as fol-

lows. Given n points x1, x2, ..., xn, where each point is represented as a vectorof d categorical values, xi = (x1

i , ..., xdi ), COOLCAT partitions the points into k

clusters so that the entropy of the clustering is minimized. Let C = {C1, . . . , Ck}represent the clustering. Then, the entropy associated to C is:

H(C) =

k∑

j=1

|Cj |

nH(Cj)

where H(Cj) is the entropy of cluster Cj :

H(Cj) =∑

x1

i∈S(X1)

. . .∑

xdi∈S(Xd)

P (x1i , . . . , x

di |Cj) log(P (x1

i , . . . , xdi |Cj))

COOLCAT uses an heuristic to incrementally build clusters based on theentropy criterion. It consists of two main phases: an initialization step and anincremental step.

During the initialization phase, COOLCAT bootstraps the algorithm by se-lecting a sample of points. Out of this sample, it selects the two points that havethe maximum pairwise entropy, and so are most dissimilar. COOLCAT placesthese points in two different clusters. It then proceeds incrementally: at eachstep, it selects the point that maximizes the minimum pairwise entropy withthe previously chosen points. At the end, the k selected points are the initialseeds of the clusters. During the incremental phase, COOLCAT constructs thek clusters. It processes the data in batches. For each data point, it computes theentropy resulting from placing the point in each cluster, and then assigns thepoint to the cluster that gives the minimum entropy.

The final clustering depends on the order in which points are assigned toclusters, thus there is a danger of obtaining a poor-quality result. To circum-vent this problem, the authors of COOLCAT propose a reprocessing step. Afterclustering a batch of points, a fraction of the set is reconsidered for clustering,where the size of the fraction is an input parameter. The fraction of points thatleast fit the corresponding clusters is reassigned to more fitting clusters. To as-sess which points least match their clusters, COOLCAT counts the number of

3

occurrences of each point’s attributes in the cluster to which is assigned. Thisnumber of occurrences is then converted into a probability value by dividing itby the cluster size. The point with the lowest probability for each cluster is thenremoved and reprocessed according to the entropy criterion, as before. By per-forming this assessment at the conclusion of each incremental step, COOLCATalleviates the risk imposed by the order of the input of points.

COOLCAT requires four input parameters: the number of clusters k, thesample size used in the initialization step, the buffer size, and the number ofpoints considered for reprocessing.

While COOLCAT is an effective method for clustering categorical data, it stillsuffers from limitations inherent to clustering algorithms. The solution identifiedby COOLCAT depends on the initial clusters’ seeding. The greedy sequentialstrategy used by the algorithm affects the result as well (although the reprocess-ing step alleviates in part this problem). Furthermore, the sparsity of the datain high dimensional spaces can severely compromise the ability of discoveringmeaningful clustering solutions.

In the following we address these issues by constructing ensembles of cluster-ings resulting from multiple runs of COOLCAT in random feature subspaces.

4 Clustering Ensemble Techniques

Consider a set S = {x1,x2, . . . ,xn} of n points. A clustering ensemble is a collec-tion of m clustering solutions: C = {C1, C2, ..., Cm}. Each clustering solution CL

for L = 1, . . . , m, is a partition of the set S, i.e. CL = {C1L, C2

L, ..., CKl

L }, where∪K CK

L = S. Given a collection of clustering solutions C and the desired numberof clusters k, the objective is to combine the different clustering solutions andcompute a new partition of S into k disjoint clusters.

The challenge in cluster ensembles is the design of a proper consensus func-tion that combines the component clustering solutions into an “improved” finalclustering. In our ensemble techniques we reduce the problem of defining a con-sensus function to a graph partitioning problem. This approach has shown goodresults in the literature [5, 22, 6].

In the next sections, we introduce our two consensus functions for categoricalfeatures using the clustering results produced by COOLCAT.

4.1 Motivation

To enhance the accuracy of the clustering given by COOLCAT, we constructan ensemble of clusterings obtained by multiple runs of the algorithm. A goodaccuracy-diversity trade-off must be achieved to obtain a consensus solution thatis superior to the components. To improve the diversity among the ensemblecomponents, each run of COOLCAT operates within a random subspace of thefeature space, obtained by random sampling a fixed number of attributes fromthe set of given ones. Thus, diversity is guaranteed by providing the components

4

different views (or projections) of the data. Since such views are generated ran-domly from a (typically) large pool of attributes, it is highly likely that eachcomponent receives a different prospective of the data, which leads to the dis-covery of diverse (and complementary) structures within the data.

The rationale behind our approach finds its justification in classifier ensem-bles, and in the theory of Stochastic Discrimination [16, 17]. The advantages ofa random subspace method, in fact, are well known in the context of ensemblesof classifiers [12, 21].

Furthermore, we observe that performing clustering in random subspacesshould be advantageous when data present redundant features, and/or the dis-crimination power is spread over many features, which is often the case in reallife scenarios. Under these conditions, in fact, redundant/noisy features are lesslikely to appear in random subspaces. Moreover, since discriminant informationis distributed across several features, we can generate multiple meaningful (forcluster discrimination) subspaces. This is in agreement with the assumptionsmade by the theory of stochastic discrimination [17] for building effective en-sembles of classifiers; that is, there exist multiple sets of features able to discernbetween training data in different classes, and unable to discern training andtesting data in the same class.

4.2 Categorical Similarity Partitioning Algorithm (CSPA)

Our aim here is to generate robust and stable solutions via a consensus clus-tering method. We can generate contributing clusterings by running multipletimes the COOLCAT algorithm within random subspaces. Thus, each ensemblecomponent has access to a random sample of f features drawn from the originald dimensional feature space. The objective is then to find a consensus partitionfrom the output partitions of the contributing clusterings, so that an “improved”overall clustering of the data is obtained.

In order to derive our consensus function, for each data point xi and eachcluster Cl, we want to define the probability associated with cluster Cl giventhat we have observed xi. Such probability value must conform to the informa-tion provided by a given component clustering of the ensemble. The consensusfunction will then aggregate the findings of each clustering component utilizingsuch probabilities.

COOLCAT partitions the data into k distinct clusters. In order to computedistances between data points and clusters, we represent clusters using modes .The mode of a cluster is the vector of the most frequent attribute values in thegiven cluster. In particular, when different values for an attribute have the samefrequency of occurrence, we consider the whole data set, and choose the attributethat has the least overall frequency. Ties are broken randomly.

We then compute the distance between a point xi and a cluster Cl by consid-ering the Jaccard distance [19] between xi and the mode cl of cluster Cl, definedas follows:

dil = 1 −|xi ∩ cl|

|xi ∪ cl|(1)

5

where |xi ∩ cl| represents the number of matching attribute values in the twovectors, and |xi∪cl| is the number of distinct attribute values in the two vectors.

Let Di = maxl{dil} be the largest distance of xi from any cluster. We want todefine the probability associated with cluster Cl given that we have observed xi.At a given point xi, the cluster label Cl is assumed to be a random variable froma distribution with probabilities {P (Cl|xi)}

kl=1. We provide a nonparametric

estimation of such probabilities based on the data and on the clustering result.In order to embed the clustering result in our probability estimations, the

smaller the distance dil is, the larger the corresponding probability credited toCl should be. Thus, we can define P (Cl|xi) as follows:

P (Cl|xi) =Di − dil + 1

kDi + k −∑

l dil

(2)

where the denominator serves as a normalization factor to guarantee∑k

l=1 P (Cl|xi) = 1. We observe that ∀l = 1, . . . , k and ∀i = 1, . . . , n P (Cl|xi) >0. In particular, the added value of 1 in (2) allows for a non-zero probabil-ity P (CL|xi) when L = argmaxl{dil}. In this last case P (Cl|xi) assumes itsminimum value P (CL|xi) = 1/(kDi + k −

∑

l dil). For smaller distance valuesdil, P (Cl|xi) increases proportionally to the difference Di − dil: the larger thedeviation of dil from Di, the larger the increase. As a consequence, the corre-sponding cluster Cl becomes more likely, as it is reasonable to expect based onthe information provided by the clustering process. Thus, equation (2) provides anonparametric estimation of the posterior probability associated to each clusterCl.

We can now construct the vector Pi of posterior probabilities associated withxi:

Pi = (P (C1|xi), P (C2|xi), . . . , P (Ck|xi))t (3)

where t denotes the transpose of a vector. The transformation xi → Pi mapsthe d dimensional data points xi onto a new space of relative coordinates withrespect to cluster centroids, where each dimension corresponds to one cluster.This new representation embeds information from both the original input dataand the clustering result.

We then define the similarity between xi and xj as the cosine similaritybetween the corresponding probability vectors:

s(xi,xj) =P t

i Pj

‖Pi‖‖Pj‖(4)

We combine all pairwise similarities (4) into an (n × n) similarity matrix S,where Sij = s(xi,xj). We observe that, in general, each clustering may providea different number of clusters, with different sizes and boundaries. The size ofthe similarity matrix S is independent of the clustering approach, thus providinga way to align the different clustering results onto the same space, with no needto solve a label correspondence problem.

After running the COOLCAT algorithm m times for different features weobtain the m similarity matrices S1, S2, . . . , Sm. The combined similarity matrix

6

Ψ defines a consensus function that can guide the computation of a consensuspartition:

Ψ =1

m

m∑

l=1

Sl (5)

Ψij reflects the average similarity between xi and xj (through Pi and Pj) acrossthe m contributing clusterings.

We now map the problem of finding a consensus partition to a graph par-titioning problem. We construct a complete graph G = (V, E), where |V | = nand the vertex Vi identifies xi. The edge Eij connecting the vertices Vi and Vj

is assigned the weight value Ψij . We run METIS [15] on the resulting graph tocompute a k-way partitioning of the n vertices that minimizes the edge weight-cut1. This gives the consensus clustering we seek. The size of the resulting graphpartitioning problem is n2. The steps of the algorithm, which we call CSPA (Cat-egorical Similarity Partitioning Algorithm), are summarized in the following.Input: n points x ∈ ℜd, number of features f , and number of clusters k.

1. Produce m subspaces by random sampling f features without replacementfrom the d dimensional original feature space

2. Run COOLCAT m times, each time using a different sample of f features.

3. For each partition ν = 1, . . . , m:

(a) Obtain the mode cl for each cluster (l = 1, . . . , k)

(b) Compute the Jaccard distance dνil between each data point xi, and each

mode cl: dνil = 1 − |xi∩cl|

|xi∪cl|

(c) Set Dνi = maxl{d

νil}

(d) Compute P (Cνl |xi) =

Dνi −dν

il+1kDν

i+k−

∑

ldν

il

(e) Set P νi = (P (Cν

1 |xi), P (Cν2 |xi), . . . , P (Cν

k |xi))t

(f) Compute the similarity

sν(xi,xj) =P ν

i P νj

‖P νi ‖‖P

νj ‖

, ∀i, j

(g) Construct the matrix Sν where Sνij = sν(xi,xj)

4. Build the consensus function Ψ = 1m

∑m

ν=1 Sν

5. Construct the complete graph G = (V, E), where |V | = n and Vi ≡ xi.Assign Ψij as the weight value of the edge Eij connecting the vertices Vi andVj

6. Run METIS (or spectral clustering) on the resulting graph G

Output: The resulting k-way partition of the n vertices

1 In our experiments we also apply spectral clustering to compute a k-way partitioningof the n vertices

7

4.3 Categorical Bipartite Partitioning Algorithm (CBPA)

Our second approach (CBPA) maps the problem of finding a consensus partitionto a bipartite graph partitioning problem. This mapping was first introduced in[6]. In [6], however, 0/1 weight values are used. Here we extend the range ofweight values to [0,1].

The graph in CBPA models both instances (e.g., data points) and clusters,and the graph edges can only connect an instance vertex to a cluster vertex,forming a bipartite graph. In detail, we proceed as follows for the construction ofthe graph. Suppose, again, we run the COOLCAT algorithm m times for differentf random features. For each instance xi, and for each clustering ν = 1, . . . , m, wethen can compute the vector of posterior probabilities P ν

i , as defined in equations(3) and (2). Using the P vectors, we construct the following matrix A:

A =

(P 11 )t (P 2

1 )t . . . (Pm1 )t

(P 12 )t (P 2

2 )t . . . (Pm2 )t

......

...(P 1

n)t (P 2n)t . . . (Pm

n )t

(6)

Note that the (P νi )ts are row vectors (t denotes the transpose). The dimen-

sionality of A is therefore n × km, under the assumption that each of the mclusterings produces k clusters. (We observe that the definition of A can be eas-ily generalized to the case where each clustering may discover a different numberof clusters.)

Based on A we can now define a bipartite graph to which our consensus par-tition problem maps. Consider the graph G = (V, E) with V and E constructedas follows. V = V C ∪ V I , where V C contains km vertices, each representing acluster of the ensemble, and V I contains n vertices, each representing an inputdata point. Thus |V | = km + n. The edge Eij connecting the vertices Vi and Vj

is assigned a weight value defined as follows. If the vertices Vi and Vj representboth clusters or both instances, then E(i, j) = 0; otherwise, if vertex Vi repre-sents an instance xi and vertex Vj represents a cluster Cν

j (or vice versa) thenthe corresponding entry of E is A(i, k(ν − 1) + j).

Note that the dimensionality of E is (km + n) × (km + n), and E can bewritten as follows:

E =

(

0 At

A 0

)

A partition of the bipartite graph G partitions the cluster vertices and the in-stance vertices simultaneously. The partition of the instances can then be outputas the final clustering. Due to the special structure of the graph G (sparse graph),the size of the resulting bipartite graph partitioning problem is kmn. Assumingthat (km) << n, this complexity is much smaller than the size n2 of CSPA.

The steps of the algorithm, which we call WBPA (Weighted BipartitePartitioning Algorithm), are summarized in the following.

Input: n points x ∈ ℜd, number of features f , and number of clusters k.

8

Table 1. Characteristics of the data

Dataset k d n (points-per-class)

Archeological 2 8 20 (11-9)Soybeans 4 21 47 (10-10-10-17)Breast-cancer 2 9 478 (239-239)Vote 2 16 435 (267-168)

1. Produce m subspaces by random sampling f features without replacementfrom the d dimensional original feature space

2. Run COOLCAT m times, each time using a different sample of f features.3. For each partition ν = 1, . . . , m:

(a) Obtain the mode cl for each cluster (l = 1, . . . , k).(b) Compute the Jaccard distance dν

il between each data point xi, and each

mode cl: dνil = 1 − |xi∩cl|

|xi∪cl|

(c) Set Dνi = maxl{d

νil}

(d) Compute P (Cνl |xi) =

Dνi −dν

il+1kDν

i+k−

∑

ldν

il

(e) Set P νi = (P (Cν

1 |xi), P (Cν2 |xi), . . . , P (Cν

k |xi))t

4. Construct the matrix A as in (6)5. Construct the bipartite graph G = (V, E), where V = V C ∪ V I , |V I | = n

and V Ii ≡ xi, |V C | = km and V C

j ≡ Cj (a cluster of the ensemble). SetE(i, j) = 0 if Vi and Vj are both clusters or both instances. Set E(i, j) =A(i − km, j) = E(j, i) if Vi and Vj represent an instance and a cluster

6. Run METIS (or spectral clustering) on the resulting graph G

Output: The resulting k-way partition of the n vertices in V I

5 Experimental Design and Results

In our experiments, we used four real datasets. The characteristics of all datasetsare given in Table 1. The Archeological dataset is taken from [1], and was used in[4] as well. Soybeans, Breast, and Congressional Votes are from the UCI MachineLearning Repository [20].

The Soybeans dataset consists of 47 samples and 35 attributes. Since someattributes have only one value, we have removed them, and selected the remain-ing 21 attributes for our experiments, as it has been done in other research [8].For the Breast-cancer data, we sub-sampled the most populated class from 444to 239 as we have conducted in our previous work to obtain balanced data [2].The Congressional Votes dataset contains attributes which consist of either ’yes’or ’no’ responses; we treat missing values as an additional domain attribute valuefor each feature as conducted in [4].

Evaluating the quality of clustering is in general a difficult task. Since classlabels are available for the datasets used here, we evaluate the results by com-puting the error rate and the normalized mutual information (NMI). The error

9

rate is computed according to the confusion matrix. The NMI provides a mea-sure that is impartial with respect to the number of clusters [22]. It reachesits maximum value of one only when the result completely matches the origi-nal labels. The NMI is computed according to the average mutual informationbetween every pair of cluster and class [22]:

NMI =

∑k

i=1

∑k

j=1 ni,j logni,jn

ninj√

∑k

i=1 ni log ni

n

∑k

j=1 nj lognj

n

(7)

where ni,j is the number of agreement between cluster i and class j, ni isthe number of data in cluster i, nj is the number of data in class j, and n is thetotal number of points.

5.1 Analysis of the Results

For each dataset, we ran COOLCAT 10 times with different sets of randomfeatures. The number f of selected features was set to half the original dimen-sionality for each data set: f = 4 for Archeological, f = 11 for Soybeans, f = 5for Breast-cancer, and f = 8 for Vote. The clustering results of COOLCAT arethen given as input to the consensus clustering techniques being compared. (Asvalue of k, we input both COOLCAT and the ensemble algorithms the actualnumber of classes in the data.)

Figures 1-8 plot the error rate (%) achieved by COOLCAT in each randomsubspace, and the error rates of our categorical clustering ensemble methods(CSPA-Metis, CSPA-SPEC, CBPA-Metis, and CBPA-SPEC, where SPEC isshort for spectral clustering). We also plot the error rate achieved by COOLCATover multiple runs in the entire feature space. The figures show that we were ableto obtain diverse clusterings within the random subspaces. Furthermore, theinstable performance of COOLCAT in the original space shows its sensitivity tothe initial random seeding process, and to the order according to which data areprocessed. (We kept the input parameters of COOLCAT fixed in all runs: thesample size was set to 8, and the reprocessing size was set to 10.)

Detailed results for all data are provided in Tables 2-5, where we report theNMI and error rate (ER) of the ensembles, as well as the maximum, minimum,and average NMI and error rate values for the input clusterings.

In general, our ensemble techniques were able to filter out spurious struc-tures identified by individual runs of COOLCAT, and performed quite well. Ourtechniques produced error rates comparable with, and sometime better than,COOLCAT’s minimum error rate. CSPA-Metis provided the lowest error rateamong the methods being compared on three data sets. For the Archeologicaland Breast-cancer data, the error rate provided by the CSPA-Metis technique isas good or better than the best individual input clustering. It is worth noticingthat for these two datasets, CSPA-Metis gave an error rate which is lower thanthe best individual input clustering on the entire feature space (see Figures 2and 6). In particular, on the Breast-cancer data all ensemble techniques provided

10

Table 2. Results on Archeological data

Ens-NMI Ens-ER Max-ER Min-ER Avg-ER Max-NMI Min-NMI Avg-NMI

CSPA-METIS 1 0 45.00 0 24.00 1 0.033 0.398

CSPA-SPEC 0.21 36.00 45.00 0 24.00 1 0.033 0.398

CBPA-METIS 0.5284 10.00 45.00 0 24.00 1 0.033 0.398

CBPA-SPEC 0.603 18.0 45.00 0 24.00 1 0.033 0.398

Table 3. Results on Soybeans data


CSPA-METIS 0.807 10.6 52.1 0 24.4 1 0.453 0.689

CSPA-SPEC 0.801 12.3 52.1 0 24.4 1 0.453 0.689

CBPA-METIS 0.761 12.8 52.1 0 24.4 1 0.453 0.689

CBPA-SPEC 0.771 15.3 52.1 0 24.4 1 0.453 0.689

excellent results. For the Soybeans dataset, the error rate of CSPA-Metis is stillwell below the average of the input clusterings, and for Vote is very close to theaverage.

Also CBPA (both with Metis and SPEC) performed quite well. In general,it produced error rates comparable with the other techniques. CBPA producederror rates well below the average error rates of the input clusterings, with theexception of the Vote dataset. For the Vote data, all ensemble methods gaveerror rates close to the average error rate of the input clusterings. In this case,COOLCAT on the full space gave a better performance.

Overall, our categorical clustering ensemble techniques are capable of boost-ing the performance of COOLCAT, and achieve more robust results. Given thecompetitive behavior previously shown by COOLCAT, the improvement ob-tained by our ensemble techniques is a valuable achievement.

1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Err

or R

ate

Subspace COOLCATCSPA−MetisCSPA−SPECCBPA−MetisCBPA−SPEC

Fig. 1. Archeological data: error rates of cluster ensemble methods, and COOLCATin random subspaces.

11

Table 4. Results on Breast cancer data


CSPA-METIS 0.740 4.3 9.4 6.1 7.9 0.699 0.601 0.648

CSPA-SPEC 0.743 4.4 9.4 6.1 7.9 0.699 0.601 0.648

CBPA-METIS 0.723 4.8 9.4 6.1 7.9 0.699 0.601 0.648

CBPA-SPEC 0.743 4.4 9.4 6.1 7.9 0.699 0.601 0.648

Table 5. Results on Vote data


CSPA-METIS 0.473 14.0 17.7 6.9 13.7 0.640 0.345 0.447

CSPA-SPEC 0.449 13.5 17.7 6.9 13.7 0.640 0.345 0.447

CBPA-METIS 0.473 14.0 17.7 6.9 13.7 0.640 0.345 0.447

CBPA-SPEC 0.439 14.2 17.7 6.9 13.7 0.640 0.345 0.447

1 2 3 4 5 6 7 8 9 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Err

or R

ate

COOLCATCSPA−MetisCSPA−SPECCBPA−MetisCBPA−SPEC

Fig. 2. Archeological data: error rates of cluster ensemble methods, and COOLCATusing all features.

6 Conclusions and Future Work

We have proposed two techniques to construct clustering ensembles for cate-gorical data. A number of issues remains to be explored: (1) Determine whichspecific ensemble method is best suited for a given dataset; (2) How to achievemore accurate clustering components while maintaining high diversity (e.g., byexploiting correlations among features); (3) Test our techniques on higher di-mensional categorical data. We will address these questions in our future work.

Acknowledgments This work was in part supported by NSF CAREER AwardIIS-0447814.

12

1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Err

or R

ate


Fig. 3. Soybeans data: error rates of cluster ensemble methods, and COOLCAT inrandom subspaces.

1 2 3 4 5 6 7 8 9 100.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

Err

or R

ate


Fig. 4. Soybeans data: error rates of cluster ensemble methods, and COOLCAT usingall features.

1 2 3 4 5 6 7 8 9 100.04

0.05

0.06

0.07

0.08

0.09

0.1

Err

or R

ate


Fig. 5. Breast-cancer data: error rates of cluster ensemble methods, and COOLCATin random subspaces.

13

1 2 3 4 5 6 7 8 9 100.04

0.045

0.05

0.055

0.06

0.065

0.07

0.075

0.08

0.085

0.09

Err

or R

ate


Fig. 6. Breast-cancer data: error rates of cluster ensemble methods, and COOLCATusing all features.

1 2 3 4 5 6 7 8 9 100.06

0.08

0.1

0.12

0.14

0.16

0.18

Err

or R

ate


Fig. 7. Vote data: error rates of cluster ensemble methods, and COOLCAT in randomsubspaces.

1 2 3 4 5 6 7 8 9 100.115

0.12

0.125

0.13

0.135

0.14

0.145

0.15

Err

or R

ate


Fig. 8. Vote data: error rates of cluster ensemble methods, and COOLCAT using allfeatures.

14

References

1. Aldenderfer, M. S, and Blashfield, R. K. 1984. Cluster Analysis, Sage Publications,No. 44.

2. Al-Razgan, M., and Domeniconi, C. 2006. Weighted clustering ensembles. In Pro-

ceedings of SIAM International Conference on Data Mining.3. Ayad, H., and Kamel, M. 2003. Finding natural clusters using multi-clusterer com-

biner based on shared nearest neighbors. In Multiple Classifier Systems. 166-175.4. Barbara, D., Li, Y., and Couto, J. 2002. COOLCAT: an entropy-based algorithm

for categorical clustering. In Proceedings of the International Conference on Infor-

mation and Knowledge Management. ACM Press, New York, NY, 582-589.5. Dhillon, I. 2001. Co-clustering documents and words using bipartite spectral graph

partitioning. In Proceedings of the ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining.6. Fern, X., and Brodley, C. 2004. Solving cluster ensemble problems by bipartite graph

partitioning. In Proceedings of the International Conference on Machine Learning.7. Fred, A., and Jain, A. 2002. Data clustering using evidence accumulation. In

Proceedings of the International Conference on Pattern Recognition.8. Gan, G., and Wu, J. 2004. Subspace clustering for high dimensional categorical

data. SIGKDD Explor. Newsl., 6(2), 87-94.9. Guha, S., Rastogi, R., and Shim, K. 1999. ROCK: a robust clustering algorithm for

categorical attributes In Proceedings of Data Engineering, 512-521.10. He, Z., Xu, X., and Deng, S. 2005. A cluster ensemble method for clustering

categorical data. Information Fusion, 6(2), 143-151.11. He, Z., Xu, X., and Deng, S. 2005. Clustering Mixed Numeric and Categorical

Data: A Cluster Ensemble Approach ArXiv Computer Science e-prints

12. Ho, T. K. 1998. The random subspace method for constructing decision forests.IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832-844.

13. Hu, X. 2004. Integration of cluster ensemble and text summarization for geneexpression analysis. In Proceedings of the IEEE Symposium on Bioinformatics and

Bioengineering.14. Huang, Z. 1998. Extensions to the k-Means Algorithm for Clustering Large Data

Sets with Categorical Values. Data Min. Knowl. Discov., 2(3), 283-304.15. Karypis, G., and Kumar, V. 1995. Multilevel k-way partitioning scheme for irreg-

ular graphs. Technical report, University of Minnesota Department of ComputerScience and Army HPC Research Center.

16. Kleinberg, E. M. 1990. Stochastic discrimination. Annals of Mathematics and Ar-

tificial Intelligence, 1, 207-239.17. Kleinberg, E. M. 1996. An overtraining-resistant stochastic modeling method for

pattern recognition. The Annals of Statistics, 24(6), 2319-2349.18. Kuncheva, L., and Hadjitodorov, S. 2004. Using diversity in cluster ensembles. In

Proceedings of International Conference on Systems, Man and Cybernetics.19. Mei, Q., Xin, D., Cheng, H., Han, J., and Zhai, C. 2006. Generating semantic

annotations for frequent patterns with context analysis. In Proceedings of the ACM

SIGKDD international Conference on Knowledge Discovery and Data Mining. ACMPress, New York, NY, 337-346.

20. Newman, D., Hettich, S., Blake, C., and Merz, C. 1998. UCI repository of machinelearning databases.

21. Skurichina, M., and Duin, R. P. W. 2001. Bagging and the random subspacemethod for redundant feature spaces. In J. Kittler and R. Poli, editors, Second

International Workshop on Multiple Classifier Systems, 1-10.

15

22. Strehl, A., and Ghosh, J. 2002. Cluster ensembles – a knowledge reuse frameworkfor combining multiple partitions. Journal on Machine Learning Research, 583–617.

23. Zengyou, H., Xiaofei, X., and Shengchun, D. 2002. Squeezer: an efficient algorithmfor clustering categorical data. J. Comput. Sci. Technol., 17(5), 611-624.

16

Gradient Boosting GARCH and Neural

Networks for Time Series Prediction

J. M. Matıas1, M. Febrero2, W. Gonzalez-Manteiga2, and J. C. Reboredo3

1 Dpt. of Statistics, University of Vigo2 Dpt. of Statistics, University of Santiago de Compostela

3 Dpt. of Economic Analysis, University of Santiago de Compostela

Abstract. This work develops and evaluates new algorithms based onneural networks and boosting techniques, designed to model and pre-dict heteroskedastic time series. The main novel elements of these newalgorithms are as follows: a) in regard to neural networks, the simulta-neous estimation of conditional mean and volatility through the maxi-mization of likelihood; b) in regard to boosting, its simultaneous appli-cation to trend and volatility components of the likelihood, and the useof likelihood-based models (e.g., GARCH) as the base hypothesis ratherthan gradient fitting techniques using least squares. The behavior of theproposed algorithms is evaluated over simulated data, resulting in fre-quent and significant improvements in relation to the ARMA-GARCHmodels.

1 Introduction

A fundamental issue in financial economics is predicting the price of financialassets. Given the random nature of these prices, they are generally characterizedas a stochastic process {Yt, t ∈ T} (where, henceforth, T = N or Z, natural orinteger numbers, respectively), for which we wish to know the conditional den-sity function pYt|At−1

where At−1 is the σ-algebra generated by (Yt−1,Yt−2, ...),representing all the information available to the market prior to instant t.

A large range of both parametric and non-parametric statistical techniqueshas been developed to predict prices, (e.g., [12], [8], [3], [9], [17], [5]), including,in the last decade, neural networks (e.g., [13], [11], [10])

The primordial aim of these techniques is prediction. Traditional linear tech-niques (such as the ARMA models) that assume a constant variance for theprocess {Yt} have begun to lose ground in terms of financial time series predic-tion primarily because they usually show heteroskedasticity.

From a financial point of view, investors require rewards that are proportion-ate to risk, so any variations in risk levels may alter the behavior of individualinvestors. From a statistical point of view, on the other hand, changes in vari-ance over time have implications for the validity and efficiency of estimators thatassume homoscedasticity.

With a view to resolving these problems, recent prediction techniques takeinto account different hypotheses in relation to the conditional variance (volatil-ity) VarYt|At−1

(Yt), by endeavouring to model its temporal dynamics.

17

Of the different approaches to tackling this problem –direct modelling of thedynamics of volatility, ([4], [2]), modelling of the conditional distribution, ([15],[16]), etc.– our article focuses on the former.

2 The GARCH Model

The Generalised Autoregressive Conditional Heteroskedasticity Process, orGARCH(r, s), models [2] combine a stationary series model (usually ARMA)with a stationary linear squared innovations model in the form:

Yt = µt + εt = f(Yt−1,Yt−2, ..., εt−1, εt−2, ...) + εt (1)

ε2t = ht + ωt = a0 +

r∑

j=1

ajht−j +

s∑

j=1

bjε2t−j + ωt (2)

where: {εt, t ∈ T} is a process with Eεt|At−1(εt) = 0 and Var(εt) = σ2

ε , {ωt, t ∈ T}is a process with uncorrelated variables, Eεt|At−1

(ωt) = 0 and Var(ωt) = σ2ω for

all t ∈ T, and they satisfy suitable restrictions on the process {ωt} and theparameters {aj}

rj=0, {bj}

sj=1 so that Varεt|At−1

(εt) = Eεt|At−1(ε2t ) > 0 and the

stationarity hypothesis for {ε2t} are satisfied.An alternative representation of the GARCH models is εt =

√htϑt with

ht as in (2) and where {ϑt, t ∈ T} is a process with independent variables ofmean zero and V ar(ϑt) = 1 for all t ∈ T. Hence, ε2t = htϑ

2t or, using the above

representation, ε2t = ht + ωt and in both cases, σ2t = V arεt|At−1

(εt) = ht.The GARCH models are estimated using the maximum likelihood method

(with restrictions in the parameters in order to guarantee the positivity ofthe variance). For example, in the case of a Gaussian conditional distributionYt|At−1 ∼ N (µt, σ

2t ) the minus log-likelihood is:

ℓ(εt, ht) = − ln p(εt;ht) = − ln[1

√2πht

exp(−1

2ht

ε2t ) =1

2(lnht+

ε2tht

+ln 2π) (3)

Other alternatives are to consider distributions with heavier tails for εt, suchas the Student-t distribution with ν degrees of freedom.

3 New algorithms for the prediction of heteroskedastictime series

The new algorithms developed here achieve a flexible modelling of volatilitydynamics. These algorithms are:

1. Algorithms based on neural networks aimed at simultaneously modellingconditional mean and conditional variance in time series through the maxi-mization of likelihood.

2. Boosting algorithms [14], and specifically Gradient Boost [7], with a lossfunction based on likelihood and which can simultaneously model both con-ditional mean and volatility. GARCH models can be used as the base hy-pothesis for these algorithms.

18

3.1 Plain Neural Networks Algorithms

This group of algorithms permits simultaneous estimation of series conditionalmean and volatility and are described as follows:

1. M(m, p)-GARCH(r, s). A combination of a feed-forward neural network(MLF) M(m, p) for modelling the series conditional mean with p lags andm neurons – where M may be an RBF or MLP network –, and a GARCHmodel for modelling the volatility dynamics.The models are estimated jointly through maximum likelihood (Gaussian inthis work) using the conjugated gradient algorithm. The algorithm is similarto that described immediately below, with the sole difference that in thelatter the GARCH equations are also modelled using neural networks.

2. M(m1, p)-M(m2, r, s). These algorithms are similar to those described above,but model the dynamics of volatility using a neural network M (RBF orMLP) with m2 neurons, r lags in ht and s lags in ε2t .Both models (conditional mean and volatility) are estimated jointly throughmaximum likelihood (Gaussian in this work) using the conjugated gradientalgorithm. In other words, if {Yt}

nt=1 is the set of observations and condi-

tional mean and volatility are modelled using MLF networks (MLP or RBF):

µt = µt(θ1) = f(Yt−1, ...,Yt−p; θ1)

ht = ht(θ2) = g(ht−1, ..., ht−r, ε2t−1, ..., ε

2t−s; θ2)

the algorithm iteratively minimizes the empirical risk:

Rn(θ) =1

n

n∑

t=u+1

ℓ(εt(θ1), ht(θ2)) =1

n

n∑

t=u+1

ℓ(yt − µt(θ1), ht(θ2))

where θ = (θT1 , θ

T2 )T , u = p + max(r, s) and ℓ(εt, ht) = − lnp(εt, ht) is the

minus log-likelihood loss. In the case p = r = s = 1, these models are:

Yt =m1∑

j=1

cjψ(wjYt−1 + wj0) + c0 + εt

ht =m2∑

j=1

c′jψ(wTj ht−1 + w′

j0) + c′0

or

Yt =m1∑

j=1

cjψ(|Yt−1 − wj | /wj0) + c0 + εt

ht =m2∑

j=1

c′jψ(‖ht−1 − wj‖ /w′j0) + c′0

depending on whether MLP or RBF networks are used, and where ht−1 =(ht−1, ε

2t−1)

T . If no significant conditional mean is postulated for the seriesof interest, the first of the algorithms is reduced to a GARCH model and thesecond will not include the first component M(m1, p) (or this will possess avery small number m1 of hidden units, e.g., m1 = 1).

19

3.2 Boosting Algorithms

The boosting algorithms developed here are designed to model the dynamicsof volatility but can also simultaneously estimate the series conditional mean ifthat mean is postulated.

These algorithms are generally denoted as Boost[M1-M2], where M1 is thebase hypothesis used for the estimation of the first component of the gradientin cases where conditional mean is postulated, and M2 is the base hypothesisused for the estimation of the gradient component resulting from the volatility.

Assuming a likelihood-based loss ℓ(εt, ht) = l(yt, (µt, ht)) the components µt

and ht are estimated simultaneously via Gradient Boost applied to each of thebase hypotheses M1 and M2.

If, for the sake of notational simplicity, we assume p = r = s = 1 (the generalcase is analogous) and that the estimations of the algorithm in the iteration(j − 1)-th, j = 2, 3... are for t = 3, ..., n:

µt,j−1 =

j−1∑

k=1

fk(yt−1)

ht,j−1 =

j−1∑

k=1

gk(ht−1,j−1, ε2t−1,j−1)

where the components fk and gk are the estimation of the functional gradientin the k-th iteration (see below) and:

ε2t−1,j−1 = (yt−1 − µt−1,j−1)2 = (yt−1 −

j−1∑

k=1

fk(yt−2))2

ht−1,j−1 =

j−1∑

k=1

gk(ht−2,j−1, ε2t−2,j−1)

Taking as the vector of covariables xt,j−1 = (yt−1, ht−1,j−1, ε2t−1,j−1), and

if the base hypotheses M1 and M2 are of the form ϕ = ϕ(·; θ1), φ = φ(·; θ2)respectively, with ϕ, φ linear combinations of basic parameterized functions (asis the case of the RBF and MLP neural networks), the projection of the negativegradient to the hypotheses space in the j-th iteration is obtained by resolving:

θ1,j = argminθ1

n∑

t=3

[−ξ(1)j (xt,j−1) − ϕ(xt,j−1; θ1)]

2 (4a)

θ2,j = argminθ2

n∑

t=3

[−ξ(2)j (xt,j−1) − φ(xt,j−1; θ2)]

2 (4b)

Thus, the new components fj(yt−1), gj(yt−1, ε2t−1) for conditional mean and

volatility are obtained by performing a line search in the direction of the pro-

20

jected negative gradient. In the end, the final hypothesis produced by the algo-rithm in the B-th iteration is a vectorial function:

fB,t =(

µt, ht

)T

=

B∑

j=1

fj(yt−1),

B∑

j=1

gj(ht−1, ε2t−1)

T

For the above general schema we have used the RBF and MLP neural net-works as the base hypotheses; nonetheless, in principle, any kind of model thatcan be estimated using least squares through (4) can be used.

However, in this work we have also used ARMA-GARCH as base hypothesiswith a view to evaluating the benefits of modelling any possible heteroskedastic-ity that the gradient series may inherit from the original series, thereby obtaininga better fit for this gradient in each boosting iteration. These models have thepeculiarity that they are not fitted using least squares, as in (4), but throughthe maximization of likelihood. They therefore use a discrepancy criterion withthe negative gradient different from that of the least squares.

Finally, when a significant conditional mean is not postulated for the series,the Boost[M1-M2] model is reduced to the form BoostM, a particular case4

in the development above and the same kind of general algorithm used in [1],with the only difference being the base hypotheses used and the criterion ofdiscrepancy for fitting the negative gradient.

4 Evaluation of the new algorithms

In this section we present the results of the evaluation of the proposed algorithmswith different sets of artificial data.

Unless otherwise indicated, all the models based on neural networks use asingle lag, with the corresponding parameter therefore omitted from the sym-bology that identifies the model. For example, a model MLP(m)-GARCH(1,1)indicates a model MLP(m, 1)-GARCH(1,1), in other words, an MLP model withm hidden units and just a single lag for the mean, and a GARCH(1,1) modelfor volatility.

In our evaluation of the different algorithms, presented as reference are theresults obtained by ARMA-GARCH models estimated via maximum Gaussianlikelihood, except in certain cases (to be indicated), in which Student-t likelihoodis used with several degrees of freedom. This will permit a comparison betweenthe highly non-linear models described above with non-Gaussian models.

The criteria used to evaluate the performance of the algorithms were as fol-lows, with N denoting the sample size (training or test sample, as appropriate):

4 In the BoostM algorithm, the theoretical search space is R∞ (variance functions),

whereas in the Boost[M1-M2] algorithm this is R2×∞ (conditional mean and vari-

ance functions). With n data, the search space is R2×n instead of R

n.

21

1. Average squared error for the series. If f(yt−1, yt−2, ...) is the prediction att:

ASRy =1

N

N∑

t=1

[yt − f(yt−1, yt−2, ...)]2 (5)

2. Average squared error for the series of variances when these have beensimulated and are known. If σ2

t are the true variances and g(ε2t−1, ε2t−2, ...,

ht−1, ht−2, ...) the predictions:

ASRh =1

N

N∑

t=1

[σ2t − g(ε2t−1, ε

2t−2, ..., ht−1, ht−2, ...)]

2 (6)

3. The minus log-likelihood (Gaussian unless otherwise indicated):

−lnLk =

N∑

t=1

ℓ(yt; µt, ht) (7)

with

ℓ(yt, µt, ht) = 12 (lnht +

(yt − µt)2

ht

+ ln 2π)

µt = f(yt−1, yt−2, ....)

ht = g(ε2t−1, ε2t−2, ..., ht−1, ht−2, ...)

The model selection criteria used in the different algorithms were as follows:

1. Both the MLP and RBF networks were trained through maximum likelihoodvia conjugated gradient with line search and using no more than 20 iterationsas stopping rule. The initial values of the parameters were obtained, in thecase of the MLP networks, by the Bayesian algorithm (least squares) [6], andin the case of the RBF, by using the k-means clustering algorithm to selectwidths. Both methods permit automatic model selection. The maximumnumber of basic functions were determined via a preliminary exploration ofa sample of each set of data used.

2. The Gradient Boost algorithms were stopped after a few iterations whentheir performance over the test sample started to worsen without investigat-ing whether improvement could occur subsequently. A task for the future,therefore, will be to develop stop mechanisms.

The sets of artificial data were generated using the following models, all withGaussian innovations:

1. Heteroskedastic model with sinusoidal mean:

Yt = Yt−1 sin(Yt−1) + εt (8)

with two alternative models for volatility:

22

(a) GARCH(1, 1):

ht = 0.1 + 0.85ht−1 + 0.1ε2t−1 (9)

(b) Logarithmic model:

lnht = 0.6 lnht−1 + 0.3 ln(0.5 + 0.2ε2t−1) (10)

2. Heteroskedastic zero-mean model

Yt =√

htεt; ht = f(yt−1, ht−1) (11)

with two alternative models for volatility:

(a) Additive volatility model used in [1]:

f(y, h) = (0.1+0.2 |y|+0.9y2) ∗ (0.8 exp(−1.5 |y| |h|))+ (0.4y2 +0.5h)3/4

(12)(b) Threshold model:

f(y, h) =

0.2 + 0.3y2 if y < 00.3 + 0.2y2 + 0.6h if y ≥ 0 and h < 0.50.7 + 0.8h if y ≥ 0 and h ≥ 0.5

(13)

Next we will describe the results of the best algorithms for each set of data,offering for reference purposes those obtained by the GARCH(1,1) model.

4.1 Heteroskedastic series with sinusoidal mean

For the models defined by equations (8)–(9) and equations (8)–(10) 50 samples of3000 observations each were generated, of which the n = 1000 first observationswere used to train the algorithms and the ntest = 2000 remaining observationsconstituted the test sample for the evaluation.

Table 1 shows the mean results produced by the various algorithms with theseries with sinusoidal mean and the GARCH model for the variance. The tablereflects, for each algorithm: the values ASRh (6), ASRy (5) and the minus log-likelihood (7) in the test sample as also the estimations of the parameters in theGARCH model of equation (9).

Table 2 shows the mean results obtained with the series with sinusoidal meanand the logarithmic model for the variances.

These tables reveal the satisfactory results produced by the algorithms basedon neural networks that represent conditional mean via a non-linear model5.Specifically:

5 In order to correctly evaluate the results obtained in terms of likelihood, it is nec-essary to bear in mind the difficulty implied in producing improvements in thiscriterion, given that likelihood measures the quality of the prediction of future re-turns and future volatilities, and is much more volatile than squared error, whichmeasures only the quality of the prediction of future volatilities [1].

23

Yt sin(Yt)+GARCH(1,1)

Model –lnLktest ASRhtest ASRytest a0 a1 b1

ARMA(0,0)-GARCH(1,1) 3901.94 12.9037 3.7077 1.0418 0.3156 0.3645ARMA(1,0)-GARCH(1,1) 3898.38 16.6013 3.8140 1.1826 0.2335 0.4101GARCH(1,1) after RBF(20) 3536.65 1.5550 2.2807 0.1241 0.8358 0.1018RBF(20)-GARCH(1,1) 3535.59 1.5954 2.2778 0.1240 0.8392 0.0998GARCH(1,1) after MLP(13) 3500.11 1.1005 2.1493 0.0857 0.8817 0.0717MLP(13)-GARCH(1,1) 3498.90 1.0681 2.1316 0.0855 0.8627 0.0808

Table 1. Results of the different algorithms for the series defined by equations (8) and(9).

1. In regard to the Table 1, of particular note are the M(m)-GARCH models,which propose variance models very similar to the true one (equation (9)).For these models, the simultaneous estimation of conditional mean and vari-ance produces better results than the sequential procedure typically used inpractice (estimating the GARCH model after subsctracting the mean esti-mated through M).As can be observed, the estimation of the parameters of the true GARCHmodel is totally erroneous in the ARMA(0,0)-GARCH(1,1) and ARMA(1,0)-GARCH(1,1) models. The high value of the constant (a0 in the table) of theGARCH model proposed by these models indicates that these have inter-preted the variability due to mean as if it were originated by the variance.

2. In regard to Table 2 (sinusoidal process with logarithmic variance), similarresults are obtained. Here the MLP(13)-MLP(1) algorithm produces a slightimprovement over the MLP(13)-GARCH(1,1). This algorithm does not in-clude variance ht in the model (it is a non-linear ARCH model that onlyincludes the square ε2t of the innovations): its inclusion slightly worsened theresults (-lnLkhdtest = 2510) perhaps due to the instability of the intermedi-ate estimations for ht.

4.2 Heteroskedastic series with zero mean

Table 3 shows the mean results for various algorithms with 50 realizations of theadditive (11)–(12) and threshold (11)–(13) volatility models.

1. In regard to the first series, almost all the non-linear models evaluated im-prove the results of the GARCH(1,1) model in terms of likelihood, althoughthe results in terms of quadratic loss are more heterogeneous.In regard to the minus log-likelihood, of note is the behavior of the Boost-MLP(3) and BoostARX(2,1) models in the table. The BoostARX(2,1) modeluses ARX(2,1) as the base hypothesis to fit the gradient6 in which the exoge-nous covariable is the square of the innovations ε2t and in which the initial

6 Note that the lag order of the base hypothesis of boosting can not be compared tothat of the ARMA-GARCH applied to the original series.

24

Yt sin(Yt)+Logaritmic

Model –lnLktest ASRhtest ASRytest

ARMA(0,0)-GARCH(1,1) 2964.56 0.3702 1.1848ARMA(1,0)-GARCH(1,1) 2817.14 0.5792 1.0951BoostARMA(1,0)GARCH(1,1)

-ARMA(1,0)GARCH(1,1)2710.67 0.2067 0.9956

GARCH(1,1) after RBF(20) 2529.39 0.0076 0.7411RBF(20)-GARCH(1,1) 2528.72 0.0066 0.7405GARCH(1,1) after MLP(13) 2510.53 0.0060 0.7228MLP(13)-GARCH(1,1) 2508.51 0.0057 0.6872MLP(13)-MLP(1) 2506.95 0.0076 0.6834

Table 2. Results of the different algorithms for the series defined by equations (8) and(10). The results of the boosting algorithm were obtained after 10 iterations.

Volatility Additive Model

Model IniG B –lnLktest ASRhtest

ARMA(0,0)-GARCH(1,1) – – 1394.96 0.0105MLP(1)-MLP(3) – – 1393.43 0.0288BoostMLP(3) no 3 1392.76 0.0168BoostARX(2,1) no 6 1392.01 0.0137BoostARX(2,1) yes 43 1393.54 0.0096BoostGARCH(1,1)-ARMA(1,0)GARCH(1,1) yes 2 1394.35 0.0094

Volatility Threshold Model

Model IniG B –lnLktest ASRhtest

ARMA(0,0)-GARCH(1,1) – – 1229.52 0.3697BoostMLP(1)-MLP(3) yes 2 1223.91 0.3697BoostGARCH(1,1)-ARMA(1,0)GARCH(1,1) yes 2 1223.54 0.3683

Table 3. Results for the best algorithms with the additive volatility series (ec. 11 and12) and threshold volatility series (ec. 11 and 13). The second column indicates, forthe boosting algorithms, whether these initiate (yes) with a GARCH (1,1) model or,on the other hand (no), they use the variance of the sample as the initial hypothesis.The third column indicates the number of iterations (B) for these algorithms.

25

hypothesis for the algorithm is the variance in the data (IniG=no). Theutilization of a GARCH (1,1) model as the initial hypothesis (IniG=yes)produces less satisfactory results for this algorithm in terms of likelihood,but more satisfactory results in terms of quadratic mean.The relative improvements produced by these algorithms in relation to theGARCH(1,1) reference model are similar to those obtained by [1] using re-gression trees and projection pursuit as the base hypothesis7.

2. In regard to the threshold model, the BoostMLP(1)-MLP(3) and Boost-GARCH(1,1)-ARMA(1,0)-GARCH(1,1) algorithms noticeably improve theresults of the GARCH(1,1) in terms of volatility. Despite the fact that theMLP(1)-MLP(2) model produces very good results in many of the samples,its overall performance is affected by the fact that it is highly unstable insome of them. In this series all the boosting algorithms improved when aGARCH(1,1) model was used as the initial hypothesis.

5 Conclusions

The evaluation of the proposed algorithms with a single lag in various sets ofartificial data produces the following conclusions:

1. For all the sets of data analyzed, there was always at least one algorithmthat significantly improved on the results of a GARCH model. On the otherhand, no algorithm was revealed as the best in all the situations studied.

2. The algorithms that combine neural networks for conditional mean withGARCH models or neural networks for variance appear to represent a con-siderable improvement over the ARMA-GARCH models when a non-lineartrend function is present.

3. The Gradient Boost algorithms tend to produce good results particularlywith highly nonlinear volatility. For this family of algorithms, the ARMA-GARCH models were the most stable, surest (and fastest) base hypotheses inour tests. As models estimated via maximum likelihood, the ARMA-GARCHmodels depart from the usual Gradient Boost philosophy of fitting the em-pirical negative gradient via least squares, although the good results obtainedin our tests would support their use. It would appear that the empirical gra-dient is not merely a simple n-dimensional vector but a time series in itself,with possible heteroskedasticity inherited from the original series.

Future lines of research are the development of stopping rules for the boost-ing algorithm, the study of its behavior with different base hypotheses and thesystematic evaluation of the algorithms for series of real data.

7 A precise comparison between these results and those obtained by [1] is not possible,given that important differences may exist in the different sets of data, due, forexample, to the initial value of the variance used on generating each realization of theprocess. The relative improvements with respect to the results of the GARCH(1,1)model for each case are similar.

26

Acknowledgments The research of W. Gonzalez-Manteiga, M. Febrero andJ. M. Matıas was supported by the Spanish Ministry of Education and Science,Grant No. MTM2005-00820.

References

1. Audrino, F., Buhlmann, P.: Volatility estimation with functional gradient descentfor very high dimensional financial time series. Journal of Computational Finance,6 (2002) 65–89.

2. Bollerslev T.: Generalized autoregressive conditional heteroskedasticity. Journal ofEconometrics, 31 (1986) 27–37.

3. Box GEP, Jenkins GM, Reinsel GC.: Time Series Analysis. Forecasting and Control.Prentice-Hall (1994)

4. Engle RF.: Autoregressive conditional heteroskedasticity with estimates of the vari-ance of United Kingdom inflation. Econometrica, 50 (1982) 987–1007.

5. Fan J, Yao Q.: Nonlinear Time Series. Nonparametric and Parametric Methods.Springer (2003).

6. Foresee, FD, Hagan T.: Gauss-Newton approximation to bayesian regularization. InProceedings of the 1997 International Joint Conference on Neural Networks (1997)1930–1935.

7. Friedman J.: Greedy function approximation: A gradient boosting machine. TheAnnals of Statistics, 39 (2001) 1189-1232.

8. Granger CWJ, Newbold P.: Forecasting Economic Time Series. Academic Press(1986).

9. Granger CWJ, Terasvirta T.: Modelling Nonlinear Economic Relationships. OxfordUniversity Press (1993).

10. Medeiros MC, Terasvirta T, Rech G.: Building neural networks models for timeseries: A statistical approach. Technical Report 461, Departamento de Economia.PUC-Rio, Department of Economic Statistics, Stockholm School of Economics, Swe-den (2002).

11. Miranda FG, Burgess N.: Modelling market volatilities: The neural network per-spective. The European Journal of Finance, 3 (1997) 137–157.

12. Priestley MB.: Spectral analysis and time series. Academic Press (1981).13. Refenes APN, Burgess AN, Bentz Y.: Neural networks in financial engineering: A

study in methodology. IEEE Transactions on Neural Networks, 8 (1997) 1222–1267.14. Schapire RE.: The strength of weak learnability. Machine Learning, 5 (1990) 197–

227.15. Schittenkopf C, Dorffner G, Dockner EJ.: Forecasting time-dependent conditional

densities: A seminonparametric neural network approach. Journal of Forecasting,19 (2000) 355–374.

16. Tino P, Schittenkopf C, Dorffner G.: Financial volatility trading using recurrentneural networks. IEEE Transactions on Neural Networks, 12 (2001) 865–874.

17. Tsay RS. Analysis of Financial Time Series. John Wiley and Sons (2001).18. Weigend AS, Gershenfeld NA.: Time series prediction: Forecasting the future and

understanding the past. In Proceedings of the NATO Advanced Research Workshopon Comparative Time Series Analysis. Perseus Books (1994).

27

Cascading with VDM and Binary Decision Trees

for Nominal Data

Jesus Maudes, Juan J. Rodrıguez, and Cesar Garcıa-Osorio

Escuela Politecnica Superior – Lenguajes y Sistemas InformaticosUniversidad de Burgos

Francisco de Vitoria s/n, 09006, Burgos, Spain{jmaudes, jjrodriguez, cgosorio}@ubu.es

Abstract. In pattern recognition, some learning methods need numbersas inputs. This paper analyzes two-level classifier ensembles to improvenumeric learning methods on nominal data. A different classifier wasused at each level. The classifier at the base level transforms the nominalinputs into continuous probabilities that the classifier at the meta leveluses as inputs. An experimental validation is provided over 27 nominaldatasets for enhancing a method that requires numerical inputs (i.e.,SVM, Support Vector Machine). Cascading, Stacking and Grading areused as two-level ensemble implementations. Experiments combine thesemethods with other symbolic into numerical transformation (i.e., VDM,Value Difference Metric). The results suggest that Cascading with BinaryDecision Trees at base level, and SVM with VDM at meta, uses to getbetter accuracies than other possible two-level configurations.

Keywords: Ensembles, Nominal Data, Cascade Generalization, SupportVector Machines, Decision Trees.

1 Introduction

Data can be classified into numerical and qualitative. Qualitative data can onlytake values from a finite, predefined set. If no order is assumed between suchvalues, the data is referred to as Nominal or Categorical.

In pattern recognition many methods require numerical inputs, so there is amismatch when they are applied to nominal data. One approach for adaptingnominal data to numerical methods is to transform each nominal feature inton Binary Features (NBF [5]), where n is the number of possible values thenominal feature can take. Following this method, a nominal value is translatedinto a group of binary attributes, where all values are zero, except the onecorresponding to the attribute indicating such nominal value.

An alternative approach to binary conversion is transforming symbolic valuesinto numeric continuous values. VDM (Value Difference Metric) for measuringdistances between symbolic features values is presented in [11]. Duch [3] usesVDM to transform symbolic features into numeric, and then the converted datais applied to a classifier. In [3] VDM is tested using FSM networks and k-NN

28

classifiers, and better or similar results are obtained with VDM conversion thanwith raw data.

VDM replaces each nominal value x of an attribute A with a probabilityvector v = (v1, . . . , vc), where c is the number of classes and vi = P (class =ci|A = x). Thus, VDM can also significantly augment input dimensionality inmulticlass problems. Nominal to binary conversion also increases input dimen-sionality, but, in this case, the increment is due to the cardinality of the nominalattribute domains.

Cascading [4] is an n levels ensemble, where n uses to be two. Two levelcascades have a base and a meta level. Classifier at base level performs an ex-tension of the original data by insertion of new attributes. These new attributesare derived from the probability class distribution given by such base classifier.The meta classifier takes this extended data as input data.

In [6], Cascading is used on nominal datasets. In that work an SVM withlinear kernel is enhanced using a Cascading with such SVM at meta level, and aDecision Tree at base level. Decision Trees can deal directly with nominal datawithout any previous transformation. So this Cascading implementation extendsnominal data with continuous probabilities as new features. These continuousvalues can be handled directly by a linear learner.

Experimental validation in [6] shows that this Cascading configuration usedto perform better than VDM applied to SVM, and other combinations of oneSVM and one Decision Tree (i.e., Stacking and Grading of one SVM and oneDecision Tree).

Cascading only adds c new attributes to the original dataset (c = numberof classes), but Cascading does not replace original attributes. It only adds newcontinuous dimensions, so if the classifier at meta level needs numerical inputs,still nominal data transformation will be required (in [6] NFB is used). So inputdimension is again increased by other method which performs nominal datareplacement.

According to [6], in nominal data problems for linear classifiers, it can bebetter to use extra pre-built features from a base classifier using Cascading thantranslated into continuous features using VDM because:

According to [6], in nominal data problems for linear classifiers, it can bebetter to use extra pre-built features from a base classifier using Cascading,than translated into continuous features using VDM because:

1. If there are two instances with the same symbolic value for some nominalfeature, but belonging to a different class, VDM will calculate the same prob-ability vector for both of them. However, it is preferable that different valueswould be provided to aid the classification task, for example taking into ac-count the values of the rest of attributes of the instance, as the classifier atbase level does. This issue is of especial interest wherever linear separabilityis required.

2. Cascading does not replaces original attributes, requiring a method like NFBto perform such replacement. NBF (or any other nominal into numeric trans-formation method) can lead easily toward a non linear separable represen-

29

tation of data (see example on [6]). However, adding new input dimensionsthat are probabilities estimations on the instance classification (like Cascad-ing does) can aid to gain linear separability.

Experimental results in [6] also show that VDM applied to the input of aDecision Tree seems an interesting method. It suggests that VDM, Decision Treesand Cascading can be combined in someway to get a more accurate ensemblefor nominal data, and this idea is the main motivation behind this paper.

We will show experimentally that the Cascading configuration proposed in [6]can be improved, if Decision Trees with binary splits (Binary Decision Trees)are used instead of multiple split Decision Trees, and if VDM is used instead ofNBF for transforming input SVM features.

This paper is an extension of [6]. For the sake of self-containment, some con-tent is repeated here. The paper is structured as follows. Section 2 describes theTwo-Level Ensembles that we will used in experimental validation (Cascading,Stacking and Grading) and how they can be used on nominal data. Section 3analyzes VDM applied to Decision Trees and Binary Decision Trees. Section 4derives some equivalences in order to simplify experimental validation. Section 5contains the experimental validation. Section 6 concludes. Finally, an Appendixcontains detailed rankings over 27 datasets and 57 methods.

2 Two-Level Ensembles and Nominal Data

A two-level ensemble consists of a meta level and a base level. Each level con-tains different classifiers. The base level classifier output is used as input forthe meta level classifier. These ensembles allow to use a numeric-input classifierwith nominal inputs. The numerical input classifier will be used as meta classi-fier, and a classifier which can work directly with nominal data will be used asbase classifier [6]. We have considered three two-level schemas (i.e., Cascading,Stacking and Grading).

Cascade Generalization (also known as Cascading) [4] is an architecture fordesigning classification algorithms. Cascading is commonly used with two levels.Level-1 (base level) is trained with the raw dataset, whereas Level-2 (meta level)uses all the original features from the dataset plus the output of the base levelclassifier as its inputs. Base level outputs are vectors representing a conditionalprobability distribution (p1, . . . , pc), where pi is the base level classifier estimatedprobability that the input data belongs to class i, and c is the number of classesin the dataset. Cascading can be extended to n levels.

Training one classifier with the output of another classifier uses to lead tooverfitting. However, Cascading is not degraded by overfitting because base leveland meta level classifiers are different, and because the meta level classifier alsouses the original samples as input features.

Sometimes, nominal to numeric conversion can result in a data representa-tions that are not linearly separable (i.e., NBF conversion). Cascading is pre-sented in [6] as a solution to this problem, because the input space is augmented

30

with new features that transform the non-linearly separable data representationinto another that is likely to avoid the problem.

In our approach, these new features are calculated by a base level classifier,which can deal directly with nominal data (i.e., a Decision Tree).

On the other hand, Cascading meta level input consists of the nominal tonumeric converted attributes (e.g., using NBF or VDM) plus the base levelclassifier output. Thereby, VDM input dimensionality grows as the number ofnominal attributes rises, whereas the Cascading base level output only adds cnew dimensions (where c is the number of classes). However, meta level inputdimensionality is also high, because nominal to binary conversion is still required.

Using a Decision Tree at base level for constructing continuous features foranother method at meta level can be achieved using other similar approaches.

Stacked Generalization, also known as Stacking [13] also uses two or morelevels. Levels are numbered in a different way in [13], but for the sake of simplicitywe still use the base/meta notation for levels. Stacking uses to work with morethan one base classifier, and those base classifiers use to be different.

Let b the number of base classifiers we wish to get, and c the number of classes.n disjoint partitions (or folds) from training data are used for training n×b initialbase classifiers (one group of n initial classifiers for each base classifier we wantto obtain). All these initial classifiers from each group are of the same type thanone of the base final classifiers we want to get. Each initial base classifier istrained with data from n − 1 data partitions, and tested with data from theremaining partition (cross validation). So the testing partition is different foreach initial classifier of the group. Thus each original training sample is used fortesting exactly b times (once for each of the b final base classifiers). Then, themeta classifier is trained with a new dataset of c×b dimensionality plus the classattribute. For each original instance such c × b features are obtained using thec probability estimations from the b classifiers that tested the instance. Finally,the n×b classifiers are discarded, and a new set of b base classifiers is calculated,this time using the whole training set. Overfitting is therefore avoided, becausebase and meta classifiers are different, and because due to cross-validation, themeta classifier training set differs from the base classifiers training set.

Grading [10] is another two-level method that uses n folds like Stacking.In [10] levels are also numbered in a different way than in Cascading, but againwe will still keep our base/meta notation. Just as in Stacking, the set of n×b baseclassifiers are calculated beforehand, and tested with cross validation. The outputof each base learner is a binary attribute that signifies whether the predictionis correct or not (graded prediction). Subsequently, b meta classifiers (all of thesame type) are calculated, each one being trained with the original features plusthe graded predictions obtained from one group of n base classifiers. Gradedpredictions are used as the meta classifier class. Finally, the n× b base classifiersare discarded, and b new base classifiers are calculated using the whole trainingset. The final prediction is made by meta classifiers using a voting rule.

Stacking and Grading can be used for nominal data in the same way we haveused Cascading (i.e., a single Decision Tree at base level). Note that Stacking

31

and Grading require an internal cross-validation, so they are computationallymore expensive than Cascading even when only one base classifier is required.

On the other hand, Stacking meta classifiers do not use nominal to binaryconverted data, so the meta input dimension is not increased and is alwaysequal to the number of classes when only one base classifier is used. However,the Cascading and Grading meta input dimension increases because their metaclassifier take both the base prediction and the nominal to binary convertedattributes as their input values.

Usually, Stacking and Grading are used with several base classifiers, but wehave tested them with one base classifier in order to compare Cascading withother combinations of “one” Linear Classifier (i.e., SVM) with “one” DecisionTree.

In this paper we will use the notation Two-Level-Ensemble[M=C1( );B=C2()](x), which represents a two-level ensemble, where C1( ) classifier is used as metaclassifier and C2( ) classifier is used as base. The x represents an input sample.So Cascading[M=SVM( );B=DecisionTree( )](x) is a Cascading ensemble withan SVM at meta level and a Decision Tree at base level. Whenever meta orbase classifier can not deal directly with nominal data, NBF transformation isassumed.

For example, SVM in Cascading[M=SVM( );B=Decision Tree( )](x) needsNBF, but SVM in Stackig[M=SVM( );B=Decision Tree( )](x) does not needNBF.

3 Binary Decision Trees vs. VDM on Decision Trees

Let (x, y) a sample from a learning set, where x is a multidimensional inputvector, and y the output variable. In this paper VDM(x) will represent anothervector, where every component xi from x is mapped into a group of componentsVDM(xi). If xi is nominal VDM(xi) is obtained applying VDM to xi. If xi isquantitative, VDM(xi) is equal to xi. So VDM(x) dimension is greater or equalthan x dimension.

Therefore, DecisionTree(VDM(x)) is a decision tree that takes as input VDMtransformed features. In [6] DecisionTree(VDM(x)) is tested experimentally re-sulting an interesting method by its accuracy and low computational cost.

Decision Trees applied to nominal data usually split their nodes into morethan two branches. Each branch corresponds to a symbolic value of an attribute.However, VDM features transformation into numerical makes the Decision Treessplit their nodes in exactly two branches, one representing instances with valuesbigger than a threshold, and other branch for the rest.

Working with non binary splits rise the number of branches in each split,and makes branches to classify fewer samples, so if the dataset cardinalityis small, a non binary split can lead to a too greedy partition of the inputspace. Thus it seems preferable using binary trees than non-binary trees (e.g.,DecisionTree(VDM(x)) better than DecisionTree(x)).

32

However, Decision Trees can be forced to make binary splits making someslightly changes in the algorithm, even when working with nominal data andwithout VDM. We represent Binary Decision Trees as DecisionTreeBin(x). Bi-nary splits in DecisionTreeBin(x) test if some value of a nominal attribute isequal or distinct to a symbolic value, whereas binary splits in DecisionTree(VDM(x))are based in testing if some value of an attribute is bigger or not than a thresh-old. So DecisionTreeBin(x) will lead to slightly different results than Decision-Tree(VDM(x)) (see Appendix).

Note that when VDM is applied both Decision Trees and Binary DecisionTrees, have only “bigger than” splitting nodes. That is:

DecisionTree(VDM(x )) ≡ DecisionTreeBin(VDM(x )). (1)

This equality will reduce considerably the number of possible combinationscontaining Decision Trees in two-level ensembles.

4 Equivalent Ensembles

VDM can be implemented as a filter that keeps numeric attributes and trans-forms nominal attributes into a set of probabilities. We will use C(VDM( )) tonote a classifier C that applies VDM filter to their inputs.

Thereby, C(VDM( )) can be used to note that VDM is applied to one level of atwo-level ensemble. For example: Cascading( M=SVM(VDM( )); B=DecisionTree())(x) applies VDM only to SVM inputs.

So taking into account (1), we can derive many equivalent two-level ensem-bles. First and second rows in Table 1 show some examples that can be derivedfrom this equivalence. Third row shows that VDM calculated beforehand forCascading is the same than applying VDM filter to each level. This is only truefor Cascading. In Stacking and Grading VDM estimations are not the samefor the whole data set than for each fold. So VDM calculated beforehand forStacking and Grading is not the same than applying VDM filter to each level.

Because meta level in Stacking only uses continuous probabilities as inputssome other equivalences can be derived:

1. It is not useful applying VDM to meta level for Stacking. See an example atfourth row in Table 1.

2. There is no difference in using Binary Decision Trees instead Non-binaryDecision Trees in Stacking meta level. See an example at fifth row in Table 1.

These two rules can be mixed resulting in transitive equivalences. For exam-ple, sixth row in Table 1.

Thereby all these simplification rules will be taken into account to reduce thenumber of methods to test in experimental validation design.

33

Table 1. Examples of Equivalent Two-Level Ensembles with VDM. Methods in secondcolumn are equivalent to methods in third column.

1. Stacking[M=SVM( ); Stacking[M=SVM( );B=DecisionTree(VDM( ))](x) B=DecisionTreeBin(VDM( ))](x)

2. Cascading[M=DecisionTree( ); Cascading[M=DecisionTreeBin( );B=DecisionTreeBin( )](VDM(x)) B=DecisionTree( )](VDM(x))

3. Cascading[M=SVM(VDM( )); Cascading[M=SVM( );B=DecisionTree(VDM( ))](x) B=DecisionTree( )](VDM(x))

4. Stacking[M=SVM( ); Stacking[M=SVM(VDM(( ));B=DecisionTree( )](x) B=DecisionTree( )](x)

5. Stacking[M=DecisionTree( ); Stacking[M=DecisionTreeBin( );B=SVM( )](x) B=SVM( )](x)

6. Stacking[M=DecisionTree( ); Stacking[M=DecisionTreeBin(VDM( ));B=SVM( )](x) B=SVM( )](x)

5 Experimental Validation

We have implemented Generalized Cascading and VDM algorithms in Javawithin WEKA [12]. We have tested the following methods, using 27 datasets:

1. A Decision Tree (binary or not). WEKA J.48 is an implementation of theC4.5 Release 8 Quinlan Decision Tree [9]. We note the binary split versionas J.48bin. Both implementations (J.48 and J.48bin) were also applied totwo-level ensembles that use a Decision Tree as their base or meta method.

2. SMO [8], an implementation of SVM provided by WEKA. Linear kernelwas used. This implementation was also used in all two-level ensembles thatrequired an SVM as their base or meta method.

3. J.48 (binary or not) with VDM and SMO with VDM. In both cases, nominalfeatures were replaced by VDM output.

4. Cascading with a Decision Tree in base level and with an SVM in meta level,and vice versa. VDM was also applied to both levels, none of them, or onlyone of them.

5. WEKA Stacking implementation, with ten folds using J.48 (binary and notbinary) as base and SMO as meta and vice versa. VDM was also applied toboth levels, none of them, or only one of them.

6. WEKA Grading implementation, with ten folds using J.48 (binary and notbinary) as base and SMO as meta and vice versa. VDM was also applied toboth levels, none of them, or only one of them.

Equivalences derived in Section 3 has been applied resulting into 57 methods.A 10×10 fold cross validation was applied. The corrected resampled t-test statistic

from [7] was used (significance level 5%) for comparing the methods. The 100(10×10) results for each method and dataset are considered as the input for thetest.

Table 2 shows the used datasets. Most of them are from the UCI reposi-tory [1], and the rest are from Statlib. All datasets selected have no numerical

34

or ordinal attributes.1 The only modifications made on datasets were: (i) IDattributes have been removed (i.e., in Molecular biology promoters and Splice

datasets). (ii) In Monks-1 and Monks-2, only training set is considered, becausetesting set is a subset of the training set. (iii) In Monks-3 and Spect, a union ismade of both training and testing sets.

Table 2. Datasets used in Experimental Validation. #E: number of instances, #A:number of attributes included class, #C: number of classes. (U) from UCI, (S) fromStatlib.

Dataset #E #A #C

Audiology (U) 226 70 24Boxing1 (S) 120 4 2Boxing2 (S) 132 4 2Breast cancer (U) 286 10 2Car (U) 1728 7 4Dmft (S) 797 5 6Fraud (S) 42 12 2Kr-vs-kp (U) 3196 37 2Mol Biol Prmtrs (U) 106 58 2Monks-1 (U) 432 7 2Monks-2 (U) 432 7 2Monks-3 (U) 438 7 2Mushroom (U) 8124 23 2Nursery (U) 12960 9 5

Dataset #E #A #C

Postop. patient (U) 90 9 3Primary tumor (U) 339 18 21Solar flare 1 C (U) 323 11 3Solar flare 1 M (U) 323 11 4Solar flare 1 X (U) 323 11 2Solar flare 2 C (U) 1066 11 8Solar flare 2 M (U) 1066 11 6Solar flare 2 X (U) 1066 11 3Soybean (U) 683 36 19Spect (U) 267 23 2Splice (U) 3190 61 3Tic-tac-toe (U) 958 10 2Vote (U) 435 17 2

The first and third column in Table 7 (see Appendix) rank the methodsaccording to the used statistical test. For each considered data set and pair ofmethods, a test is done. The test can conclude that there is not a differencebetween the methods, or that one method is better than the other. Hence, eachmethod will have a number of wins and losses. The difference between this twovalues is used for ranking the methods.

The second and fourth column in Table 7 show the methods ranked by theiraverage ranking [2]. A ranking is done for each dataset. A number is assignedto each method and dataset according to its ranking position in such dataset. Ifthere were ties, average ranks are assigned. Then for each method the averageranking is calculated over all datasets as in fourth column of Table 7. Methodsare then ordered increasingly using these values.

In both rankings equivalent methods were omitted. From both rankings wecan see the following remarks:

1. J.48bin(x) seems to perform better than J.48(VDM(x)), and both of themperform better than J.48(x).

1 Some attributes from these datasets are in fact ordinal, but for convenience we havetreated them as nominal, just as they appear on the Weka web site. Retrieved 3March 2007 from http://www.cs.waikato.ac.nz/ml/weka/

35

2. SMO(x) is not improved by VDM (i.e., SMO(VDM(x)).

3. Best ranked methods are Cascading configurations using an SMO at meta,and a Decision Tree at base. Two best methods in both rankings are Cas-cading[M=SMO(VDM( )); B=J.48bin( )](x) and Cascading[M=SMO( ); B=J.48bin( )](x). So using J.48bin at base seems to be a better enhancementthan using VDM in meta.

4. The two most computationally expensive options (i.e., Stacking and Grad-ing) use to perform worse than Cascading, and sometimes even worse thanJ.48 and SMO on their own.

5. Cascading with SMO as base method is not a bad combiner, but used toperform worse than using a Decision Tree as base method.

The latter fact was already discussed in [6]. In [4] some ideas are presented forchoosing suitable classifiers for each Cascading level. Our experimental resultssuggest to put the more unstable classifier at base level, whereas [4] suggeststo put it at meta level, because by selecting unstable learners, “we are able

to fit more complex decision surfaces, taking into account the ‘stable’ surfaces

drawn by the low level learners”. An experimental validation is provided in [4]over 26 UCI datasets to show it, but in this experiment nominal and continuousattributes are mixed in the datasets.

However, those “stable surfaces” can be drawn in an inappropriate way whendata is nominal, especially if the method at the low level can not deal directlywith nominal data and needs some kind of conversion, as for example SVM. Sowe think there is no contradiction between [4] and [6], because the experimentswere focused on different kind of data.

In the rankings presented in the Appendix there are a lot of weak methods,so results could be distorted. Therefore, we have focused only in the followingmethods:

1. The seven methods corresponding to the intersection of top ten methods inboth rankings. This intersection also contains the top three methods of bothrankings.

2. The base methods on their own (i.e., SMO(x), J.48(x), J.48bin(x)).

3. VDM applied on the latter methods (i.e., SMO(VDM(x)), J.48(VDM(x))).Note that J.48(VDM(x)) is the same than J.48bin(VDM(x)).

The 10×10 fold cross validation and the corrected resampled t-test statisticwere considered again for these 12 methods. Accuracy results are shown in Ta-bles 3 and 4. Bold font has been used to mark the best method of both tables foreach dataset. Significant wins and losses against the best ranked configuration(Cascading using an SVM filtered with VDM as meta classifier, and a BinaryDecision Tree as base classifier), are also marked. The symbol “◦” indicates asignificant win, and the symbol “•” indicates a significant lost. Last row of bothtables reports significative wins, ties and losses against such method.

According to this last row, Cascading[M=SMO(VDM( )); B=J.48bin( )](x) isthe best method, but differences between this method and the rest of cascading

36

Table 3. Accuracy of considered Cascading Ensembles.

Dataset Casc

[M=

SM

O(V

DM

())

;

B=

J.4

8bin

()](x

)

Casc

[M=

SM

O(

);

B=

J.4

8bin

()](x

)

Casc

[M=

SM

O(

);

B=

J.4

8(

)](x

)

Casc

[M=

SM

O(V

DM

())

;

B=

J.4

8(

)](x

)

Casc

[M=

SM

O(

);

B=

J.4

8(

)](V

DM

(x))

Casc

[M=

J.4

8(V

DM

();

B=

J.4

8bin

()](x

)

Casc

[M=

J.4

8(

);

B=

J.4

8bin

()](V

DM

(x))

Audiology 84.21 80.91 • 82.65 84.74 84.34 76.65 • 76.86 •

Boxing1 85.33 84.00 83.67 85.42 81.17 82.83 81.25Boxing2 79.77 80.98 80.44 78.76 79.91 79.41 79.75Breast cancer 70.50 70.40 74.14 74.28 70.92 70.43 70.96Car 96.72 96.76 95.14 • 95.03 • 97.32 97.21 ◦ 97.54

Dmft 20.10 20.22 19.81 19.89 20.18 19.72 19.72Fraud 70.75 68.85 73.60 75.65 85.50 ◦ 72.55 86.40 ◦

Kr-vs-kp 99.44 99.44 99.44 99.44 99.35 99.44 99.37M.Biol.Prm 88.98 90.05 91.42 91.43 86.55 78.90 • 76.82 •

Monks-1 98.17 98.19 96.60 96.60 76.21 • 99.51 76.12 •

Monks-2 94.79 94.89 67.14 • 67.14 • 89.70 • 96.37 91.14Monks-3 98.63 98.63 98.63 98.63 98.63 98.63 98.63

Mushroom 100.0 100.0 100.0 100.0 100.0 99.99 100.0

Nursery 99.36 99.36 98.29 • 98.25 • 99.42 99.59 ◦ 99.59 ◦

Post.patient 69.11 69.11 67.22 67.11 68.78 69.22 69.33Prim. tumor 43.13 43.25 44.31 43.93 43.55 40.61 40.56Solar f.1 C 89.70 89.73 88.95 88.92 88.18 • 89.61 88.15 •

Solar f.1 M 89.24 89.55 89.67 89.33 88.99 89.64 89.58Solar f.1 X 97.84 97.84 97.84 97.84 97.84 97.84 97.84

Solar f.2 C 82.59 82.67 82.91 82.77 82.76 82.70 82.90Solar f.2 M 96.62 96.62 96.62 96.62 96.62 96.62 96.62

Solar f.2 X 99.53 99.53 99.53 99.53 99.53 99.53 99.53

Soybean 93.91 93.78 94.13 93.95 93.83 92.43 92.75Spect 81.96 81.62 81.62 81.96 82.14 81.84 82.29Splice 94.93 93.96 • 93.55 • 95.33 94.80 94.32 • 94.24Tic-tac-toe 93.81 94.16 97.35 ◦ 85.53 • 94.28 94.06 94.49Vote 96.75 96.69 96.69 96.75 96.75 97.19 97.19

wins/ties/losses 0/25/2 1/22/4 0/23/4 1/23/3 2/22/3 2/21/4

37

Table 4. Accuracy of base methods on their own, or with VDM.

SMO( J.48(Dataset SMO(x) VDM(x)) J.48(x) VDM(x)) J.48bin(x)

Audiology 80.77 • 84.16 77.26 • 76.73 • 76.92 •

Boxing1 81.58 83.67 87.00 81.08 85.33Boxing2 82.34 79.15 80.44 79.91 79.62Breast cancer 69.52 68.97 74.28 70.88 70.50Car 93.62 • 93.20 • 92.22 • 97.30 96.63Dmft 21.14 20.73 19.60 20.06 19.82Fraud 76.10 73.10 63.05 86.40 ◦ 66.10Kr-vs-kp 95.79 • 96.79 • 99.44 99.36 99.44

Mol.Biol.Prm 91.01 91.65 79.04 • 76.22 • 79.09 •

Monks-1 74.86 • 75.00 • 96.60 76.28 • 98.33Monks-2 67.14 • 67.14 • 67.14 • 90.07 • 94.31Monks-3 96.12 • 95.89 • 98.63 98.63 98.63

Mushroom 100.00 100.00 100.00 100.00 99.99Nursery 93.08 • 93.08 • 97.18 • 99.42 99.36Post. patient 67.33 67.11 69.78 69.33 70.11

Primary tumor 47.09 42.69 41.39 41.22 41.19Solar flare1 C 88.49 • 88.19 • 88.95 88.15 • 89.73

Solar flare1 M 89.70 89.33 89.98 89.61 89.76Solar flare1 X 97.84 97.84 97.84 97.84 97.84

Solar flare2 C 82.91 82.77 82.93 82.89 82.72Solar flare2 M 96.62 96.62 96.62 96.62 96.62

Solar flare2 X 99.53 99.53 99.53 99.53 99.53

Soybean 93.10 93.38 91.78 • 92.77 92.30Spect 83.61 82.79 81.35 81.69 81.35Splice 92.88 • 95.47 94.17 94.28 94.36 •

Tic-tac-toe 98.33 ◦ 73.90 • 85.28 • 94.28 93.79Vote 95.77 96.04 96.57 96.57 96.57

wins/ties/losses 1/17/9 0/19/8 0/20/7 1/21/5 0/24/3

38

methods are not very important. On the other hand, only binary trees (i.e.,J.48bin(x) and J.48(VDM(x))) have comparable results.

Again, we rank the methods using the difference between significative winsand losses (see Table 5) and using the average rank over the datasets (see Ta-ble 6).

Table 5. Difference between significative wins and losses ranking.

Wins-Losses Wins Losses Methods

44 52 8 Casc[M=SMO(VDM( ));B=J.48bin( )](x)32 46 14 Casc[M=SMO( );B=J.48bin( )](x)25 44 19 Casc[M=SMO( );B=J.48( )](VDM(x))19 42 23 Casc[M=J.48(VDM( ));B=J.48bin( )](x)14 42 28 Casc[M=SMO( );B=J.48( )](x)7 34 27 J.48bin(x)6 38 32 Casc[M=J.48( );B=J.48bin( )](VDM(x))6 39 33 Casc[M=SMO(VDM( ));B=J.48( )](x)-3 32 35 J.48(VDM(x))-40 15 55 J.48(x)-55 20 75 SMO(VDM(x))-55 21 76 SMO(x)

Table 6. Average ranking of considered methods.

Avg. Rank Methods

5.80 Casc[M=SMO(VDM( ));B=J.48bin( )](x)5.80 Casc[M=SMO( );B=J.48( )](x)5.93 Casc[M=SMO(VDM( ));B=J.48( )](x)5.94 Casc[M=SMO( );B=J.48bin( )](x)6.00 Casc[M=SMO( );B=J.48( )](VDM(x))6.43 Casc[M=J.48( );B=J.48bin( )](VDM(x))6.74 Casc[M=J.48(VDM( ));B=J.48bin( )](x)6.83 J.48bin(x)6.85 SMO(x)6.87 J.48(VDM(x))7.02 J.48(x)7.80 SMO(VDM(x))

Cascading[M=SMO(VDM( )); B=J.48bin( )](x) is again the winner in bothrankings. Once more time Cascading ensembles use to rank better in all rank-ings. Binary trees seem the best singleton option, and the use of binary treesas base classifiers in cascading ensembles seems to be the reason behind per-formance improvement in Table 5 ranking. Surprisingly, SMO(VDM(x)) is theworse option, in both rankings.

39

6 Conclusions

Some learning algorithms need numbers as inputs. Nominal data translationinto numerical allows to apply these methods to symbolic data. NBF and VDMare some of these translation techniques. Transformation into numerical can beimproved if new input dimensions are provided by another classifier. These newdimensions can result in more linearly-separable representations of data, whichis an important issue for many numerical learning algorithms. We have proposedand tested Cascading ensembles to provide this extra dimensionality in order toimprove SVM performance with nominal data.

Experimental validation shows that Decision Trees as Cascading base classi-fiers are very suitable to build these extra features. Moreover, according to theranking given by differences between significative wins and losses, Binary Deci-sion Trees seem to perform better as base classifiers than non-binary DecisionTrees. We suppose that the reason behind is that trees with non-binary splitscan lead sometimes to a too much greedy data analysis.

We also have tested other two-level classifiers (i.e., Stacking and Grading)getting worse results and needing large computation time. Finally, we try usingVDM nominal transformation into numeric at one or both levels of Cascading,Stacking and Grading ensembles. Although VDM appears in the best rankedmethod (Cascading using an SVM filtered with VDM as meta classifier, and aBinary Decision Tree as base classifier), a second deep experimental validationreveals that differences with other Cascading configurations having a DecisionTree as base classifier are not very significative. Moreover, many results obtainedshow that SVM gets equal or worse accuracy when VDM is applied instead ofNBF.

Our next work is to study these Cascading configurations with numericalmethods that do not require linear separability (e.g., SVM with non-linear ker-nels).

References

1. C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.http://www.ics.uci.edu/ mlearn/MLRepository.html.

2. J. Demsar. Statistical comparisons of classifiers over multiple data sets. Journal

of Machine Learning Research, 7:1–30, 2006.3. W. Duch, K. Grudzinski, and G. Stawski. Symbolic features in neural networks.

In Proc. 5th Conference on Neural Networks and Soft Computing, pages 180–185,Zakopane, 2000.

4. J. Gama and P. Brazdil. Cascade generalization. Machine Learning, 41(3):315–343,2000.

5. K. Grabczewski and N. Jankowski. Transformations of symbolic data for contin-uous data oriented models. In Artificial Neural Networks and Neural Information

Processing, ICANN/ICONIP 2003, volume 2714, pages 359–366. Springer, 2003.6. Jesus Maudes, Juan J. Rodrıguez, and Cesar Garcıa-Osorio. Cascading for nominal

data. In 7th International Workshop on Multiple Classifier Systems, MCS 2007,volume 4472 of LNCS, pages 231–240. Springer, 2007.

40

7. C. Nadeau and Y. Bengio. Inference for the generalization error. Machine Learning,52(239–281), 2003.

8. J. Platt. Fast training of support vector machines using sequential minimal opti-mization. In B. Schoelkopf, C. Burges, and A. Smola, editors, Advances in Kernel

Methods. MIT Press, 1998.9. J. R. Quinlan. C4.5: programs for machine learning. Morgan Kaufmann, 1993.

10. A.K. Seewald and J. Furnkranz. An evaluation of grading classifiers. In Advances

in Intelligent Data Analysis, 4th International Conference, IDA 2001, pages 115–124. Springer, 2001.

11. C. Stanfill and D. Waltz. Toward memory-based reasoning. Communications of

the ACM, 29:1213–1229, 1986.12. I.H. Witten and E. Frank. Data Mining: Practical Machine Learn-

ing Tools and Techniques. Morgan Kaufmann, 2nd edition, 2005.http://www.cs.waikato.ac.nz/ml/weka.

13. D. Wolpert. Stacked generalization. Neural networks, 5:241–260, 1992.

A 57 Methods × 27 Datasets Rankings

Table 7. 57 methods ordered by Average Ranking. W-L Rank = SignificativeWins minus Significative Losses Rank Positions. W-L = Difference between Sig-nificative Wins and Losses. Avg Rank = Average Ranking positions over alldatasets. Avg = Average ranking values over all datasets. Equivalent methodsare omitted. Casc = Cascading, Stack = Stacking, Grad = Grading.

W-L AvgRank Rank W-L Avg Methods (equivalent methods are omitted)

1 1 294 20.85 Casc[Meta=SMO(VDM( ));Base=J.48bin( )](x)2 2 250 21.78 Casc[Meta=SMO( );Base=J.48bin( )](x)3 7 215 23.96 Casc[Meta=SMO( );Base=J.48( )](VDM(x))4 8 185 24.61 Casc[Meta=J.48(VDM( ));Base=J.48bin( )](x)5 3 172 22.15 Casc[Meta=SMO( );Base=J.48( )](x)6 52 146 34.17 Casc[Meta=SMO( );Base=J.48(VDM( ))](x)

7.5 53 142 34.37 Casc[Meta=J.48bin( );Base=J.48(VDM( ))](x)7.5 6 142 23.85 Casc[Meta=SMO(VDM( ));Base=J.48( )](x)9 9 131 25.06 Casc[Meta=J.48( );Base=J.48bin( )](VDM(x))10 13 128 25.59 J.48bin(x)11 15 127 26.02 Casc[Meta=J.48bin( );Base=SMO( )](x)12 4 126 23.44 Casc[Meta=J.48( );Base=J.48bin( )](x)13 25 118 28.26 Stack[Meta=J.48( );Base=J.48bin( )](x)14 14 111 25.85 Casc[Meta=J.48bin(VDM( ));Base=J.48( )](x)15 5 107 23.83 Casc[Meta=J.48bin( );Base=J.48( )](x)16 57 102 37.28 Casc[Meta=J.48( );Base=J.48bin(VDM( ))](x)17 20 99 27.22 Grad[Meta=J.48( );Base=J.48bin( )](x)18 16 94 26.39 J.48(VDM(x))19 23 90 28.04 Stack[Meta=J.48bin( );Base=J.48(VDM( ))](x)20 24 88 28.09 Stack[Meta=J.48bin( );Base=J.48( )](VDM(x))

41

Table 7. Continued from previous page.

W-L AvgRank Rank W-L Avg Methods (equivalent methods are omitted)21 11 83 25.28 Stack[Meta=SMO( );Base=J.48bin( )](x)

22.5 18 71 26.85 Grad[Meta=J.48( );Base=J.48bin(VDM( ))](x)22.5 33 71 29.57 Casc[Meta=J.48( );Base=SMO( )](VDM(x))24 12 69 25.56 Stack[Meta=SMO( );Base=J.48( )](VDM(x))25 26 68 28.52 Grad[Meta=J.48(VDM( ));Base=J.48bin( )](x)26 10 67 25.15 Stack[Meta=SMO( );Base=J.48(VDM( ))](x)27 31.5 62 29.54 Casc[Meta=J.48bin( );Base=SMO(VDM( ))](x)28 31.5 38 29.54 Casc[Meta=J.48(VDM( ));Base=SMO( )](x)29 21 13 27.39 Grad[Meta=J.48(VDM( ));Base=J.48bin(VDM( ))](x)30 39 12 31.24 Grad[Meta=SMO(VDM( ));Base=J.48bin( )](x)31 28 9 28.69 Grad[Meta=SMO( );Base=J.48bin( )](x)32 35 8 29.98 Grad[Meta=J.48bin( );Base=J.48(VDM( ))](x)33 17 3 26.59 Casc[Meta=J.48( );Base=SMO( )](x)34 30 0 29.22 Grad[Meta=J.48bin( );Base=J.48( )](VDM(x))35 51 -23 33.81 Grad[Meta=SMO( );Base=J.48(VDM( ))](x)36 45 -26 32.20 Grad[Meta=SMO(VDM( ));Base=J.48(VDM( ))](x)37 43 -53 32.07 Grad[Meta=SMO( );Base=J.48( )](VDM(x))38 19 -72 27.09 J.48(x)39 47 -75 32.69 Casc[Meta=J.48( );Base=SMO(VDM( ))](x)40 27 -95 28.59 Grad[Meta=J.48( );Base=SMO( )](x)41 29 -117 29.17 Grad[Meta=J.48bin( );Base=SMO( )](x)42 22 -126 27.87 Stack[Meta=SMO( );Base=J.48( )](x)

43.5 46 -134 32.31 Grad[Meta=J.48( );Base=SMO(VDM( ))](x)43.5 42 -134 32.02 Stack[Meta=J.48bin( );Base=J.48( )](x)44.5 34 -137 29.78 SMO(x)44.5 38 -137 30.96 Grad[Meta=J.48(VDM( ));Base=SMO( )](x)47 49 -173 32.89 SMO(VDM(x))

47.5 36 -174 30.09 Grad[Meta=J.48bin(VDM( ));Base=J.48( )](x)47.5 37 -174 30.19 Grad[Meta=J.48bin( );Base=J.48( )](x)50 41 -179 31.85 Grad[Meta=J.48( );Base=SMO( )](VDM(x))51 50 -184 32.93 Grad[Meta=J.48(VDM( ));Base=SMO(VDM( ))](x)52 44 -185 32.17 Grad[Meta=SMO(VDM( ));Base=J.48( )](x)53 48 -187 32.87 Grad[Meta=J.48bin( );Base=SMO(VDM( ))](x)54 40 -210 31.59 Grad[Meta=SMO( );Base=J.48( )](x)

55.5 56 -278 35.72 Stack[Meta=J.48( );Base=SMO( )](VDM(x))55.5 55 -278 35.15 Stack[Meta=J.48( );Base=SMO(VDM( ))](x)57 54 -290 35.06 Stack[Meta=J.48( );Base=SMO( )](x)

42

Ensemble Clustering with a Fuzzy Approach

Roberto Avogadri and Giorgio Valentini

DSI, Dipartimento di Scienze dell’ Informazione,Universita degli Studi di Milano,

Via Comelico 39, 20135 Milano, Italia.{avogadri,valentini}@dsi.unimi.it

Abstract. Ensemble clustering is a novel research field that extends tounsupervised learning the approach originally developed for classifica-tion and supervised learning problems. In particular ensemble clusteringmethods have been developed to improve the robustness and accuracyof clustering algorithms, as well as the ability to capture the structure ofcomplex data. In many clustering applications an example may belong tomultiple clusters, and the introduction of fuzzy set theory concepts canimprove the level of flexibility needed to model the uncertainty underly-ing real data in several application domains. In this paper, we propose anunsupervised fuzzy ensemble clustering approach that permit to disposeboth of the flexibility of the fuzzy sets and the robustness of the ensem-ble methods. Our algorithmic scheme can generate different ensembleclustering algorithms that allow to obtain the final consensus clusteringboth in crisp and fuzzy formats.

1 Introduction

The ensemble clustering approach has been appreciated for many qualities likescalability and parallelism, the ability to capture complex data structure andthe robustness regarding the noise [19]. Ensemble methods can combine bothdifferent data and different clusterings algorithms.For instance, ensemble algorithms have been used in data-mining to combineheterogeneous data or to combine data in a distributed environment [15].Other research lines proposed to combine etherogenous clustering algorithmsto generate an overall “consensus” ensemble clustering, in order to exploit thedifferent characteristics of clustering algorithms [17].By another general approach to ensemble clustering, multiple instances of thedata are obtained through “perturbations” of the original data: a clusteringalgorithm is applied to the multiple perturbed data and the results are combinedto achieve the overall ensemble clustering. In this contest several techniques havebeen proposed, such as noise injection, bagging, random projections [1, 4]. Thesemethods try to improve both the accuracy and the diversity of each component(base) clustering. In fact several works showed that the diversity among thesolutions of the components of the ensemble is one of the crucial factors todevelop robust and reliable ensemble clustering algorithms [2] [20]. In many

43

“real world” clustering applications it may occurs that an example may belongto more than one cluster, an in these cases traditional clustering algorithms arenot able to capture the real nature of the data. Consider, for instance, generalclustering problems in bioinformatics, such as the discovery of functional classesof genes: it is well-known that a single gene may participate to different biologicalprocesses, thus it may belong to multiple functional classes of genes. Sometimesit is enough to use hard clustering algorithms and to relax the condition that thefinal clustering has to be a partition, but more in general different techniquesbased on probabilistic approaches have been developed (i.e. [3]).

In this contribution we propose an ensemble clustering algorithmic schemeuseful to deal with problems where we can capture and manage the possibilityof an element to belong to more than one class with different degrees of mem-bership. To achieve this objective we use the fuzzy-set theory to express theuncertainty of the data ownership, and others fuzzy tools to transform the fuzzyclusterings into crisp clusterings. To perturb the data we apply random projec-tions with low distortion [4] , a method well-suited to manage high dimensionaldata (high number of attributes or “features”), reducing the computational timeand improving at the same time the diversity of the data.Combining ensemble clustering techniques and fuzzy set theory, on one hand wecan improve the accuracy and the robustness of the consensus ensemble cluster-ing, and on the other hand we can deal with the uncertainty and the fuzzinessunderlying real data.

In the following sections we introduce the random projections and the fuzzyoperators that characterize our proposed unsupervised ensemble methods. Thenwe describe the fuzzy ensemble clustering algorithmic scheme and the algorithmsthat can be obtained from it. In sect. 5 we present some results of their applica-tion to synthetic data sets. The discussion and the conclusions end the paper.

2 Random projections

Our proposed method applies random projections with low distortion to perturbthe data. The objective is to reduce the dimension (number of features) of thedata, in order to “preserve” the distance, and the structure, of the same data.Consider a couple of euclidean spaces, the initial of a high dimension d, andthe target of dimension d′, with d > d′; a random projection θ is a randomizedfunction θ : Rd −→ Rd′

such that ∀p, q ∈ Rd :, 0 < ǫ < 0.5, with high probabilitythe following disequalities hold:

1 − ǫ ≤‖θ(p) − θ(q)‖2

‖p − q‖2≤ 1 + ǫ

A type of random projection is the Plus-Minus-one (PMO) [13] θ(p) = R∗p,represented by matrices R, with elements Rij = 1/

√d′(Aij), where A is a d′ × d

matrix and Ai,j ∈ {−1, 1} such that Prob(Ai,j = 1) = Prob(Ai,j = −1) = 1/2

44

A key problem consists in finding a d′ such that, for every pair of datap, q ∈ R

d, the distances between the projections θ(p) and θ(q) are approximatelypreserved with high probability.

A natural measure of the approximation is the distortion distθ:

distθ(p, q) =||θ(p) − θ(q)||2

||p − q||2(1)

If distθ(p, q) = 1, the distances are preserved; if 1 − ǫ ≤ distθ(p, q) ≤ 1 + ǫ, wesay that an ǫ-distortion level is introduced.

The main result on random projections is due to the Johnson-Lindentrauss

(JL) lemma [7]:given N vectors {x1, ..., xN} ∈ Rd, if d′ ≥ c logN

ǫ2,, where c is a suitable constant,

then it exists a projection θ : Rd → Rd′

that, with high probability, “preservesthe distances”:

(1 − ǫ)d(xi, xj) ≤ d(θ(xi), θ(xj)) ≤ (1 + ǫ)d(xi, xj)

If we choose a value of d′ according to the JL lemma, we may perturb thedata introducing only bounded distortions, approximately preserving the metricstructure of the original data (see [8] for more details). Examples of randomprojections that obey the JL Lemma can be found in [8, 4].A key note is that the initial space d has to be not “too small”, otherwise, usingthe JL Lemma, the initial space and the reduced space, have to become of similardimensions, even if the final dimension d′ not depends by “d”, but only by thenumber of sample of the data set (n) and by the distortion chosen.

In fact our main target applications are characterized by high dimensionality,such as DNA microarray data [14] , where usually few elements (samples) ofhigh dimensionality (number of features/genes) are available. If d >> d′, we cansave considerable computational time, working with data set that approximatelypreserves the metric characteristics of the initial space. The perturbation of thedata is obtained randomly choosing for every base learner of the ensemble d′

projected features. However different perturbation methods can in principle beused.

3 Fuzzy-set & fuzzy-set methods

3.1 The membership functions

In the classic set theory (called also “crisp set theory”), created by Cantor, themembership values of an object to a set can be 0 (FALSE) or 1 (TRUE). Thecharacteristic function of a crisp set can be defined as follow:

Icrisp sets : {(elem, crisp set)} 7−→ {0, 1}.

The fuzzy set theory [11] is a generalization of the previous theory; in factan object (elem) can belong only partially to a set (fuzzy set). It is defined

45

a so called “membership function”: µfuzzy sets : (elem, fuzzy set) 7−→ [0, 1]In general, the domain of a membership function of a fuzzy set U can be ev-ery set, but usually it is a discrete set (U = {u1, u2, ...um}) or it is a sub-set of R (U=[lower value..higher value] or U=(lower value..higher value), where“lower value” and “higher value” can be every real number belong to [0,1], withlower value < higher value).If we consider a fuzzy set A, and a membership function defined on it, we canrewrite the membership function definition as follow:

µA : U 7−→ [0, 1]

so that the membership value µA(ui) describes the degree of ownership of theelement ui to the set A.

3.2 Fuzzy methods

In several data clustering applications, it is useful to have a method that cancapture with a certain approximation the “real” structure of data to obtain the“best” clustering. Through the fuzzy ensemble clustering algorithm we propose,it is possible to manage not only the possibility of overlapping among the clusters,but also the degree of membership of every example of the data set to thedifferent clusters. In some application, however the initial problem does notadmit a strictly fuzzy answer, but at the same time it is generally useful to havea valuation method that can use all the possible information available (like thedegrees of membership). We use two classical methods to “defuzzify” the results:

1. the “alpha-cut”;2. the “hard-clustering”.

The “alpha-cut” function can be defined as follow.∀α ∈ [0, 1] the α-cut [A]α, or simply Aα, of A is:

[A]α = {u ∈ U|µu ≥ α}.

The expression of a threshold value for the membership function allows to obtainfrom a fuzzy set a crisp set, called Aα, which contains every element of U whosemembership to A is larger than α.The “hard-clustering” is not properly a fuzzy function: it’s a method to obtaina crisp clustering from the original fuzzy clustering.The role of both the previous functions in the algorithm scheme will be describedwith more details in Section 4.

3.3 Triangular norms

To generalize the “classical” intersection operator in the fuzzy logic are oftenused the so called triangular norms (t-norms)(see [12]).A t-norm T is a function T : [0, 1]× [0, 1] → [0, 1] that satisfies:

46

– 1) the boundaries conditions: T (0, 0) = T (0, 1) = T (1, 0) = 0– 2) the identity condition: T (a, 1) = a ∀ a ∈ [0, 1]– 3) the commutative property: T (a, b) = T (b, a) ∀ (a,b) ∈ [0, 1]– 4) the monotonic property: T (a, b) ≤ T(c,d) if a ≤ c e b ≤ d ∀ (a,b,c,d) ∈

[0,1]– 5) the associative property: T (a, T (b, c)) = T (T (a, b), c) ∀ (a,b,c) ∈ [0, 1]

In the literature have been proposed four basic t-norms (we called themrespectively TM , TP , TLeTD):

TM (x, y) = min(x, y), (minimum) (2)

TP (x, y) = x ∗ y, (algebraic product) (3)

TL(x, y) = max(x + y − 1, 0), (Lukasewicz’s t-norm) (4)

TD(x, y) =

{

0 if (x, y) ∈ [0, 1)2,min(x, y) otherwise.

(drastic product) (5)

The following order relation exist among the previous t-norms:

TD < TL < TP < TM

In our algorithm scheme we chose the algebraic product as aggregation operator(see 4).

4 The algorithmic scheme

4.1 General structure

The general structure of the algorithm is similar to the one proposed in [4]:data are perturbed through random projections to lower dimensional subspacesand multiple clusterings are performed on the projected data; note that it islikely to obtain different clusterings, since the clustering algorithm is applied todifferent ”views” of the data. Then the clusterings are combined, and a consensus

ensemble clustering is computed.The main difference of our proposed method consists in using a fuzzy k-

means algorithm as base clustering and in applying a fuzzy approach to thecombination and the consensus steps of the ensemble algorithm. In particularwe can apply different crisp and fuzzy approaches both to the aggregation andconsensus steps, obtaining in this way the following algorithmic scheme:

47

Fuzzy ensemble clustering algorithmic scheme:

1. Random projections. Generation of multiple instances (views) of compresseddata through random projections (but different type of data perturbationmethods like resampling or noise-injection can be used).

2. Generation of multiple fuzzy clusterings. The fuzzy k-means algorithm isapplied to compressed data obtained from the previous step. The output ofthe algorithm is a membership matrix where each element represents themembership of an example to a particular cluster.

3. “Crispization” of the base clusterings. This step is executed if a ”crisp” ag-gregation is performed: the fuzzy clusterings obtained in the previous stepcan be “defuzzified” through one of the following techniques:a) hard-clustering;b) α-cut;

4. Aggregation. If a fuzzy aggregation is performed, the base clusterings arecombined, using a square similarity matrix [1] MC whose elements are gen-erated through fuzzy t-norms applied to the membership functions of eachpair of examples. If a crisp aggregation is performed, the similarity matrixis built using the product of the characteristic function between each pair ofexamples.

5. Clustering in the ”embedded” similarity space. The similarity matrix inducesa new representation of the data based on the pairwise similarity betweenpairs of examples: the fuzzy k-means clustering algorithm is applied to therows (or equivalently to the columns) of the similarity matrix.

6. Consensus clustering. The consensus clustering could be represented by theoverall consensus membership matrix, resulting in a fuzzy representation ofthe consensus clustering. Alternatively, we may apply the same crispizationtechniques used at step 3 to transform the fuzzy consensus clustering to acrisp one.

The two classical “crispization” techniques we used in steps 3 and 6, can bedescribed as follows:

Hard-clustering:

χHri =

{

1 ⇔ arg maxs Usi = r0 otherwise.

(6)

α-cut:

χαri =

{

1 ⇔ Uri ≥ α0 otherwise.

(7)

where χri is the characteristic function for the cluster r: that is χri = 1 if theith example belongs to the rth cluster, χri = 0 otherwise; 1 ≤ s ≤ k; 1 ≤ i ≤ n,0 ≤ α ≤ 1, and U is the fuzzy membership matrix obtained by applying the fuzzyk-means algorithm. Note that two different types of membership matrices areconsidered: at step 3 multiple membership matrices U are obtained through theapplication of the fuzzy k-means algorithm to multiple instances of the projecteddata; at step 6 another membership matrix UC (where the superscript C stands

48

for ”consensus”) is obtained by applying the fuzzy k-means algorithm to therows of the similarity matrix.

We may observe that considering the possibility of applying crisp or fuzzymethods at steps 3 and 6, we can obtain 9 different algorithms, exploiting dif-ferent combinations of aggregation and consensus clustering techniques. Forinstance, combining a fuzzy aggregation with a consensus clustering obtainedtrough α-cut we obtain from the algorithmic scheme a fuzzy-alpha ensembleclustering algorithm, while using a hard-clustering crispization technique for ag-gregation and a fuzzy consensus we obtain a max-fuzzy ensemble clustering.

In the next two sections we discuss the algorithms based on the fuzzy aggre-gation step (fuzzy-* clustering ensemble algorithms) and the ones based on thecrisp aggregation on the base clusterings (crisp-* clustering ensembles).

4.2 Fuzzy ensemble clustering with fuzzy aggregation of the baseclusterings

The pseudo-code of the algorithm is reported below:Fuzzy-* ensemble clustering:

Input:- a data set X = {x1, x2, . . . , xn}, stored in a d × n D matrix.- an integer k (number of clusters)- an integer c (number of clusterings)- an integer v (integer used for the normalization of the final clustering)- the fuzzy k-means clustering algorithm Cf

- a procedure the realizes the randomized map µ- an integer d′ (dimension of the projected subspace)- a function τ that defines the t-normbegin algorithm

(1) For each i, j ∈ {1, . . . , n} do Mij = 0(2) Repeat for t = 1 to c

(3) Rt = Generate projection matrix (d′, µ)(4) Dt = Rt · D(5) U (t) = Cf (Dt, k, m)(6) For each i, j ∈ {1, . . . , n}

M(t)ij =

∑k

s=1 τ(U(t)si ,U

(t)sj )

end repeat

(7)MC =∑

ct=1

M(t)

v

(8) < A1, A2, . . . , Ak >= Cf(MC , k, m)end algorithm.Output:- the final clustering C =< A1, A2, . . . , Ak >- the cumulative similarity matrix MC .

Note that the dimension d′ of the projected subspace is an input parameterof the algorithm, but it may be computed according to the JL lemma (Sect. 2),

49

to approximately preserve the distances between the examples. Inside the meanloop (steps 2-6) the procedure Generate projection matrix produces a d′ × dRt matrix according to a given random map µ [4], that it is used to randomlyproject the original data matrix D into a d′ × n Dt projected data matrix (step4). In step (5) the fuzzy k-means algorithm Cf with a given fuzziness m is appliedto Dt and a k-clustering represented by its U (t) membership matrix is achieved.Hence the corresponding similarity matrix M (t) is computed, using a given t-

norm (step 6). Note that U is a fuzzy membership matrix (where the rows areclusters and the columns examples). A similar approach can be seen in [18].

In (7) the ”cumulative” similarity matrix MC is obtained by averaging acrossthe similarity matrices computed in the main loop. Note the normalization fac-tor 1

v: it’s easy to demonstrate that, for the choice of the algebraic product as

t-norm, a suitable choice of v can be the number of clusterings c.Finally, the consensus clustering is obtained by applying the fuzzy k-means al-gorithm to the rows of the similarity matrix MC (step 8).

4.3 Fuzzy ensemble clustering with crisp aggregation of the baseclusterings

In agreement with the algorithm scheme, there are two different methods to“defuzzify” the clusterings:

– hard-clustering;– α-cut.

Below we provide the pseudo-code for the fuzzy ensemble clustering with “crispiza-tion” through hard-clustering; the algorithm with crispization through α-cut isdescribed later by difference with this algorithm.Max-* ensemble clustering:Input:- a data set X = {x1, x2, . . . , xn}, stored in a d × n D matrix.- an integer k (number of clusters)- an integer c (number of clusterings)- an integer v (integer used for the normalization of the final clustering)- the fuzzy k-means clustering algorithm Cf

- a procedure the realizes the randomized map µ- an integer d′ (dimension of the projected subspace)- a “crispization” algorithm “Crisp”

begin algorithm

(1) For each i, j ∈ {1, . . . , n} do Mij = 0(2) Repeat for t = 1 to c

(3) Rt = Generate projection matrix (d′, µ)(4) Dt = Rt · D(5) U (t) = Cf (Dt, k, m)(5bis) χ(t) = Crisp(U (t))

50

(6) For each i, j ∈ {1, . . . , n}

M(t)ij =

∑k

s=1 χ(t)si ∗ χ

(t)sj

end repeat

(7)M c =∑

ct=1

M(t)

c

(8) < A1, A2, . . . , Ak >= Cf(M c, k, m)end algorithm.Output:- the final clustering C =< A1, A2, . . . , Ak >- the cumulative similarity matrix MC .

Different observations can be made about the proposed algorithm:

1. With respect to the previously proposed fuzzy-* ensemble clustering, it hasbeen introduced a new step (step 5bis) after the creation of the membershipmatrix of the single clusterings, to obtain the transformation of fuzzy datain crisp ones. After this new step a characteristic matrix χ(t) is created forevery clustering.

2. After this step, the data can be managed like “natural” crisp data. In fact,in the (6) step, the final similarity matrix is obtained through the methodsshowed in [1, 4].

3. As a consequence of the “hard-clustering crispization” (step 5bis) the con-sensus clustering is a partition, that is each example may belong to one andonly one cluster:

∀i, j χtij =

{

1 ⇐⇒ argmaxs Usj = i,0 otherwise.

Hence, the normalization of the values of the similarity matrix can be per-formed using the factor 1

c(step (7)).

Choosing the α-cut method instead of the hard clustering, only the steps(5bis) and (7) have to be changed:

– (5bis alpha) χ(t) = Crispα(U (t)) ;

– (7 alpha) M c =∑

ct=1

M(t)

k∗c.

In the step (5bis alpha) the Crisp algorithm has a new parameter: α, thealpha-cut threshold value. In particular in the χ(t) = Crispα(U (t)) operation,

χtij =

{

1 ⇐⇒ Uij ≥ α,0 otherwise.

The normalization method in the step (7 alpha) comes from the followingobservations:

1. In the algorithm proposed it is used only one clustering function that workswith a fixed number (k) of clusters for each clustering;

51

2. For a fixed α:χsi = 1 ⇐⇒ µsi > α

χsi = 0 ⇐⇒ µsi ≤ α

with 1 ≤ s ≤ k;hence

0 ≤

k∑

s=1

χsi ≤ k

Considering that each k-clustering is repeated c times, we can observe thatk ∗ c is the total number of clusters across the multiple clusterings. We may usebase clusterings with different number of clusters for each execution; in this casethe normalization factor v becomes:

v =

c∑

t=1

kt (8)

where kt is the number of clusters of the tth clustering.

5 Experimental results

5.1 Experimental environment

To test the performance of the proposed algorithms, we used a synthetic datagenerator [4]. Every synthetic data set is composed by 3 clusters with 20 sampleseach. Every example is 5000-dimensional. Each cluster is distributed accordingto a spherical gaussian probability distribution with a standard deviation of 3.The first cluster in centered in the null vector 0. The other two clusters arecentered in 0.5e and −0.5e, where e is a vector with all the components equalto 1. We tested 4 of the 9 algorithms developed, two with the hard-clusteringmethod applied to the consensus clustering and two using the α-cut approach:

– max-max: hard-clustering applied to both the base clustering and consensusclustering;

– fuzzy-max: the aggregation step is fuzzy, the consensus step crisp throughhard-clustering;

– max-alpha: crisp aggregation by hard-clustering, and crisp consensus throughα-cut;

– fuzzy-alpha: fuzzy aggregation and crisp consensus through α-cut.

We repeated 20 times the previous four clustering ensemble algorithms usingdata sets randomly projected to a 410-dimensional feature space (correspondingto a distortion ǫ = 0.2, see Sect. 2). As a baseline clustering algorithm weused the classical fuzzy-k-means, executed on the original data set (withoutdata compression). Since clustering does not univocally associate a label to theexamples, but only provides a set of clusters, we evaluated the error by choosing

52

for each clustering the permutation of the classes that best matches the ”apriori” known ”true” classes. More precisely, considering the following clusteringfunction:

f(x) : Rd → P(Y), with Y = {1, . . . , k} (9)

where x is the sample to classify, d its dimension, k the number of the classes,and P(Y) is the powerset of Y, the error function we applied is the following:

L0/1(Y, t) =

{

0 if (|Y | = 1 ∧ t ∈ Y ) ∨ Y = {λ}1 otherwise.

(10)

with t the “real” label of the sample x, Y ∈ P(Y) and {λ} is the empty set.Other loss functions or measures of the performance of clustering algorithmsmay be applied, but we chose this modification of the 0/1 loss function to takeinto account the multi-label output of fuzzy k-means algorithms, and consideringthat our target examples belong only to a single cluster.

5.2 Results

The error boxplots of Fig. 1 show that our proposed fuzzy ensemble clusteringalgorithms perform consistently better than single fuzzy k-means algorithms,with the exception of the fuzzy-max algorithm, when a relatively high level offuzziness is chosen. More precisely, in Fig.1, we can observe how different degreesof fuzziness of the “component” k-means clusterings can change the performanceof the ensemble, if the “fuzzy” information are preserved from the “crispization”operation. In fact, if the single k-means and the ensemble max-max algorithm(which use the hard-clustering operation in both the component and consensuslevel) performances are similar in the both the graphics (1 (a) and (b)), the re-sult of the fuzzy-max ensemble algorithm (where the “defuzzification” operationare performed only on the final result) change drastically. The good performanceof the max-max algorithm with both the degrees of fuzziness could depend onthe high level of “crispness” of the data: indeed each example is assumed tobelong exactly to one cluster. A confirmation of this hypothesis is given by theimprovement of performance of the fuzzy-max algorithm by lowering the levelof fuzziness (2.0 to 1.1) of the base clusterings. A different consideration can beproposed regarding the fuzzy-max algorithm, in which the capacity to express“pure” fuzzy results on the base learners level, can improve its degree of flexi-bility (possibility to adapt the algorithm to the nature of the clusterings).

The analysis of the fuzzy-alpha and max-alpha algorithms (Fig. 2, 3) showshow the reduction of the fuzziness reduced the number of unclassified samplesand the number of errors, especially for α ≤ 0.5); for higher level of α the errorrate goes to 0, but the number of unclassified samples arises quickly. Note thatfuzzy-alpha ensembles achieve an error rate very close to 0 with a small amountof unclassified examples for a large range of α values (Fig. 2 b). Fig 3 shows howthe max-alpha algorithm obtains inversely related error and unclassification rateswhile varying α, with an ”optimal” value close to 0.5.

53

single PMOmaxmax02 PMOfuzzymax02

0.0

0.2

0.4

0.6

0.8

1.0

Methods

Err

ors

single PMOmaxmax02 PMOfuzzymax02

0.0

0.2

0.4

0.6

0.8

1.0

Methods

Err

ors

Fig. 1. Boxplot related the max-max and the fuzzy-max algorithms (PMO data reduc-tion and ǫ = 0.2) compared with the single k-means “crispized” through hard-clusteringrespectively with (a) fuzziness=2.0 (b) fuzziness=1.1

0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Alpha

Err

or/U

ncla

ssifi

ed

Error rateUnclassified rate

0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Alpha

Err

or/U

ncla

ssifi

ed


(a) (b)

Fig. 2. Fuzzy-Alpha ensemble clustering error and unclassified rate with fuzziness = 2(a) and fuzziness = 1.1 (b) with respect to α.

6 Conclusions

Table 1 numerically compares the results of the proposed fuzzy ensemble cluster-ing algorithms with other ensemble and classical clustering algorithms. We canobserve that the algorithms discussed in this paper perform equally or betterwith respect to the others, comprising the recently proposed RandClust ensem-ble method based on random projections [4]. In particular fuzzy-* ensemblessignificantly outperform the other methods with this specific data set. However,these results are too limited to deduce general conclusions about the effectiveness

54

0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Alpha

Err

or/U

ncla

ssifi

ed


0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Alpha

Err

or/U

ncla

ssifi

ed


(a) (b)

Fig. 3. Max-Alpha ensemble clustering error and unclassified rate with fuzziness = 2(a) and fuzziness = 1.1 (b) with respect to α.

Table 1. Compared results between fuzzy ensemble clustering methods and otherensemble and ”single” clustering algorithms. The last column represents the rate ofthe unclassified examples.

Algorithms Mean error Std. Dev. % Uncl.

Fuzzy-Max 0.0058 0.0155 0

Fuzzy-Alpha 0.0058 0.0155 0.0008

Max-Max 0.0758 0.1104 0

Max-Alpha 0.0573 0.0739 0.0166

Rand-Clust 0.0539 0.0354 0

Fuzzy ”single” 0.3916 0.0886 0

Hierarchical ”single” 0.0817 0.0298 0

of the proposed methods, and we need more experiments to better understandtheir characteristics, drawbacks and limitations. Anyway, other recent experi-ments with DNA microarray data confirm the effectiveness of the fuzzy-maxand fuzzy-alpha ensemble clustering algorithms [16].

We are planning new experiments with multi-label examples to show moreclearly the effectiveness of the proposed approach and to analyze the structureof unlabeled data when the boundaries of the clusters are uncertain.

References

1. Dudoit, S., Fridlyand, J.: Bagging to improve the accuracy of a clustering proce-dure. Bioinformatics 19 (2003) 1090–1099

2. Fern, X., Brodley, C.: Random projections for high dimensional data clustering:A cluster ensemble approach. In Fawcett, T., Mishra, N., eds.: Machine Learning,

55

Proceedings of the Twentieth International Conference (ICML 2003), WashingtonD.C., USA, AAAI Press (2003)

3. Topchy, A., Jain, A., Puch, W.: Clustering Ensembles: Models of Consensus andWeak Partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence27 (2005) 1866–1881

4. Bertoni, A., Valentini, G.: Ensembles based on random projections to improve theaccuracy of clustering algorithms. In: Neural Nets, WIRN 2005. Volume 3931 ofLecture Notes in Computer Science., Springer (2006) 31–37

5. Bertoni, A., Valentini, G.: Randomized embedding cluster ensembles for geneexpression data analysis. In: SETIT 2007 - IEEE International Conf. on Sciencesof Electronic, Technologies of Information and Telecommunications, Hammamet,Tunisia (2007)

6. Gasch, P., Eisen, M.: Exploring the conditional regulation of yeast gene expressionthrough fuzzy k-means clustering. Genome Biology 3 (2002)

7. Johnson, W., Lindenstrauss, J.: Extensions of Lipshitz mapping into Hilbert space.In: Conference in modern analysis and probability. Volume 26 of ContemporaryMathematics., Amer. Math. Soc. (1984) 189–206

8. Bertoni, A., Valentini, G.: Randomized maps for assessing the reliability of patientsclusters in DNA microarray data analyses. Artificial Intelligence in Medicine 37

(2006) 85–1099. Shipp, M. et al.: Diffuse large B-cell lymphoma outcome prediction by gene-

expression profiling and supervised machine learning. Nature Medicine 8 (2002)68–74

10. Ramaswamy, S., Ross, K., Lander, E., Golub, T.: A molecular signature of metas-tasis in primary solid tumors. Nature Genetics 33 (2003) 49–54

11. Zadeh, L. A.: Fuzzy Sets Information and Control 8 (1965) 338-35312. Klement, E. P., Mesiar, R., Pap, E.: Triangular Norms Kluwer Academic Pubb-

lishers (2000)13. Achlioptas, D.: Database-friendly random projections in PODS ’01: Proceedings

of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles ofdatabase systems, ACM Press (2001) 274–281

14. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis anddisplay of genome-wide expression patterns In: PNAS, volume 95 number 25 (1998)14863-14868

15. Strehl, A, Ghosh, J,: Cluster Ensembles – A Knowledge Reuse Framework forCombining Multiple Partitions In: Journal on Machine Learning Research, MITPress 3 (2002) 583–617

16. Avogadri, R., Valentini G.: Fuzzy Ensemble Clustering for DNA Microarray DataAnalysis In: CIBB 2007, The Fourth International Conference on Bioinformaticsand Biostatistics, Portofino, Italy, Lecture Notes on Computer Science, Springer(to appear).

17. Hu, X., Yoo, I.: Cluster ensemble and its applications in gene expression analysis In:Proc. 2nd Asian-Pacific Bioinformatics Conference, Dunedin, New-Zealand (2004)297-302

18. Yang, L., Lv, H., Wang, W.: Soft Cluster Ensemble Based on Fuzzy SimilarityMeasure

19. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms Wiley(2004)

20. Kuncheva, L.I.,Hadjitodorov, S.T.: Moderate Diversity for Better Cluster Ensem-bles In: Information Fusion, 2005

56

Ensembles of Nearest Neighbours for Cancer

Classification Using Gene Expression Data

Oleg Okun1 and Helen Priisalu2

1 University of Oulu, Oulu 90014, Finland,2 Tallinn University of Technology, Tallinn 19086, Estonia

Abstract. It is known that an ensemble of classifiers can outperform asingle best classifier if classifiers in the ensemble are sufficiently diverse(i.e., their errors are as much uncorrelated as possible) and accurate.We study ensembles of nearest neighbours for cancer classification basedon gene expression data. Such ensembles have been rarely used, becausethe traditional ensemble methods such as bagging and boosting are un-able to inject diversity into nearest neighbour classifiers. To alleviatethis problem, feature selection prior to classification is done. After that,diversity-based ensemble pruning before classifier combination can leadto an ensemble outperforming a single best nearest neighbour. However,such a result cannot always hold as demonstrated by experiments withtwo gene expression datasets. The reason seems to lie in dataset com-plexity. We show that if a dataset is easy for a single nearest neighbourto accurately classify, then an ensemble of nearest neighbours is hardlycapable to get it better. In contrast, if a dataset is difficult according tocertain complexity measures, then an ensemble of nearest neighbours issuperior in performance to a single best nearest neighbour.

1 Introduction

Cancer prediction using gene expression data is one of the topics of intensiveresearch in bioinformatics. It is based on the assumption that expression levelsof genes in healthy and diseased cells are different so that it is possible to dis-criminate between cancer and normal state. However, the existing gene expres-sion datasets bring a challenge for automatic discrimination: high dimensionalitycompared to the number of cases collected from patients.

To mitigate the negative effect of high data dimensionality, feature selec-tion is usually recommended before classification. In the case of gene expressiondatasets, features are gene expressions so that feature selection is synonymousto gene selection. As follows from its name, feature selection involves finding asubset of the original features so that a classifier built with this subset wouldperform better than a classifier built from the entire set of features.

Given n genes expressed in N cases, the task is to select a small subset ofall genes so that 1) the cases are accurately classified into two classes (normaland cancer) and 2) the selected subset consists of genes biologically relevant to

57

cancer. That is the task is twofold, because it involves both machine learningand biological aspects.

However, different feature selection methods will often select different subsetsof features from the same dataset so that relevant features vary. To address thisissue, it is often better to combine results of several feature selection methods.This strategy is based on the intuition that features that are consistently selectedby various feature selection methods tend to be more relevant for the problem inquestion. Because feature selection precedes classification, it would be natural tocombine these two operations together under the name ‘base classifier’, which wewill use in this paper. As a result, we can now talk about ensembles of classifiers.

A typical ensemble fuses the predictions (either class labels or probabilityestimates) of a set of classifiers, thus forming a single class membership for thetest cases. The objective of applying ensembles is to achieve superior performanceover a single best classifier. One possibility to do so is to combine both diverseand accurate classifiers [1]. Diversity means that different classifiers do not makeerrors on the same cases. One of the most typical ways to make classifiers diverseis to let each classifier to work only with a certain subset of the original featuresso that these subsets can overlap but they do not totally coincide. These subsetscan be randomly sampled from the original set like in [2] or selected by a certainfeature-ranking algorithm. We rule out random sampling, since it can easily leadto picking the features that are biologically irrelevant to cancer. Instead, we relyon statistically-driven feature ranks, which provide a better guarantee againstselecting the features purely by chance.

There are several articles employing ensembles for cancer classification, whereensembles are built with neural networks [3] and decision trees [4–6]. In contrastto these works, we focus on ensembles of nearest neighbour classifiers. Despite ofbeing one of the classifiers of choice for many real-world tasks, a nearest neigh-bour (NN) is relatively rarely applied in ensembles, especially regarding cancerclassification based on gene expression data (exceptions are [7–9]), because sev-eral researchers showed that neither bagging nor boosting (two most popularensemble techniques) does provide improvement compared to a single NN clas-sifier [2, 10]. In our opinion, the previous approaches based on NN ensemblessuffer from the principal drawback related to automatic ensemble construction:either all possible ensemble configurations have to be tried in order to selectone of them [8] or this process involves stochastic search [9], which leads to anunstable solution. We propose to apply the diversity-based ensemble pruningalgorithm [11] to quickly and automatically create an ensemble out of individualNN classifiers, given the number of classifiers to fuse. Class labels assigned totraining data by individual classifiers are only sufficient and no re-training isneeded when including NNs into an ensemble. The latter characteristic is espe-cially valuable, since one does not have to worry about ensemble overfitting.

Besides the abovementioned reasons inspiring our interest to NN ensembles,we were motivated by the fact that a nearest neighbour demonstrated a highlycompetitive performance with respect to more sophisticated classifiers [6]. Hence,as noticed in [7], one should not always expect that an NN ensemble will out-

58

perform a single NN. Although in [7] some efforts were made to answer whena single best NN classifier can outperform an NN ensemble, the results are stillfar from being comprehensive. In this work, we continue to study this problemby concentrating on dataset complexity as measured in [12]. We argue that thiscomplexity can be the reason of why NN ensembles cannot outperform a singlebest NN classifier. When a dataset is complex, e.g., it includes multiple types ofcancer, an NN ensemble is superior to a single best NN classifier. However, if adataset is rather easy for classification, a highly accurate NN can render effortsof an ensemble worthless. Together with ensemble pruning, dataset complexityconstitutes the novelty of our work, since these two issues are often overlookedin the literature on cancer classification.

2 Methods

2.1 Feature Selection Methods

We chose the following methods employed in [13]: pairwise feature selection,greedy forward selection, and individual ranking. As follows from its name, pair-wise feature selection takes into account interaction between pairs of features.Other two methods are one-feature-at-a-time methods. All the methods expecttwo-class data and belong to the filter model, i.e., they do not rely on a classifierwhen selecting features.

Method of Bø and Jonassen This method evaluates how well a pair of genesin combination distinguishes two classes. First, each pair is evaluated by com-puting the projections of the training data on the diagonal linear discriminant(DLD) axis (see [13] for details) when using only two genes constituting thispair. Then the two-sample t-statistic is computed on the projected points andassigned to a given pair as its score. Bø and Jonassen proposed two variantsof gene selection based on pair scores, called ‘all pairs’ (exhaustive search) and‘greedy pairs’ (greedy search).

The all-pairs variant targets all possible pairs of genes. First, all pairs aresorted in descending order of their score. After that, the pair with the highestpair score is selected and all other pairs containing either gene included in thispair are removed from the sorted list of pairs. Then the next highest-scoring pairis found from the remaining pairs and all other pairs containing either gene inthis pair are removed from the list, and so on. This procedure terminates whenthe pre-specified number of genes is reached.

Since the all-pairs variant is computationally demanding, an alternative eval-uating only a subset of all pairs is proposed (greedy pairs). The greedy-pairsvariant first ranks all genes based on indivudual t-score. Next, the best gene gi

ranked by its t-score is selected. Among all other genes, the gene gj that togetherwith gi maximises the pair t-score is found so that gi and gj form a pair. Thesetwo genes are then removed from the gene set and a search for the next pair isperformed until the desired number of genes is selected.

59

Greedy Forward Selection It is the greedy feature selection method, incre-mentally adding genes one-by-one; hence the name ‘forward selection’. Individualt-score is computed for each gene and genes are ranked according to their t-score.

The first gene to be selected is the one with the highest t-score. Next, othergenes are added so that adding a gene maximises the two-sample t-score of anew subset. The t-score at step i is computed on the data points projected onthe DLD axis (as done in [13]) using the expression values of the i selected genes.Greedy forward selection stops when collecting the desired number of genes.

Individual Ranking The methods above belong to gene subset ranking tech-niques because each time a group of genes gets a rank [14] in contrast to individ-ual ranking which grades each gene in isolation from others. Compared to genesubset ranking, individual ranking looks handicapped since it ignores interac-tions among genes. However, it is the fastest method among the four, because itis the simplest one. Each gene gets indivudual t-score ranking this gene amongothers. Gene selection is straightforward: a user picks the required number oftop ranked genes.

2.2 Weighted k-Nearest Neighbour Classifiers

In our approach, classification follows feature selection. As classifiers, we employthe weighted k-nearest neighbour (k-NN) (k = 1, 3, 5), with weights assigned tothe original genes as described in the experimental protocol3. It is known that ak-NN classifier is sensitive to irrelevant features [15]; hence, gene selection priorto classification will help to filter out irrelevant genes.

A k-NN classifier consists of a training set and a similarity (or distance)metric. To classify a new, unseen case, it computes distances to all cases in thetraining set and finds k nearest cases to the test case. The most common classof these k cases is assigned to the new case.

Given two n-dimensional cases x and y, the weighted distance between themis determined by the following formula:

√

√

√

√

n∑

i=1

wi(xi − yi)2 =

√

√

√

√

n∑

i=1

(√

wixi −√

wiyi)2 ,

where wi is the weight of the ith gene. Thus, cases have to be weighted beforesearching for nearest neighbours.

2.3 Ensemble Pruning Before Classifier Combination

It is imperative that in order for an ensemble to outperform a single best classi-fier, the former must be composed of diverse classifiers that are also sufficiently

3 The ith weight is equal to the number of times the ith gene has been selected duringcross-validation of feature selection.

60

accurate at the same time [1]. In many models injecting diversity into an ensem-ble, e.g., via random sampling of input data like in [10] or via random featuresampling like in [2], diversity is not explicitly measured and controlled. Hence,it is difficult to judge how diverse the classifiers and this uncertainty appeals toincorporate some explicit measure of diversity into selection of the classifiers.

Though all the base classifiers are typically combined into an ensemble, itis possible to select only certain classifiers for fusion. This operation is calledensemble pruning. We borrowed the diversity-based pruning algorithm from [11].In [11], they defined diversity for a pair of classifiers Di and Dj as the number ofcases where Di and Dj produce different predictions (class labels). The diversity-based algorithm is given in Figure 1, where H and C are sets containing theoriginal and selected classifiers, respectively.

First, the most accurate classifier is included into C. Next, in each round,the algorithm adds to the list of selected classifiers the classifier Dk that is mostdiverse to the classifiers chosen so far, i.e., the Dk that maximises the diversityS over C ∪Dk, ∀k ∈ {1, 2, . . . , |H|}. The selection ends when the M most diverseclassifiers are selected.

Let C = ∅, M := maximum number of classifiersFor i := 1, . . . , | H | −1 do

For j := i, . . . , | H | do

Let zi,j :=number of cases whereDi and Dj give different predictions

Let D′ := the classifier with the highest accuracy

C := C ∪ D′, H := H− D

′

For i := 1, . . . , M − 1 do

For j := 1, . . . , | H | do

Let Sj :=∑

|C|

k=1zj,k

Let D′ := the classifier from H with the highest Sj

C := C ∪ D′, H := H− D

′

Fig. 1. Diversity-based ensemble pruning

2.4 Classifier Combination

We employ two classifier combination methods: majority vote and Naıve Bayes.Notation and formulae in this section are borrowed from [1].

Let us assume that there are L classifiers Di, i = 1, . . . , L and c differentclasses. The label outputs of these classifiers compose c-dimensional binary vec-tors [di,1, . . . , di,c] ∈ {0, 1}c, i = 1, . . . , L, where di,j = 1 if Di labels the input xin class ωj, and di,j = 0 otherwise.

Majority Vote The majority vote results in an ensemble decision for class ωk

if this class got more votes than any other class. Ties are randomly resolved.

61

Naıve Bayes Combination This combination method assumes that individ-ual classifiers are mutually independent; hence the name ‘naıve’. Let s1, . . . , sL

be crisp class labels assigned to x by classifiers D1, . . . , DL, respectively. Theindependence assumption means that the support for class ωk is µk(x) ∝

P (ωk)∏L

i=1 P (si | ωk), where P (si | ωk) is the probability that Di assigns xto class si, given that the true class is ωk. For each classifier Di, a c × c confu-sion matrix CM i is calculated by applying Di to the training data. The (k,s)thentry of this matrix, cmi

k,s is the number of cases whose true class label wasωk, but they were classified in class ωs by Di. By Nk we denote the total num-ber of cases from class ωk. Taking cmi

k,si/Nk as an estimate of the probability

P (si | ωk), and Nk/N (N =∑c

j=1 Nj) as an estimate of the prior probability

for class ωk, we obtain that µk(x) ∝ 1NN

L−1

k

∏L

i=1 cmik,si

.

To avoid nullifying µk(x) regardless of other estimates when an esti-mate of P (si | ωk) = 0, the last formula can be rewritten as µk(x) ∝

Nk

N

{

∏L

i=1

cmik,si

+1/c

Nk+1

}B

, where B is a constant (we always set it to 1 in ex-

periments).

2.5 Performance Evaluation

A Receiver Operating Characteristic or ROC is a plot of false positive rate (X-axis) versus true positive rate (Y-axis) of a binary classifier. It is commonlyused for visualising and selecting classifiers based on their performance, becauseit seems that it is one of the best measures of predictive accuracy. The truepositive rate (TPR) is defined as the ratio of the number of correctly classifiedpositive cases to the total number of positive cases. The false positive rate (FPR)is defined as the ratio of incorrectly classified negative cases to the total numberof negative cases.

The diagonal line y = x corresponds to a classifier which predicts a classmembership by randomly guessing it. Hence, all useful classifiers must haveROC curves above this line.

The ROC curve is a two-dimensional plot of classifier performance. To com-pare classifiers one typically prefers to work with a single scalar value. Thisvalue is called the Area Under Curve or AUC. It is calculated by adding theareas under the ROC curve between each pair of consecutive FPR values, us-ing, for example, the trapezoidal rule. Because the AUC is a portion of the areaof the unit square, its value will always lie between 0 and 1. Because randomguessing produces the diagonal line between (0,0) and (1,1), which has an areaof 0.5, no realistic classifier should have an AUC less than 0.5 [16]. In fact, thebetter a classifier performs, the higher its AUC. The AUC has an importantstatistical property: the AUC of a classifier is equivalent to the probability thatthe classifier will rank a randomly chosen positive case higher than a randomlychosen negative case [16].

62

3 Experimental Protocol

The following experimental protocol was utilised in our experiments.

Input: Training set containing gene expressions; the number of cross-validationfolds, m; G feature selection methods; Q nearest neighbour classifiers; themaximum number of genes, ng

max, to be selected; threshold thr to filter outrarely selected genes; the number of base classifiers, M , to combine into anensemble; the number of top genes, ng

i , i = 1, . . . , G×Q, to be used by eachbase classifier.

Step 1: Partition the training set into m non-overlapping and stratified folds.Step 2: For each of the G feature selection methods, repeat the following op-

erations m times: leave out one fold, select ngmax genes using the remaining

folds, and store indices of the selected genes in a special list. Thus, m listsof length ng

max are formed for each feature selection method.Step 3: For each of the L (L = G × Q) base classifiers, do:

– pick top ngi (ng

i ≤ ngmax) genes in each of the m lists created in the

previous step by the jth feature selection method and combine genesfrom all lists together;

– compute the frequency of occurrence (weight), w, of each gene in thecombined set;

– filter out rare genes as follows: if w < thr for a certain gene, set its w tozero; otherwise keep its w unchanged;

– using the updated values of w and folds created in Step 1, classify thetraining set using m-fold cross-validation and the ℓth weighted classifier;

– construct the ROC curve and compute the AUC value for this classifier.Step 4: Perform ensemble pruning, which reduces the number of base classifiers

from L to M .Step 5: Combine the M remaining classifiers into an ensemble by one of the

non-trainable combination schemes.Step 6: Combine all genes used by the ensemble into one set. Compute the

frequency of occurrence of each gene. Rank genes according to the frequencyof occurrence.

Step 7: Construct the ROC curve and compute the AUC value for the resultingensemble.

Otput: ROC curves and AUC values for individual classifiers and their ensem-ble, sets of genes used to cross-validate each classifier, a ranked list of genescreated in Step 6.

Gene selection using the jth feature selection method followed by classifi-cation by the ℓth classifier is termed a base classifier. A combination of fourfeature selection methods (G = 4) and three NNs (Q = 3) produces twelve baseclassifiers (L = 12). Given L base classifiers, gene selection has to be done onlyG instead of L times, since each ng

i ≤ ngmax. We experimented with two ways

to set values of ngi ’s: 1) all ng

i ’s are equal to a certain constant (we used 5, 10,20, 30, 40, 50) and 2) each ng

i is set to a random number between 1 and ngmax

(ngmax = 50).

63

Genes are selected m times by each feature selection method, where m is thenumber of folds (m = 5). However, in general, m lists of genes differ. Combin-ing all lists produces a single set of genes, which facilitates further analysis andbiological interpretation. But not every gene in the combined set is important.For instance, rarely occurring genes were likely selected by chance and can betherefore safely eliminated. In contrast, genes that were often selected are con-sidered to be relevant for classification. To filter out rare genes, we applied asimple thresholding technique: if a certain gene appeared in the combined setless than thr times, where thr is a fraction of m (thr = 2 in this paper), thisgene was treated as noise and its weight w was set to zero. Besides nullifyingthe influence of rare genes, thresholding helps to prevent overfitting of the NNclassifiers to noise. Also notice that a set of genes used to cross-validate classifiersis different to a certain degree from any of the m subsets of genes found duringcross-validation of feature selection. Besides, feature selection does not employa NN classifier to select genes, which also attributes to mitigate a bias of thereported results. Finally, since NNs do not require training, no dramatical biascan be introduced in the whole process of feature selection and classification.

Having predictions (either class labels or probability estimates4 or both) ofindividual base classifiers, it is possible to combine them in order to achievebetter performance. Sometimes, certain classifiers can be left out, i.e., M outof L classifiers are sufficient to obtain superior results over a single best NNclassifier. The ensemble pruning technique we applied is based on diversity inpredictions of individual classifiers. It only requires class labels of the individualclassifiers and does not involve any training.

Finally, the M (M = 3, 5, 7, 9, 11, 12) classifiers that are left after ensemblepruning are fused by one of the known combination schemes. In this work, weutilise two of them: majority vote and Naıve Bayes combination with a correctionfor zeroes [1]. Both schemes do not require extra training. Thus, they do notintroduce a bias to our results. Because we need probabilistic outputs in order togenerate ROC curves and to compute AUC values for ensembles, we approximateprobabilities as follows. For majority vote, it is the ratio of votes for a certainclass to the total number of votes, while for Naıve Bayes combination, it is theclass support µ. Cancer cases are positives whereas normal cases are negatives.

Genes used by individual base classifiers are combined together and for eachgene the number of base classifiers using this gene is counted. Based on thischaracteristic, all genes can be ranked in order to guide a biological verificationtowards few top ranked genes, because genes frequently used by many classifiersin the ensemble can be highly relevant for future clinical studies.

4 Datasets

Two datasets were chosen for experiments.

4 Laplace correction for k-NN classifiers [1] is used. Probabilistic outputs are thenemployed to build ROC curves and to compute AUC values.

64

The first dataset is produced according to SAGE technology. SAGE standsfor Serial Analysis of Gene Expression[17]. This is technology alternative to mi-croarrays (cDNAs and oligonucleotides). Though SAGE was originally conceivedfor use in cancer studies, there is not much research using SAGE datasets re-garding ensembles of classifiers (to our best knowledge, this is the first researchon NN ensembles based on SAGE data).

In this dataset[18], there are expressions of 822 genes in 74 cases (24 casesare normal while 50 cases are cancerous) [19]. Unlike many other datasets withone or few types of cancer, it contains 9 different types of cancer. We decidedto ignore the difference between cancer types and to treat all cancerous cases asbelonging to a single class. No preprocessing was done.

The second dataset is a microarray (oligonucleotide) dataset [20] introducedin [21]. It contains expressions of 2000 genes for 62 cases (22 normal and 40colon tumour cases). Preprocessing includes the logarithmic transformation tobase 10, followed by normalisation to zero mean and unit variance as usuallydone with this dataset.

We will use the names ‘SAGE’ and ‘Colon’ in order to identify these datasets.

5 Experimental Results

We used the AUC value for performance evaluation of both single classifiers andensembles. In the first series of experiments, we fixed the number of genes to beselected and used with each base classifier in the ensemble to a certain constantC. Hence, this number was the same for all classifiers in the ensemble. Sinceeach feature selection method ranks all genes, it is natural to select top C genesaccording to the ranks assigned by each method. Thus, feature selection helpsto concentrate on a certain subset of all genes. The following values of C weretried: 5, 10, 20, 30, 40, 50. For ensembles rare genes were eliminated at thr = 2.

Regarding SAGE, one can notice that NN ensembles created with both com-bination schemes often outperformed a single best NN classifier, especially whenusing 20-30 genes per base classifier. Moreover, it was sufficient to fuse 5-9 NNclassifiers in order to achieve the superior performance over the ensemble includ-ing all available classifiers. This testifies to the usefulness of ensemble pruningprior to classifier combination. In addition, pruning limits the number of genesto be further analysed, because the fewer classifiers, the fewer subsets of selectedgenes. For Colon, NN ensembles did not demonstrate such superiority, i.e., asingle best NN turned out to be more accurate than an NN ensemble, regardlessof the combination scheme. Regarding the absolute AUC values, the AUC valuesobtained on the SAGE data tend to be much lower than those of the Colon data.

The results above point to the difference in performance of NN ensembles ontwo datasets and that it should not always be expected that an NN ensemblewill outperform a single best classifier. To explain these facts, we measureddataset complexity by means of a number of characteristics borrowed from [12].It turned out that according to the majority of them, SAGE is more complexthan Colon. It is not surprising since the former comprises 9 types of cancer

65

while the latter contains only one. The reason for a higher complexity of SAGEis highly overlapped classes. That is why, NN ensembles achieved better resultson the less complex Colon data. Thus, dataset complexity analysis is very usefulbecause one can roughly estimate the expected performance on the dataset understudy before classifying it. Based on the dataset complexity, it is also possibleto explain why a single NN sometimes performs better than an NN ensemble.If a dataset is easy to classify, a single best NN classifier (SBC) may be able todo so with high accuracy. In this case, it is likely that the other classifiers aregreatly inferior to SBC. Hence, combining one good classifier and several badclassifiers will drown out the vote of SBC in the votes of incorrect classifiers. Asa result, the whole ensemble will be unable to outperform an SBC. In contrast,when a dataset is difficult to classify, a SBC is a weak classifier and so are theother classifiers. An ensemble of such weak classifiers can therefore improve onthe SBC result [1, 22].

To confirm our hypothesis about the link between dataset complexity andclassification performance, we carried out a broader experiment, this time withthe variable number of genes for base classifiers in the ensemble. Thus, the num-ber of top genes varied from classifier to classifier. The variable number of topgenes allowed us to consider a more general case. Individual NNs and NN ensem-bles of different structures were created 500 times. The AUC values of both SBCsand ensembles were collected and based on this information, statistical testswere carried out. We tested the hypothesis whether an NN ensemble can outper-form an SBC. The Kruskal-Wallis nonparametric one-way Analysis of Variance(ANOVA) was done, followed by the multiple comparisons with Scheffe’s S ad-justment for the significance level α (α = 0.01 or α = 0.05). See[23] for details ofthese tests. Based on the results of these tests, NN ensembles indeed fared betterthan SBCs for SAGE but not for Colon, where SBCs were superior to ensembles.Thus, statistical tests on a larger sample confirmed our findings reported abovewhen all base classifiers in the ensemble worked with the same number of genesdefined by C.

Besides performance analysis, we were interested to see how often differ-ent feature selection methods were coupled with SBCs (and therefore the mostaccurate classifiers in each ensemble). The information on this subject is col-lected for 500 ensembles per combination scheme. It turned out that for SAGEdataset, greedy forward selection emerged as a strong leader (56% of cases). ForColon dataset, the all-pairs Bø-Jonassen algorithm was clearly dominant (85%of cases). In contrast, individual ranking rarely led to the best performance of anindividual NN (7.1% and 1.7% of cases for SAGE and Colon data, respectively).Greedy forward selection and all-pairs Bø-Jonassen perform a broader searchthan individual ranking. Hence, they are capable of finding a better solution,but at a higher computational cost.

5.1 Biological Relevance of Selected Genes

Due to space limitations, we consider only one case: SAGE dataset and raregene filtering at thr = 2. Colon dataset has been already studied in many works

66

while SAGE dataset is rather new. Besides, it contains multiple types of cancerso that it is more interesting for analysis of gene relevance.

In focusing on genes for biological interpretation, we noticed another advan-tage of classifier ensembles. Counting the number of times when a certain gene(among the selected genes) was utilised by the base classifiers can indicate theimportance or relevance of this gene. Thus, this count serves as a gene rank.Combining results from several ensembles can provide even better (more stable)ranking. For instance, when applied to the combined list of genes generated forsix values of C, the top 4 genes are:

Gene 721 TGAGTGGTCA ESTs, Highly similar to HYPOTHETICAL 13.6KD PROTEIN in NUP170-ILS1 INTERGENIC REGION [Saccharomycescerevisiae].

Gene 409 GAGTGGGGGC ESTs (‘EST’ means expressed sequence tag),Highly similar to LYSOSOMAL PRO-X CARBOXYPEPTIDASE PRE-CURSOR [Homo sapiens].

Gene 596 GTGCATCCCG Casein kinase 2, beta polypeptide.Gene 629 GTTCTCCCAC ESTs, Highly similar to PROTEIN TRANS-

PORT PROTEIN SEC61 ALPHA SUBUNIT [Canis familiaris].

Our first expectation was that since different types of cancer are present inthe dataset, there should be genes (among the top selected) that can be foundin many human tissues. It turned to be true for all four genes. We consulted thepaper [24] and Table 6 [25] supplementing that paper and containing the listof ubiquitously expressed genes (i.e., these genes can be found in many humanorgans) as well as gene descriptions found on the Web.

Gene 721 can be found in many human tissues such as brain, breast, colon,heart, kidney, prostate, skin, etc. However, the link of gene 721 to cancer isobscure and it can be that it is just ubiquitous gene. We found one study [26]pointing to the high expression level of this gene in chromophobe carcinoma [27],which is one of the most common histological subtypes of renal cell carcinoma(RCC), which, in its turn, is one of the 10 most frequent malignancies in Westernsocieties.

Gene 409 may be related to essential hypertension [28]. There is evidence thathypertension may be related to cancer [29]: for example, high blood pressure incombination with smoking can cause lung cancer [30]. However, this hypothesisis still disputed [31] for other types of cancer: in particular, it may be so thatsome antihypertensive drugs are associated with the risk of cancer, but a precisemechanism of such association is yet unknown. Currently, all that researchersseem to agree is that cancer and hypertension share some common pathologicalmediators, and through these, hypertension might influence on malignant growthof cancer. LYSOSOMAL PRO-X CARBOXYPEPTIDASE PRESURSOR wasalso found among top 50 marker genes to classify acute lymphoblastic/mieloidleukemia (ALL/AML) in [32].

Gene 596 can be connected to cancer [33, 34]. This gene encodes the betasubunit of casein kinase II, a ubiquitous protein kinase which regulates metabolicpathways, signal transduction, transcription, translation, and replication [35].

67

Thus, it is a very important gene. It had been observed that protein kinaseCK2 is elevated in various cancers [36]. In addition to cancers, CK2 is found tointeract with the human immunodeficiency virus type 1 (HIV-1) proteins andthe human herpes simplex virus type 1 (HSV-1) protein EB63 [35].

Gene 629 is highly similar to the canine gene. Its cellular function can beprotein synthesis and transport. In medicine, the human safety often demandsprior tests involving animals, including dogs. By observing effects of such clinicaltests, a greater understanding of various diseases and their potential cure canbe achieved, though extrapolation from dogs to humans should not be alwaysstraightforward, because, for example, metabolic and toxic outcomes differ acrossspecies. Recently, in [37], it was found that Sec61a, involved in protein secretion,strikingly responds to inflammatory stimuli. Another study [38] also showed thatGene 629 is involved in the inflammatory process in mice. It is estimated thatinflammation contributes to the development of at least 15% of all cancers [39].Therefore Gene 629 may be indirectly linked to cancer.

Besides top genes, interesting information can be digged out in other genes,too. For example, let us consider the following two genes:

Gene 96 AGCCACCACG Human mRNA for KIAA0149 gene, complete cds.

Gene 594 GTGATCTCCG ESTs.

Gene KIAA0159 is a scavenger receptor expressed by endothelial cells thatare involved in the control of blood pressure, blood clotting, atherosclerosis, andinflammation. These cells are located on the interior surface of blood vesselsand form a bridge between blood in the lumen and the rest of the vessel wall.Scavenger receptors have been implicated in the pathogenesis of atherosclerosis,and there is enough evidence to associate atherosclerosis with cancer [40, 41](also Google provides more than 25000 links to the phrase “atherosclerosis andcancer”). In other words, both diseases may represent variants of a commondisease entity.

Gene 594 was found in 83 mRNA-source sequences clustered in two largegroups [42]. The description of SAGE libraries points to connections of this geneto various cancers and Gaucher disease. Researchers hypothesised that patientswith Gaucher disease might be at a higher risk of such malignant disorders ascancer [43, 44]. However, because Gaucher disease is relatively rare, it is diffucultto study the links between it and cancers, using large samples. Nevertheless,there is quite strong indication that Gaucher disease and hematological cancers(lymphoma, myeloma) are related [44, 45].

6 Conclusion

In this paper, we studied cancer classification using gene expression data. Tworeal-world datasets different in the complexity and the number of cancer typeswere utilised in experiments. We demonstrated 1) the importance of featureselection in making the ensembles of NNs diverse, 2) if dataset complexity is

68

high, e.g., a dataset is composed of cases belonging to multiple types of can-cer, small NN ensembles composed of 5-9 classifiers can outperform a singlebest NN classifier as attested by the AUC value, 3) NN ensembles can benefitfrom diversity-based pruning prior to classifier combination, 4) using measuresof dataset complexity can help to evaluate the expected performance of individ-ual NN classifiers and NN ensembles on this dataset. From the biological pointof view, employing the NN ensembles helps to concentrate on small groups ofgenes thanks to two mechanisms: filtering out rarely selected genes and thosegenes that were only utilised by few base classifiers in the ensemble.

References

1. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. JohnWiley & Sons, Inc.,, Hoboken, NJ (2004)

2. Bay, S.: Nearest neighbor classification from multiple feature sets. Intelligent DataAnalysis 3(3) (1999) 191–209

3. Liu, B., Cui, Q., Jiang, T., Ma, S.: A combinational feature selection and en-semble neural network method for classification of gene expression data. BMCBioinformatics 5 (2004) 136

4. Tan, A.C., Gilbert, D.: Ensemble machine learning on gene expression data forcancer classification. Applied Bioinformatics 2(Suppl 3) (2003) 75–83

5. Long, P.M., Vega, V.B.: Boosting and microarray data. Machine Learning 52(1-2)(2003) 31–44

6. Dudoit, S., Fridlyand, J.: Classification in microarray experiments. In Speed,T., ed.: Statistical Analysis of Gene Expression Microarray Data. InterdisciplinaryStatistics. Chapman and Hall/CRC, Boca Raton, Florida (2003) 93–158

7. Paik, M., Yang, Y.: Combining nearest neighbor classifiers versus cross-validationselection. Statistical applications in genetics and molecular biology 3(1) (2004)Article 12

8. Cho, S.B., Won, H.H.: Data mining for gene expression profiles from dna mi-croarray. Int J of Software Engineering and Knowledge Engineering 13(6) (2003)593–608

9. Park, C., Cho, S.B.: Evolutionary computation for optimal ensemble classifier inlymphoma cancer classification. In Carbonell, J.G., Siekmann, J., eds.: Proceed-ings of the Fourteenth International Symposium on Methodologies for IntelligentSystems, Maebashi City, Japan. (2003) 521–530

10. Breiman, L.: Bagging predictors. Machine Learning 24(2) (1996) 123–14011. Prodromidis, A.L., Stolfo, S., Chan, P.K.: Pruning classifiers in a distributed

meta-learning system. In: Proceedings of the First Panhellenic Conference on NewInformation Technologies: 8-10 October 1998; Athens, Greece. (1998) 151–160

12. Ho, T.K., Basu, M.: Complexity measures of supervised classification problems.IEEE Trans on Pattern Analysis and Machine Intelligence 24(3) (2002) 289–300

13. Bø, T.H., Jonassen, I.: New feature subset selection procedures for classificationof expression profiles. Genome Biology 3(4) (2002) 0017.1–0017.11

14. Lu, Y., Han, J.: Cancer classification using gene expression data. InformationSystems 28(4) (2003) 243–268

15. Aha, D.W.: Generalizing from case studies: a case study. In Sleeman, D.H.,Edwards, P., eds.: Proceedings of the Ninth International Workshop on MachineLearning: 1-3 July 1992; Aberdeen, Scotland, Morgan Kaufmann (1992) 1–10

69

16. Fawcett, T.: An introduction to roc analysis. Pattern Recognition Letters 27(8)(2006) 861–874

17. Velculescu, V.E., Zhang, L., Vogelstein, B., Kinzler, K.W.: Serial analysis of geneexpression. Science 270(5235) (1995) 484–487

18. (http://lisp.vse.cz/challenge/ecmlpkdd2004/)19. Gandrillon, O.: Guide to the gene expression data. In Berka, P., Cremilleux, B.,

eds.: Proceedings of the ECML/PKDD Discovery Challenge Workshop: 20 Septem-ber; Pisa, Italy. (2004) 116–120

20. (http://microarray.princeton.edu/oncology/affydata/index.html)21. Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine,

A.J.: Broad patterns of gene expression revealed by clustering analysis of tumorand normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci96(12) (1999) 6745–6750

22. Breiman, L.: Random forests. Machine Learning 45(1) (2001) 5–3223. Zar, J.H.: Biostatistical Analysis. Prentice Hall, Inc.,, Upper Saddle River, NJ

(1999)24. Velculescu, V.E., Madden, S.L., Zhang, L., Lash, A.E., Yu, J., Rago, C., Lal, A.,

Wang, C.J., Beaudry, G.A., Ciriello, K.M., Cook, B.P., Dufault, M.R., Ferguson,A.T., Gao, Y., He, T.C., Hermeking, H., Hiraldo, S.K., Hwang, P.M., Lopez, M.A.,Luderer, H.F., Mathews, B., Petroziello, J.M., Polyak, K., Zawel, L., Zhang, W.,Zhang, X., Zhou, W., Haluska, F.G., Jen, J., Sukumar, S., Landes, G.M., Riggins,G.J., Vogelstein, B., Kinzler, K.W.: Analysis of human transcriptomes. NatureGenet 23 (1999) 387–388

25. (http://www.nature.com/ng/journal/v23/n4/extref/ng1299-387b S1.pdf)26. Boer, J.M., Huber, W.K., Sultmann, H., Wilmer, F., von Heydebreck, A., Haas,

S., Korn, B., Gunawan, B., Vente, A., Fuzesi, L., Vingron, M., Poustka, A.: Iden-tification and classification of differentially expressed genes in renal cell carcinomaby expression profiling on a global human 31,500-element cDNA array. GenomeResearch 11 (2001) 1861–1870

27. (http://www.dkfz heidelberg.de/abt0840/whuber/rcc/clear chromo genes.txt)28. (http://harvester.embl.de/harvester/P427/P42785.htm)29. Hamet, P.: Cancer and hypertension: a potential for crosstalk? J Hypernens 15

(1997) 1573–157730. Lingren, A., Pukkala, E., Nissinen, A., Tuomilehto, J.: Blood pressure, smoking,

and the incidence of lung cancer in hypertensive men in North Karelia in Finland.Am J Epidemiol 158 (2003) 442–447

31. Silva, D.G., Schonbach, C., Brusic, V., Socha, L.A., Nagashima, T., Petrovsky,N.: Identification of “pathologs” (disease-related genes) from the RIKEN mousecDNA dataset using human curation plus FACTS, a new biological informationextraction system. BMC Genomics 5 (2004)

32. Chow, M.L., Moler, E.J., Mian, I.S.: Identifying marker genes in transcriptionprofiling data using a mixture of feature relevance experts. Physiol. Genomics 5

(2001) 99–11133. Faust, R.A., Gapany, M., Tristani, P., Davis, A., Adams, G.L., Ahmed, K.: Ele-

vated protein kinase CK2 activity in chromatin of head and neck tumors: associa-tion with malignant transformation. Cancer Letters 101 (1996) 31–35

34. Wang, H., Davis, A., Yu, S., Ahmed, K.: Response of cancer cells to molecularinterruption of the CK2 signal. Molecular and Cellular Biochemistry 227 (2001)167–174

35. (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list uids=1460)

70

36. Guerra, B., Issinger, O.G.: Protein kinase CK2 and its role in cellular proliferation,development and pathology. Electrophoresis 20 (1999) 391–408

37. Higgins, M.A., Berridge, B.R., Mills, B.J., Schultze, A.E., Gao, H., Searfoss, G.H.,Baker, T.K., Ryan, T.P.: Gene expression analysis of the acute phase responseusing a canine microarray. Toxicological Sciences 74 (2003) 470–484

38. Ibrahim, S.M., Koczan, D., Thiesen, H.J.: Gene-expression profile of collagen-induced arthritis. J Autoimmunity 18(2) (2002) 159–167

39. Marx, J.: Inflammation and cancer: the link grows stronger. Science 306(5698)(2004) 966–968

40. Li, J.J., Gao, R.L.: Should atherosclerosis be considered a cancer of the vascularwall? Medical Hypotheses 64(4) (2005) 694–698

41. Ross, J.S., Stagliano, N.E., Donovan, M.J., Breitbart, R.E., Ginsburg, G.S.:Atherosclerosis and cancer. Annals of the New York Acad Sci 947 (2001) 271–293

42. (http://www.ncbi.nlm.nih.gov/SAGE/index.cgi?cmd=tagsearch&anchor=NLAIII&org=Hs&tag=GTGATCTCCG)

43. Rosenbloom, B.E., Weinreb, N.J., Zimran, A., Kacena, K.A., Charrow, J., Ward,E.: Gaucher disease and cancer incidence: a study from the gaucher registry. Blood105(12) (2005) 4569–4572

44. Shiran, A., Brenner, B., Laor, A., Tatarsky, I.: Increased risk of cancer in patientswith Gaucher disease. Cancer 72(1) (1993) 219–224

45. deMagalhaes Silverman, M.: Cancer and gaucher disease. Gaucher Clinical Per-spectives 4(2) (1996)

71

Multivariate Time Series Classification via Stacking ofUnivariate Classifiers. Application to Fault Detection in

Dynamic Systems⋆

Carlos Alonso1, Óscar Prieto1, Juan José Rodríguez2 and Aníbal Bregón1

1 Grupo de Sistemas Inteligentes, Departamento de Informática, Universidad de Valladolid,España.

2 Lenguajes y Sistemas Informáticos, Universidad de Burgos,España.

Abstract. This work explores the capacity of Stacking to generate multivariatetime series classifiers from classifiers of its univariate time series components.The Stacking scheme proposed uses K-NN with Dynamic Time Warping as adissimilarity measure for the level 0 learners and Naïve Bayes at level 1. Themethod has been tested in a fault identification problem for alaboratory scalecontinuous process plant. The multivariate time series haseleven componentsand there are fourteen classes. Experimental results show that, for the availabledata set, the proposed Stacking configuration performs fairly well, increasing theaccuracy achieved by NB or K-NN as stand alone methods by an order of mag-nitude. This is an interesting issue because good univariate time series classifiersdo not always performs satisfactory in the multivariate case.

1 Introduction

Time series classification is a current research problem. Time series are of interestin several research areas like statistic, signal processing or control theory [1]. Conse-quently, several techniques have been proposed to tackle this problem; a detailed reviewof them may be found in [2]. Hidden Markov Models [3,4], Artificial Neural Networks[5,6] or Dynamic Time Warping [4,7,8,9] has been successfully applied to univariatetime series classification. Feature extraction is also a popular method.

The complexity of the problem increases when we consider multivariate time series:not every univariate method scales up to cope with multivariate problems. For instance,k-neighbours (K-NN) using Dynamic Time Warping (DTW) as a dissimilarity measurebehaves reasonably well for some univariate problems [7] but degrades in the multi-variate case. In this work we explore the possibility of using univariate classificationmethods to handle each multivariate time series component -itself a univariate timeseries- introducing an additional classifier to obtain the final class.

This approach is a variant of the Stacking ensemble method. Stacking is charac-terized by two classification levels: level 0, which produces different classifiers for thesame data set, and level 1, which tries to generalize from level 0 classifications. Level 1

⋆ This work has been partially funded by Spanish Ministry of Education and Culture, throughgrant DPI2005-08498, and Junta Castilla y León VA088A05.

72

generalization is accomplished by inducing another classification model from the samedata set, once it has been processed by the level 0 classifiers[10].

Usually, Stacking applies different classification methods to the same data set atlevel 0. In this work, we first split each multivariate time series into each of its compo-nents and then we induce a K-NN classifier for each component.Given that the mul-tivariate series classes are not usually characterized by asingle component, we do notexpect that a simple combination method -i.e. vote- may produce accurate results. Nev-ertheless, the principles of Stacking generalization alsoapplies to this particular con-figuration. In this work we analyze whether the generalization introduced by the firstlevel classifier may compensate the information lost by the univariate classifiers at level0 and produce an accurate multivariate classifier.

The rest of this paper is organized as follows. We first discuss the motivating exam-ple that inspired this work. Then, the machine learning scheme proposed is presented indetail. Afterwards, the experimental setting and results are shown. Finally, some con-clusions are stated.

2 Motivating Example

Continuous dynamic industrial processes have two peculiarities related to the accessibleobservations. First one, the set of measured observation isfixed, because of the cost andadditional complexity of introducing new sensors in the system. Hence, reasoning aboutthe process situation is usually limited to available observations and knowledge of theprocess. Second one, as a consequence of its dynamic nature,observations values varyin time and are usually stored. Hence, the historic values ofeach observable variablemay be consider as a univariate time series and the set of historic values of all thevariables of interest as a multivariate time series. Therefore diagnosis of past and currentfaults may be achieved analyzing the past multivariate timeseries, particularly if eachdiagnosable fault generates similar behaviour patterns onthe time series. If this is thecase, each fault mode may be associated to a unique class and the diagnosis problemmay be cast as the induction of multivariate time series classifiers.

Of course, machine learning has been widely used for diagnosis problems. For in-stance, Inductive Logic Programming [11], Neural Networks[12], KDD techniques[13], decision trees [14] and combination of techniques like, Wavelet On-Line Pre-processing (WOLP), Autonomous Recursive Task Decomposition (ARTD) [15], hasbeen used with different degree of success. However, to the best of our knowledge, fewauthors formulate the problem as a multivariate time seriesclassification [15,16].

For this work, we have use the laboratory scale plant shown infigure 1.Although a laboratory plant, its complexity is comparable to the one encountered

in several subsystems of real processes. It is made of four tanks{T 1...T 4}, five pumps{P1...P5}, and two PID’s controllers acting on pumpsP1, P5 to keep the level of{T 1, T 4} close to their specified set point. Temperatures of tanks{T 2, T 3} are con-trolled by two PID’s that command, respectively, resistors{R2, R3}. The plant maywork with different configurations and a simple setting without recirculation -pumpsP3, P4 and resistorR2 are off- has been chosen. There are eleven different measured

73

T1T1

FT01FT01FT01

LT01

LC01

LT01

LC01

LT01LT01

LC01LC01

T4T4 LT04

LC04

LT04

LC04

LT04LT04

LC04LC04

FT04FT04FT04

TT04TT04TT04

P5P5

TT03TT03TT03

T2

v

T2T2

vvv

FT02FT02FT02

ON/OFFR2

P1P1

T3

v

T3T3

vvv

FT03FT03FT03

ON/OFFR3

ON/OFF

TT02TT02TT02

P2P2

ON/OFF

P3

ON/OFF

P3P3

ON/OFF

P4P4

Fig. 1. The diagram of the plant.

variables or parameters in the system: levels of tanksT 1 andT 4, the value of the PIDcontrol signal on pumpsP1 andP5, in-flow on tankT 1, out-flow on tanksT 2, T 3 andT 4 and temperatures on tanksT 2, T 3 andT 4.

We have considered fourteen different faults, listed on table 1. Each fault producesa -some time slightly- different pattern in the time series of eleven components that de-scribe the historic behavior of the plant. Figure 2 shows fourteen instances of the series,one for each faulty class. In addition, in absence of faults,the plan works in stationaryestate and time series values remains constant, excepting noise and disturbances.

As figure 2 illustrates, classes can not be characterized by asingle univariate series.However, there are univariate patterns that seem to be strongly associated to particularfaults. For instance, class FM11 is the only that produces overflow in tank 1, as shownLT1 time series. This circumstance may be better identified by a univariate time seriesclassifier.

3 Stacking of Univariate Classifiers

We want to examine whether Stacking allows the induction of multivariate time seriesclassifiers from univariate classifiers induced from each component of the original se-ries. To accomplish it, we have used the Stacking configuration shown in Figure 3. Ifthe original multivariate time series has dimensionn, it is splited into itsn compo-nents to obtainn univariate times series. For each time series, K-NN algorithm is used.Instead of returning only the class, K-NN computes the confidence for every class. Con-sequently, the output vector of level 0 hasn ∗N components, beingN the total number

74

FT01 FT02 FT03 FT04 LT01 LT04 PI.1 PI.4 TT02 TT03 TT04control control

FM

01F

M02

FM

03F

M04

FM

05F

M06

FM

07F

M08

FM

09F

M10

FM

11F

M12

FM

13F

M14

Fig. 2.Examples of classes of faults. Each row shows a case of a faultmode. Each column showsone of the variables.

75

Table 1.Fault modes

Class ComponentDescriptionFM01 T1 Small leakage in tank T1FM02 T1 Big leakage in tank T1FM03 T1 Pipe blockage T1 (left outflow)FM04 T1 Pipe blockage T1 (right outflow)FM05 T3 Leakage in tank T3FM06 T3 Pipe blockage T3 (right outflow)FM07 T2 Leakage in tank T2FM08 T2 Pipe blockage T2 (left outflow)FM09 T4 Leakage in tank T4FM10 T4 Pipe blockage T4 (right outflow)FM11 P1 Pump failureFM12 P2 Pump failureFM13 P5 Pump failureFM14 R2 Resistor failure in tank T2

of classes -11 ∗ 14 in our setting-. This output vector is the level 1 input. As level 1learner we have chosen Naïve Bayes, NB. Level 1 output is the output of the classifier.Ten folds stratified cross validation is used to create the level 1 learning set.

��

��

…

��

��

inputs

output

!"#"$ %!"#"$ &

Fig. 3. Schema of the Stacking variant used in this work.

For the level 0 classifiers we have opted for K-NN with DTW as dissimilarity mea-sured because of its simplicity and good behaviour as univariate time series classifier.

76

NB has been selected for the level 1 classifier because it is a global and continuousclassifier; these are desirable properties for the level 1 classifier [10]. Against it, it is thefact that the univariate series are not independent. For thedomain problem that we areinterested in, it seems that some subsets of components are adequate to predict someclasses and other subsets to predict other classes. Hence, it is reasonable to expect someindependence between different subsets of the multivariate time series components.On the contrary, some dependence must exist among the components of each subset.Nevertheless, the fact of training the level 0 classifiers with different and disjoint datagives a chance to increase independence.

4 Experimental Settings

Due to the cost of obtaining enough data for a fourteen classes classification problemfrom the laboratory plant, we have resort to a detailed, non linear quantitative simula-tion of the plant. We have run twenty simulations for each class, adding noise in thesensors readings. In order to obtain more realistic data, each fault mode is modelledvia a parameter, affected by a factorα in the [0, 1] range which allows to graduate theseverity of the fault. Simulations start at a stationary state and run for 900 seconds, timeenough to reach a stationary state after the fault appears. The time of occurrence of eachfault is randomly generated in the interval[180, 300]. Similarly, we randomly chose thefault magnitude,α.

The multivariate time series obtained have been used to train the following learners:NB, 1-NN DTW, 3-NN DTW, 5-NN DTW, Stacking (level 0 1-NN DTW, level 1 NB),Stacking (level 0 3-NN DTW, level 1 NB), Stacking (level 0 5-NN DTW, level 1 NB).This allows us comparing the behaviour of Stacking with respect to the classifier usedat both levels when they are employed alone: NB and K-NN. K-NNfor the multivari-ate times series uses a simple extension of one-dimensionalDTW. DTW is computedfor each component, normalized and summed over all components to obtain a globaldissimilarity measure.

5 Results

All the experiments have been performed on the WEKA tool [17]. Error estimationsand hypotheses testing have been obtained with the slightlyconservative corrected re-sampled t-test. Instead of standard cross validation, we have made fifteen training-testiterations. In each iteration, we have randomly selected 90% of available data for thetraining set and 10% for the test set. This experimental setting is proposed in [18].

Table 2 shows the error estimation for the seven learners considered. As it wasexpected, neither NB nor K-NN obtains accurate classifiers for this problem. NB seemsto be affected by the non independence of the univariate timeseries. K-NN, estimatinga global dissimilarity measured by adding DTW results for every univariate component,obtains somewhat better error rates, but the improvement isnot significant. However,Stacking increases dramatically the accuracy of the classifier, reducing error estimationby an order of magnitude.

77

Table 2.Percentage of error for each method

Methods Mean (deviation)Naive Bayes 17.14 (5.26)1-nn, DTW distance 12.38 (6.31)3-nn, DTW distance 15.00 (4.90)5-nn, DTW distance 16.19 (6.02)Stacking: Naive Bayes, 1-nn, DTW distance 0.95 (1.63)Stacking: Naive Bayes, 3-nn, DTW distance 1.19 (2.20)Stacking: Naive Bayes, 5-nn, DTW distance 1.19 (2.58)

Table 3 shows the significant test results at 0.01 level. Any of the Stacking configu-rations is significantly better than the stand alone methods. The non independence of theunivariate time series do not seem to affect NB at the level 1,which behaves extremelywell with this data set.

Table 3. t-test matrix results (* means that the method of the column is significantly better thanthe method of the row)

a b c d e f g- 0 0 0 * * * a = 1-nn, DTW distance0 - 0 0 * * * b = 3-nn, DTW distance0 0 - 0 * * * c = 5-nn, DTW distance0 0 0 - * * * d = Naive Bayes0 0 0 0 - 0 0 e = Stacking: Naive Bayes, 1-nn, DTW distance0 0 0 0 0 - 0 f = Stacking: Naive Bayes, 3-nn, DTW distance0 0 0 0 0 0 - g = Stacking: Naive Bayes, 5-nn, DTW distance

6 Conclusions

This work tries to evaluate the capacity of Stacking to address multivariate time seriesclassification from univariate time series classifiers induced for each component of theoriginal time series. This is an important question becausegood univariate classifiersmethods are known but they do not always produce good resultswhen extended tothe multivariate case. The results obtained for the laboratory plant data set with theproposed Stacking scheme are good enough to suggest that this approach may be viable.

Of course, this is an initial examination of the problem and further research is nec-essary to obtain conclusions on the generality of the approach. Particularly, we planto test the scheme on different data set and compare the behaviour of different level 1learners.

78

References

1. Kadous, Mohammed Waleed: Temporal Classification: Extending the ClassificationParadigm to Multivariate Time Series. 2002, The Universityof New South Wales, Schoolof Computer Science and Engineering http://www.cse.unsw.edu.au/ waleed/phd/

2. Rodríguez Diez, Juan José: Técnicas de Aprendizaje Automático para la Clasificación deSeries. 2004, Departamento de Informática. Universidad deValladolid.

3. Yoshua Bengio: Markovian Models for Sequential Data. Neural Computing Surveys.2,(1999) 129–162 http://www.iro.umontreal.ca/ lisa/bib/pub_subject/markov/pointeurs/hmms.ps

4. Lawrence Rabiner and Biing-Hwang Juang: Fundamentals ofSpeech Recognition. PrenticeHall, 1993.

5. Christopher M. Bishop: Neural Networks for Pattern Recognition. Oxford University Press,1995

6. Simon Haykin: Neural Networks: A Comprehensive Foundation. 2nd edition, Prentice Hall,1998

7. E. Keogh and C. A. Ratanamahatana: Exact indexing of dynamic time warping. Knowledgeand Information Systems,7(3) (2005) 358–386

8. J. Colomer, J. Melendez and F.I. Gamero: Qualitative representation of process trends forsituation assessment based on cases. In 15th Triennal WorldCongress of the InternationalFederation of Automatic Control. Barcelona, (2002)

9. C. S. Myers and L. R. Rabiner: A comparative study of several dynamic time-warping al-gorithms for connected word recognition. The Bell System Technical Journal,60(7), 1981,1389–1409

10. Wolpert, D.H.: Stacked Generalization. Neural Networks, 5 (1992) 241–259.http://citeseer.csail.mit.edu/wolpert92stacked.html

11. C Feng: Inducting temporal fault diagnostic rules from aqualitative model. In S. Muggleton,editor, Inductive Logic Programming. Academic Press. (1992)

12. V. Venkatasubramanian and K. Chan: A neural network methodology for process fault diag-nosis. AIChE J.35 (12)(1989) 1993–2002

13. Derek Sleeman, F. Mitchell, and R. Milne: Applying KDD techniques to produce diagnos-tic rules for dynamic systems. Technical report AUCS/TR9604, Department of ComputingScience. University of Aberdeen (1996)

14. Antonio J. Suárez, Pedro J. Abad, J. A. Ortega y Rafael M. Gasca: Diagnosis progresiva enel tiempo de sistemas dinámicos. IV Jornadas de ARCA, Sistemas Cualitativos y Diagnosis,JARCA02 (2002)

15. Davide Roverso: Fault diagnosis with the aladdin transient classifier. System Diagnosis andPrognosis: Security and Condition Monitoring Issues III, AeroSense2003, Aerospace andDefense Sensing and Control Technologies Symposium, (2003).

16. Carlos Alonso, Juan J. Rodríguez y Belarmino Pulido: Enhancing consistency based diagno-sis with machine learning techniques. Current Topics in Artificial Intelligence: 10th Confer-ence of the Spanish Association for Artificial Intelligence. volume 3040 of Lecture Notes inArtificial Intelligence, Springer. (2004) 312–321.

17. Ian H. Witten and Eibe Frank: Data Mining: Practical machine learning tools with Java im-plementations. Morgan Kaufmann, San Francisco, (2000)

18. Claude Nadeau and Yoshua Bengio: Inference for the Generalization Error. Machine Learn-ing. 52(3)(2003) 239–281

79

Collaborative Multi-Strategical Classification for

Object-Oriented Image Analysis

Germain Forestier, Cedric Wemmert and Pierre Gancarski

Abstract. This paper deals with the description of the use of a col-laborative multi-strategy classification applied to image analysis. Thissystem integrates different kinds of unsupervised classification methodsand produce for each classifier a result built according to the results ofall the other methods. Each method occurrence tries to make its resultconverge towards the results of the other method occurrences thank todifferent operators. In this paper we highlight how classifiers collaborateand we present results in the paradigm of object-oriented classificationof a VHR remotely sensed image of an urban area.

1 Introduction

In the last decade automatic interpretation of remotely sensed images becomesan increasingly active domain. Sensors are now able to get images with a veryhigh spatial resolution (VHR) (i.e. 1 meter resolution) and spectral resolution (upto 100 spectral bands). This increasing precision generates a significant amountof data. These new kinds of data become more and more complex and a chal-lenging task is to design algorithms and systems able to process all data inreasonable time. In these VHR images the abundance of noisy, correlated andirrelevant bands disturb the classical per-pixel classification procedures. In theparadigm of object-oriented classification [1] [2] the image is segmented and thesegments are classified using spectral and spatial attributes (e.g shape index,texture...). These new heterogeneous types of data needs specific algorithms todiscover relevant groups of objects in order to present interesting information togeographers and geoscientists.

Within the framework of our research, we studied in our team many methodsof unsupervised classification. All have advantages, but also some limitationswhich seem sometimes to be complementary. From this study is born the idea tocombine, in a multi-strategical classifier [3], many clustering algorithms to makethem collaborate in order to propose a single result combining their variousresults. For that, we define a generic model of an unsupervised classifier. Oursystem allows several entities of clusterers from this model to collaborate withouthaving to know their type (neural networks, conceptual classifiers, probabilisticclusterer. . . ).

In this paper, we present how this collaborative classification process can beused in the context of automatic image analysis. We first highlight the multi-strategy algorithm and present how the methods collaborate to find a unifiedresult. Then we present an experiment made using this approach in the domain

80

of automatic object-oriented classification of urban area with remote sensingimage. Finally we conclude on the perspective to integrate high level knowledgeon the data in the different steps of the collaboration.

2 Collaborative Multi-Strategy Classification

Many works focus on combining different results of clustering which is commonlycalled clustering aggregation [6], multiple clusterings [7] or cluster ensembles [8][9]. Whereas this kind of approaches try to combine different results of clusteringin a final step, our approach makes these methods collaborate during the clus-tering process to make them to converge so that they obtain relatively similarsolutions. Then these solutions are unified to propose a unique clustering.

Indeed, it is difficult to compute a unique result from a heterogeneous setof clustering results, representing all the knowledge given by various methodsunless there is trivial correspondence between clusters of different results. Wepropose to carry out a collaborative process which consists in an automatic andmutual refinement of the clustering results. Each clustering method produces aresult that is built according to the results of all the other methods : they searchan agreement about all their proposals. The goal of these refinements is to makethe results of the various clustering methods converge to a result which will havealmost the same number of clusters, and all these clusters should be statisticallysimilar. Thus we obtain very similar results for all the method occurrences andlinks between them, representing the correspondences between their clusters. Inmany cases, it is then possible to define a bijective correspondence function andto apply a unifying technique, such as our multi-view voting method [4].

The entire classification process is presented in figure 1. It is decomposed inthree main phases:

1. Initial classifications - First a phase of initial classifications is performed:classifications are computed by each method with its parameters.

2. Results refinement - An iterative phase of convergence of the results whichcorresponds to alternations between two steps, as long as the convergenceand the quality of the results improve:

2.1 A step of evaluation of the similarity between the results and mappingof the classes.

2.2 A step of refinement of the results.

3. Unification - The refined results are unified using a voting algorithm.

2.1 Initial Classifications

The first step of the process is the calculation for each instance of a clusteringmethod of its results. Each instance has its own clustering method and param-eters.

81

2.2 Results Refinement

The mechanism we propose for refining the results is based on the concept ofdistributed local resolution of conflicts by the iteration of four phases:

– Detection of the conflicts by evaluating the dissimilarities between pairs ofresults

– Choice of the conflicts to solve– Local resolution of these conflicts (concerning the two results implied in the

conflict)– Management of these local modifications in the global result (if they are

relevant)

Classificationmethod 1

Classificationmethod m

Results unification

Unified result

Refined result mRefined result 1

Local modificationsmanagement

Local resolution

Conflicts detection

Conflicts choiceEvaluation coefficients

Result 1 Result m

Data set

Results evaluation

Results refining

Step 2.2.1

Step 2.2.2

Step 2.2.3

Step 2.2.4

Step 2.1

Step 2

Step 1

Step 3

Step 2.2

Fig. 1. Collaborative multi-strategy classification process

Evaluation and Mapping - In the continuation of this article we will assumethat we have N clustering results (R1 to RN ) given by the N clustering methodinstances. Each result Ri is composed of ni clusters Ci

k where 1 6 k 6 ni.In order to evaluate the similarity between two results we compute the con-

fusion matrix between these two clustering results. We calculate 12N × (N − 1)

82

confusion matrix, one for each pair of clustering results. From these confusionmatrix we observe the repartition of the objects through the clusters of eachresult. The confusion matrix between two results Ri and Rj is define as :

Mi,j =

αi,j1,1 · · · αi,j

1,nj

.... . .

...

αi,jni,1

· · · αi,jni,nj

with αi,j

k,l =

∣

∣

∣Cik

⋂

Cjl

∣

∣

∣

∣

∣Cik

∣

∣

.

Then we define for each cluster of each result which clusters from the otherresults are its corresponding cluster. A cluster Cj

k′of the result Rj is the corre-

sponding cluster of the cluster Cik of the result Ri if it is most similar to Ci

k.We use the confusion matrix to define the similarity without using a concept

of distance between objects by:

ωi,jk = ρi,j

k αj,ikm,k where αi,j

k,km= max {αi,j

k,l}l=1...nj(1)

It is evaluated by observing

– the relationship between the size of their intersection and the size of thecluster itself:

αi,jk = (αi,j

k,l)l=1,...,njwhere αi,j

k,l =|Ci

k ∩ Cjl |

|Cik|

(2)

– and by taking into account the distribution of the data in the other clusters

ρi,jk =

nj∑

l=1

(

αi,jk,l

)2

(3)

Refinement - During a phase of refinement of the results, several local re-solutions are performed in parallel. The detection of the conflicts consists, forall the clusters of all the results, in seeking all the couples

(

Cik,Rj

)

for two

clustering results Ri and Rj , such as Cik 6= Cj

k′which is its corresponding cluster

in the result Rj . A conflict importance coefficient is calculated according to theinterclass similarity between the two clusters and is defined as 1 − ωi,j

k . Thena conflict is selected according to the conflict importance coefficient (the mostimportant for example) and its resolution is started. This conflict and all thoseconcerning the two associated methods are removed from the list of conflicts.This process is repeated until the list of conflicts is empty.

The resolution of a conflict consists in applying an operator to Ri and anoperator to Rj . These operators are chosen according to the clusters Ci

kiand Cj

kj

involved in the conflict:

– merging of clusters: the clusters to merge are chosen according to the rep-resentative clusters of the treated cluster. The representative clusters of acluster Ci

k from result Ri compared to result Rj are the set of clusters fromRj which have more than pcr% of their objects included in Ci

k (pcr is givenby the user).

83

– splitting a cluster into subclusters: all the objects of one cluster are classifiedin subclusters,

– reclassification of a group of objects: one cluster is removed and its objectsare reclassified in all the other existing clusters.

But, the simultaneous application of operators on Ri and Rj is not alwaysrelevant. Indeed, it does not always increase the similarity of the results impliedin the conflict treated (Red Queen effect : ”success on one side is felt by the other

side as failure to which must be responded in order to maintain one’s chances

of survival”[11]), and the iteration of conflict resolutions may lead to a trivialsolution where all the methods are in agreement: a result with only one clusterincluding all the objects to classify, or a result having one cluster for each object.

So we defined the local concordance and quality rate which estimates thesimilarity and the quality for a couple of results by

γi,j =1

2

ni∑

k=1

(

ps.ωi,jk + pq.δ

ik

)

ni

+

nj∑

k=1

(

ps.ωj,ik + pq.δ

jk

)

nj

(4)

where ps + pq = 1 and δik is a cluster quality criterion chosen by the user

(0 < δik 6 1). For example, with methods which include a distance measure, the

user can select the intra-class inertia. Without distance measure, it can use classpredictivity (Cobweb), class variance (EM algorithm), . . .

At the end of each conflict resolution, after the application of the operators,the couple of results (the two new results, the two old results, or one new resultwith one old result) which maximizes this rate is kept.

After the resolution of all the conflicts, a global application of the modifica-tions proposed by the refinement step is decided according to the improvementof the global agreement coefficient :

Γ =1

m

m∑

i=1

Γ i (5)

where

Γ i =1

m − 1

m∑

j=1

j 6=i

γi,j (6)

is the global concordance and quality rate of the result Ri with all the otherresults.

Then a new iteration of the phase of convergence is started if the global agree-ment coefficient has increased and an intermediate unified result is calculated bycombining all the results.

2.3 Combination of the Results

In the final step, all the results tend to have the same number of classes which areincreasingly similar, thus we use our voting algorithm [4] to compute an unified

84

result combining the different results. This is a new multi-view voting algorithmwhich enables to combine in one unique results, many different clustering resultsthat have not necessarily the same number of clusters.

During this phase of combination we define two new concepts which help usto give relevant information to the user : the concepts of relevant clusters andnon-consensual objects.

The relevant clusters correspond to the groups of objects of a same cluster,which were classified in an identical way in a majority of the results used. More-over, we can quantify this relevance by using the percentage of classificationsthat are in agreement. These clusters are interesting to highlight, because theyare, in the majority of the cases, the relevant classes for the user.

Reciprocally, a non-consensual object is an object that has not been clas-sified identically in a majority of results, i.e. it do not belong to any of therelevant clusters. These objects often correspond to the edges of the clustersin the data space (the result section represents the example of non-consensualobject in our object-oriented image analysis).

3 Object-Oriented Image Analysis

3.1 Creation of the Regions

In this section we present results obtained by the application of our multi-strategical classifier for an object-oriented classification. The image selected forthe evaluation is from Strasbourg (France) and has been taken by Quickbirdsensors. These sensors give one panchromatic channel with a resolution of 0.7meter and 3 color channels with a resolution of 2.8 meters. Color channels areresampled to obtain a four channel image with a resolution of 0.7 meter [5]. Thisarea is a representative of the urban structure of Western cities and is charac-terized by many different objects (e.g. buildings, streets, water, vegetation) thatexhibit a diverse range of spectral reflectance values.

Before starting to classify this image we need to decompose it into a setof regions. This step, the segmentation, is proceeded with the Derivaux et al.algorithm [10] (Fig. 2 shows an extract of the segmented image). The idea ofthis method is to apply a wathersheed transform on a fuzzy classification of theimage to obtain the segmentation.

After the segmentation, we first apply a seed fill algorithm to construct re-gions. Then we compute many features for each region in order to build ourset of complex data. We use the GeOxygene library 1 to compute shape fea-tures. GeOxygene is an open source project developed by the IGN (InstitutGeographique National), the French National Mapping Agency. It provides userswith an extensible object data model (geographic features, geometry, topologyand metadata) which implements OGC (Open Geospacial Consorcium) specifica-tions and ISO standards in the geographic information domain. Using Geoxygeneassure our algorithm to be inter-operable with geographical tools.

1 http://oxygene-project.sourceforge.net

85

Fig. 2. Extract of the image segmentation. Border of each segment is highlight in white.

The features we use can be regrouped into two types :

1. spectral features : Mean of the spectral reflectance of the segment.2. geometric shape features : area, perimeter, elongation and circularity ratio.

The spectral feature is simply the mean for each band of the spectral re-flectance of each pixel composing the region. We also compute from these meansthe NDVI2 index and SBI3 index, define as :

NDV I =(NIR − RED)

(NIR + RED)SBI =

√

(RED2 + NIR2)

where RED and NIR stand for the spectral reflectance measurements of thered band and the near-infrared band, respectively. The NDVI index varies be-tween −1.0 and +1.0. For vegetated land it generally range from about 0.1 to0.7, with values greater than 0.5 indicating dense vegetation. These two indexhave been suggested by the geographers of the LIV 4 (Image & Ville Laboratory)to analyse this kind of urban area.

Geometric shape features like area or perimeter are trivial to compute withGeOxygene. For the elongation we use the smallest value between the elongationfrom the ratio of the eigenvalues of the covariance matrix, and the elongationapproximated using the bounding box and its degree of filling. The circularityratio is defined below where S the surface and P the perimeter of the region:

circularity ratio =4πS

P 2

2 Normalized Difference Vegetation Index3 Soil Brightness Index4 http://imaville.u-strasbg.fr/

86

The circularity ratio range from 0 (for a line) to 1 (for a circle). These differentshape features are useful to regroup regions according to their form and not onlytheir spectral values (as a traditional per-pixel classification).

When all these different features are computed for each region, we are readyto use our multi-strategical classifier on the data to try to identify relevant groupsof objects.

3.2 Application of the Multi-Strategical Classifier

We choose four methods with different parameters for experimental tests on ourimage :

– M1 : K-means[14] k = 10 with random initialisations.– M2 : K-means k = 15 with random initialisations.– M3 : Growing Neural Gas[15] with the maximum of node of 100, the

max edge age of 88, the number of iteration of 10000 and the node insertionfrequency of 100.

– M4 : Cobweb[13] with an acuity of 7.5.

We have obtained the results summarized below, before and after the appli-cation of our refining algorithm (ni is the number of clusters and Γi is the global

agreement coefficient as defined earlier).

ni initial Γ i initial ni final Γ i final

R1 10 0.379 8 0.532R2 15 0.336 8 0.538R3 14 0.317 7 0.476R4 21 0.186 8 0.362µ 15 0.304 7.75 0.477

We applied to these results our multi-view voting algorithm and obtained theunifying result presented on Fig 4. This result is composed of 8 different classes.We present on Fig. 3 the voting result for all the objects before and after theapplication of our algorithm with the color legend :

– in white : all the methods agreed on the classification.– in gray : one method disagreed with the other ones.– in black : the non-consensual objects (two results or more classified these

objects differently).

We can notice a significant reduction of non-consensual objects during therefining process. The Fig. 5 represent the evolution of the Γ value through it-erations of the algorithm. We also notice here that the different classifiers aremore and more in agreement in finding a consensual classification. The study ofnon-consensual objects is very useful. In our experiment, it helps us to highlightdifficult objects to classify. This information is very interesting for the geographerwho can concentrate his attention on these objects. In our image non consensual

87

objects are often small houses which have a spectral reflectance near to the roadone and an unconventional form. This information can also be used to improvethe segmentation until non-consensual objects are often the consequence of abad segment in the segmentation.

Fig. 3. Non-consensual objects (in black): a) first iteration - b) last iteration

Fig. 4. The unified result. We can identify four interesting classes: vegetation, road,water and small houses. The four other classes contains mainly bad segmented objectsand do not have real semantic intepretation.

88

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0 50 100 150 200 250 300 350 400 450 500

Gam

ma

Iterations

Gamma

Fig. 5. Evolution of the gamma index(global agreement coefficient).

6

7

8

9

10

11

12

13

14

15

0 50 100 150 200 250 300 350 400 450 500

# C

lust

ers

Iterations

# Clusters

Fig. 6. Evolution of the number of clus-ters.

4 Conclusion

In this paper we presented a process of collaborative multi-strategy classificationof complex data and how it can be used in the domain of automatic objectoriented classification of remote sensing images. Our method enables to carryout an unsupervised multi-strategy classification on these data, giving a singleresult combining all the results suggested by the various classification methods.This enables to make them converge towards a single result (without necessarilyreaching it), and to obtain very similar classes. Doing this, it is possible to putin correspondence the classes found by the various methods and finally to applya unification algorithm like a voting method for example. So we can propose tothe user a single result representing all the results found by the various methodsof unsupervised classification.

We are now interested in using domain knowledge to improve each step of thecollaborative mining process. Indeed in the collaborative step, some operatorshas been designed to lead the different algorithms to a unified solution (split,merge, reclassification). We believe that embedded these operators with someknowledge, or design new specific operators using background knowledge, willhelp to improve the final result. We recently design an ontology of urban objects.An ontology [12] is a specification of an abstract, simplified view of the worldthat being represented for some purpose. It models a domain in a formal way.The ontology defines a set of concepts (buildings, water, etc) and their relationsto each other. These definitions allow to describe and to use reasoning on thestudied domain. Preliminary tests have already been carried out to integrateknowledge extracted from this ontology and the first results are promising.

Acknowledgments. The authors would like to thank Nicolas Durand andSebastien Derivaux for their work on the segmentation and the reviewers for

89

their valuable comments. This work is a part of the FoDoMuST 5 and ECOS-GIL projects 6.

References

1. Whiteside, T and Ahmad, W: A Comparison of object-oriented and pixel-basedclassification methods for mapping land cover in nothern australia. Spatial SciencesConference 2005 (2005)

2. Oruc M., Marangoz A. M., Buyuksalih G: Comparison of pixel-based and object-oriented classification approaches using landsat-7 etm spectral bands. ISPRS2004(2004)

3. Wemmert, C. and Gancarski.: Collaborative Multi-step Mono-level Multi-strategyClassification Journal on Multimedia Tools and Applications (2006)

4. Wemmert, C. and Gancarski.: A Multi-View Voting Method to Combine Unsuper-vised Classifications. Artificial Intelligence and Applications Malaga, Spain (2002)

5. A. Puissant, T. Ranchin, C. Weber, and A. Serradj: Fusion of quickbird ms and pandata for urban studies. EARSeL Symposium (2003) 77–83

6. Gionis, Aristides and Mannila, Heikki and Tsaparas, Panayiotis: Clustering Ag-gregation. ICDE ’05: Proceedings of the 21st International Conference on DataEngineering (2005) 341–352

7. C. Boulis and M. Ostendorf: Combining multiple clustering systems. 8th Europeanconference on Principles and Practice of Knowledge Discovery in Databases(PKDD)(2004) 63–74

8. Alexander Strehl and Joydeep Ghosh: Cluster Ensembles – A Knowledge ReuseFramework for Combining Multiple Partitions. Journal on Machine Learning Re-search (JMLR) (2002) 583–617

9. Xiaoli Z. Fern and Carla E. Brodley: Solving Cluster Ensemble Problems by Bipar-tite Graph Partitioning. Proceedings of the Twenty-first International Conferenceon Machine Learning (2004)

10. Derivaux S., Lefevre S., Wemmert C., Korczak J.: Watershed Segmentation ofRemotely Sensed Images Based on a Supervised Fuzzy Pixel Classification. IEEEInternational Geosciences And Remote Sensing Symposium (2006)

11. J. Paredis: Coevolving cellular automata: Be aware of the red queen! 7th Int.Conference on Genetic Algorithms (1997) 393–400

12. Gruber, T.R.: Toward principles for the design of ontologies used for knowledgesharing. International Journal of Human Computer Studies (1995) 907–928

13. Fisher, D. H.: Knowledge acquisition via incremental conceptual clustering. Ma-chine learning, volume 2 (1987) 139–172

14. MacQueen, J.: Some methods for classification and analysis of multivariate ob-servations. Proceedings of the 5th Berkeley Symposium on Mathematical Statisticsand Probability, University of California Press, 1965, 281–297

15. Fritzke, B.: Growing Cell Structures - A Self-Organizing Network for Unsupervisedand Supervised Learning. Neural Networks, volume 7 (9), 1994, 1441–1460

5http://lsiit.u-strasbg.fr/afd/sites/fodomust/

6 http://ecosgil.u-strasbg.fr

90

Documents

Workshop on Supervised and Unsupervised Ensemble Methods ... · on Supervised and Unsupervised Ensemble Methods and ... Supervised and Unsupervised Ensemble Methods and ... Carlotta