10
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX 1 Chameleon 2 – an improved graph based clustering algorithm for finding structure in a complex data Tomas Barton, Tomas Bruna, and Pavel Kordik Abstract—Traditional clustering algorithms fail to produce a human-like results while being limited to discovery of only spherical shaped clusters in data or being too sensitive to noise. We propose an improved version of the Chameleon algorithm which overcomes several drawbacks of the previous version. Furthermore a cutoff method which produces a high-quality clustering without requiring a human interaction is presented. Proposed algorithm is compared with state-of-the-art algorithms on a benchmark that consists of artificial and real-world datasets. Index Terms—clustering, graph clustering, cluster analysis, chameleon, clustering algorithm I. I NTRODUCTION C LUSTERING is a technique that tries to group similar objects into same groups called clusters and dissimi- lar objects into different clusters [1]. There is no general consensus what exactly a cluster is, in fact it is generally acknowledged that the problem is ill-defined [2]. Various algorithms use slightly different definition of a cluster, e.g. based on distance to closes center of cluster or density of points in its neighborhood. Unlike supervised learning where labeled data are used to train a model which is afterwards used to classify unseen data, clustering belong to category of unsupervised problems. Clustering is more difficult and challenging problem than classification [3]. Clustering has applications in many fields including data-mining, machine learning, marketing, biology, chemistry, astrology, psychology and spatial database technology. Probably due to this interdisciplinary scope most of re- spected authors ([4], [5]) define clustering in a vague way, while leaving space for several interpretations. Generally clustering algorithms are designed to capture a notion of grouping in the same way as a human observer does. The ultimate goal would be a detection of structures in higher dimensions where human fails. How to evaluate such methods is another problem. In this contribution we focus on patterns that are easily detected by a human. Yet, many algorithms fail such test. A detailed overview of many algorithms and their applications, including recently proposed clustering methods, can be found in [6]. So far hundreds of algorithms have been proposed, some assign items to exactly one cluster other allow fuzzy assignment to many clusters. There are methods based on cluster prototypes, mixture models, graph structures, density or grid based models. T. Barton is with Institute of Molecular Genetics of the ASCR, v. v. i. Videnska 1083, Prague 4, Czech Republic and Czech Technical University in Prague, Faculty of Information Technology. E-mail: tomas.barton@fit.cvut.cz. P. Kordik and T. Bruna are with Faculty of Information Technology Czech Technical University in Prague, Thakurova 9, Prague 6, Czech Republic. Manuscript received April XX, 2015; revised September XX, 2015. Chameleon algorithm was originally introduced by Karypis et al. [7] in 1999. It is a graph-based hierarchical clustering al- gorithm which tries to overcome limitations of traditional clus- tering approaches. The main idea is based on a combination of approaches used by CURE [8] and ROCK [9] algorithms (see Section II for more details). CURE ignores information about interconnectivity between two clusters while ROCK ignores closeness of two clusters and empathizes their only interconnectivity. Chameleon tries to resolve these drawbacks by combining proximity and connectivity of items as well as taking into account internal cluster properties during the merging phase. Unlike other algorithms Chameleon produces as human-like results as possible. The well known k-means algorithm which is usually re- ferred as the Lloyd’s algorithm was proposed in 1957, although publised in 1982 [10]. The algorithm itself was independently dicovered in several fields [11], [12]. Since that time many algorithms has been proposed but k-means remains still quite popular due to its simplicity and nearly linear computational complexity. This fact also empathizes the difficulty of design- ing a general purpose clustering algorithm [3]. We implemented a modified version of the algorithm which is capable of finding complex structures in various datasets with a minimal error rate. In this article, we describe the modifications and compare the altered version with the original implementation as well as with other clustering algorithms. Furthermore, in the last chapter, we focus on the automatic selection of a high quality clustering in process of merging clusters. Based on the description in the original paper [7] we were unable to reproduce exactly the same results as were presented in the study. Nonetheless, with several modifications our implementation works at least as good as the original one (comparison based solely on visualization of results on several datasets, there are no quantifiable results available). II. RELATED WORK Chameleon gained a lot of attention in the literature, how- ever most of authors mention the algorithm as an interesting graph-based clustering approach but do not investigate further properties of the algorithm. This is probably due to the fact that a complete algorithm implementation is not available. Hierarchical agglomerative clustering (HAC) is one of the oldest clustering algorithms that uses a bottom-up approach while merging closest items and producing a hierarchical data structure. Lance and Williams [13] proposed formulas for faster computation of merged items similarity, still the algorithm has high time complexity, at least O ( n 2 ) , while not

chameleon.pdf

Embed Size (px)

Citation preview

Page 1: chameleon.pdf

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX 1

Chameleon 2 – an improved graph based clusteringalgorithm for finding structure in a complex data

Tomas Barton, Tomas Bruna, and Pavel Kordik

Abstract—Traditional clustering algorithms fail to producea human-like results while being limited to discovery of onlyspherical shaped clusters in data or being too sensitive to noise.

We propose an improved version of the Chameleon algorithmwhich overcomes several drawbacks of the previous version.Furthermore a cutoff method which produces a high-qualityclustering without requiring a human interaction is presented.Proposed algorithm is compared with state-of-the-art algorithmson a benchmark that consists of artificial and real-world datasets.

Index Terms—clustering, graph clustering, cluster analysis,chameleon, clustering algorithm

I. INTRODUCTION

CLUSTERING is a technique that tries to group similarobjects into same groups called clusters and dissimi-

lar objects into different clusters [1]. There is no generalconsensus what exactly a cluster is, in fact it is generallyacknowledged that the problem is ill-defined [2]. Variousalgorithms use slightly different definition of a cluster, e.g.based on distance to closes center of cluster or density ofpoints in its neighborhood. Unlike supervised learning wherelabeled data are used to train a model which is afterwardsused to classify unseen data, clustering belong to categoryof unsupervised problems. Clustering is more difficult andchallenging problem than classification [3]. Clustering hasapplications in many fields including data-mining, machinelearning, marketing, biology, chemistry, astrology, psychologyand spatial database technology.

Probably due to this interdisciplinary scope most of re-spected authors ([4], [5]) define clustering in a vague way,while leaving space for several interpretations. Generallyclustering algorithms are designed to capture a notion ofgrouping in the same way as a human observer does. Theultimate goal would be a detection of structures in higherdimensions where human fails. How to evaluate such methodsis another problem. In this contribution we focus on patternsthat are easily detected by a human. Yet, many algorithms failsuch test. A detailed overview of many algorithms and theirapplications, including recently proposed clustering methods,can be found in [6]. So far hundreds of algorithms havebeen proposed, some assign items to exactly one cluster otherallow fuzzy assignment to many clusters. There are methodsbased on cluster prototypes, mixture models, graph structures,density or grid based models.

T. Barton is with Institute of Molecular Genetics of the ASCR, v. v. i.Videnska 1083, Prague 4, Czech Republic and Czech Technical University inPrague, Faculty of Information Technology. E-mail: [email protected].

P. Kordik and T. Bruna are with Faculty of Information Technology CzechTechnical University in Prague, Thakurova 9, Prague 6, Czech Republic.

Manuscript received April XX, 2015; revised September XX, 2015.

Chameleon algorithm was originally introduced by Karypiset al. [7] in 1999. It is a graph-based hierarchical clustering al-gorithm which tries to overcome limitations of traditional clus-tering approaches. The main idea is based on a combinationof approaches used by CURE [8] and ROCK [9] algorithms(see Section II for more details). CURE ignores informationabout interconnectivity between two clusters while ROCKignores closeness of two clusters and empathizes their onlyinterconnectivity. Chameleon tries to resolve these drawbacksby combining proximity and connectivity of items as wellas taking into account internal cluster properties during themerging phase. Unlike other algorithms Chameleon producesas human-like results as possible.

The well known k-means algorithm which is usually re-ferred as the Lloyd’s algorithm was proposed in 1957, althoughpublised in 1982 [10]. The algorithm itself was independentlydicovered in several fields [11], [12]. Since that time manyalgorithms has been proposed but k-means remains still quitepopular due to its simplicity and nearly linear computationalcomplexity. This fact also empathizes the difficulty of design-ing a general purpose clustering algorithm [3].

We implemented a modified version of the algorithm whichis capable of finding complex structures in various datasetswith a minimal error rate. In this article, we describe themodifications and compare the altered version with the originalimplementation as well as with other clustering algorithms.Furthermore, in the last chapter, we focus on the automaticselection of a high quality clustering in process of mergingclusters.

Based on the description in the original paper [7] wewere unable to reproduce exactly the same results as werepresented in the study. Nonetheless, with several modificationsour implementation works at least as good as the original one(comparison based solely on visualization of results on severaldatasets, there are no quantifiable results available).

II. RELATED WORK

Chameleon gained a lot of attention in the literature, how-ever most of authors mention the algorithm as an interestinggraph-based clustering approach but do not investigate furtherproperties of the algorithm. This is probably due to the factthat a complete algorithm implementation is not available.

Hierarchical agglomerative clustering (HAC) is one of theoldest clustering algorithms that uses a bottom-up approachwhile merging closest items and producing a hierarchicaldata structure. Lance and Williams [13] proposed formulasfor faster computation of merged items similarity, still thealgorithm has high time complexity, at least O

(n2), while not

Page 2: chameleon.pdf

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX 2

Data set

Construct asparse graph

Partitionthe graph

Mergepartitions

k-nearestneighbor graph Final clusters

Fig. 1. The overview of the Chameleon approach. Diagram courtesy of Karypis et al. [7].

providing high quality results. There are several methods forcomputing similarity of clusters, which are usually referredas single-link (HAC-SL), complete-link (HAC-CL), average-link (HAC-AL) and Ward’s linkage. For discovering clustersof arbitrary shapes, the most suitable method is single-linkagewhich computes similarity of clusters by its closest items,however this method can not deal with outliers and noise [14].

Another group of clustering algorithms models the densityfunction by a probabilistic mixture model. It is assumedthat data follow some distribution model and each clusteris described by one or more mixture components [15]. TheExpectation-Maximization (EM) algorithm was proposed byDempster et. al [16] in 1977 and is often used to inferparameters in mixture models.

Jarvis and Patrick [17] proposed in 1973 an algorithmthat defines similarity of items based on common numberof neighbors they share. Very similar approach chooses alsoDBSCAN [18], which is a density based algorithm proposedby Ester et. al. in 1996. DBSCAN is capable of discoveringclusters of arbitrary shapes, if cluster density could be deter-mined beforehand and each cluster has an uniform distribution.A cluster is defined as a maximum set of density connectedpoints, where every core point in a cluster must have at leastminimum number of points (MinPts) within given radius (ε).All points within one cluster could be reached by traversinga path of density-connected points. The algorithm itself couldbe relatively fast, however in order to configure properly itsparameters a prior knowledge of a dataset is required or arun of the k-NN algorithm in order to estimate appropriateparameters setting. Main disadvantage of DBSCAN is itssensitivity to parameter setting, even a small modificationof the ε parameter could cause that all data points will beassigned to a single cluster. Moreover, the algorithm will failon datasets with non-uniform distribution.

CURE (CLustering Using Representatives) [8] is a clus-tering algorithm that uses variety of different techniques toaddress various drawbacks, of the agglomerative clusteringmethod. A cluster is represented by multiple points, the firstpoint to be chosen is the furthest point from the center of thecluster. Once representatives are chosen, all points are shrunktowards the center of the cluster by a a user defined factor.This helps to moderate the effect of outliers, the absolute valueof moving each point will be bigger for points lying fartherout. As a clustering method uses hierarchical agglomerativeclustering, with distance function counting the minimum ofany two representative points between two clusters. However,

this algorithm requires specification the number of clusters wewould like to find (a parameter k). There are two phases ofeliminating outliers. The first one occurs when approximately1/3 of the desired k clusters is reached. Slowly growingclusters are removed because they might potentially containonly outliers. The second phase of elimination is done whenthe desired number of clusters is reached. At this point allsmall clusters are removed. CURE is more robust to outliersthan a hierarchical agglomerative clustering algorithm, and italso manages to identify clusters that have a non-sphericalshape and wide variances in sizes.

Strehl and Ghosh [19] proposed in 2002 several ensembleapproaches for combining multiple clusterings into a singlefinal solution. The Cluster-based Similarity Partitioning Algo-rithm (CSPA) uses METIS [20] for partitioning the similaritymatrix into k components.

Shatovska et al. [21] proposed an improved similarity mea-sure in 2007 which provides results similar those presentedin the Chameleon paper. The formula is further discussed inSection IV-A1. Shatovska was not able to reproduce sameresults as were described by Karypis et al. [7].

Liu Y. at al. [22] reported that Chameleon gives bestresults on an artificial dataset with skewed distribution, whilecomparing with clusterings provided by a k-means [10], DB-SCAN [18] and hierarchical agglomerative clustering algo-rithms.

III. ORIGINAL CHAMELEON ALGORITHM

In order to find clusters Chameleon uses a three-step ap-proach. First two steps aim to partition the data into manysmall subclusters and the last repeatedly merges these sub-clusters into the final result. The whole process is illustratedin the Figure 1.

The Chameleon algorithm works with a graph representa-tion of data. At first, we construct a graph using the k-NNmethod. A data point, represented by a node, is connectedwith its k neighbors by an edge. If a graph is given, we canskip this step and continue with the second one.

Second step partitions previously created graph. The goal ofpartitioning is to produce equal-sized partitions and minimizethe number of edges cut. In this manner, many small clustersare created with a few highly connected nodes in each cluster.For partitioning Karypis et al. use their own hyper-graphpartitioning algorithm called hMETIS, which is a multilevelpartitioning algorithm which works with a coarsened versionof the graph. Coarsening is not deterministic, thus for each

Page 3: chameleon.pdf

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX 3

run we might obtain slightly different result (for further detailsplease see [23] and [24]).

The final and most important step merges partitioned clus-ters using Chameleon’s dynamic modeling framework startingfrom clusters with highest similarity. The general idea is tobegin with connected clusters, that are close to each otherand that have similar densities. There are many The originalChameleon uses following function for computing the similar-ity between two clusters (Ci and Cj) that are about to merge[7]:

Sim(Ci, Cj) = RCL(Ci, Cj)α ·RIC(Ci, Cj)

β (1)

where RCL stands for relative cluster closeness, while RICdenotes relative cluster interconnectivity and α, β are userspecified parameters that gives higher importance either torelative closeness or to relative interconnectivity.

The relative interconnectivity and closeness are computedfrom the external and internal properties which describerelations between clusters and among items inside clusters,respectively.

RCL(Ci, Cj) =φ(Ci, Cj)

|ECi|

|ECi|+|ECj

| φ(Ci) +|ECj

||ECi

|+|ECj| φ(Cj)

(2)

RIC(Ci, Cj) =φ(Ci, Cj)

φ(Ci) + φ(Cj)

2

=2φ(Ci, Cj)

φ(Ci) + φ(Cj)(3)

Where |ECi,j|, |ECi

| are interconnectivity properties – num-ber of edges between clusters Ci and Cj , respective numberof edges inside cluster Ci.φ(Ci, Cj), φ(Ci) denote closeness properties – average

weights of all edges between clusters Ci and Cj , respectiveaverage weights of all edges after cluster Ci bisection (anexample shown on Figure 2).

BCi= bisect(Ci) (4)

φ(Ci) =∑e∈BCi

w(e) (5)

φ(Ci) =1

|BCi|φ(Ci) (6)

BCiis a set of edges selected by a bisection algorithm,

removing those edges would split graph into two components(clusters). w(e) is the weight of a given edge that wascomputed by the k-NN algorithm.

It is important to note that the internal properties arecomputed via bisection. Therefore, the quality of the bisectionalgorithm determines how well the internal properties arecomputed.

//preformulovat, tomu nerozumim!!!

IV. CHAMELEON 2 ALGORITHM

1) Partitioning algorithms: In the partitioning phase,we experimented with the original partitioning algorithm

Fig. 2. Example of finding bisection between clusters Ci and Cj , φ(Ci, Cj)is computed as sum of edges’ weights between clusters (marked with dashedline).

Algorithm 1 Original Chameleon1: procedure ORIGINAL CHAMELEON(dataset)2: graph← knnGraph(k, dataset)3: partitioning ← hMETIS(graph)4: return mergePairs(partitioning)5: end procedure

hMETIS [23] and the proposed partitioning method based onthe recursive Fiduccia-Mattheyses bisection [25]. The recur-sive bisection process is also thoroughly described in [26].

The difference in speed and quality is not surprising becausehMETIS is a multilevel partitioning algorithm which workswith a coarsened version of the graph while our implemen-tation works with the original graph. Basically, coarsenedversion contains less nodes and edges, therefore the processis faster. However, some information may be missing in thecoarsened approximation which can result in worse partition-ing.

Algorithm 2 Chameleon21: procedure CHAMELEON2(dataset)2: graph← knnGraph(k, dataset)3: partitioning ← recursiveBisection(graph)4: partitioning ← floodF ill(partitioning)5: return mergePairs(partitioning)6: end procedure

Comparison of the two methods is shown in Table II.2) Partitioning refinement: Unfortunately, the previously

described disconnected clusters can occur even when the edgeweights are ignored. To fix this situation, we use a simplemethod called flood fill which finds connected componentsin the graph. The process of partitioning with the flood fillalgorithm is shown in Algorithm 3. Result of the describedalgorithm is a list of connected components where eachcomponent is represented by a list of nodes.

A. Merging

The most significant modification was done in the mergingphase. During merging, Chameleon chooses the most similarcluster pairs and merges them together. The choice is based ona function which evaluates the similarity between every pairof clusters, so called similarity measure.

Page 4: chameleon.pdf

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX 4

TABLE ICOMPARISON OF THE PARTITIONING METHODS

Computational time

Dataset size hMETIS RB with F-M

Algorithm 3 Flood Fill1: procedure FLOODFILL(graph)2: partitions← []3: for all node ∈ graph do4: partition← []5: Fill(node, partition)6: partitions← partitions ∪ partition7: end for8: end procedure9: procedure FILL(node, partition)

10: if node.marked? then11: return12: end if13: node.marked← true14: partition← partition ∪ node15: for all neighbor of the node do16: Fill(node, partition)17: end for18: return19: end procedure

1) Improved similarity measure: Shatovska et al. [21] pro-pose a modified similarity measure which proved to be morerobust than the original function based only on interconnec-tivity and closeness. We incorporated the function into theChameleon algorithm and the experiments showed that theachieved results are always at least as good as the original,most of the time even better.

The whole improved formula [21] can be written as:

Simshat(Ci, Cj) = RCLS(Ci, Cj)α ·RICS(Ci, Cj)

β · γ(Ci, Cj)

(7)

RCLS(Ci, Cj) =s(Ci, Cj)

|ECi|

|ECi|+|ECj

| s(Ci) +|ECj

||ECi

|+|ECj| s(Cj)

(8)

RICS(Ci, Cj) =min{s(Ci), s(Cj)}max{s(Ci), s(Cj)}

(9)

γ(Ci, Cj) =|ECi,j

|min(|ECi |, |ECj |)

(10)

Where s(Ci) is defined as sum of edges’ weights in acluster:

s(Ci) =1

|ECi|∑e∈Ci

w(e) (11)

TABLE IICHAMELEON 2 PARAMETERS

Parameter Description Default value

k number of neighbors (k-NN) 2 · log(n)psize max. partition size max {5, n/100}α interconnectivity priority 1.0β closeness priority 2.0

similarity determines merging order Shatovska

and s(Ci, Cj) is computed as sum of edges’ weighs betweenclusters.

The improved measure also incorporates internal and exter-nal cluster properties to determine the relative similarity butin a slightly different way. The internal cluster properties arecomputed from all edges in the graph. This way, the measuredoes not have to rely on the quality of the bisection algorithm.Also, not having to bisect every cluster during the mergingphase saves quite a lot of computational time.

Additionally, the result is multiplied by the quotient ofclusters’ densities γ – density of the sparser is divided bythe density of the denser cluster. This further encourages themerging of clusters with similar densities.

Comparison of results achieved by the original and im-proved similarity measures is provided in the next section.

2) Clusters formed by one node: Independently of thesimilarity measure chosen, problems arise while mergingclusters formed by individual items. Since they contain noedges, internal characteristic of such clusters are impossibleto determine. Chameleon’s similarity measures rely heavilyon the internal characteristics and without them, clusters withonly one node are often merged incorrectly, deteriorating theoverall result.

Even when the partitioning algorithm is set to make strictlybalanced partitions with the same number of items in eachclusters, the partitioning refinement described in Section IV-2can still produce clusters formed by just one node. Therefore,the problem has to be solved during the merging phase andour solution is described bellow.

When computing similarity between a cluster pair in whichone of the clusters contains no edges, only external propertiesof the pair are computed. The resulting cluster similarity isset to be the external similarity multiplied by a constant. Thismultiplication increases the similarity of all pairs containingsingle-item clusters and causes the clusters to merge withtheir neighbors in the early stages of the merging process.This way, clusters with one node are quickly merged with thenearest cluster and do not cause problems later on. Multiplyingconstant chosen for our implementation is the number 1000but any number significantly enlarging the computed externalproperties would work.

V. DATASETS

Our benchmark intentionally contains mostly 2D or 3Ddatasets with clearly distinguishable structure in the data by ahuman, some datasets contains noise as well.

1Labels were added manually, original dataset did not contain any. Visual-ization of assigned labels can be found in Appendix A.

Page 5: chameleon.pdf

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX 5

TABLE IIIDATASETS USED FOR EXPERIMENTS.

Dataset d n classes sourceaggregation 2 788 7 [27]atom 3 800 2 [28]chainlink 3 1000 2 [28]chameleon-t4.8k 2 8000 71 [29]chameleon-t5.8k 2 8000 91 [29]chameleon-t7.10k 2 10000 81 [29]chameleon-t8.8k 2 8000 81 [29]compound 2 399 6 [30]cure-t2-4k 2 4000 7 [8]2

D31 2 3100 31 [31]DS-850 2 850 5 [32]diamond9 2 3000 9 [33]flame 2 240 2 [34]jain 2 373 2 [35]long1 2 1000 2 [36]longsquare 2 900 6 [36]lsun 2 400 3 [28]pathbased 2 300 3 [37]s-set1 2 5000 15 [38]spiralsquare 2 1500 6 [36]target 2 770 6 [28]triangle1 2 1000 4 [36]twodiamonds 2 800 2 [28]wingnut 2 1016 2 [28]

An overview of all used dataset with their properties andreferences can be found in Table III. We tried to includesame datasets are were used in original Chameleon paper,dataset marked as DS1 and DS2 were not available, first onecomes from CURE [8], based on description in the paper wegenerated a similar dataset. Instead of DS2 we included twodi-amonds dataset fr7om FCPS suite [28]. Chameleon datasets(t4.8k, t5.8k, t7.10k – marked as DS3 in [7], t8.8k – markedas DS4 in [7]) are available in an archive for download atCLUTO website [29] (a software package [39] from KarypisLabs that provides dynamic model clustering, however doesnot offer directly Chameleon algorithm).

There are no labels provided for these datasets thus weassigned labels manually.

VI. EXPERIMENTS

We evaluated Chameleon 2 against several popular al-gorithms. For evaluation of clustering we used NormalizedMutual Information (NMIsqrt) as defined by Strehl and Goshin [19]. NMI computes agreement between clustering andground truth labels which we provided for each dataset. NMIvalue 1.0 means complete agreement of clustering to externallabels while 0.0 means complete opposite. Another popularcriterion for external evaluation is Adjusted Rand Index whichwould provide in this case very similar results.

It is not feasible to run a benchmark of our algorithm againstevery other existing algorithm. However, we tried to select arepresentative algorithm from several distinguishable groups

2The author does no longer have this dataset, we generated similar databased on images provided in the referred paper.

Fig. 3. Sorted distances to 4th nearest neighbor for dataset chameleon-t4.8.Purple line shows optimal ε value selected by our heuristic.

of clustering algorithms. Complete results of our experimentscan be found in Table IV. The NMI value is the best possibleresult of a given algorithm when configured optimally. Theseexperiments are aimed to explore boundaries each algorithm,in real-world scenario it is often hard to select optimalconfiguration.

The k-means [10] algorithm was provided with “correct”number of ground truth classes (parameter k). Initializationis randomized and the NMI value is an average of 100independent runs.

For DBSCAN [18] we followed autor’s recommendationregarding ε estimation, firstly we compute and sort distancesto 4th nearest neighbor for each data point (see Figure 3). Thenusing a simple heuristic we search for an “elbow” in distancevalues which is typically located in the first third of distancessorted from highest to lowest value. From the elbow area wechoose 10 different ε values and run DBSCAN with MinPtsin interval from 4 to 10. From 60 DBSCAN clusterings weselect the one with the highest NMI value.

VII. CUTOFF

Output of the Chameleon algorithm is a hierarchical struc-ture of merges. This result can be useful on its own but mostof the time, user needs to get a single partitioning clustering.To obtain it, we need to explore the hierarchical result anddetermine where the best single clustering lies.

A. Dendrogram representation

Hierarchical structure is best visualized via dendrogram. Inorder to meaningfully visualize Chameleon’s result, certainchanges to the traditional dendrogram representation have tobe made.

Firstly, the first level of the dendrogram does not representindividual items but small clusters created by the partitioningphase.

Secondly, height of the nodes has been redefined. Normally,the height of each cluster node is simply the distance betweenthe merged clusters:

h(Ci) = d(Cx, Cy) (12)

In Chameleon, however, we redefined the height to:

h(Ci) = h(Ci−1) + d(Cx, Cy) (13)

Page 6: chameleon.pdf

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX 6

TABLE IVCLUSTERING BENCHMARK ON DATASETS USED IN THE LITERATURE.

Dataset Ch2-auto Ch2-nd1 Ch2-Std DBSCAN HAC-AL HAC-CL HAC-SL HAC-WL k-meansaggregation 0.99 0.99 0.98 0.98 0.97 0.87 0.89 0.94 0.85atom 1.00 0.99 1.00 0.99 0.59 0.57 1.00 1.00 0.29chainlink 1.00 0.99 1.00 1.00 0.55 0.51 1.00 0.55 0.07chameleon-t4.8k 0.88 0.89 0.86 0.95 0.67 0.63 0.86 0.65 0.59chameleon-t5.8k 0.82 0.87 0.82 0.94 0.82 0.68 0.80 0.82 0.77chameleon-t7.10k 0.86 0.90 0.90 0.97 0.68 0.63 0.87 0.67 0.58chameleon-t8.8k 0.88 0.89 0.88 0.89 0.69 0.66 0.86 0.68 0.57compound 0.96 0.95 0.95 0.92 0.85 0.82 0.85 0.82 0.72cure-t2-4k 0.88 0.91 0.87 0.88 0.83 0.72 0.82 0.78 0.69D31 0.96 0.96 0.94 0.88 0.95 0.95 0.87 0.95 0.92diamond9 0.99 0.98 0.97 0.98 1.00 1.00 0.99 1.00 0.95DS-850 0.98 0.95 0.99 0.98 0.98 0.62 0.99 0.69 0.57flame 0.87 0.86 0.91 0.90 0.80 0.70 0.84 0.59 0.43jain 1.00 0.93 1.00 0.89 0.70 0.70 0.86 0.52 0.37long1 1.00 0.97 1.00 0.99 0.62 0.55 1.00 0.55 0.02longsquare 0.98 0.97 0.98 0.94 0.90 0.83 0.93 0.84 0.81lsun 1.00 0.99 1.00 1.00 0.82 0.83 1.00 0.73 0.54pathbased 0.90 0.81 0.86 0.89 0.71 0.58 0.70 0.62 0.55s-set1 1.00 0.98 1.00 0.97 0.98 0.97 0.96 0.98 0.95spiralsquare 0.91 0.93 0.99 0.98 0.74 0.78 0.92 0.67 0.64target 0.94 0.96 0.94 0.99 0.74 0.70 1.00 0.69 0.69triangle1 1.00 0.97 1.00 1.00 0.98 0.91 1.00 1.00 0.93twodiamonds 0.99 0.99 1.00 1.00 0.99 0.97 0.93 1.00 1.00wingnut 0.97 0.91 0.97 1.00 1.00 1.00 1.00 1.00 0.77

(a) (b)

Fig. 4. Chameleon’s dendrogram on flame dataset results using standardsimilarity (Fig. 4a) and Shatovska similarity (Fig. 4b). It is easier to findautomatically a reasonable cutoff for this dendrogram unlike the case whenstandard similarity is used.

In both equations, h(Ci) represents height of the cluster Cat level i and d(Cx, Cy) is the distance between clusters atlevels x and y which are merged into the cluster Ci. Distanceis an inverse similarity measure, therefore computed in thefollowing way:

d(Cx, Cy) =1

Sim(Cx, Cy)(14)

The main reason for the change is that over time, clustersimilarity can increase, thus the distance decreases. In thestandard representation this would mean that the dendrogram

would have to grow down which would be confusing and thesearch for a line that cuts the dendrogram would not make anysense.

B. First Jump Cutoff

To find an optimal cut in the proposed structure we comewith a simple yet effective method named First Jump Cutoff.

The method is based on an idea that the first big jump inthe cutoff (the first large distance between levels) is the placewhere two clusters which should be separate start to merge.Therefore, this is where the cut should be made.

To find the first jump, we compute the average distancebetween levels in the first half of the dendrogram. After that,we look for the first distance between levels in the secondhalf which is at least 100 times greater than the computedaverage. If it is found, the cut is made between these levels.If not, we search for the first jump at least 50 times bigger,25 times bigger and so forth. The whole method is illustratedin Algorithm 4.

The parameters – starting with 100th multiple and dividingthe multiple by 2 were randomly chosen. Our experimentsshowed that the algorithm is not very sensitive to these valuesand that the values 340 for the starting multiple and 1.9 for thefactor produce the best overall results on the chosen datasets.

VIII. CONCLUSION

In this article, we introduced several advanced clusteringtechniques, the original Chameleon algorithm and our im-proved version called Chameleon 2. We described all the

Page 7: chameleon.pdf

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX 7

Algorithm 4 First Jump Cutoff1: procedure FIRSTJUMP(initMultiplier, factor)2: avgJump← ComputeAvgJump()3: multp← initMultiplier4: while multp > 0 do5: res← FindBiggerJump(multp× avgJump)6: if res.found then7: return res.height8: else9: mult← mult/factor

10: end if11: end while12: return 013: end procedure14: procedure FINDBIGGERJUMP(jump)15: result← []16: for all level ∈ Dendrogram do17: levelWidth← level.next.height− level.height18: if levelWidth > jump then19: result.found← true20: result.height← level.height21: return result22: end if23: end for24: res.found← false25: return result26: end procedure

Fig. 5. Chameleon 2 result on dataset chameleon-t8.8k using max. partitionsize = 10, α = 4 and manual cutoff.

differences and improvements which aim to enhance theclustering produced by Chameleon 2 and conducted severalexperiments. They showed both external evaluation scores andvisual results which proved the superiority of our approachover other methods. We also proposed a method which isable to automatically select the best clustering result from amodified Chameleon dendrogram.

Our original goal was to create a fully automated algorithmwhich produces high-quality clustering on diverse datasetswithout having to set any dataset-specific parameters. Weachieved this and we managed to find a configuration which

Fig. 6. Clustering result of our Chameleon 1 implementation on datasetchameleon-t7.10k which uses hMETIS for bisection and manual cutoff. Thereare several tiny clusters left after partitioning phase, total number of discoveredclusters is 15.

Fig. 7. Chameleon 2 result on dataset chameleon-t7.10k using max. partitionsize = 10, α = 4 and manual cutoff. Total number of discovered clusters is10.

works with a minimal error rate on all of the tested data. How-ever, by configuring each phase of the algorithm, Chameleon 2is able to correctly identify clusters in basically any dataset.Therefore, Chameleon 2 can also be viewed as a general robustclustering framework which can be adjusted for a wide rangeof specific problems.

Fig. 8. Generated dataset cure-t2-4k with ground truth labels, inspired byCURE’s dataset (marked as DS1 in [7]). Many algorithms fail to identify twoupper ellipsoids due to the link of points that is connecting them. Chameleon 2provides best result with α = 1 and β = 1.

Page 8: chameleon.pdf

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX 8

APPENDIX ADATASET VISUALIZATIONS

All datasets used in our experiments contains distinguish-able pattern. In case of datasets contaminated by noise, clustersare area with high density of data points.

ACKNOWLEDGMENT

We would like to thank Petr Bartunek, Ph.D. from the IMGCAS institute for supporting our research and letting us publishall details of our work. This research is partially supportedby CTU grant SGS15/117/OHK3/1T/18 New data processingmethods for data mining and Program NPU I (LO1419) byMinistry of Education, Youth and Sports of Czech Republic.

REFERENCES

[1] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Intro-duction to Cluster Analysis. A John Wiley & Sons, Inc., 1990.

[2] R. Caruana, M. Elhawary, N. Nguyen, and C. Smith, “Meta Clustering,”in Proceedings of the Sixth International Conference on Data Mining,ser. ICDM ’06. Washington, DC, USA: IEEE Computer Society, 2006,pp. 107–118.

[3] A. K. Jain, “Data Clustering : 50 Years Beyond K-Means,” PatternRecognition Letters, 2010.

[4] A. K. Jain and R. C. Dubes, Algorithms for clustering data. UpperSaddle River, NJ, USA: Prentice-Hall, Inc., 1988.

[5] B. S. Everitt, Cluster Analysis. Edward Arnold, 1993.[6] C. C. Aggarwal and C. K. Reddy, Eds., Data Clustering: Algorithms

and Applications. CRC Press, 2014.[7] G. Karypis, E. Han, and V. Kumar, “Chameleon: Hierarchical Clustering

Using Dynamic Modeling,” Computer, vol. 32, no. 8, pp. 68–75, August1999.

[8] S. Guha, R. Rastogi, and K. Shim, “CURE: an efficient clusteringalgorithm for large databases,” in ACM SIGMOD Record, vol. 27, no. 2.ACM, 1998, pp. 73–84.

[9] ——, “ROCK: A Robust Clustering Algorithm for Categorical At-tributes.” in ICDE, M. Kitsuregawa, M. P. Papazoglou, and C. Pu, Eds.IEEE Computer Society, 1999, pp. 512–521.

[10] S. Lloyd, “Least squares quantization in PCM,” Information Theory,IEEE Transactions on, vol. 28, no. 2, pp. 129–137, 1982.

[11] G. Ball and D. Hall, “ISODATA: A novel method of data analysis andpattern classification,” Stanford Research Institute, Menlo Park, Tech.Rep., 1965.

[12] J. B. MacQueen, “Some Methods for Classification and Analysis ofMultiVariate Observations,” in Proc. of the fifth Berkeley Symposium onMathematical Statistics and Probability, L. M. L. Cam and J. Neyman,Eds., vol. 1. University of California Press, 1967, pp. 281–297.

[13] G. N. Lance and W. T. Williams, “A General Theory of ClassificatorySorting Strategies,” The Computer Journal, vol. 9, no. 4, pp. 373–380,1967.

[14] A. K. Jain, A. Topchy, M. H. C. Law, and J. M. Buhmann, “Landscapeof Clustering Algorithms,” in Proceedings of the Pattern Recognition,17th International Conference on (ICPR’04) Volume 1 - Volume 01, ser.ICPR ’04. Washington, DC, USA: IEEE Computer Society, 2004, pp.260–263.

[15] G. McLachlan and K. Basford, Mixture Models: Inference and Applica-tions to Clustering. Marcel Dekker, New York, 1988.

[16] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihoodfrom incomplete data via the EM algorithm,” Journal of the RoyalStatistical Society. Series B (Methodological), pp. 1–38, 1977.

[17] R. A. Jarvis and E. A. Patrick, “Clustering using a similarity measurebased on shared near neighbors,” Computers, IEEE Transactions on, vol.100, no. 11, pp. 1025–1034, 1973.

[18] M. Ester, H. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithmfor Discovering Clusters in Large Spatial Databases with Noise.” inKDD, E. Simoudis, J. Han, and U. M. Fayyad, Eds. AAAI Press,1996, pp. 226–231.

[19] A. Strehl and J. Ghosh, “Cluster Ensembles – A Knowledge ReuseFramework for Combining Multiple Partitions,” Journal on MachineLearning Research (JMLR), vol. 3, pp. 583–617, December 2002.

[20] G. Karypis and V. Kumar, “A Fast and High Quality MultilevelScheme for Partitioning Irregular Graphs,” SIAM J. Sci. Comput.,vol. 20, pp. 359–392, December 1998. [Online]. Available:http://dx.doi.org/10.1137/S1064827595287997

[21] T. Shatovska, T. Safonova, and I. Tarasov, “A Modified MultilevelApproach to the Dynamic Hierarchical Clustering for Complex typesof Shapes.” in ISTA, ser. LNI, H. C. Mayr and D. Karagiannis, Eds.,vol. 107. GI, 2007, pp. 176–186.

[22] Y. Liu, Z. Li, H. Xiong, X. Gao, and J. Wu, “Understanding of InternalClustering Validation Measures.” in ICDM, G. I. Webb, B. L. 0001,C. Zhang, D. Gunopulos, and X. Wu, Eds. IEEE Computer Society,2010, pp. 911–916.

[23] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar, “MultilevelHypergraph Partitioning: Applications in VLSI Domain,” IEEE Trans.Very Large Scale Integr. Syst., vol. 7, no. 1, pp. 69–79, Mar. 1999.

[24] G. Karypis and V. Kumar, “Multilevel k-way Hypergraph Partitioning.”in DAC, 1999, pp. 343–348.

[25] C. M. Fiduccia and R. M. Mattheyses, “A Linear-time Heuristic forImproving Network Partitions,” in Proceedings of the 19th DesignAutomation Conference, ser. DAC ’82. Piscataway, NJ, USA: IEEEPress, 1982, pp. 175–181.

[26] T. Bruna, “Implementation of the Chameleon Clustering Algorithm,”Master’s thesis, Czech Technical University in Prague, 2015.

[27] A. Gionis, H. Mannila, and P. Tsaparas, “Clustering aggregation.”TKDD, vol. 1, no. 1, 2007.

[28] A. Ultsch and F. Morchen, “ESOM-Maps: tools for clustering, visual-ization, and classification with Emergent SOM,” Technical Report No.46, Dept. of Mathematics and Computer Science, University of Marburg,Germany, 2005.

[29] G. Karypis, “Karypis Lab - CLUTO’s datasets.” [Online]. Available:http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download

[30] C. Zahn, “Graph-Theoretical Methods for Detecting and DescribingGestalt Clusters,” IEEE Trans. on Computers, vol. C-20, no. 1, pp. 68–86, jan 1971.

[31] C. J. Veenman, M. J. T. Reinders, and E. Backer, “A Maximum VarianceCluster Algorithm.” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24,no. 9, pp. 1273–1280, 2002.

[32] M.-C. Su, C.-H. Chou, and C.-C. Hsieh, “Fuzzy C-means algorithm witha point symmetry distance,” International Journal of Fuzzy Systems,vol. 7, no. 4, pp. 175–181, 2005.

[33] S. Salvador and P. Chan, “Determining the Number of Clusters/Segmentsin Hierarchical Clustering/Segmentation Algorithms.” in ICTAI. IEEEComputer Society, 2004, pp. 576–584.

[34] L. Fu and E. Medico, “FLAME, a novel fuzzy clustering method for theanalysis of DNA microarray data.” BMC Bioinformatics, vol. 8, 2007.

[35] A. K. Jain and M. H. C. Law, “Data Clustering: A User’s Dilemma.” inPReMI, ser. Lecture Notes in Computer Science, S. K. Pal, S. Bandy-opadhyay, and S. Biswas, Eds., vol. 3776. Springer, 2005, pp. 1–10.

[36] J. Handl and J. Knowles, “Multiobjective clustering with automaticdetermination of the number of clusters,” UMIST, Tech. Rep., 2004.

[37] H. Chang and D.-Y. Yeung, “Robust path-based spectral clustering.”Pattern Recognition, vol. 41, no. 1, pp. 191–203, 2008.

[38] P. Franti and O. Virmajoki, “Iterative shrinking method for clusteringproblems.” Pattern Recognition, vol. 39, no. 5, pp. 761–775, 2006.

[39] G. Karypis, “CLUTO A Clustering Toolkit,” Dept. of Computer Sci-ence, University of Minnesota, Tech. Rep. 02-017, 2002, available athttp://www.cs.umn.edu/˜cluto.

PLACEPHOTOHERE

Tomas Barton Biography text here.

Tomas Bruna Biography text here.

Page 9: chameleon.pdf

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX 9

(a) Dataset aggregation. (b) Dataset atom. (c) Dataset chainlink.

(d) Dataset chameleon-t4.8k. (e) Dataset chameleon-t5.8k. (f) Dataset chameleon-t7.10k.

(g) Dataset chameleon-t8.8k. (h) Dataset compound. (i) Dataset cure-t2-4k.

(j) Dataset D31. (k) Dataset diamond9. (l) Dataset DS-850.

(m) Dataset flame. (n) Dataset jain. (o) Dataset long1.

Fig. 9. Visualization of datasets used in experiments with ground truth assignments.

Page 10: chameleon.pdf

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. X, SEPTEMBER XXXX 10

(a) Dataset longsquare. (b) Dataset lsun. (c) Dataset pathbased.

(d) Dataset s-set1. (e) Dataset spiralsquare. (f) Dataset target.

(g) Dataset triangle1. (h) Dataset twodiamonds. (i) Dataset wingnut.

Fig. 10. Visualization of datasets used in experiments with ground truth assignments.

Pavel Kordik Biography text here.