20
Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis Department of Informatics Athens University of Economics & Business Email: {mhalk, yannis, mvazirg}@aueb.gr Abstract Clustering aims at discovering groups and identifying interesting distributions and patterns in data sets. Researchers have extensively studied clustering since it arises in many application domains in engineering and social sciences. In the last years the availability of huge transactional and experimental data sets and the arising requirements for data mining created needs for clustering algorithms that scale and can be applied in diverse domains. This paper surveys clustering methods and approaches available in literature in a comparative way. It also presents the basic concepts, principles and assumptions upon which the clustering algorithms are based. Another important issue is the validity of the clustering schemes resulting from applying algorithms. This is also related to the inherent features of the data set under concern. We review and compare clustering validity measures available in the literature. Furthermore, we illustrate the issues that are under- addressed by the recent algorithms and we address new research directions. 1. Introduction Clustering is one of the most useful tasks in data mining process for discovering groups and identifying interesting distributions and patterns in the underlying data. Clustering problem is about partitioning a given data set into groups (clusters) such that the data points in a cluster are more similar to each other than points in different clusters [10]. For example, consider a retail database records containing items purchased by customers. A clustering procedure could group the customers in such a way that customers with similar buying patterns are in the same cluster. Thus, the main concern in the clustering process is to reveal the organization of patterns into “sensible” groups, which allow us to discover similarities and differences, as well as to derive useful conclusions about them. This idea is applicable in many fields, such as life sciences, medical sciences and engineering. Clustering may be found under different names in different contexts, such as unsupervised learning (in pattern recognition), numerical taxonomy (in biology, ecology), typology (in social sciences) and partition (in graph theory) [28]. In the clustering process, there are no predefined classes and no examples that would show what kind of desirable relations should be valid among the data that is why it is perceived as an unsupervised process [1]. On the other hand, classification is a procedure of assigning a data item to a predefined set of categories [8]. Clustering produces initial categories in which values of a data set are classified during the classification process. The clustering process may result in different partitioning of a data set, depending on the specific criterion used for clustering. Thus, there is a need of preprocessing before we assume a clustering task in a data set. The basic steps to develop clustering process can be summarized as follows [8]: Feature selection. The goal is to select properly the features on which clustering is to be performed so as to encode as much information as possible concerning the task of our interest. Thus, preprocessing of data may be necessary prior to their utilization in clustering task. Clustering algorithm. This step refers to the choice of an algorithm that results in the definition of a good clustering scheme for a data set. A proximity measure and a clustering criterion mainly characterize a clustering algorithm as well as its efficiency to define a clustering scheme that fits the data set. i) Proximity measure is a measure that quantifies how “similar” two data points (i.e. feature vectors) are. In most of the cases we have to ensure that all selected features contribute equally to the computation of the proximity measure and there are no features that dominate others. ii) Clustering criterion. In this step, we have to define the clustering criterion, which can be expressed via a cost function or some other type of rules. We should stress that we have to take into account the type of clusters that are expected to occur in the data set. Thus, we may define a “good” clustering criterion, leading to a partitioning that fits well the data set. Validation of the results. The correctness of clustering algorithm results is verified using appropriate criteria and techniques. Since clustering algorithms define clusters that are not known a priori, irrespective of the clustering methods, the final partition of data requires some kind of evaluation in most applications [24]. Interpretation of the results. In many cases, the experts in the application area have to integrate the

Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

  • Upload
    others

  • View
    20

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

Clustering algorithms and validity measures

M. Halkidi, Y. Batistakis, M. VazirgiannisDepartment of Informatics

Athens University of Economics & BusinessEmail: {mhalk, yannis, mvazirg}@aueb.gr

Abstract

Clustering aims at discovering groups and identifyinginteresting distributions and patterns in data sets.Researchers have extensively studied clustering since itarises in many application domains in engineering andsocial sciences. In the last years the availability of hugetransactional and experimental data sets and the arisingrequirements for data mining created needs for clusteringalgorithms that scale and can be applied in diversedomains. This paper surveys clustering methods andapproaches available in literature in a comparative way.It also presents the basic concepts, principles andassumptions upon which the clustering algorithms arebased. Another important issue is the validity of theclustering schemes resulting from applying algorithms.This is also related to the inherent features of the data setunder concern. We review and compare clusteringvalidity measures available in the literature.Furthermore, we illustrate the issues that are under-addressed by the recent algorithms and we address newresearch directions.

1. Introduction

Clustering is one of the most useful tasks in data miningprocess for discovering groups and identifying interestingdistributions and patterns in the underlying data.Clustering problem is about partitioning a given data setinto groups (clusters) such that the data points in a clusterare more similar to each other than points in differentclusters [10]. For example, consider a retail databaserecords containing items purchased by customers. Aclustering procedure could group the customers in such away that customers with similar buying patterns are in thesame cluster. Thus, the main concern in the clusteringprocess is to reveal the organization of patterns into“sensible” groups, which allow us to discover similaritiesand differences, as well as to derive useful conclusionsabout them. This idea is applicable in many fields, such aslife sciences, medical sciences and engineering.Clustering may be found under different names indifferent contexts, such as unsupervised learning (inpattern recognition), numerical taxonomy (in biology,ecology), typology (in social sciences) and partition (ingraph theory) [28].

In the clustering process, there are no predefinedclasses and no examples that would show what kind ofdesirable relations should be valid among the data that iswhy it is perceived as an unsupervised process [1]. On theother hand, classification is a procedure of assigning adata item to a predefined set of categories [8]. Clusteringproduces initial categories in which values of a data setare classified during the classification process.The clustering process may result in different partitioningof a data set, depending on the specific criterion used forclustering. Thus, there is a need of preprocessing beforewe assume a clustering task in a data set. The basic stepsto develop clustering process can be summarized asfollows [8]:♦ Feature selection. The goal is to select properly the

features on which clustering is to be performed so as toencode as much information as possible concerning thetask of our interest. Thus, preprocessing of data may benecessary prior to their utilization in clustering task.

♦ Clustering algorithm. This step refers to the choice ofan algorithm that results in the definition of a goodclustering scheme for a data set. A proximity measureand a clustering criterion mainly characterize aclustering algorithm as well as its efficiency to define aclustering scheme that fits the data set.i) Proximity measure is a measure that quantifies how“similar” two data points (i.e. feature vectors) are. Inmost of the cases we have to ensure that all selectedfeatures contribute equally to the computation of theproximity measure and there are no features thatdominate others.ii) Clustering criterion. In this step, we have to definethe clustering criterion, which can be expressed via acost function or some other type of rules. We shouldstress that we have to take into account the type ofclusters that are expected to occur in the data set. Thus,we may define a “good” clustering criterion, leading to apartitioning that fits well the data set.

♦ Validation of the results. The correctness of clusteringalgorithm results is verified using appropriate criteriaand techniques. Since clustering algorithms defineclusters that are not known a priori, irrespective of theclustering methods, the final partition of data requiressome kind of evaluation in most applications [24].

♦ Interpretation of the results. In many cases, theexperts in the application area have to integrate the

Page 2: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

clustering results with other experimental evidence andanalysis in order to draw the right conclusion.

Clustering Applications. Cluster analysis is a major toolin a number of applications in many fields of business andscience. Hereby, we summarize the basic directions inwhich clustering is used [28]:♦ Data reduction. Cluster analysis can contribute in

compression of the information included in data. Inseveral cases, the amount of available data is very largeand its processing becomes very demanding. Clusteringcan be used to partition data set into a number of“interesting” clusters. Then, instead of processing thedata set as an entity, we adopt the representatives of thedefined clusters in our process. Thus, data compressionis achieved.

♦ Hypothesis generation. Cluster analysis is usedhere in order to infer some hypotheses concerning thedata. For instance we may find in a retail database thatthere are two significant groups of customers based ontheir age and the time of purchases. Then, we may infersome hypotheses for the data, that it, “young people goshopping in the evening”, “old people go shopping inthe morning”.

♦ Hypothesis testing. In this case, the cluster analysisis used for the verification of the validity of a specifichypothesis. For example, we consider the followinghypothesis: “Young people go shopping in the evening”.One way to verify whether this is true is to apply clusteranalysis to a representative set of stores. Suppose thateach store is represented by its customer’s details (age,job etc) and the time of transactions. If, after applyingcluster analysis, a cluster that corresponds to “youngpeople buy in the evening” is formed, then thehypothesis is supported by cluster analysis.

♦ Prediction based on groups. Cluster analysis isapplied to the data set and the resulting clusters arecharacterized by the features of the patterns that belongto these clusters. Then, unknown patterns can beclassified into specified clusters based on their similarityto the clusters’ features. Useful knowledge related to ourdata can be extracted. Assume, for example, that thecluster analysis is applied to a data set concerningpatients infected by the same disease. The result is anumber of clusters of patients, according to theirreaction to specific drugs. Then for a new patient, weidentify the cluster in which he/she can be classified andbased on this decision his/her medication can be made.

More specifically, some typical applications of theclustering are in the following fields [12]:• Business. In business, clustering may help marketers

discover significant groups in their customers’ databaseand characterize them based on purchasing patterns.

• Biology. In biology, it can be used to definetaxonomies, categorize genes with similar functionalityand gain insights into structures inherent in populations.

• Spatial data analysis. Due to the huge amounts ofspatial data that may be obtained from satellite images,medical equipment, Geographical Information Systems(GIS), image database exploration etc., it is expensiveand difficult for the users to examine spatial data indetail. Clustering may help to automate the process ofanalysing and understanding spatial data. It is used toidentify and extract interesting characteristics andpatterns that may exist in large spatial databases.

• Web mining. In this case, clustering is used to discoversignificant groups of documents on the Web hugecollection of semi-structured documents. Thisclassification of Web documents assists in informationdiscovery.

In general terms, clustering may serve as a pre-processing step for other algorithms, such asclassification, which would then operate on the detectedclusters.

Clustering Algorithms Categories. A multitude ofclustering methods are proposed in the literature.Clustering algorithms can be classified according to:♦ The type of data input to the algorithm.♦ The clustering criterion defining the similarity

between data points♦ The theory and fundamental concepts on which

clustering analysis techniques are based (e.g. fuzzytheory, statistics)

Thus according to the method adopted to defineclusters, the algorithms can be broadly classified into thefollowing types [16]:•• Partitional clustering attempts to directly decompose

the data set into a set of disjoint clusters. Morespecifically, they attempt to determine an integernumber of partitions that optimise a certain criterionfunction. The criterion function may emphasize the localor global structure of the data and its optimisation is aniterative procedure.

•• Hierarchical clustering proceeds successively by eithermerging smaller clusters into larger ones, or by splittinglarger clusters. The result of the algorithm is a tree ofclusters, called dendrogram, which shows how theclusters are related. By cutting the dendrogram at adesired level, a clustering of the data items into disjointgroups is obtained.

•• Density-based clustering. The key idea of this type ofclustering is to group neighbouring objects of a data setinto clusters based on density conditions.

•• Grid-based clustering. This type of algorithms is mainlyproposed for spatial data mining. Their maincharacteristic is that they quantise the space into a finite

Page 3: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

number of cells and then they do all operations on thequantised space.

For each of above categories there is a wealth ofsubtypes and different algorithms for finding the clusters.Thus, according to the type of variables allowed in thedata set can be categorized into [11, 15, 24]:•• Statistical, which are based on statistical analysis

concepts. They use similarity measures to partitionobjects and they are limited to numeric data.

•• Conceptual, which are used to cluster categorical data.They cluster objects according to the concepts theycarry.

Another classification criterion is the way clusteringhandles uncertainty in terms of cluster overlapping.•• Fuzzy clustering, which uses fuzzy techniques to

cluster data and they consider that an object can beclassified to more than one clusters. This type ofalgorithms leads to clustering schemes that arecompatible with everyday life experience as they handlethe uncertainty of real data. The most important fuzzyclustering algorithm is Fuzzy C-Means [2].

•• Crisp clustering, considers non-overlapping partitionsmeaning that a data point either belongs to a class ornot. Most of the clustering algorithms result in crispclusters, and thus can be categorized in crisp clustering.

•• Kohonen net clustering, which is based on theconcepts of neural networks. The Kohonen network hasinput and output nodes. The input layer (input nodes)has a node for each attribute of the record, each oneconnected to every output node (output layer). Eachconnection is associated with a weight, whichdetermines the position of the corresponding outputnode. Thus, according to an algorithm, which changesthe weights properly, output nodes move to formclusters.

In general terms, the clustering algorithms are basedon a criterion for assessing the quality of a givenpartitioning. More specifically, they take as input someparameters (e.g. number of clusters, density of clusters)and attempt to define the best partitioning of a data set forthe given parameters. Thus, they define a partitioning of adata set based on certain assumptions and not necessarilythe “best” one that fits the data set.

Since clustering algorithms discover clusters, whichare not known a priori, the final partitions of a data setrequires some sort of evaluation in most applications[24]. For instance questions like “how many clusters arethere in the data set?”, “does the resulting clusteringscheme fits our data set?”, “is there a better partitioningfor our data set?” call for clustering results validation andare the subjects of methods discussed in the literature.They aim at the quantitative evaluation of the results ofthe clustering algorithms and are known under the generalterm cluster validity methods.

The remainder of the paper is organized as follows. Inthe next section we present the main categories of

clustering algorithms that are available in literature. Then,in Section 3 we discuss the main characteristics of thesealgorithms in a comparative way. In Section 4 we presentthe main concepts of clustering validity indices and thetechniques proposed in literature for evaluating theclustering results. Moreover, an experimental studybased on some of these validity indices is presented,using synthetic and real data sets. We conclude inSection 5 by summarizing and providing the trends inclustering.

2. Clustering Algorithms

In recent years, a number of clustering algorithms hasbeen proposed and is available in the literature. Somerepresentative algorithms of the above categories follow.

2.1 Partitional Algorithms

In this category, K-Means is a commonly usedalgorithm [18]. The aim of K-Means clustering is theoptimisation of an objective function that is described bythe equation

E d x mix Ci

c

i

=∈=∑∑ ( , )

1

( 1)

In the above equation, mi is the center of cluster Ci,while d(x, mi) is the Euclidean distance between a point xand mi. Thus, the criterion function E attempts tominimize the distance of every point from the center ofthe cluster to which the point belongs. More specifically,the algorithm begins by initialising a set of c clustercenters. Then, it assigns each object of the dataset to thecluster whose center is the nearest, and re-computes thecenters. The process continues until the centers of theclusters stop changing.

Another algorithm of this category isPAM(Partitioning Around Medoids). The objective ofPAM is to determine a representative object (medoid) foreach cluster, that is, to find the most centrally locatedobjects within the clusters. The algorithm begins byselecting an object as medoid for each of c clusters. Then,each of the non-selected objects is grouped with themedoid to which it is the most similar. PAM swapsmedoids with other non-selected objects until all objectsqualify as medoid. It is clear that PAM is an expensivealgorithm as regards finding the medoids, as it comparesan object with entire dataset [22].

CLARA (Clustering Large Applications), is animplementation of PAM in a subset of the dataset. Itdraws multiple samples of the dataset, applies PAM onsamples, and then outputs the best clustering out of thesesamples [22].

CLARANS (Clustering Large Applications based onRandomized Search), combines the sampling techniqueswith PAM. The clustering process can be presented assearching a graph where every node is a potential

Page 4: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

solution, that is, a set of k medoids. The clusteringobtained after replacing a medoid is called the neighbourof the current clustering. CLARANS selects a node andcompares it to a user-defined number of their neighbourssearching for a local minimum. If a better neighbour isfound (i.e., having lower-square error), CLARANS movesto the neighbour’s node and the process start again;otherwise the current clustering is a local optimum. If thelocal optimum is found, CLARANS starts with a newrandomly selected node in search for a new localoptimum.

Finally K-prototypes, K-mode [15] are based on K-Means algorithm, but they aim at clustering categoricaldata.

2.2Hierarchical Algorithms

Hierarchical clustering algorithms according to themethod that produce clusters can further be divided into[28]:♦ Agglomerative algorithms. They produce a

sequence of clustering schemes of decreasing number ofclusters at east step. The clustering scheme produced ateach step results from the previous one by merging thetwo closest clusters into one.

♦ Divisive algorithms. These algorithms produce asequence of clustering schemes increasing the numberof clusters at each step. Contrary to the agglomerativealgorithms the clustering produced at each step resultsfrom the previous one by splitting a cluster into two.

In sequel, we describe some representative hierarchicalclustering algorithms.

BIRCH [32] uses a hierarchical data structure calledCF-tree for partitioning the incoming data points in anincremental and dynamic way. CF-tree is a height-balanced tree, which stores the clustering features and it isbased on two parameters: branching factor B andthreshold T, which referred to the diameter of a cluster(the diameter (or radius) of each cluster must be less thanT). BIRCH can typically find a good clustering with asingle scan of the data and improve the quality furtherwith a few additional scans. It is also the first clusteringalgorithm to handle noise effectively [32]. However, itdoes not always correspond to a natural cluster, since eachnode in CF-tree can hold a limited number of entries dueto its size. Moreover, it is order-sensitive as it maygenerate different clusters for different orders of the sameinput data.

CURE [10] represents each cluster by a certainnumber of points that are generated by selecting well-scattered points and then shrinking them toward thecluster centroid by a specified fraction. It uses acombination of random sampling and partition clusteringto handle large databases.

ROCK [11], is a robust clustering algorithm forBoolean and categorical data. It introduces two newconcepts, that is a point's neighbours and links, and it is

based on them in order to measure thesimilarity/proximity between a pair of data points

2.3Density-based algorithms

Density based algorithms typically regard clusters asdense regions of objects in the data space that areseparated by regions of low density.A widely known algorithm of this category isDBSCAN[6]. The key idea in DBSCAN is that for eachpoint in a cluster, the neighbourhood of a given radius hasto contain at least a minimum number of points.DBSCAN can handle noise (outliers) and discoverclusters of arbitrary shape. Moreover, DBSCAN is used asthe basis for an incremental clustering algorithm proposedin [7]. Due to its density-based nature, the insertion ordeletion of an object affects the current clustering only inthe neighbourhood of this object and thus efficientalgorithms based on DBSCAN can be given forincremental insertions and deletions to an existingclustering [7].

In [14] another density-based clustering algorithm,DENCLUE, is proposed. This algorithm introduces a newapproach to cluster large multimedia databases The basicidea of this approach is to model the overall point densityanalytically as the sum of influence functions of the datapoints. The influence function can be seen as a function,which describes the impact of a data point within itsneighbourhood. Then clusters can be identified bydetermining density attractors. Density attractors are localmaximum of the overall density function. In addition,clusters of arbitrary shape can be easily described by asimple equation based on overall density function. Themain advantages of DENCLUE are that it has goodclustering properties in data sets with large amounts ofnoise and it allows a compact mathematically descriptionof arbitrary shaped clusters in high-dimensional data sets.However, DENCLUE clustering is based on twoparameters and as in most other approaches the quality ofthe resulting clustering depends on the choice of them.These parameters are [14]: i) parameter σ whichdetermines the influence of a data point in itsneighbourhood and ii) ξ describes whether a density-attractor is significant, allowing a reduction of the numberof density-attractors and helping to improve theperformance.

2.4 Grid-based algorithms

Recently a number of clustering algorithms have beenpresented for spatial data, known as grid-basedalgorithms. These algorithms quantise the space into afinite number of cells and then do all operations on thequantised space.

STING (Statistical Information Grid-based method) isrepresentative of this category. It divides the spatial areainto rectangular cells using a hierarchical structure.STING [30] goes through the data set and computes the

Page 5: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

statistical parameters (such as mean, variance, minimum,maximum and type of distribution) of each numericalfeature of the objects within cells. Then it generates ahierarchical structure of the grid cells so as to representthe clustering information at different levels. Based onthis structure STING enables the usage of clusteringinformation to search for queries or the efficientassignment of a new object to the clusters.

WaveCluster [25] is the latest grid-based algorithmproposed in literature. It is based on signal processingtechniques (wavelet transformation) to convert the spatialdata into frequency domain. More specifically, it firstsummarizes the data by imposing a multidimensional gridstructure onto the data space [12]. Each grid cellsummarizes the information of a group of points that mapinto the cell. Then it uses a wavelet transformation totransform the original feature space. In wavelet transform,convolution with an appropriate function results in atransformed space where the natural clusters in the databecome distinguishable. Thus, we can identify the clustersby finding the dense regions in the transformed domain.A-priori knowledge about the exact number of clusters isnot required in WaveCluster.

2.5 Fuzzy Clustering

The algorithms described above result in crisp clusters,meaning that a data point either belongs to a cluster ornot. The clusters are non-overlapping and this kind ofpartitioning is further called crisp clustering. The issue ofuncertainty support in clustering task leads to theintroduction of algorithms that use fuzzy logic concepts intheir procedure. A common fuzzy clustering algorithm isthe Fuzzy C-Means (FCM), an extension of classical C-Means algorithm for fuzzy applications [2]. FCMattempts to find the most characteristic point in eachcluster, which can be considered as the “center” of thecluster and, then, the grade of membership for each objectin the clusters.

Another approach proposed in literature to solve theproblems of crisp clustering is based on probabilisticmodels. The basis of this type of clustering algorithms isthe EM algorithm, which provides a quite generalapproach to learning in presence of unobservablevariables [20]. A common algorithm is the probabilisticvariant of K-Means, which is based on the mixture ofGaussian distributions. This approach of K-Means usesprobability density rather than distance to associaterecords with clusters [1]. More specifically, it regards thecenters of clusters as means of Gaussian distributions.Then, it estimates the probability that a data point isgenerated by jth Gaussian (i.e., belongs to jth cluster). Thisapproach is based on Gaussian model to extract clustersand assigns the data points to clusters assuming that theyare generated by normal distribution. Also, this approachis implemented only in the case of algorithms, which arebased on EM (Expectation Maximization) algorithm.

3. Comparison of Clustering Algorithms

Clustering is broadly recognized as a useful tool in manyapplications. Researchers of many disciplines haveaddressed the clustering problem. However, it is adifficult problem, which combines concepts of diversescientific fields (such as databases, machine learning,pattern recognition, statistics). Thus, the differences inassumptions and context among different researchcommunities caused a number of clusteringmethodologies and algorithms to be defined.

This section offers an overview of the maincharacteristics of the clustering algorithms presented in acomparative way. We consider the algorithms categorizedin four groups based on their clustering method:partitional, hierarchical, density-based and grid-basedalgorithms. Tables 1, 2 and 3 summarize the mainconcepts and the characteristics of the most representativealgorithms of these clustering categories. Morespecifically our study is based on the following features ofthe algorithms: i) the type of the data that an algorithmsupports (numerical, categorical), ii) the shape of clusters,iii) ability to handle noise and outliers, iv) the clusteringcriterion and, v) complexity. Moreover, we present theinput parameters of the algorithms while we study theinfluence of these parameters to the clustering results.Finally we describe the type of algorithms results, i.e., theinformation that an algorithm gives so as to represent thediscovered clusters in a data set.

As Table 1 depicts, partitional algorithms areapplicable mainly to numerical data sets. However, thereare some variants of K-Means such as K-mode, whichhandle categorical data. K-Mode is based on K-meansmethod to discover clusters while it adopts new conceptsin order to handle categorical data. Thus, the clustercenters are replaced with “modes”, a new dissimilaritymeasure used to deal with categorical objects. Anothercharacteristic of partitional algorithms is that they areunable to handle noise and outliers and they are notsuitable to discover clusters with non-convex shapes.Moreover, they are based on certain assumption topartition a data set. Thus, they need to specify the numberof clusters in advance except for CLARANS, which needsas input the maximum number of neighbours of a node aswell as the number of local minima that will be found inorder to define a partitioning of a dataset. The result ofclustering process is the set of representative points of thediscovered clusters. These points may be the centers orthe medoids (most centrally located object within acluster) of the clusters depending on the algorithm. Asregards the clustering criteria, the objective of algorithmsis to minimize the distance of the objects within a clusterfrom the representative point of this cluster. Thus, thecriterion of K-Means aims at the minimization of thedistance of objects belonging to a cluster from the clustercenter, while PAM from its medoid. CLARA and

Page 6: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

CLARANS, as mentioned above, are based on theclustering criterion of PAM. However, they considersamples of the data set on which clustering is applied andas a consequence they may deal with larger data sets thanPAM. More specifically, CLARA draws multiple samplesof the data set and it applies PAM on each sample. Then itgives the best clustering as the output. The problem of thisapproach is that its efficiency depends on the sample size.Also, the clustering results are produced based only onsamples of a data set. Thus, it is clear that if a sample isbiased, a good clustering based on samples will notnecessarily represent a good clustering of the whole dataset. On the other hand, CLARANS is a mixture of PAMand CLARA. A key difference between CLARANS andPAM is that the former searches a subset of dataset inorder to define clusters [22]. The subsets are drawn withsome randomness in each step of the search, in contrast toCLARA that has a fixed sample at every stage. This hasthe benefit of not confining a search to a localized area. Ingeneral terms, CLARANS is more efficient and scalablethan both CLARA and PAM. The algorithms describedabove are crisp clustering algorithms, that is, theyconsider that a data point (object) may belong to one andonly one cluster. However, the boundaries of a cluster canhardly be defined in a crisp way if we consider real-lifecases. FCM is a representative algorithm of fuzzyclustering which is based on K-means concepts in order topartition a data set into clusters. However, it introducesthe concept of uncertainty and it assigns the objects to theclusters with an attached degree of belief. Thus, an objectmay belong to more than one cluster with different degreeof belief.

A summarized view of the characteristics ofhierarchical clustering methods is presented in Table 2.The algorithms of this category create a hierarchicaldecomposition of the database represented as dendrogram.They are more efficient in handling noise and outliersthan partitional algorithms. However, they break downdue to their non-linear time complexity (typically,complexity O(n2), where n is the number of points in thedataset) and huge I/O cost when the number of input datapoints is large. BIRCH tackles this problem using ahierarchical data structure called CF-tree for multiphaseclustering. In BIRCH, a single scan of the dataset yields agood clustering and one or more additional scans can beused to improve the quality further. However, it handlesonly numerical data and it is order-sensitive (i.e., it maygenerate different clusters for different orders of the sameinput data.) Also, BIRCH does not perform well when theclusters do not have uniform size and shape since it usesonly the centroid of a cluster when redistributing the datapoints in the final phase. On the other hand, CUREemploys a combination of random sampling andpartitioning to handle large databases. It identifies clustershaving non-spherical shapes and wide variances in size by

representing each cluster by multiple points. Therepresentative points of a cluster are generated byselecting well-scattered points from the cluster andshrinking them toward the centre of the cluster by aspecified fraction. However, CURE is sensitive to someparameters such as the number of representative points,the shrink factor used for handling outliers, number ofpartitions. Thus, the quality of clustering results dependson the selection of these parameters. ROCK is arepresentative hierarchical clustering algorithm forcategorical data. It introduces a novel concept called“link” in order to measure the similarity/proximitybetween a pair of data points. Thus, the ROCK clusteringmethod extends to non-metric similarity measures that arerelevant to categorical data sets. It also exhibits goodscalability properties in comparison with the traditionalalgorithms employing techniques of random sampling.Moreover, it seems to handle successfully data sets withsignificant differences in the sizes of clusters.

The third category of our study is the density-basedclustering algorithms (Table 3). They suitably handlearbitrary shaped collections of points (e.g. ellipsoidal,spiral, cylindrical) as well as clusters of different sizes.Moreover, they can efficiently separate noise (outliers).Two widely known algorithms of this category, asmentioned above, are: DBSCAN and DENCLUE.DBSCAN requires the user to specify the radius of theneighbourhood of a point, Eps, and the minimum numberof points in the neighbourhood, MinPts. Then, it isobvious that DBSCAN is very sensitive to the parametersEps and MinPts, which are difficult to determine.Similarly, DENCLUE requires careful selection of itsinput parameters’ value (i.e., σ and ξ), since suchparameters may influence the quality of clustering results.However, the major advantage of DENCLUE incomparison with other clustering algorithms are [12]: i) ithas a solid mathematical foundation and generalized otherclustering methods, such as partitional, hierarchical, ii) ithas good clustering properties for data sets with largeamount of noise, iii) it allows a compact mathematicaldescription of arbitrary shaped clusters in high-dimensional data sets, iv) it uses grid cells and only keepsinformation about the cells that actually contain points. Itmanages these cells in a tree-based access structure andthus it is significant faster than some influentialalgorithms such as DBSCAN. In general terms thecomplexity of density based algorithms is O(logn). Theydo not perform any sort of sampling, and thus they couldincur substantial I/O costs. Finally, density-basedalgorithms may fail to use random sampling to reduce theinput size, unless sample’s size is large. This is becausethere may be substantial difference between the density inthe sample’s cluster and the clusters in the whole data set

Page 7: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

Tab

le1

The

mai

nch

arac

teri

stic

sof

the

Par

titi

onal

Clu

ster

ing

Alg

orit

hms.

Cat

egor

yP

artit

iona

l

Nam

eT

ype

ofda

taC

ompl

exity

*G

eom

etry

Out

liers

,no

ise

Inpu

tpa

ram

eter

sR

esul

tsC

lust

erin

gcr

iteri

on

K-M

ean

Num

eric

alO

(n)

non-

conv

exsh

apes

No

Num

ber

ofcl

uste

rsC

ente

rof

clus

ters

min

v1,v

2,…

,vk

(Ek)

K-m

ode

Cat

egor

ical

O(n

)no

n-co

nvex

shap

esN

oN

umbe

rof

clus

ters

Mod

esof

clus

ters

min

Q1,

Q2,

…,Q

k(E

k)

D(X

i,Ql)

=di

stan

cebe

twee

nca

tego

rica

lob

ject

sX

l,an

dm

odes

Qi

PA

MN

umer

ical

O(k

(n-k

)2 )no

n-co

nvex

shap

esN

oN

umbe

rof

clus

ters

Med

oids

ofcl

uste

rsm

in(

TC

ih)

TC

ih=Σ j

Cjih

CL

AR

AN

umer

ical

O(k

(40+

k)2

+k(

n-k)

)no

n-co

nvex

shap

esN

oN

umbe

rof

clus

ters

Med

oids

ofcl

uste

rsm

in(

TC

ih)

TC

ih=Σ j

Cjih

(Cjih

=th

eco

stof

repl

acin

gce

nter

iw

ith

has

far

asO

jis

conc

erne

d)C

LA

RA

NS

Num

eric

alO

(kn2 )

non-

conv

exsh

apes

No

Num

ber

ofcl

uste

rs,

max

imum

num

ber

ofne

ighb

ors

exam

ined

Med

oids

ofcl

uste

rsm

in(

TC

ih)

TC

ih=Σ j

Cjih

FCM

Fuzz

yC

-Mea

nsN

umer

ical

O(n

)N

on-c

onve

xsh

apes

No

Num

ber

ofcl

uste

rsC

ente

rof

clus

ter,

belie

fs

min

U,v

1,v2

,…,v

k(J

m(U

,V))

*n

isth

enu

mbe

rof

poin

tsin

the

data

seta

ndk

the

num

ber

ofcl

uste

rsde

fine

d

),

(1

1i

k i

n ll

QX

dE∑∑

==

=

),

()

,(

11

2i

j

k i

n j

m ikm

vx

dU

VU

J∑∑

==

=

),

(1

1

2i

k

k i

n kk

vx

dE

∑∑

==

=

Page 8: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

Tab

le2.

The

mai

nch

arac

teri

stic

sof

the

Hie

rarc

hica

lClu

ster

ing

Alg

orit

hms.

Cat

egor

yH

iera

rchi

cal

Nam

eT

ype

ofda

taC

ompl

exity

*G

eom

etry

Out

liers

Inpu

tpa

ram

eter

sR

esul

tsC

lust

erin

gcr

iteri

on

BIR

CH

Num

eric

alO

(n)

non-

conv

exsh

apes

Yes

Rad

ius

ofcl

uste

rs,

bran

chin

gfa

ctor

CF

=(n

umbe

rof

poin

tsin

the

clus

ter

N,

linea

rsu

mof

the

poin

tsin

the

clus

ter

LS,

the

squa

resu

mof

Nda

tapo

ints

SS)

Apo

int

isas

sign

edto

clos

est

node

(clu

ster

)ac

cord

ing

toa

chos

endi

stan

cem

etri

c.A

lso,

the

clus

ters

defi

nitio

nis

base

don

the

requ

irem

ent

that

the

num

ber

ofpo

ints

inea

chcl

uste

rm

usts

atis

fya

thre

shol

d.

CU

RE

Num

eric

alO

(n2 lo

gn),

O(n

)A

rbitr

ary

shap

esY

esN

umbe

rof

clus

ters

,nu

mbe

rof

clus

ters

repr

esen

tati

ves

Ass

ignm

ent

ofda

tava

lues

tocl

uste

rs

The

clus

ters

with

the

clos

est

pair

ofre

pres

enta

tive

s(w

ell

scat

tere

dpo

ints

)ar

em

erge

dat

each

step

.

RO

CK

Cat

egor

ical

O(n

2 +nm

mm

a+n2 lo

gn),

O(n

2,n

mmm

a)w

here

mm

isth

em

axim

umnu

mbe

rof

neig

hbor

sfo

ra

poin

tand

mais

the

aver

age

num

ber

ofne

ighb

ors

for

apo

int

Arb

itrar

ysh

apes

Yes

Num

ber

ofcl

uste

rsA

ssig

nmen

tof

data

valu

esto

clus

ters

max

(El)

-v i

cent

erof

clus

ter

I-

link

(pq,

p r)

=th

enu

mbe

rof

com

mon

neig

hbor

sbe

twee

np i

and

p r.

*n

isth

enu

mbe

rof

poin

tsin

the

data

setu

nder

cons

ider

atio

n

∑∑

∈+

=

=i

rq

Vp

pf

i

rq

k ii

ln

pp

link

nE

,)

(2

11

),

Page 9: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

Tab

le3

The

mai

nch

arac

teri

stic

sof

the

Den

sity

-bas

edC

lust

erin

gA

lgor

ithm

s.C

ateg

ory

Den

sity

-bas

ed

Nam

eT

ype

ofda

taC

ompl

exity

*G

eom

etry

Out

liers

,no

ise

Inpu

tpa

ram

eter

sR

esul

tsC

lust

erin

gcr

iteri

on

DB

SCA

NN

umer

ical

O(n

logn

)A

rbitr

ary

shap

esY

esC

lust

erra

dius

,m

inim

umnu

mbe

rof

obje

cts

Ass

ignm

ent

ofda

tava

lues

tocl

uste

rs

Mer

gepo

ints

that

are

dens

ityre

acha

ble

into

one

clus

ter.

DE

NC

LU

EN

umer

ical

O(l

ogn)

Arb

itra

rysh

apes

Yes

Clu

ster

radi

usσ,

Min

imum

num

ber

ofob

ject

Ass

ignm

ent

ofda

tava

lues

tocl

uste

rsx*

dens

ity

attr

acto

rfo

ra

poin

tx

ifF

Gau

ss>ξ

then

xat

tach

edto

the

clus

ter

belo

ngin

gto

x*.

Tab

le4.

The

mai

nch

arac

teri

stic

sof

the

Gri

d-ba

sed

Clu

ster

ing

Alg

orit

hms

Cat

egor

yG

rid-

Bas

ed

Nam

eT

ype

ofda

taC

ompl

exity

*G

eom

etry

Out

liers

Inpu

tpar

amet

ers

Out

put

Clu

ster

ing

crite

rion

Wav

e-C

lust

erSp

atia

ldat

aO

(n)

Arb

itrar

ysh

apes

Yes

Wav

elet

s,th

enu

mbe

rof

grid

cells

for

each

dim

ensi

on,

the

num

ber

ofap

plic

atio

nsof

wav

elet

tran

sfor

m.

Clu

ster

edob

ject

sD

ecom

pose

feat

ure

spac

eap

plyi

ngw

avel

ettr

ansf

orm

atio

nA

vera

gesu

b-ba

ndcl

uste

rsD

etai

lsu

b-ba

nds

clus

ters

boun

dari

es

STIN

GSp

atia

ldat

aO

(K)

Kis

the

num

ber

ofgr

idce

llsat

the

low

est

leve

l

Arb

itrar

ysh

apes

Yes

Num

ber

ofob

ject

sin

ace

llC

lust

ered

obje

cts

Div

ide

the

spat

ial

area

into

rect

angl

ece

lls

and

empl

oya

hier

arch

ical

stru

ctur

e.E

ach

cell

ata

high

leve

lis

part

ition

edin

toa

num

ber

ofsm

alle

rce

llsin

the

next

low

erle

vel

*n

isth

enu

mbe

rof

poin

tsin

the

data

setu

nder

cons

ider

atio

n

∑∈

−=

)(

2

),

(*

*1

2

21

*

)(

xne

arx

xx

dD Gau

sse

xf

σ

Page 10: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

The last category of our study (see Table 4) refers to grid-based algorithms. The basic concept of these algorithmsis that they define a grid for the data space and then do allthe operations on the quantised space. In general termsthese approaches are very efficient for large databases andare capable of finding arbitrary shape clusters andhandling outliers. STING is one of the well-known grid-based algorithms. It divides the spatial area intorectangular cells while it stores the statistical parametersof the numerical features of the objects within cells. Thegrid structure facilitates parallel processing andincremental updating. Since STING goes through thedatabase once to compute the statistical parameters of thecells, it is generally an efficient method for generatingclusters. Its time complexity is O(n). However, STINGuses a multiresolution approach to perform clusteranalysis and thus the quality of its clustering resultsdepends on the granularity of the lowest level of grid.Moreover, STING does not consider the spatialrelationship between the children and their neighbouringcells to construct the parent cell. The result is that allcluster boundaries are either horizontal or vertical andthus the quality of clusters is questionable [25]. On theother hand, WaveCluster efficiently achieves to detectarbitrary shape clusters at different scales exploiting well-known signal processing techniques. It does not requirethe specification of input parameters (e.g. the number ofclusters or a neighbourhood radius), though a-prioriestimation of the expected number of clusters helps inselecting the correct resolution of clusters. Inexperimental studies, WaveCluster was found tooutperform BIRCH, CLARANS and DBSCAN in termsof efficiency and clustering quality. Also, the study showsthat it is not efficient in high dimensional space [12].

4. Clustering results validity assessment

4.1 Problem Specification

The objective of the clustering methods is to discoversignificant groups present in a data set. In general, theyshould search for clusters whose members are close toeach other (in other words have a high degree ofsimilarity) and well separated. A problem we face in

clustering is to decide the optimal number of clusters thatfits a data set.

In most algorithms’ experimental evaluations 2D-datasets are used in order that the reader is able to visuallyverify the validity of the results (i.e., how well theclustering algorithm discovered the clusters of the dataset). It is clear that visualization of the data set is a crucialverification of the clustering results. In the case of largemultidimensional data sets (e.g. more than threedimensions) effective visualization of the data set wouldbe difficult. Moreover the perception of clusters usingavailable visualization tools is a difficult task for humansthat are not accustomed to higher dimensional spaces.

The various clustering algorithms behave in a differentway depending on: i) the features of the data set(geometry and density distribution of clusters), ii) theinput parameters values

For instance, assume the data set in Figure 1a. It isobvious that we can discover three clusters in the givendata set. However, if we consider a clustering algorithm(e.g. K-Means) with certain parameter values (in the caseof K-means the number of clusters) so as to partition thedata set in four clusters, the result of clustering processwould be the clustering scheme presented in Figure 1b. Inour example the clustering algorithm (K-Means) foundthe best four clusters in which our data set could bepartitioned. However, this is not the optimal partitioningfor the considered data set. We define, here, the term“optimal” clustering scheme as the outcome of running aclustering algorithm (i.e., a partitioning) that best fits theinherent partitions of the data set. It is obvious fromFigure 1b that the depicted scheme is not the best for ourdata set i.e., the clustering scheme presented in Figure 1bdoes not fit well the data set. The optimal clustering forour data set will be a scheme with three clusters.

As a consequence, if the clustering algorithmparameters are assigned an improper value, the clusteringmethod may result in a partitioning scheme that is notoptimal for the specific data set leading to wrongdecisions. The problems of deciding the number ofclusters better fitting a data set as well as the evaluation ofthe clustering results has been subject of several researchefforts [3, 9, 24, 27, 28, 31].

Figure 1. (a) A data set that consists of 3 three clusters, (b) The results from the application of K-means when we ask four clusters

Page 11: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

In the sequel, we discuss the fundamental concepts ofclustering validity and we present the most importantcriteria in the context of clustering validity assessment.

4.2 Validity Indices

In this section, we discuss methods suitable forquantitative evaluation of the clustering results, known ascluster validity methods. However, we have to mentionthat these methods give an indication of the quality of theresulting partitioning and thus they can only beconsidered as a tool at the disposal of the experts in orderto evaluate the clustering results. In the sequel, wedescribe the fundamental criteria for each of the abovedescribed cluster validity approaches as well as theirrepresentative indices.

How Monde Carlo is used in Cluster ValidityThe goal of using Monde Carlo techniques is thecomputation of the probability density function of thedefined statistic indices. First, we generate a large amountof synthetic data sets. For each one of these synthetic datasets, called Xi, we compute the value of the defined index,denoted qi. Then based on the respective values of qi foreach of the data sets Xi, we create a scatter-plot. Thisscatter-plot is an approximation of the probability densityfunction of the index. In Figure 2 we see the threepossible cases of probability density function’s shape ofan index q. There are three different possible shapes

depending on the critical interval ρD , corresponding to

significant level ρ (statistic constant). As we can see theprobability density function of a statistic index q, under

Ho, has a single maximum and the ρD region is either a

half line, or a union of two half lines.Assuming that this shape is right-tailed (Figure 2b) andthat we have generated the scatter-plot using r values ofthe index q, called qi, in order to accept or reject the NullHypothesis Ho we examine the following conditions [28]:

We reject (accept) Ho If q’s value for our data set, isgreater (smaller) than (1-ρ)⋅ r of qi values, of therespective synthetic data sets Xi.Assuming that the shape is left-tailed (Figure 2c), wereject (accept) Ho if q’s value for our data set, issmaller (greater) than ρ⋅ r of qi values.Assuming that the shape is two-tailed (Figure 2a) weaccept Ho if q is greater than (ρ/2)⋅r number of qi valuesand smaller than (1- ρ/2)⋅r of qi values.

Based on the external criteria we can work in twodifferent ways. Firstly, we can evaluate the resultingclustering structure C, by comparing it to an independentpartition of the data P built according to our intuitionabout the clustering structure of the data set. Secondly, wecan compare the proximity matrix P to the partition P.

a) Comparison of C with partition P (not for hierarchyof clustering)Consider C ={C1… Cm} is a clustering structure of a dataset X and P ={P1… Ps} is a defined partition of the data,where m ≠ s. We refer to a pair of points (xv, xu) from thedata set using the following terms: SS: if both points belong to the same cluster of theclustering structure C and to the same group ofpartition P. SD: if both points belong to the same cluster of C andto different groups of P. DS: if both points belong to different clusters of C andto the same group of P. DD: if both points belong to different clusters of Cand to different groups of P.Assuming now that a, b, c and d are the number of SS,

SD, DS and DD pairs respectively, then a + b + c + d = Mwhich is the maximum number of all pairs in the data set(meaning, M=N(N-1)/2 where N is the total number ofpoints in the data set).

Now we can define the following indices to measure thedegree of similarity between C and P:

Figure 2. Confidence interval for (a) two-tailed index, (b) right-tailed index, (c) left-tailed index, where q0p is the ρ proportion of

q under hypothesis Η0. [28]

Page 12: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

• Rand Statistic: R = (a + d) / M,

• Jaccard Coefficient: J = a / (a + b + c),

The above two indices take values between 0 and 1,and are maximized when m = s.Another index is the:• Folkes and Mallows index:

ca

a

ba

amm/aFM 21 +

⋅+

== ( 2)

where m1 = a / (a + b), m2=a / (a + c).For the previous three indices it has been proven that highvalues of indices indicate great similarity between C andP. The higher the values of these indices are the moresimilar C and P are. Other indices are:• Huberts Γ statistic:

∑∑+==

=ΓN

1ij

1-N

1i

j)Y(i,j)X(i,(1/M) ( 3)

High values of this index indicate a strong similaritybetween X and Y.• Normalized Γ statistic:

ΥΧ

1-N

1i

N

1ijYX σσ])µ-j)Y(i,)(µ-j)X(i,([(1/M)∑∑

= +=

_

=Γ ( 4)

where µx, µy, σx, σy are the respective means and variancesof X, Y matrices. This index takes values between –1 and1.

All these statistics have right-tailed probability densityfunctions, under the random hypothesis. In order to usethese indices in statistical tests we must know theirrespective probability density function under the NullHypothesis Ho, which is the hypothesis of randomstructure of our data set. This means that using statisticaltests, if we accept the Null Hypothesis then our data arerandomly distributed. However, the computation of theprobability density function of these indices is difficult. Asolution to this problem is to use Monde Carlo techniques.The procedure is as follows: For i = 1 to ro Generate a data set Xi with N vectors (points) in the

area of X, which means that the generated vectorshave the same dimension with those of the data set X.

o Assign each vector yj, i of Xi to the group that xj ∈ Xbelongs, according to the partition P.

o Run the same clustering algorithm used to producestructure C, for each Xi, and let Ci the resultingclustering structure.

o Compute q(Ci) value of the defined index q for P andCi.

End For Create scatter-plot of the r validity index values , q(Ci)

(that computed into the for loop).

After having plotted the approximation of theprobability density function of the defined statistic index,

we compare its value, let q, to the q(Ci) values, let qi. Theindices R, J, FM, Γ defined previously are used as the qindex mentioned in the above procedure.Example: Assume a given data set, X, containing 100three-dimensional vectors (points). The points of X formfour clusters of 25 points each. Each cluster is generatedby a normal distribution. The covariance matrices of thesedistributions are all equal to 0.2I, where I is the 3x3identity matrix. The mean vectors for the fourdistributions are [0.2, 0.2, 0.2] T, [0.5, 0.2, 0.8] T, [0.5, 0.8,0.2] T, and [0.8, 0.8, 0.8] T. We independently group dataset X in four groups according to the partition P for whichthe first 25 vectors (points) belong to the first group P1,the next 25 belong to the second group P2, the next 25belong to the third group P3 and the last 25 vectors belongto the fourth group P4. We run k-means clusteringalgorithm for k = 4 clusters and we assume that C is theresulting clustering structure. We compute the values ofthe indices for the clustering C and the partition P, and weget R = 0.91, J = 0.68, FM = 0.81 and Γ = 0.75. Then wefollow the steps described above in order to define theprobability density function of these four statistics. Wegenerate 100 data sets Xi, i = 1,…, 100, and each one ofthem consists of 100 random vectors (in 3 dimensions)using the uniform distribution. According to the partitionP defined earlier for each Xi we assign the first 25 of itsvectors to P1 and the second, third and forth groups of 25vectors to P2, P3 and P4 respectively. Then we run k-means i-times, one time for each Xi, so as to define therespective clustering structures of datasets, denoted Ci.For each of them we compute the values of the indices Ri,Ji, FMi, Γi, i=1, …,100. We set the significance level ρ =0.05 and we compare these values to the R, J, FM and Γvalues corresponding to X. We accept or reject the nullhypothesis whether (1-ρ)r =(1 – 0.05) 100 = 95 values ofRi, Ji, FMi, Γi are greater or smaller than thecorresponding values of R, J, FM, Γ. In our case the Ri, Ji,FMi, Γi values are all smaller than the correspondingvalues of R, J, FM, and Γ, which lead us to the conclusionthat the null hypothesis Ho is rejected. Something that wewere expecting because of the predefined clusteringstructure of data set X.

b) Comparison of P(proximity matrix) with partition PPartition P can be considered as a mapping

g: X {1…nc}.Assuming matrix Y: Y(i, j) = {1, if g(xi) ≠ g(xj) and 0,

otherwise}, i, j = 1…N, we can compute Γ (or normalizedΓ) statistic using the proximity matrix P and the matrix Y.Based on the index value, we may have an indication ofthe two matrices’ similarity.

To proceed with the evaluation procedure we use theMonde Carlo techniques as mentioned above. In the“Generate” step of the procedure we generate thecorresponding mappings gi for every generated Xi dataset. So in the “Compute” step we compute the matrix Yi,

Page 13: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

for each Xi in order to find the Γi corresponding statisticindex.

4.2.1 Internal Criteria. Using this approach of clustervalidity our goal is to evaluate the clustering result of analgorithm using only quantities and features inherent tothe dataset. There are two cases in which we applyinternal criteria of cluster validity depending on theclustering structure: a) hierarchy of clustering schemes,and b) single clustering scheme.

a) Validating hierarchy of clustering schemes.A matrix called cophenetic matrix, Pc, can represent thehierarchy diagram that produced by a hierarchicalalgorithm. We may define a statistical index to measurethe degree of similarity between Pc and P (proximitymatrix) matrices. This index is called CopheneticCorrelation Coefficient and defined as:

( )

( ) ( )

1CPCC1

,

c/1dM/1

cd1/M

CPCC-1N

1i

N

1ij

2C

2ij

-1N

1i

N

1ij

2P

2ij

-1N

1i

N

1ijPijij

≤≤−

⎥⎦

⎤⎢⎣

⎡−⎥

⎤⎢⎣

⎡−

−=

∑ ∑∑ ∑

∑ ∑

= +== +=

= +=

µµ

µµ

M

C

( 5)

where M=N⋅(N-1)/2 and N is the number of points in adataset. Also, µp and µc are the means of matrices P and Pc

respectively, and are given by the equation (6):

( ) ( ) ( ) ( )∑∑ ∑∑= += = +=

==1-N

1i

N

1ij

1-N

1i

N

1ijc ji,P1/M,ji,P1/M CP µµ ( 6)

Moreover, dij, cij are the (i, j) elements of P and Pc

matrices respectively. A value of the index close to 0 is anindication of a significant similarity between the twomatrices. The procedure of the Monde Carlo techniquesdescribed above is also used in this case of validation.

b) Validating a single clustering schemeThe goal here is to find the degree of agreement betweena given clustering scheme C, consisting of nc clusters, andthe proximity matrix P. The defined index for thisapproach is Hubert’s Γ statistic (or normalized Γ statistic).An additional matrix for the computation of the index isused, that is Y(i, j) = {1 , if xi and xj belong to differentclusters, and 0 , otherwise}, i, j = 1,…, N.

The application of Monde Carlo techniques is alsohere the way to test the random hypothesis in a given dataset.

4.2.2 Relative Criteria. The basis of the abovedescribed validation methods is statistical testing. Thus,the major drawback of techniques based on internal orexternal criteria is their high computational demands. Adifferent validation approach is discussed in this section.

It is based on relative criteria and does not involvestatistical tests. The fundamental idea of this approach isto choose the best clustering scheme of a set of definedschemes according to a pre-specified criterion. Morespecifically, the problem can be stated as follows:

“ Let P the set of parameters associated with aspecific clustering algorithm (e.g. the number ofclusters nc). Among the clustering schemes Ci,i=1,..,nc, defined by a specific algorithm, fordifferent values of the parameters in P, choose theone that best fits the data set.”

Then, we can consider the following cases of the problem:I) P does not contain the number of clusters, nc, as a

parameter. In this case, the choice of the optimalparameter values are described as follows: We run thealgorithm for a wide range of its parameters’ valuesand we choose the largest range for which nc remainsconstant (usually nc << N (number of tuples). Thenwe choose as appropriate values of the P parametersthe values that correspond to the middle of this range.Also, this procedure identifies the number of clustersthat underlie our data set.

II) P contains nc as a parameter. The procedure ofidentifying the best clustering scheme is based on avalidity index. Selecting a suitable performance index,q, we proceed with the following steps:• We run the clustering algorithm for all values of nc

between a minimum ncmin and a maximum ncmax.The minimum and maximum values have beendefined a-priori by user.

• For each of the values of nc, we run the algorithm rtimes, using different set of values for the otherparameters of the algorithm (e.g. different initialconditions).

• We plot the best values of the index q obtained byeach nc as the function of nc.

Based on this plot we may identify the best clusteringscheme. We have to stress that there are two approachesfor defining the best clustering depending on thebehaviour of q with respect to nc. Thus, if the validityindex does not exhibit an increasing or decreasing trend asnc increases we seek the maximum (minimum) of theplot. On the other hand, for indices that increase (ordecrease) as the number of clusters increase we search forthe values of nc at which a significant local change invalue of the index occurs. This change appears as a“knee” in the plot and it is an indication of the number ofclusters underlying the dataset. Moreover, the absence ofa knee may be an indication that the data set possesses noclustering structure.

In the sequel, some representative validity indices forcrisp and fuzzy clustering are presented.

Crisp clustering. This section discusses validity indicessuitable for crisp clustering.

Page 14: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

a) The modified Hubert Γ statistic. The definition of themodified Hubert Γ statistic is given by the equation

( 7)

where M=N(N-1)/2, P is the proximity matrix of the dataset and Q is an NXN matrix whose (i, j) element is equalto the distance between the representative points (vci, vcj )of the clusters where the objects xi and xj belong.

Similarly, we can define the normalized Hubert Γstatistic (given by equation (4)). If the d(vci, vcj) is close tod(xi, xj) for i, j=1,2,..,N, P and Q will be in close

agreement and the values of Γ and∧Γ (normalized Γ) will

be high. Conversely, a high value of Γ (∧Γ ) indicates the

existence of compact clusters. Thus, in the plot ofnormalized Γ versus nc, we seek a significant knee thatcorresponds to a significant increase of normalized Γ. Thenumber of clusters at which the knee occurs is anindication of the number of clusters that underlie the data.We note, that for nc =1 and nc=N the index is notdefined.

b) Dunn and Dunn-like indices. A cluster validity indexfor crisp clustering proposed in [5], attempts to identify“compact and well separated clusters”. The index isdefined by equation (8) for a specific number of clusters

( 8)

where d(ci, cj) is the dissimilarity function between twoclusters ci and cj defined as

( 9)

and diam(c) is the diameter of a cluster , which may beconsidered as a measure of dispersion of the clusters. Thediameter of a cluster C can be defined as follows:

),(max)(,

yxdCdiamCyx ∈

= ( 10)

It is clear that if the dataset contains compact andwell-separated clusters, the distance between the clustersis expected to be large and the diameter of the clusters isexpected to be small. Thus, based on the Dunn’s indexdefinition, we may conclude that large values of the indexindicate the presence of compact and well-separatedclusters.

The index Dnc does not exhibit any trend with respectto number of clusters. Thus, the maximum in the plot ofDnc versus the number of clusters can be an indication ofthe number of clusters that fits the data.

The implications of the Dunn index are: i) theconsiderable amount of time required for its computation,ii) the sensitive to the presence of noise in datasets, sincethese are likely to increase the values of diam(c) (i.e.,dominator of equation (8))

Three indices, are proposed in [23] that are morerobust to the presence of noise. They are widely knownas Dunn-like indices since they are based on Dunn index.Moreover, the three indices use for their definition theconcepts of the minimum spanning tree (MST), therelative neighbourhood graph (RNG) and the Gabrielgraph respectively [28].

Consider the index based on MST. Let a cluster ci andthe complete graph Gi whose vertices correspond to thevectors of ci. The weight, we, of an edge, e, of this graphequals the distance between its two end points, x, y. LetEi

MST be the set of edges of the MST of the graph Gi andei

MST the edge in EiMST with the maximum weight. Then

the diameter of Ci is defined as the weight of eiMST. Then

the Dunn-like index based on the concept of the MST isgiven by equation

( 11)

The number of clusters at which DmMST takes its

maximum value indicates the number of clusters in theunderlying data. Based on similar arguments we maydefine the Dunn-like indices for GG nad RGN graphs.

c) The Davies-Bouldin (DB) index. A similarity measureRij between the clusters Ci and Cj is defined based on ameasure of dispersion of a cluster Ci and a dissimilaritymeasure between two clusters dij. The Rij index is definedto satisfy the following conditions [4]:

1. Rij ≥ 02. Rij = Rji

3. Αν si = 0 και sj = 0 τότε Rij = 04. Αν sj > sk και dij = dik τότε Rij > Rik

5. Αν sj = sk και dij < dik τότε Rij > Rik.These conditions state that Rij is nonnegative andsymmetric.

A simple choice for Rij that satisfies the aboveconditions is [4]:

Rij = (si + sj)/dij. ( 12)Then the DB index is defined as

( 13)

It is clear for the above definition that DBnc is theaverage similarity between each cluster ci, i=1, …, nc andits most similar one. It is desirable for the clusters to have

( )∑∑−

= +=

⋅=Γ1

1 1

),(),(/1N

i

N

ij

jiQjiPM

nc1,...,i,max

1

,,...,1

1

==

=

≠=

=∑

ijjinci

i

nc

ii

cnc

RR

Rn

DB

( )( ) ⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎟

⎜⎜⎜

⎛=

=+==

knck

ji

ncijncinc cdiam

ccdD

,...,1,...,1,...,1 max

,minmin

( ) ( ),,min,,

yxdccdji cycx

ji ∈∈=

( )⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎟⎟⎟

⎜⎜⎜

⎛=

=+== MST

knck

ji

ncijncinc

diam

ccdD

,...,1

,...,1,...,1 max

,minmin

Page 15: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

the minimum possible similarity to each other; thereforewe seek clusterings that minimize DB. The DBnc indexexhibits no trends with respect to the number of clustersand thus we seek the minimum value of DBnc in its plotversus the number of clusters.Some alternative definitions of the dissimilarity betweentwo clusters as well as the dispersion of a cluster, ci, isdefined in [4].

Moreover, in [23] three variants of the DBnc index areproposed. They are based on MST, RNG and GGconcepts similarly to the cases of the Dunn-like indices.

Other validity indices for crisp clustering have beenproposed in [3] and [21]. The implementation of most ofthese indices is very computationally expensive,especially when the number of clusters and number ofobjects in the data set grows very large [31]. In [19], anevaluation study of thirty validity indices proposed inliterature is presented. It is based on small data sets (about50 points each) with well-separated clusters. The resultsof this study [19] place Caliski and Harabasz(1974),Je(2)/Je(1) (1984), C-index (1976), Gamma and Bealeamong the six best indices. However, it is noted thatalthough the results concerning these methods areencouraging they are likely to be data dependent. Thus,the behaviour of indices may change if different datastructures were used [19]. Also, some indices based on asample of clustering results. A representative example isJe(2)/Je(1) which is computed based only on theinformation provided by the items involved in the lastcluster merge.

d) RMSSDT, SPR, RS, CD. In this point we will give thedefinitions of four validity indices, which have to be usedsimultaneously to determine the number of clustersexisting in the data set. These four indices can be appliedto each step of a hierarchical clustering algorithm andthey are known as [26]: Root-mean-square standard deviation (RMSSTD) of thenew cluster Semi-partial R-squared (SPR) R-squared (RS) Distance between two clusters.

Getting into a more detailed description of them we cansay that:

RMSSTD of a new clustering scheme defined in alevel of clustering hierarchy is the square root of thepooled sample variance of all the variables (attributesused in the clustering process). This index measures thehomogeneity of the formed clusters at each step of thehierarchical algorithm. Since the objective of clusteranalysis is to form homogeneous groups the RMSSTD ofa cluster should be as small as possible. In case that thevalues of RMSSTD are higher at this step than the ones ofthe previous step, we have an indication that the newclustering scheme is not homogenous.

In the following definitions we shall use the symbolismSS, which means Sum of Squares and refers to the

equation:

2

1

)(∑=

−−=

n

ii XXSS . Along with this we shall

use some additional symbolism like:i)SSw referring to the within group sum of squares,ii) SSb referring to the between groups sum of squares.iii) SSt referring to the total sum of squares, of the

whole data set.

SPR of the new cluster is the difference between thepooled SSw of the new cluster and the sum of the pooledSSw ’s values of clusters joined to obtain the new cluster(loss of homogeneity), divided by the pooled SSt for thewhole data set. This index measures the loss ofhomogeneity after merging the two clusters of a singlealgorithm step. If the index value is zero then the newcluster is obtained by merging two perfectlyhomogeneous clusters. If its value is high then the newcluster is obtained by merging two heterogeneousclusters.

RS of the new cluster is the ratio of SSb to SSt. As wecan understand SSb is a measure of difference betweengroups. Since SSt = SSb + SSw the greater the SSb thesmaller the SSw and vise versa. As a result, the greater thedifferences between groups the more homogenous eachgroup is and vise versa. Thus, RS may be considered as ameasure of the degree of difference between clusters.Furthermore, it measures the degree of homogeneitybetween groups. The values of RS range between 0 and 1.In case that the value of RS is zero (0) indicates that nodifference exists among groups. On the other hand, whenRS equals 1 there is an indication of significant differenceamong groups.

The CD index measures the distance between the twoclusters that are merged in a given step. This distance ismeasured each time depending on the selectedrepresentatives for the hierarchical clustering we perform.For instance, in case of Centroid hierarchical clusteringthe representatives of the formed clusters are the centersof each cluster, so CD is the distance between the centersof the clusters. In case that we use single linkage CDmeasures the minimum Euclidean distance between allpossible pairs of points. In case of complete linkage CD isthe maximum Euclidean distance between all pairs of datapoints, and so on.Using these four indices we determine the number ofclusters that exist into our data set, plotting a graph of allthese indices values for a number of different stages of theclustering algorithm. In this graph we search for thesteepest knee, or in other words, the greater jump of theseindices’ values from higher to smaller number of clusters

In the case of nonhierarchical clustering (e.g. K-Means) we may also use some of these indices in order toevaluate the resulting clustering. The indices that are more

Page 16: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

meaningful to use in this case are RMSSTD and RS. Theidea, here, is to run the algorithm a number of times fordifferent number of clusters each time. Then we plot therespective graphs of the validity indices for theseclusterings and as the previous example shows, we searchfor the significant “knee” in these graphs. The number ofclusters at which the “knee” is observed indicates theoptimal clustering for our data set. In this case the validityindices described before take the following form:

21

11

11 1

2

)(

)(

1

⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢

=∑

∑ ∑

==

== =

djnci

ij

djnci

n

k

jk

n

xxij

RMSSTD

KK

KK

( 14)

∑ ∑

∑ ∑∑ ∑

= =

== == =

⎥⎥⎦

⎢⎢⎣

⎡−

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎥⎦

⎢⎢⎣

⎡−−

⎪⎭

⎪⎬⎫

⎪⎩

⎪⎨⎧

⎥⎥⎦

⎢⎢⎣

⎡−

=

dj

n

k

jk

djnci

n

k

jkdj

n

k

jk

j

ijj

xx

xxxx

RS

K

KKK

1 1

2

11 1

2

1 1

2

)(

)()(

( 15)

where nc is the number of clusters, d the number ofvariables(data dimension), nj is the number of data valuesof j dimension while nij corresponds to the number of data

values of j dimension that belong to cluster i. Also jx is

the mean of data values of j dimension.

e) The SD validity index. A most recent clusteringvalidity approach is proposed in [13]. The SD validityindex is defined based on the concepts of the averagescattering for clusters and total separation betweenclusters. In the sequel, we give the fundamental definitionfor this index.

Average scattering for clusters. The average scatteringfor clusters is defined as

( 16)

Total separation between clusters. The definition oftotal scattering (separation) between clusters is given byequation (17)

( 17)

where Dmax = max(||vi - vj||) ∀i, j ∈{1, 2,3,…,nc} is themaximum distance between cluster centers. The Dmin =min(||vi - vj||) ∀i, j ∈{1, 2,…,nc} is the minimum distancebetween cluster centers.Now, we can define a validity index based on equations(16) and (17), as follows

SD(nc) = a⋅ Scat(nc) + Dis(nc) ( 18)where α is a weighting factor equal to Dis(cmax) where cmax

is the maximum number of input clusters.The first term (i.e., Scat(nc) is defined by equation

(16) indicates the average compactness of clusters (i.e.,intra-cluster distance). A small value for this term

indicates compact clusters and as the scattering withinclusters increases (i.e., they become less compact) thevalue of Scat(nc) also increases. The second term Dis(nc)indicates the total separation between the nc clusters (i.e.,an indication of inter-cluster distance). Contrary to thefirst term the second one, Dis(nc), is influenced by thegeometry of the clusters centres and increase with thenumber of clusters. It is obvious for previous discussionthat the two terms of SD are of the different range, thus aweighting factor is needed in order to incorporate bothterms in a balanced way. The number of clusters, c, thatminimizes the above index can be considered as anoptimal value for the number of clusters present in thedata set. Also, the influence of the maximum number ofclusters cmax, related to the weighting factor, in theselection of the optimal clustering scheme is discussed in[13]. It is proved that SD proposes an optimal number ofclusters almost irrespectively of cmax. However, the indexcannot handle properly arbitrary shaped clusters. Thesame applies to all the aforementioned indices.

Fuzzy Clustering. In this section, we present validityindices suitable for fuzzy clustering. The objective is toseek clustering schemes where most of the vectors of thedataset exhibit high degree of membership in one cluster.We note, here, that a fuzzy clustering is defined by amatrix U=[uij], where uij denotes the degree ofmembership of the vector xi in the j cluster. Also, a set ofthe cluster representatives has been defined. Similarly tothe crisp clustering case we define validity index, q, andwe search for the minimum or maximum in the plot of qversus m. Also, in case that q exhibits a trend with respectto the number of clusters, we seek a significant knee ofdecrease (or increase) in the plot of q.

In the sequel two categories of fuzzy validity indicesare discussed. The first category uses only thememberships values, uij, of a fuzzy partition of data. Onthe other hand the latter one involves both the U matrixand the dataset itself.

a) Validity Indices involving only the membershipvalues. Bezdek proposed in [2] the partition coefficient,which is defined as

( 19)

The PC index values range in [1/nc, 1], where nc is thenumber of clusters. The closer to unity the index the“crisper” the clustering is. In case that all membershipvalues to a fuzzy partition are equal, that is, uij=1/nc, thePC obtains its lower value. Thus, the closer the value ofPC is to 1/nc, the fuzzier the clustering is. Furthermore, avalue close to 1/nc indicates that there is no clusteringtendency in the considered dataset or the clusteringalgorithm failed to reveal it.

∑∑= =

=N

i

nc

jiju

NPC

1 1

21

( )Xi

vnc

nc

i

ncScat σσ∑ ⎟⎠⎞⎜

⎝⎛=

= 1

1)(

1

1 1min

max)(

= =∑ ∑ ⎟

⎞⎜⎝

⎛ −=nc

k

c

zzk vv

D

DncDis

Page 17: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

The partition entropy coefficient is another index ofthis category. It is defined as

( 20)

where a is the base of the logarithm. The index iscomputed for values of nc greater than 1 and its valuesranges in [0, loganc]. The closer the value of PE to 0, theharder the clustering is. As in the previous case, thevalues of index close to the upper bound (i.e., loganc),indicate absence of any clustering structure in the datasetor inability of the algorithm to extract it.

The drawbacks of these indices are:i) their monotonous dependency on the number of

clusters. Thus, we seek significant knees of increase(for PC) or decrease (for PE) in plot of the indicesversus the number of clusters.,

ii) their sensitivity to the fuzzifier, m. More specifically,as m 1 the indices give the same values for all valuesof nc. On the other hand when m ∞, both PC and PEexhibit significant knee at nc=2.

iii) the lack of direct connection to the geometry of thedata [3], since they do not use the data itself.

b) Indices involving the membership values and thedataset. The Xie-Beni index [31], XB, also called thecompactness and separation validity function, is arepresentative index of this category.

Consider a fuzzy partition of the data set X={xj; j=1,..,n} with vi(i=1,…, nc} the centers of each cluster and uij

the membership of data point j belonging to cluster i.The fuzzy deviation of xj form cluster i, dij, is defined

as the distance between xj and the center of clusterweighted by the fuzzy membership of data point jbelonging to cluster i.

dij=uij||xj-vi|| ( 21)Also, for a cluster i, the summation of the squares offuzzy deviation of the data point in X, denoted σi, is calledvariation of cluster i.

The summation of the variations of all clusters, σ, iscalled total variation of the data set.

The quantity π=(σi/ni), is called compactness of clusteri. Since ni is the number of point in cluster belonging tocluster i, π, is the average variation in cluster i.

Also, the separation of the fuzzy partitions is definedas the minimum distance between cluster centres, that is

dmin = min||vi -vj||Then the XB index is defined as

XB=π/N⋅dminwhere N is the number of points in the data set.

It is clear that small values of XB are expected forcompact and well-separated clusters. We note, however,that XB is monotonically decreasing when the number ofclusters nc gets very large and close to n. One way to

eliminate this decreasing tendency of the index is todetermine a starting point, cmax, of the monotonicbehaviour and to search for the minimum value of XB inthe range [2, cmax]. Moreover, the values of the index XBdepend on the fuzzifier values, so as if m ∞ thenXB ∞.

Another index of this category is the Fukuyama-Sugeno index, which is defined as

( 22)

where v is the mean vector of X and A is an lxl positivedefinite, symmetric matrix. When A=I, the above distancebecome the squared Euclidean distance. It is clear that forcompact and well-separated clusters we expect smallvalues for FSm. The first term in the parenthesis measuresthe compactness of the clusters and the second onemeasures the distances of the clusters representatives.

Other fuzzy validity indices are proposed in [9], whichare based on the concepts of hypervolume and density.Let Σj the fuzzy covariance matrix of the j-th clusterdefined as

( )( )∑

∑=

=−−

=∑N

i

mij

N

i

Tjiji

m

ij

u

vxvxuj

1

1

( 23)

Then the total fuzzy hyper volume is given by theequation

∑=

=nc

jjVFH

1

( 24)

Small values of FH indicate the existence of compactclusters.The average partition density is also an index of thiscategory. It can be defined as

∑=

=nc

j j

j

V

S

ncPA

1

1( 25)

where xj is the set of data points that are within a pre-

specified region around vj. Then ∑ ∈=

jXx ijj uS is

called the sum of the central members of the j cluster.A different measure is the partition density index that

is defined asPD=S/FH ( 26)

where ∑ == nc

j jSS1

A few other indices are proposed and discussed in [17,24].

4.3 Other approaches for cluster validity

Another approach for finding the best number of clusterof a data set proposed in [27]. It introduces a practicalclustering algorithm based on Monte Carlo cross-validation. More specifically, the algorithm consists of M

( )∑∑= =

⋅−=N

i

nc

jijij uu

NPE

1 1alog

1

( )∑∑= =

−−−=N

i

nc

jAjA

mijm vvvuFS

1 1

22

jix

Page 18: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

cross-validation runs over M chosen train/test partitions ofa data set, D. For each partition u, the EM algorithm isused to define c clusters to the training data, while c isvaried from 1 to cmax. Then, the log-likelihood Lc

u(D) iscalculated for each model with c clusters. It is definedusing the probability density function of the data as

( 27)

where fk is the probability density function for the dataand Φk denotes parameters that have been estimated fromdata. This is repeated M times and the M cross-validatedestimates are averaged for each nc. Based on theseestimates we may define the posterior probabilities foreach value of the number of clusters nc, p(nc/D). If oneof p(nc/D) is near 1, there is strong evidence that theparticular number of clusters is the best for our data set.

The evaluation approach proposed in [27] is based ondensity functions considered for the data set. Thus, it isbased on concepts related to probabilistic models in orderto estimate the number of clusters, better fitting a data set,and it does not use concepts directly related to the data,(i.e., inter-cluster and intra-clusters distances).

4.4An experimental study of validity indices

In this section we present a comparative experimentalevaluation of the important validity measures, aiming atillustrating their advantages and disadvantages.We consider the known relative validity indices proposedin the literature, such as RS-RMSSTD [26], DB [28] andthe recent one SD [13]. The definitions of these validity

indices can be found in Section 4.3.RMSSTD and RS have to be taken into accountsimultaneously in order to find the correct number ofclusters. The optimal values of the number of clusters arethose for which a significant local change in values of RSand RMSSTD occurs. As regards DB, an indication of theoptimal clustering scheme is the point at which it takes itsminimum value.

For our study, we used four synthetic two-dimensionaldata sets further referred to as DataSet1, DataSet2,DataSet3 and DataSet4 (see Figure 3a-d) and a real dataset Real_Data1 (Figure 3e), representing a part of Greekroad network [29].

Table 8 summarizes the results of the validity indices(RS, RMSSDT, DB, SD), for different clustering schemesof the above-mentioned data sets as resulting from aclustering algorithm. For our study, we use the results ofthe algorithms K-means and CURE with their input value(number of clusters), ranging between 2 and 8. IndicesRS, RMSSTD propose the partitioning of DataSet1 intothree clusters while DB selects six clusters as the bestpartitioning. On the other hand, SD selects four clusters asthe best partitioning for DataSet1, which is the correctnumber of clusters fitting the data set. Moreover, theindex DB selects the correct number of clusters (i.e.,seven) as the optimal partitioning for DataSet3 while RS,RMSSTD and SD select the clustering scheme of five andsix clusters respectively. Also, all indices propose threeclusters as the best partitioning for Real_Data1. In thecase of DataSet2, DB and SD select three clusters as theoptimal scheme, while RS-RMSSDT selects two clusters(i.e., the correct number of clusters fitting the data set).

Figure 3 Datasets

Table 5: Optimal number of clusters proposed by validity indices RS-RMSSTD, DB, SD

DataSet1 DataSet2 DataSet3 DataSet4 Real_Data1Optimal number of clusters

RS, RMSSTD 3 2 5 4 3DB 6 3 7 4 3SD 4 3 6 3 3

)/(log)(1 ki

N

i k xfDLk Φ=∑ =

(a) DataSet1 (b) DataSet2 (c)DataSet3

(d) DataSet4 (e) Real_Data1

Page 19: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

Moreover, SD finds the correct number of clusters(three) for DataSet4, on the contrary to RS – RMSSTDand DB indices, which propose four clusters as the bestpartitioning.

Here, we have to mention that the validity indices arenot clustering algorithms themselves but a measure toevaluate the results of clustering algorithms and give anindication of a partitioning that best fits a data set. Theessence of clustering is not a totally resolved issue anddepending on the application domain we may considerdifferent aspects as more significant. For instance, for aspecific application it may be important to have wellseparated clusters while for another to consider more thecompactness of the clusters. Having an indication of agood partitioning as proposed by the index, the domainexperts may analyse further the validation procedureresults. Thus, they could select some of the partitioningschemes proposed by indices, and select the one betterfitting their demands for crisp or overlapping clusters. Forinstance DataSet2 can be considered as having threeclusters with two of them slightly overlapping or havingtwo well-separated clusters.

5. Conclusions and Trends in Clustering

Cluster analysis is one of the major tasks in variousresearch areas. However, it may be found under differentnames in different contexts such as unsupervised learningin pattern recognition, taxonomy in biology, partition ingraph theory. The clustering aims at identifying andextract significant groups in underlying data. Thus basedon a certain clustering criterion the data are grouped sothat data points in a cluster are more similar to each otherthan points in different clusters.

Since clustering is applied in many fields, a number ofclustering techniques and algorithms have been proposedand are available in literature. In this paper we presentedthe main characteristics and applications of clusteringalgorithms. Moreover, we discussed the differentcategories in which algorithms can be classified (i.e.,partitional, hierchical, density-based, grid-based, fuzzyclustering) and we presented representative algorithms ofeach category. We concluded the discussion on clusteringalgorithms by a comparative presentation and stressingthe pros and cons of each category.

Another important issue that we discussed in thispaper is the cluster validity. This is related to the inherentfeatures of the data set under concern. The majority ofalgorithms are based on certain criteria in order to definethe clusters in which a data set can be partitioned. Sinceclustering is an unsupervised method and there is no a-priori indication for the actual number of clusterspresented in a data set, there is a need of some kind ofclustering results validation. We presented a survey of themost known validity criteria available in literature,classified in three categories: external, internal, andrelative. Moreover, we discussed some representative

validity indices of these criteria along with sampleexperimental evaluation.

Trends in clustering. Though cluster analysis is subjectof thorough research for many years and in a variety ofdisciplines, there are still several open of research issues.We summarize some of the most interesting trends inclustering as follows:i) Discovering and finding representatives of arbitraryshaped clusters. One of the requirements in clustering isthe handling of arbitrary shaped clusters and there aresome efforts in this context. However, there is no well-established method to describe the structure of arbitraryshaped clusters as defined by an algorithm. Consideringthat clustering is a major tool for data reduction, it isimportant to find the appropriate representatives of theclusters describing their shape. Thus, we may effectivelydescribe the underlying data based on clustering resultswhile we achieve a significant compression of the hugeamount of stored data (data reduction).ii) Non-point clustering. The vast majority of algorithmshave only considered point objects, though in many caseswe have to handle sets of extended objects such as(hyper)-rectangles. Thus, a method that handlesefficiently sets of non-point objects and discovers theinherent clusters presented in them is a subject of furtherresearch with applications in diverse domains (such asspatial databases, medicine, biology).iii) Handling uncertainty in the clustering process andvisualization of results. The majority of clusteringtechniques assumes that the limits of clusters are crisp.Thus each data point may be classified into at most onecluster. Moreover all points classified into a cluster,belong to it with the same degree of belief (i.e., all valuesare treated equally in the clustering process). The result isthat, in some cases "interesting" data points fall out of thecluster limits so they are not classified at all. This isunlikely to everyday life experience where a value may beclassified into more than one categories. Thus a furtherwork direction is taking in account the uncertaintyinherent in the data. Another interesting direction is thestudy of techniques that efficiently visualizemultidimensional clusters taking also in accountuncertainty features.iv) Incremental clustering. The clusters in a data set maychange as insertions/updates and deletions occur throughout its life cycle. Then it is clear that there is a need ofevaluating the clustering scheme defined for a data set soas to update it in a timely manner. However, it isimportant to exploit the information hidden in the earlierclustering schemes so as to update them in an incrementalway.v) Constraint-based clustering. Depending on theapplication domain we may consider different clusteringaspects as more significant. It may be important to stressor ignore some aspects of data according to the

Page 20: Clustering algorithms and validity measuresdataminingzone.weebly.com/uploads/6/5/9/4/6594749/...Clustering algorithms and validity measures M. Halkidi, Y. Batistakis, M. Vazirgiannis

requirements of the considered application. In recentyears, there is a trend so that cluster analysis is based onless parameters but on more constraints. These constrainsmay exist in data space or in users’ queries. Then aclustering process has to be defined so as to take inaccount these constrains and define the inherent clustersfitting a dataset.

AcknowledgementsThis work was supported by the General Secretariat forResearch and Technology through the PENED ("99Ε∆ 85")project. We thank C. Amanatidis for his suggestions and hishelp in the experimental study. Also, we are grateful to C.Rodopoulos for the implementation of CURE algorithm as wellas to Dr Eui-Hong (Sam) Han for providing information and thesource code for CURE algorithm.

References[1] Michael J. A. Berry, Gordon Linoff. Data Mining

Techniques For marketing, Sales and Customer Support.John Willey & Sons, Inc, 1996.

[2] Bezdeck J.C, Ehrlich R., Full W., "FCM:Fuzzy C-MeansAlgorithm", Computers and Geoscience 1984

[3] Rajesh N. Dave. "Validating fuzzy partitions obtainedthrough c-shells clustering", Pattern Recognition Letters,Vol .17, pp613-623, 1996.

[4]. Davies DL, Bouldin D.W. “ A cluster separation measure”.IEEE Transactions on Pattern Analysis and MachineIntelligence, Vol. 1(2), 1979.

[5] J. C. Dunn. "Well separated clusters and optimal fuzzypartitions", J. Cybern. Vol.4, pp. 95-104, 1974

[6] Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu."A Density-Based Algorithm for Discovering Clusters inLarge Spatial Databases with Noise", Proceedings of 2nd Int.Conf. On Knowledge Discovery and Data Mining, Portland,OR, pp. 226-231, 1996.

[7] Martin Ester, Hans-Peter Kriegel, Jorg Sander, MichaelWimmer, Xiaowei Xu. "Incremental Clustering for Miningin a Data Warehousing Environment", Proceedings of 24th

VLDB Conference, New York, USA, 1998.[8] Usama M. Fayyad, Gregory Piatesky-Shapiro, Padhraic

Smuth and Ramasamy Uthurusamy. Advances in KnowledgeDiscovery and Data Mining. AAAI Press 1996

[9] Gath I., Geva A.B. “Unsupervised optimal fuzzy clustering”,IEEE Transactions on Pattern Analysis and MachineIntelligence, Vol. 11(7), 1989.

[10] Sudipto Guha, Rajeev Rastogi, Kyueseok Shim. "CURE:An Efficient Clustering Algorithm for Large Databases",Published in the Proceedings of the ACM SIGMODConference, 1998.

[11] Sudipto Guha, Rajeev Rastogi, Kyueseok Shim. "ROCK: ARobust Clustering Algorithm for Categorical Attributes",Published in the Proceedings of the IEEE Conference onData Engineering, 1999.

[12] Jiawei Han, Micheline Kamber. Data Mining: Conceptsand Techniques. Morgan Kaufmann Publishers, 2001

[13] M. Halkidi, M. Vazirgiannis, Y. Batistakis. "Qualityscheme assessment in the clustering process", Proceedingsof PKDD, Lyon, France, 2000.

[14] Alexander Hinneburg, Daniel Keim. "An EfficientApproach to Clustering in Large Multimedia Databases withNoise. Proceedings of KDD Conference, 1998.

[15] Zhexue Huang. "A Fast Clustering Algorithm to Clustervery Large Categorical Data Sets in Data Mining", DMKD,1997

[16] A.K Jain, M.N. Murty, P.J. Flyn. “Data Clustering: AReview”, ACM Computing Surveys, Vol. 31, No3,September 1999.

[17] Krishnapuram R., Frigui H., Nasraoui O. “Quadratic shellclustering algorithms and the detection of second-degreecurves”, Pattern Recognition Letters, Vol. 14(7), 1993.

[18] MacQueen, J.B "Some Methods for Classification andAnalysis of Multivariate Observations", In Proceedings of5th Berkley Symposium on Mathematical Statistics andProbability, Volume I: Statistics, pp281-297, 1967

[19] Milligan, G.W. and Cooper, M.C.), "An Examination ofProcedures for Determining the Number of Clusters in aData Set", Psychometrika, 50, 159-179, 1985

[20] T. Mitchell. Machine Learning. McGraw-Hill, 1997.[21] Milligan G. W., Soon S.C., Sokol L. M. “The effect of

cluster size, dimensionality and the number of clusters onrecovery of true cluster structure”. IEEE Transactions onPattern Analysis and Machine Intelligence, Vol. 5, pp. 40-47, 1983

[22] Raymond Ng, Jiawei Han. "Effecient and EffictiveClustering Methods for Spatial Data Mining". Proceeding ofthe 20th VLDB Conference, Santiago, Chile, 1994.

[23]. Pal N.R., Biswas J. “Cluster Validation using graphtheoretic concepts”. Pattern Recognition, Vol. 30(6), 1997.

[24] Ramze Rezaee, B.P.F. Lelieveldt, J.H.C Reiber. "A newcluster validity index for the fuzzy c-mean", PatternRecognition Letters, 19, pp237-246, 1998.

[25] C. Sheikholeslami, S. Chatterjee, A. Zhang. "WaveCluster:A-MultiResolution Clustering Approach for Very LargeSpatial Database". Proceedings of 24th VLDB Conference,New York, USA, 1998

[26] Sharma S.C. Applied Multivariate Techniques. John Willwy& Sons, 1996.

[27] Padhraic Smyth. "Clustering using Monte Carlo Cross-Validation". Proceedings of KDD Conference, 1996.

[28] S. Theodoridis, K. Koutroubas. Pattern recognition,Academic Press, 1999

[29] Y. Theodoridis. Spatial Datasets: an "unofficial" collection.http://dias.cti.gr/~ytheod/research/datasets/spatial.html[30] Wei Wang, Jiorg Yang and Richard Muntz. “STING: A

ststistical information grid approach to spatial data mining”.Proceedings of 23rd VLDB Conference, 1997.

[31] Xunali Lisa Xie, Genardo Beni. "A Validity measure forFuzzy Clustering", IEEE Transactions on Pattern Analysisand machine Intelligence, Vol.13, No4, August 1991.

[32] Tian Zhang, Raghu Ramakrishnman, Miron Linvy."BIRCH: An Efficient Method for Very Large Databases",ACM SIGMOD, Montreal, Canada, 1996.