1-s2.0-S0167865508000706-main

Embed Size (px)

DESCRIPTION

kmeans

Citation preview

  • them

    eaThiserrog prsecenalt-funingthis

    atternsbjectsmilar,ed in mnition,

    gorithmsimplento k c

    represented by an adaptively-changing cent

    JMSE Xki1

    Xxt2Ci

    kxt cik2 1

    xt is a vector representing the t-th data point in the cluster Ci and ciis the geometric centroid of the cluster Ci. Finally, this algorithmaims at minimizing an objective function, in this case a squared-

    Step 1: Initialize k cluster centres c1,c2, . . . ,ck by some initial val-ues called seed-points, using random sampling.

    i

    all points in cluster Ci.

    Although k-means has been widely used in data analyses, pat-tern recognition and image processing, it has three majorlimitations:

    (1) The number of clusters must be previously known and xed.(2) The results of k-means algorithm depend on initial cluster

    centres (initial seed-points).(3) The algorithm contains the dead-unit problem.

    * Tel.: +386 02 229 38 21; fax: +386 02 251 81 80.

    Pattern Recognition Letters 29 (2008) 13851391

    Contents lists availab

    gn

    .e lE-mail address: [email protected] computes the squared distances between the inputs (alsocalled input data points) and centroids, and assigns inputs to thenearest centroid. An algorithm for clustering N input data pointsx1,x2, . . . ,xN into k disjoint subsets Ci, i = 1, . . . ,k, each containingni data points, 0 < ni < N, minimizes the following mean-square-er-ror (MSE) cost-function:

    For each input data point xt and all k clusters, repeat steps 2 and3 until all centres converge.Step 2: Calculate cluster membership function I(xt, i) by Eq. (2)and decide the membership of each input data point in one ofthe k clusters whose cluster centre is closest to that point.Step 3: For all k cluster centres, set c to be the centre of mass ofcentre), starting from some initial values named seed-points.1. Introduction

    Clustering is a search for hidden psets. It is a process of grouping data oso that the data in each cluster are siers. Clustering techniques are applisuch as data analyses, pattern recoginformation retrieval.

    k-Means is a typical clustering alis attractive in practice, because it isfast. It partitions the input dataset i0167-8655/$ - see front matter 2008 Elsevier B.V. Adoi:10.1016/j.patrec.2008.02.014that may exist in data-into disjointed clustersyet different to the oth-any application areasimage processing, and

    (MacQueen, 1967). Itand it is generally verylusters. Each cluster isroid (also called cluster

    error-function, where kxt cik2 is a chosen distance measurementbetween data point xt and the cluster centre ci.

    The k-means algorithm assigns an input data point xt into theith cluster if the cluster membership function I(xt, i) is 1.

    Ixt ; i 1 if i arg minkxt cjk2 j 1; . . . ; k0 otherwise

    ( )2

    Here c1,c2,cj, . . . ,ck are called cluster centres which are learned bythe following steps:Cost-functionRival penalizedAn efcient k0-means clustering algorithm

    Krista Rizman Zalik *

    University of Maribor, Faculty of Natural Sciences and Mathematics, Department of Ma

    a r t i c l e i n f o

    Article history:Received 29 March 2007Received in revised form 24 December 2007Available online 4 March 2008

    Communicated by L. Heutte

    Keywords:Clustering analysisk-MeansCluster number

    a b s t r a c t

    This paper introduces k0-mexact number of clusters.extends the mean-square-The rst is a pre-processinto each cluster. During thealgorithm automatically piterations. When the cosdetermined and the remaiexperiments described in

    Pattern Reco

    journal homepage: wwwll rights reserved.atics and Computer Science, Koroka Cesta 160, 2000 Maribor, Slovenia

    ns algorithm that performs correct clustering without pre-assigning theis achieved by minimizing a suggested cost-function. The cost-functionr cost-function of k-means. The algorithm consists of two separate steps.ocedure that performs initial clustering and assigns at least one seed pointond step, the seed-points are adjusted to minimize the cost-function. Theizes any possible winning chances for all rival seed-points in subsequentnction reaches a global minimum, the correct number of clusters isseed points are located near the centres of actual clusters. The simulatedpaper conrm good performance of the proposed algorithm.

    2008 Elsevier B.V. All rights reserved.

    le at ScienceDirect

    ition Letters

    sevier .com/locate /patrec

  • The major limitation of the k-means algorithm is that the num-ber of clusters must be pre-determined and xed. Selecting theappropriate number of clusters is critical. It requires a prioriknowledge about the data or, in the worst case, guessing the num-ber of clusters. When the input number of clusters (k) is equal tothe real number of clusters (k0), the k-means algorithm correctlydiscovers all clusters, as shown in Fig. 1 where cluster centresare marked by squares. Otherwise, it gives incorrect clustering re-sults, as illustrated in Fig. 2ac. When clustering real data, thenumber of clusters is unknown ahead and has to be estimated.Finding the correct number of clusters is usually performed overmany clustering runs using different numbers of clusters.

    The performances of the k-means algorithm depend on initialcluster centres (initial seed-points). Furthermore, the nal parti-tion depends on the initial conguration. Some research has solvedthis problem by proposing an algorithm for computing initial clus-ter centres for k-means clustering (Khan and Ahmad, 2004; Red-mond and Heneghan, 2007). Genetic algorithms have beendeveloped for selecting centres in order to seed the popular k-means method for clustering (Laszlo and Mukherjee, 2007). Stein-ley and Brusco (2007) evaluated twelve procedures proposed in theliterature for initializing k-means clustering and to introduce rec-ommendations for best practices. They recommended the methodof multiple random starting-points for general use. In general, ini-tial cluster centres are selected randomly. An assumption fromtheir studies is that the number of clusters is known ahead. Theyconclude that even the best initial strategy for clustering centresand minimizing the mean-square-error cost-function, do not leadto the best dataset partition.

    In the late 1980s, it was pointed-out that the classical k-meansalgorithm has the so-called dead-unit or underutilization problem(Xu, 1993). Each centre, initialized far away from the input datapoints, may never win in the process of assigning a data point tothe nearest centre, and so it then stays far away from the inputdata objects, becoming a dead-unit.

    Over the last fteen years, new advanced k-means algorithmshave been developed that eliminate the dead-unit problem as,for example, the Frequency sensitive competitive algorithm (FSCL)(Ahalt et al., 1990). A typical strategy is to reduce the learning ratesof frequent winners. Each cluster centre counts the number whenit wins the competition, and consequently reduces its learning rate.If a centre wins too often, it does not cooperate in the competition.FSCL solves the dead-unit problem and successfully identies clus-ters, but only when the number of clusters is known in advanceand appropriately preselected; otherwise, the algorithm performs

    1386 K.R. Zalik / Pattern Recognition Letters 29 (2008) 13851391Fig. 1. A dataset with three clusters recognized by k-means algorithm for k = 3.Fig. 2. k-Means produces wrong clusters for k = 1 (a), k = 2 (b) and k = 4 (c) for the samelocation of the converged cluster centre.badly.Solving the selection of a correct cluster number has been tried

    in two ways. The rst one invokes some heuristic approaches. Theclustering algorithm is run many times with the number of clustersgradually increasing from a certain initial value to some thresholdvalue that is difcult to set. The second is to formulate clusternumber selection by choosing a component number in a nite mix-ture model. The earliest method for solving the model selectionproblem may be to choose the optimal number of clusters byAkaikes information criterion or its extensions AIC (Akaike,1973; Bozdogan, 1987). Other criteria include Schwarzs Bayesianinterface criterion (BIC) (Schwarz, 1978), Rissanens minimumdescription length (MML) (Wallace and Dowe, 1999) and Bez-deks partition coefcients (PC) (Bezdek, 1981). As reported in(Oliver et al., 1996), BIC and MML perform comparably and outper-form the AIC and PC criteria. These existing criteria may overesti-mate or underestimate the cluster number, because of difcultyin choosing an appropriate penalty function. Better results are ob-tained by a number selection criterion developed from Ying-Yangmachine (Xu, 1997) which means, unfortunately, laboriouscomputing.

    To tackle the problem of appropriate selection for number ofclusters, the rival penalized competitive learning (RPCL) algorithmwas proposed (Xu, 1993), which adds a new mechanism to FSCL.For each input data point, the basic idea is that, not only the clustercentre of a winner cluster is modied to adapt to the input datapoint, but also the cluster centre of its rival cluster (second winner)is de-learned by a smaller learning rate. Many experiments haveshown that RPCL can select the correct cluster number by drivingextra cluster centres far away from the input dataset. Althoughthe RPCL algorithm has had success in some applications, such asdataset as in Fig. 1, which consists of three clusters; the black square denotes the

  • colour-image segmentation and image features extraction, it israther sensitive to the selection of de-learning rate (Law andCheung, 2003; Cheug, 2005; Ma and Cao, 2006). The RPCL algo-rithm was proposed heuristically. It has been shown that RPCLcan be regarded as a fast approximate implementation of a specialcase Bayesian Ying-Yang (BYY) harmony learning on a Gaussianmixture (Xu, 1997). The ability to select a number of clusters isprovided by the ability of Bayesian Ying-Yang learning modelselection. There is still a lack of mathematical theory for directlydescribing the correct convergence behaviour of RPCL whichselects the correct number of clusters, while driving all otherunnecessary cluster centres far away from the sample data.

    This paper presents a new k0-means algorithm, which is anextension of k-means, without the three major drawbacks statedat the beginning of this section. The algorithm has a similarmechanism to RPCL in that it performs clustering without pre-determining the correct cluster number. The problem of thesuggested k0-means algorithms correct convergence is investi-gated via a cost-function approach. A special cost-function is sug-gested since the k-means cost-function (Eq. (1)) cannot be usedfor determining the number of clusters, because it decreases

    the proposed cost-function. Section 5 presents the experimental

    (1) Areas with dense samples strongly attract centres, and(2) Each cluster centre pushes all other cluster centres away in

    order to give maximal information about patterns formedby input data points. This enables the possibility of movingextra cluster centres away from the sample data. When acluster centre is driven away from the sample data, the cor-responding cluster can be neglected, because it is empty.

    We want to obtain maximal information about patterns formedby input data points. The amount of information each cluster givesus about the dataset can be quantied. Discovering an ith cluster Cihaving ni elements in a dataset with N elements gives us theamount of information I(Ci)

    ICi j logni=Nj: 3This information is a measure of decreasing uncertainty about thedataset. The logarithm is selected for measuring information sinceit is additive when concatenating independent, unrelated amountsof information for a whole system, e.g. if it discovers a cluster. Fora dataset with N elements forming k distinguishable clusters, theamount of information is I(C1) + I(C2) + + I(Ck).

    K.R. Zalik / Pattern Recognition Letters 29 (2008) 13851391 1387evaluation. The paper is summarized in Section 6.

    2. The cost-function

    The k-means algorithm minimizes the mean-square-error cost-function JMSE (Eq. 1), which decreases monotonically with anyincrease of cluster number. Such a function cannot be used foridentifying the correct number of clusters and cannot be used forthe RPCL algorithm. This section introduces a new cost-functionusing the following two characteristics:monotonically with any increase in cluster number. It is shownthat, when the cost-function reduces into a global minimum, thecorrect number of cluster centres converges into an actual clustercentre, while driving all other initial centres far away from the in-put dataset, and corresponding clusters can be neglected, becausethey are empty.

    Section 2 constructs a new cost-function. Rival penalized mech-anism analysis of the proposed cost-function is presented in Sec-tion 3. Section 4 describes the k0-means algorithm for minimizingFig. 3. Dataset with 800 data objects clustered into four clusters and vWe have to maximize the amount of information and minimizeuncertainty about the system JI (Eq. (4)).

    JI niEXki1

    log2pCiXki1

    pCi 1 0 6 pCi 6 1; i 1; . . . ; k

    4p(Ci) is the probability that the input data is in the Ci cluster (sub-set). E is a constant and is just a choice of measurement units. Eshould be from the range of point coordinates. The coordinatesmagnitude does not matter, because we only care about point dis-tances. Setting parameter E is discussed and experimentally provedin Sections 4 and 5.

    In view of the above considerations we were motivated to con-struct a cost-function composed of the mean-square-error JMSE andinformation uncertainty as

    J JI JMSE 5Data metric dm used for clustering, which minimizes the uppercost-function (Eq. (5)), where cluster Ci having centre ci and xt isan input data point, isalues of functions JI, JMSE and JI + JMSE for cluster number k = 19.

  • dmxt ;Ci kxt cik2 Elog2pCiXki1

    pCi 1 0 6 pCi 6 1;

    i 1; . . . ; k 6

    We assign an input data point xt into cluster Cj if the cluster mem-bership function I(xt, i) Eq. (7) is 1.

    Ixt; i 1 if i arg mindmxt; j j 1; . . . ;N0 otherwise

    7

    the second part of the proposed metric smaller (Eq. 7). We sup-pose that the rst cluster has less elements than the secondn00 < n

    01 . During data scanning, if the centre c

    01 of the second

    cluster with more elements wins when adapting to the input datapoint xt then it moves towards the rst cluster centre c

    00 and con-

    sequently, the separating line is moved towards the left as shownin Fig. 4b. Region 1 of the rst cluster is becoming smaller whileregion 2 of the second cluster is expanding towards the left. Thesame repeats through out the next iterations to points that are

    1388 K.R. Zalik / Pattern Recognition Letters 29 (2008) 13851391The input data point xt effects the cluster centre of cluster Ci. Thewinners centre is modied in order to also contain the input dataxt and the term E log2 p(Ci) in the data metric (Eq. 6) is automati-cally decreased for the rival centre, because p(Ci) is decreased andthe sum of all probabilities (p(Ci), i = 1, . . . ,k) is 1. The rival clustercentres are automatically penalized in the sense of a winningchance. Such penalization of the rival cluster centres can reduce awinning chance for rival cluster centres to zero. This rival penalizedmechanism is briey described in the next section.

    The minimization of information uncertainty JI allocates theproper number of clusters to data points, while minimization ofJMSE makes clustering of input data possible. The values for bothfunctions JMSE and JI over nine values of cluster numbers (k) for adataset with a cardinality of 800 regarding four Gaussian distribu-tions are shown in Fig. 3. The nodes on the curves in Fig. 3 denotethe global minimum values for cost-functions JI and JMSE and theirsum J for various cluster numbers (k). The global minimum for thesum of both functions corresponds to the number of actual clusters(k = k0).

    3. The rival penalized mechanism

    This section analyzes the rival penalized mechanism of the pro-posed metric in Eq. (6). The data assignment based on the datametric to the winners cluster centre reduces JMSE and drives agroup of cluster centres to converge onto the centres of actual clus-ters. The winners centre is modied to also contain the input dataxt. The second term in the data metric is automatically decreasedfor the rival centres. We show that such a penalization of rival clus-ter centres can reduce a winning chance for rival cluster centres tozero.

    We consider a simple example of one Gaussian distributionforming one cluster. We set the input number of clusters to be2. The number of input data points is 200, the mean vector is(190,90) and the standard variance is (0.5,0.2). At the beginning(t = 0), the data is divided into two clusters with two cluster cen-tres, as shown in Fig. 4a, where each cluster centre is indicated bya rectangle. We denote them as c00 and c

    01 . t represents the num-

    ber of iterations that the data has been repeatedly scanned. Datametric (Eq. 2) divides the cluster into two regions by a virtual sep-arating line, as shown in Fig. 4a. Data points on the line are thesame distance from both cluster centres. In the next iteration,they are assigned to a cluster with more elements that makeFig. 4. The clustering process of one Gaussian distribution with an input parameter-nunear or on a separating line, until c1 gradually converges to the ac-tual cluster centre through minimizing data metric dm (Eq. 6) andthe centre c0 moves towards the clusters boundary. The rst (riv-al) cluster has less and less elements until the number of elementsdecreases to 0 and its competition chance reaches zero. From Eq.(7) we see that then the data metric dm becomes innity. Clustercentre c0 becomes dead without chance to win again. When acluster centre cti is far away from the input data then it is onone side of the input data and it cannot be winner for any newsample. Change of cluster centre Dci directs to the outside of thesample data. If every cluster centre goes away from the sampledataset then the JMSE cost-function becomes greater and greater.This contradicts the assumption and fact that algorithm decreasesthe function JMSE and proves that some centres exists within thesample data.

    The analysis of multiple clusters is more complicated, becauseof interactive effects among clusters. In Section 5 various datasetshave been tested to prove the convergence behaviour of data met-ric that automatically penalizes the winning chance of all rivalcluster centres in the subsequent iterations while winning clustercentres are moved toward actual cluster centres.

    4. k0-Means algorithm

    It is clear from Section 3, that the proposed metric automati-cally penalizes all rival cluster centres in the competition to get anew point into the cluster. We propose a k0-algorithm that mini-mizes the proposed cost-function and data metric. It has twophases. In the rst phase we allocate k0 cluster centres in such away that in each cluster there are one or more cluster centres.We suppose the input number of cluster centres k is greater thanthe real number of clusters k0. In the second phase, all rival clustercentres in the same cluster are pushed out of the cluster, thus rep-resenting a cluster with no elements. The detailed k0-means algo-rithm consisting of two completely separated phases is suggestedas follows.

    For the rst phase we use k-means algorithm as initial cluster-ing to allocate k cluster centres so that each actual cluster has atleast one or more centres. We suppose that the input parameter,the number of clusters, is greater than the actual number of clus-ters that the data performs: k > k0.

    Step 1: Randomly initialize the k cluster centres in the inputdataset.mber of clusters k = 2 after: (a) 10 iterations (b) 15 iterations and (c) 20 iterations.

  • Step 2: Randomly pick up a data point xt from the input datasetand for j = 1,2, . . . ,k calculate the class membership functionI(xt, j) by Eq. (2). Every point is assigned to the cluster whosecentroid is the closest to that point.Step 3: For all k cluster centres, set ci to be the centre of mass ofall points in cluster Ci.

    ci 1jCijXxt2Ci

    xt 8

    Steps 2 and 3 are repeatedly implemented until all cluster centresremain unchanged or until they change to some threshold value.The stopping threshold value is usually selected as being verysmall. The other way to stop the algorithm is to set an uppernumber of iterations to a certain threshold value. At the end ofthe rst phase of the algorithm each cluster has at least onecentre.

    In the rst phase, we do not include the extended clusteringmembership function described by Eq. (7), because the rst stepaims to allocate the initial seed-points into some desired regions,rather than making a precise cluster number estimation. This isachieved by the second phase that repeats the following two stepsuntil all cluster centres converge.

    Step 1: For each input data point xt and all k clusters randomlypick a data point xt from the input dataset and for j = 1,2, . . . ,kcalculate the cluster membership function I(xt, j) by Eq. (7).Every point is assigned to the cluster whose centroid is closestto that point, as dened by the cluster membership function

    Steps 1 and 2 are repeatedly implemented until all cluster cen-tres remain unchanged for all input data points, or they change lessthan some threshold value. At the end k0 clusters are discovered,where k0 is the number of actual clusters. The initial seed-points cluster centres will converge towards the centroid of the inputdata clusters. All extra seed-points, the difference between k and k0,will be driven away from the dataset.

    The number of recognized clusters k0 is implicitly dened byparameter E (Eq. (6)). E is just a choice of measurement units. Eshould be from the range of point coordinates. The coordinatesmagnitude does not matter, because we only care about point dis-tances. However, it has been shown by experiments that a wideinterval exists for E when a consistent number of actual clustersare discovered in the sample dataset. The heuristic for parameterE is given in Eq. (9).

    E 2 a;3a a averager averaged=2 9where r is the average radius of clusters after the rst phase of the

    K.R. Zalik / Pattern Recognition Letters 29 (2008) 13851391 1389I(xt, j).Step 2: For all k cluster centres set ci to be the centre of mass ofall points in cluster Ci (Eq. 8).

    Table 1Parameters of dataset 1 where number of samples N = 470

    Cluster number i Ni ci ri ai

    1 100 (0.5,0.5) (0.1,0.1) 0.2132 50 (1,1) (0.1,0.1) 0.1063 160 (1.5,1.5) (0.2,0.1) 0.254 160 (1.4,2.3) (0.4,0.2) 0.34Fig. 5. (a) Clusters discovered for k = 10 by k-means alalgorithm and d is the smallest distance between two cluster cen-tres greater than 3r. For stronger clustering one can double param-eter E. If E is smaller than suggested, the algorithm cannot push theredundant cluster centres away from the input regions. On theother hand, if E is too large, the algorithm pushes almost all clustercentres away from the input data.

    5. Experimental results

    Three simulated experiments were carried-out to demonstratethe performance of the k0-means algorithm. This algorithm has alsobeen applied to the clustering of a real dataset. The stoppingthreshold value was selected to 106.

    5.1. Experiment 1

    Experiment 1 used 470 points from a mixture of four Gaussiandistributions. The detail parameters of input dataset are given inTable 1, where Ni, ci, ri and ai denote the number of samples, themean vector, the standard variance, and the mixing proportion.

    The input number of clusters k was set to 10. Fig. 5a shows all10 clusters and centres after the rst phase of the algorithm. Eachcluster has at least one seed point. After the second phase only fourseed-points denoted four cluster centres. As shown in Fig. 5b, thedata forms four well-separated clusters. The parameters of the fourwell-recognized clusters are given in Table 2.gorithm and by suggested k0-means algorithm (b).

  • 5.2. Experiment 2

    In Experiment 2, 800 data points were used, also from a mixtureof four Gaussians. The three sets of data S1, S2, and S3 were gener-ated at different degrees of overlap among the clusters. The setshad different variances of Gaussian distributions and differentnumbers of input datasets is controlled by mixing proportions ai.The detail parameters for these datasets are given in Table 3.

    In sets S1 and S2, the data has a symmetric structure and eachcluster has the same number of elements. For such datasets, whenthese clusters are separated at a certain degree, it is usual for thealgorithm converges correctly.

    It can be observed from Fig. 6 that all three datasets resulted incorrect convergence. The input number of cluster centres was setto 7. Four cluster centres were located around the centres of thefour actual clusters, while the three cluster centres were sent faraway from the data. Results show that this algorithm can also dis-cover clusters that do not form well-separated clusters as datasetS3.

    5.3. Experiment 3

    BIC, MML. The comparison presented by Oliver et al. (1996) wasused. The same mixture of three Gaussian components with themean of the rst component being (0,0), the second (2,

    12

    p),

    and the third (4,0) was used. As a dataset in this Experiment,

    Table 2The four discovered clusters in experiment 1

    Cluster number i Ni ci

    1 100 (0.496,0.501)2 50 (0.993,0.985)3 167 (1.483,1.51)4 155 (1.356,2.303)

    Table 4Predicted number of components for different standard deviations

    k rx = ry = 0.67 rx = ry = 1 rx = ry = 1.2 rx = ry = 1.33

    1 0 3 14 452 0 0 0 03 True 99 97 86 554 1 0 0 05 0 0 0 0

    1390 K.R. Zalik / Pattern Recognition Letters 29 (2008) 13851391The k0-means method was compared to previous model selec-tion criteria and Gaussian mixture estimation methods MDL, AIC,

    Table 3Parameters of three datasets for experiment 2

    Dataset number Cluster number (i) Ni ci ri ai

    S1 1 200 (1,2) (0.2,0.2) 0.252 200 (2,1) (0.2,0.2) 0.253 200 (3,2) (0.2,0.2) 0.254 200 (2,3) (0.2,0.2) 0.25

    S2 1 200 (1,2 ) (0.4,0.4) 0.2502 200 (2,1 ) (0.4,0.4) 0.2503 200 (3,2 ) (0.4,0.4) 0.2504 200 (2,3 ) (0.4,0.4) 0.250

    S3 1 400 (1,2 ) (0.4,0.4) 0.3642 400 (2,1 ) (0.4,0.4) 0.3643 150 (3,2 ) (0.4,0.4) 0.1364 150 (2,3 ) (0.4,0.4) 0.136Fig. 6. Three sets of input data used in Experiment 2 and cl100 data points were generated from this distribution. The resultsof our method are given in Table 4 for four values of standarddeviations.

    The counts (e.g., 99 in the rst block) indicate the times that theactual number of clusters (k = 3) were conrmed in 100 experi-ments repeated using different cluster centre initialization. The ini-tial number of clusters had been set to 5.

    If we compare the obtained results with MML, AIC, PC, MDL andICOMP criteria as presented by Oliver et al. (1996) for three com-ponent distribution then the k0-means algorithm gives consider-ably better results. The k0-means method conrms the actual(true) number of clusters in 100 experiments repeated using differ-ent initializations more frequently that other criteria. When anincorrect number of clusters was obtained, the k0-means predictedless clusters, but the AIC, PC, MDL and ICOMP criteria often pre-dicted more clusters.

    5.4. Experiment 4 with real dataset

    The k0-algorithm was applied also to a real dataset. Clustering ofthe wine dataset (Blake and Merz, 1998) was performed, which is atypical real dataset for testing clustering (http://mlearn.ics.uci.edu/databases/wine/). The dataset consisted of 178 samples of threetypes of wine. These data were the results from a chemical analysisof wines grown in the same region but derived from three differentcultivars. The analysis determined quantities of 13 constituents.The correct number of elements in each cluster is: 48, 71, 59.

    These wine data were rst regularized into an interval of[0,300] and then the k0-means algorithm was applied to solvethe unsupervised clustering problem of the wine data by settingk = 6.

    The k0-means algorithm detected three classes in the wine data-set with a clustering accuracy of 97.75% (there were four errors)which is a rather good result for unsupervised learning methods.This is the same result as performed by the method of linear mix-usters discovered by the proposed k0 means algorithm.

  • ing kernels with information minimization criterion (Roberts et al.,2000).

    5.5. Discussion and experimental results

    As shown by experiments the k0-means can allocate the correctnumber of clusters at or in the near of actual cluster centres. Exper-iment 3 showed that the k0-means algorithm is insensitive to initialvalues of cluster centres and leads to good result. We also foundfrom Experiment 4 on real dataset, that the algorithm also workedwell in high dimensional space when the clusters had been sepa-rated to the degree as in the Experiment 2. Simulation experimentsalso proved that when the initial cluster centres are randomly-se-lected from the input dataset, then the dead-unit problem does notoccur. The experiments also showed that, if two or more clustersare seriously overlapped, the algorithm regards them as one clusterand this leads to an incorrect result. When clusters are elliptical, orsome other forms, the algorithm can still detect the number ofclusters, but clustering is not as good. For the classication of ellip-

    sum of mean-square-error and information uncertainty. Its rivalpenalized mechanism has been shown. As the cost-function re-duces to a global minimum, the algorithm separates the inputnumber k0 (k0 is the actual number of clusters) of cluster centresthat converge towards actual cluster centres. The other (k k0)centres are moved far away from the dataset and never win incompetition for any data sample. It has been demonstrated byexperiments, that this algorithm can efciently determine the ac-tual number of clusters in articial and real datasets.

    References

    Ahalt, S.C., Krishnamurty, A.K., Chen, P., Melton, D.E., 1990. Competitive algorithmsfor vector quantization. Neural Networks 3, 277291.

    Akaike, H., 1973. Information theory and an extension of the maximum likelihoodprinciple. In: Proc. 2nd Internat. Symp. on Information Theory, pp. 267281.

    Bezdek, J., 1981. Pattern Recognition with Fuzzy Objective Function Algorithms.Plenum Press, New York.

    Blake, C.L., Merz, C.J., 1998. UCI Repository for machine learning databases, IrvineDept. Inf. Comput. Sci., Univ. California [Online]. .

    K.R. Zalik / Pattern Recognition Letters 29 (2008) 13851391 1391tical clusters, the Mahalanobis distance gives better clustering thanEuclidean distance in cost-function and data metric (Ma and Cao,2006).

    According to analysis of the data metric and simulation experi-ments we claim that when the input parameter for the number ofclusters k is not much larger than actual number of clusters k0, thealgorithm converges correctly. However, when k is much largerthan k0, the number of discovered clusters is usually greater than k0.

    From the following simulation results, we have demonstratedthat there exists a large valid range of k for each dataset. On eachof the three datasets from Experiment 2 we run the algorithm100 times for values k > k0. We increased k from k0 and computedthe percentage of the valid results. The upper boundary of the validrange for k is the largest integer k at which the valid percentage islarger or equal to a certain threshold value. We choose it 98%. Thevalid range for the rst dataset S1 is 424, for the second is 416and the third is 49. The parameter E in data metric has to be dou-bled for a greater number k.

    6. Conclusions

    A new clustering algorithm named k0-means is presented whichperforms correct clustering without predetermining the exactnumber of clusters k. It minimizes cost-function dened as theBozdogan, H., 1987. Model selection and Akaikes information criterion the generaltheory and its analytical extensions. Psyhometrika 52, 345370.

    Cheug, Y.M., 2005. On rival penalization controlled competitive learning forclustering with automatic cluster number selection. IEEE Trans. KnowledgeData Eng. 17, 15831588.

    Khan, S., Ahmad, A., 2004. Cluster centre initialization algorithm for k-meansclustering. Pattern Recognition Lett. 25, 12931302.

    Laszlo, M., Mukherjee, S., 2007. A genetic algorithm that exchanges neighbouringcentres for k-means clustering. Pattern Recognition Lett. 28, 23592366.

    Law, L.T., Cheung, Y.M., 2003. Colour image segmentation using rival penalizedcontrolled competitive learning. In: Proc. 2003 Internat. Joint Conf. on NeuralNetworks (IJCNN2003), Portland, Oregon, USA, pp. 2024.

    Ma, J., Cao, B., 2006. The Mahalanobis distance based rival penalized competitivelearning algorithm. Lect. Note Comput. Sci. 3971, 442447.

    MacQueen, J.B., 1967. Some methods for clustering and analysis of multivariateobservations. Proc. 5th Berkeley Symp. on Math. Statist. Prob., vol. 1. Universityof California Press, Berkeley, pp. 281297.

    Oliver, J., Baxter, R., Wallace, C., 1996. Unsupervised learning using MML. In: Proc.13th Internat. Conf. on Mach. Learn., pp. 364372.

    Redmond, S.J., Heneghan, C., 2007. A method for initializing the k-means clusteringalgorithm using kd-trees. Pattern Recognition Lett. 28, 965973.

    Roberts, S.J., Everson, R., Rezek, I., 2000. Maximum certainty data partitioning.Pattern Recognition 33, 833839.

    Schwarz, G., 1978. Estimating the dimension of a model. Ann. Statist. 6, 461464.Steinley, D., Brusco, M.J., 2007. Initialization k-means batch clustering: a critical

    evaluation of several techniques. J. Classif. 24, 99121.Wallace, C., Dowe, D., 1999. Minimummessage length and Kolmogorov complexity.

    Comput. J. 42, 270283.Xu, L., 1993. Rival penalized competitive learning for cluster analysis, RBF net and

    curve detection. IEEE Trans. Neural Network 4, 636648.Xu, L., 1997. Bayesian Ying-Yang machine, clustering and number of clusters.

    Pattern Recognition Lett. 18, 11671178.

    An efficient k prime -means clustering algorithmIntroductionThe cost-functionThe rival penalized mechanismk prime -Means algorithmExperimental resultsExperiment 1Experiment 2Experiment 3Experiment 4 with real datasetDiscussion and experimental results

    ConclusionsReferences