103 Selection of K -means clustering - Electrical Engineeringdpwe/papers/PhamDN05-kmeans.pdf · Selection of K in K -means clustering D T Pham , S S Dimov, and C D Nguyen Manufacturing

Selection of K in K-means clusteringD T Pham S S Dimov and C D Nguyen

Manufacturing Engineering Centre Cardiff University Cardiff UK

The manuscript was received on 26 May 2004 and was accepted after revision for publication on 27 September 2004

DOI 101243095440605X8298

Abstract The K-means algorithm is a popular data-clustering algorithm However one of itsdrawbacks is the requirement for the number of clusters K to be specified before the algorithmis applied This paper first reviews existing methods for selecting the number of clusters for thealgorithm Factors that affect this selection are then discussed and a new measure to assist theselection is proposed The paper concludes with an analysis of the results of using the proposedmeasure to determine the number of clusters for the K-means algorithm for different data sets

Keywords clustering K-means algorithm cluster number selection

1 INTRODUCTION

Data clustering is a data exploration technique thatallows objects with similar characteristics to begrouped together in order to facilitate their furtherprocessing Data clustering has many engineeringapplications including the identification of partfamilies for cellular manufacture

The K-means algorithm is a popular data-clustering algorithm To use it requires the numberof clusters in the data to be pre-specified Findingthe appropriate number of clusters for a given dataset is generally a trial-and-error process made moredifficult by the subjective nature of deciding whatconstitutes lsquocorrectrsquo clustering [1]

This paper proposes a method based on infor-mation obtained during the K-means clusteringoperation itself to select the number of clusters KThe method employs an objective evaluationmeasure to suggest suitable values for K thusavoiding the need for trial and error

The remainder of the paper consists of five sectionsSection 2 reviews the main known methods forselecting K Section 3 analyses the factors influ-encing the selection of K Section 4 describes theproposed evaluation measure Section 5 presentsthe results of applying the proposed measure toselect K for different data sets Section 6 concludesthe paper

2 SELECTION OF THE NUMBER OF CLUSTERSAND CLUSTERING VALIDITY ASSESSMENT

This section reviews existing methods for selectingK for the K-means algorithm and the correspondingclustering validation techniques

21 Values of K specified within a range or set

The performance of a clustering algorithm may beaffected by the chosen value of K Therefore insteadof using a single predefined K a set of values mightbe adopted It is important for the number ofvalues considered to be reasonably large to reflectthe specific characteristics of the data sets At thesame time the selected values have to be signifi-cantly smaller than the number of objects in thedata sets which is the main motivation for perform-ing data clustering

Reported studies [2ndash18] on K-means clustering andits applications usually do not contain any expla-nation or justification for selecting particular valuesfor K Table 1 lists the numbers of clusters and objectsand the corresponding data sets used in those studiesTwo observations could be made when analysingthe data in the table First a number of researchers[5ndash7 9] used only one or two values for K Secondseveral other researchers [1 3 11 13 16] utilizedrelatively large K values compared with the numberof objects These two actions contravene the above-mentioned guidelines for selecting K Therefore theclustering results do not always correctly representthe performance of the tested algorithms

Corresponding author Manufacturing Engineering Centre

Cardiff University Cardiff CF24 OYF UK

103

C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science

In general the performance of any new versionof the K-means algorithm could be verified by com-paring it with its predecessors on the same criteriaIn particular the sum of cluster distortions isusually employed as such a performance indicator[3 6 13 16 18] Thus the comparison is consideredfair because the same model and criterion are usedfor the performance analysis

22 Values of K specified by the user

The K-means algorithm implementation in manydata-mining or data analysis software packages[19ndash22] requires the number of clusters to be speci-fied by the user To find a satisfactory clusteringresult usually a number of iterations are neededwhere the user executes the algorithm with differentvalues of K The validity of the clustering result isassessed only visually without applying any formalperformance measures With this approach it isdifficult for users to evaluate the clustering resultfor multi-dimensional data sets

23 Values of K determined in a laterprocessing step

When K-means clustering is used as a pre-processingtool the number of clusters is determined by thespecific requirements of the main processingalgorithm [13] No attention is paid to the effect ofthe clustering results on the performance of thisalgorithm In such applications the K-meansalgorithm is employed just as a lsquoblack boxrsquo withoutvalidation of the clustering result

Table 1 The number of clusters used in different studies

of the K-means algorithm

ReferenceNumbers ofclusters K

Number ofobjects N

MaximumKNratio ()

[2] 32 64 128 256512 1024

8 192 1250

32 64 128 256512 1024

29 000

256 2 048[3] 600 700 800

900 100010 000 1000

600 700 800900 1000

50 000

[4] 4 16 64100 128

100 000 013

4 16 64100 128

120 000

4 16 64100 128

256 000

[5] 4 564 0704 7204 1 0004 1 0084 1 0104 1 2024 2 0004 2 3244 3 0054 4 0004 6 2724 7 561

[6] 6 150 400[7] 10 2 310 043

25 12 902[8] 2 4 8 Not reported Not reported[9] 2 4 500 333

2 4 50 0002 4 100 00010 300

[10] 1 2 3 4 10 000 004[11] 10 20 30 40 50

60 70 8090 100

500 2000

[12] 100 10 000 20050 2 500

[13] 7 42 16661 2 3 4 5 6 7 120

[14] 2 3 4 5 6 7 8 910 11 12 13 14

250 560

[15] 8 20 50 64 256 10 000 256[16] 5000 50 000 5000

5000 100 0005000 200 0005000 300 0005000 433 208100 100 000250 200 0001000 100 0001000 200 0001000 300 0001000 433 20840 20 00010 20 30 40 50

60 70 8030 000

50 500 5000 10 00050 500 5000 50 00050 500 5000 100 00050 500 5000 200 00050 500 5000 300 00050 500 5000 433 208

(continued )

Table 1 Continued


Number ofobjects N

MaximumKNratio ()

[17] 250 80 000 1000250 90 000250 100 000250 110 000250 120 00050 100 400 4 00050 100 400 36 000250 80 000250 90 000250 100 000250 110 000250 120 00050 100 150 4 00050 100 150 36 00050 800 000500 800 000

[18] 3 4 150 6674 5 752 7 10 214

104 D T Pham S S Dimov and C D Nguyen

Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005

24 Values of K equated to thenumber of generators

Synthetic data sets which are used for testingalgorithms are often created by a set of normal oruniform distribution generators Then clusteringalgorithms are applied to those data sets with thenumber of clusters equated to the number of genera-tors It is assumed that any resultant cluster willcover all objects created by a particular generatorThus the clustering performance is judged on thebasis of the difference between objects covered bya cluster and those created by the correspondinggenerator Such a difference can be measured bysimply counting objects or calculating the infor-mation gain [7]

There are drawbacks with this method The firstdrawback concerns the stability of the clusteringresults when there are areas in the object spacethat contain objects created by different generatorsFigure 1a illustrates such a case The data setshown in this figure has two clusters A and Bwhich cover objects generated by generators GA

and G B respectively Object X is in an overlapping

area between clusters A and B X has probabilitiesPGA

and PGBof being created by GA and GB respect-

ively and probabilities PCAand PCB

of being includedinto clusters A and B respectively All four pro-babilities are larger than 0 Thus there is a chancefor X to be created by generator GA but coveredby cluster B and vice versa In such cases theclustering results will not be perfect The stability ofthe clustering results depends on these four proba-bilities With an increase in the overlapping areas inthe object space the stability of the clustering resultsdecreases

The difference between the characteristics of thegenerators also has an effect on the clustering resultsIn Fig 1b where the number of objects of cluster A isfive times larger than that of cluster B the smallercluster B might be regarded as noise and all objectsmight be grouped into one cluster Such a clusteringoutcome would differ from that obtained by visualinspection

Unfortunately this method of selecting K cannotbe applied to practical problems The data distri-bution in practical problems is unknown and alsothe number of generators cannot be specified

25 Values of K determined bystatistical measures

There are several statistical measures available forselecting K These measures are often applied in com-bination with probabilistic clustering approachesThey are calculated with certain assumptionsabout the underlying distribution of the data TheBayesian information criterion or Akeikersquos infor-mation criterion [14 17] is calculated on data setswhich are constructed by a set of Gaussian distri-butions The measures applied by Hardy [23] arebased on the assumption that the data set fits thePoisson distribution Monte Carlo techniqueswhich are associated with the null hypothesis areused for assessing the clustering results and also fordetermining the number of clusters [24 25]

There have been comparisons between probabilis-tic and partitioning clustering [7] Expectationndashmaximization (EM) is often recognized as a typicalmethod for probabilistic clustering SimilarlyK-means clustering is considered a typical methodfor partitioning clustering Although EM andK-means clustering share some common ideasthey are based on different hypotheses models andcriteria Probabilistic clustering methods do nottake into account the distortion inside a cluster sothat a cluster created by applying such methodsmay not correspond to a cluster in partitioning clus-tering and vice versa Therefore statistical measuresused in probabilistic methods are not applicable in

Fig 1 Effect of the relationship between clusters on

the clustering for two object spaces in which

(a) an area exists that contains objects created

by two different generators and (b) there are

no overlapping areas A objects generated by

GA D objects generated by GB

Selection of K in K-means clustering 105


the K-means algorithm In addition the assumptionsabout the underlying distribution cannot be verifiedon real data sets and therefore cannot be used toobtain statistical measures

26 Values of K equated to thenumber of classes

With this method the number of clusters is equatedto the number of classes in the data sets A data-clustering algorithm can be used as a classifier byapplying it to data sets from which the class attributeis omitted and then assessing the clustering resultsusing the omitted class information [26 27] The out-come of the assessment is fed back to the clusteringalgorithm to improve its performance In this waythe clustering can be considered to be supervised

With this method of determining the numberof clusters the assumption is made that the data-clustering method could form clusters each ofwhich would consist of only objects belonging to

one class Unfortunately most real problems donot satisfy this assumption

27 Values of K determined throughvisualization

Visual verification is applied widely because of itssimplicity and explanation possibilities Visualexamples are often used to illustrate the drawbacksof an algorithm or to present the expected clusteringresults [5 27]

The assessment of a clustering result usingvisualization techniques depends heavily on theirimplicit nature The clustering models utilized bysome clustering methods may not be appropriatefor particular data sets The data sets in Fig 2are illustrations of such cases The application ofvisualization techniques implies a data distributioncontinuity in the expected clusters If the K-meansapproach is applied to such data sets there is not

Fig 2 Data sets inappropriate for the K-means approach (a) data sets with four clusters [5]

(b) data sets with three clusters [23] (c) data sets with eight clusters [27] Note that

the number of clusters in each data set was specified by the respective researchers



any cluster that satisfies the K-means cluster-ing model and at the same time corresponds to aparticular object grouping in the illustrated datasets Therefore the K-means algorithm cannot pro-duce the expected clustering results This suggeststhat the K-means approach is unsuitable for suchdata sets

The characteristics of the data sets in Fig 2(position shape size and object distribution) areimplicitly defined This makes the validation of theclustering results difficult Any slight changes inthe data characteristics may lead to different out-comes The data set in Fig 2b is an illustration ofsuch a case Another example is the series of datasets in Fig 3 Although two clusters are easily identi-fiable in the data set in Fig 3a the numbers ofclusters in the data sets in Figs 3b and c depend onthe distance between the rings and the objectdensity of each ring Usually such parametersare not explicitly defined when a visual check iscarried out

In spite of the above-mentioned deficiencies visu-alization of the results is still a useful method ofselecting K and validating the clustering resultswhen the data sets do not violate the assumptionsof the clustering model In addition this methodis recommended in cases where the expected resultscould be identified explicitly

28 Values of K determined usinga neighbourhood measure

A neighbourhood measure could be added to thecost function of the K-means algorithm to determineK [26] Although this technique has showed promis-ing results for a few data sets it needs to prove itspotential in practical applications Because the costfunction has to be modified this technique cannotbe applied to the original K-means algorithm

3 FACTORS AFFECTING THE SELECTION OF K

A function f(K ) for evaluating the clustering resultcould be used to select the number of clusters Fac-tors that such a function should take into accountare discussed in this section

31 Approach bias

The evaluation function should be related closely tothe clustering criteria As mentioned previouslysuch a relation could prevent adverse effects on thevalidation process In particular in the K-meansalgorithm the criterion is the minimization of thedistortion of clusters so that the evaluation functionshould take this parameter into account

32 Level of detail

In general observers that could see relatively lowlevels of detail would obtain only an overview of anobject By increasing the level of detail they couldgain more information about the observed objectbut at the same time the amount of data that theyhave to process increases Because of resource limit-ations a high level of detail is normally used only toexamine parts of the object [28]

Such an approach could be applied in clustering Adata set with n objects could be grouped into anynumber of clusters between 1 and n which wouldcorrespond to the lowest and the highest levels ofdetail respectively By specifying different K valuesit is possible to assess the results of groupingobjects into various numbers of clusters Fromthis evaluation more than one K value could berecommended to users but the final selection ismade by them

33 Internal distribution versus global impact

Clustering is used to find irregularities in the datadistribution and to identify regions in which objectsare concentrated However not every region with ahigh concentration of objects is considered a clusterFor a region to be identified as a cluster it is import-ant to analyse not only its internal distribution butalso its interdependence with other object groupingsin the data set

In K-means clustering the distortion of a cluster isa function of the data population and the distancebetween objects and the cluster centre according to

Ij frac14XN j

tfrac141

frac12d(x jt wj)2 (1a)

where Ij is the distortion of cluster j wj is the centreof cluster j Nj is the number of objects belonging tocluster j xjt is the tth object belonging to cluster jand d(xjt wj) is the distance between object xjt andthe centre wj of cluster j

Each cluster is represented by its distortionand its impact on the entire data set is assessed by

Fig 3 Variations in the two-ring data set



its contribution to the sum of all distortions SKgiven by

SK frac14XK

jfrac141

Ij (1b)

where K is the specified number of clustersThus such information is important in assessing

whether a particular region in the object spacecould be considered a cluster

34 Constraints on f (K )

The robustness of f(K ) is very important Becausethis function is based on the result of the clusteringalgorithm it is important for this result to vary aslittle as possible when K remains unchanged How-ever one of the main deficiencies of the K-meansapproach is its dependence on randomness Thusthe algorithm should yield consistent results sothat its performance can be used as a variablein the evaluation function A new version of the K-means algorithm namely the incremental K-meansalgorithm [29] satisfies this requirement and canbe adopted for this purpose

The role of f(K ) is to reveal trends in the datadistribution and therefore it is important to keep itindependent of the number of objects The numberof clusters K is assumed to be much smaller thanthe number of objects N When K increases f(K )should converge to some constant value Then iffor any intermediate K f(K ) exhibits a special beha-viour such as a minimum or maximum point thatvalue of K could be taken as the desired number ofclusters

4 NUMBER OF CLUSTERS FORK-MEANS CLUSTERING

As mentioned in section 33 cluster analysis isused to find irregularities in the data distributionWhen the data distribution is uniform there is notany irregularity Therefore data sets with uniformdistribution could be used to calibrate and verifythe clustering result This approach was applied byTibshirani et al [30] A data set of the same dimen-sion as the actual data set and with a uniform distri-bution was generated The clustering performanceon this artificial data set was then compared withthe result obtained for the actual data set A measureknown as the lsquogaprsquo statistic [30] was employed toassess performance In this work instead of generat-ing an artificial data set the clustering performancefor the artificial data set was estimated Also insteadof the gap statistic a new and more discriminatory

measure was employed for evaluating the clusteringresult

When the K-means algorithm is applied to datawith a uniform distribution and K is increased by 1the clusters are likely to change and in the new pos-itions the partitions will again be approximatelyequal in size and their distortions similar to oneanother The evaluations carried out in reference[29] showed that when a new cluster is insertedinto a cluster (K frac14 1) with a hypercuboid shape anda uniform distribution the decrease in the sum ofdistortions is proportional to the original sum of dis-tortions This conclusion was found to be correct forclustering results obtained with relatively smallvalues of K In such cases the sum of distortionsafter the increase in the number of clusters couldbe estimated from the current value

The evaluation function f(K ) is defined using theequations

f (K ) frac14

1 if K frac14 1SK

aK SK1if SK1 = 0 8K 1

1 if SK1 frac14 0 8K 1

8gtlt

gt(2)

aK frac14

13

4Nd

if K frac14 2 and Nd 1

(3a)

aK1 thorn1 aK1

6if K 2 and Nd 1

(3b)

8gtgtgtgtgtlt

gtgtgtgtgt

where SK is the sum of the cluster distortions whenthe number of clusters is K Nd is the number ofdata set attributes (ie the number of dimensions)and aK is a weight factor The term aKSK21 inequation (2) is an estimate of SK based on SK21

made with the assumption that the data have a uni-form distribution The value of f(K ) is the ratio of thereal distortion to the estimated distortion and isclose to 1 when the data distribution is uniformWhen there are areas of concentration in the datadistribution SK will be less than the estimatedvalue so that f(K ) decreases The smaller that f(K )is the more concentrated is the data distributionThus values of K that yield small f(K ) can beregarded as giving well-defined clusters

The weight factor aK defined in equation (3) isa positive number less than or equal to 1 and isapplied to reduce the effect of dimensions WithK frac14 2 aK is computed using equation (3a) This equa-tion is derived from equation (7) in reference [29]which shows that the decrease in distortion is inver-sely proportional to the number of dimensions Nd

As K increases above 2 the decrease in the sum ofdistortions reduces (the ratio SKSK21 approaches 1)as can be seen in Fig 4 This figure shows the values



of SKSK21 computed for different K when the clus-tering algorithm is applied to data sets of differentdimensions and with uniform distributions Withsuch data sets f(K ) is expected to be equal to 1 andaK should be chosen to equate f(K ) to 1 Fromequation (2) aK should therefore be SKSK21 andthus obtainable from Fig 4 However for compu-tational simplicity the recursion equation (3b) hasbeen derived from the data represented in Fig 4 tocalculate aK Figure 5 shows that the values of aK

obtained from equation (3b) fit the plots in Fig 4closely

The proposed function f(K ) satisfies the con-straints mentioned in the previous section Therobustness of f(K ) will be verified experimentally inthe next section When the number of objects isdoubled or tripled but their distributions areunchanged the resultant clusters remain in thesame position SK and SK21 are doubled or tripledcorrespondingly so that f(K ) stays constant There-fore generally f(K ) is independent of the numberof objects in the data set

To reduce the effect of differences in the rangesof the attributes data are normalized before theclustering starts However it should be noted that

when the data have well-separated groups of objectsthe shape of such regions in the problem space hasan effect on the evaluation function In these casesthe normalization does not influence the localobject distribution because it is a scaling techniquethat applies to the whole data set

5 PERFORMANCE

The evaluation function f(K ) is tested in a series ofexperiments on the artificially generated data setsshown in Fig 6 All data are normalized before theincremental K-means algorithm is applied with Kranging from 1 to 19 f(K ) is calculated on the basisof the total distortion of the clusters

In Figs 6andashc all objects belong to a single regionwith a uniform distribution The graph in Fig 6ashows that f(K ) reflects well the clustering result onthis data set with a uniform distribution becausef(K ) is approximately constant and equal to 1 forall K When K frac14 4 and K frac14 3 in Figs 6a and b respect-ively f(K ) reaches minimum values This could beattributed to the shape of the areas defined by theobjects belonging to these data sets However theminimum values of f(K ) do not differ significantly

Fig 4 The ratio SKSK21 for data sets having uniform

distributions (a) two-dimensional lsquosquarersquo and

lsquocirclersquo (b) four-dimensional lsquocubersquo and lsquospherersquo

Fig 5 Comparison of the values of aK calculated using

equation (3b) and the ratio SKSK21



Fig 6 Data sets and their corresponding f(K )



Fig 6 Continued



Fig 6 Continued



Fig 6 Continued



Fig 6 Continued



from the average value for any strong recommen-dations to be made to the user By comparing thevalues of f(K ) in Figs 6a and c it can be seen thataK reduces the effect of the data set dimensions onthe evaluation function

For the data set in Fig 6d again all objects areconcentrated in a single region with a normal

distribution The f(K ) plot for this data set suggestscorrectly that when K frac14 1 the clustering result isthe most suitable for this data set

The data sets in Figs 6e and f are created bytwo generators that have normal distributions InFig 6e the two generators have an overlappingregion but in Fig 6f they are well separated Note

Fig 7 f(K) for the 12 benchmark data sets



that the value for f(2) in the latter figure is muchsmaller than in the former

The data sets in Figs 6g and h have three recog-nizable regions From the corresponding graphsf(K ) suggests correct values of K for clusteringthese data sets

Three different generators that create objectgroupings with a normal distribution are used toform the data set in Fig 6i In this case f(K ) suggeststhe value 2 or 3 for K Because two of these threegenerators create object groupings that overlapf(2) is smaller than f(3) This means that the data

Fig 7 Continued



have only two clearly defined regions but K frac14 3could also be used to cluster the objects

Figures 6j and k illustrate how the level of detailcould affect the selection of K f(K ) reaches mini-mum values at K frac14 2 and 4 respectively In suchcases users could select the most appropriate valueof K based on their specific requirements A morecomplex case is shown in Fig 6l where there is apossible K value of 4 or 8 The selection of a parti-cular K will depend on the requirements of thespecific application for which the clustering iscarried out

The data sets in Figs 6mndasho have well-definedregions in the object space each of which has adifferent distribution location and number ofobjects If the minimum value of f(K ) is used tocluster the objects K will be different from thenumber of generators utilized to create them (as inthe case of the clusters in Fig 6o or the number ofobject groupings that could be identified visually(as in the case of the clusters in Figs 6m and n)The reason for the difference varies with differentcases For example it could be considered thatthere are five clusters in Fig 6m because the clusterdistances are smaller for the two leftmost pairs ofclusters than for others and the clusters in thosepairs could be merged together However nosimple explanation could be given for the casesshown in Figs 6n and o This highlights the factthat f(K ) should only be used to suggest a guidevalue for the number of clusters and the finaldecision as to which value to adopt has to be left atthe discretion of the user

From the graphs in Fig 6 a conclusion could bemade that any K with corresponding f(K ) 085could be recommended for clustering If there isnot a value with corresponding f(K ) 085 K frac14 1 isselected

The proposed function f(K ) is also applied to 12benchmarking data sets from the UCI RepositoryMachine Learning Databases [31] Figure 7 showshow the value of f(K ) varies with K If a threshold of085 is selected for f(K ) (from the study on theartificial data sets) the numbers of clusters recom-mended for each of these data sets are given as inTable 2 Kfrac14 1 means that the data distribution isvery close to the standard uniform distribution Thevalues recommended using f(K ) are very smallbecause of the high correlation between the attributesof these data sets very similar to that shown in Fig 6eThis can be verified by examining two attributes ata time and plotting the data sets in two dimensions

The above experimental study on 15 artificialand 12 benchmark data sets has demonstratedthe robustness of f(K ) The evaluation functionconverges in most cases to 1 when K increasesabove 9

6 CONCLUSION

Existing methods of selecting the number of clustersfor K-means clustering have a number of drawbacksAlso current methods for assessing the clusteringresults do not provide much information on theperformance of the clustering algorithm

A new method to select the number of clustersfor the K-means algorithm has been proposed inthe paper The new method is closely related to theapproach of K-means clustering because it takesinto account information reflecting the performanceof the algorithm The proposed method can suggestmultiple values of K to users for cases when differentclustering results could be obtained with variousrequired levels of detail The method could be com-putationally expensive if used with large data setsbecause it requires several applications of theK-means algorithm before it can suggest a guidevalue for K The method has been validated on15 artificial and 12 benchmark data sets Furtherresearch is required to verify the capability of thismethod when applied to data sets with morecomplex object distributions

ACKNOWLEDGEMENTS

This work was carried out as part of the CardiffInnovative Manufacturing Research Centre Projectsupported by the Engineering and Physical SciencesResearch Council and the SUPERMAN Project sup-ported by the European Commission and the WelshAssembly Government under the European RegionalDevelopment Fund programme The authors aremembers of the IPROMS Network of Excellencefunded by the European Commission

Table 2 The recommended number of

clusters based on f(K)

Data setsProposed numberof clusters

Australian 1Balance-scale 1Car evaluation 2 3 4Cmc 1Ionosphere 2Iris 2 3Page blocks 2Pima 1Wdbc 2Wine 3Yeast 1Zoo 2



REFERENCES

1 Han J and Kamber M Data Mining Concepts andTechniques 2000 (Morgan Kaufmann San FranciscoCalifornia)

2 Al-Daoud M B Venkateswarlu N B andRoberts S A Fast K-means clustering algorithmsReport 9518 School of Computer Studies Universityof Leeds June 1995

3 Al-Daoud M B Venkateswarlu N B andRoberts S A New methods for the initialisation ofclusters Pattern Recognition Lett 1996 17 451ndash455

4 Alsabti K Ranka S and Singh V An efficientK-means clustering algorithm In Proceedings of theFirst Workshop on High-Performance Data MiningOrlando Florida 1998 ftpftpciseufledupubfacultyrankaProceedings

5 Bilmes J Vahdat A Hsu W and Im E J Empiricalobservations of probabilistic heuristics for theclustering problem Technical Report TR-97-018International Computer Science Institute BerkeleyCalifornia

6 Bottou L and Bengio Y Convergence properties of theK-means algorithm Adv Neural Infn Processing Systems1995 7 585ndash592

7 Bradley S and Fayyad U M Refining initialpoints for K-means clustering In Proceedings ofthe Fifteenth International Conference on MachineLearning (ICML lsquo98) (Ed J Shavlik) MadisonWisconsin 1998 pp 91ndash99 (Morgan Kaufmann SanFrancisco California)

8 Du Q and Wong T-W Numerical studies ofMacQueenrsquos K-means algorithm for computing the cen-troidal Voronoi tessellations Int J Computers MathApplics 2002 44 511ndash523

9 Castro V E and Yang J A fast and robust generalpurpose clustering algorithm In Proceedings of theFourth European Workshop on Principles of KnowledgeDiscovery in Databases and Data Mining (PKDD 00)Lyon France 2000 pp 208ndash218

10 Castro V E Why so many clustering algorithmsSIGKDD Explorations Newsletter of the ACM SpecialInterest Group on Knowledge Discovery and DataMining 2002 4(1) 65ndash75

11 Fritzke B The LBG-U method for vector quantiza-tion ndash an improvement over LBG inspired fromneural networks Neural Processing Lett 1997 5(1)35ndash45

12 Hamerly G and Elkan C Alternatives to the K-meansalgorithm that find better clusterings In Proceedings ofthe 11th International Conference on Information andKnowledge Management (CIKM 02) McLean Virginia2002 pp 600ndash607

13 Hansen L K and Larsen J Unsupervised learningand generalisation In Proceedings of the IEEEInternational Conference on Neural NetworksWashington DC June 1996 pp 25ndash30 (IEEENew York)

14 Ishioka T Extended K-means with an efficientestimation of the number of clusters In Proceedingsof the Second International Conference on Intelligent

Data Engineering and Automated Learning (IDEAL2000) Hong Kong PR China December 2000pp 17ndash22

15 Kanungo T Mount D M Netanyahu N Piatko CSilverman R and Wu A The efficient K-means clus-tering algorithm analysis and implementation IEEETrans Pattern Analysis Mach Intell 2002 24(7)881ndash892

16 Pelleg D and Moore A Accelerating exact K-meansalgorithms with geometric reasoning In Proceedingsof the Conference on Knowledge Discovery inDatabases (KDD 99) San Diego California 1999pp 277ndash281

17 Pelleg D and Moore A X-means extending K-meanswith efficient estimation of the number of clusters InProceedings of the 17th International Conference onMachine Learning (ICML 2000) Stanford California2000 727ndash734

18 Pena J M Lazano J A and Larranaga P An empiri-cal comparison of four initialisation methods for theK-means algorithm Pattern Recognition Lett 199920 1027ndash1040

19 SPSS Clementine Data Mining System User Guide Ver-sion 5 1998 (Integral Solutions Limited BasingstokeHampshire)

20 DataEngine 30 ndash Intelligent Data Analysis ndash an EasyJob Management Intelligenter Technologien GmbHGermany 1998 httpwwwmitgmbhde

21 Kerr A Hall H K and Kozub S Doing Statistics withSPSS 2002 (Sage London)

22 S-PLUS 6 for Windows Guide to Statistics Vol 2Insightful Corporation Seattle Washington 2001httpwwwinsightfulcomDocumentsLive2344statman2pdf

23 Hardy A On the number of clusters Comput StatistData Analysis 1996 23 83ndash96

24 Theodoridis S and Koutroubas K Pattern Recog-nition 1998 (Academic Press London)

25 Halkidi M Batistakis Y and Vazirgiannis MCluster validity methods Part I SIGMOD Record2002 31(2) available online httpwwwacmorgsigmodrecord

26 Kothari R and Pitts D On finding the number ofclusters Pattern Recognition Lett 1999 20 405ndash416

27 Cai Z Technical aspects of data mining PhD thesisCardiff University Cardiff 2001

28 Lindeberg T Scale-space Theory in Computer Vision1994 (Kluwer Academic Boston Massachusetts)

29 Pham D T Dimov S S and Nguyen C DIncremental K-means algorithm Proc Instn MechEngrs Part C J Mechanical Engineering Science 2003218 783ndash795

30 Tibshirani R Walther G and Hastie T Estimatingthe number of clusters in a dataset via the gap statisticTechnical Report 208 Department of StatisticsStanford University California 2000

31 Blake C Keogh E and Merz C J UCI Re-pository of Machine Learning Databases IrvineCalifornia Department of Information and Com-puter Science University of California IrvineCalifornia 1998



APPENDIX

Notation

A B clustersd(xjt wj) distance between object xjt and the

centre wj of cluster jf(K ) evaluation functionGA GB generatorsIj distortion of cluster jK number of clustersN number of objects in the data setNd number of data set attributes

(the dimension of the data set)

Nj number of objects belonging tocluster j

PGA PGB

probabilities that X is created by GA orGB respectively

PCA PCB

probabilities that X is clustered into A orB respectively

SK sum of all distortions with K being thespecified number of clusters

X objectxjt object belonging to cluster jwj centre of cluster j

aK weight factor



In general the performance of any new versionof the K-means algorithm could be verified by com-paring it with its predecessors on the same criteriaIn particular the sum of cluster distortions isusually employed as such a performance indicator[3 6 13 16 18] Thus the comparison is consideredfair because the same model and criterion are usedfor the performance analysis

22 Values of K specified by the user

The K-means algorithm implementation in manydata-mining or data analysis software packages[19ndash22] requires the number of clusters to be speci-fied by the user To find a satisfactory clusteringresult usually a number of iterations are neededwhere the user executes the algorithm with differentvalues of K The validity of the clustering result isassessed only visually without applying any formalperformance measures With this approach it isdifficult for users to evaluate the clustering resultfor multi-dimensional data sets

23 Values of K determined in a laterprocessing step

When K-means clustering is used as a pre-processingtool the number of clusters is determined by thespecific requirements of the main processingalgorithm [13] No attention is paid to the effect ofthe clustering results on the performance of thisalgorithm In such applications the K-meansalgorithm is employed just as a lsquoblack boxrsquo withoutvalidation of the clustering result

Table 1 The number of clusters used in different studies

of the K-means algorithm


Number ofobjects N

MaximumKNratio ()

[2] 32 64 128 256512 1024

8 192 1250

32 64 128 256512 1024

29 000

256 2 048[3] 600 700 800

900 100010 000 1000

600 700 800900 1000

50 000

[4] 4 16 64100 128

100 000 013

4 16 64100 128

120 000

4 16 64100 128

256 000

[5] 4 564 0704 7204 1 0004 1 0084 1 0104 1 2024 2 0004 2 3244 3 0054 4 0004 6 2724 7 561

[6] 6 150 400[7] 10 2 310 043

25 12 902[8] 2 4 8 Not reported Not reported[9] 2 4 500 333

2 4 50 0002 4 100 00010 300

[10] 1 2 3 4 10 000 004[11] 10 20 30 40 50

60 70 8090 100

500 2000

[12] 100 10 000 20050 2 500

[13] 7 42 16661 2 3 4 5 6 7 120

[14] 2 3 4 5 6 7 8 910 11 12 13 14

250 560

[15] 8 20 50 64 256 10 000 256[16] 5000 50 000 5000

5000 100 0005000 200 0005000 300 0005000 433 208100 100 000250 200 0001000 100 0001000 200 0001000 300 0001000 433 20840 20 00010 20 30 40 50

60 70 8030 000

50 500 5000 10 00050 500 5000 50 00050 500 5000 100 00050 500 5000 200 00050 500 5000 300 00050 500 5000 433 208

(continued )

Table 1 Continued


Number ofobjects N

MaximumKNratio ()

[17] 250 80 000 1000250 90 000250 100 000250 110 000250 120 00050 100 400 4 00050 100 400 36 000250 80 000250 90 000250 100 000250 110 000250 120 00050 100 150 4 00050 100 150 36 00050 800 000500 800 000

[18] 3 4 150 6674 5 752 7 10 214












































31 Approach bias


32 Level of detail






Ij frac14XN j

tfrac141








SK frac14XK

jfrac141

Ij (1b)











f (K ) frac14

1 if K frac14 1SK



8gtlt

gt(2)

aK frac14

13

4Nd


(3a)

aK1 thorn1 aK1

6if K 2 and Nd 1

(3b)

8gtgtgtgtgtlt

gtgtgtgtgt












5 PERFORMANCE













Fig 6 Continued



Fig 6 Continued



Fig 6 Continued



Fig 6 Continued













Fig 7 Continued









6 CONCLUSION



ACKNOWLEDGEMENTS








REFERENCES



































APPENDIX

Notation





PGA PGB


PCA PCB




aK weight factor












































31 Approach bias


32 Level of detail






Ij frac14XN j

tfrac141








SK frac14XK

jfrac141

Ij (1b)











f (K ) frac14

1 if K frac14 1SK



8gtlt

gt(2)

aK frac14

13

4Nd


(3a)

aK1 thorn1 aK1

6if K 2 and Nd 1

(3b)

8gtgtgtgtgtlt

gtgtgtgtgt












5 PERFORMANCE













Fig 6 Continued



Fig 6 Continued



Fig 6 Continued



Fig 6 Continued













Fig 7 Continued









6 CONCLUSION



ACKNOWLEDGEMENTS








REFERENCES



































APPENDIX

Notation





PGA PGB


PCA PCB




aK weight factor























31 Approach bias


32 Level of detail






Ij frac14XN j

tfrac141








SK frac14XK

jfrac141

Ij (1b)











f (K ) frac14

1 if K frac14 1SK



8gtlt

gt(2)

aK frac14

13

4Nd


(3a)

aK1 thorn1 aK1

6if K 2 and Nd 1

(3b)

8gtgtgtgtgtlt

gtgtgtgtgt












5 PERFORMANCE













Fig 6 Continued



Fig 6 Continued



Fig 6 Continued



Fig 6 Continued













Fig 7 Continued









6 CONCLUSION



ACKNOWLEDGEMENTS








REFERENCES



































APPENDIX

Notation





PGA PGB


PCA PCB




aK weight factor










31 Approach bias


32 Level of detail






Ij frac14XN j

tfrac141








SK frac14XK

jfrac141

Ij (1b)











f (K ) frac14

1 if K frac14 1SK



8gtlt

gt(2)

aK frac14

13

4Nd


(3a)

aK1 thorn1 aK1

6if K 2 and Nd 1

(3b)

8gtgtgtgtgtlt

gtgtgtgtgt












5 PERFORMANCE













Fig 6 Continued



Fig 6 Continued



Fig 6 Continued



Fig 6 Continued













Fig 7 Continued









6 CONCLUSION



ACKNOWLEDGEMENTS








REFERENCES



































APPENDIX

Notation





PGA PGB


PCA PCB




aK weight factor




SK frac14XK

jfrac141

Ij (1b)











f (K ) frac14

1 if K frac14 1SK



8gtlt

gt(2)

aK frac14

13

4Nd


(3a)

aK1 thorn1 aK1

6if K 2 and Nd 1

(3b)

8gtgtgtgtgtlt

gtgtgtgtgt












5 PERFORMANCE













Fig 6 Continued



Fig 6 Continued



Fig 6 Continued



Fig 6 Continued













Fig 7 Continued









6 CONCLUSION



ACKNOWLEDGEMENTS








REFERENCES



































APPENDIX

Notation





PGA PGB


PCA PCB




aK weight factor








5 PERFORMANCE













Fig 6 Continued



Fig 6 Continued



Fig 6 Continued



Fig 6 Continued













Fig 7 Continued









6 CONCLUSION



ACKNOWLEDGEMENTS








REFERENCES



































APPENDIX

Notation





PGA PGB


PCA PCB




aK weight factor






Fig 6 Continued



Fig 6 Continued



Fig 6 Continued



Fig 6 Continued













Fig 7 Continued









6 CONCLUSION



ACKNOWLEDGEMENTS








REFERENCES



































APPENDIX

Notation





PGA PGB


PCA PCB




aK weight factor



Fig 6 Continued



Fig 6 Continued



Fig 6 Continued



Fig 6 Continued













Fig 7 Continued









6 CONCLUSION



ACKNOWLEDGEMENTS








REFERENCES



































APPENDIX

Notation





PGA PGB


PCA PCB




aK weight factor



Fig 6 Continued



Fig 6 Continued



Fig 6 Continued













Fig 7 Continued









6 CONCLUSION



ACKNOWLEDGEMENTS








REFERENCES



































APPENDIX

Notation





PGA PGB


PCA PCB




aK weight factor



Fig 6 Continued



Fig 6 Continued













Fig 7 Continued









6 CONCLUSION



ACKNOWLEDGEMENTS








REFERENCES



































APPENDIX

Notation





PGA PGB


PCA PCB




aK weight factor



Fig 6 Continued













Fig 7 Continued









6 CONCLUSION



ACKNOWLEDGEMENTS








REFERENCES



































APPENDIX

Notation





PGA PGB


PCA PCB




aK weight factor













Fig 7 Continued









6 CONCLUSION



ACKNOWLEDGEMENTS








REFERENCES



































APPENDIX

Notation





PGA PGB


PCA PCB




aK weight factor






Fig 7 Continued









6 CONCLUSION



ACKNOWLEDGEMENTS








REFERENCES



































APPENDIX

Notation





PGA PGB


PCA PCB




aK weight factor









6 CONCLUSION



ACKNOWLEDGEMENTS








REFERENCES



































APPENDIX

Notation





PGA PGB


PCA PCB




aK weight factor



REFERENCES



































APPENDIX

Notation





PGA PGB


PCA PCB




aK weight factor



APPENDIX

Notation





PGA PGB


PCA PCB




aK weight factor