Upload
hanhu
View
218
Download
3
Embed Size (px)
Citation preview
Selection of K in K-means clusteringD T Pham S S Dimov and C D Nguyen
Manufacturing Engineering Centre Cardiff University Cardiff UK
The manuscript was received on 26 May 2004 and was accepted after revision for publication on 27 September 2004
DOI 101243095440605X8298
Abstract The K-means algorithm is a popular data-clustering algorithm However one of itsdrawbacks is the requirement for the number of clusters K to be specified before the algorithmis applied This paper first reviews existing methods for selecting the number of clusters for thealgorithm Factors that affect this selection are then discussed and a new measure to assist theselection is proposed The paper concludes with an analysis of the results of using the proposedmeasure to determine the number of clusters for the K-means algorithm for different data sets
Keywords clustering K-means algorithm cluster number selection
1 INTRODUCTION
Data clustering is a data exploration technique thatallows objects with similar characteristics to begrouped together in order to facilitate their furtherprocessing Data clustering has many engineeringapplications including the identification of partfamilies for cellular manufacture
The K-means algorithm is a popular data-clustering algorithm To use it requires the numberof clusters in the data to be pre-specified Findingthe appropriate number of clusters for a given dataset is generally a trial-and-error process made moredifficult by the subjective nature of deciding whatconstitutes lsquocorrectrsquo clustering [1]
This paper proposes a method based on infor-mation obtained during the K-means clusteringoperation itself to select the number of clusters KThe method employs an objective evaluationmeasure to suggest suitable values for K thusavoiding the need for trial and error
The remainder of the paper consists of five sectionsSection 2 reviews the main known methods forselecting K Section 3 analyses the factors influ-encing the selection of K Section 4 describes theproposed evaluation measure Section 5 presentsthe results of applying the proposed measure toselect K for different data sets Section 6 concludesthe paper
2 SELECTION OF THE NUMBER OF CLUSTERSAND CLUSTERING VALIDITY ASSESSMENT
This section reviews existing methods for selectingK for the K-means algorithm and the correspondingclustering validation techniques
21 Values of K specified within a range or set
The performance of a clustering algorithm may beaffected by the chosen value of K Therefore insteadof using a single predefined K a set of values mightbe adopted It is important for the number ofvalues considered to be reasonably large to reflectthe specific characteristics of the data sets At thesame time the selected values have to be signifi-cantly smaller than the number of objects in thedata sets which is the main motivation for perform-ing data clustering
Reported studies [2ndash18] on K-means clustering andits applications usually do not contain any expla-nation or justification for selecting particular valuesfor K Table 1 lists the numbers of clusters and objectsand the corresponding data sets used in those studiesTwo observations could be made when analysingthe data in the table First a number of researchers[5ndash7 9] used only one or two values for K Secondseveral other researchers [1 3 11 13 16] utilizedrelatively large K values compared with the numberof objects These two actions contravene the above-mentioned guidelines for selecting K Therefore theclustering results do not always correctly representthe performance of the tested algorithms
Corresponding author Manufacturing Engineering Centre
Cardiff University Cardiff CF24 OYF UK
103
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
In general the performance of any new versionof the K-means algorithm could be verified by com-paring it with its predecessors on the same criteriaIn particular the sum of cluster distortions isusually employed as such a performance indicator[3 6 13 16 18] Thus the comparison is consideredfair because the same model and criterion are usedfor the performance analysis
22 Values of K specified by the user
The K-means algorithm implementation in manydata-mining or data analysis software packages[19ndash22] requires the number of clusters to be speci-fied by the user To find a satisfactory clusteringresult usually a number of iterations are neededwhere the user executes the algorithm with differentvalues of K The validity of the clustering result isassessed only visually without applying any formalperformance measures With this approach it isdifficult for users to evaluate the clustering resultfor multi-dimensional data sets
23 Values of K determined in a laterprocessing step
When K-means clustering is used as a pre-processingtool the number of clusters is determined by thespecific requirements of the main processingalgorithm [13] No attention is paid to the effect ofthe clustering results on the performance of thisalgorithm In such applications the K-meansalgorithm is employed just as a lsquoblack boxrsquo withoutvalidation of the clustering result
Table 1 The number of clusters used in different studies
of the K-means algorithm
ReferenceNumbers ofclusters K
Number ofobjects N
MaximumKNratio ()
[2] 32 64 128 256512 1024
8 192 1250
32 64 128 256512 1024
29 000
256 2 048[3] 600 700 800
900 100010 000 1000
600 700 800900 1000
50 000
[4] 4 16 64100 128
100 000 013
4 16 64100 128
120 000
4 16 64100 128
256 000
[5] 4 564 0704 7204 1 0004 1 0084 1 0104 1 2024 2 0004 2 3244 3 0054 4 0004 6 2724 7 561
[6] 6 150 400[7] 10 2 310 043
25 12 902[8] 2 4 8 Not reported Not reported[9] 2 4 500 333
2 4 50 0002 4 100 00010 300
[10] 1 2 3 4 10 000 004[11] 10 20 30 40 50
60 70 8090 100
500 2000
[12] 100 10 000 20050 2 500
[13] 7 42 16661 2 3 4 5 6 7 120
[14] 2 3 4 5 6 7 8 910 11 12 13 14
250 560
[15] 8 20 50 64 256 10 000 256[16] 5000 50 000 5000
5000 100 0005000 200 0005000 300 0005000 433 208100 100 000250 200 0001000 100 0001000 200 0001000 300 0001000 433 20840 20 00010 20 30 40 50
60 70 8030 000
50 500 5000 10 00050 500 5000 50 00050 500 5000 100 00050 500 5000 200 00050 500 5000 300 00050 500 5000 433 208
(continued )
Table 1 Continued
ReferenceNumbers ofclusters K
Number ofobjects N
MaximumKNratio ()
[17] 250 80 000 1000250 90 000250 100 000250 110 000250 120 00050 100 400 4 00050 100 400 36 000250 80 000250 90 000250 100 000250 110 000250 120 00050 100 150 4 00050 100 150 36 00050 800 000500 800 000
[18] 3 4 150 6674 5 752 7 10 214
104 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
24 Values of K equated to thenumber of generators
Synthetic data sets which are used for testingalgorithms are often created by a set of normal oruniform distribution generators Then clusteringalgorithms are applied to those data sets with thenumber of clusters equated to the number of genera-tors It is assumed that any resultant cluster willcover all objects created by a particular generatorThus the clustering performance is judged on thebasis of the difference between objects covered bya cluster and those created by the correspondinggenerator Such a difference can be measured bysimply counting objects or calculating the infor-mation gain [7]
There are drawbacks with this method The firstdrawback concerns the stability of the clusteringresults when there are areas in the object spacethat contain objects created by different generatorsFigure 1a illustrates such a case The data setshown in this figure has two clusters A and Bwhich cover objects generated by generators GA
and G B respectively Object X is in an overlapping
area between clusters A and B X has probabilitiesPGA
and PGBof being created by GA and GB respect-
ively and probabilities PCAand PCB
of being includedinto clusters A and B respectively All four pro-babilities are larger than 0 Thus there is a chancefor X to be created by generator GA but coveredby cluster B and vice versa In such cases theclustering results will not be perfect The stability ofthe clustering results depends on these four proba-bilities With an increase in the overlapping areas inthe object space the stability of the clustering resultsdecreases
The difference between the characteristics of thegenerators also has an effect on the clustering resultsIn Fig 1b where the number of objects of cluster A isfive times larger than that of cluster B the smallercluster B might be regarded as noise and all objectsmight be grouped into one cluster Such a clusteringoutcome would differ from that obtained by visualinspection
Unfortunately this method of selecting K cannotbe applied to practical problems The data distri-bution in practical problems is unknown and alsothe number of generators cannot be specified
25 Values of K determined bystatistical measures
There are several statistical measures available forselecting K These measures are often applied in com-bination with probabilistic clustering approachesThey are calculated with certain assumptionsabout the underlying distribution of the data TheBayesian information criterion or Akeikersquos infor-mation criterion [14 17] is calculated on data setswhich are constructed by a set of Gaussian distri-butions The measures applied by Hardy [23] arebased on the assumption that the data set fits thePoisson distribution Monte Carlo techniqueswhich are associated with the null hypothesis areused for assessing the clustering results and also fordetermining the number of clusters [24 25]
There have been comparisons between probabilis-tic and partitioning clustering [7] Expectationndashmaximization (EM) is often recognized as a typicalmethod for probabilistic clustering SimilarlyK-means clustering is considered a typical methodfor partitioning clustering Although EM andK-means clustering share some common ideasthey are based on different hypotheses models andcriteria Probabilistic clustering methods do nottake into account the distortion inside a cluster sothat a cluster created by applying such methodsmay not correspond to a cluster in partitioning clus-tering and vice versa Therefore statistical measuresused in probabilistic methods are not applicable in
Fig 1 Effect of the relationship between clusters on
the clustering for two object spaces in which
(a) an area exists that contains objects created
by two different generators and (b) there are
no overlapping areas A objects generated by
GA D objects generated by GB
Selection of K in K-means clustering 105
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
the K-means algorithm In addition the assumptionsabout the underlying distribution cannot be verifiedon real data sets and therefore cannot be used toobtain statistical measures
26 Values of K equated to thenumber of classes
With this method the number of clusters is equatedto the number of classes in the data sets A data-clustering algorithm can be used as a classifier byapplying it to data sets from which the class attributeis omitted and then assessing the clustering resultsusing the omitted class information [26 27] The out-come of the assessment is fed back to the clusteringalgorithm to improve its performance In this waythe clustering can be considered to be supervised
With this method of determining the numberof clusters the assumption is made that the data-clustering method could form clusters each ofwhich would consist of only objects belonging to
one class Unfortunately most real problems donot satisfy this assumption
27 Values of K determined throughvisualization
Visual verification is applied widely because of itssimplicity and explanation possibilities Visualexamples are often used to illustrate the drawbacksof an algorithm or to present the expected clusteringresults [5 27]
The assessment of a clustering result usingvisualization techniques depends heavily on theirimplicit nature The clustering models utilized bysome clustering methods may not be appropriatefor particular data sets The data sets in Fig 2are illustrations of such cases The application ofvisualization techniques implies a data distributioncontinuity in the expected clusters If the K-meansapproach is applied to such data sets there is not
Fig 2 Data sets inappropriate for the K-means approach (a) data sets with four clusters [5]
(b) data sets with three clusters [23] (c) data sets with eight clusters [27] Note that
the number of clusters in each data set was specified by the respective researchers
106 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
any cluster that satisfies the K-means cluster-ing model and at the same time corresponds to aparticular object grouping in the illustrated datasets Therefore the K-means algorithm cannot pro-duce the expected clustering results This suggeststhat the K-means approach is unsuitable for suchdata sets
The characteristics of the data sets in Fig 2(position shape size and object distribution) areimplicitly defined This makes the validation of theclustering results difficult Any slight changes inthe data characteristics may lead to different out-comes The data set in Fig 2b is an illustration ofsuch a case Another example is the series of datasets in Fig 3 Although two clusters are easily identi-fiable in the data set in Fig 3a the numbers ofclusters in the data sets in Figs 3b and c depend onthe distance between the rings and the objectdensity of each ring Usually such parametersare not explicitly defined when a visual check iscarried out
In spite of the above-mentioned deficiencies visu-alization of the results is still a useful method ofselecting K and validating the clustering resultswhen the data sets do not violate the assumptionsof the clustering model In addition this methodis recommended in cases where the expected resultscould be identified explicitly
28 Values of K determined usinga neighbourhood measure
A neighbourhood measure could be added to thecost function of the K-means algorithm to determineK [26] Although this technique has showed promis-ing results for a few data sets it needs to prove itspotential in practical applications Because the costfunction has to be modified this technique cannotbe applied to the original K-means algorithm
3 FACTORS AFFECTING THE SELECTION OF K
A function f(K ) for evaluating the clustering resultcould be used to select the number of clusters Fac-tors that such a function should take into accountare discussed in this section
31 Approach bias
The evaluation function should be related closely tothe clustering criteria As mentioned previouslysuch a relation could prevent adverse effects on thevalidation process In particular in the K-meansalgorithm the criterion is the minimization of thedistortion of clusters so that the evaluation functionshould take this parameter into account
32 Level of detail
In general observers that could see relatively lowlevels of detail would obtain only an overview of anobject By increasing the level of detail they couldgain more information about the observed objectbut at the same time the amount of data that theyhave to process increases Because of resource limit-ations a high level of detail is normally used only toexamine parts of the object [28]
Such an approach could be applied in clustering Adata set with n objects could be grouped into anynumber of clusters between 1 and n which wouldcorrespond to the lowest and the highest levels ofdetail respectively By specifying different K valuesit is possible to assess the results of groupingobjects into various numbers of clusters Fromthis evaluation more than one K value could berecommended to users but the final selection ismade by them
33 Internal distribution versus global impact
Clustering is used to find irregularities in the datadistribution and to identify regions in which objectsare concentrated However not every region with ahigh concentration of objects is considered a clusterFor a region to be identified as a cluster it is import-ant to analyse not only its internal distribution butalso its interdependence with other object groupingsin the data set
In K-means clustering the distortion of a cluster isa function of the data population and the distancebetween objects and the cluster centre according to
Ij frac14XN j
tfrac141
frac12d(x jt wj)2 (1a)
where Ij is the distortion of cluster j wj is the centreof cluster j Nj is the number of objects belonging tocluster j xjt is the tth object belonging to cluster jand d(xjt wj) is the distance between object xjt andthe centre wj of cluster j
Each cluster is represented by its distortionand its impact on the entire data set is assessed by
Fig 3 Variations in the two-ring data set
Selection of K in K-means clustering 107
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
its contribution to the sum of all distortions SKgiven by
SK frac14XK
jfrac141
Ij (1b)
where K is the specified number of clustersThus such information is important in assessing
whether a particular region in the object spacecould be considered a cluster
34 Constraints on f (K )
The robustness of f(K ) is very important Becausethis function is based on the result of the clusteringalgorithm it is important for this result to vary aslittle as possible when K remains unchanged How-ever one of the main deficiencies of the K-meansapproach is its dependence on randomness Thusthe algorithm should yield consistent results sothat its performance can be used as a variablein the evaluation function A new version of the K-means algorithm namely the incremental K-meansalgorithm [29] satisfies this requirement and canbe adopted for this purpose
The role of f(K ) is to reveal trends in the datadistribution and therefore it is important to keep itindependent of the number of objects The numberof clusters K is assumed to be much smaller thanthe number of objects N When K increases f(K )should converge to some constant value Then iffor any intermediate K f(K ) exhibits a special beha-viour such as a minimum or maximum point thatvalue of K could be taken as the desired number ofclusters
4 NUMBER OF CLUSTERS FORK-MEANS CLUSTERING
As mentioned in section 33 cluster analysis isused to find irregularities in the data distributionWhen the data distribution is uniform there is notany irregularity Therefore data sets with uniformdistribution could be used to calibrate and verifythe clustering result This approach was applied byTibshirani et al [30] A data set of the same dimen-sion as the actual data set and with a uniform distri-bution was generated The clustering performanceon this artificial data set was then compared withthe result obtained for the actual data set A measureknown as the lsquogaprsquo statistic [30] was employed toassess performance In this work instead of generat-ing an artificial data set the clustering performancefor the artificial data set was estimated Also insteadof the gap statistic a new and more discriminatory
measure was employed for evaluating the clusteringresult
When the K-means algorithm is applied to datawith a uniform distribution and K is increased by 1the clusters are likely to change and in the new pos-itions the partitions will again be approximatelyequal in size and their distortions similar to oneanother The evaluations carried out in reference[29] showed that when a new cluster is insertedinto a cluster (K frac14 1) with a hypercuboid shape anda uniform distribution the decrease in the sum ofdistortions is proportional to the original sum of dis-tortions This conclusion was found to be correct forclustering results obtained with relatively smallvalues of K In such cases the sum of distortionsafter the increase in the number of clusters couldbe estimated from the current value
The evaluation function f(K ) is defined using theequations
f (K ) frac14
1 if K frac14 1SK
aK SK1if SK1 = 0 8K 1
1 if SK1 frac14 0 8K 1
8gtlt
gt(2)
aK frac14
13
4Nd
if K frac14 2 and Nd 1
(3a)
aK1 thorn1 aK1
6if K 2 and Nd 1
(3b)
8gtgtgtgtgtlt
gtgtgtgtgt
where SK is the sum of the cluster distortions whenthe number of clusters is K Nd is the number ofdata set attributes (ie the number of dimensions)and aK is a weight factor The term aKSK21 inequation (2) is an estimate of SK based on SK21
made with the assumption that the data have a uni-form distribution The value of f(K ) is the ratio of thereal distortion to the estimated distortion and isclose to 1 when the data distribution is uniformWhen there are areas of concentration in the datadistribution SK will be less than the estimatedvalue so that f(K ) decreases The smaller that f(K )is the more concentrated is the data distributionThus values of K that yield small f(K ) can beregarded as giving well-defined clusters
The weight factor aK defined in equation (3) isa positive number less than or equal to 1 and isapplied to reduce the effect of dimensions WithK frac14 2 aK is computed using equation (3a) This equa-tion is derived from equation (7) in reference [29]which shows that the decrease in distortion is inver-sely proportional to the number of dimensions Nd
As K increases above 2 the decrease in the sum ofdistortions reduces (the ratio SKSK21 approaches 1)as can be seen in Fig 4 This figure shows the values
108 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
of SKSK21 computed for different K when the clus-tering algorithm is applied to data sets of differentdimensions and with uniform distributions Withsuch data sets f(K ) is expected to be equal to 1 andaK should be chosen to equate f(K ) to 1 Fromequation (2) aK should therefore be SKSK21 andthus obtainable from Fig 4 However for compu-tational simplicity the recursion equation (3b) hasbeen derived from the data represented in Fig 4 tocalculate aK Figure 5 shows that the values of aK
obtained from equation (3b) fit the plots in Fig 4closely
The proposed function f(K ) satisfies the con-straints mentioned in the previous section Therobustness of f(K ) will be verified experimentally inthe next section When the number of objects isdoubled or tripled but their distributions areunchanged the resultant clusters remain in thesame position SK and SK21 are doubled or tripledcorrespondingly so that f(K ) stays constant There-fore generally f(K ) is independent of the numberof objects in the data set
To reduce the effect of differences in the rangesof the attributes data are normalized before theclustering starts However it should be noted that
when the data have well-separated groups of objectsthe shape of such regions in the problem space hasan effect on the evaluation function In these casesthe normalization does not influence the localobject distribution because it is a scaling techniquethat applies to the whole data set
5 PERFORMANCE
The evaluation function f(K ) is tested in a series ofexperiments on the artificially generated data setsshown in Fig 6 All data are normalized before theincremental K-means algorithm is applied with Kranging from 1 to 19 f(K ) is calculated on the basisof the total distortion of the clusters
In Figs 6andashc all objects belong to a single regionwith a uniform distribution The graph in Fig 6ashows that f(K ) reflects well the clustering result onthis data set with a uniform distribution becausef(K ) is approximately constant and equal to 1 forall K When K frac14 4 and K frac14 3 in Figs 6a and b respect-ively f(K ) reaches minimum values This could beattributed to the shape of the areas defined by theobjects belonging to these data sets However theminimum values of f(K ) do not differ significantly
Fig 4 The ratio SKSK21 for data sets having uniform
distributions (a) two-dimensional lsquosquarersquo and
lsquocirclersquo (b) four-dimensional lsquocubersquo and lsquospherersquo
Fig 5 Comparison of the values of aK calculated using
equation (3b) and the ratio SKSK21
Selection of K in K-means clustering 109
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Data sets and their corresponding f(K )
110 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
Fig 6 Continued
Selection of K in K-means clustering 111
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
112 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
Fig 6 Continued
Selection of K in K-means clustering 113
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
114 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
from the average value for any strong recommen-dations to be made to the user By comparing thevalues of f(K ) in Figs 6a and c it can be seen thataK reduces the effect of the data set dimensions onthe evaluation function
For the data set in Fig 6d again all objects areconcentrated in a single region with a normal
distribution The f(K ) plot for this data set suggestscorrectly that when K frac14 1 the clustering result isthe most suitable for this data set
The data sets in Figs 6e and f are created bytwo generators that have normal distributions InFig 6e the two generators have an overlappingregion but in Fig 6f they are well separated Note
Fig 7 f(K) for the 12 benchmark data sets
Selection of K in K-means clustering 115
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
that the value for f(2) in the latter figure is muchsmaller than in the former
The data sets in Figs 6g and h have three recog-nizable regions From the corresponding graphsf(K ) suggests correct values of K for clusteringthese data sets
Three different generators that create objectgroupings with a normal distribution are used toform the data set in Fig 6i In this case f(K ) suggeststhe value 2 or 3 for K Because two of these threegenerators create object groupings that overlapf(2) is smaller than f(3) This means that the data
Fig 7 Continued
116 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
have only two clearly defined regions but K frac14 3could also be used to cluster the objects
Figures 6j and k illustrate how the level of detailcould affect the selection of K f(K ) reaches mini-mum values at K frac14 2 and 4 respectively In suchcases users could select the most appropriate valueof K based on their specific requirements A morecomplex case is shown in Fig 6l where there is apossible K value of 4 or 8 The selection of a parti-cular K will depend on the requirements of thespecific application for which the clustering iscarried out
The data sets in Figs 6mndasho have well-definedregions in the object space each of which has adifferent distribution location and number ofobjects If the minimum value of f(K ) is used tocluster the objects K will be different from thenumber of generators utilized to create them (as inthe case of the clusters in Fig 6o or the number ofobject groupings that could be identified visually(as in the case of the clusters in Figs 6m and n)The reason for the difference varies with differentcases For example it could be considered thatthere are five clusters in Fig 6m because the clusterdistances are smaller for the two leftmost pairs ofclusters than for others and the clusters in thosepairs could be merged together However nosimple explanation could be given for the casesshown in Figs 6n and o This highlights the factthat f(K ) should only be used to suggest a guidevalue for the number of clusters and the finaldecision as to which value to adopt has to be left atthe discretion of the user
From the graphs in Fig 6 a conclusion could bemade that any K with corresponding f(K ) 085could be recommended for clustering If there isnot a value with corresponding f(K ) 085 K frac14 1 isselected
The proposed function f(K ) is also applied to 12benchmarking data sets from the UCI RepositoryMachine Learning Databases [31] Figure 7 showshow the value of f(K ) varies with K If a threshold of085 is selected for f(K ) (from the study on theartificial data sets) the numbers of clusters recom-mended for each of these data sets are given as inTable 2 Kfrac14 1 means that the data distribution isvery close to the standard uniform distribution Thevalues recommended using f(K ) are very smallbecause of the high correlation between the attributesof these data sets very similar to that shown in Fig 6eThis can be verified by examining two attributes ata time and plotting the data sets in two dimensions
The above experimental study on 15 artificialand 12 benchmark data sets has demonstratedthe robustness of f(K ) The evaluation functionconverges in most cases to 1 when K increasesabove 9
6 CONCLUSION
Existing methods of selecting the number of clustersfor K-means clustering have a number of drawbacksAlso current methods for assessing the clusteringresults do not provide much information on theperformance of the clustering algorithm
A new method to select the number of clustersfor the K-means algorithm has been proposed inthe paper The new method is closely related to theapproach of K-means clustering because it takesinto account information reflecting the performanceof the algorithm The proposed method can suggestmultiple values of K to users for cases when differentclustering results could be obtained with variousrequired levels of detail The method could be com-putationally expensive if used with large data setsbecause it requires several applications of theK-means algorithm before it can suggest a guidevalue for K The method has been validated on15 artificial and 12 benchmark data sets Furtherresearch is required to verify the capability of thismethod when applied to data sets with morecomplex object distributions
ACKNOWLEDGEMENTS
This work was carried out as part of the CardiffInnovative Manufacturing Research Centre Projectsupported by the Engineering and Physical SciencesResearch Council and the SUPERMAN Project sup-ported by the European Commission and the WelshAssembly Government under the European RegionalDevelopment Fund programme The authors aremembers of the IPROMS Network of Excellencefunded by the European Commission
Table 2 The recommended number of
clusters based on f(K)
Data setsProposed numberof clusters
Australian 1Balance-scale 1Car evaluation 2 3 4Cmc 1Ionosphere 2Iris 2 3Page blocks 2Pima 1Wdbc 2Wine 3Yeast 1Zoo 2
Selection of K in K-means clustering 117
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
REFERENCES
1 Han J and Kamber M Data Mining Concepts andTechniques 2000 (Morgan Kaufmann San FranciscoCalifornia)
2 Al-Daoud M B Venkateswarlu N B andRoberts S A Fast K-means clustering algorithmsReport 9518 School of Computer Studies Universityof Leeds June 1995
3 Al-Daoud M B Venkateswarlu N B andRoberts S A New methods for the initialisation ofclusters Pattern Recognition Lett 1996 17 451ndash455
4 Alsabti K Ranka S and Singh V An efficientK-means clustering algorithm In Proceedings of theFirst Workshop on High-Performance Data MiningOrlando Florida 1998 ftpftpciseufledupubfacultyrankaProceedings
5 Bilmes J Vahdat A Hsu W and Im E J Empiricalobservations of probabilistic heuristics for theclustering problem Technical Report TR-97-018International Computer Science Institute BerkeleyCalifornia
6 Bottou L and Bengio Y Convergence properties of theK-means algorithm Adv Neural Infn Processing Systems1995 7 585ndash592
7 Bradley S and Fayyad U M Refining initialpoints for K-means clustering In Proceedings ofthe Fifteenth International Conference on MachineLearning (ICML lsquo98) (Ed J Shavlik) MadisonWisconsin 1998 pp 91ndash99 (Morgan Kaufmann SanFrancisco California)
8 Du Q and Wong T-W Numerical studies ofMacQueenrsquos K-means algorithm for computing the cen-troidal Voronoi tessellations Int J Computers MathApplics 2002 44 511ndash523
9 Castro V E and Yang J A fast and robust generalpurpose clustering algorithm In Proceedings of theFourth European Workshop on Principles of KnowledgeDiscovery in Databases and Data Mining (PKDD 00)Lyon France 2000 pp 208ndash218
10 Castro V E Why so many clustering algorithmsSIGKDD Explorations Newsletter of the ACM SpecialInterest Group on Knowledge Discovery and DataMining 2002 4(1) 65ndash75
11 Fritzke B The LBG-U method for vector quantiza-tion ndash an improvement over LBG inspired fromneural networks Neural Processing Lett 1997 5(1)35ndash45
12 Hamerly G and Elkan C Alternatives to the K-meansalgorithm that find better clusterings In Proceedings ofthe 11th International Conference on Information andKnowledge Management (CIKM 02) McLean Virginia2002 pp 600ndash607
13 Hansen L K and Larsen J Unsupervised learningand generalisation In Proceedings of the IEEEInternational Conference on Neural NetworksWashington DC June 1996 pp 25ndash30 (IEEENew York)
14 Ishioka T Extended K-means with an efficientestimation of the number of clusters In Proceedingsof the Second International Conference on Intelligent
Data Engineering and Automated Learning (IDEAL2000) Hong Kong PR China December 2000pp 17ndash22
15 Kanungo T Mount D M Netanyahu N Piatko CSilverman R and Wu A The efficient K-means clus-tering algorithm analysis and implementation IEEETrans Pattern Analysis Mach Intell 2002 24(7)881ndash892
16 Pelleg D and Moore A Accelerating exact K-meansalgorithms with geometric reasoning In Proceedingsof the Conference on Knowledge Discovery inDatabases (KDD 99) San Diego California 1999pp 277ndash281
17 Pelleg D and Moore A X-means extending K-meanswith efficient estimation of the number of clusters InProceedings of the 17th International Conference onMachine Learning (ICML 2000) Stanford California2000 727ndash734
18 Pena J M Lazano J A and Larranaga P An empiri-cal comparison of four initialisation methods for theK-means algorithm Pattern Recognition Lett 199920 1027ndash1040
19 SPSS Clementine Data Mining System User Guide Ver-sion 5 1998 (Integral Solutions Limited BasingstokeHampshire)
20 DataEngine 30 ndash Intelligent Data Analysis ndash an EasyJob Management Intelligenter Technologien GmbHGermany 1998 httpwwwmitgmbhde
21 Kerr A Hall H K and Kozub S Doing Statistics withSPSS 2002 (Sage London)
22 S-PLUS 6 for Windows Guide to Statistics Vol 2Insightful Corporation Seattle Washington 2001httpwwwinsightfulcomDocumentsLive2344statman2pdf
23 Hardy A On the number of clusters Comput StatistData Analysis 1996 23 83ndash96
24 Theodoridis S and Koutroubas K Pattern Recog-nition 1998 (Academic Press London)
25 Halkidi M Batistakis Y and Vazirgiannis MCluster validity methods Part I SIGMOD Record2002 31(2) available online httpwwwacmorgsigmodrecord
26 Kothari R and Pitts D On finding the number ofclusters Pattern Recognition Lett 1999 20 405ndash416
27 Cai Z Technical aspects of data mining PhD thesisCardiff University Cardiff 2001
28 Lindeberg T Scale-space Theory in Computer Vision1994 (Kluwer Academic Boston Massachusetts)
29 Pham D T Dimov S S and Nguyen C DIncremental K-means algorithm Proc Instn MechEngrs Part C J Mechanical Engineering Science 2003218 783ndash795
30 Tibshirani R Walther G and Hastie T Estimatingthe number of clusters in a dataset via the gap statisticTechnical Report 208 Department of StatisticsStanford University California 2000
31 Blake C Keogh E and Merz C J UCI Re-pository of Machine Learning Databases IrvineCalifornia Department of Information and Com-puter Science University of California IrvineCalifornia 1998
118 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
APPENDIX
Notation
A B clustersd(xjt wj) distance between object xjt and the
centre wj of cluster jf(K ) evaluation functionGA GB generatorsIj distortion of cluster jK number of clustersN number of objects in the data setNd number of data set attributes
(the dimension of the data set)
Nj number of objects belonging tocluster j
PGA PGB
probabilities that X is created by GA orGB respectively
PCA PCB
probabilities that X is clustered into A orB respectively
SK sum of all distortions with K being thespecified number of clusters
X objectxjt object belonging to cluster jwj centre of cluster j
aK weight factor
Selection of K in K-means clustering 119
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
In general the performance of any new versionof the K-means algorithm could be verified by com-paring it with its predecessors on the same criteriaIn particular the sum of cluster distortions isusually employed as such a performance indicator[3 6 13 16 18] Thus the comparison is consideredfair because the same model and criterion are usedfor the performance analysis
22 Values of K specified by the user
The K-means algorithm implementation in manydata-mining or data analysis software packages[19ndash22] requires the number of clusters to be speci-fied by the user To find a satisfactory clusteringresult usually a number of iterations are neededwhere the user executes the algorithm with differentvalues of K The validity of the clustering result isassessed only visually without applying any formalperformance measures With this approach it isdifficult for users to evaluate the clustering resultfor multi-dimensional data sets
23 Values of K determined in a laterprocessing step
When K-means clustering is used as a pre-processingtool the number of clusters is determined by thespecific requirements of the main processingalgorithm [13] No attention is paid to the effect ofthe clustering results on the performance of thisalgorithm In such applications the K-meansalgorithm is employed just as a lsquoblack boxrsquo withoutvalidation of the clustering result
Table 1 The number of clusters used in different studies
of the K-means algorithm
ReferenceNumbers ofclusters K
Number ofobjects N
MaximumKNratio ()
[2] 32 64 128 256512 1024
8 192 1250
32 64 128 256512 1024
29 000
256 2 048[3] 600 700 800
900 100010 000 1000
600 700 800900 1000
50 000
[4] 4 16 64100 128
100 000 013
4 16 64100 128
120 000
4 16 64100 128
256 000
[5] 4 564 0704 7204 1 0004 1 0084 1 0104 1 2024 2 0004 2 3244 3 0054 4 0004 6 2724 7 561
[6] 6 150 400[7] 10 2 310 043
25 12 902[8] 2 4 8 Not reported Not reported[9] 2 4 500 333
2 4 50 0002 4 100 00010 300
[10] 1 2 3 4 10 000 004[11] 10 20 30 40 50
60 70 8090 100
500 2000
[12] 100 10 000 20050 2 500
[13] 7 42 16661 2 3 4 5 6 7 120
[14] 2 3 4 5 6 7 8 910 11 12 13 14
250 560
[15] 8 20 50 64 256 10 000 256[16] 5000 50 000 5000
5000 100 0005000 200 0005000 300 0005000 433 208100 100 000250 200 0001000 100 0001000 200 0001000 300 0001000 433 20840 20 00010 20 30 40 50
60 70 8030 000
50 500 5000 10 00050 500 5000 50 00050 500 5000 100 00050 500 5000 200 00050 500 5000 300 00050 500 5000 433 208
(continued )
Table 1 Continued
ReferenceNumbers ofclusters K
Number ofobjects N
MaximumKNratio ()
[17] 250 80 000 1000250 90 000250 100 000250 110 000250 120 00050 100 400 4 00050 100 400 36 000250 80 000250 90 000250 100 000250 110 000250 120 00050 100 150 4 00050 100 150 36 00050 800 000500 800 000
[18] 3 4 150 6674 5 752 7 10 214
104 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
24 Values of K equated to thenumber of generators
Synthetic data sets which are used for testingalgorithms are often created by a set of normal oruniform distribution generators Then clusteringalgorithms are applied to those data sets with thenumber of clusters equated to the number of genera-tors It is assumed that any resultant cluster willcover all objects created by a particular generatorThus the clustering performance is judged on thebasis of the difference between objects covered bya cluster and those created by the correspondinggenerator Such a difference can be measured bysimply counting objects or calculating the infor-mation gain [7]
There are drawbacks with this method The firstdrawback concerns the stability of the clusteringresults when there are areas in the object spacethat contain objects created by different generatorsFigure 1a illustrates such a case The data setshown in this figure has two clusters A and Bwhich cover objects generated by generators GA
and G B respectively Object X is in an overlapping
area between clusters A and B X has probabilitiesPGA
and PGBof being created by GA and GB respect-
ively and probabilities PCAand PCB
of being includedinto clusters A and B respectively All four pro-babilities are larger than 0 Thus there is a chancefor X to be created by generator GA but coveredby cluster B and vice versa In such cases theclustering results will not be perfect The stability ofthe clustering results depends on these four proba-bilities With an increase in the overlapping areas inthe object space the stability of the clustering resultsdecreases
The difference between the characteristics of thegenerators also has an effect on the clustering resultsIn Fig 1b where the number of objects of cluster A isfive times larger than that of cluster B the smallercluster B might be regarded as noise and all objectsmight be grouped into one cluster Such a clusteringoutcome would differ from that obtained by visualinspection
Unfortunately this method of selecting K cannotbe applied to practical problems The data distri-bution in practical problems is unknown and alsothe number of generators cannot be specified
25 Values of K determined bystatistical measures
There are several statistical measures available forselecting K These measures are often applied in com-bination with probabilistic clustering approachesThey are calculated with certain assumptionsabout the underlying distribution of the data TheBayesian information criterion or Akeikersquos infor-mation criterion [14 17] is calculated on data setswhich are constructed by a set of Gaussian distri-butions The measures applied by Hardy [23] arebased on the assumption that the data set fits thePoisson distribution Monte Carlo techniqueswhich are associated with the null hypothesis areused for assessing the clustering results and also fordetermining the number of clusters [24 25]
There have been comparisons between probabilis-tic and partitioning clustering [7] Expectationndashmaximization (EM) is often recognized as a typicalmethod for probabilistic clustering SimilarlyK-means clustering is considered a typical methodfor partitioning clustering Although EM andK-means clustering share some common ideasthey are based on different hypotheses models andcriteria Probabilistic clustering methods do nottake into account the distortion inside a cluster sothat a cluster created by applying such methodsmay not correspond to a cluster in partitioning clus-tering and vice versa Therefore statistical measuresused in probabilistic methods are not applicable in
Fig 1 Effect of the relationship between clusters on
the clustering for two object spaces in which
(a) an area exists that contains objects created
by two different generators and (b) there are
no overlapping areas A objects generated by
GA D objects generated by GB
Selection of K in K-means clustering 105
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
the K-means algorithm In addition the assumptionsabout the underlying distribution cannot be verifiedon real data sets and therefore cannot be used toobtain statistical measures
26 Values of K equated to thenumber of classes
With this method the number of clusters is equatedto the number of classes in the data sets A data-clustering algorithm can be used as a classifier byapplying it to data sets from which the class attributeis omitted and then assessing the clustering resultsusing the omitted class information [26 27] The out-come of the assessment is fed back to the clusteringalgorithm to improve its performance In this waythe clustering can be considered to be supervised
With this method of determining the numberof clusters the assumption is made that the data-clustering method could form clusters each ofwhich would consist of only objects belonging to
one class Unfortunately most real problems donot satisfy this assumption
27 Values of K determined throughvisualization
Visual verification is applied widely because of itssimplicity and explanation possibilities Visualexamples are often used to illustrate the drawbacksof an algorithm or to present the expected clusteringresults [5 27]
The assessment of a clustering result usingvisualization techniques depends heavily on theirimplicit nature The clustering models utilized bysome clustering methods may not be appropriatefor particular data sets The data sets in Fig 2are illustrations of such cases The application ofvisualization techniques implies a data distributioncontinuity in the expected clusters If the K-meansapproach is applied to such data sets there is not
Fig 2 Data sets inappropriate for the K-means approach (a) data sets with four clusters [5]
(b) data sets with three clusters [23] (c) data sets with eight clusters [27] Note that
the number of clusters in each data set was specified by the respective researchers
106 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
any cluster that satisfies the K-means cluster-ing model and at the same time corresponds to aparticular object grouping in the illustrated datasets Therefore the K-means algorithm cannot pro-duce the expected clustering results This suggeststhat the K-means approach is unsuitable for suchdata sets
The characteristics of the data sets in Fig 2(position shape size and object distribution) areimplicitly defined This makes the validation of theclustering results difficult Any slight changes inthe data characteristics may lead to different out-comes The data set in Fig 2b is an illustration ofsuch a case Another example is the series of datasets in Fig 3 Although two clusters are easily identi-fiable in the data set in Fig 3a the numbers ofclusters in the data sets in Figs 3b and c depend onthe distance between the rings and the objectdensity of each ring Usually such parametersare not explicitly defined when a visual check iscarried out
In spite of the above-mentioned deficiencies visu-alization of the results is still a useful method ofselecting K and validating the clustering resultswhen the data sets do not violate the assumptionsof the clustering model In addition this methodis recommended in cases where the expected resultscould be identified explicitly
28 Values of K determined usinga neighbourhood measure
A neighbourhood measure could be added to thecost function of the K-means algorithm to determineK [26] Although this technique has showed promis-ing results for a few data sets it needs to prove itspotential in practical applications Because the costfunction has to be modified this technique cannotbe applied to the original K-means algorithm
3 FACTORS AFFECTING THE SELECTION OF K
A function f(K ) for evaluating the clustering resultcould be used to select the number of clusters Fac-tors that such a function should take into accountare discussed in this section
31 Approach bias
The evaluation function should be related closely tothe clustering criteria As mentioned previouslysuch a relation could prevent adverse effects on thevalidation process In particular in the K-meansalgorithm the criterion is the minimization of thedistortion of clusters so that the evaluation functionshould take this parameter into account
32 Level of detail
In general observers that could see relatively lowlevels of detail would obtain only an overview of anobject By increasing the level of detail they couldgain more information about the observed objectbut at the same time the amount of data that theyhave to process increases Because of resource limit-ations a high level of detail is normally used only toexamine parts of the object [28]
Such an approach could be applied in clustering Adata set with n objects could be grouped into anynumber of clusters between 1 and n which wouldcorrespond to the lowest and the highest levels ofdetail respectively By specifying different K valuesit is possible to assess the results of groupingobjects into various numbers of clusters Fromthis evaluation more than one K value could berecommended to users but the final selection ismade by them
33 Internal distribution versus global impact
Clustering is used to find irregularities in the datadistribution and to identify regions in which objectsare concentrated However not every region with ahigh concentration of objects is considered a clusterFor a region to be identified as a cluster it is import-ant to analyse not only its internal distribution butalso its interdependence with other object groupingsin the data set
In K-means clustering the distortion of a cluster isa function of the data population and the distancebetween objects and the cluster centre according to
Ij frac14XN j
tfrac141
frac12d(x jt wj)2 (1a)
where Ij is the distortion of cluster j wj is the centreof cluster j Nj is the number of objects belonging tocluster j xjt is the tth object belonging to cluster jand d(xjt wj) is the distance between object xjt andthe centre wj of cluster j
Each cluster is represented by its distortionand its impact on the entire data set is assessed by
Fig 3 Variations in the two-ring data set
Selection of K in K-means clustering 107
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
its contribution to the sum of all distortions SKgiven by
SK frac14XK
jfrac141
Ij (1b)
where K is the specified number of clustersThus such information is important in assessing
whether a particular region in the object spacecould be considered a cluster
34 Constraints on f (K )
The robustness of f(K ) is very important Becausethis function is based on the result of the clusteringalgorithm it is important for this result to vary aslittle as possible when K remains unchanged How-ever one of the main deficiencies of the K-meansapproach is its dependence on randomness Thusthe algorithm should yield consistent results sothat its performance can be used as a variablein the evaluation function A new version of the K-means algorithm namely the incremental K-meansalgorithm [29] satisfies this requirement and canbe adopted for this purpose
The role of f(K ) is to reveal trends in the datadistribution and therefore it is important to keep itindependent of the number of objects The numberof clusters K is assumed to be much smaller thanthe number of objects N When K increases f(K )should converge to some constant value Then iffor any intermediate K f(K ) exhibits a special beha-viour such as a minimum or maximum point thatvalue of K could be taken as the desired number ofclusters
4 NUMBER OF CLUSTERS FORK-MEANS CLUSTERING
As mentioned in section 33 cluster analysis isused to find irregularities in the data distributionWhen the data distribution is uniform there is notany irregularity Therefore data sets with uniformdistribution could be used to calibrate and verifythe clustering result This approach was applied byTibshirani et al [30] A data set of the same dimen-sion as the actual data set and with a uniform distri-bution was generated The clustering performanceon this artificial data set was then compared withthe result obtained for the actual data set A measureknown as the lsquogaprsquo statistic [30] was employed toassess performance In this work instead of generat-ing an artificial data set the clustering performancefor the artificial data set was estimated Also insteadof the gap statistic a new and more discriminatory
measure was employed for evaluating the clusteringresult
When the K-means algorithm is applied to datawith a uniform distribution and K is increased by 1the clusters are likely to change and in the new pos-itions the partitions will again be approximatelyequal in size and their distortions similar to oneanother The evaluations carried out in reference[29] showed that when a new cluster is insertedinto a cluster (K frac14 1) with a hypercuboid shape anda uniform distribution the decrease in the sum ofdistortions is proportional to the original sum of dis-tortions This conclusion was found to be correct forclustering results obtained with relatively smallvalues of K In such cases the sum of distortionsafter the increase in the number of clusters couldbe estimated from the current value
The evaluation function f(K ) is defined using theequations
f (K ) frac14
1 if K frac14 1SK
aK SK1if SK1 = 0 8K 1
1 if SK1 frac14 0 8K 1
8gtlt
gt(2)
aK frac14
13
4Nd
if K frac14 2 and Nd 1
(3a)
aK1 thorn1 aK1
6if K 2 and Nd 1
(3b)
8gtgtgtgtgtlt
gtgtgtgtgt
where SK is the sum of the cluster distortions whenthe number of clusters is K Nd is the number ofdata set attributes (ie the number of dimensions)and aK is a weight factor The term aKSK21 inequation (2) is an estimate of SK based on SK21
made with the assumption that the data have a uni-form distribution The value of f(K ) is the ratio of thereal distortion to the estimated distortion and isclose to 1 when the data distribution is uniformWhen there are areas of concentration in the datadistribution SK will be less than the estimatedvalue so that f(K ) decreases The smaller that f(K )is the more concentrated is the data distributionThus values of K that yield small f(K ) can beregarded as giving well-defined clusters
The weight factor aK defined in equation (3) isa positive number less than or equal to 1 and isapplied to reduce the effect of dimensions WithK frac14 2 aK is computed using equation (3a) This equa-tion is derived from equation (7) in reference [29]which shows that the decrease in distortion is inver-sely proportional to the number of dimensions Nd
As K increases above 2 the decrease in the sum ofdistortions reduces (the ratio SKSK21 approaches 1)as can be seen in Fig 4 This figure shows the values
108 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
of SKSK21 computed for different K when the clus-tering algorithm is applied to data sets of differentdimensions and with uniform distributions Withsuch data sets f(K ) is expected to be equal to 1 andaK should be chosen to equate f(K ) to 1 Fromequation (2) aK should therefore be SKSK21 andthus obtainable from Fig 4 However for compu-tational simplicity the recursion equation (3b) hasbeen derived from the data represented in Fig 4 tocalculate aK Figure 5 shows that the values of aK
obtained from equation (3b) fit the plots in Fig 4closely
The proposed function f(K ) satisfies the con-straints mentioned in the previous section Therobustness of f(K ) will be verified experimentally inthe next section When the number of objects isdoubled or tripled but their distributions areunchanged the resultant clusters remain in thesame position SK and SK21 are doubled or tripledcorrespondingly so that f(K ) stays constant There-fore generally f(K ) is independent of the numberof objects in the data set
To reduce the effect of differences in the rangesof the attributes data are normalized before theclustering starts However it should be noted that
when the data have well-separated groups of objectsthe shape of such regions in the problem space hasan effect on the evaluation function In these casesthe normalization does not influence the localobject distribution because it is a scaling techniquethat applies to the whole data set
5 PERFORMANCE
The evaluation function f(K ) is tested in a series ofexperiments on the artificially generated data setsshown in Fig 6 All data are normalized before theincremental K-means algorithm is applied with Kranging from 1 to 19 f(K ) is calculated on the basisof the total distortion of the clusters
In Figs 6andashc all objects belong to a single regionwith a uniform distribution The graph in Fig 6ashows that f(K ) reflects well the clustering result onthis data set with a uniform distribution becausef(K ) is approximately constant and equal to 1 forall K When K frac14 4 and K frac14 3 in Figs 6a and b respect-ively f(K ) reaches minimum values This could beattributed to the shape of the areas defined by theobjects belonging to these data sets However theminimum values of f(K ) do not differ significantly
Fig 4 The ratio SKSK21 for data sets having uniform
distributions (a) two-dimensional lsquosquarersquo and
lsquocirclersquo (b) four-dimensional lsquocubersquo and lsquospherersquo
Fig 5 Comparison of the values of aK calculated using
equation (3b) and the ratio SKSK21
Selection of K in K-means clustering 109
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Data sets and their corresponding f(K )
110 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
Fig 6 Continued
Selection of K in K-means clustering 111
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
112 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
Fig 6 Continued
Selection of K in K-means clustering 113
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
114 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
from the average value for any strong recommen-dations to be made to the user By comparing thevalues of f(K ) in Figs 6a and c it can be seen thataK reduces the effect of the data set dimensions onthe evaluation function
For the data set in Fig 6d again all objects areconcentrated in a single region with a normal
distribution The f(K ) plot for this data set suggestscorrectly that when K frac14 1 the clustering result isthe most suitable for this data set
The data sets in Figs 6e and f are created bytwo generators that have normal distributions InFig 6e the two generators have an overlappingregion but in Fig 6f they are well separated Note
Fig 7 f(K) for the 12 benchmark data sets
Selection of K in K-means clustering 115
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
that the value for f(2) in the latter figure is muchsmaller than in the former
The data sets in Figs 6g and h have three recog-nizable regions From the corresponding graphsf(K ) suggests correct values of K for clusteringthese data sets
Three different generators that create objectgroupings with a normal distribution are used toform the data set in Fig 6i In this case f(K ) suggeststhe value 2 or 3 for K Because two of these threegenerators create object groupings that overlapf(2) is smaller than f(3) This means that the data
Fig 7 Continued
116 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
have only two clearly defined regions but K frac14 3could also be used to cluster the objects
Figures 6j and k illustrate how the level of detailcould affect the selection of K f(K ) reaches mini-mum values at K frac14 2 and 4 respectively In suchcases users could select the most appropriate valueof K based on their specific requirements A morecomplex case is shown in Fig 6l where there is apossible K value of 4 or 8 The selection of a parti-cular K will depend on the requirements of thespecific application for which the clustering iscarried out
The data sets in Figs 6mndasho have well-definedregions in the object space each of which has adifferent distribution location and number ofobjects If the minimum value of f(K ) is used tocluster the objects K will be different from thenumber of generators utilized to create them (as inthe case of the clusters in Fig 6o or the number ofobject groupings that could be identified visually(as in the case of the clusters in Figs 6m and n)The reason for the difference varies with differentcases For example it could be considered thatthere are five clusters in Fig 6m because the clusterdistances are smaller for the two leftmost pairs ofclusters than for others and the clusters in thosepairs could be merged together However nosimple explanation could be given for the casesshown in Figs 6n and o This highlights the factthat f(K ) should only be used to suggest a guidevalue for the number of clusters and the finaldecision as to which value to adopt has to be left atthe discretion of the user
From the graphs in Fig 6 a conclusion could bemade that any K with corresponding f(K ) 085could be recommended for clustering If there isnot a value with corresponding f(K ) 085 K frac14 1 isselected
The proposed function f(K ) is also applied to 12benchmarking data sets from the UCI RepositoryMachine Learning Databases [31] Figure 7 showshow the value of f(K ) varies with K If a threshold of085 is selected for f(K ) (from the study on theartificial data sets) the numbers of clusters recom-mended for each of these data sets are given as inTable 2 Kfrac14 1 means that the data distribution isvery close to the standard uniform distribution Thevalues recommended using f(K ) are very smallbecause of the high correlation between the attributesof these data sets very similar to that shown in Fig 6eThis can be verified by examining two attributes ata time and plotting the data sets in two dimensions
The above experimental study on 15 artificialand 12 benchmark data sets has demonstratedthe robustness of f(K ) The evaluation functionconverges in most cases to 1 when K increasesabove 9
6 CONCLUSION
Existing methods of selecting the number of clustersfor K-means clustering have a number of drawbacksAlso current methods for assessing the clusteringresults do not provide much information on theperformance of the clustering algorithm
A new method to select the number of clustersfor the K-means algorithm has been proposed inthe paper The new method is closely related to theapproach of K-means clustering because it takesinto account information reflecting the performanceof the algorithm The proposed method can suggestmultiple values of K to users for cases when differentclustering results could be obtained with variousrequired levels of detail The method could be com-putationally expensive if used with large data setsbecause it requires several applications of theK-means algorithm before it can suggest a guidevalue for K The method has been validated on15 artificial and 12 benchmark data sets Furtherresearch is required to verify the capability of thismethod when applied to data sets with morecomplex object distributions
ACKNOWLEDGEMENTS
This work was carried out as part of the CardiffInnovative Manufacturing Research Centre Projectsupported by the Engineering and Physical SciencesResearch Council and the SUPERMAN Project sup-ported by the European Commission and the WelshAssembly Government under the European RegionalDevelopment Fund programme The authors aremembers of the IPROMS Network of Excellencefunded by the European Commission
Table 2 The recommended number of
clusters based on f(K)
Data setsProposed numberof clusters
Australian 1Balance-scale 1Car evaluation 2 3 4Cmc 1Ionosphere 2Iris 2 3Page blocks 2Pima 1Wdbc 2Wine 3Yeast 1Zoo 2
Selection of K in K-means clustering 117
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
REFERENCES
1 Han J and Kamber M Data Mining Concepts andTechniques 2000 (Morgan Kaufmann San FranciscoCalifornia)
2 Al-Daoud M B Venkateswarlu N B andRoberts S A Fast K-means clustering algorithmsReport 9518 School of Computer Studies Universityof Leeds June 1995
3 Al-Daoud M B Venkateswarlu N B andRoberts S A New methods for the initialisation ofclusters Pattern Recognition Lett 1996 17 451ndash455
4 Alsabti K Ranka S and Singh V An efficientK-means clustering algorithm In Proceedings of theFirst Workshop on High-Performance Data MiningOrlando Florida 1998 ftpftpciseufledupubfacultyrankaProceedings
5 Bilmes J Vahdat A Hsu W and Im E J Empiricalobservations of probabilistic heuristics for theclustering problem Technical Report TR-97-018International Computer Science Institute BerkeleyCalifornia
6 Bottou L and Bengio Y Convergence properties of theK-means algorithm Adv Neural Infn Processing Systems1995 7 585ndash592
7 Bradley S and Fayyad U M Refining initialpoints for K-means clustering In Proceedings ofthe Fifteenth International Conference on MachineLearning (ICML lsquo98) (Ed J Shavlik) MadisonWisconsin 1998 pp 91ndash99 (Morgan Kaufmann SanFrancisco California)
8 Du Q and Wong T-W Numerical studies ofMacQueenrsquos K-means algorithm for computing the cen-troidal Voronoi tessellations Int J Computers MathApplics 2002 44 511ndash523
9 Castro V E and Yang J A fast and robust generalpurpose clustering algorithm In Proceedings of theFourth European Workshop on Principles of KnowledgeDiscovery in Databases and Data Mining (PKDD 00)Lyon France 2000 pp 208ndash218
10 Castro V E Why so many clustering algorithmsSIGKDD Explorations Newsletter of the ACM SpecialInterest Group on Knowledge Discovery and DataMining 2002 4(1) 65ndash75
11 Fritzke B The LBG-U method for vector quantiza-tion ndash an improvement over LBG inspired fromneural networks Neural Processing Lett 1997 5(1)35ndash45
12 Hamerly G and Elkan C Alternatives to the K-meansalgorithm that find better clusterings In Proceedings ofthe 11th International Conference on Information andKnowledge Management (CIKM 02) McLean Virginia2002 pp 600ndash607
13 Hansen L K and Larsen J Unsupervised learningand generalisation In Proceedings of the IEEEInternational Conference on Neural NetworksWashington DC June 1996 pp 25ndash30 (IEEENew York)
14 Ishioka T Extended K-means with an efficientestimation of the number of clusters In Proceedingsof the Second International Conference on Intelligent
Data Engineering and Automated Learning (IDEAL2000) Hong Kong PR China December 2000pp 17ndash22
15 Kanungo T Mount D M Netanyahu N Piatko CSilverman R and Wu A The efficient K-means clus-tering algorithm analysis and implementation IEEETrans Pattern Analysis Mach Intell 2002 24(7)881ndash892
16 Pelleg D and Moore A Accelerating exact K-meansalgorithms with geometric reasoning In Proceedingsof the Conference on Knowledge Discovery inDatabases (KDD 99) San Diego California 1999pp 277ndash281
17 Pelleg D and Moore A X-means extending K-meanswith efficient estimation of the number of clusters InProceedings of the 17th International Conference onMachine Learning (ICML 2000) Stanford California2000 727ndash734
18 Pena J M Lazano J A and Larranaga P An empiri-cal comparison of four initialisation methods for theK-means algorithm Pattern Recognition Lett 199920 1027ndash1040
19 SPSS Clementine Data Mining System User Guide Ver-sion 5 1998 (Integral Solutions Limited BasingstokeHampshire)
20 DataEngine 30 ndash Intelligent Data Analysis ndash an EasyJob Management Intelligenter Technologien GmbHGermany 1998 httpwwwmitgmbhde
21 Kerr A Hall H K and Kozub S Doing Statistics withSPSS 2002 (Sage London)
22 S-PLUS 6 for Windows Guide to Statistics Vol 2Insightful Corporation Seattle Washington 2001httpwwwinsightfulcomDocumentsLive2344statman2pdf
23 Hardy A On the number of clusters Comput StatistData Analysis 1996 23 83ndash96
24 Theodoridis S and Koutroubas K Pattern Recog-nition 1998 (Academic Press London)
25 Halkidi M Batistakis Y and Vazirgiannis MCluster validity methods Part I SIGMOD Record2002 31(2) available online httpwwwacmorgsigmodrecord
26 Kothari R and Pitts D On finding the number ofclusters Pattern Recognition Lett 1999 20 405ndash416
27 Cai Z Technical aspects of data mining PhD thesisCardiff University Cardiff 2001
28 Lindeberg T Scale-space Theory in Computer Vision1994 (Kluwer Academic Boston Massachusetts)
29 Pham D T Dimov S S and Nguyen C DIncremental K-means algorithm Proc Instn MechEngrs Part C J Mechanical Engineering Science 2003218 783ndash795
30 Tibshirani R Walther G and Hastie T Estimatingthe number of clusters in a dataset via the gap statisticTechnical Report 208 Department of StatisticsStanford University California 2000
31 Blake C Keogh E and Merz C J UCI Re-pository of Machine Learning Databases IrvineCalifornia Department of Information and Com-puter Science University of California IrvineCalifornia 1998
118 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
APPENDIX
Notation
A B clustersd(xjt wj) distance between object xjt and the
centre wj of cluster jf(K ) evaluation functionGA GB generatorsIj distortion of cluster jK number of clustersN number of objects in the data setNd number of data set attributes
(the dimension of the data set)
Nj number of objects belonging tocluster j
PGA PGB
probabilities that X is created by GA orGB respectively
PCA PCB
probabilities that X is clustered into A orB respectively
SK sum of all distortions with K being thespecified number of clusters
X objectxjt object belonging to cluster jwj centre of cluster j
aK weight factor
Selection of K in K-means clustering 119
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
24 Values of K equated to thenumber of generators
Synthetic data sets which are used for testingalgorithms are often created by a set of normal oruniform distribution generators Then clusteringalgorithms are applied to those data sets with thenumber of clusters equated to the number of genera-tors It is assumed that any resultant cluster willcover all objects created by a particular generatorThus the clustering performance is judged on thebasis of the difference between objects covered bya cluster and those created by the correspondinggenerator Such a difference can be measured bysimply counting objects or calculating the infor-mation gain [7]
There are drawbacks with this method The firstdrawback concerns the stability of the clusteringresults when there are areas in the object spacethat contain objects created by different generatorsFigure 1a illustrates such a case The data setshown in this figure has two clusters A and Bwhich cover objects generated by generators GA
and G B respectively Object X is in an overlapping
area between clusters A and B X has probabilitiesPGA
and PGBof being created by GA and GB respect-
ively and probabilities PCAand PCB
of being includedinto clusters A and B respectively All four pro-babilities are larger than 0 Thus there is a chancefor X to be created by generator GA but coveredby cluster B and vice versa In such cases theclustering results will not be perfect The stability ofthe clustering results depends on these four proba-bilities With an increase in the overlapping areas inthe object space the stability of the clustering resultsdecreases
The difference between the characteristics of thegenerators also has an effect on the clustering resultsIn Fig 1b where the number of objects of cluster A isfive times larger than that of cluster B the smallercluster B might be regarded as noise and all objectsmight be grouped into one cluster Such a clusteringoutcome would differ from that obtained by visualinspection
Unfortunately this method of selecting K cannotbe applied to practical problems The data distri-bution in practical problems is unknown and alsothe number of generators cannot be specified
25 Values of K determined bystatistical measures
There are several statistical measures available forselecting K These measures are often applied in com-bination with probabilistic clustering approachesThey are calculated with certain assumptionsabout the underlying distribution of the data TheBayesian information criterion or Akeikersquos infor-mation criterion [14 17] is calculated on data setswhich are constructed by a set of Gaussian distri-butions The measures applied by Hardy [23] arebased on the assumption that the data set fits thePoisson distribution Monte Carlo techniqueswhich are associated with the null hypothesis areused for assessing the clustering results and also fordetermining the number of clusters [24 25]
There have been comparisons between probabilis-tic and partitioning clustering [7] Expectationndashmaximization (EM) is often recognized as a typicalmethod for probabilistic clustering SimilarlyK-means clustering is considered a typical methodfor partitioning clustering Although EM andK-means clustering share some common ideasthey are based on different hypotheses models andcriteria Probabilistic clustering methods do nottake into account the distortion inside a cluster sothat a cluster created by applying such methodsmay not correspond to a cluster in partitioning clus-tering and vice versa Therefore statistical measuresused in probabilistic methods are not applicable in
Fig 1 Effect of the relationship between clusters on
the clustering for two object spaces in which
(a) an area exists that contains objects created
by two different generators and (b) there are
no overlapping areas A objects generated by
GA D objects generated by GB
Selection of K in K-means clustering 105
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
the K-means algorithm In addition the assumptionsabout the underlying distribution cannot be verifiedon real data sets and therefore cannot be used toobtain statistical measures
26 Values of K equated to thenumber of classes
With this method the number of clusters is equatedto the number of classes in the data sets A data-clustering algorithm can be used as a classifier byapplying it to data sets from which the class attributeis omitted and then assessing the clustering resultsusing the omitted class information [26 27] The out-come of the assessment is fed back to the clusteringalgorithm to improve its performance In this waythe clustering can be considered to be supervised
With this method of determining the numberof clusters the assumption is made that the data-clustering method could form clusters each ofwhich would consist of only objects belonging to
one class Unfortunately most real problems donot satisfy this assumption
27 Values of K determined throughvisualization
Visual verification is applied widely because of itssimplicity and explanation possibilities Visualexamples are often used to illustrate the drawbacksof an algorithm or to present the expected clusteringresults [5 27]
The assessment of a clustering result usingvisualization techniques depends heavily on theirimplicit nature The clustering models utilized bysome clustering methods may not be appropriatefor particular data sets The data sets in Fig 2are illustrations of such cases The application ofvisualization techniques implies a data distributioncontinuity in the expected clusters If the K-meansapproach is applied to such data sets there is not
Fig 2 Data sets inappropriate for the K-means approach (a) data sets with four clusters [5]
(b) data sets with three clusters [23] (c) data sets with eight clusters [27] Note that
the number of clusters in each data set was specified by the respective researchers
106 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
any cluster that satisfies the K-means cluster-ing model and at the same time corresponds to aparticular object grouping in the illustrated datasets Therefore the K-means algorithm cannot pro-duce the expected clustering results This suggeststhat the K-means approach is unsuitable for suchdata sets
The characteristics of the data sets in Fig 2(position shape size and object distribution) areimplicitly defined This makes the validation of theclustering results difficult Any slight changes inthe data characteristics may lead to different out-comes The data set in Fig 2b is an illustration ofsuch a case Another example is the series of datasets in Fig 3 Although two clusters are easily identi-fiable in the data set in Fig 3a the numbers ofclusters in the data sets in Figs 3b and c depend onthe distance between the rings and the objectdensity of each ring Usually such parametersare not explicitly defined when a visual check iscarried out
In spite of the above-mentioned deficiencies visu-alization of the results is still a useful method ofselecting K and validating the clustering resultswhen the data sets do not violate the assumptionsof the clustering model In addition this methodis recommended in cases where the expected resultscould be identified explicitly
28 Values of K determined usinga neighbourhood measure
A neighbourhood measure could be added to thecost function of the K-means algorithm to determineK [26] Although this technique has showed promis-ing results for a few data sets it needs to prove itspotential in practical applications Because the costfunction has to be modified this technique cannotbe applied to the original K-means algorithm
3 FACTORS AFFECTING THE SELECTION OF K
A function f(K ) for evaluating the clustering resultcould be used to select the number of clusters Fac-tors that such a function should take into accountare discussed in this section
31 Approach bias
The evaluation function should be related closely tothe clustering criteria As mentioned previouslysuch a relation could prevent adverse effects on thevalidation process In particular in the K-meansalgorithm the criterion is the minimization of thedistortion of clusters so that the evaluation functionshould take this parameter into account
32 Level of detail
In general observers that could see relatively lowlevels of detail would obtain only an overview of anobject By increasing the level of detail they couldgain more information about the observed objectbut at the same time the amount of data that theyhave to process increases Because of resource limit-ations a high level of detail is normally used only toexamine parts of the object [28]
Such an approach could be applied in clustering Adata set with n objects could be grouped into anynumber of clusters between 1 and n which wouldcorrespond to the lowest and the highest levels ofdetail respectively By specifying different K valuesit is possible to assess the results of groupingobjects into various numbers of clusters Fromthis evaluation more than one K value could berecommended to users but the final selection ismade by them
33 Internal distribution versus global impact
Clustering is used to find irregularities in the datadistribution and to identify regions in which objectsare concentrated However not every region with ahigh concentration of objects is considered a clusterFor a region to be identified as a cluster it is import-ant to analyse not only its internal distribution butalso its interdependence with other object groupingsin the data set
In K-means clustering the distortion of a cluster isa function of the data population and the distancebetween objects and the cluster centre according to
Ij frac14XN j
tfrac141
frac12d(x jt wj)2 (1a)
where Ij is the distortion of cluster j wj is the centreof cluster j Nj is the number of objects belonging tocluster j xjt is the tth object belonging to cluster jand d(xjt wj) is the distance between object xjt andthe centre wj of cluster j
Each cluster is represented by its distortionand its impact on the entire data set is assessed by
Fig 3 Variations in the two-ring data set
Selection of K in K-means clustering 107
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
its contribution to the sum of all distortions SKgiven by
SK frac14XK
jfrac141
Ij (1b)
where K is the specified number of clustersThus such information is important in assessing
whether a particular region in the object spacecould be considered a cluster
34 Constraints on f (K )
The robustness of f(K ) is very important Becausethis function is based on the result of the clusteringalgorithm it is important for this result to vary aslittle as possible when K remains unchanged How-ever one of the main deficiencies of the K-meansapproach is its dependence on randomness Thusthe algorithm should yield consistent results sothat its performance can be used as a variablein the evaluation function A new version of the K-means algorithm namely the incremental K-meansalgorithm [29] satisfies this requirement and canbe adopted for this purpose
The role of f(K ) is to reveal trends in the datadistribution and therefore it is important to keep itindependent of the number of objects The numberof clusters K is assumed to be much smaller thanthe number of objects N When K increases f(K )should converge to some constant value Then iffor any intermediate K f(K ) exhibits a special beha-viour such as a minimum or maximum point thatvalue of K could be taken as the desired number ofclusters
4 NUMBER OF CLUSTERS FORK-MEANS CLUSTERING
As mentioned in section 33 cluster analysis isused to find irregularities in the data distributionWhen the data distribution is uniform there is notany irregularity Therefore data sets with uniformdistribution could be used to calibrate and verifythe clustering result This approach was applied byTibshirani et al [30] A data set of the same dimen-sion as the actual data set and with a uniform distri-bution was generated The clustering performanceon this artificial data set was then compared withthe result obtained for the actual data set A measureknown as the lsquogaprsquo statistic [30] was employed toassess performance In this work instead of generat-ing an artificial data set the clustering performancefor the artificial data set was estimated Also insteadof the gap statistic a new and more discriminatory
measure was employed for evaluating the clusteringresult
When the K-means algorithm is applied to datawith a uniform distribution and K is increased by 1the clusters are likely to change and in the new pos-itions the partitions will again be approximatelyequal in size and their distortions similar to oneanother The evaluations carried out in reference[29] showed that when a new cluster is insertedinto a cluster (K frac14 1) with a hypercuboid shape anda uniform distribution the decrease in the sum ofdistortions is proportional to the original sum of dis-tortions This conclusion was found to be correct forclustering results obtained with relatively smallvalues of K In such cases the sum of distortionsafter the increase in the number of clusters couldbe estimated from the current value
The evaluation function f(K ) is defined using theequations
f (K ) frac14
1 if K frac14 1SK
aK SK1if SK1 = 0 8K 1
1 if SK1 frac14 0 8K 1
8gtlt
gt(2)
aK frac14
13
4Nd
if K frac14 2 and Nd 1
(3a)
aK1 thorn1 aK1
6if K 2 and Nd 1
(3b)
8gtgtgtgtgtlt
gtgtgtgtgt
where SK is the sum of the cluster distortions whenthe number of clusters is K Nd is the number ofdata set attributes (ie the number of dimensions)and aK is a weight factor The term aKSK21 inequation (2) is an estimate of SK based on SK21
made with the assumption that the data have a uni-form distribution The value of f(K ) is the ratio of thereal distortion to the estimated distortion and isclose to 1 when the data distribution is uniformWhen there are areas of concentration in the datadistribution SK will be less than the estimatedvalue so that f(K ) decreases The smaller that f(K )is the more concentrated is the data distributionThus values of K that yield small f(K ) can beregarded as giving well-defined clusters
The weight factor aK defined in equation (3) isa positive number less than or equal to 1 and isapplied to reduce the effect of dimensions WithK frac14 2 aK is computed using equation (3a) This equa-tion is derived from equation (7) in reference [29]which shows that the decrease in distortion is inver-sely proportional to the number of dimensions Nd
As K increases above 2 the decrease in the sum ofdistortions reduces (the ratio SKSK21 approaches 1)as can be seen in Fig 4 This figure shows the values
108 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
of SKSK21 computed for different K when the clus-tering algorithm is applied to data sets of differentdimensions and with uniform distributions Withsuch data sets f(K ) is expected to be equal to 1 andaK should be chosen to equate f(K ) to 1 Fromequation (2) aK should therefore be SKSK21 andthus obtainable from Fig 4 However for compu-tational simplicity the recursion equation (3b) hasbeen derived from the data represented in Fig 4 tocalculate aK Figure 5 shows that the values of aK
obtained from equation (3b) fit the plots in Fig 4closely
The proposed function f(K ) satisfies the con-straints mentioned in the previous section Therobustness of f(K ) will be verified experimentally inthe next section When the number of objects isdoubled or tripled but their distributions areunchanged the resultant clusters remain in thesame position SK and SK21 are doubled or tripledcorrespondingly so that f(K ) stays constant There-fore generally f(K ) is independent of the numberof objects in the data set
To reduce the effect of differences in the rangesof the attributes data are normalized before theclustering starts However it should be noted that
when the data have well-separated groups of objectsthe shape of such regions in the problem space hasan effect on the evaluation function In these casesthe normalization does not influence the localobject distribution because it is a scaling techniquethat applies to the whole data set
5 PERFORMANCE
The evaluation function f(K ) is tested in a series ofexperiments on the artificially generated data setsshown in Fig 6 All data are normalized before theincremental K-means algorithm is applied with Kranging from 1 to 19 f(K ) is calculated on the basisof the total distortion of the clusters
In Figs 6andashc all objects belong to a single regionwith a uniform distribution The graph in Fig 6ashows that f(K ) reflects well the clustering result onthis data set with a uniform distribution becausef(K ) is approximately constant and equal to 1 forall K When K frac14 4 and K frac14 3 in Figs 6a and b respect-ively f(K ) reaches minimum values This could beattributed to the shape of the areas defined by theobjects belonging to these data sets However theminimum values of f(K ) do not differ significantly
Fig 4 The ratio SKSK21 for data sets having uniform
distributions (a) two-dimensional lsquosquarersquo and
lsquocirclersquo (b) four-dimensional lsquocubersquo and lsquospherersquo
Fig 5 Comparison of the values of aK calculated using
equation (3b) and the ratio SKSK21
Selection of K in K-means clustering 109
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Data sets and their corresponding f(K )
110 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
Fig 6 Continued
Selection of K in K-means clustering 111
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
112 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
Fig 6 Continued
Selection of K in K-means clustering 113
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
114 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
from the average value for any strong recommen-dations to be made to the user By comparing thevalues of f(K ) in Figs 6a and c it can be seen thataK reduces the effect of the data set dimensions onthe evaluation function
For the data set in Fig 6d again all objects areconcentrated in a single region with a normal
distribution The f(K ) plot for this data set suggestscorrectly that when K frac14 1 the clustering result isthe most suitable for this data set
The data sets in Figs 6e and f are created bytwo generators that have normal distributions InFig 6e the two generators have an overlappingregion but in Fig 6f they are well separated Note
Fig 7 f(K) for the 12 benchmark data sets
Selection of K in K-means clustering 115
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
that the value for f(2) in the latter figure is muchsmaller than in the former
The data sets in Figs 6g and h have three recog-nizable regions From the corresponding graphsf(K ) suggests correct values of K for clusteringthese data sets
Three different generators that create objectgroupings with a normal distribution are used toform the data set in Fig 6i In this case f(K ) suggeststhe value 2 or 3 for K Because two of these threegenerators create object groupings that overlapf(2) is smaller than f(3) This means that the data
Fig 7 Continued
116 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
have only two clearly defined regions but K frac14 3could also be used to cluster the objects
Figures 6j and k illustrate how the level of detailcould affect the selection of K f(K ) reaches mini-mum values at K frac14 2 and 4 respectively In suchcases users could select the most appropriate valueof K based on their specific requirements A morecomplex case is shown in Fig 6l where there is apossible K value of 4 or 8 The selection of a parti-cular K will depend on the requirements of thespecific application for which the clustering iscarried out
The data sets in Figs 6mndasho have well-definedregions in the object space each of which has adifferent distribution location and number ofobjects If the minimum value of f(K ) is used tocluster the objects K will be different from thenumber of generators utilized to create them (as inthe case of the clusters in Fig 6o or the number ofobject groupings that could be identified visually(as in the case of the clusters in Figs 6m and n)The reason for the difference varies with differentcases For example it could be considered thatthere are five clusters in Fig 6m because the clusterdistances are smaller for the two leftmost pairs ofclusters than for others and the clusters in thosepairs could be merged together However nosimple explanation could be given for the casesshown in Figs 6n and o This highlights the factthat f(K ) should only be used to suggest a guidevalue for the number of clusters and the finaldecision as to which value to adopt has to be left atthe discretion of the user
From the graphs in Fig 6 a conclusion could bemade that any K with corresponding f(K ) 085could be recommended for clustering If there isnot a value with corresponding f(K ) 085 K frac14 1 isselected
The proposed function f(K ) is also applied to 12benchmarking data sets from the UCI RepositoryMachine Learning Databases [31] Figure 7 showshow the value of f(K ) varies with K If a threshold of085 is selected for f(K ) (from the study on theartificial data sets) the numbers of clusters recom-mended for each of these data sets are given as inTable 2 Kfrac14 1 means that the data distribution isvery close to the standard uniform distribution Thevalues recommended using f(K ) are very smallbecause of the high correlation between the attributesof these data sets very similar to that shown in Fig 6eThis can be verified by examining two attributes ata time and plotting the data sets in two dimensions
The above experimental study on 15 artificialand 12 benchmark data sets has demonstratedthe robustness of f(K ) The evaluation functionconverges in most cases to 1 when K increasesabove 9
6 CONCLUSION
Existing methods of selecting the number of clustersfor K-means clustering have a number of drawbacksAlso current methods for assessing the clusteringresults do not provide much information on theperformance of the clustering algorithm
A new method to select the number of clustersfor the K-means algorithm has been proposed inthe paper The new method is closely related to theapproach of K-means clustering because it takesinto account information reflecting the performanceof the algorithm The proposed method can suggestmultiple values of K to users for cases when differentclustering results could be obtained with variousrequired levels of detail The method could be com-putationally expensive if used with large data setsbecause it requires several applications of theK-means algorithm before it can suggest a guidevalue for K The method has been validated on15 artificial and 12 benchmark data sets Furtherresearch is required to verify the capability of thismethod when applied to data sets with morecomplex object distributions
ACKNOWLEDGEMENTS
This work was carried out as part of the CardiffInnovative Manufacturing Research Centre Projectsupported by the Engineering and Physical SciencesResearch Council and the SUPERMAN Project sup-ported by the European Commission and the WelshAssembly Government under the European RegionalDevelopment Fund programme The authors aremembers of the IPROMS Network of Excellencefunded by the European Commission
Table 2 The recommended number of
clusters based on f(K)
Data setsProposed numberof clusters
Australian 1Balance-scale 1Car evaluation 2 3 4Cmc 1Ionosphere 2Iris 2 3Page blocks 2Pima 1Wdbc 2Wine 3Yeast 1Zoo 2
Selection of K in K-means clustering 117
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
REFERENCES
1 Han J and Kamber M Data Mining Concepts andTechniques 2000 (Morgan Kaufmann San FranciscoCalifornia)
2 Al-Daoud M B Venkateswarlu N B andRoberts S A Fast K-means clustering algorithmsReport 9518 School of Computer Studies Universityof Leeds June 1995
3 Al-Daoud M B Venkateswarlu N B andRoberts S A New methods for the initialisation ofclusters Pattern Recognition Lett 1996 17 451ndash455
4 Alsabti K Ranka S and Singh V An efficientK-means clustering algorithm In Proceedings of theFirst Workshop on High-Performance Data MiningOrlando Florida 1998 ftpftpciseufledupubfacultyrankaProceedings
5 Bilmes J Vahdat A Hsu W and Im E J Empiricalobservations of probabilistic heuristics for theclustering problem Technical Report TR-97-018International Computer Science Institute BerkeleyCalifornia
6 Bottou L and Bengio Y Convergence properties of theK-means algorithm Adv Neural Infn Processing Systems1995 7 585ndash592
7 Bradley S and Fayyad U M Refining initialpoints for K-means clustering In Proceedings ofthe Fifteenth International Conference on MachineLearning (ICML lsquo98) (Ed J Shavlik) MadisonWisconsin 1998 pp 91ndash99 (Morgan Kaufmann SanFrancisco California)
8 Du Q and Wong T-W Numerical studies ofMacQueenrsquos K-means algorithm for computing the cen-troidal Voronoi tessellations Int J Computers MathApplics 2002 44 511ndash523
9 Castro V E and Yang J A fast and robust generalpurpose clustering algorithm In Proceedings of theFourth European Workshop on Principles of KnowledgeDiscovery in Databases and Data Mining (PKDD 00)Lyon France 2000 pp 208ndash218
10 Castro V E Why so many clustering algorithmsSIGKDD Explorations Newsletter of the ACM SpecialInterest Group on Knowledge Discovery and DataMining 2002 4(1) 65ndash75
11 Fritzke B The LBG-U method for vector quantiza-tion ndash an improvement over LBG inspired fromneural networks Neural Processing Lett 1997 5(1)35ndash45
12 Hamerly G and Elkan C Alternatives to the K-meansalgorithm that find better clusterings In Proceedings ofthe 11th International Conference on Information andKnowledge Management (CIKM 02) McLean Virginia2002 pp 600ndash607
13 Hansen L K and Larsen J Unsupervised learningand generalisation In Proceedings of the IEEEInternational Conference on Neural NetworksWashington DC June 1996 pp 25ndash30 (IEEENew York)
14 Ishioka T Extended K-means with an efficientestimation of the number of clusters In Proceedingsof the Second International Conference on Intelligent
Data Engineering and Automated Learning (IDEAL2000) Hong Kong PR China December 2000pp 17ndash22
15 Kanungo T Mount D M Netanyahu N Piatko CSilverman R and Wu A The efficient K-means clus-tering algorithm analysis and implementation IEEETrans Pattern Analysis Mach Intell 2002 24(7)881ndash892
16 Pelleg D and Moore A Accelerating exact K-meansalgorithms with geometric reasoning In Proceedingsof the Conference on Knowledge Discovery inDatabases (KDD 99) San Diego California 1999pp 277ndash281
17 Pelleg D and Moore A X-means extending K-meanswith efficient estimation of the number of clusters InProceedings of the 17th International Conference onMachine Learning (ICML 2000) Stanford California2000 727ndash734
18 Pena J M Lazano J A and Larranaga P An empiri-cal comparison of four initialisation methods for theK-means algorithm Pattern Recognition Lett 199920 1027ndash1040
19 SPSS Clementine Data Mining System User Guide Ver-sion 5 1998 (Integral Solutions Limited BasingstokeHampshire)
20 DataEngine 30 ndash Intelligent Data Analysis ndash an EasyJob Management Intelligenter Technologien GmbHGermany 1998 httpwwwmitgmbhde
21 Kerr A Hall H K and Kozub S Doing Statistics withSPSS 2002 (Sage London)
22 S-PLUS 6 for Windows Guide to Statistics Vol 2Insightful Corporation Seattle Washington 2001httpwwwinsightfulcomDocumentsLive2344statman2pdf
23 Hardy A On the number of clusters Comput StatistData Analysis 1996 23 83ndash96
24 Theodoridis S and Koutroubas K Pattern Recog-nition 1998 (Academic Press London)
25 Halkidi M Batistakis Y and Vazirgiannis MCluster validity methods Part I SIGMOD Record2002 31(2) available online httpwwwacmorgsigmodrecord
26 Kothari R and Pitts D On finding the number ofclusters Pattern Recognition Lett 1999 20 405ndash416
27 Cai Z Technical aspects of data mining PhD thesisCardiff University Cardiff 2001
28 Lindeberg T Scale-space Theory in Computer Vision1994 (Kluwer Academic Boston Massachusetts)
29 Pham D T Dimov S S and Nguyen C DIncremental K-means algorithm Proc Instn MechEngrs Part C J Mechanical Engineering Science 2003218 783ndash795
30 Tibshirani R Walther G and Hastie T Estimatingthe number of clusters in a dataset via the gap statisticTechnical Report 208 Department of StatisticsStanford University California 2000
31 Blake C Keogh E and Merz C J UCI Re-pository of Machine Learning Databases IrvineCalifornia Department of Information and Com-puter Science University of California IrvineCalifornia 1998
118 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
APPENDIX
Notation
A B clustersd(xjt wj) distance between object xjt and the
centre wj of cluster jf(K ) evaluation functionGA GB generatorsIj distortion of cluster jK number of clustersN number of objects in the data setNd number of data set attributes
(the dimension of the data set)
Nj number of objects belonging tocluster j
PGA PGB
probabilities that X is created by GA orGB respectively
PCA PCB
probabilities that X is clustered into A orB respectively
SK sum of all distortions with K being thespecified number of clusters
X objectxjt object belonging to cluster jwj centre of cluster j
aK weight factor
Selection of K in K-means clustering 119
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
the K-means algorithm In addition the assumptionsabout the underlying distribution cannot be verifiedon real data sets and therefore cannot be used toobtain statistical measures
26 Values of K equated to thenumber of classes
With this method the number of clusters is equatedto the number of classes in the data sets A data-clustering algorithm can be used as a classifier byapplying it to data sets from which the class attributeis omitted and then assessing the clustering resultsusing the omitted class information [26 27] The out-come of the assessment is fed back to the clusteringalgorithm to improve its performance In this waythe clustering can be considered to be supervised
With this method of determining the numberof clusters the assumption is made that the data-clustering method could form clusters each ofwhich would consist of only objects belonging to
one class Unfortunately most real problems donot satisfy this assumption
27 Values of K determined throughvisualization
Visual verification is applied widely because of itssimplicity and explanation possibilities Visualexamples are often used to illustrate the drawbacksof an algorithm or to present the expected clusteringresults [5 27]
The assessment of a clustering result usingvisualization techniques depends heavily on theirimplicit nature The clustering models utilized bysome clustering methods may not be appropriatefor particular data sets The data sets in Fig 2are illustrations of such cases The application ofvisualization techniques implies a data distributioncontinuity in the expected clusters If the K-meansapproach is applied to such data sets there is not
Fig 2 Data sets inappropriate for the K-means approach (a) data sets with four clusters [5]
(b) data sets with three clusters [23] (c) data sets with eight clusters [27] Note that
the number of clusters in each data set was specified by the respective researchers
106 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
any cluster that satisfies the K-means cluster-ing model and at the same time corresponds to aparticular object grouping in the illustrated datasets Therefore the K-means algorithm cannot pro-duce the expected clustering results This suggeststhat the K-means approach is unsuitable for suchdata sets
The characteristics of the data sets in Fig 2(position shape size and object distribution) areimplicitly defined This makes the validation of theclustering results difficult Any slight changes inthe data characteristics may lead to different out-comes The data set in Fig 2b is an illustration ofsuch a case Another example is the series of datasets in Fig 3 Although two clusters are easily identi-fiable in the data set in Fig 3a the numbers ofclusters in the data sets in Figs 3b and c depend onthe distance between the rings and the objectdensity of each ring Usually such parametersare not explicitly defined when a visual check iscarried out
In spite of the above-mentioned deficiencies visu-alization of the results is still a useful method ofselecting K and validating the clustering resultswhen the data sets do not violate the assumptionsof the clustering model In addition this methodis recommended in cases where the expected resultscould be identified explicitly
28 Values of K determined usinga neighbourhood measure
A neighbourhood measure could be added to thecost function of the K-means algorithm to determineK [26] Although this technique has showed promis-ing results for a few data sets it needs to prove itspotential in practical applications Because the costfunction has to be modified this technique cannotbe applied to the original K-means algorithm
3 FACTORS AFFECTING THE SELECTION OF K
A function f(K ) for evaluating the clustering resultcould be used to select the number of clusters Fac-tors that such a function should take into accountare discussed in this section
31 Approach bias
The evaluation function should be related closely tothe clustering criteria As mentioned previouslysuch a relation could prevent adverse effects on thevalidation process In particular in the K-meansalgorithm the criterion is the minimization of thedistortion of clusters so that the evaluation functionshould take this parameter into account
32 Level of detail
In general observers that could see relatively lowlevels of detail would obtain only an overview of anobject By increasing the level of detail they couldgain more information about the observed objectbut at the same time the amount of data that theyhave to process increases Because of resource limit-ations a high level of detail is normally used only toexamine parts of the object [28]
Such an approach could be applied in clustering Adata set with n objects could be grouped into anynumber of clusters between 1 and n which wouldcorrespond to the lowest and the highest levels ofdetail respectively By specifying different K valuesit is possible to assess the results of groupingobjects into various numbers of clusters Fromthis evaluation more than one K value could berecommended to users but the final selection ismade by them
33 Internal distribution versus global impact
Clustering is used to find irregularities in the datadistribution and to identify regions in which objectsare concentrated However not every region with ahigh concentration of objects is considered a clusterFor a region to be identified as a cluster it is import-ant to analyse not only its internal distribution butalso its interdependence with other object groupingsin the data set
In K-means clustering the distortion of a cluster isa function of the data population and the distancebetween objects and the cluster centre according to
Ij frac14XN j
tfrac141
frac12d(x jt wj)2 (1a)
where Ij is the distortion of cluster j wj is the centreof cluster j Nj is the number of objects belonging tocluster j xjt is the tth object belonging to cluster jand d(xjt wj) is the distance between object xjt andthe centre wj of cluster j
Each cluster is represented by its distortionand its impact on the entire data set is assessed by
Fig 3 Variations in the two-ring data set
Selection of K in K-means clustering 107
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
its contribution to the sum of all distortions SKgiven by
SK frac14XK
jfrac141
Ij (1b)
where K is the specified number of clustersThus such information is important in assessing
whether a particular region in the object spacecould be considered a cluster
34 Constraints on f (K )
The robustness of f(K ) is very important Becausethis function is based on the result of the clusteringalgorithm it is important for this result to vary aslittle as possible when K remains unchanged How-ever one of the main deficiencies of the K-meansapproach is its dependence on randomness Thusthe algorithm should yield consistent results sothat its performance can be used as a variablein the evaluation function A new version of the K-means algorithm namely the incremental K-meansalgorithm [29] satisfies this requirement and canbe adopted for this purpose
The role of f(K ) is to reveal trends in the datadistribution and therefore it is important to keep itindependent of the number of objects The numberof clusters K is assumed to be much smaller thanthe number of objects N When K increases f(K )should converge to some constant value Then iffor any intermediate K f(K ) exhibits a special beha-viour such as a minimum or maximum point thatvalue of K could be taken as the desired number ofclusters
4 NUMBER OF CLUSTERS FORK-MEANS CLUSTERING
As mentioned in section 33 cluster analysis isused to find irregularities in the data distributionWhen the data distribution is uniform there is notany irregularity Therefore data sets with uniformdistribution could be used to calibrate and verifythe clustering result This approach was applied byTibshirani et al [30] A data set of the same dimen-sion as the actual data set and with a uniform distri-bution was generated The clustering performanceon this artificial data set was then compared withthe result obtained for the actual data set A measureknown as the lsquogaprsquo statistic [30] was employed toassess performance In this work instead of generat-ing an artificial data set the clustering performancefor the artificial data set was estimated Also insteadof the gap statistic a new and more discriminatory
measure was employed for evaluating the clusteringresult
When the K-means algorithm is applied to datawith a uniform distribution and K is increased by 1the clusters are likely to change and in the new pos-itions the partitions will again be approximatelyequal in size and their distortions similar to oneanother The evaluations carried out in reference[29] showed that when a new cluster is insertedinto a cluster (K frac14 1) with a hypercuboid shape anda uniform distribution the decrease in the sum ofdistortions is proportional to the original sum of dis-tortions This conclusion was found to be correct forclustering results obtained with relatively smallvalues of K In such cases the sum of distortionsafter the increase in the number of clusters couldbe estimated from the current value
The evaluation function f(K ) is defined using theequations
f (K ) frac14
1 if K frac14 1SK
aK SK1if SK1 = 0 8K 1
1 if SK1 frac14 0 8K 1
8gtlt
gt(2)
aK frac14
13
4Nd
if K frac14 2 and Nd 1
(3a)
aK1 thorn1 aK1
6if K 2 and Nd 1
(3b)
8gtgtgtgtgtlt
gtgtgtgtgt
where SK is the sum of the cluster distortions whenthe number of clusters is K Nd is the number ofdata set attributes (ie the number of dimensions)and aK is a weight factor The term aKSK21 inequation (2) is an estimate of SK based on SK21
made with the assumption that the data have a uni-form distribution The value of f(K ) is the ratio of thereal distortion to the estimated distortion and isclose to 1 when the data distribution is uniformWhen there are areas of concentration in the datadistribution SK will be less than the estimatedvalue so that f(K ) decreases The smaller that f(K )is the more concentrated is the data distributionThus values of K that yield small f(K ) can beregarded as giving well-defined clusters
The weight factor aK defined in equation (3) isa positive number less than or equal to 1 and isapplied to reduce the effect of dimensions WithK frac14 2 aK is computed using equation (3a) This equa-tion is derived from equation (7) in reference [29]which shows that the decrease in distortion is inver-sely proportional to the number of dimensions Nd
As K increases above 2 the decrease in the sum ofdistortions reduces (the ratio SKSK21 approaches 1)as can be seen in Fig 4 This figure shows the values
108 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
of SKSK21 computed for different K when the clus-tering algorithm is applied to data sets of differentdimensions and with uniform distributions Withsuch data sets f(K ) is expected to be equal to 1 andaK should be chosen to equate f(K ) to 1 Fromequation (2) aK should therefore be SKSK21 andthus obtainable from Fig 4 However for compu-tational simplicity the recursion equation (3b) hasbeen derived from the data represented in Fig 4 tocalculate aK Figure 5 shows that the values of aK
obtained from equation (3b) fit the plots in Fig 4closely
The proposed function f(K ) satisfies the con-straints mentioned in the previous section Therobustness of f(K ) will be verified experimentally inthe next section When the number of objects isdoubled or tripled but their distributions areunchanged the resultant clusters remain in thesame position SK and SK21 are doubled or tripledcorrespondingly so that f(K ) stays constant There-fore generally f(K ) is independent of the numberof objects in the data set
To reduce the effect of differences in the rangesof the attributes data are normalized before theclustering starts However it should be noted that
when the data have well-separated groups of objectsthe shape of such regions in the problem space hasan effect on the evaluation function In these casesthe normalization does not influence the localobject distribution because it is a scaling techniquethat applies to the whole data set
5 PERFORMANCE
The evaluation function f(K ) is tested in a series ofexperiments on the artificially generated data setsshown in Fig 6 All data are normalized before theincremental K-means algorithm is applied with Kranging from 1 to 19 f(K ) is calculated on the basisof the total distortion of the clusters
In Figs 6andashc all objects belong to a single regionwith a uniform distribution The graph in Fig 6ashows that f(K ) reflects well the clustering result onthis data set with a uniform distribution becausef(K ) is approximately constant and equal to 1 forall K When K frac14 4 and K frac14 3 in Figs 6a and b respect-ively f(K ) reaches minimum values This could beattributed to the shape of the areas defined by theobjects belonging to these data sets However theminimum values of f(K ) do not differ significantly
Fig 4 The ratio SKSK21 for data sets having uniform
distributions (a) two-dimensional lsquosquarersquo and
lsquocirclersquo (b) four-dimensional lsquocubersquo and lsquospherersquo
Fig 5 Comparison of the values of aK calculated using
equation (3b) and the ratio SKSK21
Selection of K in K-means clustering 109
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Data sets and their corresponding f(K )
110 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
Fig 6 Continued
Selection of K in K-means clustering 111
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
112 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
Fig 6 Continued
Selection of K in K-means clustering 113
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
114 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
from the average value for any strong recommen-dations to be made to the user By comparing thevalues of f(K ) in Figs 6a and c it can be seen thataK reduces the effect of the data set dimensions onthe evaluation function
For the data set in Fig 6d again all objects areconcentrated in a single region with a normal
distribution The f(K ) plot for this data set suggestscorrectly that when K frac14 1 the clustering result isthe most suitable for this data set
The data sets in Figs 6e and f are created bytwo generators that have normal distributions InFig 6e the two generators have an overlappingregion but in Fig 6f they are well separated Note
Fig 7 f(K) for the 12 benchmark data sets
Selection of K in K-means clustering 115
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
that the value for f(2) in the latter figure is muchsmaller than in the former
The data sets in Figs 6g and h have three recog-nizable regions From the corresponding graphsf(K ) suggests correct values of K for clusteringthese data sets
Three different generators that create objectgroupings with a normal distribution are used toform the data set in Fig 6i In this case f(K ) suggeststhe value 2 or 3 for K Because two of these threegenerators create object groupings that overlapf(2) is smaller than f(3) This means that the data
Fig 7 Continued
116 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
have only two clearly defined regions but K frac14 3could also be used to cluster the objects
Figures 6j and k illustrate how the level of detailcould affect the selection of K f(K ) reaches mini-mum values at K frac14 2 and 4 respectively In suchcases users could select the most appropriate valueof K based on their specific requirements A morecomplex case is shown in Fig 6l where there is apossible K value of 4 or 8 The selection of a parti-cular K will depend on the requirements of thespecific application for which the clustering iscarried out
The data sets in Figs 6mndasho have well-definedregions in the object space each of which has adifferent distribution location and number ofobjects If the minimum value of f(K ) is used tocluster the objects K will be different from thenumber of generators utilized to create them (as inthe case of the clusters in Fig 6o or the number ofobject groupings that could be identified visually(as in the case of the clusters in Figs 6m and n)The reason for the difference varies with differentcases For example it could be considered thatthere are five clusters in Fig 6m because the clusterdistances are smaller for the two leftmost pairs ofclusters than for others and the clusters in thosepairs could be merged together However nosimple explanation could be given for the casesshown in Figs 6n and o This highlights the factthat f(K ) should only be used to suggest a guidevalue for the number of clusters and the finaldecision as to which value to adopt has to be left atthe discretion of the user
From the graphs in Fig 6 a conclusion could bemade that any K with corresponding f(K ) 085could be recommended for clustering If there isnot a value with corresponding f(K ) 085 K frac14 1 isselected
The proposed function f(K ) is also applied to 12benchmarking data sets from the UCI RepositoryMachine Learning Databases [31] Figure 7 showshow the value of f(K ) varies with K If a threshold of085 is selected for f(K ) (from the study on theartificial data sets) the numbers of clusters recom-mended for each of these data sets are given as inTable 2 Kfrac14 1 means that the data distribution isvery close to the standard uniform distribution Thevalues recommended using f(K ) are very smallbecause of the high correlation between the attributesof these data sets very similar to that shown in Fig 6eThis can be verified by examining two attributes ata time and plotting the data sets in two dimensions
The above experimental study on 15 artificialand 12 benchmark data sets has demonstratedthe robustness of f(K ) The evaluation functionconverges in most cases to 1 when K increasesabove 9
6 CONCLUSION
Existing methods of selecting the number of clustersfor K-means clustering have a number of drawbacksAlso current methods for assessing the clusteringresults do not provide much information on theperformance of the clustering algorithm
A new method to select the number of clustersfor the K-means algorithm has been proposed inthe paper The new method is closely related to theapproach of K-means clustering because it takesinto account information reflecting the performanceof the algorithm The proposed method can suggestmultiple values of K to users for cases when differentclustering results could be obtained with variousrequired levels of detail The method could be com-putationally expensive if used with large data setsbecause it requires several applications of theK-means algorithm before it can suggest a guidevalue for K The method has been validated on15 artificial and 12 benchmark data sets Furtherresearch is required to verify the capability of thismethod when applied to data sets with morecomplex object distributions
ACKNOWLEDGEMENTS
This work was carried out as part of the CardiffInnovative Manufacturing Research Centre Projectsupported by the Engineering and Physical SciencesResearch Council and the SUPERMAN Project sup-ported by the European Commission and the WelshAssembly Government under the European RegionalDevelopment Fund programme The authors aremembers of the IPROMS Network of Excellencefunded by the European Commission
Table 2 The recommended number of
clusters based on f(K)
Data setsProposed numberof clusters
Australian 1Balance-scale 1Car evaluation 2 3 4Cmc 1Ionosphere 2Iris 2 3Page blocks 2Pima 1Wdbc 2Wine 3Yeast 1Zoo 2
Selection of K in K-means clustering 117
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
REFERENCES
1 Han J and Kamber M Data Mining Concepts andTechniques 2000 (Morgan Kaufmann San FranciscoCalifornia)
2 Al-Daoud M B Venkateswarlu N B andRoberts S A Fast K-means clustering algorithmsReport 9518 School of Computer Studies Universityof Leeds June 1995
3 Al-Daoud M B Venkateswarlu N B andRoberts S A New methods for the initialisation ofclusters Pattern Recognition Lett 1996 17 451ndash455
4 Alsabti K Ranka S and Singh V An efficientK-means clustering algorithm In Proceedings of theFirst Workshop on High-Performance Data MiningOrlando Florida 1998 ftpftpciseufledupubfacultyrankaProceedings
5 Bilmes J Vahdat A Hsu W and Im E J Empiricalobservations of probabilistic heuristics for theclustering problem Technical Report TR-97-018International Computer Science Institute BerkeleyCalifornia
6 Bottou L and Bengio Y Convergence properties of theK-means algorithm Adv Neural Infn Processing Systems1995 7 585ndash592
7 Bradley S and Fayyad U M Refining initialpoints for K-means clustering In Proceedings ofthe Fifteenth International Conference on MachineLearning (ICML lsquo98) (Ed J Shavlik) MadisonWisconsin 1998 pp 91ndash99 (Morgan Kaufmann SanFrancisco California)
8 Du Q and Wong T-W Numerical studies ofMacQueenrsquos K-means algorithm for computing the cen-troidal Voronoi tessellations Int J Computers MathApplics 2002 44 511ndash523
9 Castro V E and Yang J A fast and robust generalpurpose clustering algorithm In Proceedings of theFourth European Workshop on Principles of KnowledgeDiscovery in Databases and Data Mining (PKDD 00)Lyon France 2000 pp 208ndash218
10 Castro V E Why so many clustering algorithmsSIGKDD Explorations Newsletter of the ACM SpecialInterest Group on Knowledge Discovery and DataMining 2002 4(1) 65ndash75
11 Fritzke B The LBG-U method for vector quantiza-tion ndash an improvement over LBG inspired fromneural networks Neural Processing Lett 1997 5(1)35ndash45
12 Hamerly G and Elkan C Alternatives to the K-meansalgorithm that find better clusterings In Proceedings ofthe 11th International Conference on Information andKnowledge Management (CIKM 02) McLean Virginia2002 pp 600ndash607
13 Hansen L K and Larsen J Unsupervised learningand generalisation In Proceedings of the IEEEInternational Conference on Neural NetworksWashington DC June 1996 pp 25ndash30 (IEEENew York)
14 Ishioka T Extended K-means with an efficientestimation of the number of clusters In Proceedingsof the Second International Conference on Intelligent
Data Engineering and Automated Learning (IDEAL2000) Hong Kong PR China December 2000pp 17ndash22
15 Kanungo T Mount D M Netanyahu N Piatko CSilverman R and Wu A The efficient K-means clus-tering algorithm analysis and implementation IEEETrans Pattern Analysis Mach Intell 2002 24(7)881ndash892
16 Pelleg D and Moore A Accelerating exact K-meansalgorithms with geometric reasoning In Proceedingsof the Conference on Knowledge Discovery inDatabases (KDD 99) San Diego California 1999pp 277ndash281
17 Pelleg D and Moore A X-means extending K-meanswith efficient estimation of the number of clusters InProceedings of the 17th International Conference onMachine Learning (ICML 2000) Stanford California2000 727ndash734
18 Pena J M Lazano J A and Larranaga P An empiri-cal comparison of four initialisation methods for theK-means algorithm Pattern Recognition Lett 199920 1027ndash1040
19 SPSS Clementine Data Mining System User Guide Ver-sion 5 1998 (Integral Solutions Limited BasingstokeHampshire)
20 DataEngine 30 ndash Intelligent Data Analysis ndash an EasyJob Management Intelligenter Technologien GmbHGermany 1998 httpwwwmitgmbhde
21 Kerr A Hall H K and Kozub S Doing Statistics withSPSS 2002 (Sage London)
22 S-PLUS 6 for Windows Guide to Statistics Vol 2Insightful Corporation Seattle Washington 2001httpwwwinsightfulcomDocumentsLive2344statman2pdf
23 Hardy A On the number of clusters Comput StatistData Analysis 1996 23 83ndash96
24 Theodoridis S and Koutroubas K Pattern Recog-nition 1998 (Academic Press London)
25 Halkidi M Batistakis Y and Vazirgiannis MCluster validity methods Part I SIGMOD Record2002 31(2) available online httpwwwacmorgsigmodrecord
26 Kothari R and Pitts D On finding the number ofclusters Pattern Recognition Lett 1999 20 405ndash416
27 Cai Z Technical aspects of data mining PhD thesisCardiff University Cardiff 2001
28 Lindeberg T Scale-space Theory in Computer Vision1994 (Kluwer Academic Boston Massachusetts)
29 Pham D T Dimov S S and Nguyen C DIncremental K-means algorithm Proc Instn MechEngrs Part C J Mechanical Engineering Science 2003218 783ndash795
30 Tibshirani R Walther G and Hastie T Estimatingthe number of clusters in a dataset via the gap statisticTechnical Report 208 Department of StatisticsStanford University California 2000
31 Blake C Keogh E and Merz C J UCI Re-pository of Machine Learning Databases IrvineCalifornia Department of Information and Com-puter Science University of California IrvineCalifornia 1998
118 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
APPENDIX
Notation
A B clustersd(xjt wj) distance between object xjt and the
centre wj of cluster jf(K ) evaluation functionGA GB generatorsIj distortion of cluster jK number of clustersN number of objects in the data setNd number of data set attributes
(the dimension of the data set)
Nj number of objects belonging tocluster j
PGA PGB
probabilities that X is created by GA orGB respectively
PCA PCB
probabilities that X is clustered into A orB respectively
SK sum of all distortions with K being thespecified number of clusters
X objectxjt object belonging to cluster jwj centre of cluster j
aK weight factor
Selection of K in K-means clustering 119
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
any cluster that satisfies the K-means cluster-ing model and at the same time corresponds to aparticular object grouping in the illustrated datasets Therefore the K-means algorithm cannot pro-duce the expected clustering results This suggeststhat the K-means approach is unsuitable for suchdata sets
The characteristics of the data sets in Fig 2(position shape size and object distribution) areimplicitly defined This makes the validation of theclustering results difficult Any slight changes inthe data characteristics may lead to different out-comes The data set in Fig 2b is an illustration ofsuch a case Another example is the series of datasets in Fig 3 Although two clusters are easily identi-fiable in the data set in Fig 3a the numbers ofclusters in the data sets in Figs 3b and c depend onthe distance between the rings and the objectdensity of each ring Usually such parametersare not explicitly defined when a visual check iscarried out
In spite of the above-mentioned deficiencies visu-alization of the results is still a useful method ofselecting K and validating the clustering resultswhen the data sets do not violate the assumptionsof the clustering model In addition this methodis recommended in cases where the expected resultscould be identified explicitly
28 Values of K determined usinga neighbourhood measure
A neighbourhood measure could be added to thecost function of the K-means algorithm to determineK [26] Although this technique has showed promis-ing results for a few data sets it needs to prove itspotential in practical applications Because the costfunction has to be modified this technique cannotbe applied to the original K-means algorithm
3 FACTORS AFFECTING THE SELECTION OF K
A function f(K ) for evaluating the clustering resultcould be used to select the number of clusters Fac-tors that such a function should take into accountare discussed in this section
31 Approach bias
The evaluation function should be related closely tothe clustering criteria As mentioned previouslysuch a relation could prevent adverse effects on thevalidation process In particular in the K-meansalgorithm the criterion is the minimization of thedistortion of clusters so that the evaluation functionshould take this parameter into account
32 Level of detail
In general observers that could see relatively lowlevels of detail would obtain only an overview of anobject By increasing the level of detail they couldgain more information about the observed objectbut at the same time the amount of data that theyhave to process increases Because of resource limit-ations a high level of detail is normally used only toexamine parts of the object [28]
Such an approach could be applied in clustering Adata set with n objects could be grouped into anynumber of clusters between 1 and n which wouldcorrespond to the lowest and the highest levels ofdetail respectively By specifying different K valuesit is possible to assess the results of groupingobjects into various numbers of clusters Fromthis evaluation more than one K value could berecommended to users but the final selection ismade by them
33 Internal distribution versus global impact
Clustering is used to find irregularities in the datadistribution and to identify regions in which objectsare concentrated However not every region with ahigh concentration of objects is considered a clusterFor a region to be identified as a cluster it is import-ant to analyse not only its internal distribution butalso its interdependence with other object groupingsin the data set
In K-means clustering the distortion of a cluster isa function of the data population and the distancebetween objects and the cluster centre according to
Ij frac14XN j
tfrac141
frac12d(x jt wj)2 (1a)
where Ij is the distortion of cluster j wj is the centreof cluster j Nj is the number of objects belonging tocluster j xjt is the tth object belonging to cluster jand d(xjt wj) is the distance between object xjt andthe centre wj of cluster j
Each cluster is represented by its distortionand its impact on the entire data set is assessed by
Fig 3 Variations in the two-ring data set
Selection of K in K-means clustering 107
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
its contribution to the sum of all distortions SKgiven by
SK frac14XK
jfrac141
Ij (1b)
where K is the specified number of clustersThus such information is important in assessing
whether a particular region in the object spacecould be considered a cluster
34 Constraints on f (K )
The robustness of f(K ) is very important Becausethis function is based on the result of the clusteringalgorithm it is important for this result to vary aslittle as possible when K remains unchanged How-ever one of the main deficiencies of the K-meansapproach is its dependence on randomness Thusthe algorithm should yield consistent results sothat its performance can be used as a variablein the evaluation function A new version of the K-means algorithm namely the incremental K-meansalgorithm [29] satisfies this requirement and canbe adopted for this purpose
The role of f(K ) is to reveal trends in the datadistribution and therefore it is important to keep itindependent of the number of objects The numberof clusters K is assumed to be much smaller thanthe number of objects N When K increases f(K )should converge to some constant value Then iffor any intermediate K f(K ) exhibits a special beha-viour such as a minimum or maximum point thatvalue of K could be taken as the desired number ofclusters
4 NUMBER OF CLUSTERS FORK-MEANS CLUSTERING
As mentioned in section 33 cluster analysis isused to find irregularities in the data distributionWhen the data distribution is uniform there is notany irregularity Therefore data sets with uniformdistribution could be used to calibrate and verifythe clustering result This approach was applied byTibshirani et al [30] A data set of the same dimen-sion as the actual data set and with a uniform distri-bution was generated The clustering performanceon this artificial data set was then compared withthe result obtained for the actual data set A measureknown as the lsquogaprsquo statistic [30] was employed toassess performance In this work instead of generat-ing an artificial data set the clustering performancefor the artificial data set was estimated Also insteadof the gap statistic a new and more discriminatory
measure was employed for evaluating the clusteringresult
When the K-means algorithm is applied to datawith a uniform distribution and K is increased by 1the clusters are likely to change and in the new pos-itions the partitions will again be approximatelyequal in size and their distortions similar to oneanother The evaluations carried out in reference[29] showed that when a new cluster is insertedinto a cluster (K frac14 1) with a hypercuboid shape anda uniform distribution the decrease in the sum ofdistortions is proportional to the original sum of dis-tortions This conclusion was found to be correct forclustering results obtained with relatively smallvalues of K In such cases the sum of distortionsafter the increase in the number of clusters couldbe estimated from the current value
The evaluation function f(K ) is defined using theequations
f (K ) frac14
1 if K frac14 1SK
aK SK1if SK1 = 0 8K 1
1 if SK1 frac14 0 8K 1
8gtlt
gt(2)
aK frac14
13
4Nd
if K frac14 2 and Nd 1
(3a)
aK1 thorn1 aK1
6if K 2 and Nd 1
(3b)
8gtgtgtgtgtlt
gtgtgtgtgt
where SK is the sum of the cluster distortions whenthe number of clusters is K Nd is the number ofdata set attributes (ie the number of dimensions)and aK is a weight factor The term aKSK21 inequation (2) is an estimate of SK based on SK21
made with the assumption that the data have a uni-form distribution The value of f(K ) is the ratio of thereal distortion to the estimated distortion and isclose to 1 when the data distribution is uniformWhen there are areas of concentration in the datadistribution SK will be less than the estimatedvalue so that f(K ) decreases The smaller that f(K )is the more concentrated is the data distributionThus values of K that yield small f(K ) can beregarded as giving well-defined clusters
The weight factor aK defined in equation (3) isa positive number less than or equal to 1 and isapplied to reduce the effect of dimensions WithK frac14 2 aK is computed using equation (3a) This equa-tion is derived from equation (7) in reference [29]which shows that the decrease in distortion is inver-sely proportional to the number of dimensions Nd
As K increases above 2 the decrease in the sum ofdistortions reduces (the ratio SKSK21 approaches 1)as can be seen in Fig 4 This figure shows the values
108 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
of SKSK21 computed for different K when the clus-tering algorithm is applied to data sets of differentdimensions and with uniform distributions Withsuch data sets f(K ) is expected to be equal to 1 andaK should be chosen to equate f(K ) to 1 Fromequation (2) aK should therefore be SKSK21 andthus obtainable from Fig 4 However for compu-tational simplicity the recursion equation (3b) hasbeen derived from the data represented in Fig 4 tocalculate aK Figure 5 shows that the values of aK
obtained from equation (3b) fit the plots in Fig 4closely
The proposed function f(K ) satisfies the con-straints mentioned in the previous section Therobustness of f(K ) will be verified experimentally inthe next section When the number of objects isdoubled or tripled but their distributions areunchanged the resultant clusters remain in thesame position SK and SK21 are doubled or tripledcorrespondingly so that f(K ) stays constant There-fore generally f(K ) is independent of the numberof objects in the data set
To reduce the effect of differences in the rangesof the attributes data are normalized before theclustering starts However it should be noted that
when the data have well-separated groups of objectsthe shape of such regions in the problem space hasan effect on the evaluation function In these casesthe normalization does not influence the localobject distribution because it is a scaling techniquethat applies to the whole data set
5 PERFORMANCE
The evaluation function f(K ) is tested in a series ofexperiments on the artificially generated data setsshown in Fig 6 All data are normalized before theincremental K-means algorithm is applied with Kranging from 1 to 19 f(K ) is calculated on the basisof the total distortion of the clusters
In Figs 6andashc all objects belong to a single regionwith a uniform distribution The graph in Fig 6ashows that f(K ) reflects well the clustering result onthis data set with a uniform distribution becausef(K ) is approximately constant and equal to 1 forall K When K frac14 4 and K frac14 3 in Figs 6a and b respect-ively f(K ) reaches minimum values This could beattributed to the shape of the areas defined by theobjects belonging to these data sets However theminimum values of f(K ) do not differ significantly
Fig 4 The ratio SKSK21 for data sets having uniform
distributions (a) two-dimensional lsquosquarersquo and
lsquocirclersquo (b) four-dimensional lsquocubersquo and lsquospherersquo
Fig 5 Comparison of the values of aK calculated using
equation (3b) and the ratio SKSK21
Selection of K in K-means clustering 109
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Data sets and their corresponding f(K )
110 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
Fig 6 Continued
Selection of K in K-means clustering 111
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
112 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
Fig 6 Continued
Selection of K in K-means clustering 113
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
114 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
from the average value for any strong recommen-dations to be made to the user By comparing thevalues of f(K ) in Figs 6a and c it can be seen thataK reduces the effect of the data set dimensions onthe evaluation function
For the data set in Fig 6d again all objects areconcentrated in a single region with a normal
distribution The f(K ) plot for this data set suggestscorrectly that when K frac14 1 the clustering result isthe most suitable for this data set
The data sets in Figs 6e and f are created bytwo generators that have normal distributions InFig 6e the two generators have an overlappingregion but in Fig 6f they are well separated Note
Fig 7 f(K) for the 12 benchmark data sets
Selection of K in K-means clustering 115
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
that the value for f(2) in the latter figure is muchsmaller than in the former
The data sets in Figs 6g and h have three recog-nizable regions From the corresponding graphsf(K ) suggests correct values of K for clusteringthese data sets
Three different generators that create objectgroupings with a normal distribution are used toform the data set in Fig 6i In this case f(K ) suggeststhe value 2 or 3 for K Because two of these threegenerators create object groupings that overlapf(2) is smaller than f(3) This means that the data
Fig 7 Continued
116 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
have only two clearly defined regions but K frac14 3could also be used to cluster the objects
Figures 6j and k illustrate how the level of detailcould affect the selection of K f(K ) reaches mini-mum values at K frac14 2 and 4 respectively In suchcases users could select the most appropriate valueof K based on their specific requirements A morecomplex case is shown in Fig 6l where there is apossible K value of 4 or 8 The selection of a parti-cular K will depend on the requirements of thespecific application for which the clustering iscarried out
The data sets in Figs 6mndasho have well-definedregions in the object space each of which has adifferent distribution location and number ofobjects If the minimum value of f(K ) is used tocluster the objects K will be different from thenumber of generators utilized to create them (as inthe case of the clusters in Fig 6o or the number ofobject groupings that could be identified visually(as in the case of the clusters in Figs 6m and n)The reason for the difference varies with differentcases For example it could be considered thatthere are five clusters in Fig 6m because the clusterdistances are smaller for the two leftmost pairs ofclusters than for others and the clusters in thosepairs could be merged together However nosimple explanation could be given for the casesshown in Figs 6n and o This highlights the factthat f(K ) should only be used to suggest a guidevalue for the number of clusters and the finaldecision as to which value to adopt has to be left atthe discretion of the user
From the graphs in Fig 6 a conclusion could bemade that any K with corresponding f(K ) 085could be recommended for clustering If there isnot a value with corresponding f(K ) 085 K frac14 1 isselected
The proposed function f(K ) is also applied to 12benchmarking data sets from the UCI RepositoryMachine Learning Databases [31] Figure 7 showshow the value of f(K ) varies with K If a threshold of085 is selected for f(K ) (from the study on theartificial data sets) the numbers of clusters recom-mended for each of these data sets are given as inTable 2 Kfrac14 1 means that the data distribution isvery close to the standard uniform distribution Thevalues recommended using f(K ) are very smallbecause of the high correlation between the attributesof these data sets very similar to that shown in Fig 6eThis can be verified by examining two attributes ata time and plotting the data sets in two dimensions
The above experimental study on 15 artificialand 12 benchmark data sets has demonstratedthe robustness of f(K ) The evaluation functionconverges in most cases to 1 when K increasesabove 9
6 CONCLUSION
Existing methods of selecting the number of clustersfor K-means clustering have a number of drawbacksAlso current methods for assessing the clusteringresults do not provide much information on theperformance of the clustering algorithm
A new method to select the number of clustersfor the K-means algorithm has been proposed inthe paper The new method is closely related to theapproach of K-means clustering because it takesinto account information reflecting the performanceof the algorithm The proposed method can suggestmultiple values of K to users for cases when differentclustering results could be obtained with variousrequired levels of detail The method could be com-putationally expensive if used with large data setsbecause it requires several applications of theK-means algorithm before it can suggest a guidevalue for K The method has been validated on15 artificial and 12 benchmark data sets Furtherresearch is required to verify the capability of thismethod when applied to data sets with morecomplex object distributions
ACKNOWLEDGEMENTS
This work was carried out as part of the CardiffInnovative Manufacturing Research Centre Projectsupported by the Engineering and Physical SciencesResearch Council and the SUPERMAN Project sup-ported by the European Commission and the WelshAssembly Government under the European RegionalDevelopment Fund programme The authors aremembers of the IPROMS Network of Excellencefunded by the European Commission
Table 2 The recommended number of
clusters based on f(K)
Data setsProposed numberof clusters
Australian 1Balance-scale 1Car evaluation 2 3 4Cmc 1Ionosphere 2Iris 2 3Page blocks 2Pima 1Wdbc 2Wine 3Yeast 1Zoo 2
Selection of K in K-means clustering 117
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
REFERENCES
1 Han J and Kamber M Data Mining Concepts andTechniques 2000 (Morgan Kaufmann San FranciscoCalifornia)
2 Al-Daoud M B Venkateswarlu N B andRoberts S A Fast K-means clustering algorithmsReport 9518 School of Computer Studies Universityof Leeds June 1995
3 Al-Daoud M B Venkateswarlu N B andRoberts S A New methods for the initialisation ofclusters Pattern Recognition Lett 1996 17 451ndash455
4 Alsabti K Ranka S and Singh V An efficientK-means clustering algorithm In Proceedings of theFirst Workshop on High-Performance Data MiningOrlando Florida 1998 ftpftpciseufledupubfacultyrankaProceedings
5 Bilmes J Vahdat A Hsu W and Im E J Empiricalobservations of probabilistic heuristics for theclustering problem Technical Report TR-97-018International Computer Science Institute BerkeleyCalifornia
6 Bottou L and Bengio Y Convergence properties of theK-means algorithm Adv Neural Infn Processing Systems1995 7 585ndash592
7 Bradley S and Fayyad U M Refining initialpoints for K-means clustering In Proceedings ofthe Fifteenth International Conference on MachineLearning (ICML lsquo98) (Ed J Shavlik) MadisonWisconsin 1998 pp 91ndash99 (Morgan Kaufmann SanFrancisco California)
8 Du Q and Wong T-W Numerical studies ofMacQueenrsquos K-means algorithm for computing the cen-troidal Voronoi tessellations Int J Computers MathApplics 2002 44 511ndash523
9 Castro V E and Yang J A fast and robust generalpurpose clustering algorithm In Proceedings of theFourth European Workshop on Principles of KnowledgeDiscovery in Databases and Data Mining (PKDD 00)Lyon France 2000 pp 208ndash218
10 Castro V E Why so many clustering algorithmsSIGKDD Explorations Newsletter of the ACM SpecialInterest Group on Knowledge Discovery and DataMining 2002 4(1) 65ndash75
11 Fritzke B The LBG-U method for vector quantiza-tion ndash an improvement over LBG inspired fromneural networks Neural Processing Lett 1997 5(1)35ndash45
12 Hamerly G and Elkan C Alternatives to the K-meansalgorithm that find better clusterings In Proceedings ofthe 11th International Conference on Information andKnowledge Management (CIKM 02) McLean Virginia2002 pp 600ndash607
13 Hansen L K and Larsen J Unsupervised learningand generalisation In Proceedings of the IEEEInternational Conference on Neural NetworksWashington DC June 1996 pp 25ndash30 (IEEENew York)
14 Ishioka T Extended K-means with an efficientestimation of the number of clusters In Proceedingsof the Second International Conference on Intelligent
Data Engineering and Automated Learning (IDEAL2000) Hong Kong PR China December 2000pp 17ndash22
15 Kanungo T Mount D M Netanyahu N Piatko CSilverman R and Wu A The efficient K-means clus-tering algorithm analysis and implementation IEEETrans Pattern Analysis Mach Intell 2002 24(7)881ndash892
16 Pelleg D and Moore A Accelerating exact K-meansalgorithms with geometric reasoning In Proceedingsof the Conference on Knowledge Discovery inDatabases (KDD 99) San Diego California 1999pp 277ndash281
17 Pelleg D and Moore A X-means extending K-meanswith efficient estimation of the number of clusters InProceedings of the 17th International Conference onMachine Learning (ICML 2000) Stanford California2000 727ndash734
18 Pena J M Lazano J A and Larranaga P An empiri-cal comparison of four initialisation methods for theK-means algorithm Pattern Recognition Lett 199920 1027ndash1040
19 SPSS Clementine Data Mining System User Guide Ver-sion 5 1998 (Integral Solutions Limited BasingstokeHampshire)
20 DataEngine 30 ndash Intelligent Data Analysis ndash an EasyJob Management Intelligenter Technologien GmbHGermany 1998 httpwwwmitgmbhde
21 Kerr A Hall H K and Kozub S Doing Statistics withSPSS 2002 (Sage London)
22 S-PLUS 6 for Windows Guide to Statistics Vol 2Insightful Corporation Seattle Washington 2001httpwwwinsightfulcomDocumentsLive2344statman2pdf
23 Hardy A On the number of clusters Comput StatistData Analysis 1996 23 83ndash96
24 Theodoridis S and Koutroubas K Pattern Recog-nition 1998 (Academic Press London)
25 Halkidi M Batistakis Y and Vazirgiannis MCluster validity methods Part I SIGMOD Record2002 31(2) available online httpwwwacmorgsigmodrecord
26 Kothari R and Pitts D On finding the number ofclusters Pattern Recognition Lett 1999 20 405ndash416
27 Cai Z Technical aspects of data mining PhD thesisCardiff University Cardiff 2001
28 Lindeberg T Scale-space Theory in Computer Vision1994 (Kluwer Academic Boston Massachusetts)
29 Pham D T Dimov S S and Nguyen C DIncremental K-means algorithm Proc Instn MechEngrs Part C J Mechanical Engineering Science 2003218 783ndash795
30 Tibshirani R Walther G and Hastie T Estimatingthe number of clusters in a dataset via the gap statisticTechnical Report 208 Department of StatisticsStanford University California 2000
31 Blake C Keogh E and Merz C J UCI Re-pository of Machine Learning Databases IrvineCalifornia Department of Information and Com-puter Science University of California IrvineCalifornia 1998
118 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
APPENDIX
Notation
A B clustersd(xjt wj) distance between object xjt and the
centre wj of cluster jf(K ) evaluation functionGA GB generatorsIj distortion of cluster jK number of clustersN number of objects in the data setNd number of data set attributes
(the dimension of the data set)
Nj number of objects belonging tocluster j
PGA PGB
probabilities that X is created by GA orGB respectively
PCA PCB
probabilities that X is clustered into A orB respectively
SK sum of all distortions with K being thespecified number of clusters
X objectxjt object belonging to cluster jwj centre of cluster j
aK weight factor
Selection of K in K-means clustering 119
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
its contribution to the sum of all distortions SKgiven by
SK frac14XK
jfrac141
Ij (1b)
where K is the specified number of clustersThus such information is important in assessing
whether a particular region in the object spacecould be considered a cluster
34 Constraints on f (K )
The robustness of f(K ) is very important Becausethis function is based on the result of the clusteringalgorithm it is important for this result to vary aslittle as possible when K remains unchanged How-ever one of the main deficiencies of the K-meansapproach is its dependence on randomness Thusthe algorithm should yield consistent results sothat its performance can be used as a variablein the evaluation function A new version of the K-means algorithm namely the incremental K-meansalgorithm [29] satisfies this requirement and canbe adopted for this purpose
The role of f(K ) is to reveal trends in the datadistribution and therefore it is important to keep itindependent of the number of objects The numberof clusters K is assumed to be much smaller thanthe number of objects N When K increases f(K )should converge to some constant value Then iffor any intermediate K f(K ) exhibits a special beha-viour such as a minimum or maximum point thatvalue of K could be taken as the desired number ofclusters
4 NUMBER OF CLUSTERS FORK-MEANS CLUSTERING
As mentioned in section 33 cluster analysis isused to find irregularities in the data distributionWhen the data distribution is uniform there is notany irregularity Therefore data sets with uniformdistribution could be used to calibrate and verifythe clustering result This approach was applied byTibshirani et al [30] A data set of the same dimen-sion as the actual data set and with a uniform distri-bution was generated The clustering performanceon this artificial data set was then compared withthe result obtained for the actual data set A measureknown as the lsquogaprsquo statistic [30] was employed toassess performance In this work instead of generat-ing an artificial data set the clustering performancefor the artificial data set was estimated Also insteadof the gap statistic a new and more discriminatory
measure was employed for evaluating the clusteringresult
When the K-means algorithm is applied to datawith a uniform distribution and K is increased by 1the clusters are likely to change and in the new pos-itions the partitions will again be approximatelyequal in size and their distortions similar to oneanother The evaluations carried out in reference[29] showed that when a new cluster is insertedinto a cluster (K frac14 1) with a hypercuboid shape anda uniform distribution the decrease in the sum ofdistortions is proportional to the original sum of dis-tortions This conclusion was found to be correct forclustering results obtained with relatively smallvalues of K In such cases the sum of distortionsafter the increase in the number of clusters couldbe estimated from the current value
The evaluation function f(K ) is defined using theequations
f (K ) frac14
1 if K frac14 1SK
aK SK1if SK1 = 0 8K 1
1 if SK1 frac14 0 8K 1
8gtlt
gt(2)
aK frac14
13
4Nd
if K frac14 2 and Nd 1
(3a)
aK1 thorn1 aK1
6if K 2 and Nd 1
(3b)
8gtgtgtgtgtlt
gtgtgtgtgt
where SK is the sum of the cluster distortions whenthe number of clusters is K Nd is the number ofdata set attributes (ie the number of dimensions)and aK is a weight factor The term aKSK21 inequation (2) is an estimate of SK based on SK21
made with the assumption that the data have a uni-form distribution The value of f(K ) is the ratio of thereal distortion to the estimated distortion and isclose to 1 when the data distribution is uniformWhen there are areas of concentration in the datadistribution SK will be less than the estimatedvalue so that f(K ) decreases The smaller that f(K )is the more concentrated is the data distributionThus values of K that yield small f(K ) can beregarded as giving well-defined clusters
The weight factor aK defined in equation (3) isa positive number less than or equal to 1 and isapplied to reduce the effect of dimensions WithK frac14 2 aK is computed using equation (3a) This equa-tion is derived from equation (7) in reference [29]which shows that the decrease in distortion is inver-sely proportional to the number of dimensions Nd
As K increases above 2 the decrease in the sum ofdistortions reduces (the ratio SKSK21 approaches 1)as can be seen in Fig 4 This figure shows the values
108 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
of SKSK21 computed for different K when the clus-tering algorithm is applied to data sets of differentdimensions and with uniform distributions Withsuch data sets f(K ) is expected to be equal to 1 andaK should be chosen to equate f(K ) to 1 Fromequation (2) aK should therefore be SKSK21 andthus obtainable from Fig 4 However for compu-tational simplicity the recursion equation (3b) hasbeen derived from the data represented in Fig 4 tocalculate aK Figure 5 shows that the values of aK
obtained from equation (3b) fit the plots in Fig 4closely
The proposed function f(K ) satisfies the con-straints mentioned in the previous section Therobustness of f(K ) will be verified experimentally inthe next section When the number of objects isdoubled or tripled but their distributions areunchanged the resultant clusters remain in thesame position SK and SK21 are doubled or tripledcorrespondingly so that f(K ) stays constant There-fore generally f(K ) is independent of the numberof objects in the data set
To reduce the effect of differences in the rangesof the attributes data are normalized before theclustering starts However it should be noted that
when the data have well-separated groups of objectsthe shape of such regions in the problem space hasan effect on the evaluation function In these casesthe normalization does not influence the localobject distribution because it is a scaling techniquethat applies to the whole data set
5 PERFORMANCE
The evaluation function f(K ) is tested in a series ofexperiments on the artificially generated data setsshown in Fig 6 All data are normalized before theincremental K-means algorithm is applied with Kranging from 1 to 19 f(K ) is calculated on the basisof the total distortion of the clusters
In Figs 6andashc all objects belong to a single regionwith a uniform distribution The graph in Fig 6ashows that f(K ) reflects well the clustering result onthis data set with a uniform distribution becausef(K ) is approximately constant and equal to 1 forall K When K frac14 4 and K frac14 3 in Figs 6a and b respect-ively f(K ) reaches minimum values This could beattributed to the shape of the areas defined by theobjects belonging to these data sets However theminimum values of f(K ) do not differ significantly
Fig 4 The ratio SKSK21 for data sets having uniform
distributions (a) two-dimensional lsquosquarersquo and
lsquocirclersquo (b) four-dimensional lsquocubersquo and lsquospherersquo
Fig 5 Comparison of the values of aK calculated using
equation (3b) and the ratio SKSK21
Selection of K in K-means clustering 109
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Data sets and their corresponding f(K )
110 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
Fig 6 Continued
Selection of K in K-means clustering 111
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
112 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
Fig 6 Continued
Selection of K in K-means clustering 113
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
114 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
from the average value for any strong recommen-dations to be made to the user By comparing thevalues of f(K ) in Figs 6a and c it can be seen thataK reduces the effect of the data set dimensions onthe evaluation function
For the data set in Fig 6d again all objects areconcentrated in a single region with a normal
distribution The f(K ) plot for this data set suggestscorrectly that when K frac14 1 the clustering result isthe most suitable for this data set
The data sets in Figs 6e and f are created bytwo generators that have normal distributions InFig 6e the two generators have an overlappingregion but in Fig 6f they are well separated Note
Fig 7 f(K) for the 12 benchmark data sets
Selection of K in K-means clustering 115
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
that the value for f(2) in the latter figure is muchsmaller than in the former
The data sets in Figs 6g and h have three recog-nizable regions From the corresponding graphsf(K ) suggests correct values of K for clusteringthese data sets
Three different generators that create objectgroupings with a normal distribution are used toform the data set in Fig 6i In this case f(K ) suggeststhe value 2 or 3 for K Because two of these threegenerators create object groupings that overlapf(2) is smaller than f(3) This means that the data
Fig 7 Continued
116 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
have only two clearly defined regions but K frac14 3could also be used to cluster the objects
Figures 6j and k illustrate how the level of detailcould affect the selection of K f(K ) reaches mini-mum values at K frac14 2 and 4 respectively In suchcases users could select the most appropriate valueof K based on their specific requirements A morecomplex case is shown in Fig 6l where there is apossible K value of 4 or 8 The selection of a parti-cular K will depend on the requirements of thespecific application for which the clustering iscarried out
The data sets in Figs 6mndasho have well-definedregions in the object space each of which has adifferent distribution location and number ofobjects If the minimum value of f(K ) is used tocluster the objects K will be different from thenumber of generators utilized to create them (as inthe case of the clusters in Fig 6o or the number ofobject groupings that could be identified visually(as in the case of the clusters in Figs 6m and n)The reason for the difference varies with differentcases For example it could be considered thatthere are five clusters in Fig 6m because the clusterdistances are smaller for the two leftmost pairs ofclusters than for others and the clusters in thosepairs could be merged together However nosimple explanation could be given for the casesshown in Figs 6n and o This highlights the factthat f(K ) should only be used to suggest a guidevalue for the number of clusters and the finaldecision as to which value to adopt has to be left atthe discretion of the user
From the graphs in Fig 6 a conclusion could bemade that any K with corresponding f(K ) 085could be recommended for clustering If there isnot a value with corresponding f(K ) 085 K frac14 1 isselected
The proposed function f(K ) is also applied to 12benchmarking data sets from the UCI RepositoryMachine Learning Databases [31] Figure 7 showshow the value of f(K ) varies with K If a threshold of085 is selected for f(K ) (from the study on theartificial data sets) the numbers of clusters recom-mended for each of these data sets are given as inTable 2 Kfrac14 1 means that the data distribution isvery close to the standard uniform distribution Thevalues recommended using f(K ) are very smallbecause of the high correlation between the attributesof these data sets very similar to that shown in Fig 6eThis can be verified by examining two attributes ata time and plotting the data sets in two dimensions
The above experimental study on 15 artificialand 12 benchmark data sets has demonstratedthe robustness of f(K ) The evaluation functionconverges in most cases to 1 when K increasesabove 9
6 CONCLUSION
Existing methods of selecting the number of clustersfor K-means clustering have a number of drawbacksAlso current methods for assessing the clusteringresults do not provide much information on theperformance of the clustering algorithm
A new method to select the number of clustersfor the K-means algorithm has been proposed inthe paper The new method is closely related to theapproach of K-means clustering because it takesinto account information reflecting the performanceof the algorithm The proposed method can suggestmultiple values of K to users for cases when differentclustering results could be obtained with variousrequired levels of detail The method could be com-putationally expensive if used with large data setsbecause it requires several applications of theK-means algorithm before it can suggest a guidevalue for K The method has been validated on15 artificial and 12 benchmark data sets Furtherresearch is required to verify the capability of thismethod when applied to data sets with morecomplex object distributions
ACKNOWLEDGEMENTS
This work was carried out as part of the CardiffInnovative Manufacturing Research Centre Projectsupported by the Engineering and Physical SciencesResearch Council and the SUPERMAN Project sup-ported by the European Commission and the WelshAssembly Government under the European RegionalDevelopment Fund programme The authors aremembers of the IPROMS Network of Excellencefunded by the European Commission
Table 2 The recommended number of
clusters based on f(K)
Data setsProposed numberof clusters
Australian 1Balance-scale 1Car evaluation 2 3 4Cmc 1Ionosphere 2Iris 2 3Page blocks 2Pima 1Wdbc 2Wine 3Yeast 1Zoo 2
Selection of K in K-means clustering 117
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
REFERENCES
1 Han J and Kamber M Data Mining Concepts andTechniques 2000 (Morgan Kaufmann San FranciscoCalifornia)
2 Al-Daoud M B Venkateswarlu N B andRoberts S A Fast K-means clustering algorithmsReport 9518 School of Computer Studies Universityof Leeds June 1995
3 Al-Daoud M B Venkateswarlu N B andRoberts S A New methods for the initialisation ofclusters Pattern Recognition Lett 1996 17 451ndash455
4 Alsabti K Ranka S and Singh V An efficientK-means clustering algorithm In Proceedings of theFirst Workshop on High-Performance Data MiningOrlando Florida 1998 ftpftpciseufledupubfacultyrankaProceedings
5 Bilmes J Vahdat A Hsu W and Im E J Empiricalobservations of probabilistic heuristics for theclustering problem Technical Report TR-97-018International Computer Science Institute BerkeleyCalifornia
6 Bottou L and Bengio Y Convergence properties of theK-means algorithm Adv Neural Infn Processing Systems1995 7 585ndash592
7 Bradley S and Fayyad U M Refining initialpoints for K-means clustering In Proceedings ofthe Fifteenth International Conference on MachineLearning (ICML lsquo98) (Ed J Shavlik) MadisonWisconsin 1998 pp 91ndash99 (Morgan Kaufmann SanFrancisco California)
8 Du Q and Wong T-W Numerical studies ofMacQueenrsquos K-means algorithm for computing the cen-troidal Voronoi tessellations Int J Computers MathApplics 2002 44 511ndash523
9 Castro V E and Yang J A fast and robust generalpurpose clustering algorithm In Proceedings of theFourth European Workshop on Principles of KnowledgeDiscovery in Databases and Data Mining (PKDD 00)Lyon France 2000 pp 208ndash218
10 Castro V E Why so many clustering algorithmsSIGKDD Explorations Newsletter of the ACM SpecialInterest Group on Knowledge Discovery and DataMining 2002 4(1) 65ndash75
11 Fritzke B The LBG-U method for vector quantiza-tion ndash an improvement over LBG inspired fromneural networks Neural Processing Lett 1997 5(1)35ndash45
12 Hamerly G and Elkan C Alternatives to the K-meansalgorithm that find better clusterings In Proceedings ofthe 11th International Conference on Information andKnowledge Management (CIKM 02) McLean Virginia2002 pp 600ndash607
13 Hansen L K and Larsen J Unsupervised learningand generalisation In Proceedings of the IEEEInternational Conference on Neural NetworksWashington DC June 1996 pp 25ndash30 (IEEENew York)
14 Ishioka T Extended K-means with an efficientestimation of the number of clusters In Proceedingsof the Second International Conference on Intelligent
Data Engineering and Automated Learning (IDEAL2000) Hong Kong PR China December 2000pp 17ndash22
15 Kanungo T Mount D M Netanyahu N Piatko CSilverman R and Wu A The efficient K-means clus-tering algorithm analysis and implementation IEEETrans Pattern Analysis Mach Intell 2002 24(7)881ndash892
16 Pelleg D and Moore A Accelerating exact K-meansalgorithms with geometric reasoning In Proceedingsof the Conference on Knowledge Discovery inDatabases (KDD 99) San Diego California 1999pp 277ndash281
17 Pelleg D and Moore A X-means extending K-meanswith efficient estimation of the number of clusters InProceedings of the 17th International Conference onMachine Learning (ICML 2000) Stanford California2000 727ndash734
18 Pena J M Lazano J A and Larranaga P An empiri-cal comparison of four initialisation methods for theK-means algorithm Pattern Recognition Lett 199920 1027ndash1040
19 SPSS Clementine Data Mining System User Guide Ver-sion 5 1998 (Integral Solutions Limited BasingstokeHampshire)
20 DataEngine 30 ndash Intelligent Data Analysis ndash an EasyJob Management Intelligenter Technologien GmbHGermany 1998 httpwwwmitgmbhde
21 Kerr A Hall H K and Kozub S Doing Statistics withSPSS 2002 (Sage London)
22 S-PLUS 6 for Windows Guide to Statistics Vol 2Insightful Corporation Seattle Washington 2001httpwwwinsightfulcomDocumentsLive2344statman2pdf
23 Hardy A On the number of clusters Comput StatistData Analysis 1996 23 83ndash96
24 Theodoridis S and Koutroubas K Pattern Recog-nition 1998 (Academic Press London)
25 Halkidi M Batistakis Y and Vazirgiannis MCluster validity methods Part I SIGMOD Record2002 31(2) available online httpwwwacmorgsigmodrecord
26 Kothari R and Pitts D On finding the number ofclusters Pattern Recognition Lett 1999 20 405ndash416
27 Cai Z Technical aspects of data mining PhD thesisCardiff University Cardiff 2001
28 Lindeberg T Scale-space Theory in Computer Vision1994 (Kluwer Academic Boston Massachusetts)
29 Pham D T Dimov S S and Nguyen C DIncremental K-means algorithm Proc Instn MechEngrs Part C J Mechanical Engineering Science 2003218 783ndash795
30 Tibshirani R Walther G and Hastie T Estimatingthe number of clusters in a dataset via the gap statisticTechnical Report 208 Department of StatisticsStanford University California 2000
31 Blake C Keogh E and Merz C J UCI Re-pository of Machine Learning Databases IrvineCalifornia Department of Information and Com-puter Science University of California IrvineCalifornia 1998
118 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
APPENDIX
Notation
A B clustersd(xjt wj) distance between object xjt and the
centre wj of cluster jf(K ) evaluation functionGA GB generatorsIj distortion of cluster jK number of clustersN number of objects in the data setNd number of data set attributes
(the dimension of the data set)
Nj number of objects belonging tocluster j
PGA PGB
probabilities that X is created by GA orGB respectively
PCA PCB
probabilities that X is clustered into A orB respectively
SK sum of all distortions with K being thespecified number of clusters
X objectxjt object belonging to cluster jwj centre of cluster j
aK weight factor
Selection of K in K-means clustering 119
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
of SKSK21 computed for different K when the clus-tering algorithm is applied to data sets of differentdimensions and with uniform distributions Withsuch data sets f(K ) is expected to be equal to 1 andaK should be chosen to equate f(K ) to 1 Fromequation (2) aK should therefore be SKSK21 andthus obtainable from Fig 4 However for compu-tational simplicity the recursion equation (3b) hasbeen derived from the data represented in Fig 4 tocalculate aK Figure 5 shows that the values of aK
obtained from equation (3b) fit the plots in Fig 4closely
The proposed function f(K ) satisfies the con-straints mentioned in the previous section Therobustness of f(K ) will be verified experimentally inthe next section When the number of objects isdoubled or tripled but their distributions areunchanged the resultant clusters remain in thesame position SK and SK21 are doubled or tripledcorrespondingly so that f(K ) stays constant There-fore generally f(K ) is independent of the numberof objects in the data set
To reduce the effect of differences in the rangesof the attributes data are normalized before theclustering starts However it should be noted that
when the data have well-separated groups of objectsthe shape of such regions in the problem space hasan effect on the evaluation function In these casesthe normalization does not influence the localobject distribution because it is a scaling techniquethat applies to the whole data set
5 PERFORMANCE
The evaluation function f(K ) is tested in a series ofexperiments on the artificially generated data setsshown in Fig 6 All data are normalized before theincremental K-means algorithm is applied with Kranging from 1 to 19 f(K ) is calculated on the basisof the total distortion of the clusters
In Figs 6andashc all objects belong to a single regionwith a uniform distribution The graph in Fig 6ashows that f(K ) reflects well the clustering result onthis data set with a uniform distribution becausef(K ) is approximately constant and equal to 1 forall K When K frac14 4 and K frac14 3 in Figs 6a and b respect-ively f(K ) reaches minimum values This could beattributed to the shape of the areas defined by theobjects belonging to these data sets However theminimum values of f(K ) do not differ significantly
Fig 4 The ratio SKSK21 for data sets having uniform
distributions (a) two-dimensional lsquosquarersquo and
lsquocirclersquo (b) four-dimensional lsquocubersquo and lsquospherersquo
Fig 5 Comparison of the values of aK calculated using
equation (3b) and the ratio SKSK21
Selection of K in K-means clustering 109
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Data sets and their corresponding f(K )
110 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
Fig 6 Continued
Selection of K in K-means clustering 111
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
112 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
Fig 6 Continued
Selection of K in K-means clustering 113
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
114 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
from the average value for any strong recommen-dations to be made to the user By comparing thevalues of f(K ) in Figs 6a and c it can be seen thataK reduces the effect of the data set dimensions onthe evaluation function
For the data set in Fig 6d again all objects areconcentrated in a single region with a normal
distribution The f(K ) plot for this data set suggestscorrectly that when K frac14 1 the clustering result isthe most suitable for this data set
The data sets in Figs 6e and f are created bytwo generators that have normal distributions InFig 6e the two generators have an overlappingregion but in Fig 6f they are well separated Note
Fig 7 f(K) for the 12 benchmark data sets
Selection of K in K-means clustering 115
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
that the value for f(2) in the latter figure is muchsmaller than in the former
The data sets in Figs 6g and h have three recog-nizable regions From the corresponding graphsf(K ) suggests correct values of K for clusteringthese data sets
Three different generators that create objectgroupings with a normal distribution are used toform the data set in Fig 6i In this case f(K ) suggeststhe value 2 or 3 for K Because two of these threegenerators create object groupings that overlapf(2) is smaller than f(3) This means that the data
Fig 7 Continued
116 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
have only two clearly defined regions but K frac14 3could also be used to cluster the objects
Figures 6j and k illustrate how the level of detailcould affect the selection of K f(K ) reaches mini-mum values at K frac14 2 and 4 respectively In suchcases users could select the most appropriate valueof K based on their specific requirements A morecomplex case is shown in Fig 6l where there is apossible K value of 4 or 8 The selection of a parti-cular K will depend on the requirements of thespecific application for which the clustering iscarried out
The data sets in Figs 6mndasho have well-definedregions in the object space each of which has adifferent distribution location and number ofobjects If the minimum value of f(K ) is used tocluster the objects K will be different from thenumber of generators utilized to create them (as inthe case of the clusters in Fig 6o or the number ofobject groupings that could be identified visually(as in the case of the clusters in Figs 6m and n)The reason for the difference varies with differentcases For example it could be considered thatthere are five clusters in Fig 6m because the clusterdistances are smaller for the two leftmost pairs ofclusters than for others and the clusters in thosepairs could be merged together However nosimple explanation could be given for the casesshown in Figs 6n and o This highlights the factthat f(K ) should only be used to suggest a guidevalue for the number of clusters and the finaldecision as to which value to adopt has to be left atthe discretion of the user
From the graphs in Fig 6 a conclusion could bemade that any K with corresponding f(K ) 085could be recommended for clustering If there isnot a value with corresponding f(K ) 085 K frac14 1 isselected
The proposed function f(K ) is also applied to 12benchmarking data sets from the UCI RepositoryMachine Learning Databases [31] Figure 7 showshow the value of f(K ) varies with K If a threshold of085 is selected for f(K ) (from the study on theartificial data sets) the numbers of clusters recom-mended for each of these data sets are given as inTable 2 Kfrac14 1 means that the data distribution isvery close to the standard uniform distribution Thevalues recommended using f(K ) are very smallbecause of the high correlation between the attributesof these data sets very similar to that shown in Fig 6eThis can be verified by examining two attributes ata time and plotting the data sets in two dimensions
The above experimental study on 15 artificialand 12 benchmark data sets has demonstratedthe robustness of f(K ) The evaluation functionconverges in most cases to 1 when K increasesabove 9
6 CONCLUSION
Existing methods of selecting the number of clustersfor K-means clustering have a number of drawbacksAlso current methods for assessing the clusteringresults do not provide much information on theperformance of the clustering algorithm
A new method to select the number of clustersfor the K-means algorithm has been proposed inthe paper The new method is closely related to theapproach of K-means clustering because it takesinto account information reflecting the performanceof the algorithm The proposed method can suggestmultiple values of K to users for cases when differentclustering results could be obtained with variousrequired levels of detail The method could be com-putationally expensive if used with large data setsbecause it requires several applications of theK-means algorithm before it can suggest a guidevalue for K The method has been validated on15 artificial and 12 benchmark data sets Furtherresearch is required to verify the capability of thismethod when applied to data sets with morecomplex object distributions
ACKNOWLEDGEMENTS
This work was carried out as part of the CardiffInnovative Manufacturing Research Centre Projectsupported by the Engineering and Physical SciencesResearch Council and the SUPERMAN Project sup-ported by the European Commission and the WelshAssembly Government under the European RegionalDevelopment Fund programme The authors aremembers of the IPROMS Network of Excellencefunded by the European Commission
Table 2 The recommended number of
clusters based on f(K)
Data setsProposed numberof clusters
Australian 1Balance-scale 1Car evaluation 2 3 4Cmc 1Ionosphere 2Iris 2 3Page blocks 2Pima 1Wdbc 2Wine 3Yeast 1Zoo 2
Selection of K in K-means clustering 117
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
REFERENCES
1 Han J and Kamber M Data Mining Concepts andTechniques 2000 (Morgan Kaufmann San FranciscoCalifornia)
2 Al-Daoud M B Venkateswarlu N B andRoberts S A Fast K-means clustering algorithmsReport 9518 School of Computer Studies Universityof Leeds June 1995
3 Al-Daoud M B Venkateswarlu N B andRoberts S A New methods for the initialisation ofclusters Pattern Recognition Lett 1996 17 451ndash455
4 Alsabti K Ranka S and Singh V An efficientK-means clustering algorithm In Proceedings of theFirst Workshop on High-Performance Data MiningOrlando Florida 1998 ftpftpciseufledupubfacultyrankaProceedings
5 Bilmes J Vahdat A Hsu W and Im E J Empiricalobservations of probabilistic heuristics for theclustering problem Technical Report TR-97-018International Computer Science Institute BerkeleyCalifornia
6 Bottou L and Bengio Y Convergence properties of theK-means algorithm Adv Neural Infn Processing Systems1995 7 585ndash592
7 Bradley S and Fayyad U M Refining initialpoints for K-means clustering In Proceedings ofthe Fifteenth International Conference on MachineLearning (ICML lsquo98) (Ed J Shavlik) MadisonWisconsin 1998 pp 91ndash99 (Morgan Kaufmann SanFrancisco California)
8 Du Q and Wong T-W Numerical studies ofMacQueenrsquos K-means algorithm for computing the cen-troidal Voronoi tessellations Int J Computers MathApplics 2002 44 511ndash523
9 Castro V E and Yang J A fast and robust generalpurpose clustering algorithm In Proceedings of theFourth European Workshop on Principles of KnowledgeDiscovery in Databases and Data Mining (PKDD 00)Lyon France 2000 pp 208ndash218
10 Castro V E Why so many clustering algorithmsSIGKDD Explorations Newsletter of the ACM SpecialInterest Group on Knowledge Discovery and DataMining 2002 4(1) 65ndash75
11 Fritzke B The LBG-U method for vector quantiza-tion ndash an improvement over LBG inspired fromneural networks Neural Processing Lett 1997 5(1)35ndash45
12 Hamerly G and Elkan C Alternatives to the K-meansalgorithm that find better clusterings In Proceedings ofthe 11th International Conference on Information andKnowledge Management (CIKM 02) McLean Virginia2002 pp 600ndash607
13 Hansen L K and Larsen J Unsupervised learningand generalisation In Proceedings of the IEEEInternational Conference on Neural NetworksWashington DC June 1996 pp 25ndash30 (IEEENew York)
14 Ishioka T Extended K-means with an efficientestimation of the number of clusters In Proceedingsof the Second International Conference on Intelligent
Data Engineering and Automated Learning (IDEAL2000) Hong Kong PR China December 2000pp 17ndash22
15 Kanungo T Mount D M Netanyahu N Piatko CSilverman R and Wu A The efficient K-means clus-tering algorithm analysis and implementation IEEETrans Pattern Analysis Mach Intell 2002 24(7)881ndash892
16 Pelleg D and Moore A Accelerating exact K-meansalgorithms with geometric reasoning In Proceedingsof the Conference on Knowledge Discovery inDatabases (KDD 99) San Diego California 1999pp 277ndash281
17 Pelleg D and Moore A X-means extending K-meanswith efficient estimation of the number of clusters InProceedings of the 17th International Conference onMachine Learning (ICML 2000) Stanford California2000 727ndash734
18 Pena J M Lazano J A and Larranaga P An empiri-cal comparison of four initialisation methods for theK-means algorithm Pattern Recognition Lett 199920 1027ndash1040
19 SPSS Clementine Data Mining System User Guide Ver-sion 5 1998 (Integral Solutions Limited BasingstokeHampshire)
20 DataEngine 30 ndash Intelligent Data Analysis ndash an EasyJob Management Intelligenter Technologien GmbHGermany 1998 httpwwwmitgmbhde
21 Kerr A Hall H K and Kozub S Doing Statistics withSPSS 2002 (Sage London)
22 S-PLUS 6 for Windows Guide to Statistics Vol 2Insightful Corporation Seattle Washington 2001httpwwwinsightfulcomDocumentsLive2344statman2pdf
23 Hardy A On the number of clusters Comput StatistData Analysis 1996 23 83ndash96
24 Theodoridis S and Koutroubas K Pattern Recog-nition 1998 (Academic Press London)
25 Halkidi M Batistakis Y and Vazirgiannis MCluster validity methods Part I SIGMOD Record2002 31(2) available online httpwwwacmorgsigmodrecord
26 Kothari R and Pitts D On finding the number ofclusters Pattern Recognition Lett 1999 20 405ndash416
27 Cai Z Technical aspects of data mining PhD thesisCardiff University Cardiff 2001
28 Lindeberg T Scale-space Theory in Computer Vision1994 (Kluwer Academic Boston Massachusetts)
29 Pham D T Dimov S S and Nguyen C DIncremental K-means algorithm Proc Instn MechEngrs Part C J Mechanical Engineering Science 2003218 783ndash795
30 Tibshirani R Walther G and Hastie T Estimatingthe number of clusters in a dataset via the gap statisticTechnical Report 208 Department of StatisticsStanford University California 2000
31 Blake C Keogh E and Merz C J UCI Re-pository of Machine Learning Databases IrvineCalifornia Department of Information and Com-puter Science University of California IrvineCalifornia 1998
118 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
APPENDIX
Notation
A B clustersd(xjt wj) distance between object xjt and the
centre wj of cluster jf(K ) evaluation functionGA GB generatorsIj distortion of cluster jK number of clustersN number of objects in the data setNd number of data set attributes
(the dimension of the data set)
Nj number of objects belonging tocluster j
PGA PGB
probabilities that X is created by GA orGB respectively
PCA PCB
probabilities that X is clustered into A orB respectively
SK sum of all distortions with K being thespecified number of clusters
X objectxjt object belonging to cluster jwj centre of cluster j
aK weight factor
Selection of K in K-means clustering 119
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Data sets and their corresponding f(K )
110 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
Fig 6 Continued
Selection of K in K-means clustering 111
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
112 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
Fig 6 Continued
Selection of K in K-means clustering 113
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
114 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
from the average value for any strong recommen-dations to be made to the user By comparing thevalues of f(K ) in Figs 6a and c it can be seen thataK reduces the effect of the data set dimensions onthe evaluation function
For the data set in Fig 6d again all objects areconcentrated in a single region with a normal
distribution The f(K ) plot for this data set suggestscorrectly that when K frac14 1 the clustering result isthe most suitable for this data set
The data sets in Figs 6e and f are created bytwo generators that have normal distributions InFig 6e the two generators have an overlappingregion but in Fig 6f they are well separated Note
Fig 7 f(K) for the 12 benchmark data sets
Selection of K in K-means clustering 115
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
that the value for f(2) in the latter figure is muchsmaller than in the former
The data sets in Figs 6g and h have three recog-nizable regions From the corresponding graphsf(K ) suggests correct values of K for clusteringthese data sets
Three different generators that create objectgroupings with a normal distribution are used toform the data set in Fig 6i In this case f(K ) suggeststhe value 2 or 3 for K Because two of these threegenerators create object groupings that overlapf(2) is smaller than f(3) This means that the data
Fig 7 Continued
116 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
have only two clearly defined regions but K frac14 3could also be used to cluster the objects
Figures 6j and k illustrate how the level of detailcould affect the selection of K f(K ) reaches mini-mum values at K frac14 2 and 4 respectively In suchcases users could select the most appropriate valueof K based on their specific requirements A morecomplex case is shown in Fig 6l where there is apossible K value of 4 or 8 The selection of a parti-cular K will depend on the requirements of thespecific application for which the clustering iscarried out
The data sets in Figs 6mndasho have well-definedregions in the object space each of which has adifferent distribution location and number ofobjects If the minimum value of f(K ) is used tocluster the objects K will be different from thenumber of generators utilized to create them (as inthe case of the clusters in Fig 6o or the number ofobject groupings that could be identified visually(as in the case of the clusters in Figs 6m and n)The reason for the difference varies with differentcases For example it could be considered thatthere are five clusters in Fig 6m because the clusterdistances are smaller for the two leftmost pairs ofclusters than for others and the clusters in thosepairs could be merged together However nosimple explanation could be given for the casesshown in Figs 6n and o This highlights the factthat f(K ) should only be used to suggest a guidevalue for the number of clusters and the finaldecision as to which value to adopt has to be left atthe discretion of the user
From the graphs in Fig 6 a conclusion could bemade that any K with corresponding f(K ) 085could be recommended for clustering If there isnot a value with corresponding f(K ) 085 K frac14 1 isselected
The proposed function f(K ) is also applied to 12benchmarking data sets from the UCI RepositoryMachine Learning Databases [31] Figure 7 showshow the value of f(K ) varies with K If a threshold of085 is selected for f(K ) (from the study on theartificial data sets) the numbers of clusters recom-mended for each of these data sets are given as inTable 2 Kfrac14 1 means that the data distribution isvery close to the standard uniform distribution Thevalues recommended using f(K ) are very smallbecause of the high correlation between the attributesof these data sets very similar to that shown in Fig 6eThis can be verified by examining two attributes ata time and plotting the data sets in two dimensions
The above experimental study on 15 artificialand 12 benchmark data sets has demonstratedthe robustness of f(K ) The evaluation functionconverges in most cases to 1 when K increasesabove 9
6 CONCLUSION
Existing methods of selecting the number of clustersfor K-means clustering have a number of drawbacksAlso current methods for assessing the clusteringresults do not provide much information on theperformance of the clustering algorithm
A new method to select the number of clustersfor the K-means algorithm has been proposed inthe paper The new method is closely related to theapproach of K-means clustering because it takesinto account information reflecting the performanceof the algorithm The proposed method can suggestmultiple values of K to users for cases when differentclustering results could be obtained with variousrequired levels of detail The method could be com-putationally expensive if used with large data setsbecause it requires several applications of theK-means algorithm before it can suggest a guidevalue for K The method has been validated on15 artificial and 12 benchmark data sets Furtherresearch is required to verify the capability of thismethod when applied to data sets with morecomplex object distributions
ACKNOWLEDGEMENTS
This work was carried out as part of the CardiffInnovative Manufacturing Research Centre Projectsupported by the Engineering and Physical SciencesResearch Council and the SUPERMAN Project sup-ported by the European Commission and the WelshAssembly Government under the European RegionalDevelopment Fund programme The authors aremembers of the IPROMS Network of Excellencefunded by the European Commission
Table 2 The recommended number of
clusters based on f(K)
Data setsProposed numberof clusters
Australian 1Balance-scale 1Car evaluation 2 3 4Cmc 1Ionosphere 2Iris 2 3Page blocks 2Pima 1Wdbc 2Wine 3Yeast 1Zoo 2
Selection of K in K-means clustering 117
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
REFERENCES
1 Han J and Kamber M Data Mining Concepts andTechniques 2000 (Morgan Kaufmann San FranciscoCalifornia)
2 Al-Daoud M B Venkateswarlu N B andRoberts S A Fast K-means clustering algorithmsReport 9518 School of Computer Studies Universityof Leeds June 1995
3 Al-Daoud M B Venkateswarlu N B andRoberts S A New methods for the initialisation ofclusters Pattern Recognition Lett 1996 17 451ndash455
4 Alsabti K Ranka S and Singh V An efficientK-means clustering algorithm In Proceedings of theFirst Workshop on High-Performance Data MiningOrlando Florida 1998 ftpftpciseufledupubfacultyrankaProceedings
5 Bilmes J Vahdat A Hsu W and Im E J Empiricalobservations of probabilistic heuristics for theclustering problem Technical Report TR-97-018International Computer Science Institute BerkeleyCalifornia
6 Bottou L and Bengio Y Convergence properties of theK-means algorithm Adv Neural Infn Processing Systems1995 7 585ndash592
7 Bradley S and Fayyad U M Refining initialpoints for K-means clustering In Proceedings ofthe Fifteenth International Conference on MachineLearning (ICML lsquo98) (Ed J Shavlik) MadisonWisconsin 1998 pp 91ndash99 (Morgan Kaufmann SanFrancisco California)
8 Du Q and Wong T-W Numerical studies ofMacQueenrsquos K-means algorithm for computing the cen-troidal Voronoi tessellations Int J Computers MathApplics 2002 44 511ndash523
9 Castro V E and Yang J A fast and robust generalpurpose clustering algorithm In Proceedings of theFourth European Workshop on Principles of KnowledgeDiscovery in Databases and Data Mining (PKDD 00)Lyon France 2000 pp 208ndash218
10 Castro V E Why so many clustering algorithmsSIGKDD Explorations Newsletter of the ACM SpecialInterest Group on Knowledge Discovery and DataMining 2002 4(1) 65ndash75
11 Fritzke B The LBG-U method for vector quantiza-tion ndash an improvement over LBG inspired fromneural networks Neural Processing Lett 1997 5(1)35ndash45
12 Hamerly G and Elkan C Alternatives to the K-meansalgorithm that find better clusterings In Proceedings ofthe 11th International Conference on Information andKnowledge Management (CIKM 02) McLean Virginia2002 pp 600ndash607
13 Hansen L K and Larsen J Unsupervised learningand generalisation In Proceedings of the IEEEInternational Conference on Neural NetworksWashington DC June 1996 pp 25ndash30 (IEEENew York)
14 Ishioka T Extended K-means with an efficientestimation of the number of clusters In Proceedingsof the Second International Conference on Intelligent
Data Engineering and Automated Learning (IDEAL2000) Hong Kong PR China December 2000pp 17ndash22
15 Kanungo T Mount D M Netanyahu N Piatko CSilverman R and Wu A The efficient K-means clus-tering algorithm analysis and implementation IEEETrans Pattern Analysis Mach Intell 2002 24(7)881ndash892
16 Pelleg D and Moore A Accelerating exact K-meansalgorithms with geometric reasoning In Proceedingsof the Conference on Knowledge Discovery inDatabases (KDD 99) San Diego California 1999pp 277ndash281
17 Pelleg D and Moore A X-means extending K-meanswith efficient estimation of the number of clusters InProceedings of the 17th International Conference onMachine Learning (ICML 2000) Stanford California2000 727ndash734
18 Pena J M Lazano J A and Larranaga P An empiri-cal comparison of four initialisation methods for theK-means algorithm Pattern Recognition Lett 199920 1027ndash1040
19 SPSS Clementine Data Mining System User Guide Ver-sion 5 1998 (Integral Solutions Limited BasingstokeHampshire)
20 DataEngine 30 ndash Intelligent Data Analysis ndash an EasyJob Management Intelligenter Technologien GmbHGermany 1998 httpwwwmitgmbhde
21 Kerr A Hall H K and Kozub S Doing Statistics withSPSS 2002 (Sage London)
22 S-PLUS 6 for Windows Guide to Statistics Vol 2Insightful Corporation Seattle Washington 2001httpwwwinsightfulcomDocumentsLive2344statman2pdf
23 Hardy A On the number of clusters Comput StatistData Analysis 1996 23 83ndash96
24 Theodoridis S and Koutroubas K Pattern Recog-nition 1998 (Academic Press London)
25 Halkidi M Batistakis Y and Vazirgiannis MCluster validity methods Part I SIGMOD Record2002 31(2) available online httpwwwacmorgsigmodrecord
26 Kothari R and Pitts D On finding the number ofclusters Pattern Recognition Lett 1999 20 405ndash416
27 Cai Z Technical aspects of data mining PhD thesisCardiff University Cardiff 2001
28 Lindeberg T Scale-space Theory in Computer Vision1994 (Kluwer Academic Boston Massachusetts)
29 Pham D T Dimov S S and Nguyen C DIncremental K-means algorithm Proc Instn MechEngrs Part C J Mechanical Engineering Science 2003218 783ndash795
30 Tibshirani R Walther G and Hastie T Estimatingthe number of clusters in a dataset via the gap statisticTechnical Report 208 Department of StatisticsStanford University California 2000
31 Blake C Keogh E and Merz C J UCI Re-pository of Machine Learning Databases IrvineCalifornia Department of Information and Com-puter Science University of California IrvineCalifornia 1998
118 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
APPENDIX
Notation
A B clustersd(xjt wj) distance between object xjt and the
centre wj of cluster jf(K ) evaluation functionGA GB generatorsIj distortion of cluster jK number of clustersN number of objects in the data setNd number of data set attributes
(the dimension of the data set)
Nj number of objects belonging tocluster j
PGA PGB
probabilities that X is created by GA orGB respectively
PCA PCB
probabilities that X is clustered into A orB respectively
SK sum of all distortions with K being thespecified number of clusters
X objectxjt object belonging to cluster jwj centre of cluster j
aK weight factor
Selection of K in K-means clustering 119
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
Selection of K in K-means clustering 111
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
112 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
Fig 6 Continued
Selection of K in K-means clustering 113
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
114 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
from the average value for any strong recommen-dations to be made to the user By comparing thevalues of f(K ) in Figs 6a and c it can be seen thataK reduces the effect of the data set dimensions onthe evaluation function
For the data set in Fig 6d again all objects areconcentrated in a single region with a normal
distribution The f(K ) plot for this data set suggestscorrectly that when K frac14 1 the clustering result isthe most suitable for this data set
The data sets in Figs 6e and f are created bytwo generators that have normal distributions InFig 6e the two generators have an overlappingregion but in Fig 6f they are well separated Note
Fig 7 f(K) for the 12 benchmark data sets
Selection of K in K-means clustering 115
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
that the value for f(2) in the latter figure is muchsmaller than in the former
The data sets in Figs 6g and h have three recog-nizable regions From the corresponding graphsf(K ) suggests correct values of K for clusteringthese data sets
Three different generators that create objectgroupings with a normal distribution are used toform the data set in Fig 6i In this case f(K ) suggeststhe value 2 or 3 for K Because two of these threegenerators create object groupings that overlapf(2) is smaller than f(3) This means that the data
Fig 7 Continued
116 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
have only two clearly defined regions but K frac14 3could also be used to cluster the objects
Figures 6j and k illustrate how the level of detailcould affect the selection of K f(K ) reaches mini-mum values at K frac14 2 and 4 respectively In suchcases users could select the most appropriate valueof K based on their specific requirements A morecomplex case is shown in Fig 6l where there is apossible K value of 4 or 8 The selection of a parti-cular K will depend on the requirements of thespecific application for which the clustering iscarried out
The data sets in Figs 6mndasho have well-definedregions in the object space each of which has adifferent distribution location and number ofobjects If the minimum value of f(K ) is used tocluster the objects K will be different from thenumber of generators utilized to create them (as inthe case of the clusters in Fig 6o or the number ofobject groupings that could be identified visually(as in the case of the clusters in Figs 6m and n)The reason for the difference varies with differentcases For example it could be considered thatthere are five clusters in Fig 6m because the clusterdistances are smaller for the two leftmost pairs ofclusters than for others and the clusters in thosepairs could be merged together However nosimple explanation could be given for the casesshown in Figs 6n and o This highlights the factthat f(K ) should only be used to suggest a guidevalue for the number of clusters and the finaldecision as to which value to adopt has to be left atthe discretion of the user
From the graphs in Fig 6 a conclusion could bemade that any K with corresponding f(K ) 085could be recommended for clustering If there isnot a value with corresponding f(K ) 085 K frac14 1 isselected
The proposed function f(K ) is also applied to 12benchmarking data sets from the UCI RepositoryMachine Learning Databases [31] Figure 7 showshow the value of f(K ) varies with K If a threshold of085 is selected for f(K ) (from the study on theartificial data sets) the numbers of clusters recom-mended for each of these data sets are given as inTable 2 Kfrac14 1 means that the data distribution isvery close to the standard uniform distribution Thevalues recommended using f(K ) are very smallbecause of the high correlation between the attributesof these data sets very similar to that shown in Fig 6eThis can be verified by examining two attributes ata time and plotting the data sets in two dimensions
The above experimental study on 15 artificialand 12 benchmark data sets has demonstratedthe robustness of f(K ) The evaluation functionconverges in most cases to 1 when K increasesabove 9
6 CONCLUSION
Existing methods of selecting the number of clustersfor K-means clustering have a number of drawbacksAlso current methods for assessing the clusteringresults do not provide much information on theperformance of the clustering algorithm
A new method to select the number of clustersfor the K-means algorithm has been proposed inthe paper The new method is closely related to theapproach of K-means clustering because it takesinto account information reflecting the performanceof the algorithm The proposed method can suggestmultiple values of K to users for cases when differentclustering results could be obtained with variousrequired levels of detail The method could be com-putationally expensive if used with large data setsbecause it requires several applications of theK-means algorithm before it can suggest a guidevalue for K The method has been validated on15 artificial and 12 benchmark data sets Furtherresearch is required to verify the capability of thismethod when applied to data sets with morecomplex object distributions
ACKNOWLEDGEMENTS
This work was carried out as part of the CardiffInnovative Manufacturing Research Centre Projectsupported by the Engineering and Physical SciencesResearch Council and the SUPERMAN Project sup-ported by the European Commission and the WelshAssembly Government under the European RegionalDevelopment Fund programme The authors aremembers of the IPROMS Network of Excellencefunded by the European Commission
Table 2 The recommended number of
clusters based on f(K)
Data setsProposed numberof clusters
Australian 1Balance-scale 1Car evaluation 2 3 4Cmc 1Ionosphere 2Iris 2 3Page blocks 2Pima 1Wdbc 2Wine 3Yeast 1Zoo 2
Selection of K in K-means clustering 117
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
REFERENCES
1 Han J and Kamber M Data Mining Concepts andTechniques 2000 (Morgan Kaufmann San FranciscoCalifornia)
2 Al-Daoud M B Venkateswarlu N B andRoberts S A Fast K-means clustering algorithmsReport 9518 School of Computer Studies Universityof Leeds June 1995
3 Al-Daoud M B Venkateswarlu N B andRoberts S A New methods for the initialisation ofclusters Pattern Recognition Lett 1996 17 451ndash455
4 Alsabti K Ranka S and Singh V An efficientK-means clustering algorithm In Proceedings of theFirst Workshop on High-Performance Data MiningOrlando Florida 1998 ftpftpciseufledupubfacultyrankaProceedings
5 Bilmes J Vahdat A Hsu W and Im E J Empiricalobservations of probabilistic heuristics for theclustering problem Technical Report TR-97-018International Computer Science Institute BerkeleyCalifornia
6 Bottou L and Bengio Y Convergence properties of theK-means algorithm Adv Neural Infn Processing Systems1995 7 585ndash592
7 Bradley S and Fayyad U M Refining initialpoints for K-means clustering In Proceedings ofthe Fifteenth International Conference on MachineLearning (ICML lsquo98) (Ed J Shavlik) MadisonWisconsin 1998 pp 91ndash99 (Morgan Kaufmann SanFrancisco California)
8 Du Q and Wong T-W Numerical studies ofMacQueenrsquos K-means algorithm for computing the cen-troidal Voronoi tessellations Int J Computers MathApplics 2002 44 511ndash523
9 Castro V E and Yang J A fast and robust generalpurpose clustering algorithm In Proceedings of theFourth European Workshop on Principles of KnowledgeDiscovery in Databases and Data Mining (PKDD 00)Lyon France 2000 pp 208ndash218
10 Castro V E Why so many clustering algorithmsSIGKDD Explorations Newsletter of the ACM SpecialInterest Group on Knowledge Discovery and DataMining 2002 4(1) 65ndash75
11 Fritzke B The LBG-U method for vector quantiza-tion ndash an improvement over LBG inspired fromneural networks Neural Processing Lett 1997 5(1)35ndash45
12 Hamerly G and Elkan C Alternatives to the K-meansalgorithm that find better clusterings In Proceedings ofthe 11th International Conference on Information andKnowledge Management (CIKM 02) McLean Virginia2002 pp 600ndash607
13 Hansen L K and Larsen J Unsupervised learningand generalisation In Proceedings of the IEEEInternational Conference on Neural NetworksWashington DC June 1996 pp 25ndash30 (IEEENew York)
14 Ishioka T Extended K-means with an efficientestimation of the number of clusters In Proceedingsof the Second International Conference on Intelligent
Data Engineering and Automated Learning (IDEAL2000) Hong Kong PR China December 2000pp 17ndash22
15 Kanungo T Mount D M Netanyahu N Piatko CSilverman R and Wu A The efficient K-means clus-tering algorithm analysis and implementation IEEETrans Pattern Analysis Mach Intell 2002 24(7)881ndash892
16 Pelleg D and Moore A Accelerating exact K-meansalgorithms with geometric reasoning In Proceedingsof the Conference on Knowledge Discovery inDatabases (KDD 99) San Diego California 1999pp 277ndash281
17 Pelleg D and Moore A X-means extending K-meanswith efficient estimation of the number of clusters InProceedings of the 17th International Conference onMachine Learning (ICML 2000) Stanford California2000 727ndash734
18 Pena J M Lazano J A and Larranaga P An empiri-cal comparison of four initialisation methods for theK-means algorithm Pattern Recognition Lett 199920 1027ndash1040
19 SPSS Clementine Data Mining System User Guide Ver-sion 5 1998 (Integral Solutions Limited BasingstokeHampshire)
20 DataEngine 30 ndash Intelligent Data Analysis ndash an EasyJob Management Intelligenter Technologien GmbHGermany 1998 httpwwwmitgmbhde
21 Kerr A Hall H K and Kozub S Doing Statistics withSPSS 2002 (Sage London)
22 S-PLUS 6 for Windows Guide to Statistics Vol 2Insightful Corporation Seattle Washington 2001httpwwwinsightfulcomDocumentsLive2344statman2pdf
23 Hardy A On the number of clusters Comput StatistData Analysis 1996 23 83ndash96
24 Theodoridis S and Koutroubas K Pattern Recog-nition 1998 (Academic Press London)
25 Halkidi M Batistakis Y and Vazirgiannis MCluster validity methods Part I SIGMOD Record2002 31(2) available online httpwwwacmorgsigmodrecord
26 Kothari R and Pitts D On finding the number ofclusters Pattern Recognition Lett 1999 20 405ndash416
27 Cai Z Technical aspects of data mining PhD thesisCardiff University Cardiff 2001
28 Lindeberg T Scale-space Theory in Computer Vision1994 (Kluwer Academic Boston Massachusetts)
29 Pham D T Dimov S S and Nguyen C DIncremental K-means algorithm Proc Instn MechEngrs Part C J Mechanical Engineering Science 2003218 783ndash795
30 Tibshirani R Walther G and Hastie T Estimatingthe number of clusters in a dataset via the gap statisticTechnical Report 208 Department of StatisticsStanford University California 2000
31 Blake C Keogh E and Merz C J UCI Re-pository of Machine Learning Databases IrvineCalifornia Department of Information and Com-puter Science University of California IrvineCalifornia 1998
118 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
APPENDIX
Notation
A B clustersd(xjt wj) distance between object xjt and the
centre wj of cluster jf(K ) evaluation functionGA GB generatorsIj distortion of cluster jK number of clustersN number of objects in the data setNd number of data set attributes
(the dimension of the data set)
Nj number of objects belonging tocluster j
PGA PGB
probabilities that X is created by GA orGB respectively
PCA PCB
probabilities that X is clustered into A orB respectively
SK sum of all distortions with K being thespecified number of clusters
X objectxjt object belonging to cluster jwj centre of cluster j
aK weight factor
Selection of K in K-means clustering 119
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
112 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
Fig 6 Continued
Selection of K in K-means clustering 113
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
114 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
from the average value for any strong recommen-dations to be made to the user By comparing thevalues of f(K ) in Figs 6a and c it can be seen thataK reduces the effect of the data set dimensions onthe evaluation function
For the data set in Fig 6d again all objects areconcentrated in a single region with a normal
distribution The f(K ) plot for this data set suggestscorrectly that when K frac14 1 the clustering result isthe most suitable for this data set
The data sets in Figs 6e and f are created bytwo generators that have normal distributions InFig 6e the two generators have an overlappingregion but in Fig 6f they are well separated Note
Fig 7 f(K) for the 12 benchmark data sets
Selection of K in K-means clustering 115
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
that the value for f(2) in the latter figure is muchsmaller than in the former
The data sets in Figs 6g and h have three recog-nizable regions From the corresponding graphsf(K ) suggests correct values of K for clusteringthese data sets
Three different generators that create objectgroupings with a normal distribution are used toform the data set in Fig 6i In this case f(K ) suggeststhe value 2 or 3 for K Because two of these threegenerators create object groupings that overlapf(2) is smaller than f(3) This means that the data
Fig 7 Continued
116 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
have only two clearly defined regions but K frac14 3could also be used to cluster the objects
Figures 6j and k illustrate how the level of detailcould affect the selection of K f(K ) reaches mini-mum values at K frac14 2 and 4 respectively In suchcases users could select the most appropriate valueof K based on their specific requirements A morecomplex case is shown in Fig 6l where there is apossible K value of 4 or 8 The selection of a parti-cular K will depend on the requirements of thespecific application for which the clustering iscarried out
The data sets in Figs 6mndasho have well-definedregions in the object space each of which has adifferent distribution location and number ofobjects If the minimum value of f(K ) is used tocluster the objects K will be different from thenumber of generators utilized to create them (as inthe case of the clusters in Fig 6o or the number ofobject groupings that could be identified visually(as in the case of the clusters in Figs 6m and n)The reason for the difference varies with differentcases For example it could be considered thatthere are five clusters in Fig 6m because the clusterdistances are smaller for the two leftmost pairs ofclusters than for others and the clusters in thosepairs could be merged together However nosimple explanation could be given for the casesshown in Figs 6n and o This highlights the factthat f(K ) should only be used to suggest a guidevalue for the number of clusters and the finaldecision as to which value to adopt has to be left atthe discretion of the user
From the graphs in Fig 6 a conclusion could bemade that any K with corresponding f(K ) 085could be recommended for clustering If there isnot a value with corresponding f(K ) 085 K frac14 1 isselected
The proposed function f(K ) is also applied to 12benchmarking data sets from the UCI RepositoryMachine Learning Databases [31] Figure 7 showshow the value of f(K ) varies with K If a threshold of085 is selected for f(K ) (from the study on theartificial data sets) the numbers of clusters recom-mended for each of these data sets are given as inTable 2 Kfrac14 1 means that the data distribution isvery close to the standard uniform distribution Thevalues recommended using f(K ) are very smallbecause of the high correlation between the attributesof these data sets very similar to that shown in Fig 6eThis can be verified by examining two attributes ata time and plotting the data sets in two dimensions
The above experimental study on 15 artificialand 12 benchmark data sets has demonstratedthe robustness of f(K ) The evaluation functionconverges in most cases to 1 when K increasesabove 9
6 CONCLUSION
Existing methods of selecting the number of clustersfor K-means clustering have a number of drawbacksAlso current methods for assessing the clusteringresults do not provide much information on theperformance of the clustering algorithm
A new method to select the number of clustersfor the K-means algorithm has been proposed inthe paper The new method is closely related to theapproach of K-means clustering because it takesinto account information reflecting the performanceof the algorithm The proposed method can suggestmultiple values of K to users for cases when differentclustering results could be obtained with variousrequired levels of detail The method could be com-putationally expensive if used with large data setsbecause it requires several applications of theK-means algorithm before it can suggest a guidevalue for K The method has been validated on15 artificial and 12 benchmark data sets Furtherresearch is required to verify the capability of thismethod when applied to data sets with morecomplex object distributions
ACKNOWLEDGEMENTS
This work was carried out as part of the CardiffInnovative Manufacturing Research Centre Projectsupported by the Engineering and Physical SciencesResearch Council and the SUPERMAN Project sup-ported by the European Commission and the WelshAssembly Government under the European RegionalDevelopment Fund programme The authors aremembers of the IPROMS Network of Excellencefunded by the European Commission
Table 2 The recommended number of
clusters based on f(K)
Data setsProposed numberof clusters
Australian 1Balance-scale 1Car evaluation 2 3 4Cmc 1Ionosphere 2Iris 2 3Page blocks 2Pima 1Wdbc 2Wine 3Yeast 1Zoo 2
Selection of K in K-means clustering 117
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
REFERENCES
1 Han J and Kamber M Data Mining Concepts andTechniques 2000 (Morgan Kaufmann San FranciscoCalifornia)
2 Al-Daoud M B Venkateswarlu N B andRoberts S A Fast K-means clustering algorithmsReport 9518 School of Computer Studies Universityof Leeds June 1995
3 Al-Daoud M B Venkateswarlu N B andRoberts S A New methods for the initialisation ofclusters Pattern Recognition Lett 1996 17 451ndash455
4 Alsabti K Ranka S and Singh V An efficientK-means clustering algorithm In Proceedings of theFirst Workshop on High-Performance Data MiningOrlando Florida 1998 ftpftpciseufledupubfacultyrankaProceedings
5 Bilmes J Vahdat A Hsu W and Im E J Empiricalobservations of probabilistic heuristics for theclustering problem Technical Report TR-97-018International Computer Science Institute BerkeleyCalifornia
6 Bottou L and Bengio Y Convergence properties of theK-means algorithm Adv Neural Infn Processing Systems1995 7 585ndash592
7 Bradley S and Fayyad U M Refining initialpoints for K-means clustering In Proceedings ofthe Fifteenth International Conference on MachineLearning (ICML lsquo98) (Ed J Shavlik) MadisonWisconsin 1998 pp 91ndash99 (Morgan Kaufmann SanFrancisco California)
8 Du Q and Wong T-W Numerical studies ofMacQueenrsquos K-means algorithm for computing the cen-troidal Voronoi tessellations Int J Computers MathApplics 2002 44 511ndash523
9 Castro V E and Yang J A fast and robust generalpurpose clustering algorithm In Proceedings of theFourth European Workshop on Principles of KnowledgeDiscovery in Databases and Data Mining (PKDD 00)Lyon France 2000 pp 208ndash218
10 Castro V E Why so many clustering algorithmsSIGKDD Explorations Newsletter of the ACM SpecialInterest Group on Knowledge Discovery and DataMining 2002 4(1) 65ndash75
11 Fritzke B The LBG-U method for vector quantiza-tion ndash an improvement over LBG inspired fromneural networks Neural Processing Lett 1997 5(1)35ndash45
12 Hamerly G and Elkan C Alternatives to the K-meansalgorithm that find better clusterings In Proceedings ofthe 11th International Conference on Information andKnowledge Management (CIKM 02) McLean Virginia2002 pp 600ndash607
13 Hansen L K and Larsen J Unsupervised learningand generalisation In Proceedings of the IEEEInternational Conference on Neural NetworksWashington DC June 1996 pp 25ndash30 (IEEENew York)
14 Ishioka T Extended K-means with an efficientestimation of the number of clusters In Proceedingsof the Second International Conference on Intelligent
Data Engineering and Automated Learning (IDEAL2000) Hong Kong PR China December 2000pp 17ndash22
15 Kanungo T Mount D M Netanyahu N Piatko CSilverman R and Wu A The efficient K-means clus-tering algorithm analysis and implementation IEEETrans Pattern Analysis Mach Intell 2002 24(7)881ndash892
16 Pelleg D and Moore A Accelerating exact K-meansalgorithms with geometric reasoning In Proceedingsof the Conference on Knowledge Discovery inDatabases (KDD 99) San Diego California 1999pp 277ndash281
17 Pelleg D and Moore A X-means extending K-meanswith efficient estimation of the number of clusters InProceedings of the 17th International Conference onMachine Learning (ICML 2000) Stanford California2000 727ndash734
18 Pena J M Lazano J A and Larranaga P An empiri-cal comparison of four initialisation methods for theK-means algorithm Pattern Recognition Lett 199920 1027ndash1040
19 SPSS Clementine Data Mining System User Guide Ver-sion 5 1998 (Integral Solutions Limited BasingstokeHampshire)
20 DataEngine 30 ndash Intelligent Data Analysis ndash an EasyJob Management Intelligenter Technologien GmbHGermany 1998 httpwwwmitgmbhde
21 Kerr A Hall H K and Kozub S Doing Statistics withSPSS 2002 (Sage London)
22 S-PLUS 6 for Windows Guide to Statistics Vol 2Insightful Corporation Seattle Washington 2001httpwwwinsightfulcomDocumentsLive2344statman2pdf
23 Hardy A On the number of clusters Comput StatistData Analysis 1996 23 83ndash96
24 Theodoridis S and Koutroubas K Pattern Recog-nition 1998 (Academic Press London)
25 Halkidi M Batistakis Y and Vazirgiannis MCluster validity methods Part I SIGMOD Record2002 31(2) available online httpwwwacmorgsigmodrecord
26 Kothari R and Pitts D On finding the number ofclusters Pattern Recognition Lett 1999 20 405ndash416
27 Cai Z Technical aspects of data mining PhD thesisCardiff University Cardiff 2001
28 Lindeberg T Scale-space Theory in Computer Vision1994 (Kluwer Academic Boston Massachusetts)
29 Pham D T Dimov S S and Nguyen C DIncremental K-means algorithm Proc Instn MechEngrs Part C J Mechanical Engineering Science 2003218 783ndash795
30 Tibshirani R Walther G and Hastie T Estimatingthe number of clusters in a dataset via the gap statisticTechnical Report 208 Department of StatisticsStanford University California 2000
31 Blake C Keogh E and Merz C J UCI Re-pository of Machine Learning Databases IrvineCalifornia Department of Information and Com-puter Science University of California IrvineCalifornia 1998
118 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
APPENDIX
Notation
A B clustersd(xjt wj) distance between object xjt and the
centre wj of cluster jf(K ) evaluation functionGA GB generatorsIj distortion of cluster jK number of clustersN number of objects in the data setNd number of data set attributes
(the dimension of the data set)
Nj number of objects belonging tocluster j
PGA PGB
probabilities that X is created by GA orGB respectively
PCA PCB
probabilities that X is clustered into A orB respectively
SK sum of all distortions with K being thespecified number of clusters
X objectxjt object belonging to cluster jwj centre of cluster j
aK weight factor
Selection of K in K-means clustering 119
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
Selection of K in K-means clustering 113
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
114 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
from the average value for any strong recommen-dations to be made to the user By comparing thevalues of f(K ) in Figs 6a and c it can be seen thataK reduces the effect of the data set dimensions onthe evaluation function
For the data set in Fig 6d again all objects areconcentrated in a single region with a normal
distribution The f(K ) plot for this data set suggestscorrectly that when K frac14 1 the clustering result isthe most suitable for this data set
The data sets in Figs 6e and f are created bytwo generators that have normal distributions InFig 6e the two generators have an overlappingregion but in Fig 6f they are well separated Note
Fig 7 f(K) for the 12 benchmark data sets
Selection of K in K-means clustering 115
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
that the value for f(2) in the latter figure is muchsmaller than in the former
The data sets in Figs 6g and h have three recog-nizable regions From the corresponding graphsf(K ) suggests correct values of K for clusteringthese data sets
Three different generators that create objectgroupings with a normal distribution are used toform the data set in Fig 6i In this case f(K ) suggeststhe value 2 or 3 for K Because two of these threegenerators create object groupings that overlapf(2) is smaller than f(3) This means that the data
Fig 7 Continued
116 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
have only two clearly defined regions but K frac14 3could also be used to cluster the objects
Figures 6j and k illustrate how the level of detailcould affect the selection of K f(K ) reaches mini-mum values at K frac14 2 and 4 respectively In suchcases users could select the most appropriate valueof K based on their specific requirements A morecomplex case is shown in Fig 6l where there is apossible K value of 4 or 8 The selection of a parti-cular K will depend on the requirements of thespecific application for which the clustering iscarried out
The data sets in Figs 6mndasho have well-definedregions in the object space each of which has adifferent distribution location and number ofobjects If the minimum value of f(K ) is used tocluster the objects K will be different from thenumber of generators utilized to create them (as inthe case of the clusters in Fig 6o or the number ofobject groupings that could be identified visually(as in the case of the clusters in Figs 6m and n)The reason for the difference varies with differentcases For example it could be considered thatthere are five clusters in Fig 6m because the clusterdistances are smaller for the two leftmost pairs ofclusters than for others and the clusters in thosepairs could be merged together However nosimple explanation could be given for the casesshown in Figs 6n and o This highlights the factthat f(K ) should only be used to suggest a guidevalue for the number of clusters and the finaldecision as to which value to adopt has to be left atthe discretion of the user
From the graphs in Fig 6 a conclusion could bemade that any K with corresponding f(K ) 085could be recommended for clustering If there isnot a value with corresponding f(K ) 085 K frac14 1 isselected
The proposed function f(K ) is also applied to 12benchmarking data sets from the UCI RepositoryMachine Learning Databases [31] Figure 7 showshow the value of f(K ) varies with K If a threshold of085 is selected for f(K ) (from the study on theartificial data sets) the numbers of clusters recom-mended for each of these data sets are given as inTable 2 Kfrac14 1 means that the data distribution isvery close to the standard uniform distribution Thevalues recommended using f(K ) are very smallbecause of the high correlation between the attributesof these data sets very similar to that shown in Fig 6eThis can be verified by examining two attributes ata time and plotting the data sets in two dimensions
The above experimental study on 15 artificialand 12 benchmark data sets has demonstratedthe robustness of f(K ) The evaluation functionconverges in most cases to 1 when K increasesabove 9
6 CONCLUSION
Existing methods of selecting the number of clustersfor K-means clustering have a number of drawbacksAlso current methods for assessing the clusteringresults do not provide much information on theperformance of the clustering algorithm
A new method to select the number of clustersfor the K-means algorithm has been proposed inthe paper The new method is closely related to theapproach of K-means clustering because it takesinto account information reflecting the performanceof the algorithm The proposed method can suggestmultiple values of K to users for cases when differentclustering results could be obtained with variousrequired levels of detail The method could be com-putationally expensive if used with large data setsbecause it requires several applications of theK-means algorithm before it can suggest a guidevalue for K The method has been validated on15 artificial and 12 benchmark data sets Furtherresearch is required to verify the capability of thismethod when applied to data sets with morecomplex object distributions
ACKNOWLEDGEMENTS
This work was carried out as part of the CardiffInnovative Manufacturing Research Centre Projectsupported by the Engineering and Physical SciencesResearch Council and the SUPERMAN Project sup-ported by the European Commission and the WelshAssembly Government under the European RegionalDevelopment Fund programme The authors aremembers of the IPROMS Network of Excellencefunded by the European Commission
Table 2 The recommended number of
clusters based on f(K)
Data setsProposed numberof clusters
Australian 1Balance-scale 1Car evaluation 2 3 4Cmc 1Ionosphere 2Iris 2 3Page blocks 2Pima 1Wdbc 2Wine 3Yeast 1Zoo 2
Selection of K in K-means clustering 117
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
REFERENCES
1 Han J and Kamber M Data Mining Concepts andTechniques 2000 (Morgan Kaufmann San FranciscoCalifornia)
2 Al-Daoud M B Venkateswarlu N B andRoberts S A Fast K-means clustering algorithmsReport 9518 School of Computer Studies Universityof Leeds June 1995
3 Al-Daoud M B Venkateswarlu N B andRoberts S A New methods for the initialisation ofclusters Pattern Recognition Lett 1996 17 451ndash455
4 Alsabti K Ranka S and Singh V An efficientK-means clustering algorithm In Proceedings of theFirst Workshop on High-Performance Data MiningOrlando Florida 1998 ftpftpciseufledupubfacultyrankaProceedings
5 Bilmes J Vahdat A Hsu W and Im E J Empiricalobservations of probabilistic heuristics for theclustering problem Technical Report TR-97-018International Computer Science Institute BerkeleyCalifornia
6 Bottou L and Bengio Y Convergence properties of theK-means algorithm Adv Neural Infn Processing Systems1995 7 585ndash592
7 Bradley S and Fayyad U M Refining initialpoints for K-means clustering In Proceedings ofthe Fifteenth International Conference on MachineLearning (ICML lsquo98) (Ed J Shavlik) MadisonWisconsin 1998 pp 91ndash99 (Morgan Kaufmann SanFrancisco California)
8 Du Q and Wong T-W Numerical studies ofMacQueenrsquos K-means algorithm for computing the cen-troidal Voronoi tessellations Int J Computers MathApplics 2002 44 511ndash523
9 Castro V E and Yang J A fast and robust generalpurpose clustering algorithm In Proceedings of theFourth European Workshop on Principles of KnowledgeDiscovery in Databases and Data Mining (PKDD 00)Lyon France 2000 pp 208ndash218
10 Castro V E Why so many clustering algorithmsSIGKDD Explorations Newsletter of the ACM SpecialInterest Group on Knowledge Discovery and DataMining 2002 4(1) 65ndash75
11 Fritzke B The LBG-U method for vector quantiza-tion ndash an improvement over LBG inspired fromneural networks Neural Processing Lett 1997 5(1)35ndash45
12 Hamerly G and Elkan C Alternatives to the K-meansalgorithm that find better clusterings In Proceedings ofthe 11th International Conference on Information andKnowledge Management (CIKM 02) McLean Virginia2002 pp 600ndash607
13 Hansen L K and Larsen J Unsupervised learningand generalisation In Proceedings of the IEEEInternational Conference on Neural NetworksWashington DC June 1996 pp 25ndash30 (IEEENew York)
14 Ishioka T Extended K-means with an efficientestimation of the number of clusters In Proceedingsof the Second International Conference on Intelligent
Data Engineering and Automated Learning (IDEAL2000) Hong Kong PR China December 2000pp 17ndash22
15 Kanungo T Mount D M Netanyahu N Piatko CSilverman R and Wu A The efficient K-means clus-tering algorithm analysis and implementation IEEETrans Pattern Analysis Mach Intell 2002 24(7)881ndash892
16 Pelleg D and Moore A Accelerating exact K-meansalgorithms with geometric reasoning In Proceedingsof the Conference on Knowledge Discovery inDatabases (KDD 99) San Diego California 1999pp 277ndash281
17 Pelleg D and Moore A X-means extending K-meanswith efficient estimation of the number of clusters InProceedings of the 17th International Conference onMachine Learning (ICML 2000) Stanford California2000 727ndash734
18 Pena J M Lazano J A and Larranaga P An empiri-cal comparison of four initialisation methods for theK-means algorithm Pattern Recognition Lett 199920 1027ndash1040
19 SPSS Clementine Data Mining System User Guide Ver-sion 5 1998 (Integral Solutions Limited BasingstokeHampshire)
20 DataEngine 30 ndash Intelligent Data Analysis ndash an EasyJob Management Intelligenter Technologien GmbHGermany 1998 httpwwwmitgmbhde
21 Kerr A Hall H K and Kozub S Doing Statistics withSPSS 2002 (Sage London)
22 S-PLUS 6 for Windows Guide to Statistics Vol 2Insightful Corporation Seattle Washington 2001httpwwwinsightfulcomDocumentsLive2344statman2pdf
23 Hardy A On the number of clusters Comput StatistData Analysis 1996 23 83ndash96
24 Theodoridis S and Koutroubas K Pattern Recog-nition 1998 (Academic Press London)
25 Halkidi M Batistakis Y and Vazirgiannis MCluster validity methods Part I SIGMOD Record2002 31(2) available online httpwwwacmorgsigmodrecord
26 Kothari R and Pitts D On finding the number ofclusters Pattern Recognition Lett 1999 20 405ndash416
27 Cai Z Technical aspects of data mining PhD thesisCardiff University Cardiff 2001
28 Lindeberg T Scale-space Theory in Computer Vision1994 (Kluwer Academic Boston Massachusetts)
29 Pham D T Dimov S S and Nguyen C DIncremental K-means algorithm Proc Instn MechEngrs Part C J Mechanical Engineering Science 2003218 783ndash795
30 Tibshirani R Walther G and Hastie T Estimatingthe number of clusters in a dataset via the gap statisticTechnical Report 208 Department of StatisticsStanford University California 2000
31 Blake C Keogh E and Merz C J UCI Re-pository of Machine Learning Databases IrvineCalifornia Department of Information and Com-puter Science University of California IrvineCalifornia 1998
118 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
APPENDIX
Notation
A B clustersd(xjt wj) distance between object xjt and the
centre wj of cluster jf(K ) evaluation functionGA GB generatorsIj distortion of cluster jK number of clustersN number of objects in the data setNd number of data set attributes
(the dimension of the data set)
Nj number of objects belonging tocluster j
PGA PGB
probabilities that X is created by GA orGB respectively
PCA PCB
probabilities that X is clustered into A orB respectively
SK sum of all distortions with K being thespecified number of clusters
X objectxjt object belonging to cluster jwj centre of cluster j
aK weight factor
Selection of K in K-means clustering 119
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
Fig 6 Continued
114 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
from the average value for any strong recommen-dations to be made to the user By comparing thevalues of f(K ) in Figs 6a and c it can be seen thataK reduces the effect of the data set dimensions onthe evaluation function
For the data set in Fig 6d again all objects areconcentrated in a single region with a normal
distribution The f(K ) plot for this data set suggestscorrectly that when K frac14 1 the clustering result isthe most suitable for this data set
The data sets in Figs 6e and f are created bytwo generators that have normal distributions InFig 6e the two generators have an overlappingregion but in Fig 6f they are well separated Note
Fig 7 f(K) for the 12 benchmark data sets
Selection of K in K-means clustering 115
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
that the value for f(2) in the latter figure is muchsmaller than in the former
The data sets in Figs 6g and h have three recog-nizable regions From the corresponding graphsf(K ) suggests correct values of K for clusteringthese data sets
Three different generators that create objectgroupings with a normal distribution are used toform the data set in Fig 6i In this case f(K ) suggeststhe value 2 or 3 for K Because two of these threegenerators create object groupings that overlapf(2) is smaller than f(3) This means that the data
Fig 7 Continued
116 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
have only two clearly defined regions but K frac14 3could also be used to cluster the objects
Figures 6j and k illustrate how the level of detailcould affect the selection of K f(K ) reaches mini-mum values at K frac14 2 and 4 respectively In suchcases users could select the most appropriate valueof K based on their specific requirements A morecomplex case is shown in Fig 6l where there is apossible K value of 4 or 8 The selection of a parti-cular K will depend on the requirements of thespecific application for which the clustering iscarried out
The data sets in Figs 6mndasho have well-definedregions in the object space each of which has adifferent distribution location and number ofobjects If the minimum value of f(K ) is used tocluster the objects K will be different from thenumber of generators utilized to create them (as inthe case of the clusters in Fig 6o or the number ofobject groupings that could be identified visually(as in the case of the clusters in Figs 6m and n)The reason for the difference varies with differentcases For example it could be considered thatthere are five clusters in Fig 6m because the clusterdistances are smaller for the two leftmost pairs ofclusters than for others and the clusters in thosepairs could be merged together However nosimple explanation could be given for the casesshown in Figs 6n and o This highlights the factthat f(K ) should only be used to suggest a guidevalue for the number of clusters and the finaldecision as to which value to adopt has to be left atthe discretion of the user
From the graphs in Fig 6 a conclusion could bemade that any K with corresponding f(K ) 085could be recommended for clustering If there isnot a value with corresponding f(K ) 085 K frac14 1 isselected
The proposed function f(K ) is also applied to 12benchmarking data sets from the UCI RepositoryMachine Learning Databases [31] Figure 7 showshow the value of f(K ) varies with K If a threshold of085 is selected for f(K ) (from the study on theartificial data sets) the numbers of clusters recom-mended for each of these data sets are given as inTable 2 Kfrac14 1 means that the data distribution isvery close to the standard uniform distribution Thevalues recommended using f(K ) are very smallbecause of the high correlation between the attributesof these data sets very similar to that shown in Fig 6eThis can be verified by examining two attributes ata time and plotting the data sets in two dimensions
The above experimental study on 15 artificialand 12 benchmark data sets has demonstratedthe robustness of f(K ) The evaluation functionconverges in most cases to 1 when K increasesabove 9
6 CONCLUSION
Existing methods of selecting the number of clustersfor K-means clustering have a number of drawbacksAlso current methods for assessing the clusteringresults do not provide much information on theperformance of the clustering algorithm
A new method to select the number of clustersfor the K-means algorithm has been proposed inthe paper The new method is closely related to theapproach of K-means clustering because it takesinto account information reflecting the performanceof the algorithm The proposed method can suggestmultiple values of K to users for cases when differentclustering results could be obtained with variousrequired levels of detail The method could be com-putationally expensive if used with large data setsbecause it requires several applications of theK-means algorithm before it can suggest a guidevalue for K The method has been validated on15 artificial and 12 benchmark data sets Furtherresearch is required to verify the capability of thismethod when applied to data sets with morecomplex object distributions
ACKNOWLEDGEMENTS
This work was carried out as part of the CardiffInnovative Manufacturing Research Centre Projectsupported by the Engineering and Physical SciencesResearch Council and the SUPERMAN Project sup-ported by the European Commission and the WelshAssembly Government under the European RegionalDevelopment Fund programme The authors aremembers of the IPROMS Network of Excellencefunded by the European Commission
Table 2 The recommended number of
clusters based on f(K)
Data setsProposed numberof clusters
Australian 1Balance-scale 1Car evaluation 2 3 4Cmc 1Ionosphere 2Iris 2 3Page blocks 2Pima 1Wdbc 2Wine 3Yeast 1Zoo 2
Selection of K in K-means clustering 117
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
REFERENCES
1 Han J and Kamber M Data Mining Concepts andTechniques 2000 (Morgan Kaufmann San FranciscoCalifornia)
2 Al-Daoud M B Venkateswarlu N B andRoberts S A Fast K-means clustering algorithmsReport 9518 School of Computer Studies Universityof Leeds June 1995
3 Al-Daoud M B Venkateswarlu N B andRoberts S A New methods for the initialisation ofclusters Pattern Recognition Lett 1996 17 451ndash455
4 Alsabti K Ranka S and Singh V An efficientK-means clustering algorithm In Proceedings of theFirst Workshop on High-Performance Data MiningOrlando Florida 1998 ftpftpciseufledupubfacultyrankaProceedings
5 Bilmes J Vahdat A Hsu W and Im E J Empiricalobservations of probabilistic heuristics for theclustering problem Technical Report TR-97-018International Computer Science Institute BerkeleyCalifornia
6 Bottou L and Bengio Y Convergence properties of theK-means algorithm Adv Neural Infn Processing Systems1995 7 585ndash592
7 Bradley S and Fayyad U M Refining initialpoints for K-means clustering In Proceedings ofthe Fifteenth International Conference on MachineLearning (ICML lsquo98) (Ed J Shavlik) MadisonWisconsin 1998 pp 91ndash99 (Morgan Kaufmann SanFrancisco California)
8 Du Q and Wong T-W Numerical studies ofMacQueenrsquos K-means algorithm for computing the cen-troidal Voronoi tessellations Int J Computers MathApplics 2002 44 511ndash523
9 Castro V E and Yang J A fast and robust generalpurpose clustering algorithm In Proceedings of theFourth European Workshop on Principles of KnowledgeDiscovery in Databases and Data Mining (PKDD 00)Lyon France 2000 pp 208ndash218
10 Castro V E Why so many clustering algorithmsSIGKDD Explorations Newsletter of the ACM SpecialInterest Group on Knowledge Discovery and DataMining 2002 4(1) 65ndash75
11 Fritzke B The LBG-U method for vector quantiza-tion ndash an improvement over LBG inspired fromneural networks Neural Processing Lett 1997 5(1)35ndash45
12 Hamerly G and Elkan C Alternatives to the K-meansalgorithm that find better clusterings In Proceedings ofthe 11th International Conference on Information andKnowledge Management (CIKM 02) McLean Virginia2002 pp 600ndash607
13 Hansen L K and Larsen J Unsupervised learningand generalisation In Proceedings of the IEEEInternational Conference on Neural NetworksWashington DC June 1996 pp 25ndash30 (IEEENew York)
14 Ishioka T Extended K-means with an efficientestimation of the number of clusters In Proceedingsof the Second International Conference on Intelligent
Data Engineering and Automated Learning (IDEAL2000) Hong Kong PR China December 2000pp 17ndash22
15 Kanungo T Mount D M Netanyahu N Piatko CSilverman R and Wu A The efficient K-means clus-tering algorithm analysis and implementation IEEETrans Pattern Analysis Mach Intell 2002 24(7)881ndash892
16 Pelleg D and Moore A Accelerating exact K-meansalgorithms with geometric reasoning In Proceedingsof the Conference on Knowledge Discovery inDatabases (KDD 99) San Diego California 1999pp 277ndash281
17 Pelleg D and Moore A X-means extending K-meanswith efficient estimation of the number of clusters InProceedings of the 17th International Conference onMachine Learning (ICML 2000) Stanford California2000 727ndash734
18 Pena J M Lazano J A and Larranaga P An empiri-cal comparison of four initialisation methods for theK-means algorithm Pattern Recognition Lett 199920 1027ndash1040
19 SPSS Clementine Data Mining System User Guide Ver-sion 5 1998 (Integral Solutions Limited BasingstokeHampshire)
20 DataEngine 30 ndash Intelligent Data Analysis ndash an EasyJob Management Intelligenter Technologien GmbHGermany 1998 httpwwwmitgmbhde
21 Kerr A Hall H K and Kozub S Doing Statistics withSPSS 2002 (Sage London)
22 S-PLUS 6 for Windows Guide to Statistics Vol 2Insightful Corporation Seattle Washington 2001httpwwwinsightfulcomDocumentsLive2344statman2pdf
23 Hardy A On the number of clusters Comput StatistData Analysis 1996 23 83ndash96
24 Theodoridis S and Koutroubas K Pattern Recog-nition 1998 (Academic Press London)
25 Halkidi M Batistakis Y and Vazirgiannis MCluster validity methods Part I SIGMOD Record2002 31(2) available online httpwwwacmorgsigmodrecord
26 Kothari R and Pitts D On finding the number ofclusters Pattern Recognition Lett 1999 20 405ndash416
27 Cai Z Technical aspects of data mining PhD thesisCardiff University Cardiff 2001
28 Lindeberg T Scale-space Theory in Computer Vision1994 (Kluwer Academic Boston Massachusetts)
29 Pham D T Dimov S S and Nguyen C DIncremental K-means algorithm Proc Instn MechEngrs Part C J Mechanical Engineering Science 2003218 783ndash795
30 Tibshirani R Walther G and Hastie T Estimatingthe number of clusters in a dataset via the gap statisticTechnical Report 208 Department of StatisticsStanford University California 2000
31 Blake C Keogh E and Merz C J UCI Re-pository of Machine Learning Databases IrvineCalifornia Department of Information and Com-puter Science University of California IrvineCalifornia 1998
118 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
APPENDIX
Notation
A B clustersd(xjt wj) distance between object xjt and the
centre wj of cluster jf(K ) evaluation functionGA GB generatorsIj distortion of cluster jK number of clustersN number of objects in the data setNd number of data set attributes
(the dimension of the data set)
Nj number of objects belonging tocluster j
PGA PGB
probabilities that X is created by GA orGB respectively
PCA PCB
probabilities that X is clustered into A orB respectively
SK sum of all distortions with K being thespecified number of clusters
X objectxjt object belonging to cluster jwj centre of cluster j
aK weight factor
Selection of K in K-means clustering 119
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
from the average value for any strong recommen-dations to be made to the user By comparing thevalues of f(K ) in Figs 6a and c it can be seen thataK reduces the effect of the data set dimensions onthe evaluation function
For the data set in Fig 6d again all objects areconcentrated in a single region with a normal
distribution The f(K ) plot for this data set suggestscorrectly that when K frac14 1 the clustering result isthe most suitable for this data set
The data sets in Figs 6e and f are created bytwo generators that have normal distributions InFig 6e the two generators have an overlappingregion but in Fig 6f they are well separated Note
Fig 7 f(K) for the 12 benchmark data sets
Selection of K in K-means clustering 115
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
that the value for f(2) in the latter figure is muchsmaller than in the former
The data sets in Figs 6g and h have three recog-nizable regions From the corresponding graphsf(K ) suggests correct values of K for clusteringthese data sets
Three different generators that create objectgroupings with a normal distribution are used toform the data set in Fig 6i In this case f(K ) suggeststhe value 2 or 3 for K Because two of these threegenerators create object groupings that overlapf(2) is smaller than f(3) This means that the data
Fig 7 Continued
116 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
have only two clearly defined regions but K frac14 3could also be used to cluster the objects
Figures 6j and k illustrate how the level of detailcould affect the selection of K f(K ) reaches mini-mum values at K frac14 2 and 4 respectively In suchcases users could select the most appropriate valueof K based on their specific requirements A morecomplex case is shown in Fig 6l where there is apossible K value of 4 or 8 The selection of a parti-cular K will depend on the requirements of thespecific application for which the clustering iscarried out
The data sets in Figs 6mndasho have well-definedregions in the object space each of which has adifferent distribution location and number ofobjects If the minimum value of f(K ) is used tocluster the objects K will be different from thenumber of generators utilized to create them (as inthe case of the clusters in Fig 6o or the number ofobject groupings that could be identified visually(as in the case of the clusters in Figs 6m and n)The reason for the difference varies with differentcases For example it could be considered thatthere are five clusters in Fig 6m because the clusterdistances are smaller for the two leftmost pairs ofclusters than for others and the clusters in thosepairs could be merged together However nosimple explanation could be given for the casesshown in Figs 6n and o This highlights the factthat f(K ) should only be used to suggest a guidevalue for the number of clusters and the finaldecision as to which value to adopt has to be left atthe discretion of the user
From the graphs in Fig 6 a conclusion could bemade that any K with corresponding f(K ) 085could be recommended for clustering If there isnot a value with corresponding f(K ) 085 K frac14 1 isselected
The proposed function f(K ) is also applied to 12benchmarking data sets from the UCI RepositoryMachine Learning Databases [31] Figure 7 showshow the value of f(K ) varies with K If a threshold of085 is selected for f(K ) (from the study on theartificial data sets) the numbers of clusters recom-mended for each of these data sets are given as inTable 2 Kfrac14 1 means that the data distribution isvery close to the standard uniform distribution Thevalues recommended using f(K ) are very smallbecause of the high correlation between the attributesof these data sets very similar to that shown in Fig 6eThis can be verified by examining two attributes ata time and plotting the data sets in two dimensions
The above experimental study on 15 artificialand 12 benchmark data sets has demonstratedthe robustness of f(K ) The evaluation functionconverges in most cases to 1 when K increasesabove 9
6 CONCLUSION
Existing methods of selecting the number of clustersfor K-means clustering have a number of drawbacksAlso current methods for assessing the clusteringresults do not provide much information on theperformance of the clustering algorithm
A new method to select the number of clustersfor the K-means algorithm has been proposed inthe paper The new method is closely related to theapproach of K-means clustering because it takesinto account information reflecting the performanceof the algorithm The proposed method can suggestmultiple values of K to users for cases when differentclustering results could be obtained with variousrequired levels of detail The method could be com-putationally expensive if used with large data setsbecause it requires several applications of theK-means algorithm before it can suggest a guidevalue for K The method has been validated on15 artificial and 12 benchmark data sets Furtherresearch is required to verify the capability of thismethod when applied to data sets with morecomplex object distributions
ACKNOWLEDGEMENTS
This work was carried out as part of the CardiffInnovative Manufacturing Research Centre Projectsupported by the Engineering and Physical SciencesResearch Council and the SUPERMAN Project sup-ported by the European Commission and the WelshAssembly Government under the European RegionalDevelopment Fund programme The authors aremembers of the IPROMS Network of Excellencefunded by the European Commission
Table 2 The recommended number of
clusters based on f(K)
Data setsProposed numberof clusters
Australian 1Balance-scale 1Car evaluation 2 3 4Cmc 1Ionosphere 2Iris 2 3Page blocks 2Pima 1Wdbc 2Wine 3Yeast 1Zoo 2
Selection of K in K-means clustering 117
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
REFERENCES
1 Han J and Kamber M Data Mining Concepts andTechniques 2000 (Morgan Kaufmann San FranciscoCalifornia)
2 Al-Daoud M B Venkateswarlu N B andRoberts S A Fast K-means clustering algorithmsReport 9518 School of Computer Studies Universityof Leeds June 1995
3 Al-Daoud M B Venkateswarlu N B andRoberts S A New methods for the initialisation ofclusters Pattern Recognition Lett 1996 17 451ndash455
4 Alsabti K Ranka S and Singh V An efficientK-means clustering algorithm In Proceedings of theFirst Workshop on High-Performance Data MiningOrlando Florida 1998 ftpftpciseufledupubfacultyrankaProceedings
5 Bilmes J Vahdat A Hsu W and Im E J Empiricalobservations of probabilistic heuristics for theclustering problem Technical Report TR-97-018International Computer Science Institute BerkeleyCalifornia
6 Bottou L and Bengio Y Convergence properties of theK-means algorithm Adv Neural Infn Processing Systems1995 7 585ndash592
7 Bradley S and Fayyad U M Refining initialpoints for K-means clustering In Proceedings ofthe Fifteenth International Conference on MachineLearning (ICML lsquo98) (Ed J Shavlik) MadisonWisconsin 1998 pp 91ndash99 (Morgan Kaufmann SanFrancisco California)
8 Du Q and Wong T-W Numerical studies ofMacQueenrsquos K-means algorithm for computing the cen-troidal Voronoi tessellations Int J Computers MathApplics 2002 44 511ndash523
9 Castro V E and Yang J A fast and robust generalpurpose clustering algorithm In Proceedings of theFourth European Workshop on Principles of KnowledgeDiscovery in Databases and Data Mining (PKDD 00)Lyon France 2000 pp 208ndash218
10 Castro V E Why so many clustering algorithmsSIGKDD Explorations Newsletter of the ACM SpecialInterest Group on Knowledge Discovery and DataMining 2002 4(1) 65ndash75
11 Fritzke B The LBG-U method for vector quantiza-tion ndash an improvement over LBG inspired fromneural networks Neural Processing Lett 1997 5(1)35ndash45
12 Hamerly G and Elkan C Alternatives to the K-meansalgorithm that find better clusterings In Proceedings ofthe 11th International Conference on Information andKnowledge Management (CIKM 02) McLean Virginia2002 pp 600ndash607
13 Hansen L K and Larsen J Unsupervised learningand generalisation In Proceedings of the IEEEInternational Conference on Neural NetworksWashington DC June 1996 pp 25ndash30 (IEEENew York)
14 Ishioka T Extended K-means with an efficientestimation of the number of clusters In Proceedingsof the Second International Conference on Intelligent
Data Engineering and Automated Learning (IDEAL2000) Hong Kong PR China December 2000pp 17ndash22
15 Kanungo T Mount D M Netanyahu N Piatko CSilverman R and Wu A The efficient K-means clus-tering algorithm analysis and implementation IEEETrans Pattern Analysis Mach Intell 2002 24(7)881ndash892
16 Pelleg D and Moore A Accelerating exact K-meansalgorithms with geometric reasoning In Proceedingsof the Conference on Knowledge Discovery inDatabases (KDD 99) San Diego California 1999pp 277ndash281
17 Pelleg D and Moore A X-means extending K-meanswith efficient estimation of the number of clusters InProceedings of the 17th International Conference onMachine Learning (ICML 2000) Stanford California2000 727ndash734
18 Pena J M Lazano J A and Larranaga P An empiri-cal comparison of four initialisation methods for theK-means algorithm Pattern Recognition Lett 199920 1027ndash1040
19 SPSS Clementine Data Mining System User Guide Ver-sion 5 1998 (Integral Solutions Limited BasingstokeHampshire)
20 DataEngine 30 ndash Intelligent Data Analysis ndash an EasyJob Management Intelligenter Technologien GmbHGermany 1998 httpwwwmitgmbhde
21 Kerr A Hall H K and Kozub S Doing Statistics withSPSS 2002 (Sage London)
22 S-PLUS 6 for Windows Guide to Statistics Vol 2Insightful Corporation Seattle Washington 2001httpwwwinsightfulcomDocumentsLive2344statman2pdf
23 Hardy A On the number of clusters Comput StatistData Analysis 1996 23 83ndash96
24 Theodoridis S and Koutroubas K Pattern Recog-nition 1998 (Academic Press London)
25 Halkidi M Batistakis Y and Vazirgiannis MCluster validity methods Part I SIGMOD Record2002 31(2) available online httpwwwacmorgsigmodrecord
26 Kothari R and Pitts D On finding the number ofclusters Pattern Recognition Lett 1999 20 405ndash416
27 Cai Z Technical aspects of data mining PhD thesisCardiff University Cardiff 2001
28 Lindeberg T Scale-space Theory in Computer Vision1994 (Kluwer Academic Boston Massachusetts)
29 Pham D T Dimov S S and Nguyen C DIncremental K-means algorithm Proc Instn MechEngrs Part C J Mechanical Engineering Science 2003218 783ndash795
30 Tibshirani R Walther G and Hastie T Estimatingthe number of clusters in a dataset via the gap statisticTechnical Report 208 Department of StatisticsStanford University California 2000
31 Blake C Keogh E and Merz C J UCI Re-pository of Machine Learning Databases IrvineCalifornia Department of Information and Com-puter Science University of California IrvineCalifornia 1998
118 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
APPENDIX
Notation
A B clustersd(xjt wj) distance between object xjt and the
centre wj of cluster jf(K ) evaluation functionGA GB generatorsIj distortion of cluster jK number of clustersN number of objects in the data setNd number of data set attributes
(the dimension of the data set)
Nj number of objects belonging tocluster j
PGA PGB
probabilities that X is created by GA orGB respectively
PCA PCB
probabilities that X is clustered into A orB respectively
SK sum of all distortions with K being thespecified number of clusters
X objectxjt object belonging to cluster jwj centre of cluster j
aK weight factor
Selection of K in K-means clustering 119
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
that the value for f(2) in the latter figure is muchsmaller than in the former
The data sets in Figs 6g and h have three recog-nizable regions From the corresponding graphsf(K ) suggests correct values of K for clusteringthese data sets
Three different generators that create objectgroupings with a normal distribution are used toform the data set in Fig 6i In this case f(K ) suggeststhe value 2 or 3 for K Because two of these threegenerators create object groupings that overlapf(2) is smaller than f(3) This means that the data
Fig 7 Continued
116 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
have only two clearly defined regions but K frac14 3could also be used to cluster the objects
Figures 6j and k illustrate how the level of detailcould affect the selection of K f(K ) reaches mini-mum values at K frac14 2 and 4 respectively In suchcases users could select the most appropriate valueof K based on their specific requirements A morecomplex case is shown in Fig 6l where there is apossible K value of 4 or 8 The selection of a parti-cular K will depend on the requirements of thespecific application for which the clustering iscarried out
The data sets in Figs 6mndasho have well-definedregions in the object space each of which has adifferent distribution location and number ofobjects If the minimum value of f(K ) is used tocluster the objects K will be different from thenumber of generators utilized to create them (as inthe case of the clusters in Fig 6o or the number ofobject groupings that could be identified visually(as in the case of the clusters in Figs 6m and n)The reason for the difference varies with differentcases For example it could be considered thatthere are five clusters in Fig 6m because the clusterdistances are smaller for the two leftmost pairs ofclusters than for others and the clusters in thosepairs could be merged together However nosimple explanation could be given for the casesshown in Figs 6n and o This highlights the factthat f(K ) should only be used to suggest a guidevalue for the number of clusters and the finaldecision as to which value to adopt has to be left atthe discretion of the user
From the graphs in Fig 6 a conclusion could bemade that any K with corresponding f(K ) 085could be recommended for clustering If there isnot a value with corresponding f(K ) 085 K frac14 1 isselected
The proposed function f(K ) is also applied to 12benchmarking data sets from the UCI RepositoryMachine Learning Databases [31] Figure 7 showshow the value of f(K ) varies with K If a threshold of085 is selected for f(K ) (from the study on theartificial data sets) the numbers of clusters recom-mended for each of these data sets are given as inTable 2 Kfrac14 1 means that the data distribution isvery close to the standard uniform distribution Thevalues recommended using f(K ) are very smallbecause of the high correlation between the attributesof these data sets very similar to that shown in Fig 6eThis can be verified by examining two attributes ata time and plotting the data sets in two dimensions
The above experimental study on 15 artificialand 12 benchmark data sets has demonstratedthe robustness of f(K ) The evaluation functionconverges in most cases to 1 when K increasesabove 9
6 CONCLUSION
Existing methods of selecting the number of clustersfor K-means clustering have a number of drawbacksAlso current methods for assessing the clusteringresults do not provide much information on theperformance of the clustering algorithm
A new method to select the number of clustersfor the K-means algorithm has been proposed inthe paper The new method is closely related to theapproach of K-means clustering because it takesinto account information reflecting the performanceof the algorithm The proposed method can suggestmultiple values of K to users for cases when differentclustering results could be obtained with variousrequired levels of detail The method could be com-putationally expensive if used with large data setsbecause it requires several applications of theK-means algorithm before it can suggest a guidevalue for K The method has been validated on15 artificial and 12 benchmark data sets Furtherresearch is required to verify the capability of thismethod when applied to data sets with morecomplex object distributions
ACKNOWLEDGEMENTS
This work was carried out as part of the CardiffInnovative Manufacturing Research Centre Projectsupported by the Engineering and Physical SciencesResearch Council and the SUPERMAN Project sup-ported by the European Commission and the WelshAssembly Government under the European RegionalDevelopment Fund programme The authors aremembers of the IPROMS Network of Excellencefunded by the European Commission
Table 2 The recommended number of
clusters based on f(K)
Data setsProposed numberof clusters
Australian 1Balance-scale 1Car evaluation 2 3 4Cmc 1Ionosphere 2Iris 2 3Page blocks 2Pima 1Wdbc 2Wine 3Yeast 1Zoo 2
Selection of K in K-means clustering 117
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
REFERENCES
1 Han J and Kamber M Data Mining Concepts andTechniques 2000 (Morgan Kaufmann San FranciscoCalifornia)
2 Al-Daoud M B Venkateswarlu N B andRoberts S A Fast K-means clustering algorithmsReport 9518 School of Computer Studies Universityof Leeds June 1995
3 Al-Daoud M B Venkateswarlu N B andRoberts S A New methods for the initialisation ofclusters Pattern Recognition Lett 1996 17 451ndash455
4 Alsabti K Ranka S and Singh V An efficientK-means clustering algorithm In Proceedings of theFirst Workshop on High-Performance Data MiningOrlando Florida 1998 ftpftpciseufledupubfacultyrankaProceedings
5 Bilmes J Vahdat A Hsu W and Im E J Empiricalobservations of probabilistic heuristics for theclustering problem Technical Report TR-97-018International Computer Science Institute BerkeleyCalifornia
6 Bottou L and Bengio Y Convergence properties of theK-means algorithm Adv Neural Infn Processing Systems1995 7 585ndash592
7 Bradley S and Fayyad U M Refining initialpoints for K-means clustering In Proceedings ofthe Fifteenth International Conference on MachineLearning (ICML lsquo98) (Ed J Shavlik) MadisonWisconsin 1998 pp 91ndash99 (Morgan Kaufmann SanFrancisco California)
8 Du Q and Wong T-W Numerical studies ofMacQueenrsquos K-means algorithm for computing the cen-troidal Voronoi tessellations Int J Computers MathApplics 2002 44 511ndash523
9 Castro V E and Yang J A fast and robust generalpurpose clustering algorithm In Proceedings of theFourth European Workshop on Principles of KnowledgeDiscovery in Databases and Data Mining (PKDD 00)Lyon France 2000 pp 208ndash218
10 Castro V E Why so many clustering algorithmsSIGKDD Explorations Newsletter of the ACM SpecialInterest Group on Knowledge Discovery and DataMining 2002 4(1) 65ndash75
11 Fritzke B The LBG-U method for vector quantiza-tion ndash an improvement over LBG inspired fromneural networks Neural Processing Lett 1997 5(1)35ndash45
12 Hamerly G and Elkan C Alternatives to the K-meansalgorithm that find better clusterings In Proceedings ofthe 11th International Conference on Information andKnowledge Management (CIKM 02) McLean Virginia2002 pp 600ndash607
13 Hansen L K and Larsen J Unsupervised learningand generalisation In Proceedings of the IEEEInternational Conference on Neural NetworksWashington DC June 1996 pp 25ndash30 (IEEENew York)
14 Ishioka T Extended K-means with an efficientestimation of the number of clusters In Proceedingsof the Second International Conference on Intelligent
Data Engineering and Automated Learning (IDEAL2000) Hong Kong PR China December 2000pp 17ndash22
15 Kanungo T Mount D M Netanyahu N Piatko CSilverman R and Wu A The efficient K-means clus-tering algorithm analysis and implementation IEEETrans Pattern Analysis Mach Intell 2002 24(7)881ndash892
16 Pelleg D and Moore A Accelerating exact K-meansalgorithms with geometric reasoning In Proceedingsof the Conference on Knowledge Discovery inDatabases (KDD 99) San Diego California 1999pp 277ndash281
17 Pelleg D and Moore A X-means extending K-meanswith efficient estimation of the number of clusters InProceedings of the 17th International Conference onMachine Learning (ICML 2000) Stanford California2000 727ndash734
18 Pena J M Lazano J A and Larranaga P An empiri-cal comparison of four initialisation methods for theK-means algorithm Pattern Recognition Lett 199920 1027ndash1040
19 SPSS Clementine Data Mining System User Guide Ver-sion 5 1998 (Integral Solutions Limited BasingstokeHampshire)
20 DataEngine 30 ndash Intelligent Data Analysis ndash an EasyJob Management Intelligenter Technologien GmbHGermany 1998 httpwwwmitgmbhde
21 Kerr A Hall H K and Kozub S Doing Statistics withSPSS 2002 (Sage London)
22 S-PLUS 6 for Windows Guide to Statistics Vol 2Insightful Corporation Seattle Washington 2001httpwwwinsightfulcomDocumentsLive2344statman2pdf
23 Hardy A On the number of clusters Comput StatistData Analysis 1996 23 83ndash96
24 Theodoridis S and Koutroubas K Pattern Recog-nition 1998 (Academic Press London)
25 Halkidi M Batistakis Y and Vazirgiannis MCluster validity methods Part I SIGMOD Record2002 31(2) available online httpwwwacmorgsigmodrecord
26 Kothari R and Pitts D On finding the number ofclusters Pattern Recognition Lett 1999 20 405ndash416
27 Cai Z Technical aspects of data mining PhD thesisCardiff University Cardiff 2001
28 Lindeberg T Scale-space Theory in Computer Vision1994 (Kluwer Academic Boston Massachusetts)
29 Pham D T Dimov S S and Nguyen C DIncremental K-means algorithm Proc Instn MechEngrs Part C J Mechanical Engineering Science 2003218 783ndash795
30 Tibshirani R Walther G and Hastie T Estimatingthe number of clusters in a dataset via the gap statisticTechnical Report 208 Department of StatisticsStanford University California 2000
31 Blake C Keogh E and Merz C J UCI Re-pository of Machine Learning Databases IrvineCalifornia Department of Information and Com-puter Science University of California IrvineCalifornia 1998
118 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
APPENDIX
Notation
A B clustersd(xjt wj) distance between object xjt and the
centre wj of cluster jf(K ) evaluation functionGA GB generatorsIj distortion of cluster jK number of clustersN number of objects in the data setNd number of data set attributes
(the dimension of the data set)
Nj number of objects belonging tocluster j
PGA PGB
probabilities that X is created by GA orGB respectively
PCA PCB
probabilities that X is clustered into A orB respectively
SK sum of all distortions with K being thespecified number of clusters
X objectxjt object belonging to cluster jwj centre of cluster j
aK weight factor
Selection of K in K-means clustering 119
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
have only two clearly defined regions but K frac14 3could also be used to cluster the objects
Figures 6j and k illustrate how the level of detailcould affect the selection of K f(K ) reaches mini-mum values at K frac14 2 and 4 respectively In suchcases users could select the most appropriate valueof K based on their specific requirements A morecomplex case is shown in Fig 6l where there is apossible K value of 4 or 8 The selection of a parti-cular K will depend on the requirements of thespecific application for which the clustering iscarried out
The data sets in Figs 6mndasho have well-definedregions in the object space each of which has adifferent distribution location and number ofobjects If the minimum value of f(K ) is used tocluster the objects K will be different from thenumber of generators utilized to create them (as inthe case of the clusters in Fig 6o or the number ofobject groupings that could be identified visually(as in the case of the clusters in Figs 6m and n)The reason for the difference varies with differentcases For example it could be considered thatthere are five clusters in Fig 6m because the clusterdistances are smaller for the two leftmost pairs ofclusters than for others and the clusters in thosepairs could be merged together However nosimple explanation could be given for the casesshown in Figs 6n and o This highlights the factthat f(K ) should only be used to suggest a guidevalue for the number of clusters and the finaldecision as to which value to adopt has to be left atthe discretion of the user
From the graphs in Fig 6 a conclusion could bemade that any K with corresponding f(K ) 085could be recommended for clustering If there isnot a value with corresponding f(K ) 085 K frac14 1 isselected
The proposed function f(K ) is also applied to 12benchmarking data sets from the UCI RepositoryMachine Learning Databases [31] Figure 7 showshow the value of f(K ) varies with K If a threshold of085 is selected for f(K ) (from the study on theartificial data sets) the numbers of clusters recom-mended for each of these data sets are given as inTable 2 Kfrac14 1 means that the data distribution isvery close to the standard uniform distribution Thevalues recommended using f(K ) are very smallbecause of the high correlation between the attributesof these data sets very similar to that shown in Fig 6eThis can be verified by examining two attributes ata time and plotting the data sets in two dimensions
The above experimental study on 15 artificialand 12 benchmark data sets has demonstratedthe robustness of f(K ) The evaluation functionconverges in most cases to 1 when K increasesabove 9
6 CONCLUSION
Existing methods of selecting the number of clustersfor K-means clustering have a number of drawbacksAlso current methods for assessing the clusteringresults do not provide much information on theperformance of the clustering algorithm
A new method to select the number of clustersfor the K-means algorithm has been proposed inthe paper The new method is closely related to theapproach of K-means clustering because it takesinto account information reflecting the performanceof the algorithm The proposed method can suggestmultiple values of K to users for cases when differentclustering results could be obtained with variousrequired levels of detail The method could be com-putationally expensive if used with large data setsbecause it requires several applications of theK-means algorithm before it can suggest a guidevalue for K The method has been validated on15 artificial and 12 benchmark data sets Furtherresearch is required to verify the capability of thismethod when applied to data sets with morecomplex object distributions
ACKNOWLEDGEMENTS
This work was carried out as part of the CardiffInnovative Manufacturing Research Centre Projectsupported by the Engineering and Physical SciencesResearch Council and the SUPERMAN Project sup-ported by the European Commission and the WelshAssembly Government under the European RegionalDevelopment Fund programme The authors aremembers of the IPROMS Network of Excellencefunded by the European Commission
Table 2 The recommended number of
clusters based on f(K)
Data setsProposed numberof clusters
Australian 1Balance-scale 1Car evaluation 2 3 4Cmc 1Ionosphere 2Iris 2 3Page blocks 2Pima 1Wdbc 2Wine 3Yeast 1Zoo 2
Selection of K in K-means clustering 117
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
REFERENCES
1 Han J and Kamber M Data Mining Concepts andTechniques 2000 (Morgan Kaufmann San FranciscoCalifornia)
2 Al-Daoud M B Venkateswarlu N B andRoberts S A Fast K-means clustering algorithmsReport 9518 School of Computer Studies Universityof Leeds June 1995
3 Al-Daoud M B Venkateswarlu N B andRoberts S A New methods for the initialisation ofclusters Pattern Recognition Lett 1996 17 451ndash455
4 Alsabti K Ranka S and Singh V An efficientK-means clustering algorithm In Proceedings of theFirst Workshop on High-Performance Data MiningOrlando Florida 1998 ftpftpciseufledupubfacultyrankaProceedings
5 Bilmes J Vahdat A Hsu W and Im E J Empiricalobservations of probabilistic heuristics for theclustering problem Technical Report TR-97-018International Computer Science Institute BerkeleyCalifornia
6 Bottou L and Bengio Y Convergence properties of theK-means algorithm Adv Neural Infn Processing Systems1995 7 585ndash592
7 Bradley S and Fayyad U M Refining initialpoints for K-means clustering In Proceedings ofthe Fifteenth International Conference on MachineLearning (ICML lsquo98) (Ed J Shavlik) MadisonWisconsin 1998 pp 91ndash99 (Morgan Kaufmann SanFrancisco California)
8 Du Q and Wong T-W Numerical studies ofMacQueenrsquos K-means algorithm for computing the cen-troidal Voronoi tessellations Int J Computers MathApplics 2002 44 511ndash523
9 Castro V E and Yang J A fast and robust generalpurpose clustering algorithm In Proceedings of theFourth European Workshop on Principles of KnowledgeDiscovery in Databases and Data Mining (PKDD 00)Lyon France 2000 pp 208ndash218
10 Castro V E Why so many clustering algorithmsSIGKDD Explorations Newsletter of the ACM SpecialInterest Group on Knowledge Discovery and DataMining 2002 4(1) 65ndash75
11 Fritzke B The LBG-U method for vector quantiza-tion ndash an improvement over LBG inspired fromneural networks Neural Processing Lett 1997 5(1)35ndash45
12 Hamerly G and Elkan C Alternatives to the K-meansalgorithm that find better clusterings In Proceedings ofthe 11th International Conference on Information andKnowledge Management (CIKM 02) McLean Virginia2002 pp 600ndash607
13 Hansen L K and Larsen J Unsupervised learningand generalisation In Proceedings of the IEEEInternational Conference on Neural NetworksWashington DC June 1996 pp 25ndash30 (IEEENew York)
14 Ishioka T Extended K-means with an efficientestimation of the number of clusters In Proceedingsof the Second International Conference on Intelligent
Data Engineering and Automated Learning (IDEAL2000) Hong Kong PR China December 2000pp 17ndash22
15 Kanungo T Mount D M Netanyahu N Piatko CSilverman R and Wu A The efficient K-means clus-tering algorithm analysis and implementation IEEETrans Pattern Analysis Mach Intell 2002 24(7)881ndash892
16 Pelleg D and Moore A Accelerating exact K-meansalgorithms with geometric reasoning In Proceedingsof the Conference on Knowledge Discovery inDatabases (KDD 99) San Diego California 1999pp 277ndash281
17 Pelleg D and Moore A X-means extending K-meanswith efficient estimation of the number of clusters InProceedings of the 17th International Conference onMachine Learning (ICML 2000) Stanford California2000 727ndash734
18 Pena J M Lazano J A and Larranaga P An empiri-cal comparison of four initialisation methods for theK-means algorithm Pattern Recognition Lett 199920 1027ndash1040
19 SPSS Clementine Data Mining System User Guide Ver-sion 5 1998 (Integral Solutions Limited BasingstokeHampshire)
20 DataEngine 30 ndash Intelligent Data Analysis ndash an EasyJob Management Intelligenter Technologien GmbHGermany 1998 httpwwwmitgmbhde
21 Kerr A Hall H K and Kozub S Doing Statistics withSPSS 2002 (Sage London)
22 S-PLUS 6 for Windows Guide to Statistics Vol 2Insightful Corporation Seattle Washington 2001httpwwwinsightfulcomDocumentsLive2344statman2pdf
23 Hardy A On the number of clusters Comput StatistData Analysis 1996 23 83ndash96
24 Theodoridis S and Koutroubas K Pattern Recog-nition 1998 (Academic Press London)
25 Halkidi M Batistakis Y and Vazirgiannis MCluster validity methods Part I SIGMOD Record2002 31(2) available online httpwwwacmorgsigmodrecord
26 Kothari R and Pitts D On finding the number ofclusters Pattern Recognition Lett 1999 20 405ndash416
27 Cai Z Technical aspects of data mining PhD thesisCardiff University Cardiff 2001
28 Lindeberg T Scale-space Theory in Computer Vision1994 (Kluwer Academic Boston Massachusetts)
29 Pham D T Dimov S S and Nguyen C DIncremental K-means algorithm Proc Instn MechEngrs Part C J Mechanical Engineering Science 2003218 783ndash795
30 Tibshirani R Walther G and Hastie T Estimatingthe number of clusters in a dataset via the gap statisticTechnical Report 208 Department of StatisticsStanford University California 2000
31 Blake C Keogh E and Merz C J UCI Re-pository of Machine Learning Databases IrvineCalifornia Department of Information and Com-puter Science University of California IrvineCalifornia 1998
118 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
APPENDIX
Notation
A B clustersd(xjt wj) distance between object xjt and the
centre wj of cluster jf(K ) evaluation functionGA GB generatorsIj distortion of cluster jK number of clustersN number of objects in the data setNd number of data set attributes
(the dimension of the data set)
Nj number of objects belonging tocluster j
PGA PGB
probabilities that X is created by GA orGB respectively
PCA PCB
probabilities that X is clustered into A orB respectively
SK sum of all distortions with K being thespecified number of clusters
X objectxjt object belonging to cluster jwj centre of cluster j
aK weight factor
Selection of K in K-means clustering 119
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
REFERENCES
1 Han J and Kamber M Data Mining Concepts andTechniques 2000 (Morgan Kaufmann San FranciscoCalifornia)
2 Al-Daoud M B Venkateswarlu N B andRoberts S A Fast K-means clustering algorithmsReport 9518 School of Computer Studies Universityof Leeds June 1995
3 Al-Daoud M B Venkateswarlu N B andRoberts S A New methods for the initialisation ofclusters Pattern Recognition Lett 1996 17 451ndash455
4 Alsabti K Ranka S and Singh V An efficientK-means clustering algorithm In Proceedings of theFirst Workshop on High-Performance Data MiningOrlando Florida 1998 ftpftpciseufledupubfacultyrankaProceedings
5 Bilmes J Vahdat A Hsu W and Im E J Empiricalobservations of probabilistic heuristics for theclustering problem Technical Report TR-97-018International Computer Science Institute BerkeleyCalifornia
6 Bottou L and Bengio Y Convergence properties of theK-means algorithm Adv Neural Infn Processing Systems1995 7 585ndash592
7 Bradley S and Fayyad U M Refining initialpoints for K-means clustering In Proceedings ofthe Fifteenth International Conference on MachineLearning (ICML lsquo98) (Ed J Shavlik) MadisonWisconsin 1998 pp 91ndash99 (Morgan Kaufmann SanFrancisco California)
8 Du Q and Wong T-W Numerical studies ofMacQueenrsquos K-means algorithm for computing the cen-troidal Voronoi tessellations Int J Computers MathApplics 2002 44 511ndash523
9 Castro V E and Yang J A fast and robust generalpurpose clustering algorithm In Proceedings of theFourth European Workshop on Principles of KnowledgeDiscovery in Databases and Data Mining (PKDD 00)Lyon France 2000 pp 208ndash218
10 Castro V E Why so many clustering algorithmsSIGKDD Explorations Newsletter of the ACM SpecialInterest Group on Knowledge Discovery and DataMining 2002 4(1) 65ndash75
11 Fritzke B The LBG-U method for vector quantiza-tion ndash an improvement over LBG inspired fromneural networks Neural Processing Lett 1997 5(1)35ndash45
12 Hamerly G and Elkan C Alternatives to the K-meansalgorithm that find better clusterings In Proceedings ofthe 11th International Conference on Information andKnowledge Management (CIKM 02) McLean Virginia2002 pp 600ndash607
13 Hansen L K and Larsen J Unsupervised learningand generalisation In Proceedings of the IEEEInternational Conference on Neural NetworksWashington DC June 1996 pp 25ndash30 (IEEENew York)
14 Ishioka T Extended K-means with an efficientestimation of the number of clusters In Proceedingsof the Second International Conference on Intelligent
Data Engineering and Automated Learning (IDEAL2000) Hong Kong PR China December 2000pp 17ndash22
15 Kanungo T Mount D M Netanyahu N Piatko CSilverman R and Wu A The efficient K-means clus-tering algorithm analysis and implementation IEEETrans Pattern Analysis Mach Intell 2002 24(7)881ndash892
16 Pelleg D and Moore A Accelerating exact K-meansalgorithms with geometric reasoning In Proceedingsof the Conference on Knowledge Discovery inDatabases (KDD 99) San Diego California 1999pp 277ndash281
17 Pelleg D and Moore A X-means extending K-meanswith efficient estimation of the number of clusters InProceedings of the 17th International Conference onMachine Learning (ICML 2000) Stanford California2000 727ndash734
18 Pena J M Lazano J A and Larranaga P An empiri-cal comparison of four initialisation methods for theK-means algorithm Pattern Recognition Lett 199920 1027ndash1040
19 SPSS Clementine Data Mining System User Guide Ver-sion 5 1998 (Integral Solutions Limited BasingstokeHampshire)
20 DataEngine 30 ndash Intelligent Data Analysis ndash an EasyJob Management Intelligenter Technologien GmbHGermany 1998 httpwwwmitgmbhde
21 Kerr A Hall H K and Kozub S Doing Statistics withSPSS 2002 (Sage London)
22 S-PLUS 6 for Windows Guide to Statistics Vol 2Insightful Corporation Seattle Washington 2001httpwwwinsightfulcomDocumentsLive2344statman2pdf
23 Hardy A On the number of clusters Comput StatistData Analysis 1996 23 83ndash96
24 Theodoridis S and Koutroubas K Pattern Recog-nition 1998 (Academic Press London)
25 Halkidi M Batistakis Y and Vazirgiannis MCluster validity methods Part I SIGMOD Record2002 31(2) available online httpwwwacmorgsigmodrecord
26 Kothari R and Pitts D On finding the number ofclusters Pattern Recognition Lett 1999 20 405ndash416
27 Cai Z Technical aspects of data mining PhD thesisCardiff University Cardiff 2001
28 Lindeberg T Scale-space Theory in Computer Vision1994 (Kluwer Academic Boston Massachusetts)
29 Pham D T Dimov S S and Nguyen C DIncremental K-means algorithm Proc Instn MechEngrs Part C J Mechanical Engineering Science 2003218 783ndash795
30 Tibshirani R Walther G and Hastie T Estimatingthe number of clusters in a dataset via the gap statisticTechnical Report 208 Department of StatisticsStanford University California 2000
31 Blake C Keogh E and Merz C J UCI Re-pository of Machine Learning Databases IrvineCalifornia Department of Information and Com-puter Science University of California IrvineCalifornia 1998
118 D T Pham S S Dimov and C D Nguyen
Proc IMechE Vol 219 Part C J Mechanical Engineering Science C09304 IMechE 2005
APPENDIX
Notation
A B clustersd(xjt wj) distance between object xjt and the
centre wj of cluster jf(K ) evaluation functionGA GB generatorsIj distortion of cluster jK number of clustersN number of objects in the data setNd number of data set attributes
(the dimension of the data set)
Nj number of objects belonging tocluster j
PGA PGB
probabilities that X is created by GA orGB respectively
PCA PCB
probabilities that X is clustered into A orB respectively
SK sum of all distortions with K being thespecified number of clusters
X objectxjt object belonging to cluster jwj centre of cluster j
aK weight factor
Selection of K in K-means clustering 119
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science
APPENDIX
Notation
A B clustersd(xjt wj) distance between object xjt and the
centre wj of cluster jf(K ) evaluation functionGA GB generatorsIj distortion of cluster jK number of clustersN number of objects in the data setNd number of data set attributes
(the dimension of the data set)
Nj number of objects belonging tocluster j
PGA PGB
probabilities that X is created by GA orGB respectively
PCA PCB
probabilities that X is clustered into A orB respectively
SK sum of all distortions with K being thespecified number of clusters
X objectxjt object belonging to cluster jwj centre of cluster j
aK weight factor
Selection of K in K-means clustering 119
C09304 IMechE 2005 Proc IMechE Vol 219 Part C J Mechanical Engineering Science