13
Research Article A Self-Adaptive Fuzzy -Means Algorithm for Determining the Optimal Number of Clusters Min Ren, 1,2,3 Peiyu Liu, 1,3 Zhihao Wang, 1,3 and Jing Yi 1,3 1 School of Information Science and Engineering, Shandong Normal University, Jinan, Shandong, China 2 School of Mathematic and Quantitative Economics, Shandong University of Finance and Economics, Jinan, Shandong, China 3 Shandong Provincial Key Laboratory for Distributed Computer Soſtware Novel Technology, Jinan, Shandong, China Correspondence should be addressed to Peiyu Liu; [email protected] Received 29 June 2016; Accepted 17 October 2016 Academic Editor: Elio Masciari Copyright © 2016 Min Ren et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For the shortcoming of fuzzy c-means algorithm (FCM) needing to know the number of clusters in advance, this paper proposed a new self-adaptive method to determine the optimal number of clusters. Firstly, a density-based algorithm was put forward. e algorithm, according to the characteristics of the dataset, automatically determined the possible maximum number of clusters instead of using the empirical rule and obtained the optimal initial cluster centroids, improving the limitation of FCM that randomly selected cluster centroids lead the convergence result to the local minimum. Secondly, this paper, by introducing a penalty function, proposed a new fuzzy clustering validity index based on fuzzy compactness and separation, which ensured that when the number of clusters verged on that of objects in the dataset, the value of clustering validity index did not monotonically decrease and was close to zero, so that the optimal number of clusters lost robustness and decision function. en, based on these studies, a self-adaptive FCM algorithm was put forward to estimate the optimal number of clusters by the iterative trial-and-error process. At last, experiments were done on the UCI, KDD Cup 1999, and synthetic datasets, which showed that the method not only effectively determined the optimal number of clusters, but also reduced the iteration of FCM with the stable clustering result. 1. Introduction Cluster analysis has a long research history. Due to its advan- tage of learning without a priori knowledge, it is widely app- lied in the fields of pattern recognition, image processing, web mining, spatiotemporal database application, business intel- ligence, and so forth. Clustering is oſten as unsupervised learning and aims to partition objects in a dataset into several natural groupings, namely, the so-called clusters, such that objects within a cluster tend to be similar while objects belonging to different clusters are dissimilar. Generally, the datasets from different application fields vary in feature, and the purposes of clus- tering are multifarious. erefore, the best method of cluster analysis depends on datasets and purposes of use. ere is no universal clustering technology that can be widely applicable to the diverse structures presented by various datasets [1]. According to the accumulation rules of objects in clusters and the methods of applying these rules, clustering algorithms are divided into many types. However, for most clustering algorithms including partitional clustering and hierarchical clustering, the number of clusters is a parameter needing to be preset, to which the quality of clustering result is closely related. In practical application, it usually relies on users’ experience or background knowledge in related fields. But in most cases, the number of clusters is unknown to users. If the number is assigned too large, it may result in more compli- cated clustering results which are difficult to be explained. On the contrary, if it is too small, a lot of valuable information in clustering result may be lost [2]. us, it is still a fundamental problem in the research of cluster analysis to determine the optimal number of clusters in a dataset [3]. Contributions of this paper are as follows. (1) A density- based algorithm is proposed, which can quickly and suc- cinctly generate high-quality initial cluster centroids instead of choosing randomly, so as to stabilize the clustering result and also quicken the convergence of the clustering algo- rithm. Besides, this algorithm can automatically estimate the Hindawi Publishing Corporation Computational Intelligence and Neuroscience Volume 2016, Article ID 2647389, 12 pages http://dx.doi.org/10.1155/2016/2647389

Research Article A Self-Adaptive Fuzzy -Means Algorithm ...downloads.hindawi.com/journals/cin/2016/2647389.pdf · A Self-Adaptive Fuzzy -Means Algorithm for Determining the Optimal

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Article A Self-Adaptive Fuzzy -Means Algorithm ...downloads.hindawi.com/journals/cin/2016/2647389.pdf · A Self-Adaptive Fuzzy -Means Algorithm for Determining the Optimal

Research ArticleA Self-Adaptive Fuzzy 119888-Means Algorithm for Determiningthe Optimal Number of Clusters

Min Ren123 Peiyu Liu13 Zhihao Wang13 and Jing Yi13

1School of Information Science and Engineering Shandong Normal University Jinan Shandong China2School of Mathematic and Quantitative Economics Shandong University of Finance and Economics Jinan Shandong China3Shandong Provincial Key Laboratory for Distributed Computer Software Novel Technology Jinan Shandong China

Correspondence should be addressed to Peiyu Liu sdkeylab163com

Received 29 June 2016 Accepted 17 October 2016

Academic Editor Elio Masciari

Copyright copy 2016 Min Ren et alThis is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

For the shortcoming of fuzzy c-means algorithm (FCM) needing to know the number of clusters in advance this paper proposeda new self-adaptive method to determine the optimal number of clusters Firstly a density-based algorithm was put forward Thealgorithm according to the characteristics of the dataset automatically determined the possible maximum number of clustersinstead of using the empirical rule radic119899 and obtained the optimal initial cluster centroids improving the limitation of FCM thatrandomly selected cluster centroids lead the convergence result to the localminimum Secondly this paper by introducing a penaltyfunction proposed a new fuzzy clustering validity index based on fuzzy compactness and separation which ensured that when thenumber of clusters verged on that of objects in the dataset the value of clustering validity index did not monotonically decreaseand was close to zero so that the optimal number of clusters lost robustness and decision function Then based on these studies aself-adaptive FCM algorithmwas put forward to estimate the optimal number of clusters by the iterative trial-and-error process Atlast experiments were done on the UCI KDDCup 1999 and synthetic datasets which showed that the method not only effectivelydetermined the optimal number of clusters but also reduced the iteration of FCM with the stable clustering result

1 Introduction

Cluster analysis has a long research history Due to its advan-tage of learning without a priori knowledge it is widely app-lied in the fields of pattern recognition image processing webmining spatiotemporal database application business intel-ligence and so forth

Clustering is often as unsupervised learning and aims topartition objects in a dataset into several natural groupingsnamely the so-called clusters such that objects within acluster tend to be similar while objects belonging to differentclusters are dissimilar Generally the datasets from differentapplication fields vary in feature and the purposes of clus-tering are multifarious Therefore the best method of clusteranalysis depends on datasets and purposes of useThere is nouniversal clustering technology that can be widely applicableto the diverse structures presented by various datasets [1]According to the accumulation rules of objects in clusters andthe methods of applying these rules clustering algorithms

are divided into many types However for most clusteringalgorithms including partitional clustering and hierarchicalclustering the number of clusters is a parameter needing tobe preset to which the quality of clustering result is closelyrelated In practical application it usually relies on usersrsquoexperience or background knowledge in related fields But inmost cases the number of clusters is unknown to users If thenumber is assigned too large it may result in more compli-cated clustering results which are difficult to be explained Onthe contrary if it is too small a lot of valuable information inclustering result may be lost [2]Thus it is still a fundamentalproblem in the research of cluster analysis to determine theoptimal number of clusters in a dataset [3]

Contributions of this paper are as follows (1) A density-based algorithm is proposed which can quickly and suc-cinctly generate high-quality initial cluster centroids insteadof choosing randomly so as to stabilize the clustering resultand also quicken the convergence of the clustering algo-rithm Besides this algorithm can automatically estimate the

Hindawi Publishing CorporationComputational Intelligence and NeuroscienceVolume 2016 Article ID 2647389 12 pageshttpdxdoiorg10115520162647389

2 Computational Intelligence and Neuroscience

maximum number of clusters based on the features of adataset hereby determining the search range for estimatingthe optimal number of clusters and effectively reducing theiterations of the clustering algorithm (2) Based on the fea-tures of compactness within a cluster and separation betweenclusters a new fuzzy clustering validity index (CVI) is definedin this paper avoiding its value close to zero along with thenumber of clusters tending to the number of objects andobtaining the optimal clustering result (3) A self-adaptivemethod that iteratively uses improved FCM algorithm toestimate the optimal number of clusters was put forward

2 Related Work

The easiest method of determining the number of clustersis data visualization For the dataset that can be effectivelymapped to a 2-dimensional Euclidean space the number ofclusters can be intuitively acquired through the distributiongraph of data points However for high-dimensional andcomplicated data this method is unserviceable Rodriguezand Laio proposed a clustering algorithm based on densitypeaks declaring that it was able to detect nonsphericalclusters and to automatically find the true number of clusters[4] But in fact the number of cluster centroids still needs tobe selected artificially according to the decision graph Nextrelevant technologies to determine the optimal number ofclusters are summarized below

21 Clustering Validity Index BasedMethod Clustering valid-ity index is used to evaluate the quality of partitions on adataset generated by the clustering algorithm It is an effectivemethod to construct an appropriate clustering validity indexto determine the number of clusters The idea is to assigndifferent values of the number of clusters 119888 within a certainrange then run fuzzy clustering algorithm on the datasetand finally to evaluate the results by clustering validity indexWhen the value of clustering validity index is of the maxi-mumor theminimumor an obvious inflection point appearsthe corresponding value of 119888 is the optimal number ofclusters 119888opt So far researchers have put forward lots of fuzzyclustering validity indices divided into the following twotypes

(1) Clustering Validity Index Based on Fuzzy Partition Theseindices are according to such a point of view that for a well-separated dataset the smaller the fuzziness of fuzzy partitionis themore reliable the clustering result is Based on it Zadehthe founder of fuzzy sets put forward the first clusteringvalidity index degree of separation in 1965 [5] But itsdiscriminating effect is not ideal In 1974 Bezdek put forwardthe concept of partition coefficient (PC) [6] which was thefirst practical generic function for measuring the validityof the fuzzy clustering and subsequently proposed anotherclustering validity function partition entropy (PE) [7] closelyrelated to partition coefficient Later Windham definedproportion exponent by use of the maximum value of a fuzzymembership function [8] Lee put forward a fuzzy clusteringvalidity index using the distinguishableness of clusters mea-sured by the object proximities [9] Based on the Shannon

entropy and fuzzy variation theory Zhang and Jiang putforward a new fuzzy clustering validity index taking accountof the geometry structure of the dataset [10] Saha et alput forward an algorithm based on differential evolution forautomatic cluster detection which well evaluated the validityof the clustering result [11] Yue et al partitioned the originaldata space into a grid-based structure and proposed a clusterseparation measure based on grid distances [12] The clus-tering validity index based on fuzzy partition is only relatedto the fuzzy degree of membership and has the advantages ofsimpleness and small calculating amount But it lacks directrelation with some structural features of the dataset

(2) Clustering Validity Index Based on the Geometry Structureof the Dataset These indices are based on such a point ofview that for a well-separated dataset every cluster should becompact and separated from each other as far as possibleTheratio of compactness and separation is used as the standardof clustering validity This type of representative clusteringvalidity indices include Xie-Beni index [13] Bensaid index[14] and Kwon index [15] Sun et al proposed a new validityindex based on a linear combination of compactness andseparation and inspired by Rezaeersquos validity [16] Li and Yudefined new compactness and separation and put forward anew fuzzy clustering validity function [17] Based on fuzzygranulation-degranulation Saha and Bandyopadhyay putforward a fuzzy clustering validity function [18] Zhang et aladopted Pearson correlation to measure the distance and putforward a validity function [19] Kim et al proposed a clus-tering validity index for GK algorithm based on the averagevalue of the relative degrees of sharing of all possible pairs offuzzy clusters [20] Rezaee proposed a new validity index forGK algorithm to overcome the shortcomings of Kimrsquos index[21] Zhang et al proposed a novelWGLI to detect the optimalcluster number using global optimum membership as theglobal property and modularity of bipartite network as thelocal independent property [22]The clustering validity indexbased on the geometric structure of the dataset considers boththe fuzzy degree of membership and the geometric structurebut its membership function is quite complicated with largecalculating amount

Based on clustering validity index the optimal number ofclusters is determined through exhaustive search In order toincrease the efficiency of estimating the optimal number ofclusters 119888opt the search range of 119888opt must be set that is 119888max isthe maximum number of clusters assigned to meet the con-dition 119888opt le 119888max Most researchers used the empirical rule119888max le radic119899 where 119899 is the number of data in the dataset Forthis problem theoretical analysis and example verificationwere conducted in [23] which indicated that it was reasona-ble in a sense However apparently this method has thefollowing disadvantages (1) Each 119888 must be tried in turnwhich will cause a huge calculation (2) For each 119888 it cannotbe guaranteed that the clustering result is the globally optimalsolution (3) When noise and outliers are existing the relia-bility of clustering validity index is weak (4) For some data-sets like FaceImage [24] if 119888 ge radic119899 the empirical rule willbe invalid Researches showed that due to the diversified datatypes and structures no universal fuzzy clustering validity

Computational Intelligence and Neuroscience 3

index can be applicable to all datasets The research is andwill still be carried on urgently

22 Heuristic Method Some new clustering algorithms havebeen proposed in succession recently The main idea is touse some criteria to guide the clustering process with thenumber of clusters being adjusted In this way while theclustering is completed the appropriate number of clusterscan be obtained aswell For example119883-means algorithm [25]based on the split hierarchical clustering is representativeContrary to the aforesaid process RCA [26] determined theactual number of clusters by a process of competitive agglom-eration Combining the single-point iterative technologywithhierarchical clustering a similarity-based clustering method(SCM) [27] was proposed A mercer kernel-based clustering[28] estimated the number of clusters by the eigenvectors ofa kernel matrix A clustering algorithm based on maximal 120579-distant subtrees [29] detected any number of well-separatedclusters with any shapes In addition Frey and Dueck putforward an affinity propagation clustering algorithm (AP)[24] which generated high-quality cluster centroids via themessage passing between objects to determine the optimalnumber of clusters of large-scale data Shihong et al proposeda Gerschgorin disk estimation-based criterion to estimatethe true number of clusters [30] Jose-Garcıa and Gomez-Flores presented an up-to-date review of all major nature-inspiredmetaheuristic algorithms used thus far for automaticclustering determining the best estimate of the number ofclusters [31] All these have widened the thoughts for relevantresearches

3 Traditional FCM for Determiningthe Optimal Number of Clusters

Since Ruspini first introduced the theory of fuzzy sets intocluster analysis in 1973 [32] different fuzzy clustering algo-rithms have been widely discussed developed and appliedin various areas FCM [33] is one of the most commonlyused algorithms FCM was first presented by Dunn in 1974Subsequently Bezdek introduced weighted index to the fuzzydegree of membership [34] further developing FCM FCMdivides the dataset 119883 into 119888 fuzzy clusters This algorithmholds that each object belongs to a certain cluster with adifferent degree ofmembership that is a cluster is consideredas a fuzzy subset on the dataset

Assume that119883 = 1199091 1199092 119909119899 is a finite dataset where119909119894 = (1199091198941 1199091198942 119909119894119897) is an l-dimensional object and 119909119894119895is the 119895th property of the 119894th object 119862 = 1198621 1198622 119862119888denotes 119888 cluster(s) 119881 = V1 V2 V119888 represents 119888 l-dimensional cluster centroid(s) where V119894 = (V1198941 V1198942 V119894119897)119880 = (119906119894119896)(119899times119888) is a fuzzy partition matrix and 119906119894119896 is the degreeof membership of the 119894th object in the 119896th cluster wheresum119888119896=1 119906119894119896 = 1 forall119894 = 1 119899 The objective function is thequadratic sum of weighed distances from the samples to thecluster centroid in each cluster that is

119869119898 (119880 119881) = 119888sum119896=1

119899sum119894=1

1199061198981198941198961198892119894119896 (1)

where 119889119894119896 = 119909119894 minus V119896 shows the Euclidean distancebetween the 119894th object and the 119896th cluster centroid 119898 (119898 isin[1infin)) is a fuzziness index controlling the fuzziness ofthe memberships As the value of 119898 becomes progressivelyhigher the resulting memberships become fuzzier [35] Paland Bezdek advised that119898 should be between 15 and 25 andusually let 119898 = 2 if without any special requirement [36]

According to the clustering criterion appropriate fuzzypartition matrix 119880 and cluster centroid 119881 are obtained tominimize the objective function 119869119898 Based on the Lagrangemultiplier method 119880 and 119881 are respectively calculated bythe formulas below

119906119894119896 = 1sum119888119895=1 (119889119894119896119889119895119896)2(119898minus1) (2)

V119896 = sum119899119894=1 119906119898119894119896119909119894sum119899119894=1 119906119898119894119896 (3)

FCMalgorithm is carried out through an iterative processof minimizing the objective function 119869119898 with the update of119880 and 119881 The specific steps are as follows

Step 1 Assign the initial value of the number of clusters 119888fuzziness index 119898 maximum iterations 119868max and threshold120585Step 2 Initialize the fuzzy partition119880(0) randomly accordingto the constraints of the degree of membership

Step 3 At the 119905-step calculate 119888 cluster centroids119881(119905) accord-ing to (3)

Step 4 Calculate the objective function 119869(119905)119898 according to (1)If |119869(119905)119898 minus 119869(119905minus1)119898 | lt 120585 or 119905 gt 119868max then stop otherwise continueto Step 5

Step 5 Calculate119880(119905+1) according to (2) and return to Step 3

At last each object can be arranged into one cluster inaccordance with the principle of the maximum degree ofmembership The advantages of FCM may be summarizedas simple algorithm quick convergence and easiness to beextended Its disadvantages lie in the selection of the initialcluster centroids the sensitivity to noise and outliers andthe setting of the number of clusters which have a greatimpact on the clustering result As the random selectionof the initial cluster centroids cannot ensure the fact thatFCMconverges to an optimal solution different initial clustercentroids are used for multiple running of FCM otherwisethey are determined by using of other fast algorithms

The traditional method to determine the optimal numberof clusters of FCM is to set the search range of the number ofclusters run FCM to generate clustering results of differentnumber of clusters select an appropriate clustering validityindex to evaluate clustering results and finally obtain theoptimal number of clusters according to the evaluation resultThe method is composed of the following steps

4 Computational Intelligence and Neuroscience

Step 1 Input the search range [119888min 119888max] generally 119888min = 2and 119888max = lfloorradic119899rfloorStep 2 For each integer 119896119899 isin [119888min 119888max]Step 21 Run FCM

Step 22 Calculate clustering validity index

Step 3 Compare all values of clustering validity index 119896119899corresponding to the maximum or minimum value is theoptimal number of clusters 119888optStep 4 Output 119888opt the optimal value of clustering validityindex and clustering result

4 The Proposed Methods

41 Density-Based Algorithm Considering the large influ-ence of the randomly selected initial cluster centroid on theclustering result a density-based algorithm is proposed toselect the initial cluster centroids and the maximum numberof clusters can be estimated at the same time Some relatedterms will be defined at first

Definition 1 (local density) The local density 120588119894 of object 119909119894 isdefined as

120588119894 = 119899sum119895=1

119890minus11988921198941198951198891198882 (4)

where 119889119894119895 is the distance between two objects 119909119894 and 119909119895 and119889119888 is a cutoff distance The recommended approach is to sortthe distance between any two objects in descending order andthen assign 119889119888 as the value corresponding to the first 119901 ofthe sorted distance (appropriately 119901 isin [2 5]) It shows thatthe more objects with a distance from 119909119894 less than 119889119888 thereare the bigger the value of 120588119894 is The cutoff distance finallycan decide the number of initial cluster centroids namely themaximumnumber of clustersThe density-based algorithm isrobust with respect to the choice of 119889119888Definition 2 (core point) Assume that there is an object 119909119894 isin119883 if 119909119894 notin 119862119896 and 120588119894 = max119909119895notin119862119896(120588119895) 119896 = 1 119888 119895 = 1 119899then 119909119894 is a core pointDefinition 3 (directly density-reachable) Assume that thereare two objects 119909119894 119909119895 isin 119883 if their distance 119889119894119895 lt 119889119888 then 119909119895is directly density-reachable to 119909119894 and vice versa

Definition 4 (density-reachable) Assume that 119883 is a datasetand objects 119909119904 119909119904+1 119909119905minus1 119909119905 belong to it If for each 119894 isin[119904 119905 minus 1] 119909119894+1 is directly density-reachable to 119909119894 then 119909119904 isdensity-reachable to 119909119905Definition 5 (neighbor) Neighbors of an object 119909119894 are thosewho are directly density-reachable or density-reachable to itdenoted as Neighbor (119909119894)

D

E

dc

dc

dc

B

H

F

C

A

Figure 1 An example Point A is of the highest local density If Adoes not belong to any cluster then A is a core point Points B C Dand E are directly density-reachable to point A Point F is density-reachable to point A Point H is a border point

Definition 6 (border point) An object is called a border pointif it has no neighbors

The example is shown in Figure 1The selection principle of initial cluster centroids of

density-based algorithm is that usually a cluster centroid isan object with higher local density surrounded by neighborswith lower local density than it and has a relatively largedistance fromother cluster centroids [4] Density-based algo-rithm can automatically select the initial cluster centroids anddetermine the maximum number of clusters 119888max accordingto local density The pseudocode is shown in Algorithm 1These cluster centroids obtained are sorted in descendingorder according to local density Figure 2 demonstrates theprocess of density-based algorithm

42 Fuzzy Clustering Validity Index Firstly based on the geo-metric structure of the dataset Xie and Beni put forwardthe Xie-Beni fuzzy clustering validity index [13] which triedto find a balance point between the fuzzy compactness andseparation so as to acquire the optimal cluster result Theindex is defined as

119881xie (119880 119881) = (1119899)sum119888119894=1sum119899119895=1 119906119898119894119895 10038171003817100381710038171003817V119894 minus 119909119895100381710038171003817100381710038172min119894 =119895

10038171003817100381710038171003817V119894 minus V119895100381710038171003817100381710038172 (5)

In the formula the numerator is the average distancefrom various objects to centroids used to measure thecompactness and the denominator is the minimum distancebetween any two centroids measuring the separation

However Bensaid et al found that the size of each clusterhad a large influence on Xie-Beni index and put forwarda new index [14] which was insensitive to the number ofobjects in each cluster Bensaid index is defined as

119881119861 (119880 119881) = 119888sum119896=1

sum119899119894=1 119906119898119894119896 1003817100381710038171003817119909119894 minus V11989610038171003817100381710038172119899119896sum119888119895=1 10038171003817100381710038171003817V119895 minus V119896100381710038171003817100381710038172 (6)

Computational Intelligence and Neuroscience 5

Input Dataset and 119901Output 119881 and 119888max119863 = (119889119894119895)119899times119899dc = cutoffdis (119901) lowastCalculate the cutoff distance dc lowast120588 = (120588119894)119899119894119889119909 = arg(119904119900119903119905(120588 119889119890119904119888119890119899119905)) lowastThe number of object corresponding to sorted120588 lowast119896 = 1119888119897 = (minus1)1times119899 lowastThe cluster number that each object belongs to with initialvalue minus1 lowastfor 119894 = 1 to 119899 minus 1 doif 119888119897119894119889119909119894 = minus1ampamp119873119890119894119892ℎ119887119900119903(119909119894119889119909119894 ) gt 1 then lowast119909119894119889119909119894 dose notbelong to any cluster and the nmber of its neighbors is greater than 1 lowastV119896 = 119909119894119889119909119894 for 119895 isin 119873119890119894119892ℎ119887119900119903(119909119894119889119909119894 ) do119888119897119895 = 119896119896 = 119896 + 1119888max = 119896

Algorithm 1 Density-based algorithm

where 119899119896 is the fuzzy cardinality of the 119896th cluster and definedas

119899119896 = 119899sum119894=1

119906119894119896 (7)

sum119899119894=1 119906119898119894119896119909119894 minus V1198962 shows the variation of the 119896th fuzzycluster Then the compactness is computed as

1119899119896119899sum119894=1

119906119898119894119896 1003817100381710038171003817119909119894 minus V11989610038171003817100381710038172 (8)

sum119888119895=1 V119895 minus V1198962 denotes the separation of the 119896th fuzzycluster defined as the sum of the distances from its clustercentroid to the centroids of other 119888 minus 1 clusters

Because

lim119888rarr119899

1003817100381710038171003817119909119894 minus V11989610038171003817100381710038172 = 0 (9)

this index is the same as the Xie-Beni indexWhen 119888 rarr 119899 theindex value will be monotonically decreased close to 0 andwill lose robustness and judgment function for determiningthe optimal number of clusters Thus this paper improvesBensaid index and proposes a new index 119881119877

119881119877 (119880 119881)= 119888sum119896=1

(1119899119896)sum119899119894=1 119906119898119894119896 1003817100381710038171003817119909119894 minus V11989610038171003817100381710038172 + (1119888) 1003817100381710038171003817V119896 minus V10038171003817100381710038172(1 (119888 minus 1))sum119888119895=1 10038171003817100381710038171003817V119895 minus V119896

100381710038171003817100381710038172 (10)

The numerator represents the compactness of the 119896thcluster where 119899119896 is its fuzzy cardinality Its second item anintroduced punishing function denotes the distance fromthe cluster centroid of the 119896th cluster to the average of allcluster centroids which can eliminate the monotonically

decreasing tendency as the number of clusters increases to119899 The denominator represents the mean distance from the119896th cluster centroid to other cluster centroids which is usedfor measuring the separationThe ratio of the numerator andthe denominator thereof represents the clustering effect of the119896th clusterThe clustering validity index is defined as the sumof the clustering effect (the ratio) of all clusters Obviouslythe smaller the value is the better the clustering effect of thedataset is and the corresponding 119888 to the minimum value isthe optimal number of clusters

43 Self-Adaptive FCM In this paper the iterative trial-and-error process [16] is still used to determine the optimalnumber of clusters by self-adaptive FCM (SAFCM) Thepseudocode of SAFCM algorithm is described in Algo-rithm 2

5 Experiment and Analysis

This paper selects 8 experimental datasets among which 3datasets come fromUCI datasets respectively IrisWine andSeeds a dataset (SubKDD) is randomly selected from KDDCup 1999 shown in Table 1 and the remaining 4 datasetsare synthetic datasets shown in Figure 3 SubKDD includesnormal 2 kinds of Probe attack (ipsweep and portsweep) and3 kinds of DoS attack (neptune smurf and back) The firstsynthetic dataset (SD1) consists of 20 2-dimensional Gaussiandistribution data with 10 samples Their covariance matrixesare second-order unit matrix 1198682 The structural feature of thedataset is that the distance between any two clusters is largeand the number of classes is greater than lfloorradic119899rfloor The secondsynthetic dataset (SD2) consists of 4 2-dimensional Gaussiandistribution dataThe cluster centroids are respectively (55)(1010) (1515) and (2020) each containing 500 samples andthe covariancematrix of each being 21198682The structural feature

6 Computational Intelligence and Neuroscience

minus2 0 2 4 6 8 10minus2

0

2

4

6

8

10

12

14

(a) Initial dataminus2 0 2 4 6 8 10

minus2

0

2

4

6

8

10

12

14

(b) Iteration 1

minus2 0 2 4 6 8 10minus2

0

2

4

6

8

10

12

14

(c) Iteration 2minus2 0 2 4 6 8 10

minus2

0

2

4

6

8

10

12

14

(d) Iteration 3

Figure 2 Demonstration of the process of density-based algorithm (a) is the initial data distribution of a synthetic dataset The datasetconsists of two pieces of 2-dimensional Gaussian distribution data with centroids respectively as (2 3) and (7 8) Each class has 100 samplesIn (b) the blue circle represents the highest density core point as the centroid of the first cluster and the red plus sign represents the objectbelonging to the first cluster In (c) the red circle represents the core point as the centroid of the second cluster and the blue asterisk representsthe object belonging to the second cluster In (d) the purple circle represents the core point as the centroid of the third cluster the green timessign represents the object belonging to the third cluster and the black dot represents the final border pointwhich does not belong to any clusterAccording to a certain cutoff distance themaximumnumber of clusters is 3 If calculated in accordance with the empirical rule themaximumnumber of clusters should be 14 Therefore the algorithm can effectively reduce the iteration of FCM algorithm operation

Table 1 The data type and distribution of SubKDD

Attack behavior Number of samplesnormal 200ipsweep 50portsweep 50neptune 200smurf 300back 50

of the dataset is of short intercluster distance with a smalloverlap The third synthetic dataset (SD3) has a complicatedstructure The fourth synthetic dataset (SD4) is a nonconvexone

In this paper the numeric value 119909119894119895 is normalized as

1199091015840119894119895 = 119909119894119895 minus 119909119895119904119895 (11)

where 119909119895 and 119904119895 denote the mean of the 119895th attribute valueand its standard deviation respectively then

119909119895 = 1119899119899sum119894=1

119909119894119895119904119895 = radic 1119899 minus 1

119899sum119894=1

(119909119894119895 minus 119909119895)2(12)

Computational Intelligence and Neuroscience 7

Input 119881 119888max 119868max and 120585Output 119869119887119890119904119905 119888opt 119880119887119890119904119905for 119896119899 = 2 to 119888max do119905 = 0 lowastThe current iterationlowast119881(119905)

119896119899= (V119894)119896119899 lowastThe first kn cluster centroids of V lowast119880(119905)

119896119899= (119906(119905)119894119896)119899times119896119899119869(119905)

119896119899= 119881119877(119880(119905)119896119899 119881(119905)119896119899 )

while 119905 == 0119905 lt 119868maxampamp|119869(119905)119896119899

minus 119869(119905minus1)119896119899

| gt 120585 doUpdate 119881(119905)

119896119899

Update 119880(119905)119896119899

Update 119869(119905)119896119899119905 = 119905 + 1119869min119896119899 = min119905119894=1(119869(119894)119896119899)119894119889119909 = arg(min119905119894=1(119869(119894)119896119899))119880min119896119899 = 119880(119894119889119909)119896119899

119869119887119890119904119905 = min119888max119896119899=2(119869min119896119899) lowastThe best value of clustering validity indexlowast119888opt = arg(min119888max119896119899=2(119869min119896119899))119880119887119890119904119905 = 119880min119888opt

Algorithm 2 SAFCM

Table 2 119888max estimated by several methods 119899 is the number of objects in the dataset 119888 is the actual number of clusters 119888ER is the numberof clusters estimated by the empirical rule that is lfloorradic119899rfloor 119888AP is the number of clusters obtained by AP algorithm and 119888DBA is the number ofclusters obtained by density-based algorithm

119888DBADataset 119899 119888 119888ER 119888AP 119901 = 1 119901 = 2 119901 = 3 119901 = 5Iris 150 3 12 9 20 14 9 6Wine 178 3 13 15 14 7 6 3Seeds 210 3 14 13 18 12 5 2SubKDD 1050 6 32 24 21 17 10 7SD1 200 20 14 19 38 22 20 mdashSD2 2000 4 44 25 16 3 4 2SD3 885 3 29 27 24 19 5 3SD4 947 3 30 31 23 13 8 4

For the categorical attribute of SubKDD the simplematching is used for the dissimilarity measure that is 0 foridentical values and 1 for different values

51 Experiment of 119888max Simulation In [37] it was proposedthat the number of clusters generated by AP algorithm couldbe selected as the maximum number of clusters So thispaper estimates the value of 119888max by the empirical ruleAP algorithm and density-based algorithm The specificexperimental results are shown in Table 2 Let 119901 = 119901119898 in APalgorithm 119889119888 is respectively selected as the distance of thefirst 1 2 3 and 5

Experiment results show that for the dataset with thetrue number of clusters greater than lfloorradic119899rfloor it is obviouslyincorrect to let 119888max = lfloorradic119899rfloor In other words the empiricalrule is invalid The number of clusters finally obtained byAP algorithm is close to the actual number of clusters Forthe dataset with the true number of clusters smaller thanlfloorradic119899rfloor the number of clusters generated by AP algorithm canbe used as an appropriate 119888max but sometimes the value is

greater than the value estimated by the empirical rule whichenlarges the search range of the optimal number of clustersIf 119888max is estimated by the proposed density-based algorithmthe results in most cases are appropriate The method isinvalid only on SD1 when the cutoff distance is the first 5distance When the cutoff distance is selected as the first 3distance 119888max generated by the proposed algorithm is muchsmaller than lfloorradic119899rfloor It is closer to the true number of clustersand greatly narrows the search range of the optimal numberof clusters Therefore the cutoff distance is selected as thedistance of the first 3 in the later experiments

52 Experiment of Influence of the Initial Cluster Centroidson the Convergence of FCM To show that the initial clustercentroids obtained by density-based algorithm can quickenthe convergence of clustering algorithm the traditional FCMis adopted for verificationThe number of clusters is assignedas the true value with the convergence threshold being 10minus5Because the random selection of initial cluster centroids hasa large influence on FCM the experiment of each dataset

8 Computational Intelligence and Neuroscience

minus5 0 5 10 15 20 250

5

10

15

20

25

30

(a) SD10 5 10 15 20 25

0

5

10

15

20

25

(b) SD2

50 100 150 200 250 300 3500

50

100

150

200

250

300

(c) SD320 40 60 80 100 120 140 160 180 200 220 240

20

40

60

80

100

120

140

160

(d) SD4

Figure 3 Four synthetic datasets

Table 3 Comparison of iterations of FCMalgorithmMethod 1 usesthe random initial cluster centroids and Method 2 uses the clustercentroids obtained by density-based algorithm

Dataset Method 1 Method 2Iris 21 16Wine 27 18Seeds 19 16SubKDD 31 23SD1 38 14SD2 18 12SD3 30 22SD4 26 21

is done repeatedly for 50 times and the round numbers ofthe mean of algorithmic iterations are comparedThe specificexperimental result is shown in Table 3

As shown in Table 3 the initial cluster centroids obtainedby density-based algorithm can effectively reduce the itera-tion of FCM Particularly on SD1 the iteration of FCM isfar smaller than that with randomly selected initial cluster

centroids whose minimum iteration is 24 andmaximum iter-ation reaches 63 with unstable clustering results Thereforethe proposed method can not only effectively quicken theconvergence of FCM but also obtain a stable clustering result

53 Experiment of Clustering Accuracy Based on ClusteringValidity Index For 3 UCI datasets and SubKDD clusteringaccuracy 119903 is adopted tomeasure the clustering effect definedas

119903 = sum119896119894=1 119887119894119899 (13)

Here 119887119894 is the number of objects which co-occur in the 119894thcluster and the 119894th real cluster and 119899 is the number of objectsin the dataset According to this measurement the higherthe clustering accuracy is the better the clustering result ofFCM is When 119903 = 1 the clustering result of FCM is totallyaccurate

In the experiments the true number of clusters is assignedto each dataset and the initial cluster centroids are obtainedby density-based algorithm Then experimental results ofclustering accuracy are shown in Table 4

Computational Intelligence and Neuroscience 9

minus5 0 5 10 15 20 250

5

10

15

20

25

30

(a) SD10 5 10 15 20 25

0

5

10

15

20

25

(b) SD2

50 100 150 200 250 300 3500

50

100

150

200

250

300

(c) SD320 40 60 80 100 120 140 160 180 200 220 240

20

40

60

80

100

120

140

160

(d) SD4

Figure 4 Clustering results of two synthetic datasets

Table 4 Clustering accuracy

Dataset Iris Wine Seeds SubKDDClustering accuracy 8400 9663 9190 9435

Apparently clustering accuracies of the datasets WineSeeds and SubKDD are high while that of the dataset Iris isrelatively low for the reason that two clusters are nonlinearlyseparable

The clustering results of the proposed clustering validityindex on 4 synthetic datasets are shown in Figure 4 whichdepict the good partitions on the datasets SD1 and SD2 andthe slightly poor partition on SD3 and SD4 because of thecomplexity of their structure or nonconvexity When thenumber of clusters is set as 4 SD3 is divided into 4 groupsthat means the right and the left each have 2 groups

54 Experiment of the Optimal Number of Clusters At lastXie-Beni index Bensaid index Kwon index and the pro-posed index are respectively adopted for running of SAFCMso as to determine the optimal number of clustersThe results

Table 5 Optimal number of clusters estimated by several clusteringvalidity indices

Dataset 119881XB 119881B 119881K 119881R

Iris 2 9 2 2Wine 3 6 3 3Seeds 2 5 2 2SubKDD 10 10 9 4SD1 20 20 20 20SD2 4 4 4 4SD3 5 3 5 4SD4 2 8 2 3

are shown in Table 5 119881XB 119881B 119881K and 119881R represent Xie-Beniindex Bensaid index Kwon index and the proposed indexrespectively

It shows that for synthetic datasets SD1 and SD2 withsimple structure these indices can all obtain the optimalnumber of clusters For 3 UCI datasets SubKDD and SD4Bensaid index cannot obtain an accurate number of clustersexcept SD3 For the dataset Wine both Xie-Beni index and

10 Computational Intelligence and Neuroscience

Table 6 The value of clustering validity index on Iris

119888 119881XB 119881B 119881K 119881R

2 0114234 0223604 17384973 04736043 0223973 0124598 34572146 05515654 0316742 0099103 49279488 06154365 0560109 0089108 87540968 06763506 0574475 0072201 90563379 06913407 0400071 0067005 63328679 07353118 0275682 0036283 45736972 06140559 0250971 0027735 42868449 0584244

Table 7 The value of clustering validity index on Wine

119888 119881XB 119881B 119881K 119881R

2 0663406 1328291 11833902 15782913 0468902 0513071 83939058 13464214 mdash 0473254 mdash 17357915 mdash 0373668 mdash 18466866 mdash 0256698 mdash 1683222

Table 8 The value of clustering validity index on Seeds

119888 119881XB 119881B 119881K 119881R

2 0147247 0293609 31170349 05436093 0212127 0150899 45326216 05990014 0243483 0127720 52215334 06979435 0348842 0085477 75493654 0701153

Kwon index can obtain the accurate number of clusters whilefor the datasets Iris Seeds SubKDD SD3 and SD4 they onlyobtain the result approximate to the true number of clusterThe proposed index obtains the right result on the datasetsWine and SD4 and a better result compared toXie-Beni indexand Kwon index on the datasets SubKDD and SD3 whileit obtains the same result of the two indices on the datasetsIris and Seeds There are two reasons First the importancedegree of each property of the dataset is not considered inthe clustering process but assumed to be the same therebyaffecting the experimental result Second SAFCMhas a weakcapacity to deal with overlapping clusters or nonconvexitydataset These are the authorsrsquo subsequent research contents

Tables 6ndash13 list the detailed results of 4 indices on theexperimental datasets in each iteration of SAFCM For thedataset Wine when the number of clusters is 4 5 or 6the result based on Xie-Beni index or Kwon index is of noconvergence However the proposed index can achieve thedesired results

6 Conclusions

FCM is widely used in lots of fields But it needs to presetthe number of clusters and is greatly influenced by the initialcluster centroids This paper studies a self-adaptive methodfor determining the number of clusters by using of FCMalgorithm In this method a density-based algorithm is put

Table 9 The value of clustering validity index on SubKDD

119888 119881XB 119881B 119881K 119881R

2 0646989 1324434 550166676 15744313 0260755 0378775 222020838 10902894 0133843 0062126 119544560 05112385 0234402 0052499 202641204 05378526 0180728 0054938 156812271 05838007 0134636 0047514 119029265 06197208 0104511 0032849 919852740 06908739 0129721 0027639 740128470 056263610 006822 0027025 913528560 0528528

Table 10 The value of clustering validity index on SD1

119888 119881XB 119881B 119881K 119881R

2 0221693 0443390 44592968 06933903 0206035 0198853 40264251 07262454 0127731 0093653 26220200 06555505 0130781 0069848 27154867 06514656 0144894 0050067 22922121 06393257 0136562 0040275 29126152 06362588 0112480 0032874 24323625 06274429 0115090 0026833 24242580 062493610 0141415 0022611 28574579 061670111 0126680 0019256 28821707 061152412 0103178 0016634 23931865 060599013 0110355 0013253 26517065 058824614 0095513 0011083 23635022 057680815 0075928 0009817 19302095 056228916 0066025 0008824 17236138 055799017 0054314 0007248 14995284 054434118 0045398 0006090 13208810 053488219 0039492 0005365 11977437 052713120 0034598 0004917 10835952 0517481

Table 11 The value of clustering validity index on SD2

119888 119881XB 119881B 119881K 119881R

2 0066286 0132572 13281503 03825723 0068242 0063751 13752535 03942004 0060916 0028149 12309967 0378337

Table 12 The value of clustering validity index on SD3

119888 119881XB 119881B 119881K 119881R

2 0148379 0300899 131557269 05708763 0195663 0142149 173900551 05996804 0127512 0142150 113748947 05571985 0126561 0738070 113262975 0589535

forward at first which can estimate the maximum number ofclusters to reduce the search range of the optimal numberespecially being fit for the dataset on which the empiricalrule is inoperative Besides it can generate the high-quality

Computational Intelligence and Neuroscience 11

Table 13 The value of clustering validity index on SD4

119888 119881XB 119881B 119881K 119881R

2 0104434 0208832 991482710 04737483 0170326 0142561 162044450 04588324 0221884 0081007 211692699 05835295 0156253 0053094 157683921 06032116 0123191 0041799 118279116 05753967 0165465 0032411 107210082 05926258 0145164 0028698 139310969 0606049

initial cluster centroids so that the the clustering result isstable and the convergence of FCM is quick Then a newfuzzy clustering validity index was put forward based onfuzzy compactness and separation so that the clusteringresult is closer to global optimum The index is robust andinterpretable when the number of clusters tends to that ofobjects in the dataset Finally a self-adaptive FCM algorithmis proposed to determine the optimal number of clusters runin the iterative trial-and-error process

The contributions are validated by experimental resultsHowever in most cases each property plays a different rolein the clustering process In other words the weight ofproperties is not the same This issue will be focused on inthe authorsrsquo future work

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work is partially supported by the National NaturalScience Foundation of China (61373148 and 61502151) Shan-dong Province Natural Science Foundation (ZR2014FL010)Science Foundation ofMinistry of Education ofChina (14YJC860042) Shandong Province Outstanding Young ScientistAward Fund (BS2013DX033) Shandong Province Social Sci-ences Project (15CXWJ13 and 16CFXJ05) and Project ofShandongProvinceHigher Educational Science andTechnol-ogy Program (no J15LN02)

References

[1] S Sambasivam and N Theodosopoulos ldquoAdvanced data clus-tering methods of mining web documentsrdquo Issues in InformingScience and Information Technology vol 3 pp 563ndash579 2006

[2] J Wang S-T Wang and Z-H Deng ldquoSurvey on challenges inclustering analysis researchrdquo Control and Decision vol 3 2012

[3] C-W Bong andM Rajeswari ldquoMulti-objective nature-inspiredclustering and classification techniques for image segmenta-tionrdquo Applied Soft Computing Journal vol 11 no 4 pp 3271ndash3282 2011

[4] A Rodriguez and A Laio ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[5] L A Zadeh ldquoFuzzy setsrdquo Information and Computation vol 8no 3 pp 338ndash353 1965

[6] J C Bezdek ldquoNumerical taxonomy with fuzzy setsrdquo Journal ofMathematical Biology vol 1 no 1 pp 57ndash71 1974

[7] J C Bezdek ldquoCluster validity with fuzzy setsrdquo Journal of Cyber-netics vol 3 no 3 pp 58ndash73 1974

[8] M P Windham ldquoCluster validity for fuzzy clustering algo-rithmsrdquo Fuzzy Sets and Systems vol 5 no 2 pp 177ndash185 1981

[9] M Lee ldquoFuzzy cluster validity index based onobject proximitiesdefined over fuzzy partition matricesrdquo in Proceedings of theIEEE International Conference on Fuzzy Systems and IEEEWorldCongress on Computational Intelligence (FUZZ-IEEE rsquo08) pp336ndash340 June 2008

[10] X-B Zhang and L Jiang ldquoAnew validity index of fuzzy c-meansclusteringrdquo in Proceedings of the IEEE International Conferenceon IntelligentHuman-Machine Systems andCybernetics (IHMSCrsquo09) vol 2 pp 218ndash221 Hangzhou China August 2009

[11] I Saha U Maulik and S Bandyopadhyay ldquoA new differ-ential evolution based fuzzy clustering for automatic clusterevolutionrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 706ndash711 IEEE PatialaIndia March 2009

[12] S Yue J-S Wang T Wu and H Wang ldquoA new separationmeasure for improving the effectiveness of validity indicesrdquoInformation Sciences vol 180 no 5 pp 748ndash764 2010

[13] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[14] A M Bensaid L O Hall J C Bezdek et al ldquoValidity-guided(re)clustering with applications to image segmentationrdquo IEEETransactions on Fuzzy Systems vol 4 no 2 pp 112ndash123 1996

[15] S H Kwon ldquoCluster validity index for fuzzy clusteringrdquo Elec-tronics Letters vol 34 no 22 pp 2176ndash2177 1998

[16] H Sun S Wang and Q Jiang ldquoFCM-based model selectionalgorithms for determining the number of clustersrdquo PatternRecognition vol 37 no 10 pp 2027ndash2037 2004

[17] Y Li and F Yu ldquoA new validity function for fuzzy clusteringrdquoin Proceedings of the International Conference on ComputationalIntelligence and Natural Computing (CINC rsquo09) vol 1 pp 462ndash465 IEEE June 2009

[18] S Saha and S Bandyopadhyay ldquoA new cluster validity indexbased on fuzzy granulation-degranulation criteriardquo in Proceed-ings of the IEEEInternational Conference on Advanced Comput-ing and Communications (ADCOM rsquo07) pp 353ndash358 IEEEGuwahati India 2007

[19] M Zhang W Zhang H Sicotte and P Yang ldquoA new validitymeasure for a correlation-based fuzzy c-means clustering algo-rithmrdquo in Proceedings of the 31st IEEE Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo09) pp 3865ndash3868 Minneapolis Minn USASeptember 2009

[20] Y-I KimD-WKimD Lee andKH Lee ldquoA cluster validationindex for GK cluster analysis based on relative degree ofsharingrdquo Information Sciences vol 168 no 1ndash4 pp 225ndash2422004

[21] B Rezaee ldquoA cluster validity index for fuzzy clusteringrdquo FuzzySets and Systems vol 161 no 23 pp 3014ndash3025 2010

[22] D Zhang M Ji J Yang Y Zhang and F Xie ldquoA novel clus-ter validity index for fuzzy clustering based on bipartite modu-larityrdquo Fuzzy Sets and Systems vol 253 pp 122ndash137 2014

[23] J Yu andQ Cheng ldquoSearch range of the optimal cluster numberin fuzzy clusteringrdquo Science in China (Series E) vol 32 no 2 pp274ndash280 2002

12 Computational Intelligence and Neuroscience

[24] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[25] D Pelleg A W Moore et al ldquoX-means extending K-meanswith efficient estimation of the number of clustersrdquo in Proceed-ings of the 17th International Conference on Machine Learning(ICML rsquo00) vol 1 pp 727ndash734 2000

[26] H Frigui and R Krishnapuram ldquoA robust competitive clus-tering algorithm with applications in computer visionrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol21 no 5 pp 450ndash465 1999

[27] M-S Yang and K-L Wu ldquoA similarity-based robust clusteringmethodrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 26 no 4 pp 434ndash448 2004

[28] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[29] L Yujian ldquoA clustering algorithm based on maximal 120579-distantsubtreesrdquo Pattern Recognition vol 40 no 5 pp 1425ndash1431 2007

[30] Y Shihong H Ti and W Penglong ldquoMatrix eigenvalue anal-ysis-based clustering validity indexrdquo Journal of Tianjin Univer-sity (Science and Technology) vol 8 article 6 2014

[31] A Jose-Garcıa and W Gomez-Flores ldquoAutomatic clusteringusing nature-inspired metaheuristics a surveyrdquo Applied SoftComputing Journal vol 41 pp 192ndash213 2016

[32] E H Ruspini ldquoNew experimental results in fuzzy clusteringrdquoInformation Sciences vol 6 pp 273ndash284 1973

[33] J C Dunn ldquoWell-separated clusters and optimal fuzzy parti-tionsrdquo Journal of Cybernetics vol 4 no 1 pp 95ndash104 1974

[34] J C Bezdek Pattern Recognition with Fuzzy Objective FunctionAlgorithms Springer Science amp Business Media 2013

[35] R E Hammah and J H Curran ldquoFuzzy cluster algorithm forthe automatic identification of joint setsrdquo International Journalof Rock Mechanics and Mining Sciences vol 35 no 7 pp 889ndash905 1998

[36] N R Pal and J C Bezdek ldquoOn cluster validity for the fuzzy c-means modelrdquo IEEE Transactions on Fuzzy Systems vol 3 no3 pp 370ndash379 1995

[37] S B Zhou Z Y Xu and X Q Tang ldquoMethod for determiningoptimal number of clusters based on affinity propagationclusteringrdquo Control and Decision vol 26 no 8 pp 1147ndash11522011

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 2: Research Article A Self-Adaptive Fuzzy -Means Algorithm ...downloads.hindawi.com/journals/cin/2016/2647389.pdf · A Self-Adaptive Fuzzy -Means Algorithm for Determining the Optimal

2 Computational Intelligence and Neuroscience

maximum number of clusters based on the features of adataset hereby determining the search range for estimatingthe optimal number of clusters and effectively reducing theiterations of the clustering algorithm (2) Based on the fea-tures of compactness within a cluster and separation betweenclusters a new fuzzy clustering validity index (CVI) is definedin this paper avoiding its value close to zero along with thenumber of clusters tending to the number of objects andobtaining the optimal clustering result (3) A self-adaptivemethod that iteratively uses improved FCM algorithm toestimate the optimal number of clusters was put forward

2 Related Work

The easiest method of determining the number of clustersis data visualization For the dataset that can be effectivelymapped to a 2-dimensional Euclidean space the number ofclusters can be intuitively acquired through the distributiongraph of data points However for high-dimensional andcomplicated data this method is unserviceable Rodriguezand Laio proposed a clustering algorithm based on densitypeaks declaring that it was able to detect nonsphericalclusters and to automatically find the true number of clusters[4] But in fact the number of cluster centroids still needs tobe selected artificially according to the decision graph Nextrelevant technologies to determine the optimal number ofclusters are summarized below

21 Clustering Validity Index BasedMethod Clustering valid-ity index is used to evaluate the quality of partitions on adataset generated by the clustering algorithm It is an effectivemethod to construct an appropriate clustering validity indexto determine the number of clusters The idea is to assigndifferent values of the number of clusters 119888 within a certainrange then run fuzzy clustering algorithm on the datasetand finally to evaluate the results by clustering validity indexWhen the value of clustering validity index is of the maxi-mumor theminimumor an obvious inflection point appearsthe corresponding value of 119888 is the optimal number ofclusters 119888opt So far researchers have put forward lots of fuzzyclustering validity indices divided into the following twotypes

(1) Clustering Validity Index Based on Fuzzy Partition Theseindices are according to such a point of view that for a well-separated dataset the smaller the fuzziness of fuzzy partitionis themore reliable the clustering result is Based on it Zadehthe founder of fuzzy sets put forward the first clusteringvalidity index degree of separation in 1965 [5] But itsdiscriminating effect is not ideal In 1974 Bezdek put forwardthe concept of partition coefficient (PC) [6] which was thefirst practical generic function for measuring the validityof the fuzzy clustering and subsequently proposed anotherclustering validity function partition entropy (PE) [7] closelyrelated to partition coefficient Later Windham definedproportion exponent by use of the maximum value of a fuzzymembership function [8] Lee put forward a fuzzy clusteringvalidity index using the distinguishableness of clusters mea-sured by the object proximities [9] Based on the Shannon

entropy and fuzzy variation theory Zhang and Jiang putforward a new fuzzy clustering validity index taking accountof the geometry structure of the dataset [10] Saha et alput forward an algorithm based on differential evolution forautomatic cluster detection which well evaluated the validityof the clustering result [11] Yue et al partitioned the originaldata space into a grid-based structure and proposed a clusterseparation measure based on grid distances [12] The clus-tering validity index based on fuzzy partition is only relatedto the fuzzy degree of membership and has the advantages ofsimpleness and small calculating amount But it lacks directrelation with some structural features of the dataset

(2) Clustering Validity Index Based on the Geometry Structureof the Dataset These indices are based on such a point ofview that for a well-separated dataset every cluster should becompact and separated from each other as far as possibleTheratio of compactness and separation is used as the standardof clustering validity This type of representative clusteringvalidity indices include Xie-Beni index [13] Bensaid index[14] and Kwon index [15] Sun et al proposed a new validityindex based on a linear combination of compactness andseparation and inspired by Rezaeersquos validity [16] Li and Yudefined new compactness and separation and put forward anew fuzzy clustering validity function [17] Based on fuzzygranulation-degranulation Saha and Bandyopadhyay putforward a fuzzy clustering validity function [18] Zhang et aladopted Pearson correlation to measure the distance and putforward a validity function [19] Kim et al proposed a clus-tering validity index for GK algorithm based on the averagevalue of the relative degrees of sharing of all possible pairs offuzzy clusters [20] Rezaee proposed a new validity index forGK algorithm to overcome the shortcomings of Kimrsquos index[21] Zhang et al proposed a novelWGLI to detect the optimalcluster number using global optimum membership as theglobal property and modularity of bipartite network as thelocal independent property [22]The clustering validity indexbased on the geometric structure of the dataset considers boththe fuzzy degree of membership and the geometric structurebut its membership function is quite complicated with largecalculating amount

Based on clustering validity index the optimal number ofclusters is determined through exhaustive search In order toincrease the efficiency of estimating the optimal number ofclusters 119888opt the search range of 119888opt must be set that is 119888max isthe maximum number of clusters assigned to meet the con-dition 119888opt le 119888max Most researchers used the empirical rule119888max le radic119899 where 119899 is the number of data in the dataset Forthis problem theoretical analysis and example verificationwere conducted in [23] which indicated that it was reasona-ble in a sense However apparently this method has thefollowing disadvantages (1) Each 119888 must be tried in turnwhich will cause a huge calculation (2) For each 119888 it cannotbe guaranteed that the clustering result is the globally optimalsolution (3) When noise and outliers are existing the relia-bility of clustering validity index is weak (4) For some data-sets like FaceImage [24] if 119888 ge radic119899 the empirical rule willbe invalid Researches showed that due to the diversified datatypes and structures no universal fuzzy clustering validity

Computational Intelligence and Neuroscience 3

index can be applicable to all datasets The research is andwill still be carried on urgently

22 Heuristic Method Some new clustering algorithms havebeen proposed in succession recently The main idea is touse some criteria to guide the clustering process with thenumber of clusters being adjusted In this way while theclustering is completed the appropriate number of clusterscan be obtained aswell For example119883-means algorithm [25]based on the split hierarchical clustering is representativeContrary to the aforesaid process RCA [26] determined theactual number of clusters by a process of competitive agglom-eration Combining the single-point iterative technologywithhierarchical clustering a similarity-based clustering method(SCM) [27] was proposed A mercer kernel-based clustering[28] estimated the number of clusters by the eigenvectors ofa kernel matrix A clustering algorithm based on maximal 120579-distant subtrees [29] detected any number of well-separatedclusters with any shapes In addition Frey and Dueck putforward an affinity propagation clustering algorithm (AP)[24] which generated high-quality cluster centroids via themessage passing between objects to determine the optimalnumber of clusters of large-scale data Shihong et al proposeda Gerschgorin disk estimation-based criterion to estimatethe true number of clusters [30] Jose-Garcıa and Gomez-Flores presented an up-to-date review of all major nature-inspiredmetaheuristic algorithms used thus far for automaticclustering determining the best estimate of the number ofclusters [31] All these have widened the thoughts for relevantresearches

3 Traditional FCM for Determiningthe Optimal Number of Clusters

Since Ruspini first introduced the theory of fuzzy sets intocluster analysis in 1973 [32] different fuzzy clustering algo-rithms have been widely discussed developed and appliedin various areas FCM [33] is one of the most commonlyused algorithms FCM was first presented by Dunn in 1974Subsequently Bezdek introduced weighted index to the fuzzydegree of membership [34] further developing FCM FCMdivides the dataset 119883 into 119888 fuzzy clusters This algorithmholds that each object belongs to a certain cluster with adifferent degree ofmembership that is a cluster is consideredas a fuzzy subset on the dataset

Assume that119883 = 1199091 1199092 119909119899 is a finite dataset where119909119894 = (1199091198941 1199091198942 119909119894119897) is an l-dimensional object and 119909119894119895is the 119895th property of the 119894th object 119862 = 1198621 1198622 119862119888denotes 119888 cluster(s) 119881 = V1 V2 V119888 represents 119888 l-dimensional cluster centroid(s) where V119894 = (V1198941 V1198942 V119894119897)119880 = (119906119894119896)(119899times119888) is a fuzzy partition matrix and 119906119894119896 is the degreeof membership of the 119894th object in the 119896th cluster wheresum119888119896=1 119906119894119896 = 1 forall119894 = 1 119899 The objective function is thequadratic sum of weighed distances from the samples to thecluster centroid in each cluster that is

119869119898 (119880 119881) = 119888sum119896=1

119899sum119894=1

1199061198981198941198961198892119894119896 (1)

where 119889119894119896 = 119909119894 minus V119896 shows the Euclidean distancebetween the 119894th object and the 119896th cluster centroid 119898 (119898 isin[1infin)) is a fuzziness index controlling the fuzziness ofthe memberships As the value of 119898 becomes progressivelyhigher the resulting memberships become fuzzier [35] Paland Bezdek advised that119898 should be between 15 and 25 andusually let 119898 = 2 if without any special requirement [36]

According to the clustering criterion appropriate fuzzypartition matrix 119880 and cluster centroid 119881 are obtained tominimize the objective function 119869119898 Based on the Lagrangemultiplier method 119880 and 119881 are respectively calculated bythe formulas below

119906119894119896 = 1sum119888119895=1 (119889119894119896119889119895119896)2(119898minus1) (2)

V119896 = sum119899119894=1 119906119898119894119896119909119894sum119899119894=1 119906119898119894119896 (3)

FCMalgorithm is carried out through an iterative processof minimizing the objective function 119869119898 with the update of119880 and 119881 The specific steps are as follows

Step 1 Assign the initial value of the number of clusters 119888fuzziness index 119898 maximum iterations 119868max and threshold120585Step 2 Initialize the fuzzy partition119880(0) randomly accordingto the constraints of the degree of membership

Step 3 At the 119905-step calculate 119888 cluster centroids119881(119905) accord-ing to (3)

Step 4 Calculate the objective function 119869(119905)119898 according to (1)If |119869(119905)119898 minus 119869(119905minus1)119898 | lt 120585 or 119905 gt 119868max then stop otherwise continueto Step 5

Step 5 Calculate119880(119905+1) according to (2) and return to Step 3

At last each object can be arranged into one cluster inaccordance with the principle of the maximum degree ofmembership The advantages of FCM may be summarizedas simple algorithm quick convergence and easiness to beextended Its disadvantages lie in the selection of the initialcluster centroids the sensitivity to noise and outliers andthe setting of the number of clusters which have a greatimpact on the clustering result As the random selectionof the initial cluster centroids cannot ensure the fact thatFCMconverges to an optimal solution different initial clustercentroids are used for multiple running of FCM otherwisethey are determined by using of other fast algorithms

The traditional method to determine the optimal numberof clusters of FCM is to set the search range of the number ofclusters run FCM to generate clustering results of differentnumber of clusters select an appropriate clustering validityindex to evaluate clustering results and finally obtain theoptimal number of clusters according to the evaluation resultThe method is composed of the following steps

4 Computational Intelligence and Neuroscience

Step 1 Input the search range [119888min 119888max] generally 119888min = 2and 119888max = lfloorradic119899rfloorStep 2 For each integer 119896119899 isin [119888min 119888max]Step 21 Run FCM

Step 22 Calculate clustering validity index

Step 3 Compare all values of clustering validity index 119896119899corresponding to the maximum or minimum value is theoptimal number of clusters 119888optStep 4 Output 119888opt the optimal value of clustering validityindex and clustering result

4 The Proposed Methods

41 Density-Based Algorithm Considering the large influ-ence of the randomly selected initial cluster centroid on theclustering result a density-based algorithm is proposed toselect the initial cluster centroids and the maximum numberof clusters can be estimated at the same time Some relatedterms will be defined at first

Definition 1 (local density) The local density 120588119894 of object 119909119894 isdefined as

120588119894 = 119899sum119895=1

119890minus11988921198941198951198891198882 (4)

where 119889119894119895 is the distance between two objects 119909119894 and 119909119895 and119889119888 is a cutoff distance The recommended approach is to sortthe distance between any two objects in descending order andthen assign 119889119888 as the value corresponding to the first 119901 ofthe sorted distance (appropriately 119901 isin [2 5]) It shows thatthe more objects with a distance from 119909119894 less than 119889119888 thereare the bigger the value of 120588119894 is The cutoff distance finallycan decide the number of initial cluster centroids namely themaximumnumber of clustersThe density-based algorithm isrobust with respect to the choice of 119889119888Definition 2 (core point) Assume that there is an object 119909119894 isin119883 if 119909119894 notin 119862119896 and 120588119894 = max119909119895notin119862119896(120588119895) 119896 = 1 119888 119895 = 1 119899then 119909119894 is a core pointDefinition 3 (directly density-reachable) Assume that thereare two objects 119909119894 119909119895 isin 119883 if their distance 119889119894119895 lt 119889119888 then 119909119895is directly density-reachable to 119909119894 and vice versa

Definition 4 (density-reachable) Assume that 119883 is a datasetand objects 119909119904 119909119904+1 119909119905minus1 119909119905 belong to it If for each 119894 isin[119904 119905 minus 1] 119909119894+1 is directly density-reachable to 119909119894 then 119909119904 isdensity-reachable to 119909119905Definition 5 (neighbor) Neighbors of an object 119909119894 are thosewho are directly density-reachable or density-reachable to itdenoted as Neighbor (119909119894)

D

E

dc

dc

dc

B

H

F

C

A

Figure 1 An example Point A is of the highest local density If Adoes not belong to any cluster then A is a core point Points B C Dand E are directly density-reachable to point A Point F is density-reachable to point A Point H is a border point

Definition 6 (border point) An object is called a border pointif it has no neighbors

The example is shown in Figure 1The selection principle of initial cluster centroids of

density-based algorithm is that usually a cluster centroid isan object with higher local density surrounded by neighborswith lower local density than it and has a relatively largedistance fromother cluster centroids [4] Density-based algo-rithm can automatically select the initial cluster centroids anddetermine the maximum number of clusters 119888max accordingto local density The pseudocode is shown in Algorithm 1These cluster centroids obtained are sorted in descendingorder according to local density Figure 2 demonstrates theprocess of density-based algorithm

42 Fuzzy Clustering Validity Index Firstly based on the geo-metric structure of the dataset Xie and Beni put forwardthe Xie-Beni fuzzy clustering validity index [13] which triedto find a balance point between the fuzzy compactness andseparation so as to acquire the optimal cluster result Theindex is defined as

119881xie (119880 119881) = (1119899)sum119888119894=1sum119899119895=1 119906119898119894119895 10038171003817100381710038171003817V119894 minus 119909119895100381710038171003817100381710038172min119894 =119895

10038171003817100381710038171003817V119894 minus V119895100381710038171003817100381710038172 (5)

In the formula the numerator is the average distancefrom various objects to centroids used to measure thecompactness and the denominator is the minimum distancebetween any two centroids measuring the separation

However Bensaid et al found that the size of each clusterhad a large influence on Xie-Beni index and put forwarda new index [14] which was insensitive to the number ofobjects in each cluster Bensaid index is defined as

119881119861 (119880 119881) = 119888sum119896=1

sum119899119894=1 119906119898119894119896 1003817100381710038171003817119909119894 minus V11989610038171003817100381710038172119899119896sum119888119895=1 10038171003817100381710038171003817V119895 minus V119896100381710038171003817100381710038172 (6)

Computational Intelligence and Neuroscience 5

Input Dataset and 119901Output 119881 and 119888max119863 = (119889119894119895)119899times119899dc = cutoffdis (119901) lowastCalculate the cutoff distance dc lowast120588 = (120588119894)119899119894119889119909 = arg(119904119900119903119905(120588 119889119890119904119888119890119899119905)) lowastThe number of object corresponding to sorted120588 lowast119896 = 1119888119897 = (minus1)1times119899 lowastThe cluster number that each object belongs to with initialvalue minus1 lowastfor 119894 = 1 to 119899 minus 1 doif 119888119897119894119889119909119894 = minus1ampamp119873119890119894119892ℎ119887119900119903(119909119894119889119909119894 ) gt 1 then lowast119909119894119889119909119894 dose notbelong to any cluster and the nmber of its neighbors is greater than 1 lowastV119896 = 119909119894119889119909119894 for 119895 isin 119873119890119894119892ℎ119887119900119903(119909119894119889119909119894 ) do119888119897119895 = 119896119896 = 119896 + 1119888max = 119896

Algorithm 1 Density-based algorithm

where 119899119896 is the fuzzy cardinality of the 119896th cluster and definedas

119899119896 = 119899sum119894=1

119906119894119896 (7)

sum119899119894=1 119906119898119894119896119909119894 minus V1198962 shows the variation of the 119896th fuzzycluster Then the compactness is computed as

1119899119896119899sum119894=1

119906119898119894119896 1003817100381710038171003817119909119894 minus V11989610038171003817100381710038172 (8)

sum119888119895=1 V119895 minus V1198962 denotes the separation of the 119896th fuzzycluster defined as the sum of the distances from its clustercentroid to the centroids of other 119888 minus 1 clusters

Because

lim119888rarr119899

1003817100381710038171003817119909119894 minus V11989610038171003817100381710038172 = 0 (9)

this index is the same as the Xie-Beni indexWhen 119888 rarr 119899 theindex value will be monotonically decreased close to 0 andwill lose robustness and judgment function for determiningthe optimal number of clusters Thus this paper improvesBensaid index and proposes a new index 119881119877

119881119877 (119880 119881)= 119888sum119896=1

(1119899119896)sum119899119894=1 119906119898119894119896 1003817100381710038171003817119909119894 minus V11989610038171003817100381710038172 + (1119888) 1003817100381710038171003817V119896 minus V10038171003817100381710038172(1 (119888 minus 1))sum119888119895=1 10038171003817100381710038171003817V119895 minus V119896

100381710038171003817100381710038172 (10)

The numerator represents the compactness of the 119896thcluster where 119899119896 is its fuzzy cardinality Its second item anintroduced punishing function denotes the distance fromthe cluster centroid of the 119896th cluster to the average of allcluster centroids which can eliminate the monotonically

decreasing tendency as the number of clusters increases to119899 The denominator represents the mean distance from the119896th cluster centroid to other cluster centroids which is usedfor measuring the separationThe ratio of the numerator andthe denominator thereof represents the clustering effect of the119896th clusterThe clustering validity index is defined as the sumof the clustering effect (the ratio) of all clusters Obviouslythe smaller the value is the better the clustering effect of thedataset is and the corresponding 119888 to the minimum value isthe optimal number of clusters

43 Self-Adaptive FCM In this paper the iterative trial-and-error process [16] is still used to determine the optimalnumber of clusters by self-adaptive FCM (SAFCM) Thepseudocode of SAFCM algorithm is described in Algo-rithm 2

5 Experiment and Analysis

This paper selects 8 experimental datasets among which 3datasets come fromUCI datasets respectively IrisWine andSeeds a dataset (SubKDD) is randomly selected from KDDCup 1999 shown in Table 1 and the remaining 4 datasetsare synthetic datasets shown in Figure 3 SubKDD includesnormal 2 kinds of Probe attack (ipsweep and portsweep) and3 kinds of DoS attack (neptune smurf and back) The firstsynthetic dataset (SD1) consists of 20 2-dimensional Gaussiandistribution data with 10 samples Their covariance matrixesare second-order unit matrix 1198682 The structural feature of thedataset is that the distance between any two clusters is largeand the number of classes is greater than lfloorradic119899rfloor The secondsynthetic dataset (SD2) consists of 4 2-dimensional Gaussiandistribution dataThe cluster centroids are respectively (55)(1010) (1515) and (2020) each containing 500 samples andthe covariancematrix of each being 21198682The structural feature

6 Computational Intelligence and Neuroscience

minus2 0 2 4 6 8 10minus2

0

2

4

6

8

10

12

14

(a) Initial dataminus2 0 2 4 6 8 10

minus2

0

2

4

6

8

10

12

14

(b) Iteration 1

minus2 0 2 4 6 8 10minus2

0

2

4

6

8

10

12

14

(c) Iteration 2minus2 0 2 4 6 8 10

minus2

0

2

4

6

8

10

12

14

(d) Iteration 3

Figure 2 Demonstration of the process of density-based algorithm (a) is the initial data distribution of a synthetic dataset The datasetconsists of two pieces of 2-dimensional Gaussian distribution data with centroids respectively as (2 3) and (7 8) Each class has 100 samplesIn (b) the blue circle represents the highest density core point as the centroid of the first cluster and the red plus sign represents the objectbelonging to the first cluster In (c) the red circle represents the core point as the centroid of the second cluster and the blue asterisk representsthe object belonging to the second cluster In (d) the purple circle represents the core point as the centroid of the third cluster the green timessign represents the object belonging to the third cluster and the black dot represents the final border pointwhich does not belong to any clusterAccording to a certain cutoff distance themaximumnumber of clusters is 3 If calculated in accordance with the empirical rule themaximumnumber of clusters should be 14 Therefore the algorithm can effectively reduce the iteration of FCM algorithm operation

Table 1 The data type and distribution of SubKDD

Attack behavior Number of samplesnormal 200ipsweep 50portsweep 50neptune 200smurf 300back 50

of the dataset is of short intercluster distance with a smalloverlap The third synthetic dataset (SD3) has a complicatedstructure The fourth synthetic dataset (SD4) is a nonconvexone

In this paper the numeric value 119909119894119895 is normalized as

1199091015840119894119895 = 119909119894119895 minus 119909119895119904119895 (11)

where 119909119895 and 119904119895 denote the mean of the 119895th attribute valueand its standard deviation respectively then

119909119895 = 1119899119899sum119894=1

119909119894119895119904119895 = radic 1119899 minus 1

119899sum119894=1

(119909119894119895 minus 119909119895)2(12)

Computational Intelligence and Neuroscience 7

Input 119881 119888max 119868max and 120585Output 119869119887119890119904119905 119888opt 119880119887119890119904119905for 119896119899 = 2 to 119888max do119905 = 0 lowastThe current iterationlowast119881(119905)

119896119899= (V119894)119896119899 lowastThe first kn cluster centroids of V lowast119880(119905)

119896119899= (119906(119905)119894119896)119899times119896119899119869(119905)

119896119899= 119881119877(119880(119905)119896119899 119881(119905)119896119899 )

while 119905 == 0119905 lt 119868maxampamp|119869(119905)119896119899

minus 119869(119905minus1)119896119899

| gt 120585 doUpdate 119881(119905)

119896119899

Update 119880(119905)119896119899

Update 119869(119905)119896119899119905 = 119905 + 1119869min119896119899 = min119905119894=1(119869(119894)119896119899)119894119889119909 = arg(min119905119894=1(119869(119894)119896119899))119880min119896119899 = 119880(119894119889119909)119896119899

119869119887119890119904119905 = min119888max119896119899=2(119869min119896119899) lowastThe best value of clustering validity indexlowast119888opt = arg(min119888max119896119899=2(119869min119896119899))119880119887119890119904119905 = 119880min119888opt

Algorithm 2 SAFCM

Table 2 119888max estimated by several methods 119899 is the number of objects in the dataset 119888 is the actual number of clusters 119888ER is the numberof clusters estimated by the empirical rule that is lfloorradic119899rfloor 119888AP is the number of clusters obtained by AP algorithm and 119888DBA is the number ofclusters obtained by density-based algorithm

119888DBADataset 119899 119888 119888ER 119888AP 119901 = 1 119901 = 2 119901 = 3 119901 = 5Iris 150 3 12 9 20 14 9 6Wine 178 3 13 15 14 7 6 3Seeds 210 3 14 13 18 12 5 2SubKDD 1050 6 32 24 21 17 10 7SD1 200 20 14 19 38 22 20 mdashSD2 2000 4 44 25 16 3 4 2SD3 885 3 29 27 24 19 5 3SD4 947 3 30 31 23 13 8 4

For the categorical attribute of SubKDD the simplematching is used for the dissimilarity measure that is 0 foridentical values and 1 for different values

51 Experiment of 119888max Simulation In [37] it was proposedthat the number of clusters generated by AP algorithm couldbe selected as the maximum number of clusters So thispaper estimates the value of 119888max by the empirical ruleAP algorithm and density-based algorithm The specificexperimental results are shown in Table 2 Let 119901 = 119901119898 in APalgorithm 119889119888 is respectively selected as the distance of thefirst 1 2 3 and 5

Experiment results show that for the dataset with thetrue number of clusters greater than lfloorradic119899rfloor it is obviouslyincorrect to let 119888max = lfloorradic119899rfloor In other words the empiricalrule is invalid The number of clusters finally obtained byAP algorithm is close to the actual number of clusters Forthe dataset with the true number of clusters smaller thanlfloorradic119899rfloor the number of clusters generated by AP algorithm canbe used as an appropriate 119888max but sometimes the value is

greater than the value estimated by the empirical rule whichenlarges the search range of the optimal number of clustersIf 119888max is estimated by the proposed density-based algorithmthe results in most cases are appropriate The method isinvalid only on SD1 when the cutoff distance is the first 5distance When the cutoff distance is selected as the first 3distance 119888max generated by the proposed algorithm is muchsmaller than lfloorradic119899rfloor It is closer to the true number of clustersand greatly narrows the search range of the optimal numberof clusters Therefore the cutoff distance is selected as thedistance of the first 3 in the later experiments

52 Experiment of Influence of the Initial Cluster Centroidson the Convergence of FCM To show that the initial clustercentroids obtained by density-based algorithm can quickenthe convergence of clustering algorithm the traditional FCMis adopted for verificationThe number of clusters is assignedas the true value with the convergence threshold being 10minus5Because the random selection of initial cluster centroids hasa large influence on FCM the experiment of each dataset

8 Computational Intelligence and Neuroscience

minus5 0 5 10 15 20 250

5

10

15

20

25

30

(a) SD10 5 10 15 20 25

0

5

10

15

20

25

(b) SD2

50 100 150 200 250 300 3500

50

100

150

200

250

300

(c) SD320 40 60 80 100 120 140 160 180 200 220 240

20

40

60

80

100

120

140

160

(d) SD4

Figure 3 Four synthetic datasets

Table 3 Comparison of iterations of FCMalgorithmMethod 1 usesthe random initial cluster centroids and Method 2 uses the clustercentroids obtained by density-based algorithm

Dataset Method 1 Method 2Iris 21 16Wine 27 18Seeds 19 16SubKDD 31 23SD1 38 14SD2 18 12SD3 30 22SD4 26 21

is done repeatedly for 50 times and the round numbers ofthe mean of algorithmic iterations are comparedThe specificexperimental result is shown in Table 3

As shown in Table 3 the initial cluster centroids obtainedby density-based algorithm can effectively reduce the itera-tion of FCM Particularly on SD1 the iteration of FCM isfar smaller than that with randomly selected initial cluster

centroids whose minimum iteration is 24 andmaximum iter-ation reaches 63 with unstable clustering results Thereforethe proposed method can not only effectively quicken theconvergence of FCM but also obtain a stable clustering result

53 Experiment of Clustering Accuracy Based on ClusteringValidity Index For 3 UCI datasets and SubKDD clusteringaccuracy 119903 is adopted tomeasure the clustering effect definedas

119903 = sum119896119894=1 119887119894119899 (13)

Here 119887119894 is the number of objects which co-occur in the 119894thcluster and the 119894th real cluster and 119899 is the number of objectsin the dataset According to this measurement the higherthe clustering accuracy is the better the clustering result ofFCM is When 119903 = 1 the clustering result of FCM is totallyaccurate

In the experiments the true number of clusters is assignedto each dataset and the initial cluster centroids are obtainedby density-based algorithm Then experimental results ofclustering accuracy are shown in Table 4

Computational Intelligence and Neuroscience 9

minus5 0 5 10 15 20 250

5

10

15

20

25

30

(a) SD10 5 10 15 20 25

0

5

10

15

20

25

(b) SD2

50 100 150 200 250 300 3500

50

100

150

200

250

300

(c) SD320 40 60 80 100 120 140 160 180 200 220 240

20

40

60

80

100

120

140

160

(d) SD4

Figure 4 Clustering results of two synthetic datasets

Table 4 Clustering accuracy

Dataset Iris Wine Seeds SubKDDClustering accuracy 8400 9663 9190 9435

Apparently clustering accuracies of the datasets WineSeeds and SubKDD are high while that of the dataset Iris isrelatively low for the reason that two clusters are nonlinearlyseparable

The clustering results of the proposed clustering validityindex on 4 synthetic datasets are shown in Figure 4 whichdepict the good partitions on the datasets SD1 and SD2 andthe slightly poor partition on SD3 and SD4 because of thecomplexity of their structure or nonconvexity When thenumber of clusters is set as 4 SD3 is divided into 4 groupsthat means the right and the left each have 2 groups

54 Experiment of the Optimal Number of Clusters At lastXie-Beni index Bensaid index Kwon index and the pro-posed index are respectively adopted for running of SAFCMso as to determine the optimal number of clustersThe results

Table 5 Optimal number of clusters estimated by several clusteringvalidity indices

Dataset 119881XB 119881B 119881K 119881R

Iris 2 9 2 2Wine 3 6 3 3Seeds 2 5 2 2SubKDD 10 10 9 4SD1 20 20 20 20SD2 4 4 4 4SD3 5 3 5 4SD4 2 8 2 3

are shown in Table 5 119881XB 119881B 119881K and 119881R represent Xie-Beniindex Bensaid index Kwon index and the proposed indexrespectively

It shows that for synthetic datasets SD1 and SD2 withsimple structure these indices can all obtain the optimalnumber of clusters For 3 UCI datasets SubKDD and SD4Bensaid index cannot obtain an accurate number of clustersexcept SD3 For the dataset Wine both Xie-Beni index and

10 Computational Intelligence and Neuroscience

Table 6 The value of clustering validity index on Iris

119888 119881XB 119881B 119881K 119881R

2 0114234 0223604 17384973 04736043 0223973 0124598 34572146 05515654 0316742 0099103 49279488 06154365 0560109 0089108 87540968 06763506 0574475 0072201 90563379 06913407 0400071 0067005 63328679 07353118 0275682 0036283 45736972 06140559 0250971 0027735 42868449 0584244

Table 7 The value of clustering validity index on Wine

119888 119881XB 119881B 119881K 119881R

2 0663406 1328291 11833902 15782913 0468902 0513071 83939058 13464214 mdash 0473254 mdash 17357915 mdash 0373668 mdash 18466866 mdash 0256698 mdash 1683222

Table 8 The value of clustering validity index on Seeds

119888 119881XB 119881B 119881K 119881R

2 0147247 0293609 31170349 05436093 0212127 0150899 45326216 05990014 0243483 0127720 52215334 06979435 0348842 0085477 75493654 0701153

Kwon index can obtain the accurate number of clusters whilefor the datasets Iris Seeds SubKDD SD3 and SD4 they onlyobtain the result approximate to the true number of clusterThe proposed index obtains the right result on the datasetsWine and SD4 and a better result compared toXie-Beni indexand Kwon index on the datasets SubKDD and SD3 whileit obtains the same result of the two indices on the datasetsIris and Seeds There are two reasons First the importancedegree of each property of the dataset is not considered inthe clustering process but assumed to be the same therebyaffecting the experimental result Second SAFCMhas a weakcapacity to deal with overlapping clusters or nonconvexitydataset These are the authorsrsquo subsequent research contents

Tables 6ndash13 list the detailed results of 4 indices on theexperimental datasets in each iteration of SAFCM For thedataset Wine when the number of clusters is 4 5 or 6the result based on Xie-Beni index or Kwon index is of noconvergence However the proposed index can achieve thedesired results

6 Conclusions

FCM is widely used in lots of fields But it needs to presetthe number of clusters and is greatly influenced by the initialcluster centroids This paper studies a self-adaptive methodfor determining the number of clusters by using of FCMalgorithm In this method a density-based algorithm is put

Table 9 The value of clustering validity index on SubKDD

119888 119881XB 119881B 119881K 119881R

2 0646989 1324434 550166676 15744313 0260755 0378775 222020838 10902894 0133843 0062126 119544560 05112385 0234402 0052499 202641204 05378526 0180728 0054938 156812271 05838007 0134636 0047514 119029265 06197208 0104511 0032849 919852740 06908739 0129721 0027639 740128470 056263610 006822 0027025 913528560 0528528

Table 10 The value of clustering validity index on SD1

119888 119881XB 119881B 119881K 119881R

2 0221693 0443390 44592968 06933903 0206035 0198853 40264251 07262454 0127731 0093653 26220200 06555505 0130781 0069848 27154867 06514656 0144894 0050067 22922121 06393257 0136562 0040275 29126152 06362588 0112480 0032874 24323625 06274429 0115090 0026833 24242580 062493610 0141415 0022611 28574579 061670111 0126680 0019256 28821707 061152412 0103178 0016634 23931865 060599013 0110355 0013253 26517065 058824614 0095513 0011083 23635022 057680815 0075928 0009817 19302095 056228916 0066025 0008824 17236138 055799017 0054314 0007248 14995284 054434118 0045398 0006090 13208810 053488219 0039492 0005365 11977437 052713120 0034598 0004917 10835952 0517481

Table 11 The value of clustering validity index on SD2

119888 119881XB 119881B 119881K 119881R

2 0066286 0132572 13281503 03825723 0068242 0063751 13752535 03942004 0060916 0028149 12309967 0378337

Table 12 The value of clustering validity index on SD3

119888 119881XB 119881B 119881K 119881R

2 0148379 0300899 131557269 05708763 0195663 0142149 173900551 05996804 0127512 0142150 113748947 05571985 0126561 0738070 113262975 0589535

forward at first which can estimate the maximum number ofclusters to reduce the search range of the optimal numberespecially being fit for the dataset on which the empiricalrule is inoperative Besides it can generate the high-quality

Computational Intelligence and Neuroscience 11

Table 13 The value of clustering validity index on SD4

119888 119881XB 119881B 119881K 119881R

2 0104434 0208832 991482710 04737483 0170326 0142561 162044450 04588324 0221884 0081007 211692699 05835295 0156253 0053094 157683921 06032116 0123191 0041799 118279116 05753967 0165465 0032411 107210082 05926258 0145164 0028698 139310969 0606049

initial cluster centroids so that the the clustering result isstable and the convergence of FCM is quick Then a newfuzzy clustering validity index was put forward based onfuzzy compactness and separation so that the clusteringresult is closer to global optimum The index is robust andinterpretable when the number of clusters tends to that ofobjects in the dataset Finally a self-adaptive FCM algorithmis proposed to determine the optimal number of clusters runin the iterative trial-and-error process

The contributions are validated by experimental resultsHowever in most cases each property plays a different rolein the clustering process In other words the weight ofproperties is not the same This issue will be focused on inthe authorsrsquo future work

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work is partially supported by the National NaturalScience Foundation of China (61373148 and 61502151) Shan-dong Province Natural Science Foundation (ZR2014FL010)Science Foundation ofMinistry of Education ofChina (14YJC860042) Shandong Province Outstanding Young ScientistAward Fund (BS2013DX033) Shandong Province Social Sci-ences Project (15CXWJ13 and 16CFXJ05) and Project ofShandongProvinceHigher Educational Science andTechnol-ogy Program (no J15LN02)

References

[1] S Sambasivam and N Theodosopoulos ldquoAdvanced data clus-tering methods of mining web documentsrdquo Issues in InformingScience and Information Technology vol 3 pp 563ndash579 2006

[2] J Wang S-T Wang and Z-H Deng ldquoSurvey on challenges inclustering analysis researchrdquo Control and Decision vol 3 2012

[3] C-W Bong andM Rajeswari ldquoMulti-objective nature-inspiredclustering and classification techniques for image segmenta-tionrdquo Applied Soft Computing Journal vol 11 no 4 pp 3271ndash3282 2011

[4] A Rodriguez and A Laio ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[5] L A Zadeh ldquoFuzzy setsrdquo Information and Computation vol 8no 3 pp 338ndash353 1965

[6] J C Bezdek ldquoNumerical taxonomy with fuzzy setsrdquo Journal ofMathematical Biology vol 1 no 1 pp 57ndash71 1974

[7] J C Bezdek ldquoCluster validity with fuzzy setsrdquo Journal of Cyber-netics vol 3 no 3 pp 58ndash73 1974

[8] M P Windham ldquoCluster validity for fuzzy clustering algo-rithmsrdquo Fuzzy Sets and Systems vol 5 no 2 pp 177ndash185 1981

[9] M Lee ldquoFuzzy cluster validity index based onobject proximitiesdefined over fuzzy partition matricesrdquo in Proceedings of theIEEE International Conference on Fuzzy Systems and IEEEWorldCongress on Computational Intelligence (FUZZ-IEEE rsquo08) pp336ndash340 June 2008

[10] X-B Zhang and L Jiang ldquoAnew validity index of fuzzy c-meansclusteringrdquo in Proceedings of the IEEE International Conferenceon IntelligentHuman-Machine Systems andCybernetics (IHMSCrsquo09) vol 2 pp 218ndash221 Hangzhou China August 2009

[11] I Saha U Maulik and S Bandyopadhyay ldquoA new differ-ential evolution based fuzzy clustering for automatic clusterevolutionrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 706ndash711 IEEE PatialaIndia March 2009

[12] S Yue J-S Wang T Wu and H Wang ldquoA new separationmeasure for improving the effectiveness of validity indicesrdquoInformation Sciences vol 180 no 5 pp 748ndash764 2010

[13] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[14] A M Bensaid L O Hall J C Bezdek et al ldquoValidity-guided(re)clustering with applications to image segmentationrdquo IEEETransactions on Fuzzy Systems vol 4 no 2 pp 112ndash123 1996

[15] S H Kwon ldquoCluster validity index for fuzzy clusteringrdquo Elec-tronics Letters vol 34 no 22 pp 2176ndash2177 1998

[16] H Sun S Wang and Q Jiang ldquoFCM-based model selectionalgorithms for determining the number of clustersrdquo PatternRecognition vol 37 no 10 pp 2027ndash2037 2004

[17] Y Li and F Yu ldquoA new validity function for fuzzy clusteringrdquoin Proceedings of the International Conference on ComputationalIntelligence and Natural Computing (CINC rsquo09) vol 1 pp 462ndash465 IEEE June 2009

[18] S Saha and S Bandyopadhyay ldquoA new cluster validity indexbased on fuzzy granulation-degranulation criteriardquo in Proceed-ings of the IEEEInternational Conference on Advanced Comput-ing and Communications (ADCOM rsquo07) pp 353ndash358 IEEEGuwahati India 2007

[19] M Zhang W Zhang H Sicotte and P Yang ldquoA new validitymeasure for a correlation-based fuzzy c-means clustering algo-rithmrdquo in Proceedings of the 31st IEEE Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo09) pp 3865ndash3868 Minneapolis Minn USASeptember 2009

[20] Y-I KimD-WKimD Lee andKH Lee ldquoA cluster validationindex for GK cluster analysis based on relative degree ofsharingrdquo Information Sciences vol 168 no 1ndash4 pp 225ndash2422004

[21] B Rezaee ldquoA cluster validity index for fuzzy clusteringrdquo FuzzySets and Systems vol 161 no 23 pp 3014ndash3025 2010

[22] D Zhang M Ji J Yang Y Zhang and F Xie ldquoA novel clus-ter validity index for fuzzy clustering based on bipartite modu-larityrdquo Fuzzy Sets and Systems vol 253 pp 122ndash137 2014

[23] J Yu andQ Cheng ldquoSearch range of the optimal cluster numberin fuzzy clusteringrdquo Science in China (Series E) vol 32 no 2 pp274ndash280 2002

12 Computational Intelligence and Neuroscience

[24] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[25] D Pelleg A W Moore et al ldquoX-means extending K-meanswith efficient estimation of the number of clustersrdquo in Proceed-ings of the 17th International Conference on Machine Learning(ICML rsquo00) vol 1 pp 727ndash734 2000

[26] H Frigui and R Krishnapuram ldquoA robust competitive clus-tering algorithm with applications in computer visionrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol21 no 5 pp 450ndash465 1999

[27] M-S Yang and K-L Wu ldquoA similarity-based robust clusteringmethodrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 26 no 4 pp 434ndash448 2004

[28] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[29] L Yujian ldquoA clustering algorithm based on maximal 120579-distantsubtreesrdquo Pattern Recognition vol 40 no 5 pp 1425ndash1431 2007

[30] Y Shihong H Ti and W Penglong ldquoMatrix eigenvalue anal-ysis-based clustering validity indexrdquo Journal of Tianjin Univer-sity (Science and Technology) vol 8 article 6 2014

[31] A Jose-Garcıa and W Gomez-Flores ldquoAutomatic clusteringusing nature-inspired metaheuristics a surveyrdquo Applied SoftComputing Journal vol 41 pp 192ndash213 2016

[32] E H Ruspini ldquoNew experimental results in fuzzy clusteringrdquoInformation Sciences vol 6 pp 273ndash284 1973

[33] J C Dunn ldquoWell-separated clusters and optimal fuzzy parti-tionsrdquo Journal of Cybernetics vol 4 no 1 pp 95ndash104 1974

[34] J C Bezdek Pattern Recognition with Fuzzy Objective FunctionAlgorithms Springer Science amp Business Media 2013

[35] R E Hammah and J H Curran ldquoFuzzy cluster algorithm forthe automatic identification of joint setsrdquo International Journalof Rock Mechanics and Mining Sciences vol 35 no 7 pp 889ndash905 1998

[36] N R Pal and J C Bezdek ldquoOn cluster validity for the fuzzy c-means modelrdquo IEEE Transactions on Fuzzy Systems vol 3 no3 pp 370ndash379 1995

[37] S B Zhou Z Y Xu and X Q Tang ldquoMethod for determiningoptimal number of clusters based on affinity propagationclusteringrdquo Control and Decision vol 26 no 8 pp 1147ndash11522011

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 3: Research Article A Self-Adaptive Fuzzy -Means Algorithm ...downloads.hindawi.com/journals/cin/2016/2647389.pdf · A Self-Adaptive Fuzzy -Means Algorithm for Determining the Optimal

Computational Intelligence and Neuroscience 3

index can be applicable to all datasets The research is andwill still be carried on urgently

22 Heuristic Method Some new clustering algorithms havebeen proposed in succession recently The main idea is touse some criteria to guide the clustering process with thenumber of clusters being adjusted In this way while theclustering is completed the appropriate number of clusterscan be obtained aswell For example119883-means algorithm [25]based on the split hierarchical clustering is representativeContrary to the aforesaid process RCA [26] determined theactual number of clusters by a process of competitive agglom-eration Combining the single-point iterative technologywithhierarchical clustering a similarity-based clustering method(SCM) [27] was proposed A mercer kernel-based clustering[28] estimated the number of clusters by the eigenvectors ofa kernel matrix A clustering algorithm based on maximal 120579-distant subtrees [29] detected any number of well-separatedclusters with any shapes In addition Frey and Dueck putforward an affinity propagation clustering algorithm (AP)[24] which generated high-quality cluster centroids via themessage passing between objects to determine the optimalnumber of clusters of large-scale data Shihong et al proposeda Gerschgorin disk estimation-based criterion to estimatethe true number of clusters [30] Jose-Garcıa and Gomez-Flores presented an up-to-date review of all major nature-inspiredmetaheuristic algorithms used thus far for automaticclustering determining the best estimate of the number ofclusters [31] All these have widened the thoughts for relevantresearches

3 Traditional FCM for Determiningthe Optimal Number of Clusters

Since Ruspini first introduced the theory of fuzzy sets intocluster analysis in 1973 [32] different fuzzy clustering algo-rithms have been widely discussed developed and appliedin various areas FCM [33] is one of the most commonlyused algorithms FCM was first presented by Dunn in 1974Subsequently Bezdek introduced weighted index to the fuzzydegree of membership [34] further developing FCM FCMdivides the dataset 119883 into 119888 fuzzy clusters This algorithmholds that each object belongs to a certain cluster with adifferent degree ofmembership that is a cluster is consideredas a fuzzy subset on the dataset

Assume that119883 = 1199091 1199092 119909119899 is a finite dataset where119909119894 = (1199091198941 1199091198942 119909119894119897) is an l-dimensional object and 119909119894119895is the 119895th property of the 119894th object 119862 = 1198621 1198622 119862119888denotes 119888 cluster(s) 119881 = V1 V2 V119888 represents 119888 l-dimensional cluster centroid(s) where V119894 = (V1198941 V1198942 V119894119897)119880 = (119906119894119896)(119899times119888) is a fuzzy partition matrix and 119906119894119896 is the degreeof membership of the 119894th object in the 119896th cluster wheresum119888119896=1 119906119894119896 = 1 forall119894 = 1 119899 The objective function is thequadratic sum of weighed distances from the samples to thecluster centroid in each cluster that is

119869119898 (119880 119881) = 119888sum119896=1

119899sum119894=1

1199061198981198941198961198892119894119896 (1)

where 119889119894119896 = 119909119894 minus V119896 shows the Euclidean distancebetween the 119894th object and the 119896th cluster centroid 119898 (119898 isin[1infin)) is a fuzziness index controlling the fuzziness ofthe memberships As the value of 119898 becomes progressivelyhigher the resulting memberships become fuzzier [35] Paland Bezdek advised that119898 should be between 15 and 25 andusually let 119898 = 2 if without any special requirement [36]

According to the clustering criterion appropriate fuzzypartition matrix 119880 and cluster centroid 119881 are obtained tominimize the objective function 119869119898 Based on the Lagrangemultiplier method 119880 and 119881 are respectively calculated bythe formulas below

119906119894119896 = 1sum119888119895=1 (119889119894119896119889119895119896)2(119898minus1) (2)

V119896 = sum119899119894=1 119906119898119894119896119909119894sum119899119894=1 119906119898119894119896 (3)

FCMalgorithm is carried out through an iterative processof minimizing the objective function 119869119898 with the update of119880 and 119881 The specific steps are as follows

Step 1 Assign the initial value of the number of clusters 119888fuzziness index 119898 maximum iterations 119868max and threshold120585Step 2 Initialize the fuzzy partition119880(0) randomly accordingto the constraints of the degree of membership

Step 3 At the 119905-step calculate 119888 cluster centroids119881(119905) accord-ing to (3)

Step 4 Calculate the objective function 119869(119905)119898 according to (1)If |119869(119905)119898 minus 119869(119905minus1)119898 | lt 120585 or 119905 gt 119868max then stop otherwise continueto Step 5

Step 5 Calculate119880(119905+1) according to (2) and return to Step 3

At last each object can be arranged into one cluster inaccordance with the principle of the maximum degree ofmembership The advantages of FCM may be summarizedas simple algorithm quick convergence and easiness to beextended Its disadvantages lie in the selection of the initialcluster centroids the sensitivity to noise and outliers andthe setting of the number of clusters which have a greatimpact on the clustering result As the random selectionof the initial cluster centroids cannot ensure the fact thatFCMconverges to an optimal solution different initial clustercentroids are used for multiple running of FCM otherwisethey are determined by using of other fast algorithms

The traditional method to determine the optimal numberof clusters of FCM is to set the search range of the number ofclusters run FCM to generate clustering results of differentnumber of clusters select an appropriate clustering validityindex to evaluate clustering results and finally obtain theoptimal number of clusters according to the evaluation resultThe method is composed of the following steps

4 Computational Intelligence and Neuroscience

Step 1 Input the search range [119888min 119888max] generally 119888min = 2and 119888max = lfloorradic119899rfloorStep 2 For each integer 119896119899 isin [119888min 119888max]Step 21 Run FCM

Step 22 Calculate clustering validity index

Step 3 Compare all values of clustering validity index 119896119899corresponding to the maximum or minimum value is theoptimal number of clusters 119888optStep 4 Output 119888opt the optimal value of clustering validityindex and clustering result

4 The Proposed Methods

41 Density-Based Algorithm Considering the large influ-ence of the randomly selected initial cluster centroid on theclustering result a density-based algorithm is proposed toselect the initial cluster centroids and the maximum numberof clusters can be estimated at the same time Some relatedterms will be defined at first

Definition 1 (local density) The local density 120588119894 of object 119909119894 isdefined as

120588119894 = 119899sum119895=1

119890minus11988921198941198951198891198882 (4)

where 119889119894119895 is the distance between two objects 119909119894 and 119909119895 and119889119888 is a cutoff distance The recommended approach is to sortthe distance between any two objects in descending order andthen assign 119889119888 as the value corresponding to the first 119901 ofthe sorted distance (appropriately 119901 isin [2 5]) It shows thatthe more objects with a distance from 119909119894 less than 119889119888 thereare the bigger the value of 120588119894 is The cutoff distance finallycan decide the number of initial cluster centroids namely themaximumnumber of clustersThe density-based algorithm isrobust with respect to the choice of 119889119888Definition 2 (core point) Assume that there is an object 119909119894 isin119883 if 119909119894 notin 119862119896 and 120588119894 = max119909119895notin119862119896(120588119895) 119896 = 1 119888 119895 = 1 119899then 119909119894 is a core pointDefinition 3 (directly density-reachable) Assume that thereare two objects 119909119894 119909119895 isin 119883 if their distance 119889119894119895 lt 119889119888 then 119909119895is directly density-reachable to 119909119894 and vice versa

Definition 4 (density-reachable) Assume that 119883 is a datasetand objects 119909119904 119909119904+1 119909119905minus1 119909119905 belong to it If for each 119894 isin[119904 119905 minus 1] 119909119894+1 is directly density-reachable to 119909119894 then 119909119904 isdensity-reachable to 119909119905Definition 5 (neighbor) Neighbors of an object 119909119894 are thosewho are directly density-reachable or density-reachable to itdenoted as Neighbor (119909119894)

D

E

dc

dc

dc

B

H

F

C

A

Figure 1 An example Point A is of the highest local density If Adoes not belong to any cluster then A is a core point Points B C Dand E are directly density-reachable to point A Point F is density-reachable to point A Point H is a border point

Definition 6 (border point) An object is called a border pointif it has no neighbors

The example is shown in Figure 1The selection principle of initial cluster centroids of

density-based algorithm is that usually a cluster centroid isan object with higher local density surrounded by neighborswith lower local density than it and has a relatively largedistance fromother cluster centroids [4] Density-based algo-rithm can automatically select the initial cluster centroids anddetermine the maximum number of clusters 119888max accordingto local density The pseudocode is shown in Algorithm 1These cluster centroids obtained are sorted in descendingorder according to local density Figure 2 demonstrates theprocess of density-based algorithm

42 Fuzzy Clustering Validity Index Firstly based on the geo-metric structure of the dataset Xie and Beni put forwardthe Xie-Beni fuzzy clustering validity index [13] which triedto find a balance point between the fuzzy compactness andseparation so as to acquire the optimal cluster result Theindex is defined as

119881xie (119880 119881) = (1119899)sum119888119894=1sum119899119895=1 119906119898119894119895 10038171003817100381710038171003817V119894 minus 119909119895100381710038171003817100381710038172min119894 =119895

10038171003817100381710038171003817V119894 minus V119895100381710038171003817100381710038172 (5)

In the formula the numerator is the average distancefrom various objects to centroids used to measure thecompactness and the denominator is the minimum distancebetween any two centroids measuring the separation

However Bensaid et al found that the size of each clusterhad a large influence on Xie-Beni index and put forwarda new index [14] which was insensitive to the number ofobjects in each cluster Bensaid index is defined as

119881119861 (119880 119881) = 119888sum119896=1

sum119899119894=1 119906119898119894119896 1003817100381710038171003817119909119894 minus V11989610038171003817100381710038172119899119896sum119888119895=1 10038171003817100381710038171003817V119895 minus V119896100381710038171003817100381710038172 (6)

Computational Intelligence and Neuroscience 5

Input Dataset and 119901Output 119881 and 119888max119863 = (119889119894119895)119899times119899dc = cutoffdis (119901) lowastCalculate the cutoff distance dc lowast120588 = (120588119894)119899119894119889119909 = arg(119904119900119903119905(120588 119889119890119904119888119890119899119905)) lowastThe number of object corresponding to sorted120588 lowast119896 = 1119888119897 = (minus1)1times119899 lowastThe cluster number that each object belongs to with initialvalue minus1 lowastfor 119894 = 1 to 119899 minus 1 doif 119888119897119894119889119909119894 = minus1ampamp119873119890119894119892ℎ119887119900119903(119909119894119889119909119894 ) gt 1 then lowast119909119894119889119909119894 dose notbelong to any cluster and the nmber of its neighbors is greater than 1 lowastV119896 = 119909119894119889119909119894 for 119895 isin 119873119890119894119892ℎ119887119900119903(119909119894119889119909119894 ) do119888119897119895 = 119896119896 = 119896 + 1119888max = 119896

Algorithm 1 Density-based algorithm

where 119899119896 is the fuzzy cardinality of the 119896th cluster and definedas

119899119896 = 119899sum119894=1

119906119894119896 (7)

sum119899119894=1 119906119898119894119896119909119894 minus V1198962 shows the variation of the 119896th fuzzycluster Then the compactness is computed as

1119899119896119899sum119894=1

119906119898119894119896 1003817100381710038171003817119909119894 minus V11989610038171003817100381710038172 (8)

sum119888119895=1 V119895 minus V1198962 denotes the separation of the 119896th fuzzycluster defined as the sum of the distances from its clustercentroid to the centroids of other 119888 minus 1 clusters

Because

lim119888rarr119899

1003817100381710038171003817119909119894 minus V11989610038171003817100381710038172 = 0 (9)

this index is the same as the Xie-Beni indexWhen 119888 rarr 119899 theindex value will be monotonically decreased close to 0 andwill lose robustness and judgment function for determiningthe optimal number of clusters Thus this paper improvesBensaid index and proposes a new index 119881119877

119881119877 (119880 119881)= 119888sum119896=1

(1119899119896)sum119899119894=1 119906119898119894119896 1003817100381710038171003817119909119894 minus V11989610038171003817100381710038172 + (1119888) 1003817100381710038171003817V119896 minus V10038171003817100381710038172(1 (119888 minus 1))sum119888119895=1 10038171003817100381710038171003817V119895 minus V119896

100381710038171003817100381710038172 (10)

The numerator represents the compactness of the 119896thcluster where 119899119896 is its fuzzy cardinality Its second item anintroduced punishing function denotes the distance fromthe cluster centroid of the 119896th cluster to the average of allcluster centroids which can eliminate the monotonically

decreasing tendency as the number of clusters increases to119899 The denominator represents the mean distance from the119896th cluster centroid to other cluster centroids which is usedfor measuring the separationThe ratio of the numerator andthe denominator thereof represents the clustering effect of the119896th clusterThe clustering validity index is defined as the sumof the clustering effect (the ratio) of all clusters Obviouslythe smaller the value is the better the clustering effect of thedataset is and the corresponding 119888 to the minimum value isthe optimal number of clusters

43 Self-Adaptive FCM In this paper the iterative trial-and-error process [16] is still used to determine the optimalnumber of clusters by self-adaptive FCM (SAFCM) Thepseudocode of SAFCM algorithm is described in Algo-rithm 2

5 Experiment and Analysis

This paper selects 8 experimental datasets among which 3datasets come fromUCI datasets respectively IrisWine andSeeds a dataset (SubKDD) is randomly selected from KDDCup 1999 shown in Table 1 and the remaining 4 datasetsare synthetic datasets shown in Figure 3 SubKDD includesnormal 2 kinds of Probe attack (ipsweep and portsweep) and3 kinds of DoS attack (neptune smurf and back) The firstsynthetic dataset (SD1) consists of 20 2-dimensional Gaussiandistribution data with 10 samples Their covariance matrixesare second-order unit matrix 1198682 The structural feature of thedataset is that the distance between any two clusters is largeand the number of classes is greater than lfloorradic119899rfloor The secondsynthetic dataset (SD2) consists of 4 2-dimensional Gaussiandistribution dataThe cluster centroids are respectively (55)(1010) (1515) and (2020) each containing 500 samples andthe covariancematrix of each being 21198682The structural feature

6 Computational Intelligence and Neuroscience

minus2 0 2 4 6 8 10minus2

0

2

4

6

8

10

12

14

(a) Initial dataminus2 0 2 4 6 8 10

minus2

0

2

4

6

8

10

12

14

(b) Iteration 1

minus2 0 2 4 6 8 10minus2

0

2

4

6

8

10

12

14

(c) Iteration 2minus2 0 2 4 6 8 10

minus2

0

2

4

6

8

10

12

14

(d) Iteration 3

Figure 2 Demonstration of the process of density-based algorithm (a) is the initial data distribution of a synthetic dataset The datasetconsists of two pieces of 2-dimensional Gaussian distribution data with centroids respectively as (2 3) and (7 8) Each class has 100 samplesIn (b) the blue circle represents the highest density core point as the centroid of the first cluster and the red plus sign represents the objectbelonging to the first cluster In (c) the red circle represents the core point as the centroid of the second cluster and the blue asterisk representsthe object belonging to the second cluster In (d) the purple circle represents the core point as the centroid of the third cluster the green timessign represents the object belonging to the third cluster and the black dot represents the final border pointwhich does not belong to any clusterAccording to a certain cutoff distance themaximumnumber of clusters is 3 If calculated in accordance with the empirical rule themaximumnumber of clusters should be 14 Therefore the algorithm can effectively reduce the iteration of FCM algorithm operation

Table 1 The data type and distribution of SubKDD

Attack behavior Number of samplesnormal 200ipsweep 50portsweep 50neptune 200smurf 300back 50

of the dataset is of short intercluster distance with a smalloverlap The third synthetic dataset (SD3) has a complicatedstructure The fourth synthetic dataset (SD4) is a nonconvexone

In this paper the numeric value 119909119894119895 is normalized as

1199091015840119894119895 = 119909119894119895 minus 119909119895119904119895 (11)

where 119909119895 and 119904119895 denote the mean of the 119895th attribute valueand its standard deviation respectively then

119909119895 = 1119899119899sum119894=1

119909119894119895119904119895 = radic 1119899 minus 1

119899sum119894=1

(119909119894119895 minus 119909119895)2(12)

Computational Intelligence and Neuroscience 7

Input 119881 119888max 119868max and 120585Output 119869119887119890119904119905 119888opt 119880119887119890119904119905for 119896119899 = 2 to 119888max do119905 = 0 lowastThe current iterationlowast119881(119905)

119896119899= (V119894)119896119899 lowastThe first kn cluster centroids of V lowast119880(119905)

119896119899= (119906(119905)119894119896)119899times119896119899119869(119905)

119896119899= 119881119877(119880(119905)119896119899 119881(119905)119896119899 )

while 119905 == 0119905 lt 119868maxampamp|119869(119905)119896119899

minus 119869(119905minus1)119896119899

| gt 120585 doUpdate 119881(119905)

119896119899

Update 119880(119905)119896119899

Update 119869(119905)119896119899119905 = 119905 + 1119869min119896119899 = min119905119894=1(119869(119894)119896119899)119894119889119909 = arg(min119905119894=1(119869(119894)119896119899))119880min119896119899 = 119880(119894119889119909)119896119899

119869119887119890119904119905 = min119888max119896119899=2(119869min119896119899) lowastThe best value of clustering validity indexlowast119888opt = arg(min119888max119896119899=2(119869min119896119899))119880119887119890119904119905 = 119880min119888opt

Algorithm 2 SAFCM

Table 2 119888max estimated by several methods 119899 is the number of objects in the dataset 119888 is the actual number of clusters 119888ER is the numberof clusters estimated by the empirical rule that is lfloorradic119899rfloor 119888AP is the number of clusters obtained by AP algorithm and 119888DBA is the number ofclusters obtained by density-based algorithm

119888DBADataset 119899 119888 119888ER 119888AP 119901 = 1 119901 = 2 119901 = 3 119901 = 5Iris 150 3 12 9 20 14 9 6Wine 178 3 13 15 14 7 6 3Seeds 210 3 14 13 18 12 5 2SubKDD 1050 6 32 24 21 17 10 7SD1 200 20 14 19 38 22 20 mdashSD2 2000 4 44 25 16 3 4 2SD3 885 3 29 27 24 19 5 3SD4 947 3 30 31 23 13 8 4

For the categorical attribute of SubKDD the simplematching is used for the dissimilarity measure that is 0 foridentical values and 1 for different values

51 Experiment of 119888max Simulation In [37] it was proposedthat the number of clusters generated by AP algorithm couldbe selected as the maximum number of clusters So thispaper estimates the value of 119888max by the empirical ruleAP algorithm and density-based algorithm The specificexperimental results are shown in Table 2 Let 119901 = 119901119898 in APalgorithm 119889119888 is respectively selected as the distance of thefirst 1 2 3 and 5

Experiment results show that for the dataset with thetrue number of clusters greater than lfloorradic119899rfloor it is obviouslyincorrect to let 119888max = lfloorradic119899rfloor In other words the empiricalrule is invalid The number of clusters finally obtained byAP algorithm is close to the actual number of clusters Forthe dataset with the true number of clusters smaller thanlfloorradic119899rfloor the number of clusters generated by AP algorithm canbe used as an appropriate 119888max but sometimes the value is

greater than the value estimated by the empirical rule whichenlarges the search range of the optimal number of clustersIf 119888max is estimated by the proposed density-based algorithmthe results in most cases are appropriate The method isinvalid only on SD1 when the cutoff distance is the first 5distance When the cutoff distance is selected as the first 3distance 119888max generated by the proposed algorithm is muchsmaller than lfloorradic119899rfloor It is closer to the true number of clustersand greatly narrows the search range of the optimal numberof clusters Therefore the cutoff distance is selected as thedistance of the first 3 in the later experiments

52 Experiment of Influence of the Initial Cluster Centroidson the Convergence of FCM To show that the initial clustercentroids obtained by density-based algorithm can quickenthe convergence of clustering algorithm the traditional FCMis adopted for verificationThe number of clusters is assignedas the true value with the convergence threshold being 10minus5Because the random selection of initial cluster centroids hasa large influence on FCM the experiment of each dataset

8 Computational Intelligence and Neuroscience

minus5 0 5 10 15 20 250

5

10

15

20

25

30

(a) SD10 5 10 15 20 25

0

5

10

15

20

25

(b) SD2

50 100 150 200 250 300 3500

50

100

150

200

250

300

(c) SD320 40 60 80 100 120 140 160 180 200 220 240

20

40

60

80

100

120

140

160

(d) SD4

Figure 3 Four synthetic datasets

Table 3 Comparison of iterations of FCMalgorithmMethod 1 usesthe random initial cluster centroids and Method 2 uses the clustercentroids obtained by density-based algorithm

Dataset Method 1 Method 2Iris 21 16Wine 27 18Seeds 19 16SubKDD 31 23SD1 38 14SD2 18 12SD3 30 22SD4 26 21

is done repeatedly for 50 times and the round numbers ofthe mean of algorithmic iterations are comparedThe specificexperimental result is shown in Table 3

As shown in Table 3 the initial cluster centroids obtainedby density-based algorithm can effectively reduce the itera-tion of FCM Particularly on SD1 the iteration of FCM isfar smaller than that with randomly selected initial cluster

centroids whose minimum iteration is 24 andmaximum iter-ation reaches 63 with unstable clustering results Thereforethe proposed method can not only effectively quicken theconvergence of FCM but also obtain a stable clustering result

53 Experiment of Clustering Accuracy Based on ClusteringValidity Index For 3 UCI datasets and SubKDD clusteringaccuracy 119903 is adopted tomeasure the clustering effect definedas

119903 = sum119896119894=1 119887119894119899 (13)

Here 119887119894 is the number of objects which co-occur in the 119894thcluster and the 119894th real cluster and 119899 is the number of objectsin the dataset According to this measurement the higherthe clustering accuracy is the better the clustering result ofFCM is When 119903 = 1 the clustering result of FCM is totallyaccurate

In the experiments the true number of clusters is assignedto each dataset and the initial cluster centroids are obtainedby density-based algorithm Then experimental results ofclustering accuracy are shown in Table 4

Computational Intelligence and Neuroscience 9

minus5 0 5 10 15 20 250

5

10

15

20

25

30

(a) SD10 5 10 15 20 25

0

5

10

15

20

25

(b) SD2

50 100 150 200 250 300 3500

50

100

150

200

250

300

(c) SD320 40 60 80 100 120 140 160 180 200 220 240

20

40

60

80

100

120

140

160

(d) SD4

Figure 4 Clustering results of two synthetic datasets

Table 4 Clustering accuracy

Dataset Iris Wine Seeds SubKDDClustering accuracy 8400 9663 9190 9435

Apparently clustering accuracies of the datasets WineSeeds and SubKDD are high while that of the dataset Iris isrelatively low for the reason that two clusters are nonlinearlyseparable

The clustering results of the proposed clustering validityindex on 4 synthetic datasets are shown in Figure 4 whichdepict the good partitions on the datasets SD1 and SD2 andthe slightly poor partition on SD3 and SD4 because of thecomplexity of their structure or nonconvexity When thenumber of clusters is set as 4 SD3 is divided into 4 groupsthat means the right and the left each have 2 groups

54 Experiment of the Optimal Number of Clusters At lastXie-Beni index Bensaid index Kwon index and the pro-posed index are respectively adopted for running of SAFCMso as to determine the optimal number of clustersThe results

Table 5 Optimal number of clusters estimated by several clusteringvalidity indices

Dataset 119881XB 119881B 119881K 119881R

Iris 2 9 2 2Wine 3 6 3 3Seeds 2 5 2 2SubKDD 10 10 9 4SD1 20 20 20 20SD2 4 4 4 4SD3 5 3 5 4SD4 2 8 2 3

are shown in Table 5 119881XB 119881B 119881K and 119881R represent Xie-Beniindex Bensaid index Kwon index and the proposed indexrespectively

It shows that for synthetic datasets SD1 and SD2 withsimple structure these indices can all obtain the optimalnumber of clusters For 3 UCI datasets SubKDD and SD4Bensaid index cannot obtain an accurate number of clustersexcept SD3 For the dataset Wine both Xie-Beni index and

10 Computational Intelligence and Neuroscience

Table 6 The value of clustering validity index on Iris

119888 119881XB 119881B 119881K 119881R

2 0114234 0223604 17384973 04736043 0223973 0124598 34572146 05515654 0316742 0099103 49279488 06154365 0560109 0089108 87540968 06763506 0574475 0072201 90563379 06913407 0400071 0067005 63328679 07353118 0275682 0036283 45736972 06140559 0250971 0027735 42868449 0584244

Table 7 The value of clustering validity index on Wine

119888 119881XB 119881B 119881K 119881R

2 0663406 1328291 11833902 15782913 0468902 0513071 83939058 13464214 mdash 0473254 mdash 17357915 mdash 0373668 mdash 18466866 mdash 0256698 mdash 1683222

Table 8 The value of clustering validity index on Seeds

119888 119881XB 119881B 119881K 119881R

2 0147247 0293609 31170349 05436093 0212127 0150899 45326216 05990014 0243483 0127720 52215334 06979435 0348842 0085477 75493654 0701153

Kwon index can obtain the accurate number of clusters whilefor the datasets Iris Seeds SubKDD SD3 and SD4 they onlyobtain the result approximate to the true number of clusterThe proposed index obtains the right result on the datasetsWine and SD4 and a better result compared toXie-Beni indexand Kwon index on the datasets SubKDD and SD3 whileit obtains the same result of the two indices on the datasetsIris and Seeds There are two reasons First the importancedegree of each property of the dataset is not considered inthe clustering process but assumed to be the same therebyaffecting the experimental result Second SAFCMhas a weakcapacity to deal with overlapping clusters or nonconvexitydataset These are the authorsrsquo subsequent research contents

Tables 6ndash13 list the detailed results of 4 indices on theexperimental datasets in each iteration of SAFCM For thedataset Wine when the number of clusters is 4 5 or 6the result based on Xie-Beni index or Kwon index is of noconvergence However the proposed index can achieve thedesired results

6 Conclusions

FCM is widely used in lots of fields But it needs to presetthe number of clusters and is greatly influenced by the initialcluster centroids This paper studies a self-adaptive methodfor determining the number of clusters by using of FCMalgorithm In this method a density-based algorithm is put

Table 9 The value of clustering validity index on SubKDD

119888 119881XB 119881B 119881K 119881R

2 0646989 1324434 550166676 15744313 0260755 0378775 222020838 10902894 0133843 0062126 119544560 05112385 0234402 0052499 202641204 05378526 0180728 0054938 156812271 05838007 0134636 0047514 119029265 06197208 0104511 0032849 919852740 06908739 0129721 0027639 740128470 056263610 006822 0027025 913528560 0528528

Table 10 The value of clustering validity index on SD1

119888 119881XB 119881B 119881K 119881R

2 0221693 0443390 44592968 06933903 0206035 0198853 40264251 07262454 0127731 0093653 26220200 06555505 0130781 0069848 27154867 06514656 0144894 0050067 22922121 06393257 0136562 0040275 29126152 06362588 0112480 0032874 24323625 06274429 0115090 0026833 24242580 062493610 0141415 0022611 28574579 061670111 0126680 0019256 28821707 061152412 0103178 0016634 23931865 060599013 0110355 0013253 26517065 058824614 0095513 0011083 23635022 057680815 0075928 0009817 19302095 056228916 0066025 0008824 17236138 055799017 0054314 0007248 14995284 054434118 0045398 0006090 13208810 053488219 0039492 0005365 11977437 052713120 0034598 0004917 10835952 0517481

Table 11 The value of clustering validity index on SD2

119888 119881XB 119881B 119881K 119881R

2 0066286 0132572 13281503 03825723 0068242 0063751 13752535 03942004 0060916 0028149 12309967 0378337

Table 12 The value of clustering validity index on SD3

119888 119881XB 119881B 119881K 119881R

2 0148379 0300899 131557269 05708763 0195663 0142149 173900551 05996804 0127512 0142150 113748947 05571985 0126561 0738070 113262975 0589535

forward at first which can estimate the maximum number ofclusters to reduce the search range of the optimal numberespecially being fit for the dataset on which the empiricalrule is inoperative Besides it can generate the high-quality

Computational Intelligence and Neuroscience 11

Table 13 The value of clustering validity index on SD4

119888 119881XB 119881B 119881K 119881R

2 0104434 0208832 991482710 04737483 0170326 0142561 162044450 04588324 0221884 0081007 211692699 05835295 0156253 0053094 157683921 06032116 0123191 0041799 118279116 05753967 0165465 0032411 107210082 05926258 0145164 0028698 139310969 0606049

initial cluster centroids so that the the clustering result isstable and the convergence of FCM is quick Then a newfuzzy clustering validity index was put forward based onfuzzy compactness and separation so that the clusteringresult is closer to global optimum The index is robust andinterpretable when the number of clusters tends to that ofobjects in the dataset Finally a self-adaptive FCM algorithmis proposed to determine the optimal number of clusters runin the iterative trial-and-error process

The contributions are validated by experimental resultsHowever in most cases each property plays a different rolein the clustering process In other words the weight ofproperties is not the same This issue will be focused on inthe authorsrsquo future work

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work is partially supported by the National NaturalScience Foundation of China (61373148 and 61502151) Shan-dong Province Natural Science Foundation (ZR2014FL010)Science Foundation ofMinistry of Education ofChina (14YJC860042) Shandong Province Outstanding Young ScientistAward Fund (BS2013DX033) Shandong Province Social Sci-ences Project (15CXWJ13 and 16CFXJ05) and Project ofShandongProvinceHigher Educational Science andTechnol-ogy Program (no J15LN02)

References

[1] S Sambasivam and N Theodosopoulos ldquoAdvanced data clus-tering methods of mining web documentsrdquo Issues in InformingScience and Information Technology vol 3 pp 563ndash579 2006

[2] J Wang S-T Wang and Z-H Deng ldquoSurvey on challenges inclustering analysis researchrdquo Control and Decision vol 3 2012

[3] C-W Bong andM Rajeswari ldquoMulti-objective nature-inspiredclustering and classification techniques for image segmenta-tionrdquo Applied Soft Computing Journal vol 11 no 4 pp 3271ndash3282 2011

[4] A Rodriguez and A Laio ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[5] L A Zadeh ldquoFuzzy setsrdquo Information and Computation vol 8no 3 pp 338ndash353 1965

[6] J C Bezdek ldquoNumerical taxonomy with fuzzy setsrdquo Journal ofMathematical Biology vol 1 no 1 pp 57ndash71 1974

[7] J C Bezdek ldquoCluster validity with fuzzy setsrdquo Journal of Cyber-netics vol 3 no 3 pp 58ndash73 1974

[8] M P Windham ldquoCluster validity for fuzzy clustering algo-rithmsrdquo Fuzzy Sets and Systems vol 5 no 2 pp 177ndash185 1981

[9] M Lee ldquoFuzzy cluster validity index based onobject proximitiesdefined over fuzzy partition matricesrdquo in Proceedings of theIEEE International Conference on Fuzzy Systems and IEEEWorldCongress on Computational Intelligence (FUZZ-IEEE rsquo08) pp336ndash340 June 2008

[10] X-B Zhang and L Jiang ldquoAnew validity index of fuzzy c-meansclusteringrdquo in Proceedings of the IEEE International Conferenceon IntelligentHuman-Machine Systems andCybernetics (IHMSCrsquo09) vol 2 pp 218ndash221 Hangzhou China August 2009

[11] I Saha U Maulik and S Bandyopadhyay ldquoA new differ-ential evolution based fuzzy clustering for automatic clusterevolutionrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 706ndash711 IEEE PatialaIndia March 2009

[12] S Yue J-S Wang T Wu and H Wang ldquoA new separationmeasure for improving the effectiveness of validity indicesrdquoInformation Sciences vol 180 no 5 pp 748ndash764 2010

[13] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[14] A M Bensaid L O Hall J C Bezdek et al ldquoValidity-guided(re)clustering with applications to image segmentationrdquo IEEETransactions on Fuzzy Systems vol 4 no 2 pp 112ndash123 1996

[15] S H Kwon ldquoCluster validity index for fuzzy clusteringrdquo Elec-tronics Letters vol 34 no 22 pp 2176ndash2177 1998

[16] H Sun S Wang and Q Jiang ldquoFCM-based model selectionalgorithms for determining the number of clustersrdquo PatternRecognition vol 37 no 10 pp 2027ndash2037 2004

[17] Y Li and F Yu ldquoA new validity function for fuzzy clusteringrdquoin Proceedings of the International Conference on ComputationalIntelligence and Natural Computing (CINC rsquo09) vol 1 pp 462ndash465 IEEE June 2009

[18] S Saha and S Bandyopadhyay ldquoA new cluster validity indexbased on fuzzy granulation-degranulation criteriardquo in Proceed-ings of the IEEEInternational Conference on Advanced Comput-ing and Communications (ADCOM rsquo07) pp 353ndash358 IEEEGuwahati India 2007

[19] M Zhang W Zhang H Sicotte and P Yang ldquoA new validitymeasure for a correlation-based fuzzy c-means clustering algo-rithmrdquo in Proceedings of the 31st IEEE Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo09) pp 3865ndash3868 Minneapolis Minn USASeptember 2009

[20] Y-I KimD-WKimD Lee andKH Lee ldquoA cluster validationindex for GK cluster analysis based on relative degree ofsharingrdquo Information Sciences vol 168 no 1ndash4 pp 225ndash2422004

[21] B Rezaee ldquoA cluster validity index for fuzzy clusteringrdquo FuzzySets and Systems vol 161 no 23 pp 3014ndash3025 2010

[22] D Zhang M Ji J Yang Y Zhang and F Xie ldquoA novel clus-ter validity index for fuzzy clustering based on bipartite modu-larityrdquo Fuzzy Sets and Systems vol 253 pp 122ndash137 2014

[23] J Yu andQ Cheng ldquoSearch range of the optimal cluster numberin fuzzy clusteringrdquo Science in China (Series E) vol 32 no 2 pp274ndash280 2002

12 Computational Intelligence and Neuroscience

[24] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[25] D Pelleg A W Moore et al ldquoX-means extending K-meanswith efficient estimation of the number of clustersrdquo in Proceed-ings of the 17th International Conference on Machine Learning(ICML rsquo00) vol 1 pp 727ndash734 2000

[26] H Frigui and R Krishnapuram ldquoA robust competitive clus-tering algorithm with applications in computer visionrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol21 no 5 pp 450ndash465 1999

[27] M-S Yang and K-L Wu ldquoA similarity-based robust clusteringmethodrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 26 no 4 pp 434ndash448 2004

[28] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[29] L Yujian ldquoA clustering algorithm based on maximal 120579-distantsubtreesrdquo Pattern Recognition vol 40 no 5 pp 1425ndash1431 2007

[30] Y Shihong H Ti and W Penglong ldquoMatrix eigenvalue anal-ysis-based clustering validity indexrdquo Journal of Tianjin Univer-sity (Science and Technology) vol 8 article 6 2014

[31] A Jose-Garcıa and W Gomez-Flores ldquoAutomatic clusteringusing nature-inspired metaheuristics a surveyrdquo Applied SoftComputing Journal vol 41 pp 192ndash213 2016

[32] E H Ruspini ldquoNew experimental results in fuzzy clusteringrdquoInformation Sciences vol 6 pp 273ndash284 1973

[33] J C Dunn ldquoWell-separated clusters and optimal fuzzy parti-tionsrdquo Journal of Cybernetics vol 4 no 1 pp 95ndash104 1974

[34] J C Bezdek Pattern Recognition with Fuzzy Objective FunctionAlgorithms Springer Science amp Business Media 2013

[35] R E Hammah and J H Curran ldquoFuzzy cluster algorithm forthe automatic identification of joint setsrdquo International Journalof Rock Mechanics and Mining Sciences vol 35 no 7 pp 889ndash905 1998

[36] N R Pal and J C Bezdek ldquoOn cluster validity for the fuzzy c-means modelrdquo IEEE Transactions on Fuzzy Systems vol 3 no3 pp 370ndash379 1995

[37] S B Zhou Z Y Xu and X Q Tang ldquoMethod for determiningoptimal number of clusters based on affinity propagationclusteringrdquo Control and Decision vol 26 no 8 pp 1147ndash11522011

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 4: Research Article A Self-Adaptive Fuzzy -Means Algorithm ...downloads.hindawi.com/journals/cin/2016/2647389.pdf · A Self-Adaptive Fuzzy -Means Algorithm for Determining the Optimal

4 Computational Intelligence and Neuroscience

Step 1 Input the search range [119888min 119888max] generally 119888min = 2and 119888max = lfloorradic119899rfloorStep 2 For each integer 119896119899 isin [119888min 119888max]Step 21 Run FCM

Step 22 Calculate clustering validity index

Step 3 Compare all values of clustering validity index 119896119899corresponding to the maximum or minimum value is theoptimal number of clusters 119888optStep 4 Output 119888opt the optimal value of clustering validityindex and clustering result

4 The Proposed Methods

41 Density-Based Algorithm Considering the large influ-ence of the randomly selected initial cluster centroid on theclustering result a density-based algorithm is proposed toselect the initial cluster centroids and the maximum numberof clusters can be estimated at the same time Some relatedterms will be defined at first

Definition 1 (local density) The local density 120588119894 of object 119909119894 isdefined as

120588119894 = 119899sum119895=1

119890minus11988921198941198951198891198882 (4)

where 119889119894119895 is the distance between two objects 119909119894 and 119909119895 and119889119888 is a cutoff distance The recommended approach is to sortthe distance between any two objects in descending order andthen assign 119889119888 as the value corresponding to the first 119901 ofthe sorted distance (appropriately 119901 isin [2 5]) It shows thatthe more objects with a distance from 119909119894 less than 119889119888 thereare the bigger the value of 120588119894 is The cutoff distance finallycan decide the number of initial cluster centroids namely themaximumnumber of clustersThe density-based algorithm isrobust with respect to the choice of 119889119888Definition 2 (core point) Assume that there is an object 119909119894 isin119883 if 119909119894 notin 119862119896 and 120588119894 = max119909119895notin119862119896(120588119895) 119896 = 1 119888 119895 = 1 119899then 119909119894 is a core pointDefinition 3 (directly density-reachable) Assume that thereare two objects 119909119894 119909119895 isin 119883 if their distance 119889119894119895 lt 119889119888 then 119909119895is directly density-reachable to 119909119894 and vice versa

Definition 4 (density-reachable) Assume that 119883 is a datasetand objects 119909119904 119909119904+1 119909119905minus1 119909119905 belong to it If for each 119894 isin[119904 119905 minus 1] 119909119894+1 is directly density-reachable to 119909119894 then 119909119904 isdensity-reachable to 119909119905Definition 5 (neighbor) Neighbors of an object 119909119894 are thosewho are directly density-reachable or density-reachable to itdenoted as Neighbor (119909119894)

D

E

dc

dc

dc

B

H

F

C

A

Figure 1 An example Point A is of the highest local density If Adoes not belong to any cluster then A is a core point Points B C Dand E are directly density-reachable to point A Point F is density-reachable to point A Point H is a border point

Definition 6 (border point) An object is called a border pointif it has no neighbors

The example is shown in Figure 1The selection principle of initial cluster centroids of

density-based algorithm is that usually a cluster centroid isan object with higher local density surrounded by neighborswith lower local density than it and has a relatively largedistance fromother cluster centroids [4] Density-based algo-rithm can automatically select the initial cluster centroids anddetermine the maximum number of clusters 119888max accordingto local density The pseudocode is shown in Algorithm 1These cluster centroids obtained are sorted in descendingorder according to local density Figure 2 demonstrates theprocess of density-based algorithm

42 Fuzzy Clustering Validity Index Firstly based on the geo-metric structure of the dataset Xie and Beni put forwardthe Xie-Beni fuzzy clustering validity index [13] which triedto find a balance point between the fuzzy compactness andseparation so as to acquire the optimal cluster result Theindex is defined as

119881xie (119880 119881) = (1119899)sum119888119894=1sum119899119895=1 119906119898119894119895 10038171003817100381710038171003817V119894 minus 119909119895100381710038171003817100381710038172min119894 =119895

10038171003817100381710038171003817V119894 minus V119895100381710038171003817100381710038172 (5)

In the formula the numerator is the average distancefrom various objects to centroids used to measure thecompactness and the denominator is the minimum distancebetween any two centroids measuring the separation

However Bensaid et al found that the size of each clusterhad a large influence on Xie-Beni index and put forwarda new index [14] which was insensitive to the number ofobjects in each cluster Bensaid index is defined as

119881119861 (119880 119881) = 119888sum119896=1

sum119899119894=1 119906119898119894119896 1003817100381710038171003817119909119894 minus V11989610038171003817100381710038172119899119896sum119888119895=1 10038171003817100381710038171003817V119895 minus V119896100381710038171003817100381710038172 (6)

Computational Intelligence and Neuroscience 5

Input Dataset and 119901Output 119881 and 119888max119863 = (119889119894119895)119899times119899dc = cutoffdis (119901) lowastCalculate the cutoff distance dc lowast120588 = (120588119894)119899119894119889119909 = arg(119904119900119903119905(120588 119889119890119904119888119890119899119905)) lowastThe number of object corresponding to sorted120588 lowast119896 = 1119888119897 = (minus1)1times119899 lowastThe cluster number that each object belongs to with initialvalue minus1 lowastfor 119894 = 1 to 119899 minus 1 doif 119888119897119894119889119909119894 = minus1ampamp119873119890119894119892ℎ119887119900119903(119909119894119889119909119894 ) gt 1 then lowast119909119894119889119909119894 dose notbelong to any cluster and the nmber of its neighbors is greater than 1 lowastV119896 = 119909119894119889119909119894 for 119895 isin 119873119890119894119892ℎ119887119900119903(119909119894119889119909119894 ) do119888119897119895 = 119896119896 = 119896 + 1119888max = 119896

Algorithm 1 Density-based algorithm

where 119899119896 is the fuzzy cardinality of the 119896th cluster and definedas

119899119896 = 119899sum119894=1

119906119894119896 (7)

sum119899119894=1 119906119898119894119896119909119894 minus V1198962 shows the variation of the 119896th fuzzycluster Then the compactness is computed as

1119899119896119899sum119894=1

119906119898119894119896 1003817100381710038171003817119909119894 minus V11989610038171003817100381710038172 (8)

sum119888119895=1 V119895 minus V1198962 denotes the separation of the 119896th fuzzycluster defined as the sum of the distances from its clustercentroid to the centroids of other 119888 minus 1 clusters

Because

lim119888rarr119899

1003817100381710038171003817119909119894 minus V11989610038171003817100381710038172 = 0 (9)

this index is the same as the Xie-Beni indexWhen 119888 rarr 119899 theindex value will be monotonically decreased close to 0 andwill lose robustness and judgment function for determiningthe optimal number of clusters Thus this paper improvesBensaid index and proposes a new index 119881119877

119881119877 (119880 119881)= 119888sum119896=1

(1119899119896)sum119899119894=1 119906119898119894119896 1003817100381710038171003817119909119894 minus V11989610038171003817100381710038172 + (1119888) 1003817100381710038171003817V119896 minus V10038171003817100381710038172(1 (119888 minus 1))sum119888119895=1 10038171003817100381710038171003817V119895 minus V119896

100381710038171003817100381710038172 (10)

The numerator represents the compactness of the 119896thcluster where 119899119896 is its fuzzy cardinality Its second item anintroduced punishing function denotes the distance fromthe cluster centroid of the 119896th cluster to the average of allcluster centroids which can eliminate the monotonically

decreasing tendency as the number of clusters increases to119899 The denominator represents the mean distance from the119896th cluster centroid to other cluster centroids which is usedfor measuring the separationThe ratio of the numerator andthe denominator thereof represents the clustering effect of the119896th clusterThe clustering validity index is defined as the sumof the clustering effect (the ratio) of all clusters Obviouslythe smaller the value is the better the clustering effect of thedataset is and the corresponding 119888 to the minimum value isthe optimal number of clusters

43 Self-Adaptive FCM In this paper the iterative trial-and-error process [16] is still used to determine the optimalnumber of clusters by self-adaptive FCM (SAFCM) Thepseudocode of SAFCM algorithm is described in Algo-rithm 2

5 Experiment and Analysis

This paper selects 8 experimental datasets among which 3datasets come fromUCI datasets respectively IrisWine andSeeds a dataset (SubKDD) is randomly selected from KDDCup 1999 shown in Table 1 and the remaining 4 datasetsare synthetic datasets shown in Figure 3 SubKDD includesnormal 2 kinds of Probe attack (ipsweep and portsweep) and3 kinds of DoS attack (neptune smurf and back) The firstsynthetic dataset (SD1) consists of 20 2-dimensional Gaussiandistribution data with 10 samples Their covariance matrixesare second-order unit matrix 1198682 The structural feature of thedataset is that the distance between any two clusters is largeand the number of classes is greater than lfloorradic119899rfloor The secondsynthetic dataset (SD2) consists of 4 2-dimensional Gaussiandistribution dataThe cluster centroids are respectively (55)(1010) (1515) and (2020) each containing 500 samples andthe covariancematrix of each being 21198682The structural feature

6 Computational Intelligence and Neuroscience

minus2 0 2 4 6 8 10minus2

0

2

4

6

8

10

12

14

(a) Initial dataminus2 0 2 4 6 8 10

minus2

0

2

4

6

8

10

12

14

(b) Iteration 1

minus2 0 2 4 6 8 10minus2

0

2

4

6

8

10

12

14

(c) Iteration 2minus2 0 2 4 6 8 10

minus2

0

2

4

6

8

10

12

14

(d) Iteration 3

Figure 2 Demonstration of the process of density-based algorithm (a) is the initial data distribution of a synthetic dataset The datasetconsists of two pieces of 2-dimensional Gaussian distribution data with centroids respectively as (2 3) and (7 8) Each class has 100 samplesIn (b) the blue circle represents the highest density core point as the centroid of the first cluster and the red plus sign represents the objectbelonging to the first cluster In (c) the red circle represents the core point as the centroid of the second cluster and the blue asterisk representsthe object belonging to the second cluster In (d) the purple circle represents the core point as the centroid of the third cluster the green timessign represents the object belonging to the third cluster and the black dot represents the final border pointwhich does not belong to any clusterAccording to a certain cutoff distance themaximumnumber of clusters is 3 If calculated in accordance with the empirical rule themaximumnumber of clusters should be 14 Therefore the algorithm can effectively reduce the iteration of FCM algorithm operation

Table 1 The data type and distribution of SubKDD

Attack behavior Number of samplesnormal 200ipsweep 50portsweep 50neptune 200smurf 300back 50

of the dataset is of short intercluster distance with a smalloverlap The third synthetic dataset (SD3) has a complicatedstructure The fourth synthetic dataset (SD4) is a nonconvexone

In this paper the numeric value 119909119894119895 is normalized as

1199091015840119894119895 = 119909119894119895 minus 119909119895119904119895 (11)

where 119909119895 and 119904119895 denote the mean of the 119895th attribute valueand its standard deviation respectively then

119909119895 = 1119899119899sum119894=1

119909119894119895119904119895 = radic 1119899 minus 1

119899sum119894=1

(119909119894119895 minus 119909119895)2(12)

Computational Intelligence and Neuroscience 7

Input 119881 119888max 119868max and 120585Output 119869119887119890119904119905 119888opt 119880119887119890119904119905for 119896119899 = 2 to 119888max do119905 = 0 lowastThe current iterationlowast119881(119905)

119896119899= (V119894)119896119899 lowastThe first kn cluster centroids of V lowast119880(119905)

119896119899= (119906(119905)119894119896)119899times119896119899119869(119905)

119896119899= 119881119877(119880(119905)119896119899 119881(119905)119896119899 )

while 119905 == 0119905 lt 119868maxampamp|119869(119905)119896119899

minus 119869(119905minus1)119896119899

| gt 120585 doUpdate 119881(119905)

119896119899

Update 119880(119905)119896119899

Update 119869(119905)119896119899119905 = 119905 + 1119869min119896119899 = min119905119894=1(119869(119894)119896119899)119894119889119909 = arg(min119905119894=1(119869(119894)119896119899))119880min119896119899 = 119880(119894119889119909)119896119899

119869119887119890119904119905 = min119888max119896119899=2(119869min119896119899) lowastThe best value of clustering validity indexlowast119888opt = arg(min119888max119896119899=2(119869min119896119899))119880119887119890119904119905 = 119880min119888opt

Algorithm 2 SAFCM

Table 2 119888max estimated by several methods 119899 is the number of objects in the dataset 119888 is the actual number of clusters 119888ER is the numberof clusters estimated by the empirical rule that is lfloorradic119899rfloor 119888AP is the number of clusters obtained by AP algorithm and 119888DBA is the number ofclusters obtained by density-based algorithm

119888DBADataset 119899 119888 119888ER 119888AP 119901 = 1 119901 = 2 119901 = 3 119901 = 5Iris 150 3 12 9 20 14 9 6Wine 178 3 13 15 14 7 6 3Seeds 210 3 14 13 18 12 5 2SubKDD 1050 6 32 24 21 17 10 7SD1 200 20 14 19 38 22 20 mdashSD2 2000 4 44 25 16 3 4 2SD3 885 3 29 27 24 19 5 3SD4 947 3 30 31 23 13 8 4

For the categorical attribute of SubKDD the simplematching is used for the dissimilarity measure that is 0 foridentical values and 1 for different values

51 Experiment of 119888max Simulation In [37] it was proposedthat the number of clusters generated by AP algorithm couldbe selected as the maximum number of clusters So thispaper estimates the value of 119888max by the empirical ruleAP algorithm and density-based algorithm The specificexperimental results are shown in Table 2 Let 119901 = 119901119898 in APalgorithm 119889119888 is respectively selected as the distance of thefirst 1 2 3 and 5

Experiment results show that for the dataset with thetrue number of clusters greater than lfloorradic119899rfloor it is obviouslyincorrect to let 119888max = lfloorradic119899rfloor In other words the empiricalrule is invalid The number of clusters finally obtained byAP algorithm is close to the actual number of clusters Forthe dataset with the true number of clusters smaller thanlfloorradic119899rfloor the number of clusters generated by AP algorithm canbe used as an appropriate 119888max but sometimes the value is

greater than the value estimated by the empirical rule whichenlarges the search range of the optimal number of clustersIf 119888max is estimated by the proposed density-based algorithmthe results in most cases are appropriate The method isinvalid only on SD1 when the cutoff distance is the first 5distance When the cutoff distance is selected as the first 3distance 119888max generated by the proposed algorithm is muchsmaller than lfloorradic119899rfloor It is closer to the true number of clustersand greatly narrows the search range of the optimal numberof clusters Therefore the cutoff distance is selected as thedistance of the first 3 in the later experiments

52 Experiment of Influence of the Initial Cluster Centroidson the Convergence of FCM To show that the initial clustercentroids obtained by density-based algorithm can quickenthe convergence of clustering algorithm the traditional FCMis adopted for verificationThe number of clusters is assignedas the true value with the convergence threshold being 10minus5Because the random selection of initial cluster centroids hasa large influence on FCM the experiment of each dataset

8 Computational Intelligence and Neuroscience

minus5 0 5 10 15 20 250

5

10

15

20

25

30

(a) SD10 5 10 15 20 25

0

5

10

15

20

25

(b) SD2

50 100 150 200 250 300 3500

50

100

150

200

250

300

(c) SD320 40 60 80 100 120 140 160 180 200 220 240

20

40

60

80

100

120

140

160

(d) SD4

Figure 3 Four synthetic datasets

Table 3 Comparison of iterations of FCMalgorithmMethod 1 usesthe random initial cluster centroids and Method 2 uses the clustercentroids obtained by density-based algorithm

Dataset Method 1 Method 2Iris 21 16Wine 27 18Seeds 19 16SubKDD 31 23SD1 38 14SD2 18 12SD3 30 22SD4 26 21

is done repeatedly for 50 times and the round numbers ofthe mean of algorithmic iterations are comparedThe specificexperimental result is shown in Table 3

As shown in Table 3 the initial cluster centroids obtainedby density-based algorithm can effectively reduce the itera-tion of FCM Particularly on SD1 the iteration of FCM isfar smaller than that with randomly selected initial cluster

centroids whose minimum iteration is 24 andmaximum iter-ation reaches 63 with unstable clustering results Thereforethe proposed method can not only effectively quicken theconvergence of FCM but also obtain a stable clustering result

53 Experiment of Clustering Accuracy Based on ClusteringValidity Index For 3 UCI datasets and SubKDD clusteringaccuracy 119903 is adopted tomeasure the clustering effect definedas

119903 = sum119896119894=1 119887119894119899 (13)

Here 119887119894 is the number of objects which co-occur in the 119894thcluster and the 119894th real cluster and 119899 is the number of objectsin the dataset According to this measurement the higherthe clustering accuracy is the better the clustering result ofFCM is When 119903 = 1 the clustering result of FCM is totallyaccurate

In the experiments the true number of clusters is assignedto each dataset and the initial cluster centroids are obtainedby density-based algorithm Then experimental results ofclustering accuracy are shown in Table 4

Computational Intelligence and Neuroscience 9

minus5 0 5 10 15 20 250

5

10

15

20

25

30

(a) SD10 5 10 15 20 25

0

5

10

15

20

25

(b) SD2

50 100 150 200 250 300 3500

50

100

150

200

250

300

(c) SD320 40 60 80 100 120 140 160 180 200 220 240

20

40

60

80

100

120

140

160

(d) SD4

Figure 4 Clustering results of two synthetic datasets

Table 4 Clustering accuracy

Dataset Iris Wine Seeds SubKDDClustering accuracy 8400 9663 9190 9435

Apparently clustering accuracies of the datasets WineSeeds and SubKDD are high while that of the dataset Iris isrelatively low for the reason that two clusters are nonlinearlyseparable

The clustering results of the proposed clustering validityindex on 4 synthetic datasets are shown in Figure 4 whichdepict the good partitions on the datasets SD1 and SD2 andthe slightly poor partition on SD3 and SD4 because of thecomplexity of their structure or nonconvexity When thenumber of clusters is set as 4 SD3 is divided into 4 groupsthat means the right and the left each have 2 groups

54 Experiment of the Optimal Number of Clusters At lastXie-Beni index Bensaid index Kwon index and the pro-posed index are respectively adopted for running of SAFCMso as to determine the optimal number of clustersThe results

Table 5 Optimal number of clusters estimated by several clusteringvalidity indices

Dataset 119881XB 119881B 119881K 119881R

Iris 2 9 2 2Wine 3 6 3 3Seeds 2 5 2 2SubKDD 10 10 9 4SD1 20 20 20 20SD2 4 4 4 4SD3 5 3 5 4SD4 2 8 2 3

are shown in Table 5 119881XB 119881B 119881K and 119881R represent Xie-Beniindex Bensaid index Kwon index and the proposed indexrespectively

It shows that for synthetic datasets SD1 and SD2 withsimple structure these indices can all obtain the optimalnumber of clusters For 3 UCI datasets SubKDD and SD4Bensaid index cannot obtain an accurate number of clustersexcept SD3 For the dataset Wine both Xie-Beni index and

10 Computational Intelligence and Neuroscience

Table 6 The value of clustering validity index on Iris

119888 119881XB 119881B 119881K 119881R

2 0114234 0223604 17384973 04736043 0223973 0124598 34572146 05515654 0316742 0099103 49279488 06154365 0560109 0089108 87540968 06763506 0574475 0072201 90563379 06913407 0400071 0067005 63328679 07353118 0275682 0036283 45736972 06140559 0250971 0027735 42868449 0584244

Table 7 The value of clustering validity index on Wine

119888 119881XB 119881B 119881K 119881R

2 0663406 1328291 11833902 15782913 0468902 0513071 83939058 13464214 mdash 0473254 mdash 17357915 mdash 0373668 mdash 18466866 mdash 0256698 mdash 1683222

Table 8 The value of clustering validity index on Seeds

119888 119881XB 119881B 119881K 119881R

2 0147247 0293609 31170349 05436093 0212127 0150899 45326216 05990014 0243483 0127720 52215334 06979435 0348842 0085477 75493654 0701153

Kwon index can obtain the accurate number of clusters whilefor the datasets Iris Seeds SubKDD SD3 and SD4 they onlyobtain the result approximate to the true number of clusterThe proposed index obtains the right result on the datasetsWine and SD4 and a better result compared toXie-Beni indexand Kwon index on the datasets SubKDD and SD3 whileit obtains the same result of the two indices on the datasetsIris and Seeds There are two reasons First the importancedegree of each property of the dataset is not considered inthe clustering process but assumed to be the same therebyaffecting the experimental result Second SAFCMhas a weakcapacity to deal with overlapping clusters or nonconvexitydataset These are the authorsrsquo subsequent research contents

Tables 6ndash13 list the detailed results of 4 indices on theexperimental datasets in each iteration of SAFCM For thedataset Wine when the number of clusters is 4 5 or 6the result based on Xie-Beni index or Kwon index is of noconvergence However the proposed index can achieve thedesired results

6 Conclusions

FCM is widely used in lots of fields But it needs to presetthe number of clusters and is greatly influenced by the initialcluster centroids This paper studies a self-adaptive methodfor determining the number of clusters by using of FCMalgorithm In this method a density-based algorithm is put

Table 9 The value of clustering validity index on SubKDD

119888 119881XB 119881B 119881K 119881R

2 0646989 1324434 550166676 15744313 0260755 0378775 222020838 10902894 0133843 0062126 119544560 05112385 0234402 0052499 202641204 05378526 0180728 0054938 156812271 05838007 0134636 0047514 119029265 06197208 0104511 0032849 919852740 06908739 0129721 0027639 740128470 056263610 006822 0027025 913528560 0528528

Table 10 The value of clustering validity index on SD1

119888 119881XB 119881B 119881K 119881R

2 0221693 0443390 44592968 06933903 0206035 0198853 40264251 07262454 0127731 0093653 26220200 06555505 0130781 0069848 27154867 06514656 0144894 0050067 22922121 06393257 0136562 0040275 29126152 06362588 0112480 0032874 24323625 06274429 0115090 0026833 24242580 062493610 0141415 0022611 28574579 061670111 0126680 0019256 28821707 061152412 0103178 0016634 23931865 060599013 0110355 0013253 26517065 058824614 0095513 0011083 23635022 057680815 0075928 0009817 19302095 056228916 0066025 0008824 17236138 055799017 0054314 0007248 14995284 054434118 0045398 0006090 13208810 053488219 0039492 0005365 11977437 052713120 0034598 0004917 10835952 0517481

Table 11 The value of clustering validity index on SD2

119888 119881XB 119881B 119881K 119881R

2 0066286 0132572 13281503 03825723 0068242 0063751 13752535 03942004 0060916 0028149 12309967 0378337

Table 12 The value of clustering validity index on SD3

119888 119881XB 119881B 119881K 119881R

2 0148379 0300899 131557269 05708763 0195663 0142149 173900551 05996804 0127512 0142150 113748947 05571985 0126561 0738070 113262975 0589535

forward at first which can estimate the maximum number ofclusters to reduce the search range of the optimal numberespecially being fit for the dataset on which the empiricalrule is inoperative Besides it can generate the high-quality

Computational Intelligence and Neuroscience 11

Table 13 The value of clustering validity index on SD4

119888 119881XB 119881B 119881K 119881R

2 0104434 0208832 991482710 04737483 0170326 0142561 162044450 04588324 0221884 0081007 211692699 05835295 0156253 0053094 157683921 06032116 0123191 0041799 118279116 05753967 0165465 0032411 107210082 05926258 0145164 0028698 139310969 0606049

initial cluster centroids so that the the clustering result isstable and the convergence of FCM is quick Then a newfuzzy clustering validity index was put forward based onfuzzy compactness and separation so that the clusteringresult is closer to global optimum The index is robust andinterpretable when the number of clusters tends to that ofobjects in the dataset Finally a self-adaptive FCM algorithmis proposed to determine the optimal number of clusters runin the iterative trial-and-error process

The contributions are validated by experimental resultsHowever in most cases each property plays a different rolein the clustering process In other words the weight ofproperties is not the same This issue will be focused on inthe authorsrsquo future work

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work is partially supported by the National NaturalScience Foundation of China (61373148 and 61502151) Shan-dong Province Natural Science Foundation (ZR2014FL010)Science Foundation ofMinistry of Education ofChina (14YJC860042) Shandong Province Outstanding Young ScientistAward Fund (BS2013DX033) Shandong Province Social Sci-ences Project (15CXWJ13 and 16CFXJ05) and Project ofShandongProvinceHigher Educational Science andTechnol-ogy Program (no J15LN02)

References

[1] S Sambasivam and N Theodosopoulos ldquoAdvanced data clus-tering methods of mining web documentsrdquo Issues in InformingScience and Information Technology vol 3 pp 563ndash579 2006

[2] J Wang S-T Wang and Z-H Deng ldquoSurvey on challenges inclustering analysis researchrdquo Control and Decision vol 3 2012

[3] C-W Bong andM Rajeswari ldquoMulti-objective nature-inspiredclustering and classification techniques for image segmenta-tionrdquo Applied Soft Computing Journal vol 11 no 4 pp 3271ndash3282 2011

[4] A Rodriguez and A Laio ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[5] L A Zadeh ldquoFuzzy setsrdquo Information and Computation vol 8no 3 pp 338ndash353 1965

[6] J C Bezdek ldquoNumerical taxonomy with fuzzy setsrdquo Journal ofMathematical Biology vol 1 no 1 pp 57ndash71 1974

[7] J C Bezdek ldquoCluster validity with fuzzy setsrdquo Journal of Cyber-netics vol 3 no 3 pp 58ndash73 1974

[8] M P Windham ldquoCluster validity for fuzzy clustering algo-rithmsrdquo Fuzzy Sets and Systems vol 5 no 2 pp 177ndash185 1981

[9] M Lee ldquoFuzzy cluster validity index based onobject proximitiesdefined over fuzzy partition matricesrdquo in Proceedings of theIEEE International Conference on Fuzzy Systems and IEEEWorldCongress on Computational Intelligence (FUZZ-IEEE rsquo08) pp336ndash340 June 2008

[10] X-B Zhang and L Jiang ldquoAnew validity index of fuzzy c-meansclusteringrdquo in Proceedings of the IEEE International Conferenceon IntelligentHuman-Machine Systems andCybernetics (IHMSCrsquo09) vol 2 pp 218ndash221 Hangzhou China August 2009

[11] I Saha U Maulik and S Bandyopadhyay ldquoA new differ-ential evolution based fuzzy clustering for automatic clusterevolutionrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 706ndash711 IEEE PatialaIndia March 2009

[12] S Yue J-S Wang T Wu and H Wang ldquoA new separationmeasure for improving the effectiveness of validity indicesrdquoInformation Sciences vol 180 no 5 pp 748ndash764 2010

[13] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[14] A M Bensaid L O Hall J C Bezdek et al ldquoValidity-guided(re)clustering with applications to image segmentationrdquo IEEETransactions on Fuzzy Systems vol 4 no 2 pp 112ndash123 1996

[15] S H Kwon ldquoCluster validity index for fuzzy clusteringrdquo Elec-tronics Letters vol 34 no 22 pp 2176ndash2177 1998

[16] H Sun S Wang and Q Jiang ldquoFCM-based model selectionalgorithms for determining the number of clustersrdquo PatternRecognition vol 37 no 10 pp 2027ndash2037 2004

[17] Y Li and F Yu ldquoA new validity function for fuzzy clusteringrdquoin Proceedings of the International Conference on ComputationalIntelligence and Natural Computing (CINC rsquo09) vol 1 pp 462ndash465 IEEE June 2009

[18] S Saha and S Bandyopadhyay ldquoA new cluster validity indexbased on fuzzy granulation-degranulation criteriardquo in Proceed-ings of the IEEEInternational Conference on Advanced Comput-ing and Communications (ADCOM rsquo07) pp 353ndash358 IEEEGuwahati India 2007

[19] M Zhang W Zhang H Sicotte and P Yang ldquoA new validitymeasure for a correlation-based fuzzy c-means clustering algo-rithmrdquo in Proceedings of the 31st IEEE Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo09) pp 3865ndash3868 Minneapolis Minn USASeptember 2009

[20] Y-I KimD-WKimD Lee andKH Lee ldquoA cluster validationindex for GK cluster analysis based on relative degree ofsharingrdquo Information Sciences vol 168 no 1ndash4 pp 225ndash2422004

[21] B Rezaee ldquoA cluster validity index for fuzzy clusteringrdquo FuzzySets and Systems vol 161 no 23 pp 3014ndash3025 2010

[22] D Zhang M Ji J Yang Y Zhang and F Xie ldquoA novel clus-ter validity index for fuzzy clustering based on bipartite modu-larityrdquo Fuzzy Sets and Systems vol 253 pp 122ndash137 2014

[23] J Yu andQ Cheng ldquoSearch range of the optimal cluster numberin fuzzy clusteringrdquo Science in China (Series E) vol 32 no 2 pp274ndash280 2002

12 Computational Intelligence and Neuroscience

[24] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[25] D Pelleg A W Moore et al ldquoX-means extending K-meanswith efficient estimation of the number of clustersrdquo in Proceed-ings of the 17th International Conference on Machine Learning(ICML rsquo00) vol 1 pp 727ndash734 2000

[26] H Frigui and R Krishnapuram ldquoA robust competitive clus-tering algorithm with applications in computer visionrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol21 no 5 pp 450ndash465 1999

[27] M-S Yang and K-L Wu ldquoA similarity-based robust clusteringmethodrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 26 no 4 pp 434ndash448 2004

[28] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[29] L Yujian ldquoA clustering algorithm based on maximal 120579-distantsubtreesrdquo Pattern Recognition vol 40 no 5 pp 1425ndash1431 2007

[30] Y Shihong H Ti and W Penglong ldquoMatrix eigenvalue anal-ysis-based clustering validity indexrdquo Journal of Tianjin Univer-sity (Science and Technology) vol 8 article 6 2014

[31] A Jose-Garcıa and W Gomez-Flores ldquoAutomatic clusteringusing nature-inspired metaheuristics a surveyrdquo Applied SoftComputing Journal vol 41 pp 192ndash213 2016

[32] E H Ruspini ldquoNew experimental results in fuzzy clusteringrdquoInformation Sciences vol 6 pp 273ndash284 1973

[33] J C Dunn ldquoWell-separated clusters and optimal fuzzy parti-tionsrdquo Journal of Cybernetics vol 4 no 1 pp 95ndash104 1974

[34] J C Bezdek Pattern Recognition with Fuzzy Objective FunctionAlgorithms Springer Science amp Business Media 2013

[35] R E Hammah and J H Curran ldquoFuzzy cluster algorithm forthe automatic identification of joint setsrdquo International Journalof Rock Mechanics and Mining Sciences vol 35 no 7 pp 889ndash905 1998

[36] N R Pal and J C Bezdek ldquoOn cluster validity for the fuzzy c-means modelrdquo IEEE Transactions on Fuzzy Systems vol 3 no3 pp 370ndash379 1995

[37] S B Zhou Z Y Xu and X Q Tang ldquoMethod for determiningoptimal number of clusters based on affinity propagationclusteringrdquo Control and Decision vol 26 no 8 pp 1147ndash11522011

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 5: Research Article A Self-Adaptive Fuzzy -Means Algorithm ...downloads.hindawi.com/journals/cin/2016/2647389.pdf · A Self-Adaptive Fuzzy -Means Algorithm for Determining the Optimal

Computational Intelligence and Neuroscience 5

Input Dataset and 119901Output 119881 and 119888max119863 = (119889119894119895)119899times119899dc = cutoffdis (119901) lowastCalculate the cutoff distance dc lowast120588 = (120588119894)119899119894119889119909 = arg(119904119900119903119905(120588 119889119890119904119888119890119899119905)) lowastThe number of object corresponding to sorted120588 lowast119896 = 1119888119897 = (minus1)1times119899 lowastThe cluster number that each object belongs to with initialvalue minus1 lowastfor 119894 = 1 to 119899 minus 1 doif 119888119897119894119889119909119894 = minus1ampamp119873119890119894119892ℎ119887119900119903(119909119894119889119909119894 ) gt 1 then lowast119909119894119889119909119894 dose notbelong to any cluster and the nmber of its neighbors is greater than 1 lowastV119896 = 119909119894119889119909119894 for 119895 isin 119873119890119894119892ℎ119887119900119903(119909119894119889119909119894 ) do119888119897119895 = 119896119896 = 119896 + 1119888max = 119896

Algorithm 1 Density-based algorithm

where 119899119896 is the fuzzy cardinality of the 119896th cluster and definedas

119899119896 = 119899sum119894=1

119906119894119896 (7)

sum119899119894=1 119906119898119894119896119909119894 minus V1198962 shows the variation of the 119896th fuzzycluster Then the compactness is computed as

1119899119896119899sum119894=1

119906119898119894119896 1003817100381710038171003817119909119894 minus V11989610038171003817100381710038172 (8)

sum119888119895=1 V119895 minus V1198962 denotes the separation of the 119896th fuzzycluster defined as the sum of the distances from its clustercentroid to the centroids of other 119888 minus 1 clusters

Because

lim119888rarr119899

1003817100381710038171003817119909119894 minus V11989610038171003817100381710038172 = 0 (9)

this index is the same as the Xie-Beni indexWhen 119888 rarr 119899 theindex value will be monotonically decreased close to 0 andwill lose robustness and judgment function for determiningthe optimal number of clusters Thus this paper improvesBensaid index and proposes a new index 119881119877

119881119877 (119880 119881)= 119888sum119896=1

(1119899119896)sum119899119894=1 119906119898119894119896 1003817100381710038171003817119909119894 minus V11989610038171003817100381710038172 + (1119888) 1003817100381710038171003817V119896 minus V10038171003817100381710038172(1 (119888 minus 1))sum119888119895=1 10038171003817100381710038171003817V119895 minus V119896

100381710038171003817100381710038172 (10)

The numerator represents the compactness of the 119896thcluster where 119899119896 is its fuzzy cardinality Its second item anintroduced punishing function denotes the distance fromthe cluster centroid of the 119896th cluster to the average of allcluster centroids which can eliminate the monotonically

decreasing tendency as the number of clusters increases to119899 The denominator represents the mean distance from the119896th cluster centroid to other cluster centroids which is usedfor measuring the separationThe ratio of the numerator andthe denominator thereof represents the clustering effect of the119896th clusterThe clustering validity index is defined as the sumof the clustering effect (the ratio) of all clusters Obviouslythe smaller the value is the better the clustering effect of thedataset is and the corresponding 119888 to the minimum value isthe optimal number of clusters

43 Self-Adaptive FCM In this paper the iterative trial-and-error process [16] is still used to determine the optimalnumber of clusters by self-adaptive FCM (SAFCM) Thepseudocode of SAFCM algorithm is described in Algo-rithm 2

5 Experiment and Analysis

This paper selects 8 experimental datasets among which 3datasets come fromUCI datasets respectively IrisWine andSeeds a dataset (SubKDD) is randomly selected from KDDCup 1999 shown in Table 1 and the remaining 4 datasetsare synthetic datasets shown in Figure 3 SubKDD includesnormal 2 kinds of Probe attack (ipsweep and portsweep) and3 kinds of DoS attack (neptune smurf and back) The firstsynthetic dataset (SD1) consists of 20 2-dimensional Gaussiandistribution data with 10 samples Their covariance matrixesare second-order unit matrix 1198682 The structural feature of thedataset is that the distance between any two clusters is largeand the number of classes is greater than lfloorradic119899rfloor The secondsynthetic dataset (SD2) consists of 4 2-dimensional Gaussiandistribution dataThe cluster centroids are respectively (55)(1010) (1515) and (2020) each containing 500 samples andthe covariancematrix of each being 21198682The structural feature

6 Computational Intelligence and Neuroscience

minus2 0 2 4 6 8 10minus2

0

2

4

6

8

10

12

14

(a) Initial dataminus2 0 2 4 6 8 10

minus2

0

2

4

6

8

10

12

14

(b) Iteration 1

minus2 0 2 4 6 8 10minus2

0

2

4

6

8

10

12

14

(c) Iteration 2minus2 0 2 4 6 8 10

minus2

0

2

4

6

8

10

12

14

(d) Iteration 3

Figure 2 Demonstration of the process of density-based algorithm (a) is the initial data distribution of a synthetic dataset The datasetconsists of two pieces of 2-dimensional Gaussian distribution data with centroids respectively as (2 3) and (7 8) Each class has 100 samplesIn (b) the blue circle represents the highest density core point as the centroid of the first cluster and the red plus sign represents the objectbelonging to the first cluster In (c) the red circle represents the core point as the centroid of the second cluster and the blue asterisk representsthe object belonging to the second cluster In (d) the purple circle represents the core point as the centroid of the third cluster the green timessign represents the object belonging to the third cluster and the black dot represents the final border pointwhich does not belong to any clusterAccording to a certain cutoff distance themaximumnumber of clusters is 3 If calculated in accordance with the empirical rule themaximumnumber of clusters should be 14 Therefore the algorithm can effectively reduce the iteration of FCM algorithm operation

Table 1 The data type and distribution of SubKDD

Attack behavior Number of samplesnormal 200ipsweep 50portsweep 50neptune 200smurf 300back 50

of the dataset is of short intercluster distance with a smalloverlap The third synthetic dataset (SD3) has a complicatedstructure The fourth synthetic dataset (SD4) is a nonconvexone

In this paper the numeric value 119909119894119895 is normalized as

1199091015840119894119895 = 119909119894119895 minus 119909119895119904119895 (11)

where 119909119895 and 119904119895 denote the mean of the 119895th attribute valueand its standard deviation respectively then

119909119895 = 1119899119899sum119894=1

119909119894119895119904119895 = radic 1119899 minus 1

119899sum119894=1

(119909119894119895 minus 119909119895)2(12)

Computational Intelligence and Neuroscience 7

Input 119881 119888max 119868max and 120585Output 119869119887119890119904119905 119888opt 119880119887119890119904119905for 119896119899 = 2 to 119888max do119905 = 0 lowastThe current iterationlowast119881(119905)

119896119899= (V119894)119896119899 lowastThe first kn cluster centroids of V lowast119880(119905)

119896119899= (119906(119905)119894119896)119899times119896119899119869(119905)

119896119899= 119881119877(119880(119905)119896119899 119881(119905)119896119899 )

while 119905 == 0119905 lt 119868maxampamp|119869(119905)119896119899

minus 119869(119905minus1)119896119899

| gt 120585 doUpdate 119881(119905)

119896119899

Update 119880(119905)119896119899

Update 119869(119905)119896119899119905 = 119905 + 1119869min119896119899 = min119905119894=1(119869(119894)119896119899)119894119889119909 = arg(min119905119894=1(119869(119894)119896119899))119880min119896119899 = 119880(119894119889119909)119896119899

119869119887119890119904119905 = min119888max119896119899=2(119869min119896119899) lowastThe best value of clustering validity indexlowast119888opt = arg(min119888max119896119899=2(119869min119896119899))119880119887119890119904119905 = 119880min119888opt

Algorithm 2 SAFCM

Table 2 119888max estimated by several methods 119899 is the number of objects in the dataset 119888 is the actual number of clusters 119888ER is the numberof clusters estimated by the empirical rule that is lfloorradic119899rfloor 119888AP is the number of clusters obtained by AP algorithm and 119888DBA is the number ofclusters obtained by density-based algorithm

119888DBADataset 119899 119888 119888ER 119888AP 119901 = 1 119901 = 2 119901 = 3 119901 = 5Iris 150 3 12 9 20 14 9 6Wine 178 3 13 15 14 7 6 3Seeds 210 3 14 13 18 12 5 2SubKDD 1050 6 32 24 21 17 10 7SD1 200 20 14 19 38 22 20 mdashSD2 2000 4 44 25 16 3 4 2SD3 885 3 29 27 24 19 5 3SD4 947 3 30 31 23 13 8 4

For the categorical attribute of SubKDD the simplematching is used for the dissimilarity measure that is 0 foridentical values and 1 for different values

51 Experiment of 119888max Simulation In [37] it was proposedthat the number of clusters generated by AP algorithm couldbe selected as the maximum number of clusters So thispaper estimates the value of 119888max by the empirical ruleAP algorithm and density-based algorithm The specificexperimental results are shown in Table 2 Let 119901 = 119901119898 in APalgorithm 119889119888 is respectively selected as the distance of thefirst 1 2 3 and 5

Experiment results show that for the dataset with thetrue number of clusters greater than lfloorradic119899rfloor it is obviouslyincorrect to let 119888max = lfloorradic119899rfloor In other words the empiricalrule is invalid The number of clusters finally obtained byAP algorithm is close to the actual number of clusters Forthe dataset with the true number of clusters smaller thanlfloorradic119899rfloor the number of clusters generated by AP algorithm canbe used as an appropriate 119888max but sometimes the value is

greater than the value estimated by the empirical rule whichenlarges the search range of the optimal number of clustersIf 119888max is estimated by the proposed density-based algorithmthe results in most cases are appropriate The method isinvalid only on SD1 when the cutoff distance is the first 5distance When the cutoff distance is selected as the first 3distance 119888max generated by the proposed algorithm is muchsmaller than lfloorradic119899rfloor It is closer to the true number of clustersand greatly narrows the search range of the optimal numberof clusters Therefore the cutoff distance is selected as thedistance of the first 3 in the later experiments

52 Experiment of Influence of the Initial Cluster Centroidson the Convergence of FCM To show that the initial clustercentroids obtained by density-based algorithm can quickenthe convergence of clustering algorithm the traditional FCMis adopted for verificationThe number of clusters is assignedas the true value with the convergence threshold being 10minus5Because the random selection of initial cluster centroids hasa large influence on FCM the experiment of each dataset

8 Computational Intelligence and Neuroscience

minus5 0 5 10 15 20 250

5

10

15

20

25

30

(a) SD10 5 10 15 20 25

0

5

10

15

20

25

(b) SD2

50 100 150 200 250 300 3500

50

100

150

200

250

300

(c) SD320 40 60 80 100 120 140 160 180 200 220 240

20

40

60

80

100

120

140

160

(d) SD4

Figure 3 Four synthetic datasets

Table 3 Comparison of iterations of FCMalgorithmMethod 1 usesthe random initial cluster centroids and Method 2 uses the clustercentroids obtained by density-based algorithm

Dataset Method 1 Method 2Iris 21 16Wine 27 18Seeds 19 16SubKDD 31 23SD1 38 14SD2 18 12SD3 30 22SD4 26 21

is done repeatedly for 50 times and the round numbers ofthe mean of algorithmic iterations are comparedThe specificexperimental result is shown in Table 3

As shown in Table 3 the initial cluster centroids obtainedby density-based algorithm can effectively reduce the itera-tion of FCM Particularly on SD1 the iteration of FCM isfar smaller than that with randomly selected initial cluster

centroids whose minimum iteration is 24 andmaximum iter-ation reaches 63 with unstable clustering results Thereforethe proposed method can not only effectively quicken theconvergence of FCM but also obtain a stable clustering result

53 Experiment of Clustering Accuracy Based on ClusteringValidity Index For 3 UCI datasets and SubKDD clusteringaccuracy 119903 is adopted tomeasure the clustering effect definedas

119903 = sum119896119894=1 119887119894119899 (13)

Here 119887119894 is the number of objects which co-occur in the 119894thcluster and the 119894th real cluster and 119899 is the number of objectsin the dataset According to this measurement the higherthe clustering accuracy is the better the clustering result ofFCM is When 119903 = 1 the clustering result of FCM is totallyaccurate

In the experiments the true number of clusters is assignedto each dataset and the initial cluster centroids are obtainedby density-based algorithm Then experimental results ofclustering accuracy are shown in Table 4

Computational Intelligence and Neuroscience 9

minus5 0 5 10 15 20 250

5

10

15

20

25

30

(a) SD10 5 10 15 20 25

0

5

10

15

20

25

(b) SD2

50 100 150 200 250 300 3500

50

100

150

200

250

300

(c) SD320 40 60 80 100 120 140 160 180 200 220 240

20

40

60

80

100

120

140

160

(d) SD4

Figure 4 Clustering results of two synthetic datasets

Table 4 Clustering accuracy

Dataset Iris Wine Seeds SubKDDClustering accuracy 8400 9663 9190 9435

Apparently clustering accuracies of the datasets WineSeeds and SubKDD are high while that of the dataset Iris isrelatively low for the reason that two clusters are nonlinearlyseparable

The clustering results of the proposed clustering validityindex on 4 synthetic datasets are shown in Figure 4 whichdepict the good partitions on the datasets SD1 and SD2 andthe slightly poor partition on SD3 and SD4 because of thecomplexity of their structure or nonconvexity When thenumber of clusters is set as 4 SD3 is divided into 4 groupsthat means the right and the left each have 2 groups

54 Experiment of the Optimal Number of Clusters At lastXie-Beni index Bensaid index Kwon index and the pro-posed index are respectively adopted for running of SAFCMso as to determine the optimal number of clustersThe results

Table 5 Optimal number of clusters estimated by several clusteringvalidity indices

Dataset 119881XB 119881B 119881K 119881R

Iris 2 9 2 2Wine 3 6 3 3Seeds 2 5 2 2SubKDD 10 10 9 4SD1 20 20 20 20SD2 4 4 4 4SD3 5 3 5 4SD4 2 8 2 3

are shown in Table 5 119881XB 119881B 119881K and 119881R represent Xie-Beniindex Bensaid index Kwon index and the proposed indexrespectively

It shows that for synthetic datasets SD1 and SD2 withsimple structure these indices can all obtain the optimalnumber of clusters For 3 UCI datasets SubKDD and SD4Bensaid index cannot obtain an accurate number of clustersexcept SD3 For the dataset Wine both Xie-Beni index and

10 Computational Intelligence and Neuroscience

Table 6 The value of clustering validity index on Iris

119888 119881XB 119881B 119881K 119881R

2 0114234 0223604 17384973 04736043 0223973 0124598 34572146 05515654 0316742 0099103 49279488 06154365 0560109 0089108 87540968 06763506 0574475 0072201 90563379 06913407 0400071 0067005 63328679 07353118 0275682 0036283 45736972 06140559 0250971 0027735 42868449 0584244

Table 7 The value of clustering validity index on Wine

119888 119881XB 119881B 119881K 119881R

2 0663406 1328291 11833902 15782913 0468902 0513071 83939058 13464214 mdash 0473254 mdash 17357915 mdash 0373668 mdash 18466866 mdash 0256698 mdash 1683222

Table 8 The value of clustering validity index on Seeds

119888 119881XB 119881B 119881K 119881R

2 0147247 0293609 31170349 05436093 0212127 0150899 45326216 05990014 0243483 0127720 52215334 06979435 0348842 0085477 75493654 0701153

Kwon index can obtain the accurate number of clusters whilefor the datasets Iris Seeds SubKDD SD3 and SD4 they onlyobtain the result approximate to the true number of clusterThe proposed index obtains the right result on the datasetsWine and SD4 and a better result compared toXie-Beni indexand Kwon index on the datasets SubKDD and SD3 whileit obtains the same result of the two indices on the datasetsIris and Seeds There are two reasons First the importancedegree of each property of the dataset is not considered inthe clustering process but assumed to be the same therebyaffecting the experimental result Second SAFCMhas a weakcapacity to deal with overlapping clusters or nonconvexitydataset These are the authorsrsquo subsequent research contents

Tables 6ndash13 list the detailed results of 4 indices on theexperimental datasets in each iteration of SAFCM For thedataset Wine when the number of clusters is 4 5 or 6the result based on Xie-Beni index or Kwon index is of noconvergence However the proposed index can achieve thedesired results

6 Conclusions

FCM is widely used in lots of fields But it needs to presetthe number of clusters and is greatly influenced by the initialcluster centroids This paper studies a self-adaptive methodfor determining the number of clusters by using of FCMalgorithm In this method a density-based algorithm is put

Table 9 The value of clustering validity index on SubKDD

119888 119881XB 119881B 119881K 119881R

2 0646989 1324434 550166676 15744313 0260755 0378775 222020838 10902894 0133843 0062126 119544560 05112385 0234402 0052499 202641204 05378526 0180728 0054938 156812271 05838007 0134636 0047514 119029265 06197208 0104511 0032849 919852740 06908739 0129721 0027639 740128470 056263610 006822 0027025 913528560 0528528

Table 10 The value of clustering validity index on SD1

119888 119881XB 119881B 119881K 119881R

2 0221693 0443390 44592968 06933903 0206035 0198853 40264251 07262454 0127731 0093653 26220200 06555505 0130781 0069848 27154867 06514656 0144894 0050067 22922121 06393257 0136562 0040275 29126152 06362588 0112480 0032874 24323625 06274429 0115090 0026833 24242580 062493610 0141415 0022611 28574579 061670111 0126680 0019256 28821707 061152412 0103178 0016634 23931865 060599013 0110355 0013253 26517065 058824614 0095513 0011083 23635022 057680815 0075928 0009817 19302095 056228916 0066025 0008824 17236138 055799017 0054314 0007248 14995284 054434118 0045398 0006090 13208810 053488219 0039492 0005365 11977437 052713120 0034598 0004917 10835952 0517481

Table 11 The value of clustering validity index on SD2

119888 119881XB 119881B 119881K 119881R

2 0066286 0132572 13281503 03825723 0068242 0063751 13752535 03942004 0060916 0028149 12309967 0378337

Table 12 The value of clustering validity index on SD3

119888 119881XB 119881B 119881K 119881R

2 0148379 0300899 131557269 05708763 0195663 0142149 173900551 05996804 0127512 0142150 113748947 05571985 0126561 0738070 113262975 0589535

forward at first which can estimate the maximum number ofclusters to reduce the search range of the optimal numberespecially being fit for the dataset on which the empiricalrule is inoperative Besides it can generate the high-quality

Computational Intelligence and Neuroscience 11

Table 13 The value of clustering validity index on SD4

119888 119881XB 119881B 119881K 119881R

2 0104434 0208832 991482710 04737483 0170326 0142561 162044450 04588324 0221884 0081007 211692699 05835295 0156253 0053094 157683921 06032116 0123191 0041799 118279116 05753967 0165465 0032411 107210082 05926258 0145164 0028698 139310969 0606049

initial cluster centroids so that the the clustering result isstable and the convergence of FCM is quick Then a newfuzzy clustering validity index was put forward based onfuzzy compactness and separation so that the clusteringresult is closer to global optimum The index is robust andinterpretable when the number of clusters tends to that ofobjects in the dataset Finally a self-adaptive FCM algorithmis proposed to determine the optimal number of clusters runin the iterative trial-and-error process

The contributions are validated by experimental resultsHowever in most cases each property plays a different rolein the clustering process In other words the weight ofproperties is not the same This issue will be focused on inthe authorsrsquo future work

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work is partially supported by the National NaturalScience Foundation of China (61373148 and 61502151) Shan-dong Province Natural Science Foundation (ZR2014FL010)Science Foundation ofMinistry of Education ofChina (14YJC860042) Shandong Province Outstanding Young ScientistAward Fund (BS2013DX033) Shandong Province Social Sci-ences Project (15CXWJ13 and 16CFXJ05) and Project ofShandongProvinceHigher Educational Science andTechnol-ogy Program (no J15LN02)

References

[1] S Sambasivam and N Theodosopoulos ldquoAdvanced data clus-tering methods of mining web documentsrdquo Issues in InformingScience and Information Technology vol 3 pp 563ndash579 2006

[2] J Wang S-T Wang and Z-H Deng ldquoSurvey on challenges inclustering analysis researchrdquo Control and Decision vol 3 2012

[3] C-W Bong andM Rajeswari ldquoMulti-objective nature-inspiredclustering and classification techniques for image segmenta-tionrdquo Applied Soft Computing Journal vol 11 no 4 pp 3271ndash3282 2011

[4] A Rodriguez and A Laio ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[5] L A Zadeh ldquoFuzzy setsrdquo Information and Computation vol 8no 3 pp 338ndash353 1965

[6] J C Bezdek ldquoNumerical taxonomy with fuzzy setsrdquo Journal ofMathematical Biology vol 1 no 1 pp 57ndash71 1974

[7] J C Bezdek ldquoCluster validity with fuzzy setsrdquo Journal of Cyber-netics vol 3 no 3 pp 58ndash73 1974

[8] M P Windham ldquoCluster validity for fuzzy clustering algo-rithmsrdquo Fuzzy Sets and Systems vol 5 no 2 pp 177ndash185 1981

[9] M Lee ldquoFuzzy cluster validity index based onobject proximitiesdefined over fuzzy partition matricesrdquo in Proceedings of theIEEE International Conference on Fuzzy Systems and IEEEWorldCongress on Computational Intelligence (FUZZ-IEEE rsquo08) pp336ndash340 June 2008

[10] X-B Zhang and L Jiang ldquoAnew validity index of fuzzy c-meansclusteringrdquo in Proceedings of the IEEE International Conferenceon IntelligentHuman-Machine Systems andCybernetics (IHMSCrsquo09) vol 2 pp 218ndash221 Hangzhou China August 2009

[11] I Saha U Maulik and S Bandyopadhyay ldquoA new differ-ential evolution based fuzzy clustering for automatic clusterevolutionrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 706ndash711 IEEE PatialaIndia March 2009

[12] S Yue J-S Wang T Wu and H Wang ldquoA new separationmeasure for improving the effectiveness of validity indicesrdquoInformation Sciences vol 180 no 5 pp 748ndash764 2010

[13] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[14] A M Bensaid L O Hall J C Bezdek et al ldquoValidity-guided(re)clustering with applications to image segmentationrdquo IEEETransactions on Fuzzy Systems vol 4 no 2 pp 112ndash123 1996

[15] S H Kwon ldquoCluster validity index for fuzzy clusteringrdquo Elec-tronics Letters vol 34 no 22 pp 2176ndash2177 1998

[16] H Sun S Wang and Q Jiang ldquoFCM-based model selectionalgorithms for determining the number of clustersrdquo PatternRecognition vol 37 no 10 pp 2027ndash2037 2004

[17] Y Li and F Yu ldquoA new validity function for fuzzy clusteringrdquoin Proceedings of the International Conference on ComputationalIntelligence and Natural Computing (CINC rsquo09) vol 1 pp 462ndash465 IEEE June 2009

[18] S Saha and S Bandyopadhyay ldquoA new cluster validity indexbased on fuzzy granulation-degranulation criteriardquo in Proceed-ings of the IEEEInternational Conference on Advanced Comput-ing and Communications (ADCOM rsquo07) pp 353ndash358 IEEEGuwahati India 2007

[19] M Zhang W Zhang H Sicotte and P Yang ldquoA new validitymeasure for a correlation-based fuzzy c-means clustering algo-rithmrdquo in Proceedings of the 31st IEEE Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo09) pp 3865ndash3868 Minneapolis Minn USASeptember 2009

[20] Y-I KimD-WKimD Lee andKH Lee ldquoA cluster validationindex for GK cluster analysis based on relative degree ofsharingrdquo Information Sciences vol 168 no 1ndash4 pp 225ndash2422004

[21] B Rezaee ldquoA cluster validity index for fuzzy clusteringrdquo FuzzySets and Systems vol 161 no 23 pp 3014ndash3025 2010

[22] D Zhang M Ji J Yang Y Zhang and F Xie ldquoA novel clus-ter validity index for fuzzy clustering based on bipartite modu-larityrdquo Fuzzy Sets and Systems vol 253 pp 122ndash137 2014

[23] J Yu andQ Cheng ldquoSearch range of the optimal cluster numberin fuzzy clusteringrdquo Science in China (Series E) vol 32 no 2 pp274ndash280 2002

12 Computational Intelligence and Neuroscience

[24] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[25] D Pelleg A W Moore et al ldquoX-means extending K-meanswith efficient estimation of the number of clustersrdquo in Proceed-ings of the 17th International Conference on Machine Learning(ICML rsquo00) vol 1 pp 727ndash734 2000

[26] H Frigui and R Krishnapuram ldquoA robust competitive clus-tering algorithm with applications in computer visionrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol21 no 5 pp 450ndash465 1999

[27] M-S Yang and K-L Wu ldquoA similarity-based robust clusteringmethodrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 26 no 4 pp 434ndash448 2004

[28] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[29] L Yujian ldquoA clustering algorithm based on maximal 120579-distantsubtreesrdquo Pattern Recognition vol 40 no 5 pp 1425ndash1431 2007

[30] Y Shihong H Ti and W Penglong ldquoMatrix eigenvalue anal-ysis-based clustering validity indexrdquo Journal of Tianjin Univer-sity (Science and Technology) vol 8 article 6 2014

[31] A Jose-Garcıa and W Gomez-Flores ldquoAutomatic clusteringusing nature-inspired metaheuristics a surveyrdquo Applied SoftComputing Journal vol 41 pp 192ndash213 2016

[32] E H Ruspini ldquoNew experimental results in fuzzy clusteringrdquoInformation Sciences vol 6 pp 273ndash284 1973

[33] J C Dunn ldquoWell-separated clusters and optimal fuzzy parti-tionsrdquo Journal of Cybernetics vol 4 no 1 pp 95ndash104 1974

[34] J C Bezdek Pattern Recognition with Fuzzy Objective FunctionAlgorithms Springer Science amp Business Media 2013

[35] R E Hammah and J H Curran ldquoFuzzy cluster algorithm forthe automatic identification of joint setsrdquo International Journalof Rock Mechanics and Mining Sciences vol 35 no 7 pp 889ndash905 1998

[36] N R Pal and J C Bezdek ldquoOn cluster validity for the fuzzy c-means modelrdquo IEEE Transactions on Fuzzy Systems vol 3 no3 pp 370ndash379 1995

[37] S B Zhou Z Y Xu and X Q Tang ldquoMethod for determiningoptimal number of clusters based on affinity propagationclusteringrdquo Control and Decision vol 26 no 8 pp 1147ndash11522011

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 6: Research Article A Self-Adaptive Fuzzy -Means Algorithm ...downloads.hindawi.com/journals/cin/2016/2647389.pdf · A Self-Adaptive Fuzzy -Means Algorithm for Determining the Optimal

6 Computational Intelligence and Neuroscience

minus2 0 2 4 6 8 10minus2

0

2

4

6

8

10

12

14

(a) Initial dataminus2 0 2 4 6 8 10

minus2

0

2

4

6

8

10

12

14

(b) Iteration 1

minus2 0 2 4 6 8 10minus2

0

2

4

6

8

10

12

14

(c) Iteration 2minus2 0 2 4 6 8 10

minus2

0

2

4

6

8

10

12

14

(d) Iteration 3

Figure 2 Demonstration of the process of density-based algorithm (a) is the initial data distribution of a synthetic dataset The datasetconsists of two pieces of 2-dimensional Gaussian distribution data with centroids respectively as (2 3) and (7 8) Each class has 100 samplesIn (b) the blue circle represents the highest density core point as the centroid of the first cluster and the red plus sign represents the objectbelonging to the first cluster In (c) the red circle represents the core point as the centroid of the second cluster and the blue asterisk representsthe object belonging to the second cluster In (d) the purple circle represents the core point as the centroid of the third cluster the green timessign represents the object belonging to the third cluster and the black dot represents the final border pointwhich does not belong to any clusterAccording to a certain cutoff distance themaximumnumber of clusters is 3 If calculated in accordance with the empirical rule themaximumnumber of clusters should be 14 Therefore the algorithm can effectively reduce the iteration of FCM algorithm operation

Table 1 The data type and distribution of SubKDD

Attack behavior Number of samplesnormal 200ipsweep 50portsweep 50neptune 200smurf 300back 50

of the dataset is of short intercluster distance with a smalloverlap The third synthetic dataset (SD3) has a complicatedstructure The fourth synthetic dataset (SD4) is a nonconvexone

In this paper the numeric value 119909119894119895 is normalized as

1199091015840119894119895 = 119909119894119895 minus 119909119895119904119895 (11)

where 119909119895 and 119904119895 denote the mean of the 119895th attribute valueand its standard deviation respectively then

119909119895 = 1119899119899sum119894=1

119909119894119895119904119895 = radic 1119899 minus 1

119899sum119894=1

(119909119894119895 minus 119909119895)2(12)

Computational Intelligence and Neuroscience 7

Input 119881 119888max 119868max and 120585Output 119869119887119890119904119905 119888opt 119880119887119890119904119905for 119896119899 = 2 to 119888max do119905 = 0 lowastThe current iterationlowast119881(119905)

119896119899= (V119894)119896119899 lowastThe first kn cluster centroids of V lowast119880(119905)

119896119899= (119906(119905)119894119896)119899times119896119899119869(119905)

119896119899= 119881119877(119880(119905)119896119899 119881(119905)119896119899 )

while 119905 == 0119905 lt 119868maxampamp|119869(119905)119896119899

minus 119869(119905minus1)119896119899

| gt 120585 doUpdate 119881(119905)

119896119899

Update 119880(119905)119896119899

Update 119869(119905)119896119899119905 = 119905 + 1119869min119896119899 = min119905119894=1(119869(119894)119896119899)119894119889119909 = arg(min119905119894=1(119869(119894)119896119899))119880min119896119899 = 119880(119894119889119909)119896119899

119869119887119890119904119905 = min119888max119896119899=2(119869min119896119899) lowastThe best value of clustering validity indexlowast119888opt = arg(min119888max119896119899=2(119869min119896119899))119880119887119890119904119905 = 119880min119888opt

Algorithm 2 SAFCM

Table 2 119888max estimated by several methods 119899 is the number of objects in the dataset 119888 is the actual number of clusters 119888ER is the numberof clusters estimated by the empirical rule that is lfloorradic119899rfloor 119888AP is the number of clusters obtained by AP algorithm and 119888DBA is the number ofclusters obtained by density-based algorithm

119888DBADataset 119899 119888 119888ER 119888AP 119901 = 1 119901 = 2 119901 = 3 119901 = 5Iris 150 3 12 9 20 14 9 6Wine 178 3 13 15 14 7 6 3Seeds 210 3 14 13 18 12 5 2SubKDD 1050 6 32 24 21 17 10 7SD1 200 20 14 19 38 22 20 mdashSD2 2000 4 44 25 16 3 4 2SD3 885 3 29 27 24 19 5 3SD4 947 3 30 31 23 13 8 4

For the categorical attribute of SubKDD the simplematching is used for the dissimilarity measure that is 0 foridentical values and 1 for different values

51 Experiment of 119888max Simulation In [37] it was proposedthat the number of clusters generated by AP algorithm couldbe selected as the maximum number of clusters So thispaper estimates the value of 119888max by the empirical ruleAP algorithm and density-based algorithm The specificexperimental results are shown in Table 2 Let 119901 = 119901119898 in APalgorithm 119889119888 is respectively selected as the distance of thefirst 1 2 3 and 5

Experiment results show that for the dataset with thetrue number of clusters greater than lfloorradic119899rfloor it is obviouslyincorrect to let 119888max = lfloorradic119899rfloor In other words the empiricalrule is invalid The number of clusters finally obtained byAP algorithm is close to the actual number of clusters Forthe dataset with the true number of clusters smaller thanlfloorradic119899rfloor the number of clusters generated by AP algorithm canbe used as an appropriate 119888max but sometimes the value is

greater than the value estimated by the empirical rule whichenlarges the search range of the optimal number of clustersIf 119888max is estimated by the proposed density-based algorithmthe results in most cases are appropriate The method isinvalid only on SD1 when the cutoff distance is the first 5distance When the cutoff distance is selected as the first 3distance 119888max generated by the proposed algorithm is muchsmaller than lfloorradic119899rfloor It is closer to the true number of clustersand greatly narrows the search range of the optimal numberof clusters Therefore the cutoff distance is selected as thedistance of the first 3 in the later experiments

52 Experiment of Influence of the Initial Cluster Centroidson the Convergence of FCM To show that the initial clustercentroids obtained by density-based algorithm can quickenthe convergence of clustering algorithm the traditional FCMis adopted for verificationThe number of clusters is assignedas the true value with the convergence threshold being 10minus5Because the random selection of initial cluster centroids hasa large influence on FCM the experiment of each dataset

8 Computational Intelligence and Neuroscience

minus5 0 5 10 15 20 250

5

10

15

20

25

30

(a) SD10 5 10 15 20 25

0

5

10

15

20

25

(b) SD2

50 100 150 200 250 300 3500

50

100

150

200

250

300

(c) SD320 40 60 80 100 120 140 160 180 200 220 240

20

40

60

80

100

120

140

160

(d) SD4

Figure 3 Four synthetic datasets

Table 3 Comparison of iterations of FCMalgorithmMethod 1 usesthe random initial cluster centroids and Method 2 uses the clustercentroids obtained by density-based algorithm

Dataset Method 1 Method 2Iris 21 16Wine 27 18Seeds 19 16SubKDD 31 23SD1 38 14SD2 18 12SD3 30 22SD4 26 21

is done repeatedly for 50 times and the round numbers ofthe mean of algorithmic iterations are comparedThe specificexperimental result is shown in Table 3

As shown in Table 3 the initial cluster centroids obtainedby density-based algorithm can effectively reduce the itera-tion of FCM Particularly on SD1 the iteration of FCM isfar smaller than that with randomly selected initial cluster

centroids whose minimum iteration is 24 andmaximum iter-ation reaches 63 with unstable clustering results Thereforethe proposed method can not only effectively quicken theconvergence of FCM but also obtain a stable clustering result

53 Experiment of Clustering Accuracy Based on ClusteringValidity Index For 3 UCI datasets and SubKDD clusteringaccuracy 119903 is adopted tomeasure the clustering effect definedas

119903 = sum119896119894=1 119887119894119899 (13)

Here 119887119894 is the number of objects which co-occur in the 119894thcluster and the 119894th real cluster and 119899 is the number of objectsin the dataset According to this measurement the higherthe clustering accuracy is the better the clustering result ofFCM is When 119903 = 1 the clustering result of FCM is totallyaccurate

In the experiments the true number of clusters is assignedto each dataset and the initial cluster centroids are obtainedby density-based algorithm Then experimental results ofclustering accuracy are shown in Table 4

Computational Intelligence and Neuroscience 9

minus5 0 5 10 15 20 250

5

10

15

20

25

30

(a) SD10 5 10 15 20 25

0

5

10

15

20

25

(b) SD2

50 100 150 200 250 300 3500

50

100

150

200

250

300

(c) SD320 40 60 80 100 120 140 160 180 200 220 240

20

40

60

80

100

120

140

160

(d) SD4

Figure 4 Clustering results of two synthetic datasets

Table 4 Clustering accuracy

Dataset Iris Wine Seeds SubKDDClustering accuracy 8400 9663 9190 9435

Apparently clustering accuracies of the datasets WineSeeds and SubKDD are high while that of the dataset Iris isrelatively low for the reason that two clusters are nonlinearlyseparable

The clustering results of the proposed clustering validityindex on 4 synthetic datasets are shown in Figure 4 whichdepict the good partitions on the datasets SD1 and SD2 andthe slightly poor partition on SD3 and SD4 because of thecomplexity of their structure or nonconvexity When thenumber of clusters is set as 4 SD3 is divided into 4 groupsthat means the right and the left each have 2 groups

54 Experiment of the Optimal Number of Clusters At lastXie-Beni index Bensaid index Kwon index and the pro-posed index are respectively adopted for running of SAFCMso as to determine the optimal number of clustersThe results

Table 5 Optimal number of clusters estimated by several clusteringvalidity indices

Dataset 119881XB 119881B 119881K 119881R

Iris 2 9 2 2Wine 3 6 3 3Seeds 2 5 2 2SubKDD 10 10 9 4SD1 20 20 20 20SD2 4 4 4 4SD3 5 3 5 4SD4 2 8 2 3

are shown in Table 5 119881XB 119881B 119881K and 119881R represent Xie-Beniindex Bensaid index Kwon index and the proposed indexrespectively

It shows that for synthetic datasets SD1 and SD2 withsimple structure these indices can all obtain the optimalnumber of clusters For 3 UCI datasets SubKDD and SD4Bensaid index cannot obtain an accurate number of clustersexcept SD3 For the dataset Wine both Xie-Beni index and

10 Computational Intelligence and Neuroscience

Table 6 The value of clustering validity index on Iris

119888 119881XB 119881B 119881K 119881R

2 0114234 0223604 17384973 04736043 0223973 0124598 34572146 05515654 0316742 0099103 49279488 06154365 0560109 0089108 87540968 06763506 0574475 0072201 90563379 06913407 0400071 0067005 63328679 07353118 0275682 0036283 45736972 06140559 0250971 0027735 42868449 0584244

Table 7 The value of clustering validity index on Wine

119888 119881XB 119881B 119881K 119881R

2 0663406 1328291 11833902 15782913 0468902 0513071 83939058 13464214 mdash 0473254 mdash 17357915 mdash 0373668 mdash 18466866 mdash 0256698 mdash 1683222

Table 8 The value of clustering validity index on Seeds

119888 119881XB 119881B 119881K 119881R

2 0147247 0293609 31170349 05436093 0212127 0150899 45326216 05990014 0243483 0127720 52215334 06979435 0348842 0085477 75493654 0701153

Kwon index can obtain the accurate number of clusters whilefor the datasets Iris Seeds SubKDD SD3 and SD4 they onlyobtain the result approximate to the true number of clusterThe proposed index obtains the right result on the datasetsWine and SD4 and a better result compared toXie-Beni indexand Kwon index on the datasets SubKDD and SD3 whileit obtains the same result of the two indices on the datasetsIris and Seeds There are two reasons First the importancedegree of each property of the dataset is not considered inthe clustering process but assumed to be the same therebyaffecting the experimental result Second SAFCMhas a weakcapacity to deal with overlapping clusters or nonconvexitydataset These are the authorsrsquo subsequent research contents

Tables 6ndash13 list the detailed results of 4 indices on theexperimental datasets in each iteration of SAFCM For thedataset Wine when the number of clusters is 4 5 or 6the result based on Xie-Beni index or Kwon index is of noconvergence However the proposed index can achieve thedesired results

6 Conclusions

FCM is widely used in lots of fields But it needs to presetthe number of clusters and is greatly influenced by the initialcluster centroids This paper studies a self-adaptive methodfor determining the number of clusters by using of FCMalgorithm In this method a density-based algorithm is put

Table 9 The value of clustering validity index on SubKDD

119888 119881XB 119881B 119881K 119881R

2 0646989 1324434 550166676 15744313 0260755 0378775 222020838 10902894 0133843 0062126 119544560 05112385 0234402 0052499 202641204 05378526 0180728 0054938 156812271 05838007 0134636 0047514 119029265 06197208 0104511 0032849 919852740 06908739 0129721 0027639 740128470 056263610 006822 0027025 913528560 0528528

Table 10 The value of clustering validity index on SD1

119888 119881XB 119881B 119881K 119881R

2 0221693 0443390 44592968 06933903 0206035 0198853 40264251 07262454 0127731 0093653 26220200 06555505 0130781 0069848 27154867 06514656 0144894 0050067 22922121 06393257 0136562 0040275 29126152 06362588 0112480 0032874 24323625 06274429 0115090 0026833 24242580 062493610 0141415 0022611 28574579 061670111 0126680 0019256 28821707 061152412 0103178 0016634 23931865 060599013 0110355 0013253 26517065 058824614 0095513 0011083 23635022 057680815 0075928 0009817 19302095 056228916 0066025 0008824 17236138 055799017 0054314 0007248 14995284 054434118 0045398 0006090 13208810 053488219 0039492 0005365 11977437 052713120 0034598 0004917 10835952 0517481

Table 11 The value of clustering validity index on SD2

119888 119881XB 119881B 119881K 119881R

2 0066286 0132572 13281503 03825723 0068242 0063751 13752535 03942004 0060916 0028149 12309967 0378337

Table 12 The value of clustering validity index on SD3

119888 119881XB 119881B 119881K 119881R

2 0148379 0300899 131557269 05708763 0195663 0142149 173900551 05996804 0127512 0142150 113748947 05571985 0126561 0738070 113262975 0589535

forward at first which can estimate the maximum number ofclusters to reduce the search range of the optimal numberespecially being fit for the dataset on which the empiricalrule is inoperative Besides it can generate the high-quality

Computational Intelligence and Neuroscience 11

Table 13 The value of clustering validity index on SD4

119888 119881XB 119881B 119881K 119881R

2 0104434 0208832 991482710 04737483 0170326 0142561 162044450 04588324 0221884 0081007 211692699 05835295 0156253 0053094 157683921 06032116 0123191 0041799 118279116 05753967 0165465 0032411 107210082 05926258 0145164 0028698 139310969 0606049

initial cluster centroids so that the the clustering result isstable and the convergence of FCM is quick Then a newfuzzy clustering validity index was put forward based onfuzzy compactness and separation so that the clusteringresult is closer to global optimum The index is robust andinterpretable when the number of clusters tends to that ofobjects in the dataset Finally a self-adaptive FCM algorithmis proposed to determine the optimal number of clusters runin the iterative trial-and-error process

The contributions are validated by experimental resultsHowever in most cases each property plays a different rolein the clustering process In other words the weight ofproperties is not the same This issue will be focused on inthe authorsrsquo future work

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work is partially supported by the National NaturalScience Foundation of China (61373148 and 61502151) Shan-dong Province Natural Science Foundation (ZR2014FL010)Science Foundation ofMinistry of Education ofChina (14YJC860042) Shandong Province Outstanding Young ScientistAward Fund (BS2013DX033) Shandong Province Social Sci-ences Project (15CXWJ13 and 16CFXJ05) and Project ofShandongProvinceHigher Educational Science andTechnol-ogy Program (no J15LN02)

References

[1] S Sambasivam and N Theodosopoulos ldquoAdvanced data clus-tering methods of mining web documentsrdquo Issues in InformingScience and Information Technology vol 3 pp 563ndash579 2006

[2] J Wang S-T Wang and Z-H Deng ldquoSurvey on challenges inclustering analysis researchrdquo Control and Decision vol 3 2012

[3] C-W Bong andM Rajeswari ldquoMulti-objective nature-inspiredclustering and classification techniques for image segmenta-tionrdquo Applied Soft Computing Journal vol 11 no 4 pp 3271ndash3282 2011

[4] A Rodriguez and A Laio ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[5] L A Zadeh ldquoFuzzy setsrdquo Information and Computation vol 8no 3 pp 338ndash353 1965

[6] J C Bezdek ldquoNumerical taxonomy with fuzzy setsrdquo Journal ofMathematical Biology vol 1 no 1 pp 57ndash71 1974

[7] J C Bezdek ldquoCluster validity with fuzzy setsrdquo Journal of Cyber-netics vol 3 no 3 pp 58ndash73 1974

[8] M P Windham ldquoCluster validity for fuzzy clustering algo-rithmsrdquo Fuzzy Sets and Systems vol 5 no 2 pp 177ndash185 1981

[9] M Lee ldquoFuzzy cluster validity index based onobject proximitiesdefined over fuzzy partition matricesrdquo in Proceedings of theIEEE International Conference on Fuzzy Systems and IEEEWorldCongress on Computational Intelligence (FUZZ-IEEE rsquo08) pp336ndash340 June 2008

[10] X-B Zhang and L Jiang ldquoAnew validity index of fuzzy c-meansclusteringrdquo in Proceedings of the IEEE International Conferenceon IntelligentHuman-Machine Systems andCybernetics (IHMSCrsquo09) vol 2 pp 218ndash221 Hangzhou China August 2009

[11] I Saha U Maulik and S Bandyopadhyay ldquoA new differ-ential evolution based fuzzy clustering for automatic clusterevolutionrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 706ndash711 IEEE PatialaIndia March 2009

[12] S Yue J-S Wang T Wu and H Wang ldquoA new separationmeasure for improving the effectiveness of validity indicesrdquoInformation Sciences vol 180 no 5 pp 748ndash764 2010

[13] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[14] A M Bensaid L O Hall J C Bezdek et al ldquoValidity-guided(re)clustering with applications to image segmentationrdquo IEEETransactions on Fuzzy Systems vol 4 no 2 pp 112ndash123 1996

[15] S H Kwon ldquoCluster validity index for fuzzy clusteringrdquo Elec-tronics Letters vol 34 no 22 pp 2176ndash2177 1998

[16] H Sun S Wang and Q Jiang ldquoFCM-based model selectionalgorithms for determining the number of clustersrdquo PatternRecognition vol 37 no 10 pp 2027ndash2037 2004

[17] Y Li and F Yu ldquoA new validity function for fuzzy clusteringrdquoin Proceedings of the International Conference on ComputationalIntelligence and Natural Computing (CINC rsquo09) vol 1 pp 462ndash465 IEEE June 2009

[18] S Saha and S Bandyopadhyay ldquoA new cluster validity indexbased on fuzzy granulation-degranulation criteriardquo in Proceed-ings of the IEEEInternational Conference on Advanced Comput-ing and Communications (ADCOM rsquo07) pp 353ndash358 IEEEGuwahati India 2007

[19] M Zhang W Zhang H Sicotte and P Yang ldquoA new validitymeasure for a correlation-based fuzzy c-means clustering algo-rithmrdquo in Proceedings of the 31st IEEE Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo09) pp 3865ndash3868 Minneapolis Minn USASeptember 2009

[20] Y-I KimD-WKimD Lee andKH Lee ldquoA cluster validationindex for GK cluster analysis based on relative degree ofsharingrdquo Information Sciences vol 168 no 1ndash4 pp 225ndash2422004

[21] B Rezaee ldquoA cluster validity index for fuzzy clusteringrdquo FuzzySets and Systems vol 161 no 23 pp 3014ndash3025 2010

[22] D Zhang M Ji J Yang Y Zhang and F Xie ldquoA novel clus-ter validity index for fuzzy clustering based on bipartite modu-larityrdquo Fuzzy Sets and Systems vol 253 pp 122ndash137 2014

[23] J Yu andQ Cheng ldquoSearch range of the optimal cluster numberin fuzzy clusteringrdquo Science in China (Series E) vol 32 no 2 pp274ndash280 2002

12 Computational Intelligence and Neuroscience

[24] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[25] D Pelleg A W Moore et al ldquoX-means extending K-meanswith efficient estimation of the number of clustersrdquo in Proceed-ings of the 17th International Conference on Machine Learning(ICML rsquo00) vol 1 pp 727ndash734 2000

[26] H Frigui and R Krishnapuram ldquoA robust competitive clus-tering algorithm with applications in computer visionrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol21 no 5 pp 450ndash465 1999

[27] M-S Yang and K-L Wu ldquoA similarity-based robust clusteringmethodrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 26 no 4 pp 434ndash448 2004

[28] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[29] L Yujian ldquoA clustering algorithm based on maximal 120579-distantsubtreesrdquo Pattern Recognition vol 40 no 5 pp 1425ndash1431 2007

[30] Y Shihong H Ti and W Penglong ldquoMatrix eigenvalue anal-ysis-based clustering validity indexrdquo Journal of Tianjin Univer-sity (Science and Technology) vol 8 article 6 2014

[31] A Jose-Garcıa and W Gomez-Flores ldquoAutomatic clusteringusing nature-inspired metaheuristics a surveyrdquo Applied SoftComputing Journal vol 41 pp 192ndash213 2016

[32] E H Ruspini ldquoNew experimental results in fuzzy clusteringrdquoInformation Sciences vol 6 pp 273ndash284 1973

[33] J C Dunn ldquoWell-separated clusters and optimal fuzzy parti-tionsrdquo Journal of Cybernetics vol 4 no 1 pp 95ndash104 1974

[34] J C Bezdek Pattern Recognition with Fuzzy Objective FunctionAlgorithms Springer Science amp Business Media 2013

[35] R E Hammah and J H Curran ldquoFuzzy cluster algorithm forthe automatic identification of joint setsrdquo International Journalof Rock Mechanics and Mining Sciences vol 35 no 7 pp 889ndash905 1998

[36] N R Pal and J C Bezdek ldquoOn cluster validity for the fuzzy c-means modelrdquo IEEE Transactions on Fuzzy Systems vol 3 no3 pp 370ndash379 1995

[37] S B Zhou Z Y Xu and X Q Tang ldquoMethod for determiningoptimal number of clusters based on affinity propagationclusteringrdquo Control and Decision vol 26 no 8 pp 1147ndash11522011

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 7: Research Article A Self-Adaptive Fuzzy -Means Algorithm ...downloads.hindawi.com/journals/cin/2016/2647389.pdf · A Self-Adaptive Fuzzy -Means Algorithm for Determining the Optimal

Computational Intelligence and Neuroscience 7

Input 119881 119888max 119868max and 120585Output 119869119887119890119904119905 119888opt 119880119887119890119904119905for 119896119899 = 2 to 119888max do119905 = 0 lowastThe current iterationlowast119881(119905)

119896119899= (V119894)119896119899 lowastThe first kn cluster centroids of V lowast119880(119905)

119896119899= (119906(119905)119894119896)119899times119896119899119869(119905)

119896119899= 119881119877(119880(119905)119896119899 119881(119905)119896119899 )

while 119905 == 0119905 lt 119868maxampamp|119869(119905)119896119899

minus 119869(119905minus1)119896119899

| gt 120585 doUpdate 119881(119905)

119896119899

Update 119880(119905)119896119899

Update 119869(119905)119896119899119905 = 119905 + 1119869min119896119899 = min119905119894=1(119869(119894)119896119899)119894119889119909 = arg(min119905119894=1(119869(119894)119896119899))119880min119896119899 = 119880(119894119889119909)119896119899

119869119887119890119904119905 = min119888max119896119899=2(119869min119896119899) lowastThe best value of clustering validity indexlowast119888opt = arg(min119888max119896119899=2(119869min119896119899))119880119887119890119904119905 = 119880min119888opt

Algorithm 2 SAFCM

Table 2 119888max estimated by several methods 119899 is the number of objects in the dataset 119888 is the actual number of clusters 119888ER is the numberof clusters estimated by the empirical rule that is lfloorradic119899rfloor 119888AP is the number of clusters obtained by AP algorithm and 119888DBA is the number ofclusters obtained by density-based algorithm

119888DBADataset 119899 119888 119888ER 119888AP 119901 = 1 119901 = 2 119901 = 3 119901 = 5Iris 150 3 12 9 20 14 9 6Wine 178 3 13 15 14 7 6 3Seeds 210 3 14 13 18 12 5 2SubKDD 1050 6 32 24 21 17 10 7SD1 200 20 14 19 38 22 20 mdashSD2 2000 4 44 25 16 3 4 2SD3 885 3 29 27 24 19 5 3SD4 947 3 30 31 23 13 8 4

For the categorical attribute of SubKDD the simplematching is used for the dissimilarity measure that is 0 foridentical values and 1 for different values

51 Experiment of 119888max Simulation In [37] it was proposedthat the number of clusters generated by AP algorithm couldbe selected as the maximum number of clusters So thispaper estimates the value of 119888max by the empirical ruleAP algorithm and density-based algorithm The specificexperimental results are shown in Table 2 Let 119901 = 119901119898 in APalgorithm 119889119888 is respectively selected as the distance of thefirst 1 2 3 and 5

Experiment results show that for the dataset with thetrue number of clusters greater than lfloorradic119899rfloor it is obviouslyincorrect to let 119888max = lfloorradic119899rfloor In other words the empiricalrule is invalid The number of clusters finally obtained byAP algorithm is close to the actual number of clusters Forthe dataset with the true number of clusters smaller thanlfloorradic119899rfloor the number of clusters generated by AP algorithm canbe used as an appropriate 119888max but sometimes the value is

greater than the value estimated by the empirical rule whichenlarges the search range of the optimal number of clustersIf 119888max is estimated by the proposed density-based algorithmthe results in most cases are appropriate The method isinvalid only on SD1 when the cutoff distance is the first 5distance When the cutoff distance is selected as the first 3distance 119888max generated by the proposed algorithm is muchsmaller than lfloorradic119899rfloor It is closer to the true number of clustersand greatly narrows the search range of the optimal numberof clusters Therefore the cutoff distance is selected as thedistance of the first 3 in the later experiments

52 Experiment of Influence of the Initial Cluster Centroidson the Convergence of FCM To show that the initial clustercentroids obtained by density-based algorithm can quickenthe convergence of clustering algorithm the traditional FCMis adopted for verificationThe number of clusters is assignedas the true value with the convergence threshold being 10minus5Because the random selection of initial cluster centroids hasa large influence on FCM the experiment of each dataset

8 Computational Intelligence and Neuroscience

minus5 0 5 10 15 20 250

5

10

15

20

25

30

(a) SD10 5 10 15 20 25

0

5

10

15

20

25

(b) SD2

50 100 150 200 250 300 3500

50

100

150

200

250

300

(c) SD320 40 60 80 100 120 140 160 180 200 220 240

20

40

60

80

100

120

140

160

(d) SD4

Figure 3 Four synthetic datasets

Table 3 Comparison of iterations of FCMalgorithmMethod 1 usesthe random initial cluster centroids and Method 2 uses the clustercentroids obtained by density-based algorithm

Dataset Method 1 Method 2Iris 21 16Wine 27 18Seeds 19 16SubKDD 31 23SD1 38 14SD2 18 12SD3 30 22SD4 26 21

is done repeatedly for 50 times and the round numbers ofthe mean of algorithmic iterations are comparedThe specificexperimental result is shown in Table 3

As shown in Table 3 the initial cluster centroids obtainedby density-based algorithm can effectively reduce the itera-tion of FCM Particularly on SD1 the iteration of FCM isfar smaller than that with randomly selected initial cluster

centroids whose minimum iteration is 24 andmaximum iter-ation reaches 63 with unstable clustering results Thereforethe proposed method can not only effectively quicken theconvergence of FCM but also obtain a stable clustering result

53 Experiment of Clustering Accuracy Based on ClusteringValidity Index For 3 UCI datasets and SubKDD clusteringaccuracy 119903 is adopted tomeasure the clustering effect definedas

119903 = sum119896119894=1 119887119894119899 (13)

Here 119887119894 is the number of objects which co-occur in the 119894thcluster and the 119894th real cluster and 119899 is the number of objectsin the dataset According to this measurement the higherthe clustering accuracy is the better the clustering result ofFCM is When 119903 = 1 the clustering result of FCM is totallyaccurate

In the experiments the true number of clusters is assignedto each dataset and the initial cluster centroids are obtainedby density-based algorithm Then experimental results ofclustering accuracy are shown in Table 4

Computational Intelligence and Neuroscience 9

minus5 0 5 10 15 20 250

5

10

15

20

25

30

(a) SD10 5 10 15 20 25

0

5

10

15

20

25

(b) SD2

50 100 150 200 250 300 3500

50

100

150

200

250

300

(c) SD320 40 60 80 100 120 140 160 180 200 220 240

20

40

60

80

100

120

140

160

(d) SD4

Figure 4 Clustering results of two synthetic datasets

Table 4 Clustering accuracy

Dataset Iris Wine Seeds SubKDDClustering accuracy 8400 9663 9190 9435

Apparently clustering accuracies of the datasets WineSeeds and SubKDD are high while that of the dataset Iris isrelatively low for the reason that two clusters are nonlinearlyseparable

The clustering results of the proposed clustering validityindex on 4 synthetic datasets are shown in Figure 4 whichdepict the good partitions on the datasets SD1 and SD2 andthe slightly poor partition on SD3 and SD4 because of thecomplexity of their structure or nonconvexity When thenumber of clusters is set as 4 SD3 is divided into 4 groupsthat means the right and the left each have 2 groups

54 Experiment of the Optimal Number of Clusters At lastXie-Beni index Bensaid index Kwon index and the pro-posed index are respectively adopted for running of SAFCMso as to determine the optimal number of clustersThe results

Table 5 Optimal number of clusters estimated by several clusteringvalidity indices

Dataset 119881XB 119881B 119881K 119881R

Iris 2 9 2 2Wine 3 6 3 3Seeds 2 5 2 2SubKDD 10 10 9 4SD1 20 20 20 20SD2 4 4 4 4SD3 5 3 5 4SD4 2 8 2 3

are shown in Table 5 119881XB 119881B 119881K and 119881R represent Xie-Beniindex Bensaid index Kwon index and the proposed indexrespectively

It shows that for synthetic datasets SD1 and SD2 withsimple structure these indices can all obtain the optimalnumber of clusters For 3 UCI datasets SubKDD and SD4Bensaid index cannot obtain an accurate number of clustersexcept SD3 For the dataset Wine both Xie-Beni index and

10 Computational Intelligence and Neuroscience

Table 6 The value of clustering validity index on Iris

119888 119881XB 119881B 119881K 119881R

2 0114234 0223604 17384973 04736043 0223973 0124598 34572146 05515654 0316742 0099103 49279488 06154365 0560109 0089108 87540968 06763506 0574475 0072201 90563379 06913407 0400071 0067005 63328679 07353118 0275682 0036283 45736972 06140559 0250971 0027735 42868449 0584244

Table 7 The value of clustering validity index on Wine

119888 119881XB 119881B 119881K 119881R

2 0663406 1328291 11833902 15782913 0468902 0513071 83939058 13464214 mdash 0473254 mdash 17357915 mdash 0373668 mdash 18466866 mdash 0256698 mdash 1683222

Table 8 The value of clustering validity index on Seeds

119888 119881XB 119881B 119881K 119881R

2 0147247 0293609 31170349 05436093 0212127 0150899 45326216 05990014 0243483 0127720 52215334 06979435 0348842 0085477 75493654 0701153

Kwon index can obtain the accurate number of clusters whilefor the datasets Iris Seeds SubKDD SD3 and SD4 they onlyobtain the result approximate to the true number of clusterThe proposed index obtains the right result on the datasetsWine and SD4 and a better result compared toXie-Beni indexand Kwon index on the datasets SubKDD and SD3 whileit obtains the same result of the two indices on the datasetsIris and Seeds There are two reasons First the importancedegree of each property of the dataset is not considered inthe clustering process but assumed to be the same therebyaffecting the experimental result Second SAFCMhas a weakcapacity to deal with overlapping clusters or nonconvexitydataset These are the authorsrsquo subsequent research contents

Tables 6ndash13 list the detailed results of 4 indices on theexperimental datasets in each iteration of SAFCM For thedataset Wine when the number of clusters is 4 5 or 6the result based on Xie-Beni index or Kwon index is of noconvergence However the proposed index can achieve thedesired results

6 Conclusions

FCM is widely used in lots of fields But it needs to presetthe number of clusters and is greatly influenced by the initialcluster centroids This paper studies a self-adaptive methodfor determining the number of clusters by using of FCMalgorithm In this method a density-based algorithm is put

Table 9 The value of clustering validity index on SubKDD

119888 119881XB 119881B 119881K 119881R

2 0646989 1324434 550166676 15744313 0260755 0378775 222020838 10902894 0133843 0062126 119544560 05112385 0234402 0052499 202641204 05378526 0180728 0054938 156812271 05838007 0134636 0047514 119029265 06197208 0104511 0032849 919852740 06908739 0129721 0027639 740128470 056263610 006822 0027025 913528560 0528528

Table 10 The value of clustering validity index on SD1

119888 119881XB 119881B 119881K 119881R

2 0221693 0443390 44592968 06933903 0206035 0198853 40264251 07262454 0127731 0093653 26220200 06555505 0130781 0069848 27154867 06514656 0144894 0050067 22922121 06393257 0136562 0040275 29126152 06362588 0112480 0032874 24323625 06274429 0115090 0026833 24242580 062493610 0141415 0022611 28574579 061670111 0126680 0019256 28821707 061152412 0103178 0016634 23931865 060599013 0110355 0013253 26517065 058824614 0095513 0011083 23635022 057680815 0075928 0009817 19302095 056228916 0066025 0008824 17236138 055799017 0054314 0007248 14995284 054434118 0045398 0006090 13208810 053488219 0039492 0005365 11977437 052713120 0034598 0004917 10835952 0517481

Table 11 The value of clustering validity index on SD2

119888 119881XB 119881B 119881K 119881R

2 0066286 0132572 13281503 03825723 0068242 0063751 13752535 03942004 0060916 0028149 12309967 0378337

Table 12 The value of clustering validity index on SD3

119888 119881XB 119881B 119881K 119881R

2 0148379 0300899 131557269 05708763 0195663 0142149 173900551 05996804 0127512 0142150 113748947 05571985 0126561 0738070 113262975 0589535

forward at first which can estimate the maximum number ofclusters to reduce the search range of the optimal numberespecially being fit for the dataset on which the empiricalrule is inoperative Besides it can generate the high-quality

Computational Intelligence and Neuroscience 11

Table 13 The value of clustering validity index on SD4

119888 119881XB 119881B 119881K 119881R

2 0104434 0208832 991482710 04737483 0170326 0142561 162044450 04588324 0221884 0081007 211692699 05835295 0156253 0053094 157683921 06032116 0123191 0041799 118279116 05753967 0165465 0032411 107210082 05926258 0145164 0028698 139310969 0606049

initial cluster centroids so that the the clustering result isstable and the convergence of FCM is quick Then a newfuzzy clustering validity index was put forward based onfuzzy compactness and separation so that the clusteringresult is closer to global optimum The index is robust andinterpretable when the number of clusters tends to that ofobjects in the dataset Finally a self-adaptive FCM algorithmis proposed to determine the optimal number of clusters runin the iterative trial-and-error process

The contributions are validated by experimental resultsHowever in most cases each property plays a different rolein the clustering process In other words the weight ofproperties is not the same This issue will be focused on inthe authorsrsquo future work

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work is partially supported by the National NaturalScience Foundation of China (61373148 and 61502151) Shan-dong Province Natural Science Foundation (ZR2014FL010)Science Foundation ofMinistry of Education ofChina (14YJC860042) Shandong Province Outstanding Young ScientistAward Fund (BS2013DX033) Shandong Province Social Sci-ences Project (15CXWJ13 and 16CFXJ05) and Project ofShandongProvinceHigher Educational Science andTechnol-ogy Program (no J15LN02)

References

[1] S Sambasivam and N Theodosopoulos ldquoAdvanced data clus-tering methods of mining web documentsrdquo Issues in InformingScience and Information Technology vol 3 pp 563ndash579 2006

[2] J Wang S-T Wang and Z-H Deng ldquoSurvey on challenges inclustering analysis researchrdquo Control and Decision vol 3 2012

[3] C-W Bong andM Rajeswari ldquoMulti-objective nature-inspiredclustering and classification techniques for image segmenta-tionrdquo Applied Soft Computing Journal vol 11 no 4 pp 3271ndash3282 2011

[4] A Rodriguez and A Laio ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[5] L A Zadeh ldquoFuzzy setsrdquo Information and Computation vol 8no 3 pp 338ndash353 1965

[6] J C Bezdek ldquoNumerical taxonomy with fuzzy setsrdquo Journal ofMathematical Biology vol 1 no 1 pp 57ndash71 1974

[7] J C Bezdek ldquoCluster validity with fuzzy setsrdquo Journal of Cyber-netics vol 3 no 3 pp 58ndash73 1974

[8] M P Windham ldquoCluster validity for fuzzy clustering algo-rithmsrdquo Fuzzy Sets and Systems vol 5 no 2 pp 177ndash185 1981

[9] M Lee ldquoFuzzy cluster validity index based onobject proximitiesdefined over fuzzy partition matricesrdquo in Proceedings of theIEEE International Conference on Fuzzy Systems and IEEEWorldCongress on Computational Intelligence (FUZZ-IEEE rsquo08) pp336ndash340 June 2008

[10] X-B Zhang and L Jiang ldquoAnew validity index of fuzzy c-meansclusteringrdquo in Proceedings of the IEEE International Conferenceon IntelligentHuman-Machine Systems andCybernetics (IHMSCrsquo09) vol 2 pp 218ndash221 Hangzhou China August 2009

[11] I Saha U Maulik and S Bandyopadhyay ldquoA new differ-ential evolution based fuzzy clustering for automatic clusterevolutionrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 706ndash711 IEEE PatialaIndia March 2009

[12] S Yue J-S Wang T Wu and H Wang ldquoA new separationmeasure for improving the effectiveness of validity indicesrdquoInformation Sciences vol 180 no 5 pp 748ndash764 2010

[13] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[14] A M Bensaid L O Hall J C Bezdek et al ldquoValidity-guided(re)clustering with applications to image segmentationrdquo IEEETransactions on Fuzzy Systems vol 4 no 2 pp 112ndash123 1996

[15] S H Kwon ldquoCluster validity index for fuzzy clusteringrdquo Elec-tronics Letters vol 34 no 22 pp 2176ndash2177 1998

[16] H Sun S Wang and Q Jiang ldquoFCM-based model selectionalgorithms for determining the number of clustersrdquo PatternRecognition vol 37 no 10 pp 2027ndash2037 2004

[17] Y Li and F Yu ldquoA new validity function for fuzzy clusteringrdquoin Proceedings of the International Conference on ComputationalIntelligence and Natural Computing (CINC rsquo09) vol 1 pp 462ndash465 IEEE June 2009

[18] S Saha and S Bandyopadhyay ldquoA new cluster validity indexbased on fuzzy granulation-degranulation criteriardquo in Proceed-ings of the IEEEInternational Conference on Advanced Comput-ing and Communications (ADCOM rsquo07) pp 353ndash358 IEEEGuwahati India 2007

[19] M Zhang W Zhang H Sicotte and P Yang ldquoA new validitymeasure for a correlation-based fuzzy c-means clustering algo-rithmrdquo in Proceedings of the 31st IEEE Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo09) pp 3865ndash3868 Minneapolis Minn USASeptember 2009

[20] Y-I KimD-WKimD Lee andKH Lee ldquoA cluster validationindex for GK cluster analysis based on relative degree ofsharingrdquo Information Sciences vol 168 no 1ndash4 pp 225ndash2422004

[21] B Rezaee ldquoA cluster validity index for fuzzy clusteringrdquo FuzzySets and Systems vol 161 no 23 pp 3014ndash3025 2010

[22] D Zhang M Ji J Yang Y Zhang and F Xie ldquoA novel clus-ter validity index for fuzzy clustering based on bipartite modu-larityrdquo Fuzzy Sets and Systems vol 253 pp 122ndash137 2014

[23] J Yu andQ Cheng ldquoSearch range of the optimal cluster numberin fuzzy clusteringrdquo Science in China (Series E) vol 32 no 2 pp274ndash280 2002

12 Computational Intelligence and Neuroscience

[24] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[25] D Pelleg A W Moore et al ldquoX-means extending K-meanswith efficient estimation of the number of clustersrdquo in Proceed-ings of the 17th International Conference on Machine Learning(ICML rsquo00) vol 1 pp 727ndash734 2000

[26] H Frigui and R Krishnapuram ldquoA robust competitive clus-tering algorithm with applications in computer visionrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol21 no 5 pp 450ndash465 1999

[27] M-S Yang and K-L Wu ldquoA similarity-based robust clusteringmethodrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 26 no 4 pp 434ndash448 2004

[28] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[29] L Yujian ldquoA clustering algorithm based on maximal 120579-distantsubtreesrdquo Pattern Recognition vol 40 no 5 pp 1425ndash1431 2007

[30] Y Shihong H Ti and W Penglong ldquoMatrix eigenvalue anal-ysis-based clustering validity indexrdquo Journal of Tianjin Univer-sity (Science and Technology) vol 8 article 6 2014

[31] A Jose-Garcıa and W Gomez-Flores ldquoAutomatic clusteringusing nature-inspired metaheuristics a surveyrdquo Applied SoftComputing Journal vol 41 pp 192ndash213 2016

[32] E H Ruspini ldquoNew experimental results in fuzzy clusteringrdquoInformation Sciences vol 6 pp 273ndash284 1973

[33] J C Dunn ldquoWell-separated clusters and optimal fuzzy parti-tionsrdquo Journal of Cybernetics vol 4 no 1 pp 95ndash104 1974

[34] J C Bezdek Pattern Recognition with Fuzzy Objective FunctionAlgorithms Springer Science amp Business Media 2013

[35] R E Hammah and J H Curran ldquoFuzzy cluster algorithm forthe automatic identification of joint setsrdquo International Journalof Rock Mechanics and Mining Sciences vol 35 no 7 pp 889ndash905 1998

[36] N R Pal and J C Bezdek ldquoOn cluster validity for the fuzzy c-means modelrdquo IEEE Transactions on Fuzzy Systems vol 3 no3 pp 370ndash379 1995

[37] S B Zhou Z Y Xu and X Q Tang ldquoMethod for determiningoptimal number of clusters based on affinity propagationclusteringrdquo Control and Decision vol 26 no 8 pp 1147ndash11522011

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 8: Research Article A Self-Adaptive Fuzzy -Means Algorithm ...downloads.hindawi.com/journals/cin/2016/2647389.pdf · A Self-Adaptive Fuzzy -Means Algorithm for Determining the Optimal

8 Computational Intelligence and Neuroscience

minus5 0 5 10 15 20 250

5

10

15

20

25

30

(a) SD10 5 10 15 20 25

0

5

10

15

20

25

(b) SD2

50 100 150 200 250 300 3500

50

100

150

200

250

300

(c) SD320 40 60 80 100 120 140 160 180 200 220 240

20

40

60

80

100

120

140

160

(d) SD4

Figure 3 Four synthetic datasets

Table 3 Comparison of iterations of FCMalgorithmMethod 1 usesthe random initial cluster centroids and Method 2 uses the clustercentroids obtained by density-based algorithm

Dataset Method 1 Method 2Iris 21 16Wine 27 18Seeds 19 16SubKDD 31 23SD1 38 14SD2 18 12SD3 30 22SD4 26 21

is done repeatedly for 50 times and the round numbers ofthe mean of algorithmic iterations are comparedThe specificexperimental result is shown in Table 3

As shown in Table 3 the initial cluster centroids obtainedby density-based algorithm can effectively reduce the itera-tion of FCM Particularly on SD1 the iteration of FCM isfar smaller than that with randomly selected initial cluster

centroids whose minimum iteration is 24 andmaximum iter-ation reaches 63 with unstable clustering results Thereforethe proposed method can not only effectively quicken theconvergence of FCM but also obtain a stable clustering result

53 Experiment of Clustering Accuracy Based on ClusteringValidity Index For 3 UCI datasets and SubKDD clusteringaccuracy 119903 is adopted tomeasure the clustering effect definedas

119903 = sum119896119894=1 119887119894119899 (13)

Here 119887119894 is the number of objects which co-occur in the 119894thcluster and the 119894th real cluster and 119899 is the number of objectsin the dataset According to this measurement the higherthe clustering accuracy is the better the clustering result ofFCM is When 119903 = 1 the clustering result of FCM is totallyaccurate

In the experiments the true number of clusters is assignedto each dataset and the initial cluster centroids are obtainedby density-based algorithm Then experimental results ofclustering accuracy are shown in Table 4

Computational Intelligence and Neuroscience 9

minus5 0 5 10 15 20 250

5

10

15

20

25

30

(a) SD10 5 10 15 20 25

0

5

10

15

20

25

(b) SD2

50 100 150 200 250 300 3500

50

100

150

200

250

300

(c) SD320 40 60 80 100 120 140 160 180 200 220 240

20

40

60

80

100

120

140

160

(d) SD4

Figure 4 Clustering results of two synthetic datasets

Table 4 Clustering accuracy

Dataset Iris Wine Seeds SubKDDClustering accuracy 8400 9663 9190 9435

Apparently clustering accuracies of the datasets WineSeeds and SubKDD are high while that of the dataset Iris isrelatively low for the reason that two clusters are nonlinearlyseparable

The clustering results of the proposed clustering validityindex on 4 synthetic datasets are shown in Figure 4 whichdepict the good partitions on the datasets SD1 and SD2 andthe slightly poor partition on SD3 and SD4 because of thecomplexity of their structure or nonconvexity When thenumber of clusters is set as 4 SD3 is divided into 4 groupsthat means the right and the left each have 2 groups

54 Experiment of the Optimal Number of Clusters At lastXie-Beni index Bensaid index Kwon index and the pro-posed index are respectively adopted for running of SAFCMso as to determine the optimal number of clustersThe results

Table 5 Optimal number of clusters estimated by several clusteringvalidity indices

Dataset 119881XB 119881B 119881K 119881R

Iris 2 9 2 2Wine 3 6 3 3Seeds 2 5 2 2SubKDD 10 10 9 4SD1 20 20 20 20SD2 4 4 4 4SD3 5 3 5 4SD4 2 8 2 3

are shown in Table 5 119881XB 119881B 119881K and 119881R represent Xie-Beniindex Bensaid index Kwon index and the proposed indexrespectively

It shows that for synthetic datasets SD1 and SD2 withsimple structure these indices can all obtain the optimalnumber of clusters For 3 UCI datasets SubKDD and SD4Bensaid index cannot obtain an accurate number of clustersexcept SD3 For the dataset Wine both Xie-Beni index and

10 Computational Intelligence and Neuroscience

Table 6 The value of clustering validity index on Iris

119888 119881XB 119881B 119881K 119881R

2 0114234 0223604 17384973 04736043 0223973 0124598 34572146 05515654 0316742 0099103 49279488 06154365 0560109 0089108 87540968 06763506 0574475 0072201 90563379 06913407 0400071 0067005 63328679 07353118 0275682 0036283 45736972 06140559 0250971 0027735 42868449 0584244

Table 7 The value of clustering validity index on Wine

119888 119881XB 119881B 119881K 119881R

2 0663406 1328291 11833902 15782913 0468902 0513071 83939058 13464214 mdash 0473254 mdash 17357915 mdash 0373668 mdash 18466866 mdash 0256698 mdash 1683222

Table 8 The value of clustering validity index on Seeds

119888 119881XB 119881B 119881K 119881R

2 0147247 0293609 31170349 05436093 0212127 0150899 45326216 05990014 0243483 0127720 52215334 06979435 0348842 0085477 75493654 0701153

Kwon index can obtain the accurate number of clusters whilefor the datasets Iris Seeds SubKDD SD3 and SD4 they onlyobtain the result approximate to the true number of clusterThe proposed index obtains the right result on the datasetsWine and SD4 and a better result compared toXie-Beni indexand Kwon index on the datasets SubKDD and SD3 whileit obtains the same result of the two indices on the datasetsIris and Seeds There are two reasons First the importancedegree of each property of the dataset is not considered inthe clustering process but assumed to be the same therebyaffecting the experimental result Second SAFCMhas a weakcapacity to deal with overlapping clusters or nonconvexitydataset These are the authorsrsquo subsequent research contents

Tables 6ndash13 list the detailed results of 4 indices on theexperimental datasets in each iteration of SAFCM For thedataset Wine when the number of clusters is 4 5 or 6the result based on Xie-Beni index or Kwon index is of noconvergence However the proposed index can achieve thedesired results

6 Conclusions

FCM is widely used in lots of fields But it needs to presetthe number of clusters and is greatly influenced by the initialcluster centroids This paper studies a self-adaptive methodfor determining the number of clusters by using of FCMalgorithm In this method a density-based algorithm is put

Table 9 The value of clustering validity index on SubKDD

119888 119881XB 119881B 119881K 119881R

2 0646989 1324434 550166676 15744313 0260755 0378775 222020838 10902894 0133843 0062126 119544560 05112385 0234402 0052499 202641204 05378526 0180728 0054938 156812271 05838007 0134636 0047514 119029265 06197208 0104511 0032849 919852740 06908739 0129721 0027639 740128470 056263610 006822 0027025 913528560 0528528

Table 10 The value of clustering validity index on SD1

119888 119881XB 119881B 119881K 119881R

2 0221693 0443390 44592968 06933903 0206035 0198853 40264251 07262454 0127731 0093653 26220200 06555505 0130781 0069848 27154867 06514656 0144894 0050067 22922121 06393257 0136562 0040275 29126152 06362588 0112480 0032874 24323625 06274429 0115090 0026833 24242580 062493610 0141415 0022611 28574579 061670111 0126680 0019256 28821707 061152412 0103178 0016634 23931865 060599013 0110355 0013253 26517065 058824614 0095513 0011083 23635022 057680815 0075928 0009817 19302095 056228916 0066025 0008824 17236138 055799017 0054314 0007248 14995284 054434118 0045398 0006090 13208810 053488219 0039492 0005365 11977437 052713120 0034598 0004917 10835952 0517481

Table 11 The value of clustering validity index on SD2

119888 119881XB 119881B 119881K 119881R

2 0066286 0132572 13281503 03825723 0068242 0063751 13752535 03942004 0060916 0028149 12309967 0378337

Table 12 The value of clustering validity index on SD3

119888 119881XB 119881B 119881K 119881R

2 0148379 0300899 131557269 05708763 0195663 0142149 173900551 05996804 0127512 0142150 113748947 05571985 0126561 0738070 113262975 0589535

forward at first which can estimate the maximum number ofclusters to reduce the search range of the optimal numberespecially being fit for the dataset on which the empiricalrule is inoperative Besides it can generate the high-quality

Computational Intelligence and Neuroscience 11

Table 13 The value of clustering validity index on SD4

119888 119881XB 119881B 119881K 119881R

2 0104434 0208832 991482710 04737483 0170326 0142561 162044450 04588324 0221884 0081007 211692699 05835295 0156253 0053094 157683921 06032116 0123191 0041799 118279116 05753967 0165465 0032411 107210082 05926258 0145164 0028698 139310969 0606049

initial cluster centroids so that the the clustering result isstable and the convergence of FCM is quick Then a newfuzzy clustering validity index was put forward based onfuzzy compactness and separation so that the clusteringresult is closer to global optimum The index is robust andinterpretable when the number of clusters tends to that ofobjects in the dataset Finally a self-adaptive FCM algorithmis proposed to determine the optimal number of clusters runin the iterative trial-and-error process

The contributions are validated by experimental resultsHowever in most cases each property plays a different rolein the clustering process In other words the weight ofproperties is not the same This issue will be focused on inthe authorsrsquo future work

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work is partially supported by the National NaturalScience Foundation of China (61373148 and 61502151) Shan-dong Province Natural Science Foundation (ZR2014FL010)Science Foundation ofMinistry of Education ofChina (14YJC860042) Shandong Province Outstanding Young ScientistAward Fund (BS2013DX033) Shandong Province Social Sci-ences Project (15CXWJ13 and 16CFXJ05) and Project ofShandongProvinceHigher Educational Science andTechnol-ogy Program (no J15LN02)

References

[1] S Sambasivam and N Theodosopoulos ldquoAdvanced data clus-tering methods of mining web documentsrdquo Issues in InformingScience and Information Technology vol 3 pp 563ndash579 2006

[2] J Wang S-T Wang and Z-H Deng ldquoSurvey on challenges inclustering analysis researchrdquo Control and Decision vol 3 2012

[3] C-W Bong andM Rajeswari ldquoMulti-objective nature-inspiredclustering and classification techniques for image segmenta-tionrdquo Applied Soft Computing Journal vol 11 no 4 pp 3271ndash3282 2011

[4] A Rodriguez and A Laio ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[5] L A Zadeh ldquoFuzzy setsrdquo Information and Computation vol 8no 3 pp 338ndash353 1965

[6] J C Bezdek ldquoNumerical taxonomy with fuzzy setsrdquo Journal ofMathematical Biology vol 1 no 1 pp 57ndash71 1974

[7] J C Bezdek ldquoCluster validity with fuzzy setsrdquo Journal of Cyber-netics vol 3 no 3 pp 58ndash73 1974

[8] M P Windham ldquoCluster validity for fuzzy clustering algo-rithmsrdquo Fuzzy Sets and Systems vol 5 no 2 pp 177ndash185 1981

[9] M Lee ldquoFuzzy cluster validity index based onobject proximitiesdefined over fuzzy partition matricesrdquo in Proceedings of theIEEE International Conference on Fuzzy Systems and IEEEWorldCongress on Computational Intelligence (FUZZ-IEEE rsquo08) pp336ndash340 June 2008

[10] X-B Zhang and L Jiang ldquoAnew validity index of fuzzy c-meansclusteringrdquo in Proceedings of the IEEE International Conferenceon IntelligentHuman-Machine Systems andCybernetics (IHMSCrsquo09) vol 2 pp 218ndash221 Hangzhou China August 2009

[11] I Saha U Maulik and S Bandyopadhyay ldquoA new differ-ential evolution based fuzzy clustering for automatic clusterevolutionrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 706ndash711 IEEE PatialaIndia March 2009

[12] S Yue J-S Wang T Wu and H Wang ldquoA new separationmeasure for improving the effectiveness of validity indicesrdquoInformation Sciences vol 180 no 5 pp 748ndash764 2010

[13] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[14] A M Bensaid L O Hall J C Bezdek et al ldquoValidity-guided(re)clustering with applications to image segmentationrdquo IEEETransactions on Fuzzy Systems vol 4 no 2 pp 112ndash123 1996

[15] S H Kwon ldquoCluster validity index for fuzzy clusteringrdquo Elec-tronics Letters vol 34 no 22 pp 2176ndash2177 1998

[16] H Sun S Wang and Q Jiang ldquoFCM-based model selectionalgorithms for determining the number of clustersrdquo PatternRecognition vol 37 no 10 pp 2027ndash2037 2004

[17] Y Li and F Yu ldquoA new validity function for fuzzy clusteringrdquoin Proceedings of the International Conference on ComputationalIntelligence and Natural Computing (CINC rsquo09) vol 1 pp 462ndash465 IEEE June 2009

[18] S Saha and S Bandyopadhyay ldquoA new cluster validity indexbased on fuzzy granulation-degranulation criteriardquo in Proceed-ings of the IEEEInternational Conference on Advanced Comput-ing and Communications (ADCOM rsquo07) pp 353ndash358 IEEEGuwahati India 2007

[19] M Zhang W Zhang H Sicotte and P Yang ldquoA new validitymeasure for a correlation-based fuzzy c-means clustering algo-rithmrdquo in Proceedings of the 31st IEEE Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo09) pp 3865ndash3868 Minneapolis Minn USASeptember 2009

[20] Y-I KimD-WKimD Lee andKH Lee ldquoA cluster validationindex for GK cluster analysis based on relative degree ofsharingrdquo Information Sciences vol 168 no 1ndash4 pp 225ndash2422004

[21] B Rezaee ldquoA cluster validity index for fuzzy clusteringrdquo FuzzySets and Systems vol 161 no 23 pp 3014ndash3025 2010

[22] D Zhang M Ji J Yang Y Zhang and F Xie ldquoA novel clus-ter validity index for fuzzy clustering based on bipartite modu-larityrdquo Fuzzy Sets and Systems vol 253 pp 122ndash137 2014

[23] J Yu andQ Cheng ldquoSearch range of the optimal cluster numberin fuzzy clusteringrdquo Science in China (Series E) vol 32 no 2 pp274ndash280 2002

12 Computational Intelligence and Neuroscience

[24] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[25] D Pelleg A W Moore et al ldquoX-means extending K-meanswith efficient estimation of the number of clustersrdquo in Proceed-ings of the 17th International Conference on Machine Learning(ICML rsquo00) vol 1 pp 727ndash734 2000

[26] H Frigui and R Krishnapuram ldquoA robust competitive clus-tering algorithm with applications in computer visionrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol21 no 5 pp 450ndash465 1999

[27] M-S Yang and K-L Wu ldquoA similarity-based robust clusteringmethodrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 26 no 4 pp 434ndash448 2004

[28] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[29] L Yujian ldquoA clustering algorithm based on maximal 120579-distantsubtreesrdquo Pattern Recognition vol 40 no 5 pp 1425ndash1431 2007

[30] Y Shihong H Ti and W Penglong ldquoMatrix eigenvalue anal-ysis-based clustering validity indexrdquo Journal of Tianjin Univer-sity (Science and Technology) vol 8 article 6 2014

[31] A Jose-Garcıa and W Gomez-Flores ldquoAutomatic clusteringusing nature-inspired metaheuristics a surveyrdquo Applied SoftComputing Journal vol 41 pp 192ndash213 2016

[32] E H Ruspini ldquoNew experimental results in fuzzy clusteringrdquoInformation Sciences vol 6 pp 273ndash284 1973

[33] J C Dunn ldquoWell-separated clusters and optimal fuzzy parti-tionsrdquo Journal of Cybernetics vol 4 no 1 pp 95ndash104 1974

[34] J C Bezdek Pattern Recognition with Fuzzy Objective FunctionAlgorithms Springer Science amp Business Media 2013

[35] R E Hammah and J H Curran ldquoFuzzy cluster algorithm forthe automatic identification of joint setsrdquo International Journalof Rock Mechanics and Mining Sciences vol 35 no 7 pp 889ndash905 1998

[36] N R Pal and J C Bezdek ldquoOn cluster validity for the fuzzy c-means modelrdquo IEEE Transactions on Fuzzy Systems vol 3 no3 pp 370ndash379 1995

[37] S B Zhou Z Y Xu and X Q Tang ldquoMethod for determiningoptimal number of clusters based on affinity propagationclusteringrdquo Control and Decision vol 26 no 8 pp 1147ndash11522011

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 9: Research Article A Self-Adaptive Fuzzy -Means Algorithm ...downloads.hindawi.com/journals/cin/2016/2647389.pdf · A Self-Adaptive Fuzzy -Means Algorithm for Determining the Optimal

Computational Intelligence and Neuroscience 9

minus5 0 5 10 15 20 250

5

10

15

20

25

30

(a) SD10 5 10 15 20 25

0

5

10

15

20

25

(b) SD2

50 100 150 200 250 300 3500

50

100

150

200

250

300

(c) SD320 40 60 80 100 120 140 160 180 200 220 240

20

40

60

80

100

120

140

160

(d) SD4

Figure 4 Clustering results of two synthetic datasets

Table 4 Clustering accuracy

Dataset Iris Wine Seeds SubKDDClustering accuracy 8400 9663 9190 9435

Apparently clustering accuracies of the datasets WineSeeds and SubKDD are high while that of the dataset Iris isrelatively low for the reason that two clusters are nonlinearlyseparable

The clustering results of the proposed clustering validityindex on 4 synthetic datasets are shown in Figure 4 whichdepict the good partitions on the datasets SD1 and SD2 andthe slightly poor partition on SD3 and SD4 because of thecomplexity of their structure or nonconvexity When thenumber of clusters is set as 4 SD3 is divided into 4 groupsthat means the right and the left each have 2 groups

54 Experiment of the Optimal Number of Clusters At lastXie-Beni index Bensaid index Kwon index and the pro-posed index are respectively adopted for running of SAFCMso as to determine the optimal number of clustersThe results

Table 5 Optimal number of clusters estimated by several clusteringvalidity indices

Dataset 119881XB 119881B 119881K 119881R

Iris 2 9 2 2Wine 3 6 3 3Seeds 2 5 2 2SubKDD 10 10 9 4SD1 20 20 20 20SD2 4 4 4 4SD3 5 3 5 4SD4 2 8 2 3

are shown in Table 5 119881XB 119881B 119881K and 119881R represent Xie-Beniindex Bensaid index Kwon index and the proposed indexrespectively

It shows that for synthetic datasets SD1 and SD2 withsimple structure these indices can all obtain the optimalnumber of clusters For 3 UCI datasets SubKDD and SD4Bensaid index cannot obtain an accurate number of clustersexcept SD3 For the dataset Wine both Xie-Beni index and

10 Computational Intelligence and Neuroscience

Table 6 The value of clustering validity index on Iris

119888 119881XB 119881B 119881K 119881R

2 0114234 0223604 17384973 04736043 0223973 0124598 34572146 05515654 0316742 0099103 49279488 06154365 0560109 0089108 87540968 06763506 0574475 0072201 90563379 06913407 0400071 0067005 63328679 07353118 0275682 0036283 45736972 06140559 0250971 0027735 42868449 0584244

Table 7 The value of clustering validity index on Wine

119888 119881XB 119881B 119881K 119881R

2 0663406 1328291 11833902 15782913 0468902 0513071 83939058 13464214 mdash 0473254 mdash 17357915 mdash 0373668 mdash 18466866 mdash 0256698 mdash 1683222

Table 8 The value of clustering validity index on Seeds

119888 119881XB 119881B 119881K 119881R

2 0147247 0293609 31170349 05436093 0212127 0150899 45326216 05990014 0243483 0127720 52215334 06979435 0348842 0085477 75493654 0701153

Kwon index can obtain the accurate number of clusters whilefor the datasets Iris Seeds SubKDD SD3 and SD4 they onlyobtain the result approximate to the true number of clusterThe proposed index obtains the right result on the datasetsWine and SD4 and a better result compared toXie-Beni indexand Kwon index on the datasets SubKDD and SD3 whileit obtains the same result of the two indices on the datasetsIris and Seeds There are two reasons First the importancedegree of each property of the dataset is not considered inthe clustering process but assumed to be the same therebyaffecting the experimental result Second SAFCMhas a weakcapacity to deal with overlapping clusters or nonconvexitydataset These are the authorsrsquo subsequent research contents

Tables 6ndash13 list the detailed results of 4 indices on theexperimental datasets in each iteration of SAFCM For thedataset Wine when the number of clusters is 4 5 or 6the result based on Xie-Beni index or Kwon index is of noconvergence However the proposed index can achieve thedesired results

6 Conclusions

FCM is widely used in lots of fields But it needs to presetthe number of clusters and is greatly influenced by the initialcluster centroids This paper studies a self-adaptive methodfor determining the number of clusters by using of FCMalgorithm In this method a density-based algorithm is put

Table 9 The value of clustering validity index on SubKDD

119888 119881XB 119881B 119881K 119881R

2 0646989 1324434 550166676 15744313 0260755 0378775 222020838 10902894 0133843 0062126 119544560 05112385 0234402 0052499 202641204 05378526 0180728 0054938 156812271 05838007 0134636 0047514 119029265 06197208 0104511 0032849 919852740 06908739 0129721 0027639 740128470 056263610 006822 0027025 913528560 0528528

Table 10 The value of clustering validity index on SD1

119888 119881XB 119881B 119881K 119881R

2 0221693 0443390 44592968 06933903 0206035 0198853 40264251 07262454 0127731 0093653 26220200 06555505 0130781 0069848 27154867 06514656 0144894 0050067 22922121 06393257 0136562 0040275 29126152 06362588 0112480 0032874 24323625 06274429 0115090 0026833 24242580 062493610 0141415 0022611 28574579 061670111 0126680 0019256 28821707 061152412 0103178 0016634 23931865 060599013 0110355 0013253 26517065 058824614 0095513 0011083 23635022 057680815 0075928 0009817 19302095 056228916 0066025 0008824 17236138 055799017 0054314 0007248 14995284 054434118 0045398 0006090 13208810 053488219 0039492 0005365 11977437 052713120 0034598 0004917 10835952 0517481

Table 11 The value of clustering validity index on SD2

119888 119881XB 119881B 119881K 119881R

2 0066286 0132572 13281503 03825723 0068242 0063751 13752535 03942004 0060916 0028149 12309967 0378337

Table 12 The value of clustering validity index on SD3

119888 119881XB 119881B 119881K 119881R

2 0148379 0300899 131557269 05708763 0195663 0142149 173900551 05996804 0127512 0142150 113748947 05571985 0126561 0738070 113262975 0589535

forward at first which can estimate the maximum number ofclusters to reduce the search range of the optimal numberespecially being fit for the dataset on which the empiricalrule is inoperative Besides it can generate the high-quality

Computational Intelligence and Neuroscience 11

Table 13 The value of clustering validity index on SD4

119888 119881XB 119881B 119881K 119881R

2 0104434 0208832 991482710 04737483 0170326 0142561 162044450 04588324 0221884 0081007 211692699 05835295 0156253 0053094 157683921 06032116 0123191 0041799 118279116 05753967 0165465 0032411 107210082 05926258 0145164 0028698 139310969 0606049

initial cluster centroids so that the the clustering result isstable and the convergence of FCM is quick Then a newfuzzy clustering validity index was put forward based onfuzzy compactness and separation so that the clusteringresult is closer to global optimum The index is robust andinterpretable when the number of clusters tends to that ofobjects in the dataset Finally a self-adaptive FCM algorithmis proposed to determine the optimal number of clusters runin the iterative trial-and-error process

The contributions are validated by experimental resultsHowever in most cases each property plays a different rolein the clustering process In other words the weight ofproperties is not the same This issue will be focused on inthe authorsrsquo future work

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work is partially supported by the National NaturalScience Foundation of China (61373148 and 61502151) Shan-dong Province Natural Science Foundation (ZR2014FL010)Science Foundation ofMinistry of Education ofChina (14YJC860042) Shandong Province Outstanding Young ScientistAward Fund (BS2013DX033) Shandong Province Social Sci-ences Project (15CXWJ13 and 16CFXJ05) and Project ofShandongProvinceHigher Educational Science andTechnol-ogy Program (no J15LN02)

References

[1] S Sambasivam and N Theodosopoulos ldquoAdvanced data clus-tering methods of mining web documentsrdquo Issues in InformingScience and Information Technology vol 3 pp 563ndash579 2006

[2] J Wang S-T Wang and Z-H Deng ldquoSurvey on challenges inclustering analysis researchrdquo Control and Decision vol 3 2012

[3] C-W Bong andM Rajeswari ldquoMulti-objective nature-inspiredclustering and classification techniques for image segmenta-tionrdquo Applied Soft Computing Journal vol 11 no 4 pp 3271ndash3282 2011

[4] A Rodriguez and A Laio ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[5] L A Zadeh ldquoFuzzy setsrdquo Information and Computation vol 8no 3 pp 338ndash353 1965

[6] J C Bezdek ldquoNumerical taxonomy with fuzzy setsrdquo Journal ofMathematical Biology vol 1 no 1 pp 57ndash71 1974

[7] J C Bezdek ldquoCluster validity with fuzzy setsrdquo Journal of Cyber-netics vol 3 no 3 pp 58ndash73 1974

[8] M P Windham ldquoCluster validity for fuzzy clustering algo-rithmsrdquo Fuzzy Sets and Systems vol 5 no 2 pp 177ndash185 1981

[9] M Lee ldquoFuzzy cluster validity index based onobject proximitiesdefined over fuzzy partition matricesrdquo in Proceedings of theIEEE International Conference on Fuzzy Systems and IEEEWorldCongress on Computational Intelligence (FUZZ-IEEE rsquo08) pp336ndash340 June 2008

[10] X-B Zhang and L Jiang ldquoAnew validity index of fuzzy c-meansclusteringrdquo in Proceedings of the IEEE International Conferenceon IntelligentHuman-Machine Systems andCybernetics (IHMSCrsquo09) vol 2 pp 218ndash221 Hangzhou China August 2009

[11] I Saha U Maulik and S Bandyopadhyay ldquoA new differ-ential evolution based fuzzy clustering for automatic clusterevolutionrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 706ndash711 IEEE PatialaIndia March 2009

[12] S Yue J-S Wang T Wu and H Wang ldquoA new separationmeasure for improving the effectiveness of validity indicesrdquoInformation Sciences vol 180 no 5 pp 748ndash764 2010

[13] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[14] A M Bensaid L O Hall J C Bezdek et al ldquoValidity-guided(re)clustering with applications to image segmentationrdquo IEEETransactions on Fuzzy Systems vol 4 no 2 pp 112ndash123 1996

[15] S H Kwon ldquoCluster validity index for fuzzy clusteringrdquo Elec-tronics Letters vol 34 no 22 pp 2176ndash2177 1998

[16] H Sun S Wang and Q Jiang ldquoFCM-based model selectionalgorithms for determining the number of clustersrdquo PatternRecognition vol 37 no 10 pp 2027ndash2037 2004

[17] Y Li and F Yu ldquoA new validity function for fuzzy clusteringrdquoin Proceedings of the International Conference on ComputationalIntelligence and Natural Computing (CINC rsquo09) vol 1 pp 462ndash465 IEEE June 2009

[18] S Saha and S Bandyopadhyay ldquoA new cluster validity indexbased on fuzzy granulation-degranulation criteriardquo in Proceed-ings of the IEEEInternational Conference on Advanced Comput-ing and Communications (ADCOM rsquo07) pp 353ndash358 IEEEGuwahati India 2007

[19] M Zhang W Zhang H Sicotte and P Yang ldquoA new validitymeasure for a correlation-based fuzzy c-means clustering algo-rithmrdquo in Proceedings of the 31st IEEE Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo09) pp 3865ndash3868 Minneapolis Minn USASeptember 2009

[20] Y-I KimD-WKimD Lee andKH Lee ldquoA cluster validationindex for GK cluster analysis based on relative degree ofsharingrdquo Information Sciences vol 168 no 1ndash4 pp 225ndash2422004

[21] B Rezaee ldquoA cluster validity index for fuzzy clusteringrdquo FuzzySets and Systems vol 161 no 23 pp 3014ndash3025 2010

[22] D Zhang M Ji J Yang Y Zhang and F Xie ldquoA novel clus-ter validity index for fuzzy clustering based on bipartite modu-larityrdquo Fuzzy Sets and Systems vol 253 pp 122ndash137 2014

[23] J Yu andQ Cheng ldquoSearch range of the optimal cluster numberin fuzzy clusteringrdquo Science in China (Series E) vol 32 no 2 pp274ndash280 2002

12 Computational Intelligence and Neuroscience

[24] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[25] D Pelleg A W Moore et al ldquoX-means extending K-meanswith efficient estimation of the number of clustersrdquo in Proceed-ings of the 17th International Conference on Machine Learning(ICML rsquo00) vol 1 pp 727ndash734 2000

[26] H Frigui and R Krishnapuram ldquoA robust competitive clus-tering algorithm with applications in computer visionrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol21 no 5 pp 450ndash465 1999

[27] M-S Yang and K-L Wu ldquoA similarity-based robust clusteringmethodrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 26 no 4 pp 434ndash448 2004

[28] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[29] L Yujian ldquoA clustering algorithm based on maximal 120579-distantsubtreesrdquo Pattern Recognition vol 40 no 5 pp 1425ndash1431 2007

[30] Y Shihong H Ti and W Penglong ldquoMatrix eigenvalue anal-ysis-based clustering validity indexrdquo Journal of Tianjin Univer-sity (Science and Technology) vol 8 article 6 2014

[31] A Jose-Garcıa and W Gomez-Flores ldquoAutomatic clusteringusing nature-inspired metaheuristics a surveyrdquo Applied SoftComputing Journal vol 41 pp 192ndash213 2016

[32] E H Ruspini ldquoNew experimental results in fuzzy clusteringrdquoInformation Sciences vol 6 pp 273ndash284 1973

[33] J C Dunn ldquoWell-separated clusters and optimal fuzzy parti-tionsrdquo Journal of Cybernetics vol 4 no 1 pp 95ndash104 1974

[34] J C Bezdek Pattern Recognition with Fuzzy Objective FunctionAlgorithms Springer Science amp Business Media 2013

[35] R E Hammah and J H Curran ldquoFuzzy cluster algorithm forthe automatic identification of joint setsrdquo International Journalof Rock Mechanics and Mining Sciences vol 35 no 7 pp 889ndash905 1998

[36] N R Pal and J C Bezdek ldquoOn cluster validity for the fuzzy c-means modelrdquo IEEE Transactions on Fuzzy Systems vol 3 no3 pp 370ndash379 1995

[37] S B Zhou Z Y Xu and X Q Tang ldquoMethod for determiningoptimal number of clusters based on affinity propagationclusteringrdquo Control and Decision vol 26 no 8 pp 1147ndash11522011

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 10: Research Article A Self-Adaptive Fuzzy -Means Algorithm ...downloads.hindawi.com/journals/cin/2016/2647389.pdf · A Self-Adaptive Fuzzy -Means Algorithm for Determining the Optimal

10 Computational Intelligence and Neuroscience

Table 6 The value of clustering validity index on Iris

119888 119881XB 119881B 119881K 119881R

2 0114234 0223604 17384973 04736043 0223973 0124598 34572146 05515654 0316742 0099103 49279488 06154365 0560109 0089108 87540968 06763506 0574475 0072201 90563379 06913407 0400071 0067005 63328679 07353118 0275682 0036283 45736972 06140559 0250971 0027735 42868449 0584244

Table 7 The value of clustering validity index on Wine

119888 119881XB 119881B 119881K 119881R

2 0663406 1328291 11833902 15782913 0468902 0513071 83939058 13464214 mdash 0473254 mdash 17357915 mdash 0373668 mdash 18466866 mdash 0256698 mdash 1683222

Table 8 The value of clustering validity index on Seeds

119888 119881XB 119881B 119881K 119881R

2 0147247 0293609 31170349 05436093 0212127 0150899 45326216 05990014 0243483 0127720 52215334 06979435 0348842 0085477 75493654 0701153

Kwon index can obtain the accurate number of clusters whilefor the datasets Iris Seeds SubKDD SD3 and SD4 they onlyobtain the result approximate to the true number of clusterThe proposed index obtains the right result on the datasetsWine and SD4 and a better result compared toXie-Beni indexand Kwon index on the datasets SubKDD and SD3 whileit obtains the same result of the two indices on the datasetsIris and Seeds There are two reasons First the importancedegree of each property of the dataset is not considered inthe clustering process but assumed to be the same therebyaffecting the experimental result Second SAFCMhas a weakcapacity to deal with overlapping clusters or nonconvexitydataset These are the authorsrsquo subsequent research contents

Tables 6ndash13 list the detailed results of 4 indices on theexperimental datasets in each iteration of SAFCM For thedataset Wine when the number of clusters is 4 5 or 6the result based on Xie-Beni index or Kwon index is of noconvergence However the proposed index can achieve thedesired results

6 Conclusions

FCM is widely used in lots of fields But it needs to presetthe number of clusters and is greatly influenced by the initialcluster centroids This paper studies a self-adaptive methodfor determining the number of clusters by using of FCMalgorithm In this method a density-based algorithm is put

Table 9 The value of clustering validity index on SubKDD

119888 119881XB 119881B 119881K 119881R

2 0646989 1324434 550166676 15744313 0260755 0378775 222020838 10902894 0133843 0062126 119544560 05112385 0234402 0052499 202641204 05378526 0180728 0054938 156812271 05838007 0134636 0047514 119029265 06197208 0104511 0032849 919852740 06908739 0129721 0027639 740128470 056263610 006822 0027025 913528560 0528528

Table 10 The value of clustering validity index on SD1

119888 119881XB 119881B 119881K 119881R

2 0221693 0443390 44592968 06933903 0206035 0198853 40264251 07262454 0127731 0093653 26220200 06555505 0130781 0069848 27154867 06514656 0144894 0050067 22922121 06393257 0136562 0040275 29126152 06362588 0112480 0032874 24323625 06274429 0115090 0026833 24242580 062493610 0141415 0022611 28574579 061670111 0126680 0019256 28821707 061152412 0103178 0016634 23931865 060599013 0110355 0013253 26517065 058824614 0095513 0011083 23635022 057680815 0075928 0009817 19302095 056228916 0066025 0008824 17236138 055799017 0054314 0007248 14995284 054434118 0045398 0006090 13208810 053488219 0039492 0005365 11977437 052713120 0034598 0004917 10835952 0517481

Table 11 The value of clustering validity index on SD2

119888 119881XB 119881B 119881K 119881R

2 0066286 0132572 13281503 03825723 0068242 0063751 13752535 03942004 0060916 0028149 12309967 0378337

Table 12 The value of clustering validity index on SD3

119888 119881XB 119881B 119881K 119881R

2 0148379 0300899 131557269 05708763 0195663 0142149 173900551 05996804 0127512 0142150 113748947 05571985 0126561 0738070 113262975 0589535

forward at first which can estimate the maximum number ofclusters to reduce the search range of the optimal numberespecially being fit for the dataset on which the empiricalrule is inoperative Besides it can generate the high-quality

Computational Intelligence and Neuroscience 11

Table 13 The value of clustering validity index on SD4

119888 119881XB 119881B 119881K 119881R

2 0104434 0208832 991482710 04737483 0170326 0142561 162044450 04588324 0221884 0081007 211692699 05835295 0156253 0053094 157683921 06032116 0123191 0041799 118279116 05753967 0165465 0032411 107210082 05926258 0145164 0028698 139310969 0606049

initial cluster centroids so that the the clustering result isstable and the convergence of FCM is quick Then a newfuzzy clustering validity index was put forward based onfuzzy compactness and separation so that the clusteringresult is closer to global optimum The index is robust andinterpretable when the number of clusters tends to that ofobjects in the dataset Finally a self-adaptive FCM algorithmis proposed to determine the optimal number of clusters runin the iterative trial-and-error process

The contributions are validated by experimental resultsHowever in most cases each property plays a different rolein the clustering process In other words the weight ofproperties is not the same This issue will be focused on inthe authorsrsquo future work

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work is partially supported by the National NaturalScience Foundation of China (61373148 and 61502151) Shan-dong Province Natural Science Foundation (ZR2014FL010)Science Foundation ofMinistry of Education ofChina (14YJC860042) Shandong Province Outstanding Young ScientistAward Fund (BS2013DX033) Shandong Province Social Sci-ences Project (15CXWJ13 and 16CFXJ05) and Project ofShandongProvinceHigher Educational Science andTechnol-ogy Program (no J15LN02)

References

[1] S Sambasivam and N Theodosopoulos ldquoAdvanced data clus-tering methods of mining web documentsrdquo Issues in InformingScience and Information Technology vol 3 pp 563ndash579 2006

[2] J Wang S-T Wang and Z-H Deng ldquoSurvey on challenges inclustering analysis researchrdquo Control and Decision vol 3 2012

[3] C-W Bong andM Rajeswari ldquoMulti-objective nature-inspiredclustering and classification techniques for image segmenta-tionrdquo Applied Soft Computing Journal vol 11 no 4 pp 3271ndash3282 2011

[4] A Rodriguez and A Laio ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[5] L A Zadeh ldquoFuzzy setsrdquo Information and Computation vol 8no 3 pp 338ndash353 1965

[6] J C Bezdek ldquoNumerical taxonomy with fuzzy setsrdquo Journal ofMathematical Biology vol 1 no 1 pp 57ndash71 1974

[7] J C Bezdek ldquoCluster validity with fuzzy setsrdquo Journal of Cyber-netics vol 3 no 3 pp 58ndash73 1974

[8] M P Windham ldquoCluster validity for fuzzy clustering algo-rithmsrdquo Fuzzy Sets and Systems vol 5 no 2 pp 177ndash185 1981

[9] M Lee ldquoFuzzy cluster validity index based onobject proximitiesdefined over fuzzy partition matricesrdquo in Proceedings of theIEEE International Conference on Fuzzy Systems and IEEEWorldCongress on Computational Intelligence (FUZZ-IEEE rsquo08) pp336ndash340 June 2008

[10] X-B Zhang and L Jiang ldquoAnew validity index of fuzzy c-meansclusteringrdquo in Proceedings of the IEEE International Conferenceon IntelligentHuman-Machine Systems andCybernetics (IHMSCrsquo09) vol 2 pp 218ndash221 Hangzhou China August 2009

[11] I Saha U Maulik and S Bandyopadhyay ldquoA new differ-ential evolution based fuzzy clustering for automatic clusterevolutionrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 706ndash711 IEEE PatialaIndia March 2009

[12] S Yue J-S Wang T Wu and H Wang ldquoA new separationmeasure for improving the effectiveness of validity indicesrdquoInformation Sciences vol 180 no 5 pp 748ndash764 2010

[13] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[14] A M Bensaid L O Hall J C Bezdek et al ldquoValidity-guided(re)clustering with applications to image segmentationrdquo IEEETransactions on Fuzzy Systems vol 4 no 2 pp 112ndash123 1996

[15] S H Kwon ldquoCluster validity index for fuzzy clusteringrdquo Elec-tronics Letters vol 34 no 22 pp 2176ndash2177 1998

[16] H Sun S Wang and Q Jiang ldquoFCM-based model selectionalgorithms for determining the number of clustersrdquo PatternRecognition vol 37 no 10 pp 2027ndash2037 2004

[17] Y Li and F Yu ldquoA new validity function for fuzzy clusteringrdquoin Proceedings of the International Conference on ComputationalIntelligence and Natural Computing (CINC rsquo09) vol 1 pp 462ndash465 IEEE June 2009

[18] S Saha and S Bandyopadhyay ldquoA new cluster validity indexbased on fuzzy granulation-degranulation criteriardquo in Proceed-ings of the IEEEInternational Conference on Advanced Comput-ing and Communications (ADCOM rsquo07) pp 353ndash358 IEEEGuwahati India 2007

[19] M Zhang W Zhang H Sicotte and P Yang ldquoA new validitymeasure for a correlation-based fuzzy c-means clustering algo-rithmrdquo in Proceedings of the 31st IEEE Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo09) pp 3865ndash3868 Minneapolis Minn USASeptember 2009

[20] Y-I KimD-WKimD Lee andKH Lee ldquoA cluster validationindex for GK cluster analysis based on relative degree ofsharingrdquo Information Sciences vol 168 no 1ndash4 pp 225ndash2422004

[21] B Rezaee ldquoA cluster validity index for fuzzy clusteringrdquo FuzzySets and Systems vol 161 no 23 pp 3014ndash3025 2010

[22] D Zhang M Ji J Yang Y Zhang and F Xie ldquoA novel clus-ter validity index for fuzzy clustering based on bipartite modu-larityrdquo Fuzzy Sets and Systems vol 253 pp 122ndash137 2014

[23] J Yu andQ Cheng ldquoSearch range of the optimal cluster numberin fuzzy clusteringrdquo Science in China (Series E) vol 32 no 2 pp274ndash280 2002

12 Computational Intelligence and Neuroscience

[24] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[25] D Pelleg A W Moore et al ldquoX-means extending K-meanswith efficient estimation of the number of clustersrdquo in Proceed-ings of the 17th International Conference on Machine Learning(ICML rsquo00) vol 1 pp 727ndash734 2000

[26] H Frigui and R Krishnapuram ldquoA robust competitive clus-tering algorithm with applications in computer visionrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol21 no 5 pp 450ndash465 1999

[27] M-S Yang and K-L Wu ldquoA similarity-based robust clusteringmethodrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 26 no 4 pp 434ndash448 2004

[28] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[29] L Yujian ldquoA clustering algorithm based on maximal 120579-distantsubtreesrdquo Pattern Recognition vol 40 no 5 pp 1425ndash1431 2007

[30] Y Shihong H Ti and W Penglong ldquoMatrix eigenvalue anal-ysis-based clustering validity indexrdquo Journal of Tianjin Univer-sity (Science and Technology) vol 8 article 6 2014

[31] A Jose-Garcıa and W Gomez-Flores ldquoAutomatic clusteringusing nature-inspired metaheuristics a surveyrdquo Applied SoftComputing Journal vol 41 pp 192ndash213 2016

[32] E H Ruspini ldquoNew experimental results in fuzzy clusteringrdquoInformation Sciences vol 6 pp 273ndash284 1973

[33] J C Dunn ldquoWell-separated clusters and optimal fuzzy parti-tionsrdquo Journal of Cybernetics vol 4 no 1 pp 95ndash104 1974

[34] J C Bezdek Pattern Recognition with Fuzzy Objective FunctionAlgorithms Springer Science amp Business Media 2013

[35] R E Hammah and J H Curran ldquoFuzzy cluster algorithm forthe automatic identification of joint setsrdquo International Journalof Rock Mechanics and Mining Sciences vol 35 no 7 pp 889ndash905 1998

[36] N R Pal and J C Bezdek ldquoOn cluster validity for the fuzzy c-means modelrdquo IEEE Transactions on Fuzzy Systems vol 3 no3 pp 370ndash379 1995

[37] S B Zhou Z Y Xu and X Q Tang ldquoMethod for determiningoptimal number of clusters based on affinity propagationclusteringrdquo Control and Decision vol 26 no 8 pp 1147ndash11522011

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 11: Research Article A Self-Adaptive Fuzzy -Means Algorithm ...downloads.hindawi.com/journals/cin/2016/2647389.pdf · A Self-Adaptive Fuzzy -Means Algorithm for Determining the Optimal

Computational Intelligence and Neuroscience 11

Table 13 The value of clustering validity index on SD4

119888 119881XB 119881B 119881K 119881R

2 0104434 0208832 991482710 04737483 0170326 0142561 162044450 04588324 0221884 0081007 211692699 05835295 0156253 0053094 157683921 06032116 0123191 0041799 118279116 05753967 0165465 0032411 107210082 05926258 0145164 0028698 139310969 0606049

initial cluster centroids so that the the clustering result isstable and the convergence of FCM is quick Then a newfuzzy clustering validity index was put forward based onfuzzy compactness and separation so that the clusteringresult is closer to global optimum The index is robust andinterpretable when the number of clusters tends to that ofobjects in the dataset Finally a self-adaptive FCM algorithmis proposed to determine the optimal number of clusters runin the iterative trial-and-error process

The contributions are validated by experimental resultsHowever in most cases each property plays a different rolein the clustering process In other words the weight ofproperties is not the same This issue will be focused on inthe authorsrsquo future work

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

This work is partially supported by the National NaturalScience Foundation of China (61373148 and 61502151) Shan-dong Province Natural Science Foundation (ZR2014FL010)Science Foundation ofMinistry of Education ofChina (14YJC860042) Shandong Province Outstanding Young ScientistAward Fund (BS2013DX033) Shandong Province Social Sci-ences Project (15CXWJ13 and 16CFXJ05) and Project ofShandongProvinceHigher Educational Science andTechnol-ogy Program (no J15LN02)

References

[1] S Sambasivam and N Theodosopoulos ldquoAdvanced data clus-tering methods of mining web documentsrdquo Issues in InformingScience and Information Technology vol 3 pp 563ndash579 2006

[2] J Wang S-T Wang and Z-H Deng ldquoSurvey on challenges inclustering analysis researchrdquo Control and Decision vol 3 2012

[3] C-W Bong andM Rajeswari ldquoMulti-objective nature-inspiredclustering and classification techniques for image segmenta-tionrdquo Applied Soft Computing Journal vol 11 no 4 pp 3271ndash3282 2011

[4] A Rodriguez and A Laio ldquoClustering by fast search and find ofdensity peaksrdquo Science vol 344 no 6191 pp 1492ndash1496 2014

[5] L A Zadeh ldquoFuzzy setsrdquo Information and Computation vol 8no 3 pp 338ndash353 1965

[6] J C Bezdek ldquoNumerical taxonomy with fuzzy setsrdquo Journal ofMathematical Biology vol 1 no 1 pp 57ndash71 1974

[7] J C Bezdek ldquoCluster validity with fuzzy setsrdquo Journal of Cyber-netics vol 3 no 3 pp 58ndash73 1974

[8] M P Windham ldquoCluster validity for fuzzy clustering algo-rithmsrdquo Fuzzy Sets and Systems vol 5 no 2 pp 177ndash185 1981

[9] M Lee ldquoFuzzy cluster validity index based onobject proximitiesdefined over fuzzy partition matricesrdquo in Proceedings of theIEEE International Conference on Fuzzy Systems and IEEEWorldCongress on Computational Intelligence (FUZZ-IEEE rsquo08) pp336ndash340 June 2008

[10] X-B Zhang and L Jiang ldquoAnew validity index of fuzzy c-meansclusteringrdquo in Proceedings of the IEEE International Conferenceon IntelligentHuman-Machine Systems andCybernetics (IHMSCrsquo09) vol 2 pp 218ndash221 Hangzhou China August 2009

[11] I Saha U Maulik and S Bandyopadhyay ldquoA new differ-ential evolution based fuzzy clustering for automatic clusterevolutionrdquo in Proceedings of the IEEE International AdvanceComputing Conference (IACC rsquo09) pp 706ndash711 IEEE PatialaIndia March 2009

[12] S Yue J-S Wang T Wu and H Wang ldquoA new separationmeasure for improving the effectiveness of validity indicesrdquoInformation Sciences vol 180 no 5 pp 748ndash764 2010

[13] X L Xie and G Beni ldquoA validity measure for fuzzy clusteringrdquoIEEE Transactions on Pattern Analysis andMachine Intelligencevol 13 no 8 pp 841ndash847 1991

[14] A M Bensaid L O Hall J C Bezdek et al ldquoValidity-guided(re)clustering with applications to image segmentationrdquo IEEETransactions on Fuzzy Systems vol 4 no 2 pp 112ndash123 1996

[15] S H Kwon ldquoCluster validity index for fuzzy clusteringrdquo Elec-tronics Letters vol 34 no 22 pp 2176ndash2177 1998

[16] H Sun S Wang and Q Jiang ldquoFCM-based model selectionalgorithms for determining the number of clustersrdquo PatternRecognition vol 37 no 10 pp 2027ndash2037 2004

[17] Y Li and F Yu ldquoA new validity function for fuzzy clusteringrdquoin Proceedings of the International Conference on ComputationalIntelligence and Natural Computing (CINC rsquo09) vol 1 pp 462ndash465 IEEE June 2009

[18] S Saha and S Bandyopadhyay ldquoA new cluster validity indexbased on fuzzy granulation-degranulation criteriardquo in Proceed-ings of the IEEEInternational Conference on Advanced Comput-ing and Communications (ADCOM rsquo07) pp 353ndash358 IEEEGuwahati India 2007

[19] M Zhang W Zhang H Sicotte and P Yang ldquoA new validitymeasure for a correlation-based fuzzy c-means clustering algo-rithmrdquo in Proceedings of the 31st IEEE Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety (EMBC rsquo09) pp 3865ndash3868 Minneapolis Minn USASeptember 2009

[20] Y-I KimD-WKimD Lee andKH Lee ldquoA cluster validationindex for GK cluster analysis based on relative degree ofsharingrdquo Information Sciences vol 168 no 1ndash4 pp 225ndash2422004

[21] B Rezaee ldquoA cluster validity index for fuzzy clusteringrdquo FuzzySets and Systems vol 161 no 23 pp 3014ndash3025 2010

[22] D Zhang M Ji J Yang Y Zhang and F Xie ldquoA novel clus-ter validity index for fuzzy clustering based on bipartite modu-larityrdquo Fuzzy Sets and Systems vol 253 pp 122ndash137 2014

[23] J Yu andQ Cheng ldquoSearch range of the optimal cluster numberin fuzzy clusteringrdquo Science in China (Series E) vol 32 no 2 pp274ndash280 2002

12 Computational Intelligence and Neuroscience

[24] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[25] D Pelleg A W Moore et al ldquoX-means extending K-meanswith efficient estimation of the number of clustersrdquo in Proceed-ings of the 17th International Conference on Machine Learning(ICML rsquo00) vol 1 pp 727ndash734 2000

[26] H Frigui and R Krishnapuram ldquoA robust competitive clus-tering algorithm with applications in computer visionrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol21 no 5 pp 450ndash465 1999

[27] M-S Yang and K-L Wu ldquoA similarity-based robust clusteringmethodrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 26 no 4 pp 434ndash448 2004

[28] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[29] L Yujian ldquoA clustering algorithm based on maximal 120579-distantsubtreesrdquo Pattern Recognition vol 40 no 5 pp 1425ndash1431 2007

[30] Y Shihong H Ti and W Penglong ldquoMatrix eigenvalue anal-ysis-based clustering validity indexrdquo Journal of Tianjin Univer-sity (Science and Technology) vol 8 article 6 2014

[31] A Jose-Garcıa and W Gomez-Flores ldquoAutomatic clusteringusing nature-inspired metaheuristics a surveyrdquo Applied SoftComputing Journal vol 41 pp 192ndash213 2016

[32] E H Ruspini ldquoNew experimental results in fuzzy clusteringrdquoInformation Sciences vol 6 pp 273ndash284 1973

[33] J C Dunn ldquoWell-separated clusters and optimal fuzzy parti-tionsrdquo Journal of Cybernetics vol 4 no 1 pp 95ndash104 1974

[34] J C Bezdek Pattern Recognition with Fuzzy Objective FunctionAlgorithms Springer Science amp Business Media 2013

[35] R E Hammah and J H Curran ldquoFuzzy cluster algorithm forthe automatic identification of joint setsrdquo International Journalof Rock Mechanics and Mining Sciences vol 35 no 7 pp 889ndash905 1998

[36] N R Pal and J C Bezdek ldquoOn cluster validity for the fuzzy c-means modelrdquo IEEE Transactions on Fuzzy Systems vol 3 no3 pp 370ndash379 1995

[37] S B Zhou Z Y Xu and X Q Tang ldquoMethod for determiningoptimal number of clusters based on affinity propagationclusteringrdquo Control and Decision vol 26 no 8 pp 1147ndash11522011

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 12: Research Article A Self-Adaptive Fuzzy -Means Algorithm ...downloads.hindawi.com/journals/cin/2016/2647389.pdf · A Self-Adaptive Fuzzy -Means Algorithm for Determining the Optimal

12 Computational Intelligence and Neuroscience

[24] B J Frey and D Dueck ldquoClustering by passing messagesbetween data pointsrdquo Science vol 315 no 5814 pp 972ndash9762007

[25] D Pelleg A W Moore et al ldquoX-means extending K-meanswith efficient estimation of the number of clustersrdquo in Proceed-ings of the 17th International Conference on Machine Learning(ICML rsquo00) vol 1 pp 727ndash734 2000

[26] H Frigui and R Krishnapuram ldquoA robust competitive clus-tering algorithm with applications in computer visionrdquo IEEETransactions on Pattern Analysis and Machine Intelligence vol21 no 5 pp 450ndash465 1999

[27] M-S Yang and K-L Wu ldquoA similarity-based robust clusteringmethodrdquo IEEE Transactions on Pattern Analysis and MachineIntelligence vol 26 no 4 pp 434ndash448 2004

[28] M Girolami ldquoMercer kernel-based clustering in feature spacerdquoIEEE Transactions on Neural Networks vol 13 no 3 pp 780ndash784 2002

[29] L Yujian ldquoA clustering algorithm based on maximal 120579-distantsubtreesrdquo Pattern Recognition vol 40 no 5 pp 1425ndash1431 2007

[30] Y Shihong H Ti and W Penglong ldquoMatrix eigenvalue anal-ysis-based clustering validity indexrdquo Journal of Tianjin Univer-sity (Science and Technology) vol 8 article 6 2014

[31] A Jose-Garcıa and W Gomez-Flores ldquoAutomatic clusteringusing nature-inspired metaheuristics a surveyrdquo Applied SoftComputing Journal vol 41 pp 192ndash213 2016

[32] E H Ruspini ldquoNew experimental results in fuzzy clusteringrdquoInformation Sciences vol 6 pp 273ndash284 1973

[33] J C Dunn ldquoWell-separated clusters and optimal fuzzy parti-tionsrdquo Journal of Cybernetics vol 4 no 1 pp 95ndash104 1974

[34] J C Bezdek Pattern Recognition with Fuzzy Objective FunctionAlgorithms Springer Science amp Business Media 2013

[35] R E Hammah and J H Curran ldquoFuzzy cluster algorithm forthe automatic identification of joint setsrdquo International Journalof Rock Mechanics and Mining Sciences vol 35 no 7 pp 889ndash905 1998

[36] N R Pal and J C Bezdek ldquoOn cluster validity for the fuzzy c-means modelrdquo IEEE Transactions on Fuzzy Systems vol 3 no3 pp 370ndash379 1995

[37] S B Zhou Z Y Xu and X Q Tang ldquoMethod for determiningoptimal number of clusters based on affinity propagationclusteringrdquo Control and Decision vol 26 no 8 pp 1147ndash11522011

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Page 13: Research Article A Self-Adaptive Fuzzy -Means Algorithm ...downloads.hindawi.com/journals/cin/2016/2647389.pdf · A Self-Adaptive Fuzzy -Means Algorithm for Determining the Optimal

Submit your manuscripts athttpwwwhindawicom

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

Advances in

FuzzySystems

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014

International Journal of

ReconfigurableComputing

Hindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

Artificial Intelligence

HindawithinspPublishingthinspCorporationhttpwwwhindawicom Volumethinsp2014

Advances inSoftware EngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation

httpwwwhindawicom Volume 2014

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

ArtificialNeural Systems

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Computational Intelligence and Neuroscience

Industrial EngineeringJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014