6

Click here to load reader

A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data

  • Upload
    dan-li

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data

Expert Systems with Applications 37 (2010) 6942–6947

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier .com/locate /eswa

A fuzzy c-means clustering algorithm based on nearest-neighbor intervalsfor incomplete data

Dan Li *, Hong Gu, Liyong ZhangSchool of Electronic and Information Engineering, Dalian University of Technology, Dalian 116024, China

a r t i c l e i n f o

Keywords:ClusteringFuzzy c-meansIncomplete dataNearest-neighbor intervals

0957-4174/$ - see front matter � 2010 Elsevier Ltd. Adoi:10.1016/j.eswa.2010.03.028

* Corresponding author. Tel.: +86 411 82965258.E-mail addresses: [email protected] (D. Li), gu

[email protected] (L. Zhang).

a b s t r a c t

Partially missing data sets are a prevailing problem in clustering analysis. In this paper, missing attributesare represented as intervals, and a novel fuzzy c-means algorithm for incomplete data based on nearest-neighbor intervals is proposed. The algorithm estimates the nearest-neighbor interval representation ofmissing attributes by using the attribute distribution information of the data sets sufficiently, which canenhances the robustness of missing attribute imputation compared with other numerical imputationmethods. Also, the convex hyper-polyhedrons formed by interval prototypes can present the uncertaintyof missing attributes, and simultaneously reflect the shape of the clusters to some degree, which is help-ful in enhancing the robustness of clustering analysis. Comparisons and analysis of the experimentalresults for several UCI data sets demonstrate the capability of the proposed algorithm.

� 2010 Elsevier Ltd. All rights reserved.

1. Introduction

The fuzzy c-means (FCM) algorithm (Bezdek, 1981) is a usefultool for clustering, which partitions a real s-dimensional datasetX ¼ fx1;x2; . . . ;xng � Rs into several clusters to describe an under-lying structure within the data, and has been extensively used inpattern recognition and data mining. However, in pattern classifi-cation applications, many datasets suffer from incompleteness,i.e. a dataset X can contain vectors that are missing one or moreof the attribute values, as a result of failure in data collection, mea-surement errors, missing observations, random noise, etc. and FCMis not directly applicable to such incomplete datasets.

The problem of doing pattern recognition with incomplete datacan be traced back to the 1960s, when Sebestyen (1962) intro-duced an approach based on probabilistic assumptions. Subse-quently the expectation-maximization (EM) algorithm(Dempster, Laird, & Rubin, 1977) was used to handle incompletedata and probabilistic clustering (McLachlan & Basford, 1988). In1998, several methods were proposed for handling missing valuesin FCM (Miyamoto, Takata, & Umayahara, 1998). One basic strat-egy, imputation, replaces the missing values by the weighted aver-ages of the corresponding attributes. Another approach,discarding/ignoring, ignores the missing values and calculates thedistances from the remaining coordinates. In 2001, Hathawayand Bezdek proposed other strategies to continue the FCM cluster-ing of incomplete data (Hathaway & Bezdek, 2001). One simple

ll rights reserved.

[email protected] (H. Gu),

strategy (whole data strategy, WDS) removes all sample data thatinclude missing values from the dataset, but the strategy is notdesirable because the elimination brings a loss of information. An-other method uses the partial distance strategy (PDS), which calcu-lates partial distances using all available attribute values, andscales this quantity by the reciprocal of the proportion of compo-nents used. Two further methods proposed by Hathaway and Bez-dek (2001) belong to the imputation method, which involvecomputations to replace the missing values with estimation basedon the available information. The optimal completion strategy(OCS) views the missing values as an optimization problem andimputes missing values in each iteration to find better estimates.The nearest prototype strategy (NPS) replaces missing values withthe corresponding attributes of the nearest prototype. Besides theabove methods, by taking into account the information why dataare missing, Timm, Doring, and Kruse (2004) developed a fuzzyclustering algorithm extended from the Gath and Geva algorithm.Hathaway and Bezdek (2002) used triangle inequality-basedapproximation schemes to cluster incomplete relational data, andHonda and Ichihashi (2004) partitioned the incomplete datasetsinto several linear fuzzy clusters by extracting local principalcomponents.

In this paper, by adopting the idea of nearest-neighbor rule, anovel fuzzy c-means algorithm for incomplete data based on near-est-neighbor intervals (FCM-NNI) is proposed. Firstly, because ofthe uncertainty of missing attributes, missing attributes are repre-sented by nearest-neighbor intervals (NNI) based on the nearest-neighbor information, which are more robust than the numericalvalues obtained by imputation methods mentioned above. Sec-ondly, the clustering problem can be thus viewed as clustering

Page 2: A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data

D. Li et al. / Expert Systems with Applications 37 (2010) 6942–6947 6943

for interval-valued data, which will result in interval cluster proto-types rather than point prototypes. Therefore, the convex hyper-polyhedrons formed by interval prototypes in the attribute space,as a kind of cluster prototype with more complicated geometricalstructure, can present the uncertainty of missing attributes, andat the same time reflect the shape of the clusters to some degree,thus validating the robustness of clustering pattern with moreaccurate clustering results.

This paper is organized as follows. Section 2 presents a shortdescription of the FCM algorithm and FCM clustering algorithmfor interval-valued data (IFCM) based on clustering objective func-tion minimization. The nearest-neighbor interval representation ofmissing attributes and the novel FCM-NNI algorithm are intro-duced in Section 3. Section 4 presents clustering results of severalUCI data sets and a comparative study of our proposed algorithmwith various other methods for handling missing values in FCM.Finally, conclusions are drawn in Section 5.

2. FCM clustering algorithm for interval-valued data

2.1. Fuzzy c-means algorithm

The fuzzy c-means (FCM) algorithm partitions a set of completedata X ¼ fx1; x2; . . . ; xng � Rs into c-(fuzzy) clusters by minimizingthe clustering objective function

JðU;VÞ ¼Xc

i¼1

Xn

k¼1

umikkxk � v ik2

2; ð1Þ

with the constraint ofXc

i¼1

uik ¼ 1; for k ¼ 1;2; . . . ; n; ð2Þ

where xk = [x1k,x2k, . . . ,xsk]T is an object datum, and xjk is the jthattribute value of xk; vi is the ith point cluster prototype, v i 2 Rs,and let the matrix of cluster prototypes V ¼ ½v ji� ¼ ½v1;v2; . . . ;

vc� 2 Rs�c for convenience; uik is the membership that representsthe degree to which xk belongs to the ith cluster, "i,k: uik 2 [0,1],and let the partition matrix U ¼ ½uik� 2 Rc�n for convenience; m isa fuzzification parameter, m 2 (1,1); and k � k2 denotes Euclideannorm.

FCM uses the Lagrange multiplier method, and let the Lagrangefunction be

JaðU;VÞ ¼Xc

i¼1

Xn

k¼1

umikkxk � v ik2

2 þXn

k¼1

kk

Xc

i¼1

uik � 1

!; ð3Þ

where k = [k1,k2, . . . ,kn]T is Lagrange multiplier, and the necessaryconditions for minimizing (1) with the constraint of (2) are theupdate equations as follows (Bezdek, 1981):

v i ¼Pn

k¼1umikxkPn

k¼1umik

; for i ¼ 1;2; . . . ; c ð4Þ

and

uik ¼Xc

t¼1

kxk � v ik22

kxk � v tk22

! 1m�1

24 35�1

; for i ¼ 1;2; . . . ; c and

k ¼ 1;2; . . . ;n: ð5Þ

The procedure of FCM can be described as follows.

Step 1: Choose m, c and e, where e > 0 is a small positive constant;then initialize the partition matrix U(0).

Step 2: When the iteration index is l (l = 1,2, . . .), calculate thematrix of cluster prototypes V(l) using (4) and U(l�1).

Step 3: Update the partition matrix U(l) using (5) and V(l).Step 4: If 8i; k : max juðlÞik � uðl�1Þ

ik j < e, then stop and get the parti-tion matrix U and the matrix of cluster prototypes V;otherwise set l = l + 1 and return to Step 2.

2.2. FCM clustering algorithm for interval-valued data (IFCM)

Let X ¼ f�x1; �x2; . . . ; �xng be an s-dimensional interval-valued dataset to be partitioned into c-(fuzzy) clusters, where �xk ¼ ½�x1k; �x2k; . . . ;�xsk�T ; 8j; k : �xjk ¼ ½x�jk; xþjk�. The IFCM algorithm minimizes the objec-tive function

JIðU;VÞ ¼Xc

i¼1

Xn

k¼1

umikk�xk � �v ik2

2; ð6Þ

with the constraint of (2), where �v i is the ith interval cluster proto-type, and let the matrix of interval cluster prototypes V ¼ ½�v ji� ¼½�v1; �v2; . . . �vc�, where �v ji ¼ ½v�ji ;vþji �;8i ¼ 1;2; . . . c; j ¼ 1;2; . . . ; s. TheEuclidean distance between �xk and �v i is defined as

k�xk � �v ik2 ¼ ½ðx�k � v�i ÞTðx�k � v�i Þ þ ðxþk � vþi Þ

Tðxþk � vþi Þ�12; ð7Þ

where

x�k ¼ ½x�1k; x�2k; . . . ; x�sk�

T; xþk ¼ ½xþ1k; x

þ2k; . . . ; xþsk�

T;

v�i ¼ ½v�1i;v�2i; . . . ; v�si �

T; vþi ¼ ½vþ1i;v

þ2i; . . . ; vþsi �

T:

The necessary conditions for minimizing (6) with the constraint of(2) are the update equations as follows (Yu & Fan, 2004):

v�i ¼Pn

k¼1umikx�kPn

k¼1umik

; for i ¼ 1;2; . . . ; c: ð8Þ

and

vþi ¼Pn

k¼1umikxþkPn

k¼1umik

; for i ¼ 1;2; . . . ; c; ð9Þ

And if 9k;h;1 6 k 6 n;1 6 h 6 c;8j : �xjk # �v jh, that is, �xk is within theconvex hyper-polyhedron formed by �vh, then �xk can be consideredto belong fully to the hth cluster with membership 1, and belong tothe other clusters with membership 0. Thus

uik ¼1; i ¼ h

0; i – h

�for i ¼ 1;2; . . . ; c; ð10Þ

else

uik ¼Xc

t¼1

k�xk � �v ik22

k�xk � �v tk22

! 1m�1

24 35�1

; for i ¼ 1;2; . . . ; c: ð11Þ

The procedure of IFCM is similar to that of FCM, and hence is omit-ted here.

3. Fuzzy c-means algorithm for incomplete data based onnearest-neighbor intervals

3.1. Nearest-neighbor intervals determination

Recently, the use of nearest-neighbor (NN) based techniqueshas been proposed for imputation of missing values. A simple NNimputation method is to substitute the missing attribute by thecorresponding attribute of the nearest-neighbor (Stade, 1996).And in another popular approach, k-nearest-neighbor imputation(Acuna & Rodriguez, 2004), missing attributes are supplementedby the mean value of the attribute in the k-nearest-neighbors. Sub-sequently, many similarity measures other than Euclidean distanceare introduced in searching for nearest-neighbors (Huang & Zhu,2002; Huang, 2006). All the approaches mentioned above develop

Page 3: A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data

6944 D. Li et al. / Expert Systems with Applications 37 (2010) 6942–6947

numerical imputation values, which is unsuitable to represent theuncertainty of missing attributes.

The nearest-neighbor intervals determination is given as fol-lows: let eX ¼ f~x1; ~x2; . . . ; ~xng be an s-dimensional incomplete dataset, which contains at least one incomplete datum with some(but not all) missing attribute values. For an incomplete datum~xb, the method searches for its q nearest-neighbors within thedataset based on partial Euclidean distance. The formula for thepartial distance calculation of ~xb and an instance ~xp (incompleteor complete) is given by

Dpb ¼1Psj¼1Ij

Xs

j¼1

ð~xjb � ~xjpÞ2Ij; ð12Þ

where ~xjb and ~xjp are the jth attribute of ~xb and ~xp respectively, and

Ij ¼1; if both xjb and xjp are nonmissing0; otherwise

8><>:for p; b ¼ 1;2 . . . n; and j ¼ 1;2; . . . s

ð13Þ

From (12), it can be seen that the approach makes full use of attri-bute information of both complete data and incomplete data (non-missing attributes) to search for the nearest-neighbors. From thenearest-neighbor perspective, instance ~xb and its nearest-neighbor~xp share same or similar attributes. So, by searching for q nearest-neighbors of incomplete datum ~xb, the value range of its missingattribute ~xjb can be determined by the minimum and maximum ofthe neighbors’ corresponding attribute values, which form the near-est-neighbor interval (NNI) of ~xjb. Thus the data set eX ¼ f~x1; ~x2; . . . ;~xng can be transformed to an interval-valued one. In this trans-formed data set, missing attribute ~xjb can be represented by its cor-responding nearest-neighbor interval ½x�jb; xþjb�, and nonmissingattribute ~xjw can be rewritten into interval form ½x�jw; xþjw�. Note thatx�jw ¼ xþjw ¼ ~xjw, that is, the original values are unchanged.

Then, the clustering problem for incomplete data sets can besolved by using IFCM as mentioned in the last section. From theIFCM algorithm, it can be seen that the clustering results rely onthe NNI of missing values, so the determination of q (number ofnearest-neighbors to be searched) is an important issue. If the se-lected q is too small, the NNI achieved is likely to be a bias estima-tion; and if the selected q is too large, the clustering performancewill also be affected greatly. To put it in a simple way, in the ex-treme situation that q is as large as n (the number of samples inthe data set), according to IFCM, there will be numerous misclassi-fication for the range of NNI is too large to represent the missingvalues properly, which will confuse the attribute characteristicsamong different clusters and result in unreasonable clustering re-sults. In this paper, the determination of q follows a strategy men-tioned later. It is obvious that q is related to the percentage ofmissing attributes, the sample distribution of clusters, etc. Thus,our determination strategy is to designate one nonmissing attri-bute as missing randomly from the given incomplete data set,and test whether the nonmissing attribute falls within the corre-sponding NNI by assuming q = 2,3, . . . The process should be per-formed Z times to estimate the percent probability, then thepoint q0 where the curve of percentages changes slope can befound. At this point, the probability that these nonmissing attri-butes fall within the corresponding NNI is small when q < q0, andthe probability changes little when q > q0, thus q = q0 can be se-lected as the number of nearest-neighbors for the incomplete dataset. The strategy selects q based on the characteristics of data set,and uses the attribute distribution information of data sets suffi-ciently, which can reduce the bias of NNI estimation and avoidthe attribute confusion among clusters resulting from the exces-sive estimation of interval ranges. As for the value of Z, which is

not a significant factor in the problem, we can adopt 10% of non-missing attributes to estimate the percent probability.

Clearly, compared with other imputation methods for incom-plete data which result in numerical values, the NNI representationcan produce a more robust estimation when the proportion ofincomplete data is small, which is more suitable for presentingthe uncertainty of missing attributes.

3.2. Fuzzy c-means algorithm for incomplete data based on NNI

The fuzzy c-means algorithm for incomplete data based on NNI(FCM-NNI) proposed here deals with clustering problem for incom-plete data by transforming the data set to an interval-valued one,which will result in interval cluster prototypes rather than pointprototypes. Moreover, the convex hyper-polyhedrons formed byinterval prototypes in the attribute space, as a kind of cluster pro-totype with more complicated geometrical structure, can representthe uncertainty of missing attributes. For a missing attribute ~xjb ofan incomplete datum ~xb, the range of its nearest-neighbor interval½x�jb; xþjb� based on q nearest-neighbors of ~xb will be large if the jthattributes are dispersive in clusters of incomplete data set eX;whereas the range will be small if the jth attributes are compactin clusters. From (8) and (9), it can be seen that the lower andupper boundaries of interval prototypes are determined by theboundaries of the interval attributes, especially by NNI of missingattributes. Thus, the boundaries of interval prototypes are deter-mined by the distributions of attributes in clusters essentially, thatis, by the geometrical structure of clusters. Therefore, the convexhyper-polyhedrons formed by interval prototypes can present tosome extend the shape of clusters and sample distribution of thedata set. It is known that FCM would generally detect clusters withspherical shapes (Dave, 1992), so the proposed FCM-NNI can over-come this drawback of FCM and validates the robustness of cluster-ing pattern.

For an s-dimensional incomplete data set eX ¼ f~x1; ~x2; . . . ; ~xng,the procedure of the fuzzy c-means algorithm for incomplete databased on NNI (FCM-NNI) can be described as follows.

Step 1: Determine the number of nearest-neighbors q using thedetermination strategy.

Step 2: For each incomplete instance ~xb, find its q nearest-neigh-bors according to the partial distance defined by (12),and determine the NNI ½x�jb; xþjb� of its missing attribute ~xjb.Rewrite the nonmissing attribute ~xjw of incomplete andcomplete data into its interval form ½x�jw; xþjw�, wherex�jw ¼ xþjw.

Step 3: Choose m, c and e, and initialize the partition matrix U(0).Step 4: When the iteration index is l (l = 1,2,. . .), calculate the

matrix of interval cluster prototypes V�(l) and V+(l) using(8) and (9) and U(l�1).

Step 5: Update the partition matrix U(l) using (10) and (11) andV ðlÞ.

Step 6: If 8i; k : max juðlÞik � uðl�1Þik j < e, then stop and get the parti-

tion matrix U and the matrix of cluster prototypes V;otherwise set l = l + 1 and return to Step 4.

4. Numerical experiments

4.1. Data sets

In the experiments presented below, we tested the performanceof the proposed FCM-NNI for three well-known data sets: IRIS,Wine, and Bupa Liver Disorder. All of these databases are takenfrom the UCI machine repository (Hettich, Blake, & Merz, 1998),and often used as standard databases to test the performance ofclustering algorithms.

Page 4: A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data

D. Li et al. / Expert Systems with Applications 37 (2010) 6942–6947 6945

The IRIS data contains 150 four-dimensional attribute vectors,depicting four attributes of iris flowers, which include Petal Length,Petal Width, Sepal Length and Sepal Width. The three IRIS classesinvolved are Setosa, Versicolor and Virginica, each containing 50vectors. Setosa is well separated from the others, while Versicolorand Virginica are not easily separable due to the overlapping oftheir vectors. Hathaway and Bezdek (1995) presented the actualcluster prototypes of the IRIS data:

V� ¼

5:00 5:93 6:583:42 2:77 2:971:46 4:26 5:550:24 1:32 2:02

2666437775 ð14Þ

The Wine data set is the results of a chemical analysis of winesgrown in the same region but derived from three different culti-vars. The analysis determined the quantities of 13 constituentsfound in each of the three types of wines. It contains 178 datapoints.

Table 1The values of q on the three data sets with four different missing percentages.

Data sets % Missing

5% 10% 15% 20%

IRIS 6 5 7 5Wine 5 5 5 5Bupa 5 5 5 5

Table 2Averaged results of 30 trials using incomplete iris data set.

% Missing Mean number of iterations to termination Mean number of m

WDS PDS OCS NPS NNI WDS PDS

0 25.37 25.23 25.93 25.03 25.47 16 16

5 25.00 25.37 31.87 26.37 26.00 16.87 17.03

10 26.50 26.53 39.33 27.53 24.50 16.60 17.07

15 25.60 25.93 36.77 28.10 24.30 16.60 17.07

20 27.20 26.80 38.53 29.27 23.03 16.90 16.97

Table 3Averaged results of 30 trials using incomplete WINE data set.

% Missing Mean number of iterations to termination

WDS PDS OCS NPS NN

0 23.13 23.40 23.43 23.63 23

5 22.83 23.83 26.53 24.57 2310 26.37 23.30 31.33 26.60 2515 32.47 24.20 39.97 29.97 2520 39.07 23.50 46.33 33.67 26

Table 4Averaged results of 30 trials using incomplete BUPA data set.

% Missing Mean number of iterations to termination

WDS PDS OCS NPS NNI

0 33.27 34.03 34.10 34.90 34.37

5 35.30 34.17 40.87 36.33 36.27

10 38.60 33.93 45.53 41.13 41.13

15 34.60 33.07 68.87 43.07 49.90

20 40.10 33.53 58.03 46.07 75.93

The Bupa Liver Disorder data set includes 345 samples in six-dimensional space. The first five attributes are all blood tests whichare thought to be sensitive to liver disorders that might arise fromexcessive alcohol consumption. Each data point constitutes the re-cord of a single male individual, and the data set has two clusters.

In this paper, we consider the clustering problems where miss-ing attribute are missing completely at random (MCAR). Thescheme for artificially generating an incomplete data set eX is torandomly select a specified percentage of components and desig-nate them as missing. The random selection of missing attributevalues is constrained so that (Hathaway & Bezdek, 2001)

(i) each original attribute vector ~xk retains at least onecomponent;

(ii) each attribute has at least one value present in the incom-plete data set eX .

4.2. Experimental results

To test the clustering performance, the clustering results ofFCM-NNI and those of WDS, PDS, OCS, and the NPS versions ofFCM proposed by Hathaway and Bezdek (2001) are compared.These last four versions of FCM uses optimal FCM cluster proto-types obtained using the original (complete) data as the initialcluster prototypes, and the corresponding stopping criterion iskV(l) � V(l�1)k < e. In practical clustering problems for incompletedata, optimal FCM cluster prototypes are unavailable. So, theinitialization of the five approaches is partition matrix U(0) thatsatisfies (2) here, and the corresponding stopping criterion is

isclassification Mean prototype error

OCS NPS NNI WDS PDS OCS NPS NNI

16 16 16 0.0501 0.0501 0.0501 0.0501 0.0501

16.80 16.73 16.47 0.0692 0.0535 0.0515 0.0524 0.0466

17.07 16.80 16.70 0.0828 0.0587 0.0562 0.0584 0.0417

17.43 17.00 16.13 0.1231 0.0639 0.0632 0.0634 0.0402

17.33 16.63 16.33 0.1606 0.0677 0.0644 0.0657 0.0430

Mean number of misclassification

I WDS PDS OCS NPS NNI

.00 9 9 9 9 9

.13 10.20 10.40 10.60 10.37 10.00

.40 12.17 10.57 10.70 10.37 10.07

.27 20.87 11.20 11.73 11.27 10.83

.30 39.23 12.43 12.77 12.57 11.63

Mean number of misclassification

WDS PDS OCS NPS NNI

177 177 177 177 177

177.47 177.27 177.50 177.23 176.87

176.40 177.00 176.67 177.00 176.43177.80 178.53 177.57 178.47 174.60

178.33 179.17 177.47 179.00 176.63

Page 5: A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data

2 3 4 5 6 7 8 9 10

40%

50%

60%

70%

80%

90%

100%

Num. of NN (q)

NN

I Acc

urac

y

5% missing10% missing15% missing20% missing

(a)

2 3 4 5 6 7 8 9 10

40%

50%

60%

70%

80%

90%

100%

Num. of NN (q)

NN

I Acc

urac

y

5% missing10% missing15% missing20% missing

(b)

2 3 4 5 6 7 8 9 10

40%

50%

60%

70%

80%

90%

100%

Num. of NN (q)

NN

I Acc

urac

y

5% missing10% missing15% missing20% missing

(c)

ig. 1. Change curves of averaged NNI accuracies of 30 trials. (a) IRIS, (b) Wine, (c)upa.

6946 D. Li et al. / Expert Systems with Applications 37 (2010) 6942–6947

kU(l) � U(l�1)k < e. And for the same reason, the missing attributesare randomly initialized in OCS-FCM and NPS-FCM.

Choose fuzzification parameter m = 2, the convergence thresh-old e = 10�6, and the values of q on the three incomplete data setswith four different missing percentages are shown in Table 1, aswill be discussed in the following subsection.

To eliminate the variation in the results from trial to trial, Tables2–4 present the averages obtained over 30 trials on incompleteIRIS, Wine and Bupa data sets, and the same incomplete data setis used in each trail for each of the five approaches, so that the re-sults can be correctly compared.

For the actual cluster prototypes of the IRIS data are alreadyknown, the last five columns in Table 2 show the mean prototypeerror calculated by

kV � V�k22 ¼

Xs

j¼1

Xc

i¼1

ðv ji � v�jiÞ2; ð15Þ

where V* is the actual cluster prototypes of the IRIS data as shownin (14), and as FCM-NNI results in interval cluster prototypes, themedians are used to calculate the error. The optimal solutions ineach row are highlighted in bold, and the suboptimal solutionsare underlined.

4.3. Discussion

(1) Overall comparison: From Tables 2–4, it can be seen that thedifferent algorithms result in different numbers of misclassi-fication and prototype errors for the different methods usedto handle missing values. In terms of misclassification error,FCM-NNI is always the best performer except for the 10%cases of incomplete IRIS and Bupa data sets, where FCM-NNI gives suboptimal solutions. And for incomplete IRIS dataset, the cluster prototypes obtained by FCM-NNI are closer tothe actual ones, based on the errors in the last five columnsin Table 2. As for the convergence of iterative calculations,FCM-NNI needs minimal or subminimal numbers of itera-tions to achieve convergence in almost all cases on theincomplete IRIS and Wine data sets, while it results in aslower convergence rate on the Bupa data set. Becausecluster analysis is an off-line data analysis approach, theconvergence rate is not as important as the number of mis-classifications and prototype errors.

(2) Comparison of FCM-NNI and WDS-FCM, PDS-FCM: The meth-ods that deal with incomplete data in the last two algo-rithms belong to the discarding/ignoring category. PDS-FCM ignores missing attributes in incomplete data andscales the partial distances by the reciprocal of the propor-tion of components used, in which the distribution informa-tion of missing attributes implicitly embodied in the otherdata is not taken into account. And WDS-FCM simply deletesall incomplete data and applies FCM to the remaining com-plete data, which can directly provide cluster membershipinformation for complete data, and the cluster membershipfor incomplete data is estimated by using a nearest proto-type classification scheme based on the partial distance sim-ilar with that of PDS-FCM. It is not surprising that WDS-FCMmay yield unreliable or distorted clustering results since theelimination brings a loss of information. Compared withthese two methods, FCM-NNI uses the attribute distributioninformation of data sets sufficiently, including complete dataand nonmissing attributes of incomplete data, to search fornearest-neighbors, which can recover missing attributes byinterval estimation, and carry out clustering and other pat-tern analysis on the recovered data sets.

(3) Comparison of FCM-NNI and OCS-FCM, NPS-FCM: The meth-ods that deal with incomplete data in the last two algo-rithms belong to the imputation category. OCS-FCM viewsthe missing attributes as additional variables, and completesthem by adding an additional optimization step during eachiteration to get smallest possible value of the clustering

FB

Page 6: A fuzzy c-means clustering algorithm based on nearest-neighbor intervals for incomplete data

D. Li et al. / Expert Systems with Applications 37 (2010) 6942–6947 6947

objective function of FCM; while NPS-FCM, as a simple mod-ification of OCS-FCM, replaces missing attributes by the cor-responding attribute of the nearest prototype during eachiteration. Both methods can achieve reliable estimation ofmissing attributes based on the available attribute informa-tion. Compared with these two methods, FCM-NNI achievesinterval estimation of missing attributes by using the near-est-neighbor information of incomplete data, which repre-sents the uncertainty of missing attributes and makes therepresentation more robust. Furthermore, the cluster proto-types in FCM-NNI are convex hyper-polyhedrons formed byinterval prototypes in the attribute space. And FCM-NNI hasadvantages over cluster algorithms with point prototypessince this kind of cluster prototype, with more complicatedgeometrical structure, can present the uncertainty of miss-ing attributes, the shape of clusters and sample distributionto some degree, thus validating the robustness of clusteringpattern with more satisfying clustering results.

(4) The number of nearest-neighbors q: According to the determi-nation strategy for q mentioned above, we randomly desig-nate 10% of nonmissing attributes as missing artificiallyfrom the already incomplete IRIS, Wine and Bupa data setsrespectively. To eliminate the variation in the results fromdifferent missing cases, 30 trials of four missing percentagesare considered. Fig. 1 graphs the average percentageprobability that these nonmissing attributes fall within thecorresponding NNI (NNI accuracy) based on different q.Thereafter, a q-NNI accuracy difference calculation can beused to determine the proper q in different missing casesof the three incomplete datasets. It can be seen that for theincomplete IRIS data set, the points that the curves of thepercentages change from shape to smooth are q0 = 6 andq0 = 7 when there are 5% and 15% attributes are missing,and for the other two missing percentages of incomplete IRISdata sets and all the four missing percentages of incompleteWine and Bupa data sets, the points are q0 = 5. Thus, we getthe values of q for the three incomplete data sets with fourdifferent missing percentages as shown in Table 1. It canbe seen that there is no theoretical basis for the selectionof q, and further research is needed to investigate thisproblem.

5. Conclusion

This paper aims at the problem of clustering of incomplete data,and proposes a novel fuzzy c-means algorithm for incomplete databased on nearest-neighbor intervals (FCM-NNI). The proposedalgorithm has two main advantages. Firstly, interval estimationsof missing attributes are obtained by using the attribute distribu-tion information of data sets sufficiently, which is superior inexpressing the uncertainty of missing attributes, and enhancesthe robustness of missing attributes representation. Secondly, thecluster prototypes in FCM-NNI are convex hyper-polyhedronsformed by interval prototypes in the attribute space, which can

present the shape of clusters and sample distribution to some de-gree, thus validate the robustness of clustering pattern with moresatisfying classification results. The results reported in this paperobtain optimal and suboptimal solutions on three famous UCI datasets, which show that our proposed algorithm is general, simpleand appropriate for the clustering of incomplete data. Our furtherwork will focus on the selection of q with theoretical basis, anddetermination of nearest-neighbor intervals of missing attributeswhen the missing percentage is large, which will be helpful to ex-tend FCM-NNI to solve clustering problem of incomplete data withvarious missing percentages.

Acknowledgments

The authors would like to express their gratitude to the editor,associate editor, and all reviewers for their valuable suggestions forimproving this manuscript.

References

Acuna, E., & Rodriguez, C. (2004). The treatment of missing values and its effect inthe classifier accuracy. Classification, Clustering and Data Mining Applications, 3,639–648.

Bezdek, J. C. (1981). Pattern recognition with fuzzy objective function algorithms. NY:Plenum Press.

Dave, R. N. (1992). Generalized fuzzy c-shells clustering and detection of circularand elliptical boundaries. Pattern Recognition, 25, 713–721.

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood fromincomplete data via the EM algorithm. Journal of the Royal Statistical Society,Series B, 39, 1–38.

Hathaway, R. J., & Bezdek, J. C. (1995). Optimization of clustering criteria byreformulation. IEEE Transactions on Fuzzy Systems, 3(2), 241–245.

Hathaway, R. J., & Bezdek, J. C. (2001). Fuzzy c-means clustering of incomplete data.IEEE Transactions on Systems, Man, and Cybernetics Part B: Cybernetics, 31(5),735–744.

Hathaway, R. J., & Bezdek, J. C. (2002). Clustering incomplete relational data usingthe non-Euclidean relational fuzzy c-means algorithm. Pattern RecognitionLetters, 23, 151–160.

Hettich, S., Blake, C.L., & Merz, C.J. (1998). UCI repository of machine learningdatabases, Department of Information and Computer Science, University ofCalifornia, Irvine, CA. <http://www.ics.uci.edu/�mlearn/MLRepository.html>.

Honda, K., & Ichihashi, H. (2004). Linear fuzzy clustering techniques with missingvalues and their application to local principle component analysis. IEEETransactions on Fuzzy System, 12(2), 183–193.

Huang, C. C. (2006). A novel gray-based reduced NN classification method. PatternRecognition, 39, 1979–1986.

Huang, X., & Zhu, Q. (2002). A pseudo-nearest-neighbor approach for missing datarecovery on Gaussian random data sets. Pattern Recognition Letters, 23,1613–1622.

McLachlan, G. J., & Basford, K. E. (1988). Mixture models: Inference and applications toclustering. NY: Marcel Dekker.

Miyamoto, S., Takata, O., & Umayahara, K. (1998). Handling missing values in fuzzyc-means. In Proceedings of the third Asian fuzzy systems symposium, Masan, Korea(pp. 139–142).

Sebestyen, G. S. (1962). Decision-making process in pattern recognition. NY:Macmillan Press.

Stade, I. (1996). Hot deck imputation procedures. In Incomplete data in samplesurvey: Symposium on incomplete data proceedings (pp. 225–248).

Timm, H., Doring, C., & Kruse, R. (2004). Different approaches to fuzzy clustering ofincomplete datasets. International Journal of approximate reasoning, 35,239–249.

Yu, C. H., & Fan, Z. P. (2004). A FCM clustering algorithm for multiple attributeinformation with interval numbers. Journal of System Engineering, 19(4),387–393 (in Chinese).