7

Click here to load reader

An automatic and stable clustering algorithm

Embed Size (px)

Citation preview

Page 1: An automatic and stable clustering algorithm

ELgEVIER

June 1994

Pattern Recognition Letters 15 (1994) 543-549

Pattern Recognition Letters

An automatic and stable clustering algorithm

Shri Kant *, T.L. Rao, P.N. Sundaram Scientific Analysis Group, R&D Organisation, Metcalfe House, Delhi I10 054, India

Received 24 February 1992

Abstract

A new non-hierarchical and non-conventionaJ clustering algorithm has been developed to remove the effect of the initial guess (seed points) on the performance and convergence of the "Moving Centroid Method". The classification is also least affected on the ordering of the data presented for clustering. The most appropriate cluster centres are obtained by studying the class strings of the initial partitions, which are generated with the help of random integers. These stable cluster centres lead to the stable partitions in a unsupervised manner.

Keywords: Clustering; Class-strings; Seed points; Stable clusters; Random partitions; Natural association

1. Introduction

"Clustering" is the process of obtained natural classes without any a priori knowledge of prototype classification. It searches the most natural structure of categories within given complex bodies of a large data set. In other category sorting problems such as pattern classification and discriminant analysis, either the category structure is known a priori or a part of the structure is known and missing information is ex- tracted from the labelled patterns. The objective is to fit any new observation P~ into its respective category Ci. But in cluster analysis the operational objective is to search for natural categories which fit the obser- vations in such a way that the degree of NaturalAs- sociation among the patterns within a cluster is high and between the patterns of clusters is low.

A variety of cluster seeking algorithms are avail- able in the literature (Anderberg, 1973; Everitt, 1981; Forgy, 1965 ). Broadly they can be classified as:

* Corresponding author.

(a) hierarchical methods (agglomerative & divisive),

(b) non-hierarchical methods (direct & indirect). In agglomerative procedures the individual pat-

terns are ordered in such a way that the two individ- ual patterns in the same cluster at any level remain together at all higher levels. The hierarchy is decided by the linkage methods viz. single linkage (nearest neighbour), complete linkage (farthest neighbour), group average link, weighted average link and cen- troid and median method. A good comparison of these methods on random data patterns is discussed by Jain, Indrayan and Goel ( 1985 ). Murthy and Rao (1974) have described a clustering technique based on various kinds of optimization of within-cluster and between-cluster distances. Narasimha Murty and Krishna (1980) have given a computationally effi- cient agglomerative algorithm based on multilevel theory.

In divisive procedures the set of n patterns is di- vided into smaller clusters by binary division of an existing cluster, hence the partitions into a different

0167-8655/94/$07.00 © 1994 Elsevier Science B.V. All rights reserved SSDIO167-8655(93)EOO69-Z

Page 2: An automatic and stable clustering algorithm

544 S. Kant et al. / Pattern Recognition Letters 15 (1994) 543-549

number of clusters are hierarchically nested. Since n individual patterns can be divided into two subsets in (2 n-~- 1 ) ways, so as the size of the data set in- creases, it rapidly becomes computationally infeasi- ble to examine all possible partitions. For a moder- ately large data set, two types of divisive techniques are discussed in (Everitt, 1981; Enslein et al., 1977, Ch. 11; Forgy, 1963). One is Monothetic division, which is based on the states of a single specified at- tribute and the other is Polythetic division, which is based on the values taken by all the attributes. The method of maximal predictive classification pro- posed by Gower (1974) is similar to monothetic division.

Non-Hierarchical algorithms are characterized as direct method if no criterion functions are used to optimize the performance of the classification, e.g. centre adjustment algorithm, guard zone algorithm (Meisel, 1972), generalized guard zone algorithm (Pathak and Pal, 1986) and dynamic guard zone al- gorithm (Pal et al., 1988 ).

On the other hand indirect methods are based upon certain criterion functions (also known as perform- ance index), which are optimized (maximized or minimized) to get the best fit of data into categories. Some well-known algorithms are K-mean algorithm (MacQueen, 1967), the ISODATA (Tou and Gon- zalez, 1974), dynamic clusters method (Diday, 1973), and DYNOC (Tou, 1979).

Proximity measures (similarity and dissimilarity) play an important role in establishing a rule for as- signing patterns or data points to the domain of the particular cluster centre. A detailed list of proximity measures with their interpretation are available in (Anderberg, 1973, Chs. 4-5; Gordon, 1981; Everitt, 1981; Sneath and Sokal, 1963).

In most of the non-hierarchical methods, particu- larly the centre adjustement algorithm after specify- ing the procedure of clustering and proximity meas- ures, it has been found that the performance of the algorithm depends heavily on the initial guess of cluster centres and also on the order in which the data set is exposed for classification. There are various properties, which should be possessed by all classifi- cation methods. The two more desirable properties "Stability" and "Objectivity" are generally not pos- sessed by many classification methods. "Stability" in a classification characterizes that the classification is

little affected by small errors in recording the fea- tures characterizing each pattern, by the addition of a few new data patterns in the existing data set and by recording of a few new features for each pattern in the data set. "Objectivity" in this context stands for continuous repeatability of results, that is, different analysts analysing the same numerical data pattern should arrive at the same conclusion.

The present algorithm "Automatic and Stable Clustering Algorithm" (ASCA) has been developed to circumvent the above-mentioned major pitfalls present in the general moving centroid clustering al- gorithm. The choice of initial centres is automatic and provides stable partitions. We have used clustering around moving centres repeatedly in various steps of ASCA. Initial cluster centres are obtained with the use of class strings, which are generated with the help of random numbers.

2. Clustering around moving eentres

Forgy (1965), Ball and Hall (1967), Macqueen (1967) and several other authors have given this al- gorithm independently. It becomes here essential to mention the said algorithm, because we have used a slightly modified form of the above algorithm in our robust cluster seeking procedure. Steps of the algo- rithm are as follows.

Step 1. Let ND data patterns characterized by Nv features to be partitioned into Nc classes. Obtain Arc provisional cluster centres

c i °, co, ..., co, ..., c % .

Step 2. Assign all the ND patterns to the cluster do- main of C°'s {i= = 1, 2, ..., Nc) where they are found to be more close. To decide about the degree of close- ness or resemblance, we have used the following dis- similarity and similarity measures.

Let the two patterns Xand Yand their mean scores be

X=[Xl ,X2 , . . . ,Xm] , Y = [ Y l , Y 2 ..... Ym] ,

1 m 1 m

x= m i~ x, and 37= m i~t y'" (2.1)

( i ) Angle between two vectors. The cosine of the an-

Page 3: An automatic and stable clustering algorithm

S. Kant et aL / Pattern Recognition Letters 15 (1994) 543-549 545

gle is a measure of similarity between two patterns X and Y, defined by Anderberg (1973) as

XTy A (XY) = cos ot = - -

IXI I YI (2.2)

=(i~l xiyi ) / ( i~l X2 )(i~l y2 ) "

A pattern vector Xis decided to be very close or nearly parallel to the centre C o if the computed value A(X, C °) is very near to unity and vice versa.

(ii) Product moment correlation coefficient. The pattern vectors of centered scores are ,(" and 17, where

X = [ (x 1 --.~), (x 2 --)c), ..., ( X m - - .~ ) ] ,

~r= [ (Yl --Y), (Y2 --JT), ..., (y,, --:f) ] .

The inner product of the two centered vectors is called the scatter o f X a n d Yand the scatter divided by m is COV(X, Y). The scatter of X (the inner product of Xwith itself) divided by m is VAR(X). Product mo- ment correlation can be viewed as

c o y ( x , Y) r( X, Y)= [VAR(X) VAR(Y) 1,/2

(X i - - . ~ ) (y, - ; ) i=l

[(i~ 1 (Xi--X)2)(,~ I (Yi--Y)2)] 1/2 (2.3)

which can be interpreted as the angle between the two centered vectors ~" and 1~ for observing the resem- blance/closeness between pattern X and pattern Y.

(iii) The Euclidean distance between two patterns X and Y is computed as

~1/2

d2(X, Y)= ,=,~" wilxi-y,I 2) , (2.4)

where w~ is the corresponding weight vector such that Zm=i wi = 1.

(iv) The Mahalanobis (1936) distance between two pattern vectors X and Y is given by:

D2(X, Y) -- ( X - y ) T y,-1 (X- - Y), (2.5)

where X is the pooled within-group-variance co-var- iance matrix. A pattern X is decided to be the mem- ber of C O i f d 2 ( X , C ° ) , ( i = 1 , 2 . . . . , N c ) , i s m i n i -

mum. The assignment of the ND patterns to their respective classes C o will create a partition

o o , o o . . . . . o ° . . . . , o % .

Step 3. Using the centre of gravity of the partitions created above, compute new centres

c l , .. . . . . . . . . C ,c,

which will create new partitions

oI,o . . . . . . . .

Step k. Nc new cluster centres

cLc . . . . . . . .

are determined by using the centre of gravity of the partitions

Olk--', ok - - l , ..., 0 k-` ..... o k c 1 .

These new centres will create a new partition of the ND data patterns

o L o i , ..., o k . . . . , o c.

The algorithm stops after the kth iteration if (a) or (b) or (c) occurs:

(a) The partitions O k- l and Ok are the same, i.e., C k- l and C k are identical.

(b) The within-cluster scatter

Nc Wgc= Z k~o D2(x" Ck) '

k ~ l x k i

where Nc is the number of clusters and C k is the centre of the Okth partition, stops decreasing significantly.

(c) A previously established maximum number of iterations is reached.

3. Automatic and stable clustering algorithm (ASCA)

Let us assume that there are ND patterns from a pattern space P and each pattern Pi, i= 1, 2, ..., ND, is characterized by Nv features. The objective is to gen- erate Cj, j = 1, 2, ..., NCR stable cluster regions such that each Pi fits exactly into one of these regions Cj and not a single pattern P~ can be fitted into two clus- ter regions Cj and Ck, for all j # k, i.e.,

Page 4: An automatic and stable clustering algorithm

546 S. Kant et al. / Pattern Recognition Letters 15 (1994) 543-549

C 1 ~ C 2 k.)...U CNc R = P , a n d

Cj c~ Ck = 0, Vjck

where w and c~ stand for the usual union and inter- section. The steps of the algorithm are as follows.

Step 1. Set the initial parameters: ND --,number of patterns to be clustered, Nv-- .number of features characterising each

pattern, Nc --.number of initial partitions, N~ --, number of the initial partitions created to ar-

rive at stable cluster centres, NcR---,number of stable cluster regions (subject to

change after observing the (Nc) NI partitions). Step 2. Generate a non-repeated random sequence

of ND integers {Ri; i= l, 2, ..., ND}, where R~Ri , i¢j= l, 2, ..., ND, and 1 <,Ri<ND.

Note. To get this random sequence any random number generation process can be used. A method given by Maclaren and Marsaglia (1965) has been applied. This subroutine is specific to IBM System/ 360 with period 229 . Complete generation and testing procedure are available in IBM manual C20-8011. In the subroutine RANDU(IX, IY, YFL), IX is a nine or less digit odd integer supplied to the subroutine, IY is computed by the previous IX, and YFL is the resultant uniformally distributed floating point ran- dom number in the range 0.0~<YFL< 1.0. The inte- ger random sequence Ri, 1 ~<R~ ~<ND, has been ob- tained by using

R~=INT((YFL)~ .ND)+I , i = l , 2,...,ND. (3.1)

In the following steps, the term "class string" has been used frequently, which is the sequence of label numbers (class numbers) of each of the patterns with their corresponding location of the clusters.

Step 3. It has been accomplished by substituting that

R~=k, if ( k -1) . INT(No/Nc)

<R~ <~k.INT(ND/Nc) ,

R~=Nc, i fR ,>(Nc-1) . INT(ND/Nc) , (3.2)

where i= 1, 2, ..., ND; k= 1, 2 ..... N c - 1 and INT stands for integral value. The class string IR (i) is ob- tained by partitioning the whole data set into Nc parts

and assigning the corresponding part number to each Ri.

Step 4. Compute the Nc centres for initial parti- tions with the help of class string IR(i)

1 (N)k Ck'l= (N)k j~l= Pk(J, l) , (3.3)

where k= I, 2, ..., Nc; l= 1, 2 .... , Nv and (N)k is the number of patterns Pk in the kth partition. For the ktb centre take all those patterns into account, which are having the corresponding value k in the class string IR(i).

Step 5. Create partitions around the Arc centres us- ing clustering around moving centres and update the class string IR ( i ).

Step 6. Repeat the process from Step 2 to Step 5 Nx times and store the updated class strings in NON(n, i), n = l , 2 .... ,NI;i=I,2,...,No.

Step 7. In the matrix NON(n, i), n= 1, 2 ..... NI; i= 1, 2, ..., ND, each of the ND columns is one of the (Nc)NX possible combinations of the class numbers 1, 2 .... , Arc. The/ th value of these combinations can be calculated as:

NI NDX(i)= ~ ( In ( i ) - l ) . (N¢)" -~+l , (3.4)

where In(i) =NON(n, i) and i= 1, 2 .... , ND. Step 8. The above Nv integral values will always

satisfy 1 ~ NDX(i) ~ (N c)NI. Calculate the frequency of the (Nc) sl values and arrange them in descending order (NF(m), m= 1, 2, ..., (Nc)NI).

Step 9. Decide about the number of stable clusters to be generated Nca=Nc+k, where k=0, 1, 2 . . . . , (Nc) NI- (Nc + 1 ), provided there is some observa- tion in NF(m), m = l , 2 ..... (Nc) NI. Some of the NF(m )'s may not have even a single observation.

Step 10. Take the first NcR stable partitions NF(m) and compute the final seed points by choosing the patterns of the respective Nca stable partitions

1 (FN)m FCm,t- (FN),,, ~ PmO', l), (3.5)

j=l

where m = 1, 2, ..., NCR; l= 1, 2, ..., Nv and (FN)m is the number of patterns in the mth partition.

Step 11. (i) Apply the moving centres method once around these final cluster centres to get the stable

Page 5: An automatic and stable clustering algorithm

S. Kant et al. / Pattern Recognition Letters 15 (1994) 543-549 547

partitions and perform statistical analysis, preferably ~2 statistics, to see the validity of the classification; or

(ii) Apply the same clustering procedure till the clusters are stabilized and then perform statistical analysis.

After repeated application of the algorithm on well- known data sets, it has been found that cluster for- mation in (i) and (ii) of Step 11 is the same if the classes are clearly separable. In case of overlapping classes there is a slight variation in the results ob- tained in (i) and (ii) of Step 11.

Example. Here we will show how the algorithm ASCA performs the clustering and seperates the individuals into their respective clusters. The example of 20 two- dimensional patterns given in (Bow, 1984, Ch. 5) are demonstrated below:

P1 P2 P3 P4 P5 (0,0) (1,0) (0,1) (1,1) (2,1)

P6 P7 P8 P9 P10 (1,2) (2,2) (2,3) (6,6) (7,6)

P l l P12 P13 P14 P15 (8,6) (6,7) (7,7) (8,7) (9,7)

P16 P17 e l8 e l9 P20 (7,8) (8,8) (9,8) (10,8) (11,8)"

Let these twenty patterns be presented for clustering in the following order:

P4, P9, Pl, P10, P14, P5, P3, P20, P18, P8,

P7, P6, P2, P12, Pll,P16, P19, P15, P13, P17.

Step 1. Set the initial parameters

ND=20, Nv=2, Nc=2, Nx=3, NCR=2.

Step 2. Twenty non-repeated integer random num- bers are generated

15, 17, 6, 8, 10, 12, 3, 11, 19, 5,

7 ,2 ,9 ,4 , 16, 13,20, 14, 18, 1 .

Step 3. IR(1 )=2 , since 15> (Nc-1).INT(ND/ Nc), IR(2) =2 and IR(3) = 1, since 0< 6 ~< 10 and so on. Hence the corresponding class string becomes

2 2 1 1 1 2 1 2 2 1 1 1 1 1 2 2 2 2 2 1 .

Step 4. Initial centres computed by (5.3) are

1 1o C~,,= ]-0j~l P~ (j' l)=3.50, 3.60; and

1 lo C2,t= ~-~j~ P20", l)=7.00, 6.00.

Step 5. Partitioning around these centres provides the updated class string

1 2 1 2 2 1 1 2 2 1 1 1 1 2 2 2 2 2 2 2

and corresponding new centres Cl.t= ( 1.13, 1.25) and C2,t= (8.00, 7.17).

Step 6. Step 2 to Step 5 are repeated N~ times and the N~ class strings are stored in matrix

NON( N~, ND ) =

1 2 1 2 2 1 1 2 2 1 1 1 1 2 2 2 2 2 2 2 2 1 2 1 1 2 2 1 1 2 2 2 2 1 1 1 1 1 1 1 . 2 1 2 1 1 2 2 1 1 2 2 2 2 1 1 1 1 1 1 1

Step 7. The number of possible combinations of 1 and 2 is (Nc) HI = 23 = 8. The numerical value of the first column by (3.4) is

(NDX(1) = (1 - 1 ) .2 ~-~

+ ( 2 - - 1 ) . 2 2 - 1 + ( 2 - 1 ) . 2 3 - 1 + 1 = 7 .

Similarly, NDX(2) =2, NDX(3) =7 .... . NDX(20) = 2 are the values of the 2nd column, 3rd column etc.

Step 8. The frequency of each column value is counted and arranged in descending order

NF(I) NF(2) NF(3) NF(4) NF(5) NF(6) NF(7) NF(8)

0 12 0 0 0 0 8 0

NF(2) NF(7) NF(1) ... NF(8)

12 8 0 ... 0

Step 9. Since in all the three initial partitionings, the patterns are moving together in one of the two classes, there are only two stable clusters out of eight possible stable clusters.

Step 10. Using the patterns of two stable partitions the final cluster centres are computed by (3.5): FCI.t= (8.00, 7.17) and FC2.t= (1.13, 1.25).

Step 11. Around these centers, clusters are formed which provide the following class string

2 1 2 1 1 2 2 1 1 2 2 2 2 1 1 1 1 1 1 1 .

Page 6: An automatic and stable clustering algorithm

548

Table 1

S. Kant et al. / Pattern Recognition Letters 15 (1994,) 543-549

el.1 ci.2 cl.3 el.1 cl.2 cl.3 el.1 el.2 cl.3

Iris Setosa 50 0 0 0 0 50 50 0 0 Iris Versicolor 0 3 47 47 3 0 0 3 47 Iris Verginica 0 36 14 14 36 0 0 36 14

Hence the clusters are Cluster 1: [/99, PI0, P14, P20, P18, PI 2, P11, P16, PI 9, P15, P13, P17 ] and Cluster 2: [P4, P1, P5, P3, P8, P7, P6, P2 ]. The cluster centres and cluster sets are the same as mentioned in (Bow, 1984).

4. Discussion

The performance of the ASCA has been estab- lished by its repeated application on IRIS data. This data set contains three classes namely Iris Setosa, Iris Versicolor and Iris Virginica. The individual pattern of each class is characterized by four measurements: sepal length, sepal width, petal length, petal width (see Kendall and Stuart, 1961, p. 331 ). The IRIS data has been presented for classification more than hundred times in various order by giving different input for random number generation. The final cluster centres arrived at have always been found to be the same. The clustering around these stable cluster centres lead to the stable partitions. The process to arrive at the stable cluster centres has already been discussed in the example (Step 6 to Step 10). In Table 1 we can see how the groups of patterns move together in the initial partitions.

The three stable cluster centres, which provide ul- timately the stable class structures, are computed by taking into account 50 patterns for class number one, 47 patterns for class number two and 36 patterns for class number three. Observing the class structure ob- tained repeatedly, we can say that we are able to com- pute the stable cluster centres conveniently and the effect of ordering of the data set on the final cluster formation is removed automatically.

5. Conclusions

The clustering method developed in the present manuscript advocates the way of achieving the most

appropriate initial guess through random partition- ing of data, which finally leads to the stable cluster structures. Stable in the sense that in the repeated partitioning, many patterns always move together. By observing their movement the final seed points are decided automatically irrespective of the order of the data presented for classification, which finally leads to the stable cluster formation. The programme has been developed on Prime-750 System. Our input are the data set, and a nine or less digit positive odd in- teger for generation of random numbers.

Acknowledgements

The authors record their sincere thanks to Dr. C.R. Chakravarthy, Director, Scientific Analysis Group, DRDO for his constant encouragement and contin- ued support in the completion of the manuscript. The authors are also thankful to the unknown reviewers for their valuable comments and suggestions for the improvement of this manuscript.

References

Anderberg, M.R. (1973). Cluster Analysis for Application. Academic Press, New York.

Ball, G.H. and Hall, D.J. (1967). A clustering technique for summarizing multivariate data. Behav. Sci. 12, 153-155.

Bow, Sing-Tze (1984). Pattern Recognition: Applications to large data set problems. Marcel Dekker, New York and Basel.

Diday, E. (1973). The dynamic dusters method in non- hierarchical clustering. Internat. J, Comput. Inf. Sci. 2 ( 1 ), 61- 88.

Everitt, B. ( 1981 ). ClusterAnalysis. Wiley (Halsted Press), New York.

Enslein, K., A. Ralston and H.S. Will (1977). Statistical Method for Digital Computers, VoL IIl. Wiley, New York.

Forby, E.W. (1965). Cluster analysis of multivariate data: efficiency versus interpretability of classifications (abstract). Biometrics 2 l, 768-769.

Page 7: An automatic and stable clustering algorithm

S. Kant et al. / Pattern Recognition Letters 15 (1994) 543-549 549

Gordon, A.D. (1981). Classification. Chapman and Hall, London.

Gower, J.C. (1974). Maximal predictive classification. Biometrics 30, 643-654.

Jain, N.C., A. Indrayan and L.R. Goel (1985). Monte Carlo comparison of six hierarchical clustering methods on random data. Pattern Recognition 19, 95-99.

Kendall, M.G. and A. Stuart (1961). The Advance Theory of Statistics, Vol. 3. Griffin, London.

Lebart, L., A. Morineau and K.M. Warwick (1984). Multivariate Descriptive Statistical Analysis. Wiley, New York.

Mahalanobis, P.C. (1936). On the generalized distance in statistics. Proc. Nat. Inst. Sci. India 2, 49-55.

MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate observations. Proc. 5th Berkeley Syrup., 281-297.

Maclaren, M.D. and G. Marsaglia (1965). Uniform random number generators. J. Assoc. Comput. Mach. 12, 83-89.

Meisel, S.M. ( 1972 ). Computer oriented Approaches to Pattern Recognition. Academic Press, New York.

Murthy, C.S.R. and R.S. Rao (1974). A clustering technique. ,/. Comput. Soc. India 5, 13.

Narasimha Murty, M. and G. Krishna (1980). A computationally efficient technique for data-clustering. Pattern Recognition 12, 153-158.

Pathak, A. and S.K. Pal (1986). A generalised learning algorithm based on guard zones. Pattern Recognition Lett. 4, 63-69.

Pal, S.K. and D. Dutta Majumder (1986 ). Fuzzy Mathematical Approach to Pattern Recognition. Wiley (Halsted Press), New York.

Pal, S.K., A. Pathak and C. Basu (1988). Dynamic guard zone for self supervised learning. Pattern Recognition Lett. 7, 135- 144.

Sneath, P.H.A. and R.R. Sokal ( 1963 ). Principles of Numerical Taxonomy. Freeman, San Francisco, CA.

Tou, J.T. and R.C. Gonzalez (1974). Pattern Recognition Principles. Addision-Wesley, Reading, MA.

Tou, J.T. ( 1979 ). DYNOC - A dynamic optimal cluster-seeking technique. Internat. J. Comp. Inf. Sci. 8 (6), 541-547.