8
Cluster Quality Indexes for Symbolic Classification – An Examination Andrzej Dudek Wroc law University of Economics, Department of Econometrics and Computer Science, Nowowiejska 3, 58-500 Jelenia G´ora, Poland; [email protected] Abstract. The paper presents difficulties of measuring clustering quality for sym- bolic data (such as lack of a ”traditional” data matrix). Some hints concerning the usage of known indexes for such kind of data are given and indexes designed exclusively for symbolic data are described. Finally, after the presentation of sim- ulation results, some proposals for choosing the most adequate indexes for popular classification algorithms are given. 1 Introduction In a typical classification procedure, cluster validation is one of the crucial steps. Typically, in the validation step an internal cluster quality index is used. There is a variety of such kind of indexes with over fifty measures (Milligan and Cooper (1985), Weingessel et al. (1999)) The problem of choosing the most adequate cluster quality index for data measured on different scales and classified by various clustering methods is well described in literature. Milligan and Cooper (1985) suggest to use Cali´ nski and Harabasz, Hubert and Levine, Baker and Hubert indexes, and also the Silhouette index and the Krzanowski and Lai index are quite commonly used. The situation differs in case of symbolic data. There are no hints in liter- ature which indexes are most appropriate for that kind of data. This paper describes cluster quality indexes that can be used in this case. In the first part clustering methods that can be used for symbolic data and methods designed exclusively for this kind of data are described. The second part presents main groups of cluster quality indexes along with examples of indexes from each group (due to the lack of space only the most frequently used indexes are described). The third part describes the classification pro- cess for symbolic data. In the next part cluster quality indexes are compared on 100 sets of symbolic data with known structures and for three clustering methods. Furthermore, there is a short summary which of them most accu-

4 Cluster Quality Indexes for Symbolic Classication - An Examination

Embed Size (px)

DESCRIPTION

neuroscience

Citation preview

Page 1: 4 Cluster Quality Indexes for Symbolic Classication - An Examination

Cluster Quality Indexes for SymbolicClassification – An Examination

Andrzej Dudek

Wroc�law University of Economics, Department of Econometrics and ComputerScience, Nowowiejska 3, 58-500 Jelenia Gora, Poland; [email protected]

Abstract. The paper presents difficulties of measuring clustering quality for sym-bolic data (such as lack of a ”traditional” data matrix). Some hints concerningthe usage of known indexes for such kind of data are given and indexes designedexclusively for symbolic data are described. Finally, after the presentation of sim-ulation results, some proposals for choosing the most adequate indexes for popularclassification algorithms are given.

1 Introduction

In a typical classification procedure, cluster validation is one of the crucialsteps. Typically, in the validation step an internal cluster quality index is used.There is a variety of such kind of indexes with over fifty measures (Milliganand Cooper (1985), Weingessel et al. (1999))

The problem of choosing the most adequate cluster quality index for datameasured on different scales and classified by various clustering methods iswell described in literature. Milligan and Cooper (1985) suggest to use Calinskiand Harabasz, Hubert and Levine, Baker and Hubert indexes, and also theSilhouette index and the Krzanowski and Lai index are quite commonly used.

The situation differs in case of symbolic data. There are no hints in liter-ature which indexes are most appropriate for that kind of data. This paperdescribes cluster quality indexes that can be used in this case.

In the first part clustering methods that can be used for symbolic data andmethods designed exclusively for this kind of data are described. The secondpart presents main groups of cluster quality indexes along with examples ofindexes from each group (due to the lack of space only the most frequentlyused indexes are described). The third part describes the classification pro-cess for symbolic data. In the next part cluster quality indexes are comparedon 100 sets of symbolic data with known structures and for three clusteringmethods. Furthermore, there is a short summary which of them most accu-

Page 2: 4 Cluster Quality Indexes for Symbolic Classication - An Examination

32 Andrzej Dudek

rately represents the structure of the clusters. Finally some conclusions andremarks are given.

2 Clustering methods for symbolic data

Symbolic data, unlike classical data, are more complex than tables of numericvalues. Bock and Diday (2000) define five types of symbolic variables:

• single quantitative value,• categorical value,• interval,• multi-valued variable,• multi-valued variable with weights.

Variables in a symbolic object can also be, regardless of theirs type (Diday(2002)):

• taxonomic representing hierarchical structure,• hierarchically dependent,• logically dependent.

A common problem with the usage of symbolic data in classification algo-rithms is the fact, that for this kind of data, due to their structure, operationsof addition, subtraction, multiplication, squaring, calculation of means or cal-culation of variance are not defined. Thus, methods based on data matricescannot be used. Only methods based on distance matrices are applicable.Among them the most popular ones are:Hierarchical agglomerative clustering methods (Gordon (1999, p. 79)):

• Ward,• single linkage,• complete linkage,• average linkage,• McQuitty (1966),• centroid,• median.

Optimization methods:

• Partitioning around medoids, also called k-medoids method (Kaufmanand Rousseeuw (1990)).

Algorithms developed for symbolic data (Chavent et al. (2003), Verde (2004)):

• divisive clustering of symbolic objects (DIV),• clustering of symbolic objects based on distance tables (DCLUST),• dynamic clustering of symbolic objects (SCLUST),• hierarchical and pyramidal clustering of symbolic objects (HiPYR).

Page 3: 4 Cluster Quality Indexes for Symbolic Classication - An Examination

Cluster Quality Indexes for Symbolic Classification – An Examination 33

Popular methods like k-means and related ones like hard competitive learning,soft competitive learning, Isodata and others cannot be used for symbolic data.

3 Cluster quality indexes

Over fifty internal cluster quality indexes are described in the literature ofsubject. They can be arranged in three main groups (Weingessel et al. (2003)),for each group a few well-known representatives are enumerated:

Indexes based on inertia (Sum of squares):

• Calinski and Harabasz (1974) index (pseudo F-statistics),• Hartigan(1975) index,• Ratkovski index (Ratkovski and Lance (1978)),• Ball (1965) index,• Krzanowski and Lai (1988) index.

Indexes based on scatter matrices:

• Scott index (Scott and Symons (1971)),• Marriot (1971) index,• Friedman index (Friedman and Rubin (1967)),• Rubin index (Friedman and Rubin (1967)).

Indexes based on distance matrices:

• Silhouette (Rousseeuw (1987), Kaufman and Rousseeuw (1990)),• Baker and Hubert (Hubert (1974), Baker and Hubert (1975)),• Hubert and Levine (1976).

A different, relatively small group is defined by indexes dedicated only forsymbolic data. Those indexes are (Verde (2004)):

• Inertia for symbolic objects,• homogeneity based quality index.

4 Clustering quality indexes – symbolic objects case

Figure 1 summarizes the usage of clustering quality indexes for symbolic ob-jects. For symbolic objects, clustering methods based on data matrices cannotbe used. If the clustering algorithm is based on a distance matrix then, in val-idation, indexes based on the inertia and indexes based on a distance matrixare allowed. If an algorithm designed strictly for symbolic data is used thenfor validation indexes based on inertia and “symbolic” indexes are most ap-propriate. Thus, four paths of classification procedure may be distinguished:

Page 4: 4 Cluster Quality Indexes for Symbolic Classication - An Examination

34 Andrzej Dudek

Fig. 1. Clustering method and cluster quality indexes for symbolic data.

(Source: Own research based on Verde (2004), Chavent at al. (2003), Weinges-sel (1999))

• Clustering procedure based on dissimilarity matrix, validation with clusterquality index based on inertia;

• Clustering procedure based on dissimilarity matrix, validation with clusterquality index based on dissimilarity/distance matrix;

• “Symbolic” clustering procedure, validation with cluster quality indexbased on inertia;

• “Symbolic” clustering procedure, validation with cluster quality index de-signed for symbolic data.

5 Comparison of cluster quality indexes in symbolicobjects case – computational results

Many authors like Milligan and Copper (1985) have compared cluster qualityindexes so far, and some hints at the usage of them can be found in theliterature. But no such comparison has been done for symbolic data yet (asfar as the author knows).

100 symbolic data sets with known class structure have been clustered, anda compatibility measure for each index has been calculated according to thecondition: “If the best value of the index is achieved for a number of clusterscorresponding to the real structure of the data set then the compatibilitymeasure is incremented, if it is second in the row 0.5 is added and for thirdin the row 0.25 is added”.

Page 5: 4 Cluster Quality Indexes for Symbolic Classication - An Examination

Cluster Quality Indexes for Symbolic Classification – An Examination 35

Three clustering algorithms have been used: Ward hierarchical clusteringmethod, partitioning around medoids method and dynamical clustering forsymbolic objects methods. For each algorithm the compatibility measure hasbeen calculated separately. Calculations have been made with the use of thesymbolicDA library (written in R and C language by the author).

The data for the experiment has been generated artificially. The mainreason for this is lack of real symbolic datasets with known data structure.There are only a few datasets shipped with the SODAS Software. But we canassume that switching from artificial to real data wouldn’t change the resultsof the simulation, as far as the real cluster sizes are approximately equal.For datasets with one “large” and few “small” clusters the situation probablydiffers. Each data set contained a fixed number of objects (150), a randomnumber (from 2 to 5) of single numerical variables, a random number (from2 to 10) of variables in form of intervals and a random number (from 2 to10) of multi-nominal variables. 20 data sets have 3 clusters structure, 25 have4 clusters structure, 30 contain 5 clusters and 25 have 7 clusters structure,which is indicated in 1-4 headings.

The following indexes have been compared:

• S – Silhouette index,• G2 – Baker and Hubert index,• G3 – Hubert and Levine index,• F – Calinski and Harabasz index,• H – Hartigan index,• KL – Krzanowski and Lai index,• SI – inertia for symbolic objects,• SH – homogeneity based quality index.

The Ichino and Yaguchi distance measure has been used to calculate thedistance matrix. The results of the experiment are presented in Tables 1-4.Calculations were made in the R environment using the symbolicDA library.

Table 1. Comparison of cluster quality indexes for symbolic data – Ward hierar-chical clustering.

Index 3 clusters(20 sets)

4 clusters(25 sets)

5 clusters(30 sets)

7 clusters(25 sets)

Total

S 6.5 4.5 0.25 0.5 11.75G2 17 20.25 5.75 21.5 64.5G3 18.25 17.25 15.5 7 58F 6.5 7.25 0.5 0.75 15H 6.75 26.25 5.25 0.25 38.5KL 8 5.25 1.5 2.25 17SI 6.5 3.5 8.75 0.75 19.5SH 7.25 4.75 10.25 2 24.25

Page 6: 4 Cluster Quality Indexes for Symbolic Classication - An Examination

36 Andrzej Dudek

Table 2. Comparison of cluster quality indexes for symbolic data – k-medoidsalgorithm.

Index 3 clusters(20 sets)

4 clusters(25 sets)

5 clusters(30 sets)

7 clusters(25 sets)

Total

S 17.25 4.5 5.75 1 28.5G2 5.25 17.5 18.75 11 52.5G3 20 5.75 27.25 6.75 59.75F 1.5 4.75 0.25 5.75 12.25H 11.5 0.25 0.5 0 12.25KL 13.75 5.75 6.25 3 28.75SI 20 0.25 26.5 0.75 47.5SH 20 1.25 28 4.25 53.5

For Ward hierarchical clustering of symbolic objects Hubert and Levine(G3) and Baker and Hubert (G2) indexes most adequately represent the realstructure of the data. Only the Hartigan index provides significantly goodresults and the correlation between other indexes values and the real classstructure is at a very low level. For the k-medoids algorithm for symbolicobjects Hubert and Levine (G2), Baker and Hubert (G3), symbolic inertia(SI) and homogeneity based quality index (SH) may be used to validate clas-sification results. And again for dynamical clustering of symbolic objects theHubert and Levine (G2) and the Baker and Hubert (G3) indexes most ade-quately represent the real structure of data. Table 4 summarizes the resultsof the experiments. The G2 and G3 indexes are significantly better than theother indexes. It can be explained by the fact, that these indexes, are based ondistance matrices, however the third index from this group (Silhouette index)is not as good as the two others. Indexes designed for symbolic data: symbolicinertia and homogeneity based quality index can also be used for symboliccluster validation but the results may be worse than those achieved by usingthe Hubert and Levine or the Baker and Hubert index.

6 Final remarks

In this paper several cluster quality indexes were compared on 100 artificiallygenerated symbolic data sets. The experiment showed that the most adequateones for this kind of data are the Hubert and Levine and the Baker andHubert indexes. We can assume that the usage of these indexes in case ofreal symbolic data validation should also give good results. The preliminaryexperiments with real symbolic data sets, done by the author, also confirmthe quality of these indexes in the symbolic data case.

Page 7: 4 Cluster Quality Indexes for Symbolic Classication - An Examination

Cluster Quality Indexes for Symbolic Classification – An Examination 37

The results can be explained by the fact that Hubert and Levine andthe Baker and Hubert indexes are based on distance matrices and for them,limitations of symbolic methods, described in section 2, do not exist.

Table 3. Comparison of cluster quality indexes for symbolic data – Dynamicalclustering.

Index 3 clusters(max 20)

4 clusters(max 25)

5 clusters(max 30)

7 clusters(max 25)

Total

S 17.25 4 3.75 4.5 29.5G2 5.25 10.25 17 16.75 49.25G3 20 5.25 28 4.5 57.75F 4.5 0 2.75 0 7.25H 10 0 0 0.5 10.5KL 12.25 3.5 3.25 4 23SI 0 0.25 9.25 0 9.5SH 2 2.75 17.5 0.5 22.75

Table 4. Comparison of cluster quality indexes for symbolic data – Aggregatedresults.

Index 3 clusters(60 sets)

4 clusters(75 sets)

5 clusters(90 sets)

7 clusters(75 sets)

Total

S 41 13 9.75 6 69.75G2 27.5 48 41.5 49.25 166.25G3 58.25 28.25 70.75 18.25 175.5F 12.5 12 3.5 6.5 34.5H 28.25 26.5 5.75 0.75 61.25KL 34 14.5 11 9.25 68.75SI 26.5 4 44.5 1.5 76.5SH 29.25 8.75 55.75 6.75 100.5

Note that only two strictly “symbolic” indexes (symbolic inertia and ho-mogeneity based quality index) have been taken into consideration. Currentlynew proposals are available (see for example Hardy (2005) for a descriptionof the nbstat procedure), so this comparison should be repeated when moreindexes and cluster number determination procedures are introduced.

References

BAKER, F.B. and HUBERT, L.J. (1975): Measuring the Power of HierarchicalCluster Analysis. Journal of the American Statistical Association, 70, 349, 31-38.

Page 8: 4 Cluster Quality Indexes for Symbolic Classication - An Examination

38 Andrzej Dudek

BALL, F.B. and HALL, D.J. (1965): ISODATA, A Novel Method of Data Analysisand Pattern Classification. Tech. Rep. NTIS No.AD 699616, Stanford ResearchInstitute, Menlo Park.

BOCK, H.-H. and DIDAY, E. (2000): Analysis of Symbolic Data. Explanatory Meth-ods for Extracting Statistical Information from Complex Data. Springer, Berlin.

CALINSKI, R.B. and HARABASZ, J. (1974): A Dendrite Method for Cluster Anal-ysis. Communications in Statistics, 3, 1-27.

CHAVENT, M., DE CARVALHO, F.A.T., VERDE, R. and LECHEVALLIER, Y.(2003): Trois Nouvelle Methodes de Classification Automatique de DonneesSymboliques de Type Intervalle. Revue de Statistique Appliquee , LI 4, 5-29.

DIDAY, E. (2002): An Introduction to Symbolic Data Analysis and the SODASSoftware. J.S.D.A., International EJournal.

FRIEDMAN, H.P. and RUBIN, J. (1967): On Some Invariant Criteria for GroupingData. Journal of the American Statistical Association, 62, 1159-1178.

GORDON, A.D. (1999): Classification, Chapman & Hall/CRC, London.HARDY, A. (2005): Validation of Unsupervised Symbolic Classifica-

tion. Proceedings of ASMDA 2005 Conference. Available at URL:http://asmda2005.enst-bretagne.fr/IMG/pdf/proceedings/379.pdf.

HARTIGAN, J.A. (1975): Clustering Algorithms. New York, Wiley.HUBERT, L.J. (1974): Approximate Evaluation Technique for the Single-link and

Complete-link Hierarchical Clustering Procedures. Journal of the American Sta-tistical Association, 69, 347, 698-704.

HUBERT, L.J. and LEVINE, J.R. (1976): Evaluating Object Set Partitions: FreeSort Analysis and Some Generalizations. Journal of Verbal Learning and VerbalBehaviour, 15, 549-570.

KAUFMAN, L. and ROUSSEEUW, P.J. (1990): Finding Groups in Data: An In-troduction to Cluster Analysis. Wiley, New York.

KRZANOWSKI, W.J. and LAI, Y.T. (1988): A Criterion for Determining the Num-ber of Groups in a Data Set Using Sum of Squares Clustering. Biometrics, 44,23-34.

MARRIOT, F.H. (1971). Practical Problems in a Method of Cluster Analysis. Bio-metrics, 27, 501-514.

MCQUITTY, L.L. (1966): Similarity Analysis by Reciprocal Pairs for Discrete andContinuous Data. Educational and Psychological Measurement, 26, 825-831.

MILLIGAN, G.W. and COOPER, M.C. (1985): An Examination of Procedures forDetermining the Number of Clusters in a Data Set. Psychometrika, 2, 159-179.

RATKOVSKI, D.A. and LANCE, G.N. (1978) A Criterion for Determining a Num-ber of Groups in a Classification. Australian Computer Journal, 10, 115-117.

ROUSSEEUW, P.J. (1987): Silhouettes: A Graphical Aid to the Interpretation andValidation of Cluster Analysis. Journal of Computational and Applied Mathe-matics, 20, 53-65.

SCOTT, A.J. and SYMONS, M.J. (1971) Clustering Methods Based on LikelihoodRatio Criteria. Biometrics, 27, 387-397.

VERDE, R. (2004): Clustering Methods in Symbolic Data Analysis. In: D. Banks etal. (Eds.): Classification, Clustering and Data Mining Applications, Springer,Berlin, 299-318.

WEINGESSEL, A., DIMITRIADOU, A. and DOLNICAR, S. (1999): An Examina-tion Of Indexes For Determining The Number Of Clusters In Binary Data Sets.Available at URL: http://www.wu-wien.ac.at/am/wp99.htm#29.