DMIN18 Simultaneous Pattern and Data Clustering for Pattern Cluster Analysis

8/7/2019 DMIN18 Simultaneous Pattern and Data Clustering for Pattern Cluster Analysis

1/13

Simultaneous Pattern and Data Clustering forPattern Cluster AnalysisAndrew K.C. Wong, Fellow, IEEE, and Gary C.L. Li

AbstractIn data mining and knowledge discovery, pattern discovery extracts previously unknown regularities in the data and is a

useful tool for categorical data analysis. However, the number of patterns discovered is often overwhelming. It is difficult and time

consuming to 1) interpret the discovered patterns and 2) use them to further analyze the data set. To overcome these problems, this

paper proposes a new method that clusters patterns and their associated data simultaneously. When patterns are clustered, the data

containing the patterns are also clustered, and the relation between patterns and data is made explicit. Such an explicit relation allows

the user, on one hand, to further analyze each pattern cluster via its associated data cluster and, on the other hand, to interpret why a

data cluster is formed via its corresponding pattern cluster. Since the effectiveness of clustering mainly depends on the distance

measure, several distance measures between patterns and their associated data are proposed. Their relationships to the existing

common ones are discussed. Once pattern clusters and their associated data clusters are obtained, each of them can be further

analyzed individually. To evaluate the effectiveness of the proposed approach, experimental results on synthetic and real data are

reported.

Index TermsPattern discovery, contingency table, residual, chi-square test, categorical data analysis.

1 INTRODUCTION

TODAY, a primary challenge in data mining and knowledgediscovery is to discover interesting relationships fromdata sets. In examining relationship from data, statisticianshave studied the problem of testing for correlation amongrandom variables for over a century [1]. To obtain morespecific information, pattern discovery (PD) [2], [26], [27]searches and tests for significant correlation among events

rather than among variables, and such a correlation isreferred to as event associations. Hence, PD shifts thecorrelation analysis at the variable level to the event level.

The basic idea of PD can be illustrated by a simple XORproblem with three binary variables: A, B, and C A B.Suppose that we want to check whether or not theoccurrence of a compound event A T; B T; C F isjust a random happening. Given the observed frequency ofoccurrences o of the compound event, if we could estimateits expected frequency of occurrences e under the randomassumption, we know whether it is random or not bychecking the difference o e. Such a notion is formulatedas a hypothesis test. A compound event is called an event

association pattern or simply a pattern if the difference o eis significant enough to indicate that the compound event isnot a random happening.

PD is a useful tool for categorical data analysis. First, thepatterns produced are easy to understand for those who arenot data mining experts. Hence, it is widely used in

business and commercial applications. Second, it assumesvery little knowledge about the data from the users. Thus,when the users do not have any a priori knowledge about adata set, PD is a good starting point for exploring the dataset. Third, correlation relationship often reveals usefulinformation hidden in the data. Finally, the discoveredpatterns provide an alternative perspective for furtheranalysis of the data. Using the discovered patterns, wecan conduct inference such as classification [3]. Theclassifiers built from association patterns are known asassociative classifiers, which may be easier to understandthan conventional classifiers.

However, in most real-world problems, PD typicallyproduces an overwhelming number of patterns, the scopeof which is very difficult and time consuming to compre-hend. In addition, there is no systematic and objective wayof combining fragments of information from individualpatterns to produce a more generalized form of informa-tion. Since there are too many patterns, it is difficult to usethem to further explore or analyze the data. Hence, the

problem of too many patterns limits the usefulness of PD.To address these issues, we propose a new method that

simultaneously clusters the discovered patterns and theirassociated data. It is referred to as simultaneous pattern anddata clustering or simply pattern clustering. One importantproperty of the proposed method is that each pattern clusteris explicitly associated with a corresponding data cluster.This explicit data-pattern relation facilitates the subsequentanalysis of individual clusters (more specifically, patternclusters and their associated data clusters). To avoidconfusion, from now on, the term cluster refers to both apattern cluster and its associated data cluster.

To effectively cluster patterns and their associated data,several distance measures are proposed, and their relation-ships with the existing common measures are discussed.Once a distance measure is defined, existing clustering

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 7, JULY 2008 911

. The authors are with the Pattern Analysis and Machine IntelligenceLaboratory, Department of Electrical and Computer Engineering, Uni-versity of Waterloo, 200 University Avenue West, Waterloo, Ontario, N2L3G1, Canada. E-mail: {akcwong, gclli}@pami.uwaterloo.ca.

Manuscript received 20 Apr. 2007; revised 15 Dec. 2007; accepted 31 Jan.

2008; published online 11 Feb. 2008.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log NumberTKDE-2007-04-0178.Digital Object Identifier no. 10.1109/TKDE.2008.38.

1041-4347/08/$25.00 2008 IEEE Published by the IEEE Computer Society


2/13

methods can be used to cluster patterns and theirassociated data.

After clusters are found, each of them (i.e., a patterncluster and its associated data cluster) can be furtherexplored and analyzed individually. Since each patterncluster is associated with a data cluster, the users can1) further analyze each pattern cluster via its associated datacluster by using existing well-studied techniques such as

COBWEB [7] and other discrete-valued data analysismethods [23], [24], [25] and 2) interpret why a data clusteris formed via its associated pattern cluster.

The above procedures for handling a large number ofpatterns are based on a divide-and-conquer approach,where in the divide phase, patterns and data are simulta-neously clustered, and in the conquer phase, individualclusters are further analyzed. This approach is widely usedin conventional cluster analysis, where clustering is fre-quently applied to observe the characteristics of individualclusters and to focus on a particular set of clusters forfurther analysis. By the same token, this approach isproposed for handling a large set of patterns. It can be

used either separately or together with other approachessuch as pruning [4] and visualization [5].

The major contributions of this work are summarized asfollows:

1. A simultaneous pattern and data clustering methodis proposed. It is able to cluster not only patterns butalso their associated data while making the relationbetween patterns and data explicit. The explicit data-pattern relation enables the users to 1) furtheranalyze each pattern cluster via its associated datacluster by using well-studied techniques such as in[7], [23], [24], and [25] and 2) interpret the datacluster via its corresponding pattern cluster. Theabove procedures take a divide-and-conquer ap-proach for handling a large number of patterns.

2. Since the effectiveness of a pattern clusteringalgorithm is highly dependent on the distancemeasures between patterns and their associateddata, in this work, several distance measures areproposed, and their relations to existing commonmeasures [4], [5] are studied.

The key component of the proposed approach is thesimultaneous pattern and data clustering. If it is able tocluster patterns and the associated data effectively in the

divide phase, the subsequent analysis of individual clustersin the conquer phase will become simpler and moreeffective and can be conducted using existing well-studiedtechniques.

This paper begins with a review of related work inSection 2, describes the preliminary concepts in Section 3,presents the proposed simultaneous pattern and dataclustering in Section 4, describes the data analysis techni-ques in Section 5, presents the experimental results inSection 6, discusses the relation to other works in Section 7,and finally concludes in Section 8.

2 RELATED WORK

One of the most important problems in data mining is todiscover interesting regularities from a data set. In the

1990s, Agrawal and Srikant developed association rulemining, which discovers association rules from transactiondatabases [8]. To reduce the search space, an importantproperty called the Apriori property was used. Based onthis property, very efficient algorithms were developed forvery large databases [9]. Association rule mining has beenextensively studied and widely used in various real-worldapplications. It is a powerful tool for exploring andanalyzing data.

Association rule mining is well suited to applications

such as the market basket analysis. However, Brin et al. [10]pointed out that for some applications where item correla-tion is required, association rules may be misleading. Forexample, in Table 1, the association rule Tea Y )Coffee Y has a 20 percent support and an 80 percentconfidence [10]. With fairly high support and confidence,we may consider it as a valid rule and believe thatcustomers who buy tea will also buy coffee. However,Tea Y and Coffee Y are actually negatively corre-lated, since the ratio PfTea Y ^ Coffee Yg=PfTea Yg PfCoffee Yg 0:2=0:25 0:9 0:89 < 1.

To address this issue, Brin et al. [10] proposed the use ofchi-square statistics to detect correlation rules from thecontingency tables. However, since the chi-square statisticsobtained from the entire contingency table was designed fortesting correlations among random variables rather thanamong events, the correlation rule is less accurate if thecontingency table data are sparse.

PD moves the hypothesis test from taking the entirecontingency table to focusing on its individual cells [2], [26],[27]. In Table 1, to determine whether Tea Y; Coffee Yis a significant pattern, it tests the difference between theobserved frequency o 20 and the expected frequencyunder an independence assumption e 100 PfTea Y

g P

fCoffee

Y

g 100

0:25

0:9

22:5. If the dif-

ference 2022:5 2:5 is significant enough, we wouldconclude that Tea Y and Coffee Y are negativelyassociated. Since the difference and, hence, the hypothesistest is obtained from an individual cell in the table, PD canhandle sparse contingency table data. The relation betweenPD, association rule mining, and chi-square statistics will bediscussed in Section 7 in more detail.

A common problem encountered by most rule/patternmining methods is the overwhelming number of rules/patterns that they often produce. Some researchers suggestusing additional user specification to select interestingrules. Silberschatz and Tuzhilin [11] suggested specifying

existing knowledge to search for unexpected associationrules. Srikant et al. [12] used item constraints for specifyingthe required association rules. Others suggest the pruningof uninteresting association rules based on certain criteria.

912 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO. 7, JULY 2008

TABLE 1Contingency Table of the Purchase of Tea and Coffee [10]


3/13


4/13

Suppose that PD [2] produces a set of patternsfxs11 ; xs22 ; . . . ; xsnn g. Then, the set of samples matched by apattern xsii is denoted by mi x 2 Djx xsii

. A pattern-

induced data cluster of a pattern xsii is a set of compoundevents containing xsii and is represented as

Ii fxs x j x 2 mi; s sig: 6As an example, in Fig. 1, x

f3;4;5;6g1 is a fourth-order pattern

eggs 1; aquatic 0; backbone 0; tail 0, where the at-tribute index set {3, 4, 5, 6} is referred to the attributes {eggs,

aquatic, backbone, tail}. By the same token, xf7;8;9;10g2

represents the pattern milk 0; airborne 1; breathes 1;fins 0, and xf3;4;5;6;7;8;9;10g3 represents

eggs 1; aquatic 0; backbone 0; tail 0;milk 0; airborne 1; breathes 1; fins 0:

Then, I1 represents the data cluster P1 induced by patternx

f3;4;5;6g1 . I2 represents P3 induced by xf7;8;9;10g2 , and I3

represents P13 induced by xf3;4;5;6;7;8;9;10g3 .The pattern-induced data clusters defined above are

constant clusters. That is, each attribute has only one value in

the cluster. Since each pattern can induce a constant cluster,

the number of constant clusters is overwhelming. To reduce

the number, it is desirable to merge clusters. Let Ii andIj be two data clusters induced by patterns xsii and xsjj ,respectively. The merged data cluster of Ii and Ij is theunion of their matched samples and matched attributes.

More precisely,

I

i; j

fxs

x

jx

2m

i

[m

j

; s

si

[sj

g:

7

The above definition can be generalized to n patterns,

i.e., I1; . . . ; n fxs xjx 2 m1 [ . . . [ mn; s s1 [ . . .[sng. For instance, in Fig. 1, the highlighted block I1; 2

is merged from I1 and I2. Note that I1; 2; 3, beingmerged from the three data clusters, is the sameas I1; 2.4.2 Distance Measures between Patterns/Rules

When two data clusters are merged, the correspondingpatterns inducing them are simultaneously clustered.Hence, merging data clusters can also be seen as clustering

patterns. In the literature, the major issue of pattern/rulesclustering is to define a distance measure between patterns/rules. Once a measure is defined, existing clusteringmethods can be directly applied.

When defining distance between rules, most measures inthe literature do not consider the direction of rules, since itis usually irrelevant to the distance among rules or can beconsidered separately. For instance, the rule A T ) C T is considered as a pattern A T; C T. This viewsimplifies the subsequent discussion of distance measuresbetween rules or patterns.

There are numerous ways of defining distance/similar-

ity between patterns/rules. A commonly used distancebetween two association rules, as proposed by Toivonenet al. is the number of samples where the rules differ [4]. Asan illustration, in Fig. 2, mi and mj are matched bypatterns/rules xsii and x

sjj , respectively. Let ri rj be the

number of samples matched by xsii xsjj but not by xsjj xsii .That is, ri jmi n mjj and rj jmj n mij. Let rij bethe number of samples matched by both xsii and x

sjj . That is,

rij jmi \ mjj. The distance is defined as [4]dTi; j ri rj: 8

In [5], Gupta et al. pointed out that dT tends to give

higher values for rules that are matched by more samples(i.e., high jmij). To address this problem, they proposed anormalized distance [5]:

dGi; j 1 rijri rj rij : 9

One can have other variants of distances/similaritybased on the set of matched samples. For instance, we findthat the ratio

dRi; j ri rjrij

10

works fairly well, since it captures both the similarity rijand the dissimilarity ri rj between two patterns/rules.In addition, dR 1 has the simple interpretation that thenumber of different samples is the same as the number of


Fig. 1. An example from the zoo data set [17].

Fig. 2. A set of matched samples.


5/13

common samples. It can be used as a natural threshold forstopping a clustering algorithm. Intuitively, we should notcluster two rules/patterns together if dR > 1, since there ismore dissimilarity than similarity between the two patterns.

It has been reported that distance/similarity measuresbased on matched samples such as dT [4] and dG [5] workfairly well for clustering similar patterns, because they goback to the samples where the patterns/rules share ordiffer. However, these measures do not give specialconsideration to the attributes where the patterns share ordiffer. As an illustration, consider the two pairs of patternsxsii , x

sjj and x

spp , x

sqq in Fig. 3. In the figure, ri, rj and rij bear

the same meaning as in Fig. 2. ci cj is the number ofattributes matched by xsii xsjj but not by xsjj xsii . That is,ci jsi n sjj, and cj jsj n sij. Let cij be the number ofattributes matched by both xsii and x

sjj . That is, cij jsi \ sjj.

In Fig. 3, measures dT, dG, and dR will yield the same valuefor the two pairs of patterns, since ri rp, rj rq, andr

ij r

pq. However, it seems more reasonable to consider

that xsii and xsjj are more similar, since they share certain

attributes cij > 0, while xspp and xsqq do not cpq 0.This motivates us to introduce the notion of pattern-

induced data cluster Ii in (6), which consider both the setsof matched samples and matched attributes. In Fig. 3, thefour highlighted rectangles are actually the induced dataclusters Ii, Ij, Ip, and Iq. In addition, when two dataclusters, say, Ii and Ij, are merged, the compoundevents in the upper right and bottom left corner regions areadded into the merged data cluster. The corner regions aredefined by

Iijj fxs x j x 2 mi = mj; s sj=sig; 11

Ijji fxs x j x 2 mj = mi; s si = sjg: 12One possible measure for considering both the matched

samples and the matched attributes is

dRCi; j wr ri rjrij

wc ci cjcij

; 13

where wr and wc are the weights of the samples and theattributes, respectively. If wr wc 1, dRC > 1, again, canbe used as a natural stopping criterion. For example, if we

consider the number of matched samples and matchedattributes equally important, we may set wr and wc to 0.5.

One problem of the measure dRC is that it does notconsider the variation within the data clusters. The

measures, including dT, dG, dR, and dRC, discussed so farare designed to measure the distance among patterns.However, they do not measure the distance among the dataassociated with the patterns. It is worth stating at this pointthat clustering patterns without simultaneously clusteringtheir associated data may lose insights about how patternsand data are interrelated. On the one hand, while most dataanalysis techniques are inapplicable to analyze a pattern

cluster, they can be applied to its associated data cluster ifthe users know how patterns are realized in data. Thisallows us to analyze pattern clusters via their associateddata clusters by using standard analysis techniques [7], [23],[24], [25]. In Section 5, we also introduce a simple techniquefor effectively analyzing individual data clusters. On theother hand, knowing which set of patterns induces whichdata clusters allows users to use understandable patternsfor interpreting and validating the data clusters. In view ofthis, it is desirable to simultaneously cluster patterns anddata and keep their relations explicit.

To obtain good data clusters, we would like to minimize

the variations in the clusters. A common measure ofvariation/uncertainty for discrete-valued data is entropy:

HI Xxs2I

Pxs log Pxs; 14

where Pxs is the joint probability distribution of xs, and Iis the abbreviation of I1; . . . ; n.

However, the number of parameters in estimating Pxsis exponential. For jsj binary variables, it is of order O2jsj.Hence, it is impractical to estimate Pxs directly due to thelack of data in most real-world problems. To reduce thenumber of parameters, we adopt a naive assumption that

the attributes are conditionally independent, given a datacluster. The joint entropy is then estimated by summing upthe entropy of individual attributes:

HI Xi2s

Xxi2xs;xs2I

Pxi log Pxi; 15

where s is the attribute index set of I. Pxi is theprobability of xi in I and is estimated by

Pxi oxijIj ; 16

where oxi is the observed frequency of xi in I, and jIj is thenumber of compound events in I. The computationcomplexity of HI in (15) is OjIjjsj. Since jsj is usuallymuch smaller than jIj, the complexity could be OjIj,which is linear. All constant clusters have zero entropy. Forexample, in Fig. 1, HI1 0, and HI2 0. Whenmerging patterns x

f3;4;5;6g1 and x

f7;8;9;10g2 , the entropy of the

merged data cluster increases (e.g., HI1; 2 3:66,indicating that the variation/uncertainty of the clustersincreases due to the merging.

Note that HI is bounded as 0 HI Pi2s log mi,where mi is the number of possible values of theith attribute. Hence, HI in (15) can be normalized:

HI HIPi2s

log mi: 17

WONG AND LI: SIMULTANEOUS PATTERN AND DATA CLUSTERING FOR PATTERN CLUSTER ANALYSIS 915

Fig. 3. Pattern-induced data clusters.


6/13

HI does not explicitly consider the numbers ofmatched samples and matched attributes. Hence, it shouldbe weighted by the area of I:

dOI jIjjsjHI; 18where jIj is the number of compound events in I, and jsj isthe number of attributes in I.

An appropriate weighting is important for comparingthe normalized entropies in regions with different sizes.Intuitively, larger regions should have greater impact thansmaller regions and should thus be assigned with greaterweight.

One problem of (18) is that the conditional independenceassumption may not be realistic. If a compound event issaid to be a pattern, we reject the null hypothesis that theprimary events of the compound event are independent(see Section 3). Hence, the compound event that induced apattern must be dependent, violating the conditionalindependence assumption. To address this problem, insteadof directly estimating the entropy of the entire data cluster,

we estimate the entropy of those data not induced bypatterns. In Fig. 3, since the corner regions are not inducedby either pattern xsii or x

sjj , it seems more reasonable to

assume that the compound events there are independentrather than to assume that the compound events in theentire data cluster are independent.

This motivates us to estimate the entropy of the cornerregions after merging. Intuitively, when merging two dataclusters, the only sources of variation come from the cornerregions. Hence, it is desirable to minimize the entropytherein. We therefore introduce the measure

dD

I

i

; I

j

jI

i

jj

jjsijj

jH

I

i

jj

jI

j

ji

jjsjji

jH

I

j

ji

;

19where sijj is the attribute index set ofIijj. Note, again, thatthe normalized entropy of the corner regions must beweighted appropriately by their areas when they are addedtogether.

A final remark is that the discussion in this section can besimplified, because it only considers measures that ignorerule direction, and hence, rules are considered as patterns.Although this is true for most pattern clustering methods, itis also possible to use rule direction for pattern distances. In[18], each rule is represented by a directed hypergraph, andthe rule directions are encoded in the directed hyperedges.A pruning rule is then considered as a hypercycles removalproblem. To remove hypercycles, dG is used to measure thedistance between two rules. Hence, to measure ruledistance, rule direction can be augmented with themeasures discussed in this section. However, in this paper,we focus on the problem of clustering patterns. Hence, ruledirection is not used.

4.3 The Clustering Algorithm

Once a distance/similarity measure is defined, manyclustering algorithms can be applied to cluster patterns. Inthis paper, we adopt the agglomerative hierarchical algo-

rithm. Beside simplicity in computation, hierarchical clus-tering is insensitive to the ordering of inputted patterns.Given a distance measure, the algorithm always producesthe same clustering result. Hence, it is ideal for studying

and comparing the properties of the various measuresdiscussed in Section 4.2. Obviously, with slight modificationin the proposed distances, other clustering algorithms canbe used. For example, to improve speed, one might useK-Means. To visualize the geometric relationship amongpatterns, one might consider a self-organizing map. Themajor steps in agglomerative clustering are contained in thefollowing procedures, where c is the desired number of

final clusters:

1: begin initialize c, c0 n, Ii, i 1; . . . ; n2: do c0 c 13: find nearest clusters, say, Ii and Ij4: merge Ii and Ij5: until c c06: return c clusters7: end

Since hierarchical clustering is well studied, we present ithere only for completeness.

5 DATA ANALYSIS TECHNIQUES FOR INDIVIDUALDATA CLUSTERS

While pattern clustering facilitates the management andcomprehension of an enormous number of patterns, westill need methods for further analyzing the resultingclusters. In the literature, due to the inadequacy of a singlemethod for handling a large set of patterns, a hybridapproach combining pattern pruning, clustering, summar-izing, and visualization is often used [4], [5], [6]. Similarly,pattern clustering itself does not render an adequate

solution for handling the large number of patterns.However, being the divide phase, it is the key componentin the proposed divide-and-conquer approach. In theconquer phase, once clusters (i.e., pattern clusters andtheir associated data clusters) are obtained, each of them isfurther analyzed individually.

Recall that clusters produced by pattern clusteringmaintain an explicit data-pattern relation. Such a relationserves two purposes. First, it relates the cluster of patternsback to where they are realized in the data set. Once thedata containing the cluster of patterns are located, existingwell-studied techniques such as those in [7], [23], [24], and

[25] can be applied to the data cluster, which, in turn, helpsanalyze the corresponding pattern cluster. Second, a clusterof data is usually difficult to interpret and validate, whereaspatterns are easy to comprehend and can be expressed innatural language or query statements. The data-patternrelation allows us to use easily comprehensible patterns forinterpreting the meaning of the data cluster and assessingwhether they make sense or not.

An advantage of the proposed approach is that randomnoises are filtered from the obtained data clusters, and thus,their effects are minimized. Since patterns are detected byrejecting the random null hypothesis, most random noises

are filtered from the pattern-induced data clusters. Althougha certain amount of noises may be introduced into the dataclusters during merging, their effects are minimized ifdistance measure dO or dD is used. Hence, data analysis



7/13

methods can be applied to analyze the data clusters without

being heavily affected by noises and the unrelated data.Essentially, once patterns and data are clustered by

pattern clustering in the divide phase, any data analysis

techniques can be applied to the obtained data clusters in

the conquer phase. In this section, as an illustration, we

propose a simple technique for discovering subgroups in a

data cluster.To measure the interdependence between attributes, a

normalized mutual information measure Rij between attri-

butes Xi and Xj was introduced [19], [20]:

Rij MIi; jHi; j ; 20

where MIi; j and Hi; j are the mutual information andjoint entropy between Xi and Xj, respectively. The higher

the dependence between attributes Xi and Xj, the higher

the MIi; j value. Rij 1 if Xi and Xj are totally

dependent, and Rij 0 if they are independent. Oneadvantage of using the normalized mutual information isthat a hypothesis test can be formulated to test for the

significance of attribute dependence. Two attributes Xi and

Xj are interdependent if

Rij >2mi1mj12 jIj Hi; j : 21

Otherwise, the two attributes are false. To account for the

overall significant dependence of an attribute Xi with other

attributes, we introduced the sum of normalized interdepen-

dency redundancy SRi ofXi with other attributes Xj for allj, j 6 i, in the data cluster [21]. Thus,

SRi X

j;i;j2NRij; 22

where N is the set of i; j attribute pairs satisfying (21).Based on SRi, a simple recursive algorithm is

proposed in this paper, referred to as a subgrouping tree,

to discover subgroups in a data cluster. The major steps in

the algorithm are contained in the following procedure,

where t is a user-specified threshold, I is an input data

cluster, andD

i is a partition ofI

based on the primaryevent xi of the attribute Xi:

1: find the attribute Xi with the highest SRi in I2: if SRi ! t then3: partition the compound events in I into Di based on

the different primary events xi.4: for each xi5: if Di 6 then call Subgrouping treeDi; t;

{recursively call the algorithm}6: end7: else return I

The algorithm recursively partitions an input data

cluster I into Di until the SRi of the all attributes issmaller than a prescribed threshold t.

6 EXPERIMENTAL RESULTS

6.1 Synthetic Data Set 1An Investigation of theProperties of Various Distance Measures

To study the properties of the measures discussed inSection 4.2, synthetic data sets are used. In all theexperiments in this paper, the PD algorithm in the softwarediscover*e [22] is used to discover patterns, and then,

hierarchical clustering is used to cluster the discoveredpatterns by using various measures. The parameters valuesused are shown in Table 2.

The stopping criterion of pattern clustering depends onthe measure that it uses. If dR and dRC are used, stoppingcriteria dR > 1 and dRC > 1 are available (see Section 4.2).For measures dT, dG, dO, and dD, such a stopping criterion isnot available. A common method is to stop the hierarchicalclustering algorithm if the distance is noticeably increasedafter merging. Usually, such noticeable increases can beobserved if a data set is well separated and a suitabledistance is used. If noticeable increases cannot be observed,it implies that either the data set is not well separable or thedistance is not suitable. Since the data sets used in theexperiments are fairly well separable, noticeable increasesshould be observed if appropriate measures are used. In theexperiments, the above method is used to stop theclustering algorithm when using dT, dG, dO, and dD as thedistance.

Two synthetic data sets, each of which contains 10 attri-butes and 300 samples, are randomly generated to study theproperties of the measures. The first data set contains threepattern-induced data clusters I1, I2, and I3, as shownin Fig. 4. Each cluster consists of four attributes and200 samples. A 10 percent noise was randomly added to

each cluster. Measures dT, dG,and dR yield the same value forthe pair I1, I2 and the pair I2, I3 (with dT 200,dG 0:667, and dR 2 for both pairs). Thus, the clusteringalgorithm first clusters I1 and I2 and then I3. Incontrast, dRC, dO, and dD yield a lower value for the pair I2,I3 than the pair I1, I2; i.e., dRC 1 for pair I1, I2,and dRC 2 for pair I2, I3. Hence, the clusteringalgorithm first clusters I2 and I3 and then I1. Thisexperiment shows that dRC, dO, and dD are able to take intoaccount the difference in the matched attributes betweenpatterns while dT, dG, and dR are not.

The second data set has the same configuration as the

first one. However, the data in the regions I2j1, I3j2, andI2j3 are first set to be constant, and then, their levels ofrandomness are gradually increased. The values of dT, dG,dR, and dRC remain unchanged as the level of randomness


TABLE 2Parameter Values Used in All Experiments


8/13

increases, since the number of matched samples and thenumber of matched attributes are still the same. However,the values ofdO and dD are changed to reflect the increase invariation. Fig. 5 shows the values of dO and dD between thepair I1, I2 and the pair I2, I3 as randomness

increases. The values ofd

O andd

D are very close for bothpairs. This may be explained by the observation that thesource of variation mainly comes from the corner regionsI2j1, I2j3, and I3j2. Hence, although dO estimates thevariation in the entire data clusters, the entropy is mainlydue to the variation in the corner regions as estimated bydD. Thus, the values of dO and dD are very close.Furthermore, the distance between I2 and I3 is alwayslower than the distance between I1 and I2, regardless ofthe level of randomness. Hence, both dO and dD considerI2 and I3 as closer than I1 and I2.

To compare the difference between dO and dD, the levelof randomness in regions I

1

, I

2

, and I

3

is increased to

simulate the noises cumulated in the clusters over asequence of merging. As randomness increases, dO in-creases, while dD still remains the same. Hence, dO is moreaccurate if the cumulative noises are significant. In contrast,dD ignores the cumulative noises in the clusters and onlymeasures the noises in the corner regions. However, dDcould be fairly accurate if the amount of noises isminimized during merging.

6.2 Synthetic Data Set 2An Investigation of theDifferences of Various Distance Measures

In this section, synthetic data sets are designed to study the

differences of the six measures. In Fig. 6, the two syntheticdata sets consist of 15 attributes and 500 samples. In Fig. 6a,the data set contains four clusters: 1, 2, 3, and 4. Clusters 1and 2 are originally from a larger cluster A that consists offour attributes and 75 samples (the enclosed dashed-linerectangle), whereas clusters 3 and 4 are originally fromcluster B. Clusters 1 and 2 are obtained by adding randomnoises to cluster A. Similarly, clusters 3 and 4 are obtainedby adding random noises to cluster B. We want to seewhether the measures can recognize that clusters 1 and 2come from cluster A and that clusters 3 and 4 come fromcluster B. It turns out that dT [4], dG [5], and dR group

clusters 2 and 3 together, even though they do not share anyattribute. Clusters 1 and 4 are left alone. It is becauseclusters 2 and 3 share 30 samples, which is more than thesample shared between clusters 1 and 2 (20 samples) or

between clusters 3 and 4 (25 samples). On the other hand,dRC, dO, and dD group clusters 1 and 2 together and groupsclusters 3 and 4 together. dRC does so, because clusters 1and 2 (or clusters 3 and 4) share one attribute, whileclusters 2 and 3 do not share any attribute. dO and dD do so,

not only because clusters 1 and 2 (or clusters 3 and 4) shareone attribute but also because the common attribute sharesthe same value 2 (or 7), making the entropy lower. InFig. 6b, clusters 2 and 3 are originally from cluster A andshare one attribute value 6. Although clusters 1 and 2 alsoshare one attribute, their attribute values (4 in cluster 1 and5 in cluster 2) are different. Measures dT [4], dG [5], and dRconsider clusters 1 and 2 to have equal distances asclusters 2 and 3, because both cluster pairs do not shareany sample. Hence, depending on the order of inputpatterns, these measures may group clusters 1 and 2 first,leaving cluster 3 alone. dRC also consider the two clusterpairs as equal distance because they both share oneattribute. However, dO and dD always group clusters 2and 3 first, because their shared attribute has the samevalue 6.

6.3 Real-World Data SetsCluster Entropy andMinkowski Scores

We assess the experimental results of real-world datasets both quantitatively and qualitatively. In this section,quantitative measures, including cluster entropy [7] andMinkowski scores (MS) [7], are reported for three bench-mark data setsZoo, Splice, and Breast Cancerobtainedfrom UCI [17]. Cluster entropy and MS are measures of the

quality of a clustering solution, given the true clustering. InSection 6.4, each real data set is discussed qualitatively. In allexperiments, the class labels of data samples are deleted and


Fig. 4. A synthetic data set.Fig. 5. The values of dO and dD against the level of randomness in the

corner regions.

Fig. 6. Two synthetic data sets for the study of different distance

measures.


9/13

are only used for evaluation purposes. Suppose that a dataset containing Sclasses is clustered into Kclusters. Let nk bethe number of samples in the kth cluster and let nsk be thenumber of samples from the sth class in the kth cluster. The

entropy of the kth cluster is defined as

Hk XSs1

nsknk

lognsknk

: 23

The total entropy for the set of K clusters is given by

Ent XKk1

nkM

Hk; 24

where M is the number of samples in the data set. Thecluster entropy is a measure of the class purity of theclusters. The optimum value is 0, with lower values being

better. MS measures the consistency between a clusteringsolution and a given true clustering. To define MS, let T bethe true solution and let C be the solution that we wish tomeasure. Let n11 denote the number of pairs of samples thatare in the same cluster in both C and T. Let n01 denote thenumber of pairs that are in the same cluster only in C andlet n10 denote the number of pairs that are in the samecluster in T. The MS is then defined as

MST ; C ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffin01 n10n11 n10

r: 25

For MS, the optimum score is also 0, with lower scores

being better. The cluster entropies and MS are reportedin Table 3 for all three real-world data sets. The tableconsists of seven columns. Column 1 lists various distancemeasures. Columns 2-4 contain both Ent and MS results foreach data set (Zoo, Splice, and Breast) when patternclustering with a prescribed cluster number is run. Theprescribed cluster number is the class number of each dataset (i.e., 7 for Zoo, 3 for Splice, and 2 for Breast). Columns 5-7 contain the cluster number (No), Ent, and MS valueswhen the automatic termination method described at thebeginning of Section 6.1 is used. For each column, the bestvalues are highlighted. Prescribing a cluster number allows

us to compare the cluster entropy and MS fairly, since alarger cluster number tends to have lower cluster entropy(more clusters imply smaller and purer clusters). However,we also report the entropy and MS with automatic

termination, since we do not assume that pattern clusteringknows the cluster number in advance.

For the Zoo data in column 2 of Table 3, both the Ent andMS ofdO are the best, while that ofdD is the second. For the

Splice data in column 3, the Ent of dG and dR are the best.dD obtains the second best Ent and the best MS values.Considering both measures, the performance of dD is thebest for the Splice data. As for the Breast data, the Ent ofdTis the best. dG, dR, and dO obtain the second best Ent and thebest MS. Overall, dT is the best, while dG, dR, and dO are thesecond best.

Columns 5-7 report the Ent and MS values whenautomatic termination is used. For the Zoo data in column 5,dD automatically produces seven clusters. The Ent ofdRC isthe best. However, dRC produces 18 clusters. Since moreclusters tend to have smaller and purer clusters, its lowestEnt does not mean that it is the best. We can only concludethat dO obtains the best Ent among dT, dR, and dO, since theyall get three clusters. As for MS, dO and dD obtain the bestand the second best scores, respectively. Overall, theperformance ofdO and dD is fairly good. For the Splice datain column 6, both dT and dO produce three clusters. dDobtains 4, which is very close to 3. The MS of all measuresare very close (with only 0.009 differences). Overall, dTseems the best for the Splice data, while other measures,except for dG, work fairly well. For the Breast data incolumn 7, dRC and dO produce two clusters. Comparing dRCand dO, dRC has a lower Ent value, while dO has a lower MS.Hence, both measures work fairly well for the Breast data.

In this set of experiments, the proposed measures,especially dO and dD, perform fairly well in all cases. Insome cases, they even achieve the best performance.However, the results may not reflect the whole picture ofclustering quality. There are two limitations in our experi-ments. First, cluster entropy and MS are only two commonmeasures of clustering solutions. They cannot capture allaspects of clustering quality. In fact, many aspects ofclustering quality may not be measurable at all. Second, theclass labels may not necessarily be the true clustering. Infact, it is difficult, if not impossible, to define trueclustering. However, the experiment results suggest that

the proposed measures are better (in terms ofEnt, MS, andthe ability to automatically determine a reasonable clusternumber) than the existing common measures dT [4] and dG[5] in many cases.


TABLE 3Cluster Entropy and Minkowski for Read Data


10/13

Table 4 shows the number of patterns produced by PD[2] for each data set and the average execution times of allmethods, including PD, all distance measures and sub-grouping tree (Tree). Subgrouping tree will be used inSection 6.4. In Table 4, dO and dD are slower than othermeasures. Hence, there is a trade-off between accuracy andspeed.

6.4 Real-World Data SetsQualitative Discussion

This section discusses the qualitative aspect of clusteringsolutions. It may provide additional insight into how welleach distance measure performs in real data. However, thediscussion is, by no means, all inclusive. We only pointedout some interesting findings in the experiments.

The Zoo data is easy to understand and is hence ideal forus to assess whether the results make sense or not. UsingdD, pattern clustering produces seven clusters, while threesecond-order outlier patterns are not grouped to anycluster. The second-order patterns have a broad coverageof various animal types, and grouping them into a clusterwill introduce considerable irrelevant animals into theclusters. Among the seven clusters, four contain, respec-

tively, the majority of birds, mammals, fishes, and insects.However, the renaming three clusters are mixed with otheranimals. The result is reasonable, because the Zoo data hasan unbalanced class distribution, so it is difficult to identifysmall animal groups like reptiles, amphibians, and coelen-terates. We also apply a subgrouping tree to each cluster.For example, Fig. 7 gives a subgrouping tree with twosubgroups. The first contains all the 20 birds, whereas thesecond contains fruitbat and vampire. This cluster isconsistent with our perception that birds, fruitbat, andvampire can all fly. Hence, to a certain extent, the 22 animalsare similar.

As for the Splice data, pattern clustering using dOproduces three data clusters and eight second-order outlierpatterns. The three clusters, by and large, correspond to theoriginal three classes: exon/intron (EI boundaries), intron/exon (IE boundaries), and Neither. Fig. 8 gives a patterncluster inducing 96 percent EI sequences. This clustercorresponds to the EI splice-junction DNA sequence (A/C)AGGT(A/G)AGT, where the slash represents a disjunc-tive option. Fig. 9 gives another pattern cluster inducing91 percent IE sequences. This cluster corresponds to the IEsequence C=T6XC=TAGGG=T, where X representsany character, and the subscript indicates repetitions. It is

interesting to observe that X27, corresponding to X (anycharacter) in the sequence, is empty. X21X26 correspond topart of C=T6. The last cluster contains 86 percentsequences in the Neither class.

As for the Breast Cancer data, pattern clustering using

dD returns eight clusters that contain interesting results.

For instance, one pattern cluster (see Fig. 10) suggests that

if bare nuclei > 4 and bland chromatin > 3, it is likely

that the patients have cancer, even though their mitosis is

low. Hence, the pattern cluster generalized from individual

patterns may provide useful information. Similar results

are obtained for uniformity of cell size, marginal adhesion, etc.

Generally speaking, these results suggest that if certain

attribute values are greater than certain values, the patientsare more likely to have breast cancer. This observation is

quite obvious to an experienced doctor. More subtle

information may be found by subgrouping trees. Fig. 11

shows part of a subgrouping tree. It shows that patients in

the cluster are likely to have cancer, even though

the values of their attributes are relatively low (e.g.,

uniformity of cell size 4, and marginal adhesion 1).Thus, the tree may be helpful for detecting breast cancer at

an early stage (with relatively low attribute values). Similar

results could be found in other clusters.

7 RELATION TO PREVIOUS WORK

7.1 Relation between Association Rule Mining andPattern Discovery

Since we focus on clustering patterns produced by PD, it

may be useful to give the mathematical relation between

association rule mining and PD. In fact, association rule

mining and PD are two highly related problems. In

association rule mining, an item set or a compound event

xs is said to be frequent if its observed frequency of

occurrences oxs is greater than a user-specified threshold

c. That is,

oxs ! c: 26


TABLE 4Number of Patterns and Average Execution Times

for All Experiments (in Seconds)

Fig. 7. A subgrouping tree.

Fig. 8. EI sequences (A/C)AGGT(A/G)AG.


11/13

In PD, recall from (3) and (4) that an item set xs is said to

be associated if

oxs ! dxsj j ffiffiffiffiffiffiffiffiffiffiffiffiexsvxsp exs : 27Comparing (26) and (27), association rule mining uses a

constant threshold c for detecting frequent item sets,

whereas PD uses an adaptive threshold dxsj j ffiffiffiffiffiffiffiffiffiffiffiffiexsvxsp exs

for each item set. Since dxs is a constant and, from (1) and

(5), exs and vxs are dependent on the termQ

xi2xs Pxi and aconstant M, dxsj j ffiffiffiffiffiffiffiffiffiffiffiffiexsvxsp exs becomes the same constant forall item sets xs if

Qxi2xs Pxi are the same for all xs. More

precisely, the criterion for detecting significant patterns in

PD is equivalent to the criterion for detecting frequent item

sets in association rule mining ifY

xi2xsPxi constant; 8xs: 28

Hence, PD uses different thresholds for different itemsets xs, while association rule mining uses a fixed threshold

for all item sets. However, the cost of such a customization

is the lack of the important Apriori property. Hence, PD is

slower than frequent item set mining, despite the use of

several effective heuristics based on the statistical properties

of contingency tables [2]. Since PD is an adaptive thresh-

olding version of frequent item set mining, the proposed

pattern clustering method may be able to cluster frequent

item sets. The research of clustering frequent item sets is in

the scope of our continuing research.

7.2 Relation to Previous Methods for Handling aLarge Number of Patterns

A single method is usually not adequate to deal with the

large number of patterns discovered by rule/pattern

mining. It is not uncommon to adopt a hybrid approach

by combining two methods together, as listed in Table 5.

Most methods involve the idea of using the set of matched

samples of a rule to measure the distance between rules. For

example, the rule cover in [4] is the set of rules that covers all

samples matched by the original set of rules. The proposed

approach takes an alternative divide-and-conquer ap-

proach. Obviously, it can be used either separately or

together with other approaches. For example, pruning can

be used to reduce the number of patterns before clustering.Another observation based on Table 5 is that the chi-

square test for correlation has been widely used in various

methods. It may be of interest to derive the relation between

the chi-square statistics and the residuals in (3) and (4). The

chi-square statistics has the form of

2 X

xs

oxs exs 2exs

X

xs

oxs exsffiffiffiffiffiffiexs

p 2

X

xsz2xs ; 29

where zxs is the standardized residual in (3). Hence,

standardized residual is the square root of values of the

individual cells of chi-square. Since the chi-square distribu-

tion is the sum of squared standard normal distribution, zxs

is normally distributed with zero mean and unit variance.

To ensure that zxs have unit variance, zxs is normalized by

its estimated variance vxs in (5) to obtain the adjusted

residual. Hence, the test in PD is a test for correlation at thecompound event level.


Fig. 9. Pattern clusters corresponding to the IE splice-junction sequences C=T6XC=TAGGG=T.

Fig. 10. Another pattern cluster in the Wisconsin data. Fig. 11. A subgrouping tree in the Wisconsin data.


12/13

8 SUMMARY AND CONCLUSION

This paper has proposed a method for clustering patternsand their associated data. Each pattern cluster is thenanalyzed via its data cluster by using existing techniques[7], [23], [24], [25] or a subgrouping tree, and each datacluster is interpreted via its pattern cluster. The effective-ness of the above divide-and-conquer approach lies in theproposed clustering method supported by the four mea-sures dR, dRC, dO, and dD that we proposed. In this paper,

the relations of these measures with existing commonmeasures such as dT [4] and dG [5] are discussed (Section 4.2)and studied experimentally (Sections 6.1-6.4). In ourexperiments, dO and dD consistently produce better resultsthan dT, dG, dR, and dRC. However, their calculations areslower.

To give a more complete picture of the proposedmethod, mathematical relations between PD and associa-tion rule mining/chi-square statistics are given analytically.It is shown that PD is an adaptive thresholding version offrequent item set mining. The condition under which thetwo methods are equivalent is also derived. The relationsshow that 1) the proposed method could be applicable to

frequent item sets (Section 7.1) and 2) the hypothesis testbased on residual is a test for correlation at the compoundevent level (Section 7.2).

The major limitation of the current method is speed.Since hierarchical clustering is not scalable, it cannot clustertoo many patterns. In our experiments, when dO or dD isused, it takes around 5 minutes to cluster 1,000 patterns atthis speed. It will take more than 1 hour to cluster10,000 patterns. Fortunately, in many real-world data sets,the number of patterns produced by PD is largely in thethousand magnitudes, even though it still depends on thedata sets. However, this problem limits the effectiveness of

the proposed method in clustering frequent item sets/association rules, since their numbers could be orders ofmagnitude greater than that in PD. There are three possiblesolutions to the speed problem. First, rather than clustering

all frequent item sets, we may cluster the closed/maximalfrequent item sets, which could be much fewer than thenumber of all frequent item sets, depending on whether thedata sets are dense or not. Second, other faster clusteringalgorithms such as K-Means could be adopted. Third,pattern pruning can be used before pattern clustering. Theinvestigation of methods for effectively clustering frequentitem sets/ association rules is in the scope of our continuingresearch.

ACKNOWLEDGMENTS

The authors would like to thank NSERC Discovery Grant insupport of this project, Pattern Discovery System Inc. forallowing them to use Discover*e, and the anonymousreviewers for their variable comments.

REFERENCES[1] F. Mills, Statistical Methods. Pitman, 1955.[2] A.K.C. Wong and Y. Wang, High Order Pattern Discovery from

Discrete-Valued Data, IEEE Trans. Knowledge and Data Eng.,vol. 9, no. 6, pp. 877-893, Nov./Dec. 1997.

[3]Y. Wang and A.K.C. Wong, From Association to Classification:Inference Using Weight of Evidence, IEEE Trans. Knowledge andData Eng., vol. 15, no. 3, pp. 764-767, May/June 2003.

[4] H. Toivonen, M. Klemetinen, P. Ronkaninen, K. Hatonen, and H.Mannila, Pruning and Grouping Discovered Association Rules,Proc. MLnet Workshop Statistics, Machine Learning, and Discovery inDatabases, pp. 47-52, 1995.

[5] G.K. Gupta, A. Strehi, and J. Ghosh, Distance Based Clustering ofAssociation Rules, Proc. Intl Conf. Artificial Neural Networks inEng. (ANNIE 99), vol. 9, pp. 759-764, 1999.

[6] B. Liu, W. Hsu, and Y. Ma, Pruning and Summarizing theDiscovered Associations, Proc. Fifth ACM Intl Conf. KnowledgeDiscovery and Data Mining (KDD 99), pp. 125-134, 1999.

[7] J. Han, Data Mining: Concepts and Techniques. Morgan Kaufmann,2001.

[8] R. Agrawal and R. Srikant, Fast Algorithms for Mining

Association Rules, Proc. 20th Intl Conf. Very Large Data Bases(VLDB 94), pp. 487-499, 1994.[9] J. Hipp, U. Guntzer, and G. Nakhaeizadeh, Algorithms for

Association Rule Mining: General Survey and Comparison, ACMSIGKDD Explorations Newsletter, vol. 2, no. 1, pp. 58-64, 2000.


TABLE 5Comparison of Hybrid Pattern Analysis Methods and Correlation Rule Mining


13/13

[10] S. Brin, R. Motwani, and R. Silverstein, Beyond Market Basket:Generalizing Association Rules to Correlations, Proc. ACMSIGMOD 97, pp. 265-276, 1997.

[11] A. Silberschatz and A. Tuzhilin, What Makes Patterns Interestingin Knowledge Discovery Systems, IEEE Trans. Knowledge andData Eng., vol. 8, no. 6, pp. 970-974, Dec. 1996.

[12] R. Srikant, Q. Vu, and R. Agrawal, Mining Association Ruleswith Item Constraints, Proc. Third Intl Conf. Knowledge Discoveryand Data Mining (KDD 97), pp. 67-73, 1997.

[13] R. Bayardo, R. Agrawal, and D. Gunopulos, Constraint-Based

Rule Mining in Large Dense Databases, Proc. 15th IEEE Intl Conf.Data Eng. (ICDE 99), pp. 188-197, 1999.[14] N. Wrigley, Categorical Data Analysis for Geographers and Environ-

mental Scientists. Longman, 1985.[15] D.R. Cox and E.J.A. Snell, General Definition of Residuals,

J. Royal Statistical Soc. B, vol. 30, pp. 248-265, 1968.[16] S.J. Haberman, The Analysis of Residuals in Cross-Classified

Tables, Biometrics, vol. 29, pp. 205-220, 1973.[17] P.M. Murph and D.W. Aha, UCI Repository of Machine Learning

Databases, Dept. Information and Computer Science, Univ. ofCalifornia, Irvine, 1987.

[18] S. Chawla and J. Davis, On Local Pruning of Association RulesUsing Directed Hypergraphs, Technical Report 537, School ofInformation Technologies, Univ. of Sydney, 2003.

[19] A.K.C. Wong and T.S. Liu, Typicality, Diversity and FeaturePatterns of an Ensemble, IEEE Trans. Computers, vol. 24, no. 2,

pp. 158-181, Feb. 1975.[20] A.K.C. Wong, T.S. Liu, and C.C. Wang, Statistical Analysis of

Residue Variability in Cytochrome C, J. Molecular Biology,vol. 102, pp. 287-295, 1976.

[21] D. Chiu and A.K.C. Wong, Multiple Pattern Associations forInterpreting Structural and Functional Characteristics of Biomo-lecules, Information Science, vol. 167, nos. 1-4, pp. 23-29, 2004.

[22] Y. Wang and A.K.C Wong, Discover*e. Pattern DiscoveryTechnologies, http://www.patterndiscovery.com, 2008.

[23] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification. Wiley,2000.

[24] A.K.C. Wong, D.K.Y. Chiu, and W. Huang, A Discrete-ValuedClustering Algorithm with Applications to Bimolecular Data,Information Sciences, vol. 139, pp. 97-112, 2002.

[25] W.H. Au, K.C.C. Chan, A.K.C. Wong, and Y. Wang, AttributeClustering for Grouping, Selection and Classification of GeneExpression Data, ACM/IEEE Trans. Computational Biology andBioinformatics, vol. 2, no. 2, pp. 83-101, Apr.-June 2005.

[26] A.K.C. Wong and Y. Wang, Pattern Discovery: A Data DrivenApproach to Decision Support, IEEE Trans. Systems, Man,Cybernetics Part C, vol. 33, no. 1, pp. 114-124, 2003.

[27] T. Chau and A.K.C. Wong, Pattern Discovery by ResidualAnalysis and Recursive Partitioning, IEEE Trans. Knowledge andData Eng., vol. 11, no. 6, pp. 833-852, Nov./Dec. 1999.

Andrew K.C. Wong received the PhD degree in1968 from the Carnegie Mellon University,Pittsburgh, where he taught for several yearsthereafter. From 1980 to 2002, he was aprofessor of systems design engineering andthe founding director of the Pattern Analysis andMachine Intelligence Laboratory, University ofWaterloo (UW), Waterloo, Ontario, Canada.From 2000 to 2003, he was a distinguishedchair professor in the Department of Computing,

Hong Kong Polytechnic University, Hong Kong. In 2002, he was adistinguished professor emeritus at UW. Since then, he has served UWas an adjunct professor in the Systems Design Engineering Department,Electrical and Computer Engineering Department, and the David R.Cheriton School of Computer Sciences. He is the founder and a retireddirector (1993-2003) of Virtek (a publicly traded company and a leader inlaser vision technology), where he was the president from 1986 to 1993and the chairman from 1993 to 1997. In 1997, he cofounded PatternDiscovery Technologies, where he is currently the chairman. He was onthe IEEE Distinguished Speaker Program and served as the generalchair of the IASTED Conference of Robotics and Systems held in Hawaiiin 1996 and the IEEE/RSJ International Conference held in Victoria,Canada, in 1998. He has published extensively and has been invited asa keynote/plenary speaker for IEEE-related conferences. He is a fellowof the IEEE.

Gary C.L. Li received the BA (with first-classhonors) and the MPhil degrees in computingfrom the Hong Kong Polytechnic University,Hong Kong, in 2002 and 2004, respectively. Heis currently working toward the PhD degree incomputer engineering at the University of Water-loo, Waterloo, Ontario, Canada. The technolo-gies that he developed have been applied tovarious commercial projects in education, textmining, and biomedical industries. His research

interests include knowledge discovery, data mining, machine learning,data analysis, and their applications.

. For more information on this or any other computing topic,

please visit our Digital Library at www.computer.org/publications/dlib.


Documents

DMIN18 Simultaneous Pattern and Data Clustering for Pattern Cluster Analysis