34
CACTUS-Clustering Categorical Data Using Summaries Advisor Dr. Hsu Graduate Min-Hung Lin IDSL seminar 2001/10/30

CACTUS-Clustering Categorical Data Using Summaries

Embed Size (px)

DESCRIPTION

CACTUS-Clustering Categorical Data Using Summaries. Advisor : Dr. Hsu Graduate : Min-Hung Lin IDSL seminar 2001/10/30. Outline. Motivation Objective Related Work Definitions CACTUS Performance Evaluation Conclusions Comments. Motivation. - PowerPoint PPT Presentation

Citation preview

Page 1: CACTUS-Clustering Categorical Data Using Summaries

CACTUS-Clustering Categorical Data Using Summaries

Advisor : Dr. HsuGraduate : Min-Hung Lin

IDSL seminar 2001/10/30

Page 2: CACTUS-Clustering Categorical Data Using Summaries

Outline Motivation Objective Related Work Definitions CACTUS Performance Evaluation Conclusions Comments

Page 3: CACTUS-Clustering Categorical Data Using Summaries

Motivation Clustering with categorical

attributes has received attention Previous algorithms do not give a

formal description of the clusters Some of them need post-process

the output of the algorithm to identify the final clusters.

Page 4: CACTUS-Clustering Categorical Data Using Summaries

Objective Introduce a novel formalization of a cl

uster for categorical attributes. Describe a fast summarization-based

algorithm CACTUS that discovers clusters.

Evaluate the performance of CACTUS on synthetic and real datasets.

Page 5: CACTUS-Clustering Categorical Data Using Summaries

Related Work EM algorithm [Dempster et al., 1977]

Iterative clustering technique STIRR algorithm[Gibson et al., 1998]

Iterative algorithm based on non-linear dynamical systems

ROCK algorithm[Guha et al., 1999] Hierarchical clustering algorithm

Page 6: CACTUS-Clustering Categorical Data Using Summaries

DEF:Support

Page 7: CACTUS-Clustering Categorical Data Using Summaries

DEF:Strongly Connected

Page 8: CACTUS-Clustering Categorical Data Using Summaries

DEF:Strongly Connected(cont’d)

Page 9: CACTUS-Clustering Categorical Data Using Summaries

Formal Definition of a Cluster

Page 10: CACTUS-Clustering Categorical Data Using Summaries

Formal Definition of a Cluster (cont’d) is the cluster-projection of C on C is called a sub-cluster if it

satisfies conditions (1) and (3) A cluster C over a subset of all

attributes is called a subspace cluster on S; if |S| = k then C is called a k-cluster

Page 11: CACTUS-Clustering Categorical Data Using Summaries

DEF:Similarity

Page 12: CACTUS-Clustering Categorical Data Using Summaries

Inter-attribute Summaries

Page 13: CACTUS-Clustering Categorical Data Using Summaries

Intra-attribute Summaries

Page 14: CACTUS-Clustering Categorical Data Using Summaries

Experiments

Page 15: CACTUS-Clustering Categorical Data Using Summaries

Result STIRR fails to discover

clusters consisting of overlapping cluster-projections on any attribute

clusters where two or more clusters share the same cluster projection

CACTUS correctly discovers all clusters

Page 16: CACTUS-Clustering Categorical Data Using Summaries

CACTUS Three-phase clustering algorithm

Summarization Phase Compute the summary information

Clustering Phase Discover a set of candidate clusters

Validation Phase Determine the actual set of clusters

Page 17: CACTUS-Clustering Categorical Data Using Summaries

Summarization Phase Inter-attribute Summaries

Intra-attribute Summaries

Page 18: CACTUS-Clustering Categorical Data Using Summaries

Clustering Phase Computing cluster-projections on

attributes Level-wise synthesis of clusters

Page 19: CACTUS-Clustering Categorical Data Using Summaries

Computing Cluster-Projections on Attributes Step 1 :pairwise cluster-projection

Step 2 :intersection

Page 20: CACTUS-Clustering Categorical Data Using Summaries

Computing Cluster-Projections on Attributes (cont’d)

Cluster-projection

Page 21: CACTUS-Clustering Categorical Data Using Summaries

Level-wise synthesis of clusters

n

Page 22: CACTUS-Clustering Categorical Data Using Summaries

Level-wise synthesis of clusters (cont’d) Generation procedure

Page 23: CACTUS-Clustering Categorical Data Using Summaries

Level-wise synthesis of clusters (cont’d)

Candidate cluster

Page 24: CACTUS-Clustering Categorical Data Using Summaries

Validation Some of the candidate clusters may not hav

e enough support because some of the 2-cluster may be due to different sets of tuples.

Check if the support of each candidate cluster is greater than the threshold: times the expected support of the cluster.

Only clusters whose support on D passes the threshold are retained.

Page 25: CACTUS-Clustering Categorical Data Using Summaries

Validation Procedure Setting the supports of all candidate c

lusters to zero. For each tuple increment the sup

port of the candidate cluster to which t belongs.

At the end of the scan, delete all candidate clusters whose support is less than the threshold.

Page 26: CACTUS-Clustering Categorical Data Using Summaries

Extensions Large Attribute Value Domains Clusters in Subspaces

Page 27: CACTUS-Clustering Categorical Data Using Summaries

Performance Evaluation Evaluation of CACTUS on Synthetic an

d Real Datasets Compared the performance of CACTU

S with the performance of STIRR

Page 28: CACTUS-Clustering Categorical Data Using Summaries

Synthetic Datasets The test datasets were generated usin

g the data generator developed by Gibson et al.(1 million tuples, 10 attributes, 100 attributes values for each attribute)

Page 29: CACTUS-Clustering Categorical Data Using Summaries

Real Datasets Two sets of bibliographic entries

7766 entries are database-related 30919 entries are theory-related

Four attributes: the first author, the second author, the conference, and the year.

Attribute domains are {3418,3529,1631,44},{8043,8190,690,42},{10212,10527,2315,52}

Page 30: CACTUS-Clustering Categorical Data Using Summaries

Real Datasets (cont’d)

Database-relatedTheory-related

Mixture

Page 31: CACTUS-Clustering Categorical Data Using Summaries

Results CACTUS is very fast and scalable(only

two scans of the dataset) CACTUS outperforms STIRR by a facto

r between 3 and 10

Page 32: CACTUS-Clustering Categorical Data Using Summaries

Conclusions Formalized the definition of a cluster f

or categorical attributes. Introduced a fast summarization-base

d algorithm CACTUS for discovering such clusters in categorical data.

Evaluated algorithm against both synthetic and real datasets.

Page 33: CACTUS-Clustering Categorical Data Using Summaries

Future Work Relax the cluster definition by allowing

sets of attribute values are “almost” strongly connected to each other.

Inter-attribute summaries can be incremental maintained=>Derive an incremental clustering algorithm

Rank the clusters based on a measure of interestingness

Page 34: CACTUS-Clustering Categorical Data Using Summaries

Comments Pairwise cluster-projection is the NP-c

omplete problem A large number of candidate clusters i

s still a problem