Upload
september-harrington
View
32
Download
0
Tags:
Embed Size (px)
DESCRIPTION
CACTUS-Clustering Categorical Data Using Summaries. Advisor : Dr. Hsu Graduate : Min-Hung Lin IDSL seminar 2001/10/30. Outline. Motivation Objective Related Work Definitions CACTUS Performance Evaluation Conclusions Comments. Motivation. - PowerPoint PPT Presentation
Citation preview
CACTUS-Clustering Categorical Data Using Summaries
Advisor : Dr. HsuGraduate : Min-Hung Lin
IDSL seminar 2001/10/30
Outline Motivation Objective Related Work Definitions CACTUS Performance Evaluation Conclusions Comments
Motivation Clustering with categorical
attributes has received attention Previous algorithms do not give a
formal description of the clusters Some of them need post-process
the output of the algorithm to identify the final clusters.
Objective Introduce a novel formalization of a cl
uster for categorical attributes. Describe a fast summarization-based
algorithm CACTUS that discovers clusters.
Evaluate the performance of CACTUS on synthetic and real datasets.
Related Work EM algorithm [Dempster et al., 1977]
Iterative clustering technique STIRR algorithm[Gibson et al., 1998]
Iterative algorithm based on non-linear dynamical systems
ROCK algorithm[Guha et al., 1999] Hierarchical clustering algorithm
DEF:Support
DEF:Strongly Connected
DEF:Strongly Connected(cont’d)
Formal Definition of a Cluster
Formal Definition of a Cluster (cont’d) is the cluster-projection of C on C is called a sub-cluster if it
satisfies conditions (1) and (3) A cluster C over a subset of all
attributes is called a subspace cluster on S; if |S| = k then C is called a k-cluster
DEF:Similarity
Inter-attribute Summaries
Intra-attribute Summaries
Experiments
Result STIRR fails to discover
clusters consisting of overlapping cluster-projections on any attribute
clusters where two or more clusters share the same cluster projection
CACTUS correctly discovers all clusters
CACTUS Three-phase clustering algorithm
Summarization Phase Compute the summary information
Clustering Phase Discover a set of candidate clusters
Validation Phase Determine the actual set of clusters
Summarization Phase Inter-attribute Summaries
Intra-attribute Summaries
Clustering Phase Computing cluster-projections on
attributes Level-wise synthesis of clusters
Computing Cluster-Projections on Attributes Step 1 :pairwise cluster-projection
Step 2 :intersection
Computing Cluster-Projections on Attributes (cont’d)
Cluster-projection
Level-wise synthesis of clusters
n
Level-wise synthesis of clusters (cont’d) Generation procedure
Level-wise synthesis of clusters (cont’d)
Candidate cluster
Validation Some of the candidate clusters may not hav
e enough support because some of the 2-cluster may be due to different sets of tuples.
Check if the support of each candidate cluster is greater than the threshold: times the expected support of the cluster.
Only clusters whose support on D passes the threshold are retained.
Validation Procedure Setting the supports of all candidate c
lusters to zero. For each tuple increment the sup
port of the candidate cluster to which t belongs.
At the end of the scan, delete all candidate clusters whose support is less than the threshold.
Extensions Large Attribute Value Domains Clusters in Subspaces
Performance Evaluation Evaluation of CACTUS on Synthetic an
d Real Datasets Compared the performance of CACTU
S with the performance of STIRR
Synthetic Datasets The test datasets were generated usin
g the data generator developed by Gibson et al.(1 million tuples, 10 attributes, 100 attributes values for each attribute)
Real Datasets Two sets of bibliographic entries
7766 entries are database-related 30919 entries are theory-related
Four attributes: the first author, the second author, the conference, and the year.
Attribute domains are {3418,3529,1631,44},{8043,8190,690,42},{10212,10527,2315,52}
Real Datasets (cont’d)
Database-relatedTheory-related
Mixture
Results CACTUS is very fast and scalable(only
two scans of the dataset) CACTUS outperforms STIRR by a facto
r between 3 and 10
Conclusions Formalized the definition of a cluster f
or categorical attributes. Introduced a fast summarization-base
d algorithm CACTUS for discovering such clusters in categorical data.
Evaluated algorithm against both synthetic and real datasets.
Future Work Relax the cluster definition by allowing
sets of attribute values are “almost” strongly connected to each other.
Inter-attribute summaries can be incremental maintained=>Derive an incremental clustering algorithm
Rank the clusters based on a measure of interestingness
Comments Pairwise cluster-projection is the NP-c
omplete problem A large number of candidate clusters i
s still a problem