View
579
Download
0
Category
Tags:
Preview:
Citation preview
Mining 3-Clusters in Vertically Partitioned Data
Faris Alqadah & Raj BhatnagarUniversity of Cincinnati
Outline
• Introduction to 3-clustering in binary, (categorical) vertically partitioned data
• Proposed cluster quality measure• 3-Clu: algorithm for enumerating 3-clusters
from two datasets
Introduction
Traditional clustering
Bi-Clustering
3-Clustering
Why 3-clusters?
• Find correspondence between bi-clusters of two different datasets
• Sharpen local clusters with outside knowledge
• Alternative? “Join datasets then search”– Does not capture underlying interactions– Inefficient– Not always possible
Why 3-clusters?<A,1234>
<AB,134>
<AWB,13>
<AY,12>
<AX,24>
<AWBCYZ,1>
<ABDX,4>
Formal Definitions
Bi-cluster in Di
3-Cluster across D1 and D
2
Pattern in Di
Defining 3-clusters• D
1 is the “learner”
• Maximal rectangle of 1's under suitable permutation in learner
• Best Correspondence to rectangle of 1's in D
2
D1D1
D1
D2
Cluster Quality Measure
• Intuition: Maximize number of 1's while also maximizing number of items and objects
• Trade off between objects and items– More items...less objects– More objects...less items
Quality Measure
–Consider bi-clusters in learner alone
I1
O C1
C2
•Which is preferable ?•User decides
Quality Measure• Quality measure:
– Monotonic in both width and height• Reflects intuition
– Balances width and height according to user defined parameter
• Introduce β
• Amount of width(attributes) willing to trade for a single unit of height (objects)
Quality Measure
Extending to 3-clusters
• Utilize same intuition• Width of 3-cluster is sum of individual
widths
Selecting β
• Larger values yield 3-clusters that are “wide” and “short” in both D1 and D2 – Cluster key websites popular with large number
of democrats and republicans
• Smaller values produce 3-clusters that are “narrow” and “long”– Discover long list of websites utilized by few
select democrats and republicans
3-Clu: Our Algorithm
• Search for 3-clusters similar to search for closed itemsets
• How to formulate the search space?– Assumption that objects out-number attributes
may not hold– Several possible orderings of the search space
Algorithm
Algorithm
• Define search space with primacy to objects
• Only need to maintain one search tree• Mimic closed itemset algorithm with
simultaneous pruning of search space• Prune with quality measure
Algorithm
Algorithm
• Cluster quality measure is neither monotone nor anti-monotone in the search space
• Pruning is still possible
Is C2 of higher quality ?
Algorithm
Algorithm
• Pruning rule is very optimistic
• Can be adjusted with some a-priori information
• Example β = 0.5
• x=2.73...can't prune– This assumes w will
stay at 15 for 3 more levels
Algorithm Analysis
• Computational cost: O (|O|*i*N)– Only as expensive as enumerating bi-
clusters in single dataset
• Communication cost: O(N)
• Correctness guaranteed by FCA theory
Experimental Results
• Performance tests
• Randomly split benchmark datasets CHESS and CONNECT
• Genetic dataset: Genes, GO terms, Phenotypes
• Compared to LCM and CHARM
ChessConnect
GO-Pheno
Experimental Results
• Test validity of 3-clusters
• Randomly partitioned Mushrooms dataset by attributes
Conclusion
• Novel concept of 3-clusters in vertically partitioned data
• Introduced quality measure framework for 3-clusters• Presented efficient algorithm based on closed itemset
mining algorithms, with adaptations:– Defined search space to enable simultaneous pruning
– Incorporated novel pruning method based on cluster quality measure
Recommended