View
216
Download
0
Embed Size (px)
Citation preview
Discovering Patterns in Multiple Datasets
Raj Bhatnagar
University of Cincinnati
Nature of Distributed Datasets
Horizontal Partitioning
A B C D E
A B C D E
A B C D E
Vertical Partitioning
D E F G H
A H J K M
A B C D E
Data components may be Geographically Distributed
Nature of Distributed DatasetsMulti-Domain Datasets
Gen
es
Diseases
Gen
es
Drugs
Dru
gs
Adverse Reactions
Nature of Distributed DatasetsMulti-Domain Datasets
Doc
umen
ts
Keywords
Doc
umen
ts
Cited-Documents
Key
wor
ds
Topics
Types of Patterns• Decision Trees• Association Rules• Principal Component Analysis• K-Nearest Neighbor Analysis• Clusters
– Hierarchical
– K-Means
– Subspace
Nature of Clusters
Patterns Ξ Unsupervised, Data Driven, Clusters
Single-Domain Clustering
Gen
es
Diseases
Gen
es
Diseases
Clusters of similar genes;In the context of diseases
Clusters of similar diseasesIn the context of genes
Clusters may be: - Mutually Exclusive - Overlapping
Nature of Patterns
Simultaneous Two-Domain Clustering
Gen
es
Diseases A cluster of similar genes - in a subspace of diseases;
A cluster of similar diseases - in a subspace of genes
Options: - Exhaustive in one domain - Exhaustive in both domains - Mutually exclusive clusters in one or both domain - Overlapping clusters/subspaces in both domains
G
D
Nature of Patterns
Simultaneous Three (Multi)-Domain Clustering
Gen
es
Diseases
Gen
es
Dise.
Gen
es
Drugs
Gen
es
Drugs
Match “genes” subsets in two clusters
Phase-III of this research
Part-I
Patterns in Vertically Distributed Databases
Learning Decision Trees
D = D1 X D2 X . . . X Dn
- D is implicitly specified
Goal: Build decision tree for implicit D, using the explicit Di’s
D1 D2 Dn
A B C C D E A E G
Limitations:- Can’t move Di’s to a common site
- Size / communication cost/Privacy- Can’t update local databases- Can’t send actual data tuples
Geographically distributed databases
Vertically Partitioned Dataset
Explicit and Implicit Databases
321162
121162
211221
211261
321161
121161
FEDCBA
Implicit Database
Explicit Component Databases
22
12
21
11
CA
SharedSet
------162
122311161
121111261
211221221
CEAFCDCBA
Node 3Node 2Node 1
Decomposition of Computations
- Since D is implicit,
- For a computation:- Decompose F into G and g’s
- Decomposition depends on- F- Di’s and Set of shared attributes
D1 D2 Dn
A B C C D E A E G
)]()...(),([ 2211 nn DgDgDgGR
)(DFR
Count All Tuples in Implicit D
)(# DtuplesR
m
j
n
iCondi j
DNR1 1
))(((
– condJ : Jth tuple in Shareds
– n: number of databases (Dis)
– (N(Dt)condJ): count of tuples in Dt satisfying condJ
– Local computation: gi(Di,) = N(Dt)condJ
– G is a sum-of-products
– If each Di knows “shared” values, then• Only one message per site needed for #tuples
22
12
21
11
CA
Shareds
L shared attributes;k values each;
kl tuples
Learning Decision Trees
Consists of various counts only:
))log( 2b
bc
c b
bc
b N
N
N
NE
b branchesa=?
a1 a2 ab
ID3 Algorithm
c classes in the dataset
Nbc and Nb can be computed using g and G as for #tuple - one message/database needed for computing each Entropy value
Compute Covariance Matrix for D• Covariance matrix for D
– Needed for eigen vectors/principal components
– Needs second order moments
– Helps compute terms of the type:
– This matrix can be computed at one of the databases
Dt
tt yx
)])([( jjii xxE
G-and-g Decomposition for 2nd order moments
• Sum of products for two attributes:
• Six different ways in which x and y may be distributed– Each requires a different decomposition
– Case 1: x same as y; and x belongs to the SharedSet.
– Case 2: x same as y; and x does not belong to the SharedSet.
– Case 3: x and y both belong to the SharedSet.
Dt
tt yx
)....(*2 DinxCountx jj
j
)(*)....( 2kk k condCountcondforxAvg
)....(** SharedincondCountyx kkk k
Sum of Products
– Case 4: x belongs to SharedSet and y does not.
– Case 5: x, y don’t belong to the SharedSet and reside on different nodes.
• For each tuple t in SharedSet, obtain
• and then
– Case 6: x, y don’t belong to the SharedSet and reside on the same
node.
)(** jj j xxCountyx
)(,)( tytx
t
tySumtxSum ))((*))((
t
tCounttod )(*)(Pr where
Prod(t) is average of product of x and y for cond-t of SharedSet
Nearest Neighbor Algorithm
Find nearest neighbor of r1 in D1
• with virtual extensions in D for all tuples in D1
• Need to Compute all pair wise distances• The same distance values can be used for clustering algorithms
Problem: Closed-Loops in Databases
Extracting Communication Graph
The learner is D1
Covariance, k-NN, etc. algorithms developed for this situation
Part-II
Subspace Clusters and Lattice Organization
Clustering in Multi-Domains
• Example 3-D dataset with 4 clusters.• Each cluster is in 2-D • Points from two subspace clusters can be very close --
making traditional clustering algorithms inapplicable.• Overlapping between clusters
Subspace Clustering
• “Interestingness” of a subspace cluster: – Domain dependent / user defined– Similarity-based clusters
Subspace Clusters
• Number of Subspaces
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C101 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 03 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 04 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 05 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 06 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 07 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 08 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 09 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 010 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 112 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 113 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 114 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 115 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
30
1
30
k k
Nature of Real Datasets1 0 1 1 0 0 0 1 1 1 1 0 0 0 0
0 0 0 1 1 1 1 1 1 0 0 0 0 0 0
0 1 0 1 0 1 0 1 1 1 0 0 0 1 1
1 0 0 0 0 0 0 0 1 1 1 1 0 0 1
1 1 0 0 0 0 1 1 1 1 1 1 0 0 0
0 0 0 0 0 0 0 0 0 1 1 1 1 1 0
0 1 0 0 0 0 0 1 1 1 1 1 0 0 1
0 1 0 1 0 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 1 1 1 1 1 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
5 6 3 4 5 1 2 5 4 6 5 7 6 7 5
6 8 8 9 9 9 7 6 5 4 3 2 1 2 3
4 3 2 1 2 3 4 5 6 7 8 7 6 5 6
7 8 7 6 5 6 7 8 9 0 9 8 7 6 0
0 9 8 0 9 8 7 6 5 4 3 2 3 4 5
4 3 4 5 0 0 0 0 0 9 8 7 6 5 4
3 4 5 6 5 4 3 4 3 4 3 6 3 7 2
7 3 9 0 7 0 1 5 3 4 6 5 4 3 7
3 9 6 3 9 0 0 5 4 0 4 3 2 2 2
2 7 8 9 0 9 8 7 6 5 6 7 8 7 6
5 4 3 4 5 6 7 4 2 8 5 9 5 7 2
4 6 4 6 7 7 8 4 6 3 3 1 0 0 1
1 5 5 5 4 4 7 7 8 9 6 4 3 2 0
6 0 7 6 8 4 5 7 3 3 3 4 7 6 8
6 7 6 9 2 5 3 7 5 1 0 4 8 3 5
Examples: Genes--Diseases; person-MovieRating; Document-TermFrequency
Lattice of Subspaces: Formal Concept Analysisnull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
124 123 1234 245 345
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
Row Ids
Parallel to the ideas ofFormal Concept Analysis
1. Need Algorithms to find Interesting subspace clusters2. Lattice provides much more insight into dataset.
Clusters in Subspaces
Clusters in overlapping subspaces
a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1
a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1
a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1
a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1
a b c d e1 1 1 1 0 12 1 1 1 0 13 1 0 1 1 14 0 0 1 1 15 1 0 1 1 1
b
d
Density = number of rows- An antimonotonic property
a
If AB < needed density,
Then so do all its descendents
null
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
Value of (anti)monotonic Propertiesnull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDEPruned supersets
Maximal and Closed Subspacesnull
AB AC AD AE BC BD BE CD CE DE
A B C D E
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE
124 123 1234 245 345
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
Minimum support = 2
# Closed = 9
# Maximal = 4
Closed and maximal
Closed but not maximal
Siblings and Parents in Lattice
Merge lattice nodes to find clusters of other properties
a b c d e
1 1 1 1 0 1
2 1 1 1 0 1
3 1 0 1 1 1
4 0 0 1 1 1
5 1 0 1 1 1
a b c d e
1 1 1 1 0 1
2 1 1 1 0 1
3 1 0 1 1 1
4 0 0 1 1 1
5 1 0 1 1 1
a b c d e
1 1 1 1 0 1
2 1 1 1 0 1
3 1 0 1 1 1
4 0 0 1 1 1
5 1 0 1 1 1
a b c d e
1 1 1 1 0 1
2 1 1 1 0 1
3 1 0 1 1 1
4 0 0 1 1 1
5 1 0 1 1 1
a b c d e
1 1 1 1 0 1
2 1 1 1 0 1
3 1 0 1 1 1
4 0 0 1 1 1
5 1 0 1 1 1
C1 =<{1,2,3,4,5}, {a,c,d,e}>
C2 =<{3,4,5}, {a,c,d,e}>
Siblings in Lattice
Goal: Subspace Clusters with PropertiesAnti-monotonic properties:
– Minimum density (C=<O,A>):= |O| / total number of objects in the data
eg: density(C=<{o2,o3,o4},{a.2}> ) = 3/5 = 0.6
– Succinctness: density is strictly smaller than all of its minimum generalizations
eg: C1=<{o2,o3,o4},{a.2c.2}>--not succinct C2 =<{o3,o4},{b.2 c.2 }>--succinct
– Numerical properties (row-wise): “max”, “min” eg: C2 =<{o1,o5},{b.2c.4 }> satisfies “max> 3”
Weak anti-monotonic properties– “average>=δ” “average<= δ” “variance>=δ”
“variance<=δ”eg: C2 =<{o1,o5},{b.2c.4 }> satisfies “average>= 3”, but: both C3 =<{o1,o2,o4,o5},{b.2}> and C4 =<{o1,o5},{b.2c.4d.2}> violate “average >=3”
a b c d
o1 5 2 4 2
o2 2 1 2 2
o3 2 2 2 2
o4 2 2 2 2
o5 3 2 4 2
Levelwise Search
• Pruning of weak anti-monotonic properties eg: if C2 =<{o1,o5},{b.2c.4 }> satisfies
“average>= 3”, then o1,o5 must be contained in at least one of its minimum generalizations that satisfy this constraint:
C5 =<{o1,o5},{c.4}>
• If an object is not contained in any cluster of size k that satisfies a weak anti-monotonic property, it must not be contained in any cluster of size k+1 that satisfies this property
a b c d
o1 5 2 4 2
o2 2 1 2 2
o3 2 2 2 2
o4 2 2 2 2
o5 3 2 4 2
Levelwise Search for Subspace Clusters
• Anti-monotonic & Weak Anti-monotonic
– Candidate generation based on anti-monotonic properties only
– Data reduction based on weak anti-monotonic properties, such as: “mean>=δ”, “mean<=δ”, “variance>=δ”, “variance< =δ”
Performance Comparison
• Optimizing Techniques– Sorting the attributes
– Reuse previous results
– Stack of unpromising branches
– Check closure property
Distributed Subspace Clustering
• Discover closed subspace clusters from databases located at multiple sites
• Objectives:– Minimize local computation cost– Minimize communication cost
Distributed Subspace Clustering
a b c d
1 0 0 1 1
2 1 0 1 1
3 1 1 1 0
4 0 0 1 1
5 1 1 0 0
a b c d
1 0 0 1 1
2 1 0 1 1
3 1 1 1 0
a b c d
4 0 0 1 1
5 1 1 0 0
DSD1
D2
• Horizontal Partitioned Data
Distributed Subspace Clustering
DS D1 D2
<c,1234>
<cd,124>
<a,235>
<ac,23>
<acd,2>
<ab,35>
<abc,3>
<c,123>
<cd,12>
<ac,23>
<acd,2>
<abc,3>
<cd, 4>
<ab,5>
List of Closed Subspace Clusters
Lemma 1: All locally closed attribute sets are also globally closed
Lemma 2: Intersection of two locally closed attribute sets from two different sites is globally closed eg: a = ac ∩ ab
Distributed Subspace Clustering
DS D1 D2
<c,1234>
<cd,124>
<a,235>
<ac,23>
<acd,2>
<ab,35>
<abc,3>
<c,123>
<cd,12>
<ac,23>
<acd,2>
<abc,3>
<cd, 4>
<ab,5>
List of Closed Subspace ClustersCompute the object set:
1. Closed at both partitions: compute the union of the two object setseg: cd
2. Closed in one of the partition: the union of two object sets whose attribute sets’ intersection equals the target attribute seteg: c = c ∩ cd
3. Not closed in any of the partitions: similar to case 2eg: a = ac ∩ ab
Distributed Subspace Clustering
DS D1 D2
<c,1234>
<cd,124>
<a,235>
<ac,23>
<acd,2>
<ab,35>
<abc,3>
<c,123>
<cd,12>
<ac,23>
<acd,2>
<abc,3>
<cd, 4>
<ab,5>
List of Closed Subspace ClustersProblem:
both for case 2 and 3
a = ac ∩ ab and a = acd ∩ ab
Solution:
for each globally closed attribute set, keep track of the largest object set (or size of the object set)
Distributed Subspace Clustering
DS D1 D2
F :<c,1234>
<cd,124>
<a,235>
F1:
<c,123>
<cd,12>
<ac,23>
F2:
E :<ac,23>
<acd,2>
<ab,35>
<abc,3>
E1 :
<acd,2>
<abc,3>
E2:
<cd, 4>
<ab,5>
Density Constraint: δ >= 0.6Observation:
Intersection of two elements both from Eis can not have enough density
Efficient Computation:
Sort Fi and Ei into decreasing density
Distributed Subspace Clustering
R R1 R2
<c,1234>
<cd,124>
<a,235>
<c,123>
<cd,12>
<ac,23>
<c,1234>
<cd,124>
<a,235>
Distributed Subspace Clustering
• Generalize to k>2– k sites need k step communication and
computation– k sites have k types:
Distributed Subspace Clustering
• K=3
Distributed Subspace Clustering
Distributed Subspace Clustering
Part-III
Multi-Domain Clusters
Introduction
Traditional clustering
Bi-Clustering
3-Clustering
Why 3-clusters?
• Correspondence between bi-clusters of two different lattices
• Sharpen local clusters with outside knowledge
• Alternative? “Join datasets then search”– Does not capture underlying interactions– Inefficient– Not always possible
Formal Definitions
Bi-cluster in Di
3-Cluster across D1 and D
2
Pattern in Di
Defining 3-clusters
• D1 is the “learner”
• Maximal rectangle of 1's under suitable permutation in learner
• Best Correspondence to rectangle of 1's in D
2
D1D1
Cluster Quality Measure
• Intuition: Maximize number of 1's while also maximizing number of items and objects
• Trade off between objects and items– More items...less objects– More objects...less items
Quality Measure
–Consider bi-clusters in learner alone
I1
O C1
C2
•Which is preferable ?•User decides
Measure of Cluster Quality
• Quality measure:– Monotonic in both width and height– Balances width and height according to user
defined parameter
• Introduce β width(attributes) willing to trade for a single unit of height (objects)
Cluster Quality Measure
Cluster Quality for 3-clusters
• Utilize same intuition• Width of 3-cluster is sum of individual
widths
Selecting
• Larger values yield 3-clusters that are “wide” and “short” in both D1 and D2 – Cluster key websites popular with large number
of democrats and republicans
• Smaller values produce 3-clusters that are “narrow” and “long”– Discover long list of websites utilized by few
select democrats and republicans
β
3-Clu: Our Algorithm
• Search for 3-clusters similar to search for closed itemsets
• How to formulate the search space?– Assumption that objects outnumber attributes
may not hold– Several possible orderings of the search space
3-Clu Algorithm
3-Clu Algorithm
• Define search space with primacy to objects
• Only need to maintain one search tree• Mimic closed itemset algorithm with
simultaneous pruning of search space• Prune with quality measure
3-Clu Algorithm
Algorithm
• Quality measure is neither monotone nor anti-monotone in the search space
• Pruning is still possible
Is C2 of higher quality ?
Algorithm
Experimental Results
Chess Connect GO-Pheno
Experimental Results
• Test validity of 3-clusters
• Randomly partitioned Mushrooms dataset by attributes
Discriminating Clusters
• Key question: What sets of attributes and /or objects most distinguish the incremental bi-clusters from each other?
• Incremental bi-clusters that only differ slightly may be a result of noise or human error
• Prioritize relationships among incremental bi-clusters
A B C D
1 0 1 0 1
2 1 1 0 0
3 0 1 1 1
4 0 0 1 1
{}
{A,B,C,D}
{2}
{A,B}
{1,2,3,4}
{}
{3,4}
{C,D}
{1,2,3}
{B}
{1,3,4}
{D}
{1,3}
{B,D}
Motivation
tf1 tf2 tf3 tf4
g1 1 0 1 1
g2 1 1 1 0
g3 0 1 0 1
g4 1 0 0 0
g5 1 0 0 1
• Functional genomics• Interactions between genes and transcription factors• Comparing each bi-cluster in the lattice tells us the difference in activation
of genes/TFs that transform cellular processes• Prioritize relationships
Related Work
• Emerging patterns [Dong,Li]– Only consider ratio of support between frequent itemsets– Supervised technique
• Contrast sets [Bay, Pazzani]– Also supervised– Special case of rule discovery
• Closed Itemset algorithms [Zaki, Uno, Bian] – Efficient– Do not explicitly enumerate lattice structure
Problem Formulation
• Challenges– Enumerating bi-clusters and forming lattice is
known to be NP-Complete problem– Discover distinguishing sets during the mining
process as opposed to post processing step– How to quantify distinction
Problem Formulation
• Dataset D=(O,A,R)
• Consider set of objects X• X’ defined as all attributes common to all
objects of X• Dually defined set of attributes Y
• Bi-cluster: (X,Y) s.t. X’ = Y and Y’=X
Problem Formulation
A B C D
10 1 0 1
21 1 0 0
30 1 1 1
40 0 1 1
• Sample bi-cluster• <{3,4},{C,D}>
• Bi-clusters are equivalent to• Maximal rectangles of
1s under suitable permutation
• Maximal bi-cliques
Problem Formulation
• Set of bi-clusters from a complete lattice
• Model lattice as weighted directed graph
• Weights represent degree of distinction
• Each edge represents a distinguishing set
• Grow maximum cost spanning tree
Problem Formulation
• Quantifying distinction:– View bi-clusters as maximal rectangles of 1s
under suitable permutation– Consider both change in width and height when
computing distinction– Choose a shape metric s (ex. Area, ratio height
to width etc.)– Quantify distinction as degree of shape change
along a path in the lattice
A B C D
10 1 0 1
21 1 0 0
30 1 1 1
40 0 1 1
Problem Model
Compute partial derivates as forward difference
{2}{A,B}
{1,2,3}{B}
Our Algorithm (MIDS)
• Input: Dataset D
• Output: Maximum cost spanning tree of bi-cluster lattice of D
• Computational challenge: enumerating bi-cluster lattice and growing tree simultaneously
Our Algorithm (MIDS)
• Most min/max cost spanning tree algorithms assume availability of graph
• Prim’s algorithm depends on the Cut set
• Intuitive idea: Grow bi-cluster lattice incrementally and maintain the Cut set
Our Algorithm (MIDS)
• Prim’s grows sequence of trees
• Denote set of edges between bi-cluster c and all upper neighbors of c that do not appear in current tree by
• Df
• Dynamically compute Cut, while enumerating new bi-clusters
Our Algorithm (MIDS)1. Choose starting bi-cluster c (usually infimum)2. Compute cut set by generating upper neighbors of c together with
update equation3. Compute weight of edges between c and upper neighbors4. Greedily choose maximum cost edge and associated concept d5. Set c=d, repeat steps 2-5 until all reachable bi-clusters visited
Algorithms
{}
{A,B,C,D}
{2}
{A,B}
{3,4}
{C,D}
{1,3}
{B,D}
{1,2,3}
{B}
{1,3,4}
{D}
A B C D
10 1 0 1
21 1 0 0
30 1 1 1
40 0 1 1
{}
{A,B,C,D}
{2}
{A,B}
{3,4}
{C,D}
{1,3}
{B,D}
{1,2,3}
{B}
{1,3,4}
{D}
{1,2,3,4}
{}
{}
{A,B,C,D}
Algorithms
• Major computational cost is computing upper neighbors of a bi-cluster
• Theorem: • Let <X,Y> be a concept. Then {X υ {o} } ’ ’ are the objects of an
upper neighbor of <X,Y> if and only iffor all z ε {X υ {o} } ’ ’ – X the following holds:{X υ {z}} ’ ’ = {X υ {o} } ’ ’
• Lindig’s algorithm implements this theorem• Algorithm performs a local computation of bi-clusters and
upper neighbors
Algorithms
• Improved Lindig’s algorithm, practical running time• Adapted to enumerate only “large” bi-clusters • Reduced number of set intersections performed
• Theoretical complexity remains the same• Overall complexity of MIDS
– E: total number of edges– N: number of bi-clusters– O: number of rows– A: number of columns
Experimental Results
• Compared solution to Zaki’s CHARM-L algorithm– CHARM-L enumerates bi-clusters, and organizes into
lattice structure– Not incremental: hard to adapt to Prim’s algorithm– Added post processing step to grow MCST
Experimental Results
Experimental Results
• Experimented with synthetic datasets to find most distinguished incremental bi-clusters
• Preliminary experiments conducted with clearly distinguishable incremental bi-clusters and random noise
• Next planted several large incremental bi-clusters that differed only slightly as a result of noise
Experimental Results
• Region 1 is region of interest, clearly distinct
• Noise added to region 1, while regions 2 and 3 contain minimal distinction
Experimental Results
Computer_Science@UC
A Vision
Raj Bhatnagar
CS Department @ UC
Department Goals:• Train CS graduates to meet technical manpower
needs in Ohio, and the world at large.• Contribute to the creation of scholarly knowledge
through research
Needed Features:– Strong Research and Graduate program– Strong Undergraduate program– Good visibility and ranking in research communities– Good reputation in 250-mile region for UG program
Research and Graduate Program
CS Dept. Faculty inCore CS Areas
BME, CHMC
Bio-InformaticsGIS
, Scie
nces
A&S
Engineering, Business
Robotics, BI
Mat
hem
atics
Secur
ity, D
ata
Mini
ng
• CS is an Enabling Science• Can catalyze research in Science and Engineering• UC needs a stronger CS program
• Core CS areas to be covered in CS Dept., Foundational research
• Collaborations to be built with other UC departments for research
Core Computing FacultyCurrent Number: 12Areas of Strength:
– Algorithms and CS Theory– Networks and Communications– Machine Learning, AI, and Data Mining– HCI
Gaps in Strength/Coverage– Programming Languages/Compilers– Software Engineering– Databases– Computer Systems and Networks
Target Number for a Strong CSD: 18
Potential CollaborationsBio-Informatics
– Have been research contacts– Strong presence of Bioinformatics activity at UC– Need new CS faculty with matching interests
Mathematics– Computer Security, quantum computing– Data Mining, privacy preserving operations on data
GIS– Very active and growing in A&S– Almost no contact with CS faculty; great research potential
Robotics– ME has good activity; a recent hire from Duke University– Great student interest; grad and undergrad
Business– Strength in databases and interest in collaborations
CAS– Potential for a minor in IT for CS students and vice-versa
Undergraduate Program• High priority to increase enrollment/retention
– Translates into more resources now– Alumni potential donors– Need to advertize our strengths on www and in local area– Increase interestingness of available electives – from new faculty and
also from collaborating departments– Increase social capital and sense of belongingness within CS student
body• Strengthen Capstone projects
– Sponsored projects by local industry (item for IAB)– Advertize them on department website– Awards for top projects (sponsored by industry)
• Student Experience while at UC– Support ACM, EEE, and LARC groups for an enriching experience– Increase UG Research experience opportunities– Seek funds to provide more scholarships
Graduate Program
Marshal Resources to support Ph.D. students– Help faculty’s attempts to seek sponsored research– Seek industry help for sponsored research– Seek funds to support student travel to conferences
Enhance Students’ Quality of Experience at UC– More interactions with faculty/students– More graduate courses– More support for organizing/travelling to
events/conferences
Resources
Need serious efforts for raising resources– Faculty lines from UC
• Need to reach critical mass for CSD• PBB is a hurdle• College support/commitment is needed
– CS Endowment Account (Our money, that can’t be cut!)• Approach alumni for donations• Seek industry support/sponsorship for scholarships, seminar speakers
– Sponsored research awards from NSF• New administration has significantly enhanced funds for NSF• Support faculty in efforts to seek these funds
– Seek more UGA funds from CoE• Difficult for now; but must continue efforts
Phase-I SummaryMain Results:
• Algorithms for complex operations that work with implicit databases
• Decision Trees, Association Rules, Covariance matrix• K-NN neighbors• Hierarchical and Sequential Clustering
• Algorithms for distributed control of multi-agent systems• Distributed Multi-Agent Reorganization
Open Research Issues:– We preserve data privacy, but we need a formal model and
analysis of privacy– Mining of streaming data at multiple cooperating nodes
Phase-I Research Participants:– Ph.D. dissertations
• Ahmed Khedr, 2003• Barrington Young, 2007• Eric Matson, 2008
– M.S. theses• Shriram Srinivasan, 1997• Sanjeev Beemidi, 1998• Harpreet Singh, 2000• Rahul Dasgupta, 2000• Susmit Kumar 2002• Rishi Jhaver, 2003• Chris Calendar, 2004• Michael Kinsey, 2005• Kaustubh Shinde, 2006
Phase-I Publications:1. Ahmed Khedr and Raj Bhatnagar. Agents for Integrating Distributed Data for Complex
Computations. \textit{Computing and Informatics} Journal, Vol. 26, 2007, 149-170.
2. Eric T. Matson, Raj Bhatnagar. Knowledge Sharing Between Agents in a Transitioning Organization. Proceedings of the COIN 2007 published as book Coordination, Organizations, Institutions, and Norms in Agent Systems - III, Springer Verlag, 2007, pp. 187-202.
3. Eric Matson and Raj Bhatnagar. Properties of Capability Based Agent Organization Transition, Proceedings of the Intelligent Agent Technologies (IAT 2006) confernece held in Hong Kong in December 2006.
4. Barrington Young and Raj Bhatnagar. Secure K-NN Algorithm for Distributed Databases, Proceedings of the Privacy Security and Trust Conference, 2006, pp. 485-490.
5. Ahmed Khedr and Raj Bhatnagar, Decomposable Algorithms for Minimum Spanning Tree, Presented at the International Workshop on Distributed Computing, December 2003, Springer Verlag notes on Computer Science, vol. 2918.
6. Raj Bhatnagar, Sriram Srinivasan. Pattern Discovery in Distributed Databases. {\em Proceedings of the AAAI-97 Conference} held at Providence, RI, in July 1997, pp. 503-508.
Phase-II Summary
Main Results:• Subspace Clustering Algorithms
• Efficient Lattice-based search• Use of novel monotonic conditions to control search
• Distributed mining across multiple lattices• Only for horizontally partitioned datasets
• Applications of Results• Genomic, document collections,
Open Research Issues:– Clusters for Non-binary datasets– Approximately closed clusters
Phase-II Research Participants:– Ph.D. dissertations
• Haiyun Bian, 2006• Amit Sinha, 2008
– M.S. theses• Gautam Kurra, 2002• Anshuman Rajshiva, 2004• Ramya Ashok, 2005• Aparna Yardi, 2006• Aravind Kumar, 2006• Shriram Narayanswami, 2007• Mrunal Deshmukh, 2008
– Collaborations:• CHMC
Phase-II Publications:1. Shriram Narayanswamy, Raj Bhatnagar. A Lattice-Based Model for Recommender Systems.
Proceedings of the International Conference on Tools with Artificial intelligence (ICTAI 2008) pp. 349-356.
2. Haiyun Bian, Raj Bhatnagar. An Algorithm for Mining Weighted Dense Maximal 1-Complete Regions. Data Mining: Foundations and Practice, 31-48, Springer Verlag, 2008.
3. Haiyun Bian, Raj Bhatnagar, and Barrington Young. An Efficient Constraint-Based Closed Set Mining Algorithm. Proceedings of the International Conference on Machine Learning and Applications (ICMLA 2007), pp. 67-72.
4. Barrington Young, Raj Bhatnagar, Giridhar Tatavarty, and Haiyun Bian. Covariance matrix Computations with Federated databases. Proceedings of the International Conference on Machine Learning and Applications (ICMLA 2007), pp. 172-177.
5. Haiyun Bian, Raj Bhatnagar: Efficiently Mining Maximal 1-complete Regions from Dense Datasets. ICDM Workshop on Foundations of data Mining 2006, Proceedings of ICDM Workshops, pp 423-427
6. Haiyun Bian and Raj Bhatnagar. Towards More Supervised Subspace Cluetering, Proceedings of the MAICS 2006 conference, held in Valparaiso, OH April 2006.
7. Arvind Muthukrishnan and Raj Bhatnagar. Concept-based Organization and Retrieval of Technical Documents. Proceedings of the MAICS2006, Valparaiso, OH April 2006.
8. Haiyun Bian and Raj Bhatnagar. An Algorithm for Lattice-Structured subspace clustering, Proceedings of the SIAM International Conference on Data Mining, April 2005.
Phase-III Summary
Main Results:• 3-Clustering Algorithm
• Efficient search algorithm• Bioinformatics Application, Genomic datasets
• Most Discriminating subsets• Efficient algorithm
Open Research Issues:– Multi-domain datasets with closed loop relationships– Diagonal band patterns
Phase-III Research Participants:– Ph.D. dissertations
• Faris Alqadah, 2010 (very likely)
– Collaborations: CHMC
Phase-III Publications:1. Faris Alqadah and Raj Bhatnagar. Discovering Substantial Distinctions among
Incremental Bi-Clusters, To be presented atthe SIAM International COnference on Data Mining (SDM 09) in April 2009.
2. Faris Alqadah and Raj Bhatnagar. An effective algorithm for mining 3-clusters in vertically partitioned data. Proceedings of the CIKM 2008, 1103-1112.
3. Faris Alqadah, Raj Bhatnagar. Detecting significant distinguishing sets among bi-clusters. Proceedings of the CIKM 2008.
Conclusions
• Introduced quantitative measure of distinction among incremental bi-clusters
• Developed efficient algorithm for enumerating bi-clusters and growing maximum cost spanning tree simultaneously
Info from Lattice of Clusters
tf1 tf2 tf3 tf4
g1 1 0 1 1
g2 1 1 1 0
g3 0 1 0 1
g4 1 0 0 0
g5 1 0 0 1
• Functional genomics• Interactions between genes and transcription factors• Comparing each bi-cluster in the lattice tells us the difference in activation of
genes/TFs that transform cellular processes• Prioritize relationships
Problem Formulation
• Challenges– Enumerating bi-clusters and forming lattice is
known to be NP-Complete problem– Discover distinguishing sets during the mining
process as opposed to post processing step– How to quantify distinction
Problem Formulation
• Model lattice as weighted directed graph
• Weights represent degree of distinction
• Each edge represents a distinguishing set
• Grow maximum cost spanning tree
Problem Formulation
• Quantifying distinction:– View bi-clusters as maximal rectangles of 1s
under suitable permutation– Consider both change in width and height when
computing distinction– Choose a shape metric s (ex. Area, ratio height
to width etc.)– Quantify distinction as degree of shape change
along a path in the lattice
A B C D
10 1 0 1
21 1 0 0
30 1 1 1
40 0 1 1
Problem Model
• Compute partial derivates as forward difference
{2}{A,B}
{1,2,3}{B}
Our Algorithm (MIDS)
• Adapt Prim’s algorithm– Lattice is not readily available – Dynamically compute cut set by enumerating upper
neighbors of bi-clusters1. Choose starting bi-cluster c2. Compute cut set by generating upper neighbors of c3. Compute weight of edges between c and upper
neighbors4. Greedily maximum cost edge and associated concept d5. Set c=d, repeat steps 2-5 until all reachable bi-clusters
visited
Algorithm Details
• Step 2: Generating upper neighbors in lattice– How?
• Lindig’s algorithm– Cost?
• Improved Lindig’s algorithm, practical running time• Theortical complexity remains the same
• Overall complexity of MIDS– E: total number of edges– N: number of bi-clusters– O: number of rows– A: number of columns
Experimental Results
• Experimented with synthetic datasets to find most distinguished incremental bi-clusters
• Preliminary experiments conducted with clearly distinguishable incremental bi-clusters and random noise
• Next planted several large incremental bi-clusters that differed only slightly as a result of noise
Experimental Results
• Region 1 is region of interest, clearly distinct
• Noise added to region 1, while regions 2 and 3 contain minimal distinction
Experimental Results