Finding Top-K Similar Graphs in Graph Database@ReadingCircle
M1 Ishikawa Yasutaka1
About this paperA paper in βgraph theoryβ
About βgraph similarity queryβ
Proposing new technique for accurate answer and reducing computational cost
Proceedings of the 15th International Conference on Extending Database Technology - EDBT '12
Zhu, Yuanyuanγ»Qin, Luγ»Yu, Jeffrey Xuγ»Cheng, Hong
2
Outline1. Back ground of graph theory
2. Introduction
3. Problem statement
4. The framework
5. Pruning without indexing
6. Pruning with indexing
7. Performance studies
8. Conclusion3
Outline1. Back ground of graph theory
2. Introduction
3. Problem statement
4. The framework
5. Pruning without indexing
6. Pruning with indexing
7. Performance studies
8. Conclusion4
What is βgraphβ?
5
Graph is denoted by π = π, πΈ, π
π is a set of vertices
πΈ β V Γ π is the set of edges
π is a labeling function, π: π β π π is a set of labels
In this paper, edges of graph have no weight
Subgraphγ»Supergraph
6
Given two graphs π and πβ² , If π β πβ²,
π is subgraph of πβ²
πβ² is supergraph of π
Supergraph
Subgraph
Maximum Common Subgraph
7
If π is a common subgraph of π1 and π2 and there is no other common subgraph πβ² of π1 and π2,such that πΈ πβ² > |πΈ(π)|, ππππβ π is a maximum common subgraph of two graphs
This calculation is NP-hard
ππππβ π1
ππππβ π2πππ π
Bipartite graph
8
A graph whose vertices can be devided into two disjoint sets π and π
π and π are each independent sets
π π
Matching of bipartite graph
9
If each edge has no same vertices, the edge set M is called matching
π π
Outline1. Background of graph theory
2. Introduction
3. Problem statement
4. The framework
5. Pruning without indexing
6. Pruning with indexing
7. Performance studies
8. Conclusion10
Graph query processing(1)Using graph as query to graph Database
It has attracted much attention in recent year
Image retrieval
Chemical compound structure search
Query graph
GraphDB
11result graphs
querying
Graph query processing(2)Mainly falling into two categories
Subgraph containment search
Identify a set of graphs that contain a query graph
Supergraph containment search
Identify a set of graphs that are contained by a query graph
Besides exact subgraph/supergraph containment query, some studies allow a small number of edgesor nodes missing in the query result
βgraph similarity search is important
12
Graph similarity search
13
Main theme of this paper
Search for the similarity of a query graph and each graph of Database
βTop-k similar graphs β means k graphs that is most similar to a query graph
Query graph
12
3
Top-3 similar graph
Existing graph similarity search(1)
14
Two kinds of graph similarity search in related works
Subgraph similarity search
H.Shang,X.Lin,Y.Zhang,J.X.Yu,andW.Wang.Connected substructure similarity search. In SIGMOD, pages 903β914, 2010.
X.Yan,P.Yu,andJ.Han.Substructuresimilaritysearchingraphdatabases. In SIGMOD, pages 766β777, 2005.
Supergraph similarity search
H.Shang,K.Zhu,X.Lin,Y.Zhang,andR.Ichise.Similaritysearch on supergraph containment. In ICDE, pages 637β648, 2010
To calculate similarity, it is needed to define the distance of graphs:πππ π‘(π, π)
Existing graph similarity search(2)
15
Subgraph similarity search
πππ π‘ π, π = πΈ π β πΈ πππ π, π
Supergraph similarity search
πππ π‘ π, π = πΈ π β πΈ πππ π, π
β»(maybe) these πππ π‘ π, π donβt satisfy the axiom of metric space
πππ π‘ π, π β πππ π‘(π, π)
Ex:existing similarity search(1)
16
Query π and sample graph database π· ={π1, π2, π3}
Bold edges mean the MCS of π and each π
B
C
C A C C
B
Query q
B
C
C D C C
B
ππππβ π2 β π·
C B B C
ππππβ π1 β π·
B
C
C A
AA
AA
A C C
B C
C
ππππβ π3 β π·
Ex:existing similarity search(2)
17
If we use subgraph query (πππ π‘ π, π = πΈ π β
πΈ πππ π, π ),π3 will be returned as answer
πππ π‘ π, π3 = 7 β 6 = 1
B
C
C A C C
B
Query q
B
C
C D C C
B
ππππβ π2 β π·
C B B C
ππππβ π1 β π·
B
C
C A
AA
AA
A C C
B C
C
ππππβ π3 β π·
Ex:existing similarity search(3)
18
If we use supergraph query (πππ π‘ π, π = πΈ π β
πΈ πππ π, π ), π1 will be returned as answer
πππ π‘ π, π1 = 3 β 2 = 1
B
C
C A C C
B
Query q
B
C
C D C C
B
ππππβ π2 β π·
C B B C
ππππβ π1 β π·
B
C
C A
AA
AA
A C C
B C
C
ππππβ π3 β π·
Ex:existing similarity search(4)
19
But, the best answer should be π2, from userβs perspective
These way to calculate πππ π‘ is not good
B
C
C A C C
B
Query q
B
C
C D C C
B
ππππβ π2 β π·
C B B C
ππππβ π1 β π·
B
C
C A
AA
AA
A C C
B C
C
ππππβ π3 β π·
Main contributions of this paper
20
1. Studying top-k graph similarity query processing based on new MCS based similarity measure
2. Deriving several distance lower bounds(without and with index) to reduce the number of MCS computations
3. Conducting extensive performance studies on a real dataset to test the performance of their algorithms
Outline1. Background of graph theory
2. Introduction
3. Problem statement
4. The framework
5. Pruning without indexing
6. Pruning with indexing
7. Performance studies
8. Conclusion21
Definitions(1)
22
In this paper, they define the πππ π‘(π, π) like this
πππ π‘ π, π = πΈ π + πΈ π β 2 Γ πΈ πππ π, π
β»This πππ π‘ π, π (maybe) satisfies the axiom of metric space
π₯ = π¦ β πππ π‘ π₯, π¦ = 0
πππ π‘ π¦, π₯ = πππ π‘(π₯, π¦)
πππ π‘ π₯, π¦ β₯ 0
πππ π‘ π₯, π¦ + πππ π‘ π¦, π§ β₯ πππ π‘(π₯, π§)
This is important in later
Definition(2)
23
In this paper, they allow MCS of two graphs to be disconnected
It cat potentially capture more common substructures of two graphs
It also can evaluate the structure similarity of two graphs more globally
Ex:π πππ(π, π) of this paper(1)
24
Query π and sample graph database π· = {π1, π2}
Bold edges mean the common edges of π and each π
C
C
B
B AA
ππππβ π1
A
C
C
C
B
B
C
C
B
B A
C
C
C
B
BC
C
C
B
B A
ππππβ π2ππ’πππ¦ π
Ex:π πππ(π, π) of this paper(2)
25
If we require MCS to be connected, π1 will be returned as the answer
πππ π‘ π, π1 = 12 + 6 β 2 Γ 6 = 6
πππ π‘ π, π2 = 12 + 12 β 2 Γ 5 = 14
C
C
B
B AA
ππππβ π1
A
C
C
C
B
B
C
C
B
B A
C
C
C
B
BC
C
C
B
B A
ππππβ π2ππ’πππ¦ π
Ex:π πππ(π, π) of this paper(3)
26
If we allow MCS to be disconnected, π2 will be returned as the answer
πππ π‘ π, π1 = 12 + 6 β 2 Γ 6 = 6
πππ π‘ π, π2 = 12 + 12 β 2 Γ 10 = 4
π2 is desired result for usersC
C
B
B AA
ππππβ π1
A
C
C
C
B
B
C
C
B
B A
C
C
C
B
BC
C
C
B
B A
ππππβ π2ππ’πππ¦ π
Outline1. Background of graph theory
2. Introduction
3. Problem statement
4. The framework
5. Pruning without indexing
6. Pruning with indexing
7. Performance studies
8. Conclusion27
Pruning strategy
28
As mentioned previously, computing MCS is NP-hard problem
In this paper, they derived the lower bound of MCS to reduce the number of MCS computations
They didnβt make MCS computation faster
If πππ π‘(π, π) is no less than the largest distance of the current top-k answers, π is not a top-k answer and can be pruned safety
Based algorithm(1)
29
Using max-heap Ξ and min-heap β
Based algorithm(2)
30
If πππ π‘(π, π) is smaller than the top value of current top-k answer, the πππ π‘(π, π) is computed and compared with the current top value again
Outline1. Background of graph theory
2. Introduction
3. Problem statement
4. The framework
5. Pruning without indexing
6. Pruning with indexing
7. Performance studies
8. Conclusion31
Edge frequency based lower bound
32
Finding the lower bound of πππ π‘(π, π) is equivalent to finding the upper bound of |πΈ(πππ π, π )|
Denote the set of the distinct edges in g as πΈπ(π)
Denote Frequency of e as π(π, π)
ππππ 1 π, π = πβπΈπ(π)βͺπΈπ(π)min{π π, π , π(π, π)}
πππ π‘1 π, π = πΈ π + πΈ π β 2 Γ ππππ 1(π, π)
Ex:using the π ππππ(π, π) (1)
33
The frequency of edge(A,C),(B,C),(C,C) are 4,3,6
ππππ 1 π, π1 = 4 + 3 + 5 = 12
πππ π‘1 π, π1 = 13 + 12 β 2 Γ 12 = 1A
CCCCCC
C
C B A
A
ππππβ π1
CCCCCC
C
C B A
A
C
C
ππππβ π2
B
CC C
CCCCCCC
AA
A
ππ’πππ¦ π
Ex:using the π ππππ(π, π) (2)
34
ππππ 1 π, π2 = 3 + 3 + 6 = 12
πππ π‘1 π, π2 = 13 + 13 β 2 Γ 12 = 2
In fact, these lower bound are not tight compared to the actual πππ π‘ A
CCCCCC
C
C B A
A
ππππβ π1
CCCCCC
C
C B A
A
C
C
ππππβ π2
B
CC C
CCCCCCC
AA
A
ππ’πππ¦ π
Adjacency List Based Lower Bound(1)
35
Constracting bipartite graph π΅(π, π)
For each pair of nodes π’ β π(π) and π£ β π(π), there is an edge between π(π’) and π π£ if π π’ =π π£
πΏ(πππ(π’)) is a multiset consisting of all labels in the adjacent nodes of π’
AC
BA
π’
πΏ πππ π’ = {π΄, π΄, π΅}
Adjacency List Based Lower Bound(2)
36
The weight of edges is defined as π€ π π’ , π π£ =
|πΏ(πππ(π’)) β© πΏ(πππ(π£))|
π(π, π) is the maximum weighted bipartite matching
ππππ 2 π, π =1
2 π π’ ,π π£ βπ π,π π€ π π’ , π π£
πππ π‘2 π, π = πΈ π + πΈ π β 2 Γ ππππ 2 π, π
Bipartite graph(repeated)
37
A graph whose vertices can be devided into two disjoint sets π and π
π and π are each independent sets
π π
Matching of bipartite graph(repeated)
38
If each edge has no same vertices, the edge set M is called matching
π π
Ex:using the π ππππ(π, π) (1)
39
ππππ 2 π, π1 = 2 + 2 + 2 + 1 Γ· 2 = 3.5
πππ π‘2 π, π1 = 4 + 5 β 2 Γ 3.5 = 2
C
C
B A
A
ππππβ π1
CC
B
A
ππ’πππ¦ π
A
A
A
B
B
C
CC
C
2
2
2
1
Ex:using the π ππππ(π, π) (2)
40
If we use ππππ 1, ππππ 1 = 1 + 1 + 1 + 1 = 4
πππ π‘1 π, π1 = 4 + 5 β 2 Γ 4 = 1
C
C
B A
A
ππππβ π1
CC
B
A
ππ’πππ¦ π
A
A
A
B
B
C
CC
C
2
2
2
1
Ex:using the π ππππ(π, π) (3)
41
Given two graphs π, π,we have πππ π‘2(π, π) β₯
πππ π‘1(π, π)
C
C
B A
A
ππππβ π1
CC
B
A
ππ’πππ¦ π
A
A
A
B
B
C
CC
C
2
2
2
1
Algorithm using π ππππ, π ππππ
42
The computational cost of are πππ π‘ > πππ π‘2 > πππ π‘1
Using πππ π‘1 as possible
Outline1. Background of graph theory
2. Introduction
3. Problem statement
4. The framework
5. Pruning without indexing
6. Pruning with indexing
7. Performance studies
8. Conclusion43
Triangle property of distance
44
Given three graph π1, π2, π3, πππ π‘ π1, π3 β€πππ π‘ π1, π2 + πππ π‘ π2, π3 If π2 and π3 are very near, πππ π‘(π1, π2)~dist(π2, π3)
If we know πππ π‘(π, πβ²), we can compute these lower bound
πππ π‘3 π, π πβ² = πππ π‘ π, πβ² β πππ π‘ π, πβ²
πππ π‘4 π, π πβ² = πππ π‘ π, πβ² β πππ π‘(π, πβ²)
Indexing
45
The πππ π‘(π, πβ²) can be precomputed
But, computing all the pair need to do π(|π·|2)MCS computations
Define a set of groups πΌ = {πΊ1, πΊ2, β¦ , πΊ|πΌ|}, where
πΊπ β π·, and πΊ1 βͺ πΊ2 βͺβ―βͺ πΊ πΌ = π·
There is a center graph ππ β πΊπ
Precompute the πππ π‘(π, ππ), π β πΊπ
π6
π4 π2π7π5π1
π3πΊ1 πΊ2
Algorithm using π ππππ, π ππππ,index
46
If we get the
real πππ π‘(π, π), update
lower bound πππ π‘ by
using it
Ex:algorithm with index
47
Three indexing strategy(1)
48
DPIndex
Given the number of π, randomly pick π graphs as πcenter nodes for group. For each non-center graph π βπ·,assign it to the nearest center
Each graph only belongs to one group
Three indexing strategy(2)
49
OPIndex
After selecting π graphs in π· as centers, assign each non-center graph π β π· to the π nealest centers
Allows each graph to belong to multiple groups
Three indexing strategy(3)
50
GSIndex
Treat each graph in π· as the center
For each center, find its nearest π graphs in π·, and putting the π + 1 graphs together as group
Outline1. Background of graph thoery
2. Introduction
3. Problem statement
4. The framework
5. Pruning without indexing
6. Pruning with indexing
7. Performance studies
8. Conclusion51
Overview of experiments
52
Similarity measures evaluation
Show why the query results of subgraph/supergraphsimilarity query are not good
Query performance evaluation
Compare with noIndex and SeqScan, and compare their three indexing techniques
Indexing cost evaluation
Compare the cost of their three indexing
environment
53
All the algorithms were implemented using Visual C++ 2005
Tested on a PC with 2.66GHz CPU and 3.43GB memory running Windows XP
parameters
54
They evaluate their approaches by varying five parameters
π:top-k value
|π(π)|:the size of query graph
π· :the number of graphs in graph database
π:the number of groups m used in DPIndex and OPIndex
π:the maximum number of groups l
Similarity measures comparison
55
Experiments in three types
Subsim: πΈ π β πΈ πππ π, π
Supersim: πΈ π β πΈ πππ π, π
Fullsim: πΈ π + πΈ π β 2 Γ πΈ πππ π, π
The near the answers and
query graph in size,
the better the answers are
Power of pruning strategy
56
Seqscan needs around 7000 MCS computation for graph with size larger than 10
noIndex needs no more than 500
Scalability testing
57
Comparing their three index teqnique
Index testing
58
Comparing the cost of three index teqnique
Outline1. Background of graph theory
2. Introduction
3. Problem statement
4. The framework
5. Pruning without indexing
6. Pruning with indexing
7. Performance studies
8. Conclusion59
Conclusion
60
Existing solutions:subgraph/supergraph similarity search cannot be used to solve problem properly
They introduced a new graph distance using the maximum common subgraph(MCS)
In order to reduce the number of MCS computation, they proposed two distance lower bounds
They further introduced a triangle property to lower bound
They conducted extensive performance studies