42
Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Embed Size (px)

Citation preview

Page 1: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Graph Indexing: A Frequent Structure

based ApproachAuthors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Page 2: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Outline

• Introduction• Background• Problem definition• Challenges

• Frequent fragment• Discriminative fragment• gIndex

• Discriminative fragment selection• Index construction• Search• Verification• Incremental maintenance

• Experimental result• Conclusions & discussion

Page 3: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Background

• Graphs have become increasingly important in modelling complicated structures and data

• Problems: • how to efficiently process graph queries and retrieve related graphs.

Page 4: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Background

• Path-based indexing approach

• Advantages• Paths are easier to manipulate than trees and graphs• The index space is predefined: all the paths up to maxL length are selected.

Page 5: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Background

CC-C

C-C-CC-C-C-C

Page 6: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Background

• Path-based indexing approach

• Disadvantages• Path is too simple: structural information is lost• There are too many paths: the set of paths in a graph database usually is huge

Page 7: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Problem definition

• Can we use a graph structure instead of path as the basic index feature?

• Graph-based indexing approach - gIndex

Page 8: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Challenges

• The number of graph structures is usually much larger than the number of paths in a graph database.

frequent subgraph

1. support threshold progressively increases when the subgraphs grow large

2. Discriminative structure : reduce the redundancy

Page 9: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Challenges

• Step 1: Index Construction• Enumerating and selecting features in graph database.• Df = {gi|f g⊆ i,gi D }, f F ,F is the graph feature set∈ ∈

• Step 2: Query Processing• Enumerate structures in the query graph: search+ verification

1. Calculate the candidate graphs containing these structures (Cq =∩fDf (f q and f F) )⊆ ∈2. checks graph g in Cq to verify whether q is really a subgraph of g.

Page 10: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Challenges

• Query Response Time

Graph index access time Isomorphism testing time

Size of candidate answer set

KEY: make |Cq| as small as possible

Page 11: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Challenges

• If a graph database is very large such that the index cannot be held on the memory, Tsearch may be critical for the query response time.

-> Reduce the index size-> minimizing the size of feature set |F|

Page 12: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Frequent fragment

• A graph g is frequent if its support is no less than a minimum support threshold, minSup• It is feasible to construct high-quality indices using only frequent

fragments • Suppose all the frequent fragments with minimum support minSup

are indexed , Given a query graph q1. if q is frequent 2. if q is not frequent ->use frequent subgraph f of q

Page 13: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Frequent fragment

• How to choose minSup? Uniform minSup Example: a completely con- nected graph with 10 vertices, each of which has a distinct label.

45 1-edge subgraphs360 2-edge onesmore than 1,814,400 8-edge ones3

Page 14: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Frequent fragment

• Low minimum support on small fragments (for effectiveness) and high minimum support on large fragments (for compactness)

Page 15: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Frequent fragment

• Advantages• The number of frequent fragments so obtained is much less then that with

the lowest uniform minSup• Low-support large fragments may by indexed well by their smaller subgraphs;

thereby we do not miss useful fragments for indexing

• However, the number of indexed fragments may still be huge • For example, 1,000 graphs may easily produce 100,000 fragments of that kind • time and space consuming

Page 16: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Discriminative fragment

• Among similar fragments with the same support, it is often sufficient to index only the smallest common fragment since more query graphs may contain the smallest fragment.

Page 17: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Discriminative fragment

• Let fragments f1 , f2 , …, fn be indexing features. • Given a new fragment x, the discriminative power of x can be

measured by

• When P is small enough, x is a discriminative structure and should be included in the index

• Reduce the index size

Page 18: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

gIndex

• Discriminative Fragment selection• Index construction• Search• Verification• Incremental maintenance

Page 19: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

gIndex - Discriminative Fragment selection• Generates all frequent fragments with the size increasing support

constraint

• Eliminate the redundant ones

Page 20: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

gIndex - Index construction

• It translates fragments into sequences and holds them in a prefix tree.

• Graph Sequentialization• gIndex Tree

Page 21: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

gIndex - Index construction

• efficient to scan the whole feature set to match fragments one by one• Graph Sequentialization• DFS coding - translate a graph into a sequence• Labeled edge is a 5-tuple (I, j, li, l(I,j), lj)

• (0, 1, X, a, X) (1, 2,X, a, Z) (2, 0, Z, b, X) (1, 3, X, b, Y )

1

2

34

Page 22: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

gIndex - Index construction

• gIndex Tree

f1 = (e1)f2 = (e1, e2)

f3 = (e1, e2, e3)

Page 23: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

gIndex - Index construction

• Records some redundant fragments• This setting makes the Apriori pruning possible.

• Reduce the number of intersection operations conducted on id lists of discriminative fragments by using (approximate) maximum fragments only.• The search time will be significantly reduced by using gIndex tree

Page 24: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

gIndex - Search

Page 25: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

gIndex - Search

• Apriori Pruning

• if a fragment is not in the gIndex tree, we need not check its super-graphs any more.

• That is why we record some redundant fragments in the gIndex tree.

Page 26: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

gIndex - Search

• Maximum Discriminative Fragment

• Only have to perform intersection operations on the id lists of maximum discriminative fragments

• Reduce the number of intersection operations

Page 27: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

gIndex - Search

• Inner Support

• one fragment may appear several times even in one graph.

• have to store the inner support of discriminative fragments, which means the space cost is doubled

• If a query graph is large, it is pretty efficient using inner support.

Page 28: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

gIndex - Verification

• After getting the candidate answer set, we have to verify that the graphs in the set really contain the query graph.

• The simplest way to do it is to perform a subgraph isomorphism test on each graph one by one.

Page 29: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

gIndex - Incremental maintenance

• Index maintenance algorithm to handle insert/delete operations.

Page 30: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

gIndex - Incremental maintenance

• If the frequent graphs and their supports in a graph database do not change, then the discriminative fragments will not change at all.

• A small number of insert/delete operations will not change their distribution too much.

• This property becomes one key advantage of using frequent fragments in the index.

Page 31: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Experimental result

• Data sets

• The AIDS antiviral screen compound dataset from NCI/NIH, containing 43,905 chemical compounds

• Synthetic data sets was provided by Kuramochi

• All our experiments are performed on a 1.5GHZ, 1GBmemory, Intel PC running RedHat 8.0. Both GraphGrep and gIndex are compiled with gcc/g++.

Page 32: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Experimental result

AIDS Antiviral Screen Dataset

• GraphGrep: • maximum length (edges) of paths is set at 10

• gIndex: • maximum size (edges) of structures is set at 10• the maximum fragment size maxL is also 10• the minimum discriminative ratio γmin is 2.0• and the maximum support Θ is 0.1N. • The size-increasing support function ψ(l) is 1 if l < 4

AIDS Antiviral Screen Dataset

Page 33: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Experimental result – Index Size Test• the index size of gIndex is at least 10 times smaller than that of GraphGrep• gIndex: index size is small and stable

• due to the fact that frequent fragments and discriminative frequent fragments do not change much if the data have similar distribution.

• GraphGrep : index size increase significantly • because GraphGrep has to index all possible paths existing in a database (up to length-10 in our exeriments)

AIDS Antiviral Screen Dataset

Page 34: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Experimental result – Performance Test according to query size• Six query sets are tested, each of which has 1, 000 queries: • randomly draw 1, 000 graphs from the antiviral screen dataset and

then extract a connected size-m subgraph from each graph randomly. • These 1, 000 subgraphs are taken as a query set, denoted by Qm. • We generate Q4, Q8, Q12, Q16, Q20, and Q24. • Each query set is then divided into two groups: • low support group if its support is less than 50 • high support group if its support is between 50 and 500.

AIDS Antiviral Screen Dataset

Page 35: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

• gIndex outperforms GraphGrep, except the low support queries in Q4 . • GraphGrep works better on Q4 simply because queries in Q4 are more

likely path-structured and the exhausted enumeration of paths in GraphGrep favors these queries.

Experimental result – Performance Test according to query size

AIDS Antiviral Screen Dataset

Page 36: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

• The performance gap between gIndex and GraphGrep shrinks when query support increases

simpler and smaller structures

• When |Dq| is close to 10, 000, |Cq| will approach |Dq|

10, 000 is their upper bound

Experimental result – Performance Test according to answer set size (query support)

• Overall, gIndex outperforms GraphGrep by 3 to 10 times when the answer set size is below 1, 000.

AIDS Antiviral Screen Dataset

Page 37: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Experimental result – Incremental Maintenance

AIDS Antiviral Screen Dataset

We have two databases D and

Page 38: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Experimental result Synthetic Dataset

• The synthetic graph dataset : • a set of S seed fragments are generated randomly, whose size is

determined by a Poisson distribution with mean I. • The size of each graph is a Poisson random variable with mean T. • Seed fragments are then randomly selected and inserted into a graph one

by one until the graph reaches its size.

• When the number of distinct labels (L) is large, the synthetic dataset is much different from the AIDS antiviral screen dataset

Synthetic Dataset

Page 39: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Experimental result – Answer Set Size

Synthetic Dataset

• gIndex performs much better than GraphGrep. • When |Dq| approaches 300, GraphGrep performs well.

Page 40: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Conclusions

• The index size of gIndex is more than 10 times smaller than that of GraphGrep• gIndex outperforms GraphGrep by 3 to 10 times in various query

loads• The index returned by the incremental maintenance algorithm is

effective: it performs as well as the index computed from scratch provided the data distribution does not change much.

Page 41: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Discussions

•Weak points of this paper• Bad performance for simple fragments • Cannot be used for directed graph

• Possible improvement• Each data should add an entry of ‘direction’

• Possible extension & applications • 3D structure• Time-related factor

41

Page 42: Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Q & AThanks for your listening