Upload
terence-jesse-golden
View
220
Download
1
Tags:
Embed Size (px)
Citation preview
A(k)-index :Exploiting Local Similarity to Index Paths in Graph Data
Raghav Kaushik (UW)Pradeep Shenoy (UWash)Philip Bohannon (Bell Labs)Ehud Gudes (BGU)
Outline
Problem statementPrior work and limitationsBackgroundA(k)-indexQuery EvaluationPreliminary experimentsUpdateConclusions
Data Model
Rooted, node-labeled graph with unique root; root has unique label
Nodes - objectsArcs - object-subobject relationshipIn XML context
Index tag structure No distinction between elements and attributes No distinction between tree and idref arcs Order ignored
Problem Statement
Practical indexing schemes for large graph data (like XML data) (100K - 1M nodes) Size ~10% of database size Efficient construction and update Tunable to a workload
Queries of the form R x, where R is a regular path expression
Schemaless data
Flavor of Approach
Different from traditional value indices
Structural summaries for indexing paths
Both data and index are rooted graphs
Example: Dataguide
Index Graph
Structural summary
Associate a set of data nodes with each index node, called its extent
Preserve data paths in index graph
Data graph Index graph
Example index graph
5,6
3,4
21
00
1 2
3 4
5 6
Index Graph (cont’d)
Can be constructed from any partition
Node for every equivalence class C
Edge between C and C’ if exists an
edge v v’ with v in C and v’ in C’
Preserves data paths, no false drops
Our structures are all index graphs
Prior Schemes
Dataguide [Goldman, Widom 1997] Deterministic automaton corresponding
to data graph
Each set of data nodes that can be distinguished by a path query is summarized by a single node in the index
Can be exponential in size!
Prior Schemes (cont’d)
1-index [Milo, Suciu 1999] NFA rather than DFA (smaller) split graph nodes into equivalence classes
based on incoming paths from the root Computing best split is PSPACE complete Go for refinements (approximations)
similaritybisimilarity
Limitations of Prior Work
Size Dataguide sizes subject to exponential blow-
up 1-index size can be big too!
Update No known update algorithm for 1-index
Designed to answer queries involving arbitrarily complex paths, but... such paths may never show up in queries
ROOT
metro
cultural business neighborhoods
museum museum hotel
nearby
nhd.nhd.
attr.attr.
cult.cult.
Local Similarity
Main Contributions
New family of approximate index structures
Applicable to Approximate Schema Statistics
Query evaluation using approximate indexes
Preliminary performance studyUpdate algorithms
Approximate Indexes
Motivation: Smaller More efficient query processing Limited update cost - maintain local
informationApproximate dataguide [Goldman, et.al]
path merging, object matching, etc no formal basis (but different goal) no study of effect on query processing
Outline
Problem statementPrior work and limitationsBackgroundA(k)-indexQuery EvaluationPreliminary experimentsUpdateConclusions
Graph Bisimulation
A bisimulation is a symmetric relation R between nodes
If A1 R A2 then A1 and A2 have the same labels and ...
B1
A1 A2R
A1 A2
B1
R
B2R
and vice-versa!
Graph Bisimulation (cont’d)
Bisimilarity
Two nodes a and b are bisimilar if they are related in some bisimulation
1-index is index graph constructed from bisimulation partition
Simulation partition: similar
ROOT
metro
cultural business neighborhoods
museum museum hotel
nearby
nhd.nhd.
attr.attr.
cult.cult.
Bisimulation on example
k-bisimulation
Nodes A1 and A2 are 0-bisimilar iff same label
A1 and A2 are k-bisimilar iff k-1 bisimilar and
if (B1, A1), exists (B2, A2): B1 and B2 are k-1 bisimilar, and vice versa
Data graph
0
1 2
3 4
65
Example for k-bisimulation
0
1 2
3,4
5,6
0-bisimulation
0
1 2
3 4
5,6
1-bisimulation
ROOT
metro
cultural business neighborhoods
museum museum hotel
nearby
nhd.nhd.
attr.attr.
cult.cult.
A(2) for example
Properties
If a and b are bisimilar set of incoming paths into them is same
If a and b are k-similar or k-bisimilar set of incoming paths of length <= k are
sameIf k-bisim = k+1-bisim then k-bisim =
bisimSize: certainly smaller than bisimulation
Query Evaluation
Only queries studied are regular path queries of the form R x
Query Evaluation Approach: Create automaton for regexp query Run automaton on the index graph Result is union of extents belonging to
index nodes accepted by automaton
0
1 2
3,4
5,6
Automaton Graph Index Graph
Example Query Evaluation
Approximate Indexes
Caveat: False positives possibleApproach: verify each node on data
graph by running reverse automaton Prohibitive cost?
Then why use approx. indices? In fact, frequently more efficient than
data graph or precise index
Improving Validation
First cut: Keep track of accepting-path-length for accepted nodes with path length <= k,
verification not requiredSecond step: Share traversals among
verification calls mark node-state pairs on a successful
verification path as accept similar marking for failed path
Improving Validation (cont’d)
Third Step: Avoid needless
verification
Example: For _*.R queries, no need to
verify all the way up to the root
Generalize the above!
Outline
Problem statementPrior work and limitationsBackgroundA(k)-indexQuery EvaluationPreliminary experimentsUpdateConclusions
Preliminary Experiments
Data used: Internet Move Database (http://www.imdb.com) 250,000 movies & TV shows 460,000 actors, etc XML version = ~1GB
We used subsets of this database ranging from 200 - 2000 movies
Whole database --> future work!
Preliminary Experiments
Second source: Open Directory Project (http://www.dmoz.org) Entire source available in RDF format
Subsets: (entire subtree under a topic, say shopping)
Storage Model
Results independent of any particular storage model
In-memory rooted graphPerformance metrics are abstract
Cost = total number of nodes visited (graph + index)
IMDB#Nodes:190,000
ODP#Nodes:143,000 0
10
20
30
40
0 2 4 6 8 10 12 14 16
K parameter of Index
Pe
rce
nt
of
Da
ta G
rap
h S
ize
A(k)-Index, IMDBData
A(k)-Index, ODPData
Bisimulation Sizes
Query Evaluation Plans
012345678
IMDB Short IMDB Long ODP Short
Workload
No
de
Vis
its
(L
og
Sc
ale
)
1-index(fwd)
1-index(back)
G (fwd)
G(back)
1. Forward eval
2. Backward eval (assume a label index)
0
0.5
1
1.5
2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
K parameter of Index
Fra
cti
on
of
1-I
nd
ex
Co
st
Validation Cost
Index Cost
Short Queries - IMDB
0
0.5
1
1.5
2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
K parameter of Index
Fra
cti
on
of
1-I
nd
ex
Co
st Validation Cost
Index Cost
Long Queries - IMDB
0
0.5
1
1.5
2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
K parameter of Index
Fra
cti
on
of
1-I
nd
ex
Co
st
validcost
indexcost
Queries beginning with _*
0
0.5
1
1.5
2
2.5
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
K parameter of Index
Fra
cti
on
of
1-I
nd
ex
Co
st
validcost
indexcost
Queries containing _*
Approximate Answers
0
0.5
1
1.5
2
2.5
3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15K parameter of Index
Fra
cti
on
of
Co
rre
ct
Re
su
lt S
ize
False Positives
ToValidate
Guaranteed Results
A(k)-index Update
Edge added from u to v
A(0)-index -> no change except possible addition of edge
A(1)-index -> index node containing v may change determined by set of labels in v’s
parents
A(k)-index Update (contd)
A(k)-index only nodes to be considered are those at
distance < k from vMaintain tree of splitsWork iteratively:
find new A(1) position of v find new A(2) positions of v and its children …
Updating the 1-index
One way is generalization of A(k) updateR - any binary relation on the nodes that
is reflexive transitively closed.
A refinement of R is any subset that is reflexive transitively closed
Refinement
B - bisimulation relationB’ - any refinement of BB(G) - index graph built using BB’(G) - index graph built using B’
Theorem
Theorem: B(B’(G)) = B(G)Intuition:
Similar nodes behave similarly So, fuse them together!
Lazy Update
Basic Idea: G G’ , and meanwhile B(G) B(G’) Instead, “relax” the graph B(G) to B’(G’)
How? A “stable” partitioning of G is either
B(G) or its refinement. Propagate graph update on B(G) by
splitting nodes until stable.
0
1
2
3
4
5
6
0 100 200 300 400 500
Number of edges added
Pe
rce
nt
inc
rea
se
in
ind
ex
siz
e Propagated Index
Accurate Index
Lazy Update Performance
Conclusions
Novel approximate index structures and validation techniques
Experiments demonstrate k-bisimulation index is Efficiently constructed Effective for query answering
Future Work
Handle more query types Branching queries Queries with selection
Annotating A(k) with statistics for query optimization
StorageApplication of update algorithms to
triggers