A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes

A(k)-index :Exploiting Local Similarity to Index Paths in Graph Data

Raghav Kaushik (UW)Pradeep Shenoy (UWash)Philip Bohannon (Bell Labs)Ehud Gudes (BGU)

Outline

Problem statementPrior work and limitationsBackgroundA(k)-indexQuery EvaluationPreliminary experimentsUpdateConclusions

Data Model

Rooted, node-labeled graph with unique root; root has unique label

Nodes - objectsArcs - object-subobject relationshipIn XML context

Index tag structure No distinction between elements and attributes No distinction between tree and idref arcs Order ignored

Problem Statement

Practical indexing schemes for large graph data (like XML data) (100K - 1M nodes) Size ~10% of database size Efficient construction and update Tunable to a workload

Queries of the form R x, where R is a regular path expression

Schemaless data

Flavor of Approach

Different from traditional value indices

Structural summaries for indexing paths

Both data and index are rooted graphs

Example: Dataguide

Index Graph

Structural summary

Associate a set of data nodes with each index node, called its extent

Preserve data paths in index graph

Data graph Index graph

Example index graph

5,6

3,4

21

00

1 2

3 4

5 6

Index Graph (cont’d)

Can be constructed from any partition

Node for every equivalence class C

Edge between C and C’ if exists an

edge v v’ with v in C and v’ in C’

Preserves data paths, no false drops

Our structures are all index graphs

Prior Schemes

Dataguide [Goldman, Widom 1997] Deterministic automaton corresponding

to data graph

Each set of data nodes that can be distinguished by a path query is summarized by a single node in the index

Can be exponential in size!

Prior Schemes (cont’d)

1-index [Milo, Suciu 1999] NFA rather than DFA (smaller) split graph nodes into equivalence classes

based on incoming paths from the root Computing best split is PSPACE complete Go for refinements (approximations)

similaritybisimilarity

Limitations of Prior Work

Size Dataguide sizes subject to exponential blow-

up 1-index size can be big too!

Update No known update algorithm for 1-index

Designed to answer queries involving arbitrarily complex paths, but... such paths may never show up in queries

ROOT

metro

cultural business neighborhoods

museum museum hotel

nearby

nhd.nhd.

attr.attr.

cult.cult.

Local Similarity

Main Contributions

New family of approximate index structures

Applicable to Approximate Schema Statistics

Query evaluation using approximate indexes

Preliminary performance studyUpdate algorithms

Approximate Indexes

Motivation: Smaller More efficient query processing Limited update cost - maintain local

informationApproximate dataguide [Goldman, et.al]

path merging, object matching, etc no formal basis (but different goal) no study of effect on query processing

Outline


Graph Bisimulation

A bisimulation is a symmetric relation R between nodes

If A1 R A2 then A1 and A2 have the same labels and ...

B1

A1 A2R

A1 A2

B1

R

B2R

and vice-versa!

Graph Bisimulation (cont’d)

Bisimilarity

Two nodes a and b are bisimilar if they are related in some bisimulation

1-index is index graph constructed from bisimulation partition

Simulation partition: similar

ROOT

metro


museum museum hotel

nearby

nhd.nhd.

attr.attr.

cult.cult.

Bisimulation on example

k-bisimulation

Nodes A1 and A2 are 0-bisimilar iff same label

A1 and A2 are k-bisimilar iff k-1 bisimilar and

if (B1, A1), exists (B2, A2): B1 and B2 are k-1 bisimilar, and vice versa

Data graph

0

1 2

3 4

65

Example for k-bisimulation

0

1 2

3,4

5,6

0-bisimulation

0

1 2

3 4

5,6

1-bisimulation

ROOT

metro


museum museum hotel

nearby

nhd.nhd.

attr.attr.

cult.cult.

A(2) for example

Properties

If a and b are bisimilar set of incoming paths into them is same

If a and b are k-similar or k-bisimilar set of incoming paths of length <= k are

sameIf k-bisim = k+1-bisim then k-bisim =

bisimSize: certainly smaller than bisimulation

Query Evaluation

Only queries studied are regular path queries of the form R x

Query Evaluation Approach: Create automaton for regexp query Run automaton on the index graph Result is union of extents belonging to

index nodes accepted by automaton

0

1 2

3,4

5,6

Automaton Graph Index Graph

Example Query Evaluation

Approximate Indexes

Caveat: False positives possibleApproach: verify each node on data

graph by running reverse automaton Prohibitive cost?

Then why use approx. indices? In fact, frequently more efficient than

data graph or precise index

Improving Validation

First cut: Keep track of accepting-path-length for accepted nodes with path length <= k,

verification not requiredSecond step: Share traversals among

verification calls mark node-state pairs on a successful

verification path as accept similar marking for failed path

Improving Validation (cont’d)

Third Step: Avoid needless

verification

Example: For _*.R queries, no need to

verify all the way up to the root

Generalize the above!

Outline


Preliminary Experiments

Data used: Internet Move Database (http://www.imdb.com) 250,000 movies & TV shows 460,000 actors, etc XML version = ~1GB

We used subsets of this database ranging from 200 - 2000 movies

Whole database --> future work!

Preliminary Experiments

Second source: Open Directory Project (http://www.dmoz.org) Entire source available in RDF format

Subsets: (entire subtree under a topic, say shopping)

Storage Model

Results independent of any particular storage model

In-memory rooted graphPerformance metrics are abstract

Cost = total number of nodes visited (graph + index)

IMDB#Nodes:190,000

ODP#Nodes:143,000 0

10

20

30

40

0 2 4 6 8 10 12 14 16

K parameter of Index

Pe

rce

nt

of

Da

ta G

rap

h S

ize

A(k)-Index, IMDBData

A(k)-Index, ODPData

Bisimulation Sizes

Query Evaluation Plans

012345678

IMDB Short IMDB Long ODP Short

Workload

No

de

Vis

its

(L

og

Sc

ale

)

1-index(fwd)

1-index(back)

G (fwd)

G(back)

1. Forward eval

2. Backward eval (assume a label index)

0

0.5

1

1.5

2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15


Fra

cti

on

of

1-I

nd

ex

Co

st

Validation Cost

Index Cost

Short Queries - IMDB

0

0.5

1

1.5

2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15


Fra

cti

on

of

1-I

nd

ex

Co

st Validation Cost

Index Cost

Long Queries - IMDB

0

0.5

1

1.5

2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15


Fra

cti

on

of

1-I

nd

ex

Co

st

validcost

indexcost

Queries beginning with _*

0

0.5

1

1.5

2

2.5

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15


Fra

cti

on

of

1-I

nd

ex

Co

st

validcost

indexcost

Queries containing _*

Approximate Answers

0

0.5

1

1.5

2

2.5

3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15K parameter of Index

Fra

cti

on

of

Co

rre

ct

Re

su

lt S

ize

False Positives

ToValidate

Guaranteed Results

A(k)-index Update

Edge added from u to v

A(0)-index -> no change except possible addition of edge

A(1)-index -> index node containing v may change determined by set of labels in v’s

parents

A(k)-index Update (contd)

A(k)-index only nodes to be considered are those at

distance < k from vMaintain tree of splitsWork iteratively:

find new A(1) position of v find new A(2) positions of v and its children …

Updating the 1-index

One way is generalization of A(k) updateR - any binary relation on the nodes that

is reflexive transitively closed.

A refinement of R is any subset that is reflexive transitively closed

Refinement

B - bisimulation relationB’ - any refinement of BB(G) - index graph built using BB’(G) - index graph built using B’

Theorem

Theorem: B(B’(G)) = B(G)Intuition:

Similar nodes behave similarly So, fuse them together!

Lazy Update

Basic Idea: G G’ , and meanwhile B(G) B(G’) Instead, “relax” the graph B(G) to B’(G’)

How? A “stable” partitioning of G is either

B(G) or its refinement. Propagate graph update on B(G) by

splitting nodes until stable.

0

1

2

3

4

5

6

0 100 200 300 400 500

Number of edges added

Pe

rce

nt

inc

rea

se

in

ind

ex

siz

e Propagated Index

Accurate Index

Lazy Update Performance

Conclusions

Novel approximate index structures and validation techniques

Experiments demonstrate k-bisimulation index is Efficiently constructed Effective for query answering

Future Work

Handle more query types Branching queries Queries with selection

Annotating A(k) with statistics for query optimization

StorageApplication of update algorithms to

triggers

Documents

A(k)-index : Exploiting Local Similarity to Index Paths in Graph Data Raghav Kaushik (UW) Pradeep Shenoy (UWash) Philip Bohannon (Bell Labs) Ehud Gudes