Approximate and Incremental Processing of Complex Queries against the Web of Data

KIT – University of the State of Baden-Württemberg and

National Large-scale Research Center of the Helmholtz Association

Institute of Applied Informatics and Formal Description Methods (AIFB)

www.kit.edu

Approximate and Incremental Processing of Complex Queries against the Web of Data

Thanh Tran, Günter Ladwig, Andreas Wagner

DEXA 2011

Institute of Applied Informatics and Formal Description Methods (AIFB)2 August 31st, 2011

Contents

Introduction OverviewApproximate & Incremental

Processing

Entity SearchApproximate

Structure Matching

Structure-based Result

Refinement and Computation

Evaluation Conclusion

DEXA 2011, Toulouse, France


INTRODUCTION



Introduction – Data Model

Resource Description Framework (RDF)


p2 p1

super-

vises

p4 p3super-

vises

knows

i2

u2

a2

c2

conference partOf

i1 u1

a1 c1conference

partOf

p5

authorOf

worksAt

worksAt

worksAt

authorOf

authorOf

nameP5

nameP2

name

U1


Introduction – Query Model

Basic Graph Patterns

Conjunctive queries over RDF data: graph pattern matching


x y v

z u

w

KIT

ICDE

AIFB

29

worksAt

author conf

supervise

partOf

age

name

name

name


Contribution

Techniques for matching (basic) query patterns against graph-

structured data have limits

We might wish to trade completeness and exactness for

responsiveness


Our approach allows an “affordable” computation of an initial set

of approximate results, which can be incrementally refined as

needed.


Contribution – Pipeline Overview

Pipeline of operations where approximate results are refined

incrementally



StructureMatching

Structure-based ResultRefinement

Structure-based AnswerComputation

Entity &

Neighborhood

Index

Structure

IndexRelation Index

Intermediate,

Approximate Results


ENTITY SEARCH



StructureMatching




Entity Search

Entity index

Stores attribute edges of the data graph

Enables lookup of entities by attribute and value

Entity search

Obtains candidate bindings for all variables in the query that have

attribute edges

Does not consider structure (i.e., relations between entities)

Query decomposition and transformation

Decompose query into entity queries to create a transformed

query



Query Decomposition & Transformation


Identify entity queries

Breadth-first search starting from random variable

x y v

z u

w

KIT

ICDE

AIFB

29

worksAt

author conf

supervise

partOf

age

name

name

name


Query Decomposition & Transformation


x y v

z u

w

KIT

ICDE

AIFB

29

worksAt

author conf

supervise

partOf

age

name

name

name

yw

worksAt

author conf

supervise

partOf

xage 29

zname AIFB

uname KIT

vname ICDE

Collapse entity queries


Entity Search Results

Use entity index to obtain bindings for all entity queries in

transformed query

Entity queries are necessary conditions,

but not sufficient

Final results will be a subset


x z u v

p1 i1 u1 c1

p3 i1 u1 c1

p5 i1 u1 c1

p6 i1 u1 c1

yw

worksAt

author conf

supervise

partOf

xage 29

zname AIFB

uname KIT

vname ICDE


APPROXIMATE STRUCTURE

MATCHING



StructureMatching




Approximate Structure Matching

Only entity parts of the query have been matched

Relation edges have yet to be processed

Instead of performing exact equijoins we propose to perform a

neighborhood join

Neighborhood join allows us to check whether two entities are

connected via relation edges (but not which ones)

Again: necessary, but not sufficient


The k-neighborhood of an entity e is the set of entities in the data graph

that can be reached from e via a path of relation edges of length k or less.

A neighborhood join between two sets of entities E1, E2 is an equijoin

between all pairs e1 ∈ E1, e2 ∈ E2 where e1 and e2 are considered

equivalent if the intersection of their k-neighborhood is non-empty.


Neighborhood Join via Bloom Filters

We store the set of k-neighborhood entities as a bloom filter

Bloom filter

Space-efficient, probabilistic data structure for set membership test

False positives are possible (false negatives are not)

We refine the results of the previous step

To perform a neighborhood join between bindings E1, E2

Load bloom filters for one set of entities, say E1

In a nested loop manner, check if entities in E2 are contained in the

bloom filter



Neighborhood Join via Bloom Filters

Load bloom filters for entities bound to x

Check whether entities bound to w,y, z are in the neighborhood

of x

When k=2, bloom filters for x also cover u and vDEXA 2011, Toulouse, France

x y v

z u

w

KIT

ICDE

AIFB

29

worksAt

author conf

supervise

partOf

age

namename

name

k=1

k=2


STRUCTURE-BASED RESULT

REFINEMENT



Structure Matching

Structure-based Result Refinement

Structure-based Answer Computation



From ASM we know that entities in intermediate results are

connected

With structure-based result refinement we find out whether they

are connected via paths captured by query atoms

Query is matched against a structure index graph

Bisimulation-based summary of data graph that captures structural

information

Nodes in the data graph with the same “structure” are grouped

together

Much smaller than the data graph


Necessary, but not sufficient.


Structure Index


p2 p1

super-

vises

p4 p3super-

vises

knows

i2

u2

a2

c2

conference partOf

i1 u1

a1 c1conference

partOf

p5

E6

p5

E3

i1,i2

E5

u1, u2

E2

p1,p3

E4

a1,a2

E6

c1,c2

authorOf

worksAt

worksAt

worksAt

worksAt

authorOf

authorOf

authorOf

partOf

conference

E1

p2,p4 super-

vises

worksAtauthorOf

knows

Structure Index Graph G~

Data graph G

Bisimulation



We take advantage of this property:

Match the query against the structure index graph to obtain sets

of extensions that contain potential query answers

Bindings computed in previous ES/ASM steps can only be

answers if they are contained in the matched extensions


Whenever there is a match of a query graph q on G the query also

matches on G~. Moreover, extensions of the index graph

matches will contain all data graph matches, i.e. the bindings to

query variables.


STRUCTURE-BASED ANSWER

COMPUTATION



StructureMatching




Structure-based Answer Compution

Finally, results which exactly match the query are computed by

the last refinement.

Only for this step, we actually perform joins on the data.



EVALUTION



Evaluation

Systems

INC: the proposed approach

VP: join processing using vertical partitioning with sextuple indexing

Datasets

DBLP: 13M triples

LUBM: 0.7M – 6.7M triples

Queries

Generated 80 queries via random sampling

Different shapes: path, star, graph



Results – Average Processing Time



Results – Average Processing Time

Neighborhood Distance



Results – Precision vs. Time



Results - Precision



Conclusion

We proposed a novel process for approximate and

incremental processing of complex graph pattern queries

Initial results are computed in a small fraction of total time and

the incrementally refined via approximate matching at low cost

Increased responsiveness as inexact results are available early

Users can decide if and for which result exactness and

completeness is desirable

Experiments show that our approach is relatively fast w.r.t. exact

and complete results, indicating that the proposed mechanism is

able to reuse intermediate results


Institute of Applied Informatics and Formal Description Methods (AIFB)30 August 31st, 2011 DEXA 2011, Toulouse, France


BACKUP SLIDES


Education

Approximate and Incremental Processing of Complex Queries against the Web of Data