Upload
wagner-andreas
View
272
Download
3
Tags:
Embed Size (px)
Citation preview
KIT – University of the State of Baden-Württemberg and
National Large-scale Research Center of the Helmholtz Association
Institute of Applied Informatics and Formal Description Methods (AIFB)
www.kit.edu
Approximate and Incremental Processing of Complex Queries against the Web of Data
Thanh Tran, Günter Ladwig, Andreas Wagner
DEXA 2011
Institute of Applied Informatics and Formal Description Methods (AIFB)2 August 31st, 2011
Contents
Introduction OverviewApproximate & Incremental
Processing
Entity SearchApproximate
Structure Matching
Structure-based Result
Refinement and Computation
Evaluation Conclusion
DEXA 2011, Toulouse, France
Institute of Applied Informatics and Formal Description Methods (AIFB)3 August 31st, 2011
INTRODUCTION
DEXA 2011, Toulouse, France
Institute of Applied Informatics and Formal Description Methods (AIFB)4 August 31st, 2011
Introduction – Data Model
Resource Description Framework (RDF)
DEXA 2011, Toulouse, France
p2 p1
super-
vises
p4 p3super-
vises
knows
i2
u2
a2
c2
conference partOf
i1 u1
a1 c1conference
partOf
p5
authorOf
worksAt
worksAt
worksAt
authorOf
authorOf
nameP5
nameP2
name
U1
Institute of Applied Informatics and Formal Description Methods (AIFB)5 August 31st, 2011
Introduction – Query Model
Basic Graph Patterns
Conjunctive queries over RDF data: graph pattern matching
DEXA 2011, Toulouse, France
x y v
z u
w
KIT
ICDE
AIFB
29
worksAt
author conf
supervise
partOf
age
name
name
name
Institute of Applied Informatics and Formal Description Methods (AIFB)6 August 31st, 2011
Contribution
Techniques for matching (basic) query patterns against graph-
structured data have limits
We might wish to trade completeness and exactness for
responsiveness
DEXA 2011, Toulouse, France
Our approach allows an “affordable” computation of an initial set
of approximate results, which can be incrementally refined as
needed.
Institute of Applied Informatics and Formal Description Methods (AIFB)7 August 31st, 2011
Contribution – Pipeline Overview
Pipeline of operations where approximate results are refined
incrementally
DEXA 2011, Toulouse, France
Entity SearchApproximate
StructureMatching
Structure-based ResultRefinement
Structure-based AnswerComputation
Entity &
Neighborhood
Index
Structure
IndexRelation Index
Intermediate,
Approximate Results
Institute of Applied Informatics and Formal Description Methods (AIFB)8 August 31st, 2011
ENTITY SEARCH
DEXA 2011, Toulouse, France
Entity SearchApproximate
StructureMatching
Structure-based ResultRefinement
Structure-based AnswerComputation
Institute of Applied Informatics and Formal Description Methods (AIFB)9 August 31st, 2011
Entity Search
Entity index
Stores attribute edges of the data graph
Enables lookup of entities by attribute and value
Entity search
Obtains candidate bindings for all variables in the query that have
attribute edges
Does not consider structure (i.e., relations between entities)
Query decomposition and transformation
Decompose query into entity queries to create a transformed
query
DEXA 2011, Toulouse, France
Institute of Applied Informatics and Formal Description Methods (AIFB)10 August 31st, 2011
Query Decomposition & Transformation
DEXA 2011, Toulouse, France
Identify entity queries
Breadth-first search starting from random variable
x y v
z u
w
KIT
ICDE
AIFB
29
worksAt
author conf
supervise
partOf
age
name
name
name
Institute of Applied Informatics and Formal Description Methods (AIFB)11 August 31st, 2011
Query Decomposition & Transformation
DEXA 2011, Toulouse, France
x y v
z u
w
KIT
ICDE
AIFB
29
worksAt
author conf
supervise
partOf
age
name
name
name
yw
worksAt
author conf
supervise
partOf
xage 29
zname AIFB
uname KIT
vname ICDE
Collapse entity queries
Institute of Applied Informatics and Formal Description Methods (AIFB)12 August 31st, 2011
Entity Search Results
Use entity index to obtain bindings for all entity queries in
transformed query
Entity queries are necessary conditions,
but not sufficient
Final results will be a subset
DEXA 2011, Toulouse, France
x z u v
p1 i1 u1 c1
p3 i1 u1 c1
p5 i1 u1 c1
p6 i1 u1 c1
yw
worksAt
author conf
supervise
partOf
xage 29
zname AIFB
uname KIT
vname ICDE
Institute of Applied Informatics and Formal Description Methods (AIFB)13 August 31st, 2011
APPROXIMATE STRUCTURE
MATCHING
DEXA 2011, Toulouse, France
Entity SearchApproximate
StructureMatching
Structure-based ResultRefinement
Structure-based AnswerComputation
Institute of Applied Informatics and Formal Description Methods (AIFB)14 August 31st, 2011
Approximate Structure Matching
Only entity parts of the query have been matched
Relation edges have yet to be processed
Instead of performing exact equijoins we propose to perform a
neighborhood join
Neighborhood join allows us to check whether two entities are
connected via relation edges (but not which ones)
Again: necessary, but not sufficient
DEXA 2011, Toulouse, France
The k-neighborhood of an entity e is the set of entities in the data graph
that can be reached from e via a path of relation edges of length k or less.
A neighborhood join between two sets of entities E1, E2 is an equijoin
between all pairs e1 ∈ E1, e2 ∈ E2 where e1 and e2 are considered
equivalent if the intersection of their k-neighborhood is non-empty.
Institute of Applied Informatics and Formal Description Methods (AIFB)15 August 31st, 2011
Neighborhood Join via Bloom Filters
We store the set of k-neighborhood entities as a bloom filter
Bloom filter
Space-efficient, probabilistic data structure for set membership test
False positives are possible (false negatives are not)
We refine the results of the previous step
To perform a neighborhood join between bindings E1, E2
Load bloom filters for one set of entities, say E1
In a nested loop manner, check if entities in E2 are contained in the
bloom filter
DEXA 2011, Toulouse, France
Institute of Applied Informatics and Formal Description Methods (AIFB)16 August 31st, 2011
Neighborhood Join via Bloom Filters
Load bloom filters for entities bound to x
Check whether entities bound to w,y, z are in the neighborhood
of x
When k=2, bloom filters for x also cover u and vDEXA 2011, Toulouse, France
x y v
z u
w
KIT
ICDE
AIFB
29
worksAt
author conf
supervise
partOf
age
namename
name
k=1
k=2
Institute of Applied Informatics and Formal Description Methods (AIFB)17 August 31st, 2011
STRUCTURE-BASED RESULT
REFINEMENT
DEXA 2011, Toulouse, France
Entity SearchApproximate
Structure Matching
Structure-based Result Refinement
Structure-based Answer Computation
Institute of Applied Informatics and Formal Description Methods (AIFB)18 August 31st, 2011
Structure-based Result Refinement
From ASM we know that entities in intermediate results are
connected
With structure-based result refinement we find out whether they
are connected via paths captured by query atoms
Query is matched against a structure index graph
Bisimulation-based summary of data graph that captures structural
information
Nodes in the data graph with the same “structure” are grouped
together
Much smaller than the data graph
DEXA 2011, Toulouse, France
Necessary, but not sufficient.
Institute of Applied Informatics and Formal Description Methods (AIFB)19 August 31st, 2011
Structure Index
DEXA 2011, Toulouse, France
p2 p1
super-
vises
p4 p3super-
vises
knows
i2
u2
a2
c2
conference partOf
i1 u1
a1 c1conference
partOf
p5
E6
p5
E3
i1,i2
E5
u1, u2
E2
p1,p3
E4
a1,a2
E6
c1,c2
authorOf
worksAt
worksAt
worksAt
worksAt
authorOf
authorOf
authorOf
partOf
conference
E1
p2,p4 super-
vises
worksAtauthorOf
knows
Structure Index Graph G~
Data graph G
Bisimulation
Institute of Applied Informatics and Formal Description Methods (AIFB)20 August 31st, 2011
Structure-based Result Refinement
We take advantage of this property:
Match the query against the structure index graph to obtain sets
of extensions that contain potential query answers
Bindings computed in previous ES/ASM steps can only be
answers if they are contained in the matched extensions
DEXA 2011, Toulouse, France
Whenever there is a match of a query graph q on G the query also
matches on G~. Moreover, extensions of the index graph
matches will contain all data graph matches, i.e. the bindings to
query variables.
Institute of Applied Informatics and Formal Description Methods (AIFB)21 August 31st, 2011
STRUCTURE-BASED ANSWER
COMPUTATION
DEXA 2011, Toulouse, France
Entity SearchApproximate
StructureMatching
Structure-based ResultRefinement
Structure-based AnswerComputation
Institute of Applied Informatics and Formal Description Methods (AIFB)22 August 31st, 2011
Structure-based Answer Compution
Finally, results which exactly match the query are computed by
the last refinement.
Only for this step, we actually perform joins on the data.
DEXA 2011, Toulouse, France
Institute of Applied Informatics and Formal Description Methods (AIFB)23 August 31st, 2011
EVALUTION
DEXA 2011, Toulouse, France
Institute of Applied Informatics and Formal Description Methods (AIFB)24 August 31st, 2011
Evaluation
Systems
INC: the proposed approach
VP: join processing using vertical partitioning with sextuple indexing
Datasets
DBLP: 13M triples
LUBM: 0.7M – 6.7M triples
Queries
Generated 80 queries via random sampling
Different shapes: path, star, graph
DEXA 2011, Toulouse, France
Institute of Applied Informatics and Formal Description Methods (AIFB)25 August 31st, 2011
Results – Average Processing Time
DEXA 2011, Toulouse, France
Institute of Applied Informatics and Formal Description Methods (AIFB)26 August 31st, 2011
Results – Average Processing Time
Neighborhood Distance
DEXA 2011, Toulouse, France
Institute of Applied Informatics and Formal Description Methods (AIFB)27 August 31st, 2011
Results – Precision vs. Time
DEXA 2011, Toulouse, France
Institute of Applied Informatics and Formal Description Methods (AIFB)28 August 31st, 2011
Results - Precision
DEXA 2011, Toulouse, France
Institute of Applied Informatics and Formal Description Methods (AIFB)29 August 31st, 2011
Conclusion
We proposed a novel process for approximate and
incremental processing of complex graph pattern queries
Initial results are computed in a small fraction of total time and
the incrementally refined via approximate matching at low cost
Increased responsiveness as inexact results are available early
Users can decide if and for which result exactness and
completeness is desirable
Experiments show that our approach is relatively fast w.r.t. exact
and complete results, indicating that the proposed mechanism is
able to reuse intermediate results
DEXA 2011, Toulouse, France
Institute of Applied Informatics and Formal Description Methods (AIFB)30 August 31st, 2011 DEXA 2011, Toulouse, France
Institute of Applied Informatics and Formal Description Methods (AIFB)31 August 31st, 2011
BACKUP SLIDES
DEXA 2011, Toulouse, France