Upload
petra-selmer
View
908
Download
0
Embed Size (px)
DESCRIPTION
These are the slides from a talk I presented at the Graph Processing room at FOSDEM 2013, in which I discussed my PhD topic: a query language allowing for the flexible querying of complex paths within graph structured data
Citation preview
Graph processing room
FOSDEM, 2 Feb 2013
Petra Selmer
http://www.dcs.bbk.ac.uk/~lselm01/
Flexible querying of graph data
Introduction
2
I shall be presenting my PhD topic which involves
a declarative query language allowing for the
flexible querying of graph-structured data with
complex paths.
Agenda
3
Who (am I)?
Why (the motivation)?
Some background info
What (is the query language and what
can it do)?
Illustrative examples
How (is it done)?
Who?
4
Petra Selmer
Part-time PhD student:
Birkbeck College, University of London
Prof. Alexandra Poulovassilis
Dr. Peter T. Wood
Software Architect:
University College London’s Institute of Neurology
(Wellcome Trust Centre for Neuroimaging)
Why?
5
Amount of graph-structured data is growing fast
The structure of this data is becoming more complex, especially when multiple, heterogeneous data sources are integrated together
The structure of the data is also always subject to change...
Why?
6
Users of such systems may not be familiar with the underlying data
structure: available paths etc
The user may not be able to obtain meaningful answers (or indeed,
any answers) from the data IF the querying system is limited to exact
matching of users’ queries
Also, the user may wish to explore the data by starting from a set of
initial answers and proceeding from there
The user may additionally wish to derive some intelligence from the
connections....
The user
The data
The query
Background: Ontologies
7
Currently part of the Semantic Web stack (Tim Berners-
Lee, RDF, triple stores)
Models a domain of interest: inferences, reasoning...
It can be thought of as a “schema” for graph data
The following inference rules are included (among
others):
Subclass: ‘History’, ‘Languages’ are subclasses of
‘Humanities’
Subproperty, Domain, Range...
What?
8
Data model: G = (V, E) Very general model V : vertices (or nodes); each labelled with some
constant E : directed, labelled edges; labels drawn from an
alphabet {Ʃ U ‘type’}
The query language is called Flex-It (it is declarative)
The basis is that of conjunctive regular path
queries There are two operators which may be applied to the
original query
What?
9
Conjunctive regular path queries:
This is where the graph's paths to be traversed are expressed with a
regular expression
A single regular path query conjunct: (X, R, Y)
X, Y: either constants or variables
R: the regular expression
“Conjunctive”: joining multiple conjuncts; e.g. (X, R1, Y), (Y,
R2, Z), (Z, R3, A)
The Y’s are matched, the Z’s are matched etc
N1 N2 N3 N4 n n p
1) (N1, n+, ?Y):
• Y = N2, N3
2) (N1, n*p, ?Y):
• Y = N4
What?
10
Approximation allows for the approximate matching
of labels in the path
An edit operation is applied to each edge label in
the path denoted by the regular expression:
Edit operations: insertions, deletions, inversions,
substitutions and transpositions of labels
Each operation has a ‘cost’: usually 1
Example: Query conjunct: (X, a*.b, Y)
R = a*.b [answers returned at cost 0]
R’ = p.a*.b (insertion of ‘p’) [answers returned at cost 1]
R’’ = p.a*.b- (inversion of ‘b’) [answers returned at cost 2]
What?
11
Relaxation is applied by using inference rules from an ontology (if one exists). Achieved by applying logical relaxation of the query
conditions using the data’s ontology definition Relaxation operations: subclass, subproperty, domain
and range Each operation has a ‘cost’ – usually 1
Example: We have an ontology: Humanities (superclass) Languages and History (subclasses of Humanities)
Assume our query states Languages may be relaxed Languages is relaxed to Humanities: Instances of Languages will be returned at cost 0 Instances of History will be returned at cost 1
What?
12
Answers are ranked according to how
closely they match the original query;
higher-cost answers have a lower ranking
All answers at a certain distance d are
ranked the same and returned before
answers at a higher distance
We allow for incremental execution: exact
answers returned first; then answers at
distance 1; ...
Example – ‘Lifelong learner metadata’
13
History
sc
14
History
sc
15
Query: “What work positions can I reach, having a degree in English”?
Y = the episode; Z = the job
(?Y, ?Z)
(?X, type, University),
(?X, qualif.type, EnglishStudies),
(?X, prereq+, ?Y),
(?Y, type, Work),
(?Y, job.type, ?Z)
16
Query: “What work positions can I reach, having a degree in English”?
Y = the episode; Z = the job
(?Y, ?Z)
(?X, type, University),
(?X, qualif.type, EnglishStudies),
(?X, prereq+, ?Y),
(?Y, type, Work),
(?Y, job.type, ?Z)
No results from User 2 will be returned...even though it is relevant!
17
Allowing query approximation can yield some answers:
Replacing the edge label prereq by next, at an edit cost of 1, we get this variant of the
query:
(?Y, ?Z)
(?X, type, University),
(?X, qualif.type, EnglishStudies),
APPROX(?X, prereq+, ?Y),
(?Y, type, Work),
(?Y, job.type, ?Z)
prereq+ can be approximated by next.prereq* at edit distance 1:
Result: Y = ep22, Z = AirTravelAssistant
18
Allowing query approximation can yield some answers:
Replacing the edge label prereq by next, at an edit cost of 1, we get this variant of the query: (?Y, ?Z)
(?X, type, University),
(?X, qualif.type, EnglishStudies),
APPROX(?X, prereq+, ?Y),
(?Y, type, Work),
(?Y, job.type, ?Z)
next.prereq* can be approximated by next.next.prereq*, now at edit distance 2: Results:
Y = ep23, Z = Journalist
Y = ep24, Z = AssistantEditor
19
History
sc
20
Query: “What jobs are open to me if I study English, or something similar, at University”?
(?Y, ?Z)
(?X, type, University), (?X, qualif, ?D),
RELAX (?D, type, EnglishStudies),
APPROX (?X, prereq+, ?Y),
(?Y, type, Work), (?Y, job.type, ?Z)
In addition to the answers (from User 2) obtained by the previous query, we now also have
answers from the timeline of User 3
prereq+ can be approximated by next.prereq* (distance 1) and EnglishStudies can be relaxed
– via Languages - to Humanities (distance 2), encompassing History
Result: Y = ep32, Z = PersonalAssistant (distance of 3 from original query)
21
Query: “What jobs are open to me if I study English, or something similar, at University”?
(?Y, ?Z)
(?X, type, University), (?X, qualif, ?D),
RELAX (?D, type, EnglishStudies),
APPROX (?X, prereq+, ?Y),
(?Y, type, Work), (?Y, job.type, ?Z)
next.prereq* can be approximated by next.next.prereq* (distance 2), with EnglishStudies again relaxed to Humanities (distance 2)
Results: (both at distance 4 from the original query)
Y = ep33, Z = Author
Y = e34, Z = AssociateEditor
How?
22
Theory
Construction of a weighted non-deterministic finite
automaton (NFA) to represent the regular expression
We apply new states and transitions to the NFA to represent the
approximation and relaxation operations
Formation of a product automaton: NFA with data
graph G
We perform a lowest cost path traversal of the product
automaton; construct query tree, do joins etc
Polynomial time complexity
Correctness of algorithms proven
How?
23
Implementation of prototype
Graph database: DEX (http://www.sparsity-
technologies.com/dex)
Programming language: C#
Further work
New flexible operation combining APPROX and
RELAX FLEX
Optimisation!