Fosdem 2013 petra selmer flexible querying of graph data

Graph processing room

FOSDEM, 2 Feb 2013

Petra Selmer

[email protected]

http://www.dcs.bbk.ac.uk/~lselm01/

Flexible querying of graph data

mailto:[email protected]

Introduction

2

I shall be presenting my PhD topic which involves

a declarative query language allowing for the

flexible querying of graph-structured data with

complex paths.

Agenda

3

Who (am I)?

Why (the motivation)?

Some background info

What (is the query language and what

can it do)?

Illustrative examples

How (is it done)?

Who?

4

Petra Selmer

Part-time PhD student:

Birkbeck College, University of London

Prof. Alexandra Poulovassilis

Dr. Peter T. Wood

Software Architect:

University College London’s Institute of Neurology

(Wellcome Trust Centre for Neuroimaging)

Why?

5

Amount of graph-structured data is growing fast

The structure of this data is becoming more complex, especially when multiple, heterogeneous data sources are integrated together

The structure of the data is also always subject to change...

Why?

6

Users of such systems may not be familiar with the underlying data

structure: available paths etc

The user may not be able to obtain meaningful answers (or indeed,

any answers) from the data IF the querying system is limited to exact

matching of users’ queries

Also, the user may wish to explore the data by starting from a set of

initial answers and proceeding from there

The user may additionally wish to derive some intelligence from the

connections....

The user

The data

The query

Background: Ontologies

7

Currently part of the Semantic Web stack (Tim Berners-

Lee, RDF, triple stores)

Models a domain of interest: inferences, reasoning...

It can be thought of as a “schema” for graph data

The following inference rules are included (among

others):

Subclass: ‘History’, ‘Languages’ are subclasses of

‘Humanities’

Subproperty, Domain, Range...

What?

8

Data model: G = (V, E) Very general model V : vertices (or nodes); each labelled with some

constant E : directed, labelled edges; labels drawn from an

alphabet {Ʃ U ‘type’}

The query language is called Flex-It (it is declarative)

The basis is that of conjunctive regular path

queries There are two operators which may be applied to the

original query

What?

9

Conjunctive regular path queries:

This is where the graph's paths to be traversed are expressed with a

regular expression

A single regular path query conjunct: (X, R, Y)

X, Y: either constants or variables

R: the regular expression

“Conjunctive”: joining multiple conjuncts; e.g. (X, R1, Y), (Y,

R2, Z), (Z, R3, A)

The Y’s are matched, the Z’s are matched etc

N1 N2 N3 N4 n n p

1) (N1, n+, ?Y):

• Y = N2, N3

2) (N1, n*p, ?Y):

• Y = N4

What?

10

Approximation allows for the approximate matching

of labels in the path

An edit operation is applied to each edge label in

the path denoted by the regular expression:

Edit operations: insertions, deletions, inversions,

substitutions and transpositions of labels

Each operation has a ‘cost’: usually 1

Example: Query conjunct: (X, a*.b, Y)

R = a*.b [answers returned at cost 0]

R’ = p.a*.b (insertion of ‘p’) [answers returned at cost 1]

R’’ = p.a*.b- (inversion of ‘b’) [answers returned at cost 2]

What?

11

Relaxation is applied by using inference rules from an ontology (if one exists). Achieved by applying logical relaxation of the query

conditions using the data’s ontology definition Relaxation operations: subclass, subproperty, domain

and range Each operation has a ‘cost’ – usually 1

Example: We have an ontology: Humanities (superclass) Languages and History (subclasses of Humanities)

Assume our query states Languages may be relaxed Languages is relaxed to Humanities: Instances of Languages will be returned at cost 0 Instances of History will be returned at cost 1

What?

12

Answers are ranked according to how

closely they match the original query;

higher-cost answers have a lower ranking

All answers at a certain distance d are

ranked the same and returned before

answers at a higher distance

We allow for incremental execution: exact

answers returned first; then answers at

distance 1; ...

Example – ‘Lifelong learner metadata’

13

History

sc

14

History

sc

15

Query: “What work positions can I reach, having a degree in English”?

Y = the episode; Z = the job

(?Y, ?Z)

(?X, type, University),

(?X, qualif.type, EnglishStudies),

(?X, prereq+, ?Y),

(?Y, type, Work),

(?Y, job.type, ?Z)

16

Query: “What work positions can I reach, having a degree in English”?

Y = the episode; Z = the job

(?Y, ?Z)



(?X, prereq+, ?Y),

(?Y, type, Work),

(?Y, job.type, ?Z)

No results from User 2 will be returned...even though it is relevant!

17

Allowing query approximation can yield some answers:

Replacing the edge label prereq by next, at an edit cost of 1, we get this variant of the

query:

(?Y, ?Z)



APPROX(?X, prereq+, ?Y),

(?Y, type, Work),

(?Y, job.type, ?Z)

prereq+ can be approximated by next.prereq* at edit distance 1:

Result: Y = ep22, Z = AirTravelAssistant

18

Allowing query approximation can yield some answers:

Replacing the edge label prereq by next, at an edit cost of 1, we get this variant of the query: (?Y, ?Z)



APPROX(?X, prereq+, ?Y),

(?Y, type, Work),

(?Y, job.type, ?Z)

next.prereq* can be approximated by next.next.prereq*, now at edit distance 2: Results:

Y = ep23, Z = Journalist

Y = ep24, Z = AssistantEditor

19

History

sc

20

Query: “What jobs are open to me if I study English, or something similar, at University”?

(?Y, ?Z)

(?X, type, University), (?X, qualif, ?D),

RELAX (?D, type, EnglishStudies),

APPROX (?X, prereq+, ?Y),

(?Y, type, Work), (?Y, job.type, ?Z)

In addition to the answers (from User 2) obtained by the previous query, we now also have

answers from the timeline of User 3

prereq+ can be approximated by next.prereq* (distance 1) and EnglishStudies can be relaxed

– via Languages - to Humanities (distance 2), encompassing History

Result: Y = ep32, Z = PersonalAssistant (distance of 3 from original query)

21

Query: “What jobs are open to me if I study English, or something similar, at University”?

(?Y, ?Z)

(?X, type, University), (?X, qualif, ?D),

RELAX (?D, type, EnglishStudies),

APPROX (?X, prereq+, ?Y),

(?Y, type, Work), (?Y, job.type, ?Z)

next.prereq* can be approximated by next.next.prereq* (distance 2), with EnglishStudies again relaxed to Humanities (distance 2)

Results: (both at distance 4 from the original query)

Y = ep33, Z = Author

Y = e34, Z = AssociateEditor

How?

22

Theory

Construction of a weighted non-deterministic finite

automaton (NFA) to represent the regular expression

We apply new states and transitions to the NFA to represent the

approximation and relaxation operations

Formation of a product automaton: NFA with data

graph G

We perform a lowest cost path traversal of the product

automaton; construct query tree, do joins etc

Polynomial time complexity

Correctness of algorithms proven

How?

23

Implementation of prototype

Graph database: DEX (http://www.sparsity-

technologies.com/dex)

Programming language: C#

Further work

New flexible operation combining APPROX and

RELAX FLEX

Optimisation!

24

Any questions?

Thank you for your attention!

[email protected]

mailto:[email protected]

Technology

Fosdem 2013 petra selmer flexible querying of graph data