A query language for analyzing networks

A query language for analyzing networksAnton Dries(based on joint work with Siegfried Nijssen)

Idea

Declarative language for manipulating and analyzing information networks

“Query language” – cf. SQL

with special focus on querying connections

simplicity / expressivity / flexibility

Information networks

Objects (“nodes”)

Connections between objects (“edges”)

Focus on structure (“topology”)

a.k.a. “large single graph”

Information networks

HTTP://SPIKEDMATH.COM/382.HTML

Information networksExamples:

World Wide Web

Social networks

Bibliographical

Transportation

Biological

ProcessCommon tasks

Query language

Operational model (algebra)

Implementation & Optimization

Data management & storage

TOP

DOW

N AP

PROA

CH

ProcessCommon tasks

Query language

Operational model (algebra)

Implementation & Optimization

Data management & storage

TOP

DOW

N AP

PROA

CH [CIKM 2009]

[MLG 2010]

?

Graph databases (DEX, Neo, ...)

Common tasksFeature-based queries

Structure-based queries

Aggregation

Basic graph problems e.g. degree, shortest path

Network analysis (e.g. centrality measures)

...

Mainly path-based queries

BiQL“The BISON Query Language”

publication

publication

publication

keyworddata mining

keywordgraphs

keywordmachine learning

keywordprobabilities

author

author

author

author

author

author of

author of

author of

author ofauth

or o

f

author ofauthor o

f

author of

has k

eyw

ord

has

keyw

ord

has keyword

has keyword

has keyw

ord

has keyword

publication

publication

publication

keyworddata mining

keywordgraphs



author

author

author

author

author

author of

author of

author of

author ofauth

or o

f

author ofauthor o

f

author of

has k

eyw

ord

has

keyw

ord

has keyword

has keyword

has keyw

ord

has keyword

author

author

author

author

author

co-author

co-author

co-author

co-author

co-author

co-author

co-author

co-authorship

Manipulation“query language”

SQL-style: loosely based on SQL syntax

One type of query: create set of (new) objects

CREATE/UPDATE Domain<Vars> { Properties }FROM Path Expression

WHERE Constraints

Example

publication

publication

publication

keyworddata mining

keywordgraphs



author

author

author

author

author

author of

author of

author of

author of

au

tho

r of

author ofauthor o

f

author of

has k

eyw

ord

has

keyw

ord

has keyword

has keyword

has keyw

ord

has keyword

author

author

author

author

author

co-author

co-author

co-author

co-author

co-author

co-author

co-author

CREATE CoAuthor<A,B> { A <−>, B <−> }FROM Author A −> AuthorOf −> Publication P

<− AuthorOf <− Author B

Examplepublication

publication

publication

keyworddata mining

keywordgraphs



author

author

author

author

author

author of

author of

author of

author of

au

tho

r of

author ofauthor o

f

author of

has k

eyw

ord

has

keyw

ord

has keyword

has keyword

has keyw

ord

has keyword

author

author

author

author

author

co-author

co-author

co-author

co-author

co-author

co-author

co-author

CREATE CoAuthor<A,B> { A <−>, B <−> }

FROM Author A −> AuthorOf −> Publication P <− AuthorOf <− Author B

“path expression” – structural selection

“object creation” – output specification

(+ other operations)

Structural selectionAuthor A −> AuthorOf −> Publication P <− AuthorOf <− Author B,

Publication P −> HasKeyword −> Keyword K

Author A −> CoAuthor −> Author B −> CoAuthor −> Author C −> CoAuthor −> Author A

AuthorA

AuthorB

Publication PAuthorOf AuthorOf

Keyword K

HasKeyword

AuthorA

AuthorB

AuthorC

CoAuthor

CoAuthorCoAuthor

Structural selection

Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B

regular expressions

list variables

each expansion of regular expression should lead to a valid (simple) path expression defining

the same variables

Structural selectionNode A −> Edge [E] −> (Node −> Edge [E] −>)* Node B

Node A −> Edge [E] −> Node B

Node A −> Edge [E] −> Node −> Edge [E] −> Node B

(n1, [e1], n2)(n1, [e3], n3)(n2, [e2], n3)(n2, [e4], n4)(n3, [e5], n4)

(n1, [e1,e2], n3)(n1, [e1,e4], n4)(n1, [e3,e5], n4)

(A,E,B) =

(A,E,B) =

n1

n2

n3

n4

e1

e2e3

e4

e5

Output specificationCREATE CoAuthor<A,B> { A <−>, B <−> }


CREATE CoAuthor<A,B> { A <−>, B <−> }

update/createobjects

put themin this

domain

for each combination

of values

with these properties

UPDATE

Output specificationUPDATE <A> { nr_reach: count }

FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B

(n1, [e1], n2)(n1, [e3], n3)(n2, [e2], n3)(n2, [e4], n4)(n3, [e5], n4)

(n1, [e1,e2], n3)(n1, [e1,e4], n4)(n1, [e3,e5], n4)

n1

n2

n3

n4

e1

e2e3

e4

e5

<A>

([e1], n2)([e3], n3)

([e1,e2], n3) ([e1,e4], n4)([e3,e5], n4)

n1

([e2], n3)([e4], n4)([e5], n4)

n2

n3



n1

n2

n3

n4

e1

e2e3

e4

e5

([e1], n2)([e3], n3)

([e1,e2], n3) ([e1,e4], n4)([e3,e5], n4)

n1

([e2], n3)([e4], n4)([e5], n4)

n2

n3

<A>



n1

n2

n3

n4

e1

e2e3

e4

e5



([e1])

n1

([e2])([e4])([e5])

n2

n3

n2

n3

n3

n4

n4

n4

([e1,e4])([e3,e5])

([e3])([e1,e2])

([e1], n2)([e3], n3)

([e1,e2], n3) ([e1,e4], n4)([e3,e5], n4)

n1

([e2], n3)([e4], n4)([e5], n4)

n2

n3

<A>



n1

n2

n3

n4

e1

e2e3

e4

e5



([e1])

n1

([e2])([e4])([e5])

n2

n3

n2

n3

n3

n4

n4

n4

([e1,e4])([e3,e5])

([e3])([e1,e2])



n1

n2

n3

n4

e1

e2e3

e4

e5



([e1])

n1

([e2])([e4])([e5])

n2

n3

n2

n3

n3

n4

n4

n4

([e1,e4])([e3,e5])

([e3])([e1,e2]) count 3

2

1



n1

n2

n3

n4

e1

e2e3

e4

e5

count 3

2

1



([e1])

n1

([e2])([e4])([e5])

n2

n3

n2

n3

n3

n4

n4

n4

([e1,e4])([e3,e5])

([e3])([e1,e2]) UPDATE

n1nr_reach: 3

n2nr_reach: 2

n3nr_reach: 1

Object properties

Attribute definition

Link definition

strength: count start: min(P.year)

A −>, B −> P <−

Examples

Co-authorship

CREATE CoAuthor<A,B> { A −>, B −>, <− P,

start: min(P.year), end: max(P.year), strength: count }


adding a new relationship

A B

CoAuthorstrength: 3start: 2008end: 2010

P1year: 2008

P2year: 2008

P3year: 2010

UPDATE <A> { netsize: count }FROM Author A −> (CoAuthor [co] <− Author −>)*

CoAuthor [co] <− Author BWHERE length(co) < 4

Size of neighborhoodtransitive closure

Distance

CREATE Connection<A,B> { A −>, −> B, distance: min<E>(length(E)) }

FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B

based on shortest path

distance: min<E>(sum(E.weight))distance: min<E>(length(E))

distance: max<E>(product(E.probability))

Centrality measures

closeness centralityUPDATE <A> { closeness: 1/sum(min<AB>(AB.distance))}FROM Node A −> Connection AB −> Node B

degree centrality

UPDATE <A> { Cdegree: count/(count<N>-1) }FROM Node A −− Edge -- Node B, Node N

CD(v) =deg(v)

n� 1

CC(v) =1P

t2V dist(v, t)

Query execution

Operational model

Query algebra operators:

Evaluate path expression (graph –> tuple)

Relational algebra (tuple –> tuple)

Construction operator (tuple –> graph)

Used by prototype implementation

Operational model

“Pattern match” operator is too broad

Enumerates all paths

exponential

e.g. even when only shortest path is requested

Need for atomic graph operations (open question)

Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B

Pattern matching

Homomorphism matching (no cycle check)

more efficient than isomorphism

cycles could lead to unbounded solutions

Use constraints and algebraic solutions to avoid infinite processing

operator interaction – “pattern match” operator not atomic enough

Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B

Avoiding unbounded solutions

CREATE Distance<A,B> { A −>, −> B, distance: min<E>(sum(E.weight)) }FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B

CREATE ConnectionWeight<A,B> { A −>, −> B, distance: sum<E>(product(E.weight)) }FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B

CREATE PathCount<A,B> { A −>, −> B, numP: count<E> }FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B

Fletcher’s algorithmFOR k = 1..n FOR i = 1..n FOR j = 1..n Ck,i,j = Ck-1,i,j ⊕ (Ck-1,i,k ⊙ Ck-1,k,k* ⊙ Ck-1,k,j) Ck,k,k = e⊙ ⊕ Ck,k,k

(S, ⊕, ⊙, e⊕, e⊙) an algebraic semiring

where

number of nodes in the graphn

[FLETCHER, 1980][BATAGELJ, 1994]

a* = e⊙ ⊕ a ⊕ a⊙a ⊕ a⊙a⊙a ⊕ ... closure operator

C0,i,j weighted adjacency matrix

Fletcher’s algorithm

Dynamic programming approach

At step k: Ck,i,j contains solution using paths containing only nodes 1...k

Some examples ...


a* = e⊙ ⊕ a ⊕ a⊙a ⊕ a⊙a⊙a + ...

(S, ⊕, ⊙, e⊕, e⊙) = (ℝ+, min, +, ∞, 0)

Ck,k* = min(0, Ck,k, 2Ck,k, 3Ck,k, ...) = 0 (Ck,k >= 0)

FOR k = 1..n FOR i = 1..n FOR j = 1..n Ck,i,j = min(Ck-1,i,j,Ck-1,i,k + Ck-1,k,j) Ck,k,k = 0

Floyd-Warshall shortest path algorithm

Fletcher’s algorithm(S, ⊕, ⊙, e⊕, e⊙) = ([0,1], +, ·, 0, 1)

Ck,k* = 1 + Ck,k + Ck,k2 + Ck,k3 + ... = 1 / (1-Ck,k) (|Ck,k | < 1)

FOR k = 1..n FOR i = 1..n FOR j = 1..n Ck,i,j = Ck-1,i,j + Ck-1,i,k · Ck-1,k,k* · Ck-1,k,j Ck,k,k = 1 + Ck,k,k

sum of all path weights

a* = e⊙ ⊕ a ⊕ a⊙a ⊕ a⊙a⊙a + ...


a* = 1 + a + a2 + a3 + ...

(S, ⊕, ⊙, e⊕, e⊙) = (N, +, ·, 0, 1)

Ck,k* = 1 (Ck,k = 0)

FOR k = 1..n FOR i = 1..n FOR j = 1..n Ck,i,j = Ck-1,i,j + Ck-1,i,k · Ck-1,k,k* · Ck-1,k,j Ck,k,k = 1 + Ck,k,k

number of pathsCk,k* = ∞ (Ck,k > 0) cycle k–>k

no cycle k–>k

Fletcher’s algorithmGeneralized algorithm for several connectivity problems

O(n3) time complexity, O(n3) or O(n2) space complexity

for many problems: best known time complexity (exact, for arbitrary graphs)

also in the presence of cycles (thanks to (Ck,k,k*) term)

Applicability depends on constraints on path

Fletcher’s algorithmCREATE Connection<A,B>

{ A −>, −> B, distance: min<E>(sum(E.weight)) }FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node BWHERE A.color = ‘blue’

(S, ⊕, ⊙, e⊕, e⊙) = (ℝ, min, +, ∞, 0)

if e1e2 matches path expression then e1 and e2 must match path expression

=> has to compute all pair shortest paths

= +

Conclusion

A query language for analyzing networks

Focussed to path based analysis

Raises interesting questions

Some ideas on implementation and optimization

Future workNeed for atomic graph operations

Fletcher’s algorithm:

interaction with constraints

complex path expressions (not just Node-Edge-Node)

Approximate answers – O(n3) is very bad

Other metrics: flow-based, pagerank, ... mining

Thank you

Technology

A query language for analyzing networks