Upload
larca-upc
View
1.022
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Information networks are a popular way to represent information, especially in domains where the emphasis lies on the structural relationships between the entities rather than their features. Notable examples are online social networks and road networks. This special focus on network topology has led to the development of specialized graph databases. However, few of these databases offer a high-level declarative interface suited for analyzing information networks.In this talk I present our work on developing a query language for analyzing networks. I will focus on the general principles we followed in the design of this language, and the main challenges related to developing it into a scalable tool for network analysis.
Citation preview
A query language for analyzing networksAnton Dries(based on joint work with Siegfried Nijssen)
Idea
Declarative language for manipulating and analyzing information networks
“Query language” – cf. SQL
with special focus on querying connections
simplicity / expressivity / flexibility
Information networks
Objects (“nodes”)
Connections between objects (“edges”)
Focus on structure (“topology”)
a.k.a. “large single graph”
Information networks
HTTP://SPIKEDMATH.COM/382.HTML
Information networksExamples:
World Wide Web
Social networks
Bibliographical
Transportation
Biological
ProcessCommon tasks
Query language
Operational model (algebra)
Implementation & Optimization
Data management & storage
TOP
DOW
N AP
PROA
CH
ProcessCommon tasks
Query language
Operational model (algebra)
Implementation & Optimization
Data management & storage
TOP
DOW
N AP
PROA
CH [CIKM 2009]
[MLG 2010]
?
Graph databases (DEX, Neo, ...)
Common tasksFeature-based queries
Structure-based queries
Aggregation
Basic graph problems e.g. degree, shortest path
Network analysis (e.g. centrality measures)
...
Mainly path-based queries
BiQL“The BISON Query Language”
publication
publication
publication
keyworddata mining
keywordgraphs
keywordmachine learning
keywordprobabilities
author
author
author
author
author
author of
author of
author of
author ofauth
or o
f
author ofauthor o
f
author of
has k
eyw
ord
has
keyw
ord
has keyword
has keyword
has keyw
ord
has keyword
publication
publication
publication
keyworddata mining
keywordgraphs
keywordmachine learning
keywordprobabilities
author
author
author
author
author
author of
author of
author of
author ofauth
or o
f
author ofauthor o
f
author of
has k
eyw
ord
has
keyw
ord
has keyword
has keyword
has keyw
ord
has keyword
author
author
author
author
author
co-author
co-author
co-author
co-author
co-author
co-author
co-author
co-authorship
Manipulation“query language”
SQL-style: loosely based on SQL syntax
One type of query: create set of (new) objects
CREATE/UPDATE Domain<Vars> { Properties }FROM Path Expression
WHERE Constraints
Example
publication
publication
publication
keyworddata mining
keywordgraphs
keywordmachine learning
keywordprobabilities
author
author
author
author
author
author of
author of
author of
author of
au
tho
r of
author ofauthor o
f
author of
has k
eyw
ord
has
keyw
ord
has keyword
has keyword
has keyw
ord
has keyword
author
author
author
author
author
co-author
co-author
co-author
co-author
co-author
co-author
co-author
CREATE CoAuthor<A,B> { A <−>, B <−> }FROM Author A −> AuthorOf −> Publication P
<− AuthorOf <− Author B
Examplepublication
publication
publication
keyworddata mining
keywordgraphs
keywordmachine learning
keywordprobabilities
author
author
author
author
author
author of
author of
author of
author of
au
tho
r of
author ofauthor o
f
author of
has k
eyw
ord
has
keyw
ord
has keyword
has keyword
has keyw
ord
has keyword
author
author
author
author
author
co-author
co-author
co-author
co-author
co-author
co-author
co-author
CREATE CoAuthor<A,B> { A <−>, B <−> }
FROM Author A −> AuthorOf −> Publication P <− AuthorOf <− Author B
“path expression” – structural selection
“object creation” – output specification
(+ other operations)
Structural selectionAuthor A −> AuthorOf −> Publication P <− AuthorOf <− Author B,
Publication P −> HasKeyword −> Keyword K
Author A −> CoAuthor −> Author B −> CoAuthor −> Author C −> CoAuthor −> Author A
AuthorA
AuthorB
Publication PAuthorOf AuthorOf
Keyword K
HasKeyword
AuthorA
AuthorB
AuthorC
CoAuthor
CoAuthorCoAuthor
Structural selection
Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B
regular expressions
list variables
each expansion of regular expression should lead to a valid (simple) path expression defining
the same variables
Structural selectionNode A −> Edge [E] −> (Node −> Edge [E] −>)* Node B
Node A −> Edge [E] −> Node B
Node A −> Edge [E] −> Node −> Edge [E] −> Node B
(n1, [e1], n2)(n1, [e3], n3)(n2, [e2], n3)(n2, [e4], n4)(n3, [e5], n4)
(n1, [e1,e2], n3)(n1, [e1,e4], n4)(n1, [e3,e5], n4)
(A,E,B) =
(A,E,B) =
n1
n2
n3
n4
e1
e2e3
e4
e5
Output specificationCREATE CoAuthor<A,B> { A <−>, B <−> }
FROM Author A −> AuthorOf −> Publication P <− AuthorOf <− Author B
CREATE CoAuthor<A,B> { A <−>, B <−> }
update/createobjects
put themin this
domain
for each combination
of values
with these properties
UPDATE
Output specificationUPDATE <A> { nr_reach: count<B> }
FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B
(n1, [e1], n2)(n1, [e3], n3)(n2, [e2], n3)(n2, [e4], n4)(n3, [e5], n4)
(n1, [e1,e2], n3)(n1, [e1,e4], n4)(n1, [e3,e5], n4)
n1
n2
n3
n4
e1
e2e3
e4
e5
<A>
([e1], n2)([e3], n3)
([e1,e2], n3) ([e1,e4], n4)([e3,e5], n4)
n1
([e2], n3)([e4], n4)([e5], n4)
n2
n3
Output specificationUPDATE <A> { nr_reach: count<B> }
FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B
n1
n2
n3
n4
e1
e2e3
e4
e5
([e1], n2)([e3], n3)
([e1,e2], n3) ([e1,e4], n4)([e3,e5], n4)
n1
([e2], n3)([e4], n4)([e5], n4)
n2
n3
<A>
Output specificationUPDATE <A> { nr_reach: count<B> }
FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B
n1
n2
n3
n4
e1
e2e3
e4
e5
<B>
([e1])
n1
([e2])([e4])([e5])
n2
n3
n2
n3
n3
n4
n4
n4
([e1,e4])([e3,e5])
([e3])([e1,e2])
([e1], n2)([e3], n3)
([e1,e2], n3) ([e1,e4], n4)([e3,e5], n4)
n1
([e2], n3)([e4], n4)([e5], n4)
n2
n3
<A>
Output specificationUPDATE <A> { nr_reach: count<B> }
FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B
n1
n2
n3
n4
e1
e2e3
e4
e5
<B>
([e1])
n1
([e2])([e4])([e5])
n2
n3
n2
n3
n3
n4
n4
n4
([e1,e4])([e3,e5])
([e3])([e1,e2])
Output specificationUPDATE <A> { nr_reach: count<B> }
FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B
n1
n2
n3
n4
e1
e2e3
e4
e5
<B>
([e1])
n1
([e2])([e4])([e5])
n2
n3
n2
n3
n3
n4
n4
n4
([e1,e4])([e3,e5])
([e3])([e1,e2]) count 3
2
1
Output specificationUPDATE <A> { nr_reach: count<B> }
FROM Node A −> Edge [E] −> (Node −> Edge [E] −>)* Node B
n1
n2
n3
n4
e1
e2e3
e4
e5
count 3
2
1
<B>
([e1])
n1
([e2])([e4])([e5])
n2
n3
n2
n3
n3
n4
n4
n4
([e1,e4])([e3,e5])
([e3])([e1,e2]) UPDATE
n1nr_reach: 3
n2nr_reach: 2
n3nr_reach: 1
Object properties
Attribute definition
Link definition
strength: count<P> start: min<P>(P.year)
A −>, B −> P <−
Examples
Co-authorship
CREATE CoAuthor<A,B> { A −>, B −>, <− P,
start: min<P>(P.year), end: max<P>(P.year), strength: count<P> }
FROM Author A −> AuthorOf −> Publication P <− AuthorOf <− Author B
adding a new relationship
A B
CoAuthorstrength: 3start: 2008end: 2010
P1year: 2008
P2year: 2008
P3year: 2010
UPDATE <A> { netsize: count<B> }FROM Author A −> (CoAuthor [co] <− Author −>)*
CoAuthor [co] <− Author BWHERE length(co) < 4
Size of neighborhoodtransitive closure
Distance
CREATE Connection<A,B> { A −>, −> B, distance: min<E>(length(E)) }
FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B
based on shortest path
distance: min<E>(sum(E.weight))distance: min<E>(length(E))
distance: max<E>(product(E.probability))
Centrality measures
closeness centralityUPDATE <A> { closeness: 1/sum<B>(min<AB>(AB.distance))}FROM Node A −> Connection AB −> Node B
degree centrality
UPDATE <A> { Cdegree: count<B>/(count<N>-1) }FROM Node A −− Edge -- Node B, Node N
CD(v) =deg(v)
n� 1
CC(v) =1P
t2V dist(v, t)
Query execution
Operational model
Query algebra operators:
Evaluate path expression (graph –> tuple)
Relational algebra (tuple –> tuple)
Construction operator (tuple –> graph)
Used by prototype implementation
Operational model
“Pattern match” operator is too broad
Enumerates all paths
exponential
e.g. even when only shortest path is requested
Need for atomic graph operations (open question)
Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B
Pattern matching
Homomorphism matching (no cycle check)
more efficient than isomorphism
cycles could lead to unbounded solutions
Use constraints and algebraic solutions to avoid infinite processing
operator interaction – “pattern match” operator not atomic enough
Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B
Avoiding unbounded solutions
CREATE Distance<A,B> { A −>, −> B, distance: min<E>(sum(E.weight)) }FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B
CREATE ConnectionWeight<A,B> { A −>, −> B, distance: sum<E>(product(E.weight)) }FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B
CREATE PathCount<A,B> { A −>, −> B, numP: count<E> }FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node B
Fletcher’s algorithmFOR k = 1..n FOR i = 1..n FOR j = 1..n Ck,i,j = Ck-1,i,j ⊕ (Ck-1,i,k ⊙ Ck-1,k,k* ⊙ Ck-1,k,j) Ck,k,k = e⊙ ⊕ Ck,k,k
(S, ⊕, ⊙, e⊕, e⊙) an algebraic semiring
where
number of nodes in the graphn
[FLETCHER, 1980][BATAGELJ, 1994]
a* = e⊙ ⊕ a ⊕ a⊙a ⊕ a⊙a⊙a ⊕ ... closure operator
C0,i,j weighted adjacency matrix
Fletcher’s algorithm
Dynamic programming approach
At step k: Ck,i,j contains solution using paths containing only nodes 1...k
Some examples ...
Fletcher’s algorithm
a* = e⊙ ⊕ a ⊕ a⊙a ⊕ a⊙a⊙a + ...
(S, ⊕, ⊙, e⊕, e⊙) = (ℝ+, min, +, ∞, 0)
Ck,k* = min(0, Ck,k, 2Ck,k, 3Ck,k, ...) = 0 (Ck,k >= 0)
FOR k = 1..n FOR i = 1..n FOR j = 1..n Ck,i,j = min(Ck-1,i,j,Ck-1,i,k + Ck-1,k,j) Ck,k,k = 0
Floyd-Warshall shortest path algorithm
Fletcher’s algorithm(S, ⊕, ⊙, e⊕, e⊙) = ([0,1], +, ·, 0, 1)
Ck,k* = 1 + Ck,k + Ck,k2 + Ck,k3 + ... = 1 / (1-Ck,k) (|Ck,k | < 1)
FOR k = 1..n FOR i = 1..n FOR j = 1..n Ck,i,j = Ck-1,i,j + Ck-1,i,k · Ck-1,k,k* · Ck-1,k,j Ck,k,k = 1 + Ck,k,k
sum of all path weights
a* = e⊙ ⊕ a ⊕ a⊙a ⊕ a⊙a⊙a + ...
Fletcher’s algorithm
a* = 1 + a + a2 + a3 + ...
(S, ⊕, ⊙, e⊕, e⊙) = (N, +, ·, 0, 1)
Ck,k* = 1 (Ck,k = 0)
FOR k = 1..n FOR i = 1..n FOR j = 1..n Ck,i,j = Ck-1,i,j + Ck-1,i,k · Ck-1,k,k* · Ck-1,k,j Ck,k,k = 1 + Ck,k,k
number of pathsCk,k* = ∞ (Ck,k > 0) cycle k–>k
no cycle k–>k
Fletcher’s algorithmGeneralized algorithm for several connectivity problems
O(n3) time complexity, O(n3) or O(n2) space complexity
for many problems: best known time complexity (exact, for arbitrary graphs)
also in the presence of cycles (thanks to (Ck,k,k*) term)
Applicability depends on constraints on path
Fletcher’s algorithmCREATE Connection<A,B>
{ A −>, −> B, distance: min<E>(sum(E.weight)) }FROM Node A −> Edge [E] (−> Node −> Edge [E])* −> Node BWHERE A.color = ‘blue’
(S, ⊕, ⊙, e⊕, e⊙) = (ℝ, min, +, ∞, 0)
if e1e2 matches path expression then e1 and e2 must match path expression
=> has to compute all pair shortest paths
= +
Conclusion
A query language for analyzing networks
Focussed to path based analysis
Raises interesting questions
Some ideas on implementation and optimization
Future workNeed for atomic graph operations
Fletcher’s algorithm:
interaction with constraints
complex path expressions (not just Node-Edge-Node)
Approximate answers – O(n3) is very bad
Other metrics: flow-based, pagerank, ... mining
Thank you