29
Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology 3 University of California, Santa Barbara 1

Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Embed Size (px)

Citation preview

Page 1: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Performance Guarantees for

Distributed Reachability Queries

Wenfei Fan1,2 Xin Wang1 Yinghui Wu1,3

1University of Edinburgh

2Harbin Institute of Technology

3University of California, Santa Barbara

1

Page 2: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

outline

Querying distributed real-life graphs

• Real-life graphs are often fragmented/distributed

• Distributed reachability queries

• Distributed bounded reachability queries

• Distributed regular reachability queries

Distributed reachability with MapReduce

Experimental study

Conclusion

Distributed query evaluation with performance guarantees

2

Partial Evaluation

Yinghui Wu VLDB 2012

Page 3: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Distributed Real-life Graphs

Real life graphs are distributed

• Geo-distributed, e.g., data centers

• Decentralization, e.g., social networks

• Distributed entity and personal information

3Yinghui Wu VLDB 2012Real-life graphs are purposely or naturally distributed

Page 4: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Distributed querying methods

Federated/centralized graph database

• collect and link graph fragments

• query the centralized graph

4

construction and maintenance cost

centralized querying

Q

Q(G)

fragments...

Yinghui Wu VLDB 2012

Page 5: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Distributed querying methods

Graph exploration strategy

• Master node and slave node

• Predefined graph partition and query execution plan

5

no bounds on visit numbers and data shipment

master node

Q

Q(G)

query planintermediateresults

slave node

...

Yinghui Wu VLDB 2012

Page 6: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Querying a distributed social network

6

(DB* HR*)∪

Ann, "CTO"

Mark, "FA"

Fred, "HR"

Walt, "HR"

Dan,"DB"

Bill,"DB" Mark,"FA"

Pat,"SE"Tom,"AI"

Ross,"HR"

Jack,"MK"

Ben,"MK"

Emmy,"HR"

Mat,"HR"

Q

DC1

DC2

DC3

Yinghui Wu VLDB 2012Using partial evaluation to obtain performance guarantees

centralized method?Graph exploration?

Page 7: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Yinghui Wu VLDB 2012

Partial evaluation

Partial evaluation (a.k.a program specialization)

• given a function f(s,d) and a part of input e.g., s,

specializes f(s,d) w.r.t s

• only conducts the part of f’s computation that depends on s

• generates a residual function f’

Partial evaluation: generating partial answer

f (s, d) f’ (d)

s

Q (Fi, G)

Fi

Q’ (G)

for graph

queries?

7

Page 8: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Distributed graphs and graph queries

8

Distributed graph

• graph fragmentation F = (F, Gf)

• fragment graph Gf

Reachability query

• reachability query Qr(s,t)

• bounded reachability

query Qbr(s,t,l)

• regular reachability

(path) query Qrr(s,t,R)

R::= ε| a | RR | R R | R*∪

Fred, "HR"

Walt, "HR"

Bill,"DB"

Pat,"SE"Tom,"AI"

Ross,"HR"

Jack,"MK"

Emmy,"HR"

Mat,"HR"

F1

F2

F3

Gf

fragmenta virtual node of F1an in-node of F1

a cross edge

Ann, "CTO"

Mark, "FA"

Qr(Ann, Mark)

Ann, "CTO"

Mark, "FA"

Qbr(Ann, Mark, 5)

(DB* HR*)∪

Ann, "CTO"

Mark, "FA"Qrr(Ann, Mark, (DB* HR*))∪

5

Yinghui Wu VLDB 2012

Page 9: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Distributed graph querying framework

9

Applying partial evaluation to graph querying

coordinator Sc

Q

Q(G)fragments

...

coordinating site Sc

and a set of graph

fragments F1, …, Fn

distributing at Sc:

post Q to fragments

local evaluation:

partially evaluate Q

Assembling at Sc

QQQQ

Q(Fi) Q(Fi)

Q(Fi)Q(Fi)

Yinghui Wu VLDB 2012

Page 10: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Distributed reachability queries

10Yinghui Wu VLDB 2012

Performance guarantees: Over a fragmentation F = (F, Gf) of a

graph G, reachability queries can be evaluated (a) in O(|Vf||Fm|)

time, (b) by visiting each site only once, and (c) with the total

network traffic bounded by O(|Vf|2), where Gf = (Vf , Ef) and Fm

is the largest fragment in F.

A distributed reachability evaluation algorithm DisReach

• Coordinator Sc posts qr(s,t) to each fragment site in F

• Each site locally evaluates qr(s,t) in parallel, and produces

partial answer as a set of Boolean equations

• Sc collects and assembles the partial results

Page 11: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Distributed reachability: partial evaluation

11Partial evaluation by introducing Boolean variables

Yinghui Wu VLDB 2012

Local evaluate each qr(v,t) on

Fi in parallel:

for each in-node v’ in Fi,

decides if v’ reaches t;

introduce a Boolean variable

to each v’

Partial answer to qr(v,t): a set

of Boolean formula,

disjunction of variables of v’ to

which v can reach

qr(v,t)

v

tv’

t

qr(v,v’)

Xv’ = qr(v’,t)

= Xv1’ or … or Xvn’

Page 12: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Distributed reachability: assembling

12Partial evaluation by introducing Boolean variables

Yinghui Wu VLDB 2012

Collect the Boolean equation

set at coordinator Sc

solve a Boolean equation

system over a dependency

graph

qr(s,t) is true iff Xs = true at Sc

Xv = Xv’’ or Xv’

Xv’’ = false

Xt = 1

Xv’ = Xt

Xs = Xv

O(|Vf|)

Page 13: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

QQQQ

Q

Yinghui Wu VLDB 2012

1. Dispatch Q to

fragments (at Sc)

2. Partial evaluation:

generating Boolean

equations (at Fi)

3. Assembling:

solving equation

system (at Sc)

Distributed reachability queries: example

Sc Jack,"MK"

Emmy,"HR"

Mat,"HR"

F2

Fred, "HR"

Walt, "HR"

Bill,"DB"

F1

Pat,"SE"Tom,"AI"

Ross,"HR"

F3

13

Ann

Mark

Page 14: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

QQQQ

Q

Yinghui Wu VLDB 2012

1. Dispatch Q to

fragments (at Sc)

2. Partial evaluation:

generating

equations (at Fi)

3. Assembling:

solving equation

system (at Sc)

Distributed bounded reachability queries

Sc Jack,"MK"

Emmy,"HR"

Mat,"HR"

F2

Fred, "HR"

Walt, "HR"

Bill,"DB"

F1

Pat,"SE"Tom,"AI"

Ross,"HR"

F3

Variables denoting

numeric values15

Ann

Mark

A weighted dependency graph

Page 15: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Distributed bounded reachability queries

Performance guarantees: bounded reachability queries can be

evaluated with the same performance guarantees as for reachability

queries.

16Yinghui Wu VLDB 2012

Performance guarantees for distributed bounded reachability

Page 16: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Distributed regular reachability queries

Performance guarantees: Over a fragmentation F = (F, Gf) of a graph

G, regular reachability queries qrr(s, t, R) can be evaluated (a) in O((|Vf|2+|Fm|)|R|2 ) time, (b) by visiting each site only once, and (c) with the total

network traffic bounded by O(R|2 |Vf|2), where Gf = (Vf , Ef) and Fm is the

largest fragment in F.

Query automaton Gq(R) of R: <Vq, Eq, Lq, us, ut>

17Yinghui Wu VLDB 2012

Automaton

representation

for queries

Page 17: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Query automaton

A node v is a match of state uv in Gq(R) iff (1) they have the

same label, and (2) there is a path ρ from v to t and a path ρ’

from uv to ut , s.t. ρ and ρ’ induce the same label

Given a graph G, qrr(s, t, R) over G is true if and only if s is a

match of us in Gq(R)

18

Yinghui Wu VLDB 2012

Fred, "HR"

Walt, "HR"

Mark,"FA"

Ross,"HR"

Emmy,"HR"

Mat,"HR"

Ann

FA

DB HR

Page 18: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Distributed regular query evaluation: algorithm

19Yinghui Wu VLDB 2012

Page 19: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

f21 f22 … f2k

Distributed regular query evaluation: partial evaluation

20Yinghui Wu VLDB 2012

Partial evaluation by introducing Boolean variables

For each node v in Fi, assign v.

rvec: a vector of O(|Vq|) Boolean

formulas, each entry v.rvec[u]

denotes if v matches u

introduce a Boolean variable

X(v’,w) to each virtual node v’ of Fi

and a state w in Vq, denoting if v’

matches w

Partial answer to qrr(s,t): a set of

Boolean formula from each in-

nodes of Fi

v1

tv’

t

vq

wq…

v2

f11 f12 … f1k

f1v’ f2v’ … fkv’

qrr

X(v’,w)

Page 20: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Distributed regular query evaluation: assembling

21Yinghui Wu VLDB 2012

Partial evaluation by introducing Boolean variables

Collects partial results as set of

Boolean formulas

Constructs a dependency graph: a

node vd for each in-node and

each entry of its formula vector,

labeled with Boolean formula and

an edge for dependencies

Checks the reachability of vd(s, us)

can reach vd(t, ut) in the

dependency graph

v1

tv’

t

vq

wq…

v2

qrr

f11 f12 … f1k

vd(v1, vq)

vd(v’,w)

vd(v2,vq)

vd(t,ut)=true

vd(s, us)

Page 21: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Yinghui Wu VLDB 2012

Distributed Regular Reachability Evaluation: Example

22

QQQQ

Q1. Dispatch Q to

fragments (at Sc)

2. Partial evaluation:

generating a set of

Boolean equations

(at Fi)

3. Assembling:

solving equation

system (at Sc)

Sc Jack,"MK"

Emmy,"HR"

Mat,"HR"

F2

Fred, "HR"

Walt, "HR"

Bill,"DB"

F1

Pat,"SE"Tom,"AI"

Ross,"HR"

F3

Test reachability in dependency graph

distributed regular reachability query evaluation

vector of Boolean formulas

Page 22: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Yinghui Wu VLDB 2012

Distributed Reachability with MapReduce

Partial evaluation properly fits in MapReduce framework

coordinator

mapper 1 mapper m mapper k

reducer

1. generates query automata

Gq; partition graph G to K

fragments (as a key/value

pair) (i, <Fi, Gq> )

2. Map function: local

evaluation upon (i, <Fi,

Gq>) and generates <1,

rvset>

3. Reduce function:

assembles collected partial

results and writes <0, ans>

to distributed file system.

1, <F 1, G q> k, <F

k, Gq>

1, rvset k1, rvset

1

<0,ans>

O(Fm)

O(|R|2|Vf |2)

Processin

g path

O(Fm) + |R|2|Vf |2)

24

Page 23: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Experimental Evaluation

Experimental setting• Real-life datasets

• Synthetic data: larger

random graphs following

densification law

• Algorithms:

disReach, disReachn and disReachm

disDist and disDistn

disRPQ, disRPQn and disRPQd

MRdRPQ25Yinghui Wu VLDB 2012

Page 24: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

26

Distributed reachability

Efficiency and scalability 20% and 6% 9% of disReachn

three thousand visits over 4 fragments

disReach outperforms centralized and message-passing approaches

Page 25: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Distributed regular reachability

Efficiency and network traffic

27Yinghui Wu VLDB 2012

Time: 60% of disRPQn Traffic: at most 25% and 3%

disRPQ takes much less time and communication cost

Page 26: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Distributed regular reachability (cont.)

Scalability

28Yinghui Wu VLDB 2012

Scales well with the number of fragments; takes less

time over more fragments

disRPQ scales well over the number of fragments

Page 27: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Performance of MapReduce implementation

Efficiency and Scalability

29Yinghui Wu VLDB 2012

scales well with the size of fragments

Takes more time over more complex queries

Takes less time with more mappers

Partial evaluation works well in MapReduce model

Page 28: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

Conclusion

Distributed reachability querying

• Partial evaluation based distributed evaluation

• Reachability, bounded reachability and regular reachability queries

• Performance guarantees

Partial evaluation can be naturally conducted as MapReduce

Future work

• Distributed evaluation for other queries, e.g., graph pattern

matching using simulation

• Combining partial evaluation and incremental computation

30

Partial evaluation based distributed query evaluation

Yinghui Wu VLDB 2012

Page 29: Performance Guarantees for Distributed Reachability Queries Wenfei Fan 1,2 Xin Wang 1 Yinghui Wu 1,3 1 University of Edinburgh 2 Harbin Institute of Technology

29

Thank you!

Performance Guarantees for

Distributed Reachability Queries