Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Distributed Graph Algorithms
Alessio Guerrieri
University of Trento, Italy
2016/04/26
This work is licensed under a Creative CommonsAttribution-ShareAlike 4.0 International License.
references
Contents
1 IntroductionP2PDistance computation
2 Thinking distributivelyBig DataPregel and Vertex-Centric frameworksGraph partitioning
3 Graph algorithmsTriangle countingConnectednessStrongly Connected ComponentsGraph clustering
Introduction
Who am I?
My name is Alessio Guerrieri
Postdoctoral researcher with prof. Montresor
Specialization on distributed large-scale graph processing
Graph G = (V,E)
V set of nodes
E set of edges (links, connections)between pair of nodes
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 1 / 42
Introduction
Today’s topic
Vertex-centric model for graph computation
Possible applications of vertex-centric model
Vertex-centric graph algorithms
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 2 / 42
Introduction P2P
Peer-to-Peer network
Collection of peer(nodes)
Have physicalconnections
Compute an overlaygraph
Nodes know existenceof neighbors
Can discover othernodes via peer-samplingtechniques
Network
TCP/IP
Overlay
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 3 / 42
Introduction P2P
Computing in Peer-to-Peer systems
Peers might want to run protocols on the network:
Finding centrality of peer
Test properties of network
Compute distances
...
Can we simply reuse distributed algorithms for Computer Networks?
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 4 / 42
Introduction Distance computation
Example: decentralized distance computation
Problem
For each peer x in overlay G, compute its distance from target peerTarget.
Vertex-centric model
Initially each peer only knows its direct neighbors in the overlay
Each peer can send messages to each peer of which it knows theexistence
Amount of memory used by each peer should be sub-linear
No control on rate of activity of peers and speed of messages(asynchronous execution)
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 5 / 42
Introduction Distance computation
Distributed Breadth-First-Search
D-BFS protocol executed by p:
upon initialization doif p = Target then
Dp ← 0send 〈Dp + 1〉 to neighbors
elseDp ←∞
upon receive 〈d〉 doif d < Dp then
Dp ← dsend 〈Dp + 1〉 to neighbors
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 6 / 42
Introduction Distance computation
Distributed Breadth-First-Search
D-BFS protocol executed by p:
upon initialization doif p = Target then
Dp ← 0send 〈Dp + 1〉 to neighbors
elseDp ←∞
upon receive 〈d〉 doif d < Dp then
Dp ← dsend 〈Dp + 1〉 to neighbors
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 6 / 42
Introduction Distance computation
Distributed Breadth-First-Search
D-BFS protocol executed by p:
upon initialization doif p = Target then
Dp ← 0send 〈Dp + 1〉 to neighbors
elseDp ←∞
upon receive 〈d〉 doif d < Dp then
Dp ← dsend 〈Dp + 1〉 to neighbors
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 6 / 42
Introduction Distance computation
Distributed Breadth-First-Search
D-BFS protocol executed by p:
upon initialization doif p = Target then
Dp ← 0send 〈Dp + 1〉 to neighbors
elseDp ←∞
upon receive 〈d〉 doif d < Dp then
Dp ← dsend 〈Dp + 1〉 to neighbors
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 6 / 42
Introduction Distance computation
Distributed Breadth-First-Search
D-BFS protocol executed by p:
upon initialization doif p = Target then
Dp ← 0send 〈Dp + 1〉 to neighbors
elseDp ←∞
upon receive 〈d〉 doif d < Dp then
Dp ← dsend 〈Dp + 1〉 to neighbors
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 6 / 42
Introduction Distance computation
Distributed Breadth-First-Search
D-BFS protocol executed by p:
upon initialization doif p = Target then
Dp ← 0send 〈Dp + 1〉 to neighbors
elseDp ←∞
upon receive 〈d〉 doif d < Dp then
Dp ← dsend 〈Dp + 1〉 to neighbors
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 6 / 42
Introduction Distance computation
Distributed Breadth-First-Search
D-BFS protocol executed by p:
upon initialization doif p = Target then
Dp ← 0send 〈Dp + 1〉 to neighbors
elseDp ←∞
upon receive 〈d〉 doif d < Dp then
Dp ← dsend 〈Dp + 1〉 to neighbors
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 6 / 42
Introduction Distance computation
Distributed Breadth-First-Search
D-BFS protocol executed by p:
upon initialization doif p = Target then
Dp ← 0send 〈Dp + 1〉 to neighbors
elseDp ←∞
upon receive 〈d〉 doif d < Dp then
Dp ← dsend 〈Dp + 1〉 to neighbors
repeat every ∆ time unitssend 〈Dp + 1〉 to neighbors
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 7 / 42
Introduction Distance computation
Notes
Works if:
Nodes execute in any order
Messages arrive in any order
Nodes or edges are added
Messages are lost
Fails if:
Nodes fail
Edges are removed
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 8 / 42
Thinking distributively Big Data
New Graph Applications
Many settings in which there is need to analyze graph:
Web data
Computer networks
Biological networks
Social networks
...
Size can be extremely large
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 9 / 42
Thinking distributively Big Data
Approaches
Centralized/Sequential
Collect data in one place
Load it in memory
Analyze it with traditionaltechniques
Decentralized
One machine for each node
Construct overlay network
Run vertex-centric protocol
Both unfeasible! These are better:
Centralized/Parallel
Parallel processes and threads inshared memory
Distributed system
Multiple (not N) machines, eachhosting part of the graph
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 10 / 42
Thinking distributively Big Data
Why distributed?
Disadvantages
Slower than parallelsystems
Younger field ofresearch
Needs distribution ofdata
Advantages
Fault tolerant
Cheaper
Extendible
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 11 / 42
Thinking distributively Big Data
What do we want?
Nice interface for distributed graph algorithms and fast framework torun those algorithms
Graph analyst are not distributed systems experts!
Issues with:
Data replication
Fault tolerance
Message deliverance
...
should be dealt by the system!
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 12 / 42
Thinking distributively Big Data
Vertex centric
Why not reuse vertex-centric interface?
User
Defines data associated tovertex
Defines type of messages
Defines code executed bysingle vertex when it receivesmessages
System
Divides input graph betweenthe available machines
Each machine simulates eachvertex independently
Takes care of fault tolerance
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 13 / 42
Thinking distributively Big Data
Vertex centric distributed system
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 14 / 42
Thinking distributively Pregel and Vertex-Centric frameworks
Pregel
Developed by Google
First large-scale graphprocessing system withvertex-centric interface
Created for PageRank
Code automatically deployedon thousands of machines
Programming API
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 15 / 42
Thinking distributively Pregel and Vertex-Centric frameworks
Pregel-like frameworks
Many:
Giraph: on top of Hadoop
GraphX: on top of Spark
Gelly: on top of Flink
Graphlab: standalone
All frameworks allow user to run vertex-centric programs ondistributed systems
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 16 / 42
Thinking distributively Pregel and Vertex-Centric frameworks
Example (Giraph)
public void compute ( I t e r ab l e<IntWritable> messages ) {int minDist = In t eg e r .MAXVALUE;for ( IntWritab le message : messages ) {
minDist = Math . min (minDist , message . get ( ) ) ;}i f ( minDist < getValue ( ) . get ( ) ) {
setValue (new IntWritab le ( minDist ) ) ;IntWritab le d i s t ance = new IntWritab le ( minDist + 1)for (Edge< . . .> edge : getEdges ( ) ) {
sendMessage ( edge . getTargetVertexId ( ) , d i s t anc e ) ;}
}voteToHalt ( ) ;
}
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 17 / 42
Thinking distributively Pregel and Vertex-Centric frameworks
Common characteristics of Vertex-centric models
Independence from problem size
Code stays the same regardless of
size of graph
number of machines used
Embedded fault-tolerance
Periodic benchmarking
Machines can ”steal nodes”from slower machines
Distributed File System
Usually HDFS
Allow all machines to readand write in common space
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 18 / 42
Thinking distributively Pregel and Vertex-Centric frameworks
Synchronicity of Vertex-centric models
Synchronous
Execution in rounds
Each vertex receives in roundi all messages sent to it inround i− 1
Better guarantees → easiestmodel
Asynchronous
No rounds
Messages processed withoutguarantees
More difficult, more efficient
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 19 / 42
Thinking distributively Graph partitioning
Graph partitioning
First step of analysisWhich machine hosts which subset of nodes/edges?
Balance
Size of partitions should bebalanced
Quality
Messages between nodeshosted in same machine arecheap
Messages between machinesare costly
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 20 / 42
Thinking distributively Graph partitioning
Heuristic solutions
Graph partitioning is NP-Hard
Hash-based heuristics
Node n ends in machine H(n)%K
Iterative algorithms
Start from random andincrementally improve
Heuristics can be devised for specific data-sets
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 21 / 42
Graph algorithms
Graph Algorithms
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 22 / 42
Graph algorithms
What algorithms are available
This area is not completely new: there are distributed algorithms forcomputer networksBut:
Computer networks are small
Many interesting graph problems are not interesting on computernetworks
Some ideas can be taken from research in PRAM ( Parallel RandomAccess Machine)
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 23 / 42
Graph algorithms
Problems
We’ll look at:
Triangle counting
Connected components
Strongly connected components
Clustering
We won’t look at:
Pagerank
Single-Value decomposition
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 24 / 42
Graph algorithms Triangle counting
Triangle Counting
Definition
In an undirected graph Compute, for each node n, how many pairs ofnodes (v, w) exists such that (n, v), (v, w), (w, n) are edges.
Application on social networks
Seems easier than it is...
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 25 / 42
Graph algorithms Triangle counting
Triangle counting protocol
Send neighbor-list to neighbors and see which neighbors we have incommon
Executed by node p
upon initialization doTp ← 0send neighbor-list to neighbors
upon receive 〈M〉 doCommon← neighbor-list
⋂M
Tp ← Tp + |Common|
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 26 / 42
Graph algorithms Triangle counting
Issues
Multi-counting
Each triangle is counted 6 times!
Should make sure that every triangle is counted only once!
Hubs
Hubs are nodes with disproportionally large degree
Exists in almost all real graphs
Will send large neighbor-lists to large number of neighbors
Will receive large quantity of messages
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 27 / 42
Graph algorithms Triangle counting
Triangle counting protocol (2)
Send list of neighbors with smaller id
Executed by node p
upon initialization doTp ← 0M ← {n : n ∈ neighbor-list, idn < idp}send M to neighbors
upon receive 〈M〉 doCommon← neighbor-list
⋂M
Tp ← Tp + |Common|
Cut messages by half
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 28 / 42
Graph algorithms Triangle counting
Triangle counting protocol (3)
Executed by node p
upon initialization doTp ← 0foreach n ∈ neighbor-list do
if idn < idp thenM ← {q : q ∈ neighbor-list, idq < idn}send LIST (M) to n
upon receive 〈LIST (M)〉 from n doCommon← neighbor-list
⋂M
Tp ← Tp + |Common| send TR(|Common|) to nsend TR(1) to v ∈ Common
upon receive 〈TR(t)〉 doTp ← Tp + t
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 29 / 42
Graph algorithms Connectedness
Connected components (problem)
Problem definition
Two nodes v, w of an undirected graph are in the same weaklyconnected component (WCC) if exists a path from v to w.For each node n compute value Cn such that:
∀v, w ∈ V,Cv = Cw ⇐⇒ WCCv = WCCw
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 30 / 42
Graph algorithms Connectedness
Connected components (solution)
Executed by node p:
upon initialization doCp ← p.idsend 〈Cp〉 to neighbors
upon receive 〈c〉 doif c < Cp then
Cp ← csend 〈c〉 to neighbors
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 31 / 42
Graph algorithms Connectedness
Connected components (solution)
Executed by node p:
upon initialization doCp ← p.idsend 〈Cp〉 to neighbors
upon receive 〈c〉 doif c < Cp then
Cp ← csend 〈c〉 to neighbors
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 31 / 42
Graph algorithms Connectedness
Connected components (solution)
Executed by node p:
upon initialization doCp ← p.idsend 〈Cp〉 to neighbors
upon receive 〈c〉 doif c < Cp then
Cp ← csend 〈c〉 to neighbors
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 31 / 42
Graph algorithms Strongly Connected Components
Strongly connected components
Problem definition
Two nodes v, w of an directed graph are in the same strongly connectedcomponent (SCC) if exist paths from v to w and from w to v (bothdirections).
Centralized solution
Single depth-first search visit using Tarjan’s algorithm
Two depth-first search visits using Kosaraju’s algorithm
Distributed solution much more difficult!
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 32 / 42
Graph algorithms Strongly Connected Components
Strongly Connected components (solution)
SCC: Executed by node p:
upon initialization doFp ← p.id // Lowest id of nodes that reaches pBp ← p.id // Lowest id of nodes reachable from psend 〈FOR(Fp)〉 to out-neighborssend 〈BACK(Bp)〉 to in-neighbors
upon receive 〈FOR(c)〉 doif c < Fp then
Fp ← c send 〈FOR(c)〉 to out-neighbors
upon receive 〈BACK(c)〉 doif c < Bp then
Bp ← c send 〈BACK(c)〉 to in-neighbors
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 33 / 42
Graph algorithms Strongly Connected Components
Properties
Bp = minimum id of nodesreachable from pFp = minimum id of nodes thatcan reach p
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 34 / 42
Graph algorithms Strongly Connected Components
Properties
Bp = minimum id of nodesreachable from pFp = minimum id of nodes thatcan reach p
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 34 / 42
Graph algorithms Strongly Connected Components
Properties
Lemma1
(Fp = Bp = i)→ p and i are in the same SCC
Lemma2
(Fp = Fq and BP = B1 → p and q are in the same SCC
Are the two lemmas correct?
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 35 / 42
Graph algorithms Strongly Connected Components
Properties
Bp = minimum id of nodesreachable from pFp = minimum id of nodes thatcan reach p
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 36 / 42
Graph algorithms Strongly Connected Components
Complete algorithm
Complete algorithm
While graph not empty:
Run SCC algorithm
remove each node p such that Fp = Bp
Can improve number of SCC found through heuristics
Quicker convergence for largest SCC
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 37 / 42
Graph algorithms Graph clustering
Graph clustering
Given a graph, divide its vertices inclusters such that:
Most edges connect nodes in samecluster
Few edges go across cluster
Clusters are significant
Precise definition depends fromapplication scenarioAll versions are NP-Complete
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 38 / 42
Graph algorithms Graph clustering
Label-propagation algorithm
Executed by node p:
upon initialization doCp ← p.id // starting label equal to idNp ← {} // dictionary containing labels of neighborssend Cp to neighbors
upon receive c from q doNp[q] = c // update dictionary with new label of neighbor
repeat every ∆ time unitsc′ ← most common label in neighborsif Cp 6= c′ then
send c′ to neighborsCp ← c′
Note:ties resolved randomlyAlessio Guerrieri (UniTN) DS - Graphs 2016/04/26 39 / 42
Graph algorithms Graph clustering
Sample run
Initialization:
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 40 / 42
Graph algorithms Graph clustering
Sample run
Active node chooses label 1 (randomly)
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 40 / 42
Graph algorithms Graph clustering
Sample run
Active node chooses label 3(randomly)
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 40 / 42
Graph algorithms Graph clustering
Sample run
Active node chooses label 6(randomly)
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 40 / 42
Graph algorithms Graph clustering
Sample run
Active node chooses label 6
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 40 / 42
Graph algorithms Graph clustering
Sample run
Active node chooses label 1
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 40 / 42
Graph algorithms Graph clustering
Sample run
Active node chooses label 1
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 40 / 42
Graph algorithms Graph clustering
Sample run
Active node chooses label 6
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 40 / 42
Graph algorithms Graph clustering
Sample run
Converged state: no node change its mind
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 40 / 42
Graph algorithms Graph clustering
Disadvantages
Non-deterministic
Non-null probability of collapsing to single cluster
Low quality
In practice, it’s not that good
Better options
Slow, iterative algorithms
Fast, heuristics
My algorithm!
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 41 / 42
Graph algorithms Graph clustering
Conclusions
Vertex-centric algorithms can be used in:I P2P systemsI Computer networksI Large-scale graph analysis
Field still in fluxI New frameworksI New programming modelsI New algorithms
Many applicationsI Everything is a graphI Could be interesting for a project
Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 42 / 42
Reading Material
Pregel: A System for Large-Scale Graph Processing by Malewicz et al.
Pregel Algorithms for Graph Connectivity Problems with PerformanceGuarantees by Yan et al.
Giraph http://giraph.apache.org/
Graphx http://spark.apache.org/graphx/