Pregel: A System for Large Scale Graph
Processing
Presenter: Xing Feng
University of New South Wales, Australia
1
Outline • What is a graph
• Challenge of big graph
• What is Pregel
• Classic graph problems in Pregel
• Improved version of connected
component algorithm in Pregel
2
Graphs • Definition:
• A Graph is a collection of vertices joined by edges.
3
G=(V,E)
• V is a set of vertices, u, v in V
are vertices
• E is a set of edges, (u,v) in E is
an edge.
Graphs • Undirected Graph:
o Graph where edges have NO orientation.
• Example: Social Network o People are modeled as vertices
o Friendship between people are modeled as undirected edges
4
Graphs • Directed Graph:
o Graph where edges have orientations.
• Example: Road Network o Cities are modeled as vertices
o Highways between cities are modeled as directed edges
5
Outline • What is a graph
• Challenge of big graph
• What is Pregel
• Classic graph problems in Pregel
• Improved version of connected
component algorithm in Pregel
6
Challenges of Real-world Graphs
Disk size of WebGraph 2012: 435GB Impossible to load into memory of a moderate machine
• Facebook: 1.4+ billion users, 0.4 trillion relationships (in 2014)
• WebGraph 2012: 0.98 billion pages, 42.6 billion hyperlinks (in 2012).
7
Outline • What is a graph
• Challenge of big graph
• What is Pregel
• Classic graph problems in Pregel
• Improved version of connected
component algorithm in Pregel
8
Bulk Synchronous Parallel model (BSP)
Input
Output
Supersteps(iterations)
1. computation
2. communication
3. synchronization
9
10
• Superstep: the vertices compute in parallel
o Each vertex
• Receives messages sent in the previous superstep
• Executes the same user-defined function
• Modifies its value or values of its outgoing edges
• Sends messages to other vertices (to be received in the next
superstep)
• Votes to halt if it has no further work to do
Pregel System
11
• Superstep: the vertices compute in parallel
o Each vertex
• Receives messages sent in the previous superstep
• Executes the same user-defined function
• Modifies its value or values of its outgoing edges
• Sends messages to other vertices (to be received in the next
superstep)
• Votes to halt if it has no further work to do
Pregel System
Vertex State Machine
12
• Superstep: the vertices compute in parallel
o Each vertex
• Receives messages sent in the previous superstep
• Executes the same user-defined function
• Modifies its value or values of its outgoing edges
• Sends messages to other vertices (to be received in the next
superstep)
• Votes to halt if it has no further work to do
o Termination condition
• All vertices are inactive
• There are no messages in transit
Pregel System
Vertex State Machine
Pregel System
• Bulk Synchronous Parallel model (BSP)
• Performed in a serial of iterations (supersteps)
• In a superstep, vertices receives messages,
execute user defined functions (UDF),…
• System terminates when there is no message
and no computation task to do
Initialization
Vertex 0 UDF
...
Vertex 1 UDF
Message delivery
superstep
13
Outline • What is a graph
• Challenge of big graph
• What is Pregel
• Classic graph problems in Pregel
• Improved version of connected
component algorithm in Pregel
14
15
Connected Component (CC)
• Input: undirected graph G=(V,E)
• Output: subgraphs where any two vertices are connected to each other by paths.
• Solution
o Single processor machine: BFS or DFS
Example: CC -BFS
16
Example: CC -BFS
17
Example: CC -BFS
18
Example: CC -BFS
19
Example: CC -BFS
20
Example: CC -BFS
21
Example: CC -BFS
22
Example: CC -BFS
23
Example: CC in Pregel • Solution on Pregel, Hash-min:
o Initially, each vertex sets its value as its id and send id to neighbors
o In each iteration, each vertex
Receives messages from last iteration
Updates its value if it received a smaller id
If the value has been updated, sends its value to neighbors
Votes to halt
o Terminates when there is no update in an iteration
24
Example: CC in Pregel
25
Example: CC in Pregel
26
Example: CC in Pregel
27
Example: CC in Pregel
28
Example: CC in Pregel
29
30
Single Source Shortest Path (SSSP)
• Input: Directed graph G=(V,E) and source vertex s∈V, such that all edge weights are nonnegative
• Output: Lengths of shortest paths from given source vertex s to all other vertices
• Solution
o Single processor machine: Dijkstra’s algorithm
31
Dijkstra’s algorithm
• Initially, source vertex sets its estimation as 0 and
add itself to priority queue Q.
• In each iteration, o pop the vertex v with least estimation from Q and add v to C
o add or update v’s neighbors and their estimations into Q
• C is the result set with vertices and lengths of their
shortest paths
Example: SSSP – Dijkstra’s Algorithm
32
0
10
5
2 3
2
1
9
7
4 6 v1
v2
v3
v4
v5
C: Q: (V1,0)
(source)
Example: SSSP – Dijkstra’s Algorithm
33
0
10
5
10
5
2 3
2
1
9
7
4 6 v1
v2
v3
v4
v5
C: (V1,0) Q: (V2, 10) (V3,5)
(source)
Example: SSSP – Dijkstra’s Algorithm
34
0
8
5
14
7
10
5
2 3
2
1
9
7
4 6 v1
v2
v3
v4
v5
C: (V1,0) (V3,5) Q: (V2, 8) (V4,14) (V5,7)
(source)
Example: SSSP – Dijkstra’s Algorithm
35
0
8
5
13
7
10
5
2 3
2
1
9
7
4 6 v1
v2
v3
v4
v5
C: (V1,0) (V3,5) (V5,7) Q: (V2, 8) (V4,13)
(source)
Example: SSSP – Dijkstra’s Algorithm
36
0
8
5
9
7
10
5
2 3
2
1
9
7
4 6 v1
v2
v3
v4
v5
C: (V1,0) (V3,5) (V5,7) (V2, 8) Q: (V4,9)
(source)
Example: SSSP – Dijkstra’s Algorithm
37
0
8
5
9
7
10
5
2 3
2
1
9
7
4 6 v1
v2
v3
v4
v5
C: (V1,0) (V3,5) (V5,7) (V2, 8) (V4,9) Q:
(source)
Example: SSSP in Pregel • Solution on Pregel:
o Initially, source vertex set its value as its 0 and send distance
estimations to out-going neighbors; other vertices set values as
MAX ()
o In each iteration, each vertex
Receives messages from last iteration
Updates its value if it received a smaller estimation
If the value has been updated, sends distance estimations
to out-going neighbors
Votes to halt
o Terminates when there is no update in an iteration
38
Example: SSSP in Pregel
39
0
10
5
2 3
2
1
9
7
4 6 v1
v2
v3
v4
v5
(source)
Example: SSSP in Pregel
40
0
10
5
2 3
2
1
9
7
4 6
10
5
v1
v2
v3
v4
v5
(source)
Example: SSSP in Pregel
41
0
10
5
10
5
2 3
2
1
9
7
4 6 v1
v2
v3
v4
v5
(source)
Example: SSSP in Pregel
42
0
10
5
10
5
2 3
2
1
9
7
4 6
11
7
12
8 14
v1
v2
v3
v4
v5
(source)
Example: SSSP in Pregel
43
0
8
5
11
7
10
5
2 3
2
1
9
7
4 6 v1
v2
v3
v4
v5
(source)
Example: SSSP in Pregel
44
0
8
5
11
7
10
5
2 3
2
1
9
7
4 6
9
14
13
15
v1
v2
v3
v4
v5
(source) 10
Example: SSSP in Pregel
45
0
8
5
9
7
10
5
2 3
2
1
9
7
4 6 v1
v2
v3
v4
v5
(source)
Example: SSSP in Pregel
46
0
8
5
9
7
10
5
2 3
2
1
9
7
4 6
13
v1
v2
v3
v4
v5
(source)
Example: SSSP in Pregel
47
0
8
5
9
7
10
5
2 3
2
1
9
7
4 6 v1
v2
v3
v4
v5
(source)
Outline • What is a graph
• Challenge of big graph
• What is Pregel
• Classic graph problems in Pregel
• Improved version of connected
component algorithm in Pregel
48
49
Connected Component (CC)
• Input: undirected graph G=(V,E)
• Output: subgraphs where any two vertices are connected to each other by paths.
• Solution
o Single processor machine: BFS or DFS
Example: CC in Pregel • Solution on Pregel, Hash-min:
o Initially, each vertex set its value as its id and send id to neighbors
o In each iteration, each vertex
Receives messages from its neighbors
Updates its value if it received a smaller id
Sends its value to neighbors if the value has been updated
Votes to halt
o Terminates when there is no update in an iteration
50
Drawbacks: hash-min has communication cost O(m×#superstep) Improved CC: computing CCs with linear communication cost while retaining the other costs
Framework • Phase 1: decompose graph G into
connected subgraphs g1, g2,...,gi
• Phase 2: merge two subgraphs if they share
a common vertex
51
Phase 1: Graph Decomposition • Simultaneously conduct BFS from seed
vertices
• Challenge: how to select seeds?
1. Randomly select some vertices
Some CCs may never be detected
52
Phase 1: Graph Decomposition • Simultaneously conduct BFS from seed
vertices
• Challenge: how to select seeds?
1. Randomly select some vertices
2. Select all vertices
Too much work for subgraph merging
53
Phase 1: Graph Decomposition • Simultaneously conduct BFS from seed
vertices
• Seed selection: randomly sample 𝛽𝑖 − 𝛽𝑖−1
vertices at superstep 𝑖
54
Phase 1: Graph Decomposition • Simultaneously conduct BFS from seed
vertices
• Seed selection: randomly sample 𝛽𝑖 − 𝛽𝑖−1
vertices at superstep 𝑖
55
Phase 1: Graph Decomposition • Simultaneously conduct BFS from seed
vertices
• Seed selection: randomly sample 𝛽𝑖 − 𝛽𝑖−1
vertices at superstep 𝑖
56
Phase 1: Graph Decomposition • Simultaneously conduct BFS from seed
vertices
• Seed selection: randomly sample 𝛽𝑖 − 𝛽𝑖−1
vertices at superstep 𝑖
57
`
`
Phase 1: Graph Decomposition • Simultaneously conduct BFS from seed
vertices
• Seed selection: randomly sample 𝛽𝑖 − 𝛽𝑖−1
vertices at superstep 𝑖
58
Phase 2: Subgraph Merging • Merge all colors received by a vertex.
59
Theoretical results • Number of supersteps: 𝑂 log
• Total communication cost: 𝑂
• Total computation cost: 𝑂
60
Experimental Setting • 25 Amazon EC2 r3.2xlarge machines with
enhanced networking.
• Each machine has 4 cores and 60GB RAM
61
Approaches Evaluated • Regarding computing CCs, we evaluate
1. S-V
2. hash-min
3. single-pivot
4. GD-CC(our approach)
Experimental Result
Evaluating CC computation algorithms
63
References • G. Malewicz, M. H. Austern, A. J. C. Bik, J. C.
Dehnert, I. Horn, N. Leiser, and G.
Czajkowski. Pregel: a system for large-scale
graph processing. In Proc. of SIGMOD’10, 2010.
• X. Feng, L. Chang, X. Lin, L. Qin, and W.
Zhang. Computing Connected
Components with Linear Communication
Cost in Pregel-like Systems. In Proc. Of
ICDE’16, 2016.
64
Thank You! Any questions or comments
65
Write Pregel Applications • Writing a Pregel program
o Subclassing the predefined Vertex class
Override this!
in msgs
out msg
66
Example: Vertex Class for SSSP
67