Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Argo: Architecture-Aware Graph Partitioning
Angen Zheng
Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange
Department of Computer Science, University of Pittsburgh
http://db.cs.pitt.edu/group/
http://www.prognosticlab.org/
1
Big Graphs Are Everywhere [SIGMOD’16 Tutorial]
2
A Balanced Partitioning = Even Load Distribution Minimal Edge-Cut = Minimal Data Comm
N3 N1
N2
Assumption: Network is the bottleneck. 3
The End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15]
✓ Dual-socket Xeon E5v2 server with ○ DDR3-1600 ○ 2 FDR 4x NICs per socket
✓ Infiniband: 1.7GB/s~37.5GB/s
✓ DDR3: 6.25GB/s~16.6GB/s
4
The End of Slow Networks: Does edge-cut still matter?
5
Roadmap
Introduction
Does edge-cut still matter?
Why edge-cut still matters?
Argo
Evaluation
Conclusions
6
The End of Slow Networks: Does edge-cut still matter?
Graph Partitioners METIS and LDG
Graph Workloads BFS, SSSP, and PageRank
Graph Dataset Orkut (|V|=3M, |E|=234M)
Number of Partitions 16 (one partition per core)
7
The End of Slow Networks: Does edge-cut still matter?
m:s:c SSSP Execution Time (s)
METIS LDG
1:2:8 633 2,632
2:2:4 654 2,565
4:2:2 521 631
8:2:1 222 280
9x
m: # of machines used s: # of sockets used per machine c: # of cores used per socket
✓ Denser configurations had longer execution time. ○ Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck.
8
The End of Slow Networks: Does edge-cut still matter?
m:s:c SSSP Execution Time (s)
METIS LDG
1:2:8 633 2,632
2:2:4 654 2,565
4:2:2 521 631
8:2:1 222 280
m:s:c SSSP LLC Misses (in Millions)
METIS LDG
1:2:8 10,292 44,117
2:2:4 10,626 44,689
4:2:2 2,541 1,061
8:2:1 96 187
9x 235x
✓ Denser configurations had longer execution time. ○ Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck.
9
m:s:c SSSP LLC Misses (in Millions)
METIS LDG
1:2:8 10,292 44,117
2:2:4 10,626 44,689
4:2:2 2,541 1,061
8:2:1 96 187
The End of Slow Networks: Does edge-cut still matter?
m:s:c SSSP Execution Time (s)
METIS LDG
1:2:8 633 2,632
2:2:4 654 2,565
4:2:2 521 631
8:2:1 222 280
9x 235x
✓ Denser configurations had longer execution time. ○ Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck.
10
m:s:c SSSP LLC Misses (in Millions)
METIS LDG
1:2:8 10,292 44,117
2:2:4 10,626 44,689
4:2:2 2,541 1,061
8:2:1 96 187
The End of Slow Networks: Does edge-cut still matter?
m:s:c SSSP Execution Time (s)
METIS LDG
1:2:8 633 2,632
2:2:4 654 2,565
4:2:2 521 631
8:2:1 222 280
9x 235x
✓ Denser configurations had longer execution time. ○ Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck.
11
The distribution of edge-cut matters.
The End of Slow Networks: Does edge-cut still matter?
m:s:c SSSP Execution Time (s)
METIS LDG
1:2:8 633 2,632
2:2:4 654 2,565
4:2:2 521 631
8:2:1 222 280
m:s:c SSSP LLC Misses (in Millions)
METIS LDG
1:2:8 10,292 44,117
2:2:4 10,626 44,689
4:2:2 2,541 1,061
8:2:1 96 187
✓ METIS had lower execution time and LLC misses than LDG.
○ Edge-cut matters.
○ Higher edge-cut-->higher comm-->higher contention
9x 235x
12
Yes! Both edge-cut and its distribution matter!
✓ Intra-Node and Inter-Node Data Communication ○ Have different performance impact on the memory
subsystems of modern multicore machines.
The End of Slow Networks: Does edge-cut still matter?
13
Roadmap
Introduction
Does edge-cut still matter?
Why edge-cut still matters?
Argo
Evaluation
Conclusions
14
Send Buffer
Sending Core Receiving Core
Receive Buffer Shared Buffer
1. Load 3. Load 2b. Write
2a. Load 4a. Load
4b. Write
Extra Memory Copy
Intra-Node Data Comm: Shared Memory
15
Cached Send/Shared/Receive Buffer
Intra-Node Data Comm: Shared Memory
LLC and Memory Bandwidth Contention
Cache Pollution
16
Cached Send/Shared Buffer Cached Receive/Shared Buffer
LLC and Memory Bandwidth Contention
Cache Pollution
Intra-Node Data Comm: Shared Memory
17
Excess intra-node data communication may hurt performance.
18
Inter-Node Data Comm: RDMA Read/Write
Send
Buffer
Sending Core
Node#1
IB
HCA
Receive
Buffer
Sending Core
Node#2
IB
HCA
No Extra Memory Copy and Cache Pollution
19
Offloading excess intra-node data comm across nodes may achieve better performance.
20
Roadmap
Introduction
Does edge-cut still matter?
Why edge-cut still matters?
Argo
Evaluation
Conclusions
21
Argo: Graph Partitioning Model
Partitioner
...
...
Vertex Stream
Streaming Graph Partitioning Model [I. Stanton, KDD’12]
22
Argo: Architecture-Aware Vertex Placement
Place vertex, v, to a partition, Pi, that maximize:
✓ Weighted by the relative network comm cost, Argo will ○ avoid edge-cut across nodes (inter-node data comm).
Penalize the placement based on the load of Pi
Weighted Edge-cut
Great for cases where the network is the bottleneck. 23
Argo: Architecture-Aware Vertex Placement
Refined Intra-Node Network Comm Cost
Maximal Inter-Node Network Comm Cost
Degree of Contention (𝞴 ∈ [0, 1])
Original Intra-Node Network Comm Cost
✓ Weighted by the refined relative network comm cost, Argo will ○ avoid edge-cut across cores of the same node (intra-node
data comm).
Bottleneck
Network Memory
𝞴=0 𝞴=1
24
Introduction
Does edge-cut still matter?
Why edge-cut still matters?
Argo
Evaluation
Conclusions
Roadmap
25
Three Classic Graph Workloads o Breadth First Search (BFS) o Single Source Shortest Path (SSSP) o PageRank
Three Real-World Large Graphs
Evaluation: Workloads & Datasets
Dataset |V| |E|
Orkut 3M 234M
Friendster 124M 3.6B
Twitter 52M 3.9B
26
Evaluation: Platform
Cluster Configuration
# of Nodes 32
Network Topology FDR Infiniband (Single Switch)
Network Bandwidth 56Gbps
Compute Node Configuration
# of Sockets 2 Intel Haswell
(10 cores / socket)
L3 Cache 25MB
27
METIS: the most well-known multi-level partitioner.
LDG: the most well-known streaming partitioner.
ARGO-H: network is the bottleneck.
o weight edge-cut by the original network comm costs.
ARGO: memory is the bottleneck.
o weight edge-cut by the refined network comm costs.
Evaluation: Partitioners
28
Evaluation: SSSP Exec. Time on Orkut dataset
Message Grouping Size (Group multiple msgs by a single SSSP process to the same destination into one msg)
★ Orkut: |V| = 3M, |E| = 234M
★ 60 Partitions: three 20-core machines
2x
2x
3x
1x
4x
2x
5x
1x
2x
1.4x
3x
1x
✓ ARGO had the lowest SSSP execution time. 29
Message Grouping Size
Evaluation: SSSP LLC Misses on Orkut dataset
★ Orkut: |V| = 3M, |E| = 234M
★ 60 Partitions: three 20-core machines
50x
38x
9x
4x 3x 6x
1x 1x 1x
9x
1.2x
12x
✓ ARGO had the lowest LLC Misses.
30
Evaluation: SSSP Comm Vol. on Orkut dataset
★ Orkut: |V| = 3M, |E| = 234M
★ 60 Partitions: three 20-core machines
64 Intra-Socket
METIS 69%
LDG 49%
ARGO-H 70%
✓ ARGO had the lowest intra-node communication volume. ✓ Distribution of the edge-cut also matters.
31
Evaluation: SSSP Exec. Time vs Graph Size ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80 Partitions: four 20-core machines ★ Message Grouping Size: 512
✓ ARGO had the lowest SSSP execution time. ✓ Up to 6x improvement against ARGO-H. ✓ Improvement became larger as the graph size increased.
32
Evaluation: SSSP Exec. Time vs # of Partitions ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines ★ Message Grouping Size: 512
✓ ARGO always outperformed LDG and ARGO-H. ✓ Up to 11x improvement against ARGO-H.
33
Evaluation: SSSP Exec. Time vs # of Partitions
* 160 = 13h
* 180 = 6h
★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines ★ Message Grouping Size: 512
✓ Hours CPU Time Saving.
34
# of Partitions
Partitioning Time as a Percentage of the CPU Time Saved (SSSP Execution)
Evaluation: Partitioning Overhead
★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines
# of Partitions
✓ ARGO is indeed slower than LDG. ✓ The overhead was negligible in comparison to the CPU time saved. ✓ Graph analytics usually have much longer execution time.
35
Findings o Network is not always the bottleneck. o Contention on memory subsystems may impact the performance a lot
due to excess intra-node data comm. o Both edge-cut and its distribution matter.
Acknowledgments:
Peyman Givi
Patrick Pisciuneri
Funding:
NSF CBET-1609120
NSF CBET-1250171
BigData’16 Student
Travel Award
ARGO o voids contention by offloading excess
intra-node data comm across nodes.
o Achieves up to 11x improvement on real-world workloads.
o Scales well in terms of both graph size and number of partitions.
Thanks!
Conclusions
36