Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Argo: Architecture-Aware Graph Partitioning

Angen Zheng

Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange

Department of Computer Science, University of Pittsburgh

http://db.cs.pitt.edu/group/

http://www.prognosticlab.org/

1



http://www.prognosticlab.org/

Big Graphs Are Everywhere [SIGMOD’16 Tutorial]

2

A Balanced Partitioning = Even Load Distribution Minimal Edge-Cut = Minimal Data Comm

N3 N1

N2

Assumption: Network is the bottleneck. 3

The End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15]

✓ Dual-socket Xeon E5v2 server with ○ DDR3-1600 ○ 2 FDR 4x NICs per socket

✓ Infiniband: 1.7GB/s~37.5GB/s

✓ DDR3: 6.25GB/s~16.6GB/s

4

The End of Slow Networks: Does edge-cut still matter?

5

Roadmap

Introduction

Does edge-cut still matter?

Why edge-cut still matters?

Argo

Evaluation

Conclusions

6


Graph Partitioners METIS and LDG

Graph Workloads BFS, SSSP, and PageRank

Graph Dataset Orkut (|V|=3M, |E|=234M)

Number of Partitions 16 (one partition per core)

7


m:s:c SSSP Execution Time (s)

METIS LDG

1:2:8 633 2,632

2:2:4 654 2,565

4:2:2 521 631

8:2:1 222 280

9x

m: # of machines used s: # of sockets used per machine c: # of cores used per socket

✓ Denser configurations had longer execution time. ○ Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck.

8



METIS LDG

1:2:8 633 2,632

2:2:4 654 2,565

4:2:2 521 631

8:2:1 222 280

m:s:c SSSP LLC Misses (in Millions)

METIS LDG

1:2:8 10,292 44,117

2:2:4 10,626 44,689

4:2:2 2,541 1,061

8:2:1 96 187

9x 235x


9


METIS LDG

1:2:8 10,292 44,117

2:2:4 10,626 44,689

4:2:2 2,541 1,061

8:2:1 96 187



METIS LDG

1:2:8 633 2,632

2:2:4 654 2,565

4:2:2 521 631

8:2:1 222 280

9x 235x


10


METIS LDG

1:2:8 10,292 44,117

2:2:4 10,626 44,689

4:2:2 2,541 1,061

8:2:1 96 187



METIS LDG

1:2:8 633 2,632

2:2:4 654 2,565

4:2:2 521 631

8:2:1 222 280

9x 235x


11

The distribution of edge-cut matters.



METIS LDG

1:2:8 633 2,632

2:2:4 654 2,565

4:2:2 521 631

8:2:1 222 280


METIS LDG

1:2:8 10,292 44,117

2:2:4 10,626 44,689

4:2:2 2,541 1,061

8:2:1 96 187

✓ METIS had lower execution time and LLC misses than LDG.

○ Edge-cut matters.

○ Higher edge-cut-->higher comm-->higher contention

9x 235x

12

Yes! Both edge-cut and its distribution matter!

✓ Intra-Node and Inter-Node Data Communication ○ Have different performance impact on the memory

subsystems of modern multicore machines.


13

Roadmap

Introduction



Argo

Evaluation

Conclusions

14

Send Buffer

Sending Core Receiving Core

Receive Buffer Shared Buffer

1. Load 3. Load 2b. Write

2a. Load 4a. Load

4b. Write

Extra Memory Copy

Intra-Node Data Comm: Shared Memory

15

Cached Send/Shared/Receive Buffer


LLC and Memory Bandwidth Contention

Cache Pollution

16

Cached Send/Shared Buffer Cached Receive/Shared Buffer

LLC and Memory Bandwidth Contention

Cache Pollution


17

Excess intra-node data communication may hurt performance.

18

Inter-Node Data Comm: RDMA Read/Write

Send

Buffer

Sending Core

Node#1

IB

HCA

Receive

Buffer

Sending Core

Node#2

IB

HCA

No Extra Memory Copy and Cache Pollution

19

Offloading excess intra-node data comm across nodes may achieve better performance.

20

Roadmap

Introduction



Argo

Evaluation

Conclusions

21

Argo: Graph Partitioning Model

Partitioner

...

...

Vertex Stream

Streaming Graph Partitioning Model [I. Stanton, KDD’12]

22

Argo: Architecture-Aware Vertex Placement

Place vertex, v, to a partition, Pi, that maximize:

✓ Weighted by the relative network comm cost, Argo will ○ avoid edge-cut across nodes (inter-node data comm).

Penalize the placement based on the load of Pi

Weighted Edge-cut

Great for cases where the network is the bottleneck. 23

Argo: Architecture-Aware Vertex Placement

Refined Intra-Node Network Comm Cost

Maximal Inter-Node Network Comm Cost

Degree of Contention (𝞴 ∈ [0, 1])

Original Intra-Node Network Comm Cost

✓ Weighted by the refined relative network comm cost, Argo will ○ avoid edge-cut across cores of the same node (intra-node

data comm).

Bottleneck

Network Memory

𝞴=0 𝞴=1

24

Introduction



Argo

Evaluation

Conclusions

Roadmap

25

Three Classic Graph Workloads o Breadth First Search (BFS) o Single Source Shortest Path (SSSP) o PageRank

Three Real-World Large Graphs

Evaluation: Workloads & Datasets

Dataset |V| |E|

Orkut 3M 234M

Friendster 124M 3.6B

Twitter 52M 3.9B

26

Evaluation: Platform

Cluster Configuration

# of Nodes 32

Network Topology FDR Infiniband (Single Switch)

Network Bandwidth 56Gbps

Compute Node Configuration

# of Sockets 2 Intel Haswell

(10 cores / socket)

L3 Cache 25MB

27

METIS: the most well-known multi-level partitioner.

LDG: the most well-known streaming partitioner.

ARGO-H: network is the bottleneck.

o weight edge-cut by the original network comm costs.

ARGO: memory is the bottleneck.

o weight edge-cut by the refined network comm costs.

Evaluation: Partitioners

28

Evaluation: SSSP Exec. Time on Orkut dataset

Message Grouping Size (Group multiple msgs by a single SSSP process to the same destination into one msg)

★ Orkut: |V| = 3M, |E| = 234M

★ 60 Partitions: three 20-core machines

2x

2x

3x

1x

4x

2x

5x

1x

2x

1.4x

3x

1x

✓ ARGO had the lowest SSSP execution time. 29

Message Grouping Size

Evaluation: SSSP LLC Misses on Orkut dataset

★ Orkut: |V| = 3M, |E| = 234M


50x

38x

9x

4x 3x 6x

1x 1x 1x

9x

1.2x

12x

✓ ARGO had the lowest LLC Misses.

30

Evaluation: SSSP Comm Vol. on Orkut dataset

★ Orkut: |V| = 3M, |E| = 234M


64 Intra-Socket

METIS 69%

LDG 49%

ARGO-H 70%

✓ ARGO had the lowest intra-node communication volume. ✓ Distribution of the edge-cut also matters.

31

Evaluation: SSSP Exec. Time vs Graph Size ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80 Partitions: four 20-core machines ★ Message Grouping Size: 512

✓ ARGO had the lowest SSSP execution time. ✓ Up to 6x improvement against ARGO-H. ✓ Improvement became larger as the graph size increased.

32

Evaluation: SSSP Exec. Time vs # of Partitions ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines ★ Message Grouping Size: 512

✓ ARGO always outperformed LDG and ARGO-H. ✓ Up to 11x improvement against ARGO-H.

33

Evaluation: SSSP Exec. Time vs # of Partitions

* 160 = 13h

* 180 = 6h

★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines ★ Message Grouping Size: 512

✓ Hours CPU Time Saving.

34

# of Partitions

Partitioning Time as a Percentage of the CPU Time Saved (SSSP Execution)

Evaluation: Partitioning Overhead

★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines

# of Partitions

✓ ARGO is indeed slower than LDG. ✓ The overhead was negligible in comparison to the CPU time saved. ✓ Graph analytics usually have much longer execution time.

35

Findings o Network is not always the bottleneck. o Contention on memory subsystems may impact the performance a lot

due to excess intra-node data comm. o Both edge-cut and its distribution matter.

Acknowledgments:

Peyman Givi

Patrick Pisciuneri

Funding:

NSF CBET-1609120

NSF CBET-1250171

BigData’16 Student

Travel Award

ARGO o voids contention by offloading excess

intra-node data comm across nodes.

o Achieves up to 11x improvement on real-world workloads.

o Scales well in terms of both graph size and number of partitions.

Thanks!

Conclusions

36

Documents

Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon