36
Argo: Architecture-Aware Graph Partitioning Angen Zheng Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange Department of Computer Science, University of Pittsburgh http://db.cs.pitt.edu/group/ http://www.prognosticlab.org/ 1

Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Argo: Architecture-Aware Graph Partitioning

Angen Zheng

Alexandros Labrinidis, Panos K. Chrysanthis, and Jack Lange

Department of Computer Science, University of Pittsburgh

http://db.cs.pitt.edu/group/

http://www.prognosticlab.org/

1

Page 2: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Big Graphs Are Everywhere [SIGMOD’16 Tutorial]

2

Page 3: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

A Balanced Partitioning = Even Load Distribution Minimal Edge-Cut = Minimal Data Comm

N3 N1

N2

Assumption: Network is the bottleneck. 3

Page 4: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

The End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15]

✓ Dual-socket Xeon E5v2 server with ○ DDR3-1600 ○ 2 FDR 4x NICs per socket

✓ Infiniband: 1.7GB/s~37.5GB/s

✓ DDR3: 6.25GB/s~16.6GB/s

4

Page 5: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

The End of Slow Networks: Does edge-cut still matter?

5

Page 6: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Roadmap

Introduction

Does edge-cut still matter?

Why edge-cut still matters?

Argo

Evaluation

Conclusions

6

Page 7: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

The End of Slow Networks: Does edge-cut still matter?

Graph Partitioners METIS and LDG

Graph Workloads BFS, SSSP, and PageRank

Graph Dataset Orkut (|V|=3M, |E|=234M)

Number of Partitions 16 (one partition per core)

7

Page 8: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

The End of Slow Networks: Does edge-cut still matter?

m:s:c SSSP Execution Time (s)

METIS LDG

1:2:8 633 2,632

2:2:4 654 2,565

4:2:2 521 631

8:2:1 222 280

9x

m: # of machines used s: # of sockets used per machine c: # of cores used per socket

✓ Denser configurations had longer execution time. ○ Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck.

8

Page 9: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

The End of Slow Networks: Does edge-cut still matter?

m:s:c SSSP Execution Time (s)

METIS LDG

1:2:8 633 2,632

2:2:4 654 2,565

4:2:2 521 631

8:2:1 222 280

m:s:c SSSP LLC Misses (in Millions)

METIS LDG

1:2:8 10,292 44,117

2:2:4 10,626 44,689

4:2:2 2,541 1,061

8:2:1 96 187

9x 235x

✓ Denser configurations had longer execution time. ○ Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck.

9

Page 10: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

m:s:c SSSP LLC Misses (in Millions)

METIS LDG

1:2:8 10,292 44,117

2:2:4 10,626 44,689

4:2:2 2,541 1,061

8:2:1 96 187

The End of Slow Networks: Does edge-cut still matter?

m:s:c SSSP Execution Time (s)

METIS LDG

1:2:8 633 2,632

2:2:4 654 2,565

4:2:2 521 631

8:2:1 222 280

9x 235x

✓ Denser configurations had longer execution time. ○ Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck.

10

Page 11: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

m:s:c SSSP LLC Misses (in Millions)

METIS LDG

1:2:8 10,292 44,117

2:2:4 10,626 44,689

4:2:2 2,541 1,061

8:2:1 96 187

The End of Slow Networks: Does edge-cut still matter?

m:s:c SSSP Execution Time (s)

METIS LDG

1:2:8 633 2,632

2:2:4 654 2,565

4:2:2 521 631

8:2:1 222 280

9x 235x

✓ Denser configurations had longer execution time. ○ Contention on the memory subsystems impacted performance. ○ Network may not always be the bottleneck.

11

The distribution of edge-cut matters.

Page 12: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

The End of Slow Networks: Does edge-cut still matter?

m:s:c SSSP Execution Time (s)

METIS LDG

1:2:8 633 2,632

2:2:4 654 2,565

4:2:2 521 631

8:2:1 222 280

m:s:c SSSP LLC Misses (in Millions)

METIS LDG

1:2:8 10,292 44,117

2:2:4 10,626 44,689

4:2:2 2,541 1,061

8:2:1 96 187

✓ METIS had lower execution time and LLC misses than LDG.

○ Edge-cut matters.

○ Higher edge-cut-->higher comm-->higher contention

9x 235x

12

Page 13: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Yes! Both edge-cut and its distribution matter!

✓ Intra-Node and Inter-Node Data Communication ○ Have different performance impact on the memory

subsystems of modern multicore machines.

The End of Slow Networks: Does edge-cut still matter?

13

Page 14: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Roadmap

Introduction

Does edge-cut still matter?

Why edge-cut still matters?

Argo

Evaluation

Conclusions

14

Page 15: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Send Buffer

Sending Core Receiving Core

Receive Buffer Shared Buffer

1. Load 3. Load 2b. Write

2a. Load 4a. Load

4b. Write

Extra Memory Copy

Intra-Node Data Comm: Shared Memory

15

Page 16: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Cached Send/Shared/Receive Buffer

Intra-Node Data Comm: Shared Memory

LLC and Memory Bandwidth Contention

Cache Pollution

16

Page 17: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Cached Send/Shared Buffer Cached Receive/Shared Buffer

LLC and Memory Bandwidth Contention

Cache Pollution

Intra-Node Data Comm: Shared Memory

17

Page 18: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Excess intra-node data communication may hurt performance.

18

Page 19: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Inter-Node Data Comm: RDMA Read/Write

Send

Buffer

Sending Core

Node#1

IB

HCA

Receive

Buffer

Sending Core

Node#2

IB

HCA

No Extra Memory Copy and Cache Pollution

19

Page 20: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Offloading excess intra-node data comm across nodes may achieve better performance.

20

Page 21: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Roadmap

Introduction

Does edge-cut still matter?

Why edge-cut still matters?

Argo

Evaluation

Conclusions

21

Page 22: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Argo: Graph Partitioning Model

Partitioner

...

...

Vertex Stream

Streaming Graph Partitioning Model [I. Stanton, KDD’12]

22

Page 23: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Argo: Architecture-Aware Vertex Placement

Place vertex, v, to a partition, Pi, that maximize:

✓ Weighted by the relative network comm cost, Argo will ○ avoid edge-cut across nodes (inter-node data comm).

Penalize the placement based on the load of Pi

Weighted Edge-cut

Great for cases where the network is the bottleneck. 23

Page 24: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Argo: Architecture-Aware Vertex Placement

Refined Intra-Node Network Comm Cost

Maximal Inter-Node Network Comm Cost

Degree of Contention (𝞴 ∈ [0, 1])

Original Intra-Node Network Comm Cost

✓ Weighted by the refined relative network comm cost, Argo will ○ avoid edge-cut across cores of the same node (intra-node

data comm).

Bottleneck

Network Memory

𝞴=0 𝞴=1

24

Page 25: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Introduction

Does edge-cut still matter?

Why edge-cut still matters?

Argo

Evaluation

Conclusions

Roadmap

25

Page 26: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Three Classic Graph Workloads o Breadth First Search (BFS) o Single Source Shortest Path (SSSP) o PageRank

Three Real-World Large Graphs

Evaluation: Workloads & Datasets

Dataset |V| |E|

Orkut 3M 234M

Friendster 124M 3.6B

Twitter 52M 3.9B

26

Page 27: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Evaluation: Platform

Cluster Configuration

# of Nodes 32

Network Topology FDR Infiniband (Single Switch)

Network Bandwidth 56Gbps

Compute Node Configuration

# of Sockets 2 Intel Haswell

(10 cores / socket)

L3 Cache 25MB

27

Page 28: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

METIS: the most well-known multi-level partitioner.

LDG: the most well-known streaming partitioner.

ARGO-H: network is the bottleneck.

o weight edge-cut by the original network comm costs.

ARGO: memory is the bottleneck.

o weight edge-cut by the refined network comm costs.

Evaluation: Partitioners

28

Page 29: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Evaluation: SSSP Exec. Time on Orkut dataset

Message Grouping Size (Group multiple msgs by a single SSSP process to the same destination into one msg)

★ Orkut: |V| = 3M, |E| = 234M

★ 60 Partitions: three 20-core machines

2x

2x

3x

1x

4x

2x

5x

1x

2x

1.4x

3x

1x

✓ ARGO had the lowest SSSP execution time. 29

Page 30: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Message Grouping Size

Evaluation: SSSP LLC Misses on Orkut dataset

★ Orkut: |V| = 3M, |E| = 234M

★ 60 Partitions: three 20-core machines

50x

38x

9x

4x 3x 6x

1x 1x 1x

9x

1.2x

12x

✓ ARGO had the lowest LLC Misses.

30

Page 31: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Evaluation: SSSP Comm Vol. on Orkut dataset

★ Orkut: |V| = 3M, |E| = 234M

★ 60 Partitions: three 20-core machines

64 Intra-Socket

METIS 69%

LDG 49%

ARGO-H 70%

✓ ARGO had the lowest intra-node communication volume. ✓ Distribution of the edge-cut also matters.

31

Page 32: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Evaluation: SSSP Exec. Time vs Graph Size ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80 Partitions: four 20-core machines ★ Message Grouping Size: 512

✓ ARGO had the lowest SSSP execution time. ✓ Up to 6x improvement against ARGO-H. ✓ Improvement became larger as the graph size increased.

32

Page 33: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Evaluation: SSSP Exec. Time vs # of Partitions ★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines ★ Message Grouping Size: 512

✓ ARGO always outperformed LDG and ARGO-H. ✓ Up to 11x improvement against ARGO-H.

33

Page 34: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Evaluation: SSSP Exec. Time vs # of Partitions

* 160 = 13h

* 180 = 6h

★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines ★ Message Grouping Size: 512

✓ Hours CPU Time Saving.

34

Page 35: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

# of Partitions

Partitioning Time as a Percentage of the CPU Time Saved (SSSP Execution)

Evaluation: Partitioning Overhead

★ Twitter: |V| = 52M, |E| = 3.9B ★ 80~200 Partitions: four up to ten 20-core machines

# of Partitions

✓ ARGO is indeed slower than LDG. ✓ The overhead was negligible in comparison to the CPU time saved. ✓ Graph analytics usually have much longer execution time.

35

Page 36: Argo: Architecture-Aware Graph Partitioningjarusl/presentations/argo.bigdata16.slides.pdfThe End of Slow Networks: Network is now as fast as DRAM [C. Bing, VLDB’15] Dual-socket Xeon

Findings o Network is not always the bottleneck. o Contention on memory subsystems may impact the performance a lot

due to excess intra-node data comm. o Both edge-cut and its distribution matter.

Acknowledgments:

Peyman Givi

Patrick Pisciuneri

Funding:

NSF CBET-1609120

NSF CBET-1250171

BigData’16 Student

Travel Award

ARGO o voids contention by offloading excess

intra-node data comm across nodes.

o Achieves up to 11x improvement on real-world workloads.

o Scales well in terms of both graph size and number of partitions.

Thanks!

Conclusions

36