GStream 2.0: A Fast and Scalable Graph Processing Method ... · DGIST GStream 2.0: A Fast and Scalable Graph Processing Method ... Introduction Preliminaries Streaming graph topology

Min-Soo Kim

Department of Information and Communication Engineering

DGIST

GStream 2.0:

A Fast and Scalable Graph Processing Method

based on Streaming Topology to GPUs

GTCx Korea 2016 InfoLab

InfoLab

Outline

2

Introduction

Preliminaries

Streaming graph topology

Exploiting multiple GPUs

Experimental results

Conclusions

InfoLab

Big graph data

3

Graphs are everywhere

- web, social networks, telecommunication, biology, neuroscience

Sizes of graphs are growing

Graph analysis is getting more and more important

- e.g. PageRank, connected components, counting triangles

InfoLab

Sizes of graphs

4

Real graphs

- livejournal: 5M vertices, 69M edges

- twitter: 42M vertices, 1468M edges

- yahooweb: 1414M vertices, 6637M edges

Synthetic graphs

- RMAT20: 1M vertices, 16M edges

- RMAT30: 1B vertices, 16B edges

- RMAT40: 1T vertices, 16T edges

Human connectome: 100B vertices, 100T edges

InfoLab

Vertex-centric programming model

5

Think like a “vertex”

Message passing across each vertex

Bulk synchronous parallel model (BSP)

v

Computation scope of v

for v in Vertex:

compute(v);

done

InfoLab

Example of vertex-centric model

6

InfoLab

Edge cut vs. Vertex cut

7

Communication proportional to number

of edges across different machines

Communication linear to number of

machines each vertex spans

Edge Cut: a partition of vertices

(e.g. Pregel, Giraph…) Vertex Cut: a partition of edges

(PowerGraph)

100 machine => 99% edge cut

InfoLab

PowerGraph (GraphLab 2.2)

8

Graph processing system for power-law graph in real world

- Power-law distribution (the property on natural graph)

Programming model

- GAS decomposition

- Asynchronous model (as well as BSP model)

InfoLab

GAS decomposition

9

InfoLab

BSP vs. Asynchronous

10

BSP model on GAS Asynchronous GAS

- Deterministic but inefficient - Better performance

- Potential non-determinism

(fatal to some algorithm)

InfoLab

Existing graph processing methods (1)

11

Single-machine / CPU-based methods

- limited computing power (tens of CPU cores)

- limited graph size (main memory)

- e.g., MTGL[IPDPS’07], Galois[SOSP’13], Ligra[PPoPP’13], Ligra+[DCC’15]

Single-machine / GPU-based methods

- good computing power (thousands of GPU cores)

- limited graph size (main memory / GPU memory)

- e.g., MapGraph[GRADES’14], TOTEM[PACT’12], CuSha[HPDC’14]

InfoLab


12

Interconnection Network

P P P

M M M

topology data

attribute data

Distributed methods

- scalable computing power (thousands of CPU cores)

- edge cut: communication traffic

- vertex cut: storage overhead (not scalable in graph size)

- e.g, GraphLab[ODSI’12], Giraph[Apache], GraphX[Apache], Naiad[SOSP’13]

existing distributed methods

InfoLab


13

Interconnection Network

P P P

M M M

topology data

attribute data

P

D

M

existing distributed methods

GTS method

Distributed methods

- scalable computing power (thousands of CPU cores)

- edge cut: communication traffic

- vertex cut: storage overhead (not scalable in graph size)

- e.g, GraphLab[ODSI’12], Giraph[Apache], GraphX[Apache], Naiad[SOSP’13]

InfoLab

Key ideas of GTS

14

Scale-up approach

- no communication traffic (from edge cut)

- no storage overhead (from vertex cut)

- thousands of computing cores (up to 8 GPUs)

- terabytes of storage (up to 8 PCI-E SSDs)

Storing only updatable attribute data in GPU memory

- read-only attribute (RA), writable attribute (WA)

- WA data is much smaller than topology data

Moving topology data from SSDs to GPUs

- streaming only necessary pages (page-level random access)

- strategies for mapping between SSDs and GPUs

InfoLab

Slotted page format for graph [KDD’13]

15

Storing topology data

- small page (SP): low-degree vertices

- large page (LP): high-degree vertices

VID (logical ID)

RID (physical ID) : ⟨PageID, SlotNo⟩

InfoLab

Extended slotted page format

16

Original slotted page format: 1MB / page

- 2-byte PageID: max. 64K pages

- 2-byte SlotNo: max. 64K vertices / page

- max. 4 billion vertices (practically, max. 1 billion vertices)

Extended slotted page format

- p-byte PageID

- q-byte SlotNo

- max. 281 trillion vertices (p=3, q=3)

InfoLab

Superstep: streaming graph topology once

17

Streaming topology data with read-only attribute (RA)

Two GPU kernels for a graph algorithm : KSP , KLP

WA: writable attribute

RA: read-only attribute

SP: small page

LP: large page

InfoLab

Asynchronous multiple streams

18

Async. copying {WA, SP, RA} to GPU cannot overlap

Executing GPU kernels can overlap

Performing up to 32 streams simultaneously

InfoLab

Exploiting multiple GPUs and SSDs

19

Strategy-P: the same WA in all GPUs

Strategy-S: different WAs in each GPU

InfoLab

Strategy-P vs. Strategy-S

20

InfoLab

Micro-level graph processing

21

+4 +1 +2 +2

1 2 4 5 0 0 6 4 5 0 3 5

slots

records 0 3 4 2 7 9 6 9 9 6 7 8

0 4 5 7 9 12 15 18 20 21

0 1 2 3 4 5 6 7 8 9

Each thread processes each vertex (VWC technique)

Each GPU processes (numBlocks numThreads) vertices

- process multiple Small Pages simultaneously

- process Large Pages for a high-degree vertex simultaneously

thread 0 thread 9 ……..

InfoLab

Experimental setup

22

Methods H/W setting

CPU-based

methods

MTGL [IPDPS’07]

Galois [SOSP’13]

Ligra [PPoPP’13]

Ligra+[DCC’15] 16 CPU cores

two GPUs

128GB memory

two PCI-E SSDs

GPU-based

methods

MapGraph [GRADES’14]

TOTEM [PACT’12]

CuSha [HPDC’14]

GTS [SIGMOD’16]

Distributed

methods

GraphLab [ODSI’12]

Giraph [Apache]

GraphX [Apache]

Naiad [SOSP’13]

DGIST supercomputer iREMB

(Rank#454, 30 nodes)

480 CPU cores

1,920 GB memory

Infiniband QDR (40 Gbps)

InfoLab

Data sets

23

InfoLab

Comparison with CPU-based methods

24

InfoLab

Comparison with GPU-based methods

25

InfoLab

Comparison with distributed methods

26

InfoLab

Other graph algorithms

27

InfoLab

Strategy-P

28

Faster than Strategy-S (BFS, two SSDs / main memory)

Similar with Strategy-S (PageRank, one SSD / two HDDs)

InfoLab

Conclusions

29

New scale-up approach

- storing only updatable attribute data in GPU memory

- moving topology data from SSDs to GPUs

- two strategies : Strategy-P, Strategy-S

- extended slotted page, caching, micro-level parallel processing

Results

- faster than distributed methods on supercomputer

- processing larger-scale graphs (RMAT34, two Intel 750 SSDs)

- cost-efficient

- energy efficient

Thank you!

Question?

Documents

GStream 2.0: A Fast and Scalable Graph Processing Method ... · DGIST GStream 2.0: A Fast and Scalable Graph Processing Method ... Introduction Preliminaries Streaming graph topology