49
GReTA: Hardware Optimized Graph Processing for GNNs Kevin Kiningham, Phil Levis, Chris Ré Stanford University March 4th, 2020 1 ReCoML’20

ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

GReTA: Hardware Optimized Graph Processing for GNNs

Kevin Kiningham, Phil Levis, Chris RéStanford University

March 4th, 2020

1

ReCoML’20

Page 2: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

2Introduction

Deep Neural Networks

Speech Recognition Translation

Object Detection Handwriting Recognition

“Dog”ConvLayer

LinearLayer

Traditional DNN

Page 3: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

3Introduction

Deep Neural Networks + Graphs = ?

Speech Recognition Translation

Object Detection Handwriting Recognition

“Dog”ConvLayer

LinearLayer

Traditional DNN

Social Networks

Protein Interactions Road Networks

Citations ?

Page 4: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

4Introduction

Deep Neural Networks + Graphs = GNNs

Social Networks

Road Networks

Citations

Speech Recognition Translation

Object Detection Handwriting Recognition

“Dog”ConvLayer

LinearLayer

Traditional DNN

GraphConvLayer

GraphConvLayer

Graph Neural Network (GNN)

Protein Interactions

Page 5: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

GNN Computation Is Irregular

5Introduction

● Computation pattern changes depending on input graph structure

● GNN layers follow message passing architecture

1. Send messages 2. Aggregate 3. Update

GraphConvLayer

GraphConvLayer

Page 6: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Existing DNN Representations Bad for GNNs

6Introduction

● Irregular computation is difficult to represent with static tensor network

○ E.g. Tensorflow

● Hard to handle large graphs○ Must manually deal with partitioning

variables○ Hard to make efficient when graph shape

can change

X

MatMul

Add

ReLU

Static Network Graph Neural Network

x1 x2 x3W

bMP MP MP

MP MP MP

x2

x1 x3

x2

x1 x3

Page 7: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

GReTA: Graph Framework for GNNs● Simple to represent GNN layers

○ Computation defined on edges and vertices of input graph○ Maps directly to message passing

● Flexible enough to allow a wide range of GNN models○ Allows each execution phase to be customized

● Efficient execution on an accelerator○ Partitioning: Limit accelerator memory usage without modifying user code○ Tiling: Increase the reuse of layer weights

7Introduction

Page 8: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Talk Agenda● Introduction

● GReTA Overview

● Execution Model

● Partitioning

● Experimental Results

● Conclusion

8GReTA Overview

Page 9: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

GReTA Overview● GReTA represents computation using graph framework

○ Functions defined on edges and vertices

○ Can directly map message passing layer

● GNN layers implemented using four user-defined functions (UDFs)

1. Gather: compute message for each edges

2. Reduce: reduce incoming messages per-vertex

3. Transform: combine reduced value with per-vertex accumulator

4. Activate: perform non-linear function

9GReTA Overview

Page 10: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Example: Graph Convolutional Network (GCN)

10GReTA Overview

GCN layer update function

Page 11: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Example: Graph Convolutional Network (GCN)

11GReTA Overview

GCN layer update function

1. Gather messages using connected edges

Page 12: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Example: Graph Convolutional Network (GCN)

12GReTA Overview

GCN layer update function

1. Gather messages using connected edges

2. Reduce to single vector by summation

Page 13: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Example: Graph Convolutional Network (GCN)

13GReTA Overview

GCN layer update function

1. Gather messages using connected edges

2. Reduce to single vector by summation

3. Transform result using linear transformation

Page 14: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Example: Graph Convolutional Network (GCN)

14GReTA Overview

GCN layer update function

1. Gather messages using connected edges

2. Reduce to single vector by summation

3. Transform result using linear transformation

4. Activate output using element-wise ReLU

ReLU

Page 15: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

class GCNLayer(GretaInterface): def gather(h_u, h_v, h_uv): return h_u

def reduce(a_v, m_v): return a_v + m_v

def transform(z_v, a_v, W, b): return z_v + W * a_v + b

def activate(z_v): return relu(z_v)

GCN Implementation Pseudocode

15GReTA Overview

GCN layer update function

Page 16: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Talk Agenda● Introduction

● GReTA Overview

● Execution Model

● Partitioning

● Experimental Results

● Conclusion

16Execution Model

Page 17: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

GReTA Execution Model

17Execution Model

Execution conceptually split into three phases

Page 18: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

GReTA Execution Model

18Execution Model

Acc. EdgesEdges

Du Du→v Accum

Accum

Execution conceptually split into three phases

1. Accumulate Edges○ Gather/compute message for each edge○ Reduce to single value per vertex

Page 19: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

GReTA Execution Model

19Execution Model

Acc. Edges

Acc. Vertices

Edges

Vertices

Du Du→v

AccumW

Accum

Accum

Execution conceptually split into three phases

1. Accumulate Edges○ Gather/compute message for each edge○ Reduce to single value per vertex

2. Accumulate Vertices○ Combine reduced value with prior vertex

accumulator state

Page 20: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

GReTA Execution Model

20Execution Model

Execution conceptually split into three phases

1. Accumulate Edges○ Gather/compute message for each edge○ Reduce to single value per vertex

2. Accumulate Vertices○ Combine reduced value with prior vertex

accumulator state

3. Update Vertices○ Apply activate to accumulator

Acc. Edges

Acc. Vertices

Update Vertices

Edges

Vertices

Du Du→v

Accum

Accum

W

Accum

Dv

Page 21: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Talk Agenda● Introduction

● GReTA Overview

● Execution Model

● Partitioning

● Experimental Results

● Conclusion

21Partitioning

Page 22: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Optimizations for Hardware Implementation

Execution Partitioning● Problem: Large graphs do not fit into limited accelerator memory

○ E.g. social media graphs with millions of users

● Solution: Partition graph and execute GReTA on each partition separately● Results combined via vertex accumulators

Weight Tiling● Problem: Bandwidth bottlenecks when layer weights are large● Solution: Improve reuse by splitting weights into tiles● Tiles can be reused across multiple vertices

22Partitioning

Page 23: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Graph Partitioning Example

23Partitioning

A

F

B

E D

C

Page 24: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Graph Partitioning Example

24Partitioning

A

F

B

E D

C

V1

V2V3

Vertex Partition

Vertex Chunks

Page 25: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Graph Partitioning Example

25PartitioningS

ourc

e

AB

CD

EF

A B C D E F

Destination

Edge Partition

A

F

B

E D

C

V1

V2V3

Vertex Partition

Page 26: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Graph Partitioning Example

26PartitioningS

ourc

e

AB

CD

EF

A B C D E F

Destination

Edge Partition

A

F

B

E D

C

V1

V2V3

Vertex Partition

E2,1

Page 27: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Execution Partitioning Example

27Partitioning

Sou

rce

AB

CD

EF

A B C D E F

Destination

Edge Partition

AcceleratorSrc Dst Edge

Off-chip DRAM

A BC DE F

Page 28: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Execution Partitioning Example

28Partitioning

Sou

rce

AB

CD

EF

A B C D E F

Destination

Edge Partition

Execution follows

columns

AcceleratorSrc Dst Edge

Off-chip DRAM

A BC DE F

Page 29: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Execution Partitioning Example

29Partitioning

Sou

rce

AB

CD

EF

A B C D E F

Destination

Edge Partition

Execution follows

columns

AcceleratorSrc Dst Edge

Off-chip DRAM

A BC DE F

A B A B

Accumulate Edges

Page 30: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Execution Partitioning Example

30Partitioning

Sou

rce

AB

CD

EF

A B C D E F

Destination

Edge Partition

Execution follows

columns

AcceleratorSrc Dst Edge

Off-chip DRAM

A BC DE F

A BC D

Accumulate Edges

Accumulate Edges

Page 31: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Execution Partitioning Example

31Partitioning

Sou

rce

AB

CD

EF

A B C D E F

Destination

Edge Partition

Execution follows

columns

AcceleratorSrc Dst Edge

Off-chip DRAM

A BC DE F

A BE F

Accumulate Edges

Accumulate Edges

Accumulate Edges

Page 32: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Execution Partitioning Example

32Partitioning

Sou

rce

AB

CD

EF

A B C D E F

Destination

Edge Partition

Execution follows

columns

AcceleratorSrc Dst Edge

Off-chip DRAM

A BC DE F

A B

Accumulate Edges

Accumulate Edges

Accumulate Vertices

Update

Accumulate Edges

Page 33: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Talk Agenda● Introduction

● GReTA Overview

● Execution Model

● Partitioning

● Experimental Results

● Conclusion

33Results

Page 34: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Experimental Setup● Implemented range of GNN models

○ GCN (simple, classic GNN model)○ GraphSage (max-reduce instead of sum)○ GIN (MLP in transform layer)○ G-GCN (per-edge computation)

● Baseline○ CPU: 2.6 GHz Intel Xeon E5-2690v4○ GPU: Nvidia Tesla P100○ Models implemented using Tensorflow

● Compared to custom 32nm GReTA accelerator

● Key performance metric: Total inference latency for batch size of 1

34Results

Dataset Nodes Edges 2-Hop

YT YouTube 1.13M 2.98M 25

LJ LiveJournal 3.99M 34.6M 65

PO Pokec 1.63M 30.6M 167

RD Reddit 232K 47.4M 239

Evaluation Datasets

Page 35: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

9-23x Latency Reduction vs CPU

● 15x g.mean across all datasets/models

● Best results on models where message passing dominates (GCN, G-GCN)

35Results

Page 36: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

6-67x Latency Reduction vs GPU

36Results

● 21x g.mean across all datasets/models

● Best speedup on models with low overall latency (GCN, GIN)

● Small batch size means data transfer latency often dominates

Page 37: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Conclusion

Key features of GReTA:

1. Simple representation using a graph framework

2. Expressive enough to allow for a wide range of GNNs

3. Efficient execution on an accelerator

Future work: Apply GReTA beyond GNNs? Integration with existing frameworks?

37Conclusion

Page 38: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Conclusion

Key features of GReTA:

1. Simple representation using a graph framework

2. Expressive enough to allow for a wide range of GNNs

3. Efficient execution on an accelerator

Future work: Apply GReTA beyond GNNs? Integration with existing frameworks?

Q&A?38Conclusion

Page 39: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

39Conclusion

Page 40: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

GReTA Accelerator● Replace setup with unit for Gather-ing

edge/vertex values

○ Uses graph adjacency info stored in Unified Buffer

● New accumulator unit for Reduce

● Note: Existing NN ops can still run on new architecture!

○ Gather unit just performs single load

○ Reduce unit performs no-op

40Hardware Acceleration

Unified Buffer

GEMMEngine

Gather

Accumulators

Activation

Off-chip DRAM

Reduce

Page 41: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Compiling GReTA to a TPU-like Architecture

41Hardware Acceleration

Unified Buffer

GEMMEngine

Execution in four stages

1. Load: Move data from unified buffer into setup unit

Setup Unit

Accumulators

Activation

Off-chip DRAM

Page 42: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Traditional DNN Accelerator Model

42Hardware Acceleration

Unified Buffer

GEMMEngine

Execution in four stages

1. Load: Move data from unified buffer into setup unit

2. Compute: Multiply setup data by pre-loaded weight values

Setup Unit

Accumulators

Activation

Off-chip DRAM

Page 43: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Traditional DNN Accelerator Model

43Hardware Acceleration

Unified Buffer

GEMMEngine

Execution in four stages

1. Load: Move data from unified buffer into setup unit

2. Compute: Multiply setup data by pre-loaded weight values

3. Accumulate: Collect output from compute over N cycles

Setup Unit

Accumulators

Activation

Off-chip DRAM

Page 44: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Traditional DNN Accelerator Model

44Hardware Acceleration

Unified Buffer

GEMMEngine

Execution in four stages

1. Load: Move data from unified buffer into setup unit

2. Compute: Multiply setup data by pre-loaded weight values

3. Accumulate: Collect output from compute over N cycles

4. Activate: Execute required activation/normalization and store result

Setup Unit

Accumulators

Activation

Off-chip DRAM

Page 45: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Traditional DNN Accelerator Model

45Hardware Acceleration

Unified Buffer

GEMMEngine

Execution in four stages

1. Load: Move data from unified buffer into setup unit

2. Compute: Multiply setup data by pre-loaded weight values

3. Accumulate: Collect output from compute over N cycles

4. Activate: Execute required activation/normalization and store result

Setup Unit

Accumulators

Activation

Off-chip DRAM

● Key insight: Stages 2-4 can already execute GReTA’s Transform and Activate UDFs

● Only need to add hardware for Gather and Reduce

Page 46: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Graph Partitioning● Problem: Data for full graph may be too large to fit entirely on accelerator

● Solution: Partition graph and execute phases for each partition separately

46GReTA Overview

Vertex Partition

0 1

0 0

0 1

0 0

0 0

0 0

1 1

0 1

Sou

rce

Destination

Edge Partition

Load Acc. E Acc. V Act.

Page 47: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Interleaving Execution● Multiple GReTA programs in a layer may reuse data

○ Read identical edge/vertex data○ Reuse accumulator values

● Interleaving execution improves data locality

47Optimizations

accum_edges 1

accum_verts 1

accum_edges 2

accum_verts 2

Execution

Identical vertex data read twice Vertex accumulator

must be unloaded and reloaded

Page 48: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Interleaving Execution● Multiple GReTA programs in a layer may reuse data

○ Read identical edge/vertex data○ Reuse accumulator values

● Interleaving execution improves data locality

48Optimizations

accum_edges 1

accum_verts 1

accum_edges 2

accum_verts 2

Vertex data read once, reused

Vertex accumulator stays loaded

Execution

Page 49: ReCoML’20 GReTA: Hardware Optimized Graph Processing for ... · PO Pokec 1.63M 30.6M 167 RD Reddit 232K 47.4M 239 Evaluation Datasets. 9-23x Latency Reduction vs CPU 15x g.mean

Weight Tiling● Problem: Layer weights can be too large to fully load into GEMM unit

● Existing solution: Slice weights into tiles and reloading for each new vertex

○ Unfortunately, gives worst case reuse of each tile

○ Accelerator often bottlenecked on loading/reload weight tiles

49Optimizations

Unified Buffer Reduction GEMM Accumulate

X +=