50
Active-Routing: Compute on the Way for Near-Data Processing Jiayi Huang, Ramprakash Reddy Puli, Pritam Majumder Sungkeun Kim, Rahul Boyapati, Ki Hwan Yum and EJ Kim

Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Active-Routing: Compute on the Way for Near-Data Processing

Jiayi Huang, Ramprakash Reddy Puli, Pritam Majumder Sungkeun Kim, Rahul Boyapati, Ki Hwan Yum and EJ Kim

Page 2: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Outline

Motivation

Active-Routing Architecture

Implementation

Enhancements in Active-Routing

Evaluation

Conclusion

2

Page 3: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Motivation

Page 4: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Graph processing (social networks)

Deep learning (NLP) [Hestess et al. 2019]

Data Is Exploding

4

………

Ref: https://griffsgraphs.files.wordpress.com/2012/07/facebook-network.png

Requires more memory to process big data

Page 5: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

3D die-stacked memory [Loh ISCA’08]

HMC and HBM

Denser capacity

higher throughput

Memory network [Kim et al. PACT’13, Zhan et al. MICRO’16]

Scalable memory capacity

Better processor bandwidth

DRAM layer

Demand More Memory

5

Vault Controller

Intra-Cube Network

I/O I/O I/OI/O

… … Vault Controller

Logi

c La

yer

Vau

lt

Page 6: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Stall processor computation

Consume energy

Processing-in-memory (PIM)

PIM-Enabled Instruction (PEI) [Ahn et al. ISCA’2015]

C = A x B (Can we bring less data?)

In-network computing Compute in router switches [Panda IPPS’95, Chen et al. SC’11]

MAERI [Kwon et al. ASPLOS’18]

Near-data processing to reduce data movement

Enormous Data Movement Is Expensive

6

Active-Routing for dataflow execution in memory networkActive-Routing for dataflow execution in memory network

Reduce data movement and more flexible

Active-Routing for dataflow execution in memory network

Reduce data movement and more flexible

Exploit memory throughput and network concurrency

Page 7: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Active-Routing Architecture

Page 8: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

System Architecture

Host CPU

8

Ho

st C

PU

Net

wo

rk-o

n-C

hip

O3core

Cache

O3core

Cache

HM

C C

on

tro

ller

HMCN

etw

ork

Inte

rfa

ceN

etw

ork

Inte

rfac

e…

Memory Network

Page 9: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Active-Routing Flow

Active-Routing tree dataflow for compute kernel

Compute kernel example

Reduction over intermediate

results

Active-Routing tree dataflow

9

for (i = 0; i < n; i++)

{

sum += *Ai × *Bi;

}

Ho

st C

PU

Ai Bi

Page 10: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Active-Routing Three-Phase Processing

Active-Routing Tree Construction

10

Ho

st C

PU

Ai Bi

for (i = 0; i < n; i++)

{

sum += *Ai × *Bi;

}

Update Packet

Page 11: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Active-Routing Three-Phase Processing

Active-Routing Tree Construction

Update Phase for data processing

11

Ho

st C

PU

Ai Bi

for (i = 0; i < n; i++)

{

sum += *Ai × *Bi;

}

Ak Operand request

Bk Operand requestAk Operand response

Bk Operand response

Page 12: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Active-Routing Three-Phase Processing

Active-Routing Tree Construction

Update Phase for data processing

Gather Phase for tree reduction

12

Ho

st C

PU

Ai Bi

for (i = 0; i < n; i++)

{

sum += *Ai × *Bi;

}

Gather request

Gather requestGather request

Gather requestGather request

Gather request

Page 13: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Implementation

Page 14: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Programming interface and ISA extension

Update(void *src1, void *src2, void *target, int op);

Gather(void *target, int num_threads);

14

APIExtended PIM

InstructionsActive-Routing

Execution

for (i = 0; i < n; i++)

{

sum += *Ai × *Bi;

}

for (i = 0; i < n; i++)

{

Update(Ai, Bi, &sum, MAC);

}

Gather(&sum, 16);

Page 15: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Programming interface and ISA extension

Update(void *src1, void *src2, void *target, int op);

Gather(void *target, int num_threads);

Offloading logic in network interface

Dedicated registers for offloading information

Convert to Update/Gather packets

15

APIExtended PIM

InstructionsActive-Routing

Execution

Page 16: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Active-Routing Engine

16

Vault Controller

Intra-Cube Network

I/O I/O I/OI/O

… … Vault Controller

Logi

c La

yerDRAM layer

Vau

lt

Vault

Controller

Intra-Cube Network

I/O I/O I/OI/O

… Vault

Controller

Active-Routing

Engine

Logic

Layer

Packet Processing

UnitFlow Table

Operand Buffers

ALU

Page 17: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Packet Processing Unit

Process Update/Gather packets

Schedule corresponding actions

17

Packet Processing

UnitFlow Table

Operand BuffersA

LU

Page 18: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Flow Table

Flow Table Entry

18

Packet Processing

UnitFlow Table

Operand BuffersA

LU

64-bit 6-bit 64-bit 64-bit 64-bit 2-bit 4-bit 1-bit

flowID opcode result req_counter resp_counter parent children flags Gflag

Page 19: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Flow Table

Flow Table Entry

19

Packet Processing

UnitFlow Table

Operand BuffersA

LU

64-bit 6-bit 64-bit 64-bit 64-bit 2-bit 4-bit 1-bit

flowID opcode result req_counter resp_counter parent children flags Gflag

Page 20: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Flow Table

Flow Table Entry

Maintain tree structure

20

Packet Processing

UnitFlow Table

Operand BuffersA

LU

64-bit 6-bit 64-bit 64-bit 64-bit 2-bit 4-bit 1-bit

flowID opcode result req_counter resp_counter parent children flags Gflag

Page 21: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Flow Table

Flow Table Entry

Maintain tree structure

Keep track of state information of each flow

21

Packet Processing

UnitFlow Table

Operand BuffersA

LU

64-bit 6-bit 64-bit 64-bit 64-bit 2-bit 4-bit 1-bit

flowID opcode result req_counter resp_counter parent children flags Gflag

Page 22: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Flow Table

Flow Table Entry

Maintain tree structure

Keep track of state information of each flow

22

Packet Processing

UnitFlow Table

Operand BuffersA

LU

64-bit 6-bit 64-bit 64-bit 64-bit 2-bit 4-bit 1-bit

flowID opcode result req_counter resp_counter parent children flags Gflag

Page 23: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Operand Buffers

Operand Buffer Entry

23

Packet Processing

UnitFlow Table

Operand BuffersA

LU

64-bit 64-bit 1-bit 64-bit 1-bit

flowID op_value1 op_ready1 op_value2 op_ready2

Page 24: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Operand Buffers

Operand Buffer Entry

Shared temporal storage

24

Packet Processing

UnitFlow Table

Operand BuffersA

LU

64-bit 64-bit 1-bit 64-bit 1-bit

flowID op_value1 op_ready1 op_value2 op_ready2

Page 25: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Operand Buffers

Operand Buffer Entry

Shared temporal storage

Fire for computation in dataflow

25

Packet Processing

UnitFlow Table

Operand BuffersA

LU

64-bit 64-bit 1-bit 64-bit 1-bit

flowID op_value1 op_ready1 op_value2 op_ready2

Page 26: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Enhancements in Active-Routing

Page 27: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

One Tree Per Flow

One tree from one memory port

Deep tree

Congestion at memory port

27

Ho

st C

PU

Deep treeCongestion

Page 28: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Multiple Trees Per Flow

Build multiple trees

ART-tid: interleave the thread ID

ART-addr: nearest port based on operands’ address

28

Ho

st C

PUThread 0

Thread 1

Thread 2

Thread 3

Page 29: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Exploit Memory Access Locality

Pure reduction Irregular (random)

Regular

Reduction on intermediate results Irregular-Irregular (II)

Regular-Irregular (RI)

Regular-Regular (RR)

Offload cache block granularity for regular accesses

29

for (i = 0; i < n; i++)

{

sum += *Ai;

}

for (i = 0; i < n; i++)

{

sum += *Ai × *Bi;

}

Page 30: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Evaluation

Page 31: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Methodology

Compared techniques

HMC Baseline

PIM-Enabled Instruction (PEI)

Active-Routing-threadID (ART-tid)

Active-Routing-address (ART-addr)

Tools: Pin, McSimA+ and CasHMC

System configurations

16 O3 cores at 2 GHz

16 memory cubes in dragonfly topology

Minimum routing with virtual cut-through

Active-Routing Engine

1250 MHz, 16 flow table entries, 128 operand entries

31

Page 32: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Workloads

Benchmarks (graph app, ML kernels, etc.) backprop

lud

pagerank

sgemm

spmv

Microbenchmarks reduce (sum reduction)

rand_reduce

mac (multipy-and-accumulate)

rand_mac

32

Page 33: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Comparison of Enhancements in Active-Routing

33

-1.5

-1

-0.5

0

0.5

1

1.5

Norm

aliz

ed

Run

tim

e S

peed

up

(log

)

Normalized Speedup over HMC Baseline

ART-Naïve ART-tid ART-addr

Page 34: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Comparison of Enhancements in Active-Routing

34

-1.5

-1

-0.5

0

0.5

1

1.5

Norm

aliz

ed

Run

tim

e S

peed

up

(log

)

Normalized Speedup over HMC Baseline

ART-Naïve ART-tid ART-addr

Page 35: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

-1.5

-1

-0.5

0

0.5

1

1.5

Norm

aliz

ed

Run

tim

e S

peed

up

(log

)

Normalized Speedup over HMC Baseline

ART-Naïve ART-tid ART-addr

Comparison of Enhancements in Active-Routing

35

Multiple trees and cache-block grained

offloading effectively make ART much better.

Page 36: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Benchmark Performance

36

0

2

4

6

8

Norm

aliz

ed R

untim

e

Speedup

PEI ART-tid ART-addr

Page 37: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Benchmark Performance

37

0

2

4

6

8

Norm

aliz

ed R

untim

e

Speedup

PEI ART-tid ART-addr

Page 38: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Benchmark Performance

38

0

2

4

6

8

Norm

aliz

ed R

untim

e

Speedup

PEI ART-tid ART-addr

In general, ART-addr > ART-tid > PEI

Imbalance

computations

Page 39: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Analysis of spmv

39

ART-tid

ART-addr

0 4E+05

Operand Distribution Compute Point Distribution

Page 40: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Benchmark Performance

40

0

2

4

6

8

Norm

aliz

ed R

untim

e

Speedup

PEI ART-tid ART-addr

PEI cache thrashing

C = A x B

Page 41: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Microbenchmark Performance

41

0102030405060

Norm

aliz

ed R

untim

e

Speedup

PEI ART-tid ART-addr

ART-addr > ART-tid > PEI

Page 42: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Energy-Delay Product

42

-2

-1.5

-1

-0.5

0

0.5

Norm

aliz

ed E

nerg

y-

Dela

y-P

roduc

t (log)

ART-tid ART-addr

Reduce EDP by 80% on average

Page 43: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Dynamic Offloading Case Study (lud)

ART-tid-adaptive: dynamic offloading based on locality

and reuse43

0

500

1000

1500

2000

25000

20

41

60

80

100

120

Cycl

es T

housa

nd

s

Phase

ART-tid

ART-tid-adaptive

0

0.5

1

1.5

2

2.5

Sp

eed

up

First Phase

Second Phase

Page 44: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Conclusion

Propose Active-Routing in-network computing architecture which computes near-data in the memory network in data-flow style

Present a three-phase processing procedure for Active-Routing

Categorize memory access patterns to exploit the locality and offload computations in various granularities

Active-Routing achieves up to 7x speedup with 60% average performance improvement and reduce energy-delay product by 80% on average

44

Page 45: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Thank You & Questions

Jiayi Huang

[email protected]

Page 46: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Active-Routing: Compute on the Way for Near-Data Processing

Jiayi Huang, Ramprakash Reddy Puli, Pritam Majumder Sungkeun Kim, Rahul Boyapati, Ki Hwan Yum and EJ Kim

Page 47: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Backup

Page 48: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

API Extensions

UpdateRR(void *src1, void *src2, void *target, int op);

Offload both operands in cache—block granularity

UpdateRI(void *src1, void *src2[], void *target, int op);

Fetch irregular data then offload to regular operands location

for cache-block granularity processing

UpdateII(void *src1, void *src2, void *target, int op);

Offload both operands in single element granularity

Update(type src1, type src2, void *target, int op);

Support general PIM

Gather(void *target, int num_threads);

As explicit barrier

48

Page 49: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

Workloads (Optimization Region and Dataset)

Benchmarks backprop: activation calculation in feedforward pass (2M hidden

units) lud: upper and lower triangular matrix decomposition (4K matrix

dimension) pagerank: ranking score calculation (web-Google graph from SNAP

Dataset) sgemm: matrix multiplication (4K x 4K matrix) spmv: matrix-vector multiplication loop (4K dimension with 0.7

sparsity)

Microbenchmarks reduce: sum reduction over a sequential vector (64K dimension) rand_reduce: sum reduction over random elements (64K dimension) mac: multipy-and-accumulate over two sequential vectors (2 64K-D

vectors) rand_mac: multiply-and-accumulate over two random element lists (2

64K-element lists)

49

Page 50: Active-Routing: Compute on the Way for Near-Data Processing · Compute in router switches [Panda IPPS’95, Chen et al. SC’11] MAERI [Kwon et al. ASPLOS’18] Near-data processing

On/Off-Chip Data Movement

50

0

0.25

0.5

0.75

1

1.25PEI

ART-

tid

ART-

add

r

PEI

ART-

tid

ART-

add

r

PEI

ART-

tid

ART-

add

r

PEI

ART-

tid

ART-

add

r

PEI

ART-

tid

ART-

add

r

backprop lud pagerank sgemm spmv

Norm

aliz

ed

On/

Off

-chi

p D

ata

M

ove

ment

normal request active request

normal response active responseFine-grained

offloading overheadPEI cache thrashing