65
TensorDIMM: A Practical Near-Memory Processing Architecture for Embeddings and Tensor Operations in Deep Learning KAIST Youngeun Kwon, Yunjae Lee, and Minsoo Rhu

TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

1

TensorDIMM: A Practical Near-Memory Processing Architecture

for Embeddings and Tensor Operations in Deep Learning

KAIST

Youngeun Kwon, Yunjae Lee, and Minsoo Rhu

Page 2: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

2

Research Scope

Page 3: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

3

Primarily focused on “dense” DNN layers (e.g., CNNs, RNNs, …)

DL architecture research so far

* Chen et al., “DaDianNao: A Machine-Learning Supercomputer”, ISCA-2014* Chen et al., “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks”, ISCA-2016* Han et al., “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA-2016* Parashar et al., “SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks”, ISCA-2017* Yazdanbakhsh et al., “GANAX: A Unified MIMD-SIMD Acceleration for Generative Adversarial Networks”, ISCA-2018

Page 4: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

4

“Non” conventional DNN layers are causing a bottleneck

Emerging DL applications?

* Devlin et al., “Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding”, arxiv.org, 2018* Graves et al., “Neural Turing Machines”, arxiv.org, 2014* Naumov et al., “Deep Learning Recommendation Model for Personalization and Recommendation Systems”, arxiv.org, 2019

BERT Neural Turing Machine Recommendation

Page 5: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

5

“Sparse” embedding layers (rather than dense DNNs) are the bottleneck

Personalized recommendation models

Page 6: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

6

Projection of sparse features into dense vector dimension (e.g., word2vec)

What is an embedding?

* Mikolov et al., “Distributed Representations of Words and Phrases and their Compositionality”, NIPS-2013

KING

QUEEN

UNCLE

AUNT

MAN

WOMAN

KING

QUEEN

KINGS

QUEENS

Page 7: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

7

Stored as a large look-up table containing millions-to-billions of entries

What is an embedding?

User ID Embedding (vector)

0: Sam [0.49, 0.52, 0.23, 0.69, 0.32, …]

1: Harry [0.24, 0.27, 0.13, 0.09, 0.79, …]

2: Matt [0.31, 0.71, 0.46, 0.91, 0.07, …]

3: John [0.83, 0.43, 0.81, 0.57, 0.09, …]

4: Elicia [0.31, 0.83, 0.23, 0.69, 0.86, …]

… …

N: Danny [0.77, 0.18, 0.71, 0.59, 0.46, …]

N: can be millions

Page 8: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

8

Goal: predict a preference of user-item pair

Recommendation model 101

Sam

Page 9: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

9

Goal: predict a preference of user-item pair

Recommendation model 101

[Embedding table]

Movie 0

Movie 1

Movie 2

Movie 3

Movie 4

Movie 5

Movie 6

Movie 7

Movie N

Sam

Page 10: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

10

Goal: predict a preference of user-item pair

Recommendation model 101

[Embedding table]

Movie 0

Movie 1

Movie 2

Movie 3

Movie 4

Movie 5

Movie 6

Movie 7

Movie N

Sam

Page 11: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

11

Goal: predict a preference of user-item pair

Recommendation model 101

[Embedding table]

Movie 0

Movie 1

Movie 2

Movie 3

Movie 4

Movie 5

Movie 6

Movie 7

Movie N

Sam

DNNs

Page 12: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

12

Goal: predict a preference of user-item pair

Recommendation model 101

[Embedding table]

Movie 0

Movie 1

Movie 2

Movie 3

Movie 4

Movie 5

Movie 6

Movie 7

Movie N

Sam

DNNs

🙂?🙁

Prediction

Page 13: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

13

Key Primitives in Embedding Layers

Page 14: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

14

Copying target embeddings into contiguous address space

#1: Embedding lookup (gather)

[Embedding table]

Movie 0

Movie 1

Movie 2

Movie 3

Movie 4

Movie 5

Movie 6

Movie 7

Movie N

Page 15: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

15

Copying target embeddings into contiguous address space

#1: Embedding lookup (gather)

[Embedding table]

Movie 7

[Output tensor]

Movie 4

Movie 2

Movie 6

Movie 1

Movie 0

Movie 1

Movie 2

Movie 3

Movie 4

Movie 5

Movie 6

Movie 7

Movie N

“Contiguous address space”

Page 16: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

16

e.g., Averaging multiple embeddings, element-wise addition/multiplication

#2: Tensor operation (reduction)

Movie 7

Movie 4

Movie 2

Movie 6

Movie 1

Page 17: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

17

e.g., Averaging multiple embeddings, element-wise addition/multiplication

#2: Tensor operation (reduction)

Movie 7

Movie 4

Movie 2

Movie 6

Movie 1+

Page 18: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

18

e.g., Averaging multiple embeddings, element-wise addition/multiplication

#2: Tensor operation (reduction)

+

Movie 7

Movie 4

Movie 2

Movie 6

Movie 1

<Average movie embedding>

Page 19: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

19

Embedding gathers/reductions are extremely memory-bandwidth sensitive

Key challenges of embedding layers

[Embedding lookup]

Movie 7

Movie 4

Movie 2

Movie 6

Movie 1

Movie 0

Movie 1

Movie 2

Movie 3

Movie 4

Movie 5

Movie 6

Movie 7

Movie N

Averaged embedding

[Tensor operation]

+

Page 20: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

20

Current Solutions forRecommendation Systems

Page 21: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

21

The memory wall for “Inference”

+

+

DNNsReductionEmbedding lookup

Size of embedding tables can reach hundreds of GBs

Page 22: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

22

Size of embedding tables can reach hundreds of GBs

The memory wall for “Inference”

+

+

DNNsReductionEmbedding lookup

CPU

Capacity limited

Page 23: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

23

Size of embedding tables can reach hundreds of GBs

The memory wall for “Inference”

+

+

DNNsReductionEmbedding lookup

CPU

Capacity limited

Page 24: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

24

CPU stores entire embedding tables, but DNNs executed using GPUs

Design#1: Hybrid CPU-GPU approach

+

+

DNNsReductionEmbedding lookup

CPU GPU

Capacity limited Compute limited

Page 25: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

25

Challenges: need to copy multiple embeddings via narrow PCIe channel

Design#1: Hybrid CPU-GPU approach

+

+

DNNsReductionEmbedding lookup

CPU GPU

Capacity limited Compute limited

Communication Bottleneck

PCIe

Page 26: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

26

The CPU handles the entire steps of inference

Design#2: CPU-only approach

+

+

DNNsReductionEmbedding lookup

CPU

Page 27: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

27

Challenges: low throughput of CPUs slows down DNN computation

Design#2: CPU-only approach

+

+

DNNsReductionEmbedding lookup

CPU

Computation Bottleneck

Page 28: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

28

Unbuildable, oracular solution assuming infinite GPU memory capacity

Design#3: GPU-only approach?

+

+

DNNsReductionEmbedding lookup

GPU

Page 29: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

29

Unbuildable, oracular solution assuming infinite GPU memory capacity

Design#3: GPU-only approach?

+

+

DNNsReductionEmbedding lookup

GPU

No Communications

Page 30: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

30

Unbuildable, oracular solution assuming infinite GPU memory capacity

Design#3: GPU-only approach?

+

+

DNNsReductionEmbedding lookup

GPU

No Communications

Fast DNN execution

Page 31: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

31

Unbuildable, oracular solution assuming infinite GPU memory capacity

Design#3: GPU-only approach?

+

+

DNNsReductionEmbedding lookup

GPU

No Communications

Higher Bandwidth

Fast DNN execution

Page 32: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

32

CPU-only and hybrid CPU-GPU (vs. GPU-only)

Key challenges of existing solutions?

7.3x slowdown

10x slowdown

(Oracular) GPU-only

CPU-only

Hybrid CPU-GPU

Communication Bottleneck

Computation Bottleneck

Page 33: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

33

Our Approach: “Near”-Memory Acceleration(so Near-Memory Processing, NMP)

Page 34: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

34

Augment buffer device to add NMP cores for embedding gathers/reductions

TensorDIMM: a NMP for embeddings

DRAM

Buffer device

DRAM DRAM DRAM

Vector ALU

DD

R

PH

Y NMP-local Memory Controller

Input Q(A)

Input Q(B)

Output Q(C)

Pro

toco

l E

ng

ine

LocalDRAM

DDRxMemoryChannel

(a) NMP core (b) TensorDIMM

Page 35: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

35

Augment buffer device to add NMP cores for embedding gathers/reductions

TensorDIMM: a NMP for embeddings

DRAM

Buffer device

DRAM DRAM DRAM

Vector ALU

DD

R

PH

Y NMP-local Memory Controller

Input Q(A)

Input Q(B)

Output Q(C)

Pro

toco

l E

ng

ine

LocalDRAM

DDRxMemoryChannel

(a) NMP core (b) TensorDIMM

Page 36: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

36

“Effective” memory bandwidth scales proportional to the # of DIMMs/ranks

Key advantage of TensorDIMM

Memory Controller(Memory Channel)

Memory Controller(Memory Channel)

DRAM

Buffer device

DRAM DRAM DRAM

Current system TensorDIMM approach

ProcessorProcessor

Page 37: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

37

“Effective” memory bandwidth scales proportional to the # of DIMMs/ranks

Key advantage of TensorDIMM

Memory Controller(Memory Channel)

Processor

Memory Controller(Memory Channel)

Processor

DRAM

Buffer device

DRAM DRAM DRAM

Current system TensorDIMM approach

Embedding gathers/reductionsare done “locally” within a DIMM

Page 38: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

38

“Effective” memory bandwidth scales proportional to the # of DIMMs/ranks

Key advantage of TensorDIMM

Memory Controller(Memory Channel)

Memory Controller(Memory Channel)

DRAM

Buffer device

DRAM DRAM DRAM

Current system TensorDIMM approach

Memory bandwidth: 100

Memory bandwidth: 100

ProcessorProcessor

Page 39: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

39

“Effective” memory bandwidth scales proportional to the # of DIMMs/ranks

Key advantage of TensorDIMM

Memory Controller(Memory Channel)

Memory Controller(Memory Channel)

DRAM

Buffer device

DRAM DRAM DRAM

Current system TensorDIMM approach

Memory bandwidth: 100

DRAM

Buffer device

DRAM DRAM DRAM

Memory bandwidth: 200

ProcessorProcessor

Page 40: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

40

“Effective” memory bandwidth scales proportional to the # of DIMMs/ranks

Key advantage of TensorDIMM

Memory Controller(Memory Channel)

Memory Controller(Memory Channel)

DRAM

Buffer device

DRAM DRAM DRAM

Current system TensorDIMM approach

Memory bandwidth: 100

DRAM

Buffer device

DRAM DRAM DRAM

DRAM

Buffer device

DRAM DRAM DRAM

Memory bandwidth: 300

ProcessorProcessor

Page 41: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

41

“Effective” memory bandwidth scales proportional to the # of DIMMs/ranks

Key advantage of TensorDIMM

Memory Controller(Memory Channel)

Memory Controller(Memory Channel)

DRAM

Buffer device

DRAM DRAM DRAM

Current system TensorDIMM approach

Memory bandwidth: 100

DRAM

Buffer device

DRAM DRAM DRAM

DRAM

Buffer device

DRAM DRAM DRAM

DRAM

Buffer device

DRAM DRAM DRAM

Memory bandwidth: 400

ProcessorProcessor

Page 42: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

42

Leverage rank-level parallelism for maximal bandwidth utilization

Mapping embedding tables in DRAMs

InputEmbedding0

InputEmbedding1

OutputEmbedding

InputEmbedding2

InputEmbeddingn

Subje

ct for

ele

ment-

wis

e o

pera

tions

Rank 0 Rank 1 … Rank 15

64B 64B … 64B

…… ……

NMP Core

(DIMM0)

NMP Core

(DIMM1)

…NMP Core

(DIMM15)

64B 64B … 64B

64B 64B … 64B

64B 64B … 64B

64B 64B … 64B

Page 43: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

43

Leverage rank-level parallelism for maximal bandwidth utilization

Mapping embedding tables in DRAMs

InputEmbedding0

InputEmbedding1

OutputEmbedding

InputEmbedding2

InputEmbeddingn

Subje

ct for

ele

ment-

wis

e o

pera

tions

Rank 0 Rank 1 … Rank 15

64B 64B … 64B

…… ……

NMP Core

(DIMM0)

NMP Core

(DIMM1)

…NMP Core

(DIMM15)

64B 64B … 64B

64B 64B … 64B

64B 64B … 64B

64B 64B … 64B

Page 44: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

44

Leverage rank-level parallelism for maximal bandwidth utilization

Mapping embedding tables in DRAMs

InputEmbedding0

InputEmbedding1

OutputEmbedding

InputEmbedding2

InputEmbeddingn

Subje

ct for

ele

ment-

wis

e o

pera

tions

Rank 0 Rank 1 … Rank 15

64B 64B … 64B

…… ……

NMP Core

(DIMM0)

NMP Core

(DIMM1)

…NMP Core

(DIMM15)

64B 64B … 64B

64B 64B … 64B

64B 64B … 64B

64B 64B … 64B

Page 45: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

45

A pooled memory architecture aggregated with multiple TensorDIMMs

Tensor“Node” using TensorDIMMs

TensorNode

TensorDIMM

TensorDIMM

TensorDIMM

TensorDIMM

Page 46: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

46

Utilize high-speed links (e.g. NVLINK) for inter-device communication

TensorNode as “remote” memory pools

TensorNode

TensorDIMM

TensorDIMM

TensorDIMM

TensorDIMMNVLINK

(150 GB/sec)

GPUs

Page 47: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

47

A platform for scalable expansion of both memory bandwidth and capacity

Putting everything together

Addresses thememory capacity

challenge

Addresses thememory bandwidth

challenge

Addresses thecompute & communication

challenges

Page 48: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

48

Evaluation

Page 49: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

49

Cycle-level DRAM simulator (Ramulator*)

Proof-of-concept software prototype on real ML systems (NVIDIA DGX-1V)

Combination of cycle-level simulation and emulation on real ML systems

Evaluation methodology

* Kim et al., “Ramulator: A Fast and Extensible DRAM Simulator”, IEEE Computer Architecture Letters, 2015

Page 50: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

50

Cycle-level DRAM simulator (Ramulator*)

Memory bandwidth for embedding gathers/reductions under our address mapping

Proof-of-concept software prototype on real ML systems (NVIDIA DGX-1V)

Combination of cycle-level simulation and emulation on real ML systems

Evaluation methodology

* Kim et al., “Ramulator: A Fast and Extensible DRAM Simulator”, IEEE Computer Architecture Letters, 2015

Page 51: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

51

Effective bandwidth scales proportional to number of ranks (avg 4x↑)

Memory bandwidth utilization

Page 52: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

52

Cycle-level DRAM simulator (Ramulator*)

Memory bandwidth for embedding gathers/reductions under our address mapping

Proof-of-concept software prototype on real ML systems (NVIDIA DGX-1V)

Intel’s Math Kernel Library (MKL)

NVIDIA cuDNN / cuBLAS

In-house CUDA implementation of other layers

NVIDIA DGX-1V

• Eight NVIDIA V100 GPUs

• Two Intel Xeon E5-2698 v4

Combination of cycle-level simulation and emulation on real ML systems

Evaluation methodology

* Kim et al., “Ramulator: A Fast and Extensible DRAM Simulator”, IEEE Computer Architecture Letters, 2015

Page 53: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

53

A proof-of-concept software prototype to emulate TensorDIMM

TensorNode system modeling

SM SM SM SM

Crossbar

MC

L2

MC

L2

MC

L2

MC

L2

DRAM DRAM DRAM DRAM

Modern GPU architecture

NMP cores

TensorNode’s NMP local memory channels and the associated memory bandwidth

GPU-sideInterconnect

GPU

Used as anormal GPU

GPU

“Emulated” as a TensorNode

NVLINK(150 GB/sec)

Page 54: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

54

Four system design points

CPU-only

Hybrid CPU-GPU

TensorDIMM (ours)

GPU-only (oracle)

A proof-of-concept software prototype to emulate TensorDIMM

Evaluation methodology

Page 55: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

55

TensorDIMM helps reduce both embedding/MLP latency

Latency breakdown

Page 56: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

56

TensorDIMM helps reduce both embedding/MLP latency

Latency breakdown

Page 57: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

57

TensorDIMM helps reduce both embedding/MLP latency

Latency breakdown

Page 58: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

58

TensorDIMM achieves overall 6-9x speedup against the baselines

Latency breakdown

Page 59: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

59

TensorDIMM:A Near-Memory Processing Architecture for Sparse Embedding Layers

The “first” architectural solution tackling sparse embedding layers

A “practical” near-memory processing solution

for an important AI workload

Average “6~9x” performance improvement on state-of-the-art DNN-based recommendation models

Page 60: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

60

Questions?

Page 61: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

61

Backup Slides

Page 62: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

62

TensorDIMM design overheadsIt’s not free, but adding custom logics within DIMM has been done before

IBM centaur DIMM

Page 63: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

63

Page 64: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

64

Page 65: TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon, Yunjae Lee, and Minsoo Rhu. 2 Research Scope. 3 Primarily focused on “dense”

65