TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon,...

TensorDIMM: A Practical Near-Memory Processing Architecture

for Embeddings and Tensor Operations in Deep Learning

Youngeun Kwon, Yunjae Lee, and Minsoo Rhu

Research Scope

Primarily focused on “dense” DNN layers (e.g., CNNs, RNNs, …)

DL architecture research so far

* Chen et al., “DaDianNao: A Machine-Learning Supercomputer”, ISCA-2014* Chen et al., “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks”, ISCA-2016* Han et al., “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA-2016* Parashar et al., “SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks”, ISCA-2017* Yazdanbakhsh et al., “GANAX: A Unified MIMD-SIMD Acceleration for Generative Adversarial Networks”, ISCA-2018

“Non” conventional DNN layers are causing a bottleneck

Emerging DL applications?

* Devlin et al., “Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding”, arxiv.org, 2018* Graves et al., “Neural Turing Machines”, arxiv.org, 2014* Naumov et al., “Deep Learning Recommendation Model for Personalization and Recommendation Systems”, arxiv.org, 2019

BERT Neural Turing Machine Recommendation

“Sparse” embedding layers (rather than dense DNNs) are the bottleneck

Personalized recommendation models

Projection of sparse features into dense vector dimension (e.g., word2vec)

What is an embedding?

* Mikolov et al., “Distributed Representations of Words and Phrases and their Compositionality”, NIPS-2013

QUEENS

Stored as a large look-up table containing millions-to-billions of entries

What is an embedding?

User ID Embedding (vector)

0: Sam [0.49, 0.52, 0.23, 0.69, 0.32, …]

1: Harry [0.24, 0.27, 0.13, 0.09, 0.79, …]

2: Matt [0.31, 0.71, 0.46, 0.91, 0.07, …]

3: John [0.83, 0.43, 0.81, 0.57, 0.09, …]

4: Elicia [0.31, 0.83, 0.23, 0.69, 0.86, …]

… …

N: Danny [0.77, 0.18, 0.71, 0.59, 0.46, …]

N: can be millions

Goal: predict a preference of user-item pair

Recommendation model 101

[Embedding table]

Movie 0

Movie 1

Movie 2

Movie 3

Movie 4

Movie 5

Movie 6

Movie 7

Movie N

[Embedding table]

Movie 0

Movie 1

Movie 2

Movie 3

Movie 4

Movie 5

Movie 6

Movie 7

Movie N

[Embedding table]

Movie 0

Movie 1

Movie 2

Movie 3

Movie 4

Movie 5

Movie 6

Movie 7

Movie N

[Embedding table]

Movie 0

Movie 1

Movie 2

Movie 3

Movie 4

Movie 5

Movie 6

Movie 7

Movie N

🙂?🙁

Prediction

Key Primitives in Embedding Layers

Copying target embeddings into contiguous address space

#1: Embedding lookup (gather)

[Embedding table]

Movie 0

Movie 1

Movie 2

Movie 3

Movie 4

Movie 5

Movie 6

Movie 7

Movie N

Copying target embeddings into contiguous address space

#1: Embedding lookup (gather)

[Embedding table]

Movie 7

[Output tensor]

Movie 4

Movie 2

Movie 6

Movie 1

Movie 0

Movie 1

Movie 2

Movie 3

Movie 4

Movie 5

Movie 6

Movie 7

Movie N

“Contiguous address space”

e.g., Averaging multiple embeddings, element-wise addition/multiplication

#2: Tensor operation (reduction)

Movie 7

Movie 4

Movie 2

Movie 6

Movie 1

Movie 7

Movie 4

Movie 2

Movie 6

Movie 1+

Movie 7

Movie 4

Movie 2

Movie 6

Movie 1

Embedding gathers/reductions are extremely memory-bandwidth sensitive

Key challenges of embedding layers

[Embedding lookup]

Movie 7

Movie 4

Movie 2

Movie 6

Movie 1

Movie 0

Movie 1

Movie 2

Movie 3

Movie 4

Movie 5

Movie 6

Movie 7

Movie N

Averaged embedding

[Tensor operation]

Current Solutions forRecommendation Systems

The memory wall for “Inference”

DNNsReductionEmbedding lookup

Size of embedding tables can reach hundreds of GBs

Capacity limited

Size of embedding tables can reach hundreds of GBs

Capacity limited

CPU stores entire embedding tables, but DNNs executed using GPUs

Design#1: Hybrid CPU-GPU approach

CPU GPU

Capacity limited Compute limited

Challenges: need to copy multiple embeddings via narrow PCIe channel

Design#1: Hybrid CPU-GPU approach

CPU GPU

Capacity limited Compute limited

Communication Bottleneck

The CPU handles the entire steps of inference

Design#2: CPU-only approach

Challenges: low throughput of CPUs slows down DNN computation

Design#2: CPU-only approach

Computation Bottleneck

Unbuildable, oracular solution assuming infinite GPU memory capacity

Design#3: GPU-only approach?

No Communications

Fast DNN execution

No Communications

Higher Bandwidth

Fast DNN execution

CPU-only and hybrid CPU-GPU (vs. GPU-only)

Key challenges of existing solutions?

7.3x slowdown

10x slowdown

(Oracular) GPU-only

CPU-only

Hybrid CPU-GPU

Communication Bottleneck

Computation Bottleneck

Our Approach: “Near”-Memory Acceleration(so Near-Memory Processing, NMP)

Augment buffer device to add NMP cores for embedding gathers/reductions

TensorDIMM: a NMP for embeddings

Buffer device

DRAM DRAM DRAM

Vector ALU

Y NMP-local Memory Controller

Input Q(A)

Input Q(B)

Output Q(C)

LocalDRAM

DDRxMemoryChannel

(a) NMP core (b) TensorDIMM

Augment buffer device to add NMP cores for embedding gathers/reductions

TensorDIMM: a NMP for embeddings

Buffer device

DRAM DRAM DRAM

Vector ALU

Y NMP-local Memory Controller

Input Q(A)

Input Q(B)

Output Q(C)

LocalDRAM

DDRxMemoryChannel

(a) NMP core (b) TensorDIMM

“Effective” memory bandwidth scales proportional to the # of DIMMs/ranks

Key advantage of TensorDIMM

Memory Controller(Memory Channel)

Buffer device

DRAM DRAM DRAM

Current system TensorDIMM approach

ProcessorProcessor

Processor

Buffer device

DRAM DRAM DRAM

Embedding gathers/reductionsare done “locally” within a DIMM

Buffer device

DRAM DRAM DRAM

Memory bandwidth: 100

ProcessorProcessor

Buffer device

DRAM DRAM DRAM

Buffer device

DRAM DRAM DRAM

ProcessorProcessor

Buffer device

DRAM DRAM DRAM

Buffer device

DRAM DRAM DRAM

Buffer device

DRAM DRAM DRAM

ProcessorProcessor

Buffer device

DRAM DRAM DRAM

Buffer device

DRAM DRAM DRAM

Buffer device

DRAM DRAM DRAM

Buffer device

DRAM DRAM DRAM

ProcessorProcessor

Leverage rank-level parallelism for maximal bandwidth utilization

Mapping embedding tables in DRAMs

InputEmbedding0

InputEmbedding1

OutputEmbedding

InputEmbedding2

InputEmbeddingn

ct for

Rank 0 Rank 1 … Rank 15

64B 64B … 64B

…… ……

NMP Core

(DIMM0)

NMP Core

(DIMM1)

…NMP Core

(DIMM15)

64B 64B … 64B

InputEmbedding0

InputEmbedding1

OutputEmbedding

InputEmbedding2

InputEmbeddingn

ct for

64B 64B … 64B

…… ……

NMP Core

(DIMM0)

NMP Core

(DIMM1)

…NMP Core

(DIMM15)

64B 64B … 64B

InputEmbedding0

InputEmbedding1

OutputEmbedding

InputEmbedding2

InputEmbeddingn

ct for

64B 64B … 64B

…… ……

NMP Core

(DIMM0)

NMP Core

(DIMM1)

…NMP Core

(DIMM15)

64B 64B … 64B

A pooled memory architecture aggregated with multiple TensorDIMMs

Tensor“Node” using TensorDIMMs

TensorNode

TensorDIMM

Utilize high-speed links (e.g. NVLINK) for inter-device communication

TensorNode as “remote” memory pools

TensorNode

TensorDIMM

TensorDIMMNVLINK

(150 GB/sec)

A platform for scalable expansion of both memory bandwidth and capacity

Putting everything together

Addresses thememory capacity

challenge

Addresses thememory bandwidth

challenge

Addresses thecompute & communication

challenges

Evaluation

Cycle-level DRAM simulator (Ramulator*)

Proof-of-concept software prototype on real ML systems (NVIDIA DGX-1V)

Combination of cycle-level simulation and emulation on real ML systems

Evaluation methodology

* Kim et al., “Ramulator: A Fast and Extensible DRAM Simulator”, IEEE Computer Architecture Letters, 2015

Memory bandwidth for embedding gathers/reductions under our address mapping

Effective bandwidth scales proportional to number of ranks (avg 4x↑)

Memory bandwidth utilization

Memory bandwidth for embedding gathers/reductions under our address mapping

Intel’s Math Kernel Library (MKL)

NVIDIA cuDNN / cuBLAS

In-house CUDA implementation of other layers

NVIDIA DGX-1V

• Eight NVIDIA V100 GPUs

• Two Intel Xeon E5-2698 v4

A proof-of-concept software prototype to emulate TensorDIMM

TensorNode system modeling

SM SM SM SM

Crossbar

DRAM DRAM DRAM DRAM

Modern GPU architecture

NMP cores

TensorNode’s NMP local memory channels and the associated memory bandwidth

GPU-sideInterconnect

Used as anormal GPU

“Emulated” as a TensorNode

NVLINK(150 GB/sec)

Four system design points

CPU-only

Hybrid CPU-GPU

TensorDIMM (ours)

GPU-only (oracle)

A proof-of-concept software prototype to emulate TensorDIMM

TensorDIMM helps reduce both embedding/MLP latency

Latency breakdown

TensorDIMM achieves overall 6-9x speedup against the baselines

Latency breakdown

TensorDIMM:A Near-Memory Processing Architecture for Sparse Embedding Layers

The “first” architectural solution tackling sparse embedding layers

A “practical” near-memory processing solution

for an important AI workload

Average “6~9x” performance improvement on state-of-the-art DNN-based recommendation models

Questions?

Backup Slides

TensorDIMM design overheadsIt’s not free, but adding custom logics within DIMM has been done before

IBM centaur DIMM

TensorDIMM: A Practical Near-Memory Processing Architecture … · 2019. 11. 11. · Youngeun Kwon,...

Documents

VANET 2009 Eun Kyu Lee, Young Min Yoo, Chan Gook Park, Minsoo Kim, and Mario Gerla

Toward a New Theory of Metaphorical Interpretations-space.snu.ac.kr/bitstream/10371/90776/1/6. Toward... · Toward a New Theory of Metaphorical Interpretation YoungEun Yoon ... Explicit

Yoon, YoungEun, The English Present Perfect and Simple Past Tense

Minsoo Ryu Department of Computer Science and Engineering Hanyang ...rtcc.hanyang.ac.kr/sitedata/2016_Under_Embedded/Lecture_13... · Minsoo Ryu Department of Computer Science and

vDNN: Virtualized Deep Neural Networks for Scalable, Memory … · 2016-08-01 · vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efﬁcient Neural Network Design Minsoo

Sanitizing Korea: Anti-Cholera Activities of the Police in Early … · 2019. 9. 4. · 154 Park Yunjae 2. It was estimated that in 1895 tens of thousands of cholera patients had

Planar graphene Josephson coupling via van der Waals ... · Planar graphene Josephson coupling via van der Waals superconducting contacts Jongyun Leea, Minsoo Kima,*, Kenji Watanabeb,

CORONERS COURT OF QUEENSLAND FINDINGS OF INQUEST€¦ · CORONERS COURT OF QUEENSLAND . FINDINGS OF INQUEST . CITATION: Inquest into the deathsof Thomas HUNT and Youngeun KIM . TITLE

Global-and-Local Relative Position Embedding for ......Yunjae Jung 1, Donghyeon Cho2, Sanghyun Woo , and In So Kweon 1 Korea Advanced Institute of Science and Technology, Daejeon,

CLIENT: CREATIVE DIRECTOR: DESIGNER: Yunjae Chung ...thecreativepack.com/templates/files/2014-9-11-tcp... · 9/11/2014 · DESIGNER: Yunjae Chung ILLUSTRATOR: Lee Newham REGION:

Forensic analysis of explosives - ETH Z · PDF fileForensic analysis of explosives. Youngeun Choi, Dario Remmler, Maximilian Ries, Felix Rösicke, Radwan Sarhan, Felix Stete, Zhiyang

Youngeun Kim, Ph.D. - Harvard University · 2017-01-23 · Interdisciplinary Program in Nanotechnology (IUPN) Research Grant 2009 Krivobok Brooks Metallography Award 2008 Intel Freshmen

Generics and Exceptions: A Reply to eohen (2004)s-space.snu.ac.kr/bitstream/10371/86396/1/4. 2231344.pdfGenerics and Exceptions: A Reply to eohen (2004) YoungEun Yoon (Ewha Womans

A Love Beyond Vengeance - memberfiles.freewebs.com Love Beyond... · First ever yaoi fic that she read was a YunJae one. And then a EunHae one shot. If I remember it correctly, she

Electron Interferometer Investigation for Coherent Electron Transport …essay.utwente.nl/60832/1/MScThesis_MMShawrav.pdf · 2011. 8. 28. · Yunjae, Lan, Serkan, Eerkes, Saroj and

Ternary blend organic solar cells with improved ... · Minwoo Nam,‡a Jaehong Yoo,‡a Yunjae Park, ... JSC, and c) FF. The average and standard deviation values were obtained from

BMJ Open...Complete List of Authors: Lee, Seunghoon; Kyung Hee University Korean Medicine Hospital, Dept. of Acupuncture & Moxibustion Medicine Nam, Dongwoo Kwon, Minsoo; Dept. of

The 10th Materials Information, Characterization, and ...yonseimtg.weebly.com/uploads/3/5/9/0/3590130/abstracts.pdf · Yunjae Lee, Ung Cheon Kim, and Aloysius Soon Department of Materials

SCNN: An Accelerator for Compressed-sparse Convolutional ...people.csail.mit.edu/anurag_m/papers/2017.scnn.isca.slides.pdf · Angshuman Parashar, Minsoo Rhu*, Anurag Mukkara, Antonio

Compiler, Assembler, and Linker - Hanyangrtcc.hanyang.ac.kr/.../Lecture_05_Compiler_Assembler_and_Linker.pdf · Compiler, Assembler, and Linker Minsoo Ryu Department of Computer Science