21
Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos Baidu SVAIL April 7, 2016 Gregory Diamos Persistent RNNs

Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

Persistent RNNs(stashing recurrent weights on-chip)

Gregory Diamos

Baidu SVAIL

April 7, 2016

Gregory Diamos Persistent RNNs

Page 2: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

SVAIL

Think hard AI.

Goal

Develop hard AI technologies that impact 100 million users.

Gregory Diamos Persistent RNNs

Page 3: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

Deep Learning at SVAIL

reco

gnitio

n a

ccura

cy

data and compute

state of the art

human level

100 GFLOP/s1 laptop

6 TFLOP/s1 GPU

800 TFLOP/s128 GPUs

100 PFLOP/s16K GPUs

many previous methods

deep learning

Hypothesis: deep learning scales with data and compute.

Can we strong scale deep learning to the limits of technology?

Gregory Diamos Persistent RNNs

Page 4: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

Persistent RNNs

30x speedup at a mini-batch size of 4

Why is reducing the mini-batch size important?

Train bigger and deeper models.

Strong scale to more GPUs.

Improve efficiency of deployed models.

Gregory Diamos Persistent RNNs

Page 5: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

Training Deep RNNs

Gregory Diamos Persistent RNNs

Page 6: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

Deep speech

Near human level speech recognition in Mandarin and English

Trained on over 10,000 hours (about 1 year) of speech data.

20 ExaFLOPs of work to train (7 days on 16 GPUs at 40% of peak).

Gregory Diamos Persistent RNNs

Page 7: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

Data parallel training

...

GPU 0

GPU 1

mini-batchspeech data

Data parallelism:

The training data is grouped into mini-batches.

Each GPU trains a copy of the model on a slice of the mini-batch.

GPUs synchronize their models after a fixed number of steps.

Gregory Diamos Persistent RNNs

Page 8: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

Mini-batch constraints

So how should you choose the mini-batch size?

wal

l-cl

ock

tim

e to

con

verg

ence

mini-batch size

inefficient hardware inefficient optimization

64 per GPU 1024

Hardware efficiency will set a lower bound.

Optimization efficiency will set an upper bound.

Shrinking the mini-batch per GPU enables the use of more GPUs.

Gregory Diamos Persistent RNNs

Page 9: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

Determining the batch size

The upper bound can be found empirically.

In general a hyperparameter search is needed, but a useful heuristic is:

momentum = 1.0 − miniBatchSizewindowSize

learningRate = stepSize ∗ (1.0 −momentum) ∗miniBatchSize

Gregory Diamos Persistent RNNs

Page 10: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

Persistent RNN Details

Gregory Diamos Persistent RNNs

Page 11: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

RNN primer

RNNs built on GEMM calls reload the weights (U) each timestep.

However, the weights are constant, and this is wasteful.

Gregory Diamos Persistent RNNs

Page 12: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

Caching weights in registers

380 GB/s300 ns5.5 MB 6.144 TFLOP/s

128 GB/s30 ns230 KB 256 GFLOP/s

16 GB/s6 ns896 B 2 GFLOP/s

x 24

x 128

x 1GPU

Core

Thread

Off-chip memory is much slower and less efficient than registers.

GPUs have more on-chip memory in registers than anywhere else.

Cache RNN weights in registers and reuse them over timesteps.

Gregory Diamos Persistent RNNs

Page 13: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

Choosing the tile sizes

Recurrent Weight Matrix

SM0

SM1

SM23

...

1152

1152

1152

48

Warp0

Warp1

Warp7

1152

6

Thre

ad0

3

2

Thre

ad1

Thre

ad14

Thre

ad15

Thre

ad30

Thre

ad31

Thre

ad15

Thre

ad0

Thre

ad31

Thre

ad16

Thre

ad16

Thre

ad17

... ...

...

Block rows avoid additional inter-CTA synchronizations.

Each SM loads the activations into shared memory.

Threads are interleaved to avoid shared memory bank conflicts.

Vector loads and broadcasts amplify shared memory bandwidth.

Gregory Diamos Persistent RNNs

Page 14: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

Global barriers on GPUs

Grid of cooperative thread arrays Cooperative Thread Array

Kernel launch

barrier

barrier

divergentbranch

Grid of cooperative thread arrays Cooperative Thread Array

Persistent kernel launch

barrier

divergentbranch

globalbarrier

An inter-CTA barrier is implemented with a counting semaphore.

Uses atomic, membar, and cache modified load/store operations.

Completes in about 500ns on a TitanX GPU.

Disclaimer: global barriers violate the CUDA 7.5 model.

CUDA does not guarantee forward progress of multiple CTAs.

Our system implements cooperative threading for correctness.

Gregory Diamos Persistent RNNs

Page 15: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

Software pipelining

load math reduce barrier

load math reduce barrier

load math reduce barrier

load math reduce barrier

load math reduce barrier

load math reduce

load math

load

mini-batch 0

mini-batch 1

mini-batch 2

mini-batch 3

barrier

reduce barrier

math reduce barrier

load math reduce barrier

barrier

reduce barrier

math reduce barrier

load math reduce

load math

load

...

i0 i1 i2 i3 i4 i5 i6 i7

timestep0 timestep1 timestepn-1

i4n-4 i4n-3 i4n-2 i4n-1 i4n i4n+1 i4n+2

Software pipelining is used to hide latency.

Thread local math (430ns).

Intra-SM reduction (320ns).

Global loads (315ns).

Global barrier (500ns).

These are grouped into 4 pipeline stages, kept full with a minibatch of 4.

Gregory Diamos Persistent RNNs

Page 16: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

Strong Scaling

Gregory Diamos Persistent RNNs

Page 17: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

Scaling to 128 GPUs

Scaling results for end-to-end model training.

8 GPUs per node, 7GB/s infiniband between nodes.

The algorithmic mini-batch size is fixed at 512.

0 20 40 60 80 100 120 140

GPU Count

0

50

100

150

200

250

300Tera

FLO

P/s

Deep Speech Scaling With 1152 Unit LayersPERSISTENT-RNNGEMM-RNNPERFECT SCALING

A smaller mini-batch per GPU enables the use of up to 128 GPUs.

Gregory Diamos Persistent RNNs

Page 18: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

Exploring deep residual RNNs

Using a mini-batch per GPU of 4 provides a 16x reduction in memory.

Models with more parameters can now fit into GPU memory.

0 10 20 30 40 50 60 70 80 90

Recurrent Layer Count

27

28

29

30

31

32

33

34

35

36

Word

Err

or

Rate

(Englis

h)

Deep Residual Network Error Rate Reduction With Depth

Deep Residual RNN

Results suggest that residual skip connections networks apply to RNNs.

Gregory Diamos Persistent RNNs

Page 19: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

Pascal and future

Future GPUs will enable bigger and faster RNN layers.

bigger GPUs (more threads, more registers)

low latency atomics between GPUs (NvLink)

lower precision (fp16)

Gregory Diamos Persistent RNNs

Page 20: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

Conclusions

So far, deep learning for speech recognition has scaled with compute.

reco

gnitio

n a

ccura

cy

data and compute

state of the art

human level

100 GFLOP/s1 laptop

6 TFLOP/s1 GPU

800 TFLOP/s128 GPUs

100 PFLOP/s16K GPUs

many previous methods

deep learning

Persistent kernels provide a new tool for accelerating RNN training.

Let’s continue building faster computers, software, and algorithms.

What other hard AI problems will scale with deep learning and compute?

Gregory Diamos Persistent RNNs

Page 21: Persistent RNNs - (stashing recurrent weights on-chip)on-demand.gputechconf.com/...diamos-persisten-rnns.pdf · Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos

Questions

Questions?

Contact Me:

Gregory Diamos - [email protected]

Baidu USA is hiring!

http://usa.baidu.com/careers/

Gregory Diamos Persistent RNNs