S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN … · Easy access to internal and hidden states of the RNN cells for each timestep Persistent kernels for small minibatch and

Murat Efe Guney – Developer Technology Engineer, NVIDIA

March 20, 2019

S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN INFERENCE

2

REAL-TIME INFERENCESequence Models Based on RNNs

Sequence models for automatic speech recognition (ASR), translation, and speech generation

Real-time applications have a stream of inference requests from multiple users

Challenge is to perform inferencing with low latency and high throughput

“Hello, my name is Alice”

“I am Susan”

“This is Bob”

ASR

3

BATCHING VS NON-BATCHING

Batch size = 1

• Run a single RNN inference task on a GPU

• Low-latency, but the GPU is underutilized

Batch size = N

• Group RNN inference instances together

• High throughput and GPU utilization

• Allows employing Tensor Cores in Volta and Turing

Batching: Grouping Inference Requests Together

W W W W

W

4

BATCHING VS NON-BATCHINGPerformance Data on T4

1.2

23.0 27.6

32.9 31.5

1.8

51.4

83.8

116.5

138.4

0

1

2

3

4

5

6

7

8

9

0

20

40

60

80

100

120

140

160

180

Batch Size = 1 Batch Size = 32 Batch Size = 64 Batch Size = 128 Batch Size = 256

Late

ncy (

ms

per

a t

imest

ep)

Thro

ughput

(tim

est

eps

per

ms)

RNN Inference Throughput and Latency

FP32 throughput FP16 w/TC throughput FP32 latency FP16 w/TC latency

5

RNN BATCHING

Existing real-time codes are written for inferencing many instances with batch size = 1

Real-time batching requires extra programming effort

A naïve implementation can suffer from significant increase in latency

An ideal solution will allow making a tradeoff between latency and throughput

RNN cells provide an opportunity to merge inference tasks at different timesteps

Challenges and Opportunities

6

RNN BATCHING Combining RNNs at Different Timesteps

Inference Tasks Arrival Time

t0

t1 t0

t2 t1 t0 t0

t2 t1 t1

Batch Size = 4

fill with a new inference task

Batched Execution of Timesteps

Common modelparameters

Time steps →

7

RNN CELLSRNN Cells Supported in TensorRT and cuDNN

TANH LSTM GRU

it = σ(Wixt + Riht-1 + bWi + bRi)

ft = σ(Wfxt + Rfht-1 + bWf + bRf)

ot = σ(Woxt + Roht-1 + bWo + bRo)

c't = tanh(Wcxt + Rcht-1 + bWc + bRc)

ct = ft ◦ ct-1 + it ◦ c't

ht = ot ◦ tanh(ct)

ht = tanh(Wixt + Riht-1 + bWi + bRi)ht = ReLU(Wixt + Riht-1 + bWi + bRi)

it = σ(Wixt + Riht-1 + bWi + bRu)

rt = σ(Wrxt + Rrht-1 + bWr + bRr)

h't = tanh(Whxt + rt◦(Rhht-1 + bRh) +

bWh)

ht = (1 - it) ◦ h't + it ◦ ht-1

RELU

8

HIGH-PERFORMANCE RNN INFERENCING

High-performance implementations of Tanh, RELU, LSTM and GRU recurrent cells

An arbitrary batch size and number of timesteps can be executed

Easy access to internal and hidden states of the RNN cells for each timestep

Persistent kernels for small minibatch and long sequence lengths (compute capability >= 6.0)

LSTMs with recurrent projections to reduce the op count

Utilize Tensor Cores for FP16 and FP32 cells (125 TFLOPs on V100 and 65 TFLOPs on T4)

cuDNN Features

9

UTILIZING TENSOR CORES

cuDNN

cuDNN, cuBLAS and TensorRT

// input, output and weight data types are FP16cudnnSetRNNMatrixMathType(cudnnRnnDesc, CUDNN_TENSOR_OP_MATH);

// input, output and weight are FP32, which is converted internally to FP16 cudnnSetRNNMatrixMathType(cudnnRnnDesc, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION);

cuBLAS and cuBLASLt

cublasGemmEx(...);

cublasLtMatmul(...);

TensorRT

builder->setFp16Mode(true);

10

RNN INFERENCING WITH CUDNN

cudnnCreateRNNDescriptor(&rnnDesc); // creates an RNN descriptor

cudnnSetRNNDescriptor(rnnDesc, … ); // sets the RNN descriptor

cudnnGetRNNLinLayerMatrixParams(cudnnHandle, rnnDesc, …); // set weights

cudnnGetRNNLinLayerBiasParams(cudnnHandle, rnnDesc, …); // set bias

cudnnRNNForwardInference(cudnnHandle, rnnDesc, … ); // perform inferencing

cudnnDestroyRnnDescriptor(rnnDesc); // destroy the RNN descriptor

Key Functions

11

AUTO-BATCHING FOR HIGH THROUGHPUT

Rely on cuDNN, cuBLAS and TensorRT for high-performance RNN implementation

Input, hidden states and outputs are tracked automatically with a new API

Exploits optimization opportunities by overlapping compute, transfer and host computations

Similar ideas explored at:

• Low‐latency RNN inference using cellular batching (Jinyang Li et. al., GTC 2018)

• Deep Speech 2: End-to-End Speech Recognition in English and Mandarin (Dario Amodei et. al., CoRR, 2015)

Automatically Group Inference Instances

12

STREAMING INFERENCE API

Non-blocking function calls with a mechanism to wait on completion

Inferencing can be performed in segments with multiple timesteps for real-time processing

A background thread that combines and executes the single inference tasks

An Auto-batching Solution

t0 t1 t2 t3

t0 t1 t2 t3

t4 t5 t6 t7

t4 t5 t6

Submit 4 steps

Submit 4 steps

Submit 4 steps

Submit 3 steps

Wait for t7 completionInference-0

Inference-1

Inference-0

Inference-1

13

STREAMING INFERENCE APIList of Functions

streamHandle = CreateRNNInferenceStream(modelDesc);

rnnHandle = CreateRNNInference (streamHandle);

RNNInference (rnnHandle, pInput, pOutput, seqLength);

timeStep = WaitRNNInferenceTimeStep(rnnHandle, timeStep);

timeStep = GetRNNInferenceProgress(rnnHandle);

DestroyRNNInference (rnnHandle);

DestroyRNNInferenceStream(streamHandle);

14

EXAMPLE USAGE

// Create the inference stream with shared model parametersstreamHandle = CreateRNNInferenceStream(modelDesc);

// Create two RNN inference instancesrnnHandle[0] = CreateRNNInference(streamHandle);rnnHandle[1] = CreateRNNInference(streamHandle);

// Request inferencing for each inference instance with 10 timesteps (non-blocking call)RNNInference(rnnHandle[0], pInput[0], pOutput[0], 10); RNNInference(rnnHandle[1], pInput[1], pOutput[1], 10); // Request inferencing an additional 5 time step for the second inference instanceRNNInference(rnnHandle[1], pInput[1] + 10*inputSize, pOutput[1] + 10*outputSize, 5);

// Wait for the completion of lastly added inferencing jobWaitRNNInferenceTimeStep(rnnHandle[1], 15);

// Destroy the two inferencing tasks and the inference streamDestroyRNNInference(rnnHandle[0]);DestroyRNNInference(rnnHandle[1]);DestroyRNNStream(streamHandle);

Two Inference Instances

15

RNN INFERENCING WITH SEGMENTSExecution Queue and Task Switching for Batch Size = 2

t0 t1 t2 t3Inference-0:

i0

o0

i1

o1

i2

o2

i3

o3

…

t0 t1 t2 t3Inference-1:

i0 i1 i2 i3

o0 o1 o2 o3

Inference-2: t0 t1 t2 t3

i0 i1 i2 i3

o0 o1 o2 o3

t4 t5

i4

o4

i5

o5

Execution Queue

t2 t3

t0 t1 t2 t3

Task Arrival Time

t6 t7

i4

o6

i5

o7

…t4 t5

i4

o4

i5

o5

t6 t7

i4

o6

i5

o7

…t4 t5

i4

o4

i5

o5

t6 t7

i4

o6

i5

o7

X Y Z

Time X

t2 t3

Time Y

t0 t1 t2

Time Z

t6 t7

t4 t5 t6 t7

t3

Store inference-0 states

Store inference-2 statesRestore Inference-0 states

t5

16

IMPLEMENTATION

1. Find the inference tasks ready to execute time steps

2. Determine the batch slots for each inference task

3. Send the inputs to GPU for batched processing

4. Restore hidden states as needed (+ cell states for LSTMs)

5. Batched execution on the GPU

6. Store hidden states as needed (+ cell states for LSTMs)

7. Send the batched results back to host

8. De-batch the results on the host

Auto-batching and GPU Execution

17

IMPLEMENTATIONBatching, Executing, and De-batching Inference Tasks

LSTMBatching Projection Output DtoHInput HtoD De-batching

Infe

rence T

ask

s In

put

Infe

rence T

ask

s O

utp

ut

Host Op

GPU Op

Background thread accepting inference tasks

At each timestep inference tasks are batched, executed on the GPU and de-batched

Top K

Data Transfer

Restore Store

18

PERFORMANCE OPTIMIZATIONSHiding Host Processing, Data transfers, and State Management

Overlapping opportunities between timesteps for compute, batching, de-batching and transfer

Perform batching and de-batching on separate CPU threads: provides better CPU BW and GPU overlap

Employ three CUDA streams and triple-buffering of the output to better exploit concurrency

LSTMBatching Projection Output DtoHInput HtoD De-batchingTop KRestore Store

LSTMBatching Projection Output DtoHInput HtoD De-batchingTop KRestore Store

LSTMBatching ProjectionInput HtoD Top KRestore Store …

t

t+1

t+2

LSTMBatching Input HtoD Restoret+3 …

CUDA Stream 0

CUDA Stream 1

CUDA Stream 2

CUDA Stream 0

19

PERFORMANCE EXPERIMENTS

Input size = 128

7 LSTM layers with 1024 hidden cells

A final projection layer with 1024 output

Timestep per each inference segment = 10

Total sequence length = 1000

Experiments are performed on T4 and GV100

End-to-end time: task submission to results arriving to host

An Example LSTM Model

128

1024 1024

128

1024

128

…

One inference request has 10 timesteps

20

PERFORMANCE EXPERIMENTSBenchmarking Code

// Queue up inferencing tasks with 10 timesteps eachtime[0] = time();RNNInference(rnnHandle[0], pInput[0], pOutput[0], 10); time[1] = time();RNNInference(rnnHandle[1], pInput[1], pOutput[1], 10); ...RNNInference(rnnHandle[N-1], pInput[N-1], pOutput[N-1], 10);

// Wait for the completion of first inferencing taskWaitRNNInferenceTimeStep(rnnHandle[0], 10);time[0] = time[0] - time();RNNInference(rnnHandle[N], pInput[N], pOutput[N], 10);

// Wait for the completion of second inferencing taskWaitRNNInferenceTimeStep(rnnHandle[1], 10);time[1] = time[1] - time();RNNInference(rnnHandle[N+1], pInput[N+1], pOutput[N+1], 10); ...

There is at most N inference requests on the fly at a given time.

Measure the time required to finish each inference request including the data transfer time.

21

COMPARISION AGAINST BATCHED CUDNNFP32 Model, GV100 Numbers

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0

10

20

30

40

50

60

70

80

90

100


Thro

ughput

as

% o

f Batc

hed c

uD

NN

Thro

ughput

(Tim

est

eps

per

ms)

Throughput of Streaming Inference API vs. Batched cuDNN

Streaming API Batched % of Batched

22

PERFORMANCE WITH TENSOR CORES FP16 on GV100

0

2

4

6

8

10

12

14

16

18

20


Thro

ughput

(TFlo

p/se

c)

Streaming Inference Performance on GV100

FP32 FP16 w/TCs

23

PERFORMANCE WITH TENSOR CORES FP16 on T4

0

2

4

6

8

10

12

14


Thro

ughput

(TFlo

p/se

c)

Streaming Inference Performance on GV100

FP32 FP16 w/TCs

24

LATENCY VS THROUGHPUT TRADEOFF

Assuming each inference segment represents100ms audio

Choose a batch size that will maximize throughput while staying within latency budget

FP16 on T4

0

10

20

30

40

50

60

70

257 / 32 441 / 64 692 / 128 913 / 256 977 / 512

Late

ncy (

ms)

Inference Instances Served by a GPU / Batch Size

Latency Percentiles

50% Latency 90% Latency 95% Latency 99% Latency

25

NVIDIA TENSORRT INFERENCE SERVERProduction Data Center Inference Server

Maximize real-time inference

performance of GPUs

Quickly deploy and manage multiple

models per GPU per node

Easily scale to heterogeneous GPUs

and multi GPU nodes

Integrates with orchestration

systems and auto scalers via latency

and health metrics

Open source for thorough

customization and integration

TensorR

T

Infe

rence

Serv

er Tesla T4

Tesla T4

Te

nso

rRT

Infe

ren

ce

Se

rve

r

Tesla

V100

Tesla

V100

Te

nso

rRT

Infe

ren

ce

Se

rve

r Tesla P4

Tesla P4

26

INFERENCE SERVER ARCHITECTURE

Models supported● TensorFlow GraphDef/SavedModel● TensorFlow and TensorRT GraphDef● TensorRT Plans● Caffe2 NetDef (ONNX import)● Custom backends

Multi-GPU support

Concurrent model execution

Server HTTP REST API/gRPC

Python/C++ client libraries

Python/C++ Client Library

Available with Monthly Updates

27

INFERENCE SERVER BATCHERS

Dynamic Batching

TensorRT Inference Server (TRTIS) groups inference requests based on customer defined metrics for optimal performance

Customer defines 1) batch size and/or 2) latency requirements

Sequence Batching

TRTIS can keep track of the inference requests belonging to a stateful model

The client application assigns a correlation ID for a stream of inferences belonging to the same sequence

Use together with a custom backend to store and restore the internal states of the model

Dynamic and Sequence Batching

28

SUMMARY

Designed and implemented the Streaming Inference API

Automatically batches the RNN inference requests together to achieve high throughput

Code written for batch size = 1 achieves ≥66% throughput of batched execution (FP32)

Allows utilizing the Tensor Cores on Volta and Turing architectures

Hit latency targets by choosing the right batch size

Generalizes to sequence models with interdependent inference streams

TRTIS sequence batcher and custom backends for high-performance real-time inferencing

S9438 - Maximizing Utilization for Data Center Inference with TensorRT Inference Server

29

RESOURCES

TRTIS blog post and documentation:

https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/

https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/

https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/

https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/

Documents

S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN … · Easy access to internal and hidden states of the RNN cells for each timestep Persistent kernels for small minibatch and