S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN … · Easy access to internal and hidden...

Murat Efe Guney – Developer Technology Engineer, NVIDIA

March 20, 2019

S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN INFERENCE

REAL-TIME INFERENCESequence Models Based on RNNs

Sequence models for automatic speech recognition (ASR), translation, and speech generation

Real-time applications have a stream of inference requests from multiple users

Challenge is to perform inferencing with low latency and high throughput

“Hello, my name is Alice”

“I am Susan”

“This is Bob”

BATCHING VS NON-BATCHING

Batch size = 1

• Run a single RNN inference task on a GPU

• Low-latency, but the GPU is underutilized

Batch size = N

• Group RNN inference instances together

• High throughput and GPU utilization

• Allows employing Tensor Cores in Volta and Turing

Batching: Grouping Inference Requests Together

W W W W

BATCHING VS NON-BATCHINGPerformance Data on T4

23.0 27.6

32.9 31.5

Batch Size = 1 Batch Size = 32 Batch Size = 64 Batch Size = 128 Batch Size = 256

ughput

RNN Inference Throughput and Latency

FP32 throughput FP16 w/TC throughput FP32 latency FP16 w/TC latency

RNN BATCHING

Existing real-time codes are written for inferencing many instances with batch size = 1

Real-time batching requires extra programming effort

A naïve implementation can suffer from significant increase in latency

An ideal solution will allow making a tradeoff between latency and throughput

RNN cells provide an opportunity to merge inference tasks at different timesteps

Challenges and Opportunities

RNN BATCHING Combining RNNs at Different Timesteps

Inference Tasks Arrival Time

t2 t1 t0 t0

t2 t1 t1

Batch Size = 4

fill with a new inference task

Batched Execution of Timesteps

Common modelparameters

Time steps →

RNN CELLSRNN Cells Supported in TensorRT and cuDNN

TANH LSTM GRU

it = σ(Wixt + Riht-1 + bWi + bRi)

ft = σ(Wfxt + Rfht-1 + bWf + bRf)

ot = σ(Woxt + Roht-1 + bWo + bRo)

c't = tanh(Wcxt + Rcht-1 + bWc + bRc)

ct = ft ◦ ct-1 + it ◦ c't

ht = ot ◦ tanh(ct)

ht = tanh(Wixt + Riht-1 + bWi + bRi)ht = ReLU(Wixt + Riht-1 + bWi + bRi)

it = σ(Wixt + Riht-1 + bWi + bRu)

rt = σ(Wrxt + Rrht-1 + bWr + bRr)

h't = tanh(Whxt + rt◦(Rhht-1 + bRh) +

ht = (1 - it) ◦ h't + it ◦ ht-1

HIGH-PERFORMANCE RNN INFERENCING

High-performance implementations of Tanh, RELU, LSTM and GRU recurrent cells

An arbitrary batch size and number of timesteps can be executed

Easy access to internal and hidden states of the RNN cells for each timestep

Persistent kernels for small minibatch and long sequence lengths (compute capability >= 6.0)

LSTMs with recurrent projections to reduce the op count

Utilize Tensor Cores for FP16 and FP32 cells (125 TFLOPs on V100 and 65 TFLOPs on T4)

cuDNN Features

UTILIZING TENSOR CORES

cuDNN, cuBLAS and TensorRT

// input, output and weight data types are FP16cudnnSetRNNMatrixMathType(cudnnRnnDesc, CUDNN_TENSOR_OP_MATH);

// input, output and weight are FP32, which is converted internally to FP16 cudnnSetRNNMatrixMathType(cudnnRnnDesc, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION);

cuBLAS and cuBLASLt

cublasGemmEx(...);

cublasLtMatmul(...);

TensorRT

builder->setFp16Mode(true);

RNN INFERENCING WITH CUDNN

cudnnCreateRNNDescriptor(&rnnDesc); // creates an RNN descriptor

cudnnSetRNNDescriptor(rnnDesc, … ); // sets the RNN descriptor

cudnnGetRNNLinLayerMatrixParams(cudnnHandle, rnnDesc, …); // set weights

cudnnGetRNNLinLayerBiasParams(cudnnHandle, rnnDesc, …); // set bias

cudnnRNNForwardInference(cudnnHandle, rnnDesc, … ); // perform inferencing

cudnnDestroyRnnDescriptor(rnnDesc); // destroy the RNN descriptor

Key Functions

AUTO-BATCHING FOR HIGH THROUGHPUT

Rely on cuDNN, cuBLAS and TensorRT for high-performance RNN implementation

Input, hidden states and outputs are tracked automatically with a new API

Exploits optimization opportunities by overlapping compute, transfer and host computations

Similar ideas explored at:

• Low‐latency RNN inference using cellular batching (Jinyang Li et. al., GTC 2018)

• Deep Speech 2: End-to-End Speech Recognition in English and Mandarin (Dario Amodei et. al., CoRR, 2015)

Automatically Group Inference Instances

STREAMING INFERENCE API

Non-blocking function calls with a mechanism to wait on completion

Inferencing can be performed in segments with multiple timesteps for real-time processing

A background thread that combines and executes the single inference tasks

An Auto-batching Solution

t0 t1 t2 t3

t4 t5 t6 t7

t4 t5 t6

Submit 4 steps

Submit 3 steps

Wait for t7 completionInference-0

Inference-1

Inference-0

Inference-1

STREAMING INFERENCE APIList of Functions

streamHandle = CreateRNNInferenceStream(modelDesc);

rnnHandle = CreateRNNInference (streamHandle);

RNNInference (rnnHandle, pInput, pOutput, seqLength);

timeStep = WaitRNNInferenceTimeStep(rnnHandle, timeStep);

timeStep = GetRNNInferenceProgress(rnnHandle);

DestroyRNNInference (rnnHandle);

DestroyRNNInferenceStream(streamHandle);

EXAMPLE USAGE

// Create the inference stream with shared model parametersstreamHandle = CreateRNNInferenceStream(modelDesc);

// Create two RNN inference instancesrnnHandle[0] = CreateRNNInference(streamHandle);rnnHandle[1] = CreateRNNInference(streamHandle);

// Request inferencing for each inference instance with 10 timesteps (non-blocking call)RNNInference(rnnHandle[0], pInput[0], pOutput[0], 10); RNNInference(rnnHandle[1], pInput[1], pOutput[1], 10); // Request inferencing an additional 5 time step for the second inference instanceRNNInference(rnnHandle[1], pInput[1] + 10*inputSize, pOutput[1] + 10*outputSize, 5);

// Wait for the completion of lastly added inferencing jobWaitRNNInferenceTimeStep(rnnHandle[1], 15);

// Destroy the two inferencing tasks and the inference streamDestroyRNNInference(rnnHandle[0]);DestroyRNNInference(rnnHandle[1]);DestroyRNNStream(streamHandle);

Two Inference Instances

RNN INFERENCING WITH SEGMENTSExecution Queue and Task Switching for Batch Size = 2

t0 t1 t2 t3Inference-0:

t0 t1 t2 t3Inference-1:

i0 i1 i2 i3

o0 o1 o2 o3

Inference-2: t0 t1 t2 t3

i0 i1 i2 i3

o0 o1 o2 o3

Execution Queue

t0 t1 t2 t3

Task Arrival Time

…t4 t5

Time X

Time Y

t0 t1 t2

Time Z

t4 t5 t6 t7

Store inference-0 states

Store inference-2 statesRestore Inference-0 states

IMPLEMENTATION

1. Find the inference tasks ready to execute time steps

2. Determine the batch slots for each inference task

3. Send the inputs to GPU for batched processing

4. Restore hidden states as needed (+ cell states for LSTMs)

5. Batched execution on the GPU

6. Store hidden states as needed (+ cell states for LSTMs)

7. Send the batched results back to host

8. De-batch the results on the host

Auto-batching and GPU Execution

IMPLEMENTATIONBatching, Executing, and De-batching Inference Tasks

LSTMBatching Projection Output DtoHInput HtoD De-batching

rence T

Host Op

GPU Op

Background thread accepting inference tasks

At each timestep inference tasks are batched, executed on the GPU and de-batched

Data Transfer

Restore Store

PERFORMANCE OPTIMIZATIONSHiding Host Processing, Data transfers, and State Management

Overlapping opportunities between timesteps for compute, batching, de-batching and transfer

Perform batching and de-batching on separate CPU threads: provides better CPU BW and GPU overlap

Employ three CUDA streams and triple-buffering of the output to better exploit concurrency

LSTMBatching Projection Output DtoHInput HtoD De-batchingTop KRestore Store

LSTMBatching ProjectionInput HtoD Top KRestore Store …

LSTMBatching Input HtoD Restoret+3 …

CUDA Stream 0

CUDA Stream 1

CUDA Stream 2

CUDA Stream 0

PERFORMANCE EXPERIMENTS

Input size = 128

7 LSTM layers with 1024 hidden cells

A final projection layer with 1024 output

Timestep per each inference segment = 10

Total sequence length = 1000

Experiments are performed on T4 and GV100

End-to-end time: task submission to results arriving to host

An Example LSTM Model

1024 1024

One inference request has 10 timesteps

PERFORMANCE EXPERIMENTSBenchmarking Code

// Queue up inferencing tasks with 10 timesteps eachtime[0] = time();RNNInference(rnnHandle[0], pInput[0], pOutput[0], 10); time[1] = time();RNNInference(rnnHandle[1], pInput[1], pOutput[1], 10); ...RNNInference(rnnHandle[N-1], pInput[N-1], pOutput[N-1], 10);

// Wait for the completion of first inferencing taskWaitRNNInferenceTimeStep(rnnHandle[0], 10);time[0] = time[0] - time();RNNInference(rnnHandle[N], pInput[N], pOutput[N], 10);

// Wait for the completion of second inferencing taskWaitRNNInferenceTimeStep(rnnHandle[1], 10);time[1] = time[1] - time();RNNInference(rnnHandle[N+1], pInput[N+1], pOutput[N+1], 10); ...

There is at most N inference requests on the fly at a given time.

Measure the time required to finish each inference request including the data transfer time.

COMPARISION AGAINST BATCHED CUDNNFP32 Model, GV100 Numbers

ughput

f Batc

ughput

Throughput of Streaming Inference API vs. Batched cuDNN

Streaming API Batched % of Batched

PERFORMANCE WITH TENSOR CORES FP16 on GV100

ughput

Streaming Inference Performance on GV100

FP32 FP16 w/TCs

PERFORMANCE WITH TENSOR CORES FP16 on T4

ughput

Streaming Inference Performance on GV100

FP32 FP16 w/TCs

LATENCY VS THROUGHPUT TRADEOFF

Assuming each inference segment represents100ms audio

Choose a batch size that will maximize throughput while staying within latency budget

FP16 on T4

257 / 32 441 / 64 692 / 128 913 / 256 977 / 512

Inference Instances Served by a GPU / Batch Size

Latency Percentiles

50% Latency 90% Latency 95% Latency 99% Latency

NVIDIA TENSORRT INFERENCE SERVERProduction Data Center Inference Server

Maximize real-time inference

performance of GPUs

Quickly deploy and manage multiple

models per GPU per node

Easily scale to heterogeneous GPUs

and multi GPU nodes

Integrates with orchestration

systems and auto scalers via latency

and health metrics

Open source for thorough

customization and integration

TensorR

er Tesla T4

Tesla T4

r Tesla P4

Tesla P4

INFERENCE SERVER ARCHITECTURE

Models supported● TensorFlow GraphDef/SavedModel● TensorFlow and TensorRT GraphDef● TensorRT Plans● Caffe2 NetDef (ONNX import)● Custom backends

Multi-GPU support

Concurrent model execution

Server HTTP REST API/gRPC

Python/C++ client libraries

Python/C++ Client Library

Available with Monthly Updates

INFERENCE SERVER BATCHERS

Dynamic Batching

TensorRT Inference Server (TRTIS) groups inference requests based on customer defined metrics for optimal performance

Customer defines 1) batch size and/or 2) latency requirements

Sequence Batching

TRTIS can keep track of the inference requests belonging to a stateful model

The client application assigns a correlation ID for a stream of inferences belonging to the same sequence

Use together with a custom backend to store and restore the internal states of the model

Dynamic and Sequence Batching

SUMMARY

Designed and implemented the Streaming Inference API

Automatically batches the RNN inference requests together to achieve high throughput

Code written for batch size = 1 achieves ≥66% throughput of batched execution (FP32)

Allows utilizing the Tensor Cores on Volta and Turing architectures

Hit latency targets by choosing the right batch size

Generalizes to sequence models with interdependent inference streams

TRTIS sequence batcher and custom backends for high-performance real-time inferencing

S9438 - Maximizing Utilization for Data Center Inference with TensorRT Inference Server

RESOURCES

TRTIS blog post and documentation:

https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/

https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/

S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN … · Easy access to internal and hidden...

Documents

Deep RNN Framework for Visual Sequential Applicationsopenaccess.thecvf.com/content_CVPR_2019/papers/Pang_Deep... · 2019-06-10 · Deep RNN Framework for Visual Sequential Applications

cs221 final poster - Stanford Universityweb.stanford.edu/class/cs224n/posters/15744482.pdfContext Final Representation RNN Self-Attention RNN Inter-Attention RNN Contextualized Word

Modernized Crypto Assets - Stanford Universitystanford.edu/class/msande448/2019/Final_presentations/gr6.pdf · BTC. Presentation Structure Data Sentiment RNN Training RNN Analyzing

Matching Networks for One Shot Learning · Figure 1: Matching Networks architecture showing only a few examples per class, switching the task from minibatch to minibatch, much like

Timestep Stochastic Simulation of Computer Networks using

RNN Explore

CNN RNN Autoencoder - UNISTisystems.unist.ac.kr/wp-content/uploads/sites/209/2017/... · 2018. 12. 28. · CNN RNN Autoencoder. Today 1. MachineLearning 2. DeepLearning –CNN –RNN

Dp-Rnn: Type Ii Diabetic Prediction Using Gkfcm And Rnn

(final) RNN Implementation on FPGA

Rnn 02.07.2012 copy

Recurrent Neural Network shad.pcs15@iitp.acshad.pcs15/data/rnn-shad.pdf · Recurrent Neural Network (RNN) Training of RNNs BPTT Visualization of RNN through Feed-Forward Neural Network

RNN based Energy Demand Prediction for Smart -Home in ...networking.khu.ac.kr/layouts/net/publications/data/KSC2017/RNN... · RNN based Energy Demand Prediction for Smart -Home

STRUCTURE PRESERVING OPTIMAL CONTROL OF FINGER … · to the timestep size. Hence using the largest possible timestep is crucial for efﬁcient optimal control simulations. A well

CUN PANNN DVELPN RNN UD - Denver

RNN · shortage and why so many bosses hate it save 22 RNN Training PART OF RNN GROUP . AZ Diagram 2a: Summary approach to "On-programme" and End Point Assessment— Engineering Technician

Recurrent Neural Network (RNN) - NTU Speech …speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/RNN...Recurrent Neural Network (RNN) x 1 x 2 y 1 y 2 a 1 a 2 Memory can be considered

Generating Chinese Classical Poems with RNN Encoder-Decoder

PRESS RELEASE Timestep EVO GR turntable 2017-05

MAM-RNN: Multi-level Attention Model Based RNN for Video ... · MAM-RNN: Multi-level Attention Model Based RNN for Video Captioning Xuelong Li1, Bin Zhao2, ... video feature is encoded

Large Timestep Issues