View
3
Download
0
Category
Preview:
Citation preview
Murat Efe Guney – Developer Technology Engineer, NVIDIA
March 20, 2019
S9422: AN AUTO-BATCHING API FOR HIGH-PERFORMANCE RNN INFERENCE
2
REAL-TIME INFERENCESequence Models Based on RNNs
Sequence models for automatic speech recognition (ASR), translation, and speech generation
Real-time applications have a stream of inference requests from multiple users
Challenge is to perform inferencing with low latency and high throughput
“Hello, my name is Alice”
“I am Susan”
“This is Bob”
ASR
3
BATCHING VS NON-BATCHING
Batch size = 1
• Run a single RNN inference task on a GPU
• Low-latency, but the GPU is underutilized
Batch size = N
• Group RNN inference instances together
• High throughput and GPU utilization
• Allows employing Tensor Cores in Volta and Turing
Batching: Grouping Inference Requests Together
W W W W
W
4
BATCHING VS NON-BATCHINGPerformance Data on T4
1.2
23.0 27.6
32.9 31.5
1.8
51.4
83.8
116.5
138.4
0
1
2
3
4
5
6
7
8
9
0
20
40
60
80
100
120
140
160
180
Batch Size = 1 Batch Size = 32 Batch Size = 64 Batch Size = 128 Batch Size = 256
Late
ncy (
ms
per
a t
imest
ep)
Thro
ughput
(tim
est
eps
per
ms)
RNN Inference Throughput and Latency
FP32 throughput FP16 w/TC throughput FP32 latency FP16 w/TC latency
5
RNN BATCHING
Existing real-time codes are written for inferencing many instances with batch size = 1
Real-time batching requires extra programming effort
A naïve implementation can suffer from significant increase in latency
An ideal solution will allow making a tradeoff between latency and throughput
RNN cells provide an opportunity to merge inference tasks at different timesteps
Challenges and Opportunities
6
RNN BATCHING Combining RNNs at Different Timesteps
Inference Tasks Arrival Time
t0
t1 t0
t2 t1 t0 t0
t2 t1 t1
Batch Size = 4
fill with a new inference task
Batched Execution of Timesteps
Common modelparameters
Time steps →
7
RNN CELLSRNN Cells Supported in TensorRT and cuDNN
TANH LSTM GRU
it = σ(Wixt + Riht-1 + bWi + bRi)
ft = σ(Wfxt + Rfht-1 + bWf + bRf)
ot = σ(Woxt + Roht-1 + bWo + bRo)
c't = tanh(Wcxt + Rcht-1 + bWc + bRc)
ct = ft ◦ ct-1 + it ◦ c't
ht = ot ◦ tanh(ct)
ht = tanh(Wixt + Riht-1 + bWi + bRi)ht = ReLU(Wixt + Riht-1 + bWi + bRi)
it = σ(Wixt + Riht-1 + bWi + bRu)
rt = σ(Wrxt + Rrht-1 + bWr + bRr)
h't = tanh(Whxt + rt◦(Rhht-1 + bRh) +
bWh)
ht = (1 - it) ◦ h't + it ◦ ht-1
RELU
8
HIGH-PERFORMANCE RNN INFERENCING
High-performance implementations of Tanh, RELU, LSTM and GRU recurrent cells
An arbitrary batch size and number of timesteps can be executed
Easy access to internal and hidden states of the RNN cells for each timestep
Persistent kernels for small minibatch and long sequence lengths (compute capability >= 6.0)
LSTMs with recurrent projections to reduce the op count
Utilize Tensor Cores for FP16 and FP32 cells (125 TFLOPs on V100 and 65 TFLOPs on T4)
cuDNN Features
9
UTILIZING TENSOR CORES
cuDNN
cuDNN, cuBLAS and TensorRT
// input, output and weight data types are FP16cudnnSetRNNMatrixMathType(cudnnRnnDesc, CUDNN_TENSOR_OP_MATH);
// input, output and weight are FP32, which is converted internally to FP16 cudnnSetRNNMatrixMathType(cudnnRnnDesc, CUDNN_TENSOR_OP_MATH_ALLOW_CONVERSION);
cuBLAS and cuBLASLt
cublasGemmEx(...);
cublasLtMatmul(...);
TensorRT
builder->setFp16Mode(true);
10
RNN INFERENCING WITH CUDNN
cudnnCreateRNNDescriptor(&rnnDesc); // creates an RNN descriptor
cudnnSetRNNDescriptor(rnnDesc, … ); // sets the RNN descriptor
cudnnGetRNNLinLayerMatrixParams(cudnnHandle, rnnDesc, …); // set weights
cudnnGetRNNLinLayerBiasParams(cudnnHandle, rnnDesc, …); // set bias
cudnnRNNForwardInference(cudnnHandle, rnnDesc, … ); // perform inferencing
cudnnDestroyRnnDescriptor(rnnDesc); // destroy the RNN descriptor
Key Functions
11
AUTO-BATCHING FOR HIGH THROUGHPUT
Rely on cuDNN, cuBLAS and TensorRT for high-performance RNN implementation
Input, hidden states and outputs are tracked automatically with a new API
Exploits optimization opportunities by overlapping compute, transfer and host computations
Similar ideas explored at:
• Low‐latency RNN inference using cellular batching (Jinyang Li et. al., GTC 2018)
• Deep Speech 2: End-to-End Speech Recognition in English and Mandarin (Dario Amodei et. al., CoRR, 2015)
Automatically Group Inference Instances
12
STREAMING INFERENCE API
Non-blocking function calls with a mechanism to wait on completion
Inferencing can be performed in segments with multiple timesteps for real-time processing
A background thread that combines and executes the single inference tasks
An Auto-batching Solution
t0 t1 t2 t3
t0 t1 t2 t3
t4 t5 t6 t7
t4 t5 t6
Submit 4 steps
Submit 4 steps
Submit 4 steps
Submit 3 steps
Wait for t7 completionInference-0
Inference-1
Inference-0
Inference-1
13
STREAMING INFERENCE APIList of Functions
streamHandle = CreateRNNInferenceStream(modelDesc);
rnnHandle = CreateRNNInference (streamHandle);
RNNInference (rnnHandle, pInput, pOutput, seqLength);
timeStep = WaitRNNInferenceTimeStep(rnnHandle, timeStep);
timeStep = GetRNNInferenceProgress(rnnHandle);
DestroyRNNInference (rnnHandle);
DestroyRNNInferenceStream(streamHandle);
14
EXAMPLE USAGE
// Create the inference stream with shared model parametersstreamHandle = CreateRNNInferenceStream(modelDesc);
// Create two RNN inference instancesrnnHandle[0] = CreateRNNInference(streamHandle);rnnHandle[1] = CreateRNNInference(streamHandle);
// Request inferencing for each inference instance with 10 timesteps (non-blocking call)RNNInference(rnnHandle[0], pInput[0], pOutput[0], 10); RNNInference(rnnHandle[1], pInput[1], pOutput[1], 10); // Request inferencing an additional 5 time step for the second inference instanceRNNInference(rnnHandle[1], pInput[1] + 10*inputSize, pOutput[1] + 10*outputSize, 5);
// Wait for the completion of lastly added inferencing jobWaitRNNInferenceTimeStep(rnnHandle[1], 15);
// Destroy the two inferencing tasks and the inference streamDestroyRNNInference(rnnHandle[0]);DestroyRNNInference(rnnHandle[1]);DestroyRNNStream(streamHandle);
Two Inference Instances
15
RNN INFERENCING WITH SEGMENTSExecution Queue and Task Switching for Batch Size = 2
t0 t1 t2 t3Inference-0:
i0
o0
i1
o1
i2
o2
i3
o3
…
t0 t1 t2 t3Inference-1:
i0 i1 i2 i3
o0 o1 o2 o3
Inference-2: t0 t1 t2 t3
i0 i1 i2 i3
o0 o1 o2 o3
t4 t5
i4
o4
i5
o5
Execution Queue
t2 t3
t0 t1 t2 t3
Task Arrival Time
t6 t7
i4
o6
i5
o7
…t4 t5
i4
o4
i5
o5
t6 t7
i4
o6
i5
o7
…t4 t5
i4
o4
i5
o5
t6 t7
i4
o6
i5
o7
X Y Z
Time X
t2 t3
Time Y
t0 t1 t2
Time Z
t6 t7
t4 t5 t6 t7
t3
Store inference-0 states
Store inference-2 statesRestore Inference-0 states
t5
16
IMPLEMENTATION
1. Find the inference tasks ready to execute time steps
2. Determine the batch slots for each inference task
3. Send the inputs to GPU for batched processing
4. Restore hidden states as needed (+ cell states for LSTMs)
5. Batched execution on the GPU
6. Store hidden states as needed (+ cell states for LSTMs)
7. Send the batched results back to host
8. De-batch the results on the host
Auto-batching and GPU Execution
17
IMPLEMENTATIONBatching, Executing, and De-batching Inference Tasks
LSTMBatching Projection Output DtoHInput HtoD De-batching
Infe
rence T
ask
s In
put
Infe
rence T
ask
s O
utp
ut
Host Op
GPU Op
Background thread accepting inference tasks
At each timestep inference tasks are batched, executed on the GPU and de-batched
Top K
Data Transfer
Restore Store
18
PERFORMANCE OPTIMIZATIONSHiding Host Processing, Data transfers, and State Management
Overlapping opportunities between timesteps for compute, batching, de-batching and transfer
Perform batching and de-batching on separate CPU threads: provides better CPU BW and GPU overlap
Employ three CUDA streams and triple-buffering of the output to better exploit concurrency
LSTMBatching Projection Output DtoHInput HtoD De-batchingTop KRestore Store
LSTMBatching Projection Output DtoHInput HtoD De-batchingTop KRestore Store
LSTMBatching ProjectionInput HtoD Top KRestore Store …
t
t+1
t+2
LSTMBatching Input HtoD Restoret+3 …
CUDA Stream 0
CUDA Stream 1
CUDA Stream 2
CUDA Stream 0
19
PERFORMANCE EXPERIMENTS
Input size = 128
7 LSTM layers with 1024 hidden cells
A final projection layer with 1024 output
Timestep per each inference segment = 10
Total sequence length = 1000
Experiments are performed on T4 and GV100
End-to-end time: task submission to results arriving to host
An Example LSTM Model
128
1024 1024
128
1024
128
…
One inference request has 10 timesteps
20
PERFORMANCE EXPERIMENTSBenchmarking Code
// Queue up inferencing tasks with 10 timesteps eachtime[0] = time();RNNInference(rnnHandle[0], pInput[0], pOutput[0], 10); time[1] = time();RNNInference(rnnHandle[1], pInput[1], pOutput[1], 10); ...RNNInference(rnnHandle[N-1], pInput[N-1], pOutput[N-1], 10);
// Wait for the completion of first inferencing taskWaitRNNInferenceTimeStep(rnnHandle[0], 10);time[0] = time[0] - time();RNNInference(rnnHandle[N], pInput[N], pOutput[N], 10);
// Wait for the completion of second inferencing taskWaitRNNInferenceTimeStep(rnnHandle[1], 10);time[1] = time[1] - time();RNNInference(rnnHandle[N+1], pInput[N+1], pOutput[N+1], 10); ...
There is at most N inference requests on the fly at a given time.
Measure the time required to finish each inference request including the data transfer time.
21
COMPARISION AGAINST BATCHED CUDNNFP32 Model, GV100 Numbers
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0
10
20
30
40
50
60
70
80
90
100
Batch Size = 32 Batch Size = 64 Batch Size = 128 Batch Size = 256 Batch Size = 512
Thro
ughput
as
% o
f Batc
hed c
uD
NN
Thro
ughput
(Tim
est
eps
per
ms)
Throughput of Streaming Inference API vs. Batched cuDNN
Streaming API Batched % of Batched
22
PERFORMANCE WITH TENSOR CORES FP16 on GV100
0
2
4
6
8
10
12
14
16
18
20
Batch Size = 32 Batch Size = 64 Batch Size = 128 Batch Size = 256 Batch Size = 512
Thro
ughput
(TFlo
p/se
c)
Streaming Inference Performance on GV100
FP32 FP16 w/TCs
23
PERFORMANCE WITH TENSOR CORES FP16 on T4
0
2
4
6
8
10
12
14
Batch Size = 32 Batch Size = 64 Batch Size = 128 Batch Size = 256 Batch Size = 512
Thro
ughput
(TFlo
p/se
c)
Streaming Inference Performance on GV100
FP32 FP16 w/TCs
24
LATENCY VS THROUGHPUT TRADEOFF
Assuming each inference segment represents100ms audio
Choose a batch size that will maximize throughput while staying within latency budget
FP16 on T4
0
10
20
30
40
50
60
70
257 / 32 441 / 64 692 / 128 913 / 256 977 / 512
Late
ncy (
ms)
Inference Instances Served by a GPU / Batch Size
Latency Percentiles
50% Latency 90% Latency 95% Latency 99% Latency
25
NVIDIA TENSORRT INFERENCE SERVERProduction Data Center Inference Server
Maximize real-time inference
performance of GPUs
Quickly deploy and manage multiple
models per GPU per node
Easily scale to heterogeneous GPUs
and multi GPU nodes
Integrates with orchestration
systems and auto scalers via latency
and health metrics
Open source for thorough
customization and integration
TensorR
T
Infe
rence
Serv
er Tesla T4
Tesla T4
Te
nso
rRT
Infe
ren
ce
Se
rve
r
Tesla
V100
Tesla
V100
Te
nso
rRT
Infe
ren
ce
Se
rve
r Tesla P4
Tesla P4
26
INFERENCE SERVER ARCHITECTURE
Models supported● TensorFlow GraphDef/SavedModel● TensorFlow and TensorRT GraphDef● TensorRT Plans● Caffe2 NetDef (ONNX import)● Custom backends
Multi-GPU support
Concurrent model execution
Server HTTP REST API/gRPC
Python/C++ client libraries
Python/C++ Client Library
Available with Monthly Updates
27
INFERENCE SERVER BATCHERS
Dynamic Batching
TensorRT Inference Server (TRTIS) groups inference requests based on customer defined metrics for optimal performance
Customer defines 1) batch size and/or 2) latency requirements
Sequence Batching
TRTIS can keep track of the inference requests belonging to a stateful model
The client application assigns a correlation ID for a stream of inferences belonging to the same sequence
Use together with a custom backend to store and restore the internal states of the model
Dynamic and Sequence Batching
28
SUMMARY
Designed and implemented the Streaming Inference API
Automatically batches the RNN inference requests together to achieve high throughput
Code written for batch size = 1 achieves ≥66% throughput of batched execution (FP32)
Allows utilizing the Tensor Cores on Volta and Turing architectures
Hit latency targets by choosing the right batch size
Generalizes to sequence models with interdependent inference streams
TRTIS sequence batcher and custom backends for high-performance real-time inferencing
S9438 - Maximizing Utilization for Data Center Inference with TensorRT Inference Server
29
RESOURCES
TRTIS blog post and documentation:
https://devblogs.nvidia.com/nvidia-serves-deep-learning-inference/
https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/
Recommended