Optimizing Runtime Performance of Neural Net Architectures ......Neural Net Revolution = Company Crossroads Up to 2013 –FPGA implemented large fully continuous GMM models with integrated

GTC 2019, San Jose CA

Optimizing Runtime Performance

of Neural Net Architectures

for High Scalability

in Speech Recognition Servers

John Kominek, CTO, Voci Technologies

March 21, 2019

S9535

© 2019 | 2GTC 2019, San Jose CA

High Density Speech Recognition on Nvidia GPUs

▪ Training of neural nets receives a lot of attention

– Consumes the entire resources of the GPU

▪ Less attention is given to evaluation but is critical to

large commercial deployments

– Each stream uses a fraction of GPU resources

– But you want to get maximal use of the card

© 2019 | 3GTC 2019, San Jose CA

An Untold Story of Neural Nets in ASR

▪ Experience Learned

– Insight into what's easy and what's hard

– The multi-threading trick that work surprisingly well

– The mystery (and pain) of negative scaling

– Kepler, Maxwell, Pascal, Volta – how do they stack up?

▪ Intermediate level talk

– Light on math, plenty of insider jargon

© 2019 | 4GTC 2019, San Jose CA

Company HighlightsDeliver the world’s best speech to text platform for analytics

Pre 2011

2012

2014

2011 2015 20182017

• CMU government funded projects leads to founding Voci

• Series A funding

• Emotion, Sentiment, Gender Labeling

• Deep learning

• Integration

2016

• Language ID introduced

• Expanded language models

• Over 10m minutes transcribed

2019

• 50 Employees.

• >10m Bookings

• >8 Billion est. minutes transcribed

• New Website launched

• Focused business model

• First generation speech engine –V-Blaze developed

• Speed, Accuracy, Scalability

• Real-time transcription

• AI powers ASR

• V-Spark introduced

• Speaker Separation

• Partner enablement

• V-Cloud Introduced

• Speaker ID

• 40 New logos

• V-Blaze 5.0 released

• Custom language models

• Over 100m minutes transcribed

• Series B funding

• 5 Billion minutes transcribed

• Biometrics introduced

• 200 man-years of development

© 2019 | 5GTC 2019, San Jose CA

Neural Net Revolution = Company Crossroads

▪ Up to 2013

– FPGA implemented large fully continuous GMM

models with integrated statistical language model

evaluation and search. Fastest ASR engine in the

world at the time.

▪ 2013

– G. Hinton et al. established superiority of deep neural

networks. Led to a rare seismic shift in field of speech

recognition.

© 2019 | 6GTC 2019, San Jose CA

Voci's Technology Shift from FPGAs to GPUs

▪ 2013-2014 – Technology bakeoff

– Implemented DNN evaluation on Xilinx Virtex-5 pitted

against CUDA implemented on K20

– Nvidia platform won convincingly

– Matrix multiplication primitives are tailor made for

deep feedforward network evaluation

– Migrated to open source Kaldi toolkit for model

training

© 2019 | 7GTC 2019, San Jose CA

Voci V-Blaze Runs on Extensive Array of GPUs

▪ Servers: Tesla K20, K40, K80, M10, M60, P100, V100

▪ Embedded: Jetson Tegra TK1, TX1, TX2, CX2

▪ Laptops: GeForce GTX 960M, GTX 1050, GTX 1050 Ti

▪ In the cloud on AWS

▪ Redhat/CentOS, Debian/Ubuntu

© 2019 | 8GTC 2019, San Jose CA

Pictures of Server Rooms are Boring, so...

Voci powering advanced automotive conversational systems

© 2019 | 9GTC 2019, San Jose CA

If it's a Neural Net, Throw it on the GPU

▪ Voci V-Blaze runs

– DNN

– LSTM, BLSTM

– CNN

– TDNN

– RNNLM

– Combinations: e.g. DNN + CNN + BLSTM

© 2019 | 10GTC 2019, San Jose CA

A Story of Joy, Struggle, and Triumph

▪ Easy to accelerate: Feedforward DNN

▪ Hard to accelerate: Bidirectional LSTM

© 2019 | 11GTC 2019, San Jose CA

Evaluating Feedforward DNNs

▪ Single threaded evaluation is a straightforward sequence

of matrix multiplications and non-linear range-

compression functions

▪ Invoke the appropriate cudnn functions … and voila,

marketing gold

© 2019 | 12GTC 2019, San Jose CA

Single Threaded Performance

© 2019 | 13GTC 2019, San Jose CA

Multi-Core Performance is What Matters

▪ There are plenty of CUDA cores left over

▪ There are untapped Xeon cores

▪ How well does neural net inference scale as the compute

load is increased and more cores (GPU or CPU) are

invoked?

© 2019 | 14GTC 2019, San Jose CA

Increasing Load on an M10, DNN Evaluation

© 2019 | 15GTC 2019, San Jose CA

Increasing Load on an M60, DNN Evaluation

© 2019 | 16GTC 2019, San Jose CA

Increasing Load on a P100, DNN Evaluation

© 2019 | 17GTC 2019, San Jose CA

Increasing Load on a V100, DNN Evaluation

© 2019 | 18GTC 2019, San Jose CA

Meaning of Compute Load

▪ A compute load of 1 is one process pumping audio to

the GPU as fast as results can be returned

▪ Load = number of such processes in parallel

▪ Independent processes, not threads

© 2019 | 19GTC 2019, San Jose CA

Voci Engineers are Scofflaws!

▪ The Nvidia programming guidelines recommend multi-

threading. Separate processes do not truly run in parallel.

To run in parallel, program threads.

▪ We're like, "yeah, whatever."

© 2019 | 20GTC 2019, San Jose CA

Translating Compute Load to Speed

▪ Depends on the size of the neural net

• Small = 1024x6 ~ 12 million connections

• Medium = 2048x6 ~ 34 million

• Large = 4096x6 ~ 110 million

▪ Speed reported as x times faster than real time

© 2019 | 21GTC 2019, San Jose CA

DNN Evaluation Speed vs Model Size (P100)

© 2019 | 22GTC 2019, San Jose CA

DNN Evaluation by Tesla Generation (K80)

© 2019 | 23GTC 2019, San Jose CA

DNN Evaluation by Tesla Generation (M10)

© 2019 | 24GTC 2019, San Jose CA

DNN Evaluation by Tesla Generation (M60)

© 2019 | 25GTC 2019, San Jose CA

DNN Evaluation by Tesla Generation (P100)

© 2019 | 26GTC 2019, San Jose CA

DNN Evaluation by Tesla Generation (V100)

© 2019 | 27GTC 2019, San Jose CA

Comparison to Pure CPU Performance Curve

© 2019 | 28GTC 2019, San Jose CA

V100 Provides Best Peak Power Efficiency

© 2019 | 29GTC 2019, San Jose CA

So much for Easy, Now for Hard

https://github.com/dophist/kaldi-lstm

© 2019 | 30GTC 2019, San Jose CA

Speed of Open Source Kaldi Implementation

© 2019 | 31GTC 2019, San Jose CA

Visual Profiler Reveals the Problem

DNN

BLSTM – kernel synchronization dominates

© 2019 | 35GTC 2019, San Jose CA

What to do?

▪ Separate processes were interfering with each other

▪ Three avenues forward

– Switch to older cards that present less powerful but

multiple GPU interfaces (the M10)

– Re-engineer the infrastructure code to be a multi-

threaded, single-process server

– See how far optimizing the code will take you

© 2019 | 36GTC 2019, San Jose CA

4 Custom Optimizations

▪ Kernel merging (15%)

▪ Matrix transpose into row major form (10%)*

▪ Reverse direction compute stream pairs (24%)

▪ Application-specific data parallelism (26%)

▪ Together: increase single process speed by 2x

– * J. Appleyard, Optimizing Recurrent Neural Networks in cuDNN5, GTC 2016

© 2019 | 37GTC 2019, San Jose CA

Application Specific Data Parallelism

▪ Serialism inherent in recurrent loops can be approximated

Time

fwd/bwd compute stream pair



....

cudaStreamCreate(&stream_fwd)

cudaStreamCreate(&stream_bwd)

© 2019 | 43GTC 2019, San Jose CA

Unsurprising Findings

▪ What's easy and what's hard

– DNNs are easy, BLSTMs are hard

▪ Kepler, Maxwell, Pascal, Volta comparison

– V100 is fastest

– V100 has best power efficiency

– M10 has best price/performance

© 2019 | 44GTC 2019, San Jose CA

Unexpected Findings

▪ The multi-threading trick that work surprisingly well to

achieve high performance scaling

– Don't multi-thread (even though you should)

▪ Negative scaling can happen – and can be overcome

– It's still kind of a mystery, though

– For advanced details, join our company

www.vocitec.com

© 2019 | 45GTC 2019, San Jose CA

The only true enterprise speech-to-text platform that solves real business challenges

[email protected], [email protected] (CEO)

www.vocitec.com
mailto:[email protected]

Documents

Optimizing Runtime Performance of Neural Net Architectures ......Neural Net Revolution = Company Crossroads Up to 2013 –FPGA implemented large fully continuous GMM models with integrated