Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
GTC 2019, San Jose CA
Optimizing Runtime Performance
of Neural Net Architectures
for High Scalability
in Speech Recognition Servers
John Kominek, CTO, Voci Technologies
March 21, 2019
S9535
© 2019 | 2GTC 2019, San Jose CA
High Density Speech Recognition on Nvidia GPUs
▪ Training of neural nets receives a lot of attention
– Consumes the entire resources of the GPU
▪ Less attention is given to evaluation but is critical to
large commercial deployments
– Each stream uses a fraction of GPU resources
– But you want to get maximal use of the card
© 2019 | 3GTC 2019, San Jose CA
An Untold Story of Neural Nets in ASR
▪ Experience Learned
– Insight into what's easy and what's hard
– The multi-threading trick that work surprisingly well
– The mystery (and pain) of negative scaling
– Kepler, Maxwell, Pascal, Volta – how do they stack up?
▪ Intermediate level talk
– Light on math, plenty of insider jargon
© 2019 | 4GTC 2019, San Jose CA
Company HighlightsDeliver the world’s best speech to text platform for analytics
Pre 2011
2012
2014
2011 2015 20182017
• CMU government funded projects leads to founding Voci
• Series A funding
• Emotion, Sentiment, Gender Labeling
• Deep learning
• Integration
2016
• Language ID introduced
• Expanded language models
• Over 10m minutes transcribed
2019
• 50 Employees.
• >10m Bookings
• >8 Billion est. minutes transcribed
• New Website launched
• Focused business model
• First generation speech engine –V-Blaze developed
• Speed, Accuracy, Scalability
• Real-time transcription
• AI powers ASR
• V-Spark introduced
• Speaker Separation
• Partner enablement
• V-Cloud Introduced
• Speaker ID
• 40 New logos
• V-Blaze 5.0 released
• Custom language models
• Over 100m minutes transcribed
• Series B funding
• 5 Billion minutes transcribed
• Biometrics introduced
• 200 man-years of development
© 2019 | 5GTC 2019, San Jose CA
Neural Net Revolution = Company Crossroads
▪ Up to 2013
– FPGA implemented large fully continuous GMM
models with integrated statistical language model
evaluation and search. Fastest ASR engine in the
world at the time.
▪ 2013
– G. Hinton et al. established superiority of deep neural
networks. Led to a rare seismic shift in field of speech
recognition.
© 2019 | 6GTC 2019, San Jose CA
Voci's Technology Shift from FPGAs to GPUs
▪ 2013-2014 – Technology bakeoff
– Implemented DNN evaluation on Xilinx Virtex-5 pitted
against CUDA implemented on K20
– Nvidia platform won convincingly
– Matrix multiplication primitives are tailor made for
deep feedforward network evaluation
– Migrated to open source Kaldi toolkit for model
training
© 2019 | 7GTC 2019, San Jose CA
Voci V-Blaze Runs on Extensive Array of GPUs
▪ Servers: Tesla K20, K40, K80, M10, M60, P100, V100
▪ Embedded: Jetson Tegra TK1, TX1, TX2, CX2
▪ Laptops: GeForce GTX 960M, GTX 1050, GTX 1050 Ti
▪ In the cloud on AWS
▪ Redhat/CentOS, Debian/Ubuntu
© 2019 | 8GTC 2019, San Jose CA
Pictures of Server Rooms are Boring, so...
Voci powering advanced automotive conversational systems
© 2019 | 9GTC 2019, San Jose CA
If it's a Neural Net, Throw it on the GPU
▪ Voci V-Blaze runs
– DNN
– LSTM, BLSTM
– CNN
– TDNN
– RNNLM
– Combinations: e.g. DNN + CNN + BLSTM
© 2019 | 10GTC 2019, San Jose CA
A Story of Joy, Struggle, and Triumph
▪ Easy to accelerate: Feedforward DNN
▪ Hard to accelerate: Bidirectional LSTM
© 2019 | 11GTC 2019, San Jose CA
Evaluating Feedforward DNNs
▪ Single threaded evaluation is a straightforward sequence
of matrix multiplications and non-linear range-
compression functions
▪ Invoke the appropriate cudnn functions … and voila,
marketing gold
© 2019 | 12GTC 2019, San Jose CA
Single Threaded Performance
© 2019 | 13GTC 2019, San Jose CA
Multi-Core Performance is What Matters
▪ There are plenty of CUDA cores left over
▪ There are untapped Xeon cores
▪ How well does neural net inference scale as the compute
load is increased and more cores (GPU or CPU) are
invoked?
© 2019 | 14GTC 2019, San Jose CA
Increasing Load on an M10, DNN Evaluation
© 2019 | 15GTC 2019, San Jose CA
Increasing Load on an M60, DNN Evaluation
© 2019 | 16GTC 2019, San Jose CA
Increasing Load on a P100, DNN Evaluation
© 2019 | 17GTC 2019, San Jose CA
Increasing Load on a V100, DNN Evaluation
© 2019 | 18GTC 2019, San Jose CA
Meaning of Compute Load
▪ A compute load of 1 is one process pumping audio to
the GPU as fast as results can be returned
▪ Load = number of such processes in parallel
▪ Independent processes, not threads
© 2019 | 19GTC 2019, San Jose CA
Voci Engineers are Scofflaws!
▪ The Nvidia programming guidelines recommend multi-
threading. Separate processes do not truly run in parallel.
To run in parallel, program threads.
▪ We're like, "yeah, whatever."
© 2019 | 20GTC 2019, San Jose CA
Translating Compute Load to Speed
▪ Depends on the size of the neural net
• Small = 1024x6 ~ 12 million connections
• Medium = 2048x6 ~ 34 million
• Large = 4096x6 ~ 110 million
▪ Speed reported as x times faster than real time
© 2019 | 21GTC 2019, San Jose CA
DNN Evaluation Speed vs Model Size (P100)
© 2019 | 22GTC 2019, San Jose CA
DNN Evaluation by Tesla Generation (K80)
© 2019 | 23GTC 2019, San Jose CA
DNN Evaluation by Tesla Generation (M10)
© 2019 | 24GTC 2019, San Jose CA
DNN Evaluation by Tesla Generation (M60)
© 2019 | 25GTC 2019, San Jose CA
DNN Evaluation by Tesla Generation (P100)
© 2019 | 26GTC 2019, San Jose CA
DNN Evaluation by Tesla Generation (V100)
© 2019 | 27GTC 2019, San Jose CA
Comparison to Pure CPU Performance Curve
© 2019 | 28GTC 2019, San Jose CA
V100 Provides Best Peak Power Efficiency
© 2019 | 29GTC 2019, San Jose CA
So much for Easy, Now for Hard
https://github.com/dophist/kaldi-lstm
© 2019 | 30GTC 2019, San Jose CA
Speed of Open Source Kaldi Implementation
© 2019 | 31GTC 2019, San Jose CA
Visual Profiler Reveals the Problem
DNN
BLSTM – kernel synchronization dominates
© 2019 | 32GTC 2019, San Jose CA
Highly Suspicious Power/Utilization Pattern
© 2019 | 33GTC 2019, San Jose CA
The Shock of Negative Scaling
Instead of saturating,speed decreases!
© 2019 | 34GTC 2019, San Jose CA
M10/M60 Scale According to GPU Count, then Drop
© 2019 | 35GTC 2019, San Jose CA
What to do?
▪ Separate processes were interfering with each other
▪ Three avenues forward
– Switch to older cards that present less powerful but
multiple GPU interfaces (the M10)
– Re-engineer the infrastructure code to be a multi-
threaded, single-process server
– See how far optimizing the code will take you
© 2019 | 36GTC 2019, San Jose CA
4 Custom Optimizations
▪ Kernel merging (15%)
▪ Matrix transpose into row major form (10%)*
▪ Reverse direction compute stream pairs (24%)
▪ Application-specific data parallelism (26%)
▪ Together: increase single process speed by 2x
– * J. Appleyard, Optimizing Recurrent Neural Networks in cuDNN5, GTC 2016
© 2019 | 37GTC 2019, San Jose CA
Application Specific Data Parallelism
▪ Serialism inherent in recurrent loops can be approximated
Time
fwd/bwd compute stream pair
fwd/bwd compute stream pair
fwd/bwd compute stream pair
....
cudaStreamCreate(&stream_fwd)
cudaStreamCreate(&stream_bwd)
© 2019 | 38GTC 2019, San Jose CA
Before and After – from 37.5x to 375x
© 2019 | 39GTC 2019, San Jose CA
Negative Scaling Eliminated
© 2019 | 40GTC 2019, San Jose CA
Power/Utilization after Optimization is Sane Again
© 2019 | 41GTC 2019, San Jose CA
V100 Still Leads on Power Efficiency
© 2019 | 42GTC 2019, San Jose CA
Best Price/Performance Provided by M10
© 2019 | 43GTC 2019, San Jose CA
Unsurprising Findings
▪ What's easy and what's hard
– DNNs are easy, BLSTMs are hard
▪ Kepler, Maxwell, Pascal, Volta comparison
– V100 is fastest
– V100 has best power efficiency
– M10 has best price/performance
© 2019 | 44GTC 2019, San Jose CA
Unexpected Findings
▪ The multi-threading trick that work surprisingly well to
achieve high performance scaling
– Don't multi-thread (even though you should)
▪ Negative scaling can happen – and can be overcome
– It's still kind of a mystery, though
– For advanced details, join our company
www.vocitec.com
© 2019 | 45GTC 2019, San Jose CA
The only true enterprise speech-to-text platform that solves real business challenges
[email protected], [email protected] (CEO)
www.vocitec.com
mailto:[email protected]