45
GTC 2019, San Jose CA Optimizing Runtime Performance of Neural Net Architectures for High Scalability in Speech Recognition Servers John Kominek, CTO, Voci Technologies March 21, 2019 S9535

Optimizing Runtime Performance of Neural Net Architectures ......Neural Net Revolution = Company Crossroads Up to 2013 –FPGA implemented large fully continuous GMM models with integrated

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • GTC 2019, San Jose CA

    Optimizing Runtime Performance

    of Neural Net Architectures

    for High Scalability

    in Speech Recognition Servers

    John Kominek, CTO, Voci Technologies

    March 21, 2019

    S9535

  • © 2019 | 2GTC 2019, San Jose CA

    High Density Speech Recognition on Nvidia GPUs

    ▪ Training of neural nets receives a lot of attention

    – Consumes the entire resources of the GPU

    ▪ Less attention is given to evaluation but is critical to

    large commercial deployments

    – Each stream uses a fraction of GPU resources

    – But you want to get maximal use of the card

  • © 2019 | 3GTC 2019, San Jose CA

    An Untold Story of Neural Nets in ASR

    ▪ Experience Learned

    – Insight into what's easy and what's hard

    – The multi-threading trick that work surprisingly well

    – The mystery (and pain) of negative scaling

    – Kepler, Maxwell, Pascal, Volta – how do they stack up?

    ▪ Intermediate level talk

    – Light on math, plenty of insider jargon

  • © 2019 | 4GTC 2019, San Jose CA

    Company HighlightsDeliver the world’s best speech to text platform for analytics

    Pre 2011

    2012

    2014

    2011 2015 20182017

    • CMU government funded projects leads to founding Voci

    • Series A funding

    • Emotion, Sentiment, Gender Labeling

    • Deep learning

    • Integration

    2016

    • Language ID introduced

    • Expanded language models

    • Over 10m minutes transcribed

    2019

    • 50 Employees.

    • >10m Bookings

    • >8 Billion est. minutes transcribed

    • New Website launched

    • Focused business model

    • First generation speech engine –V-Blaze developed

    • Speed, Accuracy, Scalability

    • Real-time transcription

    • AI powers ASR

    • V-Spark introduced

    • Speaker Separation

    • Partner enablement

    • V-Cloud Introduced

    • Speaker ID

    • 40 New logos

    • V-Blaze 5.0 released

    • Custom language models

    • Over 100m minutes transcribed

    • Series B funding

    • 5 Billion minutes transcribed

    • Biometrics introduced

    • 200 man-years of development

  • © 2019 | 5GTC 2019, San Jose CA

    Neural Net Revolution = Company Crossroads

    ▪ Up to 2013

    – FPGA implemented large fully continuous GMM

    models with integrated statistical language model

    evaluation and search. Fastest ASR engine in the

    world at the time.

    ▪ 2013

    – G. Hinton et al. established superiority of deep neural

    networks. Led to a rare seismic shift in field of speech

    recognition.

  • © 2019 | 6GTC 2019, San Jose CA

    Voci's Technology Shift from FPGAs to GPUs

    ▪ 2013-2014 – Technology bakeoff

    – Implemented DNN evaluation on Xilinx Virtex-5 pitted

    against CUDA implemented on K20

    – Nvidia platform won convincingly

    – Matrix multiplication primitives are tailor made for

    deep feedforward network evaluation

    – Migrated to open source Kaldi toolkit for model

    training

  • © 2019 | 7GTC 2019, San Jose CA

    Voci V-Blaze Runs on Extensive Array of GPUs

    ▪ Servers: Tesla K20, K40, K80, M10, M60, P100, V100

    ▪ Embedded: Jetson Tegra TK1, TX1, TX2, CX2

    ▪ Laptops: GeForce GTX 960M, GTX 1050, GTX 1050 Ti

    ▪ In the cloud on AWS

    ▪ Redhat/CentOS, Debian/Ubuntu

  • © 2019 | 8GTC 2019, San Jose CA

    Pictures of Server Rooms are Boring, so...

    Voci powering advanced automotive conversational systems

  • © 2019 | 9GTC 2019, San Jose CA

    If it's a Neural Net, Throw it on the GPU

    ▪ Voci V-Blaze runs

    – DNN

    – LSTM, BLSTM

    – CNN

    – TDNN

    – RNNLM

    – Combinations: e.g. DNN + CNN + BLSTM

  • © 2019 | 10GTC 2019, San Jose CA

    A Story of Joy, Struggle, and Triumph

    ▪ Easy to accelerate: Feedforward DNN

    ▪ Hard to accelerate: Bidirectional LSTM

  • © 2019 | 11GTC 2019, San Jose CA

    Evaluating Feedforward DNNs

    ▪ Single threaded evaluation is a straightforward sequence

    of matrix multiplications and non-linear range-

    compression functions

    ▪ Invoke the appropriate cudnn functions … and voila,

    marketing gold

  • © 2019 | 12GTC 2019, San Jose CA

    Single Threaded Performance

  • © 2019 | 13GTC 2019, San Jose CA

    Multi-Core Performance is What Matters

    ▪ There are plenty of CUDA cores left over

    ▪ There are untapped Xeon cores

    ▪ How well does neural net inference scale as the compute

    load is increased and more cores (GPU or CPU) are

    invoked?

  • © 2019 | 14GTC 2019, San Jose CA

    Increasing Load on an M10, DNN Evaluation

  • © 2019 | 15GTC 2019, San Jose CA

    Increasing Load on an M60, DNN Evaluation

  • © 2019 | 16GTC 2019, San Jose CA

    Increasing Load on a P100, DNN Evaluation

  • © 2019 | 17GTC 2019, San Jose CA

    Increasing Load on a V100, DNN Evaluation

  • © 2019 | 18GTC 2019, San Jose CA

    Meaning of Compute Load

    ▪ A compute load of 1 is one process pumping audio to

    the GPU as fast as results can be returned

    ▪ Load = number of such processes in parallel

    ▪ Independent processes, not threads

  • © 2019 | 19GTC 2019, San Jose CA

    Voci Engineers are Scofflaws!

    ▪ The Nvidia programming guidelines recommend multi-

    threading. Separate processes do not truly run in parallel.

    To run in parallel, program threads.

    ▪ We're like, "yeah, whatever."

  • © 2019 | 20GTC 2019, San Jose CA

    Translating Compute Load to Speed

    ▪ Depends on the size of the neural net

    • Small = 1024x6 ~ 12 million connections

    • Medium = 2048x6 ~ 34 million

    • Large = 4096x6 ~ 110 million

    ▪ Speed reported as x times faster than real time

  • © 2019 | 21GTC 2019, San Jose CA

    DNN Evaluation Speed vs Model Size (P100)

  • © 2019 | 22GTC 2019, San Jose CA

    DNN Evaluation by Tesla Generation (K80)

  • © 2019 | 23GTC 2019, San Jose CA

    DNN Evaluation by Tesla Generation (M10)

  • © 2019 | 24GTC 2019, San Jose CA

    DNN Evaluation by Tesla Generation (M60)

  • © 2019 | 25GTC 2019, San Jose CA

    DNN Evaluation by Tesla Generation (P100)

  • © 2019 | 26GTC 2019, San Jose CA

    DNN Evaluation by Tesla Generation (V100)

  • © 2019 | 27GTC 2019, San Jose CA

    Comparison to Pure CPU Performance Curve

  • © 2019 | 28GTC 2019, San Jose CA

    V100 Provides Best Peak Power Efficiency

  • © 2019 | 29GTC 2019, San Jose CA

    So much for Easy, Now for Hard

    https://github.com/dophist/kaldi-lstm

  • © 2019 | 30GTC 2019, San Jose CA

    Speed of Open Source Kaldi Implementation

  • © 2019 | 31GTC 2019, San Jose CA

    Visual Profiler Reveals the Problem

    DNN

    BLSTM – kernel synchronization dominates

  • © 2019 | 32GTC 2019, San Jose CA

    Highly Suspicious Power/Utilization Pattern

  • © 2019 | 33GTC 2019, San Jose CA

    The Shock of Negative Scaling

    Instead of saturating,speed decreases!

  • © 2019 | 34GTC 2019, San Jose CA

    M10/M60 Scale According to GPU Count, then Drop

  • © 2019 | 35GTC 2019, San Jose CA

    What to do?

    ▪ Separate processes were interfering with each other

    ▪ Three avenues forward

    – Switch to older cards that present less powerful but

    multiple GPU interfaces (the M10)

    – Re-engineer the infrastructure code to be a multi-

    threaded, single-process server

    – See how far optimizing the code will take you

  • © 2019 | 36GTC 2019, San Jose CA

    4 Custom Optimizations

    ▪ Kernel merging (15%)

    ▪ Matrix transpose into row major form (10%)*

    ▪ Reverse direction compute stream pairs (24%)

    ▪ Application-specific data parallelism (26%)

    ▪ Together: increase single process speed by 2x

    – * J. Appleyard, Optimizing Recurrent Neural Networks in cuDNN5, GTC 2016

  • © 2019 | 37GTC 2019, San Jose CA

    Application Specific Data Parallelism

    ▪ Serialism inherent in recurrent loops can be approximated

    Time

    fwd/bwd compute stream pair

    fwd/bwd compute stream pair

    fwd/bwd compute stream pair

    ....

    cudaStreamCreate(&stream_fwd)

    cudaStreamCreate(&stream_bwd)

  • © 2019 | 38GTC 2019, San Jose CA

    Before and After – from 37.5x to 375x

  • © 2019 | 39GTC 2019, San Jose CA

    Negative Scaling Eliminated

  • © 2019 | 40GTC 2019, San Jose CA

    Power/Utilization after Optimization is Sane Again

  • © 2019 | 41GTC 2019, San Jose CA

    V100 Still Leads on Power Efficiency

  • © 2019 | 42GTC 2019, San Jose CA

    Best Price/Performance Provided by M10

  • © 2019 | 43GTC 2019, San Jose CA

    Unsurprising Findings

    ▪ What's easy and what's hard

    – DNNs are easy, BLSTMs are hard

    ▪ Kepler, Maxwell, Pascal, Volta comparison

    – V100 is fastest

    – V100 has best power efficiency

    – M10 has best price/performance

  • © 2019 | 44GTC 2019, San Jose CA

    Unexpected Findings

    ▪ The multi-threading trick that work surprisingly well to

    achieve high performance scaling

    – Don't multi-thread (even though you should)

    ▪ Negative scaling can happen – and can be overcome

    – It's still kind of a mystery, though

    – For advanced details, join our company

  • www.vocitec.com

    © 2019 | 45GTC 2019, San Jose CA

    The only true enterprise speech-to-text platform that solves real business challenges

    [email protected], [email protected] (CEO)

    www.vocitec.com

    mailto:[email protected]