2
AI IS EVERYWHERE
“Find where I parked my car” “Find the bag I just saw in this magazine”
“What movie should I watch next?”
3
TOUCHING OUR LIVES
Bringing grandmother closer to family by bridging language barrier
Predicting sick baby’s vitals like heart rate, blood pressure, survival rate
Enabling the blind to “see” their surrounding, read emotions on faces
4
FUELING ALL INDUSTRIES
Increasing public safety with smart video surveillance at airports & malls
Providing intelligent services in hotels, banks and stores
Separating weeds as it harvests, reduces chemical usage by 90%
5
NVIDIA DEEP LEARNING SOFTWARE PLATFORM
NVIDIA DEEP LEARNING SDK
DEVELOP WITH DIGITS TensorRT
TRAINED NETWORK
TRAINING DATA
TRAINING
DATA MANAGEMENT
MODEL ASSESSMENT
EMBEDDED
AUTOMOTIVE
DATA CENTER
6
TENSORRT Workflow
DIGITS OPTIMIZATION USING TensorRT
RUNTIME USING TensorRT
PLAN NEURAL NETWORK
developer.nvidia.com/TensorRT
7
TENSORRT INFERENCE RUNTIME High-performance deep learning inference for production deployment
developer.nvidia.com/TensorRT
0
1
2
3
4
5
6
7
8
1 8 128
CPU-Only Tesla M4 + TensorRT
Up to 16x More Inference Perf/Watt
Batch Sizes
GoogLenet, CPU-only vs Tesla M4 + TensorRT on
Single-socket Haswell E5-2698 [email protected] with HT
Images/
Second/W
att
EMBEDDED
Jetson TX1
AUTOMOTIVE
Drive PX
DATA CENTER
Tesla M4
8
TENSORRT GoogleNet Performance
BATCH=1 M4 TX1 TX1 FP16
TensorRT 3.7 ms 13.9 ms 16.5ms (N=2)
Caffe 15 ms 33 ms n/a
developer.nvidia.com/TensorRT
BATCH=16 M4 TX1 TX1 FP16
TensorRT 39 ms 164 ms 99 ms
Caffe 67 ms 255 ms n/a
Jetson TX1 HALF2 column uses fp16
9
DEEP LEARNING DEMANDS NEW CLASS OF HPC
TRAINING INFERENCING
Data / Users
Scalable Performance
Throughput + Efficiency
Billions of TFLOPS per training run
Years of compute-days on Xeon CPU
GPU turns years to days
Billions of FLOPS per inference
Seconds for response on Xeon CPU
GPU for instant response
10
BAIDU DEEP SPEECH 2
12K Neurons
100M Parameters
2.5x Deep Speech 1 4x Deep Speech 1
15 Exaflops
Super-human Accuracy
10x Deep Speech 1 2 Months on CPU Server | 2 Days on DGX-1
Word Error Rate DS2: 5% | Human: 6% | DS1: 8%
“Deep Speech 2: End-to-End Speech Recognition in English and Mandarin”, 12/2015 | Dataset: LibriSpeech test-clean
11
MODERN AI NEEDS NEW INFERENCE SOLUTION
0 0,5 1 1,5 2 2,5
Network
Network
Deep Speech 2
User Wait Time (seconds)
“Where is the nearest Szechuan restaurant?”
User Experience: From Seconds to Instant Wait Time for Text after Speech is Complete
6 sec CPU
0.1 sec Pascal GPU
Deep Speech 2 inference performance on 16 user server | CPU: 170 ms of estimated compute time required for each 100 ms of speech sample | Pascal GPU: 51 ms of compute required for each 100
ms of speech sample
2.2 sec CPU
12
40x Efficient vs CPU, 8x Efficient vs FPGA
0
50
100
150
200
AlexNet
CPU FPGA 1x M4 (FP32) 1x P4 (INT8)
Images/
Sec/W
att
Maximum Efficiency for Scale-out Servers P4
# of CUDA Cores 2560
Peak Single Precision 5.5 TeraFLOPS
Peak INT8 22 TOPS
Low Precision 4x 8-bit vector dot product
with 32-bit accumulate
Video Engines 1x decode engine, 2x encode engine
GDDR5 Memory 8 GB @ 192 GB/s
Power 50W & 75 W
AlexNet, batch size = 128, CPU: Intel E5-2690v4 using Intel MKL 2017, FPGA is Arria10-115 1x M4/P4 in node, P4 board power at 56W, P4 GPU power at 36W, M4 board power at 57W, M4 GPU power at 39W, Perf/W chart using GPU power
TESLA P4
13
TESLA P40
P40
# of CUDA Cores 3840
Peak Single Precision 12 TeraFLOPS
Peak INT8 47 TOPS
Low Precision 4x 8-bit vector dot product
with 32-bit accumulate
Video Engines 1x decode engine, 2x encode engines
GDDR5 Memory 24 GB @ 346 GB/s
Power 250W
0
20 000
40 000
60 000
80 000
100 000
GoogLeNet AlexNet
8x M40 (FP32) 8x P40 (INT8)
Images/
Sec
4x Boost in Less than One Year
GoogLeNet, AlexNet, batch size = 128, CPU: Dual Socket Intel E5-2697v4
Highest Throughput for Scale-up Servers
14 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
P40/P4 – NEW “INT8” FOR INFERENCE
A0 A1 A2 A3
B0 B1 B2 B3
A0 * B0
A1 * B1
A2 * B2
A3 * B3
4x INT8
4x INT8
INT32
intermediate
INT32
intermediate
INT32
intermediate
INT32
intermediate
INT32 C
INT32
PRODUCT PRECISION INFERENCE TOPS*
M4 FP32 2.2
M40 FP32 7
P100 FP16 21.2
P4 INT8 22
P40 INT8 47
• Integer 8-bit Dot Product with 32-bit accumulate
• New in Pascal, only in P40/P4
*TOPS = Tera-Operations per second, base on boost clocks
15 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
178 480
1 514
4 121
3 200
6 514
0
1 000
2 000
3 000
4 000
5 000
6 000
7 000
E5-2690v414 Core
M4(FP32)
M40(FP32)
P100(FP16)
P4(INT8)
P40(INT8)
Infe
rence Im
age/se
c
All results are measured, based on GoogLenet with batch size 128 Xeon uses MKL 2017 GOLD with FP32, GPU uses TensorRT internal development ver.
P40/P4+TensorRT DELIVER MAX INFERENCE PERFORMANCE
>35x
1,4
12,3 10,6
27,9
91,1
56,3
0
20
40
60
80
100
E5-2690v414 Core
M4(FP32)
M40(FP32)
P100(FP16)
P4(INT8)
P40(INT8)
Infe
rence Im
g/s/
watt
>60x
P40 For Max Inference Throughput P4 For Max Inference Efficiency
16
NVIDIA DEEPSTREAM SDK Delivering Video Analytics at Scale
Inference
Preprocess Hardware Decode
“Boy playing soccer”
Simple, high performance API for analyzing video
Decode H.264, HEVC, MPEG-2, MPEG-4, VP9
CUDA-optimized resize and scale
TensorRT
0
20
40
60
80
100
1x Tesla P4 Server +DeepStream SDK
13x E5-2650 v4 Servers
Concurr
ent
Vid
eo S
tream
s
Concurrent Video Streams Analyzed
720p30 decode | IntelCaffe using dual socket E5-2650 v4 CPU servers, Intel MKL 2017 Based on GoogLeNet optimized by Intel: https://github.com/intel/caffe/tree/master/models/mkl2017_googlenet_v2
18
TESLA DEEP LEARNING PLATFORM
TRAINING INFERENCING
DIGITS Training System
Deep Learning Frameworks
Tesla P100
DeepStream SDK
TensorRT
Tesla P40 & P4
19 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
END-TO-END PRODUCT FAMILY
MIXED-APPS HPC
Tesla P100 PCIE
STRONG-SCALE HPC
Tesla P100 SXM2
DL SUPERCOMPUTER
DGX-1
Get going now with fully integrated DL solution
Hyperscale & HPC data centers running apps that
scale to multiple GPUs
HPC data centers running mix of CPU and GPU workloads
HYPERSCALE HPC
Tesla P4, P40
Hyperscale deployment for DL training, inference, video &
image processing
20
Jetson TX1
JETSON TX1
GPU 1 TFLOP/s 256-core Maxwell
CPU 64-bit ARM A57 CPUs
Memory 4 GB LPDDR4 | 25.6 GB/s
Video decode 4K 60Hz
Video encode 4K 30Hz
CSI Up to 6 cameras | 1400 Mpix/s
Display 2x DSI, 1x eDP 1.4, 1x DP 1.2/HDMI
Wifi 802.11 2x2 ac
Networking 1 Gigabit Ethernet
PCIE Gen 2 1x1 + 1x4
Storage 16 GB eMMC, SDIO, SATA
Other 3x UART, 3x SPI, 4x I2C, 4x I2S, GPIOs
Антон Джораев, [email protected]