23
19-May-2016 Frédéric Parienté, Business Development Manager DEEP LEARNING UPDATE

Deep Learning Update May 2016

Embed Size (px)

Citation preview

19-May-2016

Frédéric Parienté, Business Development Manager

DEEP LEARNING UPDATE

#GTC16 TESLA ANNOUNCEMENTS

NVGRAPH

CUDA 8 & UNIFIED MEMORYNVIDIA DGX-1

NVIDIA SDK

TESLA P100

Deep Learning Achieves “Superhuman” Results

A NEW COMPUTING MODEL

Traditional Computer VisionExperts + Time

Deep Learning Object DetectionDNN + Data + HPC

ImageNet

DL Systems Sold Application

STATE OF DEEP LEARNING SYSTEMS

Geography

100k Deep Learning in 2015

Machine Learning Algorithms

Image Recognition

Object Recognition

Big Data

Natural Language Processing

Action Recognition

Medical

Other

Facial Recognition

Speech Recognition

1

2

3

4

5

6

7

8

9

10

TOP 10

MICROSOFT: “SUPER DEEP NETWORKS”

Microsoft Deep ResNet

http://arxiv.org/pdf/1512.03385v1.pdf

18 LAYERS1.8 GF

152 LAYERS11.3 GF

>6X MORE FLOPSRevolution of Depth

BAIDU: DL DEVELOPERS NEED HPC

“Investments in computer systems — and I think the bleeding-edge of AI, and deep learning specifically, is shifting to HPC (high performance computing) — can cut down the time to run an experiment, and therefore go around the circle, from a week to a day and sometimes even faster.”

“Those of us that grew up doing machine learning often didn’t grow up with an HPC or computer systems background … partnerships between machine learning researchers and computer systems researchers tend to help both teams drive a lot of machine learning progress.”

— Andrew Ng

NVIDIA DGX-1WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER

170 TFLOPS FP16

8x Tesla P100 16GB

NVLink Hybrid Cube Mesh

Accelerates Major AI Frameworks

Dual Xeon

7 TB SSD Deep Learning Cache

Dual 10GbE, Quad IB 100Gb

3RU – 3200W

NVIDIA DGX-1 DEEP LEARNING SYSTEM

BENEFITS FOR AI RESEARCHERS

cuDNN NCCL

cuSPARSE cuBLAS

DesignBig Networks

Reduce Training Times

DL SDKOngoing Updates

cuFFT

Fastest DL Supercomputer

NVIDIA DGX-1 SOFTWARE STACKOptimized for Deep Learning Performance

Accelerated Deep Learning

cuDNN NCCL

cuSPARSE cuBLAS cuFFT

Container Based Applications

NVIDIA Cloud Management

Digits DL Frameworks GPU Apps

INTRODUCING TESLA P100New GPU Architecture to Enable the World’s Fastest Compute Node

Pascal Architecture NVLinkCoWoS HBM2 Page Migration Engine

PCIe

Switch

PCIe

Switch

CPU CPU

Highest Compute Performance GPU Interconnect for Maximum Scalability

Unifying Compute & Memory in Single Package

Simple Parallel Programming with Virtually Unlimited Memory

Unified Memory

CPU

Tesla P100

GIANT LEAPS

IN EVERYTHING

NVLINK

PAGE MIGRATION ENGINE

PASCAL ARCHITECTURE

CoWoS HBM2 Stacked Mem

K40Tera

flops

(FP32/FP16)

5

10

15

20

P100

(FP32)

P100

(FP16)

M40

K40

Bi-

dir

ecti

onal BW

(G

B/Sec)

40

80

120

160P100

M40

K40Bandw

idth

(G

B/s)

200

400

600

P100

M40 K40

Addre

ssable

Mem

ory

(G

B)

10

100

1000

P100

M40

21 Teraflops of FP16 for Deep Learning 5x GPU-GPU Bandwidth

3x Higher for Massive Data Workloads Virtually Unlimited Memory Space

10000800

HUGE JUMP IN PERFORMANCE

DUAL XEON 8X TESLA M40DGX-1

(8X TESLA P100)

FLOPS (CPU + GPU) 3 TF 58 TF 170 TF

PROC-PROC BW 25 GB/s 64 GB/s 640 GB/s

ALEXNET TRAIN TIME 150 HOURS 9 HOURS 2 HOURS

# NODES FOR 3HR TAT 250 4 1

PERFORMANCE 1X 63X 250X

HIGHEST ABSOLUTE PERFORMANCE DELIVEREDNVLink for Max Scalability, More than 45x Faster with 8x P100

0x

5x

10x

15x

20x

25x

30x

35x

40x

45x

50x

Caffe/Alexnet VASP HOOMD-Blue COSMO MILC Amber HACC

2x K80 (M40 for Alexnet) 2x P100 4x P100 8x P100

Speed-u

p v

sD

ual Socket

Hasw

ell

2x Haswell CPU

DATACENTER IN A RACK1 Rack of Tesla P100 Delivers Performance of 6,000 CPUs

QUANTUM PHYSICSMILC

WEATHERCOSMO

DEEP LEARNINGCAFFE/ALEXNET

12 NODES in RACK8 P100s per Node

MOLECULAR DYNAMICSAMBER

# of Racks1 108642 . . .40 84

638 CPUs / 186 kW

650 CPUs / 190 kW

6,000 CPUs / 1.8 MW

2,900 CPUs / 850 kW

96 P100s / 38 kW

36 NODES in RACKDUAL CPUs Per Node

. . .

TESLA P100 ACCELERATOR

Compute 5.3 TF DP ∙ 10.6 TF SP ∙ 21.2 TF HP

Memory HBM2: 720 GB/s ∙ 16 GB

Interconnect NVLink (up to 8 way) + PCIe Gen3

ProgrammabilityPage Migration Engine

Unified Memory

AvailabilityDGX-1: Order Now

Atos, Cray, Dell, HP, IBM: Q1 2017

END-TO-END PRODUCT FAMILY

HYPERSCALE HPC

Tesla M4, M40

MIXED-APPS HPC

Tesla K80

STRONG-SCALING HPC

Tesla P100

FULLY INTEGRATED DLSUPERCOMPUTER

DGX-1

For customers who need to get going now with fully

integrated solution

Hyperscale & HPC data centers running apps that

scale to multiple GPUs

HPC data centers running mix of CPU and GPU workloads

Hyperscale deployment for DL training, inference, video &

image processing

NVIDIA DEEP LEARNING SDKHigh Performance GPU-Acceleration for Deep Learning

COMPUTER VISION SPEECH AND AUDIO BEHAVIOR

Object Detection Voice Recognition TranslationRecommendation

EnginesSentiment Analysis

DEEP LEARNING

cuDNN

MATH LIBRARIES

cuBLAS cuSPARSE

MULTI-GPU

NCCL

cuFFT

Mocha.jl

Image Classification

DEEP LEARNING

SDK

FRAMEWORKS

APPLICATIONS

NVIDIA CUDNN

Building blocks for accelerating deep neural networks on GPUs

High performance deep neural network training

Accelerates Deep Learning: Caffe, CNTK, Tensorflow, Theano, Torch

Performance continues to improve over time

“NVIDIA has improved the speed of cuDNN

with each release while extending the

interface to more operations and devices

at the same time.”

— Evan Shelhamer, Lead Caffe Developer, UC Berkeley

developer.nvidia.com/cudnn

AlexNet training throughput based on 20 iterations, CPU: 1x E5-2680v3 12 Core 2.5GHz.

0.0x

2.0x

4.0x

6.0x

8.0x

10.0x

12.0x

2014 2015 2016

K40(cuDNN v1)

M40(cuDNN v3)

Pascal(cuDNN v5)

WHAT’S NEW IN CUDNN 5?

LSTM recurrent neural networks deliver up to 6x speedup in Torch

Improved performance:

• Deep Neural Networks with 3x3 convolutions, like VGG, GoogleNet and ResNets

• 3D Convolutions

• FP16 routines on Pascal GPUs

Pascal GPU, RNNs, Improved Performance

Performance relative to torch-rnn(https://github.com/jcjohnson/torch-rnn)

DeepSpeech2: http://arxiv.org/abs/1512.02595Char-rnn: https://github.com/karpathy/char-rnn

5.9xSpeedup for char-rnn

RNN Layers

2.8xSpeedup for DeepSpeech 2

RNN Layers

DEEP LEARNING &

ARTIFICIAL INTELLIGENCE

Sep 28-29, 2016 | Amsterdam

www.gputechconf.eu #GTC16

SELF-DRIVING CARS VIRTUAL REALITY &

AUGMENTED REALITY

SUPERCOMPUTING & HPC

GTC Europe is a two-day conference designed to expose the innovative ways developers, businesses and academics are

using parallel computing to transform our world.

EUROPE’S BRIGHTEST MINDS & BEST IDEAS

2 Days | 800 Attendees | 50+ Exhibitors | 50+ Speakers | 15+ Tracks | 15+ Workshops | 1-to-1 Meetings