NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

May 2015

NVIDIA GPU TECHNOLOGY

UPDATE

Axel Koehler

Senior Solutions Architect, NVIDIA

2

PC DATA CENTER MOBILE

ENTERPRISE VIRTUALIZATION

AUTONOMOUS MACHINES

HPC & CLOUD SERVICE PROVIDERS GAMING DESIGN

NVIDIA: The VISUAL Computing Company

3

Tesla Accelerated Computing Platform

4

Tesla GPU Accelerators for 2015

Server

Seismic, Data Analytics, HPC Labs, Defense

Multi-GPU Accelerated Apps

Single and Double Precision Workloads

Server, Workstation, Liquid Cooled

Higher Ed, Data Analytics, HPC Labs, Defense

Double Precision Workloads

Tesla K40 Tesla K80

Best Single GPU Performance Maximize Throughput within a Server

5

Tesla K40 / K80

K40 K80

GPU GK110B GK210

Peak SP (board @ base clock)

4.29TFLOPS ~5.6TFLOPS (Base)

Peak DP (per board)

1.43 TFLOPS

1.68 TFLOPS(Boost)

~1.87 TFLOPS (Base)

~2.7 TFLOPS (Boost)

# of GPUs 1 2

# of CUDA

Cores/board 2880 4992

PCIe Gen Gen 3 Gen 3

GDDR5 Memory Size

(per board) 12 GB 24 GB

Memory Bandwidth 288 GB/s ~480GB/s

GPUBoost 2 Levels >10 levels

Power 235W 300W

Form Factors PCIe Active

PCIe Passive PCIe Passive

6

Average GPU Power in Watts

0

20

40

60

80

100

120

140

160

180

AMBER ANSYS Black Scholes Chroma GROMACS GTC LAMMPS LSMS NAMD Nbody QMCPACK RTM SPECFEM3D

Board

Pow

er

(Watt

s)

Avg GPU Power in Watts for Real Applications on K20X

7

GPU Boost

Base

Clock

Workload # 1

Worst case

Reference

App

23

5W

Boost

Clock #1

Workload # 2

E.g. AMBER

23

5W

Boost

Clock #2

Workload # 3

E.g. ANSYS

Fluent

23

5W

GPU Boost K40

810Mhz

745Mhz

875Mhz

Zero Idle

Boost

Base

GPU Clock

1.87 Teraflops

DP @ 560 MHz

875 MHz

40-50% more flops

with Boost

Most CUDA Apps Run At Boost Clocks

DGEMM Heavy Apps Run at Base Clocks

Dynamic GPU Boost K80

8

GPU Roadmap

9

Pascal GPU Features NVLINK and Stacked Memory

NVLINK GPU high speed interconnect

5x PCIe bandwidth

Move data at CPU memory speed

3x lower energy/bit

3D Stacked Memory 4x Higher Bandwidth (~1 TB/s)

3x Larger Capacity

4x More Energy Efficient per bit

Developer View Without Unified Memory

Developer View With Unified Memory

Unified Memory System Memory

GPU Memory

Unified Memory Dramatically Lower Developer Effort

11

NVLink and Unified Memory

Enable Data Transfer At Speed of CPU Memory

Move Data where it is Needed Fast

Accelerated Communication

GPU Direct RDMA NVLINK

Fast Access to other Nodes

Eliminate CPU Latency

Eliminate CPU Bottleneck

2x App Performance

5x Faster Than PCIe

Fast Access to System Memory

GPU Direct P2P

Multi-GPU Scaling

Fast GPU Communication

Fast GPU Memory Access

13 SC14 TALK AT MELLANOX BOOTH

GPUDIRECT RDMA

GPU

CPU

IOH

HCA

CPU prepares and queues communication tasks on HCA

CPU synchronizes with GPU tasks

HCA directly accesses GPU memory

GPU

CPU

IOH

HCA

CPU prepares and queues communication tasks on GPU

GPU triggers communication on HCA

HCA directly accesses GPU memory

GPUDIRECT ASYNC

http://on-demand.gputechconf.com/gtc/2015/presentation/S5412-Davide-Rossetti.pdf

Improving GPUDirect RDMA

14

Developer Platform With Open Ecosystem Accelerate Applications Across Multiple CPUs

x86

Libraries

Programming

Languages

Compiler

Directives

AmgX

cuBLAS

/

15

Drop-in Acceleration with GPU Libraries

Speedups out of the box

AmgX cuFFT

NPP cuBLAS cuRAND

cuSPARSE MATH

Linear Performance Scaling with XT libraries

cuBLAS-XT Machine learning, O&G, Material Sience, Defense,

Supercomputing

cuFFT-XT O&G, Molecular Dynamics, Defense

AmgX CFD, Supercomputing, O&G Reservoir Sim

CUDA 7 – New Features

• C++11 feature support

– Auto, Lambda, std::initializer_list, Variadic Templates, Static_asserts,

Constexpr, Rvalue references, Range based for loops

• Runtime Compilation (RTC)

• cuSolver library

– Routines for solving sparse and dense linear systems and Eigen problems

– Three APIs: Dense, Sparse, Refactorization

• Thrust improvements

– Device-side Thrust , API support for CUDA streams, Performance

• HyperQ/MPI (MPS): Multiple GPUs per Node

17

CUDA7: Supported C++11 Features

C++11 language features enabled, including:

Auto

Lambda

std::initializer_list

Variadic Templates

Static_asserts

Constexpr

Rvalue references

Range based for loops

…

Not supported: thread_local

Standard libraries

std::thread,

Etc.

18

Application

// launch foo()

Runtime

Compilation

Library

(libnvrtc)

CUDA 7.0 Runtime Compilation

Compile CUDA kernel source at run time

Compiled kernels can be cached on disk

Runtime C++ Code Specialization

Optimize code based on run-time data

Reduce compile time and compiled code size

Enables runtime code generation, C++ template-based DSLs

__global__

foo(..) { .. }

Compiled

Kernel

19

cuSOLVER

cusolverDN Dense Cholesky, LU, SVD, QR

Optimization, Computer vision, CFD

cusolverSP Sparse direct solvers & Eigensolvers

Newton’s method, Chemical kinetics

cusolverRF Sparse refactorization solver

Chemistry, ODEs, Circuit simulation

0x

5x

10x

15x

20x

mhd4800b ex33 Muu gyro_m

cusolverSP Speedup over CPU

cuSOLVER 7.0, MKL 11.0.4, SuiteSparse 3.6.0

K40, i7-3930K CPU @ 3.20GHz

0x

2x

4x

6x

8x

SPOTRF DPOTRF CPOTRF ZPOTRF

cusolverDN Speedup over CPU

M=N=4096

GPU 0 GPU 1

CUDA

MPI

Rank 0

CUDA

MPI

Rank 1

CUDA

MPI

Rank 2

CUDA

MPI

Rank 3

CUDA7: HyperQ/MPI (MPS): Multiple GPUs per Node

MPS Server

MPS Server efficiently overlaps work

from multiple ranks to each GPU

lrank=$OMPI_COMM_WORLD_LOCAL_RANK

case ${lrank} in

[0]) export CUDA_VISIBLE_DEVICES=0; numactl —cpunodebind=0 ./executable;;




esac

21

2008 – PGI Accelerator Model (targeting NVIDIA GPUs)

2011 – OpenACC 1.0 (targeting NVIDIA GPUs, AMD GPUs)

data regions, compute regions, gang/worker/vector

2013 – OpenACC 2.0

procedures, dynamic data lifetimes

2015 – OpenACC 2.5

minor fixes, additions

2015/16 – OpenACC 3.0

deep copy

http://on-demand.gputechconf.com/gtc/2015/presentation/S5382-Michael-Wolfe.pdf

http://on-demand.gputechconf.com/gtc/2015/video/S5382.html

OpenACC Timeline













Vision: Mainstream Parallel Programming

• Enable more programmers to write parallel software

• Give programmers the choice of language to use

• Embrace and evolve key programming standards

C

http://on-demand.gputechconf.com/gtc/2015/presentation/S5820-Mark-Harris.pdf

http://on-demand.gputechconf.com/gtc/2015/video/S5820B.html









Mixed Precision Computation

• Half precision (fp16) data type in addition to single (fp32), double (fp64)

• fp16: half the bandwidth, twice the throughput

• Format: s1e5m10

• Range ~ -6*10^-8 … 6*10^4 as it includes denormals

• Limitations

– Limited precision: 11-bit mantissa

– Vector operations only: 32-bit register holds 2 fp16 values

Mixed Precision Computation

FP16 Support in CUDA

28

Deep Learning using Deep Neural Networks

Image “Sara”

Today’s Largest Networks ~10 layers. 1B parameters, 10M images, ~30 Exaflops, ~30 GPU days

http://devblogs.nvidia.com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library

NVIDIA cuDNN Library

Low-level Library of GPU-accelerated routines

Out-of-the-box speedup of Neural Networks

Developed and maintained by NVIDIA

First release focused on Convolutional Neural Networks

Already part of major open-source frameworks

Caffe, Torch, Theano

https://developer.nvidia.com/cuDNN

[email protected]



mailto:[email protected]

29

DIGITS

Data Scientists & Researchers:

Quickly design the best deep neural network (DNN) for your data

Visually monitor DNN training quality in real-time

Manage training of many DNNs in parallel on multi-GPU systems

Interactive Deep Learning GPU Training System

developer.nvidia.com/digits

30

Use Cases Image Classification, Object

Detection, Localization Face Recognition Speech & Natural Language

Processing

Medical Imaging & Interpretation

Seismic Imaging & Interpretation Recommendation

31

NVIDIA DRIVE PX Auto-Pilot Platform

Complex scenes require Deep Learning-

based object identification and classification

Two Tegra X1 processors

Up to twelve camera inputs can be

processed by one Tegra X1 in real-time

32

Cars that see better … and Learn

33

US TO BUILD WORLD’S TWO FASTEST SUPERCOMPUTERS

Major Step Forward on the Path to Exascale

100-300 PFLOPS Peak Performance

IBM POWER CPU + NVIDIA Volta GPU

NVLink High Speed Interconnect

40 TFLOPS per Node, >3,400 Nodes

2017

SUMMIT SIERRA

34

nvidia.qwiklabs.com Self-paced hands-on sessions that run on real GPUs in the cloud

Using IPython Notebook technology lab instructions, editing and execution of code, and even interaction with visual tools are all weaved together into a single web application

http://ipython.org/notebook.html





NVIDIA GPU TECHNOLOGY

UPDATE

Axel Koehler

[email protected]

Documents

NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock