35
May 2015 NVIDIA GPU TECHNOLOGY UPDATE Axel Koehler Senior Solutions Architect, NVIDIA

NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

May 2015

NVIDIA GPU TECHNOLOGY

UPDATE

Axel Koehler

Senior Solutions Architect, NVIDIA

Page 2: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

2

PC DATA CENTER MOBILE

ENTERPRISE VIRTUALIZATION

AUTONOMOUS MACHINES

HPC & CLOUD SERVICE PROVIDERS GAMING DESIGN

NVIDIA: The VISUAL Computing Company

Page 3: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

3

Tesla Accelerated Computing Platform

Page 4: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

4

Tesla GPU Accelerators for 2015

Server

Seismic, Data Analytics, HPC Labs, Defense

Multi-GPU Accelerated Apps

Single and Double Precision Workloads

Server, Workstation, Liquid Cooled

Higher Ed, Data Analytics, HPC Labs, Defense

Double Precision Workloads

Tesla K40 Tesla K80

Best Single GPU Performance Maximize Throughput within a Server

Page 5: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

5

Tesla K40 / K80

K40 K80

GPU GK110B GK210

Peak SP (board @ base clock)

4.29TFLOPS ~5.6TFLOPS (Base)

Peak DP (per board)

1.43 TFLOPS

1.68 TFLOPS(Boost)

~1.87 TFLOPS (Base)

~2.7 TFLOPS (Boost)

# of GPUs 1 2

# of CUDA

Cores/board 2880 4992

PCIe Gen Gen 3 Gen 3

GDDR5 Memory Size

(per board) 12 GB 24 GB

Memory Bandwidth 288 GB/s ~480GB/s

GPUBoost 2 Levels >10 levels

Power 235W 300W

Form Factors PCIe Active

PCIe Passive PCIe Passive

Page 6: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

6

Average GPU Power in Watts

0

20

40

60

80

100

120

140

160

180

AMBER ANSYS Black Scholes Chroma GROMACS GTC LAMMPS LSMS NAMD Nbody QMCPACK RTM SPECFEM3D

Board

Pow

er

(Watt

s)

Avg GPU Power in Watts for Real Applications on K20X

Page 7: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

7

GPU Boost

Base

Clock

Workload # 1

Worst case

Reference

App

23

5W

Boost

Clock #1

Workload # 2

E.g. AMBER

23

5W

Boost

Clock #2

Workload # 3

E.g. ANSYS

Fluent

23

5W

GPU Boost K40

810Mhz

745Mhz

875Mhz

Zero Idle

Boost

Base

GPU Clock

1.87 Teraflops

DP @ 560 MHz

875 MHz

40-50% more flops

with Boost

Most CUDA Apps Run At Boost Clocks

DGEMM Heavy Apps Run at Base Clocks

Dynamic GPU Boost K80

Page 8: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

8

GPU Roadmap

Page 9: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

9

Pascal GPU Features NVLINK and Stacked Memory

NVLINK GPU high speed interconnect

5x PCIe bandwidth

Move data at CPU memory speed

3x lower energy/bit

3D Stacked Memory 4x Higher Bandwidth (~1 TB/s)

3x Larger Capacity

4x More Energy Efficient per bit

Page 10: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

Developer View Without Unified Memory

Developer View With Unified Memory

Unified Memory System Memory

GPU Memory

Unified Memory Dramatically Lower Developer Effort

Page 11: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

11

NVLink and Unified Memory

Enable Data Transfer At Speed of CPU Memory

Page 12: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

Move Data where it is Needed Fast

Accelerated Communication

GPU Direct RDMA NVLINK

Fast Access to other Nodes

Eliminate CPU Latency

Eliminate CPU Bottleneck

2x App Performance

5x Faster Than PCIe

Fast Access to System Memory

GPU Direct P2P

Multi-GPU Scaling

Fast GPU Communication

Fast GPU Memory Access

Page 13: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

13 SC14 TALK AT MELLANOX BOOTH

GPUDIRECT RDMA

GPU

CPU

IOH

HCA

CPU prepares and queues communication tasks on HCA

CPU synchronizes with GPU tasks

HCA directly accesses GPU memory

GPU

CPU

IOH

HCA

CPU prepares and queues communication tasks on GPU

GPU triggers communication on HCA

HCA directly accesses GPU memory

GPUDIRECT ASYNC

http://on-demand.gputechconf.com/gtc/2015/presentation/S5412-Davide-Rossetti.pdf

Improving GPUDirect RDMA

Page 14: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

14

Developer Platform With Open Ecosystem Accelerate Applications Across Multiple CPUs

x86

Libraries

Programming

Languages

Compiler

Directives

AmgX

cuBLAS

/

Page 15: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

15

Drop-in Acceleration with GPU Libraries

Speedups out of the box

AmgX cuFFT

NPP cuBLAS cuRAND

cuSPARSE MATH

Linear Performance Scaling with XT libraries

cuBLAS-XT Machine learning, O&G, Material Sience, Defense,

Supercomputing

cuFFT-XT O&G, Molecular Dynamics, Defense

AmgX CFD, Supercomputing, O&G Reservoir Sim

Page 16: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

CUDA 7 – New Features

• C++11 feature support

– Auto, Lambda, std::initializer_list, Variadic Templates, Static_asserts,

Constexpr, Rvalue references, Range based for loops

• Runtime Compilation (RTC)

• cuSolver library

– Routines for solving sparse and dense linear systems and Eigen problems

– Three APIs: Dense, Sparse, Refactorization

• Thrust improvements

– Device-side Thrust , API support for CUDA streams, Performance

• HyperQ/MPI (MPS): Multiple GPUs per Node

Page 17: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

17

CUDA7: Supported C++11 Features

C++11 language features enabled, including:

Auto

Lambda

std::initializer_list

Variadic Templates

Static_asserts

Constexpr

Rvalue references

Range based for loops

Not supported: thread_local

Standard libraries

std::thread,

Etc.

Page 18: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

18

Application

// launch foo()

Runtime

Compilation

Library

(libnvrtc)

CUDA 7.0 Runtime Compilation

Compile CUDA kernel source at run time

Compiled kernels can be cached on disk

Runtime C++ Code Specialization

Optimize code based on run-time data

Reduce compile time and compiled code size

Enables runtime code generation, C++ template-based DSLs

__global__

foo(..) { .. }

Compiled

Kernel

Page 19: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

19

cuSOLVER

cusolverDN Dense Cholesky, LU, SVD, QR

Optimization, Computer vision, CFD

cusolverSP Sparse direct solvers & Eigensolvers

Newton’s method, Chemical kinetics

cusolverRF Sparse refactorization solver

Chemistry, ODEs, Circuit simulation

0x

5x

10x

15x

20x

mhd4800b ex33 Muu gyro_m

cusolverSP Speedup over CPU

cuSOLVER 7.0, MKL 11.0.4, SuiteSparse 3.6.0

K40, i7-3930K CPU @ 3.20GHz

0x

2x

4x

6x

8x

SPOTRF DPOTRF CPOTRF ZPOTRF

cusolverDN Speedup over CPU

M=N=4096

Page 20: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

GPU 0 GPU 1

CUDA

MPI

Rank 0

CUDA

MPI

Rank 1

CUDA

MPI

Rank 2

CUDA

MPI

Rank 3

CUDA7: HyperQ/MPI (MPS): Multiple GPUs per Node

MPS Server

MPS Server efficiently overlaps work

from multiple ranks to each GPU

lrank=$OMPI_COMM_WORLD_LOCAL_RANK

case ${lrank} in

[0]) export CUDA_VISIBLE_DEVICES=0; numactl —cpunodebind=0 ./executable;;

[1]) export CUDA_VISIBLE_DEVICES=1; numactl —cpunodebind=1 ./executable;;

[2]) export CUDA_VISIBLE_DEVICES=0; numactl —cpunodebind=0 ./executable;;

[3]) export CUDA_VISIBLE_DEVICES=1; numactl —cpunodebind=1 ./executable;;

esac

Page 22: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

Vision: Mainstream Parallel Programming

• Enable more programmers to write parallel software

• Give programmers the choice of language to use

• Embrace and evolve key programming standards

C

Page 24: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock
Page 25: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

Mixed Precision Computation

Page 26: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

• Half precision (fp16) data type in addition to single (fp32), double (fp64)

• fp16: half the bandwidth, twice the throughput

• Format: s1e5m10

• Range ~ -6*10^-8 … 6*10^4 as it includes denormals

• Limitations

– Limited precision: 11-bit mantissa

– Vector operations only: 32-bit register holds 2 fp16 values

Mixed Precision Computation

Page 27: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

FP16 Support in CUDA

Page 28: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

28

Deep Learning using Deep Neural Networks

Image “Sara”

Today’s Largest Networks ~10 layers. 1B parameters, 10M images, ~30 Exaflops, ~30 GPU days

http://devblogs.nvidia.com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library

NVIDIA cuDNN Library

Low-level Library of GPU-accelerated routines

Out-of-the-box speedup of Neural Networks

Developed and maintained by NVIDIA

First release focused on Convolutional Neural Networks

Already part of major open-source frameworks

Caffe, Torch, Theano

https://developer.nvidia.com/cuDNN

[email protected]

Page 29: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

29

DIGITS

Data Scientists & Researchers:

Quickly design the best deep neural network (DNN) for your data

Visually monitor DNN training quality in real-time

Manage training of many DNNs in parallel on multi-GPU systems

Interactive Deep Learning GPU Training System

developer.nvidia.com/digits

Page 30: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

30

Use Cases Image Classification, Object

Detection, Localization Face Recognition Speech & Natural Language

Processing

Medical Imaging & Interpretation

Seismic Imaging & Interpretation Recommendation

Page 31: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

31

NVIDIA DRIVE PX Auto-Pilot Platform

Complex scenes require Deep Learning-

based object identification and classification

Two Tegra X1 processors

Up to twelve camera inputs can be

processed by one Tegra X1 in real-time

Page 32: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

32

Cars that see better … and Learn

Page 33: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

33

US TO BUILD WORLD’S TWO FASTEST SUPERCOMPUTERS

Major Step Forward on the Path to Exascale

100-300 PFLOPS Peak Performance

IBM POWER CPU + NVIDIA Volta GPU

NVLink High Speed Interconnect

40 TFLOPS per Node, >3,400 Nodes

2017

SUMMIT SIERRA

Page 34: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

34

nvidia.qwiklabs.com Self-paced hands-on sessions that run on real GPUs in the cloud

Using IPython Notebook technology lab instructions, editing and execution of code, and even interaction with visual tools are all weaved together into a single web application

Page 35: NVIDIA GPU TECHNOLOGY UPDATE - Max Planck Society€¦ · GPU Boost Base Clock Workload # 1 Worst case Reference App 23 5W Boost Clock #1 Workload # 2 E.g. AMBER 23 5W Boost Clock

NVIDIA GPU TECHNOLOGY

UPDATE

Axel Koehler

[email protected]