Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
May 2015
NVIDIA GPU TECHNOLOGY
UPDATE
Axel Koehler
Senior Solutions Architect, NVIDIA
2
PC DATA CENTER MOBILE
ENTERPRISE VIRTUALIZATION
AUTONOMOUS MACHINES
HPC & CLOUD SERVICE PROVIDERS GAMING DESIGN
NVIDIA: The VISUAL Computing Company
3
Tesla Accelerated Computing Platform
4
Tesla GPU Accelerators for 2015
Server
Seismic, Data Analytics, HPC Labs, Defense
Multi-GPU Accelerated Apps
Single and Double Precision Workloads
Server, Workstation, Liquid Cooled
Higher Ed, Data Analytics, HPC Labs, Defense
Double Precision Workloads
Tesla K40 Tesla K80
Best Single GPU Performance Maximize Throughput within a Server
5
Tesla K40 / K80
K40 K80
GPU GK110B GK210
Peak SP (board @ base clock)
4.29TFLOPS ~5.6TFLOPS (Base)
Peak DP (per board)
1.43 TFLOPS
1.68 TFLOPS(Boost)
~1.87 TFLOPS (Base)
~2.7 TFLOPS (Boost)
# of GPUs 1 2
# of CUDA
Cores/board 2880 4992
PCIe Gen Gen 3 Gen 3
GDDR5 Memory Size
(per board) 12 GB 24 GB
Memory Bandwidth 288 GB/s ~480GB/s
GPUBoost 2 Levels >10 levels
Power 235W 300W
Form Factors PCIe Active
PCIe Passive PCIe Passive
6
Average GPU Power in Watts
0
20
40
60
80
100
120
140
160
180
AMBER ANSYS Black Scholes Chroma GROMACS GTC LAMMPS LSMS NAMD Nbody QMCPACK RTM SPECFEM3D
Board
Pow
er
(Watt
s)
Avg GPU Power in Watts for Real Applications on K20X
7
GPU Boost
Base
Clock
Workload # 1
Worst case
Reference
App
23
5W
Boost
Clock #1
Workload # 2
E.g. AMBER
23
5W
Boost
Clock #2
Workload # 3
E.g. ANSYS
Fluent
23
5W
GPU Boost K40
810Mhz
745Mhz
875Mhz
Zero Idle
Boost
Base
GPU Clock
1.87 Teraflops
DP @ 560 MHz
875 MHz
40-50% more flops
with Boost
Most CUDA Apps Run At Boost Clocks
DGEMM Heavy Apps Run at Base Clocks
Dynamic GPU Boost K80
8
GPU Roadmap
9
Pascal GPU Features NVLINK and Stacked Memory
NVLINK GPU high speed interconnect
5x PCIe bandwidth
Move data at CPU memory speed
3x lower energy/bit
3D Stacked Memory 4x Higher Bandwidth (~1 TB/s)
3x Larger Capacity
4x More Energy Efficient per bit
Developer View Without Unified Memory
Developer View With Unified Memory
Unified Memory System Memory
GPU Memory
Unified Memory Dramatically Lower Developer Effort
11
NVLink and Unified Memory
Enable Data Transfer At Speed of CPU Memory
Move Data where it is Needed Fast
Accelerated Communication
GPU Direct RDMA NVLINK
Fast Access to other Nodes
Eliminate CPU Latency
Eliminate CPU Bottleneck
2x App Performance
5x Faster Than PCIe
Fast Access to System Memory
GPU Direct P2P
Multi-GPU Scaling
Fast GPU Communication
Fast GPU Memory Access
13 SC14 TALK AT MELLANOX BOOTH
GPUDIRECT RDMA
GPU
CPU
IOH
HCA
CPU prepares and queues communication tasks on HCA
CPU synchronizes with GPU tasks
HCA directly accesses GPU memory
GPU
CPU
IOH
HCA
CPU prepares and queues communication tasks on GPU
GPU triggers communication on HCA
HCA directly accesses GPU memory
GPUDIRECT ASYNC
http://on-demand.gputechconf.com/gtc/2015/presentation/S5412-Davide-Rossetti.pdf
Improving GPUDirect RDMA
14
Developer Platform With Open Ecosystem Accelerate Applications Across Multiple CPUs
x86
Libraries
Programming
Languages
Compiler
Directives
AmgX
cuBLAS
/
15
Drop-in Acceleration with GPU Libraries
Speedups out of the box
AmgX cuFFT
NPP cuBLAS cuRAND
cuSPARSE MATH
Linear Performance Scaling with XT libraries
cuBLAS-XT Machine learning, O&G, Material Sience, Defense,
Supercomputing
cuFFT-XT O&G, Molecular Dynamics, Defense
AmgX CFD, Supercomputing, O&G Reservoir Sim
CUDA 7 – New Features
• C++11 feature support
– Auto, Lambda, std::initializer_list, Variadic Templates, Static_asserts,
Constexpr, Rvalue references, Range based for loops
• Runtime Compilation (RTC)
• cuSolver library
– Routines for solving sparse and dense linear systems and Eigen problems
– Three APIs: Dense, Sparse, Refactorization
• Thrust improvements
– Device-side Thrust , API support for CUDA streams, Performance
• HyperQ/MPI (MPS): Multiple GPUs per Node
17
CUDA7: Supported C++11 Features
C++11 language features enabled, including:
Auto
Lambda
std::initializer_list
Variadic Templates
Static_asserts
Constexpr
Rvalue references
Range based for loops
…
Not supported: thread_local
Standard libraries
std::thread,
Etc.
18
Application
// launch foo()
Runtime
Compilation
Library
(libnvrtc)
CUDA 7.0 Runtime Compilation
Compile CUDA kernel source at run time
Compiled kernels can be cached on disk
Runtime C++ Code Specialization
Optimize code based on run-time data
Reduce compile time and compiled code size
Enables runtime code generation, C++ template-based DSLs
__global__
foo(..) { .. }
Compiled
Kernel
19
cuSOLVER
cusolverDN Dense Cholesky, LU, SVD, QR
Optimization, Computer vision, CFD
cusolverSP Sparse direct solvers & Eigensolvers
Newton’s method, Chemical kinetics
cusolverRF Sparse refactorization solver
Chemistry, ODEs, Circuit simulation
0x
5x
10x
15x
20x
mhd4800b ex33 Muu gyro_m
cusolverSP Speedup over CPU
cuSOLVER 7.0, MKL 11.0.4, SuiteSparse 3.6.0
K40, i7-3930K CPU @ 3.20GHz
0x
2x
4x
6x
8x
SPOTRF DPOTRF CPOTRF ZPOTRF
cusolverDN Speedup over CPU
M=N=4096
GPU 0 GPU 1
CUDA
MPI
Rank 0
CUDA
MPI
Rank 1
CUDA
MPI
Rank 2
CUDA
MPI
Rank 3
CUDA7: HyperQ/MPI (MPS): Multiple GPUs per Node
MPS Server
MPS Server efficiently overlaps work
from multiple ranks to each GPU
lrank=$OMPI_COMM_WORLD_LOCAL_RANK
case ${lrank} in
[0]) export CUDA_VISIBLE_DEVICES=0; numactl —cpunodebind=0 ./executable;;
[1]) export CUDA_VISIBLE_DEVICES=1; numactl —cpunodebind=1 ./executable;;
[2]) export CUDA_VISIBLE_DEVICES=0; numactl —cpunodebind=0 ./executable;;
[3]) export CUDA_VISIBLE_DEVICES=1; numactl —cpunodebind=1 ./executable;;
esac
21
2008 – PGI Accelerator Model (targeting NVIDIA GPUs)
2011 – OpenACC 1.0 (targeting NVIDIA GPUs, AMD GPUs)
data regions, compute regions, gang/worker/vector
2013 – OpenACC 2.0
procedures, dynamic data lifetimes
2015 – OpenACC 2.5
minor fixes, additions
2015/16 – OpenACC 3.0
deep copy
http://on-demand.gputechconf.com/gtc/2015/presentation/S5382-Michael-Wolfe.pdf
http://on-demand.gputechconf.com/gtc/2015/video/S5382.html
OpenACC Timeline
Vision: Mainstream Parallel Programming
• Enable more programmers to write parallel software
• Give programmers the choice of language to use
• Embrace and evolve key programming standards
C
http://on-demand.gputechconf.com/gtc/2015/presentation/S5820-Mark-Harris.pdf
http://on-demand.gputechconf.com/gtc/2015/video/S5820B.html
Mixed Precision Computation
• Half precision (fp16) data type in addition to single (fp32), double (fp64)
• fp16: half the bandwidth, twice the throughput
• Format: s1e5m10
• Range ~ -6*10^-8 … 6*10^4 as it includes denormals
• Limitations
– Limited precision: 11-bit mantissa
– Vector operations only: 32-bit register holds 2 fp16 values
Mixed Precision Computation
FP16 Support in CUDA
28
Deep Learning using Deep Neural Networks
Image “Sara”
Today’s Largest Networks ~10 layers. 1B parameters, 10M images, ~30 Exaflops, ~30 GPU days
http://devblogs.nvidia.com/parallelforall/accelerate-machine-learning-cudnn-deep-neural-network-library
NVIDIA cuDNN Library
Low-level Library of GPU-accelerated routines
Out-of-the-box speedup of Neural Networks
Developed and maintained by NVIDIA
First release focused on Convolutional Neural Networks
Already part of major open-source frameworks
Caffe, Torch, Theano
https://developer.nvidia.com/cuDNN
29
DIGITS
Data Scientists & Researchers:
Quickly design the best deep neural network (DNN) for your data
Visually monitor DNN training quality in real-time
Manage training of many DNNs in parallel on multi-GPU systems
Interactive Deep Learning GPU Training System
developer.nvidia.com/digits
30
Use Cases Image Classification, Object
Detection, Localization Face Recognition Speech & Natural Language
Processing
Medical Imaging & Interpretation
Seismic Imaging & Interpretation Recommendation
31
NVIDIA DRIVE PX Auto-Pilot Platform
Complex scenes require Deep Learning-
based object identification and classification
Two Tegra X1 processors
Up to twelve camera inputs can be
processed by one Tegra X1 in real-time
32
Cars that see better … and Learn
33
US TO BUILD WORLD’S TWO FASTEST SUPERCOMPUTERS
Major Step Forward on the Path to Exascale
100-300 PFLOPS Peak Performance
IBM POWER CPU + NVIDIA Volta GPU
NVLink High Speed Interconnect
40 TFLOPS per Node, >3,400 Nodes
2017
SUMMIT SIERRA
34
nvidia.qwiklabs.com Self-paced hands-on sessions that run on real GPUs in the cloud
Using IPython Notebook technology lab instructions, editing and execution of code, and even interaction with visual tools are all weaved together into a single web application