HPC with the NVIDIA Accelerated Computing...

Mark Harris, November 16, 2015

HPC with the NVIDIA Accelerated Computing Toolkit

2013 2014 2015

Accelerators Surge in World’s Top Supercomputers

100+ accelerated systems now on Top500 list

1/3 of total FLOPS powered by accelerators

NVIDIA Tesla GPUs sweep 23 of 24 new

accelerated supercomputers

Tesla supercomputers growing at 50% CAGR

over past five years

Top500: # of Accelerated Supercomputers

3 NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

70% OF TOP HPC APPS NOW ACCELERATED

KEY MATERIALS APP NOW ACCELERATED

Typically Consumes

10-25%

of HPC System

INTERSECT360 SURVEY OF TOP APPS

Top 10 HPC Apps 90%

Accelerated

Top 50 HPC Apps 70%

Accelerated

1 Dual K80 Server 1.3x

4 Dual CPU Servers 1.0x

Intersect360, Nov 2015 “HPC Application Support for GPU Computing” Dual-socket Xeon E5-2690 v2 3GHz, Dual Tesla K80, FDR InfiniBand Dataset: NiAl-MD Blocked Davidson

Accelerated Computing Available Everywhere

Supercomputer Workstation - Notebook Datacenter - Cloud Embedded - Automotive

NVIDIA Accelerated Computing Toolkit

NVIDIA Accelerated Computing Toolkit Everything You Need to Accelerate Your Applications

developer.nvidia.com

Libraries Language Solutions

Tools System Interfaces Training

FFT BLAS DNN

NSight

C++ OpenACC

C/C++ Fortran OpenACC FFTW BLAS

GPUDirect Unified Memory NSight IDE Visual Profiler Hands-on Labs Docs & Samples

Get Started Quickly

OpenACC Toolkit

Simple way to accelerate

scientific code

developer.nvidia.com/openacc

Machine Learning Tools

Design, Train and Deploy Deep

Neural Networks

developer.nvidia.com/deep-learning

CUDA Toolkit

High-Performance C/C++

Applications

developer.nvidia.com/cuda-toolkit

Why is Machine Learning Hot Now?

Big Data Availability GPU Acceleration New Techniques

350 millions images uploaded per day

2.5 Petabytes of customer data hourly

300 hours of video uploaded every minute

cuDNN Deep Learning Primitives

Igniting

Artificial Intelligence

Drop-in Acceleration for major

Deep Learning frameworks:

Caffe, Theano, Torch

Up to 2x faster on AlexNet than

baseline Caffe GPU

implementation

NVIDIA GPU

Frameworks

Applications

cuDNN 1 cuDNN 2 cuDNN 3

Millions of images trained per day

Alexnet OverFeat VGG

Train up to 2X faster than cuDNN 2

Tesla M40 World’s Fastest Accelerator for Deep Learning

Caffe Benchmark: AlexNet training throughput based on 20 iterations, CPU: E5-2680v3 @ 2.70GHz. 64GB System Memory, CentOS 6.2

0 1 2 3 4 5 6 7 8 9 10 11

Tesla M40

Reduce Training Time from 10 Days to 1 Day

# of Days

8x Faster Caffe Performance

CUDA Cores 3072

Peak SP 7 TFLOPS

GDDR5 Memory 12 GB

Bandwidth 288 GB/s

Power 250W

NVIDIA® DIGITSTM

Data Scientists and Researchers:

Quickly design the best deep neural network (DNN) for your data

Visually monitor DNN training quality in real-time

Manage training of many DNNs in parallel on multi-GPU systems

Interactive Deep Learning GPU Training System

developer.nvidia.com/digits

OpenACC Simple | Powerful | Portable

Fueling the Next Wave of

Scientific Discoveries in HPC

University of Illinois PowerGrid- MRI Reconstruction

70x Speed-Up

2 Days of Effort

http://www.cray.com/sites/default/files/resources/OpenACC_213462.12_OpenACC_Cosmo_CS_FNL.pdf

http://www.hpcwire.com/off-the-wire/first-round-of-2015-hackathons-gets-underway

http://on-demand.gputechconf.com/gtc/2015/presentation/S5297-Hisashi-Yashiro.pdf

http://www.openacc.org/content/experiences-porting-molecular-dynamics-code-gpus-cray-xk7

RIKEN Japan NICAM- Climate Modeling

7-8x Speed-Up

5% of Code Modified

main() { <serial code> #pragma acc kernels //automatically run on GPU

{ <parallel code> } }

miniGhost(Mantevo)

NEMO (Climate &Ocean)

CLOVERLEAF(Physics)

ad CPU: MPI + OpenMP

CPU: MPI + OpenACC

CPU + GPU: MPI + OpenACC

Performance Portability with PGI

359.miniGhost: CPU: Intel Xeon E5-2698 v3, 2 sockets, 32-cores total, GPU: Tesla K80 (single GPU) NEMO: Each socket CPU: Intel Xeon E5-‐2698 v3, 16 cores; GPU: NVIDIA K80 both GPUs CLOVERLEAF: CPU: Dual socket Intel Xeon CPU E5-2690 v2, 20 cores total, GPU: Tesla K80 both GPUs

Janus Juul Eriksen, PhD Fellow

qLEAP Center for Theoretical Chemistry, Aarhus University

“ OpenACC makes GPU computing approachable for

domain scientists. Initial OpenACC implementation

required only minor effort, and more importantly,

no modifications of our existing CPU implementation.

Lines of Code

Modified

# of Weeks

Required

# of Codes to

Maintain

<100 Lines 1 Week 1 Source

Big Performance

Alanine-113 Atoms

Alanine-223 Atoms

Alanine-333 Atoms

Speedup v

Minimal Effort

LS-DALTON CCSD(T) Module Benchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)

LS-DALTON Large-scale Application for Calculating High-accuracy

Molecular Energies

The NVIDIA OpenACC Toolkit Free Toolkit Offers Simple & Powerful Path to Accelerated Computing

Download at developer.nvidia.com/openacc

PGI Compiler Free OpenACC compiler for academia

NVProf Profiler Easily find where to add compiler directives

Code Samples Learn from examples of real-world algorithms

Documentation Quick start guide, Best practices, Forums

__global__ void xyzw_frequency(int *count, char *text, int n) { const char letters[] { 'x','y','z','w' }; count_if(count, text, n, [&](char c) { for (const auto x : letters) if (c == x) return true; return false; }); }

CUDA Toolkit Flexibility – High Performance

DEVELOPER TOOLS PARALLEL C++ LIBRARIES

Drop-In Acceleration

Parallel Algorithms and

Custom Parallel C++

Pinpoint Bugs and

Performance Opportunities

Download at developer.nvidia.com/cuda-toolkit

BLAS BLAS

GPU Accelerated Libraries Drop-in Acceleration for Your Applications

Linear Algebra

& Numerical

Parallel

Algorithms

Signal Processing FFT, Image & Video

NVIDIA

Math Lib

NVIDIA NPP NVENC

Video Encode

GPU AI –

Path Finding

BLAS cuBLAS cuSparse cuSOLVER

BLAS AmgX

cuRAND

Machine

Learning

developer.nvidia.com/gpu-accelerated-libraries

Developer Tools Pinpoint Performance Bottlenecks and Hard-to-Find Bugs

NVIDIA Visual Profiler Analyze GPU Performance

CUDA-GDB Standard Linux Debugger

CUDA-MEMCHECK Find memory errors, leaks and race conditions

Develop, Debug, & Profile in Eclipse or Visual Studio

Your Next Steps

Free Hands-On Self-Paced Labs in the NVIDIA Booth

Sign up for credit for on-line labs

Free In-Depth On-Line Courses

Deep Learning: developer.nvidia.com/deep-learning-courses

OpenACC: developer.nvidia.com/openacc-course

Learn from hundreds of experts at GTC 2016

All about Accelerated Computing: developer.nvidia.com

Many Learning Opportunities

APRIL 4-7, 2016 SILICON VALLEY

Backup

// generate 32M random numbers on host thrust::host_vector<int> h_vec(32 << 20); thrust::generate(h_vec.begin(), h_vec.end(), rand); // transfer data to device (GPU) thrust::device_vector<int> d_vec = h_vec; // sort data on device thrust::sort(d_vec.begin(), d_vec.end()); // transfer data back to host thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());

THRUST: Productive Parallel C++

Resembles C++ STL

Included in CUDA Toolkit

Productive High-level API

CPU/GPU Performance portability

CUDA, OpenMP, and TBB backends

Flexible, extensible

High-level parallel algorithms

C++11 in CUDA

Count the number of occurrences of letters x, y, z and w in text

“C++11 Feels like a new language” – Bjarne Stroustrup

__global__ void xyzw_frequency(int *count, char *text, int n) { const char letters[] { 'x','y','z','w' }; count_if(count, text, n, [&](char c) { for (const auto x : letters) if (c == x) return true; return false; }); }

Read 3288846 bytes from "warandpeace.txt" counted 107310 instances of 'x', 'y', 'z', or 'w' in "warandpeace.txt"

Output:

Lambda

Function

Initializer List

Range-based

For Loop

Automatic type

deduction

Parallel STL: Algorithms + Policies Higher-level abstractions for ease of use and performance portability

int main() { size_t n = 1 << 16; std::vector<float> x(n, 1), y(n, 2), z(n); float a = 13; auto I = interval(0, n); std::for_each(std::par, std::begin(I), std::end(I), [&](int i) { z[i] = a * x[i] + y[i]; }); return 0; }

On track for C++17 Standard Includes algorithms such as for_each, reduce, transform, inclusive_scan … Execution policies: std::seq std::par std::par_vec

Instruction-Level Profiling

New in CUDA 7.5 Visual Profiler

Highlights hot spots in code, helping you target bottlenecks

Quickly locate expensive code lines

Detailed per-line stall analysis helps you understand underlying limiters

Pinpoint Performance Problems

Unified Memory Dramatically Lower Developer Effort

System Memory

Past Developer View

GPU Memory

Developer View With Unified Memory

Unified Memory

Simplified Memory Management Code

void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); }

void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort<<<...>>>(data,N,1,compare); cudaDeviceSynchronize(); use_data(data); cudaFree(data); }

CPU Code CUDA 6 Code with Unified Memory

Great Performance with Unified Memory

RAJA uses Unified Memory for heterogeneous array allocations

Parallel forall loops run on CPU or GPU

“Excellent performance considering this is a "generic” version of LULESH

with no architecture-specific tuning.”

- Jeff Keasler, LLNL

RAJA: Portable Parallel C++ Framework

1.9x 2.0x

45^3 100^3 150^3

CPU: 10-core Haswell

GPU: Tesla K40

Million e

second

Mesh size GPU: NVIDIA Tesla K40, CPU: Intel Haswell E5-2650 v3 @ 2.30GHz, single socket 10-core

LULESH Throughput

HPC with the NVIDIA Accelerated Computing...

Documents

Deep Learning at Scale on NVIDIA V100 Accelerators · Deep Learning at Scale on NVIDIA V100 Accelerators HPC and AI Innovation Lab Rengan Xu, Frank Han, ... •Deep Learning Frameworks

GPU-Accelerated Applications for HPC Industries| NVIDIA · Jedox Helps with portfolio analysis, management consolidation, liquidity controlling, cash flow statements, profit center

LEGION: A VISION FOR FUTURE HPC PROGRAMMING SYSTEMS …images.nvidia.com/.../pdfs/SC5117-future-hpc-programming-legion.pdf · Michael Bauer, NVIDIA Research Patrick McCormick, Los

M FOR GPU COMPUTING - HPC Advisory Council · © NVIDIA Corporation 2011 ANSYS Mechanical > 125K Commercial Seats Faster = Better Quality

NVIDIA HPC SDK - Argonne National Laboratory

Introduction to the CUDA Toolkit for Building Applications · Introduction to the CUDA Toolkit for Building Applications Adam DeConinck HPC Systems Engineer, NVIDIA

GPU-Accelerated Applications for HPC Industries| NVIDIA

CONTAINERS DEMOCRATIZE HPC - Nvidia...S9525 - Containers Democratize HPC TU 1-2 S9500 - Latest Deep Learning Framework Container Optimizations W 9-10 SE285481 - NGC User Meetup W 7-9

Clusters with GPUs under Linux and Windows HPC Room | Oct 2 2009 Clusters with GPUs under Linux and Windows HPC Massimiliano Fatica (NVIDIA), Calvin Clark (Microsoft) ... HPC Clusters

NVIDIA HPC Compilers Reference Guide

The State of the Art in GPU Computing - NVIDIA and SINTEF ...€¦ · The State of the Art in GPU Computing NVIDIA and SINTEF HPC GPU Computing Mini-Workshop Jon Mikkelsen Hjelmervik

NVIDIA TESLA V100 GPU ARCHITECTURE · architecture to deliver higher performance for both deep learning inference and High Performance Computing (HPC) applications. The NVIDIA CUDA

NVIDIA HPC SDK Release Notes

Overview of accelerator technologies for HPC NVidia Intel MIC.pdf

HPC C OMPUTING WITH CUDA AND T ESLA H ARDWARE Timothy Lanfear, NVIDIA

Introduction to OpenACC - Stanford Universityweb.stanford.edu/class/.../Lecture_14_openacc2017.pdf · Introduction to OpenACC Peng Wang HPC Developer Technology, NVIDIA penwang@nvidia.com

Exploring Co-Design in Chapel Using LULESH

GPU-Accelerated Applications for HPC Industries| NVIDIA · GPU‑ACCELERATED APPLICATIONS CONTENTS ... (back-end generates CUDA and ... GPU-Accelerated Applications for HPC Industries

Introduction to GPU Computing - HPC@LR€¦ · INTRODUCTION TO HETEROGENEOUS/HYBRID COMPUTING François Courteille |Senior Solutions Architect, NVIDIA |fcourteille@nvidia.com . 2

HPC IN CONTAINERS - NVIDIA · OUTLINE • Motivation • What NVIDIA is doing • Collaborations • Requested feedback • Call to action GTC’18: HPC Containers 3 WHY CONTAINERS: