70
© NVIDIA Corporation 2009 Mark Harris NVIDIA Corporation Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing

  • Upload
    flower

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

Tesla GPU Computing A Revolution in High Performance Computing. Mark Harris NVIDIA Corporation. Agenda. CUDA Review Architecture Programming Model Memory Model CUDA C CUDA General Optimizations Fermi Next Generation Architecture Getting Started Resources. CUDA Architecture. - PowerPoint PPT Presentation

Citation preview

Page 1: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Mark HarrisNVIDIA Corporation

Tesla GPU ComputingA Revolution in High Performance Computing

Page 2: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Agenda

CUDA ReviewArchitectureProgramming ModelMemory ModelCUDA C

CUDA General Optimizations

FermiNext Generation Architecture

Getting StartedResources

Page 3: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

CUDA ARCHITECTURE

Page 4: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

CUDA Parallel Computing Architecture

Parallel computing architecture and programming model

Includes a CUDA C compiler, support for OpenCL and DirectCompute

Architected to natively support multiple computational interfaces (standard languages and APIs)

Page 5: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

CUDA Parallel Computing Architecture

CUDA defines:Programming modelMemory modelExecution model

CUDA uses the GPU, but is for general-purpose computingFacilitate heterogeneous computing: CPU + GPU

CUDA is scalableScale to run on 100s of cores/1000s of parallel threads

Page 6: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

PROGRAMMING MODELCUDA

Page 7: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

CUDA Kernels

Parallel portion of application: execute as a kernelEntire GPU executes kernel, many threads

CUDA threads:LightweightFast switching1000s execute simultaneously

CPU Host Executes functions

GPU Device Executes kernels

Page 8: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

CUDA Kernels: Parallel Threads

A kernel is a function executed on the GPU

Array of threads, in parallel

All threads execute the same code, can take different paths

Each thread has an IDSelect input/output dataControl decisions

float x = input[threadID];float y = func(x);output[threadID] = y;

Page 9: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

CUDA Kernels: Subdivide into Blocks

Page 10: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocks

Page 11: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocksBlocks are grouped into a grid

Page 12: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocksBlocks are grouped into a gridA kernel is executed as a grid of blocks of threads

Page 13: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

CUDA Kernels: Subdivide into Blocks

Threads are grouped into blocksBlocks are grouped into a gridA kernel is executed as a grid of blocks of threads

GPU

Page 14: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Communication Within a Block

Threads may need to cooperateMemory accessesShare results

Cooperate using shared memoryAccessible by all threads within a block

Restriction to “within a block” permits scalabilityFast communication between N threads is not feasible when N large

Page 15: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Transparent Scalability – G84

1 2 3 4 5 6 7 8 9 10 11 12

1 2

3 4

5 6

7 8

9 10

11 12

Page 16: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Transparent Scalability – G80

1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8

9 10 11 12

Page 17: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Transparent Scalability – GT200

1 2 3 4 5 6 7 8 9 10 11 12

1 2 3 4 5 6 7 8 9 10 11 12 ...Idle Idle Idle

Page 18: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

CUDA Programming Model - Summary

A kernel executes as a grid of thread blocks

A block is a batch of threadsCommunicate through shared memory

Each block has a block ID

Each thread has a thread ID

Host

Kernel 1

Kernel 2

Device

0 1 2 3

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

1D

2D

Page 19: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

MEMORY MODELCUDA

Page 20: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Memory hierarchy

Thread:Registers

Page 21: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Memory hierarchy

Thread:Registers

Thread:Local memory

Page 22: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Memory hierarchy

Thread:Registers

Thread:Local memory

Block of threads:Shared memory

Page 23: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Memory hierarchy

Thread:Registers

Thread:Local memory

Block of threads:Shared memory

Page 24: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Memory hierarchy

Thread:Registers

Thread:Local memory

Block of threads:Shared memory

All blocks:Global memory

Page 25: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Memory hierarchy

Thread:Registers

Thread:Local memory

Block of threads:Shared memory

All blocks:Global memory

Page 26: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

PROGRAMMING ENVIRONMENTCUDA

Page 27: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

CUDA APIs

API allows the host to manage the devicesAllocate memory & transfer dataLaunch kernels

CUDA C “Runtime” APIHigh level of abstraction - start here!

CUDA C “Driver” APIMore control, more verbose

OpenCLSimilar to CUDA C Driver API

Page 28: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

CUDA C and OpenCL

Shared back-end compiler and optimization technology

Entry point for developers who want low-level API

Entry point for developers who prefer high-level C

Page 29: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Visual Studio

Separate file types.c/.cpp for host code.cu for device/mixed code

Compilation rules: cuda.rulesSyntax highlightingIntellisense

Integrated debugger and profiler: Nexus

Page 30: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

NVIDIA Nexus IDE

The industry’s first IDE for massively parallel applications

Accelerates co-processing (CPU + GPU) application development

Complete Visual Studio-integrated development environment

Page 31: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

NVIDIA Nexus IDE - Debugging

Page 32: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

NVIDIA Nexus IDE - Profiling

Page 33: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Linux

Separate file types.c/.cpp for host code.cu for device/mixed code

Typically makefile driven

cuda-gdb for debugging

CUDA Visual Profiler

Page 34: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

CUDA CCUDA

Page 35: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Example: Vector Addition Kernel

// Compute vector sum C = A+B// Each thread performs one pair-wise addition__global__ void vecAdd(float* A, float* B, float* C){ int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i];}

int main(){ // Run N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);}

Device Code

Page 36: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Example: Vector Addition Kernel

// Compute vector sum C = A+B// Each thread performs one pair-wise addition__global__ void vecAdd(float* A, float* B, float* C){ int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i];}

int main(){ // Run N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);}

Host Code

Page 37: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

CUDA: Memory Management

Explicit memory allocation returns pointers to GPU memorycudaMalloc()cudaFree()

Explicit memory copy for host ↔ device, device ↔ devicecudaMemcpy()

Page 38: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Example: Vector Addition Kernel

// Compute vector sum C = A+B// Each thread performs one pair-wise addition__global__ void vecAdd(float* A, float* B, float* C){ int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i];}

int main(){ // Run N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);}

Page 39: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Example: Host code for vecAdd

// allocate and initialize host (CPU) memoryfloat *h_A = …, *h_B = …;

// allocate device (GPU) memoryfloat *d_A, *d_B, *d_C;cudaMalloc( (void**) &d_A, N * sizeof(float));cudaMalloc( (void**) &d_B, N * sizeof(float));cudaMalloc( (void**) &d_C, N * sizeof(float));

// copy host memory to devicecudaMemcpy( d_A, h_A, N * sizeof(float), cudaMemcpyHostToDevice) );cudaMemcpy( d_B, h_B, N * sizeof(float), cudaMemcpyHostToDevice) );

// execute the kernel on N/256 blocks of 256 threads eachvecAdd<<<N/256, 256>>>(d_A, d_B, d_C);

Page 40: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

CUDA: Minimal extensions to C/C++

Declaration specifiers to indicate where things live__global__ void KernelFunc(...); // kernel callable from host__device__ void DeviceFunc(...); // function callable on device__device__ int GlobalVar; // variable in device memory__shared__ int SharedVar; // in per-block shared memory

Extend function invocation syntax for parallel kernel launchKernelFunc<<<500, 128>>>(...); // 500 blocks, 128 threads each

Special variables for thread identification in kernelsdim3 threadIdx; dim3 blockIdx; dim3 blockDim;dim3 gridDim;Intrinsics that expose specific operations in kernel code__syncthreads(); // barrier synchronization

Page 41: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Synchronization of blocks

Threads within block may synchronize with barriers… Step 1 …__syncthreads();… Step 2 …

Blocks coordinate via atomic memory operationse.g., increment shared queue pointer with atomicInc()

Implicit barrier between dependent kernelsvec_minus<<<nblocks, blksize>>>(a, b, c);vec_dot<<<nblocks, blksize>>>(c, c);

Page 42: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Using per-block shared memory

Variables shared across block__shared__ int *begin, *end;

Scratchpad memory__shared__ int scratch[blocksize];scratch[threadIdx.x] = begin[threadIdx.x];// … compute on scratch values …begin[threadIdx.x] = scratch[threadIdx.x];

Communicating values between threadsscratch[threadIdx.x] = begin[threadIdx.x];__syncthreads();int left = scratch[threadIdx.x - 1];

Block

Shared

Page 43: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

GPU ARCHITECTURECUDA

Page 44: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

10-Series Architecture240 Scalar Processor (SP) cores execute kernel threads30 Streaming Multiprocessors (SMs) each contain8 scalar processors2 Special Function Units (SFUs)1 double precision unitShared memory enables thread cooperation

ScalarProcessors

Multiprocessor

SharedMemory

Double

Page 45: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Execution ModelSoftware Hardware

Threads are executed by scalar processors

Thread

Scalar Processor

Thread Block Multiprocessor

Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources (shared memory and register file)

...

Grid Device

A kernel is launched as a grid of thread blocks

Only one kernel can execute on a device at one time

Page 46: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Warps and Half Warps

Thread Block Multiprocessor

32 Threads

32 Threads

32 Threads

...

Warps

16

Half Warps

16

DRAMGlobalLocal

A thread block consists of 32-thread warps

A warp is executed physically in parallel (SIMD) on a multiprocessor

Device Memory

=

A half-warp of 16 threads can coordinate global memory accesses into a single transaction

Page 47: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Memory Architecture

Host

CPU

Chipset

DRAM

DeviceDRAM

Global

Constant

Texture

Local

GPUMultiprocessor

RegistersShared Memory

MultiprocessorRegisters

Shared MemoryMultiprocessor

RegistersShared Memory

Constant and Texture Caches

Page 48: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

OPTIMIZATION GUIDELINESCUDA

Page 49: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Optimize Algorithms for the GPU

Maximize independent parallelism

Maximize arithmetic intensity (math/bandwidth)

Sometimes it’s better to recompute than to cacheGPU spends its transistors on ALUs, not memory

Do more computation on the GPU to avoid costly data transfers

Even low parallelism computations can sometimes be faster than transferring back and forth to host

Page 50: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Optimize Memory Access

Coalesced vs. Non-coalesced = order of magnitudeGlobal/Local device memory

Optimize for spatial locality in cached texture memory

In shared memory, avoid high-degree bank conflicts

Page 51: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Take Advantage of Shared Memory

Hundreds of times faster than global memory

Threads can cooperate via shared memory

Use one / a few threads to load / compute data shared by all threads

Use it to avoid non-coalesced accessStage loads and stores in shared memory to re-order non-coalesceable addressing

Page 52: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Use Parallelism Efficiently

Partition computation to keep the GPU multiprocessors equally busy

Many threads, many thread blocks

Keep resource usage low enough to support multiple active thread blocks per multiprocessor

Registers, shared memory

Page 53: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

NEXT GENERATION ARCHITECTUREFermi

Page 54: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Many-Core High Performance Computing

Each core has:Floating point & Integer unitLogic unitMove, compare unitBranch unit

Cores managed by thread managerSpawn and manage over 30,000 threadsZero-overhead thread switching

10-series GPU has 240 cores

1.4 billion transistors

1 Teraflop of processing power

Page 55: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Introducing the Fermi Architecture

3 billion transistors

512 cores

DP performance 50% of SP

ECC

L1 and L2 Caches

GDDR5 Memory

Up to 1 Terabyte of GPU Memory

Concurrent Kernels, C++

Page 56: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Fermi SM Architecture

32 CUDA cores per SM (512 total)

Double precision 50% of single precision

8x over GT200

Dual Thread Scheduler

64 KB of RAM for shared memory and L1 cache (configurable)

Page 57: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

CUDA Core Architecture

New IEEE 754-2008 floating-point standard, surpassing even the most advanced CPUs

Fused multiply-add (FMA) instruction for both single and double precision

Newly designed integer ALU optimized for 64-bit and extended precision operations

Page 58: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Cached Memory Hierarchy

First GPU architecture to support a true cache hierarchy in combination with on-chip shared memory

L1 Cache per SM (per 32 cores)Improves bandwidth and reduces latency

Unified L2 Cache (768 KB)Fast, coherent data sharing across all cores in the GPU Parallel DataCache™

Memory Hierarchy

Page 59: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Larger, Faster Memory Interface

GDDR5 memory interface2x speed of GDDR3

Up to 1 Terabyte of memory attached to GPU

Operate on large data sets

Page 60: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

ECC

ECC protection forDRAM

ECC supported for GDDR5 memory

All major internal memoriesRegister file, shared memory, L1 cache, L2 cache

Detect 2-bit errors, correct 1-bit errors (per word)

Page 61: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

GigaThread™ Hardware Thread Scheduler

Hierarchically manages thousands of simultaneously active threads

10x faster application context switching

Concurrent kernel execution

Page 62: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

GigaThread™ Hardware Thread Scheduler Concurrent Kernel Execution + Faster Context Switch

Serial Kernel Execution Parallel Kernel Execution

Page 63: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

GigaThread Streaming Data Transfer Engine

Dual DMA enginesSimultaneous CPUGPU and GPUCPU data transferFully overlapped with CPU and GPU processing time

Activity Snapshot:

Page 64: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Enhanced Software Support

Full C++ SupportVirtual functionsTry/Catch hardware support

System call supportSupport for pipes, semaphores, printf, etc

Unified 64-bit memory addressing

Page 65: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Tesla S1070 1U System

Tesla C1060 Computing Board

Tesla Personal Supercomputer

GPUs 2 Tesla GPUs 4 Tesla GPUs 1 Tesla GPU 4 Tesla GPUsSingle Precision

Performance1.87 Teraflops 4.14 Teraflops 933 Gigaflops 3.7 Teraflops

Double PrecisionPerformance

156 Gigaflops 346 Gigaflops 78 Gigaflops 312 Gigaflops

Memory 8 GB (4 GB / GPU) 16 GB (4 GB / GPU) 4 GB 16 GB (4 GB / GPU)

SuperMicro 1UGPU SuperServer

Tesla GPU Computing Products

Page 66: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Tesla S2070 1U System

Tesla C2050 Computing Board

Tesla C2070 Computing Board

GPUs 4 Tesla GPUs 1 Tesla GPUDouble Precision

Performance2.1 – 2.5 Teraflops 520 – 630 Gigaflops

Memory 12 GB (3 GB / GPU) 24 GB (6 GB / GPU) 3 GB 6 GB

Tesla S20501U System

Tesla GPU Computing Products: Fermi

Page 67: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Tesla GPU Computing

Questions?

Page 68: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

RESOURCESGetting Started

Page 69: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Getting Started

CUDA Zone www.nvidia.com/cuda

Introductory tutorials/webinarsForums

DocumentationProgramming GuideBest Practices Guide

ExamplesCUDA SDK

Page 70: Tesla GPU Computing A Revolution in High Performance Computing

© NVIDIA Corporation 2009

Libraries

NVIDIACUBLAS Dense linear algebra (subset of full BLAS suite)CUFFT 1D/2D/3D real and complex

Third partyNAG Numeric libraries e.g. RNGsCULAPACK/MAGMA

Open SourceThrust STL/Boost style template languageCUDPP Data parallel primitives (e.g. scan, sort and reduction)CUSP Sparse linear algebra and graph computation

Many more...