Day1 02a Programming Overview

8/3/2019 Day1 02a Programming Overview

1/47

CUDAProgramming Model

Gernot Ziegler, NVIDIA UK(material by Gregory Ruetsch)


2/47

NVIDIA Confidential

Programming in C for CUDA

C for CUDA = C + a few simple extensions

as C developer, easy to start writing parallel programs

Three key abstractions:1. parallel threads on device (GPU)

2. manage corresponding memory spaces3. corresponding synchronization

Host: Device management API

Additionally, Runtime API & nvcc:use language extensions even for host code!


3/47

NVIDIA Confidential

Basics

Set up GPU for computation

GPU device and memory management

GPU kernel launches (execution configuration)

Some specifics of GPU/device code

Some additional features:

Vector typesAsynchronous execution

CUDA error handling

CUDA Events

Note: only the basic features are covered

Programming Guide and Reference Manualcontain more information


4/47

NVIDIA Confidential

Device Management

First task: CPU will query and select GPU devices

cudaGetDeviceCount( int* count )cudaSetDevice( int device )

cudaGetDevice( int *current_device )

cudaGetDeviceProperties( cudaDeviceProp* prop,

int device )

cudaChooseDevice( int *device, cudaDeviceProp* prop )

Multi-GPU setup:

device 0 is used by default,careful with combination of GFX card and Tesla !

(usually, one CPU thread controls one GPU each,but driver API allows more)


5/47

NVIDIA Confidential

Managing Memory

Host/CPU also manages device/GPU memory:

Allocate & Free memoryCopy data to and from device's globalmemory(GPU DRAM, e.g. 4 GB on Tesla)

cudaMalloc(void **pointer, size_t nbytes)cudaMemset(void *pointer, int value, size_t count)

cudaFree(void *pointer)

Host and device have separate memory spaces!


6/47

NVIDIA Confidential

Example:

Managing memory (no data transfer)

int n = 1024;int nbytes = 1024*sizeof(int);

int *d_a = 0;

cudaMalloc( (void**)&d_a, nbytes );cudaMemset( d_a, 0, nbytes);

cudaFree(d_a);


7/47

NVIDIA Confidential

CUDA: Runtime support

Explicit memory allocation returns pointers to GPU memory

cudaMalloc(), cudaFree()

Explicit memory copy for host device, device device

cudaMemcpy(), cudaMemcpy2D(), ...

Texture management

cudaBindTexture(), cudaBindTextureToArray(), ...

OpenGL & DirectX interoperabilitycudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(),


8/47

NVIDIA Confidential

Example: Host Code's mem manage// allocate host memory

int numBytes = N * sizeof(float)

float* h_A = (float*)malloc(numBytes);

// allocate device memory

float* d_A = 0;

cudaMalloc((void**)&d_A, numbytes);

// copy data from host to device

cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice);

// execute the kernel on GPU: [ NEXT SLIDE ]

gpu_func (params)

// copy data from device back to hostcudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost);

// free device memory

cudaFree(d_A);


9/47

NVIDIA Confidential

Kernel creation

How to...

gpu_func (params)

write a kernel!

First, re-cap on the CUDA architecture...


10/47

NVIDIA Confidential

Device code:

Thread bundles

Kernel = device code call

A kernel is executed by agrid of thread blocks

A thread block is a batch

of threads that can

cooperate throughshared memory

Threads from different

blocks cannot cooperate

Host

Kernel

1

Kernel

2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)


11/47

NVIDIA Confidential

Blocks must be independent

"Threads from different blocks cannot cooperate"

Why?

Any possible interleaving of blocks should be validpresumed to run to completion without pre-emption

can run in any order

can run concurrently OR sequentially (GPU scaling)

Blocks may coordinate but not synchronizeshared queue pointer: OK

shared lock: BAD can easily deadlock

So:Independence requirement givesscalabilityfor different GPU sizes.


12/47

NVIDIA Confidential

Device code:

Thread IDs

Threads and blocks have IDs

So each thread can decide whatdata to work on

Block ID: 1D or 2D

Thread ID: 1D, 2D, or 3D

2D/3D IDs simplifyaddressing when processing

multidimensional dataImage processing

Solving PDEs on volumes

Device

Grid 1

Block

(0, 0)

Block

(1, 0)

Block

(2, 0)

Block

(0, 1)

Block

(1, 1)

Block

(2, 1)

Block (1, 1)

Thread

(0, 1)

Thread

(1, 1)

Thread

(2, 1)

Thread

(3, 1)

Thread

(4, 1)

Thread

(0, 2)

Thread

(1, 2)

Thread

(2, 2)

Thread

(3, 2)

Thread

(4, 2)

Thread

(0, 0)

Thread

(1, 0)

Thread

(2, 0)

Thread

(3, 0)

Thread

(4, 0)


13/47

NVIDIA Confidential

Programming Model:

Memory Spaces

Each thread can:

Read/write per-thread registers

(Read/write per-thread local memory)

Read/write per-block shared memory

Read/write per-grid global memory

Read only per-grid constant memory

Read only per-grid texture memory

Grid

Constant

Memory

Texture

Memory

Global

Memory

Block (0, 0)

Shared Memory

Local

Memory

Thread (0, 0)

Registers

Local

Memory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Local

Memory

Thread (0, 0)

Registers

Local

Memory

Thread (1, 0)

Registers

HostHost can read/write global,constant, and texturememory

(all stored in GPU DRAM)


14/47

NVIDIA Confidential

Qualifiers for variable storage

(device code)__device__

Stored in device memory, aka global memory (e.g. 4GB on Tesla)

Large capacity, BUT: high latency, uncached

Allocated with cudaMalloc

Accessible by all threads

__shared__On-chip memory (SRAM, low latency), 16 kB per multiprocessor

Allocated by execution configuration or at compile timeShared access by all threads in the same thread block

Shortlived (only while block runs)

All unqualified variables:

Scalars and built-in vector types are stored in registersArrays may be in registers, or local memory(special form of global memory /DRAM)


15/47

NVIDIA Confidential

Launching kernels

Modified C function call syntax:

kernel()

Execution Configuration (>):

grid dimensions: x and y

thread-block dimensions: x, y, and z

dim3 grid(16, 16);

dim3 block(16,16);

kernel(...);

kernel(...);


16/47

NVIDIA Confidential

Example: Host Code// allocate host memory

int numBytes = N * sizeof(float)



float* d_A = 0;

cudaMalloc((void**)&d_A, numbytes);



// execute the kernel

increment_gpu>(d_A, b);

// copy data from device back to host

cudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost);


cudaFree(d_A);


17/47

NVIDIA Confidential

CUDA Built-in Device Variables

All__global__and__device__functions have

access to these automatically defined variables

dim3 gridDim;

Dimensions of the grid in blocks (at most 2D)

dim3 blockDim;

Dimensions of the block in threads

dim3 blockIdx;

Block index within the griddim3 threadIdx;

Thread index within the block


18/47

NVIDIA Confidential

Example: Increment Array Elements

CPU program CUDA program

void increment_cpu(float *a, float b, int N)

{

for (int idx = 0; idx


19/47

NVIDIA Confidential

Other extras (device code)

Other language extras....


20/47

NVIDIA Confidential

Built-in Vector Types

[u]char[1..4], [u]short[1..4], [u]int[1..4],

[u]long[1..4], float[1..4]Structures accessed with x, y, z, w fields:

uint4 param;

int y = param.y;

dim3

Based on uint3

Used to specify dimensions

Default value (1,1,1)

Can be used in GPU and CPU code (if nvcc compiled)


21/47

NVIDIA Confidential

Thread Synchronization

void __syncthreads();Synchronizes all threads in a block

Generates barrier synchronization instruction

No thread can pass this barrier until all threads in the

block reach it

Often needed for shared memory write/read

synchronization inbetween threads


22/47

NVIDIA Confidential

GPU Atomic Integer Operations

Atomic operations on integers in global memory:

Associative operations on signed/unsigned intsadd, sub, min, max, ...

and, or, xor

Increment, decrement

Exchange, compare and swap

32-bit: hardware with compute capability >= 1.1

64-bit: hardware with compute capability >= 1.2


23/47

NVIDIA Confidential

C for CUDA : Summary

Function qualifiers:__global__ void MyKernel() { }

__device__ float MyDeviceFunc() { }

Variable qualifiers:__constant__ float MyConstantArray[32];

__shared__ float MySharedArray[32];

Execution configuration:dim3 dimGrid(100, 50); // 5000 thread blocks

dim3 dimBlock(4, 8, 8); // 256 threads per block

MyKernel > (...); // Launch kernel

Built-in variables and functions valid in device code:dim3 gridDim; // Grid dimensiondim3blockDim; // Block dimension

dim3blockIdx; // Block index

dim3 threadIdx; // Thread index

void__syncthreads(); // Thread synchronization (ProgGuide)


24/47

NVIDIA Confidential

Runtime API: More features

Other runtime specialties for host code...


25/47

NVIDIA Confidential

Asynchronous operation

CUDA calls are enqueued in streams, and executed

one after another : usually one default stream (0)Kernel launches are asynchronous

control returns to CPU immediately

kernel executes after all previous CUDA calls

cudaMemcpy() is synchronouscopy starts after all previous CUDA calls have completed

control returns to CPU after copy completes

(async memcopies possible, too)

Thus: GPU output, required on the host, leads to sync


26/47

NVIDIA Confidential

Example: Async operation

// allocate host memoryint numBytes = N * sizeof(float)



float* d_A = 0;cudaMalloc((void**)&d_A, numbytes);



// "execute the kernel"

// truly: CPU enqueues kernel calls, GPU executes asynchronously

kernel_A>(...);

kernel_B>(...);

kernel_C>(...);

// copy data from device back to host - CPU/GPU SYNCcudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost);


cudaFree(d_A);


27/47

NVIDIA Confidential

CUDA Error Reporting

All CUDA calls return error code

Except for kernel launchescudaError_t type

cudaGetLastError( )

Returns the code for the last error (no error: has a code)

Even get error from kernel execution

char *cudaGetErrorString(code)

Returns a string describing the error

printf(%s\n, cudaGetErrorString( cudaGetLastError() ) );


28/47

NVIDIA Confidential

Textures in CUDA

Textures are known from graphics ...In CUDA, Texture is used for data reading

Benefits:Addressable in 1D, 2D, or 3DData is cached (optimized for 2D locality)

Helpful for irregular data access

FilteringLinear / bilinear / trilinear

dedicated hardware

Wrap modes (for out-of-bounds addresses)

Usage:Host code binds data to a texture referenceKernel reads data by calling a fetchfunction,e.g. tex1Dfetch()


29/47

NVIDIA Confidential

CUDA Event API

CUDA call streams can be interspersed with Events

Usage scenarios:measure elapsed time for CUDA calls (clock cycle precision!)

query the status of an asynchronous CUDA callblock CPU until CUDA calls prior to the event are completed

asyncAPI sample in CUDA SDK

cudaEvent_t start, stop;

cudaEventCreate(&start); cudaEventCreate(&stop);

cudaEventRecord(start, 0);

kernel(...);

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);float elapsedTime;

cudaEventElapsedTime(&elapsedTime, start, stop);

cudaEventDestroy(start); cudaEventDestroy(stop);


30/47

NVIDIA Confidential

Driver API

Up to this point the host code weve seen has been fromthe runtime API cuda*() functions...

Driver API: cu*() functions

Advantages:Plain C interface, you can use any CPU compiler for host code(e.g. icc, etc.)

More control over devices

One CPU thread can control multiple GPUs

PTX Just-In-Time (JIT) compilation(Parallel Thread eXecution (PTX) is our "GPU assembly language")

No dependency on runtime library

Disadvantages:

No device emulationMore verbose code

Note: Devicecode is identical, regardless of using theruntime or driver API


31/47

NVIDIA Confidential

Once more: Runtime and Driver API

Best place to start for virtually all developers:Runtime API

Easy to migrate to driver API if/when it is needed

Anything which can be done in the runtime API canalso be done in the driver API, but not vice versa

Much, much more information on both APIs in the

CUDA Reference Manual


32/47

NVIDIA Confidential

New Features in CUDA 2.2

Zero copyCUDA threads can directly read/write host (CPU)memory

Requires pinned (non-pageable) memory

Main benefits:More efficient than small PCIe data transfers

May be better performance when there is no opportunityfor data reuse from device DRAM

2D Texturing from linear memory

Allows simpler write-to-texture in CUDAUseful for image processing


33/47

NVIDIA Confidential

nvcc is a C compiler

Advanced C++ constructs (classes with inheritance

and virtual functions) make it stumble in device

code!

If problems occur, and CUDART is still desirable:

Let nvcc only compile .cu files that contain the

kernels, let customer's compiler handle C++ code intheir own files, and link the two parts.

Last resort: CUDA driver API,

(nvcc compiles kernels into PTX or binaries,which application loads via C calls)


34/47

NVIDIA Confidential

C for CUDAOptimization


35/47

NVIDIA Confidential

Optimize Algorithms for GPU

Maximize data-parallelism in the algorithm (SIMD):Think threads for data elements, not specific tasks

Reduce thread divergence(performance impact from branch serialization,when groups smaller than 32 threads start to diverge)

More computation on the GPU thancostly device-host data transfers

Even low parallelism computations can sometimes be fasterthan transferring back and forth to host


36/47

NVIDIA Confidential

Optimize Algorithms for GPU: Maths

Maximize arithmetic intensity (math per mem transfer)

Sometimes its better to recompute results than to causeserial dependencies

GPU spends its transistors on ALUs, not memory

Double precision algorithms:Consider moving parts/all to single precision computation

Hardware has builtin math functions (at reduced precision):__sinf(), __expf(), etc.

Try -fast-math (implicitly converts e.g. sin() to _sinf()) or carefullyreplace individual function calls, considering reduced accuracy


37/47

NVIDIA Confidential

Optimize Memory Access

Coalescing: "Optimal" memory access pattern

Coalesced vs. Non-coalesced = order of magnitude!

Shared memory: A user-managed cache

Advanced concepts:

Shared memory bank conflicts

Make use of spatial localityfor texture and constant caches


38/47

NVIDIA Confidential

Coalescing

Compute capability 1.0 and 1.1K-th thread must access k-th word in the segment (or k-th word in 2

contiguous 128B segments for 128-bit words), not all threads need to

participate

Coalesces 1 transaction

Out of sequence 16 transactions Misaligned 16 transactions


39/47

NVIDIA Confidential

Coalescing

Compute capability 1.2 and higher

1 transaction - 64B segment

MMU is more advanced, relaxes coalescing requirements

Coalescing achieved for any pattern of addresses that fits into a segmentof size: 32B for 8-bit words, 64B for 16-bit words, 128B for 32- and 64-bit

words

Smaller transactions may be issued to avoid wasted bandwidth due to

unused words

Exact rules in Programming Guide


40/47

NVIDIA Confidential

Take Advantage of Shared Memory

Hundreds of times faster than global memory

Threads can cooperate via shared memory

Use one / a few threads to load / compute data shared

by all threads

Use it to avoid non-coalesced accessStage loads and stores in shared memory to re-order non-

coalesceable addressing


41/47

NVIDIA Confidential

Use Parallelism Efficiently

Partition your computation to keep the GPU

multiprocessors equally busy

Many threads, many thread blocks

Keep threads' resource usage low enough

to supportmultiple blocks per multiprocessor

Resources: Registers, shared memory


42/47

NVIDIA Confidential

Host-Device Data Transfers

Device-Host memory bandwidth

much lower than device-device bandwidth

8 GB/s peak (PCI-e x16 Gen 2) vs. 102 GB/s peak (Tesla C1060)

Minimize transfers

Dont transfer intermediate data:Can be allocated, operated on, and deallocatedwithout ever copying them to host memory

Group transfersOne large transfer much better than many small ones


43/47

NVIDIA Confidential

Overlapping Data Transfers and

Computation

Stream and Async API allow overlaphost-device data transfers with computation

CPU computation can overlap data transferson all CUDA capable devices

Devices with Concurrent copy and execution(CompCap >= 1.1):Kernel computation can overlap data transfers, controlled viastreams and events.

Stream = sequence of CUDA calls that execute in orderCalls in different streams can be interleaved

Stream ID is an argument to async calls and kernel launches


44/47

NVIDIA Confidential

Shared Memory

~Hundred times faster than global memory

Use it to cache data from global memory accesses

Use it to avoid non-coalesced access

Stage loads and stores in shared memory tore-order non-coalesceable addressing

Threads can cooperate via shared memory

share results with each othercontribute to common result,e.g. block min/max/avg

G id/Bl k Si H i i


45/47

NVIDIA Confidential

Grid/Block Size Heuristics

# of blocks > # of multiprocessorsSo all multiprocessors have at least one block to execute

# of blocks / # of multiprocessors > 2

Multiple blocks can run concurrently in a multiprocessor

Blocks that arent waiting at a __syncthreads() keep the

hardware busySubject to resource availability registers, shared memory

# of blocks > 100 to scale to future devices

Blocks executed in pipeline fashion1000 blocks per grid will scale across multiple generations

A


46/47

NVIDIA Confidential

Accuracy

GPU and CPU results may differ, but are

equally accurate (to specified ulp accuracy)

CPU operations arent strictly limited to 0.5 ulp

Sequences of operations can be even more accurate

due to 80-bit extended precision ALUs

Compare GPU calculation to CPU SSE

And: Floating-point arithmetic is not associative!

Complex area (ask if unsure)

S


47/47

NVIDIA Confidential

Summary

GPU hardware can achieve great performance on data-parallel computations if you follow a few simple guidelines:

Use parallelism efficientlyCoalesce memory accesses if possible

Take advantage of shared memory

Explore other memory spaces

TextureConstant

(Reduce shared memory bank conflicts)

See the Programming Guide, Best Practices Guide and ReferenceManual

If that doesn't help:Ask your local DevTech-Compute engineer :)

Documents

Day1 02a Programming Overview