Upload
nashwan-alaghbri
View
231
Download
0
Embed Size (px)
Citation preview
8/3/2019 Day1 02a Programming Overview
1/47
CUDAProgramming Model
Gernot Ziegler, NVIDIA UK(material by Gregory Ruetsch)
8/3/2019 Day1 02a Programming Overview
2/47
NVIDIA Confidential
Programming in C for CUDA
C for CUDA = C + a few simple extensions
as C developer, easy to start writing parallel programs
Three key abstractions:1. parallel threads on device (GPU)
2. manage corresponding memory spaces3. corresponding synchronization
Host: Device management API
Additionally, Runtime API & nvcc:use language extensions even for host code!
8/3/2019 Day1 02a Programming Overview
3/47
NVIDIA Confidential
Basics
Set up GPU for computation
GPU device and memory management
GPU kernel launches (execution configuration)
Some specifics of GPU/device code
Some additional features:
Vector typesAsynchronous execution
CUDA error handling
CUDA Events
Note: only the basic features are covered
Programming Guide and Reference Manualcontain more information
8/3/2019 Day1 02a Programming Overview
4/47
NVIDIA Confidential
Device Management
First task: CPU will query and select GPU devices
cudaGetDeviceCount( int* count )cudaSetDevice( int device )
cudaGetDevice( int *current_device )
cudaGetDeviceProperties( cudaDeviceProp* prop,
int device )
cudaChooseDevice( int *device, cudaDeviceProp* prop )
Multi-GPU setup:
device 0 is used by default,careful with combination of GFX card and Tesla !
(usually, one CPU thread controls one GPU each,but driver API allows more)
8/3/2019 Day1 02a Programming Overview
5/47
NVIDIA Confidential
Managing Memory
Host/CPU also manages device/GPU memory:
Allocate & Free memoryCopy data to and from device's globalmemory(GPU DRAM, e.g. 4 GB on Tesla)
cudaMalloc(void **pointer, size_t nbytes)cudaMemset(void *pointer, int value, size_t count)
cudaFree(void *pointer)
Host and device have separate memory spaces!
8/3/2019 Day1 02a Programming Overview
6/47
NVIDIA Confidential
Example:
Managing memory (no data transfer)
int n = 1024;int nbytes = 1024*sizeof(int);
int *d_a = 0;
cudaMalloc( (void**)&d_a, nbytes );cudaMemset( d_a, 0, nbytes);
cudaFree(d_a);
8/3/2019 Day1 02a Programming Overview
7/47
NVIDIA Confidential
CUDA: Runtime support
Explicit memory allocation returns pointers to GPU memory
cudaMalloc(), cudaFree()
Explicit memory copy for host device, device device
cudaMemcpy(), cudaMemcpy2D(), ...
Texture management
cudaBindTexture(), cudaBindTextureToArray(), ...
OpenGL & DirectX interoperabilitycudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(),
8/3/2019 Day1 02a Programming Overview
8/47
NVIDIA Confidential
Example: Host Code's mem manage// allocate host memory
int numBytes = N * sizeof(float)
float* h_A = (float*)malloc(numBytes);
// allocate device memory
float* d_A = 0;
cudaMalloc((void**)&d_A, numbytes);
// copy data from host to device
cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice);
// execute the kernel on GPU: [ NEXT SLIDE ]
gpu_func (params)
// copy data from device back to hostcudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost);
// free device memory
cudaFree(d_A);
8/3/2019 Day1 02a Programming Overview
9/47
NVIDIA Confidential
Kernel creation
How to...
gpu_func (params)
write a kernel!
First, re-cap on the CUDA architecture...
8/3/2019 Day1 02a Programming Overview
10/47
NVIDIA Confidential
Device code:
Thread bundles
Kernel = device code call
A kernel is executed by agrid of thread blocks
A thread block is a batch
of threads that can
cooperate throughshared memory
Threads from different
blocks cannot cooperate
Host
Kernel
1
Kernel
2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
8/3/2019 Day1 02a Programming Overview
11/47
NVIDIA Confidential
Blocks must be independent
"Threads from different blocks cannot cooperate"
Why?
Any possible interleaving of blocks should be validpresumed to run to completion without pre-emption
can run in any order
can run concurrently OR sequentially (GPU scaling)
Blocks may coordinate but not synchronizeshared queue pointer: OK
shared lock: BAD can easily deadlock
So:Independence requirement givesscalabilityfor different GPU sizes.
8/3/2019 Day1 02a Programming Overview
12/47
NVIDIA Confidential
Device code:
Thread IDs
Threads and blocks have IDs
So each thread can decide whatdata to work on
Block ID: 1D or 2D
Thread ID: 1D, 2D, or 3D
2D/3D IDs simplifyaddressing when processing
multidimensional dataImage processing
Solving PDEs on volumes
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread
(0, 1)
Thread
(1, 1)
Thread
(2, 1)
Thread
(3, 1)
Thread
(4, 1)
Thread
(0, 2)
Thread
(1, 2)
Thread
(2, 2)
Thread
(3, 2)
Thread
(4, 2)
Thread
(0, 0)
Thread
(1, 0)
Thread
(2, 0)
Thread
(3, 0)
Thread
(4, 0)
8/3/2019 Day1 02a Programming Overview
13/47
NVIDIA Confidential
Programming Model:
Memory Spaces
Each thread can:
Read/write per-thread registers
(Read/write per-thread local memory)
Read/write per-block shared memory
Read/write per-grid global memory
Read only per-grid constant memory
Read only per-grid texture memory
Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
HostHost can read/write global,constant, and texturememory
(all stored in GPU DRAM)
8/3/2019 Day1 02a Programming Overview
14/47
NVIDIA Confidential
Qualifiers for variable storage
(device code)__device__
Stored in device memory, aka global memory (e.g. 4GB on Tesla)
Large capacity, BUT: high latency, uncached
Allocated with cudaMalloc
Accessible by all threads
__shared__On-chip memory (SRAM, low latency), 16 kB per multiprocessor
Allocated by execution configuration or at compile timeShared access by all threads in the same thread block
Shortlived (only while block runs)
All unqualified variables:
Scalars and built-in vector types are stored in registersArrays may be in registers, or local memory(special form of global memory /DRAM)
8/3/2019 Day1 02a Programming Overview
15/47
NVIDIA Confidential
Launching kernels
Modified C function call syntax:
kernel()
Execution Configuration (>):
grid dimensions: x and y
thread-block dimensions: x, y, and z
dim3 grid(16, 16);
dim3 block(16,16);
kernel(...);
kernel(...);
8/3/2019 Day1 02a Programming Overview
16/47
NVIDIA Confidential
Example: Host Code// allocate host memory
int numBytes = N * sizeof(float)
float* h_A = (float*)malloc(numBytes);
// allocate device memory
float* d_A = 0;
cudaMalloc((void**)&d_A, numbytes);
// copy data from host to device
cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice);
// execute the kernel
increment_gpu>(d_A, b);
// copy data from device back to host
cudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost);
// free device memory
cudaFree(d_A);
8/3/2019 Day1 02a Programming Overview
17/47
NVIDIA Confidential
CUDA Built-in Device Variables
All__global__and__device__functions have
access to these automatically defined variables
dim3 gridDim;
Dimensions of the grid in blocks (at most 2D)
dim3 blockDim;
Dimensions of the block in threads
dim3 blockIdx;
Block index within the griddim3 threadIdx;
Thread index within the block
8/3/2019 Day1 02a Programming Overview
18/47
NVIDIA Confidential
Example: Increment Array Elements
CPU program CUDA program
void increment_cpu(float *a, float b, int N)
{
for (int idx = 0; idx
8/3/2019 Day1 02a Programming Overview
19/47
NVIDIA Confidential
Other extras (device code)
Other language extras....
8/3/2019 Day1 02a Programming Overview
20/47
NVIDIA Confidential
Built-in Vector Types
[u]char[1..4], [u]short[1..4], [u]int[1..4],
[u]long[1..4], float[1..4]Structures accessed with x, y, z, w fields:
uint4 param;
int y = param.y;
dim3
Based on uint3
Used to specify dimensions
Default value (1,1,1)
Can be used in GPU and CPU code (if nvcc compiled)
8/3/2019 Day1 02a Programming Overview
21/47
NVIDIA Confidential
Thread Synchronization
void __syncthreads();Synchronizes all threads in a block
Generates barrier synchronization instruction
No thread can pass this barrier until all threads in the
block reach it
Often needed for shared memory write/read
synchronization inbetween threads
8/3/2019 Day1 02a Programming Overview
22/47
NVIDIA Confidential
GPU Atomic Integer Operations
Atomic operations on integers in global memory:
Associative operations on signed/unsigned intsadd, sub, min, max, ...
and, or, xor
Increment, decrement
Exchange, compare and swap
32-bit: hardware with compute capability >= 1.1
64-bit: hardware with compute capability >= 1.2
8/3/2019 Day1 02a Programming Overview
23/47
NVIDIA Confidential
C for CUDA : Summary
Function qualifiers:__global__ void MyKernel() { }
__device__ float MyDeviceFunc() { }
Variable qualifiers:__constant__ float MyConstantArray[32];
__shared__ float MySharedArray[32];
Execution configuration:dim3 dimGrid(100, 50); // 5000 thread blocks
dim3 dimBlock(4, 8, 8); // 256 threads per block
MyKernel > (...); // Launch kernel
Built-in variables and functions valid in device code:dim3 gridDim; // Grid dimensiondim3blockDim; // Block dimension
dim3blockIdx; // Block index
dim3 threadIdx; // Thread index
void__syncthreads(); // Thread synchronization (ProgGuide)
8/3/2019 Day1 02a Programming Overview
24/47
NVIDIA Confidential
Runtime API: More features
Other runtime specialties for host code...
8/3/2019 Day1 02a Programming Overview
25/47
NVIDIA Confidential
Asynchronous operation
CUDA calls are enqueued in streams, and executed
one after another : usually one default stream (0)Kernel launches are asynchronous
control returns to CPU immediately
kernel executes after all previous CUDA calls
cudaMemcpy() is synchronouscopy starts after all previous CUDA calls have completed
control returns to CPU after copy completes
(async memcopies possible, too)
Thus: GPU output, required on the host, leads to sync
8/3/2019 Day1 02a Programming Overview
26/47
NVIDIA Confidential
Example: Async operation
// allocate host memoryint numBytes = N * sizeof(float)
float* h_A = (float*)malloc(numBytes);
// allocate device memory
float* d_A = 0;cudaMalloc((void**)&d_A, numbytes);
// copy data from host to device
cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice);
// "execute the kernel"
// truly: CPU enqueues kernel calls, GPU executes asynchronously
kernel_A>(...);
kernel_B>(...);
kernel_C>(...);
// copy data from device back to host - CPU/GPU SYNCcudaMemcpy(h_A, d_A, numBytes, cudaMemcpyDeviceToHost);
// free device memory
cudaFree(d_A);
8/3/2019 Day1 02a Programming Overview
27/47
NVIDIA Confidential
CUDA Error Reporting
All CUDA calls return error code
Except for kernel launchescudaError_t type
cudaGetLastError( )
Returns the code for the last error (no error: has a code)
Even get error from kernel execution
char *cudaGetErrorString(code)
Returns a string describing the error
printf(%s\n, cudaGetErrorString( cudaGetLastError() ) );
8/3/2019 Day1 02a Programming Overview
28/47
NVIDIA Confidential
Textures in CUDA
Textures are known from graphics ...In CUDA, Texture is used for data reading
Benefits:Addressable in 1D, 2D, or 3DData is cached (optimized for 2D locality)
Helpful for irregular data access
FilteringLinear / bilinear / trilinear
dedicated hardware
Wrap modes (for out-of-bounds addresses)
Usage:Host code binds data to a texture referenceKernel reads data by calling a fetchfunction,e.g. tex1Dfetch()
8/3/2019 Day1 02a Programming Overview
29/47
NVIDIA Confidential
CUDA Event API
CUDA call streams can be interspersed with Events
Usage scenarios:measure elapsed time for CUDA calls (clock cycle precision!)
query the status of an asynchronous CUDA callblock CPU until CUDA calls prior to the event are completed
asyncAPI sample in CUDA SDK
cudaEvent_t start, stop;
cudaEventCreate(&start); cudaEventCreate(&stop);
cudaEventRecord(start, 0);
kernel(...);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
cudaEventDestroy(start); cudaEventDestroy(stop);
8/3/2019 Day1 02a Programming Overview
30/47
NVIDIA Confidential
Driver API
Up to this point the host code weve seen has been fromthe runtime API cuda*() functions...
Driver API: cu*() functions
Advantages:Plain C interface, you can use any CPU compiler for host code(e.g. icc, etc.)
More control over devices
One CPU thread can control multiple GPUs
PTX Just-In-Time (JIT) compilation(Parallel Thread eXecution (PTX) is our "GPU assembly language")
No dependency on runtime library
Disadvantages:
No device emulationMore verbose code
Note: Devicecode is identical, regardless of using theruntime or driver API
8/3/2019 Day1 02a Programming Overview
31/47
NVIDIA Confidential
Once more: Runtime and Driver API
Best place to start for virtually all developers:Runtime API
Easy to migrate to driver API if/when it is needed
Anything which can be done in the runtime API canalso be done in the driver API, but not vice versa
Much, much more information on both APIs in the
CUDA Reference Manual
8/3/2019 Day1 02a Programming Overview
32/47
NVIDIA Confidential
New Features in CUDA 2.2
Zero copyCUDA threads can directly read/write host (CPU)memory
Requires pinned (non-pageable) memory
Main benefits:More efficient than small PCIe data transfers
May be better performance when there is no opportunityfor data reuse from device DRAM
2D Texturing from linear memory
Allows simpler write-to-texture in CUDAUseful for image processing
8/3/2019 Day1 02a Programming Overview
33/47
NVIDIA Confidential
nvcc is a C compiler
Advanced C++ constructs (classes with inheritance
and virtual functions) make it stumble in device
code!
If problems occur, and CUDART is still desirable:
Let nvcc only compile .cu files that contain the
kernels, let customer's compiler handle C++ code intheir own files, and link the two parts.
Last resort: CUDA driver API,
(nvcc compiles kernels into PTX or binaries,which application loads via C calls)
8/3/2019 Day1 02a Programming Overview
34/47
NVIDIA Confidential
C for CUDAOptimization
8/3/2019 Day1 02a Programming Overview
35/47
NVIDIA Confidential
Optimize Algorithms for GPU
Maximize data-parallelism in the algorithm (SIMD):Think threads for data elements, not specific tasks
Reduce thread divergence(performance impact from branch serialization,when groups smaller than 32 threads start to diverge)
More computation on the GPU thancostly device-host data transfers
Even low parallelism computations can sometimes be fasterthan transferring back and forth to host
8/3/2019 Day1 02a Programming Overview
36/47
NVIDIA Confidential
Optimize Algorithms for GPU: Maths
Maximize arithmetic intensity (math per mem transfer)
Sometimes its better to recompute results than to causeserial dependencies
GPU spends its transistors on ALUs, not memory
Double precision algorithms:Consider moving parts/all to single precision computation
Hardware has builtin math functions (at reduced precision):__sinf(), __expf(), etc.
Try -fast-math (implicitly converts e.g. sin() to _sinf()) or carefullyreplace individual function calls, considering reduced accuracy
8/3/2019 Day1 02a Programming Overview
37/47
NVIDIA Confidential
Optimize Memory Access
Coalescing: "Optimal" memory access pattern
Coalesced vs. Non-coalesced = order of magnitude!
Shared memory: A user-managed cache
Advanced concepts:
Shared memory bank conflicts
Make use of spatial localityfor texture and constant caches
8/3/2019 Day1 02a Programming Overview
38/47
NVIDIA Confidential
Coalescing
Compute capability 1.0 and 1.1K-th thread must access k-th word in the segment (or k-th word in 2
contiguous 128B segments for 128-bit words), not all threads need to
participate
Coalesces 1 transaction
Out of sequence 16 transactions Misaligned 16 transactions
8/3/2019 Day1 02a Programming Overview
39/47
NVIDIA Confidential
Coalescing
Compute capability 1.2 and higher
1 transaction - 64B segment
MMU is more advanced, relaxes coalescing requirements
Coalescing achieved for any pattern of addresses that fits into a segmentof size: 32B for 8-bit words, 64B for 16-bit words, 128B for 32- and 64-bit
words
Smaller transactions may be issued to avoid wasted bandwidth due to
unused words
Exact rules in Programming Guide
8/3/2019 Day1 02a Programming Overview
40/47
NVIDIA Confidential
Take Advantage of Shared Memory
Hundreds of times faster than global memory
Threads can cooperate via shared memory
Use one / a few threads to load / compute data shared
by all threads
Use it to avoid non-coalesced accessStage loads and stores in shared memory to re-order non-
coalesceable addressing
8/3/2019 Day1 02a Programming Overview
41/47
NVIDIA Confidential
Use Parallelism Efficiently
Partition your computation to keep the GPU
multiprocessors equally busy
Many threads, many thread blocks
Keep threads' resource usage low enough
to supportmultiple blocks per multiprocessor
Resources: Registers, shared memory
8/3/2019 Day1 02a Programming Overview
42/47
NVIDIA Confidential
Host-Device Data Transfers
Device-Host memory bandwidth
much lower than device-device bandwidth
8 GB/s peak (PCI-e x16 Gen 2) vs. 102 GB/s peak (Tesla C1060)
Minimize transfers
Dont transfer intermediate data:Can be allocated, operated on, and deallocatedwithout ever copying them to host memory
Group transfersOne large transfer much better than many small ones
8/3/2019 Day1 02a Programming Overview
43/47
NVIDIA Confidential
Overlapping Data Transfers and
Computation
Stream and Async API allow overlaphost-device data transfers with computation
CPU computation can overlap data transferson all CUDA capable devices
Devices with Concurrent copy and execution(CompCap >= 1.1):Kernel computation can overlap data transfers, controlled viastreams and events.
Stream = sequence of CUDA calls that execute in orderCalls in different streams can be interleaved
Stream ID is an argument to async calls and kernel launches
8/3/2019 Day1 02a Programming Overview
44/47
NVIDIA Confidential
Shared Memory
~Hundred times faster than global memory
Use it to cache data from global memory accesses
Use it to avoid non-coalesced access
Stage loads and stores in shared memory tore-order non-coalesceable addressing
Threads can cooperate via shared memory
share results with each othercontribute to common result,e.g. block min/max/avg
G id/Bl k Si H i i
8/3/2019 Day1 02a Programming Overview
45/47
NVIDIA Confidential
Grid/Block Size Heuristics
# of blocks > # of multiprocessorsSo all multiprocessors have at least one block to execute
# of blocks / # of multiprocessors > 2
Multiple blocks can run concurrently in a multiprocessor
Blocks that arent waiting at a __syncthreads() keep the
hardware busySubject to resource availability registers, shared memory
# of blocks > 100 to scale to future devices
Blocks executed in pipeline fashion1000 blocks per grid will scale across multiple generations
A
8/3/2019 Day1 02a Programming Overview
46/47
NVIDIA Confidential
Accuracy
GPU and CPU results may differ, but are
equally accurate (to specified ulp accuracy)
CPU operations arent strictly limited to 0.5 ulp
Sequences of operations can be even more accurate
due to 80-bit extended precision ALUs
Compare GPU calculation to CPU SSE
And: Floating-point arithmetic is not associative!
Complex area (ask if unsure)
S
8/3/2019 Day1 02a Programming Overview
47/47
NVIDIA Confidential
Summary
GPU hardware can achieve great performance on data-parallel computations if you follow a few simple guidelines:
Use parallelism efficientlyCoalesce memory accesses if possible
Take advantage of shared memory
Explore other memory spaces
TextureConstant
(Reduce shared memory bank conflicts)
See the Programming Guide, Best Practices Guide and ReferenceManual
If that doesn't help:Ask your local DevTech-Compute engineer :)