GPU Programming with CUDA – Optimisation Mike Griffiths

Preview:

DESCRIPTION

GPU Programming with CUDA – Optimisation Mike Griffiths. GPUComputing@Sheffield http://gpucomputing.sites.sheffield.ac.uk/. Hardware Model. DRAM. GDRAM. GPU Kernel Code _________ _____ ______. Main Program Code ___________ _______ _________. GPU Kernel Code _________ _____ ______. - PowerPoint PPT Presentation

Citation preview

GPU Programming with CUDA – Optimisation

Mike Griffiths

GPUComputing@Sheffieldhttp://gpucomputing.sites.sheffield.ac.uk/

Hardware Model

Main Program Code___________________________

DRAM GDRAM

CPUGPU

I/O I/OPCIe SM

Shared Memory

GPU Kernel Code____________________

GPU Kernel Code____________________

GPU Kernel Code____________________

• Data transfer to/from device memory• Device under-utilisation• GPU memory bandwidth• Code Branching

Performance Inhibitors

• Data transfer to/from device memory• Device under-utilisation and occupancy• GPU memory bandwidth• Code Branching

Performance Inhibitors

• CPU (host) and GPU (device) have separate dedicated memory• All data read/written on the device must be copied via PCIe bus• Very expensive operation

• Optimisation Technique: Minimise data copies• Keep resident data on the device• May have to move some computation to the GPU even if is not

computationally expensive • Might be quicker to re-calculate data on the device than copy it

Data Transfer

• Port inexpensive routine to the device• Minimise transfers by moving copy out of the loop

Data Transfer ExampleLoop over timesteps

inexpensive_routine_on_host(data_on_host)copy data from host to deviceexpensive_routine_on_device(data_on_device)copy data from device to host

End loop over timesteps

copy data from host to deviceLoop over timesteps

inexpensive_routine_on_device(data_on_device)expensive_routine_on_device(data_on_device)

End loop over timestepscopy data from device to host

• Data transfer to/from device memory• Device under-utilisation and occupancy• GPU memory bandwidth• Code Branching

Performance Inhibitors

• GPU performance relies on the use of many threads• Degree of parallelism must be much higher than on the CPU• Ideally need many more threads than cores

• Effort must be made to expose as much parallelism as possible• May require re-engineering your problem

• If significant sections of code are serial then GPU acceleration will be limited• Amdahl’s Law

Exposing Parallelism

𝑆𝑝𝑒𝑒𝑑𝑢𝑝 (𝑁 )= 1

𝐵+1𝑁 (1−𝐵)

• Access to GPU memory has several hundred cycles of latency• When a thread is waiting for data it is stalled

• GPUs have very fast context switching• Stalled threads can be switched with active threads• Switching hides memory latency if other threads are performing compute• Requires many threads ideally performing large amounts of computation

• Optimisation Technique: Have lots of threads with high arithmetic intensity• Defined as the ratio of arithmetic computation to memory accesses

Memory Latency

Exposing parallelism exampleLoop over i from 1 to 512

Loop over j from 1 to 512independent

iteration

Calc i from thread/block IDLoop over j from 1 to 512

independent iteration

Calc i & j from thread/block IDindependent iteration

Original code

1D decomposition 2D decomposition

512 threads 262,144 threads✖ ✔

• Data transfer to/from device memory• Device under-utilisation and occupancy• GPU memory bandwidth• Code Branching

Performance Inhibitors

• GPUs have high peak memory bandwidth• Maximum bandwidth is achieved when data is accessed in large

requests rather than many small requests• Large requests must come from multiple threads• Otherwise memory accesses are serialised degrading performance• Memory coalescing: Consecutive threads accessing consecutive memory

locations

• Optimisation technique: Coalesced memory accesses reduce the number of requests and achieve higher bandwidth

Memory Coalescing

• Consecutive threads are those with consecutive threadIdx.x values• Question: Do consecutive threads access consecutive memory

locations?

Coalescing Example

index = blockIdx.x*blockDim.x + threadIdx.x; output[index] = 2*input[index];

• Consecutive threads are those with consecutive threadIdx.x values• Question: Do consecutive threads access consecutive memory

locations?

• Yes: Consecutive index values leads to consecutive threadIdx values

Coalescing Example

index = blockIdx.x*blockDim.x + threadIdx.x; output[index] = 2*input[index];

• Question: Do consecutive threads access consecutive memory locations?

Coalescing Example 2

i = blockIdx.x*blockDim.x + threadIdx.x; for (j=0; j<N; j++)

output[i][j]=2*input[i][j];

j = blockIdx.x*blockDim.x + threadIdx.x; for (i=0; i<N; i++)

output[i][j]=2*input[i][j];

• Question: Do consecutive threads access consecutive memory locations?

• No: Consecutive threadIdx.x corresponds to consecutive i values

• Yes: Consecutive threadIdx.x corresponds to consecutive j values

Coalescing Example 2

i = blockIdx.x*blockDim.x + threadIdx.x; for (j=0; j<N; j++)

output[i][j]=2*input[i][j];

j = blockIdx.x*blockDim.x + threadIdx.x; for (i=0; i<N; i++)

output[i][j]=2*input[i][j];

• What about 2D or 3D decompositions• Exactly the same• Always threadIdx.x which should increment with consecutive threads• E.g. Matrix addition

Memory Coalescing in 2D

int j = blockIdx.x * blockDim.x + threadIdx.x; int i = blockIdx.y * blockDim.y + threadIdx.y;

c[i][j] = a[i][j] + b[i][j];

• Data transfer to/from device memory• Device under-utilisation and occupancy• GPU memory bandwidth• Code Branching

Performance Inhibitors

• On NVIDIA GPUs there are less instructional scheduling units than cores• Threads are scheduled in groups of 32 (a warp)• Threads within a warp execute the same instruction in lock-step• Single Instruction Multiple Data (SIMD)

• CUDA C Kernels are free to specify branches• BUT all threads will have to follow all code paths within the warp

• Optimisation Technique: Avoid inter warp branching wherever possible

Code Branching

• You want to split your threads into two groups:

Branching Example

i = blockIdx.x*blockDim.x + threadIdx.x;if (i%2 == 0)

…else

i = blockIdx.x*blockDim.x + threadIdx.x;if ((i/32)%2 == 0)

…else

• Set COMPUTE_PROFILE environment variable to 1• Log file will be created at runtime• E.g. cuda_profile_0.log• Contains timing information for kernel and data transfer• Possible to output more metrics (see doc/Compute_Profiler.txt)

CUDA Profiling

# CUDA_PROFILE_LOG_VERSION 2.0# CUDA_DEVICE 0 Tesla M1060# CUDA_CONTEXT 1# TIMESTAMPFACTOR fffff6e2e9ee8858method,gputime,cputime,occupancymethod=[ memcpyHtoD ] gputime=[ 37.952 ] cputime=[ 86.000 ] method=[ memcpyHtoD ] gputime=[ 37.376 ] cputime=[ 71.000 ] method=[ memcpyHtoD ] gputime=[ 37.184 ] cputime=[ 57.000 ] method=[ _Z23inverseEdgeDetect1D_colPfS_S_ ] gputime=[ 253.536 ] cputime=[ 13.000 ] occupancy=[ 0.250 ] ...

• GPUs offers higher Floating Point and memory bandwidth performance than CPUs• A number of factors will inhibit execution performance• A number of techniques can be applied to circumvent these• Some techniques may require re-engineering your problem• If you application cant be adapted GPU performance will not be good!

• It is important to have an understanding of the application, architecture and programming model

Conclusions