Intermediate GPGPU Programming in CUDA

Intermediate GPGPU Programming in CUDA

Supada Laosooksathit

NVIDIA Hardware Architecture

Hostmemory

Recall

• 5 steps for CUDA Programming– Initialize device– Allocate device memory– Copy data to device memory– Execute kernel– Copy data back from device memory

Initialize Device Calls

• To select the device associated to the host thread– cudaSetDevice(device)– This function must be called before any __global__

function, otherwise device 0 is automatically selected.

• To get number of devices– cudaGetDeviceCount(&devicecount)

• To retrieve device’s property– cudaGetDeviceProperties(&deviceProp, device)

Hello World Example

• Allocate host and device memory

Hello World Example

• Host code

Hello World Example

• Kernel code

To Try CUDA Programming• SSH to 138.47.102.111• Set environment vals in .bashrc in your home directory

export PATH=$PATH:/usr/local/cuda/binexport LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATHexport LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

• Copy the SDK from /home/students/NVIDIA_GPU_Computing_SDK

• Compile the following directories– NVIDIA_GPU_Computing_SDK/shared/– NVIDIA_GPU_Computing_SDK/C/common/

• The sample codes are in NVIDIA_GPU_Computing_SDK/C/src/

Demo

• Hello World– Print out block and thread IDs

• Vector Add– C = A + B

NVIDIA Hardware Architecture

SM

Specifications of a Device

• For more details– deviceQuery in CUDA SDK– Appendix F in Programming Guide 4.0

Specifications Compute Capability 1.3

Compute Capability 2.0

Warp size 32 32

Max threads/block 512 1024

Max Blocks/grid 65535 65535

Shared mem 16 KB/SM 48 KB/SM

Demo

• deviceQuery– Show hardware specifications in details

Memory Optimizations

• Reduce the time of memory transfer between host and device– Use asynchronous memory transfer (CUDA streams)– Use zero copy

• Reduce the number of transactions between on-chip and off-chip memory– Memory coalescing

• Avoid bank conflicts in shared memory

Reduce Time of Host-Device Memory Transfer

• Regular memory transfer (synchronously)

Reduce Time of Host-Device Memory Transfer• CUDA streams– Allow overlapping between kernel and memory copy

CUDA Streams Example

CUDA Streams Example

GPU Timers

• CUDA Events– An API– Use the clock shade in kernel– Accurate for timing kernel executions

• CUDA timer calls– Libraries implemented in CUDA SDK

CUDA Events Example

Demo

• simpleStreams

Reduce Time of Host-Device Memory Transfer

• Zero copy– Allow device pointers to access page-locked host

memory directly– Page-locked host memory is allocated by cudaHostAlloc()

Demo

• Zero copy

Reduce number of On-chip and Off-chip Memory Transactions

• Threads in a warp access global memory• Memory coalescing– Copy a bunch of words at the same time

Memory Coalescing

• Threads in a warp access global memory in a straight forward way (4-byte word per thread)

Memory Coalescing

• Memory addresses are aligned in the same segment but the accesses are not sequential

Memory Coalescing

• Memory addresses are not aligned in the same segment

Shared Memory

• 16 banks for compute capability 1.x, 32 banks for compute capability 2.x

• Help utilizing memory coalescing• Bank conflicts may occur– Two or more threads in access the same bank– In compute capability 1.x, no broadcast– In compute capability 2.x, broadcast the same

data to many threads that request

Bank Conflicts

00Threads: Banks:

11

22

33

00Threads: Banks:

11

22

33

No bank conflict 2-way bank conflict

Matrix Multiplication Example

Matrix Multiplication Example

• Reduce accesses to global memory– (A.height/BLOCK_SIZE) times reading A– (B.width/BLOCK_SIZE) times reading B

Demo

• Matrix Multiplication– With and without shared memory– Different block sizes

Control Flow

• if, switch, do, for, while• Branch divergence in a warp– Threads in a warp issue different instruction sets

• Different execution paths will be serialized• Increase number of instructions in that warp

Branch Divergence

Summary

• 5 steps for CUDA Programming• NVIDIA Hardware Architecture– Memory hierarchy: global memory, shared

memory, register file– Specifications of a device: block, warp, thread, SM

Summary

• Memory optimization– Reduce overhead due to host-device memory

transfer with CUDA streams, Zero copy– Reduce the number of transactions between on-

chip and off-chip memory by utilizing memory coalescing (shared memory)

– Try to avoid bank conflicts in shared memory• Control flow– Try to avoid branch divergence in a warp

References

• http://docs.nvidia.com/cuda/cuda-c-programming-guide/

• http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

• http://www.developer.nvidia.com/cuda-toolkit

http://docs.nvidia.com/cuda/cuda-c-programming-guide/

http://docs.nvidia.com/cuda/cuda-c-programming-guide/

http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

http://www.developer.nvidia.com/cuda-toolkit

http://www.developer.nvidia.com/cuda-toolkit

Documents

Intermediate GPGPU Programming in CUDA