Upload
moya
View
49
Download
1
Embed Size (px)
DESCRIPTION
Intermediate GPGPU Programming in CUDA. Supada Laosooksathit. NVIDIA Hardware Architecture. Host memory. Recall. 5 steps for CUDA Programming Initialize device Allocate device memory Copy data to device memory Execute kernel Copy data back from device memory. - PowerPoint PPT Presentation
Citation preview
Intermediate GPGPU Programming in CUDA
Supada Laosooksathit
NVIDIA Hardware Architecture
Hostmemory
Recall
• 5 steps for CUDA Programming– Initialize device– Allocate device memory– Copy data to device memory– Execute kernel– Copy data back from device memory
Initialize Device Calls
• To select the device associated to the host thread– cudaSetDevice(device)– This function must be called before any __global__
function, otherwise device 0 is automatically selected.
• To get number of devices– cudaGetDeviceCount(&devicecount)
• To retrieve device’s property– cudaGetDeviceProperties(&deviceProp, device)
Hello World Example
• Allocate host and device memory
Hello World Example
• Host code
Hello World Example
• Kernel code
To Try CUDA Programming• SSH to 138.47.102.111• Set environment vals in .bashrc in your home directory
export PATH=$PATH:/usr/local/cuda/binexport LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATHexport LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
• Copy the SDK from /home/students/NVIDIA_GPU_Computing_SDK
• Compile the following directories– NVIDIA_GPU_Computing_SDK/shared/– NVIDIA_GPU_Computing_SDK/C/common/
• The sample codes are in NVIDIA_GPU_Computing_SDK/C/src/
Demo
• Hello World– Print out block and thread IDs
• Vector Add– C = A + B
NVIDIA Hardware Architecture
SM
Specifications of a Device
• For more details– deviceQuery in CUDA SDK– Appendix F in Programming Guide 4.0
Specifications Compute Capability 1.3
Compute Capability 2.0
Warp size 32 32
Max threads/block 512 1024
Max Blocks/grid 65535 65535
Shared mem 16 KB/SM 48 KB/SM
Demo
• deviceQuery– Show hardware specifications in details
Memory Optimizations
• Reduce the time of memory transfer between host and device– Use asynchronous memory transfer (CUDA streams)– Use zero copy
• Reduce the number of transactions between on-chip and off-chip memory– Memory coalescing
• Avoid bank conflicts in shared memory
Reduce Time of Host-Device Memory Transfer
• Regular memory transfer (synchronously)
Reduce Time of Host-Device Memory Transfer• CUDA streams– Allow overlapping between kernel and memory copy
CUDA Streams Example
CUDA Streams Example
GPU Timers
• CUDA Events– An API– Use the clock shade in kernel– Accurate for timing kernel executions
• CUDA timer calls– Libraries implemented in CUDA SDK
CUDA Events Example
Demo
• simpleStreams
Reduce Time of Host-Device Memory Transfer
• Zero copy– Allow device pointers to access page-locked host
memory directly– Page-locked host memory is allocated by cudaHostAlloc()
Demo
• Zero copy
Reduce number of On-chip and Off-chip Memory Transactions
• Threads in a warp access global memory• Memory coalescing– Copy a bunch of words at the same time
Memory Coalescing
• Threads in a warp access global memory in a straight forward way (4-byte word per thread)
Memory Coalescing
• Memory addresses are aligned in the same segment but the accesses are not sequential
Memory Coalescing
• Memory addresses are not aligned in the same segment
Shared Memory
• 16 banks for compute capability 1.x, 32 banks for compute capability 2.x
• Help utilizing memory coalescing• Bank conflicts may occur– Two or more threads in access the same bank– In compute capability 1.x, no broadcast– In compute capability 2.x, broadcast the same
data to many threads that request
Bank Conflicts
00Threads: Banks:
11
22
33
00Threads: Banks:
11
22
33
No bank conflict 2-way bank conflict
Matrix Multiplication Example
Matrix Multiplication Example
• Reduce accesses to global memory– (A.height/BLOCK_SIZE) times reading A– (B.width/BLOCK_SIZE) times reading B
Demo
• Matrix Multiplication– With and without shared memory– Different block sizes
Control Flow
• if, switch, do, for, while• Branch divergence in a warp– Threads in a warp issue different instruction sets
• Different execution paths will be serialized• Increase number of instructions in that warp
Branch Divergence
Summary
• 5 steps for CUDA Programming• NVIDIA Hardware Architecture– Memory hierarchy: global memory, shared
memory, register file– Specifications of a device: block, warp, thread, SM
Summary
• Memory optimization– Reduce overhead due to host-device memory
transfer with CUDA streams, Zero copy– Reduce the number of transactions between on-
chip and off-chip memory by utilizing memory coalescing (shared memory)
– Try to avoid bank conflicts in shared memory• Control flow– Try to avoid branch divergence in a warp
References
• http://docs.nvidia.com/cuda/cuda-c-programming-guide/
• http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/
• http://www.developer.nvidia.com/cuda-toolkit