Intermediate GPGPU Programming in CUDA

Supada Laosooksathit

NVIDIA Hardware Architecture

Hostmemory

Recall

• 5 steps for CUDA Programming– Initialize device– Allocate device memory– Copy data to device memory– Execute kernel– Copy data back from device memory

Initialize Device Calls

• To select the device associated to the host thread– cudaSetDevice(device)– This function must be called before any __global__

function, otherwise device 0 is automatically selected.

• To get number of devices– cudaGetDeviceCount(&devicecount)

• To retrieve device’s property– cudaGetDeviceProperties(&deviceProp, device)

Hello World Example

• Allocate host and device memory

Hello World Example

• Host code

Hello World Example

• Kernel code

To Try CUDA Programming• SSH to 138.47.102.111• Set environment vals in .bashrc in your home directory

export PATH=$PATH:/usr/local/cuda/binexport LD_LIBRARY_PATH=/usr/local/cuda/lib:$LD_LIBRARY_PATHexport LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

• Copy the SDK from /home/students/NVIDIA_GPU_Computing_SDK

• Compile the following directories– NVIDIA_GPU_Computing_SDK/shared/– NVIDIA_GPU_Computing_SDK/C/common/

• The sample codes are in NVIDIA_GPU_Computing_SDK/C/src/

• Hello World– Print out block and thread IDs

• Vector Add– C = A + B

NVIDIA Hardware Architecture

Specifications of a Device

• For more details– deviceQuery in CUDA SDK– Appendix F in Programming Guide 4.0

Specifications Compute Capability 1.3

Compute Capability 2.0

Warp size 32 32

Max threads/block 512 1024

Max Blocks/grid 65535 65535

Shared mem 16 KB/SM 48 KB/SM

• deviceQuery– Show hardware specifications in details

Memory Optimizations

• Reduce the time of memory transfer between host and device– Use asynchronous memory transfer (CUDA streams)– Use zero copy

• Reduce the number of transactions between on-chip and off-chip memory– Memory coalescing

• Avoid bank conflicts in shared memory

Reduce Time of Host-Device Memory Transfer

• Regular memory transfer (synchronously)

Reduce Time of Host-Device Memory Transfer• CUDA streams– Allow overlapping between kernel and memory copy

CUDA Streams Example

GPU Timers

• CUDA Events– An API– Use the clock shade in kernel– Accurate for timing kernel executions

• CUDA timer calls– Libraries implemented in CUDA SDK

CUDA Events Example

• simpleStreams

Reduce Time of Host-Device Memory Transfer

• Zero copy– Allow device pointers to access page-locked host

memory directly– Page-locked host memory is allocated by cudaHostAlloc()

• Zero copy

Reduce number of On-chip and Off-chip Memory Transactions

• Threads in a warp access global memory• Memory coalescing– Copy a bunch of words at the same time

Memory Coalescing

• Threads in a warp access global memory in a straight forward way (4-byte word per thread)

Memory Coalescing

• Memory addresses are aligned in the same segment but the accesses are not sequential

Memory Coalescing

• Memory addresses are not aligned in the same segment

Shared Memory

• 16 banks for compute capability 1.x, 32 banks for compute capability 2.x

• Help utilizing memory coalescing• Bank conflicts may occur– Two or more threads in access the same bank– In compute capability 1.x, no broadcast– In compute capability 2.x, broadcast the same

data to many threads that request

Bank Conflicts

00Threads: Banks:

No bank conflict 2-way bank conflict

Matrix Multiplication Example

• Reduce accesses to global memory– (A.height/BLOCK_SIZE) times reading A– (B.width/BLOCK_SIZE) times reading B

• Matrix Multiplication– With and without shared memory– Different block sizes

Control Flow

• if, switch, do, for, while• Branch divergence in a warp– Threads in a warp issue different instruction sets

• Different execution paths will be serialized• Increase number of instructions in that warp

Branch Divergence

Summary

• 5 steps for CUDA Programming• NVIDIA Hardware Architecture– Memory hierarchy: global memory, shared

memory, register file– Specifications of a device: block, warp, thread, SM

Summary

• Memory optimization– Reduce overhead due to host-device memory

transfer with CUDA streams, Zero copy– Reduce the number of transactions between on-

chip and off-chip memory by utilizing memory coalescing (shared memory)

– Try to avoid bank conflicts in shared memory• Control flow– Try to avoid branch divergence in a warp

References

• http://docs.nvidia.com/cuda/cuda-c-programming-guide/

• http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

• http://www.developer.nvidia.com/cuda-toolkit

Intermediate GPGPU Programming in CUDA

Documents

Intermediate GPGPU Programming in CUDA Supada Laosooksathit

CS 698L: GPGPU Architectures and CUDA C · CS 698L: GPGPU Architectures and CUDA C Swarnendu Biswas Semester 2019-2020-I CSE, IIT Kanpur Content influenced by many excellent references,

GPGPU programming with CUDA

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications…

CUDA-CHILL: A PROGRAMMING LANGUAGE INTERFACE FOR GPGPU ...grudy/GabeRudyThesis10-CUDA-CHiLL.pdf · CUDA-CHILL: A PROGRAMMING LANGUAGE INTERFACE FOR GPGPU OPTIMIZATIONS AND CODE GENERATION

GPGPU programming on example of CUDA - Panoramix - …panoramx.ift.uni.wroc.pl/~maq/cuda/prezentacja-cuda-eng.pdf · CPU GPU CUDA Architecture GPU programming Examples Summary GPGPU

Exploring GPGPU Workloads: Characterization Methodology, … · 2017-08-22 · CUDA and AMD Stream, are often used to facilitate GPGPU application development. In this paper, we have

CUDA Libraries and Tools - GPGPU Libraries & Tools NVIDIA GPU with the CUDA Parallel Computing Architecture CUDA C OpenCL Direct Compute Fortran Python, Java, .NET, … Over 60,000

Seismic imaging using GPGPU accelerated reverse time migrationsep · Tesla S1070 GPGPU rack-mounted blade server. This unique platform implements the CUDA 2.1 software speciﬁcation,

Lecture 11: “GPGPU” computing and the CUDA/OpenCL

CUDA-NP: Realizing Nested Thread-Level Parallelism in GPGPU Applications

Intermediate%GPGPU% Programming%in%CUDA% - …box/hapc/CUDA_inter.pdf · 12/19/11 1 Intermediate%GPGPU% Programming%in%CUDA% CSC469/585 Winter%2011>12% LouisianaTech%U% NVIDIA%Hardware%Architecture%

Cuda and OpenCL API comparison - Aalto University Wiki · CUDA and OpenCL API comparison Presentaon for T‐106.5800 Seminar on GPGPU ... • Iniated by Apple, speciﬁcaon maintained

Programming GPGPU with MapReduce™³文光.pdf · Program GPU with CUDA • CUDA is a revolutionary step in GPU ... – No dynamic memory allocation – Complex memory hierarchy

GPGPU Programming · Ref. Acclerated Computing 1: GPGPU Programming and Computing, Korea-Japan HPC Winter School 2014 ... CUDA Programming OpenCL Programming __kernel void kernel_func(…)

GPGPU – CUDA 1.cg.elte.hu/~gpgpu/cuda/GPGPU_CUDA_1.pdf · CUDA (Compute Unified Device Architecture) NVIDIA által fejlesztett GPGPU platform. Fejleszthető: C, C++, Fortran nyelveken,

An Introduction to GPGPU Programming - CUDA - DiVA Portal

GPU!#!$%&’(cyy/courses/assembly/08fall/...Computer Organization and Assembly Language Final Project Report GPU Architecture and Programming: From GPGPU to CUDA GPU!"#!"$"%&’( )GPGPU#CUDA

GPGPU/CUDA/C Workshop 2012 - Wichita State …capplab/CAPPLabMain/pdf/CUDA_Workshop_2012_Day...Nasrin Sultana Wichita State University July 10, 2012 . ... Dot products ... CUDA is

Lecture 11: “GPGPU” computing and the CUDA/OpenCL ... · “GPGPU” computing and the CUDA/OpenCL Programming Model Kayvon Fatahalian ... -Data-parallel programing: ZPL, Nesl-Stream