28
An Introduction to Programming with CUDA Paul Richmond GPUComputing@Sheffield http://gpucomputing.sites.sheffield.ac.uk/

An Introduction to Programming with CUDA Paul Richmond

  • Upload
    nibal

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

An Introduction to Programming with CUDA Paul Richmond. GPUComputing@Sheffield http://gpucomputing.sites.sheffield.ac.uk/. Overview. Motivation Introduction to CUDA Kernels CUDA Memory Management. Overview. Motivation Introduction to CUDA Kernels CUDA Memory Management. - PowerPoint PPT Presentation

Citation preview

Page 1: An Introduction to Programming with CUDA Paul Richmond

An Introduction to Programming with CUDA

Paul Richmond

GPUComputing@Sheffieldhttp://gpucomputing.sites.sheffield.ac.uk/

Page 2: An Introduction to Programming with CUDA Paul Richmond

• Motivation• Introduction to CUDA Kernels• CUDA Memory Management

Overview

Page 3: An Introduction to Programming with CUDA Paul Richmond

• Motivation• Introduction to CUDA Kernels• CUDA Memory Management

Overview

Page 4: An Introduction to Programming with CUDA Paul Richmond

• Traditional sequential languages are not suitable for GPUs• GPUs are data NOT task parallel• CUDA allows NVIDIA GPUs to be programmed in C/C++ (also in

Fortran)• Language extensions for compute “kernels”• Kernels execute on multiple threads concurrently• API functions provide management (e.g. memory, synchronisation, etc.)

About NVIDIA CUDA

Page 5: An Introduction to Programming with CUDA Paul Richmond

GPU Kernel Code____________________

GPU Kernel Code____________________

Main Program Code___________________________

GPU Kernel Code____________________

CPU NVIDIA GPU

PCIe BUS

DRAM GDRAM

Page 6: An Introduction to Programming with CUDA Paul Richmond

• Data set decomposed into a stream of elements• A single computational function (kernel) operates on each element

• A thread is the execution of a kernel on one data element

• Multiple Streaming Multiprocessor Cores can operate on multiple elements in parallel• Many parallel threads

• Suitable for Data Parallel problems

Stream Computing

Page 7: An Introduction to Programming with CUDA Paul Richmond

• NVIDIA GPUs have a 2-level hierarchy• Each Streaming Multiprocessor has multiple cores

• The number of SMs and cores per SM varies

Hardware Model

GPU

SM SM

SMSM

SM

Device Memory Shared Memory

Page 8: An Introduction to Programming with CUDA Paul Richmond

• Hardware abstracted as a Grid of Thread Blocks• Blocks map to SMs• Each thread maps onto a SM core

• Don’t need to know the hardware characteristics• Oversubscribe and allow the hardware to

perform scheduling• More blocks than SMs and more threads than cores

• Code is portable across different GPU versions

CUDA Software Model

Grid

Block

Thread

Page 9: An Introduction to Programming with CUDA Paul Richmond

• CUDA Introduces a new dim types. E.g. dim2, dim3, dim4• dim3 contains a collection of three integers (X, Y, Z)

dim3 my_xyz (x_value, y_value, z_value);

• Values are accessed as members

int x = my_xyz.x;

CUDA Vector Types

Page 10: An Introduction to Programming with CUDA Paul Richmond

• threadIdx• The location of a thread within a block. E.g. (2,1,0)

• blockIdx• The location of a block within a grid. E.g. (1,0,0)

• blockDim• The dimensions of the blocks. E.g. (3,9,1)

• gridDim• The dimensions of the grid. E.g. (3,2,1)

Idx values use zero indices, Dim values are a size

Special dim3 Vectors

Grid

Block

Thread

Page 11: An Introduction to Programming with CUDA Paul Richmond
Page 12: An Introduction to Programming with CUDA Paul Richmond

• Students arrive at halls of residence to check in• Rooms allocated in order

• Unfortunately admission rates are down!• Only half as many rooms as students• Each student can be moved from room i to room 2i so that no-one has a

neighbour

Analogy

Page 13: An Introduction to Programming with CUDA Paul Richmond

• Receptionist performs the following tasks1. Asks each student their assigned room number2. Works out their new room number3. Informs them of their new room number

Serial Solution

Page 14: An Introduction to Programming with CUDA Paul Richmond

“Everybody check your room number. Multiply it by 2 and go to that room”

Parallel Solution

Page 15: An Introduction to Programming with CUDA Paul Richmond

• Motivation• Introduction to CUDA Kernels• CUDA Memory Management

Overview

Page 16: An Introduction to Programming with CUDA Paul Richmond

• Serial solution

for (i=0;i<N;i++){ result[i] = 2*i;}

• We can parallelise this by assigning each iteration to a CUDA thread!

A Coded Example

Page 17: An Introduction to Programming with CUDA Paul Richmond

__global__ void myKernel(int *result){ int i = threadIdx.x; result[i] = 2*i;}

• Replace loop with a “kernel”• Use __global__ specifier to indicate it is GPU code

• Use threadIdx dim variable to get a unique index• Assuming for simplicity we have only one block• Equivalent to your door number at CUDA Halls of Residence

CUDA C Example: Device

Page 18: An Introduction to Programming with CUDA Paul Richmond

• Call the kernel by using the CUDA kernel launch syntax• kernel<<<GRID OF BLOCKS, BLOCK OF THREADS>>>(arguments);

dim3 blocksPerGrid(1,1,1); //use only one blockdim3 threadsPerBlock(N,1,1); //use N threads in the block

myKernel<<<blocksPerGrid, threadsPerBlock>>>(result);

CUDA C Example: Host

Page 19: An Introduction to Programming with CUDA Paul Richmond

• Only one block will give poor performance • a block gets allocated to a single SM!• Solution: Use multiple blocks

dim3 blocksPerGrid(N/256,1,1); // assumes 256 divides N exactlydim3 threadsPerBlock(256,1,1); //256 threads in the block

myKernel<<<blocksPerGrid, threadsPerBlock>>>(result);

CUDA C Example: Host

Page 20: An Introduction to Programming with CUDA Paul Richmond

//Kernel Code

__global__ void vectorAdd(float *a, float *b, float *c)

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

c[i] = a[i] + b[i];

}

//Host Code

...

dim3 blocksPerGrid(N/256,1,1); //assuming 256 divides N exactly

dim3 threadsPerBlock(256,1,1);

vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(a, b, c);

Vector Addition Example

Page 21: An Introduction to Programming with CUDA Paul Richmond

//Device Code

__global__ void matrixAdd(float a[N][N], float b[N][N], float c[N][N])

{

int j = blockIdx.x * blockDim.x + threadIdx.x;

int i = blockIdx.y * blockDim.y + threadIdx.y;

c[i][j] = a[i][j] + b[i][j];

}

//Host Code

...

dim3 blocksPerGrid(N/16,N/16,1); // (N/16)x(N/16) blocks/grid (2D)

dim3 threadsPerBlock(16,16,1); // 16x16=256 threads/block (2D)

matrixAdd<<<blocksPerGrid, threadsPerBlock>>>(a, b, c);

A 2D Matrix Addition Example

Page 22: An Introduction to Programming with CUDA Paul Richmond

• Motivation• Introduction to CUDA Kernels• CUDA Memory Management

Overview

Page 23: An Introduction to Programming with CUDA Paul Richmond

• GPU has separate dedicated memory from the host CPU• Data accessed in kernels must be on GPU memory

• Data must be explicitly copied and transferred

• cudaMalloc() is used to allocate memory on the GPU• cudaFree() releases memory

float *a;cudaMalloc(&a, N*sizeof(float));...cudaFree(a);

Memory Management

Page 24: An Introduction to Programming with CUDA Paul Richmond

• Once memory has been allocated we need to copy data to it and from it.• cudaMemcpy() transfers memory from the host to device to host and vice versa

cudaMemcpy(array_device, array_host, N*sizeof(float), cudaMemcpyHostToDevice);

cudaMemcpy(array_host, array_device, N*sizeof(float), cudaMemcpyDeviceToHost);

• First argument is always the destination of transfer• Transfers are relatively slow and should be minimised where possible

Memory Copying

Page 25: An Introduction to Programming with CUDA Paul Richmond

• Kernel calls are non-blocking• Host continues after kernel launch• Overlaps CPU and GPU execution

• cudaThreadSynchronise() call be called from the host to block until GPU kernels have completed

vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(a, b, c);//do work on host (that doesn’t depend on c)cudaThreadSynchronise(); //wait for kernel to finish

• Standard cudaMemcpy calls are blocking• Non-blocking variants exist

Synchronisation

Page 26: An Introduction to Programming with CUDA Paul Richmond

• syncthreads() can be used within a kernel to synchronise between threads in a block• Threads in the same block can therefore communicate using a shared

memory space

if (threadIdx.x == 0) array[0]=x;

syncthreads();

if (threadIdx.x == 1) x=array[0];

• It is NOT possible to synchronise between threads in different blocks• A kernel exit does however guarantee synchronisation

Synchronisation Between Threads

Page 27: An Introduction to Programming with CUDA Paul Richmond

• CUDA C Code is compiled using nvcc e.g.• Will compile host AND device code to produce an executable

nvcc –o example example.cu

Compiling a CUDA program

Page 28: An Introduction to Programming with CUDA Paul Richmond

• Traditional languages alone are not sufficient for programming GPUs• CUDA allows NVIDIA GPUs to be programmed using C/C++• defines language extensions and APIs to enable this

• We introduced the key CUDA concepts and gave examples• Kernels for the device• Host API for memory management and kernel launching

Now lets try it out…

Summary