High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.

LSST All Hands Meeting 2010, Tucson, AZ

GPU Tutorial: How To Program for GPUsKrešimir Ćosić1,

(1) University of Split, Croatia

Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.


High Performance Computing with GPUs: An Introduction

Overview

CUDA Hardware architecture

Programming model Convolution on GPU




CUDA

‘Compute Unified Device Architecture’– Hardware and software architecture for issuing

and managing computations on GPU• Massively parallel architecture

– over 8000 threads is common• C for CUDA (C++ for CUDA)

– C/C++ language with some additions and restrictions

• Enables GPGPU – ‘General Purpose Computing on GPUs’




GPU: a multithreaded coprocessor

SMstreaming multiprocessor

32xSP (or 16, 48 or more)

Fast local ‘shared memory’(shared between SPs)

16 KiB (or 64 KiB)

GLOBAL MEMORY(ON DEVICE)

SMSP SP SP SP

SP SP SP SP

SP SP SP SP

SP SP SP SP

SHARED MEMORY

SP: scalar processor ‘CUDA core’

Executes one thread




GDDR memory

512 MiB - 6 GiB

GPU: SMs

o 30xSM on GT200, o 14xSM on Fermi

For example, GTX 480: 14 SMs x 32 cores

= 448 cores on a GPU


SMSP SP SP SP

SP SP SP SP

SP SP SP SP

SP SP SP SP

SHARED MEMORY




How To Program For GPUs

Parallelization Decomposition to threads

Memory shared memory, global memory


SMSP SP SP SP

SP SP SP SP

SP SP SP SP

SP SP SP SP

SHARED MEMORY




Important Things To Keep In Mind

Avoid divergent branches Threads of single SM must be

executing the same code Code that branches heavily and

unpredictably will execute slowly Threads shoud be independent as

much as possible Synchronization and

communication can be done efficiently only for threads of single multiprocessor

SMSP SP SP SP

SP SP SP SP

SP SP SP SP

SP SP SP SP

SHARED MEMORY




How To Program For GPUs

Parallelization Decomposition to threads

Memory shared memory, global memory

Enormous processing power Avoid divergence

Thread communication Synchronization, no

interdependencies GLOBAL MEMORY(ON DEVICE)

SMSP SP SP SP

SP SP SP SP

SP SP SP SP

SP SP SP SP

SHARED MEMORY




Programming model




Thread blocks

Threads grouped in thread blocks 128, 192 or 256

threads in a block

• One thread block executes on one SM– All threads sharing the ‘shared memory’– 32 threads are executed simultaneously (‘warp’)

BLOCK 1

THREAD(0,0)

THREAD(0,1)

THREAD(0,2)

THREAD(1,0)

THREAD(1,1)

THREAD(1,2)




Thread blocks

Blocks execute on SMs - execute in parallel - execute independently!

BLOCK 1

THREAD(0,0)

THREAD(0,1)

THREAD(0,2)

THREAD(1,0)

THREAD(1,1)

THREAD(1,2)

Grid

BLOCK 0 BLOCK 1 BLOCK 2



• Blocks form a GRID• Thread ID

unique within block• Block ID

unique within grid




Code that executes on GPU: Kernels

Kernel - a simple C function - executes on GPU - Executes in parallel

as many times as there are threads

The keyword __global__ tells the compiler to make a function a kernel (and compile it for the GPU, instead of the CPU)




Convolution

To get one pixel of output image:

- multiply (pixelwise) mask with image at corresponding position

- sum the products

Kernel - Example code pt 1

__global__ void Convolve( float* img, int imgW, int imgH, float* filt, int filtW, int filtH, float* out){ const int nThreads = blockDim.x * gridDim.x; const int idx = blockIdx.x * blockDim.x + threadIdx.x;

const int outW = imgW – filtW + 1; const int outH = imgH – filtH + 1; const int nPixels = outW * outH;

for(int curPixel = idx; curPixel < nPixels; curPixel += nThreads)

{ int x = curPixel % outW; int y = curPixel / outW;

float sum = 0; for (int filtY = 0; filtY < filtH; filtY++) for (int filtX = 0; filtX < filtW; filtX++) { int sx = x + filtX; int sy = y + filtY; sum+= img[sy*imgW + sx] * filt[filtY*filtW + filtX]; } out[y * outW + x] = sum; }}

for (int y = 0; y < outH; y++) for (int x = 0; x < outW; x++) {




Setup and data transfer

cudaMemcpy transfer data to and from GPU (global memory)

cudaMalloc Allocate memory on GPU (global memory)

GPU is the ‘device’, CPU is the ‘host’

Kernel call syntax

Examle setup and data transfer 1

int main() { ... float* img ... int imgW, imgH ...

float* imgGPU; cudaMalloc((void**)& imgGPU, imgW * imgH * sizeof(float)); cudaMemcpy( imgGPU, // Destination img, // Source imgW * imgH * sizeof(float), // Size in bytes cudaMemcpyHostToDevice // Direction

);

float* filter ... int filterW, filterH ...

float* filterGPU; cudaMalloc((void**)& filterGPU, filterW * filterH * sizeof(float)); cudaMemcpy( filterGPU, // Destination filter, // Source filterW * filterH * sizeof(float), // Size in bytes cudaMemcpyHostToDevice // Direction

);

Examle setup and data transfer 2

int resultW = imgW – filterW + 1; int resultH = imgH – filterH + 1; float* result = (float*) malloc(resultW * resultH * sizeof(float)); float* resultGPU; cudaMalloc((void**) &resultGPU, resultW * resultH * sizeof(float)); /* Call the GPU kernel */ dim3 block(128); dim3 grid(30); Convolve<<<grid, block>>> ( imgGPU, imgW, imgH, filterGPU, filterW, filterH, resultGPU ); cudaMemcpy( result, // Desination resultGPU, // Source resultW * resultH * sizeof(float), // Size in bytes cudaMemcpyDeviceToHost // Direction );

cudaThreadExit(); ... }







Speedup

Linear combination of 3 filters sized 15x15 Image size: 2k x 2k

CPU: Core 2 @ 2.0 GHz (1 core) GPU: Tesla S1070 (GT200 )

30xSM, 240 CUDA cores, 1.3 GHz

CPU: 6.58 s 0.89 Mpixels/s GPU: 0.21 s 27.99 Mpixels/s

31 times faster!







CUDA capabilities

1.0 GeForce 8800 Ultra/GTX/GTS

1.1 GeForce 9800 GT, GTX, GTS 250+ atomic instructions …

1.2 GeForce GT 220

1.3 Tesla S1070, C1060, GeForce GTX 275,285+ double precision (slow) …

2.0 Tesla C2050, GeForce GTX 480, 470+ ECC, L1 and L2 cache, faster IMUL, faster atomics,

faster double precision on Tesla cards …




CUDA essentials

developer.nvidia.com/object/cuda_3_1_downloads.html

Download Driver Toolkit (compiler nvcc) SDK (examples) (recommended)

CUDA Programmers guide




Other tools

‘Emulator’ Executes on CPU Slow

Simple profiler cuda-gdb (Linux) Paralel Nsight (Vista)

simple profiler on-device debugger




...

...




Logical thread hierarchy

Thread ID – unique within block Block ID – unique within grid

To get globally unique thread ID: Combine block ID and thread ID

Threads can access both shared and global memory

Documents

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial: