Introduction to GPU Programming Volodymyr () Kindratenko Volodymyr ( Vlad ) Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing

Introduction to GPU Programming

Volodymyr (Vlad) KindratenkoInnovative Systems Laboratory @ NCSA

Institute for Advanced Computing Applications and Technologies (IACAT)

2

Tutorial Goals

• Become familiar with NVIDIA GPU architecture• Become familiar with the NVIDIA GPU

application development flow• Be able to write and run a simple NVIDIA GPU

application in CUDA

7/26/2009

3

Tutorial Outline• Introduction (15 minutes)

– Why use Graphics Processing Units (GPUs) for general-purpose computing– Modern GPU architecture

• NVIDIA

– GPU programming• Libraries, CUDA C, OpenCL, PGI x64+GPU

• Hands-on: getting started with NCSA GPU cluster (15 minutes)– Cluster architecture overview– How to login and check out a node– How to compile and run an existing application

• Hands-on: Anatomy of a GPU application (25 minutes)– Host side– Device side

• CUDA programming model

• Hands-on: Porting matrix multiplier to GPU (25 minutes)• More on CUDA programming (40 minutes)

7/26/2009

4

Introduction

• Why use Graphics Processing Units (GPUs) for general-purpose computing

• Modern GPU architecture– NVIDIA

• GPU programming– Libraries– CUDA C– OpenCL– PGI x64+GPU

7/26/2009

5

Why GPUs?Raw Performance Trends

9/22/02 2/4/04 6/18/05 10/31/06 3/14/080

200

400

600

800

1000

1200

NVIDIA GPU

Intel CPU

GFL

OP/

s

7900 GTX

8800 GTX

8800 Ultra

GTX 285

Intel Xeon Quad-core 3 GHz

GTX 280

5800 5950 Ultra6800 Ultra

7800 GTX

Graph is courtesy of NVIDIA7/26/2009

6

9/22/02 2/4/04 6/18/05 10/31/06 3/14/080

20

40

60

80

100

120

140

160

180

GBy

te/s

5800

5950 Ultra

6800 Ultra

7800 GTX 7900 GTX

8800 GTX

8800 Ultra

GTX 285

GTX 280

Why GPUs?Memory Bandwidth Trends

Graph is courtesy of NVIDIA7/26/2009

7

GPU vs. CPU Silicon Use

7/26/2009Graph is courtesy of NVIDIA

8

NVIDIA GPU Architecture

• A scalable array of multithreaded Streaming Multiprocessors (SMs), each SM consists of– 8 Scalar Processor (SP) cores– 2 special function units for

transcendentals– A multithreaded instruction

unit– On-chip shared memory

• GDDR3 SDRAM• PCIe interface

Figure is courtesy of NVIDIA7/26/2009

9

NVIDIA GeForce9400M G GPU• 16 streaming processors

arranged as 2 streaming multiprocessors

• At 0.8 GHz this provides– 54 GFLOPS in single-

precision (SP)

• 128-bit interface to off-chip GDDR3 memory– 21 GB/s bandwidth

TPCGeometry controller

SMC

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache

Texture unitsTexture L1

L2ROPL2 ROP

128-bit interconnect

DRAM DRAM

7/26/2009

10

NVIDIA Tesla C1060 GPU• 240 streaming

processors arranged as 30 streaming multiprocessors

• At 1.3 GHz this provides– 1 TFLOPS SP– 86.4 GFLOPS DP

• 512-bit interface to off-chip GDDR3 memory– 102 GB/s bandwidth

TPC 1Geometry controller

SMC

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache


TPC 10Geometry controller

SMC

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache

SM

Shared memory

SFU SFU

SP SP

SP SP

SP SP

SP SP

C cache

MT issue

I cache


L2ROPL2 ROP

512-bit memory interconnect

DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM

7/26/2009

11

NVIDIA Tesla S1070 Computing Server

• 4 T10 GPUs

Tesla GPUTesla GPU

Tesla GPUTesla GPU

4 GB GDDR3 SDRAM

4 GB GDDR3 SDRAM

4 GB GDDR3 SDRAM 4 GB GDDR3

SDRAM

NVIDIA SWITCH

NVIDIA SWITCH

Pow

er

supp

ly

Ther

mal

m

anag

emen

t

Syst

em

mon

itorin

g

PCI x16

PCI x

16

7/26/2009Graph is courtesy of NVIDIA

12

GPU Use/Programming

• GPU libraries– NVIDIA’s CUDA BLAS and FFT libraries– Many 3rd party libraries

• Low abstraction lightweight GPU programming toolkits– CUDA C– OpenCL

• High abstraction compiler-based tools– PGI x64+GPU

7/26/2009

13

Getting Started with NCSA GPU Cluster

• Cluster architecture overview• How to login and check out a node• How to compile and run an existing

application

7/26/2009

14

NCSA AC GPU Cluster

7/26/2009

15

GPU Cluster Architecture

• Servers: 32– CPU cores: 128

• Accelerator Units: 32– GPUs: 128

ac01(compute node)

ac02(compute node)

ac32(compute node)

ac(head node)

7/26/2009

16

GPU Cluster Node Architecture

• HP xw9400 workstation– 2216 AMD Opteron 2.4

GHz dual socket dual core

– 8 GB DDR2– InfiniBand QDR

• S1070 1U GPU Computing Server– 1.3 GHz Tesla T10

processors– 4x4 GB GDDR3 SDRAM

IB

Tesla S1070

T10 T10

PCIe interface

DRAM DRAM

T10 T10

PCIe interface

DRAM DRAM

HP xw9400 workstation

PCIe x16 PCIe x16

QDR IB

Compute node

7/26/2009

17

Accessing the GPU Cluster• Use Secure Shell (SSH) client to access AC

– `ssh [email protected]` (User: gpu001 - gpu200; Password: CHiPS-09)

– You will see something like this printed out:

See machine details and a technical report at: http://www.ncsa.uiuc.edu/Projects/GPUcluster/Machine Description and HOW TO USE. See: /usr/local/share/ac.readmeCUDA wrapper readme: /usr/local/share/cuda_wrapper.readme*IMPORTANT* If you are using multiple GPU devices per host, be sure to understand how the

cuda_wrapper changes this system!!…July 03, 2009Nvidia compute exclusive mode made default. If this breaks your application,

"touch ~/FORCE_NORMAL" to create an override for all your jobs.Questions? Contact Jeremy Enos [email protected]

[gpuXYZ@ac ~]$ _

7/26/2009

18

Installing Tutorial Examples

• Run this sequence to retrieve and install tutorial examples:cdcp /tmp/chips_tutorial.tgz .tar -xvzf chips_tutorial.tgzcd chips_tutorialls

src1 src2 src3

7/26/2009

19

Accessing the GPU Cluster

ac01(compute node)

ac02(compute node)

ac32(compute node)

Laptop 1 Laptop 30Laptop 2

ac(head node)You are here

7/26/2009

20

Requesting a Cluster Node for Interactive Use

• Run `qstat` to see what other users do, just for the fun of it

• Run `qsub -I -l walltime=02:00:00` to request a node with a single GPU for 2 hours of interactive use– You will see something like this printed out:

qsub: waiting for job 64424.acm to startqsub: job 64424.acm ready[gpuXYZ@acAB ~]$ _

7/26/2009

21

Requesting a Cluster Node

ac01(compute node)

ac02(compute node)

ac32(compute node)

Laptop 1 Laptop 30Laptop 2

ac(head node) You are here

7/26/2009

22

Checking GPU Characteristics• Run `deviceQuery`

CUDA Device Query (Runtime API) version (CUDART static linking)There is 1 device supporting CUDADevice 0: "Tesla C1060" CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 3 Total amount of global memory: 4294705152 bytes Number of multiprocessors: 30 Number of cores: 240 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes…Clock rate: 1.30 GHzCompute mode: Exclusive (only one host thread at a time can use this device)

7/26/2009

23

Compiling and Running and Existing Application

• cd chips_tutorial/src1– vecadd.c - reference C implementation– vecadd.cu – CUDA implementation

• Compile & run CPU versiongcc vecadd.c -o vecadd_cpu./vecadd_cpu

Running CPU vecAdd for 16384 elementsC[0]=2147483648.00 ...

• Compile & run GPU versionnvcc vecadd.cu -o vecadd_gpu./vecadd_gpu

Running GPU vecAdd for 16384 elementsC[0]=2147483648.00 ...

7/26/2009

24

nvcc

• Any source file containing CUDA C language extensions must be compiled with nvcc

• nvcc is a compiler driver that invokes many other tools to accomplish the job

• Basic nvcc usage– nvcc <filename>.cu [-o <executable>]

• Builds release mode

– nvcc -deviceemu <filename>.cu• Builds device emulation mode (all code runs on CPU)

– -g flag allows to build debug mode for gdb debugger

7/26/2009

25

Anatomy of a GPU Application

• Host side• Device side• CUDA programming model

7/26/2009

26

CPU-Only Versionvoid vecAdd(int N, float* A, float* B, float* C) { for (int i = 0; i < N; i++) C[i] = A[i] + B[i];}

int main(int argc, char **argv) { int N = 16384; // default vector size

float *A = (float*)malloc(N * sizeof(float)); float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float));

vecAdd(N, A, B, C); // call compute kernel free(A); free(B); free(C); }

Computational kernel

Memory allocation

Kernel invocation

Memory de-allocation

7/26/2009

27

Adding GPU supportint main(int argc, char **argv) { int N = 16384; // default vector size

float *A = (float*)malloc(N * sizeof(float)); float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float));

float *devPtrA, *devPtrB, *devPtrC;

cudaMalloc((void**)&devPtrA, N * sizeof(float)); cudaMalloc((void**)&devPtrB, N * sizeof(float)); cudaMalloc((void**)&devPtrC, N * sizeof(float));

cudaMemcpy(devPtrA, A, N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(devPtrB, B, N * sizeof(float), cudaMemcpyHostToDevice);

Memory allocation on the GPU card

Copy data from the CPU (host) memory to the GPU (device) memory

7/26/2009

28

Adding GPU support

vecAdd<<<N/512, 512>>>(devPtrA, devPtrB, devPtrC); cudaMemcpy(C, devPtrC, N * sizeof(float), cudaMemcpyDeviceToHost); cudaFree(devPtrA); cudaFree(devPtrB); cudaFree(devPtrC); free(A); free(B); free(C); }

Kernel invocation

Copy results from device memory to the host memory

Device memory de-allocation

7/26/2009

29

GPU Kernel• CPU version

void vecAdd(int N, float* A, float* B, float* C) { for (int i = 0; i < N; i++) C[i] = A[i] + B[i];}

• GPU version__global__ void vecAdd(float* A, float* B, float* C) { int i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; }

7/26/2009

30

CUDA Programming Model

• A CUDA kernel is executed by an array of threads– All threads run the same code (SPMD)– Each thread has an ID that it uses to

compute memory addresses and make control decisions

• Threads are arranged as a grid of thread blocks– Threads within a

block have accessto a segment ofshared memory

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID

Grid

Thread Block 0

Shared memory

Thread Block 1

Shared memory

Thread Block N-1

Shared memory

…

7/26/2009

31

Kernel Invocation Syntax

Grid

Thread Block 0

Shared memory

Thread Block 1

Shared memory

Thread Block N-1

Shared memory

…

grid & thread block dimensionality

vecAdd<<<32, 512>>>(devPtrA, devPtrB, devPtrC);

int i = blockIdx.x * blockDim.x + threadIdx.x;

thread ID within a thread blocknumber of theards per blockblock ID within a grid

7/26/2009

32

Mapping Threads to the Hardware

• Blocks of threads are transparently assigned to SMs– A block of threads executes on one SM & does not migrate– Several blocks can reside concurrently on one SM

Device

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Kernel grid

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Device

Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

Each block can execute in any order relative to other blocks.

time

Slide is courtesy of NVIDIA7/26/2009

33

CUDA Programming Model• A kernel is executed as a grid

of thread blocks– Grid of blocks can be 1 or 2-

dimentional– Thread blocks can be 1, 2, or 3-

dimensional

• Different kernels can have different grid/block configuration

• Threads from the same block have access to a shared memory and their execution can be synchronized

Slide is courtesy of NVIDIA

Device

Grid 2

Host

Kernel 1

Kernel 2

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

7/26/2009

34

GPU Memory Hierarchy

Memory Location Cached Access Scope LifetimeRegister On-chip N/A R/W One thread ThreadLocal Off-chip No R/W One thread ThreadShared On-chip N/A R/W All threads in a block BlockGlobal Off-chip No R/W All threads + host ApplicationConstant Off-chip Yes R All threads + host ApplicationTexture Off-chip Yes R All threads + host Application

Host

CPU

chipset

DRAM

Device

DRAM

localglobal

constanttexture

GPUMultiprocessor

MultiprocessorMultiprocessor

registers shared memory

constant and texture caches

7/26/2009

35

Porting matrix multiplier to GPU

• cd ../chips_tutorial/src2

• Compile & run CPU versionicc -O3 mmult.c -o mmult./mmult

1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ......msec = 2215478 GFLOPS = 0.969

7/26/2009

36

int main(int argc, char* argv[]){ int N = 1024; struct timeval t1, t2, ta, tb; long msec1, msec2; float flop, mflop, gflop; float *a = (float *)malloc(N*N*sizeof(float)); float *b = (float *)malloc(N*N*sizeof(float)); float *c = (float *)malloc(N*N*sizeof(float));

minit(a, b, c, N);

gettimeofday(&t1, NULL); mmult(a, b, c, N); // a = b * c gettimeofday(&t2, NULL);

mprint(a, N, 5); free(a); free(b); free(c);

msec1 = t1.tv_sec * 1000000 + t1.tv_usec; msec2 = t2.tv_sec * 1000000 + t2.tv_usec; msec2 -= msec1; flop = N*N*N*2.0f; mflop = flop / msec2; gflop = mflop / 1000.0f; printf("msec = %10ld GFLOPS = %.3f\n", msec2, gflop);}

// a = b * cvoid mmult(float *a, float *b, float *c, int N) { for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) for (int i = 0; i < N; i++) a[i+j*N] += b[i+k*N]*c[k+j*N];}

void minit(float *a, float *b, float *c, int N) { for (int j = 0; j < N; j++) for (int i = 0; i < N; i++) { a[i+N*j] = 0.0f; b[i+N*j] = 1.0f; c[i+N*j] = 1.0f; }}

void mprint(float *a, int N, int M){ int i, j; for (int j = 0; j < M; j++) { for (int i = 0; i < M; i++) printf("%.2f ", a[i+N*j]); printf("...\n"); } printf("...\n");}

7/26/2009

37

for (i = 0; i < n; ++i) for (j = 0; j < n; ++j) for (k = 0; k < n; ++k) a[i+n*j] += b[i+n*k] * c[k+n*j];

Matrix Representation in Memory

- Matrices are stored in column-major order

- For reference, jki-ordered version runs at 1.7 GFLOPS on 3 GHz Intel Xeon (single core)

7/26/2009

38

for (i = 0; i < n; ++i) for (j = 0; j < n; ++j) for (k = 0; k < n; ++k) a[i+n*j] += b[i+n*k] * c[k+n*j];

Grid of thread blocks

0

0 1 2 3 4 5

1

0 1 2 3 4 5

2

0 1 2 3 4 5

blockIdx.x

blockDim.x

threadIdx.x

blockIdx.x * blockDim.x +threadIdx.x

Map this code:

into this (logical) architecture:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

b1,1 b1,2

b2,1 b2,2

b3,1 b3,2

c1,1 c1,2 c1,3

c2,1 c2,2 c2,3

a1,1 a1,2 a1,3

a2,1 a2,2 a2,3

a3,1 a3,2 a3,3

B

C

A=B*C

a1,2=b1,1*c1,2+b1,2*c2,2

7/26/2009

39

32x1024 grid of thread blocks

Block of 32x1x1 threads

(blockIdx.x, blockIdx.y)

(threadIdx.x)

{ int i = blockIdx.x*32 + threadIdx.x; int j = blockIdx.y; float sum = 0.0f; for (int k = 0; k < n; k++) sum += b[i+n*k] * c[k+n*j]; a[i+n*j] = sum;}

32 threads per block (i)

32 thread blocks (i)

1024

thre

ad b

lock

s (j)

dim3 grid(1024/32, 1024);dim3 threads (32);

7/26/2009

40

Kernel

Original CPU kernel GPU Kernel__global__ void mmult(float *a, float *b, float *c, int N){ int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y; float sum = 0.0; for (int k = 0; k < N; k++) sum += b[i+N*k] * c[k+N*j];

a[i+N*j] = sum;}

void mmult(float *a, float *b, float *c, int N) { for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) a[i+j*N] += b[i+k*N]*c[k+j*N];}

dim3 dimGrid(32, 1024);dim3 dimBlock(32);mmult<<<dimGrid, dimBlock>>>(devPtrA, devPtrB, devPtrC, N);

7/26/2009

41

int main(int argc, char* argv[]){ int N = 1024; struct timeval t1, t2; long msec1, msec2; float flop, mflop, gflop; float *a = (float *)malloc(N*N*sizeof(float)); float *b = (float *)malloc(N*N*sizeof(float)); float *c = (float *)malloc(N*N*sizeof(float));

minit(a, b, c, N); // allocate device memory float *devPtrA, *devPtrB, *devPtrC; cudaMalloc((void**)&devPtrA, N*N*sizeof(float)); cudaMalloc((void**)&devPtrB, N*N*sizeof(float)); cudaMalloc((void**)&devPtrC, N*N*sizeof(float));

// copu input arrays to the device meory cudaMemcpy(devPtrB, b, N*N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(devPtrC, c, N*N*sizeof(float), cudaMemcpyHostToDevice);

7/26/2009

42

gettimeofday(&t1, NULL); msec1 = t1.tv_sec * 1000000 + t1.tv_usec;

// define grid and thread block sizes dim3 dimGrid(32, 1024); dim3 dimBlock(32); // launch GPU kernel mmult<<<dimGrid, dimBlock>>>(devPtrA, devPtrB, devPtrC, N); // check for errors cudaError_t err = cudaGetLastError(); if (cudaSuccess != err) { fprintf(stderr, "CUDA error: %s.\n", cudaGetErrorString( err) ); exit(EXIT_FAILURE); } // wait until GPU kernel is done cudaThreadSynchronize();

gettimeofday(&t2, NULL); msec2 = t2.tv_sec * 1000000 + t2.tv_usec;

7/26/2009

43

// copy results to host cudaMemcpy(a, devPtrA, N*N*sizeof(float), cudaMemcpyDeviceToHost);

mprint(a, N, 5); // free device memory cudaFree(devPtrA); cudaFree(devPtrB); cudaFree(devPtrC); free(a); free(b); free(c);

msec2 -= msec1; flop = N*N*N*2.0f; mflop = flop / msec2; gflop = mflop / 1000.0f; printf("msec = %10ld GFLOPS = %.3f\n", msec2, gflop);}

7/26/2009

44

Porting matrix multiplier to GPU

• Compile & run GPU versionnvcc mmult.cu -o mmult./mmult

1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ......msec = 91363 GFLOPS = 23.505

7/26/2009

45

More on CUDA Programming• Language extensions

– Function type qualifiers– Variable type qualifiers– Execution configuration– Built-in variables

• Common runtime components– Built-in vector types

• Device runtime components– Intrinsic functions– Synchronization and memory fencing functions– Atomic functions

• Host runtime components (runtime API only)– Device management– Memory management– Error handling– Debugging in the device emulation mode

• Exercise

7/26/2009

46

Function Type Qualifiers

hosthost__host__ float HostFunc()

hostdevice__global__ void KernelFunc()

devicedevice__device__ float DeviceFunc()

Only callable from the:

Executed on the:

__device__ and __global__ functions do not support recursion, cannot declare static variables inside their body, cannot have a variable number of arguments__device__ functions cannot have their address taken__host__ and __device__ qualifiers can be used together, in which case the function is compiled for both__global__ and __host__ qualifiers cannot be used together__global__ function must have void return type, its execution configuration must be specified, and the call is asynchronous

7/26/2009

47

Variable Type QualifiersMemory Scope Lifetime

__device__ int GlobalVar; global grid application

__device__ __shared__ int SharedVar; shared block block

__device__ __constant__ int ConstantVar; constant grid Application

volatile int GlobarVar or SharedVar;

__shared__ and __constant__ variables have implied static storage__device__, __shared__ and __constant__ variables cannot be defined using external keyword__device__ and __constant__ variables are only allowed at file scope__constant__ variables cannot be assigned to from the devices, they are initialized from the host only__shared__ variables cannot have an initialization as part of their declaration

7/26/2009

48

Execution ConfigurationFunction declared as

__global__ void kernel(float* param);

must be called like this:

kernel<<<Dg, Db, Ns, S>>>(param);

where• Dg (type dim3) specifies the dimension and size of the grid, such that Dg.x*Dg.y equals the number of blocks being launched;• Db (type dim3) spesifies the dimension abd size of each block of threads, such that Db.x*Db.y*Db.z equals the number of threads per block;• optional Ns (type size_z) specifies the number of bytes of shared memory dynamically allocated per block for this call in addition to the statically allocated memory• optional S (type cudaStream_t) specifies the stream associated with this kernel call

7/26/2009

49

Built-in Variablesvariable type descriptiongridDim dim3 dimensions of the gridblockID unit3 block index within the gridblockDim dim3 dimensions of the blockthreadIdx uint3 thread index within the blockwarpSize int warp size in threads

It is not allowed to take addresses of any of the built-in variablesIt is not allowed to assign values to any of the built-in variables

7/26/2009

50

Built-in Vector Types

Vector types derived from basic integer and float types

• char1, char2, char3, char4• uchar1, uchar2, uchar3, uchar4• short1, short2, short3, short4• ushort1, ushort2, ushort3, ushort4• int1, int2, int3, int4• uint1, uint2, uint3 (dim3), uint4• long1, long2, long3, long4• ulong1, ulong2, ulong3, ulong4• longlong1, longlong2• float1, float2, float3, float4• double1, double2

They are all structures, like this:

typedef struct { float x,y,z,w;} float4;

They all come with a constructor function in the form make_<type name>, e.g.,

int2 make_int2(int x, int y);

7/26/2009

51

Intrinsic Functions

Supported on the device only

Start with __, as in __sinf(x)

End with _rn (round-to-nearest-even rounding mode)_rz (round-towards-zero rounding mode)_ru (round-up rounding mode)_rd (round-down rounding mode)as in __fadd_rn(x,y);

There are mathematical (__log10f(x)), type conversion (__int2float_rn(x)), type casting (__int_as_float(x)), and bit manipulation (__ffs(x)) functions

7/26/2009

52

Synchronization and Memory Fencing Functions

function descriptionvoid __threadfence() wait until all global and shared memory

accesses made by the calling thread become visible to all threads in the device for global memory accesses and all threads in the thread block for shared memory accesses

void __threadfence_block()

Waits until all global and shared memory accesses made by the calling thread become visible to all threads in the thread block

void __syncthreads() Waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads become visible to all threads in the block

7/26/2009

53

Atomic Functions

An atomic function performs read-modify-write atomic operation on one 32-bit or one 64-bit word residing in global or shared memory. The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads.

function Description

atomicAdd() new = old + val

atomicSub() new = old – val

atomicExch() new = val

atomicMin() new = min(old, val)

atomicMax() new = max(old, val)

atomicInc() new = ((old >= val) ? 0 : (old+1))

atomicDec() new = (((old==0) | (old > val)) ? val : (old-1))

atomicCAS() new = (old == compare ? val : old)

Atomic{And, Or, Xor}() new = {(old & val), (old | val), (old^val)}

7/26/2009

54

CUDA APIs• higher-level API called the CUDA

runtime API– myKernelunsigned char*)devPtr,

width, <<<Dg, Db>>>(( height, pitch);

• low-level API called the CUDA driver API

– cuModuleLoad( &module, binfile );– cuModuleGetFunction( &func,

module, "mmkernel" );– …– cuParamSetv( func, 0, &args, 48 );

– cuParamSetSize( func, 48 );– cuFuncSetBlockShape( func, ts[0],

ts[1], 1 );– cuLaunchGrid( func, gs[0], gs[1] );

7/26/2009

55

Device Management

function descriptioncudaGetDeviceCount() Returns the number of compute-capable

devicescudaGetDeviceProperties() Returns information on the compute

devicecudaSetDevice() Sets device to be used for GPU executioncudaGetDevice() Returns the device currently being usedcudaChooseDevice() Selects device that best matches given

criteria

7/26/2009

56

Device Management Examplevoid cudaDeviceInit() { int devCount, device; cudaGetDeviceCount(&devCount); if (devCount == 0) { printf("No CUDA capable devices detected.\n"); exit(EXIT_FAILURE); } for (device=0; device < devCount; device++) { cudaDeviceProp props; cudaGetDeviceProperties(&props, device); // If a device of compute capability >= 1.3 is found, use it if (props.major > 1 || (props.major == 1 && props.minor >= 3)) break; } if (device == devCount) { printf("No device above 1.2 compute capability detected.\n"); exit(EXIT_FAILURE); } else cudaSetDevice(device);}7/26/2009

57

Memory Management

function descriptioncudaMalloc() Allocates memory on the GPUcudaMallocPitch() Allocates memory on the GPU device for 2D

arrays, may pad the allocated memory to ensure alignment requirements

cudaFree() Frees the memory allocated on the GPUcudaMallocArray() Allocates an array on the GPUcudaFreeArray() Frees an array allocated on the GPUcudaMallocHost() Allocates page-locked memory on the hostcudaFreeHost() Frees page-locked memory in the host

7/26/2009

58

More on Memory Alignmenta1,1 a1,2 a1,3

a2,1 a2,2 a2,3

a3,1 a3,2 a3,3

a1,1 a2,1 a3,1 a1,2 a2,2 a3,2 a1,3 a2,3 a3,3

cudaMalloc(&dev_a, m*n*sizeof(float));

Matrix columns are not aligned at 64-bit boundary

a1,1 a2,1 a3,1 a1,2 a2,2 a3,2 a1,3 a2,3 a3,3

cudaMallocPitch(&dev_a, &n, n*sizeof(float), m);

Matrix columns are aligned at 64-bit boundary

n is the allocated (aligned) size for the first dimension (the pitch), given the requested sizes of the two dimensions.

7/26/2009

59

Memory Management ExamplecudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(float), height); myKernel<<<100, 192>>>(devPtr, pitch);

// device code __global__ void myKernel(float* devPtr, int pitch) { for (int r = 0; r < height; ++r) { float* row = (float*)((char*)devPtr + r * pitch); for (int c = 0; c < width; ++c) { float element = row[c]; } }}

7/26/2009

60

Memory Managementfunction descriptioncudaMemset() Initializes or sets GPU memory to a valuecudaMemCpy() Copies data between host and the devicecudaMemcpyToArray()

cudaMemcpyFromArray()

cudaMemcpyArrayToArray()

cudaMemcpyToSymbol()

cudaMemcpyFromSymbol()

cudaGetSymbolAddress() Finds the address associated with a CUDA symbol

cudaGetSymbolSize() Finds the size of the object associated with a CUDA symbol

7/26/2009

61

Error HandlingAll CUDA runtime API functions return an error code. The runtime maintains an error variable for each host thread that is overwritten by the error code every time an error concurs.

function description

cudaGetLastError() Returns error variable and resets it to cudaSuccess

cudaGetErrorString() Returns the message string from an error code

cudaError_t err = cudaGetLastError();if (cudaSuccess != err) { fprintf(stderr, "CUDA error: %s.\n", cudaGetErrorString( err) ); exit(EXIT_FAILURE);}

7/26/2009

62

Exercise

• Port Mandelbrot set fractal renderer to CUDA– Source is in ~/chips_tutorial/src3

• fractal.c – reference C implementation• Makefile – make file• fractal.cu.reference – CUDA implementation for

reference

7/26/2009

63

Reference C Implementation

7/26/2009

void makefractal_cpu(unsigned char *image, int width, int height, double xupper, double xlower, double yupper, double ylower){ int x, y;

double xinc = (xupper - xlower) / width; double yinc = (yupper - ylower) / height;

for (y = 0; y < height; y++) { for (x = 0; x < width; x++) { image[y*width+x] = iter((xlower + x*xinc), (ylower + y*yinc)); } }}

64

CUDA Kernel Implementation

7/26/2009

__global__ void makefractal_cpu(unsigned char *image, int width, int height, double xupper, double xlower, double yupper, double ylower){ int x = blockIdx.x; int y = blockIdx.y;

int width = gridDim.x; int height = gridDim.y;

double xupper=-0.74624, xlower=-0.74758, yupper=0.10779, ylower=0.10671;

double xinc = (xupper - xlower) / width; double yinc = (yupper - ylower) / height;

image[y*width+x] = iter((xlower + x*xinc), (ylower + y*yinc));}

65

Reference C Implementation

7/26/2009

inline unsigned char iter(double a, double b){ unsigned char i = 0; double c_x = 0, c_y = 0; double c_x_tmp, c_y_tmp; double D = 4.0;

while ((c_x*c_x+c_y*c_y < D) && (i++ < 255)) { c_x_tmp = c_x * c_x - c_y * c_y; c_y_tmp = 2* c_y * c_x; c_x = a + c_x_tmp; c_y = b + c_y_tmp; }

return i;}

The Mandelbrot set is generated by iterating complex function z2 + c, where c is a constant:

z1 = (z0)2 + c

z2 = (z1)2 + c

z3 = (z2)2 + c

and so forth. Sequence z0, z1,

z2,... is called the orbit of z0

under iteration of z2 + c. We stop iteration when the orbit starts to diverge, or when a maximum number of iterations is done.

66

CUDA Kernel Implementation

7/26/2009

inline __device__ unsigned char iter(double a, double b){ unsigned char i = 0; double c_x = 0, c_y = 0; double c_x_tmp, c_y_tmp; double D = 4.0;

while ((c_x*c_x+c_y*c_y < D) && (i++ < 255)) { c_x_tmp = c_x * c_x - c_y * c_y; c_y_tmp = 2* c_y * c_x; c_x = a + c_x_tmp; c_y = b + c_y_tmp; }

return i;}

67

Host Code

7/26/2009

int width = 1024; int height = 768; unsigned char *image = NULL; unsigned char *devImage;

image = (unsigned char*)malloc(width*height*sizeof(unsigned char)); cudaMalloc((void**)&devImage, width*height*sizeof(unsigned char));

dim3 dimGrid(width, height); dim3 dimBlock(1);

makefractal_gpu<<<dimGrid, dimBlock>>>(devImage);

cudaMemcpy(image, devImage, width*height*sizeof(unsigned char), cudaMemcpyDeviceToHost);

free(image); cudaFree(devImage);

68

Few Examples• xupper=-0.74624• xlower=-0.74758• yupper=0.10779• ylower=0.10671

• CPU time: 2.27 sec• GPU time: 0.29 sec

• xupper=-0.754534912109• xlower=-.757077407837• yupper=0.060144042969• ylower=0.057710774740

• CPU time: 1.5 sec• GPU time: 0.25 sec

7/26/2009

Documents

Introduction to GPU Programming Volodymyr () Kindratenko Volodymyr ( Vlad ) Kindratenko Innovative Systems Laboratory @ NCSA Institute for Advanced Computing