Upload
russell-wilson
View
248
Download
4
Tags:
Embed Size (px)
Citation preview
Introduction to GPU Programming
Volodymyr (Vlad) KindratenkoInnovative Systems Laboratory @ NCSA
Institute for Advanced Computing Applications and Technologies (IACAT)
2
Tutorial Goals
• Become familiar with NVIDIA GPU architecture• Become familiar with the NVIDIA GPU
application development flow• Be able to write and run a simple NVIDIA GPU
application in CUDA
7/26/2009
3
Tutorial Outline• Introduction (15 minutes)
– Why use Graphics Processing Units (GPUs) for general-purpose computing– Modern GPU architecture
• NVIDIA
– GPU programming• Libraries, CUDA C, OpenCL, PGI x64+GPU
• Hands-on: getting started with NCSA GPU cluster (15 minutes)– Cluster architecture overview– How to login and check out a node– How to compile and run an existing application
• Hands-on: Anatomy of a GPU application (25 minutes)– Host side– Device side
• CUDA programming model
• Hands-on: Porting matrix multiplier to GPU (25 minutes)• More on CUDA programming (40 minutes)
7/26/2009
4
Introduction
• Why use Graphics Processing Units (GPUs) for general-purpose computing
• Modern GPU architecture– NVIDIA
• GPU programming– Libraries– CUDA C– OpenCL– PGI x64+GPU
7/26/2009
5
Why GPUs?Raw Performance Trends
9/22/02 2/4/04 6/18/05 10/31/06 3/14/080
200
400
600
800
1000
1200
NVIDIA GPU
Intel CPU
GFL
OP/
s
7900 GTX
8800 GTX
8800 Ultra
GTX 285
Intel Xeon Quad-core 3 GHz
GTX 280
5800 5950 Ultra6800 Ultra
7800 GTX
Graph is courtesy of NVIDIA7/26/2009
6
9/22/02 2/4/04 6/18/05 10/31/06 3/14/080
20
40
60
80
100
120
140
160
180
GBy
te/s
5800
5950 Ultra
6800 Ultra
7800 GTX 7900 GTX
8800 GTX
8800 Ultra
GTX 285
GTX 280
Why GPUs?Memory Bandwidth Trends
Graph is courtesy of NVIDIA7/26/2009
7
GPU vs. CPU Silicon Use
7/26/2009Graph is courtesy of NVIDIA
8
NVIDIA GPU Architecture
• A scalable array of multithreaded Streaming Multiprocessors (SMs), each SM consists of– 8 Scalar Processor (SP) cores– 2 special function units for
transcendentals– A multithreaded instruction
unit– On-chip shared memory
• GDDR3 SDRAM• PCIe interface
Figure is courtesy of NVIDIA7/26/2009
9
NVIDIA GeForce9400M G GPU• 16 streaming processors
arranged as 2 streaming multiprocessors
• At 0.8 GHz this provides– 54 GFLOPS in single-
precision (SP)
• 128-bit interface to off-chip GDDR3 memory– 21 GB/s bandwidth
TPCGeometry controller
SMC
SM
Shared memory
SFU SFU
SP SP
SP SP
SP SP
SP SP
C cache
MT issue
I cache
SM
Shared memory
SFU SFU
SP SP
SP SP
SP SP
SP SP
C cache
MT issue
I cache
Texture unitsTexture L1
L2ROPL2 ROP
128-bit interconnect
DRAM DRAM
7/26/2009
10
NVIDIA Tesla C1060 GPU• 240 streaming
processors arranged as 30 streaming multiprocessors
• At 1.3 GHz this provides– 1 TFLOPS SP– 86.4 GFLOPS DP
• 512-bit interface to off-chip GDDR3 memory– 102 GB/s bandwidth
TPC 1Geometry controller
SMC
SM
Shared memory
SFU SFU
SP SP
SP SP
SP SP
SP SP
C cache
MT issue
I cache
SM
Shared memory
SFU SFU
SP SP
SP SP
SP SP
SP SP
C cache
MT issue
I cache
SM
Shared memory
SFU SFU
SP SP
SP SP
SP SP
SP SP
C cache
MT issue
I cache
Texture unitsTexture L1
TPC 10Geometry controller
SMC
SM
Shared memory
SFU SFU
SP SP
SP SP
SP SP
SP SP
C cache
MT issue
I cache
SM
Shared memory
SFU SFU
SP SP
SP SP
SP SP
SP SP
C cache
MT issue
I cache
SM
Shared memory
SFU SFU
SP SP
SP SP
SP SP
SP SP
C cache
MT issue
I cache
Texture unitsTexture L1
L2ROPL2 ROP
512-bit memory interconnect
DRAM DRAM DRAM DRAM DRAM DRAM DRAM DRAM
7/26/2009
11
NVIDIA Tesla S1070 Computing Server
• 4 T10 GPUs
Tesla GPUTesla GPU
Tesla GPUTesla GPU
4 GB GDDR3 SDRAM
4 GB GDDR3 SDRAM
4 GB GDDR3 SDRAM 4 GB GDDR3
SDRAM
NVIDIA SWITCH
NVIDIA SWITCH
Pow
er
supp
ly
Ther
mal
m
anag
emen
t
Syst
em
mon
itorin
g
PCI x16
PCI x
16
7/26/2009Graph is courtesy of NVIDIA
12
GPU Use/Programming
• GPU libraries– NVIDIA’s CUDA BLAS and FFT libraries– Many 3rd party libraries
• Low abstraction lightweight GPU programming toolkits– CUDA C– OpenCL
• High abstraction compiler-based tools– PGI x64+GPU
7/26/2009
13
Getting Started with NCSA GPU Cluster
• Cluster architecture overview• How to login and check out a node• How to compile and run an existing
application
7/26/2009
14
NCSA AC GPU Cluster
7/26/2009
15
GPU Cluster Architecture
• Servers: 32– CPU cores: 128
• Accelerator Units: 32– GPUs: 128
ac01(compute node)
ac02(compute node)
ac32(compute node)
ac(head node)
7/26/2009
16
GPU Cluster Node Architecture
• HP xw9400 workstation– 2216 AMD Opteron 2.4
GHz dual socket dual core
– 8 GB DDR2– InfiniBand QDR
• S1070 1U GPU Computing Server– 1.3 GHz Tesla T10
processors– 4x4 GB GDDR3 SDRAM
IB
Tesla S1070
T10 T10
PCIe interface
DRAM DRAM
T10 T10
PCIe interface
DRAM DRAM
HP xw9400 workstation
PCIe x16 PCIe x16
QDR IB
Compute node
7/26/2009
17
Accessing the GPU Cluster• Use Secure Shell (SSH) client to access AC
– `ssh [email protected]` (User: gpu001 - gpu200; Password: CHiPS-09)
– You will see something like this printed out:
See machine details and a technical report at: http://www.ncsa.uiuc.edu/Projects/GPUcluster/Machine Description and HOW TO USE. See: /usr/local/share/ac.readmeCUDA wrapper readme: /usr/local/share/cuda_wrapper.readme*IMPORTANT* If you are using multiple GPU devices per host, be sure to understand how the
cuda_wrapper changes this system!!…July 03, 2009Nvidia compute exclusive mode made default. If this breaks your application,
"touch ~/FORCE_NORMAL" to create an override for all your jobs.Questions? Contact Jeremy Enos [email protected]
[gpuXYZ@ac ~]$ _
7/26/2009
18
Installing Tutorial Examples
• Run this sequence to retrieve and install tutorial examples:cdcp /tmp/chips_tutorial.tgz .tar -xvzf chips_tutorial.tgzcd chips_tutorialls
src1 src2 src3
7/26/2009
19
Accessing the GPU Cluster
ac01(compute node)
ac02(compute node)
ac32(compute node)
Laptop 1 Laptop 30Laptop 2
ac(head node)You are here
7/26/2009
20
Requesting a Cluster Node for Interactive Use
• Run `qstat` to see what other users do, just for the fun of it
• Run `qsub -I -l walltime=02:00:00` to request a node with a single GPU for 2 hours of interactive use– You will see something like this printed out:
qsub: waiting for job 64424.acm to startqsub: job 64424.acm ready[gpuXYZ@acAB ~]$ _
7/26/2009
21
Requesting a Cluster Node
ac01(compute node)
ac02(compute node)
ac32(compute node)
Laptop 1 Laptop 30Laptop 2
ac(head node) You are here
7/26/2009
22
Checking GPU Characteristics• Run `deviceQuery`
CUDA Device Query (Runtime API) version (CUDART static linking)There is 1 device supporting CUDADevice 0: "Tesla C1060" CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 3 Total amount of global memory: 4294705152 bytes Number of multiprocessors: 30 Number of cores: 240 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes…Clock rate: 1.30 GHzCompute mode: Exclusive (only one host thread at a time can use this device)
7/26/2009
23
Compiling and Running and Existing Application
• cd chips_tutorial/src1– vecadd.c - reference C implementation– vecadd.cu – CUDA implementation
• Compile & run CPU versiongcc vecadd.c -o vecadd_cpu./vecadd_cpu
Running CPU vecAdd for 16384 elementsC[0]=2147483648.00 ...
• Compile & run GPU versionnvcc vecadd.cu -o vecadd_gpu./vecadd_gpu
Running GPU vecAdd for 16384 elementsC[0]=2147483648.00 ...
7/26/2009
24
nvcc
• Any source file containing CUDA C language extensions must be compiled with nvcc
• nvcc is a compiler driver that invokes many other tools to accomplish the job
• Basic nvcc usage– nvcc <filename>.cu [-o <executable>]
• Builds release mode
– nvcc -deviceemu <filename>.cu• Builds device emulation mode (all code runs on CPU)
– -g flag allows to build debug mode for gdb debugger
7/26/2009
25
Anatomy of a GPU Application
• Host side• Device side• CUDA programming model
7/26/2009
26
CPU-Only Versionvoid vecAdd(int N, float* A, float* B, float* C) { for (int i = 0; i < N; i++) C[i] = A[i] + B[i];}
int main(int argc, char **argv) { int N = 16384; // default vector size
float *A = (float*)malloc(N * sizeof(float)); float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float));
vecAdd(N, A, B, C); // call compute kernel free(A); free(B); free(C); }
Computational kernel
Memory allocation
Kernel invocation
Memory de-allocation
7/26/2009
27
Adding GPU supportint main(int argc, char **argv) { int N = 16384; // default vector size
float *A = (float*)malloc(N * sizeof(float)); float *B = (float*)malloc(N * sizeof(float)); float *C = (float*)malloc(N * sizeof(float));
float *devPtrA, *devPtrB, *devPtrC;
cudaMalloc((void**)&devPtrA, N * sizeof(float)); cudaMalloc((void**)&devPtrB, N * sizeof(float)); cudaMalloc((void**)&devPtrC, N * sizeof(float));
cudaMemcpy(devPtrA, A, N * sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(devPtrB, B, N * sizeof(float), cudaMemcpyHostToDevice);
Memory allocation on the GPU card
Copy data from the CPU (host) memory to the GPU (device) memory
7/26/2009
28
Adding GPU support
vecAdd<<<N/512, 512>>>(devPtrA, devPtrB, devPtrC); cudaMemcpy(C, devPtrC, N * sizeof(float), cudaMemcpyDeviceToHost); cudaFree(devPtrA); cudaFree(devPtrB); cudaFree(devPtrC); free(A); free(B); free(C); }
Kernel invocation
Copy results from device memory to the host memory
Device memory de-allocation
7/26/2009
29
GPU Kernel• CPU version
void vecAdd(int N, float* A, float* B, float* C) { for (int i = 0; i < N; i++) C[i] = A[i] + B[i];}
• GPU version__global__ void vecAdd(float* A, float* B, float* C) { int i = blockIdx.x * blockDim.x + threadIdx.x; C[i] = A[i] + B[i]; }
7/26/2009
30
CUDA Programming Model
• A CUDA kernel is executed by an array of threads– All threads run the same code (SPMD)– Each thread has an ID that it uses to
compute memory addresses and make control decisions
• Threads are arranged as a grid of thread blocks– Threads within a
block have accessto a segment ofshared memory
…float x = input[threadID];float y = func(x);output[threadID] = y;…
threadID
Grid
Thread Block 0
Shared memory
Thread Block 1
Shared memory
Thread Block N-1
Shared memory
…
7/26/2009
31
Kernel Invocation Syntax
Grid
Thread Block 0
Shared memory
Thread Block 1
Shared memory
Thread Block N-1
Shared memory
…
grid & thread block dimensionality
vecAdd<<<32, 512>>>(devPtrA, devPtrB, devPtrC);
int i = blockIdx.x * blockDim.x + threadIdx.x;
thread ID within a thread blocknumber of theards per blockblock ID within a grid
7/26/2009
32
Mapping Threads to the Hardware
• Blocks of threads are transparently assigned to SMs– A block of threads executes on one SM & does not migrate– Several blocks can reside concurrently on one SM
Device
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Kernel grid
Block 0 Block 1
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7
Device
Block 0 Block 1 Block 2 Block 3
Block 4 Block 5 Block 6 Block 7
Each block can execute in any order relative to other blocks.
time
Slide is courtesy of NVIDIA7/26/2009
33
CUDA Programming Model• A kernel is executed as a grid
of thread blocks– Grid of blocks can be 1 or 2-
dimentional– Thread blocks can be 1, 2, or 3-
dimensional
• Different kernels can have different grid/block configuration
• Threads from the same block have access to a shared memory and their execution can be synchronized
Slide is courtesy of NVIDIA
Device
Grid 2
Host
Kernel 1
Kernel 2
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
7/26/2009
34
GPU Memory Hierarchy
Memory Location Cached Access Scope LifetimeRegister On-chip N/A R/W One thread ThreadLocal Off-chip No R/W One thread ThreadShared On-chip N/A R/W All threads in a block BlockGlobal Off-chip No R/W All threads + host ApplicationConstant Off-chip Yes R All threads + host ApplicationTexture Off-chip Yes R All threads + host Application
Host
CPU
chipset
DRAM
Device
DRAM
localglobal
constanttexture
GPUMultiprocessor
MultiprocessorMultiprocessor
registers shared memory
constant and texture caches
7/26/2009
35
Porting matrix multiplier to GPU
• cd ../chips_tutorial/src2
• Compile & run CPU versionicc -O3 mmult.c -o mmult./mmult
1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ......msec = 2215478 GFLOPS = 0.969
7/26/2009
36
int main(int argc, char* argv[]){ int N = 1024; struct timeval t1, t2, ta, tb; long msec1, msec2; float flop, mflop, gflop; float *a = (float *)malloc(N*N*sizeof(float)); float *b = (float *)malloc(N*N*sizeof(float)); float *c = (float *)malloc(N*N*sizeof(float));
minit(a, b, c, N);
gettimeofday(&t1, NULL); mmult(a, b, c, N); // a = b * c gettimeofday(&t2, NULL);
mprint(a, N, 5); free(a); free(b); free(c);
msec1 = t1.tv_sec * 1000000 + t1.tv_usec; msec2 = t2.tv_sec * 1000000 + t2.tv_usec; msec2 -= msec1; flop = N*N*N*2.0f; mflop = flop / msec2; gflop = mflop / 1000.0f; printf("msec = %10ld GFLOPS = %.3f\n", msec2, gflop);}
// a = b * cvoid mmult(float *a, float *b, float *c, int N) { for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) for (int i = 0; i < N; i++) a[i+j*N] += b[i+k*N]*c[k+j*N];}
void minit(float *a, float *b, float *c, int N) { for (int j = 0; j < N; j++) for (int i = 0; i < N; i++) { a[i+N*j] = 0.0f; b[i+N*j] = 1.0f; c[i+N*j] = 1.0f; }}
void mprint(float *a, int N, int M){ int i, j; for (int j = 0; j < M; j++) { for (int i = 0; i < M; i++) printf("%.2f ", a[i+N*j]); printf("...\n"); } printf("...\n");}
7/26/2009
37
for (i = 0; i < n; ++i) for (j = 0; j < n; ++j) for (k = 0; k < n; ++k) a[i+n*j] += b[i+n*k] * c[k+n*j];
Matrix Representation in Memory
- Matrices are stored in column-major order
- For reference, jki-ordered version runs at 1.7 GFLOPS on 3 GHz Intel Xeon (single core)
7/26/2009
38
for (i = 0; i < n; ++i) for (j = 0; j < n; ++j) for (k = 0; k < n; ++k) a[i+n*j] += b[i+n*k] * c[k+n*j];
Grid of thread blocks
0
0 1 2 3 4 5
1
0 1 2 3 4 5
2
0 1 2 3 4 5
blockIdx.x
blockDim.x
threadIdx.x
blockIdx.x * blockDim.x +threadIdx.x
Map this code:
into this (logical) architecture:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
b1,1 b1,2
b2,1 b2,2
b3,1 b3,2
c1,1 c1,2 c1,3
c2,1 c2,2 c2,3
a1,1 a1,2 a1,3
a2,1 a2,2 a2,3
a3,1 a3,2 a3,3
B
C
A=B*C
a1,2=b1,1*c1,2+b1,2*c2,2
7/26/2009
39
32x1024 grid of thread blocks
Block of 32x1x1 threads
(blockIdx.x, blockIdx.y)
(threadIdx.x)
{ int i = blockIdx.x*32 + threadIdx.x; int j = blockIdx.y; float sum = 0.0f; for (int k = 0; k < n; k++) sum += b[i+n*k] * c[k+n*j]; a[i+n*j] = sum;}
32 threads per block (i)
32 thread blocks (i)
1024
thre
ad b
lock
s (j)
dim3 grid(1024/32, 1024);dim3 threads (32);
7/26/2009
40
Kernel
Original CPU kernel GPU Kernel__global__ void mmult(float *a, float *b, float *c, int N){ int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y; float sum = 0.0; for (int k = 0; k < N; k++) sum += b[i+N*k] * c[k+N*j];
a[i+N*j] = sum;}
void mmult(float *a, float *b, float *c, int N) { for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) a[i+j*N] += b[i+k*N]*c[k+j*N];}
dim3 dimGrid(32, 1024);dim3 dimBlock(32);mmult<<<dimGrid, dimBlock>>>(devPtrA, devPtrB, devPtrC, N);
7/26/2009
41
int main(int argc, char* argv[]){ int N = 1024; struct timeval t1, t2; long msec1, msec2; float flop, mflop, gflop; float *a = (float *)malloc(N*N*sizeof(float)); float *b = (float *)malloc(N*N*sizeof(float)); float *c = (float *)malloc(N*N*sizeof(float));
minit(a, b, c, N); // allocate device memory float *devPtrA, *devPtrB, *devPtrC; cudaMalloc((void**)&devPtrA, N*N*sizeof(float)); cudaMalloc((void**)&devPtrB, N*N*sizeof(float)); cudaMalloc((void**)&devPtrC, N*N*sizeof(float));
// copu input arrays to the device meory cudaMemcpy(devPtrB, b, N*N*sizeof(float), cudaMemcpyHostToDevice); cudaMemcpy(devPtrC, c, N*N*sizeof(float), cudaMemcpyHostToDevice);
7/26/2009
42
gettimeofday(&t1, NULL); msec1 = t1.tv_sec * 1000000 + t1.tv_usec;
// define grid and thread block sizes dim3 dimGrid(32, 1024); dim3 dimBlock(32); // launch GPU kernel mmult<<<dimGrid, dimBlock>>>(devPtrA, devPtrB, devPtrC, N); // check for errors cudaError_t err = cudaGetLastError(); if (cudaSuccess != err) { fprintf(stderr, "CUDA error: %s.\n", cudaGetErrorString( err) ); exit(EXIT_FAILURE); } // wait until GPU kernel is done cudaThreadSynchronize();
gettimeofday(&t2, NULL); msec2 = t2.tv_sec * 1000000 + t2.tv_usec;
7/26/2009
43
// copy results to host cudaMemcpy(a, devPtrA, N*N*sizeof(float), cudaMemcpyDeviceToHost);
mprint(a, N, 5); // free device memory cudaFree(devPtrA); cudaFree(devPtrB); cudaFree(devPtrC); free(a); free(b); free(c);
msec2 -= msec1; flop = N*N*N*2.0f; mflop = flop / msec2; gflop = mflop / 1000.0f; printf("msec = %10ld GFLOPS = %.3f\n", msec2, gflop);}
7/26/2009
44
Porting matrix multiplier to GPU
• Compile & run GPU versionnvcc mmult.cu -o mmult./mmult
1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ...1024.00 1024.00 1024.00 1024.00 1024.00 ......msec = 91363 GFLOPS = 23.505
7/26/2009
45
More on CUDA Programming• Language extensions
– Function type qualifiers– Variable type qualifiers– Execution configuration– Built-in variables
• Common runtime components– Built-in vector types
• Device runtime components– Intrinsic functions– Synchronization and memory fencing functions– Atomic functions
• Host runtime components (runtime API only)– Device management– Memory management– Error handling– Debugging in the device emulation mode
• Exercise
7/26/2009
46
Function Type Qualifiers
hosthost__host__ float HostFunc()
hostdevice__global__ void KernelFunc()
devicedevice__device__ float DeviceFunc()
Only callable from the:
Executed on the:
__device__ and __global__ functions do not support recursion, cannot declare static variables inside their body, cannot have a variable number of arguments__device__ functions cannot have their address taken__host__ and __device__ qualifiers can be used together, in which case the function is compiled for both__global__ and __host__ qualifiers cannot be used together__global__ function must have void return type, its execution configuration must be specified, and the call is asynchronous
7/26/2009
47
Variable Type QualifiersMemory Scope Lifetime
__device__ int GlobalVar; global grid application
__device__ __shared__ int SharedVar; shared block block
__device__ __constant__ int ConstantVar; constant grid Application
volatile int GlobarVar or SharedVar;
__shared__ and __constant__ variables have implied static storage__device__, __shared__ and __constant__ variables cannot be defined using external keyword__device__ and __constant__ variables are only allowed at file scope__constant__ variables cannot be assigned to from the devices, they are initialized from the host only__shared__ variables cannot have an initialization as part of their declaration
7/26/2009
48
Execution ConfigurationFunction declared as
__global__ void kernel(float* param);
must be called like this:
kernel<<<Dg, Db, Ns, S>>>(param);
where• Dg (type dim3) specifies the dimension and size of the grid, such that Dg.x*Dg.y equals the number of blocks being launched;• Db (type dim3) spesifies the dimension abd size of each block of threads, such that Db.x*Db.y*Db.z equals the number of threads per block;• optional Ns (type size_z) specifies the number of bytes of shared memory dynamically allocated per block for this call in addition to the statically allocated memory• optional S (type cudaStream_t) specifies the stream associated with this kernel call
7/26/2009
49
Built-in Variablesvariable type descriptiongridDim dim3 dimensions of the gridblockID unit3 block index within the gridblockDim dim3 dimensions of the blockthreadIdx uint3 thread index within the blockwarpSize int warp size in threads
It is not allowed to take addresses of any of the built-in variablesIt is not allowed to assign values to any of the built-in variables
7/26/2009
50
Built-in Vector Types
Vector types derived from basic integer and float types
• char1, char2, char3, char4• uchar1, uchar2, uchar3, uchar4• short1, short2, short3, short4• ushort1, ushort2, ushort3, ushort4• int1, int2, int3, int4• uint1, uint2, uint3 (dim3), uint4• long1, long2, long3, long4• ulong1, ulong2, ulong3, ulong4• longlong1, longlong2• float1, float2, float3, float4• double1, double2
They are all structures, like this:
typedef struct { float x,y,z,w;} float4;
They all come with a constructor function in the form make_<type name>, e.g.,
int2 make_int2(int x, int y);
7/26/2009
51
Intrinsic Functions
Supported on the device only
Start with __, as in __sinf(x)
End with _rn (round-to-nearest-even rounding mode)_rz (round-towards-zero rounding mode)_ru (round-up rounding mode)_rd (round-down rounding mode)as in __fadd_rn(x,y);
There are mathematical (__log10f(x)), type conversion (__int2float_rn(x)), type casting (__int_as_float(x)), and bit manipulation (__ffs(x)) functions
7/26/2009
52
Synchronization and Memory Fencing Functions
function descriptionvoid __threadfence() wait until all global and shared memory
accesses made by the calling thread become visible to all threads in the device for global memory accesses and all threads in the thread block for shared memory accesses
void __threadfence_block()
Waits until all global and shared memory accesses made by the calling thread become visible to all threads in the thread block
void __syncthreads() Waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads become visible to all threads in the block
7/26/2009
53
Atomic Functions
An atomic function performs read-modify-write atomic operation on one 32-bit or one 64-bit word residing in global or shared memory. The operation is atomic in the sense that it is guaranteed to be performed without interference from other threads.
function Description
atomicAdd() new = old + val
atomicSub() new = old – val
atomicExch() new = val
atomicMin() new = min(old, val)
atomicMax() new = max(old, val)
atomicInc() new = ((old >= val) ? 0 : (old+1))
atomicDec() new = (((old==0) | (old > val)) ? val : (old-1))
atomicCAS() new = (old == compare ? val : old)
Atomic{And, Or, Xor}() new = {(old & val), (old | val), (old^val)}
7/26/2009
54
CUDA APIs• higher-level API called the CUDA
runtime API– myKernelunsigned char*)devPtr,
width, <<<Dg, Db>>>(( height, pitch);
• low-level API called the CUDA driver API
– cuModuleLoad( &module, binfile );– cuModuleGetFunction( &func,
module, "mmkernel" );– …– cuParamSetv( func, 0, &args, 48 );
– cuParamSetSize( func, 48 );– cuFuncSetBlockShape( func, ts[0],
ts[1], 1 );– cuLaunchGrid( func, gs[0], gs[1] );
7/26/2009
55
Device Management
function descriptioncudaGetDeviceCount() Returns the number of compute-capable
devicescudaGetDeviceProperties() Returns information on the compute
devicecudaSetDevice() Sets device to be used for GPU executioncudaGetDevice() Returns the device currently being usedcudaChooseDevice() Selects device that best matches given
criteria
7/26/2009
56
Device Management Examplevoid cudaDeviceInit() { int devCount, device; cudaGetDeviceCount(&devCount); if (devCount == 0) { printf("No CUDA capable devices detected.\n"); exit(EXIT_FAILURE); } for (device=0; device < devCount; device++) { cudaDeviceProp props; cudaGetDeviceProperties(&props, device); // If a device of compute capability >= 1.3 is found, use it if (props.major > 1 || (props.major == 1 && props.minor >= 3)) break; } if (device == devCount) { printf("No device above 1.2 compute capability detected.\n"); exit(EXIT_FAILURE); } else cudaSetDevice(device);}7/26/2009
57
Memory Management
function descriptioncudaMalloc() Allocates memory on the GPUcudaMallocPitch() Allocates memory on the GPU device for 2D
arrays, may pad the allocated memory to ensure alignment requirements
cudaFree() Frees the memory allocated on the GPUcudaMallocArray() Allocates an array on the GPUcudaFreeArray() Frees an array allocated on the GPUcudaMallocHost() Allocates page-locked memory on the hostcudaFreeHost() Frees page-locked memory in the host
7/26/2009
58
More on Memory Alignmenta1,1 a1,2 a1,3
a2,1 a2,2 a2,3
a3,1 a3,2 a3,3
a1,1 a2,1 a3,1 a1,2 a2,2 a3,2 a1,3 a2,3 a3,3
cudaMalloc(&dev_a, m*n*sizeof(float));
Matrix columns are not aligned at 64-bit boundary
a1,1 a2,1 a3,1 a1,2 a2,2 a3,2 a1,3 a2,3 a3,3
cudaMallocPitch(&dev_a, &n, n*sizeof(float), m);
Matrix columns are aligned at 64-bit boundary
n is the allocated (aligned) size for the first dimension (the pitch), given the requested sizes of the two dimensions.
7/26/2009
59
Memory Management ExamplecudaMallocPitch((void**)&devPtr, &pitch, width * sizeof(float), height); myKernel<<<100, 192>>>(devPtr, pitch);
// device code __global__ void myKernel(float* devPtr, int pitch) { for (int r = 0; r < height; ++r) { float* row = (float*)((char*)devPtr + r * pitch); for (int c = 0; c < width; ++c) { float element = row[c]; } }}
7/26/2009
60
Memory Managementfunction descriptioncudaMemset() Initializes or sets GPU memory to a valuecudaMemCpy() Copies data between host and the devicecudaMemcpyToArray()
cudaMemcpyFromArray()
cudaMemcpyArrayToArray()
cudaMemcpyToSymbol()
cudaMemcpyFromSymbol()
cudaGetSymbolAddress() Finds the address associated with a CUDA symbol
cudaGetSymbolSize() Finds the size of the object associated with a CUDA symbol
7/26/2009
61
Error HandlingAll CUDA runtime API functions return an error code. The runtime maintains an error variable for each host thread that is overwritten by the error code every time an error concurs.
function description
cudaGetLastError() Returns error variable and resets it to cudaSuccess
cudaGetErrorString() Returns the message string from an error code
cudaError_t err = cudaGetLastError();if (cudaSuccess != err) { fprintf(stderr, "CUDA error: %s.\n", cudaGetErrorString( err) ); exit(EXIT_FAILURE);}
7/26/2009
62
Exercise
• Port Mandelbrot set fractal renderer to CUDA– Source is in ~/chips_tutorial/src3
• fractal.c – reference C implementation• Makefile – make file• fractal.cu.reference – CUDA implementation for
reference
7/26/2009
63
Reference C Implementation
7/26/2009
void makefractal_cpu(unsigned char *image, int width, int height, double xupper, double xlower, double yupper, double ylower){ int x, y;
double xinc = (xupper - xlower) / width; double yinc = (yupper - ylower) / height;
for (y = 0; y < height; y++) { for (x = 0; x < width; x++) { image[y*width+x] = iter((xlower + x*xinc), (ylower + y*yinc)); } }}
64
CUDA Kernel Implementation
7/26/2009
__global__ void makefractal_cpu(unsigned char *image, int width, int height, double xupper, double xlower, double yupper, double ylower){ int x = blockIdx.x; int y = blockIdx.y;
int width = gridDim.x; int height = gridDim.y;
double xupper=-0.74624, xlower=-0.74758, yupper=0.10779, ylower=0.10671;
double xinc = (xupper - xlower) / width; double yinc = (yupper - ylower) / height;
image[y*width+x] = iter((xlower + x*xinc), (ylower + y*yinc));}
65
Reference C Implementation
7/26/2009
inline unsigned char iter(double a, double b){ unsigned char i = 0; double c_x = 0, c_y = 0; double c_x_tmp, c_y_tmp; double D = 4.0;
while ((c_x*c_x+c_y*c_y < D) && (i++ < 255)) { c_x_tmp = c_x * c_x - c_y * c_y; c_y_tmp = 2* c_y * c_x; c_x = a + c_x_tmp; c_y = b + c_y_tmp; }
return i;}
The Mandelbrot set is generated by iterating complex function z2 + c, where c is a constant:
z1 = (z0)2 + c
z2 = (z1)2 + c
z3 = (z2)2 + c
and so forth. Sequence z0, z1,
z2,... is called the orbit of z0
under iteration of z2 + c. We stop iteration when the orbit starts to diverge, or when a maximum number of iterations is done.
66
CUDA Kernel Implementation
7/26/2009
inline __device__ unsigned char iter(double a, double b){ unsigned char i = 0; double c_x = 0, c_y = 0; double c_x_tmp, c_y_tmp; double D = 4.0;
while ((c_x*c_x+c_y*c_y < D) && (i++ < 255)) { c_x_tmp = c_x * c_x - c_y * c_y; c_y_tmp = 2* c_y * c_x; c_x = a + c_x_tmp; c_y = b + c_y_tmp; }
return i;}
67
Host Code
7/26/2009
int width = 1024; int height = 768; unsigned char *image = NULL; unsigned char *devImage;
image = (unsigned char*)malloc(width*height*sizeof(unsigned char)); cudaMalloc((void**)&devImage, width*height*sizeof(unsigned char));
dim3 dimGrid(width, height); dim3 dimBlock(1);
makefractal_gpu<<<dimGrid, dimBlock>>>(devImage);
cudaMemcpy(image, devImage, width*height*sizeof(unsigned char), cudaMemcpyDeviceToHost);
free(image); cudaFree(devImage);
68
Few Examples• xupper=-0.74624• xlower=-0.74758• yupper=0.10779• ylower=0.10671
• CPU time: 2.27 sec• GPU time: 0.29 sec
• xupper=-0.754534912109• xlower=-.757077407837• yupper=0.060144042969• ylower=0.057710774740
• CPU time: 1.5 sec• GPU time: 0.25 sec
7/26/2009