CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

CUDAC/C++BASICS

© NVIDIA 2013

WhatisCUDA?• CUDAArchitecture

– ExposeGPUparallelismforgeneral-purposecomputing– Retainperformance

• CUDAC/C++– Basedonindustry-standardC/C++– Smallsetofextensionstoenableheterogeneousprogramming

– StraightforwardAPIstomanagedevices,memoryetc.

• ThissessionintroducesCUDAC/C++

© NVIDIA 2013

IntroductiontoCUDAC/C++

• Whatwillyoulearninthissession?– Startfrom“HelloWorld!”– WriteandlaunchCUDAC/C++kernels– ManageGPUmemory– Managecommunicationandsynchronization

© NVIDIA 2013

Prerequisites

• You(probably)needexperiencewithCorC++

• Youdon’tneedGPUexperience

• Youdon’tneedparallelprogrammingexperience

• Youdon’tneedgraphicsexperience

© NVIDIA 2013

Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices

CONCEPTS

© NVIDIA 2013

HELLO WORLD!


Blocks

Threads

Indexing

Shared memory

__syncthreads()


Handling errors

Managing devices

CONCEPTS

HeterogeneousComputing

§ Terminology:§ Host The CPU and its memory (host memory)§ Device The GPU and its memory (device memory)

Host Device

© NVIDIA 2013

HeterogeneousComputing#include <iostream>#include <algorithm>

using namespace std;

#define N 1024#define RADIUS 3#define BLOCK_SIZE 16

__global__ void stencil_1d(int *in, int *out) {__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];int gindex = threadIdx.x + blockIdx.x * blockDim.x;int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memorytemp[lindex] = in[gindex];if (threadIdx.x < RADIUS) {

temp[lindex - RADIUS] = in[gindex - RADIUS];temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

}

// Synchronize (ensure all the data is available)__syncthreads();

// Apply the stencilint result = 0;for (int offset = -RADIUS ; offset <= RADIUS ; offset++)

result += temp[lindex + offset];

// Store the resultout[gindex] = result;

}

void fill_ints(int *x, int n) {fill_n(x, n, 1);

}

int main(void) {int *in, *out; // host copies of a, b, cint *d_in, *d_out; // device copies of a, b, cint size = (N + 2*RADIUS) * sizeof(int);

// Alloc space for host copies and setup valuesin = (int *)malloc(size); fill_ints(in, N + 2*RADIUS);out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS);

// Alloc space for device copiescudaMalloc((void **)&d_in, size);cudaMalloc((void **)&d_out, size);

// Copy to devicecudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);

// Launch stencil_1d() kernel on GPUstencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS,

d_out + RADIUS);

// Copy result back to hostcudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);

// Cleanupfree(in); free(out);cudaFree(d_in); cudaFree(d_out);return 0;

}

serial code

parallel code

serial code

parallel fn

© NVIDIA 2013

SimpleProcessingFlow

1. Copy input data from CPU memory to GPU memory

PCIBus

© NVIDIA 2013



2. Load GPU program and execute,caching data on chip for performance

© NVIDIA 2013

PCIBus



2. Load GPU program and execute,caching data on chip for performance

3. Copy results from GPU memory to CPU memory

© NVIDIA 2013

PCIBus

HelloWorld!int main(void) {

printf("Hello World!\n");return 0;

}

Standard C that runs on the host

NVIDIA compiler (nvcc) can be used to compile programs with no devicecode

Output:

$ nvcchello_world.cu$ a.outHello World!$

© NVIDIA 2013

HelloWorld!withDeviceCode__global__ void mykernel(void) {}

int main(void) {mykernel<<<1,1>>>();printf("Hello World!\n");return 0;

}

§ Two new syntactic elements…

© NVIDIA 2013

HelloWorld!withDeviceCode__global__ void mykernel(void) {}

• CUDAC/C++keyword__global__ indicatesafunctionthat:– Runsonthedevice– Iscalledfromhostcode

• nvcc separatessourcecodeintohostanddevicecomponents– Devicefunctions(e.g.mykernel())processedbyNVIDIAcompiler– Hostfunctions(e.g.main())processedbystandardhostcompiler

• gcc,cl.exe

© NVIDIA 2013

HelloWorld!withDeviceCOdemykernel<<<1,1>>>();

• Tripleanglebracketsmarkacallfromhostcodetodevice code– Alsocalleda“kernellaunch”– We’llreturntotheparameters(1,1)inamoment

• That’sallthatisrequiredtoexecuteafunctionontheGPU!

© NVIDIA 2013

HelloWorld!withDeviceCode__global__ void mykernel(void){}

int main(void) {mykernel<<<1,1>>>();printf("Hello World!\n");return 0;

}

• mykernel() does nothing, somewhat anticlimactic!

Output:

$ nvcchello.cu$ a.outHello World!$

© NVIDIA 2013

ParallelProgramminginCUDAC/C++

• But wait… GPU computing is about massive parallelism!

• We need a more interesting example…

• We’ll start by adding two integers and build up to vector addition

a b c

© NVIDIA 2013

AdditionontheDevice

• Asimplekerneltoaddtwointegers

__global__ void add(int *a, int *b, int *c) {*c = *a + *b;

}

• Asbefore__global__ isaCUDAC/C++keywordmeaning– add() willexecuteonthedevice– add() willbecalledfromthehost

© NVIDIA 2013

AdditionontheDevice

• Notethatweusepointersforthevariables


}

• add() runsonthedevice,soa,b andc mustpointtodevicememory

• WeneedtoallocatememoryontheGPU

© NVIDIA 2013

MemoryManagement

• Hostanddevicememoryareseparateentities– Device pointerspointtoGPUmemory

Maybepassedto/fromhostcodeMaynotbedereferencedinhostcode

– HostpointerspointtoCPUmemoryMaybepassedto/fromdevicecodeMaynotbedereferencedindevicecode

• SimpleCUDAAPIforhandlingdevicememory– cudaMalloc(),cudaFree(),cudaMemcpy()– SimilartotheCequivalentsmalloc(),free(),memcpy()

© NVIDIA 2013

AdditionontheDevice:add()

• Returningtoouradd() kernel


}

• Let’stakealookatmain()…

© NVIDIA 2013

AdditionontheDevice:main()int main(void) {

int a, b, c; // host copies of a, b, cint *d_a, *d_b, *d_c; // device copies of a, b, cint size = sizeof(int);

// Allocate space for device copies of a, b, ccudaMalloc((void **)&d_a, size);cudaMalloc((void **)&d_b, size);cudaMalloc((void **)&d_c, size);

// Setup input valuesa = 2;b = 7;

© NVIDIA 2013

AdditionontheDevice:main()// Copy inputs to devicecudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPUadd<<<1,1>>>(d_a, d_b, d_c);

// Copy result back to hostcudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);

// CleanupcudaFree(d_a); cudaFree(d_b); cudaFree(d_c);return 0;

}

© NVIDIA 2013

RUNNING IN PARALLEL


Blocks

Threads

Indexing

Shared memory

__syncthreads()


Handling errors

Managing devices

CONCEPTS

© NVIDIA 2013

MovingtoParallel

• GPUcomputingisaboutmassiveparallelism– Sohowdoweruncodeinparallelonthedevice?

add<<< 1, 1 >>>();

add<<< N, 1 >>>();

• Insteadofexecutingadd() once,executeNtimesinparallel

© NVIDIA 2013

VectorAdditionontheDevice

• Withadd() runninginparallelwecandovectoraddition

• Terminology:eachparallelinvocationofadd() isreferredtoasablock– Thesetofblocksisreferredtoasagrid– EachinvocationcanrefertoitsblockindexusingblockIdx.x

__global__ void add(int *a, int *b, int *c) {c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];

}

• ByusingblockIdx.x toindexintothearray,eachblockhandlesadifferentindex

© NVIDIA 2013

VectorAdditionontheDevice__global__ void add(int *a, int *b, int *c) {

c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];}

• Onthedevice,eachblockcanexecuteinparallel:

c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3];

Block0 Block1 Block2 Block3

© NVIDIA 2013

VectorAdditionontheDevice:add()

• Returningtoourparallelizedadd() kernel__global__ void add(int *a, int *b, int *c) {

c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];}

• Let’stakealookatmain()…

© NVIDIA 2013

VectorAdditionontheDevice:main()#define N 512int main(void) {

int *a, *b, *c; // host copies of a, b, cint *d_a, *d_b, *d_c; // device copies of a, b, cint size = N * sizeof(int);

// Alloc space for device copies of a, b, ccudaMalloc((void **)&d_a, size);cudaMalloc((void **)&d_b, size);cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input valuesa = (int *)malloc(size); random_ints(a, N);b = (int *)malloc(size); random_ints(b, N);c = (int *)malloc(size);

© NVIDIA 2013

VectorAdditionontheDevice:main()// Copy inputs to devicecudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N blocksadd<<<N,1>>>(d_a, d_b, d_c);

// Copy result back to hostcudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanupfree(a); free(b); free(c);cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);return 0;

}

© NVIDIA 2013

Review(1of2)

• Differencebetweenhostanddevice– Host CPU– Device GPU

• Using__global__ todeclareafunctionasdevicecode– Executesonthedevice– Calledfromthehost

• Passingparametersfromhostcodetoadevicefunction

© NVIDIA 2013

Review(2of2)

• Basicdevicememorymanagement– cudaMalloc()– cudaMemcpy()– cudaFree()

• Launchingparallelkernels– LaunchN copiesofadd() withadd<<<N,1>>>(…);– UseblockIdx.x toaccessblockindex

© NVIDIA 2013

INTRODUCING THREADS


Blocks

Threads

Indexing

Shared memory

__syncthreads()


Handling errors

Managing devices

CONCEPTS

© NVIDIA 2013

CUDAThreads

• Terminology:ablockcanbesplitintoparallelthreads

• Let’schangeadd() touseparallelthreads insteadofparallelblocks

• WeusethreadIdx.x insteadofblockIdx.x

• Needtomakeonechangeinmain()…

__global__ void add(int *a, int *b, int *c) {c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];

}

© NVIDIA 2013

VectorAdditionUsingThreads:main()#define N 512int main(void) {




© NVIDIA 2013

VectorAdditionUsingThreads:main()// Copy inputs to devicecudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N threadsadd<<<1,N>>>(d_a, d_b, d_c);



}

© NVIDIA 2013

COMBINING THREADSAND BLOCKS


Blocks

Threads

Indexing

Shared memory

__syncthreads()


Handling errors

Managing devices

CONCEPTS

© NVIDIA 2013

CombiningBlocksandThreads

• We’veseenparallelvectoradditionusing:– Manyblockswithonethreadeach– Oneblockwithmanythreads

• Let’sadaptvectoradditiontousebothblocksandthreads

• Why?We’llcometothat…

• Firstlet’sdiscussdataindexing…

© NVIDIA 2013

0 1 72 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6

IndexingArrayswithBlocksandThreads

• WithMthreads/blockauniqueindexforeachthreadisgivenby:

int index = threadIdx.x + blockIdx.x * M;

• NolongerassimpleasusingblockIdx.x and threadIdx.x

– Considerindexinganarraywithoneelementperthread(8threads/block)threadIdx.x threadIdx.x threadIdx.x threadIdx.x

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

© NVIDIA 2013

IndexingArrays:Example

• Whichthreadwilloperateontheredelement?

int index = threadIdx.x + blockIdx.x * M;= 5 + 2 * 8;= 21;

0 1 72 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6

threadIdx.x = 5

blockIdx.x = 2

0 1 312 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

M = 8

© NVIDIA 2013

VectorAdditionwithBlocksandThreads

• Whatchangesneedtobemadeinmain()?

• Usethebuilt-invariableblockDim.x forthreadsperblock

int index = threadIdx.x + blockIdx.x * blockDim.x;

• Combinedversionofadd() touseparallelthreadsand parallelblocks

__global__ void add(int *a, int *b, int *c) {int index = threadIdx.x + blockIdx.x * blockDim.x;c[index] = a[index] + b[index];

}

© NVIDIA 2013

AdditionwithBlocksandThreads:main()#define N (2048*2048)#define THREADS_PER_BLOCK 512int main(void) {




© NVIDIA 2013

AdditionwithBlocksandThreads:main()// Copy inputs to devicecudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPUadd<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c);



}

© NVIDIA 2013

HandlingArbitraryVectorSizes

• Updatethekernellaunch:add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N);

• TypicalproblemsarenotfriendlymultiplesofblockDim.x

• Avoidaccessingbeyondtheendofthearrays:__global__ void add(int *a, int *b, int *c, int n) {

int index = threadIdx.x + blockIdx.x * blockDim.x;if (index < n)

c[index] = a[index] + b[index];}

© NVIDIA 2013

WhyBotherwithThreads?

• Threadsseemunnecessary– Theyaddalevelofcomplexity– Whatdowegain?

• Unlikeparallelblocks,threadshavemechanismsto:– Communicate– Synchronize

• Tolookcloser,weneedanewexample…

© NVIDIA 2013

Review

• Launchingparallelkernels– LaunchN copiesofadd() withadd<<<N/M,M>>>(…);– UseblockIdx.x toaccessblockindex– UsethreadIdx.x toaccessthreadindexwithinblock

• Allocateelementstothreads:


© NVIDIA 2013

COOPERATING THREADS


Blocks

Threads

Indexing

Shared memory

__syncthreads()


Handling errors

Managing devices

CONCEPTS

© NVIDIA 2013

1DStencil

• Considerapplyinga1Dstenciltoa1Darrayofelements– Eachoutputelementisthesumofinputelementswithinaradius

• Ifradiusis3,theneachoutputelementisthesumof7inputelements:

© NVIDIA 2013

radius radius

ImplementingWithinaBlock

• Eachthreadprocessesoneoutputelement– blockDim.xelementsperblock

• Inputelementsarereadseveraltimes– Withradius3,eachinputelementisreadseventimes

© NVIDIA 2013

SharingDataBetweenThreads

• Terminology:withinablock,threadssharedataviasharedmemory

• Extremelyfaston-chipmemory,user-managed

• Declareusing__shared__,allocatedperblock

• Dataisnotvisibletothreadsinotherblocks

© NVIDIA 2013

ImplementingWithSharedMemory

• Cachedatainsharedmemory– Read(blockDim.x +2*radius)inputelementsfromglobalmemorytosharedmemory

– ComputeblockDim.x outputelements– WriteblockDim.x outputelementstoglobalmemory

– Eachblockneedsahaloofradiuselementsateachboundary

blockDim.x output elements

halo on left halo on right

© NVIDIA 2013

__global__ void stencil_1d(int *in, int *out) {__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];int gindex = threadIdx.x + blockIdx.x * blockDim.x;int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memorytemp[lindex] = in[gindex];if (threadIdx.x < RADIUS) {temp[lindex - RADIUS] = in[gindex - RADIUS];temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

}

© NVIDIA 2013

StencilKernel

// Apply the stencilint result = 0;for (int offset = -RADIUS ; offset <= RADIUS ; offset++)result += temp[lindex + offset];


}

StencilKernel

© NVIDIA 2013

DataRace!

© NVIDIA 2013

§ Thestencilexamplewillnotwork…

§ Supposethread15readsthehalobeforethread0hasfetchedit…

temp[lindex] = in[gindex];

if (threadIdx.x < RADIUS) {

temp[lindex – RADIUS = in[gindex – RADIUS];

temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

}

int result = 0;

result += temp[lindex + 1];

Store at temp[18]

Load from temp[19]

Skipped, threadIdx > RADIUS

__syncthreads()

• void __syncthreads();

• Synchronizesallthreadswithinablock– UsedtopreventRAW/WAR/WAWhazards

• Allthreadsmustreachthebarrier– Inconditionalcode,theconditionmustbeuniformacrosstheblock

© NVIDIA 2013

StencilKernel__global__ void stencil_1d(int *in, int *out) {

__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];int gindex = threadIdx.x + blockIdx.x * blockDim.x;int lindex = threadIdx.x + radius;

// Read input elements into shared memorytemp[lindex] = in[gindex];if (threadIdx.x < RADIUS) {

temp[lindex – RADIUS] = in[gindex – RADIUS];temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

}

// Synchronize (ensure all the data is available)__syncthreads();

© NVIDIA 2013

StencilKernel// Apply the stencilint result = 0;for (int offset = -RADIUS ; offset <= RADIUS ; offset++)

result += temp[lindex + offset];


}

© NVIDIA 2013

Review(1of2)

• Launchingparallelthreads– LaunchN blockswithM threadsperblockwith

kernel<<<N,M>>>(…);

– UseblockIdx.x toaccessblockindexwithingrid– UsethreadIdx.x toaccessthreadindexwithinblock

• Allocateelementstothreads:


© NVIDIA 2013

Review(2of2)

• Use__shared__ todeclareavariable/arrayinsharedmemory– Dataissharedbetweenthreadsinablock– Notvisibletothreadsinotherblocks

• Use__syncthreads() asabarrier– Usetopreventdatahazards

© NVIDIA 2013

MANAGING THE DEVICE


Blocks

Threads

Indexing

Shared memory

__syncthreads()


Handling errors

Managing devices

CONCEPTS

© NVIDIA 2013

CoordinatingHost&Device

• Kernellaunchesareasynchronous– ControlreturnstotheCPUimmediately

• CPUneedstosynchronizebeforeconsumingtheresultscudaMemcpy() BlockstheCPUuntilthecopyiscomplete

CopybeginswhenallprecedingCUDAcallshavecompleted

cudaMemcpyAsync() Asynchronous,doesnotblocktheCPU

cudaDeviceSynchronize()

BlockstheCPUuntilallprecedingCUDAcallshavecompleted

© NVIDIA 2013

ReportingErrors

• AllCUDAAPIcallsreturnanerrorcode(cudaError_t)– ErrorintheAPIcallitself

OR– Errorinanearlierasynchronousoperation(e.g.kernel)

• Gettheerrorcodeforthelasterror:cudaError_t cudaGetLastError(void)

• Getastringtodescribetheerror:char *cudaGetErrorString(cudaError_t)

printf("%s\n", cudaGetErrorString(cudaGetLastError()));

© NVIDIA 2013

DeviceManagement

• ApplicationcanqueryandselectGPUscudaGetDeviceCount(int *count)cudaSetDevice(int device)cudaGetDevice(int *device)cudaGetDeviceProperties(cudaDeviceProp *prop, int device)

• Multiplethreadscanshareadevice

• AsinglethreadcanmanagemultipledevicescudaSetDevice(i) toselectcurrentdevicecudaMemcpy(…) forpeer-to-peercopies✝

✝ requires OS and device support

© NVIDIA 2013

IntroductiontoCUDAC/C++

• Whathavewelearned?– WriteandlaunchCUDAC/C++kernels

• __global__,blockIdx.x,threadIdx.x,<<<>>>

– ManageGPUmemory• cudaMalloc(),cudaMemcpy(),cudaFree()

– Managecommunicationandsynchronization• __shared__,__syncthreads()• cudaMemcpy() vs cudaMemcpyAsync(),cudaDeviceSynchronize()

© NVIDIA 2013

ComputeCapability• Thecomputecapabilityofadevicedescribesitsarchitecture,e.g.

– Numberofregisters– Sizesofmemories– Features&capabilities

• ThefollowingpresentationsconcentrateonFermidevices– ComputeCapability>=2.0

ComputeCapability

SelectedFeatures(seeCUDACProgrammingGuideforcompletelist)

Teslamodels

1.0 FundamentalCUDAsupport 870

1.3 Doubleprecision,improved memoryaccesses,atomics

10-series

2.0 Caches, fusedmultiply-add,3Dgrids,surfaces,ECC,P2P,concurrentkernels/copies,functionpointers,recursion

20-series

© NVIDIA 2013

IDsandDimensions– Akernelislaunchedasagridofblocksofthreads

• blockIdx andthreadIdx are3D

• Weshowedonlyonedimension(x)

• Built-invariables:– threadIdx– blockIdx– blockDim– gridDim

Device

Grid 1Bloc

k(0,0,

0)

Block

(1,0,0)

Block

(2,0,0)

Block

(1,1,0)

Block

(2,1,0)

Block

(0,1,0)

Block (1,1,0)Thread

(0,0,0)

Thread

(1,0,0)

Thread

(2,0,0)

Thread

(3,0,0)

Thread

(4,0,0)

Thread

(0,1,0)

Thread

(1,1,0)

Thread

(2,1,0)

Thread

(3,1,0)

Thread

(4,1,0)

Thread

(0,2,0)

Thread

(1,2,0)

Thread

(2,2,0)

Thread

(3,2,0)

Thread

(4,2,0)

© NVIDIA 2013

Textures

• Read-onlyobject– Dedicatedcache

• Dedicatedfilteringhardware(Linear,bilinear,trilinear)

• Addressableas1D,2Dor3D

• Out-of-boundsaddresshandling(Wrap,clamp)

0 1 2 30

1

2

4

(2.5, 0.5)

(1.0, 1.0)

© NVIDIA 2013

Topicsweskipped

• Weskippedsomedetails,youcanlearnmore:– CUDAProgrammingGuide– CUDAZone– tools,training,webinarsandmore

developer.nvidia.com/cuda

• Needaquickprimerforlater:– Multi-dimensionalindexing– Textures

© NVIDIA 2013

Documents

CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch