68
CUDA C/C++ BASICS © NVIDIA 2013

CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

CUDAC/C++BASICS

© NVIDIA 2013

Page 2: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

WhatisCUDA?• CUDAArchitecture

– ExposeGPUparallelismforgeneral-purposecomputing– Retainperformance

• CUDAC/C++– Basedonindustry-standardC/C++– Smallsetofextensionstoenableheterogeneousprogramming

– StraightforwardAPIstomanagedevices,memoryetc.

• ThissessionintroducesCUDAC/C++

© NVIDIA 2013

Page 3: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

IntroductiontoCUDAC/C++

• Whatwillyoulearninthissession?– Startfrom“HelloWorld!”– WriteandlaunchCUDAC/C++kernels– ManageGPUmemory– Managecommunicationandsynchronization

© NVIDIA 2013

Page 4: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

Prerequisites

• You(probably)needexperiencewithCorC++

• Youdon’tneedGPUexperience

• Youdon’tneedparallelprogrammingexperience

• Youdon’tneedgraphicsexperience

© NVIDIA 2013

Page 5: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices

CONCEPTS

© NVIDIA 2013

Page 6: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

HELLO WORLD!

Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices

CONCEPTS

Page 7: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

HeterogeneousComputing

§ Terminology:§ Host The CPU and its memory (host memory)§ Device The GPU and its memory (device memory)

Host Device

© NVIDIA 2013

Page 8: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

HeterogeneousComputing#include <iostream>#include <algorithm>

using namespace std;

#define N 1024#define RADIUS 3#define BLOCK_SIZE 16

__global__ void stencil_1d(int *in, int *out) {__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];int gindex = threadIdx.x + blockIdx.x * blockDim.x;int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memorytemp[lindex] = in[gindex];if (threadIdx.x < RADIUS) {

temp[lindex - RADIUS] = in[gindex - RADIUS];temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

}

// Synchronize (ensure all the data is available)__syncthreads();

// Apply the stencilint result = 0;for (int offset = -RADIUS ; offset <= RADIUS ; offset++)

result += temp[lindex + offset];

// Store the resultout[gindex] = result;

}

void fill_ints(int *x, int n) {fill_n(x, n, 1);

}

int main(void) {int *in, *out; // host copies of a, b, cint *d_in, *d_out; // device copies of a, b, cint size = (N + 2*RADIUS) * sizeof(int);

// Alloc space for host copies and setup valuesin = (int *)malloc(size); fill_ints(in, N + 2*RADIUS);out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS);

// Alloc space for device copiescudaMalloc((void **)&d_in, size);cudaMalloc((void **)&d_out, size);

// Copy to devicecudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);

// Launch stencil_1d() kernel on GPUstencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS,

d_out + RADIUS);

// Copy result back to hostcudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);

// Cleanupfree(in); free(out);cudaFree(d_in); cudaFree(d_out);return 0;

}

serial code

parallel code

serial code

parallel fn

© NVIDIA 2013

Page 9: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

SimpleProcessingFlow

1. Copy input data from CPU memory to GPU memory

PCIBus

© NVIDIA 2013

Page 10: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

SimpleProcessingFlow

1. Copy input data from CPU memory to GPU memory

2. Load GPU program and execute,caching data on chip for performance

© NVIDIA 2013

PCIBus

Page 11: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

SimpleProcessingFlow

1. Copy input data from CPU memory to GPU memory

2. Load GPU program and execute,caching data on chip for performance

3. Copy results from GPU memory to CPU memory

© NVIDIA 2013

PCIBus

Page 12: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

HelloWorld!int main(void) {

printf("Hello World!\n");return 0;

}

Standard C that runs on the host

NVIDIA compiler (nvcc) can be used to compile programs with no devicecode

Output:

$ nvcchello_world.cu$ a.outHello World!$

© NVIDIA 2013

Page 13: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

HelloWorld!withDeviceCode__global__ void mykernel(void) {}

int main(void) {mykernel<<<1,1>>>();printf("Hello World!\n");return 0;

}

§ Two new syntactic elements…

© NVIDIA 2013

Page 14: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

HelloWorld!withDeviceCode__global__ void mykernel(void) {}

• CUDAC/C++keyword__global__ indicatesafunctionthat:– Runsonthedevice– Iscalledfromhostcode

• nvcc separatessourcecodeintohostanddevicecomponents– Devicefunctions(e.g.mykernel())processedbyNVIDIAcompiler– Hostfunctions(e.g.main())processedbystandardhostcompiler

• gcc,cl.exe

© NVIDIA 2013

Page 15: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

HelloWorld!withDeviceCOdemykernel<<<1,1>>>();

• Tripleanglebracketsmarkacallfromhostcodetodevice code– Alsocalleda“kernellaunch”– We’llreturntotheparameters(1,1)inamoment

• That’sallthatisrequiredtoexecuteafunctionontheGPU!

© NVIDIA 2013

Page 16: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

HelloWorld!withDeviceCode__global__ void mykernel(void){}

int main(void) {mykernel<<<1,1>>>();printf("Hello World!\n");return 0;

}

• mykernel() does nothing, somewhat anticlimactic!

Output:

$ nvcchello.cu$ a.outHello World!$

© NVIDIA 2013

Page 17: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

ParallelProgramminginCUDAC/C++

• But wait… GPU computing is about massive parallelism!

• We need a more interesting example…

• We’ll start by adding two integers and build up to vector addition

a b c

© NVIDIA 2013

Page 18: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

AdditionontheDevice

• Asimplekerneltoaddtwointegers

__global__ void add(int *a, int *b, int *c) {*c = *a + *b;

}

• Asbefore__global__ isaCUDAC/C++keywordmeaning– add() willexecuteonthedevice– add() willbecalledfromthehost

© NVIDIA 2013

Page 19: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

AdditionontheDevice

• Notethatweusepointersforthevariables

__global__ void add(int *a, int *b, int *c) {*c = *a + *b;

}

• add() runsonthedevice,soa,b andc mustpointtodevicememory

• WeneedtoallocatememoryontheGPU

© NVIDIA 2013

Page 20: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

MemoryManagement

• Hostanddevicememoryareseparateentities– Device pointerspointtoGPUmemory

Maybepassedto/fromhostcodeMaynotbedereferencedinhostcode

– HostpointerspointtoCPUmemoryMaybepassedto/fromdevicecodeMaynotbedereferencedindevicecode

• SimpleCUDAAPIforhandlingdevicememory– cudaMalloc(),cudaFree(),cudaMemcpy()– SimilartotheCequivalentsmalloc(),free(),memcpy()

© NVIDIA 2013

Page 21: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

AdditionontheDevice:add()

• Returningtoouradd() kernel

__global__ void add(int *a, int *b, int *c) {*c = *a + *b;

}

• Let’stakealookatmain()…

© NVIDIA 2013

Page 22: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

AdditionontheDevice:main()int main(void) {

int a, b, c; // host copies of a, b, cint *d_a, *d_b, *d_c; // device copies of a, b, cint size = sizeof(int);

// Allocate space for device copies of a, b, ccudaMalloc((void **)&d_a, size);cudaMalloc((void **)&d_b, size);cudaMalloc((void **)&d_c, size);

// Setup input valuesa = 2;b = 7;

© NVIDIA 2013

Page 23: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

AdditionontheDevice:main()// Copy inputs to devicecudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPUadd<<<1,1>>>(d_a, d_b, d_c);

// Copy result back to hostcudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);

// CleanupcudaFree(d_a); cudaFree(d_b); cudaFree(d_c);return 0;

}

© NVIDIA 2013

Page 24: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

RUNNING IN PARALLEL

Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices

CONCEPTS

© NVIDIA 2013

Page 25: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

MovingtoParallel

• GPUcomputingisaboutmassiveparallelism– Sohowdoweruncodeinparallelonthedevice?

add<<< 1, 1 >>>();

add<<< N, 1 >>>();

• Insteadofexecutingadd() once,executeNtimesinparallel

© NVIDIA 2013

Page 26: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

VectorAdditionontheDevice

• Withadd() runninginparallelwecandovectoraddition

• Terminology:eachparallelinvocationofadd() isreferredtoasablock– Thesetofblocksisreferredtoasagrid– EachinvocationcanrefertoitsblockindexusingblockIdx.x

__global__ void add(int *a, int *b, int *c) {c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];

}

• ByusingblockIdx.x toindexintothearray,eachblockhandlesadifferentindex

© NVIDIA 2013

Page 27: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

VectorAdditionontheDevice__global__ void add(int *a, int *b, int *c) {

c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];}

• Onthedevice,eachblockcanexecuteinparallel:

c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3];

Block0 Block1 Block2 Block3

© NVIDIA 2013

Page 28: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

VectorAdditionontheDevice:add()

• Returningtoourparallelizedadd() kernel__global__ void add(int *a, int *b, int *c) {

c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];}

• Let’stakealookatmain()…

© NVIDIA 2013

Page 29: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

VectorAdditionontheDevice:main()#define N 512int main(void) {

int *a, *b, *c; // host copies of a, b, cint *d_a, *d_b, *d_c; // device copies of a, b, cint size = N * sizeof(int);

// Alloc space for device copies of a, b, ccudaMalloc((void **)&d_a, size);cudaMalloc((void **)&d_b, size);cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input valuesa = (int *)malloc(size); random_ints(a, N);b = (int *)malloc(size); random_ints(b, N);c = (int *)malloc(size);

© NVIDIA 2013

Page 30: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

VectorAdditionontheDevice:main()// Copy inputs to devicecudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N blocksadd<<<N,1>>>(d_a, d_b, d_c);

// Copy result back to hostcudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanupfree(a); free(b); free(c);cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);return 0;

}

© NVIDIA 2013

Page 31: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

Review(1of2)

• Differencebetweenhostanddevice– Host CPU– Device GPU

• Using__global__ todeclareafunctionasdevicecode– Executesonthedevice– Calledfromthehost

• Passingparametersfromhostcodetoadevicefunction

© NVIDIA 2013

Page 32: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

Review(2of2)

• Basicdevicememorymanagement– cudaMalloc()– cudaMemcpy()– cudaFree()

• Launchingparallelkernels– LaunchN copiesofadd() withadd<<<N,1>>>(…);– UseblockIdx.x toaccessblockindex

© NVIDIA 2013

Page 33: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

INTRODUCING THREADS

Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices

CONCEPTS

© NVIDIA 2013

Page 34: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

CUDAThreads

• Terminology:ablockcanbesplitintoparallelthreads

• Let’schangeadd() touseparallelthreads insteadofparallelblocks

• WeusethreadIdx.x insteadofblockIdx.x

• Needtomakeonechangeinmain()…

__global__ void add(int *a, int *b, int *c) {c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];

}

© NVIDIA 2013

Page 35: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

VectorAdditionUsingThreads:main()#define N 512int main(void) {

int *a, *b, *c; // host copies of a, b, cint *d_a, *d_b, *d_c; // device copies of a, b, cint size = N * sizeof(int);

// Alloc space for device copies of a, b, ccudaMalloc((void **)&d_a, size);cudaMalloc((void **)&d_b, size);cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input valuesa = (int *)malloc(size); random_ints(a, N);b = (int *)malloc(size); random_ints(b, N);c = (int *)malloc(size);

© NVIDIA 2013

Page 36: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

VectorAdditionUsingThreads:main()// Copy inputs to devicecudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N threadsadd<<<1,N>>>(d_a, d_b, d_c);

// Copy result back to hostcudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanupfree(a); free(b); free(c);cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);return 0;

}

© NVIDIA 2013

Page 37: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

COMBINING THREADSAND BLOCKS

Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices

CONCEPTS

© NVIDIA 2013

Page 38: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

CombiningBlocksandThreads

• We’veseenparallelvectoradditionusing:– Manyblockswithonethreadeach– Oneblockwithmanythreads

• Let’sadaptvectoradditiontousebothblocksandthreads

• Why?We’llcometothat…

• Firstlet’sdiscussdataindexing…

© NVIDIA 2013

Page 39: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

0 1 72 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6

IndexingArrayswithBlocksandThreads

• WithMthreads/blockauniqueindexforeachthreadisgivenby:

int index = threadIdx.x + blockIdx.x * M;

• NolongerassimpleasusingblockIdx.x and threadIdx.x

– Considerindexinganarraywithoneelementperthread(8threads/block)threadIdx.x threadIdx.x threadIdx.x threadIdx.x

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

© NVIDIA 2013

Page 40: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

IndexingArrays:Example

• Whichthreadwilloperateontheredelement?

int index = threadIdx.x + blockIdx.x * M;= 5 + 2 * 8;= 21;

0 1 72 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6

threadIdx.x = 5

blockIdx.x = 2

0 1 312 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

M = 8

© NVIDIA 2013

Page 41: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

VectorAdditionwithBlocksandThreads

• Whatchangesneedtobemadeinmain()?

• Usethebuilt-invariableblockDim.x forthreadsperblock

int index = threadIdx.x + blockIdx.x * blockDim.x;

• Combinedversionofadd() touseparallelthreadsand parallelblocks

__global__ void add(int *a, int *b, int *c) {int index = threadIdx.x + blockIdx.x * blockDim.x;c[index] = a[index] + b[index];

}

© NVIDIA 2013

Page 42: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

AdditionwithBlocksandThreads:main()#define N (2048*2048)#define THREADS_PER_BLOCK 512int main(void) {

int *a, *b, *c; // host copies of a, b, cint *d_a, *d_b, *d_c; // device copies of a, b, cint size = N * sizeof(int);

// Alloc space for device copies of a, b, ccudaMalloc((void **)&d_a, size);cudaMalloc((void **)&d_b, size);cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input valuesa = (int *)malloc(size); random_ints(a, N);b = (int *)malloc(size); random_ints(b, N);c = (int *)malloc(size);

© NVIDIA 2013

Page 43: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

AdditionwithBlocksandThreads:main()// Copy inputs to devicecudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPUadd<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c);

// Copy result back to hostcudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanupfree(a); free(b); free(c);cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);return 0;

}

© NVIDIA 2013

Page 44: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

HandlingArbitraryVectorSizes

• Updatethekernellaunch:add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N);

• TypicalproblemsarenotfriendlymultiplesofblockDim.x

• Avoidaccessingbeyondtheendofthearrays:__global__ void add(int *a, int *b, int *c, int n) {

int index = threadIdx.x + blockIdx.x * blockDim.x;if (index < n)

c[index] = a[index] + b[index];}

© NVIDIA 2013

Page 45: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

WhyBotherwithThreads?

• Threadsseemunnecessary– Theyaddalevelofcomplexity– Whatdowegain?

• Unlikeparallelblocks,threadshavemechanismsto:– Communicate– Synchronize

• Tolookcloser,weneedanewexample…

© NVIDIA 2013

Page 46: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

Review

• Launchingparallelkernels– LaunchN copiesofadd() withadd<<<N/M,M>>>(…);– UseblockIdx.x toaccessblockindex– UsethreadIdx.x toaccessthreadindexwithinblock

• Allocateelementstothreads:

int index = threadIdx.x + blockIdx.x * blockDim.x;

© NVIDIA 2013

Page 47: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

COOPERATING THREADS

Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices

CONCEPTS

© NVIDIA 2013

Page 48: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

1DStencil

• Considerapplyinga1Dstenciltoa1Darrayofelements– Eachoutputelementisthesumofinputelementswithinaradius

• Ifradiusis3,theneachoutputelementisthesumof7inputelements:

© NVIDIA 2013

radius radius

Page 49: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

ImplementingWithinaBlock

• Eachthreadprocessesoneoutputelement– blockDim.xelementsperblock

• Inputelementsarereadseveraltimes– Withradius3,eachinputelementisreadseventimes

© NVIDIA 2013

Page 50: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

SharingDataBetweenThreads

• Terminology:withinablock,threadssharedataviasharedmemory

• Extremelyfaston-chipmemory,user-managed

• Declareusing__shared__,allocatedperblock

• Dataisnotvisibletothreadsinotherblocks

© NVIDIA 2013

Page 51: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

ImplementingWithSharedMemory

• Cachedatainsharedmemory– Read(blockDim.x +2*radius)inputelementsfromglobalmemorytosharedmemory

– ComputeblockDim.x outputelements– WriteblockDim.x outputelementstoglobalmemory

– Eachblockneedsahaloofradiuselementsateachboundary

blockDim.x output elements

halo on left halo on right

© NVIDIA 2013

Page 52: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

__global__ void stencil_1d(int *in, int *out) {__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];int gindex = threadIdx.x + blockIdx.x * blockDim.x;int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memorytemp[lindex] = in[gindex];if (threadIdx.x < RADIUS) {temp[lindex - RADIUS] = in[gindex - RADIUS];temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

}

© NVIDIA 2013

StencilKernel

Page 53: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

// Apply the stencilint result = 0;for (int offset = -RADIUS ; offset <= RADIUS ; offset++)result += temp[lindex + offset];

// Store the resultout[gindex] = result;

}

StencilKernel

© NVIDIA 2013

Page 54: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

DataRace!

© NVIDIA 2013

§ Thestencilexamplewillnotwork…

§ Supposethread15readsthehalobeforethread0hasfetchedit…

temp[lindex] = in[gindex];

if (threadIdx.x < RADIUS) {

temp[lindex – RADIUS = in[gindex – RADIUS];

temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

}

int result = 0;

result += temp[lindex + 1];

Store at temp[18]

Load from temp[19]

Skipped, threadIdx > RADIUS

Page 55: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

__syncthreads()

• void __syncthreads();

• Synchronizesallthreadswithinablock– UsedtopreventRAW/WAR/WAWhazards

• Allthreadsmustreachthebarrier– Inconditionalcode,theconditionmustbeuniformacrosstheblock

© NVIDIA 2013

Page 56: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

StencilKernel__global__ void stencil_1d(int *in, int *out) {

__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];int gindex = threadIdx.x + blockIdx.x * blockDim.x;int lindex = threadIdx.x + radius;

// Read input elements into shared memorytemp[lindex] = in[gindex];if (threadIdx.x < RADIUS) {

temp[lindex – RADIUS] = in[gindex – RADIUS];temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

}

// Synchronize (ensure all the data is available)__syncthreads();

© NVIDIA 2013

Page 57: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

StencilKernel// Apply the stencilint result = 0;for (int offset = -RADIUS ; offset <= RADIUS ; offset++)

result += temp[lindex + offset];

// Store the resultout[gindex] = result;

}

© NVIDIA 2013

Page 58: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

Review(1of2)

• Launchingparallelthreads– LaunchN blockswithM threadsperblockwith

kernel<<<N,M>>>(…);

– UseblockIdx.x toaccessblockindexwithingrid– UsethreadIdx.x toaccessthreadindexwithinblock

• Allocateelementstothreads:

int index = threadIdx.x + blockIdx.x * blockDim.x;

© NVIDIA 2013

Page 59: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

Review(2of2)

• Use__shared__ todeclareavariable/arrayinsharedmemory– Dataissharedbetweenthreadsinablock– Notvisibletothreadsinotherblocks

• Use__syncthreads() asabarrier– Usetopreventdatahazards

© NVIDIA 2013

Page 60: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

MANAGING THE DEVICE

Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

Handling errors

Managing devices

CONCEPTS

© NVIDIA 2013

Page 61: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

CoordinatingHost&Device

• Kernellaunchesareasynchronous– ControlreturnstotheCPUimmediately

• CPUneedstosynchronizebeforeconsumingtheresultscudaMemcpy() BlockstheCPUuntilthecopyiscomplete

CopybeginswhenallprecedingCUDAcallshavecompleted

cudaMemcpyAsync() Asynchronous,doesnotblocktheCPU

cudaDeviceSynchronize()

BlockstheCPUuntilallprecedingCUDAcallshavecompleted

© NVIDIA 2013

Page 62: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

ReportingErrors

• AllCUDAAPIcallsreturnanerrorcode(cudaError_t)– ErrorintheAPIcallitself

OR– Errorinanearlierasynchronousoperation(e.g.kernel)

• Gettheerrorcodeforthelasterror:cudaError_t cudaGetLastError(void)

• Getastringtodescribetheerror:char *cudaGetErrorString(cudaError_t)

printf("%s\n", cudaGetErrorString(cudaGetLastError()));

© NVIDIA 2013

Page 63: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

DeviceManagement

• ApplicationcanqueryandselectGPUscudaGetDeviceCount(int *count)cudaSetDevice(int device)cudaGetDevice(int *device)cudaGetDeviceProperties(cudaDeviceProp *prop, int device)

• Multiplethreadscanshareadevice

• AsinglethreadcanmanagemultipledevicescudaSetDevice(i) toselectcurrentdevicecudaMemcpy(…) forpeer-to-peercopies✝

✝ requires OS and device support

© NVIDIA 2013

Page 64: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

IntroductiontoCUDAC/C++

• Whathavewelearned?– WriteandlaunchCUDAC/C++kernels

• __global__,blockIdx.x,threadIdx.x,<<<>>>

– ManageGPUmemory• cudaMalloc(),cudaMemcpy(),cudaFree()

– Managecommunicationandsynchronization• __shared__,__syncthreads()• cudaMemcpy() vs cudaMemcpyAsync(),cudaDeviceSynchronize()

© NVIDIA 2013

Page 65: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

ComputeCapability• Thecomputecapabilityofadevicedescribesitsarchitecture,e.g.

– Numberofregisters– Sizesofmemories– Features&capabilities

• ThefollowingpresentationsconcentrateonFermidevices– ComputeCapability>=2.0

ComputeCapability

SelectedFeatures(seeCUDACProgrammingGuideforcompletelist)

Teslamodels

1.0 FundamentalCUDAsupport 870

1.3 Doubleprecision,improved memoryaccesses,atomics

10-series

2.0 Caches, fusedmultiply-add,3Dgrids,surfaces,ECC,P2P,concurrentkernels/copies,functionpointers,recursion

20-series

© NVIDIA 2013

Page 66: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

IDsandDimensions– Akernelislaunchedasagridofblocksofthreads

• blockIdx andthreadIdx are3D

• Weshowedonlyonedimension(x)

• Built-invariables:– threadIdx– blockIdx– blockDim– gridDim

Device

Grid 1Bloc

k(0,0,

0)

Block

(1,0,0)

Block

(2,0,0)

Block

(1,1,0)

Block

(2,1,0)

Block

(0,1,0)

Block (1,1,0)Thread

(0,0,0)

Thread

(1,0,0)

Thread

(2,0,0)

Thread

(3,0,0)

Thread

(4,0,0)

Thread

(0,1,0)

Thread

(1,1,0)

Thread

(2,1,0)

Thread

(3,1,0)

Thread

(4,1,0)

Thread

(0,2,0)

Thread

(1,2,0)

Thread

(2,2,0)

Thread

(3,2,0)

Thread

(4,2,0)

© NVIDIA 2013

Page 67: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

Textures

• Read-onlyobject– Dedicatedcache

• Dedicatedfilteringhardware(Linear,bilinear,trilinear)

• Addressableas1D,2Dor3D

• Out-of-boundsaddresshandling(Wrap,clamp)

0 1 2 30

1

2

4

(2.5, 0.5)

(1.0, 1.0)

© NVIDIA 2013

Page 68: CUDA C/C++ BASICScc.sjtu.edu.cn/Upload/20171012153914920.pdf · Introduction to CUDA C/C++ •What will you learn in this session? –Start from “Hello World!” –Write and launch

Topicsweskipped

• Weskippedsomedetails,youcanlearnmore:– CUDAProgrammingGuide– CUDAZone– tools,training,webinarsandmore

developer.nvidia.com/cuda

• Needaquickprimerforlater:– Multi-dimensionalindexing– Textures

© NVIDIA 2013