Upload
lolo406
View
215
Download
0
Embed Size (px)
Citation preview
8/14/2019 CyrilZeller_IntroductionToCUDACC
1/74
NVIDIA Corporation 2011
Introduction to CUDA CQCon 2011
Cyril Zeller, NVIDIA Corporation
8/14/2019 CyrilZeller_IntroductionToCUDACC
2/74
NVIDIA Corporation 2011
Welcome
Goal: an introduction to GPU programming
CUDA = NVIDIAs architecture for GPU computing
What will you learn in this session?
Start from Hello World!
Write and execute C code on the GPU
Manage GPU memory
Manage communication and synchronization
8/14/2019 CyrilZeller_IntroductionToCUDACC
3/74
NVIDIA Corporation 2011
GPUs are Fast!
80.1
656.1
0
150
300
450
600
750
CPU Server GPU-CPUServer
PerformanceGflops
11
60
0
10
20
30
40
50
60
70
CPU Server GPU-CPUServer
Performance / $Gflops / $K
146
0
200
400
600
800
CPU Server
PerformanGflops
CPU 1U Server: 2x Intel Xeon X5550 (Nehalem) 2.66 GHz,48 GB memory, $7K, 0.55
GPU-CPU 1U Server: 2x Tesla C2050 + 2x Intel Xeon X5550, 48 GB memory, $11K, 1
8x Higher Linpack
8/14/2019 CyrilZeller_IntroductionToCUDACC
4/74
NVIDIA Corporation 2011
Worlds Fastest MD Simulation
Sustained Performance of 1.87 Petaflops/s
Institute of Process Engineering (IPE)Chinese Academy of Sciences (CAS)
MD Simulation for Crystalline Silicon
Used all 7168 Tesla G
Tianhe-1A GPU Super
8/14/2019 CyrilZeller_IntroductionToCUDACC
5/74 NVIDIA Corporation 2011
Worlds Greenest Petaflop Supercomputer
Tsubame 2.0Tokyo Institute of Technology
1.19 Petaflops
4,224 Tesla M2050 GPUs
8/14/2019 CyrilZeller_IntroductionToCUDACC
6/74 NVIDIA Corporation 2011
Increasing Number of Professional CUDAApplications
Tools &Libraries
Oil & Gas
TotalViewDebugger
Thrust C++Template Lib
R-StreamReservoir Labs
NVIDIA NPPPerf Primitives
Bright ClusterManager
CAPS HMPP
PBSWorks
EMPhotonicsCULAPACK
NVIDIAVideo Libraries
CUDA C/C++
PGI Fortran
Parallel NsightVis Studio IDE
Allinea DDTDebugger
GPU PackagesFor R Stats Pkg
IMSLpyCUDA
ParaToolsVampirTrace
PGIAccelerators
MAGMA
AvailableNow
ParadigmSKUA
StoneRidgeRTM
Headwave SuiteAccelewareRTM Solver
GeoStar Seismic
ffA SVI Pro
OpenGeo SolnsOpenSEIS
Seismic CityRTM
ParadigmGeoDepth RTM
VSGOpen Inventor
TsunamiRTM
VSGAvizo
SVI ProSEA 3D
Pro 2010S
AccelerEyesJacket: MATLAB
MathematicaMATLABLabVIEWLibraries
NumericalAnalytics
MOABAdaptive Comp
TorqueAdaptive Comp
Finance AquiminAlphaVision
NAGRNG
SciCompSciFinance
Hanweck VoleraOptions Analysi
MurexMACS
NumerixCounterpartyRisk
Avai
Siemens4D Ultrasound
DigisensCT
SchrodingerCore Hopping
ManifoldGIS
DalsaMach Vision
Other
Useful ProgMedical Imag
WRFWeather
ASUCAWeather Model
MVTechMach Vision
8/14/2019 CyrilZeller_IntroductionToCUDACC
7/74 NVIDIA Corporation 2011
Increasing Number of Professional CUDAApplications
Bio-Chemistry
Bio-Informatics
GPU-HMMR MUMmerGPU
CUDA-BLASTP CUDA-EC CUDA-MEME CUDA SW++ OpenEye ROCS
HEX ProteinDocking
TeraChemBigDFTABINT
VMD
AcelleraACEMD
AMBER
DL-POLY
GROMACS GROMOS HOOMDNAMD
GAMESS
PIPERDocking
LAMMPS
EDAAgilent ADSSPICE Sim
RemcomXFdtd
AgilentEMPro 2010
CST MicrowaveSPEAG
SEMCAD XANSOFT Nexxim
SynopsysTCAD
AvailableNow
CAEACUSIM/Altair
AcuSolveAutodeskMoldflow
ANSYSMechanical
SIMULIAAbaqus/Std
FluiDyna CulisesOpenFOAM
MetacompCFD++
ImpetusAFEA
Rendering
VideoSorensonSqueeze 7
FraunhoferJPEG2000
ElementalLive & Server
MotionDSPIkena Video
MS ExpressionEncoder
MainConceptCUDA H.264
AdobePremier Pro
DassaultCatia v6 (iray)
NVIDIAOptiX (SDK)
mental imagesiray (OEM)
BunkspeedShot (iray)
Refractive SWOctane
LightworksArtisan, Author
Chaos GroupV-Ray RT
CausticOpenRL (SDK)
Weta DigitalPantaRay
Autodesk3ds Max (iray)
Works ZebraZeany
Avai
8/14/2019 CyrilZeller_IntroductionToCUDACC
8/74 NVIDIA Corporation 2011
CUDA Capable GPUs300,000,000
CUDA Toolkit Downloads500,000
Active CUDA Developers100,000
Universities Teaching CUDA400
% OEMs offer CUDA GPU PCs100
CUDA by the Numbers
8/14/2019 CyrilZeller_IntroductionToCUDACC
9/74 NVIDIA Corporation 2011
C++ OpenCLtm
FortranJava
PythonWrappers
DirectCompute
OpenCL is trademark of Apple Inc. use
C
NVIDIA GPUwith the CUDAParallel Computing Architecture
GPU Computing Applications
Fermi Architecture(compute capabilities 2.x)
GeForce 400 Series Quadro Fermi Series Tesla 20 Series
Tesla Architecture(compute capabilities 1.x)
GeForce 200 SeriesGeForce 9 SeriesGeForce 8 Series
Quadro FX SeriesQuadro Plex SeriesQuadro NVS Series
Tesla 10 Series
ProfessionalGraphics
High PerformanceComputing
Entertainment
D(e.g
Libraries and Middleware
cuFFTcuBLAScuRAND
cuSPARSE
LAPACK
CULAMAGMA
ThrustNPP
cuDPP
VSIPLSVM
OpencCurrent
PhysXOptiX
irayRealityServer
8/14/2019 CyrilZeller_IntroductionToCUDACC
10/74 NVIDIA Corporation 2011
WorkstatServers & Blades
Tesla Data Center & Workstation GPU Solutio
Tesla M-series GPUsM2090 | M2070 | M2050
Tesla C-serieC2070 | C
M2090 M2070 M2050
Cores 512 448 448
Memory 6 GB 6 GB 3 GBMemory bandwidth(ECC off)
177.6 GB/s 150 GB/s 148.8 GB/s
PeakPerfGflops
SinglePrecision
1331 1030 1030
DoublePrecision
665 515 515
C2070
448
6 GB
148.8 GB/s 1
1030
515
8/14/2019 CyrilZeller_IntroductionToCUDACC
11/74 NVIDIA Corporation 2011
NVIDIA Developer EcosystemDebuggers
& Profilers
cuda-gdb
NV Visual ProfilerParallel Nsight
Visual Studio
Allinea
TotalView
MATLAB
Mathematica
NI LabView
pyCUDA
Numerical
Packages
C
C++
Fortran
OpenCL
DirectCompute
Java
Python
GPU Compilers
PGI AcceleratorCAPS HMPP
mCUDA
OpenMP
Parallelizing
Compilers
OEGPGPU Consultants & Training
ANEO GPU Tech
http://images.google.com/imgres?imgurl=http://fishtrain.com/wp-content/uploads/2007/09/cray_logo.gif&imgrefurl=http://fishtrain.com/2007/09/03/nvidias-playbook/&usg=__mBEPjqB6tUo0mps50ld866NdmmI=&h=70&w=160&sz=3&hl=en&start=8&sig2=erIWlru80_C67bxBapde6g&tbnid=ooG9_suq3ywK-M:&tbnh=43&tbnw=98&prev=/images?q=cray+logo&gbv=2&hl=en&ei=aHYpSvyWEo-ctgPd-dXxCg8/14/2019 CyrilZeller_IntroductionToCUDACC
12/74 NVIDIA Corporation 2011
Parallel NsightVisual Studio
Visual ProfilerWindows/Linux/Mac
cudLin
8/14/2019 CyrilZeller_IntroductionToCUDACC
13/74
NVIDIA Corporation 2011
What is CUDA?
CUDA Architecture
Expose GPU computing for general purpose
Retain performance
CUDA C/C++
Based on industry-standard C/C++
Small set of extensions to enable heterogeneous programming
Straightforward APIs to manage devices, memory etc.
This session introduces CUDA C/C++
8/14/2019 CyrilZeller_IntroductionToCUDACC
14/74
NVIDIA Corporation 2011
Heterogeneous Comp
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operati
Handling errors
Managing devices
CONCEPTS
8/14/2019 CyrilZeller_IntroductionToCUDACC
15/74
NVIDIA Corporation 2011
HELLO WORLD!
Het
Blo
Thr
Ind
Sha
__s
Asy
Han
Man
CONCEPTS
8/14/2019 CyrilZeller_IntroductionToCUDACC
16/74
8/14/2019 CyrilZeller_IntroductionToCUDACC
17/74
NVIDIA Corporation 2011
Heterogeneous Computing
#include
#include
usingnamespacestd;
#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16
__global__ void stencil_1d(int*in, int*out) {
__shared__ inttemp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx. x * blockDim.x;int lindex = threadIdx.x + RADIUS;
// Read input elements into shared memory
temp[lindex] = in[gindex];
if(threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex- RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
// Synchronize(ensure all thedata is available)
__syncthreads();
// Applythe stencil
int result = 0;
for(intoffset = -RADIUS ; offset
8/14/2019 CyrilZeller_IntroductionToCUDACC
18/74
NVIDIA Corporation 2011
Simple Processing Flow
1. Copy input data from CPU memory to GPUmemory
PCI Bus
8/14/2019 CyrilZeller_IntroductionToCUDACC
19/74
NVIDIA Corporation 2011
Simple Processing Flow
1. Copy input data from CPU memory to GPUmemory
2. Load GPU code and execute it,
caching data on chip for performance
PCI Bus
8/14/2019 CyrilZeller_IntroductionToCUDACC
20/74
NVIDIA Corporation 2011
Simple Processing Flow
1. Copy input data from CPU memory to GPUmemory
2. Load GPU program and execute,
caching data on chip for performance
3. Copy results from GPU memory to CPU
memory
PCI Bus
8/14/2019 CyrilZeller_IntroductionToCUDACC
21/74
NVIDIA Corporation 2011
Hello World!
int main(void) {
printf("Hello World!\n");
return0;
}
Standard C that runs on the host
NVIDIA compiler (nvcc) can be used to compile
programs with no devicecode
Output:
$ nvcchello_wo
$ a.out
Hello Wo
$
8/14/2019 CyrilZeller_IntroductionToCUDACC
22/74
NVIDIA Corporation 2011
Hello World! with Device Code
__global__ voidmykernel(void) {
}
int main(void) {
mykernel();
printf("Hello World!\n");
return0;
}
Two new syntactic elements
8/14/2019 CyrilZeller_IntroductionToCUDACC
23/74
NVIDIA Corporation 2011
Hello World! with Device Code
__global__ void mykernel(void) {
}
CUDA C/C++ keyword __global__ indicates a function th
Runs on the device
Is called from host code
nvccseparates source code into host and device compon
Device functions (e.g. mykernel()) processed by NVIDIA compi
Host functions (e.g. main()) processed by standard host compile
- gcc, cl.exe
8/14/2019 CyrilZeller_IntroductionToCUDACC
24/74
8/14/2019 CyrilZeller_IntroductionToCUDACC
25/74
NVIDIA Corporation 2011
Hello World! with Device Code
__global__ voidmykernel(void) {
}
int main(void) {
mykernel();
printf("Hello World!\n");
return0;
}
mykernel()does nothing, somewhat
anticlimactic!
Output:
$ nvcc h
$ a.out
Hello Wo$
8/14/2019 CyrilZeller_IntroductionToCUDACC
26/74
NVIDIA Corporation 2011
Parallel Programming in CUDA C/C++
But wait GPU computing is about massive
parallelism!
We need a more interesting example
Well start by adding two integers and build up
to vector addition
a
8/14/2019 CyrilZeller_IntroductionToCUDACC
27/74
NVIDIA Corporation 2011
Addition on the Device
A simple kernel to add two integers
__global__ voidadd(int *a, int*b, int *c) {
*c = *a + *b;
}
As before __global__ is a CUDA C/C++ keyword meanin
add() will execute on the device
add() will be called from the host
Additi th D i
8/14/2019 CyrilZeller_IntroductionToCUDACC
28/74
NVIDIA Corporation 2011
Addition on the Device
Note that we use pointers for the variables
__global__ void add(int *a, int *b, int *c) {
*c= *a+ *b;
}
add() runs on the device, so a, band cmust point to devi
We need to allocate memory on the GPU
8/14/2019 CyrilZeller_IntroductionToCUDACC
29/74
Additi th D i
8/14/2019 CyrilZeller_IntroductionToCUDACC
30/74
NVIDIA Corporation 2011
Addition on the Device: add()
Returning to our add()kernel
__global__ voidadd(int *a, int*b, int *c) {
*c = *a + *b;
}
Lets take a look at main()
Additi th D i i ()
8/14/2019 CyrilZeller_IntroductionToCUDACC
31/74
NVIDIA Corporation 2011
Addition on the Device:main()
int main(void) {
inta, b, c; // host copies of a, b
int*d_a, *d_b, *d_c; // device copies of a,
intsize = sizeof(int);
// Allocate space for device copies of a, b, c
cudaMalloc((void**)&d_a, size);
cudaMalloc((void**)&d_b, size);
cudaMalloc((void**)&d_c, size);
// Setup input values
a = 2;
b = 7;
Addition on the Device: i ()
8/14/2019 CyrilZeller_IntroductionToCUDACC
32/74
NVIDIA Corporation 2011
Addition on the Device:main()
// Copy inputs to device
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU
add(d_a, d_b, d_c);
// Copy result back to host
cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return0;
}
8/14/2019 CyrilZeller_IntroductionToCUDACC
33/74
Moving to Parallel
8/14/2019 CyrilZeller_IntroductionToCUDACC
34/74
NVIDIA Corporation 2011
Moving to Parallel
GPU computing is about massive parallelism
So how do we run code in parallel on the device?
add>();
add();
Instead of executing add()once, execute N times in para
8/14/2019 CyrilZeller_IntroductionToCUDACC
35/74
8/14/2019 CyrilZeller_IntroductionToCUDACC
36/74
Vector Addition on the Device: add()
8/14/2019 CyrilZeller_IntroductionToCUDACC
37/74
NVIDIA Corporation 2011
Vector Addition on the Device: add()
Returning to our parallelized add()kernel
__global__ voidadd(int *a, int*b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
Lets take a look at main()
Vector Addition on the Device: main()
8/14/2019 CyrilZeller_IntroductionToCUDACC
38/74
NVIDIA Corporation 2011
Vector Addition on the Device:main()
#define N 512
intmain(void) {
int*a, *b, *c; // host copies of a, b
int*d_a, *d_b, *d_c; // device copies of a,
intsize = N * sizeof(int);
// Alloc space for device copies of a, b, c
cudaMalloc((void**)&d_a, size);
cudaMalloc((void**)&d_b, size);cudaMalloc((void**)&d_c, size);
// Alloc space for host copies of a, b, c and setup
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
Vector Addition on the Device: main()
8/14/2019 CyrilZeller_IntroductionToCUDACC
39/74
NVIDIA Corporation 2011
Vector Addition on the Device:main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with N blocks
add(d_a, d_b, d_c);
// Copy result back to host
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return0;
}
Review (1 of 2)
8/14/2019 CyrilZeller_IntroductionToCUDACC
40/74
NVIDIA Corporation 2011
Review (1 of 2)
Difference between host and device
Host CPU
Device GPU
Using __global__ to declare a function as device code
Executes on the device
Called from the host
Passing parameters from host code to a device function
Review (2 of 2)
8/14/2019 CyrilZeller_IntroductionToCUDACC
41/74
NVIDIA Corporation 2011
Review (2 of 2)
Basic device memory management
cudaMalloc() cudaMemcpy()
cudaFree()
Launching parallel kernels
Launch Ncopies of add()with add();
Use blockIdx.xto access block index
8/14/2019 CyrilZeller_IntroductionToCUDACC
42/74
NVIDIA Corporation 2011
INTRODUCING THREADS
Het
Blo
Thr
Ind
Sha
__s
Asy
Han
Man
CONCEPTS
CUDA Threads
8/14/2019 CyrilZeller_IntroductionToCUDACC
43/74
NVIDIA Corporation 2011
__global__ voidadd(int *a, int*b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
CUDA Threads
Terminology: a block can be split into parallel threads
Lets change add()to use parallel threadsinstead of para
__global__ voidadd(int *a, int*b, int *c) {
c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}
We use threadIdx.xinstead of blockIdx.x
Need to make one change in main()
Vector Addition Using Threads: main()
8/14/2019 CyrilZeller_IntroductionToCUDACC
44/74
NVIDIA Corporation 2011
Vector Addition Using Threads:main()
#define N 512
intmain(void) {
int*a, *b, *c; // host copies of a, b
int*d_a, *d_b, *d_c; // device copies of a,
intsize = N * sizeof(int);
// Alloc space for device copies of a, b, c
cudaMalloc((void**)&d_a, size);
cudaMalloc((void**)&d_b, size);cudaMalloc((void**)&d_c, size);
// Alloc space for host copies of a, b, c and setup
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
Vector Addition Using Threads: main()
8/14/2019 CyrilZeller_IntroductionToCUDACC
45/74
NVIDIA Corporation 2011
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with N blocks
add(d_a, d_b, d_c);
// Copy result back to host
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return0;
}
Vector Addition Using Threads:main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU with N threads
add(d_a, d_b, d_c);
// Copy result back to host
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return0;
}
8/14/2019 CyrilZeller_IntroductionToCUDACC
46/74
Combining Blocks andThreads
8/14/2019 CyrilZeller_IntroductionToCUDACC
47/74
NVIDIA Corporation 2011
g
Weve seen parallel vector addition using:
Several blocks with one thread each One block with several threads
Lets adapt vector addition to use both blocks and threads
Why? Well come to that
First lets discuss data indexing
Indexing Arrays with Blocks and Threads
8/14/2019 CyrilZeller_IntroductionToCUDACC
48/74
NVIDIA Corporation 2011
g y
No longer as simple as using blockIdx.xand threadIdx.x
Consider indexing an array with one element per thread (8 threa
With M threads per block, a unique index for each thread iintindex = threadIdx.x + blockIdx.x * M;
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3
threadIdx.x threadIdx.x threadIdx.x thread
blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx
Indexing Arrays: Example
8/14/2019 CyrilZeller_IntroductionToCUDACC
49/74
NVIDIA Corporation 2011
g y p
Which thread will operate on the red element?
intindex = threadIdx.x + blockIdx.x * M;
= 5 + 2 * 8;
= 21;
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3
threadIdx.x = 5
blockIdx.x = 2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
M = 8
Vector Addition with Blocks and Threads
8/14/2019 CyrilZeller_IntroductionToCUDACC
50/74
NVIDIA Corporation 2011
Use the built-in variable blockDim.x for threads per block
intindex = threadIdx.x + blockIdx.x * blockDim.x;
Combined version of add()to use parallel threads andpa
__global__ voidadd(int *a, int*b, int *c) {
intindex = threadIdx.x + blockIdx.x * blockDim.x;
c[index] = a[index] + b[index];
}
What changes need to be made in main()?
Addition with Blocks and Threads:main()
8/14/2019 CyrilZeller_IntroductionToCUDACC
51/74
NVIDIA Corporation 2011
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
intmain(void) {int*a, *b, *c; // host copies of a, b, c
int*d_a, *d_b, *d_c; // device copies of a, b,
intsize = N *sizeof(int);
// Alloc space for device copies of a, b, c
cudaMalloc((void**)&d_a, size);
cudaMalloc((void**)&d_b, size);
cudaMalloc((void**)&d_c, size);
// Alloc space for host copies of a, b, c and setup input v
a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);
Addition with Blocks and Threads:main()
8/14/2019 CyrilZeller_IntroductionToCUDACC
52/74
NVIDIA Corporation 2011
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Launch add() kernel on GPU
add(d_a,
// Copy result back to host
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return0;
}
Handling Arbitrary Vector Sizes
8/14/2019 CyrilZeller_IntroductionToCUDACC
53/74
NVIDIA Corporation 2011
Typical problems are not friendly multiples of blockDim.x
Avoid accessing beyond the end of the arrays:
__global__ voidadd(int *a, int*b, int *c, intn) {
intindex = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
c[index] = a[index] + b[index];}
Update the kernel launch:add(d_a, d_b, d_c, N);
Why Bother with Threads?
8/14/2019 CyrilZeller_IntroductionToCUDACC
54/74
NVIDIA Corporation 2011
Threads seem unnecessary
They add a level of complexity What do we gain?
Unlike parallel blocks, threads have mechanisms to efficie
Communicate
Synchronize
To look closer, we need a new example
8/14/2019 CyrilZeller_IntroductionToCUDACC
55/74
1D Stencil
8/14/2019 CyrilZeller_IntroductionToCUDACC
56/74
NVIDIA Corporation 2011
in
out
Consider applying a 1D stencil to a 1D array of elements
Each output element is the sum of input elements within a radius
If radius is 3, then each output element is the sum of 7 inp
Implementing Within a Block
8/14/2019 CyrilZeller_IntroductionToCUDACC
57/74
NVIDIA Corporation 2011
01234567
Each thread processes one output element
blockDim.xelements per block
Input elements are read several times
With radius 3, each input element is read seven times
in
out
radius
Thread0
Thread1
Thread2
Thread3
Thread4
Thread5
Thread6
Thread7
Thread8
Sharing Data Between Threads
8/14/2019 CyrilZeller_IntroductionToCUDACC
58/74
NVIDIA Corporation 2011
Terminology: within a block, threads share data via shared
Extremely fast on-chip memory
By opposition to device memory, referred to as globalmemory
Like a user-managed cache
Declare using__shared__, allocated per block
Data is not visible to threads in other blocks
Implementing With Shared Memory
8/14/2019 CyrilZeller_IntroductionToCUDACC
59/74
NVIDIA Corporation 2011
Cache data in shared memory
Read(blockDim.x + 2 * radius)
input elements from global
shared memory
Compute blockDim.xoutput elements
Write blockDim.xoutput elements to global memory
Each block needs a haloof radiuselements at each boundary
blockDim.x output elements
halo on left halo
in
out
Stencil Kernel
8/14/2019 CyrilZeller_IntroductionToCUDACC
60/74
NVIDIA Corporation 2011
__global__ void stencil_1d(int*in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
intgindex = threadIdx.x + blockIdx.x * blockDim.x;
intlindex = threadIdx.x + RADIUS;
// Read input elements into shared memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
Stencil Kernel
8/14/2019 CyrilZeller_IntroductionToCUDACC
61/74
NVIDIA Corporation 2011
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset
8/14/2019 CyrilZeller_IntroductionToCUDACC
62/74
NVIDIA Corporation 2011
The stencil example will not work
Suppose thread 15 reads the halo before thread 0 has fetc
...
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex RADIUS] = in[gindex RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
int result = 0;
for (int offset = -RADIUS ; offset
8/14/2019 CyrilZeller_IntroductionToCUDACC
63/74
NVIDIA Corporation 2011
void __syncthreads();
Synchronizes all threads within a block
Used to prevent RAW / WAR / WAW hazards
All threads must reach the barrier
In conditional code, the condition must be uniform across the blo
Stencil Kernel
8/14/2019 CyrilZeller_IntroductionToCUDACC
64/74
NVIDIA Corporation 2011
__global__ void stencil_1d(int*in, int *out) {
__shared__int temp[BLOCK_SIZE + 2 * RADIUS];
intgindex = threadIdx.x + blockIdx.x * blockDim.x;
intlindex = threadIdx.x + radius;
// Read input elements into shared memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex RADIUS] = in[gindex RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
// Synchronize (ensure all the data is available)
__syncthreads();
Stencil Kernel
8/14/2019 CyrilZeller_IntroductionToCUDACC
65/74
NVIDIA Corporation 2011
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset
8/14/2019 CyrilZeller_IntroductionToCUDACC
66/74
NVIDIA Corporation 2011
Launching parallel threads
Launch Nblocks with Mthreads per block with kernel
Use blockIdx.xto access block index within grid
Use threadIdx.xto access thread index within block
Allocate elements to threads:
intindex = threadIdx.x + blockIdx.x * blockDim.x;
Review (2 of 2)
8/14/2019 CyrilZeller_IntroductionToCUDACC
67/74
NVIDIA Corporation 2011
Use __shared__ to declare a variable/array in shared mem
Data is shared between threads in a block
Not visible to threads in other blocks
Use __syncthreads()as a barrier
Use to prevent data hazards
8/14/2019 CyrilZeller_IntroductionToCUDACC
68/74
Coordinating Host & Device
8/14/2019 CyrilZeller_IntroductionToCUDACC
69/74
NVIDIA Corporation 2011
Kernel launches are asynchronous
Control returns to the CPU immediately
CPU needs to synchronize before consuming the results
cudaMemcpy() Blocks the CPU until the copy is complete
Copy begins when all preceding CUDA calls h
cudaMemcpyAsync() Asynchronous, does not block the CPU
cudaDeviceSynchronize() Blocks the CPU until all preceding CUDA calls
Reporting Errors
8/14/2019 CyrilZeller_IntroductionToCUDACC
70/74
NVIDIA Corporation 2011
All CUDA API calls return an error code (cudaError_t)
Error in the API call itself
OR
Error in an earlier asynchronous operation (e.g. kernel)
Get the error code for the last error:cudaError_t cudaGetLastError(void)
Get a string to describe the error:char *cudaGetErrorString(cudaError_t)
printf("%s\n", cudaGetErrorString(cudaGetLastError())
8/14/2019 CyrilZeller_IntroductionToCUDACC
71/74
Introduction to CUDA C/C++
8/14/2019 CyrilZeller_IntroductionToCUDACC
72/74
NVIDIA Corporation 2011
What have we learned?
Write and launch CUDA C/C++ kernels
- __global__, , blockIdx, threadIdx, blockDim
Manage GPU memory
- cudaMalloc(), cudaMemcpy(), cudaFree()
Manage communication and synchronization
- __shared__, __syncthreads()
- cudaMemcpy()vs cudaMemcpyAsync(), cudaDeviceSynchronize(
Topics we skipped
8/14/2019 CyrilZeller_IntroductionToCUDACC
73/74
NVIDIA Corporation 2011
We skipped some details, you can learn more:
CUDA Programming Guide
CUDA Zonetools, training, webinars and more
http://developer.nvidia.com/cuda
Next step:
Download the CUDA toolkit at www.nvidia.com/getcuda
Accelerate your application with GPUs!
http://www.nvidia.com/getcudahttp://www.nvidia.com/getcuda8/14/2019 CyrilZeller_IntroductionToCUDACC
74/74