Introduction to CUDA - PRACE Research Infrastructure€¦ · CUDA A Parallel Computing Architecture...

Preview:

Citation preview

Introduction to CUDA

GPU Performance History• GPUs are massively multithreaded many-core chips

• Hundreds of cores, thousands of concurrent threads• Huge economies of scale• Still on aggressive performance growth • High memory bandwidth

G80

GT200

CUDAA Parallel Computing Architecture for NVIDIA GPUs

Supports standard languages and APIs

• C• OpenCL• DX Compute• Fortran (PGI)

Supported on common operating systems:

• Windows• Mac OS X• Linux

DXCompute Fortran

NVIDIA supports any initiative that unleashes the massive power of the GPU

© NVIDIA Corporation 2008

C for CUDA

CUDA is industry-standard CWrite a program for one threadInstantiate it on many parallel threadsFamiliar programming model and language

CUDA is a scalable parallel programming modelProgram runs on any number of processors without recompiling

© NVIDIA Corporation 2008

GPU Sizes Require CUDA Scalability

128 SP Cores32 SPCores

240 SP Cores

© NVIDIA Corporation 2008

TeslaTM

High-Performance ComputingQuadro®

Design & CreationGeForce®

Entertainment

CUDA runs on NVIDIA GPUs…Over 100 Million CUDA GPUs Deployed

© NVIDIA Corporation 2008

Pervasive CUDA Parallel ComputingCUDA brings data-parallel computing to the masses

Over 100 M CUDA-capable GPUs deployed since Nov 2006

Wide developer acceptanceDownload CUDA from www.nvidia.com/cudaOver 50K CUDA developer downloadsA GPU “developer kit” costs ~$100 for several hundreds GFLOPS

Data-parallel supercomputers are everywhere!CUDA makes this power readily accessibleEnables rapid innovations in data-parallel computing

Parallel computing rides the commodity technology wave

© NVIDIA Corporation 2008

CUDA Zone: www.nvidia.com/cuda

Resources, examples, and pointers for CUDA developers

© NVIDIA Corporation 2008

1.4 billion transistors

1 Teraflop of processing power

240 SP processing cores

30 DP processing cores with

IEEE-754 double precision

Introducing Tesla T10P Processor

…NVIDIA’s 2nd Generation CUDA Processor

© NVIDIA Corporation 2008

CUDA Computing with Tesla T10240 SP processors at 1.44 GHz: 1 TFLOPS peak30 DP processors at 1.44 GHz: 86 GFLOPS peak

128 threads per processor: 30,720 threads total

© NVIDIA Corporation 2008

Double Precision Floating Point NVIDIA GPU SSE2 Cell SPE

Precision IEEE 754 IEEE 754 IEEE 754

Rounding modes for FADD and FMUL

All 4 IEEE, round to nearest, zero, inf, -inf

All 4 IEEE, round to nearest, zero, inf, -inf

Round to zero/truncate only

Denormal handling Full speed Supported, costs 1000’s of cycles Flush to zero

NaN support Yes Yes No

Overflow and Infinity support Yes Yes No infinity, clamps to max norm

Flags No Yes Some

FMA Yes No Yes

Square root Software with low-latency FMA-based convergence Hardware Software only

Division Software with low-latency FMA-based convergence Hardware Software only

Reciprocal estimate accuracy 24 bit 12 bit 12 bit

Reciprocal sqrt estimate accuracy 23 bit 12 bit 12 bit

log2(x) and 2^x estimates accuracy 23 bit No No

© NVIDIA Corporation 2008

Tesla C1060 Computing Processor

Processor 1x Tesla T10P

Core GHz 1.29 GHz

Form factorFull ATX:

4.736” (H) x 10.5” (L)Dual slot wide

On-board memory 4 GB

System I/O PCIe x16 gen2

Memory I/O 512-bit, 800MHz DDR102 GB/s peak bandwidth

Display outputs None

Typical power 160 W

© NVIDIA Corporation 2008

Tesla S1070 1U SystemProcessors 4 x Tesla T10P

Core GHz 1.44 GHz

Form factor 1U for an EIA 19”4-post rack

Total 1U system memory 16 GB (4.0GB per GPU)

System I/O 2 PCIe x16

Memory I/O per processor

512-bit, 800MHz GDDR102 GB/s peak bandwidth

Display outputs None

Typical power 700 W

Chassis dimensions 1.73” H × 17.5” W × 28.5” D

© NVIDIA Corporation 2008

Applications

© NVIDIA Corporation 2008

4100

170

377423

740

0

100

200

300

400

500

600

700

800

CPU PS3 Radeon HD3870

Radeon HD4850

Tesla 8‐series Tesla 10‐Series

nano seconds  simulation per  QuickTime™ and a

decompressorare needed to see this picture.

F@H kernel based on GROMACS code

Folding@home Performance Comparison

© NVIDIA Corporation 2008

Lattice Boltzmann

1000 iterations on a 256x128x128 domain

Cluster with 8 GPUs: 7.5 sec

Blue Gene/L 512 nodes: 21 sec

10000 iterations on irregular 1057x692x1446 domain with 4M fluid nodes

1 C870 760 s 53 MLUPS

2 C1060 159 s 252 MLUPS

8 C1060 42 s 955 MLUPS

Blood flow pattern in a human coronary artery, Bernaschi et al.

© NVIDIA Corporation 2008

Desktop GPU Supercomputer Beats Cluster

FASTRA8 GPUs in a Desktop

CalcUA256 Nodes (512 cores)

http://fastra.ua.ac.be/en/index.html

© NVIDIA Corporation 2008

Standard HPL code, with library that intercepts DGEMM and DTRSMcalls and executes them simultaneously on the GPUs and CPU cores. Library is implemented with CUBLAS

Cluster with 8 nodes:-Each node has 2 Intel Xeon E5462 ( 2.8Ghz), 16GB of memoryand 2 Tesla GPUs (1.44Ghz clock).

-The nodes are connected with SDR Infiniband.

CUDA accelerated Linpack

T/V N NB P Q Time Gflops---------------------------------------------------------------------------------------------------WR11R2L2 118144 960 4 4 874.26 1.258e+03----------------------------------------------------------------------------------------------------||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0031157 ...... PASSED

© NVIDIA Corporation 2008

Pseudo-spectral simulation of 2D Isotropic turbulence

Accelerating MATLAB®

1024x1024 mesh, 400 RK4 steps, Windows XP, Core2 Duo 2.4Ghz vs GeForce 8800GTX

http://developer.nvidia.com/object/matlab_cuda.htmll

Use MEX files to call CUDA from MATLAB, 17x speed-up

© NVIDIA Corporation 2008

146X 36X 19X 17X 100X

Astrophysics NAstrophysics N--body simulationbody simulation

Interactive Interactive visualization of visualization of

volumetric white volumetric white matter connectivitymatter connectivity

Ionic placement for Ionic placement for molecular molecular dynamics dynamics

simulation on GPUsimulation on GPU

TranscodingTranscoding HD HD video stream to video stream to

H.264H.264

Simulation in Simulation in MatlabMatlab using .using .mexmexfile CUDA functionfile CUDA function

149X 47X 20X 24X 30X

Financial Financial simulation of simulation of

LIBOR model with LIBOR model with swaptionsswaptions

GLAME@labGLAME@lab: An : An MM--script API for script API for linear Algebra linear Algebra

operations on GPUoperations on GPU

Ultrasound Ultrasound medical imaging medical imaging

for cancer for cancer diagnosticsdiagnostics

Highly optimized Highly optimized object oriented object oriented

molecular molecular dynamicsdynamics

CmatchCmatch exact exact string matching to string matching to

find similar find similar proteins and gene proteins and gene

sequencessequences

Applications in several fields

© NVIDIA Corporation 2008

CUDA Basic

© NVIDIA Corporation 2008

CUDAA Parallel Computing Architecture for NVIDIA GPUs

Supports standard languages and APIs

•C•OpenCL•Fortran (PGI)•DX Compute

Supported on common operating systems:

•Windows•Mac OS•Linux

DXCompute

© NVIDIA Corporation 2008 23

Arrays of Parallel Threads

A CUDA kernel is executed by an array of threadsAll threads run the same codeEach thread has an ID that it uses to compute memory addresses and make control decisions

0 1 2 3 4 5 6 7

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID

© NVIDIA Corporation 2008 24

Example: Increment Array Elements

Increment N-element vector a by scalar b

Let’s assume N=16, blockDim=4 -> 4 blocks

blockIdx.x=0blockDim.x=4threadIdx.x=0,1,2,3idx=0,1,2,3

blockIdx.x=1blockDim.x=4threadIdx.x=0,1,2,3idx=4,5,6,7

blockIdx.x=2blockDim.x=4threadIdx.x=0,1,2,3idx=8,9,10,11

blockIdx.x=3blockDim.x=4threadIdx.x=0,1,2,3idx=12,13,14,15

int idx = blockDim.x * blockId.x + threadIdx.x;will map from local index threadIdx to global index

NB: blockDim should be >= 32 in real code, this is just an example

© NVIDIA Corporation 2008 25

Example: Increment Array Elements

CPU program CUDA program

void increment_cpu(float *a, float b, int N){

for (int idx = 0; idx<N; idx++) a[idx] = a[idx] + b;

}

void main(){

.....increment_cpu(a, b, N);

}

__global__ void increment_gpu(float *a, float b, int N){

int idx = blockIdx.x * blockDim.x + threadIdx.x;if (idx < N)

a[idx] = a[idx] + b;}

void main(){

…..dim3 dimBlock (blocksize);dim3 dimGrid( ceil( N / (float)blocksize) );increment_gpu<<<dimGrid, dimBlock>>>(ad,bd, N);

}

© NVIDIA Corporation 2008

Outline of CUDA Basics

Basics Memory ManagementBasic Kernels and Execution on GPUCoordinating CPU and GPU ExecutionDevelopment Resources

See the Programming Guide for the full API

© NVIDIA Corporation 2008

Basic Memory Management

© NVIDIA Corporation 2008

Memory Spaces

CPU and GPU have separate memory spacesData is moved across PCIe busUse functions to allocate/set/copy memory on GPU

Very similar to corresponding C functions

Pointers are just addressesCan’t tell from the pointer value whether the address is on CPU or GPUMust exercise care when dereferencing:

Dereferencing CPU pointer on GPU will likely crashSame for vice versa

© NVIDIA Corporation 2008

GPU Memory Allocation / Release

Host (CPU) manages device (GPU) memory:cudaMalloc (void ** pointer, size_t nbytes)cudaMemset (void * pointer, int value, size_t count)cudaFree (void* pointer)

int n = 1024;int nbytes = 1024*sizeof(int);int * d_a = 0;cudaMalloc( (void**)&d_a, nbytes );cudaMemset( d_a, 0, nbytes);cudaFree(d_a);

© NVIDIA Corporation 2008

Data Copies

cudaMemcpy( void *dst, void *src, size_t nbytes, enum cudaMemcpyKind direction);

returns after the copy is completeblocks CPU thread until all bytes have been copieddoesn’t start copying until previous CUDA calls complete

enum cudaMemcpyKindcudaMemcpyHostToDevicecudaMemcpyDeviceToHostcudaMemcpyDeviceToDevice

Non-blocking memcopies are provided

© NVIDIA Corporation 2008

Code Walkthrough 1

Allocate CPU memory for n integersAllocate GPU memory for n integersInitialize GPU memory to 0sCopy from GPU to CPUPrint the values

© NVIDIA Corporation 2008

Code Walkthrough 1#include <stdio.h>

int main(){

int dimx = 16;int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

© NVIDIA Corporation 2008

Code Walkthrough 1#include <stdio.h>

int main(){

int dimx = 16;int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes);cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ){

printf("couldn't allocate memory\n");return 1;

}

© NVIDIA Corporation 2008

Code Walkthrough 1#include <stdio.h>

int main(){

int dimx = 16;int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes);cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ){

printf("couldn't allocate memory\n");return 1;

}

cudaMemset( d_a, 0, num_bytes );cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

© NVIDIA Corporation 2008

Code Walkthrough 1#include <stdio.h>

int main(){

int dimx = 16;int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes);cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ){

printf("couldn't allocate memory\n");return 1;

}

cudaMemset( d_a, 0, num_bytes );cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int i=0; i<dimx; i++)printf("%d ", h_a[i] );

printf("\n");

free( h_a );cudaFree( d_a );

return 0;}

© NVIDIA Corporation 2008

Basic Kernels and Execution on GPU

© NVIDIA Corporation 2008

CUDA Programming Model

Parallel code (kernel) is launched and executed on a device by many threadsThreads are grouped into thread blocksParallel code is written for a thread

Each thread is free to execute a unique code pathBuilt-in thread and block ID variables

© NVIDIA Corporation 2008

Thread Hierarchy

Threads launched for a parallel section are partitioned into thread blocks

Grid = all blocks for a given launchThread block is a group of threads that can:

Synchronize their executionCommunicate via shared memory

© NVIDIA Corporation 2008

IDs and Dimensions

Threads:3D IDs, unique within a block

Blocks:2D IDs, unique within a grid

Dimensions set at launch timeCan be unique for each grid

Built-in variables:threadIdx, blockIdxblockDim, gridDim

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

© NVIDIA Corporation 2008

Code executed on GPU

C function with some restrictions:Can only access GPU memoryNo variable number of argumentsNo static variablesNo recursion

Must be declared with a qualifier:__global__ : launched by CPU,

cannot be called from GPU must return void__device__ : called from other GPU functions,

cannot be launched by the CPU__host__ : can be executed by CPU__host__ and __device__ qualifiers can be combined

sample use: overloading operators

© NVIDIA Corporation 2008

Code Walkthrough 2

Build on Walkthrough 1Write a kernel to initialize integersCopy the result back to CPUPrint the values

© NVIDIA Corporation 2008

__global__ void kernel( int *a ){

int idx = blockIdx.x*blockDim.x + threadIdx.x;a[idx] = 7;

}

Kernel Code (executed on GPU)

© NVIDIA Corporation 2008

Launching kernels on GPU

Launch parameters:grid dimensions (up to 2D), dim3 typethread-block dimensions (up to 3D), dim3 typeshared memory: number of bytes per block

for extern smem variables declared without sizeOptional, 0 by default

stream IDOptional, 0 by default

dim3 grid(16, 16);dim3 block(16,16);kernel<<<grid, block, 0, 0>>>(...);kernel<<<32, 512>>>(...);

© NVIDIA Corporation 2008

#include <stdio.h>

__global__ void kernel( int *a ){

int idx = blockIdx.x*blockDim.x + threadIdx.x;a[idx] = 7;

}

int main(){

int dimx = 16;int num_bytes = dimx*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes);cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ){

printf("couldn't allocate memory\n");return 1;

}

cudaMemset( d_a, 0, num_bytes );

dim3 grid, block;block.x = 4;grid.x = dimx / block.x;

kernel<<<grid, block>>>( d_a );

cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int i=0; i<dimx; i++)printf("%d ", h_a[i] );

printf("\n");

free( h_a );cudaFree( d_a );

return 0;}

© NVIDIA Corporation 2008

__global__ void kernel( int *a ){

int idx = blockIdx.x*blockDim.x + threadIdx.x;a[idx] = 7;

}

__global__ void kernel( int *a ){

int idx = blockIdx.x*blockDim.x + threadIdx.x;a[idx] = blockIdx.x;

}

__global__ void kernel( int *a ){

int idx = blockIdx.x*blockDim.x + threadIdx.x;a[idx] = threadIdx.x;

}

Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3

Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

Kernel Variations and Output

© NVIDIA Corporation 2008

Code Walkthrough 3

Build on Walkthruogh 2Write a kernel to increment n×m integersCopy the result back to CPUPrint the values

© NVIDIA Corporation 2008

__global__ void kernel( int *a, int dimx, int dimy ){

int ix = blockIdx.x*blockDim.x + threadIdx.x;int iy = blockIdx.y*blockDim.y + threadIdx.y;int idx = iy*dimx + ix;

a[idx] = a[idx]+1;}

Kernel with 2D Indexing

© NVIDIA Corporation 2008

int main(){

int dimx = 16;int dimy = 16;int num_bytes = dimx*dimy*sizeof(int);

int *d_a=0, *h_a=0; // device and host pointers

h_a = (int*)malloc(num_bytes);cudaMalloc( (void**)&d_a, num_bytes );

if( 0==h_a || 0==d_a ){

printf("couldn't allocate memory\n");return 1;

}

cudaMemset( d_a, 0, num_bytes );

dim3 grid, block;block.x = 4;block.y = 4;grid.x = dimx / block.x;grid.y = dimy / block.y;

kernel<<<grid, block>>>( d_a, dimx, dimy );

cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );

for(int row=0; row<dimy; row++){

for(int col=0; col<dimx; col++)printf("%d ", h_a[row*dimx+col] );

printf("\n");}

free( h_a );cudaFree( d_a );

return 0;}

__global__ void kernel( int *a, int dimx, int dimy ){

int ix = blockIdx.x*blockDim.x + threadIdx.x;int iy = blockIdx.y*blockDim.y + threadIdx.y;int idx = iy*dimx + ix;

a[idx] = a[idx]+1;}

© NVIDIA Corporation 2008

Blocks must be independent

Any possible interleaving of blocks should be validpresumed to run to completion without pre-emptioncan run in any ordercan run concurrently OR sequentially

Blocks may coordinate but not synchronizeshared queue pointer: OKshared lock: BAD … can easily deadlock

Independence requirement gives scalability

© NVIDIA Corporation 2008

Blocks must be independent

Thread blocks can run in any orderConcurrently or sequentiallyFacilitates scaling of the same code across many devices

Scalability

© NVIDIA Corporation 2008

Coordinating CPU and GPU Execution

© NVIDIA Corporation 2008

Synchronizing GPU and CPU

All kernel launches are asynchronouscontrol returns to CPU immediatelykernel starts executing once all previous CUDA calls have completed

Memcopies are synchronouscontrol returns to CPU once the copy is completecopy starts once all previous CUDA calls have completed

cudaThreadSynchronize()blocks until all previous CUDA calls complete

Asynchronous CUDA calls provide:non-blocking memcopiesability to overlap memcopies and kernel execution

© NVIDIA Corporation 2008

CUDA Error Reporting to CPU

All CUDA calls return error code:except kernel launchescudaError_t type

cudaError_t cudaGetLastError(void)returns the code for the last error (“no error” has a code)

char* cudaGetErrorString(cudaError_t code)returns a null-terminated character string describing the error

printf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );

© NVIDIA Corporation 2008

CUDA Event API

Events are inserted (recorded) into CUDA call streamsUsage scenarios:

measure elapsed time for CUDA calls (clock cycle precision)query the status of an asynchronous CUDA callblock CPU until CUDA calls prior to the event are completedasyncAPI sample in CUDA SDK

cudaEvent_t start, stop;cudaEventCreate(&start); cudaEventCreate(&stop);cudaEventRecord(start, 0);kernel<<<grid, block>>>(...);cudaEventRecord(stop, 0);cudaEventSynchronize(stop);float et;cudaEventElapsedTime(&et, start, stop);cudaEventDestroy(start); cudaEventDestroy(stop);

© NVIDIA Corporation 2008

Device Management

CPU can query and select GPU devicescudaGetDeviceCount( int* count )cudaSetDevice( int device )cudaGetDevice( int *current_device )cudaGetDeviceProperties( cudaDeviceProp* prop,

int device )cudaChooseDevice( int *device, cudaDeviceProp* prop )

Multi-GPU setup:device 0 is used by defaultone CPU thread can control one GPU

multiple CPU threads can control the same GPU

– calls are serialized by the driver

© NVIDIA Corporation 2008

Shared Memory

© NVIDIA Corporation 2008

Shared Memory

On-chip memory2 orders of magnitude lower latency than global memoryOrder of magnitude higher bandwidth than gmem16KB per multiprocessor

NVIDIA GPUs contain up to 30 multiprocessors

Allocated per threadblockAccessible by any thread in the threadblock

Not accessible to other threadblocksSeveral uses:

Sharing data among threads in a threadblockUser-managed cache (reducing gmem accesses)

© NVIDIA Corporation 2008 58

Using shared memory

Size known at compile time

__global__ void kernel(…){

…__shared__ float sData[256];…

}

int main(void){…kernel<<<nBlocks,blockSize>>>(…);…

}

Size known at kernel launch

__global__ void kernel(…){

…extern __shared__ float sData[];…

}

int main(void){

…smBytes = blockSize*sizeof(float);kernel<<<nBlocks, blockSize,

smBytes>>>(…);…

}

© NVIDIA Corporation 2008

Example of Using Shared Memory

Applying a 1D stencil:1D dataFor each output element, sum all elements within a radius

For example, radius = 3Add 7 input elements

radius radius

© NVIDIA Corporation 2008

Implementation with Shared Memory

1D threadblocks (partition the output)Each threadblock outputs BLOCK_DIMX elements

Read input from gmem to smemNeeds BLOCK_DIMX + 2*RADIUS input elements

ComputeWrite output to gmem

“halo” “halo”Input elements corresponding to output

as many as there are threads in a threadblock

© NVIDIA Corporation 2008

Kernel code

__global__ void stencil( int *output, int *input, int dimx, int dimy ){

__shared__ int s_a[BLOCK_DIMX+2*RADIUS];

int global_ix = blockIdx.x*blockDim.x + threadIdx.x;int local_ix = threadIdx.x + RADIUS;

s_a[local_ix] = input[global_ix];

if ( threadIdx.x < RADIUS ){

s_a[local_ix – RADIUS] = input[global_ix – RADIUS];s_a[local_ix + BLOCK_DIMX + RADIUS] =

input[global_ix + BLOCK_DIMX + RADIUS];}__syncthreads();

int value = 0;for( offset = -RADIUS; offset<=RADIUS; offset++ )

value += s_a[ local_ix + offset ];

output[global_ix] = value;}

© NVIDIA Corporation 2008

Thread Synchronization Function

void __syncthreads();Synchronizes all threads in a thread-block

Since threads are scheduled at run-timeOnce all threads have reached this point, execution resumes normallyUsed to avoid RAW / WAR / WAW hazards when accessing shared memory

Should be used in conditional code only if the conditional is uniform across the entire thread block

© NVIDIA Corporation 2008

Memory Model Review

Local storageEach thread has own local storageMostly registers (managed by the compiler)Data lifetime = thread lifetime

Shared memoryEach thread block has own shared memory

Accessible only by threads within that block

Data lifetime = block lifetimeGlobal (device) memory

Accessible by all threads as well as host (CPU)Data lifetime = from allocation to deallocation

© NVIDIA Corporation 2008

Memory Model Review

Thread

Per-threadLocal Storage

Block

Per-blockSharedMemory

© NVIDIA Corporation 2008

Memory Model Review

Kernel 0

. . .Per-device

GlobalMemory

. . .

Kernel 1

SequentialKernels

© NVIDIA Corporation 2008

Memory Model Review

Device 0memory

Device 1memory

Host memory cudaMemcpy()

Recommended