Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU

A crash course on (C-‐) CUDA programming for (mul8) GPU

Massimo Bernaschi [email protected]

h;p://www.iac.cnr.it/~massimo

Single GPU part of the class based on

Code samples available from (wget) h3p://twin.iac.rm.cnr.it/cccsample.tgz Slides available from (wget) h3p://twin.iac.rm.cnr.it/CCC.pdf

Let us start with few basic info

A Graphics Processing Unit is not a CPU!

And there is no GPU without a CPU…

Cache

ALU Control

ALU

ALU

ALU

DRAM DRAM

CPU GPU

Graphics Processing Unit ü Original ly dedicated to specific opera8ons required by video cards for accelera8ng graphics, GPUs have become flexible general-‐purpose computa8onal engines (GPGPU):

ü Op8mized for high memory throughput ü Very suitable to data-‐parallel processing

ü Tradi8onally GPU programming has been tricky but CUDA, OpenCL and now OpenACC made it affordable.

What is CUDA? §  CUDA=Compute Unified Device Architecture

—  Expose general-purpose GPU computing as first-class capability

—  Retain traditional DirectX/OpenGL graphics performance

§  CUDA C —  Based on industry-standard C —  A handful of language extensions to allow heterogeneous

programs —  Straightforward APIs to manage devices, memory, etc.

CUDA Programming Model •  The GPU is viewed as a compute device that:

–  has its own RAM (device memory) –  runs data-‐parallel por8ons of an applica8on as kernels

by using many threads •  GPU vs. CPU threads

–  GPU threads are extremely lightweight •  Very li;le crea8on overhead

–  GPU needs 1000s of threads for full efficiency •  A mul8-‐core CPU needs only a few

(basically one thread per core)

CUDA C Jargon: The Basics

Host

§ Host – The CPU and its memory (host memory)

§ Device – The GPU and its memory (device memory)

Device

What Programmer Expresses in CUDA

ü  Computa8on par88oning (where does computa8on occur?) ü  Declara8ons on func8ons __host__, __global__, __device__ ü Mapping of thread programs to device: compute <<<gs, bs>>>(<args>)

ü  Data par88oning (where does data reside, who may access it and how?)

ü Declara8ons on data __shared__, __device__, __constant__, … ü  Data management and orchestra8on

ü Copying to/from host: e.g., cudaMemcpy(h_obj,d_obj, size, cudaMemcpyDevicetoHost)

ü  Concurrency management ü  e.g. __synchthreads()

P P

HO

ST

(CP

U)

M

DE

VIC

E (G

PU

)

Interconnect between devices and memories

M

Code Samples

1.   Connect to the system ssh –X [email protected]

2.   Get the code samples: wget h;p://twin.iac.rm.cnr.it/ccc.tgz wget h;p://twin.iac.rm.cnr.it/cccsample.tgz

3.   Unpack the samples: tar zxvf cccsample.tgz

4.   Go to source directory: cd cccsample/source

5.   Get the CUDA C Quick Reference wget h;p://twin.iac.rm.cnr.it/CUDA_C_QuickRef.pdf

Hello, World! int main( void ) { printf( "Hello, World!\n" ); return 0; } •  To compile: nvcc –o hello_world hello_world.cu

•  To execute: ./hello_world

•  This basic program is just standard C that runs on the host

•  NVIDIA’s compiler (nvcc) will not complain about CUDA programs with no device code

•  At its simplest, CUDA C is just C!

Hello, World! with Device Code __global__ void kernel( void ) { } int main( void ) { kernel<<<1,1>>>(); printf( "Hello, World!\n" ); return 0; }

•  Two notable addi8ons to the original “Hello, World!”

To compile: nvcc –o simple_kernel simple_kernel.cu To execute: ./simple_kernel

Hello, World! with Device Code __global__ void kernel( void ) { } •  CUDA C keyword __global__ indicates that a

func8on –  Runs on the device –  Called from host code

•  nvcc splits source file into host and device components –  NVIDIA’s compiler handles device func8ons like kernel() –  Standard host compiler handles host func8ons like main()

•  gcc, icc, … •  Microso] Visual C

Hello, World! with Device Code int main( void ) { kernel<<< 1, 1 >>>(); printf( "Hello, World!\n" ); return 0; }

•  Triple angle brackets mark a call from host code to device code –  A “kernel launch” in CUDA jargon –  We’ll discuss the parameters inside the angle brackets later

•  This is all that’s required to execute a func8on on the GPU! •  The func8on kernel() does nothing, so let us run something a

bit more useful:

A kernel to make an addition __global__ void add(int a, int b, int *c) { *c = a + b; }

•  Compile: nvcc –o addint addint.cu

•  Run: ./addint

•  Look at line 28 of the addint.cu code How do we pass the first two arguments?

A More Complex Example •  Another kernel to add two integers: __global__ void add( int *a, int *b, int *c ) { *c = *a + *b; }

•  As before, __global__ is a CUDA C keyword meaning –  add() will execute on the device –  add() will be called from the host

A More Complex Example •  No8ce that now we use pointers for all our variables:

__global__ void add( int *a, int *b, int *c ) { *c = *a + *b; }

•  add() runs on the device…so a, b, and c must point to device memory

•  How do we allocate memory on the GPU?

Memory Management •  Up to CUDA 4.0 host and device memory were dis8nct

en88es from the programmers’ viewpoint –  Device pointers point to GPU memory

•  May be passed to and from host code •  (In general) May not be dereferenced from host code

–  Host pointers point to CPU memory •  May be passed to and from device code •  (In general) May not be dereferenced from device code

Star8ng on CUDA 4.0 there is a Unified Virtual Addressing feature. •  Basic CUDA API for dealing with device memory

–  cudaMalloc(&p, size), cudaFree(p), cudaMemcpy(t, s, size, direction)

–  Similar to their C equivalents: malloc(), free(), memcpy()

pointer to pointer

A More Complex Example: main() int main( void ) { int a, b, c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = sizeof( int ); // we need space for an integer // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, size ); a = 2; b = 7; // copy inputs to device cudaMemcpy( dev_a, &a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, &b, size, cudaMemcpyHostToDevice ); // launch add() kernel on GPU, passing parameters add<<< 1, 1 >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudaMemcpy( &c, dev_c, size, cudaMemcpyDeviceToHost ); cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ) return 0; }

Parallel Programming in CUDA C •  But wait…GPU compu8ng is about massive parallelism

•  So how do we run code in parallel on the device?

•  Solu8on lies in the parameters between the triple angle brackets:

add<<< 1, 1 >>>( dev_a, dev_b, dev_c ); add<<< N, 1 >>>( dev_a, dev_b, dev_c );

•  Instead of execu8ng add() once, add() executed N 8mes in parallel

Parallel Programming in CUDA C •  With add() running in parallel…let’s do vector addi8on

•  Terminology: Each parallel invoca8on of add() referred to as a block

•  Kernel can refer to its block’s index with the variable blockIdx.x •  Each block adds a value from a[] and b[], storing the result in c[]:

__global__ void add( int *a, int *b, int *c ) { c[blockIdx.x] = a[blockIdx.x]+b[blockIdx.x]; }

•  By using blockIdx.x to index arrays, each block handles a different index •  blockIdx.x is the first example of a CUDA predefined variable.

Parallel Programming in CUDA C

Block 1

c[1]=a[1]+b[1];

§ We write this code: __global__ void add( int *a, int *b, int *c ) {

c[blockIdx.x] = a[blockIdx.x]+b[blockIdx.x];

}

§ This is what runs in parallel on the device:

Block 0

c[0]=a[0]+b[0];

Block 2

c[2]=a[2]+b[2];

Block 3

c[3]=a[3]+b[3];

Parallel Addition: main() #define N 512 int main( void ) { int *a, *b, *c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = N * sizeof( int ); // we need space for 512

// integers // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, size ); a = (int*)malloc( size ); b = (int*)malloc( size ); c = (int*)malloc( size ); random_ints( a, N ); random_ints( b, N );

Parallel Addition: main() (cont) // copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // launch add() kernel with N parallel blocks add<<< N, 1 >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost ); free( a ); free( b ); free( c ); cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); return 0; }

Review §  Difference between “host” and “device”

—  Host = CPU —  Device = GPU

§  Using __global__ to declare a function as device code —  Runs on device —  Called from host

§  Passing parameters from host code to a device function

Review (cont) §  Basic device memory management

—  cudaMalloc() —  cudaMemcpy() —  cudaFree()

§  Launching parallel kernels —  Launch N copies of add() with: add<<< N, 1 >>>(); —  blockIdx.x allows to access block’s index

•  Exercise: look at, compile and run the add_simple_blocks.cu code

Threads §  Terminology: A block can be split into parallel threads

§  Let’s change vector addition to use parallel threads instead of parallel blocks:

__global__ void add( int *a, int *b, int *c ) { c[ ] = a[ ]+ b[ ]; }

§  We use threadIdx.x instead of blockIdx.x in add()

§  main() will require one change as well…

threadIdx.x threadIdx.x threadIdx.x blockIdx.x blockIdx.x blockIdx.x

Parallel Addition (Threads): main() #define N 512 int main( void ) { int *a, *b, *c; //host copies of a, b, c int *dev_a, *dev_b, *dev_c;//device copies of a, b, c int size = N * sizeof( int )//we need space for 512 integers // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, size ); a = (int*)malloc( size ); b = (int*)malloc( size ); c = (int*)malloc( size ); random_ints( a, N ); random_ints( b, N );

Parallel Addition (Threads): main() // copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // launch add() kernel with N add<<< >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost ); free( a ); free( b ); free( c ); cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); return 0; }

threads

1, N

blocks

N, 1

Exercise: compile and run the add_simple_threads.cu code

Using Threads And Blocks §  We’ve seen parallel vector addition using

—  Many blocks with 1 thread apiece —  1 block with many threads

§  Let’s adapt vector addition to use lots of both blocks and

threads

§  After using threads and blocks together, we’ll talk about why threads

§  First let’s discuss data indexing…

Indexing Arrays With Threads & Blocks •  No longer as simple as just using threadIdx.x or blockIdx.x as indices

•  To index array with 1 thread per entry (using 8 threads/block)

•  If we have M threads/block, a unique array index for each entry is given by int index = threadIdx.x + blockIdx.x * M;

int index = x + y * width;

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

threadIdx.x

0 1 2 3 4 5 6 7

threadIdx.x

0 1 2 3 4 5 6 7

threadIdx.x

0 1 2 3 4 5 6 7

threadIdx.x

0 1 2 3 4 5 6 7

Indexing Arrays: Example •  In this example, the red entry would have an index of 21:

int index = threadIdx.x + blockIdx.x * M; = 5 + 2 * 8; = 21;

blockIdx.x = 2

M = 8 threads/block

0 17 8 16 18 19 20 21 2 1 3 4 5 6 7 10 9 11 12 13 14 15

Indexing Arrays: other examples (4 blocks with 4 threads per block)

Addition with Threads and Blocks §  blockDim.x is a built-in variable for threads per block: int index= threadIdx.x + blockIdx.x * blockDim.x; §  gridDim.x is a built-in variable for blocks in a grid;

§  A combined version of our vector addition kernel to use blocks and

threads: __global__ void add( int *a, int *b, int *c ) { int index = threadIdx.x + blockIdx.x * blockDim.x; c[index] = a[index] + b[index]; }

§  So what changes in main() when we use both blocks and threads?

Parallel Addition (Blocks/Threads) #define N (2048*2048) #define THREADS_PER_BLOCK 512 int main( void ) { int *a, *b, *c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c;//device copies of a, b, c int size = N * sizeof( int ); // we need space for N

// integers // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, size ); a = (int*)malloc( size ); b = (int*)malloc( size ); c = (int*)malloc( size );

random_ints( a, N ); random_ints( b, N );

Parallel Addition (Blocks/Threads) // copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // launch add() kernel with blocks and threads add<<< N/THREADS_PER_BLOCK, THREADS_PER_BLOCK >>>( dev_a, dev_b, dev_c); // copy device result back to host copy of c cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost ); free( a ); free( b ); free( c ); cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); return 0; }

Exercise: compile and run the add_simple.cu code

Exercise: array reversal

•  Start from the arrayReversal.cu code The code must fill an array d_out with the contents of an input array d_in in reverse order so that if d_in is [100, 110, 200, 220, 300] then d_out must be [300, 220, 200, 110, 100].

•  You must complete line 13, 14 and 34. •  Remember that:

– blockDim.x is the number of threads per block; – gridDim.x is the number of blocks in a grid;

Array Reversal: Soluaon

__global__ void reverseArrayBlock(int *d_out, int *d_in) {

int in = blockDim.x * blockIdx.x + threadIdx.x ; int out = (blockDim.x * gridDim.x -‐1) – in ; d_out[out] = d_in[in]; }

Why Bother With Threads? §  Threads seem unnecessary

—  Added a level of abstraction and complexity —  What did we gain?

§  Unlike parallel blocks, parallel threads have mechanisms to —  Communicate —  Synchronize

§  Let’s see how…

Dot Product §  Unlike vector addition, dot product is a reduction from vectors to

a scalar

c = a · b

c = (a0, a1, a2, a3) · (b0, b1, b2, b3)

c = a0 b0 + a1 b1 + a2 b2 + a3 b3

Dot Product •  Parallel threads have no problem compu8ng the pairwise products:

•  So we can start a dot product CUDA kernel by doing just that: __global__ void dot( int *a, int *b, int *c ) { // Each thread computes a pairwise product int temp = a[threadIdx.x] * b[threadIdx.x];

Dot Product •  But we need to share data between threads to compute the final sum:

__global__ void dot( int *a, int *b, int *c ) { // Each thread computes a pairwise product int temp = a[threadIdx.x] * b[threadIdx.x]; // Can’t compute the final sum // Each thread’s copy of ‘temp’ is private!!! }

Sharing Data Between Threads §  Terminology: A block of threads shares memory called…

§  Extremely fast, on-chip memory (user-managed cache)

§  Declared with the __shared__ CUDA keyword

§  Not visible to threads in other blocks running in parallel

shared memory

Shared Memory

Threads

Block 0

Shared Memory

Threads

Block 1

Shared Memory

Threads

Block 2

…

Parallel Dot Product: dot() §  We perform parallel multiplication, serial addition: #define N 512 __global__ void dot( int *a, int *b, int *c ) { // Shared memory for results of multiplication __shared__ int temp[N]; temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x]; // Thread 0 sums the pairwise products if( 0 == threadIdx.x ) { int sum = 0; for( int i = N-1; i >= 0; i-- ){ sum += temp[i];

} *c = sum; } }

Parallel Dot Product Recap §  We perform parallel, pairwise multiplications

§  Shared memory stores each thread’s result

§  We sum these pairwise products from a single thread

§  Sounds good…

But…

Exercise: Compile and run dot_simple_threads.cu. Does it work as expected?

Faulty Dot Product Exposed! §  Step 1: In parallel, each thread writes a pairwise product

§  Step 2: Thread 0 reads and sums the products

§  But there’s an assumption hidden in Step 1…

__shared__ int temp

__shared__ int temp

Read-Before-Write Hazard §  Suppose thread 0 finishes its write in step 1

§  Then thread 0 reads index 12 in step 2

§  Before thread 12 writes to index 12 in step 1

This read returns garbage!

Synchronization §  We need threads to wait between the sections of dot(): __global__ void dot( int *a, int *b, int *c ) { __shared__ int temp[N]; temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x]; // * NEED THREADS TO SYNCHRONIZE HERE * // No thread can advance until all threads // have reached this point in the code // Thread 0 sums the pairwise products if( 0 == threadIdx.x ) { int sum = 0; for( int i = N-1; i >= 0; i-- ){ sum += temp[i];

} *c = sum; } }

__syncthreads()

§ We can synchronize threads with the function __syncthreads()

§  Threads in the block wait until all threads have hit the __syncthreads()

§  Threads are only synchronized within a block!

__syncthreads()

__syncthreads()

__syncthreads()

__syncthreads()

__syncthreads()

Thread 0 Thread 1 Thread 2

Thread 3

Thread 4 …

Parallel Dot Product: dot() __global__ void dot( int *a, int *b, int *c ) { __shared__ int temp[N]; temp[threadIdx.x] = a[threadIdx.x] * b[threadIdx.x]; __syncthreads(); if( 0 == threadIdx.x ) { int sum = 0; for( int i = N-1; i >= 0; i-- ){ sum += temp[i];

} *c = sum; } }

§  With a properly synchronized dot() routine, let’s look at main()

Parallel Dot Product: main() #define N 512 int main( void ) { int *a, *b, *c; // copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = N * sizeof( int ); // we need space for N integers // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, sizeof( int ) ); a = (int *)malloc( size ); b = (int *)malloc( size ); c = (int *)malloc( sizeof( int ) ); random_ints( a, N ); random_ints( b, N );

Parallel Dot Product: main() // copy inputs to device

cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // launch dot() kernel with 1 block and N threads dot<<< 1, N >>>( dev_a, dev_b, dev_c ); // copy device result back to host copy of c cudaMemcpy( c, dev_c, sizeof( int ), cudaMemcpyDeviceToHost ); free( a ); free( b ); free( c ); cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); return 0; }

Exercise: insert __syncthreads() in the dot kernel. Compile and run.

Review §  Launching kernels with parallel threads

—  Launch add() with N threads: add<<< 1, N >>>(); —  Used threadIdx.x to access thread’s index

§  Using both blocks and threads —  Used (threadIdx.x + blockIdx.x * blockDim.x) to index input/output —  N/THREADS_PER_BLOCK blocks and THREADS_PER_BLOCK threads gave us N threads

total

Review (cont) §  Using __shared__ to declare memory as shared

memory —  Data shared among threads in a block —  Not visible to threads in other parallel blocks

§  Using __syncthreads() as a barrier —  No thread executes instructions after __syncthreads() until

all threads have reached the __syncthreads() —  Needs to be used to prevent data hazards

Multiblock Dot Product •  Recall our dot product launch: // launch dot() kernel with 1 block and N threads dot<<< 1, N >>>( dev_a, dev_b, dev_c );

•  Launching with one block will not u8lize much of the GPU

•  Let’s write a mul9block version of dot product

Multiblock Dot Product: Algorithm §  Each block computes a sum of its pairwise products like before:

Multiblock Dot Product: Algorithm §  And then contributes its sum to the final result:

Multiblock Dot Product: dot() #define THREADS_PER_BLOCK 512 #define N (1024*THREADS_PER_BLOCK) __global__ void dot( int *a, int *b, int *c ) { __shared__ int temp[THREADS_PER_BLOCK]; int index = threadIdx.x + blockIdx.x * blockDim.x; temp[threadIdx.x] = a[index] * b[index]; __syncthreads(); if( 0 == threadIdx.x ) { int sum = 0; for( int i = 0; i < THREADS_PER_BLOCK; i++ ) { sum += temp[i];

} } }

§  But we have a race condi8on… Compile and run dot_simple_mul8block.cu

§  We can fix it with one of CUDA’s atomic opera8ons. Use atomicAdd in the dot kernel. Compile with nvcc –o dot_simple_mulablock dot_simple_mulablock.cu –arch=sm_35

*c += sum; atomicAdd( c , sum );

Race Conditions

§  Thread 0, Block 1 –  Read value at address c –  Add sum to value –  Write result to address c

§  Terminology: A race condition occurs when program behavior depends upon relative timing of two (or more) event sequences

§  What actually takes place to execute the line in question: *c += sum; —  Read value at address c —  Add sum to value —  Write result to address c

§  What if two threads are trying to do this at the same time?

§  Thread 0, Block 0 —  Read value at address c —  Add sum to value —  Write result to address c

Terminology: Read-Modify-Write

Global Memory Contention

c

Block 0 sum = 3

Block 1 sum = 4

Reads 0 0

Computes 0+3 0+3 = 3 3

Writes 3

Reads 3 3

Computes 3+4 3+4 = 7 7

Writes 7

0

Read-Modify-Write

Read-Modify-Write

*c += sum 3 0 3 3 7

Global Memory Contention

0 c

Block 0 sum = 3

Block 1 sum = 4

Reads 0 0

Computes 0+3 0+3 = 3 3

Writes 3

Reads 0 0

Computes 0+4 0+4 = 4 4

Writes 4

Read-Modify-Write

Read-Modify-Write

*c += sum 0 0 3 4 0

Atomic Operations §  Terminology: Read-modify-write uninterruptible when atomic

§  Many atomic operations on memory available with CUDA C

§  Predictable result when simultaneous access to memory required

§ We need to atomically add sum to c in our multiblock dot product

§  atomicAdd() §  atomicSub() §  atomicMin() §  atomicMax()

§  atomicInc() §  atomicDec() §  atomicExch() §  atomicCAS()old == compare ? val : old

Multiblock Dot Product: dot() __global__ void dot( int *a, int *b, int *c ) { __shared__ int temp[THREADS_PER_BLOCK]; int index = threadIdx.x + blockIdx.x * blockDim.x; temp[threadIdx.x] = a[index] * b[index]; __syncthreads(); if( 0 == threadIdx.x ) { int sum = 0; for( int i = 0; i < THREADS_PER_BLOCK; i++ ){ sum += temp[i];

} atomicAdd( c , sum ); } }

§  Now let’s fix up main() to handle a mul9block dot product

Parallel Dot Product: main() #define THREADS_PER_BLOCK 512 #define N (1024*THREADS_PER_BLOCK) int main( void ) { int *a, *b, *c; // host copies of a, b, c int *dev_a, *dev_b, *dev_c; // device copies of a, b, c int size = N * sizeof( int ); // we need space for N ints // allocate device copies of a, b, c cudaMalloc( (void**)&dev_a, size ); cudaMalloc( (void**)&dev_b, size ); cudaMalloc( (void**)&dev_c, sizeof( int ) ); a = (int *)malloc( size ); b = (int *)malloc( size ); c = (int *)malloc( sizeof( int ) ); random_ints( a, N ); random_ints( b, N );

Parallel Dot Product: main() // copy inputs to device cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice ); cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice ); // launch dot() kernel dot<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(dev_a, dev_b, dev_c); // copy device result back to host copy of c cudaMemcpy( c, dev_c, sizeof( int ), cudaMemcpyDeviceToHost ); free( a ); free( b ); free( c ); cudaFree( dev_a ); cudaFree( dev_b ); cudaFree( dev_c ); return 0; }

Review §  Race conditions

—  Behavior depends upon relative timing of multiple event sequences

—  Can occur when an implied read-modify-write is interruptible

§  Atomic operations —  CUDA provides read-modify-write operations guaranteed to be

atomic —  Atomics ensure correct results when multiple threads modify

memory —  To use atomic operations a new option (-arch=…) must be used

at compile time

Eratosthenes’ Sieve

•  A sieve has holes in it and is used to filter out the juice.

•  Eratosthenes’s sieve filters out numbers to find the prime numbers.

Defini8on

•  Prime Number – a number that has only two factors, itself and 1.

7 7 is prime because the only numbers that will divide into it evenly are 1 and 7.

The Eratosthenes’Sieve

The Prime Numbers from 1 to 120 are the following:

2,3,5,7,11,13,17,19,23, 29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97, 101,103,107,109,113 Download (wget) h3p://twin.iac.rm.cnr.it/eratostenePr.cu

CUDA Thread organiza8on: Grids and Blocks

•  A kernel is executed as a 1D, 2D or 3D grid of thread blocks. –  All threads share the global memory

•  A thread block is a 1D, 2D or 3D batch of threads that can cooperate with each other by: –  Synchronizing their execu8on

•  For hazard-‐free shared memory accesses

–  Efficiently sharing data through a low latency shared memory

•  Threads blocks are independent of each other and can be executed in any order!

Host

Kernel 1

Kernel 2

Device

Grid 1

Block (0, 0)

Block (1, 0)

Block (2, 0)

Block (0, 1)

Block (1, 1)

Block (2, 1)

Grid 2

Block (1, 1)

Thread (0, 1)

Thread (1, 1)

Thread (2, 1)

Thread (3, 1)

Thread (4, 1)

Thread (0, 2)

Thread (1, 2)

Thread (2, 2)

Thread (3, 2)

Thread (4, 2)

Thread (0, 0)

Thread (1, 0)

Thread (2, 0)

Thread (3, 0)

Thread (4, 0)

Courtesy: NVIDIA

Built-‐in Variables to manage grids and blocks

•  dim3 gridDim; –  Dimensions of the grid in blocks

•  dim3 blockDim; –  Dimensions of the block in threads

•  dim3 blockIdx; –  Block index within the grid

•  dim3 threadIdx; –  Thread index within the block

dim3: a new datatype defined by CUDA: struct dim3 { unsigned int x, y, z }; three unsigned ints where any unspecified component defaults to 1.

int main() { int dimx = 16; int dimy = 16; int num_bytes = dimx*dimy*sizeof(int); int *d_a=0, *h_a=0; // device and host pointers h_a = (int*)malloc(num_bytes); cudaMalloc( (void**)&d_a, num_bytes ); dim3 grid, block; block.x = 4; block.y = 4; grid.x = dimx / block.x; grid.y = dimy / block.y; kernel<<<grid, block>>>( d_a, dimx, dimy ); cudaMemcpy(h_a,d_a,num_bytes, cudaMemcpyDeviceToHost); for(int row=0; row<dimy; row++) { for(int col=0; col<dimx; col++) prin|("%d ", h_a[row*dimx+col] ); prin|("\n"); } free( h_a ); cudaFree( d_a ); return 0; }

__global__ void kernel( int *a, int dimx, int dimy ) { int ix = blockIdx.x*blockDim.x + threadIdx.x; int iy = blockIdx.y*blockDim.y + threadIdx.y; int idx = iy*dimx + ix; a[idx] = idx+1; }

Bi-‐dimensional threads configura8on by example:

set the elements of a square matrix

(assume the matrix is a single block of memory!)

Exercise: compile and run: setmatrix.cu

Matrix mul8ply with one thread block

•  One block of threads computes matrix dP –  Each thread computes one element

of dP •  Each thread

–  Loads a row of matrix dM –  Loads a column of matrix dN –  Perform one mul8ply and addi8on

for each pair of dM and dN elements •  Size of matrix limited by the number

of threads allowed in a thread block!

Grid 1 Block 1

3 2 5 4

2

4

2

6

48

Thread )2, 2(‏

WIDTH

dM dP

dN

ü Start from MMGlob.cu ü Complete the func8on

MatrixMulOnDevice and the kernel MatrixMulKernel

ü Use one bi-‐dimensional block ü Threads will have an x (threadIdx.x) and an y iden8fier (threadIdx.y)

ü Max size of the matrix: 16 Compile: nvcc –o MMGlob MMglob.cu Run: ./MMGlob N

__global__ void MatrixMulKernel(DATA* dM, DATA* dN, DATA* dP, int Width) { DATA Pvalue = 0.0; for (int k = 0; k < Width; ++k) { DATA Melement = dM[ ]; DATA Nelement = dN[ ]; Pvalue += Melement * Nelement; } dP[ ] = Pvalue; }

Exercise: matrix-‐matrix mul8ply

CUDA Compiler: nvcc basic op8ons ü  -‐arch=sm_13 (or sm_20, sm_35) Enable double precision (on compa8ble hardware) ü  -‐G Enable debug for device code

ü  -‐-‐ptxas-‐op8ons=-‐v Show register and memory usage

ü  -‐-‐maxrregcount <N> Limit the number of registers

ü  -‐use_fast_math Use fast math library

ü  -‐O3 Enables compiler op8miza8on

ü  -‐ccbin compiler_path use a different C compiler (e.g., Intel instead of gcc) Exercises:

1.  re-‐compile the MMGlob.cu program and check how many registers the kernel uses: nvcc –o MMGlob MMGlob.cu -‐-‐ptxas-‐op8ons=-‐v

2.  Edit MMGlob.cu and modify DATA from float to double, then compile for double precision and run

CUDA Error Repor8ng to CPU

•  All CUDA calls return an error code: –  except kernel launches –  cudaError_t type

•  cudaError_t cudaGetLastError(void) –  returns the code for the last error (cudaSuccess means “no error”)

•  char* cudaGetErrorString(cudaError_t code) –  returns a null-‐terminated character string describing the error

prinn(“%s\n”,cudaGetErrorString(cudaGetLastError() ) );

CUDA Event API •  Events are inserted (recorded) into CUDA call streams

•  Usage scenarios: –  measure elapsed 8me (in milliseconds) for CUDA calls (resolu8on ~0.5 microseconds) –  query the status of an asynchronous CUDA call –  block CPU un8l CUDA calls prior to the event are completed

cudaEvent_t start, stop; float et; cudaEventCreate(&start); cudaEventCreate(&stop);

cudaEventRecord(start, 0);

kernel<<<grid, block>>>(...);

cudaEventRecord(stop, 0);

cudaEventSynchronize(stop);

cudaEventElapsedTime(&et, start, stop);

cudaEventDestroy(start); cudaEventDestroy(stop); prin|(“Kernel execu8on 8me=%f\n”,et);

Read the func8on MatrixMulOnDevice of the MMGlobWT.cu program. Compile and run

Execution Model Software Hardware

Threads are executed by scalar processors

Thread

Scalar Processor

Thread Block Mul8processor

Thread blocks are executed on mul8processors Thread blocks do not migrate Several concurrent thread blocks can reside on one mul8processor -‐ limited by mul8processor resources (shared memory and register file)

...

Grid Device

A kernel is launched as a grid of thread blocks

86

Transparent Scalability •  Hardware is free to assigns blocks to any

processor at any 8me –  A kernel scales across any number of parallel

processors Device

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Kernel grid

Block 0 Block 1

Block 2 Block 3

Block 4 Block 5

Block 6 Block 7

Device

Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

Each block can execute in any order relaave to other blocks!

8me

87!

Warps and Half Warps

Thread Block Mul8processor

32 Threads

32 Threads

32 Threads

...

Warps

16

Half Warps

16

DRAM

Global

Local

A thread block consists of 32-‐thread warps A warp is executed physically in parallel (SIMD) on a mul8processor

Device Memory

=

A half-‐warp of 16 threads can coordinate global memory accesses into a single transac8on

88

Execu8ng Thread Blocks

•  Threads are assigned to Streaming Mulaprocessors (SM) in block granularity –  Up to 16 blocks to each SM

on Kepler. 8 blocks on Fermi SM. –  A Fermi SM can take up to 1536 threads

•  Could be 256 (threads/block) * 6 blocks or 192 (threads/block) * 8 blocks but, for instance, 128 (threads/block) * 12 blocks not allowed!

–  A Kepler SM can take up to 2048 threads •  Threads run concurrently

–  SM manages/schedules thread execu8on

t0 t1 t2 … tm

Blocks

SP

Shared Memory

MT IU

SP

Shared Memory

MT IU

t0 t1 t2 … tm

Blocks

SM 1 SM 0

The number of threads in a block depends on the capability (i.e., the version) of the GPU. On a Fermi or Kepler GPU a block may have up to 1024 threads.

89

Block Granularity Considera8ons •  For a kernel running on a Fermi and using mul8ple blocks, should

we use 8X8, 16X16 or 32X32 blocks?

–  For 8X8, we have 64 threads per Block. Since each Fermi SM can take up to 1536 threads, there are 24 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!

–  For 32X32, we have 1024 threads per Block. Only one block can fit into a SM!

–  For 16X16, we have 256 threads per Block. Since each Fermi SM can take up to 1536 threads, it can take up to 6 Blocks and achieve full capacity unless other resource considera8ons overrule.

Repeat the analysis for a Kepler card!

90

Programmer View of Register File •  There are up to 32768 (32 bit)

registers in each Fermi SM •  There are up to 65536 (32 bit)

registers in each Kepler SM –  This is an implementa8on decision,

not part of CUDA –  Registers are dynamically par88oned

across all blocks assigned to the SM –  Once assigned to a block, the register

is NOT accessible by threads in other blocks

–  Each thread in the same block only access registers assigned to itself

4 blocks 3 blocks

Registers “occupancy” example •  If a Block has 16x16 threads and each thread uses 30 registers, how many blocks can run on each Fermi SM? – Each block requires 30*256 = 7680 registers – 32768/7680 = 4 + change (2048) – So, 4 blocks can run on an SM as far as registers are concerned

•  How about if each thread increases the use of registers by 10%? – Each Block now requires 33*256 = 8448 registers – 32768/8448 = 3 + change (7424) – Only three Blocks can run on an SM, 25% reducaon of parallelism!!!

Repeat the analysis for a Kepler card!

Optimizing threads per block ü  Choose threads per block as a multiple of warp size (32)

ü  Avoid wasting computation on under-populated warps ü  Facilitates efficient memory access (coalescing )

ü  Run as many warps as possible per multiprocessor (hide latency) ü  SM can run up to 16 (on Kepler) blocks at a time

ü  Heuristics ü Minimum: 64 threads per block ü  192 or 256 threads a better choice

ü Usually still enough registers to compile and invoke successfully

ü  The right tradeoff depends on your computation, so experiment, experiment, experiment!!!

Review on Memory Hierarchy •  Local Memory: per-‐thread

–  Private per thread but slow! –  Auto variables, register spill

•  Shared Memory: per-‐block –  Shared by threads of the same block –  Fast inter-‐thread communica8on

•  Global Memory: per-‐applica8on –  Shared by all threads –  Inter-‐Grid communica8on

Thread

Local Memory

Grid 0

. . . Global

Memory

. . .

Grid 1 Sequential Grids in Time

Block

Shared Memory

Op8mizing Global Memory Access • Memory Bandwidth is very high (~ 150 Gbytes/sec.) but…

• Latency is also very high (hundreds of cycles)! • Local memory has the same latency!

• Reduce number of memory transac8ons by applying a proper memory access pa;ern.

Thread locality is be3er than data locality on GPU!

Array of structures (data locality!)

Structure of arrays (thread locality!)

Global Memory Opera8ons

•  Two types of loads: –  Caching

•  Default mode •  A;empts to hit in L1, then L2, then GMEM •  Load granularity is 128-‐byte line

– Non-‐caching •  Compile with –Xptxas –dlcm=cg op8on to nvcc •  A;empts to hit in L2, then GMEM

–  Do not hit in L1, invalidate the line if it’s in L1 already •  Load granularity is 32-‐bytes

•  Stores: –  Invalidate L1, write-‐back for L2

Caching Load •  Warp requests 32 aligned, consecu8ve 4-‐byte words •  Addresses fall within 1 cache-‐line

– Warp needs 128 bytes – 128 bytes move across the bus on a miss – Bus u8liza8on: 100%

... addresses from a warp

96 192 128 160 224 288 256 32 64 352 320 384 448 416 Memory addresses

0

Non-‐caching Load •  Warp requests 32 aligned, consecu8ve 4-‐byte words •  Addresses fall within 4 segments




0

Caching Load

•  Warp requests 32 aligned, permuted 4-‐byte words •  Addresses fall within 1 cache-‐line


...


addresses from a warp

0

Non-‐caching Load

•  Warp requests 32 aligned, permuted 4-‐byte words •  Addresses fall within 4 segments


...


addresses from a warp

0

Caching Load •  Warp requests 32 misaligned, consecu8ve 4-‐byte words

•  Addresses fall within 2 cache-‐lines – Warp needs 128 bytes – 256 bytes move across the bus on misses – Bus u8liza8on: 50%

96 192 128 160 224 288 256


32 64 0 352 320 384 448 416 Memory addresses

Non-‐caching Load •  Warp requests 32 misaligned, consecu8ve 4-‐byte words •  Addresses fall within at most 5 segments

– Warp needs 128 bytes – 160 bytes move across the bus on misses – Bus u8liza8on: at least 80%

•  Some misaligned pa;erns will fall within 4 segments, so 100% u8liza8on

96 192 128 160 224 288 256


32 64 0 352 320 384 448 416 Memory addresses

Caching Load •  All threads in a warp request the same 4-‐byte word •  Addresses fall within a single cache-‐line

– Warp needs 4 bytes – 128 bytes move across the bus on a miss – Bus u8liza8on: 3.125%



0

Non-‐caching Load •  All threads in a warp request the same 4-‐byte word •  Addresses fall within a single segment

– Warp needs 4 bytes – 32 bytes move across the bus on a miss – Bus u8liza8on: 12.5%



0

Caching Load

•  Warp requests 32 sca;ered 4-‐byte words •  Addresses fall within N cache-‐lines

– Warp needs 128 bytes – N*128 bytes move across the bus on a miss – Bus u8liza8on: 128 / (N*128)



0

Non-‐caching Load

•  Warp requests 32 sca;ered 4-‐byte words •  Addresses fall within N segments

– Warp needs 128 bytes – N*32 bytes move across the bus on a miss – Bus u8liza8on: 128 / (N*32)



0

Shared Memory •  On-‐chip memory

–  2 orders of magnitude lower latency than global memory – Much higher bandwidth than global memory –  Up to 48 KByte per mul8processor (on Fermi and Kepler)

•  Allocated per thread-‐block •  Accessible by any thread in the thread-‐block

–  Not accessible to other thread-‐blocks •  Several uses:

–  Sharing data among threads in a thread-‐block –  User-‐managed cache (reducing global memory accesses) __shared__ float As[16][16];

Shared Memory Example •  Compute adjacent_difference

–  Given an array d_in produce a d_out array containing the difference: d_out[0]=d_in[0]; d_out[i]=d_in[i]-‐d_in[i-‐1]; /* i>0 */

Exercise: Complete the shared_variables.cu code

N-‐Body problems An N-‐body simulaaon numerically approximates the evoluaon of a system of bodies in which each body conanuously interacts with every other body. The all-‐pairs approach to N-‐body simulaaon is a brute-‐force technique that evaluates all pair-‐wise interacaons among the N bodies.

N-‐Body problems We use the gravitaaonal potenaal to illustrate the basic form of computaaon in an all-‐pairs N-‐body simulaaon.

N-‐Body problems To integrate over ame, we need the acceleraaon a i = F i /mi to update the posiaon and velocity of body i.

Download h3p://twin.iac.rm.cnr.it/mini_nbody.tgz and unpack: tar zxvf mini_nbody.tgz look at nbody.c

N-‐Body: main loop void bodyForce(Body *p, float dt, int n) { for (int i = 0; i < n; i++) { float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f; for (int j = 0; j < n; j++) { float dx = p[j].x -‐ p[i].x; float dy = p[j].y -‐ p[i].y; float dz = p[j].z -‐ p[i].z; float distSqr = dx*dx + dy*dy + dz*dz + SOFTENING; float invDist = 1.0f / sqrn(distSqr); float invDist3 = invDist * invDist * invDist; Fx += dx * invDist3; Fy += dy * invDist3; Fz += dz * invDist3; } p[i].vx += dt*Fx; p[i].vy += dt*Fy; p[i].vz += dt*Fz; } }

N-‐Body: parallel processing •  We may think of the all-‐pairs algorithm as calculaang each entry fij in an NxN grid of all pair-‐wise forces.

•  Each entry can be computed independently, so there is O(N 2) available parallelism

•  We may assign each body to a thread – If the block size is, say, 256 then the number of blocks is: (nBodies +255) / 256

N-‐Body: a first CUDA code __global__ void bodyForce(Body *p, float dt, int n) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < n) { float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f; for (int j = 0; j < n; j++) { float dx = p[j].x -‐ p[i].x; float dy = p[j].y -‐ p[i].y; float dz = p[j].z -‐ p[i].z; float distSqr = dx*dx + dy*dy + dz*dz + SOFTENING; float invDist = rsqr|(distSqr); float invDist3 = invDist * invDist * invDist; Fx += dx * invDist3; Fy += dy * invDist3; Fz += dz * invDist3; } p[i].vx += dt*Fx; p[i].vy += dt*Fy; p[i].vz += dt*Fz; } }

N-‐Body: a first CUDA code •  Do we expect this code to be efficient?

– All threads read all data from the global memory

•  It is be3er to serialize some of the computaaons to achieve the data reuse needed to reach peak performance of the arithmeac units and to reduce the memory bandwidth required.

•  To that purpose let us introduce the noaon of computaaonal 8le

N-‐Body: Computa8onal 9le

•  A 8le is a square region of the grid of pair-‐wise forces consisang of p rows and p columns. –  Only 2p body descripaons are required to evaluate all p2 interacaons in the ale (p of which can be reused later).

–  These body descripaons can be stored in shared memory or in registers.

N-‐Body: Computa8onal 9le

•  Each thread in a block loads a body •  All threads in a block compute the interacaons with the bodies that have been loaded

N-‐Body: using Shared Memory (1) __global__ void bodyForce(float4 *p, float4 *v, float dt, int n) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < n) { float Fx = 0.0f; float Fy = 0.0f; float Fz = 0.0f; for (int 8le = 0; 8le < gridDim.x; 8le++) { __shared__ float3 spos[BLOCK_SIZE]; float4 tpos = p[8le * blockDim.x + threadIdx.x]; spos[threadIdx.x] = make_float3(tpos.x, tpos.y, tpos.z); __syncthreads();

for (int j = 0; j < BLOCK_SIZE; j++) {

….. } __syncthreads(); } v[i].x += dt*Fx; v[i].y += dt*Fy; v[i].z += dt*Fz; } }

N-‐Body: using Shared Memory (2) __global__ void bodyForce(float4 *p, float4 *v, float dt, int n) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < n) { .... for (int j = 0; j < BLOCK_SIZE; j++) { float dx = spos[j].x -‐ p[i].x; float dy = spos[j].y -‐ p[i].y; float dz = spos[j].z -‐ p[i].z; float distSqr = dx*dx + dy*dy + dz*dz + SOFTENING; float invDist = rsqr|(distSqr); float invDist3 = invDist * invDist * invDist; Fx += dx * invDist3; Fy += dy * invDist3; Fz += dz * invDist3; } …. } }

N-‐Body: Further op8miza8ons

•  Tuning of block size •  Unrolling of the inner loop

for (int j = 0; j < BLOCK_SIZE; j++) { }

•  Test: 1.   shmoo-‐cpu-‐nbody.sh 2.   cuda/shmoo-‐cuda-‐nbody-‐orig.sh 3.   cuda/shmoo-‐cuda-‐nbody-‐block.sh 4.   cuda/shmoo-‐cuda-‐nbody-‐unroll.sh

Shared Memory •  Organiza8on:

–  32 banks, 4-‐byte wide banks –  Successive 4-‐byte words belong to different banks

•  Performance: –  4 bytes per bank per 2 clocks per mul8processor –  smem accesses are issued per 32 threads (warp) –  serializa8on: if N threads of 32 access different 4-‐byte words in the same bank, N accesses are executed serially

– mul8cast: N threads access the same word in one fetch •  Could be different bytes within the same word

Bank Addressing Examples •  No Bank Conflicts •  No Bank Conflicts

Bank 31

Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

Thread 31

Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0

Bank 31


Thread 31


Bank Addressing Examples •  2-‐way Bank Conflicts •  x

Thread 31 Thread 30 Thread 29 Thread 28

Thread 4 Thread 3 Thread 2 Thread 1 Thread 0

Bank 31


Thread 31


Bank 9 Bank 8

Bank 31

Bank 7

Bank 2 Bank 1 Bank 0 x8

x8

Shared Memory: Avoiding Bank Conflicts

•  32x32 SMEM array •  Warp accesses a column:

– 32-‐way bank conflicts (threads in a warp access the same bank)

31

2 1 0

31 2 1 0

31 2 1 0

warps: 0 1 2 31

Bank 0 Bank 1 … Bank 31 2 0 1

31

Shared Memory: Avoiding Bank Conflicts

•  Add a column for padding: – 32x33 SMEM array

•  Warp accesses a column: – 32 different banks, no bank conflicts

31 2 1 0

31 2 1 0

31 2 1 0

warps: 0 1 2 31 padding

Bank 0 Bank 1 … Bank 31

31 2 0 1

Tree-‐Based Parallel Reduc8ons in Shared Memory

•  CUDA solves this by exposing on-‐chip shared memory – Reduce blocks of data in shared memory to save bandwidth

4 7 5 9

11 14

25

3 1 7 0 4 1 6 3

Parallel Reduc8on: Interleaved Addressing

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 Values (shared memory)

0 1 2 3 4 5 6 7

11 1 7 -1 -2 -2 8 5 -5 -3 9 7 11 11 2 2 Values

0 1 2 3

18 1 7 -1 6 -2 8 5 4 -3 9 7 13 11 2 2 Values

0 1

24 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2 Values

0

41 1 7 -1 6 -2 8 5 17 -3 9 7 13 11 2 2 Values

Thread IDs

Step 1 Stride 1

Step 2 Stride 2

Step 3 Stride 4

Step 4 Stride 8

Thread IDs

Thread IDs

Thread IDs

Interleaved addressing results in bank conflicts

•  Arbitrarily bad bank conflicts •  Requires barriers if N > warpsize •  Supports non-‐commutaave operators

Parallel Reduc8on: Sequen8al Addressing

10 1 8 -1 0 -2 3 5 -2 -3 2 7 0 11 0 2 Values (shared memory)

0 1 2 3 4 5 6 7

8 -2 10 6 0 9 3 7 -2 -3 2 7 0 11 0 2 Values

0 1 2 3

8 7 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 Values

0 1

21 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 Values

0

41 20 13 13 0 9 3 7 -2 -3 2 7 0 11 0 2 Values

Thread IDs

Step 1 Stride 8

Step 2 Stride 4

Step 3 Stride 2

Step 4 Stride 1

Thread IDs

Thread IDs

Thread IDs

Sequential addressing is conflict free

•  No bank conflicts •  Requires barriers if N > warpsize •  Does not support non-‐commutaave operators

Parallel Prefix Sum (Scan)

•  Given an array A = [a0, a1, …, an-‐1] and a binary associa8ve operator ⊕ with iden8ty I,

•  scan(A) = [I, a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-‐2)]

•  Example: if ⊕ is addi8on, then scan on the set –  [3 1 7 0 4 1 6 3]

•  returns the set –  [0 3 4 11 11 15 16 22]

O(n log n) Scan

•  Step efficient (log n steps) •  Not work efficient (n log n work) •  Requires barriers at each step (WAR dependencies)

132

Constant Memory

ü  Both scalar and array values can be stored ü  Constants stored in DRAM, and cached on chip

–  L1 per SM ü  A constant value can be broadcast to all threads in

a Warp ü  Extremely efficient way of accessing a value that is

common for all threads in a block!

I $ L 1

Multithreaded Instruction Buffer

R F C $

L 1 Shared Mem

Operand Select

MAD SFU

cudaMemcpyToSymbol(ds, &hs, sizeof(int), 0,cudaMemcpyHostToDevice);

Texture Memory (and co-‐stars)

•  Another type of memory system, featuring: – Spa8ally-‐cached read-‐only access – Avoid coalescing worries –  Interpola8on –  (Other) fixed-‐func8on capabili8es – Graphics interoperability

Tradi8onal Caching

•  When reading, cache “nearby elements” –  (i.e. cache line)

– Memory is linear!

– Applies to CPU, GPU L1/L2 cache, etc

Texture-‐Memory Caching

•  Can cache “spa8ally!” (2D, 3D) – Specify dimensions (1D, 2D, 3D) on crea8on

•  1D applica8ons: –  Interpola8on, clipping (later) – Caching when, e.g., coalesced access is infeasible

Texture Memory

•  “Memory is just an unshaped bucket of bits” (CUDA Handbook)

•  Need texture reference in order to: –  Interpret data – Deliver to registers

Interpolaaon

•  Can “read between the lines!”

h;p://cuda-‐programming.blogspot.com/2013/02/texture-‐memory-‐in-‐cuda-‐what-‐is-‐texture.html

Download h3p://twin.iac.rm.cnr.it/readTexels.cu

•  Seamlessly handle reads beyond region!

Clamping

h;p://cuda-‐programming.blogspot.com/2013/02/texture-‐memory-‐in-‐cuda-‐what-‐is-‐texture.html

“CUDA Arrays”

•  So far, we’ve used standard linear arrays

•  “CUDA arrays”: – Different addressing calcula8on

•  Con8guous addresses have 2D/3D locality!

– Not pointer-‐addressable –  (Designed specifically for texturing)

Texture Memory

•  Texture reference can be a;ached to: – Ordinary device-‐memory array – “CUDA array”

•  Many more capabili8es

Texturing Example (2D)

h;p://www.math.ntu.edu.tw/~wwang/mtxcomp2010/download/cuda_04_ykhung.pdf h;p://www.math.ntu.edu.tw/~wwang/mtxcomp2010/download/cuda_04_ykhung.pdf


h;p://www.math.ntu.edu.tw/~wwang/mtxcomp2010/download/cuda_04_ykhung.pdf


h;p://www.math.ntu.edu.tw/~wwang/mtxcomp2010/download/cuda_04_ykhung.pdf


146!

Arithmetic Instruction Throughput

Number of operaaons per clock cycle per mulaprocessor

147

Control Flow Instruc8ons •  Main performance concern with branching is divergence

–  Threads within a single warp (group of 32) take different paths –  Different execu8on paths are serialized

•  The control paths taken by the threads in a warp are traversed one at a 8me un8l there is no more.

•  Avoid divergence when branch condi8on is a func8on of thread ID –  Example with divergence:

•  if (threadIdx.x > 7) { } •  This creates two different control paths for threads in a block •  Branch granularity < warp size; threads 0, 1 , … 7 follow different path than the

rest of the threads 8, 9, …, 31 in the same warp. –  Example without divergence:

•  if (threadIdx.x / WARP_SIZE > 2) { } •  Also creates two different control paths for threads in a block •  Branch granularity is a whole mul8ple of warp size; all threads in any given

warp follow the same path

148

Warp Vote Func8ons and other useful primi8ves

•  int __all(predicate) return true iff the predicate evaluates as true for all threads of a warp

•  int __any(predicate) return true iff the predicate evaluates as true for any threads of a warp

•  unsigned int __ballot(predicate) return an unsigned int whose Nth is set iff predicate evaluates as true for the Nth thread of the warp.

•  int __popc(int x) POPula8on Count: return the number of bits that are set to 1 in the 32-‐bit integer x.

•  int __clz(int x) Count Leading Zero: return the number of consecu8ve zero bits beginning at the most significant bit of the 32-‐bit integer x.

149

Shuffle Instruc8on

•  Instruc8on to exchange data in a warp

•  Threads can “read” other threads’ registers

•  No shared memory is needed

•  It is available star8ng from SM 3.0 •  Kepler’s shuffle instruc8on (SHFL) enables a thread to directly

read a register from another thread in the same warp

150

Shuffle Instruc8on

•  4 Variants (idx, up, down, bfly):

151

A simple shuffle example (1)

int __shfl_down(int var, unsigned int delta, int width=warpSize);

•  __shfl_down() calculates a source lane ID by adding delta to the caller’s lane ID –  the lane ID is a thread’s index within its warp, from 0 to 31. Can be computed as threadIdx.x % 32;

•  The value of var held by the resul8ng lane ID is returned: this has the effect of shi�ing var down the warp by delta lanes

•  If the source lane ID is out of range or the source thread has exited, the calling thread’s own var is returned.

152

A simple shuffle example (2)

int i = threadIdx.x % 32; int j = __shfl_down(i, 2, 8);

153

Using shuffle for a reduce opera8on

__inline__ __device__ int warpReduceSum(int val) { for (int offset = warpSize/2; offset > 0; offset /= 2) { val += __shfl_down(val, offset);

} return val; }

154

Using shuffle for an all reduce opera8on

__inline__ __device__ int warpAllReduceSum(int val) { for (int mask = warpSize/2; mask > 0; mask /= 2) { val += __shfl_xor(val, mask);

} return val;

}

If all threads in the warp need the result, it possible to implement an “all reduce” within a warp using __shfl_xor() instead of __shfl_down(), like the following:

155

Block reduce by using shuffle •  First reduce within warps. •  Then the first thread of each warp writes its par8al sum to shared

memory. •  Finally, a]er synchronizing, the first warp reads from shared

memory and reduces again. __inline__ __device__ int blockReduceSum(int val) {

staac __shared__ int shared[32]; // Shared mem for 32 paraal sums int lane = threadIdx.x % warpSize; int wid = threadIdx.x / warpSize; val = warpReduceSum(val); // Each warp performs paraal reducaon if (lane==0) shared[wid]=val; // Write reduced value to shared memory __syncthreads(); // Wait for all paraal reducaons //read from shared memory only if that warp existed val = (threadIdx.x < blockDim.x / warpSize) ? shared[lane] : 0; if (wid==0) val = warpReduceSum(val); //Final reduce within first warp return val;

}

156

Reducing Large Arrays •  Reducing across blocks means global communica8on which

means synchronizing the grid •  Two possible approaches:

–  Breaking the computa8on into two separate kernel launches –  Use atomics opera8ons

•  Exercise (home work!): write a deviceReduce that works across a complete grid (many thread blocks).

157

Implement SHFL for 64 bits Numbers

Generic SHFL: h3ps://github.com/BryanCatanzaro/generics

158

Opamized Filtering Let us have a source array, src, containing n elements, and a predicate, and we need to copy all elements of src sa8sfying the predicate into the des8na8on array, dst. For the sake of simplicity, assume that dst has length of at least n and that the order of elements in the dst array does not ma;er. If the predicate is src[i]>0, a simple CPU solu8on is: int filter(int *dst, const int *src, int n) { int nres = 0; for (int i = 0; i < n; i++) if (src[i] > 0) dst[nres++] = src[i]; // return the number of elements copied return nres; }

159

Opamized Filtering on GPU: global memory

A straigh|orward approach is to use a single global counter and atomically increment it for each new element wri;en int the dst array: __global__ void filter_k(int *dst, int *nres, const int *src, int n) { int i = threadIdx.x + blockIdx.x * blockDim.x; if(i < n && src[i] > 0) dst[atomicAdd(nres, 1)] = src[i]; } Here atomic opera8ons are clearly a bo;leneck, and need to be removed or reduced to increase performance

Opamized Filtering on GPU: using shared memory

_global__ void filter_shared_k(int *dst, int *nres, const int* src, int n) { __shared__ int l_n; int i = blockIdx.x * (NPER_THREAD * BS) + threadIdx.x; for (int iter = 0; iter < NPER_THREAD; iter++) { // zero the counter if (threadIdx.x == 0) l_n = 0; __syncthreads() // get the value, evaluate the predicate, and // increment the counter if needed int d, pos; if(i < n) { d = src[i]; if(d > 0) pos = atomicAdd(&l_n, 1); }

__syncthreads(); // leader increments the global counter if(threadIdx.x == 0) l_n = atomicAdd(nres, l_n); __syncthreads(); // threads with true predicates write their elements if(i < n && d > 0) { pos += l_n; // increment local pos by global counter dst[pos] = d; } __syncthreads(); i += BS; } }

160

Opamized Filtering on GPU: warp-‐aggregated atomics

Warp aggrega9on is the process of combining atomic opera8ons from mul8ple threads in a warp into a single atomic. 1.  Threads in the warp elect a leader thread. 2.  Threads in the warp compute the total atomic increment for

the warp. 3.  The leader thread performs an atomic add to compute the

offset for the warp. 4.  The leader thread broadcasts the offset to all other threads

in the warp. 5.  Each thread adds its own index within the warp to the warp

offset to get its posi8on in the output array.

161

Step 1: elect a leader

The thread iden8fier within a warp can be computed as #define WARP_SZ 32 __device__ inline int lane_id(void) { return threadIdx.x % WARP_SZ; }

A reasonable choice is to select as leader the lowest ac9ve lane int mask = __ballot(1); // mask of acave lanes int leader = __ffs(mask) -‐ 1; // -‐1 for 0-‐based indexing __ffs(int v) returns the 1-‐based posi8on of the lowest bit that is set in v, or 0 if no bit is set (ffs stands for find first set)

162

Step 2: compute the total increment

The total increment for the warp is equal to the number of ac8ve lanes which hold true predicates. The value can be obtained by using the __popc intrinsic which returns the number of bits set in the binary representa8on of its argument: int change = __popc(mask);

163

Step 3: performing the atomic

Only the leader lane performs the atomic opera8on int res; if(lane_id() == leader) res = atomicAdd(ctr, change); // ctr is the pointer to the counter

164

Step 4: broadcasang the result

We can implement the broadcast efficiently using the shuffle intrinsic __device__ int warp_bcast(int v, int leader) { return __shfl(v, leader); }

165

Step 5: compuang the index for each lane

This is a bit tricky: we must compute the output index for each lane, by adding the broadcast counter value for the warp to the lane’s intra-‐warp offset •  the intra-‐warp offset is the number of ac9ve lanes with IDs lower than current lane ID

•  We can compute it by applying a mask that selects only lower lane IDs, and then applying the __popc() intrinsic to the result.

res += __popc(mask & ((1 << lane_id()) – 1));

166

Final code // warp-‐aggregated atomic increment __device__ int atomicAddInc(int *ctr) { int mask = __ballot(1); // select the leader int leader = __ffs(mask) – 1; // leader does the update int res; if(lane_id() == leader) res = atomicAdd(ctr, __popc(mask)); // broadcast result res = warp_bcast(res, leader); // each thread computes its own value return res + __popc(mask & ((1 << lane_id()) – 1)); } // atomicAddInc Download h3p://twin.iac.rm.cnr.it/filter.cu

167

Exercise Write a shuffle-‐based kernel for the N-‐body problem. Hint: start from the nbody-‐unroll.cu code

168

169

CUDA Dynamic Parallelism •  A facility introduced by NVIDIA in their GK110 chip/architecture •  Allows a kernel to call another kernel from within it without returning

the host. •  Each such kernel call has the same calling construction - the grid

and block sizes and dimensions are set at the time of the call. •  Facility allows computations to be done with dynamically altering

grid structures and recursion, to suit the computation. •  For example allows a 2/3D simulation mesh to be non-uniform with

increased precision at places of interest, see previous slides.

170

Host code

kernel1<<< G1,B1>>>(…) ;

__global__ void kernel1 (…)

kernel2<<< G2,B2>>>(…) ;

return 0; Dynamic Parallelism


kernel3<<< G3,B3>>>(…) ;

return 0;

Kernels calling other kernels

Notice the kernel call is a standard syntax and allows each kernel call to have different grid/block structures

Device code

Nested depth limited by memory and <= 63 or 64

171

Host code

kernel1<<< G1,B1>>>(…) ;


kernel1<<< G2,B2>>>(…) ;

return 0;

Dynamic Parallelism

Recursion is allowed

Device code

172 Derived from Fig 20.4 of Kirk and Hwu 2nd Ed.

Grid A (Parent)

Grid B (Child)

Grid B launch

Grid A threads

Grid B completes

Host (CPU) thread

Parent-‐Child Launch Nes8ng

Grid A launch

Time

Grid B threads

Grid A completes

“Implicit synchronization between parent and child forcing parent to wait until all children exit before it can exit.”

173!

Host-Device Data Transfers

ü  Device-to-host memory bandwidth much lower than device-to-device bandwidth ü  8 GB/s peak (PCI-e x16 Gen 2) vs. ~ 150 GB/s peak

ü  Minimize transfers ü  Intermediate data can be allocated, operated on, and deallocated

without ever copying them to host memory

ü  Group transfers ü  One large transfer is usually much better than many small ones

Synchronizing GPU and CPU •  All kernel launches are asynchronous!

–  control returns to CPU immediately •  Be careful measuring 8mes!

–  kernel starts execu8ng once all previous CUDA calls have completed •  Memcopies are synchronous

–  control returns to CPU once the copy is complete –  copy starts once all previous CUDA calls have completed

•  cudaDeviceSynchronize() –  CPU code –  blocks un8l all previous CUDA calls complete

•  Asynchronous CUDA calls provide: –  non-‐blocking memcopies –  ability to overlap memcopies and kernel execuaon

175!

Page-Locked Data Transfers ü  cudaMallocHost() allows allocation of page-locked (“pinned”) host

memory ü  Same syntax of cudaMalloc() but works with CPU pointers and

memory

ü  Enables highest cudaMemcpy performance ü  3.2 GB/s on PCI-e x16 Gen1 ü  6.5 GB/s on PCI-e x16 Gen2

ü  Use with caution!!! ü  Alloca8ng too much page-‐locked memory can reduce overall

system performance

See and test the bandwidth.cu sample

176!

Overlapping Data Transfers and Computation

ü Async Copy and Stream API allow overlap of H2D or D2H data transfers with computation ü CPU computation can overlap data transfers on all CUDA capable

devices ü Kernel computation can overlap data transfers on devices with

“Concurrent copy and execution” (compute capability >= 1.1)

ü Stream = sequence of operations that execute in order on GPU ü Operations from different streams can be interleaved ü Stream ID used as argument to async calls and kernel launches

ü  Asynchronous host-device memory copy returns control immediately to CPU ü  cudaMemcpyAsync(dst, src, size, dir, stream); ü  requires pinned host memory

(allocated with cudaMallocHost)

ü  Overlap CPU computation with data transfer ü  0 = default stream

cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0);

kernel<<<grid, block>>>(a_d);

cpuFunction();

177!

Asynchronous Data Transfers overlapped

178!

Overlapping kernel and data transfer

ü  Requires: ü  Kernel and transfer use different, non-zero streams

ü  For the crea8on of a stream: cudaStreamCreate ü  Remember that a CUDA call to stream 0 blocks until all previous calls

complete and cannot be overlapped

ü  Example: cudaStream_t stream1, stream2; cudaStreamCreate(&stream1); cudaStreamCreate(&stream2);

cudaMemcpyAsync(dst, src, size, dir,stream1); kernel<<<grid, block, 0, stream2>>>(…);

overlapped

Exercise: read, compile and run simpleStreams.cu

Device Management

•  CPU can query and select GPU devices –  cudaGetDeviceCount( int* count ) –  cudaSetDevice( int device ) –  cudaGetDevice( int *current_device ) –  cudaGetDeviceProper8es( cudaDeviceProp* prop, int device ) –  cudaChooseDevice( int *device, cudaDeviceProp* prop )

•  Mul8-‐GPU setup –  device 0 is used by default –  Star8ng on CUDA 4.0 a CPU thread can control more than one GPU –  mul8ple CPU threads can control the same GPU

– the driver regulates the access.

Device Management (sample code)

int cudadevice; struct cudaDeviceProp prop; cudaGetDevice( &cudadevice ); cudaGetDeviceProper8es (&prop, cudadevice); mpc=prop.mul8ProcessorCount; mtpb=prop.maxThreadsPerBlock; shmsize=prop.sharedMemPerBlock; prin|("Device %d: number of mul8processors %d\n" "max number of threads per block %d\n" "shared memory per block %d\n", cudadevice, mpc, mtpb, shmsize);

Exercise: compile and run the enum_gpu.cu code

CUDA libraries

ü  CUDA includes a number of widely used libraries –  CUBLAS: BLAS implementa8on –  CUFFT: FFT implementa8on –  CURAND: Random number genera8on –  Etc.

ü  Thrust h;p://docs.nvidia.com/cuda/thrust –  Reduc8on –  Scan –  Sort –  Etc.

CUDA libraries

ü  There are two types of run8me math opera8ons –  __func(): direct mapping to hardware ISA

•  Fast but low accuracy •  Examples: __sin(x), __exp(x), __pow(x,y)

–  func() : compile to mul8ple instruc8ons •  Slower but higher accuracy (5 ulp, units in the least place, or

less) •  Examples: sin(x), exp(x), pow(x,y)

ü The -use_fast_math compiler option forces every func() to compile to __func()

CURAND Library

•  Library for genera8ng random numbers

•  Features: –  XORWOW pseudo-‐random generator –  Sobol’ quasi-‐random number generators – Host API for genera8ng random numbers in bulk –  Inline implementa8on allows use inside GPU func8ons/kernels

–  Single-‐ and double-‐precision, uniform, normal and log-‐normal distribu8ons

CURAND Features

–  Pseudo-‐random numbers George Marsaglia. Xorshi� RNGs. Journal of Sta9s9cal SoRware, 8(14), 2003. Available at h;p://www.jstatso�.org/v08/i14/paper.

–  Quasi-‐random 32-‐bit and 64-‐bit Sobol’ sequences with up to 20,000 dimensions.

–  Host API: call kernel from host, generate numbers on GPU, consume numbers on host or on GPU.

–  GPU API: generate and consume numbers during kernel execu8on.

CURAND use 1.  Create a generator:

curandCreateGenerator() 2.  Set a seed:

curandSetPseudoRandomGeneratorSeed() 3.  Generate the data from a distribu8on:

curandGenerateUniform()/(curandGenerateUniformDouble(): Uniform curandGenerateNormal()/cuRandGenerateNormalDouble(): Gaussian curandGenerateLogNormal/curandGenerateLogNormalDouble(): Log-‐Normal

4.  Destroy the generator: curandDestroyGenerator()

Example CURAND Program: Compu8ng π with MonteCarlo method

π = 4 * (Σ red points)/(Σ points) 1.  Generate 0 <= x < 1 2.  Generate 0 <= y < 1 3.  Compute z = x2 + y2 4.  If z <= 1 the point is red Accuracy increases slowly…

Example CURAND Program: Compu8ng π with MonteCarlo method

Compare the speed of pigpu and picpu cd Pi gcc –o picpu –O3 picpu.c -‐lm nvcc –o pigpu –O3 pigpu.cu –arch=sm_13 -‐lm Run for 100000 itera8ons Look at pigpu and try to understand why is it so SLOW…?

Mul8-‐GPU Memory

ü GPUs do not share global memory ü But star8ng on CUDA 4.0 one GPU can copy data from/to another GPU memory directly if the GPU are connected to the same PCIe switch

ü Inter-‐GPU communica8on ü Applica8on code is responsible for copying/moving data between GPU

ü Data travel across the PCIe bus ü Even when GPUs are connected to the same PCIe switch!

Mul8-‐GPU Environment

ü GPUs have consecu8ve integer IDs, star8ng with 0 ü Star8ng on CUDA 4.0, a host thread can maintain more than one GPU context at a 8me ü CudaSetDevice allows to change the “ac8ve” GPU (and so the context)

ü Device 0 is chosen when cudaSetDevice is not called ü Remember that mul8ple host threads can establish contexts with the same GPU ü Driver handles 8me-‐sharing and resource par88oning unless the GPU is in exclusive mode

Mul8-‐GPU Environment: a simple approach // Run independent kernel on each CUDA device

int numDevs = 0; cudaGetNumDevices(&numDevs); ... for (int d = 0; d < numDevs; d++) { cudaSetDevice(d); kernel<<<blocks, threads>>>(args);

}

Exercise: Compare the dot_simple_mul8block.cu and dot_mul8gpu.cu programs. Compile and run. A;en8on! Add the –arch=sm_35 op8on: nvcc –o dot_mulagpu dot_mulagpu.cu –arch=sm_35

General Mula-‐GPU Programming Pa3ern

•  The goal is to hide communicaaon cost – Overlap with computa8on

•  So, every ame-‐step, each GPU should: –  Compute parts (halos) to be sent to neighbors –  Compute the internal region (bulk) –  Exchange halos with neighbors

•  Linear scaling as long as internal-‐computaaon takes longer than halo exchange – Actually, separate halo computa8on adds some overhead

overlapped

Example: Two Subdomains

Example: Two Subdomains

send compute send compute

compute compute

Phase 1

Phase 2

GPU-‐0: green subdomain GPU-‐1: grey subdomain

CUDA Features Useful for Mula-‐GPU

•  Control mulaple GPUs with a single CPU thread – Simpler coding: no need for CPU mul8threading

•  Streams: – Enable execu8ng kernels and memcopies concurrently – Up to 2 concurrent memcopies: to/from GPU

•  Peer-‐to-‐Peer (P2P) GPU memory copies – Transfer data between GPUs using PCIe P2P support – Done by GPU DMA hardware – host CPU is not involved

•  Data traverses PCIe links, without touching CPU memory – Disjoint GPU-‐pairs can communicate simultaneously

Peer-‐to-‐Peer GPU Communica8on •  Up to CUDA 4.0, GPU could not

exchange data directly.

Example: read p2pcopy.cu, compile nvcc –o p2pcopy p2pcopy.cu –arch=sm_35

•  With CUDA >= 4.0 two GPU connected to the same PCI/express switch can exchange data directly.

cudaStreamCreate(&stream_on_gpu_0); cudaSetDevice(0); cudaDeviceEnablePeerAccess( 1, 0 ); cudaMalloc(d_0, num_bytes); /* create array on GPU 0 */ cudaSetDevice(1); cudaMalloc(d_1, num_bytes); /* create array on GPU 1 */ cudaMemcpyPeerAsync( d_0, 0, d_1, 1, num_bytes, stream_on_gpu_0 ); /* copy d_1 from GPU 1 to d_0 on GPU 0: pull copy */

What does the code look like?

•  Arguments to the cudaMemcpyPeerAsync() call: –  Dest address, dest GPU, src addres, src GPU, num bytes, stream ID

•  The middle loop isn’t necessary for correctness –  Improves performance by preven8ng the two stages from interfering with

each other (15 vs. 11 GB/s using 4-‐GPU example)

for( int i=0; i<num_gpus-1; i++ ) // “right” stage cudaMemcpyPeerAsync( d_a[i+1], gpu[i+1], d_a[i], gpu[i], num_bytes, stream[i] );

for( int i=0; i<num_gpus; i++ ) cudaStreamSynchronize( stream[i] );

for( int i=1; i<num_gpus; i++ ) // “left” stage cudaMemcpyPeerAsync( d_b[i-1], gpu[i-1], d_b[i], gpu[i], num_bytes, stream[i] );

Code Pa3ern for Mula-‐GPU

•  Every ame-‐step each GPU will: – Compute halos to be sent to neighbors (stream h) – Compute the internal region (stream i) – Exchange halos with neighbors (stream h)

Code For Looping through Time for( int istep=0; istep<nsteps; istep++) {

for( int i=0; i<num_gpus; i++ ) { cudaSetDevice( gpu[i] ); kernel<<<..., stream_halo[i]>>>( ... ); kernel<<<..., stream_halo[i]>>>( ... ); cudaStreamQuery( stream_halo[i] ); kernel<<<..., stream_internal[i]>>>( ... ); }

for( int i=0; i<num_gpus-1; i++ ) cudaMemcpyPeerAsync( ..., stream_halo[i] ); for( int i=0; i<num_gpus; i++ ) cudaStreamSynchronize( stream_halo[i] ); for( int i=1; i<num_gpus; i++ ) cudaMemcpyPeerAsync( ..., stream_halo[i] );

for( int i=0; i<num_gpus; i++ ) { cudaSetDevice( gpu[i] ); cudaDeviceSynchronize(); // swap input/output pointers }

}

Compute halos

Exchange halos

Synchronize before next step

Compute internal subdomain

Code For Looping through Time •  Two CUDA streams per GPU

–  halo and internal –  Each in an array indexed by GPU

–  Calls issued to different streams can overlap

•  Prior to ame-‐loop –  Create streams –  Enable P2P access

for( int istep=0; istep<nsteps; istep++) {

for( int i=0; i<num_gpus; i++ ) { cudaSetDevice( gpu[i] ); kernel<<<..., stream_halo[i]>>>( ... ); kernel<<<..., stream_halo[i]>>>( ... ); cudaStreamQuery( stream_halo[i] ); kernel<<<..., stream_internal[i]>>>( ... ); }

for( int i=0; i<num_gpus-1; i++ ) cudaMemcpyPeerAsync( ..., stream_halo[i] ); for( int i=0; i<num_gpus; i++ ) cudaStreamSynchronize( stream_halo[i] ); for( int i=1; i<num_gpus; i++ ) cudaMemcpyPeerAsync( ..., stream_halo[i] );

for( int i=0; i<num_gpus; i++ ) { cudaSetDevice( gpu[i] ); cudaDeviceSynchronize(); // swap input/output pointers }

}

GPUs in Different Cluster Nodes

•  Add network communicaaon to halo-‐exchange –  Transfer data from “edge” GPUs of each node to CPU – Hosts exchange via MPI (or other API) –  Transfer data from CPU to “edge” GPUs of each node

•  With CUDA >= 4.0, transparent interoperability between CUDA and Infiniband export CUDA_NIC_INTEROP=1 Let MPI use, for Remote Direct Memory Access opera8ons, memory pinned by GPU

Inter-‐GPU Communica8on with MPI •  Example: simpleMPI.c

–  Generate some random numbers on one node. –  Dispatch them to all nodes. –  Compute their square root on each node's GPU. –  Compute the average of the results by using MPI. To compile:

$nvcc -‐c simpleCUDAMPI.cu

$mpic++ -‐o simpleMPI simpleMPI.c simpleCUDAMPI.o

-‐L/usr/local/cuda/lib64/ -‐lcudart To run:

$mpirun –np 2 simpleMPI

On a single line!

GPU-‐aware MPI

•  Support GPU to GPU communica8on through standard MPI interfaces: –  e.g. enable MPI_Send, MPI_Recv from/to GPU memory – Made possible by Unified Virtual Addressing (UVA) in CUDA 4.0

•  Provide high performance without exposing low level details to the programmer: –  Pipelined data transfer which automa8cally provides op8miza8ons inside MPI library without user tuning

From: MVAPICH2-‐GPU: Op8mized GPU to GPU Communica8on for InfiniBand Clusters H Wang, S. Potluri, M Luo, A.K. Singh, S. Sur, D. K. Panda

Naive code with “old style” MPI

•  Sender: cudaMemcpy(s_buf, s_device, size, cudaMemcpyDeviceToHost); MPI_Send(s_buf, size, MPI_CHAR, 1, 1, MPI_COMM_WORLD);

•  Receiver: MPI_Recv(r_buf, size, MPI_CHAR, 0, 1, MPI_COMM_WORLD, &req); cudaMemcpy(r_device, r_buf, size, cudaMemcpyHostToDevice);

Opamized code with “old style” MPI

•  Pipelining at user level with non-‐blocking MPI and CUDA func8ons

•  Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(s_buf+j*block_sz,s_device+j*block_sz,…); for (j = 0; j < pipeline_len; j++) { while (result != cudaSuccess) { results = cudaStreamQuery(..); if (j>0) MPI_Test(…); } MPI_Isend(s_buf+j*block_sz,block_sz,MPI_CHAR,1,1,…) } MPI_Waitall();

Code with GPU-‐aware MPI

•  Standard MPI interfaces from device buffers Sender: MPI_Send(s_dev, size, MPI_CHAR, 1, 1, MPI_COMM_WORLD);

Receiver: MPI_Recv(r_dev, size, MPI_CHAR, 0, 1, MPI_COMM_WORLD, &req);

• MPI library can automa8cally differen8ate between device memory and host memory • Overlap CUDA copy and RDMA transfer • Pipeline DMA of data from GPU and InfiniBand RDMA

229

Major GPU Performance Detractors

ü  Some8mes it is be;er to recompute than to cache ü GPU spends its transistors on ALUs, not on cache!

ü Do more computa8on on the GPU to avoid costly data transfers ü Even low parallelism computa8ons can at 8mes be faster than transferring data back and forth to host

230

Major GPU Performance Detractors

ü  Par88on your computa8on to keep the GPU mul8processors equally busy ü Many threads, many thread blocks

ü  Keep resource usage low enough to support mul8ple ac8ve thread blocks per mul8processor

•  Max number of blocks per SM: 16 (Kepler) •  Max number of threads per SM: 2048 (Kepler) •  Max number of threads per block: 1024 (Fermi & Kepler) •  Total number of registers per SM: 65536 (Kepler) •  Size of shared memory per SM: 48 Kbyte (Fermi & Kepler)

A useful tool: the CUDA Profiler ü What it does:

ü Provide access to hardware performance monitors ü Pinpoint performance issues and compare across implementa8ons

ü Two interfaces: ü Text-‐based:

ü Built-‐in and included with compiler export CUDA_PROFILE=1 (bash) or setenv CUDA_PROFILE 1 (tcsh) export CUDA_PROFILE_LOG=XXXXXX or setenv CUDA_PROFILE_LOG XXXXX where XXXXX is any file name you want

ü GUI: specific for each pla|orm

Exercise: run the MMGlob program under the control of the profiler

Final CUDA Highlights

•  The CUDA API is a minimal extension of the ANSI C programming language Low learning curve

•  The memory hierarchy is directly exposed to the programmer to allow op8mal control of data flow Huge differences in performance with, apparently, 8ny changes to the data organiza8on.

What is missing? • ECC effect • Debugging and Advanced Profiing • Single virtual address space and other variants of memory copy opera8ons • Many GPU libraries

Thanks!

Documents

Acrashcourseon (C)CUDA programmingfor(mul)GPU · A"Graphics"Processing"Unitis"notaCPU!"" And"there"isnoGPU" without"aCPU…" Cache ALU Control ALU ALU ALU DRAM DRAM CPU GPU