58
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming

GPU Programming and CUDA

  • Upload
    gella

  • View
    36

  • Download
    1

Embed Size (px)

DESCRIPTION

GPU Programming and CUDA. Sathish Vadhiyar Parallel Programming. GPU. Graphical Processing Unit A single GPU consists of large number of cores – hundreds of cores. Whereas a single CPU can consist of 2, 4, 8 or 12 cores - PowerPoint PPT Presentation

Citation preview

Page 1: GPU Programming and CUDA

GPU Programming and CUDA

Sathish VadhiyarParallel Programming

Page 2: GPU Programming and CUDA

GPU

• Graphical Processing Unit• A single GPU consists of large number of cores

– hundreds of cores.• Whereas a single CPU can consist of 2, 4, 8 or

12 cores• Cores? – Processing units in a chip sharing at

least the memory and L1 cache

Page 3: GPU Programming and CUDA

GPU and CPU• Typically GPU and CPU coexist in a heterogeneous setting• “Less” computationally intensive part runs on CPU (coarse-

grained parallelism), and more intensive parts run on GPU (fine-grained parallelism)

• NVIDIA’s GPU architecture is called CUDA (Compute Unified Device Architecture) architecture, accompanied by CUDA programming model, and CUDA C language

Page 4: GPU Programming and CUDA

Example: SERC system

• SERC system had 4 nodes• Each node has 16 CPU cores arranges as four

quad cores• Each node connected to a Tesla S1070 GPU

system

Page 5: GPU Programming and CUDA

Tesla S1070

• Each Tesla S1070 system has 4 Tesla GPUs (T10 processors) in the system

• The system connected to one or two host systems via high speed interconnects

• Each GPU has 240 GPU cores. Hence a total of about 960 cores

• Frequency of processor cores – 1.296 to 1.44 GHz• Memory – 16 GB total, 4 GB per GPU

Page 6: GPU Programming and CUDA

Tesla S1070 Architecture

Page 7: GPU Programming and CUDA

Tesla T10

• 240 streaming processors/cores (SPs) organized as 30 streaming multiprocessors (SMs) in 10 independent processing units called Texture Processors/Clusters (TPCs)

• A TPC consists of 3 SMs; A SM consists of 8 SPs• 30 DP processors• 128 threads per processor; 30,720 threads total• Collection of TPCs is called Streaming Processor

Arrays (SPAs)

Page 8: GPU Programming and CUDA

Tesla S1070 Architecture Details

Page 9: GPU Programming and CUDA

T 10

Host

Page 10: GPU Programming and CUDA

TPCs, SMs, SPs

Page 11: GPU Programming and CUDA

Connection to Host, Interconnection Network

Page 12: GPU Programming and CUDA

Details of SM• Each SM has a shared memory

Page 13: GPU Programming and CUDA

Multithreading

• SM manages and executes up to 768 concurrent threads in hardware with zero scheduling overhead

• Concurrent threads synchronize using barrier using a single SM instruction

• These support efficient very fine-grained parallelism

• Group of threads executes the same program and different groups executes different programs

Page 14: GPU Programming and CUDA

Hierarchical Parallelism

• Parallel computations arranged as grids• One grid executes after another• Grid consists of blocks• Blocks assigned to SM. A single block assigned

to a single SM. Multiple blocks can be assigned to a SM.

• Block consists of elements• Elements computed by threads

Page 15: GPU Programming and CUDA

Hierarchical Parallelism

Page 16: GPU Programming and CUDA

Thread Blocks

• Thread block – an array of concurrent threads that execute the same program and can cooperate to compute the result

• Consists of 1 to 512 threads• Has shape and dimensions (1d, 2d or 3d) for threads• A thread ID has corresponding 1,2 or 3d indices• Each SM executes up to eight thread blocks

concurrently• Threads of a thread block share memory

Page 17: GPU Programming and CUDA
Page 18: GPU Programming and CUDA
Page 19: GPU Programming and CUDA
Page 20: GPU Programming and CUDA
Page 21: GPU Programming and CUDA

CUDA Programming Language

• Programming language for threaded parallelism for GPUs

• Minimal extension of C• A serial program that calls parallel kernels• Serial code executes on CPU• Parallel kernels executed across a set of parallel

threads on the GPU• Programmer organizes threads into a hierarchy of

thread blocks and grids

Page 22: GPU Programming and CUDA
Page 23: GPU Programming and CUDA

CUDA language

• A kernel corresponds to a grid.• __global__ - denotes kernel entry functions• __device__ - global variables• __shared__ - shared memory variables• A kernel text is a C function for one sequential

thread

Page 24: GPU Programming and CUDA

CUDA C

• Built-in variables:– threadIdx.{x,y,z} – thread ID within a block– blockIDx.{x,y,z} – block ID within a grid– blockDim.{x,y,z} – number of threads within a block– gridDim.{x,y,z} – number of blocks within a grid

• kernel<<<nBlocks,nThreads>>>(args)– Invokes a parallel kernel function on a grid of

nBlocks where each block instantiates nThreads concurrent threads

Page 25: GPU Programming and CUDA

Example: Summing Up kernel function

grid of kernels

Page 26: GPU Programming and CUDA
Page 27: GPU Programming and CUDA

General CUDA Steps

1. Copy data from CPU to GPU2. Compute on GPU3. Copy data back from GPU to CPU• By default, execution on host doesn’t wait for

kernel to finish• General rules:– Minimize data transfer between CPU & GPU– Maximize number of threads on GPU

Page 28: GPU Programming and CUDA

CUDA Elements

• cudaMalloc – for allocating memory in device• cudaMemCopy – for copying data to allocated

memory from host to device, and from device to host• cudaFree – freeing allocated memory• void syncthreads__() – synchronizing all threads in a

block like barrier• CUDA has atomic in-built functions (add, sub etc.),

error reporting• Event management:

Page 29: GPU Programming and CUDA
Page 30: GPU Programming and CUDA
Page 31: GPU Programming and CUDA

EXAMPLE 2: REDUCTION

Page 32: GPU Programming and CUDA

Example: Reduction

• Tree based approach used within each thread block

• In this case, partial results need to be communicated across thread blocks

• Hence, global synchronization needed across thread blocks

Page 33: GPU Programming and CUDA

Reduction

• But CUDA does not have global synchronization –– expensive to build in hardware for large number of

GPU cores• Solution• Decompose into multiple kernels• Kernel launch serves as a global

synchronization point

Page 34: GPU Programming and CUDA

Illustration

Page 35: GPU Programming and CUDA

Host Codeint main(){

int* h_idata, h_odata; /* host data*/Int *d_idata, d_odata; /* device data*/

/* copying inputs to device memory */cudaMemcpy(d_idata, h_idata, bytes, cudaMemcpyHostToDevice) ;cudaMemcpy(d_odata, h_idata, numBlocks*sizeof(int),

cudaMemcpyHostToDevice) ;

int numThreads = (n < maxThreads) ? n : maxThreads;int numBlocks = n / numThreads;dim3 dimBlock(numThreads, 1, 1); dim3 dimGrid(numBlocks, 1, 1);

reduce<<< dimGrid, dimBlock >>>(d_idata, d_odata);

Page 36: GPU Programming and CUDA

Host Codeint s=numBlocks;while(s > 1) { numThreads = (s< maxThreads) ? s : maxThreads; numBlocks = s /

numThreads; dimBlock(numThreads, 1, 1); dimGrid(numBlocks, 1, 1); reduce<<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); s = s / threads;}

}

Page 37: GPU Programming and CUDA

Device Code__global__ void reduce(int *g_idata, int *g_odata){ extern __shared__ int sdata[]; // load shared mem unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; sdata[tid] = g_idata[i]; __syncthreads(); // do reduction in shared mem for(unsigned int s=1; s < blockDim.x; s *= 2) { if ((tid % (2*s)) == 0) sdata[tid] += sdata[tid + s]; __syncthreads(); } // write result for this block to global mem if (tid == 0) g_odata[blockIdx.x] = sdata[0];}

Page 38: GPU Programming and CUDA

EXAMPLE 3: SCAN

Page 39: GPU Programming and CUDA

Example: Scan or Parallel-prefix sum

• Using binary tree• An upward reduction phase (reduce phase or up-sweep

phase)– Traversing tree from leaves to root forming partial sums at

internal nodes• Down-sweep phase– Traversing from root to leaves using partial sums computed in

reduction phase

Page 40: GPU Programming and CUDA

Up Sweep

Page 41: GPU Programming and CUDA

Down Sweep

Page 42: GPU Programming and CUDA

Host Code• int main(){• const unsigned int num_threads = num_elements / 2;• /* cudaMalloc d_idata and d_odata */• cudaMemcpy( d_idata, h_data, mem_size, cudaMemcpyHostToDevice) );

• dim3 grid(256, 1, 1); dim3 threads(num_threads, 1, 1);• scan<<< grid, threads>>> (d_odata, d_idata);

• cudaMemcpy( h_data, d_odata[i], sizeof(float) * num_elements, cudaMemcpyDeviceToHost

• /* cudaFree d_idata and d_odata */• }

Page 43: GPU Programming and CUDA

Device Code__global__ void scan_workefficient(float *g_odata, float *g_idata, int n){ // Dynamically allocated shared memory for scan kernels extern __shared__ float temp[]; int thid = threadIdx.x; int offset = 1; // Cache the computational window in shared memory temp[2*thid] = g_idata[2*thid]; temp[2*thid+1] = g_idata[2*thid+1]; // build the sum in place up the tree for (int d = n>>1; d > 0; d >>= 1) { __syncthreads(); if (thid < d) { int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; temp[bi] += temp[ai]; } offset *= 2; }

Page 44: GPU Programming and CUDA

Device Code // scan back down the tree // clear the last element if (thid == 0) temp[n - 1] = 0; // traverse down the tree building the scan in place for (int d = 1; d < n; d *= 2) { offset >>= 1; __syncthreads(); if (thid < d) { int ai = offset*(2*thid+1)-1; int bi = offset*(2*thid+2)-1; float t = temp[ai]; temp[ai] = temp[bi]; temp[bi] += t; } } __syncthreads(); // write results to global memory g_odata[2*thid] = temp[2*thid]; g_odata[2*thid+1] = temp[2*thid+1];}

Page 45: GPU Programming and CUDA

Arrays of Arbitrary Size

• The previous code works for only arrays that can fit inside a single thread block

• For large arrays:– Divide arrays into blocks– Perform scan using previous algorithm for each

thread block– And then scan the block sums– Illustrated by a diagram

Page 46: GPU Programming and CUDA

Illustration

Page 47: GPU Programming and CUDA

EXAMPLE 4: MATRIX MULTIPLICATION

Page 48: GPU Programming and CUDA
Page 49: GPU Programming and CUDA

Example 4: Matrix Multiplication

Page 50: GPU Programming and CUDA

Example 4

Page 51: GPU Programming and CUDA

Example 4

Page 52: GPU Programming and CUDA

Example 4

Page 53: GPU Programming and CUDA

Example 4

Page 54: GPU Programming and CUDA

Others

• CUDA has built-in libraries including cuBLAS, cuFFT, and SDKs (software development kit) for financial, image processing, etc.

Page 55: GPU Programming and CUDA

• For more information…• CUDA SDK code samples – NVIDIA - http

://www.nvidia.com/object/cuda_get_samples.html

Page 56: GPU Programming and CUDA

Backup

Page 57: GPU Programming and CUDA

Host Codeint main(){ prescanArrayRecursive();}

void prescanArray(float *outArray, float *inArray, int numElements){ prescanArrayRecursive(outArray, inArray, numElements, 0);}void prescanArrayRecursive(float *outArray, const float *inArray, int numElements, int level){ unsigned int blockSize = BLOCK_SIZE; // max size of the thread blocks unsigned int numBlocks = max(1, (int)ceil((float)numElements / (2.f * blockSize))); unsigned int numThreads; if (numBlocks > 1) numThreads = blockSize; else numThreads = numElements / 2;

dim3 grid(max(1, numBlocks - np2LastBlock), 1, 1); dim3 threads(numThreads, 1, 1);

Page 58: GPU Programming and CUDA

Host Code// execute the scan if (numBlocks > 1) { prescan<true, false><<< grid, threads, sharedMemSize >>>(outArray, inArray, g_scanBlockSums[level], numThreads * 2, 0, 0);// After scanning all the sub-blocks, we are mostly done. But now we need to take all of the last values of the

sub-blocks and scan those. // This will give us a new value that must be sdded to each block to get the final results. // recursive (CPU) call prescanArrayRecursive(g_scanBlockSums[level], g_scanBlockSums[level], numBlocks, level+1); uniformAdd<<< grid, threads >>>(outArray, g_scanBlockSums[level], numElements, 0, 0); }else prescan<false, false><<< grid, threads, sharedMemSize >>>(outArray, inArray, 0, numThreads * 2, 0, 0);

}