Upload
brianna-dalton
View
213
Download
0
Embed Size (px)
Citation preview
High Performance Computing with GPUs: An Introduction Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
GPU Tutorial: How To Program for GPUsKrešimir Ćosić1,
(1) University of Split, Croatia
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
Overview
CUDA Hardware architecture
Programming model Convolution on GPU
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
CUDA
‘Compute Unified Device Architecture’– Hardware and software architecture for issuing
and managing computations on GPU• Massively parallel architecture
– over 8000 threads is common• C for CUDA (C++ for CUDA)
– C/C++ language with some additions and restrictions
• Enables GPGPU – ‘General Purpose Computing on GPUs’
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
GPU: a multithreaded coprocessor
SMstreaming multiprocessor
32xSP (or 16, 48 or more)
Fast local ‘shared memory’(shared between SPs)
16 KiB (or 64 KiB)
GLOBAL MEMORY(ON DEVICE)
SMSP SP SP SP
SP SP SP SP
SP SP SP SP
SP SP SP SP
SHARED MEMORY
SP: scalar processor ‘CUDA core’
Executes one thread
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
GDDR memory
512 MiB - 6 GiB
GPU: SMs
o 30xSM on GT200, o 14xSM on Fermi
For example, GTX 480: 14 SMs x 32 cores
= 448 cores on a GPU
GLOBAL MEMORY(ON DEVICE)
SMSP SP SP SP
SP SP SP SP
SP SP SP SP
SP SP SP SP
SHARED MEMORY
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
How To Program For GPUs
Parallelization Decomposition to threads
Memory shared memory, global memory
GLOBAL MEMORY(ON DEVICE)
SMSP SP SP SP
SP SP SP SP
SP SP SP SP
SP SP SP SP
SHARED MEMORY
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
Important Things To Keep In Mind
Avoid divergent branches Threads of single SM must be
executing the same code Code that branches heavily and
unpredictably will execute slowly Threads shoud be independent as
much as possible Synchronization and
communication can be done efficiently only for threads of single multiprocessor
SMSP SP SP SP
SP SP SP SP
SP SP SP SP
SP SP SP SP
SHARED MEMORY
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
How To Program For GPUs
Parallelization Decomposition to threads
Memory shared memory, global memory
Enormous processing power Avoid divergence
Thread communication Synchronization, no
interdependencies GLOBAL MEMORY(ON DEVICE)
SMSP SP SP SP
SP SP SP SP
SP SP SP SP
SP SP SP SP
SHARED MEMORY
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
Programming model
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
Thread blocks
Threads grouped in thread blocks 128, 192 or 256
threads in a block
• One thread block executes on one SM– All threads sharing the ‘shared memory’– 32 threads are executed simultaneously (‘warp’)
BLOCK 1
THREAD(0,0)
THREAD(0,1)
THREAD(0,2)
THREAD(1,0)
THREAD(1,1)
THREAD(1,2)
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
Thread blocks
Blocks execute on SMs - execute in parallel - execute independently!
BLOCK 1
THREAD(0,0)
THREAD(0,1)
THREAD(0,2)
THREAD(1,0)
THREAD(1,1)
THREAD(1,2)
Grid
BLOCK 0 BLOCK 1 BLOCK 2
BLOCK 3 BLOCK 4 BLOCK 5
BLOCK 6 BLOCK 7 BLOCK 8
• Blocks form a GRID• Thread ID
unique within block• Block ID
unique within grid
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
Code that executes on GPU: Kernels
Kernel - a simple C function - executes on GPU - Executes in parallel
as many times as there are threads
The keyword __global__ tells the compiler to make a function a kernel (and compile it for the GPU, instead of the CPU)
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
Convolution
To get one pixel of output image:
- multiply (pixelwise) mask with image at corresponding position
- sum the products
Kernel - Example code pt 1
__global__ void Convolve( float* img, int imgW, int imgH, float* filt, int filtW, int filtH, float* out){ const int nThreads = blockDim.x * gridDim.x; const int idx = blockIdx.x * blockDim.x + threadIdx.x;
const int outW = imgW – filtW + 1; const int outH = imgH – filtH + 1; const int nPixels = outW * outH;
for(int curPixel = idx; curPixel < nPixels; curPixel += nThreads)
{ int x = curPixel % outW; int y = curPixel / outW;
float sum = 0; for (int filtY = 0; filtY < filtH; filtY++) for (int filtX = 0; filtX < filtW; filtX++) { int sx = x + filtX; int sy = y + filtY; sum+= img[sy*imgW + sx] * filt[filtY*filtW + filtX]; } out[y * outW + x] = sum; }}
for (int y = 0; y < outH; y++) for (int x = 0; x < outW; x++) {
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
Setup and data transfer
cudaMemcpy transfer data to and from GPU (global memory)
cudaMalloc Allocate memory on GPU (global memory)
GPU is the ‘device’, CPU is the ‘host’
Kernel call syntax
Examle setup and data transfer 1
int main() { ... float* img ... int imgW, imgH ...
float* imgGPU; cudaMalloc((void**)& imgGPU, imgW * imgH * sizeof(float)); cudaMemcpy( imgGPU, // Destination img, // Source imgW * imgH * sizeof(float), // Size in bytes cudaMemcpyHostToDevice // Direction
);
float* filter ... int filterW, filterH ...
float* filterGPU; cudaMalloc((void**)& filterGPU, filterW * filterH * sizeof(float)); cudaMemcpy( filterGPU, // Destination filter, // Source filterW * filterH * sizeof(float), // Size in bytes cudaMemcpyHostToDevice // Direction
);
Examle setup and data transfer 2
int resultW = imgW – filterW + 1; int resultH = imgH – filterH + 1; float* result = (float*) malloc(resultW * resultH * sizeof(float)); float* resultGPU; cudaMalloc((void**) &resultGPU, resultW * resultH * sizeof(float)); /* Call the GPU kernel */ dim3 block(128); dim3 grid(30); Convolve<<<grid, block>>> ( imgGPU, imgW, imgH, filterGPU, filterW, filterH, resultGPU ); cudaMemcpy( result, // Desination resultGPU, // Source resultW * resultH * sizeof(float), // Size in bytes cudaMemcpyDeviceToHost // Direction );
cudaThreadExit(); ... }
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
Speedup
Linear combination of 3 filters sized 15x15 Image size: 2k x 2k
CPU: Core 2 @ 2.0 GHz (1 core) GPU: Tesla S1070 (GT200 )
30xSM, 240 CUDA cores, 1.3 GHz
CPU: 6.58 s 0.89 Mpixels/s GPU: 0.21 s 27.99 Mpixels/s
31 times faster!
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
CUDA capabilities
1.0 GeForce 8800 Ultra/GTX/GTS
1.1 GeForce 9800 GT, GTX, GTS 250+ atomic instructions …
1.2 GeForce GT 220
1.3 Tesla S1070, C1060, GeForce GTX 275,285+ double precision (slow) …
2.0 Tesla C2050, GeForce GTX 480, 470+ ECC, L1 and L2 cache, faster IMUL, faster atomics,
faster double precision on Tesla cards …
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
CUDA essentials
developer.nvidia.com/object/cuda_3_1_downloads.html
Download Driver Toolkit (compiler nvcc) SDK (examples) (recommended)
CUDA Programmers guide
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
Other tools
‘Emulator’ Executes on CPU Slow
Simple profiler cuda-gdb (Linux) Paralel Nsight (Vista)
simple profiler on-device debugger
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
...
...
Krešimir Ćosić <[email protected]>, Thursday, August 12th, 2010.
LSST All Hands Meeting 2010, Tucson, AZ
High Performance Computing with GPUs: An Introduction
Logical thread hierarchy
Thread ID – unique within block Block ID – unique within grid
To get globally unique thread ID: Combine block ID and thread ID
Threads can access both shared and global memory