Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
CUDA by ExampleThe University of Mississippi
Computer Science Seminar Series
SINTEF ICT
Department of Applied Mathematics
April 28, 2010
The GPU cudaPI CUDA MJPEG-encoding Summary
Outline
1 The GPU
2 cudaPI
3 CUDA MJPEG-encoding
4 Summary
Applied Mathematics 2/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Moore's Law
Moore's Law states that �thenumber of transistors that can be
placed inexpensively on an
integrated circuit has doubled
approximately every two years�.
Applied Mathematics 3/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Why GPUs?
Three problems for serial computing:
The power wall
The von Neumann bottleneck
The Instruction Level Parallelism wall
We need more �ops, and serial computing is no longer capable ofdelievering it. The frequency of new processors is not increasinganymore.One possible solution: Compute on multiple cores!
Applied Mathematics 4/50
The GPU cudaPI CUDA MJPEG-encoding Summary
GPU history - 1981
Applied Mathematics 5/50
The GPU cudaPI CUDA MJPEG-encoding Summary
GPU history - 1992
Applied Mathematics 6/50
The GPU cudaPI CUDA MJPEG-encoding Summary
GPU history - 2001
Applied Mathematics 7/50
The GPU cudaPI CUDA MJPEG-encoding Summary
GPU history - 2009
Applied Mathematics 8/50
The GPU cudaPI CUDA MJPEG-encoding Summary
The GPU today
The graphics card
GPU - massively parallel processor(480 cores)
Memory: 1.5 GB (177.4 GB/sec)
Connected to the host through aPCIe x 16 Gen2 bus (8 GB/s)
Processor clock @ 1401 MHzFigure: The NVIDIAGeForce GTX 480 card.
Applied Mathematics 9/50
The GPU cudaPI CUDA MJPEG-encoding Summary
The GPU vs. the CPU
A lot more transistors for �oating point operations!
But: Algorithms cannot simply be �translated� 1-to-1 fromserial CPU-code to GPU-code.
You have to parallelize the whole or parts of your algorithm toachieve e�ective GPU-code.
Applied Mathematics 10/50
The GPU cudaPI CUDA MJPEG-encoding Summary
A heterogeneous architecture
On a heterogeneous architecture several types of processorswork together, each one solving the task for which it is bestsuited, e.g., the CPU in cooperation with the GPU.
The CPU is running the show, and the GPU launches programson many cores in parallel, as requested by the CPU-program.
Do data parallel processing on the GPU, and high-level logicon the CPU.
Applied Mathematics 11/50
The GPU cudaPI CUDA MJPEG-encoding Summary
CUDA - Compute Uni�ed Device Architecture
Small set of extensions to the C language
Made to expose the GPU to the programmer, and allow easyaccess to the computational resources of the GPU
Does support some C++-features, like templates, and moreare being added
A Fortran compiler is also available, developed by NVIDIA incooperation with The Portland Group
Applied Mathematics 12/50
The GPU cudaPI CUDA MJPEG-encoding Summary
The CUDA tools
What you need to get started:
1 CUDA driver (device driver for the graphics card)
2 CUDA toolkit (CUDA compiler, runtime library etc.)
3 CUDA SDK (software development kit, with code examples)
All this can be found at http://nvidia.com/cuda
Applied Mathematics 13/50
The GPU cudaPI CUDA MJPEG-encoding Summary
OpenCL
Open Computing Language
Initially developed by Apple
Has a lot in common with CUDA
Released in December 2008 (version 1.0)
Applied Mathematics 14/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Our �rst example
A HelloWorld-type of application using CUDA.
Calculate decimals of π by using geometric observations.
This is not the best way to calculate π, but is serves us goodas an example of parallel programming with CUDA.
Applied Mathematics 15/50
The GPU cudaPI CUDA MJPEG-encoding Summary
p =πr2
4r2(1)
π = 4p (2)
We can calculate the ratio p to get an estimate of π:
1 By placing n dots within the square, some fraction k of thesedots will lie inside the circle.
2 This gives us p ≈ k
nand π ≈ 4k
n.
Applied Mathematics 16/50
The GPU cudaPI CUDA MJPEG-encoding Summary
The CUDA grid
Applied Mathematics 17/50
The GPU cudaPI CUDA MJPEG-encoding Summary
The CUDA grid
We set our block size to16x16.
The block size is a di�cultparameter to set, and it canhave a big impact onperformance.
Applied Mathematics 18/50
The GPU cudaPI CUDA MJPEG-encoding Summary
The CUDA grid
Then we set our grid to be8x8.
This gives us (16 ∗ 8) x(16 ∗ 8) = 16384 threadstotally.
If we use a larger grid weget a �ner resultion, and abetter estimate of π.
Applied Mathematics 19/50
The GPU cudaPI CUDA MJPEG-encoding Summary
The thread workload
We could let eachthread represent onepixel, but this wouldnot give each threadmuch of a workload.
Therefore, we leteach thread calculateseveral points.
Applied Mathematics 20/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Setup on the CPU (host)
The dim3-variables are used later when we call ourCUDA-program.
const int width = 128;
const int height = 128;
// set the block size
const dim3 blockSize (16, 16);
// determine the size of the grid
const size_t gridWidth = width/blockSize.x;
const size_t gridHeight = height/blockSize.y;
const dim3 gridSize(gridWidth , gridHeight );
// number of points per thread (=m^2)
const int m = 10;
Applied Mathematics 21/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Allocating device memory
First we allocate memory on the device (the GPU), which wewill use to save our intermediate results to.
cudaMallocPitch allocates width*height*sizeof(float)bytes of linear memory on the device.
Ensures e�cient memory access, when address is updatedfrom row to row.
float* d_results;
size_t d_pitch;
// allocate device memory
CUDA_SAFE_CALL(cudaMallocPitch ((void **)& d_results ,
&d_pitch ,
width * sizeof(float),
height ));
Applied Mathematics 22/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Allocating host memory
We also need to allocate memory on the host, so we havesomewhere to download our intermediate results to.
// allocate host memory
float* h_results = new float[width*height ];
Applied Mathematics 23/50
The GPU cudaPI CUDA MJPEG-encoding Summary
The CUDA kernel
The __global__-keyword de�nes a CUDA kernel.
__global__ void cudaPIkernel(float* results , size_t pitch) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
float dx = 1.0f / (float) (gridDim.x * blockDim.x);
float dy = 1.0f / (float) (gridDim.y * blockDim.y);
int inside = 0;
...
Applied Mathematics 24/50
The GPU cudaPI CUDA MJPEG-encoding Summary
The CUDA kernel cont'd
Calculate whether the points (x , y) are inside our outside theunit circle.
...
for (int l = 0; l < m; ++l) {
float y = j / (float) (gridDim.y * blockDim.y);
y += dy * (l/( float)m);
for (int k = 0; k < m; ++k) {
float x = i / (float) (gridDim.x * blockDim.x);
x += dx * (k/( float)m);
if (x * x + y * y < 1.0f) {
// point is inside circle!
++ inside;
}
}
}
float* elementPtr = (float *) ((char*) results + j * pitch);
elementPtr[i] = inside;
}
Applied Mathematics 25/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Calling the CUDA kernel
Now we know enough to write our call to the CUDA-kernel.
// run kernel
cudaPIkernel <<<gridSize , blockSize >>>(d_results , d_pitch );
CUT_CHECK_ERROR("cudaPIkernel");
Applied Mathematics 26/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Device to host copy
Now we copy the intermediate results from the GPU to theCPU.
// fetch results from device
CUDA_SAFE_CALL(cudaMemcpy2D(h_results , width * sizeof(float),
d_results , d_pitch ,
width * sizeof(float),
height ,
cudaMemcpyDeviceToHost ));
Applied Mathematics 27/50
The GPU cudaPI CUDA MJPEG-encoding Summary
The host code
Lastly, we calculate π from the intermediate results we fetchedfrom the device.
// sum up results from each CUDA -thread
float k = 0.0;
for(int i = 0; i < width * height; ++i) {
k += h_results[i];
}
// calculate final result
int n = (width * m) * (height * m);
float pi = 4 * ((float) k / n);
cout << "PI is (approximately ): " << pi << endl;
Applied Mathematics 28/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Optimization: Allocating host memory using theCUDA API
We should also allocate memory on the host using the CUDAAPI.
This ensures fast memory transfers between the device and thehost.
cudaMallocHost allocates page-locked memory that isaccessible to the device.
float* h_results;
// allocate host memory
CUDA_SAFE_CALL(cudaMallocHost ((void **)& h_results ,
width * height * sizeof(float )));
Applied Mathematics 29/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Demo
Applied Mathematics 30/50
The GPU cudaPI CUDA MJPEG-encoding Summary
JPEG
A lossy compression method for photographic images
Applied Mathematics 31/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Motion-JPEG
A videoformat where each frame is individually compressed asa JPEG image
Uses no inter-frame compression (motion compensation)
Used by many portable devices, e.g., webcams and otherdevices that stream video
Applied Mathematics 32/50
The GPU cudaPI CUDA MJPEG-encoding Summary
CPU-version of the algorithm
We already have a CPU-version of the algorithm for M-JPEGencoding.
It is often the case that you want to accelerate some alreadyexisting algorithm.
Understanding the algorithm in detail is important.
Applied Mathematics 33/50
The GPU cudaPI CUDA MJPEG-encoding Summary
M-JPEG encoding
Split up image intomacroblocks
Zig−zag image, truncateand do Huffman
RGB −> YUV
DCT
Quantize
Split the image into macroblocks of size 8x8.
Applied Mathematics 34/50
The GPU cudaPI CUDA MJPEG-encoding Summary
M-JPEG encoding
Split up image intomacroblocks
Zig−zag image, truncateand do Huffman
RGB −> YUV
DCT
Quantize
Subtract 128 to shift the values so that theyare centered around zero.
Perform a two-dimensional DCT on eachmacroblock, converting them into afrequency-domain representation.
Applied Mathematics 34/50
The GPU cudaPI CUDA MJPEG-encoding Summary
M-JPEG encoding
Split up image intomacroblocks
Zig−zag image, truncateand do Huffman
RGB −> YUV
DCT
Quantize
Reduce the amount of information in the highfrequency components, by simply dividingeach component by a prede�ned constant.
The human eye is not so good atdistinguishing the exact strength of a highfrequency brightness variation.
Applied Mathematics 34/50
The GPU cudaPI CUDA MJPEG-encoding Summary
M-JPEG encoding
Split up image intomacroblocks
Zig−zag image, truncateand do Huffman
RGB −> YUV
DCT
QuantizeGo through the macroblock in azigzag-pattern.
Remove the trailing zeros.
Hu�man-encoding is a lossless compressionalgorithm which further compresses the result.
Applied Mathematics 34/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Mapping an algorithm to the GPU
1 Identify which parts are most computationally demanding (bypro�ling)
2 Investigate if all or some of these parts can be done in aparallel fashion
3 Write CUDA-kernels for these parts, which are called from thehost code
4 Repeat
Applied Mathematics 35/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Pro�ling the CPU-version
By running a pro�ler like gprof (remember to compile yourcode with -gp), we can determine which functions aredominating our programs runtime.
We �nd that the DCT spends well over 90% of the runtime.
Applied Mathematics 36/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Overview
CPU GPU
Split up image into
macroblocks
Zigzag, truncate
and Huffman
Read frame from file
DCT
Quantize
Write to file
Figure: Overview CPU/GPU.
Applied Mathematics 37/50
The GPU cudaPI CUDA MJPEG-encoding Summary
DCT on the CPU
Let us take a look at the CPU-version's DCT.
For each macroblock, the following code is executed:
for(int u = 0; u < 8; ++u) {
for(int v = 0; v < 8; ++v) {
float dct = 0;
for(int j = 0; j < 8; ++j) {
for(int i = 0; i < 8; ++i) {
float coeff =
in_data [(y+j)*width+(x+i)] - 128.0f;
dct += coeff
* (float) (cos ((2*i+1)*u*PI /16.0f)
* cos ((2*j+1)*v*PI/16.0f));
}
}
float a1 = !u ? ISQRT2 : 1.0f;
float a2 = !v ? ISQRT2 : 1.0f;
/* Scale according to normalizing function */
dct *= a1*a2/4.0f;
...Applied Mathematics 38/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Quantization on the CPU
...and the quantization:
...
/* Quantize */
out_data [(y+v)*width+(x+u)] =
(int16_t )(floor (0.5f + dct /
(float )( quantization[v*8+u])));
}
}
Applied Mathematics 39/50
The GPU cudaPI CUDA MJPEG-encoding Summary
The GPU compute gridVIDEO FRAME
MACROBLOCK
GPU COMPUTE GRID
GPU COMPUTE BLOCK
PIXEL
THREAD
Figure: One videoframe and the GPU compute grid.
Applied Mathematics 40/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Quantization matrices
We store the three quantization matrices (for Y, U and V) inconstant memory on the device.
/**
* Quantization matrices
*/
__constant__ float yquanttbl [] =
{
16, 11, 12, 14, 12, 10, 16, 14,
13, 14, 18, 17, 16, 19, 24, 40,
26, 24, 22, 22, 24, 49, 35, 37,
29, 40, 58, 51, 61, 30, 57, 51,
56, 55, 64, 72, 92, 78, 64, 68,
87, 69, 55, 56, 80, 109, 81, 87,
95, 98, 103, 104, 103, 62, 77, 113,
121, 112, 100, 120, 92, 101, 103, 99
};
...
Applied Mathematics 41/50
The GPU cudaPI CUDA MJPEG-encoding Summary
The CUDA-kernelReads one macroblock into shared memory.__syncthreads() is a barrier, which make threads wait untilall threads have reached this point in the kernel.
__global__ void cudaDct(float *output , size_t output_pitch
float *input , size_t input_pitch) {
const int bx = blockIdx.x;
const int by = blockIdx.y;
const int tx = threadIdx.x;
const int ty = threadIdx.y;
const int x = (bx * 8) + tx;
const int y = (by * 8) + ty;
__shared__ float macroblock [8][8];
float* inputPtr = (float *)(( char*) input + y * input_pitch );
macroblock[ty][tx] = inputPtr[x];
__syncthreads ();
... Applied Mathematics 42/50
The GPU cudaPI CUDA MJPEG-encoding Summary
The CUDA-kernel cont'd
__cosf is an intrinsic function.
...
float dct = 0;
for (int j = 0; j < 8; ++j) {
for (int i = 0; i < 8; ++i) {
float coeff = macroblock[j][i] - 128.0f;
dct += coeff * __cosf ((2 * i + 1) * tx * PI / 16.0f)
* __cosf ((2 * j + 1) * ty * PI / 16.0f);
}
}
float a1 = !tx ? ISQRT2 : 1.0f;
float a2 = !ty ? ISQRT2 : 1.0f;
/* Scale according to normalizing function */
dct *= a1 * a2 / 4.0f;
/* Quantize */
float* inputPtr = (float *)(( char*) output + y * output_pitch );
output[x] = dct / yquanttbl[ty * 8 + tx];
}
Applied Mathematics 43/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Moving data between host and device
Using cudaMemcpy2D, similar to the cudaPI-example:
// copy image to device
CUDA_SAFE_CALL(cudaMemcpy2D(h_in_data , h_pitch_in ,
d_in_data , d_pitch_in ,
width * sizeof(float),
height ,
cudaMemcpyHostToDevice ));
// launch kernel
cudaDct <<<gridSize , blockSize >>>(d_out_data , d_pitch_out
d_in_data , d_pitch_in );
CUT_CHECK_ERROR("cudaDct");
// copy image back to host
CUDA_SAFE_CALL(cudaMemcpy2D(d_out_data , d_pitch_out ,
h_out_data , h_pitch_out ,
width * sizeof(float),
height ,
cudaMemcpyDeviceToHost ));
Applied Mathematics 44/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Optimizations
Run kernel on four consecutive macroblocks to achievecoalescing.
Applied Mathematics 45/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Optimizations cont'd
Threaded I/O on host
Double-bu�ering input and output bu�ers on the CPU
Makes it possible to do �le IO and GPU memory transferssimultaneously
Can be implemented using threading on the CPU, e.g.,boost::thread
Could also run the Hu�man-encoding in a separate thread,since this is a compute intensive algorithm
Applied Mathematics 46/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Optimizations cont'd
Streams
Utilizing CUDA streams
Uses asynchronous memory transfers
Allows overlapping of uploading, kernel invocation, anddownloading
Applied Mathematics 47/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Is GPU-computing worth the e�ort?
It takes time to learn a new technology and a new platform.
It will take time to rewrite your existing algorithms.
But: You will get signi�cant speed ups (5-100X) of algorithmsthat bene�t from parallelism.
Once you have your parallel algorithms, they will scale onnewer generations of parallel hardware.
Applied Mathematics 48/50
The GPU cudaPI CUDA MJPEG-encoding Summary
CUDA Documentation
Programming Guide
Reference Manual
Best Practices Guide
Pay special attention to the �Performance Guidelines�-sectionof the programming guide to get the most performance out ofyour CUDA-apps
http://gpgpu.org
Applied Mathematics 49/50
The GPU cudaPI CUDA MJPEG-encoding Summary
Good luck with your CUDA-programming, and thank you :-)Contact: [email protected]://martinsa.at.i�.uio.no
Applied Mathematics 50/50