CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism

CUDA by ExampleThe University of Mississippi

Computer Science Seminar Series

[email protected]

SINTEF ICT

Department of Applied Mathematics

April 28, 2010

The GPU cudaPI CUDA MJPEG-encoding Summary

Outline

1 The GPU

2 cudaPI

3 CUDA MJPEG-encoding

4 Summary

Applied Mathematics 2/50


Moore's Law

Moore's Law states that �thenumber of transistors that can be

placed inexpensively on an

integrated circuit has doubled

approximately every two years�.



Why GPUs?

Three problems for serial computing:

The power wall

The von Neumann bottleneck

The Instruction Level Parallelism wall

We need more �ops, and serial computing is no longer capable ofdelievering it. The frequency of new processors is not increasinganymore.One possible solution: Compute on multiple cores!



GPU history - 1981



GPU history - 1992



GPU history - 2001



GPU history - 2009



The GPU today

The graphics card

GPU - massively parallel processor(480 cores)

Memory: 1.5 GB (177.4 GB/sec)

Connected to the host through aPCIe x 16 Gen2 bus (8 GB/s)

Processor clock @ 1401 MHzFigure: The NVIDIAGeForce GTX 480 card.



The GPU vs. the CPU

A lot more transistors for �oating point operations!

But: Algorithms cannot simply be �translated� 1-to-1 fromserial CPU-code to GPU-code.

You have to parallelize the whole or parts of your algorithm toachieve e�ective GPU-code.



A heterogeneous architecture

On a heterogeneous architecture several types of processorswork together, each one solving the task for which it is bestsuited, e.g., the CPU in cooperation with the GPU.

The CPU is running the show, and the GPU launches programson many cores in parallel, as requested by the CPU-program.

Do data parallel processing on the GPU, and high-level logicon the CPU.



CUDA - Compute Uni�ed Device Architecture

Small set of extensions to the C language

Made to expose the GPU to the programmer, and allow easyaccess to the computational resources of the GPU

Does support some C++-features, like templates, and moreare being added

A Fortran compiler is also available, developed by NVIDIA incooperation with The Portland Group



The CUDA tools

What you need to get started:

1 CUDA driver (device driver for the graphics card)

2 CUDA toolkit (CUDA compiler, runtime library etc.)

3 CUDA SDK (software development kit, with code examples)

All this can be found at http://nvidia.com/cuda


http://nvidia.com/cuda


OpenCL

Open Computing Language

Initially developed by Apple

Has a lot in common with CUDA

Released in December 2008 (version 1.0)



Our �rst example

A HelloWorld-type of application using CUDA.

Calculate decimals of π by using geometric observations.

This is not the best way to calculate π, but is serves us goodas an example of parallel programming with CUDA.



p =πr2

4r2(1)

π = 4p (2)

We can calculate the ratio p to get an estimate of π:

1 By placing n dots within the square, some fraction k of thesedots will lie inside the circle.

2 This gives us p ≈ k

nand π ≈ 4k

n.



The CUDA grid



The CUDA grid

We set our block size to16x16.

The block size is a di�cultparameter to set, and it canhave a big impact onperformance.



The CUDA grid

Then we set our grid to be8x8.

This gives us (16 ∗ 8) x(16 ∗ 8) = 16384 threadstotally.

If we use a larger grid weget a �ner resultion, and abetter estimate of π.



The thread workload

We could let eachthread represent onepixel, but this wouldnot give each threadmuch of a workload.

Therefore, we leteach thread calculateseveral points.



Setup on the CPU (host)

The dim3-variables are used later when we call ourCUDA-program.

const int width = 128;

const int height = 128;

// set the block size

const dim3 blockSize (16, 16);

// determine the size of the grid

const size_t gridWidth = width/blockSize.x;

const size_t gridHeight = height/blockSize.y;

const dim3 gridSize(gridWidth , gridHeight );

// number of points per thread (=m^2)

const int m = 10;



Allocating device memory

First we allocate memory on the device (the GPU), which wewill use to save our intermediate results to.

cudaMallocPitch allocates width*height*sizeof(float)bytes of linear memory on the device.

Ensures e�cient memory access, when address is updatedfrom row to row.

float* d_results;

size_t d_pitch;

// allocate device memory

CUDA_SAFE_CALL(cudaMallocPitch ((void **)& d_results ,

&d_pitch ,

width * sizeof(float),

height ));



Allocating host memory

We also need to allocate memory on the host, so we havesomewhere to download our intermediate results to.

// allocate host memory

float* h_results = new float[width*height ];



The CUDA kernel

The __global__-keyword de�nes a CUDA kernel.

__global__ void cudaPIkernel(float* results , size_t pitch) {

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

float dx = 1.0f / (float) (gridDim.x * blockDim.x);

float dy = 1.0f / (float) (gridDim.y * blockDim.y);

int inside = 0;

...



The CUDA kernel cont'd

Calculate whether the points (x , y) are inside our outside theunit circle.

...

for (int l = 0; l < m; ++l) {

float y = j / (float) (gridDim.y * blockDim.y);

y += dy * (l/( float)m);

for (int k = 0; k < m; ++k) {

float x = i / (float) (gridDim.x * blockDim.x);

x += dx * (k/( float)m);

if (x * x + y * y < 1.0f) {

// point is inside circle!

++ inside;

}

}

}

float* elementPtr = (float *) ((char*) results + j * pitch);

elementPtr[i] = inside;

}



Calling the CUDA kernel

Now we know enough to write our call to the CUDA-kernel.

// run kernel

cudaPIkernel <<<gridSize , blockSize >>>(d_results , d_pitch );

CUT_CHECK_ERROR("cudaPIkernel");



Device to host copy

Now we copy the intermediate results from the GPU to theCPU.

// fetch results from device

CUDA_SAFE_CALL(cudaMemcpy2D(h_results , width * sizeof(float),

d_results , d_pitch ,


height ,

cudaMemcpyDeviceToHost ));



The host code

Lastly, we calculate π from the intermediate results we fetchedfrom the device.

// sum up results from each CUDA -thread

float k = 0.0;

for(int i = 0; i < width * height; ++i) {

k += h_results[i];

}

// calculate final result

int n = (width * m) * (height * m);

float pi = 4 * ((float) k / n);

cout << "PI is (approximately ): " << pi << endl;



Optimization: Allocating host memory using theCUDA API

We should also allocate memory on the host using the CUDAAPI.

This ensures fast memory transfers between the device and thehost.

cudaMallocHost allocates page-locked memory that isaccessible to the device.

float* h_results;

// allocate host memory

CUDA_SAFE_CALL(cudaMallocHost ((void **)& h_results ,

width * height * sizeof(float )));



Demo



JPEG

A lossy compression method for photographic images



Motion-JPEG

A videoformat where each frame is individually compressed asa JPEG image

Uses no inter-frame compression (motion compensation)

Used by many portable devices, e.g., webcams and otherdevices that stream video



CPU-version of the algorithm

We already have a CPU-version of the algorithm for M-JPEGencoding.

It is often the case that you want to accelerate some alreadyexisting algorithm.

Understanding the algorithm in detail is important.



M-JPEG encoding

Split up image intomacroblocks

Zig−zag image, truncateand do Huffman

RGB −> YUV

DCT

Quantize

Split the image into macroblocks of size 8x8.



M-JPEG encoding



RGB −> YUV

DCT

Quantize

Subtract 128 to shift the values so that theyare centered around zero.

Perform a two-dimensional DCT on eachmacroblock, converting them into afrequency-domain representation.



M-JPEG encoding



RGB −> YUV

DCT

Quantize

Reduce the amount of information in the highfrequency components, by simply dividingeach component by a prede�ned constant.

The human eye is not so good atdistinguishing the exact strength of a highfrequency brightness variation.



M-JPEG encoding



RGB −> YUV

DCT

QuantizeGo through the macroblock in azigzag-pattern.

Remove the trailing zeros.

Hu�man-encoding is a lossless compressionalgorithm which further compresses the result.



Mapping an algorithm to the GPU

1 Identify which parts are most computationally demanding (bypro�ling)

2 Investigate if all or some of these parts can be done in aparallel fashion

3 Write CUDA-kernels for these parts, which are called from thehost code

4 Repeat



Pro�ling the CPU-version

By running a pro�ler like gprof (remember to compile yourcode with -gp), we can determine which functions aredominating our programs runtime.

We �nd that the DCT spends well over 90% of the runtime.



Overview

CPU GPU

Split up image into

macroblocks

Zigzag, truncate

and Huffman

Read frame from file

DCT

Quantize

Write to file

Figure: Overview CPU/GPU.



DCT on the CPU

Let us take a look at the CPU-version's DCT.

For each macroblock, the following code is executed:

for(int u = 0; u < 8; ++u) {

for(int v = 0; v < 8; ++v) {

float dct = 0;

for(int j = 0; j < 8; ++j) {

for(int i = 0; i < 8; ++i) {

float coeff =

in_data [(y+j)*width+(x+i)] - 128.0f;

dct += coeff

* (float) (cos ((2*i+1)*u*PI /16.0f)

* cos ((2*j+1)*v*PI/16.0f));

}

}

float a1 = !u ? ISQRT2 : 1.0f;

float a2 = !v ? ISQRT2 : 1.0f;

/* Scale according to normalizing function */

dct *= a1*a2/4.0f;

...Applied Mathematics 38/50


Quantization on the CPU

...and the quantization:

...

/* Quantize */

out_data [(y+v)*width+(x+u)] =

(int16_t )(floor (0.5f + dct /

(float )( quantization[v*8+u])));

}

}



The GPU compute gridVIDEO FRAME

MACROBLOCK

GPU COMPUTE GRID

GPU COMPUTE BLOCK

PIXEL

THREAD

Figure: One videoframe and the GPU compute grid.



Quantization matrices

We store the three quantization matrices (for Y, U and V) inconstant memory on the device.

/**

* Quantization matrices

*/

__constant__ float yquanttbl [] =

{

16, 11, 12, 14, 12, 10, 16, 14,

13, 14, 18, 17, 16, 19, 24, 40,

26, 24, 22, 22, 24, 49, 35, 37,

29, 40, 58, 51, 61, 30, 57, 51,

56, 55, 64, 72, 92, 78, 64, 68,

87, 69, 55, 56, 80, 109, 81, 87,

95, 98, 103, 104, 103, 62, 77, 113,

121, 112, 100, 120, 92, 101, 103, 99

};

...



The CUDA-kernelReads one macroblock into shared memory.__syncthreads() is a barrier, which make threads wait untilall threads have reached this point in the kernel.

__global__ void cudaDct(float *output , size_t output_pitch

float *input , size_t input_pitch) {

const int bx = blockIdx.x;

const int by = blockIdx.y;

const int tx = threadIdx.x;

const int ty = threadIdx.y;

const int x = (bx * 8) + tx;

const int y = (by * 8) + ty;

__shared__ float macroblock [8][8];

float* inputPtr = (float *)(( char*) input + y * input_pitch );

macroblock[ty][tx] = inputPtr[x];

__syncthreads ();

... Applied Mathematics 42/50


The CUDA-kernel cont'd

__cosf is an intrinsic function.

...

float dct = 0;

for (int j = 0; j < 8; ++j) {

for (int i = 0; i < 8; ++i) {

float coeff = macroblock[j][i] - 128.0f;

dct += coeff * __cosf ((2 * i + 1) * tx * PI / 16.0f)

* __cosf ((2 * j + 1) * ty * PI / 16.0f);

}

}

float a1 = !tx ? ISQRT2 : 1.0f;

float a2 = !ty ? ISQRT2 : 1.0f;

/* Scale according to normalizing function */

dct *= a1 * a2 / 4.0f;

/* Quantize */

float* inputPtr = (float *)(( char*) output + y * output_pitch );

output[x] = dct / yquanttbl[ty * 8 + tx];

}



Moving data between host and device

Using cudaMemcpy2D, similar to the cudaPI-example:

// copy image to device

CUDA_SAFE_CALL(cudaMemcpy2D(h_in_data , h_pitch_in ,

d_in_data , d_pitch_in ,


height ,

cudaMemcpyHostToDevice ));

// launch kernel

cudaDct <<<gridSize , blockSize >>>(d_out_data , d_pitch_out

d_in_data , d_pitch_in );

CUT_CHECK_ERROR("cudaDct");

// copy image back to host

CUDA_SAFE_CALL(cudaMemcpy2D(d_out_data , d_pitch_out ,

h_out_data , h_pitch_out ,


height ,

cudaMemcpyDeviceToHost ));



Optimizations

Run kernel on four consecutive macroblocks to achievecoalescing.



Optimizations cont'd

Threaded I/O on host

Double-bu�ering input and output bu�ers on the CPU

Makes it possible to do �le IO and GPU memory transferssimultaneously

Can be implemented using threading on the CPU, e.g.,boost::thread

Could also run the Hu�man-encoding in a separate thread,since this is a compute intensive algorithm



Optimizations cont'd

Streams

Utilizing CUDA streams

Uses asynchronous memory transfers

Allows overlapping of uploading, kernel invocation, anddownloading



Is GPU-computing worth the e�ort?

It takes time to learn a new technology and a new platform.

It will take time to rewrite your existing algorithms.

But: You will get signi�cant speed ups (5-100X) of algorithmsthat bene�t from parallelism.

Once you have your parallel algorithms, they will scale onnewer generations of parallel hardware.



CUDA Documentation

Programming Guide

Reference Manual

Best Practices Guide

Pay special attention to the �Performance Guidelines�-sectionof the programming guide to get the most performance out ofyour CUDA-apps

http://gpgpu.org


http://gpgpu.org


Good luck with your CUDA-programming, and thank you :-)Contact: [email protected]://martinsa.at.i�.uio.no


Documents

CUDA by Examplemartinsa.at.ifi.uio.no/files/cuda_by_example.pdf · Three problems for serial computing: The power wall The von Neumann bottleneck The Instruction Level Parallelism