OpenCL Image Convolution Filter - Box Filter

HeterogeneousParallel Programming-Image ConvolutionFilters(Box Filter)

Pi19404

January 28, 2013

Contents

Contents

OpenCL Parallel Programming for Image Convolution 3

0.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30.2 2D Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40.3 Naive 2D convolution . . . . . . . . . . . . . . . . . . . . . . . . 40.4 Optimization method 1 2D convolution . . . . . . . . . . . . . 5

0.4.1 Using Local Memory . . . . . . . . . . . . . . . . . . . 5

0.4.2 Using Ternary Conditional Operator . . . . . . . . . 6

0.4.3 Unrolling For Loops . . . . . . . . . . . . . . . . . . . . 6

0.4.4 Read Only Memory and Constant Variables . . . . . 6

0.4.5 Performance Comparison . . . . . . . . . . . . . . . . . . 7

0.5 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 | 8

OpenCL Parallel Programming for Image Convolution

OpenCL Parallel Programming forImage Convolution

0.1 AbstractData parallelism is one of ways to achieve parallelism wherein datais distributed across various computation units. In a multiprocessorsystem executing a single set of instructions (SIMD), data paral-lelism is achieved when each processor performs the same task ondifferent pieces of distributed data.

Image Convolution is a neighborhood operations.The value of pixel iscomputed as weighted linear combination of neighborhood pixels.Thetask of the set of operations to be performed for convolution isthe same for all pixels.Thus data parallelism can be achieved forimage convolution by assigning each pixel to a computation unit andsame task is performed by each computation unit.

OpenCLTM is the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personalcomputers, servers and handheld/embedded devices.Open Computing Language (OpenCL) is a framework for writing pro-grams that execute across heterogeneous platforms consisting ofcentral processing units (CPUs), graphics processing units (GPUs),DSPs and other processors. OpenCL includes a language (based onC99) for writing kernels (functions that execute on OpenCL de-vices), plus application programming interfaces (APIs) that are usedto define and then control the platforms. OpenCL provides parallelcomputing using task-based and data-based parallelism.

In the present document we describe the details of OpenCL API’sbut how OpenCL is used efficiently to attain the desired task.

We will look at Image convolution using 2D kernels and seper-able kernels and compare the performance of box standard CPUalgorithms .

3 | 8


0.2 2D ConvolutionA 2D Convolution operations is a neighborhood operations.Value ofpixel in the output matrix depends on weighted linear sum of pixelis input matrix.The weigth map to be used during the summation aredefined by convolution kernel.

2D convolution is viewed as the output of a discrete time LTIsystem whose impulse response is defined by the convolution ker-nel.The value the pixel at system output is the linear sum of thepixels in neighborhood corresponding pixel in system input weightedby the convolution kernel.

The convolution kernel decides the neighborhood size and weightmap. Different weight map correspond to different types of fil-tering operation.

A box kernel,gaussian filter defined a Low Pass filter LTI sys-tem while the sobel filter kernel defines a High Pass Filter LTIsystem.

0.3 Naive 2D convolutionA 320x240 image is divided into 16x16 blocks.Each thread is config-ured to compute the value of a output pixel. Thus total number ofthreads is equal to total number of pixels in the image.

The expression for 2D convolution is given below,P is input im-age,K is the kernel ,O is the output image ,(i; j) is the pixel locationand R is kernel size.

O[i; j] =RX

k=�R

RX

l=�R

P [i+ k; j + l]K[k; l] (1)

The pixels at the image borders use pixel index outside of theimage,we need to extrapolate the value’s of such image pixels lyingoutside the image.Different methods can be used to extrapolatethe value of pixel.One simple method is to set pixel value to zeroor constant.Another method is replicate the border pixels to pixelsoutside image border.In present approach we will set the pixel values

4 | 8


0.

The data for input image,output image and kernel are stored indevice global memory. Thus Each thread will access the data fromthe global device memory. The same pixels in the global memory willbe accessed different local threads multiple times.

Each thread will implement the above code to compute the valueoutput pixel O[i; j] in terms of local block/work group indexs,localthread id’s and global thread id’s.

The naive parallel version is compared with host CPU version ofthe code

0.4 Optimization method 1 2D convolution

0.4.1 Using Local Memory

In the earlier method the data in global memory is accessed multipletimes by different threads.As Global Memory Reads are costly thesemultiple reads may lead to parallel algorithm take longer executiontime than CPU host algorithm.

The efficiency of parallel algorithm can be improved by optimizingthe access to global memory.Each thread block/workgroup will beallocated a fixed local memory.

This local memory is accessible to all the threads in the threadblock/workgroup.The data required by all the threads of the threadblock are loaded from the global memory to the local memory. Thuseach thread block loads a sub-image from global memory.

Thus each pixel in the global memory is accessed only once ex-cept the pixels at border of sub-images that are accessed byadjacent thread blocks.The program will be executed in two parts.In the first part all thethreads will load the data from global to local memory.

Once the data is loaded into local memory ,all the thread inthe thread block perform convolution operations on the sub-imageloaded in the local memory.

5 | 8


Thus performance increases is obtained by reducing the numberof loads from global memory as well fact that convolution computa-tion involves local memory which is faster to access than the globalmemory.

0.4.2 Using Ternary Conditional Operator

Performance is also increased by replacing the if-else block byternary conditional operator. The If-Else block takes more thantwo instruction while ternary conditional operation is executed insingle instruction cycle in some devices.

0.4.3 Unrolling For Loops

The for loops are expensive operation .If the size of the for loopsare known at compile time they can be unrolled .In some compilersthe for loops are unrolled automatically by providing compiler with ahint. If the size of loop is not known still the loop can be partiallyunrolled.

To take full advantage of unrolling the parameters used in forloops can be passed as defined directives at compile time ratheras kernel arguments . However not all devices may support unrollingand in which case we need to manually substitue for loop withequivalent commands,this provides slight improvement than for loop insome cases

0.4.4 Read Only Memory and Constant Variables

Read Only memory are faster to access on some devices than read-write memory Thus memories that are not required to be writtento are labelled as read only memory.

Also variable that are not going to be changed during the ex-

6 | 8


ecution of the code are declared as const.These changes mayprovide improvement on some devices.

0.4.5 Performance Comparison

For small matrices the CPU version is faster as the size of matricesincreases the parallel version shows improvement .The programs wereexecuted on 4 Core device. The box filter show’s an improvementof 2X for optimized version.

The Convolution Kernel for NxN Box Filter is all 1’s.The Box fil-ter represents a averaging filter

0.5 CodeThe code consits of two parts the host code and the devicecode. Host side code uses OpenCv API’s to read the image fromvideo file and demonstrates the calling of the kernel code forBox filter,Gaussian Filter and Sobel with naive and optimized parallelversion and host CPU version .

Code is available in repository https://code.google.com/p/m19404/

source/browse/OpenCL-Image-Processing/Convolution/

7 | 8

https://code.google.com/p/m19404/source/browse/OpenCL-Image-Processing/Convolution/

https://code.google.com/p/m19404/source/browse/OpenCL-Image-Processing/Convolution/

Bibliography

Bibliography

[1] A study of OpenCL image convolution optimization. url: http://www.evl.uic.edu/kreda/gpu/image-convolution/.

[2] Image Convolution Filter. url: http://lodev.org/cgtutor/filtering.html.

[3] NVidia CUDA Example. url: http://developer.download.nvidia.com/compute/cuda/4_2/rel/sdk/website/OpenCL/html/samples.html.

[4] OpenCL. url: http://www.khronos.org/opencl/.

[5] OpenCV color conversion. url: http://www.shervinemami.info/colorConversion.html.

8 | 8

http://www.evl.uic.edu/kreda/gpu/image-convolution/

http://www.evl.uic.edu/kreda/gpu/image-convolution/

http://lodev.org/cgtutor/filtering.html

http://lodev.org/cgtutor/filtering.html

http://developer.download.nvidia.com/compute/cuda/4_2/rel/sdk/website/OpenCL/html/samples.html

http://developer.download.nvidia.com/compute/cuda/4_2/rel/sdk/website/OpenCL/html/samples.html

http://www.khronos.org/opencl/

http://www.shervinemami.info/colorConversion.html

http://www.shervinemami.info/colorConversion.html

Documents

OpenCL Image Convolution Filter - Box Filter