Optimizing Image Sharpening Algorithm on GPUcaoting/sharpness-icpp-2015.pdfOptimizing Image Sharpening Algorithm on GPU Mengran Fan, Haipeng Jia1,Yunquan Zhang, Xiaojing An, Ting Cao

Optimizing Image Sharpening Algorithm on GPU

Mengran Fan, Haipeng Jia1,Yunquan Zhang, Xiaojing An, Ting Cao Institute Of Computing Technology

Chinese Academy Of Sciences Beijing, China

{fanmengran, jiahaipeng, zhangyunquan, anxiaojing, caoting}@ict.ac.cn

1 Corresponding Author

Abstract—Sharpness is an algorithm used to sharpen images. As the increase of image size, resolution, and the requirements for real-time processing, the performance of sharpness needs to get improved greatly. The independent pixel calculation of sharpness makes a good opportunity to use GPU to largely accelerate the performance. However, to transplant it to GPU, one challenge is that sharpness involves several stages to execute. Each stage has its own characteristics, either with or without data dependency to other stages. Based on those characteristics, this paper proposes a complete solution to implement and optimize sharpness on GPU. Our solution includes five major and effective techniques: Data Transfer Optimization, Kernel Fusion, Vectorization for Data Locality, Border and Reduction Optimization. Experiments show that, compared to a well-optimized CPU version, our GPU solution can reach 10.7~ 69.3 times speedup for different image sizes on an AMD FirePro W8000 GPU.

Keywords-sharpness; GPU; Kernel Fusion; Data Locality; Vectorization

I. INTRODUCTION Since the perceptual assessment of images is crucial,

sharpness has been regarded as a determined factor when dealing with images. This algorithm is widely used in TV, camera, videocassette recorder and other fields. With the increasing of image size, resolution, and the needs of real-time, improving the performance of sharpness algorithm is urgently required in those areas. Fortunately, sharpness algorithm has good parallel computing capacity and can process each pixel of an image independently. This feature is well suited for the characteristics of GPU, which is a massively and fine-grained parallel processor. Porting the algorithm to GPU with effective performance optimization will potentially fasten the algorithm greatly.

However, it is very challenge to implement and optimize sharpness algorithm on GPU. This is because sharpness is not just a one-step finished algorithm. It has multiple steps with or without data dependency to each other, such as downscale, upscale, Sobel, reduction and so on. Each of those steps has its own characteristics different from others. In order to achieve good overall performance!we must solve three challenging problems: 1) Consider the different features and processing granularity among these steps and implement them on GPU in a high effective way; 2) Analyze the dependencies of these steps and reduce the overhead of global synchronizations among different steps; 3) Reduce the

overhead of data transfers between CPU and GPU, which will largely degrade the performance if not well handled.

In order to address these challenges, this paper proposes a complete optimization mechanism that can port sharpness algorithm to GPU with high performance. This mechanism mainly contains five techniques: Data Transfer Optimization, Kernel Fusion, Vectorization for Data Locality, Border and Reduction Optimization. With these optimization methods, we overcome the performance bottleneck and implement a high performance OpenCL version of this algorithm. We also demonstrate the high performance of our implementation by comparing it with a well-optimized CPU version. Experimental results show that the speedup reaches up to 10.7~ 69.3 times for different image sizes on an AMD FirePro W8000 GPU.

The key contributions of this paper can be summarized as follows:

1) We provide a complete optimization mechanism for porting sharpness algorithm to GPU. This mechanism is also applicable for other image processing algorithms with multiple steps.

2) We implement an OpenCL version of sharpness algorithm on GPU with high performance.

The remainder of this paper is organized as follows: We begin with discussing the related works in Section II. Section III describes the sharpness algorithm in details. After analyzing the parallelism of this algorithm in Section IV, we discuss the optimization of GPU version of this algorithm in Section V. Section VI presents the experimental results and analysis. Section VII concludes this paper.

II. RELATED WORK The graphics processing unit (GPU) has been becoming

an integral part of today’s mainstream computing systems. The modern GPU is not only a powerful graphics engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth that substantially outpaces its CPU counterpart [1]. So we use GPU to improve the speed of processing images. CUDA and OpenCL are two different models for GPU programming [2]. The OpenCL standard offers a common API for program execution on systems composed of different types of computational devices such as multicore CPUs, GPUs, or other accelerators [3]. On the contrary, CUDA is specific to NVIDIA GPUs. In this paper we use OpenCL, which is universal on computational devices.

2015 44th International Conference on Parallel Processing

0190-3918/15 $31.00 © 2015 IEEE

DOI 10.1109/ICPP.2015.32

231


0190-3918/15 $31.00 © 2015 IEEE

DOI 10.1109/ICPP.2015.32

230


0190-3918/15 $31.00 © 2015 IEEE

DOI 10.1109/ICPP.2015.32

230


0190-3918/15 $31.00 © 2015 IEEE

DOI 10.1109/ICPP.2015.32

230


0190-3918/15 $31.00 © 2015 IEEE

DOI 10.1109/ICPP.2015.32

230


0190-3918/15 $31.00 © 2015 IEEE

DOI 10.1109/ICPP.2015.32

230

start

end

!"#$%&'()*+

,-%&'()*+

-.//"/*+

01'/-$)%%*+

!Figure 1. The processing of sharpness

There are many image processing algorithms ported to GPU. Gu et al. [4] implement deformable image registration (DIR), an automated segmentation method to transfer tumor/organ contours from the planning image to daily images on GPU. Yang et al. [5] implement several image processing algorithms, such as histogram equalization, removing clouds, edge detection and DCT encode and decode on GPU. Bilgic et al. [6] implement integral image algorithm on GPU. Moreland et al. [7] implement fast Fourier transform algorithm on GPU, which is essential for many image processing techniques, including filtering, manipulation, correction, and compression. Bernabe et al. [8] implement the automatic target detection and classification algorithm for hyperspectral image analysis on GPU. Vineet et al. [9] implement the maxflow/mincut algorithm and a computing graph-cuts algorithm on GPU.

There are also some sub-algorithms composed sharpness algorithm that have been ported to GPU. The Sobel operator performs a 2-D spatial gradient measurement on images. Transferring a 2-D pixel array into statistically uncorrelated data set enhances the removal of redundant data. As a result, reduction of the amount of data is required to represent a digital image [10]. Brown et al. [11] implement Sobel and reduction on GPU, and propose optimizations for Sobel. In that paper they use share memory with padding original image matrix to improve the performance of Sobel. A thread accesses neighboring pixels from low-latency shared memory is much faster than global memory. Adding padding just reduce the judgement in kernel which can still save time for processing, because of the bad logical judgement of GPU. Zhang et al. [12] propose to use vector to optimize the Sobel kernel, because accessing data from cache in modern GPU performs better than shared memory, and the large amounts of data reuse of this algorithm can reduce data access by vector. In this paper, we also use adding padding and vector to optimize the kernel. Besides, we use ‘vload’ in addition and adjust the step of transmitting the padding data for Sobel kernel in order to save the time, considering the whole algorithm.

Reduction is a crucial bottleneck in a lot of complex algorithms, including sharpness. Many researchers conduct reduction through GPU [13, 14]. However, these

approaches mainly used the basic implementation of tree-based reduction without much attention on optimization. Nickolls et al. [15] provide two methods to optimize the work after the first phase of reduction, which are launching the reduction kernel again and directly using function ‘atomicAdd()’ to sum up the results of the first stage, but the article didn’t mention the optimization of the tree-based reduction. Harris [16] provides a good method to optimize the tree-based reduction, including first adding during load, and unrolling the last wavefront, but he didn’t tell us how to process the results of the first phase of reduction effectively. In this paper, we discuss the optimization of reduction in detail, and provide a total solution of the both stages.

It’s worth noting that some sub-algorithms composed sharpness are not implemented or well optimized for GPU, such as downscale, upscale, and Overshoot control.

In summary, although many algorithms dealing with images have been ported to GPU, including some algorithms composed sharpness. To transfer the whole sharpness algorithm hasn’t been proposed before.

III. ALGORITHM ANALYSIS Fig. 1 shows the whole processing of sharpness

algorithm. This algorithm includes four steps: downscale image, upscale image, matrix difference calculation and sub-sharpness image.

Figure 2. The process of downscaling an image

Downscale operation aims to reduce the image size. Its input is the original image matrix and output is a width/4 x height/4 matrix. As Fig. 2 shows, the value of an element in the downscaled matrix is the mean value of the

232231231231231231

elements in the corresponding 4 x 4 submatrix (marked as the same color in the figure) of the original matrix.

Upscale operation aims to upscale the size of the downscaled image back to its original size. The input of this operation is the downscaled matrix, and the output will be called the upscaled matrix. This operation includes two steps: upscale the image border, and then upscale the body.

1) Upscale the border: As Fig. 3 shows, the elements in the first row and column of the downscaled matrix will be interpolated every other three elements to the first row and column of the upscaled matrix. The vacancies, for example the black elements in the figure, are calculated by the adjacent values (red colour elements in the figure) multiplied with a predefined parameter matrix. We use this method to calculate the values except the last three elements of the first row and column. Instead, the third last element’s value is copied from the fourth last value. After the value caculation of the first row and column, the values will be copied directly to the seond row and column respectively.

Through the same method above, the penultimate row and column of the upscaled matrix will be caculated based on the last row and column of the downscaled matrix. Then those calculated values will be copied to the last row and column of the upscaled matrix. After all the caculations above, the four elements of the bottom right corner are still missing. The third last value of the last column will be copied to those elements to complete the border upscaling.

Figure 3. The process of upscaling the border of a image.

2) Upscale the main body: As Fig. 4 shows, every 4 x 4 submatrix (with a stride of four columns/rows) of the upscaled matrix is caculated from a corresponding 2 x 2 submatrix (with a stride of one column/row) of the downscaled matrix. The caculation process is shown in Fig. 5. A 4 x 4 uppsacled submatrix is the product of a predefined 4 x 2 parameter matrix, a 2 x 2 submatrix from the downscaled matrix, and the transpose of the parameter matrix. After all the caculation, we then get the upscaled matrix.

Difference matrix calculation aims to calculate the difference matrix between the upscaled matrix and the original matrix. We call this difference matrix as pError

matrix. It will be used later with the result matrix of Sobel to adjust the upscaled matrix and get a preliminary sharpened image.

Figure 4. The process of upscaling the body.

Figure 5. The computational formula of upscaling.

Sub-sharpness image is a relatively complex process compared with other steps. This step contains multiple sub-steps: Sobel, reduction, calculation of the preliminary sharpened matrix, and Overshoot control. These sub-steps are described as follows.

Sobel is used firstly to get the first-order gradient approximation of the brightness value of the original matrix. It includes two steps to complete the Sobel matrix:

a) Filling the border: Fill the first row and column, last row and column with 0.

Figure 6. The process of Sobel.

b) Filling the body:Except the border, every value of the Sobel matrix is calculated from a 3 x 3 submatrix of the original input matrix, as shown in Fig. 6. Similar as Upscale, the stride of the submatrix is also one row or column. The center element of the submatrix has the same index as the caculated element in the result matrix.

The Sobel operator (a pair of 3 x 3 convolution masks) shown in Fig. 7 will convolve with the original matrix to get two matrixes which at each element contains the horizontal and vertical derivative approximations. The absolute value sum of the two matrixes will be the result

233232232232232232

matrix of Sobel (referred to as pEdge in the following text).

After Sobel works out the gradient approximation of the image brightness (pEdge matrix), the arithmetic mean of the elements in this matrix will be caculated. It is a reduction process.

Figure 7. Sobel operator (the * means convolution operation).

From the mean value and user-defined parameters, brightness strength will be worked out. It is used to adjust the pEdge matrix to ultimately control the sharpness level of the sharpened image. Next, the adjusted pEdge matrix will be multiplied by the pError matrix mentioned above. This multiplication result summed with the upscaled matrix will be the preliminary sharpened matrix.

start

max and min value of the 3*3 units of the original matrix

Calculate oscStrength of the pEdge

Preliminary>max

Final=min(max(preliminary,0),255)

Final=min(osc max,255)

Preliminary<min Final=max(osc min,0)

End

Yes

No

No

Yes

Figure 8. The process of Overshoot control.

The following step of the sub-sharpness image process is Overshoot Control to reduce the undesired effects, such as noise amplification. It takes the preliminary sharpened matrix as the input and outputs the final sharpened image. The process includes two main steps:

a) Matrix Border: This step copies the first row and column, last row and column of the preliminary sharpened matrix to the same position of the final matrix.

b) Matrix Body: As shown in Fig. 8, each element of the preliminary sharpened matrix will first be compared with the max value of the corresponding 3 x 3 submatrix of the original matrix. If it is larger, the element will be replaced with the max value adjusted by user-defined tuning parameters and the difference of the element and max value (marked as osc max in the figure). Through similar process, the element smaller than the min value will be replaced by the adjusted min value. After the comparison with min or max values, all the elements will

be clamped in the range [0, 255]. Then, the result will be the final sharpened image.

IV. PARALLEL ANALYSIS AND NAÏVE GPU VERSION This section describes the parallelism analysis of

sharpness algorithm and implements a naïve version of this algorithm on GPU. As described in Section III, sharpness algorithm consists of a series of operations, including downscale, upscale, pError matrix, and sub-sharpness. They are analyzed separately below.

Downscale: Downscale is equivalent to compressing an image. As discussed in the last section, each element of the compressed matrix will be the mean of the corresponding 4 x 4 submatrix elements of the original matrix. The stride is 4 rows or columns, so the compressed size is one-sixteenth of the original matrix. The calculation of each value in the compressed matrix is fully independent. So downscale is a suitable candidate to run on GPU.

Upscale: Upscale is used to enlarge the downscaled matrix. The calculation of upscale is more complicated than the downscale, especially for the border of the enlarged matrix. In the calculation of the center region of the enlarged matrix, we fetch 2x2 nodes from the compressed matrix, calculate with two parameter matrixes and get a 4x4 matrix, and then store the little matrix to the enlarged matrix. There exists repeated memory fetching, as shown in Fig. 4, but it won’t affect the degree of parallelism, while the repeated memory write well. So, the work can be separated and executed in parallel efficiently.

The calculation of the border region consists more conditional statements, which is inefficient to be processed on GPU and also affects the degree of parallelism. It is not suitable to calculate it on GPU when the matrix size is not big enough. However, the matrix multiplication can be parallelized, as Fig. 3 shows. So when this calculation becomes the bottleneck of the algorithm, we still can run it on the GPU.

Difference matrix calculation: The calculation of each value in pError matrix is fully independent and can run parallelly. So this step is also a good candidate for GPU.

Sub-sharpness: This procedure includes four steps: Sobel, reduction, calculation of the preliminary sharpened matrix, and Overshoot control.

For Sobel, as shown in Fig. 6, there also exists data sharing for the calculation of adjacent elements. Each element is calculated from the convolution of the Sobel operator and the original matrix. Obviously the calculation of every element of the Sobel matrix is independent, so it can be parallelized on GPU.

However, porting reduction algorithm to GPU is not an easy work. This is because this algorithm exists data dependency and repeated memory writes. To be executed in parallel and make sure of the correctness, atomic operations are needed, which is inefficienct on GPU. In calculation of the final sharpness matrix, the process of every node is independent, so it is parallelizable.

The calculation of the brightness strength matrix will perform many exponentiations resulting in big overhead.

234233233233233233

Besides, there is no data dependency while the calculation, so it is profitable to run it parallelly on GPU.

For Overshoot control, to work out the max and min value of every 3 x 3 submatrix of the original matrix can be well parallelized. After adding padding for the original matrix, we don’t need to deal with the border alone, which can make this part well paralleled.

In summary, there is great parallelism in sharpness algorithm, and it is very suitable for massively parallel on GPU. Accordingly, we implemented a naïve version of this algorithm on GPU.

Naïve version: Reduction and the border processing of the upscaled matrix might be inefficient on GPU, so we put them on CPU in this preliminary version.

Because of the data dependencies between different steps of the algorithm, they have to be executed serially through global synchronization. Unfortunately, up until now, global synchronization can only be achieved on the host. This means that we must return to the host and reload the next OpenCL kernel while global synchronization. Besides, in sub-sharpness, data dependencies exist between Sobel, reduction, and calculation of the final sharpness matrix, so we separate the procedure of sharpness to three kernels. Totally, we have to load five kernels to finish the calculation.

V. OPTIMIZATIONS This section describes our step-wise optimizations. The

optimizations are grouped to optimizing data transmission, kernel fusion, optimization of reduction, vectorization for data share, border optimization, and others.

A. Optimizing Data Transfer Between Host and Device The sharpness algorithm involves a number of steps.

As analyzed in Section IV, most of the steps are computation intensive and accelerated by the GPU. Thus, the computation will not take a big fraction of the total overhead any more. Instead, data transmission between the host and the device might be a bottleneck. Some steps of the algorithm are conducted on the CPU, and there are data dependencies existing across the steps of the algorithm, so we have to transfer data between the host and the device frequently. The transmission is done by the PCI-E, which transmits data slowly. In this article, we improve the efficiency of data transmission by optimizing the mode and reducing the amount of data to transfer.

Optimize the mode of data transmission: OpenCL provides two ways to transfer data between host and device: map/unmap and read/write.

By calling the map function, you can map a memory object on the device to a memory region on the host. Once this map is established, you can read or modify the memory object on the host using pointers. With this method the host won’t transmit the data to the device at first, but transmit the data when it is need. So every time you read or write the memory, you need a transmission to get or put the data from or to the host memory. Each Memory access request goes through PCI-E, which will degrade the performance. This process can be a bigger

time consumption comparing with transmitting all the data at one time. Because, each memory access needs to go through PCI-E, the performance will be largely degraded. When finished accessing, the unmap function must be used to make the data available to device kernel access. This is just release the pointer pointing to the host memory.

Instead of using the map/unmap operations presented earlier, the read/write functions are more simple and more straightforward. The read function reads data from the device to host, and the write function writes data from the host to the device.

The read/write function reads or writes all the data which need to transmit at first. That means this function transmit data only one time before processing the kernel, while map/unmap function needs to transmit data every time it needs. Comparing with the map/unmap function the read/write function will save so many data transmission times, which equal to save the data transmission time. Although the may/unmap function perform well on APU but it doesn’t perform well on GPU in most cases. As for the sharpness algorithm, the accesses to the data is dispersive, so we need transmit the data every time we need, while the read/write mode only once data transmission. We try both methods to transmit the data of sharpness algorithm the read/write functions are indeed performing better.

Reduce the amount of data to transfer: Calculation of the sharpness matrix needs the padded original matrix to be data source, while Calculation of downscale and Sobel needs the original matrix. So, we transfer both the padded original matrix and the original matrix from the host to the device, which is obviously wasteful. To reduce the amount of data to transfer, we change the kernel of downscale and Sobel to use the padded original matrix, and then we can transfer the padded original matrix alone. Meanwhile, by calling clEnqueueWriteBufferRect, we finished the operation of padding when transferring the original matrix from the host to the device. Comparing with adding padding on CPU for the original matrix, this method performs better in practice. Add padding on CPU need to copy the original matrix line by line to the extensive matrix on CPU, which consumes a lot of time. Using clEnqueueWriteBufferRect on transmission procedure may consume more time than transmit padding matrix directly, but considering the time consumed on copying data on CPU, this method perform better.

B. Kernel Fusion Sharpness algorithm consists of a series of operations,

including downscale, upscale, pError, and sub-sharpness. Besides, sub-sharpness includes four operations: Sobel, reduction, calculate of the preliminary sharpened matrix, and Overshoot control. As illustrated in algorithm analysis, data dependences exist across the operations. They have to be executed serially through global synchronization.

On GPU, the global synchronization can only be achieved by loading multiple kernels, but it causes performance problems. Time of lunching a kernel can be

235234234234234234

huge, and the overhead of global synchronization increases with the number of kernels. Data communication between kernels is finished by global memory. Thus fusing kernels can reduce the necessary amount of global memory and reduce the data accessing, which improves the use efficiency of GPU. Meanwhile, different operations sometimes need the same source data to calculate and do the same calculation, so fusing operations can also help the operations to share data accessing and calculation, which can improve the performance obviously.

After analyze the algorithm thoroughly, a one-to-one correspondence can be found between pError, calculating of the preliminary sharpening matrix and Overshoot control. Thus the three operations can be calculated within a kernel, and the calculations of every node in the two operations are executed in order within the threads. We immediately call this combined kernel sharpness. After fusing the two kernels, the difference matrix is stored in threads’ registers dispersedly, and thus we don’t need to allocate the difference matrix on global memory and accessing this matrix from global memory is also avoided.

C. Optimizing Reduction Reduction operation aims to get the sum of sobel

matrix. If we conduct it on GPU, it means a lot of atomic operations, which will seriously reduce parallelism. So we put the reduction on CPU at first. However, as the volume of data increasing, the performance apparently decreases on CPU. So we port this calculation to GPU.

Stage1:workgroups reduction in parallel

The final result

Intermediate results

Source data

Stage2:implement on CPU or GPU

Figure 9. Tow staes of reduction

When using GPU to accelerate the calculation of reduction, we need to divide it into two stages, as shown in Fig. 9. In the first stage, we split the summing work to the workgroups, and every workgroup do their work in parallel and generate the sum of the data in their responsibility. In the second stage, we sum up the results of the first stage and generate the final result. Next, we will optimize the two stages separately.

In the first stage, we choose the tree-based reduction to finish the work within the workgroups. The tree-based mode is the most efficient parallel implementation of reduction. Fig. 10 is an instruction of tree-based reduction. It’s described in detail below.

Local memory is a private memory region for the workgroup, all threads within the workgroup can access to it, while other threads cannot. Threads within a workgroup can communicate through local memory, and we use local memory to store the intermediate results of the reduction within a workgroup. So, in the first stage of the reduction, the step one of every thread is fetching data from global memory, adding them up and storing the result to local

memory. After a barrier operation within a workgroup, half of the threads add the results of the other half to their own. This is a cycle of the adding operation. The number of intermediate results and active threads are halved after each cycle of the adding operation, and this procedure continues until getting the final result, just like in Fig. 10. For the threads, the whole procedure can be conducted by a ‘for’ construct. But after each cycle of the adding operation, we need a barrier operation to make sure the next cycle can execute correctly.

Figure 10. The tree-based reduction

As for the AMD GPUs, a wavefront executes a number of work-items in lock step relative to each other. Sixteen work-items are executed in parallel across the vector unit. Threads within the same wavefront can execute synchronously in nature. In the calculation of reduction, when the number of active threads in a workgroup smaller than 64, there is no need for synchronization operation which will increase the overhead. So we unroll the last wavefront to reduce the overhead of synchronous. Algorithm 1 outlines the implementation of unrolling.

Algorithm opt1"unrolling the last wavefront in reduction barrier(CLK_LOCAL_MEM_FENCE); if(lid < 64) *( sum + lid) += *( sum + lid + 64); if(lid < 32) *(sum + lid) += *(sum + lid + 32); … if(lid < 1) { *(sum + lid) += *(sum + lid + 1); *(global_sum + gridx + gridy*grsizex) = *(sum); }

Unrolling the last one wavefront makes only one

wavefront executing at last. So we try the way of unrolling the last two wavefronts to make two wavefronts processing the task at the same time to improve the degree of parallelism. Hence, there adds a barrier after the calculation of the two wavefronts, to make sure the two wavefront finished, and then add the results of the two wavefronts. Algorithm 2 outlines the implementation of unrolling the last two wavefronts.

236235235235235235

Algorithm 2"unrolling last two wavefronts in reduction Barrier(CLK_LOCAL_MEM_FENCE); if(lid < 64) *( sum + lid) += *( sum + lid + 64); if(lid < 32) *(sum + lid) += *(sum + lid + 32); … if(lid < 1)

*(sum + lid) += *(sum + lid + 1);

if(64 <= lid && lid < 128) *( sum + lid) += *( sum + lid + 64); if(64 <= lid && lid < 96) *(sum + lid) += *(sum + lid + 32); … if(64 <= lid && lid < 65)

*(sum + lid) += *(sum + lid + 1);

barrier(CLK_LOCAL_MEM_FENCE); if(lid == 0) *(!global_sum + gridx + gridy*grsizex) =

*(sum) + *(sum + 64); To make sure the fine-grained parallelism, which can

help GPU to schedule flexibly and hide the delay of accessing to global memory, we fix the amount of data processed per thread.

In most cases, the work in the second stage won’t be too much, so we can finish it directly by CPU and gain a fine performance. But as data volumes increase, the results of first stage will be abundant, and then we need GPU to conduct reduction just like the first stage. In this article, the usage of GPU is determined by the amount of data, and the critical value is tested in advance. Conducting the reduction on GPU has reduced the amount of data transmission and kept the reduction a high performance.

After the reduction and the processing of border of the upscale matrix putted to GPU, substantially all the calculation of sharpness algorithm has been conducted on GPU. The original matrix, a little data in reduction, and the finial sharpness matrix are the only data remaining to transfer between the host and the device.

D. Vectorization for Data Locality Considering the frequent and repeated accessing to

memory on GPU in Sobel, sharpness, and the processing of the center data in the upscale matrix, calculation of adjacent nodes can share data by vectorization.

In Sobel, there are lots of repeated accesses of memory in the calculation of adjacent nodes. Every node of the Sobel matrix is calculated by fetching eight nodes from the original matrix, multiplying the eight nodes by Sobel-operator and then summing up results of the multiplication. So this procedure is memory-intensive.

If calculating only one node in a thread, the accessing for every node in original matrix is repeated for about 8 times. But calculating adjacent, for example, four nodes in one thread by vectorization will reduce a great part of the repeated memory accessing, and the implementation are shown in Fig. 11. To calculate the adjacent four nodes, the

thread should fetch 18 nodes of the original matrix to register and the adjacent four nodes of the Sobel matrix share the data, and the accessing for every node in original matrix is repeated for about only 4.5 times. Almost half of data fetching are eliminated, so the performance improved greatly. As for the number of nodes in Sobel matrix for every thread to calculate, four might be the best. To align memory accessing, the number is supposed to be a multiple of 4. As the number increasing, the granularity of parallelism becomes more coarse, which against the GPU to schedule flexibly and hide the delay of accessing to global memory.

Figure 11. Vectorization for data locality in Sobel

The case is the same with sharpness, and the

processing of the center data in the upscale matrix, so that won't be covered again here.

E. Border Optimization The processing of the border in the upscale matrix

needs lots of conditional statements, which are inefficiency on GPU and affect the degree of parallelism. So we put it on CPU at first. But this may not be proper. Because of the data dependence across the operations, conducting it on CPU must bring the problem of increasing data transfer between the host and the device. When conducting it on CPU, we need transfer the downscale matrix from device to host, and after processing the border on CPU, we need transfer the upscale matrix from CPU to GPU. Those data transmission is a huge performance cost.

Not only this problem existence, when the image big enough, calculation of the vacancy shows in Fig. 2 can be a large time consumption. In the experience, we find that when the size of the image is larger than 768x768 put this algorithm on GPU can perform better.

Thus, the process can gain a good performance on CPU with a little data size, and perform better on GPU with a large data size. So we choose the approach according to the data size.

F. Other Optimizations Eliminate Global Synchronization: in this algorithm

implementation, kernels are required to be executed in order, which agrees with the default execution mode of command queue. That is we can ensure the kernel execution in order simply by sequentially enqueueing the kernels to the command queue. So we can remove the

237236236236236236

synchronization function, clFinish(), after each kernel. The communication overhead between CPU and GPU produced by clFinish() are removed, thus promoting the performance.

Build-in Function: to improve the access rate, we use the build-in functions ‘vload’ and ‘vstore’, which is faster than the simple data accessing statements. Besides, a part of calculations in kernel are also implemented by build-in functions, such as max, min, mad, clamp, and select.

Instruction selection (Computing optimization): division, multiplication and remainder execute slowly on GPU, relative to the addition, subtraction and bit operations. Thus we adopt shifting principle to do multiply and division operation, and the bitwise-and operator to do the modulus operation. For example, dividing by 2 can be replaced with right shifted one time and modulus 4 can be finished by performing a bitwise-and operation between the number and 0xffff fff3.

VI. PERFORMANCE EVALUATION In this section, the results obtained from the sequential

implementation, base parallel implementation, and implementations after the step-wise optimizations are considered. We run the programs on Intel Core i5-3470 CPU and AMD Radeon W8000 GPU. Table 1 is the comparison of experimental hardware platform specifications.

TABLE I. COMPARISON OF EXPERIMENTAL HARDWARE PLATFORM SEPCIFICATIONS

AMD W8000 Intel Core i5 3470 CPU

Processor main frequency 0.88 GHz 3.2 GHz

The number of cores 1792 4

Peak Gflops 3.23 TFlops 57.76 GFlops

Memory Bandwidth 176 GB/s 25 GB/s

A. Comparison of the performance on GPU and CPU

Figure 12. Comparison of the CPU, GPU base and optimized version.

Experimental results show that our final implementation on AMD Radeon W8000 demonstrates substantial improvement, up to 69.3 times speedup than the CPU counterpart, and the gap grows fast with the volume of data increasing. Fig. 12 shows the performance

comparison of the CPU version, the base and optimizedGPU versions. The original matrix for testing is square, and the x-axis is the width of the original matrix. As the data size increases from 256 x 256 to 4096 x 4096 bytes, the speedup of the base GPU version is 9.8 to 35.3 compared to the CPU version. The optimized GPU version will further speed up 1.2 to 2.0 times based on that.

a. CPU Version

b. Base GPU Version

c. Optimized GPU Version

Figure 13. Time fraction of each algorithm step for the (a) CPU version, (b) the base GPU version, and (c) the optimized GPU version.

The CPU version in this article is carefully optimized, including using ‘–O3’ option to improve the performance by the compiler. Fig. 13 (a) shows the time fraction of each algorithm step for the optimized CPU implementation. The overshoot control and the calculation of the strength matrix are the bottlenecks of the running

0 100 200 300 400 500

1800 1900 7800 7900 8000

!" #!" $!" %!" &!" '!" (!" )!" *!" +!" #!!"

!"

#"

$"

%"

&"

'"

("

)"

*"

+"

#!"

!"#$%#&#

'#()$*+",$"

the CPU version the base GPU version the final GPU version

!"#

$!"#

%!"#

&!"#

'!"#

(!!"#

$)&*$)&# )($*)($# (!$%*(!$%# $!%'*$!%'# %!+&*%!+&#'#()$*+",$#

"#2$%&&'(!3)%&&'(!).**#*!%#+('!%,*($-,.!#4(*%.##,!&#$,*#'!

!"#

$!"#

%!"#

&!"#

'!"#

(!!"#

$)&*$)&# )($*)($# (!$%*(!$%# $!%'*$!%'# %!+&*%!+&#'#()$*+",$#

"&,&!/$/,!"#2$%&&'(!+#*"(*!&($,(*!)&""/$-!%#+('!*("3&5#$!

!"#

$!"#

%!"#

&!"#

'!"#

(!!"#

$)&*$)&# )($*)($# (!$%*(!$%# $!%'*$!%'# %!+&*%!+&#'#()$*+",$#

"&,&!/$/,!"#2$%&&'(!+#*"(*!&($,(*!%#+('!*("3&5#$!%.&*)$(%%!

238237237237237237

time. The main parts of the calculation of the strength matrix are reduction and preliminary sharpening. It can also be found that as the volume of data increasing, the percentage of Sobel, pError and upscale decreased.

The time fraction of the base parallel implementation on GPU is shown in Fig. 13 (b). The running time can be unstable because of the data transmission. So repeating tests and averaging the records are necessary. As mentioned in Section V, the calculation of the algorithm on GPU is finished by six kernels: downscale, border, center, Sobel, reduction and sharpness. Comparing with the CPU version we separate the upscale algorithm into border kernel and center kernel on GPU. Sharpness kernel includes the calculation of the pError, overshoot control and the calculation of the preliminary sharpened matrix. The operations of data initialization and padding process the data for GPU to calculate.

From Fig. 13 (b), we can see that the bottlenecks of the GPU version are different from the CPU version. They are the processing of the center of the upscaled matrix, Sobel and reduction. The reason is that the overshoot control andpreliminary sharpening are both calculated in the sharpness kernel of the GPU version. They have good parallel computing capacity and can be optimized efficiently by GPU, so they are no longer the bottlenecks.

Besides, as the volume of data increases, the proportion of data initialization decreases. The reason is that the complexity of data initialization, which mainly consists of data transmission, won’t increase as fast as the calculation.

After the optimizations, the performance of the final version gains a great improvement. The overhead distribution is shown in Fig. 13(c). Different from the base CPU and GPU version, the fraction for all operations aremore evenly distributed without prominent bottlenecks.

B. Performance changes after each step of optimizations

Figure 14. Comparison of the optimizations

All the steps of optimization, bring between a 1.15~ 9.04 times speedup with data size in the range of 256x256 to 8192x8192 bytes, relative to the base version. To assess benefits of individual optimizations, performance is tested

after each step of optimizations. Fig. 14 shows the comparison of the performance after each step of optimization.

We can see that the optimization of reduction and vectorizing for data share improve the performance greatly. These work include reduction, Sobel, sharpness, and the processing of the center data in the upscaled matrix, which occupy a large proportion of overhead originally. So these two optimizations achieve the great performance improvement of the sharpness algorithm.

It also can be seen that while the data size increases, the optimizations is more profitable. For a small data size, some optimizations may even decrease performance. For example, after the optimization of data transmission and kernel fusion, the performance is reduced when the data size is smaller than 4096x4096. That is because that the map/unmap mode is effective with small data size. However, with the increasing of data size, the optimization effect is obvious. GPU is normally used to process a large amount of data, so the optimizations still have great significance. Data of performance testing in each optimization is below.

Figure 15. Comparison of the two kernels of reduction

In the optimization of reduction, we use the tree-based mode. Implementations of unrolling the last one wavefront and unrolling the last two wavefronts are both tried in this article. Fig. 15 shows comparison of two implementations of reduction and it indicates that unrolling one wavefront works better. The reason is the barrier after the calculation.Unrolling the last two wavefronts increases the overhead of synchronization.

Figure 16. Comparison of the reduction on CPU and GPU

0 100

200 300 400 500

1200 1300

2500 2600 2700 2800 2900 3000 3100 3200

3300 !" #!" $!" %!" &!" '!" (!" )!" *!" +!" #!!"

!"

#"

$"

%"

&"

'"

("

)"

*"

+"

#!"

##"

#$"

#%"

#&"

#'"

#("

256*256 1024*1024 4096*4096

!"#$%#&"

'#()$*+",$*

the base version

data transmission and kernel fusion Optimizing the reduction

Vectorization for data share and border optimization others

5 10 15 20 25 30 35 40 45 50

!" #!" $!" %!" &!" '!" (!" )!" *!" +!" #!!"

!"

#"

$"

%"

&"

'"

("

)"

*"

+"

#!"

256*256 1024*1024 4096*4096

!"#$%#&#

'#()$*+",$"

unrolling one wavefront

unrolling two wavefront

0 10 20 30 40 50 60 70

530 540 550

!" #!" $!" %!" &!" '!" (!" )!" *!" +!" #!!"

!"

#"

$"

%"

&"

'"

("

)"

*"

+"

#!"

256*256 1024*1024 4096*4096

!"#$%#&#

'#()$*+",$"

on CPU

on GPU

239238238238238238

We also compare the performance of the reduction on CPU and the optimized reduction on GPU, as shown in Fig. 16. The procedure of reduction on CPU includes transferring the pEdge matrix from GPU to CPU. We can find that after using GPU to accelerate, performance of reduction improved up to 30.8 times.

Figure 17. Comparison of the border process on CPU and GPU

In the optimization of border, when the data size is small, implementing the border on CPU is faster than on GPU, and implementing on GPU becomes faster with the increasing of data size. So we test the performance both on CPU and GPU, as shown in Fig. 17. Because no one performs better in all data size, so we use CPU to calculate the border when the data size is small. Well if we calculate the border on CPU when its size is small it means more time consuming because of the data transferring between the host and the GPU. So in the GPU version we compare the GPU with the CPU in time consuming, and the CPU part including data transferring time consuming, to decide above which size we use GPU to calculate the border. Itcan be found that the critical value is 768x768 bytes.

VII. CONCLUSION In this paper, we provide a complete and efficient

solution to port sharpness algorithm to GPU. Because sharpness involves several relevant or irrelevant steps, each has its own characteristics. It is a challenge to implement and optimize sharpness algorithm on GPU.This paper proposed five techniques, Data Transfer Optimization, Kernel Fusion, Vectorization for Data Locality, Border and Reduction Optimization, to address these challenges. Finally, we implement a GPU version with high performance of this algorithm. Experiments show that, compared to a well-optimized CPU version, our GPU solution can reach 10.7~ 69.3 times speedup for different image sizes on an AMD FirePro W8000 GPU.Meanwhile, this solution is also suited for other image processing algorithm with multiple steps.

ACKNOWLEDGMENT This paper is supported by the General Program of

National Natural Science Foundation of China (No.61133005), the Key Program of National Natural

Science Foundation of China (No.61272136), Science Fund for Creative Research Groups of the National Natural Science Foundation of China (No.61221062), National Science Foundation for Distinguished Young Scholars$No.61402441%

REFERENCES

[1] Owens J D, Houston M, Luebke D, et al. GPU computing[J]. Proceedings of the IEEE, 2008, 96(5): 879-899.

[2] Karimi K, Dickson N G, Hamze F. A performance comparison of CUDA and OpenCL[J]. arXiv preprint arXiv:1005.2581, 2010.

[3] Stone J E, Gohara D, Shi G. OpenCL: A parallel programming standard for heterogeneous computing systems[J]. Computing in science & engineering, 2010, 12(1-3): 66-73.

[4] Gu X, Pan H, Liang Y, et al. Implementation and evaluation of various demons deformable image registration algorithms on a GPU[J]. Physics in medicine and biology, 2010, 55(1): 207.

[5] Yang Z, Zhu Y, Pu Y. Parallel image processing based on CUDA[C]//Computer Science and Software Engineering, 2008 International Conference on. IEEE, 2008, 3: 198-201.

[6] Bilgic B, Horn B K P, Masaki I. Efficient integral image computation on the GPU[C]//Intelligent Vehicles Symposium (IV), 2010 IEEE. IEEE, 2010: 528-533.

[7] Moreland K, Angel E. The FFT on a GPU[C]//Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware. Eurographics Association, 2003: 112-119.

[8] Bernabe S, López S, Plaza A, et al. GPU implementation of an automatic target detection and classification algorithm for hyperspectral image analysis[J]. Geoscience and Remote Sensing Letters, IEEE, 2013, 10(2): 221-225.

[9] Vineet V, Narayanan P J. CUDA cuts: Fast graph cuts on the GPU[C]//Computer Vision and Pattern Recognition Workshops, 2008. CVPRW'08. IEEE Computer Society Conference on. IEEE, 2008: 1-8.

[10] Vincent O R, Folorunso O. A descriptive algorithm for Sobel image edge detection[C]//Proceedings of Informing Science & IT Education Conference (InSITE). 2009, 40: 97-107.

[11] Brown J A, Capson D W. A framework for 3D model-based visual tracking using a GPU-accelerated particle filter[J]. Visualization and Computer Graphics, IEEE Transactions on, 2012, 18(1): 68-80.

[12] Zhang N, Chen Y, Wang J L. Image parallel processing based on GPU[C]//Advanced Computer Control (ICACC), 2010 2nd International Conference on. IEEE, 2010, 3: 367-370.

[13] Abhinav Sarje, Xiaoye S. Li, Alexander Hexemer , High-Performance Original Modeling with Reverse Monte Carlo Simulations, Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 2014

[14] Xuejun Gu, Hubert Pan, Yun Liang, Richard Castillo, Deshan Yang, Dongju Choi, Edward Castillo, Amitava Majumdar, Thomas Guerrero and Steve B Jiang, Implementation and evaluation of various demons deformable image registration algorithms on a GPU, Department of Radiation Oncology, University of California San Diego, La Jolla, CA 92037, USA, 2010

[15] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron, Scalable parallel programming with CUDA, ACM Queue,March/April 2008.

[16] Mark Harris, Optimizing Parallel Reduction in CUDA, NVIDIA Developer Technology, 2007

[17] Nitin Singhal, In Kyu Park, Sungdae Cho, Implementation and Optimization of image processing algorithms on handheld gpu, Digital Media & Communication R&D Center, SAMSUNG Electronics Co. Ltd., Suwon, Korea, 2010.

0.15

0.3

0.45

0.6

0.75

0.9

1.05

1.2

1.35

1.5 !" #!" $!" %!" &!" '!" (!" )!" *!" +!" #!!"

!"

#"

$"

%"

&"

'"

("

)"

*"

+"

#!"

448*448 576*576 704*704 832*832

!"#$%#&#

'#()$*+",$"

border on CPU border on GPU

240239239239239239

Documents

Optimizing Image Sharpening Algorithm on GPUcaoting/sharpness-icpp-2015.pdfOptimizing Image Sharpening Algorithm on GPU Mengran Fan, Haipeng Jia1,Yunquan Zhang, Xiaojing An, Ting Cao