1 HIPA : A Domain-Speciﬁc Language and Compiler for Image Processing · 2021. 1. 15. · 1 HIPAcc: A Domain-Speciﬁc Language and Compiler for Image Processing Richard Membarth,

1

HIPAcc: A Domain-Specific Language andCompiler for Image Processing

Richard Membarth, Oliver Reiche, Frank Hannig, Jürgen Teich, Mario Körner, and Wieland Eckert

F

Abstract—Domain-Specific Languages (DSLs) provide high-level anddomain-specific abstractions that allow expressive and concise algorithmdescriptions. Since the description in a DSL hides also the properties ofthe target hardware, DSLs are a promising path to target different paralleland heterogeneous hardware from the same algorithm description. Intheory, the DSL description can capture all characteristics of the algorithmthat are required to generate highly efficient parallel implementations.However, most frameworks do not make use of this knowledge and theperformance cannot reach that of optimized library implementations.

In this article, we present the HIPAcc framework, a DSL and source-to-source compiler for image processing. We show that domain knowledgecan be captured in the language and that this knowledge enables us togenerate tailored implementations for a given target architecture. Backends for CUDA, OpenCL, and Renderscript allow us to target discreteGraphics Processing Units (GPUs) as well as mobile, embedded GPUs.Exploiting the captured domain knowledge, we can generate specializedalgorithm variants that reach the maximal achievable performance due tothe peak memory bandwidth. These implementations outperform state-of-the-art domain-specific languages and libraries significantly.

Index Terms—domain-specific language, image processing, code gener-ation, source-to-source translation, GPU, CUDA, OpenCL, Renderscript.

1 INTRODUCTION

Image processing is one of the basic and ubiquitous algorithmclasses. Yet, it is challenging and poses high demands on theenvironment in which they are embedded in: Mobile deviceslike phones and tablets strive for efficient implementationsto save battery live; driver assistance systems require imageprocessing to be in time; and systems in medical imaging haveextremely high data volumes that have to be processed fast.All these systems require highly efficient implementationsof image processing algorithms on today’s more and more(massively) parallel hardware.

This poses challenges to algorithm developers that aretypically no machine experts. What (parallel) language shouldbe used for implementation? While CUDA might be the numberone choice on GPUs from NVIDIA, there is no CUDA supportfrom other vendors. OpenCL is well supported by most hard-ware manufacturers, but Google removed OpenCL support

• R. Membarth is with the German Research Center for Artificial Intelligence,Germany and the Computer Graphics Lab & Intel Visual ComputingInstitute, Saarland University, [email protected]

• O. Reiche, F. Hannig, and J. Teich are with the Department of ComputerScience, Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany.{oliver.reiche,hannig,teich}@cs.fau.de

• M. Körner is with Siemens Healthcare Sector, Forchheim, [email protected]

• W. Eckert is with Siemens Corporate Technology, Erlangen, [email protected]

C++embedded DSL

Source-to-SourceCompiler

Clang/LLVM

DomainKnowledge

ArchitectureKnowledge

CUDA(GPU)

OpenCL(x86/GPU)

Renderscript(x86/ARM/GPU)

C/C++(x86)

CUDA/OpenCL/Renderscript Runtime Library

Figure 1: HIPAcc framework: Generation of efficient low-levelcode based on an algorithm description in a DSL utilizingtarget hardware and domain knowledge.

on Android. What optimizations should be applied for a giventarget platform? Target-specific optimization require profoundknowledge of the target hardware. Although languages likeCUDA and OpenCL provide functional portability across arange of devices, target-specific tuning is required for goodperformance [1].

One compelling solution to tackle these challenges areDSLs: Algorithms can be described concisely in a high-levellanguage and code for different target architectures can begenerated from this description. All the required knowledgefor optimization is captured by DSL constructs and platform-specific optimization strategies can be provided as additionalinput. This allows to generate highly efficient implemen-tations from a common algorithm description that is notcontaminated with target-dependent code for a variety ofdomains such as image processing [2], [3], PDE solvers [4], ormachine learning [5].

In this article, we present the Heterogeneous Image Pro-cessing Acceleration (HIPAcc) framework (Figure 1): A DSLfor image processing (Section 2) that captures domain-specificcharacteristics of algorithms and a source-to-source compiler(Section 3) that generates highly optimized and efficient targetcode. Using source-to-source compilation, we rely on compil-ers of hardware-vendors for target code generation, but wealso benefit from their target-specific optimizations: We pre-process the DSL input and emit code for which the compiler ofthe hardware-vendor can generate efficient code. Our focusis parallel execution on GPUs: CUDA and OpenCL allow effi-

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TPDS.2015.2394802

Copyright (c) 2015 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

2

cient execution on standalone GPUs [6], [7] and Renderscriptallows to target mobile embedded GPUs running Android [8].Our approach allows compact algorithm descriptions (highproductivity), portability between different target platforms,as well as excellent execution speed (performance) comparedto state-of-the-art frameworks (Section 4).

The contributions of this article can be summarized asfollows:• We introduce a domain-specific language embedded into

C++ for describing image processing algorithms.• A source-to-source compiler that translates algorithms

defined in the DSL into target-specific implementations.• Target-specific and domain-specific mapping to the deep

memory hierarchy as well as to the different types ofparallelism found in today’s GPUs.

• A range of algorithms and applications implemented inthe proposed framework. We show that the DSL descrip-tion allows a compact and portable representation whileproviding excellent performance.

• In detail, we evaluate our approach by assessing theperformance of generated implementations and the pro-ductivity gains by using our DSL. We show that onlyHIPAcc generates highly-optimized implementations thatachieve performance near to the limits as predicted by theRoofline model [9]. Moreover, we show that HIPAcc im-proves productivity significantly according to Halstead’sproductivity metrics [10], reducing development timesfrom days to minutes.

2 LANGUAGE DESIGN

A careful analysis of the application domain is necessary andindispensable before designing a DSL. This includes analysisof the algorithmic type of operations and operators as well asthe analysis of basic components characteristic to a domain.Using this information allows to design an expressive languagethat is compact and orthogonal at the same time.

For image processing algorithms, there exist a wide rangeof literature. Bankman [11] uses a classification based onwhat information is used to map one image to another. Thethree basic classes are pixel operations updating single pixelvalues, local operators considering also neighboring pixels,and operations with multiple images where several images ofthe same scene are used. Other classifications are based onwhy a method is applied [12]: The classes include correctingimaging defects due to imperfect detectors or limitations of theoptics, image enhancement in the spatial domain to increase thevisibility of one aspect, or processing images in the frequencyspace due to computational advantages. We follow the firstapproach and classify image processing algorithms basedon what information contribute to the result: point, local, andglobal operators. However, compared to the classificationpresented in [13] and [11], it is not important how manyimages contribute to the operation or whether the result isagain an image: Also the computation of a single value (e. g.,the sum of pixels) is considered as an image operator.

2.1 Language Components

HIPAcc provides a DSL embedded in C++ and uses C++ classesto describe components of the DSL. Similarly, computationson images are encapsulated in C++ classes, which inherit frombase classes provided by the framework.

Level 6Level 5

Level 4Level 3

Level 2Level 1

Level 0

Figure 2: Image pyramid with 7 resolution levels. The mostfine-grained image is located at the bottom level and the mostcoarse-grained image is located at the top level.

2.1.1 Images in the DSLAn Image in the DSL describes the data storage for the pixelsof a digital image. Each pixel can be stored using the standarddata types (such as int or float) as well as vector types(such as uchar4). Images in HIPAcc are two-dimensionalat the moment, but also 3D-images (volumes) might besupported in the future:

Image<pixel_t>(size_t width, size_t height, pixel_t *img)

The data layout of the image is not exposed to the program-mer and can be different depending on the target platform.

Images are often processed at different resolutions so thatimage details can be best detected and processed. A com-mon multiresoultion data representation in image processingare image pyramids [14]: Operating on an image pyramidincludes creating different fine- and coarse-grained imageseach of a certain resolution, which we denote as levels asshown exemplarily for 7 levels in Figure 2.

A Pyramid in the DSL is defined by providing the imageat the most fine-grained level and the number of levels forthe image pyramid:

Pyramid<pixel_t>(Image<pixel_t> &img, size_t depth)

Again, the data layout of the image pyramid is hidden fromthe programmer and no care has to be taken for allocatingdata for the different image pyramid levels.

2.1.2 Accessing ImagesWe differentiate between reads and writes to an image in theDSL. The Iteration Space defines the pixels of an image that arecomputed during an operation. An iteration space is boundto an image and can be restricted by defining a rectangularRegion of Interest (ROI):

IterationSpace<pixel_t>(Image<pixel_t> &img,size_t width, size_t height,int offset_x, int offset_y)

That is, the image bound to the iteration space will be writtenduring the operation. The parameters for the ROI are optionaland can be omitted. The image can be also retrieved from animage pyramid, providing the desired level within the imagepyramid:

Pyramid<pixel_t> &pyr(size_t level)

The pixels from (different) images that contribute to thecomputation are defined by accessors. An Accessor is like theiteration space bound to an image and can be restricted to aROI:

Accessor<pixel_t>(Image<pixel_t> &img,size_t width, size_t height,int offset_x, int offset_y)



3

(a) Image and boundary. (b) Image offset. (c) Image crop. (d) Image crop with offset. (e) Image stride.

Figure 3: The Accessor defines different views on an image for image operators: (a) shows the whole image and the surroundingboundary (light blue), while in (b) an offset to the image is used. In (c) only a subregion of an image (crop) is considered and(d) shows the combination of offset and crop. (e) shows an access pattern where only every second pixel in the x-dimension isconsidered.

Since the computed and contributing images may be ofdifferent size, the pixel value returned when accessing a pixelis interpolated using either nearest neighbor, linear filtering,cubic filtering, or Lanczos resampling. The interpolation modeis provided as argument to the constructor of an accessor.Figure 3 visualizes the pixels that are returned by the viewdefined by different accessors: The image is extended by avirtual boundary (light-blue); dark-blue pixels are read andwhite pixels of the image are not considered by the accessor.Note that the stride shown in Figure 3 (e) is implicitly definedby using an input image twice as wide as the output imageand using nearest neighbor as interpolation mode.

These are all DSL constructs required to describe pointoperators. Next, we will introduce support for boundary han-dling and filter masks, which are required for local operators.

2.1.3 Boundary HandlingAs indicated in Figure 3, each image is extended by a virtualboundary so that images can be accessed out-of-bounds. Thisis, for example, the case for sliding window operators whereneighboring pixels contribute to the new pixel value. Forsuch operators the accessor should take care for boundaryhandling. For this purpose, the Boundary Condition defines thesize of the region around each pixel where boundary handlingis required as well as the desired boundary handling mode:

BoundaryCondition<pixel_t>(Image<pixel_t> &img,size_t size_x, size_t size_y,boundary_mode mode, pixel_t val)

The framework supports repeat, clamp, mirror, constant, and un-defined as boundary handling modes as visualized in Figure 4.The accessor can be also defined by providing a boundarycondition instance instead of an image. Note that using anundefined boundary handling mode might be desirable incase the accessor is only defined for a smaller ROI.

2.1.4 Sliding WindowsSliding window, local operators that iterate over neighboringpixels are common in image processing. Therefore, HIPAcc

provides two constructs in the DSL for sliding windows: ADomain, which defines the iteration space of a sliding windowand a Mask, which is a more specific version of a Domain,providing also filter coefficients for that window.

Domain(size_t size_x, size_t size_y)Domain(uchar &domain[size_y][size_x])Domain(Mask<pixel_t> &mask)Mask<pixel_t>(size_t size_x, size_t size_y)Mask<pixel_t>(pixel_t &data[size_y][size_x])

−10

−5

0

5

10−10

−5

0

5

10

0

0.5

1

x y

f(x,

y)

0.0571

0.1248

0.0571

0.1248

0.2725

0.1248

0.0571

0.1248

0.0571

Figure 5: Two-dimensional Gaussian function. The normal-ized constants for a 3× 3 filter mask are shown above.

For example, Figure 5 shows the two-dimensional Gaussianfunction and the normalized constants for a corresponding3× 3 filter mask.

Example: Gaussian Blur FilterHaving all DSL components defined, we consider the Gaus-sian blur filter as an example application to show how DSLcomponents interact. Listing 1 defines first a Mask mask forthe Gaussian blur filter, reads then the input picture (image)from disk and assigns it to the Image in. To read from the inImage, a Boundary Condition is defined for the size of the Maskmask and clamp is used as boundary handling mode. TheIteration Space iter is defined on the result Image out. Usingthese DSL components, an instance of the LinearFilteroperator, which will be introduced in the next section, iscreated and executed.

2.2 Defining Operators

Operators are defined similarly to operators in ThreadingBuilding Blocks (TBB) [15], where programmers implementthe operator() and the join() methods of custom C++classes. The equivalent to the operator() method is thekernel() method in HIPAcc: The method specifies the com-putation for each pixel in the Iteration Space. Hence, there is no



4

C D

G H

A B C D

E F G H

A B

E F

O P

K L

G H

C D

M N O P

I J K L

E F G H

A B C D

M N

I J

E F

A B

O P

K L

M N O P

I J K L

M N

I J

(a) Repeat.

M M

M M

M N O P

M N O P

P P

P P

M M

I I

E E

A A

M N O P

I J K L

E F G H

A B C D

P P

L L

H H

D D

A A

A A

A B C D

A B C D

D D

D D

(b) Clamp.

I M

J N

M N O P

I J K L

P L

O K

N M

J I

F E

B A

M N O P

I J K L

E F G H

A B C D

P O

L K

H G

D C

E A

F B

A B C D

E F G H

D H

C G

(c) Mirror.

Q Q

Q Q

Q Q Q Q

Q Q Q Q

Q Q

Q Q

Q Q

Q Q

Q Q

Q Q

M N O P

I J K L

E F G H

A B C D

Q Q

Q Q

Q Q

Q Q

Q Q

Q Q

Q Q Q Q

Q Q Q Q

Q Q

Q Q

(d) Constant.

? ?

? ?

? ? ? ?

? ? ? ?

? ?

? ?

? ?

? ?

? ?

? ?

M N O P

I J K L

E F G H

A B C D

? ?

? ?

? ?

? ?

? ?

? ?

? ? ? ?

? ? ? ?

? ?

? ?

(e) Undefined.

Figure 4: Boundary handling modes for image processing. By default, the behavior is undefined when the image is accessedout of bounds (e). The framework allows to specify different boundary handling modes like repeating the image (a), clampingto the last valid pixel (b), mirroring the image at the image border (c), and returning a constant value when accessed out ofbounds (d).

1 // filter mask for Gaussian blur filter2 const float filter_mask[3][3] = {3 { 0.057118f, 0.124758f, 0.057118f },4 { 0.124758f, 0.272496f, 0.124758f },5 { 0.057118f, 0.124758f, 0.057118f }6 };7 Mask<float> mask(filter_mask);8

9 // input image10 size_t width, height;11 uchar *image = read_image(&width, &height, "input.pgm");12 Image<uchar> in(width, height, image);13

14 // reading from in with clamping as boundary condition15 BoundaryCondition<uchar> cond(in, mask, Boundary::CLAMP);16 Accessor<uchar> acc(cond);17

18 // output image19 Image<uchar> out(width, height);20 IterationSpace<uchar> iter(out);21

22 // instantiate and launch the Gaussian blur filter23 LinearFilter Gaussian(iter, acc, mask, 3);24 Gaussian.execute();

Listing 1: Instantiation of an operator for the Gaussian blurfilter in the DSL.

difference between point and local operators in HIPAcc. Bothare realized by the kernel() method. Instead of providinga generic join() method, HIPAcc allows currently only todefine global reductions using the reduce() method.

2.2.1 Local and Point OperatorsThe kernel() method is embedded into a user-definedclass that derives from the Kernel base class provided bythe framework. Listing 2 shows such a class implementinglinear filters that retrieve the precomputed constants from theMask mask. The provided implementation iterates manuallyover the mask elements, accessing neighboring pixels usingrelative offsets. The result is stored using the output()method.

For describing sliding window operators, the frameworkprovides built-in functions that allow a more concise descrip-tion: The convolve() method taking a) the filter mask, b)the aggregation mode, and c) the computation instructionsfor a single filter mask component with the correspondingimage pixel described as a C++ lambda function:

void kernel() {output() = (uchar)

convolve(mask, Reduce::SUM, [&] () -> float {return mask() * input(mask);

1 class LinearFilter : public Kernel<uchar> {2 private:3 Accessor<uchar> &input;4 Mask<float> &mask;5 size_t size;6

7 public:8 LinearFilter(IterationSpace<uchar> &iter, Accessor<

uchar> &input, Mask<float> &mask, size_t size) :9 Kernel(iter), input(input), mask(mask), size(size)

10 { add_accessor(&input); }11

12 void kernel() {13 float sum = 0;14 int range = size/2;15

16 for (int yf = -range; yf <= range; ++yf)17 for (int xf = -range; xf <= range; ++xf)18 sum += mask(xf, yf) * input(xf, yf);19

20 output() = (uchar) sum;21 }22 };

Listing 2: Kernel description for the Gaussian blur filter.

});}

HIPAcc currently supports min, max, sum, and prod as aggrega-tion mode.

For more complex algorithms that do not follow the abovescheme, the iterate function can be used. Consider thebilateral filter that combines a closeness component c and asimilarity component s over a common sliding window [16].Using iterate(), we can iterate over the sliding windowdefined by the Domain dom and compute both componentsas well as the normalization factor:

void kernel() {float d = 0, p = 0;

iterate(dom, [&] () -> void {float diff = in(dom) - in();float c = mask(dom);float s = expf(-c_r * diff*diff);d += c*s;p += c*s * in(dom);

});

output() = p/d;}

The aggregation is explicit in this case and the closenesscomponent is read from the Mask mask. The similarity couldbe also read from a Mask, but is computed on-the-fly in thiscase.



5

2.2.2 Global Reduction OperatorsFor global reduction operators, the reduce() method canbe implemented by the programmer. We use the definitionprovided by Blelloch [17] for the reduce operation, but do notrequire the identity: The operator defined by the reduce()method is applied to all elements defined by the Iteration Spaceof the user-defined Kernel class. For instance, the minimumof all pixels can be described as follows:

int reduce(int left, int right) {return min(left, right);

}

The result of the reduction can be retrieved using thereduced_data() method of the Kernel class.

2.3 Image Pyramid ConstructionDuring traversal of the pyramid, either a more fine-grained ora more coarse-grained representation is created by upscalingor downscaling the image. The actual operators are thenapplied to the images at each level. To construct the images atdifferent pyramid levels, the DSL provides the traverse()function:

void traverse(std::vector<Pyramid>,const std::function<void()>)

It takes a vector of image pyramids as argument and a C++lambda function that describes the traversal of the pyramids.Pyramids are registered in the traverse() function andimages of each pyramid can be requested using a relativeindex. That is, querying a pyramid at index ’0’ will returnan image handle for the current traversal level. The index ’1’refers to the image on the next (more coarse-grained) level,whereas index ’−1’ refers to the image of the previous (morefine-grained) level, respectively.

The traversal of the pyramid in the C++ lambda function istriggered by a pseudo-recursive call to traverse(): Listing 3shows a simple example for operating on pyramid datastructures using the presented traversal functions. As neededfor the traversal, two image pyramids are created (lines 3–4)with a depth of 7. Both are bound to a traverse functioncall (line 6), together with the lambda function describingthe actual recursion body (lines 7–26). The body containsthe set-up and execution of three kernels: downscale (lines8–11), compute (lines 14–17), and upscale (lines 22–25).The recursive call (line 19) occurs right after the computekernel but could also be placed before without changing theschedule.

2.4 C++ IntegrationSince the presented DSL is realized through C++ classes,programs can be compiled with any C++ compiler such thatincremental porting of applications is possible. However,compiled with the source-to-source compiler provided byHIPAcc, target code for GPU accelerators is generated asdiscussed in the next section. The programmer can combinethe DSL with regular C++ as shown in the following example,using OpenCV for I/O and HIPAcc for computing:

// load image in OpenCVMat frame = imread("lena.pgm", CV_LOAD_IMAGE_GRAYSCALE);

// use image in hipaccImage<uchar> img(frame.cols, frame.rows);img = frame.data;...

// use OpenCV to display imageframe.data = img.data();imshow("Result Image", frame);

1 Image<uchar> in;2 Image<uchar> out;3 Pyramid<uchar> pin(in, 7);4 Pyramid<uchar> pout(out, 7);5

6 traverse({ &pin, &pout }, [&] {7 if (!pin.is_top_level()) {8 Accessor<uchar> acc_in(pin(-1), Interpolate::LF);9 IterationSpace<uchar> iter_in(pin(0));

10 Downscale ds(iter_in, acc_in);11 ds.execute();12 }13

14 Accessor<uchar> acc_in(pin(0));15 IterationSpace<uchar> iter_out(pout(0));16 Compute c(iter_out, acc_in);17 c.execute();18

19 traverse();20

21 if (!pout.is_bottom_level()) {22 Accessor<uchar> acc_out_cg(pout(1), Interpolate::LF);23 Accessor<uchar> acc_out(pout(0));24 Upscale us(iter_out, acc_out, acc_out_cg);25 us.execute();26 }27 });

Listing 3: Pyramid traversal: The downscale/upscalekernels create a coarser/finer representation and thecompute kernel processes the data at each level.

C++embedded DSL

Clang AST

C++ DSL - Host Code DSL - Device Code

Rewrite

C++ & HIPAcc Runtime

Match

Analysis

Clone/Translate

PrettyPrint

CUDA/OpenCL/Renderscript

Figure 6: DSL program compilation work flow of HIPAcc.

3 COMPILER FRAMEWORK

Instead of implementing a new front end for C++, we decidedto build our source-to-source compiler on top of Clang1, aC language family front end for LLVM. Clang performs thetypical steps of a compiler front end: Syntactic analysis (lexingand parsing), semantic analysis, and the construction of anIntermediate Representation (IR) [18], [19]. Clang uses an Ab-stract Syntax Tree (AST) as IR that stores further informationof the input source program for each AST node.

3.1 Compiler Work FlowThe source-to-source compiler of HIPAcc traverses the ASTbuild by Clang in order to generate target code as shown in

1. http://clang.llvm.org



6

Figure 6. First, all AST nodes are visited using a preorderdepth-first traversal in the Match library. AST nodes thatcorresponds to a) declarations and definitions of DSL classes,b) statements that define objects in the DSL, and c) expres-sions involving DSL objects are stored and further operationsare triggered depending on their type: a) class definitionsare removed from the textual source program representationusing the Rewrite library, but the AST representation is stillavailable to generate target code for the kernel; b) statementsdefining DSL objects are replaced by corresponding calls tothe HIPAcc runtime (e. g., to allocate an image object on theGPU); and c) expressions on DSL objects are also replacedby calls to the HIPAcc runtime (e. g., to copy data to/from thememory allocated previously on the GPU). Depending on thetarget language, different runtime Application ProgrammingInterface (API) calls are emitted by the Rewrite library andthe updated textual representation is stored to a file. Onlyparts of the input source file that make use of the DSL arechanged. This helps the user to understand the appliedchanges and allows him to further modify the generatedsource files.

The Analysis library extracts metadata from DSL classesand image operators. This includes the size of filter masksas well as boundary handling modes, information that isrequired for target code generation. However, most importantis the analysis and categorization of kernels: Memory accessesand arithmetic operations are analyzed for the user-providedkernel() method. Therefore, a Control Flow Graph (CFG)of the method is constructed and visited in postorder depth-first traversal in order to collect the following information:(a) the number of instructions. The instructions are classified

into memory accesses and arithmetic instructions.(b) memory access properties of images. Each memory access

is analyzed and for each image it is stored if it is onlyread, written, or both read and written. In case of readingand writing an image within the same kernel, a warning isemitted since this may lead to inconsistent memory views.

(c) memory access pattern of images and kernels. In additionto the read/write analysis, also the pattern of memoryaccesses are analyzed. It is differentiated between accesseswith stride-x, stride-y, stride-xy, and without any stride.This information is collected for each image, and theaggregated stride information of all images is annotatedto the kernel. This allows to categorize kernels as pointoperators (no stride) and local operators (stride-x, stride-y,stride-xy). Row filters correspond then to stride-x localoperators and column filters to stride-y local operators.

(d) use-def chain. Each definition of a variable and its uses isstored. This liveness analysis can be used to remove deadcode, but is extended to include divergence information.This information is essential for vectorization [20].Once the metadata has been extracted, the AST of the

kernel() method is converted for execution on the targethardware. Since Clang does not allow to modify the ASTrepresentation—the AST is immutable—the whole AST rep-resentation has to be recreated in order to add or modify ASTnodes. The Clone library provides the facility to duplicatesingle AST nodes and will issue an error message if C++operators are used that are not supported on the targethardware such as for try, catch, and throw as well asfor instructions not supported by HIPAcc such as new anddelete.

The Translate library replaces AST nodes without chang-ing the semantics of the operation they describe. For example,

accesses to an image object in the DSL are replaced byaccesses to GPU memory. However, this is not limited tosimple one to one translations, but also more sophisticatedtransformations are supported such as loop unrolling orstaging of data to scratchpad memory. During AST traversal,we keep track of kernel class member variable uses. Onlymember variables that are present in the generated code willbe added as parameter to the compute kernel. For example,Domains need only to be added as parameter in case theiteration space defined by the Domain is not unrolled.

Having the AST transformed for the target hardware, thePrettyPrint library stores the AST to a file using the syntaxof the target language. During pretty printing, function andvariable qualifiers are emitted. For example, the __kernelqualifier is emitted in OpenCL for kernel entry functions andthe __local address space qualifier for scratchpad memory.Similarly, CUDA requires attributes for kernel entry functions(__attribute__ ((global))) and scratchpad memory(__attribute__ ((shared))).

3.2 Kernel Code GenerationThe source-to-source compiler transforms the DSL descrip-tion in target-specific source code optimized with respect toexecution speed. Therefore, the metadata extracted from theDSL description as well as the information obtained throughcode analysis is used.• Memory padding: The data storage for images are allocated

by the runtime system such that each image line is alignedto a multiple of the memory transaction size on thetarget architecture. The generated memory access indexcomputation takes the introduced padding into account.This preserves kernel performance for images of arbitrarysize.

• Memory layout: Image data can be stored using dif-ferent memory layouts. This includes linearized one-dimensional memory, two-dimensional memory, as wellas hardware-vendor specific memory layouts using space-filling curves. Therefore, the runtime can allocate memoryusing different memory layouts and the required API callsto read/write to the selected data structure are emittedduring code generation. This includes image objects forOpenCL or texture and surface memory for CUDA.

• Memory hierarchy: GPU accelerators feature deep memoryhierarchies that have to be utilized in order to overcomethe long memory latency and improve locality. Dependingon the memory access pattern, different approaches arefollowed in order to improve locality. We distinguishbetween the following scenarios: a) in case a kernel readsonly a single pixel of a given image, locality cannot beimproved and no special memory mapping is generated;b) in case a kernel reads multiple surrounding pixels, lo-cality may be improved by using a special memory layoutas described above and by staging data to fast scratchpadmemory; c) in case a kernel reads the same data for eachpixel and that data is of known size, constant memory canbe utilized to provide a cache that broadcasts the datato multiple threads. Constant memory is predestined forMasks (Domains) and is used whenever the compiler isnot able to resolve all computations involving a Mask atcompile time. Scratchpad memory and special memorylayouts are selected depending on the target platform.Both can be combined as shown in the following forCUDA, reading data from a texture reference bound tolinear 1D memory and staging it to shared memory:



7

// texture declaration for inputtexture<float, cudaTextureType1D,

cudaReadModeElementType> tex_in;// shared memory declaration for input__shared__ float in_sm[8][33];// load data to scratchpadin_sm[threadIdx.y][threadIdx.x] =

tex1Dfetch(tex_in, gid_x * in_stride + gid_y);__syncthreads();// read data from scratchpad... in_sm[threadIdx.y-1][threadIdx.x+1] ...

The example above is generated for a work-group configu-ration of 32×4 and a filter mask size of 1×5. The allocatedscratchpad for the x-dimension are 1 · 32 + 1 = 33pixels,adding one element padding to avoid bank conflictsin shared memory. Considering the given work-groupconfiguration in the y-dimension, 4 + (5− 1) = 8 lines ofscratchpad are required.

• Constant propagation & loop unrolling: The local windowdescribed by Domains and Masks can be fully unrolledknowing the filter window size. This removes the loopingoverhead in the compute kernel and the requirement topass a Domain or Mask to the compute kernel as parameter.In case a Mask is compile-time constant, the constants ofthe mask can be propagated in addition to unrolling theiteration space. This can only be done if the access to theMask is also known at compile time.

• Thread coarsening: Merging the computation of multipleiteration points into a single thread can improve local-ity and reuse of data. At the same time, it reduces thelaunch overhead for lightweight GPU threads. This trans-formation is equivalent to loop unrolling of the globaliteration space, reducing the number of threads runningon the GPU. GPU accelerators have strict requirementson memory access patterns in order to utilize bandwidthbest: contiguous memory should be accessed by threadsexecuted in lockstep for memory coalescing. For thisreason, we apply thread coarsening only to threads ofthe global iteration space in the y-dimension:

gid_x = blockDim.x * blockIdx.x + threadIdx.x;gid_y = blockDim.y * blockIdx.y * N + threadIdx.y;

// first iteration{

// begin kernel code... = in[gid_y * in_stride + gid_x];// end kernel code

}

// second iterationif (gid_y + 1 * blockDim.y < is_height) {// begin kernel code... = in[(gid_y+1) * in_stride + gid_x];// end kernel code

}

// n-th iterationif (gid_y + (N-1) * blockDim.y < is_height) {// begin kernel code... = in[(gid_y+N-1) * in_stride + gid_x];// end kernel code

}

The example above unrolls the global iteration space bya factor of N with Image accesses being updated for thecurrent iteration point. Since we do not know if the imagesize is a multiple of N at compile time, we also need to addguards that ensure that the current iteration point belongsto the global iteration space. Setting N to 4 requires now4 · N + (5 − 1) = 20 lines of scratchpad in the previousexample.

• Multiple Program, Multiple Data (MPMD) code generation:

To support boundary handling, checks have to be addedfor reads to an Image. Checking boundary conditions foreach memory access is expensive and leads to increasedexecution times in the magnitude of 10% up to 100%depending on the boundary handling mode as well ason the target hardware. One solution to this problem isthe use of texturing hardware that takes care of boundaryhandling. However, texturing hardware is not availableon all platforms and allows only boundary handling ontwo-dimensional arrays. This imposes a memory layoutthat is not optimal for achieving best performance anddoes not allow to define boundary handling on a ROI(cf. [6], [7]). Therefore, we divide the image in differentregions according to the image borders and generate spe-cialized variants for boundary handling for each regionas indicated below and visualized in Figure 7:

__global__ void kernel(... int bh_t, int bh_fb) {if (bh_fb) goto BH_FB;if (blockIdx.x < bh_l &&

blockIdx.y < bh_t) goto BH_TL;... // checks for other variantsgoto BH_NO;

BH_FB:// handle all image borders<pretty printed AST>return;

BH_TL:// handle only top left border<pretty printed AST>return;

... // other variantsBH_NO:

// no boundary handling<pretty printed AST>

}

This results in one big kernel that includes all special-ized variants. This is similar to index set splitting [21],but the code variant is selected at run time dependingon the block index. All threads executed on the samecompute unit execute always the same code variant, andtherefore, in lockstep. Now, only checks are issued atthread blocks processing pixels at image borders andno checks are generated for the inner image region. Wechoose to generate one big kernel instead of multiplekernels since the overhead of scheduling multiple smallkernels for the image borders is higher than the benefitof specialized boundary handling. In total we generateup to ten variants: Up to eight for the image borders, onefor the inner region, and one fall-back variant for verysmall images (e. g., as required for Pyramids). The memoryaccess pattern extracted during code analysis defines thecode variants required: Only left and right borders haveto be handled for row kernels and only top and bottomborders for column kernels.

3.3 Optimization StrategyHIPAcc follows a model-driven approach for code optimiza-tion utilizing the metadata captured by the DSL constructsand information from an architecture model. Tailored im-plementations are generated using the fused informationabout the input algorithm and the target architecture. Thearchitecture model describes which optimizations are validand beneficial for a given target architecture:• The memory layout and alignment for images.• Texture memory layout and usage.• Thresholds for staging data to scratchpad memory.



8

M M

M M

M N O P M N O P

M N O P M N O P

P P

P P

M M

M M

I I

I I

E E

E E

A A

A A

M N O P

I J K L

E F G H

A B C D

M N O P

I J K L

E F G H

A B C D

M N O P

I J K L

E F G H

A B C D

M N O P

I J K L

E F G H

A B C D

P P

P P

L L

L L

H H

H H

D D

D D

A A

A A

A B C D A B C D

A B C D A B C D

D D

D D

BH_NO

BH_TL BH_T BH_TR

BH_L BH_R

BH_BL BH_B BH_BR

Figure 7: Code variant assignment to image border: Boundaryhandling for local operators is limited to those regions whereout of bounds image accesses occur. Different code variantsare generated for the top left (BH_TL), top (BH_T), etc. imageborders.

• The thread coarsening factor.The launch configuration of a kernel determines the tiling

of the global iteration space and the occupancy of hardwareresources. HIPAcc extracts resource usage (such as number ofregisters and shared memory usage) of generated kernels tofine-tune their tiling to achieve good occupancy and improvelocality: The compiler calculates for all valid configurationsizes (S) (due to shared memory/register usage) that resultin good memory bandwidth utilization (due to coalescing)the occupancy. Then the heuristic of Algorithm 1 selects aconfiguration cs,txy

from the list of configuration sizes S. Foreach configuration size s, the shape of the configuration isdetermined such that the resulting tiling txy of the imageinto sub-problems processed on a compute unit adapts to theavailable resources and inherent locality of the consideredkernel.

Depending on the operator kind, a different strategy isused. For point and global operators, a 1D-tiling is chosensince the resulting code requires less instructions (e. g., noguards for the y-dimension, simpler index address calcula-tion) and point and global operators do not benefit from a2D-tiling (there is no temporal locality). The size of the tiling isselected such that a high occupancy is given. In case multipleconfigurations result in the same occupancy, the configurationwith the lowest number of threads is chosen. This gives higherflexibility to the hardware scheduler since multiple work-groups can be scheduled and executed at the same time ona single compute unit. This results in 1D-configurations like128× 1 or 256× 1. Such configurations are typically selectedby expert programmers and yield good performance for mostkernels. In contrast, a 2D-tiling is chosen for local operators.This has two reasons: To minimize the number of threads thatexecute code with conditionals for boundary handling andto maximize locality by selecting a 2D-shape. The minimalsize for the x-dimension (the warp size) is typically enoughto cover boundary handling in the x-dimension with a singlework-group (e. g., for a kernel window size of 3 × 1 upto 65 × 1 for a warp size of 32). Instead, the y-dimension

Algorithm 1: Heuristic for selecting kernel configurationand tiling depending on resource usage, filter mask sizes,and target graphics card.

Input: Kernel K , list of configuration sizes S for GPU GOutput: Configuration cs,txy

with size s and tiling txyfor kernel K

1 S← configuration sizes of S multiple of warp size of Gand within resource limitations of G

2 S← sorted configuration sizes of S with descendingoccupancy and ascending number of threads

3 s← first configuration size of S4 if local operator then5 txy ← tiling of s, prefer y over x6 threads_bh← calculate number of threads for border

handling for kernel K with tiling txy7 S′← configuration sizes with occupancy within 10%

of s8 foreach configuration size s′ of S′ do9 t′xy ← tiling of s′, prefer y over x

10 threads_bh′← calculate number of threads forborder handling for kernel K withtiling t′xy

11 if threads_bh′ < threads_bh then12 s← s′

13 txy ← t′xy14 threads_bh← threads_bh′

15 end16 end17 else18 txy ← tiling of s, prefer x over y19 end20 cs,txy ← s, txy

is increased and not only the first configuration size withthe highest occupancy is considered. All configuration sizesthat are within a given threshold, for instance 10% of theconfiguration size with the highest occupancy are considered:The configuration is chosen such that the number of threadswith boundary handling conditionals is minimized (e. g.,a configuration of 32 × 6 is preferred over 32 × 4 for akernel window size of 13 × 13; however, a configurationof 32 × 3 would be preferred to the two aforementionedconfigurations). Once a configuration size and tiling for akernel have been determined according to the presentedheuristic, the source-to-source compiler can emit the kernelconfiguration for invocations of the kernel.

3.3.1 Configuration Space ExplorationThe rules of the architecture model decide what code variantis generated. However, the compiler allows also to set trans-formations manually using compiler switches. For example,the user can specify that scratchpad memory or a certain kindof texture memory should be turned on or off. Similarly, theamount of padding or the thread coarsening factor can be setby the user. This allows for an easy exploration of features forfuture architectures.

In addition, the source-to-source compiler can gener-ate code to explore kernel tiling parameters using the--explore-config compiler switch. The generated codeincludes macro-based code snippets for features that dependon tiling (statically allocated scratchpad memory and thework-group size). During execution, the runtime systemcompiles those variants using Just-In-Time (JIT) compilation.The influence of the kernel configuration (size and shape)on execution time is shown in Figure 8 for the bilateral filter.



9

128 256 384 512 640 768 896 1024

20

25

30

35

40

Optimum with 32× 4: 18.49msHeuristic with 32× 6: 18.68ms

number of threads

exec

utio

nti

me[m

s]

Figure 8: Configuration space exploration for the bilateralfilter (filter window size: 13× 13) for an image of size 4096×4096 on the Tesla K20.

Only configurations that are a multiple of the warp size areconsidered. In the example, execution times vary between18.49ms for a configuration of 32 × 4 and 41.84ms for aconfiguration of 32× 1. The proposed heuristic (Algorithm 1)selected the configuration of 32 × 6, which results in anexecution time of 18.68ms. The insights from explorationscan be used to set an optimized configuration by hand or tofine-tune the heuristic.

4 RESULTS

In this section, we evaluate important aspects of domain-specific languages for image processing. In particular, we willevaluate the performance of the generated implementations,the portability of the description itself, and the gains in pro-ductivity using the proposed DSL. In the following, we firstinvestigate a single, representative sliding window algorithmin detail, before we show how bigger applications can bedescribed in the DSL.

4.1 Comparison against RapidMind, Halide, OpenCV,and NPPWe picked the Gaussian blur filter as representative slidingwindow algorithm since implementations exist in highly-optimized domain-specific libraries such as Open SourceComputer Vision (OpenCV) and NVIDIA Performance Prim-itives (NPP). At the same time, implementations in otherdomain-specific language such as RapidMind or Halide areavailable.

RapidMind [22] is a multi-core development platformthat allows to describe algorithms as computations on arrayssimilar to the proposed framework. Actually, the kerneldescription (embedded into C++) differs only by the usedkeywords and data types. GPUs from NVIDIA are supportedby the CUDA back end.

Halide [2] is a domain-specific language embedded intoC++ for image processing and supports GPU acceleratorsthrough a CUDA and OpenCL back end. Algorithms areexpressed using functional expressions and the target-specific

optimization (the schedule) is defined separately by theprogrammer.

The OpenCV library is a widely used library for imageprocessing and provides a CUDA and OpenCL module totarget GPU accelerators. Hand-tuned implementations areprovided for each module, which are further optimizeddepending on the target GPU.

The NPP library is part of CUDA and provides optimizedprimitives for imaging and video processing on NVIDIAGPUs. In contrast to OpenCV, NPP supports only processingof 8-bit gray-scale images for the considered primitives. Thereis also no boundary handling support in NPP. The validimage region shrinks after each kernel invocation.

4.1.1 ResultsTable 1 and Table 2 show the execution times of the auto-matically generated implementations for the Gaussian blurfilter on an image of 4096× 4096 pixels on a Tesla K20 andRadeon R9 290X. The optimizations applied by the HIPAcc

framework include constant propagation and unrolling ofthe convolve method, using texture memory when readingfrom global memory (CUDA), staging the data to scratchpadmemory as well as unrolling of the global iteration space. Theconfiguration is determined by the HIPAcc framework and sixspecialized variants for boundary handling are generated.The tables list also a naïve implementation generated byHIPAcc where only global and constant memory are usedwith a fixed configuration of 128× 1.

The results show that our generated implementations areon both GPUs faster than low-level hand-written implemen-tations (OpenCV and NPP) as well as the implementationsgenerated by RapidMind and Halide. The RapidMind imple-mentation crashed when Repeat is used as boundary handlingmode and is otherwise slower by a factor of three. For Rapid-Mind, we recompute the filter mask, which is faster thanusing precomputed values. For Halide, we explored differentoptimizations (schedules). It turned out that a separatedschedule yields the fastest implementation on both GPUs.Halide provides also a RDom object that can be used thesame way as a Domain in HIPAcc. Describing the convolutionusing RDom results in higher execution times (11.62ms vs.10.95ms) for the K20 and is significantly faster on the R9 290X(0.82ms vs. 1.10ms). Since both Halide and RapidMind usea JIT compiler, NVIDIA’s nvprof profiler is used to measurekernel timings.

The OpenCV implementation has a competitive perfor-mance across different boundary handling modes and isfaster compared to the naïve implementation. OpenCV iseven faster compared to NPP, although NPP does not imple-ment boundary handling and requires only to transfer onefourth of the data (using uchar instead of float for theintermediate result).

Table 3 shows the execution times on the Intel Xeon Phi7120P with similar results as for the AMD and NVIDIA cards:Our generated code is significantly faster compared to theimplementation in OpenCV (3×) and more than 10% fastercompared to Halide.

For the ARM Mali T-604 running Android, we generatespecialized Renderscript implementations. Table 4 showsthat we get a speedup of roughly 40% compared to a naïveimplementation in Renderscript. While there is no otherframework targeting Renderscript at the moment, we haveshown previously that our generated Renderscript implemen-tations outperforms a) highly optimized implementations



10

Table 1: Execution times in ms for the Gaussian blur filteron a Tesla K20 using the CUDA back end for an image of4096× 4096 pixels and a filter window size of 5× 5.

Undef. Clamp Repeat Mirror Const.

naïve crash 3.04 3.14 3.15 3.19

RapidMind 5.40 6.00 crash n/a 5.97Halide n/a 4.17 n/a n/a n/a

OpenCV n/a 2.12 2.15 2.22 2.01NPP† 2.40 n/a n/a n/a n/a

HIPAcc 1.32 1.38 1.40 1.38 1.39† NPP’s implementation works on 8-bit data only; no support for

boundary handling; output image dimensions are reduced by thefilter window size.

Table 2: Execution times inms for the Gaussian blur filter ona Radeon R9 290X using the OpenCL back end for an imageof 4096× 4096 pixels and a filter window size of 5× 5.


naïve 1.19 1.18 1.19 1.18 1.20

Halide† n/a 0.82 n/a n/a n/a

OpenCV n/a 0.89 1.43 0.90 0.91

HIPAcc 0.67 0.64 0.68 0.67 0.66† Using RDom for the convolution, normal schedule requires1.10ms.

from the hardware-vendor of the Mali T-604 when executedon the embedded GPU; as well as b) the hand-tuned andvectorized CPU implementations provided in OpenCV whenexecuted on the ARM CPU (cf. [8]).

4.1.2 Performance ModelSince it is known that stencil codes are usually bandwidthlimited, we use a simple performance model to estimate theperformance of the generated code. For a separated Gaussianblur filter, we load pixels of type uchar and store float

Table 3: Execution times inms for the Gaussian blur filter ona Xeon Phi 7120P using the OpenCL back end for an imageof 4096× 4096 pixels and a filter window size of 5× 5.


naïve crash 6.42 5.96 6.34 6.39

Halide n/a 3.94 n/a n/a n/a

OpenCV n/a 10.96 14.06 11.13 9.75

HIPAcc crash 3.57 3.78 3.70 3.87

Table 4: Execution times inms for the Gaussian blur filter ona Mali T-604 using the Renderscript back end for an imageof 4096× 4096 pixels and a filter window size of 5× 5.


naïve crash 342.87 348.34 338.58 348.36HIPAcc crash 213.38 214.28 215.71 219.65

Tesla K20 Radeon R9 290X Xeon Phi 7190P Mali T-6040

10

20

30

40

50

60

70

80

90

100

mem

ory

band

wid

th[%

]

memcpy RapidMind Halide OpenCV NPP HIPAcc

Figure 9: Memory bandwidth of the Gaussian blur filter imple-mentations in different frameworks. The memcpy bandwidthis given as reference.

values in the first component. The second component loadsfloat values and stores pixels back as uchar. If we assumethat all neighboring memory accesses for the convolutionare in cache, we have to transfer 4 · 2 + 1 · 2 = 10bytes perpixel. On the Tesla K20 with achievable memory bandwidthof b = 140GB/s and for a problem size N = 4096 × 4096we thus estimate N ·10

b · 1000 ≈ 1.20ms for the Gaussianblur. This corresponds to the maximal performance theGaussian blur filter can achieve due to the peak streammemory bandwidth in the Roofline model [9]. We useOpenCL’s clEnqueueCopyBuffer() and Renderscript’scopy1DRangeFrom() to measure this achievable memcpybandwidth, which we consider as an upper bound.

Figure 9 shows the achieved memory performance of theGaussian blur filter implementations in the different frame-works2. It can be seen that the memory bandwidth utilizationis highest when using HIPAcc. For NVIDIA hardware, weachieve almost the memcpy bandwidth and exceed it even onthe AMD card. Efficient implementations for the current XeonPhi coprocessor require prefetching which is not exposed inOpenCL. This results in a bandwidth utilization that is farbelow the theoretical peak memory bandwidth. Similarly,the bandwidth utilization we see for the ARM Mali is farbelow the theoretical memory bandwidth. Here, we considerthe dual port memory bandwidth as peak, but the systemmight only use a singe port of the memory. However, all otherimplementations we investigated suffer from the same lowbandwidth utilization.

4.1.3 Programming EffortWe consider the productivity metrics introduced by Hal-stead [10] in order to quantify the programming effort usingHIPAcc. Halstead’s metrics are based on two code features:operators and operands. Operands denote the variables andconstants of a program on which operators act. If we countthe number of distinct operators η1 and operands η2 as well

2. Note that the NPP implementation transfers only 4bytes per pixeland the OpenCL implementation in OpenCV only 8 bytes per pixel.



11

Table 5: Halstead’s productivity measures for the Gaussian blur filter in HIPAcc, OpenCV (CUDA), and the generated CUDAimplementation. Shown are the number of unique/total operators (η1/N1) and operands (η2/N2) as well as the resultingvolume V and effort E for the implementation.

η1 η2 N1 N2 V E Lines of Code (LoC)

HIPAcc col 15 3 23 5 261.70 3082.87 10row 16 5 26 7

OpenCV col 42 49 357 230 7447.45 713047.41 185row 40 47 345 218

generated col 32 78 2951 1509 74208.40 24701839.63 1190row 32 104 3935 2268

as the total numbers of operatorsN1 and operandsN2 withinan algorithm implementation, we can compute the volumeV , difficulty D, and effort E for the given implementation:

Volume V = (N1 +N2)× log2 (η1 + η2)

Difficulty D =η12

× N2

η2Effort E = D × V

We use and extend the implementation from [23] to recog-nize also the CUDA language in order to compute Halstead’sproductivity metrics. Table 5 shows the productivity metricsas well as the LoC for the Gaussian blur filter description inHIPAcc, the corresponding CUDA implementation in OpenCV,as well as for the generated CUDA implementation, spe-cialized for the Tesla K20. We describe the row and columncomponent in HIPAcc separately in order to round the secondcomponent the same way as in OpenCV in order to getconsistent results. Apart from rounding, the code of bothcomponents is identical and a single description would besufficient to generate the same specialized implementations.In contrast, two separate and completely different implemen-tations are used in OpenCV. It can be seen that the descriptionin HIPAcc has by far the smallest volume and effort. Thevolume of the generated code is 10× the volume of the hand-tuned CUDA implementation in OpenCV, but has an ∼ 35×higher effort to implement.

We can estimate the time T required for the implementa-tion if we divide the effort E by the Stroud number S:

Time T =E

Sseconds

Using 18 for S for fluent, concentrating programmers3 [25] weestimate the time T as follows: 3minutes for the descriptionin HIPAcc, 11hours for the CUDA implementation in OpenCV,and 15.8days for the generated CUDA implementation. Notethat this takes into account neither that parallel programmingincreases programming complexity nor the fact that the gen-erated CUDA implementations compute 4pixels per thread(4× the LoC).

4.2 ApplicationsIn addition to the Gaussian blur filter, we consider typicalimage processing algorithms for feature detection: a Lapacianedge detector, a multiresolution filter, a Harris corner detec-tor [26], and the computation of optical flow using the censustransform [27]. Those algorithms form a realistic scenario andreflect possible cost-sensitive implementations in commercial

3. Stroud [24] suggests 5 ≤ S ≤ 20 discriminations per second.

products, such as augmented reality or driver assistancesystems. All of them are based on local and point operatorsor a combination of both but differ greatly in implementationdetail.

Figure 10 shows the number of operators and operator in-vocations for these algorithms. It shows also the LoC requiredto describe them in HIPAcc as well as the performance of thegenerated implementations. In the following, we summarizewhich operators are required for each algorithm and highlightfeatures of the DSL to realize them:• The Laplacian operator is a sliding window operator

similar to the Gaussian blur filter. However, the filter maskof size 5×5 cannot be separated and is realized as a singleoperator.

• The multiresolution filter creates first a pyramid represen-tation of the input image of depth 6. Then, it applies thebilateral filter [16] on each image of the pyramid. Finally,it reconstructs the image at the most fine-grained level.For the image pyramid traversal, we use the traverse()function working on pyramids for the input image, outputimage, and an image for intermediate results.

• The Harris corner detector embodies a combination ofpoint and local operators that form a complex imagepipeline. In total, twelve operator invocations are requiredto detect the edges in the input image.

• The optical flow includes the computation of a signaturefor each pixel of the smoothed input images. The signa-tures of two successive images are used to compute theoptical flow: They are compared within a sliding windowof size 15 × 15 using the iterate() function over aDomain that excludes the center.

5 RELATED WORK

Abstractions are a compelling way to hide low-level details.High-level programming languages provide abstractionsfrom storage locations or calling conventions of a processor byvariables and function calls. However, these abstractions arespecific to the target processor and offer no means to exploitoptimizations specific to a given application domain or paral-lelism within algorithm descriptions. As a consequence, toolsand languages have been developed to extract or capture therequired knowledge. For example, the polyhedron model [28]relies on code analysis to extract information required forparallelization. The need for abstraction of domain propertiesresulted in several languages and frameworks for a variety ofdomains including image processing. A prominent examplefor DSLs is Delite [29], [30]: A common compiler and run-time infrastructure for building new performance-orientedDSLs. It provides facilities for defining and embedding a



12

1Operator & 1 InvocationHIPAcc: 8 + 6LoC

Radeon R9 290X: 408.33FPSTesla K20: 272.79FPS

Xeon Phi 7120P: 16.84FPSMali T-604: 2.88FPS

(a) Laplace.

4Operators & 21 InvocationsHIPAcc: 29 + 17LoC


Xeon Phi 7120P: compiler segfaultMali T-604: 1.77FPS

(b) Multiresolution Filter.




(c) Harris Corner.




(d) Optical Flow.

Figure 10: Algorithms in HIPAcc: edge detection (a), multiresolution processing (b), corner detection (c), as well as optical flow(d). Listed are the number of operators required for the algorithms as well as the total number of operator invocations. The(X + Y) LoC denote the DSL parametrization (X) and the algorithm implementation (Y) in DSL code. All benchmarks wereperformed using OpenCL on 4K UltraHD (3840× 2160) images.

DSL in Scala. A DSL can make use of parallel buildingblocks provided by the Delite infrastructure, which can bein turn mapped efficiently to parallel execution patterns.Programs written in a DSL are broken down into thesebuilding blocks which are then combined and optimized forparallel execution. A number of DSLs have been implementedon top of Delite such as Liszt [4] for mesh-based PDE solversor OptiML [5] for machine learning, but none for imageprocessing. This work follows a different approach: Insteadof breaking computations down into their building blocks,in order to combine them as required, this work employsdomain-specific optimizations using domain knowledge forcode generation.

Other domain-specific or domain-agnostic approachesfocus on image processing just as HIPAcc: These includeRapidMind [22], Halide [2], KernelGenius [3], and the workof Howes et al. [31] and Cornwall et al. [32].

The work most close to the work at hand is the RapidMindmulti-core development platform [22] targeting standardmulti-core processors as well as accelerators like the CellBroadband Engine (Cell B. E.) and GPUs. The RapidMindtechnology is based on Sh [33], a high-level metaprogram-ming language for graphics cards. RapidMind provides itsown data types that can be arranged in multi-dimensionalarrays. Boundary handling properties are defined on acces-sors and neighboring elements can be accessed using theshift() method on input data. Since there are no details oncode generation for boundary handling publicly availablefor RapidMind, the approach followed in this work canonly be compared quantitatively with the one of RapidMind.However, our evaluation indicates that no domain-specificknowledge is used for code generation. In 2009, Intel acquiredRapidMind and incorporated the RapidMind technologyinto Intel Array Building Blocks (ArBB) [34]. The focus ofIntel’s ArBB is on vector parallel programming and for thatreason, image processing features of RapidMind like genericboundary handling support were not adopted. Also the backends for accelerators were dropped.

Cornwall et al. [32] introduced a domain-specific frame-

work for visual effects that generates target code for CUDA.Their framework is based on indexed metadata in the formof C++ classes. Similar optimizations are available for codegeneration (staging data to local memory, realignment ofthreads for coalescing, 1:N mapping, and automatic work-group size and shape selection). However, only data contain-ers for images are provided. The work at hand introducesnotations for other domain-specific traits such as filter masksand provides additional domain-specific optimizations suchas specialized code variant generation for boundary handling.

Howes et al. [31] propose a framework for decoupled ac-cess/execute (Æcute) specification, capturing both executionconstraints and memory access patterns of a computationalkernel. Their domain-agnostic framework is also based on C++classes with tailored implementations for the Cell B. E. andstandard multi-core processors. In [35], the authors highlightpossible annotations for GPU accelerators using CUDA astarget language. They provide hand-tuned implementationsshowing the benefit that can be achieved if a source-to-sourcecompiler makes use of this knowledge. HIPAcc is capable toemploy not only the proposed optimizations, but also furthertransformations. We have shown that we can achieve thesame performance compared to their hand-tuned code [36].

Halide [2], a DSL for image processing, follows a promis-ing path by using functional programming to express kernelsin a compact and concise way. Using functional programming,images have no explicit storage, but are pure functions thatdefine the value of each pixel. The schedule for these func-tions has to be specified separately and determines how thecomputation is mapped to the target hardware and if memoryneeds to be allocated for the computation. In particular, imageprocessing pipelines benefit from this concept. However,no domain-specific optimizations are applied during codegeneration. Boundary handling is realized as a function thatis applied for each image pixel (with the resulting overhead).

KernelGenius [3] provides a DSL for image processingkernels using C-like syntax and generates target OpenCLcode for the STHORM embedded many-core architecture ofSTMicroelectronics. Functions in a kernel can be combined



REFERENCES 13

using common predefined templates to create an extendedSynchronous Dataflow (SDF) graph. Nodes of this graph arethen scheduled within an OpenCL kernel to exploit locality.A concise syntax is provided to express boundary handlingmodes and convolutions.

6 CONCLUSION

In this article, we have introduced a DSL for image pro-cessing and a source-to-source compiler to generate highlyefficient parallel implementations. Both are part of theHIPAcc framework and are available as open-source underhttp://hipacc-lang.org.

We have shown that a close co-design of language andcompiler can exploit domain-specific properties for special-ization. Considering target architecture information yieldstailored implementations that are in many cases faster thancorresponding hand-tuned implementations. The abstrac-tions of the DSL allow to map an algorithm to a quite largespectrum of GPU target architectures and to exploit theavailable types of parallelism. The algorithm description itselfis not contaminated with the mapping and parallelization fora specific architecture or other hardware-dependent optimiza-tions. We plan to exploit this property also for vectorizationand have recently shown that it can be used for synthesis ofefficient hardware designs [37].

ACKNOWLEDGMENTS

This work is supported by the German Research Foundation(DFG), as part of the Research Training Group 1773 “Hetero-geneous Image Systems”. The Tesla K20 used for this researchwas donated by the NVIDIA Corporation.

REFERENCES[1] P. Du, R. Weber, P. Luszczek, S. Tomov, G. Peterson,

and J. Dongarra, “From CUDA to OpenCL: Towards aperformance-portable solution for multi-platform GPUprogramming”, Parallel Computing, vol. 38, no. 8, pp. 391–407, 2011.

[2] J. Ragan-Kelley, A. Adams, S. Paris, M. Levoy, S. Ama-rasinghe, and F. Durand, “Decoupling algorithms fromschedules for easy optimization of image processingpipelines”, ACM Transactions on Graphics (TOG), vol. 31,no. 4, 32:1–32:12, 2012.

[3] T. Lepley, P. Paulin, and E. Flamand, “A novel compilationapproach for image processing graphs on a many-coreplatform with explicitly managed memory”, in Proc. of the2013 Int. Conference on Compilers, Architecture and Synthesisfor Embedded Systems (CASES), IEEE, 2013, 6:1–6:10.

[4] Z. DeVito, N. Joubert, F. Palacios, S. Oakley, M. Medina,M. Barrientos, E. Elsen, F. Ham, A. Aiken, K. Duraisamy,E. Darve, J. Alonso, and P. Hanrahan, “Liszt: A domainspecific language for building portable mesh-based PDEsolvers”, in Proc. of the 2011 Int. Conference for High Per-formance Computing, Networking, Storage and Analysis (SC),ACM, 2011, 9:1–9:12.

[5] A. K. Sujeeth, H. Lee, K. J. Brown, T. Rompf, H. Chafi,M. Wu, A. R. Atreya, M. Odersky, and K. Olukotun,“OptiML: An implicitly parallel domain-specific languagefor machine learning”, in Proc. of the 28th Int. Conference onMachine Learning (ICML), ACM, 2011, pp. 609–616.

[6] R. Membarth, F. Hannig, J. Teich, M. Körner, and W. Eckert,“Generating device-specific GPU code for local operatorsin medical imaging”, in Proc. of the 26th IEEE Int. Parallel& Distributed Processing Symposium (IPDPS), IEEE, 2012,pp. 569–581.

[7] R. Membarth, “Code generation for GPU accelerators froma domain-specific language for medical imaging”, VerlagDr. Hut, Munich, Germany, Dissertation, Hardware/-Software Co-Design, Department of Computer Science,University of Erlangen-Nuremberg, Germany, 2013.

[8] R. Membarth, O. Reiche, F. Hannig, and J. Teich, “Codegeneration for embedded heterogeneous architectures onAndroid”, in Proc. of the Conference on Design, Automationand Test in Europe (DATE), IEEE, 2014, 86:1–86:6.

[9] S. Williams, A. Waterman, and D. Patterson, “Roofline:An insightful visual performance model for multicorearchitectures”, Communications of the ACM, vol. 52, no. 4,pp. 65–76, 2009.

[10] M. H. Halstead, Elements of Software Science, ser. Operatingand Programming Systems. Elsevier, 1977.

[11] I. N. Bankman, Handbook of Medical Image Processing andAnalysis. Academic Press, 2008, vol. 2.

[12] J. C. Russ, The Image Processing Handbook. CRC Press, 2006,vol. 5.

[13] R. Klette and P. Zamperoni, Handbook of Image ProcessingOperators. John Wiley & Sons, 1996, vol. 1.

[14] P. Burt and E. Adelson, “The Laplacian pyramid as a com-pact image code”, IEEE Transactions on Communications,vol. 31, no. 4, pp. 532–540, 1983.

[15] J. Reinders, Intel Threading Building Blocks: Outfitting C++for Multi-core Processor Parallelism. O’Reilly Media, 2007.

[16] C. Tomasi and R. Manduchi, “Bilateral filtering for grayand color images”, IEEE, 1998, pp. 839–846.

[17] G. E. Blelloch, “Prefix sums and their applications”, inSynthesis of Parallel Algorithms, J. H. Reif, Ed., MorganKaufmann, 1993, ch. 1, pp. 35–60.

[18] A. V. Aho, R. Sethi, and J. D. Ullman, Compilers: Principles,Techniques, and Tools. Addison-Wesley, 1986, vol. 2.

[19] N. Wirth, “Program development by stepwise refinement”,Communications of the ACM, vol. 14, no. 4, pp. 221–227,1971.

[20] R. Karrenberg and S. Hack, “Whole-function vectoriza-tion”, in Proc. of the 9th Annual IEEE/ACM Int. Symposiumon Code Generation and Optimization (CGO), IEEE, 2011,pp. 141–150.

[21] M. Wolfe, High Performance Compilers for Parallel Computing.Addison-Wesley, 1996.

[22] RapidMind, RapidMind development platform documentation,RapidMind Inc., 2009.

[23] C. H. González and B. B. Fraguela, “A generic algorithmtemplate for divide-and-conquer in multicore systems”, inProc. of the 12th Int. Conference on High Performance Comput-ing and Communications (HPCC), IEEE, 2010, pp. 79–88.

[24] J. M. Stroud, “The fine structure of psychological time”,Information Theory in Psychology, 1956.

[25] R. D. Gordon and M. H. Halstead, “An experiment com-paring Fortran programming times with the softwarephysics hypothesis”, in Proc. of the National ComputerConference; American Federation of Information ProcessingSocieties (AFIPS), ACM, 1976, pp. 935–937.

[26] C. Harris and M. Stephens, “A combined corner and edgedetector”, in Proc. of the 4th Alvey Vision Conference, 1988,pp. 147–151.

[27] F. Stein, “Efficient computation of optical flow using thecensus transform”, in Pattern Recognition, ser. LectureNotes in Computer Science, vol. 3175, Springer, 2004,pp. 79–86.

[28] P. Feautrier and C. Lengauer, “Polyhedron model”, inEncyclopedia of Parallel Computing, Springer, 2011, pp. 1581–1592.

[29] H. Chafi, Z. DeVito, A. Moors, T. Rompf, A. K. Sujeeth,P. Hanrahan, M. Odersky, and K. Olukotun, “Languagevirtualization for heterogeneous parallel computing”, inProc. of the ACM Int. Conference on Object Oriented Program-ming Systems Languages and Applications (OOPSLA), ACM,2010, pp. 835–847.

[30] H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya,and K. Olukotun, “A domain-specific approach to hetero-



14

geneous parallelism”, in Proc. of the 16th Annual Symposiumon Principles and Practice of Parallel Programming (PPoPP),ACM, 2011, pp. 35–46.

[31] L. Howes, A. Lokhmotov, A. Donaldson, and P. H. J.Kelly, “Deriving efficient data movement from decoupledaccess/execute specifications”, in Proc. of the 4th Int. Con-ference on High-Performance and Embedded Architectures andCompilers (HiPEAC), Springer, 2009, pp. 168–182.

[32] J. L. Cornwall, L. Howes, P. H. J. Kelly, P. Parsonage, andB. Nicoletti, “High-performance SIMT code generationin an active visual effects library”, in Proc. of the 6th ACMConference on Computing Frontiers (CF), ACM, 2009, pp. 175–184.

[33] M. McCool, S. Du Toit, T. Popa, B. Chan, and K. Moule,“Shader algebra”, ACM Transactions on Graphics (TOG), vol.23, no. 3, pp. 787–795, 2004.

[34] C. J. Newburn, B. So, Z. Liu, M. McCool, A. Ghuloum,S. Du Toit, Z. G. Wang, Z. H. Du, Y. Chen, G. Wu, P. Guo,Z. Liu, and D. Zhang, “Intel’s Array Building Blocks: Aretargetable, dynamic compiler and embedded language”,in Proc. of the 9th Annual IEEE/ACM Int. Symposium on CodeGeneration and Optimization (CGO), IEEE, 2011, pp. 224–235.

[35] L. Howes, A. Lokhmotov, A. F. Donaldson, and P. H. J.Kelly, “Towards metaprogramming for parallel systemson a chip”, in Proc. of the 3rd Workshop on Highly ParallelProcessing on a Chip (HPPC), Springer, 2009, pp. 36–45.

[36] R. Membarth, A. Lokhmotov, and J. Teich, “GeneratingGPU code from a high-level representation for image pro-cessing kernels”, in Proc. of the 5th Workshop on Highly Paral-lel Processing on a Chip (HPPC), Springer, 2011, pp. 270–280.

[37] O. Reiche, M. Schmid, F. Hannig, R. Membarth, and J.Teich, “Code generation from a domain-specific languagefor C-based HLS of hardware accelerators”, in Proc. of theInt. Conference on Hardware/Software Codesign and SystemSynthesis (CODES+ISSS), ACM, 2014, 17:1–17:10.

7 AUTHOR BIOGRAPHY

Richard Membarth is a senior researcherat the German Research Center forArtificial Intelligence (DFKI). He holdsa diploma degree in Computer Sciencefrom the University of Erlangen-Nuremberg and a postgraduatediploma in Computer and InformationSciences from Auckland University ofTechnologies. In 2013, he received hisPh.D. (Dr.-Ing.) at the University ofErlangen-Nuremberg on the automaticcode generation for GPU acceleratorsfrom a domain-specific language for

medical imaging. After his Ph.D., he joined the Graphics Chairand the Intel Visual Computing Institute at Saarland Universityas postdoctoral researcher. His research interests include parallelcomputer architectures and programming models with focus onautomatic code generation.

Oliver Reiche holds a bachelor’s degreeand a master’s degree in ComputerScience, both from the University ofApplied Sciences in Nuremberg. Since2012, he is a Ph.D. student at the Chairfor Hardware/Software Co-Design atthe Department of Computer Scienceat the University Erlangen-Nurembergand a member of the Research TrainingGroup Heterogeneous Image Systems.

His research interests are embedded systems, efficient mappingstrategies of image algorithms to heterogeneous architecturesand domain-specific languages.

Frank Hannig leads the Architectureand Compiler Design Group in theCS Department at the University ofErlangen-Nuremberg, Germany, since2004. He received a diploma degree inan interdisciplinary course of study in EEand CS from the University of Paderborn,Germany in 2000 and a Ph.D. in CS fromthe University of Erlangen-Nurembergin 2009. His main research interestsare the design of massively parallelarchitectures, ranging from dedicated

hardware to multi-core architectures, mapping methodologiesfor domain-specific computing, and architecture/compilerco-design.

Jürgen Teich received an M.S. degree(Dipl.-Ing.; with honors) from theUniversity of Kaiserslautern, Germany,in 1989 and a Ph.D. (summa cum laude)from Saarland University, Germany,in 1993. In 1994, he joined the DSPdesign group of Prof. E. A. Lee in theDepartment of Electrical Engineering andComputer Sciences (EECS), University ofCalifornia at Berkeley (PostDoc). From1995 to 1998, he held a position at theInstitute of Computer Engineering andCommunications Networks Laboratory

(TIK), ETH Zurich, Switzerland (Habilitation). From 1998 to2002, he was a Full Professor in the Electrical Engineering andInformation Technology Department, University of Paderborn,Germany. Since 2003, he has been a Full Professor in theDepartment of Computer Science, University of Erlangen-Nuremberg, Germany, holding a chair in Hardware/SoftwareCo-Design.

Mario Körner works as software architectfor the Angiography and InterventionalX-Ray Systems business unit at SiemensHealthcare. He completed his degreein Computer Science at the Universityof Erlangen-Nuremberg in 2007. Sincethen he has been working in differentroles on the development of medicalimaging applications for interventionalprocedures. His main interests arealgorithms for image processing,visualization and geometry calibration,software engineering methods, efficientmapping to hardware accelerators as well

as usability aspects for making new technologies available tomedical users.

Wieland Eckert received a M.S. degree inComputer Science (Dipl.-Inf.) and a Ph.D.(Dr.-Ing.) at the University of Erlangen-Nuremberg in 1991 and 1996, respectively.From 1996 to 1999 he held a Senior Re-searcher position at AT&T Labs Researchin Florham Park, NJ, in the speech recog-nition and dialog systems laboratory. In1999 he joined Lucent Technologies inNuremberg, Germany, designing and de-veloping architectures and software forembedded communication systems. In2004 he joined Siemens Healthcare as a

Senior Software Architect for interventional X-ray systems. Since2012 he is with the research unit Corporate Technology ofSiemens AG. His main research interests are effective softwarearchitectures and challenges of efficient implementation on newand powerful hardware designs like multi-core and many-coreprocessors combined with easy to use programming models.



Documents

1 HIPA : A Domain-Speciﬁc Language and Compiler for Image Processing · 2021. 1. 15. · 1 HIPAcc: A Domain-Speciﬁc Language and Compiler for Image Processing Richard Membarth,