ART vs. NDK vs. GPU acceleration: A study of performance ...1114955/FULLTEXT01.pdf · ART vs. NDK vs. GPU acceleration: A study of performance of image processing algorithms on Android

IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2017

ART vs. NDK vs. GPU acceleration: A study of performance of image processing algorithms on Android

ANDREAS PÅLSSON

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

ART vs. NDK vs. GPU acceleration:A study of performance of imageprocessing algorithms on Android

ANDREAS PÅLSSON

Master in Computer ScienceDate: June 26, 2017Supervisor: Cyrille ArthoExaminer: Johan HåstadSwedish title: ART, NDK eller GPU acceleration: En prestandastudie avbildbehandlingsalgoritmer på AndroidSchool of Computer Science and Communication

iii

Abstract

The Android ecosystem contains three major platforms for execution suit-able for different purposes. Android applications are normally written inthe Java programming language, but computationally intensive parts of An-droid applications can be sped up by choosing to use a native language or byutilising the parallel architecture found in graphics processing units (GPUs).The experiments conducted in this thesis measure the performance benefitsby switching from Java to C++ or RenderScript, Google’s GPU accelerationframework.

The experiments consist of often-done tasks in image processing. For someof these tasks, optimized libraries and implementations already exist. Theperformance of the implementations provided by third parties are comparedto our own.

Our results show that for advanced image processing on large images, thebenefits are large enough to warrant C++ or RenderScript usage instead ofJava in modern smartphones. However, if the image processing is conductedon very small images (e.g. thumbnails) or the image processing task containsfew calculations, moving to a native language or RenderScript is not worththe added development time and static complexity.

RenderScript is the best choice if the GPU vendors provide an optimizedimplementation of the processing task. If there is no such implementationprovided, both C++ and RenderScript are viable choices. If full precision isrequired in the floating point arithmetic, a C++ implementation is the rec-ommended. If it is possible to achieve the desired effect without compliancewith IEEE Floating Point Arithmetic standard, RenderScript provides betterrun time performance.

iv

Sammanfattning

Android-ekosystemet innehåller tre exekveringsplattformer passande för oli-ka syften. Android-applikationer är vanligtvis skrivna i programmerings-språket Java, men beräkningsintensiva delar av en Android-applikation kansnabbas upp genom att använda en statiskt kompilerat språk eller genomatt utnyttja den parallella arkitekturen som hittas i grafikprocessorer. Experi-menten utförda i det här projektet ämnar mäta prestandasförbättringar somkan uppnås genom att byta från Java till C++ eller RenderScript, Googlesgrafikaccelerationsramverk.

Experimenten består av ofta använda algoritmer inom bildhantering. För någ-ra av dessa finns det optimerade bibliotek och övriga färdiga implementatio-ner. Prestandan av tredjepartsbiblioteken jämförs med våra implementatio-ner.

Våra resultat visar att för avancerad bildhantering är prestandaförbättringar-na tillräckligt bra för att använda C++ eller RenderScript istället för Java påmoderna smartphones. I de fall bildhanteringen görs på väldigt små bildereller innehåller få beräkningar (exempelvis miniatyrbilder) är bytet från Ja-va till RenderScript eller C++ inte värt den extra utvecklingstiden samt denstatiska kodkomplexiteten.

RenderScript är det bästa valet då grafikprocessortillverkarna tillhandahål-ler implementationer av algoritmen som ska köras. Om det inte finns någonsådan implementation är både C++ och RenderScript tillämpbara val. Omnoggrann precision krävs rekommenderas en C++-implementation. Däremotom full precision inte behövs vid flyttalsberäkningar rekommenderas iställetRenderScript.

Contents

1 Introduction 11.1 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Ethics and sustainability . . . . . . . . . . . . . . . . . . . . . . 31.5 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 42.1 Native and interpreted languages . . . . . . . . . . . . . . . . . 4

2.1.1 Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Android . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Android application compilation . . . . . . . . . . . . . . . . . 72.4 Dalvik . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Android Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.5.1 Ahead-of-time (AOT) compilation . . . . . . . . . . . . 92.5.2 Improved garbage collection . . . . . . . . . . . . . . . 102.5.3 Just-in-time (JIT) compilation . . . . . . . . . . . . . . . 10

2.6 Android Native Development Kit . . . . . . . . . . . . . . . . . 112.7 RenderScript . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.7.1 Compilation and deployment . . . . . . . . . . . . . . . 122.7.2 Floating point precision . . . . . . . . . . . . . . . . . . 13

2.8 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 142.8.1 Image smoothing . . . . . . . . . . . . . . . . . . . . . . 142.8.2 Grayscaling . . . . . . . . . . . . . . . . . . . . . . . . . 162.8.3 Thresholding . . . . . . . . . . . . . . . . . . . . . . . . 17

2.9 Color spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Related Work 193.1 Java and C++ benchmarks . . . . . . . . . . . . . . . . . . . . . 19

v

vi CONTENTS

3.2 Dalvik vs ART . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Using GPU for calculations . . . . . . . . . . . . . . . . . . . . 21

3.3.1 RenderScript . . . . . . . . . . . . . . . . . . . . . . . . 213.3.2 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Method 234.1 Choice of method and algorithms . . . . . . . . . . . . . . . . . 234.2 Development environment and devices . . . . . . . . . . . . . 234.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3.1 Color space conversion . . . . . . . . . . . . . . . . . . 254.3.2 Blurring . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.3.3 Grayscaling and thresholding . . . . . . . . . . . . . . . 30

4.4 Measuring Runtime Performance . . . . . . . . . . . . . . . . . 304.4.1 Image processing . . . . . . . . . . . . . . . . . . . . . . 304.4.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.5 Verifying results . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Results 345.1 Color space conversion . . . . . . . . . . . . . . . . . . . . . . . 345.2 Blurring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2.1 Box filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.2.2 Median filter . . . . . . . . . . . . . . . . . . . . . . . . 365.2.3 Gaussian filter . . . . . . . . . . . . . . . . . . . . . . . . 37

5.3 Grayscaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.4 Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6 Discussion 406.1 Color space conversion . . . . . . . . . . . . . . . . . . . . . . . 406.2 Blurring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416.3 Grayscaling and thresholding . . . . . . . . . . . . . . . . . . . 426.4 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . 436.5 Threats to validity . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.5.1 Choice of algorithms . . . . . . . . . . . . . . . . . . . . 456.5.2 High variance . . . . . . . . . . . . . . . . . . . . . . . . 466.5.3 Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.5.4 Image sizes . . . . . . . . . . . . . . . . . . . . . . . . . 466.5.5 Optimization . . . . . . . . . . . . . . . . . . . . . . . . 46

6.6 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7 Conclusion 49

Bibliography 50

CONTENTS vii

A Tables 53A.1 Blurring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

A.1.1 Box filter . . . . . . . . . . . . . . . . . . . . . . . . . . . 55A.1.2 Median filter . . . . . . . . . . . . . . . . . . . . . . . . 57A.1.3 Gaussian filter . . . . . . . . . . . . . . . . . . . . . . . . 60

A.2 Grayscaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63A.3 Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Chapter 1

Introduction

The first version of the mobile operating system Android was released in fall2008. It is, as of January 2017, the most widely used smartphone operatingsystem [14]. It is used all over the world, with varying device and networkquality. Because of these reasons, it is important to mobile application de-velopers to be able to develop high quality applications that work well onlow-end devices in third world countries.

Android application developers can choose to write the business logic of anapplication in a native language (a source language that is directly compiledto machine code) or Java, where Google recommends the use of Java [8].However, when conducting computationally intensive tasks it can be advan-tageous to use native languages, as it is generally faster than Java [8], to notimpede the user experience.

Moreover, a developer can utilize GPU (graphics processing unit) acceleratedcomputing to utilize the full capabilities of the device. This means using thedevice’s graphics processor to offload compute-intensive portions of code tothe GPU, while the remainder of the code remains on the CPU (central pro-cessing unit). This allows the device to take advantage of the massively par-allel architecture of the GPU.

Code written to run on a GPU does not have to be custom-written for eachdifferent type of GPU, but can be compiled from a higher-level language. Thismeans that GPU acceleration is more readily available for developers todaythan what it traditionally has been.

1

2 CHAPTER 1. INTRODUCTION

As today’s users of technological products see more and more of virtual andaugmented reality products, it is of utmost importance to keep the experienceas smooth as possible. Many new technologies offer a more visual experiencethan before, which further increases the need for performance, since graphicsprocessing require large amounts of heavy calculations.

1.1 Problem

Java is the recommended programming language for building Android ap-plications. However, the Java programming language contains features de-signed to improve safety and convenience at the expense of performance,e.g., the automatic memory management.

Therefore, Google suggests that it might be useful for a developer to use anative language over Java in two cases [8]:

• Squeeze extra performance out of a device to achieve low latency orrun computationally intensive applications, such as games or physicssimulations.

• Reuse your own or other developers’ C or C++ libraries.

This thesis intends to examine the first bullet point and investigate how largethe performance benefits can be when conducting real time image processing.Furthermore, the usage of GPU acceleration can provide greater performanceimprovements due to increased levels of parallelization in the hardware. Theproblem is that with increased performance from the Android system, it ishard to know whether the performance benefit of using a different languagethan Java is worth the extra complexity needed to add another programminglanguage to a software project.

Image processing contains many computationally intensive processes and istherefore a candidate where it might be useful to switch to a native languageor a framework that allows use of GPU acceleration.

1.2 Research Question

The question this thesis intends to answer is the following:

CHAPTER 1. INTRODUCTION 3

Can performance increases in run time warrant the usage of C++ or GPUacceleration frameworks over Java when writing image processing

algorithms on Android?

1.3 Scope

The reason a developer might not want to choose a native language or aGPU acceleration framework over Java despite performance benefits is likelythat the added performance improvements do not outweigh the complex-ity added to the software project. This project does not intend to extensivelymeasure the code development complexity added by using these componentsin an Android project.

1.4 Ethics and sustainability

The work presented in this thesis aims to be as ethical as possible, in the sensethat all results in presented in this thesis are reproducible from the descrip-tion in chapter 4.

Regarding sustainability, there are three pillars of sustainability: social, eco-nomical and environmental. The work presented in this thesis only touchesthe last two pillars by performing performance analysis, seeing as the workpresented here lacks the dimension of affecting social sustainability. Achiev-ing higher performance in a mobile device might lead to lower battery us-age and usage of fewer clock cycles, saving energy and therefore leading togreater environmental and economical sustainability.

1.5 Structure

This report consists of eight chapters. Chapter 2 contains technical informa-tion needed to understand the project. Chapter 3 contains previous researchconducted in the area. Chapter 4 outlines the experiments conducted in thisproject. Chapter 5 contains the results from the aforementioned experiments.Chapter 6 contains discussions regarding the results and their possible appli-cations, as well as possible extensions to the research. Chapter 7 contains ourconclusion and final answer to the research question.

Chapter 2

Background

This section will contain background information that is needed to under-stand the rest of this paper.

2.1 Native and interpreted languages

A fundamental difference between interpreted and native languages is thatthe native language gets compiled to instructions that can be interpreted bythe processor. An overview of their compilation processes and differences arepresented in this section.

2.1.1 Java

Java is an interpreted, object-oriented programming language developed byOracle. Java gets compiled to bytecode in order to run on the Java VirtualMachine (JVM). It is commonly seen together with the slogan Write once, runeverywhere, since the compiled Java code can run on any platform without theneed to compile it for each architecture. The compiled bytecode can run onany JVM. The JVM takes care of translating the bytecode to instructions thatthe host CPU can understand.

4

CHAPTER 2. BACKGROUND 5

*.java files

Java compiler

.class files containing java

bytecode

Figure 2.1: Steps in compilation to Java bytecode

Figure 2.1 shows the process of compiling Java source code to its correspond-ing bytecode that will be processed by the JVM.

2.1.2 C++

C++ is a native language, and can skip the translation step required in theJVM, as it is compiled directly to native processor instructions. This alsomeans that it is not architecture independent, and the code has to be compiledfor each architecture it is supposed to run on.

C++-files

compiler

assembler

linker

Figure 2.2: Steps in compiling native language source files to an executable

The native-language source files are compiled to assembly code by the com-piler. The code generated by the compiler is then assembled into object-codefor the platform. This object-code is then linked together with library depen-

6 CHAPTER 2. BACKGROUND

dencies and other code needed to produce the actual executable, shown inFigure 2.2.

2.1.3 Performance

The runtime performance achieved using C++ can be higher than Java run-ning on a JVM. C++ lacks automatic garbage collection, a feature that canimpede the performance of Java programs at the cost of developer conve-nience. Another reason that Java performance is penalized is that it does notallow memory allocation on the stack. Accessing and allocating memory onthe heap is a more costly operation, creating overhead for Java implementa-tions. In C++ the stack is freely available for developers, making the memoryaccess and allocation faster.

2.2 Android

Android is an operating system developed by Google, designed primarily foruse on mobile smartphones and tablets. It is based on the Linux kernel. TheAndroid operating system has also been customized to run on smart watches,TVs and in cars. It is the most widely used mobile operating system with amarket share of 88% [4].

On top of the Linux kernel at the root of the Android architecture there arenative libraries and middleware, for example Webkit and OpenGL. On topof the native libraries lies the application framework. The application frame-work provides APIs (Application Programming Interface) for developers touse when building Android applications. Applications that can be found onGoogle Play are written on top of all this, in the application layer, as shownin Figure 2.3.

The runtime, responsible for running applications on the smartphone, liesbetween the native libraries and the application framework.


Applications

Application Framework

Libraries

Linux Kernel

Phone, Contacts...

Managers for activities, packages..

SQLite, OpenGL

Display, WiFi, Camera..

RuntimeDalvik/ART

Figure 2.3: The Android software stack

2.3 Android application compilation

Compiling for the Android platform requires adding extra steps to the Javacompilation process described above. The first compilation step producesstandard JVM bytecode (.class-files) from the source code. This is not com-patible with Android devices, since Google developed the Dalvik VirtualMachine (DVM) that utilizes a different bytecode format. The next step incompilation is taking any .jar-libraries and the .class-files and converting it toDVM bytecode.

.class-files .jar-libraries

Dalvik converter

.dex-file

Figure 2.4: Converting the bytecode to DVM format

The resulting DVM bytecode is contained in a single .dex-file, as shown inFigure 2.4. This file is packaged together with any application resources (e.g.,


layouts, images) into an Android Package file (.apk-file). The package canthen be deployed and installed to devices running the Android operatingsystem.

When the application is running the bytecode is interpreted by the host de-vice and then passed to the CPU for execution. In the 2.2 release of Android aJust In Time (JIT)-compiler was added to the DVM. This meant that code thatwas run often could be compiled to native code and the DVM could achievehigher performance because it could effectively skip the interpretation step.

2.4 Dalvik

Dalvik was the standard virtual machine (VM) on Android devices runningAndroid versions 4.4 and earlier. It is different from Oracle’s standard JVM incertain aspects. The Dalvik VM was constructed for mobile devices limited inmemory and storage space. It uses a register-based architecture, as opposedto the regular JVM’s stack-based architecture, and therefore requires fewervirtual machine instructions. The uncompressed .dex-files used by the DalvikVM are often a few percent smaller than a compressed Java archive, makingit more suitable for the limited storage on Android devices.

Dalvik has had trace-based just-in-time (JIT) compilation since the releaseof Android 2.2. The JIT compilation allows Dalvik to compile frequentlyexecuted code ("traces") to native code. Even though Dalvik interprets theremaining bytecode, this dynamic compilation provided significant perfor-mance improvements [3].

There are also drawbacks of using a VM with a JIT-compiler such as Dalvik,as opposed to using a native language. The time constraints of running theoptimization alongside the program process incurs time constraints, lower-ing the degree of possible optimization as compared to static compilers.

2.5 Android Runtime

Android Runtime (ART) is Dalvik’s successor and is the standard runtime forAndroid applications and certain system services. ART was, like its prede-cessor Dalvik, created specifically for the Android operating system and wasoptimized for devices with a limited amount of memory and storage space.


ART implemented a number of features to improve performance in the An-droid system.

2.5.1 Ahead-of-time (AOT) compilation

As opposed to only using JIT to compile certain parts of bytecode to nativecode, ART compiles applications to native code at install time. By eliminatingthe interpretation and JIT-compilation of Dalvik, run time performance andbattery consumption was improved [12].

It is important to note that there are certain optimizations that are possiblein JIT-compilation that AOT-compilation cannot offer. Static analysis is verydifficult in the general case, and therefore optimization done at install time isdifficult. The JIT compiler does not have this problem, as it does not have tostatically analyze the code; it is observable at runtime.

JIT compilers, however, have a bigger problem with resource consumption.An AOT compiler can take longer time without worrying about stealing re-sources from the program at hand, whereas a JIT compiler must not slowdown the application it is optimizing.

dex-file resources

.apk-file

zip

.dex-file

.elf-file

resourcesinstallation on smartphone

build process

Figure 2.5: Steps in conducting ahead of time compilation


In Figure 2.5 the process of AOT-compilation is outlined. An apk-file is cre-ated from the .dex-file and resources. When the package is installed, it isunpackaged and processed through a tool called dex2oat in order to createan .elf-file. dex2oat is a tool that compiles a .dex-file to native code. An.elf-file is an executable file, which can be executed natively by the proces-sor instead of relying on the JVM interpreting Java bytecode.

2.5.2 Improved garbage collection

Garbage collection (GC) is the process of reclaiming system memory occu-pied by objects that are no longer in use by the program. Poor use of objectsin an application, making the GC do a lot of work, impairs the application’sperformance, resulting in choppy display and poor responsiveness.

The garbage collector in Dalvik is invoked if any of these conditions are true:

• An OutOfMemoryError is about to be triggered,

• When the heap size hits a limit,

• When GC was explicitly requested

The typical garbage collection is triggered by the allocation limit being reached.The actual collection is done using a Mark-Sweep algorithm. The algorithmconsists of two phases: mark and sweep. In the first phase it finds and marksall accessible objects. In the second phase it scans through the heap and re-claims all objects that have not been marked. Both of these phases halt theexecution of the program.

As opposed to the two pauses in Dalvik, the ART GC only pauses once. Themark phase in ART’s GC is done concurrently by letting threads mark theirown objects [2].

2.5.3 Just-in-time (JIT) compilation

In Android 7.0 Google added a JIT compiler to complement ART’s AOT com-pilation process. The JIT compiler can do runtime optimizations in orderto improve run time performance. ART utilized profile-guided optimizationwhich allows it to use a profiler to precompile and cache select methods. Thisfeature further reduces applications’ memory usage.


2.6 Android Native Development Kit

The Android Native Development Kit (NDK) is a set of tools allowing devel-opers to write parts of their Android application in native languages such asC or C++. It can be of use in order to achieve low latencies or run compu-tationally intense code. Furthermore, it enables reusing previously writtenC/C++ code.

The NDK is used to compile C/C++ code into a native library and package itinto the application package. Java code can then use the Java Native Interface(JNI) to call functions in the native library. It is worth noting that crossing thisJava-Native boundary might incur performance degradation, as compared tocalling a Java method [11].

Native code is platform specific, so for the native code to work on all de-vices the code must be compiled for every supported device architecture (e.g.,ARM, x86).

2.7 RenderScript

RenderScript (RS) is a framework in the Android ecosystem for running com-putationally intensive tasks at high performance using heterogenous com-puting and is primarily oriented for use with data-parallel computation. TheRenderScript runtime on Android parallelizes work across the multi-coreGPUs and CPUs available on a device.

The reason that using the GPU for compute-intensive functions can be bene-ficial is that the architecture differs from that of the CPU. A CPU consists ofcores optimized for sequential processing, whereas a GPU has a parallel ar-chitecture consisting of many smaller cores, optimized for handling multipletasks simultaneously.

RenderScript itself is a C99-derived language and code written in it is com-piled on devices at runtime to allow platform-independence. The perfor-mance gain, compared to Java, is gained from executing native code on thedevice. As opposed to the NDK, RenderScript is cross-platform. The Render-Script code is compiled to a device-agnostic intermediate state before beingpackaged in the application package. The scripts are compiled to machinecode and optimized on the device when the application is run. The device


decides at runtime whether the computation should be run on the CPU orGPU.

2.7.1 Compilation and deployment

Compilation and deployment of RenderScript code contains 3 steps:

• Offline compiler

• Online JIT compiler

• RenderScript Runtime

The offline compiler converts RenderScript .rs-files to portable bitcode andreflected Java-files. The JIT compiler translates the portable bitcode output bythe offline compiler to machine code appropriate for the processor the codeis running on (e.g., CPU, GPU). The RenderScript runtime manages memoryallocation, provides implementation of libraries (e.g., math, time, drawing)and manages RenderScript objects created from Android Runtime or Dalvik.

Below is a RenderScript implementation that changes saturation of a bitmap:

const static float3 gMonoMult = {0.299f, 0.587f, 0.114f};float saturationValue = 0.f;

uchar4 __attribute__((kernel)) saturation(uchar4 in){

float4 f4 = rsUnpackColor8888(in);float3 result = dot(f4.rgb, gMonoMult);result = mix( result, f4.rgb, saturationValue );

return rsPackColorTo8888(result);}

The corresponding usage of the above RenderScript code from Java can looklike the following:


Bitmap outputBitmap = ..Bitmap inputBitmap = ..

// Initialize the RenderScript contextRenderscript rs = RenderScript.create(mContext);// Create the specific script from the bitcodeScriptC_process script = new ScriptC_process(rs);

// Create an allocation (which is memory abstraction in theRenderscript) that corresponds to the outputBitmapAllocation allocationOut = Allocation.

createFromBitmap(rs,outputBitmap);Allocation allocationIn = Allocation.createTyped(rs,

allocationOut.getType(), Allocation.USAGE_SCRIPT);

process.set_saturationValue(1);process.forEach_saturation(inAllocation, outAllocation);rs.finish();

The code above creates objects of type Allocation-objects, which is the pri-mary means of passing data to RenderScript code. The code then passes theelements in the inAllocation as the parameter in to the saturation-method in the RenderScript code. The returned result is put into the allocationOut-object.

2.7.2 Floating point precision

Developers can control the required level of precision in RenderScript, if thefull IEEE 754-2008 [15] standard is not required. A developer can lower therequired precision in order to improve performance and allow for additionaloptimizations on certain architectures. RenderScript implementations thatdoes not require IEE 754-2008 compliance will be referenced to as RelaxedRenderScript-implementations.


2.8 Image Processing

2.8.1 Image smoothing

Image processing is the use of algorithms to process images. This can be donein order to improve clarity, remove noise or compress images to optimizethem for network communication. A step in conducting image process ofteninvolves image smoothing. Image smoothing is the process of blurring animage in order to remove noise. The intent is to capture important patterns inthe data while removing rarely occurring phenomena. It is often used beforeconducting further processing, e.g., face or edge detection.

Image smoothing functions are often linear, where each pixels output valueis a function of some input pixels:

g(i, j) =∑k,l

f(i+ k, j + l)h(k, l)

where h(k, l) is the kernel of the algorithm. The kernel contains the relativeweights of each pixel.

Box Filter

Box filter [29] is an image smoothing algorithm that achieves a blurring effectby replacing a pixel with the average of itself and its surrounding pixels.This means that the new value can be one that was not in the image before.Calculating the average of the pixels is also known as applying a box filter. A3× 3 kernel for a box filter looks like the following:

1 1 1

1 1 1

1 1 1

meaning that the relative weight of all pixels are the same. Using this filteron a 1800× 1018 image yields the result shown in Figure 2.6.


Figure 2.6: Averaging all pixels to achieve image smoothing, with a 5x5-kernel

Median Filter

Median image smoothing [29] uses the same process as the averaging, butcalculates the median of the pixels instead of average. This means that thenewly calculated value of the pixel is always present in the non-processedimage. A median filter applied to an image of size 1800×1018 pixels is shownin Figure 2.7.

Figure 2.7: Calculating the median of the pixels to achieve image smoothing

Gaussian filter

Gaussian filter [29] is an often used technique of image smoothing, using theGaussian function. It uses a Gaussian kernel, where the relative weights ofeach pixel decrease as the distance to the center increases. In a 2D picture thevalue of the kernel is calculated as


G(x, y) =1

2πσ2e−

x2+y2

2σ2

where x is the distance from the center in the horizontal axis, y is the distancefrom the center in the vertical axis, and σ is the standard deviation of theGaussian distribution.0.077847 0.123317 0.077847

0.123317 0.195346 0.123317

0.077847 0.123317 0.077847

Table 2.1: Sample 3x3 Gaussian kernel with σ = 1.0.

As can be seen in matrix 2.1 the weighted values are largest in the center ofthe matrix, and decreasing as the distance to the center increases. An exampleof a Gaussian filter applied to an image of size 1800× 1018 pixels can be seenin Figure 2.8.

Figure 2.8: Example of Gaussian blur of an image with σ = 1.0 and a 5x5-kernel

2.8.2 Grayscaling

Grayscaling an image is the process of converting a colored image to an imagecomposed by different shades of gray. Each pixel in the image carries onlyintensity information. To convert a colored image to a grayscale image, theintensity of the pixels have to be calculated.

Y = 0.299R+ 0.587G+ 0.114B (2.1)


Equation 2.1 shows how to calculate the intensity (Y ) from the colors of apixel [31]. R is the amount of red in the pixel, G is the amount of green andB is the amount of blue.

(a) Original (b) Grayscale

Figure 2.9: An image converted to grayscale

Figure 2.9 shows an example of a colored image converted to grayscale. Theintensities of the gray pixels are calculated from the colors according to Equa-tion 2.1.

2.8.3 Thresholding

Thresholding is a method of image segmentation, converting a grayscale im-age to a binary image (i.e., an image with only two colors). The simplestthresholding methods replace each pixel in an image with a black or whitepixel depending on the intensity of the pixel. Given a grayscale image and afixed T , every pixel in the image with an intensity I < T is replaced with ablack pixel, and the others are replaced with white pixels.

(a) Original (b) Thresholding effect

Figure 2.10: Grayscaling and thresholding applied to an image


In Figure 2.10 the original image is converted to grayscale with the help ofEquation 2.1. Every pixel with an intensity < 0.5 is then replaced with ablack pixel, and otherwise a white pixel.

2.9 Color spaces

A color space is a model describing how to represent colors as tuples of num-bers. An example of a commonly seen color space is RGB [21]. RGB is anadditive color space where a color in the space is defined by its amount ofred, blue and green. The color red is defined as #FF0000 in hexadecimal.The two first digits (i.e., the first byte) represent the amount of red, the mid-dle two digits (the second byte) represent the amount of green and the lasttwo digits (the last byte) represent the amount of blue.

RGB is not a supported format on all Android devices. All devices, however,support capturing video in YUV420 format and it is the standard format forthe Android camera preview. The YUV model defines a color space in termsof a brightness component (Y) and two color components (U, V) [21].

An image in the RGB color space is represented as interspersed values, i.e.,the red-green-blue values lie next to each other. The YUV model, however,group together the U and V values, while the Y values are left at the begin-ning.

In order to do color processing in the RGB color space on Android, the cam-era frames fetched must therefore first be converted from the YUV420. Thefollowing formula can be used [22]:

B = 1.164(Y − 16) + 2.018(U − 128)

G = 1.164(Y − 16)− 0.813(V − 128)− 0.391(U − 128) (2.2)

R = 1.164(Y − 16) + 1.596(V − 128)

Chapter 3

Related Work

Earlier related research conducted is presented in this section.

3.1 Java and C++ benchmarks

Reinholtz [23] claimed that the run-time performance of the Java program-ming language will likely surpass that of C++. The author based this claimon the fact that the dynamic compilation gives the Java compiler access toruntime information that is not available to the C++ compiler. The authorclaims that this is bound to occur since the market for embedded systemswill be driven to extend battery life, and that a more performant languagewould be desirable.

Hundt [13] conducted a comparison of the programming languages Java,C++, Go and Scala. The intent was to compare loop recognition in the lan-guages mentioned. The implementations all used idiomatic container classes,but did not attempt to exploit specific language or run time features. The re-sults showed that a Java implementation contained 25.6% more lines of codethan the C++ version, and that it used 6 times more virtual memory than theC++ version. The Java version was 5.8 times slower than the C++ version.The author claims that even though the benchmark itself was simple andcompact, it utilized many language features such as higher-level data struc-tures (lists, maps etc), some well known algorithms (e.g., DFS, union/find),which means that the comparison shown could be applicable in other situ-ations as well. Following the reasoning of Reinholtz [23], this gap between

19

20 CHAPTER 3. RELATED WORK

the Java and C++ performance will decrease as the JIT compilers used in theJVMs are improved.

Gherardi, Brugali, and Comotti [9] showed a smaller difference in run timeperformance than Hundt [13]. Gherardi, Brugali, and Comotti [9] imple-mented algorithms processing sensor data, often used in robotics. In certaintest runs the performance difference was measured as Java being 9% slowerthan the C++ implementation. However, The same program with differentdata also showed a performance decrease of 280% when using the Java ver-sion. Moreover, Gherardi, Brugali, and Comotti [9] presented similar bench-marks conducted with earlier versions of the JVM, showing that the perfor-mance gain of using C++ was getting smaller every year, giving reason tobelieve that Reinholtz [23] is correct.

Lin et al. [18] have earlier shown that Android applications written in C++ in-creased run time performance by up to 34.2% when moving from Java. Thesetests were however run on the Dalvik VM, and not utilizing the ART runtimereleased with Android 5.0. Son and Lee [30] also showed that they could in-crease the run time performance of their augmented reality engine by 86.9%when rewriting it using C++ instead of Java.

3.2 Dalvik vs ART

Konradsson [17] compared the performance of the Dalvik VM and the thennewly introduced Android Runtime. The author compared run time, mem-ory usage and application size, using well established benchmarking frame-works. Solving a dense 1000×1000 linear equation system yielded an averageimprovement of 12.35% when using ART over Dalvik when measuring thenumber of floating point operations per second. It is worth noting that Dalvikoutperformed ART in certain test cases. In two of five test cases, Dalvik per-formed 0.6% and 0.8% more floating point operations per second. The testswere run on Android versions 4.4.2 and 5.1.1, meaning the ART version didnot have access to a JIT-compiler. The reason that Dalvik outperformed ARTin certain test cases is possibly that the JIT-compiler in Dalvik was able to op-timize the code for the architecture on that device. The author furthermoremeasured the RAM usage for 6 popular applications, namely Drive, Gmail,WhatsApp, Netflix, Dropbox and Skype. It is not clear what actions the userperformed, but the average RAM usage was 45% higher on Dalvik.

CHAPTER 3. RELATED WORK 21

3.3 Using GPU for calculations

3.3.1 RenderScript

In 2012 an Android engineer showed that when varying saturation in a bitmaprun time could be 7 times as fast when using RenderScript over Java [27].In 2013 Google further optimized the RenderScript engine, significantly im-proving its performance [26]. When code was executed only on the CPU, theRenderScript engine showed improvements in the range of 90 %–220 % whenupdating from Android 4.1 to Android 4.2.

Figure 3.1: Comparison of CPU and GPU code doing image processing onAndroid 4.2 [6]

The tests run in the comparisons used Android versions 4.0, 4.1 and 4.2,meaning that ART was not used in the tests. The tests did therefore not com-pare the RenderScript engine with AOT-compiled code. As can be seen inFigure 3.1, using the GPU provides performance benefits compared to usingonly the CPU. The performance is shown relative to the performance mea-sured on Android 4.0.

3.3.2 OpenCL

OpenCL [35] is a framework for writing programs that execute across het-erogeneous platforms, The programs can be run on CPUs and GPUs andOpenCL is widely used for parallelization. Its support for Android is lim-ited [1], since Google opted for RenderScript instead.

22 CHAPTER 3. RELATED WORK

Wang et al. [36], using OpenCL, implemented an algorithm that removes ob-jects from images and fills the hole left by removing the object, creating aplausible image. Using OpenCL that only ran on the CPU, processing an im-age took 393.8 seconds. Utilizing the GPU and varying certain parameters intheir algorithm it took 4.266 seconds. The authors conclude that frameworkssuch as OpenCL are suitable for use on modern mobile GPUs.

Ross et al. [25] measure the performance of mobile CPUs and GPUs using theN-body algorithm. N-body is an algorithm used to solve problems regardingparticles subject to an inter-particle force. The authors considered the algo-rithm representative of many real-world computational kernels. The resultspresented show that code running on the GPU is considerably faster than thecode running on the CPU, and that the performance of handheld GPUs isclosing in on desktop CPUs. The authors furthermore note that OpenCL isimmature for mobile and embedded devices, but that it will likely get better.

Kim and Kim [16] compared the performance of OpenCL and RenderScriptwhen computing matrix multiplications. When performing the multiplica-tions on a PC the OpenCL implementation far outperformed the Render-Script version. The OpenCL implementation was 2 times faster when multi-plying a 10 × 10 matrix, and by approximately 13 times faster when multi-plying a 100 × 100 matrix. The average case is that OpenCL was 9.11 timesfaster.

However, when conducting the same experiments on a mobile device theRenderScript implementation was 5.8 times better in average. The PC ver-sions, however, used an emulator in order to run the RenderScript versions,which might penalize the performance. The authors conclude that Render-Script is more optimized for the architectures found on Android devices.

Chapter 4

Method

This chapter explains how the experiment that is intended to answer the re-search question was conducted.

4.1 Choice of method and algorithms

Selecting a language other than Java when developing Android applicationsis often done when developing for computationally intensive purposes. Im-age processing is a computationally intensive area, making it suitable for usewhen measuring language performance. The algorithms implemented in thisproject are popular algorithms that are available in open source libraries.

The different image processing algorithms implemented have different prop-erties that make them interesting. The grayscaling algorithm, for example,only requires accessing one pixel to determine the color of the new pixel. TheGaussian blurring function, however, requires accessing neighboring pixelsto calculate a weighted average for the pixel. This means that the algorithmshave differing cache localities, making them suitable candidates for this test.

4.2 Development environment and devices

The Android applications built for this thesis was built using gradle 2.2.3 [10]and CMake 3.4.1 [5]. The compiler used for compiling the native parts of theapplication was clang 3.8.256229.

23

24 CHAPTER 4. METHOD

The image processing algorithms outlined in this chapter were tested on mul-tiple Android devices running different versions of the operating system. Thefollowing smartphones were tested:

• Samsung Galaxy S5, Android 6.0.1

• Sony Xperia Z1, Android 4.4

The Samsung Galaxy S5 device was using ART whereas the Sony Xperia Z1device was using Dalvik as its runtime. The technical specifications can beseen in Tables 4.1 and 4.2.

OS Android 6.0.1 (Marshmallow)Chipset Qualcomm MSM8974AC Snapdragon 801CPU Quad-core 2.5 GHz Krait 400GPU Adreno 330RAM 2GB

Table 4.1: Technical specifications for the Samsung Galaxy S5

OS Android 4.4 (KitKat)Chipset Qualcomm MSM8974 Snapdragon 800CPU Quad-core 2.2 GHz Krait 400GPU Adreno 330RAM 2GB

Table 4.2: Technical specifications for the Sony Xperia Z1

4.3 Implementation

The benchmark implementations were done by implementing a color spaceconversion algorithm, different versions of blurring filters, grayscaling andthresholding. The color space conversion is used to convert YUV-data to RGB-format. The implementations are described in more detail below.

The number of bugs encountered during implementation of these algorithmwas larger in RenderScript and C++ than Java. However, the number of bugsencountered is also due to our previous experience with the developmentlanguages. Moreover, the tooling for NDK and RenderScript are not as ex-tensive as Java, making debugging harder. The build times of the projectincreased as the NDK or RenderScript was added to the project.

CHAPTER 4. METHOD 25

4.3.1 Color space conversion

The color space conversion application captures frames from the smartphone’scamera. When a frame is fetched it is passed to an instance of the interfaceCamera.PreviewCallback. The frame is passed as a byte array contain-ing data in YUV-format. A common operation is converting the frame datafrom YUV-format to a RGB-format before conducting further processing ofthe image. Seeing as this is a common operation, there exists many imple-mentations provided by different vendors. It also means that this is a suitabletest that can be generalized. The following implementations were tested, andare more thoroughly explained below:

• Java Threaded

• C++ Threaded

• C++ implemented in OpenCV

• RenderScript intrinsics

• Relaxed RenderScript

The implementations were built into an Android application that capturedframes from the camera and then let each algorithm process the frame.

The methods all set the pixels of a bitmap displayed on the screen. After pro-cessing, the bitmap is invalidated and is redrawn by the operating system. Inthe cases where it is possible to change, the processing is done by the max-imum number of threads possible on the device. In the case of color spaceconversion, each thread processes a part of the image. The main thread thenwaits for each thread to complete before rendering the final image on thescreen, using the join-method present in the language. The garbage collectoris manually requested to run before each algorithm in order to avoid garbagecollection during the processing of the image.

A reference C++ implementation developed by Google was adapted for ourJava and C++ implementations1. The reference implementation provided byGoogle uses the formula in Equation 2.2 to calculate the RGB values fromYUV.

1https://android.googlesource.com/platform/frameworks/rs/+/master/cpu_ref/rsCpuIntrinsicYuvToRGB.cpp


Floating point operations are often poorly performing and are as such re-placed with integer and bitwise operations, both in our implementation andthe reference implementation provided by Google.

Java, C++

The Java and C++ implementations create the maximum number of threadsusable by the hardware. The maximum number of threads usable by theCPU is detectable at runtime. Creating 4 new threads on a Samsung GalaxyS5 takes on average 3 ms, over 100 test runs. Creation of new threads was assuch considered to not be a large overhead, can be recreated when needed.

C++, using OpenCV

OpenCV is an open source computer vision library [20]. It contains opti-mized code for many tasks often done in image processing, and containsarchitecture-specific optimizations. The OpenCV Android SDK v3.2.0 wasused to call the OpenCV C++ API.

RenderScript intrinsics

Google provides implementations of often-used algorithms with RenderScriptintrinsics. Intrinsics are built-in functions that perform operations often usedwhen conducting image processing. They provide high performance with avery small amount of code [24].

Relaxed RenderScript

The Relaxed RenderScript implementation uses lower precision in floatingpoint operations in favor of increased performance. The implementationtherefore uses 32-bit precision instead of 64-bit precision which is commonin CPUs.


4.3.2 Blurring

A pre-defined bitmap is shown and blurred, using different blurring algo-rithms. The algorithms are run sequentially using AsyncTasks, a class inAndroid framework used for processing on a background thread. The algo-rithms are run sequentially. Each algorithm takes two parameters: an inputbitmap and and output bitmap.

The methods all set the pixels of a bitmap displayed on the screen. Afterprocessing, the bitmap is invalidated and is redrawn by the operating system.The processing is done by the maximum number of threads possible on thedevice in cases where it is possible to change. The main thread then waits foreach thread to complete before rendering the final image on the screen, usingthe join-method present in the language.

Before running an algorithm the system garbage collector is manually re-quested to run in order to not pollute the run times of the algorithms. Notethat this does not guarantee that the garbage collector is invoked, but it isvisible in the logs when it is run. The logs were manually checked to makesure that the garbage collector did not run during the processing.

Different implementations can in some instances return different results. Forinstance, it is not possible for a developer to specify the kernel used whenusing the intrinsic RenderScript implementation of Gaussian blurring, andwe must therefore consider the possibility that the images slightly differ. Thedifference between two images is calculated pixel per pixel. The red, greenand blue values of each pixel are summed, and the sum is compared betweenthe images. This is called the Manhattan norm.

||x1|| =n∑

i=1

xi

It is considered acceptable if the resulting images differ up to 10% in eachchannel (R, G, B). Further distortion when conducting blurring will be no-ticeable in the resulting image.

Gaussian filter

As described in chapter 2, a 2D Gaussian Kernel is calculated as follows:


G(x, y) =1

2πσ2e−

x2+y2

2σ2

In order to speed up image processing, one can use a one dimensional filterand apply it twice, both horizontally and vertically. This means that a 1Dvector is computed and applied horizontally to each row in the image. Theresulting image is then used for the vertical pass for all columns in the image.

The Gaussian kernel used was pre-calculated using an online service [7]. Thecalculation of the Gaussian kernel is therefore not taken into account.

The Gaussian filter had 8 different implementations tested:

• Single-threaded Java

• Multi-threaded Java

• Single-threaded C++

• Multi-threaded C++

• C++, using OpenCV

• RenderScript


• RenderScript Intrinsics

The Java and C++ implementations were based on a reference C++ imple-mentation found in the Android system source code2.

Box filter

Applying a box filter can be done using the same strategy as applying theGaussian filter, using one vertical and one horizontal pass. Recall that a boxfilter is identical to a Gaussian filter, where the relative weights of the pixelsare the same. The box filter had 7 different implementations:


2https://android.googlesource.com/platform/frameworks/rs/+/master/cpu_ref/rsCpuIntrinsicBlur.cpp







• RenderScript

Google does not provide an intrinsic RenderScript box filter function, and itcould therefore not be included. The Java and C++ versions were based onthe Gaussian blurring implementation provided by Google, with changes toadapt it to a box filter.

Median filter

The median filter is applied by looking at every pixel surrounding a centerpixel within the radius supplied. The color of the center pixel was then set tothe median color of the pixels.

The median filter has 5 different implementations:






No RenderScript intrinsics were provided by Google. RenderScript does notallow using vectors as function parameters, which made calculating mediansunpractical in the C99-derived language, and it was therefore left out. TheJava and C++ versions were not based on a reference implementation.


4.3.3 Grayscaling and thresholding

The grayscaling and thresholding algorithms were implemented in four vari-ants:

• Java Threaded

• C++ Threaded

• C++ OpenCV

• RenderScript


As opposed to the blurring implementations described above, these algo-rithm only accesses one pixel at a time to calculate the color of the new value.This could lead to better cache locality.

The thresholding implementation uses the grayscaling implementation toconvert a colored to a grayscale image before deciding whether a certain pixelshould be black or white.

4.4 Measuring Runtime Performance

There are multiple ways of measuring run time in Java. Using wall-clocktime, with System.currentTimeMillis(), is not reliable seeing as it canbe altered at seemingly random times by the operating system. Instead, theelapsed CPU time is measured in this thesis. The elapsed CPU time is mea-sured using the Android OS system call SystemClock.elapsedRealtimeNanos()as recommended by Google [33].

4.4.1 Image processing

The run time of the algorithms can change depending on a number of fac-tors. Among others, the JIT compiler present in the OS will optimize the codeas it is running. In order to minimize its effect on the collected run times, anumber of warmup rounds are run before the run times are measured. Further-more, to prevent other processes influencing the run times of the algorithms,


no other applications were running during the testing of the algorithms. Thesmartphone was also running in flight mode.

In the color space conversion test, 50 warmup rounds are run before startingthe test. The blurring, grayscaling and thresholding tests use 10 warmuprounds. The run times were successively smaller in the first warmup rounds,due to the JIT compilation. After the 10 warmup rounds, the optimizationdid not further improve the performance.

4.4.2 Setup

For every time a blurring, grayscaling or thresholding algorithm is run, somesetup is required (for, e.g., allocating memory). The time required to setupthe necessary environments for each algorithm differs, and is also collected.For instance, when running the Java versions, the following is always used:

int[] srcpixels = new int[width * height];int[] dstpixels = new int[width * height];src.getPixels(srcpixels, 0, width, 0, 0, width, height);

The C++ setup conducted is the same as when doing image processing inJava.

The RenderScript versions require more sophisticated setup. This is not partof the actual calculations done by the algorithm, but is necessary for the algo-rithms to work, and is therefore taken into account. Nothing is saved betweentwo runs of the same algorithm, which means that the setup is conductedeach time the algorithm is run. To take this into account, the time taken foreach algorithm to setup the necessary allocations is measured. An examplesetup used for RenderScript in this thesis is shown below:

Allocation input = Allocation.createFromBitmap(rs, src);Allocation output = Allocation.createFromBitmap(rs, dst);ScriptC_gaussian_blur script = new ScriptC_gaussian_blur(rs);script.set_width(w);script.set_height(h);

//set input for blurringscript.set_ScratchPixel1(input);script.set_ScratchPixel2(input);


The Allocation objects are handled by the RenderScript runtime and pro-vide a buffer for the GPU to read from.

In the color space conversion test, the setup is reused between tests, meaningthat the initialization and memory allocation does not have to be done beforeevery run of the algorithms. The setup times are therefore not taken intoaccount in that experiment.

4.5 Verifying results

In order to confirm that conclusions can be drawn from the results, a sta-tistical test must be performed. A Wilcoxon Signed Rank Test is used in thisproject [38]. The test is performed by conducting pairwise comparisons be-tween the average run times of different implementations of the same algo-rithm.

Table 4.3 shows the algorithms and what languages they were implementedin. All algorithms, except the YUV to RGB conversion, were tested on im-ages with 3 different resolutions. The YUV to RGB conversion was tested ona camera feed, with 2 different resolutions. The results of these tests weretreated as independent data points, so the Java Threaded implementationhad 17 independent data points, for example.

Java

Java

Thre

aded

C++

C++

Thre

aded

C++

Ope

nCV

RS

Rel

axed

RS

RS

Intr

insi

c

YUV to RGBGaussian blurBox blurMedian blurThresholdingGrayscaling

Table 4.3: Algorithms and what languages they have been implemented in

The data points of an implementation are pairwise compared with the otherimplementations, i.e., the run times of two implementations of a certain algo-rithm on a certain resolution frame are compared.

The run times are normalized to the interval [0, 1] in order to reduce the rel-ative importance the run times of the tasks done on larger image. The nor-


Algorithm Java C++

Gaussian blur 12 ms 16 msMedian blur 20 ms 18 msThresholding 17 ms 19 msGrayscaling 13 ms 14 ms

(a) Absolute runtimes

Algorithm Java C++

Gaussian blur 0.75 1Median blur 1 0.9Thresholding 0.8947 1Grayscaling 0.9286 1

(b) Normalized runtimes

Table 4.4: An example of converting absolute run times to normalized times

malization is done by dividing the runtime of the implementation with thelargest runtime in the pair. Table 4.4 shows an example of converting abso-lute runtimes to normalized values, used for further calculation of statisticalsignificance. Note that the results are examples only, and that the tables donot contain any real data.

The results shown in Table 4.4b are used in the pairwise calculations to cal-culate whether the results are significant or not. An online tool is used forconvenience to determine statistical significance [37]. The performance of animplementation of an algorithm on a single resolution can also be comparedwith other implementations with the Wilcoxon Signed Rank Test by comparingtheir absolute runtimes.

Some of the algorithms were not implemented in all languages. The single-threaded versions of Java and C++ performed worse than their correspond-ing multi-threaded versions, and were therefore left out. RenderScript Intrin-sics are only available for a select few operations, and could therefore not beused for all algorithms.

In these cases where the number of data points is too small, the test statisticdoes not converge to a normal distribution, like it normally does. When thenumber of data points is lower than 10, the calculated test statistic has to becompared with predefined values to determine whether the data is significantor not, as is standard when using a Wilcoxon Signed Rank Test.

Chapter 5

Results

In this chapter the results from the experiments conducted are presented.

5.1 Color space conversion

The color space conversion was done on 100 sequential frames captured bythe smartphone’s camera, tested with different resolutions. The results arepresented in this section.

Table 5.1 shows the average run time of the color space conversion. The An-droid 4.4 run times were collected from a Sony Xperia Z1 and the Android6.0.1 runtimes were collected from a Samsung Galaxy S5. The resolutions ofthe camera feed are also displayed in the table.

Android 4.4 Android 6.0.1

Resolution 640× 480 1280× 720 640× 480 1920× 1080

Java Threaded 26 ± 8 ms 63 ± 13 ms 32 ± 9 ms 70 ± 14 msC++ Threaded 14 ± 5 ms 40 ± 8 ms 23 ± 8 ms 51 ± 11 msC++ OpenCV 11 ± 5 ms 34 ± 7 ms 16 ± 6 ms 42 ± 10 msRelaxed RenderScript 29 ± 3 ms 79 ± 13 ms 19 ± 8 ms 65 ± 18 msRenderScript Intrinsic 11 ± 3 ms 32 ± 8 ms 12 ± 5 ms 40 ± 7 ms

Table 5.1: Run times for converting from YUV to RGB on different resolutionsand different smartphones

34

CHAPTER 5. RESULTS 35

The RenderScript Intrinsic provided the best runtime performance out ofthe tried implementations. The average runtimes of the C++ implementa-tions was not far behind. The Java Threaded and Relaxed RenderScript-implementations provided the worst runtime performance in the color spaceconversion test.

5.2 Blurring

The blurring algorithms were applied to three images of different sizes, rang-ing from 100 × 67 pixels up to 1920 × 1080 pixels. The run times of the al-gorithms are presented in this section. The graphs presented below displayrun time as a function of image size. Tables with more detailed results can befound in Appendix A.

5.2.1 Box filter

Table 5.2 shows the run times of the box filter implementations on a 1920 ×1080 image, on both Android 4.4 and 6.0.1. Note that the different operatingsystem versions were used on different smartphones. Tables with run timesfor other resolutions can be found in Appendix A.

Algorithm Android 4.4 Android 6.0.1

Java 3092 ms 4907 msJava Threaded 869 ms 1084 msC++ 945 ms 939 msC++ Threaded 353 ms 311 msC++ OpenCV 201 ms 236 msRenderScript 402 ms 324 msRelaxed RenderScript 168 ms 151 ms

Table 5.2: Run times for applying box filter to a 1920× 1080 image

The Relaxed RenderScript runtime performance is the best out of the imple-mentations shown above, and the single-threaded Java implementation is theslowest implementation. Increasing the number of threads in the Java andC++ implementations shows a linear increase in run time performance, asexpected. The C++ Threaded and OpenCV implementations outperformed

36 CHAPTER 5. RESULTS

0 1 2

·106

0

1,000

2,000

3,000

Pixels in image

Tim

e(m

s)

Android 4.4

0 1 2

·106

0

2,000

4,000

Pixels in image

Android 6.0.1

Java Java Threaded C++ C++ ThreadedRS Relaxed RS OpenCV

Figure 5.1: Run times of applying a box filter on two versions of Android

the RenderScript implementation that used full floating point precision. Fig-ure 5.1 displays the run times of applying a box filter to images of differentresolutions.

5.2.2 Median filter

Table 5.3 shows the run times of the median filter implementations on a1920 × 1080 image, on both Android 4.4 and 6.0.1. Note that the differentoperating system versions were used on different smartphones. Tables withrun times for other resolutions can be found in Appendix A.

The trends in the runtime performance of different implementations of themedian filter are similar to the trends in the box filter results. An outlier inTable 5.3 is the OpenCV runtime performance on Android 4.4.


Java 4283 ms 2983 msJava Threaded 1903 ms 1230 msC++ 1560 ms 1835 msC++ Threaded 646 ms 717 msC++ OpenCV 3176 ms 201 ms

Table 5.3: Run times for applying median filter to a 1920× 1080 image


0 1 2

·106

0

2,000

4,000

Pixels in image

Tim

e(m

s)Android 4.4

0 1 2

·106

0

1,000

2,000

3,000

Pixels in image

Android 6.0.1

Java Java Threaded C++ C++ ThreadedOpenCV

Figure 5.2: Run times of applying a median filter on two versions of Android

Figure 5.2 shows the average run times of the different implementations onboth Android 4.4 and Android 6.0.1. The time is displayed as a function ofimage size.

5.2.3 Gaussian filter

Table 5.4 shows the run times of the Gaussian filter implementations on a1920 × 1080 image, on both Android 4.4 and 6.0.1. Note that the differentoperating system versions were used on different smartphones. Tables withrun times for other resolutions can be found in Appendix A.


Java 3067 ms 4959 msJava Threaded 877 ms 1115 msC++ 936 ms 939 msC++ Threaded 392 ms 325 msC++ OpenCV 240 ms 285 msRenderScript 420 ms 356 msRenderScript Intrinsic 124 ms 49 msRelaxed RenderScript 166 ms 168 ms

Table 5.4: Run times for applying Gaussian filter to a 1920× 1080 image

38 CHAPTER 5. RESULTS

0 1 2

·106

0

1,000

2,000

3,000

Pixels in image

Tim

e(m

s)

Android 4.4

0 1 2

·106

0

2,000

4,000

Pixels in image

Android 6.0.1

Java Java Threaded C++ C++ ThreadedRS RS Intrinsic Relaxed RS OpenCV

Figure 5.3: Run times of applying a Gaussian filter on two versions of An-droid

The Instrinsic RenderScript implementation was the fastest implementationby a significant margin. The Relaxed RenderScript implementation was thesecond fastest implementation, with an average run time 119 ms higher thanthe Intrinsic implementation on Android 6.0.1, and 42 ms on Android 4.4.

Figure 5.3 shows the average run times of the implementations when ap-plying a Gaussian filter to images of varying resolution. The run times areplotted as a function of image size.

5.3 Grayscaling

Table 5.5 shows the run times of the different implementation when convert-ing a colored image to grayscale. The resolution of the image was 1920×1080.The average run times of applying grayscaling to images of other resolutionscan be found in Appendix A.



Implementation Setup time Runtime Setup time Runtime

Java Threaded 55 ± 2 ms 133 ± 12 ms 29 ± 4 ms 107 ± 26 msC++ Threaded 52 ± 6 ms 98 ± 11 ms 30 ± 3 ms 93 ± 18 msC++ OpenCV 14 ± 3 ms 20 ± 3 ms 20 ± 2 ms 23 ± 11 msRelaxed RenderScript 57 ± 12 ms 72 ± 15 ms 46 ± 12 ms 59 ± 15 msRenderScript 51 ± 4 ms 81 ± 7 ms 41 ± 6 ms 65 ± 5 ms

Table 5.5: Run times for converting a 1920× 1080-image to grayscale

In this relatively simple algorithm, the C++ OpenCV implementation achievedthe lowest average runtime. The multi-threaded Java implementation wasthe slowest contender.

5.4 Thresholding

Table 5.6 shows the run times of the different implementation when applyingthresholding to an image. The resolution of the image was 1920 × 1080. Theaverage run times of applying thresholding to images of other resolutionscan be found in Appendix A.

The results and trends shown in Table 5.6 are similar to that of the grayscalingperformance.



Java Threaded 57 ± 7 ms 142 ± 13 ms 25 ± 4 ms 110 ± 14 msC++ Threaded 55 ± 8 ms 106 ± 13 ms 29 ± 5 ms 93 ± 21 msC++ OpenCV 24 ± 4 ms 24 ± 3 ms 26 ± 4 ms 26 ± 4 msRelaxed RenderScript 52 ± 12 ms 94 ± 11 ms 33 ± 4 ms 63 ± 7 msRenderScript 48 ± 6 ms 95 ± 8 ms 32 ± 3 ms 63 ± 7 ms

Table 5.6: Run times for applying thresholding to a 1920× 1080-image

Chapter 6

Discussion

In this chapter we discuss the experiments conducted and their results.

6.1 Color space conversion

Real time image processing, such as real time color space conversion, requireshigh performance. Each frame captured by the camera must be processed inat most roughly 33 ms in order to reach 30 FPS (frames per second). If anyreal time image processing task takes longer than 33 ms, the user will start tonotice stuttering. Table 5.1 shows the difference in run time when applying aformula for converting YUV-frames to RGB.

The tables are both showing a significant difference between some of the im-plementations. In Table 5.1, using Android 4.4, the average run time of theJava implementation is 146% higher than that of the C++ implementation.Furthermore, the OpenCV implementation shows an average run time per-formance increase of 16.2% compared to our C++ implementation. Our Ren-derScript implementation performed worse than the C++ implementationsand the RenderScript intrinsics.

The RenderScript intrinsic on Android 4.4 performed 60.5% better than ourRenderScript implementation. However, on Android 6.6 the difference wasonly 35.5%, but done on a larger frame size. The main reason for this is likelythe RenderScript engine improvements done from upgrading Android 4.4 toAndroid 6.0.1 together with an increase in processor speed.

40

CHAPTER 6. DISCUSSION 41

The RenderScript intrinsics have likely been fine-tuned by hand by Render-Script engine developers, explaining their performance. However, the intrin-sics implementation can be found widely optimized in assembly code in theAndroid source code1 for certain architectures. The GPU version of this codeis proprietary and developed by the vendors.

6.2 Blurring

Applying a filter to an image is done in real time less often than the colorspace conversion. Many applications exist today that allow applying filtersto an image from the smartphone’s camera roll. It is therefore not as crucialthat these filters can be applied as fast to achieve 30 FPS, but the performanceis important as to not impede the user experience.

Notice that the many of the results presented earlier in this report were runtimes of algorithms that were ran on an image smaller than what is todaynormally captured on a modern smartphone camera. Table 5.4 shows the runtime comparisons of the algorithms run on a larger image. The differencesbetween the implementations grow as the image size grows, which meansthat a Java implementation might not be suitable when conducting imageprocessing on images capture by a modern smartphone camera.

However, resorting to RenderScript implementations might not always benecessary. As can be seen in the tables in the previous section the C++ OpenCVcan be considered a strong competitor of the RenderScript implementation.This means that an optimized implementation in a native language can beas fast, or faster, than the RenderScript code. If the setup time is taken intoaccount, the RenderScript and C++ OpenCV run times are often very close.

However, the results differ largely depending on the precision required inthe RenderScript computation scripts. In Table 5.4, the time taken for theRelaxed RenderScript implementation was 47.2% of the average runtime of theRenderScript implementation on Android 6.0.1.

The other blurring filters show similar trends. The outlier is the C++ OpenCVimplementation on Android 4.4 when applying median filtering on an image.The performance is significantly increased on Android 6.0.1.

1https://android.googlesource.com/platform/frameworks/rs/+/master/cpu_ref/rsCpuIntrinsics_neon_YuvToRGB.S

42 CHAPTER 6. DISCUSSION

The RenderScript intrinsics provided by Google are by far the best optionregarding run time. However, Google only provides intrinsics for 11 commontasks [28]. Given that many applications today conduct more sophisticatedimage processing the intrinsics might not give developers what they want. Ifa cross platform solution must be developed it is easy to argue for a nativelanguage such as C or C++, because RenderScript is only available for theAndroid platform.

Considering that the RenderScript intrinsics outperform our RenderScriptimplementations by a large margin, it is worth considering that our imple-mentation might lack optimization used in the intrinsic implementation. How-ever, as presented in Figure 3.1, the GPU-utilizing implementations do notalways outperform the CPU counterparts, meaning the algorithm itself mustbe considered before deciding on whether RenderScript is worth using.

The Gaussian filtering reference implementation provided by Google con-tains highly optimized assembly code for different architectures2. However,the GPU version of this code is proprietary and developed by the vendors,and the RenderScript intrinsics always outperformed their counterparts, mean-ing that the GPU vendors’ implementations are favorable.

6.3 Grayscaling and thresholding

The results of the conversion from color to grayscale can be seen in Table 5.5.The best implementation was the C++ OpenCV implementation, followed bythe Relaxed RenderScript implementation. However, the average setup timefor the RenderScript implementation is 26 ms higher than the average setuptime for the OpenCV version. Without counting the setup, the RenderScriptimplementation performs nearly as well as the C++ OpenCV implementa-tion. The runtime of this algorithm is small compared to the blurring, wherewe see runtimes of > 100 ms. The reason that the RenderScript implemen-tation is lacking in run time performance here can be that it is too costly topass data to the buffers needed in the RenderScript runtime. Note that thegrayscaling was only conducted on images of resolution 1920 × 1080, andthe performance difference would likely be smaller on smaller images, as wehave seen in the blurring and color space conversion results. The threshold-ing results look very much like the grayscaling results and the same trendscan be found in Table 5.6.

2https://android.googlesource.com/platform/frameworks/rs/+/master/cpu_ref/rsCpuIntrinsics_advsimd_Blur.S


Our Java and C++ implementations do not differ much in run time perfor-mance when doing grayscaling and thresholding. The calculations for thesealgorithms are small compared to the blurring, meaning that Java perhapscan be considered a viable candidate for very simple image processing tasks.

6.4 Overall Performance

The single-threaded implementations of C++ and Java were the worst can-didates, and their multi-threaded counterparts achieved significantly higherruntime performance. However, our C++ implementation performed consis-tently worse than the OpenCV implementation, with a p-value of as low as0.0003.

The average runtime of the OpenCV C++ implementation, over all the al-gorithms, proved to be better than both the RenderScript and the RelaxedRenderScript implementations as well, with a p-value of 0.0036 and 0.0114,respectively.

0 100 200 3000

100

200

300

Relaxed RenderScript

Ope

nCV

Figure 6.1: Plot showing run times of Relaxed RenderScript and OpenCV.Blue circles indicate a 640× 480 resolution on the image processed, red indi-cates a resolution of 500×333, and green indicates a resolution of 1920×1080.

Figure 6.1 shows the average runtimes of OpenCV and Relaxed RenderScripton Android 6.0.1. OpenCV performed better than the Relaxed RenderScriptversion in the majority of cases. However, this was the performance of the


implementations over all the algorithms and all resolutions. If we comparethe runtimes of the Box filter and Gaussian filter on a 1920× 1080 image, theRelaxed RenderScript version performs better, with p-values< 0.05. OpenCVperforms better than the Relaxed RenderScript implementations on the smallerimages, likely due to reduced setup time. In addition to this, OpenCV per-forms better in the thresholding and grayscaling tasks as well on large im-ages.

Despite the relative poor performance of our C++ implementation, it cannotbe inferred from the data that it is worse than the RenderScript or RelaxedRenderScript in the average case with 95% confidence. However, in the caseof blurring, thresholding and grayscaling 1920 × 1080-images, the RelaxedRenderScript outperforms our Threaded C++ implementation, yielding p-values < 0.05.

The RenderScript intrinsics outperformed all other implementations in everyalgorithm. However, with intrinsics only being available for the Gaussianblurring and color space conversion, there are not enough data points to saythat it is better with a 95% certainty.

The RenderScript and Relaxed RenderScript implementations did not showany significant difference when pairwise comparing their runtimes. How-ever, like the case with Relaxed RenderScript and OpenCV, the performanceis significantly different in certain cases. In the Gaussian and Box blurring,the Relaxed RenderScript implementation was significantly faster than Ren-derScript. In the other cases, the difference is insignificant. This is likely dueto the fact that the Gaussian and Box blurring contains more floating pointoperations than the other algorithms implemented.

For clarity, the outcomes of the statistical significance tests are presented inTable 6.1. The results are calculated as described in Chapter 4, and are thuscalculated from the run times of all algorithms on all image resolutions.


Java

Java

Thre

aded

C++

C++

Thr

eade

d

C++

Ope

nCV

RS

Rel

axed

RS

Java - JT C++ C++T C++O RS RSRJava Threaded JT - - C++T C++O - -C++ C++ - - - C++O - -C++ Threaded C++T C++T - - C++O - -C++ OpenCV C++O C++O C++O C++O - C++O C++ORS RS - - - C++O - -Relaxed RS RSR - - - C++O - -

Table 6.1: Statistical significance. The names have been abbreviated as fol-lows: Java Threaded: JT, C++ Threaded: C++T, C++ OpenCV: C++O, Ren-derScript Relaxed: RSR, RenderScript: RS

The Intrinsic RenderScript implementations did not show any statistical sig-nificance, due to the few test cases available, and has therefore not been in-cluded in Table 6.1. Note again that certain implementations proved to bebetter than other in certain cases, but not in the table. The Relaxed Render-Script outperformed our Threaded C++ implementation in many tasks onhigher resolution images, for example. The table shows that the third partyC++ implementation found in OpenCV performed best on average.

6.5 Threats to validity

6.5.1 Choice of algorithms

The algorithms implemented as part of this project were selected to be rep-resentative of common image processing tasks. The algorithms chosen havebeen implemented in open-source projects and were therefore deemed suit-able for testing the performance of the available execution platforms. Eventhough the algorithms implemented have different properties that could af-fect their run time performance, similar trends can be found in many of theresults. However, there might exist other image processing algorithms withdifferent properties where another implementation language might be favor-able.


6.5.2 High variance

The variance of the collected run times were, in some measurements, as highas 20% of the average run time. The high variances could most often beseen where the average run time of the algorithm was low, i.e., < 100 ms.This could be the result of the Android system performing background tasks,leaving less processing power for the application. However, despite the highvariance in some cases, trends are still very visible in the results.

6.5.3 Devices

Two devices were used to test the runtime performance of the three executionplatforms, while the number of different Android devices exceeds 20000. Thedevices used in this project use similar chipsets and GPUs, meaning that thesame results might not be replicable on processing units from other vendors.However, the Samsung Galaxy models are the most popular series of An-droid devices, and Qualcomm GPUs are the second most commonly seenGPU [19]. The devices used in this project can therefore represent a widevariety of commonly used devices.

6.5.4 Image sizes

The image processing algorithms were applied to with sizes ranging from100 × 67 to 1920 × 1080. The smallest images represent thumbnails, i.e.,reduced-size versions of images used to help recognition, whereas the largestimage represents pictures taken with a smartphone camera. High-end de-vices in the current generation of smartphones can take pictures with a higherresolution than 1920 × 1080, and these were not taken into account. The res-olutions used in this project were chosen to represent an average Androiddevice. The trends visible in the results of this project can possibly not be ex-trapolated to determine the performance on the algorithms on larger images.

6.5.5 Optimization

The OpenCV C++ implementation performed better than our own multi-threaded C++ implementation. The OpenCV implementation has, however,been optimized by many contributors over a long period of time. Optimizingthe implementations tested in this project can therefore likely be done. How-


ever, our Java, C++ and RenderScript implementations contained identicalalgorithms, and did not use any language specific feature or optimizationand can therefore be used as a benchmark of language performance.

6.6 Future Research

Any future research conducted will likely use a better JIT compiler. Dy-namic compilation allows for optimizations that are platform specific andJava might therefore be able to surpass the performance of native languagessince they are statically compiled, with no access to runtime information.Reinholtz [23] claims that Java performance eventually will surpass that ofC++, and therefore comparing Java with native languages will be interestingin the future as well.

Increasing the performance of applications in the Android system is impor-tant for battery life. Developers have to consider the user’s battery whenconducting computationally intensive tasks in an application. It could there-fore be interesting to examine the RenderScript runtime and its effect on bat-tery life. The amount of extra memory required to utilize the RenderScriptruntime could be interesting to measure as well.

The native language chosen in this thesis was C++ because of the availabilityof the Android NDK. However, developers are free to write other native lan-guages for the Android platform as well. The programming language Go [34]has support for mobile tools in versions above 1.5, allowing developers togenerate bindings to use existing Go code in an Android project or write en-tire applications in Go. Go is a statically compiled language with automaticmemory management meaning that it, much like Java, trades performancefor safety. However, it does not run on the JVM and can therefore be a vi-able competitor to other native languages on the Android system. Apple alsostates that it is possible to use the Swift [32] programming language on An-droid devices, which could be a contender. Swift is a programming languageused for building iOS applications, meaning that using Swift on Android en-ables sharing code across platforms. Both Swift and Golang only supportARM architectures, however, meaning that not all Android devices are sup-ported.

A more practical approach to GPU acceleration instead of using RenderScriptcan be to use a cross-platform framework such as OpenCL. OpenCL currentlyhas limited support for the Android platform, but the code can be reused for,e.g., iOS devices. Many large applications are developed for multiple plat-


forms, meaning that other GPU acceleration frameworks should be evaluatedas well.

Chapter 7

Conclusion

Recall the research question posed in the introduction:

Can performance increases in run time warrant the usage of C++ or GPUacceleration frameworks over Java when writing image processing

algorithms on Android?

All tests showed that our C++ implementation was significantly better per-forming than the corresponding Java implementations. The RenderScript im-plementations were significantly faster than Java on large images, but did notperform better in the average case. As such, Java cannot be considered a vi-able option when conducting advanced image processing on large images onthe current generation of smartphones. However, the difference in run timeperformance between Java and C++ is minor when the calculations are verysimple (e.g., grayscaling) or the images are very small.

Our RenderScript implementations with full floating point precision did notturn out to be better performing than their C++ counterparts. If compliancewith the IEEE Standard for Floating-Point Arithmetic is required, C++ is therecommended implementation language. If there is no strict requirement onfloating point precision, RenderScript can outperform the C++ implementa-tion, although our results did not show a statistically significant differencebetween the two in the average case. However, when the algorithms wereapplied on larger images, the RenderScript implementation with low floatingpoint arithmetic precision proved to be better than the C++ implementation.

49

Bibliography

[1] https://streamcomputing.eu/blog/2013-08-01/google-blocked-opencl-on-android-4-3/.

[2] ART GC overview. https://source.android.com/devices/tech/dalvik/gc-debug.html. Accessed on 2017-04-01.

[3] Bill Buzbee Ben Cheng. A JIT Compiler for Android’s Dalvik VM. http:/ / www . android - app - developer . co . uk / android - app -development - docs / android - jit - compiler - androids -dalvik-vm.pdf. Accessed on 2017-02-07.

[4] Ananya Bhattacharya. Android just hit a record 88% market share of allsmartphones. https://source.android.com/devices/tech/dalvik/. Accessed on 2017-02-07.

[5] CMake. https://cmake.org/. Accessed on 2017-03-21.

[6] Evolution of Renderscript Performance. https://android-developers.googleblog.com/2013/01/evolution-of-renderscript-performance.html. Accessed on 2017-03-03.

[7] Gaussian Kernel Calculator. http://dev.theomader.com/gaussian-kernel-calculator/. Accessed on 2017-03-23.

[8] Getting Started with the NDK. https://developer.android.com/ndk/guides/index.html. Accessed on 2017-03-20.

[9] Luca Gherardi, Davide Brugali, and Daniele Comotti. “A java vs. c++performance evaluation: a 3d modeling benchmark”. In: InternationalConference on Simulation, Modeling, and Programming for AutonomousRobots. Springer. 2012, pp. 161–172.

[10] Gradle Build Tool. https://gradle.org/. Accessed on 2017-03-20.

[11] Nassim A Halli, Henri-Pierre Charles, and Jean-François Mehaut. “Per-formance comparison between Java and JNI for optimal implementa-tion of computational micro-kernels”. In: arXiv preprint arXiv:1412.6765(2014).

50

BIBLIOGRAPHY 51

[12] How ART works. https://source.android.com/devices/tech/dalvik/configure.html\#how_art_works. Accessed on 2017-03-02.

[13] Robert Hundt. “Loop recognition in C++/Java/Go/Scala”. In: Proceed-ings of Scala Days 2011 (2011), p. 38.

[14] IDC. Smartphone OS Market Share, 2016 Q3. http://www.idc.com/promo/smartphone-market-share/os. Accessed on 2017-02-03.

[15] IEEE SA - 754-2008 - IEEE Standard for Floating-Point Arithmetic. https://standards.ieee.org/findstds/standard/754- 2008.html. Accessed on 2017-05-07.

[16] SeongKi Kim and Seok-Kyoo Kim. “Comparison of OpenCL and Ren-derScript for mobile devices”. In: Multimedia Tools and Applications 75.22(2016), pp. 14161–14179.

[17] Tobias Konradsson. ART and Dalvik performance compared. 2015.

[18] Cheng-Min Lin et al. “Benchmark Dalvik and native code for An-droid system”. In: Innovations in Bio-inspired Computing and Applications(IBICA), 2011 Second International Conference on. IEEE. 2011, pp. 320–323.

[19] Mobile Hardware Statistics. http : / / hwstats . unity3d . com /mobile/index.html.

[20] OpenCV Library. http://opencv.org. Accessed on 2017-03-23.

[21] Charles Poynton. Digital video and HD: Algorithms and Interfaces. Else-vier, 2012.

[22] Recommendation ITU-R BT.601-5: Studio Encoding Parameters of DigitalTelevision for Standard 4:3 and wide-screen 16:9 Aspect Ratios. https://www.itu.int/dms_pubrec/itu-r/rec/bt/R-REC-BT.601-5-199510-S!!PDF-E.pdf.

[23] Kirk Reinholtz. “Java will be faster than C++”. In: ACM Sigplan Notices35.2 (2000), pp. 25–28.

[24] RenderScript Intrinsics. https://android-developers.googleblog.com/2013/08/renderscript-intrinsics.html. Accessed on2017-03-18.

[25] James A Ross et al. “A case study of OpenCL on an Android mobileGPU”. In: High Performance Extreme Computing Conference (HPEC), 2014IEEE. IEEE. 2014, pp. 1–6.

[26] R Jason Sams. Evolution of Renderscript Performance. https://android-developers . googleblog . com / 2013 / 01 / evolution - of -renderscript-performance.html. Accessed on 2017-02-07.

52 BIBLIOGRAPHY

[27] R Jason Sams. Levels in Renderscript. https://android-developers.googleblog.com/2011/03/renderscript.html. Accessed on2017-02-07.

[28] ScriptIntrinsic. https://developer.android.com/reference/android/renderscript/ScriptIntrinsic.html. Accessed on2017-04-20.

[29] Linda Shapiro and George C Stockman. “Computer Vision”. In: ed:Prentice Hall (2001).

[30] Ki-Cheol Son and Jong-Yeol Lee. “The method of Android applicationspeed up by using NDK”. In: Awareness Science and Technology (iCAST),2011 3rd International Conference on. IEEE. 2011, pp. 382–385.

[31] Studio encoding parameters of digital television for standard 4:3 and wide-screen 16:9 aspect ratios. http://www.itu.int/dms_pubrec/itu-r/rec/bt/R-REC-BT.601-7-201103-I!!PDF-E.pdf.

[32] Swift. https://swift.org/. Accessed on 2017-04-20.

[33] SystemClock. https://developer.android.com/reference/android/os/SystemClock.html.

[34] The Go Programming Language. https://golang.org/. Accessed on2017-04-20.

[35] The open standard for parallel programming of heterogeneous systems.https://www.khronos.org/opencl/. Accessed on 2017-02-25.

[36] Guohui Wang et al. “Accelerating computer vision algorithms usingOpenCL framework on the mobile GPU-a case study”. In: Acoustics,Speech and Signal Processing (ICASSP), 2013 IEEE International Conferenceon. IEEE. 2013, pp. 2629–2633.

[37] Wilcoxon Signed-Rank Test. http://vassarstats.net/wilcoxon.html.

[38] Frank Wilcoxon. “Individual comparisons by ranking methods”. In:Biometrics bulletin 1.6 (1945), pp. 80–83.

Appendix A

Tables

A.1 Blurring

53

54 APPENDIX A. TABLES

Android

4.4A

ndroid6.0.1

Median

BoxG

aussM

edianBox

Gauss

Resolution

lowm

edhigh

lowm

edhigh

lowm

edhigh

lowm

edhigh

lowm

edhigh

lowm

edhigh

Java13

3563092

9269

309210

2773067

12235

298316

3884907

17395

4959Java

Threaded

14178

86910

128869

10130

8779

901230

1888

108425

841115

C++

5157

15603

106945

2111

9367

1541835

377

9392

75939

C++

Threaded

799

6465

75353

476

39210

49717

843

3114

48325

C++

OpenC

V8

2123176

017

2010

30240

012

2011

23236

131

285R

enderScript-

--

675

4029

76420

--

-15

33324

1636

356R

enderScriptIntrinsic-

--

--

-1

43124

--

--

--

28

49R

elaxedR

enderScript-

--

659

1686

58166

--

-19

29151

2129

168

TableA

.1:Average

runtim

esfor

image

smoothing

operations.Resolution

lowindicates

a100×

67-image,m

ediumindicates

a500×333-im

ageand

highresolution

indicatesa1920

×1080-im

age.

APPENDIX A. TABLES 55

A.1.1 Box filter

Android 4.4

Table A.2 shows the run times of the box filter implementations on a 100× 67

image, running Android 4.4. The run times were captured on a Sony XperiaZ1.

Algorithm Setup time (ms) Avg (ms) Min (ms) Max (ms)

Java 0 ± 0 9 ± 0.38 9 10Java Threaded 0 ± 0 10 ± 1.59 7 13C++ 0 ± 1 3 ± 1.02 2 6C++ Threaded 0 ± 0 5 ± 1.86 2 11C++ OpenCV 0 ± 0 0 ± 0 0 0RenderScript 3 ± 2 6 ± 0.82 5 8Relaxed RenderScript 4 ± 2 6 ± 1.60 4 10

Table A.2: Run times for applying box filter to a 100× 67 image, on Android4.4

Table A.3 shows the run times of the box filter implementations on a 500×333image, running Android 4.4. The run times were captured on a Sony XperiaZ1.


Java 23 ± 3 269 ± 17 261 320Java Threaded 25 ± 3 128 ± 11 109 155C++ 23 ± 4 106 ± 23 90 156C++ Threaded 28 ± 6 75 ± 9 63 106C++ OpenCV 0 ± 0 17 ± 4 15 34RenderScript 29 ± 5 75 ± 9 61 95Relaxed RenderScript 43 ± 9 59 ± 14 37 79

Table A.3: Run times for applying box filter to a 500× 333 image, on Android4.4

Table A.4 shows the run times of the box filter implementations on a 1920 ×1080 image, running Android 4.4. The run times were captured on a SonyXperia Z1.




Table A.4: Run times for applying box filter to a 1920 × 1080 image, on An-droid 4.4

Android 6.0.1

Table A.5 shows the run times of the box filter implementations on a 100× 67

image, running Android 6.0.1. The run times were captured on a SamsungGalaxy S5.


Java 0 ± 1 16 ± 6.02 12 37Java Threaded 0 ± 0 18 ± 5.17 12 37C++ 0 ± 0 3 ± 1.14 2 4C++ Threaded 0 ± 4 8 ± 1.87 6 14C++ OpenCV 0 ± 0 1 ± 0.80 0 4RenderScript 12 ± 2 15 ± 8.39 10 50Relaxed RenderScript 13 ± 1 19 ± 4.12 12 30

Table A.5: Run times for applying box filter to a 100× 67 image, on Android6.0.1

Table A.6 shows the run times of the box filter implementations on a 500×333image, running Android 6.0.1. The run times were captured on a SamsungGalaxy S5.




Table A.6: Run times for applying box filter to a 500× 333 image, on Android6.0.1

Table A.7 shows the run times of the box filter implementations on a 1920 ×1080 image, running Android 6.0.1. The run times were captured on a Sam-sung Galaxy S5.


Java 35 ± 6 4907 ± 124 4693 5164Java Threaded 34± 8 1084 ± 60 994 1259C++ 33± 2 939 ± 25 896 979C++ Threaded 32± 7 311 ± 35 261 402C++ OpenCV 13 ± 4 236 ± 38 171 320RenderScript 55 ± 12 324 ± 20 289 372Relaxed RenderScript 68 ± 18 151 ± 30 113 212

Table A.7: Run times for applying box filter to a 1920 × 1080 image, on An-droid 6.0.1

A.1.2 Median filter

Android 4.4

Table A.8 shows the run times of the median filter implementations on a100 × 67 image, running Android 4.4. The run times were captured on aSony Xperia Z1.



Java 0 ± 1 13 ± 0.18 13 14Java Threaded 0 ± 0 14 ± 2.44 10 21C++ 0 ± 0 5 ± 0 5 5C++ Threaded 0 ± 1 7 ± 1.91 4 12C++ OpenCV 0 ± 0 8 ± 0 8 8

Table A.8: Run times for applying median filter to a 100 × 67 image, on An-droid 4.4

Table A.9 shows the run times of the median filter implementations on a 500×333 image, running Android 4.4. The run times were captured on a SonyXperia Z1.


Java 23 ± 4 356 ± 23 341 414Java Threaded 27 ± 5 178 ± 11 157 203C++ 23 ± 2 157 ± 20 147 209C++ Threaded 24 ± 3 99 ± 10 85 116C++ OpenCV 0 ± 1 212 ± 1 211 212

Table A.9: Run times for applying median filter to a 500 × 333 image, onAndroid 4.4

Table A.10 shows the run times of the median filter implementations on a1920 × 1080 image, running Android 4.4. The run times were captured on aSony Xperia Z1.



Table A.10: Run times for applying median filter to a 1920 × 1080 image, onAndroid 4.4


Android 6.0.1

Table A.11 shows the run times of the median filter implementations on a100 × 67 image, running Android 6.0.1. The run times were captured on aSamsung Galaxy S5.


Java 0 ± 0 12 ± 4.63 8 24Java Threaded 0 ± 0 9 ± 3.19 5 18C++ 0 ± 0 7 ± 3.95 5 25C++ Threaded 0 ± 0 10 ± 5.49 4 31C++ OpenCV 0 ± 1 0 ± 0.87 0 4

Table A.11: Run times for applying median filter to a 100 × 67 image, onAndroid 6.0.1

Table A.12 shows the run times of the median filter implementations on a500 × 333 image, running Android 6.0.1. The run times were captured on aSamsung Galaxy S5.




Table A.13 shows the run times of the median filter implementations on a1920× 1080 image, running Android 6.0.1. The run times were captured on aSamsung Galaxy S5.





A.1.3 Gaussian filter

Android 4.4

Table A.14 shows the run times of the Gaussian filter implementations on a100×67 image, running Android 4.4. The run times were captured on a SonyXperia Z1.


Java 0 ± 0 10 ± 0.31 9 10Java Threaded 0 ± 1 8 ± 1.38 5 11C++ 0 ± 1 2 ± 0 2 2C++ Threaded 0 ± 0 4 ± 0.72 3 5C++ OpenCV 0 ± 0 0 ± 0 0 0RenderScript 5 ± 1 9 ± 2.36 6 17RenderScript Intrinsic 0 ± 1 1 ± 0.30 1 2Relaxed RenderScript 4 ± 2 6 ± 1.45 5 10

Table A.14: Run times for applying Gaussian filter to a 100 × 67 image, onAndroid 4.4

Table A.15 shows the run times of the Gaussian filter implementations on a500 × 333 image, running Android 4.4. The run times were captured on aSony Xperia Z1.



Java 23 ± 4 277 ± 25 262 337Java Threaded 26 ± 3 130 ± 10 113 151C++ 24 ± 1 111 ± 25 90 168C++ Threaded 25 ± 2 76 ± 9 60 98C++ OpenCV 1 ± 1 30 ± 6 28 58RenderScript 30 ± 3 76 ± 6 64 89RenderScript Intrinsic 32 ± 4 43 ± 6 25 54Relaxed RenderScript 44 ± 9 58 ± 16 34 102

Table A.15: Run times for applying Gaussian filter to a 500 × 333 image, onAndroid 4.4

Table A.16 shows the run times of the Gaussian filter implementations on a1920 × 1080 image, running Android 4.4. The run times were captured on aSony Xperia Z1.



Table A.16: Run times for applying Gaussian filter to a 1920× 1080 image, onAndroid 4.4

Android 6.0.1

Table A.17 shows the run times of the Gaussian filter implementations on a100 × 67 image, running Android 6.0.1. The run times were captured on aSamsung Galaxy S5.



Java 0 ± 1 17 ± 7.18 12 43Java Threaded 0 ± 2 25 ± 8.62 13 47C++ 0 ± 1 2 ± 0.55 2 4C++ Threaded 0 ± 0 4 ± 2.45 2 13C++ OpenCV 0 ± 0 1 ± 1.88 0 8RenderScript 12 ± 3 16 ± 6.48 11 38RenderScript Intrinsic 1 ± 1 2 ± 1.67 1 9Relaxed RenderScript 15 ± 1 21 ± 6.39 16 51

Table A.17: Run times for applying Gaussian filter to a 100 × 67 image, onAndroid 6.0.1

Table A.18 shows the run times of the Gaussian filter implementations on a500 × 333 image, running Android 6.0.1. The run times were captured on aSamsung Galaxy S5.



Table A.18: Run times for applying Gaussian filter to a 500 × 333 image, onAndroid 6.0.1

Table A.19 shows the run times of the Gaussian filter implementations on a1920× 1080 image, running Android 6.0.1. The run times were captured on aSamsung Galaxy S5.




Table A.19: Run times for applying Gaussian filter to a 1920× 1080 image, onAndroid 6.0.1

A.2 Grayscaling

Table A.20 shows the average run and setup times for applying the grayscal-ing algorithm to a 500× 333 image.



Java Threaded 0 ± 1 9 ± 2 4 ± 1 16 ± 4C++ Threaded 0 ± 0 5 ± 1 3 ± 1 15 ± 3C++ OpenCV 0 ± 1 1 ± 0 1 ± 0 2 ± 1Relaxed RenderScript 25 ± 6 41 ± 9 19 ± 4 19 ± 3RenderScript 29 ± 3 35 ± 7 20 ± 5 20 ± 7

Table A.20: Run times for applying grayscaling to a 500× 333-image

Table A.21 shows the average run and setup times for applying the grayscal-ing algorithm to a 500× 333 image.





Table A.21: Run times for applying grayscaling to a 100× 67-image

A.3 Thresholding




Table A.22: Run times for applying thresholding to a 500× 333-image




Table A.23: Run times for applying thresholding to a 100× 67-image

www.kth.se

Documents

ART vs. NDK vs. GPU acceleration: A study of performance ...1114955/FULLTEXT01.pdf · ART vs. NDK vs. GPU acceleration: A study of performance of image processing algorithms on Android