Download pdf - FPGA-Based Simultaneous Localization and Mapping (SLAM) … · FPGA -Based Simultaneous Localization a nd Mapping using High -Level Synthesis Basile Van Hoorick Supervisor s: P rof

Mapping (SLAM) using High-Level SynthesisFPGA-Based Simultaneous Localization and

Academic year 2018-2019

TechnologyMaster of Science in Electrical Engineering - main subject Communication and InformationMaster's dissertation submitted in order to obtain the academic degree of

Counsellors: Dr. ir. Jan Aelterman, Ir. Michiel VlaminckSupervisors: Prof. dr. ir. Bart Goossens, Prof. dr. ir. Erik D'Hollander

Student number: 01404852Basile Van Hoorick

Admisson to Loan

The author gives his permission to make this master’s dissertation available for con-

sultation and to copy parts of this master’s dissertation for personal use. In all cases

of other use, the copyright terms have to be respected, in particular with regard to

the obligation to explicitly state the source when quoting results from this master’s

dissertation.

Basile Van Hoorick, May 2019

Acknowledgements

First and foremost, I would like to express my sincerest gratitude towards Prof. dr.

ir. Bart Goossens and Prof. em. dr. ir. Erik D’Hollander for giving me the opportu-

nity to conduct this master’s dissertation at the Department of Telecommunications

and Information Processing. I truly appreciate their vast expertise and would like

to thank them for their guidance towards making substantiated decisions, as well as

for their outstanding passion in their respective fields of expertise.

In particular, Prof. em. dr. ir. Erik D’Hollander of the Department of Elec-

tronics and Information Systems has been extremely helpful with regard to the so-

phisticated practicalities of testing heterogeneous computer systems. I have learned

an enormously great deal about Field-Programmable Gate Arrays over the past ten

months, and I could not possibly have wished for a more driven and competent su-

pervisor than him.

I also want to thank Prof. dr. ir. Bart Goossens for offering his aid and exten-

sive knowledge regarding Simultaneous Localization and Mapping, as well as for

providing me with helpful suggestions and tips throughout the year. Furthermore,

I am grateful to Prof. dr. ir. Wilfried Philips, Prof. dr. ir. Peter Veelaert and other

researchers at the Image Processing and Interpretation group for their valuable feed-

back and advice given during the two intermediate thesis presentations.

Last but not least, I would like to thank my parents, family and friends for their

indispensable support and encouragement throughout the entire period of my stud-

ies. Distinct credit goes to Tinus Pannier, Clemens Schlegel, Jacques Van Damme

and Viktor Verstraelen, with whom I have shared many pleasant breaks and memo-

rable moments during this exceptionally busy year.

Basile Van Hoorick, May 2019

FPGA-Based Simultaneous Localizationand Mapping using High-Level Synthesis

by

Basile VAN HOORICK

Master’s dissertation submitted in order to obtain the academic degree of

MASTER OF SCIENCE IN ELECTRICAL ENGINEERING

Academic year 2018-2019

Promoters: Prof. dr. ir. Bart GOOSSENS, Prof. em. dr. ir. Erik D’HOLLANDER

Supervisors: dr. ir. Jan AELTERMAN, ir. Michiel VLAMINCK

Faculty of Engineering and Architecture

Ghent University

Department of Telecommunications and Information Processing

Chairman: Prof. dr. ir. JORIS WALRAEVENS

Abstract

The growing popularity of SLAM is despite the lack of an embedded, low-power

yet real-time solution for dense 3D scene reconstruction. An attempt to fill this gap

with the Xilinx Zynq-7020 SoC resulted in the formation and evaluation of a de-

tailed methodology that tackles several types of typical routines in the image pro-

cessing domain using HLS. The devised principles and guidelines are then tested

by applying them to eight kernels of an established 3D SLAM application, reveal-

ing powerful potential and an estimated holistic speed-up of ×40.4 over execution

on the ARM Cortex-A9 CPU. Multi-modal, multi-resolution dataflow architectures

are subsequently proposed and compared with the purpose of efficiently mapping

algorithmic blocks and their interconnections to hardware while conforming to the

FPGA’s limitations. A trade-off between area and throughput appears to be the de-

ciding factor, although further research is desired towards merging the two Pareto-

optimal identified techniques.

Keywords

Simultaneous Localization and Mapping, Field- Programmable Gate Array, High-

Level Synthesis, Image Processing, System-on-Chip

FPGA-Based Simultaneous Localization and

Mapping using High-Level Synthesis

Basile Van Hoorick

Supervisors: Prof. dr. ir. Bart Goossens, Prof. em. dr. ir. Erik D’Hollander

Abstract — The growing popularity of SLAM is despite the

lack of an embedded, low-power yet real-time solution for dense

3D scene reconstruction. An attempt to fill this gap with the

Xilinx Zynq-7020 SoC resulted in the formation and evaluation of

a detailed methodology that tackles several types of typical

routines in the image processing domain using HLS. The devised

principles and guidelines are then tested by applying them to

eight kernels of an established 3D SLAM application, revealing

powerful potential and an estimated holistic speed-up of x40.4

over execution on the ARM Cortex-A9 CPU. Multi-modal, multi-

resolution dataflow architectures are subsequently proposed and

compared with the purpose of efficiently mapping algorithmic

blocks and their interconnections to hardware while conforming

to the FPGA’s limitations. A trade-off between area and

throughput seems to be the deciding factor, although further

research is desired towards merging the two Pareto-optimal

identified techniques.

Keywords — Simultaneous Localization and Mapping, Field-

Programmable Gate Array, High-Level Synthesis, Image

Processing, System-on-Chip

I. INTRODUCTION

As we embark on the road towards a more autonomous

world, countless challenges and opportunities emerge in

various subdisciplines of computer architecture, algorithm

design and electronics. One such challenge is Simultaneous

Localization and Mapping (SLAM), which attempts to make a

robot aware of its surroundings. The goal of SLAM is to track

the position and orientation of an agent within an unknown

environment, while simultaneously constructing a model of

this very environment [1]. Dense SLAM variants distinguish

themselves from their sparse counterparts by incorporating as

much sensor data as possible into their global reconstruction.

However, their considerable advantage in the form of

producing a high-quality model that is reusable across

applications comes at the cost of far greater computational

complexity [2]. At the same time, embedded SLAM solutions

are in high demand due to their many use cases on mobile and

low-power devices such as autonomous vehicles [3].

In this master’s dissertation, a framework is presented by

which SLAM and by extension, image processing kernels in

general, can be mapped effectively onto Field-Programmable

Gate Arrays (FPGAs). The FPGA is a reconfigurable

integrated circuit that can reach high performance yet low

power consumption [4], offering a flexible platform on which

to evaluate the hardware implementation of a dense 3D SLAM

algorithm. High-Level Synthesis (HLS) tools are employed

because of their apt capability to perform high-level, pragma-

directed compilation of C-code into hardware [5]. The use

case of choice is KinectFusion, a prominent scene

reconstruction algorithm [6] that is representative of diverse

paradigms in both 2D and 3D image processing. The only

existing work in literature that accelerates parts of

KinectFusion on an FPGA also uses a GPU [7], which is

avoided in this thesis due to its high energy consumption. We

also explore how multiple kernels with complex dataflow

characteristics can be combined in hardware as to form an

efficient, large-scale pipeline consisting of functional blocks.

II. HLS DESIGN OF INDIVIDUAL KERNELS

A. Methodology

Every kernel under consideration can be categorized

according to one or multiple parallel patterns most closely

associated with its computational and/or data management

structure [8]. Techniques are developed to deal with the

following patterns in HLS:

• Map & Reduce: The independence of every input

(and output) pixel lends itself to the application of

pipelining and AXI streaming interfaces, enforcing the

single-read, single-write principle of every element in

the array while overlapping multiple instances of

similar calculations in time as to enable efficient use of

DSPs and other hardware blocks.

• Stencil: In addition to the above, line buffers and

memory windows (see Figure 1) must be inserted in

order to fully exploit data reuse and preserve the I/O

streaming model [9][10]. Further speed-ups are

obtained by partitioning both arrays in certain

dimensions across multiple instances of local storage,

which prevents the internal block RAM from causing a

bottleneck due to the high amount of concurrent data

accesses.

• Gather: Reads from irregular positions in large arrays

are more complicated to handle on an FPGA due to its

limited local memory size. As continuous DDR requests

to DRAM form significant bottlenecks in practice [11],

the use of scratchpads is recommended to cache

(portions of) the region of interest. Multiple re-

executions of the subroutine might be necessary to

adequately deal with all required data.

Figure 1: Interaction between the line buffer and window for Stencil-

type kernels, visualized onto the input image (left) and as how they

are structured in memory (right).

An initiation interval (II) of one clock cycle is the goal in

the majority of cases, so that no further speed-up is possible

unless processing elements would be duplicated. The selection

among fixed-point versus floating point data type

representations depends on the complexity and kind of

operations employed in each kernel, but the former usually

results in a more hardware-efficient design, despite the

possible overhead introduced by conversions.

B. Implementation of KinectFusion

Eight SLAM kernels are examined and optimized in Vivado

HLS, leading to a median speed-up of x30.5 by purely

applying the presented methodology over leaving the code

unchanged. Additional transformations that require thorough

insight into the use case as well as statistical analysis of typical

values in various steps of the algorithm using real-world data,

lead to an additional median speed-up of x2.45 and further

decreases in resource utilization. According to this evaluation,

the most significant performance gains clearly originate from

the discussed standard approaches, although it remains

important to incorporate application-specific knowledge as

well to avoid superfluous hardware usage and suboptimal

designs.

III. COMBINED ACCELERATION OF MULTIPLE KERNELS

A. Problem statement and initial configuration

The complex, multi-resolution nature of tracking is reflected

in its requirement of seven output streams from the preceding

stages of KinectFusion, shown in Figure 2. Traditional task-

level pipelining does not capture how stream duplication or

multi-modal paths should be handled. The dataflow can be

broken down into two more general challenges: one is the

accumulation of intermediate results down a pipelined path,

and the other concerns creating multi-modal blocks as to

maximize resource sharing across different functional paths.

Three distinct ways are proposed and compared in which both

of these issues can be resolved. The first one places all

accelerators independently on the FPGA, each with its own

AXI DMA, and all data is passed via DRAM. The described

difficulties are kind of avoided this way; however, it is

expected however that better results will be achieved once

task-level pipelining between subsequent blocks is employed.

Figure 2: Dataflow diagram of KinectFusion's first five kernels.

B. Block-level and HLS-level pipelined architectures

In the Vivado block design, collecting intermediate outputs

is done by redirecting the needed streams from in-between

multiple components directly back to the processing system

via an AXI DMA. Multiple modes can be activated either by

setting control signals via the AXI-Lite protocol, or by

inserting stream switching IP cores to enable the selection

among different blocks altogether.

The same principles can also be applied at the level of

Vivado HLS, albeit after taking special measures to reconcile

them with the HLS dataflow optimization directive. This

includes strict adherence to the single-producer, single-

consumer paradigm and the non-conditional execution of

blocks. Intermediate output aggregation is achieved by

programming virtual pass-through connections and having

each kernel attach its own output values to the increasingly

wide stream of interleaved data. Multi-modality of kernels is

translated to if-else case-switching inside loop bodies.

C. Application to KinectFusion

In the dataflow graph, modes are defined to correspond to

different resolution levels; this produces the fastest allocation

of paths inside which to pipeline all components. Assuming all

other components of the KinectFusion system (reading sensor

frames, tracking, volumetric integration etc.) work sufficiently

fast, the resulting measurements on an Avnet Zedboard with a

PL clock period of 10 ns are as follows:

Configuration Initiation

interval

Max. frame

rate

Avg. resource

usage

Coexistence 2.53 ms 395 FPS 52 %

Block-level dataflow 2.10 ms 476 FPS 45 %

HLS-level dataflow 4.13 ms 242 FPS 35 %

The first configuration involving independent accelerators is

Pareto-dominated by the block-level dataflow architecture. Its

HLS-level counterpart is twice as slow however, which can be

explained by the fact that the whole IP core uses only one AXI

DMA to forward its 256-bit output stream to the PS. The

Zynq-7020 High-Performance port has a maximum data width

of 64-bit, forcing the DMA to chop up every element into

smaller packets and thus take four clock cycles to transfer one

aggregated data point. An advantage however is the decreased

total hardware utilization, which is because the opportunity for

resource sharing across multiple modes of a hybrid block can

already be exploited earlier in the design process by the HLS

compiler, in contrast to block-level multi-modality.

IV. CONCLUSIONS

High gains in performance were obtained by applying the

devised image processing acceleration methodology, although

careful attention in its usage is essential. Vivado HLS provides

a balanced mix of high-level and low-level details by allowing

fine-grained optimization of hardware computations, while

still abstracting away most of the repetitive specifics of

established paradigms such as pipelining and I/O interfacing.

Designing heterogeneous FPGA systems remains intricate

however, mainly due to the inherent duality of having to

manage both hardware and software starting from a blank

slate. On the other hand, increasing the degree of automation

might adversely affect the quality of the resulting design.

Experiments on system-level acceleration of multiple

components bearing non-trivial dataflows reveal that there is

no clear-cut winner between composition at the block design

level versus virtually implementing the same concepts at an

earlier phase in HLS. Lastly, our findings on the practice of

multi-modal kernels closely match those by [2].

V. FUTURE WORK

Not all KinectFusion kernels could be adequately tested on

the FPGA due to scope constraints, which presents a concrete

possible direction for future work. Second, the implementation

on higher-end SoCs and/or a cascade of FPGAs should be

researched as well, since the combined resource utilization

makes fully off-loading KinectFusion onto the Zynq-7020

FPGA impossible. Finally, the block-level and HLS-level

dataflow variants could be treated as two ends of a spectrum;

an untested hypothesis is that a mixture of both methods might

lead to an optimum in terms of timing and area metrics.

REFERENCES

[1] C. Cadena et al., “Past, present, and future of

simultaneous localization and mapping: Toward the

robust-perception age,” IEEE Trans. Robot., vol. 32,

no. 6, pp. 1309–1332, 2016.

[2] K. Boikos and C.-S. Bouganis, “A Scalable FPGA-

based Architecture for Depth Estimation in SLAM,”

Appl. Reconfigurable Comput., 2019.

[3] M. Abouzahir, A. Elouardi, R. Latif, S. Bouaziz, and

A. Tajer, “Embedding SLAM algorithms: Has it come

of age?,” Rob. Auton. Syst., 2018.

[4] K. Rafferty et al., “FPGA-Based Processor

Acceleration for Image Processing Applications,” J.

Imaging, vol. 5, no. 1, p. 16, 2019.

[5] R. Nane et al., “A Survey and Evaluation of FPGA

High-Level Synthesis Tools,” IEEE Trans. Comput.

Des. Integr. Circuits Syst., vol. 35, no. 10, pp. 1591–

1604, 2016.

[6] R. A. Newcombe et al., “KinectFusion: Real-Time

Dense Surface Mapping and Tracking,” 2011.

[7] Q. Gautier, A. Shearer, J. Matai, D. Richmond, P.

Meng, and R. Kastner, “Real-time 3D reconstruction

for FPGAs: A case study for evaluating the

performance, area, and programmability trade-offs of

the Altera OpenCL SDK,” in Proceedings of the 2014

International Conference on Field-Programmable

Technology, FPT 2014, 2015, pp. 326–329.

[8] L. Nardi et al., “Introducing SLAMBench, a

performance and accuracy benchmarking

methodology for SLAM,” in Proceedings - IEEE

International Conference on Robotics and

Automation, 2015, vol. 2015-June, no. June, pp.

5783–5790.

[9] J. Lee, T. Ueno, M. Sato, and K. Sano, “High-

productivity Programming and Optimization

Framework for Stream Processing on FPGA,” Hear.

2018 Proc. 9th Int. Symp. Highly-Efficient Accel.

Reconfigurable Technol., pp. 1–6, 2018.

[10] O. Reiche, M. A. Ozkan, R. Membarth, J. Teich, and

F. Hannig, “Generating FPGA-based image

processing accelerators with Hipacc: (Invited paper),”

IEEE/ACM Int. Conf. Comput. Des. Dig. Tech. Pap.

ICCAD, vol. 2017-Novem, pp. 1026–1033, 2017.

[11] K. Boikos and C. S. Bouganis, “Semi-dense SLAM on

an FPGA SoC,” in FPL 2016 - 26th International

Conference on Field-Programmable Logic and

Applications, 2016.

v

Contents

1 Introduction 1

1.1 Goals and outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background and related research 5

2.1 Simultaneous Localization and Mapping . . . . . . . . . . . . . . . . . 5

2.1.1 KinectFusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.2 Benchmarking visual SLAM . . . . . . . . . . . . . . . . . . . . . 8

2.2 Field-Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 The FPGA put into context . . . . . . . . . . . . . . . . . . . . . 12

2.2.2 System-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.3 High-Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.4 Designer workflow . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 SLAM on FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.1 Dense and semi-dense SLAM . . . . . . . . . . . . . . . . . . . . 20

3 High-level synthesis design of individual kernels 23

3.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Detailed algorithm description . . . . . . . . . . . . . . . . . . . 23

3.1.2 Source code, dataset and parameters . . . . . . . . . . . . . . . . 27

3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.1 Common parallel patterns and categorization . . . . . . . . . . 31

3.2.2 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2.3 Efficient line buffering . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2.4 Random memory access . . . . . . . . . . . . . . . . . . . . . . . 46

3.2.5 Data type selection . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4 Implementation of KinectFusion in HLS 57

4.1 Detailed results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.1.1 mm2m_sample (Map) . . . . . . . . . . . . . . . . . . . . . . . . 59

4.1.2 bilateral_filter (Stencil) . . . . . . . . . . . . . . . . . . . . . . . . 62

4.1.3 half_sample (Stencil) . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.1.4 depth2vertex (Map) . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.1.5 vertex2normal (Stencil) . . . . . . . . . . . . . . . . . . . . . . . 67

vi

4.1.6 track (Gather & Map) . . . . . . . . . . . . . . . . . . . . . . . . 69

4.1.7 reduce (Reduce) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.1.8 integrate (Gather & Map) . . . . . . . . . . . . . . . . . . . . . . 72

4.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.2.1 Evaluation of the methodology . . . . . . . . . . . . . . . . . . . 74

5 System-level acceleration of multiple kernels 77

5.1 Dataflow of KinectFusion . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.1.1 Generalized problem statement . . . . . . . . . . . . . . . . . . . 80

5.2 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.2.1 Hardware debugging . . . . . . . . . . . . . . . . . . . . . . . . 82

5.2.2 Bandwidth limitations . . . . . . . . . . . . . . . . . . . . . . . . 82

5.3 Independent coexistence of kernels . . . . . . . . . . . . . . . . . . . . . 83

5.3.1 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.4 Task-level pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.4.1 Intermediate output aggregation . . . . . . . . . . . . . . . . . . 92

5.4.2 Multi-modal execution . . . . . . . . . . . . . . . . . . . . . . . . 94

5.4.3 Application to KinectFusion . . . . . . . . . . . . . . . . . . . . . 94

5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.5.1 Comparison of timing and resource profiles . . . . . . . . . . . 102

6 Conclusions and future work 103

6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Bibliography 107

vii

List of Figures

2.1 Continuum of SLAM algorithms from sparse (e.g. using feature ex-

traction) to dense (e.g. using voxelated maps) [3]. . . . . . . . . . . . . 6

2.2 Part of KinectFusion’s map (right) and a slice through the volume

(left) showing truncated signed distance values, each representing a

distance F to a surface [5]. Grey voxels are those without a valid mea-

surement, and are naturally found within solid objects. . . . . . . . . . 7

2.3 System workflow of the KinectFusion method [5]. . . . . . . . . . . . . 8

2.4 Simplified overview of KinectFusion kernels. A subscript j indicates

the presence of several resolution levels, while i indicates the presence

of multiple iterations within a level. . . . . . . . . . . . . . . . . . . . . 9

2.5 Violin plots comparing four SLAM algorithms on the NVIDIA Jetson

TK1, a GPU development board [6]. Here, KF-CUDA stands for a

CUDA-implementation of KinectFusion. . . . . . . . . . . . . . . . . . . 11

2.6 (a) Sketch of the FPGA architecture; (b) Diagram of a simple logic

element. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.7 Diagram comparing the FPGA to other processing platforms [19]. . . . 14

2.8 Functional block diagram of the Zynq-7000 SoC [22]. . . . . . . . . . . 16

2.9 Annotated photograph of the Avnet Zedboard (adapted from [28]). . . 16

3.1 Illustration of the bilateral filter, showing its edge-preserving prop-

erty [46]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Overview of KinectFusion kernels. Green shaded areas include blocks

that are executed multiple times per frame and per level; once for ev-

ery iteration i. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 Screenshot of the SLAMBench2 GUI when evaluating the ’Living Room

2’ scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Mean ATE for different configurations of KinectFusion. The cubed

numbers indicate volume resolutions, while the input FPS corresponds

to both the tracking and integration rate. . . . . . . . . . . . . . . . . . . 30

3.5 A) RGB video stream (unused). B) Latest depth map captured by the

Kinect sensor. C) Reconstructed scene using KinectFusion [37]. . . . . . 30

3.6 The Map pattern [9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.7 The Stencil pattern [9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.8 The Reduce pattern [9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

viii

3.9 The Gather (or Scatter) pattern [9]. . . . . . . . . . . . . . . . . . . . . . 35

3.10 The Search pattern [9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.11 Non-exhaustive code snippet representing a possible instance of the

Search parallel pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.12 Concept of pipelining applied to a repeated calculation called ’op’ on

a large array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.13 Effect of pipelining on the timing profile and resource utilization. . . . 40

3.14 Analysis of a pipelined Map kernel, showing the parallelized elemen-

tary operations constituting a matrix-vector multiplication. Note that

the analysis view in Vivado HLS does not clearly indicate overlapped

computation, even though it is definitely present here: a read from

and write to the streaming interface occurs at every single clock cycle

(or equivalently, control step). . . . . . . . . . . . . . . . . . . . . . . . . 41

3.15 Illustration of the Stencil parallel pattern and a corresponding buffer-

ing technique for its implementation on the FPGA. . . . . . . . . . . . . 42

3.16 Report and analysis of a naive implementation of bilateral_filter; nei-

ther line buffering nor array partitioning is applied. . . . . . . . . . . . 44

3.17 Report and analysis of an improved implementation of bilateral_filterwhich includes line buffer and memory window functionality. . . . . . 45

3.18 Array partitioning strategy for optimizing Stencil computations. Dif-

ferently colored elements need to be accessed independently and in

parallel, which is possible only by distributing them across different

instances of internal storage components. (The memory window is

fully partitioned in all dimensions.) . . . . . . . . . . . . . . . . . . . . . 46

3.19 HLS report of the fully optimized bilateral_filter kernel. . . . . . . . . . 46

3.20 Resulting BRAM instances in the HLS report for different memory

sizes in Listing 3.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.21 Kinect v2 accuracy error distribution [66]. . . . . . . . . . . . . . . . . . 54

3.22 Kinect v1 offset and precision [44]. . . . . . . . . . . . . . . . . . . . . . 54

4.1 Effect of every optimization on the timing, resource and accuracy pro-

file of mm2m_sample (Map). . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2 I/O diagram of the mm2m_sample HLS kernel before and after du-

plicating its processing elements 8-fold, assuming no bandwidth bot-

tlenecks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61


files of bilateral_filter (Stencil). . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4 Exponential function approximation for the bilateral filter, with the

actual frequency (popularity) of all arguments translated to the thick-

ness of the green layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

ix

4.5 Pareto diagram of the bilateral filter’s HLS average resource usage

(not including BRAM) and measured accuracy when all eight possible

configurations of three separate optimizations are tested. One outlier

with a large error is not shown. . . . . . . . . . . . . . . . . . . . . . . . 65


files of half_sample (Stencil). . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.7 HLS performance analysis view of an unnecessarily complex division

that went unnoticed by the HLS compiler. . . . . . . . . . . . . . . . . . 67

4.8 I/O diagram of the half_sample HLS kernel before and after duplicat-

ing its processing elements 4-fold, assuming no bandwidth bottlenecks. 68


file of depth2vertex (Map). . . . . . . . . . . . . . . . . . . . . . . . . . . . 68


file of vertex2normal (Stencil). Contrary to most other cases, the con-

version from floating point to fixed-point has a negative effect here. . . 69


file of track (Gather & Map). . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.12 Heatmap of the accessed pixel positions within the reference maps

relative to the corresponding regular loop over the input maps for

the first level of track. Yellow means high frequency, purple means

the opposite. The underlying data was extracted from five frames

selected over a video fragment captured at 30 FPS, and shows that

horizontal movement of up to 750 pixels per second occurred at some

point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71


file of reduce (Reduce). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72


file of integrate (Gather & Map). . . . . . . . . . . . . . . . . . . . . . . . 73

4.15 Two-dimensional illustration of a frustum-encompassing block, to which

loop boundaries can safely be restricted. The green coloured blocks

represent volumetric elements that are visible from the sensor’s cur-

rent position, meaning that all yellow elements remain unchanged

during integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.1 Dataflow diagram of the first five kernels of KinectFusion. . . . . . . . 79

5.2 Illustration of two generalized dataflow challenges. . . . . . . . . . . . 81

5.3 Overview of the System-on-Chip architecture for the execution of a

custom IP core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4 System architecture when five coexisting kernels are implemented to-

gether on the FPGA. By allocating one port for every accelerator, hard

constraints on concurrent executions are avoided. . . . . . . . . . . . . 84

x

5.5 Waveforms produced by the System ILA for the vertex2normal kernel. . 87

5.6 Diagrams depicting how the five kernels should be executed in time

if the DDR access speed were unlimited. The rows correspond to ac-

celerators each managing their own DMA and PS-PL port, while the

distinct tasks are labelled with resolution levels (0 stands for 320x240,

1 for 160x120 and 2 for 80x60). . . . . . . . . . . . . . . . . . . . . . . . . 89

5.7 System ILA waveforms for bilateral_filter when it is executed alone,

releaving a strange hiccup. The vertical lines are spaced 200 ns. . . . . 90

5.8 System ILA waveforms for half_sample in the multi-frame execution.

Large-scale pauses and restarts are clearly visible, and occur presum-

ably due to the DDR controller having to operate at full capacity. The

vertical lines are spaced 1 µs. . . . . . . . . . . . . . . . . . . . . . . . . 91

5.9 Two possible solutions for intermediate output aggregation (Figure

5.2b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.10 Two possible solutions for multi-modal execution (Figure 5.2c). . . . . 95

5.11 Three different sets of paths (depicted as large arrows) that connect

components to combine using task-level pipelining. The time for one

path is estimated from the slowest block inside that path, and the

paths should be executed separately in time to enable resource shar-

ing across different modes. . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.12 System architecture that handles the multi-level dataflow challenge of

KinectFusion’s first five kernels (see Figure 5.1) completely within the

Vivado block design, leaving the HLS IP cores unchanged. AXI-Lite

control signals are omitted for clarity, and the bottleneck-inducing

streams are marked with a red data width label. . . . . . . . . . . . . . 97

5.13 Schedule to process incoming sensor frames using the improved ac-

celerators. Due to the application of task-level pipelining, all subcom-

ponents now adapt to the slowest link in the chain, which is formed

by bandwith limitations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

xi

List of Tables

2.1 Summary of 3D SLAM algorithms adapted and compared by [6]. . . . 10

2.2 A compilation of recent 3D SLAM applications involving the FPGA

taking up roles of varying importance, showing a trend of decreasing

frame rate with increasing "density". SoC (System-on-Chip) boards

always contain both an embedded CPU + FPGA. . . . . . . . . . . . . . 20

3.1 Time spent in each kernel when KinectFusion is executed on the CPU

of either a regular laptop or the Avnet Zedboard. The resulting frame

rate is determined by summing up all timings on a given platform. . . 31

3.2 Timing and resource usage for various implementations of a simple

series of arithmetic calculations. . . . . . . . . . . . . . . . . . . . . . . . 52

3.3 Timing and resource usage for various implementations of a square

root calculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.1 Category, I/O dimensions, estimated timing and average accuracy of

every KinectFusion kernel when it would be executed on the FPGA.

Bandwidth limitations and other external factors are not yet taken into

account, since these fall outside the scope of Vivado HLS. . . . . . . . . 58

4.2 Resource utilization estimated by HLS for every KinectFusion ker-

nel’s top function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 Impact of the optimizations arising from adoption of the methodology

versus use case-specific knowledge on the estimated performance of

KinectFusion’s kernels in HLS. . . . . . . . . . . . . . . . . . . . . . . . 74

5.1 I/O characteristics of all instances of KinectFusion’s first five kernels. . 79

5.2 Time spent in each kernel as measured on both the PS and PL of the

Zedboard. Summing these values assumes that all kernels are exe-

cuted separately in time, and can be placed side by side onto the same

FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.3 Realized maximum I/O throughputs that conforms to HP port band-

width bounds. The data widths and elements processed per clock

cycle are measured in terms of data units meaningful to KinectFusion

(e.g. one depth value), without regard for details involving packed

structs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

xii

5.4 Comparison of timing and resource profiles after implementing mm2m_samplethrough vertex2normal as separate accelerators versus applying both

discussed multi-level dataflow techniques. . . . . . . . . . . . . . . . . 102

6.1 Time spent in each kernel when KinectFusion is executed on either the

ARM Cortex-A9 CPU or Xilinx Zynq-7020 FPGA of the embedded SoC.104

xiii

List of Listings

3.1 Code snippet representing the Map parallel pattern. . . . . . . . . . . . 32

3.2 Code snippet representing the Stencil parallel pattern. . . . . . . . . . . 33

3.3 Code snippet representing the Reduce parallel pattern. . . . . . . . . . 34

3.4 Code snippet representing the Gather parallel pattern. . . . . . . . . . . 35

3.5 Vivado HLS code to test the maximum size of a 16-bit integer array.

Data is copied in burst mode from external memory, similar to how

block-by-block processing is implemented in practice. Although the

compiler places the local array into block RAM by default, the HLS

RESOURCE directive [1] is still included for clarity. . . . . . . . . . . . 49

3.6 Vivado HLS code for a fixed-point simple pipelined arithmetic calcu-

lation, belonging to the Map pattern. . . . . . . . . . . . . . . . . . . . . 51

3.7 Vivado HLS code for a fixed-point square root calculation, belonging

to the Map pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.1 Code snippet summarizing how the multi-level dataflow problem is

to be solved within Vivado HLS. . . . . . . . . . . . . . . . . . . . . . . 99

Abbreviations

ACP Accelerator Coherency Port

AXI Advanced eXtensible Interface

CPU Central Processing Unit

DDR Double Data Rate

DMA Direct Memory Access

DRAM Dynamic Random Access Memory

DSE Design Space Exploration

FIFO First-In, First-Out

FPGA Field-Programmable Gate Array

FSM Finite State Machine

GPU Graphics Processing Unit

HLS High-Level Synthesis

HP High-Performance

ILA Integrated Logic Analyzer

IP Intellectual Property

MM Memory-Mapped

PCI Peripheral Component Interconnect

PL Programmable Logic

PS Processing System

RAM Random Access Memory

SLAM Simultaneous Localization And Mapping

1

Chapter 1

Introduction

As we embark on the road towards a more autonomous world, countless challenges

and opportunities emerge in various subdisciplines of computer architecture, algo-

rithm design and electronics. One such challenge is Simultaneous Localization and

Mapping or SLAM, a relatively modern application that attempts to make a robot

aware of its surroundings. SLAM concerns the dual problem of constructing a model

of the robot’s real-world environment, while also determining the position and ori-

entation of the robot moving inside this map at the same time [2]. Many distinct

implementations of this concept exist. Dense SLAM variants, for example, distin-

guish themselves from their sparse counterparts by incorporating as much data as

possible captured by the sensors into their global reconstruction. This gives them

a considerable edge, mainly due to the fact that they create a high quality model

that is reusable across other applications as well. However, this comes at the cost

of greater computational demands [3]. On the other hand, use cases such as au-

tonomous driving, augmented reality, indoor mapping or navigation, and basically

any requirement of high-quality environmental awareness on mobile or low-power

devices, all justify why one might desire to run dense SLAM on embedded devices

as well rather than on high-end GPUs only.

The need for embedded SLAM solutions is evident. The Field-Programmable

Gate Array (FPGA), a low-power integrated circuit that is reconfigurable yet can

reach high performance and efficiency, offers a flexible hardware platform on which

to evaluate the implementation of a dense 3D SLAM algorithm. It is essentially

a large grid of elementary blocks and routing interconnects, both of which can be

reprogrammed by the designer ’on the field’. While FPGA designs are tradition-

ally developed using a hardware description language such as VHDL, we employ

the upcoming High-Level Synthesis (HLS) tools as a means to evaluate the present-

day programmability of FPGAs as well as the quality of our design methodology.

The strength of Vivado HLS is its capability to perform high-level, pragma-directed

compilation of C-code into hardware modules [4]. The concept of a System-on-Chip

(SoC) is also essential to this dissertation, which integrates a CPU and an FPGA into

2 Chapter 1. Introduction

one package. The Zedboard development board is then used to evaluate both hard-

ware and software running on the Zynq-7020 SoC.

Computer vision and signal processing are research fields that are well repre-

sented in typical FPGA applications as well as the low-level operation of SLAM

[2]. This dissertation is based on the implementation of the KinectFusion algorithm,

mainly because it is very representative of many kernels within the general context

of image processing. Both two-dimensional and three-dimensional data structures

are processed in various ways throughout the KinectFusion pipeline [5], giving rise

to a diverse exploration of possible FPGA-specific optimizations. Furthermore, it

allows for the extraction of guidelines involving the methodology for FPGA pro-

gramming. Other benefits over comparable SLAM algorithms include its relatively

low memory requirement and good accuracy [6].

1.1 Goals and outline

The aim of this master’s thesis is to provide a framework by which SLAM and by ex-

tension, image processing kernels in general, can be efficiently mapped onto FPGAs.

Beyond the exploitation of parallelism and pipelining, the full translation of software

algorithms into HLS code is often non-trivial. We also intend to explore how multi-

ple kernels with complex dataflow characteristics can be combined in hardware as to

form an efficient, large-scale pipeline consisting of functional blocks. The final goal

is to achieve a heterogeneous implementation of 3D SLAM that is as fast as practica-

ble, while investigating which concepts and techniques can be distillated in order to

create a more generally applicable methodology as a side effect.

The outline and contributions of this thesis are as follows:

• Chapter 2 reviews the background and existing literature about SLAM, FPGAs,

SLAM on FPGAs and justifies several choices made in this thesis.

• Chapter 3 delineates the methodology that was developed to deal with the ef-

fective optimization of kernels bearing different computational and data man-

agement patterns using HLS.

• Chapter 4 applies these practices to KinectFusion and evaluates the extent to

which they brought us to a satisfying solution versus how many additional

optimizations had to be applied.

• Chapter 5 explores various ways in which multi-level dataflow can be realized

efficiently, fitting together the first five kernels of KinectFusion onto the pro-

grammable logic. A comparison is made among three architectures using their

resulting timing and resource metrics.

1.1. Goals and outline 3

• Finally, Chapter 6 formulates a conclusion presenting some takeaways of our

research and opportunities for future work.

5

Chapter 2

Background and related research

2.1 Simultaneous Localization and Mapping

Simultaneous Localization and Mapping (SLAM) is an advanced computer vision

and robotic navigation algorithm that has made significant progress over the last

30 years. Its purpose is to track the state of an agent within an unknown environ-

ment, while simultaneously constructing a model of this very environment using its

sensory observations [2]. The state is typically described by its pose (position and ori-

entation), while the model essentially refers to a map, which is either a representation

of some interesting aspects (so-called features) or a dense volumetric description of

the robot’s surroundings. It is clear that both components of SLAM, being localiza-

tion and mapping, cannot be solved independently from each other. A sufficiently

detailed map is needed for localization, while an accurate pose estimate is required

to be able to reconstruct or update the map [7]. Localization is often done by means

of tracking, which compares the incoming sensor data with the map that has been

generated so far in order to create a new estimate of the current pose [3].

One of SLAM’s emerging use cases is the variety of applications in mobile robo-

tics, including but not limited to path planning, visualization, augmented reality

and 3D object recognition. In general, many situations where localization infrastruc-

ture is absent (such as indoor operation) give rise to the present-day popularity of

SLAM. The same holds true for any scenario where detailed up-to-date maps need

to be created but are not available beforehand. Cadena et al. [2] note that SLAM is a

vital aspect of robotics, and is being increasingly deployed in various real-world set-

tings that range from autonomous driving and household robots to mobile devices.

However, it is also stated that more research is needed to achieve true robustness in

navigation and perception, especially for autonomous robots that ought to operate

independently for a long time. In this sense, SLAM has not been fully solved yet,

but we note that the algorithms considered in this thesis definitely hold potential

for investigation and acceleration due to the broad applicability of their underlying

concepts.

6 Chapter 2. Background and related research

FIGURE 2.1: Continuum of SLAM algorithms from sparse (e.g. usingfeature extraction) to dense (e.g. using voxelated maps) [3].

Implementations of SLAM come in many shapes and sizes; this diversity is par-

tially illustrated in Figure 2.1. On one end of the spectrum, we have sparse SLAM

that focuses on the selection of a limited amount of features or landmarks. This has

the upside of being computationally lighter but carries the significant downside of

reducing the quality and usability of the reconstruction. On the other end, a dense

algorithm reverses these properties: its ability to generate a much higher quality

map of the environment comes at the cost of being computationally intensive. Semi-

dense visual SLAM implementations have emerged in an attempt to form a compro-

mise, although the resulting model is still incomplete with respect to the fully dense

variant as the algorithms do not deal with all of the available sensory information

[2], [8].

2.1.1 KinectFusion

KinectFusion is a real-time dense scene reconstruction algorithm, published by Mi-

crosoft in 2011. As a SLAM algorithm, it continuously updates a global 3D map and

tracks the position of a moving depth camera within this environment. Several inno-

vations were built into this system by [5]. First, it works under all lighting conditions

since only the depth data is used. Viable consumer-oriented depth sensors include

the Microsoft Kinect camera. This allows the system to work perfectly under dark

conditions as well. Furthermore, the localization step is always done with respect

to the most up-to-date global map at all times. The usage of the map, which is rep-

resented as a volume of truncated signed distance function (TSDF) values, thereby

recapitulates the information of all previous depth frames seen so far. This helps to

avoid the drifting problems commonly associated with simple frame-to-frame align-

ment.

Figure 2.2 depicts a typical volume consisting of TSDF values. Here, F is de-

fined as the signed distance to the nearest surface. Its value is positive if it is outside

of a (solid) object, and its magnitude is truncated to a fixed maximum in order to

2.1. Simultaneous Localization and Mapping 7

FIGURE 2.2: Part of KinectFusion’s map (right) and a slice through thevolume (left) showing truncated signed distance values, each repre-senting a distance F to a surface [5]. Grey voxels are those without a

valid measurement, and are naturally found within solid objects.

avoid the interference of surfaces far away from each other [5]. The global surface is

then defined as the set of points where F = 0, hence this data structure belongs to

the class of implicit surface representations [2]. A functional limitation of KinectFu-

sion is that, unlike many other SLAM algorithms [6], the global map cannot expand

at runtime because its size is predefined. This renders KinectFusion unsuited for

large-scale SLAM (on the order of 500 cubic meters or more), although the author

notes that many of the aforementioned applications do not necessarily require this

functionality.

Technical description

The overall workflow of KinectFusion is shown in Figure 2.3. From a high-level

perspective, four interconnected stages can be distinguished as follows:

1. Surface measurement. After obtaining the raw depth map captured by a Mi-

crosoft Kinect (or equivalent) camera, this preprocessing step calculates 3D

vertex and normal vector arrays at multiple resolution levels.

2. Pose estimation. The device is tracked using a variant of the Iterative Clos-

est Point (ICP) algorithm; see the original paper for its description. The live

measurement is aligned in a coarse-to-fine manner with a predicted surface

measurement, which is in turn obtained from the surface prediction phase.

3. Update reconstruction. Given an accurate pose estimate, the incoming depth

data is integrated into the volume. TSDF values within the frustum are up-

dated to accommodate for the new sensor measurement, further consolidating

the global model.

4. Surface prediction. A raycast is performed on the most up-to-date volume,

thereby producing a dense and reliable surface measurement estimate against


FIGURE 2.3: System workflow of the KinectFusion method [5].

which to perform alignment in the pose estimation phase. Loop closure be-

tween mapping and localization is achieved this way [5].

Figure 2.4 shows correspondences between these high-level stages and the sub-

routines in the source code provided by [9]. Note that the system-level dataflow of

KinectFusion is much more complex in reality, and contains many interacting ker-

nels with multiple instances. These communication and replication aspects are left

out for simplicity here, but will be explored in detail in Chapter 3.

2.1.2 Benchmarking visual SLAM

Nardi et al. [6], [9] have introduced SLAMBench, a tool used to test the correctness

and performance of various 3D SLAM algorithms. Given a dataset with a ground

truth camera trajectory, the accuracy, speed and optionally the power consumption

of a specified SLAM implementation can be measured on various CPU or GPU plat-

forms. This benchmark provides an important basis by which to evaluate the ef-

fect several parameter choices, such as the resolution of the reconstruction volume

and the frame rate. Essentially, it will allow the author to deviate from the refer-

ence KinectFusion implementation whenever it is deemed useful to do so, while still

keeping track of the possible degradation in quality due to these optimizations.

Accuracy evaluation

SLAMBench allows for detailed accuracy measurements of different SLAM imple-

mentations in the form of an absolute trajectory error (ATE). At every frame during

the execution, there exists a certain error between the estimated camera position as

produced by the application under test (AUT) and the ground truth position. The

ATE, as described by [10], is a metric that serves to evaluate this discrepancy using

a scaled Euclidean distance calculation, after aligning both trajectories in a least-

squares manner. The mean ATE is then simply its average over all frames, and will

be used hereafter to quantify the accuracy of any set of parameters used as input to

KinectFusion.

2.1. Simultaneous Localization and Mapping 9

Depthinput

mm2m_sample

bilateral_filter

half_samplej

depth2vertexj

vertex2normalj

tracki,j

reducei,j

update_posei,j

integrate

raycast

Mapoutput

Poseoutput

1. Surface measurement

2. Pose estimation

3. Update reconstruction

4. Surface prediction

FIGURE 2.4: Simplified overview of KinectFusion kernels. A sub-script j indicates the presence of several resolution levels, while i in-

dicates the presence of multiple iterations within a level.


Algorithm Type Required sensors YearORB-SLAM2 [11] Sparse Monocular, stereo or RGB-D 2016LSD-SLAM [12] Semi-dense Monocular 2014ElasticFusion [13] Dense RGB-D 2015InfiniTAM [14] Dense RGB-D 2015KinectFusion [5] Dense RGB-D 2011

TABLE 2.1: Summary of 3D SLAM algorithms adapted and comparedby [6].

Comparison of KinectFusion among other SLAM algorithms

The following items summarize the performance results of SLAM in literature [6]

as well as benchmarks executed by the author, in order to give a context to the per-

formance of KinectFusion. The considered algorithms are listed in Table 2.1, and a

standard (mid-to-high-end) set of parameters is used for their evaluation.

Accuracy. Figure 2.5 indicates that the trajectory accuracy of KinectFusion is

generally mediocre, although the author’s own executions have indicated that its

mean ATE is among the best as long as no loss of track occurs. A major drawback

of KinectFusion is that it tends to get lost completely during some video fragments,

causing the pose to simply stop updating midway the benchmark. High drift occurs

as a consequence, which explains the high variability when performing accuracy

measurements on different datasets.

Memory requirements. In a comparison made by Bodin et al. [6], KinectFusion

turned out to require the lowest memory size among the five recent sparse, semi-

dense and dense SLAM algorithms shown in Table 2.1. The memory usage depends

on the dimensions of the reconstruction volume, but is on the order of 50 MB for a

relatively detailed map of 2563 elements. However, it should be noted that this value

is still very high compared to typical FPGA applications, since local storage on the

FPGA is typically on the order of a few megabits. This indicates a priori that the

implementation of KinectFusion on the FPGA is likely to be a challenging task.

Speed. According to Figure 2.5, KinectFusion is faster than most of its coun-

terparts, achieving around 8 FPS on a GPU platform. This cannot simply be gen-

eralized towards heterogeneous CPU-FPGA executions, although it does provide

another hint that KinectFusion might be the most promising choice to attempt to

accelerate.

2.2. Field-Programmable Gate Arrays 11

FIGURE 2.5: Violin plots comparing four SLAM algorithms on theNVIDIA Jetson TK1, a GPU development board [6]. Here, KF-CUDA

stands for a CUDA-implementation of KinectFusion.

2.2 Field-Programmable Gate Arrays

The Field-Programmable Gate Array (FPGA) is essentially a two-dimensional grid

of reconfigurable blocks and routing channels, offering a low-volume yet highly ef-

ficient alternative to the Application-Specific Integrated Circuits (ASIC). The term

gate array refers to the fact that these elementary building blocks consist of various

logic gates, providing look-up tables (LUT), registers, full adders (FA), multiplexers,

flip-flops (FF) and more. Special blocks such as Digital Signal Processors (DSPs) are

also at the designer’s disposal: these serve to perform arithmetic operations such

as multiplications more efficiently than by merely using LUTs. A field-programmableintegrated circuit is one that can be reprogrammed on the spot as to perform almost

any hardware functionality that the user desires. Whereas ASICs have their elec-

tronic circuitry permanently ’baked’ into silicon, the FPGA’s logic can be changed

at will long after it has been manufactured precisely because its logic blocks and

interconnects are reconfigurable. The designs running on an FPGA are typically

created using a Hardware Description Language (HDL). This language allows the

user to formally describe the behaviour of digital circuits by means of specifying,

among others, how signals should be connected together and which logical oper-

ations should be performed. In the synthesis phase, this description is then trans-

formed into a list of electronic building blocks and their interconnections. After-

wards, the blocks are mapped onto the physical rectangular layout of the FPGA in

the mapping phase. Finally, the routing phase decides how to connect these placed

components. The resulting implemented design specifies exactly how each available

FPGA resource should be configured, including how the interconnections should be

routed as to connect the relevant blocks together.

As an example, Figure 2.6 (adapted from [15]) depicts the architecture of an


FIGURE 2.6: (a) Sketch of the FPGA architecture; (b) Diagram of asimple logic element.

island-style FPGA. Here, another special block called the I/O block is shown, resid-

ing at the periphery of the device. These serve to provide external connections and

are necessary to communicate with the world outside of the Programmable Logic

(PL).

2.2.1 The FPGA put into context

Figure 2.7 shows a simplified comparison of how the FPGA can be situated among

the CPU, GPU and ASIC. On the left end, we find a Central Processing Unit (CPU).

This general-purpose device is clearly the very flexible with regard to programma-

bility, but it is also the least efficient one. In this context, efficiency refers to both speed

(throughput, latency) and power consumption. On the right end, we find an ASIC:

this device is the most rigid of all but also the most efficient. The logic is burned right

into silicon, which fixes its functionality permanently but allows for an extremely

low latency and energy consumption. Within a given semiconductor techology, a

much better efficiency can be achieved by ASICs relative to FPGAs since negligible

overhead exists. Since the components and interconnections are fully fixed before-

hand, their area utilization and speed metrics are much better than for comparable

designs on the FPGA. For example, [16] found that the average ratio of silicon area

required to implement circuits containing only LUT-based logic and FFs is 35.

Moving to the left on the axis generally means sacrificing efficiency, while gain-

ing the ability to easily run a wider range of applications. On the other hand, moving

to the right means giving up on ad-hoc programmability and configurability, but in-

stead gaining increased potential for high-efficiency or high-performance computa-

tion in return. For example, the Graphics Processing Unit (GPU) cannot run general-

purpose programs, although it is quite well suited for massively parallel or vector-

ized calculations thanks to its large amount of processing units. They are however


very power-hungry, a drawback FPGAs and ASICs do not suffer from. The FPGA’s

low power consumption and hardware reconfigurability explain its growing interest

in the academic and industrial world.

Reality is of course not one-dimensional, and the FPGA has its fair share of dif-

ferences and advantages that offsets it on other axes as well, figuratively speaking.

Rather than just existing inbetween GPUs and ASICs, it is also useful to compare

the FPGA with the CPU with respect to how they process data. CPUs often have

higher clock speeds, but execute every instruction in a much less parallel fashion

than its counterparts. While the CPU has undergone many architectural improve-

ments to accelerate the execution of software, including multi-core functionality, Sin-

gle Instruction Multiple Data (SIMD) technology, instruction-level parallelism (ILP),

speculative execution and more, these extra tools are only available under specific

circumstances. Serial execution happens otherwise, which results in every data el-

ement being processed one by one. A resulting benefit is that code with a high

degree of control statements, for example with many if-then-else constructions,

are handled well by the CPU [17]. On the other hand, FPGAs provide a more direct

shortcut to hardware, as they allow for effective pipelined and dataflow-oriented

architectures to be designed and implemented for a given (fixed) algorithm. FP-

GAs provide the opportunity to spatially parallelize complex computations across

its many reconfigurable blocks and routing channels, in order to achieve a process-

ing speed many times greater than the CPU [18]. However, note that the maximum

DDR access speed, bus widths and I/O limitations still define upper limits with re-

gard to communication for both devices.

Lastly, a disadvantage of FPGAs is that, despite the rise of tools such as High-

Level Synthesis (HLS) that attempt to ease its development [4], FPGAs intrinsically

remain quite difficult to program. The mindset for FPGA development is very dif-

ferent from that of software engineering [17], which is essentially due to the design

process being multifaceted and intricate, involving both hardware and software. To

design an FPGA system is to start from a blank slate: the architecture is not fixed,

but can be changed to perform virtually any digital hardware logic the user wishes it

to. For a System-on-Chip combining both a CPU and FPGA, the situation becomes

even more involved: in addition to devising effective hardware, the designer also

has to write good software around their custom architecture, and has to ensure that

all components work well together as intended. This stands in stark contrast to reg-

ular software development that does not deal with variable architectures, such as on

a desktop or laptop CPU. In short, a high degree of technical expertise is required in

this field, although HLS can definitely be regarded as a positive evolution towards

facilitating the hardware design aspect of this two-fold development process.


FIGURE 2.7: Diagram comparing the FPGA to other processing plat-forms [19].

Power consumption

Minimizing the power usage of an application is especially important in the context

of embedded devices, where SLAM is most likely to be found. In theory, the FPGA is

more energy-efficient than the CPU and GPU by design. After all, this device is clas-

sified as ’reprogrammable hardware’, meaning that its data operations are directly

encoded into hardware. The overhead for any given calculation is therefore greatly

reduced. Furthermore, high throughputs can be achieved thanks to the opportunity

for efficient pipelining, no matter how complex said string of operations.

To verify the above claims, [20] compared the energy consumption of a high-end

Altera Stratix III E260 FPGA with an NVIDIA GeForce GTX 295 GPU and quad-core

Xeon W3520 CPU. For many typical sliding window applications (which can be seen

as a subset of image processing), the FPGA turns out to be one to two orders of mag-

nitude more power efficient in terms of energy usage per frame than both the GPU

and CPU. Only in the case of a linear 2D convolution where the filtering operation

may be executed in the frequency domain as well, the GPU-FFT implementation

was able to obtain a power efficiency comparable to that of the FPGA. While both of

these devices are around a decade old, similar conclusions were drawn by [21] for

k-means clustering on more modern hardware. Here, several Xilinx Zynq FPGAs

achieved an ’FPS per Watt’ value of a factor 10 to 25 times better than the NVIDIA

GTX 980 GPU. The general trend is that GPUs can often process data faster then

FPGAs in terms of frames per second, but do so much less energy efficiently.

2.2.2 System-on-Chip

A System-on-Chip (SoC) integrates a Processing System (PS) containing the CPU

with Programmable Logic (PL) representing the FPGA onto a single device [22].

Figure 2.8 depicts the block diagram of a Xilinx Zynq-7000 SoC, where the following

relevant functional blocks can be distinguished:

• Application Processor Unit (APU): The software part of the SoC, consisting of

a dual-core ARM Cortex-A9 CPU. It is used to control the full execution of


KinectFusion and to initiate data transfers via the AXI1 Direct Memory Access

(DMA) IP core.

• Programmable Logic (PL): The reconfigurable hardware part of the SoC, de-

rived from the Xilinx Artix-7 FPGA. It is used to run the accelerated kernels of

KinectFusion.

• General-Purpose (AXI_GP) Ports: Provides PS-PL communication with two

32-bit master and two 32-bit slave interfaces. It is used to control the IP-cores,

and its maximum estimated throughput is 600 MB/s [22].

• High-Performance (AXI_HP) Ports: Provides PS-PL communication with four

32- or 64-bit independently programmed master interfaces. It is used to trans-

fer large amounts of data via the DMA with an estimated maximum through-

put of around 1,200 MB/s [22], [24], [25].

• Central Interconnect: Connects the PL via its AXI_GP ports to the DDR mem-

ory controller, PS cache and I/O peripherals.

• DDR Controller (Memory Interfaces): Supports DDR2 and DDR3 for access to

the Dynamic RAM (DRAM), not shown on the figure.

• Programmable Logic to Memory Interconnect: Connects the PL via its AXI_HP

ports directly to the DDR controller for fast streaming (reading and writing) of

data.

The Zedboard, shown in Figure 2.9, is a low-end development board based on

the Xilinx Zynq-7000 SoC [26]. It will be used in this dissertation for the prac-

tical evaluation of several accelerated KinectFusion configurations. High-end FP-

GAs were considered as well, but were eventually deemed out-of-scope; we mostly

wanted to see how much the Zedboard is capable of already, as the Zynq FPGA is

relatively popular in image and video processing applications [27]. Furthermore,

the lower the cost of the hardware platform, the wider the range of devices and use

cases our work could be applied to.

2.2.3 High-Level Synthesis

High-Level Synthesis (HLS) represents a collection of processes that automatically

convert a high-level algorithmic description of a certain desired behavior to a circuit

specification in HDL that performs the same operation [4]. It allows the hardware

functionality of an FPGA to be specified directly by algorithms written in a software

programming language such as C or C++. HLS tools attempt to reduce time-to-

market and address the design of increasingly complex systems by permitting de-

signers to work at a higher level of abstraction. Design spaces can be explored more1Advanced eXtensible Interface, a set of protocols for inter-IP communication as adopted and de-

scribed by Xilinx in [23].


FIGURE 2.8: Functional block diagram of the Zynq-7000 SoC [22].

FIGURE 2.9: Annotated photograph of the Avnet Zedboard (adaptedfrom [28]).


rapidly this way, which is especially important when many alternative configura-

tions have to be implemented, generated and compared.

The task of automatically generating hardware from software is far from easy,

and a one-size-fits-all solution might not even exist in the same way that a fully

optimizing and/or parallelizing compiler is theoretically impossible to create. Nev-

ertheless, a wide range of different approaches exist that attempt to partially solve

the problem. Removing the burden on the user of having to reinvent the wheel is

already a great practical advantage of HLS. After all, frequently recurring concepts

such as pipelining, array partitioning and more are often already built into these

tools, ready to be used without requiring the designer to deal with its low-level de-

tails.

Xilinx’ Vivado HLS is able to synthesize a procedural description written in C,

C++ or SystemC into a hardware IP block [1], [4]. Loop unrolling, pipelining, chain-

ing of operations, resource allocation and internal array restructuring are among the

many different optimizations that can be applied during the compilation process. In

addition, support for many types of interfaces such as shared memory and stream-

ing is built-in.

2.2.4 Designer workflow

With the availability of an Avnet Zedboard containing the Xilinx Zynq-7020 PL, the

author’s toolchain of choice consists of Vivado HLS, Vivado and Xilinx Software

Development Kit (SDK) v2017.4. These three development environments together

provide an integrated design flow as follows:

• In Vivado HLS [1]:

1. Write a C/C++ function to be integrated into the hardware system. This

can be written from scratch or based on an existing reference implemen-

tation. Data type selection and interface specifications have to be consid-

ered as well.

2. Write test benches; compile, simulate and debug the algorithm to verify

its functional correctness. Return to step 1 until the output is correct.

3. Optimize the C/C++ code to make it tailored towards a useful imple-

mentation on the FPGA. One important practice here is the application of

Vivado HLS optimization directives, which automates many redundant

aspects of the optimization process. Ensure that the algorithm stays cor-

rect by performing step 2 as needed.


4. Synthesize the top function into an RTL implementation. Vivado HLS

creates two variants: VHDL and Verilog, both of which ought to be fully

equivalent.

5. Analyze the reports and cycle-by-cycle computation steps of the resulting

design. Return to step 3 until satisfied. This back and forth process is part

of Design Space Exploration (DSE).

6. Optionally, verify the correctness of the RTL implementation by running

a C/RTL cosimulation.

7. Export the RTL implementation to package it into an IP block, ready to be

used in subsequent design tools.

• In Vivado [29]–[31]:

8. Create a new IP integrator block design and insert the ZYNQ7 Processing

System. This block encompasses the embedded PS functionality, while

all other IP cores around it represent what will be implemented on the

FPGA.

9. Configure the Zynq-7000 with respect to clock speeds, PS-PL communi-

cation, peripheral interfaces and more.

10. Insert your custom HLS IP core(s) into the block design while taking care

of AXI interfacing, interconnects and ports. Optionally, insert an AXI

DMA IP core which provides streaming to memory-mapped conversions

and vice versa, in order to allow streaming IP cores to efficiently access

the DRAM via HP ports.

11. Insert a System Integrated Logic Analyzer (ILA) IP core, and add de-

bugging probes to important signals. This allows for the debugging of

post-implementation designs on the FPGA device, which consumes extra

resources but no additional clock cycles.

12. Verify the design, and fix the block design if needed.

13. Perform logic synthesis and implementation. Redo step 9 if problems

arise, such as the critical path length exceeding the clock period.

14. Analyze the resource usage and timing profile of the implemented design.

If the resource usage exceeds the Zynq-7000’s maximum, consider reduc-

ing the complexity of the HLS IP core(s) and/or decreasing the number

of concurrent IP cores present in the block design at step 10. If the critical

path exceeds the clock period, return to step 9 until any timing issues are

resolved.

15. Generate the bitstream and export the hardware to Xilinx SDK.

• In Xilinx SDK [32], [33]:

2.3. SLAM on FPGAs 19

16. Create a new standalone Board Support Package (BSP) based on the pre-

viously generated hardware platform. The drivers of this BSP allow hard-

ware components to be called directly from software code.

17. Create a new C++ software application project based on the BSP. The au-

thor recommends to import the Hello World example as it contains the

necessary platform initialization and clean-up routines.

18. Configure the BSP project to include relevant libraries, and modify the

software project’s linker script to ensure the stack and heap sizes are large

enough for your use case.

19. Write C/C++ application code that executes and verifies the full system’s

functionality. Software can be debugged by setting breakpoints in SDK,

while hardware logic can be debugged using the System ILA in Vivado

as described in [31].

It is possible that the resolution of some problems in the last step extends all the

way back to step 1, in the sense that it requires a comprehensive and holistic analysis

of all aspects in the design process. One such incident might be that the integrated

system’s performance is less than expected, so that architectural decisions regarding

bandwidths, data types etc. have to be revisited. Othertimes, more fundamental lim-

itations might be encountered such as excessive resource utilization (making routing

impossible) or I/O bandwidth ceilings, which can usually not be solved without re-

assessing the initial specifications of the system.

2.3 SLAM on FPGAs

In recent years, the idea of implementing Simultaneous Localization and Mapping

on FPGAs has been explored in diverse ways. Works in literature range from two-

dimensional [34], [35] to three-dimensional [3], [7], [36]–[40] SLAM and from sparse

[7], [35], [38], [39] to semi-dense [3] SLAM, although a fully dense 3D variant such

as KinectFusion seems to be less popular in the embedded hardware community.

Furthermore, most works focus on the hardware acceleration of just specific parts

belonging to a whole heterogeneous SLAM system [36], [37], [39]. The selected

subcomponents naturally include those that the FPGA is known to be strong at in

terms of performance and efficiency. Nevertheless, the following text provides an

overview of these existing results in an attempt to pick up architectural and method-

ological clues for the as-complete-as-possible acceleration of SLAM on FPGAs.

In order to give an idea of the current state of the art, Table 2.2 summarizes most

recent works on 3D SLAM. Some authors have published improvements of their

systems over several years, in which case only the latest results are shown. A clear


Reference Algorithm Type Platform(s) Speed Year[7], [41] FastSLAM 2.0 Sparse Host CPU + Arria 10 FPGA 102 FPS 2018[36] ORB-SLAM Sparse Host CPU + Stratix V FPGA 67 FPS 2018[38] VO-SLAM Sparse DE3-340 SoC 31 FPS 2015[39] EKF-SLAM Sparse Zynq-7020 SoC 30 Hz 2015[3], [42], [43] LSD-SLAM Semi-dense Zynq-7045 SoC >60 FPS 2019[37] KinectFusion Dense GTX 760 GPU + Stratix V FPGA 26-28 FPS 2015[40] ICP-SLAM Dense Zynq-7020 SoC 2 FPS 2017

TABLE 2.2: A compilation of recent 3D SLAM applications involvingthe FPGA taking up roles of varying importance, showing a trendof decreasing frame rate with increasing "density". SoC (System-on-

Chip) boards always contain both an embedded CPU + FPGA.

trade-off is visible between the performance and quality of the algorithm, in ad-

dition to the role played by the actual ’embeddedness’ of the heterogeneous plat-

forms in use. Low-power, real-time sparse SLAM applications seem to be coming of

age thanks to their light computational weight, although they do not produce a us-

able map and often lack loop-closure functionality [3]. On the other hand, the more

accurate and feature-rich fully dense solutions notably require high-end hardware

(typically desktop GPUs as shown in Figure 2.1) unless the real-time constraint is

disposed of [2].

2.3.1 Dense and semi-dense SLAM

Perhaps the most relevant entry in Table 2.2, the implementation of real-time 3D re-

construction using KinectFusion on a heterogeneous system with an FPGA has been

attempted before by [37]. Here, Gautier et al. attempted to accelerate two intensive

parts of the application, specifically the Iterative Closest Point (ICP) algorithm and

the volumetric integration step, corresponding to the track, reduce and integrate ker-

nels in Figure 2.4. Their set-up was a heterogeneous system with the Altera Stratix

V FPGA and the NVIDIA GTX 760 GPU, both connected via PCI Express to a host

computer. It is interesting to note that the authors’ goal of accelerating integrate was

unsuccessful due to a fundamental bandwidth limitation. A 3D volume with 5123

elements takes 512 MB of space in memory which is far too huge to transfer back

and forth between the FPGA and the GPU (or CPU for that matter) at sufficient

speeds. Their final architecture therefore consists of KinectFusion being executed

nearly completely on the GPU, but with the ICP part of the tracking step offloaded

to the FPGA. Real-time speeds of up to 28 FPS were achieved by halving the input

data resolution and limiting the number of ICP iterations. Lastly, Gautier et al. point

out that the Altera OpenCL SDK posed practical difficulties in optimizing area uti-

lization, for example because the tool lacked support for fixed-point arithmetic.

The fact that semi-dense and dense SLAM algorithms are characterized by high

bandwidth requirements was also noted by [42] in 2016. In this work, Boikos et

2.3. SLAM on FPGAs 21

al. presented their first iteration of LSD-SLAM, achieving 4 FPS at a resolution of

320x240. Two accelerator units were implemented, and the communication between

them had to occur via DDR because the intermediate data (on the order of a few

MB) produced by the first unit could not be cached entirely on the FPGA. A re-

designed architecture around the tracking core to enable the usage of a full stream-

ing communication paradigm was presented in [43], bringing a five-fold frame rate

improvement over the previous work. Combined with a scalable depth estimation

architecture again by Boikos et al. in [3], 2019 marked the arrival of the first com-

plete accelerator for direct semi-dense monocular SLAM on FPGAs. The power con-

sumption was measured to be an order of magnitude smaller than that of an Intel

i7-4770 CPU. A highly important takeaway is that the dataflow principle (i.e. ker-

nels linked with a single-consumer, single-producer pattern) was found to yield the

most efficient design. Furthermore, the units were made multi-modal in order to

deal with LSD-SLAM’s complex control flow due to the iterative and multi-level na-

ture of tracking. More specifically, the pipelined hardware blocks could be put to

different uses depending on the current phase of the system: every unit contains a

set of operations from which the desired computation can be selected by means of

multiplexing. As will be explained in Chapter 5, similar techniques were employed

in our research for the dataflow architecture of KinectFusion.

23

Chapter 3

High-level synthesis design ofindividual kernels

The transformation of KinectFusion’s source code into Vivado HLS-optimized code

is an important aspect of this thesis, not just to obtain a HLS implementation of

SLAM but also in an attempt to recognize patterns in the approach by which it is

done. Many subroutines part of a dense 3D SLAM algorithm often recur in other ap-

plications related to computer vision and image or video processing as well, because

the data management structure of its kernels only varies so much in practice. Conse-

quently, the task of accelerating these kernels can largely be mapped to a framework

that we developed for the purpose of making the design of similar HLS kernels eas-

ier in the future. Before proceeding to this methodology in Section 3.2, some details

about the specific use case are discussed first as to give a better idea about the char-

acteristics and diversity of kernels that are being dealt with.

3.1 Prerequisites

3.1.1 Detailed algorithm description

KinectFusion, like any SLAM algorithm, is composed of various steps that have to be

executed in succession for each captured depth frame. A diagram of KinectFusion’s

nine essential kernels is depicted in Figure 3.2. Note that some routines are called a

variable amount of times per frame, and that multiple instances of each kernel might

be called with mutually different sets of dimensions. This complex interaction and

dataflow will be discussed in detail in Chapter 5; a functional description of each

kernel follows here.

• mm2m_sample: This kernel essentially resamples the sensor output and per-

forms a unit conversion from millimeters to meters. The raw depth frames

captured by the Kinect camera are given in an unsigned short integer format,

which need to be converted to a floating point representation and resized (if

necessary) to the correct dimension by subsampling the pixels. This allows all

subsequent kernels to work with distance values expressed in meters only.

24 Chapter 3. High-level synthesis design of individual kernels

• bilateral_filter: Because the Kinect depth map is rather noisy [44], the data is

first filtered in an edge-preserving way. This kernel is a non-linear filter that

replaces each depth value by a weighted average of nearby values in order to

reduce the noise amplitude. Its smoothing operation relies on the prior knowl-

edge that many real-world environments consist of large patches of mostly flat

areas, such as shown in Figure 3.1. The algorithm clearly preserves discontinu-

ities, which is achieved by just barely weighing intensities that are ’far away’

from the currently considered intensity in an absolute value sense. Newcombe

et al. [5] have found that the insertion of this preprocessing stage greatly

increases the quality of the normal maps produced in later stages of the al-

gorithm, which in turn improves the data association step performed by the

tracking kernels.

• half_sample: This kernel resamples the bilaterally filtered depth map by a

factor of two in each dimension. Every four input values are thus mapped to

one output value, again with an edge-preserving effort as not to introduce any

fake averaged depth values near discontinuities.

• depth2vertex: This kernel computes a point cloud in projective form, i.e. with

every vertex as a function of its pixel position. By multiplying every depth

value with a matrix that summarizes the camera’s intrinsics [45], an output

array is produced where each element represents a Euclidean 3D point in the

local frame of reference.

• vertex2normal: Given the map of vertices generated previously, this kernel

produces an array of normal vectors. Every normal vector is calculated by

taking the cross product of two subtracted pairs of neighboring points.

• track: This kernel performs part of the multi-scale Iterative Closest Point (ICP)

alignment. It essentially tracks the live depth frame against the globally fused

model, in order to establish correspondences between the new and synthetic

point clouds. No feature selection occurs as KinectFusion is a fully dense al-

gorithm. A faster than conventional variant of ICP is employed by [5] as well

as the source code discussed later. As long as the frame rate remains suffi-

ciently high, this optimization is made possible thanks to the assumption of

small motion from one frame to the next.

• reduce: This kernel calculates the total error of the tracking result, by adding

up distances between corresponding points in the input and predicted point

clouds. 32 values are obtained to form a basis for the error minimization pro-

cess.

• update_pose: This routine produces a new or refined pose estimation, starting

from the reduction computed above.

3.1. Prerequisites 25

FIGURE 3.1: Illustration of the bilateral filter, showing its edge-preserving property [46].

• check_pose: Since the correction of the pose estimate should only be applied

when the resulting tracking error is small enough, this routine verifies the re-

duction output and resets the pose to its last sufficiently stable estimate if nec-

essary.

• integrate: Once the estimated pose has been updated, this kernel integrates

the newly observed depth map into the global 3D map. This volume consists

of Truncated Signed Distance Function (TSDF) values, whereby each element

has an associated weight corresponding to the certainty of the surface mea-

surement at that position. The integration step transforms the input depth

map into a world frame of reference, and iterates over the whole volume to

update each element by computing a simple running average of the existing

TSDF value and the (possibly noisy) new TSDF value. In order to maximize

the system’s ability to reconstruct finer scale structures, the raw depth map is

used for this purpose rather than the bilaterally filtered version [5].

• raycast: This kernel generates a synthetic vertex and normal map by casting

a ray from every pixel into the fully up-to-date dense 3D volume, given the

current pose estimate. It therefore generates a reliable prediction of what the

corresponding input arrays should look like if the camera would make its ob-

servation at the specified position. This step is essential in forming a reference

that allows the tracking kernels to take into account all previous observations

made so far.

The reduce kernel is followed by update_pose and check_pose, but these are left out

of the diagram because they contain ’irregular code’ that is inherently unfit for the

FPGA. The justification for this arises from the absence of any significant oppor-

tunity for parallelization, pipelining, or temporally efficient hardware sharing. In

update_pose, a vector of 6 values is computed by means of singular value decomposi-

tion, after which the updated pose matrix is generated by calculating an exponential

map from a Lie algebra to the group of rigid transformations in 3D space [47]. Imple-

menting these operations on the FPGA takes up a large fraction (around one third)


Depthinput

mm2m_sample bilateral_filter half_sample1 half_sample2

depth2vertex0 depth2vertex1 depth2vertex2

vertex2normal0 vertex2normal1 vertex2normal2

tracki,0

reducei,0

update_posei,0

tracki,1

reducei,1

update_posei,1

tracki,2

reducei,2

update_posei,2

integrate

raycast

Mapoutput

Poseoutput

Inter-framestorage

2D array (level 0)

2D array (level 1)

2D array (level 2)

3D volume

Small signal

FIGURE 3.2: Overview of KinectFusion kernels. Green shaded ar-eas include blocks that are executed multiple times per frame and per

level; once for every iteration i.


of the Zynq’s resources. This indicates that the hardware blocks are being used in-

efficiently, as the code is irregular. Only if the matrix dimensions would be much

larger or the routine would be called much more often than it is right now, would

we be able to exploit repetitiveness and consider FPGA acceleration. The second

method, check_pose, is more related to control flow than actual computation, making

the overhead of off-loading this method to the FPGA undesirable. Lastly, both rou-

tines take relatively little processing time. All of the discussed factors have led the

author to the decision to execute update_pose and check_pose on the ARM Cortex-A9

CPU only, regardless of which other kernels are being off-loaded to the FPGA.

3.1.2 Source code, dataset and parameters

Reference implementation

A C++ implementation of KinectFusion is provided by SLAMBench [6], [9], [48],

[49], which is in turn based on the CUDA implementation by [50]. We have decided

to completely rewrite KinectFusion based on the source code found in these GitHub

repositories, for the following reasons. First, out-of-scope or platform-specific fea-

tures such as comprehensive benchmarking, multi-core functionality, graphics ren-

dering1, user interfaces, extensive I/O support and more can easily be omitted this

way to avoid interference with relevant features and improvements. Second, by fix-

ing loop bounds as much as possible, the methods become better suited for FPGA

acceleration. Variable loop bounds are difficult to reconcile with certain HLS opti-

mizations such as loop unrolling [1]. Moreover, the loop control can be simplified

if the HLS compiler knows the number of iterations beforehand, thus saving on re-

sources. Third, library dependencies and C++11 features are avoided as much as

possible, in order to increase the code portability. It has been experimentally de-

termined that Vivado HLS and Xilinx SDK do not fully support C++11-specific fea-

tures, so this step allows for the ARM-compilation and execution of KinectFusion

kernels on the Zedboard without issues. TooN [51] is the only remaining external

library, and is used for the complex linear algebra calculations performed in up-date_pose. Coincidentally, this library does not need to be ported to FPGA hardware

as update_pose was already declared uneligible for acceleration. Lastly, the existence

of a clean reference implementation allows for a more straightforward comparison

between generic (CPU) and FPGA-specific variants of the kernels. Henceforth, the

terms ’original’ and ’reference implementation’ will always refer to respectively [9]

and the rewritten version of KinectFusion as described in this paragraph.

1After all, visualization is just one of the possible use cases of SLAM as previously noted in Chapter2. The removal of this feature makes our implementation less opinionated, in the sense that it allowsfor any other application to substitute for it instead.


FIGURE 3.3: Screenshot of the SLAMBench2 GUI when evaluatingthe ’Living Room 2’ scene.

Dataset choice and data extraction

There is little difference in KinectFusion’s performance or accuracy across different

datasets. In order to measure kernel timings in the reference implementation, the

author has rather arbitrarily chosen the ’Living Room 2’ RGB-D video fragment be-

longing to the ICL-NUIM dataset [45]. This scene has a resolution of 640x480 which

corresponds to what the Kinect v1 sensor captures, and is even slightly higher than

the depth map resolution of the Kinect v2 camera (512x424) [52]. An example of

SLAMBench’s visualization is shown in 3.3. The original implementation was mod-

ified so that intermediate data could be extracted from in-between every block in

Figure 3.2, with the goal of thoroughly testing every HLS kernel’s accuracy relative

to a reliable ground truth. Using test benches in Vivado HLS, the exact deviations

caused by optimizations (such as moving from floating point to fixed-point arith-

metic) or other changes were evaluated precisely this way.

Parameter space exploration

While the specific dimensions and parameters of our system do not matter much in

the construction of a generally applicable methodology, it is useful to get an idea of

the orders of magnitude that are being considered. The important factors influencing

this decision include accuracy, speed and usefulness of the resulting application. Us-

ing SLAMBench [6], the mean ATE of KinectFusion’s original implementation was

calculated for 64 parameter configurations by varying the following four settings:

• Input resampling factor (= 1, 2, 4 or 8): Specifies the resizing ratio of the cap-

tured sensor frame. If it is larger than one, then mm2m_sample will perform re-

sampling in order to downscale the input depth map from 640x480 to 320x240,


160x120 or 80x60.

• Global volume resolution (= 256, 128, 64 or 32): The size of the reconstruction

array along one dimension. The upper boundary of 2563 corresponds to 16

million weighted TSDF elements in total.

• Tracking rate (= 1 or 2): The frame interval by which to perform the pose track-

ing step. If the input is captured at 30 FPS but the tracking rate is halved (i.e. its

value equals 2), then the pose would only be updated once every second frame.

Consequently, the actual processing rate of the algorithm would be limited to

15 FPS in real-time as half of the input frames effectively remain unused.

• Integration rate (= 1 or 2): The frame interval by which to perform the volu-

metric integration step. As it makes no sense to perform integration with an

outdated pose, this value must be equal to or higher than the tracking rate.

It was quickly found that setting the volume resolution lower to 64 or lower almost

always leads to loss of track, effectively rendering the system useless at perform-

ing localization or mapping. Excluding these values then results in Figure 3.4. Re-

markably, the algorithm performs well at all depth map resolutions. Halving the

volume resolution doubles the mean ATE on average, although the input resam-

pling factor does not have a significant influence as long as its value does not exceed

4. For reasons that we could not explain, the ATE for one of the 16 configurations

(640x480, 2563, 15 FPS) skyrocketed during the benchmark. This phenomenon was

also present in ’Living Room 0’, and we speculate that the algorithm simply experi-

ences bad luck for this particular configuration. Errors tend to grow near the end of

the scene, although the first half often works flawlessly.

Note that the meaning of FPS in this context is unrelated to the actual perfor-

mance of the system that runs the algorithm. Instead, it designates the rate at which

the sensor provides its input frames, or in other words, the frequency at which we

decide to skip these frames. Reducing the input FPS from 30 to 15 lowers the bar

(i.e. the system’s minimal workload) of achieving real-time performance by a factor

of two, while sacrificing just a small amount of accuracy. If the heterogeneous sys-

tem is able to process frames at 15 FPS or higher, then we can talk about real-time

performance as long as every second sensor frame is disposed of. [5] have shown

that even using every 6th sensor frame only may result in reconstructions of lesser

but nonetheless acceptable quality, a concept also known as graceful degradation.

The real-world size of the volume is 8 × 8 × 8 m3, yielding a smallest step of

3 cm along one dimension in the case of a volume with 2563 elements. For use

cases that require the ability to distinguish objects in sufficient detail, the author

declares a lower accuracy than that unacceptable (see Figure 3.5 for an example with


FIGURE 3.4: Mean ATE for different configurations of KinectFusion.The cubed numbers indicate volume resolutions, while the input FPS

corresponds to both the tracking and integration rate.

FIGURE 3.5: A) RGB video stream (unused). B) Latest depth mapcaptured by the Kinect sensor. C) Reconstructed scene using Kinect-

Fusion [37].

a resolution of 512), so the volume resolution is kept at 256. However, the input

resampling factor is set to 2 as both the mean ATE and the global reconstruction

quality are still satisfactory. Furthermore, the decrease from 30 to 15 FPS also has

negligible effect on the accuracy but obviously a considerable positive effect on the

speed of the system. The final selected configuration is as follows: an input size of

320x240, a volume resolution of 256, a tracking rate of 2 and an integration rate of

2. In theory, real-time performance can be achieved as soon as the actual processing

rate becomes 15 FPS or more.

CPU timing results

With the parameters determined above, the reference implementation of KinectFu-

sion was timed on both a laptop computer and the CPU of the Zynq SoC with com-

piler optimizations enabled. Table 3.1 summarizes the total time spent within each

3.2. Methodology 31

Kernel Intel i7-6700HQ CPU ARM Cortex-A9 CPUmm2m_sample 0.4 ms 2.7 msbilateral_filter 58.2 ms 425.9 mshalf_sample 0.6 ms 1.8 msdepth2vertex 3.0 ms 7.9 msvertex2normal 7.8 ms 27.4 mstrack 20.9 ms 125.7 msreduce 35.1 ms 27.7 msintegrate 205.0 ms 1236.1 msraycast 233.0 ms 1293.6 msTotal speed 1.77 FPS 0.32 FPS

TABLE 3.1: Time spent in each kernel when KinectFusion is executedon the CPU of either a regular laptop or the Avnet Zedboard. Theresulting frame rate is determined by summing up all timings on a

given platform.

kernel. No multi-core functionality was used, because all software is executed as

standalone (bare metal) on the Zedboard. Without any operating system, multi-

threading is too cumbersome to implement and therefore left out-of-scope for this

dissertation.

3.2 Methodology

3.2.1 Common parallel patterns and categorization

When presented with the question of how to accelerate image processing kernels,

one could answer that it is important to first understand which type of kernel is be-

ing considered. In the domain of parallel programming, a very useful categoriza-

tion happens on the basis of the kernel’s parallel pattern. This concept assigns every

method to a kind of predefined algorithmic skeleton, based on its data management

and/or computation pattern [53].2 The design methodology can then be tailored to-

wards the optimization of every category separately, which allows the kernels to be

tackled more easily after performing this subdivision. As will become clear, every

pattern has its own set of challenges and approaches during its implementation -

this holds true not just for GPUs but for FPGAs and other hardware as well.

On a wording-related side note, the following paragraphs will often talk about

images and pixels when referring to (two-dimensional) arrays and elementary data

values respectively. The reader should keep in mind that the elements can be com-

posed of any data type that is not necessarily limited to regular pixels. Many pat-

terns as well as their discussed methods can even be generalized towards three-

dimensional data structures as well. The aforementioned terms will still be used for2As such, the patterns are not really parallel in the sense that the FPGA will not parallelize them the

way a GPU does, but we still call them that for convenience.


FIGURE 3.6: The Map pattern [9].

for (int y = 0; y < HEIGHT; y++) for (int x = 0; x < WIDTH; x++)

// Read an input valuedata_t value_in = data_in[y][x];// Calculate some function of (value_in, x, y) to produce value_outdata_t value_out = f(value_in, x, y);// Write the output valuedata_out[y][x] = value_out;

LISTING 3.1: Code snippet representing the Map parallel pattern.

conciseness and simplicity. Next, the pseudo-code examples serve to illustrate the

computation pattern of each type, and do not yet take into account any efficiency

and/or HLS optimization measures. Lastly, the approach towards FPGA accelera-

tion is already summarized very briefly for each category, but will be explained in

full detail in the next section.

Map

The Map parallel pattern is arguably the simplest pattern with regard to image pro-

cessing. As illustrated in Figure 3.6, This type of kernel independently maps every

input pixel to a corresponding output pixel, without using the value of any other

input or output pixel in its calculation. The pseudo-code for this pattern can be

summarized as in the code snippet in Listing 3.1.

The characteristic aspect of this category is that, for every iteration, the calcu-

lation depends only on a single input pixel that is never used again in any other

iteration. This means that the input data is scanned in a linear manner, i.e. from

the beginning to the end. This fact can be readily exploited in FPGA-oriented opti-

mization stages, both computation- and communication-wise. Example applications

of the Map category include affine transformations such as pixel inversions or unit

conversions.

This pattern is a perfect candidate for pipelining, since all iterations are indepen-

dent of each other in terms of data. In HLS, a streaming interface should be applied

to the kernel in order to allow an AXI DMA to send and receive data at high speeds.

3.2. Methodology 33

FIGURE 3.7: The Stencil pattern [9].

#define WIN_SIZE (HALF_SIZE * 2 + 1) // WLOG: always oddfor (int y = 0; y < HEIGHT; y++)

for (int x = 0; x < WIDTH; x++) // Read multiple input valuesdata_t window[WIN_SIZE][WIN_SIZE];for (int i = -HALF_SIZE; i <= HALF_SIZE; i++)

for (int j = -HALF_SIZE; j <= HALF_SIZE; j++) if (0 <= y + i && y + i < HEIGHT && 0 <= x + j && x + j < WIDTH)

window[i + HALF_SIZE][j + HALF_SIZE] = data_in[y + i][x + j];

// Calculate some function of (window, y, x) to produce value_outdata_t value_out = f(window, y, x);// Write the output valuedata_out[y][x] = value_out;

LISTING 3.2: Code snippet representing the Stencil parallel pattern.

Loop unrolling should be no problem either, if permitted by the FPGA resources and

PS-PL bandwidth capacity.

Stencil

The Stencil parallel pattern shown in Figure 3.7 is similar to Map, but every output

pixel now depends on a set of multiple neighboring input pixels instead of just a sin-

gle one. This dependence very often comes down to a predefined moving window,

in which case the output calculation depends only on input pixels within this win-

dow. Listing 3.2 shows a possible pseudo-code representation, in which the sliding

window is fully loaded with input data at the start of every new iteration.

Note that in the given example code, border handling effects are left to the user.

They should always be taken care of adequately in order to prevent accidental reads

from invalid (out-of-bounds) memory addresses. This type of kernel is characterized

by the explicit reuse of input data; every pixel is usually read multiple times and is

involved in the calculation of multiple output pixels. The reuse factor often equals

the squared window dimension (WIN_SIZE2) for a standard convolution with non-

zero coefficients, although this is not necessarily the case in general. Applications of


FIGURE 3.8: The Reduce pattern [9].

reduction_t result;for (int y = 0; y < HEIGHT; y++)

for (int x = 0; x < WIDTH; x++) // Read an input valuedata_t value_in = data_in[y][x];// Update the temporary variable holding a partial aggregationresult = aggregate(result, value_in);

return result;

LISTING 3.3: Code snippet representing the Reduce parallel pattern.

the Stencil category include for example two-dimensional linear or non-linear filters.

Clearly, the access pattern to data_in is more complicated than for the Map pat-

tern, which makes the application of streaming interfaces non-trivial. Furthermore,

bottlenecks can arise very quickly upon employing the pipeline paradigm without

careful optimization. The technique to resolve this issue will be to insert line buffers

containing several rows of pixels, along with array partitioning directives in order

to ensure the possibility of concurrent accesses to the window and line buffers while

maintaining a minimized initiation interval (II).

Reduce

The Reduce parallel pattern (Figure 3.8) is a many-to-one operator and aggregates all

input elements into a single output value. This kernel type is typically programmed

as in Listing 3.3.

The lack of an output image and the combination aspect is characteristic for this

kernel. Naturally, multiple output values (which can be seen as one large data type)

are also allowed. However, the output size essentially remains constant and does

not scale with the input size. FPGA-wise, the acceleration of this pattern is very

similar to Map. The biggest difference is the absence of an output stream: instead,

the result will be exposed as a simple memory-mapped register for the PS to read

from upon completion.

3.2. Methodology 35

FIGURE 3.9: The Gather (or Scatter) pattern [9].

for (int y = 0; y < HEIGHT; y++) for (int x = 0; x < WIDTH; x++)

// Retrieve the input position (y_in, x_in)int y_in = f(y, x);int x_in = g(y, x);// Read the input valuedata_t value_in = data_in[y_in][x_in];// Calculate some function of (value_in, y, x, y_in, x_in) to produce value_outdata_t value_out = h(value_in, y, x, y_in, x_in);// Write the output valuedata_out[y][x] = value_out;

LISTING 3.4: Code snippet representing the Gather parallel pattern.

Gather

The Gather parallel pattern, shown in Figure 3.9, introduces an unseen complexity

in the form of random data access. Instead of scanning both the input and output

arrays regularly, they are accessed at irregular indices. These positions can depend

on a limited number of environmental parameters, but also on the data itself. Listing

3.4 shows an example whereby the reads from the input image occur in an irregular

fashion, while the output image gets written to in a linear (and thus fully predictable)

manner.

The characteristic aspect of this category is the presence of ’random’ memory

accesses (it does not matter whether this happens at the input side or the output

side). Here, the term random does not necessarily involve statistical randomness, but

also includes irregular, parameter-dependent and/or content-dependent accesses

that cannot be fully predicted by the programmer without knowing more about the

context during execution. This type of kernel is often found in combination with

the Map pattern, whereby an auxiliary or reference array is the one that gets ac-

cessed at irregular positions in order to retrieve extra data, while the main input

and output arrays are still mapped in a linear order. While the Stencil and Gather

categories concern different patterns, overlap sometimes exists for example in the

tracking step of KinectFusion. The irregularly accessed positions might not be uni-

formly random, but instead remain somewhat confined (either deterministically or

statistically) within an area around the 2D position of the corresponding ’linear’ loop


FIGURE 3.10: The Search pattern [9].

determined by (y, x). Application-specific knowledge and analysis of spatial local-

ity will be very useful in determining which possible solution to pick.

In general, the Gather pattern is certainly non-optimal when it comes to fitness

for FPGA acceleration due to the irregular memory accesses. After all, local buffer-

ing of all image data is often unfeasible due to the limited amount of Block RAM (on

the order of several megabits) that is present on an average FPGA. One technique

would be to cache different parts of the data, each piece as large as possible, and

re-execute the kernel as many times as needed in order to fully process all elements.

Another technique is to treat the kernel as a Stencil pattern, and employ a fall-back

procedure to external DDR requests if a specific position happens to fall outside

the current line buffer. Clearly, both techniques are only applicable in limited cases

however, which will be discussed in Section 3.2.4.

Other

Many remaining parallel patterns exist, such as Search in Figure 3.10 and Scatter in

Figure 3.9. The former type retrieves data based on content matching [53] and is

embodied by the raycast kernel in KinectFusion, albeit for a huge volume of data.

The latter category can be contrasted to Gather, as Scatter kernels writes their output

data to random locations rather than reading from them. Due to a mixture of scope

constraints and lack of FPGA-relevant methods in the considered SLAM use case,

both patterns were not studied in detail in this dissertation. For the Search kernel,

we refer to a dynamic variant of the scratchpad buffering technique that is nomi-

nally tailored towards Gather, while a Scatter kernel could be treated as belonging

to the Map category since external memory writes are non-blocking operations (un-

less read-after-writes occur). For completeness, a pseudo-code example of Search

is shown in Listing 3.11. A similar computation is performed for every 2D pixel in

raycast.

3.2.2 Pipelining

The concept of pipelining is of primary importance for the acceleration of compu-

tationally intensive kernels of any kind. This holds true especially in the world of

hardware design, because it allows for a much more time-efficient usage of all re-

sources required for the calculation. Pipelining can essentially be summarized as:

the ability to start executing the next iteration of a computation before the previous

3.2. Methodology 37

step_t t = 0;data_t val = 0;while (true)

// Retrieve the position at which to inspect dataint y = f(t, val);int x = g(t, val);// Read the valuevalue = data[y][x];// Check whether the condition we are looking for is satisfiedif (found(t, x, y, val))

return (t, x, y, val);// Increment tt += step_size(t, x, y, val);

FIGURE 3.11: Non-exhaustive code snippet representing a possibleinstance of the Search parallel pattern.

one has finished [54]. It can be compared with a factory assembly line, whereby

parallelism that exists among the different steps is exploited in order to overlap the

execution of multiple iterations. Since any given loop body generally consists of

various operations that require distinct sets of resources, the synthesized design will

often be suboptimal as long as this opportunity for resource sharing is not taken

advantage of. The manner in which speed-ups can be obtained through pipelining

is visualized in Figure 3.12. Here, the initiation interval (II) is defined as the rate at

which a new iteration starts, measured in clock cycles. The iteration latency indicates

the actual duration of one such iteration, which can perfectly be many times longer

than the initiation interval thanks to the opportunity for overlapped computations

provided by pipelining.

Even though the clock frequency of an FPGA is typically one order of magni-

tude lower than a CPU, it will produce higher throughputs for the great majority of

kernels. Furthermore, the FPGA is very data-dependent, making a steady stream of

input and output data ideal when said data has to undergo a long string of muta-

tions in a pipelined fashion. A useful analogy would be to view the FPGA as a set

of hardware blocks operating on the data elements flowing through them. On the

other hand, a CPU could be viewed as a small set of registers being operated by the

instructions flowing through the CPU one by one. In short, the notion of pipelining

is intrinsically very well suited to the FPGA, and in general there exist few reasons

not to apply this concept to methods that contain repetitive calculations over large

bodies of data. An initiation interval of one clock cycle is the fastest possible, and is

often what the designer ought to aim for in practice.


(A) Execution on a CPU, which has limited opportunities for parallelization or higher-level pipelining.

(B) Execution on an FPGA with pipelining.

FIGURE 3.12: Concept of pipelining applied to a repeated calculationcalled ’op’ on a large array.

Practically, the loops contained within an HLS kernel can easily be pipelined

thanks to optimization directives that are readily available in Vivado HLS. By in-

serting a pragma in either the loop body’s source code or a separate configuration

file, the programmer can provide a hint to the HLS compiler that he or she wishes

to pipeline the respective loop. The compiler will then insert registers between mul-

tiple hardware operations and reschedule them as needed, so that no interference

occurs between data elements present at different stages of the calculation. This

way, instead of sequentially executing every iteration in isolation, the initiation in-

terval can effectively be reduced to just one or several clock cycles. Whereas the

iteration latency had previously directly determined the initiation interval, this ini-

tial latency has become far less important as the resulting throughput in regime is

now defined by the pipelined initiation interval only. For maximum performance, it

is also recommended to convert external arrays into a streaming interface if possible.

This enforces all reads and writes to happen in a fully linear and predictable manner,

which better suits the pipelining paradigm.

All discussed categories including the Map, Stencil, Reduce, Gather and Search

parallel patterns ought to benefit from the insertion of a pipelining pragma in the in-

nermost pixel loop (i.e. the horizontal loop if the processing order is row-major). As

an example, its effect on a Map kernel called depth2vertex is discussed here, which

3.2. Methodology 39

processes 320 × 240 = 76800 pixels at the resolution pyramid’s finest level. Syn-

thesizing the HLS top function while employing a streaming paradigm but without

applying any pipelining optimization yields Figure 3.13a. In this case, the initiation

interval of every loop equals its iteration latency, being five clock cycles. However,

applying the HLS pipeline directive to the for_x loop results in Figure 3.13b, where a

speed-up of a factor five is visible. The initiation interval is now exactly one, which

is the fastest possible: one result is produced and written to the output stream at

every clock cycle. Note that the iteration latency has increased (HLS likely had to

insert an extra register somewhere in the chain of hardware operations to maintain

conformance with dependency and timing constraints), but this change has no real

importance whatsoever compared to the drastic decrease in the initiation interval.

To explain why the resource utilization has increased just barely despite this five-

fold speedup, the more detailed analysis perspective provided by HLS is depicted in

Figure 3.14. The analysis views for the non-pipelining variant look extremely simi-

lar and are therefore not shown. One can visualize pipelining as creating copies of

every blue rectangle and shifting them to the right by intervals of one control step,

although Vivado HLS does not draw it in that manner presumably to preserve clar-

ity. A control step can be identified with one of different states of a finite state machine

(FSM), although it corresponds to a single clock cycle in the majority of cases. The

considered kernel multiplies every incoming data element with a 3x3 matrix, which

corresponds the numbered operations 27 to 43. If pipelining were not applied, then

the FPGA is still be able to exploit the intrinsic parallellism of a matrix-vector multi-

plication, however the many DSPs responsible for this calculation would be utilized

inefficiently throughout time. The overhead of the loop, whose iterations are exe-

cuted sequentially in time, would cause the DSPs to be activated only once or twice

every five clock cycles. Applying the pipelining concept results in these same DSPs

now having a much higher duty cycle: 100 % to be precise. Unlike loop unrolling

or similar optimizations, the relative increase in resource usage is generally much

smaller than the performance gain factor obtained by applying the pipeline pragma.

To conclude, our FPGA acceleration of this routine consists of parallelization on one

hand and pipelining on the other, made possible respectively thanks to the kernel

computation’s inherent regularity and the overlapped usage of DSPs.

3.2.3 Efficient line buffering

For kernels that incorporate any degree of data reuse, more advanced techniques

than streaming and pipelining have to be utilized. Figure 3.15a depicts the typical

functionality of a kernel belonging to the Stencil category. Hardware convolutions

and stencil kernels alike have been studied extensively in literature [20], [21], [55]–


(A) HLS report (no pipelining).

(B) HLS report (with pipelining).

FIGURE 3.13: Effect of pipelining on the timing profile and resourceutilization.

[58], although no clear description exists on their implementation in HLS3. While not

a full-fledged tutorial, this paragraph attempts to partially fill this gap while placing

specific attention on handling feedback from the tool as this is what constitutes a

methodology.

A naive implementation of the bilateral_filter kernel is shown in Figure 3.16a. An

initiation interval of one clock cycle was not achieved, and the Vivado HLS console

displays blue messages as in Figure 3.16b to indicate why that is the case. The in-

spection and resolution of these warnings is an important aspect of the HLS design

process, and is part of the DSE philosophy. At the start of every iteration, the win-

dow is completely refilled with 3× 3 = 9 depth values read straight from the input

array. However, these 9 accesses cannot happen concurrently within one clock cy-

cle, which forces HLS to automatically increase the pipelined initiation interval to at

least 9 cycles. This is confirmed by the analysis view in Figure 3.16c, which visualizes

which operations occur in the synthesized design at every control step. This initial

configuration is sub-optimal not only due to the presence of recurring requests for

the same data elements, but also due to the irregularity of the memory accesses. The

required pixels are generally not available in memory precisely in the order that they

3The Window and LineBuffer data structures provided by Xilinx were found to lack the necessaryfeatures. Moreover, documentation on non-separable 2D filters is scarce: the only relevant sourceswould be [59], [60], although a different loop structure was used in this thesis.

3.2. Methodology 41

(A) Performance analysis view (with pipelining).

(B) Resource analysis view (with pipelining).

FIGURE 3.14: Analysis of a pipelined Map kernel, showing the paral-lelized elementary operations constituting a matrix-vector multipli-cation. Note that the analysis view in Vivado HLS does not clearlyindicate overlapped computation, even though it is definitely presenthere: a read from and write to the streaming interface occurs at every

single clock cycle (or equivalently, control step).


(A) General operation of the Stencil kernel, shown with a 3x3 window as an ex-ample [56].

(B) Principle of the line buffer and shifting window operation,visualized onto the input image.

(C) Interaction between the line buffer and memory window, visualized as how they are actually struc-tured in memory (adapted from [55]).

FIGURE 3.15: Illustration of the Stencil parallel pattern and a corre-sponding buffering technique for its implementation on the FPGA.

are read and processed, so that the FPGA will have to ’gather’ them and continually

make external DDR requests to the DRAM. Such requests inherently have a high la-

tency, reducing the real-world throughput even further compared to the HLS report.

A major improvement would be to access every data element exactly once, and

locally cache them as long as needed to allow for all stencil computations requiring

that particular pixel to be completed from data residing on the FPGA only. This

way, efficient streaming interfaces can be implemented for the incoming data. Fur-

thermore, we aim to store as little data as possible in order to optimize for BRAM

utilization as well. Figures 3.15b and 3.15c illustrate the principle of a window and

a line buffer, which are important two data structures that serve to fulfill this goal.

3.2. Methodology 43

Denote the image height by H, its width by W and the window dimension by N. At

every iteration, the whole memory window of size N × N is shifted left and the line

buffer subsequently fills N − 1 values of its rightmost column. In addition, a single

value is read from the input stream and placed in the window as well as the line

buffer. The line buffer of size (N − 1)×W holds N − 1 full rows of pixels. There-

fore, it has a width equal to that of the whole image, but experiences a vertical shift

of just one column per iteration so that the net result after one full row of iterations

will be a full upward shift as well.

The sizes of both data structures correspond to the amount that is minimally re-

quired while still allowing for every stencil computation to proceed efficiently start-

ing from this locally stored data only. A large square window of e.g. 15x15 would

require no more than 14 lines to be cached at any point in time, taking up space

on the order of 20 to 100 KB. For an image width of 640 pixels and a data size of 4

bytes per pixel, the Zynq-7020’s BRAM could theoretically store up to 162 lines (see

Section 3.2.4) together with a window of 163x163. Of course, a kernel actually using

data of such magnitude will likely exceed the FPGA’s resource utilization due to its

computational weight long before it approaches these internal memory limitations.

In practice, none of the considered routines in KinectFusion (or most other use cases,

to our knowledge) come anywhere near this size. Transposing the image in order to

exploit lower image heights and thereby allow for even larger window sizes is there-

fore not necessary.

Figure 3.17 shows the HLS report, console and resource analysis views after im-

plementing the aforementioned buffering technique. Unfortunately, the design is

still flawed: this time due to internal memory congestion rather than PS-PL inter-

facing constraints. The more precise reason can be found in the fact that a single in-

stance of FPGA’s dual-port Block RAM does not have enough ports to output all the

values that are needed at the desired rate. Furthermore, a shifting window requires

many simultaneous writes to the same array per iteration as well. The following

operations should be allowed to happen ideally within one clock cycle:

• N2 reads from the memory window to be forwarded to the stencil calculation.

• N2 writes to the memory window, N(N − 1) of which consist of left-shifted

values, N− 1 of which are copied from the line buffer and the last one of which

is read from the external stream.

• N − 1 reads from the line buffer to be forwarded to the window.

• N − 1 writes to the line buffer, N − 2 consist of up-shifted values and the last

one of which equals the single new value received from the input stream.


(A) HLS report displaying a high initiation interval.

(B) Example of warning messages shown by Vivado HLS to inform the designer about possible bottle-necks.

(C) Performance view displaying concurrent accesses to the input image marked in red, with feedbackto the HLS source code.

FIGURE 3.16: Report and analysis of a naive implementation of bilat-eral_filter; neither line buffering nor array partitioning is applied.

3.2. Methodology 45

(A) HLS report still displaying a high initiation interval.

(B) Warning messages shown by Vivado HLS, revealing internal memory constraints.

(C) Resource view displaying concurrent accesses to the memory window, from which a new bottle-neck arises.

FIGURE 3.17: Report and analysis of an improved implementation ofbilateral_filter which includes line buffer and memory window func-

tionality.

Note that the above list is not part of the stencil calculation itself, and exists purely

to manage the content of the buffer correctly. Similar to [25], we separate the com-

putation from the data movement to keep things simple. In case of a 3x3 window,

N2 + N − 1 = 11 reads from and N2 + N − 1 = 11 writes to the BRAM occur for

every iteration. Interdependencies between both data structures result in a slightly

higher initiation interval of 12 clock cycles.

In order to resolve the bottleneck, the back-end memory storage for the window

and line buffer need to be partitioned into several BRAM instances or even reg-

isters. Vivado HLS provides an effective optimization directive to do exactly that

[1]. Multi-dimensional arrays can be distributed across multiple memory cells such

as individual registers or BRAM banks by applying the HLS ARRAY_PARTITION


FIGURE 3.18: Array partitioning strategy for optimizing Stencil com-putations. Differently colored elements need to be accessed indepen-dently and in parallel, which is possible only by distributing themacross different instances of internal storage components. (The mem-

ory window is fully partitioned in all dimensions.)

FIGURE 3.19: HLS report of the fully optimized bilateral_filter kernel.

pragma. The dimension(s) across which to partition and a cyclic factor can be spec-

ified as well, although only the former parameter is relevant in our discussion. Fig-

ure 3.18 depicts the minimally required degree of partitioning necessary to allow for

both data structures to perform without forming bottlenecks. Finally, Figure 3.19

shows the report whereby all discussed issues have been resolved. Resource utiliza-

tion has increased significantly, which is explained by the fact that the bilateral filter

performs relatively expensive operations on every pixel in its window. Previously,

this per-pixel computation could be spread over 12 clock cycles (actually forming

another pipeline within the pipelined loop), but as of now N2 − 1 = 8 instances (the

center pixel can be optimized away in this particular case) have to start in parallel

at every clock cycle. The latency of this operation is 7 cycles, so by Little’s Law we

have that 56 such computations are in progress at any given point in time during the

kernel’s execution.

3.2.4 Random memory access

Some kernels have to read from, or write to, a large array that resides outside the

FPGA such as in DRAM. The presence of such irregular accesses often creates perfor-

mance bottlenecks on the FPGA. However, this fact is usually not visible in Vivado

HLS because it assumes that every request gets answered immediately within one

3.2. Methodology 47

clock cycle. Simply put, I/O performance analysis falls outside the scope of Vivado

HLS as this tool is only concerned with everything that happens inside the FPGA.

The full complexity of system-level interactions between the PS, PL and DRAM can

only be assessed accurately by performing real-world executions, which will be ex-

plored in Chapter 5. Note that the term random is interchanged with unpredictable in

this context: the positions at which specific elements are needed can depend either

on a limited set of parameters given to the subroutine, but also on the content of an-

other data stream. The default scenario is to leave all data on the DRAM, and let the

FPGA periodically make DDR requests whenever it needs one or more elements. In

this case, an AXI-master protocol should be applied so that the PS (representing the

outside memory) acts as a slave with the PL as its master. Due to the intrinsic high

latency of performing random address lookups, this scenario is quite slow by de-

sign. Assuming that the required data does not fully fit on the FPGA’s local storage,

two possible solutions were devised:

1. Block-by-block processing. This technique locally buffers smaller sections of

the whole body of data, and restarts the subroutine as much as needed in order

to adequately process all elements. These sections could either overlap or not;

the former case is preferred to limited the number of re-executions, but the

latter might prove more efficient when a cohesive neighbourhood of multiple

elements from the external array is often needed. The prerequisite for this

technique is a kernel that can easily be re-executed multiple times, and whose

input and output streams are of the same type. The input data that cannot yet

be handled correctly in the first execution would otherwise be lost forever. In

addition, a special flag has to be introduced for each element of the stream to

indicate that some elements were not yet processed due to inavailability of the

data required for the calculation of that specific element, but should be handled

in a next iteration instead. The amount of re-executions needed equals:

⌈Nn

⌉Here, N is the full size of the randomly accessed data in bytes, and n denotes

the size of the internal array containing one block of this data. A code example

of this method is found in integrate.

2. Intelligent buffering. In some cases, it is possible to execute the component

just once by buffering the data in a smart way so that most or all of it is already

locally available. Practically, this idea will result in code similar to how Stencil

patterns are handled, although we prefer to use the term scratchpad memoryrather than a line buffer here. At every moment in time, this scratchpad should

aim to cover areas of the randomly accessed data where it is most likely to be

needed, which often requires deleting and refilling portions of this internal


store throughout the whole execution of the program. Locality of reference

can readily be exploited using this method. If a requested position happens to

reside outside of the buffered region of interest (i.e. a cache miss occurs), then

a fall-back procedure can consist of either making a DDR request or doing

nothing and returning a (possibly incorrect) default value for that element.

The latter is only possible if the system can tolerate a small amount of invalid

data, while the former requires a statistical analysis to ensure that the cache

miss frequency remains low enough to warrant the additional overhead. A

code example of this approach is found in track.

The selection among these solutions depends on various factors, and both of them

cannot be applied in all situations. It may even be the case that the only feasible

configuration is the default one, which does not copy data and uses DDR requests

instead. Reasons might include the conditions mentioned above not holding, or sim-

ply due to the lack of a significant improvement in the resulting performance.

In order to substantiate the discussed techniques and give an idea of when they

might be relevant, the maximum size of a single contiguous array residing in the

Zynq-7020’s Block RAM was experimentally determined using the code in Listing

3.5. Block RAM units come in sizes of 36 Kib containing two independently con-

trolled 18 Kib RAMs [61], and HLS reports the BRAM usage in equivalents of 18

Kib. An array of 16-bit integers will cause the BRAM to be configured so that one

instance can effectively hold 1024 integers. The function was synthesized in HLS for

different values of BYTES, resulting in Figure 3.20. A peculiar behaviour is visible:

the tool always seems to round up BRAM utilization to the next power of two, an

observation also made by [3]. This means that an array of 512 KiB or 524 288 bytes

will result in 256 BRAM instances for memory storage, but any more than that will

cause at least 512 instances to be generated, which is far beyond the 280 BRAMs

offered by the Zynq-7020 PL. An important consequence is that the maximum size

of a single array is effectively limited by this rounding behaviour rather than by the

precise amount of BRAMs offered by the FPGA.

In KinectFusion, the block processing technique is used during integration since

the main input and output arrays are 3D volumes there and thus of the same type.

See Section 4.1.8 for a further discussion. The intelligent buffering method will prove

to be very useful in the tracking phase. Because physical camera motion between

subsequent frames can be assumed to be small [5], a high degree of spatial locality

across multiple 2D arrays is present in the ICP part of the algorithm. This is dis-

cussed in more detail in Section 4.1.6.

3.2. Methodology 49

#define BYTES 524288#define COUNT (BYTES / 2)int32 max_memory(const int16 source[COUNT]) #pragma HLS INTERFACE m_axi depth=64 port=source

int16 local[COUNT];#pragma HLS RESOURCE variable=local core=RAM_1P_BRAM

memcpy(local, source, BYTES);// Do something with the dataint32 res = 0;for_i : for (int i = 0; i < COUNT; i++)

#pragma HLS PIPELINE II=1res += local[i];

return res;

LISTING 3.5: Vivado HLS code to test the maximum size of a 16-bitinteger array. Data is copied in burst mode from external memory,similar to how block-by-block processing is implemented in practice.Although the compiler places the local array into block RAM by de-

fault, the HLS RESOURCE directive [1] is still included for clarity.

FIGURE 3.20: Resulting BRAM instances in the HLS report for differ-ent memory sizes in Listing 3.5.


3.2.5 Data type selection

Vivado HLS supports the usage of arbitrary bit sizes for integers and fixed-point

numbers [1]. The philosophy behind this freedom is that every data type should

only be assigned as many bits as actually needed. This way, the designer can pre-

vent the unnecessary allocation of FPGA resources for the calculation of bits that are

not useful enough to warrant the cost, or might even end up not being used at all.

On the other hand, floating point representations are only available in half- (16 bit),

single- (32 bit) and double-word (64 bit) sizes.

Several aspects contribute to the choice of data types for any given calculation.

A trade-off in several dimensions between resource cost and accuracy often has to

be made. When real numbers are required, the decision between floating point and

fixed-point representations turns out not to be so evident. In conjunction with this

aspect, the precise bit widths at the algorithm’s disposal naturally have to be consid-

ered as well. The following paragraphs serve to motivate a substantiated decision

between the two alternatives in the context of 3D SLAM. First, two general use cases

will be considered:

1. An IP-core that performs four basic arithmetic operations on every data ele-

ment in a pipelined manner;

2. An IP-core that calculates the square root of every data element in a pipelined

manner.

The goal is then to test the effect of different data types on the FPGA’s resource

usage and latency. Note that I/O-bandwidth considerations might also play a role;

this aspect will be concentrated on in Chapter 5.

Comparison between floating point versus fixed-point

The notion that floating point operations are considered to be expensive to imple-

ment on reconfigurable hardware is backed by [62] as well as practical measure-

ments. The other option is to use fixed-point numbers, where the decimal point is

placed at a fixed position within the bit representation. These provide the opportu-

nity to perform arithmetic operations in a way that is much less costly for the FPGA’s

resources, as will be illustrated below. On the other hand, the biggest drawback of

employing fixed-point representations is the lack of a high dynamic range. Floating

point numbers retain sufficient accuracy for both extremely small and large values,

while fixed-point numbers are only practical within selected orders of magnitude

due to the presence of a fixed step size and a relatively low maximum value. This

limitation notably led [3] to the decision of keeping the computation in the floating

point domain.

3.2. Methodology 51

typedef ap_fixed<32, 16> data_t;for (int i = 0; i < N; i++) #pragma HLS PIPELINE II=1

data_t value_in = stream_in.read();data_t tmp = value_in + data_t(2.0);tmp = tmp * data_t(3.0);tmp = tmp - data_t(5.0);tmp = tmp / data_t(7.0);stream_out.write(tmp);

LISTING 3.6: Vivado HLS code for a fixed-point simple pipelinedarithmetic calculation, belonging to the Map pattern.

Any application should ideally execute its whole pipeline either in the float-

ing point domain or in the fixed-point domain as to avoid the cost of conversion.

However, given that we are currently in the process of accelerating only subsets of

the algorithm, conversions from floating point numbers to their equivalent fixed-

point representation will have to be taken into account in the following measure-

ments. This is because the non-accelerated parts of KinectFusion will run on a CPU,

where floating point representations are naturally the best choice. After all, the ARM

Cortex-A9 does not provide native support for fixed-point operations; neither does

the C++ language. Consider the HLS kernel in Listing 3.6, which consists of a series

of basic operations. Here, one addition, one subtraction, one multiplication and one

division are performed, each with a constant second operand. The loop is pipelined

to simulate a scenario whereby data is streamed and processed efficiently with an

initiation interval of one clock cycle.

Three cases are tested: first, the kernel is synthesized with all operations in the

32-bit floating point domain. Second, it is resynthesized in the 32-bit fixed-point do-

main by changing all data types; this is the case shown in the code snippet above.

Finally, the stream data types are changed to floating point again while keeping the

calculation within the fixed-point domain. As such, the overhead due to the added

float-to-fixed and fixed-to-float conversion steps is investigated. Table 3.2 summa-

rizes the resulting HLS reports after synthesis on a Zynq xc7z020clg484-1. A second

experiment was done with the arithmetic operations replaced by a single square

root. As is commonly done, the Xilinx HLS library implementations of sqrt and

sqrtf are used to ensure a hardware-optimized design. Table 3.3 summarizes the

resulting HLS reports, and a representative code snippet is shown in Listing 3.7 for

completeness.

Several interesting observations can be made. First, it is evident that most (but

not all) timing and resource metrics improve significantly if the same operations


Calculation architecture Iteration latency DSP FF LUTFloating point (32-bit) 33 cycles 7 1582 2259Fixed-point (32-bit) 7 cycles 9 605 621Fixed-point (32-bit) with conversions 18 cycles 9 1659 2540

TABLE 3.2: Timing and resource usage for various implementationsof a simple series of arithmetic calculations.

typedef ap_fixed<32, 16> data_t;for (int i = 0; i < N; i++) #pragma HLS PIPELINE II=1

data_t value_in = stream_in.read();data_t value_out = hls::sqrt(value_in);stream_out.write(value_out);

LISTING 3.7: Vivado HLS code for a fixed-point square root calcula-tion, belonging to the Map pattern.

are performed using fixed-point representations over floating point. This can be

explained by the fact that arithmetic operations are complex to implement in the

floating point domain, but in the fixed-point domain they are essentially just glori-

fied integer operations. A drop of a factor 3 to 4 in resource usage is therefore not

uncommon.

Second, this fact cannot simply be generalized to all other types of operations

such as the square root. One reason is that the Zynq-7000 FPGA has native support

for the floating point square root calculation in the form of a hardware core called

FSqrt [1]. The analysis view in HLS confirms that a simple opcode is used in this

case, while for fixed-point data types an external function is called with an imple-

mentation provided by Xilinx. Hence, care should be taken in selecting the optimal

data type when a square root operation is present. Alternatively, optimized algo-

rithms such as the LogiCORE IP core described in [63] could be used. This product

is however not further explored in this thesis, due to the lack of an evaluation license

for academic purposes.

Third, in terms of resources, the considered series of four arithmetic operations

happens to quantitatively coincide with a break-even point. The decision of whether

Calculation architecture Iteration latency DSP FF LUTFloating point (32-bit) 15 cycles 0 559 779Fixed-point (32-bit) 22 cycles 0 2166 5810Fixed-point (32-bit) with conversions 32 cycles 0 3058 7404

TABLE 3.3: Timing and resource usage for various implementationsof a square root calculation.

3.2. Methodology 53

to stay within the floating point domain, or whether to perform the operations using

fixed-point representations in conjunction with a two-fold conversion, therefore de-

pends on the complexity of the considered algorithm relative to this use case. In the

context of this dissertation, all kernels in KinectFusion except for mm2m_sample have

a much higher computational complexity so that the latter option is expected to be

preferred in the majority of cases. In general, we have found that fixed-point oper-

ations are more efficient with respect to resources and, as a result, area and power

usage as well. This finding is also supported by [64] and [65] among others. To our

knowledge, the 32-bit square root operation seems to be only exception to this rule

on this specific hardware platform.4 It is also noted that the better accuracy provided

by floating point representations is often not needed for many applications, as long

as the dynamic range of the used fixed-point data type is sufficient.

Bit sizes and application context

Next to resource usage, another drawback of floating point operations is that their

availability is restricted to 16-bit, 32-bit and 64-bit formats only. In contrast, hard-

ware designers are free to allocate any amount of bits to the integer and fractional

parts of fixed-point arithmetic. While the comparisons in Tables 3.2 and 3.3 were

made with 32-bit fixed-point calculations as to match the bit sizes of the correspond-

ing floating point numbers, the actual optimum strongly depends on the use case.

Depth value representation. The latest Microsoft Kinect sensor has a best-case

depth measurement accuracy error of less than 2 millimeters [66]. As shown in Fig-

ure 3.21, the accuracy worsens as we move either further away from the camera or

deviate from the center of the image. While the Kinect v1’s depth map acquisition

works by a fundamentally different principle5, similar conclusions have been drawn

regarding its precision in [44], [52]. In addition to the measurements experiencing

noise with a standard deviation on the order of a few millimeters, the offset (bias)

strongly increases as a function of the distance as well, which is plotted in Figure

3.22. The most important KinectFusion kernel in terms of resource utilization that

operates on these depth values is the bilateral filter. This operation is quite expen-

sive, leading us to the choice of employing a fixed-point representation with a frac-

tional part of 10 bits. As a result, the smallest increment equals 1000/210 ≈ 1 mm,

although no better accuracy is expected to be necessary according to the argumen-

tation above. Since both Kinect cameras cannot see beyond 8 to 10 meters [52], the

4Even though the exponential and logarithmic functions also have designated hardware cores, nei-ther resource profile clearly dominates the other when comparing floating point versus fixed-pointimplementations.

5The Kinect v1 projects an infrared light pattern and estimates the 3D geometry of a scene by an-alyzing the projected image. On the other hand, the Kinect v2 is a Time-of-Flight (ToF) sensor whichacquires the depth map by emitting modulated square waves and measuring the per-pixel phase dif-ference of the reflection [52].


FIGURE 3.21: Kinect v2 accuracy error distribution [66].

FIGURE 3.22: Kinect v1 offset and precision [44].

integer part has to be at least 4 bits. Our final choice in this work for the representa-

tion of a depth value in meters is a 16-bit unsigned fixed-point number, assigning 10

bits to the fractional part and 6 bits to the integer part.

Other geometric value representation. 3D vertices, normal vectors, matrices

and Euclidean (squared) distances constitute the many other quantities that need to

be processed on the FPGA as well. In order to ensure sufficient accuracy for kernels

where fine-grained differences in distances (such as error accumulations) matter a

lot, we decided to assign a more generous 32-bit fixed-point representation to all of

these remaining values with the integer and fractional parts each taking up 16 bits.

This way, only a limited amount of bug-prone rescaling is needed when operating on

small and/or large numbers such as in the depth2vertex, vertex2normal, track, reduceand raycast kernels.

3.3 Summary

In this chapter, a detailed algorithmic description of a known 3D dense SLAM appli-

cation was given. Afterwards, techniques were introduced to effectively accelerate

3.3. Summary 55

image processing methods with commonly occurring computation or data manage-

ment patterns using high-level synthesis. This methodology involves both algorith-

mic and structural changes to the code as well as the straightforward insertion of

pragmas. Some of these optimization directives thoroughly transform the design

while abstracting away many of the underlying hardware details, especially with

respect to managing cycle-by-cycle execution steps. Vivado HLS’ reports and anal-

ysis views were shown to be a very useful tool in order to gain an understanding of

what happens behind the scenes. The concepts of streaming interfaces, loop pipelin-

ing, data buffering, array partitioning and data type decisions were studied compre-

hensively. The presence of irregular memory accesses is a more complicated matter

however, as the strategy by which to tackle these kernels is much more context-

dependent. The next chapter will evaluate the application of all studied techniques

to KinectFusion step-by-step.

57

Chapter 4

Implementation of KinectFusion inHLS

Having reviewed a generally applicable methodology for the acceleration of im-

age processing kernels with various parallel patterns, we now move to the concrete

world of KinectFusion. In the following sections, HLS implementations of all ker-

nels are covered separately and we discuss the effect of the applied techniques as

well as some extra application-specific optimizations. Table 4.1 summarizes what

is achieved in Vivado HLS for each KinectFusion kernel. The parallel pattern was

inferred by inspecting the source code. The time spent is an estimate calculated by

multiplying the timing of a single run by the average number of re-executions per

frame. Next, the accuracy was computed by comparing every HLS kernel’s output

with a ground truth obtained via the original KinectFusion implementation, as de-

scribed in Section 3.1.2. Unless noted otherwise, every method converts its input

from floating point to fixed-point and performs the reverse operation on the output.

This is done to ensure that all data communicated outside of the FPGA remains read-

able by the CPU, while also saving on resources for computations inside the FPGA;

see Section 3.2.5 for a grounded argumentation on this. Note that the raycast kernel

was successfully implemented in C++ but its HLS optimization was not explored in

this thesis, due to a mixture of scope constraints and the Search category’s inherent

unsuitability. The higher degree of control as well as the inherently unpredictable

amount of iterations, irregular data positions and very low data reuse all make it a

priori problematic to accelerate this component adequately on an FPGA.1

1Even if we just consider data movement and ignore all other problems: the block-by-block process-ing method is not applicable because it fundamentally conflicts with how the Search pattern operates.Intelligent buffering faces the issue of deciding which parts to copy to the scratchpad and how tostructure the memory layout efficiently. As even a partial volume is too large for most FPGAs, finergranularity is needed which in turn requires advanced geometric calculations to determine the bound-aries of these smaller blocks. In short, going this path would quickly threaten the overhead to increaseeven beyond the regular case of not employing any buffering technique at all.

58 Chapter 4. Implementation of KinectFusion in HLS

Kernel Category Input Output HLS timing Errormm2m_sample Map 2D 2D 0.38 ms 0 %bilateral_filter Stencil 2D 2D 0.77 ms 0.03 %half_sample Stencil 2D 2D 0.25 ms 0.01 %depth2vertex Map 2D 2D 1.0 ms 0.09 %vertex2normal Stencil 2D 2D 1.0 ms 0.02 %track Gather & Map 2D 2D 23 ms 1.13 %reduce Reduce 2D 32 values 3.3 ms 0.47 %integrate Gather & Map 2D + 3D 3D 0-168 ms 0.32 %

TABLE 4.1: Category, I/O dimensions, estimated timing and averageaccuracy of every KinectFusion kernel when it would be executed onthe FPGA. Bandwidth limitations and other external factors are notyet taken into account, since these fall outside the scope of Vivado

HLS.

Kernel BRAM [%] DSP [%] FF [%] LUT [%]mm2m_sample 0 0 5 13bilateral_filter 0 21 10 19half_sample 1 2 4 17depth2vertex 0 4 3 11vertex2normal 6 16 18 35track 89 57 43 42reduce 1 38 4 11integrate 100 74 48 93

TABLE 4.2: Resource utilization estimated by HLS for every Kinect-Fusion kernel’s top function.

4.1. Detailed results 59

Unroll loop

Apply pipelining

Apply streaming

Original code0 %

0 %

0 %

0 %

6 % LUT

4 % LUT

4 % LUT

13 % LUT

24.6 ms

25.3 ms

3.07 ms

0.38 ms Estimated timingMax. resource usage

Mean error

FIGURE 4.1: Effect of every optimization on the timing, resource andaccuracy profile of mm2m_sample (Map).

4.1 Detailed results

4.1.1 mm2m_sample (Map)

The mm2m_sample kernel passes through a selected subset of all input pixels, while

dividing them by 1000. It is kind of an exception to the earlier categorization of

parallel patterns, but for the purpose of our research, this method most closely be-

longs to the Map class. This pattern is handled by introducing streaming interfaces

and pipelining the computation. It is also the only individually accelerated IP core

where the author has chosen not to perform any conversion between floating point

and fixed-point representations. Because the calculation itself is so light-weight com-

pared to the overhead of an extra conversion step, the total resource usage and la-

tency turns out to be lower when this conversion is left out. Figure 4.1 captures the

effect of every optimization separately on the HLS report, when the target device

xc7z020clg484-1 is selected with a clock period of 10 ns. The selected parameter

configuration is described in Section 3.1.2, so the raw sensor input gets downsam-

pled from 640× 480 to 320× 240. Every change is in accordance with the general

methodology that has been introduced over the previous sections, and is explained

in more detail below:

1. Original code. The function was copied straight from the reference imple-

mentation without any changes except for the introduction of an interfacing

protocol. AXI-master interface directives were added so that the IP core is able

to communicate with the Zynq’s DRAM via a master-slave link. The iteration

latency is 32 cycles, and only a quarter of the input pixels are read from the ar-

ray. The resulting execution time is therefore estimated to be 32× 320× 240×


10 ns = 24.6 ms, although this does not take into account the unpredictability

of DDR requests to outside memory.

2. Apply streaming. The interface was changed to an AXI-streaming protocol,

which enables the usage of a Zynq High-Performance port in conjunction with

the AXI Direct Memory Access (DMA) IP core. Furthermore, Xilinx’ HLS

stream data type in C/C++ enforces a single read, single write pattern over

the whole input and output maps. On the other hand, a drawback arises since

the kernel is forced to read every input pixel, regardless of whether that ele-

ment will be processed or not. The iteration latency varies from 3 to 24 clock

cycles while iterating over all elements of the input array: 3 if it does not pass

through (which is 75 % of the time on average), and 24 if it does pass through

and thus gets rescaled by 1/1000. Following the report, the resulting execution

time is equal to (0.75× 3 + 0.25× 24)× 640× 480× 10 ns = 25.3 ms.

3. Apply pipelining. The loop body was pipelined with an initiation interval of

one clock cycle. The iteration latency stays unchanged, but since a new divi-

sion now starts (and ends) every four cycles, a speed-up of a factor 8.25 has

effectively been obtained. The increase in resource usage is negligible, which

means that no extra hardware blocks had to be allocated in this transformation.

This can be explained by the fact that the existing resources are now used much

more efficiently, whereas previously the majority of them were only active ev-

ery once in a while. The resulting execution time is 640× 480× 10 ns = 3.07 ms

according to HLS.

4. Unroll loop. As there is a huge degree of spare room for extra computation,

the loop body was manually unrolled by a factor of 8.23 One might expect

that transforming the design as duplicate its kernel calculation 8-fold would

increase the resource utilization by a factor of 8 as well. However, due to the

nature of downsampling and the horizontal loop unrolling, the routine only

has to perform 4 divisions per iteration rather than 8 within every second row

of pixels. Essentially, 4 divisions happen at 50 % of the iterations on average,

whereas previously just 1 division occurred at 25 % of the iterations. This is

illustrated in Figure 4.2. The HLS report confirms that the number of FFs and

LUTs allocated for the division have increased approximately 4-fold; in addi-

tion, they are now used more efficiently. Although there is still much room

left in terms of resources, we decided not to explore further duplication of

processing elements due to bandwidth constraints. Both streaming data types

2This decision for this value was made from a trade-off between (expected) bandwidth limitationsand available hardware resources.

3Duplication of processing elements has to be done manually, since HLS does not allow multiplereads from or writes to a stream construct within the same clock cycle. Therefore, the struct data typecontained by the stream has to be updated as well.


(A) Before unrolling.

(B) After unrolling.

FIGURE 4.2: I/O diagram of the mm2m_sample HLS kernel beforeand after duplicating its processing elements 8-fold, assuming no

bandwidth bottlenecks.

have a width of 128 bit, holding either 8 unsigned short integers or 4 float-

ing point values. The AXI_HP port supports a maximum throughput of 1,200

MB/s, although the discussed configuration already assumes 1,600 MB/s at its

input and output streams at a clock period of 10 ns. No further real-world per-

formance improvement can therefore be expected by increasing the unrolling

factor. The theoretical timing of this kernel is 640/8× 480× 10 ns = 0.38 ms

although we expect something closer to 0.51 ms; a more accurate performance

analysis follows in Chapter 5.

Note that the error stays extremely small since the calculation stays in the floating

point domain at all times. Due to its relative simplicity, this method was the only

one where we found that adding conversions to fixed-point arithmetic increased

resource usage instead of decreasing it.


Weaken exp. func.

Use fixed-point

Reduce window size

Buffering & streaming

Apply pipelining

Original code0.004 %

0.004 %

0.004 %

0.030 %

0.035 %

0.035 %

11 % LUT

31 % LUT

262 % LUT

89 % LUT

29 % DSP

21 % DSP

1109 ms

23.8 ms

0.78 ms

0.77 ms

0.77 ms


Mean error

FIGURE 4.3: Effect of every optimization on the timing, resource andaccuracy profiles of bilateral_filter (Stencil).

4.1.2 bilateral_filter (Stencil)

The bilateral_filter in KinectFusion is a non-linear, non-separable filter with a 5x5

window size whose operation is described in Section 3.1.1, and can be expressed

mathematically as follows [67]:

BF[I](i, j) =∑k,l

I(k, l)w(i, j, k, l)

∑k,l

w(i, j, k, l)(4.1)

w(i, j, k, l) = Gσs

(√(i− k)2 + (j− l)2)

)Gσr

(|I(i, j)− I(k, l)|

)(4.2)

Gσ(x) = exp(− x2

2σ2

)(4.3)

Here, I(i, j) is the intensity of the source image at position (i, j) and BF[I] represents

the destination image. The ranges for k and l depend on the window size. While

the Xilinx OpenCV library already contains built-in bilateral filtering functionality

[68], we decided to provide our own implementation since it covers many interest-

ing aspects of the optimization methodology. The reference code already includes

optimizations not relevant to FPGA acceleration such as storing the Gaussian co-

efficients Gσs (see Equation 4.3) beforehand. The changes applied in Figure 4.3 are

explained below:


1. Original code. When using the reference implementation and an AXI-master

interface, the resulting FPGA design performs extremely slowly. This is be-

cause every input pixel is redundantly accessed many times, on top of the fact

that all iterations are executed serially.

2. Apply pipelining. A pipelining directive was inserted with a target initiation

interval of 31 clock cycles, which is the lowest amount that HLS could success-

fully synthesize for. Inner loops including the filtering and shifting operations

on the window are automatically unrolled this way (although we also insert

pragmas there for clarity). A bottleneck arises due to concurrent accesses to

the externally located input depth array.

3. Apply buffering & streaming. In accordance with the methodology for han-

dling Stencil-type kernels in Section 3.2.1, line buffer and window data struc-

tures were inserted. Every input element is now just accessed exactly once

instead of up to 25 times, allowing for a streaming implementation and a

pipelined initiation interval of one clock cycle. This modification is the most

drastic one from a technical standpoint. The new memory window as well

as the helper array containing Gaussian coefficients Gσs were partitioned into

registers to prevent them from forming a bottleneck. The line buffer was par-

titioned into four BRAM banks. To produce Gσr , the exponential function in

Equation 4.3 now has to be evaluated 24 times per clock cycle and has a la-

tency of 7 cycles using HLS’ default implementation, so that by Little’s Law

168 such operations are in progress at any moment in time. Since this has

caused the resource usage to skyrocket beyond the Zynq-7020’s capacity, the

strength of the algorithm will now have to be reduced in order to maintain the

desired speed.

4. Reduce window size. The memory window was reduced from 5x5 to 3x3, so

that the filtering operation effectively uses 9 values only per iteration instead

of 25. The resource usage has approximately dropped by a factor of 3, since

the amount of exponential functions called per iteration has decreased from 24

to 8.

5. Use fixed-point. In accordance with Section 3.2.5, all computations were moved

from floating point arithmetic into the fixed-point domain. A significant drop

in area utilization is observed again, while the decrease in accuracy is negligi-

ble.

6. Weaken exponential function. The calculation of Gσr is still the source of

high FF and LUT utilizations, and the full-fledged exponential function was

replaced by a piecewise linear approximation shown in Figure 4.4. The three

parameters (i.e. the coordinates of the first breaking point and the abscissa


0 1 2 3 4 5 60.0

0.2

0.4

0.6

0.8

1.0exp(-x)ApproximationPopularity

FIGURE 4.4: Exponential function approximation for the bilateral fil-ter, with the actual frequency (popularity) of all arguments translated

to the thickness of the green layer.

of the second) of this approximation function were optimized by minimizing

the deviation between the mathematical and approximated outputs in a least

squares sense. To this end, real sensor data was used to measure the distribu-

tion of input arguments and to perform a weighted, more reliable optimization

accordingly. The total resource usage has again dropped by 30 %, with surpris-

ingly little effect on accuracy. The inv_exp block has a latency of just two cycles,

and uses 150 LUTs as opposed to 480 in the default HLS implementation.

Accuracy versus strength trade-off

The last three steps of Figure 4.3 can also be approached from a different standpoint.

Figure 4.5 explores the possible outcomes when we switch between two states of

every relevant parameter. The first term of each point label indicates the window

size, the second term indicates the data type domain, while the last term denotes

which implementation of the exponential function is in use. ’Fixed’ always includes

conversion from and to floating point numbers, and ’HLS’ means the built-in im-

plementation of the exponential function in contrast to the custom approximation.

Note that even though the implementation with all optimizations enabled is Pareto-

optimal, we observed that its error is relatively large. Copying the input straight to

the output leads to an average error of 0.06 %, so that profiles with a 3x3 window

might not be of sufficient quality. This is an important trade-off, and the ability to

combine multiple components together on the FPGA will be discussed extensively

in Chapter 5. We decided to proceed with the ’5x5 fixed pwlin (piecewise linear


FIGURE 4.5: Pareto diagram of the bilateral filter’s HLS average re-source usage (not including BRAM) and measured accuracy whenall eight possible configurations of three separate optimizations are

tested. One outlier with a large error is not shown.

exponential function)’ profile despite its high DSP usage of 65 %. Lastly, little im-

provement in accuracy was observed when increasing the amount of bits allocated

to the depth value’s data type, which is therefore kept at 16-bit fixed-point. Process-

ing element duplication (i.e. loop unrolling) was not explored precisely due to the

already high area utilization of this kernel.

4.1.3 half_sample (Stencil)

The half_sample procedure subsamples its input by a factor of two in every dimen-

sion, while also preserving edges by averaging only values that are close to each

other. Its computation pattern best fits the Stencil category, although every output

pixel depends on four input pixels that are uniquely mapped to that output pixel

only. This kernel is called twice per frame (once for 320× 240 to 160× 120 and again

for 160× 120 to 80× 60), and the estimated timings in Figure 4.6 denote the sum of

those two executions. The changes are explained in more detail as follows:

1. Original code. Synthesizing the reference implementation with AXI-master

and AXI4-Lite slave interfaces results in an initiation interval of 50 clock cycles.

2. Apply pipelining. HLS reports a pipelined initiation interval of 4 cycles; this

is is misleading however since the actual performance depends on whether

the DRAM will be able to keep up with the continuous DDR requests. Even

though every input pixel is accessed exactly once, the pattern is not linear as a

function of the address: instead, it quickly jumps back and forth between even

and odd rows of the source image.


Unroll loop

Manual division

Use fixed-point


Apply pipelining

Original code0 %

0 %

0 %

0.011 %

0.011 %

0.011 %

10 % LUT

13 % LUT

15 % LUT

11 % LUT

8 % LUT

17 % LUT

12.0 ms

0.96 ms

0.97 ms

0.97 ms

0.97 ms


Mean error

FIGURE 4.6: Effect of every optimization on the timing, resource andaccuracy profiles of half_sample (Stencil).

3. Apply buffering & streaming. Line buffer and memory window functionality

was implemented in order to reliably achieve an initiation interval of one clock

cycle. Due to the nature of streaming interfaces, the estimated timing is now

much more reliable.

4. Use fixed-point. This step omitted all DSPs which has the benefit of leaving

more space for other kernels, particularly the bilateral filter.

5. Manual division. A striking opportunity for optimization that seems not to

have been noticed by the compiler was found in Figure 4.7, part of the filtering

subroutine. This operation stems from the fact that depth pixels are only aver-

aged whenever their absolute value is within 30 cm of the window’s upperleft

value. In order to convert sum of a varying amount of values into an average,

a division by 1, 2, 3 or 4 is necessary. HLS implemented this as a full-fledged

unsigned division with a latency of 31 clock cycles, but it is hard to believe that

this degree of complexity is needed. After all, every case except one could con-

sist of a simple bitshift, and even a division by 3 ought to be much faster than

31 cycles. Indeed, by writing a switch-case construct and performing the di-

vision manually, the area utilization of the window calculation itself has more

than halved and the total iteration latency was reduced from 35 to just 5 cycles.

6. Unroll loop. Similar to mm2m_sample, this kernel was manually unrolled with

a factor of 4. The resource utilization has also doubled instead of quadrupled,


FIGURE 4.7: HLS performance analysis view of an unnecessarilycomplex division that went unnoticed by the HLS compiler.

the reason for which is visualized in Figure 4.8. Again, the true timing is ex-

pected to surpass 0.25 ms as the Zynq’s HP port is not able to transfer at rates

beyond 1,200 MB/s.

4.1.4 depth2vertex (Map)

The depth2vertex method converts every depth pixel into a 3D point by multiplying

the inverse camera matrix with (x, y, D(x, y)). Here, the camera matrix describes

the projection from the real-world space into the image plane, and D represents the

incoming depth map. The applied optimizations are described in Figure 4.9. Since

this routine is executed three times per frame (once for every pyramid level), the

total timing is the sum of these executions. Loop unrolling was not applied since

the output stream is 128 bit wide in the final variant: the required bandwidth for

optimum performance would already be 1,600 MB/s. Note that the actual width of

the point struct is 96 bit both for floating point as well as fixed-point representations.

However, the AXI DMA and Interconnect IP cores do not support memory-mapped

or stream data sizes that are not a power of two [23], leading us to insert padding for

those structs.4 Relative to the previous kernels, no other novelties arise in the HLS

optimization of this kernel.

4.1.5 vertex2normal (Stencil)

The vertex2normal uses four vertices surrounding the regular loop indices to com-

pute a normal vector at every position. The last part of every iteration is a vector

normalization, which in turn requires the calculation of a square root. Figure 4.10

summarizes the changes that were applied: clearly, moving the computation into

the fixed-point domain is not beneficial both in terms of resources and accuracy.

4Technically, data width conversions can be enabled within the AXI IP cores as well [23]. Whetherto control padding in HLS or in the Vivado block design is simply a choice to be made by the designer.


(A) Before unrolling.

(B) After unrolling.

FIGURE 4.8: I/O diagram of the half_sample HLS kernel before andafter duplicating its processing elements 4-fold, assuming no band-

width bottlenecks.

Use fixed-point

Apply pipelining

Apply streaming

Original code0 %

0 %

0 %

0.093 %

9 % LUT

8 % LUT

16 % DSP

11 % LUT

32.3 ms

25.2 ms

1.01 ms


Mean error

FIGURE 4.9: Effect of every optimization on the timing, resource andaccuracy profile of depth2vertex (Map).


Use fixed-point


Apply pipelining


0.010 %

0.010 %

0.021 %

19 % LUT

11 % LUT

28 % LUT

35 % LUT

92.7 ms

12.1 ms

1.01 ms


Mean error

FIGURE 4.10: Effect of every optimization on the timing, resourceand accuracy profile of vertex2normal (Stencil). Contrary to most othercases, the conversion from floating point to fixed-point has a negative

effect here.

This perhaps surprising result was foreshadowed by Section 3.2.5, where it was de-

termined that the fixed-point square root calculation takes up more resources than

its floating point counterpart. In addition, more conversions between both domains

are required per iteration at the FPGA’s input and output, since 3D points consist

of three real-valued quantities per element. Our attempts to create a lighter yet suf-

ficiently accurate implementation of vector normalization were unsuccessful: they

did result in fewer LUTs, but the DSP utilization increased instead.

4.1.6 track (Gather & Map)

The track method is the first kernel where unpredictable memory access occurs.

Since there are still two input maps and one output map that is accessed at regu-

lar positions next to the randomly accessed reference array, its category is a mixture

of Map and Gather. While the author was unable to finalize the debugging and opti-

mization of this method due to time constraints, a list of the applied changes follows

anyway:

1. Original code. Due to the high degree of control structures present in this

kernel, the iteration latency varies between 28 and 171 clock cycles. The control

flow of every iteration strongly depends on the content of the four vertex and

normal maps.

2. Pipelining & streaming. After converting the interfaces into streams except

for the two reference maps and pipelining the horizontal loop, synthesizing

this function was found to be difficult for HLS due to timing constraints. Since


XX

Intelligent buffering

Use fixed-point

Pipelining & streaming


0.648 %

0.824 %

1.125 %

28 % LUT

36 % LUT

57 % DSP

89 % BRAM

92.6-566 ms

265 ms

132 ms

23.0 ms

Estimated timingMax. resource usage

Mean error

FIGURE 4.11: Effect of every optimization on the timing, resource andaccuracy profile of track (Gather & Map).

the indices requested from both reference arrays residing in DRAM are data-

dependent and therefore completely unpredictable for the HLS compiler, this

created a very long critical path in the resulting design. The target clock period

had to be increased to a whopping 80 ns for the DDR requests to be able to

complete in time before the next iteration could perform its pair of requests.

The lowest feasible pipelined initiation interval equals 10 clock cycles despite

this lengthy clock period.

3. Use fixed-point. The usage of fixed-point arithmetic allowed us to decrease

the clock period to 40 ns. Since there is much more input and output data than

before, conversions between floating point and fixed-point were left out in this

configuration however; they can be assumed to take place in preceding blocks,

albeit at a slower rate to keep the resource utilization at bay.

4. Intelligent buffering. The decisive transformation of this kernel into a reason-

ably fast variant is one that incorporates key insights into what the algorithm is

actually doing. The track routine constitutes part of the Iterative Closest Point’s

projective variant, but since the sensor’s movement from one frame to the next

is often limited [5], the idea is that distances between corresponding points

over subsequent frames should also be small. In other words, the 2D positions

of both pixels ought to be close to each other. This is confirmed by Figure 4.12,

a heatmap which depicts how frequently both positions relate to each other

in various configurations. It appears that the indices accessed in the reference

arrays are (statistically speaking) almost always in the neighbourhood of the

indices linked to the linear loop over the actual input maps, assuming both


FIGURE 4.12: Heatmap of the accessed pixel positions within the ref-erence maps relative to the corresponding regular loop over the inputmaps for the first level of track. Yellow means high frequency, purplemeans the opposite. The underlying data was extracted from fiveframes selected over a video fragment captured at 30 FPS, and showsthat horizontal movement of up to 750 pixels per second occurred at

some point.

sources have the same size. A line buffer-like structure was therefore inserted,

containing several rows of both reference maps. In-between every row of itera-

tions, a burst copy is made to update the scratchpad. Any number of rows can

in principle be chosen, depending on the desired maximal BRAM utilization.

For example, 41 rows leads to 89 % BRAM usage and means that ∆x is un-

bounded although |∆y| ≤ 20 must hold for a cache hit to occur. If a requested

pixel happens to fall outside of the currently buffered data, then a default code

equivalent to an invalid measurement is returned. Note that the increase in

error is unexpected behaviour (a bug seems to be present that we could not

successfully resolve), although the optimization of this kernel is still presented

as a valid proof of concept for the application of the scratchpad technique, re-

sulting in an achieved theoretical speed-up factor of 5 to 6.

4.1.7 reduce (Reduce)

The reduce kernel sums various mutually multiplied elements of track’s output stream,

in order to be fed into update_pose later on. No real novelties arise in its optimiza-

tion described in Figure 4.13, except that the struct holding all output values had

to be partitioned into multiple registers in order to achieve an initiation interval in

HLS of exactly one clock cycle. Moreover, the computations have to happen in the

fixed-point domain, as cumulative floating point additions restrict the II to at least 4

cycles as well. However, since the input stream is 256 bit wide corresponding to a

computation speed of 3,200 MB/s, the practical throughput is expected to be lower

by a factor of around three due to AXI_HP’s bandwidth constraints.


Use fixed-point

Partition array



0.014 %

0.014 %

0.465 %

22 % LUT

15 % LUT

15 % LUT

38 % DSP

95.9 ms

13.2 ms

13.2 ms


Mean error

FIGURE 4.13: Effect of every optimization on the timing, resource andaccuracy profile of reduce (Reduce).

4.1.8 integrate (Gather & Map)

The integrate method performs volumetric integration, and loops over every element

in the 3D volume to incorporate the most recently captured sensor measurement.

The optimizations, summarized in Figure 4.14, incorporate two important changes.

First, the full 320× 240 depth array of 300 KiB is copied to local storage before start-

ing the loop, similar to block-by-block processing described in Section 3.2.4. The

HLS compiler allocates a number of instances in order to fit up to 512 KiB instead of

300 KiB, which explains the high BRAM utilization. Only one block is used in this

use case, but the kernel supports upscaling to multiple blocks as well. For a depth

map of 640× 480, three sections and thus re-executions of the full routine would be

required. Due to the huge volume size however, this would bring up the maximal

runtime to a rather impractical 503 ms. Second, the realization is made that not all

iterations can possibly contribute to a meaningful update of the volume. The cam-

era is always looking at just a part of the global volume, around which a minimally

encompassing cube can be mathematically determined. This principle is shown in

Figure 4.15. Since the total number of iterations now depends on the size of this

subvolume, the timing now also depends on the precise sensor location and can be

anywhere between 0 and 168 ms. At the start of the KinectFusion algorithm, the

camera’s perspective is by definition positioned at the center of the volume span-

ning (8 m)3, oriented perpendicularly towards one face of the volume. Knowing the

Kinect v1 depth images’s field of view of 58.5 x 46.6 degrees [69], the initial encom-

passing cube volume is calculated to be approximately 9.7 % that of the total volume.

As such, this fraction is also taken to produce a rough estimate of the average timing,

resulting in 16.2 ms.


XX

Limit to frustum

Block processing

Use fixed-point



0.147 %

0.322 %

0.322 %

0.322 %

31 % LUT

32 % LUT

88 % LUT

100 % BRAM

100 % BRAM

3.36-23.8 s

839 ms

168 ms

168 ms

0-168 ms

Estimated timingMax. resource usage

Mean error

FIGURE 4.14: Effect of every optimization on the timing, resource andaccuracy profile of integrate (Gather & Map).

FIGURE 4.15: Two-dimensional illustration of a frustum-encompassing block, to which loop boundaries can safely berestricted. The green coloured blocks represent volumetric elementsthat are visible from the sensor’s current position, meaning that all

yellow elements remain unchanged during integration.


Unchanged code Fully optimized Methodology Additional TotalKernel HLS timing HLS timing speed-up speed-up speed-upmm2m_sample 24.6 ms 0.38 ms ×8.01 ×8.08 ×64.7bilateral_filter 1109 ms 0.77 ms ×1422 ×1.01 ×1440half_sample 12.0 ms 0.25 ms ×12.4 ×3.88 ×48.0depth2vertex 32.3 ms 1.01 ms ×32.0 ×1 ×32.0vertex2normal 92.7 ms 1.01 ms ×91.8 ×1 ×91.8track 566 ms 23.0 ms ×4.29 ×5.74 ×24.6reduce 95.9 ms 3.31 ms ×29.0 ×1 ×29.0integrate 23.8 s 16.2 ms ×142 ×10.4 ×1469

Median ×30.5 ×2.45 ×56.4

TABLE 4.3: Impact of the optimizations arising from adoption of themethodology versus use case-specific knowledge on the estimated

performance of KinectFusion’s kernels in HLS.

4.2 Discussion

This chapter describes the high-level synthesis design phase of all FPGA-eligible

KinectFusion kernels except one. It is clear that a significant portion of this work ex-

tends beyond merely applying the methods presented in Section 3.2, although they

did provide a strong headstart in both qualitative and quantitative terms. A non-

trivial understanding of the algorithm is nonetheless required to apply the proce-

dures correctly, and to take advantage of opportunities for additional time-oriented

or resource-oriented optimizations.

4.2.1 Evaluation of the methodology

A straightforward application of all techniques discussed in Chapter 3 yields a me-

dian speed-up of a factor 30.5 for the individual components of KinectFusion, while

extra changes created an additional median performance gain of 2.45. Table 4.3 goes

into more detail, and lists the effect of these two types of design transformations

for each kernel. It is not always evident to separate changes originating from the

methodology and ’additional’ optimizations. Loop unrollings are standard practice

in the FPGA community for example, but were not described in Chapter 3. More-

over, the biggest reason why duplication of processing elements proved useful is

because the two respective routines (mm2m_sample and half_sample) are downsam-

plers. On the other hand, the intelligent buffering method in track does, strictly

speaking, belong to the methodology but was instead included in the additional col-

umn because it still requires a significant deal of insight into the SLAM application

and its context. Not listed in Table 4.3 is the impact on resources, even though sev-

eral important extra optimizations did drastically decrease the total area utilization

of some components.

While different in philosophy from the range of source-to-source compilers found

in literature [58], [70], [71], we conclude that the more manual HLS methodology that

4.2. Discussion 75

we introduced based on a selection of SLAM kernels is a positive result, and might

even be preferable to using the aforementioned tools. It is hard to imagine that auto-

matic code translation will be able to reliably deal with superfluous hardware usage

such as the division in half_sample. While this instance is not particularly detrimen-

tal in any respect, the implication that similar situations could get overlooked does

suggest that the degree of automation is generally correlated with a lesser quality of

the final design. Working at a lower level enables more fine-grained control of all as-

pects of the design process, which in turn opens gateways towards better efficiency

and performance.

Furthermore, the reason behind the lack of existence of a fully automatic FPGA

compiler becomes even clearer once we move outside the domain of image pro-

cessing. The design space grows larger exponentially as more possible paths can

be taken, and Vivado HLS for example has to ask for hints in the form of pragmas

in order to incorporate indispensable human intelligence into its compilation pro-

cess. Otherwise, the dimensionality of DSE would become unfathomably high; even

more so when it comes to the combined implementation of multiple kernels dis-

cussed in Chapter 5. As a result, the workflow of HLS aims to provide a balanced

mix of high-level and low-level details. In addition, opportunities for optimizing

individual hardware operations can still be exploited, while the repetitive specifics

of established paradigms such as pipelining and I/O interfacing are taken care of

mostly automatically.

77

Chapter 5

System-level acceleration ofmultiple kernels

Henceforth, we decided to focus on the first five kernels of KinectFusion, most of

which multiple instances (also called levels, modes or variants) are called for every in-

put depth frame. The other parts of the algorithm would certainly lend themselves

to deeper investigations as well, but were deemed out-of-scope for this thesis due to

technical difficulties. The dataflow principles presented in this chapter are largely

applicable to the remaining half of KinectFusion as well. In addition, we refer to [37]

for an FPGA-implementation and system-level integration of track and reduce. As

shown in Figure 3.2, all five kernels mm2m_sample through vertex2normal are inde-

pendent of the global reconstruction volume, as they serve to transform the sensor

measurement into various formats that will be used in later stages of KinectFusion.

The term system-level acceleration refers to the question of how to accelerate not

just one kernel, but a multitude of algorithmic components within the heterogeneous

system. Recall that an SoC essentially offers a CPU and an FPGA with a common

DRAM. The quality of the design implemented onto the FPGA as well as the regula-

tory software program on the CPU will determine the extent to which the coopera-

tion between PS and PL occurs harmoniously. As mentioned in the introduction, this

sophisticated duality of having to manage both hardware and software completely

leaves a high degree of freedom to the system designer. We propose three different

architectures:

1. Independent coexistence of kernels. Every accelerator, each corresponding to

one kernel, exists as an independent block in hardware and fully manages its

own connection to DDR via the PS. No direct inter-component communication

is possible.

2. Block-level dataflow. Subsequent kernels on the same datapath are also di-

rectly connected in hardware. The many complications arising from the prac-

tical implementation of this idea are resolved during the Vivado block design

phase.

78 Chapter 5. System-level acceleration of multiple kernels

3. HLS-level dataflow. Similar to the above, but the issues are resolved through

C++ code within the HLS top function instead. All kernels thereby reside in-

side just one IP core.

After reviewing the specifics of KinectFusion’s challenging multi-level dataflow, the

remainder of this chapter will implement and compare the three aforementioned

configurations. Note that both the problem statement and its solutions are also care-

fully generalized, which means that throughout this text we will occasionally jump

back and forth between abstract principles and their concrete application in the con-

sidered use case.

5.1 Dataflow of KinectFusion

The first five kernels of KinectFusion are shown in Figure 5.1, putting emphasis on

which data they process, how the various blocks interact and how they depend on

each other. A ’pyramid’ with three distinct levels of data processing can be distin-

guished, all of which are needed to perform the multi-level tracking phase after-

wards (see Figure 3.2). From a performance analysis perspective, every block can be

summarized as in Table 5.1 and as follows:

1. mm2m_sample: Accepts a raw sensor measurement array of unsigned short

integers as input and downsamples by a factor two in both dimensions to pro-

duce an array of single-precision floating point depth values as output. The

data rate between input and output is halved on average, although this ratio

varies between 0 and 1 at runtime.

2. bilateral_filter: Filters an array of single-precision depth values, maintaining

the same resolution and datatype at its output. The data rate stays equal be-

tween input and output.

3. half_sample: Downsamples an array of depth values by a factor of two in both

dimensions. The data rate between input and output is decreased by a factor

of 4 on average, but varies between 0 and 1/2 at runtime.

4. depth2vertex: Converts an array of single-precision depth values and to 128-

bit point structs, maintaining the same resolution at its output. The data rate

is increased by a factor of 4 between input and output.

5. vertex2normal: Filters an array of point structs, maintaining the same resolu-

tion and datatype at its output. The data rate stays equal between input and

output.

All kernels under consideration have fast 2D streaming implementations thanks

to the HLS optimizations that were investigated in previous chapters. The ques-

tion arises of how to most appropriately handle the execution of multiple kernels if

5.1. Dataflow of KinectFusion 79

FIGURE 5.1: Dataflow diagram of the first five kernels of KinectFu-sion.

Input Input Output Output Data rateInstance size width size width factormm2m_sample 640× 480 16-bit 320× 240 32-bit 1/2bilateral_filter 320× 240 32-bit 320× 240 32-bit 1half_sample1 320× 240 32-bit 160× 120 32-bit 1/4half_sample2 160× 120 32-bit 80× 60 32-bit 1/4depth2vertex0 320× 240 32-bit 320× 240 128-bit 4depth2vertex1 160× 120 32-bit 160× 120 128-bit 4depth2vertex2 80× 60 32-bit 80× 60 128-bit 4vertex2normal0 320× 240 128-bit 320× 240 128-bit 1vertex2normal1 160× 120 128-bit 160× 120 128-bit 1vertex2normal2 80× 60 128-bit 80× 60 128-bit 1

TABLE 5.1: I/O characteristics of all instances of KinectFusion’s firstfive kernels.


we want to off-load them all onto the FPGA. More specifically, the current problem

statement can be expressed as follows: given the task of having to obtain all seven

required output maps shown in Figure 5.1, how do we best design the architecture

of our hardware and software in order to calculate the desired data as efficiently as

possible? Here, the goal of efficiency aims to maximize speed while minimizing re-

source utilization. We assume that none of the 2D arrays can possibly fit on the PL at

once, so that the streaming paradigm somehow has to be maintainted throughout all

computations. On the other hand, most streaming implementations assume a single-producer, single-consumer pattern. This means that it is impossible to connect a single

output to multiple inputs of other blocks, unless those blocks are forced to always

operate in a fully synchronized manner and thus to run in parallel at the same rate.

The last constraint is to maximize the degree of resource sharing among all blocks,

so that different instances of the same kernel across multiple levels are preferably

not implemented as completely separate layouts on the FPGA.

5.1.1 Generalized problem statement

The first step is to recognize that handling the complex dataflow of KinectFusion

concerns two different but related challenges. Figure 5.2a depicts the configuration

of a general string of kernels when no multi-level functionality is present and the

single-producer, single-consumer pattern holds everywhere. These blocks can quite

easily be coarse-grain pipelined as explained in Section 5.4. Building onto this base

case, Figure 5.2b then adds the requirement that intermediate outputs from kernels

residing in the middle also have to be stored for later usage. If the output of A is

not stored somewhere, then it is lost forever because B transforms the stream into

something else. As mentioned before, temporarily storing the data in local memory

for later retrieval is not possible either, so it has to be written directly to the DRAM.

How to achieve this represents the first aspect of our generalized problem statement.

Next, Figure 5.2c exemplifies the concept of similar instances of kernels handling

different streams. Here, a high similarity means that a large fraction of hardware area

can be shared among the different variants Ai. In practice, these variants might sim-

ply consist of HLS top functions that are being called with different parameters. The

fact that the back-end hardware implementation of multiple instances can be merged

to some degree will prove to be beneficial in minimizing resource utilization, espe-

cially for a low-end device such as the Zynq SoC. However, the maximum degree

of high-level parallelism is also restricted this way, so that a trade-off will have to

be made. How to efficiently reconcile this with intermediate output accumulation

represents the second part of our problem statement.

5.2. System architecture 81

(A) Simple configuration: all kernels are connected together in a linear fashion.

(B) Configuration with intermediate output aggregation: the output of some ker-nels is needed in later stages as well, rather than by its direct successor only.

(C) Configuration with multi-level execution: several variants of each kernel han-dle data streams in principle independently from each other.

FIGURE 5.2: Illustration of two generalized dataflow challenges.

5.2 System architecture

Figure 5.3 depicts the important aspects of our initial system architecture. Here, the

processor is connected to all IP cores present in the PL via an AXI-Lite interface. This

is a low-throughput memory-mapped protocol that enables simple communication

of control and status registers [23], for example in order to start and stop the kernel.

The PS acts as a master from its General Purpose (AXI_GP) port, while the AXI DMA

and HLS IP core act as slaves. Both the PS and PL have access to a common physical

DRAM via a DDR controller. The DMA serves to provide a high-speed communi-

cation facility between the HLS IP core and the DDR controller. The label of the in-

put stream, AXIS_MM2S, denotes a protocol that converts a memory-mapped address

space (the DRAM) into an AXI stream. Essentially, the DMA is what matches the

DDR data to the interface of the custom IP core by transferring the data in a stream-

ing manner from and to the IP core. The second connection between DMA and IP

core, called AXIS_S2MM, reads the output stream and writes it back to the DRAM in

memory-mapped format. The DMA serves as a master over the Zynq’s bidirectional

HP port via a regular AXI protocol. The widths of the AXIS_MM2S and AXIS_S2MM

streams must be a power of two and can be up to 1024 bit, however the HP port’s

maximal data width is 64 bit. Not shown on the diagram are AXI Interconnect and

SmartConnect IP cores, which up- or downconvert streams as needed to ensure they


FIGURE 5.3: Overview of the System-on-Chip architecture for the ex-ecution of a custom IP core.

have the correct bit widths [23].

5.2.1 Hardware debugging

The debugging of real-world hardware is far from as evident as debugging of soft-

ware, although Xilinx provides an IP core called the System Integrated Logic Ana-

lyzer (ILA) to perform in-system debugging of designs on an FPGA after implemen-

tation [31]. In Vivado, interfaces between blocks as well as signals inside IP cores can

be monitored by marking them for debug during the design phase or after synthesis.

The hardware manager in Vivado then allows the user to select probes and set-up

triggers on certain values of that signal. For example, a waveform spanning 1024

clock cycles can be captured once the TVALID or TLAST signal of an AXI stream be-

comes true. This technique is very useful for precise timing, latency and bottleneck

analysis, or to figure out why a system is not working at all.

5.2.2 Bandwidth limitations

In this thesis, all IP cores including the accelerators are fixed to a clock period of

10 ns. Given the Zynq HP port’s maximum data width of 64 bit, this means that a

single-way throughput of no more than 800 MB/s per interface is feasible from and

to DRAM. The theoretical upper limit of 1,200 MB/s could be achieved by introduc-

ing a second clock domain and increasing the AXI Interconnect’s clock frequency

5.3. Independent coexistence of kernels 83

between the DMA and PS to 150 MHz (or 200 MHz such as in [25]). Due to tech-

nical difficulties and time constraints1, this opportunity was however regarded as

less important and not further explored. Despite the suboptimal configuration, the

conclusions drawn from this chapter can be applied to any other values of such lim-

itations as well. After all, the goal of this chapter is to investigate which method is

best suited to resolve multi-level dataflow problems, rather than to achieve a maxi-

mal frame rate for KinectFusion using every possible available feature.

The only other way to process data at rates higher than 800 MB/s is to use multi-

ple PS-PL ports. The Zynq-7020 SoC has 4 HP ports and one Accelerator Coherency

Port (ACP), the latter of which can be made cache-coherent but is otherwise prac-

tically similar to a HP port, sharing the same maximum throughput. This idea is

explored in the next section. In that case however, the concurrent execution of mul-

tiple kernels is not unbounded either. In an experiment done by [24] using all four

HP ports, the bandwidth of the DDR interface to the external DRAM is determined

to be 4,264 MB/s. Furthermore, the maximum throughput when using all four HP

ports is only 3,333 MB/s, or 78 % of the DRAM bandwidth. Given the speed of 800

MB/s per port in our design, we initially expect few problems to arise here unless

all five ports would be in use at the same time.

5.3 Independent coexistence of kernels

One way to off-load multiple distinct subroutines of an algorithm to the FPGA is to

simply place them together in the Vivado block design, and connect every streaming

kernel with a separate HP or ACP port to the PS via an AXI DMA. This concept is il-

lustrated in Figure 5.4. Every function can then be called separately by the software

running on the processor, and no direct communication has to occur between the

components. Instead, the data is retrieved from and stored back to the DRAM every

time. This extra overhead and lack of inter-kernel communication forms a draw-

back in most cases, although sometimes are good reasons to do so. Naturally, the

available PS-PL ports will quickly become filled up this way, which places an upper

bound of five coexisting kernels in the case of a Zynq-7020 SoC. A well chosen set

of components to be accelerated whose combined resources still fit on the PL can

nevertheless generate a significant speed-up to the whole system. In addition, the

power efficiency will also be much better than what can be achieved by off-loading

just a single accelerator, in terms of useful computation done per Watt.

1Synthesizing and implementing a block design in Vivado takes on the order of one to severalhours, which limits the number of iterations we could perform.


FIGURE 5.4: System architecture when five coexisting kernels are im-plemented together on the FPGA. By allocating one port for everyaccelerator, hard constraints on concurrent executions are avoided.


In general, the selection of which subset out of N kernels to off-load to the FPGA

can be stated approximately as the following mathematical optimization problem:

minimizesi

N

∑i=1

siti,FPGA + (1− si)ti,CPU

subject toN

∑i=1

siri,BRAM ≤ 1

N

∑i=1

siri,DSP ≤ 1

N

∑i=1

siri,FF ≤ 1

N

∑i=1

siri,LUT ≤ 1

(5.1)

Here, the decision booleans si denote whether to execute component i on the FPGA

(1) or the CPU (0). The variables ti measured beforehand equal the total time spent

in each kernel when it is executed on either of these devices. The resource fractions

ri ∈ [0, 1] denote the resource utilization as a fraction of the total available amount of

that type provided by the FPGA. Note that three important assumptions are made:

1. Every component in executed in isolation from any other. The objective func-

tion has to be modified to account for concurrent executions and scheduling

opportunities; in practice this happens on a case-by-case basis.

2. The PS-PL communication overhead is zero; one component can start imme-

diately after the other and no waiting has to occur. This is a very reliable ap-

proximation if the individual timings already account for bandwidth bounds

and latencies though.

3. The resource utilization of combined components, no matter the si-vector, al-

ways equals the sum of individually off-loaded components. From our ex-

perience, these values tend to be off by around 35 % in a favourable way,

meaning that the post-implementation total resource utilization is actually less

compared to the post-implementation resource profiles of single-kernel accel-

erators. This non-linear scaling is explained by the increased opportunity of

resource sharing as more and more functionality enters the FPGA

For our research, we intend to investigate the case of accelerating all five components

at once. A sacrifice therefore had to be made regarding the bilateral filter’s complex-

ity: its window size was reduced from 5x5 to 3x3 as the design could otherwise not

fit on the Zynq-7020 FPGA.


ARM Cortex-A9 HLS Zynq-7020 HLS ActualKernel CPU report FPGA speed-up speed-upmm2m_sample 2.70 ms 0.38 ms 0.77 ms x7.0 x3.5bilateral_filter 426 ms 0.77 ms 0.78 ms x552 x544half_sample 1.82 ms 0.24 ms 0.50 ms x7.6 x3.7depth2vertex 7.90 ms 1.01 ms 2.05 ms x7.8 x3.8vertex2normal 27.4 ms 1.01 ms 2.06 ms x27.2 x13.3Total 467 ms 3.41 ms 6.16 ms x137 x75.6

TABLE 5.2: Time spent in each kernel as measured on both the PS andPL of the Zedboard. Summing these values assumes that all kernelsare executed separately in time, and can be placed side by side onto

the same FPGA.

5.3.1 Performance analysis

This section benchmarks the design where IP cores A through E in Figure 5.4 are

filled in by mm2m_sample, bilateral_filter, half_sample, depth2vertex and vertex2normal.The initial test is to engage every accelerator separately in time, although concurrent

executions will be discussed directly afterwards.

Isolated executions

Executing KinectFusion’s accelerated kernels in isolation at a clock period of 10 ns

produces Tables 5.2 and 5.3. With the current architecture, every block except bi-lateral_filter clearly performs twice as slowly as was estimated by Vivado HLS. This

is logical, since an I/O interface throughput of at least 1,600 MB/s was assumed in

those designs. Recall that mm2m_sample was unrolled by a factor of 8 and half_sampleby a factor of 4 in Chapter 4. The effective unroll factors have become 4 and 2 respec-

tively, due to a PS-PL interfacing bottleneck. depth2vertex and vertex2normal work

with 128-bit data points and therefore suffer from a very similar bottleneck, plotted

in Figure 5.5. By comparing the PS-PL interface signals and the DMA streaming sig-

nals, it is revealed that the PL is forced to split up every 128-bit packet into two 64-bit

packets in order to pass the stream of 3D points via the HP port. This can be con-

firmed by looking at the frequency and placement of the 32 zero-padding bits, which

serve the purpose of fitting a 96-bit struct element inside the 128-bit AXI streaming

format. This data width conversion causes the DMA to adapt to the slowest stream

of 64-bit and thus require two clock cycles per communicated element, so that the

initiation interval has effectively doubled from 1 to 2.

Figure 5.4 depicts the architecture that is employed to off-load all five kernels at

once. For mm2m_sample and half_sample, implementations with unrolling factors of 4

and 2 respectively were used, because the computationally stronger variants do not

yield any real-world increase in performance. The performance and communication

bounds (see [62]) are precisely matched to each other this way, so that no bandwidth


Input Input Output Output Elements producedKernel width rate width rate per clock cyclemm2m_sample 16-bit 800 MB/s 32-bit 400 MB/s 1bilateral_filter 32-bit 400 MB/s 32-bit 400 MB/s 1half_sample 32-bit 800 MB/s 32-bit 200 MB/s 0.5depth2vertex 32-bit 200 MB/s 128-bit 800 MB/s 0.5vertex2normal 128-bit 800 MB/s 128-bit 800 MB/s 0.5

TABLE 5.3: Realized maximum I/O throughputs that conforms to HPport bandwidth bounds. The data widths and elements processed perclock cycle are measured in terms of data units meaningful to Kinect-Fusion (e.g. one depth value), without regard for details involving

packed structs.

(A) AXI streams directly attached to the HLS IP core (128-bit wide).

(B) Zynq HP port transferring data read from DRAM (64-bit wide).

FIGURE 5.5: Waveforms produced by the System ILA for the ver-tex2normal kernel.


is left and no superfluous hardware resources are used up. Coincidentally, this con-

figuration just barely fits on the Zynq-7020 FPGA with a LUT utilization of 88 %. If

every instance were to be executed separately in time, then the total time spent in

these first five kernels would be 6.16 ms. The FPGA therefore has the capacity to

process 162 frames per second, writing all seven temporary outputs to the DRAM

per incoming sensor measurement. Note that this value does not represent a real

FPS of the full system, unless it is assumed that all surrounding components of the

algorithm work sufficiently fast as well.2 While independent execution of kernels on

the FPGA is already a huge improvement over merely using the CPU, opportunities

for parallelism among different blocks exist and will be exploited next.

Frame-level pipelining

Returning to Figure 5.1, it is evident that not all components need to wait for each

other. In particular, once the bilateral filter finishes and its result is written back to

DRAM, then there is no reason for either of depth2vertex and half_sample1 to hold their

commencement until the other has completed. After all, the ’coexistence’ configura-

tion allows for any degree of concurrent execution to occur thanks to the presence of

independent PS-PL links for each of its components. Taking into account the inter-

kernel data dependencies, Figure 5.6a illustrates how the accelerators’ executions

should be scheduled over time so that every result is available as early as possible.

We again introduce the concepts of iteration latency (IL) and initiation interval (II), de-

fined very similarly to Chapter 3 albeit on a higher level in this context. The timings

are calculated theoretically assuming the speeds of Table 5.3 hold true at all times,

and there exists no delay in-between the execution of multiple kernels or instances.

For example, half_sample1 produces 160× 120 32-bit elements at an average through-

put of 200 MB/s or equivalently, 1 output value for every 2 clock cycles, so that the

estimated timing is 160× 120× 2× 10−5 = 0.384 ms.

The realized iteration latency of Figure 5.6a’s execution on the SoC is 5.31 ms.

This is quite close to but still more than the theoretical value of 5.09 ms, indicating

that a bottleneck occurs. Interestingly, the bilateral filter behaves strangely when it

is part of the coexisting configuration. Even when executed in isolation, its timing

is consistently measured to be 0.97 ms instead of 0.78 ms as in the standalone con-

figuration when only one IP core is implemented on the PL. Figure 5.7 shows the

AXI stream signals, confirming that the unexpected issue is present. When taking

this extra latency into account, the increase from 5.09 ms to 5.31 is logical.3 Some

aspect of the component being placed on a nearly full FPGA and connected to one

HP port among a fully occupied set of PS-PL ports seems to generate a slowdown,

2For example, it was experimentally determined that reading a new input frame from the SD cardtakes around 50 to 70 ms, which just barely threatens the real-time constraint of achieving 15 FPS.

3The extra 30 µs can simply be attributed to the summed latencies of individual executions.


(A) Schedule to process a single sensor frame.

(B) Schedule to process multiple sensor frames. Both depth2vertex and vertex2normal determine thesmallest possible initiation interval.

FIGURE 5.6: Diagrams depicting how the five kernels should be ex-ecuted in time if the DDR access speed were unlimited. The rowscorrespond to accelerators each managing their own DMA and PS-PL port, while the distinct tasks are labelled with resolution levels (0

stands for 320x240, 1 for 160x120 and 2 for 80x60).


FIGURE 5.7: System ILA waveforms for bilateral_filter when it is exe-cuted alone, releaving a strange hiccup. The vertical lines are spaced

200 ns.

although we are unable to explain precisely where this bottleneck lies (we do not

exclude the possibility of hardware, software or system bugs either). However, this

particular incident does not significantly impact the conclusions we can draw from

our experiments with regards to multi-level dataflow, and we chose to focus on more

important matters rather than to extensively debug the encountered situation. Mea-

surements of all other kernels showed that they behave exactly as expected, and

achieve the same performance results as their standalone counterparts.

The natural extension of the discussed configuration is to set-up a pipelined ex-

ecution of all blocks as in Figure 5.6b. This way, multiple frames can be processed in

an overlapped manner. The initiation interval is (320× 240+ 160× 120+ 80× 60)×2× 10−5 = 2.02 ms in theory, but this assumes that the DRAM has no problem deal-

ing with three concurrent streams of 200-800 MB/s to different memory locations at

once, continuously generating around 2 GB/s of traffic in both directions. The re-

alized initiation interval is 2.53 ms (an increase of 21 %), while the iteration latency

is 6.02 ms (an increase of 18 %). As every individual HP or ACP port is requested

to handle only 800 MB/s at most by the PL blocks, the increased II can very likely

be attributed to a bottleneck arising from the high DDR workload. The presence of

a slowdown is confirmed by plotting the half_sample I/O signals in Figure 5.8. The

DMA is clearly sending and receiving burst copies instead of operating at its full

capacity; correspondingly, the streaming interfaces are only active for a fraction of

the time. The increase in IL can be explained for exactly the same reasons. Over-

all, the system is performing around 20 % slower than expected, and the differences

between both relative increases are small enough to be blamed on random measure-

ment noise (our timing method is sufficiently accurate but of course not perfect).

5.4 Task-level pipelining

The basic configuration in Figure 5.2a, linking data streams of multiple components

together, lends itself to an obvious upgrade in the form of task-level pipelining. This

concept allows for the overlapped execution of subsequent kernels, in the sense that

5.4. Task-level pipelining 91

FIGURE 5.8: System ILA waveforms for half_sample in the multi-frameexecution. Large-scale pauses and restarts are clearly visible, and oc-cur presumably due to the DDR controller having to operate at full

capacity. The vertical lines are spaced 1 µs.

every element produced by a given component is immediately processed by the next

component as soon as it becomes available.4 By inserting small channels (often FIFO

buffers) in-between these blocks, it is therefore possible to completely bypass the re-

quirement of having to store large bodies of intermediate data somewhere while the

next component is waiting for its predecessor(s) to finish all their computations. In

practice, the HLS DATAFLOW pragma is the ideal lubricant for this paradigm as it

automates many aspects of its implementation [1], [72]. C/C++ functions as well as

individual for-loops can effectively be chained together to form a single ’superblock’

as if it were one large for-loop, although several restrictions apply. The usage of this

optimization directive to enable linear task-level pipelining is standard practice and

will not be lectured here; instead, we consider how situations resembling Figures

5.2b and 5.2c could possibly be tackled efficiently using the underlying concept de-

spite its limitations, in order to eventually arrive at two candidate solutions for Fig-

ure 5.1. The multi-level dataflow principles discussed in the following paragraphs

can essentially be applied in two distinct manners:

1. In the block design of Vivado, where standard task-level pipelining can be

generated naturally by connecting subsequent kernels directly together.

2. In Vivado HLS, where the dataflow directive stands central but special mea-

sures have to be taken to reconcile its operation with conflicting requirements.

A comparison will then be made of all reviewed techniques in Section 5.5.

4We have encountered three forms of pipelining at this point, so that it is appropriate to emphasizethe differences between them:1) HLS pipelining concerns subsequent iterations of a loop and happens at a low level, i.e. that of flip-flops and hardware blocks.2) Frame-level pipelining concerns subsequent frames of a video fragment, and happens at the muchhigher level of complete accelerators residing on the PL to be controlled by the PS.3) Task-level pipelining concerns subsequent components within one accelerator. It also happens at adifferent level than regular HLS pipelining, yet is distinct from the previous concept.


5.4.1 Intermediate output aggregation

Extracting data streams from in-between two blocks breaks the single-producer,

single-consumer principle that is fundamental to the operation of task-level dataflow

in Vivado HLS [72]. Even a basic stream construct does not allow multiple reads of

the same element; instead, every call to read() also consumes that element so that

it can never be retrieved again unless a copy of the data is preserved somehow. AXI

stream broadcast IP cores or custom stream duplicators in HLS provide a partial

solution to this problem, although the question still remains of how to redirect the

duplicated data either to a temporary storage, or directly to the input side of the

next component where it is needed. Since the latter case requires all kernels to exe-

cute together in a synchronized manner, resource sharing among similar variants of

a kernel cannot really be exploited.5 Therefore, only the former case is focused on,

as this will allow all instances of a kernel to run separately in time while still taking

advantage of task-level pipelining.

In Vivado’s block design, Figure 5.2b can be realized as Figure 5.9a (four IP cores

are shown here, but this amount can again vary up to five on the Zynq-7020 SoC).

The AXI4-Stream Broadcast IP core forwards a single input stream to both outputs,

so that one of them can return via a one-way AXI DMA to DRAM. DDR bottlenecks

might arise, which is discussed in Section 5.4.3. A second solution method is to

create a top function in HLS calling all subfunctions and insert the HLS dataflow

directive. In order not to cause a single-producer, single-consumer violation, im-

portant modifications must however be introduced. If a single combined output

stream is desired, it does not make sense to bypass the blocks in the middle and

somehow recombine the differently phased streams at the end. Bypassing tasks is

even downright impossible with the pragma enabled [72], and various attempts to

do so anyway have been found to produce deadlocks during C/RTL cosimulation.

A feasible workaround that we have conceived is depicted in Figure 5.9b. Every

subkernel ought to attach its own output onto the existing stream to prevent the

data from being lost forever. This way, the elements grow wider as the computation

progresses. In the end, all intermediate results are stored in one fat output stream

and written back to DRAM. Remark that the technical implementation of this so-

lution is more burdensome than the first method: Stencil kernels in particular now

have to keep all aggregated data inside their line buffers, in order to ensure that the

information to be passed through remains correctly synchronized in time with the

newly computed data. The utilization of BRAM and registers slightly increases as

a result, but this should not pose a problem considering the line buffers’ relatively

small dimensions.5This statement is analogous to the fact that loop unrolling a.k.a. processing element duplication

usually increases resource utilization by a factor equal to the resulting gain in performance.


(A) In Vivado’s block design: all kernels are kept fully separate during the HLSdesign phase. Unlabelled white arrows belong to the AXI4-stream protocol.

(B) Single HLS IP core combining all subkernels via dataflow. Inter-kernel resultsare accumulated into an increasingly larger stream (depicted here as multiple ar-rows). The dotted lines represent conceptual pass-through connections between

input and output.

FIGURE 5.9: Two possible solutions for intermediate output aggrega-tion (Figure 5.2b).


5.4.2 Multi-modal execution

The problem of multi-modal execution is defined in this context as how to efficiently

design the architecture for a chain of kernels whose instances have varying function-

ality and/or dimensionality while maximizing the degree of area utilization across

these modes. Figure 5.2c could trivially be implemented as three separate accelera-

tors. While this easily allows for the concurrent execution of different instances, it

will prevent Vivado from exploiting any degree of inter-modal overlap on the hard-

ware level during synthesis and implementation. Instead, we propose the usage

of multi-modal blocks such as in Figure 5.10a. Another distinction is made here

between kernels whose mode can be changed by setting parameters via their AXI

slave interface, and kernels that are more fundamentally different so that a custom

HLS stream selection block is inserted before and after them instead. Note that the

latter case, exemplified by IP cores Bi, does not conform to the area sharing princi-

ple but allows for completely unrelated kernels to exist next to each other instead.

The ability for the software programmer to select which block to activate in the cur-

rent chain might be required in some parts of the multi-modal execution process.

In KinectFusion for example, variants of both depth2vertex and vertex2normal exist

across all levels, although mm2m_sample and bilateral_filter are present in the highest

resolution level only. Section 5.4.3 will therefore insert stream switching components

to deal with the selection between the latter two kernels and half_sample.

Vivado HLS again provides an opportunity to resolve the multi-modal problem

in an earlier stage as well. The impossibility of conditional task execution is another

constraint of the dataflow optimization directive [72], although case selection withina loop is certainly not forbidden. This idea is illustrated in Figure 5.10b. Multiple

variants of kernels can be activated by controlling the mode parameter, and the free-

dom of if-else case switching allows for completely different functionality to coexist

within one HLS function as well. Our hypothesis is that this method decreases total

area utilization even further compared to the first solution, since resource sharing

across modes (for blocks that were previously distinct) can already occur during the

HLS scheduling and binding phases.

5.4.3 Application to KinectFusion

Having investigated how to tackle the two challenges constituting the multi-level

dataflow problem, its application to Figure 5.1 is considered. First, a necessary in-

termediate step is to determine which modes should correspond to which blocks

in order to reach an efficient configuration of combined kernels with respect to task-

level pipelining. Stated more generally, data dependencies between components can

be modelled as a directed graph. The task at hand is then to find an optimal set of


(A) In Vivado’s block design, the components can be configured to instantiatedifferent kernel variants either by setting control signals, or by routing the switch

blocks.

(B) In HLS, the differentiation between several modes is done inside the loop bodies.

FIGURE 5.10: Two possible solutions for multi-modal execution (Fig-ure 5.2c).


FIGURE 5.11: Three different sets of paths (depicted as large arrows)that connect components to combine using task-level pipelining. Thetime for one path is estimated from the slowest block inside that path,and the paths should be executed separately in time to enable re-

source sharing across different modes.

paths, each consisting of pipelined components and corresponding to a certain exe-

cution time, so that all nodes are covered and the total sum of all timings is minimal.

Figure 5.11 shows three different overlays in the KinectFusion use case. While differ-

ent modes need not necessarily correspond to different resolution levels, in this case

it seems that the best way forward is the leftmost configuration. We now review two

practical ways in which the intermediate output aggregation and multi-modal exe-

cution techniques can be applied to KinectFusion: one way is via the block design,

and the other is via HLS itself. It is expected that both methods will generate better

results compared to the straightforward side-by-side placement of all accelerators

that was discussed in Section 5.4.

Block-level dataflow

Figure 5.12 depicts the architecture that combines elements of both Figures 5.9a and

5.10a. Note that the bit widths correspond to packed structs and the stream ele-

ments do not always correspond to meaningful units in KinectFusion. While this is

not always the case in general, it so happens that at every stage within this pipeline

of blocks, intermediate streams have to be extracted and redirected to DRAM via an

AXI DMA. The outputs of mm2m_sample and depth2vertex are a priori required by

the temporary storage in Figure 5.1, although this does not hold for the outputs of

bilateral_filter and half_sample. However, the latter data is needed for the next level of

kernels down the resolution pyramid, so that it inevitably needs to be stored away

as well. The processing element duplication factor of mm2m_sample was reduced

from 4 to 2 because the bottleneck of the first level (the leftmost path in Figure 5.11)

now resides with the two last kernels due to their 128-bit stream size. As a result,


FIGURE 5.12: System architecture that handles the multi-leveldataflow challenge of KinectFusion’s first five kernels (see Figure 5.1)completely within the Vivado block design, leaving the HLS IP coresunchanged. AXI-Lite control signals are omitted for clarity, and thebottleneck-inducing streams are marked with a red data width label.

communication and computation bounds are now matched for all three levels.

Figure 5.13 depicts how which components should be started at which level for

every frame, and gives an indication of how long they take. The switch blocks

should of course be controlled accordingly. Every kernel only exists once in the

hardware so that no overlapped execution can occur, causing the iteration latency to

be equal to the initiation interval. The measured values are II = IL = 2.10 ms, which is

very close to the theoretical value of (320× 240+ 160× 120+ 80× 60)× 2× 10−5 =

2.02 ms. Despite the total throughput of data that is written to DRAM being quite

high, there is only one input stream to the whole FPGA. In contrast to the coexisting


FIGURE 5.13: Schedule to process incoming sensor frames using theimproved accelerators. Due to the application of task-level pipelin-ing, all subcomponents now adapt to the slowest link in the chain,

which is formed by bandwith limitations.

configuration where the input of every block had to be read from DRAM, the max-

imum throughput from PS to PL is 800 MB/s here because all intermediate results

are passed directly to subsequent components. We suspect that this lower value

removes the DDR bottleneck that was present previously. The average resource uti-

lization of this configuration is 45 %, or 7 % less than the previous architecture with

coexisting kernels since fewer communication infrastructure is present.

HLS-level dataflow

Both intermediate output aggregation and multi-modal kernels can already be intro-

duced in the HLS design as well, conceptually illustrated in Figures 5.9b and 5.10b.

Merging these techniques for KinectFusion yields Listing 5.1, where many details

including fixed-float conversion, variable loop bounds and TLAST signal handling

are omitted for clarity. Opportunities for resource sharing can now be taken ad-

vantage of more thoroughly, e.g. by fusing the bilateral_filter and half_sample kernels

together as much as possible. The window sizes of these Stencil kernels are 3x3 and

2x2 respectively, the biggest of which is determined by the bilateral filter. By using

one shared window for both routines, the effect of multi-modality is consequently

postponed to the actual Stencil computation given such a window filled with data.

The overarching HLS IP core can be implemented as one big accelerator on the

FPGA, shown in Figure 5.3. The resulting functionality is very similar to that of the

block-level dataflow design, except that the complex dataflow challenges are now

taken care of at a different level. To process multiple frames, the schedule depicted

in Figure 5.13 also applies to this case. However, measurements indicate that the ini-

tiation interval and iteration latency have increased to 4.13 ms. This unfortunate fact

is explained by the fact that only one AXI DMA is used to retrieve all output data.

The width of one element is 256 bits as it must contain at least two depth values of 32

bits each and two 3D points of 96 bits each. The 64-bit HP port at the PS-PL interface


typedef struct float mm_out; // 4 bytes; empty for level >= 1float bf_out; // 4 bytes; hs_out for level >= 1point_t dv_out; // 12 bytespoint_t vn_out; // 12 bytes

agg_t;int mm_through_vn(hls::stream<agg_t>& stream_out,

hls::stream<int>& stream_in, int level) #pragma HLS DATAFLOW

hls::stream<agg_t> tmp1, tmp2, tmp3;// stream_in -> tmp1for (...)

if (level == 0) // mm2m_sample (Map) ...

else // pass through (Map) ...

// tmp1 -> tmp2; both kernels use a shared memory windowfor (...)

if (level == 0) // bilateral_filter (Stencil) ...

else // half_sample (Stencil) ...

// tmp2 -> tmp3for (...)

// depth2vertex (Map) ...// tmp3 -> stream_outfor (...)

// vertex2normal (Stencil) ...

LISTING 5.1: Code snippet summarizing how the multi-leveldataflow problem is to be solved within Vivado HLS.


is thus forced to chop up the accumulated data into four smaller packets, taking four

clock cycles per element to send them to DRAM. Another drawback of the HLS-level

dataflow solution is that all data is now returned in interleaved format and must be

deinterleaved by the PS in order to obtain the same separated array structures as

in the block-level dataflow solution. The FPGA cannot perform deinterleaving into

one output stream though, as this would again bring us back to unrealistically large

memory requirements. We suspect that deinterleaving is however not strictly re-

quired for all use cases.

Lastly, we give a remark on the average resource usage of 35 %, which is 10 %

lower than for the block-level counterpart. This fact can be attributed to the follow-

ing factors, albeit by an unknown weight for each item:

• The reduction in communication infrastructure. Only one AXI DMA is present

in hardware instead of four or five.

• The intrinsically higher degree of resource sharing by making blocks multi-

modal at an earlier stage in the design process. This is Section 5.4.2’s hypothe-

sis that we wanted to test.

• The improved handling of data type conversions.

All streams exposed to the PS are represented in floating point format by princi-

ple so that the CPU can understand their content. The block-level variant contains

the same IP cores as the very first configuration of coexisting kernels. This means

however that all intermediate inputs and outputs consist of floating point numbers,

even those in-between multiple blocks causing some redundant data type conver-

sions to occur. On the other hand, the HLS-level dataflow architecture has to per-

form fewer conversions in total thanks to the following design choice: inter-kernel

streams are left unconverted so that subsequent kernels do not have to unnecessarily

re-convert data from floating point to fixed-point representation. The hypothesis of a

fundamental improvement of the HLS-level configuration over its block-level coun-

terpart in terms of resource sharing therefore remains plausible but unconfirmed,

and should be tested more strictly by comparing architectures where such interfer-

ence due to nuisance variables is not present. However, in this particular case it

still holds that the resulting design is more hardware-efficient by 10 % on average

thanks to the engagement of the discussed extra opportunity for avoiding unneces-

sary computations.

5.5. Discussion 101

5.5 Discussion

This chapter explored several architectures of an embedded system that deal with

off-loading five distinct KinectFusion kernels at once. The complex datapath re-

quired us to solve two related problems along the way. The first one concerns the

retrieval of data streams from in-between functional blocks, and the second one in-

volved the efficient exploitation of the algorithm’s multi-level nature. Most impor-

tantly, three implementations were developed. A summary of how each configura-

tion solves the two aforementioned problems follows:

• Independent coexistence of kernels. In this architecture, all kernels are imple-

mented as separate accelerators. As such, there is little reason to worry about

dataflow because every stream immediately gets written back to DRAM. No

task-level pipelining is employed so that data dependencies are the only rea-

son kernels have to wait for each other. As soon as an output becomes avail-

able, from that point onwards it is stored permanently in DRAM: an unlimited

amount of other kernels who might need the data can therefore read the data

without issues. The blocks themselves are multi-modal by design. Frame-level

pipelining can be achieved by scheduling the execution of different compo-

nents and modes efficiently over time.

• Block-level or HLS-level dataflow. Next, by introducing task-level pipelin-

ing we proposed a family of two configurations that inserted small buffers in-

between kernels rather than imposing on them the burden of passing data via

DDR everytime. This concept can be applied at two levels: either in the block

design or in the HLS top function with a dataflow pragma. Calculated re-

sults that are needed in later stages of the algorithm yet would be lost without

corrective measures are either redirected via a separate AXI DMA core (block-

level aggregation), or accumulated throughout the chain of components until

the final output stream is reached (HLS-level aggregation). To switch across

multiple resolution levels, stream switching IP cores are inserted at the block-

level and hybrid kernels were implemented virtually by if-else case switch-

ing at the HLS-level. No frame-level pipelining is possible in this architecture

because the accelerator is built to be executed in an indivisible manner by de-

sign.

Our findings in this chapter related to multi-level executions closely match those

by Boikos et al. [3], who present an implementation of semi-dense SLAM on FPGA

where the units also support multiple data rates in addition to being multi-modal.

The authors confirm our conclusion that following the single-producer, single-consu-

mer principle and reusing hardware units while incorporating adjustable processing

paths leads to efficient designs. They did however not disclose precisely at which


Initiation Iteration Frame BRAM DSP FF LUTConfiguration interval latency rate [%] [%] [%] [%]Coexistence 2.53 ms 6.02 ms 395 FPS 24 46 48 88Block-level dataflow 2.10 ms 2.10 ms 476 FPS 25 46 40 69HLS-level dataflow 4.13 ms* 4.13 ms* 242 FPS* 16 51 25 48

TABLE 5.4: Comparison of timing and resource profiles after imple-menting mm2m_sample through vertex2normal as separate accelerators

versus applying both discussed multi-level dataflow techniques.

level this multiplexing between different pipelined operation paths occurs, the dis-

cernment and evaluation of which we believe is an important contribution of our

research.

Lastly, power usage was not taken into account in this chapter. While Vivado

does offer detailed information about static and dynamic power consumption of

post-implementation FPGA designs, it is much more cumbersome to get a holistic

view of the full heterogeneous system’s energy consumption (which is what actually

matters). Methods in literature to measure the power usage of embedded processing

systems are often ad hoc or even not mentioned at all, leading us to ignore this

aspect.

5.5.1 Comparison of timing and resource profiles

Our best achieved performance metrics are listed in Table 5.4. The block-level data-

flow configuration clearly dominates the coexistence configuration both in terms of

speed and resource usage, which is a positive result. This can be explained by the

fact that only four AXI DMAs are present on the FPGA in configuration 2, three of

which are one-way. In contrast, the first configuration has five two-way DMAs. The

third configuration, HLS-level dataflow, has a flaw in the form of a low bandwidth

bound. Another drawback is the output stream interleaving all accumulated data

(which is the reason for marking the timings with an asterisk). Note that configura-

tions 2 and 3 can be seen as two ends of an extremum, since only one DMA is used

in the latter variant. If instead two DMAs were employed to write the fat output

stream back to DRAM, then a two-fold increase in throughput might occur, bringing

the performance of both configurations on par. We therefore propose investigating

hybrid block- and HLS-level dataflow architectures as a possible direction for future

research.

103

Chapter 6

Conclusions and future work

The goal of this master’s dissertation was to implement KinectFusion on the Zynq-

7020 SoC on one hand, and to devise a set of workable guidelines by which to imple-

ment similar algorithms and kernels on the other hand. Throughout our research, it

quickly became apparent that the scope of full FPGA acceleration would have to be

limited to just a subset of all kernels.

First, a methodology was constructed in Chapter 3 that enables the designer to

correctly handle a range of parallel patterns often found in 2D image processing ap-

plications. Techniques that were elaborated and exemplified include pipelining, I/O

streaming, line buffering, array partitioning and scratchpad memories. The HLS re-

port summaries, performance and resource views were pinpointed as indispensable

tools when applying these procedures. Detailed investigations were also made with

respect to the impact of different data types on hardware utilization, and how Vi-

vado HLS sometimes creates extra overhead in the design by for example rounding

up memory sizes to the next power of two.

Next, Chapter 4 describes how eight KinectFusion kernels were sped up signifi-

cantly using the discussed concepts. It is also illustrated that these methods should

not be applied inconsiderately, as pitfalls might otherwise occur leading to subop-

timal designs. In addition, further opportunities for optimization were discovered

that required a deeper knowledge of the application itself. Both of these reasons led

us to conclude that increasing the degree of automation beyond HLS compilation

might adversely affect the quality of the resulting design.

Afterwards, a step back was taken in Chapter 5 to gain a system-level overview

of the whole application. The first half of KinectFusion was fully off-loaded to

the FPGA, and comparisons were made among different methods to reconcile its

complex dataflow with task-level pipelining and the accumulation of intermediate

streams. The challenge consisted of staying within bounds of the FPGA’s capabili-

ties while achieving the desired functionality without sacrifices. The best configura-

tion with respect to performance was found to arise from composition at the Vivado

104 Chapter 6. Conclusions and future work

ARM HLS FPGA FPGAKernel CPU estimate single-kernel multi-kernelmm2m_sample 2.70 ms 0.38 ms 0.78 ms

2.10 msbilateral_filter 426 ms 0.77 ms 0.78 mshalf_sample 1.82 ms 0.24 ms 0.50 msdepth2vertex 7.90 ms 1.01 ms 2.05 msvertex2normal 27.4 ms 1.01 ms 2.06 mstrack 125.7 ms 23.0 msreduce 27.7 ms 3.31 msintegrate 1236 ms 16.2 msraycast 1294 ms

TABLE 6.1: Time spent in each kernel when KinectFusion is executedon either the ARM Cortex-A9 CPU or Xilinx Zynq-7020 FPGA of the

embedded SoC.

block design level, while the least hardware resources seem to be used if all func-

tionality is combined within one large HLS IP core instead.

For convenience, the timings obtained throughout this dissertation are summa-

rized in Table 6.1. The HLS column is embodied by Chapter 4, while the FPGA

columns originate from Chapter 5. HLS reports indicated a median speed-up of

×8.10 of the first eight kernels, and FPGA executions revealed an actual speed-up

of ×222 of the first five kernels combined. Summing all eight HLS timings and di-

viding them by the sum of the first eight CPU timings yields a ratio of ×40.4. This

value can be treated as an estimated holistic (i.e. weighted average) speed-up factor,

comparing regular execution on the CPU with HLS estimates after optimizing all

eight kernels.

Finally, a remark on programmability is made. Throughout this thesis, we have

experienced how FPGAs remain notably difficult to program. Nonetheless, a pos-

itive evolution is undisputably present: thanks to the advances in HLS, familiarity

with low-level hardware details such as propagation delays and the architecture of

basic logic elements is not required anymore, which stands in stark contrast to a

decade ago [73]. However, in addition to the unique duality and high-dimensional

constraints of designing heterogeneous CPU-FPGA systems as described in the in-

troduction, the toolchains in existence were not found to be bug-free. We believe that

these facts pose challenges to the accessibility and popularity of designing FPGA

hardware, and efforts towards improving this workflow can only be encouraged.

6.1 Future work

From a practical standpoint, the most drastic speed-up was achieved for bilateral_filter,

although integrate and raycast are equally (if not more) important candidates. A clear

6.1. Future work 105

direction for future research is hence presented. More specifically, the HLS design

of raycast ought to be investigated in detail. Chapter 4 already elucidated several

reasons why its acceleration will be guaranteed to be highly sophisticated, but no

hard statements can be made unless it is tried at some point.1

Second, Table 4.2 made it clear that not all kernels will be able to fit together on a

low-end Zynq-7020 FPGA even by the sheer amounts of resource usages alone. Op-

portunities for higher-end FPGAs should therefore be researched, and/or systems

with multiple FPGAs placed in cascade. Perhaps an embedded GPU could be used

for raycast, should this last step be deemed unfit for FPGAs anyway. One advantage

of using a more expensive FPGA is that clock periods can be reduced further be-

low 10 ns, which would lead to even better performance results in addition to being

able to accommodate a larger number of algorithmic components at once. Another

advantage is their larger internal memory; this extra space makes the local caching

strategies for random data access discussed in Section 3.2.4 more appealing.

Third, the block-level and HLS-level architectures for multi-kernel acceleration

explored in Chapter 5 do not represent the full design space. Once treated as two

ends of a spectrum, a mixture of both concepts could be devised as to hopefully com-

bine the best of both worlds in terms of timing and resources. In addition, the effect

on hardware utilization of moving to one extremum or the other should be studied

more closely, as some uncontrolled variables were present in our experiments mak-

ing the results slightly less reliable. Lastly, the dataflow techniques should also be

evaluated more extensively by applying them to the remaining two relevant kernels,

track and reduce.

1Several paths can be undertaken here, including a fundamental transformation of the algorithmin order to use less image data overall. However, this would turn the fully dense SLAM applicationinto a semi-dense variant, which might not be desireable with respect to preserving the quality of thereconstruction and localization. Furthermore, recent solutions for semi-dense SLAM already exist [3],[42], [43], albeit are unrelated to KinectFusion.

107

Bibliography

[1] Xilinx, Vivado Design Suite User Guide: High-Level Synthesis v2018.2, 2018. [On-

line]. Available: https://www.xilinx.com/support/documentation/sw_

manuals/xilinx2017_4/ug902-vivado-high-level-synthesis.pdf.

[2] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid,

and J. J. Leonard, “Past, present, and future of simultaneous localization and

mapping: Toward the robust-perception age”, IEEE Transactions on Robotics,

vol. 32, no. 6, pp. 1309–1332, 2016, ISSN: 15523098. DOI: 10.1109/TRO.2016.

2624754. arXiv: arXiv:1606.05830v4.

[3] K. Boikos and C.-S. Bouganis, “A Scalable FPGA-based Architecture for Depth

Estimation in SLAM”, Applied Reconfigurable Computing, 2019. arXiv: 1902 .

04907. [Online]. Available: http://arxiv.org/abs/1902.04907.

[4] R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen, H. Hsiao, S.

Brown, F. Ferrandi, J. Anderson, and K. Bertels, “A Survey and Evaluation of

FPGA High-Level Synthesis Tools”, IEEE Transactions on Computer-Aided De-sign of Integrated Circuits and Systems, vol. 35, no. 10, pp. 1591–1604, 2016, ISSN:

02780070. DOI: 10.1109/TCAD.2015.2513673.

[5] R. A. Newcombe, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J.

Shotton, S. Hodges, and A. Fitzgibbon, “KinectFusion: Real-Time Dense Sur-

face Mapping and Tracking”, Tech. Rep., 2011.

[6] B. Bodin, H. Wagstaff, S. Saeedi, L. Nardi, E. Vespa, J. H. Mayer, A. Nisbet, M.

Luján, S. Furber, A. J. Davison, P. H. J. Kelly, and M. O’Boyle, “SLAMBench2:

Multi-Objective Head-to-Head Benchmarking for Visual SLAM”, 2018. arXiv:

1808.06820. [Online]. Available: http://arxiv.org/abs/1808.06820.

[7] M. Abouzahir, A. Elouardi, R. Latif, S. Bouaziz, and A. Tajer, “Embedding

SLAM algorithms: Has it come of age?”, Robotics and Autonomous Systems,

2018, ISSN: 09218890. DOI: 10.1016/j.robot.2017.10.019.

[8] J. Engel, J. Sturm, and D. Cremers, “Semi-dense visual odometry for a monocu-

lar camera”, Proceedings of the IEEE International Conference on Computer Vision,

pp. 1449–1456, 2013. DOI: 10.1109/ICCV.2013.183.

[9] L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. Kelly, A. J. Davi-

son, M. Luján, M. F. O’Boyle, G. Riley, N. Topham, and S. Furber, “Introduc-

ing SLAMBench, a performance and accuracy benchmarking methodology for

https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_4/ug902-vivado-high-level-synthesis.pdf

https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_4/ug902-vivado-high-level-synthesis.pdf

https://doi.org/10.1109/TRO.2016.2624754

https://doi.org/10.1109/TRO.2016.2624754

https://arxiv.org/abs/arXiv:1606.05830v4

https://arxiv.org/abs/1902.04907


http://arxiv.org/abs/1902.04907

https://doi.org/10.1109/TCAD.2015.2513673



https://doi.org/10.1016/j.robot.2017.10.019

https://doi.org/10.1109/ICCV.2013.183

108 Bibliography

SLAM”, in Proceedings - IEEE International Conference on Robotics and Automa-tion, vol. 2015-June, 2015, pp. 5783–5790, ISBN: 9781479969234. DOI: 10.1109/

ICRA.2015.7140009. eprint: 1410.2167.

[10] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark

for the evaluation of RGB-D SLAM systems”, IEEE International Conference onIntelligent Robots and Systems, pp. 573–580, 2012, ISSN: 21530858. DOI: 10.1109/

IROS.2012.6385773.

[11] R. Mur-Artal and J. D. Tardos, “ORB-SLAM2: An Open-Source SLAM System

for Monocular, Stereo, and RGB-D Cameras”, IEEE Transactions on Robotics,

vol. 33, no. 5, pp. 1255–1262, 2017, ISSN: 1552-3098. DOI: 10.1109/TRO.2017.

2705103. arXiv: 1610.06475. [Online]. Available: http://ieeexplore.ieee.

org/document/7946260/.

[12] J. Engel, T. Schöps, and D. Cremers, “LSD-SLAM: Large-Scale Direct Monocu-

lar SLAM”, European Conference on Computer Vision, 2014, ISSN: 00201693. DOI:

10.1016/S0020-1693(00)81721-1.

[13] T. Whelan, R. F. Salas-Moreno, B. Glocker, A. J. Davison, and S. Leutenegger,

“ElasticFusion: Real-time dense SLAM and light source estimation”, Interna-tional Journal of Robotics Research, vol. 35, no. 14, pp. 1697–1716, 2016, ISSN:

17413176. DOI: 10.1177/0278364916669237.

[14] V. A. Prisacariu, O. Kähler, S. Golodetz, M. Sapienza, T. Cavallari, P. H. S. Torr,

and D. W. Murray, “InfiniTAM v3: A Framework for Large-Scale 3D Recon-

struction with Loop Closure”, 2017. arXiv: 1708.00783. [Online]. Available:

http://arxiv.org/abs/1708.00783.

[15] Y. Bai, M. Alawad, R. DeMara, and M. Lin, “Optimally Fortifying Logic Relia-

bility through Criticality Ranking”, Electronics, vol. 4, no. 1, pp. 150–172, 2015.

DOI: 10.3390/electronics4010150.

[16] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs”, IEEETransactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 26,

no. 2, pp. 203–215, 2007, ISSN: 02780070. DOI: 10.1109/TCAD.2006.884574.

[17] D. Koch, F. Hannig, and D. Ziener, FPGAs for software programmers. 2016, pp. 1–

327, ISBN: 9783319264080. DOI: 10.1007/978-3-319-26408-0.

[18] S. Asano, T. Maruyama, and Y. Yamaguchi, “Performance comparison of FPGA,

GPU and CPU in image processing”, FPL 09: 19th International Conference onField Programmable Logic and Applications, pp. 126–131, 2009. DOI: 10.1109/

FPL.2009.5272532.

[19] Tedway, What are fpgas and project brainwave? - azure machine learning service,

2019. [Online]. Available: https://docs.microsoft.com/en- us/azure/

machine-learning/service/concept-accelerate-with-fpgas.

https://doi.org/10.1109/ICRA.2015.7140009


1410.2167

https://doi.org/10.1109/IROS.2012.6385773

https://doi.org/10.1109/IROS.2012.6385773

https://doi.org/10.1109/TRO.2017.2705103

https://doi.org/10.1109/TRO.2017.2705103


http://ieeexplore.ieee.org/document/7946260/

http://ieeexplore.ieee.org/document/7946260/

https://doi.org/10.1016/S0020-1693(00)81721-1

https://doi.org/10.1177/0278364916669237



https://doi.org/10.3390/electronics4010150

https://doi.org/10.1109/TCAD.2006.884574

https://doi.org/10.1007/978-3-319-26408-0

https://doi.org/10.1109/FPL.2009.5272532

https://doi.org/10.1109/FPL.2009.5272532

https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-accelerate-with-fpgas

https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-accelerate-with-fpgas

Bibliography 109

[20] J. Fowers, G. Brown, P. Cooke, and G. Stitt, “A performance and energy com-

parison of FPGAs, GPUs, and multicores for sliding-window applications”,

p. 47, 2012. DOI: 10.1145/2145694.2145704.

[21] K. Rafferty, D. Crookes, F. Siddiqui, T. Deng, R. Woods, U. Minhas, and S.

Amiri, “FPGA-Based Processor Acceleration for Image Processing Applica-

tions”, Journal of Imaging, vol. 5, no. 1, p. 16, 2019. DOI: 10.3390/jimaging5010016.

[22] Xilinx, Zynq-7000 SoC: Technical Reference Manual v1.12.2, 2018. [Online]. Avail-

able: https://www.xilinx.com/support/documentation/user_guides/

ug585-Zynq-7000-TRM.pdf.

[23] ——, Vivado Design Suite: AXI Reference Guide v4.0, 2017. [Online]. Available:

https://www.xilinx.com/support/documentation/ip_documentation/axi_

ref_guide/latest/ug1037-vivado-axi-reference-guide.pdf.

[24] B. J. Svensson, “Exploring OpenCL Memory Throughput on the Zynq”, 2016.

[25] E. H. D’Hollander, “High-Level Synthesis Optimization for Blocked Floating-

Point Matrix Multiplication”, ACM SIGARCH Computer Architecture News, vol. 44,

no. 4, pp. 74–79, 2017, ISSN: 01635964. DOI: 10.1145/3039902.3039916.

[26] Avnet, Zynq Evaluation and Development Board: Hardware User’s Guide v2.2, 2014.

[Online]. Available: http://zedboard.org/sites/default/files/documentations/

ZedBoard_HW_UG_v2_2.pdf.

[27] Xilinx, Embedded vision solutions powered by xilinx, 2019. [Online]. Available:

https://www.xilinx.com/applications/megatrends/video-vision.html.

[28] E. Billauer, High resolution images of the zedboard, 2012. [Online]. Available: http:

//billauer.co.il/blog/2012/09/zedboard-zynq-images/.

[29] Xilinx, “Vivado Design Suite User Guide: Design Flows Overview v2018.2”,

in, 2018, ch. 1. [Online]. Available: https : / / www . xilinx . com / support /

documentation/sw_manuals/xilinx2018_2/ug892-vivado-design-flows-

overview.pdf.

[30] ——, “Vivado Design Suite User Guide: Embedded Processor Hardware De-

sign v2018.2”, in, 2018, ch. 3. [Online]. Available: https : / / www . xilinx .

com/support/documentation/sw_manuals/xilinx2018_2/ug898-vivado-

embedded-design.pdf.

[31] ——, “Vivado Design Suite User Guide: Programming and Debugging v2017.4”,

in, 2018, ch. 9-12. [Online]. Available: https://www.xilinx.com/support/

documentation/sw_manuals/xilinx2017_4/ug908-vivado-programming-

debugging.pdf.

[32] ——, Using xilinx sdk, 2017. [Online]. Available: https://www.xilinx.com/

html_docs/xilinx2017_4/SDK_Doc/index.html.

https://doi.org/10.1145/2145694.2145704

https://doi.org/10.3390/jimaging5010016

https://www.xilinx.com/support/documentation/user_guides/ug585-Zynq-7000-TRM.pdf

https://www.xilinx.com/support/documentation/user_guides/ug585-Zynq-7000-TRM.pdf

https://www.xilinx.com/support/documentation/ip_documentation/axi_ref_guide/latest/ug1037-vivado-axi-reference-guide.pdf

https://www.xilinx.com/support/documentation/ip_documentation/axi_ref_guide/latest/ug1037-vivado-axi-reference-guide.pdf

https://doi.org/10.1145/3039902.3039916

http://zedboard.org/sites/default/files/documentations/ZedBoard_HW_UG_v2_2.pdf

http://zedboard.org/sites/default/files/documentations/ZedBoard_HW_UG_v2_2.pdf

https://www.xilinx.com/applications/megatrends/video-vision.html

http://billauer.co.il/blog/2012/09/zedboard-zynq-images/

http://billauer.co.il/blog/2012/09/zedboard-zynq-images/

https://www.xilinx.com/support/documentation/sw_manuals/xilinx2018_2/ug892-vivado-design-flows-overview.pdf



https://www.xilinx.com/support/documentation/sw_manuals/xilinx2018_2/ug898-vivado-embedded-design.pdf



https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_4/ug908-vivado-programming-debugging.pdf



https://www.xilinx.com/html_docs/xilinx2017_4/SDK_Doc/index.html

https://www.xilinx.com/html_docs/xilinx2017_4/SDK_Doc/index.html

110 Bibliography

[33] ——, “Zynq-7000 All Programmable SoC: Embedded Design Tutorial: A Hands-

On Guide to Effective Embedded System Design v2017.4”, in, 2017. [Online].

Available: https://www.xilinx.com/support/documentation/sw_manuals/

xilinx2017_4/ug1165-zynq-embedded-design-tutorial.pdf.

[34] R. N. Appel and H. H. Folmer, “Analysis, optimization, and design of a SLAM

solution for an implementation on reconfigurable hardware (FPGA) using C?aSH”,

PhD thesis, 2016. [Online]. Available: http://essay.utwente.nl/71550/.

[35] V. Bonato, E. Marques, and G. A. Constantinides, “A floating-point extended

kalman filter implementation for autonomous mobile robots”, 2007 Interna-tional Conference on Field Programmable Logic and Applications, 2007. DOI: 10.

1109/fpl.2007.4380720.

[36] W. Fang, Y. Zhang, B. Yu, and S. Liu, “FPGA-based ORB feature extraction for

real-time visual SLAM”, in 2017 International Conference on Field-ProgrammableTechnology, ICFPT 2017, vol. 2018-Janua, 2018, pp. 275–278, ISBN: 9781538626559.

DOI: 10.1109/FPT.2017.8280159. arXiv: 1710.07312.

[37] Q. Gautier, A. Shearer, J. Matai, D. Richmond, P. Meng, and R. Kastner, “Real-

time 3D reconstruction for FPGAs: A case study for evaluating the perfor-

mance, area, and programmability trade-offs of the Altera OpenCL SDK”, in

Proceedings of the 2014 International Conference on Field-Programmable Technology,FPT 2014, 2015, pp. 326–329, ISBN: 9781479962457. DOI: 10.1109/FPT.2014.

7082810.

[38] M. Gu, K. Guo, W. Wang, Y. Wang, and H. Yang, “An FPGA-based Real-time

Simultaneous Localization and Mapping System”, no. 61373026, pp. 0–3, 2015.

[39] D. Törtei Tertei, J. Piat, and M. Devy, “FPGA design of EKF block accelerator

for 3D visual SLAM”, Computers and Electrical Engineering, vol. 55, pp. 1339–

1351, 2016, ISSN: 00457906. DOI: 10.1016/j.compeleceng.2016.05.003.

[40] B. W. Williams, J. Zambreno, and P. Jones, “Evaluation of a SoC for Real-Time

3D SLAM”, PhD thesis, 2017. [Online]. Available: https://lib.dr.iastate.

edu/etd.

[41] M. Abouzahir, A. Elouardi, S. Bouaziz, O. Hammami, and I. Ali, “High-level

synthesis for FPGA design based-SLAM application”, in Proceedings of IEEE/ACSInternational Conference on Computer Systems and Applications, AICCSA, 2017,

ISBN: 9781509043200. DOI: 10.1109/AICCSA.2016.7945638.

[42] K. Boikos and C. S. Bouganis, “Semi-dense SLAM on an FPGA SoC”, in FPL2016 - 26th International Conference on Field-Programmable Logic and Applications,

2016, ISBN: 9782839918442. DOI: 10.1109/FPL.2016.7577365.

https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_4/ug1165-zynq-embedded-design-tutorial.pdf

https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_4/ug1165-zynq-embedded-design-tutorial.pdf

http://essay.utwente.nl/71550/

https://doi.org/10.1109/fpl.2007.4380720

https://doi.org/10.1109/fpl.2007.4380720

https://doi.org/10.1109/FPT.2017.8280159


https://doi.org/10.1109/FPT.2014.7082810

https://doi.org/10.1109/FPT.2014.7082810

https://doi.org/10.1016/j.compeleceng.2016.05.003

https://lib.dr.iastate.edu/etd

https://lib.dr.iastate.edu/etd

https://doi.org/10.1109/AICCSA.2016.7945638

https://doi.org/10.1109/FPL.2016.7577365

Bibliography 111

[43] ——, “A high-performance system-on-chip architecture for direct tracking for

SLAM”, in 2017 27th International Conference on Field Programmable Logic andApplications, FPL 2017, 2017, ISBN: 9789090304281. DOI: 10.23919/FPL.2017.

8056831.

[44] O. Wasenmüller and D. Stricker, “Comparison of kinect v1 and v2 depth im-

ages in terms of accuracy and precision”, Lecture Notes in Computer Science (in-cluding subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioin-formatics), vol. 10117 LNCS, pp. 34–45, 2017, ISSN: 16113349. DOI: 10.1007/

978-3-319-54427-4_3.

[45] A. Handa, T. Whelan, J. Mcdonald, and A. J. Davison, “A Benchmark for RGB-

D Visual Odometry , 3D Reconstruction and SLAM”, IEEE International Con-ference on Robotics and Automation (ICRA), 2014. DOI: 10.1109/ICRA.2014.

6907054.

[46] F. Durand and J. Dorsey, “Fast Bilateral Filtering for the Display of High-

Dynamic-Range Images”, ACM Trans. Graph. (Proc. SIGGRAPH), pp. 257–266,

2002.

[47] E. Eade, “Lie Groups for Computer Vision”, Website, pp. 1–15, 2014. [Online].

Available: http://ethaneade.com/lie\_groups.pdf.

[48] L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. Kelly, A. J. Davi-

son, M. Luján, M. F. O’Boyle, G. Riley, N. Topham, and S. Furber, Pamela-project/slambench, 2017. [Online]. Available: https://github.com/pamela-

project/slambench.

[49] B. Bodin, H. Wagstaff, S. Saeedi, L. Nardi, E. Vespa, J. H. Mayer, A. Nisbet,

M. Luján, S. Furber, A. J. Davison, P. H. J. Kelly, and M. O’Boyle, Pamela-project/slambench2, 2019. [Online]. Available: https://github.com/pamela-

project/slambench2.

[50] G. Reitmayr, Gerhardr/kfusion, 2013. [Online]. Available: https://github.com/

GerhardR/kfusion.

[51] E. Rosten, Toon: Tom’s object-oriented numerics library, 2018. [Online]. Available:

https://www.edwardrosten.com/cvd/toon.html.

[52] S. Zennaro, M. Munaro, S. Milani, P. Zanuttigh, A. Bernardi, S. Ghidoni, and E.

Menegatti, “Performance evaluation of the 1st and 2nd generation Kinect for

multimedia applications”, Proceedings - IEEE International Conference on Multi-media and Expo, vol. 2015-Augus, pp. 1–6, 2015, ISSN: 1945788X. DOI: 10.1109/

ICME.2015.7177380.

[53] M. D. McCool, “Structured Parallel Programming with Deterministic Patterns”,

Dr. Dobb Journal, no. June 2010, pp. 7–12, 2010, ISSN: 0960-1317. DOI: 10.1088/

0960-1317/5/3/002.

https://doi.org/10.23919/FPL.2017.8056831

https://doi.org/10.23919/FPL.2017.8056831

https://doi.org/10.1007/978-3-319-54427-4_3

https://doi.org/10.1007/978-3-319-54427-4_3



http://ethaneade.com/lie\_groups.pdf

https://github.com/pamela-project/slambench

https://github.com/pamela-project/slambench

https://github.com/pamela-project/slambench2

https://github.com/pamela-project/slambench2

https://github.com/GerhardR/kfusion

https://github.com/GerhardR/kfusion

https://www.edwardrosten.com/cvd/toon.html

https://doi.org/10.1109/ICME.2015.7177380

https://doi.org/10.1109/ICME.2015.7177380

https://doi.org/10.1088/0960-1317/5/3/002

https://doi.org/10.1088/0960-1317/5/3/002

112 Bibliography

[54] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Ap-proach. 2007, p. 423, ISBN: 9780123704900.

[55] M. Schmid, N. Apelt, F. Hannig, and J. Teich, “An image processing library for

C-based high-level synthesis”, Conference Digest - 24th International Conferenceon Field Programmable Logic and Applications, FPL 2014, 2014. DOI: 10.1109/

FPL.2014.6927424.

[56] J. Hegarty, J. Brunhaver, Z. DeVito, J. Ragan-Kelley, N. Cohen, S. Bell, A. Vasi-

lyev, M. Horowitz, and P. Hanrahan, “Darkroom: Compiling High-Level Im-

age Processing Code into Hardware Pipelines”, ACM Transactions on Graphics,

vol. 33, no. 4, pp. 1–11, 2014, ISSN: 07300301. DOI: 10.1145/2601097.2601174.

[Online]. Available: http://dl.acm.org/citation.cfm?doid=2601097.

2601174.

[57] “Rigel: flexible multi-rate image processing hardware”, ACM Trans. Graph.,vol. 35, no. 4, 85:1–85:11, 2016.

[58] O. Reiche, M. A. Ozkan, R. Membarth, J. Teich, and F. Hannig, “Generat-

ing FPGA-based image processing accelerators with Hipacc: (Invited paper)”,

IEEE/ACM International Conference on Computer-Aided Design, Digest of Techni-cal Papers, ICCAD, vol. 2017-Novem, pp. 1026–1033, 2017, ISSN: 10923152. DOI:

10.1109/ICCAD.2017.8203894.

[59] Xilinx, Application Note: Zynq-7000 AP SoC: Demystifying the Lucas-Kanade Op-tical Flow Algorithm with Vivado HLS v1.0, 2017. [Online]. Available: https:

//www.xilinx.com/support/documentation/application_notes/xapp1300-

lucas-kanade-optical-flow.pdf.

[60] ——, Application Note: Vivado HLS: Implementing Memory Structures for VideoProcessing in the Vivado HLS Tool v1.0, 2012. [Online]. Available: https://www.

xilinx.com/support/documentation/application_notes/xapp793-memory-

structures-video-vivado-hls.pdf.

[61] ——, 7 Series FPGAs Memory Resources: User Guide v1.13, 2019. [Online]. Avail-

able: https://www.xilinx.com/support/documentation/user_guides/

ug473_7Series_Memory_Resources.pdf.

[62] B. Da Silva, A. Braeken, E. H. D’Hollander, and A. Touhafi, “Performance and

resource modeling for FPGAs using high-level synthesis tools”, Advances inParallel Computing, vol. 25, pp. 523–531, 2014, ISSN: 09275452. DOI: 10.3233/

978-1-61499-381-0-523.

[63] Xilinx, CORDIC v6.0: LogiCORE IP Product Guide, 2017. [Online]. Available:

https://www.xilinx.com/support/documentation/ip_documentation/

cordic/v6_0/pg105-cordic.pdf.

https://doi.org/10.1109/FPL.2014.6927424

https://doi.org/10.1109/FPL.2014.6927424

https://doi.org/10.1145/2601097.2601174

http://dl.acm.org/citation.cfm?doid=2601097.2601174

http://dl.acm.org/citation.cfm?doid=2601097.2601174

https://doi.org/10.1109/ICCAD.2017.8203894

https://www.xilinx.com/support/documentation/application_notes/xapp1300-lucas-kanade-optical-flow.pdf



https://www.xilinx.com/support/documentation/application_notes/xapp793-memory-structures-video-vivado-hls.pdf



https://www.xilinx.com/support/documentation/user_guides/ug473_7Series_Memory_Resources.pdf

https://www.xilinx.com/support/documentation/user_guides/ug473_7Series_Memory_Resources.pdf

https://doi.org/10.3233/978-1-61499-381-0-523

https://doi.org/10.3233/978-1-61499-381-0-523

https://www.xilinx.com/support/documentation/ip_documentation/cordic/v6_0/pg105-cordic.pdf

https://www.xilinx.com/support/documentation/ip_documentation/cordic/v6_0/pg105-cordic.pdf

Bibliography 113

[64] ——, Reduce Power and Cost by Converting from Floating Point to Fixed Point v1.0,

2017. [Online]. Available: http://xilinx.eetrend.com/files- eetrend-

xilinx/download/201706/11535-30442-wp491-floating-fixed-point.pdf.

[65] L. Saldanha and R. Lysecky, “Float-to-fixed and fixed-to-float hardware con-

verters for rapid hardware/software partitioning of floating point software

applications to static and dynamic fixed point coprocessors”, Design Automa-tion for Embedded Systems, vol. 13, no. 3, pp. 139–157, 2009, ISSN: 09295585. DOI:

10.1007/s10617-009-9044-4.

[66] L. Yang, L. Zhang, H. Dong, A. Alelaiwi, and A. E. Saddik, “Evaluating and

improving the depth accuracy of Kinect for Windows v2”, IEEE Sensors Journal,vol. 15, no. 8, pp. 4275–4285, 2015, ISSN: 1530437X. DOI: 10.1109/JSEN.2015.

2416651.

[67] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images”,

1998, pp. 839–846.

[68] Xilinx, Xilinx OpenCV User Guide, 2017. [Online]. Available: https://www.

xilinx.com/support/documentation/sw_manuals/xilinx2017_4/ug1233-

xilinx-opencv-user-guide.pdf.

[69] R. Smeenk, Kinect v1 and kinect v2 fields of view compared, 2014. [Online]. Avail-

able: https://smeenk.com/kinect-field-of-view-comparison/.

[70] J. Lee, T. Ueno, M. Sato, and K. Sano, “High-productivity Programming and

Optimization Framework for Stream Processing on FPGA”, HEART 2018 Pro-ceedings of the 9th International Symposium on Highly-Efficient Accelerators andReconfigurable Technologies, pp. 1–6, 2018. DOI: 10.1145/3241793.3241798.

[71] P. Zhang, M. Huang, B. Xiao, H. Huang, and J Cong, “CMOST: A system-level

FPGA compilation framework”, Design Automation Conference (DAC), 2015 52ndACM/EDAC/IEEE, 2015, ISSN: 03649059. DOI: 10.1145/2744769.2744807.

[72] Xilinx, Vivado HLS Optimization Methodology Guide v2018.1, 2018. [Online]. Avail-

able: https : / / www . xilinx . com / support / documentation / sw _ manuals /

xilinx2018_1/ug1270-vivado-hls-opt-methodology-guide.pdf.

[73] W. MacLean, “An Evaluation of the Suitability of FPGAs for Embedded Vi-

sion Systems”, in 2005 IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPR’05) - Workshops, IEEE, 2006, pp. 131–131. DOI:

10.1109/cvpr.2005.408.

http://xilinx.eetrend.com/files-eetrend-xilinx/download/201706/11535-30442-wp491-floating-fixed-point.pdf

http://xilinx.eetrend.com/files-eetrend-xilinx/download/201706/11535-30442-wp491-floating-fixed-point.pdf

https://doi.org/10.1007/s10617-009-9044-4

https://doi.org/10.1109/JSEN.2015.2416651

https://doi.org/10.1109/JSEN.2015.2416651

https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_4/ug1233-xilinx-opencv-user-guide.pdf



https://smeenk.com/kinect-field-of-view-comparison/

https://doi.org/10.1145/3241793.3241798

https://doi.org/10.1145/2744769.2744807

https://www.xilinx.com/support/documentation/sw_manuals/xilinx2018_1/ug1270-vivado-hls-opt-methodology-guide.pdf

https://www.xilinx.com/support/documentation/sw_manuals/xilinx2018_1/ug1270-vivado-hls-opt-methodology-guide.pdf

https://doi.org/10.1109/cvpr.2005.408