Click here to load reader

FPGA-Based Simultaneous Localization and Mapping (SLAM) … · FPGA -Based Simultaneous Localization a nd Mapping using High -Level Synthesis Basile Van Hoorick Supervisor s: P rof

  • View
    2

  • Download
    0

Embed Size (px)

Text of FPGA-Based Simultaneous Localization and Mapping (SLAM) … · FPGA -Based Simultaneous...

  • Mapping (SLAM) using High-Level SynthesisFPGA-Based Simultaneous Localization and

    Academic year 2018-2019

    TechnologyMaster of Science in Electrical Engineering - main subject Communication and InformationMaster's dissertation submitted in order to obtain the academic degree of

    Counsellors: Dr. ir. Jan Aelterman, Ir. Michiel VlaminckSupervisors: Prof. dr. ir. Bart Goossens, Prof. dr. ir. Erik D'Hollander

    Student number: 01404852Basile Van Hoorick

  • Admisson to Loan

    The author gives his permission to make this master’s dissertation available for con-

    sultation and to copy parts of this master’s dissertation for personal use. In all cases

    of other use, the copyright terms have to be respected, in particular with regard to

    the obligation to explicitly state the source when quoting results from this master’s

    dissertation.

    Basile Van Hoorick, May 2019

  • Acknowledgements

    First and foremost, I would like to express my sincerest gratitude towards Prof. dr.

    ir. Bart Goossens and Prof. em. dr. ir. Erik D’Hollander for giving me the opportu-

    nity to conduct this master’s dissertation at the Department of Telecommunications

    and Information Processing. I truly appreciate their vast expertise and would like

    to thank them for their guidance towards making substantiated decisions, as well as

    for their outstanding passion in their respective fields of expertise.

    In particular, Prof. em. dr. ir. Erik D’Hollander of the Department of Elec-

    tronics and Information Systems has been extremely helpful with regard to the so-

    phisticated practicalities of testing heterogeneous computer systems. I have learned

    an enormously great deal about Field-Programmable Gate Arrays over the past ten

    months, and I could not possibly have wished for a more driven and competent su-

    pervisor than him.

    I also want to thank Prof. dr. ir. Bart Goossens for offering his aid and exten-

    sive knowledge regarding Simultaneous Localization and Mapping, as well as for

    providing me with helpful suggestions and tips throughout the year. Furthermore,

    I am grateful to Prof. dr. ir. Wilfried Philips, Prof. dr. ir. Peter Veelaert and other

    researchers at the Image Processing and Interpretation group for their valuable feed-

    back and advice given during the two intermediate thesis presentations.

    Last but not least, I would like to thank my parents, family and friends for their

    indispensable support and encouragement throughout the entire period of my stud-

    ies. Distinct credit goes to Tinus Pannier, Clemens Schlegel, Jacques Van Damme

    and Viktor Verstraelen, with whom I have shared many pleasant breaks and memo-

    rable moments during this exceptionally busy year.

    Basile Van Hoorick, May 2019

  • FPGA-Based Simultaneous Localizationand Mapping using High-Level Synthesis

    by

    Basile VAN HOORICK

    Master’s dissertation submitted in order to obtain the academic degree of

    MASTER OF SCIENCE IN ELECTRICAL ENGINEERING

    Academic year 2018-2019

    Promoters: Prof. dr. ir. Bart GOOSSENS, Prof. em. dr. ir. Erik D’HOLLANDER

    Supervisors: dr. ir. Jan AELTERMAN, ir. Michiel VLAMINCK

    Faculty of Engineering and Architecture

    Ghent University

    Department of Telecommunications and Information Processing

    Chairman: Prof. dr. ir. JORIS WALRAEVENS

    Abstract

    The growing popularity of SLAM is despite the lack of an embedded, low-power

    yet real-time solution for dense 3D scene reconstruction. An attempt to fill this gap

    with the Xilinx Zynq-7020 SoC resulted in the formation and evaluation of a de-

    tailed methodology that tackles several types of typical routines in the image pro-

    cessing domain using HLS. The devised principles and guidelines are then tested

    by applying them to eight kernels of an established 3D SLAM application, reveal-

    ing powerful potential and an estimated holistic speed-up of ×40.4 over executionon the ARM Cortex-A9 CPU. Multi-modal, multi-resolution dataflow architectures

    are subsequently proposed and compared with the purpose of efficiently mapping

    algorithmic blocks and their interconnections to hardware while conforming to the

    FPGA’s limitations. A trade-off between area and throughput appears to be the de-

    ciding factor, although further research is desired towards merging the two Pareto-

    optimal identified techniques.

    Keywords

    Simultaneous Localization and Mapping, Field- Programmable Gate Array, High-

    Level Synthesis, Image Processing, System-on-Chip

  • FPGA-Based Simultaneous Localization and

    Mapping using High-Level Synthesis

    Basile Van Hoorick

    Supervisors: Prof. dr. ir. Bart Goossens, Prof. em. dr. ir. Erik D’Hollander

    Abstract — The growing popularity of SLAM is despite the

    lack of an embedded, low-power yet real-time solution for dense

    3D scene reconstruction. An attempt to fill this gap with the

    Xilinx Zynq-7020 SoC resulted in the formation and evaluation of

    a detailed methodology that tackles several types of typical

    routines in the image processing domain using HLS. The devised

    principles and guidelines are then tested by applying them to

    eight kernels of an established 3D SLAM application, revealing

    powerful potential and an estimated holistic speed-up of x40.4

    over execution on the ARM Cortex-A9 CPU. Multi-modal, multi-

    resolution dataflow architectures are subsequently proposed and

    compared with the purpose of efficiently mapping algorithmic

    blocks and their interconnections to hardware while conforming

    to the FPGA’s limitations. A trade-off between area and

    throughput seems to be the deciding factor, although further

    research is desired towards merging the two Pareto-optimal

    identified techniques.

    Keywords — Simultaneous Localization and Mapping, Field-

    Programmable Gate Array, High-Level Synthesis, Image

    Processing, System-on-Chip

    I. INTRODUCTION

    As we embark on the road towards a more autonomous

    world, countless challenges and opportunities emerge in

    various subdisciplines of computer architecture, algorithm

    design and electronics. One such challenge is Simultaneous

    Localization and Mapping (SLAM), which attempts to make a

    robot aware of its surroundings. The goal of SLAM is to track

    the position and orientation of an agent within an unknown

    environment, while simultaneously constructing a model of

    this very environment [1]. Dense SLAM variants distinguish

    themselves from their sparse counterparts by incorporating as

    much sensor data as possible into their global reconstruction.

    However, their considerable advantage in the form of

    producing a high-quality model that is reusable across

    applications comes at the cost of far greater computational

    complexity [2]. At the same time, embedded SLAM solutions

    are in high demand due to their many use cases on mobile and

    low-power devices such as autonomous vehicles [3].

    In this master’s dissertation, a framework is presented by

    which SLAM and by extension, image processing kernels in

    general, can be mapped effectively onto Field-Programmable

    Gate Arrays (FPGAs). The FPGA is a reconfigurable

    integrated circuit that can reach high performance yet low

    power consumption [4], offering a flexible platform on which

    to evaluate the hardware implementation of a dense 3D SLAM

    algorithm. High-Level Synthesis (HLS) tools are employed

    because of their apt capability to perform high-level, pragma-

    directed compilation of C-code into hardware [5]. The use

    case of choice is KinectFusion, a prominent scene

    reconstruction algorithm [6] that is representative of diverse

    paradigms in both 2D and 3D image processing. The only

    existing work in literature that accelerates parts of

    KinectFusion on an FPGA also uses a GPU [7], which is

    avoided in this thesis due to its high energy consumption. We

    also explore how multiple kernels with complex dataflow

    characteristics can be combined in hardware as to form an

    efficient, large-scale pipeline consisting of functional blocks.

    II. HLS DESIGN OF INDIVIDUAL KERNELS

    A. Methodology

    Every kernel under consideration can be categorized

    according to one or multiple parallel patterns most closely

    associated with its computational and/or data management

    structure [8]. Techniques are developed to deal with the

    following patterns in HLS:

    • Map & Reduce: The independence of every input

    (and output) pixel lends itself to the application of

    pipelining and AXI streaming interfaces, enforcing the

    single-read, single-write principle of every element in

    the array while overlapping multiple instances of

    similar calculations in time as to enable efficient use of

    DSPs and other hardware blocks.

    • Stencil: In addition to the above, line buffers and

    memory windows (see Figure 1) must be inserted in

    order to fully exploit data reuse and preserve the I/O

    streaming model [9][10]. Further speed-ups are

    obtained by partitioning both arrays in certain

    dimensions across multiple instances of local storage,

    which prevents the internal block RAM from causing a

    bottleneck due to the high amount of concurrent data

    accesses.

    • Gather: Reads from irregular positions in large arrays

    are more complicated to handle on an FPGA due to its

    limited local memory size. As continuous DDR requests

    to DRAM form significant bottlenecks in practice [11],

    the use of scratchpads is recommended to cache

    (portions of) the region of interest. Multiple re-

    executions of the subroutine might be necessary to

    adequately deal with all required data.

    Figure 1: Interaction between the line buffer and window for Stencil-

    type kernels, visualized onto the input image (left) and as how they

    are structured in memory (right).

    An initiation interval (II) of one clock cycle is the goal in

    the majority of cases, so that no further speed-up is possible

  • unless processing elements would be duplicated. The selection

    among fixed-point versus floating point data type

    representations depends on the complexity and kind of

    operations employed in each kernel, but the former usually

    results in a more hardware-efficient design, despite the

    possible overhead introduced by conversions.

    B. Implementation of KinectFusion

    Eight SLAM kernels are examined and optimized in Vivado

    HLS, leading to a median speed-up of x30.5 by purely

    applying the presented methodology over leaving the code

    unchanged. Additional transformations that require thorough

    insight into the use case as well as statistical analysis of typical

    values in various steps of the algorithm using real-world data,

    lead to an additional median speed-up of x2.45 and further

    decreases in resource utilization. According to this evaluation,

    the most significant performance gains clearly originate from

    the discussed standard approaches, although it remains

    important to incorporate application-specific knowledge as

    well to avoid superfluous hardware usage and suboptimal

    designs.

    III. COMBINED ACCELERATION OF MULTIPLE KERNELS

    A. Problem statement and initial configuration

    The complex, multi-resolution nature of tracking is reflected

    in its requirement of seven output streams from the preceding

    stages of KinectFusion, shown in Figure 2. Traditional task-

    level pipelining does not capture how stream duplication or

    multi-modal paths should be handled. The dataflow can be

    broken down into two more general challenges: one is the

    accumulation of intermediate results down a pipelined path,

    and the other concerns creating multi-modal blocks as to

    maximize resource sharing across different functional paths.

    Three distinct ways are proposed and compared in which both

    of these issues can be resolved. The first one places all

    accelerators independently on the FPGA, each with its own

    AXI DMA, and all data is passed via DRAM. The described

    difficulties are kind of avoided this way; however, it is

    expected however that better results will be achieved once

    task-level pipelining between subsequent blocks is employed.

    Figure 2: Dataflow diagram of KinectFusion's first five kernels.

    B. Block-level and HLS-level pipelined architectures

    In the Vivado block design, collecting intermediate outputs

    is done by redirecting the needed streams from in-between

    multiple components directly back to the processing system

    via an AXI DMA. Multiple modes can be activated either by

    setting control signals via the AXI-Lite protocol, or by

    inserting stream switching IP cores to enable the selection

    among different blocks altogether.

    The same principles can also be applied at the level of

    Vivado HLS, albeit after taking special measures to reconcile

    them with the HLS dataflow optimization directive. This

    includes strict adherence to the single-producer, single-

    consumer paradigm and the non-conditional execution of

    blocks. Intermediate output aggregation is achieved by

    programming virtual pass-through connections and having

    each kernel attach its own output values to the increasingly

    wide stream of interleaved data. Multi-modality of kernels is

    translated to if-else case-switching inside loop bodies.

    C. Application to KinectFusion

    In the dataflow graph, modes are defined to correspond to

    different resolution levels; this produces the fastest allocation

    of paths inside which to pipeline all components. Assuming all

    other components of the KinectFusion system (reading sensor

    frames, tracking, volumetric integration etc.) work sufficiently

    fast, the resulting measurements on an Avnet Zedboard with a

    PL clock period of 10 ns are as follows:

    Configuration Initiation

    interval

    Max. frame

    rate

    Avg. resource

    usage

    Coexistence 2.53 ms 395 FPS 52 %

    Block-level dataflow 2.10 ms 476 FPS 45 %

    HLS-level dataflow 4.13 ms 242 FPS 35 %

    The first configuration involving independent accelerators is

    Pareto-dominated by the block-level dataflow architecture. Its

    HLS-level counterpart is twice as slow however, which can be

    explained by the fact that the whole IP core uses only one AXI

    DMA to forward its 256-bit output stream to the PS. The

    Zynq-7020 High-Performance port has a maximum data width

    of 64-bit, forcing the DMA to chop up every element into

    smaller packets and thus take four clock cycles to transfer one

    aggregated data point. An advantage however is the decreased

    total hardware utilization, which is because the opportunity for

    resource sharing across multiple modes of a hybrid block can

    already be exploited earlier in the design process by the HLS

    compiler, in contrast to block-level multi-modality.

    IV. CONCLUSIONS

    High gains in performance were obtained by applying the

    devised image processing acceleration methodology, although

    careful attention in its usage is essential. Vivado HLS provides

    a balanced mix of high-level and low-level details by allowing

    fine-grained optimization of hardware computations, while

    still abstracting away most of the repetitive specifics of

    established paradigms such as pipelining and I/O interfacing.

    Designing heterogeneous FPGA systems remains intricate

    however, mainly due to the inherent duality of having to

    manage both hardware and software starting from a blank

    slate. On the other hand, increasing the degree of automation

    might adversely affect the quality of the resulting design.

    Experiments on system-level acceleration of multiple

    components bearing non-trivial dataflows reveal that there is

    no clear-cut winner between composition at the block design

    level versus virtually implementing the same concepts at an

    earlier phase in HLS. Lastly, our findings on the practice of

    multi-modal kernels closely match those by [2].

  • V. FUTURE WORK

    Not all KinectFusion kernels could be adequately tested on

    the FPGA due to scope constraints, which presents a concrete

    possible direction for future work. Second, the implementation

    on higher-end SoCs and/or a cascade of FPGAs should be

    researched as well, since the combined resource utilization

    makes fully off-loading KinectFusion onto the Zynq-7020

    FPGA impossible. Finally, the block-level and HLS-level

    dataflow variants could be treated as two ends of a spectrum;

    an untested hypothesis is that a mixture of both methods might

    lead to an optimum in terms of timing and area metrics.

    REFERENCES

    [1] C. Cadena et al., “Past, present, and future of

    simultaneous localization and mapping: Toward the

    robust-perception age,” IEEE Trans. Robot., vol. 32,

    no. 6, pp. 1309–1332, 2016.

    [2] K. Boikos and C.-S. Bouganis, “A Scalable FPGA-

    based Architecture for Depth Estimation in SLAM,”

    Appl. Reconfigurable Comput., 2019.

    [3] M. Abouzahir, A. Elouardi, R. Latif, S. Bouaziz, and

    A. Tajer, “Embedding SLAM algorithms: Has it come

    of age?,” Rob. Auton. Syst., 2018.

    [4] K. Rafferty et al., “FPGA-Based Processor

    Acceleration for Image Processing Applications,” J.

    Imaging, vol. 5, no. 1, p. 16, 2019.

    [5] R. Nane et al., “A Survey and Evaluation of FPGA

    High-Level Synthesis Tools,” IEEE Trans. Comput.

    Des. Integr. Circuits Syst., vol. 35, no. 10, pp. 1591–

    1604, 2016.

    [6] R. A. Newcombe et al., “KinectFusion: Real-Time

    Dense Surface Mapping and Tracking,” 2011.

    [7] Q. Gautier, A. Shearer, J. Matai, D. Richmond, P.

    Meng, and R. Kastner, “Real-time 3D reconstruction

    for FPGAs: A case study for evaluating the

    performance, area, and programmability trade-offs of

    the Altera OpenCL SDK,” in Proceedings of the 2014

    International Conference on Field-Programmable

    Technology, FPT 2014, 2015, pp. 326–329.

    [8] L. Nardi et al., “Introducing SLAMBench, a

    performance and accuracy benchmarking

    methodology for SLAM,” in Proceedings - IEEE

    International Conference on Robotics and

    Automation, 2015, vol. 2015-June, no. June, pp.

    5783–5790.

    [9] J. Lee, T. Ueno, M. Sato, and K. Sano, “High-

    productivity Programming and Optimization

    Framework for Stream Processing on FPGA,” Hear.

    2018 Proc. 9th Int. Symp. Highly-Efficient Accel.

    Reconfigurable Technol., pp. 1–6, 2018.

    [10] O. Reiche, M. A. Ozkan, R. Membarth, J. Teich, and

    F. Hannig, “Generating FPGA-based image

    processing accelerators with Hipacc: (Invited paper),”

    IEEE/ACM Int. Conf. Comput. Des. Dig. Tech. Pap.

    ICCAD, vol. 2017-Novem, pp. 1026–1033, 2017.

    [11] K. Boikos and C. S. Bouganis, “Semi-dense SLAM on

    an FPGA SoC,” in FPL 2016 - 26th International

    Conference on Field-Programmable Logic and

    Applications, 2016.

  • v

    Contents

    1 Introduction 1

    1.1 Goals and outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2 Background and related research 5

    2.1 Simultaneous Localization and Mapping . . . . . . . . . . . . . . . . . 5

    2.1.1 KinectFusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.1.2 Benchmarking visual SLAM . . . . . . . . . . . . . . . . . . . . . 8

    2.2 Field-Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . 11

    2.2.1 The FPGA put into context . . . . . . . . . . . . . . . . . . . . . 12

    2.2.2 System-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.2.3 High-Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 15

    2.2.4 Designer workflow . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.3 SLAM on FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    2.3.1 Dense and semi-dense SLAM . . . . . . . . . . . . . . . . . . . . 20

    3 High-level synthesis design of individual kernels 23

    3.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    3.1.1 Detailed algorithm description . . . . . . . . . . . . . . . . . . . 23

    3.1.2 Source code, dataset and parameters . . . . . . . . . . . . . . . . 27

    3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    3.2.1 Common parallel patterns and categorization . . . . . . . . . . 31

    3.2.2 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    3.2.3 Efficient line buffering . . . . . . . . . . . . . . . . . . . . . . . . 39

    3.2.4 Random memory access . . . . . . . . . . . . . . . . . . . . . . . 46

    3.2.5 Data type selection . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    4 Implementation of KinectFusion in HLS 57

    4.1 Detailed results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    4.1.1 mm2m_sample (Map) . . . . . . . . . . . . . . . . . . . . . . . . 59

    4.1.2 bilateral_filter (Stencil) . . . . . . . . . . . . . . . . . . . . . . . . 62

    4.1.3 half_sample (Stencil) . . . . . . . . . . . . . . . . . . . . . . . . . 65

    4.1.4 depth2vertex (Map) . . . . . . . . . . . . . . . . . . . . . . . . . 67

    4.1.5 vertex2normal (Stencil) . . . . . . . . . . . . . . . . . . . . . . . 67

  • vi

    4.1.6 track (Gather & Map) . . . . . . . . . . . . . . . . . . . . . . . . 69

    4.1.7 reduce (Reduce) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    4.1.8 integrate (Gather & Map) . . . . . . . . . . . . . . . . . . . . . . 72

    4.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    4.2.1 Evaluation of the methodology . . . . . . . . . . . . . . . . . . . 74

    5 System-level acceleration of multiple kernels 77

    5.1 Dataflow of KinectFusion . . . . . . . . . . . . . . . . . . . . . . . . . . 78

    5.1.1 Generalized problem statement . . . . . . . . . . . . . . . . . . . 80

    5.2 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    5.2.1 Hardware debugging . . . . . . . . . . . . . . . . . . . . . . . . 82

    5.2.2 Bandwidth limitations . . . . . . . . . . . . . . . . . . . . . . . . 82

    5.3 Independent coexistence of kernels . . . . . . . . . . . . . . . . . . . . . 83

    5.3.1 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . 86

    5.4 Task-level pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

    5.4.1 Intermediate output aggregation . . . . . . . . . . . . . . . . . . 92

    5.4.2 Multi-modal execution . . . . . . . . . . . . . . . . . . . . . . . . 94

    5.4.3 Application to KinectFusion . . . . . . . . . . . . . . . . . . . . . 94

    5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

    5.5.1 Comparison of timing and resource profiles . . . . . . . . . . . 102

    6 Conclusions and future work 103

    6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

    Bibliography 107

  • vii

    List of Figures

    2.1 Continuum of SLAM algorithms from sparse (e.g. using feature ex-

    traction) to dense (e.g. using voxelated maps) [3]. . . . . . . . . . . . . 6

    2.2 Part of KinectFusion’s map (right) and a slice through the volume

    (left) showing truncated signed distance values, each representing a

    distance F to a surface [5]. Grey voxels are those without a valid mea-surement, and are naturally found within solid objects. . . . . . . . . . 7

    2.3 System workflow of the KinectFusion method [5]. . . . . . . . . . . . . 8

    2.4 Simplified overview of KinectFusion kernels. A subscript j indicatesthe presence of several resolution levels, while i indicates the presenceof multiple iterations within a level. . . . . . . . . . . . . . . . . . . . . 9

    2.5 Violin plots comparing four SLAM algorithms on the NVIDIA Jetson

    TK1, a GPU development board [6]. Here, KF-CUDA stands for a

    CUDA-implementation of KinectFusion. . . . . . . . . . . . . . . . . . . 11

    2.6 (a) Sketch of the FPGA architecture; (b) Diagram of a simple logic

    element. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.7 Diagram comparing the FPGA to other processing platforms [19]. . . . 14

    2.8 Functional block diagram of the Zynq-7000 SoC [22]. . . . . . . . . . . 16

    2.9 Annotated photograph of the Avnet Zedboard (adapted from [28]). . . 16

    3.1 Illustration of the bilateral filter, showing its edge-preserving prop-

    erty [46]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    3.2 Overview of KinectFusion kernels. Green shaded areas include blocks

    that are executed multiple times per frame and per level; once for ev-ery iteration i. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    3.3 Screenshot of the SLAMBench2 GUI when evaluating the ’Living Room

    2’ scene. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    3.4 Mean ATE for different configurations of KinectFusion. The cubed

    numbers indicate volume resolutions, while the input FPS corresponds

    to both the tracking and integration rate. . . . . . . . . . . . . . . . . . . 30

    3.5 A) RGB video stream (unused). B) Latest depth map captured by the

    Kinect sensor. C) Reconstructed scene using KinectFusion [37]. . . . . . 30

    3.6 The Map pattern [9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    3.7 The Stencil pattern [9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    3.8 The Reduce pattern [9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

  • viii

    3.9 The Gather (or Scatter) pattern [9]. . . . . . . . . . . . . . . . . . . . . . 35

    3.10 The Search pattern [9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    3.11 Non-exhaustive code snippet representing a possible instance of the

    Search parallel pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    3.12 Concept of pipelining applied to a repeated calculation called ’op’ on

    a large array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    3.13 Effect of pipelining on the timing profile and resource utilization. . . . 40

    3.14 Analysis of a pipelined Map kernel, showing the parallelized elemen-

    tary operations constituting a matrix-vector multiplication. Note that

    the analysis view in Vivado HLS does not clearly indicate overlapped

    computation, even though it is definitely present here: a read from

    and write to the streaming interface occurs at every single clock cycle

    (or equivalently, control step). . . . . . . . . . . . . . . . . . . . . . . . . 41

    3.15 Illustration of the Stencil parallel pattern and a corresponding buffer-

    ing technique for its implementation on the FPGA. . . . . . . . . . . . . 42

    3.16 Report and analysis of a naive implementation of bilateral_filter; nei-ther line buffering nor array partitioning is applied. . . . . . . . . . . . 44

    3.17 Report and analysis of an improved implementation of bilateral_filterwhich includes line buffer and memory window functionality. . . . . . 45

    3.18 Array partitioning strategy for optimizing Stencil computations. Dif-

    ferently colored elements need to be accessed independently and in

    parallel, which is possible only by distributing them across different

    instances of internal storage components. (The memory window is

    fully partitioned in all dimensions.) . . . . . . . . . . . . . . . . . . . . . 46

    3.19 HLS report of the fully optimized bilateral_filter kernel. . . . . . . . . . 463.20 Resulting BRAM instances in the HLS report for different memory

    sizes in Listing 3.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    3.21 Kinect v2 accuracy error distribution [66]. . . . . . . . . . . . . . . . . . 54

    3.22 Kinect v1 offset and precision [44]. . . . . . . . . . . . . . . . . . . . . . 54

    4.1 Effect of every optimization on the timing, resource and accuracy pro-

    file of mm2m_sample (Map). . . . . . . . . . . . . . . . . . . . . . . . . . 594.2 I/O diagram of the mm2m_sample HLS kernel before and after du-

    plicating its processing elements 8-fold, assuming no bandwidth bot-

    tlenecks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    4.3 Effect of every optimization on the timing, resource and accuracy pro-

    files of bilateral_filter (Stencil). . . . . . . . . . . . . . . . . . . . . . . . . 624.4 Exponential function approximation for the bilateral filter, with the

    actual frequency (popularity) of all arguments translated to the thick-ness of the green layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

  • ix

    4.5 Pareto diagram of the bilateral filter’s HLS average resource usage

    (not including BRAM) and measured accuracy when all eight possible

    configurations of three separate optimizations are tested. One outlier

    with a large error is not shown. . . . . . . . . . . . . . . . . . . . . . . . 65

    4.6 Effect of every optimization on the timing, resource and accuracy pro-

    files of half_sample (Stencil). . . . . . . . . . . . . . . . . . . . . . . . . . 664.7 HLS performance analysis view of an unnecessarily complex division

    that went unnoticed by the HLS compiler. . . . . . . . . . . . . . . . . . 67

    4.8 I/O diagram of the half_sample HLS kernel before and after duplicat-

    ing its processing elements 4-fold, assuming no bandwidth bottlenecks. 68

    4.9 Effect of every optimization on the timing, resource and accuracy pro-

    file of depth2vertex (Map). . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.10 Effect of every optimization on the timing, resource and accuracy pro-

    file of vertex2normal (Stencil). Contrary to most other cases, the con-version from floating point to fixed-point has a negative effect here. . . 69

    4.11 Effect of every optimization on the timing, resource and accuracy pro-

    file of track (Gather & Map). . . . . . . . . . . . . . . . . . . . . . . . . . 704.12 Heatmap of the accessed pixel positions within the reference maps

    relative to the corresponding regular loop over the input maps for

    the first level of track. Yellow means high frequency, purple meansthe opposite. The underlying data was extracted from five frames

    selected over a video fragment captured at 30 FPS, and shows that

    horizontal movement of up to 750 pixels per second occurred at some

    point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    4.13 Effect of every optimization on the timing, resource and accuracy pro-

    file of reduce (Reduce). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.14 Effect of every optimization on the timing, resource and accuracy pro-

    file of integrate (Gather & Map). . . . . . . . . . . . . . . . . . . . . . . . 734.15 Two-dimensional illustration of a frustum-encompassing block, to which

    loop boundaries can safely be restricted. The green coloured blocks

    represent volumetric elements that are visible from the sensor’s cur-

    rent position, meaning that all yellow elements remain unchanged

    during integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    5.1 Dataflow diagram of the first five kernels of KinectFusion. . . . . . . . 79

    5.2 Illustration of two generalized dataflow challenges. . . . . . . . . . . . 81

    5.3 Overview of the System-on-Chip architecture for the execution of a

    custom IP core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

    5.4 System architecture when five coexisting kernels are implemented to-

    gether on the FPGA. By allocating one port for every accelerator, hard

    constraints on concurrent executions are avoided. . . . . . . . . . . . . 84

  • x

    5.5 Waveforms produced by the System ILA for the vertex2normal kernel. . 875.6 Diagrams depicting how the five kernels should be executed in time

    if the DDR access speed were unlimited. The rows correspond to ac-

    celerators each managing their own DMA and PS-PL port, while the

    distinct tasks are labelled with resolution levels (0 stands for 320x240,

    1 for 160x120 and 2 for 80x60). . . . . . . . . . . . . . . . . . . . . . . . . 89

    5.7 System ILA waveforms for bilateral_filter when it is executed alone,releaving a strange hiccup. The vertical lines are spaced 200 ns. . . . . 90

    5.8 System ILA waveforms for half_sample in the multi-frame execution.Large-scale pauses and restarts are clearly visible, and occur presum-

    ably due to the DDR controller having to operate at full capacity. The

    vertical lines are spaced 1 µs. . . . . . . . . . . . . . . . . . . . . . . . . 91

    5.9 Two possible solutions for intermediate output aggregation (Figure

    5.2b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

    5.10 Two possible solutions for multi-modal execution (Figure 5.2c). . . . . 95

    5.11 Three different sets of paths (depicted as large arrows) that connect

    components to combine using task-level pipelining. The time for one

    path is estimated from the slowest block inside that path, and the

    paths should be executed separately in time to enable resource shar-

    ing across different modes. . . . . . . . . . . . . . . . . . . . . . . . . . 96

    5.12 System architecture that handles the multi-level dataflow challenge of

    KinectFusion’s first five kernels (see Figure 5.1) completely within the

    Vivado block design, leaving the HLS IP cores unchanged. AXI-Lite

    control signals are omitted for clarity, and the bottleneck-inducing

    streams are marked with a red data width label. . . . . . . . . . . . . . 97

    5.13 Schedule to process incoming sensor frames using the improved ac-

    celerators. Due to the application of task-level pipelining, all subcom-

    ponents now adapt to the slowest link in the chain, which is formed

    by bandwith limitations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

  • xi

    List of Tables

    2.1 Summary of 3D SLAM algorithms adapted and compared by [6]. . . . 10

    2.2 A compilation of recent 3D SLAM applications involving the FPGA

    taking up roles of varying importance, showing a trend of decreasing

    frame rate with increasing "density". SoC (System-on-Chip) boards

    always contain both an embedded CPU + FPGA. . . . . . . . . . . . . . 20

    3.1 Time spent in each kernel when KinectFusion is executed on the CPU

    of either a regular laptop or the Avnet Zedboard. The resulting frame

    rate is determined by summing up all timings on a given platform. . . 31

    3.2 Timing and resource usage for various implementations of a simple

    series of arithmetic calculations. . . . . . . . . . . . . . . . . . . . . . . . 52

    3.3 Timing and resource usage for various implementations of a square

    root calculation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    4.1 Category, I/O dimensions, estimated timing and average accuracy of

    every KinectFusion kernel when it would be executed on the FPGA.

    Bandwidth limitations and other external factors are not yet taken into

    account, since these fall outside the scope of Vivado HLS. . . . . . . . . 58

    4.2 Resource utilization estimated by HLS for every KinectFusion ker-

    nel’s top function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    4.3 Impact of the optimizations arising from adoption of the methodology

    versus use case-specific knowledge on the estimated performance of

    KinectFusion’s kernels in HLS. . . . . . . . . . . . . . . . . . . . . . . . 74

    5.1 I/O characteristics of all instances of KinectFusion’s first five kernels. . 79

    5.2 Time spent in each kernel as measured on both the PS and PL of the

    Zedboard. Summing these values assumes that all kernels are exe-

    cuted separately in time, and can be placed side by side onto the same

    FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

    5.3 Realized maximum I/O throughputs that conforms to HP port band-

    width bounds. The data widths and elements processed per clock

    cycle are measured in terms of data units meaningful to KinectFusion

    (e.g. one depth value), without regard for details involving packed

    structs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

  • xii

    5.4 Comparison of timing and resource profiles after implementing mm2m_samplethrough vertex2normal as separate accelerators versus applying bothdiscussed multi-level dataflow techniques. . . . . . . . . . . . . . . . . 102

    6.1 Time spent in each kernel when KinectFusion is executed on either the

    ARM Cortex-A9 CPU or Xilinx Zynq-7020 FPGA of the embedded SoC.104

  • xiii

    List of Listings

    3.1 Code snippet representing the Map parallel pattern. . . . . . . . . . . . 32

    3.2 Code snippet representing the Stencil parallel pattern. . . . . . . . . . . 33

    3.3 Code snippet representing the Reduce parallel pattern. . . . . . . . . . 34

    3.4 Code snippet representing the Gather parallel pattern. . . . . . . . . . . 35

    3.5 Vivado HLS code to test the maximum size of a 16-bit integer array.

    Data is copied in burst mode from external memory, similar to how

    block-by-block processing is implemented in practice. Although the

    compiler places the local array into block RAM by default, the HLS

    RESOURCE directive [1] is still included for clarity. . . . . . . . . . . . 49

    3.6 Vivado HLS code for a fixed-point simple pipelined arithmetic calcu-

    lation, belonging to the Map pattern. . . . . . . . . . . . . . . . . . . . . 51

    3.7 Vivado HLS code for a fixed-point square root calculation, belonging

    to the Map pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    5.1 Code snippet summarizing how the multi-level dataflow problem is

    to be solved within Vivado HLS. . . . . . . . . . . . . . . . . . . . . . . 99

  • Abbreviations

    ACP Accelerator Coherency Port

    AXI Advanced eXtensible Interface

    CPU Central Processing Unit

    DDR Double Data Rate

    DMA Direct Memory Access

    DRAM Dynamic Random Access Memory

    DSE Design Space Exploration

    FIFO First-In, First-Out

    FPGA Field-Programmable Gate Array

    FSM Finite State Machine

    GPU Graphics Processing Unit

    HLS High-Level Synthesis

    HP High-Performance

    ILA Integrated Logic Analyzer

    IP Intellectual Property

    MM Memory-Mapped

    PCI Peripheral Component Interconnect

    PL Programmable Logic

    PS Processing System

    RAM Random Access Memory

    SLAM Simultaneous Localization And Mapping

  • 1

    Chapter 1

    Introduction

    As we embark on the road towards a more autonomous world, countless challenges

    and opportunities emerge in various subdisciplines of computer architecture, algo-

    rithm design and electronics. One such challenge is Simultaneous Localization and

    Mapping or SLAM, a relatively modern application that attempts to make a robot

    aware of its surroundings. SLAM concerns the dual problem of constructing a model

    of the robot’s real-world environment, while also determining the position and ori-

    entation of the robot moving inside this map at the same time [2]. Many distinct

    implementations of this concept exist. Dense SLAM variants, for example, distin-

    guish themselves from their sparse counterparts by incorporating as much data as

    possible captured by the sensors into their global reconstruction. This gives them

    a considerable edge, mainly due to the fact that they create a high quality model

    that is reusable across other applications as well. However, this comes at the cost

    of greater computational demands [3]. On the other hand, use cases such as au-

    tonomous driving, augmented reality, indoor mapping or navigation, and basically

    any requirement of high-quality environmental awareness on mobile or low-power

    devices, all justify why one might desire to run dense SLAM on embedded devices

    as well rather than on high-end GPUs only.

    The need for embedded SLAM solutions is evident. The Field-Programmable

    Gate Array (FPGA), a low-power integrated circuit that is reconfigurable yet can

    reach high performance and efficiency, offers a flexible hardware platform on which

    to evaluate the implementation of a dense 3D SLAM algorithm. It is essentially

    a large grid of elementary blocks and routing interconnects, both of which can be

    reprogrammed by the designer ’on the field’. While FPGA designs are tradition-

    ally developed using a hardware description language such as VHDL, we employ

    the upcoming High-Level Synthesis (HLS) tools as a means to evaluate the present-

    day programmability of FPGAs as well as the quality of our design methodology.

    The strength of Vivado HLS is its capability to perform high-level, pragma-directed

    compilation of C-code into hardware modules [4]. The concept of a System-on-Chip

    (SoC) is also essential to this dissertation, which integrates a CPU and an FPGA into

  • 2 Chapter 1. Introduction

    one package. The Zedboard development board is then used to evaluate both hard-

    ware and software running on the Zynq-7020 SoC.

    Computer vision and signal processing are research fields that are well repre-

    sented in typical FPGA applications as well as the low-level operation of SLAM

    [2]. This dissertation is based on the implementation of the KinectFusion algorithm,

    mainly because it is very representative of many kernels within the general context

    of image processing. Both two-dimensional and three-dimensional data structures

    are processed in various ways throughout the KinectFusion pipeline [5], giving rise

    to a diverse exploration of possible FPGA-specific optimizations. Furthermore, it

    allows for the extraction of guidelines involving the methodology for FPGA pro-

    gramming. Other benefits over comparable SLAM algorithms include its relatively

    low memory requirement and good accuracy [6].

    1.1 Goals and outline

    The aim of this master’s thesis is to provide a framework by which SLAM and by ex-

    tension, image processing kernels in general, can be efficiently mapped onto FPGAs.

    Beyond the exploitation of parallelism and pipelining, the full translation of software

    algorithms into HLS code is often non-trivial. We also intend to explore how multi-

    ple kernels with complex dataflow characteristics can be combined in hardware as to

    form an efficient, large-scale pipeline consisting of functional blocks. The final goal

    is to achieve a heterogeneous implementation of 3D SLAM that is as fast as practica-

    ble, while investigating which concepts and techniques can be distillated in order to

    create a more generally applicable methodology as a side effect.

    The outline and contributions of this thesis are as follows:

    • Chapter 2 reviews the background and existing literature about SLAM, FPGAs,

    SLAM on FPGAs and justifies several choices made in this thesis.

    • Chapter 3 delineates the methodology that was developed to deal with the ef-

    fective optimization of kernels bearing different computational and data man-

    agement patterns using HLS.

    • Chapter 4 applies these practices to KinectFusion and evaluates the extent to

    which they brought us to a satisfying solution versus how many additional

    optimizations had to be applied.

    • Chapter 5 explores various ways in which multi-level dataflow can be realized

    efficiently, fitting together the first five kernels of KinectFusion onto the pro-

    grammable logic. A comparison is made among three architectures using their

    resulting timing and resource metrics.

  • 1.1. Goals and outline 3

    • Finally, Chapter 6 formulates a conclusion presenting some takeaways of our

    research and opportunities for future work.

  • 5

    Chapter 2

    Background and related research

    2.1 Simultaneous Localization and Mapping

    Simultaneous Localization and Mapping (SLAM) is an advanced computer vision

    and robotic navigation algorithm that has made significant progress over the last

    30 years. Its purpose is to track the state of an agent within an unknown environ-

    ment, while simultaneously constructing a model of this very environment using its

    sensory observations [2]. The state is typically described by its pose (position and ori-entation), while the model essentially refers to a map, which is either a representationof some interesting aspects (so-called features) or a dense volumetric description ofthe robot’s surroundings. It is clear that both components of SLAM, being localiza-

    tion and mapping, cannot be solved independently from each other. A sufficiently

    detailed map is needed for localization, while an accurate pose estimate is required

    to be able to reconstruct or update the map [7]. Localization is often done by means

    of tracking, which compares the incoming sensor data with the map that has beengenerated so far in order to create a new estimate of the current pose [3].

    One of SLAM’s emerging use cases is the variety of applications in mobile robo-

    tics, including but not limited to path planning, visualization, augmented reality

    and 3D object recognition. In general, many situations where localization infrastruc-

    ture is absent (such as indoor operation) give rise to the present-day popularity of

    SLAM. The same holds true for any scenario where detailed up-to-date maps need

    to be created but are not available beforehand. Cadena et al. [2] note that SLAM is a

    vital aspect of robotics, and is being increasingly deployed in various real-world set-

    tings that range from autonomous driving and household robots to mobile devices.

    However, it is also stated that more research is needed to achieve true robustness in

    navigation and perception, especially for autonomous robots that ought to operate

    independently for a long time. In this sense, SLAM has not been fully solved yet,but we note that the algorithms considered in this thesis definitely hold potential

    for investigation and acceleration due to the broad applicability of their underlying

    concepts.

  • 6 Chapter 2. Background and related research

    FIGURE 2.1: Continuum of SLAM algorithms from sparse (e.g. usingfeature extraction) to dense (e.g. using voxelated maps) [3].

    Implementations of SLAM come in many shapes and sizes; this diversity is par-

    tially illustrated in Figure 2.1. On one end of the spectrum, we have sparse SLAM

    that focuses on the selection of a limited amount of features or landmarks. This has

    the upside of being computationally lighter but carries the significant downside of

    reducing the quality and usability of the reconstruction. On the other end, a dense

    algorithm reverses these properties: its ability to generate a much higher quality

    map of the environment comes at the cost of being computationally intensive. Semi-

    dense visual SLAM implementations have emerged in an attempt to form a compro-

    mise, although the resulting model is still incomplete with respect to the fully dense

    variant as the algorithms do not deal with all of the available sensory information

    [2], [8].

    2.1.1 KinectFusion

    KinectFusion is a real-time dense scene reconstruction algorithm, published by Mi-

    crosoft in 2011. As a SLAM algorithm, it continuously updates a global 3D map and

    tracks the position of a moving depth camera within this environment. Several inno-

    vations were built into this system by [5]. First, it works under all lighting conditions

    since only the depth data is used. Viable consumer-oriented depth sensors include

    the Microsoft Kinect camera. This allows the system to work perfectly under dark

    conditions as well. Furthermore, the localization step is always done with respect

    to the most up-to-date global map at all times. The usage of the map, which is rep-

    resented as a volume of truncated signed distance function (TSDF) values, thereby

    recapitulates the information of all previous depth frames seen so far. This helps to

    avoid the drifting problems commonly associated with simple frame-to-frame align-

    ment.

    Figure 2.2 depicts a typical volume consisting of TSDF values. Here, F is de-fined as the signed distance to the nearest surface. Its value is positive if it is outside

    of a (solid) object, and its magnitude is truncated to a fixed maximum in order to

  • 2.1. Simultaneous Localization and Mapping 7

    FIGURE 2.2: Part of KinectFusion’s map (right) and a slice through thevolume (left) showing truncated signed distance values, each repre-senting a distance F to a surface [5]. Grey voxels are those without a

    valid measurement, and are naturally found within solid objects.

    avoid the interference of surfaces far away from each other [5]. The global surface is

    then defined as the set of points where F = 0, hence this data structure belongs tothe class of implicit surface representations [2]. A functional limitation of KinectFu-sion is that, unlike many other SLAM algorithms [6], the global map cannot expand

    at runtime because its size is predefined. This renders KinectFusion unsuited for

    large-scale SLAM (on the order of 500 cubic meters or more), although the author

    notes that many of the aforementioned applications do not necessarily require this

    functionality.

    Technical description

    The overall workflow of KinectFusion is shown in Figure 2.3. From a high-level

    perspective, four interconnected stages can be distinguished as follows:

    1. Surface measurement. After obtaining the raw depth map captured by a Mi-

    crosoft Kinect (or equivalent) camera, this preprocessing step calculates 3D

    vertex and normal vector arrays at multiple resolution levels.

    2. Pose estimation. The device is tracked using a variant of the Iterative Clos-

    est Point (ICP) algorithm; see the original paper for its description. The live

    measurement is aligned in a coarse-to-fine manner with a predicted surface

    measurement, which is in turn obtained from the surface prediction phase.

    3. Update reconstruction. Given an accurate pose estimate, the incoming depth

    data is integrated into the volume. TSDF values within the frustum are up-

    dated to accommodate for the new sensor measurement, further consolidating

    the global model.

    4. Surface prediction. A raycast is performed on the most up-to-date volume,

    thereby producing a dense and reliable surface measurement estimate against

  • 8 Chapter 2. Background and related research

    FIGURE 2.3: System workflow of the KinectFusion method [5].

    which to perform alignment in the pose estimation phase. Loop closure be-

    tween mapping and localization is achieved this way [5].

    Figure 2.4 shows correspondences between these high-level stages and the sub-

    routines in the source code provided by [9]. Note that the system-level dataflow of

    KinectFusion is much more complex in reality, and contains many interacting ker-

    nels with multiple instances. These communication and replication aspects are left

    out for simplicity here, but will be explored in detail in Chapter 3.

    2.1.2 Benchmarking visual SLAM

    Nardi et al. [6], [9] have introduced SLAMBench, a tool used to test the correctness

    and performance of various 3D SLAM algorithms. Given a dataset with a ground

    truth camera trajectory, the accuracy, speed and optionally the power consumption

    of a specified SLAM implementation can be measured on various CPU or GPU plat-

    forms. This benchmark provides an important basis by which to evaluate the ef-

    fect several parameter choices, such as the resolution of the reconstruction volume

    and the frame rate. Essentially, it will allow the author to deviate from the refer-

    ence KinectFusion implementation whenever it is deemed useful to do so, while still

    keeping track of the possible degradation in quality due to these optimizations.

    Accuracy evaluation

    SLAMBench allows for detailed accuracy measurements of different SLAM imple-

    mentations in the form of an absolute trajectory error (ATE). At every frame duringthe execution, there exists a certain error between the estimated camera position as

    produced by the application under test (AUT) and the ground truth position. The

    ATE, as described by [10], is a metric that serves to evaluate this discrepancy using

    a scaled Euclidean distance calculation, after aligning both trajectories in a least-

    squares manner. The mean ATE is then simply its average over all frames, and willbe used hereafter to quantify the accuracy of any set of parameters used as input to

    KinectFusion.

  • 2.1. Simultaneous Localization and Mapping 9

    Depthinput

    mm2m_sample

    bilateral_filter

    half_samplej

    depth2vertexj

    vertex2normalj

    tracki,j

    reducei,j

    update_posei,j

    integrate

    raycast

    Mapoutput

    Poseoutput

    1. Surface measurement

    2. Pose estimation

    3. Update reconstruction

    4. Surface prediction

    FIGURE 2.4: Simplified overview of KinectFusion kernels. A sub-script j indicates the presence of several resolution levels, while i in-

    dicates the presence of multiple iterations within a level.

  • 10 Chapter 2. Background and related research

    Algorithm Type Required sensors YearORB-SLAM2 [11] Sparse Monocular, stereo or RGB-D 2016LSD-SLAM [12] Semi-dense Monocular 2014ElasticFusion [13] Dense RGB-D 2015InfiniTAM [14] Dense RGB-D 2015KinectFusion [5] Dense RGB-D 2011

    TABLE 2.1: Summary of 3D SLAM algorithms adapted and comparedby [6].

    Comparison of KinectFusion among other SLAM algorithms

    The following items summarize the performance results of SLAM in literature [6]

    as well as benchmarks executed by the author, in order to give a context to the per-

    formance of KinectFusion. The considered algorithms are listed in Table 2.1, and a

    standard (mid-to-high-end) set of parameters is used for their evaluation.

    Accuracy. Figure 2.5 indicates that the trajectory accuracy of KinectFusion is

    generally mediocre, although the author’s own executions have indicated that its

    mean ATE is among the best as long as no loss of track occurs. A major drawback

    of KinectFusion is that it tends to get lost completely during some video fragments,

    causing the pose to simply stop updating midway the benchmark. High drift occurs

    as a consequence, which explains the high variability when performing accuracy

    measurements on different datasets.

    Memory requirements. In a comparison made by Bodin et al. [6], KinectFusion

    turned out to require the lowest memory size among the five recent sparse, semi-

    dense and dense SLAM algorithms shown in Table 2.1. The memory usage depends

    on the dimensions of the reconstruction volume, but is on the order of 50 MB for a

    relatively detailed map of 2563 elements. However, it should be noted that this value

    is still very high compared to typical FPGA applications, since local storage on the

    FPGA is typically on the order of a few megabits. This indicates a priori that the

    implementation of KinectFusion on the FPGA is likely to be a challenging task.

    Speed. According to Figure 2.5, KinectFusion is faster than most of its coun-

    terparts, achieving around 8 FPS on a GPU platform. This cannot simply be gen-

    eralized towards heterogeneous CPU-FPGA executions, although it does provide

    another hint that KinectFusion might be the most promising choice to attempt to

    accelerate.

  • 2.2. Field-Programmable Gate Arrays 11

    FIGURE 2.5: Violin plots comparing four SLAM algorithms on theNVIDIA Jetson TK1, a GPU development board [6]. Here, KF-CUDA

    stands for a CUDA-implementation of KinectFusion.

    2.2 Field-Programmable Gate Arrays

    The Field-Programmable Gate Array (FPGA) is essentially a two-dimensional grid

    of reconfigurable blocks and routing channels, offering a low-volume yet highly ef-

    ficient alternative to the Application-Specific Integrated Circuits (ASIC). The term

    gate array refers to the fact that these elementary building blocks consist of variouslogic gates, providing look-up tables (LUT), registers, full adders (FA), multiplexers,

    flip-flops (FF) and more. Special blocks such as Digital Signal Processors (DSPs) are

    also at the designer’s disposal: these serve to perform arithmetic operations such

    as multiplications more efficiently than by merely using LUTs. A field-programmableintegrated circuit is one that can be reprogrammed on the spot as to perform almost

    any hardware functionality that the user desires. Whereas ASICs have their elec-

    tronic circuitry permanently ’baked’ into silicon, the FPGA’s logic can be changed

    at will long after it has been manufactured precisely because its logic blocks and

    interconnects are reconfigurable. The designs running on an FPGA are typically

    created using a Hardware Description Language (HDL). This language allows the

    user to formally describe the behaviour of digital circuits by means of specifying,

    among others, how signals should be connected together and which logical oper-

    ations should be performed. In the synthesis phase, this description is then trans-formed into a list of electronic building blocks and their interconnections. After-

    wards, the blocks are mapped onto the physical rectangular layout of the FPGA in

    the mapping phase. Finally, the routing phase decides how to connect these placed

    components. The resulting implemented design specifies exactly how each availableFPGA resource should be configured, including how the interconnections should be

    routed as to connect the relevant blocks together.

    As an example, Figure 2.6 (adapted from [15]) depicts the architecture of an

  • 12 Chapter 2. Background and related research

    FIGURE 2.6: (a) Sketch of the FPGA architecture; (b) Diagram of asimple logic element.

    island-style FPGA. Here, another special block called the I/O block is shown, resid-

    ing at the periphery of the device. These serve to provide external connections and

    are necessary to communicate with the world outside of the Programmable Logic

    (PL).

    2.2.1 The FPGA put into context

    Figure 2.7 shows a simplified comparison of how the FPGA can be situated among

    the CPU, GPU and ASIC. On the left end, we find a Central Processing Unit (CPU).

    This general-purpose device is clearly the very flexible with regard to programma-

    bility, but it is also the least efficient one. In this context, efficiency refers to both speed(throughput, latency) and power consumption. On the right end, we find an ASIC:

    this device is the most rigid of all but also the most efficient. The logic is burned right

    into silicon, which fixes its functionality permanently but allows for an extremely

    low latency and energy consumption. Within a given semiconductor techology, a

    much better efficiency can be achieved by ASICs relative to FPGAs since negligible

    overhead exists. Since the components and interconnections are fully fixed before-

    hand, their area utilization and speed metrics are much better than for comparable

    designs on the FPGA. For example, [16] found that the average ratio of silicon area

    required to implement circuits containing only LUT-based logic and FFs is 35.

    Moving to the left on the axis generally means sacrificing efficiency, while gain-

    ing the ability to easily run a wider range of applications. On the other hand, moving

    to the right means giving up on ad-hoc programmability and configurability, but in-

    stead gaining increased potential for high-efficiency or high-performance computa-

    tion in return. For example, the Graphics Processing Unit (GPU) cannot run general-

    purpose programs, although it is quite well suited for massively parallel or vector-

    ized calculations thanks to its large amount of processing units. They are however

  • 2.2. Field-Programmable Gate Arrays 13

    very power-hungry, a drawback FPGAs and ASICs do not suffer from. The FPGA’s

    low power consumption and hardware reconfigurability explain its growing interest

    in the academic and industrial world.

    Reality is of course not one-dimensional, and the FPGA has its fair share of dif-

    ferences and advantages that offsets it on other axes as well, figuratively speaking.

    Rather than just existing inbetween GPUs and ASICs, it is also useful to compare

    the FPGA with the CPU with respect to how they process data. CPUs often have

    higher clock speeds, but execute every instruction in a much less parallel fashion

    than its counterparts. While the CPU has undergone many architectural improve-

    ments to accelerate the execution of software, including multi-core functionality, Sin-

    gle Instruction Multiple Data (SIMD) technology, instruction-level parallelism (ILP),

    speculative execution and more, these extra tools are only available under specific

    circumstances. Serial execution happens otherwise, which results in every data el-

    ement being processed one by one. A resulting benefit is that code with a high

    degree of control statements, for example with many if-then-else constructions,

    are handled well by the CPU [17]. On the other hand, FPGAs provide a more direct

    shortcut to hardware, as they allow for effective pipelined and dataflow-oriented

    architectures to be designed and implemented for a given (fixed) algorithm. FP-

    GAs provide the opportunity to spatially parallelize complex computations across

    its many reconfigurable blocks and routing channels, in order to achieve a process-

    ing speed many times greater than the CPU [18]. However, note that the maximum

    DDR access speed, bus widths and I/O limitations still define upper limits with re-

    gard to communication for both devices.

    Lastly, a disadvantage of FPGAs is that, despite the rise of tools such as High-

    Level Synthesis (HLS) that attempt to ease its development [4], FPGAs intrinsically

    remain quite difficult to program. The mindset for FPGA development is very dif-

    ferent from that of software engineering [17], which is essentially due to the design

    process being multifaceted and intricate, involving both hardware and software. To

    design an FPGA system is to start from a blank slate: the architecture is not fixed,

    but can be changed to perform virtually any digital hardware logic the user wishes it

    to. For a System-on-Chip combining both a CPU and FPGA, the situation becomes

    even more involved: in addition to devising effective hardware, the designer also

    has to write good software around their custom architecture, and has to ensure that

    all components work well together as intended. This stands in stark contrast to reg-

    ular software development that does not deal with variable architectures, such as on

    a desktop or laptop CPU. In short, a high degree of technical expertise is required in

    this field, although HLS can definitely be regarded as a positive evolution towards

    facilitating the hardware design aspect of this two-fold development process.

  • 14 Chapter 2. Background and related research

    FIGURE 2.7: Diagram comparing the FPGA to other processing plat-forms [19].

    Power consumption

    Minimizing the power usage of an application is especially important in the context

    of embedded devices, where SLAM is most likely to be found. In theory, the FPGA is

    more energy-efficient than the CPU and GPU by design. After all, this device is clas-

    sified as ’reprogrammable hardware’, meaning that its data operations are directly

    encoded into hardware. The overhead for any given calculation is therefore greatly

    reduced. Furthermore, high throughputs can be achieved thanks to the opportunity

    for efficient pipelining, no matter how complex said string of operations.

    To verify the above claims, [20] compared the energy consumption of a high-end

    Altera Stratix III E260 FPGA with an NVIDIA GeForce GTX 295 GPU and quad-core

    Xeon W3520 CPU. For many typical sliding window applications (which can be seen

    as a subset of image processing), the FPGA turns out to be one to two orders of mag-

    nitude more power efficient in terms of energy usage per frame than both the GPU

    and CPU. Only in the case of a linear 2D convolution where the filtering operation

    may be executed in the frequency domain as well, the GPU-FFT implementation

    was able to obtain a power efficiency comparable to that of the FPGA. While both of

    these devices are around a decade old, similar conclusions were drawn by [21] for

    k-means clustering on more modern hardware. Here, several Xilinx Zynq FPGAs

    achieved an ’FPS per Watt’ value of a factor 10 to 25 times better than the NVIDIA

    GTX 980 GPU. The general trend is that GPUs can often process data faster then

    FPGAs in terms of frames per second, but do so much less energy efficiently.

    2.2.2 System-on-Chip

    A System-on-Chip (SoC) integrates a Processing System (PS) containing the CPU

    with Programmable Logic (PL) representing the FPGA onto a single device [22].

    Figure 2.8 depicts the block diagram of a Xilinx Zynq-7000 SoC, where the following

    relevant functional blocks can be distinguished:

    • Application Processor Unit (APU): The software part of the SoC, consisting of

    a dual-core ARM Cortex-A9 CPU. It is used to control the full execution of

  • 2.2. Field-Programmable Gate Arrays 15

    KinectFusion and to initiate data transfers via the AXI1 Direct Memory Access

    (DMA) IP core.

    • Programmable Logic (PL): The reconfigurable hardware part of the SoC, de-

    rived from the Xilinx Artix-7 FPGA. It is used to run the accelerated kernels of

    KinectFusion.

    • General-Purpose (AXI_GP) Ports: Provides PS-PL communication with two

    32-bit master and two 32-bit slave interfaces. It is used to control the IP-cores,

    and its maximum estimated throughput is 600 MB/s [22].

    • High-Performance (AXI_HP) Ports: Provides PS-PL communication with four

    32- or 64-bit independently programmed master interfaces. It is used to trans-

    fer large amounts of data via the DMA with an estimated maximum through-

    put of around 1,200 MB/s [22], [24], [25].

    • Central Interconnect: Connects the PL via its AXI_GP ports to the DDR mem-

    ory controller, PS cache and I/O peripherals.

    • DDR Controller (Memory Interfaces): Supports DDR2 and DDR3 for access to

    the Dynamic RAM (DRAM), not shown on the figure.

    • Programmable Logic to Memory Interconnect: Connects the PL via its AXI_HP

    ports directly to the DDR controller for fast streaming (reading and writing) of

    data.

    The Zedboard, shown in Figure 2.9, is a low-end development board based on

    the Xilinx Zynq-7000 SoC [26]. It will be used in this dissertation for the prac-

    tical evaluation of several accelerated KinectFusion configurations. High-end FP-

    GAs were considered as well, but were eventually deemed out-of-scope; we mostly

    wanted to see how much the Zedboard is capable of already, as the Zynq FPGA is

    relatively popular in image and video processing applications [27]. Furthermore,

    the lower the cost of the hardware platform, the wider the range of devices and use

    cases our work could be applied to.

    2.2.3 High-Level Synthesis

    High-Level Synthesis (HLS) represents a collection of processes that automatically

    convert a high-level algorithmic description of a certain desired behavior to a circuit

    specification in HDL that performs the same operation [4]. It allows the hardware

    functionality of an FPGA to be specified directly by algorithms written in a software

    programming language such as C or C++. HLS tools attempt to reduce time-to-

    market and address the design of increasingly complex systems by permitting de-

    signers to work at a higher level of abstraction. Design spaces can be explored more1Advanced eXtensible Interface, a set of protocols for inter-IP communication as adopted and de-

    scribed by Xilinx in [23].

  • 16 Chapter 2. Background and related research

    FIGURE 2.8: Functional block diagram of the Zynq-7000 SoC [22].

    FIGURE 2.9: Annotated photograph of the Avnet Zedboard (adaptedfrom [28]).

  • 2.2. Field-Programmable Gate Arrays 17

    rapidly this way, which is especially important when many alternative configura-

    tions have to be implemented, generated and compared.

    The task of automatically generating hardware from software is far from easy,

    and a one-size-fits-all solution might not even exist in the same way that a fully

    optimizing and/or parallelizing compiler is theoretically impossible to create. Nev-

    ertheless, a wide range of different approaches exist that attempt to partially solve

    the problem. Removing the burden on the user of having to reinvent the wheel is

    already a great practical advantage of HLS. After all, frequently recurring concepts

    such as pipelining, array partitioning and more are often already built into these

    tools, ready to be used without requiring the designer to deal with its low-level de-

    tails.

    Xilinx’ Vivado HLS is able to synthesize a procedural description written in C,

    C++ or SystemC into a hardware IP block [1], [4]. Loop unrolling, pipelining, chain-

    ing of operations, resource allocation and internal array restructuring are among the

    many different optimizations that can be applied during the compilation process. In

    addition, support for many types of interfaces such as shared memory and stream-

    ing is built-in.

    2.2.4 Designer workflow

    With the availability of an Avnet Zedboard containing the Xilinx Zynq-7020 PL, the

    author’s toolchain of choice consists of Vivado HLS, Vivado and Xilinx Software

    Development Kit (SDK) v2017.4. These three development environments together

    provide an integrated design flow as follows:

    • In Vivado HLS [1]:

    1. Write a C/C++ function to be integrated into the hardware system. This

    can be written from scratch or based on an existing reference implemen-

    tation. Data type selection and interface specifications have to be consid-

    ered as well.

    2. Write test benches; compile, simulate and debug the algorithm to verify

    its functional correctness. Return to step 1 until the output is correct.

    3. Optimize the C/C++ code to make it tailored towards a useful imple-

    mentation on the FPGA. One important practice here is the application of

    Vivado HLS optimization directives, which automates many redundant

    aspects of the optimization process. Ensure that the algorithm stays cor-

    rect by performing step 2 as needed.

  • 18 Chapter 2. Background and related research

    4. Synthesize the top function into an RTL implementation. Vivado HLS

    creates two variants: VHDL and Verilog, both of which ought to be fully

    equivalent.

    5. Analyze the reports and cycle-by-cycle computation steps of the resulting

    design. Return to step 3 until satisfied. This back and forth process is part

    of Design Space Exploration (DSE).

    6. Optionally, verify the correctness of the RTL implementation by running

    a C/RTL cosimulation.

    7. Export the RTL implementation to package it into an IP block, ready to be

    used in subsequent design tools.

    • In Vivado [29]–[31]:

    8. Create a new IP integrator block design and insert the ZYNQ7 Processing

    System. This block encompasses the embedded PS functionality, while

    all other IP cores around it represent what will be implemented on the

    FPGA.

    9. Configure the Zynq-7000 with respect to clock speeds, PS-PL communi-

    cation, peripheral interfaces and more.

    10. Insert your custom HLS IP core(s) into the block design while taking care

    of AXI interfacing, interconnects and ports. Optionally, insert an AXI

    DMA IP core which provides streaming to memory-mapped conversions

    and vice versa, in order to allow streaming IP cores to efficiently access

    the DRAM via HP ports.

    11. Insert a System Integrated Logic Analyzer (ILA) IP core, and add de-

    bugging probes to important signals. This allows for the debugging of

    post-implementation designs on the FPGA device, which consumes extra

    resources but no additional clock cycles.

    12. Verify the design, and fix the block design if needed.

    13. Perform logic synthesis and implementation. Redo step 9 if problems

    arise, such as the critical path length exceeding the clock period.

    14. Analyze the resource usage and timing profile of the implemented design.

    If the resource usage exceeds the Zynq-7000’s maximum, consider reduc-

    ing the complexity of the HLS IP core(s) and/or decreasing the number

    of concurrent IP cores present in the block design at step 10. If the critical

    path exceeds the clock period, return to step 9 until any timing issues are

    resolved.

    15. Generate the bitstream and export the hardware to Xilinx SDK.

    • In Xilinx SDK [32], [33]:

  • 2.3. SLAM on FPGAs 19

    16. Create a new standalone Board Support Package (BSP) based on the pre-

    viously generated hardware platform. The drivers of this BSP allow hard-

    ware components to be called directly from software code.

    17. Create a new C++ software application project based on the BSP. The au-

    thor recommends to import the Hello World example as it contains the

    necessary platform initialization and clean-up routines.

    18. Configure the BSP project to include relevant libraries, and modify the

    software project’s linker script to ensure the stack and heap sizes are large

    enough for your use case.

    19. Write C/C++ application code that executes and verifies the full system’s

    functionality. Software can be debugged by setting breakpoints in SDK,

    while hardware logic can be debugged using the System ILA in Vivado

    as described in [31].

    It is possible that the resolution of some problems in the last step extends all the

    way back to step 1, in the sense that it requires a comprehensive and holistic analysis

    of all aspects in the design process. One such incident might be that the integrated

    system’s performance is less than expected, so that architectural decisions regarding

    bandwidths, data types etc. have to be revisited. Othertimes, more fundamental lim-

    itations might be encountered such as excessive resource utilization (making routing

    impossible) or I/O bandwidth ceilings, which can usually not be solved without re-

    assessing the initial specifications of the system.

    2.3 SLAM on FPGAs

    In recent years, the idea of implementing Simultaneous Localization and Mapping

    on FPGAs has been explored in diverse ways. Works in literature range from two-

    dimensional [34], [35] to three-dimensional [3], [7], [36]–[40] SLAM and from sparse

    [7], [35], [38], [39] to semi-dense [3] SLAM, although a fully dense 3D variant such

    as KinectFusion seems to be less popular in the embedded hardware community.

    Furthermore, most works focus on the hardware acceleration of just specific parts

    belonging to a whole heterogeneous SLAM system [36], [37], [39]. The selected

    subcomponents naturally include those that the FPGA is known to be strong at in

    terms of performance and efficiency. Nevertheless, the following text provides an

    overview of these existing results in an attempt to pick up architectural and method-

    ological clues for the as-complete-as-possible acceleration of SLAM on FPGAs.

    In order to give an idea of the current state of the art, Table 2.2 summarizes most

    recent works on 3D SLAM. Some authors have published improvements of their

    systems over several years, in which case only the latest results are shown. A clear

  • 20 Chapter 2. Background and related research

    Reference Algorithm Type Platform(s) Speed Year[7], [41] FastSLAM 2.0 Sparse Host CPU + Arria 10 FPGA 102 FPS 2018[36] ORB-SLAM Sparse Host CPU + Stratix V FPGA 67 FPS 2018[38] VO-SLAM Sparse DE3-340 SoC 31 FPS 2015[39] EKF-SLAM Sparse Zynq-7020 SoC 30 Hz 2015[3], [42], [43] LSD-SLAM Semi-dense Zynq-7045 SoC >60 FPS 2019[37] KinectFusion Dense GTX 760 GPU + Stratix V FPGA 26-28 FPS 2015[40] ICP-SLAM Dense Zynq-7020 SoC 2 FPS 2017

    TABLE 2.2: A compilation of recent 3D SLAM applications involvingthe FPGA taking up roles of varying importance, showing a trendof decreasing frame rate with increasing "density". SoC (System-on-

    Chip) boards always contain both an embedded CPU + FPGA.

    trade-off is visible between the performance and quality of the algorithm, in ad-

    dition to the role played by the actual ’embeddedness’ of the heterogeneous plat-

    forms in use. Low-power, real-time sparse SLAM applications seem to be coming of

    age thanks to their light computational weight, although they do not produce a us-

    able map and often lack loop-closure functionality [3]. On the other hand, the more

    accurate and feature-rich fully dense solutions notably require high-end hardware

    (typically desktop GPUs as shown in Figure 2.1) unless the real-time constraint is

    disposed of [2].

    2.3.1 Dense and semi-dense SLAM

    Perhaps the most relevant entry in Table 2.2, the implementation of real-time 3D re-

    construction using KinectFusion on a heterogeneous system with an FPGA has been

    attempted before by [37]. Here, Gautier et al. attempted to accelerate two intensive

    parts of the application, specifically the Iterative Closest Point (ICP) algorithm and

    the volumetric integration step, corresponding to the track, reduce and integrate ker-nels in Figure 2.4. Their set-up was a heterogeneous system with the Altera Stratix

    V FPGA and the NVIDIA GTX 760 GPU, both connected via PCI Express to a host

    computer. It is interesting to note that the authors’ goal of accelerating integrate wasunsuccessful due to a fundamental bandwidth limitation. A 3D volume with 5123

    elements takes 512 MB of space in memory which is far too huge to transfer back

    and forth between the FPGA and the GPU (or CPU for that matter) at sufficient

    speeds. Their final architecture therefore consists of KinectFusion being executed

    nearly completely on the GPU, but with the ICP part of the tracking step offloaded

    to the FPGA. Real-time speeds of up to 28 FPS were achieved by halving the input

    data resolution and limiting the number of ICP iterations. Lastly, Gautier et al. point

    out that the Altera OpenCL SDK posed practical difficulties in optimizing area uti-

    lization, for example because the tool lacked support for fixed-point arithmetic.

    The fact that semi-dense and dense SLAM algorithms are characterized by high

    bandwidth requirements was also noted by [42] in 2016. In this work, Boikos et

  • 2.3. SLAM on FPGAs 21

    al. presented their first iteration of LSD-SLAM, achieving 4 FPS at a resolution of

    320x240. Two accelerator units were implemented, and the communication between

    them had to occur via DDR because the intermediate data (on the order of a few

    MB) produced by the first unit could not be cached entirely on the FPGA. A re-

    designed architecture around the tracking core to enable the usage of a full stream-

    ing communication paradigm was presented in [43], bringing a five-fold frame rate

    improvement over the previous work. Combined with a scalable depth estimation

    architecture again by Boikos et al. in [3], 2019 marked the arrival of the first com-

    plete accelerator for direct semi-dense monocular SLAM on FPGAs. The power con-

    sumption was measured to be an order of magnitude smaller than that of an Intel

    i7-4770 CPU. A highly important takeaway is that the dataflow principle (i.e. ker-

    nels linked with a single-consumer, single-producer pattern) was found to yield the

    most efficient design. Furthermore, the units were made multi-modal in order to

    deal with LSD-SLAM’s complex control flow due to the iterative and multi-level na-

    ture of tracking. More specifically, the pipelined hardware blocks could be put to

    different uses depending on the current phase of the system: every unit contains a

    set of operations from which the desired computation can be selected by means of

    multiplexing. As will be explained in Chapter 5, similar techniques were employed

    in our research for the dataflow architecture of KinectFusion.

  • 23

    Chapter 3

    High-level synthesis design ofindividual kernels

    The transformation of KinectFusion’s source code into Vivado HLS-optimized code

    is an important aspect of this thesis, not just to obtain a HLS implementation of

    SLAM but also in an attempt to recognize patterns in the approach by which it is

    done. Many subroutines part of a dense 3D SLAM algorithm often recur in other ap-

    plications related to computer vision and image or video processing as well, because

    the data management structure of its kernels only varies so much in practice. Conse-

    quently, the task of accelerating these kernels can largely be mapped to a framework

    that we developed for the purpose of making the design of similar HLS kernels eas-

    ier in the future. Before proceeding to this methodology in Section 3.2, some details

    about the specific use case are discussed first as to give a better idea about the char-

    acteristics and diversity of kernels that are being dealt with.

    3.1 Prerequisites

    3.1.1 Detailed algorithm description

    KinectFusion, like any SLAM algorithm, is composed of various steps that have to be

    executed in succession for each captured depth frame. A diagram of KinectFusion’s

    nine essential kernels is depicted in Figure 3.2. Note that some routines are called a

    variable amount of times per frame, and that multiple instances of each kernel might

    be called with mutually different sets of dimensions. This complex interaction and

    dataflow will be discussed in detail in Chapter 5; a functional description of each

    kernel follows here.

    • mm2m_sample: This kernel essentially resamples the sensor output and per-

    forms a unit conversion from millimeters to meters. The raw depth frames

    captured by the Kinect camera are given in an unsigned short integer format,

    which need to be converted to a floating point representation and resized (if

    necessary) to the correct dimension by subsampling the pixels. This allows all

    subsequent kernels to work with distance values expressed in meters only.

  • 24 Chapter 3. High-level synthesis design of individual kernels

    • bilateral_filter: Because the Kinect depth map is rather noisy [44], the data is

    first filtered in an edge-preserving way. This kernel is a non-linear filter that

    replaces each depth value by a weighted average of nearby values in order to

    reduce the noise amplitude. Its smoothing operation relies on the prior knowl-

    edge that many real-world environments consist of large patches of mostly flat

    areas, such as shown in Figure 3.1. The algorithm clearly preserves discontinu-

    ities, which is achieved by just barely weighing intensities that are ’far away’

    from the currently considered intensity in an absolute value sense. Newcombe

    et al. [5] have found that the insertion of this preprocessing stage greatly

    increases the quality of the normal maps produced in later stages of the al-

    gorithm, which in turn improves the data association step performed by the

    tracking kernels.

    • half_sample: This kernel resamples the bilaterally filtered depth map by a

    factor of two in each dimension. Every four input values are thus mapped to

    one output value, again with an edge-preserving effort as not to introduce any

    fake averaged depth values near discontinuities.

    • depth2vertex: This kernel computes a point cloud in projective form, i.e. with

    every vertex as a function of its pixel position. By multiplying every depth

    value with a matrix that summarizes the camera’s intrinsics [45], an output

    array is produced where each element represents a Euclidean 3D point in the

    local frame of reference.

    • vertex2normal: Given the map of vertices generated previously, this kernel

    produces an array of normal vectors. Every normal vector is calculated by

    taking the cross product of two subtracted pairs of neighboring points.

    • track: This kernel performs part of the multi-scale Iterative Closest Point (ICP)

    alignment. It essentially tracks the live depth frame against the globally fused

    model, in order to establish correspondences between the new and synthetic

    point clouds. No feature selection occurs as KinectFusion is a fully dense al-

    gorithm. A faster than conventional variant of ICP is employed by [5] as well

    as the source code discussed later. As long as the frame rate remains suffi-

    ciently high, this optimization is made possible thanks to the assumption of

    small motion from one frame to the next.

    • reduce: This kernel calculates the total error of the tracking result, by adding

    up distance