70
M ¨ alardalen University School of innovation, design and engineering Master thesis in Electronics Evaluation of partial reconfiguration in FPGA-based high-performance video systems Author: Emil Segerblad [email protected] Supervisor: Dr. Mikael Ekstr¨ om Examiner: Prof. Lars Asplund June 5, 2013

Evaluation of partial recon guration in FPGA-based high ...627637/FULLTEXT01.pdf · part of the FPGA can be reprogrammed during run-time. In this report an evaluation of partial recon

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

  • Mälardalen University

    School of innovation, design and engineering

    Master thesis in Electronics

    Evaluation of partial reconfiguration inFPGA-based high-performance video

    systems

    Author:Emil [email protected]

    Supervisor:Dr. Mikael Ekström

    Examiner:Prof. Lars Asplund

    June 5, 2013

  • Abstract

    The use of reconfigurable logic within the field of computing has increased during the lastdecades. The ability to change hardware during the design process enables developers to lowerthe time to market and to reuse designs in several different products. Many different archi-tectures for reconfigurable logic exists today with one of the most commonly used are Field-Programmable Gate Arrays (FPGA). The use of so called dynamic reconfiguration, or partialreconfiguration, in FPGAs have recently been introduced by several leading vendors but theconcept has existed for several decades. Partial reconfiguration is a technique were a specificpart of the FPGA can be reprogrammed during run-time. In this report an evaluation of partialreconfiguration is presented with focus on the Xilinx ZynQ System-On-Chip and the GIMME2vision platform developed at Mälardalen University. Special focus has been given to the useof partial reconfiguration in high-performance video systems such as the GIMME2 platform.The results show that current state of the technology is capable of performing reconfigurationswithin strict timing constraints but that the associated software tools are yet lacking in bothperformance and usability.

  • Sammanfattning

    Användningen av rekonfigurerbar logik inom beräkningsomr̊adet har ökat under de senaste de-cennierna. Möjligheten att ändra h̊ardvaran under designprocessen kan hjälpa utvecklare attsänka utvecklingstiderna och att återanvända konstruktioner i flera olika produkter. Mångaolika arkitekturer för rekonfigurerbar logik existerar idag och en av de vanligaste är Field-Programmable Gate Arrays (FPGA). Användningen av s̊a kallad dynamisk omkonfigureringeller partiell omkonfigurering i FPGA: er har nyligen införts av flera ledande leverantörer menkonceptet har funnits i flera decennier. Partiell omkonfigurering används för att ändra en specifikdel i h̊ardvaran under körning. I denna rapport presenteras en utvärdering av partiell omkon-figurering p̊a FPGA:er med fokus p̊a Xilinxs ZynQ System-On-Chip och GIMME2-plattformensom utvecklats vid Mälardalens högskola. Särskilt fokus har lagts vid användningen av par-tiell omkonfigurering i högpresterande videosystem s̊asom GIMME2-plattformen. Resultatenvisar att den nuvarande tekniken är kapabel att utföra partiella omkonfigureringar inom striktatidsbegränsningar men att de tillhörande verktygen (mjukvaran) ännu har klara brister i b̊adeprestanda och användarvänlighet.

  • Acknowledgements

    The path of the righteous man is beset on all sides by the iniquities of the selfish and the tyrannyof evil men. Blessed is he who, in the name of charity and good will, shepherds the weak throughthe valley of darkness, for he is truly his brother’s keeper and the finder of lost children. AndI will strike down upon thee with great vengeance and furious anger those who would attemptto poison and destroy My brothers. And you will know My name is the Lord when I lay Myvengeance upon thee.

    - Jules Winnfield, ”Pulp Fiction”, 1994

    I chose this quote not because I am a religious person, on the contrary, I chose it because it soundscool and also due to the fact that it is something that you would not expect a hitman to say justbefore killing a man. It makes you think, does it not? Now, before I get all philosophical, thereare some people I would like to thank for making this master thesis possible. First of all I wouldlike to thank Sara, my fiancée, for her great support during these past 4 years. Furthermore,I would to thank my family and friends for their interest and support of my work. I wouldalso like to thank my room mates at MDH: Fredrik, Carl, Ralf and Batu for not throwing meout when I was annoying and also for their good ideas and feedback. Lastly, I would like tothank my supervisor Mikael Ekström at MDH for good support and ”can-do”-attitude and myexaminator Lars Asplund at MDH for good feedback and challenging tasks.

    Emil Segerblad - Väster̊as - June 5, 2013

    i

  • Glossary

    API Application Programming Interface 46, 48

    ARM Advanced Reduced Instruction Set Computer (RISC) Machine 17

    ASIC Application Specified Integrated Circuit 1, 10

    AXI Advanced eXtensible Interface 17, 31–33, 36, 37, 41, 46

    BIOS Basic Input/Output System 39

    BLE Basic Logic Element 10

    CAN Controller Area Network 41

    CCD Charged Coupled Device 7

    CPLD Complex Programmable Logic Device 9, 10

    CPU Central Processor Unit 3, 10, 13, 50

    DDR Dual Data Rate 36, 41, 46, 47, 58

    DMA Direct Memory Access vi, 18, 35, 46

    DSP Digital Signal Processor 14

    EDK Embedded Development Kit 33, 34

    EMIO Extended Multiplexed I/O 18

    EPP Extensible Processing Platform iv, 3, 17, 19

    FPGA Field Programmable Gate Array 1–3, 5, 6, 10–17, 22–25, 27, 28, 31–33, 39, 40, 48–53

    FPS Frames Per Second 27, 28, 46

    FSBL First-Stage Boot-Loader 40

    GIMME General Image Multiview Manipulation Engine 5

    GPIO General Purpose Input/Output 41

    GPS Global Positioning System 7

    GPU Graphics Processing Unit 23

    HDL Hardware Description Language 1, 3, 5, 11, 52

    HLS High-Level Synthesis 1, 2, 6, 37, 51, 53

    ii

  • I/O Input/Output 33

    IC Integrated Circuit 10, 13, 16

    ICAP Internal Configuration Access Port v, 15, 25–28, 33–36, 46, 49, 50, 52

    IEEE Institute of Electrical and Electronics Engineers 31

    IP Intellectual Property 6, 15, 16, 39, 48, 50, 58

    ISE Integrated Software Environment 14–16, 34, 36, 38, 45, 51

    LIDAR LIght Detection and Ranging 7

    LUT Look-Up Table 10, 14

    MDH Mälardalen University 41, 51, 53

    MIG Memory Interface Generator 41

    MIO Multiplexed Input/Output 40

    MIPS Microprocessor without Interlocked Pipeline Stages 8

    PCAP Parallel Configuration Access Port 16, 17, 31–34, 36, 45, 46, 49–52

    PCI Peripheral Component Interconnect 14, 23

    PL Programmable Logic 1, 9, 10, 17, 18, 31, 35, 39–41, 45, 46, 49, 50

    PS Processing System 17, 18, 31–33, 35, 36, 40, 41, 46, 49, 50

    RAM Random Access Memory 14

    RISC Reduced Instruction Set Computer ii, 17

    RTL Register Transfer Level 1

    SD Secure Digital 40, 48

    SDK Software Development Kit 40

    SoC System on Chip 5, 17, 18, 31, 39, 40, 49

    SRAM Static Random Access Memory 27

    USB Universal Serial Bus 39

    VDMA Video Direct Memory Access 36, 37, 47, 50

    VHDL VHSIC Hardware Description Language v, 37, 47, 50

    WSN Wireless Sensor Networks 27

    XMD Xilinx Microprocessor Debugger 40

    iii

  • List of Figures

    1.1 Figure showing a comparison between various hardware architectures. Picture isfrom the work of Flynn and Luk. [25] . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2 Figure showing the concept behind dynamic reconfiguration. . . . . . . . . . . . . 31.3 Figure showing a wrongly performed partial reconfiguration. . . . . . . . . . . . . 42.1 Figure showing the concept of an stereo-camera setup taken from the work of

    Ohmura et al. [55] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Video flow example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Filtering example from http://rsbweb.nih.gov/ij/plugins/sigma-filter.html 92.4 Feture extraction example from http://www2.cvl.isy.liu.se/Research/Robot/

    WITAS/operations.html . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5 MIPS pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6 Figure showing the mobile robot used by Weiss and Biber (left) and the output

    3D map (right). The image is from the work of Weiss et al. [64] . . . . . . . . . . 102.7 Figure showing the general concept of an FPGA-device. The figure is from the

    article by Kuon et al. [48] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.8 Figure showing a typical Look-Up Table (LUT). [49] . . . . . . . . . . . . . . . . 112.9 Picture from lecture slides from NTU. [8] . . . . . . . . . . . . . . . . . . . . . . 122.10 Figure showing how a LUT can be ”programmed” to perform logic operations. [49] 122.11 Figure showing the general concept of an FPGA-device. The figure is from the

    article by Kuon et al. [48] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.12 Figure showing a typical island style partition location strategy. . . . . . . . . . . 152.13 Figure showing the outline of the Xilinx RP. [35] . . . . . . . . . . . . . . . . . . 162.14 The layout of the DevC-module. [38] . . . . . . . . . . . . . . . . . . . . . . . . . 182.15 Figure showing the general idea on how to utilize the Xilinx Extensible Processing

    Platform (EPP)-family. [41] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.16 Figure showing the outline of the Xilinx ZynQ-SoC. Image from Xilinx document

    UG585. [38] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.17 Figure showing the ZC702 board peripherals. [41] . . . . . . . . . . . . . . . . . . 202.18 Figure showing the outline of the GIMME2 platform. [3] . . . . . . . . . . . . . . 202.19 Figure showing the GIMME2 boards front (right) and backside (left). Notice the

    two image sensors on the backside of the PCB (encircled in red). Also notice theZynq SoC (encircled in yellow, the PS DDR Memory (encircled in blue) and thePS DDR Memory (encircled in purple). . . . . . . . . . . . . . . . . . . . . . . . 21

    3.1 Figure showing the system from the article by Hosseini and Hu. [29] To left isthe logic implementation and to the right is the CPU implementation using theAltera Nios II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    3.2 System developed by Ohmura and Takauji. Picture is retrieved from the articleby Ohmura and Takauji. [55] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    3.3 System developed by Ohmura and Takauji. Picture is retrieved from the articleby Ohmura and Takauji. [55] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    3.4 Komuro et al.’s architecture. [47] . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.5 Blair et al.’s performance. [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.6 Figure showing Ming et al.’s results taken from their article. [51] . . . . . . . . . 273.7 Koch et al.’s implemented system. Picture is taken from the related article. [45] . 28

    iv

    http://rsbweb.nih.gov/ij/plugins/sigma-filter.htmlhttp://www2.cvl.isy.liu.se/Research/Robot/WITAS/operations.htmlhttp://www2.cvl.isy.liu.se/Research/Robot/WITAS/operations.html

  • 3.8 Ackermann et al.’s system. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.1 Figure showing the Partial Reconfiguration flow from WP374. [23] . . . . . . . . 334.2 Excluded partial reconfiguration steps from XAPP1159 shown in red. [46] . . . . 344.3 Reference design with Microblaze and Internal Configuration Access Port (ICAP)

    added. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.4 Reference design from XAPP1159. . . . . . . . . . . . . . . . . . . . . . . . . . . 364.5 Implemented video design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.6 Second implemented video design. . . . . . . . . . . . . . . . . . . . . . . . . . . 384.7 Red filter code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.8 Color filter code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.9 Work flow for Partial Reconfiguration in Xilinx ISE from UG702. [35] . . . . . . 434.10 Example boot image format for Linux. Picture from UG821. [37] . . . . . . . . . 434.11 Example boot image format for Linux. Picture from UG873. [36] . . . . . . . . . 445.1 Excerpt from VHSIC Hardware Description Language (VHDL)-code generated

    by Vivado HLS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477.1 Proposed video pipe line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    A.1 Macro definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58A.2 Memory access code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    B.1 Overview of ZynQ-family. Image from DS190. [42] . . . . . . . . . . . . . . . . . 59

    v

  • List of Tables

    3.1 Figure showing the results from work by Hosseini and Hu. [29] The first four rowsare for the filtering of a 64 x 64 pixel image while the last two are for the filteringa 256 x 256 pixel image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    3.2 Komuro et al.’s performance. [47] . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3 Meyer et al.’s results. [53] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4 Table showing Ming et al.’s results taken from their article. [51] . . . . . . . . . . 263.5 Ackermann et al.’s results. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.6 Ackermann et al.’s results. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.7 Bhandari et al.’s results. [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.8 Perschke et al.’s results. Picture from the related article. [58] . . . . . . . . . . . 303.9 Perschke et al.’s results. Picture from the related article. [58] . . . . . . . . . . . 305.1 Figure showing the time needed to just finish the ”write”-function call on the

    ZC702 board running Linux. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2 Figure showing the time needed for the full reconfiguration flow on the ZC702

    board running Linux. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.3 Figure showing reconfiguration time needed to finish the Direct Memory Access

    (DMA)-transfer on the ZC702 board using standalone software. . . . . . . . . . . 465.4 Figure showing summarized test results. . . . . . . . . . . . . . . . . . . . . . . . 46

    vi

  • Contents

    List of Figures iv

    List of Tables vi

    Introduction 1Thesis description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4Scope of this report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Outline of this report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    Background 7Introduction to computer vision and its applications . . . . . . . . . . . . . . . . . . . 7Introduction to Programmable Logic (PL) . . . . . . . . . . . . . . . . . . . . . . . . . 9Field Programmable Gate Array-technology . . . . . . . . . . . . . . . . . . . . . . . . 10Run-time Reconfigurability with focus Xilinx FPGA . . . . . . . . . . . . . . . . . . . 12Heterogeneous systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16Computer Vision on FPGAs and heterogeneous systems . . . . . . . . . . . . . . . . . 17Xilinx ZynQ-7000 and Xilinx XC702 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17GIMME 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    Related work 22Implementations of computer Vision on FPGAs . . . . . . . . . . . . . . . . . . . . . . 22Implementations of reconfigurable FPGA-systems . . . . . . . . . . . . . . . . . . . . . 24Implementations of reconfigurable FPGA-systems running computer vision algorithms 27

    Method 31Early work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Design considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Method of reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Implemented vision components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36System design in Xilinx ISE/Planahead . . . . . . . . . . . . . . . . . . . . . . . . . . 38Interface from Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39GIMME2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    Results 45Performance of reconfiguration methods . . . . . . . . . . . . . . . . . . . . . . . . . . 45Vision component implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Software implementation evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48GIMME2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    Discussion 49

    Future work 51

    Conclusions 52

    vii

  • Bibliography 54

    Appendix A Device interface from Linux 58

    Appendix B Overview of Xilinx ZynQ-family 59

    viii

  • Introduction

    Within the world of computation one of the largest open questions is which hardware platform,or architecture, to use for a certain task. Some platforms can achieve high clock frequencies butlow degrees of parallelism while others can achieve high degree of parallelism but are limitedto lower clock frequencies. Each architecture has its own inherent strengths and weaknesses.Designers are often faced with the problem of choosing the right platform for the right task,which is not always an simple problem to solve as costs, availability and design support alsoplay a major role during development and production. Hence, performance is not always themost important aspect to consider when choosing a platform.

    Today many applications exist where both performance and costs are critical aspects. Theindustry standard for many years to use for such applications have been to design and imple-ment Application Specified Integrated Circuits (ASICs). ASICs have the advantage of beingcheap to produce and allows for high performance in general. ASICs are not without problems,however. Long development times, expensive in small to medium-scale production and no re-programmability after production are some its inherent disadvantages. Other technologies suchas Field Programmable Gate Arrays (FPGAs) allows designer to shorten the design phase andallows for a high-degree of re-programmability after production. FPGAs belongs to a muchlarger family known as Programmable Logic (PL) where focus is on the programmability of thehardware and can jokingly be refereed to as ”Lego-logic” due to its high degree of custom-ability.

    The use of PL has increased over the years in embedded systems due the good flexibility anddecent performance that the various PL-architectures offer. However, the reconfigurable natureof PL-devices also has some drawbacks, such as much lower speed (in terms of clock frequency)than other processing architectures. What the PL-architectures lack in speed they gain inparallelism. The ability to run tasks in parallel on PL borders to the extreme. An comparisonbetween various hardware architectures with respect to performance and programmability canbe found in figure 1.1.

    Some PL-systems are reconfigurable during run-time. That means that one or several sec-tions of the logic fabric can be reprogrammed with or without affecting the operation of othersections depending on which technique is used. Run-time reconfiguration enables developers todynamically use hardware-acceleration in their applications by changing the behaviour of thelogic. Furthermore, by swapping functions in and out of the FPGA chip area can be saved. Anexample of this concept, known as partial reconfiguration, can be seen in figure 1.2. At first theFPGA-fabric contains functions A-D. At some point the user wants to put function E on to theFPGA and hence functions B and D, in this example, must be overwritten in order to makefunction E fit properly.

    Some important aspects seen in figure 1.2 needs further explanation. All bit streams (astream of data containing the new configuration of the PL) for the functions to be used in thesystem needs to be generated prior to run-time and stored in non-volatile memory in orderto minimize reconfiguration delay and ensure a deterministic system behaviour (generating bitstreams during run-time is currently impossible as the time needed is vast). In most cases thetasks put on the FPGA are generated manually by the developer using Hardware DescriptionLanguage (HDL) but lately tools for automatic HDL-code generation from high level languageshave appeared on the market. One example of such a tool is Xilinx Vivado High-Level Synthesis(HLS). Vivado HLS is able to generate Register Transfer Level (RTL)-level code from high levellanguages, such as: C/C++ and Matlab. RTL, not to be confused with Resistor-Transistor

    1

  • Figure 1.1: Figure showing a comparison between various hardware architectures. Picture isfrom the work of Flynn and Luk. [25]

    Logic, is a common abstraction mechanism used in hardware design where synchronous circuitsare modelled as the data flow between registers and the manipulation performed on the dataflow. For more information about Xilinx Vivado HLS, please refer to the Vivado HLS Userguide. [33].

    When doing partial reconfiguration there exists a risk for corruption of previously existingfunctions. An example can be seen in figure 1.3.

    As can be clearly seen in figure 1.3, if a component is placed badly on to the FPGA it candestroy or interfere with other pre-existing components. Therefore, it is of uttermost importancethat the user keeps tracks of the boundaries of all components on the FPGA in some manner.This is however, tiresome and error prone to do ”by hand” so a better option to ensure compo-nent integrity is to implement a resource manager that keeps track on where the functions areplaced and during partial reconfiguration makes sure that no pre-existing component is affectednegatively. This task is highly complex and not without problems. A third option also existsand that is to divide the area of the chip in to different regions with different properties. Mostcommon is to have one region being static, that is non-reconfigurable, and then several regionsthat can be reconfigured. Using high-level design tools, the user can then verify that non of thetasks to be put into one of these regions violates any design rules or other limitations. After thisis verified the functionality of each region can be changed dynamically during run-time with-out any concern that the reconfiguration would affect any other region. However there existssome limitations on partial reconfiguration, both software-wise and hardware-wise, as will bepresented later in this report. Furthermore, some different ways of partitioning the FPGA-logicbetween static and reconfigurable areas will be briefly explained in the Background-section.

    The applications of this technology are many. The ability to put software tasks on hardwareare appealing, especially for systems with a strong degree of possible parallelism in them. Anexample here would be computer vision systems, where large chunks of data needs to be processedin real-time. Vision systems also have a high degree of parallelism making them ideal forhardware acceleration. However, the software used in modern vision systems are often complexand written in high-level languages such as C++ or Java. Porting such complex tasks to an

    2

  • Figure 1.2: Figure showing the concept behind dynamic reconfiguration.

    FPGA would be extremely hard as many of the constructs used in high-level languages haveno direct representations in HDL. However, some tools exist, such as Vivado HLS that waspresented earlier in this text, for High-level language to HDL-conversion. These tools are notwithout limitations though, as will be seen later in this report.

    Instead of creating new specialized functions for each implementation, a library of visionalgorithms and tools are often used in software design. One popular library for computer visionis OpenCV. OpenCV contains several high-level functions for image processing such as Houghtransforms, color conversion algorithms and feature extraction methods. The OpenCV-libraryis compatible with a wide range of operating systems, such as Linux and Microsoft Windows.

    In recent years FPGAs with hard Central Processor Unit (CPU)-cores embedded in themhave been introduced to the market. This enables programmers to mix the high clock frequencyof the CPU-cores with the high parallelism and programmability of the FPGA in order to createhigh-performing systems within a wide range of applications. Platforms that contain severaldifferent processing elements are known as heterogeneous platforms and will be discussed moreextensively later in this report. Some of these heterogeneous FPGAs even features the possibilityto reprogram the FPGA from the hard CPU-cores during run-time. This is a interesting featureas it could prove to be useful in certain types of high-performance systems where a wide rangeof functions can be offloaded onto the FPGA during run-time and hence increasing performanceand decreasing power usage. The newly released Xilinx ZynQ EPP is a heterogeneous systemthat features a dual-core ARM-processor and a large FPGA-area. The Xilinx Zynq has thecapability to reconfigure the FPGA during run-time.

    3

  • Figure 1.3: Figure showing a wrongly performed partial reconfiguration.

    Thesis Description

    As stated earlier this work presented in this report was done as part of an master thesis inelectronics at Mälardalen University. The original thesis description was produced by ProfessorLars Asplund at Mälardalen University [7]

    Reconfigurable Systems

    The FPGA-area in the Zynq is reconfigurable, which means that parts of the area canbe reloaded during run-time. This can be of great use for robotics vision system, e.g.where different algorithms can be loaded for navigation first and later algorithms forobject recognition i.e a robot moving in an apartment first and then in the kitchen itcan find the coffee . . . This master thesis work aims to design a framework for loadinga set of OpenCV components in the FPGA area at the same time as the camera-partis continuously running and with the requirement that some other vision componentsthat are resident can continue to work. The work will use already defined blocks forthe camera input, and the work is also connected to the research project Ralf-3, andwill be used in a tool for allocation of software components on the FPGA.

    High level programming

    Xilinx has released a system called Vivado, which makes it possible to use high levellanguages such as Matlab Simulink, C or C++ for implementation in an FPGA.This master thesis work aims at evaluating Vivado. Special attention should beput on components from the OpenCV-library. In the thesis work the suitability ofVivado for this type of computation will be evaluated. Special attention should beput on speed and FPGA-area allocation. The possibility of using LabView shouldbe included.

    4

  • Cooperation

    These two thesis works should be run simultaneously since they benefit from eachothers results. They will also be connected to two research groups. The roboticsgroup in terms of hardware and IP-components for the cameras and communicationwith the dual core ARM and the Software Engineering group in the Ralf-3 project.

    This work focuses on the ”Reconfigurable Systems”-part. However, modifications to thethesis description was needed as no student applied for the ”High level programming”-part andhence no OpenCV-functions were generated for the partial reconfiguration framework to use.Instead this thesis will focus on the use of partial reconfiguration and the properties of the XilinxZynq FPGA and its associated tools. A more extensive description of the thesis and this reportcan be found in the next section.

    Scope of this report

    In this report the Xilinx ZynQ System on Chip (SoC), that features a dual-core ARM-processorunit and a FPGA, will be used to demonstrate the capabilities and limitations of heterogeneoussystems in high-performance applications such as image or video processing with emphasis onthe partial reconfiguration possibilities found in SoCs such as the ZynQ. The Xilinx ZynQ SoCand the Xilinx tool suite associated with the ZynQ supports run-time partial reconfigurationof the FPGA from both the hard ARM-processor and within the FPGA and will hence beevaluated with respect to usability and speed of the reconfiguration process. Furthermore,the usability of partial reconfiguration in video processing systems will be evaluated using theGIMME2 stereo-vision platform. GIMME2 is a hardware platform featuring a Xilinx ZynQFPGA developed at Mälardalen University as part of the research project VARUM (Vision AidedRobots for Underwater Minehunting). The General Image Multiview Manipulation Engine(GIMME) platform was first introduced in the work of Ekstrand et al. [4] The possibility ofgenerating hardware components of OpenCV-like-functions will also be explored and evaluated,to some degree, using the Xilinx Vivado HLS-tool suite. The motivation of this is to provideprogrammers with tools and techniques to easily place vision components onto the ZynQ FPGAin order to offload the embedded ARM-cores in the SoC during run-time. The work is done atMälardalen University as part of a master thesis in electronics.

    In order to limit the scope of the thesis work and to produce clear goals for thesis itself a setof questions that this report intends to answer can be seen below.

    • What are the possible advantages of using heterogeneous systems in video processinginstead of homogeneous systems?

    • What is the performance of Xilinx’s current partial reconfiguration methods?

    • How well developed are the available software tools for partial reconfiguration in XilinxFPGAs?

    • What are the technological limitations of partial reconfiguration in its current state?

    • How can partial reconfiguration be utilized in high-performance video systems such asstereo-vision systems and what are the implications of this technology for machine visionin general?

    • What types of components can be partially reconfigured?

    • How can Mälardalen University use partial reconfiguration in their current research projectssuch as GIMME2?

    • What is the status of high-level language to HDL-tools such as Xilinx Vivado HLS, interms of efficiency and performance?

    5

  • These questions have been answered throughout this report and a summarized version of theanswers can be found on page 52, in the Conclusions-section (page 52).

    Outline of this report

    In the Background-section, starting on page 7, an overview of FPGAs, heterogeneous platforms,programmable logic and computer vision in general will be presented to the reader. Furthermore,the Xilinx ZynQ-FPGA, the GIMME2 stereo-vision platform, run-time reconfiguration of theZynQ-FPGA and the open source image processing library OpenCV will be discussed. In theRelated work-section, page 22, the state of the art solutions and implementations of relatedfields will be presented to the reader in the form of short summaries of research papers andtechnical reports.

    In the Method-section, page 31, the implementation done by the author of this report ispresented to the reader: A reference systems based around Xilinx Intellectual Property (IP)-cores running in the Zynq’s FPGA will be presented and evaluated, image processing componentsgenerated with Vivado HLS will be presented and evaluated and the final implementation on theGIMME2-board will be presented and evaluated. In the Results-section, starting at page 45, theperformance and other important features of the implementation is evaluated and demonstrated.In the Discussion-section, page 49, future work and possible improvements are presented tothe reader. The conclusions of this report can be found in the Conclusions-section, page 52.Finally, some appendices containing useful information about the configuration and set-up ofthe GIMME2 are presented starting at page 57.

    6

  • Background

    Introduction to computer vision and its applications

    Computer vision, or machine vision, is becoming more wide-spread these days as more and morerobotic and control applications require some kind of vision to function properly.

    An image on a digital system is constructed by small elements called pixels. The amountof pixels in a image indicate the so called resolution, a large number of pixels indicate a highresolution. As the the resolution of an image always is finite, regardless of the camera, the imageis a discrete representation of an continuous environment. Using image acquisition devices suchas a Charged Coupled Device (CCD), the three dimensional world is converted into a twodimensional representation expressed by the linear combination of a set of base colours. Anexample of such a set is the Red, Green and Blue (RGB) used in the RGB-colorspace. Byexpressing the color of an pixel as a linear combination of red, green and blue an human-interpretable representation is constructed. Other colorspaces that utilize other aspect of theenvironment to create an image, that may or may not be fully human interpretable, exists as wellbut these are outside the scope of this report. The amount of unique colors that a colorspace cancreate is called the color depth. For example, if using the RGB-colorspace and each componentis expressed as 1 byte (8 bits) a color depth of 24 bits achieved and the number of possible colorsare 224 = 16777216.

    As said earlier, a single camera will generate a two dimensional representation of the world.This might be hard to believe as if a human views an image he/she can ”perceive” the missingdimension, depth. This is due to the human ability to interpolate by using visual features inthe image. This is something a computer cannot do (yet) and hence it is unable to perceivedepth from a single image. By using more than one camera, depth images can be created. Theconcept is presented in figure 2.1 by using a stereo-camera system.

    The distance, z, to an object can be calculated by the formula seen in equation 2.1

    z = (fL)/(pd) (2.1)

    by a method called triangulation. L is the distance between the two cameras, f is the focallengths of the cameras, p is the size of one pixel in the cameras and d is the so called disparity.The disparity is the absolute value of the difference in pixels between the center of the object(or some other visual feature) between the two cameras. The same concept of triangulationused here is used in, for example, the Global Positioning System (GPS) where the position ofa device is calculated by measuring the time it takes for signals to travel from a number ofdifferent satellites to the device. An example of stereo vision can be found in the work of Chenet al. [17]

    Some main ideas are common for all computer vision systems. Firstly, representation of theenvironment must be acquired. Image acquisition can be done in many different ways with themost common one method is using a regular camera. Other ways of creating representationsof the environment include LIght Detection and Ranging (LIDAR) and ultra sonic range sen-sors. After an image has been acquired some kind of filtering known as preprocessing is oftenperformed in order to remove noise or unwanted part of the image. For example, if one wantsto find green objects it would be unnecessary to keep anything but green pixels in the image.After the filtering is completed, an algorithm for feature extraction is often performed. After

    7

  • Figure 2.1: Figure showing the concept of an stereo-camera setup taken from the work of Ohmuraet al. [55]

    relevant features are extracted the detection of some parameter is performed and the resultsused to control the system in some fashion. An example of such a video processing flow can beseen in figure 2.2. Other steps can be added as needed to the video flow in order to performmore complex operations.

    Figure 2.2: Video flow example.

    One example of filtering can be seen in figure 2.3 while one example of feature extractioncan be seen in figure 2.4.

    A common method for performing these image operations efficiently are by using a so calledpipeline. Pipe-lining is a well known technique with the field of computer architecture and isused to increase throughput and parallelism in computer systems. The concept presented infigure 2.2 can be implemented as a pipeline. The essential idea between pipe-lining is to allowmultiple instructions to overlap in execution, i.e. instead of performing all necessary operationsin one stage the operations on data can be split between several stages that are interconnected.One good example of a pipeline implementation is the one found in Microprocessor withoutInterlocked Pipeline Stages (MIPS) pipeline where instructions have 5 distinct steps [32]:

    1. Fetch instruction (IF)

    2. Decode instruction (ID)

    3. Execute instruction operation (EX)

    4. Access operands (MEM)

    5. Write result to register (WB)

    An example of this pipeline can be seen in figure 2.5. For more information about the use ofpipelines in modern computer architecture and the MIPS-pipeline, please refer to the works ofPatterson and Hennessy. [32] [31]

    8

  • Figure 2.3: Filtering example from http://rsbweb.nih.gov/ij/plugins/sigma-filter.html

    Figure 2.4: Feture extraction example from http://www2.cvl.isy.liu.se/Research/Robot/WITAS/operations.html

    A lot of pre-developed libraries that contain high-level image processing exist today. Oneexample of such a library is the Open Computer Vision library better known as OpenCV.OpenCV is a open-source project released under the BSD-license, meaning that it free to useeven in commercial applications. OpenCV contains a vast range of image processing algorithmsand can be run in both Windows and *nix-based operating systems (Linux, Mac and Android).For more information about OpenCV please refer to the official homepage http://opencv.org/.

    One possible application of computer vision is within agriculture as can be seen in theresearch article written by Segerblad and Delight. [21] The most interesting example from thework of Segerblad and Delight [21] is, according to the author if this report atleast, the work ofWeiss and Biber [64]. Using a LIDAR-device the two scientists detected and mapped vegetationin fields onto a global 3D map. This LIDAR-device was mounted onto a mobile robot thattraversed the fields. The solution showed promising results with a successful detection rate ofmaize plants, in a field, of 60 percent. A image showing the mobile robots used and the generated3D map can be seen in figure 2.6.

    More information about computer vision and its applications can be found in the bookwritten by Szeliski. [61].

    Introduction to Programmable Logic (PL)

    Logic circuits can either be fixed or programmable. Programmable logic behaviour can bechanged after manufacturing is completed while fixed logic has a static behaviour. PL hasexisted for over 50 years and is extensively used in both industry and academia. It offersdevelopers and researchers a flexible and fast way to design and implement hardware. Complex

    9

    http://rsbweb.nih.gov/ij/plugins/sigma-filter.htmlhttp://www2.cvl.isy.liu.se/Research/Robot/WITAS/operations.htmlhttp://www2.cvl.isy.liu.se/Research/Robot/WITAS/operations.htmlhttp://opencv.org/

  • Figure 2.5: MIPS pipeline

    Figure 2.6: Figure showing the mobile robot used by Weiss and Biber (left) and the output 3Dmap (right). The image is from the work of Weiss et al. [64]

    Programmable Logic Device (CPLD) is one of the most commonly used architectures togetherwith FPGA. Both technologies have their unique properties and different applications. In thisreport the focus will be on FPGAs and their particular applications. A overview of the CPLDand FPGA technologies can be found in the article by Brown et al. [16]

    One of the most interesting aspects of the various types of existing PL-architectures is thepossibility to perform tasks in true parallel unlike regular CPU-based systems that must runtasks in series. This means that some tasks can be performed much faster on PL-devices thanon CPU-devices. Another appealing aspect is the possibility to reprogram these devices withoutloss of performance. This implies a lower development cost compared to regular ASIC-devicesand also the possibility to correct potential hardware bugs after they are released to the market.

    Field Programmable Gate Array-technology

    Modern FPGAs were first introduced in the middle of the 1980’s by the American companyXilinx [26] but the concept can be traced back to the 1960’s. The first Xilinx FPGAs onlycontained a few thousand logic cells while modern FPGA Integrated Circuits (ICs) can containseveral millions. The basic concept of the FPGA is to pack large amounts of logic blocks,memory blocks and other low-level hardware peripherals onto one IC and then using a largenetwork of interconnections to ”glue” all components together. [24] In figure 2.7 this concept isdemonstrated.

    The high degree of interconnectivity is what makes the FPGA so versatile but is also one ofthe big drawbacks with FPGAs. The high degree interconnectivity implies that a large area ofthe FPGA-IC must be dedicated to this task, this increases the physical size of the packaging andalso lowers the highest possible clock frequency as the clock signals must travel longer distances.

    Configurable Control Blocks (CLBs) provide the core logic and storage capabilities of theFPGA. In the figure 2.7 these are labelled just ”logic”. Today, most commercial CLBs areLUT-based. Each CLB consists of several Basic Logic Elements (BLEs) arranged in a specialfashion. In Xilinx’s FPGAs a CLB consists of a number of so called slices which they consistof several BLEs. A BLE contains a N-input LUT and a D-type-flip-flop. Using a N-input LUT

    10

  • Figure 2.7: Figure showing the general concept of an FPGA-device. The figure is from thearticle by Kuon et al. [48]

    makes it possible to implement any logic functions with N-input bits. This concept is seen infigure 2.8. By connecting the output from the LUT to a D-flip-flop the behaviour of the circuitcan become synchronised. More complex logic functions are implemented by connecting severalof BLEs together. An example of this can be seen in figure 2.11 and also in 2.9. Most basicdigital electronic concepts are explained in the book by Kuphaldt. [49]

    Figure 2.8: Figure showing a typical LUT. [49]

    The programming of the FPGA is basically just connecting the CLBs in the right fashion.Several different programming methods exist with some being static while others changeable.The most common one today is the use of some kind of static memory to hold the configurationwhile older technologies used fuses and anti-fuses to create permanent connections.

    HDLs were created in order to increase the development speed of implementations on FP-GAs. Using a synthesis tool the HDL code is then translated into a bit stream containing theconfiguration of the FPGA. This bit stream can then be uploaded to the FPGA from a com-puter or dedicated programmer. The two most popular HDLs are Verilog and VHDL. Both arecommonly used within both academia and industry. In later years graphical development toolsfor embedded systems on FPGAs have been released by most FPGA-manufacturers enablingdevelopers to rapidly develop complex systems.

    A survey of various FPGA-architectures can be found in the work by Kuon et al. [48]

    11

  • Figure 2.9: Picture from lecture slides from NTU. [8]

    Figure 2.10: Figure showing how a LUT can be ”programmed” to perform logic operations. [49]

    Run-time Reconfigurability of FPGAs with focus on the XilinxFPGAs

    Many of the modern FPGAs support run-time reconfigurability to some extent, with partialreconfiguration being the most common one. Partial reconfiguration is a term commonly usedwhen referring to reconfiguration of a specific part of the FPGA without interfering with othercomponents located on the FPGA. As can been seen in the Introduction, partial reconfigurationcan be dangerous to the overall system performance or stability if performed wrongly. Thepotential benefits of run-time reconfiguration are many, for example, the ability to dynamicallymove demanding functionality from software to hardware would improve performance of manyapplications. However, one must consider the time it takes to reconfigure a section on theFPGA and weigh that against the potential speed-up. Even tough the possibility of run-timereconfiguration has existed for over 2 decades now few FPGA manufactures have yet to providea complete design flow including design tools and paradigms. Several reasons behind this canbe found: The main reason is that the number of logic blocks on FPGAs have increased rapidlyduring this time and hence no direct need for partial reconfiguration has existed due to the mucheasier process of implementing a fully static system instead of using function swapping. Anotherreason is the added development time needed for the implementation and verification of systemsthat feature partial reconfiguration. However, as the ICs grow larger so does the time needed to

    12

  • Figure 2.11: Figure showing the general concept of an FPGA-device. The figure is from thearticle by Kuon et al. [48]

    program these. This proves to be troublesome in applications where start-up timing is crucial,as will be seen later in this section. Another implication is high static power consumption due tothe increased number of transistors in each package. By utilizing smaller FPGAs in combinationwith partial reconfiguration lower power consumption and, in some cases, higher performancecan be achieved.

    In the book on the subject [44] Koch presents some crucial ideas behind the concept of partialreconfiguration. Koch separates between active and passive partial reconfiguration, where activereconfiguration is when the FPGA is reconfigured during run-time without disturbing the rest ofthe FPGA and passive reconfiguration is when the entire FPGA is stopped (stopped in this caseis when all the clocks in the FPGA are stopped for a period of time) during reconfiguration. Inthis paper only active partial reconfiguration is considered and hence will be used synonymouslywith partial reconfiguration. Furthermore, Koch presents three open questions on the subjectof partial reconfiguration that can be seen in the quote below. [44]

    1. Methodologies: How can hardware modules be efficiently integrated into a sys-tem at runtime? How can this be implemented with present FPGA technol-ogy? And, how can runtime reconfigurable systems be managed at runtime?

    2. Tools: How to implement reconfigurable systems at a high level of abstractionfor increasing design productivity?

    3. Applications: What are the applications that can considerably benefit fromruntime reconfiguration?

    Furthermore, Koch identifies three possible benefits of partial reconfiguration: Performanceimprovement, Area and power reduction and fast-system start-up. To summarize: As has beenstated before in this report algorithms that are highly parallel in nature can easily achieve speed-up by running on an FPGA compared with a CPU. By swapping functions in and out of theFPGA dynamically the power and area used can be reduced. Lastly, fast system start-up refersto systems where the device must have low start-up times. Partial reconfiguration can be usedhere to only load crucial components on the FPGA on start-up in order to minimize the starttime and then at a later stage load the rest of the functionality on to the FPGA.An example of

    13

  • this can be found in Xilinx Application Note 883 where a FPGA connected to the PeripheralComponent Interconnect (PCI)-Express bus is partially configured at boot-up in order to meetthe strict timing constraints of the bus. [62]

    The partial reconfiguration methods can generally be divided into two categories: differencebased reconfiguration for small net list changes and module reconfiguration for large module-based changes. This report will focus on module reconfiguration but according to Xilinx theirlatest FPGAs ”[..] support reconfiguration of CLBs (flip flops, look-up tables, distributed Ran-dom Access Memory (RAM), multiplexers, etc.), block RAM, and Digital Signal Processor(DSP) blocks, plus all associated routing resources.” [23]. This would imply that a high level ofgranularity can be achieved during difference based partial reconfiguration.

    Partial reconfiguration can be seen as a specific implementation of context switching. How-ever, as Koch points one must consider the entire system state before explicitly labelling partialreconfiguration as context switching. [44]

    Some key words used in the context of partial reconfiguration is needed to be explainedand elaborated on before a more technical discussion of partial reconfiguration can occur. Ashort summary of commonly used terms are presented below and are a summary of the termsintroduced and described both by Koch [44] and by Xilinx [35].

    Reconfigurable Partition A physical part of the FPGA constrained by the user to host re-configurable modules.

    Reconfigurable Module A net list that is set to reside inside a reconfigurable partition atsome point. Several modules can share the same reconfigurable partition.

    Reconfigurable Logic Logic elements that make up the reconfigurable module.

    Static Logic Logic implemented in such a way that it is not reconfigurable.

    Proxy Logic Logic inserted by design software in order to provide the system with a knowncommunication path between static logic and reconfigurable partitions.

    The techniques for placement and interaction of/with reconfigurable modules within recon-figurable partitions differs between manufactures but three of the most common ones are: islandstyle, grid style and slot style. Island style is the simplest model for partial reconfiguration andis the only one supported by Xilinx so far and hence it will be the focus of this report. A figureshowing the concept behind island style placement can be seen in figure 2.12.

    Notice the static region around the reconfigurable module. This is needed in order to providea safe and efficient way for routing signals in and out of the reconfigurable module. The staticregion is extra important when the reconfigurable module is connected to a bus as it makessure that the bus is not disturbed during reconfiguration. In Xilinx FPGAs the must commonimplementation of the static region was so called bus macros that was needed to be addedmanually by the user in the Integrated Software Environment (ISE) tool suite during the designphase. These have since been replaced by proxy LUTs that are automatically inserted duringthe synthesis. Island style placement only allows for one reconfigurable module per island. Thismeans that a certain degree of fragmentation will occur when swapping modules as resourceswithin the island will be wasted if not all resources are used by the new module. This is furtherenhanced by the fact that reconfigurable partitions must be predefined by the user and hencefinding a perfect partition size in order to avoid fragmentation may be hard if not impossible.Furthermore, the current tools on the market require net lists to be generated for unique pairof module and island. This means that even if the same module can be placed in two differentislands, two net lists and bit streams must still be generated. For example, a system that has 6modules and 3 islands where all modules can resist in any island must have 18 unique net listsand bit streams. This is clearly time consuming for the designer.

    Xilinx states in their user guide for Partial Reconfiguration, [35] that all logic can be recon-figured during run-time except: ”Clocks and Clock Modifying Logic [...], I/O and I/O related

    14

  • Figure 2.12: Figure showing a typical island style partition location strategy.

    components [...], Serial transceivers (MGTs) and related components [...] and Individual archi-tecture feature components (such as BSCAN, STARTUP, etc.) must remain in the static regionof the design”. Further, it is stated that bidirectional interfaces between static logic and recon-figurable logic are not allowed unless explicit routes exist. Also some specific IP componentsmight function erratically if used in combination with partial reconfiguration. Another designconsideration to notice is that the interface between the reconfigurable partition and the staticlogic must be static, as stated earlier, this implies that all reconfigurable modules that is toreside within a reconfigurable partition must have the same interface ”out towards” the rest ofthe FPGA. Ports or bus connections cannot be created on the fly. An extensive list of designconsiderations for partial reconfiguration can be found in Xilinx UG702. [35]

    In order to use partial reconfiguration on an Xilinx FPGA one must use their design suiteISE. The work flow to generate and use partial reconfiguration on a Xilinx FPGA using theXilinx ISE will be presented in the method-chapter of this report. A general idea of the partialreconfiguration concept on a Xilinx FPGA can be seen in figure 2.13. The concept is in depthexplained in work by Khalaf et al. [43] Designers are forced to use so called ”Bottom-Up-Synthesis” in order to successfully implement a reconfigurable system. This implies that allmodules must have separate net lists for each possible instantiation and no optimization isallowed for the interface between the module and rest of the FPGA. Bottom-Up-Synthesisis explained in the Partial Reconfiguration Guide by Xilinx [35] and the Hierarchical DesignMethodology Guide by Xilinx. [34]

    Xilinx states the following design performance in UG702 [35]:

    • Performance will vary between designs but in general expect 10% degradation in clockfrequency and not to be able to exceed 80% in packing density.

    • Longer runtimes during synthesis and implementation due to added constraints.

    • Too small reconfigurable partitions may result in routing problems.

    From the user-side, reconfiguration of the FPGA during run-time is only a matter of writinga partial bit stream to the associated reconfiguration port. The most commonly used recon-figuration port in Xilinx-based systems are the ICAP-interface which can be instantiated as a

    15

  • Figure 2.13: Figure showing the outline of the Xilinx RP. [35]

    soft IP-core in the FPGA-fabric. Other reconfiguration interfaces such as the ZynQ’s ParallelConfiguration Access Port (PCAP)-interface exists as well. This report focuses on the actual us-ability of the partial reconfiguration flow and the technical low-level reconfiguration process willhence not be discussed here. A good introduction to the partial reconfiguration work flow andlimitations for Xilinx FPGAs can be found in Partial Reconfiguration User Guide by Xilinx. [35]

    Xilinx claims the following in the Partial Reconfiguration Reference Design for the Zynq [46]:

    The configuration time scales fairly linearly as the bitstream size grows with thenumber of reconfigurable frames with small variances depending on the location andthe contents of the frames.

    If the reconfiguration time is linear with respect to the bitstream size it would imply thatpartial reconfiguration could be used in time-critical systems as the worst case scenario couldbe calculated and verified with a high degree of certainty. This property could be useful inhigh-speed applications such as video processing or other streaming data applications.

    An article from 2006 written by Xilinx employees describes the general work-flow in theirISE-tool-suite and much of the information found there still applies to the current versions. [52]

    Heterogeneous systems

    A general trend in both research and industry is to use more and more heterogeneous systems.Heterogeneous systems are composed of several different processing architectures. An example ofthis is the Xilinx ZynQ-platform which features two hard ARM-cores and a large FPGA-sectionin one IC. The Xilinx ZynQ-family of FPGAs will be discussed later in this chapter. Anotherexample of a heterogeneous system can be seen in the Related Work-section of this report, moreprecisely the work of Blair et al. [14].

    Heterogeneous systems enables programmers to utilize different processing architectures dif-ferent properties for different tasks. For instance, a task that can be run in parallel can be puton an FPGA while a strongly serial task can be run on the much faster CPU. However, thesesystems are not without drawbacks. Different processing architectures use different methods ofexecution and tasks must be adapted to fit these methods in order work correctly.

    16

  • Computer Vision on FPGAs and heterogeneous systems

    The parallel nature of FPGAs makes them ideal for running image processing algorithms due tothe outline of these algorithms. An example of this is the conversion between the YUV422-colorformat and the RGB-color format. The equation for the conversion can be seen in equation 2.2.RG

    B

    1 0 1.139831 −0.39465 −0.580601 2.03211 0

    YUV

    (2.2)If the simple conversion example seen in equation 2.2 would be run in series it would require

    at leastPictureWidth ∗ PictureHeight iterations to finish. For an image in the VGA-format (640x480pixels) 307200 iterations are then needed. If this conversion would be implemented on an FPGAall pixels could be converted at the same time in parallel, that is only a few iterations would berequired (assuming that the entire picture is available in the FPGA component at the start ofthe conversion process, which rarely is the case). One example if this can be found in the workof Hamid et al. [29] where a filtering algorithm that took 17 iterations to finish on a CPU onlytook 5 iterations on an FPGA.

    Integrating hard processor cores in to FPGAs is no new idea, however. For example, previousXilinx FPGAs have featured PowerPC-processors integrated into them (Virtex-IV) and a widerange of soft CPU-cores exist for integrating in the FPGA-fabric, such as Microblaze from Xilinxand NIOS from Altera. Using heterogeneous systems for video processing have several positiveimplications for the overall system performance and usability as will be seen later in this report.

    Xilinx ZynQ-7000 and Xilinx XC702

    This thesis focuses on using partial reconfiguration of the Xilinx ZynQ-7000 FPGA family.Hence, it is important to discuss the features and properties of these devices and especiallypresent the development boards used, the Xilinx ZC702 and GIMME2 that will be presentedin the next section. Xilinx calls the ZynQ family for a EPP-family due to the fact that it bothfeatures a ARM-processor and a FPGA-block in the same package. The general idea of usingthe EPP is demonstrated in figure 2.15.

    An outline of the Xilinx ZynQ-SoC can be seen in 2.16 and a table showing some of thebasic characteristics of the different devices in the ZynQ-family can be found in Appendix B.The ZynQ can generally be divided into two different regions: the PL featuring the FPGA-fabric and the Processing System (PS) featuring the ARM-processor. The Xilinx ZynQ SoCfeatures a wide range of embedded peripherals. The most interesting ones for report are theDevC-interface (Where the PCAP-interface is located) and the Advanced eXtensible Interface(AXI)-bus between the PL and the PS. The AXI bus is a high-performance bus developed andspecified by ARM. It provides developers with an easy way of interfacing between the PL andthe PS-sections. The principal layout of the DevC-module of the Advanced RISC Machine(ARM)-processor on the Zynq can be seen in figure 2.14.

    The latest version of the AXI bus, version 4, have three distinct implementation: AXI4,AXI4 Lite and AXI4-Stream. The standard AXI4 bus is a burst based master-slave bus withindependent channels for read addressing, write addressing, data reception, data transmissionand transmission response. The width of the data channels can range from 8 upto 1024 bits.Interconnects are used to connect masters to slaves, and vice versa, and several masters canbe connected to the same interconnect. AXI4-Lite is a reduced version of the standard AXI4bus designed for a more simple communication method between masters and slaves that is notcapable of burst reads or writes. The width of the buses are also limited to either 32 or 64 bits.AXI4-Stream is a stream based communication interface that is designed for high-performanceapplications such as video processing. As AXI4-Stream is stream based, it lacks the addresschannels present in regular AXI4. AXI4-Stream can either be used as an direct protocol where

    17

  • Figure 2.14: The layout of the DevC-module. [38]

    a master units writes directly to a slave unit or together with an interconnect in order to performoperations on the data stream such as routing or resizing. In order to pass stream based datainto a memory an extra component is needed and a common technique is to utilize an DMA-device to perform such operations without the direct involvement of the processing unit. Thisis common with video applications where frame buffers are used to store video frames betweenthe various stages of the video pipeline. More information about the regular AXI4-protocol andAXI4-Lite can be found in the specification supplied by ARM. [6] More information about theAXI4-Stream-protocol can be found in the specification released by ARM. [5]

    The Xilinx ZynQ has integrated AXI-based high performance ports for PL access of thememory connected to the PS. This allows the two different sections to share memory. Further-more, other AXI-based ports are available for communication and peripheral sharing betweenthe PS and PL sections. Also interrupts and Extended Multiplexed I/Os (EMIOs) are routedbetween the PL and the PL. More extensive information about the Xilinx Zynq SoC can befound in the Technical Reference Manual [38].

    Xilinx ZC702 is a development board featuring a Xilinx ZynQ XC7Z020 SoC and wide rangeof on-board peripherals. A picture showing these various peripherals can be seen in 2.17

    GIMME 2

    GIMME2 is an computer vision platform developed by professor Lars Asplund at MälardalenUniversity and AF Inventions GmbH. The board features, for example, a Xilinx ZynQ XC7Z020-2 SoC, two Omnivision OV10810 10-Megapixel image sensors and separate DDR-memories forthe PL and the PS. A short technical summary of the board can be seen in figure . The boarditself can be seen in the figure . The GIMME2 platform is intended to function as a researchplatform for researchers at Mälardalen University and other research institutes as well.

    18

  • Figure 2.15: Figure showing the general idea on how to utilize the Xilinx EPP-family. [41]

    Figure 2.16: Figure showing the outline of the Xilinx ZynQ-SoC. Image from Xilinx documentUG585. [38]

    19

  • Figure 2.17: Figure showing the ZC702 board peripherals. [41]

    Figure 2.18: Figure showing the outline of the GIMME2 platform. [3]

    20

  • Figure 2.19: Figure showing the GIMME2 boards front (right) and backside (left). Noticethe two image sensors on the backside of the PCB (encircled in red). Also notice the ZynqSoC (encircled in yellow, the PS DDR Memory (encircled in blue) and the PS DDR Memory(encircled in purple).

    21

  • Related work

    Implementations of computer vision on FPGAs and heteroge-neous systems

    Utilising the computational power of an FPGA in a mono or stereo vision system is no newconcept. This concept has been the target of many researcher’s work throughout the years, forreasons already accounted for. One interesting, and relevant to this report, implementation ofcomputer vision on an FPGA is the work of Hosseini and Hu [29]. Hosseini and Hu compared theperformance of a hard logic solution against a soft CPU-core (Altera Nios II) implemented onan FPGA when given the task to filter a 64 x 64 pixel or 256 x 256 pixel grey scale image usinga n x n coefficient matrix. It was found that the logic FPGA-implementation was immenselyfaster than a similar implementation on a CPU. The authors found that the logic based solutioncould perform up to 80 times faster. The results of this article can be seen in figure 3.1.

    Table 3.1: Figure showing the results from work by Hosseini and Hu. [29] The first four rowsare for the filtering of a 64 x 64 pixel image while the last two are for the filtering a 256 x 256pixel image.

    An overview of the algorithm implemented in hard logic and the algorithm implemented onthe soft CPU-core by Hosseini and Hu can be seen in figure 3.1.

    Another implementation of a computer vision system on an FPGA can be found in thearticle by Ohmura and Takauji. [55] Ohmura and Takauji used a stereo-vision system with theOrientation Code Matching (OCM) algorithm running on an Altera FPGA-chip. In short, theOCM algorithm is designed to find similarities between pictures and one possible application isstereo-machining. An extensive explanation of the OCM algorithm can be found in the samepaper. The stereo vision module used was capable of supplying images by the size 752 x 480pixels at 60 FPS. The FPGA on module was an Altera Cyclone III running at 53 MHz. A figureshowing a block diagram of the developed system can be seen in figure 3.2. The implementedsystem performed with a minimal delay but used 82% of the available Logic Elements on theFPGA. The system used a 16 x 16 pixel template size with a maximum disparity of 127 pixels.A overview of the algorithm implemented on the FPGA can be seen in figure 3.3.

    Also the work of Komuro et al [47] is an example of an vision system implemented on a

    22

  • Figure 3.1: Figure showing the system from the article by Hosseini and Hu. [29] To left is thelogic implementation and to the right is the CPU implementation using the Altera Nios II.

    Figure 3.2: System developed by Ohmura and Takauji. Picture is retrieved from the article byOhmura and Takauji. [55]

    heterogeneous system featuring a microprocessor and an FPGA. They developed a high-speedsystem capable of retrieving 1000 FPS using a single camera setup. The architecture used canbe seen in figure 3.4. The camera used were capable of outputting 1280 x 512 pixel frames at1000 FPS. By splitting functionality between the CPU and the FPGA the team managed to getacceptable performance out of the system. The implemented system, with the CPU running at266 MHz and the FPGA at 200 MHz, was compared to an PC with a dual-core Intel processorrunning at 1.86 GHz and with 3 Gb of RAM. On the PC OpenCV was used to provide high-levelimage processing capabilities. In figure 3.2 a comparison between similar functions running onthe heterogeneous system and on the PC can be seen. It is quite clear that even though the clockfrequency of the heterogeneous platform is much lower, it still can outperform the dual-core PCin most of the cases due to the increased level of parallelism achieved.

    Blair et al. [14] present a vision system implemented on a heterogeneous platform (GPU,FPGAand CPU) for detection of pedestrians in real-time using a algorithm called HOG (Histogram ofOriented Gradients). Shortly, HOG works by looking at the intensity gradients of image pixelsand then uses this data to detected objects. Blair. et al.’s system was based around an dual-core Intel processor running at 2.4 GHz, a Xilinx Virtex-6 FPGA and a Nvidia 560Ti GraphicsProcessing Unit (GPU). The devices were connected together using the PCI Express-bus. Eachstage of the implementation can run on any of the processing units and mixing between thedifferent architectures are allowed. Figure 3.5 shows the possible paths data can take. The per-formance of the system was evaluated by sending two 1024 x 768 pixel images through all thedifferent data paths. One of the images were single scale and the other had 13 scales. The fastestdata path, for both images, was the one mainly using the GPU, 6.8 ms and 47.0 ms, while the

    23

  • Figure 3.3: System developed by Ohmura and Takauji. Picture is retrieved from the article byOhmura and Takauji. [55]

    Figure 3.4: Komuro et al.’s architecture. [47]

    slowest one was the one mainly using the CPU, 174.3 ms and 1376 ms. Using mainly the FPGAgives the third best times, 10.1 ms and 124.5 ms. The authors conclude that communicationdelay is a major problem and combining two processing units in one data flow is not preferreddue to this. Furthermore, Blair et al. state that it would be better to, instead of splitting thealgorithm between processing units, dedicate each processing unit to a specific task and thencombine the results of each task in the end.

    For more information about the implementation and design of computer vision systems onFPGAs please refer to the comprehensive book on the subject by Bailey. [9] An overview of thestate of the art in heterogeneous computing can be found in the work by Brodtkorb. [15]

    Implementations of reconfigurable FPGA-systems

    Thoma et al. [63] discusses a method for dynamic pipeline reconfiguration of a soft core processorimplemented on an FPGA. Thoma et al. present a novel processor core that can be reconfiguredwith respect to the depth of the pipeline in order to increase performance of decrease powerconsumption. The processor core is based upon an already existing one called LEON3. LEON3is implemented in VHDL and can hence be used on a FPGA. For more information aboutLEON3, please refer to the technical paper about LEON3. [1] By cleverly joining or splittingup adjacent pipeline stages dynamically, Thoma et al. demonstrated a relative saving of 3,8%in cycle count over the execution of a test program.

    In master thesis by Hamre [30] a framework (in principle a hardware operating system) ispresented for dynamic reconfiguration of FPGAs. Hamre’s work is closely related to the workpresent in this report as the same principal ideas are shared. Hamre thoroughly presents theconcept of partial reconfiguration and also the difficulties of implementing it. Hamre’s frameworkis designed to work with Linux and also uses the Xilinx ICAP-port for reconfiguration. The

    24

  • Table 3.2: Komuro et al.’s performance. [47]

    Figure 3.5: Blair et al.’s performance. [14]

    results show that partial reconfiguration of an FPGA from Linux is possible with the availabletools. However, no real performance evaluation is made in Hamre’s report.

    In the article by Lesau et al. [50] the usage of real-time reconfiguration in combination withembedded Linux is discussed and furthermore a set of tools for easier handling of reconfigurationis presented. Lesau et al. have successfully implemented tools like mail-boxes for Linux to hard-ware module communication and also a hardware administrator that handles reconfiguration.This system was implemented on an Xilinx Virtex-5 FPGA using MicroBlaze-cores and PetaL-inux. Lesau et al. successfully proved that this kind of hardware and software layout could beused for handling dynamic reconfiguration.

    Meyer et al. present a new configuration method for FPGAs in their article on the subject.[53] The researchers have named it ”Fast start-up”. By manipulating the bit streams generatedthe regular development tools, Meyer et al. significantly decreased the configuration time of aXilinx Spartan-6 FPGA. The results from these tests can be found in table 3.3.

    In Ming et al. [51] a comparison between different ICAP-components are made and thereconfiguration speed versus the bit-file-size are compared. They found that by making radical

    25

  • Table 3.3: Meyer et al.’s results. [53]

    changes to the standard ICAP the reconfiguration time could be lowered by a order of onemagnitude. Ming et al.’s results (reconfiguration time versus file size) can be seen in table 3.4.In figure 3.6 one can see the almost linear relationship between bit file size and reconfigurationtime.

    Table 3.4: Table showing Ming et al.’s results taken from their article. [51]

    In the work by Koch et al. [45] present the concept of partial reconfiguration and also demon-strates some tools commonly used for achieving it. Furthermore, some possible applications ofpartial reconfiguration is presented to the reader with one application being a ”self-adaptivereconfigurable video-processing system”. This implemented system can be seen in figure 3.7.The modules seen in the picture can be dynamically loaded and unloaded during run-time. Thissystem was implemented on an Xilinx Virtex-II FPGA. No performance indicator is given butit would appear as the performance of the system is acceptable for non-real time applications,thus proving that the concept is implementable. Koch has also published a book on the sub-ject named ”Partial Reconfiguration on FPGAs” which may be of interest for the reader. [44]Koch is also the co-author of the article where the GoAhead-tool for partial reconfiguration ispresented. [10] GoAhead is aimed to provide developers with a simple user-interface in order tocreate systems containing reconfigurable modules.

    Several other tools and work flows have lately emerged for creating run-time reconfigurablesystems. For example: Dreams [56], the work by Ruiz et al. [54] and the work by Dondo etal [22].

    Gantel et al. [27] discuss a possible algorithm for module relocation during run-time. Thework was performed on a Xilinx FPGA mostly using the Xilinx tool Isolation Design Flow. Themain problem when dealing with relocation of modules on an FPGA is that each module musthave an unique bit-stream for each possible module location on the FPGA. Gantel et al. want tomake it possible to only have one bit-stream per module and then using a bit-stream parser andbit-stream re-locator just adapt so that the module can be placed anywhere on the glsFPGA.This would allow for a more efficient design. Gantel et al. succeeded with module relocation on

    26

  • Figure 3.6: Figure showing Ming et al.’s results taken from their article. [51]

    the FPGA with their proposed method.Garcia et al. [28] presented a possible application of reconfigurable FPGAs within Wireless

    Sensor Networkss (WSNs).In the work by Papadimitriou et al. [57] the authors present their solution of providing fast

    reconfigurations and also propose a model for cost estimation of the reconfiguration process.

    Implementations of reconfigurable FPGA-systems running com-puter vision algorithms

    Ackermann et al. [2] implemented a self-reconfigurable video processing system on an FPGA.They present two different implementations of the system. In the first one a Xilinx Virtex-IVFPGA is used and reconfiguration was performed with the help of the ICAP and a Microblaze-processor. Some common image processing algorithms were implemented as modules used forthe reconfiguration testing. The first image processing algorithm implemented was binarization,the second was edge detection using 2-d gradients and the third one was edge detection usinghorizontal derivative. The bit-streams are here stored on a CompactFlash-card accessed fromthe MicroBlaze processor. Image data is retrieved from an camera with a resolution of 2048 x2048 pixels at 16 Frames Per Second (FPS). For both implementations only one image processingalgorithm was allowed on the FPGA. An overview of the first implementation can be seen infigure 3.8. In the second implementation, the MicroBlaze processor has been removed and beenreplaced by a small bit-stream controller connected to a Static Random Access Memory (SRAM)-memory where all bit-streams are located. The results of the two different implementations canbe found in figure 3.6 which shows the time needed to reconfigure the FPGA. The size ofeach implemented processing algorithm can be found in 3.5. This clearly shows that run-timereconfiguration is possible even in such demanding systems as video processing systems.

    Another implementation of a dynamic reconfiguration in a multimedia application was pre-sented in the work by Bhandari et al. [11] Bhandari et al. used a Xilinx Virtex-4 FPGA toimplement a test-platform where a VGA-camera stream was fed into the FPGA, past a recon-figurable filter module and then outputted onto a monitor. Also available on the FPGA wasan audio module that also had a reconfigurable filter slot. Three programming techniques uti-

    27

  • Figure 3.7: Koch et al.’s implemented system. Picture is taken from the related article. [45]

    Table 3.5: Ackermann et al.’s results. [2]

    lizing the ICAP were tested. The first two, OPB-HWICAP and XPS-HWICAP, are made byXilinx while the last one, SEDPRC, is designed by Bhandari et al. The results of these testscan be seen in figure 3.7. As can be seen the novel ICAP interface performs much better thanthe Xilinx alternatives, hence performance of the ICAP-interface can be improved by utilizingcustom-made components.

    A run-time reconfigurable ”multi-object-tracker” were implemented by Perschke et al. [58]The system implemented is based on a Xilinx Virtex 4 FPGA with a PowerPC-CPU and videois retrieved from a camera running at 384 x 286 pixels and 25 FPS. Just like Bhandari et al.,Perschke et al. have implemented a new ICAP-controller in order to improve performance ofthe system as the standard Xilinx ICAP-controller was deemed to slow to be used. In figure3.8 the total usage of each component implemented can be seen. In figure 3.9 the results fromthe article can be found. Perschke et al. conclude that components can be switched betweenframes and hence no delay in the object-tracking algorithm is produced. Looking at the resourceutilization of the components used and the speed of the switching it can be concluded that thismethod can be used in high-speed vision systems.

    A application for partial reconfiguration within the automotive industry can be found inworkings by Claus et al. [18] [19] These studies shows that FPGAs and real-time reconfiguration

    28

  • Figure 3.8: Ackermann et al.’s system. [2]

    Table 3.6: Ackermann et al.’s results. [2]

    can be usefully within the automotive industry as well where time-constraints are strict andoverall reliability of the system is a critical factor.

    29

  • Table 3.7: Bhandari et al.’s results. [11]

    Table 3.8: Perschke et al.’s results. Picture from the related article. [58]

    Table 3.9: Perschke et al.’s results. Picture from the related article. [58]

    30

  • Method

    Early work

    In order to evaluate partial reconfiguration on the Xilinx ZynQ FPGA using its associated toolsuite several steps were taken to prepare and measure interesting aspects. At first the generalfunctionality of the different development boards were tested and a basic system featuring partialreconfiguration of one slot was designed and implemented in order to learn the work flow inthe different Xilinx tools. This procedure was completed both on the Xilinx ZC7020 and theGIMME2 board.

    The first run-time reconfiguration designs were implemented on the Xilinx ZC7020 FPGAdevelopment board. In later stages the designs were moved onto the GIMME2-board. GIMME2[4] is a stereo-vision platform developed at Mälardalen University by the Intelligent SensorSystem Group. The latest version of the GIMME board, version 2, features a Xilinx Zynq-SoC,dual high resolution cameras etc. GIMME2 was described in the Background-section as was theZC7020 development board.

    After getting to know the tools better the state of the art was researched by looking intoarticles and research papers from various scientific databases such Institute of Electrical andElectronics Engineers (IEEE) Xplore. Linux were downloaded from the Xilinx git-server in theform of binary files. The possibility to adapt the Linux Kernel and file system exist and extensiveguides are available at Xilinx Wiki-page [65].

    Design considerations

    The technical limitations of partial reconfiguration in its current state as implemented by Xilinxwere accounted for in the Background-section of this report. From those limitations someconclusions can be drawn:

    • Modules residing within partitions should have similar functionality and communicationinterfaces in order to minimize fragmentation and design problems.

    • The size of reconfigurable partitions should be kept as small as possible in order to minimizefragmentation and free resources for other components residing in the FPGA.

    • Partial reconfiguration should only be used in system that potentially could benefit fromit drastically. Small performance improvements are a small gain for the added complexityduring the design process.

    Before starting the implementation of the final reconfigurable system some technical con-siderations are needed to be dealt with. As can be seen in Xilinx documents UG470 [39] andWP374 [23] several different methods exist for configuration and reconfiguration of Xilinx’s 7series FPGAs. Further, if one consults the technical reference manual, UG585 [38], of the Zynq7000 SoC, one can see a clear overview of the various techniques used to program the PL from thePS. Looking at chapter 6.4.5 in UG585 [38], it is clear that reconfiguration from the PS using thePCAP is not desirable in all cases as it would require the AXI interface to be turned off duringreconfiguration and hence separate the two areas for some time. This would mean that no datawould be able to be passed to any component stored in the FPGA from the PS. However, the

    31

  • PCAP supports 400 MB/s throughput which means a low configuration/reconfiguration time.The entire (power-on) configuration flow for the ZynQ using the PCAP [38] can be seen in thefirst quote below and the entire reconfiguration flow for the ZynQ using the PCAP [38] can beseen in the second quote below.

    1. Wait for PCFG INIT to be set High by the PL (STATUS-bit[4])

    2. Set internal loopback to 0 (MCTRL–bit[4])

    3. Set PCAP PR and PCAP MODE to 1 (CTRL-bit[27 and 26])

    4. Initiate a DevC DMA transfer:

    (a) Source address: Location of PL bitstream

    (b) Destination address: 0xFFFF FFFF

    (c) Source length: Total number of 32-bit words in the PL bitstream

    (d) Destination length: Total number of 32-bit words in the PL bitstream

    5. Wait for PCFG DONE to be set High by the PL (INT STS-bit[2])

    1. Disable the AXI interface to the PL.

    2. Disable the PL level shifters by writing 0xA to the SLCR LVL SHFTR ENregister.

    3. Set PCAP MODE and PCAP PR High.

    4. Clear the previous configuration from the PL (optional):

    (a) Set PCFG PROG B High.

    (b) Set PCFG PROG B Low.

    (c) Check for PCFG INIT = 0 (STATUS-bit[4]).

    (d) Set PCFG PROG B High.

    5. Check for PCFG INIT = 1 (STATUS-bit[4]).

    6. Set INT PCAP LPBK Low (MCTRL-bit[4]).

    7. Initiate a DevC DMA transfer:

    (a) Source Address: Location of new PL bitstream.

    (b) Destination Address: 0xFFFF FFFF.

    (c) Source Length: Total number of 32-bit words in the new PL bitstream.

    (d) Destination Length: Total number of 32-bit words in the new PL bitstream.

    8. Clear PCFG DONE INT by writing a 1 to the INT STS[2].

    9. Wait for PCFG DONE INT to be set High.

    10. Enable the PL level shifters by writing 0xF to the SLCR LVL SHFTR ENregister.

    11. Enable the AXI interface to the PL.

    A more flexible solution can be found in WP374 [23], where a logic component in the FPGAtakes care of the reconfiguration mechanism and hence efficiently hides it from the PS. A simplemicro-controller, such as a Xilinx Microblaze, or some state-machine could be in charge ofreconfiguration and accept reconfiguration requests from the PS on the AXI bus. Partial bit-files would then be stored at known memory locations and as the bit-files are hard-area-coded noreconfiguration could affect other logic on the FPGA then the logic meant to be affected. Thissystem would resemble the one presented in the work by Hamre. [30] Then the only time thePCAP will be used is during boot-up when the base-system bit-file is loaded on to the FPGA.After the initialisation of the FPGA is completed all reconfigurations will be handled by thelogic in the FPGA. A sketch of this concept can be seen in figure 4.1.

    However, little to none documentation exists for using the glsICAP-component on the Zynqand developing such a system turned out to be troublesome as will be seen later in this text.

    32

  • Figure 4.1: Figure showing the Partial Reconfiguration flow from WP374. [23]

    The main issue lies in the fact that the Zynq already has a built-in configuration/reconfigurationport, the PCAP, and that the configuration interfaces are mutually exclusive. The control ofwhich configuration interface that is active is done via a register in the PS and must be setcorrectly before the ICAP or PCAP can be used. [38]

    If one consults the Xilinx Application note XAPP1159 [46], especially table 2, some typicalreconfiguration times using the PCAP-interface can be found. From Linux a partial reconfigu-ration time of 2 milliseconds are specified. If such a low reconfiguration time can be achievedusing the PCAP then it could be acceptable to turn off the AXI-bus interface for that period oftime. However, Xilinx writes in the text the 2 ms only includes the actual ”beginning and endof the DevC DMA transfer driver function call”. Hence some steps of the reconfiguration flowis not included in this time. These excluded steps can be seen in figure 4.2. Still it is interestingto see the performance of the partial reconfiguration flow both from Linux and from so calledstand-alone software.

    Another interesting problem one is faced with when dealing with partial reconfigurationis how to synthesize each reconfigurable module in an efficient way. Each module needs tobe synthesized independently in order to get partial reconfiguration to work properly. Xilinxmethod to performing this is to ensure that the all the components that shall be able to resideinside a reconfigurable partition have the same interfaces towards the rest of the FPGA and thenimporting each component into the Embedded Development Kit (EDK)-environment. There bychanging the component instantiation in the mhs-file associated to the project between synthesisruns each component gets a net-list generated. This set of net lists are then exported to thePartial Reconfiguration project in Planahead. This methodology may seem clumsy at best butit has proven to work without any major malfunctions. Another approach is to use componentinstantiations in the HDL-files and then directly edit these instantiations between synthesis runs.As Xilinx enforces the Bottom-Up-Synthesis methodology users can synthesise modules usingany synthesis tool. However, Input/Output (I/O) insertion must be disabled during synthesis as

    33

  • Figure 4.2: Excluded partial reconfiguration steps from XAPP1159 shown in red. [46]

    the modules pins will no be connected to package pins but the static logic of each partition. [35]For this report the EDK method was mostly used as it seemed smoother but work flow using aseparate synthesis tool was also tested and confirmed to work.

    As stated earlier, a reconfigurable partition consists of two parts: (1) the static logic and (2)the reconfigurable module itself. [35] In order to guarantee successful partial reconfiguration thestatic logic for all modules residing in a partition must be identical after implementation. In theXilinx ISE tool suite this checked with the Verify-tool in Planahead. After reconfigur