CONFIGURABLE RUNTIME ON-CHIP FPGA LOGIC DEBUGGING …

Mohamad Ibrahim Shariat Nasseri

CONFIGURABLE RUNTIME ON-CHIP FPGA LOGIC DEBUGGING AND ITS

CHALLENGES

Master’s Thesis Examiner 1: Timo Hämäläinen

Examiner 2: Arto Oinonen Faculty of Information Technology

and Communication Sciences April 2021

i

ABSTRACT

Mohamad Ibrahim Shariat Nasseri: Configurable runtime on-chip FPGA logic debugging and

its challenges

Master’s Thesis

Tampere University

Information Technology

April 2021

There are many solutions to debug an FPGA design. Some are industry standards, and some are custom-made for specific cases and projects. In the context of on-chip debugging, there are solutions such as using external measurement tools, direct instrumentation (or modification) of the bitfiles, commercial embedded logic analysers or custom embedded logic analysers using trace buffers. Although these approaches utilize the speed of on-chip testing, they either lack visibility or they take up too much of the FPGA resources to be Implemented. To simplify com-parison between different methods, four design criteria are defined: visibility, flexibility, logic over-head and debugging cycle.

In this thesis, an alternative approach for on-chip debugging is provided that can utilize the speed of running on the chip, while it uses minimal amount of the FPGA resources. The approach is in a form of an intellectual property (IP) called snapshot IP. It has two functionalities, capture and insertion. Capture functionality can sniff the ongoing data between two IPs and offloaded to a memory space for an offline check, while insertion can inject data to an IP for validation pur-poses. There are multiple modes of operation in both capture and insertion to choose from de-pending on testing scenarios. Snapshot IP provides a set of features to enhance flexibility of debugging process. Examples of such features are timing-based or pattern-based capture, sup-port for multiple data formats and two simultaneous memory access. Also, the IP has multiple clock inputs for the cases where stream and memory sides are operating on different clock fre-quencies. User has control over the start and stopping of the operations and can define the memory area to be used.

Moreover, in implementation chapter, the design decisions, challenges, and limitations of such design as a high-performance debugging solution are discussed. Challenges are categorized into three sections: design, design flow and validation challenges. Limitations are the inability to insert any arbitrary data, performance sensitivity on external factors and the fact that the design is only a complimentary solution to other debugging and validation methods. The FPGA in use is Intel Stratix 10 SX.

In terms of results, the final design has met the target that is aimed for. When implemented, each instance of Snapshot IP roughly takes 0.3-0.4% of logic on the FPGA and 0.5% of available M20K memory blocks, showing low level of logic overhead. The Ip can handle ~16 Gbps data on its AXI4 interface toward memory. Throughout the design process, there were bugs such as AXI4 bus jamming that were fixed in the final design. Also, preparing for a new round of debugging is simple, visibility is on an acceptable level. Finally, the feature set helps to have decent flexibility.

Keywords: on-chip debugging, embedded logic analysers, ELA, instrumenting, design challenges, trace buffer

The originality of this thesis has been checked using the Turnitin OriginalityCheck service.

ii

PREFACE

This thesis is written to conclude my master’s studies in Information Technology. This

thesis has been done as part of project in Nokia networks and Solutions in Tampere.

I would like to thank my supervisors and examiners, Prof. Timo Hämäläinen and Arto

Oinonen who helped and supported me not only in this thesis but throughout my studies

at Tampere University.

I would also express my appreciation to Petri Kukkala, my manager at Nokia for motivat-

ing me and trusting in me and Tomi Mansikkala who helped and guided me in this project.

I must also thank my dear friend Mina Shahmoradi and my family for believing in me and

making the path to success easier and I am grateful.

Tampere, 26 April 2021

Mohamad Ibrahim Shariat Nasseri

iii

CONTENTS

1. INTRODUCTION .................................................................................................. 1

2. RELATED WORKS ............................................................................................... 4

2.1 Bitfile instrumenting .............................................................................. 4

2.2 Design for Debug and Trace Buffers .................................................... 6

3. REQUIREMENTS AND METHODS .................................................................... 11

3.1 Project Background ............................................................................ 11

3.2 Tools .................................................................................................. 12

3.3 IP interfaces ....................................................................................... 12

3.4 IP Features ........................................................................................ 13

3.4.1 Data Format ................................................................................ 13 3.4.2 Multiple clock domains ................................................................ 14 3.4.3 Modes of operation ..................................................................... 14 3.4.4 Support for two memories ........................................................... 15 3.4.5 Optional insertion instantiation .................................................... 16

3.5 Capture .............................................................................................. 16

3.6 Insertion ............................................................................................. 18

3.7 Example Use Cases ........................................................................... 21

3.7.1 Use case 1: Capture ................................................................... 21 3.7.2 Use Case 2: Insertion ................................................................. 21 3.7.3 Use Case 3: Self-validation ......................................................... 22

4. IMPLEMENTATION ............................................................................................ 24

4.1 Design Decisions ............................................................................... 24

4.2 Design Challenges ............................................................................. 25

4.2.1 Variable number of clocks ........................................................... 25 4.2.2 Integration points within the system ............................................ 25 4.2.3 Different input formats ................................................................. 26 4.2.4 AXI4 Master design compatibility with the slave .......................... 26 4.2.5 Performance optimizations .......................................................... 27 4.2.6 Timing Optimizations .................................................................. 27 4.2.7 FPGA resource usage ................................................................ 28 4.2.8 Handling urgent feature requests ................................................ 28

4.3 Design Flow Challenges .................................................................... 29

4.3.1 IP-XACT packaging .................................................................... 29 4.3.2 SpyGlass linting/CDC checks...................................................... 29 4.3.3 Register bank Generation ........................................................... 30

4.4 Validation and verification challenges ................................................ 30

4.5 Limitations .......................................................................................... 31

4.5.1 A complementary solution ........................................................... 31 4.5.2 Ability to insert any arbitrary format of data ................................. 32 4.5.3 Performance sensitivity ............................................................... 32

5. RESULTS ........................................................................................................... 33

5.1 Overall performance ........................................................................... 33

iv

5.2 FIFO Usage levels ............................................................................. 36

5.3 FPGA resource usage ........................................................................ 39

5.4 Design bugs ....................................................................................... 39

5.5 Well-balanced design criteria ............................................................. 40

6. CONCLUSION .................................................................................................... 42

7. REFERENCES ................................................................................................... 45

v

LIST OF FIGURES

Figure 1. Simple Logic Analyzer proposed by Graham et al. [5] ............................ 5 Figure 2. Architecture proposed by Kumar et al., showing the improved two

stepmodel [13] ....................................................................................... 8 Figure 3. Architecture proposed by Ko et al. [12] .................................................. 9 Figure 4. Snapshot IP top-level architecture ........................................................ 13 Figure 5. Intel Stratix 10 HPS Block Diagram. Difference between F2H (top

centre) and F2SDRAM (centre) paths to SDRAM [10]. ......................... 16 Figure 6. Capture single mode with BCN start .................................................... 17 Figure 7. Capture user operations procedure ...................................................... 19 Figure 8. Insertion user operations procedure ..................................................... 20 Figure 9. Visualisation of use case 1 ................................................................... 21 Figure 10. Visualisation of use case 2 ................................................................... 22 Figure 11. Use case 3: Self validation ................................................................... 23 Figure 12. An Example of SignalTap state-based trigger condition ....................... 31 Figure 13. AXI4 write transactions in capture ........................................................ 34 Figure 14. Continuous insertion output to a stream ............................................... 35 Figure 15. Relation between AXI4 transactions and FIFO level in capture ............ 36 Figure 16. FIFO usage level while capturing ......................................................... 37 Figure 17. FIFO usage level in insertion ................................................................ 38

vi

LIST OF SYMBOLS AND ABBREVIATIONS

AXI4 Advanced eXtensible Interface revision 4 AXI4-Lite Advanced eXtensible Interface revision 4, light version AXI4-Stream Advanced eXtensible Interface revision 4 for data bus BCN BTS Clock Number BTS Base Transceiver Station CCU Cache Coherency Unit CDC Clock Domain Crossing CR Cycling Registers DDR4 Double Data Rate 4 DSP Digital Signal Processing ELA Embedded Logic Analyser F2H FPGA-to-HPS F2SDRAM FPGA-to-SDRAM FDD Frequency Division Duplex FIFO First In, First Out, A type of memory FPGA Field-Programmable Gate Array HPS Hard Processor System Gbps Gigabit per second IP Intellectual Property IP-XACT An XML format to describe a design MISR Multiple Input Signature Register MUX Multiplexer RAM Random Access Memory SoC System-on-Chip Tcl Tool Command Language TDD Time division duplex UVM Universal Verification Methodology

1

1. INTRODUCTION

5G as the new standard for mobile communications promises higher-than-ever internet

access speed. Each cell in a base station can serve more users simultaneously than

previous generations. From base station point of view, this translates to processing more

data locally. As a result, the hardware used in the base station should be capable of such

processing needs. In addition, with more data comes more design complexity which in

turn brings more difficulties to fully verify and validate the hardware design and it is ever

more important to be able to find bugs and fix them in due time.

There are many solutions to tackle this matter, many are industry standards. Some of

these verification methods focus on the code itself, before implementing it on a chip.

Methods such as the Universal Verification Methodology (UVM), formal verification and

static linting checks are some of these solutions.

Another category of debugging solutions is designed to work on a chip level, that is to

validate correctness of the functionality of a programmed chip through different use

cases. The most common tool here are external measuring devices that can generate

custom test vectors.

Each of the above-mentioned methods have their own pros and cons. Methods such as

UVM and formal verification offer unparalleled visibility into the design, but they are slow

to run. It can take days to fully run test cases for system using UVM. On the other hand,

on-chip debugging methods are fast to run, but they generally provide poor visibility to

system signal activities.

Over the years, there has been studies on how to bring about a solution to improve on-

chip debugging by means of more visibility, more flexibility, less logic overhead or less

debugging cycle time while utilizing the speed of running on an actual chip.

Visibility means how much data one can gather in a debugging session. It can range

from only seeing wrong output on a board-level tests to being able to monitor every signal

in the design using a simulation tool. Flexibility refers to the features available in the

solution and how easy they are to be used. Logic overhead refers the amount of logic

area and memory blocks that is needed to implement a solution. Debugging cycle is the

amount of time needed to reconfigure and start a new session. There might be a need

to recompile the design in between these sessions.

2

Karpagam and Viswanathan, Poulos et al. and Graham et al. have all proposed methods

based on modifications to the design without a need for re-compilation. These modifica-

tions are either done after place and route step or directly inserted onto the bitfile before

programming the Field-Programmable Gate Arrays (FPGA). A bitfile or bitstream is a file

that is used to program an FPGA and contains information on how to place and route

the design within the FPGA [16]. This type of approaches is ideal for reducing debugging

cycle time as it eliminates the time spent to re-compile. It also offers a minimal logic

overhead. Nonetheless, these methods do not contribute to the level of visibility of the

system internals.

On a different approach to the issue at hand, authors such as Yang and Touba, Kumar

et al. and Ko et al. have used trace buffers in order to increase the visibility into the

system. Kumar et al. and Yang and Touba have taken a similar approach of trimming

the acquired data to potentially erroneous ones. Ko et al. have taken different path to

increase visibility and flexibility. They used multiple triggers and multiple buffers with the

option to fragment the data to maximize buffer usage.

Commercial embedded logic analysers (ELA) are a good example of having as much

visibility of the signal activities as possible. However, in case of FPGA, the overhead, in

terms of resource usage, that a commercial ELA creates is sometimes not affordable. It

causes an unwanted increase in logic and memory utilization and may bar the project to

reach its timing targets.

The question here is if there can be an approach to have balanced solution where there

is an acceptable level of compromise between criteria like visibility, flexibility, logic over-

head and debugging cycle. This thesis tries to accomplish this objective by suggesting a

method that while it has limited effect in the area utilization (logic overhead) and can be

used in the final product, it packs several useful features (flexibility) and can be used at

any time, either in lab testing, in simulation or on site (visibility and debugging cycle), as

a tool to help validating and debugging the design. The solution is implemented in a form

of an intellectual property (IP), that can be integrated to compatible system. The IP is

called Snapshot IP. It should be noted that this IP is designed to be a complimentary

solution and should be used along with other debugging techniques. The IP cannot find

the exact location of a bug, but rather helps to find a rough location within the design.

This thesis is structured as follows. Chapter 2 takes a look at different approaches in

academic area and discuss them in detail. Chapter 3 discusses a short background on

the project and the IP itself. It is followed by the IP’s overall specifications, interfaces,

and main features. There will also be a couple of short examples use cases to familiarize

3

the reader with the IP. Chapter 4 examines the decision and challenges we faced while

designing the Snapshot IP, both from design point of view and the set of tools we used.

Chapter 5 shortly discusses the limitations of the work that has been done. Chapter 6

gives an insight into the results of the works done and chapter 7 will conclude this thesis.

4

2. RELATED WORKS

When talking about a system-on-chip (SoC) design, validation and verification in its

broadest context, can be divided into two categories, simulations and on-chip ap-

proaches.

Simulations can have different levels of complexity, from a simple VHDL test bench to a

full-fledged UVM environment. For a large design containing multiple IPs and subsys-

tems, simulating all possible cases are almost impossible. This is due to the fact that

running a test case to simulate a few milliseconds can take hours or days to complete.

This will make running full regression a time-consuming task. So, this brings the need

for on-chip validation approaches and run some test cases on an actual chip. This helps

verification teams speed up validation and verification processes.

However, in case of finding a bug, typical approaches of on-chip validation are not much

of a use due to their significant lack of visibility within the design. Simulations, on the

other hand, brings in maximum visibility into the design and can be useful for debugging

purposes.

In general, when maximum visibility into the system is needed, simulation is the first

choice. Simulation tools can log all the signals within the system and have all the waves

at hand when needed. However, on-chip debugging has not been explored as exten-

sively. So, in this chapter we discuss some of the methods used in order to increase on-

chip debugging capabilities and compare their advantages and disadvantages.

2.1 Bitfile instrumenting

Bitfile instrumenting or direct modification of the bit file for the purpose of debugging or

validation is a method of on-chip debugging and has been discussed in the academic

literature since at least the beginning of new century [5]. Modification or instrumenting

the bit file is a way for an IP or a system user to directly manipulate the bitfile in order to

monitor new nets or signals without having to re-compile the design. In general, this

approach eliminates the need to re-compile the whole design whenever a new signal

needs to be traced [5]. The following section briefly discusses how the following authors

took different approaches to implement the instrumenting of a bit file.

To begin with, Poulos et al., proposed pre-selecting all the potential signals that are

needed to be traced in case a bug is encountered [15]. The signals are then connected

to a newly proposed multiplexer (MUX) structure with the selectors and output being an

5

SRAM memory. In run time, these signals can be selected by modifying the memory

contents to choose a certain set of signals to be traced. Benefits of this method is the

fact that it is vendor independent and minimizes the performance overhead due to mini-

malistic nature of the approach. This leads to faster debugging rounds and eventually to

the less debugging effort. However, the debugging functionality this technique provides

is limited and is designed to work without trigger events, without which, it would be a

near impossible task to capture the traced signal at the right time.

On the other hand, Graham et al., take a relatively more complex approach and imple-

ment a simple custom ELA [5]. This custom design enables the bitfile modification fea-

tures which are not available in case of using conventional ELAs. The high-level block

diagram of the mentioned ELA is shown in Figure 1. Having a modifiable logic analyser

consequently enhances the flexibility of the instrumenting so the triggers and traced sig-

nals can be connected to any desired nets just before the physical implementation. The

result presented by authors shows a significant reduction between debugging rounds.

However, this reduction is a relative to the size of the original design. The bigger the

design, the less the benefit. The main disadvantage here is that some of the tools are

vendor-specific and are obsolete for since the early 2010s.

Figure 1. Simple Logic Analyzer proposed by Graham et al. [5]

Similarly, Karpagam and Viswanathan implemented a method which manipulates the

mapped database to trace the nets [12]. However, instead of using a custom ELA, it

dumps the traced data directly into a RAM location. This is done by routing the signals

6

through FPGA using left-over resources that have not been used by the original design,

while mapping the design in placement and routing stage. To do so, it uses a technology-

mapped database of the design and manually inserts the needed triggers and Trace-

buffers into the database; then it regenerates the bit file. The advantage of this approach

unlike commercial ELAs is that the debugging functionality will not affect the timing cri-

teria of the original system, as it only used unused parts of the chip after the original

design is fully mapped. This method again is designed around a specific FPGA vendor

and cannot be used on other FPGAs.

Overall, the main motivation of this method is the amount of time saved from avoiding

design re-compilation compared to conventional logic analysers. Also, the proposed

methods have small footprint on FPGA resources and have relatively simple architecture

thus it has a minimum or no effect on the original timing characteristics of the design

which in return simplifies the design flow. In some cases, it only uses the left-over re-

sources that are not utilized by the original design. However, the complexity for the end

users of these methods is higher than what commercial ELAs normally offer. For this

reason, they are only practical in some specific cases where the debugging requirements

are simple enough to put these methods into use.

2.2 Design for Debug and Trace Buffers

Design for debug is a methodology to plan for debugging from the start, meaning design-

ers should consider the testability aspects when designing the system. This can include

extra logic to make the debugging more straightforward. One of the ways to do so is to

use trace buffers with or without an ELA. Trace buffer refers to a technique of acquiring

data and dumping it to a memory location while the chip is running at its full speed. ELA

can be utilized in tandem to give a visual interpretation of the acquired data.

Techniques incorporating trace buffers offer users a real-time observability over the chip

functionality in debugging and validation sessions [13]. These techniques generally can

sample data of pre-defined length upon reaching one or more trigger conditions and store

it into the buffer or memory area. This data can then be either accessed on-chip or off-

loaded from the chip for further analysis.

The main hinderance in using trace buffers is the limited amount of memory for storing

the acquired data. The limited amount of memory makes debugging observability lower

than of software simulation. However, in many cases, the speed it provides compensates

for the low observability. Different researchers took different paths to find a methodology

7

to increase the observability. The following section discusses how each of these re-

searchers approached the issue at hand and explore different available options on how

one can effectively make use of the limited storage to achieve better observability. If

resolved, it will result in more efficient debugging sessions.

In order to have an efficient debugging session, the data captured into trace buffer must

give debuggers useful insights to find and locate errors easily. It is a possible scenario

that the captured samples are an error-free section of data which has filled the buffer

before reaching to an erroneous one. Capturing unnecessary data can cause delay in

the debugging process. Lengthening the process can indirectly hide away real bugs that

otherwise could have been found.

Yang and Touba proposed a three-step process to identify the rate on which the errors

potentially occur and optimize the captured data based upon that [17]. The main idea

behind this method is to have a selective capturing process. Rather than sampling the

complete sequence of clock cycles, it only samples a subset of the real processed data

on the chip that are suspected to be erroneous. This scheme adds an extra module to

help utilizing the buffer size over three separate debugging session. In the first step, error

rate is “estimated using lossy compression with a parity generator in the first debug ses-

sion”. This error rate along with the buffer size defines how much the window size can

be, that is, how far away the first and last captured samples are from each other in a

matter of clock cycles. The bigger the window size, the better the observability of debug-

ging. The second step compacts the data using two different algorithms, cycling registers

(CR) and multiple input signature register (MISR). It then compares their generated sig-

natures to find a set of suspected clock cycles in which errors may happen. An erroneous

data produces erroneous signatures. By cross-checking them one can find where the

errors may be located. In the last step, it then starts to capture only the data from those

suspected cycles into the buffer and analysis begins when acquisition is done.

Following the same principle of identifying suspect cycles, Kumar et al. proposed an

optimized scheme which includes two steps instead of three, by removing the first step

of identifying the window size [14]. The main difference compared to the previous method

is that instead of one CR, this technique uses two CRs (CR1 and CR2) with different

parameters at the same time. This brings in fewer overhead data, which refers to data

that are not erroneous, than the method discussed in [17]. This method analyses the

data from the first session off-chip and creates a list of markers called tag bits. The tag

bits are loaded into chip to guide the acquisition process in the second session to only

capture what has been marked as error data in the first step. The architecture is shown

in Figure 2.

8

Figure 2. Architecture proposed by Kumar et al., showing the improved two step model [14]

While [14] and [17] tried to utilize the buffer usage using selective capturing of data to

achieve maximum observability, Ko et al. proposed a solution to achieve the same using

efficient control of multiple trigger events and multiple buffers so the total captured debug

data in all buffers is maximized [13]. This idea revolves around a system that has multiple

trigger events and trace buffers to store the sampled data. The proposed architecture

uses a so-called allocation unit to handle simultaneous or overlapping trigger requests

by combining the automatic and user-defined set of priorities. This unit also decides on

which requests to ignore if there is no adequate buffer space left. It also supports seg-

menting the data while sampling it so each segment can be stored in a different buffer

9

independently and without the risk of losing the data. The order of the segments is de-

fined using a control data field which is inserted at the end of each segment in the buffer.

This results in a higher average throughput, better utilization of the buffers, and balances

the load on all buffers in use. Whenever a buffer is idle, the allocation unit can offload

them through the trace port and keep them ready for upcoming trigger requests. Figure

3 shows the high-level architecture of the proposed scheme.

Figure 3. Architecture proposed by Ko et al. [13]

The main disadvantage of methods proposed by [14] and [17] is that since the data be-

comes fragmented, it complicates the process of validating the content with regards to

their timing. Having a sequence of erroneous data without a timestamp in a system

where the timing of each packet of data is important makes the debugging process cum-

bersome. On the other hand, the method proposed by [13] has no such issue. However,

its architecture is significantly more complex and requires more resources on the FPGA

to be implemented.

All in all, what makes trace buffer useful is the fact that it exploits the speed of the chip

in the debugging process, but due to limited storage, it can only offer a low level of ob-

servability, specifically compared to software simulations. As a result, there are many

works in the literature aiming at finding different solutions to increase this. One such

10

method was selective sampling to only acquire necessary cycles of data to minimize

buffer usage [14][17]. Another solution was dynamic management of multiple trigger re-

quests and multiple trace buffers to achieve maximal usage of all buffers [13]. These

researches show that while they have some disadvantages, utilizing trace buffers can

lead to efficient debugging sessions where bugs and errors can be found with more ease

than instrumenting methods.

Overall, the works discussed here gives some insight how on-chip debugging and vali-

dations can be improved. Specifically, taking trace buffers into use can diversify the ca-

pability of a debugging solution. The approach that is proposed in this thesis utilizes the

trace buffers and triggers as a basis to implement a new solution.

11

3. REQUIREMENTS AND METHODS

This chapter discusses high-level specification of Snapshot IP, its main features and the

tools used to implement the IP. It also includes the more detailed description of the main

functionalities of Snapshot IP. In the end, a few use cases where the IP can be utilized

are explained.

3.1 Project Background

The Snapshot IP is part of a 5G baseband project which is designed to be implemented

on “Intel Stratix 10 SX” FGPA chips. It is loosely based on another IP designed for a

relatively similar project. However, being designed for a Xilinx FPGA, the mentioned IP

was not fully compatible with the new project, both from IPs used within the design and

the interface compatibility and did not have all the requested features. The Snapshot IP

is developed to address those needs. In order to address these issues, most of the code

is re-written. This made adding new features more straight-forward. As a by-product of

the code refactoring, the performance in some areas has been improved.

The Snapshot IP was planned to be integrated into certain points of design, many of

which were between the radio interface IP and the rest of design, in both uplink and

downlink directions. The IP specification initially prepared to meet the requirements of

these snapshot points.

The initial design was done for Time Division Duplex (TDD) variant of the project, mean-

ing it only supported data formats common in that variant. It also had fewer set of fea-

tures. Also, there were only one clock frequency for the datapath. Frequency Division

Duplex (FDD) data format support and extra features were added in later revisions of the

IP. Also, the single clock frequency datapath evolved to two-clock datapath, bringing the

need for use of clock crossing schemes.

The Snapshot IP is part of a bigger project and is meant to help verification, validation

and testing teams find possible bugs in an IP. This is done by providing two main func-

tionalities. The first one is sniffing the ongoing data on a data bus and compare it with

reference vectors of data generated by high-level modelling tools (MATLAB in this case)

and tracking the symptoms of erroneous strings of data to a specific IP and continue

debugging from there using other available debugging options.

The second functionality is to artificially insert the data to a connected IP and then ob-

serve the behaviour of the destination IP. The inserted vector is based on a data pre-

12

loaded in system memory. In this case, the connection between source and destination

IP is virtually cut off, the inserted data replaces the real data and goes to the destination

IP.

Under a specific circumstances, insertion and capture can work simultaneously within

the IP to provide better visibility for the testing team.

In high-level architecture of the project, there has been a few points that are defined as

insertion/capture point. The Snapshot IP is designed in a way to handle all the possible

cases in those points. However, it can be used in other points or projects as long as the

interfaces match neighbour IPs.

Incorrect data can show up as a skipped or erroneous sample. There can also be a

correct sequence of samples but with wrong timing.

3.2 Tools

A handful of tools were used for the development and verification of Snapshot IP. They

range from compilation and design entry tools and software products to verification and

debugging tools.

For synthesis and compilation, we used Intel Quartus Prime Pro as the official tool to

compile and program for Intel FPGAs. For generating register bank, a script-based tool

called reg_gen is used. This tool creates a register bank that can be controlled through

software. It takes a spreadsheet as an input and generates VHDL code out of it.

For generating IPXACT packaging a combination of custom scripts and a third-party tool

from Magillem is used. For linting and clock domain crossing (CDC) checks Synopsys

Spyglass is the tool of choice. More details about these tools is discussed in Chapter 4.

For verification and debugging, Mentor QuestaSim and for simulation-level debugging

and Intel SignalTap (part of Intel Quartus Prime Pro software) for on-chip debugging is

being used.

Each of these tools contributed their part to ensure a high level of design quality and

avoid human errors. Also, if used properly, they help the development process to be

easier and more automated.

3.3 IP interfaces

The interface of the design consists of an Advanced eXtensible Interface, revision 4

(AXI4) Master for memory interface, an AXI4-Lite Slave for configuration interface and

multiple AXI4-Stream Master and Slave for data path that can be connected to multiple

13

adjacent IPs. It also has a few custom signals dedicated for timing information such as

BTS Clock Number (BCN) values.

Figure 4. Snapshot IP top-level architecture

3.4 IP Features

This IP provides a range of features to increase the debugging flexibility so it can give

the testing team more visibility on different cases and environments. It works with differ-

ent width of data, clock domains, types of data and different modes of operation. It also

can be integrated with two memory controllers simultaneously. In addition, insertion func-

tionality can be optionally ignored or instantiated on demand, according to the project

needs. What follows is the explanation of each of these.

3.4.1 Data Format

As mentioned earlier, Snapshot IP can be used in different places within the design. This

means that the tdata signal of AXI4-Stream is dependent to tdata width of the adjacent

IP it is connecting to [3]. So, the Snapshot IP has both 64- and 128-bit input and output

data line on the AXI4-Stream.

The stream of data going through Snapshot IP can also differ in format. The interval of

valid data on the stream can also vary. The Snapshot IP will process the data according

the user’s format of choice.

14

3.4.2 Multiple clock domains

From logical point of view, both insertion and capture functionalities consist of three sec-

tions: first section interacts with the memory interface to read from or write to DDR4

memory. The second section interacts with adjacent IPs, that is, actual insertion or cap-

ture process. This refers to the part of design that processes AXI4-Stream packets. Let’s

call this stream side. The third section is called configuration memory (register bank) that

act as an interface so the IP can be controlled with a software. These three sections can

optionally work on totally independent clock frequencies.

Memory side works on ddr_clk, the stream side works on str_clk and the configuration

memory works on cfg_clk.

Clock frequencies the IP receive depend on what the current iteration of system-level

design is targeting. The ddr_clk should be on the same domain as the DDR4 memory

controller. The clock frequency of the stream side depends on the adjacent IPs that are

receiving or sending data to Snapshot IP and cfg_clk generally depends on what fre-

quency is in use for configuration lines in system-level. So, to make the design as much

as possible future-proof, there’s a need for clock crossing synchronization methods

within the IP. A mix of few different approaches has been used to handle all clock cross-

ing situations.

3.4.3 Modes of operation

The capturing and insertion have numerous features that users can utilize to debug more

efficiently. The following is the brief list of these features. They will be discussed more

thoroughly in sections 3.3 and 3.4.

• Capturing:

a. Single/continuous (circular buffer) modes

b. User-controllable continuous mode

c. Start or ending a capture based on a certain timing preference

d. Capture when reaching a specific pattern

e. Selectable memory

f. Selectable memory space to take into use

g. Selectable data width based on the source IP data width

15

• Insertion:

a. Trigger based on certain timing criteria

b. Supporting different data bus width to match the destination IP

c. Different data formats to support different use cases

The main requirement is that the Snapshot IP would be able to capture the ongoing data

all the time so if any failure happens at any time, the debugging data will be available.

To work all the time, this IP needs to efficiently handle the incoming data and save them

into memory. In theory, this means ~16 Gbps of data. This, along with all the features

requested while developing the IP caused some challenges that are the main discussing

point of the next chapter.

3.4.4 Support for two memories

The system has two memories. One is an external memory with direct access from FPGA

side and the other is a memory accessible through Hard Processor System (HPS). There

are times that one of these memories can be busy with some other operations and using

the same memory can cause unwanted performance issues for the ongoing operation.

Although there’s only one AXI4 Master interface available in Snapshot IP design, the

read and write sections are always fully separated between capture and insertion, with

capture always using write transactions and insertion using read transactions. So, with

a small effort in top-level integration, Snapshot IP can be connected to two different

memory controllers at the same time. In addition to that, the fact that insertion and cap-

turing logics are completely separated and can work independently makes this feature

possible.

It is worth noting that the IP connection to the HPS memory is not through the conven-

tional FPGA-to-HPS (F2H) bridge. It uses a dedicated bridge called FPGA-to-SDRAM

(F2SDRAM) bridge that has a more direct access to SDRAM compared to F2H bridge.

F2SDRAM connection bypasses the Cache Coherency Unit (CCU) and connects to

SDRAM L3 Interconnect directly [11]. (Figure 5)

16

Figure 5. Intel Stratix 10 HPS Block Diagram. Difference between F2H (top centre) and F2SDRAM (centre) paths to SDRAM [11].

3.4.5 Optional insertion instantiation

Unlike capturing, not all the points where Snapshot IP can be integrated to needs an

insertion feature for debugging. This is mainly due to the complexity of the data format

that is needed by the adjacent IP in those points. As a result, insertion as a whole can

be optionally ignored and not instantiated. This results in a lower total logic utilization.

3.5 Capture

The IP can capture data in one of the 7 modes available. They can be divided into two

categories, single and continuous modes. There is, however, a set of common configu-

rations that all these modes need to be able to operate. First, a valid address range, that

is start and end address, reserved for the captured data on the memory should be de-

fined. The capturing functionality should also be enabled before starting to capture.

17

Single capture mode can be defined as those set of modes that start capturing incoming

with a predefined or manual trigger and continue until they reach the end address defined

by the user.

• Immediate single mode: user manually triggers the start. Capturing start immedi-

ately and stops when it reaches the end address.

• Start based on a certain timing: It uses BCN value. When a certain value is

reached, it starts until the address range is full.

• Start with a specific pattern in the incoming data: the incoming data is compared

against a predefined N-bit data set by the user. It will start as soon as it finds the

data on the input

Continuous modes are the ones that will run until a specific condition is met, regardless of end address.

• continuous (circular buffer) mode: similar to immediate single mode, but when it

reaches the end address it will wrap around to start address. This mode must be

stopped manually.

• Ending a capture based on a certain timing: In this case it starts manually but will

go on until it reaches a certain BCN value. That means it might or might not wrap

around the memory range.

Figure 6. Capture single mode with BCN start

18

From user point of view, a set of operations should be done. The flowchart shows in

Figure 7 an abstract level of these operations. Before starting capture, user should de-

cide on what input stream, address area and mode to use. Then, if the selected mode

needs additional step, certain configurations should be done. Depending on the selected

mode, capture will finish automatically, or it needs to be stopped manually by user. When

captured is done, operations and error status registers can be checked to make sure

everything is captured successfully.

3.6 Insertion

The Snapshot IP can also insert a specific vector of data into an adjacent IP. In this case,

a pre-defined string of data should be put in the memory and the IP will read from it and

push it to the destination IP. Due to the nature of operation, the number of available

modes is less than capturing functionality. There are two modes to choose from: single

and continuous. In both cases BCN value is needed so the starting point of the insertion

can be found and aligned precisely.

• In single mode, upon reaching the BCN value, the data from memory is read and

according to the chosen data format, it is pushed to the destination IP. When

reading from memory reaches the end address, it stops automatically and up-

dates the status registers.

• In continuous mode, the similar steps happen, with the difference being that when

reading reaches the end address, it will wrap around to start address and contin-

ues inserting from the beginning. This mode needs to be stopped manually.

The flowchart in Figure 8 shows the procedure a user should use to take insertion into

use. Similar to capture, user should set address area, select a stream and a mode. Also,

setting BCN value here is mandatory. Otherwise, insertion will start when BCN=0x0.

Moreover, if the output format is different from the default, it should be defined as well.

When insertion is done, status registers can be checked to verify everything has been

transferred successfully.

19

Figure 7. Capture user operations procedure

20

Figure 8. Insertion user operations procedure

21

3.7 Example Use Cases

This section describes some simple use cases for the Snapshot IP. These use cases

can showcase the flexibility of Snapshot IP and how it can be utilized in different scenar-

ios.

3.7.1 Use case 1: Capture

Assume there’s a Digital Signal Processing (DSP) IP that generates a series of data that

should be transferred to the radio interface. Each of these data must arrive on a specific

time, otherwise they might be dropped. The Snapshot IP can be programmed to be trig-

gered on a certain moment and capture the incoming data into a memory space. Then,

this data is compared with the golden data generated by simulation tools to validate if

the data and its timing is precise. Figure 9 shows a block diagram for this case.

Figure 9. Visualisation of use case 1

3.7.2 Use Case 2: Insertion

Assume a case where the processing unit is under debugging and cannot be used to

generate data for radio interface IP. To increase the productivity of the testing team,

Snapshot IP can be used to test the radio interface IP.

In this case, a simulation-generated test vector can be inserted into the destination IP to

verify the functionality. This can be expanded in a way that the incoming data from the

radio interface can be captured back to fully verify the functionality of the IP. Figure 10

visualises a simplified idea how instead of original data, the data coming from DDR4

memory is passed to destination IP.

22

Figure 10. Visualisation of use case 2

3.7.3 Use Case 3: Self-validation

Snapshot IP can be used to validate its own insertion and capture functionality. This has

actually been used to validate if the Snapshot IP runs as intended and to see if all status

registers show correct values when a process is finished. This is achieved by having a

direct feedback from one of IP’s output streams into an input stream. As a result, the

data is being inserted to the Snapshot IP itself, which in turn can be captured into a

memory space. This use case shows the robustness of the design under pressure and

how FIFOs and AXI4 interface behaviour when the IP is under full load.

For doing so, the IP is configured for both insertion and capture functionalities. To make

the comparison more feasible to do, single modes are recommended. Different capture

and insertion modes can be used here. Continuous modes overwrite the memory content

and it makes the comparison between the original data and captured one more complex.

A good combination of mode to use here are insertion in single mode and capture in

either immediate single mode or pattern-based capture, where the pattern is the same

data that is going to be inserted.

23

Figure 11. Use case 3: Self validation

Capture should be started first to make sure no data is missing in the final result. It will

stay in idle mode until a valid data (tvalid=’1’) is arrived. After enabling capture, insertion

can be started. When finished, status registers in both insertion and capture should be

in done state and the IP itself should be back in idle. Also, the captured data should

match the original inserted data.

24

4. IMPLEMENTATION

Designing a high-performance Snapshot IP with the possibility to function within different

use cases inevitably brings a series of challenges. There also need to take decision on

series of items and how to implement certain modules.

The main design decision to make for Snapshot IP is how different types of data should

be synchronized when crossing clock boundaries.

Challenges can be categorized into three main types, design challenges, design flow

challenges, and validation and verification challenges. Design challenges include the

ones that hampered or complicated the design process and got in the way of keeping

Snapshot IP as generic as possible while finalizing it. Design flow challenges are the

ones related to tools and workflows in relation to the IP. Validation and verification chal-

lenges are those that made debugging the IP itself more difficult.

There is also a short discussion on what the limitations of the design are, why they cannot

be achieved and how they have been dealt with.

4.1 Design Decisions

Choosing correct synchronization methods can significantly help the robustness of the

design. Depending on type of data and the relation of source and destination clocks,

different synchronization schemes should be used. Signals that need to be synchronized

in Snapshot IP falls under three categories: multi-bit signals on the main bus, single-bit

signals from a slow to a fast clock and single-bit signals from a fast to a slow clock do-

main. Each of category needs synchronization scheme of its own.

First, a dual-clock First In, First Out (FIFO) memory has been utilized for the data bus to

safely move the data between ddr_clk and str_clk. The FIFO used here is an IP by Intel.

The FIFO uses Gray code to synchronize data between write and read sections. Due to

this synchronization, there is a delay between the time the data is written into the FIFO

and the time it is ready to be read out. The actual delay depends on the frequency of

read and write side of the FIFO. The following formulas show some example of the delay

between writing to and reading from the dual clock FIFO. wrreq is the write request sig-

nal, q is the output and rdfull shows if the read side sees the FIFO as full [7].

wrreq to q[]: 1 wrclk + following 1 rdclk

wrreq to rdfull: 2 wrclk cycles + following n rdclk

25

Second method is using multi flop synchronizations for single-bit and multi-bit signals

where source clock is slower [4]. That includes two or three sequential flipflops with des-

tination clock. The multi-bit signals are using a coherent synchronizer where the output

only changes when data is stabilised on all of the intermediate flip-flops.

Finally, for the single-bit synchronization from a fast clock to a slow clock, 8-bit extension

is used. Synchronizing with this method can cause data loss if the value of the signal

changes quicker than destination clock. So, it is only used for signals that are quasi-

static, meaning they rarely change their values.

It should be noted, however, despite all these synchronizations in place, not any arbitrary

set of clock frequencies can be used. The fact that FIFO overflow or underflows can

cause some data loss, limits the choices here.

4.2 Design Challenges

4.2.1 Variable number of clocks

Since the project was in a development phase, the clock requirements were changing

regularly. The IP started to be designed with a single clock on the data path. Later this

changed and the synchronizations and dual clock buffer was added to the design. Later

on, the design moved to have a single clock as all the clock frequencies reached their

target value. Each of these changes required a significant amount of code modifications.

Particularly, converting a single clock design into a dual-clock one was cumbersome.

That included making sure all the signals needed to be transferred from one clock do-

main to another one are properly synchronized. Also, the FIFO should be replaced with

a dual-clock version. This complicated the relation between read and write sections of

the FIFO. The delay between these two sections should be considered when trying to

read from or write into the FIFO [7]. These delays also brought the need to modify the

full and empty thresholds and how AXI4 Master interface interacted with the IP to avoid

overflows or underflows.

4.2.2 Integration points within the system

The Snapshot IP can be integrated to different points of design in the system, as long as

there is an AXI4-Stream connection. Naturally, the characteristics of these point is not

necessarily the same and often differ from point to point. This is especially important

when talking about the relation of the AXI4-Stream clock and the DDR4 memory clock,

i.e. which side is faster and which side is slower. This plays a role in how the synchroni-

zation is done between these two side of the design.

26

4.2.3 Different input formats

Another main difference between the integration points is the data format on the AXI4-

Stream channel. How frequent tvalid signal is coming and how tdata is structured affects

the processing of the data. The incoming data can be 64 or 128 bits and the interval of

tvalid signals can change in different use cases. To keep the incoming data (either from

DDR4 memory in insertion or AXI4-Stream in capture) a dual clock FIFO buffer is used.

The FIFO will also take care of synchronizing data between two clocks. Since the width

of this buffer was set to 128 bits, when a 64 bits capture is in use, the data should be first

packed into 128-bit packets and then saved into the buffer. That adds a need for some

preprocessing on the incoming data. Same idea happens in insertion when 64-bit data

is desired on the output.

There are also some specific bits in the data to show the beginning of a frame. When

capturing, this bit can be used to find the start of the frame and start capturing from that

point. This can be also done using a 64-bit pattern that when found, capture is triggered.

A tlast signal is sent out at the end of each basic frame. In some cases, tvalid is ‘0’ and

tdata does not carry any valid data for a specific number of clocks before a next basic

frame arrives. Since the data in the buffer is packed and does not consist of these non-

valid data, the IP should handle the gaps between non-sequential tvalid signals and con-

trol the flow of the outgoing data. This also includes delaying responses to the DDR4

memory so the buffer will not overflow.

4.2.4 AXI4 Master design compatibility with the slave

AXI4 Master is a standardized method, but since it can be interpreted differently. There-

fore, Master interface might not be fully compatible with the slave if it is not generic

enough. For example, the AWREADY can go high either before or after having a valid

WDATA [2]. This makes matching the design when changing the slave IP cumbersome.

Another point that should be discussed here is the fact that each AXI4 Master interface

can be connected to multiple Slave interfaces. As a result, Intel Quartus software auto-

matically adds an interconnect between them. This interconnect uses Intel Avalon

memory-mapped interfaces internally. So that makes all write and read transactions go

through the same path and if one of these gets stuck, the other one will not be able to

continue sending or receiving data. It took some debugging sessions to find and fix such

corner cases that caused these issues.

27

4.2.5 Performance optimizations

While capturing, the worst-case scenario is when every clock cycle has a valid data on

the incoming AXI-Stream channel (tvalid = ‘1’). It should be noted here that Snapshot IP

does not backpressure the incoming stream in capture side (tready is always ‘1’). In this

situation, buffer can fill up quickly. Therefore, the AXI4 interface with DDR4 memories

should be optimal to avoid any buffer overflow and the data can be sent to DDR4 memory

in a constant rate. If the buffer overflows, the captured data is considered corrupted (with

missing data) and the capture stops immediately.

In insertion, the data should be sent out exactly based on the chosen format and data

cannot be delayed. So, buffer underflows cannot be afforded. If at any time buffer under-

flow happens, the inserting will be stopped as there will be missing clock cycles where

there must have been valid data. This brings the need to have multiple active AXI4 trans-

action at a time, so the data moves in and out of buffer as fast as possible.

In earlier versions of the IP, the design had a sequential AXI4 transaction handling

method, that is, single write transaction at a time. This was improved in later releases as

the IP had some difficulty to keep up with the flow of incoming data and it was overflow-

ing. The final versions can handle 32 bursts of AXI4 transactions.

One might say that wider AXI4 data bus toward memories might help to mitigate this

issue. However, there are a couple of considerations that led to not having wider data

bus. The explanation below is mostly for capture, but similar ideas apply to insertion too.

The most important consideration is that data bus width on F2SDRAM AXI4 bridges of

HPS support maximum of 128 bits [11]. So even if Snapshot IP has a wider data bus for

AXI4, this will be constrained to 128 bits when reaching the HPS.

Moreover, FIFO word size is aligned to DDR4 bus size for ease of use. So, each read

from the FIFO is an AXI4 transaction. If AXI4 data bus becomes wider, there are two

options for FIFO. The first option is to also widen the FIFO to match the AXI4 data bus

output width. That brings the need to pack incoming data before writing them into FIFO.

With current design, the packing is at its minimum possible where it is only done for input

data that is 64-bit wide. The other option is to keep the FIFO size the same as before

and instead of a single read per AXI4 transaction, we will have multiple reads. Both cases

would complicate the design and that can directly affect critical paths within the design.

4.2.6 Timing Optimizations

Although Snapshot IP is rather a small IP, Intel Quartus Prime had difficulties reaching

the timing goals inside or around the Snapshot IP instances.

28

One of the reasons for this was the large number of interfaces that complicates the rout-

ing. For this matter, we tried to relax the timing using pipelines wherever possible. So,

the inputs and outputs are going through multiple levels of registers.

Another timing concern that needed to be addressed was about the value comparisons

and their effect on timing closure. Having a variety of features in the design, brings the

need of having multi-layered conditional statements and more complex state machines.

They can make the combinatorial logic between two register large, hence the critical

paths can become longer than what the compiler can handle. To fix these, we re-exam-

ined the code and simplified some of the conditional statements and state machines.

This was in forms of using variables instead of some signals, moving assignments out of

“if conditions” where it was possible and pipelining signals when they are not needed

immediately.

Moreover, we also modified the design for signals to have either synchronous reset or

no reset at all. This is as a result of how Intel Stratix 10 FPGAs architecture is. Design

can be placed and retimed more easily by compiler if there is no asynchronous reset

involved in the design [8].

4.2.7 FPGA resource usage

Although the earlier revisions of the design were not particularly resource hungry, there

were some requests to optimize the design even further. This was mainly due to help

top-level integration reach timing-clean result through smoother placement.

For this, some redundant and unused parts were removed from the code and a generic

was added to optionally leave out the insertion whenever possible. The results of these

changes are reflected in chapter 6.

4.2.8 Handling urgent feature requests

There were times within the project that an urgent new feature was requested that we

had to include into the design without breaking or changing any of our current features.

As an example, having multiple supported formats for insertion was one of these urgent

requests. Before this change, Snapshot IP only supported TDD formats. With this, we

should add FDD format support that included several types of outputs.

The process of thinking through to choose the best way of developing this feature that

can be easily merged with the current implementation was challenging as it needed many

code modifications and addition to generalize the design for all possible scenarios. Nev-

ertheless, it was quite rewarding, as the feature worked as intended from the first tests.

29

4.3 Design Flow Challenges

The challenges regarding design flow are mainly either due to the lack of experience in

using the tools or lack of robust design entry flow specifically designed for an FPGA

project. The tools used are industry standard tools and they are not used in academic

area.

In the following sections, some of the tools and the challenges coming with each of these

are discussed.

4.3.1 IP-XACT packaging

IP-XACT is an XML-based metadata that mainly describes the IP interface. This makes

using and integrating an IP into a bigger system easier [1]. However, the process of

generating IP-XACT files for the first time was rather cumbersome.

The flow consists of multiple steps and includes usage of a third-party tools by Magillem

combined with a few TCL-based scripts. Setting up all of these, in order to have correct

IP-XACT XML files required a few rounds of trial and error with different options.

When the first revision of the XML files is done and correct files are generated, modifi-

cations and rerunning the flow are straightforward. Modifications are only needed when

there is a change in IP interface or in register bank.

4.3.2 SpyGlass linting/CDC checks

Linting is the process of checking the code statically to find common issues and bugs.

This helps to reach a better code quality. Clock Domain Crossing (CDC) check is the

process of checking clock crossing points against a formal check to find issues in the

clock crossing logics. That includes glitches, missing data or data holding issues. Syn-

opsys Spyglass is the tools used here.

Setting up a Spyglass from the scratch and defining constraints for it was challenging.

By default, Spyglass considers the worst-case scenarios for every check it does. In real-

ity, it might not be the case, so many manual instructions should be added to guide the

tool. For example, the status registers change their value only a handful of times during

an operation. However, since these signals go through some clock crossings, Spyglass

assumes the worst-case of them changing their value on a regular basis. Since this is

not the case, some constraints should be added so Spyglass considers them as quasi-

static.

30

Also having a blackbox IP in the design complicates the flow. Blackboxes need extra

care and constraints so the tool can understand them properly.

Even after adding numerous instructions and constraints, many errors and warning might

show up in the tool that needs to be addressed. Some of these are false alarms and

some are coming from blackboxes. These errors and warnings can be waived in order

to clean up the results so real issues can be found easier.

4.3.3 Register bank Generation

Register bank allows software access and configure the IP for a specific purpose. The

flow starts with a spreadsheet, formatted in a specific style that defines all the registers

needed. That includes both control and status registers. This spreadsheet is then fed to

a script that generates both HDL and IP-XACT files for the register bank.

4.4 Validation and verification challenges

Validation and verification challenges are those that affected how easily and quickly

Snapshot IP itself could be debugged. This resulted to lengthy debugging cycles.

Early on in the project, it was decided that the Snapshot IP should be verified as part of

top-level design, that is the whole FPGA project. As a result, there were no IP-level sim-

ulation available. For debugging the IP, one had to run top-level simulation which is not

time effective. It might take few tens of minutes before any data even start to go through

the IP. This made debugging process slow.

Another option for us to debug the design was to use Intel’s SignalTap logic analyzer.

Running directly on the FPGA, ideally it is a faster solution to find a bug than simulations.

However, due to the poor visibility of the signal activities often it needs multiple steps of

debugging to find a certain issue. One SignalTap instance can only handle 4K signals

and save data of those signals for a certain amount of cycles [9]. Data acquisition is

triggered when a pre-defined event or chain of events happens [9]. The idea is to find a

trigger so close to the problematic section of a signal that the bug can be shown within

the acquired data.

In our case, debugging an issue starts from a stuck interface or wrong status register

values. With this initial information we were creating the first SignalTap file to find the

rough location of the issue. After that, depending on the situation, we may have a few

more rounds SignalTap debugging to find the exact problem.

31

Another issue using SignalTap brings is related to timing. With a system that has already

a high logic utilization percentage, SignalTap only complicates reaching a timing clean

result.

Combination of these limitations lead to days and sometimes weeks spent for debugging

a single bug.

Nevertheless, SignalTap offers features useful in a debugging session. The best exam-

ple would be state machine-based trigger conditions, where user can define a specific

sequence of conditions to happen before data acquisition starts (Figure 12). Also, the

data acquisition can be split into multiple sections and acquire data individually. This is

useful when a certain set of trigger conditions happens more than once.

Figure 12. An Example of SignalTap state-based trigger condition

4.5 Limitations

Although the Snapshot IP is designed to be as generic as possible, there are still some

limitations to reach fully generic design. What follows are the main limitations.

4.5.1 A complementary solution

While Snapshot IP has many features that can help testers to validate and debug parts

of system, it was not intended to be the sole method of doing so. It is rather designed to

be a complementary method that can be used in line with other available solutions.

32

Ideally, Snapshot IP can narrow down the issue to a particular IP. In large-scale project,

narrowing down can be used into two different scenarios. In one scenario, Snapshot IP

can be a secondary method to use after a system-level validation where a failure is found.

In this case, Snapshot IP can be used in a lower level to find a certain IP in the design

that malfunctions. In another scenario, it can be the first level of validation or debugging

to determine the correctness of a path data and then start longer test cases to validate

whole system.

4.5.2 Ability to insert any arbitrary format of data

The Snapshot IP can insert multiple formats of data. However, there are still a set of

rules that applies to what can be considered as a valid data for insertion. These rules are

defined based on what the radio interface IP can accept as a valid data, so if the Snap-

shot IP is being used somewhere else than adjacent to the radio interface IP, the data

format might differ, and insertion cannot be done anymore. Insertion formats are depend-

ent on the intervals of tvalid signals and how often tlast should be sent out. This makes

supporting any arbitrary insertion format practically unfeasible. The insertion optional in-

stantiation feature is partly implemented due to this limitation.

4.5.3 Performance sensitivity

As mentioned in chapter 4, capture functionality only sniffs the data and is not the main

destination for the incoming data and cannot request for the data to be delayed. As a

result, capture functionality in this design does not provide backpressure for its AXI4-

Stream interface, that is tready is always ‘1’.

That means if a data packet is coming, the FIFO should have enough space for it. That

requires the data buffered in the FIFO to be sent out to an external memory through the

AXI4 interface in a timely and consistent manner. If the AXI4 interface transfer rate fluc-

tuates or underperforms, there is a chance that the FIFO overflows and that halts cap-

turing process as the data will be corrupted with missing data.

The Snapshot IP is designed to make maximum use of the AXI4 interface when it is the

dominant user of the memory. However, if multiple high-performance operations are us-

ing the same memory simultaneously, it can directly affect the performance of Snapshot

IP.

In response to this limitation, we added support for two simultaneous memories so when

a memory is already busy, another memory can be used for Snapshot IP operations.

However, if the number of these concurrent cases increase, the performance reduction

is unavoidable.

33

5. RESULTS

The main objective of Snapshot IP from project point of view is to enable real-time cap-

ture and insertion functionality for a set of pre-defined integration points. The implemen-

tation was successful, and the IP is now in use in the final product. This chapter dis-

cusses the results of the designed IPs and explores to see if the criteria set out in the

beginning of this thesis are achieved.

5.1 Overall performance

One of the items that can be tangibly measured is the speed of accessing external

memory. The theoretical data rate on the AXI4-Stream in the system is at the average of

~16Gbps. So, this is the minimum that the AXI4 interface should be able to handle. As

shown in Figure 13, this has been handled with ease. At maximum, assuming AXI4 in-

terface has something to transfer at each clock cycle, this can reach to ~32 Gbps. Con-

sidering system delays to transfer data, it will always be less than 32 Gbps.

One should note that these numbers are when only capture or insertion are at work, but

not both together. When both working, the effect of some of the variables is too large to

be able to measure the performance. Below items are some these variables.

• The timing of capture and insertion with regard to each other

• AXI4 interfaces connections in top level

• Number of external memories in use (one or two)

34

Figure 13. AXI4 write transactions in capture

35

Figure 14. Continuous insertion output to a stream

36

5.2 FIFO Usage levels

When designing the Snapshot IP, one of the main concerns was to avoid FIFO underflow

or overflow as this will lead to missing data in the output. This is due to the fact that in

the system, AXI4-Stream slave interfaces must always be ready and cannot change its

tready, to ‘0’. The AXI4 side, however, is fully capable of controlling the flow of the data.

Figure 15. Relation between AXI4 transactions and FIFO level in capture

In capturing case, where the AXI4-Stream writes to the FIFO and AXI4 interface reads

from it, there’s a possibility of overflow happening when FIFO is not emptied out quick

enough. The solution here is to read as much of data as possible so the FIFO stays as

empty as possible at any given time. Since the burst size for AXI4 Master is defined at

32, the interface starts a new transaction whenever the FIFO level goes above 32 (Figure

15). Taking the delays int account, this is actually happening when FIFO level is at 35.

The FIFO will only go fully empty when the capture session is finished. Figure 16 shows

how FIFO usage level acts in capture modes.

In case of Insertion, the situation is vice versa, meaning that AXI4-Stream reads from

the FIFO and as a result, the buffer is theoretically prone to underflow. For this matter,

the FIFO is always kept at almost full level. It is also worth mentioning here that the FIFO

size is 4096 words. The almost full level is set at 3840. Whenever it goes below this level,

a new data was written into the FIFO from the external memory. Figure 17 shows how

FIFO is kept almost full at all time. When operation is done, the IP gradually empties the

FIFO and send the data out.

37

Figure 16. FIFO usage level while capturing

38

Figure 17. FIFO usage level in insertion

39

5.3 FPGA resource usage

One of the benefits of Snapshot IP is the minimal footprint it has compared to the level

of features it provides. There are three instances of Snapshot IP inside the top-level

project. The resource usage is reported in the number of Adaptive Logic Module (ALM)

and logic registers for logic area and the number of M20K memory blocks and memory

bits [6].

Depending on the connections of the IP in top-level, the amount of ALMs used can vary.

This also depends on if the insertion is instantiated into the design. The number of

needed ALMs for each instance varies between 2500 and 3500 ALMs, where it is mainly

divided into three sections, capture, insertion and register interface. The capture uses

around 1000 ALMs. Insertion and register interface take around 600 ALMs for each in-

stance. The rest is used by the multiplexers and demultiplexers that control inputs and

outputs. In total, all three instances use around 10k ALMs. In terms of the number of

dedicated logic registers, each instance uses something between 5700 and 8700 logic

registers.

For the memories, the FIFOs used in the design is implemented using M20K memory

blocks. For a full design, that has insertion included, a total of 52 M20K blocks is used

(26 blocks for each FIFO). Each M20K block, as the name suggests has 20480 bits of

memory. That sums up to 1,064,960 bits of memory available with 52 blocks of M20K.

The design has two FIFOs each consists of 4096 of 128-bit words. The total of needed

memory would be 1,048,576 bits. This results in ~98.5 utilization rate for each instance.

In case of only having capture functionality, these number would be half and the utiliza-

tion rate stays the same.

To put the all these numbers in context, an Intel Stratix 10 SX FPGA has almost 933,000

ALMs and 11,000 M20K blocks. All three instances of Snapshot IP (assuming full design,

including insertion) takes ~1% of FGPA logic area and ~1.5% of M20K memory blocks

[10].

As a result, despite the fact that Snapshot IP packs many features, it has a relatively

small footprint in the project.

5.4 Design bugs

Although the final IP design has no known bugs, there were a few recurring bugs that

showed up at multiple occasions. AXI4 signal timing mismatch and controlling FIFO level

were the hardest to fix.

40

There were a few occasions where there was a bug within AXI4 transactions. The ma-

jority of the issues were on write transactions and almost all of them caused transactions

to jam. There were different reasons for these jams. Some bugs were because of not

having enough transaction within a single burst. Another bug was due to a mismatch

between data and valid signals in a write transaction. All of these cases resulted in the

interconnects not to respond. In these cases, the system has to be rebooted. They are

eventually fixed and AXI4 transactions are functional now.

Another bug that reappeared is that the FIFO underflowed or overflowed when it could

have been avoided. This has been fixed by redefining a custom almost_full and al-

most_empty signals to control when to stop reading from or writing to FIFO.

5.5 Well-balanced design criteria

As mentioned in Chapter 1, there are 4 criteria set to measure the implemented method

to previous researches. Those four are visibility, flexibility, logic overhead and debugging

cycle duration. Each one is analyzed below.

• Visibility: Although visibility is not on par with other solutions as for example Sig-

nalTap, it provides a decent level of details. First there are a series of status

registers that can be useful in case of anything abnormal happens on the line.

Second, it provides a direct access to the data between two IPs that cannot be

achieved using external measuring devices.

• Flexibility: The IP has rather a rich list of features that can provide a variety of

scenarios to test. Features such as simultaneous Insertion and capture, configu-

rable address range, multiple modes of operation to choose from are some of the

more useful ones.

• Logic overhead: As already explained in previous section, logic overhead or re-

source usage is relatively low. Each instance of the Snapshot IP takes 0.3-0.4%

of FPGA logic resources and 0.5% of FPGA memory block resources. So that is

safe to say that the logic overhead is small.

• Debugging cycle: Snapshot IP is designed to work in final products so if needed,

it should be accessible constantly. So, for resetting Snapshot IP in order to start

a new round of validation or debugging, all is needed to do is to reset a few control

registers and the IP is ready for another round of debugging. This will reset the

design and status registers to an idle state. All other configuration done from last

round stays the same. It should be noted here that this is only from Snapshot IP

41

point of view. Other IPs in the system might need more complex resetting proce-

dure which is not in the scope of this thesis.

Overall, the Snapshot IP met the criteria was set. The resource usage is an acceptable

level and it can provide the level of visibility it was designed to do so. It is developed to

function constantly and can quickly be prepared for a new round of insertion or capture.

The IP also met the overall performance level needed for the project and there is no

known bugs remaining in the IP.

42

6. CONCLUSION

In this thesis, we proposed and implemented a custom method in a form of an IP, called

Snapshot IP, to assist designers and testers to debug and validate the system more

efficiently, that has balanced properties in terms of visibility, flexibility, resource usage

and debugging cycle time. The properties are compared to the earlier academic re-

search. The IP is intended for Intel Stratix 10 SX FPGAs and is originally designed to be

implemented as part of a larger project.

We first explored different methods proposed in academic area and compared them to

see their benefits and deficiencies. We then proceeded to define a set of different criteria

such as debugging visibility, physical resource usage, flexibility or the availability of fea-

tures and the amount of effort needed between each debugging round. We then imple-

mented an IP that can have reasonable balance among all of the criteria.

Authors such as Karpagam and Viswanathan, Poulos et al. and Graham et al. have all

done works that are based on direct design modifications without having to re-compile.

These modifications are done at place and route or later steps of compilation process.

This type of approaches reduces debugging cycle time as it does not need to re-compile.

It also offers a minimal logic overhead. Nonetheless, these methods do not provide any

benefit in terms of the level of visibility of the system internals.

A different approach to the issue is done by authors such as Kumar et al., Ko et al. and

Yang and Touba. They have proposed methods that are utilizing trace buffers in order

to increase the visibility into the system. Kumar et al. and Yang and Touba have taken

an approach to intelligently acquire only the data that is potentially erroneous. Ko et al.,

however, have taken a different path to increase visibility and flexibility. They used mul-

tiple triggers and multiple buffers with the option to fragment the data between the buffers

to have maximum buffer usage.

The proposed method has a rich list of features that can be used to capture an ongoing

data to a memory for later comparison to the golden data or insert a vector of data into

an adjacent IP that is under test. It can also be used as way of self-validation, as the

inserted data can be feedback into the design and the IP will capture that. As a result,

the original inserted data can be compared against the captured data to ensure validity

of the IP.

43

The Snapshot IP also provides reasonable visibility into the system as it can specifically

monitor the datapath between two adjacent IPs. The captured data is saves into an ex-

ternal memory, so it can be dumped for further analysis. Also, the status register of the

IP can alert users of any malfunction in the way capturing or insertion. Moreover, con-

sidering the list of available features, the amount of resource usage is fairly low. Each

instance of the IP use around ~3-4K of ALMs of the FPGA and 52 blocks of M20K

memory blocks, with the utilization rate of 98.5%. That translate to 0.3-0.4% of logic

resources and 0.5% of memory resources available for the selected FPGA. Regarding

the flexibility of the design, all the features are available for the tester through a software-

controlled register interface that can be modified according to user’s preference without

a need to recompile the design. The design is also easy to reset and can be ready for a

new round of debugging by modifying a few control registers.

Capture running modes divide to single and continuous modes, where single modes stop

when they reach to the end of defined address range and continuous modes continue

until some other conditions, such as a manual trigger or a certain BCN value, becomes

true. Capture functionality is designed in a way that it can cope with the amount of on-

coming traffic and handle them in an efficient way.

Insertion on the other hand, have less modes to operate in. This is due to the nature of

insertion functionality itself. It always aligns with the defined BCN value to guarantee

correct timing of outgoing data. Insertion can send out packet in a series of pre-defined

formats that user can define in the beginning of each round.

Regarding the limitation of the design, it should be noted that this method is developed

to be a complementary debugging and validation tool and can be used either as the first

level of validation before starting system level validation test cases, or as a secondary

tool when a malfunction in system is found and the user seeks a more precise location

of the issue.

Another limitation worth a remark is the limited ability of insertion to have an arbitrary

format on the output data. This is a result of output complexity it can cause. The AXI4-

Stream packets are formed locally within the IP. The number of possible formats is too

large to be coded. Furthermore, this IP is done in the scope of a bigger project and

although this feature could be useful for another project, it was not a necessary feature

to be developed for the current project.

The last limitation discussed here is the fact that although the Snapshot IP is developed

to maximize AXI4 capabilities, it is still possible to have performance fluctuations while

transferring data to or from an external memory. If there are multiple concurrent high-

44

performance operations with one memory are happening, the performance of snapshot

IP might decrease. This is partially fixed by supporting two simultaneous memory access

for the Snapshot IP.

To conclude, Snapshot IP has met the defined criteria for having a debugging and vali-

dation method that is flexible and efficient on resource usage and has more visibility than

typical on-chip solutions while it keeps the debugging process easy. The IP is designed

to function in a high-performance environment. Also, in spite of having some limitations,

the IP has many features that can be utilized to locate bugs and help testers to validate

design more thoroughly.

45

7. REFERENCES

[1] Accellera Systems Initiative Inc. IP-XACT User Guide [Internet]. Mar 2018. Available: https://www.accellera.org/images/downloads/standards/ip-xact/IP-XACT_User_Guide_2018-02-16.pdf

[2] ARM. AMBA AXI and ACE Protocol Specification AXI3, AXI4, and AXI4-Lite ACE and ACE-Lite, Revision E [Internet]. 2013 Feb 22. Available: https://devel-oper.arm.com/documentation/ihi0022/e/

[3] ARM. AMBA 4 AXI4-Stream Protocol Specification [Internet]. 2010 Mar 03. Available: https://developer.arm.com/documentation/ihi0051/a

[4] Arto Perttula. Lecture13: Clock and Synchronization, course material, TIE-50206 Logic Synthesis, Tampere University of Technology, Feb 2018. Available: http://www.tkt.cs.tut.fi/kurssit/50200/S17/Kalvot/Lecture%2013%20-%20Clock%20and%20Synchronization.pdf

[5] Graham, Paul, Brent Nelson, and Brad Hutchings. "Instrumenting bitstreams for debugging FPGA circuits." The 9th annual IEEE symposium on field-program-mable custom computing machines (FCCM'01). IEEE, 2001.

[6] Intel Corporation. Intel Stratix 10 Logic Array Blocks and Adaptive Logic Mod-ules User Guide [Internet]. 24 Apr 2020. Available: https://www.intel.com/con-tent/dam/www/programmable/us/en/pdfs/literature/hb/stratix-10/ug-s10-lab.pdf

[7] Intel Corporation. FIFO Intel FPGA IP User Guide [Internet]. 2020 Dec 14. Avail-able: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/litera-ture/ug/ug_fifo.pdf

[8] Intel Corporation. Intel Hyperflex Architecture High-Performance Design Hand-book [Internet]. 2020 Jul 13. Available: https://www.intel.com/con-tent/dam/www/programmable/us/en/pdfs/literature/hb/stratix-10/s10_hp_hb.pdf

[9] Intel Corporation. Intel Quartus Prime Pro Edition User Guide: Debug Tools [in-ternet]. 2020 Sep 28. Available: https://www.intel.com/content/dam/www/pro-grammable/us/en/pdfs/literature/ug/ug-qpp-debug.pdf

[10] Intel Corporation. Intel Stratix 10 GX/SX Product Table [Internet]. [date un-known]. Available: https://www.intel.com/content/dam/www/programma-ble/us/en/pdfs/literature/pt/stratix-10-product-table.pdf

[11] Intel Corporation. Intel Stratix 10 Hard Processor System Technical Reference Manual [Internet]. 2021 Feb 23. Available: https://www.intel.com/con-tent/dam/www/programmable/us/en/pdfs/literature/hb/stratix-10/s10_5v4.pdf

[12] Karpagam, R. S., and B. Viswanathan. "Design, Test and Evaluation of Trace-Buffer Inserted FPGA System." Artificial Intelligence and Evolutionary Computa-tions in Engineering Systems. Springer, New Delhi, 2016. 1039-1048.

https://developer.arm.com/documentation/ihi0022/e/

https://developer.arm.com/documentation/ihi0022/e/

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug_fifo.pdf

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug_fifo.pdf

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/stratix-10/s10_hp_hb.pdf

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/stratix-10/s10_hp_hb.pdf

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug-qpp-debug.pdf

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/ug/ug-qpp-debug.pdf

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/pt/stratix-10-product-table.pdf

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/pt/stratix-10-product-table.pdf

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/stratix-10/s10_5v4.pdf

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/stratix-10/s10_5v4.pdf

46

[13] Ko, Ho Fai, Adam B. Kinsman, and Nicola Nicolici. "Design-for-debug architec-ture for distributed embedded logic analysis." IEEE transactions on very large scale integration (VLSI) systems 19.8 (2010): 1380-1393.

[14] Kumar, Binod, et al. "A Methodology to Capture Fine-Grained Internal Visibility During Multisession Silicon Debug." IEEE Transactions on Very Large Scale In-tegration (VLSI) Systems (2020).

[15] Poulos, Zissis, et al. "Leveraging reconfigurability to raise productivity in FPGA functional debug." 2012 Design, Automation & Test in Europe Conference & Ex-hibition (DATE). IEEE, 2012.

[16] Xilinx Inc. FPGA Bitstream [Internet]. 2019. Available: https://www.xil-inx.com/html_docs/xilinx2019_1/SDK_Doc/SDK_concepts/concept_fpgabit-stream.html

[17] Yang, Joon-Sung, and Nur A. Touba. "Improved trace buffer observation via se-lective data capture using 2-D compaction for post-silicon debug." IEEE transac-tions on very large scale integration (VLSI) systems 21.2 (2012): 320-328.

Documents

CONFIGURABLE RUNTIME ON-CHIP FPGA LOGIC DEBUGGING …