Upload
others
View
23
Download
0
Embed Size (px)
Citation preview
Mohamad Ibrahim Shariat Nasseri
CONFIGURABLE RUNTIME ON-CHIP FPGA LOGIC DEBUGGING AND ITS
CHALLENGES
Master’s Thesis Examiner 1: Timo Hämäläinen
Examiner 2: Arto Oinonen Faculty of Information Technology
and Communication Sciences April 2021
i
ABSTRACT
Mohamad Ibrahim Shariat Nasseri: Configurable runtime on-chip FPGA logic debugging and
its challenges
Master’s Thesis
Tampere University
Information Technology
April 2021
There are many solutions to debug an FPGA design. Some are industry standards, and some are custom-made for specific cases and projects. In the context of on-chip debugging, there are solutions such as using external measurement tools, direct instrumentation (or modification) of the bitfiles, commercial embedded logic analysers or custom embedded logic analysers using trace buffers. Although these approaches utilize the speed of on-chip testing, they either lack visibility or they take up too much of the FPGA resources to be Implemented. To simplify com-parison between different methods, four design criteria are defined: visibility, flexibility, logic over-head and debugging cycle.
In this thesis, an alternative approach for on-chip debugging is provided that can utilize the speed of running on the chip, while it uses minimal amount of the FPGA resources. The approach is in a form of an intellectual property (IP) called snapshot IP. It has two functionalities, capture and insertion. Capture functionality can sniff the ongoing data between two IPs and offloaded to a memory space for an offline check, while insertion can inject data to an IP for validation pur-poses. There are multiple modes of operation in both capture and insertion to choose from de-pending on testing scenarios. Snapshot IP provides a set of features to enhance flexibility of debugging process. Examples of such features are timing-based or pattern-based capture, sup-port for multiple data formats and two simultaneous memory access. Also, the IP has multiple clock inputs for the cases where stream and memory sides are operating on different clock fre-quencies. User has control over the start and stopping of the operations and can define the memory area to be used.
Moreover, in implementation chapter, the design decisions, challenges, and limitations of such design as a high-performance debugging solution are discussed. Challenges are categorized into three sections: design, design flow and validation challenges. Limitations are the inability to insert any arbitrary data, performance sensitivity on external factors and the fact that the design is only a complimentary solution to other debugging and validation methods. The FPGA in use is Intel Stratix 10 SX.
In terms of results, the final design has met the target that is aimed for. When implemented, each instance of Snapshot IP roughly takes 0.3-0.4% of logic on the FPGA and 0.5% of available M20K memory blocks, showing low level of logic overhead. The Ip can handle ~16 Gbps data on its AXI4 interface toward memory. Throughout the design process, there were bugs such as AXI4 bus jamming that were fixed in the final design. Also, preparing for a new round of debugging is simple, visibility is on an acceptable level. Finally, the feature set helps to have decent flexibility.
Keywords: on-chip debugging, embedded logic analysers, ELA, instrumenting, design challenges, trace buffer
The originality of this thesis has been checked using the Turnitin OriginalityCheck service.
ii
PREFACE
This thesis is written to conclude my master’s studies in Information Technology. This
thesis has been done as part of project in Nokia networks and Solutions in Tampere.
I would like to thank my supervisors and examiners, Prof. Timo Hämäläinen and Arto
Oinonen who helped and supported me not only in this thesis but throughout my studies
at Tampere University.
I would also express my appreciation to Petri Kukkala, my manager at Nokia for motivat-
ing me and trusting in me and Tomi Mansikkala who helped and guided me in this project.
I must also thank my dear friend Mina Shahmoradi and my family for believing in me and
making the path to success easier and I am grateful.
Tampere, 26 April 2021
Mohamad Ibrahim Shariat Nasseri
iii
CONTENTS
1. INTRODUCTION .................................................................................................. 1
2. RELATED WORKS ............................................................................................... 4
2.1 Bitfile instrumenting .............................................................................. 4
2.2 Design for Debug and Trace Buffers .................................................... 6
3. REQUIREMENTS AND METHODS .................................................................... 11
3.1 Project Background ............................................................................ 11
3.2 Tools .................................................................................................. 12
3.3 IP interfaces ....................................................................................... 12
3.4 IP Features ........................................................................................ 13
3.4.1 Data Format ................................................................................ 13 3.4.2 Multiple clock domains ................................................................ 14 3.4.3 Modes of operation ..................................................................... 14 3.4.4 Support for two memories ........................................................... 15 3.4.5 Optional insertion instantiation .................................................... 16
3.5 Capture .............................................................................................. 16
3.6 Insertion ............................................................................................. 18
3.7 Example Use Cases ........................................................................... 21
3.7.1 Use case 1: Capture ................................................................... 21 3.7.2 Use Case 2: Insertion ................................................................. 21 3.7.3 Use Case 3: Self-validation ......................................................... 22
4. IMPLEMENTATION ............................................................................................ 24
4.1 Design Decisions ............................................................................... 24
4.2 Design Challenges ............................................................................. 25
4.2.1 Variable number of clocks ........................................................... 25 4.2.2 Integration points within the system ............................................ 25 4.2.3 Different input formats ................................................................. 26 4.2.4 AXI4 Master design compatibility with the slave .......................... 26 4.2.5 Performance optimizations .......................................................... 27 4.2.6 Timing Optimizations .................................................................. 27 4.2.7 FPGA resource usage ................................................................ 28 4.2.8 Handling urgent feature requests ................................................ 28
4.3 Design Flow Challenges .................................................................... 29
4.3.1 IP-XACT packaging .................................................................... 29 4.3.2 SpyGlass linting/CDC checks...................................................... 29 4.3.3 Register bank Generation ........................................................... 30
4.4 Validation and verification challenges ................................................ 30
4.5 Limitations .......................................................................................... 31
4.5.1 A complementary solution ........................................................... 31 4.5.2 Ability to insert any arbitrary format of data ................................. 32 4.5.3 Performance sensitivity ............................................................... 32
5. RESULTS ........................................................................................................... 33
5.1 Overall performance ........................................................................... 33
iv
5.2 FIFO Usage levels ............................................................................. 36
5.3 FPGA resource usage ........................................................................ 39
5.4 Design bugs ....................................................................................... 39
5.5 Well-balanced design criteria ............................................................. 40
6. CONCLUSION .................................................................................................... 42
7. REFERENCES ................................................................................................... 45
v
LIST OF FIGURES
Figure 1. Simple Logic Analyzer proposed by Graham et al. [5] ............................ 5 Figure 2. Architecture proposed by Kumar et al., showing the improved two
stepmodel [13] ....................................................................................... 8 Figure 3. Architecture proposed by Ko et al. [12] .................................................. 9 Figure 4. Snapshot IP top-level architecture ........................................................ 13 Figure 5. Intel Stratix 10 HPS Block Diagram. Difference between F2H (top
centre) and F2SDRAM (centre) paths to SDRAM [10]. ......................... 16 Figure 6. Capture single mode with BCN start .................................................... 17 Figure 7. Capture user operations procedure ...................................................... 19 Figure 8. Insertion user operations procedure ..................................................... 20 Figure 9. Visualisation of use case 1 ................................................................... 21 Figure 10. Visualisation of use case 2 ................................................................... 22 Figure 11. Use case 3: Self validation ................................................................... 23 Figure 12. An Example of SignalTap state-based trigger condition ....................... 31 Figure 13. AXI4 write transactions in capture ........................................................ 34 Figure 14. Continuous insertion output to a stream ............................................... 35 Figure 15. Relation between AXI4 transactions and FIFO level in capture ............ 36 Figure 16. FIFO usage level while capturing ......................................................... 37 Figure 17. FIFO usage level in insertion ................................................................ 38
vi
LIST OF SYMBOLS AND ABBREVIATIONS
AXI4 Advanced eXtensible Interface revision 4 AXI4-Lite Advanced eXtensible Interface revision 4, light version AXI4-Stream Advanced eXtensible Interface revision 4 for data bus BCN BTS Clock Number BTS Base Transceiver Station CCU Cache Coherency Unit CDC Clock Domain Crossing CR Cycling Registers DDR4 Double Data Rate 4 DSP Digital Signal Processing ELA Embedded Logic Analyser F2H FPGA-to-HPS F2SDRAM FPGA-to-SDRAM FDD Frequency Division Duplex FIFO First In, First Out, A type of memory FPGA Field-Programmable Gate Array HPS Hard Processor System Gbps Gigabit per second IP Intellectual Property IP-XACT An XML format to describe a design MISR Multiple Input Signature Register MUX Multiplexer RAM Random Access Memory SoC System-on-Chip Tcl Tool Command Language TDD Time division duplex UVM Universal Verification Methodology
1
1. INTRODUCTION
5G as the new standard for mobile communications promises higher-than-ever internet
access speed. Each cell in a base station can serve more users simultaneously than
previous generations. From base station point of view, this translates to processing more
data locally. As a result, the hardware used in the base station should be capable of such
processing needs. In addition, with more data comes more design complexity which in
turn brings more difficulties to fully verify and validate the hardware design and it is ever
more important to be able to find bugs and fix them in due time.
There are many solutions to tackle this matter, many are industry standards. Some of
these verification methods focus on the code itself, before implementing it on a chip.
Methods such as the Universal Verification Methodology (UVM), formal verification and
static linting checks are some of these solutions.
Another category of debugging solutions is designed to work on a chip level, that is to
validate correctness of the functionality of a programmed chip through different use
cases. The most common tool here are external measuring devices that can generate
custom test vectors.
Each of the above-mentioned methods have their own pros and cons. Methods such as
UVM and formal verification offer unparalleled visibility into the design, but they are slow
to run. It can take days to fully run test cases for system using UVM. On the other hand,
on-chip debugging methods are fast to run, but they generally provide poor visibility to
system signal activities.
Over the years, there has been studies on how to bring about a solution to improve on-
chip debugging by means of more visibility, more flexibility, less logic overhead or less
debugging cycle time while utilizing the speed of running on an actual chip.
Visibility means how much data one can gather in a debugging session. It can range
from only seeing wrong output on a board-level tests to being able to monitor every signal
in the design using a simulation tool. Flexibility refers to the features available in the
solution and how easy they are to be used. Logic overhead refers the amount of logic
area and memory blocks that is needed to implement a solution. Debugging cycle is the
amount of time needed to reconfigure and start a new session. There might be a need
to recompile the design in between these sessions.
2
Karpagam and Viswanathan, Poulos et al. and Graham et al. have all proposed methods
based on modifications to the design without a need for re-compilation. These modifica-
tions are either done after place and route step or directly inserted onto the bitfile before
programming the Field-Programmable Gate Arrays (FPGA). A bitfile or bitstream is a file
that is used to program an FPGA and contains information on how to place and route
the design within the FPGA [16]. This type of approaches is ideal for reducing debugging
cycle time as it eliminates the time spent to re-compile. It also offers a minimal logic
overhead. Nonetheless, these methods do not contribute to the level of visibility of the
system internals.
On a different approach to the issue at hand, authors such as Yang and Touba, Kumar
et al. and Ko et al. have used trace buffers in order to increase the visibility into the
system. Kumar et al. and Yang and Touba have taken a similar approach of trimming
the acquired data to potentially erroneous ones. Ko et al. have taken different path to
increase visibility and flexibility. They used multiple triggers and multiple buffers with the
option to fragment the data to maximize buffer usage.
Commercial embedded logic analysers (ELA) are a good example of having as much
visibility of the signal activities as possible. However, in case of FPGA, the overhead, in
terms of resource usage, that a commercial ELA creates is sometimes not affordable. It
causes an unwanted increase in logic and memory utilization and may bar the project to
reach its timing targets.
The question here is if there can be an approach to have balanced solution where there
is an acceptable level of compromise between criteria like visibility, flexibility, logic over-
head and debugging cycle. This thesis tries to accomplish this objective by suggesting a
method that while it has limited effect in the area utilization (logic overhead) and can be
used in the final product, it packs several useful features (flexibility) and can be used at
any time, either in lab testing, in simulation or on site (visibility and debugging cycle), as
a tool to help validating and debugging the design. The solution is implemented in a form
of an intellectual property (IP), that can be integrated to compatible system. The IP is
called Snapshot IP. It should be noted that this IP is designed to be a complimentary
solution and should be used along with other debugging techniques. The IP cannot find
the exact location of a bug, but rather helps to find a rough location within the design.
This thesis is structured as follows. Chapter 2 takes a look at different approaches in
academic area and discuss them in detail. Chapter 3 discusses a short background on
the project and the IP itself. It is followed by the IP’s overall specifications, interfaces,
and main features. There will also be a couple of short examples use cases to familiarize
3
the reader with the IP. Chapter 4 examines the decision and challenges we faced while
designing the Snapshot IP, both from design point of view and the set of tools we used.
Chapter 5 shortly discusses the limitations of the work that has been done. Chapter 6
gives an insight into the results of the works done and chapter 7 will conclude this thesis.
4
2. RELATED WORKS
When talking about a system-on-chip (SoC) design, validation and verification in its
broadest context, can be divided into two categories, simulations and on-chip ap-
proaches.
Simulations can have different levels of complexity, from a simple VHDL test bench to a
full-fledged UVM environment. For a large design containing multiple IPs and subsys-
tems, simulating all possible cases are almost impossible. This is due to the fact that
running a test case to simulate a few milliseconds can take hours or days to complete.
This will make running full regression a time-consuming task. So, this brings the need
for on-chip validation approaches and run some test cases on an actual chip. This helps
verification teams speed up validation and verification processes.
However, in case of finding a bug, typical approaches of on-chip validation are not much
of a use due to their significant lack of visibility within the design. Simulations, on the
other hand, brings in maximum visibility into the design and can be useful for debugging
purposes.
In general, when maximum visibility into the system is needed, simulation is the first
choice. Simulation tools can log all the signals within the system and have all the waves
at hand when needed. However, on-chip debugging has not been explored as exten-
sively. So, in this chapter we discuss some of the methods used in order to increase on-
chip debugging capabilities and compare their advantages and disadvantages.
2.1 Bitfile instrumenting
Bitfile instrumenting or direct modification of the bit file for the purpose of debugging or
validation is a method of on-chip debugging and has been discussed in the academic
literature since at least the beginning of new century [5]. Modification or instrumenting
the bit file is a way for an IP or a system user to directly manipulate the bitfile in order to
monitor new nets or signals without having to re-compile the design. In general, this
approach eliminates the need to re-compile the whole design whenever a new signal
needs to be traced [5]. The following section briefly discusses how the following authors
took different approaches to implement the instrumenting of a bit file.
To begin with, Poulos et al., proposed pre-selecting all the potential signals that are
needed to be traced in case a bug is encountered [15]. The signals are then connected
to a newly proposed multiplexer (MUX) structure with the selectors and output being an
5
SRAM memory. In run time, these signals can be selected by modifying the memory
contents to choose a certain set of signals to be traced. Benefits of this method is the
fact that it is vendor independent and minimizes the performance overhead due to mini-
malistic nature of the approach. This leads to faster debugging rounds and eventually to
the less debugging effort. However, the debugging functionality this technique provides
is limited and is designed to work without trigger events, without which, it would be a
near impossible task to capture the traced signal at the right time.
On the other hand, Graham et al., take a relatively more complex approach and imple-
ment a simple custom ELA [5]. This custom design enables the bitfile modification fea-
tures which are not available in case of using conventional ELAs. The high-level block
diagram of the mentioned ELA is shown in Figure 1. Having a modifiable logic analyser
consequently enhances the flexibility of the instrumenting so the triggers and traced sig-
nals can be connected to any desired nets just before the physical implementation. The
result presented by authors shows a significant reduction between debugging rounds.
However, this reduction is a relative to the size of the original design. The bigger the
design, the less the benefit. The main disadvantage here is that some of the tools are
vendor-specific and are obsolete for since the early 2010s.
Figure 1. Simple Logic Analyzer proposed by Graham et al. [5]
Similarly, Karpagam and Viswanathan implemented a method which manipulates the
mapped database to trace the nets [12]. However, instead of using a custom ELA, it
dumps the traced data directly into a RAM location. This is done by routing the signals
6
through FPGA using left-over resources that have not been used by the original design,
while mapping the design in placement and routing stage. To do so, it uses a technology-
mapped database of the design and manually inserts the needed triggers and Trace-
buffers into the database; then it regenerates the bit file. The advantage of this approach
unlike commercial ELAs is that the debugging functionality will not affect the timing cri-
teria of the original system, as it only used unused parts of the chip after the original
design is fully mapped. This method again is designed around a specific FPGA vendor
and cannot be used on other FPGAs.
Overall, the main motivation of this method is the amount of time saved from avoiding
design re-compilation compared to conventional logic analysers. Also, the proposed
methods have small footprint on FPGA resources and have relatively simple architecture
thus it has a minimum or no effect on the original timing characteristics of the design
which in return simplifies the design flow. In some cases, it only uses the left-over re-
sources that are not utilized by the original design. However, the complexity for the end
users of these methods is higher than what commercial ELAs normally offer. For this
reason, they are only practical in some specific cases where the debugging requirements
are simple enough to put these methods into use.
2.2 Design for Debug and Trace Buffers
Design for debug is a methodology to plan for debugging from the start, meaning design-
ers should consider the testability aspects when designing the system. This can include
extra logic to make the debugging more straightforward. One of the ways to do so is to
use trace buffers with or without an ELA. Trace buffer refers to a technique of acquiring
data and dumping it to a memory location while the chip is running at its full speed. ELA
can be utilized in tandem to give a visual interpretation of the acquired data.
Techniques incorporating trace buffers offer users a real-time observability over the chip
functionality in debugging and validation sessions [13]. These techniques generally can
sample data of pre-defined length upon reaching one or more trigger conditions and store
it into the buffer or memory area. This data can then be either accessed on-chip or off-
loaded from the chip for further analysis.
The main hinderance in using trace buffers is the limited amount of memory for storing
the acquired data. The limited amount of memory makes debugging observability lower
than of software simulation. However, in many cases, the speed it provides compensates
for the low observability. Different researchers took different paths to find a methodology
7
to increase the observability. The following section discusses how each of these re-
searchers approached the issue at hand and explore different available options on how
one can effectively make use of the limited storage to achieve better observability. If
resolved, it will result in more efficient debugging sessions.
In order to have an efficient debugging session, the data captured into trace buffer must
give debuggers useful insights to find and locate errors easily. It is a possible scenario
that the captured samples are an error-free section of data which has filled the buffer
before reaching to an erroneous one. Capturing unnecessary data can cause delay in
the debugging process. Lengthening the process can indirectly hide away real bugs that
otherwise could have been found.
Yang and Touba proposed a three-step process to identify the rate on which the errors
potentially occur and optimize the captured data based upon that [17]. The main idea
behind this method is to have a selective capturing process. Rather than sampling the
complete sequence of clock cycles, it only samples a subset of the real processed data
on the chip that are suspected to be erroneous. This scheme adds an extra module to
help utilizing the buffer size over three separate debugging session. In the first step, error
rate is “estimated using lossy compression with a parity generator in the first debug ses-
sion”. This error rate along with the buffer size defines how much the window size can
be, that is, how far away the first and last captured samples are from each other in a
matter of clock cycles. The bigger the window size, the better the observability of debug-
ging. The second step compacts the data using two different algorithms, cycling registers
(CR) and multiple input signature register (MISR). It then compares their generated sig-
natures to find a set of suspected clock cycles in which errors may happen. An erroneous
data produces erroneous signatures. By cross-checking them one can find where the
errors may be located. In the last step, it then starts to capture only the data from those
suspected cycles into the buffer and analysis begins when acquisition is done.
Following the same principle of identifying suspect cycles, Kumar et al. proposed an
optimized scheme which includes two steps instead of three, by removing the first step
of identifying the window size [14]. The main difference compared to the previous method
is that instead of one CR, this technique uses two CRs (CR1 and CR2) with different
parameters at the same time. This brings in fewer overhead data, which refers to data
that are not erroneous, than the method discussed in [17]. This method analyses the
data from the first session off-chip and creates a list of markers called tag bits. The tag
bits are loaded into chip to guide the acquisition process in the second session to only
capture what has been marked as error data in the first step. The architecture is shown
in Figure 2.
8
Figure 2. Architecture proposed by Kumar et al., showing the improved two step model [14]
While [14] and [17] tried to utilize the buffer usage using selective capturing of data to
achieve maximum observability, Ko et al. proposed a solution to achieve the same using
efficient control of multiple trigger events and multiple buffers so the total captured debug
data in all buffers is maximized [13]. This idea revolves around a system that has multiple
trigger events and trace buffers to store the sampled data. The proposed architecture
uses a so-called allocation unit to handle simultaneous or overlapping trigger requests
by combining the automatic and user-defined set of priorities. This unit also decides on
which requests to ignore if there is no adequate buffer space left. It also supports seg-
menting the data while sampling it so each segment can be stored in a different buffer
9
independently and without the risk of losing the data. The order of the segments is de-
fined using a control data field which is inserted at the end of each segment in the buffer.
This results in a higher average throughput, better utilization of the buffers, and balances
the load on all buffers in use. Whenever a buffer is idle, the allocation unit can offload
them through the trace port and keep them ready for upcoming trigger requests. Figure
3 shows the high-level architecture of the proposed scheme.
Figure 3. Architecture proposed by Ko et al. [13]
The main disadvantage of methods proposed by [14] and [17] is that since the data be-
comes fragmented, it complicates the process of validating the content with regards to
their timing. Having a sequence of erroneous data without a timestamp in a system
where the timing of each packet of data is important makes the debugging process cum-
bersome. On the other hand, the method proposed by [13] has no such issue. However,
its architecture is significantly more complex and requires more resources on the FPGA
to be implemented.
All in all, what makes trace buffer useful is the fact that it exploits the speed of the chip
in the debugging process, but due to limited storage, it can only offer a low level of ob-
servability, specifically compared to software simulations. As a result, there are many
works in the literature aiming at finding different solutions to increase this. One such
10
method was selective sampling to only acquire necessary cycles of data to minimize
buffer usage [14][17]. Another solution was dynamic management of multiple trigger re-
quests and multiple trace buffers to achieve maximal usage of all buffers [13]. These
researches show that while they have some disadvantages, utilizing trace buffers can
lead to efficient debugging sessions where bugs and errors can be found with more ease
than instrumenting methods.
Overall, the works discussed here gives some insight how on-chip debugging and vali-
dations can be improved. Specifically, taking trace buffers into use can diversify the ca-
pability of a debugging solution. The approach that is proposed in this thesis utilizes the
trace buffers and triggers as a basis to implement a new solution.
11
3. REQUIREMENTS AND METHODS
This chapter discusses high-level specification of Snapshot IP, its main features and the
tools used to implement the IP. It also includes the more detailed description of the main
functionalities of Snapshot IP. In the end, a few use cases where the IP can be utilized
are explained.
3.1 Project Background
The Snapshot IP is part of a 5G baseband project which is designed to be implemented
on “Intel Stratix 10 SX” FGPA chips. It is loosely based on another IP designed for a
relatively similar project. However, being designed for a Xilinx FPGA, the mentioned IP
was not fully compatible with the new project, both from IPs used within the design and
the interface compatibility and did not have all the requested features. The Snapshot IP
is developed to address those needs. In order to address these issues, most of the code
is re-written. This made adding new features more straight-forward. As a by-product of
the code refactoring, the performance in some areas has been improved.
The Snapshot IP was planned to be integrated into certain points of design, many of
which were between the radio interface IP and the rest of design, in both uplink and
downlink directions. The IP specification initially prepared to meet the requirements of
these snapshot points.
The initial design was done for Time Division Duplex (TDD) variant of the project, mean-
ing it only supported data formats common in that variant. It also had fewer set of fea-
tures. Also, there were only one clock frequency for the datapath. Frequency Division
Duplex (FDD) data format support and extra features were added in later revisions of the
IP. Also, the single clock frequency datapath evolved to two-clock datapath, bringing the
need for use of clock crossing schemes.
The Snapshot IP is part of a bigger project and is meant to help verification, validation
and testing teams find possible bugs in an IP. This is done by providing two main func-
tionalities. The first one is sniffing the ongoing data on a data bus and compare it with
reference vectors of data generated by high-level modelling tools (MATLAB in this case)
and tracking the symptoms of erroneous strings of data to a specific IP and continue
debugging from there using other available debugging options.
The second functionality is to artificially insert the data to a connected IP and then ob-
serve the behaviour of the destination IP. The inserted vector is based on a data pre-
12
loaded in system memory. In this case, the connection between source and destination
IP is virtually cut off, the inserted data replaces the real data and goes to the destination
IP.
Under a specific circumstances, insertion and capture can work simultaneously within
the IP to provide better visibility for the testing team.
In high-level architecture of the project, there has been a few points that are defined as
insertion/capture point. The Snapshot IP is designed in a way to handle all the possible
cases in those points. However, it can be used in other points or projects as long as the
interfaces match neighbour IPs.
Incorrect data can show up as a skipped or erroneous sample. There can also be a
correct sequence of samples but with wrong timing.
3.2 Tools
A handful of tools were used for the development and verification of Snapshot IP. They
range from compilation and design entry tools and software products to verification and
debugging tools.
For synthesis and compilation, we used Intel Quartus Prime Pro as the official tool to
compile and program for Intel FPGAs. For generating register bank, a script-based tool
called reg_gen is used. This tool creates a register bank that can be controlled through
software. It takes a spreadsheet as an input and generates VHDL code out of it.
For generating IPXACT packaging a combination of custom scripts and a third-party tool
from Magillem is used. For linting and clock domain crossing (CDC) checks Synopsys
Spyglass is the tool of choice. More details about these tools is discussed in Chapter 4.
For verification and debugging, Mentor QuestaSim and for simulation-level debugging
and Intel SignalTap (part of Intel Quartus Prime Pro software) for on-chip debugging is
being used.
Each of these tools contributed their part to ensure a high level of design quality and
avoid human errors. Also, if used properly, they help the development process to be
easier and more automated.
3.3 IP interfaces
The interface of the design consists of an Advanced eXtensible Interface, revision 4
(AXI4) Master for memory interface, an AXI4-Lite Slave for configuration interface and
multiple AXI4-Stream Master and Slave for data path that can be connected to multiple
13
adjacent IPs. It also has a few custom signals dedicated for timing information such as
BTS Clock Number (BCN) values.
Figure 4. Snapshot IP top-level architecture
3.4 IP Features
This IP provides a range of features to increase the debugging flexibility so it can give
the testing team more visibility on different cases and environments. It works with differ-
ent width of data, clock domains, types of data and different modes of operation. It also
can be integrated with two memory controllers simultaneously. In addition, insertion func-
tionality can be optionally ignored or instantiated on demand, according to the project
needs. What follows is the explanation of each of these.
3.4.1 Data Format
As mentioned earlier, Snapshot IP can be used in different places within the design. This
means that the tdata signal of AXI4-Stream is dependent to tdata width of the adjacent
IP it is connecting to [3]. So, the Snapshot IP has both 64- and 128-bit input and output
data line on the AXI4-Stream.
The stream of data going through Snapshot IP can also differ in format. The interval of
valid data on the stream can also vary. The Snapshot IP will process the data according
the user’s format of choice.
14
3.4.2 Multiple clock domains
From logical point of view, both insertion and capture functionalities consist of three sec-
tions: first section interacts with the memory interface to read from or write to DDR4
memory. The second section interacts with adjacent IPs, that is, actual insertion or cap-
ture process. This refers to the part of design that processes AXI4-Stream packets. Let’s
call this stream side. The third section is called configuration memory (register bank) that
act as an interface so the IP can be controlled with a software. These three sections can
optionally work on totally independent clock frequencies.
Memory side works on ddr_clk, the stream side works on str_clk and the configuration
memory works on cfg_clk.
Clock frequencies the IP receive depend on what the current iteration of system-level
design is targeting. The ddr_clk should be on the same domain as the DDR4 memory
controller. The clock frequency of the stream side depends on the adjacent IPs that are
receiving or sending data to Snapshot IP and cfg_clk generally depends on what fre-
quency is in use for configuration lines in system-level. So, to make the design as much
as possible future-proof, there’s a need for clock crossing synchronization methods
within the IP. A mix of few different approaches has been used to handle all clock cross-
ing situations.
3.4.3 Modes of operation
The capturing and insertion have numerous features that users can utilize to debug more
efficiently. The following is the brief list of these features. They will be discussed more
thoroughly in sections 3.3 and 3.4.
• Capturing:
a. Single/continuous (circular buffer) modes
b. User-controllable continuous mode
c. Start or ending a capture based on a certain timing preference
d. Capture when reaching a specific pattern
e. Selectable memory
f. Selectable memory space to take into use
g. Selectable data width based on the source IP data width
15
• Insertion:
a. Trigger based on certain timing criteria
b. Supporting different data bus width to match the destination IP
c. Different data formats to support different use cases
The main requirement is that the Snapshot IP would be able to capture the ongoing data
all the time so if any failure happens at any time, the debugging data will be available.
To work all the time, this IP needs to efficiently handle the incoming data and save them
into memory. In theory, this means ~16 Gbps of data. This, along with all the features
requested while developing the IP caused some challenges that are the main discussing
point of the next chapter.
3.4.4 Support for two memories
The system has two memories. One is an external memory with direct access from FPGA
side and the other is a memory accessible through Hard Processor System (HPS). There
are times that one of these memories can be busy with some other operations and using
the same memory can cause unwanted performance issues for the ongoing operation.
Although there’s only one AXI4 Master interface available in Snapshot IP design, the
read and write sections are always fully separated between capture and insertion, with
capture always using write transactions and insertion using read transactions. So, with
a small effort in top-level integration, Snapshot IP can be connected to two different
memory controllers at the same time. In addition to that, the fact that insertion and cap-
turing logics are completely separated and can work independently makes this feature
possible.
It is worth noting that the IP connection to the HPS memory is not through the conven-
tional FPGA-to-HPS (F2H) bridge. It uses a dedicated bridge called FPGA-to-SDRAM
(F2SDRAM) bridge that has a more direct access to SDRAM compared to F2H bridge.
F2SDRAM connection bypasses the Cache Coherency Unit (CCU) and connects to
SDRAM L3 Interconnect directly [11]. (Figure 5)
16
Figure 5. Intel Stratix 10 HPS Block Diagram. Difference between F2H (top centre) and F2SDRAM (centre) paths to SDRAM [11].
3.4.5 Optional insertion instantiation
Unlike capturing, not all the points where Snapshot IP can be integrated to needs an
insertion feature for debugging. This is mainly due to the complexity of the data format
that is needed by the adjacent IP in those points. As a result, insertion as a whole can
be optionally ignored and not instantiated. This results in a lower total logic utilization.
3.5 Capture
The IP can capture data in one of the 7 modes available. They can be divided into two
categories, single and continuous modes. There is, however, a set of common configu-
rations that all these modes need to be able to operate. First, a valid address range, that
is start and end address, reserved for the captured data on the memory should be de-
fined. The capturing functionality should also be enabled before starting to capture.
17
Single capture mode can be defined as those set of modes that start capturing incoming
with a predefined or manual trigger and continue until they reach the end address defined
by the user.
• Immediate single mode: user manually triggers the start. Capturing start immedi-
ately and stops when it reaches the end address.
• Start based on a certain timing: It uses BCN value. When a certain value is
reached, it starts until the address range is full.
• Start with a specific pattern in the incoming data: the incoming data is compared
against a predefined N-bit data set by the user. It will start as soon as it finds the
data on the input
Continuous modes are the ones that will run until a specific condition is met, regardless of end address.
• continuous (circular buffer) mode: similar to immediate single mode, but when it
reaches the end address it will wrap around to start address. This mode must be
stopped manually.
• Ending a capture based on a certain timing: In this case it starts manually but will
go on until it reaches a certain BCN value. That means it might or might not wrap
around the memory range.
Figure 6. Capture single mode with BCN start
18
From user point of view, a set of operations should be done. The flowchart shows in
Figure 7 an abstract level of these operations. Before starting capture, user should de-
cide on what input stream, address area and mode to use. Then, if the selected mode
needs additional step, certain configurations should be done. Depending on the selected
mode, capture will finish automatically, or it needs to be stopped manually by user. When
captured is done, operations and error status registers can be checked to make sure
everything is captured successfully.
3.6 Insertion
The Snapshot IP can also insert a specific vector of data into an adjacent IP. In this case,
a pre-defined string of data should be put in the memory and the IP will read from it and
push it to the destination IP. Due to the nature of operation, the number of available
modes is less than capturing functionality. There are two modes to choose from: single
and continuous. In both cases BCN value is needed so the starting point of the insertion
can be found and aligned precisely.
• In single mode, upon reaching the BCN value, the data from memory is read and
according to the chosen data format, it is pushed to the destination IP. When
reading from memory reaches the end address, it stops automatically and up-
dates the status registers.
• In continuous mode, the similar steps happen, with the difference being that when
reading reaches the end address, it will wrap around to start address and contin-
ues inserting from the beginning. This mode needs to be stopped manually.
The flowchart in Figure 8 shows the procedure a user should use to take insertion into
use. Similar to capture, user should set address area, select a stream and a mode. Also,
setting BCN value here is mandatory. Otherwise, insertion will start when BCN=0x0.
Moreover, if the output format is different from the default, it should be defined as well.
When insertion is done, status registers can be checked to verify everything has been
transferred successfully.
19
Figure 7. Capture user operations procedure
20
Figure 8. Insertion user operations procedure
21
3.7 Example Use Cases
This section describes some simple use cases for the Snapshot IP. These use cases
can showcase the flexibility of Snapshot IP and how it can be utilized in different scenar-
ios.
3.7.1 Use case 1: Capture
Assume there’s a Digital Signal Processing (DSP) IP that generates a series of data that
should be transferred to the radio interface. Each of these data must arrive on a specific
time, otherwise they might be dropped. The Snapshot IP can be programmed to be trig-
gered on a certain moment and capture the incoming data into a memory space. Then,
this data is compared with the golden data generated by simulation tools to validate if
the data and its timing is precise. Figure 9 shows a block diagram for this case.
Figure 9. Visualisation of use case 1
3.7.2 Use Case 2: Insertion
Assume a case where the processing unit is under debugging and cannot be used to
generate data for radio interface IP. To increase the productivity of the testing team,
Snapshot IP can be used to test the radio interface IP.
In this case, a simulation-generated test vector can be inserted into the destination IP to
verify the functionality. This can be expanded in a way that the incoming data from the
radio interface can be captured back to fully verify the functionality of the IP. Figure 10
visualises a simplified idea how instead of original data, the data coming from DDR4
memory is passed to destination IP.
22
Figure 10. Visualisation of use case 2
3.7.3 Use Case 3: Self-validation
Snapshot IP can be used to validate its own insertion and capture functionality. This has
actually been used to validate if the Snapshot IP runs as intended and to see if all status
registers show correct values when a process is finished. This is achieved by having a
direct feedback from one of IP’s output streams into an input stream. As a result, the
data is being inserted to the Snapshot IP itself, which in turn can be captured into a
memory space. This use case shows the robustness of the design under pressure and
how FIFOs and AXI4 interface behaviour when the IP is under full load.
For doing so, the IP is configured for both insertion and capture functionalities. To make
the comparison more feasible to do, single modes are recommended. Different capture
and insertion modes can be used here. Continuous modes overwrite the memory content
and it makes the comparison between the original data and captured one more complex.
A good combination of mode to use here are insertion in single mode and capture in
either immediate single mode or pattern-based capture, where the pattern is the same
data that is going to be inserted.
23
Figure 11. Use case 3: Self validation
Capture should be started first to make sure no data is missing in the final result. It will
stay in idle mode until a valid data (tvalid=’1’) is arrived. After enabling capture, insertion
can be started. When finished, status registers in both insertion and capture should be
in done state and the IP itself should be back in idle. Also, the captured data should
match the original inserted data.
24
4. IMPLEMENTATION
Designing a high-performance Snapshot IP with the possibility to function within different
use cases inevitably brings a series of challenges. There also need to take decision on
series of items and how to implement certain modules.
The main design decision to make for Snapshot IP is how different types of data should
be synchronized when crossing clock boundaries.
Challenges can be categorized into three main types, design challenges, design flow
challenges, and validation and verification challenges. Design challenges include the
ones that hampered or complicated the design process and got in the way of keeping
Snapshot IP as generic as possible while finalizing it. Design flow challenges are the
ones related to tools and workflows in relation to the IP. Validation and verification chal-
lenges are those that made debugging the IP itself more difficult.
There is also a short discussion on what the limitations of the design are, why they cannot
be achieved and how they have been dealt with.
4.1 Design Decisions
Choosing correct synchronization methods can significantly help the robustness of the
design. Depending on type of data and the relation of source and destination clocks,
different synchronization schemes should be used. Signals that need to be synchronized
in Snapshot IP falls under three categories: multi-bit signals on the main bus, single-bit
signals from a slow to a fast clock and single-bit signals from a fast to a slow clock do-
main. Each of category needs synchronization scheme of its own.
First, a dual-clock First In, First Out (FIFO) memory has been utilized for the data bus to
safely move the data between ddr_clk and str_clk. The FIFO used here is an IP by Intel.
The FIFO uses Gray code to synchronize data between write and read sections. Due to
this synchronization, there is a delay between the time the data is written into the FIFO
and the time it is ready to be read out. The actual delay depends on the frequency of
read and write side of the FIFO. The following formulas show some example of the delay
between writing to and reading from the dual clock FIFO. wrreq is the write request sig-
nal, q is the output and rdfull shows if the read side sees the FIFO as full [7].
wrreq to q[]: 1 wrclk + following 1 rdclk
wrreq to rdfull: 2 wrclk cycles + following n rdclk
25
Second method is using multi flop synchronizations for single-bit and multi-bit signals
where source clock is slower [4]. That includes two or three sequential flipflops with des-
tination clock. The multi-bit signals are using a coherent synchronizer where the output
only changes when data is stabilised on all of the intermediate flip-flops.
Finally, for the single-bit synchronization from a fast clock to a slow clock, 8-bit extension
is used. Synchronizing with this method can cause data loss if the value of the signal
changes quicker than destination clock. So, it is only used for signals that are quasi-
static, meaning they rarely change their values.
It should be noted, however, despite all these synchronizations in place, not any arbitrary
set of clock frequencies can be used. The fact that FIFO overflow or underflows can
cause some data loss, limits the choices here.
4.2 Design Challenges
4.2.1 Variable number of clocks
Since the project was in a development phase, the clock requirements were changing
regularly. The IP started to be designed with a single clock on the data path. Later this
changed and the synchronizations and dual clock buffer was added to the design. Later
on, the design moved to have a single clock as all the clock frequencies reached their
target value. Each of these changes required a significant amount of code modifications.
Particularly, converting a single clock design into a dual-clock one was cumbersome.
That included making sure all the signals needed to be transferred from one clock do-
main to another one are properly synchronized. Also, the FIFO should be replaced with
a dual-clock version. This complicated the relation between read and write sections of
the FIFO. The delay between these two sections should be considered when trying to
read from or write into the FIFO [7]. These delays also brought the need to modify the
full and empty thresholds and how AXI4 Master interface interacted with the IP to avoid
overflows or underflows.
4.2.2 Integration points within the system
The Snapshot IP can be integrated to different points of design in the system, as long as
there is an AXI4-Stream connection. Naturally, the characteristics of these point is not
necessarily the same and often differ from point to point. This is especially important
when talking about the relation of the AXI4-Stream clock and the DDR4 memory clock,
i.e. which side is faster and which side is slower. This plays a role in how the synchroni-
zation is done between these two side of the design.
26
4.2.3 Different input formats
Another main difference between the integration points is the data format on the AXI4-
Stream channel. How frequent tvalid signal is coming and how tdata is structured affects
the processing of the data. The incoming data can be 64 or 128 bits and the interval of
tvalid signals can change in different use cases. To keep the incoming data (either from
DDR4 memory in insertion or AXI4-Stream in capture) a dual clock FIFO buffer is used.
The FIFO will also take care of synchronizing data between two clocks. Since the width
of this buffer was set to 128 bits, when a 64 bits capture is in use, the data should be first
packed into 128-bit packets and then saved into the buffer. That adds a need for some
preprocessing on the incoming data. Same idea happens in insertion when 64-bit data
is desired on the output.
There are also some specific bits in the data to show the beginning of a frame. When
capturing, this bit can be used to find the start of the frame and start capturing from that
point. This can be also done using a 64-bit pattern that when found, capture is triggered.
A tlast signal is sent out at the end of each basic frame. In some cases, tvalid is ‘0’ and
tdata does not carry any valid data for a specific number of clocks before a next basic
frame arrives. Since the data in the buffer is packed and does not consist of these non-
valid data, the IP should handle the gaps between non-sequential tvalid signals and con-
trol the flow of the outgoing data. This also includes delaying responses to the DDR4
memory so the buffer will not overflow.
4.2.4 AXI4 Master design compatibility with the slave
AXI4 Master is a standardized method, but since it can be interpreted differently. There-
fore, Master interface might not be fully compatible with the slave if it is not generic
enough. For example, the AWREADY can go high either before or after having a valid
WDATA [2]. This makes matching the design when changing the slave IP cumbersome.
Another point that should be discussed here is the fact that each AXI4 Master interface
can be connected to multiple Slave interfaces. As a result, Intel Quartus software auto-
matically adds an interconnect between them. This interconnect uses Intel Avalon
memory-mapped interfaces internally. So that makes all write and read transactions go
through the same path and if one of these gets stuck, the other one will not be able to
continue sending or receiving data. It took some debugging sessions to find and fix such
corner cases that caused these issues.
27
4.2.5 Performance optimizations
While capturing, the worst-case scenario is when every clock cycle has a valid data on
the incoming AXI-Stream channel (tvalid = ‘1’). It should be noted here that Snapshot IP
does not backpressure the incoming stream in capture side (tready is always ‘1’). In this
situation, buffer can fill up quickly. Therefore, the AXI4 interface with DDR4 memories
should be optimal to avoid any buffer overflow and the data can be sent to DDR4 memory
in a constant rate. If the buffer overflows, the captured data is considered corrupted (with
missing data) and the capture stops immediately.
In insertion, the data should be sent out exactly based on the chosen format and data
cannot be delayed. So, buffer underflows cannot be afforded. If at any time buffer under-
flow happens, the inserting will be stopped as there will be missing clock cycles where
there must have been valid data. This brings the need to have multiple active AXI4 trans-
action at a time, so the data moves in and out of buffer as fast as possible.
In earlier versions of the IP, the design had a sequential AXI4 transaction handling
method, that is, single write transaction at a time. This was improved in later releases as
the IP had some difficulty to keep up with the flow of incoming data and it was overflow-
ing. The final versions can handle 32 bursts of AXI4 transactions.
One might say that wider AXI4 data bus toward memories might help to mitigate this
issue. However, there are a couple of considerations that led to not having wider data
bus. The explanation below is mostly for capture, but similar ideas apply to insertion too.
The most important consideration is that data bus width on F2SDRAM AXI4 bridges of
HPS support maximum of 128 bits [11]. So even if Snapshot IP has a wider data bus for
AXI4, this will be constrained to 128 bits when reaching the HPS.
Moreover, FIFO word size is aligned to DDR4 bus size for ease of use. So, each read
from the FIFO is an AXI4 transaction. If AXI4 data bus becomes wider, there are two
options for FIFO. The first option is to also widen the FIFO to match the AXI4 data bus
output width. That brings the need to pack incoming data before writing them into FIFO.
With current design, the packing is at its minimum possible where it is only done for input
data that is 64-bit wide. The other option is to keep the FIFO size the same as before
and instead of a single read per AXI4 transaction, we will have multiple reads. Both cases
would complicate the design and that can directly affect critical paths within the design.
4.2.6 Timing Optimizations
Although Snapshot IP is rather a small IP, Intel Quartus Prime had difficulties reaching
the timing goals inside or around the Snapshot IP instances.
28
One of the reasons for this was the large number of interfaces that complicates the rout-
ing. For this matter, we tried to relax the timing using pipelines wherever possible. So,
the inputs and outputs are going through multiple levels of registers.
Another timing concern that needed to be addressed was about the value comparisons
and their effect on timing closure. Having a variety of features in the design, brings the
need of having multi-layered conditional statements and more complex state machines.
They can make the combinatorial logic between two register large, hence the critical
paths can become longer than what the compiler can handle. To fix these, we re-exam-
ined the code and simplified some of the conditional statements and state machines.
This was in forms of using variables instead of some signals, moving assignments out of
“if conditions” where it was possible and pipelining signals when they are not needed
immediately.
Moreover, we also modified the design for signals to have either synchronous reset or
no reset at all. This is as a result of how Intel Stratix 10 FPGAs architecture is. Design
can be placed and retimed more easily by compiler if there is no asynchronous reset
involved in the design [8].
4.2.7 FPGA resource usage
Although the earlier revisions of the design were not particularly resource hungry, there
were some requests to optimize the design even further. This was mainly due to help
top-level integration reach timing-clean result through smoother placement.
For this, some redundant and unused parts were removed from the code and a generic
was added to optionally leave out the insertion whenever possible. The results of these
changes are reflected in chapter 6.
4.2.8 Handling urgent feature requests
There were times within the project that an urgent new feature was requested that we
had to include into the design without breaking or changing any of our current features.
As an example, having multiple supported formats for insertion was one of these urgent
requests. Before this change, Snapshot IP only supported TDD formats. With this, we
should add FDD format support that included several types of outputs.
The process of thinking through to choose the best way of developing this feature that
can be easily merged with the current implementation was challenging as it needed many
code modifications and addition to generalize the design for all possible scenarios. Nev-
ertheless, it was quite rewarding, as the feature worked as intended from the first tests.
29
4.3 Design Flow Challenges
The challenges regarding design flow are mainly either due to the lack of experience in
using the tools or lack of robust design entry flow specifically designed for an FPGA
project. The tools used are industry standard tools and they are not used in academic
area.
In the following sections, some of the tools and the challenges coming with each of these
are discussed.
4.3.1 IP-XACT packaging
IP-XACT is an XML-based metadata that mainly describes the IP interface. This makes
using and integrating an IP into a bigger system easier [1]. However, the process of
generating IP-XACT files for the first time was rather cumbersome.
The flow consists of multiple steps and includes usage of a third-party tools by Magillem
combined with a few TCL-based scripts. Setting up all of these, in order to have correct
IP-XACT XML files required a few rounds of trial and error with different options.
When the first revision of the XML files is done and correct files are generated, modifi-
cations and rerunning the flow are straightforward. Modifications are only needed when
there is a change in IP interface or in register bank.
4.3.2 SpyGlass linting/CDC checks
Linting is the process of checking the code statically to find common issues and bugs.
This helps to reach a better code quality. Clock Domain Crossing (CDC) check is the
process of checking clock crossing points against a formal check to find issues in the
clock crossing logics. That includes glitches, missing data or data holding issues. Syn-
opsys Spyglass is the tools used here.
Setting up a Spyglass from the scratch and defining constraints for it was challenging.
By default, Spyglass considers the worst-case scenarios for every check it does. In real-
ity, it might not be the case, so many manual instructions should be added to guide the
tool. For example, the status registers change their value only a handful of times during
an operation. However, since these signals go through some clock crossings, Spyglass
assumes the worst-case of them changing their value on a regular basis. Since this is
not the case, some constraints should be added so Spyglass considers them as quasi-
static.
30
Also having a blackbox IP in the design complicates the flow. Blackboxes need extra
care and constraints so the tool can understand them properly.
Even after adding numerous instructions and constraints, many errors and warning might
show up in the tool that needs to be addressed. Some of these are false alarms and
some are coming from blackboxes. These errors and warnings can be waived in order
to clean up the results so real issues can be found easier.
4.3.3 Register bank Generation
Register bank allows software access and configure the IP for a specific purpose. The
flow starts with a spreadsheet, formatted in a specific style that defines all the registers
needed. That includes both control and status registers. This spreadsheet is then fed to
a script that generates both HDL and IP-XACT files for the register bank.
4.4 Validation and verification challenges
Validation and verification challenges are those that affected how easily and quickly
Snapshot IP itself could be debugged. This resulted to lengthy debugging cycles.
Early on in the project, it was decided that the Snapshot IP should be verified as part of
top-level design, that is the whole FPGA project. As a result, there were no IP-level sim-
ulation available. For debugging the IP, one had to run top-level simulation which is not
time effective. It might take few tens of minutes before any data even start to go through
the IP. This made debugging process slow.
Another option for us to debug the design was to use Intel’s SignalTap logic analyzer.
Running directly on the FPGA, ideally it is a faster solution to find a bug than simulations.
However, due to the poor visibility of the signal activities often it needs multiple steps of
debugging to find a certain issue. One SignalTap instance can only handle 4K signals
and save data of those signals for a certain amount of cycles [9]. Data acquisition is
triggered when a pre-defined event or chain of events happens [9]. The idea is to find a
trigger so close to the problematic section of a signal that the bug can be shown within
the acquired data.
In our case, debugging an issue starts from a stuck interface or wrong status register
values. With this initial information we were creating the first SignalTap file to find the
rough location of the issue. After that, depending on the situation, we may have a few
more rounds SignalTap debugging to find the exact problem.
31
Another issue using SignalTap brings is related to timing. With a system that has already
a high logic utilization percentage, SignalTap only complicates reaching a timing clean
result.
Combination of these limitations lead to days and sometimes weeks spent for debugging
a single bug.
Nevertheless, SignalTap offers features useful in a debugging session. The best exam-
ple would be state machine-based trigger conditions, where user can define a specific
sequence of conditions to happen before data acquisition starts (Figure 12). Also, the
data acquisition can be split into multiple sections and acquire data individually. This is
useful when a certain set of trigger conditions happens more than once.
Figure 12. An Example of SignalTap state-based trigger condition
4.5 Limitations
Although the Snapshot IP is designed to be as generic as possible, there are still some
limitations to reach fully generic design. What follows are the main limitations.
4.5.1 A complementary solution
While Snapshot IP has many features that can help testers to validate and debug parts
of system, it was not intended to be the sole method of doing so. It is rather designed to
be a complementary method that can be used in line with other available solutions.
32
Ideally, Snapshot IP can narrow down the issue to a particular IP. In large-scale project,
narrowing down can be used into two different scenarios. In one scenario, Snapshot IP
can be a secondary method to use after a system-level validation where a failure is found.
In this case, Snapshot IP can be used in a lower level to find a certain IP in the design
that malfunctions. In another scenario, it can be the first level of validation or debugging
to determine the correctness of a path data and then start longer test cases to validate
whole system.
4.5.2 Ability to insert any arbitrary format of data
The Snapshot IP can insert multiple formats of data. However, there are still a set of
rules that applies to what can be considered as a valid data for insertion. These rules are
defined based on what the radio interface IP can accept as a valid data, so if the Snap-
shot IP is being used somewhere else than adjacent to the radio interface IP, the data
format might differ, and insertion cannot be done anymore. Insertion formats are depend-
ent on the intervals of tvalid signals and how often tlast should be sent out. This makes
supporting any arbitrary insertion format practically unfeasible. The insertion optional in-
stantiation feature is partly implemented due to this limitation.
4.5.3 Performance sensitivity
As mentioned in chapter 4, capture functionality only sniffs the data and is not the main
destination for the incoming data and cannot request for the data to be delayed. As a
result, capture functionality in this design does not provide backpressure for its AXI4-
Stream interface, that is tready is always ‘1’.
That means if a data packet is coming, the FIFO should have enough space for it. That
requires the data buffered in the FIFO to be sent out to an external memory through the
AXI4 interface in a timely and consistent manner. If the AXI4 interface transfer rate fluc-
tuates or underperforms, there is a chance that the FIFO overflows and that halts cap-
turing process as the data will be corrupted with missing data.
The Snapshot IP is designed to make maximum use of the AXI4 interface when it is the
dominant user of the memory. However, if multiple high-performance operations are us-
ing the same memory simultaneously, it can directly affect the performance of Snapshot
IP.
In response to this limitation, we added support for two simultaneous memories so when
a memory is already busy, another memory can be used for Snapshot IP operations.
However, if the number of these concurrent cases increase, the performance reduction
is unavoidable.
33
5. RESULTS
The main objective of Snapshot IP from project point of view is to enable real-time cap-
ture and insertion functionality for a set of pre-defined integration points. The implemen-
tation was successful, and the IP is now in use in the final product. This chapter dis-
cusses the results of the designed IPs and explores to see if the criteria set out in the
beginning of this thesis are achieved.
5.1 Overall performance
One of the items that can be tangibly measured is the speed of accessing external
memory. The theoretical data rate on the AXI4-Stream in the system is at the average of
~16Gbps. So, this is the minimum that the AXI4 interface should be able to handle. As
shown in Figure 13, this has been handled with ease. At maximum, assuming AXI4 in-
terface has something to transfer at each clock cycle, this can reach to ~32 Gbps. Con-
sidering system delays to transfer data, it will always be less than 32 Gbps.
One should note that these numbers are when only capture or insertion are at work, but
not both together. When both working, the effect of some of the variables is too large to
be able to measure the performance. Below items are some these variables.
• The timing of capture and insertion with regard to each other
• AXI4 interfaces connections in top level
• Number of external memories in use (one or two)
34
Figure 13. AXI4 write transactions in capture
35
Figure 14. Continuous insertion output to a stream
36
5.2 FIFO Usage levels
When designing the Snapshot IP, one of the main concerns was to avoid FIFO underflow
or overflow as this will lead to missing data in the output. This is due to the fact that in
the system, AXI4-Stream slave interfaces must always be ready and cannot change its
tready, to ‘0’. The AXI4 side, however, is fully capable of controlling the flow of the data.
Figure 15. Relation between AXI4 transactions and FIFO level in capture
In capturing case, where the AXI4-Stream writes to the FIFO and AXI4 interface reads
from it, there’s a possibility of overflow happening when FIFO is not emptied out quick
enough. The solution here is to read as much of data as possible so the FIFO stays as
empty as possible at any given time. Since the burst size for AXI4 Master is defined at
32, the interface starts a new transaction whenever the FIFO level goes above 32 (Figure
15). Taking the delays int account, this is actually happening when FIFO level is at 35.
The FIFO will only go fully empty when the capture session is finished. Figure 16 shows
how FIFO usage level acts in capture modes.
In case of Insertion, the situation is vice versa, meaning that AXI4-Stream reads from
the FIFO and as a result, the buffer is theoretically prone to underflow. For this matter,
the FIFO is always kept at almost full level. It is also worth mentioning here that the FIFO
size is 4096 words. The almost full level is set at 3840. Whenever it goes below this level,
a new data was written into the FIFO from the external memory. Figure 17 shows how
FIFO is kept almost full at all time. When operation is done, the IP gradually empties the
FIFO and send the data out.
37
Figure 16. FIFO usage level while capturing
38
Figure 17. FIFO usage level in insertion
39
5.3 FPGA resource usage
One of the benefits of Snapshot IP is the minimal footprint it has compared to the level
of features it provides. There are three instances of Snapshot IP inside the top-level
project. The resource usage is reported in the number of Adaptive Logic Module (ALM)
and logic registers for logic area and the number of M20K memory blocks and memory
bits [6].
Depending on the connections of the IP in top-level, the amount of ALMs used can vary.
This also depends on if the insertion is instantiated into the design. The number of
needed ALMs for each instance varies between 2500 and 3500 ALMs, where it is mainly
divided into three sections, capture, insertion and register interface. The capture uses
around 1000 ALMs. Insertion and register interface take around 600 ALMs for each in-
stance. The rest is used by the multiplexers and demultiplexers that control inputs and
outputs. In total, all three instances use around 10k ALMs. In terms of the number of
dedicated logic registers, each instance uses something between 5700 and 8700 logic
registers.
For the memories, the FIFOs used in the design is implemented using M20K memory
blocks. For a full design, that has insertion included, a total of 52 M20K blocks is used
(26 blocks for each FIFO). Each M20K block, as the name suggests has 20480 bits of
memory. That sums up to 1,064,960 bits of memory available with 52 blocks of M20K.
The design has two FIFOs each consists of 4096 of 128-bit words. The total of needed
memory would be 1,048,576 bits. This results in ~98.5 utilization rate for each instance.
In case of only having capture functionality, these number would be half and the utiliza-
tion rate stays the same.
To put the all these numbers in context, an Intel Stratix 10 SX FPGA has almost 933,000
ALMs and 11,000 M20K blocks. All three instances of Snapshot IP (assuming full design,
including insertion) takes ~1% of FGPA logic area and ~1.5% of M20K memory blocks
[10].
As a result, despite the fact that Snapshot IP packs many features, it has a relatively
small footprint in the project.
5.4 Design bugs
Although the final IP design has no known bugs, there were a few recurring bugs that
showed up at multiple occasions. AXI4 signal timing mismatch and controlling FIFO level
were the hardest to fix.
40
There were a few occasions where there was a bug within AXI4 transactions. The ma-
jority of the issues were on write transactions and almost all of them caused transactions
to jam. There were different reasons for these jams. Some bugs were because of not
having enough transaction within a single burst. Another bug was due to a mismatch
between data and valid signals in a write transaction. All of these cases resulted in the
interconnects not to respond. In these cases, the system has to be rebooted. They are
eventually fixed and AXI4 transactions are functional now.
Another bug that reappeared is that the FIFO underflowed or overflowed when it could
have been avoided. This has been fixed by redefining a custom almost_full and al-
most_empty signals to control when to stop reading from or writing to FIFO.
5.5 Well-balanced design criteria
As mentioned in Chapter 1, there are 4 criteria set to measure the implemented method
to previous researches. Those four are visibility, flexibility, logic overhead and debugging
cycle duration. Each one is analyzed below.
• Visibility: Although visibility is not on par with other solutions as for example Sig-
nalTap, it provides a decent level of details. First there are a series of status
registers that can be useful in case of anything abnormal happens on the line.
Second, it provides a direct access to the data between two IPs that cannot be
achieved using external measuring devices.
• Flexibility: The IP has rather a rich list of features that can provide a variety of
scenarios to test. Features such as simultaneous Insertion and capture, configu-
rable address range, multiple modes of operation to choose from are some of the
more useful ones.
• Logic overhead: As already explained in previous section, logic overhead or re-
source usage is relatively low. Each instance of the Snapshot IP takes 0.3-0.4%
of FPGA logic resources and 0.5% of FPGA memory block resources. So that is
safe to say that the logic overhead is small.
• Debugging cycle: Snapshot IP is designed to work in final products so if needed,
it should be accessible constantly. So, for resetting Snapshot IP in order to start
a new round of validation or debugging, all is needed to do is to reset a few control
registers and the IP is ready for another round of debugging. This will reset the
design and status registers to an idle state. All other configuration done from last
round stays the same. It should be noted here that this is only from Snapshot IP
41
point of view. Other IPs in the system might need more complex resetting proce-
dure which is not in the scope of this thesis.
Overall, the Snapshot IP met the criteria was set. The resource usage is an acceptable
level and it can provide the level of visibility it was designed to do so. It is developed to
function constantly and can quickly be prepared for a new round of insertion or capture.
The IP also met the overall performance level needed for the project and there is no
known bugs remaining in the IP.
42
6. CONCLUSION
In this thesis, we proposed and implemented a custom method in a form of an IP, called
Snapshot IP, to assist designers and testers to debug and validate the system more
efficiently, that has balanced properties in terms of visibility, flexibility, resource usage
and debugging cycle time. The properties are compared to the earlier academic re-
search. The IP is intended for Intel Stratix 10 SX FPGAs and is originally designed to be
implemented as part of a larger project.
We first explored different methods proposed in academic area and compared them to
see their benefits and deficiencies. We then proceeded to define a set of different criteria
such as debugging visibility, physical resource usage, flexibility or the availability of fea-
tures and the amount of effort needed between each debugging round. We then imple-
mented an IP that can have reasonable balance among all of the criteria.
Authors such as Karpagam and Viswanathan, Poulos et al. and Graham et al. have all
done works that are based on direct design modifications without having to re-compile.
These modifications are done at place and route or later steps of compilation process.
This type of approaches reduces debugging cycle time as it does not need to re-compile.
It also offers a minimal logic overhead. Nonetheless, these methods do not provide any
benefit in terms of the level of visibility of the system internals.
A different approach to the issue is done by authors such as Kumar et al., Ko et al. and
Yang and Touba. They have proposed methods that are utilizing trace buffers in order
to increase the visibility into the system. Kumar et al. and Yang and Touba have taken
an approach to intelligently acquire only the data that is potentially erroneous. Ko et al.,
however, have taken a different path to increase visibility and flexibility. They used mul-
tiple triggers and multiple buffers with the option to fragment the data between the buffers
to have maximum buffer usage.
The proposed method has a rich list of features that can be used to capture an ongoing
data to a memory for later comparison to the golden data or insert a vector of data into
an adjacent IP that is under test. It can also be used as way of self-validation, as the
inserted data can be feedback into the design and the IP will capture that. As a result,
the original inserted data can be compared against the captured data to ensure validity
of the IP.
43
The Snapshot IP also provides reasonable visibility into the system as it can specifically
monitor the datapath between two adjacent IPs. The captured data is saves into an ex-
ternal memory, so it can be dumped for further analysis. Also, the status register of the
IP can alert users of any malfunction in the way capturing or insertion. Moreover, con-
sidering the list of available features, the amount of resource usage is fairly low. Each
instance of the IP use around ~3-4K of ALMs of the FPGA and 52 blocks of M20K
memory blocks, with the utilization rate of 98.5%. That translate to 0.3-0.4% of logic
resources and 0.5% of memory resources available for the selected FPGA. Regarding
the flexibility of the design, all the features are available for the tester through a software-
controlled register interface that can be modified according to user’s preference without
a need to recompile the design. The design is also easy to reset and can be ready for a
new round of debugging by modifying a few control registers.
Capture running modes divide to single and continuous modes, where single modes stop
when they reach to the end of defined address range and continuous modes continue
until some other conditions, such as a manual trigger or a certain BCN value, becomes
true. Capture functionality is designed in a way that it can cope with the amount of on-
coming traffic and handle them in an efficient way.
Insertion on the other hand, have less modes to operate in. This is due to the nature of
insertion functionality itself. It always aligns with the defined BCN value to guarantee
correct timing of outgoing data. Insertion can send out packet in a series of pre-defined
formats that user can define in the beginning of each round.
Regarding the limitation of the design, it should be noted that this method is developed
to be a complementary debugging and validation tool and can be used either as the first
level of validation before starting system level validation test cases, or as a secondary
tool when a malfunction in system is found and the user seeks a more precise location
of the issue.
Another limitation worth a remark is the limited ability of insertion to have an arbitrary
format on the output data. This is a result of output complexity it can cause. The AXI4-
Stream packets are formed locally within the IP. The number of possible formats is too
large to be coded. Furthermore, this IP is done in the scope of a bigger project and
although this feature could be useful for another project, it was not a necessary feature
to be developed for the current project.
The last limitation discussed here is the fact that although the Snapshot IP is developed
to maximize AXI4 capabilities, it is still possible to have performance fluctuations while
transferring data to or from an external memory. If there are multiple concurrent high-
44
performance operations with one memory are happening, the performance of snapshot
IP might decrease. This is partially fixed by supporting two simultaneous memory access
for the Snapshot IP.
To conclude, Snapshot IP has met the defined criteria for having a debugging and vali-
dation method that is flexible and efficient on resource usage and has more visibility than
typical on-chip solutions while it keeps the debugging process easy. The IP is designed
to function in a high-performance environment. Also, in spite of having some limitations,
the IP has many features that can be utilized to locate bugs and help testers to validate
design more thoroughly.
45
7. REFERENCES
[1] Accellera Systems Initiative Inc. IP-XACT User Guide [Internet]. Mar 2018. Available: https://www.accellera.org/images/downloads/standards/ip-xact/IP-XACT_User_Guide_2018-02-16.pdf
[2] ARM. AMBA AXI and ACE Protocol Specification AXI3, AXI4, and AXI4-Lite ACE and ACE-Lite, Revision E [Internet]. 2013 Feb 22. Available: https://devel-oper.arm.com/documentation/ihi0022/e/
[3] ARM. AMBA 4 AXI4-Stream Protocol Specification [Internet]. 2010 Mar 03. Available: https://developer.arm.com/documentation/ihi0051/a
[4] Arto Perttula. Lecture13: Clock and Synchronization, course material, TIE-50206 Logic Synthesis, Tampere University of Technology, Feb 2018. Available: http://www.tkt.cs.tut.fi/kurssit/50200/S17/Kalvot/Lecture%2013%20-%20Clock%20and%20Synchronization.pdf
[5] Graham, Paul, Brent Nelson, and Brad Hutchings. "Instrumenting bitstreams for debugging FPGA circuits." The 9th annual IEEE symposium on field-program-mable custom computing machines (FCCM'01). IEEE, 2001.
[6] Intel Corporation. Intel Stratix 10 Logic Array Blocks and Adaptive Logic Mod-ules User Guide [Internet]. 24 Apr 2020. Available: https://www.intel.com/con-tent/dam/www/programmable/us/en/pdfs/literature/hb/stratix-10/ug-s10-lab.pdf
[7] Intel Corporation. FIFO Intel FPGA IP User Guide [Internet]. 2020 Dec 14. Avail-able: https://www.intel.com/content/dam/www/programmable/us/en/pdfs/litera-ture/ug/ug_fifo.pdf
[8] Intel Corporation. Intel Hyperflex Architecture High-Performance Design Hand-book [Internet]. 2020 Jul 13. Available: https://www.intel.com/con-tent/dam/www/programmable/us/en/pdfs/literature/hb/stratix-10/s10_hp_hb.pdf
[9] Intel Corporation. Intel Quartus Prime Pro Edition User Guide: Debug Tools [in-ternet]. 2020 Sep 28. Available: https://www.intel.com/content/dam/www/pro-grammable/us/en/pdfs/literature/ug/ug-qpp-debug.pdf
[10] Intel Corporation. Intel Stratix 10 GX/SX Product Table [Internet]. [date un-known]. Available: https://www.intel.com/content/dam/www/programma-ble/us/en/pdfs/literature/pt/stratix-10-product-table.pdf
[11] Intel Corporation. Intel Stratix 10 Hard Processor System Technical Reference Manual [Internet]. 2021 Feb 23. Available: https://www.intel.com/con-tent/dam/www/programmable/us/en/pdfs/literature/hb/stratix-10/s10_5v4.pdf
[12] Karpagam, R. S., and B. Viswanathan. "Design, Test and Evaluation of Trace-Buffer Inserted FPGA System." Artificial Intelligence and Evolutionary Computa-tions in Engineering Systems. Springer, New Delhi, 2016. 1039-1048.
46
[13] Ko, Ho Fai, Adam B. Kinsman, and Nicola Nicolici. "Design-for-debug architec-ture for distributed embedded logic analysis." IEEE transactions on very large scale integration (VLSI) systems 19.8 (2010): 1380-1393.
[14] Kumar, Binod, et al. "A Methodology to Capture Fine-Grained Internal Visibility During Multisession Silicon Debug." IEEE Transactions on Very Large Scale In-tegration (VLSI) Systems (2020).
[15] Poulos, Zissis, et al. "Leveraging reconfigurability to raise productivity in FPGA functional debug." 2012 Design, Automation & Test in Europe Conference & Ex-hibition (DATE). IEEE, 2012.
[16] Xilinx Inc. FPGA Bitstream [Internet]. 2019. Available: https://www.xil-inx.com/html_docs/xilinx2019_1/SDK_Doc/SDK_concepts/concept_fpgabit-stream.html
[17] Yang, Joon-Sung, and Nur A. Touba. "Improved trace buffer observation via se-lective data capture using 2-D compaction for post-silicon debug." IEEE transac-tions on very large scale integration (VLSI) systems 21.2 (2012): 320-328.