Upload
dotruc
View
243
Download
6
Embed Size (px)
Citation preview
Automated Debugging Framework for
High-level Synthesis
by
Li Liu
A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science
Graduate Department of Electrical and Computer EngineeringUniversity of Toronto
Copyright c© 2013 by Li Liu
Abstract
Automated Debugging Framework for
High-level Synthesis
Li Liu
Master of Applied Science
Graduate Department of Electrical and Computer Engineering
University of Toronto
2013
High-level synthesis (HLS) is an automatic compilation technique that translates a soft-
ware program to a hardware circuit [10]. This process is intended to make hardware
design easier. HLS techniques have been studied for more than 20 years and a number
of HLS tools have been developed in both industry and academia. However, verifying
correctness of HLS tools can sometimes be difficult due to a lack of benchmarks.
This thesis proposes an automated test case generation technique for verifying/debugging
HLS tools. The work presented in this thesis builds a framework that automatically gen-
erates random programs with user-specified features/characteristics. These programs are
used to verify the correctness of HLS tools by comparing the output of hardware gener-
ated by HLS to the original software. Thus, users can have a large number of benchmarks
to test their HLS algorithms without having to manually develop test programs. The
framework also provides additional ways of analyzing the performance of HLS tools.
Rather than being a replacement to the existing verification tools, this debugging
framework should serve as a useful complement to other existing test suites. Together,
they can provide a more comprehensive verification/debugging and analysis for HLS
tools.
ii
Acknowledgements
First, I would like to thank my parents for raising me and giving me the chance to study
abroad. They have always given me support spiritually and financially.
I would like to thank Professor Stephen Brown for financially supporting me and
giving me the opportunity to work in this research group and to be a part of such an
intriguing research project.
I would like to thank Professor Jason Anderson for all the daily summer meetings,
weekly status meetings, and for the many insightful ideas and suggestions. Both of you
and Professor Brown have been amazing mentors.
I would like to thank Professor Nicola Nicolici from McMaster University. You are
the person who brought me into this field. I will always remember that you told me,
“don’t behave like currents who always take the low pass. Instead, take the high pass,
you will gain more eventually.”
I would also like to thank Andrew Canis for the numerous discussions and help and
Jongsok Choi for helping me with my grammar checking and thesis writing.
In addition, I would like to thank my girlfriend, Sue, for all times you knocked on my
head and asked me to sleep early even though I rarely do.
At last, I would like to thank all my friends for all the good times we had together.
iii
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 4
2.1 High-Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 LegUp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 LLVM Intermediate Representation . . . . . . . . . . . . . . . . . . . . . 5
2.4 Control Flow Graph and Data Flow Graph . . . . . . . . . . . . . . . . . 6
2.4.1 Control flow graph (CFG) . . . . . . . . . . . . . . . . . . . . . . 6
2.4.2 Data flow graph (DFG) . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Resource sharing and pattern matching in HLS . . . . . . . . . . . . . . 7
2.6 Verification techniques for HLS . . . . . . . . . . . . . . . . . . . . . . . 8
2.6.1 Formal Verification . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6.2 Assertion-based Verification . . . . . . . . . . . . . . . . . . . . . 10
2.6.3 Manually developed test suites . . . . . . . . . . . . . . . . . . . . 11
3 Implementation 13
3.1 Overall debugging flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Test case generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
iv
3.2.1 Parameters used in the generator . . . . . . . . . . . . . . . . . . 14
3.2.1.1 Size control . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1.2 Structure control . . . . . . . . . . . . . . . . . . . . . . 18
3.2.2 Summary of parameters . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.3 Graph generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.3.1 Graph Structure . . . . . . . . . . . . . . . . . . . . . . 20
3.2.3.2 CFG generation . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.3.3 CFG loop generation . . . . . . . . . . . . . . . . . . . . 25
3.2.3.4 Multiple hierarchies of CFGs . . . . . . . . . . . . . . . 27
3.2.3.5 DFG generation . . . . . . . . . . . . . . . . . . . . . . 28
3.2.3.6 Patterns in DFG . . . . . . . . . . . . . . . . . . . . . . 33
3.2.4 LLVM IR generation . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.5 Generate main wrapper function . . . . . . . . . . . . . . . . . . 39
3.3 HW/SW results verification and Analysis . . . . . . . . . . . . . . . . . . 40
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Experiments 41
4.1 Effect of depth factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Analysis of pattern matching . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 An alternative binding algorithm . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Runtime analysis for LegUp’s pattern matching algorithm . . . . . . . . 47
4.5 Comparison with CHStone . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5.1 Diversity: CHStone vs. Auto-generated test cases . . . . . . . . . 49
4.5.1.1 Diversity of CHStone test suite . . . . . . . . . . . . . . 49
4.5.1.2 Diversity of auto-generated test cases . . . . . . . . . . . 50
4.5.2 Size: CHStone vs. Auto-generated test cases . . . . . . . . . . . . 51
4.5.2.1 Size of CHStone test suite . . . . . . . . . . . . . . . . . 51
4.5.2.2 Size of auto-generated test cases . . . . . . . . . . . . . 53
v
4.5.3 Synthesizability: CHStone vs. Auto-generated test cases . . . . . 54
4.5.3.1 Synthesizability of CHStone test suite . . . . . . . . . . 54
4.5.3.2 Synthesizability of auto-generated test cases . . . . . . . 54
4.5.4 Usability: CHStone vs. Auto-generated test cases . . . . . . . . . 55
4.5.4.1 Usability of CHStone test suite . . . . . . . . . . . . . . 55
4.5.4.2 Usability of auto-generated test cases . . . . . . . . . . . 55
4.5.5 Code coverage comparison . . . . . . . . . . . . . . . . . . . . . . 56
4.5.5.1 CHStone code coverage in LegUp . . . . . . . . . . . . . 56
4.5.5.2 Auto generated test programs code coverage in LegUp . 57
4.6 Bugs detected in LegUp 2.0 release . . . . . . . . . . . . . . . . . . . . . 59
4.6.1 Problem with shift instructions . . . . . . . . . . . . . . . . . . . 59
4.6.2 A LegUp produced Verilog file hangs at Quartus II compilation . 62
4.7 Detecting injected bugs in LegUp . . . . . . . . . . . . . . . . . . . . . . 62
4.7.1 Disabling Live Variable Analysis . . . . . . . . . . . . . . . . . . . 62
5 Conclusion 64
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.2.1 Input vector range analysis for test programs . . . . . . . . . . . . 65
5.2.2 Back tracing the error points . . . . . . . . . . . . . . . . . . . . 65
5.2.3 Customizable pattern injection . . . . . . . . . . . . . . . . . . . 66
A Add new operation type to the framework 67
B Experimental results for replicating patterns in a single basic block 70
C Experimental results for replicating basic blocks 74
D Experimental results for pattern matching runtime 89
vi
E Experimental results for size factor 95
F Experimental results for depth factor effects 101
Bibliography 107
vii
List of Tables
3.1 Parameters that control the graph generation. . . . . . . . . . . . . . . . 16
4.1 Brief description of the CHStone benchmark programs. . . . . . . . . . . 49
4.2 C code level characteristics of CHStone benchmark programs . . . . . . . 52
4.3 Differences in code coverage by CHStone and auto-generated tests . . . . 58
4.4 Shift instructions used in CHStone benchmarks . . . . . . . . . . . . . . 61
B.1 Synthesized circuit size as increasing TEMPLATE POOL RATIO param-
eter within one BB (0.1–0.5) . . . . . . . . . . . . . . . . . . . . . . . . . 70
B.2 Synthesized circuit size as increasing TEMPLATE POOL RATIO param-
eter within one BB (0.6–1.0) . . . . . . . . . . . . . . . . . . . . . . . . . 72
C.1 Synthesized circuit size as increasing number of replicated basic blocks(0–8) 74
C.2 Synthesized circuit size as increasing number of replicated basic blocks(10–
18) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
C.3 Synthesized circuit size as increasing number of replicated basic blocks(20–
30) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
D.1 Runtime measurement as circuit size increases(10-40), PM= pattern match-
ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
D.2 Runtime measurement as circuit size increases(50-80), PM= pattern match-
ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
viii
E.1 Synthesized circuit size as increasing Basic Block size factor(10-40) . . . 95
E.2 Synthesized circuit size as increasing Basic Block size factor(50-80) . . . 98
F.1 Execution cycles as increasing Depth Factor(0.1-0.3) . . . . . . . . . . . . 101
F.2 Execution cycles as increasing Depth Factor(0.35-0.55) . . . . . . . . . . 103
F.3 Execution cycles as increasing Depth Factor(0.6-0.8) . . . . . . . . . . . . 104
F.4 Execution cycles as increasing Depth Factor(0.85-1) . . . . . . . . . . . . 106
ix
List of Figures
2.1 The Clang front end and LegUp synthesis flow. . . . . . . . . . . . . . . 5
2.2 DFG with use-define chain example. . . . . . . . . . . . . . . . . . . . . . 7
2.3 Demonstration of resource sharing. . . . . . . . . . . . . . . . . . . . . . 8
2.4 Timing analysis using assertion based verification. . . . . . . . . . . . . . 10
3.1 Overall Verification Flow. . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 An example of a configuration file . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Network hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Dominators in CFG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Assigning operations in the network using a pool of operations. . . . . . . 29
3.6 DFG with patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.7 An example of PHI instruction in LLVM IR . . . . . . . . . . . . . . . . 36
3.8 How branch is translated. . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.9 Code inserted for type conversion. . . . . . . . . . . . . . . . . . . . . . . 38
3.10 Code inserted to avoid dividing by zero in an integer division. . . . . . . 39
3.11 Code inserted to avoid dividing by zero in a floating point division. . . . 39
3.12 An example of a main wrapper function. . . . . . . . . . . . . . . . . . . 39
4.1 Depth factor controls the shape of networks . . . . . . . . . . . . . . . . 42
4.2 Depth Factor controls total execution cycles of circuits . . . . . . . . . . 43
4.3 Resource sharability of replicated patterns within one BB. . . . . . . . . 44
x
4.4 Unsharable patterns in LegUp . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 Resource sharability as replicating basic blocks. . . . . . . . . . . . . . . 46
4.6 Runtime measurement as BLOCK SIZE FACTOR increases . . . . . . . 48
4.7 Incidence of operations per CHStone benchmark program (quoted from [14]) 50
4.8 Source level analysis and synthesized circuit size . . . . . . . . . . . . . . 52
4.9 Source level analysis and synthesized circuit size . . . . . . . . . . . . . . 53
4.10 Self-contained test vector in CHStone. . . . . . . . . . . . . . . . . . . . 55
4.11 LegUp code coverage by each CHStone benchmarks. . . . . . . . . . . . . 57
4.12 LegUp code coverage by each auto generated testing programs. . . . . . . 58
4.13 Configuration file with 4 basic operations . . . . . . . . . . . . . . . . . . 60
4.14 Configuration file with 4 basic operations and shl instructions . . . . . . 60
4.15 An example of arithmetic shift right in C . . . . . . . . . . . . . . . . . . 61
4.16 Example of a variable’s life cycle . . . . . . . . . . . . . . . . . . . . . . . 63
A.1 String constant added in AutoConfig.h . . . . . . . . . . . . . . . . . . . 67
A.2 “else if” case added in AutoConfig.c (Operation Index has to be explicitly
assigned) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.3 “else if” added in CFGNtk.cpp (Operation Index is the one used in Figure
A.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
A.4 Case added in CFGNtk.cpp (Operation Index is the one used in Figure A.2) 68
A.5 An example of a configuration file with the newly added operation . . . . 69
xi
Chapter 1
Introduction
1.1 Motivation
Back in early 1990s, most of the commercial HLS tools from the major EDA companies
(such as Synopsys, Cadence, and Mentor Graphics) used behavioural hardware descrip-
tion languages (HDLs), such as VHDL and Verilog, as their inputs to produce gate-level
RTL circuits [13]. However, C-based programming languages such as ANSI-C and Sys-
temC have become an important trend in replacing HDLs since the late 1990s [26] [7].
There are several reasons for such a change:
• Most embedded software is written in C/C++ hence C-based languages make hard-
ware/software hybrid-systems easier to design.
• Execution of a C program is much faster than simulation of hardware.
• A large number of existing algorithms are written in C.
• The number of software developers far exceeds the number of hardware developer.
Better HLS tools can balance the inequity by making hardware design easier for
software developers [7].
1
Chapter 1. Introduction 2
HLS consists a series of steps, which are traditionally known as allocation, scheduling,
binding and RTL generation. These steps make debugging of HLS tools complicated. For
example, a minor change in scheduling produces different finite state machines (FSM),
which significantly impacts the results of binding and the generated RTL circuits. De-
spite these challenges, verification/debugging is crucial from the perspective of helping
researchers evaluate their new ideas and algorithms.
Researchers have spent a large amount of effort in verifying the correctness of HLS
tools using various techniques. One of the techniques is called bounded model checking
[8]. Bounded model checking establishes abstract models from input/output systems
and translates them into temporal logic expressions. The input and output temporal
logic expressions are then proved to be equivalent (or not) using some SAT solvers.
However, before applying the formal method, it requires additional steps to convert
behavioural descriptions to mathematical system models, which adds more complexity
to debugging. Also, formal verification is a time consuming process where its runtime
increases exponentially as the number of input variables increases. In addition to bounded
model checking, various standard benchmark suites have been used since the 1990s.
However, the HLS community has not yet established a common recognition on what the
sufficient and necessary requirements are for C-based HLS benchmark programs.
In this thesis, we propose an automated test case generation and debugging frame-
work for HLS tools. This framework can create a large number of random test programs
with user-specified characteristics and later verify these programs by comparing the re-
sults from software execution and hardware simulation. By having such a framework,
developers of HLS tools can have a vast supply of test cases, which compensates for the
lack of standard benchmarks for HLS. In addition, our tool can generate test programs
with a large diversity in program characteristics and variable program size with easy
usability. Such a framework not only helps developers to verify their HLS algorithms,
but also helps to analyze the quality of synthesized results.
Chapter 1. Introduction 3
1.2 Contributions
The principal objective of this research is to enable automated test case generation/debugging
for HLS tools. The contributions of this thesis are:
• Enabling automated test case generation and verification for high-level synthesis
tools.
• Enabling developers to create a vast number of test programs based on user speci-
fications.
1.3 Thesis Organization
The rest of this thesis is organized as follows:
Chapter 2 provides background information on the LegUp HLS tool and the LLVM
framework. It also describes some important concepts used in this thesis, such as control
flow graphs, data flow graphs as well as how resource sharing is implemented in HLS.
In addition, it also introduces several other verification techniques used in current HLS
tools and discusses their advantages and disadvantages.
Chapter 3 describes the implementation details of the debugging framework. It in-
cludes the overall debugging flow, graph representation overview, CFG/DFG graph gen-
eration algorithms and the graph-to-LLVM IR interpretation.
Chapter 4 describes the experiments based on our debugging framework. It introduces
experiments showing the usage of different parameters that control the graph generation,
as well as experiments measuring the performance of LegUp on resource sharing with
suggestions for future improvements. The test cases generated with our tool are also
compared to the state-of-art manually developed benchmark suite. Lastly this chapter
describes bugs which are detected in LegUp by our tool.
Chapter 5 presents concluding remarks and suggestions for future work.
Chapter 2
Background
2.1 High-Level Synthesis
High-level synthesis (HLS) is a compilation technique that transforms a software be-
havioural description into a hardware circuit description with equivalent functionality
[10]. It is sometimes referred to as behavioural synthesis or C-to-gates synthesis, as HLS
often uses ANSI C/C++/SystemC (or even Java) as it input. The HLS flow is tradi-
tionally divided into four different steps [23]: allocation, scheduling, binding, and RTL
generation. Allocation decides how much resources are needed in hardware and binding
map the instructions and variables to hardware components, such as adders, multipliers,
and registers. Scheduling divides the software behaviour into control steps which are used
to define the states in a finite state machine (FSM). Each control step contains a small
section of code that can be executed in a single clock cycle in hardware. Scheduling also
optimizes the number of execution steps based on limits of hardware resource and cycle
time. RTL generation creates HDL code based on the previous steps. The generated
HDL can then be synthesized to a hardware circuit by a logic synthesis tool. The goal of
HLS is to allow developers to describe their designs using a higher level of abstraction,
similar to the flow used in the design of software programs.
4
Chapter 2. Background 5
Figure 2.1: The Clang front end and LegUp synthesis flow.
2.2 LegUp
The debugging framework in this thesis is built within a larger project called LegUp
[7]. LegUp is an open source high-level synthesis tool being developed at the University
of Toronto. The LegUp framework allows researchers to improve C-to-Verilog synthesis
without building an infrastructure from scratch. Its long-term vision is to make hardware
for FPGAs the produces good results using a software-like flow.
LegUp uses the Low-Level Virtual Machine (LLVM) compiler framework. It is the
same framework used by Apple for iOS development. LLVM uses an intermediate repre-
sentation (IR), which is an assembly-like machine independent language. LegUp utilizes
Clang to compile C/C++ code into the LLVM IR. Clang is an open source compiler
front end for C, C++ and Objective-C. It offers a replacement for the GNU Compiler
Collection (GCC) that translates source code languages to an intermediate representa-
tion (IR). Later, the LLVM IR is translated into RTL using various optimization passes.
Figure 2.1 shows the synthesis flow of LegUp. The goal of this project is to generate
input circuits for this flow at the LLVM IR level which enables improved testing of the
compiler optimization and LegUp synthesis steps in the flow.
2.3 LLVM Intermediate Representation
LLVM is a compiler infrastructure written in C++ [3]. It provides a framework with
a complete compiler system, taking intermediate representation (IR) code as its input
Chapter 2. Background 6
from a compiler front end and producing an optimized IR. This optimized IR can then
be translated and linked in machine-specific assembly code for a target platform (e.g.
MIPS, x86). LLVM can accept the IR from the GCC tool chain or Clang (used by
LegUp), which allows different compilers to be used with LLVM.
LLVM IR [3] uses static single assignment (SSA) form that provides type safety,
low-level operations, flexibility, and the capability of representing high-level languages
clearly. It is the common code representation used throughout all phases of the LLVM
compilation strategy. It is often written in a file with a .ll file extension. In the case of
LegUp, compilation starts from the LLVM IR level and takes these .ll files as its input.
2.4 Control Flow Graph and Data Flow Graph
2.4.1 Control flow graph (CFG)
A control flow graph (CFG) is a data structure that is built on top of the intermediate
representation to abstract the control flow behaviour of functions [17]. It is a directed
graph where nodes represent basic blocks and edges represent possible control flow from
one basic block (BB) to another. It contains information about a program’s execution
paths and loops. A basic block is a maximal section of straight-line code which can
only be entered via the first instruction of the block and can only be exited via the last
instruction.
2.4.2 Data flow graph (DFG)
To graphically represent relationships between variables and operations, data flow graphs
are used in the design. It represents the use-define relationship between every pair of
connected operations. Figure 2.2 shows an example of a DFG. The arrows illustrate that
there are use-define relation between the connected operations. For instance, in Figure
Chapter 2. Background 7
Figure 2.2: DFG with use-define chain example.
2.2, the dotted arrow can be described as “the subtracter uses the variable defined by
the adder”.
2.5 Resource sharing and pattern matching in HLS
Resource sharing is an area reduction technique used in the binding step of HLS. It
involves assigning multiple operations to the same hardware unit and using control logic
to multiplex input and output signals. For example, in Figure 2.3, the adder and sub-
tracter are shared by four different instructions (instructions 1 and 3 are sharing the
same adder, instructions 2 and 4 are sharing the same subtracter). Ideally this structure
should reduce the size of the circuit by a factor of 2. However, the 2-to-1 multiplexers
used at the adder/subtracter’s inputs offset these area reductions and can even lead to
larger circuit size (and lower clock frequency).
A graph-based pattern matching algorithm for area reduction in LegUp is presented
in [12]. It shows that certain patterns of operations occur multiple times in a program.
These patterns create opportunities for sharing larger composite functional units com-
prised of multiple operations. This thesis will illustrate how our tool is used to analyze
Chapter 2. Background 8
Figure 2.3: Demonstration of resource sharing.
the performance of pattern matching in LegUp.
2.6 Verification techniques for HLS
In this section we describe other existing verification/debugging techniques which are
used for validating HLS tools.
2.6.1 Formal Verification
Formal verification [8] is a method of proving or disproving the validity of a system’s
behaviour by using formal mathematical methods with respect to a set of formal specifi-
cations, constraints or properties. One approach is called model checking, which consists
of an exhaustive exploration of a system’s mathematical model. This requires the system
to be abstracted as a model of a finite state machine with data path (FSMD) described in
some temporal logic expression. With a set of specifications, constraints and properties,
Chapter 2. Background 9
the logic expression can form a boolean equation which is solvable by SAT solvers. On
the other hand, an infinite system model can also be checked by using bounded model
checking (BMC), which bounds the number of states to a limit. For instance, an infinite
loop has to be bounded to a limited number of iterations while translating it to a FSMD.
In terms of verifying high-level synthesis tools, researchers have spent a large amount
of effort on formally verifying correctness of the scheduling process since the input to the
scheduler can be changed in many ways. For example, the control structure of the input
behaviour may be modified by the path-based scheduler [6] as it tries to merge some
consecutive path segments. Also, incorporation of several code-motion techniques [20] in
the scheduling process leads to movements of operations across basic-block boundaries.
These optimizations result in scheduling that does not have a one-to-one correspondence
with the input, which makes the scheduler verification a challenging part of the HLS
verification. A formal method presented in [16] specifically verifies the correctness of the
scheduling process. It uses a finite state machine with data path (FSMD) to represent
both software and hardware schedules in a formal logic format and solves their equivalence
using SAT solvers.
Edmund et al.[8] present a way of using bounded model checking to verify the con-
sistency of behaviours for C and Verilog programs. Given an ANSI-C program and a
Verilog circuit, both are translated into a formula similar to an FSMD that represents
behavioural consistency. The formula is then checked using SAT. Note that the ANSI-C
program and Verilog circuit have no HLS connections. In other words, to verify a cir-
cuit, one has to manually develop a specifically formatted C program that is functionally
equivalent to the Verilog circuit.
Formal methods provide that a tool is correct by using mathematical methods to
check equivalence between hardware and software behaviours. However, there are at
least two disadvantages of this approach. First, formal verification usually involves a
SAT solver which has an exponentially increasing runtime with linearly increasing input
Chapter 2. Background 10
function(){
clock_t A, B;
A = clock();
//
//Some line of instructions to be verified
//
B = clock();
assert((B-A)<100);
}
Figure 2.4: Timing analysis using assertion based verification.
size [9]. In addition, to verify the hardware and software systems, one has to translate
them to a common expression. Such a translation process can also create mismatches,
which increases the chance of causing errors.
2.6.2 Assertion-based Verification
Curreri et al.[11] propose another technique called assertion-based verification. This
technique enables a HLS tool to compile C assertions into hardware and form a processor-
accelerator architecture. During hardware execution, assertions are checked at specified
points and if any of them asserts, then this processor receives an interrupt.
By enabling assertions in HLS, a developer can have more options for debugging. One
can use assertions to not only verify whether the logic is valid at a certain point, but
also check if the timing constraints are met. This can be done by using code similar to
Figure 2.4. One of the advantages of such a technique is that a user can define arbitrary
specifications which need to be verified and these results can be checked at runtime.
However, it requires a processor-accelerator architecture to be used and a user needs to
manually inject the checking conditions into the original design.
Chapter 2. Background 11
2.6.3 Manually developed test suites
Manually developed benchmark suites are important for researchers to effectively evaluate
their new ideas and algorithms for HLS. From the late 1980s to the mid 1990s, the HLS
research community had made efforts to develop standard benchmark suites for HLS,
and as a result, two sets of benchmark designs, the High Level Synthesis Workshop 1992
Benchmarks [22] and the 1995 High Level Synthesis Design Repository [21] were released
from the University of California. However, most of these designs were written in VHDL,
and the language for HLS has gradually changed from HDLs to C-based languages.
There were eight benchmarks from the High Level Synthesis Design Repository [21] were
written in C, however they were small programs with less than one hundred lines of
code. These benchmarks can still be useful for studies on loop pipelining and memory
access optimization since these features can be used even in relative small programs.
However, more complex benchmarks are needed to make HLS a practical solution for
larger designs. On the other hand, benchmark programs which are widely used in the field
of computer architecture and compilers are too large and complex for current hardware
synthesis. For instance, C programs in SPEC [4], EEMBC [5], and MediaBench [18] are
not synthesizable even by state-of-the-art HLS tools [14].
LegUp is currently using the CHStone benchmarks as its primary test suite [14], which
is a set of 12 C programs for high-level synthesis. Some key features of the CHStone
benchmarks are as follows:
• CHStone is developed for HLS researchers to analyze the effectiveness and correct-
ness of their new techniques, algorithms, and implementations.
• CHStone consists of 12 programs which are selected from various application do-
mains such as arithmetic, media processing, and security.
• The programs in CHStone are relatively large in terms of source level analysis (e.g.
number of lines of code) and synthesized circuit area.
Chapter 2. Background 12
• All the programs in CHStone have been confirmed to be synthesizable by LegUp
and eXCite (a commercial HLS tool).
• Test vectors are self-contained and no external libraries are necessary.
• CHStone is available to the public.
Our framework is built around LegUp which uses the CHStone as its primary test
suite. We will demonstrate how our tool complements CHStone as a tool for HLS de-
bugging.
Chapter 3
Implementation
This chapter introduces the design architecture of the debugging framework as well as
the detailed implementation algorithms used for each component of the framework.
3.1 Overall debugging flow
To accomplish our goal of automatically generating test cases and verifying HLS tools,
our debugging framework consists of the following steps:
1. Load a configuration file that gives the user-setable parameters for our test gener-
ator.
2. Generate graphs which includes CFGs, DFGs and patterns.
3. Generate LLVM IR from the graphs created in Step 2. This is the test program
that will be executed in software and compiled to hardware.
4. Execute the test program in software with an interpreter to obtain the software
result.
5. Compile the test program to hardware with LegUp to obtain generated RTL.
13
Chapter 3. Implementation 14
6. Simulate the RTL with ModelSim to produce the hardware result.
7. Compare the software and hardware results to verify correctness.
Figure 3.1 illustrates the detailed flow. Referring to the labels in the figure. In step
1 the tool generates random graphs based on user specifications (e.g. size, number of
I/Os, operation usage, pattern usage, etc.). In step 2, the IR generator reads the graphs
and fills in each node of the graph with LLVM IR instructions to produce a complete .ll
file. This file is a program that can be executed by the LLVM IR interpreter and can
also be compiled by the LegUp tool to Verilog in step 3. The generated program has a
single output which is printed at the end of program execution marked as software result
for further comparison. The produced Verilog RTL code is simulated with ModelSim to
obtain the hardware result. Finally, results from both software and hardware are com-
pared. The case is considered to be passed if the two results match or failed otherwise.
Please note that the generated tests will not be passed through any software compiler op-
timization flows because optimization has the chance to eliminate the generated features
users intent to test.
3.2 Test case generation
In this section, we describe the implementation details of the test case generation, what
the user given parameters are, how the graphs are generated, and also how the graphs
are converted to LLVM IR.
3.2.1 Parameters used in the generator
Although the graph is randomly generated in our tool, a user can also control some of the
characteristics of the generated graph by parameters shown in Table 3.1. In this section,
we highlight the effects of several of these parameters.
Chapter 3. Implementation 16
Table 3.1: Parameters that control the graph generation.
Parameter Name Abbreviation Effect to Network
Number of Basic Blocks BB NUM Number of blocksNumber of Inputs INPUT NUM Number of inputs to testing functionNumber of Outputs OUTPUT NUM Number of outputs to testing functionDepth Factor DEPTH FACTOR The “narrowness” of the network graphConstant Density CONST DENSITY Possibility of creating constantsLoop Enable ENABLE LOOP Enable network contains loopsArray Input Enable ARRAY INPUT Considering input variables as arraySeed SEED The seed used for random generationMax CFG Number MAX CFG NUM Maximum number of CFGs(Functions)Sub-Function Enable ENABLE SUB FUNC Enable multiple hierarchy functionsMax Sub function level MAX SUB FUNC LVL Max hierarchies of functionsFix Block Size FIX BLOCK SIZE Fix DFG size for all blocksBlock Size Factor BLOCK SIZE FACTOR Scalar for varying block sizeDisable Zero Avoidance NO ZERO AVOIDANCE Not to avoid dividing by zeroEnable Pattern ENABLE PATTERN Let the network to use PatternPattern Ratio PATTERN RATIO Ratio of nodes covered by PatternTemplate Pool Ratio TEMPLATE POOL RATIO Ratio of patterns used as “Stamps”Pattern Size PATTERN SIZE Size of each patternPattern Input Size PATTERN INPUT SIZE Input number of patternsNumber of Replicated BBs NUM REPLICATED BBS Create replicated Blocks
32-bit Add ADD Fraction of 32-bit ADD32-bit Sub SUB Fraction of 32-bit Sub32-bit Mult MULT Fraction of 32-bit Mult32-bit DIV DIV Fraction of 32-bit DIV64-bit Add LADD Fraction of 64-bit ADD64-bit Sub LSUB Fraction of 64-bit Sub64-bit Mult LMULT Fraction of 64-bit Mult64-bit DIV LDIV Fraction of 64-bit DIV32-bit Floating Add FADD Fraction of 32-bit Floating ADD32-bit Floating Sub FSUB Fraction of 32-bit Floating Sub32-bit Floating Mult FMULT Fraction of 32-bit Floating Mult32-bit Floating DIV FDIV Fraction of 32-bit Floating Div64-bit Floating Add DADD Fraction of 64-bit Floating ADD64-bit Floating Sub DSUB Fraction of 64-bit Floating Sub64-bit Floating Mult DMULT Fraction of 64-bit Floating Mult64-bit Floating DIV DDIV Fraction of 64-bit Floating DIV32-bit Left Shift SHL Fraction of 64-bit Left Shift32-bit Logic Right Shift LSHR Fraction of 64-bit Logic Right Shift32-bit Arithmetic Right Shift ASHR Fraction of 64-bit Arithmetic Right Shift64-bit Left Shift LSHL Fraction of 64-bit Left Shift64-bit Logic Right Shift LLSHR Fraction of 64-bit Logic Right Shift64-bit Arithmetic Right Shift LASHR Fraction of 64-bit Arithmetic Right Shift
Chapter 3. Implementation 17
3.2.1.1 Size control
The parameters that control the size of the generated network are:
• BB NUM. This parameter determines the total number of basic blocks (excluding
the entry and exit blocks) in the top-level CFG. Larger values of this parameter
increase the complexity of the test program, resulting in more branches, loops and
sub-functions.
• INPUT NUM. This parameter specifies the number of input variables for the test-
Func function. testFunc is the top-level test function called by themain wrapper.
If the parameter ARRAY INPUT is set, the input to the function becomes an array
in which the number of elements is equal to INPUT NUM.
• OUTPUT NUM. This parameter specifies the number of output variables returned
by the top-level testFunc function. However, for this function, only a single output
is allowed since only one result is used for the final comparison. User should note
that such a single output constraint is only applicable for the top-level testFunc
function by default. Sub-functions can have multiple returned values and they will
be an array with a number of elements.
• BLOCK SIZE FACTOR. This parameter is a scalar used in Equation: Size of DFG =
BLOCK SIZE FACTOR ∗ (DFG input size+DFG output size) to determine
the number of operations in each DFG. The generator requires this parameter to
be larger than 2 to make sure each input node has at least one connection.
• FIX BLOCK SIZE. By default, the size of a DFG in each basic block varies de-
pending on its number of inputs/outputs and the BLOCK SIZE FACTOR. It can
Chapter 3. Implementation 18
be difficult for a user to control the size of the entire network as the number of in-
puts/outputs are randomly generated for each DFG. By setting the FIX BLOCK SIZE
parameter, the size of each DFG can be fixed.
3.2.1.2 Structure control
This section describes a list of parameters that control the structure of the generated
network. They include:
• ENABLE SUB FUNC. This parameter enables the generator to create sub-functions.
In other words, it allows the basic blocks in a CFG to contain other CFGs instead
of DFGs. In addition, since the sub-function creation is a recursive process, limits
must be set in order for it to terminate. To do this, two other parameters must be
specified:
– MAX CFG NUM. This parameter limits the total number of CFGs in the
entire network. The actual number of created CFGs will be less than or equal
to this number.
– MAX SUB FUNC LVL. This parameter specifies the maximum level of sub-
functions in the entire network.
• ENABLE PATTERN. Pattern is a small network of operations resides inside DFG.
Template is a definition of a pattern (similar to the Class concept in object oriented
programming). This parameter enables the generator to use patterns and templates
for its DFG creation (more details is described in section 3.2.3.6). Four other
parameters are required:
– PATTERN RATIO. It specifies the fraction of nodes in the DFG covered by
patterns.
– PATTERN TEMPLATE RATIO. It specifies the fraction of patterns in the
DFG which are defined as templates. A template is a definition of pattern
Chapter 3. Implementation 19
that will be instantiated and placed in the network. The DFG generator
maintains a list of such templates. The size of the template list is given by
PATTERN TEMPLATE RATIO∗ Total number of patterns.
– PATTERN SIZE. It specifies the size of patterns being generated in DFGs.
– PATTERN INPUT SIZE. It specifies the number of inputs to each pattern.
• NUM REPLICATED BBS. This parameter specifies the number of replicated basic
blocks in the network. The reason for having such a parameter is to enable a user
to test functional unit sharing across basic blocks.
• DEPTH FACTOR. This parameter is used in CFG, DFG and pattern generations.
More details about this parameter will be described as we introduce the algorithms
that utilize it. In addition, experiments in Section 4.1 graphically and statistically
show the effects of the DEPTH FACTOR.
3.2.2 Summary of parameters
In this section, we have described parameters that can control the graph generation. To
combine a set of parameters as configuration, a user can use an auto-configuration file
with the argument “-auto-config”. An example of the configuration file is shown in Figure
3.2. This set of parameters tells the generator to create a test program with 10 basic
blocks, 2 scalar input variables, and a single output. The size of each basic block is fixed
with a factor of 10. The operations that can appear in the program are additions and
divisions. They are assigned to equal weights, meaning that the probability for either
operation to appear in the program is the same. In addition, 8 out of the 10 basic blocks
are replications of each other.
Chapter 3. Implementation 20
BB_NUM 10
INPUT_NUM 2
OUTPUT_NUM 1
DEPTH_FACTOR 0.5
ARRAY_INPUT 0
ADD 0.5
DIV 0.5
FIX_BLOCK_SIZE 1
NUM_REPLICATED_BBS 8
BLOCK_SIZE_FACTOR 10
Figure 3.2: An example of a configuration file
3.2.3 Graph generation
3.2.3.1 Graph Structure
In this section, the structure of the generated graph is introduced. The hierarchy of the
network is graphically represented in Figure 3.3. According to the figure, at the highest
level, the tool describes a function as a control flow graph (CFG) which contains the
interactions between basic blocks (BB) (rectangles in Figure 3.3). The CFG has a single
entry and a single exit block. Inside each BB, it either contains a data flow graph (DFG)
or contains another CFG. If it is a CFG, then this CFG is mapped to a sub-function called
by the current function. If a DFG is used, it represents the interconnections between
operations in this BB. Each DFG can have multiple inputs and outputs.
Each DFG is filled by nodes (circles in Figure 3.3) and patterns (highlighted squares in
Figure 3.3). A Node is the smallest representation unit in our structure. It can represent
an operation (add, subtraction, divide, multiplication, etc.), a constant, an input or an
output in the DFG. On the other hand, a pattern is a small network of a fixed number
of operations. The patterns can be stamped into DFGs in order to produce replicated
structures of operations.
Chapter 3. Implementation 22
3.2.3.2 CFG generation
The generation starts from the highest level of the graph, which is the control flow graph.
A CFG of a program is a directed graph which can be represented as: G = (N,E). G
represents the graph, N represents a set of basic blocks in the graph and E represents a
set of edges that connect between basic blocks. In addition, an entry block in G is a point
where the graph starts and an exit block in G is a point where the graph ends. Starting
from the entry block, considering it as a root node, the generator builds a network similar
to a breadth-first search traversal using the Algorithm 1. This algorithm is described
below:
1. From line 1 to line 5, the generator initializes a queue and a CFG, creates an entry
and an exit blocks in the CFG and puts the entry block into the queue.
2. From line 6 to line 24, the generator repeats the steps from 3 to 6 below until all
the nodes have been created.
3. The generator takes a block N0 off the queue if the queue is not empty.
4. If the number of created nodes is less than the total number of nodes that needs
to be created (there are more blocks to create), then:
(a) Randomly assign a number to N0. This number represents how many fanouts
N0 has (if N0 is the entry, its fanout number is always 1 for convenience,
otherwise, this value can be either 1 or 2).
(b) If N0 is the last element from the queue, N0 must connect to as least a newly
created node Nf , otherwise the queue will be depleted and the generation
process can not terminate.
(c) Otherwise, for each fanout of N0, depending on the “depth factor” specified
by the user, N0 can either be connected to a newly created node Nf or left as
unconnected. The generator will solve the unconnected points in later steps.
Chapter 3. Implementation 23
(d) If any of N0’s fanout is newly created, put the Nf in the queue.
5. Else if there are no more nodes to create:
(a) Connect N0 to the exit block if there is not a node connected to the exit block
yet.
6. Go back to step 2 until the queue is empty.
7. At line number 25 of the algorithm, the generator updates the level of each node
using Algorithm 2. A node is assigned to a lower level if it is closer to the input
(e.g. input node is level 0, output node is max level).
8. From line 26 to the end, The generator checks each node, if its expected fanout
number (assigned at step 2a) is less than the number of fanouts it is actually
connected to, randomly assign a node at a higher level as its fanout.
Chapter 3. Implementation 24
Data: BBcap, depth factorResult: G = (N,E)Queue Q;1
CFG G = (N,E);2
create Entry Exit blocks in G;3
put Entry into Q;4
createdBB = 0;5
while Q is not empty do6
N0 = dequeue Q;7
if createdBB 6= BBcap then8
fanout num = rand()%2+1;9
if N0 is Entry then10
fanout num = 1;11
end12
expectedFanoutNumber(N0) = fanout num;13
for i=0; i <fanout num; i++ do14
createNewBLock = ((rand()%100) <depth factor*100) ? 1:0;15
if Q is empty OR createNewBLock==1 then16
create Nf in G;17
make connection between Nf ← N0;18
put Nf into Q;19
createdBB++;20
end21
end22
end23
end24
update each block’s level using Algorithm 2;25
foreach n ∈ N do26
if expectedFanoutNumber(n) >Number of fanouts n already has then27
randomly choose a basic block NR such that level(NR) >level(n);28
make connection between NR ← n;29
end30
end31
Algorithm 1: Build a control flow graph.
Chapter 3. Implementation 25
Data: G = (N,E)Result: level(v), ∀ v in N
level(v) = 0, ∀v ∈ N ;1
changed = true;2
while changed do3
changed = false;4
Queue q;5
foreach v ∈ N where v is input to G do6
push v onto q;7
end8
while q is not empty do9
node = q.pop();10
foreach fanout fo of node do11
old level = level(fo);12
level(fo) = max(level(fi), ∀fi) where fi are fanin nodes of fo;13
push all fi onto q;14
if level(fo) 6= old level then15
changed = true;16
end17
end18
end19
end20
Algorithm 2: Update level information of each node in the network
The procedure above describes the creation of a control flow graph. Note that a
user can specify a number of parameters for this generation, such as the total number of
blocks, the number of inputs, and the depth factor (DF). DF is a parameter introduced
to the generator, which controls the possibility at step 4b. It can be any value between 0
and 1 (inclusive). A larger DF makes it more likely that N0 connects to a newly created
node Nf . This parameter gives a user the ability to control the shape of the generated
graph. According to later experiments, it has been found that a higher DF will make the
CFG wider, which results in a circuit with longer execution time.
3.2.3.3 CFG loop generation
At this point, a complete CFG has been created without any loops. In this implemen-
tation, the generator only creates natural loops. In order to assign natural loops, the
Chapter 3. Implementation 26
generator needs to do an analysis to find all the dominators for each basic block due to
the following reasons [19]:
• Dominator: Let G = (N,E) denote a CFG. Let, basic block n, n ⊆ N . n is said to
dominate n, denoted n→ n, iff every path from the entry to n contains n.
– For instance, in Figure 3.4, BB1 → BB1; BB1 → BB2; BB1 → BB3;
BB1 → BB4; BB2 → BB2; BB2 → BB3; BB2 → BB4; BB3 → BB3;
BB4→ BB4.
• A natural loop has a single entry or head node h ⊆ N , where the loop can only be
entered through h. In a program, as long as there is not goto statement, all the
loops are natural.
• A natural loop has an exit or a tail node t ⊆ N .
• Therefore, h has to be a dominator of t.
Data: G = (N,E)Result: DOM(v)∀ v in N
DOM(Entry) = Entry;1
DOM(v) = N∀v ∈ N − (Entry, Exit);2
changed = ture;3
while changed do4
changed = false;5
foreach v ∈ N − (Entry, Exit) do6
oldDOM = DOM(v);7
DOM(v) = N ;8
foreach p ∈ predecessor(v) do9
DOM(v) = DOM(v)⋂DOM(p);10
end11
DOM(v) = DOM(v)⋃v;12
if DOM(v) 6= oldDOM then13
changed = true;14
end15
end16
end17
Algorithm 3: Find dominators for each basic block.
Chapter 3. Implementation 27
Figure 3.4: Dominators in CFG
Since the generator only generates natural loops, the back edge of each loop can only
point from a basic block to one of its dominators. Before assigning any loops, the ana-
lyzer finds the dominators for each basic block to create a list of potential loops. Later,
the generator chooses loops from the list based on user specifications or chooses loop
randomly. It then assigns a number of iterations to edge of the loops. To find the domi-
nators for each basic block, Algorithm 3 is used. In this algorithm, DOM(v) denotes
the set of dominators for basic blcok v. Lines 1 and 2 are initializing the dominator of
the entry block to be itself and the dominators of all the basic blocks (except entry block
and exit block) to be whole set of basic blocks N . From line 3 to the end, the algorithm
scans through each basic block and calculates a joint set between the dominator list of
each basic block’s predecessor. This joint set becomes the new dominator list of the
basic block. The procedure from line 4 to the end is repeated until all of the basic blocks’
dominator lists are not changing any more. When this algorithm terminates, each basic
block finds a list of its dominators.
3.2.3.4 Multiple hierarchies of CFGs
Based on whether the user needs multiple hierarchies of functions, the generator can
randomly choose a number of basic blocks to contain other control flow graphs. The
total number of sub-functions cannot exceed the parameter MAX CFG NUM and the
Chapter 3. Implementation 28
maximum level of function calls cannot exceed the parameter MAX SUB FUNC LVL
(a description of all the parameters can be found in Table 3.1). Basic blocks which do
not contain CFGs can be represented by data flow graphs. By going though each BB
(other than the entry and exit blocks), if a BB contains a CFG (based on the definition
of basic block in compiler technology, it cannot contain a function. However, in this
implementation, the definition has been altered and an entire basic block can be mapped
to a CFG in order to enable the sub-function calls), the CFG generator is utilized again
(as discussed since Section 3.2.3.2), otherwise, a DFG generator is called.
3.2.3.5 DFG generation
This section discusses how a data flow graph is generated for each basic block. As shown
in Figure 3.3, a DFG is also a directed, acyclic graph (DAG) that represents the data
transaction network for a portion of a program [19]. A DFG can have multiple input and
output nodes. A DFG is represented as g = (v, e), where v represents a set of nodes, and
e represents a set of directed edges that connect the nodes together to form a network.
In our case, each node in DFG represents either an operation, an input, an output, a
constant, or even a pattern of several operations. Each edge in a DFG represents a data
dependency (use-define relationship) between two connected nodes.
To generate a data flow graph, a user has to specify at least the following parameters:
the number of inputs, the number of outputs, the maximum number of nodes for the
target DFG, the depth factor, and the constant density (a probability parameter that
determines how often constants appear in the DFG network). Once the generator receives
the input parameters, it starts generating the graph using an algorithm similar to the one
used for CFG generation, a breadth first search-like traversal described in Algorithm 4.
The algorithm is described below:
1. From line 1 to line 9, the DFG generator creates all the input and output nodes
for the graph. In addition, all the output nodes are placed in a queue.
Chapter 3. Implementation 29
Figure 3.5: Assigning operations in the network using a pool of operations.
2. From line 10 to line 32, the generator repeats the steps from 3 to 4 below until the
queue is empty.
3. Dequeue a node N0 from Q. If the number of created nodes is less than the total
number of nodes that need to be created, in other words, there are more nodes to
be created, continue, otherwise go to step 5.
(a) Create connections to the inputs of node N0. Based on the depth factor,
its fan-in points can either be connected to a newly created node or left as
unconnected. If the fan-in node is newly created, put it into Q. Such a node
can be an operation with a randomly assigned operator, a constant with a
random number, or even a pattern.
• The probability of a certain operation occurring is determined by their
corresponding fraction factors described in second half of the parameters
shown in Table 3.1. There are two ways of assigning operations.
– The generator assigns operations randomly using the rand() function.
This implementation gives the generated network more randomness.
For instance, if there are only additions and subtractions in the DFG
Chapter 3. Implementation 30
and the user wants them to have equal chance of occurrence, rand()
will generate a number between 1 and 100. The node is assigned to
an addition if the number is less than 50, otherwise a subtraction is
assigned.
– Before the generation starts, the generator creates a pool of all possi-
ble operations, where the size of the pool is the same as the number
of operation nodes. During the creation of the DFG network, a newly
created operation node randomly picks and removes an operation from
the pool, and assigns that operation to the node. Figure 3.5 illustrates
graphically how the operations are picked from the pool and put into
the network. This implementation ensures that the fraction of oc-
currence for each operation exactly matches the user specification.
One should note that the creation of the pool is not random, but the
generator picks operations from the pool randomly.
• The probability of creating a constant is determined by the constant den-
sity factor shown in Table 3.1
4. Repeat step 3 until Q is empty.
5. In line 33, the generator updates the level of each node using Algorithm 2. A node
is assigned to a lower level if it is closer to the inputs.
6. From line 34 to the end, the generator scans through each node, and if there are any
fan-in points left unconnected, randomly assign a node at a lower level (including
input nodes) to that fan-in point.
7. Repeat step 6 until all the nodes have been assigned to a sufficient number of fan-
ins, which is equivalent to the number of inputs of the operation that is assigned
to the node (e.g. an adder should have 2 fan-ins).
Chapter 3. Implementation 31
The procedure above describes the creation of a data flow graph. The depth factor
(DF) in step 3a is a parameter similar to the one used for generating a CFG to control
the shape of the graph.
The test case generator allows the user to choose which operations are to be used in
the test functions. The second half of Table 3.1, shows the operations that are currently
supported in our tool. The value of each parameter has to between 0 and 1 which indicates
the fraction of occurrence for that type of operation. For instance, if a user specifies 0.4
for the addition operation and 0.6 for the subtraction operation, the generator will
generate a network with only these two operations, among which 40% are additions and
60% are subtractions.
Chapter 3. Implementation 32
Data: NodeCap, depth factor, inputNum, outputNumResult: G = (N,E)Queue Q;1
DFG G = (N,E);2
for i=0; i <inputNum; i++ do3
create node Pi in G;4
end5
for i=0; i <outputNum; i++ do6
create node Po in G;7
put Po into Q;8
end9
createdNode = 0;10
while Q is not empty do11
N0 = dequeue Q;12
if createdNode 6= BBcap then13
if N0 is output then14
fanin num = 1;15
end16
else17
fanin num = expected number of operands for the operation assigned to18
N0;end19
expectedFaninNumber(N0) = fanin num;20
for i=0; i <fanin num; i++ do21
createNewBLock = ((rand()%100) <depth factor*100) ? 1:0;22
if Q is empty OR createNewBLock==1 then23
create Nf in G;24
randomly assign an operation to Nf ;25
make connection between Nf → N0;26
put Nf into Q;27
createdNode++;28
end29
end30
end31
end32
update all the levels of nodes using Algorithm 2;33
foreach n ∈ N do34
if expectedFaninNumber(n) >Number of fanins n already has then35
randomly choose a node NR such that level(NR) <level(n);36
make connection between NR → n;37
end38
end39
Algorithm 4: Build a data flow graph.
Chapter 3. Implementation 33
The major differences between the CFG generator and the DFG generator:
• A CFG only has a single input (entry) and a single output (exit).
• Although both generators use breadth first search-like to create networks, the CFG
generator scans from input to output whereas the DFG scans in an opposite direc-
tion for following reasons:
– Scanning from input to output in a CFG gives the generator the ability to
control the number of branches in the network as it always assigns fan-out
numbers to blocks.
– Scanning from output to input in a DFG makes building up the network easier
because the number of inputs to each node is fixed once the type of node is
determined.
• A DFG network has constants. These are the nodes without any fan-ins whereas
in a CFG, the only block without any inputs is the entry block.
• A DFG network has connections between nodes and patterns.
3.2.3.6 Patterns in DFG
As described earlier, a DFG can contain patterns. Our framework implements patterns
as a subclass of nodes as they can co-exist in the same DFG, as shown in Figure 3.3. A
pattern inherits some of the characteristics of a node. They both have fan-in nodes and
fan-out nodes connected to them. In addition, a pattern contains a small network which
consists of a group of nodes (no constants). Using patterns gives a user more control on
the structure of the DFG network. A certain group of patterns can be defined as a set of
stamps. A stamp is a pattern that can potentially be copied and placed elsewhere in the
network. The group of stamps is called the template pool. When the generator creates
the DFG, it can use these stamps to fill out the network, which easily creates replicated
Chapter 3. Implementation 34
structures within the DFG. With these replicated structures, a user can evaluate how
well a HLS tool can detect patterns and create sharable hardware [12].
Parameters PATTERN RATIO and TEMPLATE POOL RATIO are required in or-
der to create patterns in a DFG. The detailed usage of this feature is described in Section
3.2.1.2.
During the process of creating a DFG, if a node is about to be created and determined
to be a pattern, the generator either creates a new pattern and puts it into the template
pool, or selects one of the patterns from the pool to instantiate it and place the instance
into the network. A user can pre-define structures of patterns manually or just specify
the size and the number of inputs/outputs and let the generator create the patterns
randomly. The algorithm for generating random patterns is the similar to generation of
DFGs. Figure 3.6 shows an example of a DFG containing patterns (different patterns are
shown with different shapes: solid circle, dotted circle and solid square). In this graph,
each pattern is of size 3 with 4 inputs and a single output. For instance, the pattern
represented by solid circles has the form of ADD-SUB-ADD, the pattern represented
by dotted circles has the form of SUB-SUB-ADD, and the pattern represented by solid
squares has the form of ADD-ADD-SUB.
After all the DFGs have been generated, a graph that represents the behaviour of a
program has been created. The next step is to interpret this graph and translate it into
a compilable software program.
3.2.4 LLVM IR generation
This section discusses how the test case generator interprets the graph generated from
Section 3.2.3 and finally creates a compilable software program. As LegUp compiles
software from the LLVM IR level, the debugging framework uses the LLVM API to create
a LLVM module to produce a LLVM IR file. A LLVM module represents the top level
structure of the LLVM program. It contains a list of Functions, a list of GlobalVariables,
Chapter 3. Implementation 36
Loop: ;Infinite loop that counts from 0 on up...
%i = phi i32 [ 0, %LoopHeader ], [ %next_i, %Loop ]
%next_i = add i32 %i, 1
br label %Loop
Figure 3.7: An example of PHI instruction in LLVM IR
and a SymbolTable [3].
The CFG at the top level of the hierarchy is translated into a test function named
testFunc. Any CFGs inside it will be considered as sub-functions called by testFunc.
Sub-functions are created recursively by the same routine described in Section 3.2.3.2.
All of the DFGs are translated into basic blocks. Branch/conditional branch instructions
are inserted at the end of each basic block and PHI instructions are inserted at the front
of blocks with multiple fan-ins. At runtime, the PHI instruction takes the value specified
by the predecessor basic block that executes just before the current basic block [3]. An
example of a PHI instruction is shown in Figure 3.7. For the section of LLVM IR code
shown in Figure 3.7, variable %i will take the value of 0 if the program just enters the
loop section, otherwise %i will take the value in variable %next i.
As CFGs are mapped into functions, the generator traverses it using breadth-first
search and creates a one-to-one mapping between blocks and LLVM basic blocks or sub-
functions. Each conditional branch is controlled by a branch control input of the function.
This way, the caller of the function has the ability to determine how the callee function
executes. An example of how a branch in a CFG is translated is shown in Figure 3.8. As
a result, the function’s signature contains all of the input variables of the CFG followed
by a list of control signals for each branching basic blocks.
In terms of interpreting a DFG in a basic block, the generator similarly traverses the
DFG network using breadth-first search and creates one-to-one mapping between nodes
and operations/inputs/outputs/constants. In the case of patterns, they are flattened and
replaced by the networks contained in them. Input variables of a basic block have the
Chapter 3. Implementation 38
%18 = sext i32 %12 to i64
%19 = sext i32 %13 to i64
%20 = add i64 %18, %19
%21 = trunc i64 %20 to i32
Figure 3.9: Code inserted for type conversion.
use-define relationship with the output variables of its fan-in blocks. While connecting
all of the operations together based on the DFG, there are two special cases which need
to be considered.
• Two connected operations can have different types (e.g. connecting a 32-bit integer
to a 64-bit operation or connecting an integer to a floating point operation). In
order to ensure the program’s correctness, conversion instructions are inserted.
For instance, Figure 3.9 shows how sext (sign extension) and trunc (truncation)
are used for type conversion between operations. In the case of integer-floating
point conversion, instructions sitofp (signed integer to floating point) and fptosi
(floating point to signed integer) are used.
• Dividing by zero must be avoided for programs to execute properly. To avoid this,
we insert instructions before denominators are used in divisions. In the case of
integer division, before the denominator is used in a division, a bitwise OR is taken
with constant 1 (as shown in Figure 3.10). This is to ensure that the denominator
is greater than 1 or less than -1. On the other hand, in the case of floating point
division, a few more steps are needed. The denominator is first converted to an
integer, then taken bitwise OR with a 1. Finally it is converted back to floating
point for the division (as shown in Figure 3.11). This not only avoids the “dividing
by zero” problem but also ensures that the division result does not overflow since
the numerator is as least divided by 1 instead of a number very close to 0. The
user also has the ability to disable this feature.
Chapter 3. Implementation 39
%6 = or i32 %1, 1
%7 = sdiv i32 %5, %6
Figure 3.10: Code inserted to avoid dividing by zero in an integer division.
%16 = fptosi float %15 to i32
%17 = or i32 %16, 1
%18 = sitofp i32 %17 to float
%19 = fdiv float %14, %18
Figure 3.11: Code inserted to avoid dividing by zero in a floating point division.
3.2.5 Generate main wrapper function
After all of the graphs have been interpreted, a main wrapper function needs to be
created. It contains a function call to the testing function with random input variables, a
printf function that prints the returned value, as well as a return instruction to return the
result. The input arguments (actual inputs to testFunc and control signals for branches
mentioned in section 3.2.4 for the function call are generated randomly. Figure 3.12
shows an example of a wrapper function.
@.str = private constant [15 x i8] c"return_val:%u\0A\00", align 1
define i32 @main() {
%1 = call i32 @testFunc(i32 20, i32 30, i32 1, i32 1, i32 0)
%2 = call i32 (i8*, ...)* @printf(i8* getelementptr inbounds \
([15 x i8]* @.str, i32 0, i32 0), i32 %1)
ret i32 %1
}
Figure 3.12: An example of a main wrapper function.
Chapter 3. Implementation 40
3.3 HW/SW results verification and Analysis
Up to this point, a compilable LLVM IR program has been created which has the post-
fix .ll. Recall in Figure 3.1, the IR program is executed by the LLVM IR interpreter.
The printf function in the main wrapper provides a final result marked as the software
result. On the other hand, the IR is compiled by a HLS tool (LegUp in this case) to a
Verilog file. This Verilog file contains the generated RTL including a test bench so that
it can be directly simulated by ModelSim. As LegUp compiles the printf functions into
the display function in Verilog, the final simulation result is produced and marked as
the hardware result. Finally, if both the hardware result and software result match, this
test case is marked as a pass, otherwise a fail. Furthermore, the produced hardware can
also be synthesized using Quartus to gather additional hardware statistics (circuit size,
circuit speed).
3.4 Summary
This chapter has introduced the implementation details of our auto test case generator. It
described the algorithms which are used to generate control flow graphs, data flow graphs
and loops, and also described how to translate the generated graphs into executable
programs. It also explained the uses of the different parameters which control how the
graphs are generated. In addition, it illustrated how each test case is verified.
Chapter 4
Experiments
In order to evaluate the novelty and the usefulness of this framework, we have developed
several different experiments. This chapter describes each experiment with results.
4.1 Effect of depth factor
The depth factor is a parameter that is used in all of the CFG, DFG, and pattern
generations. Graphically, it controls the depth of a network. The smaller value (closer to
0) in DEPTH FACTOR, the deeper the network will be, as shown in Figure 4.1 (graph
B is deeper than graph A). In terms of synthesized hardware, a deeper circuit leads to
higher execution cycles. We have design an experiment to analyze this:
• Set all other conditions/parameters the same and sweep the depth factor from 0.1
to 1 with 0.1 as the increment.
• For each depth factor, the generator creates 30 different tests. Compile and simulate
each test case.
• Measure the total execution cycles for each circuit and take geometric mean among
the tests with the same depth factor.
41
Chapter 4. Experiments 42
Figure 4.1: Depth factor controls the shape of networks
According to Figure 4.2, the total execution cycles decrease with the respect of the
increasing in depth factor.
4.2 Analysis of pattern matching
One of the approaches for area reductions in HLS is resource sharing as described in
Section 2.5. LegUp uses a graph based pattern matching technique to search for replicated
patterns and share functional units. In order to measure the effectiveness of pattern
matching for area reduction, our tool has the ability to inject replicated patterns into the
network, which gives a user more control on the generated structure of programs.
To measure the effect of pattern sharing in LegUp, we have designed the following
experiment. Given a network with a single basic block (excluding the entry and exit
blocks) with 100 operations:
Chapter 4. Experiments 43
Figure 4.2: Depth Factor controls total execution cycles of circuits
• Set the PATTERN RATIO to 1. This tells the generator that 100% of nodes in
the basic block should be covered by patterns.
• Set the FIX BLOCK SIZE to 1 to make sure all the generated test cases have the
same number of operations.
• Gradually increase the TEMPLATE POOL RATIO from 0 to 1, with 0.1 incre-
ments each time (0 indicates all of the patterns in the DFG are different. As the
parameter is increased, the number of replicated patterns increase. Finally, when
the value equals 1, it makes the entire DFG covered by a single pattern). For each
value of this parameter, the tool generates 30 different test cases and uses LegUp
and Quartus II to synthesize them.
• Collect the area of circuits for each test case while grouping circuits generated from
the same TEMPLATE POOL RATIO as a set.
Figure 4.3 plots the geometric mean of area in terms of logic elements for each set of
circuits with respect to the TEMPLATE POOL RATIO. The purpose of this experiment
Chapter 4. Experiments 44
Figure 4.3: Resource sharability of replicated patterns within one BB.
is to find out how much area can be reduced as we inject more potential sharable patterns.
However, according to Figure 4.3, the area of circuit does not necessarily have a linear
relationship with the TEMPLATE POOL RATIO. Base on the log files produced by
LegUp, those injected patterns cannot always be found by LegUp. These suggest that
the current pattern matching algorithm in LegUp can be altered to further reduce the
circuit area.
4.3 An alternative binding algorithm
The experiment described in the previous section implies that the area of the synthesized
circuit does not necessarily decrease as the number replicated patterns increases with
the current pattern matching algorithm used in LegUp. This is because in LegUp two
patterns are recognized as sharable if and only if they have exactly the same operation
connections and the same scheduling assignments. For instance, in Figure 4.4, Pattern
1 and Pattern 2 are not considered as the same pattern in LegUp as the subtractions are
not scheduled to be in the same cycle.
The result in Section 4.2 indicates that there can be an alternative way of doing
Chapter 4. Experiments 45
Figure 4.4: Unsharable patterns in LegUp
pattern matching to further reduce the synthesized circuit area. The scheduling applied
before pattern matching is actually breaking some of the sharable patterns. Instead of
using the traditional allocation-scheduling-binding flow, a HLS tool can apply pattern
matching before scheduling and later force the scheduler to give the same clock-cycle
assignments to those sharable patterns, namely, a pattern-aware scheduler.
We design an experiment in order to verify the feasibility of this (instead of actually
implementing it). In LegUp, two blocks have the same scheduling assignments if they
have exactly the same structure (one is a replication of the other) [12]. We can utilize
this to force patterns to have the same scheduling.
• Set BB NUM to be 30.
• Force the generator to create relatively small basic blocks with 10 operations each
(10 is the maximum sharable pattern size defined in LegUp).
• For the first set of test cases (30 tests for each set), the generator randomly creates
30 different basic blocks (these 30 basic blocks are different from each other).
• For the second set of test cases, the generator makes 2 of the 30 BBs to be replica-
tions of each other and creates the rest 28 basic blocks randomly.
Chapter 4. Experiments 46
Figure 4.5: Resource sharability as replicating basic blocks.
• For the third set of test cases, the generator makes 4 of the 30 BBs to be replications
of each other and creates the rest 26 basic blocks randomly.
• Repeat 15 sets of such experiments until all 30 BBs are the same at the end.
• Collect the area result for each test case after LegUp and Quartus II synthesis.
The main difference between this experiment and the previous one in Section 4.2
is that instead of filling a single large basic block with a number of patterns, the test
programs contain a number of small basic blocks (the size of such basic blocks do not
exceed 10 as that is the maximum size of sharable patterns in LegUp). In this case,
those small basic blocks can be considered as patterns. If two of these basic block
have the same structure, as mentioned previously, they must have the same scheduling
assignments, which makes them a pair of sharable patterns in LegUp.
Figure 4.5 plots the geometric mean for area results for each set of circuits with
respect to the number of replicated basic blocks. As shown in the figure, the area of
circuits generally decreases as the number of replicated blocks increases.
Chapter 4. Experiments 47
4.4 Runtime analysis for LegUp’s pattern matching
algorithm
While doing experiments for the binding algorithm, we discovered that the runtime for
LegUp to compile test programs to Verilog grows significantly as the number of operations
increases. In order to measure how the size of the program is affecting the runtime for
LegUp, we designed the following experiment:
• Set BB NUM to be 10.
• Increase BLOCK SIZE FACTOR from 10 to 80 with an interval of 10.
• For each BLOCK SIZE FACTOR, generate 30 different test programs.
• Measure the runtime of LegUp to compile 30 generated programs.
• Measure the runtime of LegUp to compile 30 generated programs with pattern
matching enabled.
Figure 4.6 plots the runtime results obtained from this experiment. The dashed line
in the graph illustrates the runtime with pattern matching enabled. Whereas the solid
line illustrates the runtime with the pattern matching disabled. According to the figure,
it shows that the runtime bottleneck in LegUp is the pattern matching algorithm.
4.5 Comparison with CHStone
As previously mentioned in Section 2.6.3, using manually developed test suite is one of
the most important and commonly used techniques for HLS debugging. As a result,
LegUp uses CHStone as its primary test suite. According [14], CHStone emphasizes four
different aspects which are believed to be the most important features for C-based HLS
benchmark programs:
Chapter 4. Experiments 48
Figure 4.6: Runtime measurement as BLOCK SIZE FACTOR increases
• Diversity: CHStone programs come from different application domains, which con-
tains various types of operations and control structures. This results in different
types of resource utilizations.
• Size: CHStone consists of practically large programs in terms source level descrip-
tions of C programs (numbers of functions, variables, operations, and lines of code)
and generated RTL circuits (number of states, numbers/types of functional units,
and size of memories in generated circuits).
• Synthesizability: CHStone programs are synthesizable by a commercial HLS tool
(eXCite) and an academic HLS tool (LegUp).
• Usability: CHStone programs are easy to use since test vectors are self-contained
and no external libraries are necessary.
In the following section, we describe several experiments to compare between the
CHStone benchmark suite and the auto-generated test cases in four different aspects. In
addition, we also compare the two cases using code coverage measurement by gcov.
Chapter 4. Experiments 49
Table 4.1: Brief description of the CHStone benchmark programs.
Application Domain Name Description
Arithmetic
DFADD Double precision floating-point additionDFDIV Double precision floating-point divisionDFMUL Double precision floating-point multiplyDFSIN Double precision floating-point sine function
Processor MIPS Simplified MIPS processor
Media Processing
ADPCM Adaptive differential pulse code modulationdecoder and encoder
GSM Linear predictive coding analysis of globalsystem for mobile communication
JPEG JPEG image decompressionMOTION Motion vector decoding of the MPEG-2
SecurityAES Advanced encryption standardBLOWFISH Data encryption standardSHA Secure hash algorithm
4.5.1 Diversity: CHStone vs. Auto-generated test cases
This section, we compare the CHStone and our auto-generated test cases in terms of
their diversity.
4.5.1.1 Diversity of CHStone test suite
The CHStone selects 12 programs from various application domains. It includes four
arithmetic programs, four media applications, three cryptography programs and one
processor program. Table 4.1 summarizes the detailed descriptions of each CHStone
benchmark.
CHStone claims to cover a wide range of different operations, statements and data
types at the source level. According to their statistic results (Figure 4.7), the represen-
tative data types of the benchmarks are distributed from 8-bit char to 64-bit int, and
from scalar variables to arrays. Furthermore, theses CHStone programs contain various
types of instructions including assignment, goto/break, for, while, switch and if
Chapter 4. Experiments 50
Figure 4.7: Incidence of operations per CHStone benchmark program (quoted from [14])
CHStone benchmarks utilize a wide range of different hardware resource types when
synthesized to hardware. It includes 32-bit/64-bit adder, subtracter, divider, comparator,
shifter as well as memory and registers. However, there are other operations, such as
floating point arithmetic, that are not included in the programs. The author of the
CHStone paper also notes that it is unclear how many more benchmarks are needed in
order to cover a wider diversity since the HLS community has not yet established any
common recognitions on what the necessary features are which need to be tested for HLS
tools.
4.5.1.2 Diversity of auto-generated test cases
Our test case generator can generate arbitrary combinations and connections of oper-
ations. A user can tune their “ingredients” of operations by altering the configuration
file. For instance, using the configuration file shown in Figure 3.2, generator creates an
equal number of adder and divider operations as their parameters are equally weighted
as 0.5. In other words, our framework helps the HLS developers to test their tools using
controllable test features. As another example, LegUp did not support floating point
Chapter 4. Experiments 51
calculations in their 2.0 release. To test LegUp 2.0, a user can easily generate test cases
that exclude floating point arithmetics. On the other hand, LegUp will support float-
ing point calculations in their 3.0 release which makes CHStone insufficient for testing.
However, by changing a couple lines in the configuration file, our generator can include
such operations.
4.5.2 Size: CHStone vs. Auto-generated test cases
This section compares CHStone and our auto-generated test cases in terms of their size.
4.5.2.1 Size of CHStone test suite
As claimed by the author of CHStone, the CHStone benchmarks are practically large
programs in terms of source level descriptions and their compiled circuit sizes. Table 4.2
shows the source-level characteristics of all of the CHStone programs. It demonstrates
that the benchmarks are not trivially small in terms of their line of code and number
of operations. However, Figure 4.8 illustrates a different results. In Figure 4.8, the first
bar of each benchmark indicates the number of lines in the original C code, the second
bar indicates the number of lines in the LLVM IR compiled by Clang, and the third bar
indicates the Quartus synthesized circuit size in terms of the number of logic elements (all
of these numbers are normalized to their geometric means in order to show relativities).
According to this figure, the number of lines in C code (or even LLVM IR code) does
not necessarily have any relationship with the circuit size. For example, the source code
of the benchmark SHA is relatively large compared to other benchmarks but after being
synthesized to hardware, the size of the circuit is one of the smallest in the benchmark
suite. As this result shows, it is not practical to make size comparisons at the source
level.
Chapter 4. Experiments 52
Table 4.2: C code level characteristics of CHStone benchmark programs
Name Line of C code Add/Sub Mult Div Comparison Shift Logic
DFADD 526 38 78 65 146DFDIV 436 45 8 2 50 56 73DFMUL 376 28 4 34 41 61DFSIN 755 141 17 2 41 214 357MIPS 232 17 2 196 22 23ADPCM 541 156 69 2 73 81 24GSM 393 251 53 110 44 41JPEG 1692 1029 148 6 242 277 132MOTION 583 299 155 127 55AES 716 510 22 36 48 758 370BLOWFISH 1406 280 15 159 370SHA 1284 134 3 32 59 87
Figure 4.8: Source level analysis and synthesized circuit size
Chapter 4. Experiments 53
Figure 4.9: Source level analysis and synthesized circuit size
4.5.2.2 Size of auto-generated test cases
Our auto-generated test cases give a user more precise control of the synthesized circuit
size. The generator can not only generate test cases with the exact number of operations
but can also create a set of different test cases with the same configurations.
As described in Section 3.2.1.1, the parameter BLOCK SIZE FACTOR linearly scales
with the number of operations in the generated program. With this, a user can precisely
specify how many operations appear in their test cases, which makes the size of the
synthesized hardware more predictable. Figure 4.9 shows results of an experiment that
fixes all other parameters but varies the BLOCK SIZE FACTOR. As we increase the
basic block’s size factor linearly from 10 to 100 with an interval of 10 (30 different test
cases for each interval), the size of the synthesized circuits also increases linearly.
Chapter 4. Experiments 54
4.5.3 Synthesizability: CHStone vs. Auto-generated test cases
In this section, we compare CHStone with our auto-generated test cases in terms of their
synthesizability.
4.5.3.1 Synthesizability of CHStone test suite
We consider a program to be synthesizable if a HLS tool can generate an RTL circuit
from the software program without any modifications. According to [14], one of the key
features in CHStone is that the CHStone benchmark programs are easy to use since they
do not have any data types and constructs which are not synthesizable by most of the
existing C-based HLS tools. These constructs include composite data types, dynamic
memory allocations, and recursive functions. However, in order to compile the CHStone
benchmarks using eXCite, a commercial HLS tool, approximately 3.0% of the C code
needed to be altered. LegUp is able to synthesize all of the CHStone benchmarks without
altering of the original code.
4.5.3.2 Synthesizability of auto-generated test cases
One of the key features of our automated test case generator is allowing the user to
specify which operations/structures occur in the test programs. For instance, LegUp
plans to support floating point in their next release, however, the CHStone benchmarks
cannot be used as they do not have any floating point data types. In our tool, a user
can generate test cases with specific types of operations by changing the configuration
file described in Section 4.2. In addition, if an operation is not supported by the current
generator, it can be easily added to the framework. Instructions on how to add new types
of operations are described in Appendix A.
Based on our experiment, we have spent four days on compiling totally 200,000 ran-
domly generated test cases (these cases cover all and only the supported features in
LegUp). Regardless the correctness of the circuits, all of these test cases are compilable
Chapter 4. Experiments 55
printf ("Result: %d\n", main_result);
if (main_result == 150) {
printf("RESULT: PASS\n");
} else {
printf("RESULT: FAIL\n");
}
Figure 4.10: Self-contained test vector in CHStone.
by LegUp. Generally, as long as the test cases generated by our tool can be executed
correctly by the LLVM interpreter, it should be synthesizable by HLS tools/compilers
that use LLVM IR as their inputs.
4.5.4 Usability: CHStone vs. Auto-generated test cases
In this section, we compare the CHStone benchmarks to our auto-generated test cases in
terms of their usability.
4.5.4.1 Usability of CHStone test suite
In the CHStone benchmarks, test vectors are self contained which makes the programs
easier for a user to use. The final result of each program is pre-computed and checked at
the end of program execution to verify the correctness. Figure 4.10 shows an example of
how this is done in CHStone.
4.5.4.2 Usability of auto-generated test cases
Our auto-generated test cases also produce a single final result in the main wrapper
function. In order for a test case to pass, a ModelSim simulation and an LLVM IR
interpretation are required. This debugging framework is entirely automated using scripts
so that once a test case is created, it automatically verifies the correctness of the HLS
tool for that test case.
Furthermore, the debugging framework can produce a vast number of different test
Chapter 4. Experiments 56
cases and verify them in a reasonable amount of time. By using a large number of test
cases, a more comprehensive testing can be done.
Another weakness of CHStone comes from its small number of operations. With a
small number of operations, it can be difficult for a user to evaluate the runtime of a
HLS algorithm. On the other hand, since the automated generated test cases can be
customized to have a large number operations, it can reveal problems that could not be
detected by CHStone. For example, the runtime analysis described in Section 4.4 shows
that as the number of operations increases, the runtime grows exponentially (even though
those test programs were not synthesized into large circuits and some were even smaller
than CHStone circuits).
4.5.5 Code coverage comparison
Code coverage measurement is an important technique for evaluating the quality of test
cases. It measures how many lines of code in the program have been executed by running
a set of test vectors. In this thesis, we use gcov [2] to profile the code coverage information
in LegUp. In this section, the comparison of code coverage between CHStone and our
auto-generated test cases is demonstrated.
4.5.5.1 CHStone code coverage in LegUp
To measure the code coverage of LegUp with CHStone, both the coverage for individual
benchmark and the accumulated coverage of all 12 benchmarks are reported. Figure
4.11 illustrates the percentage of the LegUp’s lines of code coverage by executing each
CHStone benchmark. Cumulatively, CHStone covers 77.77% of the LegUp code. The
results of this experiment imply that the CHStone benchmarks already exercises the
majority of LegUp code. This can be due to the fact that LegUp was primarily built
around the CHStone benchmarks.
Chapter 4. Experiments 57
Figure 4.11: LegUp code coverage by each CHStone benchmarks.
4.5.5.2 Auto generated test programs code coverage in LegUp
To measure the code coverage of auto generated test programs, we have enabled as
many generating features as possible (including all the supported instructions in LegUp
2.0). The generator generates a number of different test cases with all these features.
Accumulated coverage by this set of test cases is reported in Figure 4.12. This figure
illustrates that the coverage increases rapidly for the first few test cases and gradually
saturates as more test cases are executed. One of the reasons for choosing 30 as the
number of runs in each set of experiments is that after running 30 test cases, the coverage
plot tends to be flat as shown in Figure 4.12.
After running 30 different auto-generated test cases, the coverage stabilized at 76.03%
which is slightly less than the CHStone benchmarks. As we deeply investigated the code
coverage log generated by gcov, we have found out that some sections of LegUp are only
covered by CHStone whereas some sections are only covered by auto-generated test cases.
Chapter 4. Experiments 58
Figure 4.12: LegUp code coverage by each auto generated testing programs.
This difference is because the automated test case generation can not cover a wide range
of memory accessing features yet. For instance, the tool does not generate calculations
cross multiple arrays. This can be enabled in future releases. Table 4.3 illustrates those
functions that are only fully covered by CHStone or auto-generated test cases.
Table 4.3: Differences in code coverage by CHStone and
auto-generated tests
Only fully covered by CHStone Only fully covered by auto-generated tests
Class name Function name Class name Function name
Allocation structsExistInCode Scheduler canChainAfter
Allocation getGenerateRTL GenerateRTL updateRTLWithPatterns
Allocation getRamTagNum
SchedulerDAG memDataDeps
SchedulerMapping createFSM
Continued on next page
Chapter 4. Experiments 59
Table 4.3 – continued from previous page
Only fully covered by CHStone Only fully covered by auto-generated tests
Class name Function name Class name Function name
GenerateRTL usedSameState
GenerateRTL getOpReg
4.6 Bugs detected in LegUp 2.0 release
Our framework is used to test the LegUp 2.0 release. The tool has the ability to detect
several problems in the HLS infrastructure.
4.6.1 Problem with shift instructions
We consider test cases generated from using the configuration file shown in Figure 4.13
as bug-free since after running 10,000 different tests which were generated using this
configuration, all of which passed. These test cases only used 4 types of basic operations
(32-bit add, subtract, multiply and divide) and contained 5 basic blocks.
However, bugs were detected in LegUp when shift instructions (shl: shift left, lshr:
logical shift right, and ashr: arithmetic shift right) were added to the list operations. For
instance, after adding the shl instruction (with no other types of shift instructions) to
the configuration file shown in Figure 4.14, 81 out of 100 the test cases failed. Similarly,
after adding the lshr instruction (with no other types of shift instructions), 58 out of the
100 test cases failed, and after adding the ashr instruction, 59 out of the 100 test cases
failed.
The CHStone benchmark suite also contains many of these shift instructions. Table
4.4 shows the number of shift instructions used in each of the 12 CHStone programs.
Chapter 4. Experiments 60
BB_NUM 5
INPUT_NUM 10
OUTPUT_NUM 1
DEPTH_FACTOR 0.5
ARRAY_INPUT 0
ADD 0.25
SUB 0.25
MULT 0.25
DIV 0.25
FIX_BLOCK_SIZE 1
BLOCK_SIZE_FACTOR 5
Figure 4.13: Configuration file with 4 basic operations
BB_NUM 5
INPUT_NUM 10
OUTPUT_NUM 1
DEPTH_FACTOR 0.5
ARRAY_INPUT 0
ADD 0.2
SUB 0.2
MULT 0.2
DIV 0.2
SHL 0.2
FIX_BLOCK_SIZE 1
BLOCK_SIZE_FACTOR 5
Figure 4.14: Configuration file with 4 basic operations and shl instructions
Chapter 4. Experiments 61
unsigned int x1 = 10;
unsigned int x2 = x1 >> 33;
printf("x2: %ld\n", x2);
Figure 4.15: An example of arithmetic shift right in C
Table 4.4: Shift instructions used in CHStone benchmarks
Benchmarks shl lshr ashr
ADPCM 17 5 72AES 29 0 32
BLOWFISH 10 54 0DFADD 23 13 0DFDIV 17 25 1DFMUL 12 19 0DFSIN 58 44 1GSM 68 19 27JPEG 24 0 27MIPS 8 12 2
MOTION 7 6 0SHA 17 10 0
However, none of the CHStone benchmarks are able to detect this shift instruction prob-
lem.
This problem is caused by the approach used by LegUp for handling out-of-bound
shift instructions is different from the approaches used by gcc and llvm (in ANSI C, such
scenario is undefined). For instance, Figure 4.15 shows an example of arithmetic shift
right in C where x1 and x2 are both 32-bit signed integers. After compiling the code
with gcc or llvm and executing on an x86 machine, the result of x2 is 5 (for both gcc
and Clang). This is because when the input of shift count is greater than bit width of
the data type, compiler will truncate the shift count by the bit width of the integer ( 33
becomes 33 − 32 = 1). However, LegUp does not have this conversion and it maps the
llvm shift instructions directly to Verilog shift instructions.
Chapter 4. Experiments 62
4.6.2 A LegUp produced Verilog file hangs at Quartus II com-
pilation
When collecting the synthesized circuit sizes of generated programs using Quartus II, our
tool detected a really rare issue which occurred in 1 out of 100,000 test cases. This test
case passed the check of comparing software and hardware results. However, the Verilog
file produced by LegUp cannot be compiled by Quartus II as the compilation hangs at
placement. This program only contains 70 operations in total in 4 different types (32-bits
add, subtract, multiply and divide). Since this problem is unlikely related LegUp but
Quartus II, it has been escalated to Altera for further investigation. This test case has
also been logged in LegUp’s Bugzilla (bug number 101) for future reference.
4.7 Detecting injected bugs in LegUp
In order to evaluate our tool’s the ability to detect bugs, bugs were manually injected
into LegUp. We purposely altered the original LegUp code so that it would produce
wrong synthesized results. A number of test programs were produced and compiled with
the altered version of LegUp. We measure the ratio of failed test cases and compare it
with the ratio of failed CHStone benchmarks.
4.7.1 Disabling Live Variable Analysis
In this experiment, we disabled the live variable analysis (LVA) in LegUp. Live variable
analysis is a term in compiler technology and it is used for data flow analysis. For each
variable, it finds out the range between its first definition and its last usage. An example
is shown in Figure 4.16. The life cycle for the variable produced by the adder on the
right is between point A and point B. During the binding process in HLS, LVA is needed
in order to determine which operators can be shared. For instance, in Figure 4.16, those
Chapter 4. Experiments 63
Figure 4.16: Example of a variable’s life cycle
two adders can not be shared as their life cycles are overlapping. In other words, if LVA
is disabled, LegUp could share some operations that can not be shared.
According to the actual experiment, 7 of the 12 (58%) CHStone benchmarks detected
this error in LegUp when we disabled the LVA. On the other hand, 432 out of 1000
(43.2%) generated test cases failed with LVA disabled. The configuration file we used for
the generator is shown in Figure 4.13.
Chapter 5
Conclusion
5.1 Summary
High-level synthesis reduces the time-to-market and lowers the design complexity of hard-
ware development. It allows a developer with less hardware skills to design hardware
using a software programing language such as C/C++/SystemC rather than a hardware
description language such as Verilog/VHDL. In a software environment, there are many
mature debugging tools which can be used for verifying and validating software programs.
However, there are a lack of tools which can be used for verifying HLS tools.
Although there exist several techniques for verifying HLS tools, those techniques have
weaknesses when compared to software debugging tools. For instance, formal methods be-
come impractical when programs get larger. Assertion-based techniques require inserting
additional code to software designs as well as hardware knowledge from the developers.
Furthermore, manually developed test suites do not have any standardized criteria to
determine the comprehensiveness of their test programs.
This work proposed an automated test case generation framework that can be used
as a complement to other existing debugging techniques for HLS tools. It can generate a
vast number of synthesizable software programs based on user specifications. These test
64
Chapter 5. Conclusion 65
cases can be simulated to verify the correctness of HLS tools.
5.2 Future work
The research in this thesis enables automated test case generation using LLVM IR for ver-
ifying, debugging and analyzing high-level synthesis tools. It provides additional options
for HLS developers to validate their designs. A number of future potential directions
are proposed to extend this framework in order to extend the features supported by the
generator as well as to add more intelligence to the graph generation.
5.2.1 Input vector range analysis for test programs
Currently, the test vectors generated for the test programs are created randomly without
much constraints. This can make the execution of test programs unrealistic. For instance,
overflow and underflow (e.g. floating point numbers) can occur often during calculations.
One of the potential solutions to this problem is to use the range analysis described
in [24]. The author of [24] proposed an algorithm modeled after a constraint-based
framework from [25], wherein the program is analyzed to obey a set of constraints defining
the range of each variable. In the case of our work, for instance, users may want to
produce test cases with non-zero results. This range analysis can help the generator
to figure out a preferable input vector range and therefore increase the quality of test
programs.
5.2.2 Back tracing the error points
Currently, the debugging framework only tells the user whether a test case has passed
or not by comparing the final results. However, it does not have the ability to determine
which section of the code caused the problem.
This can be improved by creating accurate mappings between each step of software
Chapter 5. Conclusion 66
and hardware execution so that the user can verify the results after the execution of each
operation.
Another potential improvement is to create a GDB-like debugging framework for
high-level synthesis tools. Similar to [15], it can provide a graphic representation of
source-level mappings between software code and synthesized circuits with break points
to step through the code for investigation.
5.2.3 Customizable pattern injection
Generating patterns randomly preserves more randomness in test cases to help developers
verify the correctness. However, it is worth having a tool that lets the user specify what
patterns should appear in the network, which gives the users the ability to analyze which
kind of network structures are worth sharing in certain FPGA devices. Currently, the
user can manually code pre-defined patterns inside the generator framework. In the
future, the tool should provide a dot-like language [1] that describes network structures
using plain text.
Appendix A
Add new operation type to the
framework
This appendix explains how to add a new type of operation to the generation framework.
To demonstrate this, we give an example of adding a 8-bit integer addition operation
(ADD8) to the system.
• In file AutoConfig.h, add a string constant as in Figure A.1
• In file AutoConfig.h, add an “else if” case so that the once the new operation
appears in the configuration file, the system can detect it (shown in Figure A.2
• In the function createInstforNode from the file CFGNtk.cpp, add an “else if” case
(in Figure A.3) for the type conversion to make sure the inputs to this operations
are in the correct data type.
• In the same function from the previous step, add a case to the switch statement as
shown in Figure A.4 to create the proper operation instruction.
static const char add8Op[] = "ADD8";
Figure A.1: String constant added in AutoConfig.h
67
Appendix A. Add new operation type to the framework 68
else if(strcmp(pch, add8Op) == 0 ){
OpName = string(pch);
OpRatio = atof(strtok (NULL, " "));
assert(OpRatio>0);
addOI(OpName, OpRatio, Operation_Index);
}
Figure A.2: “else if” case added in AutoConfig.c (Operation Index has to be explicitlyassigned)
else if((pNode->NodeOperation==Operation_Index)){
OpType = IntegerType::get(mod->getContext(), 8);
if(op_0->getType()!=OpType){
new_op_0 = create_convert_instr(mod, op_0, OpType, BB);
}else{
new_op_0 = op_0;
}
if(op_1->getType()!=OpType){
new_op_1 = create_convert_instr(mod, op_1, OpType, BB);
}else{
new_op_1 = op_1;
}
}
Figure A.3: “else if” added in CFGNtk.cpp (Operation Index is the one used in FigureA.2)
case Operation_Index:{
result_val = BinaryOperator::Create(Instruction::Add, \
new_op_0, new_op_1, "", BB);
break;
}
Figure A.4: Case added in CFGNtk.cpp (Operation Index is the one used in Figure A.2)
Appendix A. Add new operation type to the framework 69
BB_NUM 10
INPUT_NUM 2
OUTPUT_NUM 1
DEPTH_FACTOR 5
ARRAY_INPUT 0
ADD8 0.25
ADD 0.25
SUB 0.5
FIX_BLOCK_SIZE 1
NUM_REPLICATED_BBS 8
BLOCK_SIZE_FACTOR 10
#SEED 12345678
Figure A.5: An example of a configuration file with the newly added operation
Now, users can put entries in their configuration file to let the generator create test
cases with such new operation. An example is shown in Figure A.5
Appendix B
Experimental results for replicating
patterns in a single basic block
This appendix presents the full set of results for the experiments introduced in Section
4.2. The purpose of this experiment is to demonstrate, by fixing the total number of
operations, how circuit area is changing as more replicated patterns are injected. All of
the generated circuit have only one basic block (excluding the entry and the exit blocks).
For each set of data, the experiment keeps increasing the TEMPLATE POOL RATIO
parameter, which turns more and more nodes within the block to be replicated patterns.
Table B.2, ?? show the complete results of this experiment.
Table B.1: Synthesized circuit size as increasing TEM-
PLATE POOL RATIO parameter within one BB (0.1–
0.5)
TEMPLATE POOL RATIO 0.1 0.2 0.3 0.4 0.5
Circuit size (LEs) 587 1129 1519 721 566
Circuit size (LEs) 1512 331 1203 345 367
Continued on next page
70
Appendix B. Experimental results for replicating patterns in a single basic block71
Table B.1 – continued from previous page
TEMPLATE POOL RATIO 0.1 0.2 0.3 0.4 0.5
Circuit size (LEs) 882 1043 1084 1001 729
Circuit size (LEs) 1141 1154 776 632 972
Circuit size (LEs) 762 1294 762 1386 1023
Circuit size (LEs) 632 788 1310 919 458
Circuit size (LEs) 1013 57 677 605 788
Circuit size (LEs) 886 1243 551 813 539
Circuit size (LEs) 1682 1242 1903 668 1185
Circuit size (LEs) 1187 831 1725 866 1092
Circuit size (LEs) 1551 978 939 512 561
Circuit size (LEs) 1045 1004 1273 1352 774
Circuit size (LEs) 1534 634 1363 861 739
Circuit size (LEs) 653 624 1466 670 1135
Circuit size (LEs) 745 740 1221 683 934
Circuit size (LEs) 1071 695 1913 951 336
Circuit size (LEs) 963 1628 533 790 889
Circuit size (LEs) 881 996 985 426 1035
Circuit size (LEs) 891 1177 501 500 451
Circuit size (LEs) 910 976 1233 658 620
Circuit size (LEs) 584 719 679 1221 931
Circuit size (LEs) 759 749 1412 1092 1031
Circuit size (LEs) 664 507 1207 715 733
Circuit size (LEs) 1508 673 879 574 537
Circuit size (LEs) 780 1322 835 669 1131
Circuit size (LEs) 822 573 1274 307 745
Continued on next page
Appendix B. Experimental results for replicating patterns in a single basic block72
Table B.1 – continued from previous page
TEMPLATE POOL RATIO 0.1 0.2 0.3 0.4 0.5
Circuit size (LEs) 974 970 1219 1435 387
Circuit size (LEs) 1410 412 1450 593 331
Circuit size (LEs) 944 514 1009 1760 1035
Circuit size (LEs) 516 1040 1232 660 1388
Geomean 934.43 768.51 1072.14 750.46 725.26
Table B.2: Synthesized circuit size as increasing TEM-
PLATE POOL RATIO parameter within one BB (0.6–
1.0)
TEMPLATE POOL RATIO 0.6 0.7 0.8 0.9 1.0
Circuit size (LEs) 861 1464 1232 1416 1152
Circuit size (LEs) 1206 448 778 788 1018
Circuit size (LEs) 737 622 1086 952 1588
Circuit size (LEs) 686 203 630 746 983
Circuit size (LEs) 1178 966 956 1105 1104
Circuit size (LEs) 1463 928 1002 1527 782
Circuit size (LEs) 1461 1132 1584 1527 1231
Circuit size (LEs) 1838 699 1946 529 877
Circuit size (LEs) 1418 896 1607 1240 837
Circuit size (LEs) 437 766 717 1115 722
Circuit size (LEs) 993 1216 619 1228 1067
Circuit size (LEs) 1022 1044 1244 1312 1279
Continued on next page
Appendix B. Experimental results for replicating patterns in a single basic block73
Table B.2 – continued from previous page
TEMPLATE POOL RATIO 0.6 0.7 0.8 0.9 1.0
Circuit size (LEs) 609 894 1672 610 1915
Circuit size (LEs) 1358 1403 759 1098 1354
Circuit size (LEs) 1188 1929 1233 1025 1426
Circuit size (LEs) 1286 1135 1137 1971 1006
Circuit size (LEs) 1365 940 1025 1153 1001
Circuit size (LEs) 1443 947 1451 809 1657
Circuit size (LEs) 691 1259 473 1529 977
Circuit size (LEs) 1207 1372 739 900 473
Circuit size (LEs) 1067 970 800 1360 1467
Circuit size (LEs) 1197 1059 998 587 1684
Circuit size (LEs) 1321 754 1007 1002 1424
Circuit size (LEs) 1069 1034 1122 1049 987
Circuit size (LEs) 1234 907 1006 592 999
Circuit size (LEs) 757 645 1144 457 1060
Circuit size (LEs) 880 1353 1005 472 956
Circuit size (LEs) 160 758 889 564 748
Circuit size (LEs) 1448 1268 947 917 463
Circuit size (LEs) 704 1132 1043 687 781
Geomean 991.19 938.52 1012.59 940.56 1045.63
Appendix C
Experimental results for replicating
basic blocks
This appendix presents the full set of results for the experiments introduced in Section
4.3. The purpose of this experiment is to demonstrate how much circuit area can be
reduced if the newly purposed binding algorithm is used in LegUp. There are 30 Basic
Blocks in the testing program. For each set of data, the experiment keeps turning a pair
of Basic Block to be replications of each other. Table C.1, C.2, C.3 show the complete
results of this experiment.
Table C.1: Synthesized circuit size as increasing number
of replicated basic blocks(0–8)
Replicated BBs 0 2 4 6 8
Circuit Size(LEs) 4455 1250 1618 664 4987
Circuit Size(LEs) 2800 2494 3576 1307 2877
Circuit Size(LEs) 3416 3098 2282 4011 2294
Circuit Size(LEs) 3048 3649 2242 2482 3215
Continued on next page
74
Appendix C. Experimental results for replicating basic blocks 75
Table C.1 – continued from previous page
Replicated BBs 0 2 4 6 8
Circuit Size(LEs) 1690 5870 4415 2655 2958
Circuit Size(LEs) 3311 3047 1541 2870 1584
Circuit Size(LEs) 3778 3265 3626 2355 3873
Circuit Size(LEs) 3060 6192 1801 282 3230
Circuit Size(LEs) 6844 3654 815 2300 2311
Circuit Size(LEs) 3376 3769 2066 2473 4161
Circuit Size(LEs) 3788 697 2412 488 2250
Circuit Size(LEs) 1480 1892 1054 2263 1409
Circuit Size(LEs) 4042 573 3631 1346 517
Circuit Size(LEs) 1149 2292 3993 1584 3515
Circuit Size(LEs) 2225 1105 802 1671 409
Circuit Size(LEs) 1507 1224 4475 1682 983
Circuit Size(LEs) 2399 3534 2549 1616 2144
Circuit Size(LEs) 2743 2708 2732 1830 3431
Circuit Size(LEs) 3894 6317 1823 2230 153
Circuit Size(LEs) 2562 815 2925 4205 6742
Circuit Size(LEs) 3745 1973 2409 2396 837
Circuit Size(LEs) 3662 996 3502 706 2558
Circuit Size(LEs) 2647 2893 3627 2318 2378
Circuit Size(LEs) 1853 1957 4345 2379 2194
Circuit Size(LEs) 727 2581 3992 2280 1575
Circuit Size(LEs) 2153 3440 2040 1977 2753
Circuit Size(LEs) 2376 2084 232 2453 2012
Circuit Size(LEs) 3593 900 1952 3486 1521
Continued on next page
Appendix C. Experimental results for replicating basic blocks 76
Table C.1 – continued from previous page
Replicated BBs 0 2 4 6 8
Circuit Size(LEs) 3527 1525 2402 2873 3092
Circuit Size(LEs) 3096 3223 790 2528 2504
Circuit Size(LEs) 3742 2763 3907 5139 2349
Circuit Size(LEs) 3847 3326 926 1647 553
Circuit Size(LEs) 4931 1601 4730 616 1251
Circuit Size(LEs) 1254 3893 5734 4081 1639
Circuit Size(LEs) 4775 2983 7244 1925 4230
Circuit Size(LEs) 2942 4186 4372 3508 4699
Circuit Size(LEs) 4741 2854 3742 3121 1506
Circuit Size(LEs) 4223 1951 1149 1183 1462
Circuit Size(LEs) 4538 4808 1002 2500 1919
Circuit Size(LEs) 2407 3038 2657 2846 3681
Circuit Size(LEs) 939 4351 1240 3042 171
Circuit Size(LEs) 5782 2662 3937 3177 3169
Circuit Size(LEs) 2822 1373 1160 2526 3831
Circuit Size(LEs) 1301 7222 1655 2914 2809
Circuit Size(LEs) 2378 1551 3328 2460 2899
Circuit Size(LEs) 4215 967 3320 1402 2064
Circuit Size(LEs) 2252 4007 3271 1124 3427
Circuit Size(LEs) 2956 1990 3675 3507 3253
Circuit Size(LEs) 3098 3346 2780 4796 5381
Circuit Size(LEs) 511 2473 1944 2546 2780
Circuit Size(LEs) 1331 2823 3965 2977 2756
Circuit Size(LEs) 772 3018 5532 2150 1486
Continued on next page
Appendix C. Experimental results for replicating basic blocks 77
Table C.1 – continued from previous page
Replicated BBs 0 2 4 6 8
Circuit Size(LEs) 3195 4607 830 3091 3798
Circuit Size(LEs) 1553 2315 4830 2826 2303
Circuit Size(LEs) 1803 4493 2196 2054 5095
Circuit Size(LEs) 3988 2618 390 4950 2831
Circuit Size(LEs) 2751 4781 480 6692 3129
Circuit Size(LEs) 5273 2477 3210 2637 1631
Circuit Size(LEs) 4000 6938 468 6191 3165
Circuit Size(LEs) 1547 4006 3563 1362 2978
Circuit Size(LEs) 2375 4160 3026 2724 5777
Circuit Size(LEs) 3119 2835 3604 2334 2962
Circuit Size(LEs) 2275 1168 1829 1687 2031
Circuit Size(LEs) 5761 2748 2201 2471 3329
Circuit Size(LEs) 2554 2751 1376 2624 2709
Circuit Size(LEs) 3773 1911 1965 1065 2140
Circuit Size(LEs) 3061 3619 778 1231 1165
Circuit Size(LEs) 2194 5736 534 2531 4540
Circuit Size(LEs) 3292 5933 3810 3037 1919
Circuit Size(LEs) 4104 2332 1618 2251 2559
Circuit Size(LEs) 3501 5503 2113 3408 2273
Circuit Size(LEs) 2557 3375 3593 4678 1206
Circuit Size(LEs) 1869 2480 3071 2337 4260
Circuit Size(LEs) 3847 1450 1220 1144 680
Circuit Size(LEs) 994 641 5561 2291 4167
Circuit Size(LEs) 3232 1722 982 945 3961
Continued on next page
Appendix C. Experimental results for replicating basic blocks 78
Table C.1 – continued from previous page
Replicated BBs 0 2 4 6 8
Circuit Size(LEs) 4242 3442 2888 1331 699
Circuit Size(LEs) 794 2384 1591 1196 1145
Circuit Size(LEs) 3385 2851 2532 2701 4111
Circuit Size(LEs) 1077 3238 3788 2435 4981
Circuit Size(LEs) 2574 2240 2735 441 2674
Circuit Size(LEs) 2990 2103 3425 1011 1271
Circuit Size(LEs) 5584 3842 2999 2167 3406
Circuit Size(LEs) 2969 1146 1327 2278 2793
Circuit Size(LEs) 3150 2907 2255 2212 1589
Circuit Size(LEs) 4255 4203 1003 5759 2790
Circuit Size(LEs) 5219 1474 3076 2385 3774
Circuit Size(LEs) 2205 2336 3139 3279 3672
Circuit Size(LEs) 3044 2882 102 4751 3455
Circuit Size(LEs) 5301 868 4696 6091 2893
Circuit Size(LEs) 2691 1780 3423 3787 2979
Circuit Size(LEs) 1702 1300 805 3619 3826
Circuit Size(LEs) 3360 3801 1577 1992 2489
Circuit Size(LEs) 4379 4609 3627 3150 2183
Circuit Size(LEs) 3266 3049 5965 3878 3265
Circuit Size(LEs) 3494 1660 2587 641 2411
Circuit Size(LEs) 4077 2317 4707 1736 1074
Circuit Size(LEs) 3196 2221 3105 2367 1501
Circuit Size(LEs) 4356 3904 1791 645 1612
Continued on next page
Appendix C. Experimental results for replicating basic blocks 79
Table C.1 – continued from previous page
Replicated BBs 0 2 4 6 8
Circuit Size(LEs) 2093 3699 3189 3899 3745
Geomean 2761.33 2550.27 2182.11 2186.86 2275.26
Table C.2: Synthesized circuit size as increasing number
of replicated basic blocks(10–18)
Replicated BBs 10 12 14 16 18
Circuit Size(LEs) 1546 1680 379 3971 1914
Circuit Size(LEs) 2776 2250 1036 1072 1686
Circuit Size(LEs) 3004 2497 2616 2932 1789
Circuit Size(LEs) 3570 3176 4957 3254 2818
Circuit Size(LEs) 5193 891 2077 1653 2433
Circuit Size(LEs) 4490 3307 3584 2666 3176
Circuit Size(LEs) 2121 2205 1936 1560 1878
Circuit Size(LEs) 1566 2188 911 4337 2249
Circuit Size(LEs) 1575 3571 1483 2400 1216
Circuit Size(LEs) 1374 1198 503 815 366
Circuit Size(LEs) 2427 3317 2725 3717 1364
Circuit Size(LEs) 3309 1677 3074 2888 1018
Circuit Size(LEs) 4309 2103 1432 3628 2634
Circuit Size(LEs) 5244 2478 3802 4165 1357
Circuit Size(LEs) 1382 753 1255 1491 4098
Circuit Size(LEs) 3254 4067 2189 2313 2622
Continued on next page
Appendix C. Experimental results for replicating basic blocks 80
Table C.2 – continued from previous page
Replicated BBs 10 12 14 16 18
Circuit Size(LEs) 1210 3162 1956 2361 4521
Circuit Size(LEs) 2191 2910 1981 585 2549
Circuit Size(LEs) 3626 3044 2195 2391 4083
Circuit Size(LEs) 2878 1381 935 1441 1446
Circuit Size(LEs) 2779 3077 2271 5030 2451
Circuit Size(LEs) 3702 1876 2502 1800 1283
Circuit Size(LEs) 2646 3298 2698 2266 4559
Circuit Size(LEs) 3890 1695 2004 1871 3890
Circuit Size(LEs) 3382 3781 2225 3986 1113
Circuit Size(LEs) 807 2660 3830 2738 1914
Circuit Size(LEs) 3671 277 2286 1122 1812
Circuit Size(LEs) 3057 1677 495 4382 2901
Circuit Size(LEs) 2282 2562 3622 2679 4102
Circuit Size(LEs) 3967 894 2277 2681 3266
Circuit Size(LEs) 1914 2596 2605 3993 3075
Circuit Size(LEs) 2702 2093 860 1811 2140
Circuit Size(LEs) 3012 2045 2311 3969 964
Circuit Size(LEs) 1708 1990 1212 2697 2099
Circuit Size(LEs) 3086 401 5294 3925 2984
Circuit Size(LEs) 3043 3101 2022 1919 2625
Circuit Size(LEs) 2472 3101 2008 2391 1616
Circuit Size(LEs) 1853 3860 687 1505 728
Circuit Size(LEs) 3493 3228 4669 304 1162
Circuit Size(LEs) 3865 3691 355 2321 1234
Continued on next page
Appendix C. Experimental results for replicating basic blocks 81
Table C.2 – continued from previous page
Replicated BBs 10 12 14 16 18
Circuit Size(LEs) 4537 2645 2159 2314 745
Circuit Size(LEs) 663 1245 4357 358 1309
Circuit Size(LEs) 2685 422 3165 3115 2738
Circuit Size(LEs) 456 1856 2192 2679 2158
Circuit Size(LEs) 1049 1469 2602 1312 2616
Circuit Size(LEs) 3226 2525 2581 3401 1122
Circuit Size(LEs) 4470 2747 2416 1644 819
Circuit Size(LEs) 2442 3361 1869 4831 1822
Circuit Size(LEs) 708 2407 3271 2888 2700
Circuit Size(LEs) 4027 1472 3440 3076 1936
Circuit Size(LEs) 3694 989 3147 2695 2185
Circuit Size(LEs) 3089 2197 2433 2191 1339
Circuit Size(LEs) 3015 1871 2749 2921 3056
Circuit Size(LEs) 3535 2808 1365 970 3742
Circuit Size(LEs) 1495 1700 4216 2587 1425
Circuit Size(LEs) 888 892 2443 2368 1366
Circuit Size(LEs) 1013 1538 4092 320 3186
Circuit Size(LEs) 4868 3522 3126 3575 2400
Circuit Size(LEs) 3121 1635 3740 749 2214
Circuit Size(LEs) 2061 3089 4245 181 2132
Circuit Size(LEs) 1647 2524 5965 957 1931
Circuit Size(LEs) 3793 2566 3473 1163 2460
Circuit Size(LEs) 1006 2663 3457 2707 5050
Circuit Size(LEs) 1159 5077 2662 4392 2524
Continued on next page
Appendix C. Experimental results for replicating basic blocks 82
Table C.2 – continued from previous page
Replicated BBs 10 12 14 16 18
Circuit Size(LEs) 98 1964 2321 3048 1822
Circuit Size(LEs) 7391 4117 1950 2731 2400
Circuit Size(LEs) 2338 1669 712 2769 1718
Circuit Size(LEs) 1523 2192 1335 3281 3262
Circuit Size(LEs) 2171 832 2619 2702 2718
Circuit Size(LEs) 3057 4042 2954 2471 2393
Circuit Size(LEs) 505 1574 2904 2122 4671
Circuit Size(LEs) 1175 864 820 2629 3430
Circuit Size(LEs) 4354 3021 2314 1788 5361
Circuit Size(LEs) 2609 2683 809 2248 1247
Circuit Size(LEs) 698 2424 3123 4353 1203
Circuit Size(LEs) 6070 3372 3014 3526 2152
Circuit Size(LEs) 3492 1915 3083 3676 2990
Circuit Size(LEs) 2557 3294 148 1801 4537
Circuit Size(LEs) 1693 2849 1993 2074 1091
Circuit Size(LEs) 1607 3334 1534 2421 1770
Circuit Size(LEs) 1257 1853 466 2848 2269
Circuit Size(LEs) 2648 2119 2122 2665 3041
Circuit Size(LEs) 1026 954 2152 5115 2295
Circuit Size(LEs) 3402 3081 1274 3087 2932
Circuit Size(LEs) 4035 2752 1844 3105 2115
Circuit Size(LEs) 3443 1239 954 1850 3554
Circuit Size(LEs) 3902 337 3442 1490 1054
Circuit Size(LEs) 242 4079 653 3446 1391
Continued on next page
Appendix C. Experimental results for replicating basic blocks 83
Table C.2 – continued from previous page
Replicated BBs 10 12 14 16 18
Circuit Size(LEs) 996 3823 2867 656 2448
Circuit Size(LEs) 3129 1362 3178 1734 2954
Circuit Size(LEs) 2626 2021 1463 3833 1716
Circuit Size(LEs) 4386 1949 2334 496 3011
Circuit Size(LEs) 2374 3847 1774 1624 3120
Circuit Size(LEs) 2413 3127 1671 2351 3312
Circuit Size(LEs) 2566 319 4648 1936 2060
Circuit Size(LEs) 2092 2692 1939 2289 1625
Circuit Size(LEs) 3517 2664 1144 1863 1000
Circuit Size(LEs) 3149 1786 1072 2389 2036
Circuit Size(LEs) 3218 4017 1671 2346 1992
Circuit Size(LEs) 3605 3500 2113 1987 1804
Geomean 2262.95 2072.81 1974.84 2137.52 2095.22
Table C.3: Synthesized circuit size as increasing number
of replicated basic blocks(20–30)
Replicated BBs 20 22 24 26 28 30
Circuit Size(LEs) 2903 2111 1724 4142 1312 2076
Circuit Size(LEs) 1959 3760 483 1493 3177 4782
Circuit Size(LEs) 3053 2317 2142 2095 3981 5669
Circuit Size(LEs) 2440 2889 494 765 6482 3539
Circuit Size(LEs) 1300 3554 1425 3647 1319 1925
Continued on next page
Appendix C. Experimental results for replicating basic blocks 84
Table C.3 – continued from previous page
Replicated BBs 20 22 24 26 28 30
Circuit Size(LEs) 758 1316 3281 2132 3581 4612
Circuit Size(LEs) 3279 469 526 1423 1357 559
Circuit Size(LEs) 323 6093 2649 2166 2365 1475
Circuit Size(LEs) 2392 1573 4937 2861 2210 574
Circuit Size(LEs) 533 3996 229 4689 5332 3364
Circuit Size(LEs) 923 3077 2699 2613 1199 575
Circuit Size(LEs) 3463 2566 2449 1907 1695 327
Circuit Size(LEs) 972 1622 1474 2674 6253 1922
Circuit Size(LEs) 624 2861 1533 1187 4251 3353
Circuit Size(LEs) 2311 1419 2901 3766 1712 2699
Circuit Size(LEs) 3783 985 1922 1429 791 444
Circuit Size(LEs) 1619 1688 523 1976 1503 2964
Circuit Size(LEs) 2588 788 2324 1905 3803 1525
Circuit Size(LEs) 4264 3566 2592 2310 2454 418
Circuit Size(LEs) 1972 1337 2252 1309 1134 2923
Circuit Size(LEs) 3680 352 1844 3897 2249 1637
Circuit Size(LEs) 1569 2559 426 2506 2895 1222
Circuit Size(LEs) 2887 4000 2098 2464 727 779
Circuit Size(LEs) 2726 2539 939 1223 1829 1614
Circuit Size(LEs) 1550 89 2643 3910 2948 1249
Circuit Size(LEs) 4312 3650 1356 929 2799 1716
Circuit Size(LEs) 2872 1900 3099 769 2902 1899
Circuit Size(LEs) 4080 1722 2079 4816 4080 1007
Circuit Size(LEs) 2915 2829 628 2257 1390 2883
Continued on next page
Appendix C. Experimental results for replicating basic blocks 85
Table C.3 – continued from previous page
Replicated BBs 20 22 24 26 28 30
Circuit Size(LEs) 2314 2541 573 3706 2075 547
Circuit Size(LEs) 3453 1028 2943 2942 3370 2029
Circuit Size(LEs) 219 2083 1425 629 1786 1866
Circuit Size(LEs) 1602 2759 1104 1878 3736 3230
Circuit Size(LEs) 3693 1692 1161 1710 1645 1358
Circuit Size(LEs) 1363 2188 2549 2642 1602 168
Circuit Size(LEs) 1307 1342 1424 829 2630 1586
Circuit Size(LEs) 4371 1534 1772 604 5021 2659
Circuit Size(LEs) 3200 613 586 1846 1734 1930
Circuit Size(LEs) 2895 2180 6454 2251 1744 589
Circuit Size(LEs) 2410 2544 2860 1975 2876 66
Circuit Size(LEs) 922 5102 1283 3356 2375 3847
Circuit Size(LEs) 1421 975 1414 2107 1043 2128
Circuit Size(LEs) 1156 3022 2174 1245 1584 880
Circuit Size(LEs) 2754 2313 1995 4012 4066 541
Circuit Size(LEs) 4389 909 878 3370 1539 1730
Circuit Size(LEs) 3246 1500 1416 3703 3716 1347
Circuit Size(LEs) 3332 1410 4788 841 852 4071
Circuit Size(LEs) 1900 377 2057 872 3135 3563
Circuit Size(LEs) 1945 2634 3722 2340 4713 1966
Circuit Size(LEs) 1090 3275 1534 748 816 5227
Circuit Size(LEs) 1901 2263 3219 677 3056 3542
Circuit Size(LEs) 464 2533 2254 3451 2391 3230
Circuit Size(LEs) 5203 2892 1285 2222 1189 1140
Continued on next page
Appendix C. Experimental results for replicating basic blocks 86
Table C.3 – continued from previous page
Replicated BBs 20 22 24 26 28 30
Circuit Size(LEs) 1408 3564 823 2329 1072 794
Circuit Size(LEs) 2197 3185 2328 1584 1166 5777
Circuit Size(LEs) 2611 1494 1125 3180 33 695
Circuit Size(LEs) 1926 2517 605 623 1551 2295
Circuit Size(LEs) 363 3637 3652 3135 3752 1214
Circuit Size(LEs) 2848 3408 2324 1267 1268 2719
Circuit Size(LEs) 2701 2573 2957 1276 587 4070
Circuit Size(LEs) 2849 1862 2270 3126 2164 1678
Circuit Size(LEs) 911 2114 1632 1249 2485 408
Circuit Size(LEs) 1087 842 2446 5375 2792 418
Circuit Size(LEs) 1398 2091 2502 717 2546 4775
Circuit Size(LEs) 1357 3979 1257 3139 2186 31
Circuit Size(LEs) 1938 1460 1132 1914 3303 2332
Circuit Size(LEs) 1056 2593 2010 3673 2950 2278
Circuit Size(LEs) 1353 1664 216 2785 2081 1055
Circuit Size(LEs) 1545 1688 638 1908 2370 2897
Circuit Size(LEs) 1584 1864 1820 1612 531 3799
Circuit Size(LEs) 1873 3109 2018 1149 3267 4042
Circuit Size(LEs) 1990 1892 2065 1818 1228 2989
Circuit Size(LEs) 4183 2608 3326 3414 619 4759
Circuit Size(LEs) 2776 3414 1913 3329 661 539
Circuit Size(LEs) 4118 2981 542 2647 1326 3198
Circuit Size(LEs) 2160 1614 3055 2934 4900 1442
Circuit Size(LEs) 4314 3228 1899 1870 1102 6466
Continued on next page
Appendix C. Experimental results for replicating basic blocks 87
Table C.3 – continued from previous page
Replicated BBs 20 22 24 26 28 30
Circuit Size(LEs) 1573 1846 2686 2377 5847 2306
Circuit Size(LEs) 7040 1767 4120 1209 2336 2576
Circuit Size(LEs) 2849 2271 3555 2769 2296 924
Circuit Size(LEs) 1902 3699 4671 1441 3378 2012
Circuit Size(LEs) 3081 840 1538 1628 1173 502
Circuit Size(LEs) 5304 3673 2016 1655 5966 1205
Circuit Size(LEs) 2829 2504 5101 1640 909 1420
Circuit Size(LEs) 2177 3764 1906 1892 744 3559
Circuit Size(LEs) 2811 2122 1907 1539 1195 2123
Circuit Size(LEs) 2535 840 1105 1324 622 1217
Circuit Size(LEs) 1521 1800 2404 3378 771 817
Circuit Size(LEs) 2252 2708 2718 3651 6519 1588
Circuit Size(LEs) 2369 3523 3723 2624 1723 1863
Circuit Size(LEs) 2970 248 1117 2489 2244 2880
Circuit Size(LEs) 1427 2342 5251 624 1368 336
Circuit Size(LEs) 2546 933 2080 3698 3189 195
Circuit Size(LEs) 464 2681 3532 781 453 2734
Circuit Size(LEs) 2768 2702 1747 4640 2790 247
Circuit Size(LEs) 5679 2165 4519 2326 995 3343
Circuit Size(LEs) 1530 2562 3105 2131 3521 2805
Circuit Size(LEs) 1180 4205 2280 857 1640 1342
Circuit Size(LEs) 1753 2711 6684 2592 2601 2535
Circuit Size(LEs) 2269 1811 784 918 3264 3737
Geomean 1998.70 1975.57 1785.55 1948.55 1924.02 1535.97
Appendix D
Experimental results for pattern
matching runtime
This appendix presents the full set of results for the experiments introduced in Section
4.4. The purpose of this experiment is to analyze the runtime performance of Pattern
Matching algorithm used in LegUp. It shows the runtime measurement with the pat-
tern marching enabled comparing with the runtime measurement without the pattern
matching.
Table D.1: Runtime measurement as circuit size
increases(10-40), PM = pattern matching
Size Factor: 10 Size Factor: 20 Size Factor: 30 Size Factor: 40
PM No PM PM No PM PM No PM PM No PM
time(S) 0.59 0.38 2.43 1.26 51.68 2.68 239.46 4.28
time(S) 0.62 0.4 2.85 1.25 15.15 2.65 107.71 4.23
time(S) 0.59 0.41 2.68 1.22 69.85 2.68 143.5 4.19
time(S) 0.63 0.41 2.54 1.23 28.16 2.68 169.38 4.26
Continued on next page
89
Appendix D. Experimental results for pattern matching runtime 90
Table D.1 – continued from previous page
Size Factor: 10 Size Factor: 20 Size Factor: 30 Size Factor: 40
PM No PM PM No PM PM No PM PM No PM
time(S) 0.6 0.42 3.74 1.22 22.58 2.59 149.66 4.27
time(S) 2.38 0.4 4.61 1.23 11.4 2.69 245.28 4.33
time(S) 0.6 0.4 4.31 1.25 13.5 2.59 380.83 4.21
time(S) 0.58 0.41 4.74 1.29 20.77 2.64 187.66 4.27
time(S) 0.58 0.4 4.32 1.24 19.02 2.66 321.39 4.26
time(S) 0.66 0.39 6.39 1.27 70.39 2.65 168.44 4.27
time(S) 0.56 0.4 2.76 1.22 26.29 2.63 113.28 4.28
time(S) 0.57 0.41 2.86 1.23 27.08 2.77 67.73 4.29
time(S) 0.54 0.39 2.78 1.21 10.07 2.68 716.09 4.27
time(S) 0.61 0.4 12.55 1.24 12.06 2.63 158.91 4.23
time(S) 0.58 0.39 3.34 1.21 16.53 2.62 656.93 4.25
time(S) 0.64 0.42 5.48 1.31 22.61 2.8 575.45 4.27
time(S) 0.61 0.41 7.44 1.22 54.91 2.67 576.19 4.34
time(S) 0.58 0.4 3.29 1.24 33.37 2.67 186.93 4.22
time(S) 0.58 0.39 2.56 1.25 40.11 2.65 195.06 4.22
time(S) 0.58 0.39 2.66 1.25 174.85 2.6 257.42 4.22
time(S) 0.59 0.39 2.59 1.23 53.43 2.8 180.45 4.22
time(S) 0.58 0.39 3.78 1.23 19.19 2.7 246.19 4.21
time(S) 0.6 0.4 3.64 1.22 19.83 2.7 150.91 4.31
time(S) 0.62 0.42 3.49 1.25 18.5 2.68 163.29 4.22
time(S) 0.63 0.42 9.14 1.21 28.19 2.63 144.81 4.24
time(S) 0.59 0.4 12.59 1.27 62.8 2.73 130.48 4.21
time(S) 0.61 0.4 10.33 1.25 17.55 2.63 257.87 4.25
Continued on next page
Appendix D. Experimental results for pattern matching runtime 91
Table D.1 – continued from previous page
Size Factor: 10 Size Factor: 20 Size Factor: 30 Size Factor: 40
PM No PM PM No PM PM No PM PM No PM
time(S) 0.61 0.41 2.67 1.23 45.16 2.68 117.96 4.24
time(S) 0.59 0.41 4.48 1.34 32.58 2.65 314.23 4.23
time(S) 0.57 0.4 2.65 1.23 12.29 2.7 401.6 4.21
time(S) 0.69 0.42 2.31 1.22 50.77 2.65 179.64 4.31
time(S) 0.57 0.39 2.25 1.26 18.92 2.75 285.79 4.5
time(S) 0.59 0.4 2.19 1.23 34.68 2.67 192.13 4.28
time(S) 0.59 0.39 2.34 1.21 18.86 2.6 117.79 4.23
time(S) 0.62 0.41 2.52 1.23 61.91 2.64 704.97 4.2
time(S) 0.64 0.42 2.75 1.22 127.74 2.67 506.65 4.54
time(S) 0.59 0.39 2.72 1.25 22.94 2.64 215.47 4.28
time(S) 1.07 0.4 2.7 1.23 34.37 2.62 159.22 4.32
time(S) 0.57 0.4 4.66 1.23 26.47 2.7 99.38 4.24
time(S) 0.57 0.39 2.87 1.24 22.35 2.64 511.45 4.27
time(S) 0.57 0.4 3.22 1.22 41.22 2.78 636.17 4.27
time(S) 0.56 0.39 7.66 1.24 35.62 2.65 476.9 4.25
time(S) 0.57 0.4 3.33 1.31 15.26 2.62 162.78 4.24
time(S) 0.57 0.39 3.27 1.23 67.81 2.59 578 4.24
time(S) 0.58 0.41 3.18 1.24 40.93 2.77 425.28 4.22
time(S) 0.66 0.4 4.59 1.24 20.65 2.64 397.44 4.27
time(S) 0.56 0.38 3.81 1.25 65.16 2.65 418.76 4.18
time(S) 0.58 0.39 3.14 1.24 41.91 2.65 61.78 4.29
time(S) 0.55 0.38 2.76 1.24 27.86 2.67 107.6 4.24
Continued on next page
Appendix D. Experimental results for pattern matching runtime 92
Table D.1 – continued from previous page
Size Factor: 10 Size Factor: 20 Size Factor: 30 Size Factor: 40
PM No PM PM No PM PM No PM PM No PM
time(S) 0.6 0.41 2.73 1.23 56.18 2.66 205.02 4.51
Geomean 0.617 0.400 3.641 1.241 30.508 2.667 233.350 4.267
Table D.2: Runtime measurement as circuit size
increases(50-80), PM = pattern matching
Size Factor: 50 Size Factor: 60 Size Factor: 70 Size Factor: 80
PM No PM PM No PM PM No PM PM No PM
time(S) 894.06 6.78 3320.01 9.49 4176.01 12.66 7931.29 15.74
time(S) 3162.45 6.3 2462.04 9.43 2857.98 12.5 5402.22 16.01
time(S) 1805.56 6.35 1410.68 9.56 3428.4 12.57 3106.72 15.86
time(S) 1252.44 6.31 7639.19 9.6 3878.3 12.87 6184.14 15.8
time(S) 534.55 6.32 3335.45 9.6 4667.87 12.65 10813.1 15.98
time(S) 1170.49 6.26 2781.49 9.63 5457.61 12.73 2932.37 15.87
time(S) 444.43 6.36 1225.83 9.64 1588.44 13.39 8743.11 15.86
time(S) 421.16 6.33 989.19 9.79 3644.76 12.49 8836.87 15.88
time(S) 2455.63 6.4 1993 9.6 3243.31 12.61 2295.47 16.05
time(S) 1138.55 6.5 7263.35 9.44 2741.42 12.44 3502.94 15.78
time(S) 828.82 6.3 4343.52 9.51 3787.77 12.6 3385.76 16.05
time(S) 1509.86 6.42 1011.86 10.11 4441.19 12.44 2556.98 15.73
time(S) 1479.4 6.26 2136.79 9.59 3539.12 12.52 7467.93 16.37
time(S) 366.79 6.3 2183.3 9.56 4986.28 12.54 8381.22 16.08
Continued on next page
Appendix D. Experimental results for pattern matching runtime 93
Table D.2 – continued from previous page
Size Factor: 50 Size Factor: 60 Size Factor: 70 Size Factor: 80
PM No PM PM No PM PM No PM PM No PM
time(S) 768.6 6.47 2460.97 9.58 4107.25 12.58 4676.5 15.96
time(S) 805.83 6.29 3444.76 9.68 1393.27 12.73 3071.85 15.88
time(S) 706.21 6.35 954.7 9.62 4476.53 12.68 3078.19 15.99
time(S) 702.8 6.84 2249.28 9.53 5408.42 12.51 4516.58 15.93
time(S) 1157.36 6.28 3192.02 9.53 3579.35 12.66 6062.3 15.83
time(S) 1061.89 6.37 2108.06 9.54 6759.91 12.63 6297.36 16.24
time(S) 1195.42 6.35 4594.44 10.19 3121.45 12.45 2482.05 15.94
time(S) 965.03 6.4 2440.89 9.43 3291.65 12.67 5348.11 15.78
time(S) 1066.75 6.38 2301.46 9.49 5045.73 12.49 9102.24 17.51
time(S) 2060.24 6.31 2683.8 9.68 4418.76 12.46 1572.89 16.01
time(S) 857.14 6.28 3383.45 9.69 2951.27 12.58 6314.5 15.94
time(S) 1136.76 6.32 814.33 9.82 2263.05 12.54 3426.83 15.74
time(S) 547.46 6.32 679.12 9.66 4038 12.49 7142.26 16.13
time(S) 492.44 6.38 2029.77 9.53 4288.08 12.56 6383.26 15.91
time(S) 1338.24 6.4 3555.68 9.64 1635.3 12.54 3211.24 15.79
time(S) 1652.04 6.36 1482.99 9.64 5410.23 12.48 5786.62 15.97
time(S) 3487.5 6.68 3763.57 9.64 10559.87 12.64 7576.69 15.81
time(S) 1688.17 6.4 817.85 9.71 3060.54 12.52 10825.81 18.32
time(S) 362.72 6.37 1098.81 9.55 2100.68 12.47 5421.65 16.94
time(S) 352.9 6.35 1740.23 9.49 5104.78 12.47 9643.19 17.55
time(S) 1026.68 6.3 2817.82 9.52 5348.88 12.68 5539.07 17.59
time(S) 3240.75 6.41 3116.72 10.28 1278.83 12.63 9421.17 17.76
time(S) 629.39 6.37 2289.55 9.66 6892.24 12.49 4681.79 15.79
Continued on next page
Appendix D. Experimental results for pattern matching runtime 94
Table D.2 – continued from previous page
Size Factor: 50 Size Factor: 60 Size Factor: 70 Size Factor: 80
PM No PM PM No PM PM No PM PM No PM
time(S) 657.33 6.46 1596.04 9.66 1292.93 12.46 6156.46 17.25
time(S) 381.99 6.35 5356.15 9.56 948.23 12.54 6795.83 18.32
time(S) 307.38 6.41 1777.26 9.72 3060.45 12.6 8104.6 15.92
time(S) 643.69 6.8 2986.72 9.67 6212.32 12.62 2488.54 15.8
time(S) 808.83 6.23 7049.89 9.87 1721.85 12.76 3833.32 16.01
time(S) 1015.05 6.51 941.25 9.49 5771.05 12.55 5982.32 15.95
time(S) 341.86 6.35 1748.24 9.52 4536.01 12.45 5014.64 15.95
time(S) 696.5 6.33 1624.65 10.18 5505.11 12.58 5076.71 15.81
time(S) 1016.26 6.31 2640.51 9.46 3993.05 12.58 1717.27 15.77
time(S) 887.66 6.4 1085.59 9.67 3546.71 12.58 8221.29 17.98
time(S) 641.54 6.34 1879.18 9.56 6941.71 12.79 5233.32 15.02
time(S) 109.28 6.33 4711.9 9.68 2501.36 12.53 4776.44 13.73
time(S) 670.95 6.39 1682.92 9.56 19749.89 13.23 3130.3 17.32
Geomean 859.458 6.386 2221.738 9.643 3689.420 12.603 5040.602 16.183
Appendix E
Experimental results for size factor
This appendix presents the full set of results for the experiments introduced in Section
4.5.2.2. The purpose of this experiment is to illustrate the relationship between the block
size factor and the synthesized circuit size.
Table E.1: Synthesized circuit size as increasing Basic
Block size factor(10-40)
Size Factor 10 20 30 40
Circuit Size(LEs) 2318 2579 11095 8559
Circuit Size(LEs) 845 5464 2977 12871
Circuit Size(LEs) 5695 6228 6690 19499
Circuit Size(LEs) 3920 4752 7064 5308
Circuit Size(LEs) 1482 6486 9402 11000
Circuit Size(LEs) 3354 8439 10726 13628
Circuit Size(LEs) 210 2953 8355 3696
Circuit Size(LEs) 1700 6308 8319 4567
Circuit Size(LEs) 5058 12088 4627 11403
Continued on next page
95
Appendix E. Experimental results for size factor 96
Table E.1 – continued from previous page
Size Factor 10 20 30 40
Circuit Size(LEs) 1526 5980 8570 5729
Circuit Size(LEs) 719 4007 9399 3515
Circuit Size(LEs) 2459 8821 11272 14943
Circuit Size(LEs) 2982 9427 15163 11771
Circuit Size(LEs) 2991 7250 1820 9937
Circuit Size(LEs) 4073 4750 7112 7038
Circuit Size(LEs) 3991 2238 3339 11485
Circuit Size(LEs) 4715 7597 18814 11001
Circuit Size(LEs) 2564 3887 14899 9689
Circuit Size(LEs) 2724 2298 10167 10428
Circuit Size(LEs) 4106 5813 8425 17841
Circuit Size(LEs) 4372 5399 11005 10892
Circuit Size(LEs) 4134 3430 8337 11037
Circuit Size(LEs) 1340 2500 4860 10878
Circuit Size(LEs) 1766 2522 1890 8512
Circuit Size(LEs) 893 5827 9740 9940
Circuit Size(LEs) 800 8881 7563 16637
Circuit Size(LEs) 3215 1014 7912 6645
Circuit Size(LEs) 1387 8387 3597 7926
Circuit Size(LEs) 3704 11889 15042 3700
Circuit Size(LEs) 5528 4624 7425 10535
Circuit Size(LEs) 4353 1851 9654 13917
Circuit Size(LEs) 4840 7316 10849 10841
Circuit Size(LEs) 2843 8788 3934 11016
Continued on next page
Appendix E. Experimental results for size factor 97
Table E.1 – continued from previous page
Size Factor 10 20 30 40
Circuit Size(LEs) 893 4182 1075 10106
Circuit Size(LEs) 3377 12018 9800 21517
Circuit Size(LEs) 4282 5120 12169 12873
Circuit Size(LEs) 1078 8149 6800 13585
Circuit Size(LEs) 6472 4627 9065 9115
Circuit Size(LEs) 1127 5918 4393 8096
Circuit Size(LEs) 2546 6687 9355 4729
Circuit Size(LEs) 3539 5063 8903 10118
Circuit Size(LEs) 1006 4779 11229 11554
Circuit Size(LEs) 627 5615 8613 20743
Circuit Size(LEs) 4695 2872 9604 11174
Circuit Size(LEs) 2517 7400 11018 11473
Circuit Size(LEs) 4149 4025 5975 4365
Circuit Size(LEs) 3558 1687 12256 12084
Circuit Size(LEs) 1508 7407 7418 14296
Circuit Size(LEs) 2679 5659 5605 6964
Circuit Size(LEs) 2538 8430 9591 8902
Geomean 2350.059 5115.581 7483.752 9705.935
Appendix E. Experimental results for size factor 98
Table E.2: Synthesized circuit size as increasing Basic
Block size factor(50-80)
Size Factor 50 60 70 70
Circuit Size(LEs) 2570 17955 14892 35781
Circuit Size(LEs) 10209 7171 12225 23979
Circuit Size(LEs) 9465 13069 12324 5396
Circuit Size(LEs) 11045 10454 15794 21905
Circuit Size(LEs) 16068 8748 11715 22473
Circuit Size(LEs) 18014 25907 9267 20745
Circuit Size(LEs) 18184 18718 25345 25723
Circuit Size(LEs) 6915 6289 32657 21614
Circuit Size(LEs) 17110 11427 28108 27210
Circuit Size(LEs) 16240 10367 23621 6878
Circuit Size(LEs) 15898 11395 28492 8540
Circuit Size(LEs) 14268 6148 7336 27732
Circuit Size(LEs) 2543 18879 19999 31667
Circuit Size(LEs) 10856 32372 24063 11915
Circuit Size(LEs) 18296 32224 23165 23131
Circuit Size(LEs) 9509 22022 21135 36090
Circuit Size(LEs) 4898 7339 19981 20573
Circuit Size(LEs) 38113 22554 15049 28902
Circuit Size(LEs) 14871 13801 6115 25104
Circuit Size(LEs) 15885 14341 5967 14728
Circuit Size(LEs) 13288 16486 20508 12578
Circuit Size(LEs) 19455 19887 18683 42512
Continued on next page
Appendix E. Experimental results for size factor 99
Table E.2 – continued from previous page
Size Factor 50 60 70 80
Circuit Size(LEs) 7972 31043 13344 37329
Circuit Size(LEs) 5234 11751 16892 37228
Circuit Size(LEs) 20881 23212 22505 26080
Circuit Size(LEs) 24150 18971 16259 23904
Circuit Size(LEs) 16192 16872 19297 24538
Circuit Size(LEs) 13378 2920 20400 35852
Circuit Size(LEs) 17317 17380 20971 29174
Circuit Size(LEs) 12625 17823 21045 29900
Circuit Size(LEs) 9898 18083 6934 30263
Circuit Size(LEs) 14716 29132 25163 23831
Circuit Size(LEs) 15771 5690 4979 13520
Circuit Size(LEs) 12077 18481 30267 24274
Circuit Size(LEs) 14545 13643 24431 28977
Circuit Size(LEs) 13432 6459 14410 21386
Circuit Size(LEs) 16930 6906 20124 11130
Circuit Size(LEs) 14051 24201 24030 17125
Circuit Size(LEs) 8295 19714 13856 44269
Circuit Size(LEs) 16650 21103 15105 31662
Circuit Size(LEs) 20134 15079 9326 23644
Circuit Size(LEs) 16107 24292 12643 13189
Circuit Size(LEs) 14414 16487 11073 13678
Circuit Size(LEs) 25878 23561 16708 26075
Circuit Size(LEs) 17927 16320 16619 26909
Circuit Size(LEs) 11128 6025 20193 21526
Continued on next page
Appendix E. Experimental results for size factor 100
Table E.2 – continued from previous page
Size Factor 50 60 70 80
Circuit Size(LEs) 17553 18575 20995 28530
Circuit Size(LEs) 26440 17967 35605 21464
Circuit Size(LEs) 16249 21261 12307 31275
Circuit Size(LEs) 7023 14670 21164 9424
Geomean 13117.668 14683.940 16537.111 21994.998
Appendix F
Experimental results for depth
factor effects
This appendix presents the full set of results for the experiments introduced in Section
3.2.1.2. The purpose of this experiment is to show how depth factor is affecting the total
execution cycles of the compiled RTL hardware.
Table F.1: Execution cycles as increasing Depth
Factor(0.1-0.3)
Depth Factor 0.1 0.15 0.2 0.25 0.3
Cycles 26000 22000 20000 24000 24000
Cycles 22000 26000 22000 24000 20000
Cycles 26000 24000 24000 22000 22000
Cycles 28000 24000 26000 28000 20000
Cycles 26000 22000 22000 26000 24000
Cycles 24000 24000 26000 28000 24000
Cycles 28000 24000 24000 24000 26000
Continued on next page
101
Appendix F. Experimental results for depth factor effects 102
Table F.1 – continued from previous page
Depth Factor 0.1 0.15 0.2 0.25 0.3
Cycles 28000 24000 22000 26000 24000
Cycles 26000 20000 22000 24000 24000
Cycles 26000 20000 24000 24000 20000
Cycles 28000 26000 24000 22000 22000
Cycles 24000 28000 26000 22000 24000
Cycles 20000 22000 20000 22000 22000
Cycles 26000 26000 22000 22000 22000
Cycles 22000 26000 22000 24000 24000
Cycles 24000 24000 26000 22000 24000
Cycles 24000 22000 20000 22000 22000
Cycles 26000 24000 24000 24000 20000
Cycles 26000 28000 20000 24000 24000
Cycles 28000 24000 24000 22000 26000
Cycles 22000 26000 22000 26000 22000
Cycles 26000 26000 22000 24000 22000
Cycles 26000 24000 24000 22000 20000
Cycles 28000 20000 26000 22000 24000
Cycles 26000 24000 22000 24000 26000
Cycles 26000 22000 24000 22000 22000
Cycles 26000 22000 24000 24000 20000
Cycles 22000 24000 28000 20000 24000
Cycles 24000 28000 22000 20000 22000
Cycles 22000 26000 28000 24000 26000
Geomean 25137.03 23999.78 23334.13 23416.53 22815.62
Appendix F. Experimental results for depth factor effects 103
Table F.2: Execution cycles as increasing Depth
Factor(0.35-0.55)
Depth Factor 0.35 0.4 0.45 0.5 0.55
Cycles 24000 20000 20000 22000 22000
Cycles 22000 22000 24000 22000 22000
Cycles 24000 24000 22000 24000 20000
Cycles 26000 24000 22000 24000 22000
Cycles 22000 22000 20000 24000 22000
Cycles 24000 22000 22000 22000 24000
Cycles 22000 24000 22000 24000 20000
Cycles 22000 22000 22000 22000 24000
Cycles 24000 22000 22000 22000 22000
Cycles 24000 24000 24000 22000 20000
Cycles 20000 24000 24000 22000 22000
Cycles 22000 22000 22000 22000 22000
Cycles 22000 24000 24000 24000 26000
Cycles 20000 24000 24000 22000 24000
Cycles 24000 22000 24000 22000 22000
Cycles 22000 24000 22000 22000 22000
Cycles 20000 24000 22000 22000 22000
Cycles 24000 22000 22000 24000 22000
Cycles 26000 24000 22000 22000 22000
Cycles 20000 22000 22000 22000 24000
Cycles 20000 24000 26000 22000 24000
Continued on next page
Appendix F. Experimental results for depth factor effects 104
Table F.2 – continued from previous page
Depth Factor 0.35 0.4 0.45 0.5 0.55
Cycles 22000 20000 22000 24000 22000
Cycles 22000 22000 26000 24000 22000
Cycles 24000 22000 26000 24000 22000
Cycles 24000 20000 20000 22000 22000
Cycles 22000 22000 22000 22000 20000
Cycles 24000 24000 24000 22000 22000
Cycles 22000 24000 22000 22000 22000
Cycles 22000 22000 24000 22000 22000
Cycles 24000 22000 22000 22000 22000
Geomean 22627.98 22642.70 22698.33 22588.37 22176.26
Table F.3: Execution cycles as increasing Depth
Factor(0.6-0.8)
Depth Factor 0.6 0.65 0.7 0.75 0.8
Cycles 22000 24000 22000 22000 22000
Cycles 22000 20000 22000 22000 22000
Cycles 20000 24000 22000 22000 22000
Cycles 22000 22000 22000 22000 22000
Cycles 22000 20000 20000 22000 22000
Cycles 22000 22000 24000 22000 22000
Cycles 22000 24000 22000 24000 22000
Cycles 24000 22000 22000 24000 24000
Continued on next page
Appendix F. Experimental results for depth factor effects 105
Table F.3 – continued from previous page
Depth Factor 0.6 0.65 0.7 0.75 0.8
Cycles 24000 22000 22000 22000 22000
Cycles 22000 22000 22000 22000 22000
Cycles 24000 22000 22000 22000 22000
Cycles 22000 22000 22000 22000 22000
Cycles 22000 24000 22000 20000 22000
Cycles 20000 22000 24000 20000 22000
Cycles 24000 22000 22000 22000 22000
Cycles 22000 22000 22000 22000 22000
Cycles 24000 22000 22000 24000 22000
Cycles 22000 22000 22000 22000 22000
Cycles 22000 22000 22000 22000 22000
Cycles 24000 22000 22000 22000 24000
Cycles 22000 22000 22000 22000 22000
Cycles 22000 22000 22000 22000 22000
Cycles 22000 22000 24000 22000 22000
Cycles 22000 22000 22000 22000 22000
Cycles 24000 22000 22000 22000 22000
Cycles 22000 22000 22000 22000 24000
Cycles 22000 22000 22000 22000 22000
Cycles 22000 22000 22000 22000 22000
Cycles 24000 24000 22000 22000 22000
Cycles 22000 22000 20000 20000 22000
Geomean 22383.44 22187.36 22057.25 21988.55 22195.04
Appendix F. Experimental results for depth factor effects 106
Table F.4: Execution cycles as increasing Depth
Factor(0.85-1)
Depth Factor 0.85 0.9 0.95 1
Cycles 22000 22000 22000 22000
Cycles 22000 20000 22000 22000
Cycles 20000 22000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 24000 22000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 22000 22000 22000 22000
Continued on next page
Appendix F. Experimental results for depth factor effects 107
Table F.4 – continued from previous page
Depth Factor 0.85 0.9 0.95 1
Cycles 22000 22000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 20000 22000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 24000 22000 22000 22000
Cycles 22000 20000 22000 22000
Cycles 22000 22000 22000 22000
Cycles 22000 22000 22000 22000
Geomean 21992.37 21862.97 22000.00 22000.00
Bibliography
[1] DOT language. http://en.wikipedia.org/wiki/DOT_language, 2012.
[2] Gcov - Using the GNU Compiler Collection (GCC). http://gcc.gnu.org/
onlinedocs/gcc/Gcov.html, 2012.
[3] LLVM Language Reference Manual. http://llvm.org/docs/LangRef.html, 2012.
[4] Standard Performance Evaluation Corporation: SPEC. http://www.spec.org/,
2012.
[5] The Embedded Microprocessor Benchmark Consortium: EEMBC. http://www.
eembc.org/, 2012.
[6] R. Camposano. Path-based scheduling for synthesis. In IEEE J CAD, pages 85–93,
1991.
[7] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J.H. Anderson,
S. Brown, and T. Czajkowski. LegUp: High-level synthesis for FPGA-based pro-
cessor/accelerator systems. ACM/SIGDA International Symposium on Field Pro-
grammable Gate Arrays (FPGA), pages 33–36, 2011.
[8] Edmund Clarke, Daniel Kroening, and Karen Yorav. Behavioral Consistency of C
and Verilog Programs Using Bounded Model Checking. In DAC, pages 368–371,
2003.
108
Bibliography 109
[9] Stephen A. Cook. The complexity of theorem-proving procedures. In Proceedings of
the 3rd Annual ACM Symposium on Theory of Computing, pages 151–158, 1971.
[10] Philippe Coussy and Adam Morawiec. High-Level Synthesis From Algorithm to
Digital Circuit. Springer, 2008.
[11] John A. Curreri. Performance analysis and verification for high-level synthesis. In
UNIVERSITY OF FLORIDA, 2011.
[12] S. Hadjis; A. Canis; J.H. Anderson; J. Choi; K. Nam; S. Brown; T. Cza-
jkowski. Impact of FPGA Architecture on Resource Sharing in High-Level Synthe-
sis. In ACM/SIGDA International Symposium on Field Programmable Gate Arrays
(FPGA), 2012.
[13] Gary Smith Grant Martin. High-Level Synthesis: Past, Present, and Future. In
IEEE Design and Test of Computers, pages 18–25, 2009.
[14] Y. Hara.; H. Tomiyama ; S. Honda; H.akada. Proposal and quantitative analysis of
the CHStone benchmark program suite for practical C-based high-level synthesis.
In Journal of Information Processing 17, pages 242–254, 2009.
[15] K.S. Hemmert. A class of polynomially solvable range constraints for interval analy-
sis without widenings. In Field-Programmable Custom Computing Machines, 2003,
pages 228–237, 2003.
[16] Chandan Karfa; D. Sarkar; C. Mandal; P. Kumar. An Equivalence-Checking Method
for Scheduling Verication in High-Level Synthesis. In IEEE J CAD, pages 556–569,
2008.
[17] V. Lattner, C.; Adve. LLVM: a compilation framework for lifelong program anal-
ysis & transformation. In Code Generation and Optimization, 2004. CGO 2004.
International Symposium, pages 75–86, 2004.
Bibliography 110
[18] Chunho Lee; Miodrag Potkonjak; William H. Mangione-Smith. MediaBench: a tool
for evaluating and synthesizing multimedia and communicatons systems. In Proceed-
ings of the 30th annual ACM/IEEE international symposium on Microarchitecture,
pages 330–335, 1997.
[19] Steven S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kauf-
mann, 1997.
[20] S. Gupta; N. Savoiu; N. Dutt; R. Gupta; A. Nicolau. Using global code motions
to improve the quality of results for high-level synthesis. In IEEE J CAD, pages
302–312, 2004.
[21] D.N. Panda, R.P; Dutt. 1995 High Level Synthesis Design Repository. In Interna-
tional Symposium on System Synthesis, pages 170–174, 1995.
[22] Nikil D Dutt; Champaka Ramachandran. Benchmarks for the 1992 high level syn-
thesis workshop. In Technical Report, University of California, Irvine, pages 92–107,
1992.
[23] P. Coussy; D.D. Gajski; M. Meredith; A. Takach. An Introduction to High-Level
Synthesis. In IEEE M DTC, pages 8–17, 2009.
[24] Igor Rafael de Assis Costa Victor Hugo Sperle Campos, Raphael Ernani Rodrigues
and Fernando Magno Quinto Pereira. Speed and Precision in Range Analysis. In
Brazilian Symposium on Programming Languages, 2012.
[25] Zhendong Su; David Wagner. Source level debugger for the Sea Cucumber synthe-
sizing compiler. In Theoretical Computer Science, pages 122–138, 2005.
[26] Kazutoshi Wakabayashi. C-based SoC Design Flow and EDA Tools: An ASIC and
System Vendor Perspective. In IEEE Trans. Computer-Aided Design of Integrated
Circuits and Systems, pages 1507–1522, 2000.