LegUp: Open-Source High-Level Synthesis Research Framework · LegUp: Open-Source High-Level Synthesis Research Framework by AndrewChristopherCanis Athesissubmittedinconformitywiththerequirements

LegUp: Open-Source High-Level

Synthesis Research Framework

by

Andrew Christopher Canis

A thesis submitted in conformity with the requirements

for the degree of Doctor of Philosophy

Graduate Department of Electrical and Computer Engineering

University of Toronto

c© Copyright 2015 by Andrew Christopher Canis

Abstract

LegUp: Open-Source High-Level Synthesis Research Framework

Andrew Christopher Canis

Doctor of Philosophy

Graduate Department of Electrical and Computer Engineering

University of Toronto

2015

The rate of increase in computing performance has been slowing due to the end of processor fre-

quency scaling and diminishing returns from multiple cores. We believe the industry is heading towards

heterogeneous computing, an accelerator era, where specialized hardware is harnessed for better power

efficiency and compute performance. A natural platform for these accelerators are field-programmable

gate arrays (FPGAs), which are integrated circuits that can implement large custom digital circuits

including complete system-on-chips. However, programming an FPGA can be an arduous undertaking

even for experienced hardware engineers. We propose raising the abstraction level by allowing a designer

to incrementally move their design from a processor to a set of hardware accelerators, each automat-

ically synthesized from a software implementation. This dissertation describes LegUp, an open-source

high-level synthesis (HLS) framework that enables this new design methodology. We further present

novel improvements to the quality of the synthesized circuits when targeting FPGAs.

First, we present the LegUp high-level synthesis framework with an overview of our design flow. The

software is unique among academic tools for offering a wide support of the ANSI C software language,

for targeting a hybrid processor/accelerator architecture, and for being open-source. We also show that

the quality of results produced by LegUp are competitive with a commercial HLS tool.

Next, we present an FPGA architecture-specific HLS resource sharing approach. Our technique

multi-pumps high-speed DSP blocks on modern FPGAs by clocking them at twice the system clock

frequency. We show that multi-pumping can reduce circuit area without impacting performance.

Following this, we describe a novel loop pipeline scheduling algorithm. Our approach handles complex

constraints by using a backtracking method to discover better scheduling possibilities. This scheduling

algorithm improves throughput for complex loop pipelines compared to prior work and a commercial

tool.

Finally, we examine LegUp’s target memory architecture and describe how to partition memory

within the circuit hierarchy using information from compiler alias analysis. We also present a method

to efficiently use the block RAMs present in modern FPGAs by grouping memories together. These

techniques decrease memory usage and improve performance for our HLS-generated circuits.

ii

Acknowledgements

There have been many people involved in the LegUp project of whom I was immensely lucky and

grateful to work with over the years. This dissertation would not have been possible without my two

incredible supervisors and mentors. I would like to thank my co-advisor, Jason Anderson, for his guidance

and mentorship throughout my studies. Jason dedicated significant time to the LegUp project, spending

many hours in meetings, recruiting students, organizing tutorials and spreading the word about LegUp.

I admire your work ethic and I have vastly improved my ability to write and conduct research by learning

from your example. Also thanks to my co-advisor, Stephen Brown, for your high-level vision and candid

advice, and for giving me the flexibility to follow my own research path. Thanks to the members of my

committee, Vaughn Betz, Jianwen Zhu, and Andreas Koch for their edits and feedback on this work.

I would like to thank all the other graduate students involved with the LegUp project. I was lucky

to work with such a smart team: Blair Fort, Ruo Long (Lanny) Lian, Nazanin Calagar, Li Liu, Marcel

Gort, Bain Syrowik, Joy (Yu Ting) Chen, and Julie Hsiao. In particular, I wanted to thank Jongsok

(James) Choi with whom I spent many long nights debugging signal waveforms and improving LegUp.

Also Mark Aldham for working on the initial version of LegUp and running power simulations. Thanks

to all the LegUp summer undergraduate students: Victor Zhang, Ahmed Kammoona, Stefan Hadjis,

Kevin Nam, Qijing (Jenny) Huang, Ryan Xi, Emily Miao, Yolanda Wang, Yvonne Zhang, William Cai,

and Mathew Hall who were all a joy to work with and pushed the LegUp project further. Thanks to all

the other graduate students from Pratt 392 especially Mehmet Avci, Jason Luu, and Braiden Brousseau

for your many entertaining discussions over the years.

Thanks for the feedback from Altera employees Tomasz Czajkowski and Deshanand Singh who gave

some initial guidance for this research direction and for Altera’s funding of the project. I would also

like to thank Philippe Coussy, Daniel Gajski, and Jason Cong for organizing a fascinating tutorial that

I attended at DAC in 2009, which influenced the work here. Also I am grateful to CMC for providing

us with Modelsim licenses. Special thanks to the dependable administrative support from Kelly, Judith,

and Darlene. I also appreciated the inspiring entrepreneurship talks and dinners organized by professor

Jonathan Rose.

I am grateful to the Canadian government for their generous scholarships through the Natural Sci-

ences and Engineering Research Council and the Ontario Graduate Scholarship. I thank the Rogers

Family for their generous scholarships and for supporting the ECE faculty.

Thanks to my friends and roommates for all the fun outside of school over the past six years,

especially: Adam, Michael, Paul, Mark, and Alex. I am truly grateful for the loving support of my

parents, Anne and Frank, and my brothers: Lloyd, Stephen, and Ian. Thanks for believing in me,

supporting my education, and teaching me to always try my best. Finally, thanks to Sabrina for all the

love, support, and constant thoughtfulness!

iii

Our grand business undoubtedly is,

not to see what lies dimly at a distance,

but to do what lies clearly at hand.

— Thomas Carlyle

iv

Contents

1 Introduction 1

1.1 Research Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background and Related Work 7

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Modern Computation Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 High-Level Synthesis Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 C Compiler: Low-Level Virtual Machine (LLVM) . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.6.1 SDC Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6.2 Extracting Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.7 Binding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.8 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 LegUp: Open-Source High-Level Synthesis Research Framework 21

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.1 Prior HLS Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.2 Application-Specific Instruction-Set Processors (ASIPs) . . . . . . . . . . . . . . . 24

3.3 LegUp Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.1 Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.2 Target System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 LegUp Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4.1 Hardware Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.2 Device Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4.3 Hardware Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.4 Hybrid Processor/Accelerator System . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.5 Language Support and Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4.6 Circuit Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4.7 Extensibility of LegUp to Other FPGA Devices . . . . . . . . . . . . . . . . . . . . 35

3.5 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

v

3.5.1 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.5.2 Comparison to Current LegUp Release . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.6 Research using LegUp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 Multi-Pumping for Resource Reduction in FPGA High-Level Synthesis 48

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3 Multi-Pumped Multiplier Units: Concept and Characterization . . . . . . . . . . . . . . . 50

4.3.1 Multi-Pumped Multiplier Characterization . . . . . . . . . . . . . . . . . . . . . . 51

4.3.2 Multi-Pumping vs. Resource Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4 Multi-Pumping DSPs in High-Level Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.4.1 DSP Inference Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54


4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5 Modulo SDC Scheduling with Recurrence Minimization in HLS 58

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2.2 Background: Loop Pipeline Modulo Scheduling . . . . . . . . . . . . . . . . . . . . 61

5.2.3 Background: Loop Pipeline Hardware Generation . . . . . . . . . . . . . . . . . . . 64

5.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.3.1 Greedy Modulo Scheduling Example . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4 Modulo SDC Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.4.1 Detailed Scheduling Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.4.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.5 Loop Recurrence Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.6 Experimental Study and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.6.1 Runtime Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6 LegUp: Memory Architecture 80

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.2.2 Alias and Points-to Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.3 LegUp Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.3.2 Global Memory Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.4 Local Memory Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.5 Grouped Memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.5.1 Grouped Memory Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93


vi

6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7 Case Study: LegUp vs Hardware Designed by Hand 101

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.2.1 HLS vs Hand RTL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.2.2 Sobel Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.3 Custom Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.4 LegUp Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104


7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

8 Conclusions 109

8.1 Summary and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

8.2.1 Extensions of this Research Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

8.2.2 Improvements to LegUp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

8.2.3 Additional High-Level Synthesis Research Directions . . . . . . . . . . . . . . . . . 113

8.3 Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

References 114

A LegUp Source Code Overview 128

A.1 LLVM Backend Pass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

A.2 LLVM Frontend Passes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

vii

List of Tables

3.1 Release status of recent non-commercial HLS tools. . . . . . . . . . . . . . . . . . . . . . . 24

3.2 LegUp memory signals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 LegUp C language support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4 Core benchmark programs included with LegUp. . . . . . . . . . . . . . . . . . . . . . . . 33

3.5 Speed performance results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.6 Area results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.7 Power and energy results [Aldha 11b]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.8 LegUp 1.0 vs. current LegUp version. (Hardware-only Implementation) . . . . . . . . . . 45

4.1 Area results (TRS: Traditional Resource Sharing, MP: Multi-Pumping) . . . . . . . . . . 55

4.2 Speed performance results (TRS: Traditional Resource Sharing, MP: Multi-Pumping) . . 55

5.1 Algorithm Example (II=3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2 Minimum initiation interval of benchmarks for balanced vs. proposed restructuring . . . . 73

5.3 Operation and dependency characteristics of each benchmark . . . . . . . . . . . . . . . . 73

5.4 Speed performance results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.5 Speed performance results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.6 Area comparison experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.7 Tool runtime (s) comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.1 Naive grouped RAM memory allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.2 Grouped RAM memory allocation with reduced fragmentation . . . . . . . . . . . . . . . 96

6.3 Memory architecture performance results . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.4 Memory architecture area results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.1 Sobel Gradient Masks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.2 Experimental Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

viii

List of Figures

1.1 Clock frequency scaling trends [Stan 14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Cost per gate scaling trends [Inte 13]. 90nm - 20nm costs assume two years of high volume

production. 16/14nm costs are estimated for FinFET in 2016. . . . . . . . . . . . . . . . . 2

2.1 Spectrum of computation platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 High-Level Synthesis flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Control flow graph (CFG) and data flow graph (DFG) of Figure 2.4. . . . . . . . . . . . . 10

2.4 C code for FIR filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 LLVM IR for FIR filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 Scheduling the DFG of a basic block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.7 Basic block ASAP scheduling ignoring resource constraints. . . . . . . . . . . . . . . . . . 13

2.8 Scheduled FIR filter LLVM instructions with data dependencies. . . . . . . . . . . . . . . 14

2.9 System of difference constraints graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.10 Circuit datapath after binding for the given schedule (given one adder). . . . . . . . . . . 17

2.11 Bipartite Graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.12 Cyclone II and Stratix IV logic element architectures. . . . . . . . . . . . . . . . . . . . . 20

3.1 LegUp Design Methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 LegUp target hybrid processor/accelerator architecture. . . . . . . . . . . . . . . . . . . . 27

3.3 LegUp hardware module interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4 Initial state of a LegUp hardware module’s finite state machine. . . . . . . . . . . . . . . . 30

3.5 Vector addition C function targeted for hardware. . . . . . . . . . . . . . . . . . . . . . . . 32

3.6 Modified C function to call hardware accelerator for function in Figure 3.5. . . . . . . . . . 33

3.7 Summary of geomean experimental results across the benchmark suite. . . . . . . . . . . . 43

4.1 Multi-pumped multiplier (MPM) unit architecture. . . . . . . . . . . . . . . . . . . . . . . 50

4.2 Clock follower circuit from [Tidwe 05]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 Multi-pumped multiplier unit FMax characterization. . . . . . . . . . . . . . . . . . . . . 51

4.4 Multi-pumped multiplier unit register characterization. . . . . . . . . . . . . . . . . . . . . 52

4.5 Loop schedule: multiplier sharing vs. multi-pumping. . . . . . . . . . . . . . . . . . . . . . 53

4.6 Loop hardware: original vs. resource sharing. . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.7 Image after Sobel edge detection and Gaussian blur. . . . . . . . . . . . . . . . . . . . . . 55

5.1 Time sequence of a loop pipeline with II=2 and five loop iterations (i = 0 to 4). . . . . . . 61

5.2 Loop pipelining with a recurrence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

ix

5.3 C code for loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.4 Loop pipelining Figure 5.3 with II=2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.5 SDC Modulo Scheduling for II=3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.6 Restructured loop dependency graph achieves II=1 . . . . . . . . . . . . . . . . . . . . . . 71

5.7 Dependency graph restructuring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.8 Incremental Associativity Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.9 Backtracking SDC modulo scheduling experimental results. . . . . . . . . . . . . . . . . . 75

5.10 Runtime Characterization For Loop Pipelining Scheduling Algorithms. . . . . . . . . . . . 78

5.11 Initiation Interval for Loop Pipelining Scheduled in Figure 5.10. . . . . . . . . . . . . . . . 78

6.1 C snippet showing an example of global and function-scoped memory variables. . . . . . . 83

6.2 LLVM intermediate representation example showing global and stack memory. . . . . . . 84

6.3 HLS memory binding and memory interconnection network. . . . . . . . . . . . . . . . . . 84

6.4 LegUp 32-bit pointer address encoding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.5 LegUp memory controller block diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.6 LegUp shared memory controller when loading array element output[13]. . . . . . . . . . 88

6.7 Relationship between program call graph and hardware module instantiations. . . . . . . 88

6.8 Multiplexing required for the memory address at each level of the module hierarchy. . . . 89

6.9 Local and global memory addressing logic within the hardware module datapath . . . . . 90

6.10 LegUp allocating one physical RAM for each array. . . . . . . . . . . . . . . . . . . . . . . 92

6.11 Grouping arrays into physical RAMs in LegUp’s shared memory controller. . . . . . . . . 92

6.12 Grouped memory array address offsets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

7.1 Sobel stencil sliding over input image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.2 C code for Sobel Filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7.3 Sobel hardware line buffers and stencil shift registers . . . . . . . . . . . . . . . . . . . . . 103

7.4 Calculating the Sobel edge weight using the stencil window. . . . . . . . . . . . . . . . . . 104

7.5 C code for the stencil buffer and line buffers synthesized with LegUp. . . . . . . . . . . . . 105

7.6 Optimized C code for synthesized Sobel Filter with LegUp. . . . . . . . . . . . . . . . . . 106

x

Chapter 1

Introduction

Over the past four decades we have seen a tremendous increase in computing performance. This comput-

ing progress has been driven by Moore’s law, an observation that the number of transistors on the latest

integrated circuit doubles every 18 months [Moore 65]. Since the early 1970s, this trend has been accom-

plished by scaling silicon transistors to smaller dimensions using Dennard scaling [Denna 74]. Dennard

showed that by scaling the transistor dimensions by 1√

2(70%), the transistor count doubles, frequency

increases by 40%, and the total power remains constant. However, Dennard scaling ended in the early

2000s, due to increased transistor leakage current, which prevented a reduction in transistor threshold

volatage VT , which also limited the scaling of the power supply voltage [Denna 07]. As frequency con-

tinued to scale, chip power densities increased exponentially eventually producing a thermal gradient

at the limit of what could be cooled using reasonable technology [Ellsw 04]. Consequently, computer

processor clock frequencies plateaued in 2004 as shown in Figure 1.1, which implies that single-thread

processor performance will eventually stagnate. Chip manufacturers reacted by increasing the number

of processing cores available to achieve performance gains, with four cores typical today [Lempe 11].

Moore’s law has continued, with transistors doubling every generation but with significantly less

improvement to transistor performance and energy efficiency. Given a fixed chip power budget, as we

increase the number of transistors without improving the per-transistor energy efficiency, we must power

off a portion of the computer chip, a trend called dark silicon. Projections show that 50% of the chip

could be “dark” within three process generations [Esmae 13]. Even Moore’s law may soon end due

to economic considerations. Silicon foundries such as Taiwan Semiconductor Manufacturing Company

(TSMC) are finding that for the newer processes, the cost per gate is no longer decreasing, as shown in

Figure 1.2. The latest 16/14nm FinFET [Hisam 00] transistors are estimated to cost $0.0162 per million

gates in 2016, which is 14% more than today’s 20nm transistors costing $0.0142. If Moore’s law does

end, this may lead to commoditization of the semiconductor industry with correspondingly lower profit

margins and a shift to maintaining pre-existing products with less new development.

These recent trends motivate chip designers to use silicon area more power efficiently due to chip

power constraints. Furthermore, we cannot rely on Moore’s law to continue to increase computational

performance; instead we need to squeeze more performance out of the available transistors. Going

forward, these constraints will increasingly be met by heterogeneous computing: combining traditional

multi-cores with customized hardware accelerators that offer better energy efficiency and performance

for specific applications [Brodt 10]. As evidence of the shift to heterogeneous computing, we observe

that 30 of the top 100 supercomputers are using accelerator/co-processor technology as of November

1

Chapter 1. Introduction 2

Figure 1.1: Clock frequency scaling trends [Stan 14]

Figure 1.2: Cost per gate scaling trends [Inte 13]. 90nm - 20nm costs assume two years of high volumeproduction. 16/14nm costs are estimated for FinFET in 2016.


2014 [TOP5 14]. Seventeen supercomputers are using Nvidia graphics processing units (GPUs), eleven

are using Intel Xeon Phi co-processors, one uses AMD GPUs, and one uses IBM PowerXCells. Intel Phi

is a multi-core 512-bit SIMD x86-compatible co-processor platform that plugs into a standard PCIe slot

and has a peak performance of 1 TFLOPS of double precision (2 TFLOPS in single precision) [Heine 13].

In the rapidly growing mobile space, mobile system-on-chips (SoCs) now contain specialized hardware

cores to save power, including: a digital signal co-processor (DSP), GPU, sensor core, GPS, modem,

and multimedia cores [Yang 14]. We are entering an accelerator era, where hardware accelerators are

common in heterogeneous many-core systems [Borka 11].

1.1 Research Motivation

There are a few ways of developing a custom hardware implementation of a set of computations.

Application-specific integrated circuits (ASICs) offer the highest performance and lowest power cus-

tom accelerators, but are uneconomical for most applications — ASIC chip design requires over $100M

in non-recurring engineering costs at the 28nm process node [Quinn 15]. Instead, another way to re-

alize these custom hardware accelerators is using field-programmable gate arrays (FPGAs), which are

integrated circuits that can be programmed to implement arbitrary digital logic. FPGAs have the ad-

vantage of being reprogrammable, so they can offer some of the advantages of custom hardware without

requiring the user to fabricate a custom computer chip. Additionally, FPGA devices have grown larger

in recent years and they can now accommodate a complete system-on-chip including an embedded ARM

hard-processor, like you would find in a smart phone [DE1 13]. Therefore, this dissertation focuses on

FPGAs as a target platform for developing hardware accelerators.

Research has shown that implementing a design on an FPGA can offer orders of magnitude improve-

ment over a processor in terms of energy efficiency and performance for some applications [Cong 09,

Luu 09]. However, custom hardware on FPGAs has not yet been widely adopted for general purpose

compute acceleration. Adoption has been limited by two factors. First, FPGAs have historically had

poor floating point performance compared to GPUs and CPUs, and therefore had a low cost effective-

ness for high performance computing [Crave 07]. However, this may soon change with the new Altera

Stratix 10 FPGA [Stra 14], which claims 10 TFLOPs of single-precision floating point performance using

hardened floating point cores. Second, we believe that a major impediment to FPGAs is that the cost

and difficulty of hardware design is often prohibitive, and consequently, a software approach is used for

most applications. A typical high performance computing user is a scientist or researcher looking to

accelerate a scientific application and typically they have no hardware design knowledge. Design effort

for an FPGA implementation is typically an order of magnitude greater than software development, due

to a lower level of abstraction [Rupno 11]. A hardware engineer must choose a suitable circuit datapath

architecture down to the bit-level, implement control logic, verify the circuit functionality with a cycle-

accurate simulator, and finally use a static timing analysis tool to ensure timing constraints are met. The

market for FPGAs could be improved tremendously if this programmability hurdle could be lowered,

especially considering that software developers outnumber hardware designers 10 to 1 [Occu 10].

The overarching aim of my PhD research is to offer a new programming paradigm for FPGAs that

simplifies the design process for engineers familiar with software development. We propose the following

incremental design methodology. First, the designer implements their application in software using C,

targeting a processor running on the FPGA device. As the application executes, a built-in profiler iden-


tifies critical sections of the code that would benefit from a hardware implementation. These segments

are then automatically synthesized into hardware accelerators, which the processor uses to improve

performance. In this self-accelerating adaptive system, the designer can harness the performance and

energy benefits of an FPGA using an incremental design methodology. Alternatively, we can synthesize

the entire program into hardware. By designing at a higher level of abstraction, the circuit designer can

work more productively and achieve faster time-to-market than using hand-coded register transfer level

(RTL) designs.

We have implemented our described approach in an open-source research framework called LegUp.

LegUp allows designers to compile C code directly into a functionally equivalent hardware implementation

that can be programmed onto an FPGA/processor embedded system. This compilation process, referred

to as high-level synthesis (HLS) in the literature, involves automatically generating a cycle-accurate

RTL circuit description from a high-level untimed C software specification. High-level synthesis has been

studied in academia since the 1980s [McFar 88, Pauli 89, Gajsk 92] to address the issue of hardware

design complexity by allowing engineers to use software to describe hardware. In recent years, high-

level synthesis has gained traction as a viable approach for designing hardware as evidenced by new

commercial offerings from the two largest FPGA vendors: OpenCL from Altera [Open] and Vivado

HLS from Xilinx [Xili]. However, HLS is still primarily used by hardware designers at companies like

Samsung, Qualcomm, Sony, and Toshiba [Cooleb].

Despite high-level synthesis being a well-studied research area, in 2011 there were no robust open-

source platforms for performing HLS research, forcing academics to build up their own infrastructure from

scratch. The infrastructure required for high-level synthesis is quite extensive. LegUp is built within the

open-source C compiler, LLVM [Lattn 04], which includes modern compiler optimizations. We provide

support for synthesizing all the various language constructs of ANSI C into hardware, except function

pointers and recursion. LegUp-generated RTL is synthesizable on Altera FPGAs and utilizes block

RAMs, multipliers, dividers, and floating point units. We also support hardware/software partitioning

using various processors: 1) a soft MIPS core, or 2) a hard ARM core, or 3) an X86 processor connected

via PCI express. LegUp automatically generates the interconnection logic between the HLS-generated

accelerators and the processor. We also generate software logic running on the processor to marshal data

to and from the hardware accelerators and to control their execution. The high-level synthesis research

community was lacking a robust well-tested open-source academic infrastructure that could lower the

barrier to entry for new researchers—LegUp fills this gap.

Since our first release of LegUp in March 2011, the project has been well-received in the aca-

demic community. Our original conference paper [Canis 11] has 148 citations and we have had two

invited papers for the LegUp project [Fort 14, Canis 13b]. LegUp is open-source and freely available

(http://www.legup.org) and the source code has been downloaded by over 1200 unique researchers from

outside the University of Toronto since our first release. LegUp enables future high-level synthesis re-

search projects in the spirit of the Verilog-to-Routing (VTR) system for FPGA CAD research [Luu 14a].

In the long term, we hope our research will lead to wider adoption of FPGAs by software engineers, al-

lowing them to implement fast and energy-efficient FPGA applications in areas such as cancer treatment,

gene sequencing, finance, and oil exploration.


1.2 Research Contributions

The aim of my PhD research is to achieve three broad goals:

1. Make FPGAs easier to program.

2. Provide an open-source framework to enable further research in high-level synthesis.

3. Improve high-level synthesis quality of results towards generating FPGA designs comparable to

hand-written RTL implementations.

In order to achieve these goals, we make several contributions, as summarized below:

Chapter 3 presents LegUp, an open-source high-level synthesis implementation and an associated set

of benchmarks. We show that with LegUp, a hardware designer can program an FPGA using

only C—without writing a single line of RTL. Here we give an overview of the LegUp design flow,

describing the high-level synthesis algorithms performed at each step of the process and we provide

a description of the final circuit architecture generated by LegUp. Furthermore, we quantitatively

assess LegUp’s quality of results, by comparing LegUp to a commercial HLS tool (eXcite) using

the largest HLS benchmark suite available in the literature (CHStone). These comparisons show

that circuits synthesized by LegUp are comparable to the commercial tool eXcite, with a geomean

benchmark execution time that is 18% faster than eXcite, while having 16% higher geomean area.

This work has been published in [Canis 11, Canis 12, Canis 13b]. In later chapters, we show that

LegUp enables us to conduct research and evaluate new high-level synthesis algorithms.

Chapter 4 uses LegUp to investigate novel FPGA architecture-specific enhancements to high-level

synthesis. We present a new approach to resource sharing that allows multiple operations to

be performed by a single functional unit in one clock cycle. Our approach is based on multi-

pumping, which operates functional units at a higher frequency than the surrounding system logic,

typically 2×, allowing multiple computations to complete in a single system cycle. Our method

is particularly effective for the DSP blocks on modern FPGAs. We show that multi-pumping is a

viable approach to achieve the area reductions of resource sharing, with considerably less negative

impact to circuit performance. This work has been published in [Canis 13a].

Chapter 5 describes an improved high-level synthesis scheduling algorithm. In many C applications, the

majority of run time is spent executing critical loops. The high-level synthesis scheduling technique

called loop pipelining, exploits parallelism across loop iterations to generate hardware pipelines.

Loop pipelining increases parallelism and hardware utilization, creating circuits similar to hand-

coded hardware architectures. However, industrial designs often have resource constraints and

constraints imposed by loops with cross-iteration dependencies. The interaction between multiple

constraints can pose a challenge for HLS scheduling algorithms, which, if not handled properly

can lead to suboptimal loop pipeline schedules. We present a novel scheduler based on the SDC

scheduling formulation [Cong 06b] that includes a backtracking mechanism to properly handle

conflicting scheduling constraints and achieve better pipeline performance. The SDC formulation

has the advantage of being a mathematical framework that supports flexible constraints that

are useful for more complex loop pipelines. Furthermore, we describe how to specifically apply

associative expression transformations during scheduling to restructure recurrences in complex


loops to enable better scheduling. We compared our techniques to existing prior work in loop

pipeline scheduling in HLS [Zhang 13] and also compared against a state-of-art commercial tool.

Over a suite of benchmarks, we show that our approach can result in a geomean wall-clock time

reduction of 32% versus prior work, and 29% versus a commercial HLS tool. This work has been

published in [Canis 14].

Chapter 6 presents the on-chip memory architecture synthesized by LegUp. We partition application

memory into local and global physical memory regions in the final circuit by using pointer analysis

performed by LegUp at compile time. We also group physical memory to match the underlying

hardware characteristics of the FPGA chip, which typically store on-chip memory in dedicated

memory blocks. Our architecture is generally applicable to a wide range of input C applications,

even when pointer analysis cannot statically determine where each C pointer can reference in

memory. We measure the impact of local memories and grouping global memories compared to

only having global memory, across the CHStone benchmark suite targeting the Stratix IV FPGA

family. We observe an improvement in the geomean memory implementation bits by 37%, and a

reduction in geomean wall-clock time by 12%. This work has been published in [Fort 14].

Chapter 7 explores a case study of a kernel from a common edge detection algorithm: the Sobel

filter. We synthesize the circuit in LegUp from a C description and then compare the design to

an equivalent hand-written RTL implementation. We show that performance is comparable with

the wall-clock time of the LegUp-generated circuit only 2% higher than the hand-designed circuit.

However, the synthesized circuit used more FPGA device resources, requiring 64% more ALUTs

and 66% more registers. We motivate topics for future work, particularly the coding-style changes

to the input C code required to achieve comparable performance. We need a way of either detecting

these optimizations or developing a style guide with associated intuition for software developers

targeting an efficient hardware architecture.

1.3 Organization

The remainder of this PhD dissertation is organized as follows: Chapter 2 reviews the background

material relevant to the research and presents related work. The research contributions are presented

in Chapters 3, 4, 5, 6, 7. Chapter 8 summarizes conclusions and gives suggestions for future work.

Appendix A provides a brief summary of the LegUp open-source codebase.

Chapter 2

Background and Related Work

2.1 Introduction

This chapter presents background material and related work that will provide the reader with enough

knowledge to understand the research contributions of this dissertation.

Section 2.2 gives an overview of available devices for computation and where our research lies within

the computational landscape. Section 2.3 reviews prior work in high-level synthesis, discussing the

algorithms required to turn a high-level C description into hardware. Section 2.4 discusses compiler

terminology and describes LLVM, the open-source compiler framework. Sections 2.5–2.7, describe the

high-level synthesis subproblems: allocation, scheduling, and binding. Section 2.8 gives an overview of

our target FPGA architectures.

2.2 Modern Computation Platforms

A spectrum of computation platforms are currently available. We highlight a few popular platforms in

Figure 2.1. These platforms offer a trade-off between ease of programing and performance in terms of

computation throughput or lower power or both. On the left, we have general purpose processors from

Intel or AMD. The vast bulk of computation is performed on this adaptable platform. In the mobile

space, ARM processors dominate due to their lower power and customizability. From left to right, we

have specialized hardware cards such as graphics processing units (GPUs) from Nvidia or ATI. GPUs

are programmed with languages like CUDA [CUDA 07] and OpenCL [Openc 09], which allow program-

mers to harness the single-instruction multiple-data (SIMD) architecture of GPUs for general purpose

General PurposeProcessor GPU

DSPProcessor FPGA

Custom

ASIC

Better Performance

Easier Programmability

Figure 2.1: Spectrum of computation platforms.

7

Chapter 2. Background and Related Work 8

computation. GPU devices are particularly effective for floating point intensive computation. Next,

we have digital signal processing (DSP) processors from Qualcomm [Codre 14] or Texas Instruments,

these support parallel multiply-accumulate operations in a SIMD architecture, as required by signal

processing (cell baseband towers, TV signal decoding). Programming DSP processors typically involves

a mixture of C programming and assembly hand-tuning. Finally, we have more custom hardware so-

lutions such as FPGAs, which have traditionally been used in low-volume applications, and mostly for

telecom/switching/networking applications [Xili 14]. FPGA vendor Altera receives 16% of their revenue

from Huawei [Alte 13], who use FPGAs to route cell packets received at high-bandwidth (>100Gbps) in

cell tower baseband stations between many DSP-specific processors which handle signal processing. As

telecommunication companies upgrade to 4G data networks, this is a growing market area, with global

mobile data traffic expected to grow 11-fold by 2018 [Cisc 14]. Lastly, we have custom application-

specific integrated circuits (ASICs), which are typically designed using a standard cell library specific to

the silicon fabrication process node, for instance 28nm. Standard cells are selected, placed, and routed

using electronic design automation (EDA) tools after which, lithography masks are made for the final

layout and the ASIC is fabricated in silicon. ASICs can be extremely costly to fabricate: Intel’s latest

14nm fabrication plant was estimated to cost $5B [List 14]. Companies with lower volumes can save

costs by using shared fabrication facilities at a foundry like Taiwan Semiconductor Manufacturing Com-

pany (TSMC). ASICs fabricated with older mature process nodes can be significantly cheaper due to

higher yields and sunk capital costs. For example, fabricating an ASIC at 65nm (from 2006) costs an

estimated $500,000, and at 130nm (from 2001) costs $150,000 [Taylo 13].

We face a discontinuity in this spectrum, in terms of design complexity, when comparing software

implementations running on a processor to hardware design targeting an FPGA/ASIC. In the latter, the

designer implements a custom circuit in a hardware description language, and must synchronize all com-

putation across a massively parallel digital circuit down to the level of individual clock cycles. Hardware

design is error prone and can be notoriously difficult to debug, requiring cycle-accurate circuit simula-

tions. Software design is comparatively straightforward, as we typically describe software sequentially or

with limited parallelism, and we use mature freely accessible compilers and debugging tools. However, a

custom hardware implementation can provide a significant improvement in speed and energy-efficiency

versus a software implementation (e.g. [Cong 09, Luu 09, Zhang 12]). Despite the apparent energy and

performance benefits, hardware design is usually too difficult and costly for most applications, and a

software approach is preferred. If we could allow software to incrementally and automatically compile

into a hardware solution, we could lower the barrier to entry of hardware design.

2.3 High-Level Synthesis Flow

High-Level Synthesis (HLS) is the compilation process of turning an untimed high-level algorithm, typ-

ically described in C, into cycle-accurate hardware description language specifying a digital circuit. We

begin with a general overview of the high-level synthesis flow, and then discuss each step in detail along

with relevant prior academic work.

HLS is an NP-hard combinatorial problem, so academics have traditionally used divide and conquer to

break the task into distinct subproblems [Couss 09], the most important being scheduling and binding.

Figure 2.2 shows the typical HLS flow. First, the user specifies a program in a high-level language,

which in this dissertation we will assume to be ANSI C. The program is compiled, in our case by


C Compiler

(LLVM)C Program

Allocation

Scheduling

Binding

Target H/W

Characterization

RTL

Generation

User Constraints

• Timing

• Resource

Synthesizable Verilog

Optimized LLVM IR

Figure 2.2: High-Level Synthesis flow.

LLVM, to optimize the code and produce an intermediate representation (IR). The LLVM compiler and

intermediate representation will be described shortly. Next, we perform the allocation step, which reads

user-provided constraints to determine the amount of hardware available for use (e.g., the number of

multiplier functional units), and also manages other hardware constraints (e.g., speed, area, and power)

or other constraints imposed by the target hardware architecture. After the hardware is allocated,

we solve the most important HLS subproblem: scheduling. Scheduling assigns each operation in the

input program to a control step (state) that occurs during a particular clock cycle. Scheduling decisions

can have significant performance implications and we face challenges extracting parallelism from the

untimed program description. After scheduling, we perform binding, which assigns each of the program’s

operations to a specific functional unit in the hardware, sharing functional units where possible to save

area. Binding also shares registers/memories between variables and assigns memory ports to particular

load/store operations. Finally, we generate a suitable finite state machine and datapath based on the

results of scheduling and binding, while meeting the user constraints from the allocation step and output

the corresponding RTL description. The final circuit description is synthesizable on a target FPGA using

standard RTL synthesis tools [Quar 14].

2.4 C Compiler: Low-Level Virtual Machine (LLVM)

High-level synthesis is typically implemented as a series of backend compiler passes in an existing com-

piler, so we will briefly review prerequisite compiler terminology [Lam 06]. A control flow graph (CFG)

of a program is a directed graph, where vertices map to basic blocks, which represent computation,

and edges map to branches, which represent control flow. For example, given two basic blocks: b1 and

b2, if b1 can branch to b2 then b1 has an edge to b2 in the CFG. A basic block is a contiguous set of

non-branching instructions with a single entry (at its beginning) and exit point (at its end). Within

a basic block, the flow data dependencies between instructions form an acylic directed graph, called a


+

+

BB1

BB2

Load Load Load

Store

Data Flow Graph

BB0

Control Flow Graph

Figure 2.3: Control flow graph (CFG) and data flow graph (DFG) of Figure 2.4.

y[n] = 0;

for(i = 0; i < 8; i++) {

y[n] += coeff[i] * x[n - i];

}

Figure 2.4: C code for FIR filter.

data flow graph (DFG). Consider an 8-tap finite impulse response (FIR) filter whose output, y[n], is

a weighted sum of the current input sample, x[n] and seven previous input samples. The C code for

calculating the FIR response is given in Figure 2.4. Figure 2.3 shows the corresponding CFG of the FIR

filter, where the loop is indicated by the back edge of the second basic block. We also provide the data

flow graph of the FIR loop body, where we multiply two values from memory and store the sum.

In this dissertation, we leverage the popular open-source low-level virtual machine (LLVM) compiler

framework [Lattn 04] – the same framework used by Apple for iPhone/iPad application development.

At the core of LLVM is an intermediate representation (IR), which is essentially machine-independent

assembly language. C code is translated into LLVM’s IR then analyzed and modified by a series of

compiler optimization passes. Current results show that LLVM produces code of comparable quality to

gcc for x86-based processor architectures.

Figure 2.5 gives the unoptimized LLVM IR corresponding to the FIR filter C code we gave in Fig-

ure 2.4. Register names in the IR are prefixed by “%” and there is no restriction on the number of

registers. The LLVM IR is in static-single assignment (SSA) form, which ensures that each register

is only assigned once, guaranteeing a 1-to-1 correspondence between an instruction and its destination

register. Types are explicit in the IR. For example, i32 specifies a 32-bit integer type and i32* specifies

a pointer to a 32-bit integer.

In the example IR for the FIR filter in Figure 2.5, line 1 marks the beginning of a basic block called

entry. Lines 2 and 3 initialize y[n] to 0. Line 4 is an unconditional branch to a basic block called bb1

that begins on line 5, corresponding to the C loop body. phi instructions are needed to handle control

flow-dependent variables in SSA form. For example, the phi instruction on line 6 assigns loop index

register %i to 0 if the previous basic block was entry; otherwise, %i is assigned to register %i.new, which


1: entry:

2: %y.addr = getelementptr i32* %y, i32 %n

3: store i32 0, i32* %y.addr

4: br label %bb1

5: bb1:

6: %i = phi i32 [ 0, %entry ], [ %i.new, %bb1 ]

7: %coeff.addr = getelementptr [8 x i32]* %coeff, i32 0, i32 %i

8: %x.ind = sub i32 %n, %i

9: %x.addr = getelementptr i32* %x, i32 %x.ind

10: %0 = load i32* %y.addr

11: %1 = load i32* %coeff.addr

12: %2 = load i32* %x.addr

13: %3 = mul i32 %1, %2

14: %4 = add i32 %0, %3

15: store i32 %4, i32* %y.addr

16: %i.new = add i32 %i, 1

17: %exitcond = icmp eq i32 %i.new, 8

18: br i1 %exitcond, label %return, label %bb1

19: return:

Figure 2.5: LLVM IR for FIR filter.

contains the incremented %i from the previous loop iteration. The getelementptr instruction on line

7 performs address computation to initialize a pointer %coeff.addr to the address of coeff[i]. The

getelementptr instruction has three operands, a pointer %coeff to the coefficient array, an offset from

that pointer (0), and an offset from the start of the coefficient array, %i. Lines 8 and 9 initialize a pointer

to the input sample array, x[n-1]. Lines 10-12 load the sum y[n], input sample and coefficient into

registers. Lines 13 and 14 perform the multiply-accumulate: y[n] + coeff[i] * x[n-i]. The result

is stored in y[n] on line 15. Line 16 increments the loop index %i by one. Lines 17 and 18 compare %i

with loop limit (8) and branch accordingly.

Observe that LLVM instructions are simple enough to directly correspond to hardware operations

(e.g., a load from memory, or an arithmetic computation). Our HLS tool operates directly with the

LLVM IR, scheduling the instructions into specific clock cycles.

Scheduling operations in hardware requires knowing data dependencies between operations. Fortu-

nately, the SSA form of the LLVM IR makes this easy. For example, the multiply instruction (mul)

on line 13 of Figure 2.5 depends on the results of two load instructions on lines 11 and 12. Memory

data dependencies are more problematic to discern; however, LLVM includes alias analysis – a compiler

technique for determining which memory locations a pointer can reference. In Figure 2.5, the store on

line 15 has a write-after-read dependency with the load on line 10, but has no memory dependencies

with the loads on lines 12 and 13. Alias analysis can determine that these instructions are independent

and can therefore be performed in parallel.

2.5 Allocation

We will now describe each step of high-level synthesis in detail beginning with allocation. We will

emphasize features from LegUp, our high-level synthesis tool discussed in Chapter 3, and we will assume

an FPGA target device without loss of generality.

Allocation sets up the constraints for the high-level synthesis problem by specifying target hardware


properties and any user-given parameters. LegUp reads allocation information from a configuration Tcl

file, which specifies:

• The target board and FPGA device family.

• The required circuit clock period.

• The limit (if any) on functional units available for each operator type.

• Which functional units should be shared.

• The number of pipeline stages in each functional unit.

• The number of memory ports and memory latency.

• Specific HLS optimizations: minimize bitwidth, loop pipelining, etc.

• The estimated delay of each functional unit using FPGA device characterization.

All of these allocation parameters have sensible default values. The user typically only manually

specifies the target board and, if necessary, specific HLS optimizations. Based on the target FPGA

device family, LegUp will automatically select a default clock period constraint that we have previously

found to achieve the highest performance across our benchmarks. We have selected the default pipeline

stages of each functional unit to minimize impact on the overall circuit clock frequency. For instance, a

floating point adder has 14 pipeline stages by default.

A functional unit is an instantiated module in the hardware, for instance a multiplier. An operation

is synonymous with an LLVM instruction in the program. Multiple operations can share a compatible

functional unit by adding multiplexers to the input ports of the shared functional unit. By default,

LegUp does not limit the number of available functional units for integer add/subtract, bitwise, shift,

and comparators because multiplexers are costly to implement in FPGAs. For example, a 32-bit adder

can be implemented using 32 4-input LUTs (and associated carry logic), but a 32-bit 2-to-1 multiplexer

also requires 32 4-input LUTs – the same number of LUTs as the adder itself. Since we cannot save area

by restricting these functional units, LegUp generates wide datapaths that can benefit from instruction

level parallelism in the input program.

For multiplier functional units, we can use hard multiplier blocks in the FPGA fabric. LegUp will

share multipliers if the synthesized program uses more multiply operations than hard blocks available

in the FPGA. Other functional units, such as divide/modulus or floating point units, are implemented

with LUTs and consume significant area. Therefore, by default LegUp limits the number of divide and

remainder units to one, and allows only one of each type of floating point unit. We allow the user to

override these defaults in the configuration file to achieve higher parallelism and performance at the

cost of area. Like other HLS tools [Xili], LegUp does not support constraining the overall circuit area

(e.g. use less than 1000 logic elements) subject to a timing constraint. This area constraint would be

redundant, as by default LegUp attempts to reduce area as much as possible while still satisfying the

timing constraint. A user can use these allocation settings to easily perform design space exploration

and gain greater control over the final LegUp-generated datapath.


Figure 2.6: Scheduling the DFG of a basic block.

1 For each I n s t r in Bas icBlock2 s ta t e = 03 For each Operand o f I n s t r4 continue i f outs ideBas i cB lock ( BasicBlock , operand )5 operandState = getState (Operand )6 i f l a tency ( operand ) > 07 s ta t e = max( s tate , operandState + latency ( operand ) )8 e l s e i f de lay (Operand ) + delay ( I n s t r ) > maxClockPeriod9 s ta t e = max( s tate , operandState + 1)

10 e l s e11 s ta t e = max( s tate , operandState )12 end i f13 End For14 a s s i gnS ta t e ( Ins t r , s t a t e )15 End While

Figure 2.7: Basic block ASAP scheduling ignoring resource constraints.

2.6 Scheduling

In high-level synthesis, scheduling is the task of assigning operations to execute during specific clock

cycles, or control steps, such that all program data dependencies and resource constraints are satisfied.

The goal of scheduling is to minimize the total time needed to complete the program while satisfying all

constraints. We can think of the program’s CFG as a coarse representation of the finite state machine

(FSM) needed to control the hardware being synthesized – the nodes and edges are analogous to those

of a state diagram. Each branch condition in the CFG will become a state transition in the final FSM.

What is not represented in this coarse FSM are data dependencies between operations within a basic

block and the latencies of operations (e.g., a memory access may take more than a single cycle).

After constructing a coarse FSM from the CFG, we schedule each basic block individually. Figure 2.6

gives the schedule and corresponding FSM for the basic block DFG we saw previously in Figure 2.3,

where each operation has been scheduled to occur in a particular FSM state (clock cycle). Given this

schedule, the basic block will take four cycles to complete in hardware. Our memory controller is dual

ported, taking advantage of FPGA on-chip RAM and allowing two load/stores to be performed every

cycle. To satisfy this resource constraint, we scheduled only two loads in the first state and we pushed

the third load to the next state. Alternatively, we could have scheduled the third load in first state,

however, this would have pushed one of the first two loads to the next state, lengthening the overall

schedule by one cycle.


0 1 2 3 4 5 6 7 8

%y.addr = getelementptr i32* %y, i32 %n

store i32 0, i32* %y.addr

br label %bb1

%i = phi i32 [ 0, %entry ], [ %i.new, %bb1 ]

%coeff.addr = geteptr i32* %coeff, i32 0, i32 %i

%x.ind = sub i32 %n, %i

%x.addr = getelementptr i32* %x, i32 %x.ind

%0 = load i32* %y.addr

%1 = load i32* %coeff.addr

%2 = load i32* %x.addr

%3 = mul i32 %1, %2

%4 = add i32 %0, %3

store i32 %4, i32* %y.addr

%i.new = add i32 %i, 1

%exitcond = icmp eq i32 %i.new, 8

br i1 %exitcond, label %return, label %bb1

Figure 2.8: Scheduled FIR filter LLVM instructions with data dependencies.

The simplest scheduling approach, which ignores resource constraints, is as-soon-as-possible (ASAP)

scheduling [Gajsk 92]. ASAP scheduling assigns an instruction to the first state after all of its depen-

dencies have been computed, guaranteeing the shortest schedule. We provide pseudocode for ASAP

scheduling in Figure 2.7, which assigns a state number, starting from zero, to each instruction. Here, we

visit the instructions within each basic block in topological order (line 1) and loop over each instruction’s

operands (line 3). The operands for each instruction are either: 1) from this basic block and therefore

guaranteed to have already been assigned a state, or 2) from outside this basic block, in which case we

can safely assume they will be available before control reaches this basic block (line 4). For operands with

multi-cycle latencies, such as pipelined divides or memory accesses, we schedule the instruction after the

instruction producing the operand has completed (line 7). Usually an instruction will be scheduled one

cycle after all of its operands have completed (line 9). In some cases, we can schedule an instruction into

the same state as one of its operands, which is called operation chaining. We perform chaining in cases

where the estimated delay of the chained operations (from allocation) does not exceed the estimated

clock period for the design (line 11). Chaining can reduce hardware latency (# of cycles for execution)

and save registers without impacting the final clock period.

Figure 2.8 is a Gantt chart showing the ASAP schedule of the FIR filter LLVM instructions shown

in Figure 2.5. The chart shows the same LLVM instructions, now scheduled into nine states. Data


dependencies between operations are shown; in this case we do not allow operation chaining (for clarity).

We assume that load instructions have a two cycle latency. Once a load has been issued, a new load can

be issued on the next cycle.

In the presence of resource constraints, HLS scheduling is an NP-hard problem that can be solved with

integer linear programming [Hwang 91] or approximately solved using various heuristics. There are two

conventional categories of scheduling heuristics: resource-constrained, where the number of functional

units is specified, and time-constrained where the maximum cycle length of the schedule is specified.

Resource-constrainedHLS scheduling is typically performed using the list scheduling technique [Adam 74].

A list scheduler keeps track of a list of candidate operations to schedule at the current time step. An op-

eration is a candidate if all data dependencies are met and if there are still compatible resources available.

The choice of candidate operations is based on a priority, the simplest being either as-soon-as-possible

(ASAP), which schedules operations as soon as their data dependencies are met, or as-late-as-possible

(ALAP), which schedules the operations at the latest time while still maintaining the overall schedule

length achieved using ASAP scheduling. Operations are taken from the candidate list in order of priority

and commited to the current time step, then the candidate list is updated to reflect operations that can

now be scheduled. A common priority function used in list scheduling is the mobility [Pangr 87] of each

operation, which is the difference between the ASAP scheduled time and ALAP scheduled time. The

mobility gives a measure of the scheduling flexibility of an operation, with zero indicating that delaying

this operation will lengthen the overall schedule.

Force-directed scheduling [Pauli 89] is an example of time-constrained scheduling. This approach

uses the mobility range of each operation to estimate the distribution of resource requirements. For

each resource type, this distribution gives a measure of how many operations could be scheduled at a

particular time step. Operations are then selected to balance this distribution and minimize resource

usage at each time step while still meeting the schedule length constraint.

For control-intensive programs, we typically have many smaller basic blocks and a complex control

flow graph. This can lead to many cycles being spent due to the control flow of the program, when oper-

ations from different basic blocks could have been overlapped. SPARK [Gupta 03] proposed extending

the candidate operations during list scheduling to include operations outside the current basic block,

thereby speculating by executing an operation early hoping that we will need the results of the oper-

ation. Another approach is path-based scheduling [Campo 91], which schedules all possible execution

paths though the CFG independently and then combines them into a final schedule.

2.6.1 SDC Scheduling

Previously discussed scheduling heuristics suffer from local optimization choices, while the optimal branch

and bound scheduling approaches are too slow for large programs. Alternatively, state-of-the-art HLS

scheduling uses a mathematical framework, called a system of difference constraints (SDC) to describe

constraints related to scheduling [Cong 06b]. The SDC framework is flexible and allows the specification

of a wide range of constraints such as data and control dependencies, resource constraints, relative timing

constraints for I/O protocols, and clock period constraints.

A system of difference constraints formulation is a set of difference constraints :

vi − ui ≤ C (2.1)


5

−8

1

4

−3

c1

c0 c3

c2

c4

Figure 2.9: System of difference constraints graph.

Where vi and ui are variables to be solved for and C is a constant real number input into the equation.

Here is an example of a system of difference constraints:

c0 −c1 ≤ 5

c1 −c2 ≤ −8

c2 −c3 ≤ 4

c3 −c4 ≤ −3

c4 −c0 ≤ 1

(2.2)

By limiting the constant C values to integers, the constraint matrix formed by a system of difference

constraints has the property of being totally unimodular. A totally unimodular matrix is defined as

a matrix whose every square submatrix has a determinant of 0, -1, or +1. Due to this property, the

solution to the constrained linear programming (LP) problem is guaranteed to have integer solutions,

which avoids expensive branch-and-bound required for solving integer linear programming problems. We

can solve an SDC problem using a standard LP solver in polynomial time.

A system of difference constraints can also be represented as a constraint graph with a vertex corre-

sponding to each variable in the system and an edge for each difference constraint: ui → vi with an edge

weight equal to C. The constraint graph for Equation (2.2) is shown in Figure 2.9. The SDC is feasible

iff there are no negative cycles in the graph [Ramal 99], where a negative cycle in a graph occurs when

the sum of edge weights along any cycle in the graph is negative. For example, given the constraints

matrix in Equation (2.2), the sum of all rows is equal to: c0− c0 ≤ −1 which is clearly infeasible. In the

graph we observe a negative cycle with a path length of −1.

In SDC scheduling, each operation is assigned a variable that, after solving, will hold the clock cycle

in which the operation in scheduled. Consider two operations, op1 and op2, and let the variable cop1

represent the cycle in which op1 is to be scheduled, and cop2 the cycle in which op2 is to be scheduled. If

op1 depends on op2, then we must schedule op1 after op2 and we add the following difference constraint

to the SDC formulation: cop2 − cop1 ≤ 0 (or equivalently: cop2 ≤ cop1).

We can also incorporate clock period constraints into SDC scheduling. Let P be the target clock

period and let C represent a chain of any N dependant combinational operations in the dataflow graph:

C = op1→ op2→ ...→ opN . Assume that T represents the total estimated combinational delay of the

chain of N operations – computed by summing the delays of each operator. We can add the following

timing constraint to the SDC formulation: ⌈T/P ⌉− 1 ≤ copN − cop1. This difference constraint requires

that the cycle assignment for opN be at least ⌈T/P ⌉ − 1 cycles later than the cycle in which op1 is

scheduled. Such constraints control the extent to which operations can be chained together in a clock

cycle. Chaining is permitted such that the target clock period P is met.

A property of a system of difference constraints is that there are no unique solutions. For example


load load

+ load

+

store

Schedule Datapath

2-port RAM +

FF

Figure 2.10: Circuit datapath after binding for the given schedule (given one adder).

consider:c0 −c1 ≤ 5

c1 −c2 ≤ −8(2.3)

There are many feasible solutions to this linear system (c0, c1, c2) for example: (1,−4, 10) or (8, 4, 15).

However, because all SDC variables correspond to operation schedule times, we can add an additional

constraint on each variable: ci ≥ 0. Furthermore, SDC scheduling allows a linear objective function,

for example to minimize the start time of each operation to achieve an As-Soon-As-Possible (ASAP)

schedule: min∑

i ci. Objective functions to minimize circuit power have also been proposed [Jiang 08].

SDC scheduling does not solve the scheduling problem optimally, as there is a linear ordering heuristic

required when we have resource constraints. We refer the reader to [Cong 06b] for complete details of

the formulation and how other types of constraints can be included.

2.6.2 Extracting Parallelism

Until now we have assumed the hardware generated by HLS is fairly sequential in nature, with only one

state active during any clock cycle. However, hardware designs typically exploit parallelism. Parallel

computation can specified explicitly by the user with a C library like Pthreads [Buttl 96], or compiler

pragmas such as OpenMP [Buttl 96], or with language extensions like OpenCL [Openc 09]. Parallelism

can also be inferred by the HLS tool, where we attempt to extract the parallelism automatically. In

HLS, a common optimization is loop pipelining, which infers parallelism across loop iterations to generate

hardware pipelines, which we will discuss further in Chapter 5.

2.7 Binding

Binding comprises two tasks: operation binding assigns operators from the program to specific hardware

units, variable binding assigning program variables to registers. When multiple operators are assigned

to the same hardware unit, or when multiple variables are bound to the same register, multiplexers

are required to facilitate the sharing. The binding step is typically performed after scheduling, how-

ever binding is interdependent with scheduling. In retrospect, we may wish to modify the schedule to


allow more binding to occur and save area. There have been proposals to do scheduling and binding

simultaneously [Resha 05].

Figure 2.10 shows an example of a schedule and the corresponding circuit datapath after binding,

ignoring memory addressing for clarity. We assume that allocation has given us one adder functional unit

and a shared two-port memory. Each adder operation has been scheduled in distinct cycles, therefore we

can assign both adders to the same functional unit. Normally, we would require two multiplexers, one

on each input of the adder, but since one input always arrives from the memory output a multiplexer is

unnecessary.

We have three goals when binding operations to shared functional units. First, we want to balance

the sizes of the multiplexers across functional units to keep circuit performance high. Multiplexers with

more inputs have higher delay, so we wish to avoid having a functional unit with a disproportionately

large multiplexer on its input. Second, we want to recognize cases where we have shared inputs between

operations, letting us save a multiplexer if the operations are assigned to the same functional unit.

Lastly, during binding if we can assign two operations that have non-overlapping lifetime intervals to

the same functional unit, we can use a single output register for both operations. In this case we save

a register, without needing a multiplexer. We use the LLVM live variable analysis pass to check for the

lifetime intervals.

To account for these goals we use the following cost function to measure the benefit of assigning

operation op to function unit fu:

Cost(op, fu) = φ · existingMuxInputs(fu) +

β · newMuxInputs(op, fu)−

θ · outputRegisterSharable(op, fu) (2.4)

where φ = 0.1, β = 1, and θ = 0.5 to give priority to saving new multiplexer inputs, then output registers,

and finally balancing the multiplexers. Here existingMuxInputs(fu) returns the number of multiplexer

inputs already required by the functional unit fu. The function newMuxInputs(op,fu) returns the number

of new multiplexer inputs required if we assign operation op to functional unit fu. For example, if

op = a ∗ b and we have already assigned op2 = c ∗ b to the functional unit fu, then we will only need

one additional multiplexer input for operand a, since operand b is shared. outputRegisterSharable(op,fu)

returns one if the operation op has a lifetime interval that does not non-overlap with any operation

already assigned to the functional unit fu. Notice that sharing the output register reduces the cost,

while the other factors increase it.

In general, binding has been shown to be NP-hard [Pangr 91], but various heuristics have been

proposed. A common heuristic for solving the binding problem is called weighted bipartite match-

ing [Huang 90]. The binding problem is represented using a bipartite graph with two vertex sets. The

first vertex set corresponds to the operations being bound (i.e. LLVM instructions) during a particular

control step. The second vertex set corresponds to the available functional units. A weighted edge is

introduced from a vertex in the first set to a vertex in the second set if the corresponding operation

is compatible with the corresponding functional unit. The cost given in Equation 2.4 of assigning an

operation to a given functional unit is assigned to the edge weight connecting the corresponding vertices.

After constructing the weighted bipartite graph, we wish to match each vertex from the first vertex set

(operations) to exactly one of the connected vertices from the second set (compatible functional units)


addition1

adderFuncUnit1

3

adderFuncUnit2

4

addition2

1 5

Figure 2.11: Bipartite Graph.

such that the overall cost (sum of edge weights) is minimized. The weighted bipartite matching problem

can be solved optimally in O(n3) time using the Hungarian method [Kuhn 10]. We formulate and solve

the matching problem one clock cycle at a time until the operations in all clock cycles (states) have

been bound to an available functional unit. An example is shown in Figure 2.11, in this case we would

match operation addition1 to functional unit adderFuncUnit2 and addition2 to adderFuncUnit1 for

a minimum edge weight of 5. The weighted bipartite matching approach can also be used for variable

binding, where one vertex set corresponds to program variables and the other set to registers.

Another approach to binding is as a clique partitioning problem [Tseng 86]. In this formulation,

graph vertices represent operations and an edge between two vertices indicates the associated operations

can share hardware units. By finding the minimal number of cliques in the graph, we can minimize the

hardware. Clique partitioning is NP-hard in general but heuristics can be used. To apply this approach

to variable binding, each vertex represents a program variable and an edge between two vertices indicates

the variables have non-overlapping lifetimes. Graph colouring approaches can also be used during variable

binding for register allocation [Chait 81, Beida 05].

2.8 FPGA Architecture

In this section, we describe modern FPGA architecture specifically focusing on two commercial FPGAs

from Altera: the Cyclone II [Cycl 04] FPGA (90nm) and Stratix IV [Stra 10] FPGA (40nm). We

target these two FPGAs exclusively in the experimental results provided in this dissertation, due to

the widespread availability of Altera’s DE2 [DE2 10b] and DE4 [DE4 10] development and education

boards.

Modern FPGA architecture consists of a two-dimensional array of logic array blocks, each consisting

of a lookup table (LUT), registers, and some additional circuitry [Betz 99]. A k-input LUT can imple-

ment any k-input logic function by using a programmable SRAM containing a 2k bit truth table and

a 2k-input multiplexer to select the correct output. Stratix IV has a considerably different logic array

block architecture than Cyclone II, as illustrated in Figure 2.12. Cyclone II uses logic elements (LEs)

containing 4-input LUTs to implement combinational logic functions, whereas Stratix IV uses adaptive

logic modules (ALMs). An ALM is a two-output 6-LUT that receives eight inputs from the FPGA in-

terconnection fabric. The ALM can implement an arbitrary 6-input function, or two 4-input functions,

or a 3- and 5-input function, or several other functions. In both FPGA architectures, the LUT output

can either be used combinationally or registered sequentially.

Much of the early work on HLS focused on targeting ASICs, but there are key differences between


D

comb_out

Q seq_out

4−LUT

(a) Cyclone II Logic Element (LE)

D Q

D Q

seq_out1

comb_out1

6−LUT

seq_out2

comb_out2

(b) Stratix IV Adaptive Logic Module (ALM)

Figure 2.12: Cyclone II and Stratix IV logic element architectures.

targeting ASICs and FPGAs devices. On an FPGA, multiplexing requires more device area compared

to an ASIC, because multiplexers are implemented in LUTs on an FPGA device. A 32-bit wide 2-to-1

multiplexer implemented in 4-LUTs is the same size as a 32-bit adder. If we decide to share an adder

unit, we may need a multiplexer on each input, making the shared version 50% larger than simply

using two adders. In contrast, a 2-to-1 multiplexer implemented on an ASIC is implemented cheaply

using two AND gates connected to an OR gate, or transmission gates. Another difference is that an

FPGA has hardened ASIC-like blocks to implement multipliers (using DSP blocks) and memory (using

block RAMs). If possible, DSP blocks should be used instead of implementing multipliers using LUTs.

Likewise, the FPGA fabric is register rich – each logic element in the fabric has a LUT and a register.

Therefore, sharing registers is rarely justified. Consequently, in HLS we must account for the underlying

FPGA architecture to synthesize the best circuit.

Chapter 3

LegUp: Open-Source High-Level

Synthesis Research Framework

3.1 Introduction

In this chapter, we describe an open-source high-level synthesis (HLS) framework called LegUp. LegUp

enables designers to use a simpler hardware design methodology, in which a software application imple-

mented in C can be incrementally synthesized to target a hybrid FPGA processor/accelerator system-

on-chip. This target architecture harnesses the energy and performance benefits of hardware, while

the LegUp HLS framework raises the design-entry abstraction up to the software level. During the

design process, the input C program is partitioned into functions executing on the processor and other

functions that are synthesized automatically into digital logic, or hardware accelerators, that are imple-

mented on the FPGA. During program execution, the processor automatically offloads computation to

these hardware accelerators, resulting in better performance than the original software implementation.

We have observed that robust open-source academic electronic design automation (EDA) tools can

lead to significant new research progress by lowering the barrier to entry for new researchers. For example,

the Versatile Place and Route (VPR) tool has been used by hundreds of FPGA researchers to perform

studies on FPGA architecture, and for developing new place and route algorithms [Betz 97, Luu 14b].

Another example is the open-source ABC logic synthesis system, which has renewed academic interest

in logic synthesis research [Mishc 06]. High-level synthesis and application-specific processor design can

also benefit from the availability of a robust open-source framework such as LegUp, which has been

missing in the research community.

Currently, LegUp is intended to target FPGA-based embedded systems, which are application-specific

systems typically implemented on a single board. Embedded systems that utilize FPGAs often include

a soft processor, which is a processor implemented in lookup tables within the FPGA fabric [Nios 09b].

In embedded systems, particularly those using a soft processor, LegUp can significantly increase perfor-

mance and energy-efficiency by performing computations in custom hardware instead of running them

on the processor. Alternatively, LegUp could target a high performance computing platform, where a

commodity general-purpose processor [Lempe 11] is connected to an FPGA board over the PCIe bus.

In this scenario, we must be cognisant of off-chip bandwidth limitations when passing data to and from

the FPGA, which can reduce or eliminate our performance gains from hardware accelerators. We focus

21

Chapter 3. LegUp: Open-Source High-Level Synthesis Research Framework 22

primarily on the embedded system target architecture in this dissertation.

In this chapter, we will present the LegUp design methodology and target architecture, as well as

other implementation details. We will also present an experimental evaluation comparing LegUp to a

commercial HLS tool. In this study, we measure LegUp’s ability to effectively explore the hardware/-

software design space of a given program.

The remainder of this chapter is organized as follows: Section 3.2 describes other HLS tools related

to this work. Section 3.3 introduces the target hardware architecture and outlines the high-level design

flow. The details of the high-level synthesis tool and software/hardware partitioning are described in

Section 3.4. An experimental study appears in Section 3.5. Section 3.6 discusses recent research enabled

by the LegUp framework. A summary is given in Section 3.7.

3.2 Background

3.2.1 Prior HLS Tools

As we discussed in Chapter 2, high-level synthesis, also known as behavioural synthesis or electronic

system level (ESL) design, has been studied for over 30 years [McFar 88, Cong 11]. In this section, we

survey various HLS tools, both academic and commercial, that have been developed recently.

Several HLS tools have been developed to target digital signal processing (DSP) applications. The

Riverside Optimizing Compiler for Configurable Circuits (ROCCC) [Villa 10] from UC Riverside is an

open-source high level synthesis tool. ROCCC is designed to accelerate critical kernels that perform

repeated computation on streams of data. These kernels are typical in DSP applications such as FIR

filters, or fast fourier transforms. ROCCC is not designed for compiling entire C programs into hardware

and many C features are unsupported, such as: pointers, shifting by a variable amount, non-for loops,

and the ternary operator. ROCCC has a bottom-up development process that involves partitioning

one’s application into modules and systems. Modules are C functions that are synthesized by ROCCC

into a circuit datapath without any control logic. ROCCC fully unrolls any C loops within the module

at compile time. These modules cannot access memory but have data streamed into them and output

scalar values. Systems are C functions that instantiate modules to repeat computation on a stream

of data or a window of memory and usually consist of a loop nest with special function parameters for

streams. ROCCC supports advanced optimizations such as: systolic array generation, smart buffers, and

temporal common subexpression elimination. ROCCC can also generate Xilinx PCore modules to be

used with a Xilinx MicroBlaze processor [Micro 14]. However, ROCCC’s strict subset of C is insufficient

for compiling any of the CHStone benchmarks (described in Section 3.4.5) and ROCCC does not support

any resource sharing.

Another high-level synthesis tool designed for DSP applications is GAUT [Couss 10] from University

of South Brittany. GAUT synthesizes a single C function into pipelined hardware architecture consisting

of a processing unit, a memory unit, and a communication unit described in VHDL. The tool also

includes a graphical viewer to analyze the program data flow graph and the final HLS schedule. The

user must specify the circuit throughput, specified as a pipeline initiation interval (see Chapter 5), and

the clock period constraint.

We now discuss a few recent HLS tools that target more general applications. xPilot [Cong 06a] is a

state-of-the-art academic HLS tool, developed at UCLA, that has been used for numerous HLS studies


(e.g., [Chen 04, Cong 06c, Jiang 08, Cong 09, Wang 13]) and is now commercially released as Xilinx’s

Vivado HLS tool [Xili]. The CHiMPS [Putna 08] project, developed by Xilinx and the University of

Washington, synthesizes a C program into an FPGA circuit with many distributed caches, utilizing the

available FPGA block RAMs, and supporting latency-hiding of off-chip memory accesses. Trident is an

HLS compiler developed at Los Alamos National Labs, formerly called Sea Cucumber, which targeted

floating-point scientific applications [Tripp 07]. Bambu [Pilat 12] and DWARV [Nane 12] are two other

recent academic HLS tools, with Bambu offering support for custom floating point unit generation

using FloPoCo [De Di 11]. Of all the academic tools, the recently released Shang [Shan 13] is most

comparable to LegUp, supporting a hybrid flow where software is executed on an Altera Nios II soft

processor. Shang is built on LLVM (like LegUp) but instead of working with LLVM IR instructions,

Shang works on the LLVM machine code layer allowing further area optimizations. Shang also uses

multi-cycle path analysis [Zheng 13] to efficiently chain operations, allowing for better performance and

area.

Although we focus on C as the high-level input language in this dissertation, there have been other

proposed languages that can offer better hardware expressibility. The most popular alternate is Sys-

temC [Syste 02], used by Forte [Fort] among others. SystemC is a C++-based library offering a modeling

platform that supports a flexible input ranging from untimed C to a cycle-accurate RTL-like descrip-

tion using the SystemC class library. SystemC ships with a built-in simulation kernel that can perform

cycle-accurate simulations orders of magnitude (1000×) faster than an equivalent RTL-based simula-

tion [Coolea]. Proposed HLS input languages also have included C-based language extensions. For

example, SpecC [Gajsk 00], developed at UC Irvine, added support for state machines and hardware

pipelines. Other examples of C-based languages include: HardwareC [Ku 88] from Stanford University

and Handel-C [Aubur 96] from Oxford University. IBM Research has developed a HLS compiler called

LiquidMetal that uses an object-oriented Java-like language, LIME, which includes hardware-specific

extensions, such as bitwidth-specific integers [Huang 08]. Other high-level synthesis languages offer

user-explicit parallelism such as the general purpose GPU language OpenCL [Openc 09] used by Al-

tera’s OpenCL HLS tool [Open]. BlueSpec [Blue] uses the Haskell functional language to specify circuit

behaviour using the guarded atomic action model, which consists of guard/action pairs. Each guard is

a boolean function that triggers an atomic action to occur, which modifies the circuit state.

Our described design methodology bears some similarity to the FPGA-based Warp Processor devel-

oped at UC Riverside [Vahid 08]. Their approach starts by profiling the software binary executing on

the Warp Processor. Using this profiling data, they automatically select critical regions of the software

binary to synthesize into a custom digital circuit on the FPGA. Before synthesizing the circuit, they dis-

assemble the region of the software binary into a higher-level representation suitable for HLS [Stitt 07].

They take the circuit description produced by HLS and run FPGA CAD tools to synthesize the circuit

into a programmable bitstream for the FPGA target device. They reprogram the FPGA using this

bitstream. Next, the Warp processor transparently modifies the software binary to call the hardware

during the appropriate software region, dynamically improving the speed and energy consumption of the

program. LegUp’s design methodology is similar, but we synthesize our custom hardware accelerators

directly from the software source code, instead of the disassembled binary, enabling us to generate a

better final hardware circuit using HLS. The Warp processor was never publicly released or developed

commercially.

We show a summary of the release status of the surveyed tools in Table 3.1. Tools are broken into


Table 3.1: Release status of recent non-commercial HLS tools.

Open-source Binary-only No source or binaryTrident xPilot Warp ProcessorROCCC GAUT LiquidMetalShang SPARK CHiMPSBambu DWARV

three categories: 1) the source code is available, 2) only a binary is available, or 3) neither source or

binary are available. Binary-only tools are only useful for benchmarking and cannot be modified by

researchers trying to investigate new HLS algorithms. Tools without a binary release cannot have their

published results independently verified. We now discuss a few shortcomings of the currently available

open-source HLS tools that motivate the creation of our new open-source HLS framework. The Trident

tool is implemented using an older LLVM version and has not been actively maintained for several years.

Trident also only synthesizes pure hardware designs and cannot support a hybrid hardware/processor

architecture. ROCCC is under active development, but lacks the C language support required for our

desired benchmark programs. Bambu is built using the GCC compiler [Stall 99] and is still under active

development. The tool supports the CHStone benchmark suite but only targets a pure hardware flow

(no processor). Shang is the most comparable open-source HLS tool to LegUp, with source code released

in the middle of 2013. However, development appears to have stopped at the end of 2012, based on

their source control system. We found the code less well-tested (many segfaults) and harder to install

compared to LegUp but the project looks promising overall. When LegUp 1.0 was released in 2011, it was

the only open-source HLS tool compiling a complete C program to a hybrid processor/accelerator system

architecture, where the synthesized hardware comprised a general datapath/state machine model. We

have found that LegUp provides researchers with a robust infrastructure supporting larger and more

general C programs than those handled by prior open-source tools.

Commercial HLS tools have been gaining traction in recent years, with both start-ups and major

EDA vendors offering HLS tools. From FPGA vendors, Xilinx offers Vivado HLS [Xili], formerly Au-

toPilot, while Altera has released the OpenCL compiler [Open]. From major EDA vendors, there is

Catapult C from Mentor Graphics [Caly], and C-to-Silicon [Cade] and Forte [Fort] (recently acquired)

from Cadence, and Synphony HLS, formerly Synfora PICO, from Synopsys [Synp 15]. From smaller

HLS companies, there is eXCite from Y Explorations [eXCi 10], CoDeveloper from Impulse Accelerated

Technologies [Impu], C2R from CebaTech [Ceba], and CyberWorkBench [Wakab 06] from NEC.

Altera’s commercial C2H tool [Nios 09a] (deprecated in Quartus 9.1) targeted a system architecture

similar to LegUp’s target architecture. C2H required the user to categorize a C program’s functions as

either hardware or software-based. After C2H generated the system, the software-based functions would

execute on a Nios II soft processor [Nios 09b], while the hardware-based functions would be synthesized

into custom hardware accelerators. These hardware accelerators were connected to the Nios II processor

using an Avalon interface (Altera’s on-chip interconnection standard). However, C2H lacked enough

coverage of the C language to compile our benchmark suite.

3.2.2 Application-Specific Instruction-Set Processors (ASIPs)

Application-specific instruction-set processors (ASIPs) [Pothi 10, Pozzi 06, Sun 04] are embedded pro-

cessors which offer support for adding new custom instructions to augment their existing instruction set


architecture. If we reconfigure the processor datapath to include useful application-specific instructions,

we can then improve the program performance and energy consumption compared to a general-purpose

processor. We perform a profiling step to analyze the application for critical regions before deciding

which custom instructions to implement in the ASIP. We can use pattern matching techniques on the

profiling data to recognize commonly executed sequences of program instructions which we can group

together to form a new custom instruction. At this point, HLS is used to synthesize the custom datap-

ath required for the custom instruction and then to resynthesize the ASIP. Finally, the software code is

updated to utilize the new custom instruction.

LegUp’s hybrid processor/accelerator target architecture has two main differences when compared

to an ASIP. Firstly, custom instructions require the ASIP to stall while the hardware performs compu-

tation whereas in LegUp, our loosely-coupled processor/accelerator architecture permits the hardware

accelerators and the processor to run in tandem. Secondly, LegUp can synthesize large portions of a

C program into hardware and is not limited to synthesizing small groups of instructions like the ASIP.

ASIPs have an advantage for very small hardware accelerators because they can access the processor’s

register file.

3.3 LegUp Overview

In this section, we describe LegUp’s design methodology and the target architecture. Implementation

details will follow in Section 3.4.

3.3.1 Design Methodology

In LegUp’s design methodology, the user begins with a C program which they compile and run on

a processor to gather profiling information. Using this data, they partition the program into either

software or hardware regions and recompile. Hardware regions are automatically synthesized by LegUp

into digital logic on the FPGA and software regions execute in tandem on the processor. We provide a

detailed flow chart for this methodology in Figure 3.1. In the first step, the user compiles a standard

C software input program using the LLVM C compiler. The resulting binary is executed on an FPGA-

based processor such as the Tiger MIPS soft processor [Tige 10] or a hardened on-chip ARM Cortex-A9

processor [ARM 11]. In the case of the Tiger MIPS processor, we have added custom profiling logic to

the processor datapath to measure cycle counts accurately [Aldha 11a]. We choose hardware profiling to

avoid the need to add instrumentation to the software program for profiling which can slow down program

execution. Consequently, we can transparently obtain very accurate profiling measurements including

exact cycle measurements of off-chip memory accesses and processor cache misses. We currently profile

at the granularity of individual functions in the program. In the next step of our flow, hardware/software

partitioning, the user analyzes the profiling data to identify critical functions in the program that could

benefit from hardware acceleration. These are functions that could be synthesized into hardware with

greater performance or improved energy consumption. The partitioning process is currently manual and

requires the user to mark each function that should be implemented in hardware by LegUp in a Tcl

configuration file. The remaining functions execute on the processor.

The final step of the flow is labeled “LegUp” in Figure 3.1 because this step is what we typically refer

to as LegUp in this dissertation. LegUp requires a C compiler and an HLS tool, both of which are built

within the LLVM compiler framework. At this stage, the user has chosen functions to synthesize into


Synthesis

High−Level

C compiler

C input program

Processor

Self−ProfilingProfiling Data

Processor

FPGA

LegUp

Hardware/Software

Partitioning

C compiler

Hardware

Accelerator

Figure 3.1: LegUp Design Methodology.

hardware, which are then passed into our high-level synthesis flow to generate a hardware description that

can be synthesized onto the FPGA as a hardware accelerator. Each hardware-based C function will be

synthesized into a separate hardware accelerator. Furthermore, if a function calls another function then

these called functions are also synthesized into hardware. Currently, only software functions executing

on the processor can call hardware accelerators; the reverse is not supported.

In the final stage, the LegUp C compiler modifies the software to use the hardware accelerators. For

each hardware-partitioned function, LegUp adds specific code that will start the corresponding hardware

accelerator and pass data between the processor and accelerator. We then execute this modified software

on the FPGA-based processor. At the bottom of Figure 3.1, we show the target system architecture

consisting of a processor connected to multiple hardware accelerators on an FGPA device.

In this design methodology, the user can harness the performance and energy benefits of an FPGA us-

ing an incremental methodology while limiting time spent on hardware design. The LegUp programming

flow bears some similarity to general-purpose GPU programming using the languages CUDA [CUDA 07]

and OpenCL [Openc 09] in the sense that we allow the programmer to iteratively and incrementally work

to achieve a speedup, with the whole program working at all times.

3.3.2 Target System Architecture

LegUp’s target FPGA-based system architecture is shown in Figure 3.2. We included a processor in

our target system to support C program code that is inappropriate for hardware implementation. For

example, searching a linked list in software is inherently sequential and will achieve limited speedup

in hardware. However, highly parallel code, such as vector addition (Figure 3.5), can achieve much


MemoryTo Off−Chip

ProcessorOn−Chip

Cache

Hardware

AcceleratorMemory

Hardware

AcceleratorMemory

FPGA

Avalon Interconnect

...

Figure 3.2: LegUp target hybrid processor/accelerator architecture.

better performance by leveraging the available logic and parallel computation enabled by the FPGA.

Furthermore, offering the user a choice of running portions of the program on a processor increases

the range of allowable input programs. Functions within the program that require language features

unsupported by hardware accelerators, for example recursion or dynamic memory, can be executed on

the processor.

LegUp targets the Tiger MIPS soft processor from the University of Cambridge [Tige 10] which sup-

ports the full MIPS instruction set, has a mature tool flow, and is described in well-documented modular

Verilog. Mark Aldham evaluated two other FPGA-based soft processors to target with LegUp [Aldha 11b]:

YACC [YACC 05] and SPREE [Yiann 07]. Alternatively, Blair Fort is finalizing LegUp support for

targeting the on-chip ARM Cortex-A9 processor [ARM 11] available on the Altera DE1-SoC board

[DE1 13], which includes a Cyclone V SoC FPGA device (for details see [Fort 14]). The Cortex-A9

processor operates at a 800 MHz clock frequency, which is significantly faster than the Tiger MIPS

soft processor running at 74 MHz. Bain Syrowik benchmarked the Cortex-A9 and measured a ge-

omean wall-clock time speedup of 9.4× compared to the Tiger MIPS for software execution across

the CHStone benchmarks [ARM 14]. Furthermore, the Cortex-A9 includes 128-bit SIMD extensions

(NEON) [ARM 11], which are very applicable to the benchmarks targeted by LegUp. However, our

preliminary experiments have found that using NEON only offers a 20% geomean wall-clock time im-

provement across the CHStone benchmarks [ARM 14]. We could improve this performance by using

hand-written NEON assembly instructions instead of relying on the LLVM compiler backend. In this

chapter, we only target the Tiger MIPS soft processor for software execution.

The processor connects to one or more custom hardware accelerators through a standard on-chip

interconnect. Presently, we use the Altera Avalon interconnect [Aval 10] for communication between

the processor and the hardware accelerators with LegUp automatically generating the Avalon interface

using Altera’s SOPC [Docu 11] builder tool (now deprecated). We are also adding support for generating

the interconnect with Altera’s new interface generator: Qsys [Qsys 14]. The Avalon interconnect is

implemented as point-to-point connections between communicating hardware modules instead of as

a shared bus, for greater performance. In this system, hardware accelerators do not communicate

with other hardware accelerators, only with the processor. The interconnect allows the processor and

accelerators to communicate through a memory-mapped interface.


Our target architecture has a shared memory system, with all memory accesses from either the

processor or hardware accelerators going through an on-chip memory cache. The on-chip memory cache

is based on the Tiger MIPS data cache but has been heavily modified by Jongsok Choi. The memory

within the cache is implemented using fast FPGA block RAMs. If memory is not contained within the

cache, the cache controller requests data from off-chip main memory. Having a single-level shared cache

simplifies our architecture by not requiring cache coherency between multiple caches. Further detail on

the cache architecture can be found in [Choi 12a]. The Tiger MIPS processor also contains a separate

instruction cache. We call the memory accessed through the on-chip cache, processor memory, because

this memory is shared between the processor and hardware accelerators. We distinguish processor

memory from the memory architecture within each hardware accelerator, which stores any memory

(constants and local variables) that is not shared with the processor. This memory is stored in FPGA

block RAMs and allows the hardware accelerator to use memory locally without possible contention

when accessing the Avalon interconnect, enabling greater parallelism and performance. This memory

also avoids cache misses and the associated latency to fetch off-chip memory. Memory within a hardware

accelerator is handled by a separate memory controller as described in Chapter 6.

We concede that our performance may be limited by the shared memory cache if we instantiate many

hardware accelerators in this system that share memory with the processor. Fixing this bottleneck is

outside the scope of this dissertation and we leave improvements to the processor/accelerator memory

architecture as future work.

We support various target FPGA devices that are available on Altera development and education

boards: the DE2 [DE2 10b] (Cyclone II FPGA), the DE2-115 [DE2 10a] (Cyclone IV FPGA), the

DE4 [DE4 10] (Stratix IV FPGA), the DE1-SoC [DE1 13] (Cyclone V SoC FPGA), and the DE5-

Net [DE5 13] (Stratix V GX). We note that most prior work on high-level hardware synthesis has

focused on pure hardware implementations of C programs, not the hybrid software/hardware system we

target in LegUp.

3.4 LegUp Design and Implementation

Before implementing LegUp, the author investigated two open-source compiler frameworks to leverage

for our work: GCC [Stall 99] and LLVM [Lattn 04]. The GNU Compiler Collection (GCC) is a robust

open-source compiler that is ubiquitous in the Linux community. GCC also compiles code that executes

5-10% faster than code compiled by LLVM. However, the GCC compiler has a steep learning curve due

to a large complex C codebase, with heavy use of C global variables and macros. Furthermore, we have no

static-single assignment (SSA) intermediate representation in the compiler backend passes. In contrast,

the low-level virtual machine (LLVM) compiler has great documentation with a modular understandable

C++ design. Adding new compiler passes and targets in LLVM is easy with a standard class API. LLVM

also offers access to a consistent SSA intermediate representation at every stage of the compiler. The

LLVM open-source license was also favourable, with an unrestricted BSD-style license [Rosen 05]. For

these reasons, the author built LegUp within the LLVM compiler framework.

We programmed LegUp using modular C++, with HLS algorithms implemented as backend compiler

passes that fit into the existing LLVM compiler framework. We have logically divided LegUp C++

classes into the HLS steps previously discussed in Figure 2.2. Researchers can implement their own

HLS algorithms as drop-in replacements for the existing algorithms in LegUp. As we discuss later in


Section 3.4.6, users can easily verify circuit functionality and measure the quality of results across a suite

of benchmarks after making modifications to LegUp.

The author implemented a datastructure to represent the RTL description of the final circuit. After

scheduling and binding, there is a hardware generation pass that converts the LLVM instructions into

this final RTL datastructure. Then, LegUp has a final pass that writes out the RTL datastructure as a

synthesizable Verilog circuit description file.

In the original implementation of LegUp (described in this chapter), the author implemented a list

scheduler using as-soon-as-possible (ASAP) ordering as the priority. The author also implemented the

bipartite weighted matching [Huang 90] binding algorithm within LegUp. Some improvements to allow

operator chaining in the scheduler were also implemented by Victor Zhang. The hybrid flow, including

the interconnection generation and communication between the processor and hardware accelerators,

was implemented by Jongsok Choi. The hardware profiler within Tiger MIPS was implemented by Mark

Aldham. We focus on LegUp’s first release for the experimental results in this chapter.

Since then, Jason Anderson has implemented a new scheduler based on SDC scheduling [Cong 06b]

(see Chapter 2), using the open-source lpsolve linear programming library [lpso 14]. Later, Stefan

Hadjis implemented an area saving algorithm for sharing groups of smaller functional units that ap-

pear in a particular configuration, or pattern, more than once in the program, as described in the

co-authored [Hadji 12a]. Many other changes have been implemented since then by: Ruo Long (Lanny)

Lian, Nazanin Calagar, Li Liu, Marcel Gort, Blair Fort, Bain Syrowik, Joy (Yu Ting) Chen, Julie Hsiao,

Victor Zhang, Ahmed Kammoona, Kevin Nam, Qijing (Jenny) Huang, Ryan Xi, Emily Miao, Yolanda

Wang, Yvonne Zhang, William Cai, and Mathew Hall. The author has acted in a mentorship role by

advising these contributors on the best approach to extend LegUp. During this dissertation, the author

will make an effort to attribute any discussed LegUp functionality to the implementation author.

3.4.1 Hardware Modules

In the hardware generated by LegUp, each function from the input software program will result in

a distinct hardware module; except for small functions that are automatically inlined by the LLVM

compiler. LegUp avoids inlining every function by default because we can save area when functions

are called more than once in the program. If we inline a function in two places then the final circuit

will have duplicated hardware. Since LegUp performs resource sharing at the individual operator level,

we can have difficulty sharing this large duplicated hardware region automatically. We also implement

each function as a separate hardware module to simplify the hybrid system generation, where the user

can specify an individual function to accelerate as discussed in Section 3.4.4. We found that hardware

simulations are clearer when debugging functions that are implemented in separate modules. LegUp also

supports manually changing the function inline threshold to inline larger functions. Increasing the inline

threshold can achieve higher performance by allowing further optimizations across function boundaries

and it can reduce hardware cycles spent communicating between hardware modules at the cost of area.

We found empirically that forcing LegUp to inline all functions in the CHStone benchmark suite reduced

geomean wall-clock time by 9% but increased circuit area by 15%.

Given the following C function prototype in the input software program:

int f unc t i on ( int a , int∗ b) ;

LegUp would generate a hardware module with the interface given in Figure 3.3. The first two inputs


1 module function ( . . . )2 input c l k ;3 input r e s e t ;45 input [ 3 1 : 0 ] a ;6 input [ 3 1 : 0 ] b ;78 input s t a r t ;9 output reg f i n i s h ;

1011 output reg [ 3 1 : 0 ] memory contro l l e r addres s ;12 output reg memory contro l l e r enab l e ;13 output reg memory con t r o l l e r w r i t e enab l e ;14 output reg [ 3 1 : 0 ] memory contro l l e r i n ;15 input [ 3 1 : 0 ] memory contro l l e r out ;16 input memory contro l l e r wa i t r eques t ;1718 output reg [ 3 1 : 0 ] r e tu r n va l ;19 endmodule

Figure 3.3: LegUp hardware module interface.

Start = 0

Start = 1

Reset = 1

InitialState

Figure 3.4: Initial state of a LegUp hardware module’s finite state machine.

are the clk and reset signals. The function parameters are on input ports a (32-bit integer) and b

(32-bit memory address). The two ports start and finish, are the module control flow signals. The

protocol for starting a module is to place valid inputs on the function parameter ports and then assert

the start input. Each hardware module contains a finite state machine that controls the datapath. The

start input is monitored by the first state of the module’s finite state machine as shown in Figure 3.4.

We remain in the first state until the start input is asserted. The finish output is kept low until

the last state of the state machine, when finish is asserted to indicate to the caller module that this

hardware module is finished. Line 18 provides a registered output port, return val, which is set to the

function’s integer return value. When finish is high the return val output port should be driven with

valid data. Lines 11–16 contain the interface to the global memory shared with the rest of the system.

Table 3.2 provides a description of these memory signals. The memory architecture and instantiation

hierarchy are explained further in Chapter 6.

3.4.2 Device Characterization

Each target FPGA device offers different speed, area, and power properties for generated hardware

functional units, such as adders or multipliers. Consequently, LegUp includes a PERL script written

by Ahmed Kammoona to characterize each hardware operation, including all supported bitwidths (8,

16, 32, 64), for a given Altera FPGA family. The script synthesizes each operation in isolation for the

Altera FPGA family using Quartus II. Then, the script parses out the operator propagation delay, the

number of logic elements and registers, and DSP block usage from the timing analysis report. The script


Table 3.2: LegUp memory signals.

Memory Signal Descriptionmemory controller address Memory address (32-bit)memory controller enable Memory clock enablememory controller write enable If one then write to memory otherwise readmemory controller in Data to be written into memorymemory controller out Data to be read out of memorymemory controller waitrequest If one then hold the module current state constant

also supports running simulations to measure the estimated operator power consumption by randomly

toggling the input ports. Characteristics associated with each LLVM operator are generated by the script

and then stored in a Tcl configuration file. LegUp reads this characterization file during the allocation

stage of HLS and uses operator characteristics during scheduling and binding. We can also use this data

to make early estimates of circuit speed and area for the hardware accelerators. These scripts have been

improved by Ryan Xi, who added floating point operations, and Joy (Yu Ting) Chen, who characterized

Cyclone V and Stratix V FPGAs.

3.4.3 Hardware Profiling

As discussed in Section 3.3.1, the Tiger MIPS soft processor has been modified to include a hardware

profiler designed by Mark Aldham. The hardware profiler transparently monitors the processor operation

during program execution to gather performance characteristics and identify critical functions. By

default, the profiler keeps track of the number of clock cycles spent in each function of the running

program, but the architecture is extensible to allow measurement of other metrics (i.e. energy). A set of

performance counters, one for each function, are maintained. To keep track of the currently executing

function, the profiler monitors the processor bus for function call and return instructions and uses a

function call stack. The profiler is compatible with any input program and therefore does not require

any changes to the underlying hardware for different programs. Mark has shown that the hardware

profiler requires a 6.7% area overhead on the Tiger MIPS processor when configured to support 32

functions using 32-bit performance counters. Complete details on the profiler, including a description

of profiling the estimated energy consumption of a program, can be found in [Aldha 11a]. The ARM

Cortex-A9 processor also supports profiling as investigated by Bain Syrowik [ARM 14]. Bain’s approach

uses ARM event counters to track function calls and to sample the number of clock cycles spent in each

function.

3.4.4 Hybrid Processor/Accelerator System

As discussed in Section 3.3.2, our proposed target architecture is a hybrid system with the processor

communicating with hardware accelerators. In this section, we discuss the behaviour of the processor and

modifications to the software required to support offloading computation to the hardware accelerators

chosen by the user. Jongsok Choi implemented the LegUp hybrid flow.

In the hybrid flow, the user selects functions that they wish to synthesize into hardware accelerators.

To utilize each accelerator, LegUp must change to the software program to perform these steps: 1) pass

the function arguments from the processor to the hardware accelerator, 2) start the accelerator, 3) wait

for hardware computation to finish, and 4) retrieve any resultant data from the hardware.


1 void vector add ( int ∗A, int ∗B, int ∗C, int N) {2 for ( int i = 0 ; i < N; i++) {3 C[ i ] = A[ i ] + B[ i ] ;4 }5 }

Figure 3.5: Vector addition C function targeted for hardware.

In step three, after activating a hardware accelerator, the processor must wait for the computation

to complete. During this time, we can continue program execution on the processor if we do not

immediately need results from the hardware accelerator. Alternatively, we can halt processor execution

until the accelerator is finished. The first approach is implemented by polling a memory-mapped status

register residing on the hardware accelerator that indicates when computation is finished. Polling can

have a performance advantage by allowing parallel computation on the processor while we wait for the

accelerator to finish and before we begin the polling loop. The second approach of stalling the processor

is simpler to implement and we can save energy by idling the processor. LegUp supports both behaviours

but we use the second approach (stalling) in the experimental study presented in this chapter.

For example, assume our program contains the C function shown in Figure 3.5, which we wish to

implement in hardware. The function performs a vector addition over two N -element arrays, which are

passed as the first two function parameters, and stores the output vector in the third parameter. These

arrays are stored in processor memory, which is shared between the processor and accelerators. To

transparently accelerate this function in hardware without changing the rest of the program, we replace

the original vector add function with the new function shown in Figure 3.6. The hardware accelerator

memory-mapped address space in LegUp is: 0xF0000000–0xFFFFFFFF, with hardware accelerators placed

immediately after each other in the address space. In our address space, any address above 0x80000000,

with the most significant bit equal to one, does not go to the on-chip memory cache. Rather, it is

reserved for external I/O, which consists of the hardware accelerators, for details see [Choi 12b]. The

Avalon interconnect and logic implemented in the Avalon slave component for each accelerator handles

the address decoding for the address range of each accelerator. In Figure 3.6, we first perform memory-

mapped stores for each argument of the function on lines 10–13, which will store the arguments in

dedicated registers on the hardware accelerator. We then start the accelerator by writing to the START

address on line 15, which also immediately stalls the processor. When the accelerator finishes, the

Avalon waitrequest signal will be de-asserted, allowing the processor to resume execution. Although

not shown here, functions with a return value will have an additional memory-mapped load from the

accelerator at the end of the function to retrieve the return value.

3.4.5 Language Support and Benchmarks

LegUp has extensive language coverage of ANSI C allowable for hardware synthesis as summarized in

Table 3.3. LegUp supports integer and floating point arithmetic and all logical comparison, ternary, and

bitwise operators. We support arbitrary control flow including any type of loop, switch, if-statement,

goto-statements and function calls. We support memory including global variables and constants, multi-

dimensional arrays, arbitrary pointers, and pointer arithmetic. Generally, we support a wider range of

language constructs than other academic HLS tools. Victor Zhang added support for structs to LegUp,

including structs with arrays, arrays of structs, and structs containing pointers. LegUp stores structs in


1 // hardware ac c e l e r a t o r memory−mapped address space s t a r t s at 0xF00000002 const vo lat i l e int ∗VECTORADD START = 0xF0000000 ;3 const vo lat i l e int ∗VECTOR ADD ARGA = 0xF0000004 ;4 const vo lat i l e int ∗VECTOR ADD ARG B = 0xF0000008 ;5 const vo lat i l e int ∗VECTOR ADD ARG C = 0xF000000C ;6 const vo lat i l e int ∗VECTOR ADD ARGN = 0xF0000010 ;78 void vector add ( int ∗A, int ∗B, int ∗C, int N) {9 // pass arguments to hardware ac c e l e r a t o r using memory−mapped s t o r e s

10 ∗VECTOR ADD ARGA = A;11 ∗VECTOR ADD ARG B = B;12 ∗VECTOR ADD ARGC = C;13 ∗VECTOR ADD ARGN = N;14 // s t a r t t he hardware ac c e l e r a t o r and s t a l l t he processor u n t i l f i n i s h e d15 ∗VECTORADD START = 1 ;16 }

Figure 3.6: Modified C function to call hardware accelerator for function in Figure 3.5.

Table 3.3: LegUp C language support.

Supported UnsupportedFunctions Dynamic MemoryArrays, Structs RecursionGlobal Variables Function PointersPointer Arithmetic Unaligned Memory AccessesFloating-point Arithmetic

memory using the ANSI C alignment standards to ensure that any hardware function can access struct

elements allocated in the processor’s memory without requiring extra changes. Structs are stored in

64-bit wide block RAMs to ensure that elements up to 64-bits in size can be accessed in a single memory

operation.

Language features unsupported by LegUp include: dynamic memory allocation, recursion, function

pointers, and functions that return a struct. Another limitation is that all memory accesses should

be word aligned; for example, you cannot load only the upper byte of an integer. The user could

hypothetically compile their own custom memory allocator library to support dynamic memory. We

provide an example of such a library implemented by Victor Zhang in the included LegUp benchmarks,

where we store all dynamic memory in a predefined statically-sized 1024B memory heap. Recursion

could be supported in the future by adding a stack controller as described in [Jasro 04]. Any regions of

the program using unsupported features should remain in software running on the processor.

Table 3.4: Core benchmark programs included with LegUp.

Category Benchmarks Lines of C

Arithmetic Double-precision floating-point 363–789Add, Mult, Div, Sin

Encryption AES, Blowfish, SHA 723–1,413Processor MIPS processor 232Media JPEG decoder, 441–1,692

Motion vector decodingCommunications GSM, ADPCM 380–547Synthetic Dhrystone 491


LegUp includes a suite of benchmark C programs that the user can use to evaluate the HLS quality

of results. We show a list of the 13 benchmarks in Table 3.4, which includes all 12 benchmarks from the

CHStone high-level synthesis benchmark suite [Hara 09] and Dhrystone [Weick 84], a popular synthetic

benchmark. We chose the benchmarks to be representative of the types of programs synthesized by an

HLS tool. We include programs from the following categories: floating-point arithmetic, encryption and

hashing, processor emulation, media, communications, and synthetic. The benchmarks range in size

from 232–1692 lines of C code. The arithmetic benchmarks implement various double-precision floating-

point operations using mainly bitwise operations on 64-bit wide integer types. The processor benchmark

emulates a basic MIPS processor for a predefined program implemented in MIPS machine code.

These benchmarks require substantial C language coverage to be synthesized by a HLS tool, for

instance, the Dhrystone benchmark contains structs that are used as a linked list. In fact, before LegUp

was released, no academic tool was robust enough to support the entire CHStone benchmarks — even

commercial tools generated functionally incorrect circuits for some of these benchmarks. Furthermore,

these benchmarks are larger than prior benchmarks used in academic publications. For example, the

HLS area minimization work by Cong [Cong 10] and Zhang [Zhang 10] uses benchmarks that range

from 600–4500 slices on a Xilinx Virtex-4 FPGA (90nm). Each Virtex-4 slice contains two 4-input

LUTs [Virt 10]. On a Cyclone II FPGA (90nm) each logic element (LE) contains a single 4-input

LUT [Cycl 04]. As a first order approximation, we can assume that each Virtex-4 slice corresponds to

roughly two Cyclone II LEs. Therefore, their biggest benchmark is 4500 slices, which is under 10,000

LEs. The geomean circuit area for the LegUp benchmark suite is 50% larger (over 15,000 LEs) with the

jpeg benchmark consuming over 46,000 LEs, an order of magnitude larger. Typically 5-6 benchmarks

are used to perform HLS experimental studies. LegUp offers researchers with double this number of

benchmarks, which are all full programs instead of individual functions or kernels. At the time of our

first publication [Canis 11], to our knowledge, these were the largest HLS benchmarks that have ever been

synthesized by an academic tool in the literature. Until LegUp was released, the CHStone benchmarks

could not be studied in depth simply because no academic tool could support them. Therefore, a key

differentiator of LegUp relative to prior work is that we allow researchers to study HLS when applied to

larger and more complex C programs than before.

3.4.6 Circuit Correctness

Circuit correctness is the most important requirement of LegUp which is why we distribute LegUp with a

rigorous automated test suite including hundreds of tests. We must verify that the generated RTL design

simulates correctly and produces a functionally correct circuit under a wide range of input programs so

that academics can spend time on research instead of debugging infrastructure. In other CAD areas, such

as place and route, a bad placement will still be functionally correct and we can easily verify that our

final placement matches the original netlist. In contrast, verifying that the C input matches the circuit

output of high-level synthesis is non-trivial. Consequently, high-level synthesis research and development

is inherently prone to introducing bugs or regressions in the final circuit functionality. Even a single

misplaced register or an operation scheduled one cycle too soon can break the functionality of the final

circuit! Furthermore, manually debugging the auto-generated RTL code generated during HLS can be

challenging and tedious. Our test suite helps give academic end-users confidence in the core LegUp HLS

algorithms and they can use these tests to verify circuit correctness after implementing novel algorithms

in LegUp.


The CHStone [Hara 09] benchmarks we described earlier each contain built-in input vectors that

exercise the program execution and golden output vectors to verify that the program generated the

correct output. Consequently, when the programs are synthesized into hardware, we can simulate or run

the circuit on the FPGA device to verify the correct functionality using the golden input and output test

vectors. This is analogous to built-in self-test techniques [McClu 85] used for verifying chip functionality,

with no user input required. These test vectors are also marked by the volatile keyword to avoid the

LLVM compiler performing constant propagation and agressive optimizations. For example, the mips

CHStone benchmark contains a predefined unordered 8-element array as input, and the corresponding 8-

element sorted array as the golden output. The MIPS processor emulates a 44 instruction MIPS program

that sorts the array, after which the program verifies that the sorted array matches the expected golden

output.

For tracking down circuit correctness issues, LegUp also provides a pass that annotates each basic

block with a print statement that dumps out the value of every register assigned in the basic block. We

can then synthesize this new program to hardware and simulate the circuit to generate a log of all register

changes. Next, we compare the simulation log to the output from running the annotated program in

software, which prints out the correct register values. We can quickly discover the exact point during the

hardware simulation when the register values are incorrect. In practice, we found this procedure helpful

for debugging incorrect circuits synthesized by LegUp. Nazanin Calagar recently proposed a debugger

for LegUp [Calag 14] that has a similar feature, which compares the execution state between software

and an equivalent synthesized circuit running on the target hardware device.

3.4.7 Extensibility of LegUp to Other FPGA Devices

We now describe the extensibility of the LegUp framework to target other FPGA devices, in particular

Xilinx FPGAs, which may be the only FPGAs available to some researchers.

If we wish to target a new FPGA device with LegUp, then our first step is to re-characterize all

the hardware operation propagation delays and area metrics using the script described in Section 3.4.2.

LegUp requires these updated metrics to accurately schedule and bind the operators (e.g. shift, add) in

the program to meet user timing and area constraints for the target device. Our characterization script

assumes an Altera FPGA target, therefore the script should be rewritten if we need to target a Xilinx

FPGA. We designed the format of the Tcl configuration file containing the operator characteristics to

be FPGA agnostic.

Second, the Verilog hardware description generated by LegUp requires vendor-specific hardware mod-

ules for: floating point operations, integer divide, and integer multiply. These operations are implemented

by instantiating Altera-specific megafunctions. In LegUp, memories are inferred by Verilog statements,

which should be vendor-agnostic. However, if structs are used in the C program then block RAMs

are instances of Altera’s ALTSYNCRAM megafunction, because byte-enables are not supported by RAMs

inferred in Verilog. To target a new FPGA, we would need to replace all of these Altera megafunctions

with either the equivalent vendor-specific hardware modules for the target FPGA or with generic vendor-

agnostic hardware modules. The advantage of using generic hardware modules is we could support an

academic tool such as the open-source VTR (Verilog-to-Routing) FPGA CAD flow being developed here

at the University of Toronto [Luu 14a]. An option for generating generic floating-point functional units

is to use FloPoCo [De Di 11], or alternatively to use LegUp to synthesize floating-point cores from a C

implementation.


So far, we have only discussed changes required when targeting a new FPGA with the hardware-only

LegUp flow. However, supporting another FPGA target for the hybrid processor/accelerator architecture

would require a few additional changes. First, the Tiger MIPS processor contains Altera megafunctions

for memory, multiplication, and division. These would have to be changed to equivalent vendor-specific

modules on the new FPGA. Next, the interconnect between the processor and the hardware accelerators

would have to be changed from Altera-specific Avalon to another bus protocol, such as the Advanced

Microcontroller Bus Architecture [AMBA 03] defined by ARM and used by Xilinx FPGAs. We should

ensure that this new interconnect is in the same memory-mapped point-to-point configuration as the

current Avalon bus.

3.5 Experimental Study

In this section, we present an experimental study performed using the first version of LegUp in 2010.

This study has three objectives. First, we would like to compare the quality of results in terms of speed,

area, and energy of the circuits produced by LegUp compared to those synthesized by a commercial HLS

tool. We chose the commercial HLS tool eXCite [eXCi 10] because it was the only HLS tool we were

given access to that could compile our benchmark programs. eXCite has been actively developed since

1995 after being spun out of HLS research at UC Irvine and it is representative of commercially available

HLS tools. Second, we want to investigate hardware/software partitioning choices for our benchmarks

to explore the available design space. Third, we want to quantify the improvement of a LegUp hardware

implementation compared to executing our benchmarks on a processor. To achieve these objectives, we

ran five different experiments across the benchmark suite, starting from a software-only implementation

and then successively increasing the amount of computation performed in synthesized hardware. The

experiments are as follows (with labels appearing in parentheses):

1. A software-only implementation executing on the MIPS soft processor (MIPS-SW ).

2. A hybrid software/hardware implementation where the second most1 compute-intensive function

and its function descendants is implemented as a hardware accelerator, with the rest of the bench-

mark running in software on the MIPS processor (LegUp-Hybrid2 ).

3. A hybrid software/hardware implementation where the most compute-intensive function and all

its descendants is implemented as a hardware accelerator, with the rest executing in software

(LegUp-Hybrid1 ).

4. A hardware-only implementation synthesized by LegUp, with no processor (LegUp-HW ).

5. A hardware-only implementation synthesized by eXCite (eXCite-HW )2.

The two hybrid flows target a system with the Tiger MIPS processor and a single hardware accelerator

synthesized from one C function and all of its descendant functions.

We used Quartus II 9.1 SP2 to target the Cyclone II FPGA on the DE2 board [DE2 10b], in timing-

driven mode with all physical synthesis optimizations turned on3. We verified the circuit correctness of

1Not considering the main() function.2The eXCite implementations were produced by running the tool with the default options.3The eXCite implementation of the jpeg benchmark was synthesized without physical synthesis optimizations turned

on in Quartus II, as with such optimizations, the benchmark could not fit into the largest Cyclone II device.


all implementations using post-routed ModelSim simulations and we also verified the designs in hardware

using the Altera DE2 board. The experimental data presented here for the hybrid implementations were

collected by Jongsok Choi. The experimental data for the eXCite commercial tool were collected by

Jason Anderson. Mark Aldham also helped to gather these experimental results and he performed the

energy and power analysis presented later in this section.

In this study, we measure the circuit speed, area, and energy consumption to assess the quality of

results. Circuit speed consists of the wall-clock execution time of the circuit, the post-routed maximum

clock frequency reported by Quartus, and the number of clock cycles required to complete execution.

We calculate the wall-clock time by multiplying the number of clock cycles by the reciprocal of the

maximum clock frequency. Circuit area consists of the number of Cyclone II logic elements (LEs), the

number of memory bits, and the number of 9x9 DSP multiplier blocks.

The energy consumption of an embedded system is typically a major design constraint especially

for battery-powered mobile devices. To measure circuit energy, Mark Aldham used Altera’s PowerPlay

power analyzer tool on the final post-routed circuit. He performed a post-route netlist simulation using

Mentor Graphics’ ModelSim to gather circuit switching activity data for each benchmark. ModelSim

generates a VCD (value change dump) file containing the switching activity for each design signal.

PowerPlay reads this VCD file and calculates a power estimate for the design. Finally, we compute the

total energy consumption of each benchmark by multiplying the average core dynamic power by the

benchmark’s total execution time.

3.5.1 Experimental Results

We present the circuit speed measurements across the benchmark suite for all our experiments in Ta-

ble 3.5. The experiments are presented from left to right in the order specified previously, with software-

only on the left and hardware-only on the right. Three speed metrics are shown in columns for each

experimental flow: Cycles gives the required number of clock cycles, Freq shows the post-routed maxi-

mum clock frequency (MHz), and Time lists the total wall-clock execution time (µs). The second last

row of the table shows the geometric mean results for each column. We excluded dhrystone from the

geomean calculation because eXCite could not synthesize this benchmark. The last row of the table

gives a ratio of the geomean of each column relative to the corresponding metric of the software-only

flow (MIPS-SW ).

In the MIPS-SW flow shown in Table 3.5, we measured the clock frequency of the processor at

74 MHz, with the benchmarks completing within 6.8K–30M clock cycles. The wall-clock time for the

benchmarks ranged from 92–401K µs. For comparison, we executed these benchmarks on an Altera

NIOS II/f (fast) soft processor and found that the performance was twice as fast as the Tiger MIPS

processor. However, the NIOS II is not open-source and has a 6-stage pipeline, while the open-source

Tiger MIPS has a 5-stage pipeline and is not tuned for Altera FPGAs.

In the LegUp-Hybrid2 flow, we implemented the second most compute-intensive function and its

descendants in hardware. We observe that the geomean number of clock cycles during execution is cut

in half compared to a software implementation. However, the Hybrid2 benchmarks have a 10% lower

geomean clock frequency than the processor, resulting in an overall geomean wall-clock time improvement

of 45%, or a 1.8× speed-up, compared toMIPS-SW. The next LegUp-Hybrid1 flow implements additional

computation in hardware. We show in Table 3.5 that the number of cycles is 75% better in LegUp-Hybrid1

than software-only. The geomean clock frequency is again lower than the processor by 12%, resulting

Chapter3.

LegUp:Open-SourceHigh-L

evelSynthesis

Rese

archFramework

38

Table 3.5: Speed performance results.MIPS-SW LegUp-Hybrid2 LegUp-Hybrid1 LegUp-HW eXCite-HW

Cycles Freq. Time Cycles Freq. Time Cycles Freq. Time Cycles Freq. Time Cycles Freq. TimeBenchmark (MHz) (µs) (MHz) (µs) (MHz) (µs) (MHz) (µs) (MHz) (µs)adpcm 193,607 74 2,607 159,883 62 2,595 96,948 57 1,695 36,795 46 804 21,992 29 761aes 73,777 74 993 55,014 55 1,001 26,878 50 543 14,022 61 231 55,679 51 1,093blowfish 954,563 74 12,854 680,343 63 10,763 319,931 64 5,022 209,866 65 3,208 209,614 36 5,845dfadd 16,496 74 222 14,672 75 196 5,649 77 73 2,330 124 19 370 25 15dfdiv 71,507 74 963 15,973 78 205 4,538 66 69 2,144 75 29 2,029 44 46dfmul 6,796 74 92 10,784 76 143 2,471 79 31 347 86 4 223 49 5dfsin 2,993,369 74 40,309 293,031 66 4,463 80,678 68 1,182 67,466 63 1,077 49,709 40 1,241gsm 39,108 74 527 29,500 61 480 18,505 61 303 6,656 59 113 5,739 42 137jpeg 29,802,639 74 401,328 16,072,954 51 313,925 15,978,127 47 342,511 5,861,516 47 124,475 3,248,488 23 143,358mips 43,384 74 584 6,463 76 86 6,463 76 86 6,443 90 72 4,344 76 57motion 36,753 74 495 34,859 73 475 17,017 80 214 8,578 92 93 2,268 43 53sha 1,209,523 74 16,288 358,405 77 4,631 265,221 76 3,508 247,738 87 2,850 238,009 62 3,809dhrystone 28,855 74 389 25,599 78 330 25,509 77 331 10,202 85 119 - - -Geomean: 173,332 74 2,334 86,258 67 1,286 42,701 66 650 20,854 72 292 14,594 41 357Ratio: 1 1 1 0.50 0.90 0.55 0.25 0.88 0.28 0.12 0.96 0.12 0.08 0.55 0.15


in an overall geomean wall-clock time 72% faster, or a 3.6× speed-up over MIPS-SW. We observe the

following trend: as we synthesize a greater proportion of the program into hardware, we measure an

increase in performance. We note that these clock frequencies are for targeting a Cyclone II FPGA and

would be higher if we were targeting a 40nm Stratix IV FPGA.

The last two experiments shown on the right of Table 3.5 demonstrate a hardware-only flow with

either LegUp or the commercial HLS tool eXCite. We observe that the benchmarks synthesized to

hardware with LegUp (LegUp-HW ) have a geomean cycle execution time 88% faster than the software-

only implementation and have approximately the same geomean clock frequency. When we synthesize

the benchmarks with eXCite, the geomean number of cycles is even lower, with only 8% of that required

by the software-only flow. However, the geomean clock frequency of the eXCite-generated circuits is 45%

worse than the MIPS processor. We observed that eXCite tends to perform more operator chaining than

LegUp during scheduling, which can hurt the clock frequency but improve the number of cycles required

by the benchmark. The overall geomean wall-clock time improvement over MIPS-SW was comparable

for both LegUp and eXCite, with LegUp-HW providing an 88% improvement, or an 8× speed-up, and

eXCite-HW providing an 85% improvement, or a 6.7× speed-up. These are significant performance

improvements over running these benchmarks on the processor. We saw the greatest speed-up on the

dfsin benchmark, with LegUp improving wall-clock time by over 34×. Across these benchmarks, LegUp-

generated circuits achieved an average wall-clock execution time that was 20% faster than equivalent

circuits synthesized by eXCite. Therefore, we can infer that our HLS implementation is of reasonable

quality.

We observe no performance benefit for running a portion of these benchmarks in software, as exper-

imented in the hybrid scenarios. Furthermore, these benchmarks contain no unsupported C language

constructs that would be forced to run in software. However, exploring this software/hardware design

space is useful to exercise LegUp functionality and to verify the correctness of these generated hybrid

systems.

We now discuss a few outliers in the results shown in Table 3.5. For the aes benchmark, the

LegUp-HW implementation has nearly 5× faster wall-clock time execution than the eXCite-HW imple-

mentation. Conversely, for the motion benchmark, LegUp’s implementation is nearly 4× slower in terms

of clock cycles than eXCite’s implementation. We attribute these differences to the greater amount

of functional unit pipelining performed by LegUp, particularly for division operations. This pipelining

causes higher cycle latencies for LegUp-synthesized circuits compared to those produced by eXCite but

improves the overall clock frequency. For the jpeg benchmark, the LegUp-Hybrid1 implementation has a

higher wall-clock time than the LegUp-Hybrid2 implementation despite offloading a greater proportion of

the program to hardware. This was caused by an increase in the number of memories, and consequently

multiplexing, in the memory controller which decreased the clock frequency.

We show area results across the benchmarks for each flow in Table 3.6. The area metrics are shown

in groups of three columns: the number of Cyclone II logic elements (LEs), the number of memory

bits used (# bits), and the number of 9x9 DSP block multipliers (Mults). Like the performance table

presented earlier, we provide the geometric mean and the ratio of columns relative to MIPS-SW in the

last two rows of Table 3.6. We calculated the geomean for columns containing zeros by replacing the

zeros with ones4.

We show in Table 3.6 that the MIPS processor requires 12.2K LEs, 226K memory bits, and 16

4This convention is used in life sciences studies.


multipliers. The hybrid system consists of both the MIPS processor and a custom hardware accelerator,

therefore consuming more area. We observed that the LegUp-Hybrid2 flow increases the number of LEs,

memory bits, and multipliers by 2.23×, 1.14×, and 2.68×, respectively, compared to MIPS-SW. The

LegUp-Hybrid1 flow generates a larger hardware accelerator requiring 2.75× LEs, 1.16× memory bits,

and 3.18× multipliers compared to MIPS-SW. We disabled link time optimizations in LLVM during

the hybrid flows. Link time optimizations are late-stage compiler optimizations performed after linking

object files, which we found were inlining the function we were trying to accelerate in the hybrid flow.

However, we enabled link time compiler optimizations for the MIPS-SW and LegUp-HW flows because

these optimizations can significantly improve circuit speed and area. For example, for the jpeg benchmark

the LegUp-Hybrid1 implementation has a larger circuit area than the combined area of the MIPS-SW

and LegUp-HW implementations due to disabling link time optimizations in LegUp-Hybrid1.

For the hardware-only implementations shown in Table 3.6, the LegUp-HW flow requires 28% more

LEs than the MIPS processor on average, while the eXCite-HW implementations require 7% more

geomean LEs than MIPS-SW. We observe that both the LegUp-HW and the eXCite-HW flow require

far fewer memory bits than the MIPS processor alone. We found that LegUp-HW implementations

required more 9x9 multipliers than the corresponding benchmarks synthesized by eXCite. We believe

this is due to more aggressive multiplier sharing performed by eXCite during binding. Focusing on

Cyclone II logic elements, the LegUp hardware-only implementations require on average 19% more LEs

than circuits produced by eXCite. We can also multiply the wall-clock time and logic elements to

calculate an area-delay product metric, to account for the inherent trade-off between area and delay.

We find that LegUp-HW and eXCite-HW have nearly identical area-delay products: ∼4.6M µs-LEs

vs. ∼4.7M µs-LEs, with LegUp requiring more LEs on average but achieving better wall-clock time. We

consider these results encouraging, given that this study used the first version of LegUp.

We show the power and energy consumption across the benchmarks for each flow in Table 3.7. We

measured the average dynamic power consumption (mW) and the total energy consumption (nJ) for

each circuit. We observe in Table 3.7 that the dynamic power of the processor is about the same

as in the geomean power for the LegUp-Hybrid1 flow, with the dynamic power increasing by 12% on

average in LegUp-Hybrid2. The hardware-only implementations consume significantly less geomean

dynamic power than the processor, with the LegUp-HW flow requiring 55% less power, and the eXCite-

HW flow requiring 70% less power. The geomean energy consumptions of each flow show an even

greater improvement than dynamic power, which can be explained through the equation: Energy =

Power×T ime. We observe better energy consumption from lower power, but also from the improvement

in the wall-clock time of the faster hybrid and hardware-only flows. Consequently, energy consumption is

improved dramatically as we synthesize increasing amounts of computations into hardware. The LegUp-

Hybrid2 flow uses 47% less energy and the LegUp-Hybrid1 flow uses 76% less energy on average than the

processor, or a 1.9× and 4.2× reduction in energy consumption. The hardware-only implementations

consume even less energy, with the LegUp-HW flow consuming 94% less energy on average than the

MIPS-SW flow, an 18× energy reduction. The eXCite-HW flow uses over 95% less energy than the

processor, a 22× energy reduction.

Figure 3.7 summarizes all the geomean wall-clock times, cycle counts, clock frequencies, logic ele-

ments, power, and energy results across the benchmark suite for the five flows we considered here. The

horizontal axis labels the flow for each group of six metrics. The vertical axis provides the ratio of each

measurement to the corresponding metric in the MIPS-SW flow. Figure 3.7 shows that the hardware-

Chapter3.


evelSynthesis

Rese

archFramework

41

Table 3.6: Area results.

MIPS-SW LegUp-Hybrid2 LegUp-Hybrid1 LegUp-HW eXCite-HWBenchmark LEs # bits Mults LEs # bits Mults LEs # bits Mults LEs # bits Mults LEs # bits Multsadpcm 12,243 226,009 16 25,628 242,944 152 46,301 242,944 300 22,605 29,120 300 16,654 6,572 28aes 12,243 226,009 16 56,042 244,800 32 68,031 245,824 40 28,490 38,336 0 46,562 18,688 0blowfish 12,243 226,009 16 25,030 341,888 16 31,020 342,752 16 15,064 150,816 0 31,045 33,944 0dfadd 12,243 226,009 16 22,544 233,664 16 26,148 233,472 16 8,881 17,120 0 9,416 0 0dfdiv 12,243 226,009 16 28,583 226,009 46 36,946 233,472 78 20,159 12,416 62 9,482 0 32dfmul 12,243 226,009 16 16,149 226,009 48 20,284 233,472 48 4,861 12,032 32 4,536 0 26dfsin 12,243 226,009 16 34,695 233,472 78 54,450 233,632 116 38,933 12,864 100 22,274 0 38gsm 12,243 226,009 16 25,148 232,576 114 30,808 233,296 142 19,131 11,168 70 6,114 3,280 2jpeg 12,243 226,009 16 46,432 338,096 252 64,441 354,544 254 46,224 253,936 172 30,420 105,278 20mips 12,243 226,009 16 18,857 230,304 24 18,857 230,304 24 4,479 4,480 8 2,260 3,072 8motion 12,243 226,009 16 28,761 243,104 16 18,013 242,880 16 13,238 34,752 0 20,476 16,384 0sha 12,243 226,009 16 20,382 359,136 16 29,754 359,136 16 12,483 134,368 0 13,684 3,072 0dhrystone 12,243 226,009 16 15,220 226,009 16 16,310 226,009 16 4,985 82,008 0 N/A N/A N/AGeomean: 12,243 226,009 16 27,248 258,526 43 33,629 261,260 51 15,646 28,822 12 13,101 496 5Ratio: 1 1 1 2.23 1.14 2.68 2.75 1.16 3.18 1.28 0.13 0.72 1.07 0.00 0.32

Chapter3.


evelSynthesis

Rese

archFramework

42

Table 3.7: Power and energy results [Aldha 11b].MIPS-SW LegUp-Hybrid2 LegUp-Hybrid1 LegUp-HW eXCite-HW

Power Energy Power Energy Power Energy Power Energy Power EnergyBenchmark (mW) (nJ) (mW) (nJ) (mW) (nJ) (mW) (nJ) (mW) (nJ)adpcm 264.57 689,783.0 205.80 534,073.1 175.80 298,018.1 157.83 126,894.4 64.1 48,844.2aes 231.63 230,125.2 221.84 222,014.5 226.23 122,789.4 123.09 28,434.1 448.4 489,919.5blowfish 127.77 1,642,398.5 274.46 2,954,049.3 214.16 1,075,608.0 121.30 389,118.5 184.4 1,078,044.8dfadd 157.61 35,011.1 221.09 39,015.7 178.98 12,086.6 85.84 1,631.0 35.4 533.9dfdiv 214.70 206,741.0 257.54 49,101.0 193.20 13,299.9 83.66 2,426.0 31.8 1,469.0dfmul 122.50 11,210.6 171.13 21,593.8 202.21 5,981.8 40.00 160.0 21.6 98.1dfsin 240.75 9,704,502.3 274.49 1,224,986.7 240.06 283,857.4 129.14 139,082.3 69.8 86,642.8gsm 253.35 133,420.8 167.91 80,594.0 178.90 54,145.6 102.70 11,605.4 35.1 4,820.8jpeg 281.92 113,142,738.7 207.30 65,082,605.3 204.74 70,124,803.7 261.37 32,533,823.6 209.2 29,993,289.7mips 239.86 140,130.3 166.74 12,752.7 166.74 12,752.7 41.44 2,983.7 25.2 1,433.5motion 274.44 135,824.3 316.77 150,562.9 158.22 32,060.0 73.91 6,873.5 61.7 3,265.9sha 383.21 6,241,622.4 227.70 965,543.0 211.64 685,437.1 152.23 433,860.6 65.6 249,909.5dhrystone 260.15 101,084.8 230.82 71,830.7 243.18 74,220.9 116.19 13,826.0 N/A N/AGeomean: 221.67 517,413.28 221.61 273,162.62 194.46 122,511.37 100.75 29,390.23 65.9 23,549.97Ratio: 1.00 1.00 1.00 0.53 0.88 0.24 0.45 0.06 0.30 0.046


Figure 3.7: Summary of geomean experimental results across the benchmark suite.


only implementations offer the best performance compared to software-only or hybrid implementations.

The plot demonstrates LegUp’s usefulness for exploring the hardware/software design space.

3.5.2 Comparison to Current LegUp Release

We have made considerable improvements to the HLS algorithms since the first release of LegUp was

used to perform this study. The current version of LegUp is almost ready for our fourth release. To

illustrate this improvement, we compare the quality of results produced by the current version of LegUp

against the original LegUp release we used in this chapter.

We use the same benchmarks as presented earlier, and we target the Altera Cyclone II FPGA (this

was the only device supported by LegUp 1.0). We use the pure hardware LegUp flow without a processor

and specify a difficult-to-meet timing constraint.

Table 3.8 shows the comparison. Column 1 gives the name of each benchmark. The next columns

give the number of Cyclone II LEs, memory bits, and 9x9 multipliers, execution cycles, FMax (MHz),

and wall-clock time (µs), respectively. For readability, we repeat the same results presented earlier in

this section in the “1.0” columns, which we compare side-by-side to the current version of LegUp in the

“Cur” columns. The second last row presents geometric mean data for each column, while the last row

presents the ratio of the current LegUp geometric mean vs. LegUp 1.0.

On almost all metrics, we see significant quality-of-results improvements vs. the first LegUp release.

On average, wall-clock time improved by 48%, cycle count by 38%, FMax by 19%, LEs by 41%, and

multipliers by 8%. The only metric where we perform worse is memory bit usage, which increases by

17% on average in the current version of LegUp.

The majority of this improvement in circuit quality can be traced back to the following changes.

First, the author fixed the scheduling of phi and branch instructions, which could be chained with

other operations to decrease the number of clock cycles. The author also removed combinational loops

that could occur in the binding step of LegUp, which were reducing the circuit clock frequency. Qi-

jing (Jenny) Huang improved LegUp to use dual-port memories instead of single-port memories, which

allowed greater instruction level parallelism. In 2010, Yuko Hara updated the jpeg benchmark in the

CHStone benchmark suite to contain an approximately 50% smaller image, which sped up the bench-

mark. Jason Anderson experimented with different clock period constraints and achieved better geomean

performance across the CHStone benchmarks. Finally, the memory architecture described in Chapter 6

also improved performance and area.

3.6 Research using LegUp

In this section, we will give examples of how LegUp has enabled further high-level synthesis research by

highlighting recent publications that have used LegUp.

The co-authored work by Huang [Huang 13] was the first to study the impact of standard software

compiler optimizations on high-level synthesis. Huang proposed methods of formulating a specific se-

quence of compiler passes tailored for our benchmarks, giving a 16% faster geomean wall-clock time

compared to the default LegUp −O3 optimization passes. The LegUp framework has also allowed ar-

chitecture studies on processor/parallel-accelerator systems, like the co-authored study by Choi on the

impact of cache architecture on the performance and area of our system [Choi 12a].

Chapter3.


evelSynthesis

Rese

archFramework

45

Table 3.8: LegUp 1.0 vs. current LegUp version. (Hardware-only Implementation)

LEs # bits Mults Cycles Freq. TimeBenchmark 1.0 Cur 1.0 Cur 1.0 Cur 1.0 Cur 1.0 Cur 1.0 Curchstone/adpcm 22,605 18,527 29,120 35,646 300 172 36,795 14,450 46 92 804 157chstone/aes 28,490 13,407 38,336 36,814 0 0 14,022 9,052 61 64 231 141chstone/blowfish 15,064 6,038 150,816 182,208 0 0 209,866 163,950 65 93 3,208 1,763chstone/dfadd 8,881 6,443 17,120 17,024 0 0 2,330 684 124 89 18.8 7.7chstone/dfdiv 20,159 11,837 12,416 13,495 62 48 2,144 1,938 75 92 29 21chstone/dfmul 4,861 3,330 12,032 12,032 32 32 347 234 86 87 4.1 2.7chstone/dfsin 38,933 24,569 12,864 13,879 100 70 67,466 59,742 63 75 1,077 797chstone/gsm 19,131 9,920 11,168 15,488 70 64 6,656 5,868 59 90 113 65chstone/jpeg 46,224 44,952 253,936 469,992 172 222 5,861,516 1,320,580 47 47 124,475 28,097chstone/mips 44,79 2,837 4,480 5,504 8 8 6,443 6,228 90 102 72 61chstone/motion 13,238 13,139 34,752 32,768 0 0 8,578 8,264 92 80 93 103chstone/sha 12,483 3,334 134,368 134,656 0 0 247,738 166,768 87 118 2,850 1,413dhrystone 4,985 2,302 82,008 2,136 0 0 10,202 5,020 85 121 119 41Geomean 14,328 8,492 31,236 26,676 9.6 8.9 19,738 12,200 73 86 272 142Ratio (Cur/1.0) 1 0.59 1 1.17 1 0.92 1 0.62 1 1.19 1 0.52


Two recent works have focused on debugging in HLS, using LegUp as the backend. Calagar [Calag 14]

proposed a source-level debugging framework that offers gdb-like step, break, and data inspection func-

tionality for a HLS-generated hardware circuit. With the proposed framework, the user can inspect the

values of logic signals in the hardware from the C source code perspective. The logic signal values come

from one of two sources: 1) a logic simulation of the RTL, or 2) an actual execution of the hardware

on an FPGA. Goeders [Goede 14] proposed inserting debug instrumentation into the LegUp-generated

circuit, which allows a debugger application to start and stop the circuit, monitor variables and set

breakpoints. The instrumentation contains trace buffers to record the control and data flow in real-time,

allowing the debugger to retrieve this data and replay the execution in a GUI.

HLS area optimizations have also been studied using LegUp. Gort [Gort 13] presented an algorithm

for reducing area by minimizing signal bitwidths in LegUp. Gort proposed inferring bitmasks and ranges

for variables using constant propagation at compile-time. For programs with predictable inputs, he used

run-time profiling data to determine variable ranges and optimize further. Klimovic [Klimo 13] proposed

using LegUp to optimize hardware accelerators for common-case inputs, as opposed to worst-case inputs,

allowing accelerator area to be reduced by 28%. When inputs exceed the range that the hardware

accelerators can handle, a software fallback function is automatically triggered. The co-authored work

by Hadjis [Hadji 12a] used LegUp to investigate the impact of FPGA architecture on resource sharing

patterns of interconnected operators. Hadjis found that the type of operations that are beneficial for

high-level synthesis resource sharing varies depending on whether we target Cyclone II (4-LUT) or

Stratix IV (6-LUT) Altera FPGA architectures.

We have made some impressive strides towards making FPGAs easier to program with LegUp. This

was evidenced by a group of undergraduates: Victor Zhang, Ahmed Kammoona, and Bryce Long,

who extended LegUp to support PCIe communication between a Stratix IV FPGA and a host PC.

They displayed a Mandelbrot animation simulation on the PC’s monitor where computation was being

offloaded to 128 accelerators running on the FPGA, which executed 5.5× faster and with 5.0× less energy

than the same program executing dual-threaded on an Intel Core 2 Duo processor. Most importantly,

the Mandelbrot kernels were entirely synthesized by LegUp, with no hand-coded RTL required. Ruo

Long (Lanny) Lian and William Cai also used LegUp to generate a working hardware implementation of

an artificial intelligence that could play the two-player abstract puzzle game, Blokus Duo. They entered

the synthesized hardware design into the FPT 2013 design competition [Cai 13] to compete against other

hardware implementations.

3.7 Summary

In this chapter, we introduced an open-source HLS research framework called LegUp. LegUp can synthe-

size a standard C program into a hybrid FPGA-based processor/accelerator architecture consisting of a

processor communicating with custom hardware accelerators. With LegUp, researchers can explore the

hardware/software design space, in which a program is partitioned into functions that are synthesized

automatically into custom hardware circuits. The remaining functions are executing in software on an

FPGA-based processor. Our experimental results have shown that a suite of benchmarks synthesized

into hardware-only implementations by the first release of LegUp execute 8× faster and consume 18×

less energy than when executed in software on a MIPS soft processor. We also show that LegUp’s syn-

thesized hardware circuits are comparable to those generated by a commercial HLS tool, eXCite, both


in terms of circuit wall-clock time and in area-delay product.

Our overarching goal with LegUp is to make programming FPGA devices easier and more accessible

to software engineers. We hope to expand the number of users who can leverage FPGA devices to speed

up specific applications, particularly in the embedded systems community. LegUp, along with its suite of

benchmark C programs, is a robust well-tested open-source platform for HLS research that has enabled

many research studies over the past few years. We expect LegUp will continue to support a variety of

research advances in hardware synthesis, as well as in hardware/software co-design. LegUp is available

for download at: http://legup.eecg.utoronto.ca (or http://www.legup.org).

Chapter 4

Multi-Pumping for Resource

Reduction in FPGA High-Level

Synthesis

4.1 Introduction

LegUp enables us to study high-level synthesis techniques aimed at exploiting specific characteristics of

an FPGA architecture. In this chapter, we will present a high-level synthesis area optimization that is

particularly suitable for modern FPGA architectures.

In high-level synthesis, we often must meet user-defined resource constraints by minimizing the area of

the synthesized circuit. Area reduction is traditionally accomplished by resource sharing: using the same

functional unit to perform two or more operations. A limitation of resource sharing is that operations

sharing the same functional unit must be scheduled into mutually exclusive clock cycles. Consequently,

resource sharing can lengthen the overall schedule, hurting circuit performance. In this chapter, we

present a new approach to resource sharing by applying the technique of multi-pumping, which can

overcome this limitation. Multi-pumping refers to the existing circuit technique of operating a hardware

block at a higher clock frequency than its surrounding system. Typically, the multi-pumped unit is

clocked at twice the system frequency, or double-data-rate (DDR). We can share a single DDR multi-

pumped functional unit between two operations that are scheduled during the same system clock cycle.

Multi-pumping is an area-reduction technique that does considerably less harm to speed performance in

comparison to traditional resource sharing (provided that the functional units can indeed be clocked at

2× the system clock frequency).

Modern FPGA architectures contain prefabricated special purpose “hard” blocks such as block

RAMs, DSP blocks, and even entire processors. These blocks are implemented as ASIC-like hard IP

blocks, and are distinct from the reconfigurable “soft” logic, comprised of lookup tables (LUTs), regis-

ters, and other programmable circuitry. This is quite different than ASIC design, where standard cells

have similar timing characteristics varying only modestly with cell size. DSP blocks can perform the

types of operations essential to digital signal processing, multiply and multiply-accumulate operations,

with very high speed and low power. In modern FPGAs, such as Stratix IV [Stra 10], DSP blocks can

48

Chapter 4. Multi-Pumping for Resource Reduction in FPGA High-Level Synthesis 49

operate at speeds above 500MHz, whereas typical FPGA designs operate at considerably lower speeds

(in the 100–300MHz range). Consequently, DSP blocks are particularly suitable for our multi-pumping

sharing technique. Therefore, in this work, we focus on multi-pumping DSP blocks, however, the ideas

proposed are applicable to other types of blocks. To the best of our knowledge, this is the first work to

apply multi-pumping for resource reduction automatically in a high-level synthesis context.

We evaluate the multi-pumping approach compared to traditional resource sharing and target the

Altera Stratix IV 40-nm commercial FPGA [Stra 10]. Our results show that resource sharing using

multi-pumping is an effective approach for saving circuit area. Furthermore, when we compared this

approach to traditional HLS resource sharing we can achieve the same area reduction but with significant

performance advantages. Specifically, to achieve a 50% reduction in DSP blocks, traditional resource

sharing decreases circuit speed performance by 80%, on average, whereas multi-pumping decreases circuit

speed by just 5%. Multi-pumping is a viable approach to achieve the area reductions of resource sharing,

with considerably less negative impact to circuit performance.

The remainder of this chapter is organized as follows: Section 4.2 presents related work. Section 4.3

introduces the concept of multi-pumping, provides a characterization of the multi-pumped multiplier

unit, and compares multi-pumping to traditional resource sharing. Section 4.4 describes the high-level

synthesis algorithms necessary for resource reduction using multi-pumping. Section 4.5 presents an

experimental study. Section 4.6 offers a summary.

4.2 Background

Resource sharing has been studied extensively in high-level synthesis literature over the past two decades

[Cong 11, Gajsk 92]. Two recent studies by Gort [Gort 13] and Hadjis [Hadji 12a] were already discussed

in Section 3.6.

Cong and Wei [Cong 08] presented a method for sharing patterns of operators by analyzing their

graph edit distance. Hara-Azumi and Tomiyama [Hara 12] formulated a simultaneous binding and

allocation integer linear programming problem to minimize multiplexer area under a clock constraint.

It is worth mentioning that multi-pumping, as a concept, is not new. Multi-pumping has been applied

in computer memory for over a decade, however typically multi-pumping is used to improve performance

rather than to save area as we propose. Commodity DDR3 SDRAM allows transfers on both the positive

and negative edges of the memory bus clock — doubling the effective memory bandwidth [DDR3 08]. For

example, the Altera DE4 board has two SO-DIMM sockets, each of which contains a DDR2 SDRAM

internally clocked at 800MHz, with a combined 128-bit wide data I/O bus to the Stratix IV FPGA.

The DDR memory is multi-pumped, and we can transfer on the rising and falling clocks, so the FMax

requirement is 400MHz for the DDR2 memory controller. We use the Altera DDR2 memory controller

in half-rate transfer mode, which doubles the data width to 256-bits and halves the FMax requirement

to 200 MHz. We have found that on a Stratix IV FPGA, meeting a timing constraint of 200 MHz is

difficult but feasible for a high speed circuit. For a point of reference, the maximum possible FMax of a

circuit synthesized to a Stratix IV is about 550MHz due to minimum clock pulse constraints of the DSP

block internal registers.

Multi-pumping is also widely used in memories to “mimic” the availability of extra memory ports.

Choi et al.’s work in [Choi 12a] found that multi-pumped caches had the best performance and area

for FPGA processor/parallel-accelerator systems. A Xilinx white paper [Tidwe 05] describes how multi-


w

w

ww

w

w2w

C x D

RegisterAlignment

2w

0

1

0

1

C

A

B

D

1x Clock Follower

1xClock Domain:

2x Clock

2w

2x 1x

A x B

DSP Blocks:

Optional Pipeline Registers

Figure 4.1: Multi-pumped multiplier (MPM) unit architecture.

pumping can improve the throughput of a DSP block in isolation, outside of the HLS context.

4.3 Multi-Pumped Multiplier Units: Concept and Characteri-

zation

We exploit the high operating frequency of DSP blocks relative to the surrounding FPGA soft logic to

multi-pump DSP multipliers at double-data-rate. Figure 4.1 shows the circuit architecture of our multi-

pumped multiplier (MPM) unit. The MPM consists of a multiplier implemented by DSP blocks, with

multiplexers on the inputs to steer incoming data. The number of DSP blocks required to implement 8,

16, and 32-bit multipliers is 1, 2, and 4, respectively, for either signed or unsigned numbers. Unsigned

and signed 64-bit multipliers require 16 and 32 DSPs, respectively. The 2× clock frequency must be

exactly twice that of the system 1× clock to ensure correct multi-pumping behaviour. Assuming the

multiplier has no pipeline registers, the operation of Figure 4.1 proceeds as follows: the positive edge

of the 1× clock occurs, causing inputs A, B, C, and D to transition. The 1× Clock Follower signal

(discussed below) matches the 1× clock and is high. For the next half of the 1× clock period, A is

multiplied by B. At the half-way point of the 1× clock period, the rising edge of the 2× clock triggers

the alignment register to store the product of A and B. For the second half of the 1× clock cycle, the

1× clock follower is low, and C is multiplied by D. At the rising edge of the 1× clock both A× B and

C×D are available and are stored at the MPM outputs by registers in the 1× clock domain (not shown

in the figure).

We derived the 1× clock and 2× clock from the same PLL to match their clock phases and to avoid

a synchronizer on the 1×-to-2× clock-domain crossings. We can use optional pipeline registers inside

the DSP blocks to improve the FMax of the 2× clock and reduce the setup time on the 1×-to-2× clock-

boundary crossings. An Altera Stratix IV [Stra 10] DSP block has up to 3 optional pipeline stages: one

at the inputs and two at the outputs after the internal multiplier, as shown in Figure 4.1.

The actual FMax of the 1× clock when using multi-pumping is given by:

FMax = min(FMax1x,FMax2x

2) (4.1)

Where FMax1x is the maximum operating frequency of the circuits in the 1× clock domain, and FMax2x

is the maximum operating frequency of the circuits in the 2× clock domain. For instance, if the circuits

in the 1× clock domain could operate at FMax1x = 300Mhz, but the DSP 2× clock has a maximum

frequency of 400MHz, then the 1× clock must be reduced to 200MHz. Consequently, adding a multi-

pumped multiplier can potentially reduce the FMax of high-speed circuits by putting a ceiling on the


DFF DFF DFF

Follower1x Clock1x Clock

2x Clock

2xClock Domain: 1x

A B

Figure 4.2: Clock follower circuit from [Tidwe 05].

100

150

200

250

300

350

400

450

500

550

600

0 1 2 3 4 5 6

Fmax(M

Hz)

Pipeline Stages (P)

S W=64U W=64

U/S W=32S W=16U W=16

U/S W=8

Figure 4.3: Multi-pumped multiplier unit FMax characterization.

system clock frequency. We mitigate this problem by using DSP pipeline registers (discussed below).

In Figure 4.1, the 1× Clock Follower has an identical waveform to the 1× clock signal but the clock

follower is driven by a 2× clock register. We cannot drive the select lines directly with the 1× clock

signal because that could cause a hold-time violation. For instance, when the 2× clock has a positive

edge, the DSP input pipeline registers are receiving data but at the exact same time the 1× clock could

transition from 0 to 1. If the 1× clock is driving the select line of the multiplexer, this could change

the multiplexer output too quickly, violating the hold-time requirement of the destination DSP input

2× clock registers. Consequently, we need a signal that is identical to the 1× clock but slightly delayed:

the 1× clock follower.

Figure 4.2 gives the clock follower circuit from [Tidwe 05]. On device startup, a synchronous reset

sets all three registers to logic-0. At this point, A=0 and B=0, therefore on the rising edge of the 1×

and 2× clocks, the 1× clock follower signal transitions from 0 to 1 and A transitions from 0 to 1. On the

next rising edge of the 2× clock, B transitions from 0 to 1, and the 1× clock follower from 1 to 0. On

the rising edge of the 1× and 2× clocks, A transitions from 1 to 0, and the 1× clock follower transitions

from 0 to 1. This pattern continues with the 1× clock follower transitioning on every positive edge of

the 2× clock, matching with the 1× clock signal. The 1× clock follower is delayed by the clock-to-Q

time of the 2× register.

4.3.1 Multi-Pumped Multiplier Characterization

We characterized the multi-pumped multiplier (MPM) unit in Figure 4.1, for an Altera Stratix IV

FPGA [Stra 10], using three parameters: the number of pipeline stages (P ), also called the latency; the

width of inputs (W ); and the type of multiplier, either signed (S) or unsigned (U). Figure 4.3 shows how

the FMax of the MPM in Stratix IV is impacted by P , the number of pipeline stages. If P is greater


0

200

400

600

800

1000

1200

1400

0 1 2 3 4 5 6

Registers

Pipeline Stages (P)

S W=64U W=64

U/S W=32U/S W=16U/S W=8

Figure 4.4: Multi-pumped multiplier unit register characterization.

than three, we will implement additional pipeline registers outside of the DSP blocks. Each curve in the

figure represents one choice of input width (W ) and whether the data is unsigned (U) or signed (S).

As expected, there is a tradeoff between pipeline stages and the MPM FMax: increasing the latency

allows the FMax to increase. Observe in Figure 4.3 that setting P greater than 3 is only beneficial to

the FMax of 64-bit multipliers; FMax is unchanged for W=32 or lower. We expected multipliers with

smaller bit widths to have higher FMax, however, this was not always the case. For P=3, the FMax

improves from 483MHz for W=8, to 550MHz for W=16. We found that this improvement was caused

solely by a change in cell delay within the DSP block where the critical path is located. At higher clock

frequencies (above 450MHz), the MPM is restricted by the minimum clock pulse width requirements of

the registers inside the DSP blocks, which is a property of the DSP blocks and dependent on W . For

instance, we found that the clock frequency could have been 655MHz for W=16 and 533MHz for W=8,

but was restricted by minimum clock pulse width, as shown in Figure 4.3.

The cycle latency of the MPM, in terms of the 1× system clock, is ⌈P2⌉. For instance, if the MPM

has a system cycle latency of two, from the 1× system clock perspective, then we have four 2× clock

cycles to multiply the inputs. Because of the double-data-rate operation of the MPM, there is a wasted

2× clock cycle whenever the MPM has an odd number of pipeline stages. For instance, an MPM with

a one 1× clock cycle latency constraint can have P=1 or P=2. Given that the FMax of the 2× clock

in the MPM increases as P increases, we should always choose P=2 over P=1 — there is no additional

cost in registers because the optional pipeline registers are internal to the DSP blocks.

Figure 4.4 shows how the number of registers, outside of the DSP blocks, varies with W , P and

signedness. Register counts are indicative of silicon area cost, and those in Figure 4.4 include input and

output registers in the 1× clock domain that are not shown in Figure 4.1 (i.e. registers to hold input

operands and products). Observe that register count is unaffected when sweeping P from zero to three.

This is expected, as the DSP blocks have 3 internal pipeline stages, meaning registers in general FPGA

logic blocks are not needed. An exception to this is the 64-bit multiplier, which requires additional

registers for latencies of two and above.

While multi-pumping can save DSP blocks, it can also be applied to raise computational throughput.

For instance, if the throughput of a circuit pipeline is limited by multipliers, we can multi-pump every

multiplier. This can double the original pipeline’s throughput, without requiring any additional DSPs,


MPM100 100 100

Loop Body Multi−PumpingResource Sharing

Figure 4.5: Loop schedule: multiplier sharing vs. multi-pumping.

w

w

w

w

2w

2w

w

w

w

w2w

0

1

0

1A

B

C

D

A

B

C

D

State

Figure 4.6: Loop hardware: original vs. resource sharing.

if the application has enough data bandwidth to saturate the new pipeline.

4.3.2 Multi-Pumping vs. Resource Sharing

Multi-pumping can be seen as an alternative to traditional resource sharing in high-level synthesis with

two differences. First, multi-pumping can share two multipliers that are scheduled in the same state,

while resource sharing can only share multipliers from different states. Second, multi-pumping requires

that the system clock be half the 2× clock rate. Therefore, when multi-pumping, we should pipeline

multipliers to a greater extent than when resource sharing. Pipelining will increase the 2× clock rate

and avoid constraining the system clock.

We illustrate the difference between resource sharing and multi-pumping by considering a loop that

performs two independent multiplies every iteration and finishes in 100 cycles, taking one cycle per

iteration, as shown in Figure 4.5. Assume that we wish to reduce the number of DSPs required for the

loop. We must reschedule the multipliers into distinct states to apply traditional resource sharing to

the loop, saving one multiplier. Figure 4.5 (middle) shows the new schedule and Figure 4.6 shows the

hardware before and after resource sharing. Assuming a single-cycle multiplier, the loop (with resource

sharing applied) now takes 200 cycles — twice as long as the original. We can apply multi-pumping to

the same loop and achieve the same reduction in multipliers without the increase in cycles. Although

we must now pipeline the multi-pumped unit to achieve the same FMax as the original, we can still

start one new multiply every cycle with loop pipelining [Ramak 96], assuming there are no loop-carried

dependencies between iterations. If we pipeline the multi-pumped unit with three stages, the loop will

complete in 102 cycles — 2 cycles to fill the pipeline, then one loop iteration finishes every clock cycle

for the subsequent 100 cycles.

For multi-pumping to provide a performance and area benefit over resource sharing, a few conditions

are necessary. First, two or more multipliers need to be scheduled into the same state. Next, these

multipliers should occur in a section of the code that is executed multiple times otherwise the impact on

overall circuit performance will be minor. Lastly, there needs to be limited multiplier mobility, meaning

that there is little flexibility to change the scheduled state of a multiplication operation without impacting

the schedule of its successor operations. So, if we reschedule these multiplies into separate states then


the circuit’s performance will decrease (due to a longer schedule).

To calculate the savings from multi-pumping, we can calculate the maximum multipliers used in any

state, which we designate as X . X is the minimum number of multipliers that can be achieved using

traditional resource sharing without modifying the schedule. However, using multi-pumping, the number

of multipliers can be reduced to ⌈X2⌉.

4.4 Multi-Pumping DSPs in High-Level Synthesis

As discussed in Chapter 3, there are three main steps to high-level synthesis: allocation, scheduling, and

binding. Binding is performed after scheduling and solves the problem of assigning the operations in the

program to hardware functional units. In other words, binding is the key step where traditional resource

sharing is implemented, and is also where we may choose to assign two operations to a multi-pumped

multiplier unit.

We implemented our multi-pumping approach in LegUp by modifying the constraints we use during

scheduling and binding to handle the MPM functional units. We can think of the multi-pumped unit as

having two “ports” corresponding to the two inputs/outputs that occur each system clock cycle. We can

then bind multiply operations to a MPM function unit in an analogous manner to binding loads/stores

to a dual-port RAM. Given a user resource constraint of M multipliers, we can instantiate up to M

MPM functional units in hardware. Furthermore, during scheduling we must ensure there are no more

than 2M multiply operations per cycle. After scheduling, we bind each multiply operation to one of

the 2M available ports on the MPM units using weighted bipartite matching [Huang 90]. We can still

use a MPM unit for a single multiply, if we only utilize the DSP blocks for half of the 1× system clock

cycle. Hence, multi-pump sharing is a superset of resource sharing — we can share multipliers using

multi-pumping in all cases where we could perform resource sharing, but in addition, we can share when

two multipliers are scheduled in the same state.

4.4.1 DSP Inference Prediction

In our original implementation of multi-pumping, we saw an increase in the number of DSPs compared

to the original circuit, rather than a reduction! We found that Altera’s Quartus II synthesis tool

incorporates optimizations to avoid inferring DSPs in certain scenarios. Specifically, multiplies by a

power of 2 will be replaced with a shift. Additionally, if one input to a multiply is a constant (c) and

the multiply, x × c, can be implemented as (x << a) plus or minus (x << b), where a and b are

constants, then Quartus will not infer a DSP block, instead preferring the shifts by constants, followed

by addition. For example: x×22 will infer a DSP, while x×14 = (x×16)−(x×2) can be implemented as

(x << 4)− (x << 1). This optimization is common for constants under 100, but becomes rare for larger

constants. We avoid multi-pumping multiply operations that will not result in DSP-block inference.

Another optimization we made is an artifact of compiling software to hardware. Namely, in the

high-level synthesis of C code, multiplying two 32-bit integers does not require a 64-bit result — the

product is usually truncated to a 32-bit integer. If all 64-bits are required, then high-level synthesis will

be forced to use a 64-bit multiply instruction and sign extend the 32-bit operands to 64-bits. However,

in hardware, we can implement a 32-bit multiplier with a 64-bit product using half as many DSP blocks

as a 64-bit multiplier with the output truncated to 64-bits. By detecting these 64-bit multipliers and

replacing them with 32-bit multipliers, we saw a significant reduction in DSP block usage.


Figure 4.7: Image after Sobel edge detection and Gaussian blur.

Lastly, two multiply operations are only paired together in a MPM if they have the same bit width.

We used a bit width minimization pass [Mahlk 01] to statically calculate the required bit width of each

multiply operation.


Table 4.1: Area results (TRS: Traditional Resource Sharing, MP: Multi-Pumping)DSPs Registers ALUTs

Benchmark Orig TRS MP Orig TRS MP Orig TRS MPalphablend 8 4 4 7,799 10,599 7,965 4,756 5,786 4,821sobel 8 4 4 22,861 22,775 22,959 25,396 25,493 25,3484matrixmult 16 8 8 10,677 25,478 11,068 10,578 13,189 10,722gaussblur 24 12 12 10,659 10,615 10,861 10,458 10,655 10,493idct 40 20 20 33,977 33,289 34,925 43,361 42,204 43,440mandelbrot 144 72 72 33,729 34,449 34,548 31,112 31,291 30,702

Geomean 23 11 11 16,895 20,530 17,269 16,193 17,360 16,239Ratio 1 0.5 0.5 1 1.22 1.02 1 1.07 1

Table 4.2: Speed performance results (TRS: Traditional Resource Sharing, MP: Multi-Pumping)Cycles FMax (MHz) Time (µs)

Benchmark Orig TRS MP Orig TRS MP Orig TRS MPalphablend 1,131 2,131 1,151 219 203 223 5.2 10.5 5.2sobel 45,685 66,357 46,229 163 166 166 280.8 399.8 279.34matrixmult 8,551 19,851 8,651 157 155 158 54.5 127.8 54.8gaussblur 26,575 45,615 27,119 176 167 176 151.2 273.8 154.0idct 7,336 11,436 7,336 170 158 155 43.2 72.5 47.4mandelbrot 1,899 3,307 1,963 143 150 125 13.3 22.1 15.7

Geomean 7,395 13,007 7,513 170 166 165 43.6 78.6 45.7Ratio 1 1.76 1.02 1 0.98 0.97 1 1.8 1.05

We used six benchmarks to evaluate our multi-pumping approach: Alphablend blends two image

streams. 4matmult performs four matrix multiply operations in parallel for 20×20 matrices stored in

independent block RAMs. Sobel is a Sobel edge detection algorithm from computer vision, applied

on an image striped over three block RAMs, shown in Figure 4.7. Gaussblur applies a Gaussian low-

pass filter to blur the same image. IDCT performs 200 inverse discrete cosine transforms used in

JPEG image decompression. Mandelbrot generates a 32×32 fractal image. All of the benchmarks

require multipliers operating in parallel and are representative of data parallel digital signal processing


applications that DSP blocks were designed for. The benchmarks also include input data, allowing us

to execute them in hardware and gather wall-clock time (execution time) results. Loop unrolling was

applied to the benchmarks to increase multiplier density. We constrained the number of multipliers in

each benchmark to balance multiply operations evenly across all pipeline stages to maximize multiplier

utilization. We compare multi-pumping to traditional resource sharing targeting the Stratix IV [Stra 10]

EP4SGX530KH40C2 on Altera’s DE4 board [DE4 10] using Quartus 11.1sp2. All benchmarks were

synthesized with a 500MHz timing constraint for the 1× clock and a 1GHz constraint for the 2× clock.

Table 4.1 gives the area results for three scenarios: “Original” (the baseline with no resource reduc-

tions), “TRS” (traditional resource sharing), and “MP” (multi-pumping). The “DSPs” column gives

the number of DSP blocks required, which is reduced by 50% by both resource sharing and multi-

pumping. Mandelbrot was the only benchmark that used exclusively 64-bit multiplication; all other

benchmarks used only 32-bit multipliers. The “Registers” column gives the total number of registers

required. “ALUTs” gives the number of Stratix IV combinational ALUTs. Ratios in the table compare

the geometric mean (geomean) of the column to the respective geomean in the original. Table 4.2 gives

speed performance results. The “Cycles” column is the total number of cycles required to complete the

benchmark. The “FMax” column provides the FMax of the circuit given by the equation in Section 4.3.

The “Time” column gives the circuit wall-clock time: Cycles · (1/FMax).

In the baseline and in the traditional resource sharing scenarios, we allocated multiplier functional

units with two pipeline stages. Meaning that after two inputs are passed into a multiplier functional

unit, we must wait two system (1×) clock cycles before the multiplier output will be valid. For multi-

pumping, we increased the pipeline depth of the MPM units to three stages in the 1× clock, to minimize

the impact on FMax. We chose one 1× clock stage at the MPM inputs, to minimize the delay across the

1×-to-2× clock-boundary crossing, and four 2× clock stages in the MPM unit. Recall that for designs

that operate at a high FMax, the 2× clock FMax affects the system clock because the system clock

must be exactly half the 2× clock. By increasing the pipeline depth of the MPM units, we can increase

the 2× FMax and mitigate this effect on the system clock. The disadvantage of increasing pipeline

stages is that the cycle latency required to complete a multiply also increases. However, this is hidden

by having several multiply operations “in flight” within a single pipelined MPM unit at once.

The results show that both multi-pumping and resource sharing can be applied to reduce DSP usage

by 50%, though multi-pumping is able to do so with less impact on circuit speed, and also with less area

cost. With multi-pumping, the DSP reduction comes at a cost of 5% higher wall-clock time and 2% more

registers. In contrast, traditional resource sharing increased circuit wall-clock time by 80%, ALUTs by

7%, and registers by 22% to achieve the same DSP reduction. However, the increase in registers and

ALUTs when resource sharing was primarily an artifact of loop unrolling, which we used to emulate loop

pipelining (due to lack of early support by LegUp). Loop unrolling with longer schedule lengths caused

excessive registers to be created. We predict that with loop pipelining instead of loop unrolling, the

register improvement would disappear but all other results would remain the same. Geomean execution

cycles are significantly increased (76%) by the scheduling constraints imposed by resource sharing. Multi-

pumping can achieve the same DSP savings with only a 2% increase in execution cycles, caused by the

extra multiplier pipeline stage.

Overall, multi-pumping appears to be a viable way to reduce resource usage in HLS, while incurring

significantly less speed/area cost than traditional resource sharing.


4.6 Summary

This chapter presented multi-pumping as an alternative to traditional resource sharing in high-level

synthesis when targeting an FPGA device. For a given constraint on the number of FPGA DSP blocks,

multi-pumping can deliver considerably higher performance than resource sharing. Empirical results

over digital signal processing benchmarks show that multi-pumping achieves the same DSP reduction as

resource sharing, but with a lower impact to circuit performance: decreasing circuit speed by only 5%

instead of 80%.

Chapter 5

Modulo SDC Scheduling with

Recurrence Minimization in HLS

5.1 Introduction

In this chapter, we investigate loop pipelining [Ramak 96] scheduling algorithm improvements. Loop

pipelining is a high-level synthesis scheduling technique that overlaps the execution of loop iterations

to achieve higher performance. We use this schedule to generate a pipelined datapath in hardware for

operations within the loop, increasing parallelism and hardware utilization.

In many C applications, the majority of run time is spent executing critical loops. Consequently,

loop pipelining is crucial for generating a hardware architecture with comparable performance to hand-

designed RTL. Furthermore, complex loops usually have resource constraints, typically caused by limited

memory ports, in combination with constraints imposed by cross-iteration dependencies. The interaction

between multiple constraints can pose a challenge for loop pipelining scheduling algorithms, which, if

not handled properly can lead to a loop pipeline schedule that fails to achieve the best performance. The

goal of this work is to focus on loops with complex resource and dependency constraints and improve

the high-level synthesis quality of results for these cases.

Loop parallelism is limited by cross-iteration dependencies between operations in the loop called

recurrences. Recurrences can prevent the next iteration of a loop from starting in parallel until data

from a prior iteration has been computed, for instance an accumulation across iterations. The second

limitation is due to user-imposed resource constraints, e.g. only allowing one floating point adder in the

design. These constraints can significantly impact the final loop pipeline throughput.

As discussed in Chapter 2, state-of-the-art HLS scheduling uses a mathematical framework, called

a System of Difference Constraints (SDC) to describe constraints related to scheduling. The SDC

framework is flexible and allows the specification of a wide range of constraints such as data and control

dependencies, relative timing constraints for I/O protocols, and clock period constraints. Although loop

pipelining has been well studied in HLS, until recently, the SDC approach had not been applied to

scheduling loop pipelines due to non-linearities caused by describing the resource constraints in modulo

scheduling. Recent work in [Zhang 13] has extended the SDC framework to handle loop pipelining

scheduling by using step-wise legalization to handle resource constraints. This new SDC approach offers

compelling advantages over prior methods of modulo scheduling by providing the same mathematical

58

Chapter 5. Modulo SDC Scheduling with Recurrence Minimization in HLS 59

framework for a wide range scheduling constraints. However, there are issues applying this approach

to more complex loops, particularly the class of loops that contain a combination of recurrences and

resource constraints.

We propose a new modulo scheduling algorithm that uses backtracking to handle complex loops with

competing resource and dependency constraints, as can be expected in commercial hardware designs.

This new scheduling approach significantly improves the performance of loop pipelines compared to prior

work by scheduling pipelines with better throughput when the loops have complex constraints. Further-

more, our scheduler is based on the SDC formulation allowing for a flexible range of user constraints. We

also describe how to apply well-known algebraic transformations to the loop’s data dependency graph

using operator associativity to reduce the length of recurrences. These associative transformations have

already been widely applied for balancing the tree height of expression trees in HLS [Nicol 91a]. How-

ever, a loop containing recurrences must be restructured differently to minimize the length of the loop

recurrences. This idea has been previously studied in the DSP domain [Iqbal 93] but to our knowledge,

has not yet been widely applied in HLS.

We compared our techniques to existing prior work in HLS loop pipelining and also compared against

a state-of-art commercial HLS tool. Over a suite of benchmarks, we show that our scheduler and proposed

optimizations can result in a geomean wall-clock time reduction of 32% versus prior work and 29% versus

a commercial tool.

The remainder of this chapter is organized as follows: Section 5.2 presents related work and relevant

background. Section 5.3 gives an overview of loop pipelining and introduces a motivating example.

Section 5.4 describes our modulo SDC scheduling algorithm. Section 5.5 discusses our data dependency

restructuring transformations to reduce loop recurrence cycles. Section 5.6 presents an experimental

study and Section 5.7 draws conclusions.

5.2 Preliminaries

5.2.1 Related Work

Loop pipelining can be performed using software pipelining, which is a compiler technique traditionally

aimed at Very Long Instruction Word (VLIW) processors [Lam 88]. VLIW processors [McNai 03] can

execute multiple instructions in the same clock cycle allowing them to exploit instruction-level paral-

lelism. Software pipelining uncovers instruction-level parallelism between successive iterations of a loop,

and reschedules the instructions to exploit these opportunities. Iterations of a loop are initiated at

constant time intervals, before the previous iterations are complete. Software pipelining is performed

using modulo scheduling [Rau 81], which we will discuss in more detail in Section 5.2.2.

One common software pipelining heuristic is called Iterative Modulo Scheduling (IMS) [Ramak 96],

which has been adapted for loop pipelining in high-level synthesis by PICO [Schre 02]. Iterative modulo

scheduling combines list-scheduling, backtracking, and a modulo reservation table to reorder instructions

from multiple loop iterations into a pipelined schedule. IMS, in its original form [Ramak 96], did not

consider HLS operator chaining, as chaining is not applicable to VLIW architectures. The authors of

the HLS tool PICO [Sivar 02] studied the impact of adding chaining capability to IMS, which is non-

trivial and requires adding an approximate static timing analysis to the inner loop of the algorithm.

However, they focused on area improvements assuming a fixed pipeline throughput and did not consider


the impact of chaining on loop recurrences.

Another software pipelining heuristic used by GCC is swing modulo scheduling, which tries to reduce

register pressure [Hagog 04]. Register pressure at any point in a program is equal to the number of live

variables that must be stored in machine registers [Chait 81]. If register pressure exceeds the number of

available machine registers then we must “spill” variables into main memory. Spilling to memory slows

down program execution. During modulo scheduling we have some flexibility as to when instructions

are scheduled. Swing modulo scheduling tries to schedule dependent instructions as close as possible to

shorten variable livetimes, reducing register pressure and avoiding spilling to memory.

Earlier approaches to loop pipelining in high-level synthesis determined the pipeline datapath through

the following process [Potas 90]. First, they unroll the loop by one iteration. This entails duplicating

all basic blocks in the loop body and then connecting the last basic block of the loop body to start of

the duplicated basic blocks. They compact this new loop body by applying code motions that move

operations upwards across basic block boundaries in the new loop body. Operations migrate towards

the earlier loop iteration and upwards motion is only limited by dependencies between operations.

Unrolling is continued until the (provable) emergence of a repeating pattern of code, which will contain

all loop recurrences. This pattern then becomes the new compacted loop body, which exposes all of

the available loop parallelism to standard HLS scheduling. The code motions used to compact the loop

body are described in a compiler technique called percolation scheduling [Nicol 85]. They describe local

transformations that move one operation from a basic block to the immediately preceeding basic block(s)

if no dependencies are broken. These local transformations iteratively “percolate” operations upwards in

the control flow graph towards the start of the program. Modulo scheduling has been shown to perform

better than these iterative unrolling loop pipelining techniques for single basic block loops [Jones 91].

Loops with control flow in the loop body will have multiple basic blocks, which must be merged together

into one hyperblock [Mahlk 92] using if-conversion before modulo scheduling [Warte 92].

Recently, a heuristic using SDC-based scheduling to performmodulo scheduling was proposed [Zhang 13].

This work used an SDC-based scheduling formulation with an objective function to minimize register

pressure and compared the register usage to swing modulo scheduling. Their scheduling algorithm is

similar to the one proposed in this chapter but uses a greedy heuristic to choose operations to be sched-

uled, prioritizing operations that minimize the impact on operations still to be scheduled. We take an

alternative approach by abandoning any infeasible partial schedules and then backtracking by attempt-

ing other possible scheduling combinations. Backtracking can lead to better schedules than the greedy

approach in cases where the priority ordering prevents the discovery of a valid schedule in a single pass.

The work in [Nicol 91a] presents a method for incrementally reducing the height of an expression

tree from O(n) to O(logn) in high-level synthesis by using associative and distributive transformations.

They find trees of dependent arithmetic operations within the program data flow graph and then apply

these transformations to balance the height of each operation in the tree. The goal being to minimize

the longest dependency chain of operations in the data flow graph, which limits the total HLS schedule

length. They did not consider applying their transformations to recurrences during loop pipelining like we

describe in this chapter. Tree height restructuring has also been investigated for software pipelining in the

Cydra compiler [Schla 94] when targeting loops with recurrences. But they focused on VLIW processors

with limited instruction level parallelism instead of the flexible high-level synthesis architecture we study

here. The work in [Iqbal 93] presents an approach of using algebraic transformations and register retiming

to restructure a pipelined data flow graph. They apply these transformations to minimize the longest


Figure 5.1: Time sequence of a loop pipeline with II=2 and five loop iterations (i = 0 to 4).

chain of dependent operations that are limiting the HLS schedule length. Their algorithm allows an

arbitrary timing constraint on each operation, for instance an input arrival time for the first operation

and an output ready time for the last operation in a streaming application. Their work is the most

applicable to the transformations we describe in this chapter, but we focus on the specifics of how to

apply these transformations to modulo scheduling in HLS.

5.2.2 Background: Loop Pipeline Modulo Scheduling

LegUp performs loop pipelining in two steps, which we will discuss in this section. First, we schedule the

operations in the loop using modulo scheduling. Second, we generate the pipeline datapath and control

signals in hardware corresponding to this schedule.

The modulo scheduling algorithm assumes that the loop has exactly one basic block. If the loop body

has multiple basic blocks then we must perform if-conversion to remove control flow and leave us with

one basic block [Warte 92]. LegUp’s if-conversion pass implemented by Joy (Yu Ting) Chen is currently

limited to simple control flow. Modulo scheduling rearranges the operations from one iteration of the

loop into a schedule that can be repeated at a fixed interval without violating any data dependencies or

resource constraints. This fixed interval between starting successive iterations of the loop is called the

initiation interval (II) of the loop pipeline. The best pipeline performance and hardware utilization is

achieved with an II of one, meaning that successive iterations of the loop begin every cycle, analogous

to a MIPS processor pipeline. If the first iteration of the pipelined loop takes T cycles to complete, then

the total number of cycles required to complete a loop with N iterations is T + (N − 1)× II ≈ N × II,

for N ≫ T . Consequently, we can significantly improve pipeline throughput by minimizing the initiation

interval.

If we are pipelining a loop that contains neither resource constraints or cross-iteration dependencies

then the initiation interval will be one. Furthermore, in this case we can use a standard scheduling

approach as described in Section 2.6, which will correctly schedule the loop into a feed-forward pipeline.

However, when the loop does contain constraints then the initation interval may have to be greater than

one. For instance, if two memory operations are required in the loop body but only a single memory port

is available then the initiation interval must be two. In this case, modulo scheduling will be required

because standard scheduling has no concept of an initiation interval. Standard scheduling assumes that

operations from separate control steps do not execute in parallel when satisfying resource constraints,

which is no longer true in a loop pipeline. For instance, the standard approach may schedule the first

memory operation in the first time step and the second memory operation in the third time step, but if


new data is entering the pipeline every two cycles then these memory operations will occur in parallel

and conflict with the single memory port.

To illustrate a loop pipeline, we consider a loop with five iterations pipelined with an initiation

interval of two cycles and with three pipeline stages. Figure 5.1 shows the time sequence of the pipeline

with time increasing from left to right. During the prologue the hardware pipeline “fills up”, while

during epilogue the pipeline “flushes out” as no further loop iterations remain. Each box in Figure 5.1

is labeled with the loop iteration that is occupying the pipeline stage at that moment in time. A loop

pipeline stage is analogous to stages in a processor pipeline, where each stage of the pipeline executes

in parallel. Each pipeline stage takes two cycles to complete, corresponding to the initiation interval,

with the first loop iteration completing after six cycles. At any time step in the steady state operation

of the pipeline, we are executing operations from three consecutive iterations of the loop, each iteration

activating a different pipeline stage. For instance in Figure 5.1, when the pipeline initially reaches first

steady state, loop iterations: i = 0, i = 1, and i = 2 are all executing, and iteration i = 0 is finishing.

The number of pipeline stages depends on when the last scheduled operation finishes and is independent

of the initiation interval of the pipeline. For instance, assume we are using a pipelined divider functional

unit, where new inputs can be passed in every cycle and the output is valid after 32 clock cycles. If we use

this divider in the loop pipeline, then we would require at least 32 pipeline stages but the pipeline could

still have an initiation interval of one. More stages will result in a longer overall latency by increasing

time spent in the prologue and epilogue of the pipeline but this is typically a small fraction of the time

spent in steady state with many iterations.

We can compare the pipeline in Figure 5.1 to the sequential schedule of the same loop. Sequentially,

the loop body would have been scheduled with up to six cycles, which are now split into three pipeline

stages. If we assume the original loop body required all six cycles, then five iterations would complete

in 30 cycles, compared to the pipelined case in Figure 5.1 which completes after 14 cycles including the

prologue and epilogue. As the number of loop iterations increases, the cycles need to complete the loop

widens between sequential (6N) and pipelined (about 2N) implementations, eventually reaching three

times faster.

If we compare the final circuits, the number of functional units in the datapath of the pipelined loop

is equal to the number in the sequential loop, assuming no resource sharing, although for the pipelined

loops the datapath may have additional registers between pipeline stages. The main difference is in

the control logic. The sequential loop is controlled by a finite state machine, while the pipelined loop

is controlled by a shift register and a counter (see Section 5.2.3). The performance gain of pipelining

is due to activating multiple hardware functional units in parallel, which increases hardware utilization

compared to the sequentially scheduled loop.

Loop recurrences can increase the initiation interval required for a feasible pipeline schedule. Fig-

ure 5.2(a) illustrates the data dependency graph of a loop performing an accumulation across iterations:

sum = sum+ a[i] + i. The directed edges in the graph represent data dependencies between operations

and the edge labels indicate the required clock cycle latency between operations. We assume that both

memory loads and addition operations have a latency of one cycle. In this case, sum in the current

iteration has a loop-carried dependency on the sum calculated in the previous iteration, therefore the

loop contains a recurrence, indicated by the cycle in the data flow graph. The back edge has a depen-

dency distance of one (next iteration) labeled in square brackets. Consequentially, when we perform loop

pipelining, the best schedule has an initiation interval of two as shown in Figure 5.2(b). The recurrence


load a[i]sum

i11[1]

1

(a) Loop DependencyGraph.

+

++

load a[0]

+

+ +

3

load a[1]

1 20 4 5 6

sum

Cycle:

load a[2]

i=0 i=1 i=2

sum=0

(b) Loop pipeline schedule for the first three loop iterations (II=2).

Figure 5.2: Loop pipelining with a recurrence.

for (i = 0; i < N; i++) {

sum = sum + a[i] + i;

}

Figure 5.3: C code for loop.

prevents us from the ideal case of scheduling a loop iteration every clock cycle. However, if we could

have chained the additions into a single cycle then we could have achieved an II=1.

In general, we can calculate the minimum recurrence constrained initiation interval (recMII) in the

following manner: for every loop recurrence i, or cycle in the data dependency graph, we take the sum

of operator clock cycle latencies along the entire path of the recurrence cycle, delayi, and divide by

the dependency distance of the recurrence, distancei, and round up. The dependency distance is the

number of iterations separating the destination operation from the source operation of the recurrence

back edge. The recMII is calculated by taking the maximum over all recurrences in the dependency

graph: recMII = maxi ⌈delayi/distancei⌉. We can intuitively think of the delayi as the number of

cycles needed after the previous iteration to calculate a result needed in the next iteration. We cannot

start the next iteration for a minimum of delayi cycles. Or when distancei > 1 then the result is needed

in distancei iterations and we know that each iteration takes II cycles to complete.

Resource constraints can also limit the minimum initiation interval. For instance, if we schedule

a loop pipeline with three multiply operations but with only one multiplier unit in the datapath, we

must wait three cycles before starting each new loop iteration. In general, we calculate the resource

constrained minimum II (resMII) by taking every resource type, i, and calculating the number of

operations in a loop iteration using that resource, #opsi, divided by the number of functional units

available, #FUi, and round that up to the nearest integer. We take the maximum over all resource

types to give us: resMII = maxi ⌈#opsi/#FUi⌉. Many resources are typically unconstrained in HLS,

for instance adders, in contrast to general purpose processors which have a fixed number of functional

units.

The modulo scheduling algorithm begins by calculating a lower bound on the initiation interval called

the minimum II (MII). Any legal schedule must have an II greater than or equal to the MII, but the MII

is optimistic and may not be feasible. We calculate the MII by taking the maximum of both the resource

constrained MII (resMII) and the recurrence constrained MII (recMII), MII = max(resMII, recMII).


(a) data dependencygraph.

(b) Loop pipeline schedule.

(c) Loop pipeline datapath.

Figure 5.4: Loop pipelining Figure 5.3 with II=2.

5.2.3 Background: Loop Pipeline Hardware Generation

We will briefly describe the step after modulo scheduling, which generates the loop pipeline hardware.

Scheduling determines the initiation interval of the pipeline and the start and finish time for each

operation. To illustrate, we will use the code shown in Figure 5.3 as an example, where we assume

memory loads have a latency of two cycles and addition operations have a latency of one cycle. We

show the modulo schedule in Figure 5.4(b). The numbers along the top of Figure 5.4(b) show the

cycle count when an operation is scheduled. Each operation repeats every two cycles after the first

iteration because the initiation interval is two. During the first iteration, we have scheduled the load

with startT ime = 1 and finishT ime = 3, the adder A1 at StartT ime = FinishT ime = 3, and adder

A2 at startT ime = FinishT ime = 4. The number of pipeline stages is determined by the operation that

is scheduled last: pipelineStages = ⌈(lastT ime+ 1)/II⌉. Here A2 is scheduled last at lastT ime = 4,

therefore we have three pipeline stages (⌈(4+1)/2⌉). By inspection, the load is in the first pipeline stage,

adder A1 is in the second stage, and adder A2 is in the third stage. At the start of the fourth cycle, the

prologue of the pipeline is done, and we are now in steady state with all pipeline stages active.

After scheduling, we generate the datapath and control logic for the loop pipeline hardware. The

pipeline datapath is almost identical to the sequential non-pipelined datapath generated by LegUp but

with two differences. First, we need to keep track of the loop induction (index) variable for each pipeline

stage, because each stage will have a different iteration executing. We store the induction variable in

a three stage shift register shown at the bottom of Figure 5.4(c). Second, we add a shift register with

N stages on the output of any operation that is used N pipeline stages later. For example, if the input

to an operation is scheduled to finish two pipeline stages earlier, then two registers are needed. If the

operation inputs are scheduled in the same pipeline stage then no registers are needed. LegUp generates

the minimum number of registers required based on which pipeline stage each operation is scheduled


load a[i] load b[i]

store a[i+1]

2

1 [1]

2

0

BA

C

D

(a) Loop Dependency Graph.

load a[0]

storea[1]

+ +

load b[1]

Cycle: 0 1 2 3

load a[1]

4 5 6

load b[0]

a[2]store

Port Conflict

load b[2]

Prologue Steady State

(b) Greedy Modulo Scheduling.

load a[0]

storea[1]

+

load b[1]

+

a[2]store

load a[1]

Cycle: 0 1 2 3 4 5 6

load b[0]

7

load b[2]

load a[2]

8

Prologue Steady State

(c) Optimal Modulo Schedule.

Figure 5.5: SDC Modulo Scheduling for II=3.

and which pipeline stages each operation is used. In Figure 5.4(c), the result of the load is used exactly

two cycles later (which matches the memory latency). Therefore, adder A1 can be connected directly to

the memory output. The induction variable is an input to the A2 adder, and since A2 is scheduled in

pipeline stage three, we connect adder A2 to the third register in the induction variable shift register.

The pipeline control logic determines when each functional unit in the datapath should be active.

There are two main control signals required for each pipeline: valid and ii count. The one bit valid shift

register has lastT ime registers, which is four in this case. If the pipeline has valid input data and the

loop still has more iterations then we shift a one into the valid shift register. The ii count is a counter

that repeatedly counts from 0 to II-1 and is only needed if the initiation interval is greater than one.

In this case, ii count is a one bit counter alternating between zero and one. A datapath functional unit

should be active if the valid shift register corresponding to the scheduled time (T ) is high, and ii count

is equal T mod II. The valid shift register ensures that the inputs are valid for each time slot, and the

ii count counter ensure that each operation is performed only once per pipeline stage. For example, in

Figure 5.4(c) the register driven by A1 will be enabled when valid register three is high and ii count is

equal to one (3 mod 2). The register driven by A2 will only be enabled when valid register four is high

and ii count is equal to zero (4 mod 2). Finally, the induction variable shift register only shifts every

two cycles, when ii count is equal zero.

5.3 Motivation

5.3.1 Greedy Modulo Scheduling Example

In this section, we will illustrate the prior work, greedy modulo scheduling [Zhang 13], for a loop con-

taining both cross-iteration dependencies that cause recurrences in the loop data flow graph and also

resource constraints. A greedy modulo scheduling algorithm will not achieve an optimal schedule with


the minimum possible initiation interval if we schedule an operation to a particular time step that later

turns out to be wrong. Therefore, greedy scheduling is highly dependent on our chosen priority ordering

function.

We present the loop data dependency graph given in Figure 5.5(a). We have labeled the operations

A, B, C, and D for convenience. The directed edges in the graph represent data dependencies between

operations and the edge labels indicate the required clock cycle latency between operations. We assume

memory latencies of two cycles for a load and one cycle for a store, and we allow the adder to be chained

with zero latency. The back edge from node D to node A represents a cross-iteration data dependency

with a dependency distance of one (next iteration) labeled in square brackets. The total delay along the

recurrence is three cycles, therefore the recMII is three (⌈delay/distance⌉ = ⌈3/1⌉ = 3). We assume one

memory port giving a resMII of three (⌈#ops/#FU⌉ = ⌈3/1⌉ = 3).

Modulo scheduling specifies that an operation scheduled at time t will be repeated every II clock

cycles. Given resource constraints, we keep track of available resources using a table, where each row

tracks a resource and each column is an available time slot. When we schedule an operation at time

t, we reserve a single time slot in column t mod II of the table and in the appropriate resource row.

Consequently the table is called the modulo reservation table (MRT) and has II time slot columns.

Returning to the example, the minimum II is three and the MRT has three time slots available for the

memory in Figure 5.5(a).

First, we will attempt to greedily modulo schedule the loop. We will schedule operations prioritized

in order of perturbation, a typical priority function [Zhang 13], which gives precedence to operations that

will most impact the schedule when moved. Therefore, the order of precedence is B (affecting C, D,

and A), followed by A, and then D. First, we schedule B into time step zero, and reserve the single

memory port for that time. Next, we attempt to schedule A into time step zero but the memory port

is occupied by B, so we schedule A into the next time step at time one. After scheduling both loads

we attempt to schedule the store operation in cycle three. But a load (B) is already scheduled in cycle

three causing a memory conflict, as shown in Figure 5.5(b). This schedule is not possible due to our

single-ported memory. In this case, the greedy approach fails to achieve the minimum initiation interval

of three. There is now no feasible place to schedule the store operation due to the recurrence constraint

and the previously scheduled loads. At this point, we must give up and increase the initiation interval to

four and try again. However, we can avoid this suboptimal greedy solution by unscheduling one of the

load operations and backtracking to find the schedule shown in Figure 5.5(c). This schedule is optimal

and achieves the minimum initiation interval of three. Generally, a greedy modulo scheduler is only

guaranteed to yield an optimal schedule with an II equal to the minimum II if the loop has only (1)

simple recurrence circuits involving a single operation, or (2) if each operation in the loop is pipelined

with II=1 (no multi-cycle operations), in all other cases greedy scheduling may fail to find the optimal

solution [Ramak 96].

5.4 Modulo SDC Scheduler

In this section, we describe our novel Modulo SDC Scheduler. We begin with a candidate II based on

the pre-calculated minimum II and increment the II when we fail to find a feasible schedule. Given an

II, we can use SDC-based scheduling (described in Chapter 2) to quickly give us the control step for

every operation in the loop. An advantage we gain from the SDC formulation is the support for operator


chaining and frequency constraints. To support modulo scheduling, we modify the SDC constraints that

specify dependencies between operations by adding an additional term to account for loop recurrences.

For two dependent operations i→ j, the constraint becomes:

endi − startj ≤ II × distance(i, j) (5.1)

Here startj is the starting cycle time of operation j, and endi is the cycle time when the output of

operation i is available. The dependency distance, which is the number of loop iterations separating the

dependency, is given by distance(i, j). If there is no loop-carried dependency then the distance will be

zero and this constraint will reduce into a standard SDC data dependency constraint. The loop initiation

interval, II, is fixed for each iteration of the algorithm. We also add SDC timing constraints between

operations to enforce a frequency constraint during scheduling and to prevent excessive chaining from

lowering the desired clock period.

Unfortunately, resource constraints during modulo scheduling cannot be modeled using the SDC-

based linear programming formulation due to the non-linearity of the modulo reservation table. There-

fore, we apply an iterative backtracking approach to legalize the SDC modulo schedule. First, we ignore

all resource constraints and then we incrementally assign each resource-constrained operation to a par-

ticular control step in the schedule, depending on availability in the modulo reservation table (MRT).

In some cases, after fixing one or more resource constrained operations, the schedule will no longer be

feasible, in which case we backtrack by unscheduling the tentatively scheduled operations and resuming

our attempts.

Algorithm 1 Modulo SDC Sched(II, budget)

1: Schedule without resource constraints to get ASAP times2: schedQueue← all resource constrained instructions3: while schedQueue not empty and budget ≥ 0 do4: I ← pop schedQueue5: time← scheduled time of I from SDC schedule6: if scheduling I at time has no resource conflicts then7: Add SDC constraint: tI = time8: Update modulo reservation table and prevSched for I9: else

10: Constrain SDC with GE constraint: tI ≥ time+ 111: Attempt to solve SDC scheduling problem12: if LP solver finds feasible schedule then13: Add I to schedQueue14: else15: Delete new GE constraint16: Backtracking(I, time)17: Solve the SDC scheduling problem18: end if19: end if20: budget← budget− 121: end while22: return success if schedQueue is empty otherwise fail

Algorithm 1 gives the pseudocode for our iterative algorithm. The input to this function is the

initiation interval and a budget, which will be described shortly. First, we schedule the loop without


resource constraints and we save the ASAP time for each operation. Next we initialize a queue of

all resource constrained operations. We take the first operation out of the queue, which could be a

priority queue based on height [Ramak 96] or perturbation [Zhang 13] but neither is required due to

backtracking. However, having a good priority function will reduce the execution time of the algorithm.

Next, we check the MRT for resource conflicts at the time step given by the SDC scheduler. In the first

iteration, the SDC time step will be identical to the ASAP time calculated earlier. However, as we add

constraints to the SDC formulation, the SDC time steps may begin to diverge from the ASAP times. If

there are no MRT resource conflicts then we tentatively assign the operation to that time step by adding

an equality constraint to the SDC formulation and we update the MRT and the previous scheduled time

for I (lines 7–8). Otherwise, we try to reschedule with that operation constrained to a greater time

step (lines 10–11). If we find a feasible schedule then we add this instruction back into the queue for

later scheduling (lines 12–13). If we cannot find a feasible schedule (lines 15–17), then we backtrack

by unscheduling one or more already scheduled resource constrained instructions and then schedule the

current instruction. This process is continued until either a legal schedule is discovered with all resource

constrained instructions fixed to a specific time slot or when a budgeted number of while loop iterations

have occurred, upon which we consider the current fixed II to be infeasible and increment the II. The

budget parameter is equal to the budgetRatio× numInstructions, where we have observed empirically

that a budgetRatio = 6 (as was also found by [Ramak 96]) works well to avoid excessive backtracking.

If budgetRatio =∞ then we will backtrack through all possible schedules guaranteeing that we find the

optimal schedule that meets the II constraint. However, if the schedule is infeasible for the II constraint

then we will have an infinite loop.

Algorithm 2 Backtracking(I, time)

1: for minT ime =ASAP time of I to time do2: SDC schedule with I at minT ime ignoring resources3: break if LP solver finds feasible schedule4: end for5: prevSched← previous scheduled time for I6: if no prevSched or minT ime ≥ prevSched then7: evictT ime← minT ime8: else9: evictT ime← prevSched+ 1

10: end if11: if resource conflict scheduling I at evictT ime then12: evictInst← instr. at evictT ime mod II in MRT13: Remove all SDC constraints for evictInst14: Remove evictInst from modulo reservation table15: Add evictInst to schedQueue16: end if17: if dependency conflict scheduling I at evictT ime then18: for all S in already scheduled instructions do19: Remove all SDC constraints for S20: Remove S from modulo reservation table21: Add S to schedQueue22: end for23: end if24: Add SDC constraint: tI = evictT ime25: Update modulo reservation table and prevSched for I


Table 5.1: Algorithm Example (II=3).

IterMRT Slot SDC Time Sch. Time

I Description0 1 2 B A D B A D

1 B 0 0 2 0 B Sched. tB = 02 B 0 1 3 0 A Conflict. tA ≥ 13 B A 0 1 3 0 1 A Sched. tA = 14 D A 0 1 3 1 3 D Evict B. tD = 35 D A 1 1 3 1 3 B Conflict. tB ≥ 16 D B 1 1 3 1 3 B Evict A. tB = 17 A 0 2 4 2 A Evict All. tA = 28 B A 0 2 4 0 2 B Sched. tB = 09 B D A 0 2 4 0 2 4 D Sched. tD = 4

Algorithm 2 gives the pseudocode for our backtracking stage, which takes as input an operation I

to be scheduled at control step time. First, we find a valid time slot while ignoring resource constraints

but considering data dependencies (lines 1–4). Because we ignore resource constraints of the partial

SDC schedule, we will always find a minimum time slot and break out of the loop on line 3. In lines

5–10, we ensure forward progress by storing the previous scheduled time (updated on line 24 or line 8

of Algorithm 1) of each operation to prevent attempting a time step before that point. This prevents

two operations from displacing each other back and forth during backtracking. We remove any resource

conflicts at the candidate scheduling time by unscheduling the tentatively scheduled operations found

in the MRT at that slot (lines 11–15). In some cases, the previous scheduling time pushes forward the

schedule time of an operation such that there is also a data dependency conflict at the candidate time.

In this case, we unschedule all other operations to ensure forward progress and add these operations

back into the queue to be rescheduled (lines 16–22). Finally, we schedule the operation at the new time

step by updating the MRT and previous scheduled time for I, then we add an equality constraint to the

SDC formulation.

5.4.1 Detailed Scheduling Example

In this section, we walk through the exact steps of our scheduling algorithm for the loop data flow graph

previously provided in Figure 5.5(a). We begin Algorithm 1 by performing SDC scheduling without

resource constraints, giving us the ASAP times: tA = 0, tB = 0, tC = 2, tD = 2. Here, we assume

schedQueue is prioritized by perturbation [Zhang 13], giving precedence to operations that will most

impact the schedule when moved—although this is not required. The queue contains B (affecting C, D,

and A), followed by A, and then D. We skip C because adders are not resource constrained. Table 5.1

provides record keeping for the end of each iteration (first column) of the algorithm. The “MRT slot”

column lists the operations reserved in each time slot of the memory MRT, the “SDC Time” column

gives the operation control steps under the current SDC constraints, “Sch. Time” gives the tentatively

scheduled time of each operation (blank if not scheduled), “I” gives the current instruction I, and

“Description” summarizes what occurred during the iteration.

In the first iteration, we pop B off the queue and find no resource constraints at time 0, so we add the

SDC constraint tB = 0 and reserve MRT slot 0. Next iteration, we try to schedule A but find a resource

conflict with B, so we update the SDC with tA ≥ 1 and re-solve the linear program (LP). Next, we

schedule A at time 1 and reserve MRT slot 1. In iteration 4, we try to schedule D at time 3 but MRT slot


0 (3 mod 3 = 0) is unavailable. We constrain tD ≥ 4 and re-solve but the SDC constraints are infeasible

due to the recurrence with A. At this point a greedy algorithm would give up and increment the II, as

shown in Figure 5.5(b). Instead, we call backtracking(D, 3), where we calculate D’s minT ime to be

3. Therefore, we evict B from the MRT at slot 0, and we can now schedule D at time 3. Next iteration,

we find a resource conflict scheduling B at time 0, so we add the constraint tB ≥ 1. In iteration 6, we

try B at time 1 but there is still a resource conflict, and tB ≥ 2 is not feasible due to the recurrence. We

call backtracking(B, 1) and get minT ime = 0 but B has already been previously scheduled at time

0, so we schedule B at time 1 and kick out A from the MRT. In iteration 7, we have a resource conflict

scheduling A at time 1, tA ≥ 2 is infeasible, so we call backtracking(A, 1). A has been previously

scheduled at time 1, so we schedule at time 2 which conflicts with the recurrence so we evict all other

operations. The algorithm continues as shown in Table 5.1 until we find a valid modulo schedule for

II = 3 with tA = 2, tB = 0, tD = 4. At this point the SDC scheduled time for operations without

resource constraints is also valid, in this case tC = 4 (the addition). We now have a final schedule with

the optimal II of three, as shown in Figure 5.5(c).

We make an observation for future work, our algorithm only seeks to minimize the initiation interval

but not necessarily the number of stages in the pipeline. We have observed that in some cases the

scheduled pipeline will have one extra cycle of latency due to the ordering of the operations. We

considered this a minor concern compared to the throughput of the pipeline.

5.4.2 Complexity Analysis

Assuming there are n operations in the loop with m SDC constraints (operation dependencies) then solv-

ing the SDC scheduling problem incrementally has a worst case time complexity ofO(m+nlogn) [Ramal 99].

The budget parameter limits the amount of backtracking in the algorithm and has an O(n) com-

plexity. Each resource constrained operation has II possible slots in the reservation table, therefore

it can be rescheduled up to II times. Therefore, the overall time complexity of this algorithm is

O(n · II · (m + nlogn)). In practice, only a few of the n operations have resource constraints and

the amount of backtracking rarely reaches the budget parameter limit.

5.5 Loop Recurrence Optimization

Data flow graph transformations have been well-studied in prior work [Iqbal 93, Schla 94, Nicol 91a].

We propose a targeted manner of applying these transformations specific to HLS modulo scheduling.

The goal of these transformations is to reduce the length of loop recurrence cycles in the loop data

dependency graph and improve the achievable initiation interval.

We will first illustrate this concept by describing the impact of an associative transformation on a

loop with a cross-iteration dependency, with the C code given by Figure 5.3. In the loop, the sum

variable in the current loop iteration depends on the sum calculated in the previous iteration.

The loop has two equivalent data dependency graphs due to the associative property of addition:

i+(load+sum) and (i+ load)+sum. Figure 5.4(a) gives the former data dependency graph, i+(load+

sum), which is the default graph generated by LLVM. For this example we assume that operator chaining

is not allowed, that is, every addition takes one cycle to complete. The edge labels in the dependency

graph indicate the number of cycles required between operations. Loop recurrences are indicated by

back edges in the graph, where the dependency distance is given in square brackets, in this case the


(a) data dependency graph. (b) Loop pipeline schedule.

(c) Loop pipeline datapath.

Figure 5.6: Restructured loop dependency graph achieves II=1

distance is one (the previous iteration). Due to the recurrence spanning across two addition operations,

both of which take one cycle to complete, the minimum initiation interval of this loop pipeline is two

cycles. Figure 5.4(b) shows the loop pipeline after scheduling and the corresponding datapath is shown

in Figure 5.4(c).

Alternatively, we can restructure the data dependency graph using the associativity property of

addition: (i + load) + sum. In this new data dependency graph, shown in Figure 5.6(a), the length

of the recurrence has been reduced and the loop can now be scheduled with an initiation interval of

one. Figure 5.6(b) shows the new schedule for the loop pipeline and the corresponding datapath is

shown in Figure 5.6(c). Due to the improvement in the initiation interval, this new pipeline will have

approximately twice the throughput of the original pipeline in Figure 5.4. Based on this example, we

can conclude that the structure of the data dependency graph is critical for obtaining high performance

loop pipelines.

In this research, we propose restructuring the data dependency graph by applying associativity and

distributivity rules to improve the minimum initiation interval. There is related work in [Nicol 91b],

which presents a method for incrementally reducing the height of an expression tree from O(n) to

O(logn). We propose extending this algorithm to consider restructuring an expression tree for the

benefit of one particular critical path, which we would prefer to incur the least latency.

To illustrate, we will consider a loop that accumulates the sum of seven arrays over all array indices:

sum = sum+ a[i] + b[i] + c[i] + d[i] + e[i] + f [i] + g[i]. Figure 5.7(a) shows the default data dependency

graph assuming left-to-right associativity. The dotted lines in the figure indicate control steps after

scheduling, where arrows that cross the dotted line require registers. For this example, we assume that

operator chaining is not allowed, that is, every addition takes one clock cycle to complete. In this


a[i]

b[i]

c[i]

d[i]

e[i]

f[i]

g[i]

(a) Original7 cycles/iter.

b[i]a[i] c[i] d[i] e[i] f[i] g[i]

(b) Tree Height Reduction 3 cycles/iter.

c[i]

d[i]

e[i]

f[i]

g[i]

a[i] b[i]

(c) Restructured1 cycle/iter.

Figure 5.7: Dependency graph restructuring.

lateParent: sum earlyParent: a[i]

late

early: b[i]

curOp

(a) (sum + a[i]) + b[i]

curOp

earlyParent: a[i] early: b[i]

lateParent: sum

(b) sum + (a[i] + b[i])

Figure 5.8: Incremental Associativity Transformation.

case, sum in the current iteration has a loop-carried dependency on the sum calculated in the previous

iteration (a dependency distance of one). The loop recurrence spans across seven addition operations,

having a path delay of seven clock cycles. Therefore the minimum initiation interval of this loop pipeline

is seven cycles (recMII = ⌈7/1⌉ = 7).

The typical approach in HLS is to balance the expression tree. For instance, we could use the tree

height reduction algorithm from [Nicol 91a] to obtain the height balanced tree in Figure 5.7(b). We

have now reduced the path length of all inputs to three cycles improving the minimum initiation interval

to three. While this loop pipeline is more than twice as fast as Figure 5.7(a), the minimum initiation

interval is still constrained by the loop recurrence.

In our proposed approach, we restructure the expression tree to incur the least latency along the loop

recurrence. By targeting loop recurrences, we can focus on improving the minimum initiation interval

and consequently the loop pipeline performance. First, we find all operations in the graph that are

contained within a loop recurrence. To determine all recurrences in a loop’s data dependency graph, we

solve the equivalent problem of finding all elementary cycles in the graph. An elementary cycle is a path

through a graph where the first and last vertices are identical and no other vertex appears twice. All

elementary cycles in a graph can be found in polynomial time [Hawic 08], and each cycle corresponds to a

loop recurrence. If the graph contains multiple recurrences, we rank the recurrences by their respective

impact on the initiation interval. The rank of each recurrence is found by calculating the recMII of

the recurrence in isolation and then ranking the recMII values from high (most critical) to low. Each

operation in a recurrence inherits this ranking.


Table 5.2: Minimum initiation interval of benchmarks for balanced vs. proposed restructuringBalanced Restructuring Restructuring

Benchmark recMII resMII MII recMII resMII MIIfaddtree 26 23 26 13 23 23adderchain 2 2 2 2 2 2multipliers 2 2 2 2 2 2dividers 2 2 2 2 2 2complex 3 3 3 3 3 3

Table 5.3: Operation and dependency characteristics of each benchmarkOperations Constraints

Benchmark +/fadd/*/%/[] +/fadd/*/%/[] Distance Total Instrfaddtree 0/21/0/0/22 X/1/X/X/2 1 80adderchain 40/0/0/0/26 X/X/X/X/2 1 92multipliers 6/0/2/0/10 X/X/2/X/2 1 30dividers 11/0/0/4/13 X/X/X/X/2 1 72complex 16/0/7/2/27 X/X/3/X/2 9 98

Next, we apply transformations incrementally to the graph to reduce the path length of recurrences.

For example, Figure 5.8(a) shows the first two addition operations from the original data dependency

graph Figure 5.7(a), corresponding to the expression: (sum+ a[i]) + b[i]. The left operand of the first

addition is part of the loop recurrence that we wish to improve. We use associativity to restructure these

two operations into an algebraically equivalent expression, sum+(a[i]+ b[i]), as shown in Figure 5.8(b).

This transformation has reduced the length of the recurrence by one cycle. In general, if we consider

additions, an associative transformation involves two two-operand operations that form a recurrence:

late = lateParent+ earlyParent, and curOp = late+ early. Here lateParent and late are the critical

edges along which the recurrence occurs. In this case, we use the associative property of addition to

transform this into: curOp = lateParent + (earlyParent + early). In this new expression, we have

removed one addition operation from the recurrence, leaving only lateParent.

Repeating this associative transformation incrementally, we eventually obtain the restructured data

dependency graph in Figure 5.7(c). In this new graph, instead of balancing the height of the expression

tree, our transformations have actually lengthened some paths in the data dependency graph in order to

shorten the recurrence path. The loop recurrence now consists of only one addition, therefore the new

loop pipeline has an initiation interval of one (assuming no resource constraints). Due to the improvement

in the initiation interval, this new pipeline will have approximately seven times the throughput of the

original loop in Figure 5.7(a) (II reduced from 7 to 1). These transformations are particularly effective

for recurrences containing multi-cycle operations, for example floating point operations, which can cause

long recurrence lengths and are unaffected by operator chaining. We only performed transformations on

expressions consisting of operations of all the same type (i.e. integer, floating point).

5.6 Experimental Study and Results

We experimentally evaluated our approach using five C benchmarks containing a loop with the initiation

intervals limited by both loop recurrences and resource constraints, all of which are synthesizable by

LegUp and the commercial tool. All benchmarks contain a tree of operations with a recurrence. Table 5.2


Table 5.4: Speed performance resultsInitiation Interval Cycles

Benchmark Comm Zha Back Back+R Comm Zha Back Back+Rfaddtree 36 34 26 23 1539 1439 1128 1045adderchain 4 3 2 2 372 297 209 209multipliers 3 3 2 2 292 294 206 207dividers 4 3 2 2 261 230 152 152complex 4 5 3 3 454 550 382 382

Geomean 5.9 5.4 3.6 3.5 456 437 309 305Ratio 1 0.92 0.62 0.6 1 0.96 0.68 0.67

shows the impact of restructuring on the loop recurrence. The “Balanced Restructuring” column gives

the recurrence minimum II (recMII), the resource MII (resMII), and the combined minimum II (MII) for

the default case of balanced expression tree restructuring. The “Restructuring” column gives the same

metrics but after restructuring, as described in Section 5.5. In Table 5.2, we can see that restructuring

improved the recMII of faddtree from 26 to 13. This was caused when we restructured a floating

point addition with a latency of 13 away from the loop recurrence as shown in Figure 5.8. Table 5.3

provide a summary of the properties of each benchmark. The “Operations” column gives the number of

additions, floating point additions, multiplications, divisions, and memory operations in the loop body.

The “Constraints” column gives the constraint on the number of functional units for adders, floating

point adders, multipliers, and memory ports, with an X indicating no constraint. Although we restricted

memories to two ports in these benchmarks, memory is spread across multiple independent block RAMs

that can be accessed in parallel. The “Distance” column gives the dependency distance of the cross-

iteration dependency in the loop. The “Total Instr” column gives the total number of LLVM instructions

in the loop, most of which represent binary operations, to measure the scheduling complexity of each

benchmark.

The benchmarks include golden input and output test vectors, allowing us to synthesize the cir-

cuits with a built-in self-test. We used these test vectors to simulate the circuits in ModelSim and

verify correctness. We targeted the Stratix IV [Stra 10] FPGA (EP4SGX530KH40C2) on Altera’s DE4

board [DE4 10] using Quartus II 11.1SP2 to obtain area and FMax metrics. Quartus timing constraints

were configured to optimize for the highest achievable clock frequency.

We benchmarked against a state-of-the-art commercial HLS tool configured to target a commercial

FPGA similar to Stratix IV. We used the default commercial tool options, which include standard

expression tree balancing. We configured LegUp to use functional units with identical latency as the

commercial tool. We imposed a target clock period constraint of 3ns (333MHz) on both HLS schedulers.

In our study, we consider four scenarios for comparison: (1) A commercial HLS tool (Comm), (2)

Zhang’s recently published greedy modulo SDC scheduler [Zhang 13] (Zha) implemented in LegUp, (3)

Our proposed backtracking SDC modulo scheduler (Back), and (4) Our scheduler combined with data

dependency graph associative restructuring (Back+R). Table 5.4 and Table 5.5 give speed performance

results for these four scenarios. The “Initiation Interval” column is the scheduled II of the loop pipeline.

The “Cycles” column is the total number of cycles required to complete the benchmark. The “FMax”

column provides the FMax of the circuit given by the Quartus. The “Time” column gives the circuit

wall-clock time: Cycles · (1/FMax). Ratios in the table compare the geometric mean (geomean) of the

column to the respective geomean in the commercial tool. We also summarize the results in Figure 5.9.


Table 5.5: Speed performance resultsFMax (Mhz) Time (µs)

Benchmark Comm Zha Back Back+R Comm Zha Back Back+Rfaddtree 257 229 248 233 5.99 6.28 4.55 4.48adderchain 270 239 194 234 1.38 1.24 1.08 0.89multipliers 540 485 545 511 0.54 0.61 0.38 0.41dividers 236 261 270 270 1.11 0.88 0.56 0.56complex 269 211 232 232 1.69 2.61 1.65 1.65

Geomean 299 271 277 281 1.53 1.61 1.11 1.09Ratio 1 0.91 0.93 0.94 1 1.05 0.73 0.71

(a) LegUp versus Prior Work [Zhang 13]

(b) LegUp versus Commercial HLS Tool

Figure 5.9: Backtracking SDC modulo scheduling experimental results.


Table 5.6: Area comparison experimental resultsALUTs Registers

Benchmark Comm Zha Back Back+R Comm Zha Back Back+Rfaddtree 1,266 1,676 1,629 1,638 2,305 2,175 2,240 2,374adderchain 857 1,190 1,110 1,108 929 1,857 2,178 2,114multipliers 77 122 124 108 68 173 193 110dividers 5,395 8,488 5,495 5,495 9,072 13,771 9,732 9,732complex 6,551 4,166 4,223 4,223 11,571 14,854 17,732 17,732

Geomean 1,242 1,538 1,391 1,354 1,725 2,698 2,768 2,488Ratio 1.00 1.24 1.12 1.09 1.00 1.56 1.60 1.44

Table 5.7: Tool runtime (s) comparisonBenchmark Comm Zha Back Back+Rfaddtree 16 34 9 59adderchain 6 6 4 10multipliers 0.2 0.4 0.2 0.2dividers 6 3 2 6complex 7 4 4 5

Geomean 3.8 4.0 2.2 5.1Ratio 1.00 1.04 0.59 1.34

The results show that our backtracking modulo SDC scheduling approach can have a significant

impact on loop pipelines with resource constraints combined with recurrences. Based on our experiments,

the commercial tool is using a greedy modulo scheduler for loop pipelining because their schedule cannot

achieve the minimum II for these benchmarks. Consequently, our approach achieved a geomean reduction

in II by 38% versus the commercial tool. Furthermore, with restructuring we were able to improve this

to a 40% reduction in geomean II. We also see that our backtracking approach achieves an average

geomean improvement of 33% over Zhang’s greedy approach.

Backtracking SDC modulo scheduling reduced the geomean cycle time by 32% versus the commercial

tool and by 29% versus Zhang. The cycle time improvement was less than the reduction in II due to

time spent outside the loop pipeline in these benchmarks. The geomean FMax decreased by 7% versus

the commercial tool when applying our approach, due to better balanced expression restructuring by the

commercial tool. When restructuring along recurrences, we chain fewer operations, causing the geomean

FMax to increase by 2%. Overall, the geomean wall-clock time for these benchmarks was reduced by

32% using backtracking and restructuring versus greedy SDC modulo scheduling. When compared to

the commercial tool, our backtracking and restructuring approach improves geomean wall-clock time

by 29%. This improvement is mainly due to a reduction in II caused by better scheduling of our SDC

modulo scheduler when compared to greedy scheduling.

Table 5.6 gives the area results for the four scenarios. The “ALUTs” and “Registers” columns give the

number of Stratix IV combinational ALUTs and dedicated registers required. Table 5.7 compares the tool

runtime in seconds for each algorithm to run, this includes the entire flow from C to Verilog. Ratios in the

table compare the geometric mean (geomean) of the column to the respective geomean in the commercial

tool. Comparing our approach to the commercial tool in terms of area, the geomean combinational

ALUTs increased by 9% and the geomean registers increased by 60%. The registers increased due to a

lower II in pipelines generated by our approach allowing less register sharing. Comparing our approach


to Zhang’s in terms of area, the geomean combinational ALUTs decreased by 12% and the geomean

registers decreased by 8%.

5.6.1 Runtime Analysis

Now we present a runtime characterization of our new backtracking approach versus the other scheduling

algorithms. First, the runtime results in Table 5.7 show that our backtracking scheduler had 41%

less geomean runtime than the commercial tool but we had a 43% increase in runtime when using

restructuring. Our scheduling algorithm’s runtime is influenced by the number of invocations of the

linear program solver, which is proportional to the number of instructions in the loop being scheduled.

We would like to know the typical range of instructions found in a loop to be pipelined by HLS. The

MediaBench II Video benchmark suite [Fritt] is representative of modern and emerging multimedia DSP

applications (MPEG-4, JPEG-2000, H.264) with applications that typically have extensive instruction

level parallelism. A study of workload characteristics [Fritt] observed that the average instructions per

basic block in MediaBench was 9.4 instructions, with a maximum of 61 instructions in a single basic

block. We performed a characterization of the CHStone [Hara 09] benchmarks and observed the median

instructions per basic block was 4, while the median instructions per loop was 24. Across the CHStone

suite, a single basic block contained a maximum of 378 instructions and a maximum of 805 instructions

in a single loop. Therefore, we will perform runtime analysis of our algorithm for basic blocks of size up

to 1000 instructions.

For this experiment, we used the adderchain benchmark and duplicated the body of the loop N times,

where N ranged from 1 to 12, and added a final summation after the loop. Each additional duplication

introduced another recurrence into the loop pipeline and increased the number of instructions by 79 in-

structions. Figure 5.10 shows the runtime in seconds for each algorithm as the number of instructions in

the loop increases. By default, we solved the SDC problem using a linear programming solver [lpso 14].

We also analysed the runtime taken when efficiently solving the SDC problem incrementally after modi-

fying the constraints as described in [Ramal 99]. The lines marked with “(incremental SDC)” show these

results. However, we observed only a minor runtime difference, leading us to believe that the LP solver

is quite optimized. Here we see that our backtracking algorithm’s runtime increases substantially com-

pared to the commercial tool as the number of instructions grow, but the absolute runtime is still about

1 minute even for 1000 instructions. Although backtracking runtime compares poorly to the commercial

tool, Zhang’s greedy approach is actually no better, because if we fail to schedule for a given II we must

iteratively increment the candidate II and attempt to reschedule again. This iterative process can be

costly in terms of runtime as seen in Figure 5.10. Figure 5.11 provides the final pipeline II achieved

in each case. We observe that the greedy algorithms, both the commercial tool and Zhang, achieve

inconsistent pipeline initation intervals due to the resource constraints and cross-iteration dependencies.

5.7 Summary

This chapter demonstrated that resource constraints and loop recurrences can have a considerable im-

pact on loop pipelining in HLS by increasing the initiation interval of synthesized pipelines. We proposed

a novel backtracking SDC-based modulo scheduling algorithm and a graph restructuring technique for


1

10

100

1000

0 100 200 300 400 500 600 700 800 900 1000

Runtime(s)

Number of Instructions

ZhangZhang (incremental SDC)

BacktrackingBacktracking (incremental SDC)

Commercial

Figure 5.10: Runtime Characterization For Loop Pipelining Scheduling Algorithms.

2

4

6

8

10

12

0 100 200 300 400 500 600 700 800 900 1000

II

Number of Instructions

ZhangCommercialBacktracking

Figure 5.11: Initiation Interval for Loop Pipelining Scheduled in Figure 5.10.


expression height reduction to reduce loop recurrence lengths. Our empirical study on a set of bench-

marks containing loop pipelines constrained by resources and limited by recurrences show our approach

achieves a 32% improvement in geomean wall-clock versus prior work and 29% versus a commercial tool.

Chapter 6

LegUp: Memory Architecture

6.1 Introduction

In this chapter, we describe LegUp’s target memory architecture. Our goal for the memory architecture

is to achieve high circuit performance and minimize FPGA memory block usage while supporting C input

programs that use memory pointers.

The C language targets a computing architecture that has a single address space. In modern computer

architectures, the memory is a large off-chip RAM supported by on-chip caches. But in high-level

synthesis, we have more flexibility and can generate a custom memory architecture for our particular

target application based on its memory use. We will explore various HLS memory architectures in this

chapter.

A common approach in high-level synthesis is to limit the types of memory pointers that are allowed

in the program input. For instance, requiring that each C pointer only point to one array during program

execution. HLS users are expected to rewrite their programs to conform to these limitations. In LegUp,

we support generic pointers, meaning that a C pointer can point to any location in memory. This

increases LegUp’s utility by allowing a wider range of input programs. We will discuss later how LegUp

handles pointers that cannot be resolved to a particular memory location at compile time.

In LegUp’s hardware-only (no processor) flow, we partition program memory into distributed on-chip

FPGA block memories that are either placed globally or locally. Each local memory block is directly

connected to the generated circuit datapath at any location where we access that particular memory.

The global memory blocks are connected to the datapath through a single shared dual-port memory

controller. During a load, the memory controller steers the incoming memory address to the appropriate

global memory block and then returns the data from the memory. Local memory blocks can only be

accessed within the hardware module in which they are instantiated, while global memory blocks can be

accessed from any hardware module in the final design. We will describe the design of LegUp’s shared

memory controller and the global memory addressing scheme in Section 6.3.

We will show in Section 6.4 that the shared memory controller can limit performance in certain ways.

By partitioning the memory space into local and global memory we can increase circuit throughput. We

discuss the compiler-time analysis performed by LegUp to partition C arrays from the input program

into either global and local memory blocks in hardware. We show empirically that using local memory

blocks in LegUp improves the geomean wall-clock time by 8% compared to only using global memory,

80

Chapter 6. LegUp: Memory Architecture 81

when averaged across the CHStone benchmarks.

In Section 6.5, we investigate grouping arrays from the C program into shared memory blocks instead

of storing each array in a separate memory block. We show in that section that this technique reduces the

number of FPGA memory blocks required in the final design. We also discuss how to reduce addressing

logic required in the circuit datapath by allocating each array to an appropriate place in the shared

memory block. We applied our array grouping approach in an experimental study on the CHStone

benchmark suite and found that the geomean memory implementation bit usage decreased by 27%.

The remainder of this chapter is organized as follows: Section 6.2 presents related work. Section 6.3

gives an overview of LegUp’s memory architecture and describes the shared memory controller. Sec-

tion 6.4 describes local memory blocks and the algorithm we use to partition program memory between

local and global memory. Section 6.5 discusses how we can group arrays into shared RAMs to save

FPGA memory blocks. An experimental study is presented in Section 6.6. Section 6.7 offers a summary.

6.2 Background

6.2.1 Related Work

In HLS, we can often improve the performance of a pipelined loop by partitioning the input or output

arrays into distinct RAMs, or memory banks, allowing for increased memory bandwidth. The Vivado

HLS tool [Xili] includes compiler pragmas supporting various forms of memory partitioning. The user

can manually combine smaller arrays into a single RAM or partition arrays into many RAMs. They can

also reshape an array of many elements into a RAM with larger bitwidth allowing multiple elements to

be accessed in parallel. The HLS study by Cong [Cong 12] optimizes C loop nests using on-chip memory

reuse buffers, which reduce off-chip memory accesses by buffering array elements accessed in prior loop

iterations. They describe loop transformations that reduce the size of these buffers while maintaining

the same circuit performance, reducing on-chip FPGA memory bits by 40% on average.

Many academic HLS tools, such as SPARK [Gupta 03], GAUT [Couss 10], and ROCCC [Villa 10]

only support programs containing C arrays (i.e., int a[]) and disallow all C pointers (i.e., int *a).

In GAUT, pointers are allowed as function parameters, which synthesize into single integer output

ports (they do not point to memory). Other HLS tools like CoDeveloper [Impu] and CatapultC [Caly]

only allow C pointers that point to a single array during program execution, which can be resolved at

compile-time. In CoDeveloper, only arrays can be passed as arguments to functions (not pointers). By

limiting the types of pointers allowable in the input program, these tools can simplify the target memory

architecture; block RAMs are simply connected directly to the final circuit datapath.

Semeria and Micheli [Semer 98] from Stanford demonstrated an HLS approach that supports generic

C pointers to statically defined memory (no dynamic memory). They implemented their approach in the

SUIF C compiler framework [Wilso 94] and use a points-to analysis at compile time to determine the

memory locations each pointer can access. Memory instructions that access multiple memory locations

are implemented using a multiplexer that selects between each possible memory using a minimally sized

pointer (i.e., for two memories use a one bit pointer). The work was extended to support dynamic

memory [Semer 01], where they proposed a 32-bit pointer address encoding scheme consisting of a 16-

bit tag, representing the memory location, concatenated with a 16-bit offset, representing the byte offset

into the location. We describe a similar pointer encoding for LegUp in Section 6.3.2. Semeria only


considered synthesizing a single C function and did not discuss how to handle larger programs that pass

pointers between functions. LegUp does not inline all functions by default for the reasons explained in

Section 3.4.1. Furthermore, they did not target a hybrid architecture where the hardware can access

memory shared with the processor. We have extended their work in this chapter to handle the LegUp

hardware architecture.

Modern FPGAs have dedicated block RAMs distributed in columns across the chip, resulting in

high on-chip memory bandwidth [Under 04]. On-chip memory bandwidth is a key advantage of FP-

GAs in comparison to other compute platforms such as GPUs and CPUs [Fu 11]. On Cyclone II

FPGAs [Cycl 04], these dedicated RAMs consist of 4-Kb memory blocks (M4Ks). Stratix IV FP-

GAs [Stra 10] contain dedicated 9-Kb memory blocks (M9Ks) and 144-Kb memory blocks (M144Ks).

The work by Cong [Cong 06c] proposed an HLS target memory architecture that uses FPGA block

RAMs to implement a distributed set of register files as an alternative to discrete registers. This approach

has the advantage of reducing register usage, minimizing multiplexing and improving clock frequency.

They reported a 2X logic area reduction on average and a clock period improvement of 7.8%. The work

in [Zhu 01] proposed an HLS temporal memory optimization to detect arrays within a procedure with

non-overlapping lifetimes, which can then be stored in the same memory block. The approach uses a

flow-sensitive pointer analysis combined with memory lifetime analysis and selects memories to share

using a graph coloring algorithm.

The work presented in [Pilat 11] is based on the PandA HLS framework and bears similarity to

the local memory scheme we describe in Section 6.4. They propose a distributed memory architecture

where each array is stored in a RAM local to the function where the array is defined. Instead of a

shared memory controller, all distributed RAMs from other functions lower in the program call graph

are connected together using a daisy-chain network. During a load/store from a function, they use

this network to access memory stored in other functions by checking every distributed memory on the

chain, one at a time, every cycle. The function stalls while waiting for the memory request. A possible

drawback of this approach is that the daisy-chain network can require a long clock cycle latency for

accessing memories from other functions, depending on the number of distributed memories.

The C2H compiler targets a hybrid processor/accelerator architecture with a shared memory between

the processor and accelerators [Santa 07]. They implemented memory accesses from accelerators to the

processor memory as distinct master ports in the Avalon interconnect, allowing multiple memory accesses

to be performed in parallel.

6.2.2 Alias and Points-to Analysis

A key challenge when we try to generate a suitable memory hierarchy in high-level synthesis is to reason

about C memory pointers using only static compiler analysis. Alias analysis, or memory disambiguation,

is the problem of determining when two pointers refer to overlapping memory locations. An alias occurs

during program execution when two or more pointers refer to the same memory location. Points-to

analysis, a closely related problem, determines which memory locations a pointer can reference. In this

chapter, we will be more concerned with points-to analysis. Solving the alias and points-to analysis

problems require us to know the values of all pointers at any state in the program, which makes this an

undecidable problem in general [Landi 92].

Points-to analysis algorithms are categorized by flow-sensitivity and context-sensitivity. An approach

is flow-sensitive if the control flow within the given procedure is used during analysis while a flow-


1 const int imem [ 4 4 ] = { 0x8fa40000 , 0x27a50004 , . . .2 char output [ 2 0 4 8 ] = {0} ;34 int main ( ) {5 int reg [ 3 2 ] , dmem [ 6 4 ] ;

Figure 6.1: C snippet showing an example of global and function-scoped memory variables.

insensitive approach ignores instruction execution order. Context-sensitive analysis considers the possible

calling contexts of a procedure during analysis. Points-to analysis can either be confined to a single

function, called intraprocedural, or applied to the whole program, called interprocedural. A survey of

popular points-to analysis techniques is given in [Hind 00, Hind 01]. Points-to analysis algorithms have

varying levels of accuracy and may be overly conservative, but for programs without dynamic memory,

recursion, and function pointers, most pointers are resolvable at compile-time [Semer 98].

The compiler community has developed fast interprocedural flow-insensitive and context-insensitive

algorithms. Andersen [Ander 94] described the most accurate of these approaches, which formulates

the points-to analysis problem as a set of inclusion constraints for each program variable that are then

solved iteratively. Steensgaard [Steen 96] presented a less accurate points-to analysis, which used a set of

type constraints modeling program memory locations that can be solved in linear-time. In this chapter,

we use the points-to analysis described by Hardekopf [Harde 07], which speeds up Andersen’s approach

by detecting and removing cycles that can occur in the inclusion constraints graph. We could improve

the accuracy of our points-to analysis by using a context-sensitive, flow-sensitive algorithm such as the

symbolic pointer analysis described by Zhu [Zhu 02], which was shown to be scalable to larger programs

by using binary decision diagrams [Zhu 04].

To aid pointer analysis, the C language now includes a pointer type qualifier keyword, restrict,

allowing the user to assert that memory accesses by the pointer do not alias with any memory accesses

by other pointers.

6.3 LegUp Memory Architecture

This section gives an overview of HLS memory challenges and describes the LegUp memory architecture.

We focus here on targeting a pure hardware flow, without a processor. We also assume that LegUp does

not allow dynamic memory (malloc/free), recursion or function pointers. During this chapter, we focus

only on the on-chip memory architecture.

6.3.1 Overview

A traditional C compiler models memory as a single contiguous block of byte-addressable memory that

is shared across all functions in the program. Global static variables are placed in a region of this shared

memory and constants in a read-only section of memory. Function-scoped memory is stored in a region

of memory called the stack, which grows after each function invocation and shrinks after each function

return. Thus stack memory is re-used dynamically by different functions.

As discussed in Chapter 3, LegUp is built on the LLVM compiler and operates on the LLVM inter-

mediate representation. The C code shown in Figure 6.1 gives an example of global memory, constants,

and stack memory with the corresponding representation in LLVM given by Figure 6.2. We have two


1 @imem = constant [ 44 x i 32 ] [ i 32 −1885077504 , i 32 665124868 , . . . ]2 @output = g l oba l [2048 x i 8 ] z e r o i n i t i a l i z e r34 de f i n e i 32 @main( ) {5 %reg = a l l o c a [ 32 x i 32 ]6 %dmem = a l l o c a [ 64 x i 32 ]

Figure 6.2: LLVM intermediate representation example showing global and stack memory.

Figure 6.3: HLS memory binding and memory interconnection network.

global variables: @imem is a constant integer array with 44 elements, @output is a global byte array

holding 2048 elements. Local to the main function, we have two variables allocated on the stack: %reg

is a 32 element integer array and %dmem is a 64 element integer array.

Unlike a C compiler with a set target architecture, in HLS when we compile a C program into a digital

circuit we are free to determine the best memory architecture for our specific application. Conceptually,

generating a memory architecture in HLS requires us to determine the number of memory blocks and

then connect them to the rest of the circuit datapath, as shown in Figure 6.3. Here we distinguish

between programmer-defined C variables (reg, dmem, imem, output) called program memory and the

physical memory consisting of block RAMs on the FPGA in the final synthesized circuit. First, we bind

program memory to physical memory, using one of several approaches:

1. One global shared physical RAM that stores all program memory

2. One-to-one mapping: each program variable has a physical RAM.

3. A compromise, where multiple program memories can be assigned to multiple available physical

memories.

The first binding approach is analogous to traditional computer architecture, with a single large con-

tiguous shared memory. In LegUp, we use the second binding approach, where each C array is stored

in a separate FPGA on-chip dual-port block RAM, with a data width that matches the data width of

the array. Constant arrays are specifically instantiated as read-only memories to enable later FPGA

synthesis optimizations. We show the third binding approach in Figure 6.3 where two program variables

reg and dmem are placed in the same RAM, in non-overlapping regions. We explore this approach in

Section 6.5.


31

9−bit Tag

023 22

23−bit Address

Figure 6.4: LegUp 32-bit pointer address encoding.

In LegUp, we use a one-to-one mapping because: 1) the hardware implementation is simpler, 2) the

circuit is easier to understand and debug with each array from the C program in a separate RAM rather

than buried in a large RAM, and 3) in the future we could allow many distributed RAMs to be accessed

in parallel, increasing memory bandwidth.

As Figure 6.3 shows, we also must connect the physical memory to the circuit datapath, where store

and load operations occur, using a particular interconnection network. Ideally, we would simply use

point-to-point wires to connect each memory operation in the circuit datapath to an exact physical

RAM. But if the C input program contains pointers that can point to multiple arrays then we will need

multiplexing logic between the circuit datapath and the memory blocks. Furthermore, we will require

accurate points-to analysis to determine exactly which arrays each load or store instruction can access

at compile time.

In LegUp, we partition all program memory into the following categories: global memory blocks

accessed through a shared memory controller, or local memory blocks instantiated in a particular hard-

ware module. We use local memory when we can statically determine that two conditions are met: 1)

the array is only accessed in one function, and 2) each pointer only points to a single local array (no

multiplexing is required). All other program memory is stored in global memory blocks. LegUp has no

semi-local memory blocks for pointers that can point to multiple arrays, these arrays are all placed in

global memory (this is future work). In the case of the hybrid flow, we have a third memory category:

processor memory allocated by the processor and accessed by the hardware accelerators using the on-

chip memory cache in Figure 3.2. We will discuss global memory in Section 6.3.2 and local memory in

Section 6.4. The details of accessing processor memory are beyond the scope of this dissertation.

6.3.2 Global Memory Blocks

All global memory blocks in LegUp’s target memory architecture are accessed through a shared memory

controller. The shared memory controller makes global memory accesses easy to understand and reason

about, and reduces memory signals passed between hardware modules. We assign a unique number

called a tag to each program variable and associated physical memory, which we use for steering logic

in the memory controller. All LegUp addresses are 32 bits wide and are composed of the array tag and

the array address as shown in Figure 6.4. The upper 9 bits of the memory addresses are reserved for

a tag bit, allowing 255 distinct C arrays. A tag value of zero is reserved for NULL pointers and a tag

value of one is reserved for the processor memory address space. The 24-bit address allows up to a

16MB byte-addressable memory for each array. Because the lower bits are used for the array address,

this scheme allows pointer arithmetic—incrementing the address will not affect the tag bits. We could

increase the pointer size to 64 bits in the future if we need more addressable memory space.

Continuing the example C code shown in Figure 6.1, we assume that LegUp places each of these

arrays in global memory: imem, output, reg, dmem. LegUp will instantiate one 44-word 32-bit ROM for

the constant imem, a 2048-word 8-bit RAM for output, a 32-word 32-bit RAM for reg, and a 64-word


=

3232mem_data_out

en

dataout

clk

we

en

dataout

clk

3

=2

addr

prev_tag

mem_tag

mem_write_en

mem_enable

mem_addr

mem_data_in

mem_clk

datainaddr

32

output

imem

ROM

RAM

8

10

11

9

23

8>> 2

Figure 6.5: LegUp memory controller block diagram.

32-bit RAM for dmem. As an example, we could assign the unique tags 2, 3, 4, 5 to imem, output, reg,

and dmem respectively. Given these 9-bit tag assignments, the (byte) address of imem[10] would be

(01000028)16, the address of output[15] would be (0180000F )16, and the address of dmem[5] would be

(02800014)16.

Figure 6.5 shows the LegUp memory controller block diagram for two of these integer arrays: imem

and output. Extending this controller to access all four arrays is straightforward. In the figure, we

show the FPGA block ROM for the constant 32-bit imem array and the block RAM for 8-bit output

array, which are both instances of Altera’s ALTSYNCRAM memory megafunction (inferred from Verilog).

LegUp automatically generates a memory initialization file for each memory block. Here the memory

controller checks the 9-bit tag (mem tag) of the incoming memory address to determine which memory

block to enable and disables all the other RAMs. All addresses accessing the integer array imem will be

aligned to 4-byte word boundaries, therefore the bottom two address bits will be zero. We account for

this by right shifting the incoming 23-bit array address by two and passing the resulting 21-bit array

index into the address port of the imem ROM block. The output is an 8-bit array so no address shifting

is needed. We assume that the latency of the FPGA block RAMs is one cycle, therefore we must use

the previous tag to select the memory block that is outputting the data requested in the previous cycle.

We register the output of the memory controller to improve the circuit clock frequency, as the steering

output multiplexer can become large. Consequently, all load/stores in LegUp have a two cycle latency

by default. If the tag equals zero (NULL) then we ignore the memory access. If the tag equals one

(processor memory) then we redirect the memory request to the on-chip memory cache over the Avalon

interconnect (Figure 3.2), which we do not shown here. The memory controller output width is equal to

the maximum data width of any global memory block.

LegUp can support generic pointers, or pointers that can point to any array, by using the pointer

tags bits in the memory controller to resolve pointer ambiguity during circuit runtime. This allows a

wider range of input C programs, particularly those which are not amenable to points-to analysis at

compile time. For the hybrid flow, we can use the tag bits to determine when a pointer should access


the processor memory.

For simplicity, we have been assuming that the memory controller is single-ported, allowing one

memory load or store per cycle. To add another memory port to the controller, we duplicate all the

previously described input and output signals for the second port. We instantiate dual-ported RAMs

that are available in the FPGA fabric for each array and connect each port of the memory controller

to the corresponding port on the RAMs. We also duplicate the output multiplexer and previous tag

register for the second output port of the memory controller. With these modifications to the memory

controller, we can support two global memory accesses from the circuit datapath every cycle. During

HLS scheduling we enforce this constraint by only scheduling up to two load or store instructions in any

cycle. During HLS binding, we connect memory accesses in the circuit datapath to one of the available

memory controller ports during that cycle.

A limitation of LegUp’s memory controller is that we can only access array elements; the memory

contents are not completely byte-addressable. For instance, we could not directly load the most signifi-

cant byte of an integer stored in an array. We could allow this in the future by adding additional input

and output multiplexing to steer individual bytes.

Multi-dimensional arrays are handled like single-dimensional arrays, with elements stored in row-

major order, the same convention used by C. LegUp uses a different memory controller to handle C

structs (implemented by Victor Zhang). In a struct, the individual elements can have non-uniform size.

We handle this with an additional 2-bit mem size input port to the memory controller, which indicates

the size of the struct element we are accessing: 0 for an 8-bit element, 1 for 16 bits, 2 for 32 bits, or 3

for 64 bits. We instantiate a 64-bit wide block RAM for each struct. When writing to a struct element

that is smaller than 64 bits, we must use the mem addr and mem size to activate the appropriate input

byte enables of the RAM. When reading a struct element that is smaller than 64 bits, we must use the

mem addr and mem size to steer the correct bits of the 64-bit struct memory output to the lowermost

bits of mem data out using a multiplexer. The full details of this memory controller are outside the

scope of this dissertation.

In Figure 6.6, we continue our previous example from Figure 6.1 and show the memory controller

steering logic when loading array element output[13]. At the top of the figure, we show the pointer to

output[13] with a tag equal to three and an array address of 13. The address 13 is fed into the address

port of the output RAM and the imem ROM (after right shifting by two). The memory controller write

enable is equal to zero (load) and the enable bit is true. The tag is checked to enable the output RAM

and disable the ROM holding imem. Using the memory controller output multiplexer, we select the

output of the output RAM.

There are a few performance issues with the described memory controller. First, it contains a wide

multiplexer, which grows linearly with the number of arrays declared in the C code, assuming that all

arrays are placed in global memory. We will mitigate this issue by placing arrays in local memory

blocks as described in Section 6.4. Second, the memory controller allows only two memory accesses per

clock cycle, which can be a performance bottleneck if we have significant parallelism in the program.

In particular, loop pipelining is severely impacted by this memory constraint, so distinct local RAMs

should be used if possible. Instruction level parallelism is also limited by this memory constraint.

Hypothetically, since program variables are stored in separate physical memory blocks, we could allow

more than two memory accesses per cycle. But adding more ports to the global memory controller

would require additional multiplexing. Furthermore, during HLS scheduling we would need to ensure


=

3232

9

en

dataout

clk

we

en

dataout

clk

3

=2

addr

prev_tag

datainaddr

32

output

imem

ROM

RAM

8

10

11

8output[13]

0

1

0

mem_clk23

31 23 22 0

Address = 13Tag = 3

0

1

>> 2

Figure 6.6: LegUp shared memory controller when loading array element output[13].

main

a

c d

b

c

(a) Program call graph.

c

ba

main

c d

(b) Hardware module instantiation hierarchy.

Figure 6.7: Relationship between program call graph and hardware module instantiations.

that during any particular cycle, we do not access a program variable more than twice (each RAM is

only dual-port). This would require either accurate points-to analysis or stalling logic in the memory

controller to detect and handle conflicts. Also the benefit would primarily depend on the increase in

instruction level parallelism enabled by allowing more memory accesses per cycle. We leave this for

future work.

Another performance limitation of global memory relates to the program call graph. In LegUp, each

C function corresponds to a hardware module in the final synthesized circuit. The hardware module

instantiation hierarchy is dependent on the program call graph, with modules corresponding to functions

lower in the call graph of the program being instantiated deeper in the module hierarchy of the circuit.

For instance, assume we have a program with the call graph shown by Figure 6.7(a), where the main

function calls two other functions: a and b. In LegUp, the hardware modules will be instantiated

according to the hierarchy shown in 6.7(b), where the modules corresponding to a and b have been

instantiated inside the main module. The module c is instantiated twice, because the corresponding

function is called from two distinct functions, a and b, in the program. Since we do not allow recursion,

the program call graph will always be a tree.


Datapath

Datapath

Datapath

d

c

a

Datapath

Datapath

Datapath

Controller

Memory

c

b

main

memoryaddress

Figure 6.8: Multiplexing required for the memory address at each level of the module hierarchy.

In the pure hardware flow, the shared memory controller is always instantiated in the main hardware

module. We further assume that while the circuit is operating, only one hardware module is active at any

one time. Figure 6.8 shows the multiplexing required for the memory address at each level of hierarchy

for the example already described in Figure 6.7. For example, if we read from an array in the datapath

of hardware module d then we must pass the array address up to module a, which then must pass this

address up to the main module, where the memory controller is instantiated. Module main contains a

32-bit wide 3-to-1 memory address multiplexer, to handle the three possibilities: either the datapath in

main, a, or b is active. Modules a and b contain a 3-to-1 and 2-to-1 32-bit multiplexer respectively. As

we get further down in the module hierarchy, there is more multiplexing to get to the memory controller.

For a program with a deep call graph, this multiplexing can be detrimental to circuit clock frequency.

6.4 Local Memory Blocks

In this section, we describe how LegUp stores some program arrays in distinct RAMs, called local

memories, that are instantiated locally within a particular hardware module. These local memories

can alleviate some of the performance drawbacks of global memories that we discussed in the previous

section.

For local memories of a function, we instantiate a physical RAM within the module corresponding

to the function, and connect any memory accesses that refer to the memory location directly to the

corresponding local RAM. Global-scoped variables in the C program may still be categorized as local

memory if they are only used in one function. All constant arrays are categorized as local memory, which

we handle in LegUp by duplicating a local ROM inside all functions that access the constant.

We continue our example from Figure 6.1 and assume that the arrays reg and dmem are partitioned

into local memory and output is stored in global memory (we ignore array imem). We show the synthe-

sized datapath of a function that accesses these arrays in Figure 6.9. In the figure, we only show the

memory address wires for simplicity. On the right side of the block diagram, we have shown two local

FPGA block RAMs containing the reg and dmem arrays. Blocks in the leftmost column of the figure


FromFinite State Machine

RAM

reg

dmem

RAM

3

5

MemoryTo Global

Controller

0180000916

1601800008

Store dmem[3]

Load dmem[5]

Store reg[7]

Load reg [2]

Store output[8]

2

Load reg [6]6

7

addr

Function Datapath

addr

address

Store output[9]

Figure 6.9: Local and global memory addressing logic within the hardware module datapath

denote locations in the datapath that either load or store array elements. For simplicity, we assume that

the exact memory accesses are known at compile time, for instance the first load accesses reg[2]. In

general, we may not know the element until runtime. For example, if the first load had accessed the ele-

ment reg[i] then the datapath will be the same but with 2 replaced with wire i. The first three accesses

in the figure load and store from the reg array. Therefore, we need a 3-to-1 multiplexer to determine

the memory address of the local reg block RAM. The select line of the multiplexer is controlled by the

circuit’s finite state machine, which compares the current state to the scheduled state of the memory

access. Arrays in local memory do not need tags; the address of reg[2] is simply two. There are only

two accesses to the dmem array in the datapath, so we only need a 2-to-1 multiplexer in front of the

dmem RAM address port. We also note that the datapath could access both the dmem and the reg block

RAMs in parallel. The output array was assigned to global memory. Therefore, we have a tag assigned

to output which we assume is equal to three. Given this tag assignment, the address of output[8] and

output[9] would be (01800008)16 and (01800009)16 respectively. The size of the memory address port

multiplexer scales linearly with the number of memory accesses to that array in the function. For global

memories, the multiplexer scales with the number of accesses to any global array, which may include

several arrays in the function. If we have many memory accesses in a function then these multiplexers

can be on the critical path. In the future, we could explore pipelining these multiplexers.

Local memory reduces the number of physical memories accessed by the shared memory controller,

which shrinks the size of the output multiplexer in Figure 6.5. Accesses to local memory require an

address bitwidth that depends on the size of the local RAM, in contrast to global memories which all

require 32-bit addresses passed to the memory controller. Local RAMs have no output multiplexing or

output register, therefore they have only one cycle of latency compared to the two cycle latency required

by the shared memory controller. Lower latency leads to reduced cycle execution time for local memory

accesses. During FPGA placement, these local RAMs can be placed physically closer to the datapath of

operations, which can lower delays due to closer proximity on the FPGA device.

A key motivation for local memories is they can improve performance by allowing arrays to be


accessed in parallel, which increases memory bandwidth. We already showed some of the benefits of local

memories in Chapter 5, where loop pipelining required local memories to achieve good performance. The

loop pipelining experimental study presented in Section 5.6, used local memories within the benchmarks

to allow independent memory blocks to be accessed in parallel, achieving higher pipeline throughput.

Algorithm 3 MemoryPartition()

1: pointsT oSet← points-to set found by points-to analysis of the program2: for each load/store instruction loadstore in the program do3: address← array address accessed by loadstore4: function← function containing loadstore5: memories← set of arrays pointed-to by address in pointsT oSet6: for each array in memories do7: continue if array is already marked global8: if array is constant then9: Mark array as local to function

10: else if number of arrays in memories > 1 then11: Mark array as global12: else if array is already marked local to any function 6= function then13: Mark array as global14: else15: Mark array as local to function16: end if17: end for18: end for

We now describe Legup’s partitioning algorithm that decides whether each program array is placed

in local or global memory. We show the pseudocode in Algorithm 3. On line 1, we use a points-to

analysis [Harde 07] implemented in LLVM by Silva [Silva 13], which we enhanced to perform accurately

for LegUp’s benchmarks. This returns a points-to set that contains the set of all memory locations

pointed to by each address variable in the program. We treat a C array as one memory location,

assuming that individual array elements are not distinguished. On lines 2–19, we loop over all LLVM

load and store instructions in the program and retrieve the memory address accessed by the instruction

(line 3). We retrieve the set of all arrays that could be accessed by the current instruction by looking up

the address in the points-to set (line 5). We then loop over every array that this instruction can point to,

categorizing them as local or global (lines 6–17). During this algorithm, an array can either be marked

as global or local memory. Marking an array as global memory overrides any prior assignment to local

memory (line 7). Each local array must used exclusively in one function unless the array is a constant,

in which case we mark the constant array local to all functions where the array is accessed (lines 8–9).

We only allow pointers to access local arrays if they only point to a single local array. Therefore, we

detect when pointers can point to multiple arrays, and mark all of these arrays global (lines 10–11).

If the array was used in more than one function, we mark the array as global memory on lines 12–13.

Finally, in all other cases we mark the array as local memory to this particular function on line 15.

6.5 Grouped Memories

LegUp’s existing memory architecture suffers from poor utilization of Stratix IV M9K memory blocks by

global memory. This is because an Altera synchronous on-chip RAM instantiated with only a few words


Figure 6.10: LegUp allocating one physical RAM for each array.

Figure 6.11: Grouping arrays into physical RAMs in LegUp’s shared memory controller.

occupied will be synthesized to use an entire M9K block on the FPGA, or effectively 9Kb of memory.

For example, Figure 6.10 shows the LegUp memory controller for a program with three 32-bit arrays:

A, B, and C. In the figure, we leave out irrelevant details of the memory controller shown in Figure 6.5.

Each array has been allocated to a separate physical block RAM, which will require three distinct M9K

memory blocks on the FPGA device. However, the arrays have only used a total of 192 bits, which

should only require one M9K memory block.

To better utilize FPGA block memory, LegUp can group arrays by bitwidth and store them in one

large physical RAM for each bitwidth size. We can achieve significant M9K savings by packing many

memories into the same M9K block using this grouped memories approach, as shown in Figure 6.11.

Here, we have grouped all three 32-bit arrays from Figure 6.10 together inside a single RAM. We also

show that an array with a different bitwidth, such as the 16-bit array Z, is stored in a different RAM. By

grouping the 32-bit arrays, we have shrunk the number of M9K blocks from three to one. Furthermore,

the multiplexer on the output has less inputs, which could improve the circuit clock frequency. The

number of memory bits required has remained unchanged but the number of memory implementation

bits corresponding to M9K block usage has improved substantially.

We group all arrays in the memory controller in up to four RAMs, one for each possible array

bitwidth: 8, 16, 32, 64. LegUp only groups global memory blocks, not local memories. We group

constant memories and non-constants memories in ROMs and RAMs respectively


(a) Naive array offsets with minimalwasted space.

(b) Pad the RAM to make both array’soffset divisible by three.

Figure 6.12: Grouped memory array address offsets.

6.5.1 Grouped Memory Allocation

All program memory grouped within a physical RAM share the same tag. But now the address of each

grouped array must include a byte offset, to account for the location of the array within the larger RAM.

Therefore, a pointer address now consists of: Tag + Offset + Index, where Tag and Offset are usually

known at compile time but Index typically changes at runtime. Unfortunately, this offset will require us

to use more addition operations during address calculations in hardware compared to the non-grouped

approach. As we saw previously, by default an array’s address is calculated by Tag + Index. We know

that the constant Tag has no overlapping bits with the Index, so we can concatenate the two values

without any addition. When grouping memory, if we pack each array directly after the previous array

in the grouped RAM, the Offset will typically have overlapping bits with the Index, so an addition

must be used. For example, Figure 6.12(a) shows the default grouping of two arrays, A and B, in a

single RAM. The array B has an Offset of three, therefore a pointer to B[1] would involve the address

calculation: Offset + Index = (11)2+(01)2, requiring an addition. These extra additions after grouping

add delay to the datapath, increasing the circuit clock frequency and circuit area, effectively negating

the improvement in M9K blocks.

Given an array A, to access the element A[i] from the circuit datapath we would perform the address

calculation: Tag A +Offset A + i. However, if we can align the array offset such that Offset A does not

overlap with the range of i indices for array A, then we can transform this address calculation to use OR

gates instead of addition: Tag A OR Offset A OR i. Furthermore, if Tag A and Offset A are known at

compile time, then we can calculate the address using simple concatenation without any hardware logic.

If we can ensure that all arrays follow this alignment property, then we can always perform address

calculation without the need to perform addition. We note that this technique will only work in LegUp’s

pure hardware flow because in the hybrid flow we have no control over the alignment of arrays placed

in processor memory.

We need to ensure that for each grouped array, the array address Offset is large enough so that

no bits overlap with any possible values of the array’s Index. The bitwidth of Index is determined by

the number of elements in the array, N . We increase the Offset until it is a multiple of 2N (i.e., the

bottom N bits equal zero). For example, in Figure 6.12(a) the possible values of Index for array B range

from zero to two, requiring two bits. Therefore, we should align the address of the array such that the

Offset is a multiple of four (22), as shown in Figure 6.12(b). A pointer to B[1] now involves the address


Table 6.1: Naive grouped RAM memory allocationArray Start Address (B) End Address (B) Memory Size (B) Alignment (B)

a 0 (0)16 2 (2)16 3 (3)16 4 (4)163 (3)16 4 (4)16 1 (1)16

b 4 (4)16 6 (6)16 3 (3)16 4 (4)167 (7)16 511 (1FF )16 505 (1F9)16

c 512 (200)16 799 (31F )16 288 (120)16 512 (200)16800 (320)16 4095 (FFF )16 3296 (CE0)16

d 4096 (1000)16 6151 (1807)16 2056 (808)16 4096 (1000)166152 (1808)16 6655 (19FF )16 504 (1F8)16

e 6656 (1A00)16 6943 (1B1F )16 288 (120)16 512 (200)166944 (1B20)16 8191 (1FFF )16 1248 (4E0)16

f 8192 (2000)16 10247 (2807)16 2056 (808)16 4096 (1000)1610248 (2808)16 11263 (2BFF )16 1016 (3F8)16

g 11264 (2C00)16 12287 (2FFF )16 1024 (400)16 1024 (400)16h 12288 (3000)16 12291 (3003)16 4 (4)16 4 (4)16

calculation: Offset OR Index = (100)2 OR (001)2 = (101)2, which can be performed as a concatenation.

Modifying the array offsets using this technique saves area (fewer adders) and improves circuit clock

frequency. Of course we waste memory bits, for instance in Figure 6.12(b) the fourth word is empty, but

we waste less memory bits than when we do not group global memory.

To provide a concrete example of grouping arrays into a single RAM, consider eight 32-bit arrays (a–h)

from the jpeg CHStone benchmark varying from 3–2056B in size. We first use a naive memory allocation

for placing each array into the same 32-bit RAM. We add each array to the RAM in program order

(a–h) and we place each array at the next available spot in memory that satisfies the array’s alignment

constraint. We show the final memory allocation in Table 6.1. The first column gives the name of

each array, the second and third give the start address of each array in decimal and in hexadecimal as

calculated by our naive approach. The next two columns give the ending address of each array. The next

two columns give the array size in bytes. The final two columns show the required address alignment

of the array, as explained previously. Rows that represent unused memory “holes” in the RAM are

marked in gray, with the size column showing the amount of unused space. For example, array c is

offset to byte-address (200)16 in the RAM, which has no bits overlapping with the possible array indices:

(0)16–(11F )16. After this memory allocation, the total required size of the RAM is 12,292B with 6,570B

of unused space resulting in a fragmentation ratio (unused/total memory) of 0.53.

Minimizing memory fragmentation requires us to develop a static memory allocator. Our algorithm

will differ from conventional memory allocation techniques [Wilso 95] because each memory must be

allocated to a specific range of address boundaries. Algorithm 4 shows the pseudocode for our memory

allocation. Our approach reduces fragmentation by reordering the arrays in the RAM and by keeping

track of unused holes of memory. We observe that arrays with larger address alignment requirements

have fewer choices of valid offsets in the RAM. Therefore, on line 1 we sort the arrays by descending

alignment size first, and by descending array size to break ties. We keep a list of unused holes in memory

on line 2. Initially the RAM is empty, recall from Figure 6.4 that we have 23 address bits for each RAM,

therefore the first hole spans the addresses from 0 to 223 − 1 (line 3). On lines 4–30, we loop over the

sorted arrays and greedily place each array in the first available space (hole) in the RAM. The boolean

variable placed on line 5 will be true when we have found a spot in the RAM for the array. We keep


Algorithm 4 MemoryAllocation(arrays)

1: Sort arrays in descending order by address alignment then descending by array size2: holes← empty list3: Insert hole from addresses 0 to 223 − 1 into holes list4: for each array in sorted order do5: placed← false6: hole← first hole from holes list7: arrayStart← 08: while not placed do9: while hole start address > arrayStart do

10: arrayStart ← arrayStart+ array.alignment11: end while12: arrayEnd← arrayStart + array.size− 113: if arrayEnd ≤ hole end address then14: allocate array to start at address arrayStart in memory15: placed← true16: start1← hole start address17: end1← arrayStart− 118: start2← end1 + 1 + array.size19: end2← hole end address20: if start1 ≤ end1 then21: Insert new hole from addresses start1 to end1 in holes list before hole22: end if23: if start2 ≤ end2 then24: Insert new hole from addresses start2 to end2 in holes list after hole25: end if26: Remove hole from holes list27: end if28: hole← next hole in holes list29: end while30: end for


Table 6.2: Grouped RAM memory allocation with reduced fragmentationArray Start Address (B) End Address (B) Memory Size (B) Alignment (B)d 0 (0)16 2055 (807)16 2056 (808)16 4096 (1000)16h 2056 (808)16 2059 (80B)16 4 (4)16 4 (4)16

2060 (80C)16 2559 (9FF )16 500 (1F4)16c 2560 (A00)16 2847 (B1F )16 288 (120)16 512 (200)16a 2848 (B20)16 2850 (B22)16 3 (3)16 4 (4)16

2051 (B23)16 2551 (B23)16 1 (1)16b 2852 (B24)16 2854 (B26)16 3 (3)16 4 (4)16

2855 (B27)16 3071 (BFF )16 217 (D9)16g 3072 (C00)16 4095 (FFF )16 1024 (400)16 1024 (400)16f 4096 (1000)16 6151 (1807)16 2056 (808)16 4096 (1000)16

6152 (1808)16 6655 (19FF )16 504 (1F8)16e 6656 (1A00)16 6943 (1B1F )16 288 (120)16 512 (200)16

track of the candidate starting address for the array on line 7. We now start looping over the available

holes on lines 8–29, starting from first available hole in the RAM (line 6). Recall that if an array has

an alignment of 4096 then the array’s addresses must be a multiple of 4096 (i.e. 0, 4096, 8192, 12288).

In the loop on lines 9–11, we increase the candidate array start address by multiples of the alignment

until we are after the start of the currently available hole. The array end address is equal to the array

start address plus the array size in bytes (line 12). On line 13, we check if we can fit the array into the

current hole. If the array fits, we have successfully allocated the array at this start address (line 14) and

we can update the placed variable (line 15). Now we must update the current hole in memory, which

has possibly become two new holes on either side of the array we just placed (lines 16–26). We calculate

the new start and end addresses of the new holes (lines 16–19). We add the two new holes on lines 21

and 24 after checking that there was unused space to form a hole (lines 20, 23). We remove the original

outdated hole from the holes list on line 26. If there was not enough room for the array in the current

hole, then we move on to the next available hole on line 28 and iterate, otherwise we move on to the

next array. Our algorithm is O(n2) where n is the number of arrays to be allocated in the RAM.

Using this algorithm on the arrays we saw in Table 6.1, we present the new memory allocation

shown in Table 6.2. In this example, we first ordered the arrays by descending alignment and size:

d, f, g, c, e, h, a, b. We start by allocating the first array d into the first available location at address

zero. Then we allocate array f , which is placed at the next available unused address at 4096, a multiple

of the required 4096 alignment. The g array can be placed at addresses 0, 1024, 2048, 3072 but the

first three are taken up already by array d, so we place g at the first available memory slot at address

3072. We continue this process for the remaining arrays. After memory allocation, the total size of the

ram is now 6,944B (45% less than before) with 1,222B of unused space, leading to a significantly better

fragmentation ratio of 0.18.


We studied the impact of our proposed memory optimizations using the CHStone benchmarks. We

targeted the Stratix IV [Stra 10] FPGA (EP4SGX530KH40C2) on Altera’s DE4 board [DE4 10] using

Quartus II 13.1 to obtain area and FMax metrics. Quartus timing constraints were configured to

optimize for the highest achievable clock frequency. Our experimental results were performed across the


CHStone benchmark suite. We considered four scenarios for comparison: 1) placing all program arrays

in global memory (Global); 2) grouping multiple arrays into the same RAM in global memory (Group);

3) partitioning arrays into local memory and global memory (Local); and, 4) combining grouped global

memory with local memories (Both).

Table 6.3 gives speed performance results for these four scenarios. The “Cycles” column is the total

number of cycles required to complete the benchmark. The “FMax” column provides the FMax of the

circuit given by the Quartus. The “Time” column gives the circuit wall-clock time: Cycles · (1/FMax).

Ratios in the table compare the geometric mean (geomean) of the column to the respective geomean in

the default LegUp flow.

For local memories, we used static points-to analysis [Harde 07] across the CHStone benchmarks to

identify the set of possible memory locations accessed by each pointer during program execution. For

these benchmarks, all pointers could be resolved to one or more memory locations. Of all the pointers,

94% pointed to a single memory location, 3% pointed to two memory locations, and 2% pointed to four

memory locations. Of all the 140 arrays across the benchmarks, 16% were referenced by a pointer that

could point to more than one memory location, causing these arrays to be placed in global memory.

When we partitioned the arrays into local and global memory across these benchmarks, we were able

to place 57% of arrays in local memory and 43% in global memory. We were unable to place 27% of

the arrays in local memory due to the array being accessed by multiple functions. For this analysis, we

ignored unsynthesizable memory such as the constant character arrays used as arguments for the printf

function.

We had a choice of either one or two cycles of latency for the local memories. We chose single-cycle

latency for local memories, which resulted in a geomean clock cycle reduction of 9% compared to the

two-cycle latency of global memory. However, using single-cycle local memories did not improve FMax,

due to local block RAMs only having an input register. Without a RAM output register, our datapath

contained more combinational delay when loading from a local memory compared to the shared memory

controller used for global memory. We also ran an experiment with two-cycle latency local memories

that resulted in a 10% higher geomean FMax and a comparable wall-clock time improvement but with a

geomean cycle count within 1% of using only global memory. This result implies that the HLS schedule

(instruction level parallelism) of these benchmarks was not significantly constrained by having only two

global shared memory ports.

Overall geomean wall-clock time performance was improved by 12% by combining local memories with

grouping global memories, with a portion of the overall improvement coming from the local memories

and a portion from grouping global memories.

Table 6.4 gives the area results for the four scenarios. The “Memory Implementation Bits” column

gives the number of Stratix IV M9K blocks required multiplied by 9Kb, added to the number of M144K

blocks multiplied 144Kb. We use “implementation” to differentiate this metric from the number of

memory bits required by the circuit without consideration for actual FPGA memory block usage. The

“Logic Utilization” column provides the logic utilization reported by Quartus II, which is a metric for

measuring device area by estimating the number of half-ALMs used by the circuit. The “Registers”

column gives the number of Stratix IV dedicated registers required. We found a significant improvement

in the memory implementation bits required during synthesis, which decreased by 27% by grouping

global memories. Furthermore, using local RAMs in isolation also improves memory implementation

bits by 22% because the FPGA synthesis tool is able to optimize the smaller RAMs away (implementing

Chapter6.

LegUp:MemoryArchitecture

98

Table 6.3: Memory architecture performance resultsCycles FMax (MHz) Time (µs)

Benchmark Global Group Local Both Global Group Local Both Global Group Local Bothadpcm 14,444 14,444 13,274 13,274 160 167 141 151 90.3 86.5 94.1 87.9aes 9,348 9,348 9,196 9,196 112 129 123 149 83.5 72.5 74.8 61.7blowfish 181,228 181,228 163,928 163,928 177 178 206 223 1,023.9 1,018.1 795.8 735.1dfadd 776 776 676 676 265 234 211 211 2.9 3.3 3.2 3.2dfdiv 1,962 1,962 1,918 1,918 237 213 215 215 8.3 9.2 8.9 8.9dfmul 274 274 234 234 259 252 228 228 1.1 1.1 1.0 1.0dfsin 59,438 59,438 59,366 59,366 163 169 151 159 364.7 351.7 393.2 373.4gsm 5,868 5,868 4,774 4,774 153 170 190 190 38.4 34.5 25.1 25.1jpeg 1,320,580 1,212,984 1,301,836 1,194,242 102 101 98 107 12,946.9 12,009.7 13,284.0 11,161.1mips 6,234 6,234 5,044 5,044 240 236 195 195 26.0 26.4 25.9 25.9motion 8,266 8,266 8,264 8,264 180 173 184 191 45.9 47.8 44.9 43.3sha 209,414 209,414 166,768 166,768 201 209 242 254 1,041.9 1,002.0 689.1 656.6

Geomean 13,871 13,773 12,571 12,481 180 180 176 185 77 76 71 68Ratio 1.00 0.99 0.91 0.90 1.00 1.00 0.98 1.03 1.00 0.99 0.92 0.88

Chapter6.

LegUp:MemoryArchitecture

99

Table 6.4: Memory architecture area resultsMemory Implementation Bits Logic Utilization (half-ALMs) Registers

Benchmark Global Group Local Both Global Group Local Both Global Group Local Bothadpcm 202,752 55,296 193,536 64,512 13,509 12,148 11,396 11,036 10,192 9,968 9,608 9,589aes 110,592 73,728 110,592 92,160 12,673 11,800 11,693 11,359 9,525 9,330 9,131 9,051blowfish 294,912 267,264 285,696 285,696 5,726 5,422 5,141 4,952 4,092 4,066 3,944 3,939dfadd 36,864 64,512 9,216 9,216 5,754 6,074 3,485 3,485 2,679 2,955 2,044 2,044dfdiv 18,432 18,432 9,216 9,216 9,421 9,327 6,632 6,632 8,305 8,261 5,846 5,846dfmul 18,432 18,432 9,216 9,216 2,906 2,848 2,091 2,091 1,393 1,370 1,070 1,070dfsin 18,432 18,432 18,432 18,432 18,734 18,810 18,590 18,616 13,052 12,998 13,004 13,005gsm 64,512 27,648 55,296 55,296 8,117 7,837 6,994 6,994 4,644 4,616 4,561 4,561jpeg 976,896 755,712 967,680 635,904 38,200 33,897 34,136 31,854 20,785 19,130 19,379 18,395mips 36,864 18,432 36,864 36,864 2,697 2,607 2,173 2,173 1,086 1,082 923 923motion 92,160 73,728 92,160 55,296 11,709 11,410 12,051 11,872 7,904 7,856 7,859 7,835sha 276,480 165,888 276,480 184,320 4,327 3,791 3,122 2,754 3,294 3,247 2,488 2,478

Geomean 81,846 59,757 63,665 51,190 8,395 8,019 6,918 6,743 5,250 5,211 4,608 4,581Ratio 1.00 0.73 0.78 0.63 1.00 0.96 0.82 0.80 1.00 0.99 0.88 0.87


them in LUT RAM). When we combined local memories with grouping global memories, we found that

the reduction in memory implementation bits was 37% on average. By combining local memories with

grouping, we found geomean logic utilization decreased by 20% and geomean registers decreased by 13%.

This was due to less multiplexing in the shared memory controller and also datapath registers that can

be packed by Quartus into unused registers inside the connected local block RAMs.

6.7 Summary

This chapter presented the memory architecture generated by LegUp. We discussed how global memories

are accessed through a shared memory controller. We also described how LegUp uses separate local

memories for arrays that are only accessed in one particular function. These local memories can offer

greater performance than global memories alone. We also described how to group arrays together in

global memories instead of storing them in separate RAMs, which reduces FPGA block RAM usage. We

found the geomean memory implementation bits required during synthesis decreased by 37% using local

memories and grouped memories when compared to using only global memories. We also found that

combining the two memory approaches improved geomean wall-clock time performance by 12% over the

CHStone benchmark suite.

Chapter 7

Case Study: LegUp vs Hardware

Designed by Hand

7.1 Introduction

A major goal of this dissertation is to improve the quality of hardware that can be synthesized automat-

ically from software. In this chapter, we attempt to answer the question: how close is LegUp-generated

hardware to hand-designed hardware? Or perhaps more importantly, can our proposed methodology

generate circuits that can meet realistic FPGA design constraints? As a first step to answering this

question, we present a case study of a Sobel image filter. This filter is typically used in edge detection,

which is important for computer vision applications.

We will first describe the Sobel algorithm and present a straight-forward C implementation. We then

describe the hand-written hardware implementation of the filter. Next, we provide an implemenation

using LegUp, showing the transformations we made on the C code in order to match the performance

of the RTL implementation. We show that LegUp can produce a filter with a wall-clock time within 2%

of the custom implementation, but with about 65% more circuit area. This case study illustrates the

types of code transformations we must currently apply for high-level synthesis to create a circuit of peak

performance. Some of these transformations are non-obvious to a software engineer, which we hope will

motivate future work.

The remainder of this chapter is organized as follows: Section 7.2 presents related work and provides

a description of the Sobel filter. Section 7.3 describes the hand-written RTL implementation of the filter.

We present an equivalent LegUp implementation in Section 7.4. An experimental study comparing the

two approaches is presented in Section 7.5. Section 7.6 offers a summary.

7.2 Background

7.2.1 HLS vs Hand RTL

There have been a few published studies measuring the gap between HLS and custom hand-written RTL

designs. An independent study by BDTI [BDTI] found that AutoESL [Auto] (now called Vivado [Xili])

produced a design that met the throughput requirements for a DQPSK receiver and had a level of

101

Chapter 7. Case Study: LegUp vs Hardware Designed by Hand 102

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

d

a b c

f

ihg

e

Figure 7.1: Sobel stencil sliding over input image

Table 7.1: Sobel Gradient Masks.Gx Gy

-1 0 1 1 2 1-2 0 2 0 0 0-1 0 1 -1 -2 -1

resource utilization comparable to hand-coded RTL on a Spartan-3A Xilinx FPGA device. Similar

work in [Nogue 11] implemented a DSP wireless receiver sphere decoder channel preprocessor using

AutoESL. They found that the high-level synthesis implementation was competitive to the reference

RTL implementation in terms of throughput and resource cost.

The high-level synthesis Blue Book [Finge 10] discusses the style of C coding required to achieve

acceptable performance when synthesizing hardware. Fingeroff shows examples for the Catapult C [Caly]

HLS tool but are equally applicable to other HLS tools, including LegUp. He stresses that a poor C-level

description can lead to a sub-optimal final circuit.

7.2.2 Sobel Filter

The Canny edge detector [Canny 86] is an algorithm to detect edges in an image, which are important

for computer vision applications. In an efficient hardware implementation, we operate on a “stream”

of incoming data, where one new pixel from the image arrives every clock cycle. The edge detection

algorithm consists of five stages, each of which can be run in parallel, with the output pixel from one

stage feeding into the next stage at every clock cycle.

The first stage of Canny edge detection applies a low-pass image filter, using a Gaussian convolution,

to blur the image and remove noise that could cause superfluous edges. Next, we use an edge detection

operator to compute the edge direction. Edge detection operators work by approximating the horizontal

and vertical first derivatives of the intensity of the image in a particular window of the image. These

derivatives indicate the direction of the edge (horizontal, vertical, or two possible diagonals). Here we

chose the Sobel edge detection operator, although there are other operator choices [Abdou 79]. After

this step, we apply another filter to thin the edges. Next, we perform a step to avoid breaking up edges

where the operator output swings slightly above and below the edge threshold. Finally, we remove lone

spurious edges caused by image noise.


1 #define HEIGHT 5122 #define WIDTH 5123 for (y = 0 ; y < HEIGHT; y++) {4 for (x = 0 ; x < WIDTH; x++) {5 i f ( not in bounds (x , y ) ) continue ;6 x d i r = 0 ; y d i r = 0 ;7 for ( xOf f s e t = −1; xOf f s e t <= 1 ; xOf f s e t++) {8 for ( yOf f s e t = −1; yOf f s e t <= 1 ; yOf f s e t++) {9 p i x e l = input image [ y+yOf f s e t ] [ x+xOf f s e t ] ;

10 x d i r += p i x e l ∗ Gx[1+ xOf f s e t ] [1+ yOf f s e t ] ;11 y d i r += p i x e l ∗ Gy[1+ xOf f s e t ] [1+ yOf f s e t ] ;12 }13 }14 edge weight = bound ( x d i r ) + bound ( y d i r ) ;15 output image [ y ] [ x ] = 255 − edge weight ;16 }17 }

Figure 7.2: C code for Sobel Filter.

out in512 element shift register

out in512 element shift register

ba c

d e

g h i

f

8pixel_in

Figure 7.3: Sobel hardware line buffers and stencil shift registers

In this chapter, we focus on the second stage of the edge detector: the Sobel filter. The Sobel filter

is performed by convolution, using a three pixel by three pixel stencil window. This window shifts one

pixel at a time from left to right across the input image, and then shifts one pixel down and continues

from the far left of the image as shown in Figure 7.1. At every position of the stencil, we calculate the

edge value of the middle pixel e, using the adjacent pixels labeled from a to i. Table 7.1 gives the three

by three Gx and Gy gradient masks. These constants are used to approximate the gradient in both x

and y directions at each pixel of the image using the eight neighbouring pixels. The C source code for

the Sobel filter is provided in Figure 7.2. The outer two loops ensure that we visit every pixels in the

image, while ignoring image borders (line 5). The stencil gradient calculation is performed on lines 6–13.

The edge weight is calculated on line 14, where we bound the x and y directions to be from 0 to 255.

Finally, we store the edge value in the output image on line 15. We assume that the image is 512 by 512

pixels.

7.3 Custom Hardware Implementation

An experienced hardware designer, Blair Fort, has provided a hand-coded RTL implementation of the

Sobel filter. The hardware design assumes that a “stream” of image pixels are being fed into the hardware

module at the rate of one pixel every cycle. The hardware implementation stores the previous two rows

of the image in two shift registers, or line buffers, as shown in Figure 7.3. These line buffers can be


1 inl ine unsigned char c a l c u l a t e e d g e ( ) {2 int x d i r = 0 , y d i r = 0 ;3 int xOf f set , yOf f s e t ;4 unsigned char edge weight ;56 for ( yOf f s e t = −1; yOf f s e t <= 1 ; yOf f s e t++) {7 for ( xOf f s e t = −1; xOf f s e t <= 1 ; xOf f s e t++) {8 x d i r += s t e n c i l [1+ yOf f s e t ] [1+ xOf f s e t ] ∗ Gx[1+ xOf f s e t ] [1+ yOf f s e t ] ;9 y d i r += s t e n c i l [1+ yOf f s e t ] [1+ xOf f s e t ] ∗ Gy[1+ xOf f s e t ] [1+ yOf f s e t ] ;

10 }11 }1213 edge weight = bound ( x d i r ) + bound ( y d i r ) ;14 edge weight = 255 − edge weight ;15 return edge weight ;16 }

Figure 7.4: Calculating the Sobel edge weight using the stencil window.

efficiently implemented in FPGA block RAMs. The incoming image pixel is labeled pixel in, while

the elements of the stencil are labeled from a to i, which correspond to the labels from the stencil in

Figure 7.1. Using the two 512-pixel line buffers, the hardware can retain the necessary neighbouring

pixels for the stencil to operate, and update this window of pixels as each new input pixel arrives every

cycle.

The edge calculation can now be changed to use the stencil buffer directly, as shown in the updated

C code in Figure 7.4. After sufficient cycles have passed such that the stencil holds valid data, we input

the current nine values of the stencil into a three stage pipeline that calculates the x and y directions,

as described in lines 6–11 in Figure 7.4. The calculation was optimized to use wire-based shifting to

implement the multiply by two, twos complement for negation, and leaving out stencil values that were

not needed (like the middle pixel e). This is followed by a pipeline stage to perform the bounding on

line 13, another stage to perform the addition on line 13, and a final stage for the subtraction on line

14. After this pipeline, the edge weight has been calculated and can be output from the module.

The hardware has some additional control to wait until the line buffers are full before the output

edge data is marked as valid, and additional checks that set the output to zero if we are on the border

of the image. This hardware implementation does not need an explicit FSM. To summarize, in steady

state, this hardware pipeline receives a new pixel every cycle and outputs an edge value every cycle,

along with an output valid bit. The first valid output pixel is after 521 clock cycles, after which point

an edge will be output on every cycle for the next 262,144 cycles (512× 512). The first 513 valid edge

output values will be suppressed to zero because we are still on the border of the image. Therefore, the

custom RTL circuit has a total cycle count of 262,665, which is 0.2% worse than optimal, where optimal

is finishing right after the last pixel (after 262,144 cycles). The latency of 521 cycles is due to time spent

filling up the line buffers in the hardware before we can begin computation.

7.4 LegUp Implementation

We now describe the LegUp synthesized circuit, starting from the original C code in Figure 7.2. By

default, compiler optimizations built into LLVM will automatically unroll the innermost 3x3 loop (lines

7–13) and constant propagate the gradient values from Table 7.1 (lines 10–11). During constant propa-


1 // l i n e b u f f e r s h i f t r e g i s t e r s2 unsigned char prev row [WIDTH] = {0} ;3 int prev row index = 0 ;4 unsigned char prev prev row [WIDTH] = {0} ;5 int prev prev row index = 0 ;67 // s t e n c i l b u f f e r :8 // s t e n c i l [ 0 ] [ 0 ] , s t e n c i l [ 0 ] [ 1 ] , s t e n c i l [ 0 ] [ 2 ] ,9 // s t e n c i l [ 1 ] [ 0 ] , s t e n c i l [ 1 ] [ 1 ] , s t e n c i l [ 1 ] [ 2 ] ,

10 // s t e n c i l [ 2 ] [ 0 ] , s t e n c i l [ 2 ] [ 1 ] , s t e n c i l [ 2 ] [ 2 ]11 unsigned char s t e n c i l [ 3 ] [ 3 ] = {0} ;1213 inl ine void r e c e i v e n ew p i x e l (unsigned char p i x e l ) {1415 // s h i f t e x i s t i n g s t e n c i l to the l e f t by one16 s t e n c i l [ 0 ] [ 0 ] = s t e n c i l [ 0 ] [ 1 ] ; s t e n c i l [ 0 ] [ 1 ] = s t e n c i l [ 0 ] [ 2 ] ;17 s t e n c i l [ 1 ] [ 0 ] = s t e n c i l [ 1 ] [ 1 ] ; s t e n c i l [ 1 ] [ 1 ] = s t e n c i l [ 1 ] [ 2 ] ;18 s t e n c i l [ 2 ] [ 0 ] = s t e n c i l [ 2 ] [ 1 ] ; s t e n c i l [ 2 ] [ 1 ] = s t e n c i l [ 2 ] [ 2 ] ;1920 int prev row elem = prev row [ prev row index ] ;2122 // grab next column ( the r i gh tmos t column of the s l i d i n g s t e n c i l )23 s t e n c i l [ 0 ] [ 2 ] = prev prev row [ pr ev prev row index ] ;24 s t e n c i l [ 1 ] [ 2 ] = prev row elem ;25 s t e n c i l [ 2 ] [ 2 ] = p i x e l ;2627 // s h i f t in new p i x e l28 prev prev row [ pr ev prev row index ] = prev row elem ;29 prev row [ prev row index ] = p i x e l ;3031 // ad j u s t s h i f t r e g i s t e r i nd i c e s32 prev row index++;33 prev prev row index++;3435 prev row index = ( prev row index==WIDTH) ? 0 : prev row index ;36 pr ev prev row index = ( prev prev row index==WIDTH) ? 0 : pr ev prev row index ;37 }

Figure 7.5: C code for the stencil buffer and line buffers synthesized with LegUp.

gation, the LLVM optimizations can detect the zero in the middle of each gradient mask allowing us to

ignore the middle pixel during the iteration. Consequently, there are eight loads from the input image

required during each outer loop iteration (lines 5–15), one for each pixel adjacent to the current pixel

(line 9). The outer loop will iterate 262,144 (512× 512) times. We have nine total memory operations

in the loop, eight loads (line 9) and one store (line 15). We found that LegUp schedules the unmodified

code into nine clock cycles per iteration, mainly due to the shared memory having only two ports and a

latency of two cycles. This circuit takes 2,866,207 cycles to complete.

The first transformation we can make is to use a stencil and two line buffers holding the previous two

rows. The C code for this is given in Figure 7.5, with the stencil stored in a nine element two-dimensional

array on line 11. We shift the stencil after each new pixel arrives on lines 16–18, and shift new data

into the stencil on lines 23–25. The two line buffers are implemented on lines 28–36 using arrays:

prev prev row and prev row. We have to manually keep track of an index to indicate where to shift

data into and out of the arrays, with the index rolling over to zero when reaching the end of the array

(lines 35–36). We can now calculate an edge value using only the stencil buffer as shown in Figure 7.4,

without reading from memory eight times every loop iteration. We also enable local memories, so that

we are not constrained by the global memory controller ports.


1 int s obe l op t (2 unsigned char input image [HEIGHT] [WIDTH] ,3 unsigned volat i l e char output image [HEIGHT] [WIDTH] )4 {5 int i , e r r o r s = 0 , x o f f s e t = −1, y o f f s e t = −1, s t a r t = 0 ;6 unsigned char pixe l , edge weight ;7 unsigned char ∗ i nput image ptr = (unsigned char ∗) input image ;8 unsigned char ∗ output image ptr = (unsigned char ∗) output image ;9

10 loop : for ( i = 0 ; i < (HEIGHT) ∗(WIDTH) ; i++) {11 p i x e l = ∗ i nput image ptr++;1213 r e c e i v e n ew p i x e l ( p i x e l ) ;1415 edge weight = ca l c u l a t e e d g e ( ) ;1617 // we only want to s t a r t c a l c u l a t i n g the va lue when18 // the s h i f t r e g i s t e r s are f u l l and the window i s v a l i d19 int check = ( i == 512∗2+2) ;20 x o f f s e t = ( check ) ? 1 : x o f f s e t ;21 y o f f s e t = ( check ) ? 1 : y o f f s e t ;22 s t a r t = ( ! s t a r t ) ? check : s t a r t ;23 int border = not in bounds ( x o f f s e t , y o f f s e t ) + ! s t a r t ;2425 output image [ y o f f s e t ] [ x o f f s e t ] = ( border ) ? 0 : edge weight ;2627 // error check ing28 int i n c o r r e c t = e r r o r s + ( edge weight != golden [ y o f f s e t ] [ x o f f s e t ] ) ;29 e r r o r s = ( border ) ? e r r o r s : i n c o r r e c t ;3031 x o f f s e t++;32 y o f f s e t = ( x o f f s e t == WIDTH−1) ? ( y o f f s e t + 1) : y o f f s e t ;33 x o f f s e t = ( x o f f s e t == WIDTH−1) ? −1 : x o f f s e t ;34 }3536 return e r r o r s ;37 }

Figure 7.6: Optimized C code for synthesized Sobel Filter with LegUp.


Table 7.2: Experimental Results.Metric Hand-RTL LegUp LegUp/Hand-RTLFMax (MHz) 191.46 187.13 0.98Cycles 262,665 262,156 1.00Time (ms) 1.37 1.40 1.02ALUTs 495 813 1.64Registers 382 635 1.66Memory (bits) 6,299,616 6,299,616 1.00

Next, we need to enable loop pipelining, to overlap iterations of the outermost loop of the algorithm.

But first we must merge the two outer loops into one loop and add a label “loop”. We also change

the array accesses to use pointer dereferencing to avoid unnecessary index calculations. Second, we

must manually remove any control flow in the loop body to allow loop pipelining because automatic

if-conversion is not yet implemented in LegUp. We do this by replacing any if statements with the

ternary operator, “? :”. We show the new C code in Figure 7.6, where the incoming pixel is read on line

11, the stencil and line buffers are shifted on line 13, and the edge weight is calculated on line 15. We

have added some new control variables, such as a check for when the stencil has been filled (line 19), a

calculation of the output image x and y indices (lines 20–21 and lines 31–33), whether the output is now

valid (line 22), and an additional check for whether we are on the image border (line 23). If we are on

the border, we output a zero, otherwise we output the edge weight (line 25). We moved error checking

into the loop body to avoid an additional loop afterwards to verify the output, which would take another

262,144 cycles (512 × 512). LegUp does not have any facility for allowing one loop to “stream” into a

successive loop, so we instead manually fuse the loops together. In general, we would remove the error

checking logic after the circuit is verified to avoid wasting silicon area. There is only one load (line 11)

and one store (line 25) in the loop body that go to the dual-ported shared global memory controller.

Therefore, we can pipeline the transformed loop with an initiation interval of one. The circuit now

finishes after 262,156 cycles, only 12 cycles worse than optimal. Although we do assume that the output

image is already initialized to zero for the very last row of the image (the bottom border). We also set

the LegUp clock period scheduling constraint as low as possible, to ensure a better final circuit FMax.

Some of the C transformations we have just described would be unintuitive to software developers.

Particularly using the line buffers and stencil to reduce memory operations in the loop. The user would

have to be familiar with the concept of a pipeline initiation interval and the strategy of reducing memory

contention in the loop body. Also they have to rewrite all control flow in the loop to use the ternary

operator.


We measured the results of the LegUp vs hand RTL case study by targeting the Stratix IV [Stra 10]

FPGA (EP4SGX530KH40C2) on Altera’s DE4 board [DE4 10] using Quartus II 13.1SP2 to obtain area

and FMax metrics. Quartus timing constraints were configured to optimize for the highest achievable

clock frequency. The results are summarized in Table 7.2, with the custom hardware implementation

shown in the first column side-by-side with the LegUp synthesized results in the second column. In the

third column, we compute the ratio of the two results: LegUp / Hand-RTL.

We found that after performing manual code transformations, LegUp produced a circuit with a wall-


clock time within 2% of the hand-written hardware implementation. However, the synthesized circuit

area was larger, consuming 64% more ALUTs and 66% more registers.

We observed a few reasons for this increase in area. First, we are using many unnecessary additional

registers due to the low LegUp clock period constraint, which causes scheduling to not chain any opera-

tions. We needed the clock period constraint to achieve an acceptable FMax but this indicates that our

timing analysis and estimation needs improvement. Also, the pipeline produced by LegUp is needlessly

complex and includes a lot of additional array indexing that did not exist in the custom implementa-

tion. LegUp also generates a standard FSM when this application does not need one. Generally, the

custom hardware implementation is very minimalistic and fits in a few short pages of Verilog (296 lines

without comments). In contrast, LegUp’s implementation is 2,238 lines and includes many unnecessary

operations from the LLVM intermediate representation, such as sign extensions and memory indexing.

We expect that LegUp will perform well for hardware modules that are control-heavy and fairly

sequential. For hardware modules with pipelining, we will also perform well as long as the user can

express the pipelining in a single C loop. These could include video, media, and networking applications.

LegUp, and all HLS tools, will struggle with highly optimized hardware designs with a known structure

such as a fast Fourier transform butterfly architecture. Also LegUp cannot generate circuits that have

exact cycle-accurate timing behaviour such as a bus controller.

7.6 Summary

This chapter provided a case study comparing a LegUp synthesized Sobel image filter to an equiva-

lent hand-written version. We described transformations that were performed on the input C code to

synthesize a better final design in LegUp. We show that LegUp synthesizes a circuit with a wall-clock

time within 2% of the hand-designed circuit but with about 65% more area. We hope this case study

emphasizes the importance of coding style during HLS and motivates the need for better support of

streaming applications in LegUp.

Chapter 8

Conclusions

8.1 Summary and Contributions

With the end of processor frequency scaling and as Moore’s law continues, we need a better approach

to harness the increased number of transistors on a silicon chip. The industry has been moving towards

heterogeneous computing with the use of custom hardware accelerators to achieve higher performance.

Field-programmable gate arrays (FPGAs) are a way to realize these accelerators, especially as FPGAs

continue to grow in size, now including complete system-on-chips with hardened on-chip ARM processors.

However, hardware design remains a difficult process especially for software developers. We present a

new design entry methodology that offers a higher level of abstraction, where the user incrementally

moves their design from a processor to custom hardware. The hardware is automatically synthesized

from software using high-level synthesis. By allowing designers to program in software, they escape

the need to perform tedious cycle-accurate hardware design and they can re-synthesize their design for

future FPGA chips without redesigning the circuit datapath and control.

This dissertation has contributed an open-source robust high-level synthesis tool, LegUp, to the

research community. Furthermore, we described FPGA-specific high-level synthesis optimizations and

an improved state-of-the-art HLS loop pipelining algorithm:

Chapter 3 discussed our LegUp open-source HLS framework and our proposed design methodology.

We consider the LegUp infrastructure itself to be a major contribution of this dissertation. LegUp

is implemented using state-of-the-art HLS algorithms and a large test suite that ensures correct-

ness of a greater variety of C programs than in previous academic tools. We compared LegUp’s

performance to a commercial HLS tool across a suite of benchmark circuits and found that ge-

omean wall-clock time was 18% faster while having 16% higher geomean area. The LegUp soft-

ware has been downloaded by 1200 unique researchers from all over the world and is available

online at: legup.eecg.utoronto.ca. To the author’s knowledge, this represents the first open-

source HLS tool ever published targeting FPGAs with comprehensive coverage of the C language

and a hybrid processor/accelerator architecture. This work has been published in [Canis 11,

Canis 12, Canis 13b]. LegUp has been used for recent HLS research contributions in debug-

ging [Calag 14, Goede 14], circuit area minimization [Gort 13, Klimo 13, Hadji 12a], compiler

optimizations [Huang 13, Huang 14], performance optimizations [Hadji 12b], parallel program-

ming [Choi 13], cache architecture [Choi 12a], and even for hardware design contests [Cai 13].

The LegUp project received the Community Award at FPL 2014 for contributions to open-source

109

Chapter 8. Conclusions 110

high-level synthesis.

Chapter 4 presented a new FPGA architecture-specific enhancement to high-level synthesis, where we

multi-pump functional units that can run at higher clock speeds than the surrounding logic to

facilitate additional resource sharing. Our method was shown to be particularly effective for the

ASIC-like DSP blocks on modern FPGAs. We showed that multi-pumping achieves the same DSP

reduction as previous resource sharing approaches, but with better circuit performance: decreasing

circuit speed by only 5% instead of 80%, across a suite of digital signal processing benchmarks.

This work has been published in [Canis 13a].

Chapter 5 described a novel HLS loop pipelining scheduling algorithm using the SDC mathematical

framework. Our approach improves upon prior work by providing better handling of scheduling

constraints using a backtracking mechanism, which can achieve better pipeline throughput. We

also described a method for restructuring associative expressions within loops to reduce recurrence

constraints that can hurt pipeline throughput. We compared our approach to prior work and

a commercial HLS tool and we found a geomean wall-clock time improvement of 32% and 29%

respectively, across a suite of benchmark circuits. This work has been published in [Canis 14].

Chapter 6 discussed LegUp’s synthesized on-chip memory architecture. We described global memory,

which is accessed using a shared memory controller to support arbitrary pointers in the C input

program. We discussed how LegUp partitions program memory into local and global physical

on-chip RAMs by using static points-to analysis techniques. These local memories improve cir-

cuit performance compared to using only global memories. We also explored grouping program

memories with compatible bitwidths into a shared physical on-chip RAM to better utilize the ded-

icated RAM blocks available on the FPGA device (M9K blocks on Stratix IV). We applied these

approaches and showed a reduction in the geomean memory implementation bits by 37%, and a

decrease in geomean wall-clock time by 12%, across the CHStone benchmark suite. This work has

been published in [Fort 14].

Chapter 7 presented a case study to measure the gap between HLS-generated and hand-written circuit

implementations using a Sobel image filter kernel from the computer vision domain. We compared

a hand-written streaming hardware implementation to a circuit synthesized with LegUp. We found

that after performing manual code transformations, LegUp produced a circuit with a wall-clock

execution time within 2% of the hand-coded RTL. However, the synthesized circuit area was larger,

consuming 64% more ALUTs and 66% more registers.

8.2 Future Work

Ever since our first release in 2011, LegUp has been designed to enable other academics to explore future

HLS research directions. There remains several active research areas that merit further exploration. In

this section, we will discuss extensions to the work presented in this dissertation and suggest improve-

ments specifically for the LegUp framework. We will then suggest other promising areas that were not

covered in the preceding chapters.


8.2.1 Extensions of this Research Work

Our experimental study in Chapter 4 focused on area savings using multi-pumping. We could also

investigate the impact on circuit power and energy, which will depend on the energy consumption of the

DSP blocks operating at a higher clock speed. Future work also could investigate multi-pumping as a

general sharing technique for other types of FPGA functional units. For example, multi-pumping FPGA

block RAMs would offer us more memory ports as described in [Choi 12a]. Also, we could extended

this work to multi-pump the new hardened floating point units in Stratix 10 [Stra 14]. Another idea is

to use multi-pumping to improve circuit performance and throughput, particularly for loop pipelining,

instead of focusing on resource sharing. Finally, applying our approach on slower circuits would allow

quad-pumping the DSPs (with a 4× clock) to achieve even more area savings.

For an extension to the loop pipelining scheduler in Chapter 5, we could study the impact of loop

unrolling when combined with loop pipelining. Loop unrolling has the effect of duplicating the pipeline,

increasing circuit area while also improving throughput, which would be an interesting trade-off to

explore.

Additionally, cross-iteration dependencies involving floating point operations can greatly increase the

pipeline initiation interval because these operations are typically heavily pipelined to achieve acceptable

clock frequencies. In LegUp, floating point addition/subtraction functional units are pipelined to 14

cycles by default. Currently, the functional unit latencies are hard-coded in the allocation step of

LegUp, however, the work in [Ben A 08] investigated a variable pipeline scheduler that determines the

appropriate number of pipeline stages for each functional unit during scheduling. Future work could

involve extending our algorithm to detect critical recurrences in loop pipelines and attempt to lower the

latency of functional units along the recurrence. We could likewise modify the latency of global memory

load operations, which have a two cycle latency in LegUp.

The memory optimizations in Chapter 6 could be extended to include better context and flow sensitive

points-to analysis techniques [Zhu 04] to identify local memories more accurately in the program. We

could also add support for semi-local arrays by instantiating distributed memory controllers throughout

the hardware hierarchy. Semi-local arrays occur whenever pointers point to multiple arrays or for

when arrays are shared between multiple functions. We could also handle the connections between

each memory access and the associated physical RAM using an interconnect generator like Altera’s

Qsys [Qsys 14].

We should also investigate if LegUp is better off with a flat module hierarchy instead of a tree

hierarchy as shown in Figure 6.7. In the flat hierarchy, we can instantiate every module once regardless

of the program call graph, which saves area. Modules would be connected together at the top-level, with

modules only connected if the corresponding C functions called each other.

Another avenue for future work is memory partitioning, which involves splitting arrays into registers

or smaller block RAMs. We could use memory access patterns to automatically split arrays into distinct

physical RAMs. This is particularly effective for achieving greater parallelism during loop pipelining.

Furthermore, we could explore the reverse: storing registers in RAMs.

Currently the shared global memory controller is dual ported. But we could investigate increasing

the number of ports to allow greater memory bandwidth. The HLS scheduler would be responsible

for insuring that we never perform more than two memory accesses to the same global memory block.

Another idea is to assign a global memory block to only one port of the memory controller if the memory

is only every accessed once per cycle. This would reduce output multiplexing in the controller.


For grouping memory, we could investigate sharing arrays with different bitwidths in the same RAM.

This would require additional steering logic in the memory controller. Also, the arrays with smaller

bitwidths would waste space when placed in the wider shared RAM. We could also try matching the size

of our physical RAMs to the size of the available FPGA block RAMs. Alternatively, for a small array

we may forgo memory entirely and store the array elements in separate registers. We could also explore

storing variables from mutually exclusive program scopes or mutually exclusive program execution in

the same physical memory using lifetime analysis as discussed in [Zhu 01].

Our case study comparing HLS to hand-written RTL in Chapter 7 could be extended to include other

benchmarks. In fact, we believe the research community would find value in having a new benchmark

suite that contains two sets of equivalent designs: a reference in C code and a hand-written hardware

implementation. This could help researchers focus on closing the gap between HLS and custom design.

Another area for future work is investigating if a LegUp pass could detect these types of image filter

memory dependencies and instantiate the line buffers (Figure 7.3) automatically. Wang [Wang 14] has

investigated this memory partitioning problem by using the polyhedral model to represent the iteration

space of a loop nest and the associated memory accesses and then inferring memory banks for parallel

memory accesses. For more complex loop nests we may require the user to profile the memory access

patterns in advance. We should also add support in LegUp for streaming-style C code using Pthreads to

allow us to rewrite this code in a form more understandable to software developers.

8.2.2 Improvements to LegUp

The long-term vision for LegUp is to fully automate the flow in Figure 3.1, thereby creating a self-

accelerating adaptive processor that will profile running applications and automatically synthesize critical

code regions into hardware, improving performance without user intervention. Self-acceleration would

require on-the-fly FPGA synthesis, place, and route of generated hardware accelerators, which can

take minutes or hours for larger circuits. Therefore, this flow would be most suitable for long-running

applications.

In the hardware synthesized by LegUp, we still rely on Altera-specific hardware primitives, such as

floating point cores, dividers, and the Avalon bus. We should move towards using the popular AMBA

AXI (Advanced extensible Interface) bus defined by ARM [AMBA 03]. We could also add support for

custom floating point units generated using FloPoCo [De Di 11]. Making these changes would enable us

to support Xilinx FPGAs, which is by far the most requested LegUp feature.

In the future, LegUp should support a hybrid flow that includes a x86 processor connected over the

PCIe bus to LegUp-synthesized hardware accelerators implemented on an FPGA. We have a prototype of

this flow working, but we need more testing to make it robust. This could allow experiments comparing

FPGA hardware accelerators with the performance of commodity GPU cards (typically programmed

with CUDA) for high performance computing workloads. We could also investigate supporting other

language extensions, such as OpenCL [Openc 09], to allow the user to express further parallelism.

LegUp still needs better support for off-chip memory. Currently we only support off-chip memory in

the hybrid flow by using the shared processor cache. Instead, in the pure hardware flow we should be

able to read and write directly to off-chip DDR3 RAM, possibly buffering the result in a FIFO.

We could investigate how to better support streaming applications in LegUp, for instance by inferring

common hardware idioms like line buffers. A limitation of the CHStone benchmarks is that they do not

offer much opportunity for parallelism. We could create a new benchmark suite that focuses on designs


that can be parallelized, with streaming examples that can be pipelined, or applications that use explicit

parallelism with PThreads and OpenMP.

There are many other smaller improvements that can be made to LegUp, such as: support for

fixed-point integer arithmetic, user specified variable bitwidths, and better timing and area estimation.

8.2.3 Additional High-Level Synthesis Research Directions

We believe that debugging in HLS is an important topic that is still an open research question. We still

do not know the best way for a user to debug a synthesized hardware design, especially a SoC with parts

of the code running on the processor. Visualizing the hardware circuit in an intuitive way is especially

important for winning over software developers.

Another active area of research is on loop transformations in HLS, using the polyhedral model [Basto 04].

We could start by using the Polly polyhedral LLVM framework [Gross 11] to provide more detailed in-

formation about cross-iteration loop dependencies such as dependence distances based on array indices.

Using this framework, we could investigate more complex loops dependencies and perform loop transfor-

mations such as loop fusion, loop interchange, and loop skewing that can expose further loop parallelism.

These transformations can be applied during loop pipelining to improve parallelism and memory access

patterns [Pouch 13], which can greatly increasing circuit performance. We believe this is fertile ground

for new research.

Another relevant area to explore is power and energy optimizations in HLS. We could explore energy-

driven scheduling and FSM generation to minimize toggle rates. Or we could work on power-aware

binding approaches that look at operator power characteristics instead of area.

The processor and accelerators in our target architecture currently share a single clock signal. We

could also investigate the benefits of using multiple clock domains, where each processor and accelerator

can operate at its maximum speed and communication between modules occurs across clock domains.

8.3 Closing Remarks

In summary, we believe that LegUp offers a platform for researchers to continue pressing forward with

high-level synthesis progress. High-level synthesis targeting FPGAs will continue to be an active research

area in the years to come. The optimizations described in this dissertation improve the target circuit

architecture synthesized by HLS tools. However, further improvements, particularly in the area of

pipelining and automatic parallelization may be needed for massive adoption by hardware designers

targeting FPGAs. We are optimistic that the advantages of HLS will continue to win over hardware

designers and raise the productivity of our industry. We hope that researchers will continue to improve

HLS until we reach the holy grail: synthesizing a circuit from software that is just as good (or better)

than a hand-designed implementation.

References

[Abdou 79] I. E. Abdou and W. Pratt. “Quantitative design and evaluation of enhancement/thresh-

olding edge detectors”. Proceedings of the IEEE, Vol. 67, No. 5, pp. 753–763, 1979.

[Adam 74] T. L. Adam, K. M. Chandy, and J. Dickson. “A comparison of list schedules for parallel

processing systems”. Communications of the ACM, Vol. 17, No. 12, pp. 685–690, 1974.

[Aldha 11a] M. Aldham, J. Anderson, S. Brown, and A. Canis. “Low-Cost Hardware Profiling of Run-

Time and Energy in FPGA Embedded Processors”. In: IEEE International Conference

on Application-specific Systems, Architectures and Processors (ASAP), Santa Monica, CA,

2011.

[Aldha 11b] M. Aldham. Low-Cost Hardware Profiling of Run-Time and Energy in FPGA Soft Proces-

sors. PhD thesis, 2011.

[Alte 13] Altera 2012 Annual Report (Form 10-K). http://www.altera.com, 2013.

[AMBA 03] A. AMBA. “Protocol Specification”. ARM, June, 2003.

[Ander 94] L. O. Andersen. Program analysis and specialization for the C programming language. PhD

thesis, University of Cophenhagen, 1994.

[ARM 14] ARM Benchmark Results. http://legup.eecg.utoronto.ca/wiki/

doku.php?id=arm chstone benchmark results, 2014.

[ARM 11] ARM. “Cortex-A9 Processor”. http://www.arm.com/products/processors/cortex-a/cortex-

a9.php, 2011.

[Aubur 96] M. Aubury, I. Page, G. Randall, J. Saul, and R. Watts. “Handel-C language reference

guide”. Computing Laboratory. Oxford University, UK, 1996.

[Auto] AutoESL Design Technologies, Inc. http://www.autoesl.com.

[Aval 10] Avalon Interface Specification. Altera, Corp., San Jose, CA, 2010.

[Basto 04] C. Bastoul, A. Cohen, S. Girbal, S. Sharma, and O. Temam. “Putting polyhedral loop

transformations to work”. In: Languages and Compilers for Parallel Computing, pp. 209–

225, Springer, 2004.

[BDTI] BDTI Certified Results for the AutoESL AutoPilot High-Level Synthesis Tool.

http://www.bdti.com/Resources/ BenchmarkResults/HLSTCP/AutoPilot.

114

References 115

[Beida 05] R. Beidas and J. Zhu. “Scalable Interprocedural Register Allocation for High Level Synthe-

sis”. In: Proceedings of the 2005 Asia and South Pacific Design Automation Conference,

pp. 511–516, ACM, New York, NY, USA, 2005.

[Ben A 08] Y. Ben-Asher and N. Rotem. “Synthesis for Variable Pipelined Function Units”. In: IEEE

International Symposium on System-on-C, 2008.

[Betz 97] V. Betz and J. Rose. “VPR: A New Packing, Placement and Routing Tool for FPGA

Research”. In: Int’l Workshop on Field Programmable Logic and Applications, pp. 213–

222, 1997.

[Betz 99] V. Betz, J. Rose, and A. Marquardt. Architecture and CAD for deep-submicron FPGAs.

Kluwer Academic Publishers, 1999.

[Blue] Bluespec: The Synthesizable Modeling Company. http://www.bluespec.com.

[Borka 11] S. Borkar and A. A. Chien. “The Future of Microprocessors”. Commun. ACM, Vol. 54,

No. 5, pp. 67–77, May 2011.

[Brodt 10] A. R. Brodtkorb, C. Dyken, T. R. Hagen, J. M. Hjelmervik, and O. O. Storaasli. “State-

of-the-art in heterogeneous computing”. Scientific Programming, Vol. 18, No. 1, pp. 1–33,

2010.

[Buttl 96] D. Buttlar and J. Farrell. Pthreads programming: A POSIX standard for better multipro-

cessing. ” O’Reilly Media, Inc.”, 1996.

[Cade] Cadence C-to-Silicon Compiler. http://www.cadence.com/products/sd/ silicon compiler.

[Cai 13] J. C. Cai, R. Lian, M. Wang, A. Canis, J. Choi, B. Fort, E. Hart, E. Miao, Y. Zhang,

N. Calagar, et al. “From C to Blokus Duo with LegUp high-level synthesis”. In: Field-

Programmable Technology (FPT), 2013 International Conference on, pp. 486–489, IEEE,

2013.

[Calag 14] N. Calagar, S. D. Brown, and J. H. Anderson. “Source-level debugging for FPGA high-level

synthesis”. In: Field Programmable Logic and Applications (FPL), 2014 24th International

Conference on, pp. 1–8, IEEE, 2014.

[Caly] Calypto Catapult. http://calypto.com/en/products/catapult/overview.

[Campo 91] R. Camposano. “Path-based scheduling for synthesis”. Computer-Aided Design of Inte-

grated Circuits and Systems, IEEE Transactions on, Vol. 10, No. 1, pp. 85–93, Jan 1991.

[Canis 11] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. Anderson, S. Brown, and

T. Czajkowski. “LegUp: high-level synthesis for FPGA-based processor/accelerator sys-

tems”. In: ACM/SIGDA International Symposium on Field Programmable Gate Arrays,

pp. 33–36, 2011.

[Canis 12] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, T. Czajkowski, S. D. Brown, ,

and J. H. Anderson. “LegUp: An Open Source High-Level Synthesis Tool for FPGA-Based

Processor/Accelerator Systems”. ACM Transactions on Embedded Computing Systems

(TECS), 2012.

References 116

[Canis 13a] A. Canis, J. H. Anderson, and S. D. Brown. “Multi-Pumping for Resource Reduction in

FPGAHigh-Level Synthesis”. In: IEEE Design Automation and Test in Europe Conference

(DATE), Grenoble, France, 2013.

[Canis 13b] A. Canis, J. Choi, B. Fort, R. Lian, Q. Huang, N. Calagar, M. Gort, J. J. Qin, M. Aldham,

T. Czajkowski, S. Brown, and J. Anderson. “From Software to Accelerators with LegUp

High-level Synthesis”. In: Proceedings of the 2013 International Conference on Compilers,

Architectures and Synthesis for Embedded Systems (CASES), pp. 18:1–18:9, IEEE Press,

Piscataway, NJ, USA, 2013.

[Canis 14] A. Canis, S. Brown, and J. Anderson. “Modulo SDC Scheduling with Recurrence Mini-

mization in High-Level Synthesis”. In: International Conference on Field-Programmable

Logic and Applications, 2014.

[Canny 86] J. Canny. “A computational approach to edge detection”. Pattern Analysis and Machine

Intelligence, IEEE Transactions on, No. 6, pp. 679–698, 1986.

[Ceba] CebaTech The software to silicon company. http://www.cebatech.com.

[Chait 81] G. J. Chaitin, M. A. Auslander, A. K. Chandra, J. Cocke, M. E. Hopkins, and P. W.

Markstein. “Register allocation via coloring”. Computer languages, Vol. 6, No. 1, pp. 47–

57, 1981.

[Chen 04] D. Chen and J. Cong. “Register Binding and Port Assignment for Multiplexer Optimiza-

tion”. In: IEEE/ACM Asia and South Pacific Design Automation Conference, pp. 68–73,

2004.

[Choi 12a] J. Choi, K. Nam, A. Canis, J. Anderson, S. Brown, and T. Czajkowski. “Impact of Cache

Architecture and Interface on Performance and Area of FPGA-Based Processor/Parallel-

Accelerator Systems”. In: IEEE Symposium on Field-Programmable Custom Computing

Machines, 2012.

[Choi 12b] J. Choi. Enabling Hardware/Software Co-design in High-level Synthesis. PhD thesis,

University of Toronto, 2012.

[Choi 13] J. Choi, S. Brown, and J. Anderson. “From software threads to parallel hardware in high-

level synthesis for FPGAs”. In: Field-Programmable Technology (FPT), 2013 International


[Cisc 14] Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update, 20132018.

http://www.cisco.com, Feb. 2014.

[Codre 14] L. Codrescu, W. Anderson, S. Venkumanhanti, M. Zeng, E. Plondke, C. Koob, A. Ingle,

C. Tabony, and R. Maule. “Hexagon DSP: An Architecture Optimized for Mobile Multi-

media and Communications”. Micro, IEEE, Vol. 34, No. 2, pp. 34–43, Mar 2014.

[Cong 06a] J. Cong, Y. Fan, G. Han, W. Jiang, and Z. Zhang. “Platform-Based Behavior-Level and

System-Level Synthesis”. In: IEEE Int’l System-on-Chip Conference, pp. 199–202, 2006.

References 117

[Cong 06b] J. Cong and Z. Zhang. “An efficient and versatile scheduling algorithm based on SDC

formulation”. In: IEEE/ACM Design Automation Conference, pp. 433–438, 2006.

[Cong 06c] J. Cong, Y. Fan, and W. Jiang. “Platform-Based Resource Binding Using a Distributed

Register-File Microarchitecture”. San Jose, CA, 2006.

[Cong 08] J. Cong and J. Wei. “Pattern-based behavior synthesis for FPGA resource reduction”. In:

Int’l ACM/SIGDA symposium on Field programmable gate arrays, pp. 107–16, 2008.

[Cong 09] J. Cong and Y. Zou. “FPGA-Based Hardware Acceleration of Lithographic Aerial Image

Simulation”. ACM Trans. Reconfigurable Technol. Syst., Vol. 2, No. 3, pp. 1–29, 2009.

[Cong 10] J. Cong, B. Liu, and J. Xu. “Coordinated Resource Optimization in Behavioral Synthesis”.

In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 1267–

1272, 2010.

[Cong 11] J. Cong, B. L., S. Neuendorffer, J. Noguera, K. Vissers, and Z. Z. “High-Level Synthesis

for FPGAs: From Prototyping to Deployment”. IEEE Tran. on Computer-Aided Design

of Integrated Circuits and Systems, Vol. 30, No. 4, pp. 473–491, April 2011.

[Cong 12] J. Cong, P. Zhang, and Y. Zou. “Optimizing memory hierarchy allocation with loop trans-

formations for high-level synthesis”. In: Proceedings of the 49th Annual Design Automation

Conference, pp. 1233–1238, ACM, 2012.

[Coolea] J. Cooley. “750 engineer survey on HLS verification issues & power reduction”.

[Cooleb] J. Cooley. “My Cheesy Must See List for DAC 2014”.

[Couss 09] P. Coussy, D. Gajski, M. Meredith, and A. Takach. “An Introduction to High-Level Syn-

thesis”. IEEE Design Test of Computers, Vol. 26, No. 4, pp. 8 – 17, jul. 2009.

[Couss 10] P. Coussy, G. Lhairech-Lebreton, D. Heller, and E. Martin. “GAUT A Free and Open

Source High-Level Synthesis Tool”. In: IEEE Design Automation and Test in Europe –

University Booth, 2010.

[Crave 07] S. Craven and P. Athanas. “Examining the viability of FPGA supercomputing”. EURASIP

Journal on Embedded systems, Vol. 2007, No. 1, pp. 13–13, 2007.

[CUDA 07] CUDA: Compute Unified Device Architecture Programming Guide. NVIDIA CORPORA-

TION, 2007.

[Cycl 04] Cyclone-II Data Sheet. Altera, Corp., San Jose, CA, 2004.

[DDR3 08] DDR3 SDRAM Standard (JESD 79-3B). JEDEC Solid State Technology Assoc., 2008.

[De Di 11] F. De Dinechin and B. Pasca. “Designing custom arithmetic data paths with FloPoCo”.

IEEE Design & Test of Computers, Vol. 28, No. 4, pp. 0018–27, 2011.

[DE1 13] DE1-SoC Development and Education Board. Altera, Corp., San Jose, CA, 2013.

[DE2 10a] DE2-115 Development Board. Altera, Corp., San Jose, CA, 2010.

References 118

[DE2 10b] DE2 Development and Education Board. Altera, Corp., San Jose, CA, 2010.

[DE4 10] DE4 Development Board. Altera, Corp., San Jose, CA, 2010.

[DE5 13] DE5-Net Development Board. Terasic, 2013.

[Denna 07] R. H. Dennard, J. Cai, and A. Kumar. “A perspective on todays scaling challenges and

possible future directions”. Solid-State Electronics, Vol. 51, No. 4, pp. 518 – 525, 2007.

[Denna 74] R. Dennard, F. Gaensslen, V. Rideout, E. Bassous, and A. LeBlanc. “Design of ion-

implanted MOSFET’s with very small physical dimensions”. Solid-State Circuits, IEEE

Journal of, Vol. 9, No. 5, pp. 256–268, Oct 1974.

[Docu 11] Documentation: SOPC Builder. http://www.altera.com/literature/lit-sop.jsp, 2011.

[Ellsw 04] M. Ellsworth. “Chip power density and module cooling technology projections for the

current decade”. In: Thermal and Thermomechanical Phenomena in Electronic Systems,

2004. ITHERM ’04. The Ninth Intersociety Conference on, pp. 707–708 Vol.2, June 2004.

[Esmae 13] H. Esmaeilzadeh, E. Blem, R. S. Amant, K. Sankaralingam, and D. Burger. “Power Chal-

lenges May End the Multicore Era”. Commun. ACM, Vol. 56, No. 2, pp. 93–102, Feb.

2013.

[eXCi 10] eXCite C to RTL Behavioral Synthesis 4.1(a). Y Explorations (XYI), San Jose, CA, 2010.

[Finge 10] M. Fingeroff. High-level synthesis blue book. Xlibris Corporation, 2010.

[Fort] Forte Design Systems The high-level design company.

http://www.forteds.com/products/cynthesizer.asp.

[Fort 14] B. Fort, A. Canis, J. Choi, N. Calagar, R. Lian, S. Hadjis, Y. Chen, M. Hall, B. Syrowik,

C. T., S. Brown, and J. Anderson. “Automating the Design of Processor/Accelerator

Embedded Systems with LegUp High-Level Synthesis”. In: IEEE Int’l Conference on

Embedded and Ubiquitous Computing (EUC), Milan, Italy, August 2014.

[Fritt] J. E. Fritts, F. W. Steiling, and J. A. Tucek. “MediaBench II Video: Expediting the next

generation of video systems research”. In: Electronic Imaging 2005, Int’l Society for Optics

and Photonics.

[Fu 11] H. Fu and R. G. Clapp. “Eliminating the Memory Bottleneck: An FPGA-based Solution

for 3D Reverse Time Migration”. In: Proceedings of the 19th ACM/SIGDA International

Symposium on Field Programmable Gate Arrays, pp. 65–74, ACM, New York, NY, USA,

2011.

[Gajsk 00] D. D. Gajski, J. Zhu, R. Domer, A. Gerstlauer, and S. Zhao. “SPECC: Specification

Language and Methodology”. 2000.

[Gajsk 92] D. Gajski and et. al. Editors. High-Level Synthesis - Introduction to Chip and System

Design. Kulwer Academic Publishers, 1992.

References 119

[Goede 14] J. Goeders and S. J. Wilton. “Effective FPGA debug for high-level synthesis generated

circuits”. In: Field Programmable Logic and Applications (FPL), 2014 24th International


[Gort 13] M. Gort and J. H. Anderson. “Range and Bitmask Analysis for Hardware Optimization

in High-Level Synthesis”. In: Asia and South Pacific Design Automation Conference

(ASP-DAC), Yokohama, Japan, 2013.

[Gross 11] T. Grosser, H. Zheng, R. A, A. Simburger, A. Grosslinger, and L.-N. Pouchet. “Polly

- Polyhedral optimization in LLVM”. In: First International Workshop on Polyhedral

Compilation Techniques (IMPACT’11), Chamonix, France, Apr. 2011.

[Gupta 03] S. Gupta, N. Dutt, R. Gupta, and A. Nicolau. “SPARK: A High-Level Synthesis Framework

For Applying Parallelizing Compiler Transformations”. In: Proc. Int. Conf. on VLSI

Design, 2003.

[Hadji 12a] S. Hadjis, A. Canis, J. Anderson, J. Choi, K. Nam, S. Brown, and T. Czajkowski. “Impact

of FPGA Architecture on Resource Sharing in High-Level Synthesis”. In: ACM/SIGDA

Int’l Symp. on Field Programmable Gate Arrays, pp. 111–114, 2012.

[Hadji 12b] S. Hadjis, A. Canis, R. Sobue, Y. Hara-Azumi, H. Tomiyama, and J. Anderson. “Profiling-

driven multi-cycling in FPGA high-level synthesis”. ACM/IEEE Design Automation and

Test in Europe Conference (DATE), 2012.

[Hagog 04] M. Hagog and A. Zaks. “Swing Modulo Scheduling for GCC”. In: Proc. GCC Developers

Summit, pp. 55–64, 2004.

[Hara 12] Y. Hara-Azumi and H. Tomiyama. “Clock-constrained simultaneous allocation and binding

for multiplexer optimization in high-level synthesis”. In: Asia and South Pacific Design

Automation Conference, pp. 251–256, 2012.

[Hara 09] Y. Hara, H. Tomiyama, S. Honda, and H. Takada. “Proposal and Quantitative Analysis

of the CHStone Benchmark Program Suite for Practical C-based High-level Synthesis”.

Journal of Information Processing, Vol. 17, No. , pp. 242–254, 2009.

[Harde 07] B. Hardekopf and C. Lin. “The ant and the grasshopper: fast and accurate pointer analysis

for millions of lines of code”. In: ACM SIGPLAN Notices, pp. 290–299, ACM, 2007.

[Hawic 08] K. A. Hawick and H. A. James. “Enumerating Circuits and Loops in Graphs with Self-Arcs

and Multiple-Arcs.”. In: FCS, pp. 14–20, 2008.

[Heine 13] A. Heinecke, K. Vaidyanathan, M. Smelyanskiy, A. Kobotov, R. Dubtsov, G. Henry, A. G.

Shet, G. Chrysos, and P. Dubey. “Design and Implementation of the Linpack Benchmark

for Single and Multi-node Systems Based on Intel R© Xeon Phi Coprocessor”. In: Parallel

& Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pp. 126–

137, IEEE, 2013.

[Hind 00] M. Hind and A. Pioli. “Which Pointer Analysis Should I Use?”. SIGSOFT Softw. Eng.

Notes, Vol. 25, No. 5, pp. 113–123, Aug. 2000.

References 120

[Hind 01] M. Hind. “Pointer Analysis: Haven’t We Solved This Problem Yet?”. In: Proceedings

of the 2001 ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools

and Engineering, pp. 54–61, ACM, New York, NY, USA, 2001.

[Hisam 00] D. Hisamoto, W.-C. Lee, J. Kedzierski, H. Takeuchi, K. Asano, C. Kuo, E. Anderson, T.-J.

King, J. Bokor, and C. Hu. “FinFET-a self-aligned double-gate MOSFET scalable to 20

nm”. Electron Devices, IEEE Transactions on, Vol. 47, No. 12, pp. 2320–2325, 2000.

[Huang 08] S. Huang, A. Hormati, D. Bacon, and R. Rabbah. “Liquid Metal: Object-Oriented Pro-

gramming Across the Hardware/Software Boundary”. In: 22nd European conference on

Object-Oriented Programming, pp. 76–103, 2008.

[Huang 13] Q. Huang, R. Lian, A. Canis, J. Choi, R. Xi, S. Brown, and J. Anderson. “The effect of

compiler optimizations on high-level synthesis for FPGAs”. In: IEEE Int’l Symposium on

Field-Programmable Custom Computing Machines (FCCM), Seattle, WA, 2013.

[Huang 14] Q. Huang, R. Lian, A. Canis, J. Choi, R. Xi, S. Brown, and J. Anderson. “The effect of

compiler optimizations on high-level synthesis-generated hardware”. ACM Transactions

on Recongurable Technology and Systems (TRETS), Apr. 2014.

[Huang 90] C. Huang, Y. Che, Y. Lin, and Y. Hsu. “Data Path Allocation Based on Bipartite Weighted

Matching”. In: Design Automation Conference, pp. 499–504, 1990.

[Hwang 91] C.-T. Hwang, J.-H. Lee, and Y.-C. Hsu. “A formal approach to the scheduling problem in

high level synthesis”. Computer-Aided Design of Integrated Circuits and Systems, IEEE

Transactions on, Vol. 10, No. 4, pp. 464–475, 1991.

[Impu] Impulse CoDeveloper – Impulse accelerated technologies.

http://www.impulseaccelerated.com.

[Inte 13] Interactive Presentation on Key Trend For Advanced Technologies and Role of SOI. Inter-

national Business Strategies, Inc., october 2013.

[Iqbal 93] Z. Iqbal, M. Potkonjak, S. Dey, and A. Parker. “Critical path minimization using retiming

and algebraic speed-up”. In: DAC, 1993.

[Jasro 04] K. Jasrotia and J. Zhu. “Stacked FSMD: a power efficient micro-architecture for high level

synthesis”. In: Quality Electronic Design, 2004. Proceedings. 5th International Symposium

on, pp. 425–430, 2004.

[Jiang 08] W. Jiang, Z. Zhang, M. Potkonjak, and J. Cong. “Scheduling with Integer Time Budgeting

for Low-power Optimization”. In: Proceedings of the 2008 Asia and South Pacific Design

Automation Conference, pp. 22–27, IEEE Computer Society Press, Los Alamitos, CA, USA,

2008.

[Jones 91] R. B. Jones and V. H. Allan. “Software Pipelining: An Evaluation of Enhanced Pipelining”.

In: Proceedings of the 24th Annual International Symposium on Microarchitecture, pp. 82–

92, ACM, New York, NY, USA, 1991.

References 121

[Klimo 13] A. Klimovic and J. H. Anderson. “Bitwidth-optimized hardware accelerators with software

fallback”. In: Field-Programmable Technology (FPT), 2013 International Conference on,

pp. 136–143, IEEE, 2013.

[Ku 88] D. C. Ku and G. De Micheli. “Hardware C-a language for hardware design”. Tech. Rep.,

DTIC Document, 1988.

[Kuhn 10] H. Kuhn. “The Hungarian Method for the Assignment Problem”. In: 50 Years of Integer

Programming 1958-2008, pp. 29–47, Springer, 2010.

[Lam 06] M. Lam, R. Sethi, J. Ullman, and A. Aho. Compilers: Principles, Techniques, and Tools.

Addison-Wesley, 2006.

[Lam 88] M. Lam. “Software pipelining: An effective scheduling technique for VLIW machines”. In:

ACM Sigplan Notices, pp. 318–328, ACM, 1988.

[Landi 92] W. Landi. “Undecidability of Static Analysis”. ACM Lett. Program. Lang. Syst., Vol. 1,

No. 4, pp. 323–337, Dec. 1992.

[Lattn 04] C. Lattner and V. Adve. “LLVM: A compilation framework for lifelong program analysis

& transformation”. In: IEEE CGO, pp. 75–86, http://www.llvm.org, 2004.

[Lempe 11] O. Lempel. “2nd generation intel core processor family: Intel core i7, i5 and i3”. In: Hot

Chips, 2011.

[List 14] List of semiconductor fabrication plants. http://en.wikipedia.org/wiki/

List of semiconductor fabrication plants, 2014.

[lpso 14] “lp solve LP Solver”. http://lpsolve.sourceforge.net/5.5/, 2014.

[Luu 09] J. Luu, K. Redmond, W. Lo, P. Chow, L. Lilge, and J. Rose. “FPGA-based Monte Carlo

Computation of Light Absorption for Photodynamic Cancer Therapy”. In: IEEE Sympo-

sium on Field-Programmable Custom Computing Machines, pp. 157–164, 2009.

[Luu 14a] J. Luu, J. Goeders, M. Wainberg, A. Somerville, T. Yu, K. Nasartschuk, M. Nasr, S. Wang,

T. Liu, N. Ahmed, et al. “VTR 7.0: next generation architecture and CAD system for

FPGAs”. ACM Transactions on Reconfigurable Technology and Systems (TRETS), Vol. 7,

No. 2, p. 6, 2014.

[Luu 14b] J. Luu, J. Goeders, M. Wainberg, A. Somerville, T. Yu, K. Nasartschuk, M. Nasr, S. Wang,

T. Liu, N. Ahmed, et al. “VTR 7.0: next generation architecture and CAD system for

FPGAs”. ACM Transactions on Reconfigurable Technology and Systems (TRETS), Vol. 7,

No. 2, p. 6, 2014.

[Mahlk 01] S. Mahlke, R. Ravindran, M. Schlansker, and R. Schreiber. “Bitwidth Cognizant Architec-

ture Synthesis of Custom Hardware Accelerators”. In: IEEE Trans. on Comput. Embed.

Syst., 2001.

[Mahlk 92] S. A. Mahlke, D. C. Lin, and e. Chen. “Effective compiler support for predicated execution

using the hyperblock”. In: ACM SIGMICRO, pp. 45–54, IEEE Computer Society Press,

1992.

References 122

[McClu 85] E. J. McCluskey. “Built-in self-test techniques”. Design & Test of Computers, IEEE,

Vol. 2, No. 2, pp. 21–28, 1985.

[McFar 88] M. C. McFarland, A. C. Parker, and R. Camposano. “Tutorial on High-level Synthesis”.

In: Proceedings of the 25th ACM/IEEE Design Automation Conference, pp. 330–336,

IEEE Computer Society Press, Los Alamitos, CA, USA, 1988.

[McNai 03] C. McNairy and D. Soltis. “Itanium 2 processor microarchitecture”. Micro, IEEE, Vol. 23,

No. 2, pp. 44–55, 2003.

[Micro 14] MicroBlaze. “MicroBlaze Soft Processor Core”. 2014.

[Mishc 06] A. Mishchenko, S. Chatterjee, and R. Brayton. “DAG-aware AIG rewriting: A fresh look

at combinational logic synthesis”. In: ACM/IEEE Design Automation Conf., pp. 532–536,

2006.

[Moore 65] G. E. Moore et al. “Cramming more components onto integrated circuits”. 1965.

[Nane 12] R. Nane, V. Sima, B. Olivier, R. Meeuws, Y. Yankova, and K. Bertels. “DWARV 2.0: A

CoSy-based C-to-VHDL hardware compiler”. In: Field Programmable Logic and Applica-

tions (FPL), 2012 22nd International Conference on, pp. 619–622, IEEE, 2012.

[Nicol 85] A. Nicolau. “Percolation scheduling: A parallel compilation technique”. Tech. Rep., Cornell

University, 1985.

[Nicol 91a] A. Nicolau and R. Potasmann. “Incremental tree height reduction for high level synthesis”.

In: DAC, 1991.

[Nicol 91b] A. Nicolau and R. Potasmann. “Incremental tree height reduction for high-level synthesis”.

In: ACM/IEEE Design Automation Conference, pp. 770–774, ACM, 1991.

[Nios 09a] Nios II C2H Compiler User Guide. Altera, Corp., San Jose, CA, 2009.

[Nios 09b] I. Nios. “Processor Reference Handbook”. 2009.

[Nogue 11] J. Noguera, S. Neuendorffer, S. Haastregt, J. Barba, K. Vissers, and C. Dick. “Implemen-

tation of sphere decoder for MIMO-OFDM on FPGAs using high-level synthesis tools”.

Analog Integrated Circuits and Signal Processing, Vol. 69, No. 2-3, pp. 119–129, 2011.

[Occu 10] Occupational Outlook Handbook 2010-2011 Edition. United States Bureau of Labor Statis-

tics, 2010.

[Open] OpenCL for Altera FPGAs. http://www.altera.com/products/software/opencl/ opencl-

index.html.

[Openc 09] K. Opencl and A. Munshi. “The OpenCL Specification Version: 1.0 Document Revision:

48”. 2009.

[Pangr 87] B. M. Pangrle and D. D. Gajski. “Design tools for intelligent silicon compilation”.

Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, Vol. 6,

No. 6, pp. 1098–1112, 1987.

References 123

[Pangr 91] B. Pangrle. “On the Complexity of Connectivity Binding”. IEEE Tran. on Computer-Aided

Design, Vol. 10, No. 11, November 1991.

[Pauli 89] P. Paulin and J. Knight. “Force-directed scheduling for the behavioral synthesis of ASICs”.


No. 6, pp. 661–679, Jun 1989.

[Pilat 11] C. Pilato, F. Ferrandi, and D. Sciuto. “A design methodology to implement memory ac-

cesses in High-Level Synthesis”. In: Hardware/Software Codesign and System Synthesis

(CODES+ISSS), 2011 Proceedings of the 9th International Conference on, pp. 49–58, Oct

2011.

[Pilat 12] C. Pilato and F. Ferrandi. “Bambu: A Free Framework for the High-Level Synthesis of

Complex Applications”. DATE, 2012.

[Potas 90] R. Potasman, J. Lis, A. Nicolau, and D. Gajski. “Percolation based synthesis”. In: Design

Automation Conference, 1990. Proceedings., 27th ACM/IEEE, pp. 444–449, IEEE, 1990.

[Pothi 10] N. Pothineni, P. Brisk, P. Ienne, A. Kumar, and K. Paul. “A high-level synthesis flow

for custom instruction set extensions for application-specific processors”. In: ACM/IEEE

Asia and South Pacific Design Automation Conference, pp. 707–712, 2010.

[Pouch 13] L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong. “Polyhedral-based Data Reuse Opti-

mization for Configurable Computing”. In: Proceedings of the ACM/SIGDA International

Symposium on Field Programmable Gate Arrays, pp. 29–38, ACM, New York, NY, USA,

2013.

[Pozzi 06] L. Pozzi, K. Atasu, and P. Ienne. “Exact and approximate algorithms for the extension of

embedded processor instruction sets”. IEEE Transactions on Computer-Aided Design of

Integrated Circuits and Systems, Vol. 25, No. 7, pp. 1209–1229, 2006.

[Putna 08] A. Putnam, D. Bennett, E. Dellinger, J. Mason, P. Sundararajan, and S. Eggers. “CHiMPS:

A C-level compilation flow for hybrid CPU-FPGA architectures”. In: IEEE Int’l Conf. on

Field Programmable Logic and Applications, pp. 173–178, 2008.

[Qsys 14] Qsys interconnect. http://www.altera.com/products/ip/qsys/, 2014.

[Quar 14] Quartus II. Altera, Corp., San Jose, CA, 2014.

[Quinn 15] P. J. Quinn. “Silicon Innovation Exploiting Moore Scaling and More than Moore Technol-

ogy”. In: High-Performance AD and DA Converters, IC Design in Scaled Technologies,

and Time-Domain Signal Processing, pp. 213–232, Springer, 2015.

[Ramak 96] B. Ramakrishna Rau. “Iterative Modulo Scheduling”. The International Journal of Parallel

Processing, Vol. 24, No. 1, Feb 1996.

[Ramal 99] G. Ramalingam, J. Song, L. Joskowicz, and R. E. Miller. “Solving systems of difference

constraints incrementally”. Algorithmica, 1999.

References 124

[Rau 81] B. R. Rau and C. D. Glaeser. “Some Scheduling Techniques and an Easily Schedulable

Horizontal Architecture for High Performance Scientific Computing”. SIGMICRO Newsl.,

Vol. 12, No. 4, pp. 183–198, Dec. 1981.

[Resha 05] M. Reshadi and D. Gajski. “A cycle-accurate compilation algorithm for custom pipelined

datapaths”. In: Proceedings of the 3rd IEEE/ACM/IFIP international conference on

Hardware/software codesign and system synthesis, pp. 21–26, ACM, 2005.

[Rosen 05] L. Rosen. Open source licensing. Prentice Hall, 2005.

[Rupno 11] K. Rupnow, Y. Liang, Y. Li, D. Min, M. Do, and D. Chen. “High level synthesis of stereo

matching: Productivity, performance, and software constraints”. In: Field-Programmable

Technology (FPT), 2011 International Conference on, pp. 1–8, Dec 2011.

[Santa 07] A. Santa Cruz. “Automated Generation of Hardware Accelerators From Standard C”. 2007.

[Schla 94] M. S. Schlansker and V. Kathail. “Acceleration of First and Higher Order Recurrences on

Processors with ILP”. In: Work. on Lang. & Comp. for Par. Comp., 1994.

[Schre 02] R. Schreiber, S. Aditya, S. Mahlke, V. Kathail, B. R. Rau, D. Cronquist, and M. Sivaraman.

“PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators”. Journal

of VLSI signal processing systems for signal, image and video technology, Vol. 31, No. 2,

pp. 127–142, 2002.

[Semer 01] L. Semeria, K. Sato, and G. De Micheli. “Synthesis of hardware models in C with point-

ers and complex data structures”. Very Large Scale Integration (VLSI) Systems, IEEE

Transactions on, Vol. 9, No. 6, pp. 743–756, Dec 2001.

[Semer 98] L. Semeria and G. De Micheli. “SpC: synthesis of pointers in C application of pointer

analysis to the behavioral synthesis from C”. In: Computer-Aided Design, 1998. ICCAD

98. Digest of Technical Papers. 1998 IEEE/ACM International Conference on, pp. 340–346,

Nov 1998.

[Shan 13] Shang High-level Synthesis Framework. https://github.com/OpenEDA/Shang/, 2013.

[Silva 13] G. Q. Silva. Static Detection of Address Leaks. https://code.google.com/p/addr-leaks/,

2013.

[Sivar 02] M. Sivaraman and S. Aditya. “Cycle-time aware architecture synthesis of custom hardware

accelerators”. In: Int’l conf. on Compilers, architecture, and synthesis for embedded

systems, pp. 35–42, 2002.

[Stall 99] R. M. Stallman et al. Using and porting the GNU compiler collection. Free Software

Foundation, 1999.

[Stan 14] Stanford CPU DB: Clock Frequency Scaling. http://cpudb.stanford.edu/visualize/

clock frequency, 2014.

[Steen 96] B. Steensgaard. “Points-to analysis in almost linear time”. In: Proceedings of the 23rd

ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pp. 32–41,

ACM, 1996.

References 125

[Stitt 07] G. Stitt and F. Vahid. “Binary synthesis”. ACM Transactions on Design Automation of

Electronic Systems, Vol. 12, No. 3, 2007.

[Stra 10] Stratix-IV Data Sheet. Altera, Corp., San Jose, CA, 2010.

[Stra 14] Stratix-10. Altera, Corp., San Jose, CA, 2014.

[Sun 04] F. Sun, A. Raghunathan, S. Ravi, and N. Jha. “Custom-Instruction Synthesis for

Extensible-Processor Platforms”. IEEE Transactions on Computer-Aided Design of In-

tegrated Circuits and Systems, Vol. 23, No. 7, pp. 216–228, 2004.

[Synp 15] Synphony Model Compiler. http://www.synopsys.com/systems/blockDesign/HLS/

Pages/default.aspx, 2015.

[Syste 02] SystemC. “SystemC 2.0 User’s Guide”. Open SystemC Initiative, 2002.

[Taylo 13] M. B. Taylor. “Bitcoin and the Age of Bespoke Silicon”. In: Proceedings of the 2013

International Conference on Compilers, Architectures and Synthesis for Embedded Systems,

pp. 16:1–16:10, Piscataway, NJ, USA, 2013.

[Tidwe 05] R. Tidwell. XAPP706: Alpha Blending Two Data Streams Using a DSP48 DDR Technique.

Xilinx, Inc., 2005.

[Tige 10] Tiger ”MIPS” processor. University of Cambridge, http://www.cl.cam.ac.uk/teaching/

0910/ECAD+Arch/mips.html, 2010.

[TOP5 14] TOP500: TOP 500 Supercomputer Sites. http://www.top500.org, Nov. 2014.

[Tripp 07] J. Tripp, M. Gokhale, and K. Peterson. “Trident: From High-Level Language to Hardware

Circuitry”. Computer, Vol. 40, No. 3, pp. 28–37, 2007.

[Tseng 86] C.-J. Tseng and D. P. Siewiorek. “Automated synthesis of data paths in digital systems”.


No. 3, pp. 379–395, 1986.

[Under 04] K. D. Underwood and K. S. Hemmert. “Closing the gap: CPU and FPGA trends in sus-

tainable floating-point BLAS performance”. In: Field-Programmable Custom Computing

Machines, 2004. FCCM 2004. 12th Annual IEEE Symposium on, pp. 219–228, IEEE, 2004.

[Vahid 08] F. Vahid, G. Stitt, and L. R. “Warp Processing: Dynamic Translation of Binaries to FPGA

Circuits”. IEEE Computer, Vol. 41, No. 7, pp. 40–46, 2008.

[Villa 10] J. Villarreal, A. Park, W. Najjar, and R. Halstead. “Designing Modular Hardware Accel-

erators in C with ROCCC 2.0”. In: IEEE Symposium on Field-Programmable Custom

Computing Machines, pp. 127–134, 2010.

[Virt 10] Virtex-4 Family Overview. Xilinx, Inc., San Jose, CA, 2010.

[Wakab 06] K. Wakabayashi and T. Okamoto. “C-based SoC design flow and EDA tools: an ASIC and

system vendor perspective”. IEEE Transactions on Computer-Aided Design of Integrated

Circuits and Systems, Vol. 19, No. 12, pp. 1507–1522, 2006.

References 126

[Wang 13] Y. Wang, P. Li, P. Zhang, C. Zhang, and J. Cong. “Memory Partitioning for Multidi-

mensional Arrays in High-level Synthesis”. In: Proceedings of the 50th Annual Design

Automation Conference, pp. 12:1–12:8, ACM, New York, NY, USA, 2013.

[Wang 14] Y. Wang, P. Li, and J. Cong. “Theory and Algorithm for Generalized Memory Parti-

tioning in High-level Synthesis”. In: Proceedings of the 2014 ACM/SIGDA International

Symposium on Field-programmable Gate Arrays, pp. 199–208, ACM, New York, NY, USA,

2014.

[Warte 92] N. J. Warter, G. E. Haab, K. Subramanian, and J. W. Bockhaus. “Enhanced Modulo

Scheduling for Loops with Conditional Branches”. In: Proceedings of the 25th Annual

International Symposium on Microarchitecture, pp. 170–179, IEEE Computer Society Press,

Los Alamitos, CA, USA, 1992.

[Weick 84] R. P. Weicker. “Dhrystone: a synthetic systems programming benchmark”. Communica-

tions of the ACM, Vol. 27, No. 10, pp. 1013–1030, 1984.

[Wilso 94] R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinghe, J. M. Anderson, S. W. Tjiang,

S.-W. Liao, C.-W. Tseng, M. W. Hall, M. S. Lam, et al. “SUIF: An infrastructure for

research on parallelizing and optimizing compilers”. ACM Sigplan Notices, Vol. 29, No. 12,

pp. 31–37, 1994.

[Wilso 95] P. R. Wilson, M. S. Johnstone, M. Neely, and D. Boles. “Dynamic storage allocation: A

survey and critical review”. In: Memory Management, pp. 1–116, Springer, 1995.

[Xili] Xilinx: Vivado Design Suite. http://www.xilinx.com/products/design tools/

vivado/vivado-webpack.htm.

[Xili 14] Xilinx 2014 Annual Report (Form 10-K). http://www.xilinx.com, 2014.

[YACC 05] YACC-Yet Another CPU CPU. http://opencores.org/project,yacc,overview, 2005.

[Yang 14] D. Yang, C. Gan, P. R. Chidambaram, G. Nallapadi, J. Zhu, S. C. Song, J. Xu, and G. Yeap.

“Technology-design-manufacturing co-optimization for advanced mobile SoCs”. March 28

2014.

[Yiann 07] P. Yiannacouras, J. Steffan, and J. Rose. “Exploration and Customization of FPGA-Based

Soft Processors”. Computer-Aided Design of Integrated Circuits and Systems, IEEE Trans-

actions on, Vol. 26, No. 2, pp. 266–277, Feb 2007.

[Zhang 10] J. Zhang, Z. Zhang, S. Zhou, M. Tan, X. Liu, X. Cheng, and J. Cong. “Bit-level opti-

mization for high-level synthesis and FPGA-based acceleration”. In: Proceedings of the

18th annual ACM/SIGDA international symposium on Field programmable gate arrays,

pp. 59–68, ACM, 2010.

[Zhang 12] W. Zhang, V. Betz, and J. Rose. “Portable and Scalable FPGA-based Acceleration of a

Direct Linear System Solver”. ACM Trans. Reconfigurable Technol. Syst., Vol. 5, No. 1,

pp. 6:1–6:26, March 2012.

References 127

[Zhang 13] Z. Zhang and B. Liu. “SDC-Based Modulo Scheduling for Pipeline Synthesis”. In: ICCAD,

San Jose, CA, 2013.

[Zheng 13] H. Zheng, S. T. Gurumani, L. Yang, D. Chen, and K. Rupnow. “High-level synthesis with

behavioral level multi-cycle path analysis”. In: Field Programmable Logic and Applications

(FPL), 2013 23rd International Conference on, pp. 1–8, IEEE, 2013.

[Zhu 01] J. Zhu. “Static Memory Allocation by Pointer Analysis and Coloring”. In: Proceedings

of the Conference on Design, Automation and Test in Europe, pp. 785–790, IEEE Press,

Piscataway, NJ, USA, 2001.

[Zhu 02] J. Zhu. “Symbolic Pointer Analysis”. In: Proceedings of the 2002 IEEE/ACM International

Conference on Computer-aided Design, pp. 150–157, ACM, New York, NY, USA, 2002.

[Zhu 04] J. Zhu and S. Calman. “Symbolic Pointer Analysis Revisited”. SIGPLAN Not., Vol. 39,

No. 6, pp. 145–157, June 2004.

Appendix A

LegUp Source Code Overview

In this chapter, we give an overview of the LegUp source code. In Section A.1, we discuss the LegUp

compiler backend pass, which receives the final optimized LLVM intermediate representation (IR) as

input and produces Verilog as output. In Section A.2, we discuss the LegUp frontend compiler passes,

which receive LLVM IR as input and produce modified LLVM IR as output.

A.1 LLVM Backend Pass

Most of the LegUp code is implemented as a target backend pass in the LLVM compiler framework.

The top-level class is called LegupPass. This class gets run by the LLVM pass manager, which calls the

method runOnModule() passing in the LLVM IR for the entire program and expecting the final Verilog

code as output.

The LegUp code is logically structured according to the flow chart we gave in Figure 2.2. There

are five major logical steps performed in order: Allocation, Scheduling, Binding, RTL generation, and

producing Verilog output. First, we have an Allocation class that reads in a user Tcl configuration

script that specifies the target device, timing constraints, and HLS options. The class reads another

Tcl script that contains the FPGA device specific operation delay and area characteristics. These Tcl

configuration settings are stored in a global LegupConfig object accessable throughout the code. We

pass the Allocation object to all the later stages of LegUp. This object also handles mapping LLVM

instructions to unique signal names in Verilog and for ensuring these names do not overlap with reserved

Verilog keywords. Global settings that should be readable from all other stages of LegUp should be

stored in the Allocation class.

The next step loops over each function in the program and performs HLS scheduling. The default

scheduler uses the SDC approach and is implemented in the SDCScheduler class. The scheduler uses

the SchedulerDAG class, which holds all the dependencies between instructions for a function. The final

function schedule is stored in a FiniteStateMachine object that specifies the start and end state of each

LLVM instruction.

Next, we perform binding in the BipartiteWeightedMatchingBinding class, which performs bipartite

weighted matching. We store the binding results in a datastructure that maps each LLVM instruction

to the name of the hardware functional unit that the instruction should be implemented on.

In the next step, the GenerateRTL class loops over every LLVM instruction in the program and using

128

Appendix A. LegUp Source Code Overview 129

the schedule and binding information, creates an RTLModule object that represents the final hardware

circuit. The data structure that we use to represent an arbitrary circuit uses the following classes:

• RTLModule describes a hardware module.

• RTLSignal represents a register or wire signal in the circuit. The signal can be driven by multiple

RTLSignals each predicated on an RTLSignal to form a multiplexer.

• RTLConst represents a constant value.

• RTLOp represents a functional unit with one, two or three operands.

• RTLWidth represents the bit width of an RTLSignal (i.e. [31:0]).

Finally, the VerilogWriter class loops over each RTLModule object and prints out the corresponding

Verilog for the hardware module. We also print out any hard-coded testbenches and top-level modules.

A.2 LLVM Frontend Passes

In this section, we discuss portions of the LegUp code that are implemented as frontend LLVM passes.

These passes receive LLVM IR as input and return modified LLVM IR as output and are run individually

using the LLVM opt command. In the class implementing each pass, the LLVM pass manager will call

the method runOnFunction() and provide the LLVM IR for the function and will expect the modified

LLVM IR as output.

For the hybrid flow, we remove all functions from the IR that should be implemented in software in

the HwOnly class. We remove all functions that should be implemented in hardware with the SwOnly

class. We run these two passes on the original IR of the program to generate two new versions of the IR.

We pass the HwOnly IR to the LegUp HLS backend and the SwOnly IR to the MIPS/ARM compiler

backend.

Loop pipelining is performed by the SDCModuloScheduler class, which performs the algorithm we

described in Chapter 5. This pass will determine the pipeline initiation interval and the scheduled start

and end time of each instruction in the pipeline. This data is stored in LLVM IR metadata which can

be read later by the LegUp backend.

If-conversion is performed by the class LegUpCombineBB, which removes simple control flow and

combines basic blocks. In the class PreLTO, we detect LLVM built-in functions that can exist in the

IR (i.e. memset, memcpy). We replace these functions with equivalent LLVM IR instructions that we

can synthesize in the LegUp backend.

Documents

LegUp: Open-Source High-Level Synthesis Research Framework · LegUp: Open-Source High-Level Synthesis Research Framework by AndrewChristopherCanis Athesissubmittedinconformitywiththerequirements