Upload
hoangmien
View
259
Download
3
Embed Size (px)
Citation preview
University of California
Los Angeles
DSP Architecture Optimization
in MATLAB/Simulink Environment
A thesis submitted in partial satisfaction
of the requirements for the degree
Master of Science in Electrical Engineering
by
Rashmi Nanda
2008
c© Copyright by
Rashmi Nanda
2008
The thesis of Rashmi Nanda is approved.
Miodrag Potkonjak
Mani B. Srivastava
Dejan Markovic, Committee Chair
University of California, Los Angeles
2008
ii
Table of Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Overview of Previous Work . . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Architecture Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Representations of DSP Algorithms . . . . . . . . . . . . . . . . . 11
2.2 Data Flow Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Loop Bound and Iteration Bound . . . . . . . . . . . . . . 13
2.2.2 Precedence Relations . . . . . . . . . . . . . . . . . . . . . 14
2.3 Data Flow Graph Model . . . . . . . . . . . . . . . . . . . . . . . 16
3 Architectural Transformations . . . . . . . . . . . . . . . . . . . . 19
3.1 Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.1 Mathematical Model for Retiming . . . . . . . . . . . . . . 21
3.1.2 Retiming for Clock Period Minimization . . . . . . . . . . 22
3.1.3 Retiming for Energy Efficiency . . . . . . . . . . . . . . . 24
3.2 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Unfolding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 Mathematical Formulation for Unfolding . . . . . . . . . . 31
3.3.2 Carry-Save Arithmetic . . . . . . . . . . . . . . . . . . . . 32
iii
4 ILP Model for Scheduling . . . . . . . . . . . . . . . . . . . . . . . 36
4.1 Scheduling and Retiming . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Modified ILP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 CAD Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1 Simulink Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 RTL Generation and Synthesis . . . . . . . . . . . . . . . . . . . . 50
5.3 Architectural Optimization . . . . . . . . . . . . . . . . . . . . . . 51
5.3.1 Controller Generation . . . . . . . . . . . . . . . . . . . . 55
6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.1 Comparison of Existing and Modified Scheduling
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Design Space Exploration: 16-tap FIR Filter . . . . . . . . . . . . 62
6.3 Hierarchical Design: Multi-Core MIMO Sphere Decoder . . . . . . 67
7 Conclusions & Future Work . . . . . . . . . . . . . . . . . . . . . . 71
7.1 Summary of Research Contributions . . . . . . . . . . . . . . . . 71
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Appendix: GUI Environment . . . . . . . . . . . . . . . . . . . . . . . 73
Appendix: Tarjan’s Algorithm . . . . . . . . . . . . . . . . . . . . . . 76
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
iv
List of Figures
1.1 Algorithm-architecture-circuit-level interaction in the design-space. 3
1.2 Thesis organization. . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 (a) Block diagram of y(n) = ay(n− 1) + x(n) (b) DFG respresen-
tation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 IIR Filter: (a) Block diagram (b) DFG representation with loops. 15
2.3 (a) Architecture of a second order IIR filter (b) DFG representation
for the filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1 (a) Original DFG (b) Retimed DFG. . . . . . . . . . . . . . . . . 20
3.2 FFT butterfly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Retimed FFT butterfly with Vdd-scaling. . . . . . . . . . . . . . . 26
3.4 (a) ASAP scheduling (b) ALAP scheduling. . . . . . . . . . . . . 27
3.5 (a) DFG of recursive algorithm (b) DFG of two-unfolded version. 29
3.6 (a) DFG of feedforward algorithm (b) DFG of two-unfolded version. 30
3.7 (a) Original architecture (b) Two-unfolded version with Vdd-scaling. 32
3.8 (a) Conventional array multiplier (b) Carry-save multiplier. . . . . 33
3.9 Carry-save tree adding nine numbers. . . . . . . . . . . . . . . . . 34
3.10 A multiplier implemented with shifts and adds. . . . . . . . . . . 35
4.1 Data-flow-graph and corresponding schedule. . . . . . . . . . . . . 36
4.2 (a) Original DFG (b) Retimed DFG (c) Scheduled architecture. . 39
5.1 Design and optimization flow. . . . . . . . . . . . . . . . . . . . . 45
v
5.2 Simulink pre-defined library (Synplify DSP blockset). . . . . . . . 46
5.3 Baseband processing in a QAM system. . . . . . . . . . . . . . . . 47
5.4 BER vs. SNR curve for the QAM system. . . . . . . . . . . . . . 48
5.5 Simulink model for an 8-tap FIR filter. . . . . . . . . . . . . . . . 48
5.6 Input with normalized frequencies of 0.03 & 0.4, and corresponding
output that passes the lower frequency. . . . . . . . . . . . . . . . 49
5.7 Activity factor for a 16-bit sinusoidal input of normalized frequency
0.25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.8 Energy-area-delay tradeoffs at the circuit and micro-architectural
level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.9 Choosing values of N , P and R based on energy-delay sensitivity. 52
5.10 Time-multiplexed and parallel implementation of a 16-tap FIR filter. 54
5.11 Control circuitry using M-Control blocks. . . . . . . . . . . . . . . 55
6.1 Synthesis results for a fifth order elliptic wave digital filter. . . . . 60
6.2 Synthesis results for a 16-tap FIR filter. . . . . . . . . . . . . . . . 60
6.3 Synthesis results for a 4th order all pole lattice filter. . . . . . . . 61
6.4 Fifth-order wave digital elliptic filter. . . . . . . . . . . . . . . . . 63
6.5 Synthesis results for retimed and time-multiplexed FIR filters. . . 64
6.6 Increase in register area with retiming. . . . . . . . . . . . . . . . 64
6.7 (a) FIR with an extra latency at the output (b) Retimed version. 65
6.8 Synthesis results for parallel FIR filters. . . . . . . . . . . . . . . . 67
6.9 Multi-core MIMO sphere decoder. . . . . . . . . . . . . . . . . . . 68
6.10 Simulink model for the multi-core MIMO sphere decoder. . . . . . 69
vi
6.11 Synthesis results for the multi-core MIMO sphere decoder. . . . . 70
7.1 GUI built within MATLAB to facilitate transformations. . . . . . 74
vii
List of Tables
6.1 Comparison of scheduling and scheduling with Bellman-Ford re-
timing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2 Comparison of normalized area-delay product for scheduling and
scheduling with Bellman-Ford retiming. . . . . . . . . . . . . . . . 62
viii
Acknowledgments
I am sincerely grateful to my advisor Professor Dejan Markovic, without whose
help and support this thesis could not have been written. His constant drive for
perfection has taught me a great deal and I am truly indebted for the patience
he had with me on occasions when I faltered.
This work started out and has progressed, based on Dejan’s idea of design-
space exploration via architectural transformations. I wish to thank my group
memebers Chia-Hsiang Yang and Victoria Wang for the invaluable feedback I
received from them when working on this research project. Chia-Hsiang Yang
designed the sphere decoder which I have used as a design driver example. My
thanks also go out to the new members in my group, Sarah Gibson, Cheng-Cheng
Wang and Vaibhav Karkare for their insightful comments and discussions in the
group meetings. Professor Mani Srivastava and Professor Miodrag Potkonjak
provided helpful reviews of this thesis which helped me refine its contents.
This work could not have progressed without the crtical infrastructure support
provided by Synplicity Incorporated and Cadence Design Systems. I am grateful
to the Synplicity team for the tools they provided us and the training sessions
they conducted to help us learn them. I found the Source-Link database provided
by Cadence to be extremely useful for logic and physical synthesis flows.
I would like to thank Rohit for his unwavering support through the course of
this project. He has always had the utmost faith in my abilities, and has never
fallen short of words of encouragement, especially at times when I needed them
most. We have also had some very lively discussions when I was preparing for
the prelim examination. I am indebted to Nitesh Singhal, Abhishek Ghosh and
Bibhu Dutta Sahoo who were always ready to discuss and clear any doubt I had
ix
before the prelim examination.
I will forever remain grateful to my parents and my brother for their loving
support during my studies at UCLA.
x
Abstract of the Thesis
DSP Architecture Optimization
in MATLAB/Simulink Environment
by
Rashmi Nanda
Master of Science in Electrical Engineering
University of California, Los Angeles, 2008
Professor Dejan Markovic, Chair
Architectural optimization has traditionally been a heuristic process involving
multiple iterations before the design converges to the desired specifications. Mul-
tiple architectures are difficult to evaluate if RTL is written repeatedly for each
design. The process becomes tedious if the design fails to meet target specifica-
tions and changes need to be made at the system level. This work aims to auto-
mate the process of architecture selection and provide energy-area-performance
optimal solutions staring from the graphical timed data-flow Matlab/Simulink
description for an algorithm.
Integrating functional blocks into system architecture requires the most ef-
ficient use of available resources for maximizing area-efficiency. In essence, this
task is accomplished with scheduling. Improved power efficiency requires that all
pipelines in a design be balanced, necessitating retiming at the micro-architecture
level. Parallelism is employed when building high-throughput or low-energy
systems. These architectural transformations have been implemented in MAT-
LAB/Simulink environment, using Integer Linear Programming (ILP) models.
xi
This work proposes a modified ILP model which integrates scheduling and re-
timing, providing a 33% average reduction in area-delay product compared to
existing ILP models which do not incorporate retiming. Also a 20× reduction in
worst case CPU runtime is achieved by the proposed method when compared to
existing ILP models which directly incorporate retiming with scheduling.
Optimization of complex structures is supported by hierarchically extend-
ing circuit-level results from the underlying macros. The entire framework al-
lows comparison of various architectural solutions for a given algorithm in the
energy-area-performance space. The high-level block-diagram based description
in Simulink conveniently maps to FPGA or ASIC. The method is applicable for
dedicated DSP algorithms like Fourier transforms as well as hierarchical struc-
tures with complex macros such as those in MIMO communications.
xii
CHAPTER 1
Introduction
1.1 Motivation
Integrated circuit design is at an interesting juncture at present with continued
scaling of the underlying technology. The integration complexity has reached sev-
eral billion transistors, opening doors for the IC design market to capture a wide
variety of application domains. Cell phones, ipods, palm tops, biomedical instru-
ments are only few manifestations of this ever increasing phenomenon. However,
we are still faced with several challenges in the design process; one of which is
making a suitable architecture selection for a given algorithm. As systems be-
come more and more complex and constraints on energy, area and performance
become tighter it is not possible to single out a particular architecture as being
always optimal. Area is no longer the only optimization metric; the advent of
portable battery operated devices has made lower power consumption a more de-
sirable target. In fact the choice of the optimal architecture is strongly dictated
by system specifications such as throughput, area and power consumption. A
simple FIR filter could have several different realizations depending on the sys-
tem it is being integrated with. For example, neural signal processing requires
very slow operation at sample rates close to several hundred megahertz, in which
case the filter could be time-multiplexed. For a high speed base-band processing
unit in a wireless LAN system, the same filter would have to be parallelized to
1
meet throughput requirements.
The aim of this work is to make it feasible for algorithm designers to an-
alyze the hardware cost associated with their algorithms in the energy-area-
performance space. This feedback helps designers in tuning their algorithms such
that system specifications can be met, and also in finding the optimal architecture
for a given algorithm. The choice of the optimal architecture is strongly dictated
by circuit-level energy-delay sensitivity results [1] of the underlying macros in the
design (Fig. 1.1). The interface between the micro-architectural level and the
circuit-level is not easy to navigate. This is because, if the design fails to meet
the target specifications at the circuit-level, then changes will have to be made at
the system level. This can mean major restructuring at the micro-architectural
level, like opting for a parallel design instead of a time-multiplexed one, which
then necessitates re-writing the RTL. The process becomes very tedious if this
design cycle re-iterates.
The answer to this problem lies in automating the generation of architectural
solutions (high-level models and RTL) for a given algorithm and also hierar-
chically extrapolating circuit-level measurements like energy, area and delay for
these architectures from synthesis results of the underlying macros.
A tool which automates this flow will offer designers a convenient medium to
explore the solution space of possible architectures and also analyze the tradeoffs
in the energy-area-performance space. This work proposes such a tool embedded
in MATLAB/Simulink which is a very comprehensive and easy-to-use graphical
platform. SystemC based modeling was an alternative to Simulink for the imple-
mentation of this tool, however this approach would involve C coding which would
limit the use of the tool to those adept in coding. Moreover, a graphical format
allows easy visualization and understanding of the architectural transformations
2
Figure 1.1: Algorithm-architecture-circuit-level interaction in the design-space.
as compared to when architectures are described using C codes.
The reference architecture (direct-mapped version of the target algorithm)
has to be defined only once in Simulink using pre-existing or user-defined library
blocks. By use of data-flow-graph models this reference architecture is mathe-
matically modeled by a set of matrices. The matrix based representation makes
it convenient to apply transformations like scheduling, retiming, pipelining and
parallelism. The transformed architectures are then converted back from matri-
ces to Simulink models, making it possible for designers to functionally verify
the new design. Modeling complex DSP kernels in a hierarchical manner is also
very convenient with this approach since data-flow-graph models are essentially
independent of the internal complexity of the processing kernel.
The optimization discussed in this work is not limited to making architectural
3
choices alone. Tools like Synplify DSP support automatic RTL generation from
Simulink descriptions. By synthesizing the RTL in backend tools like Cadence
or Synopsys, accurate circuit-level results for the generated architectures can
be obtained. Other refinements like usage of carry-save arithmetic and gate
sizing which are a part of logic synthesis can also be integrated in this flow.
Depending upon delay slack available in the design after synthesis, Vdd scaling is
also employed to improve energy efficiency.
The goal of this work is to establish a complete framework for evaluating
architectures based on results generated through actual physical synthesis and
not just heuristics. For example, in the case of time-multiplexed architectures, the
energy overhead associated with control circuitry cannot be estimated accurately
at the high-level. Logic synthesis of the final architecture on the other hand
gives a more clear picture of the degradation in energy-efficiency associated with
time-multiplexing.
1.2 Overview of Previous Work
The problem of exploring architectural solutions in combination with lower level
circuit optimizations has received constant attention in the past few years. De-
signers have always known that optimizing only at the circuit-level has marginal
gains, while combination of architectural and circuit-level refinements produce
much improved results. This concept was clearly demonstrated in the paper by
Gemmeke et.al [2] where a design methodology for optimizing FIR filters was pro-
posed. The paper started out with a description of various architectural choices
like time-multiplexing and parallelism and moved on to arithemetic-level improve-
ments like Booth recoding of the filter coefficients. The last level of optimization
suggested was at the circuit-level which described sizing of the transistors for
4
minimum power dissipation at the specified throughput. The approach outlined
in this paper targetted the FIR filter design-space; in this work we try to create a
design environment which automates these optimization techniques for a generic
DSP algorithm. The tool Hyper [3] developed by the University of California,
Berkeley automates architectural transformations to enable efficient design-space
exploration. Given a flow-graph and a set of timing constraints the tool gener-
ates the most area-efficient solution. In this work, we explore the design-space of
possible architectures to find the solution which is jointly optimal in the energy-
area-performance space.
Mapping of Simulink models onto ASIC was introduced in [4]. The work in
[5] added optimization heuristics and applied them to a complex singular value
decomposition algorithm [6]. Although systematic, architecture tuning was man-
ual, making the process impractical or time consuming for complex systems. This
work aims to automate the process in the convenient-to-use MATLAB/Simulink
graphical environment. The user selects tuning parameters which control the de-
gree of time-multiplexng, retiming, parallelism and pipeling at the Simulink level.
Based on these parameters the optimizer automatically generates the optimized
Simulink model. The core of this optimization process is in the efficient modeling
and transformation of the reference architecure.
Modeling of DSP algorithms as data-flow graphs (DFG) followed by schedul-
ing or retiming has received considerable attention in literature. Retiming was
first described by C.E. Leiserson and J.B. Saxe in 1983 [7], following which it
has perhaps become one of the most widely used techniques for improving de-
lay or lowering power consumption. The retiming problem is formalized using
Integer Linear Programming (ILP) models which are solved using branch and
bound techniques or shortest-path algorithms like Bellman-Ford [8]. Formal de-
5
scription of the ILP model is discussed in the paper by K.N. Lalgudi et.al in [9].
Retiming has been fairly well integrated in modern-day CAD tools (Synopsys,
Cadence retiming tools). However, synthesis tools can retime circuits only after
the structural netlist has been created once the technology mapping process is
complete. This gate-level granularity introduces large number of retiming vari-
ables, making the process very computationally inefficient for larger designs. For
example, a 512-point FFT could not be retimed at the gate-level in a span of 2.5
days when using the RC compiler tool from Cadence. Performing retiming at the
gate-level for multiple architecture solutions can become quite tedious in such
a situation. The solution lies in hierarchically decomposing the top level design
into smaller modules which can be retimed quickly. Extrapolating circuit-level
results from the smaller modules makes it possible to get fairly accurate estimates
of the critical path post-retiming for the top level design. From these estimates a
final design can be fixed upon based on system constraints. Hence, only the final
design would have to be synthesized. This approach for hierarchical retiming
was adopted in the high-level synthesis tool IRIS [10]. However, the scheduling
approach used in IRIS was not optimal and also it lacked support for pipelining
and parallelism.
Scheduling is the formal approach for time multiplexing a finite set of op-
erations onto a core of processing elements. If the design is subject to speed
constraints, the scheduling algorithm will attempt to parallelize [11] the opera-
tions to meet timing constraints. Conversely, if there is a limit on the cost (area
resources), the scheduler will serialize operations to meet resource constraints.
Scheduling thus determines the cost-speed tradeoffs of the design.
The scheduling problem is NP hard in general. Various heuristic and formal
approaches to the scheduling problem have been covered extensively in literature.
6
Approaches like ASAP (As Soon As Possible scheduling) [13], ALAP (As Late As
Possible scheduling) [14], list scheduling [15], MARS (Minnesotta Architectural
Scheduling) [36] are quick but sub-optimal ways to generate a schedule. Integer
linear programming models can formalize the process and obtain global optimum
solutions. An integer programming model for synthesizing digital logic at the
register-transfer level (RTL) was formulated in [9]. The model gives detailed spec-
ifications for data-path synthesis such as variable storage, operation precedence,
resource sharing, and control structures. Due to the complexity of the formulation
this approach cannot be extended to complex structures. Integer programming
approach was also proposed for microcode scheduling in CATHEDRAL-II [16],
which is a synthesis engine for multiprocessor DSP systems. After a customized
data path has been synthesized and the high-level operations are mapped onto
a set of RTL operations, the microcode scheduling is performed. The model
contains data precedence, resource conflict, and controller pipelining constraints.
Since excessive CPU time is required to solve large problems, the model was
replaced by a graph-based scheduling algorithm [17] which used Integer Linear
Programming models. This approach formalized the scheduling problem and pro-
posed a way to achieve the most area or throughput efficient schedule. Retiming
of the original data-flow-graph was not integrated in all the above mentioned ap-
proaches to scheduling. It has been shown later in the thesis (Chapter IV) that
retiming simultaneously with scheduling can result in more area and throughput
efficient schedules.
The benefit of retiming with scheduling to improve on the resource utiliza-
tion was investigated in [19]. An iterative-improvement probabilistic algorithm
similar to simulated annealing was developed in this work which applied trans-
formations like retiming, associativity and commutativity to improve the area of
the scheduled architecture. However, the issue of improving the throughput of
7
the scheduled architecture by supporting greater pipeline depth in the processing
elements via retiming was not addressed. Retiming with scheduling was also in-
vestigated in [18] where a scheme to generate all possible schedules for any DFG
was proposed. However, this scheme works only on strongly connected graphs
(a DFG where every node must be a part of a loop). It was also not clear in
this work as to how the most area- or throughput-efficient schedule was to be
extracted from the pool of solutions generated.
Retiming variables are unbounded in the integer space. Their inclusion in the
ILP model makes it practically impossible for the ILP to converge to a solution
for complex structures, when we attempt to minimize the area of the schedule.
In this work a new model has been developed for ILP scheduling which integrates
retiming without directly introducing the retiming variables in the ILP. Retiming
is done post scheduling using the Bellman-Ford shortest-path algorithm [8] which
has polynomial time-complexity. This ensures that the optimum schedule is found
without increasing CPU runtime excessively. We show that simultaneous retiming
of the original DFG along with scheduling produces better results in terms of area
and throughput when compared to the results in [17].
1.3 Thesis Outline
The subsequent chapters will present in detail our methodology for architecture
optimization (Fig. 1.2). The second chapter introduces data-flow-graph represen-
tations and explains how matrices can be used to represent flow-graph connec-
tivity information. Chapter III discusses various architectural transformations
like retiming, scheduling and parallelism along with mathematical models which
automate them. Carry-save arithmetic optimization and its benefits in reducing
the critical path of a design is also discussed at the end of this chapter. Chapter
8
Figure 1.2: Thesis organization.
IV fosuses on the ILP model for scheduling and then discusses how this model
can be extended to include retiming such that the final framework still remains
CPU efficient. The fifth chapter presents the CAD design flow starting from
9
Simulink and ending with the technology mapped design. The chapter illustrates
how systems can be modeled and functionally verified in Simulink. Chapter VI
discusses how scheduling results compare with scheduling combined with retim-
ing. This is followed by the design-space exploration results for a 16-tap filter
(used in ultra-wide band application). Hierarchical capability of our design-space
exploration approach is demonstrated on a flexible MIMO sphere decoder design.
The last chapter summarizes the contributions of this thesis and discusses future
scope of this work.
10
CHAPTER 2
Architecture Modeling
This chapter describes how algorithms (direct-mapped architectures) are modeled
using graphical representations like data-flow graphs and then mathematically
abstracted as matrices. Such representations allow transformations like retiming
and scheduling to be viewed in compact form as manipulation of matrices rather
than changes being made at the architectural level.
2.1 Representations of DSP Algorithms
DSP algorithms [12] can be widely divided into two classes, namely FIR (finite
impulse response) and IIR (infinite impulse response). The result of an FIR algo-
rithm depends on previous and current inputs, while result of an IIR algorithm
depends on previous and current inputs as well as previous outputs. An example
is shown for both types of systems.
y(n) = a0x(n) + a1x(n− 1) + a2x(n− 2) FIR system (2.1)
y(n) = a0x(n) + a1y(n− 1) IIR system (2.2)
Execution of all computations in the algorithm once is referred to as an iteration.
The iteration period is the time required for execution of one complete iteration
of the algorithm. During each iteration, the 3-tap FIR filter in (2.1) processes one
input signal, completes 3 multiplications and 2 addition operations, and generates
11
one output sample. DSP systems are also characterized by their sampling rate
(throughput) and input to output latency. The throughput is determined by
the critical path (longest path between any two storage elements) of the system.
Latency is defined as the difference between the time an output is generated and
the time at which the corresponding input was received by the system.
Complex DSP algorithms can be conveniently modeled using high-level de-
scriptions, where it is more important to specify the communication between
the processing elements, rather than the order and structure of the internal op-
erations. These high-level descriptions can either take the form of behavioral
description languages or graphical representations. Many applications are de-
scribed using descriptive languages that represent the structure of the system.
Examples of these are hardware description languages such as Verilog and VHDL
which can be integrated into the physical synthesis flow using CAD tools.
Graphical representations are efficient for investigating and analyzing data-
flow properties of DSP algorithms and for exploiting inherent parallelism among
different subtasks. As such they are more amenable to transformation than be-
havioral descriptions. More importantly, graphical representations can be easily
converted to Verilog/VHDL scripts which are easy to map into hardware imple-
mentations. Hence, these representations can bridge the gap between algorithmic
descriptions and structural implementations.
The absolute measures of the performance metrics of DSP systems namely
area, speed and power cannot be obtained without the knowledge of the sup-
porting technology. However, graphical representations provide useful insight
into space-time-energy tradeoffs making it possible to explore the architectural
design-space. Various forms of graph representations include signal-flow-graph
(SFG), data-flow-graph (DFG) and dependence graphs (DG) [31]. In this work
12
data-flow-graph based representation has been adopted because of its simplicity
of representation and the convenient way in which flow-graph connectivity infor-
mation can be extracted from it. Data flow graph modeling has been described
in the following section.
2.2 Data Flow Graphs
In data-flow-graph representations, the nodes represent computations (functions
or subtasks) and the directed edges represent data paths (communication between
nodes). Each edge has a nonnegative number of delays associated with it. For
example, Fig. 2.1(b) is a data-flow-graph of the computation y(n) = ay(n −
1) + x(n). Node A represents addition while node B represents multiplication.
The edge from node A to B contains one delay while edge from B to A has no
delay. Associated with each node is its execution time in terms of normalized
units (u.t.). For example, the execution time of node A is 2 u.t. while that of
node B is 4 u.t.
2.2.1 Loop Bound and Iteration Bound
A loop is a directed path that begins and ends at the same node, such as the
path A → B → A in Fig. 2.1(b). Given that the execution times of nodes A and
B are 2 and 4 u.t. respectively, one iteration of the loop requires 6 u.t. This is
the loop bound, which represents the lower bound on the loop computation time.
Formally the loop bound of the l-th loop is defined as tlwl
, where tl is the loop
computation time and wl is the number of delays in the loop. The loop bound
for the DFG in Fig. 2.1 is 61
= 6 u.t.
The critical loop of a DFG is the loop with the maximum loop bound. This
13
A
B
a
x(n) y(n)
D
D
a
x(n) y(n)(2)
(4)
Figure 2.1: (a) Block diagram of y(n) = ay(n−1)+x(n) (b) DFG respresentation.
is known as the iteration bound [21], [23] of the DSP program which determines
the lower bound on the sample period regardless of the amount of computing
resources available. Formally, the iteration bound is degined as
T∞ = max{ tlwl
}. (2.3)
For loop1 (L1) in Fig. 2.2(b) the iteration bound is 2 u.t. while for loop2 (L2) it
is 3 u.t. The slower loop L2 will determine the iteration bound or the minimum
possible sample rate for the system which is 3 u.t., in this case. Iteration bound
sets the fundamental limit on the achievable throughput of the system during
retiming, as will be explained later in Chapter IV.
2.2.2 Precedence Relations
Data-flow graphs capture the data driven property of DSP algorithms where any
node can fire (perform its computation) whenever all the input data are available.
This implies that a node with no input edges can fire at any time. Thus many
14
1
2 3
4
x(n) y(n)
a
D D 2D
(1)
(1)
(2)
(2)
(b)
b
Z-1
+
Z-1
+
x(n) y(n)
(a)
Z-1L1
L2a
b
Figure 2.2: IIR Filter: (a) Block diagram (b) DFG representation with loops.
nodes can be fired simulatneously, leading to concurrency. Conversely, a node
with multiple input edges can only fire after all its precedent nodes have fired.
The latter case imposes the precedence constraints on a DFG, where each edge
describes a precedence relation between two nodes. This precedence constraint
is an intra-iteration constraint if the edge has zero delay while it is called inter-
iteration (occurring between iterations) if the edge has one or more delay. For
example, the edge from node 2 to node 1 in Fig. 2.2(b) enforces the inter-iteration
constraint, which states that the execution of the k-th iteration of node 2 must
be completed before the (k+1)-th iteration of node 1. The edge from node 4 to
node 2 enforces the intra-iteration precedence constraint, which states that the
k-th iteration of node 4 must be executed before the k-th iteration of node 2.
Precedence relations enforce a set of constraints in the scheduling model, where a
particular operation can be scheduled only after all its precedent operations have
executed.
15
2.3 Data Flow Graph Model
A directed DFG is denoted as G = <V,E,d,w> where the notations are as
follows
• V: Set of vertices (nodes) of G. The vertices represent operations. The
number of nodes in G is |V|.
• E: Set of directed edges of G. A directed edge from node U ∈ U to node
V ∈ V is denoted as U → V. The edges represent communication between
the nodes. The number of edges in G is |E|.
• w(e): Number of delays on the edge e, also referred to as the weight of the
edge.
• d(U ): Pipeline depth of the node U.
The data-flow-graph is initially described in the Simulink environment using block
based description. We capture the DFG information in the form of an incidence
matrix A, a loop matrix B, a weight vector w and a pipeline vector du. Let A
be the incidence matrix of the graph G, then this |V| × |E| matrix is described
as
αi,j =
1, edge i starts from node j
−1, edge i ends in node j
0, edge i does not start or end in node j .
The A matrix is generated by identifying the source and destination nodes for
every edge in the DFG.
16
-1
-1
-11
23
4
5
Figure 2.3: (a) Architecture of a second order IIR filter (b) DFG representation
for the filter.
The B matrix is an |L| × |E| matrix where |L| is the total number of loops
in the DFG. It is defined as
βi,j =
1, if edge j is in loop i
0, otherwise.
The B matrix is computed using Tarjan’s algorithm [20] (Appendix 2) in O((|V|+
|E|)(|L|+ 1)). The weight vector w is an |E|×1 vector. It is defined as
wi = number of delays (registers) on edge i .
The pipeline vector du is an |E|×1 vector. It is defined as
dui = pipeline depth of the source node (U ) of edge i (i:U → V ).
Second-Order IIR Filter Example: The incidence and loop matrices [18]
for the second order IIR filter (Fig. 2.3) are shown below. The A matrix captures
the connectivity between the four nodes (rows in A) in the DFG while the B
17
matrix extracts the two loops (rows in B).
A =
1 1 0 0 -1
0 0 -1 -1 1
-1 0 1 0 0
0 -1 0 1 0
B =
0 1 0 1 1
1 0 1 0 1
The weight vector for the five edges in the IIR filter (Fig. 2.2) is given by
wT = [1 2 0 0 1]
If the multipliers in the filter have a pipeline depth of m while the adders have a
depth of a then the pipeline vector takes the following form
duT = [a a m m a]
The incidence, loop, weight and pipeline matrices/vectors provide a compact
representation of the flow-graph information. Once extracted from the data-
flow-graph these matrices are used to model the architectural transformations
described in the next chapter.
18
CHAPTER 3
Architectural Transformations
This chapter presents the details of architectural transformations like retiming,
scheduling and unfolding (parallelelism), their relative advantages and formu-
lation using data-flow graphs. Micro-architectural optimizations like usage of
carry-save arithmetic and supply voltage scaling and their impact on the energy
and performance of a design are also discussed.
3.1 Retiming
Retiming changes the location of registers (delay elements) in a circuit in an
attempt to balance the logic depth between sequential elements and minimize
the critical path. A valid retiming solution must not change the input/output
functionality of the DFG.
Second-Order IIR Filter Example: Consider the DFG of an IIR filter
in Fig. 3.1(a) where the numbers in brackets indicate the computation time
associated with each processing node. This filter is described by
y(n) = w(n− 1) + x(n). (3.1)
However w(n) itself is recursively related to y(n) by
w(n) = ay(n− 1) + by(n− 2). (3.2)
Substituting the value of w(n) from (3.2) into (3.1) we get the final equation for
19
w(n)
(1)
(1)
(2)
(2)w1(n)
(1)
(1)
(2)
(2)
w2(n)
Figure 3.1: (a) Original DFG (b) Retimed DFG.
y(n) in (3.3).
y(n) = ay(n− 2) + by(n− 3) + x(n) (3.3)
Following a similar process we derive the input to output relation for the filter in
Fig. 3.1(b),
w1(n) = ay(n− 1) (3.4)
w2(n) = by(n− 2) (3.5)
y(n) = w1(n− 1) + w2(n− 1) + x(n) (3.6)
= ay(n− 2) + by(n− 3) + x(n).
Although the DFGs in Fig. 3.1 have delays at different locations, these filters
have the same input/output functionality and can be derived from each other
through retiming.
Retiming is mainly used to reduce the critical path in synchronous circuits.
The critical path of the filter in Fig. 3.1(a) (shown by the dashed line) passes
20
through one multiplier and one adder and has a computation time of 3 u.t. The
retimed filter in Fig. 3.1(b) has a critical path that passes through two adders and
has a computation time of 2 u.t. Reduction in critical path either translates to
improved throughput or lower power via supply voltage scaling as will be shown
later in this chapter.
3.1.1 Mathematical Model for Retiming
Retiming maps a data-flow-graph G to a retimed graph Gr. A retimed solution
is characterized by a value r(U) known as the retiming weight for each node U in
the graph. Let w(e) denote the weight of the edge e in the original graph G, and
let wr(e) denote the weight of the edge e in the retimed graph Gr. The weight
of the edge e : U → V in the retimed graph is computed from the weight of the
edge in the original graph using
wr(e) = w(e) + r(V )− r(U) r(V ), r(U) ∈ Z (3.7)
where Z is the set of integers.
The retiming values r(1) = 0, r(2) = 1, r(3) = 0 and r(4) = 0 translates
the DFG in Fig. 3.1(a) to the retimed DFG in Fig. 3.1(b). Retiming does not
alter the architecture of the design, hence the incidence and loop matrices for the
design remain for the original and retimed DFG. However, the weights on the
edges change and it is the weight vector which is transformed during retiming.
For the DFG in Fig. 3.1(a) the transformation (wT → wrT ) is shown below.
wT = [1 2 0 0 1] → wrT = [1 2 1 1 0]
A retiming solution is feasible if wr(e) ≥ 0 holds for all edges. It can be
proved that retiming does not alter the total number of delays in a loop of the
21
DFG [7]. This would mean that for a given loop in the DFG the sum of delays
on the edges of the loop will remain unchanged after retiming. Mathematically
this can be expressed as the product of the loop matrix B and the weight matrix
w remaining constant.
Bw = Bwr (3.8)
We can see that this property hold for the case of the retimed IIR filter.
Bw =
0 1 0 1 1
1 0 1 0 1
1
2
0
0
1
=
3
2
(3.9)
Bwr =
0 1 0 1 1
1 0 1 0 1
1
2
1
1
0
=
3
2
(3.10)
Since the number of delays in a loop is the same before and after retiming,
retiming cannot change the iteration bound of the DFG. This property sets the
fundamental limit on the minimum achievable critical path after retiming which
is equal to the iteration bound of the DFG. The smallest critical path obtained
after retiming is therefore limited by the slowest loop in the system.
3.1.2 Retiming for Clock Period Minimization
This algorithm was proposed in [7] by Leiserson et.al. and has hence been widely
used for retiming synchronous circuitry to optimize the clock period. Mathemat-
22
ically, the minimum feasible clock period φ(G) for a graph G is defined as
φ(G) = max{t(p) : w(p) = 0} (3.11)
where t(p) denotes the logic delay of path p and w(p) denotes the number of
registers in path p. This implies that the clock period is determined by the
longest register-less path in the circuit (critical path).
Two quantities, W (U, V ) and D(U, V ) are used to implement this algorithm.
W (U, V ) is the minimum number of registers on any path from node U to node
V and D(U, V ) is the maximum computation time among all paths from U to V
with weight W (U, V ). Formally,
W (U, V ) = min{w(p) : U → V } (3.12)
D(U, V ) = max{t(p) : U → V and w(p) = W (U, V )}. (3.13)
The following algorithm can be used to compute W (U, V ) and D(U, V ).
• Let M = tmaxn, where tmax is the maximum computation time of the nodes
in G and n is the number of nodes in G.
• Form a new graph G′which is the same as G except the edge weights are
replaced by w′(e) = Mw(e)− t(U) for all edges U → V .
• Solve the all pairs shortest-path problem on G′using Floyd-Warshall algo-
rithm [8], [37]. Let Suv be the shortest path from U to V .
• W (U, V ) = dSuv
Me and D(U, V ) = MW (U, V )− Suv + t(V )
The values of W (U, V ) and D(U, V ) are used to determine if there exists a
retiming solution that can achieve a desired clock period. Given a desired clock
period c, there is a feasible retiming solution r such that φG(r) ≤ c if the following
constraints hold:
23
• r(U)− r(V ) ≤ w(e) for every edge U → V (feasibility constraint),
• r(U)−r(V ) ≤ W (U, V )−1 for all vertices in U, V in G such that D(U, V ) >
c (critical path constraint).
The feasibility constraint forces the number of delays on each edge in the retimed
graph to be nonnegative, and the critical path constraint enforces that all paths
without delays in the graph have computation time less than c. This procedure
is iteratively repeated for several monotonically decreasing values of c, until no
feasible retiming solution exists. At this point we get the retimed graph with
minimum possible critical path.
3.1.3 Retiming for Energy Efficiency
Retiming combined with supply voltage savings can result in significant energy
savings as illustrated in [22]. We illustrate this saving in energy by scaling the
supply voltage for a butterfly unit in the FFT processor [27]. The butterfly in
Fig. 3.2 has a higher critical path as compared to the retimed architecture in Fig.
3.3. This introduces a delay slack in the second design which can be utilized to
scale the supply voltage and reduce energy. A 40% saving in energy was achieved
for this design after inserting an extra pipeline stage. Retiming therefore not
only helps in improving the throughput but can also be used to improve the
energy-efficiency of the system. Supporting results for energy savings achieved
after retiming will be presented in Chapter VI for a 16-tap FIR filter.
24
1
T
1
T
1
T
1
T
1
T
1
T
1
T
1
T
1
T
1
T
Figure 3.2: FFT butterfly.
3.2 Scheduling
Scheduling and allocation are two important tasks in the synthesis of DSP sys-
tems. Scheduling involves assigning every node of the DFG to control time steps
or clock cycles. Resource allocation is the process of assigning operations to hard-
ware with a goal of minimizing the amount of hardware required to implement
the desired algorithm.
25
+
+
+
+
Re In1
Im In1
Re In2 x
Im In2
x
+ -1
x
x
+ -1
Re Out1
Im Out1
Re Out2
Im Out2
1
1
T
1
1
T
1
1
T
1
1
T
1
1
T
1
1
T
Re Tw
Im Tw
1
1
T
1
1
T
1
1
T
1
1
T
1
1
T
1
1
T
1
1
T
1
1
T
Figure 3.3: Retimed FFT butterfly with Vdd-scaling.
The simplest scheduling technique is the As Aoon As Possible (ASAP) schedul-
ing [13],[24] where the operations in the data-flow-graph are scheduled step-by-
step from the first control step to the last. An operation is called ”ready opera-
tion” if all of its predecessors are scheduled. This procedure repeatedly schedules
ready operations to the next control step until all the operations are scheduled.
As Late As Possible (ALAP) scheduling [14] performs a very similar procedure as
26
ASAP. In contrast to ASAP, ALAP scheduling assigns the operations from the
last control step towards the first. An operation is scheduled to the next control
step as all its successors are scheduled. Figure 3.4 gives an example of ASAP
and ALAP scheduling. Since it is not practical to assign too many operations of
the same type into a control step due to the constraint on the number of func-
tion units, a variation of ASAP [28],[35] is to delay the ready operations when
their number exceeds the number of function units. Selection of the operations
to be delayed is arbitrary. The main problem with ASAP and ALAP scheduling
algorithms is that no priority is given to the nodes on the critical path. As a
result, less critical nodes may be scheduled ahead of critical nodes. This becomes
a problem under limited resource constraints because critical nodes will require
extra processing elements (PE) when other nodes have blocked all the available
PEs.
x x x x +
x x + <
x x
x
x +
x
+ <
(a) (b)
x
Figure 3.4: (a) ASAP scheduling (b) ALAP scheduling.
To overcome the problems with ASAP-ALAP scheduling, list scheduling tech-
niques were developed. The list scheduling technique [15],[25],[16] which was orig-
inally used in microcode compaction [15], has been adopted by many high-level
27
synthesis systems. Similar to ASAP, the operations in the DFG are assigned
to control steps from the first control step to the last. The ready operations
are given a priority according to heuristic rules and are scheduled into the next
control step according to this predefined priority. When the number of sched-
uled operations exceeds the number of resources, the remaining operations are
delayed. The drawback of list scheduling is that this algorithm requires some
prior knowledge of the number of resources and, therefore, can only be applied
to resource constrained problems.
The third type of scheduling is ”global” in the way it selects the next operation
to be scheduled and in the way it decides the control step in which to put it. There
are two variations; freedom-based scheduling and force-directed scheduling. In
freedom-based scheduling [30], the operations on the critical path are scheduled
first. The operations not on the critical path are assigned one at a time according
to their degree of freedom. In force-directed scheduling [33], ”force” values are
calculated for all operations at all feasible control steps. The pairing of operation
and control step that has the most attractive force is selected and assigned.
After the assignment, the forces of the unscheduled operations are re-evaluated.
Assignment and evaluation are iterated until all the operations are assigned.
Among the above scheduling techniques, list scheduling requires that the number
of function units be specified, while force-directed scheduling requires that the
maximum number of control steps be specified. They correspond to resource-
constrained and time-constrained scheduling, respectively.
Integer Linear Programming (ILP) models provide a formal method to de-
scribe and solve the scheduling problem in an optimal manner. These models
overcome all the drawbacks of the previous approaches and are able to solve for
schedules which either minimize the resources required or the total time required
28
for completion of all operations. They require longer execution time when com-
pared to previous approaches, but always guarantee the global optimum solution.
A detailed description of the ILP models along with a modified approach which
incorporates retiming with scheduling is described in Chapter IV.
x(n)
x(2m)
x(2m+1)
x
D
y(n)
y(2m)
y(2m+1)
D* = 2D
a
a
a
x
x
Y(n) = x(n) + ay(n-1)
Y(2m) = x(2m) + ay(2m-1)
Y(2m+1) = x(2m+1) + ay(2m)
tcritical = Tadd + Tmult
t critical = 2*Tadd + 2*Tmult
t critical/iter = t critical / 2
= Tadd + Tmult
(a)
(b)
Figure 3.5: (a) DFG of recursive algorithm (b) DFG of two-unfolded version.
3.3 Unfolding
Unfolding [31] is applied to DSP algorithms to create a DFG which describes
more than one iteration of the original algorithm. For example, the DFG in Fig.
3.5(a) represents the following relation
y(n) = ay(n− 1) + x(n). (3.14)
29
Replacing the index n with 2m and 2m + 1 gives the following relations which
describe a 2-unfolded version of the original DFG (Fig. 3.5(b)).
y(2m) = ay(2m− 1) + x(2m) (3.15)
y(2m + 1) = ay(2m) + x(2m + 1) (3.16)
cd b a
x(n)
y(n)
Y(n) = ax(n) + bx(n-1)- + cx(n-2) + dx(n-3)
D D D
Y(2m) = ax(2m) + bx(2m-1) + cx(2m- 2) + dx(2m-3)
Y(2m+1) = ax(2m+1) + bx(2m) + cx(2m-1) + dx(2m-2)
x(2m)
x(2m+1)
d c
D
b a
d dc b da
D D
y(2m+1)
y(2m)
D
t critical = Tadd + Tmult
t critical = Tadd + Tmultt critical/iter = tcritical / 2
(a)
(b)
Figure 3.6: (a) DFG of feedforward algorithm (b) DFG of two-unfolded version.
Figure 3.5 illustrates an important conclusion regarding unfolding of recursive
systems. Since these systems do not allow the insertion of extra latency, unfolding
them by a factor of J can only increase the critical path (dashed line in red) by
a factor of J . The critical path per iteration (critical path divided by unfolding
factor) or the throughput of the system therefore remains the same in this case.
With a feedforward system, on the other hand, it is possible to insert extra
30
latency and by strategically placing the registers at the optimum location (via
retiming) it is possible to improve the throughput considerably. An example for
this case is illustrated for an FIR filter in Fig. 3.6. The critical path for the
original architecture was the sum of an adder and multiplier delay (Tadd +Tmult).
Unfolding by a factor of 2 followed by optimal placement of registers results in
the same critical path (Tadd + Tmult). The critical path per iteration therefore is
Tadd+Tmult
2achieveing a speed-up of 2×.
3.3.1 Mathematical Formulation for Unfolding
The J-unfolded DFG contains J times as many nodes and edges. The unfolding
procedure is given by the following two steps:
• For each node U in the original DFG, draw the J nodes U0, U1, ..., UJ−1,
• For each edge U → V with w delays in the original DFG, draw the J edges
Ui → V(i+w)%J with b i+wJc delays for i = 0, 1, 2, ..., J − 1 (% refers to the
modulo operation).
In unfolded systems each delay is J-slow. This means that if the input to a delay
element is the signal x(kJ + m) the output is the signal x((k− 1)J + m). In the
matrix domain the unfolded incidence matrix Au is a J-times replication of the
original matrix A. This increases the dimension of the A matrix from V×E to
J ·V×J ·E.
The primary application of unfolding is in the design of high-speed or low-
power parallel architectures. The throughput can be halved for a 2-way parallel
architecture as shown in Fig. 3.7. This results in creation of a delay slack which
can be used to scale the supply voltage and further reduce power. This concept
31
was illustrated in [22] where for the architecture in Fig. 3.7(b) a 64% saving in
energy was reported.
1
A
1
B
1
T
CO
MPA
RA
TOR
1
C
1
T
1
T
(a)
1
A
1
CO
MPA
RA
TOR
1
C
1
1
B
CO
MPA
RA
TOR
1
C
12T
12T
12T
12T
12T
12T
1
T
(b)
Figure 3.7: (a) Original architecture (b) Two-unfolded version with Vdd-scaling.
3.3.2 Carry-Save Arithmetic
Carry-save arithmetic (CSA) is a very useful micro-architectural transformation
when it comes to reducing the critical path of a multiplier. As shown in Fig.
3.8(a) [26] a conventional array multiplier must compute the carry and sum of
the partial products at each stage. This would involve rippling the carry through
each of the stages and increase the length of the critical path (dotted line). In
32
FA FA FA HA
HA FA FA HA
FA FA FA HA
N
MHA HA HA HA
HA FA FA HA
HA FA FA HA
HA FA FA HA
Vector Merging Adder
tmult ~ (M+N-3)*tcarry+(N-1)*tsum + (N-1)*tand
(a)
tmult ~ (N-1)*tcarry+(N-1)*tand + tmerge
(b)
Figure 3.8: (a) Conventional array multiplier (b) Carry-save multiplier.
a carry-save implementation (Fig. 3.8(b)) on the other hand, the carry is not
rippled but saved and sent to the next stage. Each stage produces two outputs,
namely carry and sum which are then finally merged in the last stage using a
very fast adder (vector merging adder). This scheme reduces the critical path of
the multiplier considerably as compared to the conventional array multiplier, as
indicated in Fig. 3.8. For example, for N = 3, M = 4, the critical path reduces
from the delay of three half-adders and three full-adders to the delay of three-half
adders and a carry-propagate adder.
A second use of carry-save arithmetic is made in the addition of N numbers,
when a carry-save tree can be used to reduce the critical path. Figure 3.9 shows
an example of this where a set of nine numbers must be added. The carry-save
adder shown in Fig. 3.9 is a series of M full adder units where M is the number
of bits at the input. The first level of full-adder units operates on a set of 3 inputs
to generate the M -bit sum and carry. This operation is known as 3:2 compressor
since it takes in 3 inputs and converts them to 2 outputs namely sum and carry.
The sum and carry propagate to the next level where a similar compression takes
33
Carry save adder
a1 a2 a3
Carry save adder
a4 a5 a6
Carry save adder
a7 a8 a9
Carry save adder Carry save adder
Carry save adder
Carry save adder
Carry Propagate Adder
Out
c s c s s
s s
s
c c
c
c s
c
Figure 3.9: Carry-save tree adding nine numbers.
place. This process repeats until we compress upto the final 3 inputs which go
into the fast carry-propagate adder. The delay of the addition process in Fig. 3.9
is reduced to 4 full-adder delays and the final delay of the carry-propagate adder.
The carry-save tree implementation is particularly useful when multiplication
with a constant coefficient (like in the case of filters) is reduced to a bunch of
shifts and adds as shown in Fig. 3.10. The final adder has to add a series of
M bit numbers which is efficiently done with the help of carry-save trees. Using
CSA optimization not only improves the critical path but can result in area
and energy savings as well. This is because to obtain the same performance
from a conventional adder we must either upsacle the devices or use complex
structures like look-ahead adders. This not only increases the area of the design
but also results in larger switched capacitance and increases the energy. Results
supporting this claim are presented in Chapter VI.
34
a
0 1 2… n
0*n
1*n-1
2*n-2
n*0
InIn Left shift by
n bits
In Left shift by n-1 bits
In Left shift by 0 bits
+ a*in
a0
a1
an
Figure 3.10: A multiplier implemented with shifts and adds.
The mathematical models presented for retiming and unfolding were imple-
mented in MATLAB. These models use the incidence and loop matrices to extract
the DFG information. The optimized architecture is then transformed into a new
Simulink model. Details of functional verification and synthesis of Simulink mod-
els will be discussed in Chapter V. Carry-save optimization is done automatically
during logic synthesis by Cadence backend tools. The next chapter is devoted to
the description of the ILP model used in our framework for scheduling integrated
with retiming.
35
CHAPTER 4
ILP Model for Scheduling
This chapter describes the Integer Linear Programming model used for scheduling
and retiming of architectures. The ILP model presented attempts to minimize
the number of processing elements required to execute a finite set of operations
in N time steps (clock cycles). An example of scheduling is shown in Fig. 4.1
where the number of time steps N has been set to 4. The flow of operations
(oi, pj) have been scheduled onto a set of resource elements (Proci) in four time
steps. Each operation takes up a single clock cycle to complete. The table shows
a distribution of the operations across the time steps.
p1 p2o1
p3 o2
o3
TIME
PROCESSORS
N
12
3
4
Proc1 Proc2 Proc3
o1 p1
o2
o3
p3 p2
Figure 4.1: Data-flow-graph and corresponding schedule.
The formal ILP model used to construct the schedule uses the following set
of variables.
• xi,j: binary variables associated with node i, xi,j = 1 if node i is scheduled
36
in time step j else xi,j = 0,
• Mp: number of resource elements of type p (eg. adders, multipliers),
• cp: cost associated with each resource of type p,
• r(U) ∈ Z: retiming weight associated with node U,
• N : folding factor/number of time steps in which all operations in the algo-
rithm must be scheduled.
The value of N remains fixed in each run of the ILP. Scheduling in essence folds
the DFG by a factor of N such that a new input sample arrives every N clock
cycles in the scheduled DFG. To maintain real-time latency constraints a delay in
the original DFG maps to N delays in the scheduled DFG. The edge e : U → V
with w(e) delays originally and retiming weights r(V ) and r(U) maps to
w(e) → N(w(e) + r(V )− r(U)) (4.1)
in the scheduled DFG.
If the source node U is pipelined by a depth d(U) during scheduling, d(U)
number of delays from the outgoing edge e will be used for pipelining. Pipelining
is essential in scheduled circuits to enable higher throughput since performance
degrades by a factor of N with input coming every N clock cycles. Taking the
pipeline depth d(U) of the source node U into account, w(e) now maps to the
form in (4.2).
w(e) → N(w(e) + r(V )− r(U))− d(U) (4.2)
The schedule value p(V ) of node V is defined as
p(V ) =∑
j
jxv,j j ∈ {0, 1, ..., N − 1}, (4.3)
p(V ) ∈ {0, 1, 2, ..., N − 1}.
37
The value p(V ) for the node V gives the time step in which the node executes.
Additional delays amounting to the difference in the schedule values of nodes U
and V are introduced on the edge to maintain precedence relations. The final
number of delays on the edge e of the scheduled DFG is given by
fe = N(w(e) + r(V )− r(U))− d(U) + p(V )− p(U). (4.4)
The above expression is referred to as the folding equation [32] and gives the
number of delays on any edge of the DFG after scheduling in N time steps. fe
will hence be referred to as the folded delay.
The resource utilization cost of a schedule is modeled as
Ctotal =∑
p
cpMp. (4.5)
This cost is a weighted sum of the processing elements in the schedule with the
weight cp representing the individual cost associated with Mp, the number of
processing elements of type p.
The scheduled architecture must execute all operations in N time steps while
also preserving the functionality of the original algorithm. This imposes several
constraints on the ILP formulation which are modeled as follows.
• Each node in the DFG can be scheduled only once over the N time steps.
This condition gives |V| constraints equations of the form in (4.6).∑j
xi,j = 1 i ∈ 1, 2, ..., | V | (4.6)
• The total number of operations of type p scheduled in any time step cannot
exceed the number of resource elements Mp of type p.∑i∈p
xi,j ≤ Mp j ∈ 0, 1, ..., N (4.7)
38
• The delay on each edge after scheduling cannot be negative. This condition
gives the final |E| constraints.
N(w(e) + r(v)− r(u))− d(u) + p(V )− p(U) ≥ 0 (4.8)
For a compact description of the constraints in (4.8) we express it in matrix form
in (4.9).
f = Nw + NAr + g ≥ 0 (4.9)
Here f is the |L| × 1 vector of folded delays and r is the |V| × 1 vector of the
retiming weights associated with the nodes. The elements of the vectors w, Ar,
g are w(e), r(V )− r(U), p(V )− p(U)− d(U), respectively.
The constraints in (4.7) ensure that the operations scheduled in every control
step can be processed by the available computing resources. The constraints
in (4.8), (4.9) maintain the input/output functionality of the original algorithm
after scheduling.
1 2Folding Factor = 2
p(v1) = 1p(v2) = 2d(v1) = 2d(v2) = 2
1 2
v
(a)
(b) (c)
Z-1
2z-1
e1
e1
Figure 4.2: (a) Original DFG (b) Retimed DFG (c) Scheduled architecture.
39
4.1 Scheduling and Retiming
The benefit of retiming simultaneously with scheduling is demostrated in the
DFG in Fig. 4.2. Nodes V1 and V2 represent operations of the same type which
have to be scheduled onto the single available resource V in Fig. 4.2(c). The
operation V1 is scheduled in the first time step (p(V1) = 1) while V2 is scheduled
in the second time step (p(V2) = 2). If the processing element V needs to be
pipelined by two stages (d(V1) = 2) to satisfy throughput constraints; we get the
following folding equation for the delays on edge e1.
fe1 = N(r(V2)− r(V1))− d(V1) + p(V2)− p(V1) (4.10)
= 2(r(V2)− r(V1))− 1
From the constraint fe1 ≥ 0 we get
r(V1)− r(V2) ≥ 1
2. (4.11)
Since r(V1)− r(V2) ∈ Z, (4.11) can be rewritten as
r(V1)− r(V2) ≥ d12e = 1.
Without retiming, the variables r(V1) = r(V2) = 0 and the two operations cannot
be scheduled onto the resource V . If retiming is allowed, a delay can move from
the outgoing edge of node V2 into edge e1 making r(V2) = 1 and r(V1) = 0.
The retiming variables now satisfy the constraint in (4.11) and the schedule is
feasible. This example illutrates how larger movement of delays across the DFG
with retiming can help produce more area-efficient results.
A second advantage of retiming with scheduling is seen in feedforward al-
gorithms where insertion of extra registers at the input/output only increases
latency without affecting the functionality. These extra registers can be used
40
to pipeline the resource elements and speed up the schedule. It must be noted
that in latency constrained systems the retiming weights of the input and output
nodes will be restricted by the maximum allowable latency.
4.2 Modified ILP
Retiming variables are unbounded in the integer space and introduce exponential
time complexity when ILPs are solved using branch and bound methods. The
ILP model described in [17] minimizes resources after enforcing vector r = 0 (no
retiming). We propose an approach where retiming is incorporated in the ILP
model, but the retiming vector is decoupled from the ILP so that the unbounded
variables do not increase the runtime exponentially. Since the retiming variables
are all integers, the constraint in (4.9) can be rewritten as
−Ar ≤⌊w +
g
N
⌋(4.12)
≤ w +⌊ g
N
⌋.
The above result simplification uses the lemma
bk + xc = k + bxc if k ∈ Z. (4.13)
An integral solution to the inequalities in (4.12) can be obtained using the
Bellman-Ford shortest-path algorithm [8]. A necessary and sufficient condition
for the existence of a solution to (4.12) is
B(w +⌊ g
N
⌋) ≥ 0. (4.14)
This condition is easily explained with the following example. Without loss of
generality, let us consider a set of 3 inequations of the form in (4.12) such that
41
their left hand sum goes to zero.
r(1)− r(2) ≤ c1 (4.15)
r(2)− r(3) ≤ c2 (4.16)
r(3)− r(1) ≤ c3 (4.17)
The above set of inequalities can have a solution only when
c1 + c2 + c3 ≥ 0. (4.18)
If the system of inequalities in (4.12) is represented using a constraint graph, such
that
• U ∈ {1,2...|V|} are nodes of the graph,
• an edge e going from U to V exists if the inequality r(V )− r(U) ≤ w(e) +⌊g(e)N
⌋exists,
• an edge e going from U to V is weighted by w(e) +⌊
g(e)N
⌋where r(V ) −
r(U) ≤ w(e) +⌊
g(e)N
⌋,
then the inequalities from (4.15)-(4.17) will form a loop in the constraint graph.
The condition in (4.18) implies that the sum of the weights of all edges in any loop
of the constraint graph must be non-negative. This reduces to the constraint in
(4.14) since the constraint graph in this case is the original data-flow-graph and
the loops in the graph are given by the B matrix. Hence the retiming inequalities
in (4.12) need not be a part of the ILP. If the constraint in (4.14) is introduced
in the ILP, it will be ensured that the retiming vector can be solved by Bellman-
Ford once scheduling is complete. The constraint in (4.14) does not contain any
unbounded integer variables and therefore the CPU runtime for this formulation
42
is not increased significantly. Bellman-Ford solves the retiming inequalities in
O(|V||E|), converging quickly to the solution.
The constraint in (4.14) however introduces the nonlinear floor function in a
linear model. To model the floor function linearly we introduce extra variables
t(e) and q(e) in the ILP. This increases the complexity of the variables in the ILP
from O(N |V|) (without retiming) to O(N |V|+|2E|) (with retiming). However,
as shown below these variables will be bounded and therefore will not increase
the CPU runtime significantly. The new set of constraints in (4.14) are modeled
as follows:
B(t+w) ≥ 0. (4.19)
Here t is an |E| × 1 vector with individual element t(e) given by
t(e) =
⌊g(e)
N
⌋. (4.20)
Since t(e) represents the floor of g(e) the following relation will always hold.
g(e)
N= t(e) + frac
(g(e)
N
)(4.21)
The fractional part of g(e)N
(frac(
g(e)N
)) is modeled by the variable q(e) and the
new set of |E| constraints for the modified ILP are expressed in (4.22).
t(e) =g(e)
N− q(e), q(e) ∈ [0, 1), t(e) ∈ Z (4.22)
g(e) = p(V )− p(U)− d(U) (4.23)
The variable q(e) is bounded between 0 and 1 and the bounded value of g(e) ∈
[−N − 1−max(d(U)), N − 1] bounds t(e). The equation in (4.22) completes the
description of the modified ILP with bounded variables.
The modified ILP model has been used in our optimization framework to gen-
erate more area- and throughput-efficient schedules compared to the ILP which
43
does not incorporate retiming. Supporting results are presented for benchmark
DSP algorithms in Chapter VI. The next chapter gives details of the optimization
flow in Simulink, HDL generation, RTL synthesis and energy estimation methods
used.
44
CHAPTER 5
CAD Design Flow
The system optimization flow for automating the architectural transformations is
detailed in this chapter. The flow starts from the direct-mapped DFG representa-
tion of the algorithm in Simulink. This is followed by extraction of the incidence
and loop matrices of the DFG in MATLAB. Based on circuit-level energy-delay
sensitivity results and system specifications (throughput, area, power) suitable
selection of transformations are made and the direct-mapped architecture is op-
timized. This process is outlined in the flow chart shown below and details of
each step in the process will follow.
RT
L
SystemspecsMatlab
Test-vectors
SimulinkArch.Opt
activity( )
Simulinklib
Simulinklib
Energy-Tclk (VDD)Opt.arch
Final.GDS
Techlib
Techlib
Algorithm
Datapathsimulation
Arch.OptParametersN, P, R
SimulinkRef.Arch Synthesis
Energy,Area, Tclk
RTL
Synthesis
MDL
Figure 5.1: Design and optimization flow.
45
5.1 Simulink Modeling
Simulink is a graphical environment embedded within MATLAB useful for high-
level modeling and evaluation of algorithms. It contains a pre-defined library of
components like adders, multipliers, registers, multiplexers and also more com-
plex dedicated blocks like FFT and CORDIC (Fig. 5.2). In addition to this, it
is possible to create a user-defined library with custom blocks. This feature be-
comes particularly useful when dealing with large designs which has user specified
complex macros. Examples of such custom blocksets include the commercially
available Xilinx XSG and Synplify DSP.
Figure 5.2: Simulink pre-defined library (Synplify DSP blockset).
46
QAM Communication System Example: A Simulink model for a Quadra-
ture Amplitude Modulated (QAM) communication system is shown in Fig. 5.3.
The model takes two random integers as input and modulates it into a QAM sig-
nal. This signal is low-pass filtered (raised-cosine filter) to constrain it within the
allowable bandwidth. The signal then passes through a channel with white Gaus-
sian noise (the SNR of this channel can be user-specified). The received signal is
again low-pass filtered and demodulated to recover the transmitted symbol. The
entire system can easily be emulated using Simulink blocks as shown in Fig. 5.3.
This modeling also considers wordlength quantization effects since the baseband
processing unit (low-pass filter) is implemented using finite-precision arithmetic
(suported by Synplify DSP blockset).
1Out1
↑4
Upsample2
↑4
Upsample
RectangularQAM
Rectangular QAMModulatorBaseband
RectangularQAM
Rectangular QAMDemodulatorBaseband1
ReIm
Real-Imag toComplex
RandomInteger
Random IntegerGenerator
In1 Out1
Raised cosine4
In1 Out1
Raised cosine3
In1 Out1
Raised cosine1
In1 Out1
Raised cosine
Port Out4
Port Out3Port Out1
Port Out
Port In3
Port In2
Port In1
Port In
-2Z
Integer Delay2
-5Z
Integer Delay1 Error Rate
Calculation
Tx
Rx
Error RateCalculation1
4
Downsample2
Discrete-TimeScatter Plot
Scope1
Discrete-TimeEye Diagram
Scope3
Discrete-TimeEye Diagram
Scope2
ReIm
Complex toReal-Imag
AWGN
AWGNChannel2
AWGN
AWGNChannel1
Figure 5.3: Baseband processing in a QAM system.
Results of bit error rate simulation for the system are shown in Fig. 5.4. The
system was tested with both finite-precision arithmetic (16-bit datapath, 14 bit
fractional length) as well as as floating-point arithmetic (ideal full precision). We
47
6 7 8 9 10 11 12 13 1410
-5
10-4
10-3
10-2
10-1
Eb/No
BE
R
Figure 5.4: BER vs. SNR curve for the QAM system.
see a small degradation in bit error rate owing to quantization noise introduced
in the fixed-point arithmetic.
Figure 5.5: Simulink model for an 8-tap FIR filter.
The next example focuses on verification and synthesis of the baseband filter
48
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Time (seconds)
Am
plitu
de
InputOutput
Figure 5.6: Input with normalized frequencies of 0.03 & 0.4, and corresponding
output that passes the lower frequency.
in the above communication system. A direct-mapped structure for the reference
architecture was created using library blocks from Synplicity and the filter was
implemented with 16-bit fixed-point arithmetic. The direct-mapped architecture
for the 8-tap low-pass FIR filter is shown in Fig. 5.5. Functional verification
of the model can be carried out by applying appropriate inputs. The inputs
can either be generated in Simulink from blocks like frequency synthesizers or
can be user specified from the MATLAB workspace. Simulation results can be
exported to the MATLAB workspace or viewed with the help of the scope dispay
block in the library. Simulation results for the FIR filter are shown in Figs. 5.6
and 5.7. The FIR structure in Fig. 5.5 is a low-pass filter with cut-off at 0.2
rad/s. To verify its frequency-selective nature an input which was a combination
49
of two siusoidal frequencies, 0.03 rad/s and 0.4 rad/s was applied to the filter.
Figure 5.6 illustrates how the filter passes the 0.03 rad/s input (passband) and
suppresses the 0.4 rad/s input (transition band) in the output (dashed line is the
lower frequency output).
0 2 4 6 8 10 12 14 160.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Bit
acti
vity
fac
tor
Figure 5.7: Activity factor for a 16-bit sinusoidal input of normalized frequency
0.25.
5.2 RTL Generation and Synthesis
Simulink is also convenient as a tool, because it can be used to bridge the gap be-
tween high-level description of algorithms and physical synthesis of architectures.
SynDSPTool which is a part of the SynDSP blockset is capable of automatically
generating synthesizable RTL in Verilog/VHDL from the Simulink models. HDL
descriptions of the architectures can then be synthesized with backend tools like
Cadence RC Compiler or Synopsys Design Compiler. Synthesis results from the
architectures can then provide us with accurate area and throughput results.
Input switching activity can be extracted in MATLAB from target test vectors
50
Figure 5.8: Energy-area-delay tradeoffs at the circuit and micro-architectural
level.
and then propagated during synthesis for accurate energy estimates. Figure 5.7
shows the input swicthing activity for a 16-bit sinusoidal input applied to the
FIR filter. The results indicate that the activity is higher for the LSB bits and
decreases as we move towards the MSB. Tools like Cadence RC compiler take
in the input swicthing activity information and propagate the swicthing proba-
bilities across the whole design to provide energy estimates. Hence we can now
fully characterize an architecture in the energy-area-delay space in an automated
fashion starting from the Simulink description.
5.3 Architectural Optimization
Based on energy-delay sensitivity results [1] from synthesis of the direct-mapped
architecture and the underlying macros we can figure out which transformations
get us closest to the system specifications. This concept is illustrated in Fig.
5.8 where the impact of time-multiplexing, parallelism, pipelining etc. has been
51
Figure 5.9: Choosing values of N , P and R based on energy-delay sensitivity.
shown in the energy-area-delay space [38]. Time-multiplexing reduces the area
of the design but increases the energy due to additional energy consumption in
control circuitry like multiplxers, memories etc. Parallelism coupled with sup-
ply voltage scaling on the other hand, helps reduce the energy while increasing
the area. Pipelining can improve the throughput or reduce energy similar to
parallelism but at a reduced area overhead. The benefits of pipelining saturate
however when the delay overhead introduced by the registers begins to dominate
the reduction achieved in the critical path.
Depending upon the system specifications like throughput, area or power
consumption and the optimization objective, the degree of time-multiplexing (N),
parallelism (P ) and extra latency introduced via retiming (R) can be set by the
user (Fig. 5.9) [42]. If a lower area is desired at a fixed supply voltage with a loss
in the achievable throughput, then we opt for a higher value of N . Increasing
the value of N also increases the energy consumption due to additional energy
52
overhead in the control circuitry. For a fixed supply voltage retiming (R) and
parallelism (P ) can only improve the throughput of the system. The bounds on
the values of N and P are determined by the throughput constraints and the
lower limit on the supply voltage. The value of R is bounded by the maximum
I/O latency which can be inserted. This value is significant only in feed-forward
portions of the architecture where extra registers can be introduced. For the
recursive structures retiming can only balance the logic depth between registers to
obtain the lowest possible critical path (R=0 for feedback structures). Retiming
and parallelism coupled with supply voltage scaling improves the energy-efficiency
as was explained earlier in Chapter III.
For example, for the FIR filter (Fig. 5.5) if the objective is to reduce the
area roughly by a factor of two by trading off speed then we set the degree of
time-multiplexing N = 2. On the other hand if the objective is to double the
throughput or improve energy efficiency then parallelism (P ) can be employed.
For a given set of system specifications it is possible to find several architectures
which meet the system constraints, in which case the architecture which best
meets the optimization objective must be selected for synthesis.
The values of N , P and R, are next sent to the MATLAB/Simulink based
optimizer as illustrated in Fig. 5.1. The optimizer first extracts the connectivity
information from the direct-mapped architecture in the form of incidence and
loop matrices (Chapter III). The matrices are used to model the constraints in
the ILP setup in MOSEK which is an optimization tool embedded in MATLAB.
After ILP simulations are complete the optimizer uses the results to automati-
cally construct the Simulink model for the resulting optimized architecture. The
optimized model can be synthesized in the target technology to verify if it meets
the system constraints or whether there is a need for further refinement. In the
53
+
x
In
Out
x x x
+ +
D D D
MatlabSimulinkOptimizer
Time MultiplexTime Multiplex ParallelParallel
Lower Area Low Energy, High Throughput
Retime/PipelineRetime/Pipeline
Higher speed
Pipeline Registers
Transposed Architecture
Figure 5.10: Time-multiplexed and parallel implementation of a 16-tap FIR filter.
latter case the degree of N , P and R can be changed again and the process re-
peated until the system constraints and optimization objective are met. The user
can iteratively generate multiple architectures for the target algorithm by varying
the values of N , P and R which enables effective exploration of the design-space.
This process was carried out for the FIR filter and the Simulink-level results of
time-multiplexing (N=2), parallelizing (P=4) and retiming (R=1) for the FIR
filter are shown in Fig. 5.10. A detailed discussion on the synthesis results of
time-multiplexing, retiming and parallelizing this filter will follow in the next
chapter.
54
-1
Figure 5.11: Control circuitry using M-Control blocks.
5.3.1 Controller Generation
Automating the generation of control circuitry is an important part of high-level
synthesis. This has been done with the aid of M-Control blocks from Synplicity’s
Synplify DSP blockset [39] in Simulink. The M-Control block is a MATLAB
function which generates certain outputs in response to certain input patterns.
The example in Fig. 5.11 illustrates the use of this block to generate controllers
for scheduled architectures. The M-Control function for this architecture can be
written as
function[Sel]=M_Control(Count)
if(count==1)
Sel = 1 \\ In1 is the output of Mux
else if (count==4)
Sel = 2 \\ In2 is the output of Mux
end
end
55
The controller block must route the correct signal into the processing elements
every clock cycle, hence the output of the block depends on the schedule of
the processing elements. The generation of MATLAB scripts for the M-Control
blocks has been automated in this work. The control script is a collection of
if-then-else statements which generate the correct value of the select signal for
the multiplexers at every control step (clock cycle). A parameterized function
which accepts the schedule of the processing elements and the registers in the
design was written in MATLAB to generate this control script. This MATLAB
script can then be translated into synthesizable RTL using the SynDSP function
embedded in the Synplicity DSP blockset.
We have now described the complete optimization framework which includes
architecture modeling starting from the Simulink description, functional verifi-
caion of the model and finally RTL synthesis via backend tools. The next chapter
will discuss the results of our formal approach to architectural optimization.
56
CHAPTER 6
Results
This chapter compares high-level and logic synthesis results obtained from exist-
ing ILP scheduling and the modified scheduling model which integrates retiming
(outlined in Chapter IV). This is followed by a discussion on the design-space ex-
ploration results of a 16-tap FIR filter (used in ultra-wide-band applications). The
energy-area-throughput results obtained by scheduling, retiming and parallelizing
this filter in 90 nm CMOS technology are presented. Hierarchical design-space
exploration is illustrated for a multi-core MIMO sphere decoder.
6.1 Comparison of Existing and Modified Scheduling
Algorithms
The modified ILP was verified on a general class of feedforward and recursive
algorithms which exhibit varying degree of complexity in structure. The feedfor-
ward algorithms selected were a 16-tap FIR filter and an 8-point discrete cosine
transform (DCT) while the recursive algorithms include second-order IIR, four-
stage lattice and elliptic wave digital filters (Fig. 6.4). The ILP simulations for
high-level synthesis were run using the ILOG/OPL optimization tool on a 32-bit
Intel Core 2 CPU running at 2.0 GHz.
57
Table 6.1: Comparison of scheduling and scheduling with Bellman-Ford retiming.
Adder Mult Sched Sched Sched Sched Sched Sched Sched &
Pipe- Pipe- No. of & BF No. of & BF CPU & BF Retime
Design N line line Adds Adds Mults Mults (s) CPU (s) CPU (s)
3 0 1 - 12 - 4 - 0.25 5376
Wave 4 0 1 8 7 4 2 0.18 1.25 2.25
Digital 8 0 1 4 4 2 1 13.9 45.5 20.5
Filter 16 1 2 - 3 - 1 - 264 >6000
2 stage 2 0 1 8 4 8 4 0.28 0.26 0.30
IIR 2 0 2 - 4 - 4 - 0.28 0.28
Filter 4 0 2 4 2 4 2 0.26 0.30 0.30
2 0 2 - 6 - 8 - 0.13 0.3
4-stage 2 1 2 - 6 - 8 - 0.26 0.26
Lattice 3 0 2 6 4 8 5 0.26 0.25 0.25
Filter 3 1 2 - 4 - 5 - 0.26 0.26
4 0 2 4 3 5 4 0.28 0.21 0.26
4 1 2 - 3 - 4 - 0.26 0.26
2 0 1 16 16 16 8 0.14 0.26 0.26
8 point 3 0 2 16 11 16 6 0.15 0.15 0.40
DCT 3 1 2 - 11 - 6 - 0.15 0.15
(1-D) 4 0 2 8 8 8 4 0.29 0.28 0.26
4 1 2 - 8 - 4 - 0.25 0.25
8 1 2 5 4 4 2 6.0 0.25 0.26
2 0 1 15 8 14 8 0.25 0.28 0.25
16-tap 2 1 2 - 8 - 8 - 0.25 0.28
FIR 4 1 2 8 4 7 4 0.20 0.20 0.26
8 1 2 3 2 3 2 0.36 0.25 0.26
58
In addition, RTL synthesis was done in 90 nm CMOS technology for selected ar-
chitectures (FIR, wave digital and lattice filter) to investigate the area-performance
tradeoff offered by scheduling (existing ILP) and scheduling with Bellman-Ford
(BF) retiming (modified ILP).
Table 6.1 compares the high-level synthesis results for both approaches. In
all cases considered it was observed that the modified ILP outperforms the ex-
isting ILP scheduling model in terms of number of resource elements (number of
adders/multipliers) needed to execute the algorithm. For several combinations
of folding factors (N ) and pipeline depth, it was observed that the existing ILP
model could not reach a feasible solution indicating that the modified ILP can
traverse the area-throughput space in a more effective manner.
A comparison of simulation runtimes were made between 3 approaches; schedul-
ing, scheduling with BF retiming and scheduling with unbounded retiming vari-
ables in the ILP (unbounded ILP). In all the algorithms considered except the
wave digital filter we find that the modified ILP and the unbounded ILP have
runtimes which are comparable with the existing ILP. Although the modified ILP
and the unbounded ILP converge to the same solution when minimizing resource
count, their runtimes are significantly different for the wave digital filter (Fig.
6.4) which is a complex structure with a large number of loops (the A and B
matrices have high dimensions). The runtime with the unbounded variables in
the ILP degrades rapidly in this case since it has a larger search space to cover
(> 6000s for N=16). The runtime for the modified ILP on the other hand shows
a more graceful degradation (264s for N=16) owing to a reduced search space.
Figures 6.1, 6.2, 6.3 show the synthesis results for a fifth-order elliptic wave
digital filter (WDF), 16-tap FIR and a 4-stage lattice filter, respectively. The
area and throughput numbers have been normalized to the reference architec-
59
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4
0.5
0.6
0.7
0.8
0.9
1
Throughput
Are
a
ReferenceSchedulingScheduling and Bellman Ford
A
B
Figure 6.1: Synthesis results for a fifth order elliptic wave digital filter.
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5
0.6
0.7
0.8
0.9
1
Throughput
Are
a
ReferenceSchedulingScheduling and Bellman Ford
Figure 6.2: Synthesis results for a 16-tap FIR filter.
60
0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Throughput
Are
a
ReferenceSchedulingScheduling andBellman Ford
Figure 6.3: Synthesis results for a 4th order all pole lattice filter.
ture (original architecture which is not scheduled). It was observed in all the
three examples that for equal reduction in throughput for most cases we do not
get an equal reduction in area (traversing from point A to point B in Fig. 6.1
results in a 38% reduction in throughput and a 16% reduction in area). This
trend is expected since increase in register and controller areas after scheduling
to some degree offsets the area reduction achieved by lowering the number of
resource elements. Also with higher degree of scheduling it is not always possible
to increase the degree of pipelining of the resource elements which results in a
lower-than-expected throughput. Both these factors contribute to the area-delay
product [34] becoming greater than 1 (the area-delay product for the reference is
1) for scheduled architectures (Table 6.2).
As was mentioned earlier, scheduling with BF retiming is able to produce
results for a larger combination of folding factors and pipeline depth allowing a
higher throughput for scheduled architectures (Fig. 6.3). The area results for the
61
Table 6.2: Comparison of normalized area-delay product for scheduling and
scheduling with Bellman-Ford retiming.
Design Scheduling Scheduling & BF Gain
WDF 1.5011 1.2795 14.76 %
Lattice 3.6329 1.4456 60.208 %
FIR 3.9530 2.9328 25.808 %
modified ILP also show an improvement over the existing approach (Fig. 6.2)
due to larger movement of delays across the design owing to retiming (explained
in detail in Chapter IV). For a fair comparison between the area and throughput
of the architectures generated from both approaches, we compute the average
area-delay product of the synthesized results (Table 6.2). The numbers in Table
II indicate the mean value of area-delay product for the synthesized architectures
for the three examples. These numbers were further averaged across all three
examples and a 33% average reduction in the area-delay product was observed
for the modified ILP. The results clearly demostrate that retiming integrated
with scheduling produces more area and throughput efficient architectures when
compared to scheduling without retiming.
6.2 Design Space Exploration: 16-tap FIR Filter
The optimization flow detailed in Chapter V was first verified on a 16-tap FIR
filter because of its simplicity and well understood structure. Transformations
like scheduling, retiming and parallelism were applied to the filter integrated with
supply voltage scaling and micro-architectural techniques like the usage of carry-
save arithmetic [29]. The result was an array of optimized architectures each
unique in the energy-area-performance space. A comparison of these architectures
62
has been made in Figs. 6.5, 6.6 with contour lines connecting the architectures
which have the same throughput.
Module A(A0)
Module A(A1)
Module B(B0)
Module B(B1)
Module C(C0)
Module D(D0)
D D DD
Out
In
+ +
X
+
D
In Out
+ +
+ X X +
+ +
In1
Out1
In2
Out2
In3
Out3
+
X+ +
+
In1
Out1
In2Out2
In3Out3
+
+ +
+
X + X
X
D
Out2
In2
Out3
In1
Out1
Module A Module C Module B Module D
Figure 6.4: Fifth-order wave digital elliptic filter.
Figure 6.5 shows the effect of carry-save optimization on the direct-mapped
reference architecture. The reference architecture without carry-save arithmetic
(CSA) consumes larger area and is slower compared to the design which em-
ploys CSA optimization. To achieve the same reference throughput (set at 100
Ms/s for all architectures during logic synthesis), the architecture without CSA
must upsize its gates or use complex adder structures like carry-look-ahead which
increases the area and switched capacitance leading to an increase in energy con-
sumption as well. The CSA optimized architecture still performs better in terms
of achievable throughput which highlights the effectiveness of this technique.
Following CSA optimization the design is retimed to further improve the
throughput. From Fig. 6.5 we see that retiming improves the achievable through-
63
1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4
x 104
101
Area (μm2)
Ene
rgy
(pJ)
[lo
g sc
ale]
100 M
125 M
166 M
250 M
300 M
350 M395 M
516 M
187 M
623 MN=2N=2 / retimedRef.Ref. / retimedRef. no CSALat +Lat + / ret.
M = Ms/s
Vdd
Figure 6.5: Synthesis results for retimed and time-multiplexed FIR filters.
Figure 6.6: Increase in register area with retiming.
put from 350 Ms/s to 395 Ms/s (13%) but also results in a small area increase
(3.5%). The area increase is attibuted to movement of registers from a single
output edge to multiple input edges as is shown in Fig. 6.6 where the register
64
count increases from one to two.
In
x v1 x x
Z-1 v2 v3 + OutZ-1 Z-1
In
x
Z-1 v2 v3 + OutZ-1 Z-1
Z-1
x
Z-1
Z-1
x
Z-1
x
Z-1
(a)
(b)
Figure 6.7: (a) FIR with an extra latency at the output (b) Retimed version.
Increasing the input-to-output latency in feedforward systems also resuts in
considerable throughput enhancement. This is illustrated in Fig. 6.7(b) where
retiming at the Simulink level cuts down the throughput from Tadd+Tmult to Tmult.
The results from logic synthesis (Fig. 6.5) show a 30% throughput improvement
from 395 Ms/s to 516 Ms/s. The area roughly by 22% due to extra register
insertion as shown in Fig. 6.7(b). Retiming during logic synthesis does fine grain
pipelining inside the multipliers to balance the logic depth across the design. This
step improves the throughput to 623 Ms/s (20% increase).
Scheduling the filter results in area reduction by about 20% compared to the
retimed reference architecture and a throughput degradation by about 40%. The
area reduction is small for the filter, since the number of taps in the design is
small and the decrease in adder and multiplier area is offset by the increased area
65
of the registers and multiplexers. Retiming the scheduled architecture results in
a 12% improvement in throughput but also a 5% increase in area due to more
number of registers as explained earlier in the chapter.
Supply-voltage scaling is another degree of freedom we have when exploring
the design-space. From energy and throughput results at the nominal supply
voltage (Vdd=1V for 90nm CMOS) we can obtain throughput and energy numbers
at reduced supply by using the models for energy and delay in (6.1), (6.2).
1
Throughput=
K · Vdd
(Vdd − Vth)α(6.1)
Power ∝ V 2dd · Throughput (6.2)
The equation in (6.1) is the well known alpha-power law model used for com-
puting the logic delay of a circuit [40]. The value of alpha typically ranges between
1 to 2 depending upon the target technology.
Figure 6.5 illustrates how the throughput as well as energy scales with de-
creasing Vdd. The retimed reference architecture can operate between 100 Ms/s
to 395 Ms/s (4× variation in throughput) for Vdd values between 0.35V to 1V in a
90 nm technology. It is interesting to note that for this 4× change in throughput
we achieve a 3× change in energy. The system designer therefore has an option
to trade off throughput for increased energy efficiency by applying supply voltage
scaling.
The unfolding algorithm was applied to the 16-tap FIR to generate the ar-
chitectures shown in Fig. 6.8. It is to be noted that the Simulink model and
RTL for these architectures were generated automatically by the optimizer. Par-
allelism variable P has been been varied from 2 to 12 to generate a series of
architectures which exhibit a range of throughput and energy efficiencies. The
66
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2
x 105
0
1
2
3
4
5
6
Area (μm2)
Ene
rgy
Eff
icie
ncy
(GO
PS
/mW
)
0.4G0.5G
0.7G
1G
2G
250M
125M40M
0.7G
71M
1.3G
0.13G
1.7G
0.17G
2.6G
0.3G
3.4G
0.35G
G = GS/sM = MS/s
Ref.P=2P=4P=5P=8P =12
Vdd
Figure 6.8: Synthesis results for parallel FIR filters.
throughput varies from 40 Ms/s to 3.4 Gs/s while the energy efficiency ranges
from 0.5 GOPS to 5 GOPS. It is possible to improve the energy efficiency signif-
icantly with continued Vdd scaling if sufficient delay slack is available. Scaling of
the supply voltage has been done in 90 nm technology in the range of 1 V to 0.32
V. In Fig. 6.8 we see a clear tradeoff between energy/throughput and area. The
final choice of architecture will ultimately depend on the throughput constraints,
available area and power budget.
6.3 Hierarchical Design: Multi-Core MIMO Sphere De-
coder
Optimization of complex architectures in Simulink can be done hierarchically if
Energy-Delay sensitivity [1] results for the smaller modules in the system are
67
one core
…
…
…
…Multi-core DSP
Sub-carriers
MIMO DSP
function
Figure 6.9: Multi-core MIMO sphere decoder.
available. We take the example of a MIMO sphere decoder architecture (Fig.
6.9) [41],[42] to exhibit this hierarchical extension. The processing element (PE)
of the decoder has to find the best possible match for the transmitted symbol in
a pre-defined search radius of the symbol constellation. The workload of looking
for the correctly decoded symbol can either be done by a single PE or distributed
to multiple PEs (multi-core architecture) [41]. This is equivalent to parallelism
and with more processing elements, the incoming symbol can be decoded quickly
or more with higher energy efficiency (by scaling Vdd as explained in Chapter
IV).
To obtain energy-delay trade-off curves for the multi-core architecture it is
68
Figure 6.10: Simulink model for the multi-core MIMO sphere decoder.
sufficient to extrapolate results from the single core architecture. The maximum
throughput achieved by the single core design was 100 Ms/s at a supply voltage
of 1V taking up an area of 0.55 mm2. For the decoder to work at a higher
throughput or higher energy efficiency we must vary the degree of parallelism
(P ). This would correspond to architectures with higher number of processing
elements. A 16-core architecture automatically generated in Simulink is shown in
Fig. 6.10 with the scheduler controlling the communication between the PEs. The
energy-delay tradeoff curves for the decoder architecture with varying number of
PEs is shown in Fig. 6.11. We see a 10× tuning range in energy efficiency when
varying the degree of parallelism from P = 1 to P = 16 and scaling the supply
voltage between 1V to 0.32 V. Also a range of throughput from 100 Ms/s to 1.5
Gs/s can be achieved if the supply voltage is maintained at 1V.
69
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.60
1
2
3
4
5
6
Throughput (Gbps)
En
erg
y E
ffici
en
cy (
GO
PS
/mW
)
1V
0.66V
0.48V
0.37V
0.32V
16 x 16 antenna array
16 cores (8.8 mm2)8 cores (4.4 mm2)4 cores (2.2 mm2)2 cores (1.1 mm2)1 core (0.55 mm2)
1.5 Gbps
58 pJ/bit~1
0x tu
ning
rang
e
Figure 6.11: Synthesis results for the multi-core MIMO sphere decoder.
The results presented in the chapter shows the efficiency of integrating retim-
ing with scheduling in the ILP model. Design space exploration results show the
effect of each transformation (scheduling, retiming, parallelism,CSA,Vdd scaling)
in the energy-area-delay space. Also highlighted is the automatic generatation
of multiple architectures (Simulink model and RTL) for a given algorithm. This
allows the user to pick the design which best meets system specifications and
optimization objective. The next chapter concludes this thesis with a summary
of research contributions.
70
CHAPTER 7
Conclusions & Future Work
To conclude, we summarize the main contributions of this work and also discuss
possible directions for future research.
7.1 Summary of Research Contributions
• Developed an automated flow for optimizing DSP architectures, starting
from architectural modeling in Simulink followed by MATLAB optimization
through various architectural transformations.
• The optimization flow uses Synplicity and Cadence backend tools to inte-
grate RTL synthesis and power estimation in the frameork. Suitable archi-
tectural transformations are decided upon, based on the system constraints
(throughput, area-energy budget) and energy-delay (E-D) sensitivity re-
sults extracted from the circuit-level.
• Developed a modified ILP scheduling model which integrates retiming and
improves the energy-delay product by 33% on an average compared to re-
sults from ILP without retiming.
• The worst case CPU runtime for the modified ILP is much better compared
to the case where unbounded retiming variables are present in the ILP. For
the fifth-order elliptic wave digital filter example the modified ILP achieved
71
almost 20× reduction in worst case CPU runtime.
• Hierarchical optimization is illustrated for a complex MIMO sphere decoder
kernel based on circuit-level results of the underlying macros.
7.2 Future Work
• Extend the optimization framework to support multi-rate systems.
• Include the effect of interconnects at the Simulink level by developing suit-
ables models for wire delay and power.
• Develop area and power models for memory units (SRAM, DRAM) at the
Simulink level.
• Include support for dynamic scheduling in a real time environment for ap-
plications like software defined radios.
72
Appendix 1: GUI Environment
A graphical user environment (GUI) was built in MATLAB to allow easy interface
for implementing the transformations. The user must first create the reference
(direct mapped) architecture using Simulink, Synplify DSP or any other user-
defined component. Once created, this model will appear in the GUI’s listbox
menu under the header Simulink Model. The model must be selected from the
menu in order to load it in MATLAB’s workspace. The next step is entering the
design components used by the reference architecture (e.g. adders and multipli-
ers for a filter). If the design components are pipelined then this must also be
specified under the header Pipeline depth. We can then extract the incidence,
loop, weight and pipeline matrices/vectors from the Simulink model by hitting
the Extract Model button in the GUI. All the relevant DFG connectivity in-
formation is now present in the MATLAB workspace. This step also gives us
information about the total number of components present in the reference (15
adders and 16 multipliers for the filter in the example shown in the Fig. 7.1).
Extraction of incidence, loop matrices etc. is independent of the nature of the
components used in Simulink model. At present the user must provide the details
of the components in the model and their exact path through the GUI. However,
we plan to automate this process in the future. From the Select Design Com-
ponents menu the user selects the blocks that are used in the reference design.
As the design components are selected their names appear under the header
Design Components. It is assumed that the user has either synthesized the
reference architecture or hierarchically extrapolated results from the underlying
macros to compute its energy, area and performance. Depending upon the target
specifications, the user now decides on the degree of pipelining, parallelism or
73
Figure 7.1: GUI built within MATLAB to facilitate transformations.
folding/scheduling that must be applied to the reference architecture. The de-
gree of pipelining for each component in the model can be set by entering values
in the box next to the name of the component (which appears under header De-
sign Components). The degree of scheduling (N), retiming (R) and parallelism
(P ) can be likewise set by entering the corresponding values in the box next to
Schedule, Retime and Parallel respectively.
Once the values of N , P and R have been set, the transformations can be im-
plemented by hitting the Generate scheduled architecture, Generate par-
allel architecture and Generate retimed architecture buttons respectively.
The transformed Simulink model which uses Synplify DSP components will auto-
matically open up once the transformations are complete. The name of the new
model is set to ’test1’ by default but can be changed by the user. It is possible to
apply two transformations consecutively on a given design by following the above
74
procedure one after another. For example we may parallelize a design first to get
the unfolded architecture, then make the new model as the reference and retime
it.
This GUI is still in the development phase and will soon be fully functional
and available to use.
75
Appendix 2: Tarjan’s Algorithm
Tarjan’s algorithm [20]: Finding elementary cycles (B matrix) in a directed graph.
\begin{BACKTRACK}{integer v, logical result f}
logical g;
f:= false;
place v on point stack;
mark(v):= true;
place v on marked stack;
\FOREACH w in A(v) \DO
if w < s then delete w from A(v);
else if w = s then
begin
output circuit from s to v to s
given by point stack;
f : = true;
end
else if !mark(w) then
begin
BACKTRACK(w,g);
f:= f+g;
end;
comment: f=true if an elementary circuit
continuing the partial path on
the stack has been found;
if f = true then
76
begin
while top of marked stack != v do
begin
u:= top of marked stack;
delete u from marked stack;
mark(u):= false;
end;
delete v from marked stack;
mark(v):= false;
end
delete v from point stack;
\end{pseudocode}
\begin{loop_enumeration}
integer n;
for i:= 1 to v do mark(i):= false;
for s:= 1 to v do
begin
BACKTRACK(s,flag);
while marked stack not empty do
begin
u:= top of marked stack;
mark(u):= false;
delete u from marked stack;
end
end
end
\end{pseudocode}
77
References
[1] D. Markovic, V. Stojanovic, B. Nikolic, M.A. Horowitz, and R.W. BrodersenMethods for True Energy-Performance Optimization. IEEE Journal of SolidState Circuits,39(8),pages 1282-1293, Aug 2004.
[2] T. Gemmeke et. al. Design optimization of low-power high-performance dspbuilding blocks. JSSC, 39(7):1131–1139, 2004.
[3] J.Rabaey, C. Chu, P. Hoang, and M. Potkonjak. Fast Prototyping ofDatapath-Intensive Architectures. IEEE Design and Test of Computers,8(2):40–51, June 1991.
[4] W.R. Davis, N. Zhang, K. Camera, F. Chen, D. Markovic, N. Chan,B. Nikolic, and R.W. Brodersen. A design environment for high-throughputlow-power dedicated signal processing systems. Journal of Solid State Cir-cuits, 37:420–430, 2002.
[5] D. Markovic, R.W. Brodersen, and B. Nikolic. A 70gops 34mw multi-carriermimo chip in 3.5mm2. 2006 Symposia on VLSI Technology and Circuits,pages 158–159, 2006.
[6] A. Poon, D. Tse, and R. Broderson. An adaptive multiple-antennatransceiver for slowly flat fading channels. 2003 IEEE Transactions on Com-munications, 51(13):1820–1827, 2003.
[7] C.E. Leiserson and J.B. Saxe. Optimizing synchronous circuitry using re-timing. Algorithmica, 2(3):211–216, 1991.
[8] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and Clifford Stein. Introductionto Algorithms. MIT Press and McGraw-Hill, November 2001.
[9] M. C. Papaefthymiou and K. N. Lalgudi. Retiming edge-triggered circuitsunder general delay models. TCAD, 16(12):1393–1408, 1997.
[10] Ying Yi and R. Woods. Hierarchical synthesis of complex dsp functionsusing iris. IEEE Trans. on Computer Aided Design of Integrated Circuitsand Systems, 25(5):806–820, 2006.
[11] V.P. Roychowdhury and T.Kailath. Study of parallelism in regular iterativealgorithms. Proceedings of the second annual ACM symposium on Parallelalgorithms and architectures, pages 367–376, 1990.
[12] S.K. Rao and T.Kailath. Regular iterative algorithms and their implemen-tation on processor arrays. Proceedings of the IEEE, pages 259–269, 1988.
78
[13] C. Tseng and D. P. Siewiorek. Automated synthesis of datapaths in digitalsystems. IEEE Trans. Computer-Aided Design, CAD(5):379–395, July 1986.
[14] S. Y. Kung, H. J. Whitehouse, and T. Kailath. VLSI and Modern SignalProcessing. NJ: Prentice Hall, 1985.
[15] S. Davidson et.al. Some experiments in local microcode compaction forhorizontal machines. IEEE Transactions on Computers, pages 460–477, July1981.
[16] H. DeMan, J. Rabaey, J. Six, and P. Claesen. Cathedral-ii: A silicon com-piler for digital signal processing. IEEE Design and Test of Computers,pages 13–25, 1986.
[17] C.T. Hwang, J.H. Lee, and Y.C. Hsu. A formal approach to the schedul-ing problem in high level synthesis. IEEE Trans. Computer-Aided Design,4(10):464–474, April 1991.
[18] T.C. Denk and K.K. Parhi. Exhaustive scheduling retiming of digital signalprocessing systems. IEEE Trans. on Circuits and Systems II-Analog andDigital Signal Processing, 45(7):821–838, July 1998.
[19] M. Potkonjak and J.M. Rabaey. Optimizing resource utilization using trans-formations. IEEE Transactions on Computer Aided Design of IntegratedCircuits and Systems, 13(3):277–292, March 1994.
[20] R.E. Tarjan. Enumeration of the elementary circuits of a directed graph.SIAM J. Comput., 2(3):211–216, 1973.
[21] T.P. Bamwell and C.J.M. Hodges. Optimal implementations of signal flowgraphs on synchronous multiprocessors. Proc. of International Conferenceon Parallel Processing, August 1982.
[22] A. Chandrakasan, S. Sheng, and R. Brodersen. Low-power cmos digitaldesign. JSSC, 27(4):473–484, 1992.
[23] D. Y. Chao and D.T. Wang. Iteration bounds of single-rate data flow graphsfor concurrent processing. IEEE Transactions on Circuits and Systems,40(9):629–634, July 1993.
[24] C.H. Gebotys and M.I. Elmasry. A vlsi methodology with testability con-straints. Canadian Conference on VLSI, October 1987.
[25] C.Y. Hitchcock and D.E. Thomas. A method of automatic datapath syn-thesis. Design Automation Conference, pages 484–489, July 1983.
79
[26] A. Chandrakasan, J.M. Rabaey, and B. Nikolic. Digital Integrated CircuitsA Design Perspective. Prentice Hall, 2003.
[27] J.G. Proakis, and D. Manolakis. Digital Signal Processing. New York:Macmillan, 1992.
[28] P. Marwedel. A new synthesis algorithm for the mimola software system.Design Automation Conference, pages 271–277, July 1986.
[29] T.G. Noll. Carry-save arithmetic for high-speed digital signal processing.IEEE International Symposium on Circuits and Systems, 2:982–986, 1990.
[30] B.M. Pangrle and D.D. Gajski. State synthesis and connectivity bindingfor microarchitecture compilation. International Conference on ComputerAided Design, pages 210–213, November 1986.
[31] K.K. Parhi. VLSI Digital Signal Processing Systems: Design and Implemen-tation. Wiley, 1999.
[32] K.K. Parhi, C.Y. Wang, and A.P. Brown. Synthesis of control circuits infolded pipelined dsp architectures. IEEE Journal of Solid State Circuits,27(1):29–43, January 1992.
[33] P.G. Paulin and Knight J.P. Force-directed scheduling for the behavioralsynthesis of asics. IEEE Transactions on Computer Aided Design, 8:661–679, November 1989.
[34] Z.X. Shen and C.C. Jong. Functional area lower bound and upper boundon multicomponent selection for interval scheduling. TCAD, 19(7):745–759,2000.
[35] H. Trickey. Flamel: A high level hardware compiler. IEEE Transactions onComputer Aided Design, CAD-6:259–269, March 1987.
[36] C.Y. Wang and K.K. Parhi. Dedicated dsp architecture synthesis using themars design system. Proc. IEEE Int. Conf. on Acoustics, Speech, and SignalProcessing, pages 1253–1256, May 1991.
[37] R.W. Floyd. Algorithm 97: Shortest path. Communications of the ACM,5(6):345, June 1962.
[38] D. Markovic, B. Nikolic, and R.W. Brodersen. Power and area minimizationfor multidimensional signal processing. IEEE Journal of Solid State Circuits,42(4):1253–1256, April 2007.
80
[39] http://www.synplicity.com/products/synplifydsp/.
[40] T. Sakurai and A.R. Newton. Alpha-power law MOSFET model and itsapplications to CMOS inverterdelay and other formulas. IEEE Journal ofSolid State Circuits, 25(2):584–594, April 1990.
[41] C.H. Yang and D. Markovic. A Flexible VLSI architecture for extractingdiversity and spatial multiplexing gains in MIMO channels. To appear atthe International Conference on Communication, 2008.
[42] R. Nanda, C.H. Yang, and D. Markovic. DSP architecture optimization inMATLAB/Simulink environment. To appear at the 2008 VLSI Symposiumon Circuits, 2008.
81