DSP Architecture Optimization in MATLAB/Simulink …icslwebs.ee.ucla.edu/dejan/researchwiki/images/e/e1/Thesis_dsp... · DSP Architecture Optimization in MATLAB/Simulink Environment

University of California

Los Angeles

DSP Architecture Optimization

in MATLAB/Simulink Environment

A thesis submitted in partial satisfaction

of the requirements for the degree

Master of Science in Electrical Engineering

by

Rashmi Nanda

2008

c© Copyright by

Rashmi Nanda

2008

The thesis of Rashmi Nanda is approved.

Miodrag Potkonjak

Mani B. Srivastava

Dejan Markovic, Committee Chair

University of California, Los Angeles

2008

ii

Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Overview of Previous Work . . . . . . . . . . . . . . . . . . . . . 4

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Architecture Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1 Representations of DSP Algorithms . . . . . . . . . . . . . . . . . 11

2.2 Data Flow Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Loop Bound and Iteration Bound . . . . . . . . . . . . . . 13

2.2.2 Precedence Relations . . . . . . . . . . . . . . . . . . . . . 14

2.3 Data Flow Graph Model . . . . . . . . . . . . . . . . . . . . . . . 16

3 Architectural Transformations . . . . . . . . . . . . . . . . . . . . 19

3.1 Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.1 Mathematical Model for Retiming . . . . . . . . . . . . . . 21

3.1.2 Retiming for Clock Period Minimization . . . . . . . . . . 22

3.1.3 Retiming for Energy Efficiency . . . . . . . . . . . . . . . 24

3.2 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Unfolding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.1 Mathematical Formulation for Unfolding . . . . . . . . . . 31

3.3.2 Carry-Save Arithmetic . . . . . . . . . . . . . . . . . . . . 32

iii

4 ILP Model for Scheduling . . . . . . . . . . . . . . . . . . . . . . . 36

4.1 Scheduling and Retiming . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 Modified ILP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 CAD Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1 Simulink Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2 RTL Generation and Synthesis . . . . . . . . . . . . . . . . . . . . 50

5.3 Architectural Optimization . . . . . . . . . . . . . . . . . . . . . . 51

5.3.1 Controller Generation . . . . . . . . . . . . . . . . . . . . 55

6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.1 Comparison of Existing and Modified Scheduling

Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.2 Design Space Exploration: 16-tap FIR Filter . . . . . . . . . . . . 62

6.3 Hierarchical Design: Multi-Core MIMO Sphere Decoder . . . . . . 67

7 Conclusions & Future Work . . . . . . . . . . . . . . . . . . . . . . 71

7.1 Summary of Research Contributions . . . . . . . . . . . . . . . . 71

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Appendix: GUI Environment . . . . . . . . . . . . . . . . . . . . . . . 73

Appendix: Tarjan’s Algorithm . . . . . . . . . . . . . . . . . . . . . . 76

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

iv

List of Figures

1.1 Algorithm-architecture-circuit-level interaction in the design-space. 3

1.2 Thesis organization. . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 (a) Block diagram of y(n) = ay(n− 1) + x(n) (b) DFG respresen-

tation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 IIR Filter: (a) Block diagram (b) DFG representation with loops. 15

2.3 (a) Architecture of a second order IIR filter (b) DFG representation

for the filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 (a) Original DFG (b) Retimed DFG. . . . . . . . . . . . . . . . . 20

3.2 FFT butterfly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Retimed FFT butterfly with Vdd-scaling. . . . . . . . . . . . . . . 26

3.4 (a) ASAP scheduling (b) ALAP scheduling. . . . . . . . . . . . . 27

3.5 (a) DFG of recursive algorithm (b) DFG of two-unfolded version. 29

3.6 (a) DFG of feedforward algorithm (b) DFG of two-unfolded version. 30

3.7 (a) Original architecture (b) Two-unfolded version with Vdd-scaling. 32

3.8 (a) Conventional array multiplier (b) Carry-save multiplier. . . . . 33

3.9 Carry-save tree adding nine numbers. . . . . . . . . . . . . . . . . 34

3.10 A multiplier implemented with shifts and adds. . . . . . . . . . . 35

4.1 Data-flow-graph and corresponding schedule. . . . . . . . . . . . . 36

4.2 (a) Original DFG (b) Retimed DFG (c) Scheduled architecture. . 39

5.1 Design and optimization flow. . . . . . . . . . . . . . . . . . . . . 45

v

5.2 Simulink pre-defined library (Synplify DSP blockset). . . . . . . . 46

5.3 Baseband processing in a QAM system. . . . . . . . . . . . . . . . 47

5.4 BER vs. SNR curve for the QAM system. . . . . . . . . . . . . . 48

5.5 Simulink model for an 8-tap FIR filter. . . . . . . . . . . . . . . . 48

5.6 Input with normalized frequencies of 0.03 & 0.4, and corresponding

output that passes the lower frequency. . . . . . . . . . . . . . . . 49

5.7 Activity factor for a 16-bit sinusoidal input of normalized frequency

0.25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.8 Energy-area-delay tradeoffs at the circuit and micro-architectural

level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.9 Choosing values of N , P and R based on energy-delay sensitivity. 52

5.10 Time-multiplexed and parallel implementation of a 16-tap FIR filter. 54

5.11 Control circuitry using M-Control blocks. . . . . . . . . . . . . . . 55

6.1 Synthesis results for a fifth order elliptic wave digital filter. . . . . 60

6.2 Synthesis results for a 16-tap FIR filter. . . . . . . . . . . . . . . . 60

6.3 Synthesis results for a 4th order all pole lattice filter. . . . . . . . 61

6.4 Fifth-order wave digital elliptic filter. . . . . . . . . . . . . . . . . 63

6.5 Synthesis results for retimed and time-multiplexed FIR filters. . . 64

6.6 Increase in register area with retiming. . . . . . . . . . . . . . . . 64

6.7 (a) FIR with an extra latency at the output (b) Retimed version. 65

6.8 Synthesis results for parallel FIR filters. . . . . . . . . . . . . . . . 67

6.9 Multi-core MIMO sphere decoder. . . . . . . . . . . . . . . . . . . 68

6.10 Simulink model for the multi-core MIMO sphere decoder. . . . . . 69

vi

6.11 Synthesis results for the multi-core MIMO sphere decoder. . . . . 70

7.1 GUI built within MATLAB to facilitate transformations. . . . . . 74

vii

List of Tables

6.1 Comparison of scheduling and scheduling with Bellman-Ford re-

timing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.2 Comparison of normalized area-delay product for scheduling and

scheduling with Bellman-Ford retiming. . . . . . . . . . . . . . . . 62

viii

Acknowledgments

I am sincerely grateful to my advisor Professor Dejan Markovic, without whose

help and support this thesis could not have been written. His constant drive for

perfection has taught me a great deal and I am truly indebted for the patience

he had with me on occasions when I faltered.

This work started out and has progressed, based on Dejan’s idea of design-

space exploration via architectural transformations. I wish to thank my group

memebers Chia-Hsiang Yang and Victoria Wang for the invaluable feedback I

received from them when working on this research project. Chia-Hsiang Yang

designed the sphere decoder which I have used as a design driver example. My

thanks also go out to the new members in my group, Sarah Gibson, Cheng-Cheng

Wang and Vaibhav Karkare for their insightful comments and discussions in the

group meetings. Professor Mani Srivastava and Professor Miodrag Potkonjak

provided helpful reviews of this thesis which helped me refine its contents.

This work could not have progressed without the crtical infrastructure support

provided by Synplicity Incorporated and Cadence Design Systems. I am grateful

to the Synplicity team for the tools they provided us and the training sessions

they conducted to help us learn them. I found the Source-Link database provided

by Cadence to be extremely useful for logic and physical synthesis flows.

I would like to thank Rohit for his unwavering support through the course of

this project. He has always had the utmost faith in my abilities, and has never

fallen short of words of encouragement, especially at times when I needed them

most. We have also had some very lively discussions when I was preparing for

the prelim examination. I am indebted to Nitesh Singhal, Abhishek Ghosh and

Bibhu Dutta Sahoo who were always ready to discuss and clear any doubt I had

ix

before the prelim examination.

I will forever remain grateful to my parents and my brother for their loving

support during my studies at UCLA.

x

Abstract of the Thesis

DSP Architecture Optimization

in MATLAB/Simulink Environment

by

Rashmi Nanda

Master of Science in Electrical Engineering

University of California, Los Angeles, 2008

Professor Dejan Markovic, Chair

Architectural optimization has traditionally been a heuristic process involving

multiple iterations before the design converges to the desired specifications. Mul-

tiple architectures are difficult to evaluate if RTL is written repeatedly for each

design. The process becomes tedious if the design fails to meet target specifica-

tions and changes need to be made at the system level. This work aims to auto-

mate the process of architecture selection and provide energy-area-performance

optimal solutions staring from the graphical timed data-flow Matlab/Simulink

description for an algorithm.

Integrating functional blocks into system architecture requires the most ef-

ficient use of available resources for maximizing area-efficiency. In essence, this

task is accomplished with scheduling. Improved power efficiency requires that all

pipelines in a design be balanced, necessitating retiming at the micro-architecture

level. Parallelism is employed when building high-throughput or low-energy

systems. These architectural transformations have been implemented in MAT-

LAB/Simulink environment, using Integer Linear Programming (ILP) models.

xi

This work proposes a modified ILP model which integrates scheduling and re-

timing, providing a 33% average reduction in area-delay product compared to

existing ILP models which do not incorporate retiming. Also a 20× reduction in

worst case CPU runtime is achieved by the proposed method when compared to

existing ILP models which directly incorporate retiming with scheduling.

Optimization of complex structures is supported by hierarchically extend-

ing circuit-level results from the underlying macros. The entire framework al-

lows comparison of various architectural solutions for a given algorithm in the

energy-area-performance space. The high-level block-diagram based description

in Simulink conveniently maps to FPGA or ASIC. The method is applicable for

dedicated DSP algorithms like Fourier transforms as well as hierarchical struc-

tures with complex macros such as those in MIMO communications.

xii

CHAPTER 1

Introduction

1.1 Motivation

Integrated circuit design is at an interesting juncture at present with continued

scaling of the underlying technology. The integration complexity has reached sev-

eral billion transistors, opening doors for the IC design market to capture a wide

variety of application domains. Cell phones, ipods, palm tops, biomedical instru-

ments are only few manifestations of this ever increasing phenomenon. However,

we are still faced with several challenges in the design process; one of which is

making a suitable architecture selection for a given algorithm. As systems be-

come more and more complex and constraints on energy, area and performance

become tighter it is not possible to single out a particular architecture as being

always optimal. Area is no longer the only optimization metric; the advent of

portable battery operated devices has made lower power consumption a more de-

sirable target. In fact the choice of the optimal architecture is strongly dictated

by system specifications such as throughput, area and power consumption. A

simple FIR filter could have several different realizations depending on the sys-

tem it is being integrated with. For example, neural signal processing requires

very slow operation at sample rates close to several hundred megahertz, in which

case the filter could be time-multiplexed. For a high speed base-band processing

unit in a wireless LAN system, the same filter would have to be parallelized to

1

meet throughput requirements.

The aim of this work is to make it feasible for algorithm designers to an-

alyze the hardware cost associated with their algorithms in the energy-area-

performance space. This feedback helps designers in tuning their algorithms such

that system specifications can be met, and also in finding the optimal architecture

for a given algorithm. The choice of the optimal architecture is strongly dictated

by circuit-level energy-delay sensitivity results [1] of the underlying macros in the

design (Fig. 1.1). The interface between the micro-architectural level and the

circuit-level is not easy to navigate. This is because, if the design fails to meet

the target specifications at the circuit-level, then changes will have to be made at

the system level. This can mean major restructuring at the micro-architectural

level, like opting for a parallel design instead of a time-multiplexed one, which

then necessitates re-writing the RTL. The process becomes very tedious if this

design cycle re-iterates.

The answer to this problem lies in automating the generation of architectural

solutions (high-level models and RTL) for a given algorithm and also hierar-

chically extrapolating circuit-level measurements like energy, area and delay for

these architectures from synthesis results of the underlying macros.

A tool which automates this flow will offer designers a convenient medium to

explore the solution space of possible architectures and also analyze the tradeoffs

in the energy-area-performance space. This work proposes such a tool embedded

in MATLAB/Simulink which is a very comprehensive and easy-to-use graphical

platform. SystemC based modeling was an alternative to Simulink for the imple-

mentation of this tool, however this approach would involve C coding which would

limit the use of the tool to those adept in coding. Moreover, a graphical format

allows easy visualization and understanding of the architectural transformations

2

Figure 1.1: Algorithm-architecture-circuit-level interaction in the design-space.

as compared to when architectures are described using C codes.

The reference architecture (direct-mapped version of the target algorithm)

has to be defined only once in Simulink using pre-existing or user-defined library

blocks. By use of data-flow-graph models this reference architecture is mathe-

matically modeled by a set of matrices. The matrix based representation makes

it convenient to apply transformations like scheduling, retiming, pipelining and

parallelism. The transformed architectures are then converted back from matri-

ces to Simulink models, making it possible for designers to functionally verify

the new design. Modeling complex DSP kernels in a hierarchical manner is also

very convenient with this approach since data-flow-graph models are essentially

independent of the internal complexity of the processing kernel.

The optimization discussed in this work is not limited to making architectural

3

choices alone. Tools like Synplify DSP support automatic RTL generation from

Simulink descriptions. By synthesizing the RTL in backend tools like Cadence

or Synopsys, accurate circuit-level results for the generated architectures can

be obtained. Other refinements like usage of carry-save arithmetic and gate

sizing which are a part of logic synthesis can also be integrated in this flow.

Depending upon delay slack available in the design after synthesis, Vdd scaling is

also employed to improve energy efficiency.

The goal of this work is to establish a complete framework for evaluating

architectures based on results generated through actual physical synthesis and

not just heuristics. For example, in the case of time-multiplexed architectures, the

energy overhead associated with control circuitry cannot be estimated accurately

at the high-level. Logic synthesis of the final architecture on the other hand

gives a more clear picture of the degradation in energy-efficiency associated with

time-multiplexing.

1.2 Overview of Previous Work

The problem of exploring architectural solutions in combination with lower level

circuit optimizations has received constant attention in the past few years. De-

signers have always known that optimizing only at the circuit-level has marginal

gains, while combination of architectural and circuit-level refinements produce

much improved results. This concept was clearly demonstrated in the paper by

Gemmeke et.al [2] where a design methodology for optimizing FIR filters was pro-

posed. The paper started out with a description of various architectural choices

like time-multiplexing and parallelism and moved on to arithemetic-level improve-

ments like Booth recoding of the filter coefficients. The last level of optimization

suggested was at the circuit-level which described sizing of the transistors for

4

minimum power dissipation at the specified throughput. The approach outlined

in this paper targetted the FIR filter design-space; in this work we try to create a

design environment which automates these optimization techniques for a generic

DSP algorithm. The tool Hyper [3] developed by the University of California,

Berkeley automates architectural transformations to enable efficient design-space

exploration. Given a flow-graph and a set of timing constraints the tool gener-

ates the most area-efficient solution. In this work, we explore the design-space of

possible architectures to find the solution which is jointly optimal in the energy-

area-performance space.

Mapping of Simulink models onto ASIC was introduced in [4]. The work in

[5] added optimization heuristics and applied them to a complex singular value

decomposition algorithm [6]. Although systematic, architecture tuning was man-

ual, making the process impractical or time consuming for complex systems. This

work aims to automate the process in the convenient-to-use MATLAB/Simulink

graphical environment. The user selects tuning parameters which control the de-

gree of time-multiplexng, retiming, parallelism and pipeling at the Simulink level.

Based on these parameters the optimizer automatically generates the optimized

Simulink model. The core of this optimization process is in the efficient modeling

and transformation of the reference architecure.

Modeling of DSP algorithms as data-flow graphs (DFG) followed by schedul-

ing or retiming has received considerable attention in literature. Retiming was

first described by C.E. Leiserson and J.B. Saxe in 1983 [7], following which it

has perhaps become one of the most widely used techniques for improving de-

lay or lowering power consumption. The retiming problem is formalized using

Integer Linear Programming (ILP) models which are solved using branch and

bound techniques or shortest-path algorithms like Bellman-Ford [8]. Formal de-

5

scription of the ILP model is discussed in the paper by K.N. Lalgudi et.al in [9].

Retiming has been fairly well integrated in modern-day CAD tools (Synopsys,

Cadence retiming tools). However, synthesis tools can retime circuits only after

the structural netlist has been created once the technology mapping process is

complete. This gate-level granularity introduces large number of retiming vari-

ables, making the process very computationally inefficient for larger designs. For

example, a 512-point FFT could not be retimed at the gate-level in a span of 2.5

days when using the RC compiler tool from Cadence. Performing retiming at the

gate-level for multiple architecture solutions can become quite tedious in such

a situation. The solution lies in hierarchically decomposing the top level design

into smaller modules which can be retimed quickly. Extrapolating circuit-level

results from the smaller modules makes it possible to get fairly accurate estimates

of the critical path post-retiming for the top level design. From these estimates a

final design can be fixed upon based on system constraints. Hence, only the final

design would have to be synthesized. This approach for hierarchical retiming

was adopted in the high-level synthesis tool IRIS [10]. However, the scheduling

approach used in IRIS was not optimal and also it lacked support for pipelining

and parallelism.

Scheduling is the formal approach for time multiplexing a finite set of op-

erations onto a core of processing elements. If the design is subject to speed

constraints, the scheduling algorithm will attempt to parallelize [11] the opera-

tions to meet timing constraints. Conversely, if there is a limit on the cost (area

resources), the scheduler will serialize operations to meet resource constraints.

Scheduling thus determines the cost-speed tradeoffs of the design.

The scheduling problem is NP hard in general. Various heuristic and formal

approaches to the scheduling problem have been covered extensively in literature.

6

Approaches like ASAP (As Soon As Possible scheduling) [13], ALAP (As Late As

Possible scheduling) [14], list scheduling [15], MARS (Minnesotta Architectural

Scheduling) [36] are quick but sub-optimal ways to generate a schedule. Integer

linear programming models can formalize the process and obtain global optimum

solutions. An integer programming model for synthesizing digital logic at the

register-transfer level (RTL) was formulated in [9]. The model gives detailed spec-

ifications for data-path synthesis such as variable storage, operation precedence,

resource sharing, and control structures. Due to the complexity of the formulation

this approach cannot be extended to complex structures. Integer programming

approach was also proposed for microcode scheduling in CATHEDRAL-II [16],

which is a synthesis engine for multiprocessor DSP systems. After a customized

data path has been synthesized and the high-level operations are mapped onto

a set of RTL operations, the microcode scheduling is performed. The model

contains data precedence, resource conflict, and controller pipelining constraints.

Since excessive CPU time is required to solve large problems, the model was

replaced by a graph-based scheduling algorithm [17] which used Integer Linear

Programming models. This approach formalized the scheduling problem and pro-

posed a way to achieve the most area or throughput efficient schedule. Retiming

of the original data-flow-graph was not integrated in all the above mentioned ap-

proaches to scheduling. It has been shown later in the thesis (Chapter IV) that

retiming simultaneously with scheduling can result in more area and throughput

efficient schedules.

The benefit of retiming with scheduling to improve on the resource utiliza-

tion was investigated in [19]. An iterative-improvement probabilistic algorithm

similar to simulated annealing was developed in this work which applied trans-

formations like retiming, associativity and commutativity to improve the area of

the scheduled architecture. However, the issue of improving the throughput of

7

the scheduled architecture by supporting greater pipeline depth in the processing

elements via retiming was not addressed. Retiming with scheduling was also in-

vestigated in [18] where a scheme to generate all possible schedules for any DFG

was proposed. However, this scheme works only on strongly connected graphs

(a DFG where every node must be a part of a loop). It was also not clear in

this work as to how the most area- or throughput-efficient schedule was to be

extracted from the pool of solutions generated.

Retiming variables are unbounded in the integer space. Their inclusion in the

ILP model makes it practically impossible for the ILP to converge to a solution

for complex structures, when we attempt to minimize the area of the schedule.

In this work a new model has been developed for ILP scheduling which integrates

retiming without directly introducing the retiming variables in the ILP. Retiming

is done post scheduling using the Bellman-Ford shortest-path algorithm [8] which

has polynomial time-complexity. This ensures that the optimum schedule is found

without increasing CPU runtime excessively. We show that simultaneous retiming

of the original DFG along with scheduling produces better results in terms of area

and throughput when compared to the results in [17].

1.3 Thesis Outline

The subsequent chapters will present in detail our methodology for architecture

optimization (Fig. 1.2). The second chapter introduces data-flow-graph represen-

tations and explains how matrices can be used to represent flow-graph connec-

tivity information. Chapter III discusses various architectural transformations

like retiming, scheduling and parallelism along with mathematical models which

automate them. Carry-save arithmetic optimization and its benefits in reducing

the critical path of a design is also discussed at the end of this chapter. Chapter

8

Figure 1.2: Thesis organization.

IV fosuses on the ILP model for scheduling and then discusses how this model

can be extended to include retiming such that the final framework still remains

CPU efficient. The fifth chapter presents the CAD design flow starting from

9

Simulink and ending with the technology mapped design. The chapter illustrates

how systems can be modeled and functionally verified in Simulink. Chapter VI

discusses how scheduling results compare with scheduling combined with retim-

ing. This is followed by the design-space exploration results for a 16-tap filter

(used in ultra-wide band application). Hierarchical capability of our design-space

exploration approach is demonstrated on a flexible MIMO sphere decoder design.

The last chapter summarizes the contributions of this thesis and discusses future

scope of this work.

10

CHAPTER 2

Architecture Modeling

This chapter describes how algorithms (direct-mapped architectures) are modeled

using graphical representations like data-flow graphs and then mathematically

abstracted as matrices. Such representations allow transformations like retiming

and scheduling to be viewed in compact form as manipulation of matrices rather

than changes being made at the architectural level.

2.1 Representations of DSP Algorithms

DSP algorithms [12] can be widely divided into two classes, namely FIR (finite

impulse response) and IIR (infinite impulse response). The result of an FIR algo-

rithm depends on previous and current inputs, while result of an IIR algorithm

depends on previous and current inputs as well as previous outputs. An example

is shown for both types of systems.

y(n) = a0x(n) + a1x(n− 1) + a2x(n− 2) FIR system (2.1)

y(n) = a0x(n) + a1y(n− 1) IIR system (2.2)

Execution of all computations in the algorithm once is referred to as an iteration.

The iteration period is the time required for execution of one complete iteration

of the algorithm. During each iteration, the 3-tap FIR filter in (2.1) processes one

input signal, completes 3 multiplications and 2 addition operations, and generates

11

one output sample. DSP systems are also characterized by their sampling rate

(throughput) and input to output latency. The throughput is determined by

the critical path (longest path between any two storage elements) of the system.

Latency is defined as the difference between the time an output is generated and

the time at which the corresponding input was received by the system.

Complex DSP algorithms can be conveniently modeled using high-level de-

scriptions, where it is more important to specify the communication between

the processing elements, rather than the order and structure of the internal op-

erations. These high-level descriptions can either take the form of behavioral

description languages or graphical representations. Many applications are de-

scribed using descriptive languages that represent the structure of the system.

Examples of these are hardware description languages such as Verilog and VHDL

which can be integrated into the physical synthesis flow using CAD tools.

Graphical representations are efficient for investigating and analyzing data-

flow properties of DSP algorithms and for exploiting inherent parallelism among

different subtasks. As such they are more amenable to transformation than be-

havioral descriptions. More importantly, graphical representations can be easily

converted to Verilog/VHDL scripts which are easy to map into hardware imple-

mentations. Hence, these representations can bridge the gap between algorithmic

descriptions and structural implementations.

The absolute measures of the performance metrics of DSP systems namely

area, speed and power cannot be obtained without the knowledge of the sup-

porting technology. However, graphical representations provide useful insight

into space-time-energy tradeoffs making it possible to explore the architectural

design-space. Various forms of graph representations include signal-flow-graph

(SFG), data-flow-graph (DFG) and dependence graphs (DG) [31]. In this work

12

data-flow-graph based representation has been adopted because of its simplicity

of representation and the convenient way in which flow-graph connectivity infor-

mation can be extracted from it. Data flow graph modeling has been described

in the following section.

2.2 Data Flow Graphs

In data-flow-graph representations, the nodes represent computations (functions

or subtasks) and the directed edges represent data paths (communication between

nodes). Each edge has a nonnegative number of delays associated with it. For

example, Fig. 2.1(b) is a data-flow-graph of the computation y(n) = ay(n −

1) + x(n). Node A represents addition while node B represents multiplication.

The edge from node A to B contains one delay while edge from B to A has no

delay. Associated with each node is its execution time in terms of normalized

units (u.t.). For example, the execution time of node A is 2 u.t. while that of

node B is 4 u.t.

2.2.1 Loop Bound and Iteration Bound

A loop is a directed path that begins and ends at the same node, such as the

path A → B → A in Fig. 2.1(b). Given that the execution times of nodes A and

B are 2 and 4 u.t. respectively, one iteration of the loop requires 6 u.t. This is

the loop bound, which represents the lower bound on the loop computation time.

Formally the loop bound of the l-th loop is defined as tlwl

, where tl is the loop

computation time and wl is the number of delays in the loop. The loop bound

for the DFG in Fig. 2.1 is 61

= 6 u.t.

The critical loop of a DFG is the loop with the maximum loop bound. This

13

A

B

a

x(n) y(n)

D

D

a

x(n) y(n)(2)

(4)

Figure 2.1: (a) Block diagram of y(n) = ay(n−1)+x(n) (b) DFG respresentation.

is known as the iteration bound [21], [23] of the DSP program which determines

the lower bound on the sample period regardless of the amount of computing

resources available. Formally, the iteration bound is degined as

T∞ = max{ tlwl

}. (2.3)

For loop1 (L1) in Fig. 2.2(b) the iteration bound is 2 u.t. while for loop2 (L2) it

is 3 u.t. The slower loop L2 will determine the iteration bound or the minimum

possible sample rate for the system which is 3 u.t., in this case. Iteration bound

sets the fundamental limit on the achievable throughput of the system during

retiming, as will be explained later in Chapter IV.

2.2.2 Precedence Relations

Data-flow graphs capture the data driven property of DSP algorithms where any

node can fire (perform its computation) whenever all the input data are available.

This implies that a node with no input edges can fire at any time. Thus many

14

1

2 3

4

x(n) y(n)

a

D D 2D

(1)

(1)

(2)

(2)

(b)

b

Z-1

+

Z-1

+

x(n) y(n)

(a)

Z-1L1

L2a

b

Figure 2.2: IIR Filter: (a) Block diagram (b) DFG representation with loops.

nodes can be fired simulatneously, leading to concurrency. Conversely, a node

with multiple input edges can only fire after all its precedent nodes have fired.

The latter case imposes the precedence constraints on a DFG, where each edge

describes a precedence relation between two nodes. This precedence constraint

is an intra-iteration constraint if the edge has zero delay while it is called inter-

iteration (occurring between iterations) if the edge has one or more delay. For

example, the edge from node 2 to node 1 in Fig. 2.2(b) enforces the inter-iteration

constraint, which states that the execution of the k-th iteration of node 2 must

be completed before the (k+1)-th iteration of node 1. The edge from node 4 to

node 2 enforces the intra-iteration precedence constraint, which states that the

k-th iteration of node 4 must be executed before the k-th iteration of node 2.

Precedence relations enforce a set of constraints in the scheduling model, where a

particular operation can be scheduled only after all its precedent operations have

executed.

15

2.3 Data Flow Graph Model

A directed DFG is denoted as G = <V,E,d,w> where the notations are as

follows

• V: Set of vertices (nodes) of G. The vertices represent operations. The

number of nodes in G is |V|.

• E: Set of directed edges of G. A directed edge from node U ∈ U to node

V ∈ V is denoted as U → V. The edges represent communication between

the nodes. The number of edges in G is |E|.

• w(e): Number of delays on the edge e, also referred to as the weight of the

edge.

• d(U ): Pipeline depth of the node U.

The data-flow-graph is initially described in the Simulink environment using block

based description. We capture the DFG information in the form of an incidence

matrix A, a loop matrix B, a weight vector w and a pipeline vector du. Let A

be the incidence matrix of the graph G, then this |V| × |E| matrix is described

as

αi,j =

1, edge i starts from node j

−1, edge i ends in node j

0, edge i does not start or end in node j .

The A matrix is generated by identifying the source and destination nodes for

every edge in the DFG.

16

-1

-1

-11

23

4

5

Figure 2.3: (a) Architecture of a second order IIR filter (b) DFG representation

for the filter.

The B matrix is an |L| × |E| matrix where |L| is the total number of loops

in the DFG. It is defined as

βi,j =

1, if edge j is in loop i

0, otherwise.

The B matrix is computed using Tarjan’s algorithm [20] (Appendix 2) in O((|V|+

|E|)(|L|+ 1)). The weight vector w is an |E|×1 vector. It is defined as

wi = number of delays (registers) on edge i .

The pipeline vector du is an |E|×1 vector. It is defined as

dui = pipeline depth of the source node (U ) of edge i (i:U → V ).

Second-Order IIR Filter Example: The incidence and loop matrices [18]

for the second order IIR filter (Fig. 2.3) are shown below. The A matrix captures

the connectivity between the four nodes (rows in A) in the DFG while the B

17

matrix extracts the two loops (rows in B).

A =

1 1 0 0 -1

0 0 -1 -1 1

-1 0 1 0 0

0 -1 0 1 0

B =

0 1 0 1 1

1 0 1 0 1

The weight vector for the five edges in the IIR filter (Fig. 2.2) is given by

wT = [1 2 0 0 1]

If the multipliers in the filter have a pipeline depth of m while the adders have a

depth of a then the pipeline vector takes the following form

duT = [a a m m a]

The incidence, loop, weight and pipeline matrices/vectors provide a compact

representation of the flow-graph information. Once extracted from the data-

flow-graph these matrices are used to model the architectural transformations

described in the next chapter.

18

CHAPTER 3

Architectural Transformations

This chapter presents the details of architectural transformations like retiming,

scheduling and unfolding (parallelelism), their relative advantages and formu-

lation using data-flow graphs. Micro-architectural optimizations like usage of

carry-save arithmetic and supply voltage scaling and their impact on the energy

and performance of a design are also discussed.

3.1 Retiming

Retiming changes the location of registers (delay elements) in a circuit in an

attempt to balance the logic depth between sequential elements and minimize

the critical path. A valid retiming solution must not change the input/output

functionality of the DFG.

Second-Order IIR Filter Example: Consider the DFG of an IIR filter

in Fig. 3.1(a) where the numbers in brackets indicate the computation time

associated with each processing node. This filter is described by

y(n) = w(n− 1) + x(n). (3.1)

However w(n) itself is recursively related to y(n) by

w(n) = ay(n− 1) + by(n− 2). (3.2)

Substituting the value of w(n) from (3.2) into (3.1) we get the final equation for

19

w(n)

(1)

(1)

(2)

(2)w1(n)

(1)

(1)

(2)

(2)

w2(n)

Figure 3.1: (a) Original DFG (b) Retimed DFG.

y(n) in (3.3).

y(n) = ay(n− 2) + by(n− 3) + x(n) (3.3)

Following a similar process we derive the input to output relation for the filter in

Fig. 3.1(b),

w1(n) = ay(n− 1) (3.4)

w2(n) = by(n− 2) (3.5)

y(n) = w1(n− 1) + w2(n− 1) + x(n) (3.6)

= ay(n− 2) + by(n− 3) + x(n).

Although the DFGs in Fig. 3.1 have delays at different locations, these filters

have the same input/output functionality and can be derived from each other

through retiming.

Retiming is mainly used to reduce the critical path in synchronous circuits.

The critical path of the filter in Fig. 3.1(a) (shown by the dashed line) passes

20

through one multiplier and one adder and has a computation time of 3 u.t. The

retimed filter in Fig. 3.1(b) has a critical path that passes through two adders and

has a computation time of 2 u.t. Reduction in critical path either translates to

improved throughput or lower power via supply voltage scaling as will be shown

later in this chapter.

3.1.1 Mathematical Model for Retiming

Retiming maps a data-flow-graph G to a retimed graph Gr. A retimed solution

is characterized by a value r(U) known as the retiming weight for each node U in

the graph. Let w(e) denote the weight of the edge e in the original graph G, and

let wr(e) denote the weight of the edge e in the retimed graph Gr. The weight

of the edge e : U → V in the retimed graph is computed from the weight of the

edge in the original graph using

wr(e) = w(e) + r(V )− r(U) r(V ), r(U) ∈ Z (3.7)

where Z is the set of integers.

The retiming values r(1) = 0, r(2) = 1, r(3) = 0 and r(4) = 0 translates

the DFG in Fig. 3.1(a) to the retimed DFG in Fig. 3.1(b). Retiming does not

alter the architecture of the design, hence the incidence and loop matrices for the

design remain for the original and retimed DFG. However, the weights on the

edges change and it is the weight vector which is transformed during retiming.

For the DFG in Fig. 3.1(a) the transformation (wT → wrT ) is shown below.

wT = [1 2 0 0 1] → wrT = [1 2 1 1 0]

A retiming solution is feasible if wr(e) ≥ 0 holds for all edges. It can be

proved that retiming does not alter the total number of delays in a loop of the

21

DFG [7]. This would mean that for a given loop in the DFG the sum of delays

on the edges of the loop will remain unchanged after retiming. Mathematically

this can be expressed as the product of the loop matrix B and the weight matrix

w remaining constant.

Bw = Bwr (3.8)

We can see that this property hold for the case of the retimed IIR filter.

Bw =

0 1 0 1 1

1 0 1 0 1

1

2

0

0

1

=

3

2

(3.9)

Bwr =

0 1 0 1 1

1 0 1 0 1

1

2

1

1

0

=

3

2

(3.10)

Since the number of delays in a loop is the same before and after retiming,

retiming cannot change the iteration bound of the DFG. This property sets the

fundamental limit on the minimum achievable critical path after retiming which

is equal to the iteration bound of the DFG. The smallest critical path obtained

after retiming is therefore limited by the slowest loop in the system.

3.1.2 Retiming for Clock Period Minimization

This algorithm was proposed in [7] by Leiserson et.al. and has hence been widely

used for retiming synchronous circuitry to optimize the clock period. Mathemat-

22

ically, the minimum feasible clock period φ(G) for a graph G is defined as

φ(G) = max{t(p) : w(p) = 0} (3.11)

where t(p) denotes the logic delay of path p and w(p) denotes the number of

registers in path p. This implies that the clock period is determined by the

longest register-less path in the circuit (critical path).

Two quantities, W (U, V ) and D(U, V ) are used to implement this algorithm.

W (U, V ) is the minimum number of registers on any path from node U to node

V and D(U, V ) is the maximum computation time among all paths from U to V

with weight W (U, V ). Formally,

W (U, V ) = min{w(p) : U → V } (3.12)

D(U, V ) = max{t(p) : U → V and w(p) = W (U, V )}. (3.13)

The following algorithm can be used to compute W (U, V ) and D(U, V ).

• Let M = tmaxn, where tmax is the maximum computation time of the nodes

in G and n is the number of nodes in G.

• Form a new graph G′which is the same as G except the edge weights are

replaced by w′(e) = Mw(e)− t(U) for all edges U → V .

• Solve the all pairs shortest-path problem on G′using Floyd-Warshall algo-

rithm [8], [37]. Let Suv be the shortest path from U to V .

• W (U, V ) = dSuv

Me and D(U, V ) = MW (U, V )− Suv + t(V )

The values of W (U, V ) and D(U, V ) are used to determine if there exists a

retiming solution that can achieve a desired clock period. Given a desired clock

period c, there is a feasible retiming solution r such that φG(r) ≤ c if the following

constraints hold:

23

• r(U)− r(V ) ≤ w(e) for every edge U → V (feasibility constraint),

• r(U)−r(V ) ≤ W (U, V )−1 for all vertices in U, V in G such that D(U, V ) >

c (critical path constraint).

The feasibility constraint forces the number of delays on each edge in the retimed

graph to be nonnegative, and the critical path constraint enforces that all paths

without delays in the graph have computation time less than c. This procedure

is iteratively repeated for several monotonically decreasing values of c, until no

feasible retiming solution exists. At this point we get the retimed graph with

minimum possible critical path.

3.1.3 Retiming for Energy Efficiency

Retiming combined with supply voltage savings can result in significant energy

savings as illustrated in [22]. We illustrate this saving in energy by scaling the

supply voltage for a butterfly unit in the FFT processor [27]. The butterfly in

Fig. 3.2 has a higher critical path as compared to the retimed architecture in Fig.

3.3. This introduces a delay slack in the second design which can be utilized to

scale the supply voltage and reduce energy. A 40% saving in energy was achieved

for this design after inserting an extra pipeline stage. Retiming therefore not

only helps in improving the throughput but can also be used to improve the

energy-efficiency of the system. Supporting results for energy savings achieved

after retiming will be presented in Chapter VI for a 16-tap FIR filter.

24

1

T

1

T

1

T

1

T

1

T

1

T

1

T

1

T

1

T

1

T

Figure 3.2: FFT butterfly.

3.2 Scheduling

Scheduling and allocation are two important tasks in the synthesis of DSP sys-

tems. Scheduling involves assigning every node of the DFG to control time steps

or clock cycles. Resource allocation is the process of assigning operations to hard-

ware with a goal of minimizing the amount of hardware required to implement

the desired algorithm.

25

+

+

+

+

Re In1

Im In1

Re In2 x

Im In2

x

+ -1

x

x

+ -1

Re Out1

Im Out1

Re Out2

Im Out2

1

1

T

1

1

T

1

1

T

1

1

T

1

1

T

1

1

T

Re Tw

Im Tw

1

1

T

1

1

T

1

1

T

1

1

T

1

1

T

1

1

T

1

1

T

1

1

T

Figure 3.3: Retimed FFT butterfly with Vdd-scaling.

The simplest scheduling technique is the As Aoon As Possible (ASAP) schedul-

ing [13],[24] where the operations in the data-flow-graph are scheduled step-by-

step from the first control step to the last. An operation is called ”ready opera-

tion” if all of its predecessors are scheduled. This procedure repeatedly schedules

ready operations to the next control step until all the operations are scheduled.

As Late As Possible (ALAP) scheduling [14] performs a very similar procedure as

26

ASAP. In contrast to ASAP, ALAP scheduling assigns the operations from the

last control step towards the first. An operation is scheduled to the next control

step as all its successors are scheduled. Figure 3.4 gives an example of ASAP

and ALAP scheduling. Since it is not practical to assign too many operations of

the same type into a control step due to the constraint on the number of func-

tion units, a variation of ASAP [28],[35] is to delay the ready operations when

their number exceeds the number of function units. Selection of the operations

to be delayed is arbitrary. The main problem with ASAP and ALAP scheduling

algorithms is that no priority is given to the nodes on the critical path. As a

result, less critical nodes may be scheduled ahead of critical nodes. This becomes

a problem under limited resource constraints because critical nodes will require

extra processing elements (PE) when other nodes have blocked all the available

PEs.

x x x x +

x x + <

x x

x

x +

x

+ <

(a) (b)

x

Figure 3.4: (a) ASAP scheduling (b) ALAP scheduling.

To overcome the problems with ASAP-ALAP scheduling, list scheduling tech-

niques were developed. The list scheduling technique [15],[25],[16] which was orig-

inally used in microcode compaction [15], has been adopted by many high-level

27

synthesis systems. Similar to ASAP, the operations in the DFG are assigned

to control steps from the first control step to the last. The ready operations

are given a priority according to heuristic rules and are scheduled into the next

control step according to this predefined priority. When the number of sched-

uled operations exceeds the number of resources, the remaining operations are

delayed. The drawback of list scheduling is that this algorithm requires some

prior knowledge of the number of resources and, therefore, can only be applied

to resource constrained problems.

The third type of scheduling is ”global” in the way it selects the next operation

to be scheduled and in the way it decides the control step in which to put it. There

are two variations; freedom-based scheduling and force-directed scheduling. In

freedom-based scheduling [30], the operations on the critical path are scheduled

first. The operations not on the critical path are assigned one at a time according

to their degree of freedom. In force-directed scheduling [33], ”force” values are

calculated for all operations at all feasible control steps. The pairing of operation

and control step that has the most attractive force is selected and assigned.

After the assignment, the forces of the unscheduled operations are re-evaluated.

Assignment and evaluation are iterated until all the operations are assigned.

Among the above scheduling techniques, list scheduling requires that the number

of function units be specified, while force-directed scheduling requires that the

maximum number of control steps be specified. They correspond to resource-

constrained and time-constrained scheduling, respectively.

Integer Linear Programming (ILP) models provide a formal method to de-

scribe and solve the scheduling problem in an optimal manner. These models

overcome all the drawbacks of the previous approaches and are able to solve for

schedules which either minimize the resources required or the total time required

28

for completion of all operations. They require longer execution time when com-

pared to previous approaches, but always guarantee the global optimum solution.

A detailed description of the ILP models along with a modified approach which

incorporates retiming with scheduling is described in Chapter IV.

x(n)

x(2m)

x(2m+1)

x

D

y(n)

y(2m)

y(2m+1)

D* = 2D

a

a

a

x

x

Y(n) = x(n) + ay(n-1)

Y(2m) = x(2m) + ay(2m-1)

Y(2m+1) = x(2m+1) + ay(2m)

tcritical = Tadd + Tmult

t critical = 2*Tadd + 2*Tmult

t critical/iter = t critical / 2

= Tadd + Tmult

(a)

(b)

Figure 3.5: (a) DFG of recursive algorithm (b) DFG of two-unfolded version.

3.3 Unfolding

Unfolding [31] is applied to DSP algorithms to create a DFG which describes

more than one iteration of the original algorithm. For example, the DFG in Fig.

3.5(a) represents the following relation

y(n) = ay(n− 1) + x(n). (3.14)

29

Replacing the index n with 2m and 2m + 1 gives the following relations which

describe a 2-unfolded version of the original DFG (Fig. 3.5(b)).

y(2m) = ay(2m− 1) + x(2m) (3.15)

y(2m + 1) = ay(2m) + x(2m + 1) (3.16)

cd b a

x(n)

y(n)

Y(n) = ax(n) + bx(n-1)- + cx(n-2) + dx(n-3)

D D D

Y(2m) = ax(2m) + bx(2m-1) + cx(2m- 2) + dx(2m-3)

Y(2m+1) = ax(2m+1) + bx(2m) + cx(2m-1) + dx(2m-2)

x(2m)

x(2m+1)

d c

D

b a

d dc b da

D D

y(2m+1)

y(2m)

D

t critical = Tadd + Tmult

t critical = Tadd + Tmultt critical/iter = tcritical / 2

(a)

(b)

Figure 3.6: (a) DFG of feedforward algorithm (b) DFG of two-unfolded version.

Figure 3.5 illustrates an important conclusion regarding unfolding of recursive

systems. Since these systems do not allow the insertion of extra latency, unfolding

them by a factor of J can only increase the critical path (dashed line in red) by

a factor of J . The critical path per iteration (critical path divided by unfolding

factor) or the throughput of the system therefore remains the same in this case.

With a feedforward system, on the other hand, it is possible to insert extra

30

latency and by strategically placing the registers at the optimum location (via

retiming) it is possible to improve the throughput considerably. An example for

this case is illustrated for an FIR filter in Fig. 3.6. The critical path for the

original architecture was the sum of an adder and multiplier delay (Tadd +Tmult).

Unfolding by a factor of 2 followed by optimal placement of registers results in

the same critical path (Tadd + Tmult). The critical path per iteration therefore is

Tadd+Tmult

2achieveing a speed-up of 2×.

3.3.1 Mathematical Formulation for Unfolding

The J-unfolded DFG contains J times as many nodes and edges. The unfolding

procedure is given by the following two steps:

• For each node U in the original DFG, draw the J nodes U0, U1, ..., UJ−1,

• For each edge U → V with w delays in the original DFG, draw the J edges

Ui → V(i+w)%J with b i+wJc delays for i = 0, 1, 2, ..., J − 1 (% refers to the

modulo operation).

In unfolded systems each delay is J-slow. This means that if the input to a delay

element is the signal x(kJ + m) the output is the signal x((k− 1)J + m). In the

matrix domain the unfolded incidence matrix Au is a J-times replication of the

original matrix A. This increases the dimension of the A matrix from V×E to

J ·V×J ·E.

The primary application of unfolding is in the design of high-speed or low-

power parallel architectures. The throughput can be halved for a 2-way parallel

architecture as shown in Fig. 3.7. This results in creation of a delay slack which

can be used to scale the supply voltage and further reduce power. This concept

31

was illustrated in [22] where for the architecture in Fig. 3.7(b) a 64% saving in

energy was reported.

1

A

1

B

1

T

CO

MPA

RA

TOR

1

C

1

T

1

T

(a)

1

A

1

CO

MPA

RA

TOR

1

C

1

1

B

CO

MPA

RA

TOR

1

C

12T

12T

12T

12T

12T

12T

1

T

(b)

Figure 3.7: (a) Original architecture (b) Two-unfolded version with Vdd-scaling.

3.3.2 Carry-Save Arithmetic

Carry-save arithmetic (CSA) is a very useful micro-architectural transformation

when it comes to reducing the critical path of a multiplier. As shown in Fig.

3.8(a) [26] a conventional array multiplier must compute the carry and sum of

the partial products at each stage. This would involve rippling the carry through

each of the stages and increase the length of the critical path (dotted line). In

32

FA FA FA HA

HA FA FA HA

FA FA FA HA

N

MHA HA HA HA

HA FA FA HA

HA FA FA HA

HA FA FA HA

Vector Merging Adder

tmult ~ (M+N-3)*tcarry+(N-1)*tsum + (N-1)*tand

(a)

tmult ~ (N-1)*tcarry+(N-1)*tand + tmerge

(b)

Figure 3.8: (a) Conventional array multiplier (b) Carry-save multiplier.

a carry-save implementation (Fig. 3.8(b)) on the other hand, the carry is not

rippled but saved and sent to the next stage. Each stage produces two outputs,

namely carry and sum which are then finally merged in the last stage using a

very fast adder (vector merging adder). This scheme reduces the critical path of

the multiplier considerably as compared to the conventional array multiplier, as

indicated in Fig. 3.8. For example, for N = 3, M = 4, the critical path reduces

from the delay of three half-adders and three full-adders to the delay of three-half

adders and a carry-propagate adder.

A second use of carry-save arithmetic is made in the addition of N numbers,

when a carry-save tree can be used to reduce the critical path. Figure 3.9 shows

an example of this where a set of nine numbers must be added. The carry-save

adder shown in Fig. 3.9 is a series of M full adder units where M is the number

of bits at the input. The first level of full-adder units operates on a set of 3 inputs

to generate the M -bit sum and carry. This operation is known as 3:2 compressor

since it takes in 3 inputs and converts them to 2 outputs namely sum and carry.

The sum and carry propagate to the next level where a similar compression takes

33

Carry save adder

a1 a2 a3

Carry save adder

a4 a5 a6

Carry save adder

a7 a8 a9

Carry save adder Carry save adder

Carry save adder

Carry save adder

Carry Propagate Adder

Out

c s c s s

s s

s

c c

c

c s

c

Figure 3.9: Carry-save tree adding nine numbers.

place. This process repeats until we compress upto the final 3 inputs which go

into the fast carry-propagate adder. The delay of the addition process in Fig. 3.9

is reduced to 4 full-adder delays and the final delay of the carry-propagate adder.

The carry-save tree implementation is particularly useful when multiplication

with a constant coefficient (like in the case of filters) is reduced to a bunch of

shifts and adds as shown in Fig. 3.10. The final adder has to add a series of

M bit numbers which is efficiently done with the help of carry-save trees. Using

CSA optimization not only improves the critical path but can result in area

and energy savings as well. This is because to obtain the same performance

from a conventional adder we must either upsacle the devices or use complex

structures like look-ahead adders. This not only increases the area of the design

but also results in larger switched capacitance and increases the energy. Results

supporting this claim are presented in Chapter VI.

34

a

0 1 2… n

0*n

1*n-1

2*n-2

n*0

InIn Left shift by

n bits

In Left shift by n-1 bits

In Left shift by 0 bits

+ a*in

a0

a1

an

Figure 3.10: A multiplier implemented with shifts and adds.

The mathematical models presented for retiming and unfolding were imple-

mented in MATLAB. These models use the incidence and loop matrices to extract

the DFG information. The optimized architecture is then transformed into a new

Simulink model. Details of functional verification and synthesis of Simulink mod-

els will be discussed in Chapter V. Carry-save optimization is done automatically

during logic synthesis by Cadence backend tools. The next chapter is devoted to

the description of the ILP model used in our framework for scheduling integrated

with retiming.

35

CHAPTER 4

ILP Model for Scheduling

This chapter describes the Integer Linear Programming model used for scheduling

and retiming of architectures. The ILP model presented attempts to minimize

the number of processing elements required to execute a finite set of operations

in N time steps (clock cycles). An example of scheduling is shown in Fig. 4.1

where the number of time steps N has been set to 4. The flow of operations

(oi, pj) have been scheduled onto a set of resource elements (Proci) in four time

steps. Each operation takes up a single clock cycle to complete. The table shows

a distribution of the operations across the time steps.

p1 p2o1

p3 o2

o3

TIME

PROCESSORS

N

12

3

4

Proc1 Proc2 Proc3

o1 p1

o2

o3

p3 p2

Figure 4.1: Data-flow-graph and corresponding schedule.

The formal ILP model used to construct the schedule uses the following set

of variables.

• xi,j: binary variables associated with node i, xi,j = 1 if node i is scheduled

36

in time step j else xi,j = 0,

• Mp: number of resource elements of type p (eg. adders, multipliers),

• cp: cost associated with each resource of type p,

• r(U) ∈ Z: retiming weight associated with node U,

• N : folding factor/number of time steps in which all operations in the algo-

rithm must be scheduled.

The value of N remains fixed in each run of the ILP. Scheduling in essence folds

the DFG by a factor of N such that a new input sample arrives every N clock

cycles in the scheduled DFG. To maintain real-time latency constraints a delay in

the original DFG maps to N delays in the scheduled DFG. The edge e : U → V

with w(e) delays originally and retiming weights r(V ) and r(U) maps to

w(e) → N(w(e) + r(V )− r(U)) (4.1)

in the scheduled DFG.

If the source node U is pipelined by a depth d(U) during scheduling, d(U)

number of delays from the outgoing edge e will be used for pipelining. Pipelining

is essential in scheduled circuits to enable higher throughput since performance

degrades by a factor of N with input coming every N clock cycles. Taking the

pipeline depth d(U) of the source node U into account, w(e) now maps to the

form in (4.2).

w(e) → N(w(e) + r(V )− r(U))− d(U) (4.2)

The schedule value p(V ) of node V is defined as

p(V ) =∑

j

jxv,j j ∈ {0, 1, ..., N − 1}, (4.3)

p(V ) ∈ {0, 1, 2, ..., N − 1}.

37

The value p(V ) for the node V gives the time step in which the node executes.

Additional delays amounting to the difference in the schedule values of nodes U

and V are introduced on the edge to maintain precedence relations. The final

number of delays on the edge e of the scheduled DFG is given by

fe = N(w(e) + r(V )− r(U))− d(U) + p(V )− p(U). (4.4)

The above expression is referred to as the folding equation [32] and gives the

number of delays on any edge of the DFG after scheduling in N time steps. fe

will hence be referred to as the folded delay.

The resource utilization cost of a schedule is modeled as

Ctotal =∑

p

cpMp. (4.5)

This cost is a weighted sum of the processing elements in the schedule with the

weight cp representing the individual cost associated with Mp, the number of

processing elements of type p.

The scheduled architecture must execute all operations in N time steps while

also preserving the functionality of the original algorithm. This imposes several

constraints on the ILP formulation which are modeled as follows.

• Each node in the DFG can be scheduled only once over the N time steps.

This condition gives |V| constraints equations of the form in (4.6).∑j

xi,j = 1 i ∈ 1, 2, ..., | V | (4.6)

• The total number of operations of type p scheduled in any time step cannot

exceed the number of resource elements Mp of type p.∑i∈p

xi,j ≤ Mp j ∈ 0, 1, ..., N (4.7)

38

• The delay on each edge after scheduling cannot be negative. This condition

gives the final |E| constraints.

N(w(e) + r(v)− r(u))− d(u) + p(V )− p(U) ≥ 0 (4.8)

For a compact description of the constraints in (4.8) we express it in matrix form

in (4.9).

f = Nw + NAr + g ≥ 0 (4.9)

Here f is the |L| × 1 vector of folded delays and r is the |V| × 1 vector of the

retiming weights associated with the nodes. The elements of the vectors w, Ar,

g are w(e), r(V )− r(U), p(V )− p(U)− d(U), respectively.

The constraints in (4.7) ensure that the operations scheduled in every control

step can be processed by the available computing resources. The constraints

in (4.8), (4.9) maintain the input/output functionality of the original algorithm

after scheduling.

1 2Folding Factor = 2

p(v1) = 1p(v2) = 2d(v1) = 2d(v2) = 2

1 2

v

(a)

(b) (c)

Z-1

2z-1

e1

e1

Figure 4.2: (a) Original DFG (b) Retimed DFG (c) Scheduled architecture.

39

4.1 Scheduling and Retiming

The benefit of retiming simultaneously with scheduling is demostrated in the

DFG in Fig. 4.2. Nodes V1 and V2 represent operations of the same type which

have to be scheduled onto the single available resource V in Fig. 4.2(c). The

operation V1 is scheduled in the first time step (p(V1) = 1) while V2 is scheduled

in the second time step (p(V2) = 2). If the processing element V needs to be

pipelined by two stages (d(V1) = 2) to satisfy throughput constraints; we get the

following folding equation for the delays on edge e1.

fe1 = N(r(V2)− r(V1))− d(V1) + p(V2)− p(V1) (4.10)

= 2(r(V2)− r(V1))− 1

From the constraint fe1 ≥ 0 we get

r(V1)− r(V2) ≥ 1

2. (4.11)

Since r(V1)− r(V2) ∈ Z, (4.11) can be rewritten as

r(V1)− r(V2) ≥ d12e = 1.

Without retiming, the variables r(V1) = r(V2) = 0 and the two operations cannot

be scheduled onto the resource V . If retiming is allowed, a delay can move from

the outgoing edge of node V2 into edge e1 making r(V2) = 1 and r(V1) = 0.

The retiming variables now satisfy the constraint in (4.11) and the schedule is

feasible. This example illutrates how larger movement of delays across the DFG

with retiming can help produce more area-efficient results.

A second advantage of retiming with scheduling is seen in feedforward al-

gorithms where insertion of extra registers at the input/output only increases

latency without affecting the functionality. These extra registers can be used

40

to pipeline the resource elements and speed up the schedule. It must be noted

that in latency constrained systems the retiming weights of the input and output

nodes will be restricted by the maximum allowable latency.

4.2 Modified ILP

Retiming variables are unbounded in the integer space and introduce exponential

time complexity when ILPs are solved using branch and bound methods. The

ILP model described in [17] minimizes resources after enforcing vector r = 0 (no

retiming). We propose an approach where retiming is incorporated in the ILP

model, but the retiming vector is decoupled from the ILP so that the unbounded

variables do not increase the runtime exponentially. Since the retiming variables

are all integers, the constraint in (4.9) can be rewritten as

−Ar ≤⌊w +

g

N

⌋(4.12)

≤ w +⌊ g

N

⌋.

The above result simplification uses the lemma

bk + xc = k + bxc if k ∈ Z. (4.13)

An integral solution to the inequalities in (4.12) can be obtained using the

Bellman-Ford shortest-path algorithm [8]. A necessary and sufficient condition

for the existence of a solution to (4.12) is

B(w +⌊ g

N

⌋) ≥ 0. (4.14)

This condition is easily explained with the following example. Without loss of

generality, let us consider a set of 3 inequations of the form in (4.12) such that

41

their left hand sum goes to zero.

r(1)− r(2) ≤ c1 (4.15)

r(2)− r(3) ≤ c2 (4.16)

r(3)− r(1) ≤ c3 (4.17)

The above set of inequalities can have a solution only when

c1 + c2 + c3 ≥ 0. (4.18)

If the system of inequalities in (4.12) is represented using a constraint graph, such

that

• U ∈ {1,2...|V|} are nodes of the graph,

• an edge e going from U to V exists if the inequality r(V )− r(U) ≤ w(e) +⌊g(e)N

⌋exists,

• an edge e going from U to V is weighted by w(e) +⌊

g(e)N

⌋where r(V ) −

r(U) ≤ w(e) +⌊

g(e)N

⌋,

then the inequalities from (4.15)-(4.17) will form a loop in the constraint graph.

The condition in (4.18) implies that the sum of the weights of all edges in any loop

of the constraint graph must be non-negative. This reduces to the constraint in

(4.14) since the constraint graph in this case is the original data-flow-graph and

the loops in the graph are given by the B matrix. Hence the retiming inequalities

in (4.12) need not be a part of the ILP. If the constraint in (4.14) is introduced

in the ILP, it will be ensured that the retiming vector can be solved by Bellman-

Ford once scheduling is complete. The constraint in (4.14) does not contain any

unbounded integer variables and therefore the CPU runtime for this formulation

42

is not increased significantly. Bellman-Ford solves the retiming inequalities in

O(|V||E|), converging quickly to the solution.

The constraint in (4.14) however introduces the nonlinear floor function in a

linear model. To model the floor function linearly we introduce extra variables

t(e) and q(e) in the ILP. This increases the complexity of the variables in the ILP

from O(N |V|) (without retiming) to O(N |V|+|2E|) (with retiming). However,

as shown below these variables will be bounded and therefore will not increase

the CPU runtime significantly. The new set of constraints in (4.14) are modeled

as follows:

B(t+w) ≥ 0. (4.19)

Here t is an |E| × 1 vector with individual element t(e) given by

t(e) =

⌊g(e)

N

⌋. (4.20)

Since t(e) represents the floor of g(e) the following relation will always hold.

g(e)

N= t(e) + frac

(g(e)

N

)(4.21)

The fractional part of g(e)N

(frac(

g(e)N

)) is modeled by the variable q(e) and the

new set of |E| constraints for the modified ILP are expressed in (4.22).

t(e) =g(e)

N− q(e), q(e) ∈ [0, 1), t(e) ∈ Z (4.22)

g(e) = p(V )− p(U)− d(U) (4.23)

The variable q(e) is bounded between 0 and 1 and the bounded value of g(e) ∈

[−N − 1−max(d(U)), N − 1] bounds t(e). The equation in (4.22) completes the

description of the modified ILP with bounded variables.

The modified ILP model has been used in our optimization framework to gen-

erate more area- and throughput-efficient schedules compared to the ILP which

43

does not incorporate retiming. Supporting results are presented for benchmark

DSP algorithms in Chapter VI. The next chapter gives details of the optimization

flow in Simulink, HDL generation, RTL synthesis and energy estimation methods

used.

44

CHAPTER 5

CAD Design Flow

The system optimization flow for automating the architectural transformations is

detailed in this chapter. The flow starts from the direct-mapped DFG representa-

tion of the algorithm in Simulink. This is followed by extraction of the incidence

and loop matrices of the DFG in MATLAB. Based on circuit-level energy-delay

sensitivity results and system specifications (throughput, area, power) suitable

selection of transformations are made and the direct-mapped architecture is op-

timized. This process is outlined in the flow chart shown below and details of

each step in the process will follow.

RT

L

SystemspecsMatlab

Test-vectors

SimulinkArch.Opt

activity( )

Simulinklib

Simulinklib

Energy-Tclk (VDD)Opt.arch

Final.GDS

Techlib

Techlib

Algorithm

Datapathsimulation

Arch.OptParametersN, P, R

SimulinkRef.Arch Synthesis

Energy,Area, Tclk

RTL

Synthesis

MDL

Figure 5.1: Design and optimization flow.

45

5.1 Simulink Modeling

Simulink is a graphical environment embedded within MATLAB useful for high-

level modeling and evaluation of algorithms. It contains a pre-defined library of

components like adders, multipliers, registers, multiplexers and also more com-

plex dedicated blocks like FFT and CORDIC (Fig. 5.2). In addition to this, it

is possible to create a user-defined library with custom blocks. This feature be-

comes particularly useful when dealing with large designs which has user specified

complex macros. Examples of such custom blocksets include the commercially

available Xilinx XSG and Synplify DSP.

Figure 5.2: Simulink pre-defined library (Synplify DSP blockset).

46

QAM Communication System Example: A Simulink model for a Quadra-

ture Amplitude Modulated (QAM) communication system is shown in Fig. 5.3.

The model takes two random integers as input and modulates it into a QAM sig-

nal. This signal is low-pass filtered (raised-cosine filter) to constrain it within the

allowable bandwidth. The signal then passes through a channel with white Gaus-

sian noise (the SNR of this channel can be user-specified). The received signal is

again low-pass filtered and demodulated to recover the transmitted symbol. The

entire system can easily be emulated using Simulink blocks as shown in Fig. 5.3.

This modeling also considers wordlength quantization effects since the baseband

processing unit (low-pass filter) is implemented using finite-precision arithmetic

(suported by Synplify DSP blockset).

1Out1

↑4

Upsample2

↑4

Upsample

RectangularQAM

Rectangular QAMModulatorBaseband

RectangularQAM

Rectangular QAMDemodulatorBaseband1

ReIm

Real-Imag toComplex

RandomInteger

Random IntegerGenerator

In1 Out1

Raised cosine4

In1 Out1

Raised cosine3

In1 Out1

Raised cosine1

In1 Out1

Raised cosine

Port Out4

Port Out3Port Out1

Port Out

Port In3

Port In2

Port In1

Port In

-2Z

Integer Delay2

-5Z

Integer Delay1 Error Rate

Calculation

Tx

Rx

Error RateCalculation1

4

Downsample2

Discrete-TimeScatter Plot

Scope1

Discrete-TimeEye Diagram

Scope3

Discrete-TimeEye Diagram

Scope2

ReIm

Complex toReal-Imag

AWGN

AWGNChannel2

AWGN

AWGNChannel1

Figure 5.3: Baseband processing in a QAM system.

Results of bit error rate simulation for the system are shown in Fig. 5.4. The

system was tested with both finite-precision arithmetic (16-bit datapath, 14 bit

fractional length) as well as as floating-point arithmetic (ideal full precision). We

47

6 7 8 9 10 11 12 13 1410

-5

10-4

10-3

10-2

10-1

Eb/No

BE

R

Figure 5.4: BER vs. SNR curve for the QAM system.

see a small degradation in bit error rate owing to quantization noise introduced

in the fixed-point arithmetic.

Figure 5.5: Simulink model for an 8-tap FIR filter.

The next example focuses on verification and synthesis of the baseband filter

48

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Time (seconds)

Am

plitu

de

InputOutput

Figure 5.6: Input with normalized frequencies of 0.03 & 0.4, and corresponding

output that passes the lower frequency.

in the above communication system. A direct-mapped structure for the reference

architecture was created using library blocks from Synplicity and the filter was

implemented with 16-bit fixed-point arithmetic. The direct-mapped architecture

for the 8-tap low-pass FIR filter is shown in Fig. 5.5. Functional verification

of the model can be carried out by applying appropriate inputs. The inputs

can either be generated in Simulink from blocks like frequency synthesizers or

can be user specified from the MATLAB workspace. Simulation results can be

exported to the MATLAB workspace or viewed with the help of the scope dispay

block in the library. Simulation results for the FIR filter are shown in Figs. 5.6

and 5.7. The FIR structure in Fig. 5.5 is a low-pass filter with cut-off at 0.2

rad/s. To verify its frequency-selective nature an input which was a combination

49

of two siusoidal frequencies, 0.03 rad/s and 0.4 rad/s was applied to the filter.

Figure 5.6 illustrates how the filter passes the 0.03 rad/s input (passband) and

suppresses the 0.4 rad/s input (transition band) in the output (dashed line is the

lower frequency output).

0 2 4 6 8 10 12 14 160.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Bit

acti

vity

fac

tor

Figure 5.7: Activity factor for a 16-bit sinusoidal input of normalized frequency

0.25.

5.2 RTL Generation and Synthesis

Simulink is also convenient as a tool, because it can be used to bridge the gap be-

tween high-level description of algorithms and physical synthesis of architectures.

SynDSPTool which is a part of the SynDSP blockset is capable of automatically

generating synthesizable RTL in Verilog/VHDL from the Simulink models. HDL

descriptions of the architectures can then be synthesized with backend tools like

Cadence RC Compiler or Synopsys Design Compiler. Synthesis results from the

architectures can then provide us with accurate area and throughput results.

Input switching activity can be extracted in MATLAB from target test vectors

50

Figure 5.8: Energy-area-delay tradeoffs at the circuit and micro-architectural

level.

and then propagated during synthesis for accurate energy estimates. Figure 5.7

shows the input swicthing activity for a 16-bit sinusoidal input applied to the

FIR filter. The results indicate that the activity is higher for the LSB bits and

decreases as we move towards the MSB. Tools like Cadence RC compiler take

in the input swicthing activity information and propagate the swicthing proba-

bilities across the whole design to provide energy estimates. Hence we can now

fully characterize an architecture in the energy-area-delay space in an automated

fashion starting from the Simulink description.

5.3 Architectural Optimization

Based on energy-delay sensitivity results [1] from synthesis of the direct-mapped

architecture and the underlying macros we can figure out which transformations

get us closest to the system specifications. This concept is illustrated in Fig.

5.8 where the impact of time-multiplexing, parallelism, pipelining etc. has been

51

Figure 5.9: Choosing values of N , P and R based on energy-delay sensitivity.

shown in the energy-area-delay space [38]. Time-multiplexing reduces the area

of the design but increases the energy due to additional energy consumption in

control circuitry like multiplxers, memories etc. Parallelism coupled with sup-

ply voltage scaling on the other hand, helps reduce the energy while increasing

the area. Pipelining can improve the throughput or reduce energy similar to

parallelism but at a reduced area overhead. The benefits of pipelining saturate

however when the delay overhead introduced by the registers begins to dominate

the reduction achieved in the critical path.

Depending upon the system specifications like throughput, area or power

consumption and the optimization objective, the degree of time-multiplexing (N),

parallelism (P ) and extra latency introduced via retiming (R) can be set by the

user (Fig. 5.9) [42]. If a lower area is desired at a fixed supply voltage with a loss

in the achievable throughput, then we opt for a higher value of N . Increasing

the value of N also increases the energy consumption due to additional energy

52

overhead in the control circuitry. For a fixed supply voltage retiming (R) and

parallelism (P ) can only improve the throughput of the system. The bounds on

the values of N and P are determined by the throughput constraints and the

lower limit on the supply voltage. The value of R is bounded by the maximum

I/O latency which can be inserted. This value is significant only in feed-forward

portions of the architecture where extra registers can be introduced. For the

recursive structures retiming can only balance the logic depth between registers to

obtain the lowest possible critical path (R=0 for feedback structures). Retiming

and parallelism coupled with supply voltage scaling improves the energy-efficiency

as was explained earlier in Chapter III.

For example, for the FIR filter (Fig. 5.5) if the objective is to reduce the

area roughly by a factor of two by trading off speed then we set the degree of

time-multiplexing N = 2. On the other hand if the objective is to double the

throughput or improve energy efficiency then parallelism (P ) can be employed.

For a given set of system specifications it is possible to find several architectures

which meet the system constraints, in which case the architecture which best

meets the optimization objective must be selected for synthesis.

The values of N , P and R, are next sent to the MATLAB/Simulink based

optimizer as illustrated in Fig. 5.1. The optimizer first extracts the connectivity

information from the direct-mapped architecture in the form of incidence and

loop matrices (Chapter III). The matrices are used to model the constraints in

the ILP setup in MOSEK which is an optimization tool embedded in MATLAB.

After ILP simulations are complete the optimizer uses the results to automati-

cally construct the Simulink model for the resulting optimized architecture. The

optimized model can be synthesized in the target technology to verify if it meets

the system constraints or whether there is a need for further refinement. In the

53

+

x

In

Out

x x x

+ +

D D D

MatlabSimulinkOptimizer

Time MultiplexTime Multiplex ParallelParallel

Lower Area Low Energy, High Throughput

Retime/PipelineRetime/Pipeline

Higher speed

Pipeline Registers

Transposed Architecture

Figure 5.10: Time-multiplexed and parallel implementation of a 16-tap FIR filter.

latter case the degree of N , P and R can be changed again and the process re-

peated until the system constraints and optimization objective are met. The user

can iteratively generate multiple architectures for the target algorithm by varying

the values of N , P and R which enables effective exploration of the design-space.

This process was carried out for the FIR filter and the Simulink-level results of

time-multiplexing (N=2), parallelizing (P=4) and retiming (R=1) for the FIR

filter are shown in Fig. 5.10. A detailed discussion on the synthesis results of

time-multiplexing, retiming and parallelizing this filter will follow in the next

chapter.

54

-1

Figure 5.11: Control circuitry using M-Control blocks.

5.3.1 Controller Generation

Automating the generation of control circuitry is an important part of high-level

synthesis. This has been done with the aid of M-Control blocks from Synplicity’s

Synplify DSP blockset [39] in Simulink. The M-Control block is a MATLAB

function which generates certain outputs in response to certain input patterns.

The example in Fig. 5.11 illustrates the use of this block to generate controllers

for scheduled architectures. The M-Control function for this architecture can be

written as

function[Sel]=M_Control(Count)

if(count==1)

Sel = 1 \\ In1 is the output of Mux

else if (count==4)

Sel = 2 \\ In2 is the output of Mux

end

end

55

The controller block must route the correct signal into the processing elements

every clock cycle, hence the output of the block depends on the schedule of

the processing elements. The generation of MATLAB scripts for the M-Control

blocks has been automated in this work. The control script is a collection of

if-then-else statements which generate the correct value of the select signal for

the multiplexers at every control step (clock cycle). A parameterized function

which accepts the schedule of the processing elements and the registers in the

design was written in MATLAB to generate this control script. This MATLAB

script can then be translated into synthesizable RTL using the SynDSP function

embedded in the Synplicity DSP blockset.

We have now described the complete optimization framework which includes

architecture modeling starting from the Simulink description, functional verifi-

caion of the model and finally RTL synthesis via backend tools. The next chapter

will discuss the results of our formal approach to architectural optimization.

56

CHAPTER 6

Results

This chapter compares high-level and logic synthesis results obtained from exist-

ing ILP scheduling and the modified scheduling model which integrates retiming

(outlined in Chapter IV). This is followed by a discussion on the design-space ex-

ploration results of a 16-tap FIR filter (used in ultra-wide-band applications). The

energy-area-throughput results obtained by scheduling, retiming and parallelizing

this filter in 90 nm CMOS technology are presented. Hierarchical design-space

exploration is illustrated for a multi-core MIMO sphere decoder.

6.1 Comparison of Existing and Modified Scheduling

Algorithms

The modified ILP was verified on a general class of feedforward and recursive

algorithms which exhibit varying degree of complexity in structure. The feedfor-

ward algorithms selected were a 16-tap FIR filter and an 8-point discrete cosine

transform (DCT) while the recursive algorithms include second-order IIR, four-

stage lattice and elliptic wave digital filters (Fig. 6.4). The ILP simulations for

high-level synthesis were run using the ILOG/OPL optimization tool on a 32-bit

Intel Core 2 CPU running at 2.0 GHz.

57

Table 6.1: Comparison of scheduling and scheduling with Bellman-Ford retiming.

Adder Mult Sched Sched Sched Sched Sched Sched Sched &

Pipe- Pipe- No. of & BF No. of & BF CPU & BF Retime

Design N line line Adds Adds Mults Mults (s) CPU (s) CPU (s)

3 0 1 - 12 - 4 - 0.25 5376

Wave 4 0 1 8 7 4 2 0.18 1.25 2.25

Digital 8 0 1 4 4 2 1 13.9 45.5 20.5

Filter 16 1 2 - 3 - 1 - 264 >6000

2 stage 2 0 1 8 4 8 4 0.28 0.26 0.30

IIR 2 0 2 - 4 - 4 - 0.28 0.28

Filter 4 0 2 4 2 4 2 0.26 0.30 0.30

2 0 2 - 6 - 8 - 0.13 0.3

4-stage 2 1 2 - 6 - 8 - 0.26 0.26

Lattice 3 0 2 6 4 8 5 0.26 0.25 0.25

Filter 3 1 2 - 4 - 5 - 0.26 0.26

4 0 2 4 3 5 4 0.28 0.21 0.26

4 1 2 - 3 - 4 - 0.26 0.26

2 0 1 16 16 16 8 0.14 0.26 0.26

8 point 3 0 2 16 11 16 6 0.15 0.15 0.40

DCT 3 1 2 - 11 - 6 - 0.15 0.15

(1-D) 4 0 2 8 8 8 4 0.29 0.28 0.26

4 1 2 - 8 - 4 - 0.25 0.25

8 1 2 5 4 4 2 6.0 0.25 0.26

2 0 1 15 8 14 8 0.25 0.28 0.25

16-tap 2 1 2 - 8 - 8 - 0.25 0.28

FIR 4 1 2 8 4 7 4 0.20 0.20 0.26

8 1 2 3 2 3 2 0.36 0.25 0.26

58

In addition, RTL synthesis was done in 90 nm CMOS technology for selected ar-

chitectures (FIR, wave digital and lattice filter) to investigate the area-performance

tradeoff offered by scheduling (existing ILP) and scheduling with Bellman-Ford

(BF) retiming (modified ILP).

Table 6.1 compares the high-level synthesis results for both approaches. In

all cases considered it was observed that the modified ILP outperforms the ex-

isting ILP scheduling model in terms of number of resource elements (number of

adders/multipliers) needed to execute the algorithm. For several combinations

of folding factors (N ) and pipeline depth, it was observed that the existing ILP

model could not reach a feasible solution indicating that the modified ILP can

traverse the area-throughput space in a more effective manner.

A comparison of simulation runtimes were made between 3 approaches; schedul-

ing, scheduling with BF retiming and scheduling with unbounded retiming vari-

ables in the ILP (unbounded ILP). In all the algorithms considered except the

wave digital filter we find that the modified ILP and the unbounded ILP have

runtimes which are comparable with the existing ILP. Although the modified ILP

and the unbounded ILP converge to the same solution when minimizing resource

count, their runtimes are significantly different for the wave digital filter (Fig.

6.4) which is a complex structure with a large number of loops (the A and B

matrices have high dimensions). The runtime with the unbounded variables in

the ILP degrades rapidly in this case since it has a larger search space to cover

(> 6000s for N=16). The runtime for the modified ILP on the other hand shows

a more graceful degradation (264s for N=16) owing to a reduced search space.

Figures 6.1, 6.2, 6.3 show the synthesis results for a fifth-order elliptic wave

digital filter (WDF), 16-tap FIR and a 4-stage lattice filter, respectively. The

area and throughput numbers have been normalized to the reference architec-

59

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.4

0.5

0.6

0.7

0.8

0.9

1

Throughput

Are

a

ReferenceSchedulingScheduling and Bellman Ford

A

B

Figure 6.1: Synthesis results for a fifth order elliptic wave digital filter.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

0.6

0.7

0.8

0.9

1

Throughput

Are

a

ReferenceSchedulingScheduling and Bellman Ford

Figure 6.2: Synthesis results for a 16-tap FIR filter.

60

0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Throughput

Are

a

ReferenceSchedulingScheduling andBellman Ford

Figure 6.3: Synthesis results for a 4th order all pole lattice filter.

ture (original architecture which is not scheduled). It was observed in all the

three examples that for equal reduction in throughput for most cases we do not

get an equal reduction in area (traversing from point A to point B in Fig. 6.1

results in a 38% reduction in throughput and a 16% reduction in area). This

trend is expected since increase in register and controller areas after scheduling

to some degree offsets the area reduction achieved by lowering the number of

resource elements. Also with higher degree of scheduling it is not always possible

to increase the degree of pipelining of the resource elements which results in a

lower-than-expected throughput. Both these factors contribute to the area-delay

product [34] becoming greater than 1 (the area-delay product for the reference is

1) for scheduled architectures (Table 6.2).

As was mentioned earlier, scheduling with BF retiming is able to produce

results for a larger combination of folding factors and pipeline depth allowing a

higher throughput for scheduled architectures (Fig. 6.3). The area results for the

61

Table 6.2: Comparison of normalized area-delay product for scheduling and

scheduling with Bellman-Ford retiming.

Design Scheduling Scheduling & BF Gain

WDF 1.5011 1.2795 14.76 %

Lattice 3.6329 1.4456 60.208 %

FIR 3.9530 2.9328 25.808 %

modified ILP also show an improvement over the existing approach (Fig. 6.2)

due to larger movement of delays across the design owing to retiming (explained

in detail in Chapter IV). For a fair comparison between the area and throughput

of the architectures generated from both approaches, we compute the average

area-delay product of the synthesized results (Table 6.2). The numbers in Table

II indicate the mean value of area-delay product for the synthesized architectures

for the three examples. These numbers were further averaged across all three

examples and a 33% average reduction in the area-delay product was observed

for the modified ILP. The results clearly demostrate that retiming integrated

with scheduling produces more area and throughput efficient architectures when

compared to scheduling without retiming.

6.2 Design Space Exploration: 16-tap FIR Filter

The optimization flow detailed in Chapter V was first verified on a 16-tap FIR

filter because of its simplicity and well understood structure. Transformations

like scheduling, retiming and parallelism were applied to the filter integrated with

supply voltage scaling and micro-architectural techniques like the usage of carry-

save arithmetic [29]. The result was an array of optimized architectures each

unique in the energy-area-performance space. A comparison of these architectures

62

has been made in Figs. 6.5, 6.6 with contour lines connecting the architectures

which have the same throughput.

Module A(A0)

Module A(A1)

Module B(B0)

Module B(B1)

Module C(C0)

Module D(D0)

D D DD

Out

In

+ +

X

+

D

In Out

+ +

+ X X +

+ +

In1

Out1

In2

Out2

In3

Out3

+

X+ +

+

In1

Out1

In2Out2

In3Out3

+

+ +

+

X + X

X

D

Out2

In2

Out3

In1

Out1

Module A Module C Module B Module D

Figure 6.4: Fifth-order wave digital elliptic filter.

Figure 6.5 shows the effect of carry-save optimization on the direct-mapped

reference architecture. The reference architecture without carry-save arithmetic

(CSA) consumes larger area and is slower compared to the design which em-

ploys CSA optimization. To achieve the same reference throughput (set at 100

Ms/s for all architectures during logic synthesis), the architecture without CSA

must upsize its gates or use complex adder structures like carry-look-ahead which

increases the area and switched capacitance leading to an increase in energy con-

sumption as well. The CSA optimized architecture still performs better in terms

of achievable throughput which highlights the effectiveness of this technique.

Following CSA optimization the design is retimed to further improve the

throughput. From Fig. 6.5 we see that retiming improves the achievable through-

63

1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4

x 104

101

Area (μm2)

Ene

rgy

(pJ)

[lo

g sc

ale]

100 M

125 M

166 M

250 M

300 M

350 M395 M

516 M

187 M

623 MN=2N=2 / retimedRef.Ref. / retimedRef. no CSALat +Lat + / ret.

M = Ms/s

Vdd

Figure 6.5: Synthesis results for retimed and time-multiplexed FIR filters.

Figure 6.6: Increase in register area with retiming.

put from 350 Ms/s to 395 Ms/s (13%) but also results in a small area increase

(3.5%). The area increase is attibuted to movement of registers from a single

output edge to multiple input edges as is shown in Fig. 6.6 where the register

64

count increases from one to two.

In

x v1 x x

Z-1 v2 v3 + OutZ-1 Z-1

In

x

Z-1 v2 v3 + OutZ-1 Z-1

Z-1

x

Z-1

Z-1

x

Z-1

x

Z-1

(a)

(b)

Figure 6.7: (a) FIR with an extra latency at the output (b) Retimed version.

Increasing the input-to-output latency in feedforward systems also resuts in

considerable throughput enhancement. This is illustrated in Fig. 6.7(b) where

retiming at the Simulink level cuts down the throughput from Tadd+Tmult to Tmult.

The results from logic synthesis (Fig. 6.5) show a 30% throughput improvement

from 395 Ms/s to 516 Ms/s. The area roughly by 22% due to extra register

insertion as shown in Fig. 6.7(b). Retiming during logic synthesis does fine grain

pipelining inside the multipliers to balance the logic depth across the design. This

step improves the throughput to 623 Ms/s (20% increase).

Scheduling the filter results in area reduction by about 20% compared to the

retimed reference architecture and a throughput degradation by about 40%. The

area reduction is small for the filter, since the number of taps in the design is

small and the decrease in adder and multiplier area is offset by the increased area

65

of the registers and multiplexers. Retiming the scheduled architecture results in

a 12% improvement in throughput but also a 5% increase in area due to more

number of registers as explained earlier in the chapter.

Supply-voltage scaling is another degree of freedom we have when exploring

the design-space. From energy and throughput results at the nominal supply

voltage (Vdd=1V for 90nm CMOS) we can obtain throughput and energy numbers

at reduced supply by using the models for energy and delay in (6.1), (6.2).

1

Throughput=

K · Vdd

(Vdd − Vth)α(6.1)

Power ∝ V 2dd · Throughput (6.2)

The equation in (6.1) is the well known alpha-power law model used for com-

puting the logic delay of a circuit [40]. The value of alpha typically ranges between

1 to 2 depending upon the target technology.

Figure 6.5 illustrates how the throughput as well as energy scales with de-

creasing Vdd. The retimed reference architecture can operate between 100 Ms/s

to 395 Ms/s (4× variation in throughput) for Vdd values between 0.35V to 1V in a

90 nm technology. It is interesting to note that for this 4× change in throughput

we achieve a 3× change in energy. The system designer therefore has an option

to trade off throughput for increased energy efficiency by applying supply voltage

scaling.

The unfolding algorithm was applied to the 16-tap FIR to generate the ar-

chitectures shown in Fig. 6.8. It is to be noted that the Simulink model and

RTL for these architectures were generated automatically by the optimizer. Par-

allelism variable P has been been varied from 2 to 12 to generate a series of

architectures which exhibit a range of throughput and energy efficiencies. The

66

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2

x 105

0

1

2

3

4

5

6

Area (μm2)

Ene

rgy

Eff

icie

ncy

(GO

PS

/mW

)

0.4G0.5G

0.7G

1G

2G

250M

125M40M

0.7G

71M

1.3G

0.13G

1.7G

0.17G

2.6G

0.3G

3.4G

0.35G

G = GS/sM = MS/s

Ref.P=2P=4P=5P=8P =12

Vdd

Figure 6.8: Synthesis results for parallel FIR filters.

throughput varies from 40 Ms/s to 3.4 Gs/s while the energy efficiency ranges

from 0.5 GOPS to 5 GOPS. It is possible to improve the energy efficiency signif-

icantly with continued Vdd scaling if sufficient delay slack is available. Scaling of

the supply voltage has been done in 90 nm technology in the range of 1 V to 0.32

V. In Fig. 6.8 we see a clear tradeoff between energy/throughput and area. The

final choice of architecture will ultimately depend on the throughput constraints,

available area and power budget.

6.3 Hierarchical Design: Multi-Core MIMO Sphere De-

coder

Optimization of complex architectures in Simulink can be done hierarchically if

Energy-Delay sensitivity [1] results for the smaller modules in the system are

67

one core

…

…

…

…Multi-core DSP

Sub-carriers

MIMO DSP

function

Figure 6.9: Multi-core MIMO sphere decoder.

available. We take the example of a MIMO sphere decoder architecture (Fig.

6.9) [41],[42] to exhibit this hierarchical extension. The processing element (PE)

of the decoder has to find the best possible match for the transmitted symbol in

a pre-defined search radius of the symbol constellation. The workload of looking

for the correctly decoded symbol can either be done by a single PE or distributed

to multiple PEs (multi-core architecture) [41]. This is equivalent to parallelism

and with more processing elements, the incoming symbol can be decoded quickly

or more with higher energy efficiency (by scaling Vdd as explained in Chapter

IV).

To obtain energy-delay trade-off curves for the multi-core architecture it is

68

Figure 6.10: Simulink model for the multi-core MIMO sphere decoder.

sufficient to extrapolate results from the single core architecture. The maximum

throughput achieved by the single core design was 100 Ms/s at a supply voltage

of 1V taking up an area of 0.55 mm2. For the decoder to work at a higher

throughput or higher energy efficiency we must vary the degree of parallelism

(P ). This would correspond to architectures with higher number of processing

elements. A 16-core architecture automatically generated in Simulink is shown in

Fig. 6.10 with the scheduler controlling the communication between the PEs. The

energy-delay tradeoff curves for the decoder architecture with varying number of

PEs is shown in Fig. 6.11. We see a 10× tuning range in energy efficiency when

varying the degree of parallelism from P = 1 to P = 16 and scaling the supply

voltage between 1V to 0.32 V. Also a range of throughput from 100 Ms/s to 1.5

Gs/s can be achieved if the supply voltage is maintained at 1V.

69

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.60

1

2

3

4

5

6

Throughput (Gbps)

En

erg

y E

ffici

en

cy (

GO

PS

/mW

)

1V

0.66V

0.48V

0.37V

0.32V

16 x 16 antenna array

16 cores (8.8 mm2)8 cores (4.4 mm2)4 cores (2.2 mm2)2 cores (1.1 mm2)1 core (0.55 mm2)

1.5 Gbps

58 pJ/bit~1

0x tu

ning

rang

e

Figure 6.11: Synthesis results for the multi-core MIMO sphere decoder.

The results presented in the chapter shows the efficiency of integrating retim-

ing with scheduling in the ILP model. Design space exploration results show the

effect of each transformation (scheduling, retiming, parallelism,CSA,Vdd scaling)

in the energy-area-delay space. Also highlighted is the automatic generatation

of multiple architectures (Simulink model and RTL) for a given algorithm. This

allows the user to pick the design which best meets system specifications and

optimization objective. The next chapter concludes this thesis with a summary

of research contributions.

70

CHAPTER 7

Conclusions & Future Work

To conclude, we summarize the main contributions of this work and also discuss

possible directions for future research.

7.1 Summary of Research Contributions

• Developed an automated flow for optimizing DSP architectures, starting

from architectural modeling in Simulink followed by MATLAB optimization

through various architectural transformations.

• The optimization flow uses Synplicity and Cadence backend tools to inte-

grate RTL synthesis and power estimation in the frameork. Suitable archi-

tectural transformations are decided upon, based on the system constraints

(throughput, area-energy budget) and energy-delay (E-D) sensitivity re-

sults extracted from the circuit-level.

• Developed a modified ILP scheduling model which integrates retiming and

improves the energy-delay product by 33% on an average compared to re-

sults from ILP without retiming.

• The worst case CPU runtime for the modified ILP is much better compared

to the case where unbounded retiming variables are present in the ILP. For

the fifth-order elliptic wave digital filter example the modified ILP achieved

71

almost 20× reduction in worst case CPU runtime.

• Hierarchical optimization is illustrated for a complex MIMO sphere decoder

kernel based on circuit-level results of the underlying macros.

7.2 Future Work

• Extend the optimization framework to support multi-rate systems.

• Include the effect of interconnects at the Simulink level by developing suit-

ables models for wire delay and power.

• Develop area and power models for memory units (SRAM, DRAM) at the

Simulink level.

• Include support for dynamic scheduling in a real time environment for ap-

plications like software defined radios.

72

Appendix 1: GUI Environment

A graphical user environment (GUI) was built in MATLAB to allow easy interface

for implementing the transformations. The user must first create the reference

(direct mapped) architecture using Simulink, Synplify DSP or any other user-

defined component. Once created, this model will appear in the GUI’s listbox

menu under the header Simulink Model. The model must be selected from the

menu in order to load it in MATLAB’s workspace. The next step is entering the

design components used by the reference architecture (e.g. adders and multipli-

ers for a filter). If the design components are pipelined then this must also be

specified under the header Pipeline depth. We can then extract the incidence,

loop, weight and pipeline matrices/vectors from the Simulink model by hitting

the Extract Model button in the GUI. All the relevant DFG connectivity in-

formation is now present in the MATLAB workspace. This step also gives us

information about the total number of components present in the reference (15

adders and 16 multipliers for the filter in the example shown in the Fig. 7.1).

Extraction of incidence, loop matrices etc. is independent of the nature of the

components used in Simulink model. At present the user must provide the details

of the components in the model and their exact path through the GUI. However,

we plan to automate this process in the future. From the Select Design Com-

ponents menu the user selects the blocks that are used in the reference design.

As the design components are selected their names appear under the header

Design Components. It is assumed that the user has either synthesized the

reference architecture or hierarchically extrapolated results from the underlying

macros to compute its energy, area and performance. Depending upon the target

specifications, the user now decides on the degree of pipelining, parallelism or

73

Figure 7.1: GUI built within MATLAB to facilitate transformations.

folding/scheduling that must be applied to the reference architecture. The de-

gree of pipelining for each component in the model can be set by entering values

in the box next to the name of the component (which appears under header De-

sign Components). The degree of scheduling (N), retiming (R) and parallelism

(P ) can be likewise set by entering the corresponding values in the box next to

Schedule, Retime and Parallel respectively.

Once the values of N , P and R have been set, the transformations can be im-

plemented by hitting the Generate scheduled architecture, Generate par-

allel architecture and Generate retimed architecture buttons respectively.

The transformed Simulink model which uses Synplify DSP components will auto-

matically open up once the transformations are complete. The name of the new

model is set to ’test1’ by default but can be changed by the user. It is possible to

apply two transformations consecutively on a given design by following the above

74

procedure one after another. For example we may parallelize a design first to get

the unfolded architecture, then make the new model as the reference and retime

it.

This GUI is still in the development phase and will soon be fully functional

and available to use.

75

Appendix 2: Tarjan’s Algorithm

Tarjan’s algorithm [20]: Finding elementary cycles (B matrix) in a directed graph.

\begin{BACKTRACK}{integer v, logical result f}

logical g;

f:= false;

place v on point stack;

mark(v):= true;

place v on marked stack;

\FOREACH w in A(v) \DO

if w < s then delete w from A(v);

else if w = s then

begin

output circuit from s to v to s

given by point stack;

f : = true;

end

else if !mark(w) then

begin

BACKTRACK(w,g);

f:= f+g;

end;

comment: f=true if an elementary circuit

continuing the partial path on

the stack has been found;

if f = true then

76

begin

while top of marked stack != v do

begin

u:= top of marked stack;

delete u from marked stack;

mark(u):= false;

end;

delete v from marked stack;

mark(v):= false;

end

delete v from point stack;

\end{pseudocode}

\begin{loop_enumeration}

integer n;

for i:= 1 to v do mark(i):= false;

for s:= 1 to v do

begin

BACKTRACK(s,flag);

while marked stack not empty do

begin

u:= top of marked stack;

mark(u):= false;

delete u from marked stack;

end

end

end

\end{pseudocode}

77

References

[1] D. Markovic, V. Stojanovic, B. Nikolic, M.A. Horowitz, and R.W. BrodersenMethods for True Energy-Performance Optimization. IEEE Journal of SolidState Circuits,39(8),pages 1282-1293, Aug 2004.

[2] T. Gemmeke et. al. Design optimization of low-power high-performance dspbuilding blocks. JSSC, 39(7):1131–1139, 2004.

[3] J.Rabaey, C. Chu, P. Hoang, and M. Potkonjak. Fast Prototyping ofDatapath-Intensive Architectures. IEEE Design and Test of Computers,8(2):40–51, June 1991.

[4] W.R. Davis, N. Zhang, K. Camera, F. Chen, D. Markovic, N. Chan,B. Nikolic, and R.W. Brodersen. A design environment for high-throughputlow-power dedicated signal processing systems. Journal of Solid State Cir-cuits, 37:420–430, 2002.

[5] D. Markovic, R.W. Brodersen, and B. Nikolic. A 70gops 34mw multi-carriermimo chip in 3.5mm2. 2006 Symposia on VLSI Technology and Circuits,pages 158–159, 2006.

[6] A. Poon, D. Tse, and R. Broderson. An adaptive multiple-antennatransceiver for slowly flat fading channels. 2003 IEEE Transactions on Com-munications, 51(13):1820–1827, 2003.

[7] C.E. Leiserson and J.B. Saxe. Optimizing synchronous circuitry using re-timing. Algorithmica, 2(3):211–216, 1991.

[8] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and Clifford Stein. Introductionto Algorithms. MIT Press and McGraw-Hill, November 2001.

[9] M. C. Papaefthymiou and K. N. Lalgudi. Retiming edge-triggered circuitsunder general delay models. TCAD, 16(12):1393–1408, 1997.

[10] Ying Yi and R. Woods. Hierarchical synthesis of complex dsp functionsusing iris. IEEE Trans. on Computer Aided Design of Integrated Circuitsand Systems, 25(5):806–820, 2006.

[11] V.P. Roychowdhury and T.Kailath. Study of parallelism in regular iterativealgorithms. Proceedings of the second annual ACM symposium on Parallelalgorithms and architectures, pages 367–376, 1990.

[12] S.K. Rao and T.Kailath. Regular iterative algorithms and their implemen-tation on processor arrays. Proceedings of the IEEE, pages 259–269, 1988.

78

[13] C. Tseng and D. P. Siewiorek. Automated synthesis of datapaths in digitalsystems. IEEE Trans. Computer-Aided Design, CAD(5):379–395, July 1986.

[14] S. Y. Kung, H. J. Whitehouse, and T. Kailath. VLSI and Modern SignalProcessing. NJ: Prentice Hall, 1985.

[15] S. Davidson et.al. Some experiments in local microcode compaction forhorizontal machines. IEEE Transactions on Computers, pages 460–477, July1981.

[16] H. DeMan, J. Rabaey, J. Six, and P. Claesen. Cathedral-ii: A silicon com-piler for digital signal processing. IEEE Design and Test of Computers,pages 13–25, 1986.

[17] C.T. Hwang, J.H. Lee, and Y.C. Hsu. A formal approach to the schedul-ing problem in high level synthesis. IEEE Trans. Computer-Aided Design,4(10):464–474, April 1991.

[18] T.C. Denk and K.K. Parhi. Exhaustive scheduling retiming of digital signalprocessing systems. IEEE Trans. on Circuits and Systems II-Analog andDigital Signal Processing, 45(7):821–838, July 1998.

[19] M. Potkonjak and J.M. Rabaey. Optimizing resource utilization using trans-formations. IEEE Transactions on Computer Aided Design of IntegratedCircuits and Systems, 13(3):277–292, March 1994.

[20] R.E. Tarjan. Enumeration of the elementary circuits of a directed graph.SIAM J. Comput., 2(3):211–216, 1973.

[21] T.P. Bamwell and C.J.M. Hodges. Optimal implementations of signal flowgraphs on synchronous multiprocessors. Proc. of International Conferenceon Parallel Processing, August 1982.

[22] A. Chandrakasan, S. Sheng, and R. Brodersen. Low-power cmos digitaldesign. JSSC, 27(4):473–484, 1992.

[23] D. Y. Chao and D.T. Wang. Iteration bounds of single-rate data flow graphsfor concurrent processing. IEEE Transactions on Circuits and Systems,40(9):629–634, July 1993.

[24] C.H. Gebotys and M.I. Elmasry. A vlsi methodology with testability con-straints. Canadian Conference on VLSI, October 1987.

[25] C.Y. Hitchcock and D.E. Thomas. A method of automatic datapath syn-thesis. Design Automation Conference, pages 484–489, July 1983.

79

[26] A. Chandrakasan, J.M. Rabaey, and B. Nikolic. Digital Integrated CircuitsA Design Perspective. Prentice Hall, 2003.

[27] J.G. Proakis, and D. Manolakis. Digital Signal Processing. New York:Macmillan, 1992.

[28] P. Marwedel. A new synthesis algorithm for the mimola software system.Design Automation Conference, pages 271–277, July 1986.

[29] T.G. Noll. Carry-save arithmetic for high-speed digital signal processing.IEEE International Symposium on Circuits and Systems, 2:982–986, 1990.

[30] B.M. Pangrle and D.D. Gajski. State synthesis and connectivity bindingfor microarchitecture compilation. International Conference on ComputerAided Design, pages 210–213, November 1986.

[31] K.K. Parhi. VLSI Digital Signal Processing Systems: Design and Implemen-tation. Wiley, 1999.

[32] K.K. Parhi, C.Y. Wang, and A.P. Brown. Synthesis of control circuits infolded pipelined dsp architectures. IEEE Journal of Solid State Circuits,27(1):29–43, January 1992.

[33] P.G. Paulin and Knight J.P. Force-directed scheduling for the behavioralsynthesis of asics. IEEE Transactions on Computer Aided Design, 8:661–679, November 1989.

[34] Z.X. Shen and C.C. Jong. Functional area lower bound and upper boundon multicomponent selection for interval scheduling. TCAD, 19(7):745–759,2000.

[35] H. Trickey. Flamel: A high level hardware compiler. IEEE Transactions onComputer Aided Design, CAD-6:259–269, March 1987.

[36] C.Y. Wang and K.K. Parhi. Dedicated dsp architecture synthesis using themars design system. Proc. IEEE Int. Conf. on Acoustics, Speech, and SignalProcessing, pages 1253–1256, May 1991.

[37] R.W. Floyd. Algorithm 97: Shortest path. Communications of the ACM,5(6):345, June 1962.

[38] D. Markovic, B. Nikolic, and R.W. Brodersen. Power and area minimizationfor multidimensional signal processing. IEEE Journal of Solid State Circuits,42(4):1253–1256, April 2007.

80

[39] http://www.synplicity.com/products/synplifydsp/.

[40] T. Sakurai and A.R. Newton. Alpha-power law MOSFET model and itsapplications to CMOS inverterdelay and other formulas. IEEE Journal ofSolid State Circuits, 25(2):584–594, April 1990.

[41] C.H. Yang and D. Markovic. A Flexible VLSI architecture for extractingdiversity and spatial multiplexing gains in MIMO channels. To appear atthe International Conference on Communication, 2008.

[42] R. Nanda, C.H. Yang, and D. Markovic. DSP architecture optimization inMATLAB/Simulink environment. To appear at the 2008 VLSI Symposiumon Circuits, 2008.

81

Documents

DSP Architecture Optimization in MATLAB/Simulink …icslwebs.ee.ucla.edu/dejan/researchwiki/images/e/e1/Thesis_dsp... · DSP Architecture Optimization in MATLAB/Simulink Environment