A Framework for High Level Estimations of Signal ... · A Framework for High Level Estimations of Signal Processing VLSI Implementations J. Ph. DIGUET LESTER–Universit´e de Bretagne

Journal of VLSI Signal Processing 25, 261–284, 2000c© 2000 Kluwer Academic Publishers. Manufactured in The Netherlands.

A Framework for High Level Estimations of Signal ProcessingVLSI Implementations

J. Ph. DIGUETLESTER–Universite de Bretagne Sud, 10 Rue Le Coat Saint Haoen, 56100 Lorient, France

D. CHILLET AND O. SENTIEYSLASTI–ENSSAT–Universite de Rennes, 6 rue de Kerampont, 22300 Lannion, France

Received November 24, 1998; Revised June 8, 1999

Abstract. This paper deals with the presentation of a framework for the rapid prototyping of Digital Signal Pro-cessing applications. The BSS framework enables both synthesis of dedicated VLSI circuits and cost, performanceestimation. The latter can be used at different accuracy levels and can help the designer in selecting a properalgorithm in order to improve the global performance of its implementation. The cost estimation takes into accountboth the processing unit, including operators, registers, interconnections, and memory units. The implementedestimation techniques are presented and include functional unit number bound calculation, probabilistic cost esti-mation of processing unit components, and memory unit area evaluation. We demonstrate, on a real application,that cost/signal processing quality trade-offs can be achieved while changing the type of algorithm and the numberof filter taps for a given algorithm specification.

1. Introduction

Recent advances in VLSI technologies and Computer-Aided design (CAD) enable us to design systems on asingle chip. Programmable Digital Signal Processorsand VLSI hardware for signal processing have maturedrapidly during the last decade, and cover a large partof the integrated circuit market. We can see that thismarket is now driven by the explosive growth of digitalmobile communication and multi-media [1].

Due to the fast evolution of digital signal processing(DSP) applications and due to the limited life time ofnew telecommunication products, thetime-to-marketisone of the most important figure of merit, with the de-sign offirst-time-rightcomplex VLSI systems. More-over, these systems must optimize design constraintssuch as signal processing quality, cost, performanceand power dissipation. Therefore, the designer facesa huge multidimensional design space bounded by the

constraints. In order to find his way through this maze,he must use indications that trade-off metrics providehim.

An ideal prototyping framework for DSP designersmust deal with:

• Rapid prototyping of DSP applications. The toolmust quickly return an estimation (cost, perfor-mance, power) of the implementation on the targetarchitecture, to compare different algorithmic solu-tions. The transformations are not only limited to thealgorithmic level, but could be applied at the func-tional level (e.g. choice of a hardware or softwaresolution for a given function) and at the structurallevel (e.g. retiming, associativity). The estimationmust be accurate, independent from the CAD toolsand located at the highest level of abstraction.• Choice between programmable processors or cir-

cuits, dedicated VLSI, or flexible architecture such

262 Diguet, Chillet and Sentieys

as ASIP. The tool must accept a generic model of thearchitecture. It also must be interfaced with the de-sign environment of all the target architectures (HighLevel Synthesis, DSP compilers, ...).• The memory problem which can be, for most of DSP

applications, the critical part of cost and power dis-sipation. Most of the tools neglect the memory unitand focus on the processing part.• Targeting different VLSI technologies (FPGA, Stan-

dard Cells, ...).• An associated DSP co-simulation tool in order to

verify the DSP quality constraints (error, conver-gence rate, filter parameters, ...).

The methodology that we present in this paper triesto answer most of the problem of rapid prototypingenvironment for DSP. It enables both estimation andsynthesis under cost, time, power dissipation and SNRconstraints. This paper will not deal with the two lastcriterions [2, 3] which will be detailled in a futurpaper.

The area/time measurements are obtained in differ-ent ways based on various synthesis or estimation tech-niques. They give an area estimation of the Processingand Memory Units for a given real time constraint. Thisconstraint is generally linked to the sample frequencyof the DSP system and is a throughput constraint. Thedelay between the inputs and the corresponding out-puts may or not be constrained by a number of pipelinestages. By using the methodology, the designer canfind his optimal trade-off by changing input parame-ters and iterating the process. This parameters can beeither the throughput constraint, or the digital signalprocessing quality and performance criteria. The lattercan be modified by changing the algorithm with trans-formations (e.g. block processing, passing through thefrequency domain, etc.), or the size of the data to be pro-cessed (e.g. size of the Impulse Response of an adaptivefilter, etc.), or even the bit size of the data. All of thesecan improve the time performance but can have an ef-fect on the quality of the DSP. It is for this reason, thatthe estimation must be correlated with the simulationin the same framework.

Different trade-off curves may be built. For a givenalgorithm, area/time curves are obtained by changingthe time constraintT , on which the estimation is basedupon. For a given time constraint, quality signal pro-cessing/cost curves are established by changing the al-gorithm or by modifying the characteristics of the samealgorithm (see §5).

Figure 1 sums up the different ways which can be fol-lowed in order to obtain estimations, and then trade-offcurves. The §2.2 presents a method that provides theoptimal number of functional units (FU) and pipelinestages of the VLSI architecture that will be used laterby the synthesis. This area estimation is dependent onthe time constraint. In §2.3, we propose metrics to es-timate the number of FU, registers and interconnectioncomponents of the processing unit (PU). This methodis based on a probabilistic calculation of the DFG op-eration repartition in time and is linked with hardwareselection (see §3) of the high level synthesis systemGAUT integrated in the BSS framework [4]. It aimsthe guidance in the transformation space. The §4 dealswith the memory problem. We will show in this sectionhow the memory unit (MU) can be taken into accountin the framework. A cost estimation tool, based on bustransfer optimization, number of bank calculation, andmemory selection, is presented.

The ultimate way to accomplish trade-offs is throughthe high level synthesis itself. The PU and MU synthe-sis integrated in this framework are not presented in thispaper, but can be found in [5–7]. Some of the resultspresented in the section §5 have been achieved with ac-curate estimations. The application studied deals withnew-fast LMS algorithms.

2. Processing Unit Cost Estimation

2.1. Introduction and Previous Works

In this section, we deal with the two kinds of estima-tion involved in the GAUT framework. Estimations arecarried out under a time constraint (iteration bound:T)

The first one aims at providing, the synthesis tool,with optimal numbers of functional units and pipelinestages. This accurate estimation must predict the lessexpensive set of resources stemming from the more effi-cient use of the high level synthesis phases. Like thesetasks, resource estimation is a NP-complete problem[8] that involves searching the optimized distributionof operations from a DFG into a limited number of timesteps.

Previous works deal with upper and lower bounds.They can be classified into two classes, estimation ei-ther taking or not taking the precedence constraintsinto account, between operations in the DFG. The firstone contains fast methods that give very optimisticlower bounds and pessimistic upper bounds [9, 10].

A Framework for High Level Estimations 263

Figure 1. Global framework.

The second category computes the estimation by us-ing the time frames [asap ; alap] of operations. Upperbounds are computed by extracting maximum paral-lelism from an oriented graphF(N, R) whereNi is anode representing the operationi andRi, j a vertex in-dicating a precedence relationship betweenNi andNj .A solution is presented in [11] with aO(N3) complex-ity and it is suggested in [12] that an algorithm witha O(N3) complexity is feasible. Lower bounds corre-spond to the designer objective. In [13] a method is de-scribed concerning functional units, data-transfers andregisters, its complexity isO(N2C) for the active re-source andO(NC2) for the interconnects. They com-pute the minimum concurrency in a given time froma set of time frames of present operations. With sucha method, precedence constraints are underestimated,since the overlapping of [asap; alap] intervals mayhide dependence constraints as shown in Fig. 12. Thelower bounds are solved on the same scheme. Anothermethod is detailed in [10] for the functional units, theauthors propose to evaluate the critical pathTc of theapplication for a given set of resources, ifTc is greater

thanTconstraintthe number of the tested resources is in-creased, and so on until the time constraint is reached.Tmin is computed with a list scheduling algorithm,equivalent to [14], its complexity isO(N log(N)). Thealgorithm, is fast but can lead to underestimation. Itactually depends on a scheduling algorithm that doesnot always give the best solution, in fact the prioritygiven to the list-scheduling is 1/(alap− asap), and,as said before, such time frames do not fully repro-duce the precedence constraints between operations.An equivalent algorithm is presented for the intercon-nection bounds and heuristics are described for registerbounds.

The method we shall present, is independent fromany scheduling algorithm and combines the estima-tions of functional units with associated pipeline stages.It avoids exhaustive searching while being accurate,through an analysis of the usual causes of over andunder estimations.

The second kind of resource estimation fits in a newapproach of high level synthesis. It addresses the costcharacterization of the tested application in order to


guide the designer in his specification choices. The esti-mation may be defined in three points.

Firstly, it’s important to note here, that the aim isnot the exact prediction of the final circuit features.Due to the level of abstraction, it would be illusory anduseless to focuse on accurate area/time estimations ofthe final circuit. However, it still remains possible toprovide reliable metrics which are sufficient to enablecomparisons between different specifications. So, theobjective is to get relative estimations which expressthe optimization potential of different algorithmic so-lutions.

Secondly, it provides an estimation of the distribu-tion of the different kinds of resources in the availabletime steps. The aim is to get the metrics to analyzethe complexity of the tested application, in order tounderstand and apply transformations.

Finally, it is independent from the synthesis tool.Few works are aimed towards this kind of estima-

tion. In [15] the authors dealing with the extraction ofthe properties of an algorithm to guide its synthesis byreducing the design space. The main metrics exposedare the concurrency to get an estimation of the numberof resources, the temporal density that represents anapproximation of the lifetime of the variables and theregularity that measures the ratio between the wholedescription of the algorithm and the smallest pattern,from which the algorithm can be reproduced. This workand ours fits in the same bracket of estimation. But,in our opinion, it’s more interesting to firstly evaluateall these properties through a probabilistic calculationthat includes the dependence relationships of nodes;and secondly to keep the links between the nodes andtheir localization in the time steps. Finally we can men-tion [16] where an original method is used, involvingentropy to evaluate the repartition of nodes in thedifferent pipeline stages.

The properties of this kind of estimation as a guid-ance tool in the space of transformations will not bedescribed here. We will report its principle and its usefor the estimation of interconnections cost in order toguide the module selection.

2.2. Optimal Resource and Pipeline Estimationsfor Synthesis

2.2.1. Principle. Within a High-Level Synthesisframework, such as GAUT tool [6], the estimation taskfollows the hardware component selection that pro-vides the functional units (FU) computation delays and

Figure 2. Architectural model of GAUT synthesis tool.

the optimal clock time, the component selection canbe modified according to the estimation results [17].The method presented here, aims for the dual estima-tion of accurate numbers of FU and pipeline stages. Itis based on a preliminary analysis of the error causesin a pipeline architecture whose model is defined inFig. 2. The data flow graph compiled from the VHDLbehavioural specification is completely unrolled. Con-sequently the only restrictions to resource optimiza-tion through pipeline transformations is due to internalfeedback loops. In fact, without feedback loops, theaverage number of FU (Eq. (1)) enables us to executethe algorithms if the appropriated number of pipelinestages is used (see Fig. 3).

With feedback loops, two issues cause estimationerrors. The first one is the ignorance of real numbersof the time steps available for the different kinds ofFU. The second one is the unavoidable parallelism thathas to be done in order to respect the iteration periodconstraint. In [10] the authors rapidly suggest to takeinto account the unusable time steps but finally leavethe idea because they think it is inefficient in manycases. In fact this information is a usefull part of thesolution, we propose here a formulation of the problemin a loop context.

Figure 3. Accurate average number with a pipeline.


Firstly we will not take into account the number ofdelays in feedback loops. We see later how to ben-efit from the increase of mobility that the multi-delaygraphs allow. This work allows the NP-complete prob-lem to be avoided. Contrary to other techniques, thepipeline possibilities are introduced from the beginningof the estimation.

Bounds are useful for the limitation of the designspace, but inefficient for the provision of exact numberof resource and pipeline stages that are necessary tocorrectly guide the scheduling and binding tasks. Ourtool GAUT doesn’t perfom an hardware exploration,on the contrary it rapidly builds an architecture basedon the available FUs and time steps. That’s why an ac-curate estimation is required. Then, the datapath syn-thesis optimization focuses on minimizing controler,bus, registers costs.

The estimation is made up of two steps. The mini-mum number of FU is firstly computed with unlimitedpipeline, the associated minimum number of pipelinestages is then calculated.

The method will be carried out in four main sec-tions. (ii) computation of a number of FU in a feed-back loop, (iii) computation of the global number of FUin the DFG, (iv) computation of the minimal numberof pipeline stages, (v) optimisation with a multi-delayfeedback loop.

N(i ) =⌈

Nop(i ).1(i )

T

⌉(1)

with: 1(i ) the computation delay of the FU of typeiand Nop(i ) the number of operations of typei in theDFG.

2.2.2. Computation of the Number of FU in FeedbackLoops. The method may be divided into four mainsteps for a given typei of operators. Firstly, the maxi-mum mobility is allowed to the operation in the loops.Then we compute the improved average number of FUwhich is based on the utilization of the exact availabletime steps:N ′(i ). We treat the problem of local strongparallelism requirements by computing minimum par-allelism rate:8Npar(i ). Finally the number of FU isobtained by keeping the maximum betweenN ′(i ) and8(i ).

Improved Average Number.If NFop(i ) is the number ofoperations of typei in the feedback loops,NFop(i ).1(i )time units are required to execute them. In Eq. (1) it isassumed that a whole set (T) of time units is available

Figure 4. Unavailable time units for multipliers because of splitintervals.

for the execution of the operations of typei . This as-sumption is of course, seldom true, the available timesteps must belong to, at least, one of the time frames[asap ; alap] of an operation of typei . The number oftime steps, in this subset, is defined asT ′(i ).

Another cause of error remains in the computationof the average number. It is due to a part of time,apparently available for operations of typei , that cannot yet be used because of intervals of length lowerthan the computation delay of operators of typei . TheFig. 4 gives an example, where the apparent numberof required multipliers is one and the exact number istwo. In this case, the multiplication in the upper branchcan not use the free 200 ns because they are split intotwo intervals of 100 ns. This part of time is due todependences with the other types of operations. It isdefined asT ′′(i ) and must be considered as time unitsrequired for the execution of operation of typei . Inorder to compute the numbersT ′′(i ), we firstly exhibitin each branch of the graph, the maximum time framesthat separate two operations of typei . Then we add theintervals from the different branches, and keep only theintervals that have a length divisible by1(i ). The con-struction of the final intervals look like theTetrisgame.Finally amod1(i ) operation, applied to remaining in-tervals, provides the desiredT ′′(i ) numbers. Details ofthe formulation are given in [18].

Finally we obtain the improved average number withthe Eq. (2).

N ′r (i ) =NopF (i ).1(i )+ T ′′(i )

T ′(i )N ′(i ) = dN ′r (i )e

(2)

In [10], a notion of lost time is also tackled. How-ever the authors did not consider neither the estima-tion in the pipeline optimisation nor the problem of


feedback loops. They do not propose a formulation ofthe problem because they notice that such a method cannot detect errors due to strong and local requirementsof parallelism. They finally choose a method based ona scheduling algorithm, as previously stated.

It is true that the problem of parallelism is not pointedout by the improved average number, this is the reasonwhy we complete this by a computation of a paral-lelism rate. We will see that the two techniques mustbe associated to get the accurate estimation.

Minimum Parallelism Rate. The aim is now detect theparallalism requirements that impose the limitation ofthe mobility in the feedback loops. So, we now have tocompute the minimum number of FU of typei , whichmust be active in the same time step, into the whole setof feasible schedulings associated with the operationsof the feedback loops. The first idea consists of creatinga density of parallelism for each operation, that, in away, expresses theneedsof FUs associated with theoperation node. Thus, through adding these densitiesfor each time stept common to operations located ondifferent parallel branches of the DFG, it is possibleto obtain the number of FU required. The density isdefined (Eq. (4)), for a given operation nodek of typei , as the inverse of the integer number of operationsof typei that can be executed consecutively during thetime frame: [asap(k), alap(k)+1(i )[. In other words,the wider the time frame, the more it can contain thefollowing intervals of length1(i ), and hence the lessthe operationk will be constrained to be scheduled inthe time stept .

There then remains the problem of knowing the den-sities that must be added in a given time step, withoutjudging what the scheduling may be.

How and When Must We Add the Densities? Thekey problem is the sum ofneedsin FU linked to theoperations that may be scheduled in a given datet .The principle is based on the addition of densities ofoperations that can be executed simultaneously. Ac-tually, the sum can not be done on the whole [asap ;alap] interval for each operation, because in this casethe overlappings of intervals from the following nodeslead to over-estimations. The difficulty associated tothe overlappings intervals can be avoided. In fact, theover-estimation results from the addition of densitiesassociated with operations that can not be executed incommon time steps because they follow one another.However, it is important to retain the information on the

Figure 5. Minimum parallelism method.1(∗): delay of∗,Pr =1/Ps(∗): density of∗,8(∗): minimum of parallelism rate for∗,8(∗) = maxt8(∗).

whole [asap(i ), alap(i )] interval concerning a possiblepresence of the operationi or one of its predecessorsor successors. Therefore, the solution for the avoidanceof unfeasible superpositions of densities while coveringthe whole set of mobility intervals, consists in adding,in the given time stept , only the densities 1/Ps(i ) ofoperations belonging to separate parallel branches ofthe DFG. In a given branch, the choice between a nodeand its predecessors and successors will be done bykeeping the node with the greater density. In the caseof nodes common to different branches, the density isnormalized by the number of its successors having adensity greater or equal to their common time frame(Eqs. (5) and (7) of the Algorithm 1). Figure 5 illus-trates the method on a subgraph assumed to be insertedin feedback loops.

We have seen how the operation in a same branchof the DFG is selected. The other difficulty is linked tothe choice of densities, belonging to parallel branches,that have to be taken into account. If we analyse theproblem, we notice that the operations involved in lo-cal strong parallelism have a reduced mobility. Suchoperations possess less scheduling opportunities thanthe average of operations belonging to the same type.Therefore, a simple method for carrying out the selec-tion of critical operations of typei consists of addingonly the nodes in parallel branches that have densitiesgreater than the average density of FU of typei (seeEq. (2)).

For instance, we can imagine, in the case of Fig. 5that an other multiplication may exist in parallel withthe operations 2,3,4,5. Assuming that an infinitelysmall density is associated with this extra operation.If all the parallel branches are taken into account,without distinguish critical nodes from the others, the


parallelism rate computed, would over-estimate thenumber of FU required (4). If the extra operation withlots of other scheduling opportunities than the onlythree dates 200, 300 and 400 is selected the exactparallelism is obtained.

Pr(i ) = 1

N ′r (i )(3)

whereN ′(i ) is the integer just greater thanN ′r (i ).The Algorithm 1 sums up the different steps of the

computation of the minimum parallelism rate.

Algorithm 1. Computation of the minimum paral-lelism rate for FU of typei

• Definitions:

– Let Ps(Ok) be the potential of operations hav-ing the same type asOk, that can be executedconsecutively in the mobility interval ofOk;

– Let Nbpar(t) be the number of parallel branchesin the time stept .

– LetIi, j (t) be the set of operationsOk of typei inthe branchj whose interval [asap(i ), alap(i )+1(i )[ containst .

– LetPsN(Ok) = Ps(Ok)∗Nsucc(Ok)whereNsucc

is the number of branches coming fromOk, be-longing to the feedback loops and containingsome operationsOp of type i , with a density1/Ps(Op) ower or equal to 1/Ps(Ok) until adate t ′ such ast ′< alap(Ok) + 1(Ok). If anoperation has a density1/Ps(Ok) greater thanthose of its successors, the branches comingfrom Ok will use the normalized density untilt ′ in order to avoid computing the density of asame node, several times

• Increase the mobility of all nodes being in the feed-back loops to the maximum;• Compute the normalized potential associated to con-

secutive operations of typei in the DFG:

∀Ok ∈ IPs(Ok) =

⌊alap(Ok)+1(Ok)− asap(Ok)

1(Ok)

⌋(4)

PsN(Ok) = PsN(Ok).Nsucc(Ok) (5)

• Filter the critical nodes for parallelism

∀Ok ∈ Ii

PrN(Ok) =

1

PsN(Ok)if

1

PsN(Ok)> Pr(i )

0 else

(6)

• Compute the minimum parallelism rate of FU of typei in stept ,8Npar(t):

8Npar(i, t) =Nbpar(t)∑

j=1

maxOi∈Ii, j (t)

PrN(Oi (t)) (7)

• Extract the global minimum parallelism rate:

8Npar(i ) = maxt∈T ′

8Npar(i, t) (8)

Necessity of Use of Both the Improved Average Num-ber and the Minimum Parallelism Rate.Two crite-ria must be taken into consideration in order to carryout the accurate number of FU. The first one, the im-proved average number, considers the real time allowedto operations of a given type, but can not detect the lo-cal necessities of parallelism. The latter computes theminimum parallelism rate, but considers for each op-eration the possibility of using the entire time frame[asap(i ), alap(i ) + 1(i )[, without otherwise takingcare of the possible consequences on the time framesof the other operations that can be, thus, potentially re-duced. For this reason, it is necessary to check, with theimproved average number of operators, that the num-ber of FU provided by the minimum parallelism rate isreally sufficient to execute the whole set of operations.Such cases may appear with DFG that are not stronglyconstrained, as in the case of Fig. 6.

Consequently the number of FU required by the op-eration of typei in feedback loops is:

Nopt(i )F = MAX {8Npar(i ); N ′(i )} (9)

Figure 6. Case where the solution is given by the average number.


2.2.3. Estimation of the Number of FUin the Whole DFG

Problem Related to the Redistribution of OperationsLocated out of Feedback Loops.Following the esti-mation of FU in the feedback loops, there still remainsthe redistribution of operations located out of feedbacksloops in the free time steps. If required, pipeline stagesare used.

If the global improved average number:N ′G(i ) islower than the number of operators required by thefeedback loop nodes, we know that the number of freetime steps is quantitatively sufficient. However, thesetime steps can not always be used to execute the remain-ing operations. When these are split up into intervalsof length, lower than the processing delay of the giventype of FU, the redistribution is impossible. Therefore,the estimation may take into account the real avail-able time for the operations located out of the feedbackloops:T ′′(i )G , if this one is insufficient for the remain-ing operations, one or more other FU will be allocated.Figure 7 shows an example, where the redistribution ofthe three multiplications located out of loops is not fea-sible with the optimal number of multipliers computedfor the operations inside the feedback loop (Nopt(∗)F ),despite a global average number of multipliersN(∗)Gequal toNopt(∗)F .

Principle of Complete Estimation.The method issimilar to the previous one used for computing the num-ber of FU in the feedback loops. The only differenceis related to the particular processing of the mobilityof the operations that are located out of loops. Actu-ally, these operations may be scheduled anywhere in

Figure 7. Unfeasible complete redistribution of outside loopsoperations.

the interval [0; T ], thanks to pipeline transformations.Consequently, they are supposed to have a maximalmobility.The method is described in Algorithm 2.

Algorithm 2. Computation of the global number ofFU of typei required

• the mobilities of out of feedback loops operationsare extended to [0 ; T]• computeNopt(i )F with Algorithm 1• computeN ′(i )G with all operations of the DFG:

N ′(i )G = Nop(i )F .1(i )+ T ′′(i )G

T

(T ′′(i )G andT ′′(i )F are computed simultaneously)• Nopt(i )G = max{Nopt(i )F ; N ′(i )G}

2.2.4. Computation of the Minimal Numberof Pipeline Stages

Principle. The numbers of FU have been computedwithout any restrictions on pipeline. The aim is now tofind the minimal number of pipeline stages that enableus to only use the minimal number of FU.

The pipeline transformation is required to modifythe mobility [asap; alap] of the operations in order toassociate them to free and usable time steps. This modi-fication can only be done towards the future dates. Itmust keep the mobility of operations lower thanT inorder to use the estimation based on the improved aver-age numbers and minimal parallelism rates. This con-dition will be respected by applying a recursive methodthat moves operation intervals without an assumptionon scheduling. The initial mobilities are fixed by theabsolute minimum of pipeline stages, computed withasap dates and respect of feedback loops schedulingconstraints.

The principle is based on the parallelism rate thatis locally (for an operation) compared to the optimalnumbers of FU. Following this comparison, the possi-ble extra time steps required, not exceeding the optimalnumber of FU, are passed on to the successors. If neces-sary the intervals of the successors are, consequently,brought forward. Thus, the whole set of operationsintervals are recursively moved forward from the be-ginning of the DFT to its end. During recursion, theinitial mobilities remain unchanged. When all the DFGis fully treated, the improved average number is com-puted to check the validity of the assumed available


time. Finally, the intervals of mobility are adjustedaccording to the final number of pipeline stages whilerespecting the feedback loop constraints. The accuratenumber of pipeline stagesNbGpip is finally obtained withthe maximum on alap dates (Eq. (10)).

Talap max= maxOk∈G{alap(Ok)+1(Ok)}

NbGpip =⌈

Talapmax

T

⌉ (10)

The critical task of the method resides in the recur-sive processing of the forward moves of the operationintervals. This point is detailed in the Algorithm 3.

Algorithm 3: Recursive forward move of operation intervals for pipeline depth minimization

Limited Pipeline Constraints. Some applicationsmay impose constraints on the pipeline in order, forinstance, to limit the response delay of signal process-ing. In such cases, the method remains unchanged. Thealgorithm (Algorithm 3) is applied until the limit onpipeline stages is not reached. If the process is stoppedbefore the end, the number of FU required will not beoptimal.

2.2.5. Optimization with the Multi-DelayFeedback Loops

Optimization Opportunities. A number ofND delaysin a feedback loop permits the distribution of operations


Figure 8. Multi-delays optimization example—case of a recursive DCT with block size= 2.

on ND pipeline stages. This increase of mobility favorsthe optimization of the scheduling task. An example ofrecursive DCT [19] with two delays in feedback loop isgiven in Fig. 8. In this case, a multiplier is saved thanksto the utilization of the whole mobility. The techniquesused to reduce the critical loop, by insertion of extradelays [20], increase algorithms complexity, thereforethe optimization of scheduling may lead to significanthardware minimization.

Principle. The multi-delay opportunities in the feed-back loop are taken into account in the system MARSduring the scheduling task [21]. The authors use twokinds of mobility, in order to check the scheduling pos-sibilities of a given operation: the mobility of the loopit belongs to and the intrinsic mobility of the opera-tion. Our method is independent from scheduling tech-niques. The aim is to exhibit the mobility intervals inorder to compute the required numbers of FUs. Themethod will not be detailed here (see [18]), we willjust describe the main steps.

Numbers of FU. Firstly the DFG is broken into inde-pendent subsets. A subgraph contains nodes of interdependent feedback loops. The maximal mobility ofoperations is then computed according to the subgraphthey belong to.

When a operation belongs to different feedbackloops, several solutions of pipeline stage sets may ap-pear. Such cases are resolved by applying two criteriaof choice. The most constrained loop (with the lowestnumber of delays) has got the priority, and the solutionleading to the minimal number of pipeline stages is re-tained. The numbers of FU are computed in the sameway as before. However, the result (Eq. (9)) can not beused if mobility lengths are greater than the time con-straintT , whereas now mobility intervals may exceedT . This difficulty is overcome by using normalized

densities:

Ps′(Ok) = 1

Np· Ps(Ok)

whereNp is the number of pipeline stages where(Ok)

may be scheduled andPs(Ok) the density computedwithout pipeline. This method means this simple factthat the usable time isNp greater than previously.

Number of Pipeline Stages.Two methods may beused to compute the minimal number of pipeline stages.The first one is still valid, however the task of pipelinechecking performed while the intervals are moved for-ward, requires particular attention.

A second one becomes feasible because the inter-vals of mobility are allowed to be greater than thetime constraintT . It consists in increasing the num-ber of pipeline stages from the minimal value untilthe FU bounds are reached. For each step, the mo-bility intervals are worked out. However, in the case ofpipeline estimation, the subgraphs are not studied inde-pendently. The mobility of the entire graph is now lim-ited by the current number of pipelined stages tested,and the dependences between subgraphs must be takeninto account.

We prefer the second method in order to avoid therecursivity that may require a huge program stack in thecase of an important DFG. As previously, the designercan impose a constraint on the pipeline stage number.

Example. Figure 9 gives some results obtained in thecase of an IIR7 filter. The characteristics of the DFGmulti-delays, seven loops, four sets of loopsmake it asvery suitable benchmark for our estimation based enloop opportunities exploration. By increasing the timeconstraints, trade-off curves are performed for FU areaand numbers of pipeline stages. A retimed version of


Figure 9. IIR7 trade-offs—Area and pipeline/time constraint.

the algorithm is proposed, we notice that the transfor-mation enable a better use of pipeline in order to reducearea cost.

2.2.6. Conclusion. We have seen the method in-cluded in the GAUT framework to compute boundson FU and pipeline numbers. The complexity ofthe estimation is lower thanO(N.maxi {alap(Oi ) −asap(Oi )/1t}). The method is independent from agiven technique of scheduling, it is based on the twocauses of error in resource estimation: the misappre-ciation of available time and the local requirements ofstrong parallelism. We will see in §3 how it is used to se-lect the right components for estimation or synthesis.Using the right number of components and pipelinestages, the high level synthesis focuses on the mini-mization of both interconnection and controller costduring scheduling and binding tasks. Then a VHDLharware description is carried out for the logic synthe-sis tool [22].

2.3. Probabilistic Resource Estimation

2.3.1. Introduction. The first aim of the method isto provide the designer, with a dynamical estimationof cost in the interval [0 ; T]. The estimation is basedon the probabilistic numbers of FU, registers, bus andinterconnections in each time step. What is aimed isn’tnot an high accuracy of cost prediction that can’t notbe addressed at a high level of abstraction. The mainissue is to be able to compare the effects of transfor-mations (algorithmic/structural/fonctionnal) in order tomake choices based on analysis and relative measures.Such metrics are handled in global design methodol-ogy [18] that associate estimations and transformations

in order to guide the optimization of thealgorithme-architecture-adequacy. While analysing the results andobserving the effects of specifications choices, the de-signer gets keys to reduce the transformations space.The objective is also to complete the component se-lection choices by an estimation of registers, bus andinterconnections complexity.

2.3.2. Principle

Functional Units. Given a DFG and a selection ofcomponents, we look for the probable number of FUof type n in the time stept : Ni (t). Firstly, Nn(t) theprobable number of FU of typen scheduled t, maysimply defined as:

Nn(t) =Nmax∑i=0

i .Pn(Nn = i, t) (11)

where Pn(Nn = i, t) is the probability thati FU oftypen are scheduled to the datet . Such a formulationwould lead to a very complex computation. It can bedemonstrated that Eq. (11) may be simplified as:

Nn(t) =∑

i∈Eopn

POi (t) (12)

whereEopn is the set of operations of typen andPOi (t)the probability that the operationi would be scheduledin the time stept . After introducing the delay1 of theassociated FU, we obtain:

Nn(t) =∑

i∈Eopn

t∑k=t−1(0i )+1

POi (k) (13)


Figure 10. Data dependence between operations.

Resources for Data Transfers.The architecturalmodel of GAUT (Fig. 2) enables three kind of datatransfers:

• memory→ FU (input data);• FU→ memory (results);• FU→FU (temporary data). If the data life is greater

than thememory threshold, the data is stocked inmemory: FU/register/memory/register/FU, else thedata remains in a register: FU/register/FU.

We will comment the estimation of registers for a datatransfer between FU, in a particular case where a data Dis produced by an operationO0 and consumed by twooperationsO1 andO2 (see Fig. 10). A generalizationwill be done later.

Probability of Transfer. The scheduling probabilitiesof the producer and consumers are now both involved:

PR(t, i )={

P{(O0, t0); (Oi , ti )} if t ∈ IR

0 elsei = 1, 2

(14)

Where P{(O0, t0); (Oi , ti )} is the probability thatO0 andOi would respectively be scheduled in the timestepst0 andti . IR represents the duration interval wherethe dataD is stocked in a register. Two cases may con-sidered:

• ti −t0 ≥ memorythreshold: IR = [t0+1Op0, t0+1Op0+1reg+1mem]∪ [ti −1reg, ti +1Opi ]The dataD is firstly written and then read in memory.• ti − t0 < memorythreshold: IR = [t0+1Op0, ti +1Opi ]] D remains in the datapath.

The events,O1 is scheduled to t1 andO2 is scheduledto t2 are not independent, consequently the associate asa probability of transfer is defined as:

P{(Oi , ti ) ; (O0, t0)}= P((Oi , ti ) | (O0, t0)).PO0(t0)

= P((O0, t0) | (Oi , ti )).POi (ti )

Figure 11. Without and with data sharing.

= pO0(t0)∑ti−1Op0asap(O0)

PO0(t)· POi (ti ) (15)

The partial sumsS0(t ′) =∑t ′

t=asap(O0)pO0(t) are con-

stituted whilepO0(t) are computing.

Probable Number of Registers.The principle is iden-tical to the FU’s one, it is based on the sum of datatransfers probabilities. However, it depends also on theuse (case 2 of Fig. 11) or not (case 1 of Fig. 11) of datasharing optimization. In the case of Data Sharing, asingle register (resp. bus) can store the data for the twosuccessors:

Thus the probable number of register du to the dataD is, without data sharing:

NR(D, t) = PR(t, 1)+ PR(t, 2) (16)

and with data sharing:

NR(D, t) = PR(t, 1)+ PR(t, 2)− PR(t, 1).PR(t, 2)

(17)

The probable number of registers computed on thewhole DFG is finally:

NR(t) =∑

D

NR(D, t) (18)

Generalization. The previously described method issimilar for the estimation of bus and interconnectionsbetween FU. The difference lies in the definition ofassociated intervals:IR, IB, IC[i, j ] , where C[i, j ] des-ignates a connection between the FU of typesi and j .The method is finally generalized to transfers betweenan operation and any number of successors.

Computation Complexity.The complexity is higher,for the estimation of FU. Without data sharing, weget: O(N.max2i {(alap(i )− asap(i ))/1t }). The squareon time term may lead to slow estimations when thelengths of [asap ; alap] intervals increase. In such cases,the operations probability of scheduling in a given time


Figure 12. Unfeasible parallelism.

step become infinitely small. Therefore, an equivalentestimation is obtained by applying an interpolation.Considering a limited number of scheduling possibil-ities Ns, uniformly distributed in the intervals [asap ;alap], the estimation is computed with a maximal com-plexity equal to:O(N · N2

s ).

Scheduling Probabilities. The probabilities modelcould be uniform:POi (t)= 1/(alap(Oi )− asap(Oi )

+1) if asap(Oi ) ≤ t ≤ alap(Oi ) and 0 otherwise. Wedid not retain such a distribution in order to avoid thewrong parallelism problems due to time frame overlap-pings (see Fig. 12). The distribution law we use, takesinto account dependences on predecessors and succes-sors. The scheduling probability ofOk in the time stept is computed according to the number of associatedscheduling possibilities of the predecessors and suc-cessors that allow the scheduling ofOk in t. (Eqs. (19)and (20)).

POn(t)=1

N0.

NPred∏i=1

max[1, t − (asap(pi +1(pi ))+ 1]

·NSuc∏j=1

max[1, asap(si )− (t +1(On)+ 1)]

(19)

with:

N0=alap(On)∑

t=asap(On)

NPred∏i=1

max[1, t − (asap(pi +1(pi ))+ 1]

·NSuc∏j=1

max[1, asap(si )− (t +1(On)+ 1)] (20)

Where NPred and NSuc are respectively the numbersof predecessors and successors.POn(t) and N0 maybe computed simultaneously. The complexity is linearwith the number of operationsN:

C < O(N.maxn{(alap(n)− asap(n))/1t }·maxi {NPred(i )} ·maxj {NSuc( j )})

2.3.3. Estimation Properties. The global cost distri-bution is obtained with :

Ctotal(t) =∑

i

Cost(FUi ).Ni (t)

+Cost(reg).NR(t)+ Cost(bus).NB(t)

+∑j,k

Cost(interconnect).NC[ j lk ] (21)

In the design methodology the Eq. (21) is used, atthe upper level, to perform algorithmic selection. ForCtotal(t) like for the others curves, two main metrics aresignificant:

• The average ofNX(t): means the lower bound ofresource or cost ofX.• The irregularity ofNX(t): means the difficulty to

reach the minimal bound. In practice, the standarddeviation is used to quantify the regularity of theresource distribution.

We associate the two values to get a global measureof complexity that is called Probable Cost. The esti-mation is independent from the algorithm involved inthe high level synthesis tool. Consequently, the prob-able values do not return an exact estimation of theresults obtained after synthesis. However, the metricsof average and regularity provide relative values thatenable the classification, according to the cost, of thetransformed specifications.

2.3.4. Examples. In Fig. 13 a table provides someestimations concerning each two versions of a givenalgorithm. Actually the issue of high level estimationis the comparison between different kinds of specifi-cations. Each problem must then be analyzed in de-tail through the whole set of resource estimation. TheFig. 13(a) and (b) exhibit a possible use of estima-tion for algorithmic transformations. On Fig. 13(a) aregiven the curves of cost distribution in time steps [0 ;T ] for successive retiming transformations of a 2nd or-der Volterra filter. Figure 13(b) provides the evolutionof global cost from estimation and synthesis accordingto the consecutive transformations. While transforma-tions are performed, we notice an associated regularityenhancement.

2.3.5. Conclusion. The probabilistic cost lies with adesign methodology as an analysis tool. We take bene-fit from its properties to guide the designer in the trans-formations space including algorithmic, functional and


Figure 13. Cost evolution of a Volterra filter with successive retiming transformations.

structural transformations. We also use it to get met-rics about interconnection for the hardware componentselection.

3. Hardware Component Selectionand Allocation

3.1. Introduction

The input requirements of the component selection are:

• A DFG built from the behavioral specification(VHDL) of the application;• A library of specified components (VHDL) and a

time constraint. For a given function, several FU areavailable with different area/delay trade-offs. Thelibrary also contains multi-functional units.• A time constraint;

Using this data, the aim is to select the set of com-ponents that enable to respect the time constraint whileminimizing the final cost architecture.

Figure 14. FFT DIF2—multifunction selection.

3.2. Example

We use a DIF2 algorithm to show how the selection ofa multifunctional component may evolve with the timeconstraint. The Fig. 14 gives the selected components,it may be noticed that an AdderSubtractor is not alwaysrequired.

3.3. Method

The process is composed of four steps, in order to re-duce the search space according to the computing com-plexity.

• Selection of a subset of components based on theaverage numbers. A reduced number of solutions ispicked out with a branch and bound method. Thefollowing tasks will be performed on the subset so-lutions.• The optimal numbers of FU and pipelines are com-

puted.


• Multifunctional optimization is performed for un-derused FU. (see [17])• The estimation of numbers of registers, interconnec-

tions and bus are computed for the subset solutions.The metrics used to carry out the final selection are:

NR = 1

Tr·

Tr∑t=1

NR(t)

NB = 1

Tr·

Tr∑t=1

NB(t)

NC = 1

Tr·

Tr∑t=1

NC(t)

(22)

4. Memory Unit Cost Estimation

Recent works [23, 24] have shown that during the sys-tem design, the memory can take a considerable part ofthe system cost. This is particular true in digital signalprocessing because the quality criterion for DSP appli-cations is obtained either by suggesting new algorithms(mostly with a block vector computing), or by extend-ing the response size of known filters. In the first case,it is necessary to evaluate the implementation for thesealgorithms by a prototyping stage. In the second case,the processing unit evolves little, however, the num-ber of data to store rise considerably, in this condition,ASIC could be impossible to achieve. So it becomesimportant to develop some tools which allow us to es-timate the memory cost before the synthesis step. Tobe interesting, these tools must give a good estimationin a very small computation time. If this condition issatisfied, then it is possible to explore numerous designsolutions for an application.

4.1. Previous Works

To estimate accuracy the cost of the system implemen-tation, we need to take into account the memorizationproblem. In literature, we can find many articles con-cerning these problems. Generally, these articles sug-gest the hardware solution which strongly depends onthe data types used in the description and the mem-ory types used in the library. Most of these articlesdeal with the synthesis of a particular memory organi-zation which are detailed in the following points: (1)The allocation problem of the storage units for scalaror vectorial data [25, 26]; (2) The problem of addressgeneration [27, 28]; (3) The hierarchical organization(to ensure the rhythm of the processor access) [29, 30].

These previous points are not very interesting for theestimation problem because the cost of the solution isknown after you have built the hardware. On the otherhand, some studies give methods to compute a memoryestimation.

In [30], a memory selection technique is presented.It is based on specific memory unit organization whichis based on four possible methods. A horizontal selec-tion which associates several memory units, ensuresthat the width of the data corresponds to the width ofthe memory selected. A specific memory size can beachieved through vertical selection. Furthermore, to in-crease the transfer frequency, a selection of interleavedmemories can be used. Finally, another one can virtu-ally increase the number of memory ports.

The selected set of memories is a good estimationof the memory unit complexity. In [31], the suggestedmethod enables them to provide the number and thetype (RAM or ROM), the word width, and the numberof ports of each memory. The selection rests on theevaluation of a silicon-area-cost function.

In [32], some High Level Transformations are pre-sented. The goal consist in making transformations onthe algorithm loops in order to try to minimize the datastored and thus, the number of transfers to make be-tween processing and memory units. These two param-eters are good estimators of the memory complexity.The methods revealed are based upon data structures ofmatrix form and suggest transformations with a viewto reduce the number of memory points and transfers.

In [33], a technique of storage estimation for arraysof signals is presented. An integer linear formulation isgiven to compute an estimation of the number of storagelocations. Also, in [23], the technique presented allowto obtain a evaluation of memory area for a chip. Itis based on data flow analysis and give informationsabout total number of memory cells.

4.2. Our Approach

The memory estimation takes place after the process-ing unit estimation as showed in Fig. 1. In this context,the constraint we have to ensure is defined as an ac-cess sequence (see Fig. 15). The methodology we havedeveloped consists of three estimation levels: the firstlevel computes the number of buses for implementa-tion between PU and MU; the second computes thenumber of necessary memory banks needed to store allthe data; and the third level selects, in a library, a setof memory banks. The first two levels are not accurate


Figure 15. (a) Access sequence example (b) Access interval between PU and MU.

enough to evaluate correctly the MU cost. On the otherhand, the third level give a good estimation of the MUarea.

To compute the number of banks, we first need tocompute the number of buses implemented betweenPU and MU, this step is called the smoothing task.

4.2.1. Smoothing of the Access Sequence.The pro-cessing unit estimation gives a list of access-time inter-vals which represent the possible dates for each access.This step consists in computing an estimation of thebest dates for each access (see Fig. 15). This estima-tion gives the number of buses between PU and MU.The ideal goal of the smoothing operation is to pro-duce a constant distribution of the number of transfersfor all the possible access times. On the Fig. 16(a),we can see a bad distribution of accesses because, toensure this sequence, it is necessary to implement 6buses between the PU and the MU. On Fig. 16(b), wecan see a good distribution of the same sequence, only3 buses are implemented for the same total number oftransfers.

Figure 16. Examples of transfer sequences.

Mathematical Smoothing Formulation. Hypothe-sis: We consider that all data are transferred just once.When the data is transferredN times, we create newdata,N − 1, with on transfer for each.

Let Ait be integer variable to indicate if thei th datais transferred at the timet .

Let ASAPi be variable to indicate the “as soon aspossible” date for the transfer of thei th data.

Let ALAPi be variable to indicate the “as late aspossible” date for the transfer of thei th data.

The smoothing operation consists of the minimiza-tion of the number of accesses for each access date:

Min

(ND∑i=1

Ait

)∀t = 1, . . . ,NA

Constraint 1:NA∑t=1

Ait = 1 ∀i = 1, . . . ,ND

Constraint 2:∀t | Ait = 1H⇒ ASAPi ≤ t ≤ ALAP i

With ND as the number of data and NA as the numberof access time.


Figure 17. Example of building a conflict graph from a transfer sequence.

Constraint 1 ensures us that all the transfers are com-pleted, and constraint 2 ensures us that the date of eachtransfer is in the ASAP-ALAP interval. The problemcould be formulated as follows:

Min

(NA∑t=1

ND∑i=1

|Ait − TNA|)

With TNA = ND

NA

Constraint 1:NA∑t=1

Ait = 1 ∀i = 1, . . . ,ND

Constraint 2:∀t | Ait = 1H⇒ ASAPi ≤ t ≤ ALAP i

With TNA as the total number of accesses of thealgorithm, andTNA as the average number of accessesper time interval.

Because of the absolute sign, this formulation is nonlinear, so it is impossible to solve it with the classicalILP (Integer Linear Programming) techniques.

This problem is similar to a data-flow-graph schedul-ing, so it is solved width a scheduling algorithm underthe constraint of minimum parallel accesses.

4.2.2. Compute the Number of Memory Banks.Here, the important factor is to ensure the coherenceof data and the transfer simultaneity for all the accesssequence. The coherence of data is defined using thefollowing three points:

• temporal coherence: two data which are transferredat the same time must not be found in the same mem-ory bank;• spatial coherence: the reading and writing of the

same data must be carried out in the same memorybank;• functional coherence: in the case of a pipeline struc-

ture, the writing at a pipeline stage must not erasedata which are still of use for other pipeline stages.

We have used a graphG{N ,A} to represent the con-flicts between data, (see Fig. 17(b)). The set of ver-ticesN is made up of the set of data to be memorized(N = {a, b, c, d}); whilst the set of edgesA, representsthe conflicts between data (A = {1, 2, 3, 4}). There isconflict between data during a discrete time if thesedata are simultaneously transferred. Should this bethe case, an edge is created between the two vertices.The graph of Fig. 17(b) is therefore directly builtfrom the sequence of Fig. 17(a). The research con-cerning the minimum number of necessary banksNB

to memorize the set of data is equivalent to the researchof the chromatic index of graphG. It involves research-ing the minimum number of colours needed to colourthe vertices of the graphG, under the restriction that twovertices connected by an edge cannot carry the samecolour. The colouring problem of graphG is equivalentto a clique partitioning problem of the complementarygraphG (a clique is a sub graph, as each couple ofvertices from this sub graph are connected by an edge).

The solution to this problem thus concerns isolatingthe most of data as possible, as each pair of isolated dataincorporates an edge between the two. We can expresssuch a calculation in the following way:

• Maximization NB=ND∑i=1

DM i

Constraint: ∀i, j |DM i =DM j = 1⇒ Ci j = 1

With DM i as the variable that indicates if thei th data isisolated. The objective function NB is correct from thepoint of view of its ILP expression. On the other hand,the constraint must be expressed by way of a linearinequation. The constraint∀i, j | DM i = DM j = 1means thateach pair of the two variables i and j belongto the biggest group if, and only if, there is a conflictbetween the two variables. This constraint can not beexpressed in the form of inequality, however, we canexplain the complementary constraint:each pair of two


variables does not belong to the largest clique, if thereis no conflict between the two variables.

DM i + DM j ≤ 1 ∀i, j | Ci j = 0

The problem involving the research of the number ofmemory banks is shown in the following way:

• Maximize NB=ND∑i=1

DM i

Constraints: 0≤ DM i ≤ 1 ∀iConstraints: DMi + DM j ≤ 1 ∀i, j | Ci j = 0

The resolution of this problem can be achieved bythree types of methods:

• exhaustive: although the complexity is usually im-portant, some heuristics allow us to limit the researchspace;• maximization or minimization: from the ILP formu-

lation, research by a simplex method supplies goodresults, however the calculation time is long and of-ten unsatisfactory in an algorithm prototyping stage;• linear: this type of research is non-optimal, but

quickly supplies an evaluation of the complexity.

The method developed to carry out the number ofbanks is linear.

The calculation times necessary to obtain the solu-tion are relatively small.

4.2.3. Selection of Memory Components.To have afast estimation of the memory cost, we suggest a se-lection of memory components for a fixed model oforganization (Fig. 18). This selection, in turn, consistsof searching through a library for the memory set bestsuited to the data storage, all according to the require-ments expressed by the PU estimation. We can noticethat when we reach the best selection, the higher thenumber of memory type is, the higher the number ofexplored solutions is.

This model is composed of NB memory banks asso-ciated with NG address generators (with NG≤ NB).The memory unit can be made up of varying time ac-cess memory.

In the first approach, the types of data and memoryare ignored (note that these types could be taken intoaccount with some small modifications). The problem

Figure 18. Memory unit model.

is to select a set of minimum number of memories,which can be formulated by the following expressions:

Min

(NM∑i=1

Si

)

Constraint 1:NM∑i=1

Si ≥ NB


Si · NPi ≥ ND


Si · TL

T Ai≥ TNA

With NM the number of memories present in the li-brary, NPi the number of memory points in thei thmemory bank. Constraint 1 ensures transfer parallelismand data coherence. The first is obtained by a simpleanalysis of the transfer sequence, while the second callsfor the research of a number of memory banks (NB)which is done before.

Constraint 2 ensures the storage of all data. In thefirst step, each data needs a memory location. This canbe optimized if one envisages the sharing out of mem-ory locations by several data which have disjointed lifetimes.

Constraint 3 ensures that all the transfers can bedone.

The number of available access times in a memorybank during latency time is represented byTL/T Ai .The total amount of possible accesses is therefore thesum of all possible accesses in the selection.

Feasibility of Internal Memory. For a given tech-nology and foundry, the internal memory must not be


larger than a specific size. This maximum size is illus-trated by the productNumberOfBits·NumberOfWords.Thus, we can explain a constraint for internal memory,see constraint below.

∀i = 1, . . . , NMI H⇒ NPi · NBitsi ≤ Maxi

4.2.4. Conclusion. The memory estimation we havedeveloped is based on memory selection. This methodgives a good estimation of the MU area based on arealistic memory component (because the componentis selected in a silicon-foundry library).

After this estimation, it could be possible to startthe synthesis step of the MU. The main tasks of thesynthesis are: the data distribution in memory banks,the data placement in each memory bank and thesynthesis of the address generators (for more details,see [34]).

5. Area/Time Trade-offs of New FastLMS Filters

5.1. Introduction

The GAUT framework provides two ways of achievingtrade-off curves: with or without achieving architec-tural synthesis. While performing estimation and se-lection tasks we obtain quick results, without buildingthe architecture. This solution is independent from thescheduling and binding algorithmic choices. The fol-lowing results have been obtained by this way. Thenumber of Fu has been obtained through the optimalestimation described in §2.2. The estimation of inter-connect cost has been performed through a probabilis-tic estimation. Finally, the memory cost estimation hasbeen carried out.

The aim of this study is to estimate the cost of dif-ferent algorithms for a given time constraint. The it-eration frequency is fixed to 16 kHz, while modifyingthe number of filter taps associated to the specifica-tion of the application. Therefore, we finally obtainCost/(DSP quality) trade-offs curves for each kind oftested algorithm.

5.2. Description of the Application.

The application fits in the context of acoustic echocancellation, for which the large size of finite impulseresponse requires filters about 1000–4000 taps. The

computational load of such applications justifies theefforts spent to reduce the arithmetic complexity, ourgoal is to test, through estimation in high level synthe-sis, the efficiency of some recent methods in a VLSIcontext.

The LMS algorithm may be turned into updatingand fixed filtering parts. For the updating task, weuse the method described in [35] for the FELMS algo-rithm that keeps the convergence properties of the LMSwhile allowing significant reduction of the computa-tional complexity. The fixed part uses the fast FIR fil-tering described on [36], it is composed of short lengthFIR sub-filters derived from the original FIR algorithmwith the Chinese Remainder Theorem (CRT). Contraryto the block versions, these kind of fast algorithms gen-erate a processing delay independent of the filter length(see [37, 38]). The four kind of algorithms will be de-tailed.

5.3. Description of New-Fast LMS Algorithms

5.3.1. Reduction of Arithmetic Complexity.The firstalgorithm is the classic LMS in the FIR case. The prin-ciple of fast filtering applied to the fixed part of theLMS is a transformation of the FIR filter equation intotwo finite degree polynomials computed by the CRT.It decomposes the computation into three main parts.The first part is concerned with the interpolation of thepolynomial products in different points deriving differ-ent algorithms in the sameF(Ni , Ni ) class [37]. Thesecond is the filtering operation and the final one isthe reconstruction of the filtered signal. An algorithmof the classF(Ni , Ni ) applied to a FIR of lengthLcomputesNi outputs in parallel usingmi sub-filters oflengthL/Ni . Despite pre and post-processing opera-tions, it results in a significant reduction of the arith-metic complexity. By repeating the process [37] onsub-filters, the length of the final filters, and conse-quently the arithmetic complexity, can be considerablyreduced.

The algorithms discussed in this paper will be twoF(2, 2) algorithms based on different interpolationpoints and aF(3, 3) algorithm.

5.3.2. F(2, 2) - First Version. The first version of theF(2, 2) is built by choosing the following interpolationpoints in the pre-processing process:{0, 1,∞}. If theinitial FIR filtering is defined as:Y(z) = H(z).X(z), we obtain the following equations:


Filtering part:Y(z) = (Y0+ Y1.z−1)

X(z) = (X0+ X1.z−1)

H(z) = (H0+ H1.z−1){Y0 = X0.H0+ X1.H1.z−2

Y1 = (X0+ X1)(H0+ H1)− X0.H0− X1.H1

(23)

Updating in accordance with the FELMS algorithm:[H0

H1

](n+1)

=[

H0

H1

](n−1)

+µ[

Xt0(en − en−1)+ (X0+ X1)

t en−1

−Xt0(en − en−1)+ (X1.z−2+ X1)

t en

](24)

Equation (24) has been modified to reduce its complex-ity: two terms are common toH0 andH1 and the sumof Xi result from the filtering part.

The initial filtering requiresL multiplications andadditions, theF22 algorithm needs 3L/4 multiplica-tions and additions for the filtering part and 2+ (3/2)(L/2− 1) additions/subtractions for the pre and postprocessing parts. The reduction of arithmetic com-plexity is therefore about 25%. The Fig. 20 depicts thestructure of the algorithm.

5.3.3. F(2, 2) - Second Version.The second versionof theF(2, 2) uses an other set of interpolation points:{0,−1,∞}. It results in the following equations:Filtering part:{

Y0 = X0.H0+ X1.H1.z−2

Y1 = X0.H0+ X1+ H1− (X0− X1)(H0− H1)

(25)

Updating in accordance with the FELMS algorithm:[H0

H1

](n+1)

=[

H0

H1

](n−1)

+µ[

Xt0(en + en−1)− (X0− X1)

t en−1

Xt0(en + en−1)+ (X1.z−2− X1)

t en

](26)

We also have two terms which are common to the com-putation of H0 and H1 and the subtraction of theXi

terms results from the filtering part.The reduction of arithmetic complexity is equal to

the previous one: 25%

5.3.4. F(3, 3). TheF(3, 3) algorithm computes threeparallel outputs. In order to avoid complex interpola-tions points, it is carried out by applying theF(2, 2)algorithm twice. It results in the following equations:Filtering part:

Y(z) = (Y0 + Y1.z−1 + Y2.z−2)

X(z) = (X0 + X1.z−1 + X2.z−2)

H(z) = (H0 + H1.z−1 + H2.z−2) (27)

Y0 = (X0.H0 − X2.H2.z−3)

+((X1 + X2)(H1 + H2)− X1.H1)z−3

Y1 = ((X0 + X1)(H0 + H1)− X1.H1)

−(X0.H0 − X2.H2.z−3)

Y2 = ((X0 + X1)+ X2)(H0 + H1 + H2)

−((X0 + X1)(H0 + H1)− X1.H1)

−((X1 + X2)(H1 + H2)− X1.H1)

(28)

Updating in accordance with the FELMS algorithm:

H0

H1

H2

(n+1)

=

H0

H1

H2

(n−2)

+µ

Xt0(en−2 − en−1 + en)−(X0 + X1)

t (en−2 − en−1)+(X1 + X2)

t en−2

−Xt0(en−2 − en−1 + en)+

(X0 + X1)t (en−2 − en1)−

(X2.z−3 + X0)(en−1 − en)+((X2.z−3 + X0)+ (X0 + X1))

t en−1

Xt0(en−2 − en−1 + en)+(X2.z−3 + X0)

t (en−1 − en)+(X1.z−3 + X2.z−3)t en

(29)

The complexity is now a about a third less than theoriginal. Equation (29) has also been formulated toenable the re-use of terms, shared by the equationsof the Hi and the additions of theXi , and previouslycomputed in the FIR part (Eq. (28)).

5.4. Results and Conclusions

The four algorithms seem to have the same memory re-quirements: 2.L memory points for the samples vectorand the filter coefficients. However we note in the fasterthe algorithm is the higher the memory cost is. The ex-tra memory cost is due firstly, to theM new samples


Figure 19. Structure of fast FIR F22-version 1.

Figure 20. Trade-offs Area/perfomance of LMS algorithms.


vectors of lengthL/N that have be, in practice, createdin order to compute in parallel theM sub-filters. In [38]this problem is discussed for the implementations ofsuch fast FIR filters on DSP, the authors explain thatthe number of pointers registers is proportional to thenumber of FIR sub-filters. An efficient methodology isdeveloped to reduce this number of memory registers.Otherwise, the pre and post-processing task generatelot of temporary data that must be stock in memory orregisters.

The memory cost is not the only responsible of theextra cost observed. For instance, we note that theLMS-F33 datapath becomes less expensive than theLMS-F22 datapath only for a number of taps equal to 3072,whereas it should have always been cheaper. In fact,the memory transfers also modify the datapath cost, be-cause they generate mux, demux and tristate. When thedata transfers become important, the cost of these smallcomponents, often neglected in High Level Synthesis,result in a significant extra cost.

Let note that the reduction of number of operationsis not always related to a reduction of the number ofcomponents, in High Level Synthesis. If a fast algo-rithm improves the rate of multiplications by unit oftime from 2.5 to 2, it saves 20% of operations but0% of multipliers. Moreover, a fast algorithm oftenbreaks partially the regularity of the initial algorithm,the consequence on the architecture is an increaseof the numbers of registers, mux, demux and tristaterequired.

In fact the efficiency of the algorithmic reduction ofarithmetic complexity depends strongly on the rate be-tween the initial number of operations and the though-pout constraint. In the case of a filter length equal to1500, we observe that the reduction of the number ofmultiplications is not sufficient to save enough multi-pliers to make up for the increase of memory and inter-connection resources. Consequently the classic LMSgives the smallest area.

The results show that algorithmic transformations inHigh Level Synthesis are essential to guide the designertowards a good specification choice, for a given casestudy.

Finally, we observe that the fast algorithms give thebest design three times out of four, despite of the in-crease of memory resources. The advantage of the sec-ond version of theF22 algorithm is due to better bal-ance between the number of additions and substractionsthat make easier the overall synthesis.

To conclude we can say that after estimation, its upto the designer to make a choice. According to qualityof DSP he needs for his application, he will set a min-imum number of taps. Giving this constraint he will,by observing the results, choose one of the proposedalgorithms. For instance, if a high quality is required,he will take a 2048 taps LMS for which the algorithmLMS-F22-v2 provides the lowest cost.

6. Conclusion

In this paper we present a global framework that en-ables the prototyping of applications. The complexityof the trade-off exploration requires the use of differ-ent techniques that depend on the needs of the user.The area/time trade-offs are obtained while repeatingthe estimation with different time constraints. For agiven signal processing application, this method en-ables to estimate quickly various algorithms, thus DSPquality/area/time trade-offs are produced.

Probabilistic cost estimations focus on the analysisof cost distribution within a constrained number of timesteps. The main issues of the method are the dynamiccost estimation based on conditional probability lawthat tackles real dependencies, the computation simpli-fication that allows fast estimations and the estimationof the whole datapath resources: FUs, registers, busand interconnections.

The results analysis is involved in an estimation/transformations methodology. The aim is to guidethe designer to find out the best specification whileapplying algorithmic/data-flow/functional transforma-tions in order to correctly distribute resource uses andto flatten the cost curve for the designer.

The other datapath estimation is dedicated to the fol-lowing synthesis optimization. The originality of themethod is firstly the dual computation of minimal num-bers of FUs and associated Pipeline stages. Secondly,it proposes an association of a new parallelism rate,that avoids usual over-estimations, and an improvedFU average number that rejects unusable time units.Finally, the method addresses the entire mobility op-portunities which allow the number of delays in feed-back loops. The knowledge of operation-node mobil-ities and the number of FUs enable the reduction ofthe synthesis complexity. Consequently, the synthe-sis task principally addresses the minimization of theinterconnect and controller costs during scheduling andbinding tasks.


The two kind of estimations are involved in a globalhardware component selection.

The memory unit estimation is presented to completethe evaluation of the ASIC cost. For a more accurateestimation, the estimation processing is decomposedon three levels. The method is based on the mobilitiesof the read/write operations. The probabilistic cost es-timation give us the ASAP and ALAP times for theaccess operations. These times are used to evaluate theminimum number of memory banks that we need toimplement. After that, our tool selects in a library thebest set of memory components for the storage of alldata.

This method gives a good and a fast estimation of theMU area based on realistic memory components. Thenumber of memories in the library is very important.Indeed, the higher the number of memory componentis, the better the solution could be. Moreover, the mem-ory can be internal or external.

A real life application has been studied in the case ofaccurate estimations applied to cost/DSP quality trade-offs (see §5). The results show how a designer can usethe estimation framework to carry out a specificationchoice. This estimation is not linked to any synthe-sis tool, and then can guide the designer in the opti-misation of his application implementation by usingtransformations.

Only three axes of the complete design space ex-ploration framework have been presented here. Themethodology is now extended to two other criterions.The first one addresses the estimation and optimiza-tion of ASIP and ASICs power consumption [2]. Thesecond one is related to the optimal choice of FU bitlengths in order to respect SNR constraint [3].

References

1. P. Lippens, V. Nagasamy, and W. Wolf, “Cad Challenges inMultimedia Computing,” inIEEE Int. Conf. on Computer AidedDesign, 1995.

2. N. Julien, E. Martin, S. Gailhard, and O. Sentieys, “Area/Time/Power Space Exploration in Module Selection for dsp High-Level Synthesis,” inInt. Workshop PATMOS’97, Louvain-la-Neuve, Belgique, 1997, pp. 35–44.

3. J.M. Tourreilles, C. Nouet, and E. Martin, “A Study on DiscreteWavelet Transform Implementation for a High Level SynthesisToo,” in Eusipco’98, Sept. 1998.

4. “Breizh Synthesis System Framework,” Tech. Rep., ENSSAT—Univerisity of Rennes,http://archi.enssat.fr/bss, 1998.

5. D. Chillet, J.P. Diguet, J.L. Philippe, and O. Sentieys, “Mem-ory Unit Design for Real Time dsp Applications,” Submited toIntegrated Computer-Aided Engineering, 1996.

6. E. Martin, O. Sentieys, H. Dubois, and J.L. Philippe, “Gaut, an

Architectural Synthesis Tool for Dedicated Signal Processors,”in EURO-DAC, Hamburg, Oct. 1993, pp. 85–94.

7. E. Martin, O. Sentieys, and J.L. Philippe, “Synthèse Architec-turale de Coeur de Processeurs de Traitement du Signal,”Tech-niques et Sciences Informatiques, vol. 1, no. 2, 1994.

8. B.M. Pangrle, “On the Complexity of Connectivity Binding,”IEEE Trans. on Computer Aided Design, vol. 10, no. 11, 1991,pp. 1460–1465.

9. R. Jain, A.C. Parker, and N. Park, “Predicting System-LevelArea and Delay for Pipelined and Nonpipelined Designs,”IEEETrans. on Computer Aided Design, vol. 11, no. 8, 1992.

10. J.M. Rabaey and M. Potkonjak, “Estimating ImplementationBounds for Real Time DSP Applications Specific Circuits,”IEEE Trans. on Computer Aided Design, vol. 13, no. 6, 1994.

11. M.C. Golumbic,Algorithmic Graph Theory and Perfect Graphs,New York: Academic Press, 1980.

12. D.L. Springer and D.E. Thomas, “Exploiting the Special Struc-ture of Conflict and Compatibility Graphs in High-Level Syn-thesis,”IEEE Trans. on Computer Aided Design, vol. 13, no. 7,1994, pp. 843–856.

13. A. Sharma, Estimation and Design Algorithms for The Behavo-rial Synthesis of Asics, Ph.D. Thesis, University of Wisconsin,Madison, Dpt. of Elec. and Comp. Eng, Dec. 1992.

14. M. Rim and R. Jain, “Lower-Bound Performance Estimation forthe High-Level Synthesis Scheduling Problem,”IEEE Trans. onComputer Aided Design, vol. 13, no. 4, 1994, pp. 451–458.

15. L. Guerra, M. Potkonjak, and J. Rabaey, “System-Level DesignGuidance Using Algorithm Properties,” inIEEE Workshop onVLSI S.P., San Diego, Oct. 1994, pp. 73–82.

16. H. Jun and S. Hwang, “Design of a Pipeline Datapath SynthesisSystem for Digital Signal Processing,”IEEE Trans. on VLSISystems, vol. 2, no. 3, 1994.

17. O. Sentieys, J.Ph. Diguet, J.L. Philippe, and E. Martin, “Hard-ware Module Selection for Real Time Pipeline Architectures Us-ing Probabilistic Cost Estimation,” inIEEE ASIC Conference,Rochester, USA, Sept. 1996.

18. J.Ph. Diguet, Estimation de Complexité et Transformationsd’Algorithmes de Traitement du Signal pour la Conception deCircuits VLSI, Ph.D. Thesis, Université de Rennes I, Oct. 1996.

19. K.J.R. Liu and C.-T. Chiu, “Unified Parallel Lattice Struc-tures for Time- Recursive Discrete Cosine/Sine/Hartley Trans-forms,” IEEE Trans. on Signal Processing, vol. 41, no. 3, 1993,pp. 1357–1377.

20. K.K. Pahri and D.G. Messerschmitt, “Pipeline Interleaving andParallelism in Recursive Digital Filters—part I: Pipelining Us-ing Scattered Look-Ahead and Decomposition,”IEEE Trans. onComputer Aided Design, vol. 37, no. 7, 1989, pp. 1099–1117.

21. C. Wang and K.K. Parhi, “High-Level dsp Synthesis Using Con-current Transformations, Scheduling, and Allocation,”IEEETrans. on Computer Aided Design, vol. 14, no. 3, 1995.

22. J.L. Philippe, O. Sentieys, J.Ph. Diguet, and E. Martin, “FromDigital Signal Processing Specifications to Layout,”Novel Ap-proaches in Logic and Architecture Synthesis, Chapman & Hall,1995, pp. 307–313.

23. F. Balasa, F. Catthoor, and H. De Man, “Exact Evaluation ofMemory Size For Multi-Dimensional Signal Processing Sys-tems,” in IEEE International Conference on Computer AidedDesign, 1993.

24. T. Kim and C.L. Liu, “A New Approach to the Multiport MemoryAllocation Problem in Data Path Synthesis,”Integration, theVLSI Journal, 1995, pp. 133–160.


25. M. Balakrishnan, A.K. Majumdar, D.K. Banerji, J.G. Linders,and J.C. Majithia, “Allocation of Multiport Memories in DataPath Synthesis,”IEEE Transactions on Computer-Aided Designof Integrated Circuits, vol. 7, no. 4, 1988.

26. F. Depuydt, G. Gossens, and H. De Man, “Scheduling with Reg-ister Constraints for dsp Architectures,”Integration, the VLSIJournal, vol. 18, 1994.

27. F. Grant, P.B. Denyer, and I. Finlay, “Synthesis of Address Gen-erators,” inIEEE Int. Conf. on Computer Aided Design, Nov.1989, pp. 116–119.

28. R. Reynaud, E. Martin, and F. Devos, “Design of a Bit SlicedAddress Generator for fft Algorithms,” inSignal processing IV:Theories and Applications, 1986.

29. J.L. Hennessy and D.A. Patterson,Computer Architecture. AQuantitative Approach, Morgan Kaufmann Publishers, Inc.,Version francaise par D. Etiemble et M. Israel, Mc Graw-Hill,1992.

30. S. Bakshi and D.D. Gajski, “A Memory Selection Algorithm forHigh-Performance Pipelines,” inEURO-DAC Brighton, Sept.1995.

31. F. Balasa, F. Catthoor, and H.De Man, “Dataflow Driven Mem-ory Allocation for Multi-Dimensional Signal Processing Sys-tems,” in IEEE International Conference on Computer AidedDesign, Nov. 1994, pp. 31–34.

32. H. Samsom, F. Franssen, F. Catthoor, and H. De Man, “Verifica-tion of Loop Transformations for Real Time Signal ProcessingApplication,” in IEEE Workshop on VLSI Signal Processing,Oct. 1995.

33. I.M. Verbauwhede, C.J. Scheers, and J.M. Rabaey, “MemoryEstimation for High Level Synthesis,” inACM/IEEE DesignAutomation Conference, 1994, number 31.

34. O. Sentieys, D. Chillet, J.P. Diguet, and J.L. Philippe, “MemoryModule Selection for High Level Synthesis,” inIEEE Workshopon VLSI Signal Processing, 1996.

35. J. Benesty and P. Duhamel, “A Fast Exact Least Mean SquareAdaptive Algorithm,”IEEE Trans. on Signal Processing, vol. 40,no. 12, 1992, pp. 2904–2920.

36. R.E. Blahut,Fast Algorithms for Signal Processing, Addison-Wesley, Reading, MA, 1985.

37. Z.J. Mou and P. Duhamel, “Short Length FIR Filters and Theiruse in Fast no Recursive Filtering,”IEEE Trans. on ASSP, 1989.

38. A. Zerga¨ınoh and P. Duhamel, “Implementation and Perfor-mance of Composite Fast FIR Filtering Algortihms,” inIEEEVLSI Signal Processing, Osaka, Japan, Oct. 1995, pp. 267–276.

Jean-Philippe Diguetreceived the Engineer degree in Electronicsfrom ESEO, France, in 1992 and the PhD degree in Signal Processing

from University of Rennes in 1996. He was working toward his thesiswith the Lasti laboratory, Lannion, France, then he joined IMECin 1997 for a postdoctoral year. He is now an assistant professorof electrical engineering at UBS university, Lorient, France and amember of the LESTER laboratory. His current research interestsare in architecture application matching and high level estimationsfor system design methodologies applied to signal-processing andmultimedia domains.

Daniel Chillet received the engineering degree in Electronics andComputer Engineering from the graduate School in Applied Scienceand Technology (ENSSAT) of the Rennes University. After his DEAgraduation, he received, in January 1997, the Ph.D. degree in Sig-nal Processing from the Rennes University. The topic of his Ph.D.concerned the definition of high level synthesis for memory unit inthe context of signal processing. In September 1997, he was nomi-nate as Assistant Professor at the University of Rennes (ENSSAT).His current research interests concern the topic of the methodologydefinition for ASIC design in the context of signal processing. He isimply in the developement of the GAUT tool which is a high levelsynthesis tool developped at LASTI since 1986.

Olivier Sentieysgraduated from ENSSAT as a senior executive en-gineer in Electronics and Signal Processing Engineering in 1990,and received the DEA degree in 1990 and the Ph.D. degree in Sig-nal Processing and Telecommunications in 1993 from the Universityof Rennes. He is currently an Associate Professor of Electrical En-gineering at ENSSAT-University of Rennes and is the head of theSignal-Architecture Group of the LASTI Laboratory. His researchinterests include high level synthesis of VLSI Signal Processing sys-tems, finite arithmetic effects, and low power methodologies forASIC and DSP. He is involved in research projects including thedevelopment of the design framework BSS, the definition of be-havioural IP for telecomm, and joint research programs in codesignand biomedical. He is a member of IEEE and [email protected]

Documents

A Framework for High Level Estimations of Signal ... · A Framework for High Level Estimations of Signal Processing VLSI Implementations J. Ph. DIGUET LESTER–Universit´e de Bretagne