High-level synthesis of DSP datapaths by global optimisation of variable lifetimes

High-level synthesis of DSP datapaths by global optimisation of variable lifetimes

A.A. Duncan D.C. Hendry

Indexing terms: Architectures, Datapath space, Datapath synthesis, Interconnections, VLSI

Abstract: COBRA is a behavioural high-level synthesis tool for datapath-dominated applications. It globally optimises the synthesised datapath by performing the scheduling and allocation tasks simultaneously. COBRA uses a bit-sliced target architecture and layout style which, when compared with conventional approaches, has been previously shown to significantly reduce the area of final datapaths. The synthesis problem is formulated as an optimisation problem on the configuration of variable lifetimes when mapped into a three-dimensional ‘datapath space’. The configuration of the data in datapath space implies the structure required to achieve the data configuration and hence the datapath. Simulated annealing is used to optimise the solution. A description of the target architecture, mapping of the input description into datapath space, optimisation of the data configuration in datapath space and post processing operations is given. Results for a number of examples are presented.

1 introduction

A high-level synthesis system transforms an algorithmic behavioural description into a register-transfer level (RT level) netlist of functional units (FUs), memory elements and an associated communication structure. This trans- formation is performed subject to constraints imposed by the designer. At this level of abstraction, the designer can either optimise a design for area by specifying the number of available FUs, or for speed by specifying the number of control steps (c-steps) available for execution of the algorithm. The goal of the synthesis process is to create an RT-level structure which will execute the required behaviour, subject to the imposed constraints, minimising the physical domain cost of the solution.

To provide connectivity between FUs and memory elements, some systems, e.g. S(P)LICER [I], HAL [2], ADPS [3], and Facet, use a random multiplexer based interconnection topology. The number of interconnects is usually high and the allocated interconnect is under- utilised since interconnect minimisation is not performed until late in the design process. Alternatively, other systems use a bus-based target architecture, e.g. SPAID

0 IEE, 1995 Paper 1789E (Cl), first received 31st May and in revised form 25th November 1994 The authors are with the Department of Engineering, University of Aberdeen, Fraser Noble Building, King College, Aberdeen AB9 2UE, United Kingdom

I E E Proc.-Comput. Digit. Tech., Vol. 142, No. 3, M a y 1995

[4]. PARBUS [5], STAR [SI, and DAGAR [7]. Systems such as these use interconnect more efficiently using buses to connect datapath elements together.

The importance of wiring and layout on high-level synthesis cannot be overemphasised. In an experiment to show cost-performance trade-off [8], it was shown that the wiring/cell ratio for the chosen example varied from 1.6 to 2.6 over the designs synthesised, illustrating the large area required for routing. It was concluded that high-level synthesis tools must take physical effects into consideration if they are to produce high quality designs. McFarland made similar conclusions in Reference 9 where it was shown that BUD could not produce a cost- performance trade-off curve when physical domain effects were taken into account. Wiring delays are also shown to be significant in Reference 10 where it is stated that interconnection delay can contribute an additional 20% to FU delay for a 1.6 micron process which will increase dramatically when submicron processes are utilised.

There are a number of ways to take physical effects into account. First, physical effects can be estimated during high-level synthesis. This is the approach taken by BUD [Ill where the input description is ‘hierarchically clustered’ which gives an estimate of the overall layout and wiring. However, it was concluded in Reference 9 that more sophisticated techniques were required. Never- theless, this concept is extended in both the Siemens synthesis system [I23 and in the APARTY architectural partitioner of SAW [13]. 3D scheduling [lo] performs simultaneous scheduling and floorplanning to attempt to minimise wiring delays, however, it does not consider the cost or delay of registers, multiplexers or wiring space. Elf [ 141 dynamically allocates interconnect during the allocation of processors and registers. It then replaces the interconnects with optimised ‘interconnection networks’. HYPER Cl51 partitions the synthesised datapath to optimise the floorplan using locality of interconnect as one of the main criteria in the partitioning process. Pangrle et al. [16] perform placement and routing simultaneously with connectivity generation. This has the advantage that multiplexers and other switching devices are added in a way which helps minimise routing rather than minimising them for their own sake, as is commonly done. Similar ideas are also employed in Jang and Pangrle’s GB tool [17].

A second method involves improvement of the RT structure using information fedback to the high-level synthesis system from later in the design cycle. Chippie [I81 is an expert system which controls synthesis in the Slicer/ Splicer system, using a ‘knobs’ and ‘gauges’ approach to meet area, timing and power constraints. The IBA system uses Fasolt [I91 to perform feedback driven synthesis. Fasolt in turn uses LE, another IBA tool, to generate a

215

macrocell oriented floorplan. This is used to guide Fasolt in merging buses and rescheduling. The feedback process continues until no further area reduction can be made.

Another approach is to define a target architecture such that the interconnect and layout style are well defined. MacPitts [20] takes this approach and synthesis onto an essentially fixed floorplan. However, the fixed structure limits the types of RT-level configurations available. PARBUS [5] uses a system of partitioned buses whereas DAGAR [7] and CASS [21] partition the datapath into rows and columns, respectively. CATHEDRAL-I11 [30] partitions its datapath into several application-specific units, however this is done for performance reasons and not in an attempt to take physical effects into consideration. CASS, DAGAR and PARBUS all report area savings of around 50% when compared to conventional synthesis techniques. These systems give significant area savings, do not suffer from the lack of flexibility of systems like MacPitts, do not require feedback from layout tools, and do not need heu- ristic physical domain estimation techniques.

The synthesis problem involves several interdependent scheduling and allocation problems. Solving these problems individually will lead to suboptimal solutions. Recently, however, there have been moves made to unify some of the tasks in high-level synthesis. SAM [22] inte- grates scheduling, allocation and mapping (binding) in one algorithm. Rim et al. [23] formulate the allocation and binding problem with interconnect minimisation using an integer linear programming formulation. Safir and Zavidovique [24] present a simulated annealing approach which performs simultaneous scheduling, operator allocation and module selection. The various tools from the University of Waterloo e.g. [25-271, have developed an integer programming model which can perform global optimisation on the whole datapath synthesis problem. Devedas and Newton [28] use simulated annealing to perform global datapath optimisation by treating the datapath synthesis problem as a two- dimensional placement problem. Similarly, SAVAGE [29] uses simulated annealing to synthesise pipelined datapaths.

This paper presents COBRA (column-oriented butted regular architecture), a high-level synthesis system which performs global optimisation of the synthesised datapath using simulated annealing. In contrast to the global optimisation methods mentioned above, COBRA partitions the synthesised architecture into several communicating processors. This is achieved by synthesising onto the CASS target architecture which is designed to localise computation within processors. The combination of the target architecture and layout style localises the interconnect within processors. Conventional approaches to synthesis allocate hardware to create the desired behaviour. In COBRA however, the specified behaviour implies a datapath. Synthesis is formulated as an optimisation problem of variable lifetimes. As shall be shown, the variable lifetimes imply a datapath.

As shall be discussed, the use of the CASS target architecture and layout style effectively gives free local interconnect and allows the physical area of synthesised datapaths to be more easily, and accurately, estimated during optimisation. The target architecture and layout style should ultimately lead to more compact, and hence less expensive, physical layout. In addition, it will be shown that the global optimisation method of COBRA leads to better solutions than those synthesised using a traditional divide and conquer approach to synthesis.

216

COBRA therefore addresses the physical effects of synthesis by using the CASS target architecture and performs synthesis by globally optimising the variable lifetimes within the system.

'P 2

COBRA uses the CASS target architecture, first presented in Reference 21, which has been shown to use approximately 50% of the area required by conventional techniques. Fig. 1 shows the architecture. The datapath is

dditional buses to iultiport global register global buses

Target architecture and layout style

I I

xocessorlcolumn 0

in-the-cell ,' locol bus

1 Ilocalreg file 1 ~,

processorlcolumn 2

construction by -butted bit-sliced

cell assembly

I Drocessorlcolumn 1 I I ' I

Fig. 1 processors

Example of target architecture/layout-style composed of three

partitioned into a number of communicating processors, each with a set of functional units and local memory (registers and ROM).

Each processor is designed around a three-bus local bus, comprising two read buses and one write bus. The local bus can support one write and two reads per cycle. The functional units may be a complex multifunctional ALU, a set of single purpose units, or a mixture of both. (The tool described herein allocates a set of single purpose units to each processor.) Read-write memory in the column is provided by the 'local' and 'global' register files. The local register file always has two read ports and one write port and connects to the local bus. The global register file has sufficient ports to service all allocated local and global bus accesses. The global register file is provided such that a global bus can access a processor's storage at times when no local bus capacity is available. It is proposed that local datapath control is distributed to each processor with each local controller under the orchestration of a global VLIW (very long instruction word) controller. This is similar to the method suggested in Reference 30. Control issues are not discussed in this paper.

Communication between processors is provided by a system of global buses and column multiplexers. Sufi- cient global buses, and global register file ports, are allocated so that all interprocessor communications can be serviced. Unlike PARBUS and DAGAR, this architecture will never require extra control-steps (c-steps) to be added to enable communications between datapath partitions to take place.

I E E Proc.-Comput. Digit. Tech., Vol. 142, N o . 3, M a y 1995

The processors are implemented as separate datapaths and are constructed by butted bit-sliced cell assembly. Cells are butted horizontally to achieve the required datapath width and stacked vertically to achieve connectivity between datapath elements. The processors can therefore be considered as columns of butted cells when viewed in the physical domain. As such, local bus routing is essentially free. In each bit-slice, the local bus can either be routed in-the-cell, utilising the otherwise wasted area between n and p-type transistors [31], or over the cell if sufficient wiring layers are available.

Global bus routing may also be provided by butt con- nections, however, if the column lengths are markedly different then a more area efficient solution may involve conventional macrocell place-and-route techniques viewing each processor as a macrocell. Using this technique, global wiring is required between columns. However, as shown subsequently, COBRA minimises global buses which will therefore minimise this wiring area.

The architecture introduces hierarchy in the overall datapath. This is in contrast to most systems which essentially synthesise onto a flat architecture. Parallelism in COBRA datapaths is coarser grained than in flat hier- archies. The combination of the synthesis tool and the architecture is designed to localise computation in processors and thus localise interconnect to columns. More- over, the combination of the architecture and layout style almost completely eliminates global interconnect in the final datapath, thus significantly reducing area. There is a very strong relationship between the structural and physical domains since the structure implies a floorplan. Therefore synthesising onto this architecture should enable good cost estimates of the physical datapath area to be made.

3 Datapath synthesis by variable lifetime optimisation

3.1 Internal representation - datapath space The target architecture is designed to provide a strong relationship between structural and physical domains. The internal representation is designed to provide a strong relationship between behavioural and structural domains, and therefore a relationship between the behavioural and physical domain via the structural domain.

The internal representation used in COBRA is a mapping of the given behavioural description into 'datapath space'. Fig. 2 shows datapath space (dp-space). The

column

Fig. 2 Three-dimensional 'datapath space'

three dimensions of dp-space are time, variable and column. Ordinates in the variable dimension represent variables (both explicit and implicit) in the behavioural specification. The column dimension directly relates to the columns in the target architecture. Datapath space is

IEE Proc.-Comput. Digit. Tech., Vol. 142, No. 3, May 1995

useful since, as shall be shown, the configuration of variables in dp-space implies hardware and hence a datapath. The task of datapath synthesis can therefore be formulated as an optimisation problem in dp-space.

Fig. 3 shows a configuration of variables a, b and c on the variable time plane for some column C. The Figure

variable-time c-step plane , 0 1 1 , 2 1

destination '7 ?point

tie line "? 5 C

data line

Fig. 3 Mapping on variable-time planefor c = a + b

depicts the mapping of the operation c = a + b for a one-cycle adder and shows four important entities in dp-space: data lines, a tie line, destination points and a root point.

0 A data line exists when a particular variable is alive. A data line in the time dimension implies storage during the c-steps in which it is alive. Similarly, a data line in the column dimension implies a global bus use.

0 A tie line 'ties' data together. In Fig. 3 the tie line connects data lines a, b and c together. At the end of c-step 0, c is created by adding a and b. This implies that an adder is used in c-step 0. The new data c can now be used elsewhere.

0 The root point is the data source for all uses of this variable until it is reassigned. All uses of the variable must stem from its root point.

0 Destination points are data sinks. A tie line connects a root point and one or more destination points depend- ing on the operator associated with the tie line.

A data line is a composite entity, it is composed of an overlay of one or more two-pin nets as shown in Fig. 4. A two-pin net connects tie lines in dp-space. Every use of data requires a unique two-pin net which defines the route of the data from the source to the data sink. There- fore, for each data use, a two-pin net defines the lifetime of the particular data between its birth at its root point, and its death at its destination point. Fig. 4 shows that

column-time plane root point

bus use

I----I solid length

Fig. 4 Two-pin net

the two-pin net is composed of originator lifetime, destination lifetime and logical global bus-use segments:

0 The originator lifetime segment is the time contrib- uted by a net to the variable data line in the source column.

217

The logical global bus transfer schedules a logical global bus transfer between source and destination columns. Logical global buses provide connectivity between a column and other columns in the datapath. Logical global buses are an upper bound on the number of physical global buses which will be required in the final datapath.

0 The destination lifetime segment is the time contrib- uted by a net to the variable data line in the destination column. Part of the destination lifetime is ‘solid’. The solid part cannot be shortened by moving the logical global bus transfer back in time. As such it represents the delay of the operator associated with the destination tie line.

The two-pin nets define root and destination points and hence tie lines. The overlay of two-pin nets define data lines. Fig. 5 shows a data line composed of an overlay of

column-time plane ,destination Doint

root point f L r , destination point

1 /destination point

Fig. 5 Data line composed of three two-pin nets

three nets. As can be seen, a data line is essentially a rectilinear Steiner tree (RST) rooted at the root point with leaves at the destination points. The RST will be planar in the column time plane (since a unique RST exists for each variable) and defines the c-steps in which the variable is alive for each column. Storage and bus use are not directly associated with nets since this could imply multiple resource use for storing/transferring the same data.

Fig. 6 shows the solid part of the destination lifetime in more detail. As mentioned, the solid part represents

&operator delay- Iread, ~ $-destination

stage point delay delay

operator delay

stage delay destination

-point L read b time

Fig. 6 nonpipelined operators

Detail of solid length of destination IiJeetime for pipelined and

the delay of the operator associated with the target tie line. The solid length is composed of a delay, read time and stage delay. Delay represents the overall operator delay whereas stage delay refers to the delay of each stage of a pipelined operator. The read time is the number of cycles the register file needs to be read for this operator. To illustrate this, Fig. 6a represents a three-cycle pipelined operator with stage delay of one c-step. The pipelined operator has internal registers and therefore has a read time of one c-step. Fig. 6b shows a three-cycle nonpipelined operator without internal registers. Since the operator is nonpipelined, the stage delay is the same as

218

the operator delay (effectively a one-stage pipe). Similarly, since the operator has no internal storage, the register file is read for the duration of the operation. Using this model, COBRA can deal with multicycle and pipelined functional units, chaining is not supported at the moment.

It can therefore be seen that the configuration of two-pin nets in dp-space defines a set of data lines and tie lines. The data lines and tie lines in turn imply a set of FUs, registers and global buses. Thus, the data configuration in dp-space implies a datapath. The synthesis task can therefore be formulated as an optimisation problem of the structure of two-pin nets in dp-space, i.e. it is an optimisation problem on variable lifetimes.

Currently datapath space is three-dimensional since COBRA only deals with straight line code segments, but it is anticipated that it will be extended to four dimensions to deal with looping and conditional branches.

The dp-space representation is behavioural insofar that it represents the variables required to give the specified behaviour and the data dependencies between the variables. The configuration in dp-space implies a structure which in turn implies a floorplan.

3.2 Mapping the input description into datapath space

The previous Section gave an overview of dp-space, gave some of the terminology used and showed that datapath synthesis could be modelled as an optimisation problem in dp-space. Fig. 7 gives a COBRA input description for

//multiply accumulate input a,b,c; output d :

feedback (d,a) ; constrain (time,3);

d=a * b+c ;

Fig. 7 Input description for simple example

a simple example. This example assumes a two c-step multiplier and a single c-step adder. An implicit time loop exists around the behavioural description. The datapath is specified to execute in three c-steps and the output is fedback to the input of the multiplier for the next iteration.

The input description is specified in the CASSette language which was originally developed for use with the CASS system [21]. CASSette is used in preference to existing languages such as VHDL and SILAGE [32] since these are viewed as being overly complicated for the initial requirements of COBRA. CASSette is a small and simple language which allows data flow graphs to be specified, and as such is perfectly adequate as an input language for COBRA. For an overview of CASSette see Reference 33.

COBRA begins by compiling the input description into a data flow graph (DFG). ASAP and ALAP scheduling is performed on the DFG as a preprocessing operation before mapping into dp-space.

Mapping from DFG to dp-space is performed by examining the DFG. DFG nodes map to tie lines in dp-space. Tie lines are mapped randomly to columns in dp-space. Input and operator DFG nodes use their ASAP time as their initial time co-ordinate. Output nodes are all scheduled at the time constraint specified in


the behavioural description. After tie-line instantiation, connectivity in dp-space is added by creating two-pin nets between root and destination points. These nets cor- respond to DFG edges. Logical global bus transfers are scheduled to occur whenever the data is created. This ini- tially gives all two-pin nets null originator lifetimes, no problems can therefore arise with violation of destination solid lengths.

The translation of the DFG into dp-space also creates the implicit structural domain representation. The data structures used to store this representation can be examined to determine which variables are alive in each c-step, when functional units create data, and when global buses are required to transfer data. The hardware cost of the datapath can therefore be evaluated by examining these data structures.

Fig. 8 shows the DFG and initial mapping for the example. The variable lifetimes for the two columns in

I l,i I I intercolumn

implicit voriable

o*b tieline

feedback d too

time !d datapoth space

Fig. 8 Graph and initial dpspace representation for the example

this mapping are depicted in Fig. 9a. The bold line segments represent the time in which a variable is alive; it is evident from the Figure that two registers will be

c-step mrioblea ' mriable a 0 1 2

variable lifetimes b K ;f mlumn 0 column 1

+2 registers ->2 registers

a* b of b

a

c-step c-step

column 0 column 1 b

c-step 0 1 2

logical global bus use bus O c + l C

Fig. 9 example

Variable fifetimes, FU use and logical global bus use for

required in each column. Fig. 9b shows the FU use implied by the tie lines in Fig. 8. As can be seen, an adder is required in column 0 and a multiplier is required in column 1. The logical global bus use for the initial mapping is shown in Fig. 9c. There is only one logical

I E E Proc.-Cornput. Digit. Tech., Vol. 142, No. 3, May I995

global bus use which corresponds to the transfer of a * b from column 1 to column 0. Finally, Fig. 10 shows the architecture implied by the initial dp-space configuration.

adder

column 0

multiplier

column 1

Fig. 10 Initial architecture for example

The feedback from d to a can be seen in dp-space. To facilitate the feedback, points 1 and 2 on the Figure are 'tied' together, i.e. if point 1 is moved in the column dimension then point 2 will follow and vice versa. This ensures the pins are 'data-equivalent' i.e. a continuous path exists for the data between iterations of the behavioural description. Fig. 11 shows the internal representation and datapath after optimisation.

]column

I variable T I rnultiplierlodder x L,:

time dotopath space implied orchitecture

Fig. 11 optimisation

Dp-space representation and synthesised architectwe 4ter

A more realistic example of a configuration in dp- space is shown in Fig. 12. This is an initial solution for the 19-cycle wavefilter example using a pipelined multiplier, displayed in the main window of COBRA.

The mapping from DFG to dp-space gives the initial solution for simulated annealing to optimise. Dp-space can be used for formulation of both time constrained and

219

resource constrained optimisation problems. At the moment only time-constrained problems are considered, however all COBRA data structures have been designed to allow the solution to grow in both space and time.

4

Mapping the behavioural description into dp-space creates a datapath in the structural domain which in turn has a certain area cost when implemented in the physical domain. Using the CASS target architecture the cost of the structural datapath will be a good indication of the cost of the physical datapath.

Simulated annealing, proposed by Kirkpatrick et al. [34] has proved to be an effective solution to placement and routing problems in VLSI layout [3S]. Its basic feature is that it allows hill climbing moves which allow it to climb out of local minima while exploring its configuration space. Simulated annealing will always accept lower cost solutions but it will also accept higher cost states probabilistically under the control of a 'temperature' parameter, which is gradually reduced during the annealing process. This control parameter is analogous to temperature in the physical annealing process. Simulated annealing can be used for combinatorial optimisation problems specified by a finite set of states and a cost function defined on that state space.

Exploration of the state space is performed by making random moves from the current state to neighbouring states. A number of moves are made at each temperature until it is decided that an equilibrium has been reached. The temperature is then reduced. This process is con- tinued until it is finally decided that the system is frozen and no more optimisation is possible. The decisions on temperature reduction and algorithm termination are controlled by the annealing schedule which is described below.

In COBRA, dp-space is the annealing state space; the initial dp-space mapping serves as the initial state. The cost function is based on the structural representation implied by the dp-space configuration. Neighbouring solutions are generated in dp-space by movement of tie lines and logical global bus transfers. The effectiveness of simulated annealing is very dependent on the cost function, move set and annealing schedule used [36]. An explanation of the cost function, move set and annealing schedule used in COBRA follows.

Optimisation of data configuration by simulated annealing

4.1 Cost function The cost of a solution is calculated from the structural representation which should be a good estimate of the physical cost. From a synthesis viewpoint, the most important aspects of the available FUs are the operator delay, read time, stage delay and the cell area. The stage delay and read-time properties of an operator are important as they define the time which an operator must have exclusive access to the local bus. An operator must have access to the read buses during its read time and access to the write bus during the last stage delay of the operator delay. It is illegal for two or more operators to require exclusive bus access at anytime (otherwise bus contention will occur), however the random moves of simulated annealing will produce datapath configurations in which this occurs. The cost function must take this into account by making these illegal solutions expensive such that a legal solution will be chosen instead.

220

The cost function used is similar to that used by Devedas and Newton in their system [28]. The total cost is calculated as

cost = columncost + 1 globalbuscost C0l"Ill"S buses

-I C transfercost

where columncost is calculated as follows:

t.(lM f e n

columncost = functionalunitarea x concurrencypenalty

+ registerarea + muxarea

Functionalunitarea is simply the sum of the areas of all FU cells allocated to the column. The Concurrency penalty is introduced to solve the problem of local bus contention.

The annealing process does not actually assign variables to registers. This is done by the left-edge algorithm [37], however it does know how many registers will be required in each column for the mapping process to be performed. This is simply the maximum lifetime (data line) density in each column. The register area required can therefore be calculated.

Every column which accesses a global bus requires a column multiplexer. Therefore, if necessary, Muxarea is added to take account of the area required by the column multiplexer. At the moment Muxarea is a constant cost, but it is intended to size this quantity relative to the number of mux inputs thus making it a more accurate estimate of the required multiplexer area.

The global bus term in the cost function takes into account the number of logical global buses allocated. This influences potential global wiring (if required), the complexity of the column multiplexers and also the number of ports on the global register files.

The Transfercost term is introduced in an attempt to minimise the use of the allocated logical global buses. This does not directly affect the area of the solution during annealing, however, it may allow more eficient use of buses when logical buses are mapped to physical global buses during postprocessing.

4.2 Moveset COBRA optimises the dp-space configuration by making one of three moves:

0 Move tie line in time dimension; this is analogous to scheduling operations and affects register use and poten- tially logical global bus use.

0 Move tie line in column dimension; this affects the binding of operators to columns and thus the allocation of FUs to columns, and also affects register use and logical global bus use.

0 Move logical global bus transfer in time dimension; this effectively schedules intercolumn communications and affects global bus use and register use.

In addition to maintaining the structural representation implied by the configuration in dp-space, COBRA also keeps track of operator concurrency across all columns. This is used to guide the move selection process.

Performing a move begins by choosing either to move a tie line or local bus transfer. The selection procedure is slightly biased to move tie lines since these have more significance than logical bus moves. If a tie line is chosen the dimension to move is then selected, otherwise the local bus transfer is moved in the time dimension.

I E E Proc.-Cornput. Digit. Tech., Vol. 142, No. 3, M a y 1995

Dimension selection is influenced by the operator concurrency. If the datapath uses more FUs than the operator concurrency would dictate, the selection procedure is biased to produce a move in the column dimension, e.g. if there is a maximum of two concurrent adds, but the datapath has three adders, a move in the column dimension will probably be chosen. This is performed in an attempt to reduce redundant FUs by changing the binding of operators to columns (the algorithm is only biased to move these in the column dimension, it is not a hard rule since redundant FUs may reduce overall datapath cost). In other cases, the dimension selected is biased towards the time dimension which essentially provides an operator scheduling move.

At this stage the entity type and dimension of the move have been established. The remaining selection moves are identical for every movement. An instance of the selected entity is chosen and a direction (either posi- tive or negative) selected without bias. Finally, the dis- tance in dp-space to move is chosen. This favours moves of one unit but also allows moves of two or three units.

It has been found that the size of dp-space in the column dimension needs to be limited for the annealing procedure to find good solutions. Moves which move outside the bounds of dp-space in time or column dimensions are immediately rejected. Similarly, moves which would violate data-dependencies and destination solid lengths are rejected.

4.3 Annealing schedule The initial annealing temperature is determined by making 1000 random moves. An average value for the change of the cost function is then determined. The temperature is calculated such that there is a high probability that a solution which results in an ‘average increase’ of the cost function will be accepted.

Four variables gamma, Nmult, LCFactor and Tem- Mult are defined which control annealing in COBRA. For the results presented in this paper, gamma = 20, Nmult = 20, LCFactor = 4 and TempMult = 0.8. N = gamma x NumDFGNodes is also defined where NumDFGNodes is the number of nodes in the data flow

The temperature is lowered if there have been N lower cost moves at this temperature or there have been Nmult x N total moves at this temperature. The next temperature is calculated as TernpMult x CurrentTern- perature. Annealing is terminated when the last LC Factor temperatures have not resulted in a decrease in cost function.

The values of gamma, Nmult, LCFactor and TempMult used here have been found empirically by observation of the operation of COBRA and typically produce good results. However, the user ultimately has control over the annealing schedule and may provide their own annealing control parameters if desired.

graph.

5 Postprocessing operations

Once annealing is finished there are a number of postprocessing operations which need to be performed. The annealing process minimises the number of logical global bus communications and also determines the number of registers required but does not assign the lifetimes to registers. In addition, the number of ports on the column multiplexers and global registers needs to be determined.

I E E Proc.-Comput. Digit. Tech., Vol. 142, No. 3, May 1995

Constants are not stored in registers and are not commu- nicated to other columns. Constant use has therefore to be extracted and assigned to ROM in each column. Finally, a VHDL netlist of the datapath is produced.

The first postprocessing step is to map logical global buses onto physical global buses. The number of logical global buses is an upper bound on the number of physical buses, however the mapping process may be able to minimise this further by routing some logical bus communications to physical buses which have been already allocated. This task is accomplished using a combination of Prim’s spanning tree algorithm [36] and Dijkstra’s shortest path algorithm [37]. A spanning tree which maximises bus use between columns is first created and physical buses created. Remaining logical bus communications are then routed on the physical buses, if possible, using Dijkstra’s algorithm. If no route can be found then a physical bus is instantiated for the logical bus.

Local bus use is then examined to determine local bus busy time. All data transfers are assigned a port number. All local transfers will be assigned to the local bus, however, global bus transfers may need to be assigned to global ports if the local bus is in use. This therefore determines the number of ports on the global register file, and enables variables which need to be assigned to the global register file to be determined and tagged.

Constants are then mapped to ROM. This is performed before register assignment but it is not important when this operation takes place in the postprocessing phase. Register assignment is performed using the left- edge algorithm [38]. This simply maps variable lifetimes to registers, the number of registers required was determined by the annealing process. Registers which store variables tagged as global are assigned to the global register file.

The number of multiplexer ports and their connectivity is then determined by examining the port alloca- tions made to global bus transfers. As a final operation, a VHDL netlist of the synthesised structure is created to describe the datapath.

Although postprocessing determines the actual number of ports on the global register fdes and column multiplexers, the number of ports have been indirectly minimised during the annealing phase. Minimisation of the number of global buses and global bus transactions will effectively minimise the number of local/global bus conflicts and hence the number of global ports on the register file. In turn this implicit minimisation, coupled with the minimisation of the number of global buses, will minimise the number of ports on the column multiplexer.

6 Results

COBRA has been implemented in the C programming language. It consists of approximately 30000 lines of code and runs on a variety of workstation platforms. An X-window interface has been incorporated into the tool which allows examination of dp-space, internal data structures, register and FU utilisation and datapath visualisation. In addition, the X-interface also allows manual intervention in the annealing process. The results presented here were synthesised on a Sun SparcCenter loo0 running Solaris 2.3.

Results are presented for a number of examples in Table 1. Comparison with other systems is given where possible. The ha1 examples are for Paulin’s original differential equation example taken from Reference 2. These

22 1

Table 1 : Conmarison of COBRA svnthesis results w i th other svstems

System Example Cycles * + - D Regs Globalbuses Columns

nonDiDelined multiolier

6 2 1 1 1 8 1 2 3 1 1 1 9 3 3

diffeq8 8 2 1 1 1 9 2 COBRA diffeq9 9 2 1 1 1 9 1 I elliD18 18 2 3 - - 1 2 3

3 2 3

ellip19 19 2 3 - - 1 3 2 3 ellip21 2 1 1 2 - - l 2 2 3 cosine20 20 3 3 2 - 1 4 3 3

ha16 6 Z l l l 1 0 2 3 diffeq6 6 2 1 1 1 1 0 3 3 diffeq8 8 2 1 1 1 1 1 2 3 diffeq9 9 2 1 1 1 1 0 2 3 ellip18 18 2 3 - - l 6 5 4 ellip19 19 2 2 - - 1 7 4 4 ellip21 21 1 2 - - 1 6 2 3

ellip19 19 2 2 - - 1 2 - - ellip21 21 1 2 - - 1 2 - -

cosine20 20 4 2 2 - 23 8 6

2 2 - - l l -

1 HAL1421 { ELF1141 ellip19 19

DiDelinedmultiDlier

diffeq6p 6 2 1 1 1 4 1 2 ellipl8p 18 1 2 - - 1 3 3 3 ellipl9p 19 1 2 - - 1 1 3 3 cosine20p 20 2 2 2 - 12 3 3

PARBUS[S] ellipl9p 19 1 2 - - 1 2 - -

COBRA { HAL1421 ellipl9p 19 1 2 - - 1 2 -

elliplep 18 1 2 - - 2 1 - -

ellipl9p 19 1 2 - - 19 - SPAID[4] { -

ELF1141 elliul9p 19 1 2 - - l l -

SAM1221 elliD19D 19 1 2 - - 1 2 - -

Pass [43] ellipl9p 19 1 2 - - l O - -

(Esc)[44] elliplgp 19 1 2 - - 1 5 - -

examples use functional units which require one c-step to execute. The diffeq results are also for the differential equation example, however, unlike the ha1 examples, multipliers take two c-steps to execute. The ellip results are for the elliptical wave filter benchmark which is originally from Reference 38; additions are defined to take one c-step and multiplications two c-steps. Finally, the cosine results are for the fast discrete cosine transform example taken from Reference 29. Again, addition and subtraction require one c-step and multiplication requires two c-steps.

It is interesting that the COBRA solution for the 19- cycle nonpipelined example uses one more adder than the other systems. However, it should also be observed that COBRA uses five less registers and three less global buses than the CASS solution which results in a lower global cost (for the same target architecture).

Given the widespread use of the 19-cycle wavefilter with pipelined multiplier as a benchmark, the synthesised architecture for this example is presented in Fig. 13.

COBRA run times are quite acceptable. For example, the wavefilter results presented in Table 1 typically require around eight minutes of CPU time on a Sun SparcCenter 1OOO.

7 Conclusions

Ths paper has presented COBRA, a high-level synthesis tool which performs global optimisation of the synthe-

222

sised datapath on a target architecture designed to elimi- nate global interconnect. The optimisation has been shown to be performed by viewing the behaviour to be synthesised as a configuration of data in datapath space. This configuration has been shown to imply a datapath.

Rather than allocating hardware to create a behaviour, the approach presented uses the given behaviour to

gbus-0

rbO *l *M

ROM (1 word) I Fig. 13 multiplier

Synthesised architecturefor IPcycIe wauefilter with pipelined


imply the hardware. The internal representation and target architecture used are designed to closely link the behavioural, structural and physical domains in an attempt to make behavioural level decisions more responsive to physical domain effects. The results presented illustrate the viability of the approach and show that the global optimisation technique gives superior results when compared to another system synthesising onto the same architecture (CASS). Moreover, the target architecture gives an area advantage over conventional synthesis techniques. From a purely structural viewpoint, the datapaths which COBRA synthesises are as good as other systems. However, the target architecture and layout style has been designed to significantly reduce interconnect area. COBRA datapaths should therefore be physically superior to those synthesised by conventional techniques.

COBRA has proven the dp-space concept and inspired new directions of research within the datapath space model. Future work is needed to extend the dp-space model to facilitate a block structured language instead of just supporting straight line code segments. This would make the system more applicable to a wider range of DSP problems. It is envisaged that this would be made possible by extending the three dimensional model to four dimensions. The fourth dimension of dp-space would represent ‘branch-level’ and, as such, operations at different branch levels would be mutually exclusive. The other three dimensions of dp-space would remain unchanged and the cost of the solution would be determined by examining the fourth-dimension superposition of the conventional three-dimensional dp-space model. This superposition would imply all hardware (including storage) required to perform any of the mutually exclusive operations assigned to a processor. Since movement in the fourth dimension would not be allowed, it is pre- dicted that the extension to a four-dimensional dp-space model would be relatively straightforward and that the annealing process would not have to be significantly changed.

The extension would either require CASSette to be extended to include conditional and loop constructs, or the use of another input language. At the moment it is felt that SILAGE should be adopted because of its applicability to DSP applications.

Architectural issues may also be examined in a future system. The arithmetic complexity of the examples which have been synthesised is such that the interprocessor communication requirements can be serviced using relatively few global buses. As the degree of parallelism increases, the number of interprocessor communications will increase which will affect the global bus requirement. As the number of global buses increases the average bus length will increase as will the interconnect area. Present- ly, there are two levels of hierarchy in target architecture. It is proposed to introduce further hierarchy into the global communication layer which may allow more efficient use of allocated global interconnect and thus reduce global interconnect, particularly in cases of high parallelism.

The present system will sometimes get stuck in non- optimal metastable states thus producing inferior solutions. More work is therefore required in refining the move set, cost function and annealing schedule as well as analysing state accessibility. However, it is felt that this should be left until a more general model of dp-space has been developed which can handle conditional branches, loops and pipeline synthesis. IEE Proc.-Comput. Digit. Tech., Vol. 142, No. 3, May 1995

The work of Ly and Mowchenko [40] on the combination of simulated evolution and simulated annealing in high-level synthesis may indicate that allowing larger state-transition distances in COBRA may lead to faster convergence to a solution. In addition, some preliminary work on the applicability of genetic algorithms [41] and linear programming to dp-space optimisation may also be further investigated.

Work is required on interfacing to other CAD tools, both for producing layout and for solution verification through simulation.

8 References

1 PANGRLE, B.M., and GAJSKI, D.D.: ‘State synthesis and connectivity binding for microarchitecture compilation’. Proceedings of the 1986 IEEE international conference on Computer aided design (ICCAD86), 1986, pp. 210-213

2 PAULIN, P.G., and KNIGHT, J.P.: ‘HAL: A multi-paradigm approach to automatic datapath synthesis’. Proceedings of the 23rd ACM/IEEE conference on Design automation, 1986, pp. 263-270

3 PAPACHRISTOU, C.A., and KONUK, H.: ‘A linear program driven scheduling and allocation method followed by an interconnect optimization algorithm’. Proceedings of the 27th ACM/IEEE conference on Design automation, 1990, pp. 77-83

4 HAROUN, B.S., and ELMASRY, M.I.: ‘Automatic synthesis of a multi-bus architecture for DSP. Proceedings of the 1988 IEEE international conference on Computer aided design (1C-CAD88), 1988, pp. 44-47

5 EWERING, C.: ‘Automatic high-level synthesis of partitioned busses’. Proceedings of IEEE international conference on Computer aided design (ICCAD90), 1990, pp. 304-307

6 TSAI, F.-S., and HSU, Y.-C.: ‘Data path construction and reline- ment’. Proceedings of the 1990 IEEE international conference on Computer aided design (ICCAD90), 1990, pp. 308-311

7 RAJ, V.K., and PATWARDHAN, C.S.: ‘Automated datapath synthesis to avoid global interconnects’. Proceedings of 4th CSI/IEEE international symposium on VLSI design, 1991, pp. 11-16

8 PARKER, A., GUPTA, P., and HUSSAIN, A.: ‘The effects of physical design Characteristics on the area-performance tradeoff curve’. Proceedings of the 28th ACM/IEEE conference on Design automation, June 1991, pp. 530-534

9 McFARLAND, M.C.: ‘Reevaluating the design space for register- transfer hardware synthesis’. Proceedings of 1987 IEEE international conference on Computer aided design (ICCAD87), 1987, pp. 262-265

10 WENG, J.-P., and PARKER, A.C.: ‘3D scheduling: High-level synthesis with floorplanning’. Proceedings of the 28th ACM/IEEE conference on Design automation, 1991, pp. 668-673

11 McFARLAND, M.C.: ‘Using bottom-up techniques in the synthesis of digital hardware from abstract behavioral descriptions’. Pro- ceedings of the 23rd ACM/IEEE conference on Design automation, June 1986, pp. 474-479

12 SCHEICHENZUBER, J., GRASS, W., LAUTHER, U., and MXRZ, S.: ‘Global hardware synthesis from behavioral dataflow descrietions’. Proceedings of the 27th ACM/IEEE conference on Design automation, June 1990, pp. 456-461

13 THOMAS, D.E., LAGNESE, E.D., WALKER, R.A., NESTOR, J.A.. RAJAN. J.V.. and BLACKBURN, R.L.: ‘Algorithmic and , ,

register-transfer level synthesis: the system architect’s workbench‘, in ‘VLSI, computer architecture, and digital signal processing’ (Kluwer Academic Publishers, 1990)

14 LY, T.A., ELWOOD, W.L., and GIRCZYC, E.F.: ‘A generalized interconnect model for data path synthesis’. Proceedings of the 27th ACM/IEEE conference on Design automation, 1990, pp. 168-173

15 RABAEY, J.M., CHU, C., HOANG, P., and POTKONJAK, M.: ‘Fast Drototveina of datapath-intensive architectures’, JEEE Desian and Tkst of Comiuters, June 1991,pp. 40-51

16 PANGRLE, B., BREWER, F., LOBO, D., and SEAWRIGHT, A.: ‘Relevant issues in high-level connectivity synthesis’ Proceedings of the 28th ACM/IEEE conference on Design automation, June 1991, pp. M)7-610

17 JANG, H.-J., and PANGRLE, B.M.: ‘A gr id-bad approach for connectivity binding with geometric costs’. Proceedings of the 1993 IEEE international conference on Computer Aided design (lCCAD93). 1993, pp. 94-99

18 BREWER, F., and GAJSKI, D.: ‘Chippie: A system for constraint driven behavioral synthesis’, JEEE Trans., July 1990, CAD-9, (7), pp. 681-695

223

19 KNAPP, D.W.: ‘Feedback-driven datapath optimization in Fasolt’. Proceedings of the 1990 IEEE international conference on Computer aided design (ICCADPO), June 1990, pp. 300-303

20 SOUTHARD, J.R.: ‘MacPitts: an approach to silicon compilation’, IEEE Computer, December 1983, pp. 74-82

21 DUNCAN, A.A., and HENDRY, D.C.: ‘DSP datapath synthesis eliminating global interconnect’. Proceedings of European conference on Design automation (Euro-DACP3) with Euro-VHDLP3, Hamburg September 1993, pp. 46-51

22 CLOUTIER R., and THOMAS, D.: ‘The combination of scheduling, allocation and mapping in a single algorithm’. Proceedings of the 27th ACM/IEEE conference on Design automation, June 1990, pp. 71-76

23 RIM, M., JAIN, R., and DE LEONE, R.: ‘Optimal allocation and binding in high level synthesis’. Proceedings ofthe 29th ACM/IEEE conference on Design automation, June 1992, pp. 120-123

24 SAFIR, A., and ZAVIDOVOQUE, B.: Towards a global solution to high level synthesis problems’. Proceedings of the first European conference on Design automation, March 1990, pp. 283-288

25 GEBOTYS, C.H. : ‘Optimal scheduling and allocation of embedded VLSI chips’. Proceedings of the 29th ACM/IEEE conferena on Design automation, June 1992, pp. 116-1 19

26 GEBOTYS, C.H., and ELMASRY, M.I.: ‘Global optimization approach for architectural synthesis’, IEEE Trans., September 1993,

27 GEBOTYS, C.H., and ELMASRY, M.I.: ‘Optimal synthesis of high performance architecture’, IEEE J . Solid State Circuits, 1992, SC-2, (3). pp. 389-397

28 DEVEDAS, S., and NEWTON, A.R.: ‘Algorithms for hardware allocation in datauath synthesis’. IEEE Trans.. Julv 1989. CAD-8.

CAD-12, (9), pp. 1266-1278

~. . . . (7), pp. 768-781

29 NEIL, J.P., and DENYER, P.B.: ‘Simulated annealing based synthesis of fast discrete cosine transform blocks’. in TAYLOR. G.. and RUSSEL, G. (Eds.): ‘Algorithmic and knowledge based CAD for VLSI’(Peter PereGnus, i992), Chap. 4, pp. 75-93

30 NOTE, S., GEURTS, W., CA’ITHOOR, F., and DE MAN, H.: ‘Cathedral-111: architecture-driven high-level synthesis for high

throughput DSP applications’. Proceedings of the 28th ACM/IEEE conference on Design automation, June 1991, pp. 597-602

31 DRENTH, P.J., and STROLENBERG, C.: ‘Datapath layout generation with in-the-oell routing and optimal column resequencing’. Proceedings of EURO ASIC ’91, May 1991, pp. 373-376

32 VANHOOF, J., VAN ROMPAEY, K., BOLSENS, I., GOOSENS. G., and DE MAN, H.: ‘High-level synthesis for real time digital signal processing’ (Kluwer Academic Publishers, 1993)

33 DUNCAN, A.A. : ‘High-level synthesis for an area-effcient datapath architecture’. PhD thesis, University of Aberdeen, 1994

34 KIRKPATRICK, S., GELA‘IT, C.D., and VECCHI, M.P.: ‘Opti- mization by simulated annealing’, Science, May 1983, 220, (459% pp. 671-680

35 SECHEN, C., and SANGIOVANNI-VINCENTELLI, A.: The Timberwolf placement and routing package’, IEEE .I. Solid State Circuits, April 1985, SC-20, (2), pp. 510-522

36 AHO, A.V., HOPCROFT, J.E., and ULLMAN, J.D.: ‘Data structures and algorithms’ (Addison-Wesley, 1983)

37 O’ITEN, R.H.J.M., and VAN GINNEKEN, L.P.P.P.: ‘The annealing algorithm’ (Kluwer, Academic Publishers, 1989)

38 KURDAHI, F.J., and PARKER, A.C.: ‘REAL: a program for REgister ALlocation’. Proceedings of the 24th ACM/IEEE conference on Design automation, 1987, pp. 210-215

39 KUNG, S.Y., and WHITEHOUSE, H.J.: ‘VLSI and modern signal prmessing’ (Prentice-Hall, 1985)

40 LY, T.A., and MOWCHENKO, J.T.: ‘Applying simulated evolution to high-level synthesis’, IEEE Trans., March 1993, CAD-12, (3), pp. 389-409

41 DAVIS, L.: ‘Handbook of genetic algorithms’ (Van Nostrand Rein- hold, 1991)

42 PAULIN, P.G., and KNIGHT, J.P.: ‘Scheduling and binding algorithms for high-level synthesis’. Proceedings of the 26th ACMfiEEE conference on Design automation, 1989, pp. 1-6

43 EWERING, C., and GERHARDT, G.: ‘PASS: High-level synthesis’, Microprocess. Microprog., August 1990.30, pp. 103-108

44 STOK, L.: ‘Interconnect opthisation during datapath synthesis’. P r d i n g s ofEDAC90, March 1990, pp. 141-145

224 I E E Proc.-Comput. Digit. Tech., Vol. 142, No. 3, May 1995

Documents

High-level synthesis of DSP datapaths by global optimisation of variable lifetimes