FIDER: A force-balance-based interconnect delay driven re-synthesis algorithm for data-path optimization after floorplan

TSINGHUA SCIENCE AND TECHNOLOGY ISSN 1007-0214 10/17 pp63-69 Volume 12, Number 1, February 2007

FIDER: A Force-Balance-Based Interconnect Delay Driven Re-Synthesis Algorithm for Data-Path Optimization After Floorplan*

WANG Yunfeng (王云峰), BIAN Jinian (边计年)**, HONG Xianlong (洪先龙), ZHOU Qiang (周强), WU Qiang (吴强)

Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

Abstract: As the feature size of integrated circuits is reduced to the deep sub-micron level or the nanometer

level, the interconnect delay is becoming more and more important in determining the total delay of a circuit.

Re-synthesis after floorplan is expected to be very helpful for reducing the interconnect delay of a circuit. In

this paper, a force-balance-based re-synthesis algorithm for interconnect delay optimization after floorplan is

proposed. The algorithm optimizes the interconnect delay by changing the operation scheduling and the

functional unit allocation and binding. With this method the number and positions of all functional units are

not changed, but some operations are allocated or bound to different units. Preliminary experimental results

show that the interconnect wire delays are reduced efficiently without destroying the floorplan performance.

Key words: high-level synthesis; floorplan; interconnect delay; re-synthesis; reschedule; reallocation

Introduction

As the feature size of integrated circuits (IC) is reduced to the deep sub-micron level or the nanometer level, the interconnect delay is becoming more and more im-portant in determining the total delay of a circuit[1,2]. In the traditional design flow of IC, high-level synthesis (HLS) does scheduling and allocation first, and then floorplan determines the actual positions of modules in the physical design. As little information concerning the interconnect delays can be gained in the high-level synthesis phase, the interconnect wire delay is only op-timized in the layout phases. However, changing the topological structure of the circuit in synthesis phase will have a significant effect on the interconnect delay.

Some researchers have already addressed the

problem of incorporating physical design information in high-level synthesis. The 3-D scheduling algorithm[3] considers floorplan in high level synthesis. The algo-rithm decides the shape and position of each functional unit by floorplanning concurrently as operations are scheduled and functional units are allocated. Grid-based (GB)[4] algorithm does not consider the schedul-ing problem. Instead, it combines binding with one-dimensional floorplan, and translates the problem into a two-dimensional grid placement problem to mini-mize the interconnect wire length. However, the use of one-dimensional floorplan is very limited. Choi and Levitan[5] improved the method by using more accurate estimation of interconnect wire area and delay, but the method still uses one-dimensional floorplan. Other ap-proaches are possible. BINET[6] is a binding algorithm that performs incremental binding and floorplan on a previously scheduled result based on a network flow model. Shantanu and Miriam[7] have incorporated floor-plan into the high-level synthesis formulation using a data-transfer model. Prabhakaran and Banerjee[8] pro-vided a simulated annealing-based algorithm that com-bines scheduling, binding, and floorplan. For each step

﹡

﹡﹡

Received: 2005-11-14; revised: 2006-02-16 Supported by the National Natural Science Foundation of China (Nos. 90407005, 90207017, 60236020, and 60121120706) To whom correspondence should be addressed. E-mail: [email protected] Tel: 86-10-62785564

Tsinghua Science and Technology, February 2007, 12(1): 63-69 64

of the simulated annealing of scheduling and binding, a constructive timing driven floorplan algorithm is per-formed to evaluate the floorplan area and the time delay.

In all the algorithms discussed above, high level synthesis and floorplan are integrated. However, sim-ply combining them together will cause a dramatic in-crease in the scale of the problem’s searching space. In order to accelerate the search phase, the precision of floorplan is often sacrificed. To overcome this problem, a reallocation and rescheduling procedure based on a good floorplan result can be used. The procedure can be very useful for optimizing the interconnect delay of a circuit. However, in a general method, the floorplan will be destroyed during the re-synthesis phase.

In this paper, a force-balance-based re-synthesis al-gorithm is presented. With this method, the floorplan result is kept as an invariant. The algorithm can be used either in getting a new synthesis solution based on the floorplan result when these two phases are inte-grated together, or in optimizing the interconnect delay after the floorplan phase is finished.

1 Re-Synthesis After Floorplan

The inputs to the algorithm include the following: a schedule that assigns operations to control steps, a binding that assigns operations to functional units, and a floorplan that assigns functional units to positions. The output of the algorithm is a new schedule and a new binding with an unaltered floorplan.

For a synchronized circuit, the total delay of the cir-cuit can be calculated by

T ts= (1) where T is the total delay of the circuit, t is the delay of each control step, and s is the number of cycles.

As shown in Eq. (1), when the number of cycles is fixed, the total delay T depends only on the length of t in Eq. (1). The main objective of our approach is to re-duce t in Eq. (1) by reducing the interconnect delay of each control step. t is calculated by

t =max (ti), 0≤i≤max_steps (2) where ti is the exact delay of step i.

The exact delay of step i is calculated by ti = max (tij), 0≤j≤max_unit_number (3)

where tij is the exact delay of the operation scheduled to begin at step i and to be allocated to unit j.

Assume that operation op is scheduled to execute from step i to step i+k−1; the exact delay of op is calculated by

tij = (tf + tin + tout)/k (4) where tf is the delay of the functional unit, tin is the de-lay of the input wires, and tout is the delay of the output wires.

tin and tout in Eq. (4) can be calculated by in/out (in / out )

0 max_input /output_wire_number

max ( ),i

i

t t=

≤≤ (5) where (in / out )i

t is the delay of the i-th input/output

wire of the operation. In our approach, since no routing is performed, the

lengths of interconnect wires between two functional units are simply estimated by the half perimeter of the minimum rectangle that contains the two functional units. The Elmore delay model is used to calculate the delay of the interconnect wires, as shown by

1 , , 2

t RC C cl R rl= = = (6)

where r is the resistance of wire per unit length, c is the capacitance of wire per unit length, and l is the length of wire.

In the high-level synthesis phase before floorplan, no information about the interconnect delay can be considered. The scheduling and allocation task can only proceed based on a delay-less interconnect model. Re-synthesis after floorplan can achieve an optimiza-tion based on accurate estimation of the interconnect delay. To avoid the usual problem of destroying the floorplan performance in the re-synthesis phase, a force-balance-based interconnect delay driven re-synthesis algorithm (FIDER) is used as described in the following.

1.1 Basic idea of FIDER

In this paper, the behavior of the circuit is presented by control-data flow graph (CDFG)[9]. A two-dimensional grid is used to present the scheduling and allocation re-sult as shown in Fig. 1, which is first introduced by Jang and Barry[4]. The columns of the grid represent scheduling control steps, while the rows represent the functional units to be used in the circuit. The result of scheduling and allocation can be considered as the placement of the two-dimensional grid (Fig. 1).

WANG Yunfeng (王云峰) et al：FIDER: A Force-Balance-Based Interconnect Delay … 65

Fig. 1 Two-dimensional grid used to represent the so-lution of scheduling and allocation

In Fig. 1a, Operation B is placed in Row 2 and Col-umn 2, which means that Operation B is scheduled to Step 2 and allocated to Unit 2. If Operation B is moved into Row 2 and Column 1, as shown in Fig. 1b, then this means that Operation B is reallocated to Unit 1.

As described before, the main objective of our ap-proach is to reduce the maximum delay of each control step. The main idea of the re-synthesis procedure is presented in Fig. 2.

Fig. 2 Reallocation and rescheduling based on the result of the floorplan

The initial result of the synthesis procedure is shown in Fig. 2a, and the generated floorplan based on this re-sult is shown in Fig. 2f. From Fig. 2f, it can be seen that Operation A is allocated to functional unit f2, which is too far from reg4 to satisfy the delay con-straint (Assume that f3 is just the very position for Op-eration A).

(1) The operation allocated to f3 (Operation C in this case) is taken out of the grid. Then, Operation A is re-allocated to f3, as shown in Fig. 2b.

(2) Operation C is allocated to another functional unit. If it is possible to allow Operation C to be exe-cuted in more control steps, then it is rescheduled, as shown in Fig. 2c.

(3) The grid is perturbed by rescheduling some op-erations under some constraints, as shown in Fig. 2d.

(4) These procedures are repeated to find a better solution.

In this way, we can find an improved solution with-out changing the floorplan result.

1.2 Reallocation after floorplan

For each op in the grid, the local path set S of this op-eration can be defined as

S fR=< > (7) In Eq. (7), f is the functional unit where op is allocated, and R is the set of registers which have a direct data flow with op, as shown in Fig. 3.

A tuple <f, r> denotes a wire between functional unit f and register r. It is obvious that the best func-tional unit for an operation is the one in which inter-connect wire length is minimized. However, minimiz-ing the interconnect wire length of one operation may cause a wire length increase for other interconnects. The main objective of reallocation is, therefore, to re-duce the length of interconnect wires of all operations impartially.

For each op in the grid, assume that the operation is allocated to functional unit f. A virtual force that acts on the op is calculated by


Fig. 3 Local path set

op op, iri

=∑F F (8)

where op, irF is the force caused by the i-th register in

the local path set of op, op, opi ir r= −F P P (9)

where opP denotes the position of functional unit f,

and ir

P is the position of the i-th register in the local

path set of op. In this case, the best position for op on a chip is where

the virtual force acting on op is minimized. Additionally, the best situation for the circuit is that all its operations are located at their best or nearly best positions. In the grid representation, the best column for op in the grid means the best functional unit for op on the chip.

The reallocation algorithm is described as:

Algorithm 1: Reallocation once With the grid of the original result of synthesis:

For each control step in grid

For each kind of operation begin in this step

Select an operation op of this kind of operation

which begins execution in this step randomly;

Set the iterative count = = 0;

Calculate the best column for op in this step;

Take op out of the grid;

While ( count < max_count and

the best column of op is not empty in this step)

if (the best column in this step is locked

by an operation op2)

Calculate the next best column for op in this step;

else

Take op2 out of the grid;

Put op into the best column in this step;

Set op to be op2;

Calculate the best column of op in this step;

count ++;

if ( the best column of op in this step is not empty)

Calculate the best empty column for op in this step;

Put op into the best column in this step;

1.3 Rescheduling after reallocation

The lifetime of variables must be considered when a rescheduling procedure is carried out. In our approach, another two-dimensional grid is used to represent the scheduling and allocation of variables. As shown in Fig. 4, the columns of the grid stand for the registers, while the rows of the grid stand for the lifetime of the vari-ables. Variable b is scheduled in Steps 2 and 3, which means that the variable is produced at the end of Step 1, and must be stored in a register at the very beginning of Step 2, and be held until Step 3 is finished.

Fig. 4 Use a two-dimensional grid to present the allo-cation of registers

In order to identify these two grids in the text to fol-low, the grid shown in Fig. 1 is called a module-grid, and the grid shown in Fig. 4 is called a register-grid.

As shown in Eq. (4), increasing k can reduce the av-erage delay of each step. For example, if an operation o is scheduled to begin execution at step i and finish in the same step, then rescheduling this operation to begin at step i and finish in step i+1 will be helpful to reduce ti in Eq. (3). This rescheduling is not certain to reduce the maximum delay of each step, but will increase the probability of reducing the maximum delay. However, the inputs of each operation should not be changed


until the operation is finished. The constraint corre-sponding to this rescheduling approach can be de-scribed as follows.

Assume an operation o is scheduled to execute from step i to step j.

(1) The operation o can be rescheduled to execute from step i−1 to step j if all its previous operations are finished before step i−1 (it is assumed that the inputs of each operation should be stored in registers when-ever they are produced, and be held until the operation is finished).

(2) The operation o can be rescheduled to execute from step i to step j+1 if (i) all its succeeding opera-tions are scheduled to begin after steps j+1, and (ii) all registers in its local step set are not occupied in step j+1, i.e., the input registers of this operation must not be occupied by other operations in step j+1.

This kind of rescheduling approach is called an ex-pansion of operations. The algorithm is described as

Algorithm 2: Expansion of operations For each step i of the module-grid

For each functional unit f of the module-grid

if the module-grid cell on (f,i) is empty

Calculate the delay (with wire delay) of operation op1

(the operation scheduled in functional unit f and step

i+1)

and op2 (the operation scheduled in unit f and step

i−1); if (delay of op1 > delay of op2)

Reschedule op1 if possible;

else

Reschedule op2 if possible;

Adjust the register-grid;

The reallocation and expansion procedures should be called alternately to find a good solution. It must be noted that whenever an expansion procedure is called, the solution before expansion should be backed up and then restored before another reallocation procedure is called. If this is not done, the operation expanded will be stuck into its expanded form, and will not be able to make any further contribution to the reduction of the total circuit delay.

2 Improvement Procedure

The improvement procedure contains two phases, grid perturbation and a simulated annealing approach.

2.1 Perturbation of the module-grid

The simulated annealing algorithm is used widely in circuit optimization. However, the expansion proce-dure described previously is not suitable for the simu-lated annealing method. Moreover, adaption of the re-allocation to allow a simulated annealing approach will only limit the effect of optimization. In order to solve this problem, we try to perturb the module-grid at vari-ous times by rescheduling some operations to different steps. In this way, the search space of the simulated annealing approach will be widely expanded.

However, the perturbations must be called out fol-lowing some constraints. The approach is very similar to that already described in Section 1. All the perturba-tions can be divided into two parts.

Assume an operation op is scheduled to execute from step i to step j.

(1) The operation op can be rescheduled to execute from step i−1 to step j−1 if all its previous operations are finished before step i−1.

(2) The operation op can be rescheduled to execute from step i+1 to step j+1 if (i) all its succeeding opera-tions are scheduled to begin after step j+1 and (ii) all registers in its local step set are not occupied in step j+1.

The disturbance algorithm is described as

Algorithm 3: Disturbance of the module-grid

For each step i of the module-grid

For each functional unit f of the module-grid

if the module-grid cell on (f, i) is empty Get the operation o1 (the operation scheduled

in functional unit f and step i)

and o2 (the operation scheduled in functional unit

f and step i−1) Select one operation o from o1,o2 randomly;

if ( o is reschedulable) Reschedule o and adjust the register-grid;


2.2 Simulated annealing approach of FIDER

A simulated annealing approach is used to call the re-allocation and rescheduling procedures iteratively. The main objective of this simulated annealing approach is to minimize t in Eq. (2).

Algorithm 4: Simulated annealing After floorplan is finished

Initialize solutions s, s0, s1, s2;

Set the current solution s0 as the best solution s;

while (not temperature out)

while (not loop count out)

s2 = s0;

Reallocation;

s1 = s0;

Rescheduling;

if (the result is better than the best solution)

s = s0;

s0 = s1;

if (current solution is acceptable)

if (the result is better than the best solution)

s = s0;

else

s0 = s2;

Increase loop count;

Decrease the temperature;

3 Experimental Results

We have implemented the reallocation and re-scheduling algorithms in the C++ programming lan-guage, and executed the program on a SUN Ultral-Sparc workstation v880. The parameters used are: the wire resitance per unit length, r, is 0.075 Ω/µm; the wire capacitance per unit length, C, is 0.118 fF/µm.

We test the algorithm under two fabrication tech-nologies (FT). Delays of the functional units for the different fabrication technologies are given in Table 1. The initial length of each control step is based on the delay of the adders. As shown in Table 1, for

FT=0.25µm, the adder delay is 2 ns, and the initial length of each control step should be 2 ns. Similarly, the initial length of each control step should be 1 ns when FT=0.18 µm.

Table 1 Delays of functional units

d/ns Description

FT=0.25 µm FT=0.18 µmDelay of adder/

subtractor 2.0 1.0

Delay of multiplier 4.0 2.0 Delay of negator 0.2 0.1

We use the behavioral description of the Fir11, Iir7, and Ellipf filters in VHDL to check the algorithms. The input VHDL files are firstly compiled into a data flow graph, and then a list procedure is used to make the ini-tial scheduling and allocation solution. An intercon-nect-aware register allocation algorithm is used to allo-cate variables to the registers. A corner block list (CBL)- based floorplan algorithm[10] is used to obtain the floorplan. The constraints in experiments are the maximum delay and maximum area of the chip. The maximum number of each kind of functional unit can be specified. If the maximum number for a certain kind of functional unit is not specified, the system will use the maximum available number under the constraint of the maximum allowed area as the maximum number of this kind of functional unit.

The basic information for these benchmarks is pre-sented in Table 2.

Table 2 Benchmarks

Filter Number of operation add/minus

Number of operation multiply

Number of operation negative

Fir11 10 11 6 Iir7 14 14 4

Ellipf 22 8 5

The experimental results are shown in Table 3. The best optimization of the maximum delay of each con-trol step results in an improvement by up to 22.19%. The resulting floorplans of ellipf in different fabrica-tion technologies are shown in Fig. 5.

The conclusion can be drawn from the experimental results that the effect of reallocation and rescheduling strongly depends on the initial synthesis and floorplan, significant benefit can be obtained by reducing the in-terconnect delay of the circuit.


Table 3 Experimental results

FT=0.25 µm FT=0.18 µm Maximum delay of each step Maximum delay of each step Filter Control

steps Area ratio Original

(ns) Optimized

(ns) Ratio (%)Area ratio

(%) Original (ns)

Optimized (ns)

Ratio (%)

Fir11 14 94.73 3.4 2.9 83.95 95.24 1.8 1.7 94.92Iir7 18 93.75 2.5 2.5 100 94.00 1.9 1.9 92.82

Ellipf 17 92.86 2.7 2.4 87.07 94.62 2.5 1.9 77.81Average — — — — 90.34 — — — 88.52

Fig. 5 Floorplan result of ellipf.vhd

4 Conclusions

In this paper a force-balance-based re-synthesis algo-rithm is presented. The main objective is to reduce the total delay of a circuit by reducing the circuit intercon-nect delay. The algorithm can be used to optimize the result of allocation and scheduling after floorplan is made or to quickly adjust the synthesis results based on a temporary floorplan when integrating synthesis and floorplan together.

However, at present the proposed algorithm only deals with the optimization of data paths interconnect. We plan to integrate interconnect delay optimization of the controller also into this algorithm. Additionally, the optimization of pipeline architecture should also be in-tegrated into the algorithm, and a more accurate delay model should be used to test the algorithm.

References

[1] Luca B, Luca M, Macii A, Poncino M. Layout-driven

memory synthesis for embedded systems-on-chip. IEEE Transactions on Very Large Scale Integration (VLSI) Sys-tems, 2002, 10(2): 96-105.

[2] Cong J. An interconnect-centric design flow for nanometer technologies. Proceedings of the IEEE, 2001, 89(4): 505-527.

[3] Weng J P, Parker A C. 3D scheduling high level synthesis with floorplan. In: Proceedings of 28th ACM/IEEE Design Automation Conference. San Francisco, 1991: 668-673.

[4] Jang H, Barry M P. A grid-based approach for connectivity binding with geometric costs. In: Proceedings of Interna-tional Conference of Computer Aided Design. Santa Clara, 1993: 94-99.

[5] Choi K, Levitan S P. Exploration of area and performance optimized datapath design using realistic cost metrics. In: Proceedings of IEEE International Symposium on Circuits and Systems. Seattle, 1995: 1049-1052.

[6] Rim M, Majumdar A, Jain R, et al. Optimal and heuristic algorithms for solving the binding problem. IEEE Transac-tions on VLSI Systems, 1994, 2(7): 211-225.

[7] Shantanu T, Miriam L. A data-centric approach to high-level synthesis. IEEE Transaction on Computer-Aided De-sign of Integrated Circuits and Systems, 2000, 19(11): 1251-1267.

[8] Prabhakaran P, Banerjee P. Simultaneous scheculing, bind-ing and floorplan in high-level synthesis. In: Proceedings of the IEEE International Conference on VLSI Design. San Jose, 1998: 428-434.

[9] Wu Q, Wang Y F, Bian J N, Wu W M, Xue H X. A hierar-chical CDFG as intermediate representation for hard-ware/software codesign. In: Proceedings of IEEE 2002 In-ternational Conference on Communications, Circuits and Systems and West Sino Expositions. Chengdu, 2002: 1429-1432.

[10] Ma Y C, Hong X L, Dong S Q, Cai Y C, Cheng C K, Gu J. Floorplan with abutment constraints based on corner block list. Integration of the VLSI Journal, 2001, 31: 65-77.

Documents

FIDER: A force-balance-based interconnect delay driven re-synthesis algorithm for data-path optimization after floorplan