00238604

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS. VOL. 12, NO. 8, AUGUST 1993 1107

SALSA: A New Approach to Scheduling with Timing Constraints

John A. Nestor, Senior Member, IEEE, and Ganesh Krishnamoorthy

Abstract-This paper describes a new approach to the scheduling problem in high-level synthesis that meets timing constraints while attempting to minimize hardware resource costs. The approach is based on a modified controUdata flow graph (CDFG) representation called SALSA. SALSA provides a simple move set that allows alternative schedules to be quickly explored while maintaining timing constraints. It is shown that this move set is complete in that any legal schedule can be reached using some sequence of move applications. In addition, SALSA provides support for scheduling with conditionals, loops, and subroutines. Scheduling with SALSA is performed in two steps. First, an initial schedule that meets timing constraints is generated using a constraint solution algorithm adapted from layout compaction. Second, the schedule is improved using the SALSA move set under control of a simulated annealing algorithm. Results show the scheduler’s ability to 6nd good schedules which meet timing constraints in reasonable execution times.

I. INTRODUCTION HE goal of high level synthesis [l] is to translate a T procedural specification of behavior into a register-

transfer design that implements that behavior. Most approaches to high-level synthesis use a control/data flow graph (CDFG) as an intermediate representation of the behavioral specification and break the synthesis problem into two subtasks: scheduling, which assigns CDFG nodes representing operators to control steps, and allocution, which assigns CDFG nodes representing operators and edges representing data values to hardware (e.g. ALU’s, registers, and interconnections) to realize a datapath. Scheduling is a particularly important part of this process for two reasons. First, it fixes requirements for the various hardware resources used during allocation. Second and equally important, it fixes the relative timing of operators and thus the satisfaction of timing constraints [2], [3]. Timing constraints are important because they allow designers to specify both desired performance and interface information [2], [4].

Fig. 1 illustrates the scheduling problem using a typical CDFG, which is a directed graph in which nodes repre-

Manuscript received February 5, 1991; revised October 15, 1992. This work was supported in part by NSF Grant MIP-9010406 and the IIT Edu- cation & Research Initiative Fund. This paper was recommended by As- sociate Editor A. Parker.

J. Nestor is with the Department of Electrical and Computer Engineer- ing, Illinois Institute of Technology, Chicago IL 60616.

G . Krishnamoorthy is with Mentor Graphics Corp., 15 Independence Blvd., Warren NJ 07059.

IEEE Log Number 9207276.

sent operators and edges represent ordering dependencies between operators. Source and sink nodes represent the beginning and end of activities in the graph. Edges between nodes represent three different types of ordering dependencies. Data edges represent the flow of data from one operator to another, implying an ordering relationship because the data must be computed before it is used. Con- trol edges represent ordering relationships associated with control operations such as conditionals. Timing edges [2], [3] represent timing constraints between two operators that must be satisfied in a correct design. A timing constraint specifies the required relative timing between two operators. Minimum timing constraints specify a lower bound on the relative timing between operators, while maximum timing constraints specify an upper bound. All of these dependencies imply an ordering in which the first operator precedes the second operator in execution.

Scheduling assigns each operator node to a control step that represents the controller state in which this operator will execute. Fig. 1 illustrates a typical schedule by dis- playing control step boundaries as horizontal lines. Since scheduling fixes the order in which operators will be implemented in the design, it must perform this task in a way that meets all dependencies specified by edges in the graph. When performed before hardware allocation, this ordering sets a lower bound on the resources required to implement the CDFG in hardware. Functional unit requirements are determined by the maximum number of operators of each type (i.e., adders, ALU’s, etc.) that are scheduled in the same control step. Register requirements are determined by the maximum number of values that are live at the end of each control step, as represented by data edges in the CDFG that cross control step boundaries. For example, the schedule in Fig. 1 requires at least two adders, one multiplier, and four registers. A weighted sum of these requirements can be used as a cost function to estimate schedule quality. Some schedulers also include an estimate of interconnection requirements based on the total number of data transfers in each control step [5 ] .

Early approaches to scheduling used the simple “as- soon-as-possible” (ASAP) or “as late-as-possible’’ (ALAP) algorithms [6] , [7] to minimize schedule length while ignoring the hardware costs and timing constraints. More recently, a large number of approaches have been

‘By convention, edges in all figures are directed top-to-bottom unless an arrowhead indicates otherwise.

0278-0070/93$03.00 0 1993 IEEE

1108 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 8, AUGUST 1993

Timing Conswinc ILnC(XlJ2) 5 3 stepf’

--T- Sink Node

Fig. 1. A control/data flow graph (CDFG)

developed that attempt to minimize hardware costs in the resulting allocation. Approaches that minimize hardware cost can be broken down into constructive approaches using greedy heuristics, iterative transformational approaches such as simulated annealing, and exact approaches such as integer linear programming.

Greedy heuristics attempt to minimize resource costs but do not guarantee that an optimal schedule will be found. Examples of greedy approaches include fast, simple heuristics such as list scheduling [8]-[12] and more complex (and more effective) heuristics such as force-directed scheduling [5]. Greedy heuristics suffer the short- coming that they can be “trapped” in local minima in the cost function and so may not find the globally best schedule.

Transformational approaches alter an existing schedule to find new schedules and search for low-cost schedules. Iterative probabilistic approaches such as simulated annealing [ 131 employ a small set of transformations that are applied to randomly selected parts of the schedule. A probabilistic acceptance function allows controlled hill- climbing to escape local minima. Simulated annealing has been employed in combined scheduling and allocation schemes [14], [15]. Simulated evolution is a related approach in which a group of scheduled operators is prob- abilistically selected, “ripped up” (unscheduled) and then rescheduled using a greedy algorithm [16]. Since transformational approaches require a large number of move applications to properly explore the schedule space, they typically exhibit large execution times.

Exact approaches using techniques such as integer linear programming (ILP) (e.g., [17], [18]) guarantee a globally optimum schedule but can have large execution times. Such approaches represent scheduling decisions as a set of decision variables (one for each possible assignment of an operator to a control step) and a set of constraints that must be satisfied to guarantee a legal schedule. ILP can then be used to find the optimal schedule with respect to a given cost function. While the worst- case execution times of these approaches is exponential, special characteristics of the scheduling problem can be exploited to reduce runtime [ 181, making such techniques practical for fairly large problems. However, execution time of these approaches depends on the number of variables and constraints. This value can grow quite large for

large problems, especially as schedule length is increased for a given CDFG.

Only a few scheduling approaches attempt to meet arbitrary timing constraints while minimizing resource costs. Constructive heuristics include modified list-scheduling [9], [2] heuristics that attempt to meet timing constraints during scheduling, and force-directed scheduling, which uses local timing constraints [5] to limit the “time frames” of control steps into which an operator may be scheduled. In another approach, timing constraints have been included in an ILP formulation [ 181.

Other approaches to scheduling consider timing constraints separately from allocation and so do not attempt to minimize resource costs. For example, Hayati and Par- ker [19] consider timing constraints as part of the controller generation problem after scheduling and allocation are complete. Path-based scheduling [20] treats timing constraints and functional-unit constraints as intervals on a serialized control-flow graph (CFG) and derives a controller with a minimum number of control states by scheduling each path in the CFG separately.

It has recently been recognized that constraint solution algorithms drawn from layout compaction [21] can be used to find an ASAP-like schedule that meets timing constraints. This approach was used by Borriello [22] in the synthesis of asynchronous interface transducers. A similar approach has also been used to satisfy constraints between loop iterations [12]. More recently, Ku and De Micheli [23] have applied constraint solution to the problem of scheduling with timing constraints after allocation has been performed. This technique, called “relative scheduling” has the added feature that it can guarantee constraint satisfaction in the presence of unknown and unbounded delays using a specialized controller scheme. Since relative scheduling is performed after allocation, resource costs are not considered.

This paper describes a new transformational approach to scheduling with timing constraints. The key to this approach is a simple set of moves-transformations that alter an existing schedule by rescheduling individual operators or in some cases multiple operators. An ASAP-like schedule that meets all timing constraints is first generated using constraint solution techniques similar to [23]. How- ever, unlike [23], this schedule is generated before allocation and is used as a starting point from which to search for other schedules that are lower in cost. This is accomplished by applying a sequence of moves that alter the schedule under the control of a simulated annealing algorithm. Moves are applied only when the resulting schedule will be legal with respect to all ordering and timing constraints. It is shown in the paper that any legal schedule can be reached from any other legal schedule by the application of some sequence of these moves.

Since the moves that alter the schedule are applied many times, a modified CDFG representation called SALSA is used to represent a scheduled CDFG and speed up the tasks of checking for move legality, move application, and evaluation of the cost function after move applica-

NESTER AND KRISHNAMOORTHY: SALSA: A NEW APPROACH TO SCHEDULING WITH TIMING CONSTRAINTS 1109

tion. The SALSA representation also provides support for conditional operations, mutual exclusion in both functional units and registers, and accurate representation of storage requirements in loops and subroutines. The annealing approach allows the scheduler to avoid local minima that can trap greedy algorithms.

The key contribution of this work is the development of a conceptually simple transformational approach to scheduling with timing constraints that uses simulated annealing with a small move set to find high-quality schedules of “data-dominated” CDFG’s. Results show that this can be accomplished using reasonable amounts of CPU time, even when schedule length is substantially longer than minimum schedule length. Constraint solution provides an effective way to find an initial schedule that meets timing constraints, and these constraints are preserved by the move set. Analysis showing that all legal schedules can be reached using the move set lends intuitive support to the effectiveness of the approach and provides new insight into the structure of the scheduling problem. Fi- nally, support for conditionals, subroutines, and loops allows the application of this approach to large, structured designs.

The remainder of this paper is organized as follows: Section II describes the notation and some key concepts that will be used in the paper. Section I11 introduces the SALSA representation and the move set that is used to explore alternative schedules. In addition, it discusses the completeness of the move set and describes support for conditionals, loops, and subroutines. Section IV discusses the techniques used for initial schedule generation and schedule improvement using the SALSA representa-

Each edge eii in E represents a data, control, or timing dependency between nodes vi and vj. Two edge weights may be associated with each edge to represent the spacing requirement associated with the dependency. Edge weight eii . min denotes the minimum allowable spacing in control steps of nodes vi and vi. Simple ordering and data flow constraints are represented by this weight with a weight of value 0 or 1 while minimum timing constraints are weighted with the constraint value in control steps. Edge weight eii . denotes the maximum allowable spacing in control steps between nodes vi and vi as specified by a maximum timing constraint.

Each edge represents an inequality relationship between the scheduled values of the nodes that must be satisfied in a legal schedule. For example, a minimum timing constraint time (vi, vi) 2 in a schedule x represents an inequality:

xi 1 xi + eii.min

while a maximum constraint time (vi, vj) I straint represents an inequality:

con-

‘xi I xi + eii . mx.

Minimum and maximum constraints on an edge eii are sometimes separated into two edges eii and eji, where forward edge eii represents the minimum constraint and backward edge eji represents the maximum constraint [23]. In this formulation, all constraints can be expressed in a uniform way as inequalities of the form:

xj 1 xi + wij,

where

eii. ,,,in for a minimum constraint (forward edge) w.. = [ -eji . - for a maximum constraint (backward edge).

tion. Section V describes the implementation and presents scheduling results for a number of examples.

11. PRELIMINARIES A CDFG can be represented by a directed graph G(V,

E), where V represents the set of nodes of the graph and E represents the set of edges between nodes. The set of nodes includes a source node U,,, a sink U,,&, and operator nodes vl-vn which represent the operations in the high-level specification. The term vi delay denotes the combinational delay of an operator node.

A schedule of length L is an ordered n-tuple

x = (x, , x2, - * 9 xi, * * * 7 xn)

where each xi is an integer 1 I xi I L that represents the control step in which node vi is scheduled. While not explicitly included in the tuple above, in all schedules of length L s~urce node U,, is always scheduled in control step 0 (Le., x,, = 0) and sink node trsink is always scheduled in control step L + 1 (i.e., X,ink = L + 1).

A schedule x of length L is legal if it satisfies all of the constraint inequalities specified by the edges of the CDFG and every node vi is scheduled in the range of control steps 1 I xi I L. Since the CDFG by definition includes ordering edges from the source node and to the sink node, this second requirement will be satisfied whenever x,, = 0 and xSink = L + 1.

The slack sii(x) of a constraint eii in schedule x represents the amount by which the scheduled positions of nodes vi and vi can be decreased (increased) without vi- olating the minimum (maximum) constraint represented by eij. It is defined as

s..(x) = x . - x . - w..

Note that in a legal schedule, so@) 1 0 for every constraint eii since every inequality must be satisfied.

is sometimes useful to think of a schedule x of n nodes as a point in an n-dimensional schedule space. Each constraint inequality defines a legal half-space within the schedule space that satisfies that particular constraint.

I 1 V‘


1 + + + + +

1 2 3 4 5 : e xl

x = (1.3) Y = (1.4)

@ v2

5

@ v2

z = (2.5) 1

E v2

Fig. 2. The schedule space.

Since a legal schedule must satisfy all constraints, a region of legal schedules is defined by the intersection of all such half-spaces. Since each half-space is convex, it is easily shown that the region resulting from the intersection of half-spaces is also convex [24].

To illustrate the concept of schedule space, Fig. 2(a) shows the two-dimensional schedule space that results given two operators under the constraints:

time(v1, u2) 2 1 step AND time(u1, u2) I 3 steps.

The inequalities implied by these constraints combine with constraints on schedule length to form a trapezoidal region in the schedule space. Fig. 2(b) shows the schedules of three of the nine points contained in this region: sched- ulex = (1, 3 ) , y = (1,4), andz = (2, 5 ) . Foreachpoint in the schedule space, an adjacent point can be reached by changing the schedule of a single operator node by one control step. Schedule y can be reached from schedule x in this manner by rescheduling node u2. Diagonally adjacent points in the schedule space can be reached by changing the scheduling of two operators by one control step. Schedule z can be reached from schedule y in this manner by rescheduling both nodes u1 and u2.

It can also be useful to quantify the amount by which the scheduled position of an operator node U; varies between two schedules x and y . This is denoted by

di (x, U) = yi - xi. This specifies the distance between the two schedules for node ui. Similarly, the total distance between two schedules x and y for all nodes is denoted by

n

w, Y ) = 2 Id;(X,Y)l. i = 1

This value is equivalent to the rectilinear distance between points in the schedule space. For example, in Fig. 2, D ( x , y ) = 1, D ( y , z) = 2, andD(x, z) = 3.

111. THE SALSA SCHEDULE REPRESENTATION The SALSA representation supports a transformational

approach to scheduling by describing a scheduled CDFG and providing a set of simple moves that transform a schedule x into a new schedule x’ . Repeated application of these moves provides a means to search the schedule

space that can be guided by the cost of each new schedule encountered. Since such a transformational approach requires many move applications and many evaluations of the cost function, it is important to make these moves fast and simple. In addition, it is important to quickly test whether the application of a move will result in a legal schedule. SALSA supports these needs using an explicit representation of slack in the constraints of a scheduled CDFG.

This section describes the SALSA representation and move set. In addition, it shows that the move set is complete in that any legal schedule can be reached from any other legal schedule by the application of some sequence of moves from the move set. Finally, it describes additional considerations, including support for conditional execution, loops and subroutines.

3.1. Slack Nodes SALSA explicitly represents slack in a scheduled

CDFG using a new class of nodes known as slack nodes. Slack nodes are inserted in data, control, and timing dependency edges between operator nodes to represent slack in an existing schedule, and each slack node explicitly represents one step of slack. Thus in some schedule x with node U; scheduled in step xi and node vi scheduled in step xj and a constraint weight wii, the edge eii will contain sii(x) = xi - xi - wii slack nodes. Maximum timing constraints are represented as backward edges with slack nodes inserted in the same way.

For data edges, slack nodes explicitly represent the need for storage of a data value during each control step which is crossed by the edge. Each such “data slack” node is considered to be scheduled into one of the control steps crossed by the edge, as shown in Fig. 3. Using this representation register costs can be calculated locally in a control step by examining only nodes scheduled in that control step: operator nodes that create a new value, and slack nodes that represent the storage of a previously created value. For example, Fig. 3 shows the SALSA graph for one schedule of a simple CDFG. In the first step, two operator nodes produce values that are used in later control steps. In addition, two slack nodes represent storage of previously created values that are used in later control steps. Thus a total of four registers are required for this control step. This is the maximum number of registers


slack ...+ opuaur

Timing Constraint . . - ‘ v v 3

Sink Node

Fig. 3. The SALSA CDFG representation.

required over all control steps, so this schedule will require a minimum of four registers. When a simple trans- formation changes only part of a scheduled CDFG, the local nature of these calculations can be exploited to speed the calculation of register costs.

3.2. The Move Set An important property of slack nodes is that an operator

can be rescheduled in an adjacent control step while still satisfying constraints if all of its predecessor or successor nodes are slack nodes. Furthermore, this rescheduling can be accomplished by local rearrangement of the operator node and adjacent slack nodes. These properties can be exploited by defining a simple set of moves that alter a schedule by rescheduling one or more operator nodes in adjacent steps. SALSA provides four such moves M1- M4. Each of these moves can be applied to a target operator only when legal that is, when the schedule that results from the move does not violate any data, control, or timing dependencies. Move legality is easily determined before performing the move by checking that all dependency edges in the direction of the move contain slack nodes.

Simple moves M1 and M2 alter a schedule by moving a single operator node vi to an adjacent control step. M1 and M2 are defined as follows:

MI:

M2:

Move an operator node vi from its current control step to the preceding control step. M1 transforms a schedule x into a new schedule x’ = (xl, x2, - * * , x i - 1, - . . , xn) . M1 is legal when all predecessor edges of vi are connected to slack nodes. Applying M1 removes one slack node from each predecessor edge, reschedules the operator, and adds one slack node to each successor edge, as shown in Fig. 4(a). Move an operator node vi form its current control step to the following control step. M2 transforms a schedule x into a new schedule x’ = (x,, x2, - * , xi + 1, * - - , xn). M2 is legal only when all suc- cesser edges of vi are connected to slack nodes. Ap- plying M2 removes one slack node from each successor edge, reschedules the operator, and adds one slack node to each predecessor edge, as shown in Fig. 4 (b).

The application of a simple move transforms a schedule x into a new schedule x’ that is immediately adjacent in

Fig. 4. Simple moves M1 and M2.

the schedule space. For example, in Fig. 2 schedule y can be reached from schedule x by applying move M2 to node v2. Repeated application of simple moves allows the exploration of schedule space using very simple transformations.

Chaining [ 113 is supported by a minor extension which allows moves with non-slack predecessors (successors) when delay permits. This corresponds to a slight relaxa- tion of ordering constraints (i.e., xi 2 xi + 1 becomes xj 2 xi) when the estimated combinational delay of the chained nodes does not exceed the clock period. Chaining of a single node is accomplished using modified versions of M1 and M2 that take this calculation into effect. A more powerful recursive chaining move is also useful- this move recursively moves predecessor or successor nodes if they are already chained with the target node and would block the completion of a simple chaining move. Chaining also affects initial schedule generation and is discussed further in Section 4.1. Multicycling (scheduling operators into multiple control steps) and multicycling with pipelined functional units are supported directly and require no special handling.

Because simple moves require only local changes to a SALSA graph, the cost of an individual move application is low. However, several moves may be required to make significant changes to the schedule. This is especially true when a move is blocked by a chain of constraints with no slack. For example, Fig. 5(a) shows a schedule in which operator v1 cannot be moved to its preceding control step because there is no slack in its predecessor edge. How- ever, if the preceding “ + ” operator were moved to a previous control step, slack would be present, allowing it to move.

To overcome this problem two more powerful shoving moves are defined that are similar in concept to the “shove-aside” transformations used in some routers [25]. A shoving move recursively moves any predecessor or successor operators that are blocking a simple move of the target operator, thus rescheduling several operators at once at an added expense in CPU time. The shoving moves are defined as follows:

M3: Shove an operator node vi from its current control step to the preceding control step. This move is accomplished in two steps: (1) if any predecessor nodes are operator nodes that would block moving vi, recursively apply M3 to these nodes to “shove” them into preceding control steps. (2) Move vi to the preceding control step as in move M1. As each operator node is moved, slack nodes are removed from predecessor edges and added to successor

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 12, NO. 8. AUGUST 1993 1112

M2:

M Aha

q!pf :l_

M3 vl S

Jkfim Aha

Fig. 5. Move M3 (shove up). Fig. 6. Move M4 (shove down).

edges as appropriate. When applied to an operator node vi M3 transforms a schedule x with a chain of non-slack predecessor constraints (ejk, - , eu) into

- I , . . . , x k - 1 , * * ’ . x 1 - 1 , - * - , x n ) . M3 is always legal unless a chain of dependent operator nodes extends to the source node, which represents the beginning of the schedule. In this case, completing the move would result in an illegal schedule. Fig. 5 shows an example of move M3. Shove an operator node vi from its current control step to the following control step. This move is accomplished in two steps: (1) if any successor nodes are operator nodes that would block moving vi, recursively apply M4 to these nodes to “shove” them into following control steps; (2) move vi to the following control step as in move M2. As each node is moved, slack nodes are removed from successor edges and added to predecessor edges as appropriate. When applied to an operator node v i M4 transforms a schedule x with a chain of non-slack successor constraints (eii, - * - , ekl) into a new schedule

+ 1, - - . , X I + 1, * . . , x n ) . M4 is always legal unless a chain of dependent operators extends to the sink node, which represents the end of the schedule. In this case, completing the move would result in an illegal schedule. Fig. 6 shows an example of move M4.

a new schedule x’ = (xl, x2, - - , xi - 1, - - - 9 X j

x’ = ( X I , x2, e - a , x i + 1, - - - , x j + 1, * . - 9 xk

Shoving moves are important when minimum and maximum timing constraints are combined to create “fixed- time” constraints that specify an exact spacing between two or more operator nodes. For example, Fig. 7 shows the schedule space and two possible schedules of two nodes under a fixed-time constraint consisting of constraints time ( v l , v2) 2 2 steps AND time(v1, u2) I 2 steps. These constraints form a cycle of non-slack constraints and there is never slack in the constraints on the cycle. For this reason, simple moves cannot be used to alter the schedule. However, shoving moves can be applied to reschedule all operators on such a non-slack cycle simultaneously.2 For example, in Fig. 7 simple moves cannot be used because each legal schedule has no immediately adjacent legal schedules. However, shoving moves can be used to reschedule both operators simulta-

*To support fixed-time constraints, shoving moves must detect cycles when recursively shoving predecessor or successor nodes. This can be accomplished using a simple making scheme.

p x 2 s x 1 + 2

+ 2 x 1 2 1 +

+

Iff. xl s

Fig. 7. Operators under a fixed-time constraint.

neously (e.g., M4 transforms schedule v into schedule w). This corresponds to a diagonal move in the schedule space.

3.3. Completeness of the Move Set Since moves Ml-M4 are proposed for use in searching

the schedule space, it is important to show that this move set is complete, i.e., that all legal schedules in the schedule space can be reached using these moves. This can be demonstrated by showing that given two arbitrary legal schedules x and y, a sequence of moves Ml-M4 can be applied to operator nodes in the CDFG that will transform x into a sequence of legal schedules that are successively closer to schedule y until schedule y is found. To accomplish this, we first present a useful property of legal schedules.

Lemma I: Let x and y be two legal schedules of length L which differ in the scheduled position of at least one node vi (i.e., yi # xi) . If the scheduled position of node vi is greater in schedule y than in schedule x (i.e., yi > xi ) and there is a successor constraint edge eo between vi and some node vj with no slack in schedule x, then the scheduled position of node vj must also be greater in schedule y than in schedule x (i.e., y j > xi). Similarly, if the scheduled position of node vi is smaller in schedule y than in schedule x (i.e., yi < xi) and there is a predecessor constraint edge eii between some node vj and vi with no slack in schedule x, then the scheduled position of node vi must also be smaller in schedule y than in schedule x (i.e., y j < xi).

Proof Consider the case where yi > x i . Since there is no slack in schedule x for constraint eo (i.e., so(x) = 0) we can replace the constraint inequality for eij in schedule x with an equality relationship:


In addition, since y is a legal schedule the constraint inequality for the same constraint eii must hold in schedule Y :

yj 2 yi + wij

Since yi > xi , this inequality can only be satisfied if yj > Xj .

The proof of the case where yi < xi is similar and is omitted. U

This result can be used to show that given two schedules x and y with distance D (x, y ) , there is always a move that will create a new schedule x’ that is closer to y than the original schedule x. Lemma 2: Let x and y be two legal schedules of length

L that differ in the scheduled position of at least one node vi (i.e., yi # xi) . If the scheduled position of node vi is greater in schedule y than in schedule x (i.e., yi > xi) , then there exists a legal schedule X‘ that can be reached from x through the application of move M2 or M4 to node vi such that D(x’ , y ) < D(x , y ) . Similarly, if the scheduled position of node vi is smaller in schedule y than in schedule x (i.e., yi < xi) then there exists a legal schedule x’ that can be reached from x through the application of move M 1 or M3 to operator vi such that D (x‘ , y ) < D (x,

Pro08 Consider first the case where the scheduled position of node vi is greater in schedule y than in schedule x. In this case, applying move M2 or M4 to schedule x will create a new schedule x’ in which node vi is one step closer to its scheduled position in schedule y . We must show that all constraints that involve vi will be satisfied in the new schedule x’ and that x’ is closer toy than the original schedule x. Constraints associated with predecessor edges of vi need not be considered, since the inequalities that they represent will be satisfied both before and after the move. For successor edges, we consider three cases, illustrated in Fig. 8:

Case 1 -A42 succeeds: If every successor constraint eij has slack sij (x) 2 1, then node vi can be moved forward one control step using simple move M2 to create a new schedule x‘ as shown in Fig. 8(a). All constraints will still be satisfied in x’ but with reduced slack values. Further, since yi > xi andxj = xi + 1, Id,@’, y ) ( = ldi(x, y)I - 1 and so D(x’ , y ) < D ( x , y ) .

Case 2 4 4 4 succeeds: If one or more successor constraints have zero slack, then node vi may still be moved forward one control step using shoving move M4 to create a new schedule x’, as shown in Fig. 8(b). Move M4 will complete successfully if it can be recursively applied to the successors of each successor node. These recursive applications will move forward all nodes {vi , vi, vk, - - - , U,} that lie one or more paths of non-slack successor constraints starting with node vi providing that each path is terminated by a constraint with slack or else forms a cycle of zero-slack constraints. Since move M4 simultaneously moves forward all nodes on the path, constraints on the path remain satisfied after the move is completed. Fur- ther, since each edge in the path is a constraint with no

Y ) .

(a) Case 1 - M2 Succeeds -FwF -b

vk

S vk (c) C ~ S C 3 - M2, M4 Fail

(b) C~SC 2 - M4 SUCC& Fig. 8. Considerations when moving an operator.

slack and yi > xi, then by Lemma 1, yj > xi, yk > xk, - - , yr > x,. Thus for each node on the path Idi@’, y ) (

= )dk(X,Y)l - 1, * * * , Jdr(X’ ,y )J = J d r ( x , y ) ) - 1 and so D (x‘ , y ) < D (x, y ) for any number of nodes that are moved using M4.

Case 3-A42 and A44 fail: Move M2 will only succeed when all successor constraints contain slack. However, move M4 will always succeed unless there is a path of non-slack constraints involving nodes {vi , vi, vk, - * , U,} that is not terminated by a slack node. This can only occur when node U, has a non-slack successor constraint with the sink node Usink, as shown in Fig. 8(c). However, this situation cannot occur because both x and y are legal schedules. To demonstrate this assertion, assume that a path of non-slack constraints extends from node vi to the sink node, as shown in Fig. 8(c). In this case, since yi >

> xsink. Since by definition the two schedules are legal only if Ysink = xSink = L + 1, this cannot occur when both schedules are legal. Since Case 3 cannot occur, Cases 1 and 2 show that there is always a move that will create a new schedule x‘ such that D(x’ , y ) < D ( x , y ) .

The proof for the case where the scheduled position of node vi is smaller in schedule y than in schedule x is similar and is omitted here. U

Given the result of Lemma 2, we can show that it is possible to reach any schedule y from any other schedule

Theorem 1: Let x and y be two legal schedules of length L that differ in the scheduled position of at least one node vi (i.e., yi # xi) . Then there exists a sequence of no more than D(x , y ) applications of moves Ml-M4 to selected nodes of the CDFG that will transform schedule x into schedule y .

Proof: By Lemma 2 there is always a move that will transform schedule x into a new schedule that is closer to schedule y . Applying such a move will create a new schedule x such that D (x I , y ) < D (x, y ) . Similarly, there is always a move that will transform schedule x 1 into a new schedule x2 such that D ( x 2 , y ) < D ( x ’ , y ) . This process can be continued, creating a sequence of intermediate schedules x’, x2, * * * , x r that are successively

= (d i (x ,Y)) - 1, Idj(x’,Y)l = Idj(x,Y)l - 1, (dk(x’ ,Y)(

Xi , by LenlIIla 1, yj > X j , Y k > X k , ’ ’ 9 Y r > xr and Ysink

X.


closer to y until finally D ( x r , y) = 0, and so x r is equivalent to y. If each intermediate schedule is created using a simple move M 1 or M2 then each intermediate schedule reduces the distance from schedule y by one. In this case exactly D (x, y) moves are required to transform schedule x into schedule y. Since shoving moves reduce the distance of an intermediate schedule from y by more than one, any shoving moves in reduce the number of moves required to reach schedule y. Thus no more than D ( x , y) moves are required to transform schedule x into schedule Y. rn

This result is important because it shows that using moves Ml-M4 the region of legal schedules can be fully explored from any legal starting schedule. Thus the choice of a particular starting schedule cannot preclude the exploration of some set of schedules in the region. Further, the most direct path between any two legal schedules lies within this region, suggesting that illegal configurations are not needed or desirable when searching the schedule space. It is important to note that the number of legal schedules grows exponentially with the number of operator nodes and thus exhaustive exploration is prohibi- tively expensive. This motivates the use of a probabilistic algorithm such as simulated annealing to guide the exploration of the schedule space.

3.4. Variable-Length Schedules A straightforward extension to the SALSA representa-

tion allows it to explore schedules with different lengths. As defined in Section 11, the sink node V,ink of a CDFG with schedule length L is assigned to control step L + 1 . Varying the length of the schedule therefore corresponds to varying the scheduled position of f&k. This can be accomplished by applying the moves defined in Section 3.2 to the sink node as well as the other nodes in the graph. For example, an M1 (move up) move can be successfully applied to Ztsink when the final step of the schedule contains only slack nodes. This has the effect of shortening the schedule by one control step. On the other hand, applying move M2 to usink will lengthen the schedule if it is less than a user-specified upper bound. Shoving move M4 (shove down) must also be redefined slightly since it can always complete by lengthening the schedule if necessary. This move now fails only when the upper bound on schedule length would be violated. Since there is always some sequence of moves that will create a maximum- length schedule, the analysis of move-set completeness in the previous section still holds.

3.5. Cost Estimation in SALSA SALSA normally uses a cost function that is a weighted

sum of register and functional unit requirements. When variable-length schedules are specified, an additional weighted term is added to the cost function to account for schedule length. Weights are user-specified to allow tradeoffs between resources of different types and schedule length.

Functional unit and register costs are computed by counting the requirements in each control step and taking the maximum of these values over all control steps. As in other approaches, functional unit costs are calculated by counting the number of similar operator nodes of each type. Register costs are determined by counting the number of operator nodes that produce data values used in later steps and the number of slack nodes that represent previously stored values, as described in Section 3.1.

A full calculation of functional unit and register costs over all control steps in a CDFG is expensive. However, the local nature of simple moves M1 and M2 allows these changes to be calculated incrementally in the following fashion: For each resource type (functional unit and register), the control steps which contain the maximum demand for the resource are retained in a critical step list. Each simple move M1 or M2 affects two control steps, which we will refer to as the source step (from which the operator is removed) and the destination step (to which the operator is added).3 Adding an operator to the destination step raises the demand for resources in that step. If this value is less than the current maximum demand, no action is taken. If it is equal to the current maximum, then it is added to the list of critical steps for that resource. If it is greater than the current maximum, the current critical steps are removed and the destination step becomes the new critical step. Removing an operator from the source step lowers the demand for resources in that step. If the step is currently the only critical step for a particular resource, than the overall cost is lowered and the critical steps must be recalculated.

Shoving moves M3 and M4 are implemented using repeated applications of M1 and M2. When these moves are applied, the incremental cost adjustment must be recalculated for each operator that is moved, adding to the expense of shoving moves.

3.6. Conditionals, Subroutines, and Loops SALSA represents conditional activities using an ap-

proach similar to [26], [27], as shown in Fig. 9. A list of input conditions is attached to each operation that represents the conditions under which it is activated. Each row of this list represents a set of input conditions encoded as 0, 1 , or X (either 1 or 0). The universal condition (XXX) is attached to unconditional operations, signifying that they always execute. Since the input conditions are based on other values in the CDFG, data edges are added to conditional operators to represent the use of these values in conditional execution. These edges maintain proper sequencing and account for storage requirements. Condi- tional operators that produce data values require an added multiplexer operator to select the proper value based on the tested condition. Control operators that change control flow (e.g., restarting a loop) require no multiplexer

'Note that this is true for both single-cycle and multicycle operators because each move reschedules an operator in an adjacent control step.


Condition

Fig. 9. Conditional execution.

operator but may require added control edges to maintain proper sequencing.

Mutual exclusion between two operator nodes is de- tected by taking the intersection of their condition lists. If this intersection is empty, then they are mutually exclusive and can share the same functional unit. Condition lists are also attached to slack nodes, allowing the detec- tion of mutual exclusion in value storage. If the condition lists of any pair of nodes (either slack or operator) do not intersect, then the values that they produce are mutually exclusive and can share the same register.

Subroutines are an important tool for structuring the control and data flow of behavioral descriptions and synthesized designs. Depending on the designer’s intent, subroutines in a behavioral description may be synthesized in a number of different ways, each with different advantages: First, subroutines may be implemented directly in the controller program. This approach assumes a single thread of control, and multiple instances of a subroutine are implemented directly as subroutine calls in the control program. This has the advantage of allowing datapath hardware to be shared between subroutines and calling routines. Second, subroutines may be treated as separate structural entities that are synthesized indepen- dently. In this case, multiple instances of a subroutine may be implemented either as a single datapath or as several datapaths. This has the advantage of allowing hierarchy and parallelism. Finally, subroutines may be eliminated altogether by expanding them into calling routines. This approach has the advantage of simplifying the control structure but increases the size of the calling routines.

The SALSA representation directly supports only the first approach to implementing subroutines. This approach allows datapath resources to be shared between subroutines and calling routines. However, the remaining approaches can still be implemented using behavioral- level transformations [8] such as process formation and inline expansion to alter the structure of the behavioral description.

This representation of subroutines is implemented using a method similar to the CMU Value Trace [8] with extensions that support accurate register cost calculation during subroutine execution. In this approach, subroutines are represented as separate graphs that will be implemented by the same datapath and controller after scheduling and allocation. Each graph is scheduled into a separate sequence of control steps which will be bound to the same datapath during allocation.

CALL operator nodes represent the activation of subroutine graphs. Each CALL node represents a transfer of

Calling context S u b r o u t i n c ~ X

Value A

Fig. 10. Subroutine graph and CALL nodes.

control from a control step in the calling context to the sequence of control steps in the subroutine graph. In addition, it represents the transfer of data values to the subroutine from the calling context by data edges into the CALL node and corresponding edges out of the source node of the subroutine graph. Similarly, it represents the transfer of data values from the subroutine to the calling context at the end of the subroutine by data edges into the sink node of the subroutine graph and corresponding edges out of the CALL node. Fig. 9 shows an example of a subroutine graph and two CALL nodes that transfer control to that graph.

During scheduling, SALSA allows moves Ml-M4 to be applied to every subroutine graph as well as the graph representing the main program. This allows scheduling tradeoffs to be considered simultaneously for the entire design. However, when subroutines are present register cost estimation must account not only for local storage requirements in subroutine graphs but also the storage requirements of the calling routines. For example, in Fig. 10 there are two CALL nodes that activate subroutine graph X with different storage requirements. During the first call, values A and B require storage during the execution of the subroutine. During the second call, value C requires storage. These values must be considered live when calculating the register cost of the scheduled subroutine. The SALSA representation describes these values explicitly by adding data edges between the source and sink node of the subroutine graph, as shown in Fig. 11. Slack nodes in these edges explicitly represent storage requirements in each control step but do not imply any additional scheduling slack. Because only one call to the subroutine may be active at a time, values from different calling contexts are mutually exclusive with respect to each other. SALSA represents this mutual exclusion by creating a unique bit vector for each calling context and adding this vector to the condition list of each slack operator.

As in the Value Trace, loops are treated as a special case of subroutines. Each loop is represented using a separate graph. Loop execution is initiated using a CALL operator, and new iterations of the loop are initiated using a RESTART operator that feeds data values back to the beginning of the loop. WAIT operations that are used for external synchronization are implemented in the same way using simple loops of one control step.


Fig. 11.

UeC

ValueA ValueB

Subroutine graph with added data edges.

IV. SCHEDULING WITH SALSA The previous section discussed the SALSA represen-

tation and how alternative schedules can be explored using the SALSA move set. Given a schedule which meets all timing and ordering constraints, the application of a legal move to an operator in the schedule will result in a new schedule that meets the same constraints. However, an initial schedule that meets timing constraints must first be created before this exploration process can proceed. Following initial schedule creation, some method must be used to guide the exploration process. This section describes the techniques used to accomplish these tasks.

4.1. Initial Schedule Generation The initial scheduling phase takes a traditional CDFG

as input, finds a schedule that meets all timing constraints, and adds slack operators to form a SALSA graph. The schedule can be either a minimum-length schedule or a schedule of length specified by the user. To find the schedule, it uses an iterative algorithm adapted from layout compaction [28], [29]. This algorithm is similar to the relative scheduling algorithm of [23], but is performed before allocation and does not support unbounded delays.

In one-dimensional layout compaction, objects to be compacted are treated as nodes in a directed constraint graph with a single source and sink node. Edges represent relative positioning (e.g., object A is to the left of object B). Edge weights represent spacing constraints between objects (e.g., the distance between the center of objects A and B must be greater than X). The problem of constraint solution is to find an assignment of objects to lo- cations that meets all spacing constraints and minimizes the overall layout size. Compaction research [21], [29] has shown that when a constraint graph contains both minimum and maximum constraints it can be solved in O(V * K) execution time, where V refers to the number of nodes K refers to the number of maximum constraints. Additional algorithms allow the determination of whether a graph contains contradictory constraints [29].

It is straightforward to apply constraint solution techniques to the problem of finding a schedule in a CDFG. The CDFG becomes a constraint graph in which edges are weighted to the represent timing constraints expressed in control steps. Data and control edges are weighted to guarantee proper operation ordering, and timing edges are weighted to represent constraint values. Fig. 12 shows a constraint solution algorithm for scheduling which is pat-

constraint-solution0 [ s f ~ with all ops scheduled in step 0 *I

in queue *I

for (every node vi in G(V,E) ) xi = 0; for ( each SUCCCSSOI Vi of SOUICE V= ) enqueue ( vi );

while ( queue is not empty ) [ vj =dequeue() lower-bound = 0 upper-bound = 0; I’ process minimum comtrainu and dependencies on predecessors *I for ( each predecessor edge eij of vj ) [

if ( chaining enabled && eij is not a timing consmint ) [

r check comtraim on

comb-delay = longest-comb-delay (vi ) + vj.delay: if ( comb-delay <= clockqcriod )

lower-bound = max ( lower-bound. xi ); elsc

lower-bound = max ( lower-bound, xi + 1 ); I else lower-bound = max ( lower-bound. xi + eij.min 1;

I I* process maximum constraints on successors*/ for ( each succesmr edge ejk of vj ) [

upper-bound = max ( upper-bound, xk. ejk.max ); I P reschedule vj if necessary and enqueue constrained M&S *I newstep = max ( xj, max ( lower-bound. uppcr-bound ) ); if ( newstep f xj ) [

xj = mwstcp; for ( each predecessor edge eij of vj )

for ( each successor edge ejk of vj ) if ( cij.max represents a valid max. constraint ) enqueue (vi );

if ( ejk.min represents a valid min. constraint ) enqueue ( vir ); I

I 1

Fig. 12. Constraint solution algorithm for initial schedule generation.

terned after the constraint solution algorithm of Bums and Newton [28], [29] but is extended to deal with chaining.

The algorithm operates by initially scheduling all operators in control step 0 and then iteratively correcting constraint violations by moving operators to later control steps. Operators that may violate constraints are placed in a queue for processing. The outer loop of the algorithm removes operators from the queue one at a time and tests for constraint violations. It first tests for the violation of any minimum and ordering constraints on predecessor operators. If a minimum constraint is violated, it can be corrected by moving the node to a later control step. It then tests for violation of maximum timing constraints on successor nodes. If a maximum timing constraint is violated, it can also be corrected by moving the operator to a later control step. Since moving the operator can cause violations in other constraints, operators connected to poten- tially violated constraints are placed on the queue for later processing. The process iterates until the queue is empty; this represents a schedule where all constraints have been met.

Chaining [ 1 11 is supported during initial schedule generation using a slight modification to the constraint solution algorithm. When chaining is enabled, an operator that depends on the data output of another operator may be placed in the same control step if the estimated combinational delay of the cascaded operators does not exceed the clock period. If this value is exceeded, the second operator is placed in the following control step.

The schedule that results from this algorithm is equivalent to an “as-soon-as-possible’ ’ (ASAP) [ 11 schedule that is adjusted to meet all timing constraints. The constraint graph can also be solved in reverse order, starting at the sink node with a given number of steps. This sched-


ule is equivalent to an “as-late-as-possible” (ALAP) schedule that meets all timing constraints. These schedules can be used in the same way that ASAP and ALAP schedules are used to determine operator time frames [5]- ranges of control steps in which an operator may be scheduled. When scheduling in a minimum number of control steps, operators that are scheduled into the same control step in both schedules are critical path [30], [ 111 operators that cannot be placed in any other control step in schedules of the given length.

When multiple graphs are present that represent loops and subroutines, the initialization part of the algorithm must be modified so that each subroutine is assigned to a unique set of control steps. This task is straightforward. Maintaining timing constraints in the presence of calls to subroutines and loops is more complicated. When the execution time of a subroutine or loop is known exactly, call operators can be assigned a “delay” that represents the execution time of the subroutine or loop in control steps. This operator is then scheduled into dummy control steps that represent the time spent during the execution of the loop or subroutine. In this case, constraint solution can be used as before to find a schedule that meets all timing constraints. After constraint solution is completed, the dummy control steps are removed. To guarantee that normal operators are not scheduled into the dummy control steps, normal operators must be constrained to either pre- cede or follow call operators by adding edges to the CDFG.

When the execution time of a loop or subroutine is not known, then a timing constraint that “crosses” the call operation (i.e., the constraint is between one operator that precedes the call and one operator that follows the call) cannot be satisfied in all circumstances. However, if a lower bound on execution time is known then solving the graph assuming the minimum number of control steps will result in a schedule that meets any minimum time constraints that cross a call. Similarly, solving the graph assuming the maximum number of control steps will result in a schedule that meets any maximum timing constraints that cross a call. However this approach will not work when both minimum and maximum constraints cross a call; this remains an area for future research.

4.2. Schedule Improvement Schedule improvement is implemented using simulated

annealing. The configuration space of the annealing problem is the set of legal schedules for a CDFG. The move set consists of moves Ml-M4. In terms of the schedule space, the configuration space corresponds to the region of legal schedules. The application of a simple move M1 or M2 reschedules an operator in an adjacent control step, corresponding to a move to an adjacent point in the schedule space. The application of a shoving move M3 or M4 reschedules multiple operators into adjacent control steps, corresponding to a move to a point in the schedule space that differs by one control step in multiple dimensions.

More “global” moves which make larger changes to a schedule (e.g., move an operator more than one control step) were also considered, but experiments showed no improvement over using the basic move set. Repeated application of the move set under the control of simulated annealing corresponds to a search of several different points in the schedule space. Individual schedules may be visited more than once when a move is rejected or reversed by a later move or sequence of moves.

Illegal configurations are often used in annealing im- plementations, particularly in module placement [3 13. However, illegal configurations are not supported in SALSA because the completeness of the move set guar- antees that any legal schedule can be reached from any starting schedule. Further, because the region of legal schedules is convex, the shortest path between two schedules also lies within the region of legal schedules. Any path of schedules that includes illegal configurations is longer. This is not true in module placement, where module overlap constraints result in a region of legal configurations that is not convex and the shortest path between two configurations is likely to be through a sequence of illegal (overlapping) configurations.

Constraint solution is used as discussed earlier to create an initial legal schedule. If a minimum-length schedule is specified by the user then critical path operators are iden- tified at this time also. Since critical path operators can only be schedule in one position, they are excluded from consideration for move applications.

Simulated annealing is implemented in a straightforward manner [ 131 using a cost value C (the weighted sum of resource requirements described in Section 3.4) and temperature control parameter T. The temperature parameter is set to an initial temperature To which is gradually lowered. At each temperature, several move attempts are made. During each attempt, a move and operator are selected at random and the move is tested for legality with the selected operator. If illegal, the attempt is discarded without applying the move. If legal, then the move is applied and the change in cost AC is calculated. A negative value of AC reflects an improved configuration. These “downhill moves” are always accepted. A positive value of AC reflects an inferior configuration. These “uphill moves” are accepted with a probability:

p = e - ( A C / T ) .

This acceptance probability allows acceptance of uphill moves and an escape from locally optimal points in the design space. Rejected moves are reversed by applying the equivalent move in the opposite direction.

Temperature is controlled by an adaptive cooling schedule that is an adaptation of [32]. It calculates initial temperature, temperature changes, and equilibrium conditions based on statistics gathered from a number of moves made before annealing begins. Move attempts are made at each temperature until either equilibrium is de- tected or an upper bound is reached that is a weighted sum of the number of off-critical operators and the schedule


length). The schedule terminates when there is no change in cost over a number of successive temperatures (typically 3).

Move selection is biased towards moves and operators that are likely to reduce schedule cost. This is accomplished by selecting operators from the critical steps of resources that contribute to the current cost. During this selection a resource is chosen with probability propor- tional to the relative resource cost if the demand on this resource exceeds a preset threshold (typically the lower bound on functional units of this type). An operator is then chosen at random from a critical step of the selected resource.

Moving operators out of these critical steps reduces the demand on the control step and tends to reduce schedule cost. However, it is still useful to attempt moves on operators in other control steps since this may indirectly provide opportunities to reduce the cost. For this reason, some of the selected operators (typically 10-15%) are in- stead chosen at random from the list of all non-critical operators without regard to current control step.

After an operator is selected, a simple move M1 or M2 is selected at random and attempted. If this attempt fails and chaining is allowed, then a simple chaining move is attempted in the same direction. If this second attempt fails, then either a shoving move or a recursive chaining move is attempted. If all attempts fail, a final move is applied occasionally (typically 0.1 % of all cases) that re- turns the search to the best schedule found so far.

V. IMPLEMENTATION AND RESULTS The SALSA representation and scheduler have been

implemented in about 4300 lines of C, including the initial schedule generation and schedule improvement phases. A separate translator has also been developed that reads CMU Value Trace files from the System Architect’s Workbench [8] and translates these files into the SALSA representation.

The SALSA scheduler has been tested with a number of examples. Results from these examples are summa- rized in Tables 1-111. These examples include some small examples previously used in the literature, a control-dominated benchmark example, and two larger data-dominated examples. Each table lists schedule length, resource requirements, estimated problem size, and CPU seconds of execution time for each annealing run (CPU times were measured on a Sun SparcStation IPC with 24-Mb mem-

When evaluating scheduling speed, it is important to recognize that the complexity of the scheduling problem grows both with the number of operators in the CDFG and also the length of the schedule. The estimated problem size entry in Tables I-111 attempts to estimate this complexity as the total number of scheduled positions that each operator may be assigned. This value is equal to the number of variables required to represent the scheduling problem in an ILP formulation [ 171, [ 181.

ory).

TABLE I VARIOUS EXAMPLES

FU FU FU CPU Example Steps * + / - Other Reg (sec)

MAHA 8 - 2 7 2 9 18 MAHA (chained) 4 - 4

TMPCTL 15 1 2 4 10 42 RCVR 37 1 1 5 10 64

- -

Table I summarizes results for three examples. The MAHA “code sequence” example [30] shows the operation of chaining in a small data-dominated example. The first schedule was created without chaining, while the second schedule was created with chaining enabled. In this case the use of chaining created a schedule with fewer control steps at the expense of added functional units. The TMPCTL “temperature controller” [9] example is a simple example that illustrates the interaction between scheduling and timing constraints. In each of these cases, the quality of results matches those reported previously.

The RCVR example is used as a control-dominated example that is part of the I825 1 high-level synthesis benchmark [33]. While the scheduler reduces resource demands as much as possible, control-oriented approaches such as path-based scheduling [20] give better results for this example, especially in terms of number of control states. Improving performance in this area will be an important area of future work.

The “Fifth-Order Elliptic Wave Filter” benchmark [5], [33] has been intensively studied. It consists of 34 operators (8 multiply by constant, 26 addition). Table I1 summarizes results for this example for a number of schedule lengths under three different sets of assumptions. In the first set of results, chaining is not allowed, adder delay is assumed to be one clock cycle, and multiplier delay is assumed to be two clock cycles. Non-pipelined multipliers are used in this case. In the second set of results, the same set of assumptions is used but now pipelined multipliers are used with a latency of one clock cycle. In the third set of results, chaining is allowed. In each case, execution time of annealing increases as schedule length increases, but at a slower rate than the increase in estimated problem size.

The first two sets of results can be compared to several results in the literature (e.g., [5 ] , [9], [ l l] , [14]-[18]). In each of these cases, SALSA finds schedules that match the cost of the best schedules found by other researchers, including several that are known to be 0ptima1.~ Fewer results are available for chaining schedules. Camposano [20] reports 9 and 13 step schedules that were found using path-based scheduling with functional unit constraints on a serially-ordered CDFG. Similar schedules were found by SALSA and are shown in Table 111. In addition, SALSA found a 26-step schedule requiring one multiplier

4Register requirements may be reduced by one in some cases if the input is assumed to be stored in a dedicated input register [18].


TABLE I1 FIFTH-ORDER ELLIPTIC WAVE FILTER

FU FU Prob. CPU Schedule Characteristics Steps * + / - Reg Size SEC

No chaining 18 2 3 10 96 55 Non-Pipelined Multipliers 17 3 3 10 38 13

21 1 2 10 198 53 28 1 1 10 436 34

Pipelined Multipliers 17 2 3 10 38 12 18 1 3 10 96 57 19 1 2 10 130 76 28 1 1 10 436 45

Chaining 9 1 3 11 205 45 13 1 2 11 440 46 26 1 1 11 882 70

TABLE I11 DISCRETE COSINE TRANSFORM EXAMPLE

FU FU Prob. CPU * Size SEC

Non-Pipelined Multipliers 10 4 4 15 240 57 No chaining 14 3 3 13 432 52

18 2 3 16 624 63 19 2 2 17 672 81 34 1 2 15 1392 122 35 1 1 15 1440 137

Steps + / - Reg Schedule Characteristics

Pipelined Multipliers 10 3 4 12 11 2 4 13 13 2 3 14 19 1 3 14 20 1 2 14 33 1 1 16

Chaining 7 3 5 15 8 2 4 15

11 2 3 15 16 1 3 14 17 1 2 15 32 1 1 16

240 96 288 74 384 78 672 88 720 99

1344 132

384 41 432 42 578 44 816 50 864 52

1584 91

and one functional unit. When comparing these approaches, it is important to note that the quality of the schedule found by path-based scheduling depends on the initial serial ordering of nodes-some orderings result in longer schedules. In contrast, SALSA requires no such ordering and minimizes both functional unit and register requirements.

When examining the quality of schedules with chaining, it is interesting to compare the functional unit requirements with the absolute lower bounds for resource requirements derived in [36]. This bound predicts that the number of functional units of each type can be no smaller the number of operators of each type divided by the number of control steps. In each of the three chained EWF schedules, multiplier and adder costs are equal to the absolute lower bound. This demonstrates that in contrast to our initial experience with the small “MAHA” example, chaining often makes it possible to find low-cost schedules using a smaller number of control steps than other

approaches. These opportunities come at the expense of a more complex scheduling problem; estimated problem sizes for chained versions of the EWF example are much larger than unchained approaches due to the larger time frames that result from chaining.

The discrete cosine transform (DCT) was used to show the behavior of SALSA with larger examples. The DCT is used extensively in image coding and compression, and has been implemented in hardware for special-purpose image processors (e.g., [37]). Fig. 13 shows the CDFG of an %point DCT patterned after the implementation described in [37]. It consists of 48 operators (16 multiply by constant, 25 add, and 7 subtract). Unlike the EWF example which has a relatively long minimum schedule length (17 steps in the unchained case), the DCT has a short minimum schedule length (7 steps in the unchained case). This substantially increases the difficulty of finding schedules that contain a reasonable number of functional units.


Fig. 13. CDFG for DCT example.

Table I11 summarizes scheduling results for this example under the same scheduling conditions used for the EWF example: non-pipelined multipliers, pipelined multipliers, and chaining. In addition, it was assumed that add and subtract operators would be implemented by ALU functional units that can perform both operations. As in the EWF example, pipelined multipliers allow a substan- tial reduction in functional unit costs. However, as in the EWF example, chaining again provides the best way to find low cost schedules using a small number of control steps. Schedules that were produced using chaining match the absolute lower bound for functional units in 8, 11, and 32 steps. The scheduler was not able to produce a 16 step schedule at the absolute lower bound (1 multiplier and 2 adders). However, it found this result in a 17-step schedule.

Execution times for the DCT show that execution times grow at a reasonable rate as schedule length increases. However, we have found that while SALSA consistently finds the best schedules for small examples such as the EWF in a single annealing run, it does not always do so for larger examples such as the DCT. When this occurs, multiple runs can be used to further improve the schedule at the expense of additional CPU time.

Table IV summarizes execution times for SALSA with the EWF example compared to those of a number of previous approaches. Because these measurements were made on processors of widely varying speed, it is difficult to use these results to make accurate comparisons. How- ever, some conclusions can be drawn from these results. First, SALSA shows a clear advantage over the simulated annealing approach of [ 141, which simultaneously consid-

TABLE IV COMPARISON OF EXECUTION TIMES FOR EWF EXAMPLES

~ ~ ~~~~~~~

Scheduler # CSTEPS CPU Time Machine Type

SALSA 17-21 , 13s-55s SunSparcIPC SA [ 141 17 4m DEC VAX 8650 FDS [5] 17-21 2m-6m Xerox 1108 Extended FDS [36] 17-21 2s-3m Apollo DNlOOOO

OASIC I181 17, 18 30s, 4m Intel 386 Intel 386 OASIC (FU Only) 19 36s

ILP [U] 17-21 0.26s-34.5s DEC VAX 8800

ers scheduling, operator allocation, and estimated interconnect cost. We believe that this advantage is due not only to the reduced problem scope (i.e. scheduling only), but is also due to the fact that SALSA’S efficient representation and move set allows configurations to be explored very quickly.

While SALSA appears to have a clear advantage over execution times of Force-Directed Scheduling [5], [34], the discrepancy in processor speed for the two sets of measurements is large enough to render comparison al- most meaningless. However, when compared to results from an extended FDS algorithm [38], there is still an advantage even though a faster processor was used. More importantly, analysis of the FDS algorithm [38] has shown that execution time grows as the square of schedule length. In contrast, while it is difficult to characterize the execution time of a probabilistic algorithm, this time is related to the maximum number of move attempts at each temperature. In SALSA, this value grows linearly with respect to schedule length.

Results for the ILP approach of [ 171 are given for non- pipelined multipliers in 17-2 1 control steps. These execution times are smaller than those of SALSA but grow rapidly with increasing schedule length. An extension of this work [35] adds constraints to support chaining and pipelined functional units. Execution times are not available for these features, but for chaining the number of added constraints grows exponentially with the depth of chaining allowed. Results for the OASIC IP approach to scheduling and allocation [18] are given for 17 and 18- step schedules with pipelined multipliers. This approach uses more CPU time than the SALSA approach, but includes consideration of interconnect cost and allocation. Execution times are greatly reduced when only functional unit cost is considered, as shown in the final entry of Ta- ble IV.

ILP and IP approaches are very attractive since an optimal solution is guaranteed. These recent results show that when schedule lengths are close to minimum schedule lengths, execution times are quite good. However, in cases where schedule length is substantially longer than the minimum length or when chaining is used, the number of variables in the problem formulation grows rapidly, as shown in Tables I1 and 111. Since the execution times of


these approaches can be expected to grow rapidly with the number of variables, we believe that heuristic approaches like SALSA will be competitive for a large class of practical synthesis problems.

VI. CONCLUSION This paper has described a new approach to scheduling

with timing constraints that minimizes resource costs. A specialized representation and move set provide a way to quickly explore scheduling alternatives after an initial schedule is found using constraint solution. Simulated annealing provides an effective way to implement this exploration and yields good results in reasonable execution times, especially when chaining is used and when schedule lengths are substantially longer than minimum schedule lengths. Proof that all legal schedules may be reached using the move set provides confidence that the schedule space can be thoroughly explored during annealing. In addition, it provides new insight into the scheduling problem that may be useful in other approaches. Future work will concentrate on improving schedule quality for control-dominated examples, improving annealing performance on large examples, and extending the approach to include support for interconnections, allocation, and more general timing constraints.

ACKNOWLEDGMENT The authors would like to thank R. Cloutier and the

anonymous reviewers for their suggestions for improving this paper, M. McFarland and K. Vissers for helpful discussions concerning scheduling and R. Rutenbar for helpful discussions concerning simulated annealing.

REFERENCES

M. McFarland, A. Parker, and R. Camposano, “The high-level synthesis of digital systems,” Proc. IEEE, vol. 78, Feb. 1990. J. Nestor and D. Thomas, “Behavioral synthesis with interfaces,” in Proc. ICCAD-86, pp. 112-115, Nov. 1986. R. Camposano and A. Kunzmann, “Considering timing constraints in synthesis from a behavioral description,” in Proc. ICCD, pp.

G. Bomello and R. Katz, “Synthesis and Optimization of Interface Transducer logic,” in Proc. ICCAD-87, pp. 274-277, Nov. 1987. P. Paulin and J. Knight, “Force-directed scheduling for behavioral synthesis of ASIC’s,’’ IEEE Trans. Computer-Aided Design, Vol. 8 , pp. 661-678, June 1989. C. Hitchcock and D. Thomas, “A method of automatic data path synthesis,” in Proc. 20th DAC, pp. 484-489, June 1983. C. Tseng and D. Siewiorek, “Automated synthesis of data paths in digital systems,” IEEE Trans. Computer-Aided Design, vol.

D. Thomas, E. Lagnese, R. Walker, J. Nestor, J. Rajan, and R. Blackburn, Algorithmic and Register-Transfer Level Synthesis: The System Architect s Workbench. New York: Kluwer Academic, 1990. E. Girczyc and J. Knight, “An ADA to standard cell hardware com- piler based on graph grammars and scheduling,” in Proc. ICCD, pp.

M. McFarland and T. Kowalski, “Incorporating bottom-up design

6-9, Oct. 1986.

CAD-5, pp. 379-395, July 1986.

726-731, Oct. 1984.

into high-level synthesis,” IEEE Trans. Computer-Aided Design, vol. 8, pp. 938-950, Sept. 1990.

[ l l ] B. Pangrle and D. Gajski, “Slicer: A state synthesizer for intelligent silicon compilation,” in Proc. ICCD-87, Oct. 1987.

[12] G. Goossens, J. Vandewalle, and H. De Man, “Loop optimization in register-transfer scheduling for DSP-systems,” in Proc. 26th DAC, pp. 826-831, June 1989.

[13] S. Kirkpatrick, C. Gelatt, and M. Vecchi, “Optimization by simulated annealing,” Science, vol. 220, no. 4598, pp. 671-680, May 1983.

[14] S. Devadas and A. R. Newton, “Algorithms for hardware allocation in data Dath svnthesis,” IEEE Trans. Cornouter-Aided D e s k vol. 8, pp. f68-781, July 1989. M. Quayle and L. Grover, “Pipelined and non-pipelined data path synthesis using simulated annealing,” Progress in Computer Aided VLSI Design, vol. 4 , Feb. 1990. T. Ly and J. Mowchenko, “Applying Simulated Evolution to Sched- uling in high-level synthesis,” in Proc. IEEE 33rd Midwest Symp. on Circuits and Systems, 1990. 1. Lee, “A new Integer linear programming formulation for the scheduling problem in data path synthesis,” in Proc. ICCAD, pp.

C. Gebotys and M. Elmasry, “A global, optimization approach for architectural synthesis,” in Proc. 28th DAC, pp. 2-7, June 1991. S. Hayati and A. Parker, “Automatic production of controller specification from control and timing behavioral descriptions,” in Proc. 26th DAC, pp. 75-80, June 1989. R. Camposano, “Path-based scheduling,” IEEE Trans. Computer- Aided Design, vol. 10, Jan. 1991. Y. Liao and C. Wong, “An algorithm to compact a VLSI symbolic layout with mixed constraints,” IEEE Trans. Computer-Aided De- sign, vol. CAD-2, pp. 62-69, Apr. 1983. G. Bomello, “A new interface specification methodology and its application to transducer synthesis,” Ph.D. dissertation, Univ. of Cal- ifornia at Berkeley, May 1988. D. Ku and G. De Micheli, “Relative scheduling under timing constraints,” in Proc. 27th DAC, pp. 59-64, June 1990. C. Papadimitrou and K. Steiglitz, Combinatorial Optimization: Al- gorithms and Complexity. Englewood Cliffs, NJ: Prentice-Hall, 1982. M. Lorenzetti and D. Baeder, “Routing,” in Physical Design Auto- mation of YLSI Systems, B. Preas and M. Lorenzetti, ed. Menlo Park, CA: Benjamin-Cummings, 1988. C. Tseng, R. Wei, S. Rothweiler, M. Tong, and A. Bose, “Bridge: A versatile behavioral synthesis system,” in Proc. 25th DAC, pp. 415-420. June 1988.

20-23, NOV. 1989.

[27] K. Wakabayashi and T. Yoshimura, “A resource sharing and control synthesis method for conditional branches,” in Proc. ICCAD-89, pp.

[28] J. Bums and A. R. Newton, “SPARCS: A new constraint-based IC symbolic layout spacer,” in Proc. CICC, pp. 534-539, May 1986.

[29] A. R. Newton, “Symbolic Layout and Procedural Design,” in De- sign Systems for VLSI Circuits. Dordrecht, The Netherlands: Mar- tinus Nijhoff, 1987, pp. 65-112.

[30] A. Parker, J. Pizam, and M. Mlinar, “MAHA: A program for datapath synthesis,” in Proc. 22nd DAC, pp. 461-466, July 1986.

[3 11 R. Rutenbar, “Simulated annealing algorithms-An overview ,” IEEE Circuits Devices Mag., vol. 6, no. 1, Jan. 1989.

[32] M. Huang, R. Romeo, and A. Sangiovanni-Vincentelli, “An efficient general cooling schedule for simulated annealing,” in Proc. ICCAD-

[33] G. Bomello and E. Detjens, “High-level synthesis: Current status and future directions,” Proc. 25th DAC, pp. 477-482, June 1988.

[34] P. Paulin and J. Knight, “Scheduling and binding algorithms for high- level synthesis,” in Proc. 26th DAC, pp. 1-6, June 1989.

[35] C. Hwang, J. Lee, and Y. Hsu, “A formal approach to the scheduling problem in high-level synthesis,” IEEE Trans. Computer-Aided De- sign, vol. 8, pp. 464-475, Apr. 1991.

[36] J . Rabaey, C. Chu, P. Hoang, and M. Potkonjak, “Fast prototyping of datapath-intensive architectures,” IEEE Design Test, June 1991.

[37] R. Woudsma, et a l . , “One-dimensional linear picture transformer,” U.S. Patent 4 881 192.

[38] W. Verhaegh, E. Aarts, J. Korst, and P. Lippens, “Improved force- directed scheduling,” in Proc. EDAC 91, pp. 430-435, Feb. 1991.

62-65, NOV. 1989.

86, pp. 381-384, NOV. 1986.


John A. Nestor (S’78-M’87-SM’91) received the B.E.E. degree from Georgia Institute of Technol- ogy in 1979 and the M.S.E.E. and Ph.D. degrees from Camegie Mellon University, Pittsburgh, PA, in 1981 and 1987 respectively.

Currently he is an Associate Professor of Elec- trical and Computer Engineering at Illinois Insti- tute of Technology. His research interests include high-level synthesis, visual hardware description languages, and VLSI systems design.

Dr. Nestor received a Best Paper Award at the 22nd International Workshop on Microprogramming and Microarchitecture in 1989 and a NSF Research Initiation Award in 1990. He is a member of Eta Kappa Nu, Tau Beta Pi, and Sigma Xi.

Ganesh Krishnamoorthy received the B.S.E.E., M.S.E.E., and Ph.D. degrees from Illinois Insti- tute of Technology in 1985, 1987, and 1992, respectively.

He is currently a Custom Engineer at Mentor Graphics in Warren, NJ. His research interests include layout compaction and high-level synthesis.

Dr. Krishnamoorthy is a member of Eta Kappa Nu and Tau Beta Pi.

Documents

00238604