30

E cien - University of Notre · PDF filetask of determining start times op erations based on ... h results in the elite set whose size is 0.1% of the en ... Figure 1: T est1: Data

Embed Size (px)

Citation preview

EÆcient Module Selections for Finding Highly

Acceptable Designs based on Inclusion

Scheduling y

Chantana Chantrapornchai Edwin H.-M. Sha Xiaobo (Sharon) Hu

Research report: TR-99-2

Department of Computer Science and Engineering

University of Notre Dame

Notre Dame, IN 46556

email: fcchantra,esha,[email protected]

Phone number: (219) 631-8803

Fax number: (219) 631-9260

Abstract

In high level synthesis, module selection, scheduling, and resource binding are inter-

dependent tasks. For a selected module set, the best schedule/binding should be generated

in order to accurately assess the quality of a module selection. Exhaustively enumerating all

module selections and constructing a schedule and binding for each one of them can be ex-

tremely expensive. In this paper, we present an iterative framework, calledWiZard to solve

module selection problem under resource, latency, and power constraints. The framework

associates a utility measure with each module. This measurement re ects the usefulness of

the module for a given a design goal. Using modules with high utility values should result

in superior designs. We propose a heuristic which iteratively perturbs module utility values

until they lead to good module selections. Our experiments show that by keeping modules

with high utility values, WiZard can drastically reduce the module exploration space (ap-

proximately 99.2% reduction). Furthermore, the module selections formed by these modules

belong to superior solutions in the enumerated set (top 15%).

Keywords: Inclusion scheduling, Module selections, Design exploration, Acceptable designs

yThis work was supported in part the NSF under grant number MIP 95-01006, MIP-9612298 and MIP-9701416.

1

1 Introduction

In high-level synthesis, scheduling, resource binding and module selection are important phases. Schedul-

ing is the task of determining the start times of operations based on precedence constraints, resource

binding is the explicit mapping between the operations and generic resources (functional units), and

module selection problem is the resource-type selection problem where more than one resource type can

match the functional requirement of an operation type [12]. These three steps are highly dependent on

one another. This is due to the fact that operations sharing the same resource must have the same module

and that the execution delays of the operations depend on the chosen module. For today's IC systems,

the cost of solving the combined scheduling, binding, and module selection problem by exhaustive search

is prohibitive.

A number of research results have been published on the module selection and scheduling/binding

problem. Some works are interested in �nding the optimal module selection. Jain developed an integer

linear programming formulation where a coeÆcient matrix is unimodular to solve an optimal module

selection problem in polynomial time [9]. However, in this formulation, schedule and resource binding

are not considered. Further, all instances of the same operation type are implemented using the same

module type. Timmer et.al used a mixed integer linear programming (MILP) approach to �nd selected

modules and then applied resource-constrained list scheduling to check if the time constraint is met [16].

The number of integer variables of the MILP, nevertheless, increases as the size of eligible module set

increases. Chen and Jeng investigated the problem of �nding an optimal module as well as clock cycle

selection simultaneously [5]. However, their method explores all possible module selections. Chaudhuri,

Blythe and Walker included a module selection scheme for optimal design space characterization [4].

Similar to [5], they exhaustively generated all module selections and schedules to �nd optimal designs.

Many researchers have proposed heuristics to �nd satisfactory module selections. Ramachandran

and Gajski used the probability table to investigate module selection while constructing a schedule [13].

However, they do not consider resource constraints. Shen and Jong explored module selection space for

low power designs [15]. Their approach also does not consider resource constraints. Ishikawa and De

Micheli proposed an approach to solve a module selection problem with a �xed latency constraint [8].

Their searching scheme starts with the modules with the largest area. Then, it iteratively investigates

modules with smaller areas until the latency constraint is violated. Depending on the constraint, the

approach may, however, need to explore all possible solutions. Moreover, this exploration process assumes

a �xed schedule during each iteration of changing modules. Some researchers applied arti�cial intelligence

approach to the problem of design exploration. Simon and Kobayashi used knowledge-based system to

select satisfactory VLSI designs [6]. They assumed, however, the schedule has been given. Ahmad,

Dhodhi and Chen used a genetic algorithm to help design exploration [2]. Yet, they do not consider

latency and power constraints while performing module selection.

Considering the immense module selection space, we focus on designing a technique that helps rapidly

reduce the search space. That is, we would like to �nd a very small set of module selections which includes

mostly superior solutions and which can be identi�ed eÆciently. We call this set \elite set". The key

to our approach is the introduction of a module utility metric and inclusion scheduling. Our algorithm

2

produces a utility measure for each module. Modules with higher utility values should result in superior

designs. For an initial set of utility values, inclusion scheduling eÆciently computes an \average" schedule

for various module selections. (The meaning of \average" will be explained in Section 3.) Based on the

information provided by such a schedule, we use a heuristic to improve module utility assignments.

Within a short period of running time (within seconds), appropriate utility values of given modules can

be identi�ed. An elite set is then constructed by selecting the combinations of the modules with high

utility values.

We have carried out many experiments and our approach gives promising results. The elite set is

very small compared to the size of enumerated solutions, and the modules in the set lead to very good

designs with respect to the given optimization goal. As an example, for one of the tests on discrete cosine

transform benchmark, our approach results in the elite set whose size is 0.1% of the enumerated set and

all of them are in the top 15% after ranking all enumerated solutions according to the given constraints

and optimization criteria.

Finding an elite set rather than a single solution can be very advantageous in high-level design phase.

It is well-known that during the high-level design phase, many detailed design decisions are not yet to be

made. A design may need to be revised several times before it is �nalized. By identifying several good

solutions and using them as good initial designs, an iterative design process can be readily performed

and the design time can be saved.

The remainder of the paper is organized as follows: Section 2 presents models used in our work and

problem de�nition. Some necessary background concepts are also presented in this section. Section 3

presents our approach in detail. In Section 4, we use examples to illustrate howWiZard works. Section 5

gives experimental results for several benchmarks showing the derived elite sets and their qualities.

Finally, Section 6 draws a conclusion of this work.

2 Overview and Models

In this section, we �rst describe our model as well as problem description. Since in developing an inclusion

schedule some fuzzy arithmetics is involved, we also review some basic concepts in fuzzy computation.

2.1 Model Descriptions

Operations and their dependencies in an application are modeled by a vertex-weighted directed acyclic

graph, called a Data Flow Graph, G = (V;E; �), where each vertex in the vertex set V corresponds

to an operation and E is the set of edges representing data ow between two vertices. Function �

de�nes the type of operation for node v 2 V. Figure 1 shows a �ve-node data ow graph, where

V = fA;B;C;D; Eg, E = fA ! E;B ! E;C ! E;D ! Eg, (u ! v de�nes a directed edge from u to v),

�(A) = �(B) = �(E) = add, and �(C) = �(D) = multiply.

Operations in a data ow graph can be mapped to di�erent functional units which in turn can be

implemented by di�erent modules. Such a system must also satisfy certain design constraints, for instance,

power and cost limitation. These speci�cations are characterized by a tuple S = (N;F ;M; A;Q), where

3

B C

A

E

D+ x

x+

+

Figure 1: Test1: Data ow graph example

N is the number of functional units allowed in the system, F = ffi : 8i 2 [1;N]g is the set of functional

units allowed in the system, e.g., fadd, mulg. M = fMfi j8f 2 F ;8i 2 [1;N]g, where each Mfi contains

a set of eligible modules for functional unit fi, e.g., Mf1 = fripple adder, carry-look-ahead adderg. A is

a function mapping from Mfi 2 M to a set of tuples (a1; : : : ; ak), where a1 to ak represent attributes

of a particular module. In this paper, for simplicity of explanation, we are interested in synthesizing a

system under latency and power constraints1. Hence, A(m) = (a1; a2) where a1 refers to the latency

attributes of module m while a2 refers to the power consumption of module m. Finally, Q is a function

that de�nes the degree of a system being acceptable for di�erent system attributes. If Q(a1; : : : ; ak) = 0

the corresponding design is totally unacceptable while Q(a1; : : : ; ak) = 1, the corresponding design is

de�nitely acceptable.

Using a function Q to de�ne the acceptability of a system is a very powerful model. It can not

only de�ne certain constraints but also express certain design goals. For example, one is interested in

designing a system with latency and power under 150 and 120 respectively. Also, the smaller latency and

power, the better a system is. The best system would have latency and power under 60. An acceptability

function, Q(a1; a2) for such a speci�cation is formally de�ned as:

Q(a1; a2) =

8><>:

0 if a1 > 150 or a2 > 120

1 if a1 � 60 and a2 � 60

min(F1(a1); F2(a2)) otherwise,

(1)

where F1 and F2 are linear functions, e.g., F1(a1) = 0:011a1+1:67 and F2(a2) = 0:0167a2+2 respectively,

and return the acceptability between (0; 1). Figure 2(a) illustrates Equation (1) graphically.

To model the tradeo� between two criteria by acceptability function Q, one may de�ne Q as follows:

Q(a1; a2) =

8><>:

0 if a1 > 140 or a2 > 120

1 if a1 � 30 and a2 � 30

F(w1a1 +w2a2) otherwise.

(2)

The speci�cation states that any design with latency being greater than 140 or power being greater than

120 is unacceptable while any design with latency and power being less than 50 is the best. Furthermore,

a design with lower latency and power is more desirable. In addition, F considers tradeo� between latency

1Area constraint is very easy to be included in the algorithm. However, in order to avoid drawing a four-

dimensional graph, we skip area criteria for the purpose of clear presentation

4

and power criteria by using the weighted-sum of both latency and power. Figure 2(b) depicts an example

of the speci�cation concerning the tradeo� graphically where w1 = 2;w2 = 1. F refers to a z-shaped curve

function which produces a smooth transition between two given points. To give a better view, Figure 2(c)

shows the projection of the 3-dimensional acceptability model to the latency and acceptability plane. In

this �gure, each z curve represents a projection of Q function to a latency-acceptability plane. An inner

curve (tighter latency constraint) corresponds to larger power values. Based on the acceptability model,

a design with high acceptability implies an optimized design towards certain goals.

020

4060

80100

120

0

50

100

1500

0.2

0.4

0.6

0.8

1

powerlatency

acce

ptab

ility

(a) Linear acceptability

020406080100120

0

50

100

150

00.10.20.30.40.50.60.70.80.9

1

power

latencyac

cept

abili

ty

(b) z curve acceptability with tradeo�

0 20 40 60 80 100 120 1400

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

latency

acce

ptab

ility

(c) Latency acceptability curves corresponding to di�erent

power constraints derived from Figure 2(b)

Figure 2: Various kinds of acceptability functions

In this paper, the combined scheduling/binding and module selection problem we intend to solve can

be formulated as follows: Given a speci�cation containing S = (N;F ;M; A;Q), and G = (V;E; �), select

modules m 2 Mfi for a functional unit fi, 8fi, to execute graph G while maximizing the acceptability

degree of the resulting solutions.

Since there are complex interactions among scheduling, binding, and module selection, it is not trivial

to pick the right module combination which results a high acceptability value. A straight-forward way to

5

solve this problem would be to completely enumerate all module selections and generate the best schedule

and binding for each one of them. By comparing the acceptability degrees of all module combinations,

the one with the highest acceptability degree is chosen. However, if there are 10 possible adders and

10 possible multipliers to choose from, and only 3 adder units and 3 multiplier units are allowed in the

system, the number of solutions to explore would be 106. Thus, it would be a very time consuming task

to investigate every design solution. In our method, we eÆciently derive an elite set in which superior

solutions lie. Since the size of an elite set is small, investigating all of them will take much less time.

Therefore, one can rapidly select good module selections.

2.2 Fuzzy Sets

In this section, we give a quick review of fuzzy set theory as it relates to our work. Readers familiar with

the theory can skip this part.

Fuzzy sets, proposed by Zadeh, represent a set with imprecise boundary [17, 18]. In classical (crisp)

sets, an element can either be a member of a set or not at all; hence, its membership degree is either 1

or 0. A fuzzy set is de�ned by assigning each element in a universe of discourse its membership degree

in the unit interval [0; 1], conveying to what degree x is a member in the set. This membership value can

be de�ned as a membership function of an element in the set, �(x) : x! [0; 1].

A fuzzy set is said to be normal if there exists at least one member in the set whose membership value

is unity. A convex fuzzy set is de�ned as: for any x; y; and z in the fuzzy set A, the relation x < y < z

implies that �A(y) � min(�A(x); �A(z)). A fuzzy number is a normal, convex fuzzy set de�ned on the

real line R. Let A and B be fuzzy numbers with membership functions �A(x) and �B(y), respectively.

Let � be a set of binary operations f+;-;�;�;min;maxg. The arithmetic operations between two fuzzy

numbers, de�ned on A � B with membership function �A�B(z), can use the extension principle, by [7]:

�A�B(z) =_

z=x�y

(�A(x)^ �B(y)) (3)

where _ and ^ denote max and min operations respectively.

Fuzzy arithmetic is used to compute an arithmetic operation between two fuzzy numbers. Figure 3(a)

shows a fuzzy number A, which is assumed to be normal triangular-shaped lied on an real line. In this

�gure, let A be assigned with the con�dence interval (2; 6). The most possible value of A is 4 since

its con�dence level or presumption level is 1. Similarly, Figure 3(b) shows the fuzzy number with the

con�dence interval (3; 7) representing B. Figure 3(c) demonstrates a graphical result of adding two fuzzy

numbers de�ned on the integer line from Figures 3(a){3(b), using Equation (3).

In order to compare two fuzzy numbers, several methods can be used. All of these methods are based

on selecting a representative for each fuzzy number and compare the representatives [10]. One way to

obtain the representatives is using the removal with respect to k, which is a measure of distance from k,

computed by R(A; k) = 12(Rl(A; k) + Rr(A; k)), where A is a fuzzy number, k is a reference position on

the x-axis, Rl is the area bounded by the left side of the curve and the line x = k and similarly for the

right side, Rr. Another can be mode, which uses the value x such that �(x) = maxif�(xi)g for all xi in

the fuzzy set. Divergence is another way to calculate the representatives. It represents the width of the

6

42 6 x A

µ(x)

1

(a) A

1

3 5 7 By

µ(y)

(b) B

1

5 9A+B

z

µ(z)

13

(c) A+ B

Figure 3: Adding two fuzzy numbers, A and B

set which is computed by xmax - xmin . In addition, the defuzzi�ed value can be used to represent the

fuzzy set. Several defuzzi�ed methods can be found in [14].

3 WiZard Framework

In this section, we present our approach to solving the module selection problem. The key to the approach

is the introduction of utility values. Each module is associated with a utility value, which represents the

usefulness of a module. Ideally, the utility values of modules are either 1 or 0. The design using those

modules with utility value of 1 should be of the highest quality. However, in reality, the usefulness of a

module is dependent on others. The module may be present in good designs which optimize a certain

goal and/or bad designs that do not satisfy the design goal. In this case, we allow the utility value of a

module to be any real number between 0 and 1 to represent this ambiguity. As an input to the framework,

initial utility values can be assigned based on designers' experience.

The operations of the framework at a high level is straight-forward. First, a designer give initial utility

values. Then, the framework improves them until they are stabilized. From the utility of each module, it

is implied that combination of the module with the high degree of utility will likely give schedules that

yields high acceptability. Therefore, such solution(s) form an elite set. By inspecting the solutions in the

set, the best solution can be identi�ed and, then, used to implement a �nal system.

Figure 4 depicts the WiZard framework in details. It consists of two main steps, scheduling and

utility value improvement. The �rst phase, inclusion scheduling, takes a data ow graph, the number of

functional units required in a target system, and a module set as well as their associated utility values

as inputs and constructs a general schedule. It is important to point out that only one schedule is

created for the entire module set. The schedule may not be the same as any schedule that one would

obtain if any single combination of modules is used. Nevertheless, the resulting latency and power of the

schedule gives close approximation to the latencies and powers generated by the schedules of all module

combinations [3]. We, therefore, use the scheduling process as a means to obtain important information

for further assessment of modules' usefulness.

Speci�cally, while calculating a schedule, latency, power as well as the corresponding modules usage

are recorded. These data are then used as inputs to the second phase: the utility value improvement.

7

stop

Inclusion Scheduling

noyes

schedule,

output

DFG

cla-add, alu16

qbs-mul

m

Calculate Utility

moduleinfo.

#FU

etc

Module set

alu32, etc.

etc.

qbs-mul, p-mul32,

cla-add,ra-add,alu16

:

Funit1

Funit2

latency & power

improvementnecessary? Funit1

Funit2

Module selection

mqbs p alu ...

FU2

0.2

1

0.6

µ(m)

cla ra alu ..

µ(m)1

0.50.6

FU1

update

Value Improvement

Initial utility assignment

Figure 4: WiZard module selection framework

In this step, the utility value of each module is adjusted. For a given module, the acceptability of every

latency and power pair that the module contributes to is analyzed. Intuitively, if a module causes a lot

of unacceptable latency and power values, the utility of the module should be decreased. On the other

hand, if a module contributes to a lot of high-acceptability latency and power values, the module's utility

value should be increased. The statistics of a module usage for each latency and power value are used as

a scaling factor to the acceptability value for the module for signifying the module's contribution. Based

on this idea, we have developed a heuristic to compute the relative adjustment of a utility value. The

adjustment is then applied to update the previous utility value. The 2-step process is repeated until the

adjustment values converge to zero. The experimental results show that the average number of iterations

is approximately eleven.

3.1 Inclusion Scheduling

In order to construct a general schedule based on a utility assignment, we borrow some techniques form

the fuzzy theory. In particular, we model modules and their respective utility values as a fuzzy set with

respect to the corresponding functional unit. For each functional unit f, and its eligible module set Mf,

let �f(m) 2 [0; 1];8m 2Mf; describe a utility value of module m. For instance, consider an application

with only addition and multiplication operations, and that only 2 adders and 1 multiplier are allowed in

any implementation (3 functional units f1; f2; f3). Assume that there are 10 possible modules of adders

ADD =fadd1, add2,: : :,add10g and 10 possible modules of multipliers MUL = fmult1, mult2,: : :,mult10g.

Let �f1(m) be the utility value of module m for functional unit fi. Note that m 2 ADD for f1; f2 and

m 2 MUL for f3. It follows that �fi(m) 2 [0; 1] and can be considered as the membership function of

fi with respect to the universe Mf. This concept also implicitly shows that a functional unit has fuzzy

execution times and powers.

Inclusion scheduling is a scheduling method which takes into consideration of fuzzy characteristics,

8

e.g., fuzzy set of varying latency (power) values associated with each functional unit. The output schedule,

in turn, also consists of fuzzy attributes. The actual steps in inclusion scheduling is given in Algorithm 3.1.

In a nutshell, inclusion scheduling simply replaces the computation of accumulated execution times in a

traditional scheduling algorithm by the fuzzy arithmetic-based computation. In our case, both latency

and power attributes are fuzzy numbers. Hence, fuzzy arithmetics (speci�cally Equation (3)) is used

to compute possible latencies and powers from the given functional speci�cation. Then, latencies and

powers of di�erent schedules are compared to select a functional unit for scheduling an operation. Note

that fuzzy arithmetics can be embedded in di�erent scheduling algorithms. However, a tradeo� exists

between the complexity and quality of di�erent scheduling algorithms. In this paper, we use a general

version of the list scheduling algorithm [1] as our scheduling framework due to its eÆciency.

In a list-based scheduling algorithm, we attempt to construct a schedule that minimizes the total

latency and average power. The algorithm traverses the graph in the topological order. It puts all ready

nodes (vertices whose parents are already computed) into a queue Q. The nodes in the queue are then

rearranged by function prioritized in Line 3. A ready node is extracted from the list one at a time. The

algorithm tries to schedule the node in each available functional unit. Routine assign heuristic(S; u; f) at

Line 7 �nds a legal scheduling position for node u at functional unit f based on the schedule S. A simple

heuristic can be to assign node u after the last node in the unit f or try all possible legal positions in

the unit f. After a node is scheduled to a functional unit, the schedule latency must be updated and,

therefore, the average power consumption needs to be recalculated.

Algorithm 3.1 (Inclusion scheduling)

Input: G = (V; E; �), and speci�cation Spec = (N; F;M; A;Q)

Output: A schedule S, latency and power

1 Q = vertices in G with no incoming edges // �nding root nodes

2 while Q 6= empty do

3 Q = prioritized(Q)

4 u = dequeue(Q); mark u scheduled

5 good S = NULL;

6 foreach f 2 ffj : where fj is able to perform �(u)g do

7 temp S = assign heuristic(S; u; f) // assign u at FU f

8 if Eval Schedule(good S; temp S; G; Spec)

9 then good S = temp S � od

10 S = good S // keep good schedule

11 foreach v : (u; v) 2 E do

12 indegree(v) = indegree(v) - 1

13 if indegree(v) = 0 then enqueue(Q; v) � od

14 od

15 return(S)

Routine Eval Schedule (Line 8) updates the total latency and power of the schedule after node u is

assigned to f. It also compares the current schedule with the \best" one found in previous iterations.

The better one of the two is then chosen and the process is repeated for all nodes in the graph.

9

Algorithm 3.2 (Eval Schedule)

Input: schedules S1; S2, G = (V; E; �) and

speci�cation Spec = (N; F;M; A;Q)

Output: 1 if S1 is better than S2, 0 otherwise.

1 G0 = (V0; E0; �) where V0 = V-funscheduled nodesg; E0 = ;

2 foreach schedule Si = S1 to S2 do

3 E0 = f(u; v) : u; v 2 V0; if u; v in same f.u. in Si

4 and v is immediately after ug

5 Sort graph G according to the topological order

6 // total; total tp i and quality[i]: fuzzy set of latency and power.

7 total = ;

8 foreach level i of graph G in topological order do

9 // (u) returns the functional unit binding for u

10 total tpi =P

u2Vifuzzymax time add energy(TP ( (u)));

11 total = fuzzyadd time add energy(total; total tpi)

12 od

13 quality [Si] = total od // quality of the schedule i

14 // comparing the overall attributes of both schedules

15 return(compare(quality [S1]; quality [S2]))

Algorithm 3.2 computes the fuzzy set of latency and power of a schedule. The algorithm adds edges

between two consecutive nodes in the same functional unit to construct a legal execution order of all nodes.

In order to simplify the calculation, the latency and power values are discretized. Using Equation (3)

by replacing z with execution time and power tuple, we compute the utility degree of each latency and

power pair. This value simply re ects the utility value of the modules contributing to the latency and

power pair. Line 11 replaces a maximum operation with fuzzy maximum in order to compute maximum

execution times of the nodes in the same control step. For the power value, we calculate the average

power consumption per time unit, by using the formula [11],

P =

Pk nkpktk

T(4)

where T is the schedule length, pk and tk are the power consumption as well as execution time of

functional unit fk, and nk is the number of operations executed by functional unit fk. Since power

consumption is closely related to latency, each power value needs to be associated with a latency value.

While calculating a latency value in both fuzzymax time add energy and fuzzyadd time add energy , the

energy (t� p) is accumulated. Note that for the purpose of utility value updates, modules causing every

latency and power values need to be recorded. In order to do so, both fuzzymax time add energy and

fuzzyadd time add energy routines also collect module references while adding/maximizing latencies and

powers (we will discuss the utility value update in Section 3.2). Depending on the assign heuristic, the

values totali may be kept statically for some control step i to save the computation for fuzzy addition

and maximum, for each iteration in Line 6 of Algorithm 3.1.

10

In routine compare, the accumulated energy is �rst divided by the corresponding latency to calculate

a power consumption. Fuzzy arithmetic is also used here for comparing two fuzzy sets of latency and

power. Based on a design goal, the routine de�nes a heuristic used to decide the best intermediate

schedule to keep (between quality [S1] and quality [S2]).

Adder f1 Multipliers f2; f3

Module Time Power Util. Module Time Power Util.

a0 5 20 0.2 m0 5 100 0.2

a1 10 15 0.7 m1 10 37.5 0.5

a2 25 10 1 m2 20 23 0.7

m3 30 10 1

Table 1: Adder, multiplier modules and their utility values

Using Algorithm 3.2, one can easily add more criteria by modifying Lines 10 and 11. For instance,

if minimizing area is also another criteria, the area calculation is added to Line 10 and 11 (in routines

fuzzymax time add energy and fuzzyadd time add energy).

Let us use an example to illustrate how inclusion scheduling works. Consider the example in Section 2

Figure 1. Nodes C and D are multiplication nodes while nodes A;B; and E are addition nodes. Suppose

a potential implementation consists of 3 functional units (f1; f2; and f3 where f1 is an adder and f2; f3

are multipliers). Table 1 show execution times and powers for each adder and multiplier module as well

as their initial utility assignment. This assignment suggests that a2;m3 and m3 should be the best

module selection for f1; f2 and f3 respectively. To compare the quality of a schedule, compare function

needs to be de�ned according to a design goal. Suppose our design goal is to optimize both latency and

power criteria and we consider a tradeo� between latency and power (similar to acceptability model in

Figure 2(b)) where power and time ratio=2/1. In compare routine, we use such a ratio to compute the

weighted-sum of both latency and power. Then, we defuzzify the weighted-sum of latency and power and

compare them to select an intermediate schedule.

f1 f2 f3

+ � �

A C D

B

E

Figure 5: Schedule for Test1 DFG (Figure 1)

Applying inclusion scheduling to the above example, we obtain a schedule as shown in Figure 5. The

possible latency and average power of this schedule are estimated as shown in Figure 6(a). In Figure 6(b),

11

the z axis shows the resulting utility value for each latency and power pair based on the initial module

utility assignment. These values are derived from Algorithm 3.2. Note that the point with minimum

power and maximum latency (latency=80, power=17, as noted in the square box in Figure 6(b)) has

the maximum utility value (1) which corresponds precisely to the initial utility assignment that prefers

the module with the minimum power (a2;m3;m3). If the system requires that the latency must be less

than 50, some of the designs with high utility values are no longer acceptable. Furthermore, given that

the tradeo� criterion of power to latency is 2/1, we need to adjust the utility assignment. This will be

discussed in the next subsection.

10 20 30 40 50 60 70 8010

20

30

40

50

60

70

80

90

latency

pow

er

(a) Latency and power

1020

3040

5060

7080

020

4060

801000.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

latencypower

utili

ty

(b) Latency, power and utility values

Figure 6: Latency and power of schedule in Figure 5

The time complexity of the inclusion scheduling algorithm is simply the time complexity of list

scheduling multiplied by the fuzzy computation complexity. In Algorithm 3.2, the time complexity of

fuzzy addition and fuzzy maximum depends on the number of discrete points each totali retains. To limit

the number of points, we allow only at most K discrete points, for each functional unit. Then, the time

complexity for addition and maximum operations is O(K2). Hence, the time complexity of computing

the overall fuzzy latency and power is O((jV0j + jE0j)K2), while the time complexity of compare routine

is O(K). Altogether, the time complexity of the expanded version of list scheduling is O((jV j+ jEj)2K2),

where O(jV j+ jEj) is the list scheduling time complexity.

3.2 Utility Adjustment

Recall that the utility values of modules should re ect the usefulness of the modules towards a design goal.

However, the initial assignment of utility values may not satisfy the given design goal. We use the utility

adjustment phase to modify the utility values in order to �nd most appropriate utility assignments. In

other words, we attempt to give high utility values to modules which contribute to most highly acceptable

latency and power pairs and assign low utility values to the modules contributing to latency and power

pairs with low acceptability values. As we stated earlier, the calculation of the schedule latency and

12

power (Lines 10 and 11 in Algorithm 3.2) also tally the number of module references for each latency

and power value. Let the number of reference to a module by functional unit f be freq(f;m). For a given

latency and power pair (t; p), we compute the positive contribution of m by

�+(f;m) =X

8(t;p) s.t. �(t;p)=�f(m)

freqt;p(f;m)�acc(t; p) (5)

Condition \8(t; p) s.t. �(t; p) = �f(m)" is used to ensure that only (t; p) pair that is caused by module

m in Mf is considered when calculating the adjustment value for the module. From Equation (5), a

higher �+(f;m) value indicates that using m can potentially lead to systems with a higher acceptability

value. We also compute the negative contribution of m by

�-(f;m) =X

8(t;p) s.t. �(t;p)=�f(m)

freqt;p(f;m)(1 - c�acc(t; p)) (6)

Then Equation (7) is used to estimate an adjustment of the utility value for each module of functional

unit f.

adjf(m) =c�+(f;m) - �-(f;m)

c�+(f;m) + �-(f;m)(7)

The term adjf(m) is a relative change to current �f(m) value. From Equation (7), if adjf(m) is

negative, a module tends to cause more bad latency and power pairs. Then �f(m) should be decreased.

On the other hand, if adjf(m) is positive, �f(m) is increased.

A scalar value c introduced in Equation (7) is for the purpose of tuning utility values to avoid the

case where adjf(m) is negative for every m. Normally, c is equal to one. For a very tight constraint

(upper and lower limits are close to each other), the maximum acceptability obtained for all the latency

and power values are quite small, for instance, less than 0.6. It is likely that �- is much bigger than �+

for every module, and therefore, all adj values become negative. In this case, c has to be greater than

one so as to scale down this di�erence (i.e., scale up acceptability values). Note that for consistency, the

same c is used for all m's for all functional unit f's. This c value can be obtained experimentally. By

running the inclusion scheduling once, from the derived latency and power, we �nd the latency and power

pair which gives the maximum acceptability value (MaxAcc) and c can be speculated from MaxAcc. For

instance, when the acceptability values of latencies and powers are normalized to one, c = 1MaxAcc

.

In our experiments, we have used the following method to calculate new �f(m).

�new f(m) =

8>>>>>><>>>>>>:

�f(m)� adjf(m) + �f(m)

if 0 < adjf(m) � 1

�f(m)2

+ (1 + adjf(m))� �f(m)2

if - 1 � adjf(m) < 0

(8)

Since the value of adj(m) is always between [-1; 1], if adjf(m) equals 1, we double the value of �f(m)

and if adjf(m) equals -1, �f(m) is reduced by half. If adjf(m) is between (-1; 0], the change of �f(m)

is proportional to half of �f(m) and if adjf(m) is between (0; 1), the change of �f(m) is proportional

to �f(m). After the adjustment for all modules is made, �new f(m) are normalized with respect to the

13

highest one, i.e., norm(�f(m)) =�new f(m)

maxm �new f(m);8m 2 Mf. If �new f(m) is the same as �f(m) from the

previous iteration for every m, the adjustment is no longer needed.

We have investigated other adjustment policies. For instance, adj can be an absolute change to �f(m).

This method does not give a good result because �f(m) changes too fast and all of them may quickly

become one. This is not desirable since we are not able to distinguish the quality of the modules. Though

the proposed adjustment method does not su�er from the previous problem, it has a drawback: when

repetitively decrementing �f, the value tends to converge to zero. Once �f(m) is zero, the scheduling

approach disregards the module for eÆciency and, therefore, the module will be removed from the set

permanently. Often, if a module's utility is repeatedly decreased, the module may not be a good one.

However, several modi�cations can be made to �x this problem. The simplest way would be to slow down

the reduction rate to assure that the module which is completely ignored will de�nitely not be a good

one. At another point of view, this approach eliminates unwanted modules iteratively. Further, every

time the module is eliminated, the inclusion scheduling performs faster because the number of possible

latency and power is lessened.

4 Illustrated Examples

In this section, we present an example to show how WiZard works. Consider a module set shown in

Figure 7(a) and a data ow graph shown in Figure 7(b). Assume that we consider a system with 3

functional units: 1 adder and 2 multipliers. We will show that WiZard can consistently produce proper

elite sets for various design goals.

Modules Time Power Initial

a0 7 40 0.1

a1 15 30 0.2

a2 20 25 0.3

a3 25 15 0.7

a4 35 10 0.9

a5 40 5 1

m0 8 75 0.2

m1 15 55 0.4

m2 25 40 0.6

m3 35 30 0.7

m4 45 20 0.9

m5 55 10 1

m6 60 5 0.05

(a) Modules

A B C D

E

H G I

F

Multiplication

Addition

(b) Test2 DFG

Figure 7: An example of module set and data ow graph

14

Example I

First, let us consider the case where the goal is to simply satisfy latency and power constraints. We de�ne

a Q(t; p) as:

Q(t; p) =

Æ0 if t > 200 or p > 80

1 if t � 200 and p � 80:

In the initial assignment, a5 is the most favorable module for f1 and m5 is the one for both f2; and f3

(see the last column in Figure 7(a)). Notice that if a5;m5;m5 were used, the selection would lead to an

unacceptable system (latency equals 245 and power equals 14, by the list-based scheduling algorithm).

Therefore, utility adjustments are necessary.

By inclusion scheduling, the number of times, denoted by freq(t;p), which each module is referred to

was recorded. Such frequencies are associated with each functional unit and are collected while scheduling

each node from the input graph. Figure 8 presents a sample of freq(t;p) values for the latency and power

(t; p) of (90,66) and (202,13) for f2. In Figure 8(a) which corresponds to latency power pair of (90,66),

only modules m1, m2, m3 contribute to this value. Since this pair is an acceptable pair, the utility

value of modules m1,m2 and m3 for f2 should be increased. On the other hand, Figure 8(b) corresponds

to latency and power pair (202,13) which is unacceptable latency and power. Many modules may have

contributed to these values. However, from Figure 8(b), module m6 gives the most contribution. Based on

this observation, utility value of m6 should be decreased. Combining freq(t;p) information for all latency

and power (t; p) pairs, adjustments are determined according to Equations (5){(8).

Figure 9(a) presents the evolution of the utility values for adder modules of f1 and Figure 9(b) presents

the change of the utility values for multiplier modules of f2 and f3. Rows 1st and 2nd show the utility

values after adjusting and normalizing the utility values for the respective iteration.

0

2

4

6

8

10

12

m0 m1 m2 m3 m4 m5 m6

freq

(a) freq(t;p) for latency and

power pair (90,66)

m0 m1 m2 m3 m4 m5 m60

50

100

150

200

250

300

350

400

freq

(b) freq(t;p) for latency and

power pair (202,13)

Figure 8: Sample freq(t;p) information for multiplier modules for f2 collected during the �rst

iteration in Example I

After second iteration, utility values for f1 is stabilized. For f2 and f3, the module with utility value

one has changed from m5 to m2 in the second iteration. In the remaining iterations, utility values of

15

Func. iteration adder module utility

unit a0 a1 a2 a3 a4 a5

f1 1st 0.1 0.2 0.3 0.7 0.9 1

2nd 0.16 0.39 0.59 1 0.59 0.66

... ...

�nal 0.16 0.39 0.59 1 0.59 0.66

(a) Utility values for adder modules (f1)

Func. iteration multiplier module utility

unit m0 m1 m2 m3 m4 m5 m6

f2 1st 0.2 0.4 0.6 0.7 0.9 1 0.05

2nd 0.33 0.60 1 0.95 0.44 0.50 0.11

... ...

�nal 0.32 0.60 1 0.96 0.05 0 0

f3 1st 0.2 0.4 0.6 0.7 0.9 1 0.05

2nd 0.33 0.61 1 0.96 0.44 0.50 0.11

... ...

�nal 0.32 0.60 1 0.96 0.05 0 0

(b) Utility values for multiplier modules (f2; f3)

(f1 f2 f3) Latency Power

a3 m2 m2 125 52

a3 m2 m3 135 48

a3 m3 m2 135 48

a3 m3 m3 155 43

(c) Latency and power of the solutions obtained

by modules with utility value greater than 0.9

Figure 9: Initial utility assignment and their improvements after each iteration, and elite set

(Example I)

16

multiplier modules with large execution time keep decreasing. Since modules m5 and m6 signi�cantly

contribute to a speci�cation violating the constraint, they are removed from the eligible set after eleven

adjustment iterations. The row \�nal" of Figure 9 shows the �nal utility values of choosing each module

for each functional unit.

Suppose we select a module have utility values 0.9 or higher. This leads to the selection of fa3g,

fm2,m3g, and fm2,m3g for f1, f2, and f3 respectively. Consequently, the elite set contains f(a3,m2,m2),

(a3,m2,m3), (a3,m3,m2), (a3,m3,m3)g. For comparison, we generate results by constructing schedules

for each of these selections. Figure 9(c) displays resulting latency and power attributes when using the

selection in the elite set. It is clear that all the selections give acceptable solutions, i.e., their latencies

and powers are less than 200 and 80 respectively.

Example II

Next, let us consider the case where the design goal is to minimize latency under certain latency and

power constraints. De�ne Q(t; p) as:

Q(t; p) =

8><>:

0 if t > 200 or p > 80

1 if t � 50 and p � 80

-0:0067t + 1:33 otherwise.

For this design goal, we assume a linear acceptability function. The constants in -0:0067t + 1:33 are

obtained from the linear interpolation of two coordinates of (latency,acceptability), (200; 0) and (50; 1).

This function consists of only latency variable t and avors shorter latency values. Figure 10(a) plots

the above Q. For this problem, the solutions obtained from the previous example would no longer result

in high acceptability. Using the same initial utility assignment, we re-run the experiment using the new

acceptability model.

In Figures 10(b){10(c), we only include an initial and �nal utility values for adders and multipliers

obtained after 16 iterations. Based on these results, the elite set contains only (a0,m1,m1). This selection

gives latency and power 67 and 78. Clearly, the latency obtained from this solution is much shorter than

the latency values obtained for Example I (Fig. 9(c)).

To assess the quality of our solution, we generate all possible 294 combinations of modules for the

given adders and multipliers (6 adder modules and 7 multiplier modules) and their corresponding sched-

ules. Then, for each schedule, its latency and power is projected to the acceptability function to compute

the acceptability value. The solutions obtained from this method are referred as \enumerated set".

Figure 10(d) uses a pie chart to divide the 294 enumerated solutions into 10 ranks: the 1st-rank corre-

sponds to the solutions whose acceptability values are ranged in [ 910�MaxAcc;MaxAcc] (top 10% best

acceptabilities) while the 10th-rank corresponds to the solutions whose acceptability values ranged in

[0; 110�MaxAcc], where MaxAcc is max(t;p)Q(t; p). In this �gure, there are only 3% of solutions which

belong to the �rst rank while there are 28% of the solutions belonging to the last rank. The elite set

produced by WiZard is in the �rst rank where only 3 percent of all 294 enumerated solutions falls.

17

020

4060

80

050

100150

2000

0.2

0.4

0.6

0.8

1

powerlatency

acce

ptab

ility

(a) Acceptability model

Func iteration adder module utility

unit a0 a1 a2 a3 a4 a5

f1 1st 0.1 0.2 0.3 0.7 0.9 1

�nal 1 0.05 0.07 0.02 0 0

(b) Utility values for adder modules (f1)

Func. iteration multiplier module utility

unit m0 m1 m2 m3 m4 m5 m6

f2 1st 0.2 0.4 0.6 0.7 0.9 1 0.05

�nal 0.05 1 0.19 0 0 0 0

f3 1st 0.2 0.4 0.6 0.7 0.9 1 0.05

�nal 0.01 1 0.36 0 0 0 0

(c) Utility values for multiplier modules (f2; f3)

3%4%7%

13%

9%

3%

11% 9%

14%

28%

10th

1st2nd

3rd

4th

5th

6th7th 8th

9th

a0, m1, m1 latency=67,power=78

(d) Distribution of 294 solutions

Figure 10: Acceptability model minimizing latency, utility values for Test2 DFG and distribution

of enumerated solutions (Example II)

18

Example III

For the last example, we demonstrate the use of WiZard for �nding a design minimizing both latency

and power under given latency and power constraints. Obviously, the selection from the previous example

does not give a solution which minimizes power. Therefore, we formulate a new objective function using

a weighted-sum of latency and power. In particular, the design goal is to minimize w1t + w2p where

w1 and w2 are weights of latency and power attributes. In the following equation, we assume a linear

tradeo� where w1 = 1 and w2 = 2, i.e., saving power is twice as important as reducing latency.

Q(t; p) =

8><>:

0 if t > 200 or p > 80

1 if t � 50 and p � 20

-0:0039(t + 2p) + 1:35 otherwise.

In order to calculate the weighted-sum of latency and power (t + 2p), both t and p must be normalized

to the same scale. We assume that the normalization process has been rendered in the above equation.

The acceptability model for this problem is depicted in Figure 11(a).

Figures 11(b){11(c) display the �nal utility values after 14 iterations. These utility values lead to an

elite set f(a3;m0;m0); (a3;m0;m1); (a3;m0;m2); (a3;m1;m0); (a3;m1;m1); (a3;m1;m2)g: Comparing

with the solutions in Example II, the solutions in Figure 11(e) have lower power at the expense of the

increased latency. These solutions also yield better latency than that of Example I.

Figure 11(d) displays the 10-rank distribution of the enumeration solutions for this problem. Of all

294 enumerated solutions, the best solution found has acceptability 0.74 and there are only 15% which

belong to the �rst rank. Using our approach, we are able to construct an elite set of size 6 in which 4

solutions are of the �rst rank. This example shows that our approach is able to reduce the search space

from 294 to 6 and increases the percentage of the number of the 1st-rank solutions from 15% (44/294) to

67% (4/6).

5 Experimental Results

To demonstrate the e�ectiveness of our approach, we have performed several experiments on benchmarks

including test1 (Figure 1), test2 (Figure 7(b)), di�erential equation (deq), wave-digital �lter (fj-wdf),

Volterra �lter (vtf), elliptic �lter (elf), and discrete cosine transform (dct). Table 2 summarizes the

characteristics of the benchmarks considered.

Figure 12 lists module sets used in the experiments. Some of these are obtained by modifying the

modules in [11]. Module set I is the same set as in Table 1. Module set II consists of 6 adder modules

and 7 multiplier modules. Module set III consists of 6 adder modules and 6 multiplier modules. The

initial assignment for module set I favors the lowest-power adder and multiplier modules. For module

sets II and III, we consider two di�erent initial utility assignments: II-1 favors the lowest-power adder

and multiplier, II-2 and III-1 favors the fastest adder and multiplier, and III-2 avors the lowest-power

adder and the fastest multipliers.

19

01020304050607080

050

100150

200

00.10.20.30.40.50.60.70.80.91

powerlatency

acce

ptab

ility

(a) Acceptability model

Func. iteration adder module utility

unit a0 a1 a2 a3 a4 a5

f1 1st 0.1 0.2 0.3 0.7 0.9 1

�nal 0.25 0.65 0.45 1 0.65 0.45

(b) Utility values for adder modules (f1)

Func. iteration multiplier module utility

unit m0 m1 m2 m3 m4 m5 m6

f2 1st 0.2 0.4 0.6 0.7 0.9 1 0.05

�nal 0.92 1 0.44 0.77 0.01 0 0

f3 1st 0.2 0.4 0.6 0.7 0.9 1 0.05

�nal 1 0.99 0.98 0.8 0.01 0 0

(c) Utility values for multiplier modules (f2; f3)

14%

14%

22% 22%

6%

6th-9th = 0%

22%

1st

2nd

3rd 4th

5th

10th

a3,m0,m0a3,m0,m1a3,m0,m2a3,m1,m0

a3,m1,m1a3,m1,m2

(d) Distribution of 294 solutions

(f1 f2 f3) Latency Power

a3 m0 m0 108 28

a3 m0 m1 116 26

a3 m0 m2 116 26

a3 m1 m0 116 26

a3 m1 m1 115 36

a3 m1 m2 125 34

(e) Latency and power of the solutions obtained

modules with utility value greater than 0.9

Figure 11: Linear acceptability model considering tradeo�, utility values for Test2 DFG, distribu-

tion of enumerated solutions, and elite set (Example III)

20

Modules Time Power Initial

a0 5 20 0.2

a1 10 15 0.7

a2 25 10 1.0

m0 5 100 0.2

m1 10 37.5 0.5

m2 20 23 0.8

m3 30 10 1.0

(a) Module set I

Modules Time Power Initial1 Initial2

a0 5 30 0.05 1

a1 7 25 0.1 0.4

a2 10 19 0.2 0.2

a3 15 15 0.5 0.1

a4 20 12 0.8 0.1

a5 30 10 1 0.05

m0 5 55 0.05 1

m1 10 40 0.1 0.8

m2 15 30 0.3 0.5

m3 23 20 0.4 0.4

m4 30 13 0.7 0.3

m5 35 8 0.8 0.2

m6 50 5 1 0.1

(b) Module set II

Modules Time Power Initial1 Initial2

a0 5 60 1 0.1

a1 10 38 0.7 0.2

a2 20 23 0.5 0.3

a3 35 20 0.3 0.5

a4 40 10 0.2 0.7

a5 70 5 0.1 1

m0 100 296 0.3 1

m1 160 84 1 0.7

m2 170 70 0.8 0.5

m3 300 55 0.4 0.4

m4 640 29 0.2 0.2

m5 770 20 0.1 0.1

(c) Module set III

Figure 12: Module sets I{III

21

Benchmark # operations

name add mult

test1 3 2

test2 4 5

deq 5 6

fj-wdf 13 4

vtf 10 17

elf 28 7

dct 32 16

Table 2: Benchmark characteristics

We �rst present the results generated by enumerating solutions for various tests as shown in Table 3.

Column \Spec" displays a speci�cation of a target system (number of functional units{adders and mul-

tipliers allowed in a system). Column \Module set" shows the module set used in each benchmark test.

In the Column \Acceptability Q", �elds \latency" and \power" contain two vital points (x0; x1) similar

to those of Equations (1){(2): Any design whose latency and power are less than x0 has acceptability

one while a design with latency or power being greater than x1 is unacceptable (acceptability equals

zero). A design whose latency and power are between x0 and x1 values are acceptable with certain degree

depending on the assumption in �eld \w1 : w2". Column \w1 : w2" displays a tradeo� ratio between

latency and power respectively. When w1 : w2 = 1 : 1, i.e., no tradeo� is considered, we use z-shaped

curve to represent Q function (similar to Equation (1) except that F1 and F2 are z-curves as shown in

Figure 2(c)). When w1 : w2 = 5 : 1, or w1 : w2 = 1 : 3, we use a linear tradeo� function applied to F

similar to the graph in Figure 11(a).

Under Column \Enumerated solutions", Column \total" displays the total number of enumerated

solutions. For Column \Distributed by rank", we divide the solutions into 7 groups according to the ac-

ceptabilities of their corresponding latency and power values. Let MaxAcc be the maximum acceptability

value obtained among all enumerated solutions. The solutions in the �rst rank have the acceptabil-

ity ranged in [67� MaxAcc;MaxAcc] while the solutions in 7th rank have the acceptability ranged in

[0; 17� MaxAcc]. The numbers in Columns \1st,2nd; : : :,7th under \Distributed by rank" illustrate the

number of module combinations belonging to each rank.

Table 4 shows the results generated by WiZard2. The number in front of each row entry of Table 4

corresponds to the same indexed test in Table 3. Column \#elite set" shows the size of the elite set formed

by the selected modules. Column \Selected modules" shows the modulesWiZard selected for each func-

tional unit. Under this column, selected adders and multiplier modules are displayed in Columns \adder"

and \multiplier". The values in the parenthesis show possible module selections for the adder/multiplier

functional unit. If the target system speci�cation requires 2 adders/multipliers, the modules in the �rst

and second parentheses show candidate modules of the �rst and second adder/multiplier units respec-

tively. Column \#iter" displays the number of adjustment iterations performed to obtain the selection in

2All experiments used c = 1, and the module m is included in the elite set if �f(m) > 0:6.

22

Columns \adder" and \multiplier". Column \Distribution by rank" displays the ranking of the elements

in the elite set. For the comparison purpose, for each row entry, we use the same ranking scale as its

corresponding row in Column \Distribution by rank" in Table 3.

Consider the elliptic �lter benchmark for a 2-adder and 2-multiplier system in Table 4 (row 19) where

the number of enumerated solutions is 1,296. The experiments give a module selection (a1,a1,m1,m1)

which lies in the �rst rank (latency= 1060 and power= 110:75). This result is obtained within �ve

iterations of adjustments. To verify the result, Figure 13(a) graphically depicts all enumerated solutions

as well as their acceptability values for this case. For easy eye-investigating, Figure 13(b) projects these

points onto latency-acceptability plane and Figure 13(c) projects these points to power-acceptability

plane. Each point in the graphs corresponds to latency/power values associated with the schedule using

a particular module selection. The solutions in the square region in Figures 13(b){13(c) are highlighted to

show the 1st-rank solutions (latency ranged in [1015; 1280] and power ranged in [95; 137]). The modules

identi�ed by WiZard lead to solutions that fall into this region.

For some large benchmarks, we tested the method against the acceptability function when the tradeo�

between latency and power is considered. Take the discrete cosine transform benchmark (rows 28{29 of

Tables 3{4) as an example. In row 28, since the goal is to optimize 5t + p, we attempt to minimize

latency at the expense of increasing power. Figure 14 characterizes enumerated solutions in row 28 of

Table 3 in 2-dimensional latency and acceptability plane as well as power and acceptability plane. The

points in the square box area correspond to the 1st-rank solutions whose latencies are between [980; 1160]

and powers are between [418; 498]. Both of our selections in row 28 of Table 4, (a2,a2,a2,m0,m0) and

(a2,a2,a1,m0,m0), result in the latency and power (980, 498) and (1010,481) respectively, which again

belong to the 1st-rank solutions.

Figure 15 characterizes enumerated solutions whose goal is to optimize t + 3p, i.e., it attempts to

reduce the power consumption while increasing latency. According to Figure 15(a), the 1st-rank solutions

have latencies ranged in [1310; 1470] and the power ranged in [135; 266]. Consider one of the solutions in

row 29 of Table 4 (a1,a1,a1,m1,m1). This solution results in latency 1350 and power 168. Hence, it falls

into the �rst rank. Compared with the latency and power of the solution obtained for the previous case,

the power is signi�cantly reduced while latency is slightly increased.

To further illustrate the e�ectiveness of WiZard, we summarize our experimental data in Tables 5{6.

Table 5 summarizes the ratio between the number of enumerated solutions and average size of the derived

elite set for each benchmark. Examining Column \% Reduction", one can see that our proposed approach

is able to reduce the search space dramatically. Table 6 shows the average quality of selections found by

WiZard. Columns \Enumerated" and \WiZard" shows the average percentage of module selections

per rank for both enumerated set and the elite set obtained by our algorithm. Though on average, most

of module selections from the enumerated set can result in schedules with low acceptability degrees, our

algorithm is able to capture good module selections. This also can be done within a reasonable amount

of time. In the worst case example, discrete cosine transform with 5 functional units, the average running

time of WiZard is about 800 seconds while it takes approximately 10800 seconds to generate 7776

combinations. Therefore, WiZard takes only 7.4% of the running time of the enumerated approach.

From all the tables, one can see that WiZard eÆciently produces high-quality module selections.

23

Benchmark Spec. Module Acceptability Q Enumerated solutions

set latency power w1 : w2 total Distribution by rank

1st 2nd 3rd 4th 5th 6th 7th

1. test1 1a 2m I (20,70) (30,80) 1:1(z) 48 25 0 0 6 0 0 17

2. test2 1a 2m II-1 (55,200) (35,80) 1:1(z) 294 90 56 45 52 13 25 13

3. test2 1a 2m II-2 (55,200) (35,80) 1:1(z) 294 90 56 45 52 13 25 13

4. deq 1a 2m II-1 (100,200) (35,70) 1:1(z) 294 168 13 47 8 45 12 1

5. deq 1a 2m II-2 (100,200) (35,70) 1:1(z) 294 168 13 47 8 45 12 1

6. fj-wdf 2a 1m II-1 (100,200) (35,70) 1:1(z) 252 103 40 4 4 28 20 53

7. fj-wdf 2a 1m II-2 (100,200) (35,70) 1:1(z) 252 103 40 4 4 28 20 53

8. fj-wdf 2a 2m II-1 (50,125) (40,80) 1:1(z) 1764 116 34 4 74 29 67 1440

9. fj-wdf 2a 2m II-2 (50,125) (40,80) 1:1(z) 1764 116 34 4 74 29 67 1440

10. fj-wdf 2a 2m II-1 (50,125) (40,80) 5:1(l) 1764 77 0 124 186 181 1351061

11. fj-wdf 2a 2m II-1 (50,125) (40,80) 1:3(l) 1764 414 385 0 0 0 0 965

12. vtf 1a 3m II-1 (150,300) (35,50) 1:1(z) 2058 124 50 113 85 162 2111313

13. vtf 1a 3m II-2 (150,300) (35,50) 1:1(z) 2058 124 50 113 85 162 2111313

14. vtf 1a 3m II-1 (200,300) (35,70) 5:1(l) 2058 12 309 474 42 0 0 1221

15. vtf 1a 3m II-1 (200,300) (35,70) 1:3(l) 2058 367 247 277 254 0 0 913

16. elf 2a 1m III-1 (2000,3000) (75,200) 1:1(z) 216 72 0 1 23 9 2 109

17. elf 2a 1m III-2 (2000,3000) (75,200) 1:1(z) 216 72 0 1 23 9 2 109

18. elf 2a 2m III-1 (1000,2000) (100,250) 1:1(z) 1296 738 169 196 78 22 18 72

19. elf 2a 2m III-2 (1000,2000) (100,250) 1:1(z) 1296 738 169 196 78 22 18 72

20. elf 2a 2m III-2 (1300,1700) (75,150) 5:1(l) 1296 38 69 110 173 130 102 674

21. elf 2a 2m III-2 (1300,1700) (75,150) 1:3(l) 1296 140 173 75 75 34 0 799

22. dct 2a 2m III-1 (2000,3000) (300,500) 1:1(z) 1296 504 0 5 82 125 85 495

23. dct 2a 2m III-2 (2000,3000) (300,500) 1:1(z) 1296 504 0 5 82 125 85 495

24. dct 2a 2m III-2 (1200,2600) (75,150) 5:1(l) 1296 30 62 68 20 0 0 1116

25. dct 2a 2m III-2 (1200,2600) (75,150) 1:3(l) 1296 43 64 31 0 0 62 1096

26. dct 3a 2m III-1 (1500,2500) (350,450) 1:1(z) 7776 3005 0 0 0 0 0 4771

27. dct 3a 2m III-2 (1500,2500) (350,450) 1:1(z) 7776 3005 0 0 0 0 0 4771

28. dct 3a 2m III-2 (1000,2000) (200,500) 5:1(l) 7776 174 47 507 87214831324560

29. dct 3a 2m III-2 (1000,2000) (200,500) 1:3(l) 7776 294 13901516 16 0 0 4560

Table 3: Enumerated module selection results for tested benchmarks, generated for comparing toWiZard results in Table 4. Some benchmarks are tested against di�erent initial utility assign-ments and di�erent design goals.

24

Ben. # elite WiZard Selected modules #iter Distribution by rank

set adder multiplier 1st2nd3rd4th5th6th7th

1. 4 (a1) (m1 m2) (m1 m2) 15 3 1 0 0 0 0 0

2. 2 (a3 a4) (m2) (m2) 12 1 1 0 0 0 0 0

3. 1 (a3) (m0) (m0) 22 1 0 0 0 0 0 0

4. 9 (a1) (m1 m2 m3) (m1 m2 m3) 11 5 0 0 2 1 0 1

5. 9 (a2) (m0 m2 m3) (m0 m2 m3) 7 4 3 1 0 1 0 0

6. 1 (a2) (a2) (m1) 16 1 0 0 0 0 0 0

7. 4 (a0 a1) (a0 a1) (m1) 9 2 0 0 0 0 2 0

8. 2 (a0) (a1) (m0 m1) (m1) 9 1 1 0 0 0 0 0

9. 8 (a0 a1) (a0) (m0 m2) (m0 m2) 9 3 0 1 3 0 1 0

10. 2 (a0 a1) (a0) (m1) (m0) 14 1 1 0 0 0 0 0

11. 2 (a2) (a2) (m1) (m0 m1) 10 1 1 0 0 0 0 0

12. 2 (a2 a3) (m1) (m0) (m1) 9 1 0 0 1 0 0 0

13. 8 (a1 a2) (m0 m1) (m0 m1) (m3) 9 2 0 1 1 1 1 2

14. 2 (a0 a1) (m0) (m1) (m0) 14 2 0 0 0 0 0 0

15. 6 (a1 a2 a3) (m0) (m0 m1) (m0) 15 6 0 0 0 0 0 0

16. 1 (a0) (a0) (m1) 5 1 0 0 0 0 0 0

17. 9 (a0 a1 a2) (a0 a1 a2) (m1) 5 9 0 0 0 0 0 0

18. 4 (a0) (a0) (m1 m2) (m1 m2) 2 4 0 0 0 0 0 0

19. 1 (a1) (a1) (m1) (m1) 5 1 0 0 0 0 0 0

20. 8 (a1 a2) (a2) (m1 m2) (m1 m2) 18 4 0 1 3 0 0 0

21. 3 (a2 a3 a4) (a2) (m1) (m1) 8 3 0 0 0 0 0 0

22. 1 (a0) (a0) (m1) (m1) 3 1 0 0 0 0 0 0

23. 4 (a1 a3) (a1 a3) (m1) (m1) 3 4 0 0 0 0 0 0

24. 4 (a2) (a2) (m1 m2) (m1 m2) 7 1 3 0 0 0 0 0

25. 4 (a2 a3 a4 a5) (a0) (m1) (m2) 20 4 0 0 0 0 0 0

26. 8 (a1 a3) (a1 a3) (a1 a3) (m1) (m1) 15 8 0 0 0 0 0 0

27. 2 (a4) (a1) (a3) (m0 m2) (m2) 3 2 0 0 0 0 0 0

28. 2 (a2) (a2) (a1 a2) (m0) (m0) 16 2 0 0 0 0 0 0

29. 32 (a1 a2) (a1 a2) (a1 a2) (m1 m2) (m1 m2) 4 28 4 0 0 0 0 0

Table 4: Module selection results for tested benchmarks generated by WiZard where each rowcorresponds to the row in Table 3

25

01000

20003000

40005000

6000

0100

200300

4000

0.2

0.4

0.6

0.8

1

latencypower

acce

ptab

ility

(a) Latency, power and acceptability

0 50 100 150 200 250 300 350 400

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

acce

ptab

ility

power

(b) Latency and acceptability

0 50 100 150 200 250 300 350 400

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

acce

ptab

ility

power

(c) Power and acceptability

Figure 13: Enumerated solutions for elliptic �lter benchmark (row 19 of Table 3)

26

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

latency

acce

ptab

ility

(a) Latency and acceptability

0 100 200 300 400 500 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

power

acce

ptab

ility

(b) Power and acceptability

Figure 14: Enumerated solutions for discrete cosine transform benchmark when latency and power

tradeo� is 5:1 (row 28 of Tables 3{4)

0 1000 2000 3000 4000 5000 6000 7000 8000 90000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

latency

acce

ptab

ility

(a) Latency and acceptability

0 100 200 300 400 500 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

power

acce

ptab

ility

(b) Power and acceptability

Figure 15: Enumerated solutions for discrete cosine transform benchmark when latency and power

tradeo� is 1:3 (row 29 of Tables 3{4)

27

Benchmark Spec WiZard:Enumerated %Reduction

1. test1 1a 2m 4 : 48 91%

2.{3. test2 1a 2m 1.5 : 294 99%

4.{5. deq 1a 2m 9 : 294 97%

6.{7. fj-wdf 2a 1m 2.5 : 252 99%

8.{11. fj-wdf 2a 2m 3.5 : 1764 99%

12.{15. vtf 1a 2m 4.5 : 2058 99%

16.{17. elf 2a 1m 5 : 216 97%

18.{21. elf 2a 2m 4 : 1296 99%

22.{25. dct 2a 2m 3.25 : 1296 99%

26.{29. dct 3a 2m 11 : 7776 99%

Table 5: # elite set and #enumerated set ratio

Rank# Enumerated WiZard

1st 25% 79%

2nd 7% 11%

3rd 7% 1%

4th 6% 5%

5th 8% 1%

6th 3% 2%

7th 44% 1%

Table 6: Average distribution per rank for enumerated set and the elite set found by WiZard

28

6 Conclusion and Future Work

We have presented a module selection framework WiZard that takes into account of scheduling e�ect

as well as resource, latency, and power constraints. The key to this approach is the use of the utility

measure to model the degree of usefulness of a module. A module with a high utility value should lead to

a design with high acceptability values. The scheduling and binding method called inclusion scheduling

is used exclusively as a basis for deriving \fuzzy" latency and power values approximating latency and

power enumerated exhaustively. The modules contributing to these latency and values are also recorded

for the purpose of tracking the modules resulting in a schedule with good and/or bad latency and power.

Then, the utility values are �ne-tuned so as to precisely specify the quality of the module. For many

experiments, within seconds, the framework can identify an elite set composed of modules with high

utility values. With respect to enumerated solutions, these selections belong to superior solutions. The

solutions in the elite set can be used as good initial design solutions so as to speedup a design process.

In this work, we assume that the number of resources, i.e., functional units, is limited. Our current

approach can be included in an iterative design process varying the required number of resources. How-

ever, since the number of resources also directly a�ects schedules as well as module selection, future work

will consider the e�ect of the number of resources while evaluating the utility of a module.

References

[1] T. L. Adam, K. M. Chandy, and J. R. Dickson. A comparison of list schedules for parallel processing

systems. Communication of the ACM, pages 685{690, December 1974.

[2] I. Ahmad, M. K. Dhodhi, and C.Y.R. Chen. Integrated scheduling, allocation and module selection

for design-space exploration in high-level synthesis. IEE Proc.-Comput. Digit. Tech., 142:65{71,

January 1995.

[3] E. Sha C. Chantrapornchai and Xiaobo S. Hu. Dealing with imprecise timing information in archi-

tectural synthesis. Technical Report TR-98-5, University of Notre Dame, 1998.

[4] S. Chaudhuri, S. A. Bylthe, and R. A Walker. An exact methodology for scheduling in 3D design

space. In Proceedings of the 1995 Interational Symposium on System Level Synthesis, pages 78{83,

1995.

[5] L. Chen and L. Jeng. Optimal module set and clock cycle selection for DSP synthesis. In Proceedings

of the Interational Symposium on Circuits and Systems, pages 2200{2203, 1991.

[6] Y. Foo and H. Kobayashi. A knowledge-based system for VLSI module selection. In Proceedings of

the International Conference on Computer Design, pages 184{187, 1986.

[7] K. Gupta. Introduction to fuzzy arithmetics. Van Nostrand, 1985.

[8] M. Ishikawa and G. De Micheli. A module selection algorithm for high-level synthesis. In Proceedings

of the Interational Symposium on Circuits and Systems, pages 1777{1780, 1991.

29

[9] R. Jain. MOSP: Module selection for pipelined designs with multi-cycle operations. In Proceedings

of the IEEE/ACM International Conference on Computer Aided Design, pages 212{215, 1990.

[10] A. Kaufmann and M. M. Gupta. Fuzzy mathematical models in engineering and management science.

North-Holland, 1988.

[11] R. Martin and J. P. Knight. Power-pro�ler: Optimizing ASICs power consumption at the behavioral

level. In Proceedings of the 32nd Design Automation Conference, pages 42{47, 1997.

[12] G. D. Micheli. Synthesis and optimization of digital circuits. McGraw-Hill, Inc, 1994.

[13] L. Ramachandran and D. Gajski. An algorithm for component selection in performance optimized

scheduling. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design,

pages 92{95, 1991.

[14] T. J. Ross. Fuzzy Logic with Engineering Applications. McGrawHill, 1 edition, 1995.

[15] Z. X. Shen and C. C. Jong. Exploring module selection space for architectural synthesis of low power

designs. In Proceedings of the Interational Symposium on Circuits and Systems, pages 1532{1535,

1997.

[16] A. Timmer, M. Heijligers, L. Stok, and J. Jess. Module selection and scheduling using unrestricted

libraries. In Proceedings of the European Design Automation Conference, pages 547{551, 1993.

[17] L. A. Zadeh. The concept of a linguistic variable and its application to approximate reasoning, Part

I. Information Science, 8:199{249, 1975.

[18] L. A. Zadeh. Fuzzy Logic. Computer, 1:83{93, 1988.

30