Upload
vukhuong
View
213
Download
1
Embed Size (px)
Citation preview
EÆcient Module Selections for Finding Highly
Acceptable Designs based on Inclusion
Scheduling y
Chantana Chantrapornchai Edwin H.-M. Sha Xiaobo (Sharon) Hu
Research report: TR-99-2
Department of Computer Science and Engineering
University of Notre Dame
Notre Dame, IN 46556
email: fcchantra,esha,[email protected]
Phone number: (219) 631-8803
Fax number: (219) 631-9260
Abstract
In high level synthesis, module selection, scheduling, and resource binding are inter-
dependent tasks. For a selected module set, the best schedule/binding should be generated
in order to accurately assess the quality of a module selection. Exhaustively enumerating all
module selections and constructing a schedule and binding for each one of them can be ex-
tremely expensive. In this paper, we present an iterative framework, calledWiZard to solve
module selection problem under resource, latency, and power constraints. The framework
associates a utility measure with each module. This measurement re ects the usefulness of
the module for a given a design goal. Using modules with high utility values should result
in superior designs. We propose a heuristic which iteratively perturbs module utility values
until they lead to good module selections. Our experiments show that by keeping modules
with high utility values, WiZard can drastically reduce the module exploration space (ap-
proximately 99.2% reduction). Furthermore, the module selections formed by these modules
belong to superior solutions in the enumerated set (top 15%).
Keywords: Inclusion scheduling, Module selections, Design exploration, Acceptable designs
yThis work was supported in part the NSF under grant number MIP 95-01006, MIP-9612298 and MIP-9701416.
1
1 Introduction
In high-level synthesis, scheduling, resource binding and module selection are important phases. Schedul-
ing is the task of determining the start times of operations based on precedence constraints, resource
binding is the explicit mapping between the operations and generic resources (functional units), and
module selection problem is the resource-type selection problem where more than one resource type can
match the functional requirement of an operation type [12]. These three steps are highly dependent on
one another. This is due to the fact that operations sharing the same resource must have the same module
and that the execution delays of the operations depend on the chosen module. For today's IC systems,
the cost of solving the combined scheduling, binding, and module selection problem by exhaustive search
is prohibitive.
A number of research results have been published on the module selection and scheduling/binding
problem. Some works are interested in �nding the optimal module selection. Jain developed an integer
linear programming formulation where a coeÆcient matrix is unimodular to solve an optimal module
selection problem in polynomial time [9]. However, in this formulation, schedule and resource binding
are not considered. Further, all instances of the same operation type are implemented using the same
module type. Timmer et.al used a mixed integer linear programming (MILP) approach to �nd selected
modules and then applied resource-constrained list scheduling to check if the time constraint is met [16].
The number of integer variables of the MILP, nevertheless, increases as the size of eligible module set
increases. Chen and Jeng investigated the problem of �nding an optimal module as well as clock cycle
selection simultaneously [5]. However, their method explores all possible module selections. Chaudhuri,
Blythe and Walker included a module selection scheme for optimal design space characterization [4].
Similar to [5], they exhaustively generated all module selections and schedules to �nd optimal designs.
Many researchers have proposed heuristics to �nd satisfactory module selections. Ramachandran
and Gajski used the probability table to investigate module selection while constructing a schedule [13].
However, they do not consider resource constraints. Shen and Jong explored module selection space for
low power designs [15]. Their approach also does not consider resource constraints. Ishikawa and De
Micheli proposed an approach to solve a module selection problem with a �xed latency constraint [8].
Their searching scheme starts with the modules with the largest area. Then, it iteratively investigates
modules with smaller areas until the latency constraint is violated. Depending on the constraint, the
approach may, however, need to explore all possible solutions. Moreover, this exploration process assumes
a �xed schedule during each iteration of changing modules. Some researchers applied arti�cial intelligence
approach to the problem of design exploration. Simon and Kobayashi used knowledge-based system to
select satisfactory VLSI designs [6]. They assumed, however, the schedule has been given. Ahmad,
Dhodhi and Chen used a genetic algorithm to help design exploration [2]. Yet, they do not consider
latency and power constraints while performing module selection.
Considering the immense module selection space, we focus on designing a technique that helps rapidly
reduce the search space. That is, we would like to �nd a very small set of module selections which includes
mostly superior solutions and which can be identi�ed eÆciently. We call this set \elite set". The key
to our approach is the introduction of a module utility metric and inclusion scheduling. Our algorithm
2
produces a utility measure for each module. Modules with higher utility values should result in superior
designs. For an initial set of utility values, inclusion scheduling eÆciently computes an \average" schedule
for various module selections. (The meaning of \average" will be explained in Section 3.) Based on the
information provided by such a schedule, we use a heuristic to improve module utility assignments.
Within a short period of running time (within seconds), appropriate utility values of given modules can
be identi�ed. An elite set is then constructed by selecting the combinations of the modules with high
utility values.
We have carried out many experiments and our approach gives promising results. The elite set is
very small compared to the size of enumerated solutions, and the modules in the set lead to very good
designs with respect to the given optimization goal. As an example, for one of the tests on discrete cosine
transform benchmark, our approach results in the elite set whose size is 0.1% of the enumerated set and
all of them are in the top 15% after ranking all enumerated solutions according to the given constraints
and optimization criteria.
Finding an elite set rather than a single solution can be very advantageous in high-level design phase.
It is well-known that during the high-level design phase, many detailed design decisions are not yet to be
made. A design may need to be revised several times before it is �nalized. By identifying several good
solutions and using them as good initial designs, an iterative design process can be readily performed
and the design time can be saved.
The remainder of the paper is organized as follows: Section 2 presents models used in our work and
problem de�nition. Some necessary background concepts are also presented in this section. Section 3
presents our approach in detail. In Section 4, we use examples to illustrate howWiZard works. Section 5
gives experimental results for several benchmarks showing the derived elite sets and their qualities.
Finally, Section 6 draws a conclusion of this work.
2 Overview and Models
In this section, we �rst describe our model as well as problem description. Since in developing an inclusion
schedule some fuzzy arithmetics is involved, we also review some basic concepts in fuzzy computation.
2.1 Model Descriptions
Operations and their dependencies in an application are modeled by a vertex-weighted directed acyclic
graph, called a Data Flow Graph, G = (V;E; �), where each vertex in the vertex set V corresponds
to an operation and E is the set of edges representing data ow between two vertices. Function �
de�nes the type of operation for node v 2 V. Figure 1 shows a �ve-node data ow graph, where
V = fA;B;C;D; Eg, E = fA ! E;B ! E;C ! E;D ! Eg, (u ! v de�nes a directed edge from u to v),
�(A) = �(B) = �(E) = add, and �(C) = �(D) = multiply.
Operations in a data ow graph can be mapped to di�erent functional units which in turn can be
implemented by di�erent modules. Such a system must also satisfy certain design constraints, for instance,
power and cost limitation. These speci�cations are characterized by a tuple S = (N;F ;M; A;Q), where
3
B C
A
E
D+ x
x+
+
Figure 1: Test1: Data ow graph example
N is the number of functional units allowed in the system, F = ffi : 8i 2 [1;N]g is the set of functional
units allowed in the system, e.g., fadd, mulg. M = fMfi j8f 2 F ;8i 2 [1;N]g, where each Mfi contains
a set of eligible modules for functional unit fi, e.g., Mf1 = fripple adder, carry-look-ahead adderg. A is
a function mapping from Mfi 2 M to a set of tuples (a1; : : : ; ak), where a1 to ak represent attributes
of a particular module. In this paper, for simplicity of explanation, we are interested in synthesizing a
system under latency and power constraints1. Hence, A(m) = (a1; a2) where a1 refers to the latency
attributes of module m while a2 refers to the power consumption of module m. Finally, Q is a function
that de�nes the degree of a system being acceptable for di�erent system attributes. If Q(a1; : : : ; ak) = 0
the corresponding design is totally unacceptable while Q(a1; : : : ; ak) = 1, the corresponding design is
de�nitely acceptable.
Using a function Q to de�ne the acceptability of a system is a very powerful model. It can not
only de�ne certain constraints but also express certain design goals. For example, one is interested in
designing a system with latency and power under 150 and 120 respectively. Also, the smaller latency and
power, the better a system is. The best system would have latency and power under 60. An acceptability
function, Q(a1; a2) for such a speci�cation is formally de�ned as:
Q(a1; a2) =
8><>:
0 if a1 > 150 or a2 > 120
1 if a1 � 60 and a2 � 60
min(F1(a1); F2(a2)) otherwise,
(1)
where F1 and F2 are linear functions, e.g., F1(a1) = 0:011a1+1:67 and F2(a2) = 0:0167a2+2 respectively,
and return the acceptability between (0; 1). Figure 2(a) illustrates Equation (1) graphically.
To model the tradeo� between two criteria by acceptability function Q, one may de�ne Q as follows:
Q(a1; a2) =
8><>:
0 if a1 > 140 or a2 > 120
1 if a1 � 30 and a2 � 30
F(w1a1 +w2a2) otherwise.
(2)
The speci�cation states that any design with latency being greater than 140 or power being greater than
120 is unacceptable while any design with latency and power being less than 50 is the best. Furthermore,
a design with lower latency and power is more desirable. In addition, F considers tradeo� between latency
1Area constraint is very easy to be included in the algorithm. However, in order to avoid drawing a four-
dimensional graph, we skip area criteria for the purpose of clear presentation
4
and power criteria by using the weighted-sum of both latency and power. Figure 2(b) depicts an example
of the speci�cation concerning the tradeo� graphically where w1 = 2;w2 = 1. F refers to a z-shaped curve
function which produces a smooth transition between two given points. To give a better view, Figure 2(c)
shows the projection of the 3-dimensional acceptability model to the latency and acceptability plane. In
this �gure, each z curve represents a projection of Q function to a latency-acceptability plane. An inner
curve (tighter latency constraint) corresponds to larger power values. Based on the acceptability model,
a design with high acceptability implies an optimized design towards certain goals.
020
4060
80100
120
0
50
100
1500
0.2
0.4
0.6
0.8
1
powerlatency
acce
ptab
ility
(a) Linear acceptability
020406080100120
0
50
100
150
00.10.20.30.40.50.60.70.80.9
1
power
latencyac
cept
abili
ty
(b) z curve acceptability with tradeo�
0 20 40 60 80 100 120 1400
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
latency
acce
ptab
ility
(c) Latency acceptability curves corresponding to di�erent
power constraints derived from Figure 2(b)
Figure 2: Various kinds of acceptability functions
In this paper, the combined scheduling/binding and module selection problem we intend to solve can
be formulated as follows: Given a speci�cation containing S = (N;F ;M; A;Q), and G = (V;E; �), select
modules m 2 Mfi for a functional unit fi, 8fi, to execute graph G while maximizing the acceptability
degree of the resulting solutions.
Since there are complex interactions among scheduling, binding, and module selection, it is not trivial
to pick the right module combination which results a high acceptability value. A straight-forward way to
5
solve this problem would be to completely enumerate all module selections and generate the best schedule
and binding for each one of them. By comparing the acceptability degrees of all module combinations,
the one with the highest acceptability degree is chosen. However, if there are 10 possible adders and
10 possible multipliers to choose from, and only 3 adder units and 3 multiplier units are allowed in the
system, the number of solutions to explore would be 106. Thus, it would be a very time consuming task
to investigate every design solution. In our method, we eÆciently derive an elite set in which superior
solutions lie. Since the size of an elite set is small, investigating all of them will take much less time.
Therefore, one can rapidly select good module selections.
2.2 Fuzzy Sets
In this section, we give a quick review of fuzzy set theory as it relates to our work. Readers familiar with
the theory can skip this part.
Fuzzy sets, proposed by Zadeh, represent a set with imprecise boundary [17, 18]. In classical (crisp)
sets, an element can either be a member of a set or not at all; hence, its membership degree is either 1
or 0. A fuzzy set is de�ned by assigning each element in a universe of discourse its membership degree
in the unit interval [0; 1], conveying to what degree x is a member in the set. This membership value can
be de�ned as a membership function of an element in the set, �(x) : x! [0; 1].
A fuzzy set is said to be normal if there exists at least one member in the set whose membership value
is unity. A convex fuzzy set is de�ned as: for any x; y; and z in the fuzzy set A, the relation x < y < z
implies that �A(y) � min(�A(x); �A(z)). A fuzzy number is a normal, convex fuzzy set de�ned on the
real line R. Let A and B be fuzzy numbers with membership functions �A(x) and �B(y), respectively.
Let � be a set of binary operations f+;-;�;�;min;maxg. The arithmetic operations between two fuzzy
numbers, de�ned on A � B with membership function �A�B(z), can use the extension principle, by [7]:
�A�B(z) =_
z=x�y
(�A(x)^ �B(y)) (3)
where _ and ^ denote max and min operations respectively.
Fuzzy arithmetic is used to compute an arithmetic operation between two fuzzy numbers. Figure 3(a)
shows a fuzzy number A, which is assumed to be normal triangular-shaped lied on an real line. In this
�gure, let A be assigned with the con�dence interval (2; 6). The most possible value of A is 4 since
its con�dence level or presumption level is 1. Similarly, Figure 3(b) shows the fuzzy number with the
con�dence interval (3; 7) representing B. Figure 3(c) demonstrates a graphical result of adding two fuzzy
numbers de�ned on the integer line from Figures 3(a){3(b), using Equation (3).
In order to compare two fuzzy numbers, several methods can be used. All of these methods are based
on selecting a representative for each fuzzy number and compare the representatives [10]. One way to
obtain the representatives is using the removal with respect to k, which is a measure of distance from k,
computed by R(A; k) = 12(Rl(A; k) + Rr(A; k)), where A is a fuzzy number, k is a reference position on
the x-axis, Rl is the area bounded by the left side of the curve and the line x = k and similarly for the
right side, Rr. Another can be mode, which uses the value x such that �(x) = maxif�(xi)g for all xi in
the fuzzy set. Divergence is another way to calculate the representatives. It represents the width of the
6
42 6 x A
µ(x)
1
(a) A
1
3 5 7 By
µ(y)
(b) B
1
5 9A+B
z
µ(z)
13
(c) A+ B
Figure 3: Adding two fuzzy numbers, A and B
set which is computed by xmax - xmin . In addition, the defuzzi�ed value can be used to represent the
fuzzy set. Several defuzzi�ed methods can be found in [14].
3 WiZard Framework
In this section, we present our approach to solving the module selection problem. The key to the approach
is the introduction of utility values. Each module is associated with a utility value, which represents the
usefulness of a module. Ideally, the utility values of modules are either 1 or 0. The design using those
modules with utility value of 1 should be of the highest quality. However, in reality, the usefulness of a
module is dependent on others. The module may be present in good designs which optimize a certain
goal and/or bad designs that do not satisfy the design goal. In this case, we allow the utility value of a
module to be any real number between 0 and 1 to represent this ambiguity. As an input to the framework,
initial utility values can be assigned based on designers' experience.
The operations of the framework at a high level is straight-forward. First, a designer give initial utility
values. Then, the framework improves them until they are stabilized. From the utility of each module, it
is implied that combination of the module with the high degree of utility will likely give schedules that
yields high acceptability. Therefore, such solution(s) form an elite set. By inspecting the solutions in the
set, the best solution can be identi�ed and, then, used to implement a �nal system.
Figure 4 depicts the WiZard framework in details. It consists of two main steps, scheduling and
utility value improvement. The �rst phase, inclusion scheduling, takes a data ow graph, the number of
functional units required in a target system, and a module set as well as their associated utility values
as inputs and constructs a general schedule. It is important to point out that only one schedule is
created for the entire module set. The schedule may not be the same as any schedule that one would
obtain if any single combination of modules is used. Nevertheless, the resulting latency and power of the
schedule gives close approximation to the latencies and powers generated by the schedules of all module
combinations [3]. We, therefore, use the scheduling process as a means to obtain important information
for further assessment of modules' usefulness.
Speci�cally, while calculating a schedule, latency, power as well as the corresponding modules usage
are recorded. These data are then used as inputs to the second phase: the utility value improvement.
7
stop
Inclusion Scheduling
noyes
schedule,
output
DFG
cla-add, alu16
qbs-mul
m
Calculate Utility
moduleinfo.
#FU
etc
Module set
alu32, etc.
etc.
qbs-mul, p-mul32,
cla-add,ra-add,alu16
:
Funit1
Funit2
latency & power
improvementnecessary? Funit1
Funit2
Module selection
mqbs p alu ...
FU2
0.2
1
0.6
µ(m)
cla ra alu ..
µ(m)1
0.50.6
FU1
update
Value Improvement
Initial utility assignment
Figure 4: WiZard module selection framework
In this step, the utility value of each module is adjusted. For a given module, the acceptability of every
latency and power pair that the module contributes to is analyzed. Intuitively, if a module causes a lot
of unacceptable latency and power values, the utility of the module should be decreased. On the other
hand, if a module contributes to a lot of high-acceptability latency and power values, the module's utility
value should be increased. The statistics of a module usage for each latency and power value are used as
a scaling factor to the acceptability value for the module for signifying the module's contribution. Based
on this idea, we have developed a heuristic to compute the relative adjustment of a utility value. The
adjustment is then applied to update the previous utility value. The 2-step process is repeated until the
adjustment values converge to zero. The experimental results show that the average number of iterations
is approximately eleven.
3.1 Inclusion Scheduling
In order to construct a general schedule based on a utility assignment, we borrow some techniques form
the fuzzy theory. In particular, we model modules and their respective utility values as a fuzzy set with
respect to the corresponding functional unit. For each functional unit f, and its eligible module set Mf,
let �f(m) 2 [0; 1];8m 2Mf; describe a utility value of module m. For instance, consider an application
with only addition and multiplication operations, and that only 2 adders and 1 multiplier are allowed in
any implementation (3 functional units f1; f2; f3). Assume that there are 10 possible modules of adders
ADD =fadd1, add2,: : :,add10g and 10 possible modules of multipliers MUL = fmult1, mult2,: : :,mult10g.
Let �f1(m) be the utility value of module m for functional unit fi. Note that m 2 ADD for f1; f2 and
m 2 MUL for f3. It follows that �fi(m) 2 [0; 1] and can be considered as the membership function of
fi with respect to the universe Mf. This concept also implicitly shows that a functional unit has fuzzy
execution times and powers.
Inclusion scheduling is a scheduling method which takes into consideration of fuzzy characteristics,
8
e.g., fuzzy set of varying latency (power) values associated with each functional unit. The output schedule,
in turn, also consists of fuzzy attributes. The actual steps in inclusion scheduling is given in Algorithm 3.1.
In a nutshell, inclusion scheduling simply replaces the computation of accumulated execution times in a
traditional scheduling algorithm by the fuzzy arithmetic-based computation. In our case, both latency
and power attributes are fuzzy numbers. Hence, fuzzy arithmetics (speci�cally Equation (3)) is used
to compute possible latencies and powers from the given functional speci�cation. Then, latencies and
powers of di�erent schedules are compared to select a functional unit for scheduling an operation. Note
that fuzzy arithmetics can be embedded in di�erent scheduling algorithms. However, a tradeo� exists
between the complexity and quality of di�erent scheduling algorithms. In this paper, we use a general
version of the list scheduling algorithm [1] as our scheduling framework due to its eÆciency.
In a list-based scheduling algorithm, we attempt to construct a schedule that minimizes the total
latency and average power. The algorithm traverses the graph in the topological order. It puts all ready
nodes (vertices whose parents are already computed) into a queue Q. The nodes in the queue are then
rearranged by function prioritized in Line 3. A ready node is extracted from the list one at a time. The
algorithm tries to schedule the node in each available functional unit. Routine assign heuristic(S; u; f) at
Line 7 �nds a legal scheduling position for node u at functional unit f based on the schedule S. A simple
heuristic can be to assign node u after the last node in the unit f or try all possible legal positions in
the unit f. After a node is scheduled to a functional unit, the schedule latency must be updated and,
therefore, the average power consumption needs to be recalculated.
Algorithm 3.1 (Inclusion scheduling)
Input: G = (V; E; �), and speci�cation Spec = (N; F;M; A;Q)
Output: A schedule S, latency and power
1 Q = vertices in G with no incoming edges // �nding root nodes
2 while Q 6= empty do
3 Q = prioritized(Q)
4 u = dequeue(Q); mark u scheduled
5 good S = NULL;
6 foreach f 2 ffj : where fj is able to perform �(u)g do
7 temp S = assign heuristic(S; u; f) // assign u at FU f
8 if Eval Schedule(good S; temp S; G; Spec)
9 then good S = temp S � od
10 S = good S // keep good schedule
11 foreach v : (u; v) 2 E do
12 indegree(v) = indegree(v) - 1
13 if indegree(v) = 0 then enqueue(Q; v) � od
14 od
15 return(S)
Routine Eval Schedule (Line 8) updates the total latency and power of the schedule after node u is
assigned to f. It also compares the current schedule with the \best" one found in previous iterations.
The better one of the two is then chosen and the process is repeated for all nodes in the graph.
9
Algorithm 3.2 (Eval Schedule)
Input: schedules S1; S2, G = (V; E; �) and
speci�cation Spec = (N; F;M; A;Q)
Output: 1 if S1 is better than S2, 0 otherwise.
1 G0 = (V0; E0; �) where V0 = V-funscheduled nodesg; E0 = ;
2 foreach schedule Si = S1 to S2 do
3 E0 = f(u; v) : u; v 2 V0; if u; v in same f.u. in Si
4 and v is immediately after ug
5 Sort graph G according to the topological order
6 // total; total tp i and quality[i]: fuzzy set of latency and power.
7 total = ;
8 foreach level i of graph G in topological order do
9 // (u) returns the functional unit binding for u
10 total tpi =P
u2Vifuzzymax time add energy(TP ( (u)));
11 total = fuzzyadd time add energy(total; total tpi)
12 od
13 quality [Si] = total od // quality of the schedule i
14 // comparing the overall attributes of both schedules
15 return(compare(quality [S1]; quality [S2]))
Algorithm 3.2 computes the fuzzy set of latency and power of a schedule. The algorithm adds edges
between two consecutive nodes in the same functional unit to construct a legal execution order of all nodes.
In order to simplify the calculation, the latency and power values are discretized. Using Equation (3)
by replacing z with execution time and power tuple, we compute the utility degree of each latency and
power pair. This value simply re ects the utility value of the modules contributing to the latency and
power pair. Line 11 replaces a maximum operation with fuzzy maximum in order to compute maximum
execution times of the nodes in the same control step. For the power value, we calculate the average
power consumption per time unit, by using the formula [11],
P =
Pk nkpktk
T(4)
where T is the schedule length, pk and tk are the power consumption as well as execution time of
functional unit fk, and nk is the number of operations executed by functional unit fk. Since power
consumption is closely related to latency, each power value needs to be associated with a latency value.
While calculating a latency value in both fuzzymax time add energy and fuzzyadd time add energy , the
energy (t� p) is accumulated. Note that for the purpose of utility value updates, modules causing every
latency and power values need to be recorded. In order to do so, both fuzzymax time add energy and
fuzzyadd time add energy routines also collect module references while adding/maximizing latencies and
powers (we will discuss the utility value update in Section 3.2). Depending on the assign heuristic, the
values totali may be kept statically for some control step i to save the computation for fuzzy addition
and maximum, for each iteration in Line 6 of Algorithm 3.1.
10
In routine compare, the accumulated energy is �rst divided by the corresponding latency to calculate
a power consumption. Fuzzy arithmetic is also used here for comparing two fuzzy sets of latency and
power. Based on a design goal, the routine de�nes a heuristic used to decide the best intermediate
schedule to keep (between quality [S1] and quality [S2]).
Adder f1 Multipliers f2; f3
Module Time Power Util. Module Time Power Util.
a0 5 20 0.2 m0 5 100 0.2
a1 10 15 0.7 m1 10 37.5 0.5
a2 25 10 1 m2 20 23 0.7
m3 30 10 1
Table 1: Adder, multiplier modules and their utility values
Using Algorithm 3.2, one can easily add more criteria by modifying Lines 10 and 11. For instance,
if minimizing area is also another criteria, the area calculation is added to Line 10 and 11 (in routines
fuzzymax time add energy and fuzzyadd time add energy).
Let us use an example to illustrate how inclusion scheduling works. Consider the example in Section 2
Figure 1. Nodes C and D are multiplication nodes while nodes A;B; and E are addition nodes. Suppose
a potential implementation consists of 3 functional units (f1; f2; and f3 where f1 is an adder and f2; f3
are multipliers). Table 1 show execution times and powers for each adder and multiplier module as well
as their initial utility assignment. This assignment suggests that a2;m3 and m3 should be the best
module selection for f1; f2 and f3 respectively. To compare the quality of a schedule, compare function
needs to be de�ned according to a design goal. Suppose our design goal is to optimize both latency and
power criteria and we consider a tradeo� between latency and power (similar to acceptability model in
Figure 2(b)) where power and time ratio=2/1. In compare routine, we use such a ratio to compute the
weighted-sum of both latency and power. Then, we defuzzify the weighted-sum of latency and power and
compare them to select an intermediate schedule.
f1 f2 f3
+ � �
A C D
B
E
Figure 5: Schedule for Test1 DFG (Figure 1)
Applying inclusion scheduling to the above example, we obtain a schedule as shown in Figure 5. The
possible latency and average power of this schedule are estimated as shown in Figure 6(a). In Figure 6(b),
11
the z axis shows the resulting utility value for each latency and power pair based on the initial module
utility assignment. These values are derived from Algorithm 3.2. Note that the point with minimum
power and maximum latency (latency=80, power=17, as noted in the square box in Figure 6(b)) has
the maximum utility value (1) which corresponds precisely to the initial utility assignment that prefers
the module with the minimum power (a2;m3;m3). If the system requires that the latency must be less
than 50, some of the designs with high utility values are no longer acceptable. Furthermore, given that
the tradeo� criterion of power to latency is 2/1, we need to adjust the utility assignment. This will be
discussed in the next subsection.
10 20 30 40 50 60 70 8010
20
30
40
50
60
70
80
90
latency
pow
er
(a) Latency and power
1020
3040
5060
7080
020
4060
801000.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
latencypower
utili
ty
(b) Latency, power and utility values
Figure 6: Latency and power of schedule in Figure 5
The time complexity of the inclusion scheduling algorithm is simply the time complexity of list
scheduling multiplied by the fuzzy computation complexity. In Algorithm 3.2, the time complexity of
fuzzy addition and fuzzy maximum depends on the number of discrete points each totali retains. To limit
the number of points, we allow only at most K discrete points, for each functional unit. Then, the time
complexity for addition and maximum operations is O(K2). Hence, the time complexity of computing
the overall fuzzy latency and power is O((jV0j + jE0j)K2), while the time complexity of compare routine
is O(K). Altogether, the time complexity of the expanded version of list scheduling is O((jV j+ jEj)2K2),
where O(jV j+ jEj) is the list scheduling time complexity.
3.2 Utility Adjustment
Recall that the utility values of modules should re ect the usefulness of the modules towards a design goal.
However, the initial assignment of utility values may not satisfy the given design goal. We use the utility
adjustment phase to modify the utility values in order to �nd most appropriate utility assignments. In
other words, we attempt to give high utility values to modules which contribute to most highly acceptable
latency and power pairs and assign low utility values to the modules contributing to latency and power
pairs with low acceptability values. As we stated earlier, the calculation of the schedule latency and
12
power (Lines 10 and 11 in Algorithm 3.2) also tally the number of module references for each latency
and power value. Let the number of reference to a module by functional unit f be freq(f;m). For a given
latency and power pair (t; p), we compute the positive contribution of m by
�+(f;m) =X
8(t;p) s.t. �(t;p)=�f(m)
freqt;p(f;m)�acc(t; p) (5)
Condition \8(t; p) s.t. �(t; p) = �f(m)" is used to ensure that only (t; p) pair that is caused by module
m in Mf is considered when calculating the adjustment value for the module. From Equation (5), a
higher �+(f;m) value indicates that using m can potentially lead to systems with a higher acceptability
value. We also compute the negative contribution of m by
�-(f;m) =X
8(t;p) s.t. �(t;p)=�f(m)
freqt;p(f;m)(1 - c�acc(t; p)) (6)
Then Equation (7) is used to estimate an adjustment of the utility value for each module of functional
unit f.
adjf(m) =c�+(f;m) - �-(f;m)
c�+(f;m) + �-(f;m)(7)
The term adjf(m) is a relative change to current �f(m) value. From Equation (7), if adjf(m) is
negative, a module tends to cause more bad latency and power pairs. Then �f(m) should be decreased.
On the other hand, if adjf(m) is positive, �f(m) is increased.
A scalar value c introduced in Equation (7) is for the purpose of tuning utility values to avoid the
case where adjf(m) is negative for every m. Normally, c is equal to one. For a very tight constraint
(upper and lower limits are close to each other), the maximum acceptability obtained for all the latency
and power values are quite small, for instance, less than 0.6. It is likely that �- is much bigger than �+
for every module, and therefore, all adj values become negative. In this case, c has to be greater than
one so as to scale down this di�erence (i.e., scale up acceptability values). Note that for consistency, the
same c is used for all m's for all functional unit f's. This c value can be obtained experimentally. By
running the inclusion scheduling once, from the derived latency and power, we �nd the latency and power
pair which gives the maximum acceptability value (MaxAcc) and c can be speculated from MaxAcc. For
instance, when the acceptability values of latencies and powers are normalized to one, c = 1MaxAcc
.
In our experiments, we have used the following method to calculate new �f(m).
�new f(m) =
8>>>>>><>>>>>>:
�f(m)� adjf(m) + �f(m)
if 0 < adjf(m) � 1
�f(m)2
+ (1 + adjf(m))� �f(m)2
if - 1 � adjf(m) < 0
(8)
Since the value of adj(m) is always between [-1; 1], if adjf(m) equals 1, we double the value of �f(m)
and if adjf(m) equals -1, �f(m) is reduced by half. If adjf(m) is between (-1; 0], the change of �f(m)
is proportional to half of �f(m) and if adjf(m) is between (0; 1), the change of �f(m) is proportional
to �f(m). After the adjustment for all modules is made, �new f(m) are normalized with respect to the
13
highest one, i.e., norm(�f(m)) =�new f(m)
maxm �new f(m);8m 2 Mf. If �new f(m) is the same as �f(m) from the
previous iteration for every m, the adjustment is no longer needed.
We have investigated other adjustment policies. For instance, adj can be an absolute change to �f(m).
This method does not give a good result because �f(m) changes too fast and all of them may quickly
become one. This is not desirable since we are not able to distinguish the quality of the modules. Though
the proposed adjustment method does not su�er from the previous problem, it has a drawback: when
repetitively decrementing �f, the value tends to converge to zero. Once �f(m) is zero, the scheduling
approach disregards the module for eÆciency and, therefore, the module will be removed from the set
permanently. Often, if a module's utility is repeatedly decreased, the module may not be a good one.
However, several modi�cations can be made to �x this problem. The simplest way would be to slow down
the reduction rate to assure that the module which is completely ignored will de�nitely not be a good
one. At another point of view, this approach eliminates unwanted modules iteratively. Further, every
time the module is eliminated, the inclusion scheduling performs faster because the number of possible
latency and power is lessened.
4 Illustrated Examples
In this section, we present an example to show how WiZard works. Consider a module set shown in
Figure 7(a) and a data ow graph shown in Figure 7(b). Assume that we consider a system with 3
functional units: 1 adder and 2 multipliers. We will show that WiZard can consistently produce proper
elite sets for various design goals.
Modules Time Power Initial
a0 7 40 0.1
a1 15 30 0.2
a2 20 25 0.3
a3 25 15 0.7
a4 35 10 0.9
a5 40 5 1
m0 8 75 0.2
m1 15 55 0.4
m2 25 40 0.6
m3 35 30 0.7
m4 45 20 0.9
m5 55 10 1
m6 60 5 0.05
(a) Modules
A B C D
E
H G I
F
Multiplication
Addition
(b) Test2 DFG
Figure 7: An example of module set and data ow graph
14
Example I
First, let us consider the case where the goal is to simply satisfy latency and power constraints. We de�ne
a Q(t; p) as:
Q(t; p) =
Æ0 if t > 200 or p > 80
1 if t � 200 and p � 80:
In the initial assignment, a5 is the most favorable module for f1 and m5 is the one for both f2; and f3
(see the last column in Figure 7(a)). Notice that if a5;m5;m5 were used, the selection would lead to an
unacceptable system (latency equals 245 and power equals 14, by the list-based scheduling algorithm).
Therefore, utility adjustments are necessary.
By inclusion scheduling, the number of times, denoted by freq(t;p), which each module is referred to
was recorded. Such frequencies are associated with each functional unit and are collected while scheduling
each node from the input graph. Figure 8 presents a sample of freq(t;p) values for the latency and power
(t; p) of (90,66) and (202,13) for f2. In Figure 8(a) which corresponds to latency power pair of (90,66),
only modules m1, m2, m3 contribute to this value. Since this pair is an acceptable pair, the utility
value of modules m1,m2 and m3 for f2 should be increased. On the other hand, Figure 8(b) corresponds
to latency and power pair (202,13) which is unacceptable latency and power. Many modules may have
contributed to these values. However, from Figure 8(b), module m6 gives the most contribution. Based on
this observation, utility value of m6 should be decreased. Combining freq(t;p) information for all latency
and power (t; p) pairs, adjustments are determined according to Equations (5){(8).
Figure 9(a) presents the evolution of the utility values for adder modules of f1 and Figure 9(b) presents
the change of the utility values for multiplier modules of f2 and f3. Rows 1st and 2nd show the utility
values after adjusting and normalizing the utility values for the respective iteration.
0
2
4
6
8
10
12
m0 m1 m2 m3 m4 m5 m6
freq
(a) freq(t;p) for latency and
power pair (90,66)
m0 m1 m2 m3 m4 m5 m60
50
100
150
200
250
300
350
400
freq
(b) freq(t;p) for latency and
power pair (202,13)
Figure 8: Sample freq(t;p) information for multiplier modules for f2 collected during the �rst
iteration in Example I
After second iteration, utility values for f1 is stabilized. For f2 and f3, the module with utility value
one has changed from m5 to m2 in the second iteration. In the remaining iterations, utility values of
15
Func. iteration adder module utility
unit a0 a1 a2 a3 a4 a5
f1 1st 0.1 0.2 0.3 0.7 0.9 1
2nd 0.16 0.39 0.59 1 0.59 0.66
... ...
�nal 0.16 0.39 0.59 1 0.59 0.66
(a) Utility values for adder modules (f1)
Func. iteration multiplier module utility
unit m0 m1 m2 m3 m4 m5 m6
f2 1st 0.2 0.4 0.6 0.7 0.9 1 0.05
2nd 0.33 0.60 1 0.95 0.44 0.50 0.11
... ...
�nal 0.32 0.60 1 0.96 0.05 0 0
f3 1st 0.2 0.4 0.6 0.7 0.9 1 0.05
2nd 0.33 0.61 1 0.96 0.44 0.50 0.11
... ...
�nal 0.32 0.60 1 0.96 0.05 0 0
(b) Utility values for multiplier modules (f2; f3)
(f1 f2 f3) Latency Power
a3 m2 m2 125 52
a3 m2 m3 135 48
a3 m3 m2 135 48
a3 m3 m3 155 43
(c) Latency and power of the solutions obtained
by modules with utility value greater than 0.9
Figure 9: Initial utility assignment and their improvements after each iteration, and elite set
(Example I)
16
multiplier modules with large execution time keep decreasing. Since modules m5 and m6 signi�cantly
contribute to a speci�cation violating the constraint, they are removed from the eligible set after eleven
adjustment iterations. The row \�nal" of Figure 9 shows the �nal utility values of choosing each module
for each functional unit.
Suppose we select a module have utility values 0.9 or higher. This leads to the selection of fa3g,
fm2,m3g, and fm2,m3g for f1, f2, and f3 respectively. Consequently, the elite set contains f(a3,m2,m2),
(a3,m2,m3), (a3,m3,m2), (a3,m3,m3)g. For comparison, we generate results by constructing schedules
for each of these selections. Figure 9(c) displays resulting latency and power attributes when using the
selection in the elite set. It is clear that all the selections give acceptable solutions, i.e., their latencies
and powers are less than 200 and 80 respectively.
Example II
Next, let us consider the case where the design goal is to minimize latency under certain latency and
power constraints. De�ne Q(t; p) as:
Q(t; p) =
8><>:
0 if t > 200 or p > 80
1 if t � 50 and p � 80
-0:0067t + 1:33 otherwise.
For this design goal, we assume a linear acceptability function. The constants in -0:0067t + 1:33 are
obtained from the linear interpolation of two coordinates of (latency,acceptability), (200; 0) and (50; 1).
This function consists of only latency variable t and avors shorter latency values. Figure 10(a) plots
the above Q. For this problem, the solutions obtained from the previous example would no longer result
in high acceptability. Using the same initial utility assignment, we re-run the experiment using the new
acceptability model.
In Figures 10(b){10(c), we only include an initial and �nal utility values for adders and multipliers
obtained after 16 iterations. Based on these results, the elite set contains only (a0,m1,m1). This selection
gives latency and power 67 and 78. Clearly, the latency obtained from this solution is much shorter than
the latency values obtained for Example I (Fig. 9(c)).
To assess the quality of our solution, we generate all possible 294 combinations of modules for the
given adders and multipliers (6 adder modules and 7 multiplier modules) and their corresponding sched-
ules. Then, for each schedule, its latency and power is projected to the acceptability function to compute
the acceptability value. The solutions obtained from this method are referred as \enumerated set".
Figure 10(d) uses a pie chart to divide the 294 enumerated solutions into 10 ranks: the 1st-rank corre-
sponds to the solutions whose acceptability values are ranged in [ 910�MaxAcc;MaxAcc] (top 10% best
acceptabilities) while the 10th-rank corresponds to the solutions whose acceptability values ranged in
[0; 110�MaxAcc], where MaxAcc is max(t;p)Q(t; p). In this �gure, there are only 3% of solutions which
belong to the �rst rank while there are 28% of the solutions belonging to the last rank. The elite set
produced by WiZard is in the �rst rank where only 3 percent of all 294 enumerated solutions falls.
17
020
4060
80
050
100150
2000
0.2
0.4
0.6
0.8
1
powerlatency
acce
ptab
ility
(a) Acceptability model
Func iteration adder module utility
unit a0 a1 a2 a3 a4 a5
f1 1st 0.1 0.2 0.3 0.7 0.9 1
�nal 1 0.05 0.07 0.02 0 0
(b) Utility values for adder modules (f1)
Func. iteration multiplier module utility
unit m0 m1 m2 m3 m4 m5 m6
f2 1st 0.2 0.4 0.6 0.7 0.9 1 0.05
�nal 0.05 1 0.19 0 0 0 0
f3 1st 0.2 0.4 0.6 0.7 0.9 1 0.05
�nal 0.01 1 0.36 0 0 0 0
(c) Utility values for multiplier modules (f2; f3)
3%4%7%
13%
9%
3%
11% 9%
14%
28%
10th
1st2nd
3rd
4th
5th
6th7th 8th
9th
a0, m1, m1 latency=67,power=78
(d) Distribution of 294 solutions
Figure 10: Acceptability model minimizing latency, utility values for Test2 DFG and distribution
of enumerated solutions (Example II)
18
Example III
For the last example, we demonstrate the use of WiZard for �nding a design minimizing both latency
and power under given latency and power constraints. Obviously, the selection from the previous example
does not give a solution which minimizes power. Therefore, we formulate a new objective function using
a weighted-sum of latency and power. In particular, the design goal is to minimize w1t + w2p where
w1 and w2 are weights of latency and power attributes. In the following equation, we assume a linear
tradeo� where w1 = 1 and w2 = 2, i.e., saving power is twice as important as reducing latency.
Q(t; p) =
8><>:
0 if t > 200 or p > 80
1 if t � 50 and p � 20
-0:0039(t + 2p) + 1:35 otherwise.
In order to calculate the weighted-sum of latency and power (t + 2p), both t and p must be normalized
to the same scale. We assume that the normalization process has been rendered in the above equation.
The acceptability model for this problem is depicted in Figure 11(a).
Figures 11(b){11(c) display the �nal utility values after 14 iterations. These utility values lead to an
elite set f(a3;m0;m0); (a3;m0;m1); (a3;m0;m2); (a3;m1;m0); (a3;m1;m1); (a3;m1;m2)g: Comparing
with the solutions in Example II, the solutions in Figure 11(e) have lower power at the expense of the
increased latency. These solutions also yield better latency than that of Example I.
Figure 11(d) displays the 10-rank distribution of the enumeration solutions for this problem. Of all
294 enumerated solutions, the best solution found has acceptability 0.74 and there are only 15% which
belong to the �rst rank. Using our approach, we are able to construct an elite set of size 6 in which 4
solutions are of the �rst rank. This example shows that our approach is able to reduce the search space
from 294 to 6 and increases the percentage of the number of the 1st-rank solutions from 15% (44/294) to
67% (4/6).
5 Experimental Results
To demonstrate the e�ectiveness of our approach, we have performed several experiments on benchmarks
including test1 (Figure 1), test2 (Figure 7(b)), di�erential equation (deq), wave-digital �lter (fj-wdf),
Volterra �lter (vtf), elliptic �lter (elf), and discrete cosine transform (dct). Table 2 summarizes the
characteristics of the benchmarks considered.
Figure 12 lists module sets used in the experiments. Some of these are obtained by modifying the
modules in [11]. Module set I is the same set as in Table 1. Module set II consists of 6 adder modules
and 7 multiplier modules. Module set III consists of 6 adder modules and 6 multiplier modules. The
initial assignment for module set I favors the lowest-power adder and multiplier modules. For module
sets II and III, we consider two di�erent initial utility assignments: II-1 favors the lowest-power adder
and multiplier, II-2 and III-1 favors the fastest adder and multiplier, and III-2 avors the lowest-power
adder and the fastest multipliers.
19
01020304050607080
050
100150
200
00.10.20.30.40.50.60.70.80.91
powerlatency
acce
ptab
ility
(a) Acceptability model
Func. iteration adder module utility
unit a0 a1 a2 a3 a4 a5
f1 1st 0.1 0.2 0.3 0.7 0.9 1
�nal 0.25 0.65 0.45 1 0.65 0.45
(b) Utility values for adder modules (f1)
Func. iteration multiplier module utility
unit m0 m1 m2 m3 m4 m5 m6
f2 1st 0.2 0.4 0.6 0.7 0.9 1 0.05
�nal 0.92 1 0.44 0.77 0.01 0 0
f3 1st 0.2 0.4 0.6 0.7 0.9 1 0.05
�nal 1 0.99 0.98 0.8 0.01 0 0
(c) Utility values for multiplier modules (f2; f3)
14%
14%
22% 22%
6%
6th-9th = 0%
22%
1st
2nd
3rd 4th
5th
10th
a3,m0,m0a3,m0,m1a3,m0,m2a3,m1,m0
a3,m1,m1a3,m1,m2
(d) Distribution of 294 solutions
(f1 f2 f3) Latency Power
a3 m0 m0 108 28
a3 m0 m1 116 26
a3 m0 m2 116 26
a3 m1 m0 116 26
a3 m1 m1 115 36
a3 m1 m2 125 34
(e) Latency and power of the solutions obtained
modules with utility value greater than 0.9
Figure 11: Linear acceptability model considering tradeo�, utility values for Test2 DFG, distribu-
tion of enumerated solutions, and elite set (Example III)
20
Modules Time Power Initial
a0 5 20 0.2
a1 10 15 0.7
a2 25 10 1.0
m0 5 100 0.2
m1 10 37.5 0.5
m2 20 23 0.8
m3 30 10 1.0
(a) Module set I
Modules Time Power Initial1 Initial2
a0 5 30 0.05 1
a1 7 25 0.1 0.4
a2 10 19 0.2 0.2
a3 15 15 0.5 0.1
a4 20 12 0.8 0.1
a5 30 10 1 0.05
m0 5 55 0.05 1
m1 10 40 0.1 0.8
m2 15 30 0.3 0.5
m3 23 20 0.4 0.4
m4 30 13 0.7 0.3
m5 35 8 0.8 0.2
m6 50 5 1 0.1
(b) Module set II
Modules Time Power Initial1 Initial2
a0 5 60 1 0.1
a1 10 38 0.7 0.2
a2 20 23 0.5 0.3
a3 35 20 0.3 0.5
a4 40 10 0.2 0.7
a5 70 5 0.1 1
m0 100 296 0.3 1
m1 160 84 1 0.7
m2 170 70 0.8 0.5
m3 300 55 0.4 0.4
m4 640 29 0.2 0.2
m5 770 20 0.1 0.1
(c) Module set III
Figure 12: Module sets I{III
21
Benchmark # operations
name add mult
test1 3 2
test2 4 5
deq 5 6
fj-wdf 13 4
vtf 10 17
elf 28 7
dct 32 16
Table 2: Benchmark characteristics
We �rst present the results generated by enumerating solutions for various tests as shown in Table 3.
Column \Spec" displays a speci�cation of a target system (number of functional units{adders and mul-
tipliers allowed in a system). Column \Module set" shows the module set used in each benchmark test.
In the Column \Acceptability Q", �elds \latency" and \power" contain two vital points (x0; x1) similar
to those of Equations (1){(2): Any design whose latency and power are less than x0 has acceptability
one while a design with latency or power being greater than x1 is unacceptable (acceptability equals
zero). A design whose latency and power are between x0 and x1 values are acceptable with certain degree
depending on the assumption in �eld \w1 : w2". Column \w1 : w2" displays a tradeo� ratio between
latency and power respectively. When w1 : w2 = 1 : 1, i.e., no tradeo� is considered, we use z-shaped
curve to represent Q function (similar to Equation (1) except that F1 and F2 are z-curves as shown in
Figure 2(c)). When w1 : w2 = 5 : 1, or w1 : w2 = 1 : 3, we use a linear tradeo� function applied to F
similar to the graph in Figure 11(a).
Under Column \Enumerated solutions", Column \total" displays the total number of enumerated
solutions. For Column \Distributed by rank", we divide the solutions into 7 groups according to the ac-
ceptabilities of their corresponding latency and power values. Let MaxAcc be the maximum acceptability
value obtained among all enumerated solutions. The solutions in the �rst rank have the acceptabil-
ity ranged in [67� MaxAcc;MaxAcc] while the solutions in 7th rank have the acceptability ranged in
[0; 17� MaxAcc]. The numbers in Columns \1st,2nd; : : :,7th under \Distributed by rank" illustrate the
number of module combinations belonging to each rank.
Table 4 shows the results generated by WiZard2. The number in front of each row entry of Table 4
corresponds to the same indexed test in Table 3. Column \#elite set" shows the size of the elite set formed
by the selected modules. Column \Selected modules" shows the modulesWiZard selected for each func-
tional unit. Under this column, selected adders and multiplier modules are displayed in Columns \adder"
and \multiplier". The values in the parenthesis show possible module selections for the adder/multiplier
functional unit. If the target system speci�cation requires 2 adders/multipliers, the modules in the �rst
and second parentheses show candidate modules of the �rst and second adder/multiplier units respec-
tively. Column \#iter" displays the number of adjustment iterations performed to obtain the selection in
2All experiments used c = 1, and the module m is included in the elite set if �f(m) > 0:6.
22
Columns \adder" and \multiplier". Column \Distribution by rank" displays the ranking of the elements
in the elite set. For the comparison purpose, for each row entry, we use the same ranking scale as its
corresponding row in Column \Distribution by rank" in Table 3.
Consider the elliptic �lter benchmark for a 2-adder and 2-multiplier system in Table 4 (row 19) where
the number of enumerated solutions is 1,296. The experiments give a module selection (a1,a1,m1,m1)
which lies in the �rst rank (latency= 1060 and power= 110:75). This result is obtained within �ve
iterations of adjustments. To verify the result, Figure 13(a) graphically depicts all enumerated solutions
as well as their acceptability values for this case. For easy eye-investigating, Figure 13(b) projects these
points onto latency-acceptability plane and Figure 13(c) projects these points to power-acceptability
plane. Each point in the graphs corresponds to latency/power values associated with the schedule using
a particular module selection. The solutions in the square region in Figures 13(b){13(c) are highlighted to
show the 1st-rank solutions (latency ranged in [1015; 1280] and power ranged in [95; 137]). The modules
identi�ed by WiZard lead to solutions that fall into this region.
For some large benchmarks, we tested the method against the acceptability function when the tradeo�
between latency and power is considered. Take the discrete cosine transform benchmark (rows 28{29 of
Tables 3{4) as an example. In row 28, since the goal is to optimize 5t + p, we attempt to minimize
latency at the expense of increasing power. Figure 14 characterizes enumerated solutions in row 28 of
Table 3 in 2-dimensional latency and acceptability plane as well as power and acceptability plane. The
points in the square box area correspond to the 1st-rank solutions whose latencies are between [980; 1160]
and powers are between [418; 498]. Both of our selections in row 28 of Table 4, (a2,a2,a2,m0,m0) and
(a2,a2,a1,m0,m0), result in the latency and power (980, 498) and (1010,481) respectively, which again
belong to the 1st-rank solutions.
Figure 15 characterizes enumerated solutions whose goal is to optimize t + 3p, i.e., it attempts to
reduce the power consumption while increasing latency. According to Figure 15(a), the 1st-rank solutions
have latencies ranged in [1310; 1470] and the power ranged in [135; 266]. Consider one of the solutions in
row 29 of Table 4 (a1,a1,a1,m1,m1). This solution results in latency 1350 and power 168. Hence, it falls
into the �rst rank. Compared with the latency and power of the solution obtained for the previous case,
the power is signi�cantly reduced while latency is slightly increased.
To further illustrate the e�ectiveness of WiZard, we summarize our experimental data in Tables 5{6.
Table 5 summarizes the ratio between the number of enumerated solutions and average size of the derived
elite set for each benchmark. Examining Column \% Reduction", one can see that our proposed approach
is able to reduce the search space dramatically. Table 6 shows the average quality of selections found by
WiZard. Columns \Enumerated" and \WiZard" shows the average percentage of module selections
per rank for both enumerated set and the elite set obtained by our algorithm. Though on average, most
of module selections from the enumerated set can result in schedules with low acceptability degrees, our
algorithm is able to capture good module selections. This also can be done within a reasonable amount
of time. In the worst case example, discrete cosine transform with 5 functional units, the average running
time of WiZard is about 800 seconds while it takes approximately 10800 seconds to generate 7776
combinations. Therefore, WiZard takes only 7.4% of the running time of the enumerated approach.
From all the tables, one can see that WiZard eÆciently produces high-quality module selections.
23
Benchmark Spec. Module Acceptability Q Enumerated solutions
set latency power w1 : w2 total Distribution by rank
1st 2nd 3rd 4th 5th 6th 7th
1. test1 1a 2m I (20,70) (30,80) 1:1(z) 48 25 0 0 6 0 0 17
2. test2 1a 2m II-1 (55,200) (35,80) 1:1(z) 294 90 56 45 52 13 25 13
3. test2 1a 2m II-2 (55,200) (35,80) 1:1(z) 294 90 56 45 52 13 25 13
4. deq 1a 2m II-1 (100,200) (35,70) 1:1(z) 294 168 13 47 8 45 12 1
5. deq 1a 2m II-2 (100,200) (35,70) 1:1(z) 294 168 13 47 8 45 12 1
6. fj-wdf 2a 1m II-1 (100,200) (35,70) 1:1(z) 252 103 40 4 4 28 20 53
7. fj-wdf 2a 1m II-2 (100,200) (35,70) 1:1(z) 252 103 40 4 4 28 20 53
8. fj-wdf 2a 2m II-1 (50,125) (40,80) 1:1(z) 1764 116 34 4 74 29 67 1440
9. fj-wdf 2a 2m II-2 (50,125) (40,80) 1:1(z) 1764 116 34 4 74 29 67 1440
10. fj-wdf 2a 2m II-1 (50,125) (40,80) 5:1(l) 1764 77 0 124 186 181 1351061
11. fj-wdf 2a 2m II-1 (50,125) (40,80) 1:3(l) 1764 414 385 0 0 0 0 965
12. vtf 1a 3m II-1 (150,300) (35,50) 1:1(z) 2058 124 50 113 85 162 2111313
13. vtf 1a 3m II-2 (150,300) (35,50) 1:1(z) 2058 124 50 113 85 162 2111313
14. vtf 1a 3m II-1 (200,300) (35,70) 5:1(l) 2058 12 309 474 42 0 0 1221
15. vtf 1a 3m II-1 (200,300) (35,70) 1:3(l) 2058 367 247 277 254 0 0 913
16. elf 2a 1m III-1 (2000,3000) (75,200) 1:1(z) 216 72 0 1 23 9 2 109
17. elf 2a 1m III-2 (2000,3000) (75,200) 1:1(z) 216 72 0 1 23 9 2 109
18. elf 2a 2m III-1 (1000,2000) (100,250) 1:1(z) 1296 738 169 196 78 22 18 72
19. elf 2a 2m III-2 (1000,2000) (100,250) 1:1(z) 1296 738 169 196 78 22 18 72
20. elf 2a 2m III-2 (1300,1700) (75,150) 5:1(l) 1296 38 69 110 173 130 102 674
21. elf 2a 2m III-2 (1300,1700) (75,150) 1:3(l) 1296 140 173 75 75 34 0 799
22. dct 2a 2m III-1 (2000,3000) (300,500) 1:1(z) 1296 504 0 5 82 125 85 495
23. dct 2a 2m III-2 (2000,3000) (300,500) 1:1(z) 1296 504 0 5 82 125 85 495
24. dct 2a 2m III-2 (1200,2600) (75,150) 5:1(l) 1296 30 62 68 20 0 0 1116
25. dct 2a 2m III-2 (1200,2600) (75,150) 1:3(l) 1296 43 64 31 0 0 62 1096
26. dct 3a 2m III-1 (1500,2500) (350,450) 1:1(z) 7776 3005 0 0 0 0 0 4771
27. dct 3a 2m III-2 (1500,2500) (350,450) 1:1(z) 7776 3005 0 0 0 0 0 4771
28. dct 3a 2m III-2 (1000,2000) (200,500) 5:1(l) 7776 174 47 507 87214831324560
29. dct 3a 2m III-2 (1000,2000) (200,500) 1:3(l) 7776 294 13901516 16 0 0 4560
Table 3: Enumerated module selection results for tested benchmarks, generated for comparing toWiZard results in Table 4. Some benchmarks are tested against di�erent initial utility assign-ments and di�erent design goals.
24
Ben. # elite WiZard Selected modules #iter Distribution by rank
set adder multiplier 1st2nd3rd4th5th6th7th
1. 4 (a1) (m1 m2) (m1 m2) 15 3 1 0 0 0 0 0
2. 2 (a3 a4) (m2) (m2) 12 1 1 0 0 0 0 0
3. 1 (a3) (m0) (m0) 22 1 0 0 0 0 0 0
4. 9 (a1) (m1 m2 m3) (m1 m2 m3) 11 5 0 0 2 1 0 1
5. 9 (a2) (m0 m2 m3) (m0 m2 m3) 7 4 3 1 0 1 0 0
6. 1 (a2) (a2) (m1) 16 1 0 0 0 0 0 0
7. 4 (a0 a1) (a0 a1) (m1) 9 2 0 0 0 0 2 0
8. 2 (a0) (a1) (m0 m1) (m1) 9 1 1 0 0 0 0 0
9. 8 (a0 a1) (a0) (m0 m2) (m0 m2) 9 3 0 1 3 0 1 0
10. 2 (a0 a1) (a0) (m1) (m0) 14 1 1 0 0 0 0 0
11. 2 (a2) (a2) (m1) (m0 m1) 10 1 1 0 0 0 0 0
12. 2 (a2 a3) (m1) (m0) (m1) 9 1 0 0 1 0 0 0
13. 8 (a1 a2) (m0 m1) (m0 m1) (m3) 9 2 0 1 1 1 1 2
14. 2 (a0 a1) (m0) (m1) (m0) 14 2 0 0 0 0 0 0
15. 6 (a1 a2 a3) (m0) (m0 m1) (m0) 15 6 0 0 0 0 0 0
16. 1 (a0) (a0) (m1) 5 1 0 0 0 0 0 0
17. 9 (a0 a1 a2) (a0 a1 a2) (m1) 5 9 0 0 0 0 0 0
18. 4 (a0) (a0) (m1 m2) (m1 m2) 2 4 0 0 0 0 0 0
19. 1 (a1) (a1) (m1) (m1) 5 1 0 0 0 0 0 0
20. 8 (a1 a2) (a2) (m1 m2) (m1 m2) 18 4 0 1 3 0 0 0
21. 3 (a2 a3 a4) (a2) (m1) (m1) 8 3 0 0 0 0 0 0
22. 1 (a0) (a0) (m1) (m1) 3 1 0 0 0 0 0 0
23. 4 (a1 a3) (a1 a3) (m1) (m1) 3 4 0 0 0 0 0 0
24. 4 (a2) (a2) (m1 m2) (m1 m2) 7 1 3 0 0 0 0 0
25. 4 (a2 a3 a4 a5) (a0) (m1) (m2) 20 4 0 0 0 0 0 0
26. 8 (a1 a3) (a1 a3) (a1 a3) (m1) (m1) 15 8 0 0 0 0 0 0
27. 2 (a4) (a1) (a3) (m0 m2) (m2) 3 2 0 0 0 0 0 0
28. 2 (a2) (a2) (a1 a2) (m0) (m0) 16 2 0 0 0 0 0 0
29. 32 (a1 a2) (a1 a2) (a1 a2) (m1 m2) (m1 m2) 4 28 4 0 0 0 0 0
Table 4: Module selection results for tested benchmarks generated by WiZard where each rowcorresponds to the row in Table 3
25
01000
20003000
40005000
6000
0100
200300
4000
0.2
0.4
0.6
0.8
1
latencypower
acce
ptab
ility
(a) Latency, power and acceptability
0 50 100 150 200 250 300 350 400
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
acce
ptab
ility
power
(b) Latency and acceptability
0 50 100 150 200 250 300 350 400
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
acce
ptab
ility
power
(c) Power and acceptability
Figure 13: Enumerated solutions for elliptic �lter benchmark (row 19 of Table 3)
26
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
latency
acce
ptab
ility
(a) Latency and acceptability
0 100 200 300 400 500 6000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
power
acce
ptab
ility
(b) Power and acceptability
Figure 14: Enumerated solutions for discrete cosine transform benchmark when latency and power
tradeo� is 5:1 (row 28 of Tables 3{4)
0 1000 2000 3000 4000 5000 6000 7000 8000 90000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
latency
acce
ptab
ility
(a) Latency and acceptability
0 100 200 300 400 500 6000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
power
acce
ptab
ility
(b) Power and acceptability
Figure 15: Enumerated solutions for discrete cosine transform benchmark when latency and power
tradeo� is 1:3 (row 29 of Tables 3{4)
27
Benchmark Spec WiZard:Enumerated %Reduction
1. test1 1a 2m 4 : 48 91%
2.{3. test2 1a 2m 1.5 : 294 99%
4.{5. deq 1a 2m 9 : 294 97%
6.{7. fj-wdf 2a 1m 2.5 : 252 99%
8.{11. fj-wdf 2a 2m 3.5 : 1764 99%
12.{15. vtf 1a 2m 4.5 : 2058 99%
16.{17. elf 2a 1m 5 : 216 97%
18.{21. elf 2a 2m 4 : 1296 99%
22.{25. dct 2a 2m 3.25 : 1296 99%
26.{29. dct 3a 2m 11 : 7776 99%
Table 5: # elite set and #enumerated set ratio
Rank# Enumerated WiZard
1st 25% 79%
2nd 7% 11%
3rd 7% 1%
4th 6% 5%
5th 8% 1%
6th 3% 2%
7th 44% 1%
Table 6: Average distribution per rank for enumerated set and the elite set found by WiZard
28
6 Conclusion and Future Work
We have presented a module selection framework WiZard that takes into account of scheduling e�ect
as well as resource, latency, and power constraints. The key to this approach is the use of the utility
measure to model the degree of usefulness of a module. A module with a high utility value should lead to
a design with high acceptability values. The scheduling and binding method called inclusion scheduling
is used exclusively as a basis for deriving \fuzzy" latency and power values approximating latency and
power enumerated exhaustively. The modules contributing to these latency and values are also recorded
for the purpose of tracking the modules resulting in a schedule with good and/or bad latency and power.
Then, the utility values are �ne-tuned so as to precisely specify the quality of the module. For many
experiments, within seconds, the framework can identify an elite set composed of modules with high
utility values. With respect to enumerated solutions, these selections belong to superior solutions. The
solutions in the elite set can be used as good initial design solutions so as to speedup a design process.
In this work, we assume that the number of resources, i.e., functional units, is limited. Our current
approach can be included in an iterative design process varying the required number of resources. How-
ever, since the number of resources also directly a�ects schedules as well as module selection, future work
will consider the e�ect of the number of resources while evaluating the utility of a module.
References
[1] T. L. Adam, K. M. Chandy, and J. R. Dickson. A comparison of list schedules for parallel processing
systems. Communication of the ACM, pages 685{690, December 1974.
[2] I. Ahmad, M. K. Dhodhi, and C.Y.R. Chen. Integrated scheduling, allocation and module selection
for design-space exploration in high-level synthesis. IEE Proc.-Comput. Digit. Tech., 142:65{71,
January 1995.
[3] E. Sha C. Chantrapornchai and Xiaobo S. Hu. Dealing with imprecise timing information in archi-
tectural synthesis. Technical Report TR-98-5, University of Notre Dame, 1998.
[4] S. Chaudhuri, S. A. Bylthe, and R. A Walker. An exact methodology for scheduling in 3D design
space. In Proceedings of the 1995 Interational Symposium on System Level Synthesis, pages 78{83,
1995.
[5] L. Chen and L. Jeng. Optimal module set and clock cycle selection for DSP synthesis. In Proceedings
of the Interational Symposium on Circuits and Systems, pages 2200{2203, 1991.
[6] Y. Foo and H. Kobayashi. A knowledge-based system for VLSI module selection. In Proceedings of
the International Conference on Computer Design, pages 184{187, 1986.
[7] K. Gupta. Introduction to fuzzy arithmetics. Van Nostrand, 1985.
[8] M. Ishikawa and G. De Micheli. A module selection algorithm for high-level synthesis. In Proceedings
of the Interational Symposium on Circuits and Systems, pages 1777{1780, 1991.
29
[9] R. Jain. MOSP: Module selection for pipelined designs with multi-cycle operations. In Proceedings
of the IEEE/ACM International Conference on Computer Aided Design, pages 212{215, 1990.
[10] A. Kaufmann and M. M. Gupta. Fuzzy mathematical models in engineering and management science.
North-Holland, 1988.
[11] R. Martin and J. P. Knight. Power-pro�ler: Optimizing ASICs power consumption at the behavioral
level. In Proceedings of the 32nd Design Automation Conference, pages 42{47, 1997.
[12] G. D. Micheli. Synthesis and optimization of digital circuits. McGraw-Hill, Inc, 1994.
[13] L. Ramachandran and D. Gajski. An algorithm for component selection in performance optimized
scheduling. In Proceedings of the IEEE/ACM International Conference on Computer Aided Design,
pages 92{95, 1991.
[14] T. J. Ross. Fuzzy Logic with Engineering Applications. McGrawHill, 1 edition, 1995.
[15] Z. X. Shen and C. C. Jong. Exploring module selection space for architectural synthesis of low power
designs. In Proceedings of the Interational Symposium on Circuits and Systems, pages 1532{1535,
1997.
[16] A. Timmer, M. Heijligers, L. Stok, and J. Jess. Module selection and scheduling using unrestricted
libraries. In Proceedings of the European Design Automation Conference, pages 547{551, 1993.
[17] L. A. Zadeh. The concept of a linguistic variable and its application to approximate reasoning, Part
I. Information Science, 8:199{249, 1975.
[18] L. A. Zadeh. Fuzzy Logic. Computer, 1:83{93, 1988.
30