8
Minimizing Routing Configuration Cost in Dynamically Reconfigurable FPGAs Daler Rakhmatov and Sarma B. K. Vrudhula NSF Center for Low Power Electronics ECE Department University of Arizona Tucson, AZ 85716, USA daler/[email protected] Abstract Dynamically reconfigurable computing systems built on FPGAs offer a variety of benefits; however, the reconfigu- ration cost in terms of power dissipation and delay in such systems is the key negative factor limiting system perfor- mance. We describe a hardware organization that allows for simple dynamic placement and routing through intro- duction of a virtual standard cell topology over an FPGA. We focus on the channel routing issues under the assump- tion that the FPGA hardware is partially (selectively) re- configurable. Even though the channel capacity is fixed, routes that cannot fit in the channel at once can share the reconfigurable channel over time. The cost of configuring a new routing pattern can be greatly reduced if portions of the last configured routing pattern are reused. We ad- dress the problem of minimization of the configuration cost through maximization of the reuse of an already existing configuration of the channel. 1 Introduction Dynamically reconfigurable FPGAs provide a low cost, highly flexible hardware platform on which algorithms can be mapped into circuits automatically. Recently, a new class of dynamically and partially reconfigurable FPGAs have been introduced [1]. Partial reconfigurability allows for a selective change of functionality of FPGA segments of arbitrary size at arbitrary location, without disrupting op- eration of the rest of the FPGA chip. A dynamic hardware update can be highly localized favoring the reuse of the already configured silicon. Thus, the amount of necessary reconfiguration is greatly reduced, which translates into the decrease in reconfiguration delays and energy. One common approach of utilizing dynamic reconfig- uration is to couple a microprocessor with a collection of the dynamically reconfigurable devices serving as the logic cache. Hardware resources (e.g. basic components such as adders, multipliers, etc., higher level primitives such as filters, DFT, etc.) can be instantiated and reused at run- time. Existing resources can be relocated on the array or removed to make space for incoming resources. The ca- pability of deleting, creating, and deploying hardware re- sources at run-time permits adaptive computation. The key challenges to realize these benefits center around the time and power required to perform reconfigura- tion. We describe the hardware organization that allows for simple dynamic placement and routing of macros through introduction of a virtual standard cell topology over an FPGA. Hardware resources are placed into rows separated by channels that are used for routing. We focus on the is- sues of channel routing in the logic cache. Even though the channel capacity is fixed, routes that cannot fit in the channel at once can share the reconfigurable channel over time. The cost of configuring a new routing pattern can be reduced greatly if portions of the last configured routing pattern are reused. We address the problem of minimiza- tion of the configuration cost through maximization of the reuse of an already existing configuration of the channel. The importance of addressing this problem is clear: if the amount of reconfiguration is reduced by a factor of , then the energy dissipation and delay due to reconfiguration will be reduced by a factor of each (based on fixed average delay and power dissipation per configuration word); thus, the energy-delay product will be reduced by the factor of . 2 Related Work Reconfigurable systems with the logic cache feature a processor that is tightly coupled with its reconfigurable array. Computationally intensive and repetitive segments of a program are synthesized in hardware during com- 0-7695-0990-8/01/$10.00 (C) 2001 IEEE

[IEEE Comput. Soc IEEE International Symposium on Parallel and Distributed Processing - San Francisco, CA, USA (23-27 April 2001)] Proceedings 15th International Parallel and Distributed

  • Upload
    sbk

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [IEEE Comput. Soc IEEE International Symposium on Parallel and Distributed Processing - San Francisco, CA, USA (23-27 April 2001)] Proceedings 15th International Parallel and Distributed

Minimizing Routing Configuration Cost in Dynamically Reconfigurable FPGAs

Daler Rakhmatov and Sarma B. K. VrudhulaNSF Center for Low Power Electronics

ECE DepartmentUniversity of Arizona

Tucson, AZ 85716, USAdaler/[email protected]

Abstract

Dynamically reconfigurable computing systems built onFPGAs offer a variety of benefits; however, the reconfigu-ration cost in terms of power dissipation and delay in suchsystems is the key negative factor limiting system perfor-mance. We describe a hardware organization that allowsfor simple dynamic placement and routing through intro-duction of a virtual standard cell topology over an FPGA.We focus on the channel routing issues under the assump-tion that the FPGA hardware is partially (selectively) re-configurable. Even though the channel capacity is fixed,routes that cannot fit in the channel at once can share thereconfigurable channel over time. The cost of configuringa new routing pattern can be greatly reduced if portionsof the last configured routing pattern are reused. We ad-dress the problem of minimization of the configuration costthrough maximization of the reuse of an already existingconfiguration of the channel.

1 Introduction

Dynamically reconfigurable FPGAs provide a low cost,highly flexible hardware platform on which algorithms canbe mapped into circuits automatically. Recently, a newclass of dynamically and partially reconfigurable FPGAshave been introduced [1]. Partial reconfigurability allowsfor a selective change of functionality of FPGA segmentsof arbitrary size at arbitrary location, without disrupting op-eration of the rest of the FPGA chip. A dynamic hardwareupdate can be highly localized favoring the reuse of thealready configured silicon. Thus, the amount of necessaryreconfiguration is greatly reduced, which translates into thedecrease in reconfiguration delays and energy.

One common approach of utilizing dynamic reconfig-uration is to couple a microprocessor with a collection of

the dynamically reconfigurable devices serving as the logiccache. Hardware resources (e.g. basic components suchas adders, multipliers, etc., higher level primitives such asfilters, DFT, etc.) can be instantiated and reused at run-time. Existing resources can be relocated on the array orremoved to make space for incoming resources. The ca-pability of deleting, creating, and deploying hardware re-sources at run-time permits adaptive computation.

The key challenges to realize these benefits centeraround the time and power required to perform reconfigura-tion. We describe the hardware organization that allows forsimple dynamic placement and routing of macros throughintroduction of a virtual standard cell topology over anFPGA. Hardware resources are placed into rows separatedby channels that are used for routing. We focus on the is-sues of channel routing in the logic cache. Even thoughthe channel capacity is fixed, routes that cannot fit in thechannel at once can share the reconfigurable channel overtime. The cost of configuring a new routing pattern canbe reduced greatly if portions of the last configured routingpattern are reused. We address the problem of minimiza-tion of the configuration cost through maximization of thereuse of an already existing configuration of the channel.The importance of addressing this problem is clear: if theamount of reconfiguration is reduced by a factor of X , thenthe energy dissipation and delay due to reconfiguration willbe reduced by a factor of X each (based on fixed averagedelay and power dissipation per configuration word); thus,the energy-delay product will be reduced by the factor ofX2.

2 Related Work

Reconfigurable systems with the logic cache feature aprocessor that is tightly coupled with its reconfigurablearray. Computationally intensive and repetitive segmentsof a program are synthesized in hardware during com-

0-7695-0990-8/01/$10.00 (C) 2001 IEEE

Page 2: [IEEE Comput. Soc IEEE International Symposium on Parallel and Distributed Processing - San Francisco, CA, USA (23-27 April 2001)] Proceedings 15th International Parallel and Distributed

pilation, and cached in the reconfigurable logic cache atrun-time. Examples of such systems are presented in[6, 12, 13, 14, 15, 16].

Placement of macros for reconfigurable systems hasbeen addressed by several authors. One dimensional place-ment of macros for FPGA datapath by grammar-based treecovering is described in [4]. Fast online placement tech-niques in two dimensions are treated in detail in [3]. In thecontext of our system, placement is linear within a row, andit greatly affects routability. To reduce reconfiguration costduring placement, macros should exhibit regularity. Ef-forts on functional regularity extraction were reported in[8, 7, 5].

To the best of our knowledge, the time-multiplexed rout-ing in reconfigurable architectures with the goal of max-imizing reuse has not been studied. The classic channelrouting algorithms [10] define the cost to be the number oftracks in the channel rather than the amount of reconfig-uration, and FPGA routing algorithms [9] do not take intoconsideration already existing routes whose presence arisesdue to time-sharing of the FPGA routing resources.

3 System Organization

Dynamic routing addressed in this paper is part of aproject, called PRECIS (Partially Reconfigurable EnergyConscious Integrated System) and aimed at developing anadaptive computing system. A brief description of the sys-tem organization is presented in this section.

hardwaresoftware

configurationcontrol

configurablememory

peripheraldevice

Figure 1. Components of a Processor.

Figure 1 shows the assumed architecture of a dy-namically reconfigurable digital processor. It consistsof the five main components: the software (a micropro-cessor), the hardware (a partially reconfigurable FPGA),the configurable memory (flexible static random-accessmemory), the configuration controller (non-volatile storageand loader of configuration bitstream), and the periphery

(timers, counters, recievers, transmitters, etc). The soft-ware can directly configure the hardware. The configurablememory is an advanced feature that allows for softwarecontrolled tuning of the memory organization to a givenapplication. This memory serves as a communication linkbetween software and hardware during application execu-tion. The peripheral device provides an interface to the ex-ternal world. The processor is assumed to be implementedas a system-on-chip (SOC) with the primiary goal of reduc-ing the system power consumption. The number of coresthat can be integrated reliably into a single chip is limited;therefore, reconfigurability is particularly desirable since itenables software-controlled sharing of the FPGA resourcesover time. Such a device will soon be commercially avail-able from the Atmel Corporation [2].

The FPGA resources are utilized by the three main typesof reconfigurable hardware objects: routing, macros, andinterface. The FPGA interior is logically organized asshown in figure 2. Interface circuitry is responsible for ex-ternal communication as well as global control, and alsoserves as a global data storage. Macros are actual com-puting elements. Each macro has a local controller and alocal memory block. Control routing connects individualmacro controls with one another and the global controller,and data routing connects individual macro memory blockswith one another and the global memory. The shadedblocks directly affect the computation data by changingeither its state (datapath blocks) or location (data routingblock). Such a logical organization is intended to special-ize reconfiguration events into four categories: routing re-configuration, datapath reconfiguration, control reconfigu-ration, and memory reconfiguration.

globalmemory

globalcontrol

local memory local memory

local control local control

datapath datapath

data routing

control routing

INTERFACE

MACRO MACRO1 n

ROUTING

Figure 2. FPGA Logical Organization.

There are many possible realizations of the logical or-ganization shown in figure 2. To support fast placementand routing we impose a virtual standard-cell topology overthe FPGA (see figure 3). The FPGA is partitioned intorows and channels. The rows host macro datapaths, and

0-7695-0990-8/01/$10.00 (C) 2001 IEEE

Page 3: [IEEE Comput. Soc IEEE International Symposium on Parallel and Distributed Processing - San Francisco, CA, USA (23-27 April 2001)] Proceedings 15th International Parallel and Distributed

macro datapath

macro (local) memory/control

interface (global) memory/control

data/control routing

Figure 3. FPGA Physical Organization.

the channels host routes, memories, and control logic. Thechannel adjacent to the FPGA side facing the micropro-cessor is dedicated to the interface circuitry (global mem-ory and control - e.g. loop counters, condition checkers,etc). Placement of macros is essentially configuring macrodatapaths onto the rows. Data routing of macros is es-sentially configuring buses to establish connections amongmemory blocks. Exchange of control signals is performedthrough predetermined (reserved) routing resources. Fig-ure 3 shows that macro-to-macro routing takes place insidethe channel; whereas, macro-to-interface and interface-to-macro routing is over-the-cell (OTC). In this paper, we fo-cus on data routing reuse within a single channel. It shouldbe noted that as the FPGA size grows, the number and thesize of rows and channels grow accordingly.

Clearly, the routing resources of the logic-intensive rowsand the logic resources of the routing-intensive channelsmight be underutilized. However, specialization of FPGAsegments introduces spacial locality during reconfigura-tions, which simplifies dynamic resource management andreduces reconfiguration delays. For example, if two dif-ferent macros have the same datapath but different controllogic, only control logic needs to be changed to reconfigureone macro into the other. Also, if two FPGA-mapped seg-ments of computation have the same operations but differ-ent data flows, then only routing needs to be changed. Lo-cality of these changes is the consequence of the proposedorganization. Such an organization is targeted toward dy-namic hardware with multiple functional contexts (configu-rations) rather than for static user designs with a single con-figuration. Dynamic hardware requires efficient configura-tion management through reuse to reduce reconfigurationoverhead. Our primary goal is reusability of FPGA con-figurations rather than high utilization of FPGA resources.Nevertheless, resource utilization can be improved. For ex-ample, the channels can host data transfer schedulers andreconfigurable on-line test circuitry, and the row intercon-nects can be used for macro-to-macro OTC routing.

Finally, we note that for the proposed architecture ageneric FPGA is not an ideal realization of the dynamichardware space. A generic FPGA offers uniform fine-grain logic and routing reconfigurability; whereas, the pro-posed approach favors segmented (non-uniform) coarse-grain reconfigurability. Indeed, the rows are configuredto perform a predetermined set of high-level arithmeticand logic operations, and the channels are primarily con-figured to establish simple bus connections. Ideally, anFPGA specifically designed for the proposed architecturewould have a segmented distribution of logic and routingresources to match the requirements of the logic-intensiverows and the routing-intensive channels. However, eventhough generic FPGAs offer more flexibility than needed(resulting in higher than architecture-inherent energy-delay

0-7695-0990-8/01/$10.00 (C) 2001 IEEE

Page 4: [IEEE Comput. Soc IEEE International Symposium on Parallel and Distributed Processing - San Francisco, CA, USA (23-27 April 2001)] Proceedings 15th International Parallel and Distributed

penalties during configurations and computations), they arereadily available on the market and well-suited for hard-ware prototyping. Thus, our choice to use a generic FPGAis due to the need for the availability of hardware and soft-ware and the possibility of experimentation.

4 Problem Description

Vertical orientation of the channel shown in figure 3 isconceptually the same as its horizontal orientation due tothe FPGA symmetry. Traditionally, channels are depictedas being horizontally oriented; therefore, in our discussionwe assume that a channel is horizontal.

The channel area between two rows of macros is a twodimensional grid, with a generic configurable routing block(CRB), or a switchbox, being present at each grid point.A row of CRBs in a channel is a horizontal track alongwhich routes are established. We do not consider columnsof CRBs used for vertical segments of routes, and assumeno vertical constraints are present. The CRBs at which theroutes change direction correspond to vias in a classicalrouting problem. Thus, for a given set of placed macros,the set of data transfers that must take place can be speci-fied as pairs of endpoints (an interval), where each endpointof a pair represents a column coordinate of a terminal of amacro. Note that all connections are two point connectionsand each connection is to be realized with at most one hor-izontal segment. Configuration of a CRB is configurationof its switches in the context of this paper.

In order to configure the channel each of its CRBs mustbe configured to either enable or disable horizontal signalpropagation. Thus, if the channel is reconfigured from thescratch, the configuration cost is T � L � C, where T is thewidth (the number of tracks), L is the length (the numberof CRBs on a track), and C is the configuration cost of oneCRB (assume that C = 1). A line of N enabling CRBsrealizes an interval of length N ; these intervals are usedto transfer data. A line of N disabling CRBs realizes a gapof length N ; these gaps are used to block data. An exam-ple of a 4-track channel of length 10 is shown in figure 4(a).To configure track t3, for instance, 4 CRBs disable connec-tions (a gap from 0 to 4), and 6 CRBs enable connections(an interval from 4 to 10).

Clearly, reconfiguration costs can be greatly reduced ifalready existing routing segments are reused. If during aswitch from the layout of patternP1 to the layout of patternP2 a difference between these two layouts is reconfigured,then the reconfiguration penalties are decreased.

Figure 4(a) shows a layout of some pattern P1. As-sume that a new patern P2 to be configured has intervalsspecified in Table 1. Figure 4(b) shows a possible layout ofintervals in P2. Note that without reuse the reconfiguration

α β γ

δ

µλ

φ

ω

(a) layout of routing pattern P1 (b) layout of routing pattern P2

0 1 2 3 4 5 6 7 8 9 10

A B C

D

GF

E

H

q0

4

q1

6

q2

8

q3

100

Y

X

A B C

D G

F E

H

0 1 2 3 4 5 6 7 8 9 10

(c) P2 zones with gaps (d) zone graph of P2

t0

t1

t2

t3

t0

t1

t2

t3

A D F X B H E Y C G

q0 q3q1 q2

(f) matching graph of q1 (after q0)

A

D

F

X

t0

t1

t2

t3

(e) matching graph of q0

D

F

B

H

t0

t1

t2

t3

Figure 4. Routing Reuse Problem.

cost is 4�10 = 40, and with reuse of both intervals and gapsit is 4. This number was obtained as follows. Only config-urations of segment 6-8 of tracks t1 and t2 are different forthe two layouts. For t1, segment 6-8 of P1’s layout is apart of an interval, and segment 6-8 of P2’s layout is a gap.Thus, 2 CRBs (the length of the segment) need to be recon-figured for track t1. For t2, segment 6-8 of P1’s layout is agap, and segment 6-8 of P2’s layout is a part of an interval.Thus, 2 CRBs (the length of the segment) need to be re-configured for track t2. Total of 4 CRBs are reconfigured,which is much less than 40. If in the layout of P2 in figure4(b) intervals E and G are interchanged, then the reconfig-uration cost would be zero, since the layouts of P1 and P2match exactly. Note that the amount of reuse for a trackis the amount of overlap between intervals in P2’s layoutand intervals in P1’s layout assigned to that track, plus theamount of overlap between gaps in P2’s layout and gaps inP1’s layout on that track. The reconfiguration cost is sim-ply the difference between T � L and the amount of reusefor all tracks.

In this paper we present a method that determines thelayout of a new pattern, attempting to maximize the amountof reuse of the current channel layout. The input to the pro-cedure is the current layout of the channel and the patternto be laid out. It is assumed that there exists a layout of thepattern that fits into the channel, that is the pattern is physi-

0-7695-0990-8/01/$10.00 (C) 2001 IEEE

Page 5: [IEEE Comput. Soc IEEE International Symposium on Parallel and Distributed Processing - San Francisco, CA, USA (23-27 April 2001)] Proceedings 15th International Parallel and Distributed

Interval Left End Right End Length Overlaping

A 0 3 3 D, FB 4 7 3 D, E, F, HC 8 10 2 E, G, HD 0 5 5 A, B, F, HE 6 10 4 B, C, G, HF 2 5 3 A, B, D, HG 8 10 2 C, E, HH 4 10 6 B, C, D, E, F, G

Table 1. Example of Routing Pattern P2.

cally realizable. This is ensured by a preceding routing par-titioning step, which is not addressed in this paper. (Givenglobal routing specification, routing partitioning identifiesrouting patterns time-sharing the channel and guaranteesthat each pattern is physically realizable.)

5 Problem Formulation

Assume that the channel is of length L and has T tracks.Let the pattern currently laid out in the channel be denotedby P1, and the pattern to be laid out next be denoted by P2.Let the left endpoint coordinate of interval i be denoted byl(i) and its right endpoint coordinate by r(i). For interval� in figure 4(a), l(�) = 0 and r(�) = 3. Let t(i) denotea track number to which interval i is assigned (tracks arenumbered from top to bottom, starting from 0). For interval�, for instance, t(�) = 0.

Note that the left and right endpoints of intervals arefixed for given P1 and P2. For gaps, the left and rightendpoints are determined by two neighboring intervals onthe same track. That is the set of gaps of P2 is different,in general, for different layouts of P2 (note that the set ofgaps of P1 is fixed since P1 is already laid out). We need arepresentation of P2 such that computation of overlap forone element of the representation does not depend on thechoice of others. This representation must also implicitlyguarantee the feasibility of a layout of P2. That is the num-ber of tracks in a layout must be less than or equal to T . Forexample, assume that interval A and interval E (see figure4(b)) are assigned to the same track in some layout, and thechannel has 4 tracks. It can be verified that such an assign-ment makes routing infeasible, since at least five tracks areneeded to lay out A, E, and the rest of the intervals.

To treat layout choices as well as interval and gap over-laps systematically, we find zone representation of P2 (see[10] for details and construction) to be adequate. Zonescorrespond to maximal cliques of the interval graph of P2.In the interval graph, each node represents an interval, andan egde between two nodes exists if and only if the corre-sponding intervals overlap. Zones of P2 from figure 4(b)are shown in figure 4(c): there are 4 zones q0, q1, q2, andq3 with boundaries 0-4, 4-6, 6-8, and 8-10 respectively.

The number of intervals in a zone is the size of the cor-responding maximal clique; no two intervals in the samezone can be assigned to the same track. Clearly, the sizeof the maximum clique of the interval graph of P2 must be� T in order for P2 to be realizable in the channel withT tracks. If for some zone, the number of intervals in thatzone is < T , we include a gap (its endpoints are the zone’sboundaries) per each vacant track into that zone. For ex-ample, assuming a 4-track channel, in figure 4(c), gap Xis included into zone q0, and gap Y is included into zoneq2. This means that gaps X (0-4) and Y (6-8) are alwayspresent regardless how P2 is laid out.

Next, we construct the zone graph, where each zone cor-responds to a parent node, and each interval or gap cor-responds to a child node. Children of a parent node thatrepresents some zone correspond to the intervals (intervalchild nodes) and gaps (gap child nodes) belonging to thatzone. Now, the interval children and gap children can betreated in the same manner in terms of overlap. Figure 4(d)shows the zone graph for zones shown in figure 4(c). Itis important to note that interval child nodes represent notjust the interval boundaries but the entire segment withinthe boundaries of its parent zone(s). For example, child Aof q0 represents the entire segment 0-4, both the interval0-3 and the gap 3-4. If an interval child has more than oneparent it represents the segment formed by all the zonescorresponding to its parents. For example, child node Dof q0 and q1 represents segment 0-6. Each gap child nodehas exactly one parent and naturally represents the entiresegment within the corresponding zone’s boundaries.

The next step is to compute the amount of overlap foreach child node. If a child node v is placed on track t theoverlap is computed as follows. First, track t is found in thelayout of P1, then, the segment of t with the same bound-aries as those of the segment represented by v is selected.Comparing the selected segment against the segment of v,the amount of overlap is the number of CRBs with thesame configuration. For example, to compute overlap forA, placed on track t0, in figure 4(d), we find t0 in the layoutof P1 in figure 4(a) and select segment 0-4 (the boundariesof child A). It can be seen in figure 4(a) that subsegment0-3 is enabled, and subsegment 3-4 is disabled. In segment0-4 of child A subsegment 0-3 is also enabled, and subseg-ment 3-4 is also disabled; therefore, the overlap is 4. Forchild D (representing segment 0-6), placed on track t2, theoverlap is 4 (subsegment 2-5 of t2 is the same as subseg-ment 2-5 of D). Similarly, the overlap can computed foreach track assignment of each child node.

The routing reuse problem can be stated as follows.Given the layout ofP1, the zone graph of P2, and the chan-nel of length L with T tracks, assign each child node v ofthe zone graph of P2 to a track number such that the costis minimized and the following constraints are met:

0-7695-0990-8/01/$10.00 (C) 2001 IEEE

Page 6: [IEEE Comput. Soc IEEE International Symposium on Parallel and Distributed Processing - San Francisco, CA, USA (23-27 April 2001)] Proceedings 15th International Parallel and Distributed

1 maxft(v)g < T

2 if child nodes u and v have a common parent, thent(v) 6= t(u)

The cost is T � L�P

OV ERLAP (v), where the sumis computed over all child nodes v of the zone graph of P2,and OV ERLAP (v) is computed as described above (theamount of overlap depends on track assignment of v).

6 Routing Reuse Procedure

Our procedure for solving the routing reuse problem isbased on weighted matching. Child nodes for matching areselected after a zone is chosen; then, a bipartite matchinggraph between child nodes and tracks is constructed, andmaximum weight matching is found (see [11] for details).This step is repeated for each zone, and the end result is thetrack assignment of each child node. For example, if zoneq0 is under consideration, then nodes A, D, F , and X areselected (see Figure 4(c)). Its matching graph is shown inFigure 4(e). An edge connecting child node s to track t isassigned a weight equal to the amount of overlap as if s isplaced on t. For example, the edge (A; t0) is assigned aweight of 4, and the edge (D; t2) is assigned a weight of4. The maximum weight matching is as follows: A ! t0,D ! t1, F ! t2, X ! t3, since such a track assign-ment matches segment 0-4 in Figure 4(a) exactly, givingthe maximum overlap. Once child nodes of q0 are matched,the neighbor zone q1 is considered. The matching graph forchild nodes of q1 are shown in Figure 4(f). Note that twochild nodes (D and F ) as well as two tracks (t1 and t2)have no edges; they are said to be isolated. This is due tothe fact that previously, D and F were already assigned totracks t1 and t2 respectively. The absence of child nodeedges ensures than the child node is not assigned to morethan one track, and the absence of track edges ensures thatno overlapping child nodes are assigned to the same track.To ensure that the generated layout is feasible, the zones areconsidered in order of their distance from the starting zonein one direction, say, to the right, and then in the other, thatis to the left. For example, if the start zone is q0, then theorder in the right direction is q1-q2-q3 (there are no zonesto the left of q0). If the start zone is q2, then the order in theright direction is q3, and the order in the left direction is q1-q0. Ordering matching graphs as well as isolating alreadyassigned nodes and tracks in matching graphs ensure layoutcorrectness by construction. If the order is not preserved,then the feasibility is not guaranteed. For instance, if q0 isconsidered first, and q2 is considered second, it is possiblefor A and E to be assigned to the same track, which leadsto a layout with at least 5 tracks for the 4-track channel.

Clearly, the cost of the generated layout of P2 dependson a choice of the starting zone. Our procedure generates

the layouts corresponding to all possible starting zones, andselects the minimum cost solution among them. There is aspecial case when the optimal solution is generated. If allmaximal cliques of the interval graph of P2 are indepen-dent, it implies that each child node will have exactly oneparent. In other words, no interval belongs to more than onezone. In this case, a matching of child nodes for one zoneis independent from a matching of child nodes in any otherzone. The order of zone consideration and the choice ofthe starting zone no longer affect the solution, and the opti-mal layout is found. In general, matching of child nodes ofthe current zone will depend on the matchings performedpreviously.

The procedure is summarized in figure 5. The time com-plexity of matching is O(T 3). If the number of zones is Z,then the procedure runs in O(Z2T 3) time. The value of Zis bounded by O(V ), where V is the number of intervals;therefore, the running time of the algorithm is O(V 2T 3).If the value of T is assumed to be constant, then, the timecomplexity is quadratic.

RoutingReuse (P2; T; L; layout of P1)construct zone graph of P2for each child node v of zone graph

for each track tcompute overlap of v on t

SOLUTION = ;MINCOST =1for each zone z

StartZone = z

match child nodes of StartZone to T tracksNextZone = next zone to the right of StartZonewhile NextZone 6= ;

match child nodes of NextZone to T tracksNextZone = next zone to the right of NextZone

NextZone = next zone to the left of StartZonewhile NextZone 6= ;

match child nodes of NextZone to T tracksNextZone = next zone to the left of NextZone

compute COST of current layout of P2if COST < MINCOST

SOLUTION = current layout of P2MINCOST = COST

return SOLUTION

Figure 5. Routing Reuse Procedure.

7 Results

To illustrate the procedure on a relatively large examplewith easily reproducible results, we used an MCNC layoutnetlist ”fract” with 182 point-to-point connections. It is amultiplier built from standard cells.

Four placements were generated to obtain different rout-ing patterns. In the first placement all cells were placed onthe bottom row in the order they are listed in the benchmark

0-7695-0990-8/01/$10.00 (C) 2001 IEEE

Page 7: [IEEE Comput. Soc IEEE International Symposium on Parallel and Distributed Processing - San Francisco, CA, USA (23-27 April 2001)] Proceedings 15th International Parallel and Distributed

Pattern T L T � L

f1 120 3825 459000f2 116 1929 223764f3 128 1905 243840f4 105 2713 284865

Table 2. Pattern Profile.

specification. The second placement differed from the firstone in that every second cell was placed on the top row. Inthe third placement, the first twenty cells were assigned tothe bottom row, the second twenty cells were assigned tothe top row, the next twenty cells are assigned to bottomrow, etc. The fourth placement generation was similar tothe third except that cells were placed in groups of hun-dred. The problem instances for the first, second, third, andforth placement are denoted by ”f1”, ”f2”, ”f3”, and ”f4”,respectively. The channel length L is set to the right end-point of the rightmost interval. In the netlist the coordinatesof interval endpoints are scaled by the factor of 1000. Theprofile of the routing patterns are presented in table 2. Thefirst column shows the pattern, the second column indicatesthe minimum number of tracks T needed to realize the pat-tern, the third column shows the channel length L needed,and the last column indicates the cost T � L.

The results are presented in table 3. The first columnshows the patterns, the second column shows the amountof reconfiguration without reusing already existing routingsegments (NR = No Reuse), the third column shows thereconfiguration cost with the unoptimized reuse of alreadyexisting routing segments (UR = Unoptimized Reuse), theforth column shows the cost with optimized savings (OR= Optimized Reuse). The fifth column shows the ratio ofOR to NR, and the sixth column shows the percentage sav-ings of OR with respect to UR. Reconfiguration costs aremeasured in terms of the number of CRBs that need to bereconfigured. A pair of patterns gives two Ls and two T s;the channel length is the larger L, and the channel widthis the larger T . The NR cost is the product of the channellength and width.

The layout of P1 is computed by the left-edge algorithm(see [10] for details). The unoptimized savings were ob-tained by computing the overlap between the layout of P1and the layout of P2 that was computed by the left-edgealgorithm also. That is the NR number corresponds to thelayout of P2 obtained by the left-edge algorithm; whereas,the OR number corresponds to the layout of P2 obtainedby our method. Note that when P1 = P2, the optimal so-lution should have zero cost, and our procedure finds it. Itcan be seen that routing reuse is clearly beneficial since itallows for considerable decrease in reconfiguration costs.The above results indicate that the reuse of routing seg-ments can reduce the cost by up to the factor of 14 and

P2=P1 NR UR OR NR/OR 1-OR/UR%

f1/f1 459000 0 0 - 0f2/f2 223764 0 0 - 0f3/f3 243840 0 0 - 0f4/f4 284865 0 0 - 0f1/f2 459000 168048 162800 2.8 3.1f1/f3 489600 172696 162344 3.0 6.0f1/f4 459000 75280 57552 8.0 23.5f2/f1 459000 168048 163296 2.8 2.8f2/f3 246912 39064 17800 13.9 54.4f2/f4 314708 117728 113536 2.8 3.6f3/f1 489600 172696 163416 3.0 5.4f3/f2 246912 39064 20184 12.2 48.3f3/f4 347264 123912 115192 3.0 7.0f4/f1 459000 75280 56352 8.1 25.1f4/f2 314708 117728 109840 2.9 6.7f4/f3 347264 123912 110728 3.1 10.6

Table 3. Comparison Results.

the overlap optimization can achieve extra savings of up to54%.

8 Acknowledgements

This work was carried out at the National Science Foun-dation’s State/Industry/University Cooperative ResearchCenters’ (NSF-S/IUCRC) Center for Low Power Elec-tronics (CLPE). CLPE is supported by the NSF (GrantEEC-9523338), the State of Arizona, and the followingcompanies and foundations: Conexant, Gain Technol-ogy, Intel Corporation, Medtronic Microelectronics Cen-ter, Microchip Technology, Motorola, Inc., The MotorolaFoundation, ON Semiconductor, Philips Semiconductors,Raytheon, Syncron Technologies, LLT, Texas Instrumentsand Western Design Center.

9 Conclusion

We described the hardware organization that allows forsimple dynamic placement and routing through introduc-tion of a virtual standard cell topology over an FPGA. Wediscussed in detail the problem of data routing in a recon-figurable channel. Even though the channel capacity isfixed, routes that cannot fit in the channel at once can sharethe channel over time. The cost of configuring a new rout-ing pattern is reduced if portions of the already existingconfigured routing pattern are reused. We addressed theissue of minimization of configuration cost through max-imization of the reuse of the current configuration of thechannel and proposed a simple yet efficient procedure tosolve the routing reuse problem.

References

[1] AT40K Field Programmable Gate Arrays, Atmel, 1997.

0-7695-0990-8/01/$10.00 (C) 2001 IEEE

Page 8: [IEEE Comput. Soc IEEE International Symposium on Parallel and Distributed Processing - San Francisco, CA, USA (23-27 April 2001)] Proceedings 15th International Parallel and Distributed

[2] AT94K Field Programmable System-Level ICs, Atmel,1999.

[3] K. Bazargan, R. Kastner, M. Sarrafzadeh. “Fast TemplatePlacement for Reconfigurable Computing Systems” IEEEDesign and Test, Feb. 2000.

[4] T. Callahan, P. Chong, A. DeHon, J. Wawrzynek. “FastModule Mapping and Placement for Datapaths in FPGAs”Proc. Symposium FPGA, 1998.

[5] A. Chowdhary, S. Kale, P. Saripella, N. Sehgal, R. Gupta.“Extraction of Functional Regularity in Datapath Circuits”IEEE Trans. CAD, Sept. 1999.

[6] S. Hauck, T. Fry, M. Hosler, J. Kao. “The Chimaera Recon-figurable Functional Unit” Proc. Symposium FCCM, 1997.

[7] D. Rakhmatov, S. Vrudhula, T. Brown, A. Nagarandal.“Adaptive Multiuser Online Reconfigurable Engine” IEEEDesign and Test, Feb. 2000.

[8] D. Rao, F. Kurdahi. “On Clustering for Maximal RegularityExtraction” IEEE Trans. CAD, Aug. 1993.

[9] M. Sarrafzadeh, C. Wong. An Introduction to VLSI PhysicalDesign, Chapter 3, Section 3.4, McGraw-Hill, 1996.

[10] N. Sherwani. Algorithms for VLSI physical Design Automa-tion, Chapter 7, Section 7.4, Kluwer Academic Publishers,1995.

[11] D. West. Introduction to Graph Theory, Chapter 3, Section3.2, Prentice Hall, 1996.

[12] M. Wirthlin, B. Hutchings. “Sequencing Run-Time Recon-figured Hardware with Software” Proc. Symposium FPGA,1996.

[13] http : ==brass:cs:berkeley:edu=garp:html

[14] http : ==www:annapmicro:com=WASPP:html

[15] http : ==www:ececs:uc:edu= � dal=acs=index:htm

[16] http : ==www:icsl:ucla:edu= � atr=atr mach:html

0-7695-0990-8/01/$10.00 (C) 2001 IEEE