Reconﬁgurable Computing: A Survey of Systems and Software172 K. Compton and S. Hauck board-level solution, to perform the oper-ations in hardware. ASICs are designed speciﬁcally

Reconfigurable Computing: A Survey of Systems and Software

KATHERINE COMPTON

Northwestern University

AND

SCOTT HAUCK

University of Washington

Due to its potential to greatly accelerate a wide variety of applications, reconfigurablecomputing has become a subject of a great deal of research. Its key feature is the abilityto perform computations in hardware to increase performance, while retaining much ofthe flexibility of a software solution. In this survey, we explore the hardware aspects ofreconfigurable computing machines, from single chip architectures to multi-chipsystems, including internal structures and external coupling. We also focus on thesoftware that targets these machines, such as compilation tools that map high-levelalgorithms directly to the reconfigurable substrate. Finally, we consider the issuesinvolved in run-time reconfigurable systems, which reuse the configurable hardwareduring program execution.

Categories and Subject Descriptors: A.1 [Introductory and Survey]; B.6.1 [LogicDesign]: Design Style—logic arrays; B.6.3 [Logic Design]: Design Aids; B.7.1[Integrated Circuits]: Types and Design Styles—gate arrays

General Terms: Design, Performance

Additional Key Words and Phrases: Automatic design, field-programmable, FPGA,manual design, reconfigurable architectures, reconfigurable computing, reconfigurablesystems

1. INTRODUCTION

There are two primary methods in con-ventional computing for the execution

This research was supported in part by Motorola, Inc., DARPA, and NSF.

K. Compton was supported by an NSF fellowship.

S. Hauck was supported in part by an NSF CAREER award and a Sloan Research Fellowship.

Authors’ addresses: K. Compton, Department of Electrical and Computer Engineering, Northwestern Uni-versity, 2145 Sheridan Road, Evanston, IL 60208-3118; e-mail: [email protected]; S. Hauck, De-partment of Electrical Engineering, The University of Washington, Box 352500, Seattle, WA 98195; e-mail:[email protected].

Permission to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or direct commercial advantage andthat copies show this notice on the first page or initial screen of a display along with the full citation.Copyrights for components of this work owned by others than ACM must be honored. Abstracting with creditis permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any compo-nent of this work in other works requires prior specific permission and/or a fee. Permissions may be requestedfrom Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, [email protected]©2002 ACM 0360-0300/02/0600-0171 $5.00

of algorithms. The first is to use hard-wired technology, either an ApplicationSpecific Integrated Circuit (ASIC) or agroup of individual components forming a

ACM Computing Surveys, Vol. 34, No. 2, June 2002, pp. 171–210.

172 K. Compton and S. Hauck

board-level solution, to perform the oper-ations in hardware. ASICs are designedspecifically to perform a given computa-tion, and thus they are very fast andefficient when executing the exact com-putation for which they were designed.However, the circuit cannot be altered af-ter fabrication. This forces a redesign andrefabrication of the chip if any part of itscircuit requires modification. This is an ex-pensive process, especially when one con-siders the difficulties in replacing ASICsin a large number of deployed systems.Board-level circuits are also somewhat in-flexible, frequently requiring a board re-design and replacement in the event ofchanges to the application.

The second method is to use soft-ware-programmed microprocessors—a farmore flexible solution. Processors executea set of instructions to perform a compu-tation. By changing the software instruc-tions, the functionality of the system isaltered without changing the hardware.However, the downside of this flexibilityis that the performance can suffer, if notin clock speed then in work rate, and isfar below that of an ASIC. The processormust read each instruction from memory,decode its meaning, and only then exe-cute it. This results in a high executionoverhead for each individual operation.Additionally, the set of instructions thatmay be used by a program is determinedat the fabrication time of the processor.Any other operations that are to be im-plemented must be built out of existinginstructions.

Reconfigurable computing is intended tofill the gap between hardware and soft-ware, achieving potentially much higherperformance than software, while main-taining a higher level of flexibility thanhardware. Reconfigurable devices, in-cluding field-programmable gate arrays(FPGAs), contain an array of computa-tional elements whose functionality is de-termined through multiple programmableconfiguration bits. These elements, some-times known as logic blocks, are connectedusing a set of routing resources that arealso programmable. In this way, customdigital circuits can be mapped to the recon-

figurable hardware by computing the logicfunctions of the circuit within the logicblocks, and using the configurable routingto connect the blocks together to form thenecessary circuit.

FPGAs and reconfigurable computinghave been shown to accelerate a variety ofapplications. Data encryption, for exam-ple, is able to leverage both parallelismand fine-grained data manipulation. Animplementation of the Serpent BlockCipher in the Xilinx Virtex XCV1000shows a throughput increase by a factorof over 18 compared to a Pentium ProPC running at 200 MHz [Elbirt and Paar2000]. Additionally, a reconfigurable com-puting implementation of sieving for fac-toring large numbers (useful in breakingencryption schemes) was accelerated by afactor of 28 over a 200-MHz UltraSparcworkstation [Kim and Mangione-Smith2000]. The Garp architecture shows acomparable speed-up for DES [Hauserand Wawrzynek 1997], as does anFPGA implementation of an elliptic curvecryptography application [Leung et al.2000].

Other recent applications that havebeen shown to exhibit significant speed-ups using reconfigurable hardwareinclude: automatic target recognition[Rencher and Hutchings 1997], string pat-tern matching [Weinhardt and Luk 1999],Golomb Ruler Derivation [Dollas et al.1998; Sotiriades et al. 2000], transitiveclosure of dynamic graphs [Huelsbergen2000], Boolean satisfiability [Zhong et al.1998], data compression [Huang et al.2000], and genetic algorithms for the tra-velling salesman problem [Graham andNelson 1996].

In order to achieve these performancebenefits, yet support a wide range of appli-cations, reconfigurable systems are usu-ally formed with a combination of re-configurable logic and a general-purposemicroprocessor. The processor performsthe operations that cannot be done effi-ciently in the reconfigurable logic, suchas data-dependent control and possiblymemory accesses, while the computationalcores are mapped to the reconfigurablehardware. This reconfigurable logic can be

ACM Computing Surveys, Vol. 34, No. 2, June 2002.

Reconfigurable Computing 173

composed of either commercial FPGAs orcustom configurable hardware.

Compilation environments for reconfig-urable hardware range from tools to assista programmer in performing a hand map-ping of a circuit to the hardware, to com-plete automated systems that take a cir-cuit description in a high-level languageto a configuration for a reconfigurable sys-tem. The design process involves first par-titioning a program into sections to be im-plemented on hardware, and those whichare to be implemented in software on thehost processor. The computations destinedfor the reconfigurable hardware are syn-thesized into a gate level or register trans-fer level circuit description. This circuit ismapped onto the logic blocks within the re-configurable hardware during the technol-ogy mapping phase. These mapped blocksare then placed into the specific physi-cal blocks within the hardware, and thepieces of the circuit are connected usingthe reconfigurable routing. After compi-lation, the circuit is ready for configura-tion onto the hardware at run-time. Thesesteps, when performed using an automaticcompilation system, require very little ef-fort on the part of the programmer toutilize the reconfigurable hardware. How-ever, performing some or all of these oper-ations by hand can result in a more highlyoptimized circuit for performance-criticalapplications.

Since FPGAs must pay an area penaltybecause of their reconfigurability, devicecapacity can sometimes be a concern. Sys-tems that are configured only at power-up are able to accelerate only as muchof the program as will fit within the pro-grammable structures. Additional areas ofa program might be accelerated by reusingthe reconfigurable hardware during pro-gram execution. This process is knownas run-time reconfiguration (RTR). Whilethis style of computing has the benefit ofallowing for the acceleration of a greaterportion of an application, it also introducesthe overhead of configuration, which lim-its the amount of acceleration possible. Be-cause configuration can take millisecondsor longer, rapid and efficient configurationis a critical issue. Methods such as config-

uration compression and the partial reuseof already programmed configurations canbe used to reduce this overhead.

This article presents a survey of cur-rent research in hardware and softwaresystems for reconfigurable computing, aswell as techniques that specifically targetrun-time reconfigurability. We lead off thisdiscussion by examining the technologyrequired for reconfigurable computing, fol-lowed by a more in-depth examination ofthe various hardware structures used inreconfigurable systems. Next, we look atthe software required for compilation ofalgorithms to configurable computers, andthe trade-offs between hand-mapping andautomatic compilation. Finally, we discussrun-time reconfigurable systems, whichfurther utilize the intrinsic flexibility ofconfigurable computing platforms by opti-mizing the hardware not only for differentapplications, but for different operationswithin a single application as well.

This survey does not seek to cover ev-ery technique and research project in thearea of reconfigurable computing. Instead,it hopes to serve as an introduction tothis rapidly evolving field, bringing in-terested readers quickly up to speed ondevelopments from the last half-decade.Those interested in further backgroundcan find coverage of older techniquesand systems elsewhere [Rose et al. 1993;Hauck and Agarwal 1996; Vuillemin et al.1996; Mangione-Smith et al. 1997; Hauck1998b].

2. TECHNOLOGY

Reconfigurable computing as a concepthas been in existence for quite some time[Estrin et al. 1963]. Even general-purposeprocessors use some of the same basicideas, such as reusing computational com-ponents for independent computations,and using multiplexers to control therouting between these components. How-ever, the term reconfigurable comput-ing, as it is used in current research(and within this survey), refers to sys-tems incorporating some form of hard-ware programmability—customizing howthe hardware is used using a number



Fig. 1 . A programming bit for SRAM-based FPGAs [Xilinx 1994] (left) and a pro-grammable routing connection (right).

of physical control points. These controlpoints can then be changed periodically inorder to execute different applications us-ing the same hardware.

The recent advances in reconfigurablecomputing are for the most part de-rived from the technologies developedfor FPGAs in the mid-1980s. FPGAswere originally created to serve as a hy-brid device between PALs and Mask-Programmable Gate Arrays (MPGAs).Like PALs, FPGAs are fully electricallyprogrammable, meaning that the physicaldesign costs are amortized over multipleapplication circuit implementations, andthe hardware can be customized nearly in-stantaneously. Like MPGAs, they can im-plement very complex computations on asingle chip, with devices currently in pro-duction containing the equivalent of overa million gates. Because of these features,FPGAs had been primarily viewed as glue-logic replacement and rapid-prototypingvehicles. However, as we show through-out this article, the flexibility, capacity,and performance of these devices hasopened up completely new avenues inhigh-performance computation, formingthe basis of reconfigurable computing.

Most current FPGAs and reconfig-urable devices are SRAM-programmable(Figure 1 left), meaning that SRAM1bits are connected to the configurationpoints in the FPGA, and programmingthe SRAM bits configures the FPGA.

1 The term “SRAM” is technically incorrect for manyFPGA architectures, given that the configurationmemory may or may not support random access. Infact, the configuration memory tends to be continu-ally read in order to perform its function. However,this is the generally accepted term in the field andcorrectly conveys the concept of static volatile mem-ory using an easily understandable label.

Thus, these chips can be programmed andreprogrammed about as easily as a stan-dard static RAM. In fact, one researchproject, the PAM project [Vuillemin et al.1996], considers a group of one or moreFPGAs to be a RAM unit that performscomputation between the memory write(sending the configuration informationand input data) and memory read (read-ing the results of the computation). Thisleads some to use the term ProgrammableActive Memory or PAM.

One example of how the SRAM configu-ration points can be used is to control rout-ing within a reconfigurable device [Chowet al. 1999a]. To configure the routing onan FPGA, typically a passgate structureis employed (see Figure 1 right). Here theprogramming bit will turn on a routingconnection when it is configured with atrue value, allowing a signal to flow fromone wire to another, and will disconnectthese resources when the bit is set to false.With a proper interconnection of these ele-ments, which may include millions of rout-ing choice points within a single device, arich routing fabric can be created.

Another example of how these configu-ration bits may be used is to control mul-tiplexers, which will choose between theoutput of different logic resources withinthe array. For example, to provide optionalstateholding elements a D flip-flop (DFF)may be included with a multiplexer se-lecting whether to forward the latchedor unlatched signal value (see Figure 2left). Thus, for systems that require state-holding the programming bits controllingthe multiplexer would be configured to se-lect the DFF output, while systems thatdo not need this function would choosethe bypass route that sends the input di-rectly to the output. Similar structures



Fig. 2 . D flip-flop with optional bypass (left) and a 3-input LUT (right).

can choose between other on-chip func-tionalities, such as fixed-logic computationelements, memories, carry chains, or otherfunctions.

Finally, the configuration bits may beused as control signals for a computationalunit or as the basis for computation it-self. As a control signal, a configurationbit may determine whether an ALU per-forms an addition, subtraction, or otherlogic computations. On the other hand,with a structure such as a lookup table(LUT), the configuration bits themselvesform the result of the computation (seeFigure 2 right). These elements are essen-tially small memories provided for com-puting arbitrary logic functions. LUTs cancompute any function of N inputs (whereN is the number of control signals for theLUT’s multiplexer) by programming the2N programming bits with the truth ta-ble of the desired function. Thus, if allprogramming bits except the one corre-sponding to the input pattern 111 wereset to zero a 3-input LUT would act as a3-input AND gate, while programming itwith all ones except in 000 would computea NAND.

3. HARDWARE

Reconfigurable computing systems useFPGAs or other programmable hardwareto accelerate algorithm execution by map-ping compute-intensive calculations to thereconfigurable substrate. These hardwareresources are frequently coupled with ageneral-purpose microprocessor that isresponsible for controlling the reconfig-urable logic and executing program codethat cannot be efficiently accelerated. In

very closely coupled systems, the recon-figurability lies within customizable func-tional units on the regular datapath ofthe microprocessor. On the other hand, areconfigurable computing system can beas loosely coupled as a networked stand-alone unit. Most reconfigurable systemsare categorized somewhere between thesetwo extremes, frequently with the recon-figurable hardware acting as a coproces-sor to a host microprocessor. The pro-grammable array itself can be comprisedof one or more commercially availableFPGAs, or can be a custom device designedspecifically for reconfigurable computing.

The design of the actual computationblocks within the reconfigurable hardwarevaries from system to system. Each unit ofcomputation, or logic block, can be as sim-ple as a 3-input lookup table (LUT), or ascomplex as a 4-bit ALU. This differencein block size is commonly referred to asthe granularity of the logic block, wherethe 3-bit LUT is an example of a veryfine-grained computational element, and a4-bit ALU is an example of a quite coarse-grained unit. The finer-grained blocks areuseful for bit-level manipulations, whilethe coarse-grained blocks are better opti-mized for standard datapath applications.Some architectures employ different sizesor types of blocks within a single recon-figurable array in order to efficiently sup-port different types of computation. Forexample, memory is frequently embeddedwithin the reconfigurable hardware to pro-vide temporary data storage, forming aheterogeneous structure composed of bothlogic blocks and memory blocks [Ebelinget al. 1996; Altera 1998; Lucent 1998;Marshall et al. 1999; Xilinx 1999].



The routing between the logic blockswithin the reconfigurable hardware is alsoof great importance. Routing contributessignificantly to the overall area of the re-configurable hardware. Yet, when the per-centage of logic blocks used in an FPGA be-comes very high, automatic routing toolsfrequently have difficulty achieving thenecessary connections between the blocks.Good routing structures are therefore es-sential to ensure that a design can be suc-cessfully placed and routed onto the recon-figurable hardware.

Once a circuit has been programmedonto the reconfigurable hardware, it isready to be used by the host processor dur-ing program execution. The run-time op-eration of a reconfigurable system occursin two distinct phases: configuration andexecution. The programming of the recon-figurable hardware is under the control ofthe host processor. This host processor di-rects a stream of configuration data to thereconfigurable hardware, and this config-uration data is used to define the actualoperation of the hardware. Configurationscan be loaded solely at start-up of a pro-gram, or periodically during runtime, de-pending on the design of the system. Moreconcepts involved in run-time reconfigu-ration (the dynamic reconfiguration of de-vices during computation execution) arediscussed in a later section.

The actual execution model of the re-configurable hardware varies from sys-tem to system. For example, the NAPAsystem [Rupp et al. 1998] by defaultsuspends the execution of the host pro-cessor during execution on the recon-figurable hardware. However, simulta-neous computation can occur with theuse of fork-and-join primitives, similar tomultiprocessor programming. REMARC[Miyamori and Olukotun 1998] is a re-configurable system that uses a pipelinedset of execution phases within the recon-figurable hardware. These pipeline stagesoverlap with the pipeline stages of the hostprocessor, allowing for simultaneous ex-ecution. In the Chimaera system [Haucket al. 1997], the reconfigurable hardwareis constantly executing based upon the in-put values held in a subset of the host pro-

cessor’s registers. A call to the Chimaeraunit is in actuality only a fetch of the re-sult value. This value is stable and validafter the correct input values have beenwritten to the registers and have filteredthrough the computation.

In the next sections, we consider ingreater depth the hardware issues in re-configurable computing, including bothlogic and routing. To support the compu-tation demands of reconfigurable comput-ing, we consider the logic block architec-tures of these devices, including possiblythe integration of heterogeneous logic re-sources within a device. Heterogeneityalso extends between chips, where one ofthe most important concerns is the cou-pling of the reconfigurable logic with stan-dard, general-purpose processors. How-ever, reconfigurable devices are more thanjust logic devices; the routing resourcesare at least as important as logic re-sources, and thus we consider intercon-nect structures, including 1D-oriented de-vices that are beginning to appear.

3.1. Coupling

Frequently, reconfigurable hardware iscoupled with a traditional microprocessor.Programmable logic tends to be inefficientat implementing certain types of opera-tions, such as variable-length loops andbranch control. In order to run an applica-tion in a reconfigurable computing systemmost efficiently, the areas of the programthat cannot be easily mapped to the recon-figurable logic are executed on a host mi-croprocessor. Meanwhile, the areas with ahigh density of computation that can ben-efit from implementation in hardware aremapped to the reconfigurable logic. For thesystems that use a microprocessor in con-junction with reconfigurable logic, thereare several ways in which these two com-putation structures may be coupled, asFigure 3 shows.

First, reconfigurable hardware can beused solely to provide reconfigurablefunctional units within a host proces-sor [Razdan and Smith 1994; Haucket al. 1997]. This allows for a tradi-tional programming environment with the



Fig. 3 . Different levels of coupling in a reconfigurable system. Reconfigurable logicis shaded.

addition of custom instructions that maychange over time. Here, the reconfigurableunits execute as functional units on themain microprocessor datapath, with reg-isters used to hold the input and outputoperands.

Second, a reconfigurable unit maybe used as a coprocessor [Wittig andChow 1996; Hauser and Wawrzynek 1997;Miyamori and Olukotun 1998; Rupp et al.1998; Chameleon 2000]. A coprocessor is,in general, larger than a functional unit,and is able to perform computations with-out the constant supervision of the hostprocessor. Instead, the processor initial-izes the reconfigurable hardware and ei-ther sends the necessary data to the logic,or provides information on where this datamight be found in memory. The reconfig-urable unit performs the actual computa-tions independently of the main processor,and returns the results after completion.This type of coupling allows the reconfig-urable logic to operate for a large num-ber of cycles without intervention fromthe host processor, and generally permitsthe host processor and the reconfigurablelogic to execute simultaneously. This re-duces the overhead incurred by the useof the reconfigurable logic, compared to areconfigurable functional unit that mustcommunicate with the host processor eachtime a reconfigurable “instruction” is used.One idea that is somewhat of a hybrid be-tween the first and second coupling meth-ods, is the use of programmable hardwarewithin a configurable cache [Kim et al.2000]. In this situation, the reconfigurable

logic is embedded into the data cache.This cache can then be used as either aregular cache or as an additional com-puting resource depending on the targetapplication.

Third, an attached reconfigurableprocessing unit [Vuillemin et al. 1996;Annapolis 1998; Laufer et al. 1999] be-haves as if it is an additional processor ina multiprocessor system or an additionalcompute engine accessed semifrequentlythrough external I/O. The host processor’sdata cache is not visible to the attachedreconfigurable processing unit. There is,therefore, a higher delay in communica-tion between the host processor and the re-configurable hardware, such as when com-municating configuration information,input data, and results. This communi-cation is performed though specializedprimitives similar to multiprocessor sys-tems. However, this type of reconfigurablehardware does allow for a great deal ofcomputation independence, by shiftinglarge chunks of a computation over to thereconfigurable hardware.

Finally, the most loosely coupled formof reconfigurable hardware is that ofan external stand-alone processing unit[Quickturn 1999a, 1999b]. This type ofreconfigurable hardware communicatesinfrequently with a host processor (ifpresent). This model is similar to thatof networked workstations, where pro-cessing may occur for very long periodsof time without a great deal of commu-nication. In the case of the Quickturnsystems, however, this hardware is geared



more towards emulation than reconfig-urable computing.

Each of these styles has distinct ben-efits and drawbacks. The tighter the in-tegration of the reconfigurable hardware,the more frequently it can be used withinan application or set of applications dueto a lower communication overhead. How-ever, the hardware is unable to operatefor significant portions of time without in-tervention from a host processor, and theamount of reconfigurable logic available isoften quite limited. The more loosely cou-pled styles allow for greater parallelism inprogram execution, but suffer from highercommunications overhead. In applicationsthat require a great deal of communica-tion, this can reduce or remove any accel-eration benefits gained through this typeof reconfigurable hardware.

3.2. Traditional FPGAs

Before discussing the detailed architec-ture design of reconfigurable devices ingeneral, we will first describe the logicand routing of FPGAs. These conceptsapply directly to reconfigurable systemsusing commercial FPGAs, such as PAM[Vuillemin et al. 1996] and Splash 2[Arnold et al. 1992; Buell et al. 1996],and many also extend to architecturesdesigned specifically for reconfigurablecomputing. Hardware concepts applyingspecifically to architectures designed forreconfigurable computing, as well as vari-ations on the generic FPGA descriptionprovided here, are discussed following thissection. More detailed surveys of FPGA ar-chitectures themselves can be found else-where [Brown et al. 1992a; Rose et al.1993].

Since the introduction of FPGAs in themid-1980s, there have been many differ-ent investigations into what computationelement(s) should be built into the ar-ray [Rose et al. 1993]. One could considerFPGAs that were created with PAL-likeproduct term arrays, or multiplexer-basedfunctionality, or even basic fixed functionssuch as simple NAND and XOR gates. Infact, many such architectures have beenbuilt. However, it seems to be fairly well

Fig. 4 . A basic logic block, with a 4-inputLUT, carry chain, and a D-type flip-flop withbypass.

established that the best function blockfor a standard FPGA, a device whose pri-mary role is the implementation of ran-dom digital logic, is the one found in thefirst devices deployed—the lookup table(Figure 2 right). As described in the pre-vious section, an N-input LUT is basicallya memory that, when programmed appro-priately, can compute any function of up toN inputs. This flexibility, with relativelysimple routing requirements (each inputneed only be routed to a single multiplexercontrol input) turns out to be very power-ful for logic implementation. Although it isless area-efficient than fixed logic blocks,such as a standard NAND gate, the truthis that most current FPGAs use less than10% of their chip area for logic, devotingthe majority of the silicon real estate forrouting resources.

The typical FPGA has a logic blockwith one or more 4-input LUT(s), op-tional D flip-flops (DFF), and some formof fast carry logic (Figure 4). The LUTsallow any function to be implemented, pro-viding generic logic. The flip-flop can beused for pipelining, registers, statehold-ing functions for finite state machines, orany other situation where clocking is re-quired. Note that the flip-flops will typi-cally include programmable set/reset linesand clock signals, which may come fromglobal signals routed on special resources,or could be routed via the standard in-terconnect structures from some otherinput or logic block. The fast carry logic



Fig. 5 . A generic island-style FPGA routing archi-tecture.

is a special resource provided in the cellto speed up carry-based computations,such as addition, parity, wide AND op-erations, and other functions. These re-sources will bypass the general routingstructure, connecting instead directly be-tween neighbors in the same column.Since there are very few routing choicesin the carry chain, and thus less delay onthe computation, the inclusion of these re-sources can significantly speed up carry-based computations.

Just as there has been a great dealof experimentation in FPGA logic blockarchitectures, there has been equallyas much investigation into interconnectstructures. As logic blocks have basicallystandardized on LUT-based structures,routing resources have become primarilyisland-style, with logic surrounded by gen-eral routing channels.

Most FPGA architectures organize theirrouting structures as a relatively smoothsea of routing resources, allowing fast andefficient communication along the rowsand columns of logic blocks. As shownin Figure 5, the logic blocks are em-bedded in a general routing structure,with input and output signals attachingto the routing fabric through connectionblocks. The connection blocks provide pro-grammable multiplexers, selecting whichof the signals in the given routing channelwill be connected to the logic block’s ter-

minals. These blocks also connect shorterlocal wires to longer-distance routing re-sources. Signals flow from the logic blockinto the connection block, and then alonglonger wires within the routing channels.At the switchboxes, there are connectionsbetween the horizontal and vertical rout-ing resources to allow signals to changetheir routing direction. Once the signalhas traversed through routing resourcesand intervening switchboxes, it arrives atthe destination logic block through one ofits local connection blocks. In this man-ner, relatively arbitrary interconnectionscan be achieved between the logic blocksin the system.

Within a given routing channel, theremay be a number of different lengths ofrouting resources. Some local interconnec-tions may only move between adjacentlogic blocks (carry chains are a good ex-ample of this), providing high-speed lo-cal interconnect. Medium length lines mayrun the width of several logic blocks, pro-viding for some longer distance intercon-nect. Finally, longlines that run the entirechip width or height may provide for moreglobal signals. Also, many architecturescontain special “global lines” that providehigh-speed, and often low-skew, connec-tions to all of the logic blocks in the array.These are primarily used for clocks, resets,and other truly global signals.

While the routing architecture of anFPGA is typically quite complex—the con-nection blocks and switchboxes surround-ing a single logic block typically have thou-sands of programming points—they aredesigned to be able to support fairly arbi-trary interconnection patterns. Most usersignore the exact details of these architec-tures and allow the automatic physical de-sign tools to choose appropriate resourcesto use in order to achieve a given intercon-nect pattern.

3.3. Logic Block Granularity

Most reconfigurable hardware is basedupon a set of computation structures thatare repeated to form an array. Thesestructures, commonly called logic blocksor cells, vary in complexity from a very



Fig. 6 . The functional unit from a Xilinx 6200 cell[Xilinx 1996].

small and simple block that can calculatea function of only three inputs, to a struc-ture that is essentially a 4-bit ALU. Someof these block types are configurable, inthat the actual operation is determined bya set of loaded configuration data. Otherblocks are fixed structures, and the config-urability lies in the connections betweenthem. The size and complexity of the ba-sic computing blocks is referred to as theblock’s granularity.

An example of a very fine-grained logicblock can be found in the Xilinx 6200 seriesof FPGAs [Xilinx 1996]. The functionalunit from one of these cells, as shown inFigure 6, can implement any two-inputfunction and some three-input functions.However, although this type of architec-ture is useful for very fine-grained bit ma-nipulation, it can be too fine-grained to ef-ficiently implement many types of circuits,such as multipliers. Similarly, finite statemachines are frequently too complex toeasily map to a reasonable number ofvery fine-grained logic blocks. However, fi-nite state machines are also too dependentupon single bit values to be efficiently im-plemented in a very coarse-grained archi-tecture. This type of circuit is more suitedto an architecture that provides moreconnections and computational power perlogic block, yet still provides sufficient ca-pability for bit-level manipulation.

The logic cell in the Altera FLEX 10K ar-chitecture [Altera 1998] is a fine-grainedstructure that is somewhat coarser thanthe 6200. This architecture mainly con-sists of a single 4-input LUT with a

flip-flop. Additionally, there is specializedcarry-chain circuitry that helps to acceler-ate addition, parity, and other operationsthat use a carry chain. These types of logicblocks are useful for fine-grained bit-levelmanipulation of data, as can frequently befound in encryption and image processingapplications. Also, because the cells arefine-grained, computation structures ofarbitrary bit widths can be created. Thiscan be useful for implementing datapathcircuits that are based on data widths notimplemented on the host processor (5 bitmultiply, 18 bit addition, etc). Reconfig-urable hardware can not only take advan-tage of small bit widths, but also large datawidths. When a program uses bit widthsin excess of what is normally available ina host processor, the processor must per-form the computations using a number ofextra steps in order to handle the full datawidth. A fine-grained architecture wouldbe able to implement the full bit width in asingle step, without the fetching, decoding,and execution of additional instructions,as long as enough logic cells are available.

A number of reconfigurable systems usea granularity of logic block that we cat-egorize as medium-grained [Xilinx 1994;Hauser and Wawrzynek 1997; Haynes andCheung 1998; Lucent 1998; Marshall et al.1999]. For example, Garp [Hauser andWawrzynek 1997] is designed to performa number of different operations on upto four 2-bit inputs. Another medium-grained structure was designed specifi-cally to be embedded inside of a general-purpose FPGA to implement multipliersof a configurable bit width [Haynes andCheung 1998]. The logic block used in themultiplier FPGA is capable of implement-ing a 4×4 multiplication, or cascaded intolarger structures. The CHESS architec-ture [Marshall et al. 1999] also operateson 4-bit values, with each of its cells act-ing as a 4-bit ALU. Medium-grained logicblocks may be used to implement datapathcircuits of varying bit widths, similar tothe fine-grained structures. However, withthe ability to perform more complex oper-ations of a greater number of inputs, thistype of structure can be used efficiently toimplement a wider variety of operations.



Fig. 7 . One cell in the RaPiD-I reconfigurable architecture [Ebeling et al.1996]. The registers, RAM, ALUs, and multiplier all operate on 16-bit values.The multiplier outputs a 32-bit result, split into the high 16 bits and the low16 bits. All routing lines shown are 16-bit wide busses. The short parallellines on the busses represent configurable bus connectors.

Very coarse-grained architectures areprimarily intended for the implementa-tion of word-width datapath circuits. Be-cause the logic blocks used are optimizedfor large computations, they will performthese operations much more quickly (andconsume less chip area) than a set ofsmaller cells connected to form the sametype of structure. However, because theircomposition is static, they are unableto leverage optimizations in the size ofoperands. For example, the RaPiD archi-tecture [Ebeling et al. 1996], shown inFigure 7, as well as the Chameleon ar-chitecture [Chameleon 2000], are exam-ples of this very coarse-grained type ofdesign. Each of these architectures is com-posed of word-sized adders, multipliers,and registers. If only three 1-bit valuesare required, then the use of these archi-tectures suffers an unnecessary area andspeed overhead, as all of the bits in the fullword size are computed. However, thesecoarse-grained architectures can be muchmore efficient than fine-grained architec-tures for implementing functions closer totheir basic word size.

An alternate form of a coarse-grainedsystem is one in which the logic blocksare actually very small processors, poten-tially each with its own instruction mem-ory and/or data values. The REMARC ar-chitecture [Miyamori and Olukotun 1998]

is composed of an 8 × 8 array of 16-bitprocessors. Each of these processors usesits own instruction memory in conjunctionwith a global program counter. This styleof architecture closely resembles a single-chip multiprocessor, although with muchsimpler component processors because thesystem is intended to be coupled with ahost processor. The RAW project [Moritzet al. 1998] is a further example of a re-configurable architecture based on a mul-tiprocessor design.

The granularity of the FPGA also hasa potential effect on the reconfigurationtime of the device. This is an importantissue for run-time reconfiguration, whichis discussed in further depth in a later sec-tion. A fine-grained array has many config-uration points to perform very small com-putations, and thus requires more databits during configuration.

3.4. Heterogeneous Arrays

In order to provide greater performanceor flexibility in computation, some recon-figurable systems provide a heterogeneousstructure, where the capabilities of thelogic cells are not the same throughoutthe system. One use of heterogeneity inreconfigurable systems is to provide mul-tiplier function blocks embedded withinthe reconfigurable hardware [Haynes and



Cheung 1998; Chameleon 2000; Xilinx2001]. Because multiplication is one of themore difficult computations to implementefficiently in a traditional FPGA struc-ture, the custom multiplication hardwareembedded within a reconfigurable arrayallows a system to perform even that func-tion well.

Another use of heterogeneous struc-tures is to provide embedded memoryblocks scattered throughout the reconfig-urable hardware. This allows storage offrequently used data and variables, andallows for quick access to these valuesdue to the proximity of the memory tothe logic blocks that access it. Memorystructures embedded into the reconfig-urable fabric come in two forms. The firstis simply the use of available LUTs asRAM structures, as can be done in theXilinx 4000 series [Xilinx 1994] and Virtex[Xilinx 1999] FPGAs. Although makingthese very small blocks into a largerRAM structure introduces overhead to thememory system, it does provide local, vari-able width memory structures.

Some architectures include dedicatedmemory blocks within their array, suchas the Xilinx Virtex series [Xilinx 1999,2001] and Altera [Altera 1998] FPGAs, aswell as the CS2000 RCP (reconfigurablecommunications processor) device fromChameleon Systems, Inc. [Chameleon2000]. These memory blocks have greaterperformance in large sizes than similar-sized structures built from many smallLUTs. While these structures are some-what less flexible than the LUT-basedmemories, they can also provide some cus-tomization. For example, the Altera FLEX10K FPGA [Altera 1998] provides embed-ded memories that have a limited totalnumber of wires, but allow a trade-off be-tween the number of address lines and thedata bit width.

When embedded memories are not usedfor data storage by a particular config-uration, the area that they occupy doesnot necessarily have to be wasted. By us-ing the address lines of the memory asfunction inputs and the values stored inthe memory as function outputs, logicalexpressions of a large number of inputs

can be emulated [Altera 1998; Cong andXu 1998; Wilton 1998; Heile and Leaver1999]. In fact, because there may be morethan one value output from the memoryon a read operation, the memory struc-ture may be able to perform multiple dif-ferent computations (one for each bit ofdata output), provided that all necessaryinputs appear on the address lines. In thismanner, the embedded RAM behaves thesame as a very large LUT. Therefore, em-bedded memory allows a programmer ora synthesis tool to perform a trade-off be-tween logic and memory usage in order toachieve higher area efficiency.

Furthermore, a few of the commercialFPGA companies have announced plans toinclude entire microprocessors as embed-ded structures within their FPGAs. Alterahas demonstrated a preliminary ARM9-based Excalibur device, which combinesreconfigurable hardware with an embed-ded ARM9 processor core [Altera 2001].Meanwhile, Xilinx is working with IBM toinclude a PowerPC processor core withinthe Virtex-II FPGA [Xilinx 2000]. By con-trast, Adaptive Silicon’s focus is to providereconfigurable logic cores to customers forembedding in their own system-on-a-chip(SoC) devices [Adaptive 2001].

3.5. Routing Resources

Interconnect resources are provided in areconfigurable architecture to connect to-gether the device’s programmable logic el-ements. These resources are usually con-figurable, where the path of a signal isdetermined at compile or run-time ratherthan fabrication time. This flexible inter-connect between logic blocks or computa-tional elements allows for a wide varietyof circuit structures, each with their owninterconnect requirements, to be mappedto the reconfigurable hardware. For ex-ample, the routing for FPGAs is gener-ally island-style, with logic surroundedby routing channels, which contain sev-eral wires, potentially of varying lengths.Within this type of routing architecture,however, there are still variations. Some ofthese differences include the ratio of wiresto logic in the system, how long each of the



Fig. 8 . Segmented (left) and hierarchical (right) routing structures. The whiteboxes are logic blocks, while the dark boxes are connection switches.

wires should be, and whether they shouldbe connected in a segmented or hierarchi-cal manner.

A step in the design of efficient rout-ing structures for FPGAs and reconfig-urable systems therefore involves exam-ining the logic vs. routing area trade-offwithin reconfigurable architectures. Onegroup has argued that the interconnectshould constitute a much higher propor-tion of area in order to allow for successfulrouting under high-logic utilization condi-tions [Takahara et al. 1998]. However, forFPGAs, high-LUT utilization may not nec-essarily be the most desirable situation,but rather efficient routing usage may beof more importance [DeHon 1999]. Thisis because the routing resources occupy amuch larger part of the area of an FPGAthan the logic resources, and therefore themost area efficient designs will be thosethat optimize their use of the routing re-sources rather than the logic resources.The amount of required routing does notgrow linearly with the amount of logicpresent; therefore, larger devices requireeven greater amounts of routing per logicblock than small ones [Trimberger et al.1997b].

There are two primary methods to pro-vide both local and global routing re-sources, as shown in Figure 8. The firstis the use of segmented routing [Betz andRose 1999; Chow et al. 1999a]. In seg-mented routing, short wires accommodate

local communications traffic. These shortwires can be connected together usingswitchboxes to emulate longer wires. Fre-quently, segmented routing structuresalso contain longer wires to allow sig-nals to travel efficiently over long dis-tances without passing through a greatnumber of switches. Hierarchical routing[Aggarwal and Lewis 1994; Lai and Wang1997; Tsu et al. 1999] is the second methodto provide both local and global commu-nication. Routing within a group (or clus-ter) of logic blocks is at the local level,only connecting within that cluster. Atthe boundaries of these clusters, however,longer wires connect the different clusterstogether. This is potentially repeated at anumber of levels. The idea behind the useof hierarchical structures is that, provideda good placement has been made onto thehardware, most communication should belocal and only a limited amount of com-munication will traverse long distances.Therefore, the wiring is designed to fit thismodel, with a greater number of local rout-ing wires in a cluster than distance routingwires between clusters.

Because routing can occupy a large partof the area of a reconfigurable device, thetype of routing used must be carefully con-sidered. If the wires available are muchlonger than what is required to route a sig-nal, the excess wire length is wasted. Onthe other hand, if the wires available aremuch shorter than necessary, the signal



Fig. 9 . A traditional two-dimensional island-style routing structure (left) and a one-dimensional routing structure (right). The white boxes represent logic elements.

must pass through switchboxes that con-nect the short wires together into a longerwire, or through levels of the routing hier-archy. This induces additional delay andslows the overall operation of the circuit.Furthermore, the switchbox circuitry oc-cupies area that might be better used foradditional logic or wires.

There are a few alternatives to theisland-style of routing resources. Systemssuch as RaPiD [Ebeling et al. 1996] usesegmented bus-based routing, where sig-nals are full word-sized in width. This ismost common in the one-dimensional typeof architecture, as discussed in the nextsection.

3.6. One-Dimensional Structures

Most current FPGAs are of the two-dimensional variety, as shown in Figure 9.This allows for a great deal of flexibility,as any signal can be routed on a nearlyarbitrary path. However, providing thislevel of routing flexibility requires a greatdeal of routing area. It also complicatesthe placement and routing software, as thesoftware must consider a very large num-ber of possibilities.

One solution is to use a more one-dimensional style of architecture, also de-picted in Figure 9. Here, placement isrestricted along one axis. With a morelimited set of choices, the placement canbe performed much more quickly. Routingis also simplified, because it is generallyalong a single dimension as well, with theother dimension generally only used forcalculations requiring a shift operation.One drawback of the one-dimensional

routing is that if there are not enoughrouting resources in a particular area ofa mapped circuit, routing that circuit be-comes actually more difficult than on atwo-dimensional array that provides morealternatives. A number of different re-configurable systems have been designedin this manner. Both Garp [Hauser andWawrzynek 1997] and Chimaera [Haucket al. 1997] are structures that providecells that compute a small number of bitpositions, and a row of these cells to-gether computes the full data word. Arow can only be used by a single config-uration, making these designs one dimen-sional. In this manner, each configurationoccupies some number of complete rows.Although multiple narrow-width compu-tations can fit within a single row, thesestructures are optimized for word-basedcomputations that occupy the entire row.The NAPA architecture [Rupp et al. 1998]is similar, with a full column of cells act-ing as the atomic unit for a configura-tion, as is PipeRench [Cadambi et al. 1998;Goldstein et al. 2000].

In some systems, the computationblocks in a one-dimensional structure op-erate on word-width values instead ofsingle bits. Therefore, busses are routedinstead of individual values. This alsodecreases the time required for routing,as the bits of a bus can be consideredtogether rather than as separate routes.As shown previously in Figure 7, RaPiD[Ebeling et al. 1996] is basically a one-dimensional design that only includesword-width processing elements. The dif-ferent computation units are organized ina single dimension along the horizontal



Fig. 10 . Mesh (left) and partial crossbar (right) interconnect topologies for multi-FPGAsystems.

axis. The general flow of information fol-lows this layout, with the major routingbusses also laid out in a horizontal man-ner. Additionally, all routing is of word-sized values, and therefore all routing isof busses, not individual wires. A few ver-tical resources are included in the archi-tecture to allow signals to transfer be-tween busses, or to travel from a bus toa computation node. However, the major-ity of the routing in this architecture isone-dimensional.

3.7. Multi-FPGA Systems

Reconfigurable systems that are composedof multiple FPGA chips interconnectedon a single processing board have addi-tional hardware concerns over single-chipsystems. In particular, there is a need foran efficient connection scheme betweenthe chips, as well as to external memoryand the system bus. This is to provide forcircuits that are too large to fit within asingle FPGA, but may be partitioned overthe multiple FPGAs available. A numberof different interconnection schemes havebeen explored [Butts and Batcheller 1991;Hauck et al. 1998a; Hauck 1998; Khalid1999] including meshes and crossbars, asshown in Figure 10. A mesh connects thenearest-neighbors in the array of FPGAchips. This allows for efficient communi-cation between the neighbors, but mayrequire that some signals pass throughan FPGA simply to create a connectionbetween non-neighbors. Although this canbe done, and is quite possible, it uses valu-able I/O resources on the FPGA that formsthe routing bridge. One system that usesa mesh topology with additional board-

level column and row busses is the P1system developed within the PAM project[Vuillemin et al. 1996]. This architectureuses a central array of 16 commercialFPGAs with connections to nearest-neighbors. However, four 16-bit row bussesand four 16-bit column busses run thelength of the array and facilitate commu-nication between non-neighbor FPGAs.

A crossbar attempts to remove this prob-lem by using special routing-only chipsto connect each FPGA potentially to anyother FPGA. The inter-chip delays aremore uniform, given that a signal trav-els the exact same “distance” to get fromone FPGA to another, regardless of wherethose FPGAs are located. However, acrossbar interconnect does not scale eas-ily with an increase in the number ofFPGAs. The crossbar pattern of the chipsis fixed at fabrication of the multi-FPGAboard. Variants on these two basic topolo-gies attempt to remove some of the prob-lems encountered in mesh and crossbartopologies [Arnold et al. 1992; Vargheseet al. 1993; Buell et al. 1996; Vuilleminet al. 1996; Lewis et al. 1997; Khalid andRose 1998]. One of these variants can befound in the Splash 2 system [Arnold et al.1992; Buell et al. 1996]. The predecessor,Splash 1, used a linear systolic commu-nication method. This type of connectionwas found to work quite well for a vari-ety of applications. However, this highlyconstrained communication model madesome types of computations difficult oreven impossible. Therefore, Splash 2 wasdesigned to include not only the linear con-nections of Splash 1 that were found tobe useful for many applications, but alsoa crossbar network to allow any FPGA



to communicate with any other FPGA onthe same board. For multi-FPGA systems,because of the need for efficient commu-nication between the FPGAs, determin-ing the inter-chip routing topology is avery important step in the design process.More details on multi-FPGA system archi-tectures can be found elsewhere [Hauck1998b; Khalid 1999].

3.8. Hardware Summary

The design of reconfigurable hardwarevaries wildly from system to system. Thereconfigurable logic may be used as aconfigurable functional unit, or may bea multi-FPGA stand-alone unit. Withinthe reconfigurable logic itself, the com-plexity of the core computational units,or logic blocks, vary from very simple toextremely complex, some implementinga 4-bit ALU or even a 16 × 16 multi-plication. These blocks are not requiredto be uniform throughout the array, asthe use of different types of blocks canadd high-performance functionality in thecase of specialized computation circuitry,or expanded storage in the case of em-bedded memory blocks. Routing resourcesalso offer a variety of choices, primarily inamount, length, and organization of thewires. Systems have been developed thatfit into many different points within thisdesign space, and no true “best” systemhas yet been agreed upon.

4. SOFTWARE

Although reconfigurable hardware hasbeen shown to have significant perfor-mance benefits for some applications, itmay be ignored by application program-mers unless they are able to easily in-corporate its use into their systems. Thisrequires a software design environmentthat aids in the creation of configurationsfor the reconfigurable hardware. This soft-ware can range from a software assistin manual circuit creation to a completeautomated circuit design system. Manualcircuit description is a powerful methodfor the creation of high-quality circuit de-signs. However, it requires a great deal ofbackground knowledge of the particular

Fig. 11 . Three possible design flows for algorithmimplementation on a reconfigurable system. Greystages indicate manual effort on the part of the de-signer, while white stages are done automatically.The dotted lines represent paths to improve the re-sulting circuit. It should be noted that the middledesign cycle is only one of the possible compromisesbetween automatic and manual design.

reconfigurable system employed, as wellas a significant amount of design time. Onthe other end of the spectrum, an auto-matic compilation system provides a quickand easy way to program for reconfig-urable systems. It therefore makes the useof reconfigurable hardware more accessi-ble to general application programmers,but quality may suffer.

Both for manual and automatic cir-cuit creation, the design process proceedsthrough a number of distinct phases, asindicated in Figure 11. Circuit specifica-tion is the process of describing the func-tions that are to be placed on the recon-figurable hardware. This can be done assimply as by writing a program in C thatrepresents the functionality of the algo-rithm to be implemented in hardware. Onthe other hand, this can also be as complexas specifying the inputs, outputs, and op-eration of each basic building block in thereconfigurable system. Between these twomethods is the specification of the circuitusing generic complex components, suchas adders and multipliers, which will bemapped to the actual hardware later inthe design process. For descriptions in ahigh-level language (HLL), such as C/C++or Java, or ones using complex buildingblocks, this code must be compiled intoa netlist of gate-level components. Forthe HLL implementations, this involves



Fig. 12 . A wide function implemented with multipleLUTs.

generating computational components toperform the arithmetic and logic opera-tions within the program, and separatestructures to handle the program control,such as loop iterations and branching op-erations. Given a structural description,either generated from a HLL or specifiedby the user, each complex structure is re-placed with a network of the basic gatesthat perform that function.

Once a detailed gate- or element-leveldescription of the circuit has been created,these structures must be translated to theactual logic elements of the reconfigurablehardware. This stage is known as tech-nology mapping, and is dependent uponthe exact target architecture. For a LUT-based architecture, this stage partitionsthe circuit into a number of small subfunc-tions, each of which can be mapped to asingle LUT [Brown et al. 1992a; Abouzeidet al. 1993; Sangiovanni-Vincentelli et al.1993; Hwang et al. 1994; Chang et al.1996; Hauck and Agarwal 1996; Yi andJhon 1996; Chowdhary and Hayes 1997;Lin et al. 1997; Cong and Wu 1998; Panand Lin 1998; Togawa et al. 1998; Conget al. 1999]. Some architectures, such asthe Xilinx 4000 series [Xilinx 1994], con-tain multiple LUTs per logic cell. TheseLUTs can be used either separately to gen-erate small functions, or together to gen-erate some wider-input functions [Inuaniand Saul 1997; Cong and Hwang 1998].By taking advantage of multiple LUTs andthe internal routing within a single logiccell, functions with more inputs than canbe implemented using a single LUT canefficiently be mapped into the FPGA ar-chitecture. Figure 12 shows one exampleof a wide function mapped to a multi-LUTFPGA logic cell.

For reconfigurable structures that in-clude embedded memory blocks, the map-

ping stage may also consider using thesememories as logic units when they are notbeing used for data storage. The memoriesact as very large LUTs, where the numberof inputs is equal to the number of addresslines. In order to use these memories aslogic, the mapping software must analyzehow much of the memory blocks are actu-ally used as storage in a given mapping. Itmust then determine which are availablein order to implement logic, and what partor parts of the circuit are best mapped tothe memory [Cong and Xu 1998; Wilton1998].

After the circuit has been mapped, theresulting blocks must be placed onto thereconfigurable hardware. Each of theseblocks is assigned to a specific locationwithin the hardware, hopefully close tothe other logic blocks with which it com-municates. As FPGA capacities increase,the placement phase of circuit mappingbecomes more and more time consuming.Floorplanning is a technique that canbe used to alleviate some of this cost.A floorplanning algorithm first partitionsthe logic cells into clusters, where cellswith a large amount of communicationare grouped together. These clusters arethen placed as units onto regions of thereconfigurable hardware. Once this globalplacement is complete, the actual place-ment algorithm performs detailed place-ment of the individual logic blocks withinthe boundaries assigned to the cluster[Sankar and Rose 1999].

The use of a floorplanning tool is par-ticularly helpful for situations where thecircuit structure being mapped is of a dat-apath type. Large computational compo-nents or macros that are found in datapathcircuits are frequently composed of highlyregular logic. These structures are placedas entire units, and their component cellsare restricted to the floorplanned location[Shi and Bhatia 1997; Emmert and Bhatia1999]. This encourages the placer to find avery regular placement of these logic cells,resulting in a higher performance layoutof the circuit. Another technique for themapping and placement of datapath ele-ments is to perform both of these stepssimultaneously [Callahan et al. 1998].



This method also exploits the regular-ity of the datapath elements to gener-ate mappings and placements quickly andefficiently.

Floorplanning is also important whendealing with hierarchically structured re-configurable designs. In these architec-tures, the available resources have beengrouped by the logic or routing hierarchyof the hardware. Because performance isbest when routing lengths are minimized,the cells to be placed should be groupedsuch that cells that require a great dealof communication or which are on a criti-cal path are placed together within a logiccluster on the hardware [Krupnova et al.1997; Senouci et al. 1998].

After floorplanning, the individual logicblocks are placed into specific logic cells.One algorithm that is commonly usedis the simulated annealing technique[Shahookar and Mazumder 1991; Betzand Rose 1997; Sankar and Rose 1999].This method takes an initial placementof the system, which can be generated(pseudo-) randomly, and performs a seriesof “moves” on that layout. A move is sim-ply the changing of the location of a sin-gle logic cell, or the exchanging of loca-tions of two logic cells. These moves areattempted one at a time using randomtarget locations. If a move improves thelayout, then the layout is changed to re-flect that move. If a move is considered tobe undesirable, then it is only accepted asmall percentage of the time. Accepting afew “bad” moves helps to avoid any localminima in the placement space. Other al-gorithms exist that are not so based onrandom movements [Gehring and Ludwig1996], although this searches a smallerarea of the placement space for a solution,and therefore may be unable to find a so-lution which meets performance require-ments if a design uses a high percentageof the reconfigurable resources.

Finally, the different reconfigurablecomponents comprising the applicationcircuit are connected during the routingstage. Particular signals are assigned tospecific portions of the routing resourcesof the reconfigurable hardware. This canbecome difficult if the placement causes

many connected components to be placedfar from one another, as the signals thattravel long distances use more routingresources than those that travel shorterones. A good placement is therefore es-sential to the routing process. One ofthe challenges in routing for FPGAs andreconfigurable systems is that the avail-able routing resources are limited. In gen-eral hardware design, the goal is to min-imize the number of routing tracks usedin a channel between rows of computationunits, but the channels can be made aswide as necessary. In reconfigurable sys-tems, however, the number of availablerouting tracks is determined at fabricationtime, and therefore the routing softwaremust perform within these boundaries.Thus, FPGA routing concentrates on min-imizing congestion within the availabletracks [Brown et al. 1992b; McMurchieand Ebeling 1995; Alexander and Robins1996; Chan and Schlag 1997; Lee and Wu1997; Thakur et al. 1997; Wu and Marek-Sadowska 1997; Swartz et al. 1998; Namet al. 1999]. Because routing is one ofthe more time-intensive portions of thedesign cycle, it can be helpful to deter-mine if a placed circuit can be routedbefore actually performing the routingstep. This quickly informs the designerif changes need to be made to the layoutor a larger reconfigurable structure is re-quired [Wood and Rutenbar 1997; Swartzet al. 1998].

Each of the design phases mentionedabove may be implemented either manu-ally or automatically using compiler tools.The operation of some of these individualsteps are described in greater depth in thefollowing sections.

4.1. Hardware-Software Partitioning

For systems that include both reconfig-urable hardware and a traditional micro-processor, the program must first be par-titioned into sections to be executed onthe reconfigurable hardware and sectionsto be executed in software on the micro-processor. In general, complex control se-quences such as variable-length loops aremore efficiently implemented in software,



while fixed datapath operations may bemore efficiently executed in hardware.

Most compilers presented for reconfig-urable systems generate only the hard-ware configuration for the system, ratherthan both hardware and software. In somecases, this is because the reconfigurablehardware may not be coupled with a hostprocessor, so only a hardware configura-tion is necessary. For cases where recon-figurable hardware does operate alongsidea host microprocessor, some systems cur-rently require that the hardware compila-tion be performed separately from the soft-ware compilation, and special functionsare called from within the software inorder to configure and control the reconfig-urable hardware. However, this requireseffort on the part of the designer to iden-tify the sections that should be mappedto hardware, and to translate these intospecial hardware functions. In order tomake the use of the reconfigurable hard-ware transparent to the designer, the par-titioning and programming of the hard-ware should occur simultaneously in asingle programming environment.

For compilers that manage both thehardware and software aspects of applica-tion design, the hardware/software parti-tioning can be performed either manually,or automatically by the compiler itself.When the partitioning is performed bythe programmer, compiler directives areused to mark sections of program code forhardware compilation. The NAPA C lan-guage [Gokhale and Stone 1998] providespragma statements to allow a program-mer to specify whether a section of code isto be executed in software on the Fixed In-struction Processor (FIP), or in hardwareon the Adaptive Logic Processor (ALP).Cardoso and Neto [1999] present anothercompiler that requires the user to specify(using information gained through the useof profiling tools) which areas of code tomap to the reconfigurable hardware.

Alternately, the hardware/software par-titioning can be done automatically[Chichkov and Almeida 1997; Kress et al.1997; Callahan et al. 2000; Li et al. 2000a].In this case, the compiler will use costfunctions based upon the amount of ac-

celeration gained through the executionof a code fragment in hardware to de-termine whether the cost of configurationis overcome by the benefits of hardwareexecution.

4.2. Circuit Specification

In order to use the reconfigurable hard-ware, designers must somehow be able tospecify the operation of their custom cir-cuits. Before high-level compilation toolsare developed for a specific reconfigurablesystem, this is done through hand map-ping of the circuit, where the designerspecifies the operation of the componentsin the configurable system directly. Here,the designers utilize the basic buildingblocks of the reconfigurable system to cre-ate the desired circuit. This style of cir-cuit specification is primarily useful onlywhen a software front-end for circuit de-sign is unavailable, or for the design ofsmall circuits or circuits with very highperformance requirements. This is dueto the great amount of time involved inmanual circuit creation. However, for cir-cuits that can be reasonably hand mapped,this provides potentially the smallest andfastest implementation.

Because not all designers can be inti-mately familiar with every reconfigurablearchitecture, some design tools abstractthe specifics of the target architecture.Creating a circuit using a structural de-sign language involves describing a cir-cuit using building blocks such as gates,flip-flops and latches [Bellows and Hutch-ings 1998; Gehring and Ludwig 1998;Hutchings et al. 1999]. The compiler thenmaps these modules to one or more ba-sic components of the architecture of thereconfigurable system. Structural VHDLis one example of this type of program-ming, and commercial tools are avail-able for compiling from this languageinto vendor-specific FPGAs [Synplicity1999].

However, these two methods requirethat the designer possess either an in-timate knowledge of the targeted recon-figurable hardware, or at least a work-ing knowledge of the concepts involved



in hardware design. In order to allowa greater number of software developersto take advantage of reconfigurable com-puting, tools that allow for behavioralcircuit descriptions are being developed.These systems trade some area and per-formance quality for greater flexibility andease of use.

Behavioral circuit design is similar tosoftware design because the designer in-dicates the steps a hardware subsys-tem must go through in order to per-form the desired computation rather thanthe actual composition of the circuit.These behavioral descriptions can be ei-ther in a generic hardware descriptionlanguage such as VHDL or Verilog, or ageneral-purpose high-level language suchas C/C++ or Java. The eventual goal ofthis type of compilation is to allow usersto write programs in commonly used lan-guages that compile equally well, with-out modification, to both a traditionalsoftware executable and to an executablewhich leverages reconfigurable hardware.

Working towards this direction,Transmogrifier C [Galloway 1995] al-lows a subset of the C language to beused to describe hardware circuits. Whilemultiplication, division, pointers, arrays,and a few other C language specifics arenot supported, this system provides abehavioral method of circuit descriptionusing a primitive form of the C language.Similarly, the C++ programming environ-ment used for the P1 system [Vuilleminet al. 1996] provides a hybrid method ofdescription, using a combination of be-havioral and structural design. Synopsys’CoCentric compiler [Synopsys 2000],which can be targeted to the Xilinx Virtexseries of FPGA, uses SystemC to providefor behavioral compilation of C/C++with the assistance of a set of additionalhardware-defining classes. Other compil-ers, such as Nimble [Li et al. 2000a] andthe Garp compiler [Callahan et al. 2000],are fully behavioral C compilers, handlingthe full set of the ANSI C language.

Although behavioral description, andHLL description in particular, providesa convenient method for the program-ming of reconfigurable systems, it does

suffer from the drawback that it tends toproduce larger and slower designs thanthose generated by a structural descrip-tion or hand-mapping. Behavioral descrip-tions can leave many aspects of the cir-cuit unspecified. For example, a compilerthat encounters a while loop must gener-ate complicated control structures in or-der to allow for an unspecified numberof iterations. Also, in many HLL imple-mentations, optimizations based upon thebit width of operands cannot be performed.The compiler is generally unaware ofany application-specific limitations on theoperand size; it only sees the program-mer’s choice of data format in the program.Problems such as these might be solvedthrough additional programmer effort toreplace while loops whenever possiblewith for loops, and to use compiler direc-tives to indicate exact sizes of operands[Galloway 1995; Gokhale and Stone 1998].This method of hardware design falls be-tween structural description and behav-ioral description in complexity, becausealthough the programmers do not needto know a great deal about hardware de-sign, they are required to follow addi-tional guidelines that are not required forsoftware-only implementations.

4.3. Circuit Libraries

The use of circuit or macro librariescan greatly simplify and speed the de-sign process. By predesigning commonlyused structures such as adders, mul-tipliers, and counters, circuit creationfor configurable systems becomes largelythe assembly of high-level components,and only application-specific structuresrequire detailed design. The actual ar-chitecture of the reconfigurable devicecan be abstracted, provided only librarycomponents are used, as these low-leveldetails will already have been encapsu-lated within the library structures. Al-though the users of the circuit librarymay not know the intricacies of the des-tination architecture, they are still ableto make use of architecture-specific op-timizations, such as specialized carry



chains. This is because designers veryfamiliar with the details of the target ar-chitecture create the components within acircuit library. They can take advantageof architecture specifics when creating themodules to make these components fasterand smaller than a designer unfamiliarwith the architecture likely would. Anadded benefit of the architecture abstrac-tion is that the use of library componentscan also facilitate design migration fromone architecture to another, because de-signers are not required to learn a newarchitecture, but only to indicate the newtarget for the library components. How-ever, this does require that a circuit li-brary contain implementations for morethan one architecture.

One method for using library com-ponents is to simply instantiate themwithin an HDL design [Xilinx 1997; Altera1999]. However, circuit libraries can alsobe used in general language compil-ers by comparing the dataflow graph ofthe application to the dataflow graphsof the library macros [Cadambi andGoldstein 1999]. If a dataflow represen-tation of a macro matches a portion ofthe application graph, the correspond-ing macro is used for that part of theconfiguration.

Another benefit of circuit design withlibrary macros is that of fast compila-tion. Because the library structures mayhave been premapped, preplaced, and pre-routed (at least within the macro bound-aries), the actual compile time is reducedto the time required to place the librarycomponents and route between them. Forexample, fast configuration was one ofthe main motivations for the creation oflibraries for circuit design in the DISCreconfigurable image processing system[Hutchings 1997].

4.4. Circuit Generators

Circuit generators fulfill a role similar tocircuit libraries, in that they provide opti-mized high-level structures for use withinlarger applications. Again, designers arenot required to understand the low-leveldetails of particular architectures. How-

ever, circuit generators create semicus-tomized high-level structures automati-cally at compile time, as opposed to circuitlibraries that only provide static struc-tures. For example, a circuit generator cancreate an adder structure of the exact bitwidth required by the designer, whereas acircuit library is likely to contain a limitednumber of adder structures, none of whichmay be of the correct size. Circuit gener-ators are therefore more flexible than cir-cuit libraries because of the customizationallowed.

Some circuit generators, such asMacGen [Yasar et al. 1996], are executedat the command line using custom de-scription files to generate physical designlayout data files. Newer circuit genera-tors, however, are functions or methodscalled from high-level language programs.PAM-Blox [Mencer et al. 1998], for exam-ple, is a set of circuit generators executedin C++ that generate structures for usewith the PCI Pamette reconfigurableprocessing board. The circuit generatorpresented by Chu et al. [1998] containsa number of Java classes to allow aprogrammer to generate arbitrarily sizedarithmetic and logical components for acircuit. Although the examples presentedin that paper were mapped to a Xilinx4000 series FPGA, the generator usesarchitecture specific libraries for modulegeneration. The target architecture cantherefore be changed through the useof a different design library. The CarryLook-Ahead circuit generator describedby Stohmann and Barke [1996] is alsoretargetable, because it maps to anFPGA logic cell architecture defined bythe user.

One drawback of the circuit generatorsis that they depend on a regular logicand routing structure. Hierarchical rout-ing structures (such as those present inthe Xilinx 6200 series [Xilinx 1996]) andspecialized heterogeneous logic blocks arefrequently not accounted for. Therefore,some optimized features of a particular ar-chitecture may be unused. For these cases,a circuit macro from a library may pro-vide a more highly optimized structurethan one created with a circuit generator,



provided that the library macro fits theneeds of the application.

4.5. Partial Evaluation

Functions that are to be implemented onthe reconfigurable array should occupyas little area as possible, so as to maxi-mize the number of functions that can bemapped to the hardware. This, combinedwith the minimization of the delay in-curred by each circuit, increases the over-all acceleration of the application. Partialevaluation is the process of reducing hard-ware requirements for a circuit structurethrough optimization based upon knownstatic inputs. Specifically, if an input isknown to be constant, that value can po-tentially be propagated through one ormore gates in the structure at compiletime, and only the portions of a circuit thatdepend on time-varying inputs need to bemapped to the reconfigurable structure.

One example of the usefulness of thisoperation is that of constant coefficientmultipliers. If one input to a multiplieris constant, a multiplier object can be re-duced from a general-purpose multiplierto a set of additions with static-lengthshifts between them corresponding to thelocations of 1s in the binary constant.This type of reduction leads to a lowerarea requirement for the circuit, and po-tentially higher performance due to fewergate delays encountered on the criticalpath. Partial evaluation can also be per-formed in conjunction with circuit gener-ation, where the constants passed to thegenerator function are used to simplifythe created hardware circuit [Wang andLewis 1997; Chu et al. 1998]. Other exam-ples of this type of optimization for specificalgorithms include the partial evaluationof DES encryption circuits [Leonard andMangione-Smith 1997], and the partialevaluation of constant multipliers andfixed polynomial division circuits [Payne1997].

4.6. Memory Allocation

As with traditional software programs, itmay be necessary in reconfigurable com-

puting to allocate memories to hold vari-ables and other data. Off-chip memoriesmay be added to the reconfigurable sys-tem. Alternately, if a reconfigurable sys-tem includes memory blocks embeddedinto the reconfigurable logic, these may beused, provided that the storage require-ments do not surpass the available embed-ded memory. If multiple off-chip memoriesare available to a reconfigurable system,variables used in parallel should be placedinto different memory structures, suchthat they can be accessed simultaneously[Gokhale and Stone 1999]. When smallerembedded memory units are used, largermemories can be created from the smallerones. However, in this case, it is desir-able to ensure that each smaller mem-ory is close to the computation that mostrequires its contents [Babb et al. 1999].As mentioned earlier, the small embed-ded memories that are not allocated fordata storage may be used to perform logicfunctions.

4.7. Parallelization

One of the benefits of reconfigurable com-puting is the ability to execute multi-ple operations in parallel. In cases wherecircuits are specified using a structuralhardware description language, the userspecifies all structures and timing, andtherefore either implicitly or explicitlyspecifies any parallel operation. However,for behavioral and HLL descriptions, thereare two methods to incorporate paral-lelism: manual parallelization throughspecial instructions or compiler direc-tives, and automatic parallelization by thecompiler.

To manually incorporate parallelismwithin an application, the programmercan specifically mark sections of codethat should run as parallel threads, anduse similar operations to those used intraditional parallel compilers [Cronquistet al. 1998; Gokhale and Stone 1998].For example, a signal/wait technique canbe used to perform synchronization ofthe different threads of the computation.The RaPiD-B language [Cronquist et al.1998] is one that uses this methodology.



Although the NAPA C compiler [Gokhaleand Stone 1998] requires programmersto mark the areas of code for executingthe host processor and the reconfigurablehardware in parallel, it also detects andexploits fine-grained parallelism withincomputations destined for the reconfig-urable hardware.

Automatic parallelization of inner loopsis another common technique in recon-figurable hardware compilers to attemptto maximize the use of the reconfig-urable hardware. The compiler will se-lect the innermost loop level to be com-pletely unrolled for parallel execution inhardware, potentially creating a heav-ily pipelined structure [Cronquist et al.1998; Weinhardt and Luk 1999]. For thesecases, outer loops may not have multi-ple iterations executing simultaneously.Any loop reordering to improve the par-allelism of the circuit must be done by theprogrammer. However, some compiler sys-tems have taken this procedure a step fur-ther and focus on the parallelization of allloops within the program, not just the in-ner loops [Wang and Lewis 1997; Budiuand Goldstein 1999]. This type of compilergenerates a control flow graph based uponthe entire program source code. Loop un-rolling is used in order to increase theavailable parallelism, and the graph isthen used to schedule parallel operationsin the hardware.

4.8. Multi-FPGA System Software

When reconfigurable systems use morethan one FPGA to form the completereconfigurable hardware, there are ad-ditional compilation issues to deal with[Hauck and Agarwal 1996]. The designmust first be partitioned into the differ-ent FPGA chips [Hauck 1995; Acock andDimond 1997; Vahid 1997; Brasen andSaucier 1998; Khalid 1999]. This is gen-erally done by placing each highly con-nected portions of a circuit into a singlechip. Multi-FPGA systems have a limitednumber of I/O pins that connect the chipstogether, and therefore their use must beminimized in the overall circuit mapping.Also, by minimizing the amount of routing

required between the FPGAs, the num-ber of paths with a high (inter-chip) de-lay is reduced, and the circuit may havean overall higher performance. Similarly,those sections of the circuit that require ashort delay time must be placed upon thesame chip. Global placement then deter-mines which of the actual FPGAs in themulti-FPGA system will contain each ofthe partitions.

After the circuit has been partitionedinto the different FPGA chips, the con-nections between the chips must berouted [Mak and Wong 1997; Ejnioui andRanganathan 1999]. A global routing al-gorithm determines at a high level theconnections between the FPGA chips. Itfirst selects a region of output pins on thesource FPGA for a given signal, and de-termines which (if any) routing switchesor additional FPGAs the signal mustpass through to get to the destinationFPGA. Detailed routing and pin assign-ment [Slimane-Kade et al. 1994; Hauckand Borriello 1997; Mak and Wong 1997;Ejnioui and Ranganathan 1999] are thenused to assign signals to traces on an exist-ing multi-FPGA board, or to create tracesfor a multi-FPGA board that is to be cre-ated specifically to implement the givencircuit.

Because multi-FPGA systems use inter-chip connections to allow the circuit parti-tions to communicate, they frequently re-quire a higher proportion of I/O resourcesvs. logic in each chip than is normally re-quired in single-FPGA use. For this rea-son, some research has focused on meth-ods to allow pins of the FPGAs to be reusedfor multiple signals. This procedure is re-ferred to as Virtual Wires [Babb et al.1993; Agarwal 1995; Selvidge et al. 1995],and allows for a flexible trade-off betweenlogic and I/O within a given multi-FPGAsystem. Signals are multiplexed onto asingle wire by using multiple virtual clockcycles, one per multiplexed signal, withina user clock cycle, thus pipelining the com-munication. In this manner, the I/O re-quirements of a circuit can be reduced,while the logic requirements (because ofthe added circuitry used for the multiplex-ing) are increased.



4.9. Design Testing

After compilation, an application needsto be tested for correct operation be-fore deployment. For hardware configu-rations that have been generated frombehavioral descriptions, this is similarto the debugging of a software applica-tion. However, structurally and manu-ally created circuits must be simulatedand debugged with techniques based uponthose from the design of general hard-ware circuits. For these structures, simu-lation and debugging are critical not onlyto ensure proper circuit operation, butalso to prevent possible incorrect connec-tions from causing a short within the cir-cuit, which can damage the reconfigurablehardware.

There are several different methods ofobserving the behavior of a configurationduring simulation. The contents of mem-ory structures within the design can beviewed, modified, or saved. This allows on-the-fly customization of the simulated ex-ecution environment of the reconfigurablehardware, as well as a method for exam-ining the computation results. The inputand output values of circuit structures andsubstructures can also be viewed either ona generated schematic drawing or with atraditional waveform output. By examin-ing these values, the operation of the cir-cuit can be verified for correctness, andconflicts on individual wires can be seen.A number of simulation and debuggingsoftware systems have been developedthat use some or all of these techniques[Arnold et al. 1992; Buell et al. 1996;Gehring and Ludwig 1996; Lysaght andStockwood 1996; Bellows and Hutchings1998; Hutchings et al. 1999; McKay andSingh 1999; Vasilko and Cabanis 1999].

4.10. Software Summary

Reconfigurable hardware systems requiresoftware compilation tools to allow pro-grammers to harness the benefits ofreconfigurable computing. On one endof the spectrum, circuits for reconfig-urable systems can be designed manu-ally, leveraging all application-specific and

architecture-specific optimizations avail-able to generate a high-performance ap-plication. However, this requires a greatdeal of time and effort on the part of the de-signer. At the opposite end of the spectrumis fully automatic compilation of a high-level language. Using the automatic tools,a software programmer can transparentlyutilize the reconfigurable hardware with-out the need for direct intervention. Thecircuits created using this method, whilequickly and easily created, are generallylarger and slower than manually createdversions. The actual tools available forcompilation onto reconfigurable systemsfall at various points within this range,where many are partially automated butrequire some amount of manual aid. Cir-cuit designers for reconfigurable systemstherefore face a trade-off between the easeof design and the quality of the finallayout.

5. RUN-TIME RECONFIGURATION

Frequently, the areas of a program thatcan be accelerated through the use ofreconfigurable hardware are too numer-ous or complex to be loaded simultane-ously onto the available hardware. Forthese cases, it is beneficial to be ableto swap different configurations in andout of the reconfigurable hardware asthey are needed during program execution(Figur

Documents

Reconﬁgurable Computing: A Survey of Systems and Software172 K. Compton and S. Hauck board-level solution, to perform the oper-ations in hardware. ASICs are designed speciﬁcally