Zippy - A coarse-grained reconfigurable array with support

Zippy - A coarse-grained reconfigurable array with support for hardwarevirtualization

Christian PlesslComputer Engineering and Networks Lab

ETH Zurich, [email protected]

Marco PlatznerDepartment of Computer ScienceUniversity of Paderborn, Germany

[email protected]

Abstract

This paper motivates the use of hardware virtualizationon coarse-grained reconfigurable architectures. We intro-duce Zippy, a coarse-grained multi-context hybrid CPUwith architectural support for efficient hardware virtualiza-tion. The architectural details and the corresponding toolflow are outlined. As a case study, we compare the non-virtualized and the virtualized execution of an ADPCM de-coder.

1 Introduction and Related Work

Coarse-grained reconfigurable architectures consist ofALU-based cell arrays and bus-based interconnects.Coarse-grained devices excel in area and energy-efficiencyfor applications that require many arithmetic and logi-cal operations on byte and word-sized data. In the lastyears, many coarse-grained architectures have been pro-posed [14, 16, 12, 1]. Reconfigurable architectures arrangeoperations spatially, whereas processors arrange operations(mainly) in time. While processors can compute arbitrarilylarge applications, provided that sufficient memory exists,reconfigurable architectures run into a fundamental problemwhen an application exceeds the array size. Hardware vir-tualization denotes a set of techniques that try to overcomethis limitation by utilizing the re-configurability of recon-figurable devices. Previously, we have classified hardwarevirtualization approaches into three categories [13]: Tempo-ral partitioning, virtualized execution, and virtual machine:

• Temporal partitioning The first use of the term virtualhardware pointed to the analogy to virtual memory. Ina virtual memory system, memory pages are swappedin and out of the main memory, giving applicationsthe illusion of a much larger address space than phys-ically existent. Similarly, reconfigurable computingsystems can allow applications to use more hardware

than physically existent by swapping in and out por-tions of the hardware using a reconfiguration process.This concept leads to temporal partitioning, which al-lows for mapping an application of arbitrary size toa reconfigurable device with insufficient hardware ca-pacity. Temporal partitioning splits applications intosmaller parts, where each part fits on the device, andruns these parts sequentially. Related work in archi-tectures for hardware virtualization includes the time-multiplexed FPGA [15], the DPGA architecture [4],the Chameleon architecture [16] and the DRLE archi-tecture [8].

• Virtualized execution Later, the term hardware vir-tualization was used in a slightly different way to de-scribe tools and architectures that allow for a certaindegree of independence between the synthesized appli-cation and the actual capacity of the reconfigurable de-vice. This approach stresses the analogy to instructionset processors, where the definition of an instructionset decouples applications from the actual hardware.Similarly, virtualized execution strives for obtaining acertain level of compatibility within a reconfigurabledevice family. This compatibility is achieved by thedefinition of a programming model and an atomicunit of computation. Applications are split into mod-ules, where each module fits onto the atomic opera-tor. The programming model specifies the coordina-tion and communication between modules. A concreteinstance of the execution architecture can implementany number of atomic operators. The execution orderof the modules is determined either at run-time or atload-time. Examples for virtualized execution archi-tectures are PipeRench [9], Score [3], and WASMI [8].We have also investigated virtualized execution on theZippy architecture [6, 7] in our previous work.

• Virtual machine The virtual machine approach triesto achieve an even higher level of device indepen-dence by following the concepts of virtual machines

1

and portable code. The application is specified for anabstract execution architecture and a hardware virtualmachine is responsible for remapping this specificationfor the abstract architecture model to the concrete ex-ecution architecture. Research in hardware virtual ma-chines has been reported in [11, 10]. Although this ap-proach seems intriguing since it features platform in-dependent hardware, no efficient implementation of ahardware virtual machine has been shown so far.

The Zippy project aims at the investigation of a hybridprocessor consisting of an embedded CPU and a coarse-grained reconfigurable array. Hardware virtualization is acentral aspect of the Zippy project. Previously, we havedescribed the Zippy architecture with its basic support toolsfeaturing a system-level, cycle-accurate simulation environ-ment [7, 6, 5].

In contrast, this paper focuses on the architectural fea-tures for hardware virtualization, especially for the tem-poral partitioning approach. Up to now, work in tempo-ral partitioning has concentrated on fine-grained architec-tures and netlists or data-flow graphs as application models.For coarse-grained architectures, temporal partitioning ofnetlists has been hardly treated so far. Most studies assumethat applications are specified in a high-level language andthat only runtime intensive loops are mapped to the coarse-grained architecture. This leads to different problems, morerelated to parallelizing compilation technology, e.g, loop-unrolling, vectorization, modulo-scheduling. We argue thatefficient hardware virtualization requires the simultaneousdevelopment of the architecture and the corresponding toolflow, rather than building virtualization tools on top of anexisting or fully specified architecture. The Zippy frame-work with its parameterizable models and tools forms anideal environment for such an architecture-toolflow code-sign. In Section 2, the Zippy architecture and the corre-sponding tool flow is presented. Section 3 discusses a casestudy demonstrating the benefits of temporal partitioning.Finally, Section 4 concludes the paper and points to furtherwork.

2 The Zippy Architecture

The Zippy architecture is not a single, concrete architec-ture but an architectural simulation model of a hybrid CPU.The model integrates an embedded CPU core with a coarse-grained reconfigurable unit and can be widely parameter-ized to resemble whole families of hybrid CPUs. Zippy wascreated to provide an experimentation framework for hy-brid CPUs and, specifically, hardware virtualization. Zippyarchitectures are modeled at a level of detail that is suffi-cient for system-wide cycle-accurate simulation. Besidesthe simulation tools, Zippy includes a tool-chain to gen-erate software and hardware executables. Along with our

FIFO

FIFO

ContextSequencer

MemoryConfig.

ReconfigurableArray

Coarse−Grained

Ctrl Addr Data

Register Interface

CoprocessorInterface

Rec

onfig

urab

le U

nit

Figure 1. Reconfigurable Unit Architecture

previous work on virtualized execution, the Zippy systemarchitecture and the co-simulation framework has been pre-sented in detail [7, 5, 6]. For our current work on temporalpartitioning on coarse-grained arrays the Zippy architecturehas been extended. Hence the following section will con-centrate on the changes to the architecture —in particular tothe reconfigurable cell— that have been introduced to sup-port temporal partitioning. Finally the supporting softwaretools are briefly discussed.

2.1 System and Reconfigurable Unit Architecture

Zippy is composed of two main units, the CPU core andthe reconfigurable unit (RU), attached via a co-processorport. The coprocessor port is used for all data-transfers be-tween the CPU and the RU and exposes all RU functions tothe CPU via read and write operations on the RU’s registerinterface.

Figure 1 shows a schematic diagram of the reconfig-urable unit. Zippy is a multi-context architecture, i.e., sev-eral configurations can be stored concurrently in the config-uration memory. The RU can switch rapidly between theseconfigurations. The activation and sequencing of configu-rations is controlled by the context sequencer. The FIFOsare used to pass input data and results between the CPUcore and the reconfigurable array and also between differentconfigurations (contexts) of the RU. The register interfaceprovides access to the RU function blocks to the CPU.

The reconfigurable array is the workhorse of the Zippyarchitecture. The array is built of computing cells, mem-ory blocks, input and output ports and an interconnectionnetwork. The interconnect splits into two substructures, abus interconnect and a local interconnect structure. As withmany parts of the Zippy architecture, the reconfigurable ar-ray is parameterized. Figure 2 shows the programmablebus interconnect of this 4x4 cell instance of the Zippy ar-

MEM0

MEM1

MEM2

MEM3

INP0

INP1 OUTP

0

OUTP

1

hbus_n

hbus_s

hbus_mem

vbus_e

Figure 2. Reconfigurable Array: Bus Intercon-nect

ray. Programmable routing switches are indicated by smallbus-driver symbols at the crossing of wires. There arethree types of horizontal buses: the horizontal north buses(hbus n) that connect cells in adjacent rows, the horizontalsouth buses (hbus s), that connect cells in the same row, andthe memory buses (hbus mem), that connect all cells in arow to an on-chip memory block. Additionally, the verticaleast buses (vbus e) provide connectivity between the cellsin the same column. For the following discussion we con-sider an array instance with 4x4 cells, 2 horizontal north andsouth buses, 2 vertical east buses and 2 input and 2 outputports. The bit-width of the ALUs and buses is configured to24 bit.

In addition to the bus interconnect the array provides alsolocal interconnect between neighboring cells. Each cell canread the output data from all of its 8 immediate neighbors.The local interconnect is also fully homogeneous thanks tocyclical continuation at the edges of the array.

The cells can read data from the FIFOs via the inputbuses (INP0/1) and they can write data to the FIFOs us-ing the output buses (OUTP0/1). The cells can connect tothe input and output buses using the horizontal north buses.The control signals for the FIFOs are generated by pro-grammable controllers.

2.2 Reconfigurable Cell Architecture

Figure 3 presents a detailed view of the configurable cell.The cell has a rather versatile input structure with over-all three inputs, an operator block, and an output structure.Each of the inputs can connect to either of five sources: anylocal neighbor, a horizontal bus, a vertical bus, a constant,or to the cell’s output register (feedback path). The cell’sconfiguration determines which of these input sources orconstant values is actually selected.

The operator block is ALU-based and performs the cellcomputation. Figure 3 includes a table of implemented celloperations.

Most arithmetic and logical operations are self-explaining. The ’pass’ operation passes the input value tothe output (identity), what can be useful for routing pur-poses. The ’testbitat’ instructions are used for bit-tests.’testbitat0(value,mask)’ takes an input value and a maskand returns ’1’ if all bits that are set in mask are set to ’0’in value, otherwise the operator returns ’0’. The ’testbi-tat1’ operator works analogously for testing whether bitsare set to ’1’. The mux(sel,a,b) operator forwards input ’a’or input ’b’ to the output, depending on the value of theleast-significant bit of ’sel’. This operator is particularlyuseful for implementing control-flow, i.e., data-dependentprocessing, on the reconfigurable array. Each row of cellshas access to a shared ROM memory block, cf. Fig. 2. The’rom(addr)’ operation of a cell reads the contents at address’addr’ of the ROM associated with this cell. The contentsof the ROM are part of the configuration.

Zippy is specifically designed to support hardware virtu-alization, in particular through time-multiplexed executionof circuits. To this end, the reconfigurable array architec-ture is designed as a multi-context architecture. A multi-context architecture can concurrently store several configu-rations on-chip. The context sequencer which is part of theRU can switch rapidly between these configurations and ac-tivate one of the contexts for execution. All contexts sharethe computation elements in the data-path, but each contexthas its own set of registers. This allows to store intermediateresults generated by a context until the next context invoca-tion and eliminates the need to use memory structures tostore these data and time-consuming context store and re-store phases. Moreover, the Zippy cell provides means for acontext to read registers from other contexts which enablesefficient data-transfers between contexts.

Figure 3 shows the integration of these per-context reg-ister files into the Zippy cell architecture. A reconfigurablecell features one input register file per input and an outputregister file. The cell inputs are always registered, indepen-dently of whether the data will later on be used or not. Thatis, when a cell receives an input data in a given executioncycle, the data is written to the input register selected by

hbusn

....

hbusn

....

hbusn

....

ctxno

cons

tc

c

c c c

c

c

c

c c c

c c

ctxno

hbus....

vbus....

local....

.......................

arithmetic ops: add,sub,mul,neg

logic ops: not,and,nand,or, nor,xor,xnor, srl,sll, ror,rol,pass

test ops: eq,neq,lte,gte tstbitat0, tstbitat1

memory ops: rom

other ops: mux

OP

legend:

determined byconfiguration

determined byactive context

ctxno

c

Figure 3. Reconfigurable Unit: Cell Model

the context selector (ctxno). In the same cycle, the inputdata for the operation block is either a constant, the directcell input, or can be read from any of the input registers.This choice is part of the configuration. The output of theoperator block is also automatically registered in the outputregister file controlled by the context selector. In contrast,the cell output is controlled by the configuration and canbe selected out of the direct operator output and any of theoutput registers.

Generally, an application circuit contains (combina-tional) operators and registers. We call these registers userregisters. A virtualized circuit requires additionally so-called virtualization registers that are used to transfer databetween contexts. While the registers in the register-richZippy cell can be used arbitrarily, we have decided to im-plement user registers in the cells’ input register files andthe virtualization registers in the cells’ output register files.This specific choice has been made to simplify the pro-cess of mapping virtualized applications to the array. Atthe same time, this choice imposes constraints on the map-ping and the cell usage. For example, an operation in cell xwrites its result to the corresponding output register of thatcell. An operator mapped to cell y in a subsequent contextrequiring this data as an input has to read from cell x. Thisleads to i) routing between the two cells and ii) the fact thatthe regular output of cell x cannot be used in the subsequentconfiguration. The operation block of cell x can, however,still be utilized.

2.3 Tool-flow and Simulation Environment

The Zippy software support comprises a number of toolsthat allow a designer to generate applications consisting ofparts running in software and parts running in hardware.The software tool flow includes a C compiler and a func-tion library for accessing the reconfigurable hardware fromthe software application. The compiler chain is based onSimpleScalar’s versions of the GNU C compiler and utili-ties and has been described in [5]. For mapping applicationsonto the reconfigurable array, we have developed tools fortechnology-mapping, placing and routing circuits onto thereconfigurable array. The applications are given in graphform and are converted to a textual description in the ZippyNetlist Format (ZNF). Map, place and route tools operate onthis format. While these algorithms have been specificallytailored to the Zippy architectural model, they have beeninspired by techniques and algorithms known from FPGAplacement and routing [2]. After running the tools, the re-sulting placed and routed netlist is passed to a configurationgenerator tool which turns the array configuration into a for-mat that can be read and loaded by the CPU during runtime.

The performance evaluation of different Zippy archi-tectures and virtualization techniques is performed withour co-simulation framework which has been discussed in[5, 7].

3 Case Study

The case study illustrates the use of the Zippy architec-ture and in particular its support for hardware virtualization.We compare the execution of an ADPCM application on alarge instance of the Zippy architecture with a smaller in-stance of the same architecture. The large instance requiresmore area but allows to map the complete application intoone configuration. The smaller instance requires hardwarevirtualization to run the application. Both implementationsare compared to an implementation that uses only the CPUcore.

3.1 Application

ADPCM is a well established speech coding algorithm.Figure 4 presents our hardware implementation that hasbeen manually derived from the C-code reference imple-mentation. ADPCM uses 31 combinational operators (thatcan be directly implemented by a cell), 3 dedicated regis-ters, 1 input, and 1 output port. The dedicated registers canbe implemented within the input register files of the Zippycells, cf. Fig. 3. Thus the hardware implementation requiresan execution architecture with at least 31 cells.

op10srlc: i1=2

12 0

op11srlc: i1=1

12 0op12srlc: i1=3

12 0op6andc: i1=7

12 0

op0romindexTbl

12 0

op1add

12 0

op4amuxc: i0=88c: i1=dc

12 0

op2gtc: i1=88

12 0

op19romstepSzTbl

12 0

op4bmuxc: i1=0

12 0

op4cmux

12 0

op7tstbitat1c: i1=4

12 0

op8tstbitat1c: i1=2

12 0

op17add

12 0 op9tstbitat1c: i1=1

12 0

op13add

12 0

op14mux

12 0

op16mux

12 0

in

out

context 0 context 1

index

step

t_230 t_231t_120 t_232

op3ltc: i1=0

12 0

op21sub

12 0

op18mux

12 0

op20add

12 0

obufpass0

12 0

op5tstbitat1c: i1=8

12 0

op22mux

12 0

op25amuxc: i0=-32767c: i1=dc

12 0

op24ltc: i1=-32768

12 0

op25bmuxc: i1=32768

12 0

op25cmux

12 0

context 2

valpred

op23gtc: i1=32767

12 0

op15add

12 0

Figure 4. ADPCM: application netlist

3.2 Experiments

• For the non-virtualized implementation, we have cho-sen a reconfigurable array of size 7x7. Although a6x6 array would provide a sufficient number of cells,the dense interconnect structure of the ADPCM netlistleads easily to congestion and makes placement androuting on a 6x6 array rather difficult. Using a 7x7 ar-ray relaxes the implementation constraints and allowsthe tools to quickly find a routable implementation.

• For the virtualized implementation the full netlist hasbeen manually partitioned into three smaller netlists,such that each of them fits onto an array of size 4x4.An automatic partitioning has not been implemented atthe current stage of the project. Figure 4 presents thedivision of the initial netlist into three contexts. Vir-tualization registers have been inserted where data hasto be transferred from one context to another. Thesevirtualization registers are denoted by t x y in this fig-ure. For each of these netlists, a configuration is gen-erated using the Zippy place and route tools. Figure 4shows that for this implementation all feedback pathsof the circuit stay within single contexts. It must benoted that this is not a requirement for hardware virtu-alization. Due to the virtualization registers, feedbackcycles between contexts are possible.

• The pure software implementation uses the C source

code of the ADPCM reference implementation. Thecode has been compiled with SimpleScalar’s GNU Ccompiler, with optimizations turned on and set to -O.

3.3 Experimental Setup and Results

The performance results for the pure CPU implementa-tion have been obtained by simulation with SimpleScalar.SimpleScalar’s configuration parameters have been chosento match a typical embedded CPU core with in-order exe-cution and static branch-prediction, cf. [5].

Table 1 summarizes the results of the case study anddemonstrates the trade-off involved in hardware virtualiza-tion. The superior performance of the single-context, non-virtualized implementation is paid for with more than 3times the hardware effort of the virtualized, 3-context, im-plementation. Thus, hardware virtualization seems useful intwo situations: First, it offers a sensible approach to reducethe amount of required hardware while still making use ofcomputation in space. Second, when the given hardwareis too small to accommodate the overall circuit, hardwarevirtualization is mandatory.

The results for the pure software implementation havebeen determined by measuring the number of cycles neededfor decoding 512 blocks of 1024 samples.

The non-virtualized implementation decodes 1 sampleper cycle. The virtualized implementation requires 3 cyclesfor decoding one ADPCM sample. Each of the three con-

Implemen- cycles relative arraytation per sample speedup sizeCPU only 67.2 1.0 n. a.single context 1.0 67.2 49 cells3-context 3.0 22.4 16 cells

Table 1. Performance of the three differentADPCM implementations

texts is activated and executed for one cycle. This scheduleis repeated cyclically.

The maximal clock frequency for the reconfigurable ar-ray is application-specific and would have to be determinedby analyzing the longest path of the mapped and routed cir-cuit. The timing model we currently use is quite inaccurateand does not consider mapping and routing. However, weenvision an embedded CPU core that runs at modest clockfrequencies. This justifies the assumption that both CPUcore and reconfigurable unit run at the same clock rate.

4 Conclusions and Future Work

We have presented the Zippy hybrid CPU that employsa coarse-grained multi-context reconfigurable array. A dis-tinctive feature of Zippy is its support for hardware virtu-alization. While one approach of hardware virtualization,namely virtualized execution, has already been presentedin [5], this paper focused on an approach termed tempo-ral partitioning. The Zippy architecture and the simulationenvironment allow us to perform studies with hardware vir-tualization techniques. This is of utmost importance, sinceno commercial implementations of such architectures arereadily available. By a case study we have demonstratedthe feasibility of hardware virtualization and the involvedtrade-offs between execution time and required hardware.

Future work will focus on the development of algorithmsthat automatically perform the temporal partitioning of ap-plications onto the Zippy architecture. Our long term goalis to create an end-to-end tool-flow that starts with a hard-ware description (in the ZNF netlist format) and creates animplementation that uses hardware virtualization if needed.

References

[1] V. Baumgartne, F. May, A. Nuckel, M. Vorbach, andM. Weinhardt. PACT XPP – a self-reconfigurable data pro-cessing architecture. In Proc. 1st Int. Conf. on Engineeringof Reconfigurable Systems and Algorithms (ERSA), pages64–70, 2001.

[2] V. Betz, J. Rose, and A. Marquardt. Architecture and CADfor Deep-Submicron FPGAs. Kluwer Academic Publishers,1999.

[3] E. Caspi, M. Chu, R. Huang, J. Yeh, J. Wawrzynek, andA. DeHon. Stream computations organized for reconfig-urable execution (SCORE). In Proc. 10th Int. Conf. on FieldProgrammable Logic and Applications (FPL), pages 605–614, 2000.

[4] A. DeHon. DPGA utilization and application. In Proc.4th ACM Int. Symp. on Field-Programmable Gate Arrays(FPGA), pages 115–121, 1996.

[5] R. Enzler, C. Plessl, and M. Platzner. Co-simulation of ahybrid multi-context architecture. In Proc. 3rd Int. Conf.on Engineering of Reconfigurable Systems and Algorithms(ERSA), pages 174–180. CSREA Press, 2003.

[6] R. Enzler, C. Plessl, and M. Platzner. Virtualizing hard-ware with multi-context reconfigurable arrays. In Proc. 13thInt. Conf. on Field Programmable Logic and Applications(FPL), pages 151–160, 2003.

[7] R. Enzler, C. Plessl, and M. Platzner. System-level perfor-mance evaluation of reconfigurable processors. Micropro-cessors and Microsystems, 29(issues 2–3):63–73, Apr. 2005.

[8] T. Fujii, K.-i. Furuta, M. Motomura, M. Nomura,M. Mizuno, K.-i. Anjo, K. Wakabayashi, Y. Hirota, Y.-e.Nakazawa, H. Itoh, and M. Yamashina. A dynamically re-configurable logic engine with a multi-context/multi-modeunified-cell architecture. In 46th IEEE Int. Solid-State Cir-cuits Conf. (ISSCC), Dig. Tech. Papers, pages 364–365,1999.

[9] S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe,and R. R. Taylor. PipeRench: A reconfigurable architectureand compiler. IEEE Computer, 33(4):70–77, Apr. 2000.

[10] Y. Ha, P. Schaumont, M. Engels, S. Vernalde, F. Potargent,L. Rijnders, and H. D. Man. A hardware virtual machine forthe networked reconfiguration. In IEEE International Work-shop on Rapid System Prototyping, pages 194–199, 2000.

[11] Y. Ha, S. Vernalde, P. Schaumont, M. Engels, R. Lauw-ereins, and H. De Man. Building a Virtual Framework forNetworked Reconfigurable Hardware and Software Objects.Journal of Supercomputing, 21(2):131–144, February 2002.

[12] T. Miyamori and K. Olukotun. REMARC: Reconfigurablemultimedia array coprocessor. IEICE Trans. on Informationand Systems, E82-D(2):389–397, Feb. 1999.

[13] C. Plessl and M. Platzner. Virtualization of hardware – intro-duction and survey. In Proc. 4rd Int. Conf. on Engineering ofReconfigurable Systems and Algorithms (ERSA), pages 63–69. CSREA Press, 2004.

[14] H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh,and E. M. Chaves Filho. MorphoSys: An integrated recon-figurable system for data-parallel and computation-intensiveapplications. IEEE Trans. on Computers, 49(5):465–481,May 2000.

[15] S. Trimberger, D. Carberry, A. Johnson, and J. Wong. Atime-multiplexed FPGA. In Proc. 5th IEEE Symp. on Field-Programmable Custom Computing Machines (FCCM),pages 22–28, 1997.

[16] Xinan Tang, M. Aalsma, and R. Jou. A compiler directedapproach to hiding configuration latency in Chameleon pro-cessors. In Proc. 10th Int. Conf. on Field ProgrammableLogic and Applications (FPL), pages 29–38, 2000.

Documents

Zippy - A coarse-grained reconfigurable array with support