14
1314 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 6, JUNE 2014 Improving Energy Efficiency in FPGA Through Judicious Mapping of Computation to Embedded Memory Blocks Anandaroop Ghosh, Somnath Paul, Member, IEEE, Jongsun Park, Senior Member, IEEE, and Swarup Bhunia, Senior Member, IEEE Abstract— Field-programmable gate arrays (FPGAs) are being increasingly used as a preferred prototyping and accelerator platform for diverse application domains, such as digital signal processing (DSP), security, and real-time multimedia processing. However, mapping of these applications to FPGA typically suffers from poor energy efficiency because of high energy overhead of programmable interconnects (PI) in FPGA devices. This paper presents an energy-efficient heterogenous application mapping framework in FPGA, where the conventional application map- pings to logic and DSP blocks (for DSP-enhanced FPGA devices) are combined with judicious mapping of specific computations to embedded memory blocks. A complete mapping methodology including functional decomposition, fusion, and optimal packing of operations is proposed and efficiently used to reduce the large energy overhead of PIs. Effectiveness of the proposed methodology is verified for a set of common applications using a commercial FPGA system. Experimental results show that the proposed heterogenous mapping approach achieves significant energy improvement for different input bit-widths (e.g. more than 35% of energy savings with 8 bit or smaller bit inputs compared to the corresponding mapping in configurable logic blocks). For further reduction of energy, we propose an energy/accuracy tradeoff approach, where the input operand bit-width is dynami- cally truncated to reduce memory area and energy at the expense of modest degradation in output-accuracy. We show that using a preferential truncation method, up to 88.6% energy savings can be achieved in a 32-tap finite impulse response filter with modest impact on the filter performance. Index Terms— Embedded random access memory (RAM), energy-efficiency, field-programmable gate array (FPGA), memory-based computing. I. I NTRODUCTION B ECAUSE of their flexibility in application mapping, reduced design cost and improved time-to-volume, field-programmable gate arrays (FPGAs) are widely used as reconfigurable computing platforms in the embedded Manuscript received October 1, 2012; revised May 9, 2013; accepted June 9, 2013. Date of publication August 1, 2013; date of current version May 20, 2014. This work was supported by the National Science Foundation under Grant CCF-0964514 and Grant ECCS-1002237. A. Ghosh and S. Bhunia are with the Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106 USA (e-mail: [email protected]; [email protected]). S. Paul is with Intel Corporation, Hillsboro, OR 97124 USA (e-mail: [email protected]). J. Park is with the School of Electrical Engineering, Korea University, Seoul 136-701, Korea (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2013.2271696 application domains, such as digital signal processing (DSP), multimedia, security, and graphics. FPGAs have also emerged as a preferred coprocessor platform providing higher performance when working in conjunction with a processor for a variety of real-time applications [1]. However, application mapping to conventional FPGA platforms generally suffers from poor energy efficiency primarily because of the large overhead of elaborate programmable interconnect (PI) fabric. At scaled process technologies, PI accounts for up to 80% of power and 60% of delay in FPGAs [2]. FPGA vendors and academic researchers have investigated various device engineering options (such as low-k dielectric and multiple device thresholds) [3] as well as architecture- level techniques (e.g., clustered architecture) [4] to reduce the energy consumption of FPGA devices. However, these device- circuit-architecture level optimizations typically tradeoff with intrinsic flexibility in application mapping and performance, and the energy reduction techniques cannot be directly applied to the compute-intensive signal processing or multimedia applications. Moreover, as the interconnect delay does not scale as well as logic delay, fine-grained architecture of FPGAs suffers from poor technological scalability of performance and energy compared to custom implementation. Consequently, for compute-intensive signal processing, security, or multimedia applications, which comprise a large set of embedded appli- cations, energy efficiency do not scale significantly over FPGA generations. Looking at FPGA devices across generation, an important observation is that modern FPGAs come with a large number of embedded memory blocks (EMBs). As presented in Fig. 1, driven by aggressive technology scaling, FPGA devices from different vendors are integrating larger EMBs with faster access speed. In addition, we observe a significant reduction in the access energy of EMBs in FPGA (e.g., 53% from Stratix II to Stratix III and 48% from Stratix III to Stratix IV). While many applications use the EMBs for storing inputs and intermediate data during data processing, large part of memory may remain unused for many compute-intensive applications [7]. If those unused EMBs can opportunistically be used for computation, in particular, to realize compute-intensive data- paths or functions, one can significantly improve both dynamic energy consumption and resource utilization. To minimize the dynamic energy consumption of FPGA-based reconfigurable platform, an opportunistic mapping of computation to EMBs, 1063-8210 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1314 IEEE TRANSACTIONS ON VERY LARGE SCALE …swarup.ece.ufl.edu/papers/J/J42.pdf · to Embedded Memory Blocks ... reduced design cost and improved time-to-volume, field-programmable

  • Upload
    vandieu

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 1314 IEEE TRANSACTIONS ON VERY LARGE SCALE …swarup.ece.ufl.edu/papers/J/J42.pdf · to Embedded Memory Blocks ... reduced design cost and improved time-to-volume, field-programmable

1314 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 6, JUNE 2014

Improving Energy Efficiency in FPGA ThroughJudicious Mapping of Computation

to Embedded Memory BlocksAnandaroop Ghosh, Somnath Paul, Member, IEEE, Jongsun Park, Senior Member, IEEE, and

Swarup Bhunia, Senior Member, IEEE

Abstract— Field-programmable gate arrays (FPGAs) are beingincreasingly used as a preferred prototyping and acceleratorplatform for diverse application domains, such as digital signalprocessing (DSP), security, and real-time multimedia processing.However, mapping of these applications to FPGA typically suffersfrom poor energy efficiency because of high energy overhead ofprogrammable interconnects (PI) in FPGA devices. This paperpresents an energy-efficient heterogenous application mappingframework in FPGA, where the conventional application map-pings to logic and DSP blocks (for DSP-enhanced FPGA devices)are combined with judicious mapping of specific computationsto embedded memory blocks. A complete mapping methodologyincluding functional decomposition, fusion, and optimal packingof operations is proposed and efficiently used to reduce thelarge energy overhead of PIs. Effectiveness of the proposedmethodology is verified for a set of common applications usinga commercial FPGA system. Experimental results show that theproposed heterogenous mapping approach achieves significantenergy improvement for different input bit-widths (e.g. more than35% of energy savings with 8 bit or smaller bit inputs comparedto the corresponding mapping in configurable logic blocks). Forfurther reduction of energy, we propose an energy/accuracytradeoff approach, where the input operand bit-width is dynami-cally truncated to reduce memory area and energy at the expenseof modest degradation in output-accuracy. We show that using apreferential truncation method, up to 88.6% energy savings canbe achieved in a 32-tap finite impulse response filter with modestimpact on the filter performance.

Index Terms— Embedded random access memory (RAM),energy-efficiency, field-programmable gate array (FPGA),memory-based computing.

I. INTRODUCTION

BECAUSE of their flexibility in application mapping,reduced design cost and improved time-to-volume,

field-programmable gate arrays (FPGAs) are widely usedas reconfigurable computing platforms in the embedded

Manuscript received October 1, 2012; revised May 9, 2013; accepted June 9,2013. Date of publication August 1, 2013; date of current version May 20,2014. This work was supported by the National Science Foundation underGrant CCF-0964514 and Grant ECCS-1002237.

A. Ghosh and S. Bhunia are with the Department of Electrical Engineeringand Computer Science, Case Western Reserve University, Cleveland, OH44106 USA (e-mail: [email protected]; [email protected]).

S. Paul is with Intel Corporation, Hillsboro, OR 97124 USA (e-mail:[email protected]).

J. Park is with the School of Electrical Engineering, Korea University, Seoul136-701, Korea (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TVLSI.2013.2271696

application domains, such as digital signal processing(DSP), multimedia, security, and graphics. FPGAs have alsoemerged as a preferred coprocessor platform providing higherperformance when working in conjunction with a processor fora variety of real-time applications [1]. However, applicationmapping to conventional FPGA platforms generally suffersfrom poor energy efficiency primarily because of the largeoverhead of elaborate programmable interconnect (PI) fabric.At scaled process technologies, PI accounts for up to 80% ofpower and 60% of delay in FPGAs [2].

FPGA vendors and academic researchers have investigatedvarious device engineering options (such as low-k dielectricand multiple device thresholds) [3] as well as architecture-level techniques (e.g., clustered architecture) [4] to reduce theenergy consumption of FPGA devices. However, these device-circuit-architecture level optimizations typically tradeoff withintrinsic flexibility in application mapping and performance,and the energy reduction techniques cannot be directly appliedto the compute-intensive signal processing or multimediaapplications. Moreover, as the interconnect delay does notscale as well as logic delay, fine-grained architecture of FPGAssuffers from poor technological scalability of performance andenergy compared to custom implementation. Consequently, forcompute-intensive signal processing, security, or multimediaapplications, which comprise a large set of embedded appli-cations, energy efficiency do not scale significantly over FPGAgenerations.

Looking at FPGA devices across generation, an importantobservation is that modern FPGAs come with a large numberof embedded memory blocks (EMBs). As presented in Fig. 1,driven by aggressive technology scaling, FPGA devices fromdifferent vendors are integrating larger EMBs with fasteraccess speed. In addition, we observe a significant reductionin the access energy of EMBs in FPGA (e.g., 53% fromStratix II to Stratix III and 48% from Stratix III to Stratix IV).While many applications use the EMBs for storing inputs andintermediate data during data processing, large part of memorymay remain unused for many compute-intensive applications[7]. If those unused EMBs can opportunistically be used forcomputation, in particular, to realize compute-intensive data-paths or functions, one can significantly improve both dynamicenergy consumption and resource utilization. To minimize thedynamic energy consumption of FPGA-based reconfigurableplatform, an opportunistic mapping of computation to EMBs,

1063-8210 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 1314 IEEE TRANSACTIONS ON VERY LARGE SCALE …swarup.ece.ufl.edu/papers/J/J42.pdf · to Embedded Memory Blocks ... reduced design cost and improved time-to-volume, field-programmable

GHOSH et al.: IMPROVING ENERGY EFFICIENCY IN FPGA THROUGH JUDICIOUS MAPPING OF COMPUTATION 1315

Fig. 1. Trend of EMBs in FPGA. (a) Size (MB) and (b) access speed(MHz) for Altera Stratix [5] and Xilinx Virtex [6] series of FPGA devicesacross different technology generations.

which can drastically reduce the energy overhead of PIs whileusing conventional FPGA architecture, would be the rightsolution to pursue. Some of the previous research works havealso considered mapping logic functions to EMBs in FPGAsto improve the area or performance of applications mappedto FPGA instead of energy improvement. In [8] and [9],fine-grained applications are mapped into EMBs of FPGAs.Although those investigations lay a foundation of using EMBsfor computation, the approaches are primarily focused onarea or performance improvement by mapping simple Booleanfunctions in fine-grained applications (e.g., control logics).Mapping transcendental functions to a set of EMBs of FPGA isalso reported in [10]. However, the approach does not considerthe energy optimization with a comprehensive heterogeneousmapping flow which considers the energy requirement of fine-grained logic, DSP blocks, and EMBs for energy-efficientmapping of applications to the FPGA platform. The authors in[11] show the power implications of mapping logic to EMBs ofStratix-1 series of devices. Because of power-hungry nature ofEMBs in previous generation FPGA devices, mapping logic inembedded memory results in considerable increase in dynamicpower compared to the configurable logic block (CLB)-basedimplementation. However, because of the improvement inmemory power and performance across different technologygenerations, in newer generations of FPGA, both energy andperformance are dominated by the PI overhead.

This paper presents a novel energy-efficient applicationmapping methodology for FPGA, where compute-intensiveoperations are decomposed and mapped to EMBs to mini-mize the dynamic energy consumption in FPGA. First, weselect appropriate computations for mapping into EMBs. Byvarying the input bit-width, we explore the optimal energyconfigurations for mapping the computations into sequentialaccesses of the EMBs. A complete mapping flow includingdecomposition, fusion, and packing of nodes into available

computational resources is also presented in this paper. Whilethe conventional implementation of compute-intensive oper-ations requires multiple logic levels connected through PIsthat incur large performance and energy overhead, the pro-posed heterogenous mapping approach, where a large multipleinput/output function can be realized with one or a fewlookup table (LUT) accesses, significantly improves energyand latencies. In particular, this paper makes the followingmajor contributions.

1) It develops a heterogenous application mapping frame-work, which combines the conventional applicationmapping to logic and DSP blocks (for DSP-enhancedFPGA devices) with judicious mapping of specific com-putations in memory. It determines the complete map-ping methodology including functional decomposition,fusion of small operations into a large one, and optimalpacking of operations into a combination of EMBs andlogic arrays.

2) It then analyzes the effectiveness of mapping coarse-grained compute-intensive operations to EMBs inFPGA. By mapping a number of common signalprocessing, scientific and graphics applications ona commercial state-of-the-art FPGA platform [AlteraStratix IV (40 nm process), Xilinx Virtex VI (45 nmprocess)], the effectiveness of the proposed methodologyis verified. It is shown that the opportunistic mappingof compute-intensive operations into EMBs in the formof large multiinput multioutput LUTs can significantlyimprove energy efficiency for many applications.

3) It also shows that the proposed heterogenous map-ping framework can be extended to make an effi-cient energy/accuracy tradeoff at run time. The operandbit-widths of complex operations can be dynamicallytruncated to achieve exponential reduction in memoryspace. The resulting output-accuracy degradation canbe reduced through: a) choice of optimal values fortruncated bits and b) preferential truncation of operands.

The rest of this paper is organized as follows. Section IIpresents the background works related to energy-efficientdesign in FPGA and memory-based computation. Generalcharacteristics of the applications amenable for mapping toEMBs are described in Section III, and Section IV focuseson the application mapping methodology including energy-efficient configuration of EMBs and timing strategy. Section Vpresents the experimental results for several common com-plex datapath and complex function-dominated applications.Section VI shows the dynamic energy/accuracy tradeoffapproach using the proposed heterogenous application map-ping, and finally conclusion is drawn in Section VII.

II. BACKGROUND

In this section, we describe the device-circuit-architecturelevel approaches previously proposed to improve the energyconsumption in FPGAs. The energy-efficient approaches arecategorized as device/circuit level, architecture level, andsoftware-level techniques.

Page 3: 1314 IEEE TRANSACTIONS ON VERY LARGE SCALE …swarup.ece.ufl.edu/papers/J/J42.pdf · to Embedded Memory Blocks ... reduced design cost and improved time-to-volume, field-programmable

1316 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 6, JUNE 2014

A. Energy-Efficient Design in FPGA

Several system level and device-circuit-architecture levellow-power design techniques have been applied to FPGA toreduce its power consumption. A comprehensive description ofthe power optimization techniques has been provided in [12].The techniques can be broadly classified into three categories.

Device/Circuit-Level Techniques:

1) Clock gating can be effectively used to turnoff theunused portions of the FPGA resources to minimize thepropagation of undesired signal values [13].

2) Dynamic voltage scaling is also used to adapt the supplyvoltage of an FPGA according to the temperature of thedevice during operation [14].

3) Moreover, logic array blocks (LABs) are made with amix of high threshold (Vt ) and low Vt transistors toachieve lower power while maintaining the performancetarget in a particular technology node [15].

Architecture-Level Techniques:

1) CLB LUTs have been made fracturable with the sizes ofthe LUTs being configurable from 4 to 7 inputs to maplarger functions in a single LUT, thereby significantlyimproving the routing power consumption [16].

2) Generally for EMBs, coarse-grained, block-based archi-tecture gives better power/energy results compared tofine-grained random access memory (RAM) architecture[12].

3) The addition of more specialized units like DSP blockswith more embedded memories has helped to improveenergy consumption for specific functions.

4) Finally, most commercial FPGAs have moved to aclustered architecture for computing blocks (e.g., CLBs)which has dramatically improved the routing distance(hence average route capacitance) between neighboringblocks that can be reached within 1–2 hops [17].

Software Techniques: Energy-efficient mapping algorithmsfor legacy FPGA hardware have been investigated earlier[18]. However, existing software techniques primarily focuson reducing PI overhead in logic and DSP-based implemen-tations. They use memories only for storing large input dataor intermediate values, and do not leverage on the prospect ofmapping complex operations to EMBs.

B. Computation With Memory

Computation with memory is an LUT-based computationapproach, where a computing function is implemented usinga 2-D memory array. The same memory array may also includemultiple LUTs, and computing is performed by accessingthe LUT with the right address values. Here, the table val-ues are precomputed and used during application mappingprocess. The downside of the LUT-based computation is theexponential growth of the memory size with the increasein input operand bit-width. As a result, without employingeffective decomposition techniques or approximate computingtechniques as proposed in [19] and [20], LUT-based compu-tation can only be feasible in FPGA up to a certain LUTinput bit-width. Altera Stratix IV FPGA devices support a

maximum LUT input size of 16 in a single memory. In general,application mapping scenarios, a large part of the existingEMBs in an FPGA remain unutilized. The idle memory blocksfor fine-grained logic computation is proposed in [8] and [9]. Ithas been shown that the area or performance of an applicationcan be improved through effective computation in memory.

III. USE OF EMBEDDED MEMORY BLOCKS FOR

COMPUTATION

A majority of scientific and signal-processing applicationsinclude a set of common compute-intensive kernels, whichessentially constitutes basic mathematical operations such asaddition, multiplication as well as evaluation of many complexfunctions such as sine, cosine, reciprocal, arctan, square root,exponentiation, and logarithm [21]. Traditionally, in an FPGAframework, the transcendental functions are mapped usingcoordinate rotation digital computer (CORDIC) approach[22] or Taylor series expansion [23], which either requires alarge number of computing resources or suffers from largelatency. Mapping these functions to embedded memory arraysby holding the functional outputs is an attractive solution. Inaddition, the use of memory blocks in computing offers enor-mous parallel computing resources because of the presence oflarge (hundreds to thousands) number of EMBs in a modernFPGA device. The focus of this paper is to achieve energy-efficient mapping of the complex functions instead of makingmemory-latency tradeoff by utilizing the already existingdecomposition techniques as proposed in [19] and [20].

Computation with memory is also very effective in thecase of complex datapaths, such as two constant coefficientmultiplications followed by an addition. This kind of datapathis very common in multimedia and signal processing (e.g.,discrete cosine transform and filtering) applications. Aneffective mapping methodology, which can leverage suchdecomposition algorithms and large parallel computingresources offered by high-density EMBs in FPGA, canprovide significant improvement in performance and energyefficiency. The following computations have been identifiedto be amenable for mapping to memory.

1) Any function satisfying the maximum LUT input size(I ). The function can be a regular transcendental one(such as sin e(x), where x has a resolution ≤I ) orany arbitrary function (with input size ≤I ) obtained byfusing many simple ones.

2) Complex datapath like two constant coefficient mul-tiplications followed by an addition, with total inputsize ≤I .

3) Easily bit-sliceable functions such as Galois field opera-tions and few datapath operations such as additions andmultiplications.

4) Functions with arbitrary input bit-width which can berealized by cascading LUTs, each of input size ≤I , e.g.,transcendental functions with fixed-point operands.

5) Functions which are amenable to decomposition intoacceptable input size in other ways. For example, amultiplication realized using logarithmic number systemcan be converted from a two operand function to a singleoperand function.

Page 4: 1314 IEEE TRANSACTIONS ON VERY LARGE SCALE …swarup.ece.ufl.edu/papers/J/J42.pdf · to Embedded Memory Blocks ... reduced design cost and improved time-to-volume, field-programmable

GHOSH et al.: IMPROVING ENERGY EFFICIENCY IN FPGA THROUGH JUDICIOUS MAPPING OF COMPUTATION 1317

Fig. 2. Application mapping steps using EMBs for computation.

IV. APPLICATION MAPPING METHODOLOGY

In this section, we present a heterogenous applicationmapping process, where the combination of EMBs and otherFPGA resources are efficiently used to map compute-intensiveoperations with significant energy savings in FPGA. A com-plete heterogenous mapping flow including decomposition,fusion, and packing of nodes into available computationalresources is described in this section.

A. Mapping Flow

The application mapping process for the proposed frame-work is shown in Fig. 2. The input to the application map-ping flow is a control data flow graph (CDFG) containing ahypergraph representation [G(V , E)] of the input application.The operations in the input application constitute the set ofvertices (V ), and the dataflow between them is represented asthe edges (E). Types and subtypes of the operation nodes thatare supported in the CDFG representation are summarized inTable I. Each operation node is characterized by the followingfields: 1) name; 2) type; 3) subtype; 4) inputs; 5) outputs;and 6) bit-width. name field is unique and specifies the nameof the operation. Each type can have multiple subtypes. Forexample, both addition and subtraction fall under the greatertype of operations that can be bit-sliced into suboperationswith a smaller bit-width and a carry function (denoted asbitswC). Bit-width denotes the bitwidth of the input operands.Bits represents the operations that are bit-sliceable withoutcarry. Mult denotes the conventional two-input multiplication.Shift and rotate has one operand together with the shift/rotatedirection as a subtype. Complex denotes a general class of

TABLE I

OPERATION TYPES SUPPORTED IN THE FPGA MAPPER TOOL

LUT operations. Maximum number of inputs for each type ofoperation is fixed except for select type, which represents Nto one selection, where N is a variable. Major steps of theapplication mapping flow are described as follows.

1) Functional Decomposition: Decomposition is theprocess of replacing a vertex in a hypergraph having largeinputs/outputs into multiple vertices satisfying the input–output constraint of the LUTs. For example, bit-sliceable andtranscendental functions can be broken down in this step intomultiple vertices satisfying the maximum input/output countof LUT. Here, the resource constraints of the FPGA devicealong with the maximum input size of LUT are read from aparameter file, which serves as an input to the FPGA mappertool. Transcendental functions and other complex functionswith fixed-point operands can be decomposed using either:1) bipartite/multipartite [19] and 2) add-table lookup-add(ATA) [20]-based method to minimize the total memoryrequirement.

The idea behind bipartite and multipartite decomposition[19] is to divide the input space (2α segments) into 2γ largerintervals (where γ < α) so that the slope of the function isconsidered a constant in a larger interval. Hence, it contains2γ tables of offsets, each with 2β tables of offsets. The decom-position allows storing a total of 2α + 2γ+β values instead of2α+β . Another method for reducing the required memory sizeis the ATA method [20]. The main premise is to approximatea function f (x) using Taylor series central difference method,and the errors induced in the approximation are calculated ineach stage to ensure that the error is less than the unit inthe last place. The basic method involves parallel additions,followed by parallel lookups and finally multistage additions.This method can be used to approximate all polynomial andtranscendental functions which can be represented in a givenfixed point resolution. For the functions which are decom-posable with both multipartite and ATA, multipartite-baseddecomposition is selected as it requires smaller memory atiso-output-accuracy. A list of decomposition routines includedin the proposed mapping flow is shown in Fig. 2.

Page 5: 1314 IEEE TRANSACTIONS ON VERY LARGE SCALE …swarup.ece.ufl.edu/papers/J/J42.pdf · to Embedded Memory Blocks ... reduced design cost and improved time-to-volume, field-programmable

1318 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 6, JUNE 2014

Fig. 3. (a) Example of fusion of multiple nodes into a single node satisfyingLUT input/output count. (b) Specific example of multadd fusion for four-inputoperands.

2) Fusion: After decomposition, fusion involves the oppor-tunistic reduction of the total number of operations by com-bining multiple operations into a single operation that canbe suitably mapped into a LUT. Fusion incorporates tworoutines: 1) fusion of random LUT-based operations and2) fusion of bit-sliceable operations. Fig. 3 shows an exampleof merging multiple nodes into a single node that satisfiesthe input/output constraints of LUT. We have developed aheuristic for partitioning a target application into multi-inputmulti-output operations. The vertices inside each partition arefused to form a single vertex to be mapped as a LUT operation.This heuristic is inspired from the maximum fanout free coneand maximum fanout free subgraph approaches outlined in[8]. Through off-line analysis, using FPGA synthesis tool,we evaluate the effectiveness of mapping each merged nodeof specific input/output width to embedded memories. Thisis done by mapping a function realized as merged nodesinto three alternative implementations (memory based, normallogic, and DSP-based) and comparing their energy consump-tions. After fusion, the following vertex types can be obtained:1) multadds: stands for the function (a ∗ b + c) � shi f tamt;2) multaddsel: stands for the function sel?(a ∗ b + c) : c;3) adds: stands for the function (a + b) � shi f tamt;4) addsel: stands for the function sel?(a + b) : a; 5) multadd:stands for the function a ∗ b + c ∗ d; 6) mults: stands forthe function (a ∗ b) � shi f tamt; and 7) multsel: stands forthe function sel?(a ∗ b) : c. Note that a constraint of fusing

Fig. 4. Packing algorithm for energy-efficient mapping in an FPGA device.

multiple datapath operations is that they must share one ormore input operands. The fusion process is constrained by thenumber of input/output bits and granularity.

3) Packing: The decomposed and fused nodes in DFG actas inputs to the packing stage. The sequence of packing stepsis shown in Fig. 4. As shown in figure, we use a mappingdatabase in this stage, which contains a priori information onenergy-efficient mapping of different operations with varyinginput bit-width. Such a database can be created throughoff-line analysis of different types/sizes of operations in atarget FPGA device. When a function is mapped to memory,the mapping database helps us to derive the energy-optimalconfiguration with required LUT size. During the packing step,we also consider specific resource constraints of FPGA device.For example, if a particular memory type (say M9K or M144Kfor Stratix IV) is insufficient, we have to take the next bestenergy-efficient configuration to meet the resource constraints.

4) Memory-Logic Interface and Timing Strategy: In theheterogenous mapping procedure, where computations can bemapped either to fine-grained normal logic, DSP datapathsor EMBs, the following three mapping scenarios should beconsidered for timing assignment: 1) memory access followedby another memory access; 2) memory access followed bylogic/DSP operations; and 3) logic/DSP operation followed bymemory access. The mapping scenarios are shown in Fig. 5.Typically, EMBs in FPGA are synchronous, which requireslatching address at a clock edge. The output data from memory

Page 6: 1314 IEEE TRANSACTIONS ON VERY LARGE SCALE …swarup.ece.ufl.edu/papers/J/J42.pdf · to Embedded Memory Blocks ... reduced design cost and improved time-to-volume, field-programmable

GHOSH et al.: IMPROVING ENERGY EFFICIENCY IN FPGA THROUGH JUDICIOUS MAPPING OF COMPUTATION 1319

Fig. 5. Interfacing and timing strategies for three different cases. (a) Memoryaccess followed by logic. (b) Logic followed by memory access. (c) Memoryaccess followed by another memory access.

can be optionally latched. For the proposed memory-logicheterogenous mapping approach, the minimum clock periodis determined by the maximum access time of any memoryblock in design. The maximum clock period is the latency ofthe logic in between two memory accesses or between an inputpin and memory access, or between memory block outputand output pin. As described in Algorithm 1, we proposea heuristic which tries to optimize the total latency of theinput application by accommodating multiple compute nodesof the packed DFG in a single clock cycle. The algorithminitially assigns the clock period to be the access time of thelargest memory block and calculates the latency of the mappedapplication. Then it increments the clock period by a step sizeas calculated in the algorithm and tries to accommodate thelogic following a particular memory access in the same clockcycle (CC). This is because in a Trimatrix memory block inAltera or in a 18 k*2 or 36 k*2 memory block in Xilinx,the input address is always latched.However the output datalatch is optional and extra logic can be accommodated in thesame CC. After the clock period is incremented, the latency forthe application is recalculated and compared with the previouslatency. The minimum latency is saved with the correspondingclock period. Such an iteration is continued till the maximumlogic latency present in the design is searched. Note that ateach step, the algorithm tries to balance the delay for the input,internal, and output paths.

B. Energy-Efficient Configuration of RAM Blocks

In this subsection, the packing stage is elaborated in moredetail. An important part of the packing is to derive the optimalenergy configuration of all the memory blocks instantiated inany application. For the energy-efficient mapping of memoryblocks in an FPGA, there are two main approaches which thedesigner can consider [24].

1) The dynamic power of a static RAM (SRAM) EMBis dominated by the power consumed during bit-line precharging. To remove the unnecessary bitline

Algorithm 1 Procedure Memory-Logic TimingDetermine Multi-Cycle Timing Assignment, and Generatethe Optimal Clock Period & Latency for an ApplicationInputs: Packed resources for an application, Timing step = SOutputs: Multi-cycle timing assignment, optimal cycle time(CC) and latency

1: T = minCC = access time for the largest memory block;2: maxCC = max(delay between any two memory ops, a

memory op & primary in/out);3: StepSi ze = S; Latency = infinity;4: Max Index = ceil((maxCC − minCC)/S);5: for (index = 0; index < Max Index ; index + +)6: T = T + StepSi ze ∗ index ; NumCycle = 0;7: Mark all logic ops Li uncovered;8: for each memory operation Mi

9: Assign clock boundary at the input of Mi ;NumCycle + +;

10: if (slack = T − accessT imeO f Mi ) > 011: repeat12: Assign logic op L j following Mi into clock cycle

for Mi

13: Mark L j covered;14: until slack < 015: endif16: endfor17: For each uncovered Li

18: Assign a clock boundary at the end of Li ;NumCycle + +;

19: if (slack = T − latencyO f Li ) > 020: repeat21: Assign logic op L j following Li into clock cycle

for Li

22: Mark L j covered;23: until slack < 024: endif25: endfor26: NewLatency = NumCycle ∗ T ;27: if (NewLatency < Latency)28: Latency = NewLatency29: Save the timing assignment as the best one;30: endif31: endfor

precharging of the unused memory blocks, designers cancontrol a clock enable signal shown in Fig. 6(a). The lowpower approach is extremely helpful in the applications,where a particular SRAM block is accessed once inmany cycles.

2) A single large memory block is broken into multiplesmall sub-banks with additional predecoding so asto access only a single or a few smaller sub-banksduring memory access, keeping the unaccessed memoryblocks to be turned off. The EMBs provided in Stratixfamily [5] can be configured by a number of memoryblock depth and width combinations. Designers needto find a minimum energy configuration for different

Page 7: 1314 IEEE TRANSACTIONS ON VERY LARGE SCALE …swarup.ece.ufl.edu/papers/J/J42.pdf · to Embedded Memory Blocks ... reduced design cost and improved time-to-volume, field-programmable

1320 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 6, JUNE 2014

Fig. 6. (a) Trimatrix memory block with one address, one clock, and onecombined clock enable port. (b) Implementation of a large memory block(16 k × 16) using smaller RAM blocks with additional predecoding to saveaccess energy. (c) Alternative implementation of the same memory, whichincurs more access energy.

sizes of memory because larger number of sub-banksresult in significant amount of routing energy ofthe predecoding logic. Fig. 6(b) and (c) show twopossible implementations of a 64 k × 16 EMBs. Theimplementation in Fig. 6(b) decomposes 64 k × 16logical RAM block into four physical RAM blocks,each of output size 16. As a result, for each memoryaccess, only one of the memory blocks is prechargedand the rests are turned off. For the memory architectureshown in Fig. 6(c), the physical memory blocks havean output size = 4, which means that all the memoryblocks should be turned on for each memory access. Anenergy optimal mapping results for a 4k × 12 memoryblock are shown in Fig. 7(a), and the results for a64 k × 16 memory block are also shown in Fig. 7(b).

C. Mapping Complex Datapath in Memory

For the operands with smaller bitwidth, when the complexdatapaths having more than two sequential computations aremapped to EMBs, the memory-based computations consumesignificantly smaller energy compared to the normal logic orDSP-based mapping. The datapath for two constant coefficientmultiplications followed by an addition (constmult-add) hasbeen optimally mapped to EMBs across different technologygenerations (Stratixs II–IV, and Virtex VI), and the implemen-tation results are compared with the conventional mappingsto normal logic and DSP blocks. Fig. 8 shows the energy

Fig. 7. Energy consumed by a memory block with different types and blockdepths for (a) 12 × 12 memory and (b) 16 × 16 memory.

Fig. 8. Comparison of the energy consumed per a constmult-add operationwith varying input bitwidths in (a) Altera Stratix II, (b) Altera Stratix III,(c) Altera Stratix IV, and (d) Xilinx Vertex VI of devices.

consumed per constmult-add operation for three differentmapping approaches in Stratix series [5] and Virtex VI [6] ofdevices. Table II and Table III also show the energy savingspercentages and the operational latencies of our memory-based mappings, respectively, when those are compared withmapping to normal logic and DSP blocks. for three differentmapping Power estimates for Stratix family of FPGA deviceshave been obtained using the Quartus powerplay analyzer[25]. The input node toggle rate has been assumed to be50% with vectorless estimation for the rest of the nodes. Forthe Xilinx Virtex VI results, XPower Analyzer [26] has beenused with the same input toggle rate and vectorless estimationfor the rest of the nodes. We can notice from Table II thatenergy savingsobtained with the proposed mapping decreasesas the operand bitwidth increases. All the implementations inmemory, logic, and DSP have been done with area optimalsynthesis constraints.

Page 8: 1314 IEEE TRANSACTIONS ON VERY LARGE SCALE …swarup.ece.ufl.edu/papers/J/J42.pdf · to Embedded Memory Blocks ... reduced design cost and improved time-to-volume, field-programmable

GHOSH et al.: IMPROVING ENERGY EFFICIENCY IN FPGA THROUGH JUDICIOUS MAPPING OF COMPUTATION 1321

TABLE II

FOR A CONSTMULT-ADD OPERATION, ENERGY IMPROVEMENTS WITH THE PROPOSED MAPPING APPROACH COMPARED TO

MAPPING IN LOGIC AND DSP BLOCKS FOR STRATIXS II–IV AND VIRTEX VI FPGA DEVICES

TABLE III

COMPARISON OF OPERATIONAL LATENCY OF CONSTMULT-ADD OPERATION AMONG THREE MAPPING

APPROACHES FOR STRATIXS II–IV AND VIRTEX VI FPGA DEVICES

Fig. 9. Energy consumed per a CORDIC operation with varying inputbitwidth.

D. Mapping Complex Functions in Memory

Conventionally, complex transcendental functions are com-puted in an FPGA using CORDIC. Fig. 9 shows theimprovement in energy consumption for LUT-based imple-mentation with respect to normal CORDIC-based univariablefunction computation in Stratix IV [5]. The maximum inputsize of a LUT is 16 bits. The energy improvement is minimum(3.3×) in case of 16-bit operands and maximum (8.2×) incase of 11-bit input operands. However, if the size of theinput bitwidth does not match with the size of the LUT,efficient decomposition techniques such as multipartite [19]and ATA [20] have to be employed to minimize the numberof memory banks at the cost of latency increase. Fig. 10 showsthe optimization results for energy consumption in Stratix IVfor bipartite, tripartite, and quadpartite decompositions [19] oftranscendental functions for 24-bit inputs.

V. APPLICATION MAPPING RESULTS

In this section, the effectiveness of the proposed applica-tion mapping approach is evaluated by measuring the energy

Fig. 10. Energy consumed per different transcendental functions whenbipartite, tripartite, and quadpartite decompositions are employed for 24-bitinputs.

consumption for several compute-intensive applications. Wehave considered five applications, namely, finite impulseresponse (FIR) filtering [27], coherence calculation in acluster [28], discrete wavelet transform (DWT) [29], third-order polynomial evaluations [30], and solution to Schrodingerequation [31].

We have used a commercial FPGA platform from Altera(Stratix IV) [5] in our experiments. We also used Altera’sdesign space explorer in Altera Quartus II 11.0 tool suite toimplement three different mapping approaches (logic-based,DSP-based, and proposed heterogeneous mapping). In all threeimplementations, mapping of logic functions is optimizedfor power under iso-delay target using the Quartus DesignSpace Explorer. Typically, FPGA synthesis tools provide anoption for logic to memory mapping during physical synthe-sis. However, the built-in option is provided for only areaoptimization instead of energy. In other words, the logic tomemory mapping option in commercial FPGA synthesis toolgenerally maps some of the sequential elements to smallmemory blocks (e.g., MLAB in Stratix IV) to reduce areaor resources, and it does not consider mapping multi-inputcomplex datapath/functions to large embedded memory arrays(e.g., M144K block in Stratix IV). Thus, the main goal of the

Page 9: 1314 IEEE TRANSACTIONS ON VERY LARGE SCALE …swarup.ece.ufl.edu/papers/J/J42.pdf · to Embedded Memory Blocks ... reduced design cost and improved time-to-volume, field-programmable

1322 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 6, JUNE 2014

Fig. 11. With varying input bitwidth, energy consumed per (a) eight-tap FIR filter, (b) coherence calculation, and (c) calculation of approximation coefficientin DWT in Stratix IV technology.

TABLE IV

COMPARISON OF RESOURCE USAGE FOR THREE COMPLEX DATAPATH APPLICATIONS

proposed heterogeneous mapping approach is quite differentfrom the option provided in the commercial synthesis tool.

A. Eight-Tap FIR Filter

Because of the speedups in complex datapaths such asconstmult-add, heterogenous mapping of FIR filter gives sig-nificant advantage over conventional FPGA mappings [27].The energy consumption results for an eight-tap FIR filterare shown in Fig. 11(a) with varying input bitwidth. Forthe proposed heterogeneous mapping of memory and logic,significant energy savings is achieved up to seven bit inputs(ranging from 37.5% for four bit inputs to 9.1% for sevenbit inputs) when compared to the logic only implementation.Table IV shows the resource usage for the eight-tap filter.

B. Coherence Calculation in a Cluster

During the iterative procedure of k-means clustering, thecoherence in a cluster has to be calculated to measure thequality of clustering [28]. According to the experimentalresults, mapping coherence calculation to EMBs shows sig-nificant energy savings till 6-bit input over a DSP-basedimplementation (ranging from 54.5% for 4-bit inputs to16.7% savings for 6-bit inputs). Compared to the logic-only-based implementation, the proposed approach has energysavings till 8-bit inputs (ranging from 63.0% for 4-bit

inputs to 33.3% for 8-bit inputs). The advantages in energyover different input bitwidths are shown in Fig. 11(b).The resource usage for different input bitwidths are shownin Table IV.

C. Calculation of Approximation Coefficient in DWT

The approximation computation in a DWT consists oftwo constant coefficient multiplications of odd samples fol-lowed by one addition [29]. Using the proposed memory-based computation, the first level of multiplications andthe first addition are mapped in memory, whereas therest of the datapaths are mapped spatially in FPGA logic.The energy consumption results with varying input operandbitwidths are shown in Fig. 11(c). Energy savings is achievedtill 6-bit inputs compared to the logic-only implemen-tations (ranging from 23.1% for 4-bit inputs to 33.3%for 6-bit inputs). The resource usage for different inputbitwidths are shown in Table IV for different mappingmethodologies.

D. Third-Order Polynomial Evaluation

Newton Raphson’s (NR) method is widely used for theevaluation of roots for polynomial functions. Consider athird-order polynomial given by [30]

f (x) = a3x3 + a2x2 + a1x + a0. (1)

Page 10: 1314 IEEE TRANSACTIONS ON VERY LARGE SCALE …swarup.ece.ufl.edu/papers/J/J42.pdf · to Embedded Memory Blocks ... reduced design cost and improved time-to-volume, field-programmable

GHOSH et al.: IMPROVING ENERGY EFFICIENCY IN FPGA THROUGH JUDICIOUS MAPPING OF COMPUTATION 1323

TABLE V

COMPARISON OF ENERGY, LATENCY, AND EDP FOR SEVERAL COMMON APPLICATIONS

NR method employs the following iteration formula to movecloser to the actual root:

Xn+1 = Xn− f (x)

f ′(x)= Xn−a3 X3

n + a2 X2n + a1 Xn + a0

3a3 X2n + 2a2 Xn + a1

. (2)

For an input with 24 bit, implementation with ATA method[20] in the proposed framework needs 106 KB of memory,which shows an average energy consumption of 0.2 nJ with alatency of 8.2 ns. On the other hand, the DSP-based mappingtakes a latency of 55 ns and consumes 1 nJ of energy, whilethe logic-based implementation takes similar latency of 55 nsbut consumes 2.4 nJ of energy.

E. Solution to Schrodinger Equation (1-D)

A very common scientific application dominated by tran-scendental functions is finding solution of a time-independentSchrodinger wave equation for arbitrary periodic potentials.For a single dimension, the function is given by [31]

ψn(x) =√

2

Lsin

(nπx

L

). (3)

An FPGA-based evaluation of this function using the heteroge-nous mapping with 24-bit inputs needs the memory size of193.96 KB. The decomposition of the complex functions havebeen performed using the energy-efficient tripartite decompo-sitions [19]. The heterogenous mapping-based implementationhas a latency of 25.8 ns and an average energy consumptionof 0.7 nJ. The conventional logic-only-implementation hasa latency of 308.7 ns with 5.9 nJ of energy consumption,whereas the Schrodinger equation cannot be mapped intothe DSP-based implementation. Because of the large latencyof CORDIC-based computation, the proposed framework hassignificant energy savings over the conventional FPGA-basedmapping as shown in Table V.

F. Discussion

The simulation results show that the proposed approachcan provide considerable energy saving for application kernelsfrom diverse domains up to a specific bitwidth. To ana-lyze about how the approach scales for larger design, wemapped 32-tap FIR filter using the proposed approach andcompared the energy consumption results with two alternativeimplementations. Fig. 12 shows the comparison for varyinginput bitwidth. We can observe that for the proposed het-erogeneous mapping of memory and logic, significant energysavings is achieved up to 6-bit input (ranging from 60.2% for4-bit inputs to 47.7% for 6-bit inputs) when compared to the

Fig. 12. Energy consumed by a 32-tap FIR filter for varying input bitwidths.

logic-only implementation. Compared to DSP-based mapping,the energy saving is 58.4% for 4-bit inputs and 56.0% for5-bit inputs. In terms of resource usage, as observed inTable IV, the proposed approach reduces the resource usagein terms of logic elements, while increasing memory resourcerequirements. As presented in Table VI, the three differentmapping approaches show comparable mapping times forvarious DSP applications. The requirement of EMBs dependon the number of unique lookup operations and size of LUTs.We believe that reduced logic resource requirement can beadvantageous in many applications (e.g., signal processingapplications) where application mapping is limited by com-binational LUTs instead of embedded memory. For example,mapping of large number of parallel instances of DCT forhigh-throughput image processing is limited by combinationalLUTs. In this scenario, the proposed mapping approach canimprove resource utilization considerably by using the idleEMBs.

VI. DYNAMIC ENERGY/ACCURACY TRADEOFF

In general, multimedia and signal processing algorithms aimat keeping the distortion of output data as low as possible for agiven input bitwidth. However, in many of these applications,the best data quality may not be always required, and theoutput quality can be willingly traded off with computationenergy to prolong the battery lifetime [32], [33].

As demonstrated in the previous sections, the opportunisticmapping of computation to memory can provide significantenergy improvement in various compute-intensive operations.We can leverage on this observation to develop an efficientapproach of dynamically improving energy efficiency byswitching to lower operand bit-width at the expense of gracefuldegradation in output quality. At lower bit-width, we employthe proposed heterogenous mapping to significantly improve

Page 11: 1314 IEEE TRANSACTIONS ON VERY LARGE SCALE …swarup.ece.ufl.edu/papers/J/J42.pdf · to Embedded Memory Blocks ... reduced design cost and improved time-to-volume, field-programmable

1324 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 6, JUNE 2014

TABLE VI

MAPPING TIME IN SECONDS FOR THE APPLICATIONS USING DIFFERENT MAPPING APPROACHES USING STRATIX IV WITH QUARTUS 11.0

Fig. 13. Mean square errors (MSEs) with different bit allocations at thetruncated bits for a 32-tap FIR filter.

the energy requirement compared to that required in normal(higher) bit-width for either logic-based mapping. The tradeoffbetween energy and output quality can be easily realizedby truncating the operand bit-width such that the requiredmemory size for mapping complex functions is exponentiallyreduced leading to considerable energy saving. To minimizethe effect of truncation in output quality degradation, weemploy judicious selection of truncated bits. As a case studyof energy and output-accuracy tradeoff, we have considereda common signal processing task, namely FIR filtering [34].We have evaluated the proposed scheme for two differentarchitectures of FIR based on: 1) conventional arithmetic and2) distributed arithmetic.

A. Conventional Arithmetic-Based FIR Filter

As presented in Fig. 11(a), the heterogenous mapping of FIRfiltering shows the energy savings when the input bit-width issmaller than or equal to 6. We can notice from the results thateven though the input bit-width of FIR filter is larger, we canstill have energy savings by truncating the operand bit-widthwhile the filter output quality degrades with the truncation.To minimize the output-accuracy degradation, the followingtruncation approaches are investigated.

1) Uniform Input Truncation: The simplest way is toequally truncate the input operands for all filter taps. Thetruncated bits can be filled with all zeros, or we can carefullyallocate the optimal values considering input distribution tominimize the output accuracy degradation. As a case study, weused a 32-tap low-pass equiripple FIR filter with a pass-bandfrequency of 9.6 kHz and a stop-band frequency of 12 kHz.Random 8-bit Gaussian inputs are used, and 3 bits from LSBare truncated. The truncated bits are padded with all zeroes,or different three digit values are allocated. MSE is used as ameasure of output-accuracy, and Fig. 13 shows the simulationresults. We can notice from the figure that 3′b011 or 3′b100allocation at the truncated positions gives rise to the minimumerrors.

Fig. 14. Stop-band ripple magnitude changes when each of the 32 FIR filtercoefficients are changed to zero.

Fig. 15. MSE versus energy plots when the proposed heterogeneous mappingapproach is used to implement the conventional 32-tap FIR filters with threedifferent truncations.

2) Preferential Input Truncation: For the 32 coefficients ofthe example FIR filter, some of the coefficients in FIR filterplay a more important role in shaping the filter response com-pared to others [35]. Fig. 14 shows the stop-band ripple mag-nitude changes when each of the 32 FIR filter coefficients ischanged to zero. In this figure, larger stop-band ripple changemeans that the corresponding coefficient is relatively moreimportant than those coefficients generating smaller ripplechanges. According to the important differences found amongthe 32 coefficients, we made four coefficient groups (mostsignificant, first order, second order, and third order). Whenwe make a tradeoff by truncating the input bit-width of the FIRfilter, inputs with larger bit-widths are allocated to the moreimportant coefficients while smaller bit-widths are used for theless important coefficients. Three truncation modes (config.1,config.2, and config.3) are used in our experiments with orig-inal input bit-width = 8. At config.1, 1-bit truncation is usedfor the inputs of most significant coefficients shown in Fig. 14,2-bit truncation for those of first-order coefficients, 3-bit and4-bit truncation for those of second order and third ordercoefficients, respectively. At config.2 and config.3, one moreand two more bit truncations than config.1 are used for the

Page 12: 1314 IEEE TRANSACTIONS ON VERY LARGE SCALE …swarup.ece.ufl.edu/papers/J/J42.pdf · to Embedded Memory Blocks ... reduced design cost and improved time-to-volume, field-programmable

GHOSH et al.: IMPROVING ENERGY EFFICIENCY IN FPGA THROUGH JUDICIOUS MAPPING OF COMPUTATION 1325

TABLE VII

RESOURCE USAGE AND ENERGY CONSUMPTION FOR A 32-TAP FIR FILTER USING HETEROGENOUS, LOGIC-BASED AND

DSP-BASED MAPPING USING CONVENTIONAL ARITHMETIC

corresponding groups of coefficients. Fig. 15 shows the MSEversus energy plots for three different truncation approaches.For the zero value allocation presented in Section VI-A-1,1′b0, 2′b00, and 3′b000 are padded at the uniform1-bit, 2-bit, and 3-bit truncations, respectively. For the optimalvalue allocation, 1′b1, 2′b01, 3′b011, and 4′b1000 are used atthe corresponding truncated bit positions. As shown in Fig. 15,under iso-energy condition of 100 pJ, the MSE for preferentialtruncation is 7.8, while the optimal value allocation andzero value allocation shows the MSEs of 88.2 and 280.8,respectively.

Table VII shows the resource requirements for the uni-form truncation and the preferential truncation approachesimplemented through the heterogeneous mapping, logic-basedmapping, and DSP-based mapping. For the four modesof uniform truncation described in Table VII, heteroge-neous mapping provides an average of 43.0% improvementin energy consumption compared to the logic-based map-ping. However, it provides slight degradation in averageenergy compared to the DSP-based mapping because of thehigh energy consumption for 1-bit truncation through het-erogeneous mapping. Similarly, with preferential truncation,heterogeneous mapping provides an average of 60.3%improvement in energy consumption over the logic-based map-ping and 36.6% improvement over the DSP-based mapping,respectively.

B. Distributed Arithmetic-Based FIR Filter

The input–output relationship of the linear time invariantFIR filter is expressed in the following equation:

y(n) =K∑

k=1

Ak ∗ xk(n) (4)

where y(n) is the output at time n, xk(n) is the kthinput at time n, and Ak is the kth coefficient. Theinput xk can be expressed in 2s complement fractionalform as

xk = −xk0 +B−1∑b=1

xkb2−b (5)

where xkb is a binary number. The equation can be extendedto the following:

y =K∑

k=1

Ak

[−xk0 +

B−1∑b=1

xkb ∗ 2−b

]. (6)

Fig. 16. MSE versus energy plots when the proposed heterogeneous mappingapproach is used to implement the DA-based 32-tap FIR filters with threedifferent truncations.

Equation (6) can be extended to

y = −[x10 ∗ A1 + x20 ∗ A2 + x30 ∗ A3 + · · · + xk0 ∗ AK ]+[x11 ∗ A1 + x21 ∗ A2 + x31 ∗ A3 + · · · + xk1 ∗ AK ]2−1

· · ·+[x1(B−1) ∗ A1 + x2(B−1) ∗ A2 + · · · +

xK (B−1) ∗ AK ]2−(B−1). (7)

When the DA-based FIR filter [36] is mapped using theproposed heterogenous mapping framework, the terms in thebrackets of (7) are realized using memory lookups, which isreferred as distributed arithmetic look-up table. The rest ofthe shift and additions are mapped into combinational logicsin FPGA.

1) Uniform Input Truncation: When the uniform inputtruncation is used with heterogenous mapping, the number ofmemory blocks decreases by one with each bit truncation atthe input. Smaller number of memory blocks in FPGA resultsin lower latency and energy.

Zero Value Allocation: Uniform zero allocation at the trun-cated bits is the simplest approach of truncating inputs. Whenthe original input bit-width = 8 and a memory word size = 16,with 1-bit truncation at the input, the required memory size isreduced from 16 to 14 KB for 32-tap FIR filtering operation.

Optimal Value Allocation: The optimal numbers that shouldbe added at the truncated bit positions in the DA-based FIRfilter are same with the ones that we used in the conventionalFIR filter.

2) Preferential Truncation: The preferential truncation isimplemented in exactly the same way as has been done for theconventional FIR filter. The preferential truncation approach is

Page 13: 1314 IEEE TRANSACTIONS ON VERY LARGE SCALE …swarup.ece.ufl.edu/papers/J/J42.pdf · to Embedded Memory Blocks ... reduced design cost and improved time-to-volume, field-programmable

1326 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 22, NO. 6, JUNE 2014

Fig. 17. Comparison of energy consumptions for the conventional versusDA-based implementation of FIR filters when those are implemented using(a) heterogenous mapping, (b) logic-based mapping, and (c) DSP-basedmapping.

more effective when it is applied to the higher digits of theinputs.

Fig. 16 shows the MSE versus energy plots when the pro-posed heterogeneous mapping approach is used to implementthe DA-based 32-tap FIR filters with three different truncationmethods. At the iso-energy consumption of 400 pJ, MSE ofthe filter with preferential truncation is 7.75, while zero valueallocation and optimal bit allocation-based architectures showthe MSEs of 293.7 and 92.5, respectively.

C. Conventional Versus DA-Based Implementation

Fig. 17 shows the comparison of energy consumptions forthe conventional versus DA-based FIR filter implementationswhen three different mappings are employed. We can noticefrom the figure that DA-based FIR filter generally consumessmaller energy (because of smaller memory size) than the con-ventional FIR filter when the heterogenous mapping approachis used. With the logic-based mapping, conventional FIR haslower energy consumption compared to DA-based FIR atsmaller input bit-width, whereas DA-based architecture showsthe smaller energy consumption at large operand bit-width.For DSP-based mapping, because of fine-grained nature of themapping scheme, the number of DSP blocks in DA is largercompared to those in the conventional FIR filter. It is inter-esting to note that a eight-tap DA-based EMB implementationachieves 2.4× energy savings compared to eight-tap CA-basedDSP implementation.

VII. CONCLUSION

We have presented a heterogenous application mappingframework for FPGA, where the conventional applicationmapping to logic and DSP blocks is effectively combined withjudicious mapping of complex operations to EMBs for energy-saving. The software algorithms to perform functional decom-position, fusion of operations, packing of individual coarse-grained operations, and timing assignment are effectivelyused to minimize the energy consumption. Unlike existingenergy-saving approaches, based on device-circuit-architecturelevel optimizations, the proposed approach does not requireany modification in the FPGA device at any level. Theapproach has been evaluated with a commercial FPGA plat-form. Experimental results show that the proposed heteroge-nous mapping process can significantly improve the energyefficiency for many compute-intensive applications for spe-cific operand bit-widths. We have also presented an effi-cient energy/output-accuracy tradeoff approach, where higherenergy-saving can be achieved at the expense of minordegradation in output quality through operand truncation. Wehave presented a preferential truncation method, which aimsat minimizing the effect of truncation in the output whilemaintaining the energy-saving advantages.

With the continued scaling of memory and the emergenceof novel high-density memory technologies, both volatile andnonvolatile, FPGA devices are expected to integrate larger sizeof memory with improved performance. It would increase thebenefit of the proposed mapping approach by allowing moreoperations with larger bit-width to be mapped in memory.The proposed heterogeneous mapping approach can be easilyintegrated with the existing mapping flow and synthesis tools.For the seamless integration of both the approaches in a unifiedmapping framework, we need to meet the timing requirementsof memory and logic operations. We have presented a mul-ticycle timing approach to address the timing issue. Futurework will include extending the framework to emerging FPGAplatforms with nonvolatile memory and supporting differentprecision and data formats.

REFERENCES

[1] (2010). Embedded Intel Solutions [Online]. Available:http://www.embeddedintel.com/news.php?article=276

[2] A. Rahman, S. Das, A. P. Chandrakasan, and R. Reif, “Wiring require-ment and three-dimensional integration technology for field program-mable gate arrays,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,vol. 11, no. 1, pp. 44–54, Feb. 2003.

[3] Achieving Low Power in 65-nm Cyclone-III FPGAs, Altera, San Jose,CA, USA, 2007.

[4] F. Li, D. Chen, L. He, and J. Cong, “Architecture evaluation for power-efficient FPGAs,” in Proc. Int. Symp. Field Program. Gate Arrays, 2003,pp. 175–184.

[5] (2013). Stratix FPGA: Low Power, High Performance [Online]. Avail-able: http://www.altera.com/devices/fpga/stratix-fpgas/stratix/stratix/stx-index.jsp

[6] (2013). Xilinx Virtex Series of FPGAs [Online]. Available:http://www.xilinx.com/products/index.htm

[7] T. Good and M. Benaissa, “AES on FPGA from the fastest to thesmallest,” in Cryptographic Hardware and Embedded Systems. NewYork, NY, USA: Springer-Verlag, 2005.

[8] J. Cong and S. Xu, “Technology mapping for FPGAs with embeddedmemory blocks,” in Proc. Int. Symp. Field Program. Gate Arrays, 1998,pp. 179–188.

Page 14: 1314 IEEE TRANSACTIONS ON VERY LARGE SCALE …swarup.ece.ufl.edu/papers/J/J42.pdf · to Embedded Memory Blocks ... reduced design cost and improved time-to-volume, field-programmable

GHOSH et al.: IMPROVING ENERGY EFFICIENCY IN FPGA THROUGH JUDICIOUS MAPPING OF COMPUTATION 1327

[9] S. Wilton, “SMAP: Heterogeneous technology mapping for area reduc-tion in FPGAs with embedded memory arrays,” in Proc. Int. Symp. FieldProgram. Gate Arrays, 1998, pp. 171–178.

[10] R. Gutierrez, V. Torres, and J. Valls, “FPGA-implementation ofatan(Y/X) based on logarithmic transformation and LUT-based tech-niques,” J. Syst. Archit., vol. 56, no. 11, pp. 588–596, 2010.

[11] S. Chin, C. Lee, and S. Wilton, “Power implications of implementinglogic using FPGA embedded memory arrays,” in Proc. Int. Conf. FieldProgram. Logic Appl., 2006, pp. 1–8.

[12] J. Lamoureux and W. Luk, “An overview of low power techniquesfor field programmable gate arrays,” in Proc. NASA/ESA Conf. Adapt.Hardw. Syst., 2008, pp. 338–345.

[13] W.G. Osborne, J. Coutinho, W. Luk, and O. Mencer, “Power-aware andbranch-aware word-length optimization,” in Proc. 16th Int. Symp. FieldProgram. Custom Comput. Mach., Apr. 2008, pp. 129–138.

[14] C. T. Chow, L. S. M. Tsui, P. H. W. Leong, W. Luk, and S. Wilton,“Dynamic voltage scaling for commercial FPGAs,” in Proc. IEEE Int.Conf. Field Program. Technol., Dec. 2005, pp. 173–180.

[15] M. Klein, “Power consumption at 40 and 45 nm,” Xilinx Inc., San Jose,CA, USA, Tech. Rep. WP298, 2009.

[16] (2009). FPGA Logic Cells Comparison [Online]. Available:http://www.1-core.com/library/digital/fpga-logic-cells/

[17] (2006). FPGA Architecture [Online]. Available: http://www.altera.com/literature/wp/wp-01003.pdf

[18] S. Choi, R. Scrofano, V. K. Prasanna, and J. Jang, “Energy efficientsignal processing using FPGAs,” in Proc. 11th Int. Symp. Field Program.Gate Arrays, 2003, pp. 225–234.

[19] F. D. Dinechin and A. Tisserand, “Multipartite table methods,” IEEETrans. Comput., vol. 54, no. 3, pp. 319–330, Mar. 2005.

[20] W. F. Wong and E. Goto, “Fast evaluation of the elementary functionsin single precision,” IEEE Trans. Comput., vol. 44, no. 3, pp. 453–457,Mar. 1995.

[21] (2006). The Landscape of Parallel Computing Research: A Viewfrom Berkeley [Online]. Available: http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html

[22] W. B. Ligon, G. Monn, D. Stanzione, F. Stivers, and K. D. Underwood,“Implementation and analysis of numerical components for reconfig-urable computing,” in Proc. IEEE Aerosp. Appl. Conf., Mar. 1999,pp. 325–335.

[23] C. Brunelli, H. Berg, and D. Guevorkian, “Approximating sine functionsusing variable precision Taylor polynomials,” in Proc. IEEE WorkshopSignal Process. Syst., Oct. 2009, pp. 57–62.

[24] R. Tessier, V. Betz, D. Neto, A. Egier, and T. Gopalsamy, “Power-efficient RAM mapping algorithms for FPGA embedded memoryblocks,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.,vol. 26, no. 2, pp. 278–290, Feb. 2007.

[25] PowerPlay Power Estimator, Altera Corporation, San Jose, CA, USA,Jul. 2012.

[26] Xpower Estimator User Guide, Xilinx Inc., San Jose, CA, USA,Jan. 2012.

[27] N. Sankarayya, K. Roy, and D. Bhattacharya, “Algorithms for lowpower and high speed FIR filter realization using differential coef-ficients,” IEEE Trans. Circuits Syst., vol. 44, no. 6, pp. 488–497,Jun. 1997.

[28] C. Ding and X. He, “K-means clustering via principal componentanalysis,” in Proc. Int. Conf. Mach. Learn., 2004, p. 29.

[29] S. Narasimhan, H. J. Chiel, and S. Bhunia, “Ultra-low power androbust digital-signal-processing hardware for implantable neural inter-face microsystems,” IEEE Trans. Biomed. Circuits Syst., vol. 5, no. 2,pp. 169–178, Apr. 2011.

[30] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery,Numerical Recipes in Fortran. Cambridge, U.K.: Cambridge Univ. Press,1992.

[31] (2009). Time Dependent Schrodinger Equation [Online]. Available:http://hyperphysics.phy-astr.gsu.edu/hbase/quantum/scheq.html

[32] V. Gupta, D. Mohapatra, S. P. Park, A. Raghunathan, and K. Roy,“IMPACT: IMPrecise adders for low-power approximate computing,”in Proc. Int. Symp. Low Power Electron. Design, 2011, pp. 409–414.

[33] K. Kunaparaju, S. Narasimhan, and S. Bhunia, “VaROT: Methodologyfor variation-tolerant DSP hardware design using post-silicon truncationof operand width,” in Proc. 24th Int. Conf. VLSI Design, Jan. 2011,pp. 310–315.

[34] S. Lee, J. W. Choi, S. W. Kim, and J. Park, “A reconfigurable FIRfilter architecture to trade off filter performance for dynamic powerconsumption,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 19,no. 12, pp. 2221–2228, Dec. 2011.

[35] N. Banerjee, J. H. Choi, and K. Roy, “A process variation aware lowpower synthesis methodology for fixed point FIR filters,” in Proc. Int.Symp. Low Power Electron. Design, 2007, pp. 147–152.

[36] (2005). The Role of Distributed Arithmetic in FPGA-Based Sig-nal Processing [Online]. Available: http://www.xilinx.com/appnotes/theory1.pdf

Anandaroop Ghosh received the B.E. (Hons.)degree in electrical engineering from Jadavpur Uni-versity, Kolkata, India, in 2008. He is currentlypursuing the M.S. degree with the Department ofElectrical Engineering and Computer Science, CaseWestern Reserve University, Cleveland, OH, USA.

He was a Research Consultant with the IndianInstitute of Technology (IIT), Kharagpur, India. Hiscurrent research interests include exploring possibili-ties of hardware acceleration through memory basedreconfigurable frameworks and exploring efficient

modeling and circuit design techniques in order to design an on-chip sensingscheme for a collection of DNA molecules utilizing silicon nanopores.

Somnath Paul (M’07) received the B.E. degree inelectronics and telecommunication engineering fromJadavpur University, Kolkata, India, in 2005, andthe Ph.D. degree in computer engineering from CaseWestern Reserve University, Cleveland, OH, USA.

He is currently a Research Scientist with IntelLabs, Intel Corporation. His current research inter-ests include hardware-software co-design for energy-efficiency, yield and reliability in nanoscale tech-nologies.

Dr. Paul is a recipient of the 2011 OutstandingDissertation Award from European Design and Automation Association andthe 2012 Best Paper Award in International Conference on VLSI Design.

Jongsun Park (M’05–SM’13) received the B.S.degree in electronics engineering from Korea Uni-versity, Seoul, Korea, in 1998, and the M.S. andPh.D. degrees in electrical and computer engineeringfrom Purdue University, West Lafayette, IN, USA,in 2000 and 2005, respectively.

He joined the Electrical Engineering Faculty,Korea University, in 2008. From 2005 to 2008, hewas with the Signal Processing Technology Group,Marvell Semiconductor, Inc., Santa Clara, CA, USA.He was with the Digital Radio Processor System

Design Group, Texas Instruments, Dallas, TX, USA, in 2002. His currentresearch interests include variation-tolerant, low-power and high-performanceVLSI architectures, and circuit designs for digital signal processing and digitalcommunications.

Swarup Bhunia (M’05–SM’09) received the B.E.(Hons.) degree from Jadavpur University, Kolkata,India, the M.Tech. degree from the Indian Instituteof Technology (IIT), Kharagpur, India, and the Ph.D.degree from Purdue University, West Lafayette, IN,USA, in 2005.

He is currently an Associate Professor of electricalengineering and computer science with Case West-ern Reserve University, Cleveland, OH, USA. He hasover ten years of research and development expe-rience with over 120 publications in peer-reviewed

journals and premier conferences in the area of VLSI design, CAD, and testtechniques. His current research interests include low power and robust design,hardware security and protection, adaptive nanocomputing, and implantableelectronics.

Dr. Bhunia received the National Science Foundation Career DevelopmentAward in 2011, the Semiconductor Research Corporation Technical Excel-lence Award in 2005, several best paper awards and best paper nominations,and SRC Inventor Recognition Award in 2009. He has served as an AssociateEditor of ACM Journal of Emerging Technologies, a Guest Editor of the IEEEDesign & Test of Computers in 2010, and in the technical program committeeof a number of major IEEE/ACM conferences.