EJSR_49_4_15

  • Upload
    vvigya

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

  • 8/4/2019 EJSR_49_4_15

    1/8

    European Journal of Scientific Research

    ISSN 1450-216X Vol.49 No.4 (2011), pp. 634-641 EuroJournals Publishing, Inc. 2011

    http://www.eurojournals.com/ejsr.htm

    Slice Reduction Algorithm for Low Power and Area Efficient

    FIR Filter Using VLSI Design

    A. Hemalatha

    HOD, Dept of ECE, Periyar Centinary Polytechnic College, Vallam

    A. Shanmugam

    Principal, Bannari Amman Institute of Technology, Sathyamangalam

    Abstract

    SDR is fast becoming a crucial element of wireless technology the use of SDRtechnology is predicted to replace many of the traditional methods of implementingtransmitters and receivers while offering a wide range of advantages including adaptability,

    reconfigurability, and multifunctionality encompassing modes of operation, radio

    frequency bands, air interfaces, and waveforms. Research in this field is mainly directedtowards improving the architecture and the computational efficiency of SDR systems.

    Software-defined radio (SDR) refers to wireless communication in which the transmitter

    modulation and the receiver demodulation are both generated through software. The mainadvantage of this approach is flexibility, as the software runs on one common hardware

    platform for any type of receiver configuration. The most computationally intensive part of

    the wideband receiver of a software defined radio (SDR) is the intermediate frequency (IF)

    processing block. Digital filtering is the main task in IF processing. The computationalcomplexity of finite impulse response (FIR) filters used in the IF processing block is

    dominated by the number of adders (subtractions). The proposed reconfigurable

    synthesizes multiplier blocks offer significant savings in area over the traditional multiplierblocks for high-speed digital signal processor (DSP) systems are implemented on field

    programmable gate array (FPGA) hardware platforms. In addition, software radio has

    recently gained much attention due to the need for integrated and reconfigurablecommunication systems. To this end, reconfigurability has become an important issue for

    the future filter design. Previous research in this field has concentrated on minimizing

    multiplier block adder cost but the results presented here demonstrate that this optimizationgoal does not minimize FPGA hardware. Minimizing multiplier block logic depth and

    pipeline registers is shown to have the greatest influence in reducing FPGA area cost. Fullypipelined, full-parallel transposed-form FIR filters with reconfigurable multiplier block

    were generated using the new and previous algorithms, implemented on an FPGA targetand the results compared. The proposed method offers average reductions of adders and

    full adders needed for the coefficient multipliers over conventional FIR filter

    implementation methods.

    IntroductionHigh-speed digital signal processor (DSP) systems are increasingly being implemented on fieldprogrammable gate array (FPGA) hardware platforms. This trend is being fuelled by insurmountable

  • 8/4/2019 EJSR_49_4_15

    2/8

    Slice Reduction Algorithm for Low Power and Area Efficient FIR Filter Using VLSI Design 635

    application-specific integrated circuit (ASIC) project costs and the flexibility and reconfigurability

    advantages of FPGAs over traditional DSPs and ASICs, respectively. More recently, Structured ASIC

    technology has yielded lower cost solutions to full custom ASIC by predefining several layers ofsilicon functionality that require the definition of only a few fabrication layers to implement the

    required design. However, the FPGA platform provides high performance and flexibility with the

    option to reconfigure and is the technology focused on for the remainder of this paper. There is a

    constant requirement for efficient use of FPGA resources where occupying less hardware for a given

    system can yield significant cost-related benefits:(i) reduced power consumption;

    (ii) area for additional application functionality;(iii) potential to use a smaller, cheaper FPGA.

    Finite impulse response (FIR) digital filters are common DSP functions and are widely used in

    FPGA implementations. If very high sampling rates are required, full-parallel hardware must be usedwhere every clock edge feeds a new input sample and produces a new output sample. Such filters can

    be implemented on FPGAs using combinations of the general purpose logic fabric, on-board RAM and

    embedded arithmetic hardware. Full-parallel filters cannot share hardware over multiple clock cyclesand so tend to occupy large amounts of resource. Hence, efficient implementation of such filters is

    important to minimize hardware requirement. When implementing a DSP system on a platform

    containing dedicated arithmetic blocks, it is normal practice to utilize such blocks as far as possible inpreference to any general purpose logic fabric. However, in some cases, there may not be enoughblocks for the target application or layout/routing constraints may prohibit their use. In such cases, the

    general purpose logic fabric must be utilized. In this paper, implementation of full-parallel filters using

    only the general purpose logic fabric is considered. Hence, the techniques presented here are alsoapplicable to ASIC and Structured ASIC platforms. In Section 2, the filter type in use is described,

    followed by a discussion of multiplier block synthesis for low FPGA area in Section 3. A new

    multiplier block synthesis algorithm is presented in Section 4, followed by the generation andpresentation of results (Sections 5). Conclusions are given in Section 6.

    2. Transposed FIR Filter with Multiplier BlockFigure 1 shows three full-parallel, fixed-coefficient FIR filter structures that are mathematically

    identical but differ in architecture. Derived from the standard FIR structure using cut-set retiming, thetransposed FIR (Fig. 1b) yields an identical mathematical response but with several advantages for

    FPGA implementation:

    (i) no input sample shift registers are required since each sample is fed to each tap simultaneously;(ii) the pipelined addition chain maps efficiently;

    (iii) filter latency is reduced;(iv) identical tap coefficient magnitudes can share multiplication hardware because taps receive the

    input sample simultaneously.

  • 8/4/2019 EJSR_49_4_15

    3/8

    636 A. Hemalatha and A. Shanmugam

    Figure 1: Mathematically identical full-parallel FIR filter structures

    a Standard

    b Transposed

    c Transposed FIR with multiplier block

    Note that for maximum sampling rates, all multiplication hardware can be pipelined. In Fig. 1c,

    the coefficient multipliers of the transposed FIR have been replaced with a multiplier block (detailed inSection 3) that generates all required multiples of the filter input using cascaded adds, subtracts and

    shifts. This filter architecture is known to be a highly area efficient method of implementing fixed-

    coefficient, full-parallel FIR filters [1]. It is the multiplier block that determines filter implementationefficiency regardless of hardware platform. Effective synthesis of multiplier blocks for low FPGA area

    is the focus of this work.

    3. Multiplier Block Synthesis for Low FPGA Area3.1. Multiplication Hardware Operation and Area Estimation

    Figure 2 shows an example multiplier block (also referred to as a graph) that multiplies the input by 3,

    13, 21 and 37 in two clock cycles (the logic depth of the block is also 2). Multiplication is achievedusing only adds, subtracts and shifts which map very efficiently to FPGA architectures. As an example,

    the input is fed to the 3 adder untouched and after being left shifted once (multiplied by two). Hence,

    the output of the 3 adder is 2x x 3x as required. This product can then be used as a graph outputto be fed to the filter summation chain (refer to Fig. 1c) and, if required, routed internally to generate

    further multiples of the input. For efficiency, multiplier blocks need only generate ve, odd integers

    since negative filter coefficient weightings can be restored at the summation chain by subtracting theve equivalent generated by the multiplier block. Odd valued block outputs can be leftshifted en-route

    to the summation chain to generate even-valued coefficient multiplications. Pipelining multiplier

    blocks ensures high clock rates are achieved when implemented on FPGA hardware. Note thatmultiplier blocks usually contain a mixture of adders and subtractors, but the adder cost of a block

    refers to the number of adders and subtractors. Hence, the adder cost of Fig. 2 is 5. Note that adders

    may also be referred to as the graph vertices. In this paper, we use the Xilinx Virtex-II FPGA family

    [2] for implementation analysis and hence area will be measured in Virtex-II slices. The multiplierblock in Fig. 2 is quoted as costing 25 slices. This is calculate ed by counting the number of flip-flops

    inferred by the multi-bit signals crossing pipeline boundaries and dividing by 2 since there are two flip-

  • 8/4/2019 EJSR_49_4_15

    4/8

    Slice Reduction Algorithm for Low Power and Area Efficient FIR Filter Using VLSI Design 637

    flops per slice. Equation (1) uses the set S which contains the bit-widths of all N multi-bit signal

    pipeline boundary crossings to obtain a slice estimate e:

    21

    NSa

    ae

    ==

    (1)

    Figure 2: Multiplier block: five adders, two pipeline stages costing 25 slices

    3.2. Multiplier block Synthesis Optimization Goals

    Multiplier block synthesis has received a great deal of attention in recent decades. The majority of

    research has concentrated on producing algorithms to synthesize multiplier blocks with the

    optimization goal of minimum adder cost. Bull and Horrocks [3] introduced the concept of

    representing multiplier blocks with graphs , showed the problem to be NP-complete and definedseveral minimal adder synthesis algorithms. Dempster and Macleod [4] identified limitations in thiswork and defined the n-dimensional reduced adder graph (RAG-n) algorithm which is generally

    regarded as the primary reference for minimal adder multiplier block synthesis [5, p. 96]. Additionaltechniques using canonic signed digit (CSD) and sub expression sharing have also been proposed to

    minimize adder cost [6, 7]. Although the majority of research in this area has focused on full-parallel

    DSPs, recent work by Demirsoy et al. [8, 9] incorporates multiplexers to allow efficient FPGA

    implementation of time multiplexed filters and direct cosine transform (DCT) processors. In [10],Dempster et al. defined the C1 synthesis algorithm with the optimization goal of minimizing

    multiplier block logic depth to reduce power consumption. C1 aims to minimize power consumed by

    reducing the amount of logic transitions caused by long glitch paths through cascaded arithmetic logic.Note that logic toggling caused by glitch paths through more than one adder does not occur in fully

    pipelined multiplier blocks. In general, from a hardware perspective, the motivation for minimum

    adders has been to reduce filter complexity for very large scale integration (VLSI) implementationwhere adder cost dominates the area requirement. However, FPGAs have a fixed architecture for

    implementing digital logic, not the blank canvas of VLSI/ASIC design. Hence, algorithms

    synthesizing multiplier blocks for low FPGA area must operate with regard to FPGA architectures for

  • 8/4/2019 EJSR_49_4_15

    5/8

    638 A. Hemalatha and A. Shanmugam

    best results. In this paper, we show that the classic multiplier block synthesis goal of minimizing

    adders does not minimize FPGA hardware cost; we will, however, define a new algorithm that does.+

    3.3. Comparing Adder Cost and Logic Depth for low FPGA Area

    Figure 3 shows a multiplier block that generates the same multiples of the input as the block shown in

    Fig. 2. However, the Fig. 3. block uses only four adders, whereas Fig. 2 uses one more (five).

    Conversely, four logic levels are required for the Fig. 3 block and only two are required in Fig. 2. Mostimportantly, using (1), the Fig. 3 block requires 44 slices compared to the 25 slices of Fig. 2. This isdue to the increased number of pipeline boundary crossings in Fig. 3 caused by the extra logic levels of

    the block. Hence, in this case, fewer adders does not mean less FPGA area. It should be noted that

    architecture specific features may also influence area consumption. For example, a slight areareduction may be gained using the Virtex-II SRL16 shift registers to implement signal delays instead

    of individual flip-flops. However, such area reductions would not reduce the slice cost of Fig. 3

    significantly. Also, there may be slices where a look-up table (LUT) is unused for multiplier blockarithmetic logic but the corresponding flip-flop is in use to pipeline a signal. In such cases,

    synthesis/implementation tools may map logic for other functional blocks into the unused LUT.

    However, in a fully pipelined DSP data-path system, such functionality sharing of a single slice is lesslikely and would not signifi- cantly affect multiplier block area consumption. To summaries, the mainresearch community goal of minimising adder cost is not applicable for minimising FPGA area cost.

    Instead, an algorithm performing synthesis for low FPGA area should aim to reduce signal pipeline

    crossings. This can be achieved by synthesising low-logic depth multiplier blocks and minimisingsignal bit-widths as far as possible. Note that these low FPGA area optimisation principles apply to any

    FPGA architecture containing a LUT/flip-flop pair structure (including the Xilinx Virtex-II family

    selected in this instance).The new reduced slice graph multiplier block synthesis algorithm

    Figure 3: Multiplier block: four adders, four pipeline stages costing 44 slices

    4. Minimised Adder Graph AlgorithmBefore discussing the new reduced slice graph (RSG) synthesis algorithm, the minimised adder graph(MAG) algorithm must be introduced. MAG was defined by Dempster and Macleod [11] and generates

    minimum adder graphs consisting of shifts, adds and subtracts for implementing constant integer

    multiplication of individual values up to 12-bits (MAG was further extended to 19-bits by Gustafsson

  • 8/4/2019 EJSR_49_4_15

    6/8

    Slice Reduction Algorithm for Low Power and Area Efficient FIR Filter Using VLSI Design 639

    et al. [12]). Multiplication by a constant is common when implementing DSP on FPGAs, and efficient

    implementation has received attention from the research community. For example, Wirthlin and

    McMurtrey [13] utilise features of recent FPGA architectures to further reduce the area requirement ofconventional constant coefficient multipliers. The implementation techniques described also allow for

    multiplication constants to be changed since it is not the structure of the logic that dictates

    multiplication values (asis the case with MAG) but rather look-up table contents. Using the MAG

    algorithm, Dempster and Macleod demonstrate a 16% average reduction in adder cost over CSD

    representation and show CSD to reduce average adder cost by 33% over binary representation. Froman FPGA perspective, for single coefficients, these adder reductions also translate into hardware

    savings. In general, for each integer value in range, MAG finds numerous graphs for each value that allimplement the required multiplication with the minimum number of adders/vertices. Dempster and

    Macleod suggest differentiating between these graphs by calculating the number of single-bit full

    adders required to implement each graph and selecting the graph requiring the least. Thisdifferentiation is motivated by attempting to minimise VLSI area consumed when implementing the

    multiplication. MAGalso generates a table containing the adder cost of each integer value in range.

    4.1. MAG Extensions and Modifications

    Our MAG implementation was extended beyond the 12-bit range (imposed by Dempster and Macleodfor physical RAM constraints) using the generic graph extension cases illustrated in [11]. Also, we

    noted the potential for greater differentiation between the multiple graphs found by MAG for each

    integer value in range. Instead of using one level of differentiation with the single-bit full adder countmetric described in [11], we implemented two metric levels (primary and secondary) to allow one best

    graph to be selected. For each integer, the primary metric selects the subset of graphs with minimum

    logic depth and from that subset, the secondary metric selects the graph with lowest vertices sum tominimize bit-widths. These new metrics reflect the low FPGA area observations described previously

    and, in general, leave only one best graph per integer, whereas the sole single-bit full adder metric

    selects multiple graphs from which an arbitrary choice must be made. Note that up to and including

    adder cost 3, for each integer in range, the modified MAG implementation stores and allowsextension/branching from every graph found. This allows full searching of the graph space up to and

    including cost 4. From cost 4 graphs upwards (i.e. beyond 12-bits), only one best graph (as selected

    by the new primary/secondary metric system described reviously) is stored and used per integer forextending to higher cost graphs. This restriction is for practical implementation concerns of available

    physicalRAMand algorithm run-time.

    4.2. New RSG Algorithm Design

    Dempster and Macleods RAG-n algorithm attempts to synthesize all required multiplications by

    initially placing all coefficients of adder cost 1 (determined using MAG) into the multiplier block and

    then building higher cost values using combinations of shifts, adds and subtracts of other adder outputs

    within the block. For example, in Fig. 3 generated using RAG-n), the 3 adder is synthesized first and

    all other coefficients are built from it. This approach leads to high logic depth blocks and hence pushesup FPGA area requirement. Note that RAG-n synthesizes multiplier blocks in two stages:

    (i) optimal stage;(ii) suboptimal heuristic stage.

    If the multiplier block is fully synthesized after stage (i), Dempster and Macleod show that their

    algorithm ensures the absolute minimum number of adders to implement a given block. If stage (ii) isrequired, a suboptimal multiplier

  • 8/4/2019 EJSR_49_4_15

    7/8

    640 A. Hemalatha and A. Shanmugam

    5. Synthesis Algorithm Comparison Varying Coefficient Bit-WidthVHDL filters were generated to compare RSG, RAG-n and C1 using the specifications of Section 3.2,

    coefficient bitwidth varied from 1 to 20 and filter length fixed at 51 taps. Note that the multiplier blocksolution space expands exponentially as coefficient width increases, meaning algorithm output is

    similar at lower widths. Also, as stated in Section 4.2, the MAG algorithm implementation only uses

    the full search space for integers up to adder cost 4. From cost 5 onwards, branching is only performedfrom the single best graph of each value for reasons of available RAM. Hence, from around 15-bit

    coefficients upwards (cost 5 upwards), synthesis algorithm results are likely to converge. Were theMAG algorithm to be allowed to store and branch from all graphs from cost 4 upwards, the differencein results between the algorithms shown around 12-bits would be expected to continue instead of

    converging at 15-bits. Figure 4a demonstrates that RSG requires less FPGA area in general, although

    Fig. 4b shows that RSG uses more adders in all cases. Figure 4c confirms that FPGA area is correlated

    with flip-flop usage. LUT data (Fig. 4e) is correlated with adder usage with RSG using fractionallymore LUTs in general. For logic depth (Fig. 4d). Filter generation varying coefficient bit-width show

    in table 1.

    Figure 4: Results for VHDL filter generation varying coefficient bit-width (length: 51 taps)

    a FPGA hardware area

    b Multiplier block adders

    c Flip-flop usage

    d Multiplier block logic depth

    e LUT usage

  • 8/4/2019 EJSR_49_4_15

    8/8

    Slice Reduction Algorithm for Low Power and Area Efficient FIR Filter Using VLSI Design 641

    Table 1: Filter generation varying coefficient bit-width

    6. ConclusionsThe classic research community optimisation metric of minimising multiplier block adder cost has

    been demonstrated not to minimise FPGA hardware for full-parallel pipelined FIR filters. Reducingflip-flop count through minimising multiplier logic depth has instead been shown to yield the lowestarea solutions. The new RSG algorithm has been defined to embody this design principle. The resultspresented establish a clear area advantage of RSG over prior algorithms for typical filter parameters

    with comparable maximum clock rates. In addition, the industrial relevance of the transposed FIR withmultiplier block architecture and the RSG algorithm has been established through comparison with

    filters implemented using the DA technique.

    References[1] Macpherson, K., Stirling, I., Rice, G., Garcia-Alis, D., and Stewart, R.:Arithmetic

    implementation techniques and methodologies for 3G uplink reception in Xilinx FPGAs. ThirdInt. Conf. on 3G Mobile Communication Technologies, 2002, (IEE Conf. Publ. no.489), May2002, pp. 191195

    [2] Xilinx Inc., http://www.xilinx.com[3] Bull, D.R., and Horrocks, D.H.: Primitive operator digital filters, IEE Proc. G, Circuits

    Devices Syst., 1991, 138, (3), pp. 401412

    [4] Dempster, A.G., and Macleod, M.D.: Use of minimumadder multiplier blocks in FIR digitalfilters, IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., 1995, 42, (9),pp. 569577

    [5] Meyer-Baese, U.: Digital signal processing with field programmablegate arrays (Springer-Verlag, Berlin, Heidelberg, 2001)

    [6] Gustafsson, O., and Wanhammar, L.: ILP modelling of the common subexpression sharingproblem. 9th Int. Conf. on Electronics, Circuits and Systems, 2002, vol. 3, pp. 11711174

    [7] Y. Jang, S. Yang, "Low-power CSD linear phase FIR fdter structure using vertical commonsubexpression''.Electronics Letfers, Vo1.38. Iss.15, Jul2002. pp. 777- 779

    [8] A.G. Dempster, S.S. Dimirsoy, I. Kale, "Designing multiplier blocks with low logic depth".Circuits and Svstems, 2002.IEEE International Symposium on.Vo1.5. 2002. pp. V-773- V-776

    [9] A.G. Dempster. M.D. Macleod, "Constant integer multiplication using minimum adders",Circuits. Devices andSvstenis, IEEProceedings, Vo1.141, Iss.5, Oct 1994. pp. 407-413

    [10] K. Macpherson, "Low Hardware Cost, High Speed Full-Parallel FIR Digital Filters on FieldProgrammable Gate Arrays".PhD Thesis. University of Strathclyde, 2004

    [11] Xilinx Inc.. "Distributed Arithmetic FIR Filter v8.0".http://www.xilinx.com[12] Synplicity Inc.. http://w\nu.synplicity.com