2 A Faster and Area Efficient FIR Filters using FPGA

Abstract:

This paper introduces a new algorithm that

synthesizes multiplier blocks so that the hardware

requirement is minimum as a part for the

implementation of full-parallel finite impulse

response (FIR) filters. The techniques previously in

use are applicable to implementation on applicationspecific

integrated circuit (ASIC) and Structured

ASIC technologies, analysis is performed using field

programmable gate array (FPGA) hardware. Fully

pipelined, full-parallel transposed-form FIR filters

with multiplier block are generated using the new and

previous algorithms, implement on an FPGA target

and compare results. Minimizing multiplier block

logic depth and pipeline registers is shown to have

the greatest influence in reducing FPGA area cost.

Experimental results in terms of performance and

area show that the filters that are generated using the

distributed arithmetic technique for the new

algorithm provides better low area solutions than

existing algorithms.

1 Introduction

High-speed digital signal processor

(DSP)systems are increasingly being implemented

on field programmable gate array (FPGA) hardware

platforms. Due to rapid increases in the technology,

current generation of FPGAs contain a very high

number of Configurable Logic Blocks (CLBs), and

are becoming more feasible for implementing a wide

range of applications. More recently, Structured

ASIC technology has yielded lower cost solutions to

full custom ASIC by predefining several layers of

silicon functionality that require the definition of only

a few fabrication layers to implement the required

design. However; the FPGA platform provides high

performance and flexibility with the option to

reconfigure and is the technology focused on for the

remainder of this paper. There is a constant

requirement for efficient use of FPGA resources

where occupying less hardware for a given system

can yield significant cost-related benefits:

(i) Reduced power consumption;

(ii) Area for additional application functionality;

(iii) Potential to use a smaller, cheaper FPGA

Finite impulse response (FIR) digital filters are

common DSP functions and are widely used in

FPGA implementations. When very high

sampling rates are required, full-parallel

hardware must be used where every clock edge

feeds a new input sample and produces a new

output sample. Such filters can be implemented

on FPGAs using combinations of the general

purpose logic fabric, on-board RAM and

embedded arithmetic hardware. Full-parallel

filters cannot share hardware over multiple

clock cycles and so tend to occupy large amounts

of resource. Hence, efficient implementation of

such filters is important to minimize hardware

requirement

.

2.Trasposed FIR Filter With Multiplier block

The transposed FIR filter structure shown

in ( fig 1.a) is derived from the standard FIR

structure using cut-set retiming, yields an

identical mathematical response but with several

advantages for FPGA implementation:

(i) Each sample fed to each tap simultaneously

eliminates input sample shift registers

(ii) The pipelined addition chain maps

efficiently;

(iii) Reduces filter latency

(iv) Multiplication hardware can be shared by

identical tap coefficient magnitudes.

(a)

(b)

Fig. 1 Mathematically identical full-parallel FIR

filter structures

a .Transposed

b. Transposed FIR with multiplier block

Note that for maximum sampling rates, all

multiplication hardware can be pipelined. In Fig.

1c, the coefficient multipliers of the transposed

FIR have been replaced with a multiplier block

(detailed in Section 3) that generates all required

multiples of the filter input using cascaded adds,

subtracts and shifts. This filter architecture is

known to be a highly area efficient method of

implementing fixed-coefficient, full-parallel FIR

filters [1]. It is the multiplier block that

determines filter implementation efficiency

regardless of hardware platform. Effective

synthesis of multiplier blocks for low FPGA area

is the focus of this work

Fig.2 Multiplier Block: five adders, two

pipeline stages causing 25 slices

3. Operation and area estimation of

Multiplication hardware

Figure 2 shows an example multiplier block

(also referred to as a graph) that multiplies the

input by 3, 13, 21 and 37 in two clock cycles (the

‘logic depth’ of the block is also 2).

Multiplication is achieved using only adds,

subtracts and shifts which map very efficiently to

FPGA architectures. As an example, the input is

fed to the ‘3’ adder untouched and after being

left shifted once (multiplied by two). Hence, the

output of the ‘3’ adder is 2x +x =3x as required.

This product can then be used as a graph output

to be fed to the filter summation chain (refer to

Fig. 1b) and, if required, routed internally to

generate further multiples of the input..

Pipelining multiplier blocks ensures high clock

rates are achieved when implemented on FPGA

hardware. Note that multiplier blocks usually

contain a mixture of adders and subtractors, but

the ‘adder cost’ of a block refers to the number

of adders and subtractors. Hence, the adder cost

of Fig. 2 is 5. Note that adders may also be

referred to as the graph ‘vertices’. In this paper,

we use the Xilinx Virtex-II FPGA family [2] for

implementation analysis and hence area will be

measured in slices. Fig. 2 is quoted as costing 25

slices. This is calculated by counting the number

of flip-flops inferred by the multi-bit signals

crossing pipeline boundaries and dividing by 2

since there are two flip-flops per slice. Equation

(1) uses the set S which contains the bit-widths

of all N multi-bit signal pipeline boundary

crossings to obtain a slice estimate e:

4.Previous research in Synthesising

Multiplier Block:

The majority of research has concentrated on

producing algorithms to synthesize multiplier

blocks with the optimization goal of minimum

adder cost. Bull and Horrocks [3] has

demonstrated that the multiplication block of

digital filter shown in fig1.(c)may be realized

through the application of graph synthesis

techniques employing only primitive arithmetic

operations(addition, subtraction and bit wise

shift).and showed that graph synthesis problems

are NP- complete and defined several minimum

adder graph algorithms.. Dempster and Macleod

[4] identified limitations in this work and defined

the ‘n-dimensional reduced adder graph’ (RAGn)

algorithm which is generally regarded as the

primary reference for minimal adder multiplier

block synthesis [5, p. 96]. Additional techniques

using canonic signed digit (CSD) and

subexpression sharing have also been proposed

to minimize adder cost [6, 7].Recent work by

Demirsoy et al. [8, 9] incorporates multiplexers

to allow efficient FPGA implementation of time

multiplexed filters and direct cosine transform

(DCT) processors . In [10] , Dempster et al.

defined the ‘C1’ synthesis algorithm with the

optimization goal of minimizing multiplier block

logic depth to reduce power consumption.

5.Adder cost and logic depth for low FPGA

area

Figure 3 shows a multiplier block that

generates the same multiples of the input as the

block shown in Fig. 2. However, the Fig. 3 block

uses only four adders, whereas Fig. 2 uses one

more (five). Conversely, four logic levels are

required for the Fig. 3 block and only two are

required in Fig. 2. Most importantly, using (1),

the Fig. 3 block requires 44 slices compared to

the 25 slices of Fig. 2. This is due to the

increased number of pipeline boundary crossings

in Fig. 3 caused by the extra logic levels of the

block. Hence, in this case, fewer adders does not

mean less FPGA area. It should be noted that

architecture specific features may also influence

area consumption..The main research community

goal of minimizing adder cost is not applicable

for minimizing FPGA area cost. Instead, an

algorithm performing synthesis for low FPGA

area should aim to reduce signal pipeline

crossings. This can be achieved by synthesizing

low-logic depth multiplier blocks and

minimizing signal bit-widths as far as possible.

Fig.3Multiplier Block :four adders,five pipeline

stages causing 44 slices

6..New RSG algorithm design

Dempster and Macleod’s RAG-n algorithm

attempts to synthesize all required

multiplications by initially placing all

coefficients of adder cost 1 (determined using

MAG) into the multiplier block and then

building higher cost values using combinations

of shifts, adds and subtracts of other adder

outputs within the block. For example, in Fig. 3

(generated using RAG-n), the ‘3’ adder is

synthesized first and all other coefficients are

built from it. This approach leads to high logic

depth blocks and hence pushes up FPGA area

requirement.

Note that RAG-n synthesizes multiplier blocks in

two stages:

(i) optimal stage;

(ii) suboptimal heuristic stage.

I f the multiplier block is fully synthesized

after stage (i), Dempster and Macleod show that

their algorithm ensures the absolute minimum

number of adders to implement a given block. If

stage (ii) is required, a suboptimal multiplier

block will be synthesized in a reasonable time.

Stage (ii) employs the MAG algorithm to add

graph vertices to implement required products.

When designing the RSG algorithm, we ensured

that logic depth was controlled by not always

trying to build on existing adders within the

block. Instead, the vertices required to

implement a given value are added directly using

the ‘best graph’ data generated using the

modified and extended implementation of the

MAG algorithm RSG starts with the highest cost

values and simply inserts the ‘best graph’

required for each, ensuring no duplicate adders

are created and that adder outputs are shared as

far as possible.

7. Results generation

7.1 Selecting and implementing algorithms for

comparison with RSG

In addition to implementing RAG-n for

comparison with RSG, we selected the C1

algorithm for the low-logic depth optimization

goal itembodies.RAG-n andC1 provide suitable

coverage of the optimization goal space to

validate the performance of RSG and also

highlight desirable characteristics of multiplier

blocks for low area FPGA implementation.

Beyond 12-bit coefficients, RAG-n and C1

usedthe MAG extensions The DA technique

redistributes the order of calculation for FIR

filter sum of products equations and is highly

suitable for efficient FPGA implementation [5,

14]. Hence, in addition to comparing RSG with

other synthesis algorithms as part of transposed

FIR filters, a comparison was made between

transposed FIR filters (using RSG for block

synthesis) and equivalents implemented using

full-parallel DA. one-channel, fixed-coefficient,

fully parallel,

7.2Experimental setup Numerous single-rate,

transposed FIR filters were automatically

generated to compare the three multiplier block

synthesis algorithms (experiments 6.1 and 6.2).

For each filter, various combinations of the

following aspects were varied:

(i) coefficient bit-width;

(ii) filter length.

Coefficient sets were uniformly distributed and

filter input width was 10-bits. For each

configuration of algorithm, coefficient bit-width

and filter length, ten unique coefficient sets were

applied to provide an average. Each filter was

synthesized using Synplicity Synplify Pro 7.3.4

[15] targeting a Xilinx Virtex-II (xc2v3000-

fg676-5) and implemented using Xilinx ISE

5.2.03i [2] to obtain overall filter device usage

and maximum clock rate. A period constraint of

177 MHz was chosen after manual

experimentation to determine a realistic value

for all filter configurations.

8. Results

8.1 Synthesis algorithm comparison varying

filter length

For this experiment, filters were generated to

compare the synthesis algorithms using the

parameters of Section 8.2, coefficient bit-width

fixed at 12-bits and filter length varied from 15

to 210. Again, RSG requires the least FPGA area

with the slice and flip-flop cost (Figs. 4a and c)

of RAG-n and C1 diverging as filter length

increases. RSG requires more adders in general

(Fig. 4b) and slightly more LUTs (Fig. 4e). RSG

logic depth is constant due to the fixedcoefficient

width of 12-bits and the logic depth

controlling properties of RSG In this experiment,

all coefficients were covered by cost 3 graphs.

Filter maximum clock rate is similar for all three

algorithms with an expected gradual decline as

filter length increases.

8.2 Comparing transposed FIR with

multiplier block and full-parallel DA

To compare RSG filters with DA equivalents,

the parameters of Section 5.2 were applied with

filter length and coefficient bit-width being

varied from 15 to 210 and 2 to 20, respectively

For the 2–10 bit coefficient range, Fig. 8a

establishes that the RSG filters consume

significantly less area than the DA

equivalents,.Flip-flop usage (Fig. 8b) is

correlated with area for both RSG and DA. RSG

LUT consumption is less in all cases e), although

not by as great a margin as the flip-flop results.

For shift-register usage RSG uses few or none,

in general, for all coefficient bitwidths except 2.

This is for two reasons: (i) the low-logic depth

multiplier blocks RSG synthesizes are unlikely

to contain delay lines suitable for SRL16

mapping; (ii) consecutive zero valued

coefficients in the filter impulse response

correspond to delays in the hardware filter

summation chain and the probability of such

conditions is higher at low coefficient bit-widths.

The DA architecture makes increasing use of

shift registers with filter length. And there is an

overall decline in maximum clock rate as the

critical path delay increases with filter length.

For coefficient bit-widths 12–20, the RSG filters

again generally require less area except for 20-

bits where the results are comparable (Fig. 5a).

Also, the RSG area advantage is seen to decrease

as coefficient bit-width increases. Flip-flop and

shift-register usage (Fig. 5b and 5c) follow

similar trends to bit-widths 2–10. However, from

16-bit coefficients upwards, RSG is shown to

consume more LUTs (Fig. 5e), which is the

primary cause of reducing RSG area advantage

with increasing bit-width. Maximum clock rates

(Fig. 5d) were generally well into the fullparallel

range although for bit-widths 18 and 20,

RSG filter performance drops as filter length

increases. his experiment established that, for

typical DSP filter lengths and coefficient bitwidths,

RSG filters provide the lowest area

implementations capable of being clocked at

full-parallel rates. .

9. Conclusions

The classic research community optimisation

metric of minimizing multiplier block adder cost

has been demonstrated not to minimise FPGA

hardware for full-parallel pipelined FIR filters.

Reducing flip-flop count through minimizing

multiplier logic depth has instead been shown to

yield the lowest area solutions. The new RSG

algorithm has been defined to embody this

design principle. The results presented establish

a clear area advantage of RSG over prior

algorithms for typical filter parameters with

comparable maximum clock rates. In addition,

the industrial relevance of the transposed FIR

with multiplier block architecture and the RSG

algorithm has been established through

comparison with filters implemented using the

DA technique.

Fig. 4 Results for VHDL filter generation varying filter length (coefficients: 12 bits)

a FPGA hardware area

b Multiplier block adders

c Flip-flop usage

d Multiplier block logic depth

e LUT usage

FIG..5 Comparing Transposed FIR Filter Multiplier block (RSG) with DA(coefficient bit widths) 12-20

a.FPGA hardware area

b.Flip Flop usage

c.Shift register usage

d.Filter maximum clock rate

e..LUT Usage

10. References

1 Macpherson, K., Stirling, I., Rice, G., Garcia-

Alis,D.,and Stewart, R.:‘Arithmetic mplementation

techniques and methodologies for 3G uplink

reception in Xilinx FPGAs’. Third Int. Conf. on 3G

Mobile Communication Technologies, 2002, (IEE

Conf. Publ. no. 489), May 2002, pp. 191–195

2 Xilinx Inc., http://www.xilinx.com

3 Bull, D.R., and Horrocks, D.H.: ‘Primitive

operator digital filters’, IEE Proc. G, Circuits

Devices Syst., 1991, 138, (3), pp. 401–412

4 Dempster, A.G., and Macleod, M.D.: ‘Use of

minimumadder multiplier blocks in FIR digital

filters’, IEEE Trans. Circuits Syst. II, Analog Digit.

Signal Process., 1995, 42, (9),

pp. 569–577

5 Meyer-Baese, U.: ‘Digital signal processing with

field programmable gate arrays’ (Springer-Verlag,

Berlin, Heidelberg, 2001)

6 Gustafsson, O., and Wanhammar, L.: ‘ILP

modelling of the common subexpression sharing

problem’. 9th Int. Conf. on Electronics, Circuits

and Systems, 2002, vol. 3, pp. 1171–1174

7 Jang, Y., and Yang, S.: ‘Low-power CSD linear

phase FIR filter structure using vertical common

sub-expression’, Electron. Lett., 2002, 38, (15), pp.

777–779

8 Demirsoy, S.S., Dempster, A.G., and Kale, I.:

‘Design guidelines for reconfigurable multiplier

blocks’. IEEE Int. Symp. on Circuits and Systems,

26–28 May 2003, pp. IV293–IV296

9 Demirsoy, S.S., Beck, R., Dempster, A.G., and

Kale, I.: ‘Reconfigurable implementation of

recursive DCT kernels for reduced quantization

noise’. IEEE Int. Symp. on Circuits and

Systems, 26–28 May 2003, pp. IV289–IV292

10 Dempster, A.G., Demirsoy, S.S., and Kale, I.:

‘Designing multiplier blocks with low logic depth’.

IEEE Int. Symp. on Circuits and Systems, 2002,

vol. 5, pp. V-773–V-776

11 Dempster, A.G., and Macleod, M.D.: ‘Constant

integer multiplication using minimum adders’, IEE

Proc., Circuits Devices Syst., 1994, 141, (5), pp.

407–413

12 Gustafsson, O., Dempster, A.G., and

Wanhammar, L ‘Extended results for minimumadder

constant integer multipliers’. IEEE Int.

Symp. on Circuits and Systems, 2002, vol. 1,

pp. I-73–I-76

13 Wirthlin, M.J., and McMurtrey, B.: ‘Efficient

constant coefficient multiplication using advanced

FPGA architectures’. Proc. 11th Int. Workshop on

Field-Programmable Logic and Applications, 2001,

pp. 555–564

14 Xilinx Inc.: ‘Distributed arithmetic FIR filter

v8.0’, http://www.xilinx.com

15 Synplicity Inc.,

Documents

2 A Faster and Area Efficient FIR Filters using FPGA