Upload
api-19929244
View
112
Download
2
Embed Size (px)
Citation preview
Abstract:
This paper introduces a new algorithm that
synthesizes multiplier blocks so that the hardware
requirement is minimum as a part for the
implementation of full-parallel finite impulse
response (FIR) filters. The techniques previously in
use are applicable to implementation on applicationspecific
integrated circuit (ASIC) and Structured
ASIC technologies, analysis is performed using field
programmable gate array (FPGA) hardware. Fully
pipelined, full-parallel transposed-form FIR filters
with multiplier block are generated using the new and
previous algorithms, implement on an FPGA target
and compare results. Minimizing multiplier block
logic depth and pipeline registers is shown to have
the greatest influence in reducing FPGA area cost.
Experimental results in terms of performance and
area show that the filters that are generated using the
distributed arithmetic technique for the new
algorithm provides better low area solutions than
existing algorithms.
1 Introduction
High-speed digital signal processor
(DSP)systems are increasingly being implemented
on field programmable gate array (FPGA) hardware
platforms. Due to rapid increases in the technology,
current generation of FPGAs contain a very high
number of Configurable Logic Blocks (CLBs), and
are becoming more feasible for implementing a wide
range of applications. More recently, Structured
ASIC technology has yielded lower cost solutions to
full custom ASIC by predefining several layers of
silicon functionality that require the definition of only
a few fabrication layers to implement the required
design. However; the FPGA platform provides high
performance and flexibility with the option to
reconfigure and is the technology focused on for the
remainder of this paper. There is a constant
requirement for efficient use of FPGA resources
where occupying less hardware for a given system
can yield significant cost-related benefits:
(i) Reduced power consumption;
(ii) Area for additional application functionality;
(iii) Potential to use a smaller, cheaper FPGA
Finite impulse response (FIR) digital filters are
common DSP functions and are widely used in
FPGA implementations. When very high
sampling rates are required, full-parallel
hardware must be used where every clock edge
feeds a new input sample and produces a new
output sample. Such filters can be implemented
on FPGAs using combinations of the general
purpose logic fabric, on-board RAM and
embedded arithmetic hardware. Full-parallel
filters cannot share hardware over multiple
clock cycles and so tend to occupy large amounts
of resource. Hence, efficient implementation of
such filters is important to minimize hardware
requirement
.
2.Trasposed FIR Filter With Multiplier block
The transposed FIR filter structure shown
in ( fig 1.a) is derived from the standard FIR
structure using cut-set retiming, yields an
identical mathematical response but with several
advantages for FPGA implementation:
(i) Each sample fed to each tap simultaneously
eliminates input sample shift registers
(ii) The pipelined addition chain maps
efficiently;
(iii) Reduces filter latency
(iv) Multiplication hardware can be shared by
identical tap coefficient magnitudes.
(a)
(b)
Fig. 1 Mathematically identical full-parallel FIR
filter structures
a .Transposed
b. Transposed FIR with multiplier block
Note that for maximum sampling rates, all
multiplication hardware can be pipelined. In Fig.
1c, the coefficient multipliers of the transposed
FIR have been replaced with a multiplier block
(detailed in Section 3) that generates all required
multiples of the filter input using cascaded adds,
subtracts and shifts. This filter architecture is
known to be a highly area efficient method of
implementing fixed-coefficient, full-parallel FIR
filters [1]. It is the multiplier block that
determines filter implementation efficiency
regardless of hardware platform. Effective
synthesis of multiplier blocks for low FPGA area
is the focus of this work
Fig.2 Multiplier Block: five adders, two
pipeline stages causing 25 slices
3. Operation and area estimation of
Multiplication hardware
Figure 2 shows an example multiplier block
(also referred to as a graph) that multiplies the
input by 3, 13, 21 and 37 in two clock cycles (the
‘logic depth’ of the block is also 2).
Multiplication is achieved using only adds,
subtracts and shifts which map very efficiently to
FPGA architectures. As an example, the input is
fed to the ‘3’ adder untouched and after being
left shifted once (multiplied by two). Hence, the
output of the ‘3’ adder is 2x +x =3x as required.
This product can then be used as a graph output
to be fed to the filter summation chain (refer to
Fig. 1b) and, if required, routed internally to
generate further multiples of the input..
Pipelining multiplier blocks ensures high clock
rates are achieved when implemented on FPGA
hardware. Note that multiplier blocks usually
contain a mixture of adders and subtractors, but
the ‘adder cost’ of a block refers to the number
of adders and subtractors. Hence, the adder cost
of Fig. 2 is 5. Note that adders may also be
referred to as the graph ‘vertices’. In this paper,
we use the Xilinx Virtex-II FPGA family [2] for
implementation analysis and hence area will be
measured in slices. Fig. 2 is quoted as costing 25
slices. This is calculated by counting the number
of flip-flops inferred by the multi-bit signals
crossing pipeline boundaries and dividing by 2
since there are two flip-flops per slice. Equation
(1) uses the set S which contains the bit-widths
of all N multi-bit signal pipeline boundary
crossings to obtain a slice estimate e:
4.Previous research in Synthesising
Multiplier Block:
The majority of research has concentrated on
producing algorithms to synthesize multiplier
blocks with the optimization goal of minimum
adder cost. Bull and Horrocks [3] has
demonstrated that the multiplication block of
digital filter shown in fig1.(c)may be realized
through the application of graph synthesis
techniques employing only primitive arithmetic
operations(addition, subtraction and bit wise
shift).and showed that graph synthesis problems
are NP- complete and defined several minimum
adder graph algorithms.. Dempster and Macleod
[4] identified limitations in this work and defined
the ‘n-dimensional reduced adder graph’ (RAGn)
algorithm which is generally regarded as the
primary reference for minimal adder multiplier
block synthesis [5, p. 96]. Additional techniques
using canonic signed digit (CSD) and
subexpression sharing have also been proposed
to minimize adder cost [6, 7].Recent work by
Demirsoy et al. [8, 9] incorporates multiplexers
to allow efficient FPGA implementation of time
multiplexed filters and direct cosine transform
(DCT) processors . In [10] , Dempster et al.
defined the ‘C1’ synthesis algorithm with the
optimization goal of minimizing multiplier block
logic depth to reduce power consumption.
5.Adder cost and logic depth for low FPGA
area
Figure 3 shows a multiplier block that
generates the same multiples of the input as the
block shown in Fig. 2. However, the Fig. 3 block
uses only four adders, whereas Fig. 2 uses one
more (five). Conversely, four logic levels are
required for the Fig. 3 block and only two are
required in Fig. 2. Most importantly, using (1),
the Fig. 3 block requires 44 slices compared to
the 25 slices of Fig. 2. This is due to the
increased number of pipeline boundary crossings
in Fig. 3 caused by the extra logic levels of the
block. Hence, in this case, fewer adders does not
mean less FPGA area. It should be noted that
architecture specific features may also influence
area consumption..The main research community
goal of minimizing adder cost is not applicable
for minimizing FPGA area cost. Instead, an
algorithm performing synthesis for low FPGA
area should aim to reduce signal pipeline
crossings. This can be achieved by synthesizing
low-logic depth multiplier blocks and
minimizing signal bit-widths as far as possible.
Fig.3Multiplier Block :four adders,five pipeline
stages causing 44 slices
6..New RSG algorithm design
Dempster and Macleod’s RAG-n algorithm
attempts to synthesize all required
multiplications by initially placing all
coefficients of adder cost 1 (determined using
MAG) into the multiplier block and then
building higher cost values using combinations
of shifts, adds and subtracts of other adder
outputs within the block. For example, in Fig. 3
(generated using RAG-n), the ‘3’ adder is
synthesized first and all other coefficients are
built from it. This approach leads to high logic
depth blocks and hence pushes up FPGA area
requirement.
Note that RAG-n synthesizes multiplier blocks in
two stages:
(i) optimal stage;
(ii) suboptimal heuristic stage.
I f the multiplier block is fully synthesized
after stage (i), Dempster and Macleod show that
their algorithm ensures the absolute minimum
number of adders to implement a given block. If
stage (ii) is required, a suboptimal multiplier
block will be synthesized in a reasonable time.
Stage (ii) employs the MAG algorithm to add
graph vertices to implement required products.
When designing the RSG algorithm, we ensured
that logic depth was controlled by not always
trying to build on existing adders within the
block. Instead, the vertices required to
implement a given value are added directly using
the ‘best graph’ data generated using the
modified and extended implementation of the
MAG algorithm RSG starts with the highest cost
values and simply inserts the ‘best graph’
required for each, ensuring no duplicate adders
are created and that adder outputs are shared as
far as possible.
7. Results generation
7.1 Selecting and implementing algorithms for
comparison with RSG
In addition to implementing RAG-n for
comparison with RSG, we selected the C1
algorithm for the low-logic depth optimization
goal itembodies.RAG-n andC1 provide suitable
coverage of the optimization goal space to
validate the performance of RSG and also
highlight desirable characteristics of multiplier
blocks for low area FPGA implementation.
Beyond 12-bit coefficients, RAG-n and C1
usedthe MAG extensions The DA technique
redistributes the order of calculation for FIR
filter sum of products equations and is highly
suitable for efficient FPGA implementation [5,
14]. Hence, in addition to comparing RSG with
other synthesis algorithms as part of transposed
FIR filters, a comparison was made between
transposed FIR filters (using RSG for block
synthesis) and equivalents implemented using
full-parallel DA. one-channel, fixed-coefficient,
fully parallel,
7.2Experimental setup Numerous single-rate,
transposed FIR filters were automatically
generated to compare the three multiplier block
synthesis algorithms (experiments 6.1 and 6.2).
For each filter, various combinations of the
following aspects were varied:
(i) coefficient bit-width;
(ii) filter length.
Coefficient sets were uniformly distributed and
filter input width was 10-bits. For each
configuration of algorithm, coefficient bit-width
and filter length, ten unique coefficient sets were
applied to provide an average. Each filter was
synthesized using Synplicity Synplify Pro 7.3.4
[15] targeting a Xilinx Virtex-II (xc2v3000-
fg676-5) and implemented using Xilinx ISE
5.2.03i [2] to obtain overall filter device usage
and maximum clock rate. A period constraint of
177 MHz was chosen after manual
experimentation to determine a realistic value
for all filter configurations.
8. Results
8.1 Synthesis algorithm comparison varying
filter length
For this experiment, filters were generated to
compare the synthesis algorithms using the
parameters of Section 8.2, coefficient bit-width
fixed at 12-bits and filter length varied from 15
to 210. Again, RSG requires the least FPGA area
with the slice and flip-flop cost (Figs. 4a and c)
of RAG-n and C1 diverging as filter length
increases. RSG requires more adders in general
(Fig. 4b) and slightly more LUTs (Fig. 4e). RSG
logic depth is constant due to the fixedcoefficient
width of 12-bits and the logic depth
controlling properties of RSG In this experiment,
all coefficients were covered by cost 3 graphs.
Filter maximum clock rate is similar for all three
algorithms with an expected gradual decline as
filter length increases.
8.2 Comparing transposed FIR with
multiplier block and full-parallel DA
To compare RSG filters with DA equivalents,
the parameters of Section 5.2 were applied with
filter length and coefficient bit-width being
varied from 15 to 210 and 2 to 20, respectively
For the 2–10 bit coefficient range, Fig. 8a
establishes that the RSG filters consume
significantly less area than the DA
equivalents,.Flip-flop usage (Fig. 8b) is
correlated with area for both RSG and DA. RSG
LUT consumption is less in all cases e), although
not by as great a margin as the flip-flop results.
For shift-register usage RSG uses few or none,
in general, for all coefficient bitwidths except 2.
This is for two reasons: (i) the low-logic depth
multiplier blocks RSG synthesizes are unlikely
to contain delay lines suitable for SRL16
mapping; (ii) consecutive zero valued
coefficients in the filter impulse response
correspond to delays in the hardware filter
summation chain and the probability of such
conditions is higher at low coefficient bit-widths.
The DA architecture makes increasing use of
shift registers with filter length. And there is an
overall decline in maximum clock rate as the
critical path delay increases with filter length.
For coefficient bit-widths 12–20, the RSG filters
again generally require less area except for 20-
bits where the results are comparable (Fig. 5a).
Also, the RSG area advantage is seen to decrease
as coefficient bit-width increases. Flip-flop and
shift-register usage (Fig. 5b and 5c) follow
similar trends to bit-widths 2–10. However, from
16-bit coefficients upwards, RSG is shown to
consume more LUTs (Fig. 5e), which is the
primary cause of reducing RSG area advantage
with increasing bit-width. Maximum clock rates
(Fig. 5d) were generally well into the fullparallel
range although for bit-widths 18 and 20,
RSG filter performance drops as filter length
increases. his experiment established that, for
typical DSP filter lengths and coefficient bitwidths,
RSG filters provide the lowest area
implementations capable of being clocked at
full-parallel rates. .
9. Conclusions
The classic research community optimisation
metric of minimizing multiplier block adder cost
has been demonstrated not to minimise FPGA
hardware for full-parallel pipelined FIR filters.
Reducing flip-flop count through minimizing
multiplier logic depth has instead been shown to
yield the lowest area solutions. The new RSG
algorithm has been defined to embody this
design principle. The results presented establish
a clear area advantage of RSG over prior
algorithms for typical filter parameters with
comparable maximum clock rates. In addition,
the industrial relevance of the transposed FIR
with multiplier block architecture and the RSG
algorithm has been established through
comparison with filters implemented using the
DA technique.
Fig. 4 Results for VHDL filter generation varying filter length (coefficients: 12 bits)
a FPGA hardware area
b Multiplier block adders
c Flip-flop usage
d Multiplier block logic depth
e LUT usage
FIG..5 Comparing Transposed FIR Filter Multiplier block (RSG) with DA(coefficient bit widths) 12-20
a.FPGA hardware area
b.Flip Flop usage
c.Shift register usage
d.Filter maximum clock rate
e..LUT Usage
10. References
1 Macpherson, K., Stirling, I., Rice, G., Garcia-
Alis,D.,and Stewart, R.:‘Arithmetic mplementation
techniques and methodologies for 3G uplink
reception in Xilinx FPGAs’. Third Int. Conf. on 3G
Mobile Communication Technologies, 2002, (IEE
Conf. Publ. no. 489), May 2002, pp. 191–195
2 Xilinx Inc., http://www.xilinx.com
3 Bull, D.R., and Horrocks, D.H.: ‘Primitive
operator digital filters’, IEE Proc. G, Circuits
Devices Syst., 1991, 138, (3), pp. 401–412
4 Dempster, A.G., and Macleod, M.D.: ‘Use of
minimumadder multiplier blocks in FIR digital
filters’, IEEE Trans. Circuits Syst. II, Analog Digit.
Signal Process., 1995, 42, (9),
pp. 569–577
5 Meyer-Baese, U.: ‘Digital signal processing with
field programmable gate arrays’ (Springer-Verlag,
Berlin, Heidelberg, 2001)
6 Gustafsson, O., and Wanhammar, L.: ‘ILP
modelling of the common subexpression sharing
problem’. 9th Int. Conf. on Electronics, Circuits
and Systems, 2002, vol. 3, pp. 1171–1174
7 Jang, Y., and Yang, S.: ‘Low-power CSD linear
phase FIR filter structure using vertical common
sub-expression’, Electron. Lett., 2002, 38, (15), pp.
777–779
8 Demirsoy, S.S., Dempster, A.G., and Kale, I.:
‘Design guidelines for reconfigurable multiplier
blocks’. IEEE Int. Symp. on Circuits and Systems,
26–28 May 2003, pp. IV293–IV296
9 Demirsoy, S.S., Beck, R., Dempster, A.G., and
Kale, I.: ‘Reconfigurable implementation of
recursive DCT kernels for reduced quantization
noise’. IEEE Int. Symp. on Circuits and
Systems, 26–28 May 2003, pp. IV289–IV292
10 Dempster, A.G., Demirsoy, S.S., and Kale, I.:
‘Designing multiplier blocks with low logic depth’.
IEEE Int. Symp. on Circuits and Systems, 2002,
vol. 5, pp. V-773–V-776
11 Dempster, A.G., and Macleod, M.D.: ‘Constant
integer multiplication using minimum adders’, IEE
Proc., Circuits Devices Syst., 1994, 141, (5), pp.
407–413
12 Gustafsson, O., Dempster, A.G., and
Wanhammar, L ‘Extended results for minimumadder
constant integer multipliers’. IEEE Int.
Symp. on Circuits and Systems, 2002, vol. 1,
pp. I-73–I-76
13 Wirthlin, M.J., and McMurtrey, B.: ‘Efficient
constant coefficient multiplication using advanced
FPGA architectures’. Proc. 11th Int. Workshop on
Field-Programmable Logic and Applications, 2001,
pp. 555–564
14 Xilinx Inc.: ‘Distributed arithmetic FIR filter
v8.0’, http://www.xilinx.com
15 Synplicity Inc.,