Two High-Performance Adaptive Filter Implementation Schemes Using Distributed Arithmetic

600 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESS BRIEFS, VOL. 58, NO. 9, SEPTEMBER 2011

Two High-Performance Adaptive FilterImplementation Schemes Using

Distributed ArithmeticRui Guo and Linda S. DeBrunner

Abstract—Distributed arithmetic (DA) is performed to designbit-level architectures for vector–vector multiplication with a di-rect application for the implementation of convolution, which isnecessary for digital filters. In this brief, two novel DA-basedimplementation schemes are proposed for adaptive finite-impulseresponse filters. Different from conventional DA techniques, ourproposed schemes use coefficients as addresses to access a seriesof lookup tables (LUTs) storing sums of delayed and scaled inputsamples. Two smart LUT updating methods are developed, andleast-mean-square adaptation is performed to update the weightsand minimize the mean square error between the estimated anddesired output. Results show that our two high-performance de-signs achieve high speed, low computation complexities, and lowarea cost.

Index Terms—Adaptive filter, distributed arithmetic (DA),finite-impluse response (FIR), least mean square (LMS), lookuptable (LUT), multiply accumulate (MAC), offset-binary coding(OBC).

I. INTRODUCTION

MOST PORTABLE electronic devices such as cellularphones, personal digital assistants, and hearing aids

require digital signal processing (DSP) for high performance.Due to the increased demand of the implementation of sophis-ticated DSP algorithms, low-cost designs, i.e., low area andpower cost, are needed to make these handheld devices smallwith good performance.

Various types of DSP operations are employed in practice.Filtering is one of the most widely used signal processingoperations [1]. For FIR filters, output y(n) is a linear con-volution of weights wn and inputs. For an N th-order FIRfilter, the generation of each output sample y(n) takes N + 1multiply-accumulate (MAC) operations. Since general-purposemultipliers require significant chip area, alternate methods ofimplementing multiplication are often used, particularly whenthe coefficients values are known prior to implementation. Dis-tributed arithmetic (DA) is one way to implement convolutionmultiplierlessly, where the MAC operations are replaced bya series of LUT access and summations. Techniques, suchas ROM decomposition [2] and offset-binary coding (OBC)

Manuscript received July 20, 2010; revised December 9, 2010, February 10,2011, and April 20, 2011; accepted June 6, 2011. Date of publication August 22,2011; date of current version September 14, 2011. This paper was recom-mended by Associate Editor P. K. Meher.

The authors are with the Department of Electrical and Computer En-gineering, Florida State University, Tallahassee, FL 32310 USA (e-mail:[email protected]; [email protected]).

Digital Object Identifier 10.1109/TCSII.2011.2161168

[7] can reduce the LUT size, which would otherwise increaseexponentially with the filter length N + 1 for conventional DA.

However, in many applications such as echo cancelationand system identification, coefficient adaptation is needed.This adaptation makes it challenging to implement DA-basedadaptive filters with low cost due to the necessity of updatingLUTs. Several approaches have been developed for DA-basedadaptive filters, i.e., from the point of view of reducing logiccomplexity [3]–[6], [8]. Recently, a DA-based FIR adaptivefilter implementation scheme has been presented in [5], [6], and[8], which uses extra “auxiliary” LUTs to help in the updating;however, memory usage is doubled.

In this brief, two novel LMS adaptation-based DA imple-mentation schemes are proposed for FIR adaptive filter im-plementation. The first proposed algorithm updates the LUTsin a similar way as described in [5], [6], and [8] but withoutthe need for auxiliary LUTs. The second proposed algorithmincorporates an OBC-based LUT updating scheme that reducesmemory usage. It is shown that our two proposed schemesboth outperform that described in [5], [6], and [8], with thesecond proposed algorithm requiring less memory usage butmore computation cost than our first proposed algorithm.

This brief is organized as follows. Section II describes thebackground of DA and OBC. Then, we present our proposedschemes for the DA-based FIR adaptive filter in Section III.A performance comparison of different DA-based implemen-tations is made in Section IV. Our conclusions are given inSection V.

II. BACKGROUND

A. DA

DA was first studied by Croisier et al. [9] in 1973 andpopularized by Peled and Liu [10]. Quantization effects in theDA system were analyzed in [11] and [12]. Useful tutorials onDA were provided in [7] and [13]. DA is used to design bit-level architecture for vector multiplication [2]. Traditionally,for filters implemented using DA, the input samples are usedas addresses to access a series of LUTs whose entries are sumsof coefficients. Consider a discrete N th-order FIR filter withconstant coefficients, and input samples coded as B-bit two’scomplement numbers with only the sign bit to the left of thebinary point as follows:

x(n − k) = −xk0 +B−1∑j=1

xkj2−j . (1)

1549-7747/$26.00 © 2011 IEEE

GUO AND DEBRUNNER: TWO HIGH-PERFORMANCE ADAPTIVE FILTER IMPLEMENTATION SCHEMES USING DA 601

Using (1) to compute the FIR output gives

y(n) = −N∑

k=0

wkxk0 +B−1∑j=1

[N∑

k=0

wkxkj

]2−j . (2)

With Cj =∑N

k=0 wkxkj , ∀j ∈ [1, B − 1] and C0 =−

∑Nk=0 wkxk0, (2) can be rewritten as

y(n) =B−1∑j=0

Cj2−j . (3)

The Cj values can be precomputed and stored in a LUTwith the input used as the address. This technique allows theFIR filter with known coefficients to be implemented withoutgeneral-purpose multipliers. This implementation requires aLUT with a size that increases exponentially with the number oftaps N + 1, which results in a large time cost for accessing theLUT for a high-order filter. Therefore, reducing the LUT sizeimproves system performance as well as area cost. One possibleway to reduce LUT size, which is called ROM decomposition,replaces a longer address by shorter addresses, and the dataread from smaller LUTs is accumulated to generate the output.For a 64-tap FIR filter, by breaking the LUT with 264 entriesinto smaller LUTs with 4-bit addresses, only (64/4) × 24 = 28

entries are required.

B. OBC

OBC can be used to reduce the LUT size by a factor of 2 to2N−1 [7]. By rewritting the input from (1), OBC is derived asfollows:

x(n − k) =12{x(n − k) − [−x(n − k)]} (4)

−x(n − k) = − xk0 +B−1∑j=1

xkj2−j + 2−(B−1). (5)

Substituting (1) and (5) into (4)

x(n−k)=12

−(xk0−xk0)+

B−1∑j=1

(xkj−xkj)2−j−2−(B−1)

.

(6)

By defining Dkj as xkj − xkj , the output from FIR filter canbe written as

y(n) =N∑

k=0

wk

2

−Dk0 +

B−1∑j=1

Dkj2−j − 2−(B−1)

= −N∑

k=0

wkDk0

2+

B−1∑j=1

[N∑

k=0

wkDkj

2

]2−j

−N∑

k=0

wk

22−(B−1). (7)

By defining Ej as∑N

k=0(wkDkj/2) and Eextra as−

∑Nk=0(wk/2), (7) can be rewritten as

y(n) = −E0 +B−1∑j=1

Ej2−j + Eextra2−(B−1). (8)

TABLE ILUT CONTENTS FOR FOUR-TAP FIR WITH OBC CODING [2]

Fig. 1. DA-based bit-serial architecture for implementing K-tap FIR filterwith OBC.

The OBC scheme is described in (4)–(8). The LUT contentsfor a four-tap FIR filter are given in Table I. It can be observedthat the first half and the second half of this LUT are mirroredvertically. Therefore, its size can be halved by using x0j tocontrol the sign of each entry at the cost of a slightly increasedhardware complexity. The hardware circuit for implementing aK-tap filter is shown in Fig. 1, where j starts from j = B − 1and decreases by 1 each cycle until j = 0. S1 is 0 when j = 0and 1 if otherwise, and S2 is 1 when j = B − 1 and 0 ifotherwise.

III. PROPOSED SCHEMES

For an FIR filter with LMS adaptation, which involves theautomatic update of filter weights in accordance with the es-timation error, conventional DA performance suffers from theintensive computation required to rebuild LUTs. Work has beendone in [3]–[6], [8], [14], and [15] to reduce the computationworkload for LUT updating by only recomputing a few LUTentries. In [5], [6], and [8], the authors proposed an efficientLUT updating method that uses auxiliary LUTs, where onlyhalf of the entries are needed to be recomputed. The techniquesproposed in this brief eliminate the need for auxiliary LUTs.Our first scheme uses a similar LUT updating method to [5],[6], and [8] but without the need for auxiliary LUTs since ourproposed scheme stores the sums of delayed inputs in LUTs. Inour second scheme, OBC is incorporated, and a new updating


Fig. 2. LUT update from t = n to t = n + 1, i.e., modified from [6].

method is introduced to reduce further the memory usage.Unlike [15], the algorithms proposed in this brief implementthe coefficient updating and filter operations concurrently.

A. First Proposed Scheme

Conventional DA stores the sums of weights (coefficients) inLUTs and uses the inputs as addresses. This approach worksvery well for nonadaptive filters with constant coefficients.However, for adaptive filters, adaptation is necessary for theweights, and the input registers must be updated. By exploitingthe commutative property of convolution, the same filteringoperation can be obtained by storing sums of delayed inputsamples in LUTs and by using the binary coefficients as ad-dresses. With the inputs and coefficients having a common bitwidth, as is often the case, this change does not increase latency;however, computation and memory cost are reduced for a bit-serial DA design. If the inputs and coefficients have differentlengths, there is some difference in latency, which we haveconsidered in Section IV. Representing the weights wk in two’scomplement form using B bits gives

wk = −wk0 +B−1∑j=1

wkj2−j . (9)

Using (9) to calculate the output yields

y(n) = −N∑

k=0

x(n − k)wk0 +B−1∑j=1

[N∑

k=0

x(n − k)wkj

]2−j .

(10)

Similar to (2), the term in square brackets has only 2N+1

possible values; therefore, a LUT can be used. The left tableshown in Fig. 2 gives the LUT values for a four-tap FIR filter.When the time index t = n + 1, (10) becomes

y(n + 1) = −N∑

k=0

x(n − k + 1)wk0

+B−1∑j=1

[N∑

k=0

x(n − k + 1)wkj

]2−j . (11)

Fig. 2 shows graphically how the LUTs can be updated.Specifically, it can be observed from the term in square bracketsthat the new input sample x(n + 1) is not used for the newentries, whose least significant address bit (LSAB) w0j is 0.

Since all the combinations of inputs x(n), x(n − 1), . . . , x(n −N) are included in the old LUT at t = n, these new entries withLSAB being 0 can be obtained directly by copying the corre-sponding entries from the old LUT, as indicated by the arrowsin Fig. 2. A closer observation discloses that each of the rest ofthe new entries with LSAB being 1 can be generated by addingx(n + 1) to the prior entry in the new LUT. Mathematically,the new entries Ti(n + 1) can be obtained from the old entriesTi(n) by (12) and (13) with the entry index i ∈ [0, 2N+1 − 1]as follows:

Ti(n + 1) = T i2(n) ∀i ∈ {i|i mod 2 = 0} (12)

Ti(n + 1) = Ti−1(n + 1) + x(n + 1) ∀i ∈ {i|i mod 2 = 1}.(13)

In this brief, LMS adaptation is chosen to update the weightswk, as shown in

wk(n + 1) =wk(n) + µe(n)x(n − k) (14)

e(n) = d(n) − w(n)x(n) (15)

where d(n) is the desired output, w(n) = [w0(n) w1(n), . . . ,wN (n)], x(n) = [x(n) x(n − 1), . . . , x(n − N)]T , and e(n) isthe error between the desired and estimated output.

The top-level circuit diagram of this proposed scheme foran example four-tap FIR filter is shown in Fig. 3. The methodfor updating the DA_LUT block in our first scheme is similarto that used for updating the auxiliary LUT in [5], [6], and [8],which we refer to as the DA0 scheme. Fig. 4 shows more detailsof the implementation. The Addr Gen block has to generate theaddresses in the order shown in Fig. 4. As shown in the weightupdate block in Fig. 3, the multiplication from (14) can besimplified as shifting by assuming the step size and quantizingthe product of the error and the step size to be a power of 2. Incontrast to DA0, our scheme uses no auxiliary LUTs but onlymain LUTs. The two multipliers controlled by a0 cooperateto read the proper entry according to Fig. 2 for updating thenew entry each cycle. The updating of LUTs and weights usedas addresses can be performed concurrently, which reduceslatency. In DA0, two types of LUTs are necessary: the auxiliaryLUTs need to be updated first, and then, the updates of the mainLUTs are executed.

B. Second Proposed Scheme

In Section II-B, OBC is shown to reduce the number of LUTentries without increasing the number of LUTs required. Inthis section, we propose a new scheme that combines OBCwith our first proposed scheme. Because of the commutativeproperty, as in our first proposed approach, the sums of delayedinput samples are stored in LUTs coded using OBC with binarycoefficients as the address, as derived in (16)–(19). Equation(17) indicates that, with wkj as the address, LUTs can still beused to store Fj , as shown on the left of Fig. 5 for a four-tap FIRfilter. Although the size of LUTs is reduced by applying OBC,the LUT updating still suffers from high computation cost sincethe oldest sample is included in every entry, e.g., x(n − 3) inFig. 5. The second entry with address 001 at time t = n needs tobe updated by adding (x(n − 3) − 2x(n − 2) + x(n + 1))/2.

GUO AND DEBRUNNER: TWO HIGH-PERFORMANCE ADAPTIVE FILTER IMPLEMENTATION SCHEMES USING DA 603

Fig. 3. Top-level circuit diagram for the four-tap adaptive FIR filter.

Fig. 4. Detailed DA_LUT block for the four-tap adaptive FIR filter.

The rest of the entries need to be updated with approximatelythe same computation cost as follows:

wk = − wk0 +B−1∑j=1

wkj2−j (16)

Fj =N∑

k=0

x(n − k)(wkj − wkj)2

(17)

Fextra =N∑

k=0

x(n − k)2

(18)

y(n) = − F0 +B−1∑j=1

Fj2−j − Fextra2−(B−1). (19)

To reduce the computation workload, we propose a smartupdating algorithm. Fig. 5 shows how the update works for afour-tap FIR filter, with precomputed Q = (x(n + 1) + x(n −3))/2. Mathematically, the new entries Ti(n + 1) can be up-

Fig. 5. LUT update for the four-tap adaptive FIR filter.

Fig. 6. Area comparison. (a) Chip area and (b) 32-tap FIR filter synthesisresults.

dated by (20) and (21) with the entry index i ∈ [0, 2N − 1] asfollows:

Ti(n + 1) =Q + T2i+1(n) ∀i ∈ {i|i < 2N−1} (20)Ti(n + 1) =Q − T2(2N−1−i)(n) ∀i ∈ {i|i ≥ 2N−1}. (21)

To update Fextra, we could add together all of the firstentries from the sub-LUTs used by ROM decomposition, whichrequires num − 1 addition, where num is the number of sub-LUTs. Another method is by subtracting half of the oldestinput sample from Fextra and adding half of the newest sampleto Fextra, which requires two operations every cycle, sincedivision by two can be realized by right shifting. The lattermethod is chosen to update Fextra for this proposed scheme.Performances for the different schemes are compared inSection IV.

IV. PERFORMANCE COMPARISON

To compare our schemes with DA0, synthesis results froma 0.18-µm standard cell library for implementing FIR filterswith 8-bit inputs and weights are presented in Fig. 6(a). It isshown that, since extra auxiliary LUTs are necessary for theDA0 scheme, significant area savings can be achieved by ourproposed schemes. Similar results are obtained in Fig. 6(b)by implementing a 32-tap FIR filter using FPGA Stratix IIEP2S15F672I4. In addition to the area advantage shown inFig. 6, our proposed algorithm also has an advantage of reducedlatency.

Formulations for certain critical measurements are derivedto compare the performances of our proposed two schemeswith DA0. Since these three implementation schemes have a


similar clock rate, the memory usage and the computation costfor each scheme are derived and compared. The memory usageis measured by the numbers of LUT entries. If step size isassumed and estimation error is scaled to be in the power of 2 tosimplify multiplication as a shift, the only critical computationused for these three schemes is addition. The computation costis estimated as the number of necessary addition per filter cycle,including LUT updating and data filtering.

The memory usages for DA0 and our two proposed schemes,i.e., MDA0 , Mproposed1, and Mproposed2, respectively, are esti-mated as follows:

MDA0 = (2m − 1)(

K

m

)· 2 (22)

Mproposed1 = (2m − 1)(

K

m

)(23)

Mproposed2 = (2m−1)(

K

m

)(24)

where K is the filter length and m is the number of bits requiredfor the LUT address when ROM decomposition is used. K andm are assumed to be on the power of 2. Since our proposed firstscheme does not use any auxiliary LUTs, its memory usage isexactly half of that of DA0. In our second proposed scheme, thememory usage is reduced further by using OBC for the LUTs.With OBC-coded LUTs, our second proposed scheme requiresthe least memory usage, which is less than 30% of that by DA0,since (Mproposed2/MDA0) = 23/((24 − 1) × 2) = 26.6% forthe case of m = 4.

If W and B are the bit width of inputs and coefficients,respectively, then the numbers of necessary addition every filtercycle for these three schemes are estimated as follows:

ADA0 = 2m−1

(K

m

)+2m

(K

m

)+

(K

m

)·W−1 (25)

Aproposed1 = 2m−1

(K

m

)+K+

(K

m

)· B−1 (26)

Aproposed2 = (2m−1+1)(

K

m

)+K+

(K

m

)·B+1. (27)

The three addends in (25)–(27) count the additions for updat-ing the LUTs, coefficients, and summing up all the entries readfrom LUTs, respectively. To examine the effects of changingthe ratio of the input width W and the coefficient width B, weplot the savings of additions for different values in Fig. 7.

It is shown that our two proposed schemes require lessaddition cost than DA0, whereas the proposed second schemeneeds slightly more addition operations than our first proposedapproach, which is due to the precomputation of Q and updat-ing of Fextra every cycle.

V. CONCLUSION

In this brief, two different DA-based schemes were pre-sented for FIR adaptive filter implementation. In contrast toconventional DA-based schemes, our schemes store the sums ofdelayed input samples in LUTs and use the binary coefficientsas addresses. It was shown that, since no auxiliary LUTs arerequired, our first proposed scheme needs exactly half the mem-ory usage required by the previous work, whereas the second

Fig. 7. (a) Savings by the proposed first scheme. (b) Savings by the proposedsecond scheme.

proposed scheme only needs less than 30% of that required bythe previous work. In addition, our two proposed schemes bothhave low computation cost, with the second scheme requiringslightly more addition operations than the first one. Unlikethe previous work, in our schemes, the updating of LUTs andcoefficients can be executed concurrently, which enables lowlatency.

REFERENCES

[1] S. K. Mitra, Digital Signal Processing: A Computer-Based Approach,2nd ed. New York: McGraw-Hill, 2001.

[2] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Imple-mentation. Hoboken, NJ: Wiley, 1999.

[3] C. H. Wei and J. J. Lou, “Multimemory block structure for implementing adigital adaptive filter using distributed arithmetic,” Proc. Inst. Elect. Eng.,vol. 133, no. 1, pt. G, pp. 19–26, Feb. 1986.

[4] C. F. N. Cowan and J. Mavor, “New digital adaptive-filter implementationusing distributed-arithmetic techniques,” Proc. Inst. Elect. Eng., vol. 128,no. 4, pt. F, pp. 225–230, Feb. 1981.

[5] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson,“LMS adaptive filters using distributed arithmetic for high throughput,”IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 52, no. 7, pp. 1327–1337,Jul. 2005.

[6] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson, “Anovel high performance distributed arithmetic adaptive filter implemen-tation on an FPGA,” in Proc. IEEE Int. Conf. Acoust., Speech, SignalProcess., 2004, vol. 5, pp. V-161–V-164.

[7] S. A. White, “Applications of distributed arithmetic to digital signalprocessing: A tutorial review,” IEEE ASSP Mag., vol. 6, no. 3, pp. 4–19,Jul. 1989.

[8] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson, “AnFPGA implementation for a high throughput adaptive filter using distrib-uted arithmetic,” in Proc. 12th Annu. IEEE Symp. Field-ProgrammableCustom Comput. Mach., 2004, pp. 324–325.

[9] A. Croisier, D. Esteban, M. Levilion, and V. Rizo, “Digital filter for PCMencoded signals,” U.S. Patent 3 777 130, Dec. 4, 1973.

[10] A. Peled and B. Liu, “A new hardware realization of digital filters,” IEEETrans. Acoust., Speech, Signal Process., vol. ASSP-22, no. 6, pp. 456–462, Dec. 1974.

[11] K. Kammeyer, “Quantization error analysis of the distributed arithmetic,”IEEE Trans. Circuits Syst., vol. CAS-24, no. 12, pp. 681–689, Dec. 1977.

[12] F. Taylor, “An analysis of the distributed-arithmetic digital filters,” IEEETrans. Acoust., Speech, Signal Process., vol. ASSP-34, no. 5, pp. 1165–1170, Oct. 1986.

[13] K. Kammeyer, “Digital filter realization in distributed arithmetic,” inProc. Eur. Conf. Circuit Theory Des., Genoa, Italy, 1976.

[14] C. F. N. Cowan, S. G. Smith, and J. H. Elliott, “A digital adaptive filterusing a memory-accumulator architecture: Theory and realization,” IEEETrans. Acoust., Speech, Signal Process., vol. ASSP-31, no. 3, pp. 541–549, Jun. 1983.

[15] W. Huang and D. V. Anderson, “Modified sliding-block distributed arith-metic with offset binary coding for adaptive filters,” J. Signal Process.Syst., vol. 63, no. 1, pp. 153–163, Apr. 13, 2010.

Documents

Two High-Performance Adaptive Filter Implementation Schemes Using Distributed Arithmetic