6

Click here to load reader

Faster Sub Band Signal Processing

Embed Size (px)

DESCRIPTION

paper

Citation preview

Page 1: Faster Sub Band Signal Processing

Benny Sällberg[dsp tips&tricks]

IEEE SIGNAL PROCESSING MAGAZINE [144] SEPtEMbER 2013 1053-5888/13/$31.00©2013IEEE

Faster Subband Signal Processing

Subband signal processing is an important tool in numer-ous applications such as acoustic echo cancellation, noise reduction, s ignal

enhancement, adaptive beam forming, and signal separation, among other applications (see, for instance, [1]–[3] for example applications). Subband sig-nal processing uses a filter bank to split each input signal into a set of frequency signals, each covering a fraction of the input signal bandwidth; see the illustra-tion in Figure 1. Subband processing provides an efficient way to divide and conquer tedious problems by a set of parallel and smaller subband algo-rithms. In many cases, the subband pro-cessing is performed together with decimation, which reduces the dimen-sionality of the data in the subband algorithm. The focus in this article is on temporal subband processing, i.e., time-frequency transformation, although spatial subband processing may also gain from the discussion here.

Subband algorithms are convention-ally derived having the subband index as the center of gravity, i.e., by treating each subband as independent from the others. However, the subband-centered point of view is not necessarily optimal with regard to implementation efficiency. The reason is that subband algorithm code comprises a number of (nested) loops, as well as inner- and outer loops. Each loop has a certain cost associated with it, inner loops in particular. Every time a loop is being executed, the cost of that loop adds to the total implementation cost. The cost due to frequent loop calls may contribute to a nonnegligible share

of an implementation’s total cost.

This column dis-cusses an approach to minimize the cost asso-ciated with loops in sub-band signal processing. An alternative imple-mentation approach that basically promotes longer inner (rather than outer) loops is dis-cussed. It will be shown that the alternative implementation approach leads to a large performance improvement over the con-ventional approach under certain condi-tions. Subband adaptive filters (SAFs) are used to show the concept and the pro-cessing gains that can be achieved by the alternative approach.

Filter bankA key component of subband process-ing is the filter bank (see, e.g., [4] and [5]). The analysis part of a filter bank

transforms a full band signal into a set of subband signals, with each repre-senting a smaller frequency band of the original signal. The synthesis part of the filter bank transforms the sub-band signals back into a full band rep-resentation again.

It is assumed that two time-domain input signals [ ]x n and [ ]nd are trans-formed into K subband signals; see Figure 2. The input subband signals are denoted as [ ]X ik and [ ]D ik for k in

Digital Object Identifier 10.1109/MSP.2013.2265479

Date of publication: 20 August 2013

[FiG1] each of K subband signals covers a fraction of the input signal bandwidth.

FrequencyFunction

Subband 0 Subband 1 Subband k Subband K–1

~2r

[FiG2] Open-loop subband adaptive filtering.

Ana

lysi

s F

ilter

Ban

k

Syn

thes

is F

ilter

Ban

k

x[n] y[n]

d [n]

SAF0

SAFk

Y0[i ]

Yk[i ]

YK–1[i ]

D0[i ]

Dk[i ]

DK–1[i ]

X0[i ]

Xk[i ]

XK–1[i ]SAFK–1

Ana

lysi

s F

ilter

Ban

k

Page 2: Faster Sub Band Signal Processing

IEEE SIGNAL PROCESSING MAGAZINE [145] SEPtEMbER 2013

[ , ] .K0 1- The subband sample index i relates to the full band sample index n as

,n i R= where R is the number of full band input samples between each sub-band processing tick, i.e., the analysis block length. The subband input signals

[ ]X ik and [ ]D ik are fed to an SAF that produces a subband output signal [ ] .Y ik The synthesis filter bank reconstructs the output signal [ ]y n from the subband output signal [ ] .Y ik

Subband adaPtive FilterSTwo SAF algorithms,which are common in real applications, are used to bench-mark the alternative approach, the least mean squares (LMS) and the recursive least squares (RLS); see, e.g., [1].

It is assumed that each SAF comprises a subband finite impulse response filter having Lsub coefficients per subband, and that an open-loop configuration is being used [6], as illustrated in Figure 2. This system configuration suits many adaptive filter applications, such as noise reduction and system identification.

The SAF weights [ , ]W i lk and the input signal memory [ ]X ik are arranged in vectors

[ ] ( [ , ], [ , ], ,i W i W i 10Wk k k f=

[ , ]) ,W i L 1kT

sub- (1)

[ ] ( [ ], [ ], ,i X i X i 1Xk k k f= -

[ ]) ,X i L 1kT

sub- + (2)

where the superscript ( ) T: denotes the matrix transpose. For each sample i, the subband adaptive filter output signal is computed as

[ ] [ , ] [ ]

{ , , , },

[ ]i

Y i W i l X i l

i

k K0 1 1

X [ ]

for

W kk

k kl

Lk

H0

1sub

g

= -

=

= -

)

=

-/

(3)

where the superscript ( ): ) denotes the complex conjugate and the super-script ( ) H: denotes the complex conjugate trans-pose, i.e., Hermitian transpose. The differ-ence between the desired signal and the filter output signal forms an error signal

[ ] [ ] [ ] .E i D i Y ik k k= - (4)

Common to the selected algorithms is the goal to minimize the mean square error [ [ ] ] .E iE k

2; ; The LMS algo-rithm approximates the expectation operator with sample values using the stochastic gradient approach. The filter weight update equation of the LMS is (see, e.g., [1])

[ ] [ ] [ ] [ ],i i E i i1W W Xk k k kn+ = - ) (5)

where n is a step size constant. The RLS method uses an exponentially weighted window to approximate the expectation operator (other variants exist, see, e.g., [1]), and the following weight update equation (see, e.g., [1]):

[ ] [ ] [ ] [ ] [ ],i i E i i i1W W P Xk k k k k+ = + ) (6)

where the inverse coherence matrix Pk[i] is recursively updated using the stepsize m as

[ ] [ ]

[ ] [ ] [ ] [ ]

( [ ] [ ] [ ]) .

i i

i i i i

i i i

1P PP X X PX P X

k k

k k k k

k k k

1

1

1

H

H#

m

m

m

+ =

-

+

-

-

-

(7)

[FiG3] the processing steps using (a) the conventional- and (b) the alternative implementation approaches while computing the subband filter output signal in (3) for K 8= subbands and L 3sub = filter taps per subband.

W* W* W*

W* W* W*

X

+ +

X X

X X X

Step 1 Step 2 Step 3

Step 1 …Step k… Step 8

(a)

(b)

Page 3: Faster Sub Band Signal Processing

[dsp tips&tricks] continued

IEEE SIGNAL PROCESSING MAGAZINE [146] SEPtEMbER 2013

FaSter Subband PrOceSSinGSubband algorithm implementations usu-ally comprise several (nested) loops. As mentioned in the introduction, every loop has a certain cost associated with it. It is desirable to have a minimum number of inner-loop calls while it means that the associated loop cost, and therefore also the total implementation cost, is minimal. The focus here is to show an alternative implementation approach that improves implementation efficiency over the con-ventional approach.

As an example, consider the compu-tation of all filter output signals in (3) for K 8= subbands and L 3sub = filter taps per subband, as illustrated in Figure 3. The conventional implementation approach executes small inner loops eight times, whereas the alternative approach executes long inner loops only three times. If we assume a constant cost per loop, it is expected that the cost due

to loops is larger in the conventional implementation approach than in the alternative approach. It is stressed that, although the two implementation approaches order their loops differently, they implement the same signal process-ing function.

To get a picture of the implementation cost due to the execution of loops, assume that the average cost of executing a single loop is .CLoop It is assumed that all factors that do not contribute to the actual signal processing are included in this cost. This includes, for instance, the instructions of the loop prolog/epilog code that do not explicitly implement signal processing, together with the cost of restarting the instruction pipeline and memory usage. Counting the number of times the loops are executed and multiplying this number with the average cost of executing a single loop yields an overall loop cost for a partic-ular implementation.

Code listings for the subband LMS algorithm implemented using the conven-tional approach and the alternative approach are shown in “Appendix A.” The implementation of the subband LMS algo-rithm is illustrated in Figure 4. The con-ventional LMS implementation comprises of one outer loop and two inner loops, where each inner loop is executed K times. This gives a total loop cost of ( )K2 1+ .CLoop The alternative approach has three outer loops and two inner loops, where each inner loop is executed Lsub times. This gives a total loop cost of

.L C2 3sub Loop+^ h Using this model, if ,K L2 1 2 3sub2+ + the total cost of the

conventional approach exceeds the cost of the alternative approach. The loop cost is computed in a similar manner for the sub-band RLS algorithm and listed in Table 1.

According to Table 1, in the specific case when the number of subbands is K 64= and the subband filter length is

[FiG4] the implementation of the subband lMS algorithm using (a) the conventional approach and (b) the alternative approach, where the total loop-related cost of the implementations are ( )2 1K CLoop+ and ( ) ,2 3L Csub Loop+ respectively.

Loop over Subbands,k = 0…K–1

Loop over Coefficients,l = 0…Lsub – 1

Loop over Coefficients,l = 0…Lsub – 1

Loop over Subbands,k = 0…K – 1, and

Compute the OutputSignal Y for All

Subbands

Loop over Subbands,k = 0…K – 1, and Update

the l th FilterCoefficient

Loop over Subbands,k = 0…K–1, and Compute the

Error Signal E for AllSubbands

Loop overCoefficients,

l = 0…Lsub – 1, andCompute the Output

Signal Y for Subband k

Loop overCoefficients, l = 0…Lsub – 1,

and Update theCoefficients of

Subband k

Compute the ErrorSignal E for Subband k

Cloop

K Cloop Lsub Cloop

Lsub Cloop

Cloop

Cloop

Cloop

K Cloop

(a) (b)

Page 4: Faster Sub Band Signal Processing

IEEE SIGNAL PROCESSING MAGAZINE [147] SEPtEMbER 2013

aPPendix aShown here are code listings for the conventional and the alternative implementation approach of the subband LMS algo-rithm. All variable initializations are omitted for the sake of clarity in the presentation. It is assumed that, for each block of R input samples, an analysis filter bank computes the subband signals of x[n] and d[n] and places them in the complex-valued vectors X and D, respectively.

lMSthe LMS algorithm is implemented in the conventional approach as follows:

void LMS(complex_float Y[],complex_float D[]) { #Loop calls for (k=0; k<K; k++) { // Loop over subbands 1 // Compute filter output Y[k] = cvecdotconjf(&X[k*Lsub], &W[k*Lsub], Lsub); K // Compute error signal E_temp.re = - mu*(D[k].re - Y[k].re); E_temp.im = - mu*(-D[k].im + Y[k].im); // Update the coefficients cvecsmacf(&W[k*Lsub], E_temp, &X[k*Lsub], Lsub); K}

} Total number of loop calls: 2 K + 1

Now consider the alternative approach to implement a subband LMS. Instead of looping over the subbands, the code computes for all subbands at once but for one element at a time.

void LMS_alternative(complex_float Y[], complex_float D[]) { #Loop calls// Compute the filter output for tap 0cvecvmltconjf(&X[0*K], &W[0*K], Y, K); 1// Successively add remaining taps for (j=1; j<Lsub; j++) { // Loop over taps 1 cvecvmacconjf(&X[j*K], &W[j*K], Y, K); Lsub-1} // Compute the error signal for all subbands for (k=0; k<K; k++) { // Loop over subbands 1 E[k].re = - mu * (D[k].re - Y[k].re); E[k].im = - mu * (D[k].im - Y[k].im);} // Update the filter vector for all subbands,// one tap at a time. for (j=0; j<Lsub; j++) { // Loop over taps 1 cvecvmacconjf(&X[j*K], E, &W[j*K], K); Lsub}

} Total number of loop calls: 2 Lsub + 3

the support functions used above implement the following functionality:

•complex _ float cvecdotconjf(complex _ float* U, complex _ float* V, const int N)Returns the inner product between vector U and vector V (conjugated), according to [ ]* [ ],V n U n

n

N

0

1$

=

-/ where the superscript * denotes the complex conjugate.

•cvecsmacf(complex _ float* Y, const complex _ float C, complex _ float* V, int N)Scales the vector V with the constant C and accumulates the result in the vector Y, as [ ] [ ] [ ]Y n Y n C V n$= + for { , , } .n N0 1d f -

•cvecvmltconjf(complex _ float* U, complex _ float* V, complex _ float* Y, int N)Computes the element-wise multiplication between the elements of vector U and the elements of vector V (conjugated), and

stores the result in the vector Y, as: [ ] [ ]* [ ]Y n n nV U$= for { , , } .n N0 1d f -

•cvecvmacconjf(complex _ float* U, complex _ float* V, complex _ float* Y, int N)Computes the element-wise multiplication between the elements of vector U and the elements of vector V (conjugated), and accumulates the result to the vector Y, as: [ ] [ ] [ ]* [ ]Y n Y n V n U n$= + for { , , } .n N0 1d f -

Page 5: Faster Sub Band Signal Processing

[dsp tips&tricks] continued

IEEE SIGNAL PROCESSING MAGAZINE [148] SEPtEMbER 2013

,L 8sub = the LMS has a 129/19 = 6.8 times higher loop cost in the conventional approach than the alternative implemen-tation approach. For the same configura-tion, the RLS has a 1281/173 = 7.4 times higher loop cost in the conventional approach than the alternative implemen-tation approach.

PerFOrMance analySiSThis section introduces an empirical analysis of the performance gain from using the alternative approach over the conventional subband-centered point of view. The analysis is useful for research-ers and engineers working with MATLAB and digital signal processors (DSPs). The two selected SAF algorithms are imple-mented in MATLAB, and on an Analog Devices Sharc ADSP21262 DSP. The DSP is a floating (and fixed) point processor that has a pipelined instruction architec-ture. The DSP provides two computa-tional units in parallel, each comprising a multiplier, one arithmetic logic unit, and one shifter. The two computational units

execute the same code, but on different data, i.e., single instruction multiple data (SIMD) processing is supported. During the analysis, efforts are made to optimize the DSP code. The DSP framework is written in C language with the compiler set to maximally optimize for speed (e.g., it allows loop unrolling). Functions that compute inner loops, such as vector dot products are written in Assembler lan-guage to maximize instruction pipeline usage. The inner loops use the SIMD mode all the time and use data registers as intermediate data holders to yield a maximal throughput.

The performance is measured in MAT-LAB by using a Monte Carlo simulation averaging over 1,000 experiments on a desktop computer (Intel Core i7-2600 CPU at 3.4 GHz and 16 GB RAM, running 64 bits Windows 7 Professional and MAT-LAB version R2011B). The MATLAB clocking functions tic and toc are used to assess the time it takes for each imple-mentation to complete processing all sub-bands. The average execution time is

denoted as T. On the DSP, the number of clock cycles required to process all subbands are measured. Analog Device’s Integrated Development Environ-ment has a mode where the DSP is simulated and where the number of consumed clock cycles is measured. The number of clock cycles is denoted as N. The per-formance measures are computed for varying number of subbands

{32,64,128}K = and subband filter lengths

{2, 4, 6, 8} .Lsub = These ranges are selected while they are representative for many acoustical applications. Two perfor-mance measures are computed

/ ,P T TT Conv Alt= (8) / ,P N NN Conv Alt= (9)

where the subindex “Conv” refers to the conventional implementation approach, and the subindex “Alt” refers to the alter-native implementation approach. Hence, the performance measures in (8) and (9) are relative toward the alternative approach. If a performance measure is equal to unity, it means that both alter-natives require the same processing time/computational load. If, on the other hand, a performance measure is above unity it means that an implementation according to the subband-centered approach requires more processing time/computational load than the alternative approach. Finally, if the performance measure is less than unity, the presented alternative approach would require more processing time/computational resources than the straightforward sub-band-centered approach.

reSultSThe timing results from the MATLAB Monte Carlo simulation are provided in Figure 5. It stands clear that the alter-native implementation provides a large execution time saving over the sub-band-centered approach, ranging from 6.5 times to 85 times speed improve-ment! The largest savings in computa-tion time is when K 128= and L 2sub = and for the most computation-ally heavy algorithm, the RLS. As the number of subband filter coefficients

[table 1] tOtal lOOP cOSt FOr each alGOrithM iMPleMentatiOn, where CLoop iS the averaGe cOSt FOr a SinGle lOOP, K iS the nuMber OF SubbandS, and Lsub iS the Subband Filter lenGth. the laSt twO rOwS ShOw an exaMPle cOnFiGuratiOn where K 64= and 8.Lsub =

alGOrithM cOnFiGuratiOn cOnventiOnal aPPrOach alternative aPPrOachLMS Any ,0 0K Lsub2 2 (2K 1) CLoop+ ( )2 3L Csub Loop+

RLS Any K 0,L 0sub2 2 ][2K(2 L ) 1 Csub Loop+ + ][ ( ) 52 5L CL sub Loopsub ++

LMS Ex.: K 64,L 8sub= = 129 CLoop 19 CLoop

RLS Ex.: K 64,L 8sub= = ,1 281 CLoop 173 CLoop

PT

100

80

60

40

20

02 3 4 5

Subband Filter Length Lsub

6 7 8

[FiG5] the average time consumption excess PT for lMS (solid) and rlS (dashed),where the number of subbands is K 32= (squares), K 64= (circle), and K 128= (triangle). (continued on page 150)

Page 6: Faster Sub Band Signal Processing

IEEE SIGNAL PROCESSING MAGAZINE [150] SEPtEMbER 2013

increases, the computation time excess gets smaller, however, still providing a 6.5 times improvement over the straightforward approach for the selected configurations. This empirical analysis may show helpful for research-ers and engineers while conducting offline experiments in MATLAB; a reminder to keep the largest data dimension in the execution’s inner loop.

The focus is now on the DSP imple-mentation of the selected algorithms. The results from the analysis of the DSP imple-mentation are given in Figure 6. A higher performance is found by implementing according to the alternative approach than

in the subband-cen-tered approach in all tested cases. The results show that the alterna-tive implementation ap- proach requires up to 5.3 times fewer cycles compared to the con-ventional approach.

cOncluSiOnSThis column discusses the importance of considering data di- m e n s i o n s d u r i n g implementation of a subband algorithm. An improved imple-mentation efficiency

is achieved by considering the data dimensionality during development. The conventional subband-centered approach is altered into an alternative approach where inner loops, instead of the outer loops, run over subbands. The actual signal processing is, however, identical in both approaches. Two sub-band adaptive filter structures are used to benchmark the alternative approach, the LMS, and the RLS. The results show that the performance improvement is up to 85 times less execution time in MATLAB for the RLS when imple-mented using the alternative approach instead of the conventional approach.

On a DSP, the performance improve-ment is up to 5.3 times less clock cycles required in the alternative approach over the conventional approach for the LMS. The alternative approach allows researchers and engineers to conduct subband processing experiments much faster and more efficient than if they would follow the conventional imple-mentation approach. On a DSP, the gain in number of clock cycles can, for instance, be used to conserve power, e.g., to prolong battery life time in por-table equipment.

authOrBenny Sällberg ([email protected]) is a research director at Exaudio AB, Sweden.

reFerenceS[1] S. Haykin, Adaptive Filter Theory, 4th ed. Englewood Cliffs, NJ: Prentice-Hall, 2002.

[2] A. Gilloire and M. Vetterli, “Adaptive filtering in sub-bands,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, 1988 (ICASSP-88), vol. 3, pp. 1572–1575.

[3] A. Gilloire and M. Vetterli, “Adaptive filtering in subbands with critical sampling: Analysis, experiments, and application to acoustic echo cancellation,” IEEE Trans. Signal Processing, vol. 40, no. 8, pp. 1862–1875, 1992.

[4] R. E. Crochiere and L. R. Rabiner, Multirate Digital Signal Processing. Englewood Cliffs, NJ: Prentice Hall, 1983.

[5] P. P. Vaidyanathan, Multirate Systems and Filter Banks. Englewood Cliffs, NJ: Prentice Hall, 1993.

[6] D. R. Morgan and J. C. Thi, “A delayless subband adaptive filter architecture,” IEEE Trans. Signal Pro-cessing, vol. 43, no. 8, pp. 1819–1830, 1995.

[SP]

[FiG6] the excess in number of clock cycles PN for lMS (solid) and rlS (dashed),where the number of subbands is K 32= (squares), K 64= (circle), and K 128= (triangle).

PN

6

5

4

3

2

12 3 4 5

Filter Length L

6 7 8

[dsp tips&tricks] (continued from page 148)

[rEFLEctiONs] (continued from page 152)

Similarly to the case of quantifying scientific outputs with Google Scholar citations [7], social networking services can become the main beacons for other services that could feed information gathered on an individual’s EDU-index to universities for their public relation activities and maintaining contacts with their alumni. Finally, it is hoped that higher edu- cation institutions will revert to pro-mote excellence in teaching. In that regard, the EDU-index will help provide

an objective measure of an individual’s teaching talents.

authOrD. Robert Iskander ([email protected]) is a professor of biomedical signal processing at the Institute of Biomedical Engineering and Instrumentation, Wroclaw University of Technology, Poland.

reFerenceS[1] J. E. Hirsch, “An index to quantify an individual’s scientific research output,” Proc. Natl. Acad. Sci. U. S. A., vol. 102, no. 46, pp. 16569-16572, 2005.

[2] R. Pausch. (2008). The Last Lecture [Online]. Available: http://www.thelastlecture.com

[3] H. C. W. Beijerinck, “The unbearable lightness of teaching,” Europhys. News, vol. 42, no. 5, p. 25, 2011.

[4] D. Katsaros, L. Akritidis, and P. Bozanis, “The f index: Quantifying the impact of coterminal cita-tions on scientists’ ranking,” J. Amer. Soc. Inform. Sci. Technol., vol. 60, no. 5, pp. 1051-1056, 2009.

[5] The academic family tree [Online]. Available: http://academictree.org

[6] The mathematics genealogy project [Online]. Available: http://genealogy.math.ndsu.nodak.edu

[7] K. Kousha and M. Thelwall, “Sources of Google Scholar citations outside the Science Citation Index: A comparison between four sci-ence disciplines,” Scientometrics, vol. 74, no. 2, pp. 273-294, 2008. [SP]