An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

An Efficient DSP-Based Implementation of a FastConvolution Approach with non Uniform Partitioning

Andrea Primavera1, Stefania Cecchi1, Laura Romoli1, Francesco Piazza1 and

Marco Moschetti2

1 A3lab - DII - Universita Politecnica delle Marche -Ancona - ITALY

2 Korg Italy - Osimo (AN) - ITALY

5th European DSP in Education and Research Conference, 13th and 14th

September, 2012, Amsterdam, Netherlands.

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 1/28



Conclusion

1 Fast ConvolutionIntroductionState of the art

2 Proposed Algorithm

3 Efficient DSP ImplementationTargetUPOLS implementationMemory managementFFT/IFFT operationsPsychoacoustic expedientsFinal remarks

4 ResultsCase study: artificial reverberatorUPOLS performanceNUPOLS performance

5 ConclusionConclusionQuestions




Conclusion

IntroductionState of the art

FIR filtering is probably one of the most recurrent operations in DSP. Itis an expensive task especially for long impulse responses (IRs) and lowI/O latency.

LOW LATENCYCONVOLUTION

COMPUTATIONALCOST

MINIMIZATION

Problem

In the last 30 years, fast convolution algorithms have been deeplyinvestigated:

• OverLap and Save (OLS), OverLap and Add (OLA).

• Partitioned OverLap and Save (UPOLS).

• Non Uniform Partitioned OverLap and Save (NUPOLS).

State of the Art




Conclusion



LOW LATENCYCONVOLUTION

COMPUTATIONALCOST

MINIMIZATION

Problem

In the last 30 years, fast convolution algorithms have been deeplyinvestigated:

• OverLap and Save (OLS), OverLap and Add (OLA).

• Partitioned OverLap and Save (UPOLS).

• Non Uniform Partitioned OverLap and Save (NUPOLS).

State of the Art




Conclusion



We propose an efficient DSP based real-time implementation of afast convolution approach with non uniform partitioning (NUPOLS)taking into account:

• OMAP L137.

• Efficient partitioning.

• Usage of smart DSP expedients.

• Psychoacoustic improvement.

Proposed Solution




Conclusion


Assuming a linear time-invariant system, the linear convolution betweenthe input signal x and the system impulse response h is defined as follows:

y(t) = x(t) ∗ h(t) =

∫∞

−∞

x(t − τ)h(τ)dτ. (1)

For discrete-time signals and impulse response with a finite length N, itresults:

y [n] = x [n] ∗ h[n] =

N−1∑m=0

x(n)h(m − n) (2)

The convolution is performed using equation (2).LATENCY: Theoretically zero.COMPUTATIONAL COST: N − 1 additions and N multiplications.CONSIDERATIONS: It results too expensive for long IR.

Time Domain Convolution




Conclusion


Assuming a linear time-invariant system, the linear convolution betweenthe input signal x and the system impulse response h is defined as follows:

y(t) = x(t) ∗ h(t) =

∫∞

−∞

x(t − τ)h(τ)dτ. (1)

For discrete-time signals and impulse response with a finite length N, itresults:

y [n] = x [n] ∗ h[n] =

N−1∑m=0

x(n)h(m − n) (2)

The convolution is performed using equation (2).LATENCY: Theoretically zero.COMPUTATIONAL COST: N − 1 additions and N multiplications.CONSIDERATIONS: It results too expensive for long IR.

Time Domain Convolution




Conclusion


Considering the circular convolution and the DFT property:

y [n] = x [n] N©h[n] =

N−1∑m=0

x [(n −m)N ]h[m], (3)

x [n] N©h[n] ↔ X [k]H[k], (4)

it results that the convolution can be computed in the frequencydomain.

Frequency Domain Convolution

Allowing to convert a circular convolution into a linear convolution.LATENCY: Equal to K samples with K > N.COMPUTATIONAL COST: 2LlogL

K+ L

Kcomplex multiplications (with

K power of 2 and L = 2K for 50% overlap).CONSIDERATIONS: I/O latency is too high for long IR.

OverLap and Save (OLS)




Conclusion


Considering the circular convolution and the DFT property:

y [n] = x [n] N©h[n] =

N−1∑m=0

x [(n −m)N ]h[m], (3)

x [n] N©h[n] ↔ X [k]H[k], (4)

it results that the convolution can be computed in the frequencydomain.

Frequency Domain Convolution

Allowing to convert a circular convolution into a linear convolution.LATENCY: Equal to K samples with K > N.COMPUTATIONAL COST: 2LlogL

K+ L

Kcomplex multiplications (with

K power of 2 and L = 2K for 50% overlap).CONSIDERATIONS: I/O latency is too high for long IR.

OverLap and Save (OLS)




Conclusion


The IR is partitioned in sections of equal size, then, an OLS is appliedon each sub-filter.LATENCY: Equal to K samples with K arbitrarily chosen.COMPUTATIONAL COST: 2LlogL

K+ LP

Kcomplex multiplications and

L(P−1)K

additions (with K power of 2, P the number of partitions andL = 2K for 50% overlap).CONSIDERATIONS: Computational cost higher than OLS.

Uniform Partitioned OverLap and Save (UPOLS)

The IR is partitioned in sections of increasing size, reducing the com-putational cost with respect to UPOLS algorithm.LATENCY: Theoretically zero.COMPUTATIONAL COST: It depends on the adopted partitioning.CONSIDERATIONS: It is difficult to find the optimal partitioning.

Non Uniform Partitioned OverLap and Save (NUPOLS)




Conclusion


The IR is partitioned in sections of equal size, then, an OLS is appliedon each sub-filter.LATENCY: Equal to K samples with K arbitrarily chosen.COMPUTATIONAL COST: 2LlogL

K+ LP

Kcomplex multiplications and

L(P−1)K

additions (with K power of 2, P the number of partitions andL = 2K for 50% overlap).CONSIDERATIONS: Computational cost higher than OLS.

Uniform Partitioned OverLap and Save (UPOLS)

The IR is partitioned in sections of increasing size, reducing the com-putational cost with respect to UPOLS algorithm.LATENCY: Theoretically zero.COMPUTATIONAL COST: It depends on the adopted partitioning.CONSIDERATIONS: It is difficult to find the optimal partitioning.

Non Uniform Partitioned OverLap and Save (NUPOLS)




Conclusion

An efficient DSP based implementation of a low latency fast convolutionis proposed considering the NUPOLS algorithm.

Block diagram of the non uniform partitioned overlap and

save algorithm

g(t): impulse responsex(t): input signalgi (t) : sub-filter i-th




Conclusion


Block diagram of the proposed approach


• First UPOLS: characterized by a small block size (i.e., 64 samples)for selecting the desired input/output latency.

• Second UPOLS: with a larger framesize allows one to minimize thecomputational cost required to perform the convolution operation.




Conclusion









Conclusion









Conclusion

TargetUPOLS implementationMemory managementFFT/IFFT operationsPsychoacoustic expedientsFinal remarks

The real time implementation of the proposed approach has been donethrough the Texas Instruments Evaluation Board OMAPL137.

Hardware features

Dual-Core System-On-Chip300MHz ARM926EJ-S RISC MPU300MHz C674x VLIW Floating Point DSP

128KByte RAM Shared Memory64MByte SDRAMEnhanced Direct-Memory-Access Controller 3 (EDMA3)2 I/O audio channel32KByte L1P Program RAM/Cache (DSP side)32KByte L1D Data RAM/Cache (DSP side)256KByte L2 Unified Mapped RAM/Cache (DSP side)

• Design constraints: Sample frequency 48 kHz, latency 64 samples,stereo implementation, floating point implementation.

• ARM: used to manage the control parameters.

• DSP: used to perform the DSP operations, exploiting its ownlibraries (i.e., DSPLib) and DMA engine.




Conclusion



Hardware features









Conclusion



Hardware features









Conclusion



Hardware features









Conclusion


The UPOLS algorithm implementation can be summarized consideringthree main phases:

• Impulse response partitioning

• Input signal partitioning

• Filtering

N

K K K K

h(t)

x(t) ..............x0 x1 x2 xn

L-points

FFT

H1 H2 H3× × ×.....

+ +

+ +

L-points

IFFT

L-points

IFFT

L-points

IFFT

last

K points

last

K points

last

K points

K K K K

y(t) ..............y0 y1 y2 yn

.......




Conclusion




- The impulse response h is partitioned in P

blocks hn of length K .

- The filters set Hn is obtained by using aL-points FFT of each block hn (withL = 2K , overlap 50%).

- The set of P filters are then stored in adelay line held in the external memory.

- The operation is performed offline using aMatlab script.


• Filtering

N

K K K K

h(t)

x(t) ..............x0 x1 x2 xn

L-points

FFT

H1 H2 H3× × ×.....

+ +

+ +

L-points

IFFT

L-points

IFFT

L-points

IFFT

last

K points

last

K points

last

K points

K K K K

y(t) ..............y0 y1 y2 yn

.......




Conclusion





- The input signal x is partitioned in blocksof length K .

- The frequency domain block Xn is obtainedperforming an L-points FFT to the inputvector composed of the new frame xn andthe previous frame xn−1 (overlap 50%).

- This vector Xn is stored in a delay line heldin the external memory together with theP − 1 previous blocks.

• Filtering

N

K K K K

h(t)

x(t) ..............x0 x1 x2 xn

L-points

FFT

H1 H2 H3× × ×.....

+ +

+ +

L-points

IFFT

L-points

IFFT

L-points

IFFT

last

K points

last

K points

last

K points

K K K K

y(t) ..............y0 y1 y2 yn

.......




Conclusion





• Filtering

- The output block Yn is obtained throughfiltering operations:

Yn =

P−1∑

i=0

Xn−P+1+iHP−1−i (5)

- The time-domain output signal yn iscomposed of the last K samples of theL-points IFFT of Yn.

N

K K K K

h(t)

x(t) ..............x0 x1 x2 xn

L-points

FFT

H1 H2 H3× × ×.....

+ +

+ +

L-points

IFFT

L-points

IFFT

L-points

IFFT

last

K points

last

K points

last

K points

K K K K

y(t) ..............y0 y1 y2 yn

.......




Conclusion


Complex multiplications and accesses to external memory data are themain bottlenecks in fast convolution implementation.

HOW TO SOLVE THESE PROBLEMS?

• NUPOLS algorithm allows one to minimize both the number ofcomplex multiplications and the memory accesses compared tothe UPOLS approach.

• The DMA engine allows one to parallelize transfers from/intoexternal memory and processing operations.

Adopted Solution




Conclusion


Complex multiplications and accesses to external memory data are themain bottlenecks in fast convolution implementation.

HOW TO SOLVE THESE PROBLEMS?

• NUPOLS algorithm allows one to minimize both the number ofcomplex multiplications and the memory accesses compared tothe UPOLS approach.

• The DMA engine allows one to parallelize transfers from/intoexternal memory and processing operations.

Adopted Solution




Conclusion


Parallelization of the transfers from/into external memory (executed byDMA engine) and processing operations

Read Hn

(Blocking)

Read Xn

(Blocking)

Compute Yn

(i)

Read Hn

(Blocking)

Read Xn+1

(Non Blocking)

Compute Yn

Read Xn

(Blocking)

(ii)

Kernel used for UPOLS algorithms. (i) Basic approach. (ii) Improved approach.




Conclusion


The workload required for FFT/IFFT computation can be reduced takingadvantage of the stereo implementation and considering the real natureof the audio signal.

• Two L-points FFTs/IFFTs of real sequences may be calculatedthrough one FFT/IFFT of a complex sequence.

• The symmetry property of the FFT has be exploited. Thisdecrease the number of access to the external memory and thenumber of frequency multiplications from L to (K + 1) for eachof the P processed frequency block.

FFT Optimization




Conclusion


The workload required for FFT/IFFT computation can be reduced takingadvantage of the stereo implementation and considering the real natureof the audio signal.

• Two L-points FFTs/IFFTs of real sequences may be calculatedthrough one FFT/IFFT of a complex sequence.

• The symmetry property of the FFT has be exploited. Thisdecrease the number of access to the external memory and thenumber of frequency multiplications from L to (K + 1) for eachof the P processed frequency block.

FFT Optimization




Conclusion


Psychoacoustic allows one to reduce the number ofcomplex multiplications and memory accesses.

All the components (frequency bins) overs a certain cut-off frequencyfc (e.g., 18 kHz) are leaved out.

Psychoacoustic Optimization




Conclusion


HOW TO PARALLELIZE THE 2 UPOLS?

In a low latency context multithreaded approach does not guarantee highperformance on the DSP board.

A manual partitioning of the code has been realized aiming touniformly distribute the FFT/IFFT operations and the complexmultiplications of both the UPOLS throughout the processing.

Adopted Solution




Conclusion


HOW TO PARALLELIZE THE 2 UPOLS?

The manual partitioning aims to uniformly distribute the FFT/IFFToperations and the complex multiplications related to the larger POLSduring the K2

K1iterations necessary to respect the processing constraint.

Iteration Operation Iteration Operation

1 Large FFT 3/3 17 MAC Left Channel2 MUL Left Channel 18 MAC Left Channel3 MUL Right Channel 19 MAC Right Channel4 Large IFFT 1/3 20 MAC Right Channel5 Large IFFT 2/3 21 MAC Right Channel6 Large IFFT 3/3 22 MAC Right Channel7 MAC Left Channel 23 MAC Right Channel8 MAC Left Channel 24 MAC Right Channel9 MAC Left Channel 25 MAC Right Channel10 MAC Left Channel 26 MAC Right Channel11 MAC Left Channel 27 MAC Right Channel12 MAC Left Channel 28 MAC Right Channel13 MAC Left Channel 29 MAC Right Channel14 MAC Left Channel 30 MAC Right Channel15 MAC Left Channel 31 Large FFT 1/316 MAC Left Channel 32 Large FFT 2/3

Distribution of the UPOLS operations in a NUPOLS implementation with K1 = 64

and K2 = 2048.




Conclusion

Case study: artificial reverberatorUPOLS performanceNUPOLS performance

Fast convolution could be employed in many different real time audioapplications.

Digital artificial reverberation is the application that really points outlimits of real time FIR filtering.

• Convolutions with long IRs can be performed to simulate largeenvironments.

• Low input/output latencies are required in musical instruments.

Case Study: Artificial Reverberator

Several tests have been carried out to evaluate the effectiveness ofthe proposed approach comparing the required workload of UPOLSand NUPOLS implementation.

Tests




Conclusion


UPOLS PERFORMANCE

0.1 0.2 0.3 0.4 0.50

20

40

60

80

100

Impulse Response Length [s]

Workload

(a)(b)

Workload of the Uniform Partitioned Overlap and Save algorithm (K = 64). (a)

Classic implementation. (b) Psychoacoustic approach

• The maximum impulse response length is about 0.55s(guaranteeing real time performance).

• The approach is not suitable for the simulation of largereverberating environments in musical instruments.

Considerations




Conclusion


NUPOLS PERFORMANCE

0 1 2 3 4 50

20

40

60

80

100


Workload

(a) (b) (c) (d)

(i)

0 1 2 3 4 50

20

40

60

80

100


Workload

(a) (b) (c) (d)

(ii)

0 1 2 3 4 50

20

40

60

80

100


Workload

(a) (b) (c) (d)

(iii)

0 1 2 3 4 50

20

40

60

80

100

Impulse Response Length [s]Workload

(a)

K2 = 2048K

2 = 512 K

2 = 1024

(iv)

Workload of NUPOLS algorithm with 4 different partitionings ((i) K1 = 64

K2 = 2048, (ii) K1 = 64 K2 = 1024, (iii) K1 = 64 K2 = 512, and (iv) optimal

partitioning). Mean (a) and max (b) workload for classic implementation. Mean (c)

and max (d) workload using psychoacoustic approach.




Conclusion


NUPOLS PERFORMANCE

5 10 15 20 25 300

10

20

30

40

50

Processing iteration

Workload

(a)(b)(c)

NUPOLS workload as a function of the

processing cycle (IR Length=3.164 sec). (a)

Workload NUPOLS (b) Workload small

UPOLS (K1 = 64), (c) Workload large UPOLS

(K2 = 2048).

Partitioning Internal MemoryUsage

K1 = 64 K2 = 2048 100kB

K1 = 64 K2 = 1024 50kB

K1 = 64 K2 = 512 30kB

• Evident improvement in terms of performance with respect tothe uniform partitioning based approach.

• It is possible to perform a stereo convolution with an impulseresponse of length 6s using about 50% of the DSP resources.

Considerations




Conclusion

ConclusionQuestions

In conclusion:

• A novel approach for fast convolution computation has beenproposed based on non uniform partitioning of the impulse response.

• Two UPOLSs with uniform partitioning are introduced consideringtwo different framesize: the desired input/output latency is obtainedthrough the UPOLS with lower framesize while the other UPOLS isexploited for decreasing the number of memory accesses andcomplex multiplications.

• A DSP-based real time implementation has been performed andseveral experimental results have been carried out considering digitalreverberation as a particular case study.




Conclusion

ConclusionQuestions

QUESTIONS?