37
Fast Convolution Proposed Algorithm Efficient DSP Implementation Results Conclusion An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning Andrea Primavera 1 , Stefania Cecchi 1 , Laura Romoli 1 , Francesco Piazza 1 and Marco Moschetti 2 1 A3lab - DII - Universit` a Politecnica delle Marche - Ancona - ITALY 2 Korg Italy - Osimo (AN) - ITALY 5 th European DSP in Education and Research Conference, 13 th and 14 th September, 2012, Amsterdam, Netherlands. Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 1/28

An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Embed Size (px)

DESCRIPTION

"Finite impulse response convolution is one of the most widely used operation in digital signal processing field for filtering operations. In this context, low computationally demanding techniques become essential for calculating convolutions with low input/output latency in real scenarios, considering that the real time requirements are strictly related to the impulse response length. In this context, an efficient DSP implementation of a fast convolution approach is presented with the aim of lowering the workload required in applications like reverberation. It is based on a non uniform partitioning of the impulse response and a psychoacoustic technique derived from the human ear sensitivity. Several results are reported in order to prove the effectiveness of the proposed approach also introducing comparisons with the existing techniques of the state of the art."

Citation preview

Page 1: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

An Efficient DSP-Based Implementation of a FastConvolution Approach with non Uniform Partitioning

Andrea Primavera1, Stefania Cecchi1, Laura Romoli1, Francesco Piazza1 and

Marco Moschetti2

1 A3lab - DII - Universita Politecnica delle Marche -Ancona - ITALY

2 Korg Italy - Osimo (AN) - ITALY

5th European DSP in Education and Research Conference, 13th and 14th

September, 2012, Amsterdam, Netherlands.

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 1/28

Page 2: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

1 Fast ConvolutionIntroductionState of the art

2 Proposed Algorithm

3 Efficient DSP ImplementationTargetUPOLS implementationMemory managementFFT/IFFT operationsPsychoacoustic expedientsFinal remarks

4 ResultsCase study: artificial reverberatorUPOLS performanceNUPOLS performance

5 ConclusionConclusionQuestions

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 2/28

Page 3: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

IntroductionState of the art

FIR filtering is probably one of the most recurrent operations in DSP. Itis an expensive task especially for long impulse responses (IRs) and lowI/O latency.

LOW LATENCYCONVOLUTION

COMPUTATIONALCOST

MINIMIZATION

Problem

In the last 30 years, fast convolution algorithms have been deeplyinvestigated:

• OverLap and Save (OLS), OverLap and Add (OLA).

• Partitioned OverLap and Save (UPOLS).

• Non Uniform Partitioned OverLap and Save (NUPOLS).

State of the Art

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 3/28

Page 4: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

IntroductionState of the art

FIR filtering is probably one of the most recurrent operations in DSP. Itis an expensive task especially for long impulse responses (IRs) and lowI/O latency.

LOW LATENCYCONVOLUTION

COMPUTATIONALCOST

MINIMIZATION

Problem

In the last 30 years, fast convolution algorithms have been deeplyinvestigated:

• OverLap and Save (OLS), OverLap and Add (OLA).

• Partitioned OverLap and Save (UPOLS).

• Non Uniform Partitioned OverLap and Save (NUPOLS).

State of the Art

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 3/28

Page 5: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

IntroductionState of the art

FIR filtering is probably one of the most recurrent operations in DSP. Itis an expensive task especially for long impulse responses (IRs) and lowI/O latency.

We propose an efficient DSP based real-time implementation of afast convolution approach with non uniform partitioning (NUPOLS)taking into account:

• OMAP L137.

• Efficient partitioning.

• Usage of smart DSP expedients.

• Psychoacoustic improvement.

Proposed Solution

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 4/28

Page 6: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

IntroductionState of the art

Assuming a linear time-invariant system, the linear convolution betweenthe input signal x and the system impulse response h is defined as follows:

y(t) = x(t) ∗ h(t) =

∫∞

−∞

x(t − τ)h(τ)dτ. (1)

For discrete-time signals and impulse response with a finite length N, itresults:

y [n] = x [n] ∗ h[n] =

N−1∑m=0

x(n)h(m − n) (2)

The convolution is performed using equation (2).LATENCY: Theoretically zero.COMPUTATIONAL COST: N − 1 additions and N multiplications.CONSIDERATIONS: It results too expensive for long IR.

Time Domain Convolution

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 5/28

Page 7: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

IntroductionState of the art

Assuming a linear time-invariant system, the linear convolution betweenthe input signal x and the system impulse response h is defined as follows:

y(t) = x(t) ∗ h(t) =

∫∞

−∞

x(t − τ)h(τ)dτ. (1)

For discrete-time signals and impulse response with a finite length N, itresults:

y [n] = x [n] ∗ h[n] =

N−1∑m=0

x(n)h(m − n) (2)

The convolution is performed using equation (2).LATENCY: Theoretically zero.COMPUTATIONAL COST: N − 1 additions and N multiplications.CONSIDERATIONS: It results too expensive for long IR.

Time Domain Convolution

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 5/28

Page 8: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

IntroductionState of the art

Considering the circular convolution and the DFT property:

y [n] = x [n] N©h[n] =

N−1∑m=0

x [(n −m)N ]h[m], (3)

x [n] N©h[n] ↔ X [k]H[k], (4)

it results that the convolution can be computed in the frequencydomain.

Frequency Domain Convolution

Allowing to convert a circular convolution into a linear convolution.LATENCY: Equal to K samples with K > N.COMPUTATIONAL COST: 2LlogL

K+ L

Kcomplex multiplications (with

K power of 2 and L = 2K for 50% overlap).CONSIDERATIONS: I/O latency is too high for long IR.

OverLap and Save (OLS)

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 6/28

Page 9: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

IntroductionState of the art

Considering the circular convolution and the DFT property:

y [n] = x [n] N©h[n] =

N−1∑m=0

x [(n −m)N ]h[m], (3)

x [n] N©h[n] ↔ X [k]H[k], (4)

it results that the convolution can be computed in the frequencydomain.

Frequency Domain Convolution

Allowing to convert a circular convolution into a linear convolution.LATENCY: Equal to K samples with K > N.COMPUTATIONAL COST: 2LlogL

K+ L

Kcomplex multiplications (with

K power of 2 and L = 2K for 50% overlap).CONSIDERATIONS: I/O latency is too high for long IR.

OverLap and Save (OLS)

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 6/28

Page 10: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

IntroductionState of the art

The IR is partitioned in sections of equal size, then, an OLS is appliedon each sub-filter.LATENCY: Equal to K samples with K arbitrarily chosen.COMPUTATIONAL COST: 2LlogL

K+ LP

Kcomplex multiplications and

L(P−1)K

additions (with K power of 2, P the number of partitions andL = 2K for 50% overlap).CONSIDERATIONS: Computational cost higher than OLS.

Uniform Partitioned OverLap and Save (UPOLS)

The IR is partitioned in sections of increasing size, reducing the com-putational cost with respect to UPOLS algorithm.LATENCY: Theoretically zero.COMPUTATIONAL COST: It depends on the adopted partitioning.CONSIDERATIONS: It is difficult to find the optimal partitioning.

Non Uniform Partitioned OverLap and Save (NUPOLS)

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 7/28

Page 11: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

IntroductionState of the art

The IR is partitioned in sections of equal size, then, an OLS is appliedon each sub-filter.LATENCY: Equal to K samples with K arbitrarily chosen.COMPUTATIONAL COST: 2LlogL

K+ LP

Kcomplex multiplications and

L(P−1)K

additions (with K power of 2, P the number of partitions andL = 2K for 50% overlap).CONSIDERATIONS: Computational cost higher than OLS.

Uniform Partitioned OverLap and Save (UPOLS)

The IR is partitioned in sections of increasing size, reducing the com-putational cost with respect to UPOLS algorithm.LATENCY: Theoretically zero.COMPUTATIONAL COST: It depends on the adopted partitioning.CONSIDERATIONS: It is difficult to find the optimal partitioning.

Non Uniform Partitioned OverLap and Save (NUPOLS)

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 7/28

Page 12: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

An efficient DSP based implementation of a low latency fast convolutionis proposed considering the NUPOLS algorithm.

Block diagram of the non uniform partitioned overlap and

save algorithm

g(t): impulse responsex(t): input signalgi (t) : sub-filter i-th

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 8/28

Page 13: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

An efficient DSP based implementation of a low latency fast convolutionis proposed considering the NUPOLS algorithm.

Block diagram of the proposed approach

g(t): impulse responsex(t): input signalgi (t) : sub-filter i-th

• First UPOLS: characterized by a small block size (i.e., 64 samples)for selecting the desired input/output latency.

• Second UPOLS: with a larger framesize allows one to minimize thecomputational cost required to perform the convolution operation.

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 9/28

Page 14: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

An efficient DSP based implementation of a low latency fast convolutionis proposed considering the NUPOLS algorithm.

Block diagram of the proposed approach

g(t): impulse responsex(t): input signalgi (t) : sub-filter i-th

• First UPOLS: characterized by a small block size (i.e., 64 samples)for selecting the desired input/output latency.

• Second UPOLS: with a larger framesize allows one to minimize thecomputational cost required to perform the convolution operation.

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 10/28

Page 15: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

An efficient DSP based implementation of a low latency fast convolutionis proposed considering the NUPOLS algorithm.

Block diagram of the proposed approach

g(t): impulse responsex(t): input signalgi (t) : sub-filter i-th

• First UPOLS: characterized by a small block size (i.e., 64 samples)for selecting the desired input/output latency.

• Second UPOLS: with a larger framesize allows one to minimize thecomputational cost required to perform the convolution operation.

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 11/28

Page 16: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

TargetUPOLS implementationMemory managementFFT/IFFT operationsPsychoacoustic expedientsFinal remarks

The real time implementation of the proposed approach has been donethrough the Texas Instruments Evaluation Board OMAPL137.

Hardware features

Dual-Core System-On-Chip300MHz ARM926EJ-S RISC MPU300MHz C674x VLIW Floating Point DSP

128KByte RAM Shared Memory64MByte SDRAMEnhanced Direct-Memory-Access Controller 3 (EDMA3)2 I/O audio channel32KByte L1P Program RAM/Cache (DSP side)32KByte L1D Data RAM/Cache (DSP side)256KByte L2 Unified Mapped RAM/Cache (DSP side)

• Design constraints: Sample frequency 48 kHz, latency 64 samples,stereo implementation, floating point implementation.

• ARM: used to manage the control parameters.

• DSP: used to perform the DSP operations, exploiting its ownlibraries (i.e., DSPLib) and DMA engine.

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 12/28

Page 17: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

TargetUPOLS implementationMemory managementFFT/IFFT operationsPsychoacoustic expedientsFinal remarks

The real time implementation of the proposed approach has been donethrough the Texas Instruments Evaluation Board OMAPL137.

Hardware features

Dual-Core System-On-Chip300MHz ARM926EJ-S RISC MPU300MHz C674x VLIW Floating Point DSP

128KByte RAM Shared Memory64MByte SDRAMEnhanced Direct-Memory-Access Controller 3 (EDMA3)2 I/O audio channel32KByte L1P Program RAM/Cache (DSP side)32KByte L1D Data RAM/Cache (DSP side)256KByte L2 Unified Mapped RAM/Cache (DSP side)

• Design constraints: Sample frequency 48 kHz, latency 64 samples,stereo implementation, floating point implementation.

• ARM: used to manage the control parameters.

• DSP: used to perform the DSP operations, exploiting its ownlibraries (i.e., DSPLib) and DMA engine.

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 12/28

Page 18: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

TargetUPOLS implementationMemory managementFFT/IFFT operationsPsychoacoustic expedientsFinal remarks

The real time implementation of the proposed approach has been donethrough the Texas Instruments Evaluation Board OMAPL137.

Hardware features

Dual-Core System-On-Chip300MHz ARM926EJ-S RISC MPU300MHz C674x VLIW Floating Point DSP

128KByte RAM Shared Memory64MByte SDRAMEnhanced Direct-Memory-Access Controller 3 (EDMA3)2 I/O audio channel32KByte L1P Program RAM/Cache (DSP side)32KByte L1D Data RAM/Cache (DSP side)256KByte L2 Unified Mapped RAM/Cache (DSP side)

• Design constraints: Sample frequency 48 kHz, latency 64 samples,stereo implementation, floating point implementation.

• ARM: used to manage the control parameters.

• DSP: used to perform the DSP operations, exploiting its ownlibraries (i.e., DSPLib) and DMA engine.

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 12/28

Page 19: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

TargetUPOLS implementationMemory managementFFT/IFFT operationsPsychoacoustic expedientsFinal remarks

The real time implementation of the proposed approach has been donethrough the Texas Instruments Evaluation Board OMAPL137.

Hardware features

Dual-Core System-On-Chip300MHz ARM926EJ-S RISC MPU300MHz C674x VLIW Floating Point DSP

128KByte RAM Shared Memory64MByte SDRAMEnhanced Direct-Memory-Access Controller 3 (EDMA3)2 I/O audio channel32KByte L1P Program RAM/Cache (DSP side)32KByte L1D Data RAM/Cache (DSP side)256KByte L2 Unified Mapped RAM/Cache (DSP side)

• Design constraints: Sample frequency 48 kHz, latency 64 samples,stereo implementation, floating point implementation.

• ARM: used to manage the control parameters.

• DSP: used to perform the DSP operations, exploiting its ownlibraries (i.e., DSPLib) and DMA engine.

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 12/28

Page 20: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

TargetUPOLS implementationMemory managementFFT/IFFT operationsPsychoacoustic expedientsFinal remarks

The UPOLS algorithm implementation can be summarized consideringthree main phases:

• Impulse response partitioning

• Input signal partitioning

• Filtering

N

K K K K

h(t)

x(t) ..............x0 x1 x2 xn

L-points

FFT

H1 H2 H3× × ×.....

+ +

+ +

L-points

IFFT

L-points

IFFT

L-points

IFFT

last

K points

last

K points

last

K points

K K K K

y(t) ..............y0 y1 y2 yn

.......

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 13/28

Page 21: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

TargetUPOLS implementationMemory managementFFT/IFFT operationsPsychoacoustic expedientsFinal remarks

The UPOLS algorithm implementation can be summarized consideringthree main phases:

• Impulse response partitioning

- The impulse response h is partitioned in P

blocks hn of length K .

- The filters set Hn is obtained by using aL-points FFT of each block hn (withL = 2K , overlap 50%).

- The set of P filters are then stored in adelay line held in the external memory.

- The operation is performed offline using aMatlab script.

• Input signal partitioning

• Filtering

N

K K K K

h(t)

x(t) ..............x0 x1 x2 xn

L-points

FFT

H1 H2 H3× × ×.....

+ +

+ +

L-points

IFFT

L-points

IFFT

L-points

IFFT

last

K points

last

K points

last

K points

K K K K

y(t) ..............y0 y1 y2 yn

.......

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 14/28

Page 22: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

TargetUPOLS implementationMemory managementFFT/IFFT operationsPsychoacoustic expedientsFinal remarks

The UPOLS algorithm implementation can be summarized consideringthree main phases:

• Impulse response partitioning

• Input signal partitioning

- The input signal x is partitioned in blocksof length K .

- The frequency domain block Xn is obtainedperforming an L-points FFT to the inputvector composed of the new frame xn andthe previous frame xn−1 (overlap 50%).

- This vector Xn is stored in a delay line heldin the external memory together with theP − 1 previous blocks.

• Filtering

N

K K K K

h(t)

x(t) ..............x0 x1 x2 xn

L-points

FFT

H1 H2 H3× × ×.....

+ +

+ +

L-points

IFFT

L-points

IFFT

L-points

IFFT

last

K points

last

K points

last

K points

K K K K

y(t) ..............y0 y1 y2 yn

.......

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 15/28

Page 23: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

TargetUPOLS implementationMemory managementFFT/IFFT operationsPsychoacoustic expedientsFinal remarks

The UPOLS algorithm implementation can be summarized consideringthree main phases:

• Impulse response partitioning

• Input signal partitioning

• Filtering

- The output block Yn is obtained throughfiltering operations:

Yn =

P−1∑

i=0

Xn−P+1+iHP−1−i (5)

- The time-domain output signal yn iscomposed of the last K samples of theL-points IFFT of Yn.

N

K K K K

h(t)

x(t) ..............x0 x1 x2 xn

L-points

FFT

H1 H2 H3× × ×.....

+ +

+ +

L-points

IFFT

L-points

IFFT

L-points

IFFT

last

K points

last

K points

last

K points

K K K K

y(t) ..............y0 y1 y2 yn

.......

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 16/28

Page 24: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

TargetUPOLS implementationMemory managementFFT/IFFT operationsPsychoacoustic expedientsFinal remarks

Complex multiplications and accesses to external memory data are themain bottlenecks in fast convolution implementation.

HOW TO SOLVE THESE PROBLEMS?

• NUPOLS algorithm allows one to minimize both the number ofcomplex multiplications and the memory accesses compared tothe UPOLS approach.

• The DMA engine allows one to parallelize transfers from/intoexternal memory and processing operations.

Adopted Solution

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 17/28

Page 25: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

TargetUPOLS implementationMemory managementFFT/IFFT operationsPsychoacoustic expedientsFinal remarks

Complex multiplications and accesses to external memory data are themain bottlenecks in fast convolution implementation.

HOW TO SOLVE THESE PROBLEMS?

• NUPOLS algorithm allows one to minimize both the number ofcomplex multiplications and the memory accesses compared tothe UPOLS approach.

• The DMA engine allows one to parallelize transfers from/intoexternal memory and processing operations.

Adopted Solution

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 17/28

Page 26: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

TargetUPOLS implementationMemory managementFFT/IFFT operationsPsychoacoustic expedientsFinal remarks

Parallelization of the transfers from/into external memory (executed byDMA engine) and processing operations

Read Hn

(Blocking)

Read Xn

(Blocking)

Compute Yn

(i)

Read Hn

(Blocking)

Read Xn+1

(Non Blocking)

Compute Yn

Read Xn

(Blocking)

(ii)

Kernel used for UPOLS algorithms. (i) Basic approach. (ii) Improved approach.

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 18/28

Page 27: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

TargetUPOLS implementationMemory managementFFT/IFFT operationsPsychoacoustic expedientsFinal remarks

The workload required for FFT/IFFT computation can be reduced takingadvantage of the stereo implementation and considering the real natureof the audio signal.

• Two L-points FFTs/IFFTs of real sequences may be calculatedthrough one FFT/IFFT of a complex sequence.

• The symmetry property of the FFT has be exploited. Thisdecrease the number of access to the external memory and thenumber of frequency multiplications from L to (K + 1) for eachof the P processed frequency block.

FFT Optimization

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 19/28

Page 28: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

TargetUPOLS implementationMemory managementFFT/IFFT operationsPsychoacoustic expedientsFinal remarks

The workload required for FFT/IFFT computation can be reduced takingadvantage of the stereo implementation and considering the real natureof the audio signal.

• Two L-points FFTs/IFFTs of real sequences may be calculatedthrough one FFT/IFFT of a complex sequence.

• The symmetry property of the FFT has be exploited. Thisdecrease the number of access to the external memory and thenumber of frequency multiplications from L to (K + 1) for eachof the P processed frequency block.

FFT Optimization

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 19/28

Page 29: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

TargetUPOLS implementationMemory managementFFT/IFFT operationsPsychoacoustic expedientsFinal remarks

Psychoacoustic allows one to reduce the number ofcomplex multiplications and memory accesses.

All the components (frequency bins) overs a certain cut-off frequencyfc (e.g., 18 kHz) are leaved out.

Psychoacoustic Optimization

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 20/28

Page 30: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

TargetUPOLS implementationMemory managementFFT/IFFT operationsPsychoacoustic expedientsFinal remarks

HOW TO PARALLELIZE THE 2 UPOLS?

In a low latency context multithreaded approach does not guarantee highperformance on the DSP board.

A manual partitioning of the code has been realized aiming touniformly distribute the FFT/IFFT operations and the complexmultiplications of both the UPOLS throughout the processing.

Adopted Solution

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 21/28

Page 31: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

TargetUPOLS implementationMemory managementFFT/IFFT operationsPsychoacoustic expedientsFinal remarks

HOW TO PARALLELIZE THE 2 UPOLS?

The manual partitioning aims to uniformly distribute the FFT/IFFToperations and the complex multiplications related to the larger POLSduring the K2

K1iterations necessary to respect the processing constraint.

Iteration Operation Iteration Operation

1 Large FFT 3/3 17 MAC Left Channel2 MUL Left Channel 18 MAC Left Channel3 MUL Right Channel 19 MAC Right Channel4 Large IFFT 1/3 20 MAC Right Channel5 Large IFFT 2/3 21 MAC Right Channel6 Large IFFT 3/3 22 MAC Right Channel7 MAC Left Channel 23 MAC Right Channel8 MAC Left Channel 24 MAC Right Channel9 MAC Left Channel 25 MAC Right Channel10 MAC Left Channel 26 MAC Right Channel11 MAC Left Channel 27 MAC Right Channel12 MAC Left Channel 28 MAC Right Channel13 MAC Left Channel 29 MAC Right Channel14 MAC Left Channel 30 MAC Right Channel15 MAC Left Channel 31 Large FFT 1/316 MAC Left Channel 32 Large FFT 2/3

Distribution of the UPOLS operations in a NUPOLS implementation with K1 = 64

and K2 = 2048.

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 22/28

Page 32: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

Case study: artificial reverberatorUPOLS performanceNUPOLS performance

Fast convolution could be employed in many different real time audioapplications.

Digital artificial reverberation is the application that really points outlimits of real time FIR filtering.

• Convolutions with long IRs can be performed to simulate largeenvironments.

• Low input/output latencies are required in musical instruments.

Case Study: Artificial Reverberator

Several tests have been carried out to evaluate the effectiveness ofthe proposed approach comparing the required workload of UPOLSand NUPOLS implementation.

Tests

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 23/28

Page 33: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

Case study: artificial reverberatorUPOLS performanceNUPOLS performance

UPOLS PERFORMANCE

0.1 0.2 0.3 0.4 0.50

20

40

60

80

100

Impulse Response Length [s]

Workload

(a)(b)

Workload of the Uniform Partitioned Overlap and Save algorithm (K = 64). (a)

Classic implementation. (b) Psychoacoustic approach

• The maximum impulse response length is about 0.55s(guaranteeing real time performance).

• The approach is not suitable for the simulation of largereverberating environments in musical instruments.

Considerations

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 24/28

Page 34: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

Case study: artificial reverberatorUPOLS performanceNUPOLS performance

NUPOLS PERFORMANCE

0 1 2 3 4 50

20

40

60

80

100

Impulse Response Length [s]

Workload

(a) (b) (c) (d)

(i)

0 1 2 3 4 50

20

40

60

80

100

Impulse Response Length [s]

Workload

(a) (b) (c) (d)

(ii)

0 1 2 3 4 50

20

40

60

80

100

Impulse Response Length [s]

Workload

(a) (b) (c) (d)

(iii)

0 1 2 3 4 50

20

40

60

80

100

Impulse Response Length [s]Workload

(a)

K2 = 2048K

2 = 512 K

2 = 1024

(iv)

Workload of NUPOLS algorithm with 4 different partitionings ((i) K1 = 64

K2 = 2048, (ii) K1 = 64 K2 = 1024, (iii) K1 = 64 K2 = 512, and (iv) optimal

partitioning). Mean (a) and max (b) workload for classic implementation. Mean (c)

and max (d) workload using psychoacoustic approach.

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 25/28

Page 35: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

Case study: artificial reverberatorUPOLS performanceNUPOLS performance

NUPOLS PERFORMANCE

5 10 15 20 25 300

10

20

30

40

50

Processing iteration

Workload

(a)(b)(c)

NUPOLS workload as a function of the

processing cycle (IR Length=3.164 sec). (a)

Workload NUPOLS (b) Workload small

UPOLS (K1 = 64), (c) Workload large UPOLS

(K2 = 2048).

Partitioning Internal MemoryUsage

K1 = 64 K2 = 2048 100kB

K1 = 64 K2 = 1024 50kB

K1 = 64 K2 = 512 30kB

• Evident improvement in terms of performance with respect tothe uniform partitioning based approach.

• It is possible to perform a stereo convolution with an impulseresponse of length 6s using about 50% of the DSP resources.

Considerations

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 26/28

Page 36: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

ConclusionQuestions

In conclusion:

• A novel approach for fast convolution computation has beenproposed based on non uniform partitioning of the impulse response.

• Two UPOLSs with uniform partitioning are introduced consideringtwo different framesize: the desired input/output latency is obtainedthrough the UPOLS with lower framesize while the other UPOLS isexploited for decreasing the number of memory accesses andcomplex multiplications.

• A DSP-based real time implementation has been performed andseveral experimental results have been carried out considering digitalreverberation as a particular case study.

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 27/28

Page 37: An Efficient DSP Based Implementation of a Fast Convolution Approach with non Uniform Partitioning

Fast ConvolutionProposed Algorithm

Efficient DSP ImplementationResults

Conclusion

ConclusionQuestions

QUESTIONS?

Andrea Primavera An Efficient DSP-Based Implementation of a Fast Convolution Approach with non Uniform Partitioning 28/28