Audio Engineering Society Conventione-Brief 111research.spa.aalto.fi/projects/sparta_vsts/download/... · 2019. 3. 21. · Audio Engineering Society Conventione-Brief 111 Presentedat

Audio Engineering Society

Convention e-Brief 111Presented at the Conference onImmersive and Interactive Audio2019 March 27 – 29, York, UK

This Engineering Brief was selected on the basis of a submitted synopsis. The author is solely responsible for its presentation, and the AES takes no responsibility for the contents. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Audio Engineering Society.

SPARTA & COMPASS: Real-time implementations of linearand parametric spatial audio reproduction and processingmethodsLeo McCormack1 and Archontis Politis1

1Aalto University, School of Electrical Engineering, Department of Signal Processing and Acoustics, Espoo, Finland

Correspondence should be addressed to Leo McCormack ([email protected])

ABSTRACT

This paper summarises recent developments at the Acoustics Lab, Aalto University, Finland, regarding real-timeimplementations of both fundamental and advanced methods for spatialisation, production, visualisation, andmanipulation of spatial sound scenes. The implementations can be roughly categorised into panning tools, for bothbinaural and arbitrary loudspeaker setups, and linear processing tools based on the Ambisonics framework; thelatter of which includes: decoders for loudspeakers or headphones and sound scene activity visualisers, which arebased on either non-parametric beamforming or parametric high-resolution methods. Additionally, more advancedreproduction and spatial editing tools are detailed, which are based on the parametric COMPASS framework.

Introduction

Recording, production, and reproduction of spatialsound scenes began with the invention of stereophony[1] and has been evolving ever since. The primary aimsof spatial sound capture, panning, and playback, hasbeen to offer high perceptual quality and engagement.In the case of recording, the objective is to preservethe spatial characteristics of the original scene as muchas possible. To achieve these aims more recent ap-proaches employ sophisticated digital signal processingtechniques, which provide increased flexibility duringboth the recording and reproduction stages. This workpresents and discusses a series of real-time implementa-tions of spatial sound tools developed at the AcousticsLab, Aalto University, Finland, which have been madepublicly available as an open-source VST audio plug-in

suite, under the name, Spatial Audio Real-time Appli-cations (SPARTA)1 2.

Many of the plug-ins are implemented in the time-frequency domain, which allows spectral processing,an uncommon feature compared to existing spatiali-sation tools, which are often implemented solely inthe time-domain. Each plug-in focuses on a certainapplication. The most fundamental operation is spa-tialisation of monophonic sound source signals, whichhas been implemented for loudspeaker and headphoneplayback by employing amplitude panning or binau-ral filters, respectively. With regard to recording spa-

1The SPARTA & COMPASS plug-in suites can be downloadedfrom here: http://research.spa.aalto.fi/projects/sparta_vsts/

2The SPARTA source code (GPLv3 license) can be found here:https://github.com/leomccormack/SPARTA

McCormack & Politis SPARTA & COMPASS audio plug-in suites

tial sound scenes, spherical microphone arrays are of-ten employed, followed by suitable spatial encoding,which converts the microphone signals into sphericalharmonic (SH) signals (also referred to as Ambisonic orB-format signals) [2, 3]. An implementation of such anencoder is demonstrated, which integrates a wide rangeof options, including control over: the microphonesensor directions, array size, baffle type, microphonedirectivity, and the encoding strategy [4, 5, 6, 7, 8].Additionally, the plug-in can visualise the encodingperformance per frequency, based on a few establishedmetrics [4, 6].

Ambisonics offers a unified framework for linear andnon-parametric storage, manipulation and reproductionof spatial sound scenes, formulated in the sphericalharmonic domain (SHD). Basic ambisonic operations,such as encoding a monophonic sound source signalat a certain direction or rotating the sound scene [9],are also demonstrated. Due to the generality of SHencoding, decoding ambisonic material may be eas-ily adapted to either: arbitrary loudspeaker setups viatime-invariant real-valued gains, or matched to a set ofbinaural filters for generic or personalised headphonelistening. Both loudspeaker and binaural decoders,based on multiple decoding strategies [10, 11, 12], havebeen implemented. Furthermore, due to the frequency-dependent nature of the plug-ins, different decodersmay be assigned at different frequency ranges, provid-ing the ability to independently optimise the low- andhigh-frequency localisation.

More recently, a set of non-linear and signal-dependentspatial audio processing methods have been developed.These methods aim to achieve higher perceptual accu-racy in reproduction and also permit the manipulationof the spatial properties of a recording, in a manner notpossible with the traditional linear Ambisonics frame-work. In this case, the approach relies on the estima-tion of spatial parameters that correspond to the soundscene. These methods are often referred to as beingparametric, and were traditionally associated with cod-ing and compression of spatial sound. However, para-metric methods have also been applied to SH signalsfor the purpose of reproduction [13, 14, 15, 16, 17],and have been proven to offer high perceptual qual-ity, even with low spatial resolution material. A recentframework for such parametric spatial audio processingis the: Coding and Multi-directional Decompositionof Ambisonic Sound Scenes (COMPASS) framework[18]. Real-time implementations of loudspeaker and

headphone decoders for ambisonic material, employingthe COMPASS framework, are also described. Unliketraditional linear Ambisonic decoders, these parametricalternatives feature enhancement controls and advancedparametric processing options; such as re-balancingof the dry and reverberant components in the spatialrecording. Furthermore, the framework has also beenemployed for the up-mixing of lower-order ambisonicmaterial to higher-order material, and suppressing oramplifying sounds from various regions in the soundscene.

1 General implementation details

Both the SPARTA and COMPASS plug-in suites em-ploy the Spatial_Audio_Framework3 and JUCE, for theinternal processing and graphical user interfaces (GUI),respectively. For the plug-ins which utilise SH signalinput and/or output, the Ambisonic Channel Number(ACN) ordering convention is used. In addition bothorthonormalised (N3D) and semi-normalised (SN3D)schemes are supported; note that ACN/SN3D is morepopularly referred to as the AmbiX format. The maxi-mum transform order for the relevant SPARTA plug-insis 7th order, whereas for COMPASS plug-ins, the maxi-mum input order is 3rd order. Furthermore, all plug-insthat feature a binaural output option, permit the userto import their own head-related impulse responses(HRIRs); in order to attain personalised headphone lis-tening; via the SOFA (Spatially Oriented Format forAcoustics) standard. Additionally, plug-ins that allowthe user to specify the directions for the loudspeakers,sources or sensors, support the saving and loading ofthe directions via JSON (JavaScript Object Notation)configuration files. Source directions may also be con-trolled via suitable host automation. The plug-ins alsodisplay warning messages, should some aspects of thecurrent configuration not meet the internal processingrequirements.

For the time-frequency transform, the alias-free short-time Fourier transform (afSTFT)4 implementation byJuha Vilkamo (which is optimised to be robust to tempo-ral artefacts during aggressive time frequency domainmanipulations) is employed [19]. Furthermore, due tothe heavy use of linear algebra operations, the code

3The Spatial_Audio_Framework (ISC license) can be found here:https://github.com/leomccormack/Spatial_Audio_Framework

4The alias-free STFT (MIT license) can be found here:https://github.com/jvilkamo/afSTFT

AES Conference on Immersive and Interactive Audio, York, UK, 2019 March 27 – 29Page 2 of 12


also conforms to the relevant BLAS [20] and LAPACK[21] standards, for which Intel’s MKL implementationsare employed for both Windows (64-bit) and Mac OSX(10.10 or above) versions.

2 The SPARTA Plug-in suite

The SPARTA plug-in suite offers an array of fundamen-tal tools for spatialisation, production and visualisationof spatial sound scenes. It includes elementary binau-ral and loudspeaker spatialisers, and processing toolsbased on the Ambisonics framework. Regarding thelatter, plug-ins are included to address sound-field re-production, rotation, spatial encoding, visualisation,and frequency-dependent dynamic range compression.

2.1 SPARTA | AmbiBIN

The AmbiBIN plug-in is a binaural ambisonic decoderfor headphone playback of SH signals (see Fig. 1).It employs a virtual loudspeaker approach [22] andoffers a max_rE weighting option [11]; which con-centrates the beamformer signal energy towards thelook-direction of a given virtual loudspeaker. Thisweighting option also serves to reduce the side-lobesof the directional spread inherent in Ambisonics, for asource emanating from any direction, and concentratesthe directional energy in an optimal manner. Further-more, the number of virtual loudspeakers increaseswith the transform order and are uniformly distributed.The plug-in also features a built-in rotator and head-tracking support, which may be realised by sendingappropriate OSC messages to a user defined OSC port.The rotation angles and matrices [9] are updated afterthe time-frequency transform, which allows for reducedlatency compared to its loudspeaker counterpart (Am-biDEC) paired with the Rotator plug-in.

2.2 SPARTA | AmbiDEC

AmbiDEC is a frequency-dependent ambisonic decoderfor loudspeaker playback (see Fig. 1). The loudspeakerdirections can be user-specified for up to 64 channels,or alternatively, presets for popular 2D and 3D set-upscan be selected. The plug-in also features a built-inbinauraliser. This allows the user to audition the loud-speaker output via convolution of the individual loud-speaker signals with interpolated head-related trans-fer functions (HRTFs) which correspond to each loud-speaker direction.

Fig. 1: Linear ambisonic decoders for headphones(top) and loudspeakers (bottom).

The plug-in employs a dual decoding approach,whereby different decoder settings may be selectedfor the low and high frequencies; the cross-over fre-quency may also be dictated by the user. Several am-bisonic decoders have been integrated, including re-cent perceptually-motivated approaches, such as theAll-Round Ambisonic Decoder (AllRAD) [11] andEnergy-Preserving Ambisonic Decoder (EPAD) [12].The max_rE weighting [11] may also be enabled foreither decoder. Furthermore, in the case of non-idealSH signals as input (i.e. those that are derived fromreal microphone arrays), the decoding order may bespecified for the appropriate frequency ranges. In thesecases, energy-preserving (EP) or amplitude-preserving(AP) normalisation may be selected to help maintainconsistent loudness between decoding orders.

2.3 SPARTA | AmbiDRC

The AmbiDRC plug-in is a frequency-dependent SHDdynamic range compressor, based on the design pre-sented in [23]. The gain factors are derived byanalysing the omni-directional (zeroth order) compo-nent for each frequency band, which are then appliedto all of the input SH components. The spatial prop-erties of the original signals remain unchanged. Theimplementation also stores the frequency-dependentgain factors, derived from the omni-directional compo-nent, over time and plots them on the GUI to providethe user with visual feedback (see Fig. 2).



Fig. 2: Frequency-dependent DRC for SH signals.

2.4 SPARTA | AmbiENC

AmbiENC is an elementary ambisonic encoder, whichaccepts multiple monophonic input signals (up to 64channels) and spatially encodes them into SH signalsat specified directions (see Fig. 3). Essentially, theseSH signals describe a synthetic anechoic sound-field,where the spatial resolution of the encoding is deter-mined by the transform order. Several presets havebeen integrated into the plug-in for convenience (whichallow for 22.x etc. audio to be encoded into 1-7th or-der SH signals, for example). The panning window isalso fully mouse driven, and uses an equirectangularrepresentation of the sphere to depict the azimuth andelevation angles of each source.

2.5 SPARTA | Array2SH

The Array2SH plug-in, originally presented in [24],spatially encodes microphone array signals into SHsignals (see Fig. 3). The plug-in utilises analyticalsolutions, which ascertain the frequency and order-dependent influence that the array has on the initialestimate of the SH signals. The plug-in allows the userto specify: the array type (spherical or cylindrical),whether the array has an open or rigid enclosure, theradius of the baffle, the radius of the array (in caseswhere the sensors protrude out from the baffle), thesensor coordinates (up to 64 channels), sensor direc-tivity (omni-dipole-cardioid), and the speed of sound.The plug-in then determines order-dependent equalisa-tion curves, which need to be imposed onto the initialSH signals estimate, in order to remove the influence

Fig. 3: Monophonic (top) and microphone array (bot-tom) ambisonic encoders.

of the array itself [25, 26]. However, especially forhigher-orders, this often results in large scale ampli-fication of the low frequencies (including the sensornoise at these frequencies that accompanies it); there-fore, two popular regularisation approaches have beenintegrated into the plug-in, which allow the user tomake a compromise between noise amplification andtransform accuracy [4, 27]. The target and regularisedequalisation curves are depicted on the GUI to pro-vide visual feedback. The plug-in also offers a filterdesign option based on band-passing the individual or-ders with a linear-phase filterbank, and subsequentlyshaping their combined encoding directional responsewith max_rE weights and diffuse-field equalisation, asproposed by Franz Zotter in [28]; this design can bemore suitable for reproduction purposes. In addition,for high frequencies that exceed the spatial aliasinglimit, where directional encoding fails, all filter designshave the option to impose diffuse-field equalisation onthe aliased components, in order to avoid undesirableboosting of high-frequencies.

For convenience, the specifications of several commer-cially available microphone arrays have been integratedas presets; including: MH Acoustic’s Eigenmike, theZylia array, and various A-format microphone arrays.However, owing to the flexible nature of the plug-in,



users may also design their own customised micro-phone array, while having a convenient means of ob-taining the corresponding SH signals. For example,a four capsule open-body hydrophone array was pre-sented in [29], which utilised the plug-in as the firststep in visualising and auralising an underwater soundscene in real-time. Furthermore, a 19 sensor modu-lar microphone array was recently developed and pre-sented in [30], which featured MEMS microphonesthat protruded from the surface of a rigid scatterer viavariable length stilts. This allowed control over thesensor radii and, in turn, the ability to more readilyoptimise the array for a given target application.

The plug-in can also depict two objective metrics, asdescribed in [6, 31], which correspond to the spatialperformance of the array in question, namely: the spa-tial correlation and the diffuse level difference. Duringthis performance analysis, the encoding matrices arefirst applied to a simulated array, which is described bymulti-channel transfer functions of plane waves for 812points on the surface of the array. Ideally, the resultingencoded array responses should exactly resemble thespherical harmonic functions at each grid point; anydeviations from this would suggest a loss in encod-ing performance. The spatial correlation is a metricthat describes how similar the generated patterns areto the ideal SH patterns. A value of 1 indicates thatthe patterns are perfect, and a value of 0 means thatthey are uncorrelated. Therefore, the spatial aliasingfrequency can be observed for each order, as the pointwhere the spatial correlation tends towards 0. The leveldifference is then determined as the difference betweenthe mean level over all directions (i.e. the energy sumof the encoded response over all points) and the idealcomponents. One can observe that higher permittedamplification limits, Max Gain (dB), will result in awider frequency range of useful spherical harmoniccomponents; however, this will also result in noisiersignals.

Note that this ability to balance the noise amplifica-tion with the accuracy of the spatial encoding (to bettersuit a given application) has some important ramifica-tions. For example: the perceived fidelity of ambisonicdecoded audio can often be degraded, if the noise am-plification is set too high. However, for sound-fieldvisualisation, or for beamformers which employ appro-priate post-filtering, this noise amplification is oftenless problematic; therefore, such applications may ben-efit from a more aggressive encoding setting.

Fig. 4: Elementary binaural (top) and loudspeaker (bot-tom) spatialisers.

2.6 SPARTA | Binauraliser

The Binauraliser plug-in convolves input audio (upto 64 channels) with interpolated HRTFs, in the time-frequency domain (see Fig. 4). The plug-in first extractsthe inter-aural time differences (ITDs) for each of thedefault or imported HRIRs, via the cross-correlation be-tween the low-pass filtered left and right time-domainresponses. The HRTFs are then interpolated at run-timeby applying triangular-spherical interpolation on theHRTF measurement grid, which may be equivalently re-alised through the application of amplitude-normalisedVBAP gains [32]. Note that these interpolation gainsare applied to the HRTF magnitude responses and ITDsseparately, which are then re-combined per frequencyband. Presets for popular 2D and 3D formats are in-cluded for convenience; however, the directions for upto 64 channels can be independently controlled. Head-tracking is also supported via OSC messages in thesame manner as the Rotator and AmbiBIN plug-ins.

2.7 SPARTA | Panner

Panner is a frequency-dependent 2D and 3D panningplug-in, based on the Vector-base Amplitude Panning(VBAP) method [32] or Multiple-Direction AmplitudePanning (MDAP) [33]. The latter is employed when thespread parameter is defined to be more than 0 degrees(see Fig. 4). Presets for popular 2D and 3D formatsare included for convenience; however, the directionsfor up to 64 channels can be independently controlledfor both inputs and outputs; allowing, for example,



9.x input audio to be panned for a 22.2 setup. Thepanning is frequency-dependent to accommodate themethod described in [34], which allows more consistentloudness of sources, when they are panned in-betweenthe loudspeaker directions, under different playbackconditions. One should set the pValue Coeff parameterto 0 for a standard room with moderate reverberation,0.5 for a low-reverberation listening room, and 1 for ananechoic chamber.

2.8 SPARTA | PowerMap

The Powermap plug-in depicts the relative sound-fieldactivity at multiple directions, using a colour-map andan equirectangular representation of the sphere (seeFig. 5). The design is based on the software describedin [35]. The colour yellow indicates higher sound en-ergy or a higher likelihood of a source emanating froma particular direction, and blue indicates lower soundenergy or likelihood.

The plug-in integrates a variety of different localisa-tion functions, all of which are written to operate inthe SHD, including: beamformer-based approaches,such as Plane-Wave Decomposition (PWD) [36] andMinimum-Variance Distortion-less Response (MVDR)[37]. Also catered for are: subspace-based approaches,such as Multiple Signal Classification (MUSIC) [25]and Minimum-Norm algorithms [37]. The Cross-Pattern Coherence (CroPaC) algorithm is also featured[38], which employs an iterative side-lobe suppressionoperation, as described in [35]. Furthermore, the analy-sis order per frequency band is user definable. Addition-ally, presets for commercially available higher-ordermicrophone arrays have been included, which also pro-vide approximate starting values, based on theoreticalobjective analysis of the arrays [4, 6]. The plug-inemploys a 812 point uniformly-distributed sphericalgrid, which is interpolated to attain a 2D image usingamplitude-normalised VBAP gains [32] (i.e. triangular-spherical interpolation).

2.9 SPARTA | Rotator

The Rotator plug-in applies a SHD rotation matrix [9]to the input SH signals (see Fig. 6). The rotation an-gles can be controlled using a head-tracker and OSCmessages. The head-tracker should be configured tosend a three-element vector, ypr[3], to OSC port 9000(default); where ypr[0], ypr[1], ypr[2] correspond tothe yaw-pitch-roll angles, respectively. The sign of the

Fig. 5: Ambisonic sound-field visualiser.

Fig. 6: Ambisonic rotator.

angles may also be flipped, +/−, in order to supporta wider range of devices. The rotation order [yaw-pitch-roll (default) or roll-pitch-yaw] may also be userspecified.

2.10 SPARTA | SLDoA

SLDoA is a spatially-localised direction-of-arrival(DoA) estimator (see Fig. 7); originally presented in[24]. The plug-in first employs VBAP beam patterns(for directions that are uniformly distributed on thesurface of a sphere) to approximate spatially-biasedzeroth and first-order signals in a least-squares sense.These are subsequently utilised for the active-intensityvector estimation [39]. This allows for DoA estima-tion in several spatially-localised sectors for each sub-band, whereby estimates made within one sector havereduced susceptibility to interferers present in othersectors.

The low frequency DoA estimates are depicted withblue icons, mid-frequencies with green, and high-frequencies with red. The size of the icon and its



Fig. 7: The SLDoA sound-field visualiser.

opacity correspond to the sector energies, which arenormalised for each frequency band. The plug-in em-ploys twice the number of sectors as the analysis order,with the exception of the first-order analysis, whichderives the DoA estimate from the traditional active-intensity vector. The analysis order per frequency band,and the frequency range at which to perform the analy-sis, are also user definable in a similar manner as to thePowermap plug-in. This approach to sound-field visu-alisation and/or DoA estimation may represent a morecomputationally efficient alternative, when comparedto the algorithms that are integrated into the Powermapplug-in, for example.

3 The COMPASS plug-in suite

Ambisonic decoding is traditionally defined as a lin-ear, time-invariant and signal-independent operation.In the simplest case, reproduction is achieved witha single frequency-independent matrix consisting ofreal-valued gains. Operations which modify the soundscene may also be conveniently defined in the SHD,in a linear fashion. The operations performed on theSH signals may be simple order-preserving exercises:such as mirroring and rotating. However, more com-plex operations, may also be carried out, such as direc-tional loudness modifications, which returns signals ofa higher-order than that of the original [40].

An alternative to the linear Ambisonics ecosystem isto apply a parametric approach, which is also appli-cable to the tasks of decoding, modification, and re-production of multi-channel recordings. Parametric

approaches operate in the time-frequency domain andare signal-dependent. They rely on the extraction ofspatial parameters from the multi-channel input, whichthey subsequently utilise to achieve a more informeddecoding, or to allow advanced sound scene manipula-tions.

Parametric methods also have the potential to offersuper-resolution, in the sense that, even for small com-pact arrays of a few microphones, they can estimateand reproduce sounds with a spatial resolution that farsurpasses the inherent limitations of non-parametricmethods. Another advantage of the estimation, is thatthey perform an “objectification" of the sound scene.As an intermediate representation, instead of signalsdescribing the sound scene as a whole, they produce sig-nals and parameters which describe the scene content;such as directional sounds and their direction, ambientsound, and the energy relations between them. Thisinformation provides an intuitive representation of thesound scene and can be used to flexibly manipulate ormodify the spatial sound content [41].

However, while parametric approaches may yield en-hanced flexibility and spatial resolution, they are alsoassociated with their own limitations. For instance,rather than elementary linear filtering, adaptive spec-tral processing with dedicated multi-channel analysisand synthesis stages is required. This, therefore, ren-ders them more computationally demanding than theirnon-parametric counterparts. Furthermore, additionaleffort must be afforded in the implementation of thesemethods, in order to minimise spectral processing arte-facts and/or musical noise. Finally, the methods mustalso be designed in such a manner, as to be robust toestimation errors and to provide a perceptually accept-able output; this includes cases where the input soundscene deviates wildly from the assumptions made inthe parametric sound-field model. Of the parametricmethods operating on SH signals, the most well-knownexample that has demonstrated robust performance isDirectional Audio Coding (DirAC) [13, 17]. Originallyintended for reproduction of first-order (FOA) signals,DirAC was later extended for higher-order (HOA) input[16]. Other methods include High-Angular Plane-waveExpansion (HARPEX) [14] for FOA signals, and theambisonic sparse recovery framework of [15], for bothFOA and HOA signals. For a comparison betweenthese published parametric methods, the reader is re-ferred to Fig. 8.



Fig. 8: Comparison of parametric ambisonic processing methods. M in this case refers to the number of ambisonicchannels for each order.

The parametric COMPASS framework for SH signalswas recently proposed in [18]. It is based on both arrayprocessing and spatial filtering techniques, employinga sound-field model that comprises of multiple narrow-band source signals and a residual ambient component.The method is applicable to any input order, and offersmore robust estimation and a greater number of ex-tracted source signals with each increase in order. Theambient signal populates all the available SH channelsand has its own time-variant spatial distribution, whichpredominately encapsulates the remaining reverberantand diffuse components. The source and ambient sig-nals may then be spatialised or modified independently.A similar approach to COMPASS, formulated for FOAinput only, has also been presented in [42].

3.1 COMPASS | Decoder

The Decoder plug-in5 is a parametrically enhanced am-bisonic decoder for arbitrary loudspeaker setups (see

5A video demo of the COMPASS Decoder can be found here:http://research.spa.aalto.fi/projects/compass_vsts/plugins.html

Fig. 9). The GUI for the plug-in offers much of thefunctionality provided by the AmbiDEC plug-in, butfeatures additional frequency-dependent balance con-trols between the extracted direct and ambient compo-nents, and control between fully parametric decodingand linear ambisonic decoding.

The Diffuse-to-Direct control allows the user to lendmore prominence to the direct sound components (aneffect similar to subtle dereverberation), or to the am-bient components (an effect similar to emphasisingreverberation in the recording). When set in the middle,the two are balanced. Note that the parametric process-ing can be quite aggressive, and if the plug-in is set tofully direct rendering in a complex multi-source soundscene with low-order input signals; artefacts may ap-pear. However, with more conservative settings, suchartefacts should become imperceptible.

The Linear-to-Parametric control allows the user to dic-tate a mix between standard linear ambisonic decodingand the COMPASS parametric decoding. This control



array input signals

directionalsynthesis

panning orHRTFs

decorrelation

di�usesynthesis

source number detection

invTFT/FB

TFT/FB

DoA estimation

source beamformer

ambiencebeamformer

ambient signal directional signals

source number K

target setupinformation

K source DoAs

Fig. 9: COMPASS processing flow chart for loud-speaker or headphone reproduction.

can be used in cases where parametric processing istoo aggressive, or if the user prefers some degree oflocalisation blur offered by linear ambisonic decoding.

The plug-in is considered by the authors as a productiontool and, due to its advanced time-frequency process-ing, requires large audio buffer sizes. Therefore, theplug-in may not be suitable for interactive input withlow-latency requirements. For applications such asinteractive binaural rendering with head-tracking, theBinaural variant described below is more appropriate.

3.2 COMPASS | Binaural

The Binaural decoder is an optimised version of theCOMPASS Decoder for binaural playback, which by-passes loudspeaker rendering and uses HRTFs directly

Fig. 10: Parametric Ambisonic decoder plug-ins.

instead. For the plug-in parameters, refer to the descrip-tion of the COMPASS Decoder above. Additionally,the plug-in can receive OSC rotation angles from ahead-tracker at a user specified port, as detailed in theRotator plug-in description.

This decoder is intended primarily for head-tracked bin-aural playback of ambisonic content at interactive up-date rates, usually in conjunction with a head-mounteddisplay (HMD). The plug-in requires an audio buffersize of at least 512 samples (approximately 10msec at48kHz). The averaging parameters may be employed toinfluence the responsiveness of the parametric analysisand synthesis, thus providing the user with a meansto adjust them optimally for a particular input soundscene.

3.3 COMPASS | Upmixer

The Upmixer plug-in employs COMPASS for the taskof up-mixing a lower-order ambisonic recording to ahigher-order ambisonic recording. It is intended forusers that are already working with a preferred linearambisonic decoding work-flow of higher-order con-tent, and wish to combine lower-order material withincreased spatial resolution. The user may up-mix first,second, or third-order material up to seventh-order ma-terial (64 channels).



Fig. 11: Parametric Upmixer plug-in.

3.4 COMPASS | SpatEdit

The SpatEdit plug-in is a parametric spatial editor,which allows the user to emphasise or attenuate specificdirectional regions in the sound scene. This plug-inuses the analysis part of the COMPASS framework,namely the multi-source DoA estimates, and the am-bient component; in order to extract or modify soundswhich emanate from user-defined spatial targets. Thesetargets are controlled via markers, which may be placedat arbitrary angles on the azimuth-elevation plane. Thenumber of markers is dictated by the transform order.Higher-orders allow the placement of more markersand hence finer spatial control. Additionally, the usermay specify a gain value applied to each target, whichcan amplify, attenuate, or attempt to eliminate soundincident from that direction. In the case of the linearprocessing mode, a reasonable degree of spatial en-hancement or separation can be attained with zero dis-tortion; especially at higher-orders. However, shouldthe user wish to conduct more aggressive manipula-tions of the sound scene, the plug-in may also operatein the parametric mode. In this case, the target sourcesignals can become more spatially-selective, by takinginto account the analysed parameters. For each target,an additional enhancement value can be specified, thisessentially determines how spatially-selective the para-metric processing may aspire to be. Higher values canseparate and modify the sounds coming from the targetsto a much greater degree, but with the undesirable pos-sibility of introducing more artefacts. Since artefacts inparametric processing are always scene-dependent, theuser is then responsible for ascertaining which settingswork most optimally for a given input scene.

The plug-in may be configured to either output theindividual beams and the residual; or it can be set tooutput the manipulated scene in the SHD via suitable re-encoding. This allows the user to process sounds fromindividual regions separately (e.g. apply equalisation),or receive the modified scene re-packaged convenientlyin the input ambisonic format for immediate renderingor further ambisonic processing.

Fig. 12: Parametric spatial editor plug-in.

Acknowledgements

We would like to thank all of the Aalto Acoustics Labresearchers who have contributed, or supported thecreation of these plug-in suites, namely Dr. SymeonDelikaris-Manias, Dr. Sakari Tervo, Prof. Ville Pulkki,and Prof. Tapio Lokki.

Our many thanks are also extended to Daniel Rudrichand Dr. Franz Zotter from the Institute for ElectronicMusic and Acoustics (IEM), Gratz, Austria, for theircross-collaborative efforts. Special thanks are also ex-tended to all those who submitted bug reports duringthe beta releases of the plug-ins.

References

[1] Blumlein, A. D., “British Patent No. 394325,” Im-provements in and relating to sound-transmission,sound-recording and sound-reproducing systems,1931.

[2] Rafaely, B., Fundamentals of spherical array pro-cessing, Springer, 2015.

[3] Jarrett, D. P., Habets, E. A., and Naylor, P. A.,Theory and applications of spherical microphonearray processing, Springer, 2017.

[4] Moreau, S., Daniel, J., and Bertet, S., “3D soundfield recording with higher order ambisonics–Objective measurements and validation of a 4thorder spherical microphone,” in 120th Conventionof the AES, pp. 20–23, 2006.

[5] Jin, C. T., Epain, N., and Parthy, A., “Design, opti-mization and evaluation of a dual-radius sphericalmicrophone array,” IEEE/ACM Transactions onAudio, Speech and Language Processing (TASLP),22(1), pp. 193–204, 2014.



[6] Politis, A. and Gamper, H., “Comparing modeledand measurement-based spherical harmonic en-coding filters for spherical microphone arrays,” inApplications of Signal Processing to Audio andAcoustics (WASPAA), 2017 IEEE Workshop on,pp. 224–228, IEEE, 2017.

[7] Alon, D. L. and Rafaely, B., “Spatial decomposi-tion by spherical array processing,” in Parametrictime-frequency domain spatial audio, pp. 25–48,Wiley Online Library, 2018.

[8] Schörkhuber, C. and Höldrich, R., “Ambisonicmicrophone encoding with covariance constraint,”in Proceedings of the International Conferenceon Spatial Audio, 2017.

[9] Ivanic, J. and Ruedenberg, K., “Rotation Matri-ces for Real Spherical Harmonics. Direct Deter-mination by Recursion,” The Journal of PhysicalChemistry A, 102(45), pp. 9099–9100, 1998.

[10] Gerzon, M. A., “Periphony: With-height soundreproduction,” Journal of the Audio EngineeringSociety, 21(1), pp. 2–10, 1973.

[11] Zotter, F. and Frank, M., “All-round ambisonicpanning and decoding,” Journal of the audio en-gineering society, 60(10), pp. 807–820, 2012.

[12] Zotter, F., Pomberger, H., and Noisternig, M.,“Energy-preserving ambisonic decoding,” ActaAcustica united with Acustica, 98(1), pp. 37–47,2012.

[13] Pulkki, V., “Spatial sound reproduction with di-rectional audio coding,” Journal of the Audio En-gineering Society, 55(6), pp. 503–516, 2007.

[14] Berge, S. and Barrett, N., “High angular resolu-tion planewave expansion,” in Proc. of the 2nd In-ternational Symposium on Ambisonics and Spher-ical Acoustics May, pp. 6–7, 2010.

[15] Wabnitz, A., Epain, N., McEwan, A., and Jin, C.,“Upscaling ambisonic sound scenes using com-pressed sensing techniques,” in Applications ofSignal Processing to Audio and Acoustics (WAS-PAA), 2011 IEEE Workshop on, pp. 1–4, IEEE,2011.

[16] Politis, A., Vilkamo, J., and Pulkki, V., “Sector-based parametric sound field reproduction in the

spherical harmonic domain,” IEEE Journal of Se-lected Topics in Signal Processing, 9(5), pp. 852–866, 2015.

[17] Politis, A., McCormack, L., and Pulkki, V., “En-hancement of ambisonic binaural reproductionusing directional audio coding with optimal adap-tive mixing,” in Applications of Signal Processingto Audio and Acoustics (WASPAA), 2017 IEEEWorkshop on, pp. 379–383, IEEE, 2017.

[18] Politis, A., Tervo, S., and Pulkki, V., “COMPASS:Coding and Multidirectional Parameterization ofAmbisonic Sound Scenes,” in 2018 IEEE Interna-tional Conference on Acoustics, Speech and Sig-nal Processing (ICASSP), pp. 6802–6806, IEEE,2018.

[19] Vilkamo, J. and Backstrom, T., “Time–FrequencyProcessing: Methods and Tools,” in ParametricTime-Frequency Domain Spatial Audio, pp. 3–23,John Wiley & Sons, 2018.

[20] Hanson, R. J., Krogh, F. T., and Lawson, C., “Aproposal for standard linear algebra subprograms,”1973.

[21] Anderson, E., Bai, Z., Dongarra, J., Greenbaum,A., McKenney, A., Du Croz, J., Hammarling, S.,Demmel, J., Bischof, C., and Sorensen, D., “LA-PACK: A portable linear algebra library for high-performance computers,” in Proceedings of the1990 ACM/IEEE conference on Supercomputing,pp. 2–11, IEEE Computer Society Press, 1990.

[22] Bernschütz, B., Giner, A. V., Pörschmann, C., andArend, J., “Binaural reproduction of plane waveswith reduced modal order,” Acta Acustica unitedwith Acustica, 100(5), pp. 972–983, 2014.

[23] McCormack, L. and Välimäki, V., “FFT-baseddynamic range compression,” in Sound and MusicComputing Conference, 2017.

[24] McCormack, L., Delikaris-Manias, S., Farina, A.,Pinardi, D., and Pulkki, V., “Real-Time Conver-sion of Sensor Array Signals into Spherical Har-monic Signals with Applications to Spatially Lo-calized Sub-Band Sound-Field Analysis,” in Au-dio Engineering Society Convention 144, AudioEngineering Society, 2018.



[25] Teutsch, H., Modal array signal processing: prin-ciples and applications of acoustic wavefield de-composition, Springer, 2007.

[26] Williams, E. G., Fourier acoustics: sound radi-ation and nearfield acoustical holography, Aca-demic press, 1999.

[27] Bernschütz, B., Pörschmann, C., Spors, S.,Weinzierl, S., and der Verstärkung, B., “Soft-limiting der modalen amplitudenverstärkung beisphärischen mikrofonarrays im plane wave de-composition verfahren,” Proceedings of the 37.Deutsche Jahrestagung für Akustik (DAGA 2011),pp. 661–662, 2011.

[28] Zotter, F., “A linear-phase filter-bank approach toprocess rigid spherical microphone array record-ings,” in Proc. IcETRAN, Palic, Serbia, 2018.

[29] Delikaris-Manias, S., McCormack, L., Huh-takallio, I., and Pulkki, V., “Real-time underwaterspatial audio: a feasibility study,” in Audio Engi-neering Society Convention 144, Audio Engineer-ing Society, 2018.

[30] González, R., Pearce, J., and Lokki, T., “Mod-ular Design for Spherical Microphone Arrays,”in Audio Engineering Society Conference: 2018AES International Conference on Audio for Vir-tual and Augmented Reality, Audio EngineeringSociety, 2018.

[31] Bertet, S., Daniel, J., and Moreau, S., “3D soundfield recording with higher order ambisonics-objective measurements and validation of spher-ical microphone,” in Audio Engineering Soci-ety Convention 120, Audio Engineering Society,2006.

[32] Pulkki, V., “Virtual Sound Source PositioningUsing Vector Base Amplitude Panning,” Journalof Audio Engineering Society, 45(6), pp. 456–466,1997.

[33] Pulkki, V., “Uniform spreading of amplitudepanned virtual sources,” in Proceedings of the1999 IEEE Workshop on Applications of SignalProcessing to Audio and Acoustics. WASPAA’99(Cat. No. 99TH8452), pp. 187–190, IEEE, 1999.

[34] Laitinen, M.-V., Vilkamo, J., Jussila, K., Politis,A., and Pulkki, V., “Gain normalization in ampli-tude panning as a function of frequency and room

reverberance,” in Audio Engineering Society Con-ference: 55th International Conference: SpatialAudio, Audio Engineering Society, 2014.

[35] McCormack, L., Delikaris-Manias, S., and Pulkki,V., “Parametric acoustic camera for real-timesound capture, analysis and tracking,” in Pro-ceedings of the 20th International Conferenceon Digital Audio Effects (DAFx-17), pp. 412–419,2017.

[36] Jarrett, D. P., Habets, E. A., and Naylor, P. A.,“3D source localization in the spherical harmonicdomain using a pseudointensity vector.” in EU-SIPCO, 2010.

[37] Van Trees, H. L., Optimum array processing: PartIV of detection, estimation, and modulation the-ory, John Wiley & Sons, 2004.

[38] Delikaris-Manias, S. and Pulkki, V., “Cross pat-tern coherence algorithm for spatial filtering appli-cations utilizing microphone arrays,” IEEE Trans-actions on Audio, Speech, and Language Process-ing, 21(11), pp. 2356–2367, 2013.

[39] Fahy, F. J. and Salmon, V., “Sound intensity,”The Journal of the Acoustical Society of Amer-ica, 88(4), pp. 2044–2045, 1990.

[40] Kronlachner, M., Spatial transformations for thealteration of ambisonic recordings, Master’s the-sis, Institute of Electronic Music and Acoustics,University of Music and Performing Arts, Graz,Graz, Austria, 2014.

[41] Politis, A., Pihlajamäki, T., and Pulkki, V., “Para-metric spatial audio effects,” York, UK, Septem-ber, 2012.

[42] Kolundzija, M. and Faller, C., “Advanced B-Format Analysis,” in Audio Engineering Soci-ety Convention 144, Audio Engineering Society,2018.


Documents

Audio Engineering Society Conventione-Brief 111research.spa.aalto.fi/projects/sparta_vsts/download/... · 2019. 3. 21. · Audio Engineering Society Conventione-Brief 111 Presentedat