5
BINAURAL AUDIO SOURCE REMIXING WITH MICROPHONE ARRAY LISTENING DEVICES Ryan M. Corey and Andrew C. Singer University of Illinois at Urbana-Champaign ABSTRACT Augmented listening devices, such as hearing aids and augmented reality headsets, enhance human perception by changing the sounds that we hear. Microphone arrays can improve the performance of lis- tening systems in noisy environments, but most array-based listening systems are designed to isolate a single sound source from a mixture. This work considers a source-remixing filter that alters the relative level of each source independently. Remixing rather than separating sounds can help to improve perceptual transparency: it causes less distortion to the signal spectrum and especially to the interaural cues that humans use to localize sounds in space. Index TermsMicrophone array processing, hearing aids, augmented reality, beamforming, audio source separation 1. INTRODUCTION Audio signal processing can be used to change the way that humans experience the world around them. Augmented listening (AL) de- vices, such as hearing aids and augmented reality headsets [1], en- hance human hearing by altering the sounds presented to a listener. One of the most important functions of listening devices is to help humans to hear better in noisy environments with many competing sound sources. Unfortunately, many listening devices today perform poorly in noisy environments where users need them most. One reliable way to reduce unwanted noise and improve intel- ligibility in noisy situations is with microphone array processing. Arrays of several spatially separated microphones can be used to perform beamforming and isolate sounds from a particular direc- tion [2]. Beamformers are widely used in machine listening systems, for example for automatic speech recognition, to preprocess a sound source of interest. For decades, researchers have tried to incorporate similar beamforming technology into human listening devices [3–6]. Most past studies have taken the same approach used in machine lis- tening: steering a beam to isolate a single sound source of interest and attenuate all others. Although this approach has been shown to improve intelligibility in noise [3, 4], listening devices using large arrays have never been commercially successful. The human auditory system is different from a machine listening algorithm. Isolating a single sound source, while it might improve intelligibility, would seem unnatural to the listener and it could inter- fere with the auditory system’s natural source separation and scene analysis capabilities. Humans rely on spectral and temporal patterns as well as spatial cues, such as interaural time and level differences, to distinguish between sound sources and remain aware of their en- vironment [7, 8]. Single-target beamformers can distort the spectral and spatial cues of all non-target sound sources [9]. Beamformers This material is based upon work supported by the National Science Foundation under Grant No. 1919257. Source remixing Fig. 1: The proposed source-remixing system uses a microphone array and space-time filters to apply separate processing to many sound sources, alter- ing the auditory perception of the listener. can be designed to preserve the interaural cues of multiple sources by deliberately including components of non-target sources in the filter output [10–12]. Early binaural beamformers simply added a fraction of the unprocessed signal into the output [13–15], while more recent methods apply explicit constraints to the filter’s response to those signals [16, 17]. It is also possible to constrain only the interaural cues and not the spectral distortion of the sources [18, 19]. In gen- eral, the more the background sources are preserved, the less their interaural cues are distorted [15]. In this work, we approach augmented listening as a source remixing problem. Rather than trying to isolate a single sound source, we use a microphone array to apply different processing to many sound sources, as shown in Fig. 1. The system is analo- gous to the mixing process in a music or film studio, where each instrument, dialogue track, and sound effect in the mixture is indi- vidually adjusted to provide a pleasant listening experience. The processing applied to each source depends on the type of sound, the acoustic environment, and the listener’s preferences. For example, a normal-hearing user in a quiet room might prefer no processing at all, while a hearing-impaired user in a noisy restaurant could use aggressive noise reduction similar to single-target beamform- ing. A similar remixing approach has been proposed for television broadcasts with object-based audio encodings: listeners can tune the relative levels of sources to trade off between immersiveness and intelligibility [20,21]. Using a powerful microphone array, we could allow listeners to perform that same tuning for real-world sounds. In this work, we consider the choice of relative levels of sound sources in a source-remixing filter. Further perceptual research and clinical studies will be needed to understand how to select these lev- els to optimize the listening experience for different AL applications. In the meantime, we can characterize the engineering tradeoffs of such a system. Intuitively, the less we try to separate the sound sources—that is, the more transparent the listening experience—the easier it should be to implement the remixing system and the more

BINAURAL AUDIO SOURCE REMIXING WITH MICROPHONE …corey1.web.engr.illinois.edu/download/corey_icassp2020_remixing.pdf · to many sound sources, as shown in Fig. 1. The system is analo-gous

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BINAURAL AUDIO SOURCE REMIXING WITH MICROPHONE …corey1.web.engr.illinois.edu/download/corey_icassp2020_remixing.pdf · to many sound sources, as shown in Fig. 1. The system is analo-gous

BINAURAL AUDIO SOURCE REMIXING WITHMICROPHONE ARRAY LISTENING DEVICES

Ryan M. Corey and Andrew C. Singer

University of Illinois at Urbana-Champaign

ABSTRACT

Augmented listening devices, such as hearing aids and augmentedreality headsets, enhance human perception by changing the soundsthat we hear. Microphone arrays can improve the performance of lis-tening systems in noisy environments, but most array-based listeningsystems are designed to isolate a single sound source from a mixture.This work considers a source-remixing filter that alters the relativelevel of each source independently. Remixing rather than separatingsounds can help to improve perceptual transparency: it causes lessdistortion to the signal spectrum and especially to the interaural cuesthat humans use to localize sounds in space.

Index Terms— Microphone array processing, hearing aids,augmented reality, beamforming, audio source separation

1. INTRODUCTION

Audio signal processing can be used to change the way that humansexperience the world around them. Augmented listening (AL) de-vices, such as hearing aids and augmented reality headsets [1], en-hance human hearing by altering the sounds presented to a listener.One of the most important functions of listening devices is to helphumans to hear better in noisy environments with many competingsound sources. Unfortunately, many listening devices today performpoorly in noisy environments where users need them most.

One reliable way to reduce unwanted noise and improve intel-ligibility in noisy situations is with microphone array processing.Arrays of several spatially separated microphones can be used toperform beamforming and isolate sounds from a particular direc-tion [2]. Beamformers are widely used in machine listening systems,for example for automatic speech recognition, to preprocess a soundsource of interest. For decades, researchers have tried to incorporatesimilar beamforming technology into human listening devices [3–6].Most past studies have taken the same approach used in machine lis-tening: steering a beam to isolate a single sound source of interestand attenuate all others. Although this approach has been shown toimprove intelligibility in noise [3, 4], listening devices using largearrays have never been commercially successful.

The human auditory system is different from a machine listeningalgorithm. Isolating a single sound source, while it might improveintelligibility, would seem unnatural to the listener and it could inter-fere with the auditory system’s natural source separation and sceneanalysis capabilities. Humans rely on spectral and temporal patternsas well as spatial cues, such as interaural time and level differences,to distinguish between sound sources and remain aware of their en-vironment [7, 8]. Single-target beamformers can distort the spectraland spatial cues of all non-target sound sources [9]. Beamformers

This material is based upon work supported by the National ScienceFoundation under Grant No. 1919257.

Sourceremixing

Fig. 1: The proposed source-remixing system uses a microphone array andspace-time filters to apply separate processing to many sound sources, alter-ing the auditory perception of the listener.

can be designed to preserve the interaural cues of multiple sources bydeliberately including components of non-target sources in the filteroutput [10–12]. Early binaural beamformers simply added a fractionof the unprocessed signal into the output [13–15], while more recentmethods apply explicit constraints to the filter’s response to thosesignals [16, 17]. It is also possible to constrain only the interauralcues and not the spectral distortion of the sources [18, 19]. In gen-eral, the more the background sources are preserved, the less theirinteraural cues are distorted [15].

In this work, we approach augmented listening as a sourceremixing problem. Rather than trying to isolate a single soundsource, we use a microphone array to apply different processingto many sound sources, as shown in Fig. 1. The system is analo-gous to the mixing process in a music or film studio, where eachinstrument, dialogue track, and sound effect in the mixture is indi-vidually adjusted to provide a pleasant listening experience. Theprocessing applied to each source depends on the type of sound, theacoustic environment, and the listener’s preferences. For example,a normal-hearing user in a quiet room might prefer no processingat all, while a hearing-impaired user in a noisy restaurant coulduse aggressive noise reduction similar to single-target beamform-ing. A similar remixing approach has been proposed for televisionbroadcasts with object-based audio encodings: listeners can tune therelative levels of sources to trade off between immersiveness andintelligibility [20,21]. Using a powerful microphone array, we couldallow listeners to perform that same tuning for real-world sounds.

In this work, we consider the choice of relative levels of soundsources in a source-remixing filter. Further perceptual research andclinical studies will be needed to understand how to select these lev-els to optimize the listening experience for different AL applications.In the meantime, we can characterize the engineering tradeoffs ofsuch a system. Intuitively, the less we try to separate the soundsources—that is, the more transparent the listening experience—theeasier it should be to implement the remixing system and the more

Page 2: BINAURAL AUDIO SOURCE REMIXING WITH MICROPHONE …corey1.web.engr.illinois.edu/download/corey_icassp2020_remixing.pdf · to many sound sources, as shown in Fig. 1. The system is analo-gous

natural it should sound to the listener. Studies of musical sourceseparation have shown that remixing can reduce unpleasant artifactsand distortion compared to complete separation [22]. Applying in-dependent processing to different sound sources can also reduce thedistortion introduced by nonlinear processing algorithms such as dy-namic range compression [23]. Here we focus on the benefits ofremixing for spectral and spatial distortion in a linear time-invariantspace-time filter. We show that the less the mixture is altered, thelower the resulting distortion.

2. SPACE-TIME FILTER FOR SOURCE REMIXING

2.1. Signal model

Consider an array of M microphones. By convention, it is assumedthat microphone 1 is in or near the left ear and microphone 2 is inor near the right ear; these act as reference microphones for the pur-poses of preserving interaural cues. The mixture is assumed to in-clude N source channels. A source channel is defined as a set ofsounds that are to be processed as a group, that is, for which the de-sired response of the system is the same. A source channel could bean individual talker, a group of talkers, diffuse noise, or all soundsof a certain type, for example.

Let x(t) ∈ RM be the vector of continuous-time signals ob-served by the microphones. It is the sum of N source images cn(t)due to the source channels:

x(t) =∑N

n=1cn(t). (1)

Let y(t) ∈ R2 be the desired output of the system. By convention,channel 1 is output to the left ear and channel 2 is output to the rightear. For the purposes of this work, assume that the desired processingto be applied to each source channel n is linear and time-invariant sothat it can be characterized by an impulse response matrix gn(τ) ∈R2×M . The desired output is therefore

y(t) =

N∑n=1

∫ ∞−∞

gn(τ)cn(t− τ) dτ︸ ︷︷ ︸dn(t)

. (2)

The signal dn(t) is the desired output image for source channel n.In a perceptually transparent listening system, the listener should

perceive dn(t) as a natural sound that replaces cn(t). To ensure thatdn(t) sounds natural, it should have a similar frequency spectrumto cn(t)—or a deliberately altered spectrum, for example to am-plify high frequencies for hearing-impaired listeners—and it shouldbe perceived as coming from the same direction. It should alsohave imperceptibly low delay, which limits the amount of frequency-selective processing that can be applied [24].

Because this work is designed to highlight interaural cue preser-vation, we restrict our attention to desired responses of the form

gn(τ) = gn(τ)[e1 e2

]T, (3)

where em is the unit vector with value 1 in position m and 0 else-where. In other words, the desired output in the left (respectivelyright) ear is the source signal as observed by the left (respectivelyright) reference microphone and processed by a diotic impulse re-sponse gn(τ). If this desired response could be achieved exactly, thesame processing would be applied to the signals in both ears and theinteraural cues of all source channels would be perfectly preserved.

2.2. Space-time filtering

The output y(t) ∈ R2 of the system is produced by a linear time-invariant space-time filter w(τ) ∈ R2×M such that

y(t) =

∫ ∞−∞

w(τ)x(t− τ) dτ (4)

=

N∑n=1

∫ ∞−∞

w(τ)cn(t− τ) dτ︸ ︷︷ ︸dn(t)

. (5)

The signal dn(t) ∈ R2 is the output image for source channel n; itis the processed version of the original source image cn(t) perceivedby the listener.

In an AL system, sound signals must be processed in real timewith a total delay of no more than a few milliseconds to avoid dis-turbing distortion. Thus, w(τ) should be a causal filter that predictsa possibly delayed version of y(t). It has been shown that such aconstraint limits the performance of a space-time filter, especially forsmall arrays [24]. Because space-time filters are most convenientlystudied in the frequency domain, w(τ) is allowed to be noncausal inour mathematical analysis, but the discrete-time filters implementedin Sec. 4 are causal with realistic delay constraints.

2.3. Weighted remixing filter

We would like to choose w(τ) so that dn(t) ≈ dn(t) for all n.To make the space-time filter as flexible as possible, we will usethe multiple-speech-distortion-weighted multichannel Wiener filter(MSDW-MWF) [25]. The MSDW-MWF is a generalization of thewell-known speech-distortion-weighted multichannel Wiener filter(SDW-MWF) [26], which allows the system designer to trade noisefor spectral distortion of a single target source. The MSDW-MWFminimizes the following weighted squared-error cost function:

Cost =

N∑n=1

λnE[∥∥∥dn(t)− dn(t)

∥∥∥2], (6)

where E denotes statistical expectation and λn is a distortion weightthat controls the relative importance of each sound source.

If the source images are statistically uncorrelated with eachother, then the noncausal MSDW-MWF is given by the frequency-domain expression

W(Ω) =

(N∑

n=1

λnGn(Ω)Rcn(Ω)

)(N∑

n=1

λnRcn(Ω)

)︸ ︷︷ ︸

Rx(Ω)

−1

, (7)

where Gn(Ω) is the Fourier transform of gn(τ) and Rcn(Ω) is thepower spectral density of source image cn(t) for n = 1, . . . , N ,which must be measured or estimated. It is assumed that Rx(Ω) isinvertible, which can be achieved by including a diffuse noise chan-nel with full-rank spectral density. Many commonly used space-time filters can be derived from the MSDW-MWF: the multichannelWiener filter, which minimizes mean squared error, is a special casewith λn = 1 for all n. Linearly constrained filters such as the mini-mum variance distortionless response beamformer are limiting casesas the distortion weights approach infinity [2].

The remainder of the analysis will be performed in the frequencydomain and we omit the frequency variable Ω for brevity.

Page 3: BINAURAL AUDIO SOURCE REMIXING WITH MICROPHONE …corey1.web.engr.illinois.edu/download/corey_icassp2020_remixing.pdf · to many sound sources, as shown in Fig. 1. The system is analo-gous

2.4. Spectral distortion of the remixing filter

The following identity will be useful in analyzing the performanceof the MSDW-MWF:

W =

N∑m=1

λmGmRcmR−1x (8)

= Gn +

N∑m=1

λm(Gm −Gn)RcmR−1x . (9)

If the filter is computed using the true source statistics, then theweighted error spectral density of the MSDW-MWF is given by

Rerr =

N∑n=1

λn (Gn −W)Rcn (Gn −W)H (10)

=

N∑n=1

M∑m=1

λnλm (Gn −Gm)RcmR−1x Rcn (Gn −W)H

(11)

=

N∑n=1

M∑m=1

λnλm (Gn −Gm)RcmR−1x RcnG

Hn , (12)

where the last step follows from the orthogonality principle. Thus,the performance of the remixing filter can be expressed in terms ofpairs of source channels. It is clear from this expression that if Gn =Gm, then the (m,n) source pair contributes nothing to the error ofthe system: if we wish to apply the same processing to both sources,then we need not separate them.

If each desired response has the form of (3) and if the Fouriertransform Gn(Ω) of gn(τ) is real-valued for all n = 1, . . . , N , thenby manipulating (12) the weighted error spectra in the left and rightears can be written

Rlefterr =

1

2

N∑n=1

M∑m=1

λnλm |Gn −Gm|2 eT1 RcmR−1

x Rcne1

(13)

Rrighterr =

1

2

N∑n=1

M∑m=1

λnλm |Gn −Gm|2 eT2 RcmR−1

x Rcne2.

(14)

Thus, the performance of the remixing filter depends on the differ-ences between desired responses and on the spatial and spectral sep-arability of the signal pairs.

3. DISTORTION OF INTERAURAL CUES

The interaural level difference (ILD) and interaural phase difference(IPD) can both be derived from the interaural transfer function (ITF).If the source image cn(t) for channel n has a Fourier transformCn(Ω), then the input and output ITFs for source channel n are

ITFinn =

eT2 Cn

eT1 Cn

(15)

ITFoutn =

eT2 Dn

eT1 Dn

=eT

2 WCn

eT1 WCn

. (16)

The ILD and IPD are the magnitude and phase of the ITF:

ILDn = 20 log10 |ITFn| and IPDn = ∠ITFn. (17)

Fig. 2: Five loudspeakers were placed around an acoustically treated labora-tory. A wearable array includes up to 16 microphones.

3.1. Spatial distortion of the MSDW-MWF

The output ITF for the MSDW-MWF (9) with binaurally matchedresponses (3) is

ITFoutn =

GneT2 Cn+

∑Nm=1λm (Gm−Gn) eT

2 RcmR−1x Cn

GneT1 Cn+

∑Nm=1λm (Gm−Gn) eT

1 RcmR−1x Cn

(18)for n = 1, . . . , N . Notice that if the second terms in the numeratorand denominator were removed, the output ITF would be identicalto the input ITF. This would be the case if the same processing wereapplied to every source channel so that all Gm − Gn = 0 or if thesources were fully separable so that RcmR−1

x Cn = 0 for m 6= n.The error in the ILD and IPD are the real and imaginary parts,

respectively, of the logarithm of ITFoutn /ITFin

n . If Gn and ITFinn

are nonzero, then the ITF error for source channel n can be written

∆ITFn = ln1 +

∑Nm=1 λm

Gm−GnGn

eT2 Rcm R−1x Cn

eT2 Cn

1 +∑N

m=1 λmGm−Gn

Gn

eT1 Rcm R−1x Cn

eT1 Cn

. (19)

3.2. First-order approximation

Using the first-order approximation ln (1 + u) ≈ u for both the nu-merator and denominator of (19), the logarithmic ITF error is

∆ITFn ≈N∑

m=1

λmGm−Gn

Gn

(eT

2 RcmR−1x Cn

eT2 Cn

− eT1 RcmR−1

x Cn

eT1 Cn

).

(20)

Furthermore, if diffuse noise were negligible and every sourcechannel were well modeled by a rank-1 spectral density matrixRcn ≈ RsnAnA

Hn with Cn parallel to the acoustic transfer func-

tion An for n = 1, . . . , N , then the ITF error would be

∆ITFn ≈N∑

m=1

λmRsm

Gm−Gn

GnAH

mR−1x An

(eT

2 Am

eT2 An

− eT1 Am

eT1 An

)(21)

Thus, spatial distortion depends on the power and distortion weightof each interfering source, the relative difference in desired re-sponses between sources, the spatial separability of the sources, andthe difference in interaural cues between source channels. Distantsound sources that have different acoustic transfer functions will beeasier to separate (small AH

mR−1x An) but also have more different

interaural cues (large eT2 Am

eT2 An− eT1 Am

eT1 An). Meanwhile, if Am is par-

allel to An, the source channels have the same interaural cues andtherefore the source pair does not introduce spatial distortion.

Page 4: BINAURAL AUDIO SOURCE REMIXING WITH MICROPHONE …corey1.web.engr.illinois.edu/download/corey_icassp2020_remixing.pdf · to many sound sources, as shown in Fig. 1. The system is analo-gous

125 250 500 1000 2000 4000 80000

2

4

6

125 250 500 1000 2000 4000 8000

π4

π2

Mild

Aggressive

Single-targetM

ean

ILD

erro

r(dB

)

Mild

Aggressive

Single-target

Frequency (Hz)

Mea

nIP

Der

ror(

radi

ans)

Fig. 3: Interaural cue preservation for a binaural source-remixing filter with4 microphones and different desired responses.

4. EXPERIMENTS WITH A WEARABLE ARRAY

To evaluate the performance of the source-remixing system for aug-mented listening applications, it was applied to a wearable micro-phone array in a challenging noisy environment. Sixteen micro-phones were spread across the body of a mannequin, as shown inFig. 2. One microphone was placed just outside each ear canal.Although this mannequin does not have realistic head-related trans-fer functions, it is suitable for this experiment because its head hascontralateral attenuation that is only slightly weaker than that of ahuman [27]. The sound sources were five speech clips derived fromthe CSTR VCTK corpus [28] and played through loudspeakers aswell as the diffuse mechanical and ventilation noise in the room.

Impulse response measurements and long-term average speechand noise spectra were used to design causal discrete-time MSDW-MWFs with a delay of 16 ms and a unit pulse response length of 256ms. The experimental ITFs of the five speech sources were measuredin the STFT domain using their sample cross-correlations [15]:

ITFinn [f ] =

∑k e

T1 Cstft,n[k, f ]CH

stft,n[k, f ]e2∑k e

T1 Cstft,n[k, f ]CH

stft,n[k, f ]e1(22)

ITFoutn [f ] =

∑k e

T1 Dstft,n[k, f ]DH

stft,n[k, f ]e2∑k e

T1 Dstft,n[k, f ]DH

stft,n[k, f ]e1

, (23)

for n = 1, . . . , 5. The ILD and IPD errors were computed from theITFs and averaged over the five directional sources.

Figure 3 shows the performance of earpieces with M = 4 totalmicrophones for three sets of target responses:

1. Mild remixing with gains 1.0, 0.8, 0.7, 0.6, and 0.5 on thespeech channels and 20 dB attenuation on the noise channel,

2. Aggressive remixing with gains 1.0, 0.4, 0.3, 0.2, and 0.1 onthe speech channels and 20 dB attenuation on noise, and

3. Single-target beamforming with G1 = 1 and G2 = · · · =G6 = 0.

125 250 500 1000 2000 4000 80000

2

4

6

125 250 500 1000 2000 4000 8000

π4

π2

M = 16

M = 8

M = 4

Mea

nIL

Der

ror(

dB)

M = 16

M = 8M = 4

Frequency (Hz)

Mea

nIP

Der

ror(

radi

ans)

Fig. 4: Interaural cue preservation for a binaural source-remixing filter withdifferent array configurations.

The mild filter has low interaural cue distortion, the single-targetbeamformer severely distorts the non-target sources, and the aggres-sive filter falls in between. Distortion is mild below a few hundredhertz; these wavelengths are much larger than a human head and sothe ILD and IPD of all sources are close to zero.

A larger wearable array should be able to apply complex remix-ing to more sources than a small earpiece-based array can. Figure4 shows the ILD and ITD distortion for the “aggressive” remixingresponses with arrays of different sizes. The four-microphone ear-piece array does not have enough degrees of freedom to preservethe interaural cues of all five directional sources. The 8-microphonehead-mounted array does better, and the 16-microphone upper-bodyarray produces little distortion in any of the source channels.

5. CONCLUSIONS

The design of source-remixing filters requires a tradeoff between au-dio enhancement—removing and altering different sound sources toimprove intelligibility—and perceptual transparency. Filters that al-ter the signal less, that is, that apply similar desired responses tothe different source channels, cause less spectral distortion and lessinteraural cue distortion. They also sound more immersive and nat-ural to the listener. However, they do not provide as much benefit incomplex noise. Space-time remixing filters with large wearable mi-crophone arrays could provide the advantages of both approaches:they have enough spatial resolution to meaningfully suppress strongbackground noise, but they have enough degrees of freedom to en-sure that those attenuated noise sources sound natural.

A full understanding of source-remixing filter design tradeoffswill require new clinical research. The choice of desired responseswill likely depend on the nature of the sources and environmentand on the preferences of the individual. Compared to conventionalbeamforming and source separation, audio source remixing providesa more versatile approach to human sensory augmentation that coulddramatically—but seamlessly—change how we perceive our world.

Page 5: BINAURAL AUDIO SOURCE REMIXING WITH MICROPHONE …corey1.web.engr.illinois.edu/download/corey_icassp2020_remixing.pdf · to many sound sources, as shown in Fig. 1. The system is analo-gous

6. REFERENCES

[1] V. Valimaki, A. Franck, J. Ramo, H. Gamper, and L. Savioja,“Assisted listening using a headset: Enhancing audio percep-tion in real, augmented, and virtual environments,” IEEE Sig-nal Processing Magazine, vol. 32, no. 2, pp. 92–99, 2015.

[2] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov,“A consolidated perspective on multimicrophone speech en-hancement and source separation,” IEEE Transactions on Au-dio, Speech, and Language Processing, vol. 25, no. 4, pp. 692–730, 2017.

[3] W. Soede, F. A. Bilsen, and A. J. Berkhout, “Assessment of adirectional microphone array for hearing-impaired listeners,”The Journal of the Acoustical Society of America, vol. 94,no. 2, pp. 799–808, 1993.

[4] J. E. Greenberg and P. M. Zurek, Microphone-Array HearingAids, pp. 229–253. Berlin: Springer Berlin Heidelberg, 2001.

[5] S. Doclo, S. Gannot, M. Moonen, and A. Spriet, “Acousticbeamforming for hearing aid applications,” in Handbook onArray Processing and Sensor Networks (S. Haykin and K. R.Liu, eds.), pp. 269–302, Wiley, 2008.

[6] S. Doclo, W. Kellermann, S. Makino, and S. E. Nordholm,“Multichannel signal enhancement algorithms for assisted lis-tening devices,” IEEE Signal Processing Magazine, vol. 32,no. 2, pp. 18–30, 2015.

[7] A. S. Bregman, Auditory scene analysis: The perceptual orga-nization of sound. MIT press, 1994.

[8] A. W. Bronkhorst, “The cocktail party phenomenon: A reviewof research on speech intelligibility in multiple-talker condi-tions,” Acta Acustica United with Acustica, vol. 86, no. 1,pp. 117–128, 2000.

[9] S. Doclo, T. J. Klasen, T. Van den Bogaert, J. Wouters, andM. Moonen, “Theoretical analysis of binaural cue preserva-tion using multi-channel Wiener filtering and interaural trans-fer functions,” in International Workshop on Acoustic Echoand Noise Control (IWAENC), 2006.

[10] T. Van den Bogaert, Preserving Binaural Cues in Noise Re-duction Algorithms for Hearing Aids. PhD thesis, KU Leuven,June 2008.

[11] D. Marquardt, Development and evaluation of psychoacousti-cally motivated binaural noise reduction and cue preservationtechniques. PhD thesis, Carl von Ossietzky University of Old-enburg, 2016.

[12] A. Koutrouvelis, Multi-Microphone Noise Reduction for Hear-ing Assistive Devices. PhD thesis, Delft University of Technol-ogy, 2018.

[13] T. J. Klasen, T. Van den Bogaert, M. Moonen, and J. Wouters,“Binaural noise reduction algorithms for hearing aids that pre-serve interaural time delay cues,” IEEE Transactions on SignalProcessing, vol. 55, no. 4, pp. 1579–1585, 2007.

[14] T. Van den Bogaert, S. Doclo, J. Wouters, and M. Moonen,“Speech enhancement with multichannel Wiener filter tech-niques in multimicrophone binaural hearing aids,” The Journalof the Acoustical Society of America, vol. 125, no. 1, pp. 360–371, 2009.

[15] B. Cornelis, S. Doclo, T. Van dan Bogaert, M. Moonen, andJ. Wouters, “Theoretical analysis of binaural multimicrophone

noise reduction techniques,” IEEE Transactions on Audio,Speech, and Language Processing, vol. 18, no. 2, pp. 342–355,2010.

[16] D. Marquardt, E. Hadad, S. Gannot, and S. Doclo, “Theoret-ical analysis of linearly constrained multi-channel Wiener fil-tering algorithms for combined noise reduction and binauralcue preservation in binaural hearing aids,” IEEE Transactionson Audio, Speech, and Language Processing, vol. 23, no. 12,pp. 2384–2397, 2015.

[17] E. Hadad, S. Doclo, and S. Gannot, “The binaural LCMVbeamformer and its performance analysis,” IEEE Transactionson Audio, Speech and Language Processing, vol. 24, no. 3,pp. 543–558, 2016.

[18] E. Hadad, D. Marquardt, S. Doclo, and S. Gannot, “Theoreti-cal analysis of binaural transfer function MVDR beamformerswith interference cue preservation constraints,” IEEE Trans-actions on Audio, Speech and Language Processing, vol. 23,no. 12, pp. 2449–2464, 2015.

[19] A. I. Koutrouvelis, R. C. Hendriks, J. Jensen, and R. Heus-dens, “Improved multi-microphone noise reduction preservingbinaural cues,” in IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP), pp. 460–464, 2016.

[20] J.-M. Jot, B. Smith, and J. Thompson, “Dialog control and en-hancement in object-based audio systems,” in Audio Engineer-ing Society Convention, 2015.

[21] B. Shirley, M. Meadows, F. Malak, J. Woodcock, and A. Tid-ball, “Personalized object-based audio for hearing impairedTV viewers,” Journal of the Audio Engineering Society,vol. 65, pp. 293–303, 2017.

[22] H. Wierstorf, D. Ward, R. Mason, E. M. Grais, C. Hummer-sone, and M. D. Plumbley, “Perceptual evaluation of sourceseparation for remixing music,” in Audio Engineering SocietyConvention, 2017.

[23] R. M. Corey and A. C. Singer, “Dynamic range compressionfor noisy mixtures using source separation and beamforming,”in IEEE Workshop on Applications of Signal Processing to Au-dio and Acoustics (WASPAA), 2017.

[24] R. M. Corey, N. Tsuda, and A. C. Singer, “Delay-performancetradeoffs in causal microphone array processing,” in Interna-tional Workshop on Acoustic Signal Enhancement (IWAENC),2018.

[25] S. Markovich-Golan, S. Gannot, and I. Cohen, “A weightedmultichannel Wiener filter for multiple sources scenarios,” inIEEE Convention of Electrical & Electronics Engineers in Is-rael, 2012.

[26] S. Doclo, A. Spriet, J. Wouters, and M. Moonen, “Speechdistortion weighted multichannel Wiener filtering techniquesfor noise reduction,” in Speech Enhancement (J. Benesty,S. Makino, and J. Chen, eds.), pp. 199–228, Springer, 2005.

[27] R. M. Corey, N. Tsuda, and A. C. Singer, “Acoustic impulse re-sponse measurements for wearable audio devices,” in IEEE In-ternational Conference on Acoustics, Speech, and Signal Pro-cessing (ICASSP), 2019.

[28] C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTKcorpus: English multi-speaker corpus for CSTR voice cloningtoolkit,” 2017.