24
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 1 1. Human Auditory Scene Analysis 2. Computational Acoustic Scene Analysis 3. Virtual and Augmented Audio 4. Future Audio Analysis & Display Augmenting and Exploiting Auditory Perception for Complex Scene Analysis Dan Ellis Laboratory for Recognition and Organization of Speech and Audio Dept. Electrical Eng., Columbia Univ., NY USA [email protected] http://labrosa.ee.columbia.edu/

Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /231

1. Human Auditory Scene Analysis2. Computational Acoustic Scene Analysis3. Virtual and Augmented Audio4. Future Audio Analysis & Display

Augmenting and Exploiting Auditory Perception

for Complex Scene AnalysisDan Ellis

Laboratory for Recognition and Organization of Speech and AudioDept. Electrical Eng., Columbia Univ., NY USA

[email protected] http://labrosa.ee.columbia.edu/

Page 2: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

1. Human Auditory Scene Analysis

2

“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90)

• Now: Hearing as the model for machine perception

• Future: Machines to enhance human perception

Page 3: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

Auditory Scene Analysis• Listeners organize sound mixtures

into discrete perceived sourcesbased on within-signal cues (audio + ...)

common onset + continuityharmonicity

spatial, modulation, ...learned “schema”

3

time / sec

freq

/ kH

z

0 0.5 1 1.5 2 2.5 3 3.5 4 level / dB0

1

2

3

4

-40

-30

-20

-10

0

10

20

30

reynolds-mcadams-dpwe.wav

Bregman ’90Darwin & Carlyon ’95

Page 4: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

Spatial Hearing• People perceive sources based on cues

spatial (binaural): ITD, ILD

4

Blauert ’96

2.2 2.205 2.21 2.215 2.22 2.225 2.23 2.235-0.1

-0.05

0

0.05

0.1

time / s

shatr78m3 waveform

Left

Right

path length difference

path length difference

head shadow (high freq)

source

LR

Page 5: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

Human Performance: Spatial Info• Task: Coordinate Response Measure

“Ready Baron go to green eight now”256 variants, 16 speakerscorrect = color and number for “Baron”

• Accuracy as a function of spatial separation:

A, B same speaker

5

Brungart et al.’02

crm-11737+16515.wav

Page 6: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

Human Performance: Source Info• CRM varying the level and voice character

(same spatial location)

energetic vs. informational masking

6

Brungart et al.’01

!"#$%&$'()#*"&+$,-"#$'(!$./-#01232'(4-**2&2#52.

1232'(6-**2&2#52(5$#(72'8(-#(.82257(.20&20$,-"#

1-.,2#(*"&(,72(9%-2,2&(,$'/2&:

Page 7: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

Human Hearing: Limitations• Sensor number: just 2 ears• Sensor location: short, horizontal baseline• Sensor performance: local dynamic range

• Processing: Attention & Memory limitsintegration time

7

masked threshold

log freq

absolute threshold

masking tone

Inte

nsity

/ dB

Page 8: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

2. Computational Scene Analysis

8

• Central idea:Segment time-frequency into sourcesbased on perceptual grouping cues

... principal cue is harmonicity

Brown & Cooke’94Okuno et al.’99Hu & Wang’04 ...

inputmixture

signalfeatures

(maps)

discreteobjectsFront end Object

formationGrouping

rulesSourcegroups

onset

periodfrq.mod

time

freq

Segment Group

Page 9: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

Spatial Info: Microphone Arrays• If interference is diffuse, can simply

boost energy from target directione.g. shotgun mic - delay-and-sum

off-axis spectral colorationmany variants - filter & sum, sidelobe cancelation ...

9

DD +D ++

-40

-20

0! = 4D

! = 2D

! = D

"x = c . D

Benesty, Chen, Huang ’08

Page 10: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

Independent Component Analysis

• Separate “blind” combinations by maximizing independence of outputs

10

Bell & Sejnowski ’95Smaragdis ’98

m1 m2

s1 s2

a11 a21

a12 a22

x

−δ MutInfo δa

-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

mix 1

mix

2

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12

e / /

kurto

sis

Kurtosis vs. eMixture Scatter

s1

s1

s2

s2kurtosis

as a measureof independence?

kurt(y) = E

��y � µ

�4�� 3

Page 11: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

Environmental Scene Analysis• Find the

pieces a listenerwouldreport

11

ï70

ï60

ï50

ï40

dB

200400100020004000f/Hz Noise1

200400100020004000f/Hz Noise2,Click1

200400100020004000f/Hz City

0 1 2 3 4 5 6 7 8 9501002004001000

Horn1 (10/10)

Crash (10/10)

Horn2 (5/10)

Truck (7/10)

Horn3 (5/10)

Squeal (6/10)

Horn4 (8/10)Horn5 (10/10)

0 1 2 3 4 5 6 7 8 9time/s

200400100020004000f/Hz Wefts1ï4

501002004001000

Weft5 Wefts6,7 Weft8 Wefts9ï12

Ellis ’96

Page 12: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

“Superhuman” Speech Analysis• IBM’s 2006 Iroquois speech separation system

Key features:detailed state combinationslarge speech recognizerexploits grammar constraints34 per-speaker models

• “Superhuman” performance... in some conditions

12

Kristjansson, Hershey et al. ’06

6 dB 3 dB 0 dB - 3 dB - 6 dB - 9 dB0

50

100Same Gender Speakers

SDL Recognizer No dynamics Acoustic dyn. Grammar dyn. Human

HIGH RESOLUTION SIGNAL RECONSTRUCTION

Trausti Kristjansson

Machine Learning and Applied StatisticsMicrosoft Research

[email protected]

John Hershey

University of California, San DiegoMachine Perception Lab

[email protected]

ABSTRACT

We present a framework for speech enhancement and ro-bust speech recognition that exploits the harmonic structureof speech. We achieve substantial gains in signal to noise ra-tio (SNR) of enhanced speech as well as considerable gainsin accuracy of automatic speech recognition in very noisyconditions.

The method exploits the harmonic structure of speechby employing a high frequency resolution speech model inthe log-spectrum domain and reconstructs the signal fromthe estimated posteriors of the clean signal and the phasesfrom the original noisy signal.

We achieve a gain in signal to noise ratio of 8.38 dB forenhancement of speech at 0 dB. We also present recognitionresults on the Aurora 2 data-set. At 0 dB SNR, we achievea reduction of relative word error rate of 43.75% over thebaseline, and 15.90% over the equivalent low-resolution al-gorithm.

1. INTRODUCTION

A long standing goal in speech enhancement and robustspeech recognition has been to exploit the harmonic struc-ture of speech to improve intelligibility and increase recog-nition accuracy.

The source-filter model of speech assumes that speechis produced by an excitation source (the vocal cords) whichhas strong regular harmonic structure during voiced phonemes.The overall shape of the spectrum is then formed by a fil-ter (the vocal tract). In non-tonal languages the filter shapealone determines which phone component of a word is pro-duced (see Figure 2). The source on the other hand intro-duces fine structure in the frequency spectrum that in manycases varies strongly among different utterances of the samephone.

This fact has traditionally inspired the use of smoothrepresentations of the speech spectrum, such as the Mel-frequency cepstral coefficients, in an attempt to accuratelyestimate the filter component of speech in a way that is in-variant to the non-phonetic effects of the excitation[1].

There are two observations that motivate the consider-ation of high frequency resolution modelling of speech fornoise robust speech recognition and enhancement. First isthe observation that most noise sources do not have har-monic structure similar to that of voiced speech. Hence,voiced speech sounds should be more easily distinguish-able from environmental noise in a high dimensional signalspace1.

0 200 400 600 800 100010

15

20

25

30

35

40

45Noisy vector, clean vector and estimate of clean vector

Frequency [Hz]

Ene

rgy

[dB

]

x estimatenoisy vectorclean vector

Fig. 1. The noisy input vector (dot-dash line), the corre-sponding clean vector (solid line) and the estimate of theclean speech (dotted line), with shaded area indicating theuncertainty of the estimate (one standard deviation). Noticethat the uncertainty on the estimate is considerably larger inthe valleys between the harmonic peaks. This reflects thelower SNR in these regions. The vector shown is frame 100from Figure 2

A second observation is that in voiced speech, the signalpower is concentrated in areas near the harmonics of thefundamental frequency, which show up as parallel ridges in

1Even if the interfering signal is another speaker, the harmonic structureof the two signals may differ at different times, and the long term pitchcontour of the speakers may be exploited to separate the two sources [2].

0-7803-7980-2/03/$17.00 © 2003 IEEE 291 ASRU 2003

Page 13: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

Meeting Recorders• Distributed mics in meeting room

• Between-mic correlations locate sources

13

Janin et al. ’03Ellis & Liu ’04

-2

0

2

-2-10120

0.5

1

Inferred talker positions (x=mic)

-3 -2 -1 0 1 2 3-3

-2

-1

0

1

lag 1-2 / ms

x / m

y / m

z / m

lag

3-4

/ ms

mr-2000-11-02-1440: PZM xcorr lags

4

4

11

2

2

3 3

Page 14: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

Environmental Sound Classification• Trained models using e.g. “texture” features

• Paradoxical results

14

SoundAutomatic

gaincontrol

Envelopecorrelation

Cross-bandcorrelations

(318 samples)

Modulationenergy(18 x 6)

mean, var,skew, kurt(18 x 4)

melfilterbank

(18 chans) x

x

xxxx

FFT Octave bins0.5,1,2,4,8,16 Hz

Histogram

Ellis, Zheng, McDermott ’11

Page 15: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

Audio Lifelogs• Body-worn

continuous recording

• Long time windowsfor episode-scalesegmentation, clustering, andclassification

15

09:00

09:30

10:00

10:30

11:00

11:30

12:00

12:30

13:00

13:30

14:00

14:30

15:00

15:30

16:00

16:30

17:00

17:30

18:00

preschool

cafepreschoolcafelecture

officeofficeoutdoorgroup

lab

cafemeeting2

office

office

outdoorlabcafe

Ron

Manuel

Arroyo?

Lesser

2004-09-13

L2cafeoffice

outdoorlecture

outdoormeeting2

outdoorofficecafeoffice

outdoorofficepostlecoffice

DSP03

compmtg

Mike

Sambarta?

2004-09-14

Lee & Ellis ’04

Page 16: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

Machines: Current Limitations• Separating overlapping

sourcesblind source separation

• Separating individual eventssegmentation

• Learning & classifying source categoriesrecognition of individual sounds and classes

16

Page 17: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

3. Virtual and Augmented Audio

• Audio signals can be effectively spatializedby convolving with Head-Related Impulse Responses (HRIRs)

• Auditory localization also uses head-motion cues17

Brown & Duda ’98

0 0.5 1 1.5

-45

0

45

0 0.5 1 1.5

0

1

0 0.5 1 1.5-1

0

1

time / ms time / ms

HRIR_021 Left @ 0 elHRIR_021 Left @ 0 el 0 az

HRIR_021 Right @ 0 el 0 az

HRIR_021 Right @ 0 el

LEFT

RIGHT

Azim

uth

/ deg

Page 18: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

Augmented Audio Reality• Pass-through and/or mix-in

18

Proc. of the 11

th

Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008

Figure 2: Left: Headset used in the evaluation. Microphone is onthe left and headphone outlet on the right. Middle: Headset fittedin ear. Right: Prototype ARA mixer.

Although multimodal communication is in most cases pre-ferred over a single modality, one of the strengths of ARA tech-niques is that they can be used practically anywhere, anytime,hands-free, and eyes-free.

This paper introduces results of a pilot study on the usability ofan ARA headset when used for longer periods of time in everyday-life situations. In addition to general usability, the goal was alsoto find out different usability aspects of such headsets. A groupof four subjects wore an ARA headset in everyday-life conditionsand reported the observations in a diary.

2. ARA MIXER AND HEADSET

An ARA mixer and headset was constructed and evaluated fromthe point of view of pseudo-acoustic sound quality and usabilityin practice. The headset was constructed from a noise-cancellingheadphone (Philips SHN2500) that contains insert type earphonesand electret microphones integrated together as shown in Fig. 2.The original functionality was changed by replacing the externalelectronics box for noise cancellation by an ARA mixer designedin the project, shown also in Fig. 2.

The earphones fit quite tightly to the ear canal entrance, whilethe microphones remain about 1 cm out from that point, whichcould mean degradation of spatial perception and inconvenienceof using a telephone. The microphones have also some directivityat highest frequencies, which means that sound coloration is de-pendent on sound source direction. The main technical problemwas, however, the acoustic effects inside the ear canal and how tocompensate them as well as acoustic leakage of external sound tothe ear canal.

2.1. Coloration of Pseudo-acoustics due to the ARA Headset

The external ear modifies the sound field in many ways whiletransmitting sound waves through the ear canal opening to the eardrum. Normally, the ear canal is open and acts as a quarter wave-length resonator with one end being closed by the eardrum and theother end open to the air. For an open ear canal, the first ear canalresonance occurs at around 2-4 kHz depending on the length of thecanal [4]. When a headphone blocks the ear canal, this resonancedisappears, and the sound field is perceived unnatural. In this case,the ear canal is closed from both ends and it starts to act more likea half-wavelength resonator [5]. The lowest resonance now occursat around 5-10 kHz depending on the length of the ear canal andfitting of the earplug.

In order to make the headset acoustically more transparent,equalization is needed to recreate the quarter-wave resonance andto damp the half-wave resonance. Also the frequency response ofthe headset causes coloration, which has to be equalized.

Figure 3: Filter sections in the equalizer.

The headset (especially an insert-type) attenuates the mid-rangeand high frequencies coming from outside quite efficiently. How-ever, there is always some leakage through the headset and alsobetween the skin and the cushion of the headphone [6]. Low fre-quencies can leak through the headphone quite effectively. Theleaking from the real environment sums up in the ear canal with thepseudo-acoustic representation produced by the transducer. Thissumming causes coloration especially at low frequencies and dete-riorates the pseudo-acoustic experience [1]. The amplification oflow frequencies has to be equalized.

2.2. Equalization Properties of the ARA Mixer

The ARA mixer of this study is presented in more detail in [2].The mixer includes a mixing section for transmitting the micro-phone signals to the earphones, and also for mixing external soundsources to the pseudo-acoustic environment. The mixer includesalso an adjustable equalization section to make the pseudo-acousticssound as natural as possible. For lowest latency possible the mixerwas constructed with analog electronics. This is important sincethe addition of the acoustic leakage and the delayed pseudo-acousticsound creates a comb filtering effect resulting in coloration. Evena fraction of millisecond of latency in processing can be disturb-ing. Typical digital audio hardware and software can make severalmilliseconds of latency, and are therefore not suitable for the task.

The equalizer section has two parametric resonators and a high-pass filter. The resonators can be used to recreate the missingquarter-wave resonance, and to damp the added half-wave reso-nance. The high-pass filter is used to compensate for boosted lowfrequency reproduction due to sound leakage between the head-phone and skin. Figure 3 shows a block diagram for the equaliza-tion section of the mixer.

During the evaluation described in Section 3, two of the tes-tees used individually tuned equalization, and the other two usedgeneric equalization. For individual equalization the headset re-sponses were measured in an anechoic chamber. The transfer func-tion from a loudspeaker in front of the testee into the ear canal ofthe testee was first measured without a headset, and then the mea-surement was repeated with a headset on. The lower plot in Fig.4 shows the measurement results without a headset (black line)and with a headset (grey line). There is no equalization used inthe headset in this case. The measurement data clearly demon-strate the quarter wave resonance around 2.5 kHz when there is noheadset, and how it disappears when the headset is put on. Also,pronounced low frequencies and the added half-wave resonance ataround 8 kHz are clearly visible. The upper plot in Fig. 4 showsmeasurement data for the same subject when the equalization is ac-curately tuned and switched on. For the generic (non-individual)equalization, shown in Fig. 5, the equalization curve was computedas the average of four different individual measurement data.

It should be noted that due to the simple implementation ofthe equalization circuit, even the individual equalization settingdiffer noticeably from the measured curve. Therefore, the individ-ual equalizer setting is still an approximation of the real measuredcurve. A small-scale quality evaluation with four listeners was

DAFX-2

Härmä et al. ’04Hearium ’12

2 Journal of Electrical and Computer Engineering

Virtual sounds Virtual

acoustic processing

ARA mixer

Signal to distant userPreprocessing for

transmission

Figure 1: Block diagram of an ARA system.

Figure 1 shows the block diagram of a complete ARAsystem [1]. The first block on the left is used to create virtualsounds, which can be embedded into the pseudoacousticrepresentation. Furthermore, it can utilize location andorientation information to keep virtual audio objects in placewhile the user moves and turns his/her head. An essentialpart in creating a realistic ARA system is the ARA mixer,which routes and mixes all the signals involved in the system.Furthermore, the ARA mixer takes care of the equalizationneeded to create an acoustically transparent headset [4].The headset is similar to common in-ear headphones, whichare nowadays widely used with portable media players andsmartphones. The preprocessing block can be used to sendthe user’s binaural microphone signals to a distant user forcommunication purposes. The ARA mixer and headset arethe user-worn devices of the system.

Even if the ARA headset would provide perfect soundquality and hear-through experience, it still requires usefulapplications in order to really benefit the user [5]. Themost promising applications include full audio bandwidth(20 Hz–20 kHz) binaural telephony and audio conferenc-ing with distant subjects panned around the user [6, 7].Furthermore, the ARA technology enables location-basedinformation services, such as virtual tourist guides and audiomemos [1].

A previous prototype of the ARA mixer was constructedwith analog electronics to avoid latency [8]. This is importantbecause parts of the ambient sounds leak through andaround the headset into the ear canal and if the pseudoacous-tic representation is delayed by the ARA mixer, it results in acomb-filter e!ect when added to the leaked sound. However,there is a great interest in replacing the bulky and expensiveanalog components with digital signal processing (DSP).The digital implementation would bring several benefitswhen compared to the analog implementation. The benefitsinclude programmability, which would enable a convenientuse of individualized equalization curves; ease of design; pre-cision. The downside is that a digital implementation intro-duces more delay than an analog implementation, whichcauses the comb filtering e!ect to the perceived signal. How-ever, a digital implementation of the ARA mixer can be real-ized using a low-latency DSP due to the pronounced atten-uation capability of the in-ear headset, which can dramati-cally reduce the comb-filter e!ect.

The aim of this paper is to study whether the ARA equali-zer can be implemented using DSP and whether the latency

between the pseudoacoustic representation and leakage ofthe headphone deteriorate the perceived sound excessively.The digital implementation of the ARA equalizer could bringmany enhancements compared to the analog implementa-tion, but only if the sound quality remains su"ciently good.

This paper is organized as follows. Section 2 describes theprinciples of the ARA technology. Section 3 concentrates ondigital filters and their latency properties. Section 4 presentsthe group delay estimation of a passive mechanism. Section 5focuses on the implementation of the digital ARA equalizer.Section 6 introduces a case study of a digital ARA mixer, andSection 7 concludes the paper.

2. ARA Technology

The ARA hardware has been specially designed and built forthis purpose [4]. It consists of the ARA headset and the ARAmixer. The basis of the ARA headset is that it must be able toaccurately reproduce the surrounding sound environment.In order to do that, the headset has two external microphonesin addition to the earphone drivers. The quality of the repro-duction of the pseudoacoustic environment must be su"-cient enough for allowing the users to continuously wear theARA headset for long periods of time nonstop.

However, because of the headphone acoustics, the pseu-doacoustic representation is not an exact copy of the sur-rounding sounds. Thus, an equalizer is needed to correct thepseudoacoustic representation. Originally, the equalizationwas designed to be analog in order to have as low latency aspossible [4]. Furthermore, the ARA mixer is used to embedvirtual sound objects into the user’s sound environment aswell as to connect all the additional devices into the ARAsystem.

2.1. Headphone Acoustics. In normal listening with openears, the incident sound waves are modified by the listener’sbody, head, and outer ear. When sounds are recorded witha conventional microphone situated away from the bodyand played back through headphones, the modificationscaused by the body of the listener are lost. However, whenmicrophones are placed binaurally near the ear canal entran-ces, the majority of these modifications are preserved.

The main di!erence when using in-ear headphones com-pared to the listening with open ears is that the headphonesocclude the ear canals completely. An open ear canal acts asa quarter-wavelength resonator, that is, like a tube with one

QuietPro+

Hearium

Page 19: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

4. Future Audio Analysis & Display• Better Scene Analysis

overcoming the limitations of human hearing:sensors, geometry

• Challengesfine source discriminationmodeling & classification (language ID?)Integrating through time: single location, sparse sounds

19

Object 1 perceptSound mixture

Object 2 perceptObject 3 percept

Page 20: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

Ad-Hoc Mic Array• Multiple sensors,

real-time sharinglong-baselinebeamforming

• Challengesprecise relative (dynamic) localizationprecise absolute registration

20

Page 21: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

Sound Visualization• Making acoustic information visible

“synesthesia”

• Challengessource formation & classification registration: sensors, display

21

O’Donovan et al. ’07

www.ultra-gunfirelocator.com

Page 22: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

Auditory Display• Acoustic channel

complements visionacoustic alarmsverbal information“zoomed” ambienceinstant replay

• ChallengesInformation management & prioritizationmaximally exploit perceptual organization

22

http://www.youtube.com/watch?v=v1uyQZNg2vE

Page 23: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

Summary• Human Scene Analysis

Spatial & Source information

• Computational Scene AnalysisSpatial & Source informationWorld knowledge

• Augmented AuditionSelective pass-through + insertion

23

Page 24: Auditory Perception for Complex Scene Analysisdpwe/talks/DSRC-2013-03.pdf · Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23 Auditory Scene Analysis • Listeners

Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23

References• A. Bregman, Auditory Scene Analysis, MIT Press, 1990.

• C. Darwin & R. Carlyon, “Auditory grouping” Hbk of Percep. & Cogn. 6: Hearing, 387–424, Academic Press, 1995.

• J. Blauert, Spatial Hearing (revised ed.), MIT Press, 1996.

• D. Brungart & B. Simpson, “The effects of spatial separation in distance on the informational and energetic masking of a nearby speech signal”, JASA 112(2), Aug. 2002.

• D. Brungart, B. Simpson, M. Ericson, K. Scott, “Informational and energetic masking effects in the perception of multiple simultaneous talkers,” JASA 110(5), Nov. 2001.

• G. Brown & M. Cooke, “Computational auditory scene analysis,” Comp. Speech & Lang. 8(4), 297–336, 1994.

• H. Okuno, T. Nakatani, T. Kawabata, “Listening to two simultaneous speeches,” Speech Communication 27, 299–310, 1999.

• G. Hu and D.L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Tr. Neural Networks, 15(5), Sep. 2004.

• J. Benesty, J. Chen, Y. Huang, Microphone Array Signal Processing, Springer, 2008.

• A. Bell, T. Sejnowski, “An information-maximization approach to blind separation and blind deconvolution,” Neural Computation, vol. 7 no. 6, pp. 1129-1159, 1995.

• P. Smaragdis, “Blind separation of convolved mixtures in the frequency domain,” Intl. Wkshp. on Indep. & Artif.l Neural Networks, Tenerife, Feb. 1998.

• D. Ellis, “Prediction-Driven Computational Auditory Scene Analysis,” Ph.D. thesis, MIT EECS, 1996.

• T Kristjansson, J Hershey, P Olsen, S Rennie, R Gopinath, Super-human multi-talker speech recognition: The IBM 2006 speech separation challenge system, Proce. of Interspeech, 97-100, 2006.

• A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, C. Wooters, The ICSI Meeting Corpus, Proc. ICASSP-03, pp. I-364--367, 2003.

• D. Ellis and J. Liu, Speaker turn segmentation based on between-channel differences, NIST Meeting Recognition Workshop @ ICASSP, pp. 112-117, Montreal, May 2004.

• D. Ellis, X. Zheng, and J. McDermott, Classifying soundtracks with audio texture features, Proc. IEEE ICASSP, pp. 5880-5883, Prague, May 2011.

• D. Ellis and K.S. Lee, Minimal-Impact Audio-Based Personal Archives, First ACM workshop on Continuous Archiving and Recording of Personal Experiences CARPE-04, New York, pp. 39-47, Oct 2004.

• C. P. Brown, R. O. Duda, A structural model for binaural sound synthesis, IEEE Tr. Speech & Audio 6(5):476-488, 1998.

• A. Härma, J. Julia, M. Tikander, M. Karjalainen, T. Lokki, J, Hiipakka, G. Lorho, Augmented reality audio for mobile and wearable appliances, J. Aud Eng. Soc. 52(6): 618-639, 2004.

• A. O'Donovan, R. Duraiswami, J. Neumann, Microphone arrays as generalized cameras for integrated audio visual processing, IEEE CVPR, 1-8, 2007.

24