Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /231
1. Human Auditory Scene Analysis2. Computational Acoustic Scene Analysis3. Virtual and Augmented Audio4. Future Audio Analysis & Display
Augmenting and Exploiting Auditory Perception
for Complex Scene AnalysisDan Ellis
Laboratory for Recognition and Organization of Speech and AudioDept. Electrical Eng., Columbia Univ., NY USA
[email protected] http://labrosa.ee.columbia.edu/
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
1. Human Auditory Scene Analysis
2
“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?” (after Bregman’90)
• Now: Hearing as the model for machine perception
• Future: Machines to enhance human perception
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
Auditory Scene Analysis• Listeners organize sound mixtures
into discrete perceived sourcesbased on within-signal cues (audio + ...)
common onset + continuityharmonicity
spatial, modulation, ...learned “schema”
3
time / sec
freq
/ kH
z
0 0.5 1 1.5 2 2.5 3 3.5 4 level / dB0
1
2
3
4
-40
-30
-20
-10
0
10
20
30
reynolds-mcadams-dpwe.wav
Bregman ’90Darwin & Carlyon ’95
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
Spatial Hearing• People perceive sources based on cues
spatial (binaural): ITD, ILD
4
Blauert ’96
2.2 2.205 2.21 2.215 2.22 2.225 2.23 2.235-0.1
-0.05
0
0.05
0.1
time / s
shatr78m3 waveform
Left
Right
path length difference
path length difference
head shadow (high freq)
source
LR
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
Human Performance: Spatial Info• Task: Coordinate Response Measure
“Ready Baron go to green eight now”256 variants, 16 speakerscorrect = color and number for “Baron”
• Accuracy as a function of spatial separation:
A, B same speaker
5
Brungart et al.’02
crm-11737+16515.wav
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
Human Performance: Source Info• CRM varying the level and voice character
(same spatial location)
energetic vs. informational masking
6
Brungart et al.’01
!"#$%&$'()#*"&+$,-"#$'(!$./-#01232'(4-**2&2#52.
1232'(6-**2&2#52(5$#(72'8(-#(.82257(.20&20$,-"#
1-.,2#(*"&(,72(9%-2,2&(,$'/2&:
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
Human Hearing: Limitations• Sensor number: just 2 ears• Sensor location: short, horizontal baseline• Sensor performance: local dynamic range
• Processing: Attention & Memory limitsintegration time
7
masked threshold
log freq
absolute threshold
masking tone
Inte
nsity
/ dB
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
2. Computational Scene Analysis
8
• Central idea:Segment time-frequency into sourcesbased on perceptual grouping cues
... principal cue is harmonicity
Brown & Cooke’94Okuno et al.’99Hu & Wang’04 ...
inputmixture
signalfeatures
(maps)
discreteobjectsFront end Object
formationGrouping
rulesSourcegroups
onset
periodfrq.mod
time
freq
Segment Group
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
Spatial Info: Microphone Arrays• If interference is diffuse, can simply
boost energy from target directione.g. shotgun mic - delay-and-sum
off-axis spectral colorationmany variants - filter & sum, sidelobe cancelation ...
9
DD +D ++
-40
-20
0! = 4D
! = 2D
! = D
"x = c . D
Benesty, Chen, Huang ’08
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
Independent Component Analysis
• Separate “blind” combinations by maximizing independence of outputs
10
Bell & Sejnowski ’95Smaragdis ’98
m1 m2
s1 s2
a11 a21
a12 a22
x
−δ MutInfo δa
-0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
mix 1
mix
2
0 0.2 0.4 0.6 0.8 10
2
4
6
8
10
12
e / /
kurto
sis
Kurtosis vs. eMixture Scatter
s1
s1
s2
s2kurtosis
as a measureof independence?
kurt(y) = E
��y � µ
�
�4�� 3
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
Environmental Scene Analysis• Find the
pieces a listenerwouldreport
11
ï70
ï60
ï50
ï40
dB
200400100020004000f/Hz Noise1
200400100020004000f/Hz Noise2,Click1
200400100020004000f/Hz City
0 1 2 3 4 5 6 7 8 9501002004001000
Horn1 (10/10)
Crash (10/10)
Horn2 (5/10)
Truck (7/10)
Horn3 (5/10)
Squeal (6/10)
Horn4 (8/10)Horn5 (10/10)
0 1 2 3 4 5 6 7 8 9time/s
200400100020004000f/Hz Wefts1ï4
501002004001000
Weft5 Wefts6,7 Weft8 Wefts9ï12
Ellis ’96
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
“Superhuman” Speech Analysis• IBM’s 2006 Iroquois speech separation system
Key features:detailed state combinationslarge speech recognizerexploits grammar constraints34 per-speaker models
• “Superhuman” performance... in some conditions
12
Kristjansson, Hershey et al. ’06
6 dB 3 dB 0 dB - 3 dB - 6 dB - 9 dB0
50
100Same Gender Speakers
SDL Recognizer No dynamics Acoustic dyn. Grammar dyn. Human
HIGH RESOLUTION SIGNAL RECONSTRUCTION
Trausti Kristjansson
Machine Learning and Applied StatisticsMicrosoft Research
John Hershey
University of California, San DiegoMachine Perception Lab
ABSTRACT
We present a framework for speech enhancement and ro-bust speech recognition that exploits the harmonic structureof speech. We achieve substantial gains in signal to noise ra-tio (SNR) of enhanced speech as well as considerable gainsin accuracy of automatic speech recognition in very noisyconditions.
The method exploits the harmonic structure of speechby employing a high frequency resolution speech model inthe log-spectrum domain and reconstructs the signal fromthe estimated posteriors of the clean signal and the phasesfrom the original noisy signal.
We achieve a gain in signal to noise ratio of 8.38 dB forenhancement of speech at 0 dB. We also present recognitionresults on the Aurora 2 data-set. At 0 dB SNR, we achievea reduction of relative word error rate of 43.75% over thebaseline, and 15.90% over the equivalent low-resolution al-gorithm.
1. INTRODUCTION
A long standing goal in speech enhancement and robustspeech recognition has been to exploit the harmonic struc-ture of speech to improve intelligibility and increase recog-nition accuracy.
The source-filter model of speech assumes that speechis produced by an excitation source (the vocal cords) whichhas strong regular harmonic structure during voiced phonemes.The overall shape of the spectrum is then formed by a fil-ter (the vocal tract). In non-tonal languages the filter shapealone determines which phone component of a word is pro-duced (see Figure 2). The source on the other hand intro-duces fine structure in the frequency spectrum that in manycases varies strongly among different utterances of the samephone.
This fact has traditionally inspired the use of smoothrepresentations of the speech spectrum, such as the Mel-frequency cepstral coefficients, in an attempt to accuratelyestimate the filter component of speech in a way that is in-variant to the non-phonetic effects of the excitation[1].
There are two observations that motivate the consider-ation of high frequency resolution modelling of speech fornoise robust speech recognition and enhancement. First isthe observation that most noise sources do not have har-monic structure similar to that of voiced speech. Hence,voiced speech sounds should be more easily distinguish-able from environmental noise in a high dimensional signalspace1.
0 200 400 600 800 100010
15
20
25
30
35
40
45Noisy vector, clean vector and estimate of clean vector
Frequency [Hz]
Ene
rgy
[dB
]
x estimatenoisy vectorclean vector
Fig. 1. The noisy input vector (dot-dash line), the corre-sponding clean vector (solid line) and the estimate of theclean speech (dotted line), with shaded area indicating theuncertainty of the estimate (one standard deviation). Noticethat the uncertainty on the estimate is considerably larger inthe valleys between the harmonic peaks. This reflects thelower SNR in these regions. The vector shown is frame 100from Figure 2
A second observation is that in voiced speech, the signalpower is concentrated in areas near the harmonics of thefundamental frequency, which show up as parallel ridges in
1Even if the interfering signal is another speaker, the harmonic structureof the two signals may differ at different times, and the long term pitchcontour of the speakers may be exploited to separate the two sources [2].
0-7803-7980-2/03/$17.00 © 2003 IEEE 291 ASRU 2003
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
Meeting Recorders• Distributed mics in meeting room
• Between-mic correlations locate sources
13
Janin et al. ’03Ellis & Liu ’04
-2
0
2
-2-10120
0.5
1
Inferred talker positions (x=mic)
-3 -2 -1 0 1 2 3-3
-2
-1
0
1
lag 1-2 / ms
x / m
y / m
z / m
lag
3-4
/ ms
mr-2000-11-02-1440: PZM xcorr lags
4
4
11
2
2
3 3
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
Environmental Sound Classification• Trained models using e.g. “texture” features
• Paradoxical results
14
SoundAutomatic
gaincontrol
Envelopecorrelation
Cross-bandcorrelations
(318 samples)
Modulationenergy(18 x 6)
mean, var,skew, kurt(18 x 4)
melfilterbank
(18 chans) x
x
xxxx
FFT Octave bins0.5,1,2,4,8,16 Hz
Histogram
Ellis, Zheng, McDermott ’11
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
Audio Lifelogs• Body-worn
continuous recording
• Long time windowsfor episode-scalesegmentation, clustering, andclassification
15
09:00
09:30
10:00
10:30
11:00
11:30
12:00
12:30
13:00
13:30
14:00
14:30
15:00
15:30
16:00
16:30
17:00
17:30
18:00
preschool
cafepreschoolcafelecture
officeofficeoutdoorgroup
lab
cafemeeting2
office
office
outdoorlabcafe
Ron
Manuel
Arroyo?
Lesser
2004-09-13
L2cafeoffice
outdoorlecture
outdoormeeting2
outdoorofficecafeoffice
outdoorofficepostlecoffice
DSP03
compmtg
Mike
Sambarta?
2004-09-14
Lee & Ellis ’04
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
Machines: Current Limitations• Separating overlapping
sourcesblind source separation
• Separating individual eventssegmentation
• Learning & classifying source categoriesrecognition of individual sounds and classes
16
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
3. Virtual and Augmented Audio
• Audio signals can be effectively spatializedby convolving with Head-Related Impulse Responses (HRIRs)
• Auditory localization also uses head-motion cues17
Brown & Duda ’98
0 0.5 1 1.5
-45
0
45
0 0.5 1 1.5
0
1
0 0.5 1 1.5-1
0
1
time / ms time / ms
HRIR_021 Left @ 0 elHRIR_021 Left @ 0 el 0 az
HRIR_021 Right @ 0 el 0 az
HRIR_021 Right @ 0 el
LEFT
RIGHT
Azim
uth
/ deg
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
Augmented Audio Reality• Pass-through and/or mix-in
18
Proc. of the 11
th
Int. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008
Figure 2: Left: Headset used in the evaluation. Microphone is onthe left and headphone outlet on the right. Middle: Headset fittedin ear. Right: Prototype ARA mixer.
Although multimodal communication is in most cases pre-ferred over a single modality, one of the strengths of ARA tech-niques is that they can be used practically anywhere, anytime,hands-free, and eyes-free.
This paper introduces results of a pilot study on the usability ofan ARA headset when used for longer periods of time in everyday-life situations. In addition to general usability, the goal was alsoto find out different usability aspects of such headsets. A groupof four subjects wore an ARA headset in everyday-life conditionsand reported the observations in a diary.
2. ARA MIXER AND HEADSET
An ARA mixer and headset was constructed and evaluated fromthe point of view of pseudo-acoustic sound quality and usabilityin practice. The headset was constructed from a noise-cancellingheadphone (Philips SHN2500) that contains insert type earphonesand electret microphones integrated together as shown in Fig. 2.The original functionality was changed by replacing the externalelectronics box for noise cancellation by an ARA mixer designedin the project, shown also in Fig. 2.
The earphones fit quite tightly to the ear canal entrance, whilethe microphones remain about 1 cm out from that point, whichcould mean degradation of spatial perception and inconvenienceof using a telephone. The microphones have also some directivityat highest frequencies, which means that sound coloration is de-pendent on sound source direction. The main technical problemwas, however, the acoustic effects inside the ear canal and how tocompensate them as well as acoustic leakage of external sound tothe ear canal.
2.1. Coloration of Pseudo-acoustics due to the ARA Headset
The external ear modifies the sound field in many ways whiletransmitting sound waves through the ear canal opening to the eardrum. Normally, the ear canal is open and acts as a quarter wave-length resonator with one end being closed by the eardrum and theother end open to the air. For an open ear canal, the first ear canalresonance occurs at around 2-4 kHz depending on the length of thecanal [4]. When a headphone blocks the ear canal, this resonancedisappears, and the sound field is perceived unnatural. In this case,the ear canal is closed from both ends and it starts to act more likea half-wavelength resonator [5]. The lowest resonance now occursat around 5-10 kHz depending on the length of the ear canal andfitting of the earplug.
In order to make the headset acoustically more transparent,equalization is needed to recreate the quarter-wave resonance andto damp the half-wave resonance. Also the frequency response ofthe headset causes coloration, which has to be equalized.
Figure 3: Filter sections in the equalizer.
The headset (especially an insert-type) attenuates the mid-rangeand high frequencies coming from outside quite efficiently. How-ever, there is always some leakage through the headset and alsobetween the skin and the cushion of the headphone [6]. Low fre-quencies can leak through the headphone quite effectively. Theleaking from the real environment sums up in the ear canal with thepseudo-acoustic representation produced by the transducer. Thissumming causes coloration especially at low frequencies and dete-riorates the pseudo-acoustic experience [1]. The amplification oflow frequencies has to be equalized.
2.2. Equalization Properties of the ARA Mixer
The ARA mixer of this study is presented in more detail in [2].The mixer includes a mixing section for transmitting the micro-phone signals to the earphones, and also for mixing external soundsources to the pseudo-acoustic environment. The mixer includesalso an adjustable equalization section to make the pseudo-acousticssound as natural as possible. For lowest latency possible the mixerwas constructed with analog electronics. This is important sincethe addition of the acoustic leakage and the delayed pseudo-acousticsound creates a comb filtering effect resulting in coloration. Evena fraction of millisecond of latency in processing can be disturb-ing. Typical digital audio hardware and software can make severalmilliseconds of latency, and are therefore not suitable for the task.
The equalizer section has two parametric resonators and a high-pass filter. The resonators can be used to recreate the missingquarter-wave resonance, and to damp the added half-wave reso-nance. The high-pass filter is used to compensate for boosted lowfrequency reproduction due to sound leakage between the head-phone and skin. Figure 3 shows a block diagram for the equaliza-tion section of the mixer.
During the evaluation described in Section 3, two of the tes-tees used individually tuned equalization, and the other two usedgeneric equalization. For individual equalization the headset re-sponses were measured in an anechoic chamber. The transfer func-tion from a loudspeaker in front of the testee into the ear canal ofthe testee was first measured without a headset, and then the mea-surement was repeated with a headset on. The lower plot in Fig.4 shows the measurement results without a headset (black line)and with a headset (grey line). There is no equalization used inthe headset in this case. The measurement data clearly demon-strate the quarter wave resonance around 2.5 kHz when there is noheadset, and how it disappears when the headset is put on. Also,pronounced low frequencies and the added half-wave resonance ataround 8 kHz are clearly visible. The upper plot in Fig. 4 showsmeasurement data for the same subject when the equalization is ac-curately tuned and switched on. For the generic (non-individual)equalization, shown in Fig. 5, the equalization curve was computedas the average of four different individual measurement data.
It should be noted that due to the simple implementation ofthe equalization circuit, even the individual equalization settingdiffer noticeably from the measured curve. Therefore, the individ-ual equalizer setting is still an approximation of the real measuredcurve. A small-scale quality evaluation with four listeners was
DAFX-2
Härmä et al. ’04Hearium ’12
2 Journal of Electrical and Computer Engineering
Virtual sounds Virtual
acoustic processing
ARA mixer
Signal to distant userPreprocessing for
transmission
Figure 1: Block diagram of an ARA system.
Figure 1 shows the block diagram of a complete ARAsystem [1]. The first block on the left is used to create virtualsounds, which can be embedded into the pseudoacousticrepresentation. Furthermore, it can utilize location andorientation information to keep virtual audio objects in placewhile the user moves and turns his/her head. An essentialpart in creating a realistic ARA system is the ARA mixer,which routes and mixes all the signals involved in the system.Furthermore, the ARA mixer takes care of the equalizationneeded to create an acoustically transparent headset [4].The headset is similar to common in-ear headphones, whichare nowadays widely used with portable media players andsmartphones. The preprocessing block can be used to sendthe user’s binaural microphone signals to a distant user forcommunication purposes. The ARA mixer and headset arethe user-worn devices of the system.
Even if the ARA headset would provide perfect soundquality and hear-through experience, it still requires usefulapplications in order to really benefit the user [5]. Themost promising applications include full audio bandwidth(20 Hz–20 kHz) binaural telephony and audio conferenc-ing with distant subjects panned around the user [6, 7].Furthermore, the ARA technology enables location-basedinformation services, such as virtual tourist guides and audiomemos [1].
A previous prototype of the ARA mixer was constructedwith analog electronics to avoid latency [8]. This is importantbecause parts of the ambient sounds leak through andaround the headset into the ear canal and if the pseudoacous-tic representation is delayed by the ARA mixer, it results in acomb-filter e!ect when added to the leaked sound. However,there is a great interest in replacing the bulky and expensiveanalog components with digital signal processing (DSP).The digital implementation would bring several benefitswhen compared to the analog implementation. The benefitsinclude programmability, which would enable a convenientuse of individualized equalization curves; ease of design; pre-cision. The downside is that a digital implementation intro-duces more delay than an analog implementation, whichcauses the comb filtering e!ect to the perceived signal. How-ever, a digital implementation of the ARA mixer can be real-ized using a low-latency DSP due to the pronounced atten-uation capability of the in-ear headset, which can dramati-cally reduce the comb-filter e!ect.
The aim of this paper is to study whether the ARA equali-zer can be implemented using DSP and whether the latency
between the pseudoacoustic representation and leakage ofthe headphone deteriorate the perceived sound excessively.The digital implementation of the ARA equalizer could bringmany enhancements compared to the analog implementa-tion, but only if the sound quality remains su"ciently good.
This paper is organized as follows. Section 2 describes theprinciples of the ARA technology. Section 3 concentrates ondigital filters and their latency properties. Section 4 presentsthe group delay estimation of a passive mechanism. Section 5focuses on the implementation of the digital ARA equalizer.Section 6 introduces a case study of a digital ARA mixer, andSection 7 concludes the paper.
2. ARA Technology
The ARA hardware has been specially designed and built forthis purpose [4]. It consists of the ARA headset and the ARAmixer. The basis of the ARA headset is that it must be able toaccurately reproduce the surrounding sound environment.In order to do that, the headset has two external microphonesin addition to the earphone drivers. The quality of the repro-duction of the pseudoacoustic environment must be su"-cient enough for allowing the users to continuously wear theARA headset for long periods of time nonstop.
However, because of the headphone acoustics, the pseu-doacoustic representation is not an exact copy of the sur-rounding sounds. Thus, an equalizer is needed to correct thepseudoacoustic representation. Originally, the equalizationwas designed to be analog in order to have as low latency aspossible [4]. Furthermore, the ARA mixer is used to embedvirtual sound objects into the user’s sound environment aswell as to connect all the additional devices into the ARAsystem.
2.1. Headphone Acoustics. In normal listening with openears, the incident sound waves are modified by the listener’sbody, head, and outer ear. When sounds are recorded witha conventional microphone situated away from the bodyand played back through headphones, the modificationscaused by the body of the listener are lost. However, whenmicrophones are placed binaurally near the ear canal entran-ces, the majority of these modifications are preserved.
The main di!erence when using in-ear headphones com-pared to the listening with open ears is that the headphonesocclude the ear canals completely. An open ear canal acts asa quarter-wavelength resonator, that is, like a tube with one
QuietPro+
Hearium
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
4. Future Audio Analysis & Display• Better Scene Analysis
overcoming the limitations of human hearing:sensors, geometry
• Challengesfine source discriminationmodeling & classification (language ID?)Integrating through time: single location, sparse sounds
19
Object 1 perceptSound mixture
Object 2 perceptObject 3 percept
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
Ad-Hoc Mic Array• Multiple sensors,
real-time sharinglong-baselinebeamforming
• Challengesprecise relative (dynamic) localizationprecise absolute registration
20
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
Sound Visualization• Making acoustic information visible
“synesthesia”
• Challengessource formation & classification registration: sensors, display
21
O’Donovan et al. ’07
www.ultra-gunfirelocator.com
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
Auditory Display• Acoustic channel
complements visionacoustic alarmsverbal information“zoomed” ambienceinstant replay
• ChallengesInformation management & prioritizationmaximally exploit perceptual organization
22
http://www.youtube.com/watch?v=v1uyQZNg2vE
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
Summary• Human Scene Analysis
Spatial & Source information
• Computational Scene AnalysisSpatial & Source informationWorld knowledge
• Augmented AuditionSelective pass-through + insertion
23
Auditory Perception for Complex Scenes - Dan Ellis 2013-03-27 /23
References• A. Bregman, Auditory Scene Analysis, MIT Press, 1990.
• C. Darwin & R. Carlyon, “Auditory grouping” Hbk of Percep. & Cogn. 6: Hearing, 387–424, Academic Press, 1995.
• J. Blauert, Spatial Hearing (revised ed.), MIT Press, 1996.
• D. Brungart & B. Simpson, “The effects of spatial separation in distance on the informational and energetic masking of a nearby speech signal”, JASA 112(2), Aug. 2002.
• D. Brungart, B. Simpson, M. Ericson, K. Scott, “Informational and energetic masking effects in the perception of multiple simultaneous talkers,” JASA 110(5), Nov. 2001.
• G. Brown & M. Cooke, “Computational auditory scene analysis,” Comp. Speech & Lang. 8(4), 297–336, 1994.
• H. Okuno, T. Nakatani, T. Kawabata, “Listening to two simultaneous speeches,” Speech Communication 27, 299–310, 1999.
• G. Hu and D.L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Tr. Neural Networks, 15(5), Sep. 2004.
• J. Benesty, J. Chen, Y. Huang, Microphone Array Signal Processing, Springer, 2008.
• A. Bell, T. Sejnowski, “An information-maximization approach to blind separation and blind deconvolution,” Neural Computation, vol. 7 no. 6, pp. 1129-1159, 1995.
• P. Smaragdis, “Blind separation of convolved mixtures in the frequency domain,” Intl. Wkshp. on Indep. & Artif.l Neural Networks, Tenerife, Feb. 1998.
• D. Ellis, “Prediction-Driven Computational Auditory Scene Analysis,” Ph.D. thesis, MIT EECS, 1996.
• T Kristjansson, J Hershey, P Olsen, S Rennie, R Gopinath, Super-human multi-talker speech recognition: The IBM 2006 speech separation challenge system, Proce. of Interspeech, 97-100, 2006.
• A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, C. Wooters, The ICSI Meeting Corpus, Proc. ICASSP-03, pp. I-364--367, 2003.
• D. Ellis and J. Liu, Speaker turn segmentation based on between-channel differences, NIST Meeting Recognition Workshop @ ICASSP, pp. 112-117, Montreal, May 2004.
• D. Ellis, X. Zheng, and J. McDermott, Classifying soundtracks with audio texture features, Proc. IEEE ICASSP, pp. 5880-5883, Prague, May 2011.
• D. Ellis and K.S. Lee, Minimal-Impact Audio-Based Personal Archives, First ACM workshop on Continuous Archiving and Recording of Personal Experiences CARPE-04, New York, pp. 39-47, Oct 2004.
• C. P. Brown, R. O. Duda, A structural model for binaural sound synthesis, IEEE Tr. Speech & Audio 6(5):476-488, 1998.
• A. Härma, J. Julia, M. Tikander, M. Karjalainen, T. Lokki, J, Hiipakka, G. Lorho, Augmented reality audio for mobile and wearable appliances, J. Aud Eng. Soc. 52(6): 618-639, 2004.
• A. O'Donovan, R. Duraiswami, J. Neumann, Microphone arrays as generalized cameras for integrated audio visual processing, IEEE CVPR, 1-8, 2007.
24