Learning opportunities Human sound organization …dpwe/talks/snowbird-2002-04.pdf · Laboratory for Recognition and Organization of Speech and Audio (Lab ... passing car vs. mechanic

LabROSA

Sound, mixtures, learning @ Snowbird - Dan Ellis 2002-04-04 - 1

Sound, Mixtures, and Learning

Dan Ellis<[email protected]>

Laboratory for Recognition and Organization of Speech and Audio(Lab

ROSA

)

Electrical Engineering, Columbia Universityhttp://labrosa.ee.columbia.edu/

Outline

Human sound organization

Computational Auditory Scene Analysis

Speech models and knowledge

Sound mixture recognition

Learning opportunities

1

2

3

4

5

LabROSA



• Analyzing and describing complex sounds:

- continuous sound mixture

→

distinct events

• Hearing is

ecologically

grounded

- reflects ‘natural scene’ properties- subjective

not

canonical (ambiguity)-

mixture

analysis as primary goal

1

0 2 4 6 8 10 12 time/s

frq/Hz

0

2000

1000

3000

4000

Voice (evil)

Stab

Rumble Strings

Choir

Voice (pleasant)

Analysis

level / dB-60

-40

-20

0

LabROSA


Sound mixtures

• Sound ‘scene’ is almost always a mixture

- always stuff going on- sound is ‘transparent’ - but big energy range

• Need information related to our ‘world model’

- i.e. separate objects- a wolf howling in a blizzard is the same as

a wolf howling in a rainstorm- whole-signal statistics won’t do this

• ‘Separateness’ is similar to independence

- objects/sounds that change in isolation- but: depends on the situation e.g.

passing car vs. mechanic’s diagnosis

time / s time / s time / s

freq

/ kH

z

0 0.5 10

1

2

3

4

0 0.5 1 0 0.5 1

+ =

Speech Noise Speech + Noise

LabROSA


Auditory scene analysis

(Bregman 1990)

• How do people analyze sound mixtures?

- break mixture into small

elements

(in time-freq)- elements are

grouped

in to sources using

cues

- sources have aggregate

attributes

• Grouping ‘rules ’ (Darwin, Carlyon, ...):

- cues: common onset/offset/modulation, harmonicity, spatial location, ...

Frequencyanalysis

Groupingmechanism

Onsetmap

Harmonicitymap

Positionmap

Sourceproperties

(after Darwin, 1996)

LabROSA


Cues to simultaneous grouping

• Elements + attributes

• Common onset

- simultaneous energy has common source

• Periodicity

- energy in different bands with same cycle

• Other cues

- spatial (ITD/IID), familiarity, ...

time / s

freq

/ H

z

0 1 2 3 4 5 6 7 8 90

2000

4000

6000

8000

LabROSA


The effect of context

• Context can create an ‘expectation ’: i.e. a bias towards a particular interpretation

• e.g. Bregman ’s “old-plus-new ” principle:

A change in a signal will be interpreted as an

added

source whenever possible

- a different division of the same energy depending on what preceded it

+

time / s

freq

uenc

y / k

Hz

0.0 0.4 0.8

1

2

1.20

LabROSA


Outline



- sound source separation- bottom-up models- top-down constraints




1

2

3

4

5

LabROSA


Computational Auditory Scene Analysis(CASA)

• Goal: Automatic sound organization ;Systems to ‘pick out ’ sounds in a mixture

- ... like people do

• E.g. voice against a noisy background

- to improve speech recognition

• Approach:

- psychoacoustics describes grouping ‘rules’- ... just implement them?

2

CASAObject 1 descriptionObject 2 descriptionObject 3 description...

LabROSA


CASA front-end processing

• Correlogram:Loosely based on known/possible physiology

- linear filterbank cochlear approximation- static nonlinearity- zero-delay slice is like spectrogram- periodicity from delay-and-multiply detectors

Cochleafilterbank

sound

envelopefollower

short-timeautocorrelation

delay line

freq

uenc

y ch

anne

ls

Correlogramslice

freq

lag

time

LabROSA


The Representational Approach

(Brown & Cooke 1993)

• Implement psychoacoustic theory

- ‘bottom-up’ processing- uses common onset & periodicity cues

• Able to extract voiced speech:

inputmixture

signalfeatures

(maps)

discreteobjects

Front end Objectformation

Groupingrules

Sourcegroups

onset

period

frq.mod

time

freq

0.2 0.4 0.6 0.8 1.0 time/s

100

150200

300400

600

1000

15002000

3000

frq/Hzbrn1h.aif

0.2 0.4 0.6 0.8 1.0 time/s

100

150200

300400

600

1000

15002000

3000

frq/Hzbrn1h.fi.aif

LabROSA


Problems with ‘bottom-up ’ CASA

• Circumscribing time-frequency elements

- need to have ‘regions’, but hard to find

• Periodicity is the primary cue

- how to handle aperiodic energy?

• Resynthesis via masked filtering

- cannot separate within a single t-f element

• Bottom-up leaves no ambiguity or context

- how to model illusions?

time

freq

LabROSA


Restoration in sound perception

• Auditory ‘illusions ’ = hearing what ’s not there

• The continuity illusion

• SWS

- duplex perception

1000

2000

4000

f/Hz ptshort

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4time/s

5

10

15f/Bark S1−env.pf:0

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8

40

60

80

LabROSA


Adding top-down constraints

Perception is not

direct

but a

search

for

plausible hypotheses

• Data-driven (bottom-up)...

- objects irresistibly appear

vs. Prediction-driven (top-down)

- match observations with parameters of a world-model

- need world-model constraints...

inputmixture

signalfeatures

discreteobjects

Front end Objectformation

Groupingrules

Sourcegroups

inputmixture

signalfeatures

predictionerrors

hypotheses

predictedfeaturesFront end Compare

& reconcile

Hypothesismanagement

Predict& combinePeriodic

components

Noisecomponents

LabROSA


Aside: Optimal techniques (ICA, ABF)

(Bell & Sejnowski etc.)

• General idea:Drive a parameterized separation algorithm to maximize independence of outputs

• Attractions:

- mathematically rigorous, minimal assumptions

• Problems:

- limitations of separation algorithm (N x N)- essentially bottom-up

m1m2

s1s2

a11a21

a12a22

x

−δ MutInfoδa

LabROSA


Outline



Speech models and knowledge- automatic speech recognition- subword states- cepstral coefficients



1

2

3

4

5

LabROSA


Speech models & knowledge

• Standard speech recognition

• ‘State of the art’ word-error rates (WERs):- 2% (dictation) - 30% (telephone conversations)

3

Featurecalculation

sound

Acousticclassifier

feature vectorsAcoustic model

parameters

HMMdecoder

Understanding/application...

phone probabilities

phone / word sequence

Word models

Language modelp("sat"|"the","cat")p("saw"|"the","cat")

s ah t

D A

T A

LabROSA


Speech units

• Speech is highly variable- simple templates won’t do- spectral variation (voice quality)- time-warp problems

• Match short segments (states), allow repeats- model with pseudo-stationary slices of ~ 10 ms

• Speech models are distributions

time / s

freq

/ H

z

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.450

1000

2000

3000

4000sil silg w eh n

p X q( )

LabROSA


Speech features: Cepstra

• Idea: Decorrelate & summarize spectral slices:

- easier to model:

- C0 ‘normalizes out’ average log energy

• Decorrelated pdfs fit diagonal Gaussians- DCT is close to PCA for log spectra

Xm l[ ] IDFT S mH k,[ ]log{ }=

frames

5

10

15

20

25

4

8

12

16

20

5 10 15 20

4

8

12

16

20

-5

-4

-3

-2

50 100 150

Cep

stra

l co

effi

cien

tsA

ud

ito

ry

spec

tru

mCovariance matrixFeatures Example joint distrib (10,15)

2468

1012141618

-5 0 5-4-3-2-10123

LabROSA


Acoustic model training

• Goal: describe with e.g. GMMs

• Training data labels from:- manual phonetic annotation- ‘best path’ from earlier classifier (Viterbi)- EM: joint estimation of labels & pdfs

p X q( )

Labeledtraining

examples{xn,ωxn}

Sortaccordingto class

Estimateconditional pdf

for class ω1

p(x|ω1)

LabROSA


HMM decoding

• Feature vectors cannot be reliably classified into phonemes

• Use top-down constraints to get good results- allowable phonemes- dictionary of known words- grammar of possible sentences

• Decoder searches all possible state sequences- at least notionally; pruning makes it possible

time / s

freq

/ kH

z

0 0.5 1 1.5 2 2.50

1

2

3

4

w iy d uwhh aev

jhahs tcl yuw

d ey z l eh n tclh#

w iy dclduw hhaeaw jh ah stcltaxchjhy iyuwdcld ey z l eh f h# dh

WE DO HAVE JUSTA

FEW DAYS LEFTTHE

LabROSA


Outline




Sound mixture recognition- feature invariance- mixtures including- general mixtures


1

2

3

4

5

LabROSA



• Biggest problem in speech recognition is background noise interference

• Feature invariance approach- use features that reflect only speech- e.g. normalization, mean subtraction- but: non-static noise?

• Or: more complex models of the signal- HMM decomposition- missing-data recognition

• Generalize to other, multiple sounds

4

LabROSA


Feature normalization

• Idea: feature variations, not absolute level

• Hence: calculate average level & subtract it:

• Factors out fixed channel frequency response:

• Normalize variance to handle added noise?

X k[ ] S k[ ] mean S k[ ]{ }–=

s n[ ] h n[ ] e n[ ]*=

S k[ ]log H k[ ]log E k[ ]log+=

time / s

freq

/ kH

zce

pstr

a

0 0.5 1 1.5 2 2.5 time / s0 0.5 1 1.5 2 2.5

0

1

2

3

4

8

6

4

2

clean/MGG95958 N2_SNR5/MGG95958

LabROSA


HMM decomposition(e.g. Varga & Moore 1991, Roweis 2000)

• Total signal model has independent state sequences for 2+ component sources

• New combined state space q' = {q1 q2}

- new observation pdfs for each combination

model 1

model 2

observations / time

p Xi

q1i

q2i,( )

LabROSA


Problems with HMM decomposition

• O(qk)N is exponentially large...

• Normalization no longer holds!- each source has a different gain

→ model at various SNRs?- models typically don’t use overall energy C0

- each source has a different channel H[k]

• Modeling every possible sub-state combination is inefficient, inelegant and impractical

LabROSA


Missing data recognition(Cooke, Green, Barker @ Sheffield)

• Energy overlaps in time-freq. hide features- some observations are effectively missing

• Use missing feature theory...- integrate over missing data xm under model M

• Effective in speech recognition

• Problem: finding the missing data mask

p x M( ) p xp xm M,( ) p xm M( ) xmd∫=

-5 0 5 10 15 20 Clean0

10

20

30

40

50

60

70

80

90 AURORA 2000 - Test Set A

WE

RMD Discrete SNRMD Soft SNRHTK clean trainingHTK multi-condition

"1754" + noise

Missing-dataRecognition

stationary noise estimate

"1754"

Mask based on

LabROSA


Maximum-likelihood data mask(Jon Barker @ Sheffield)

• Search of sound-fragment interpretations

• Decoder searches over data mask K:

- how to estimate p(K)

"1754" + noise

Common Onset/Offset

MultisourceDecoder

Spectro-Temporal Proximity

Mask split into subbands

stationary noise estimate

`Grouping' applied within bands:

"1754"

Mask based on

p M K x,( ) p x K M,( ) p K M( ) p M( )∝

LabROSA


Multi-source decoding

• Search for more than one source

• Mutually-dependent data masks

• Use CASA processing to propose masks- locally coherent regions- p(K|q)

• Theoretical vs. practical limits

O(t)

K1(t)q1(t)

K2(t)q2(t)

LabROSA


General sound mixtures

• Search for generative explanation:

Sourcemodels

Sourcesignals

Receivedsignals

Observations

ObservationsO

Modeldependence

Channelparameters

M1 Y1 X1

C1

M2 Y2 X2

C2

O

{Mi}{Ki}

p(X|Mi,Ki)

Θ

Generationmodel

Analysisstructure

Fragmentformation

Maskallocation

Likelihoodevaluation

Modelfitting

LabROSA


Outline





Opportunities for learning- learnable aspects of modeling- tractable decoding- some examples

1

2

3

4

5

LabROSA


Opportunities for learning

• Per model feature distributions

- e.g. analyzing isolated sound databases

• Channel modifications

- e.g. by comparing multi-mic recordings

• Signal combinations

- determined by acoustics

• Patterns of model combinations

- loose dependence between sources

• Search for most likely explanations

• Short-term structure: repeating events

5

P Y M( )

P X Y( )

P O Xi{ }( )

P Mi{ }( )

P Mi{ } O( ) P O Xi{ }( )P Xi{ } Mi{ }( )P Mi{ }( )∝

LabROSA


Source models

• The speech recognition lesson:Use the data as much as possible- what can we do with unlimited data feeds?

• Data sources- clean data corpora- identify near-clean segments in real sound

• Model types- templates- parametric/constraint models- HMMs

LabROSA


What are the HMM states?

• No sub-units defined for nonspeech sounds

• Final states depend on EM initialization- labels- clusters- transition matrix

• Have ideas of what we’d like to get- investigate features/initialization to get there

s4s7s3s4s3s4s2s3s2s5s3s2s7s3s2s4s2s3s7

1.15 1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20 2.25 2

s4s7s3s4s3s4s2s3s2s5s3s2s7s3s2s4s2s3s7

12345678

1.15 1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90 2.00 2.10 2.20 2time

freq

/ kH

z

dogBarks2

LabROSA


Tractable decoding

• Speech decoder notionally searches all states

• Parametric models give infinite space- need closed-form partial explanations- examine residual, iterate, converge

• Need general cues to get started- return to Auditory Scene Analysis:

- onsets- harmonic patterns

- then parametric fitting

• Need multiple hypothesis search, pruning, efficiency tricks

• Learning? Parameters for new source events- e.g. from artificial (hence labeled) mixtures

LabROSA


Example: Alarm sound detection

• Alarm sounds have particular structure- people ‘know them when they hear them’

• Isolate alarms in sound mixtures

- sinusoid peaks have invariant properties

• Learn model parameters from examples

freq

/ H

z

1 1.5 2 2.50

1000

2000

3000

4000

5000

time / sec1 1.5 2 2.5

0

1000

2000

3000

4000

5000 1

1 1.5 2 2.50

1000

2000

3000

4000

5000

time / sec

freq

/ H

zP

r(al

arm

)

0 1 2 3 4 5 6 7 8 9

0

2000

4000

0.5

1

Speech + alarm

LabROSA


Example: Music transcription(e.g. Masataka Goto)

• High-quality training material:Synthesizer sample kits

• Ground truth available:Musical scores

• Find ML explanations for scores- guide by multiple pitch tracking (hyp. search)

• Applications in similarity matching

LabROSA


Summary

• Sound contains lots of information... but it’s always mixed up

• Psychologists describe ASA... but bottom-up computer models don’t work

• Speech recognition works for isolated speech... by exploiting top-down, context constraints

• Speech in mixtures via multiple-source models... practical combinatorics are the main problem

• Generalize this idea for all sounds... need models of ‘all sounds’... plus models of channel modification... plus ways to propose segmentations... plus missing-data recognition

LabROSA


Further reading[BarkCE00] J. Barker, M.P. Cooke & D. Ellis (2000). “Decoding speech in the

presence of other sound sources,” Proc. ICSLP-2000, Beijing.ftp://ftp.icsi.berkeley.edu/pub/speech/papers/icslp00-msd.pdf

[Breg90] A.S. Bregman (1990). Auditory Scene Analysis: the perceptual organi-zation of sound, MIT Press.

[Chev00] A. de Cheveigné (2000). “The Auditory System as a Separation Machine,” Proc. Intl. Symposium on Hearing.http://www.ircam.fr/pcm/cheveign/sh/ps/ATReats98.pdf

[CookE01] M. Cooke, D. Ellis (2001). “The auditory organization of speech and other sources in listeners and computational models,” Speech Commu-nication (accepted for publication).http://www.ee.columbia.edu/~dpwe/pubs/tcfkas.pdf

[Ellis99] D.P.W. Ellis (1999). “Using knowledge to organize sound: The predic-tion-driven approach to computational auditory scene analysis...,” Speech Communications 27.http://www.ee.columbia.edu/~dpwe/pubs/spcomcasa98.pdf

[Roweis00] S. Roweis (2000). “One microphone source separation.,” Proc. NIPS 2000.http://www.ee.columbia.edu/~dpwe/papers/roweis-nips2000.pdf

http://www.ee.columbia.edu/~dpwe/pubs/tcfkas.pdf

http://www.ee.columbia.edu/~dpwe/pubs/spcomcasa98.pdf

ftp://ftp.icsi.berkeley.edu/pub/speech/papers/icslp00-msd.pdf

http://www.ircam.fr/pcm/cheveign/sh/ps/ATReats98.pdf

http://www.ee.columbia.edu/~dpwe/papers/roweis-nips2000.pdf

Documents

Learning opportunities Human sound organization …dpwe/talks/snowbird-2002-04.pdf · Laboratory for Recognition and Organization of Speech and Audio (Lab ... passing car vs. mechanic