efeffefe

Dan Ellis Audio Signal Reecognition 2003-11-13 - 1 / 25

Audio Signal Recognitionfor Speech, Music,

and Environmental Sounds

Pattern Recognition for Sounds

Speech Recognition

Other Audio Applications

Observations and Conclusions

1

2

3

4

Dan Ellis

Laboratory for Recognition and Organization of Speech and Audio(Lab

ROSA

)Columbia University, New Yorkhttp://labrosa.ee.columbia.edu/



Pattern recognition is abstraction

- continuous signal

discrete labels- an essential part of understanding?

information extraction

Sound is a challenging domain

- sounds can be highly variable- human listeners are extremely adept

1


Pattern classification

Classes are defined as distinct region in some feature space

- e.g. formant frequencies to define vowels

Issues

- finding segments to classify

- transforming to an appropriate feature space

- defining the class boundaries

0

00

1000

1000

2000

2000

1000200030004000

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

f/Hz

time/s

F1/Hz

F2/Hz

x

x ayao

0 200 400 600 800 1000 1200 1400 1600

600

800

1000

1200

1400

1600

1800Pols vowel formants: "u" (x), "o" (o), "a" (+)

F1 / Hz

F2 /

Hz

x

*

+

o

new observation x


Classification system parts

signal

segment

feature vector

class

Pre-processing/segmentation

Feature extraction

Classification

Post-processing

Sensor

STFT Locate vowels

Formant extraction

Context constraints Costs/risk


Feature extraction

Feature choice is critical to performance

- make important aspects explicit, remove irrelevant details

- equivalent representations can perform very differently in practice

- major opening for domain knowledge (cleverness)

Mel-Frequency Cepstral Coefficients (MFCCs):Ubiquitous speech features

- DCT of log spectrum on auditory scale- approximately decorrelated ...

0 1 2 30

10

20

30

0 1 2 30

5

10

time / sec

freq.

cha

nnel

cep.

coe

f.Mel Spectrogram MFCCs


Statistical Interpretation

Observations are random variableswhose distribution depends on the class:

Source distributions

p

(

x

|

i

)

- reflect variability in feature- reflect noise in observation- generally have to be estimated from data

(rather than known in advance)

Classi

(hidden)discrete continuous

Observationx

p(x|i)Pr(i|x)

p(x|i)

x

1 2 3 4


Priors and posteriors

Bayesian inference can be interpreted as updating prior beliefs with new information,

x

:

Posterior is prior scaled by likelihood & normalized by evidence (so

(

posteriors

)

= 1)

Minimize the probability of error by choosing

maximum

a posteriori

(MAP) class:

Pr i( )p x i( )

p x j( ) Pr j( )j--------------------------------------------------- Pr i x( )=

Posteriorprobability

Evidence = p(x)

Likelihood

Prior probability

Pr i x( )iargmax =


Practical implementation

Optimal classifier is

but we dont know

So, model conditional distributions then use Bayes rule to find MAP class

Pr i x( )iargmax =

Pr i x( )

p x i( )

Labeledtraining

examples{xn,xn}

Sortaccordingto class

Estimateconditional pdf

for class 1

p(x|1)


Gaussian models

Model data distributions via parametric model

- assume known form, estimate a few parameters

E.g. Gaussian in 1 dimension:

For higher dimensions, need mean vector

i

and

d

x

d

covariance matrix

i

Fit more complex distributions with mixtures...

p x i( )1

2i---------------- 1

2---

x i

i------------- 2

exp=normalization

02

4

0

2

4

00.5

1

0 1 2 3 4 50

1

2

3

4

5


Gaussian models for formant data

Single Gaussians a reasonable fit for this data

Extrapolation of decision boundaries can be surprising


Outline


Speech Recognition- How its done- What works, and what doesnt



1

2

3

4


How to recognize speech?

Cross correlate templates?- waveform?- spectrogram?- time-warp problems

Classify short segments as phones (or ...), handle time-warp later- model with slices of ~ 10 ms- pseudo-piecewise-stationary model of words:

2

time / s

freq

/ Hz

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.450

1000

2000

3000

4000sil silg w eh n


Speech Recognizer Architecture

Almost all current systems are the same:

Biggest source of improvement is increase in training data- .. along with algorithms to take advantage

Featurecalculation

sound

Acousticclassifier

feature vectorsAcoustic model

parameters

HMMdecoder

Understanding/application...

phone probabilities

phone / word sequence

Word models

Language modelp("sat"|"the","cat")p("saw"|"the","cat")

s ah t

D A

T A


Speech: Progress

Annual NIST evaluations

- steady progress (?), but still order(s) of magnitude worse than human listeners

1990 1995 2000 2005

30%

3%


Speech: Problems

Natural, spontaneous speech is weird!

- coarticulation- deletions- disfluencies is word transcription even a sensible approach?

Other major problems- speaking style, rate, accent- environment / background...


Speech: What works, what doesnt

What works: Techniques:- MFCC features + GMM/HMM systems

trained with Baum-Welch (EM)- Using lots of training dataDomains:- Controlled, low noise environments- Constrained, predictable contexts- Motivated, co-operative users

What doesnt work: Techniques:- rules based on insight- perceptual representations

(except when they do...)Domains:- spontaneous, informal speech- unusual accents, voice quality, speaking style- variable, high-noise background / environment


Outline


Speech Recognition

Other Audio Applications- Meeting recordings- Alarm sounds- Music signal processing


1

2

3

4


Other Audio Applications:ICSI Meeting Recordings corpus

Real meetings, 16 channel recordings, 80 hrs

- released through NIST/LDC

Classification e.g.: Detecting emphasized utterances based on f0 contour (Kennedy & Ellis 03)

- per-speaker normalized f0 as unidimensional feature simple threshold classification

3

Speaker 1 Speaker 2440 f0/Hz f0/Hz440110 176022011055


Personal Audio

LifeLog / MyLifeBits / Remembrance Agent:- easy to record everything you

hear

Then what?- prohibitive to review- applications if access easier?

Automatic content analysis / indexing...

- find features to classify into e.g. locations

50 100 150 200 2500

2

4

14:30 15:00 15:30 16:00 16:30 17:00 17:30 18:00 18:30

5

10

15

freq

/ Hz

freq

/ Bar

k

time / min

clock time


Alarm sound detection

Alarm sounds have particular structure- clear even at low SNRs- potential applications...

Contrast two systems: (Ellis 01)- standard, global features, P(X|M)- sinusoidal model, fragments, P(M,S|Y)

- error rates high, but interesting comparisons...

0

freq

/ kHz

1

2

3

4Restaurant+ alarms (snr 0 ns 6 al 8)

20 25 30 35 40 45 50

0

6 7 8 9

time/sec0

freq

/ kHz

1

2

3

4

0MLP classifier output

Sound object classifier output


Music signal modeling

Use machine listener to navigate large music collections- e.g. unsigned bands on MP3.com

Classification to label:- notes, chords, singing, instruments- .. information to help cluster music

Artist models based on feature distributions

- measure similarity between users collections and new music? (Berenzweig & Ellis 03)

third cepstral coef

fifth

cep

stra

l coe

f

madonnabowie

p

1 0.5 0 0.5

0.8 0.6 0.4 0.2

00.20.40.6

Country

Elec

troni

ca

madonnabowie

10

5

15 10 5

15

0


Outline


Speech Recognition


Observations and Conclusions- Model complexity- Sound mixtures

1

2

3

4


Observations and Conclusions:Training and test data

Balance model/data size to avoid overfitting:

Diminishing returns from more data:

4

training or parameters

errorrate

Testdata

Trainingdata

Overfitting

500

1000

2000

4000

9.25

18.5

37

74

32

34

36

38

40

42

44

Hidden layer / unitsTraining set / hours

WER

%

WER for PLP12N-8k nets vs. net size & training data

Word Error RateMore training dataMor

e parameter

s

Optimal parameter/data ratio

Constant training

time


Beyond classification

No free lunch: Classifier can only do so much- always need to consider other parts of system

Features- impose ceiling on system performance- improved features allow simpler classifiers

Segmentation / mixtures- e.g. speech-in-noise:

only subset of feature dimensions availablemissing-data approaches...

Y(t)

S1(t)

S2(t)


Summary

Statistical Pattern Recognition- exploit training data for probabilistically-correct

classifications

Speech recognition- successful application of statistical PR- .. but many remaining frontiers

Other audio applications- meetings, alarms, music- classification is information extraction

Current challenges- variability in speech- acoustic mixtures


Extra slides


Neural network classifiers

Instead of estimating and using Bayes,

can also try to estimate posteriors

directly (the decision boundaries)

Sums over nonlinear functions of sums give a large range of decision surfaces...

e.g. Multi-layer perceptron (MLP):

Problem is finding the weights wij ... (training)

p x i( )

Pr i x( )

yk F w jk F wijxij[ ]j[ ]=

+

+wij

x1 y1

h1

h2

y2

x2x3 F[]

+

+

+wjk

Inputlayer

Hiddenlayer

Outputlayer


Neural net classifier

Models boundaries, not density

Discriminant training- concentrate on boundary regions- needs to see all classes at once

p x i( )


Why is Speech Recognition hard?

Why not match against a set of waveforms?- waveforms are never (nearly!) the same twice- speakers minimize information/effort in speech

Speech variability comes from many sources:- speaker-dependent (SD) recognizers must

handle within-speaker variability- speaker-independent (SI) recognizers must also

deal with variation between speakers- all recognizers are afflicted by background noise,

variable channels

Need recognition models that:- generalize i.e. accept variations in a range, and- adapt i.e. tune in to a particular variant


Within-speaker variability

Timing variation:- word duration varies enormously

- fast speech reduces vowels

Speaking style variation:- careful/casual articulation- soft/loud speech

Contextual effects:- speech sounds vary with context, role:

How do you do?

0.5

Freq

uenc

y

00 1.0 1.5 2.0 2.5 3.0

2000

4000

s

SO ITHOUGHT

ABOUTTHAT

ITHINK

IT'S STILL POSSIBLE

AND

ow ayth

aadx

ax b aw plihtstih

kn

ihth

ayn

axth

aa s b ax l


Between-speaker variability

Accent variation- regional / mother tongue

Voice quality variation- gender, age, huskiness, nasality

Individual characteristics- mannerisms, speed, prosody

02000400060008000

time / s

freq

/ Hz

0 0.5 1 1.5 2 2.50

2000400060008000

mbma0

fjdm2


Environment variability

Background noise- fans, cars, doors, papers

Reverberation- boxiness in recordings

Microphone channel- huge effect on relative spectral gain

0

2000

4000

time / s

freq

/ Hz

0 0.2 0.4 0.6 0.8 1 1.2 1.40

2000

4000

Closemic

Tabletopmic

Audio Signal Recognition for Speech, Music, and Environmental SoundsPattern Recognition for SoundsPattern classificationClassification system partsFeature extractionStatistical InterpretationPriors and posteriorsPractical implementationGaussian modelsGaussian models for formant dataOutlineSpeech Recognizer ArchitectureSpeech: ProgressSpeech: ProblemsSpeech: What works, what doesntOutlineOther Audio Applications: ICSI Meeting Recordings corpusPersonal AudioAlarm sound detectionMusic signal modelingOutlineObservations and Conclusions: Training and test dataBeyond classificationSummaryExtra slidesNeural network classifiersNeural net classifier

Documents

efeffefe