Electrical Engineering, Columbia University of Speech and ...dpwe/talks/ROSAtalk-2000-10.pdfChoir Voice (pleasant) Analysis. Lab ROSA ROSAtalk - Dan Ellis 2000-10-13 - 3 ... - speech

LabROSA

ROSAtalk - Dan Ellis 2000-10-13 - 1

Recognition & Organizationof Speech and Audio

Dan EllisElectrical Engineering, Columbia University

<[email protected]>

Outline

Sound ‘organization’

Background & related work

Existing projects

Future projects

Summary & conclusions

1

2

3

4

5

LabROSA


Organization of sound mixtures

• Core operation:

Converting continuous, scalar signalinto discrete, symbolic representation

1

0 2 4 6 8 10 12 time/s

frq/Hz

0

2000

4000

Voice (evil)

Stab

Rumble Strings

Choir

Voice (pleasant)

Analysis

LabROSA


Positioning sound organization

• Draws on many techniques

• Abuts/overlaps various areas

Soundorganization

Speechrecognition Audio

processing

Patternrecognition

Machinelearning

Artificialintelligence

Psychoacoustics& physiologySignal

processing

Image/videoanalysis

LabROSA


About auditory perception

• Received waveform is a mixture

- two sensors, N signals ...- need knowledge-based constraints

• Psychoacoustics: the study of human sound organization

- ‘auditory scene analysis’ (Bregman’90)

• Auditory perception is ecologically grounded

- scene analysis is preconscious (

→

illusions)- perceived organization:

real-world objects + events (transient)- subjective

not

canonical (ambiguity)

LabROSA


Key themes for Lab ROSA

• Sound organization

- recovering/constructing abstraction hierarchy- at an instant (sources)- along time (segmentation)

• Scene analysis

- need to find attributes according to objects- use attributes to form objects- ... plus constraints of knowledge

• Exploiting large data sets (the ASR lesson)

- supervised/labelled: pattern recognition- unsupervised: structure discovery, clustering

• Special-purpose cases:

- speech recognition- source-specific recognizers

• ... within a ‘complete explanation’

LabROSA


Applications for sound organization

What do people do with their ears?

• Robots

- intelligence requires awareness- Sony’s AIBO: dog-hearing

• Human-computer interface

- .. includes knowing when (& why) you’ve failed

• Archive indexing & retrieval

- pure audio archives- true multimedia content analysis

• Content ‘understanding’

- intelligent classification & summarization

• Autonomous monitoring

• Broader ‘structure discovery’ algorithms

LabROSA


Outline



- Audio coding & compression- Automatic Speech Recognition- Computational Auditory Scene Analysis- Multimedia information retrieval

Existing projects

Future projects


1

2

3

4

5

LabROSA


Audio coding & compression

• Goal is reconstruction, not abstraction

• But criteria are ‘subjective’: want same

percept

, not same waveform

• MPEG-Audio:

- filterbanks- information-theoretic coding- psychoacoustic masking of quantization noise

• MPEG-4 ‘Structured Audio’

- computer music synthesis model- instrument definition + control stream- automatic analysis?

LabROSA


Automatic Speech Recognition (ASR)

• Standard speech recognition structure:

• ‘State of the art’ word-error rates (WERs):

- 2% (dictation) - 30% (telephone conversations)

• Segmentation of speech & nonspeech

- ... recognizer wouldn’t notice!

Featurecalculation

sound

Acousticclassifier

feature vectorsAcoustic model

parameters

HMMdecoder

Understanding/application...

phone probabilities

phone / word sequence

Word models

Language modelp("sat"|"the","cat")p("saw"|"the","cat")

s ah t

D A

T A

LabROSA


Spoken document retrieval

• Text-based IR on ASR transcripts

- e.g. news broadcasts (CMU’s Informedia, Thisl)

• Recognition errors are not the limiting factor

- TREC-98 results: average precision 0.5

→

0.4

• Weak at word level, but OK over paragraphs

- replay the audio, don’t show the text!

LabROSA


Computational Auditory Scene Analysis(CASA)

• Implement psychoacoustic theory? (Brown’92)

- what are the features? how are they used?

• Top down constraints are needed (Klassner’96)

inputmixture

signalfeatures

(maps)

discreteobjects

Front end Objectformation

Groupingrules

Sourcegroups

onset

period

frq.mod

time

freq

TIME1.0 2.0 3.0 4.0 sec

Buzzer-Alarm

Glass-Clink

Phone-RingSiren-Chirp

420 Hz460 Hz500 Hz

1475 Hz

2230 Hz

2540 Hz

3760 Hz

2350 Hz

1675 Hz

950 Hz

1.0 2.0 3.0 4.0 secTIME

LabROSA


Audio Information Retrieval

• Searching in a database of audio

- speech .. use ASR- text annotations .. search them- sound effects library?

• e.g. Muscle Fish “SoundFisher” browser

- define multiple ‘perceptual’ feature dimensions- search by proximity in (weighted) feature space

- features are ‘global’ for each soundfile,no attempt to separate mixtures

- segmentation...

Segmentfeatureanalysis

Sound segmentdatabase


Seach/comparison

Results

Query example

Feature vectors

LabROSA


Music analysis

• Automatic transcription (score recovery)

- classic ‘hard problem’: can people do it even?- recent success in reduced forms

e.g. melody, drum track (Goto’00)

• Instrument identification

- ideas from speaker identification (basic PR)+ instrument family hierarchies (Martin’99)

• Fingerprinting

- spot recordings despite noise, distortion- relies on

perceptual

invariants

• Music clustering

- e.g. music recommendation based on signal- correlate objective features with user ratings?

LabROSA


Multimedia description

• MPEG-7 ‘Metadata’

- MPEG is known for audio/video

compression

standards;

- also develop standards for

search and indexing

• MPEG-7 is a standard format for

metadata

:Well-defined categories for content description

• Focus is on framework & infrastructure

• Audio descriptor categories:

- from ASR- from computer music community- uses still to emerge

LabROSA


Outline



Existing projects

- Acoustic change detection- Robust speech recognition- Nonspeech event detection- Prediction-driven CASA

Future projects


1

2

3

4

5

LabROSA


Acoustic change detection(with Williams/Sheffield, Ferreiros/UPMadrid)

• Approaches:- ‘metric’: find instants of maximal change- ‘model-based’: best alignment of model set

- ‘bayesian’: generate models when warranted

• Typically agnostic about underlying problem- use any features, find any changes

• Good for ASR adaptation, otherwise...

0 0.05 0.1 0.15 0.2 0.25 0.3Dynamism

Dynamism vs. Entropy for 2.5 second segments of speecn/music

0

0.5

1

1.5

2

2.5

3

3.5

Ent

ropy

Speech Music Speech+Music

0

2000

4000

frq/Hz

0 2 4 6 8 10 12 time/s0

20

40

Spectrogram

Posteriors

speech music speech+music

LabROSA


Speech feature combination(with Bilmes/UW, Hermansky/OGI, ICSI)

• ‘Multistream’ approaches

- streams can correct each other → big gains

• Which feature streams to combine?- low mutual information between classifiers

indicates complementary streams

Feature 1calculation

Acousticclassifier

HMMdecoder

Feature 2calculation

Inputsound

Acousticclassifier

Speechfeatures

Phoneprobabilities

Wordhypotheses

Posteriorcombination

0

0.1

0.2

0.3

0.4

0.5

PLPa PLPb MSGa MSGb CMI/bits

PLP

aP

LPb

MS

Ga

MS

Gb

Conditional Mutual Information between feature streams

LabROSA


Tandem speech recognition(with Hermansky, Sharma & Sivadas/OGI, Singh/CMU)

• Neural net estimates phone posteriors;but Gaussian mixtures model finer detail

• Combine them!

• 50% relative improvement over GMMs alone- different statistical modeling schemes

get different info from same training data

Speechfeatures

Featurecalculation

Inputsound

Neural netclassifier

Nowaydecoder

Phoneprobabilities

Words

s ah t

C0

C1

C2

Cktn

tn+w

h#pclbcltcldcl

Hybrid Connectionist-HMM ASR

Speechfeatures

Featurecalculation

Inputsound

Gauss mixmodels

HTKdecoder

Subwordlikelihoods

Words

s ah t

Conventional ASR (HTK)

Speechfeatures

Featurecalculation

Inputsound

Neural netclassifier

Phoneprobabilities

C0

C1

C2

Cktn

tn+w

h#pclbcltcldcl

Tandem modeling

Gauss mixmodels

HTKdecoder

Subwordlikelihoods

Words

s ah t

LabROSA


Alarm sound detection

• Deconstructing sound mixtures

- representation of energy in time-frequency- formation of atomic elements- grouping by common properties (onset &c.)

• Alarm sounds have particular structure- people ‘know them when they hear them’- build a generic detector?

freq

/ H

z

1 1.5 2 2.50

1000

2000

3000

4000

5000

time / sec1 1.5 2 2.5

0

1000

2000

3000

4000

5000 1

1 1.5 2 2.50

1000

2000

3000

4000

5000

LabROSA


Prediction-driven CASA

• Data-driven (bottom-up) fails for noisy, ambiguous sounds (most mixtures!)

• Need top-down constraints:

- fit vocabulary of generic elements to sound... bottom of a hierarchy?

- account for entire scene- driven by prediction failures- pursue alternative hypotheses

inputmixture

signalfeatures

(maps)

discreteobjects

Front end Objectformation

Groupingrules

Sourcegroups

inputmixture

signalfeatures

predictionerrors

hypotheses

predictedfeaturesFront end Compare

& reconcile

Hypothesismanagement

Predict& combinePeriodic

components

Noisecomponents

LabROSA


PDCASA example

−70

−60

−50

−40

dB

200400

100020004000

f/Hz Noise1

200400

100020004000

f/Hz Noise2,Click1

200400

100020004000

f/Hz City

0 1 2 3 4 5 6 7 8 9

50100200400

1000

Horn1 (10/10)

Crash (10/10)

Horn2 (5/10)

Truck (7/10)

Horn3 (5/10)

Squeal (6/10)

Horn4 (8/10)Horn5 (10/10)

0 1 2 3 4 5 6 7 8 9time/s

200400

100020004000

f/Hz Wefts1−4

50100200400

1000

Weft5 Wefts6,7 Weft8 Wefts9−12

LabROSA


Outline



Existing projects

Future projects- Meeting Recorder- Missing-data recognition & CASA for ASR- Structure from audio-video features- Speech & speaker recognition- Music organization- Audio archive structure discovery


1

2

3

4

5

provisional

concretesomewhat

LabROSA


Meeting recorder(with ICSI, UW, SRI, IBM)

• Microphones in conventional meetings- for transcription/summarization/retrieval- informal, overlapped speech

• Data collection (ICSI and ...):

• Research: ASR, nonspeech, organization- unprecedented data, new applications

LabROSA


Missing data recognition & CASA(with Barker, Cooke, Green/Sheffield)

• Missing-data recognition- integrate across ‘don’t-know’ values- ‘perfect’ mask → excellent performance in noise

• Multi-source decoder- Viterbi search of sound-fragment interpretations

• CASA for masks/fragments- larger fragments → quicker search

"1754" + noise

Common Onset/Offset

MultisourceDecoder

Spectro-Temporal Proximity

Mask split into subbands

stationary noise estimate

`Grouping' applied within bands:

"1754"

Mask based on

LabROSA


Structure from audio-video features(Peng Xu)

• HMM modeling of sports video

• Distribution of camera motion labels, color features- also need within-state sequential structure

• Add features from audio- could be orthogonal/complementary

• Audio feature toolkit?- simple feature vectors, boundaries, classes- wealth of potential applications!

inter replay start middle end

LabROSA


Speech & speaker recognition

• Words are not enough;Confidence-tagged alternate word hypotheses

• Other useful information:- speaker change detection- speaker characterization- phrasing & timing- prosodic cues to dialog state- laughter, pauses, etc.

• Integration with other analyses- segmentation for adaptation- nonspeech events to ignore- video-derived information...

LabROSA


Music organization

• Music is a special case- lots of structure- highly significant

• Trick is to find meaningful, tractable questions- boundary between speech and music?

• New (counter-intuitive) approaches?- perceive as whole, not by voice (Scheirer’00)

→ global features for chord structures- generic ‘event’ cues + local feature classification- more provisional notion of instruments/voices

LabROSA


CASA for audio retrieval

• Muscle Fish system uses global features:

• Mixtures → need elements & objects:

- features calculated on grouped subsets


Sound segmentdatabase


Seach/comparison

Results

Query example

Feature vectors

Genericelementanalysis

Continuous audioarchive

Genericelementanalysis

Seach/comparison

Results

Query example

Element representations

Objectformation

(Objectformation)

Word-to-classmapping

Objects + properties

Properties aloneSymbolic query

rushing water

LabROSA


Audio archive structure discovery

• What can you do with a large unlabeled training set (e.g. multimedia clips from the web)?- bootstrap learning: look for common patterns- have to learn generalizations in parallel:

e.g. self-organizing maps, EM HMMs- post-filtering by humans may find ‘meaning’ in

clusters

• Associated text annotations provide a very small amount of labeling- .. but for a very large number of examples

– sufficient to obtain purchase?- maximize label utility through NLP-type

operations (expansion, disambiguation etc.)- goal is automatic term-to-feature mapping

for term-based content queries

LabROSA


Outline



Existing projects

Future projects


1

2

3

4

5

LabROSA


Summary

DO

MA

INS

AP

PLI

CA

TIO

NS

ROSA

• Broadcast• Movies• Lectures

• Meetings• Personal recordings• Location monitoring

• Speech recognition• Speech characterization• Nonspeech recognition

• Object-based structure discovery & learning

• Scene analysis• Audio-visual integration• Music analysis

• Structuring• Search• Summarization• Awareness• Understanding

LabROSA


Conclusions

• Sound is more than just speech!- speech is a special case- most auditory perceivers don’t understand

speech

• Object-based analysis is critical- it’s what people do- the world presents acoustic mixtures

• Whole-scene representation is the way- it’s what people do- provides mutual constraints of overlap

• Broad range of approachesfor a broad range of phenomena

Documents

Electrical Engineering, Columbia University of Speech and ...dpwe/talks/ROSAtalk-2000-10.pdfChoir Voice (pleasant) Analysis. Lab ROSA ROSAtalk - Dan Ellis 2000-10-13 - 3 ... - speech