Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 1
Recognition & Organizationof Speech and Audio
Dan EllisElectrical Engineering, Columbia University
Outline
Sound ‘organization’
Background & related work
Existing projects
Future projects
Summary & conclusions
1
2
3
4
5
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 2
Organization of sound mixtures
• Core operation:
Converting continuous, scalar signalinto discrete, symbolic representation
1
0 2 4 6 8 10 12 time/s
frq/Hz
0
2000
4000
Voice (evil)
Stab
Rumble Strings
Choir
Voice (pleasant)
Analysis
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 3
Positioning sound organization
• Draws on many techniques
• Abuts/overlaps various areas
Soundorganization
Speechrecognition Audio
processing
Patternrecognition
Machinelearning
Artificialintelligence
Psychoacoustics& physiologySignal
processing
Image/videoanalysis
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 4
About auditory perception
• Received waveform is a mixture
- two sensors, N signals ...- need knowledge-based constraints
• Psychoacoustics: the study of human sound organization
- ‘auditory scene analysis’ (Bregman’90)
• Auditory perception is ecologically grounded
- scene analysis is preconscious (
→
illusions)- perceived organization:
real-world objects + events (transient)- subjective
not
canonical (ambiguity)
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 5
Key themes for Lab ROSA
• Sound organization
- recovering/constructing abstraction hierarchy- at an instant (sources)- along time (segmentation)
• Scene analysis
- need to find attributes according to objects- use attributes to form objects- ... plus constraints of knowledge
• Exploiting large data sets (the ASR lesson)
- supervised/labelled: pattern recognition- unsupervised: structure discovery, clustering
• Special-purpose cases:
- speech recognition- source-specific recognizers
• ... within a ‘complete explanation’
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 6
Applications for sound organization
What do people do with their ears?
• Robots
- intelligence requires awareness- Sony’s AIBO: dog-hearing
• Human-computer interface
- .. includes knowing when (& why) you’ve failed
• Archive indexing & retrieval
- pure audio archives- true multimedia content analysis
• Content ‘understanding’
- intelligent classification & summarization
• Autonomous monitoring
• Broader ‘structure discovery’ algorithms
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 7
Outline
Sound ‘organization’
Background & related work
- Audio coding & compression- Automatic Speech Recognition- Computational Auditory Scene Analysis- Multimedia information retrieval
Existing projects
Future projects
Summary & conclusions
1
2
3
4
5
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 8
Audio coding & compression
• Goal is reconstruction, not abstraction
• But criteria are ‘subjective’: want same
percept
, not same waveform
• MPEG-Audio:
- filterbanks- information-theoretic coding- psychoacoustic masking of quantization noise
• MPEG-4 ‘Structured Audio’
- computer music synthesis model- instrument definition + control stream- automatic analysis?
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 9
Automatic Speech Recognition (ASR)
• Standard speech recognition structure:
• ‘State of the art’ word-error rates (WERs):
- 2% (dictation) - 30% (telephone conversations)
• Segmentation of speech & nonspeech
- ... recognizer wouldn’t notice!
Featurecalculation
sound
Acousticclassifier
feature vectorsAcoustic model
parameters
HMMdecoder
Understanding/application...
phone probabilities
phone / word sequence
Word models
Language modelp("sat"|"the","cat")p("saw"|"the","cat")
s ah t
D A
T A
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 10
Spoken document retrieval
• Text-based IR on ASR transcripts
- e.g. news broadcasts (CMU’s Informedia, Thisl)
• Recognition errors are not the limiting factor
- TREC-98 results: average precision 0.5
→
0.4
• Weak at word level, but OK over paragraphs
- replay the audio, don’t show the text!
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 11
Computational Auditory Scene Analysis(CASA)
• Implement psychoacoustic theory? (Brown’92)
- what are the features? how are they used?
• Top down constraints are needed (Klassner’96)
inputmixture
signalfeatures
(maps)
discreteobjects
Front end Objectformation
Groupingrules
Sourcegroups
onset
period
frq.mod
time
freq
TIME1.0 2.0 3.0 4.0 sec
Buzzer-Alarm
Glass-Clink
Phone-RingSiren-Chirp
420 Hz460 Hz500 Hz
1475 Hz
2230 Hz
2540 Hz
3760 Hz
2350 Hz
1675 Hz
950 Hz
1.0 2.0 3.0 4.0 secTIME
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 12
Audio Information Retrieval
• Searching in a database of audio
- speech .. use ASR- text annotations .. search them- sound effects library?
• e.g. Muscle Fish “SoundFisher” browser
- define multiple ‘perceptual’ feature dimensions- search by proximity in (weighted) feature space
- features are ‘global’ for each soundfile,no attempt to separate mixtures
- segmentation...
Segmentfeatureanalysis
Sound segmentdatabase
Segmentfeatureanalysis
Seach/comparison
Results
Query example
Feature vectors
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 13
Music analysis
• Automatic transcription (score recovery)
- classic ‘hard problem’: can people do it even?- recent success in reduced forms
e.g. melody, drum track (Goto’00)
• Instrument identification
- ideas from speaker identification (basic PR)+ instrument family hierarchies (Martin’99)
• Fingerprinting
- spot recordings despite noise, distortion- relies on
perceptual
invariants
• Music clustering
- e.g. music recommendation based on signal- correlate objective features with user ratings?
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 14
Multimedia description
• MPEG-7 ‘Metadata’
- MPEG is known for audio/video
compression
standards;
- also develop standards for
search and indexing
• MPEG-7 is a standard format for
metadata
:Well-defined categories for content description
• Focus is on framework & infrastructure
• Audio descriptor categories:
- from ASR- from computer music community- uses still to emerge
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 15
Outline
Sound ‘organization’
Background & related work
Existing projects
- Acoustic change detection- Robust speech recognition- Nonspeech event detection- Prediction-driven CASA
Future projects
Summary & conclusions
1
2
3
4
5
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 16
Acoustic change detection(with Williams/Sheffield, Ferreiros/UPMadrid)
• Approaches:- ‘metric’: find instants of maximal change- ‘model-based’: best alignment of model set
- ‘bayesian’: generate models when warranted
• Typically agnostic about underlying problem- use any features, find any changes
• Good for ASR adaptation, otherwise...
0 0.05 0.1 0.15 0.2 0.25 0.3Dynamism
Dynamism vs. Entropy for 2.5 second segments of speecn/music
0
0.5
1
1.5
2
2.5
3
3.5
Ent
ropy
Speech Music Speech+Music
0
2000
4000
frq/Hz
0 2 4 6 8 10 12 time/s0
20
40
Spectrogram
Posteriors
speech music speech+music
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 17
Speech feature combination(with Bilmes/UW, Hermansky/OGI, ICSI)
• ‘Multistream’ approaches
- streams can correct each other → big gains
• Which feature streams to combine?- low mutual information between classifiers
indicates complementary streams
Feature 1calculation
Acousticclassifier
HMMdecoder
Feature 2calculation
Inputsound
Acousticclassifier
Speechfeatures
Phoneprobabilities
Wordhypotheses
Posteriorcombination
0
0.1
0.2
0.3
0.4
0.5
PLPa PLPb MSGa MSGb CMI/bits
PLP
aP
LPb
MS
Ga
MS
Gb
Conditional Mutual Information between feature streams
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 18
Tandem speech recognition(with Hermansky, Sharma & Sivadas/OGI, Singh/CMU)
• Neural net estimates phone posteriors;but Gaussian mixtures model finer detail
• Combine them!
• 50% relative improvement over GMMs alone- different statistical modeling schemes
get different info from same training data
Speechfeatures
Featurecalculation
Inputsound
Neural netclassifier
Nowaydecoder
Phoneprobabilities
Words
s ah t
C0
C1
C2
Cktn
tn+w
h#pclbcltcldcl
Hybrid Connectionist-HMM ASR
Speechfeatures
Featurecalculation
Inputsound
Gauss mixmodels
HTKdecoder
Subwordlikelihoods
Words
s ah t
Conventional ASR (HTK)
Speechfeatures
Featurecalculation
Inputsound
Neural netclassifier
Phoneprobabilities
C0
C1
C2
Cktn
tn+w
h#pclbcltcldcl
Tandem modeling
Gauss mixmodels
HTKdecoder
Subwordlikelihoods
Words
s ah t
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 19
Alarm sound detection
• Deconstructing sound mixtures
- representation of energy in time-frequency- formation of atomic elements- grouping by common properties (onset &c.)
• Alarm sounds have particular structure- people ‘know them when they hear them’- build a generic detector?
freq
/ H
z
1 1.5 2 2.50
1000
2000
3000
4000
5000
time / sec1 1.5 2 2.5
0
1000
2000
3000
4000
5000 1
1 1.5 2 2.50
1000
2000
3000
4000
5000
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 20
Prediction-driven CASA
• Data-driven (bottom-up) fails for noisy, ambiguous sounds (most mixtures!)
• Need top-down constraints:
- fit vocabulary of generic elements to sound... bottom of a hierarchy?
- account for entire scene- driven by prediction failures- pursue alternative hypotheses
inputmixture
signalfeatures
(maps)
discreteobjects
Front end Objectformation
Groupingrules
Sourcegroups
inputmixture
signalfeatures
predictionerrors
hypotheses
predictedfeaturesFront end Compare
& reconcile
Hypothesismanagement
Predict& combinePeriodic
components
Noisecomponents
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 21
PDCASA example
−70
−60
−50
−40
dB
200400
100020004000
f/Hz Noise1
200400
100020004000
f/Hz Noise2,Click1
200400
100020004000
f/Hz City
0 1 2 3 4 5 6 7 8 9
50100200400
1000
Horn1 (10/10)
Crash (10/10)
Horn2 (5/10)
Truck (7/10)
Horn3 (5/10)
Squeal (6/10)
Horn4 (8/10)Horn5 (10/10)
0 1 2 3 4 5 6 7 8 9time/s
200400
100020004000
f/Hz Wefts1−4
50100200400
1000
Weft5 Wefts6,7 Weft8 Wefts9−12
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 22
Outline
Sound ‘organization’
Background & related work
Existing projects
Future projects- Meeting Recorder- Missing-data recognition & CASA for ASR- Structure from audio-video features- Speech & speaker recognition- Music organization- Audio archive structure discovery
Summary & conclusions
1
2
3
4
5
provisional
concretesomewhat
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 23
Meeting recorder(with ICSI, UW, SRI, IBM)
• Microphones in conventional meetings- for transcription/summarization/retrieval- informal, overlapped speech
• Data collection (ICSI and ...):
• Research: ASR, nonspeech, organization- unprecedented data, new applications
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 24
Missing data recognition & CASA(with Barker, Cooke, Green/Sheffield)
• Missing-data recognition- integrate across ‘don’t-know’ values- ‘perfect’ mask → excellent performance in noise
• Multi-source decoder- Viterbi search of sound-fragment interpretations
• CASA for masks/fragments- larger fragments → quicker search
"1754" + noise
Common Onset/Offset
MultisourceDecoder
Spectro-Temporal Proximity
Mask split into subbands
stationary noise estimate
`Grouping' applied within bands:
"1754"
Mask based on
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 25
Structure from audio-video features(Peng Xu)
• HMM modeling of sports video
• Distribution of camera motion labels, color features- also need within-state sequential structure
• Add features from audio- could be orthogonal/complementary
• Audio feature toolkit?- simple feature vectors, boundaries, classes- wealth of potential applications!
inter replay start middle end
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 26
Speech & speaker recognition
• Words are not enough;Confidence-tagged alternate word hypotheses
• Other useful information:- speaker change detection- speaker characterization- phrasing & timing- prosodic cues to dialog state- laughter, pauses, etc.
• Integration with other analyses- segmentation for adaptation- nonspeech events to ignore- video-derived information...
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 27
Music organization
• Music is a special case- lots of structure- highly significant
• Trick is to find meaningful, tractable questions- boundary between speech and music?
• New (counter-intuitive) approaches?- perceive as whole, not by voice (Scheirer’00)
→ global features for chord structures- generic ‘event’ cues + local feature classification- more provisional notion of instruments/voices
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 28
CASA for audio retrieval
• Muscle Fish system uses global features:
• Mixtures → need elements & objects:
- features calculated on grouped subsets
Segmentfeatureanalysis
Sound segmentdatabase
Segmentfeatureanalysis
Seach/comparison
Results
Query example
Feature vectors
Genericelementanalysis
Continuous audioarchive
Genericelementanalysis
Seach/comparison
Results
Query example
Element representations
Objectformation
(Objectformation)
Word-to-classmapping
Objects + properties
Properties aloneSymbolic query
rushing water
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 29
Audio archive structure discovery
• What can you do with a large unlabeled training set (e.g. multimedia clips from the web)?- bootstrap learning: look for common patterns- have to learn generalizations in parallel:
e.g. self-organizing maps, EM HMMs- post-filtering by humans may find ‘meaning’ in
clusters
• Associated text annotations provide a very small amount of labeling- .. but for a very large number of examples
– sufficient to obtain purchase?- maximize label utility through NLP-type
operations (expansion, disambiguation etc.)- goal is automatic term-to-feature mapping
for term-based content queries
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 30
Outline
Sound ‘organization’
Background & related work
Existing projects
Future projects
Summary & conclusions
1
2
3
4
5
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 31
Summary
DO
MA
INS
AP
PLI
CA
TIO
NS
ROSA
• Broadcast• Movies• Lectures
• Meetings• Personal recordings• Location monitoring
• Speech recognition• Speech characterization• Nonspeech recognition
• Object-based structure discovery & learning
• Scene analysis• Audio-visual integration• Music analysis
• Structuring• Search• Summarization• Awareness• Understanding
LabROSA
ROSAtalk - Dan Ellis 2000-10-13 - 32
Conclusions
• Sound is more than just speech!- speech is a special case- most auditory perceivers don’t understand
speech
• Object-based analysis is critical- it’s what people do- the world presents acoustic mixtures
• Whole-scene representation is the way- it’s what people do- provides mutual constraints of overlap
• Broad range of approachesfor a broad range of phenomena