Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
An overview of
on
r
ASR for TICSP - Dan Ellis 1999sep02 - 1
Speech Recognition research at ICSI
Dan EllisInternational Computer Science Institute, Berkeley CA
Outline
About ICSI
Hybrid connectionist-HMM speech recogniti
Overview of current projects
Some details:Combinations, News retrieval, SpeechCorde
Conclusions
1
2
3
4
5
t)
rlands
d
’
ASR for TICSP - Dan Ellis 1999sep02 - 2
About ICSIhttp://www.icsi.berkeley.edu/
• Founded 1988 as ‘portal’ between U.S. and European academic systems- attached to UC Berkeley (but independen- about 100 people at any time
• Infrastructure funding from European government/industry- .. in return for hosting visitors- Germany, Italy, Spain, Switzerland, Nethe
• Research groups consist of staff, visitors anUC Berkeley graduate students
• Original vision: ‘massively parallel systems- has diversified since then
1
Groups at ICSI
llis
ASR for TICSP - Dan Ellis 1999sep02 - 3
• “Realization” (Real Speech Collective?)- Nelson Morgan, Steve Greenberg, Dan E- Speech recognition for realistic conditions- Systems support for ASR applications
• ACIRI (AT&T Center for Internet Research at ICSI)- Network routing, traffic, security
• Theory- mathematical complexity, coding theory
• Applications- natural langauge understanding- connectionist models of cognition
• Networks- multimedia communications applications
ICSI Realization highlights
−
proc
ASR for TICSP - Dan Ellis 1999sep02 - 4
• Connectionist framework for ASR(Morgan & Bourlard)
• RASTA front-end processing(Hermansky & Morgan)
• The Ring Array Processor (RAP)
• SPERT/T0- SBUS boards with custom 8-way vector µ
Outline
on
ASR for TICSP - Dan Ellis 1999sep02 - 5
About ICSI
Hybrid connectionist-HMM speech recogniti- visualizing speech recognition- building a recognizer- some issues in ASR
Overview of current projects
Some details
Conclusions
1
2
3
4
5
ns
X
MM
S
i
|
X
i
):
mar
Wordsequence
hypotheses
ASR for TICSP - Dan Ellis 1999sep02 - 6
The Hybrid Connectionist-HMM system
• Conventional ASR: Symbols S, observatio
• P(Xi|Si) is acoustic likelihood model e.g. G
• Connectionist replaces with posterior , P(
2
S∗ P S X( )S
argmax =
P Xi Si( ) P Si Si 1–( )⋅i
∏Sargmax =
P S X,( )P X( )
-------------------S
argmax =
C0
C1
C2
Cktn
tn+w
h#pclbcltcldcl
Feature calculation
Inputsound
Neural network acoustic model
Posterior probabilitiesfor context-independent
phone classes
Wordpronunciation
models Gram
HMMdecoder
Visualizing speech recognition
q
i
ASR for TICSP - Dan Ellis 1999sep02 - 7
• Speech as a sequence of discrete symbols
Front end
Sound
Phoneclassifier
Feature vectors
Acousticmodels
HMMdecoder
Label probabilities
Phone & word labelling
Word modelsGrammar
Building a recognizer
ASR for TICSP - Dan Ellis 1999sep02 - 8
• Define pronunciation models- application vocabulary- standard dictionaries + phonetic rules?
• Build language model- P(Wi|Wi-1,...)
- count n-grams in example texts?
• Train acoustic model- choose features to suit conditions- train neural net on large labeled corpus- relabel & retrain?
How much training data?
m
rns
00
its
ASR for TICSP - Dan Ellis 1999sep02 - 9
• The bulk of recent improvements derives frolarger training sets
• Largest model: 2.5M parameters, 32M patte
= 24 days to train on TetraSPERT, 1015 ops
5
1000
2000
4000
9.25
18.5
37
74
32
34
36
38
40
42
44
Hidden layer / unTraining set / hours
WE
R%
WER for PLP12N nets vs. net size & training data
Some issues in ASR
es
ulary
sk
vocabNewsmbo...
e,
ASR for TICSP - Dan Ellis 1999sep02 - 10
• ‘Spectrogram reading’ paradigm- short-time features, concatenative models
• Goal: classifier accuracy / Word Error Rate (WER)- objective measures, but quite opaque- normalization vs. generalization
• HMM requires search over all word sequenc- dominates processing time in large-vocab
• Best solutions (e.g. features) depends on ta- RASTA plus delta-features good for small- plain normalized PLP best for Broadcast - modulation spectrum features best for co
• Key challenge = Robustness- to: task, acoustic conditions, speaking rat
style, accent ...
Outline
on
ASR for TICSP - Dan Ellis 1999sep02 - 11
About ICSI
Hybrid connectionist-HMM speech recogniti
Overview of current projects- Front-end features- Pronunciation and language modeling- Modeling alternatives- Other topics
Some details
Conclusions
1
2
3
4
5
The modulation-filtered spectrogram
k
160 ms
640 ms
ndpass16 Hz
ASR for TICSP - Dan Ellis 1999sep02 - 12
(Brian Kingsbury)
• Goal: invariance to variable acoustics
- filter out irrelevant modulations
- channel adaptation (on-line auto. gain control)
- multiple representations
• Comparison:
x x
Bark-scale power-spectral filterban
speech
lowpassfeatures
bandpassfeatures
AGC
AGC
AGC
AGC
= 160 ms
= 320 ms
τ
τ
lowpass envelopefiltering
AGC AGC
AGC AGC
τ
τ
=
=
automaticgain
control
0-16 Hzba2-
plp log-spect features
5
10
15
time / s
freq
/ B
ark
freq
/ B
ark
msg1a features
0.5 1 1.5 2 2.5 3 3.5
5
10
Data-driven feature design
r coeffs
ASR for TICSP - Dan Ellis 1999sep02 - 13
(Mike Shire)
• ‘Optimal’ features for different conditions- subband envelope domain- linear-discriminant analysis (LDA) for filte- separation of labeled classes is optimized
• Modulation-frequency domain responsesfor clean, reverb, mixture:
-30
-25
-20
-15
-10
-5
0
dB
1st Discriminant Filter
CleanLight ReverbSevere ReverbClean+Severe Reverb
100
101
-30
-25
-20
-15
-10
-5
0
dB
2nd Discriminant Filter
Hz
Automatic pronunciation extraction
s)
g
ASR for TICSP - Dan Ellis 1999sep02 - 14
(Eric Fosler)
• Canonical pronunciations are too limited
• Phonetic rules overproduce ( → homonym
• Filter candidates against acoustics:
• Decision trees for canonical →real mappin
FSG building
ax d aw
Phone trees Letter-to-
d o w
phone treesDictionary
jones=jh...dow=d awthe=dh ax
Acoustics
...
FSG
d
aw
aa
dhddclax
iy
Alignmentdh dh dh dh ax ax dcl d...
Acoustic confidencesdh=0.24 ax=0.4 dcl=0.81...
FSG decoder
Word stringsororThe Dow Jones
Topic modeling (Latent Semantic Analysis)
doc)
doc)
)
ASR for TICSP - Dan Ellis 1999sep02 - 15
(Dan Gildea & Thomas Hofmann)
• Bayesian model:- p(word | doc) = ∑t p(word | topic) p(topic |
- EM modeling of p(word | topic) & p(topic |over training set
- p(topic | doc) estimated from context in recognition
• Use to modify language model weights- p(word) ∝ ptri(word) ptop(word) / puni(word
- Trigram language model perplexity of 109 reduced 17%
• Use for topic segmentation?
Multiband systems
scog.
ionsy)
ed
ASR for TICSP - Dan Ellis 1999sep02 - 16
(Adam Janin / Nikki Mirghafori)
• Separate recognizers look at different band- Fletcher/Allen model of human speech re- noise/corruption in one channel is limited- how to combine results?
• Weighted average of all possible combos- p(S | a,b,c,d) = ∑B p(S | B,a,b,c,d) . p(B)
B ranges over 16 possible band combinat- p(B) from? constant, local feature (entrop
.. ....
.. ....
.. ....
.. ....
.. ....
.. ....
.. ....
.. ....
.. ....
Pow
erPo
wer
Pow
erPo
wer
Front end
SpeechSignal
Prob. estimator MLP Merger
ViterbiDecoder
RecognizWords
Buried Markov Models
ASR for TICSP - Dan Ellis 1999sep02 - 17
(Jeff Bilmes)
• Increasing the scope of the classifier input
• Add state-dependent sparse links:
• How to add links?- maximum conditional mutual information- greedy algorithm
• How to model?- linear dependence as first attempt
Qt qt= Qt 1+ qt 1+=Qt 1— qt 1—=
X
Qt 2+ qt 2+=
Y
Connectionist speaker recognition
er
y
ASR for TICSP - Dan Ellis 1999sep02 - 18
(Dominique Genoud)
• Use neural networks to model speakers raththan phones?
• Specialize a phone classifier for a particularspeaker?
• Do both at once for “Twin-output MLP”:
Feature vectorsCurrent vector
t0-4 t0+4
t
12 fe
atur
es
9x12
feat
ures
= 1
08 in
puts
MLP
108:2000:107
Output layer
Non-speech
53 'speaker'phones
53 'world'phones
Posteriorprobabilitvectors
Acoustic Segment Classification
e:
erence
h
tset
0.25 0.3
speecn/music
Speech Music Speech+Music
ASR for TICSP - Dan Ellis 1999sep02 - 19
(Gethin Williams)
• Features from posteriors show utterance typ- average per-frame entropy- ‘dynamism’ - mean squared 1st-order diff- average energy of ‘silence’ label- covariance matrix distance to clean speec
• 1.3% error on 2.5 second speech-music tes
• Use for finding segment boundaries?
0 0.05 0.1 0.15 0.2Dynamism
Dynamism vs. Entropy for 2.5 second segments of
0
0.5
1
1.5
2
2.5
3
3.5
Ent
ropy
0
2000
4000
frq/Hz
0 2 4 6 8 10 12 time/s0
20
40
Spectrogram
Posteriors
speech music speech+music
Perceptual experiments
:
0.4
5.0
8.9
ASR for TICSP - Dan Ellis 1999sep02 - 20
(Steven Greenberg et al.)
• Use spectrally-sparse speech:
• Effect of temporal alignment between bands
WaveformSpectrogram
Original
Slit4
3
2
1
8
4
0Fre
qu
ency
(kH
z)
6000
4762
298375
Time (s)
750945
18902381
3.5
51.747.110.3 58.6 88.3%
Synchronized Slits
(Intelligibility WER%)
79.5
63.4
54.8
8
6
5
53.6
70.7
41.9
54.6
40.7
79.7
25 ms delay
50 ms delay
75 ms delay
De-synchronized Slits
Outline
on
al
ASR for TICSP - Dan Ellis 1999sep02 - 21
About ICSI
Hybrid connectionist-HMM speech recogniti
Overview of current projects
Some details:- Information stream combination- Broadcast News spoken-document retriev- SpeechCorder PDA
Conclusions
1
2
3
4
5
onesstem
ASR for TICSP - Dan Ellis 1999sep02 - 22
.1 Information stream combinations• Task: AURORA noisy digits
- continuous digits- test: 4 noise types x 7 SNR levels- train: mixed clean/noisy data
• Feature design evaluation- intermediate representation for mobile ph- evaluation specifies GMM-HMM (HTK) sy
• Baseline results:
• Can we combine features advantageously?
Feature System WER ratio
mfcc HTK 100.0%
plp Hybrid 89.6%
msg Hybrid 87.1%
msg HTK 205.0%
msg KG HTK 184.5%
4
Combination schemes
1⊥ X2|q
es
deses
deses
ASR for TICSP - Dan Ellis 1999sep02 - 23
• Simple probability combination works best:P(qi|X1,X2) = P(qi|X1)·P(qi|X2) / P(qi) ... if X
Feature 1calculation
Acousticclassifier
HMMdecoder
Feature 2calculation
Inputsound
Acousticclassifier
Speechfeatures
HMMdecoder
Phoneprobabilities
Wordhypotheses
Hypothesiscombination
Finalhypothes
Feature 1calculation
Acousticclassifier
HMMdecoder
Feature 2calculation
Inputsound
Acousticclassifier
Speechfeatures
Phoneprobabilities
Worhypoth
Posteriorcombination
Feature 1calculation
Acousticclassifier
HMMdecoder
Feature 2calculation
Inputsound
Speechfeatures
Phoneprobabilities
Worhypoth
Featurecombination
Posterior multiplication
ASR for TICSP - Dan Ellis 1999sep02 - 24
Features System WER ratio
plp + msg Feature combo 74.1%
plp + msg Prob. combo 63.0%
plp + msg HTK on probs. 51.6%
chive)
aries
6 eval)
male)
ASR for TICSP - Dan Ellis 1999sep02 - 25
.2 Spoken document retrieval
• Based on DARPA/NIST Broadcast News
• Training material recorded off-air- ABC, CNN, CSPAN, NPR- 200 hour training set (TREC: 550 hour ar- training:
word transcriptions + speaker time bound
• Best WER results:- 1996: HTK: 27%- 1997: HTK: 16% (but: easier; 22% on 199- 1998: LIMSI: 14% (SPRACH: 21%)
• Some clear conclusions- one classifier for all conditions (or male/fe- feature adaptation (VTLN, MLLR, SAT)- importance of segmentation- more data is useful
4
Applications for BN systems
CTION
DER
ODE OF VE PARTY OD BUYS IS
ASR for TICSP - Dan Ellis 1999sep02 - 26
• Live transcription- subtitles- transcripts- but: more than words?
• Video editing- precision word-time alignments- commercial systems by IBM, Virage, etc.
• Information Retrieval (IR)- TREC/MUC ‘spoken documents’- tolerant of word error rate, e.g.:
F0: THE VERY EARLY RETURNS OF THE NICARAGUAN PRESIDENTIAL ELESEEMED TO FADE BEFORE THE LOCAL MAYOR ON A LOT OF LAW
F4: AT THIS STAGE OF THE ACCOUNTING FOR SEVENTY SCOTCH ONE LEADANIEL ORTEGA IS IN SECOND PLACE THERE WERE TWENTY THREE PRESIDENTIAL CANDIDATES OF THE ELECTION
F5: THE LABOR MIGHT DO WELL TO REMEMBER THE LOST A MAJOR EPISTRANSATLANTIC CONNECT TO A CORPORATION IN BOTH CONSERVATIOFFICIALS FROM BRITAIN GOING TO WASHINGTON THEY WENT TO WOGEORGE BUSH ON HOW TO WIN A SECOND TO NONE IN LONDON THISSTEPHEN BEARD FOR MARKETPLACE
Thematic Indexing of Spoken Language
Query
ASR for TICSP - Dan Ellis 1999sep02 - 27
(Thisl)
• EC collaboration, BBC providing data
• 1000+ hr archive data
• IR is key factor- stop lists- weighting schemes- query expansion
Control
Text
Audio
Video
Receiver ASR
NLP
SegmentationASR
IR
DatabaseArchive
http
ThislGui
tDemo
ASR for TICSP - Dan Ellis 1999sep02 - 28
• Tcl/Tk front-end to Thisl IR engine
• Spoken query input: SPRACHdemo/Abbo
• NLP integration: prolog lattice parser
rtium)
ASR for TICSP - Dan Ellis 1999sep02 - 29
.3 SpeechCorder
• Convergence of interesting problems:- ubiquitous PDAs- multimedia processing- very fast, low-power CPU design- resource-bound speech recognition
• ICSI / UC Berkeley / MIT collaboration- ICSI: speech & audio processing- UCB: user interface design- MIT: new low-power CPUs
• Current proposal- PDA- meeting / memo recorder- .. also for sociological study?
(new Human Centered Computing conso
4
The SpeechCorder GUI
ASR for TICSP - Dan Ellis 1999sep02 - 30
• Live annotation of recognized speech
• Application issues:- correcting errorful transcriptions- finding places in the recording- annotations (speaker, notes, emphasis)
SpeechCorder: Audio
ASR for TICSP - Dan Ellis 1999sep02 - 31
• Speech recognition- not close-mic’d- speaker ID- low-power / low-memory / vectorized- non-local processing for ‘best’ transcript?
• Other audio issues- identifying speech versus nonspeech- finding / indexing nonspeech events...
Outline
on
ASR for TICSP - Dan Ellis 1999sep02 - 32
About ICSI
Hybrid connectionist-HMM speech recogniti
Overview of current projects
Some details
Conclusions- more criticisms of ASR- future work
1
2
3
4
5
l:
ints word
sible &
nted
sifiertcher)
ASR for TICSP - Dan Ellis 1999sep02 - 33
Conclusions
• The downside of objective evaluation- research priority has been pragmatic goa
reduce WER- human speech recog. uses many constra- grammatic/semantic constraints implicit in
sequence statistics (grammar)- automatic analysis of large corpora is pos
helpful
• The problems with a grammar- unexpected (unseen) phrases are discou- highly brittle alternatives- masks underlying performance
• A more scientific approach- first work on the underlying phoneme clas- follow nonsense syllable performance (Fle
5
The signal model in speech recognition
or
use of
ASR for TICSP - Dan Ellis 1999sep02 - 34
• Systems & approach have been optimized fspeech-alone situation- minimize classifier parameters, maximize
‘feature space’- e.g. cepstra [example]
• Possibly non-lexical data thrown away- pitch- timing/rhythm- speaker identification
• Dire consequences- .. dealing with nonspeech sounds- .. distinguishing success & failure
• Popular focus of research- e.g. segmental models, pitch features- fail to obtain robust improvements
Future work
ASR for TICSP - Dan Ellis 1999sep02 - 35
• Continue improving robustness- better features- better pronunciations- better modeling
• Still looking for a good architecture- multiband- multistream- more adaptation- more contextual dependence
• Speech recognition: useful for applications- archive indexing, summarizing- personal devices, new interfaces- tie-in to general audio analysis...