Upload
kenneth-blake
View
218
Download
0
Embed Size (px)
Citation preview
Presented by: Fang-Hui Chu
Boosting HMM acoustic models in large vocabulary speech recognition
Carsten Meyer, Hauke SchrammPhilips Research Laboratories, Germany
SPEECH COMMUNICATION 2006
2
AdaBoost introduction
• The AdaBoost algorithm was presented for transforming a “weak” learning rule into a “strong” one
• The basic idea is to train a series of classifiers based on the classification performance of the previous classifier on the training data
• In multi-class classification, a popular variant is the AdaBoost.M2 algorithm
• AdaBoost.M2 is applicable when a mapping can be defined for classifier which is related to the classification criterion
1,0: YXht
th
3
AdaBoost.M2 (Freund and Schapire, 1997)
stop ; 21 if j
4
AdaBoost introduction
• The update rule is designed to guarantee an upper bound on the training error of the combined classifier which is exponentially decreasing with the number of individual classifiers
• In multi-class problems, the weights are summed up to give a weight for each training pattern :
yiDt ,
)(iwt i
)(
,iyytt yiDiw
5
Introduction
• Why there are only a few studies so far applying boosting to acoustic model training?– Speech recognition is an extremely complex large scale classific
ation problem
• The main motivation to apply AdaBoost to speech recognition is– Its theoretical foundation providing explicit bounds on the training
and–in terms of margins–on the generalization error
6
Introduction
• In most previous applications to speech recognition, boosting was applied to classifying each individual feature vector to a phoneme symbol [ICASSP04][Dimitrakakis]
– Needing the phoneme posterior probabilities
• But the problem is..– The conventional HMM speech recognizers do not involve an
intermediate phoneme classification step for individual feature vectors
– So the frame-level boosting approach cannot straightforwardly be applied
7
Utterance approach for boosting in ASR
• An intuitive way of applying boosting to HMM speech recognition is at the utterance level– Thus, boosting is used to improve upon an initial ranking of
candidate word sequences
• The utterance approach has two advantages:– First, it is directly related to the sentence error rate– Second, it is computationally much less expensive than boosting
applied at the level of feature vectors
8
Utterance approach for boosting in ASR
• In utterance approach, we define the input patterns to be the sequence of feature vectors corresponding to the entire utterance
• denotes one possible candidate word sequence of the speech recognizer, being the correct word sequence for utterance
• The a posteriori confidence measure is calculated on basis of the N-best list for utterance
ix
i
i
y
iy
iiL
iLz it
itit
zpzxp
ypyxpyxh
,
9
Utterance approach for boosting in ASR
• Based on the confidence values and AdaBoost.M2 algorithm, we calculate an utterance weight for each training utterance
• Subsequently, the weight are used in maximum likelihood and discriminative training of Gaussian mixture model
i)(iwt
N
iiittML yxpiwF
1, log
N
i y i
iittMMI ypyxp
yxpiwF
1, log
10
Utterance approach for boosting in ASR
• Some problem encountered when apply it to large-scale continuous speech application:– The N-best lists of reasonable length (e.g. N=100) generally cont
ain only a tiny fraction of the possible classification results
• This has two consequences:– In training, it may lead to sub-optimal utterance weights– In recognition, Eq. (1) cannot be applied appropriately
iLz it
itit
zpzxp
ypyxpyxh
,
),(1
lnmaxarg)(1
yxhxH t
T
t tYy
11
Utterance approach for CSR--Training
• Training– A convenient strategy to reduce the complexity of the
classification task and to provide more meaningful N-best lists consists in “chopping” of the training data
– For long sentences, it simply means to insert additional sentence break symbols at silence intervals with a given minimum length
– This reduces the number of possible classifications of each sentence “fragment”, so that the resulting N-best lists should cover a sufficiently large fraction of hypotheses
12
Utterance approach for CSR--Decoding
• Decoding: lexical approach for model combination– A single pass decoding setup, where the combination of the
boosted acoustic models is realized at a lexical level
– The basic idea is to add a new pronunciation model by “replicating” the set of phoneme symbols in each boosting iteration (e.g. by appending the suffix “_t” to the phoneme symbol)
– The new phoneme symbols, represent the underlying acoustic model of boosting iteration t “au”, “au_1” ,“au_2”,…
t
13
Utterance approach for CSR--Decoding
• Decoding: lexical approach for model combination (cont.)– Add to each phonetic transcription in the decoding lexicon a new
transcription using the corresponding phoneme set
– Use the reweighted training data to train the boosted classifier
– Decoding is then performed using the extended lexicon and the set of acoustic models weighted by their unigram prior probabilities which are estimated on the training data
tM
“sic_a”, “sic_1 a_1” ,…
t
weighted summation
14
In more detail
BoostingIteration t
“_t”Mt
Trainingcorpus
trainingcorpus(Mt)
phonetically transcribed
)(iwtML/MMItraining
M1,M2,…,MtLexicon
pronunciationvariant
extend
Training
“sic_a”, “sic_1 a_1” ,…
unweighted model combinationweighted model combination
)( 1
11
)(11111
)(11111
11111
11
1111
11
,,,
where
,maxargmaxarg
NN
NNNN
NN
wRv
N
iiiii
imii
wRv
NMNNN
wRv
MNNMN
MN
w
MN
w
N
vxpwvpwwp
vxpwvpwpxvwpxwp
xwpxwpw
Decoding
15
In more detail
16
Weighted model combination
i
i
NN
NN
NN
tti
T
t wRv
N
iiiiiiii
imii
T
t wRv
NMNNN
T
t wRv
MNN
MN
tp
tvxptptwvpwwp
tvxptwvptwp
xtvwp
xwp
1
ln
,simplicityfor
,,
,,
,,,
,
1 )( 1
11
1 )(11111
1 )(111
11
11
11
11
• Word level model combination
17
Experiments
• Isolated word recognition– Telephone-bandwidth large vocabulary isolated word recognition– SpeechDat(II) German meterial
• Continuous speech recognition– Professional dictation and Switchboard
18
Isolated word recognition
• Database:– Training corpus: consists of 18k utterances (4.3h) of city, compa
ny, first and family names– Evaluations:
• LILI test corpus: 10k single word utterances (3.5h); 10k words lexicon; (matched conditions)
• Names corpus: an inhouse collection of 676 utterances (0.5h); two different decoding lexica: 10k lex, 190k lex; (acoustic conditions are matched, whereas there is a lexical mismatch)
• Office corpus: 3.2k utterances (1.5h), recorded over microphone in clean conditions; 20k lexicon; (an acoustic mismatch to the training conditions)
19
Isolated word recognition
• Boosting ML models
20
Isolated word recognition
• Combining boosting and discriminative training
– The experiments in isolated word recognition showed that boosting may improve the best test error rates
21
Continuous speech recognition
• Database– Professional dictation
• An inhouse data collection of real-life recordings of medical reports
• The acoustic training corpus consists of about 58h of data• Evaluations were carried out on two test corpora:
– Development corpus consists of 5.0h of speech– Evaluation corpus consists of 3.3h of speech
– Switchboard• Consisting of spontaneous conversations recorded over telep
hone line; 57h(73h) of male(female)• Evaluations corpus:
– Containing about 1h(0.5h) of male(female)
22
Continuous speech recognition
• Professional dictation:
23
• Switchboard:
24
Conclusions
• In this paper, a boosting approach which can be applied to any HMM based speech recognizer was be presented and evaluated
• The increased recognizer complexity and thus decoding effort of the boosted systems is a major drawback compared to other training techniques like discriminative training
25
References
• [ICASSP02][C.Meyer] Utterance-Level Boosting of HMM Speech Recognizers
• [ICML02][C.Meyer] Towards Large Margin Speech Recognizers by Boosting and Discriminative Training
• [ICSLP00][C.Meyer] Rival Training: Efficient Use of Data in Discriminative Training
• [ICASSP00][Schramm and Aubert] Efficient Integration of Multiple Pronunciations in a Large Vocabulary Decoder