Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,

Presented by: Fang-Hui Chu

Boosting HMM acoustic models in large vocabulary speech recognition

Carsten Meyer, Hauke SchrammPhilips Research Laboratories, Germany

SPEECH COMMUNICATION 2006

2

AdaBoost introduction

• The AdaBoost algorithm was presented for transforming a “weak” learning rule into a “strong” one

• The basic idea is to train a series of classifiers based on the classification performance of the previous classifier on the training data

• In multi-class classification, a popular variant is the AdaBoost.M2 algorithm

• AdaBoost.M2 is applicable when a mapping can be defined for classifier which is related to the classification criterion

1,0: YXht

th

3

AdaBoost.M2 (Freund and Schapire, 1997)

stop ; 21 if j

4

AdaBoost introduction

• The update rule is designed to guarantee an upper bound on the training error of the combined classifier which is exponentially decreasing with the number of individual classifiers

• In multi-class problems, the weights are summed up to give a weight for each training pattern :

yiDt ,

)(iwt i

)(

,iyytt yiDiw

5

Introduction

• Why there are only a few studies so far applying boosting to acoustic model training?– Speech recognition is an extremely complex large scale classific

ation problem

• The main motivation to apply AdaBoost to speech recognition is– Its theoretical foundation providing explicit bounds on the training

and–in terms of margins–on the generalization error

6

Introduction

• In most previous applications to speech recognition, boosting was applied to classifying each individual feature vector to a phoneme symbol [ICASSP04][Dimitrakakis]

– Needing the phoneme posterior probabilities

• But the problem is..– The conventional HMM speech recognizers do not involve an

intermediate phoneme classification step for individual feature vectors

– So the frame-level boosting approach cannot straightforwardly be applied

7

Utterance approach for boosting in ASR

• An intuitive way of applying boosting to HMM speech recognition is at the utterance level– Thus, boosting is used to improve upon an initial ranking of

candidate word sequences

• The utterance approach has two advantages:– First, it is directly related to the sentence error rate– Second, it is computationally much less expensive than boosting

applied at the level of feature vectors

8


• In utterance approach, we define the input patterns to be the sequence of feature vectors corresponding to the entire utterance

• denotes one possible candidate word sequence of the speech recognizer, being the correct word sequence for utterance

• The a posteriori confidence measure is calculated on basis of the N-best list for utterance

ix

i

i

y

iy

iiL

iLz it

itit

zpzxp

ypyxpyxh

,

9


• Based on the confidence values and AdaBoost.M2 algorithm, we calculate an utterance weight for each training utterance

• Subsequently, the weight are used in maximum likelihood and discriminative training of Gaussian mixture model

i)(iwt

N

iiittML yxpiwF

1, log

N

i y i

iittMMI ypyxp

yxpiwF

1, log

10


• Some problem encountered when apply it to large-scale continuous speech application:– The N-best lists of reasonable length (e.g. N=100) generally cont

ain only a tiny fraction of the possible classification results

• This has two consequences:– In training, it may lead to sub-optimal utterance weights– In recognition, Eq. (1) cannot be applied appropriately

iLz it

itit

zpzxp

ypyxpyxh

,

),(1

lnmaxarg)(1

yxhxH t

T

t tYy

11

Utterance approach for CSR--Training

• Training– A convenient strategy to reduce the complexity of the

classification task and to provide more meaningful N-best lists consists in “chopping” of the training data

– For long sentences, it simply means to insert additional sentence break symbols at silence intervals with a given minimum length

– This reduces the number of possible classifications of each sentence “fragment”, so that the resulting N-best lists should cover a sufficiently large fraction of hypotheses

12

Utterance approach for CSR--Decoding

• Decoding: lexical approach for model combination– A single pass decoding setup, where the combination of the

boosted acoustic models is realized at a lexical level

– The basic idea is to add a new pronunciation model by “replicating” the set of phoneme symbols in each boosting iteration (e.g. by appending the suffix “_t” to the phoneme symbol)

– The new phoneme symbols, represent the underlying acoustic model of boosting iteration t “au”, “au_1” ,“au_2”,…

t

13

Utterance approach for CSR--Decoding

• Decoding: lexical approach for model combination (cont.)– Add to each phonetic transcription in the decoding lexicon a new

transcription using the corresponding phoneme set

– Use the reweighted training data to train the boosted classifier

– Decoding is then performed using the extended lexicon and the set of acoustic models weighted by their unigram prior probabilities which are estimated on the training data

tM

“sic_a”, “sic_1 a_1” ,…

t

weighted summation

14

In more detail

BoostingIteration t

“_t”Mt

Trainingcorpus

trainingcorpus(Mt)

phonetically transcribed

)(iwtML/MMItraining

M1,M2,…,MtLexicon

pronunciationvariant

extend

Training

“sic_a”, “sic_1 a_1” ,…

unweighted model combinationweighted model combination

)( 1

11

)(11111

)(11111

11111

11

1111

11

,,,

where

,maxargmaxarg

NN

NNNN

NN

wRv

N

iiiii

imii

wRv

NMNNN

wRv

MNNMN

MN

w

MN

w

N

vxpwvpwwp

vxpwvpwpxvwpxwp

xwpxwpw

Decoding

15

In more detail

16

Weighted model combination

i

i

NN

NN

NN

tti

T

t wRv

N

iiiiiiii

imii

T

t wRv

NMNNN

T

t wRv

MNN

MN

tp

tvxptptwvpwwp

tvxptwvptwp

xtvwp

xwp

1

ln

,simplicityfor

,,

,,

,,,

,

1 )( 1

11

1 )(11111

1 )(111

11

11

11

11

• Word level model combination

17

Experiments

• Isolated word recognition– Telephone-bandwidth large vocabulary isolated word recognition– SpeechDat(II) German meterial

• Continuous speech recognition– Professional dictation and Switchboard

18

Isolated word recognition

• Database:– Training corpus: consists of 18k utterances (4.3h) of city, compa

ny, first and family names– Evaluations:

• LILI test corpus: 10k single word utterances (3.5h); 10k words lexicon; (matched conditions)

• Names corpus: an inhouse collection of 676 utterances (0.5h); two different decoding lexica: 10k lex, 190k lex; (acoustic conditions are matched, whereas there is a lexical mismatch)

• Office corpus: 3.2k utterances (1.5h), recorded over microphone in clean conditions; 20k lexicon; (an acoustic mismatch to the training conditions)

19


• Boosting ML models

20


• Combining boosting and discriminative training

– The experiments in isolated word recognition showed that boosting may improve the best test error rates

21

Continuous speech recognition

• Database– Professional dictation

• An inhouse data collection of real-life recordings of medical reports

• The acoustic training corpus consists of about 58h of data• Evaluations were carried out on two test corpora:

– Development corpus consists of 5.0h of speech– Evaluation corpus consists of 3.3h of speech

– Switchboard• Consisting of spontaneous conversations recorded over telep

hone line; 57h(73h) of male(female)• Evaluations corpus:

– Containing about 1h(0.5h) of male(female)

22

Continuous speech recognition

• Professional dictation:

23

• Switchboard:

24

Conclusions

• In this paper, a boosting approach which can be applied to any HMM based speech recognizer was be presented and evaluated

• The increased recognizer complexity and thus decoding effort of the boosted systems is a major drawback compared to other training techniques like discriminative training

25

References

• [ICASSP02][C.Meyer] Utterance-Level Boosting of HMM Speech Recognizers

• [ICML02][C.Meyer] Towards Large Margin Speech Recognizers by Boosting and Discriminative Training

• [ICSLP00][C.Meyer] Rival Training: Efficient Use of Data in Discriminative Training

• [ICASSP00][Schramm and Aubert] Efficient Integration of Multiple Pronunciations in a Large Vocabulary Decoder

Documents

Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,