ICASSP 2009: Acoustic Model Survey

ICASSP 2009: Acoustic Model Survey

Yueng-Tien,Lo

Discriminative Training of Hierarchical Acoustic Models For Large

Vocabulary Continuous Speech Recognition

Hung-An Chang and James R. GlassMIT Computer Science and Artificial Intelligence Laboratory

Cambridge, Massachusetts, 02139, USA{hung_an, glass}@csail.mit.edu

outline

• Introduction• Hierarchical Gaussian Mixture Models• Discriminative Training• Experiments• Conclusion

Introduction-1

• In recent years, discriminative training methods have demonstrated considerable success.

• Another approach to improve the acoustic modeling component is to utilize a more flexible model structure by constructing a hierarchical tree for the models.

Introduction-2

• A hierarchical acoustic modeling scheme that can be trained using discriminative methods for LVCSR tasks

• A model hierarchy is constructed by first using a top-down divisive clustering procedure to create a decision tree; hierarchical layers are then selected by using different stopping criterion to traverse the decision tree.

Hierarchical Gaussian Mixture Models

• A tree structure to represent the current status of clustering, and each node in the tree represents a set of acoustic contexts and their corresponding training data.

• In general, the node and the question are chosen such that the clustering objective such as the data log-likelihood can be optimized at each step.

Model Hierarchy Construction

• A hierarchical model can be constructed from a decision tree by running the clustering algorithm multiple times using different stopping criteria.

Model Scoring

• The leaf nodes of a hierarchical model represent the set ofits output acoustic labels.

• the log-likelihood of a feature vector x with respect to c can be computed by

. tanr mean vecto respect to with offunction densityGaussian temultivaria theis ),,( and , ofcomponent theof weight mixture the

is , of components mixture ofnumber total theis component, mixture ofindex theis where

),),,(log(),( 1

cmcm

cmcmth

cmc

M

mcmcmcm

iondarddeviatandscm

wcMm

wclc

σμxσμx

σμxx

Discriminative Training : MCE training

• MCE training seeks to minimize the number of incorrectly recognized utterances in the training set by increasing the difference between the log-likelihood score of the correct transcription and that of an incorrect hypothesis.

.exprimentsour in 1.0 set to is and list,best -N theof size thedenotes ,hypothesesincorrectbest -N ofset thedenotes utterance, ofr theion transript thedenotes where

,)]])),(exp(1log([) ,([oss 1

1

Cn

LC

LlL

nth

N

n nnn

ςY

SXYX

n

ςS λnλ

Discriminative Training with Hierarchical Model

• Given the loss function L, the gradient can be decomposed into a combination of the gradients of the individual acoustic scores, as would be done for discriminative training of non-hierarchical models:

• As a result, the training on a hierarchical model can be reduced to first computing the gradients with respect to all acoustic scores as would be done in the training for a non-hierarchical model, and then distribute the contribution of the gradient into different levels according to ωk.

N

n x pn

pxaL1

,

,),(),(

1 λ

xxλ

λ

Xx λ

papa

LL N

n pn

L

. labeloutput and featurefor score model acoustic thedenotes ),( where ppa xxλ

level. theof weight relative thedenotes where

,))( ,(),( 1

thk

K

kkk

H

kw

pAlwpl

xx λλ

Experiments- MIT Lecture Corpus

• The MIT Lecture Corpus contains audio recordings and manual transcriptions for approximately 300 hours of MIT lectures from eight different courses, and nearly 100 MITWorld seminars given on a variety of topics.

SUMMIT Recognizer

• The SUMMIT landmark-based speech recognizer first computes a set of perceptually important time points as landmarks based on an acoustic difference measure, and extracts a feature vector around each landmark.

• The acoustic landmarks are represented by a set of diphones labels to model the left and right contexts of the landmarks.

• The diphones are clustered using top-down decision tree clustering and are modeled by a set of GMM parameters.

Baseline Models

• Although discriminative training always reduced the WERs, the improvements were not as effective as the number of mixture component increased.

• This fact suggests that discriminative training has made the model over-fit the training data as the number of parameters increases.

Hierarchical Models

• Table 1 summarizes the WERs of the hierarchical model before and after discriminative training.

• The statistics of log-likelihood differences show that the hierarchy prevents large decreases in the log-likelihood and can potentially make the model be more resilient to over-fitting.

Conclusion

• In this paper, we have described how to construct a hierarchical model for LVCSR task using top-down decision tree clustering, and how to combine a hierarchical acoustic model with discriminative training.

• In the future, we plan to further improve the model by applying a more flexible hierarchical structure

Discriminative Pronunciation Learning Using Phonetic Decoder and

Minimum-Classification-Error Criterion

Oriol Vinyals, Li Deng, Dong Yu, and Alex AceroMicrosoft Research, Redmond, WA

International Computer Science Institute, Berkeley, CA

outline

• Introduction• Discriminative Pronunciation Learning• Experimental Evaluation• Discussions and Future Work

Introduction-1

• Standard pronunciations are derived from existing dictionaries, and do not cover all diversity of possible lexical items that represent various ways of pronunciating the same word.

• On the one hand, alternative pronunciations, obtained typically by manual addition or by maximum likelihood learning, increase the coverage of pronunciation variability.

• On the other hand, they may also lead to greater confusability between different lexical items.

Introduction-2

• To overcome the difficulty of increased lexical confusability while introducing alternative pronunciations, one can use discriminative learning to intentionally minimize the confusability.

• In our work, we directly exploit the MCE for selecting the more “discriminative” pronunciation alternatives, where these alternatives are derived from high-quality N-best lists or lattices in the phonetic recognition results.

Discriminative Pronunciation Learning-1

• In speech recognition exploiting alternative pronunciations, the decision rule can be approximated by

)()|()|(max max arg

)()|()|(max arg

)()|,(max arg

)()|(max argˆ

WpSXpWSp

WpSXpWSp

WpWXSp

WpWXpW

qqqW

qqqW

qqW

W

• where X is the sequence of acoustic observations (generally• feature vectors extracted from the speech waveform),• W is a sequence of hypothesized words or a sentence, and• q = 1, 2, . . . , N is the index to multiple phone sequences• that are alternative pronunciations to sentence W. Each• is associated with the q-th path in the lattice of word

pronunciations,• and has an implicit dependency onW

qS

qS


• Selects pronunciation(s) from the phonetic recognition results for each word such that the sentence error rate in the training data is minimized.

)()|()|(max),(

)],(exp[1log),()(

and1

1))((

with

))(())((

)(

jqrjqqrj

ijrjrir

drr

rrr

WpSXPWSpXD

XDM

XDd

edl

dldl

r


• where is the observation from the r-th sentence or utterance,

refers to the correct sequence of words for the r-th sentence, refers to all the incorrect hypotheses (obtainedfrom N-best lists produced by the word decoder), is the total number of hypotheses considered for a particular sentence, and Λ is the family of parameters to be estimated to

optimize the objective function.

rX

iWjW

M


• We now define the “MCE score”:

• Note that is an approximation of the number of corrected errors after using the new pronunciation p.

• Thus, if the value is positive, then there is an improvement of the recognizer performance measured by training-set error reduction with the new pronunciation p for word w.

ΛΛΔ ),(),( dldl wpwp

),(Δ wp

pair word-ionpronunciat a , iswpwhere


• One of the limitations of finding the pairs in a greedy way is the assumption that errors corrected by a particular pair are uncorrelated with errors corrected by any other pair.

wppossiblealldenotesTsetwhere

TwpS wp

, Δ|),( ),(

Experimental Evaluation

• In the experimental evaluation of the discriminative pronunciation learning technique described in the preceding section, we used the Windows-Live-Search-for-Mobile (WLS4M) database, which consists of very large quantities of short utterances from spoken queries to the WLS4M engine.

Baseline System

• A standard HMM-GMM speech recognizer provided as part of the Microsoft Speech API (SAPI)

• The acoustic model is trained on about 3000 hours of speech, using PLP features and an HLDA transformation.

• The bigram language model is used and trained with realistic system deployment data.

• The recognizer has a dictionary with 64,000 distinctive entries.

Performance results

• The comparative performance results between the baseline recognizer and the new one with MCE-based pronunciation learning are summarized in Table 2.

Discussions and Future Work

• Maximum-likelihood-based approach to generating multiple pronunciation creates greater flexibility in ways that people pronounce words, but the addition of new entries often vastly enlarges confusability across different lexical items.

• This is because the maximum likelihood learning does not take into account such lexical confusability.

Documents

ICASSP 2009: Acoustic Model Survey