Upload
shiri
View
68
Download
0
Tags:
Embed Size (px)
DESCRIPTION
ICASSP 2009: Acoustic Model Survey. Yueng-Tien,Lo. Discriminative Training of Hierarchical Acoustic Models For Large Vocabulary Continuous Speech Recognition. Hung-An Chang and James R. Glass MIT Computer Science and Artificial Intelligence Laboratory Cambridge, Massachusetts, 02139, USA - PowerPoint PPT Presentation
Citation preview
ICASSP 2009: Acoustic Model Survey
Yueng-Tien,Lo
Discriminative Training of Hierarchical Acoustic Models For Large
Vocabulary Continuous Speech Recognition
Hung-An Chang and James R. GlassMIT Computer Science and Artificial Intelligence Laboratory
Cambridge, Massachusetts, 02139, USA{hung_an, glass}@csail.mit.edu
outline
• Introduction• Hierarchical Gaussian Mixture Models• Discriminative Training• Experiments• Conclusion
Introduction-1
• In recent years, discriminative training methods have demonstrated considerable success.
• Another approach to improve the acoustic modeling component is to utilize a more flexible model structure by constructing a hierarchical tree for the models.
Introduction-2
• A hierarchical acoustic modeling scheme that can be trained using discriminative methods for LVCSR tasks
• A model hierarchy is constructed by first using a top-down divisive clustering procedure to create a decision tree; hierarchical layers are then selected by using different stopping criterion to traverse the decision tree.
Hierarchical Gaussian Mixture Models
• A tree structure to represent the current status of clustering, and each node in the tree represents a set of acoustic contexts and their corresponding training data.
• In general, the node and the question are chosen such that the clustering objective such as the data log-likelihood can be optimized at each step.
Model Hierarchy Construction
• A hierarchical model can be constructed from a decision tree by running the clustering algorithm multiple times using different stopping criteria.
Model Scoring
• The leaf nodes of a hierarchical model represent the set ofits output acoustic labels.
• the log-likelihood of a feature vector x with respect to c can be computed by
. tanr mean vecto respect to with offunction densityGaussian temultivaria theis ),,( and , ofcomponent theof weight mixture the
is , of components mixture ofnumber total theis component, mixture ofindex theis where
),),,(log(),( 1
cmcm
cmcmth
cmc
M
mcmcmcm
iondarddeviatandscm
wcMm
wclc
σμxσμx
σμxx
Discriminative Training : MCE training
• MCE training seeks to minimize the number of incorrectly recognized utterances in the training set by increasing the difference between the log-likelihood score of the correct transcription and that of an incorrect hypothesis.
.exprimentsour in 1.0 set to is and list,best -N theof size thedenotes ,hypothesesincorrectbest -N ofset thedenotes utterance, ofr theion transript thedenotes where
,)]])),(exp(1log([) ,([oss 1
1
Cn
LC
LlL
nth
N
n nnn
ςY
SXYX
n
ςS λnλ
Discriminative Training with Hierarchical Model
• Given the loss function L, the gradient can be decomposed into a combination of the gradients of the individual acoustic scores, as would be done for discriminative training of non-hierarchical models:
• As a result, the training on a hierarchical model can be reduced to first computing the gradients with respect to all acoustic scores as would be done in the training for a non-hierarchical model, and then distribute the contribution of the gradient into different levels according to ωk.
N
n x pn
pxaL1
,
,),(),(
1 λ
xxλ
λ
Xx λ
papa
LL N
n pn
L
. labeloutput and featurefor score model acoustic thedenotes ),( where ppa xxλ
level. theof weight relative thedenotes where
,))( ,(),( 1
thk
K
kkk
H
kw
pAlwpl
xx λλ
Experiments- MIT Lecture Corpus
• The MIT Lecture Corpus contains audio recordings and manual transcriptions for approximately 300 hours of MIT lectures from eight different courses, and nearly 100 MITWorld seminars given on a variety of topics.
SUMMIT Recognizer
• The SUMMIT landmark-based speech recognizer first computes a set of perceptually important time points as landmarks based on an acoustic difference measure, and extracts a feature vector around each landmark.
• The acoustic landmarks are represented by a set of diphones labels to model the left and right contexts of the landmarks.
• The diphones are clustered using top-down decision tree clustering and are modeled by a set of GMM parameters.
Baseline Models
• Although discriminative training always reduced the WERs, the improvements were not as effective as the number of mixture component increased.
• This fact suggests that discriminative training has made the model over-fit the training data as the number of parameters increases.
Hierarchical Models
• Table 1 summarizes the WERs of the hierarchical model before and after discriminative training.
• The statistics of log-likelihood differences show that the hierarchy prevents large decreases in the log-likelihood and can potentially make the model be more resilient to over-fitting.
Conclusion
• In this paper, we have described how to construct a hierarchical model for LVCSR task using top-down decision tree clustering, and how to combine a hierarchical acoustic model with discriminative training.
• In the future, we plan to further improve the model by applying a more flexible hierarchical structure
Discriminative Pronunciation Learning Using Phonetic Decoder and
Minimum-Classification-Error Criterion
Oriol Vinyals, Li Deng, Dong Yu, and Alex AceroMicrosoft Research, Redmond, WA
International Computer Science Institute, Berkeley, CA
outline
• Introduction• Discriminative Pronunciation Learning• Experimental Evaluation• Discussions and Future Work
Introduction-1
• Standard pronunciations are derived from existing dictionaries, and do not cover all diversity of possible lexical items that represent various ways of pronunciating the same word.
• On the one hand, alternative pronunciations, obtained typically by manual addition or by maximum likelihood learning, increase the coverage of pronunciation variability.
• On the other hand, they may also lead to greater confusability between different lexical items.
Introduction-2
• To overcome the difficulty of increased lexical confusability while introducing alternative pronunciations, one can use discriminative learning to intentionally minimize the confusability.
• In our work, we directly exploit the MCE for selecting the more “discriminative” pronunciation alternatives, where these alternatives are derived from high-quality N-best lists or lattices in the phonetic recognition results.
Discriminative Pronunciation Learning-1
• In speech recognition exploiting alternative pronunciations, the decision rule can be approximated by
)()|()|(max max arg
)()|()|(max arg
)()|,(max arg
)()|(max argˆ
WpSXpWSp
WpSXpWSp
WpWXSp
WpWXpW
qqqW
qqqW
qqW
W
• where X is the sequence of acoustic observations (generally• feature vectors extracted from the speech waveform),• W is a sequence of hypothesized words or a sentence, and• q = 1, 2, . . . , N is the index to multiple phone sequences• that are alternative pronunciations to sentence W. Each• is associated with the q-th path in the lattice of word
pronunciations,• and has an implicit dependency onW
qS
qS
Discriminative Pronunciation Learning-2
• Selects pronunciation(s) from the phonetic recognition results for each word such that the sentence error rate in the training data is minimized.
)()|()|(max),(
)],(exp[1log),()(
and1
1))((
with
))(())((
)(
jqrjqqrj
ijrjrir
drr
rrr
WpSXPWSpXD
XDM
XDd
edl
dldl
r
Discriminative Pronunciation Learning-3
• where is the observation from the r-th sentence or utterance,
refers to the correct sequence of words for the r-th sentence, refers to all the incorrect hypotheses (obtainedfrom N-best lists produced by the word decoder), is the total number of hypotheses considered for a particular sentence, and Λ is the family of parameters to be estimated to
optimize the objective function.
rX
iWjW
M
Discriminative Pronunciation Learning-4
• We now define the “MCE score”:
• Note that is an approximation of the number of corrected errors after using the new pronunciation p.
• Thus, if the value is positive, then there is an improvement of the recognizer performance measured by training-set error reduction with the new pronunciation p for word w.
ΛΛΔ ),(),( dldl wpwp
),(Δ wp
pair word-ionpronunciat a , iswpwhere
Discriminative Pronunciation Learning-5
• One of the limitations of finding the pairs in a greedy way is the assumption that errors corrected by a particular pair are uncorrelated with errors corrected by any other pair.
wppossiblealldenotesTsetwhere
TwpS wp
, Δ|),( ),(
Experimental Evaluation
• In the experimental evaluation of the discriminative pronunciation learning technique described in the preceding section, we used the Windows-Live-Search-for-Mobile (WLS4M) database, which consists of very large quantities of short utterances from spoken queries to the WLS4M engine.
Baseline System
• A standard HMM-GMM speech recognizer provided as part of the Microsoft Speech API (SAPI)
• The acoustic model is trained on about 3000 hours of speech, using PLP features and an HLDA transformation.
• The bigram language model is used and trained with realistic system deployment data.
• The recognizer has a dictionary with 64,000 distinctive entries.
Performance results
• The comparative performance results between the baseline recognizer and the new one with MCE-based pronunciation learning are summarized in Table 2.
Discussions and Future Work
• Maximum-likelihood-based approach to generating multiple pronunciation creates greater flexibility in ways that people pronounce words, but the addition of new entries often vastly enlarges confusability across different lexical items.
• This is because the maximum likelihood learning does not take into account such lexical confusability.