21
Ch 5b: Discriminative Training (temporal model) 14.2.2002 Ilkka Aho

Ch 5b: Discriminative Training (temporal model)

  • Upload
    paniz

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Ch 5b: Discriminative Training (temporal model). 14.2.2002 Ilkka Aho. Abbreviations. MCE = Minimum Classification Error MMI = Maximum Mutual Information STLVQ = Shift-Tolerant Learning Vector Quantization TDNN = Time-Delay Neural Network HMM = Hidden Markov Model DP = Dynamic Programming - PowerPoint PPT Presentation

Citation preview

Page 1: Ch 5b: Discriminative Training (temporal model)

Ch 5b: Discriminative Training (temporal model)

14.2.2002 Ilkka Aho

Page 2: Ch 5b: Discriminative Training (temporal model)

Abbreviations

• MCE = Minimum Classification Error

• MMI = Maximum Mutual Information

• STLVQ = Shift-Tolerant Learning Vector Quantization

• TDNN = Time-Delay Neural Network

• HMM = Hidden Markov Model

• DP = Dynamic Programming

• DTW = Dynamic Time Warping

• GPD = Generalized Probabilistic Descent

• PBMEC = Prototype-Based Minimum Error Classifier

Page 3: Ch 5b: Discriminative Training (temporal model)

Basics

• Prototype-based methods use class representatives (sample or an average of samples) to classify new patterns

• The MCE framework is used for discriminative training (also MMI is possible)

• A central concern is the design or learning of prototypes that will yield good classification performance

Page 4: Ch 5b: Discriminative Training (temporal model)

STLVQ for Speech Recognition

• LVQ algorithm in its basic form is a method for static pattern recognition

• STLVQ handles a stream of dynamically varying patterns (fig. 1.)

• STLVQ is much simpler than TDNN model, but yielded very good results on the same phoneme recognition tasks

Page 5: Ch 5b: Discriminative Training (temporal model)

Figure 1. STLVQ system architecture.

Page 6: Ch 5b: Discriminative Training (temporal model)

Limitations and Strengths of STLVQ• STLVQ assumes only a single phoneme as an input token

• Training and testing datasets are obtained from manually labeled speech databases

• How to extend the phoneme recognition to word or sentence recognition?

• LVQ is applied locally

Page 7: Ch 5b: Discriminative Training (temporal model)

Expanding the Scope of LVQ for Speech Recognition• Representation of longer speech sequences such as entire utterances

• Global optimization

• Application to continuos speech recognition

• A need for some kind of time warping or normalization

• How to merge the discriminative power of LVQ with the sequential modeling abilities of HMMs?

• Two methods: LVQ-HMM (fig. 2.) and HMM-LVQ (fig. 3.)

Page 8: Ch 5b: Discriminative Training (temporal model)

Figure 2. LVQ-HMM architecture.

Page 9: Ch 5b: Discriminative Training (temporal model)

Figure 3. HMM-LVQ architecture.

Page 10: Ch 5b: Discriminative Training (temporal model)

MCE Interpretation of LVQ

• A prototype-based implementation of the MCE framework

• The LVQ classification rule is based on the Euclidean distance between a pattern vector and each category's reference vectors

• The category of the nearest reference vector is given as the classification decision

• Figures 4, 5 and 6 demonstrate the smoothness of MCE loss

Page 11: Ch 5b: Discriminative Training (temporal model)

Figure 4. Average empirical loss measured over 10 samples from a one- dimensional, two class classification problem. The ideal zero-one loss is used in calculating the overall loss.

Page 12: Ch 5b: Discriminative Training (temporal model)

Figure 5. Now a sigmoidal MCE loss, a = 0.1, is used in calculating the overall loss.

Page 13: Ch 5b: Discriminative Training (temporal model)

Figure 6. The same situation as in the figure 5. except a = 1.0 now.

Page 14: Ch 5b: Discriminative Training (temporal model)

Prototype-based Methods Using DP• DP is used to find the path through a grid of local matches between

prototype and test sample frames that has the best overall score

• When calculating the reference distance between the input utterance and the reference utterance it is more practical to use the top path or the top few paths than every single DP path possible

• Nonlinear compressing and stretching prototypes

• DTW is a specific application of DP techniques to speech processing

Page 15: Ch 5b: Discriminative Training (temporal model)

MCE-Trained Prototypes and DTW • The idea is to define the MCE loss in terms of a discriminant function

that reflects the structure of a straightforward DTW-based recognizer

• The loss function have to be continous and differentiable that some gradient-based optimization technique (for example GPD) can be used to minimize the overall loss

• Also the loss function have to reflect classification performance

• Good results in the Bell Labs E-set task and in phoneme recognition tasks

Page 16: Ch 5b: Discriminative Training (temporal model)

PBMEC

• PBMEC models prototypes at a finer grain than MCE-trained DTW

• PBMEC prototypes are modeled within phonetic or subphonetic states

• Word models are formed by connecting different states together

• Multi-state PBMEC (fig. 7.)

• The discriminant function for a category is defined as the final accumulated score of the best DP path for that category (fig. 8.)

• MCE-GPD update rule for PBMEC pulls the nearest reference vectors for the correct category closer to the input and pushes the nearest reference vectors for the incorrect category away

• MCE-GPD in the context of speech recognition using phoneme models (fig. 9.)

Page 17: Ch 5b: Discriminative Training (temporal model)

Figure 7. Multi-state PBMEC architecture.

Page 18: Ch 5b: Discriminative Training (temporal model)

Figure 8. Final DP score.

Page 19: Ch 5b: Discriminative Training (temporal model)

Figure 9. DP segmentations for the words ”aida” and ”taira”.

Page 20: Ch 5b: Discriminative Training (temporal model)

HMM design based on MCE

• The prototype-like nature of HMMs

• The MCE framework can be applied to HMMs in a very same way that in the case of the PBMEC model

• HMM state likelihood and discriminant function

• MCE misclassification measure and loss

• Calculating of MCE Gradient for HMMs

• There are a very large number of applications of MCE-trained HMMs

• Some of the best context-independent results have been reported for the Texas Instruments-Massachusetts Institute of Technology database

Page 21: Ch 5b: Discriminative Training (temporal model)

Homework Question

• STLVQ• Prototype-based DP (DTW technique)• HMM design based on MCE

Explain the main differencies between following methods in speech

recognition: