Sequence Scoring Experiments Using the TIMIT Corpus and the HTK Recognition Framework Author: Arthur...
110
Sequence Scoring Experiments Using the TIMIT Corpus and the HTK Recognition Framework Author: Arthur Gerald Kunkle Committee Chair: Dr. Veton Z. Këpuska
Sequence Scoring Experiments Using the TIMIT Corpus and the HTK Recognition Framework Author: Arthur Gerald Kunkle Committee Chair: Dr. Veton Z. Këpuska
Sequence Scoring Experiments Using the TIMIT Corpus and the HTK
Recognition Framework Author: Arthur Gerald Kunkle Committee Chair:
Dr. Veton Z. Kpuska
Slide 2
ASR Defined Automatic Speech Recognition (ASR) - mapping an
acoustic signal into a string of words. ASR systems play a big role
in Human Machine Interaction (HMI). Speech has a natural potential
to be much more intuitive to use to command a machine versus the
existing input methods, such as keyboard and mouse.
Slide 3
Early ASR Systems Earliest systems for ASR would model natural
resonances that occur as a result of air flowing over the vocal
tract creating sounds Example: To recognize the digit five, the
system would determine that the vowel sound eye matched the correct
digit. Limitation - Utterance contained only a single digit and no
other word or non-speech event that would confuse the system.
Slide 4
ASR Improvements ASR System Development in the 1980s and 1990s
introduced use of Hidden Markov Models (HMMs). Still widely used
over the past two decades Improvements being made on a continual
basis. ASR received interest from DARPA, leading to new and notable
ASR systems such as the CMU Sphinx (Carnegie Mellon University)
system. Formalized the tasks and evaluation criterion that were
used to measure ASR System Performance.
Slide 5
Major Tasks in ASR History
Slide 6
Timeline of ASR Achievements
Slide 7
Characteristics of ASR Systems ASR Systems are defined by the
tasks they are designed to solve. We have already discussed some
examples of tasks. Tasks involve the following parameters:
Vocalbulary Size Fluency Environmental Effects Speaker
Characteristics
Slide 8
Vocabulary Size Milestones in ASR Systems are often related to
how large of a vocabulary a system can handle while keeping error
rate at a minimum. Simple Task Vocabulary: Recognizing digits:
zero, one, two,, and oh These eleven words are the in-vocabulary
words (INV). If the system encounters any words outside of this
set, they are known as out-of-vocabulary words (OOV).
Slide 9
ASR Tasks and Vocabulary Sizes Task NameVocabulary SizeWord
Error Rate (%) Texas Instruments (TI) Digits11 (zero-nine, oh)0.5
Wall Street Journal 15,0003 Wall Street Journal 220,0003 Broadcast
News64,000+10 Conversational Telephone Speech64,000+20 As
vocabulary size of a task increases, so does the Word Error Rate
(WER). WER is the standard evaluation metric for speech
recognition
Slide 10
Example WER Calculation This example is an output hypothesis of
a string of numbers from an ASR system, compared with the true
sentence string. The bottom line marks the types of errors as they
occur in the transcription. Reference:ONE TWO THREE FOUR FIVE SIX
SEVEN ***** Hypothesis:**** TWO ******* FIVE FIVE SIX SEVEN ONE
Evaluation:D D S I
Slide 11
ASR System Fluency Fluency measures the rigidity of input
speech. In isolated-word recognition, the speech to be processed is
surrounded by a known silence or pause. Examples include the digit
recognition or command- and-control tasks. Continuous-speech
systems must take non- speech events and segmentation of real words
into account. This is much harder to accomplish!
Slide 12
Other ASR System Parameters Environmental noise and channel
characteristics. Recording instruments may be located at different
distances from each speaker and may pick up other noises in
addition to speech. Speaker-dependant characteristics. Speaker
dialect and accent.
Slide 13
Wake-up-Word Paradigm The Wake-up-Word (WUW) ASR Problem:
Detect a single word or phrase when spoken in an alerting context,
while rejecting all other words, phrases, sounds, noises and other
acoustic events with virtually 100% accuracy including the same
word or phrase of interest spoken in a non-alerting (i.e.
referential) context.
Slide 14
WUW Example Application User utters the WUW Computer to alert a
machine to perform various commands. When the user utters the
command phrase Computer, begin presentation, WUW technology should
detect that Computer was spoken in the alerting context and perform
the requested command. If the user utters the phrase I want to buy
a new computer, WUW technology must detect that Computer was used
in a non-alerting context and avoid parsing the command.
Slide 15
WUW Problem Areas Detecting WUW Context The WUW system must be
able to notify the host system that attention is required in
certain circumstances and with high accuracy. Unlike
keyword-spotting,WUW dictates these occurrences only be reported
during an alerting context. This context can be determined using
features such as leading and trailing silence, difference in the
long term average of speech features, and prosodic information
(pitch, intonation, rhythm, etc.). Identifying WUW After
identifying the correct context for a spoken utterance, the WUW
paradigm shall be responsible for determining if the utterance
contains the pre-defined Wake-up-Word to be used for command (e.g.
Computer) with a high degree of accuracy, e.g., > 99%. Correct
Rejection of Non-WUW Similar to identification of the WUW, the
system shall also be capable of filtering speech tokens that are
not WUWs with practically 100% accuracy to guarantee 0% false
acceptances.
Slide 16
Current WUW System Currently being used for practical
applications such as: PowerPoint Commander, Elevator Simulator, Car
Inspection System, and Nursing Call Center
Slide 17
Motivations for External Scoring Toolkit Support for standard
speech recognition testing data sets. Provide support for
evaluating the TIMIT data set in order to evaluate novel scoring
methods against a broader class of words. Integration of standard
toolkits. Utilize the Hidden Markov Model Toolkit (HTK) and the SVM
library (LIBSVM) to build and evaluate HMM and SVM models. Using
industry-standard frameworks has the benefit of a well-documented
environment and previous results. Integration of novel scoring
techniques with standard toolkits. The novel method used in the WUW
system must be integrated with the existing workflow in the HTK
framework in order to augment the technique and evaluate its
effectiveness against additional data sources. Provide MATLAB-based
analysis and experimentation tools. Once results are obtained using
the SeqRec tools for HTK and LIBSVM, MATLAB scripts will be used to
provide visualization of the results. Provide support for One-Class
SVM modeling. A technique that allows a recognition model to be
built on only INV data scores. This SVM type will be applied to WUW
and the benefits and disadvantages will be explored.
Slide 18
SeqRec System Overview In order to further explore and refine
the unique speech recognition elements of the WUW system, the
Sequence Recognizer (SeqRec) Toolkit was developed.
Slide 19
Speech Recognition Goals Speech recognition systems often
assume speech is a realization of some message encoded as a
sequence of one or more discrete symbols. Speech is normally
converted into a sequence of equally spaced discrete parameter
vectors. (typically every 10ms). Makes the assumption that a speech
waveform can be regarded as a stationary process over the sampling
time
Slide 20
Speech Recognition Goals, contd.Speech Recognition Goals,
contd. The speech recognizers job is to create a mapping between
the sequences of speech frames and the underlying speech symbols
that constitute the utterance.
Slide 21
Probability Theory of ASR What is the most likely discrete
symbol sequence out of all valid sequences in the language L, given
some acoustic input O? Acoustic Input is set of discrete
observations: Symbol sequence is defined as: Fundamental ASR System
Goal:
Slide 22
Probability Theory of ASR, contd.Probability Theory of ASR,
contd. Applying Bayes Theorem: New quantities are easier to compute
than P(W |O). P(W) is defined as the prior probability for the
sequence itself. This is calculated by using the prior knowledge of
occurrences of the sequence W. P(O) is the prior probability of the
acoustic input occurrence.
Slide 23
Probability Theory of ASR, contd.Probability Theory of ASR,
contd. P(O) is not needed, because the argmax expression implies we
will be calculating over all possible sequences. The probability
P(O|W), which is the likelihood of the acoustic input O, given the
sequence W, is defined as the observation likelihood. (often
referred to as the acoustic score) This quantity can be determined
using the Hidden Markov Model.
Slide 24
Elements of HMMs The set of states constituting the model.
Although the states themselves are hidden from the perspective of
state assignment of each observation vector, the exact number of
states often carries a physical significance
Slide 25
Elements of HMMs, contd. The transition probability matrix.
Each element of this matrix represents the probability of
transitioning from state i to state j. Each row of this matrix must
sum to 1 to be valid.
Slide 26
Elements of HMMs, contd. The emission probabilities. Each of
these expresses the probability of an observation being generated
during state i. Note that the beginning and end states of an HMM do
not have an associated emission probability.
Slide 27
Elements of HMMs, contd. The probability distribution of
starting in each state.
Slide 28
Elements of HMMs, contd. The following equation is used to
express all the parameters of an HMM in a compact form:
Slide 29
ASR HMMs An ASR HMM is normally used to model a phoneme.
Smallest distinguishable sound unit in a language. Generally have
three emitting states in order to model the transition-in, steady
state, and transition-out regions of the phoneme. Whole word HMM is
created by simply concatenating the phonemes used to spell the word
in question.
Slide 30
Acoustic Scores Using HMMs So how do we use HMMs to calculate
the probability of an observation sequence, given a specific model?
Restated: Score how well a given model matches an input observation
sequence. For HMMs, each hidden state produces only a single
observation. Length(sequence of traversed states) ==
Length(sequence of observation)
Slide 31
Acoustic Scores Using HMMs, contd. The actual state sequence
that observation sequence will take is hidden. Assuming
independence, have to calculate joint probability of a particular
observation sequence and a particular state sequence: This
probability must be calculated across all valid state sequences in
the model:
Slide 32
Acoustic Scores Using HMMs, contd. While this solution is
valid, it presents a calculation that requires O(N^T) computations.
For speech processing applications of HMM, these parameters can
become quite large. In order to reduce the amount of calculations
needed, the forward algorithm is used.
Slide 33
Forward AlgorithmForward Algorithm The forward algorithm is a
dynamic programming technique that uses a table to store
intermediate values as it builds the final probability of the
observation sequence. Each cell is calculated by summing over the
extensions in all paths that lead to the current cell.
Slide 34
Forward Algorithm, contd.Forward Algorithm, contd. The forward
algorithm is a three step process: 1.Initialization: 2.Induction:
3.Termination:
HMM Paramter Re-estimation HMM parameter re-estimation is how
we should adjust the model parameters in order to maximize the
acoustic score. This problem is addressed by using the Baum-Welch
algorithm.
Slide 37
HMM Paramter Re-estimation, contd. Goal for Re-estimating the
transition probability matrix A: Goal for Re-estimating the
emission probability distributions:
Slide 38
HMM Paramter Re-estimation, contd. These calculations lead to
the following equations. (See Rabiner for details and
derivations.)
Slide 39
HMM Paramter Re-estimation, contd. If a current model is
re-estimated using the EM algorithm to create a new, refined model,
then either: 1.The initial model defines a critical point of the
likelihood function, in which case (no HMM parameter updates were
made). 2.A new model has been discovered that describes an HMM in
which an observation sequence O is more likely to have been
produced. The final model produced by EM is called the maximum
likelihood HMM.
Slide 40
Speech-Specific HMM Recognition The previous section presented
the fundamentals associated with using HMMs to perform general
sequence recognition. There are some additional concepts associated
specifically with the speech recognition task domain: Feature
Representation of Speech Gaussian Mixture Model Distributions
Slide 41
Feature Representation of Acoustic Speech Signals The input to
an ASR system is normally a continuous speech waveform. This input
must be transformed into a sequence of acoustic feature vectors,
each of which captures a small amount of information within the
original waveform.
Slide 42
Feature Representation of Acoustic Speech Signals, contd.
Pre-emphasis This stage is used to amplify energy in the
high-frequencies of the input speech signal. This allows
information in these regions to be more recognizable during HMM
model training and recognition.
Slide 43
Feature Representation of Acoustic Speech Signals, contd.
Windowing This stage slices the input signal into discrete time
segments. A Hamming window is commonly used to prevent edge effects
associated with the sharp changes in a Rectangular window.
Slide 44
Feature Representation of Acoustic Speech Signals, contd.
Discrete Fourier Transform DFT is applied to the windowed speech
signal, resulting in the magnitude and phase representation of the
signal.
Slide 45
Feature Representation of Acoustic Speech Signals, contd. Mel
Filter Bank - Human hearing is less sensitive at frequencies above
1000 Hz. so the spectrum is warped using a logarithmic Mel scale. A
bank of filters is constructed with filters distributed equally
below 1000 Hz and spaced logarithmically above 1000 Hz
Slide 46
Feature Representation of Acoustic Speech Signals, contd.
Inverse DFT The IDFT of the Mel spectrum is computed, resulting in
the cepstrum. This representation is valuable because it separates
characteristics of the source and filter of the speech waveform.
The first 12 values of the resulting cepstrum are recorded. Delta
MFCC Features In order to capture the changes in speech from
frame-to-frame, the first and second derivative of the MFCC
coefficients are also calculated and included.
Slide 47
Feature Representation of Acoustic Speech Signals, contd.
Energy Feature This step is performed in parallel with the MFCC
feature extraction and involves calculating the total energy of the
input frame.
Slide 48
Feature Representation of Acoustic Speech Signals, contd.
Results in a 39- element Observation Vector for each Frame of
Speech Feature TypeCount Cepstral Coefficients12 Delta Cepstral
Coefficients12 Double Delta Cepstral Coefficients12 Energy
Coefficient1 Delta Energy Coefficient1 Double Delta Energy
Coefficient1 Total39
Slide 49
Gaussian Mixture Models Until now, the emission probability
associated with an HMM state was left as a general probability
distribution. In most ASR systems, these output probabilities are
continuous-density multivariate output distributions. The most
common form of this distribution used in speech recognition is the
Gaussian Mixture Model (GMM).
Slide 50
Gaussian Mixture Models, contd. A simple Gaussian distribution
describing a one- dimensional random variable X is described by the
mean and variance
Slide 51
Gaussian Mixture Models, contd. Assume a simple (though
impractical) ASR system exists where a single-variable Gaussian is
used. Each HMM state would have an emission probability that
assumes the values of each observation vector are normally
distributed
Slide 52
Gaussian Mixture Models, contd. Recall that each observation is
actually a D element vector. (where we found D = 39 for common MFCC
representations) Extend the distribution to multivariate Gaussian
distributions. In this case, the mean is a vector of length D and
the covariance is a matrix of size D x D
Slide 53
Gaussian Mixture Models, contd. What if some of the features do
not follow a strict, normal distribution. This is actually quite
common. In order to account for complex, non-normal distributions,
the Gaussian Mixture Model is used Result of combining M Gaussian
mixtures, the contribution of each is given by a scaler
weight.
Slide 54
GMM Example Example of a non-normal, one-dimensional
probability distribution that is more effectively modeled using a
GMM with 3 mixtures
Slide 55
The Hidden Markov Modeling Toolkit (HTK) HTK is a
well-established framework, primarily designed to build HMM-based
systems used for speech processing and speech recognition
tools.
Slide 56
HTK Data Preparation Tools Provide mechanisms to prepare
arbitrarily formatted speech sound files and textual transcriptions
into a uniform format suitable for HMM model training. The raw
waveform audio must also be converted to MFCCs. Support data such
as the phonetic dictionary must be properly formatted to ensure all
pronunciations are available prior to training.
Slide 57
HTK Training Tools Uses the HTK-formatted data from the
previous stage to define, initialize, and re-estimate the set of
HMM models.
Slide 58
HTK Testing Tools Tools for generating text hypothesis given a
set of unknown speed data. HTK provides features for full speech
recognition SeqRec only needs tools that will generate the acoustic
scores.
Slide 59
TIMIT Corpus Experiments will use the Texas Instruments and
Massachusetts Institute of Technology (TIMIT) corpus. Contains
recordings of 630 speakers in 8 dialects of U.S. English. Each
speaker is assigned 10 sentences to read that are carefully
designed to contain a wide range of phonetic variability. Each
utterance is recorded as a 16-bit waveform file sampled at 16 KHz.
Two Partitions of TIMIT: TRAIN Used to generate HMM Models. TEST -
Unseen by the SeqRec system until the final evaluation
Slide 60
TIMIT Experiment Data Set The 24 words with the highest count
of occurrences in the database. Varying length from ~7 frames for a
to ~39 frames for greasy. Highlighted words are another subset that
will be used to show detailed experiment results. WordTRAIN
CountTEST Count Phonemic Pronunciation Phonemic Length Frame Length
the1603599 da ah 28.26 to1018352 t uw 210.69 in947313 ih n 213.14
a867301 ah 16.69 all545223 ao l 220.67 that612215 dh ae t 331.02
she572208 sh iy 219.86 an571207 ae n 29.83 your565202 y ao r 312.31
me517193 m iy 212.11 of455185 ah v 211.99 had526183 hh ae d 324.54
like518179 l ay k 323.65 year473177 y ih r 330.75 and492175 ah n d
314.13 dark473171 d aa r k 433.2 water479170 w ao t er 428.35
ask464169 ae s k 328.12 carry463169 k ae r iy 436.51 suit462168 s
uw t 334.99 greasy462168 g r iy s iy 539.05 wash469168 w aa sh
335.07 oily470168 oy l iy 333.38 rag470168 r ae g 334.23
Slide 61
The HTK Recipe The versatility of HTK Toolkit presents a steep
learning curve. The HTK Recipe is used by SeqRec to provide a
known- good starting point to create a well-trained set of
monophone HMM models based on TIMIT.
Slide 62
Isolated Word Recognition Result Format Red - normalized
acoustic scores for the INV evaluated against INV HMM Blue -
normalized acoustic scores for the OOV evaluated against the INV
HMM
Slide 63
Isolated Word Recognition Result Format, contd. CDFs plotted
for each score distribution, OOV reversed. Point where two CDFs
intersect is the operating threshold.
Slide 64
Isolated Word Recognition Result Format, contd. FA Rate False
acceptances of OOV words as INV. FR Rate False rejections of INV
words as OOV. Total Error Rate = FA Rate + FR Rate
Slide 65
HMM Biasing Prior to scoring, the monophone HMM models
constituting the INV word are re- estimated against only the INV
Training Data. This allows SeqRec to simulate the performance
improvement of context- dependant models. Experimentally found that
performing two re-estimations yielded the optimum increase in
performance.
Slide 66
HMM Biasing and Increased Recognizer Performance
Slide 67
Baseline Monophone HMM Results The TIMIT single-word recognizer
performance baseline was established using monophone HMMs with 1,
8, and 16 Gaussian Mixture components
Slide 68
Validation of Results Same TIMIT data set was evaluated against
third-party WSJ models from the author of the HTK Recipe procedure.
Average Total Error Rate was compared to the SeqRec models.
Slide 69
Baseline Results Observations In general, a higher number of
mixture components in the GMM models yield lower error rates.
Expected, due to the complex distributions of many of the MFCC
features used to represent the speech data. HMM models generated by
SeqRec perform slightly better than the WSJ models WSJ models are
re-estimated many times using data from a much broader test set
than just TIMIT. Overall, the baseline monophone models have shown
that 16-mixture TIMIT monophone HMMs yield the lowest average Total
Error Rate of 20.07%
Slide 70
Incorporating Additional Scores Key feature of the existing WUW
system is the application of an additional scoring method. Score 2
can be computed using the same HTK tools that were used to
determine the acoustic score.
Slide 71
Distribution of Multiple Scores When combined, score 1 and
score 2 each contribute unique information to the recognition task.
Score 2 shifts the INV score result distribution below the OOV
results
Slide 72
Introduction to SVM Cannot use the simple, one-dimensional
binary classifier with two scores. Support Vector Machines (SVMs)
are a set of learning methods that can be used to build a complex
classification model for data with multiple features.
Slide 73
Fundamentals of SVM Classifiers Consider a task requiring the
binary classification of m data points, each having classification
labels +/-1. Each data point is represented by a d- dimensional
collection of attributes (also known as a feature vector).
Slide 74
Discriminant Plane The vector w describes the orientation and b
describes the offset of a discriminant plane that can be used to
classify the data. There are an infinite amount of planes that can
be applied to a set of points.
Slide 75
Maximal Margin The plane given by the solid line provides the
best solution because it would be more robust to additional data
that exhibit perturbations from the training set. This plane is
said to provide the maximal margin between the two classes of data
points.
Slide 76
Maximal Margin, contd. For linearly seperable data, a method
for determining the maximum margin between the two classes is to
maximize the margin between two parallel supporting planes. The
distance between these planes is maximized to determine the optimal
plane for classification
Slide 77
Maximal Margin, contd. Maximizing the margin is equivalent to
maximizing the distance between the two supporting planes. Solved
using the following Quadratic Programming problem:
Slide 78
Linearly Inseperable Data For this type of data, have to
introduce a slack variable to each constraint and then add as a
weighted penalty term. Practically, the C parameter represents a
trade-off between classification error and maximal margin
Slide 79
Alternate Form of the QP Writing the classification rule in its
dual form reveals that the maximum margin hyperplane is only a
function of the support vectors - the training data that lie on the
margin Orange data points in previous slides
Slide 80
Non-Linear Classification For many data distributions, a simple
linear plane cannot be effectively applied to classify points. This
data distribution would be best classified using an elliptical
classification surface.
Slide 81
Non-Linear Classification, contd. Consider 2-dimensional
training data with attributes [r,s]. To construct a quadratic
discriminant function, the 2- dimensional input can be mapped into
a 5-dimensional data set described by [r, s, rs, r 2, s 2 ]. A
linear discriminant can then be computed in this new feature space.
This can be substituted into the original linear discriminant
function, taking into account the mapping function into
feature-space.
Slide 82
Non-Linear Classification, contd.Non-Linear Classification,
contd. Existing Quadratic Programming problem from can be modified
to use the mapping function: For practical usage of SVM, it is not
feasible to calculate the mapping function. SVMs work around this
issue by using kernel functions. Allows us to evaluate the inner
product without having to explicitly know the mapping
function.
Slide 83
Non-Linear Classification, contd.Non-Linear Classification,
contd. Final form of the Quadratic Programming problem: Following
table outlines the popular Kernel functions used in SVMs:
Slide 84
Summary of SVM Procedure 1. Select the C parameter (recall this
is the trade-off between classification error and margin
maximization). 2. Select the kernel function and any
kernel-specific parameter values. 3. Solve the Quadratic Problem to
determine the set of support vectors and multipliers. 4. Recover
the threshold variable b using the set of support vectors. 5. Apply
the SVM to classify a new data point x using the final
classification function.
Slide 85
Example of SVM Polynomial Kernel
Slide 86
Applying SVM to SeqRec LIBSVM is a software library that
provides tools to allow users to easily and quickly implement
SVM-based classifiers. svm-scale This tool is used to scale the
features of input data. svm-train This tool will train an SVM model
using a set of labeled training data. It supports the popular
kernel functions and specification of the C parameter to use.
svm-predict This tool takes un-labeled data and a previously
generated SVM model and outputs the classification label hypothesis
determined by applying the decision function.
Slide 87
SVM Parameter Search The RBF kernel will be used for the
experiments with TIMIT. The two parameters that must be selected
when applying the RBF kernel to SVM are the C and parameters. A
common method to perform parameter searching is known as
cross-validation. LIBSVM provides an implementation known as v-fold
cross- validation. Training data set is first sub-divided into v
subsets. Each subset is then sequentially tested using a classifier
trained using the other v-1 subsets. Repeat for each other subset,
allowing each instance of the whole training set to be predicted
once. The cross-validation accuracy is the percentage of data
correctly classified using the procedure.
Slide 88
SVM Parameter Search, contd. Cross-validation has the property
of avoiding the problem of overtraining. If parameters were chosen
that yielded the best classification accuracy for the entire
training data set, the SVM may be too specific and would falsely
reject unseen data. SVM may have worse accuracy during the model
building stage but in general will perform better against unseen
data. Cross-validation accuracy is computed across the following
parameter ranges:
Slide 89
Applying SVM to Multiple Scores TIMIT greasy TRAIN score data
yields = 0.03125 and C = 8 as the best parameters. To evaluate the
model on un-seen data, the SVM model is now applied to the TEST
scores 1 & 2 for TIMIT greasy. LIBSVM is able to output
decision values in addition to the binary class labels. Greater
magnitude of a decision value means greater confidence that the
value is a part of the chosen class. These values can then be
treated as a single-dimensional input to the original binary
classifier
Slide 90
Two-Class SVM TIMIT greasy Total Error Rate is reduced from
2.45% to 0.97% Recognition rate is 2.55 times better!
Slide 91
Incorporating the Word Duration Feature As opposed to the TIMIT
greasy example considered so far, the distributions for some words
are highly correlated and do not exhibit good performance using
just Score 1 & 2. One possible cause for this is that the
shorter the time duration of the word, the more apparent any errors
in the hand-labeled durations are.
Slide 92
Incorporating the Word Duration Feature, contd. SVMs are
capable of handling data with many features. Makes sense to think
of the length of the scored word as a feature itself. If two
phonetically similar words such as a and and produce very similar
acoustic scores, duration could intuitively be used to increase the
reliability of the decision.
Slide 93
SVM with Duration - TIMIT and Able to lower the original
monophone classifier error rate from 61.83% to 32.95% Relative
improvement of 88% or 1.88 times. Notice that SVM applied without
the duration feature is basically useless for this particular
word.
Slide 94
One-Class SVM SVMs that have been considered thus far have
operated by classifying data vectors into one of two different
classes. This requires a database of acoustic scores for both the
INV word, as well as scores of all other words. One-Class SVM is a
class of SVM models that only depend on having a single class of
data available for classification.
Slide 95
One-Class SVM, contd. Problem Statement - Suppose that some
data set has a probability distribution P in feature space. Find a
simple subset S of the feature space such that the probability that
a test point from P lies outside of S is bounded by some a priori
value.
Slide 96
One-Class SVM, contd. The strategy is to map the data into
kernel feature space (same as regular SVM), and then separate the
data from the origin with maximum margin. The origin in feature
space is the only original member of the negative class. Results in
a modification to the Two-Class SVM Quadratic Programming problem:
The classification function is the same as Two-Class SVM:
Slide 97
One-Class SVM - v ParameterOne-Class SVM - v Parameter The
modified QP introduces the v parameter. As v approaches 0, the
upper boundary on the second inequality becomes very large and has
decreasing impact on the expression. Leads to a hard margin problem
because the penalty for errors becomes infinite. As v is increased,
the mis- classification penalty is relaxed and errors are allowed.
Notice the effect of v on outliers when the penalty of errors is
low.
Slide 98
One-Class SVM Parameter Search The cross-validation grid-search
strategy will be applied for One-Class SVM parameter optimization:
1.The v parameter searched for instead of the cost parameter C.
2.The input SVM training data now only includes INV TRAIN data.
3.The One-Class SVM model is evaluated against INV and OOV TRAIN
data. This accuracy is recorded in order to evaluate the effect of
v on overall error rate. 4.Select the parameters that yield the
highest accuracy from (3).
Slide 99
One-Class SVM TIMIT greasy Results in a Total Error Rate of
1.10%, as compared to the Two-Class SVM classifier that is able to
achieve 0.88%
Slide 100
One-Class SVM Observations The number of Support Vectors
required for a competitive One-Class SVM model is much lower than
the number required for Two-Class SVM (54 versus 19 for TIMIT
greasy). The processing time to train the One-Class SVM model is
much lower because only the INV data has to be considered in the
Quadratic Programming optimization problem to determine the maximum
margin classifier (2.330s versus 0.001s for the TIMIT greasy). The
overall performance is generally lower for One-Class SVM models. Of
course, the absence of negative information entails a price, and
one should not expect as good results as when this information is
available
Slide 101
Final SeqRec Experiment Configurations The following techniques
will be evaluated against the 25-word TIMIT test subset: 1.Score 1
Classification (Code: Score 1) 2.(Score 1 + Score 2) Classification
With Two-Class SVM (Code: CSVMND) 3.(Score 1 + Score 2 + Duration)
Classification With Two-Class SVM (Code: CSVM) 4.(Score 1 + Score 2
+ Duration) Classification With One-Class SVM (Code: OSVM) All
acoustic scores will be generated using 16-Mixture Monophone HMMs
generated using SeqRec. The TIMIT test set is divided into two
groups to increase graph readability.
Slide 102
Evaluation Metrics The Total Error Rate metric will be the
primary criterion of performance for each method. The Relative
Error Rate Reduction (RERR) and Error Rate Reduction (ERRR) will be
calculated and used to compare performance between two methods: B
Baseline Total Error Rate N New Total Error Rate
Slide 103
Manual Parameter Selections Experimentation has revealed that
the grid search method does not always yield the most appropriate
parameters for Two-Class SVM. The following words were found to
perform considerably better for the Two-Class SVM models when using
the parameters listed in the right columns, as opposed to the
parameters in the left columns that the grid search discovered.
Using these determined values for the problem words in the TIMIT
test data set demonstrates the actual capabilities of the SeqRec
Classifier. WordC - Grid - Grid C Select - Select ask 2048888 all
81922328 water 81920.5322 year 2048281922 in 81928 0.5 that
512881920.5
Slide 104
SeqRec Results TIMIT Test Set 1
Slide 105
Word RERR (%) CSVMND ERRR CSVMND RERR (%) CSVM ERRR CSVM RERR
(%) OSVM ERRR OSVM Suit 3914.917238.23901.90 greasy
1462.461722.721172.17 year 231.231262.2681.08 rag
5186.181472.47191.19 wash 143215.32430344.03224923.49 carry
4075.07115212.52115212.52 water 1562.561802.801972.97 she
1242.243444.442453.45 all 221.22821.82591.59 dark
5906.907378.374655.65 had 2053.052533.535756.75 ask
1582.583244.241692.69 oily 891.892283.281112.11 me
1752.753684.682113.11 like 3894.895386.385226.22
Slide 106
SeqRec Results TIMIT Test Set 2
Slide 107
Word RERR (%) CSVMND ERRR CSVMND RERR (%) CSVM ERRR CSVM RERR
(%) OSVM ERRR OSVM your 2513.513604.601372.37 that
1062.061332.33101.10 an 641.643014.01791.79 in -150.851162.16531.53
of 651.651982.98551.55 to 2613.617258.252783.78 and
-360.64881.88-80.92 the 2753.756737.732583.58 Average 245 3.45 532
6.32 301 4.00
Slide 108
Concluding Remarks The SeqRec system was able to successfully
integrate off-the-shelf speech recognition and SVM frameworks to
create a working single-word classification system that shows
remarkable error rate improvements against the well-known TIMIT
data set. Average RERR of 532% using Two-Class SVM Scoring with the
Duration feature. This leads to a single word recognition system
capable of performing with an overall average Total Error Rate of
5.4% as compared to the baseline of 20.6%. The highest gain
wasTIMIT word wash: the baseline Total Error Rate was 2.51% and the
Two-Class SVM with Duration Total Error Rate was 0.06% RERR of
4303% One-Class SVM is indeed a viable method for significantly
reducing recognizer error, with an average RERR of 301% Outperforms
Two-Class SVM without the Duration feature.
Slide 109
Acknowledgements A very special thank you to Dr. Kepuska for
his dedication to the field of Speech Recognition and allowing me
to participate in a very exciting part of it! Thanks to FITs ECE
Department for the support provided to this field of study.
Slide 110
Wrap-up (Time Permitting) Show Individual TIMIT Word Results in
MATLAB. Future Work Topics. Questions from the audience.