Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
KEYWORD SPOTTING USING HIDDEN MARKOV MODELS
by
Şevket Duran
B.S. in E.E., Boğaziçi University, 1997
Submitted to the Institute for Graduate Studies in
Science and Engineering in partial fulfillment of
the requirements for the degree of
Master of Science
in
Electrical Engineering
Boğaziçi University
2001
i
ACKNOWLEDGEMENTS
To Dr. Levent M. Arslan:
Thank you for the sacrifices of your personal time that you have made unselfishly
to help me prepare this thesis.
Thank you for your encouraging me to study in the area of speech processing.
It is a privilege for me to be your student.
Şevket Duran
ii
ABSTRACT
KEYWORD SPOTTING USING HIDDEN MARKOV MODELS
The aim of keyword spotting system is to detect a small set of keywords from a continuous speech. It is important to obtain the highest possible keyword detection rate without increasing the number of false insertions in this system. Modeling only keywords is not enough. To seperate keywords from non-keywords, models for out-of-vocabulary words are needed, too. Since the structure and type of garbage model has great effect on the entire system performance, out-of-vocabulary modeling is done by the use of garbage models.
The subject of this MS thesis is to examine context independent phonemes as garbage models and evaluate the performance of different criteria as confidence measures for out-of-vocabulary word rejection. Two different databases are collected for keyword spotting and isolated word recognition experiments over telephone lines.
For keyword spotting use of monophone models together with one-state general
garbage model gives the best performance. Using average phoneme likelihoods with phoneme durations gives the best performance for confidence measures.
iii
ÖZET
SAKLI MARKOV MODELLERİ KULLANILARAK ANAHTAR
KELİME YAKALAMA
Anahtar kelime yakalama sisteminin amacı sürekli bir sesin içinde barınan küçük bir anahtar kelimeler gurubu ortaya çıkarmaktır. Bu sistemde önemli olan, kelime olmadığı halde hata verme oranını artırmaksızın olası en yüksek anahtar kelime bulma oranını elde etmektir. Bunun için sadece anahtar kelimeleri modelleme yapmak yeterli değildir. Anahtar kelimeleri, olmayanlardan ayırmak için, sözlük dışı kelimelerin modellemesi de gerekmektedir. Bu modelleme, yapısı ve türü itibarıyla tüm sistem performansı üzerinde büyük etkisi bulunan garbage modellemesi ile yapılmaktadır. Bu tezin konusu garbage modelleri olarak bağımsız içerikli sesbirimleri (monophone) incelemek ve sözlük dışı kelime dışlamaları için güvenilirlik oranları bazında değişik kriterlerin performansını değerlendirmektir. Anahtar kelime yakalama ve telefon üzerinden yalıtılmış ses tanıma denemeleri için iki veritabanı oluşturuldu. Anahtar kelime bulma için en iyi performansı tek fazlı genel garbage modelleme ile birlikte tek-sesbirimsel modellerin kullanılması verdi. Güvenilirlik oranları içinse süreleri ile ortalama sesbirim benzeşmelerinin birlikte kullanımı en iyi performansı gösterdi.
iv
TABLE OF CONTENTS
ACKNOWLEDGEMENTS………………………………………………………….…….iii ABSTRACT…………………………………………………………………………….….iv ÖZET……………………………………………………………………………………..…v LIST OF FIGURES……………………………………..………………………………..viii LIST OF TABLES….…...........................…........………………………………….............x 1. INTRODUCTION……………………………………………………………………..…1 2. BACKGROUND…………………………………………………………………………3 2.1. Speech Recognition Problem………………..………………………………………3 2.2. Speech Recognition Process……….…………………………………….…………..5 2.2.1. Gathering Digital Speech Input…….…………………………………………5 2.3. Feature Extraction……………………………….…………………………………..6 2.4. Hidden Markov Model…………………………………….……………………….10 2.4.1. Assumption in the Theory of HMMs……………………….…….…………13 2.4.1.1. The Markov Assumption………….….….………………………….13 2.4.1.2. The Stationarity Assumption….…………………………………….13 2.4.1.3. The Output Independence Assumption……………………………..14 2.4.2. Three Basic Problem of HMMs………….………………………………….14 2.4.2.1. The Evaluation Problem……………………….……………………14 2.4.2.2. The Decoding Problem….………………………….……………….14 2.4.2.3. The Learning Problem……….……………………….…….……….14 2.4.3. The Evaluation Problem and the Forward Algorithm……………………….15 2.4.4. The Decoding Problem and the Viterbi Algorithm…………………….……17 2.4.5. The Learning Problem……………………………………………………….18 2.4.5.1. Maximum Likelihood (ML) Criterion………….…….…….……….18 2.4.5.2. Baum-Welch Algorithm………………….…………………………19 2.4.6. Types of Hidden Markov Models……….……………….………………….21 2.5. Use of HMMs in Speech Recognition……………………………….….….………21 2.5.1. Subword Unit Selection…………………………….….…………………….22 2.5.2. Word Networks………………………………………………………….24 2.5.3. Training of HMMs….…………………………………….….….….……….26 2.5.4. Recognition…….…………….….….……………………………………….27 2.5.4.1. Viterbi Based Recognition…….….….……………………………..27 2.5.4.2. N-Best Search………………………….……………………………28 2.6. Keyword Spotting Problem……………………………….….….…………………28 3. PROPOSED KEYWORD SPOTTING ALGORITHM…….….….……………………29 3.1. Introduction………………………………….….………………….………………29 3.2. Experiment Data…………….….….……………………………………………….30 3.3. Performance of a System…………….….………………………………………….30 3.4. System Structure…………………………….….…………………………………..31 3.5. Performance of Monophone Models for Isolated Word Recognition….….….……34 4. CONFIDENCE MEASURES FOR ISOLATED WORD RECOGNITION.….……….37 4.1. Introduction………………………….….….………………………………………37 4.2. Experiment Data……………………………….….….….….….…………………..38 4.3. Minimum Edit Distance…………………………………….….….……………….38
v
4.4. Phoneme Durations….……………………………………………….…………….39 4.5. Garbage Model Using Same 512 Mixtures…………………….……….…….……43 4.6. Comparison of Confidence Measures…………………..………….………………46 5. CONCLUSION……………………………………………………………………...….48 APPENDIX A: SENTENCES USED FOR KEYWORD SPOTTING……………………49 APPENDIX B: MINIMUM EDIT DISTANCE ALGORITHM.........................................51 REFERENCES…………………………………………………………………………….52
vi
LIST OF FIGURES
Figure 2.1. The waveform and spectrogram of “ev” and “ben eve”......................................3
Figure 2.2. The waveform and spectrogram of “okul” and “holding”...................................4
Figure 2.3. Components of a typical recognition system.......................................................5
Figure 2.4. The spectrogram of /S/ sound and /e/ sound in word Sevket................................6
Figure 2.5. Flowchart of deriving Mel Frequency Cepstrum Coefficients............................8
Figure 2.6. A simple isolated speech unit recognizer that uses null-grammar.....................24
Figure 2.7. The expanded network using the best match triphones.....................................25
Figure 2.8. The null-grammar network showing the underlying states................................25
Figure 3.1. General structure of the proposed keyword spotter...........................................31
Figure 3.2. ROC points for different alternatives for garbage model..................................32
Figure 3.3. ROC points for different number of keywords for keyword spotting...............33
Figure 3.4. Network structure for the keyword spotter used as a post-processor for
isolated word recognizer....................................................................................34
Figure 3.5. ROC curves for monophone and garbage model based out-of-vocabulary
word rejection.....................................................................................................36
Figure 4.1. ROC curves before/after applying Minimum Edit Distance Revision..............39
Figure 4.2. Forced alignment of the waveform for keyword “iSZbankasIZkurucu”...........41
Figure 4.3. Forced alignment of the waveform for keyword “milpa”..................................42
vii
Figure 4.4. ROC curves for phoneme duration based confidence measure.........................42
Figure 4.5. Likelihood profiles for “ceylangiyim” and the base garbage model proposed..43
Figure 4.6. ROC curves for different emphasis values for power value 1...........................45
Figure 4.7. ROC curves for different power values with emphasis set to 1.........................45
Figure 4.8. ROC curve for phoneme duration based confidence measure and confidence
measure with likelihood ratio scoring included.................................................46
viii
LIST OF TABLES
Table 3.1. Database used for keyword spotting...................................................................30
Table 3.2. Number of occurrences of the keywords used for keyword spotting tests..........30
Table 3.3. Results for monophone model based out-of-vocabulary word rejection for
isolated word recognition...................................................................................35
Table 3.4. Results for general garbage model based out-of-vocabulary word rejection for
isolated word recognition...................................................................................35
Table 4.1. Average phoneme durations in Turkish..............................................................40
Table 4.2. Computation time required with/without phoneme duration evaluation............47
1
1. INTRODUCTION
Communication between people and computers using more natural interfaces is an
important issue in order to use computers in our daily lives. To interact with computers you
always have to use your hands. The device may be a keyboard or a mouse or the dialing
pad on your phone if you want to access information on a computer over a telephone line.
A more natural interface for input is speech.
Human-computer interaction via speech involves speech recognition [1, 2, 3] and
speech synthesis [4]. Speech recognition is the conversion of speech signal into text and
synthesis is the opposite. Speech recognition may range from understanding simple
commands to getting all information in speech signal such as all words, the meaning and
the emotional state of the speaker.
After many years’ work speech recognition is at a level mature enough to be used
in practical applications. This is due to the availability of algorithms developed and the
increase in the computational power.
Speech recognition may be speaker dependent or speaker independent. If the
application is for home use, and the same person will use the same microphone at the same
place, then the problem is simple and you don’t need a robust algorithm. But if it is an
application that will recognize speech over a public telephone network where speaker
variability and the environment that speech passes through are different among different
calls, you need a robust algorithm.
If recognition of isolated words or phrases is the problem, then you will have less
of a problem as far as the speakers only give the required input. If the speakers also use
other words in addition to the keywords you require, you need to perform “keyword
spotting” which means recognizing the keywords among other non-keyword filler words.
If we go further, recognition from a large vocabulary where you have to recognize all of
the words, it is called dictation, which is a harder task. We will be dealing with the
keyword-spotting problem in this thesis.
2
For speech recognition, the digitized speech signal that is in the time domain must
be transformed into another domain. Generally some part of the speech is taken and a
feature vector is derived to represent that part. Next these feature vectors are used to guess
the sequence of words that generated this speech signal. We need algorithms that account
for the variability in the speech signal. The most common technique for acoustic modeling
is called hidden Markov modeling (HMM). We have used this model in this thesis.
In order to have an operating system independent notation we preferred not to use
non-ANSI characters in Turkish character set. We have used lower case letters for
characters that are in the ANSI character set, and upper case letters for Turkish characters.
We have used /S/ instead of /ş/, /U/ instead of /ü/, and so on. We use /Z/ for interword
silence. So the word “savaş alanı” is represented as “savaSZalanI”.
In this thesis we investigate some models for garbage models for keyword spotting
and try to find some confidence measure for detection of out-of-vocabulary words in an
isolated word recognizer.
In chapter 2, we give the theory of each step in speech recognition process and give
details of the techniques we have used. In chapter 3, we study the keyword-spotting
algorithm we have proposed and conclude that using the monophone models of the words
and a one-state 16-mixture general garbage models with different bonus values give the
best performance. We evaluate the performance of monophone models as garbage model
for isolated word recognition. In chapter 4, we evaluate some measures to obtain a good
confidence measure and decide both likelihood and phoneme duration are important for
obtaining a good confidence measure. Finally, in chapter 5 we give our conclusions from
these experiments and suggest some directions for future study.
3
2. BACKGROUND
2.1. Speech Recognition Problem
The speech signal is different if input is given with isolated words or the speech is
continuous. If the speaker knows that a computer will try to recognize the speech, then
he/she may pause between words. However in continuous speech some sounds will
disappear and sometimes there will be no silence between words. It may be hard to say a
word in a different context. An exaggerated example may be a tongue twister like
“SemsiZpaSaZpasajIndaZsesiZbUzUSesiceler”. Even in normal cases there is great
difference in the characteristics of the speech signal. Figure 2.1 shows the same /e/ sound
in “ev” and “ben eve”. The waveforms are shown at the top of the figure. The
spectrograms at the bottom show the energy at different frequencies versus time. The
darkness shows the amplitude. The effect of the context can be seen on the characteristics
of the /e/ sound. The effect of the context leads us to model each phoneme according to the
neighboring phonemes.
Figure 2.1. The waveform and spectrogram of “ev” (on the left) and “ben eve” (on
the right).
4
Spontaneous speech may contain other fillers that are not words like “ee” or
“hImm”. It is another difficulty in continuous speech recognition.
The task should be known while designing the algorithm. If we are to use a
recognizer in continuous speech recognition, the training data should consist of continuous
speech as well.
The main difficulty of the speech recognition problem comes from the variability of
the source of the signal. First, the characteristics of phonemes, the smallest sound units, are
dependent on the context in which they appear. An example to phonetic variability is the
acoustic differences of the phoneme /o/ in, okul and holding in Turkish. See Figure 2.2.
The marked region corresponds to the /o/ sound.
Figure 2.2. The waveform and spectrogram of “okul” (on the left) and “holding” (on the
right)
The environment also causes variability. The same speaker will say a word
differently according to his physical and emotional state, speaking rate, or voice quality.
The difference in the vocal tract size and shape of different people also causes variability.
The problem is to find the meaningful information in the speech signal. The meaningful
information is different for speech recognition and speaker recognition. The same
information in the speech signal may be necessary for some application and redundant for
5
some other application. For speaker independent speech recognition we have to get rid of
as much speaker related features as possible.
2.2. Speech Recognition Process
Figure 2.3 shows the main components of a typical speech recognition system. The
digitized speech signal is first transformed into a set of useful measurements or features at
a fixed rate, typically once every 10 milliseconds. These measurements are then used to
search for the most likely word candidate, making use of constraints imposed by the
acoustic, lexical, and language models. Throughout this process, training data are used to
determine the values of the model parameters.
Figure 2.3. Components of a typical speech recognition system
2.2.1. Gathering Digital Speech Input
Speech recognition is the process of converting digital speech signal into words.
To capture speech signal we need a device that converts physical speech wave into digital
signal. This may be a microphone that converts the speech into analog signal and a sound
card that is an A/D converter that converts the analog signal to the digital signal. Another
way of obtaining digital speech input is to use a telephone card that converts the analog
6
signal that comes from the telephone line into digital signal. There are also devices that can
take the digital signal coming from E-1 or T-1 lines directly. Dialogic has JCT LS240 and
JCT LS300 for T-1 lines and E-1 lines respectively. We have been using a JCT LS120
card, which is a speech-processing card for 12 analog lines. We have used 8 KHz sampling
rate which is the sampling rate for telephone lines and converted the µ -law encoded
signal into 16-bit linear encoded signal before processing.
2.3. Feature Extraction
To get rid of redundancies in the speech signal mentioned earlier we have to represent the
signal by only taking the perceptually most important speaker-independent features [5].
Passing the excitation signal generated by the larynx through the vocal tract produces the
speech signal. We are interested in the properties of the speech generated by the overall
shape of the vocal tract. To distinguish phonemes better (the voiced /unvoiced distinction),
we examine if the vocal folds are vibrating but ignore the variations in the frequency of
vibration. The spectrum of voiced sounds has several sharp peaks, which are called
formant frequencies. The spectrum of unvoiced sounds looks like white noise spectrum.
Figure 2.4 shows the spectrum of the unvoiced sound /S/ and the voiced sound /e/.
Figure 2.4. The spectrum (found using 256 point FFT) of /S/ sound (on the left) and /e/
sound (on the right) in word Sevket
Since our ears are insensitive to phase effects we use the power spectrum as a basis
for speech recognition front-end. The power spectrum is represented on a log scale. When
7
the overall gain of the signal varies, the shape of the log power spectrum is the same but
shifted up or down. The convolutional effects of the telephone lines are multiplied with the
signal on the linear power spectrum. In log power spectrum the effect is additive. Since a
voiced speech waveform corresponds to convolution of a quasi-periodic excitation signal
and a time-varying filter (shape of the vocal tract), we can separate them in the log power
spectrum. Assigning a lower limit to the log function solves the problem of low energy
levels at some part of the spectrum.
Before computing short-term power spectra, the waveform is processed by a
simple pre-emphasis filter to give a 6 dB/octave increase in gain. This makes the average
speech spectrum roughly flat.
We have to extract the effects caused by the shape of the vocal tract. One method is
to predict the coefficients of the filter that corresponds to the shape of the vocal tract. The
vocal tract is assumed to be a combination of lossless tubes with different radius. The
number of parameters derived corresponds to the number of tubes assumed. The filter is
assumed to be an all-pole linear filter. The parameters are called Linear Predictive Coding
(LPC) parameters and the procedure is known as LPC analysis. There are different
methods to calculate these coefficients [6].
To calculate the short-term spectra we take overlapping portions of the waveform.
We take a frame of 25 milliseconds and multiply it with a window function to avoid
artificial high frequencies. We use a Hamming window. Then we apply Fourier transform.
We have to get rid of harmonic structure at the multiples of fundamental frequency, 0f ,
because it is the effect of the excitation signal. The smoothed spectrum without the effect
of the excitation signal corresponds to the Fourier Transform of the LPC parameters.
We use a different method and group components of the power spectrum and form
frequency bands. Grouping is not linear; the human ear sensitivity is taken into account.
The bands are linear up to 1 kHz and logarithmic at higher frequencies. The frequency
bands are broader at higher frequencies. The positions of the bands are set according to the
mel frequency scale [7].
8
The relation between mel frequency scale and linear frequency scale is as follows:
)
7001(log2595)( 10
ffMel +=
)1.2(
Figure 2.5. Flowchart for deriving Mel Frequency Cepstrum Coefficients
To calculate the filterbank coefficients, the magnitude coefficients of the spectrum
are accumulated after windowing with these triangular windows. Triangular filters are
9
spread over the whole frequency range from zero upto the Nyquist frequency. We have
chosen 16 filter banks.
Since the shape of the spectrum imposed by the vocal tract is smooth, energy levels
in adjacent bands are correlated. We have to remove correlation since in further statistical
analysis we assume that feature vector elements are uncorrelated and use a diagonal
variance vector. Removing the correlation helps the number of parameters to be reduced
without loss of useful information. The discrete cosine transform (a version of the Fourier
transform using only cosine basis functions) converts the set of log energies to a set of
cepstral coefficients, which are largely uncorrelated. The formula for Discrete Cosine
Transform is:
∑
−=
=
N
jji j
Nim
Nc
1)5.0(cos2 π , Pi ,...,1= )2.2(
where { jm } are log filter bank amplitudes.
and N is the number of filterbank channels which we set to 16. The required number of
cepstral coefficients is P and we set it to 12. Figure 2.5 shows the steps in obtaining Mel
Frequency Cepstrum Coefficients (MFCCs).
Many systems use the rate of change of the short-term power spectrum as
additional information. The simplest way to obtain this dynamic information is to take the
difference between consecutive frames. But this is too sensitive to random interframe
variations. So, linear trends are estimated over sequences of typically five or seven frames
[8]. We use five frames; there will be a delay of 2 times step size in real-time operation.
)*2*2(* 2112 −−++ −−+= ttttt ccccGd )3.2(
where td is the difference evaluated at time t , and ,2+tc ,1+tc ,1−tc 2−tc are the
coefficients at time t+2, t+1, t-1 and t-2, respectively. G is a gain factor selected as 0.375.
Some systems use the acceleration features as well as linear rates of change. These
second-order dynamic features need longer sequences of frames for reliable estimation [9].
10
Since cepstral coefficients are largely uncorrelated, probability estimates are easier
in further analysis. We can simply calculate Euclidean distances from reference model
vectors. Statistically based methods weigh coefficients by the inverse of their standard
deviations computed around their overall means.
Current representations concentrate on the spectrum envelope and ignore
fundamental frequency; but we know that even in isolated-word recognition fundamental
frequency contours carry important information.
At the acoustic phonetic level, speaker variability is typically modeled using
statistical techniques applied to large amounts of training data. Effects of context at the
acoustic phonetic level are handled by training separate models for phonemes in different
contexts; this is called context dependent acoustic modeling.
Word level variability can be handled by allowing alternate pronunciations of
words in representations known as pronunciation networks. Another technique is to add
different pronunciations to the network because after pruning common nodes at the
network, it corresponds to different pronunciations of the same word.
2.4. Hidden Markov Model
The most widely used recognition algorithm in the past fifteen years is Hidden
Markov Models (HMM) [10, 11, 12]. Although there had been some attempts at using
Neural Networks, those have not been very successful.
The Hidden Markov Model is a finite set of states, each of which is associated with
a (generally multidimensional) probability distribution. Transition probabilities are
assigned to the transitions among the states. In a particular state an outcome or observaiton
can be generated, according to the associated probability distribution. The external
observer can only see the the outcome, not the states. Therefore states are hidden to the
outside.
The following part is the teory of the HMMs taken from the tutorial [3]. The
11
advanced reader can skip this part.
In order to define an HMM completely, following elements are needed:
• The number of states of the model, N.
• The number of observation symbols in the alphabet, M. If the observations are
continuous then M is infinite.
• A set of state transition probabilities, }{ ijaA =
}{ 1 iqjqpa ttij === + , Nji ≤≤ ,1 )4.2(
where tq denotes the state index at time t and ija corresponds to the transition
probability from state i to state j . Transition probabilities should satisfy the
normal stochastic constraints,
0≥ija , Nji ≤≤ ,1 )5.2(
and
∑ ==
N
jija
11 , Ni ≤≤1 )6.2(
• A probability distribution in each of the states, )}({ kbB j= .
}{)( jqopkb tktj === ν , Ni ≤≤1 , Mk ≤≤1 )7.2(
where kν denotes the kth observation symbol in the alphabet, and to the current
observation vector.
Following stochastic constraints must be satisfied.
0)( ≥kb j , Nj ≤≤1 , Mk ≤≤1 )8.2(
and
12
∑ ==
M
kj kb
11)( , Nj ≤≤1 )9.2(
If the observations are continuous then we will have to use a continuous probability
density function, instead of a set of discrete probabilities. In this case we specify
the parameters of the probability density function. Usually the probability density is
approximated by a weighted sum of M Gaussian distributions,
∑ ΣΝ==
M
mtjmjmjmtj ocob
1),,()( µ
)10.2(
where,
jmc = mixture weights for thj state’s thm mixture
jmµ = mean vectors
jmΣ = covariance matrices
jmc should satisfy the stochastic constrains,
0≥jmc , Nj ≤≤1 , Mm ≤≤1 )11.2(
and
∑ ==
M
mjmc
11 , Nj ≤≤1 )12.2(
• The initial state distribution, }{ iππ = .
where,
}{ 1 iqpi ==π , Ni ≤≤1 )13.2(
Therefore we can use the compact notation
13
),,( πλ BA= )14.2(
can be used to denote an HMM with discrete probability distributions, while
),,,,( πµλ jmjmjmcA Σ=
)15.2(
to denote one with continuous densities.
2.4.1. Assumptions in the Theory of HMMs
For the sake of mathematical and computational tractability, following assumptions
are made in the theory of HMMs.
2.4.1.1. The Markov Assumption. As given in the definition of HMMs, transition
probabilities are defined as,
}{ 1 iqjqpa ttij === + , Nji ≤≤ ,1 )16.2(
In other words it is assumed that the next state is dependent only upon the current
state. This is called the Markov assumption and the resulting model becomes actually a
first order HMM. However the next state may depend on past k states and it is possible to
obtain such a model, called a kth order HMM. But a higher order HMM will have a higher
complexity.
2.4.1.2. The Stationarity Assumption. Here it is assumed that state transition probabilities
are independent of the actual time at which the transitions takes place. Mathematically,
}{}{ 212111 iqjqpiqjqp tttt ===== ++
)17.2(
for any 1t and 2t .
2.4.1.3. The Output Independence Assumption. This is the assumption that current output
(observation) is statistically independent of the previous outputs(observations). We can
formulate this assumption mathematically, by considering a sequence of observations,
14
ToooO ,...,, 21= )18.2(
Then according to the assumption for an HMM λ ,
∏==
T
tttT qopqqqOp
111 ),(},,...,,/{ λλ
)19.2(
However unlike the other two, this assumption has a very limited validity. In some
cases this assumption may not be fair enough and therefore becomes a severe weakness of
the HMMs.
2.4.2. Three Basic Problems of HMMs
Once we have an HMM, there are three problems of interest.
2.4.2.1. The Evaluation Problem. Given an HMM λ and a sequence of observations
ToooO ,...,, 21= , what is the probability that the observations are generated by the model,
}{ λOp ?
2.4.2.2. The Decoding Problem. Given a model λ and a sequence of observations
ToooO ,...,, 21= , what is the most likely state sequence in the model that produced the
observations?
2.4.2.3. The Learning Problem. Given a model λ and a sequence of observations
ToooO ,...,, 21= , how should we adjust the model parameters ),,( πBA in order to
maximize }{ λOp ?
Evaluation problem can be used for isolated (word) recognition. Decoding problem
is related to the continuous recognition as well as to the segmentation. Learning problem
must be solved, if we want to train an HMM for the subsequent use of recognition tasks.
2.4.3. The Evaluation Problem and the Forward Algorithm
15
We have a model ),,( πλ BA= and a sequence of observations ToooO ,...,, 21= ,
and }{ λOp must be found. If we can calculate this quantity using simple probabilistic
arguments the number of operations are on the order of TN . This is very large even if the
length of the sequence, T is small. The idea of keeping the multiplications that are common
led to the idea of using an auxiliary variable, which is called the forward variable and
denoted as )(itα .
The forward variable is defined as the probability of the partial observation
sequence toooO ,...,, 21= , when it terminates at the state i. Mathematically,
},,...,,{)( 21 λα iqooopi ttt == )20.2(
Then it is easy to see that following recursive relationship holds.
∑==
++N
iijttjt aiobj
111 )()()( αα , Nj ≤≤1 , 11 −≤≤ Tt )21.2(
where,
)()( 11 obj jjπα = , Nj ≤≤1 )22.2(
Using this recursion we can calculate )(iTα , Nj ≤≤1
and then the required probability is given by,
∑==
N
iT iOp
1)(}{ αλ
)23.2(
The complexity of this method, known as the forward algorithm is proportional to
TN 2 , which is linear with respect to T whereas the direct calculation had an exponential
complexity.
In a similar way the backword variable )(itβ is defined as the probability of the
partial observation sequence Ttt ooo ,...,, 21 ++ , given that the current state is i.
16
Mathematically ,
},,...,,{)( 21 λβ iqooopi tTttt == ++ )24.2(
As in the case of )(itα there is a recursive relationship which can be used to
calculate )(itβ efficiently.
∑==
++N
jtjijtt obaji
111 )()()( ββ , Ni ≤≤1 , 11 −≤≤ Tt )25.2(
where,
1)( =iTβ , Ni ≤≤1 )26.2(
Further we can see that,
},{)()( λβα iqOpii ttt == , Ni ≤≤1 , Tt ≤≤1 )27.2(
Therefore this gives another way to calculate }{ λOp , by using both forward and
backward variables :
∑ ∑==== =
N
i
N
ittt iiiqOpOp
1 1)()(},{}{ βαλλ
)28.2(
The last equation is very useful, especially in deriving the formulas required for
gradient based training.
17
2.4.4. The Decoding Problem and the Viterbi Algorithm
In this case we want to find the most likely state sequence for a given sequence of
observations, toooO ,...,, 21= and a model, ),,( πλ BA= .
The solution to this problem depends upon the way “most likely state sequence” is
defined. One approach is to find the most likely state tq at t=t and to concatenate all such
' tq 's. But some times this method does not give a physically meaningful state sequence.
Therefore we would need another method which has no such problems. In this method,
commonly known as Viterbi algorithm [13], the whole state sequence with the maximum
likelihood is found. In order to facilitate the computation we define an auxiliary variable,
},...,,,,...,,{max)( 121121
1...21λδ −−
−== ttt
tqqqt oooiqqqqpi
)29.2(
which gives the highest probability that partial observation sequence and state sequence up
to t=t can have, when the current state is i. It is easy to observe that the following recursive
relationship holds.
=
≤≤++ ijtNitjt aiobj )(max)()(111 δδ , Ni ≤≤1 , 11 −≤≤ Tt )30.2(
where,
)()( 11 obj jjπδ = , Ni ≤≤1 )31.2(
So the procedure to find the most likely state sequence starts from calculation of
)( jTδ , Ni ≤≤1 using recursion in Eqn (2.30) while always keeping a pointer to the
“winning state” in the maximum finding operation. Finally the state *j , is found where
)(maxarg
1
* jj TNjδ
≤≤=
)32.2(
and starting from this state, the sequence of states is back-tracked as the pointer in each
18
state indicates. This gives the required set of states. This whole algorithm can be
interpreted as a search in a graph whose nodes are formed by the states of the HMM in
each of the time instant t , Tt ≤≤1 .
2.4.5. The Learning Problem
Generally, the learning problem is how to adjust the HMM parameters, so that the
given set of observations (called the training set) is represented by the model in the best
way for the intended application. Thus it would be clear that the quantity we wish to
optimize during the learning process can be different from application to application. In
other words there may be several optimization criteria for learning. The criteria we will be
using is the Maximum Likelihood (ML) criteria.
2.4.5.1. Maximum Likelihood (ML) Criterion. In ML we try to maximize the probability
of a given sequence of observations WO , belonging to a given class w, given the HMM
wλ of the class w, with respect to the parameters of the model wλ . This probability is the
total likelihood of the observations and can be expressed mathematically as
}{ wW
tot OpL λ= )33.2(
However since we consider only one class w at a time we can drop the subscript
and superscript 'w's. Then the ML criterion can be given as,
}{ λOpLtot = )34.2(
However there is no known way to analytically solve for the model ),,( πλ BA= ,
which maximizes the quantity totL . But we can choose model parameters such that it is
locally maximized, using an iterative procedure, like the Baum-Welch method [12].
2.4.5.2. Baum-Welch Algorithm. This method can be derived using simple “occurrence
counting” arguments or using calculus to maximize the auxiliary quantity
19
[ ]∑=
qqOpOqpQ },,{log},{),( λλλλ
)35.2(
over λ [3]. A special feature of the algorithm is the guaranteed convergence .
To describe the Baum-Welch algorithm, ( also known as Forward-Backward algorithm),
we need to define two more auxiliary variables, in addition to the forward and backward
variables defined in a previous section. These variables can however be expressed in terms
of the forward and backward variables.
First one of those variables is defined as the probability of being in state i at t=t and
in state j at t=t+1. Formally,
},,{),( 1 λξ Ojqiqpji ttt === + )36.2(
This is the same as,
{ }{ }λ
λξOp
Ojqiqpji ttt
,,),( 1 === + )37.2(
Using forward and backward variables this can be expressed as,
∑ ∑
=
= =++
++N
i
N
jtjtijt
tjtijtt
objai
objaiji
1 111
11
)()()(
)()()(),(
βα
βαξ
)38.2(
The second variable is the a posteriori probability,
},{)( λγ Oiqpi tt == )39.2(
that is the probability of being in state i at t=t, given the observation sequence and the
model.
In forward and backward variables this can be expressed by,
20
∑=
=
N
itt
ttt
ii
iii
1)()(
)()()(
βα
βαγ
)40.2(
It can be seen that
the relationship between )(itγ and ),( jitξ is given by,
∑==
N
jtt jii
1),()( ξγ , Ni ≤≤1 , Mt ≤≤1 )41.2(
Now it is possible to describe the Baum-Welch learning process, where parameters of the
HMM is updated in such a way to maximize the quantity, }{ λOp . Assuming a starting
model ),,( πλ BA= , we calculate the 'α 's and ' β 's using the recursions (2.21) and (2.25),
and then 'ξ 's and 'γ 's using (2.38) and (2.41). Next step is to update the HMM parameters
according to eqns (2.42) to (2.43), known as re-estimation formulas.
)(1 ii γπ = , Ni ≤≤1 )42.2(
∑
∑= −
=
=1
1
1
)(
),(
T
tt
T
tt
iji
jia
γ
ξ , Ni ≤≤1 , Nj ≤≤1 )43.2(
∑
∑=
=
=T
tt
T
tt
jj
jkb
1
1
)(
)()(
γ
γ , Nj ≤≤1 , Mk ≤≤1 )44.2(
21
2.4.6. Types of Hidden Markov Models
HMMs can be classified according to the nature of the elements of the B matrix,
which are distribution functions.
In discrete HMMs, distributions are defined on finite spaces. Observations are
vectors of symbols in a finite alphabet of N different elements. For each one of the Q
vector components, a discrete density },...1)({ Nkkw = is defined, and the distribution is
obtained by multiplying the probabilities of each component.
Another possibility is to define distributions as probability densities on continuous
observation spaces. In this case, functional form of the distributions has to have certain
characteristics, in order to have a manageable number of statistical parameters to estimate.
The density functions are usually Gaussian or Laplacian. The statistics can be
characterized by the mean vector and the covariance matrix. HMMs with these kinds of
distributions are usually referred to as continuous HMMs. A large number of base densities
have to be used in every mixture. Since most of the time the training data is not enough,
different models share the same distributions. Different models are expressed in terms of
base distribution functions using different weights. This type of HMMs is called semi
continuous HMMs [11]. Base densities are assumed to be statistically independent; so the
distributions associated with model transitions are products of the component density
functions. Parameters of statistical models are estimated using iterative learning algorithms
[14]. The likelihood of a set of training data increases at each step.
2.5. Use of HMMs in Speech Recognition
In a statistical framework, a set of elementary probabilistic models of basic
linguistic units (e.g., phonemes) is used to build word representations. A sequence of
acoustic parameters, extracted from a speech signal, is seen as an output of a HMM which
is formed by concatenating elementary processes. The underlying state sequence
corresponds to the meaningful combinations of the phonemes. The transitions between the
states of the phoneme correspond to the variability in duration. The stochastic observable
22
outputs correspond to the spectral variability.
It is not practical training word networks for each word. Words are usually
represented as networks of phonemes. Each path in a word network represents a
pronunciation of the word.
2.5.1. Subword Unit Selection
The same phoneme can have different acoustic distributions of observations if
pronounced in different contexts. Allophone models of a phoneme are models of that
phoneme in different contexts. The decision as to how many allophones should be
considered for a given phoneme may depend on many factors, e.g., the availability of
enough training data to determine the model parameters.
A conceptually interesting approach is the use of polyphones [15]. In principle, an
allophone should be considered for every different word in which a phoneme appears. If
the vocabulary is large, it is unlikely that there are enough data to train all these allophone
models, so models for allophones of phonemes are considered at a different level of detail
(word, syllable, triphone, diphone, context independent phoneme). We have been using
790 triphones during the tests. These are selected to cover most of the triphones in Turkish
[16]. We have been using capital letters for Turkish characters that are not in the ANSI
character set and /Z/ for silence. For example we have used /C/ instead of /ç/.
Another approach consists of choosing allophones by clustering possible contexts.
This choice can be made automatically with Classification and Regression Trees (CART).
A CART is a binary tree having a phoneme at the root and, associated with each node in , a
question iQ about the context. Questions iQ are of the type, “Is the previous phoneme a
nasal consonant?” For each possible answer (YES or NO) there is a link to another node
with which other questions are associated. There are algorithms for growing and pruning
CARTs based on automatically assigning questions to a node from a manually determined
pool of questions. The leaves of the tree may be simply labeled by an allophone symbol
[17, 18].
23
We use a score to find the best match for a triphone that is not in the HMM list.
According to the spectral similarity a score is assigned to each couple of phonemes. The
similarity score of /m/ and /n/ is 0.6768. The score is calculated automatically using the
spectral distance so it is easy to find for a new language. Total similarity score is calculated
by summing the weighted scores. The center phoneme has a weight of 1 and the weight
decreases exponentially.
∑=−=
ctx
ctxi
iii WhxsHXS ),(),( )45.2(
where ),( HXS is the similarity score between unseen triphone X and triphone H in the list
),( ii hxs is the similarity score between phonemes at position I
ctx is the context level
W is a weighting factor, which we choose 0.1
So if a triphone is in the HMM list, there will be exact mach with a score of 1.200
We have expanded some words in terms of the best match triphones below:
penguen Z-p+e t-e+n e-n+i y-g+u g-u+l l-e+n e-n+Z
milpa Z-m+e m-i+l i-l+g Z-p+a s-a+Z
bossa Z-b+o b-o+r k-s+t k-s+a s-a+Z
Each allophone model is an HMM made of states, transitions and probability
distributions. In order to improve the estimation of the statistical parameters of these
models, some distributions can be the same or tied. For example, the distributions for the
central portion of the allophones of a given phoneme can be tied reflecting the fact that
they represent the stable (context-independent) physical realization of the central part of
the phoneme, uttered with a stationary configuration of the vocal tract.
24
Another approach consists of having clusters of distributions characterized by the
same set of Gaussian probability density functions. Allophone distributions are built by
considering mixtures with the same components but with different weights [14].
2.5.2. Word Networks
Isolated recognition in general means recognition of speech based on any kind of
isolated speech unit, which can be a word or a sub word or even a concatenation of words.
However only isolated word recognition has direct practical applications.
In a simple isolated speech unit recognition task, where the vocabulary contains N
speech units, we can use the system depicted in Figure 2.6.
Figure 2.6. A simple isolated speech unit recognizer that uses null-grammar
We have a finite network, since only one of the words can be spoken. The speech
contains an initial silence and a final silence. This simple network used for isolated word
recognition should be expanded because we do not have models for each different word.
We should generate networks using the HMMs for the best match triphones. We have two
silence models; Z-Z+Z for the interword silence and X-X+X for the silence at the
beginning and the end of the word. The difference of the X-X+X model is the self
transition probaility. Since the self transition probability is high we can compansate for
loose end-pointing. The expanded network will be as in Figure 2.7.
25
Figure 2.7. The expanded network using the best match triphones
Figure 2.8. The null-grammar network showing the underlying states
For the silence models we have a one-state HMM, and for triphones we have three-
state HMMs. The reason for choosing three-state models is that we can model the
transition between the adjacent phonemes with the first and the third states. The second
state stands for the steady state of the phoneme. Actually we have two more states at the
beginning and the end of the models that produce no observation. They are not shown in
the figure for the sake of simplicity. Figure 2.8 shows the actual HMM network that
contains 3-state models for triphones and 1-state model for the silence.
26
2.5.3. Training of the HMMs
We will have semicontinuous HMMs which means every output probability
distribution of each state is a linear combination of Gaussian density functions. 512
mixtures have been proposed for optimal performance [16]. Using tied-mixtures makes
recognition easier because only the probabilities of 512 Gaussian density functions are
calculated for each feature vector. The probability of each triphone is a combination of
these pdf’s.
In training we need to estimate the means and variances of Gaussian densities,
mixture coefficients and state transition matrix for each triphone.
The core process in training a set of subword models (phonemes) involves the
embedded training. We use a program “train” for this purpose. Continuously spoken
utterances is parametrized as the training data. In embedded training re-estimates of the
complete set of subword HMMs are done simultaneously. For each input utterance, we
need a transcription of the speech but labeling is not required. Labeling means specifying
the boundaries of phonemes in that uttarance. A single composite HMM is built for each
input uttarance using the transcription of the input speech and the initial HMMs. This
composite HMM collects statistics for the re-estimation of the parameters. When all of the
training utterances have been processed, the total set of accumulated statistics are used to
re-estimate the parameters of all of the phone HMMs.
To find the initial parameters for the HMMs before embedded re-estimation we
have two choices. One method is to assign a global mean and variance to all Gaussian
distributions in all HMMs.
Another method is to begin with a small set of hand-labelled training data to
intialize the mean and variance of each Gaussian density. Then Baum-Welch method is
used to update mean and/or variance of each Gaussian density.
27
2.5.4. Recognition
To find the most probable sequence we need to have a search space and that search
space is represented with a word network. In the case of isolated word recognition it is a
simple network of N words between start and end nodes. In the case of keyword spotting
we need different kind of networks which we will discuss in the next chapter. After
building the network using the nearest triphones we have to find the optimal path for a
given observation, speech unit sequence.
Then it is possible to trace for the corresponding speech unit sequence, via the state
sequence. In order to calculate the optimal state sequence ( *q ) we can use the Viterbi
Algorithm directly, level building method which is a variant of the Viterbi Algorithm.
Since the Viterbi based recognition is suboptimal, unless each speech unit correspons to a
HMM state, some attempts have been made to develop efficient methods for calculating
the sentence likelihoods. The N-best algorithm is one of these.
2.5.4.1. Viterbi Based Recognition. The Viterbi score, )(itγ can be computed for all the
states in the language model Λ at t=t and then can advance to the time instant t=t+1, in an
inductive manner, as formulated in [13]. This procedure is known as time synchronous
Viterbi search because it completely processes at time t before going into the time t+1.
Finally a backtracking pass gives the required state sequences.
Viterbi search can be very expensive if the number of states is large. When the
number of states is large, at every time instant, a large portion of states have an
accumulated likelihood which is much less than the highest one, so it is expected that a
path passing through one of these states would not become the best path at the end of the
utterance. This consideration leads to a complexity reduction technique called beam search
[19]. Beam search neglects states whose accumulated score is lower than the best one
minus a given threshold. Pruning less likely paths avoids extra computation. However if
the pruning threshold is chosen poorly, the best path can be lost. In practice, good tuning of
the beam threshold results in a gain in speed by an order of magnitude, while introducing a
negligible amount of search errors.
28
2.5.4.2. N-best Search. N-best search algorithm is very similar to the time synchronous
Viterbi search. Since the purpose of the N-best method is to find the optimum speech unit
sequence instead of the optimum state sequence, a summing operation should be done
instead of the maximum finding operation. However if we completely drop the maximum
finding operation it will become the forward algorithm, and we go back again to the start.
Therefore a pruning is performed at every state, (in addition to the pruning of the beam)
keeping only the first N paths with the highest scores. Therefore even this algorithm does
not give the theoretically optimum sentence. At the end, the algorithm gives N most likely
sentences, and for a simple task without post processors, N=1 is enough. We used token
passing paradigm [20].
2.6. Keyword Spotting Problem
In the keyword spotting problem a continuously spoken utterance is tested for the
existance of a keyword. The speech signal will contain any combination of silences,
keywords and non-keywords. The words that are not keywords are called garbage or out-
of-vocabulary words. We have discussed how to model a word HMMs for phonemes. We
need to model the out-of-vocabulary words in some form.
Since we use tied-mixtures for modeling the output produced at each state of the
network, we will be using Gaussian mixtures for garbage models. We may choose to use
less mixtures to model garbage words. These Gaussian mixtures will have greater
variances which means they are more general models.
29
3. PROPOSED KEYWORD SPOTTING ALGORITHM
3.1. Introduction
The best we can do in detecting keywords in speech signal is to recognize all of the
words using Large Vocabulary Continuous Speech Recognizer (LVCSR). However the
cost is very much and when context of the keywords is unknown it is impossible to use
LVCSR system. Instead it is common to model all of the out-of-vocabulary words as
garbage models. The garbage model we have used is a 16-mixture model. The reason for
using little number of mixtures is to model the general properties of the speech signal.
Little number of mixtures means greater variance of the Gaussian distribution functions.
We have used the notation J-J+J for garbage model.
Another idea we have tried can be described as follows: If we create a network with
triphone models and monophone models of the same word, the triphone model of the word
should get the best score if the keyword exists. If the keyword does not exist, the
monophone or the garbage model will get the best score. The reason for this hypothesis is
that the monophone models represent the context independent phonemes and the triphone
models represent the context dependent phonemes. Monophone models have greater
variances to model all phonemes more generally. We have used 32-mixture monophone
models.
In order to favor some of the models we have added a bonus value to all transition
probabilities of the HMMs on that path. Although the probability of an event cannot be
greater than zero and sum of probability of exclusive events should be equal to 1, adding
some value to the transition probabilities works fine; because it will increase the
probability of passing through some path. This corresponds to one-pass implementation of
Likelihood Ratio Scoring.
30
3.2. Experiment Data
The details of the database that we have used for keyword spotting is described in
Table 3.1 and Table 3.2. We asked 12 speakers to read 20 different sentences from a sheet
over the telephone. Since some speakers made mistakes during recording, 44 of the
sentences were removed from the database. Therefore total number of sentences that we
used in our simulations was 196. The sentences are given in Appendix A. In the
simulations we tested 1, 3, 5, 10 keyword sets. The keywords in each set and their number
of occurances are listed in Table 3.2.
Table 3.1.Database used for keyword spotting
Number of sentences 196
Number of words 3,391
Number of speakers 12
Sound Quality 8KHz µ -law encoded telephony signal
Total record time 15 minutes
Table 3.2. Number of occurrences of the keywords used for keyword spotting tests
Medya Holding 12
Ankara + Yargıtay + Fenerbahçe 36
Pejo + Sabah + Ankara + Medya Holding + Yargıtay 86
Türkiye + Pejo + Pirelli + Devalüasyon + Fenerbahçe + rüşvet +
Sabah + Yargıtay + Medya Holding + Rahmi Koç
167
3.3. Performance of a System
For a keyword spotting system there are two kinds of errors, false alarm and miss.
If there is no keyword and the keyword spotter detects a keyword then it is called a false
alarm. If there is a keyword present but the keyword spotter cannot detect it then it is called
a miss. The same recognizer cannot reduce false alarms without increasing miss rate.
31
Receiver Operating Characteristic (ROC) curves show the probability of detection
versus probability of false alarm [20]. Probability of detection is given by the dividing
number of detection to number of keywords. Probability of false alarms is given by
dividing total number of false alarms to total number of words that are not keywords.
However since we do not know much about the out-of-vocabulary words, false alarms per
hour is much more reasonable than probability of false alarms.
3.4. System Structure
Figure 3.1 shows the general network structure that can be used to detect the
keyword “sabah”. It is an infinite-state network. As far as the input data is available, the
same path may be used multiple times. Since the silence model Z-Z+Z is always used, this
grammar attempts to model each word in the speech signal either the keyword or garbage.
Figure 3.1. General structure of the proposed keyword spotter
We have tested the performance of the keyword spotter using
a. Triphone model of the word and one garbage model
b. Triphone model and monophone model of the word
c. Triphone model and monophone model of the word and garbage model
In order to obtain a Receiver Operating Characteristic (ROC) we tested the
probability of detection and number of false alarms per hour. Each time the test is
32
performed we have used different combinations of bonus values for monophone model of
the word and garbage model. We have a program that counts the false alarm/miss rate
automatically. We performed the test using only the monophone models of the keywords
as the garbage model or 1-state 16-mixture general garbage model.
Figure 3.2 shows the ROC points for test we have performed. The operating point
of the ROC is determined by bonus values of monophone models and garbage model.
However these bonus values are related. As a result, the points obtained using different
combinations do not form a curve. We preferred to call these figures as ROC points.
Figure 3.2. ROC points for different alternatives for garbage model
As it can be seen using the monophone models of the keywords as garbage model
gives approximately the same performance as using 1-state general garbage model.
However, when we use combination of these two alternatives with different bonus values,
we get a better performance. Actually monophones are good candidates for garbage
modeling. The recognizer cannot model very short utterances using the monophone model
of the word. So a better alternative may be using an ergodic model for the monophone
33
model of the keyword. In an ergodic model, transition from each phoneme to another is
possible, unlike the left-to-right models we have been using. The cost of using an ergodic
model is too much. So monophone model of the word coupled with a general 1-state
garbage model is a better alternative.
We have tested the system for different number of keywords. As the number of the
keyword increases, the performance of the system has decreased as expected. Figure 3.3
shows the ROC points for 1, 3, 5 and 10 keywords. All of the operating points are shown
together at this figure regardless of the structure used. Of course, using monophone models
of the keywords as the garbage model has given better results for multiple keywords, too.
Figure 3.3. ROC points for different number of keywords for keyword spotting simulations
Monophone models
34
3.5. Performance of Monophone Models for Isolated Word Recognition
Monophone models did not result in an increase at the performance when used in
keyword-spotting task. The reason may be the infinite-state network used for keyword
spotting. We have used the same idea in isolated word recognition.
We have decided to use our keyword spotter as a post-processor for isolated word
recognizer. The output of the isolated word recognition is used as the keyword for the
keyword spotter. The keyword spotter creates a temporary network using the triphone
model of the word and the monophone model of the word.
Figure 3.4. Network structure for the keyword spotter used as a post-processor for
isolated word recognizer
The network is a finite-state network as shown in Figure 3.4. If triphone model of the
output gets more likelihood from the post-processor that the word is from the vocabulary,
35
otherwise we assume that it is garbage. We have performed the experiment using different
bonus values. Table 3.3 and Table 3.4 show the results of the experiments. Figure 3.5
shows the ROC curve for monophone models and the garbage model.
Table 3.3. Results for monophone model based out-of-vocabulary word rejection for
isolated word recognition
Bonus Value
Probability Of
Detection Probability Of False Alarm
0.70 0.993 0.634 0.80 0.984 0.581 0.90 0.976 0.528 1.00 0.954 0.477 1.10 0.931 0.429 1.20 0.910 0.376 1.25 0.896 0.343 1.30 0.883 0.323 1.40 0.855 0.278 1.50 0.814 0.222 1.75 0.720 0.142 2.00 0.578 0.075 2.50 0.296 0.031 3.00 0.101 0.006
Table 3.4. Results for general garbage model based out-of-vocabulary word rejection for
isolated word recognition
Bonus Value
Probability Of
Detection Probability Of False Alarm
0.90 0.993 0.912 1.00 0.988 0.883 1.10 0.986 0.871 1.20 0.981 0.839 1.30 0.981 0.810 1.40 0.981 0.778 1.50 0.981 0.733 1.60 0.979 0.705 1.80 0.954 0.625 2.00 0.931 0.541 2.25 0.888 0.438 2.50 0.831 0.333 3.00 0.709 0.173 3.25 0.604 0.123 3.50 0.523 0.089 4.00 0.299 0.037
36
Figure 3.5. ROC curves for monophone and garbage model based out-of-vocabulary word
rejection
37
4. CONFIDENCE MEASURES FOR ISOLATED WORD
RECOGNITION
4.1. Introduction
Out-of-vocabulary word rejection is an important issue in isolated word recognition
as well as in keyword spotting. If somebody dials an automated information retrieval
system and asks about a subject that is irrelevant to that service, the system will interpret
the request as one of the known inputs. Of course the user will not notice that he uttered an
out-of-vocabulary word and will complain about the service. If you tell about the weather
to a person who wants information about a hospital, it will be nonsense. The user may
accidentally give an input especially to a bargein-enabled system. A bargein-enabled
system stops playing the prompt as soon as it detects voice activity of sufficient duration.
Out-of-vocabulary word rejection may be as important as recognizing the true item from
the vocabulary for a pleasing service.
A recognition system may produce a likelihood score instead of rejecting or
accepting a hypothesis. Confidence value is a score that shows how much confidant the
recognizer is about the recognition result. However confidence value is not suitable for an
average system designer. It should be converted into a more meaningful value like percent
confidence. The percent confidence is a number between zero and 100 and reflects how
likely the result is. If you group the results that have percent confidence of 75, the
recognition rate will be 75%. A recognizer cannot assign 100% confidence to a result.
The confidence value is useful especially when you have a chance to ask the speaker
to tell the word again. If the confidence value is too low then the system may tell the user
that it could not understand. If the confidence value is high then the system may directly
advance to the next step in the scenario. If the confidence value is medium, the system may
need to confirm the recognition result. A higher percent confidence may be required for
critical applications. Using different threshold values for confidence values correspond to
changing the operating point on a ROC curve.
38
4.2. Experiment Data
We have collected 1,176 records of isolated words/phrases from 4 speakers. Each
speaker spoke the names of the stocks exchanged at İMKB as isolated phrases once. Some
of the records are very bad, some have end-pointing errors but we have kept them for an
objective evaluation. We set aside half of the words and used the remaining ones in the
recognition system.
4.3. Minimum Edit Distance
The robustness of the recognizer requires that it operate at different situations
without great degradation in the performance. Due to the variability of speech there is no
strict decision about speech units. As a result it is natural that the recognizer will match a
very close word if the spoken word is not in the vocabulary. For example our HMM based
recognizer can make the distinction between words “kardemirZbe” and “kardemirZde” if
both of them are in the vocabulary. If one of them is removed from the vocabulary and the
incoming speech signal corresponds to the removed word, the recognizer will match the
remaining word. This feature can be useful if a speaker says “sabancolding” instead of
“sabancIZholding”. So we have decided not to penalize the recognizer if the recognized
word and the actual word are very close to each other.
In order to decide the similarity of two words we have used the minimum edit
distance criteria. The minimum edit distance between two strings is defined as the
minimum number of editing operations (insertion, deletion, substitution) needed to
transform one string into another [22]. The minimum edit distance algorithm is an example
of the class of dynamic programming algorithms and is given in Appendix B. The cost of
insertion, deletion and substitution may be assigned according to the specific task. For
speech recognition assigning a cost of 1 for insertion and deletion and 2 for substitution is
reasonable [23].
If the recognizer does not reject an out-of-vocabulary, which is very close to a word
in the vocabulary we counted it as detection and called it Minimum Edit Distance Revision
(MEDR). If the minimum edit distance is less than or equal to 3, we assumed them to be
39
correct. The results got better as expected. In Figure 4.1 we show the results with and
without Minimum Edit Distance Revision. Since the number of detection is increased the
ROC is closer to the upper left corner. The remaining figures are according to the MEDR
applied results unless otherwise stated.
Figure 4.1. ROC curves before/after applying Minimum Edit Distance Revision (MEDR)
4.4. Phoneme Durations
The HMMs we have used assign transition probabilities to each state to model
temporal variability. It requires that each state should be used at least during 2 adjacent
frames in order to be the part of the best path. However it does not impose any maximum
duration constraint on the states or the phonemes.
We have developed a visual tool that displays the likelihood of a given word using
the input speech signal and the state sequence. During the N-best search we may choose to
keep the phoneme information (phoneme index, end time and probability at that instant)
list for each token accumulated. If we keep phoneme information for all paths (some of
40
which will be pruned after a few frames), it will require extra memory and CPU time.
We have preferred finding the phoneme boundaries after the recognition has been
ended and the best path has been found. After finding the best path for the input speech
signal we create a temporary network with only the recognized word and tell the
recognizer to keep the phoneme information. As a result the recognizer does forced
alignment for the specified word. We compare the phoneme durations found with the
average phoneme durations for each phoneme. Average phoneme durations are shown in
Table 4.1. If the difference is not small enough then we assign a penalty score for the
phoneme. The formula is empirical and can be summarized as follows:
difference = | phoneme duration-average duration|
ratio = difference / average duration
ratio > 4.0 ⇒ penalty score = difference * 15000
ratio > 2.5 ∧ ratio < 4.0 ⇒ penalty score = difference * 5000
ratio > 1.5 ∧ ratio < 2.5 ⇒ penalty score = difference * 2000
ratio < 0.2 ⇒ penalty score = difference * 2000
ratio > 0.2 ∧ ratio < 1.5 ⇒ penalty score = 0
Penalty scores for each phoneme are added. These penalty scores give us a
confidence measure.
Table 4.1. Average phoneme durations in Turkish
Phoneme Average Duration(sec) Phoneme Average Duration(sec) /C/ 0.104476 /i/ 0.089455 /G/ 0.051474 /j/ 0.132000 /I/ 0.087539 /k/ 0.106367 /O/ 0.111259 /l/ 0.059551 /S/ 0.136918 /m/ 0.080046 /U/ 0.075290 /n/ 0.075376 /Z/ 0.052331 /o/ 0.114173 /a/ 0.114238 /p/ 0.092847 /b/ 0.056495 /r/ 0.065306 /c/ 0.084662 /s/ 0.130187 /d/ 0.065811 /t/ 0.107484 /e/ 0.100183 /u/ 0.084297 /f/ 0.081055 /v/ 0.053175 /g/ 0.064541 /y/ 0.078936 /h/ 0.065209 /z/ 0.093300
41
Sometimes low penalty scores can be found for meaningless results. This can be
seen when a long word is matched for a shorter utterance. The recognizer completes the
word if it cannot find enough data to proceed till the end of the path. The purpose is to
compensate for the end-point errors. So at this point we have added a simple additional
confidence measure. This measure is total number of forwarded phonemes without input
data, denoted as N . The situation can be seen in Figure 4.2. The recognizer claims that
the speech signal corresponds to the word “iSZbankasIZkurucu”. But it is unlikely to be a
waveform of such a long word. Number of forwarded phonemes is 9 in this case.
Figure 4.2. Forced alignment of the waveform for keyword “iSZbankasIZkurucu”
Figure 4.3 shows a typical case where the recognition result deserves high penalty.
Although the number of forwarded phonemes is 1, the phoneme durations do not seem
reasonable. The recognizer has assigned a penalty score of 1571 for this recognition.
In order to model phoneme durations we should consider the effect of the context.
The same phoneme may have different statistics at different contexts. Using the mean and
variance of the phoneme is one choice. The deviation from the average duration may be
used as a measure. However using the median of the durations may be a better statistic.
Instead of using the standard deviation, we may choose to use percentile rank. The
statistics should be derived from the durations found using forced alignment.
42
Figure 4.3. Forced alignment of the waveform for keyword “milpa”
Figure 4.4 shows ROC curves for various values for N, the threshold number of
forwarded phonemes.
Figure 4.4. ROC curves for phoneme duration based confidence measure
43
If the recognizer is to be operated at low false alarm rates, then you should select a
small threshold for N. However increasing the threshold beyond 5 does not increase the
performance too much. The optimum operating point for many applications is N = 4.
4.5. Garbage Model Using Same 512 Mixtures
During the N-best search likelihood of triphones are calculated using the weighted
sum of each Gaussian mixture probability. Since we already have the probabilities of these
mixtures, we can find a garbage probability at each frame. This garbage modeling
technique has been used for speaker recognition [10]. The garbage likelihood and the
likelihood of “ceylangiyim” are shown in Figure 4.5. The shaded area between two
probabilities may give an indication of a confidence measure. The question is how to
convert this area into a confidence measure.
Figure 4.5. Likelihood profiles for “ceylangiyim” and the base garbage model
proposed
We have tried a few things. First of all we should deal with phonemes that are
meaningful for us. We do not use triphone X-X+X or Z-Z+Z for our calculations. These
correspond to the long silence and short silence (inter word silence), respectively. If the
forced alignment is unsuccessful, we will not attempt to assign a confidence measure;
44
instead we will directly assign a low number. Since wrong alignment leads us to wrong
decisions about likelihood ratio, we penalize too much deviation from the average
phoneme duration again. The penalty method we have chosen is to decrease the probability
of that phoneme artificially.
Since the area corresponds to the log difference of two likelihood profiles shown, it
is reasonable to take the difference and add them up. This will give the likelihood of the all
path. However we have tried a few things to see that this is really the best measure.
We have taken the difference for each frame. Then we have weighted this
difference according to the characteristics of the phoneme. Then we have taken some
power of this value and summed them up. Finally we normalized it according to the
accumulated weights. Mathematically,
( ) ( )( )
∑
∑
=
=−
= stotalframe
ii
stotalframe
iiii gpwp
S
1
1
γ
γ ββ
)1.4(
where iγ is the weighting coefficient and equal to γ for phonemes /a/, /e/, /i/, /I/,
/o/, /O/, /u/, /U/, /S/, /s/, /z/ and 1 for all other phonemes.
We expect that the recognizer can discriminate these phonemes better than the
remaining, so have given γ values greater than or equal to 1. Unfortunately we did not
obtain any improvement. Figure 4.6 illustrates this result. The ROC corresponds to the
results obtained selecting β =1 and trying values 1, 2 and 3 for γ . All curves are very
close.
45
Figure 4.6. ROC curves for different emphasis values for power value 1
Figure 4.7. ROC curves for different power values with emphasis set to 1
46
After this experiment we attempted to change the value of β and keep γ =1, since
changing the value of γ did not matter. The best value was again β =1 case, which means
using the difference according to its real meaning (difference of log likelihood means
ratio). Figure 4.7 illustrates this case.
4.6. Comparison of Confidence Measures
Finally we compare the confidence measure obtained using the phoneme duration
and the one that uses 512-mixture garbage model with phoneme duration. Since the last
method we have examined also added the likelihood ratio score in addition to duration
constraint, it has given a better result. Figure 4.8 clearly shows the increase in the
performance.
Figure 4.8. ROC curve for phoneme duration based confidence measure and confidence
measure with likelihood ratio scoring included
47
When we perform the phoneme alignment after the recognition step, we spend
extra time. The average recognition time required is as shown in Table 4.2.
Table 4.2. Computation time required with/without phoneme duration evaluation
Recognition 1.35 seconds
Recognition + Monophone based rejection 1.62 seconds
Recognition + Phoneme duration evaluation 1.62 seconds
48
5. CONCLUSION
In this study, we have examined the performance of monophone models of the
keywords as garbage models. The performance over 1-state general garbage model is very
little in keyword spotting tasks. However there is a significant increase at the performance
of the monophone models of the keywords over garbage models when the finite-state
keyword spotter is used as a post-processor to an isolated word recognizer.
Several confidence measures were used for isolated word recognition. The best
performance was achieved when phoneme duration information and average phoneme
likelihood were used. We have used the same pool of Gaussian functions for the garbage
likelihood.
Duration based confidence measure can be examined in more detail. Since the
duration of a phoneme depends on the context it is used, the context information should be
used. The minimum and maximum durations for each phoneme can be used instead of the
average durations. The durations found using forced alignment is more appropriate for the
statistics of the phonemes. The monophone models for each phoneme can be used as
garbage model for likelihood ratio scoring.
49
APPENDIX A: SENTENCES USED FOR KEYWORD SPOTTING
1. Sabah grubu içinde yer alan sabah pazarlamanın sabah otomobilden ve medya
holdingten çeşitli alacaklarının bulunduğu tespit edildi
2. Aradan zaman geçiyor. ne olup ne bittiğini merak eden Ankara polonya
büyükelçiliğinden ses çıkmadığını görünce varşovaya soruyor
3. Taşların artık yerine oturmaya başladığını söyleyen Rahmi Koç gelecek dönemde çok
daha itibarlı bir Türkiyenin ortaya çıkacağını söyledi
4. Fenerbahçe teknik direktörü rüştünün tıbba göre kupa finalinde oynaması imkansız
değil dedi ve ekledi: maça daha iki gün var, son dakikaya kadar bekleyeceğim
5. Kahvenin egzersiz sırasında adelelerin tükettiği gılikojen miktarını azaltararak daha
uzun süre egzersiz yapmaya yardımcı olduğu belirtiliyor
6. Konuyla ilgili araştırma haberini internetten yayınlayan riyıl eyç gurubu spor yapan
insanların gerçek yaşlarından dokuz yaş daha genç kalabileceklerine değiniyor
7. Türkiye ihtiyacı olan dış katkıyı gerektiği oranda bulmak için zorlanacak
8. ikinci el otomobil piyasasında şubat ayında yaşanan devalüasyonun ardından satışlar
durma noktasına geldi
9. internet üzerinden düzenlenen pejo dizayn yarışmasının kazanan tasarımcısı cenevre
otomobil fuarında açıklandı
10. Oldukça cesur ve yenilikçi bir şekilde dizayn edilen model şık kelimesini hak ediyor.
Dizayn halihazırda bütün bu özellikleriyle pejonun internet sitesini ziyaret eden
otomobil meraklılarının kalbini kazanmayı başardı
11. Yarışmaya en fazla ilginin geldiği ülkeler arasında fıransa, rusya, italya ve
kolombiyanın yanısıra Türkiye de bulunuyor
12. Barselona pirelli en ileri teknoloji ile ürettiği yeni yüksek performans lastikleri pe altı
ve pe yediyi piyasaya sunuyor
13. Beyaz enerji operasyonu kapsamında gündeme gelen rüşvet çarkına adı karışanları
zor günler bekliyor
14. Dekoder sahiplerinin muhatabı teleondan önce futbol federasyonu olmalıdır dedi.
çünkü anlaşmanın koşullarını koyan federasyondur
15. Silah avadanlık ve mühimmat hakkında personele gerekli açıklamaların
50
yapılmasından sonra eğitici tatbiki olarak günlük haftalık bakımlar ile atış menzili
bakımlarının uygulanmasına geçer
16. Savcı dosya münderecatına göre suçu işlediği sabit olan abuzettin beyin tecekanın
bilmem kaçıncı maddesine göre cezalandırılması talep olunur dedi
17. İstanbul memorial hastanesince düzenlenen insan gen haritası projesi embriyoda veya
erken gebelik döneminde genetik tanı başlıklı toplantıya katılan halifa türkiyede ilk
defa embriyolarda genetik inceleme tekniği başlattıklarını belirtti
18. Yargıtay bozma kararını af yasasının çıkmasından üç gün önce vererek şahinin
davanın ertelenmesi olanağından faydalanmasını da önledi
19. Göstericiler vaşingtın post gazetesi binasını da çürük meyve yağmuruna tuttu ve
suçlu buş katil buş sıloganları attı
20. Hazine enflasyon hedefi kadar zam önerisine memur ve işçi arasında giderek
büyüyen ücret uçurumunu gerekçe gösterirken türkiş bu yaklaşımı kışkırtma olarak
nitelendirdi
51
APPENDIX B: MINIMUM EDIT DISTANCE ALGORITHM
function MIN-EDIT-DISTANCE(target, source) returns min-distance
n←LENGTH(target)
m←LENGTH(source)
Create a distance matrix distance[n+1, m+1]
distance[0, 0]←0
for each column i from 0 to n do
for each row j from 0 to m do
distance[i, j] ←MIN( distance[i-1, j] + ins-cost(targeti),
distance[i-1, j-1] + subst-cost(sourcej, targeti),
distance[i, j-1] + del-cost(sourcej) )
52
REFERENCES
1. Deller, J. R., J. H. L. Hansen and J.G. Proakis, Discrete-Time Processing of Speech
Signals. IEEE Press, 2000
2. Young, S., “A Review of Large-Vocabulary Continuous-Speech Recognition”. IEEE
Signal Processing Magazine, September 1996.
3. Rabiner, L. R., “A Tutorial on Hidden Markov Models and Selected Applications in
Speech Recognition.” Proceedings of the IEEE, 77(2) pp. 257-286, February 1989.
4. Sproat, R., Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Kluwer
Academic Publishers, Massachusetts, Chapter 5, 1998.
5. Hermansky, H., “Perceptual Linear Predictive (PLP) Analysis For Speech”. Journal of
the Acoustical Society of America, 87(4) pp. 1738-1752, April 1990
6. Markel, J. D. and A. H. Gray, Linear Prediction of Speech. Springer-Verlag, Berlin,
1976
7. Davis, S. B. and P. Mermelstein, “Comparison of Parametric Representations For
Monosyllabic Word Recognition in Continuously Spoken Sentences”. IEEE
Transactions on Acoustics, Speech and Signal Processing, ASSP-28 pp. 357--366,
August 1980.
8. Furui, S., “Speaker-Independent Isolated Word Recognition Using Dynamic Features of
The Speech Spectrum”. IEEE Transactions on Acoustics, Speech and Signal
Processing, 29(1) pp. 59-59, 1986.
9. Applebaum, T. H. and B. A. Hanson, “Regression Features For Recognition of Speech
in Quiet and in Noise”, Proceedings of the 1989 International Conference on
Acoustics, Speech, and Signal Processing, Glasgow, Scotland, May 1989, pp. 985-988
53
10. Furui, S., Digital Speech Processing, Synthesis and Recognition Second Edition
Revised and Expanded. Marcel Dekker, NY, Feb. 2001
11. Huang, X. D., Y. Ariki and M. Jack, Hidden Markov Models for Speech Recognition.
Edinburgh University Press, 1990.
12. Rabiner, L. R. and B. Juang, Fundamentals of Speech Recognition.Prentice-Hall,
Engle- wood Cliffs, NJ, 1993.
13. Forney G. D., “The Viterbi Algorithm”. Proc. IEEE, Vol 61, pp. 268-278, 1973.
14. Digalakis V. and H. Murveit, “Genones: Optimizing the Degree of Mixture Tying in a
Large Vocabulary Hidden Markov Model Based Speech Recognizer”, Proceedings of
the 1994 International Conference on Acoustics, Speech, and Signal Processing,
Adelaide, Australia, April 1994, pp. 537-540.
15. Shukat-Talamazzini E. G., H. Niemann, W. Eckert, T. Kuhn and S. Rieck, “Acoustic
Modeling of Sub-Word Units in the ISADORA Speech Recognizer”, Proceedings of
the 1992 International Conference on Acoustics, Speech, and Signal Processing, San
Francisco, March 1992, pp. 577-580.
16. Yapanel, U., “Garbage Modeling Techniques For a Turkish Keyword Spotting System,
M.S. Thesis, Bogazici University, 2000.
17. Bahl L. R., P. V. de Souza, P. S. Gopalakrishnan, D. Nahamoo and M. A. Picheny,
“Decision Trees for Phonological Rules in Continuous Speech”, Proceedings of the
1991 International Conference on Acoustics, Speech, and Signal Processing, Toronto,
May 1991, pp. 185-188.
18. Hsiao–Wuen H. and Kai-Fu Lee, “CMU Robust Vocabulary-Independent Speech
Recognition System”, Proceedings of the 1991 International Conference on Acoustics,
Speech, and Signal Processing, Toronto, May 1991, pp. 889-892
54
19. Ney H., D. Mergel, A. Noll and A. Paesler, “Data Driven Search Organization for
Continuous Speech Recognition”. IEEE Transactions on Signal Processing, 40(2) pp.
272-281, February 1992.
20. Young, S. J., N. H. Russell and J. H. S. Thornton, Token Passing: a Simple Conceptual
Model for Connected Speech Recognition Systems. Technical Report, Cambridge
University Engineering Department, July 1989.
21. Peterson W. W., T. G. Birdsall and W. C. Fox, “The theory of signal detectability”.
IRE Trans. Info. Theory, PGIT-4, pp. 171-212, September 1954
22. Jurafsky D. and J. H. Martin, Speech and Language Processing. Prentice-Hall Upper
Saddle River, NJ, 2000, pp. 153-157.
23. Levenshtein, V. I, “Binary codes capable of correcting deletions, insertions, and
reversals”. Cybernetics and Control Theory, 10(8), pp. 707-710.