KEYWORD SPOTTING USING HIDDEN MARKOV MODELS · 2018-10-09 · ii ABSTRACT KEYWORD SPOTTING USING HIDDEN MARKOV MODELS The aim of keyword spotting system is to detect a small set of

KEYWORD SPOTTING USING HIDDEN MARKOV MODELS

by

Şevket Duran

B.S. in E.E., Boğaziçi University, 1997

Submitted to the Institute for Graduate Studies in

Science and Engineering in partial fulfillment of

the requirements for the degree of

Master of Science

in

Electrical Engineering

Boğaziçi University

2001

i

ACKNOWLEDGEMENTS

To Dr. Levent M. Arslan:

Thank you for the sacrifices of your personal time that you have made unselfishly

to help me prepare this thesis.

Thank you for your encouraging me to study in the area of speech processing.

It is a privilege for me to be your student.

Şevket Duran

ii

ABSTRACT

KEYWORD SPOTTING USING HIDDEN MARKOV MODELS

The aim of keyword spotting system is to detect a small set of keywords from a continuous speech. It is important to obtain the highest possible keyword detection rate without increasing the number of false insertions in this system. Modeling only keywords is not enough. To seperate keywords from non-keywords, models for out-of-vocabulary words are needed, too. Since the structure and type of garbage model has great effect on the entire system performance, out-of-vocabulary modeling is done by the use of garbage models.

The subject of this MS thesis is to examine context independent phonemes as garbage models and evaluate the performance of different criteria as confidence measures for out-of-vocabulary word rejection. Two different databases are collected for keyword spotting and isolated word recognition experiments over telephone lines.

For keyword spotting use of monophone models together with one-state general

garbage model gives the best performance. Using average phoneme likelihoods with phoneme durations gives the best performance for confidence measures.

iii

ÖZET

SAKLI MARKOV MODELLERİ KULLANILARAK ANAHTAR

KELİME YAKALAMA

Anahtar kelime yakalama sisteminin amacı sürekli bir sesin içinde barınan küçük bir anahtar kelimeler gurubu ortaya çıkarmaktır. Bu sistemde önemli olan, kelime olmadığı halde hata verme oranını artırmaksızın olası en yüksek anahtar kelime bulma oranını elde etmektir. Bunun için sadece anahtar kelimeleri modelleme yapmak yeterli değildir. Anahtar kelimeleri, olmayanlardan ayırmak için, sözlük dışı kelimelerin modellemesi de gerekmektedir. Bu modelleme, yapısı ve türü itibarıyla tüm sistem performansı üzerinde büyük etkisi bulunan garbage modellemesi ile yapılmaktadır. Bu tezin konusu garbage modelleri olarak bağımsız içerikli sesbirimleri (monophone) incelemek ve sözlük dışı kelime dışlamaları için güvenilirlik oranları bazında değişik kriterlerin performansını değerlendirmektir. Anahtar kelime yakalama ve telefon üzerinden yalıtılmış ses tanıma denemeleri için iki veritabanı oluşturuldu. Anahtar kelime bulma için en iyi performansı tek fazlı genel garbage modelleme ile birlikte tek-sesbirimsel modellerin kullanılması verdi. Güvenilirlik oranları içinse süreleri ile ortalama sesbirim benzeşmelerinin birlikte kullanımı en iyi performansı gösterdi.

iv

TABLE OF CONTENTS

ACKNOWLEDGEMENTS………………………………………………………….…….iii ABSTRACT…………………………………………………………………………….….iv ÖZET……………………………………………………………………………………..…v LIST OF FIGURES……………………………………..………………………………..viii LIST OF TABLES….…...........................…........………………………………….............x 1. INTRODUCTION……………………………………………………………………..…1 2. BACKGROUND…………………………………………………………………………3 2.1. Speech Recognition Problem………………..………………………………………3 2.2. Speech Recognition Process……….…………………………………….…………..5 2.2.1. Gathering Digital Speech Input…….…………………………………………5 2.3. Feature Extraction……………………………….…………………………………..6 2.4. Hidden Markov Model…………………………………….……………………….10 2.4.1. Assumption in the Theory of HMMs……………………….…….…………13 2.4.1.1. The Markov Assumption………….….….………………………….13 2.4.1.2. The Stationarity Assumption….…………………………………….13 2.4.1.3. The Output Independence Assumption……………………………..14 2.4.2. Three Basic Problem of HMMs………….………………………………….14 2.4.2.1. The Evaluation Problem……………………….……………………14 2.4.2.2. The Decoding Problem….………………………….……………….14 2.4.2.3. The Learning Problem……….……………………….…….……….14 2.4.3. The Evaluation Problem and the Forward Algorithm……………………….15 2.4.4. The Decoding Problem and the Viterbi Algorithm…………………….……17 2.4.5. The Learning Problem……………………………………………………….18 2.4.5.1. Maximum Likelihood (ML) Criterion………….…….…….……….18 2.4.5.2. Baum-Welch Algorithm………………….…………………………19 2.4.6. Types of Hidden Markov Models……….……………….………………….21 2.5. Use of HMMs in Speech Recognition……………………………….….….………21 2.5.1. Subword Unit Selection…………………………….….…………………….22 2.5.2. Word Networks………………………………………………………….24 2.5.3. Training of HMMs….…………………………………….….….….……….26 2.5.4. Recognition…….…………….….….……………………………………….27 2.5.4.1. Viterbi Based Recognition…….….….……………………………..27 2.5.4.2. N-Best Search………………………….……………………………28 2.6. Keyword Spotting Problem……………………………….….….…………………28 3. PROPOSED KEYWORD SPOTTING ALGORITHM…….….….……………………29 3.1. Introduction………………………………….….………………….………………29 3.2. Experiment Data…………….….….……………………………………………….30 3.3. Performance of a System…………….….………………………………………….30 3.4. System Structure…………………………….….…………………………………..31 3.5. Performance of Monophone Models for Isolated Word Recognition….….….……34 4. CONFIDENCE MEASURES FOR ISOLATED WORD RECOGNITION.….……….37 4.1. Introduction………………………….….….………………………………………37 4.2. Experiment Data……………………………….….….….….….…………………..38 4.3. Minimum Edit Distance…………………………………….….….……………….38

v

4.4. Phoneme Durations….……………………………………………….…………….39 4.5. Garbage Model Using Same 512 Mixtures…………………….……….…….……43 4.6. Comparison of Confidence Measures…………………..………….………………46 5. CONCLUSION……………………………………………………………………...….48 APPENDIX A: SENTENCES USED FOR KEYWORD SPOTTING……………………49 APPENDIX B: MINIMUM EDIT DISTANCE ALGORITHM.........................................51 REFERENCES…………………………………………………………………………….52

vi

LIST OF FIGURES

Figure 2.1. The waveform and spectrogram of “ev” and “ben eve”......................................3

Figure 2.2. The waveform and spectrogram of “okul” and “holding”...................................4

Figure 2.3. Components of a typical recognition system.......................................................5

Figure 2.4. The spectrogram of /S/ sound and /e/ sound in word Sevket................................6

Figure 2.5. Flowchart of deriving Mel Frequency Cepstrum Coefficients............................8

Figure 2.6. A simple isolated speech unit recognizer that uses null-grammar.....................24

Figure 2.7. The expanded network using the best match triphones.....................................25

Figure 2.8. The null-grammar network showing the underlying states................................25

Figure 3.1. General structure of the proposed keyword spotter...........................................31

Figure 3.2. ROC points for different alternatives for garbage model..................................32

Figure 3.3. ROC points for different number of keywords for keyword spotting...............33

Figure 3.4. Network structure for the keyword spotter used as a post-processor for

isolated word recognizer....................................................................................34

Figure 3.5. ROC curves for monophone and garbage model based out-of-vocabulary

word rejection.....................................................................................................36

Figure 4.1. ROC curves before/after applying Minimum Edit Distance Revision..............39

Figure 4.2. Forced alignment of the waveform for keyword “iSZbankasIZkurucu”...........41

Figure 4.3. Forced alignment of the waveform for keyword “milpa”..................................42

vii

Figure 4.4. ROC curves for phoneme duration based confidence measure.........................42

Figure 4.5. Likelihood profiles for “ceylangiyim” and the base garbage model proposed..43

Figure 4.6. ROC curves for different emphasis values for power value 1...........................45

Figure 4.7. ROC curves for different power values with emphasis set to 1.........................45

Figure 4.8. ROC curve for phoneme duration based confidence measure and confidence

measure with likelihood ratio scoring included.................................................46

viii

LIST OF TABLES

Table 3.1. Database used for keyword spotting...................................................................30

Table 3.2. Number of occurrences of the keywords used for keyword spotting tests..........30

Table 3.3. Results for monophone model based out-of-vocabulary word rejection for

isolated word recognition...................................................................................35

Table 3.4. Results for general garbage model based out-of-vocabulary word rejection for

isolated word recognition...................................................................................35

Table 4.1. Average phoneme durations in Turkish..............................................................40

Table 4.2. Computation time required with/without phoneme duration evaluation............47

1

1. INTRODUCTION

Communication between people and computers using more natural interfaces is an

important issue in order to use computers in our daily lives. To interact with computers you

always have to use your hands. The device may be a keyboard or a mouse or the dialing

pad on your phone if you want to access information on a computer over a telephone line.

A more natural interface for input is speech.

Human-computer interaction via speech involves speech recognition [1, 2, 3] and

speech synthesis [4]. Speech recognition is the conversion of speech signal into text and

synthesis is the opposite. Speech recognition may range from understanding simple

commands to getting all information in speech signal such as all words, the meaning and

the emotional state of the speaker.

After many years’ work speech recognition is at a level mature enough to be used

in practical applications. This is due to the availability of algorithms developed and the

increase in the computational power.

Speech recognition may be speaker dependent or speaker independent. If the

application is for home use, and the same person will use the same microphone at the same

place, then the problem is simple and you don’t need a robust algorithm. But if it is an

application that will recognize speech over a public telephone network where speaker

variability and the environment that speech passes through are different among different

calls, you need a robust algorithm.

If recognition of isolated words or phrases is the problem, then you will have less

of a problem as far as the speakers only give the required input. If the speakers also use

other words in addition to the keywords you require, you need to perform “keyword

spotting” which means recognizing the keywords among other non-keyword filler words.

If we go further, recognition from a large vocabulary where you have to recognize all of

the words, it is called dictation, which is a harder task. We will be dealing with the

keyword-spotting problem in this thesis.

2

For speech recognition, the digitized speech signal that is in the time domain must

be transformed into another domain. Generally some part of the speech is taken and a

feature vector is derived to represent that part. Next these feature vectors are used to guess

the sequence of words that generated this speech signal. We need algorithms that account

for the variability in the speech signal. The most common technique for acoustic modeling

is called hidden Markov modeling (HMM). We have used this model in this thesis.

In order to have an operating system independent notation we preferred not to use

non-ANSI characters in Turkish character set. We have used lower case letters for

characters that are in the ANSI character set, and upper case letters for Turkish characters.

We have used /S/ instead of /ş/, /U/ instead of /ü/, and so on. We use /Z/ for interword

silence. So the word “savaş alanı” is represented as “savaSZalanI”.

In this thesis we investigate some models for garbage models for keyword spotting

and try to find some confidence measure for detection of out-of-vocabulary words in an

isolated word recognizer.

In chapter 2, we give the theory of each step in speech recognition process and give

details of the techniques we have used. In chapter 3, we study the keyword-spotting

algorithm we have proposed and conclude that using the monophone models of the words

and a one-state 16-mixture general garbage models with different bonus values give the

best performance. We evaluate the performance of monophone models as garbage model

for isolated word recognition. In chapter 4, we evaluate some measures to obtain a good

confidence measure and decide both likelihood and phoneme duration are important for

obtaining a good confidence measure. Finally, in chapter 5 we give our conclusions from

these experiments and suggest some directions for future study.

3

2. BACKGROUND

2.1. Speech Recognition Problem

The speech signal is different if input is given with isolated words or the speech is

continuous. If the speaker knows that a computer will try to recognize the speech, then

he/she may pause between words. However in continuous speech some sounds will

disappear and sometimes there will be no silence between words. It may be hard to say a

word in a different context. An exaggerated example may be a tongue twister like

“SemsiZpaSaZpasajIndaZsesiZbUzUSesiceler”. Even in normal cases there is great

difference in the characteristics of the speech signal. Figure 2.1 shows the same /e/ sound

in “ev” and “ben eve”. The waveforms are shown at the top of the figure. The

spectrograms at the bottom show the energy at different frequencies versus time. The

darkness shows the amplitude. The effect of the context can be seen on the characteristics

of the /e/ sound. The effect of the context leads us to model each phoneme according to the

neighboring phonemes.

Figure 2.1. The waveform and spectrogram of “ev” (on the left) and “ben eve” (on

the right).

4

Spontaneous speech may contain other fillers that are not words like “ee” or

“hImm”. It is another difficulty in continuous speech recognition.

The task should be known while designing the algorithm. If we are to use a

recognizer in continuous speech recognition, the training data should consist of continuous

speech as well.

The main difficulty of the speech recognition problem comes from the variability of

the source of the signal. First, the characteristics of phonemes, the smallest sound units, are

dependent on the context in which they appear. An example to phonetic variability is the

acoustic differences of the phoneme /o/ in, okul and holding in Turkish. See Figure 2.2.

The marked region corresponds to the /o/ sound.

Figure 2.2. The waveform and spectrogram of “okul” (on the left) and “holding” (on the

right)

The environment also causes variability. The same speaker will say a word

differently according to his physical and emotional state, speaking rate, or voice quality.

The difference in the vocal tract size and shape of different people also causes variability.

The problem is to find the meaningful information in the speech signal. The meaningful

information is different for speech recognition and speaker recognition. The same

information in the speech signal may be necessary for some application and redundant for

5

some other application. For speaker independent speech recognition we have to get rid of

as much speaker related features as possible.

2.2. Speech Recognition Process

Figure 2.3 shows the main components of a typical speech recognition system. The

digitized speech signal is first transformed into a set of useful measurements or features at

a fixed rate, typically once every 10 milliseconds. These measurements are then used to

search for the most likely word candidate, making use of constraints imposed by the

acoustic, lexical, and language models. Throughout this process, training data are used to

determine the values of the model parameters.

Figure 2.3. Components of a typical speech recognition system

2.2.1. Gathering Digital Speech Input

Speech recognition is the process of converting digital speech signal into words.

To capture speech signal we need a device that converts physical speech wave into digital

signal. This may be a microphone that converts the speech into analog signal and a sound

card that is an A/D converter that converts the analog signal to the digital signal. Another

way of obtaining digital speech input is to use a telephone card that converts the analog

6

signal that comes from the telephone line into digital signal. There are also devices that can

take the digital signal coming from E-1 or T-1 lines directly. Dialogic has JCT LS240 and

JCT LS300 for T-1 lines and E-1 lines respectively. We have been using a JCT LS120

card, which is a speech-processing card for 12 analog lines. We have used 8 KHz sampling

rate which is the sampling rate for telephone lines and converted the µ -law encoded

signal into 16-bit linear encoded signal before processing.

2.3. Feature Extraction

To get rid of redundancies in the speech signal mentioned earlier we have to represent the

signal by only taking the perceptually most important speaker-independent features [5].

Passing the excitation signal generated by the larynx through the vocal tract produces the

speech signal. We are interested in the properties of the speech generated by the overall

shape of the vocal tract. To distinguish phonemes better (the voiced /unvoiced distinction),

we examine if the vocal folds are vibrating but ignore the variations in the frequency of

vibration. The spectrum of voiced sounds has several sharp peaks, which are called

formant frequencies. The spectrum of unvoiced sounds looks like white noise spectrum.

Figure 2.4 shows the spectrum of the unvoiced sound /S/ and the voiced sound /e/.

Figure 2.4. The spectrum (found using 256 point FFT) of /S/ sound (on the left) and /e/

sound (on the right) in word Sevket

Since our ears are insensitive to phase effects we use the power spectrum as a basis

for speech recognition front-end. The power spectrum is represented on a log scale. When

7

the overall gain of the signal varies, the shape of the log power spectrum is the same but

shifted up or down. The convolutional effects of the telephone lines are multiplied with the

signal on the linear power spectrum. In log power spectrum the effect is additive. Since a

voiced speech waveform corresponds to convolution of a quasi-periodic excitation signal

and a time-varying filter (shape of the vocal tract), we can separate them in the log power

spectrum. Assigning a lower limit to the log function solves the problem of low energy

levels at some part of the spectrum.

Before computing short-term power spectra, the waveform is processed by a

simple pre-emphasis filter to give a 6 dB/octave increase in gain. This makes the average

speech spectrum roughly flat.

We have to extract the effects caused by the shape of the vocal tract. One method is

to predict the coefficients of the filter that corresponds to the shape of the vocal tract. The

vocal tract is assumed to be a combination of lossless tubes with different radius. The

number of parameters derived corresponds to the number of tubes assumed. The filter is

assumed to be an all-pole linear filter. The parameters are called Linear Predictive Coding

(LPC) parameters and the procedure is known as LPC analysis. There are different

methods to calculate these coefficients [6].

To calculate the short-term spectra we take overlapping portions of the waveform.

We take a frame of 25 milliseconds and multiply it with a window function to avoid

artificial high frequencies. We use a Hamming window. Then we apply Fourier transform.

We have to get rid of harmonic structure at the multiples of fundamental frequency, 0f ,

because it is the effect of the excitation signal. The smoothed spectrum without the effect

of the excitation signal corresponds to the Fourier Transform of the LPC parameters.

We use a different method and group components of the power spectrum and form

frequency bands. Grouping is not linear; the human ear sensitivity is taken into account.

The bands are linear up to 1 kHz and logarithmic at higher frequencies. The frequency

bands are broader at higher frequencies. The positions of the bands are set according to the

mel frequency scale [7].

8

The relation between mel frequency scale and linear frequency scale is as follows:

)

7001(log2595)( 10

ffMel +=

)1.2(

Figure 2.5. Flowchart for deriving Mel Frequency Cepstrum Coefficients

To calculate the filterbank coefficients, the magnitude coefficients of the spectrum

are accumulated after windowing with these triangular windows. Triangular filters are

9

spread over the whole frequency range from zero upto the Nyquist frequency. We have

chosen 16 filter banks.

Since the shape of the spectrum imposed by the vocal tract is smooth, energy levels

in adjacent bands are correlated. We have to remove correlation since in further statistical

analysis we assume that feature vector elements are uncorrelated and use a diagonal

variance vector. Removing the correlation helps the number of parameters to be reduced

without loss of useful information. The discrete cosine transform (a version of the Fourier

transform using only cosine basis functions) converts the set of log energies to a set of

cepstral coefficients, which are largely uncorrelated. The formula for Discrete Cosine

Transform is:

∑

−=

=

N

jji j

Nim

Nc

1)5.0(cos2 π , Pi ,...,1= )2.2(

where { jm } are log filter bank amplitudes.

and N is the number of filterbank channels which we set to 16. The required number of

cepstral coefficients is P and we set it to 12. Figure 2.5 shows the steps in obtaining Mel

Frequency Cepstrum Coefficients (MFCCs).

Many systems use the rate of change of the short-term power spectrum as

additional information. The simplest way to obtain this dynamic information is to take the

difference between consecutive frames. But this is too sensitive to random interframe

variations. So, linear trends are estimated over sequences of typically five or seven frames

[8]. We use five frames; there will be a delay of 2 times step size in real-time operation.

)*2*2(* 2112 −−++ −−+= ttttt ccccGd )3.2(

where td is the difference evaluated at time t , and ,2+tc ,1+tc ,1−tc 2−tc are the

coefficients at time t+2, t+1, t-1 and t-2, respectively. G is a gain factor selected as 0.375.

Some systems use the acceleration features as well as linear rates of change. These

second-order dynamic features need longer sequences of frames for reliable estimation [9].

10

Since cepstral coefficients are largely uncorrelated, probability estimates are easier

in further analysis. We can simply calculate Euclidean distances from reference model

vectors. Statistically based methods weigh coefficients by the inverse of their standard

deviations computed around their overall means.

Current representations concentrate on the spectrum envelope and ignore

fundamental frequency; but we know that even in isolated-word recognition fundamental

frequency contours carry important information.

At the acoustic phonetic level, speaker variability is typically modeled using

statistical techniques applied to large amounts of training data. Effects of context at the

acoustic phonetic level are handled by training separate models for phonemes in different

contexts; this is called context dependent acoustic modeling.

Word level variability can be handled by allowing alternate pronunciations of

words in representations known as pronunciation networks. Another technique is to add

different pronunciations to the network because after pruning common nodes at the

network, it corresponds to different pronunciations of the same word.

2.4. Hidden Markov Model

The most widely used recognition algorithm in the past fifteen years is Hidden

Markov Models (HMM) [10, 11, 12]. Although there had been some attempts at using

Neural Networks, those have not been very successful.

The Hidden Markov Model is a finite set of states, each of which is associated with

a (generally multidimensional) probability distribution. Transition probabilities are

assigned to the transitions among the states. In a particular state an outcome or observaiton

can be generated, according to the associated probability distribution. The external

observer can only see the the outcome, not the states. Therefore states are hidden to the

outside.

The following part is the teory of the HMMs taken from the tutorial [3]. The

11

advanced reader can skip this part.

In order to define an HMM completely, following elements are needed:

• The number of states of the model, N.

• The number of observation symbols in the alphabet, M. If the observations are

continuous then M is infinite.

• A set of state transition probabilities, }{ ijaA =

}{ 1 iqjqpa ttij === + , Nji ≤≤ ,1 )4.2(

where tq denotes the state index at time t and ija corresponds to the transition

probability from state i to state j . Transition probabilities should satisfy the

normal stochastic constraints,

0≥ija , Nji ≤≤ ,1 )5.2(

and

∑ ==

N

jija

11 , Ni ≤≤1 )6.2(

• A probability distribution in each of the states, )}({ kbB j= .

}{)( jqopkb tktj === ν , Ni ≤≤1 , Mk ≤≤1 )7.2(

where kν denotes the kth observation symbol in the alphabet, and to the current

observation vector.

Following stochastic constraints must be satisfied.

0)( ≥kb j , Nj ≤≤1 , Mk ≤≤1 )8.2(

and

12

∑ ==

M

kj kb

11)( , Nj ≤≤1 )9.2(

If the observations are continuous then we will have to use a continuous probability

density function, instead of a set of discrete probabilities. In this case we specify

the parameters of the probability density function. Usually the probability density is

approximated by a weighted sum of M Gaussian distributions,

∑ ΣΝ==

M

mtjmjmjmtj ocob

1),,()( µ

)10.2(

where,

jmc = mixture weights for thj state’s thm mixture

jmµ = mean vectors

jmΣ = covariance matrices

jmc should satisfy the stochastic constrains,

0≥jmc , Nj ≤≤1 , Mm ≤≤1 )11.2(

and

∑ ==

M

mjmc

11 , Nj ≤≤1 )12.2(

• The initial state distribution, }{ iππ = .

where,

}{ 1 iqpi ==π , Ni ≤≤1 )13.2(

Therefore we can use the compact notation

13

),,( πλ BA= )14.2(

can be used to denote an HMM with discrete probability distributions, while

),,,,( πµλ jmjmjmcA Σ=

)15.2(

to denote one with continuous densities.

2.4.1. Assumptions in the Theory of HMMs

For the sake of mathematical and computational tractability, following assumptions

are made in the theory of HMMs.

2.4.1.1. The Markov Assumption. As given in the definition of HMMs, transition

probabilities are defined as,

}{ 1 iqjqpa ttij === + , Nji ≤≤ ,1 )16.2(

In other words it is assumed that the next state is dependent only upon the current

state. This is called the Markov assumption and the resulting model becomes actually a

first order HMM. However the next state may depend on past k states and it is possible to

obtain such a model, called a kth order HMM. But a higher order HMM will have a higher

complexity.

2.4.1.2. The Stationarity Assumption. Here it is assumed that state transition probabilities

are independent of the actual time at which the transitions takes place. Mathematically,

}{}{ 212111 iqjqpiqjqp tttt ===== ++

)17.2(

for any 1t and 2t .

2.4.1.3. The Output Independence Assumption. This is the assumption that current output

(observation) is statistically independent of the previous outputs(observations). We can

formulate this assumption mathematically, by considering a sequence of observations,

14

ToooO ,...,, 21= )18.2(

Then according to the assumption for an HMM λ ,

∏==

T

tttT qopqqqOp

111 ),(},,...,,/{ λλ

)19.2(

However unlike the other two, this assumption has a very limited validity. In some

cases this assumption may not be fair enough and therefore becomes a severe weakness of

the HMMs.

2.4.2. Three Basic Problems of HMMs

Once we have an HMM, there are three problems of interest.

2.4.2.1. The Evaluation Problem. Given an HMM λ and a sequence of observations

ToooO ,...,, 21= , what is the probability that the observations are generated by the model,

}{ λOp ?

2.4.2.2. The Decoding Problem. Given a model λ and a sequence of observations

ToooO ,...,, 21= , what is the most likely state sequence in the model that produced the

observations?

2.4.2.3. The Learning Problem. Given a model λ and a sequence of observations

ToooO ,...,, 21= , how should we adjust the model parameters ),,( πBA in order to

maximize }{ λOp ?

Evaluation problem can be used for isolated (word) recognition. Decoding problem

is related to the continuous recognition as well as to the segmentation. Learning problem

must be solved, if we want to train an HMM for the subsequent use of recognition tasks.

2.4.3. The Evaluation Problem and the Forward Algorithm

15

We have a model ),,( πλ BA= and a sequence of observations ToooO ,...,, 21= ,

and }{ λOp must be found. If we can calculate this quantity using simple probabilistic

arguments the number of operations are on the order of TN . This is very large even if the

length of the sequence, T is small. The idea of keeping the multiplications that are common

led to the idea of using an auxiliary variable, which is called the forward variable and

denoted as )(itα .

The forward variable is defined as the probability of the partial observation

sequence toooO ,...,, 21= , when it terminates at the state i. Mathematically,

},,...,,{)( 21 λα iqooopi ttt == )20.2(

Then it is easy to see that following recursive relationship holds.

∑==

++N

iijttjt aiobj

111 )()()( αα , Nj ≤≤1 , 11 −≤≤ Tt )21.2(

where,

)()( 11 obj jjπα = , Nj ≤≤1 )22.2(

Using this recursion we can calculate )(iTα , Nj ≤≤1

and then the required probability is given by,

∑==

N

iT iOp

1)(}{ αλ

)23.2(

The complexity of this method, known as the forward algorithm is proportional to

TN 2 , which is linear with respect to T whereas the direct calculation had an exponential

complexity.

In a similar way the backword variable )(itβ is defined as the probability of the

partial observation sequence Ttt ooo ,...,, 21 ++ , given that the current state is i.

16

Mathematically ,

},,...,,{)( 21 λβ iqooopi tTttt == ++ )24.2(

As in the case of )(itα there is a recursive relationship which can be used to

calculate )(itβ efficiently.

∑==

++N

jtjijtt obaji

111 )()()( ββ , Ni ≤≤1 , 11 −≤≤ Tt )25.2(

where,

1)( =iTβ , Ni ≤≤1 )26.2(

Further we can see that,

},{)()( λβα iqOpii ttt == , Ni ≤≤1 , Tt ≤≤1 )27.2(

Therefore this gives another way to calculate }{ λOp , by using both forward and

backward variables :

∑ ∑==== =

N

i

N

ittt iiiqOpOp

1 1)()(},{}{ βαλλ

)28.2(

The last equation is very useful, especially in deriving the formulas required for

gradient based training.

17

2.4.4. The Decoding Problem and the Viterbi Algorithm

In this case we want to find the most likely state sequence for a given sequence of

observations, toooO ,...,, 21= and a model, ),,( πλ BA= .

The solution to this problem depends upon the way “most likely state sequence” is

defined. One approach is to find the most likely state tq at t=t and to concatenate all such

' tq 's. But some times this method does not give a physically meaningful state sequence.

Therefore we would need another method which has no such problems. In this method,

commonly known as Viterbi algorithm [13], the whole state sequence with the maximum

likelihood is found. In order to facilitate the computation we define an auxiliary variable,

},...,,,,...,,{max)( 121121

1...21λδ −−

−== ttt

tqqqt oooiqqqqpi

)29.2(

which gives the highest probability that partial observation sequence and state sequence up

to t=t can have, when the current state is i. It is easy to observe that the following recursive

relationship holds.

=

≤≤++ ijtNitjt aiobj )(max)()(111 δδ , Ni ≤≤1 , 11 −≤≤ Tt )30.2(

where,

)()( 11 obj jjπδ = , Ni ≤≤1 )31.2(

So the procedure to find the most likely state sequence starts from calculation of

)( jTδ , Ni ≤≤1 using recursion in Eqn (2.30) while always keeping a pointer to the

“winning state” in the maximum finding operation. Finally the state *j , is found where

)(maxarg

1

* jj TNjδ

≤≤=

)32.2(

and starting from this state, the sequence of states is back-tracked as the pointer in each

18

state indicates. This gives the required set of states. This whole algorithm can be

interpreted as a search in a graph whose nodes are formed by the states of the HMM in

each of the time instant t , Tt ≤≤1 .

2.4.5. The Learning Problem

Generally, the learning problem is how to adjust the HMM parameters, so that the

given set of observations (called the training set) is represented by the model in the best

way for the intended application. Thus it would be clear that the quantity we wish to

optimize during the learning process can be different from application to application. In

other words there may be several optimization criteria for learning. The criteria we will be

using is the Maximum Likelihood (ML) criteria.

2.4.5.1. Maximum Likelihood (ML) Criterion. In ML we try to maximize the probability

of a given sequence of observations WO , belonging to a given class w, given the HMM

wλ of the class w, with respect to the parameters of the model wλ . This probability is the

total likelihood of the observations and can be expressed mathematically as

}{ wW

tot OpL λ= )33.2(

However since we consider only one class w at a time we can drop the subscript

and superscript 'w's. Then the ML criterion can be given as,

}{ λOpLtot = )34.2(

However there is no known way to analytically solve for the model ),,( πλ BA= ,

which maximizes the quantity totL . But we can choose model parameters such that it is

locally maximized, using an iterative procedure, like the Baum-Welch method [12].

2.4.5.2. Baum-Welch Algorithm. This method can be derived using simple “occurrence

counting” arguments or using calculus to maximize the auxiliary quantity

19

[ ]∑=

qqOpOqpQ },,{log},{),( λλλλ

)35.2(

over λ [3]. A special feature of the algorithm is the guaranteed convergence .

To describe the Baum-Welch algorithm, ( also known as Forward-Backward algorithm),

we need to define two more auxiliary variables, in addition to the forward and backward

variables defined in a previous section. These variables can however be expressed in terms

of the forward and backward variables.

First one of those variables is defined as the probability of being in state i at t=t and

in state j at t=t+1. Formally,

},,{),( 1 λξ Ojqiqpji ttt === + )36.2(

This is the same as,

{ }{ }λ

λξOp

Ojqiqpji ttt

,,),( 1 === + )37.2(

Using forward and backward variables this can be expressed as,

∑ ∑

=

= =++

++N

i

N

jtjtijt

tjtijtt

objai

objaiji

1 111

11

)()()(

)()()(),(

βα

βαξ

)38.2(

The second variable is the a posteriori probability,

},{)( λγ Oiqpi tt == )39.2(

that is the probability of being in state i at t=t, given the observation sequence and the

model.

In forward and backward variables this can be expressed by,

20

∑=

=

N

itt

ttt

ii

iii

1)()(

)()()(

βα

βαγ

)40.2(

It can be seen that

the relationship between )(itγ and ),( jitξ is given by,

∑==

N

jtt jii

1),()( ξγ , Ni ≤≤1 , Mt ≤≤1 )41.2(

Now it is possible to describe the Baum-Welch learning process, where parameters of the

HMM is updated in such a way to maximize the quantity, }{ λOp . Assuming a starting

model ),,( πλ BA= , we calculate the 'α 's and ' β 's using the recursions (2.21) and (2.25),

and then 'ξ 's and 'γ 's using (2.38) and (2.41). Next step is to update the HMM parameters

according to eqns (2.42) to (2.43), known as re-estimation formulas.

)(1 ii γπ = , Ni ≤≤1 )42.2(

∑

∑= −

=

=1

1

1

)(

),(

T

tt

T

tt

iji

jia

γ

ξ , Ni ≤≤1 , Nj ≤≤1 )43.2(

∑

∑=

=

=T

tt

T

tt

jj

jkb

1

1

)(

)()(

γ

γ , Nj ≤≤1 , Mk ≤≤1 )44.2(

21

2.4.6. Types of Hidden Markov Models

HMMs can be classified according to the nature of the elements of the B matrix,

which are distribution functions.

In discrete HMMs, distributions are defined on finite spaces. Observations are

vectors of symbols in a finite alphabet of N different elements. For each one of the Q

vector components, a discrete density },...1)({ Nkkw = is defined, and the distribution is

obtained by multiplying the probabilities of each component.

Another possibility is to define distributions as probability densities on continuous

observation spaces. In this case, functional form of the distributions has to have certain

characteristics, in order to have a manageable number of statistical parameters to estimate.

The density functions are usually Gaussian or Laplacian. The statistics can be

characterized by the mean vector and the covariance matrix. HMMs with these kinds of

distributions are usually referred to as continuous HMMs. A large number of base densities

have to be used in every mixture. Since most of the time the training data is not enough,

different models share the same distributions. Different models are expressed in terms of

base distribution functions using different weights. This type of HMMs is called semi

continuous HMMs [11]. Base densities are assumed to be statistically independent; so the

distributions associated with model transitions are products of the component density

functions. Parameters of statistical models are estimated using iterative learning algorithms

[14]. The likelihood of a set of training data increases at each step.

2.5. Use of HMMs in Speech Recognition

In a statistical framework, a set of elementary probabilistic models of basic

linguistic units (e.g., phonemes) is used to build word representations. A sequence of

acoustic parameters, extracted from a speech signal, is seen as an output of a HMM which

is formed by concatenating elementary processes. The underlying state sequence

corresponds to the meaningful combinations of the phonemes. The transitions between the

states of the phoneme correspond to the variability in duration. The stochastic observable

22

outputs correspond to the spectral variability.

It is not practical training word networks for each word. Words are usually

represented as networks of phonemes. Each path in a word network represents a

pronunciation of the word.

2.5.1. Subword Unit Selection

The same phoneme can have different acoustic distributions of observations if

pronounced in different contexts. Allophone models of a phoneme are models of that

phoneme in different contexts. The decision as to how many allophones should be

considered for a given phoneme may depend on many factors, e.g., the availability of

enough training data to determine the model parameters.

A conceptually interesting approach is the use of polyphones [15]. In principle, an

allophone should be considered for every different word in which a phoneme appears. If

the vocabulary is large, it is unlikely that there are enough data to train all these allophone

models, so models for allophones of phonemes are considered at a different level of detail

(word, syllable, triphone, diphone, context independent phoneme). We have been using

790 triphones during the tests. These are selected to cover most of the triphones in Turkish

[16]. We have been using capital letters for Turkish characters that are not in the ANSI

character set and /Z/ for silence. For example we have used /C/ instead of /ç/.

Another approach consists of choosing allophones by clustering possible contexts.

This choice can be made automatically with Classification and Regression Trees (CART).

A CART is a binary tree having a phoneme at the root and, associated with each node in , a

question iQ about the context. Questions iQ are of the type, “Is the previous phoneme a

nasal consonant?” For each possible answer (YES or NO) there is a link to another node

with which other questions are associated. There are algorithms for growing and pruning

CARTs based on automatically assigning questions to a node from a manually determined

pool of questions. The leaves of the tree may be simply labeled by an allophone symbol

[17, 18].

23

We use a score to find the best match for a triphone that is not in the HMM list.

According to the spectral similarity a score is assigned to each couple of phonemes. The

similarity score of /m/ and /n/ is 0.6768. The score is calculated automatically using the

spectral distance so it is easy to find for a new language. Total similarity score is calculated

by summing the weighted scores. The center phoneme has a weight of 1 and the weight

decreases exponentially.

∑=−=

ctx

ctxi

iii WhxsHXS ),(),( )45.2(

where ),( HXS is the similarity score between unseen triphone X and triphone H in the list

),( ii hxs is the similarity score between phonemes at position I

ctx is the context level

W is a weighting factor, which we choose 0.1

So if a triphone is in the HMM list, there will be exact mach with a score of 1.200

We have expanded some words in terms of the best match triphones below:

penguen Z-p+e t-e+n e-n+i y-g+u g-u+l l-e+n e-n+Z

milpa Z-m+e m-i+l i-l+g Z-p+a s-a+Z

bossa Z-b+o b-o+r k-s+t k-s+a s-a+Z

Each allophone model is an HMM made of states, transitions and probability

distributions. In order to improve the estimation of the statistical parameters of these

models, some distributions can be the same or tied. For example, the distributions for the

central portion of the allophones of a given phoneme can be tied reflecting the fact that

they represent the stable (context-independent) physical realization of the central part of

the phoneme, uttered with a stationary configuration of the vocal tract.

24

Another approach consists of having clusters of distributions characterized by the

same set of Gaussian probability density functions. Allophone distributions are built by

considering mixtures with the same components but with different weights [14].

2.5.2. Word Networks

Isolated recognition in general means recognition of speech based on any kind of

isolated speech unit, which can be a word or a sub word or even a concatenation of words.

However only isolated word recognition has direct practical applications.

In a simple isolated speech unit recognition task, where the vocabulary contains N

speech units, we can use the system depicted in Figure 2.6.

Figure 2.6. A simple isolated speech unit recognizer that uses null-grammar

We have a finite network, since only one of the words can be spoken. The speech

contains an initial silence and a final silence. This simple network used for isolated word

recognition should be expanded because we do not have models for each different word.

We should generate networks using the HMMs for the best match triphones. We have two

silence models; Z-Z+Z for the interword silence and X-X+X for the silence at the

beginning and the end of the word. The difference of the X-X+X model is the self

transition probaility. Since the self transition probability is high we can compansate for

loose end-pointing. The expanded network will be as in Figure 2.7.

25

Figure 2.7. The expanded network using the best match triphones

Figure 2.8. The null-grammar network showing the underlying states

For the silence models we have a one-state HMM, and for triphones we have three-

state HMMs. The reason for choosing three-state models is that we can model the

transition between the adjacent phonemes with the first and the third states. The second

state stands for the steady state of the phoneme. Actually we have two more states at the

beginning and the end of the models that produce no observation. They are not shown in

the figure for the sake of simplicity. Figure 2.8 shows the actual HMM network that

contains 3-state models for triphones and 1-state model for the silence.

26

2.5.3. Training of the HMMs

We will have semicontinuous HMMs which means every output probability

distribution of each state is a linear combination of Gaussian density functions. 512

mixtures have been proposed for optimal performance [16]. Using tied-mixtures makes

recognition easier because only the probabilities of 512 Gaussian density functions are

calculated for each feature vector. The probability of each triphone is a combination of

these pdf’s.

In training we need to estimate the means and variances of Gaussian densities,

mixture coefficients and state transition matrix for each triphone.

The core process in training a set of subword models (phonemes) involves the

embedded training. We use a program “train” for this purpose. Continuously spoken

utterances is parametrized as the training data. In embedded training re-estimates of the

complete set of subword HMMs are done simultaneously. For each input utterance, we

need a transcription of the speech but labeling is not required. Labeling means specifying

the boundaries of phonemes in that uttarance. A single composite HMM is built for each

input uttarance using the transcription of the input speech and the initial HMMs. This

composite HMM collects statistics for the re-estimation of the parameters. When all of the

training utterances have been processed, the total set of accumulated statistics are used to

re-estimate the parameters of all of the phone HMMs.

To find the initial parameters for the HMMs before embedded re-estimation we

have two choices. One method is to assign a global mean and variance to all Gaussian

distributions in all HMMs.

Another method is to begin with a small set of hand-labelled training data to

intialize the mean and variance of each Gaussian density. Then Baum-Welch method is

used to update mean and/or variance of each Gaussian density.

27

2.5.4. Recognition

To find the most probable sequence we need to have a search space and that search

space is represented with a word network. In the case of isolated word recognition it is a

simple network of N words between start and end nodes. In the case of keyword spotting

we need different kind of networks which we will discuss in the next chapter. After

building the network using the nearest triphones we have to find the optimal path for a

given observation, speech unit sequence.

Then it is possible to trace for the corresponding speech unit sequence, via the state

sequence. In order to calculate the optimal state sequence ( *q ) we can use the Viterbi

Algorithm directly, level building method which is a variant of the Viterbi Algorithm.

Since the Viterbi based recognition is suboptimal, unless each speech unit correspons to a

HMM state, some attempts have been made to develop efficient methods for calculating

the sentence likelihoods. The N-best algorithm is one of these.

2.5.4.1. Viterbi Based Recognition. The Viterbi score, )(itγ can be computed for all the

states in the language model Λ at t=t and then can advance to the time instant t=t+1, in an

inductive manner, as formulated in [13]. This procedure is known as time synchronous

Viterbi search because it completely processes at time t before going into the time t+1.

Finally a backtracking pass gives the required state sequences.

Viterbi search can be very expensive if the number of states is large. When the

number of states is large, at every time instant, a large portion of states have an

accumulated likelihood which is much less than the highest one, so it is expected that a

path passing through one of these states would not become the best path at the end of the

utterance. This consideration leads to a complexity reduction technique called beam search

[19]. Beam search neglects states whose accumulated score is lower than the best one

minus a given threshold. Pruning less likely paths avoids extra computation. However if

the pruning threshold is chosen poorly, the best path can be lost. In practice, good tuning of

the beam threshold results in a gain in speed by an order of magnitude, while introducing a

negligible amount of search errors.

28

2.5.4.2. N-best Search. N-best search algorithm is very similar to the time synchronous

Viterbi search. Since the purpose of the N-best method is to find the optimum speech unit

sequence instead of the optimum state sequence, a summing operation should be done

instead of the maximum finding operation. However if we completely drop the maximum

finding operation it will become the forward algorithm, and we go back again to the start.

Therefore a pruning is performed at every state, (in addition to the pruning of the beam)

keeping only the first N paths with the highest scores. Therefore even this algorithm does

not give the theoretically optimum sentence. At the end, the algorithm gives N most likely

sentences, and for a simple task without post processors, N=1 is enough. We used token

passing paradigm [20].

2.6. Keyword Spotting Problem

In the keyword spotting problem a continuously spoken utterance is tested for the

existance of a keyword. The speech signal will contain any combination of silences,

keywords and non-keywords. The words that are not keywords are called garbage or out-

of-vocabulary words. We have discussed how to model a word HMMs for phonemes. We

need to model the out-of-vocabulary words in some form.

Since we use tied-mixtures for modeling the output produced at each state of the

network, we will be using Gaussian mixtures for garbage models. We may choose to use

less mixtures to model garbage words. These Gaussian mixtures will have greater

variances which means they are more general models.

29

3. PROPOSED KEYWORD SPOTTING ALGORITHM

3.1. Introduction

The best we can do in detecting keywords in speech signal is to recognize all of the

words using Large Vocabulary Continuous Speech Recognizer (LVCSR). However the

cost is very much and when context of the keywords is unknown it is impossible to use

LVCSR system. Instead it is common to model all of the out-of-vocabulary words as

garbage models. The garbage model we have used is a 16-mixture model. The reason for

using little number of mixtures is to model the general properties of the speech signal.

Little number of mixtures means greater variance of the Gaussian distribution functions.

We have used the notation J-J+J for garbage model.

Another idea we have tried can be described as follows: If we create a network with

triphone models and monophone models of the same word, the triphone model of the word

should get the best score if the keyword exists. If the keyword does not exist, the

monophone or the garbage model will get the best score. The reason for this hypothesis is

that the monophone models represent the context independent phonemes and the triphone

models represent the context dependent phonemes. Monophone models have greater

variances to model all phonemes more generally. We have used 32-mixture monophone

models.

In order to favor some of the models we have added a bonus value to all transition

probabilities of the HMMs on that path. Although the probability of an event cannot be

greater than zero and sum of probability of exclusive events should be equal to 1, adding

some value to the transition probabilities works fine; because it will increase the

probability of passing through some path. This corresponds to one-pass implementation of

Likelihood Ratio Scoring.

30

3.2. Experiment Data

The details of the database that we have used for keyword spotting is described in

Table 3.1 and Table 3.2. We asked 12 speakers to read 20 different sentences from a sheet

over the telephone. Since some speakers made mistakes during recording, 44 of the

sentences were removed from the database. Therefore total number of sentences that we

used in our simulations was 196. The sentences are given in Appendix A. In the

simulations we tested 1, 3, 5, 10 keyword sets. The keywords in each set and their number

of occurances are listed in Table 3.2.

Table 3.1.Database used for keyword spotting

Number of sentences 196

Number of words 3,391

Number of speakers 12

Sound Quality 8KHz µ -law encoded telephony signal

Total record time 15 minutes

Table 3.2. Number of occurrences of the keywords used for keyword spotting tests

Medya Holding 12

Ankara + Yargıtay + Fenerbahçe 36

Pejo + Sabah + Ankara + Medya Holding + Yargıtay 86

Türkiye + Pejo + Pirelli + Devalüasyon + Fenerbahçe + rüşvet +

Sabah + Yargıtay + Medya Holding + Rahmi Koç

167

3.3. Performance of a System

For a keyword spotting system there are two kinds of errors, false alarm and miss.

If there is no keyword and the keyword spotter detects a keyword then it is called a false

alarm. If there is a keyword present but the keyword spotter cannot detect it then it is called

a miss. The same recognizer cannot reduce false alarms without increasing miss rate.

31

Receiver Operating Characteristic (ROC) curves show the probability of detection

versus probability of false alarm [20]. Probability of detection is given by the dividing

number of detection to number of keywords. Probability of false alarms is given by

dividing total number of false alarms to total number of words that are not keywords.

However since we do not know much about the out-of-vocabulary words, false alarms per

hour is much more reasonable than probability of false alarms.

3.4. System Structure

Figure 3.1 shows the general network structure that can be used to detect the

keyword “sabah”. It is an infinite-state network. As far as the input data is available, the

same path may be used multiple times. Since the silence model Z-Z+Z is always used, this

grammar attempts to model each word in the speech signal either the keyword or garbage.

Figure 3.1. General structure of the proposed keyword spotter

We have tested the performance of the keyword spotter using

a. Triphone model of the word and one garbage model

b. Triphone model and monophone model of the word

c. Triphone model and monophone model of the word and garbage model

In order to obtain a Receiver Operating Characteristic (ROC) we tested the

probability of detection and number of false alarms per hour. Each time the test is

32

performed we have used different combinations of bonus values for monophone model of

the word and garbage model. We have a program that counts the false alarm/miss rate

automatically. We performed the test using only the monophone models of the keywords

as the garbage model or 1-state 16-mixture general garbage model.

Figure 3.2 shows the ROC points for test we have performed. The operating point

of the ROC is determined by bonus values of monophone models and garbage model.

However these bonus values are related. As a result, the points obtained using different

combinations do not form a curve. We preferred to call these figures as ROC points.

Figure 3.2. ROC points for different alternatives for garbage model

As it can be seen using the monophone models of the keywords as garbage model

gives approximately the same performance as using 1-state general garbage model.

However, when we use combination of these two alternatives with different bonus values,

we get a better performance. Actually monophones are good candidates for garbage

modeling. The recognizer cannot model very short utterances using the monophone model

of the word. So a better alternative may be using an ergodic model for the monophone

33

model of the keyword. In an ergodic model, transition from each phoneme to another is

possible, unlike the left-to-right models we have been using. The cost of using an ergodic

model is too much. So monophone model of the word coupled with a general 1-state

garbage model is a better alternative.

We have tested the system for different number of keywords. As the number of the

keyword increases, the performance of the system has decreased as expected. Figure 3.3

shows the ROC points for 1, 3, 5 and 10 keywords. All of the operating points are shown

together at this figure regardless of the structure used. Of course, using monophone models

of the keywords as the garbage model has given better results for multiple keywords, too.

Figure 3.3. ROC points for different number of keywords for keyword spotting simulations

Monophone models

34

3.5. Performance of Monophone Models for Isolated Word Recognition

Monophone models did not result in an increase at the performance when used in

keyword-spotting task. The reason may be the infinite-state network used for keyword

spotting. We have used the same idea in isolated word recognition.

We have decided to use our keyword spotter as a post-processor for isolated word

recognizer. The output of the isolated word recognition is used as the keyword for the

keyword spotter. The keyword spotter creates a temporary network using the triphone

model of the word and the monophone model of the word.

Figure 3.4. Network structure for the keyword spotter used as a post-processor for

isolated word recognizer

The network is a finite-state network as shown in Figure 3.4. If triphone model of the

output gets more likelihood from the post-processor that the word is from the vocabulary,

35

otherwise we assume that it is garbage. We have performed the experiment using different

bonus values. Table 3.3 and Table 3.4 show the results of the experiments. Figure 3.5

shows the ROC curve for monophone models and the garbage model.

Table 3.3. Results for monophone model based out-of-vocabulary word rejection for

isolated word recognition

Bonus Value

Probability Of

Detection Probability Of False Alarm

0.70 0.993 0.634 0.80 0.984 0.581 0.90 0.976 0.528 1.00 0.954 0.477 1.10 0.931 0.429 1.20 0.910 0.376 1.25 0.896 0.343 1.30 0.883 0.323 1.40 0.855 0.278 1.50 0.814 0.222 1.75 0.720 0.142 2.00 0.578 0.075 2.50 0.296 0.031 3.00 0.101 0.006

Table 3.4. Results for general garbage model based out-of-vocabulary word rejection for

isolated word recognition

Bonus Value

Probability Of

Detection Probability Of False Alarm

0.90 0.993 0.912 1.00 0.988 0.883 1.10 0.986 0.871 1.20 0.981 0.839 1.30 0.981 0.810 1.40 0.981 0.778 1.50 0.981 0.733 1.60 0.979 0.705 1.80 0.954 0.625 2.00 0.931 0.541 2.25 0.888 0.438 2.50 0.831 0.333 3.00 0.709 0.173 3.25 0.604 0.123 3.50 0.523 0.089 4.00 0.299 0.037

36

Figure 3.5. ROC curves for monophone and garbage model based out-of-vocabulary word

rejection

37

4. CONFIDENCE MEASURES FOR ISOLATED WORD

RECOGNITION

4.1. Introduction

Out-of-vocabulary word rejection is an important issue in isolated word recognition

as well as in keyword spotting. If somebody dials an automated information retrieval

system and asks about a subject that is irrelevant to that service, the system will interpret

the request as one of the known inputs. Of course the user will not notice that he uttered an

out-of-vocabulary word and will complain about the service. If you tell about the weather

to a person who wants information about a hospital, it will be nonsense. The user may

accidentally give an input especially to a bargein-enabled system. A bargein-enabled

system stops playing the prompt as soon as it detects voice activity of sufficient duration.

Out-of-vocabulary word rejection may be as important as recognizing the true item from

the vocabulary for a pleasing service.

A recognition system may produce a likelihood score instead of rejecting or

accepting a hypothesis. Confidence value is a score that shows how much confidant the

recognizer is about the recognition result. However confidence value is not suitable for an

average system designer. It should be converted into a more meaningful value like percent

confidence. The percent confidence is a number between zero and 100 and reflects how

likely the result is. If you group the results that have percent confidence of 75, the

recognition rate will be 75%. A recognizer cannot assign 100% confidence to a result.

The confidence value is useful especially when you have a chance to ask the speaker

to tell the word again. If the confidence value is too low then the system may tell the user

that it could not understand. If the confidence value is high then the system may directly

advance to the next step in the scenario. If the confidence value is medium, the system may

need to confirm the recognition result. A higher percent confidence may be required for

critical applications. Using different threshold values for confidence values correspond to

changing the operating point on a ROC curve.

38

4.2. Experiment Data

We have collected 1,176 records of isolated words/phrases from 4 speakers. Each

speaker spoke the names of the stocks exchanged at İMKB as isolated phrases once. Some

of the records are very bad, some have end-pointing errors but we have kept them for an

objective evaluation. We set aside half of the words and used the remaining ones in the

recognition system.

4.3. Minimum Edit Distance

The robustness of the recognizer requires that it operate at different situations

without great degradation in the performance. Due to the variability of speech there is no

strict decision about speech units. As a result it is natural that the recognizer will match a

very close word if the spoken word is not in the vocabulary. For example our HMM based

recognizer can make the distinction between words “kardemirZbe” and “kardemirZde” if

both of them are in the vocabulary. If one of them is removed from the vocabulary and the

incoming speech signal corresponds to the removed word, the recognizer will match the

remaining word. This feature can be useful if a speaker says “sabancolding” instead of

“sabancIZholding”. So we have decided not to penalize the recognizer if the recognized

word and the actual word are very close to each other.

In order to decide the similarity of two words we have used the minimum edit

distance criteria. The minimum edit distance between two strings is defined as the

minimum number of editing operations (insertion, deletion, substitution) needed to

transform one string into another [22]. The minimum edit distance algorithm is an example

of the class of dynamic programming algorithms and is given in Appendix B. The cost of

insertion, deletion and substitution may be assigned according to the specific task. For

speech recognition assigning a cost of 1 for insertion and deletion and 2 for substitution is

reasonable [23].

If the recognizer does not reject an out-of-vocabulary, which is very close to a word

in the vocabulary we counted it as detection and called it Minimum Edit Distance Revision

(MEDR). If the minimum edit distance is less than or equal to 3, we assumed them to be

39

correct. The results got better as expected. In Figure 4.1 we show the results with and

without Minimum Edit Distance Revision. Since the number of detection is increased the

ROC is closer to the upper left corner. The remaining figures are according to the MEDR

applied results unless otherwise stated.

Figure 4.1. ROC curves before/after applying Minimum Edit Distance Revision (MEDR)

4.4. Phoneme Durations

The HMMs we have used assign transition probabilities to each state to model

temporal variability. It requires that each state should be used at least during 2 adjacent

frames in order to be the part of the best path. However it does not impose any maximum

duration constraint on the states or the phonemes.

We have developed a visual tool that displays the likelihood of a given word using

the input speech signal and the state sequence. During the N-best search we may choose to

keep the phoneme information (phoneme index, end time and probability at that instant)

list for each token accumulated. If we keep phoneme information for all paths (some of

40

which will be pruned after a few frames), it will require extra memory and CPU time.

We have preferred finding the phoneme boundaries after the recognition has been

ended and the best path has been found. After finding the best path for the input speech

signal we create a temporary network with only the recognized word and tell the

recognizer to keep the phoneme information. As a result the recognizer does forced

alignment for the specified word. We compare the phoneme durations found with the

average phoneme durations for each phoneme. Average phoneme durations are shown in

Table 4.1. If the difference is not small enough then we assign a penalty score for the

phoneme. The formula is empirical and can be summarized as follows:

difference = | phoneme duration-average duration|

ratio = difference / average duration

ratio > 4.0 ⇒ penalty score = difference * 15000

ratio > 2.5 ∧ ratio < 4.0 ⇒ penalty score = difference * 5000

ratio > 1.5 ∧ ratio < 2.5 ⇒ penalty score = difference * 2000

ratio < 0.2 ⇒ penalty score = difference * 2000

ratio > 0.2 ∧ ratio < 1.5 ⇒ penalty score = 0

Penalty scores for each phoneme are added. These penalty scores give us a

confidence measure.

Table 4.1. Average phoneme durations in Turkish

Phoneme Average Duration(sec) Phoneme Average Duration(sec) /C/ 0.104476 /i/ 0.089455 /G/ 0.051474 /j/ 0.132000 /I/ 0.087539 /k/ 0.106367 /O/ 0.111259 /l/ 0.059551 /S/ 0.136918 /m/ 0.080046 /U/ 0.075290 /n/ 0.075376 /Z/ 0.052331 /o/ 0.114173 /a/ 0.114238 /p/ 0.092847 /b/ 0.056495 /r/ 0.065306 /c/ 0.084662 /s/ 0.130187 /d/ 0.065811 /t/ 0.107484 /e/ 0.100183 /u/ 0.084297 /f/ 0.081055 /v/ 0.053175 /g/ 0.064541 /y/ 0.078936 /h/ 0.065209 /z/ 0.093300

41

Sometimes low penalty scores can be found for meaningless results. This can be

seen when a long word is matched for a shorter utterance. The recognizer completes the

word if it cannot find enough data to proceed till the end of the path. The purpose is to

compensate for the end-point errors. So at this point we have added a simple additional

confidence measure. This measure is total number of forwarded phonemes without input

data, denoted as N . The situation can be seen in Figure 4.2. The recognizer claims that

the speech signal corresponds to the word “iSZbankasIZkurucu”. But it is unlikely to be a

waveform of such a long word. Number of forwarded phonemes is 9 in this case.

Figure 4.2. Forced alignment of the waveform for keyword “iSZbankasIZkurucu”

Figure 4.3 shows a typical case where the recognition result deserves high penalty.

Although the number of forwarded phonemes is 1, the phoneme durations do not seem

reasonable. The recognizer has assigned a penalty score of 1571 for this recognition.

In order to model phoneme durations we should consider the effect of the context.

The same phoneme may have different statistics at different contexts. Using the mean and

variance of the phoneme is one choice. The deviation from the average duration may be

used as a measure. However using the median of the durations may be a better statistic.

Instead of using the standard deviation, we may choose to use percentile rank. The

statistics should be derived from the durations found using forced alignment.

42

Figure 4.3. Forced alignment of the waveform for keyword “milpa”

Figure 4.4 shows ROC curves for various values for N, the threshold number of

forwarded phonemes.

Figure 4.4. ROC curves for phoneme duration based confidence measure

43

If the recognizer is to be operated at low false alarm rates, then you should select a

small threshold for N. However increasing the threshold beyond 5 does not increase the

performance too much. The optimum operating point for many applications is N = 4.

4.5. Garbage Model Using Same 512 Mixtures

During the N-best search likelihood of triphones are calculated using the weighted

sum of each Gaussian mixture probability. Since we already have the probabilities of these

mixtures, we can find a garbage probability at each frame. This garbage modeling

technique has been used for speaker recognition [10]. The garbage likelihood and the

likelihood of “ceylangiyim” are shown in Figure 4.5. The shaded area between two

probabilities may give an indication of a confidence measure. The question is how to

convert this area into a confidence measure.

Figure 4.5. Likelihood profiles for “ceylangiyim” and the base garbage model

proposed

We have tried a few things. First of all we should deal with phonemes that are

meaningful for us. We do not use triphone X-X+X or Z-Z+Z for our calculations. These

correspond to the long silence and short silence (inter word silence), respectively. If the

forced alignment is unsuccessful, we will not attempt to assign a confidence measure;

44

instead we will directly assign a low number. Since wrong alignment leads us to wrong

decisions about likelihood ratio, we penalize too much deviation from the average

phoneme duration again. The penalty method we have chosen is to decrease the probability

of that phoneme artificially.

Since the area corresponds to the log difference of two likelihood profiles shown, it

is reasonable to take the difference and add them up. This will give the likelihood of the all

path. However we have tried a few things to see that this is really the best measure.

We have taken the difference for each frame. Then we have weighted this

difference according to the characteristics of the phoneme. Then we have taken some

power of this value and summed them up. Finally we normalized it according to the

accumulated weights. Mathematically,

( ) ( )( )

∑

∑

=

=−

= stotalframe

ii

stotalframe

iiii gpwp

S

1

1

γ

γ ββ

)1.4(

where iγ is the weighting coefficient and equal to γ for phonemes /a/, /e/, /i/, /I/,

/o/, /O/, /u/, /U/, /S/, /s/, /z/ and 1 for all other phonemes.

We expect that the recognizer can discriminate these phonemes better than the

remaining, so have given γ values greater than or equal to 1. Unfortunately we did not

obtain any improvement. Figure 4.6 illustrates this result. The ROC corresponds to the

results obtained selecting β =1 and trying values 1, 2 and 3 for γ . All curves are very

close.

45

Figure 4.6. ROC curves for different emphasis values for power value 1

Figure 4.7. ROC curves for different power values with emphasis set to 1

46

After this experiment we attempted to change the value of β and keep γ =1, since

changing the value of γ did not matter. The best value was again β =1 case, which means

using the difference according to its real meaning (difference of log likelihood means

ratio). Figure 4.7 illustrates this case.

4.6. Comparison of Confidence Measures

Finally we compare the confidence measure obtained using the phoneme duration

and the one that uses 512-mixture garbage model with phoneme duration. Since the last

method we have examined also added the likelihood ratio score in addition to duration

constraint, it has given a better result. Figure 4.8 clearly shows the increase in the

performance.

Figure 4.8. ROC curve for phoneme duration based confidence measure and confidence

measure with likelihood ratio scoring included

47

When we perform the phoneme alignment after the recognition step, we spend

extra time. The average recognition time required is as shown in Table 4.2.

Table 4.2. Computation time required with/without phoneme duration evaluation

Recognition 1.35 seconds

Recognition + Monophone based rejection 1.62 seconds

Recognition + Phoneme duration evaluation 1.62 seconds

48

5. CONCLUSION

In this study, we have examined the performance of monophone models of the

keywords as garbage models. The performance over 1-state general garbage model is very

little in keyword spotting tasks. However there is a significant increase at the performance

of the monophone models of the keywords over garbage models when the finite-state

keyword spotter is used as a post-processor to an isolated word recognizer.

Several confidence measures were used for isolated word recognition. The best

performance was achieved when phoneme duration information and average phoneme

likelihood were used. We have used the same pool of Gaussian functions for the garbage

likelihood.

Duration based confidence measure can be examined in more detail. Since the

duration of a phoneme depends on the context it is used, the context information should be

used. The minimum and maximum durations for each phoneme can be used instead of the

average durations. The durations found using forced alignment is more appropriate for the

statistics of the phonemes. The monophone models for each phoneme can be used as

garbage model for likelihood ratio scoring.

49

APPENDIX A: SENTENCES USED FOR KEYWORD SPOTTING

1. Sabah grubu içinde yer alan sabah pazarlamanın sabah otomobilden ve medya

holdingten çeşitli alacaklarının bulunduğu tespit edildi

2. Aradan zaman geçiyor. ne olup ne bittiğini merak eden Ankara polonya

büyükelçiliğinden ses çıkmadığını görünce varşovaya soruyor

3. Taşların artık yerine oturmaya başladığını söyleyen Rahmi Koç gelecek dönemde çok

daha itibarlı bir Türkiyenin ortaya çıkacağını söyledi

4. Fenerbahçe teknik direktörü rüştünün tıbba göre kupa finalinde oynaması imkansız

değil dedi ve ekledi: maça daha iki gün var, son dakikaya kadar bekleyeceğim

5. Kahvenin egzersiz sırasında adelelerin tükettiği gılikojen miktarını azaltararak daha

uzun süre egzersiz yapmaya yardımcı olduğu belirtiliyor

6. Konuyla ilgili araştırma haberini internetten yayınlayan riyıl eyç gurubu spor yapan

insanların gerçek yaşlarından dokuz yaş daha genç kalabileceklerine değiniyor

7. Türkiye ihtiyacı olan dış katkıyı gerektiği oranda bulmak için zorlanacak

8. ikinci el otomobil piyasasında şubat ayında yaşanan devalüasyonun ardından satışlar

durma noktasına geldi

9. internet üzerinden düzenlenen pejo dizayn yarışmasının kazanan tasarımcısı cenevre

otomobil fuarında açıklandı

10. Oldukça cesur ve yenilikçi bir şekilde dizayn edilen model şık kelimesini hak ediyor.

Dizayn halihazırda bütün bu özellikleriyle pejonun internet sitesini ziyaret eden

otomobil meraklılarının kalbini kazanmayı başardı

11. Yarışmaya en fazla ilginin geldiği ülkeler arasında fıransa, rusya, italya ve

kolombiyanın yanısıra Türkiye de bulunuyor

12. Barselona pirelli en ileri teknoloji ile ürettiği yeni yüksek performans lastikleri pe altı

ve pe yediyi piyasaya sunuyor

13. Beyaz enerji operasyonu kapsamında gündeme gelen rüşvet çarkına adı karışanları

zor günler bekliyor

14. Dekoder sahiplerinin muhatabı teleondan önce futbol federasyonu olmalıdır dedi.

çünkü anlaşmanın koşullarını koyan federasyondur

15. Silah avadanlık ve mühimmat hakkında personele gerekli açıklamaların

50

yapılmasından sonra eğitici tatbiki olarak günlük haftalık bakımlar ile atış menzili

bakımlarının uygulanmasına geçer

16. Savcı dosya münderecatına göre suçu işlediği sabit olan abuzettin beyin tecekanın

bilmem kaçıncı maddesine göre cezalandırılması talep olunur dedi

17. İstanbul memorial hastanesince düzenlenen insan gen haritası projesi embriyoda veya

erken gebelik döneminde genetik tanı başlıklı toplantıya katılan halifa türkiyede ilk

defa embriyolarda genetik inceleme tekniği başlattıklarını belirtti

18. Yargıtay bozma kararını af yasasının çıkmasından üç gün önce vererek şahinin

davanın ertelenmesi olanağından faydalanmasını da önledi

19. Göstericiler vaşingtın post gazetesi binasını da çürük meyve yağmuruna tuttu ve

suçlu buş katil buş sıloganları attı

20. Hazine enflasyon hedefi kadar zam önerisine memur ve işçi arasında giderek

büyüyen ücret uçurumunu gerekçe gösterirken türkiş bu yaklaşımı kışkırtma olarak

nitelendirdi

51

APPENDIX B: MINIMUM EDIT DISTANCE ALGORITHM

function MIN-EDIT-DISTANCE(target, source) returns min-distance

n←LENGTH(target)

m←LENGTH(source)

Create a distance matrix distance[n+1, m+1]

distance[0, 0]←0

for each column i from 0 to n do

for each row j from 0 to m do

distance[i, j] ←MIN( distance[i-1, j] + ins-cost(targeti),

distance[i-1, j-1] + subst-cost(sourcej, targeti),

distance[i, j-1] + del-cost(sourcej) )

52

REFERENCES

1. Deller, J. R., J. H. L. Hansen and J.G. Proakis, Discrete-Time Processing of Speech

Signals. IEEE Press, 2000

2. Young, S., “A Review of Large-Vocabulary Continuous-Speech Recognition”. IEEE

Signal Processing Magazine, September 1996.

3. Rabiner, L. R., “A Tutorial on Hidden Markov Models and Selected Applications in

Speech Recognition.” Proceedings of the IEEE, 77(2) pp. 257-286, February 1989.

4. Sproat, R., Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Kluwer

Academic Publishers, Massachusetts, Chapter 5, 1998.

5. Hermansky, H., “Perceptual Linear Predictive (PLP) Analysis For Speech”. Journal of

the Acoustical Society of America, 87(4) pp. 1738-1752, April 1990

6. Markel, J. D. and A. H. Gray, Linear Prediction of Speech. Springer-Verlag, Berlin,

1976

7. Davis, S. B. and P. Mermelstein, “Comparison of Parametric Representations For

Monosyllabic Word Recognition in Continuously Spoken Sentences”. IEEE

Transactions on Acoustics, Speech and Signal Processing, ASSP-28 pp. 357--366,

August 1980.

8. Furui, S., “Speaker-Independent Isolated Word Recognition Using Dynamic Features of

The Speech Spectrum”. IEEE Transactions on Acoustics, Speech and Signal

Processing, 29(1) pp. 59-59, 1986.

9. Applebaum, T. H. and B. A. Hanson, “Regression Features For Recognition of Speech

in Quiet and in Noise”, Proceedings of the 1989 International Conference on

Acoustics, Speech, and Signal Processing, Glasgow, Scotland, May 1989, pp. 985-988

53

10. Furui, S., Digital Speech Processing, Synthesis and Recognition Second Edition

Revised and Expanded. Marcel Dekker, NY, Feb. 2001

11. Huang, X. D., Y. Ariki and M. Jack, Hidden Markov Models for Speech Recognition.

Edinburgh University Press, 1990.

12. Rabiner, L. R. and B. Juang, Fundamentals of Speech Recognition.Prentice-Hall,

Engle- wood Cliffs, NJ, 1993.

13. Forney G. D., “The Viterbi Algorithm”. Proc. IEEE, Vol 61, pp. 268-278, 1973.

14. Digalakis V. and H. Murveit, “Genones: Optimizing the Degree of Mixture Tying in a

Large Vocabulary Hidden Markov Model Based Speech Recognizer”, Proceedings of

the 1994 International Conference on Acoustics, Speech, and Signal Processing,

Adelaide, Australia, April 1994, pp. 537-540.

15. Shukat-Talamazzini E. G., H. Niemann, W. Eckert, T. Kuhn and S. Rieck, “Acoustic

Modeling of Sub-Word Units in the ISADORA Speech Recognizer”, Proceedings of

the 1992 International Conference on Acoustics, Speech, and Signal Processing, San

Francisco, March 1992, pp. 577-580.

16. Yapanel, U., “Garbage Modeling Techniques For a Turkish Keyword Spotting System,

M.S. Thesis, Bogazici University, 2000.

17. Bahl L. R., P. V. de Souza, P. S. Gopalakrishnan, D. Nahamoo and M. A. Picheny,

“Decision Trees for Phonological Rules in Continuous Speech”, Proceedings of the

1991 International Conference on Acoustics, Speech, and Signal Processing, Toronto,

May 1991, pp. 185-188.

18. Hsiao–Wuen H. and Kai-Fu Lee, “CMU Robust Vocabulary-Independent Speech

Recognition System”, Proceedings of the 1991 International Conference on Acoustics,

Speech, and Signal Processing, Toronto, May 1991, pp. 889-892

54

19. Ney H., D. Mergel, A. Noll and A. Paesler, “Data Driven Search Organization for

Continuous Speech Recognition”. IEEE Transactions on Signal Processing, 40(2) pp.

272-281, February 1992.

20. Young, S. J., N. H. Russell and J. H. S. Thornton, Token Passing: a Simple Conceptual

Model for Connected Speech Recognition Systems. Technical Report, Cambridge

University Engineering Department, July 1989.

21. Peterson W. W., T. G. Birdsall and W. C. Fox, “The theory of signal detectability”.

IRE Trans. Info. Theory, PGIT-4, pp. 171-212, September 1954

22. Jurafsky D. and J. H. Martin, Speech and Language Processing. Prentice-Hall Upper

Saddle River, NJ, 2000, pp. 153-157.

23. Levenshtein, V. I, “Binary codes capable of correcting deletions, insertions, and

reversals”. Cybernetics and Control Theory, 10(8), pp. 707-710.

Documents

KEYWORD SPOTTING USING HIDDEN MARKOV MODELS · 2018-10-09 · ii ABSTRACT KEYWORD SPOTTING USING HIDDEN MARKOV MODELS The aim of keyword spotting system is to detect a small set of