Upload
sachet-saurabh
View
98
Download
9
Tags:
Embed Size (px)
Citation preview
Speech Recognition for Hindi
M. Tech. Project Report
Submitted in partial fulfillment of the requirements
for the degree of
Master of Technology
by
Ripul Gupta
Roll No: 03305406
under the guidance of
Prof. G. Sivakumar
aDepartment of Computer Science and Engineering
Indian Institute of Technology, Bombay
Mumbai
Acknowledgement
I would like to thank Prof. G Sivakumar for his invaluable support, encouragement
and guidance, without which my MTech Project would have been an exercise in futility.
Ripul Gupta
i
Abstract
Speech interface to computer is the next big step that computer science need to take for
general users. Speech recognition will play a important role in taking technology to them.
The need is not only for speech interface, but speech interface in local languages. Our
goal is to create a speech recognition software that can recognise Hindi words.
This report takes a brief look at the basic building block of a speech recognition engine.
It talks about implementation of different modules. Sound Recorder, Feature Extractor
and HMM training and Recogniser modules have been described in details. The results of
the experiments that were conducted are also provided. The report ends with a conclusion
and Future plan.
ii
Contents
1 Introduction 1
1.1 Existing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Speech Recognition - definition and Issues . . . . . . . . . . . . . . . . . . 2
1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Design of the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Overview of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Sound Recording and Word Detection 6
2.1 Sound Recorder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Word Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 The Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 The Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Feature Extractor 11
3.1 Windowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Temporal Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Spectral Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1 Mel frequency cepstrum computation . . . . . . . . . . . . . . . . . 14
3.4 Feature Vector specification . . . . . . . . . . . . . . . . . . . . . . . . . . 15
iii
4 Knowledge Models 16
4.1 Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.1 Word Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.2 Phone Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.2.1 Context-Independent Phone Model . . . . . . . . . . . . . 18
4.1.2.2 Context-Dependent Phone Model . . . . . . . . . . . . . . 18
4.2 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5 HMM Recognition and Training 21
5.1 HMM and Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.1.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.1.2 Recognition using HMM . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.2.1 Forward Variable . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.2.2 Occurrence Probability . . . . . . . . . . . . . . . . . . . . 26
5.1.3 Training the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.3.1 Segmental K-means Algorithm . . . . . . . . . . . . . . . 27
5.1.3.2 Viterbi algorithm . . . . . . . . . . . . . . . . . . . . . . . 27
6 Experimental Results 29
6.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
6.2 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7 Conclusion and Future Work 31
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
8 Appendix 34
8.1 WAV file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.1.0.3 RIFF WAVE Chunk . . . . . . . . . . . . . . . . . . . . . 35
8.1.0.4 FMT SubChunk . . . . . . . . . . . . . . . . . . . . . . . 35
iv
8.1.0.5 Data SubChunk . . . . . . . . . . . . . . . . . . . . . . . 35
8.2 Installation Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.2.1 Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.2.1.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.2.1.2 Recorder . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.2.1.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 37
8.2.1.4 HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
8.2.2 Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
8.3 Matrix Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.4 HMM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
8.5 Class Diagram of the System . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.6 Word Waveforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
8.7 MFCC and energy feature plots . . . . . . . . . . . . . . . . . . . . . . . . 41
v
List of Figures
1.1 Block diagram of Recognition System . . . . . . . . . . . . . . . . . . . . . 3
1.2 Block diagram of Training System . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Wave Plot for word paanch . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Energy Plot for word paanch . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Zero Crossing Plot for word paanch . . . . . . . . . . . . . . . . . . . . . 10
3.1 Block diagram of Feature Extractor . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Windowing of the speech signal . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Impulse Response of Hamming Window . . . . . . . . . . . . . . . . . . . . 13
3.4 Output of feature extractor . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Plot of lowest MFCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1 Word acoustic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 Phone acoustic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.1 Diagrammatic Representation of HMM . . . . . . . . . . . . . . . . . . . . 21
5.2 Diagrammatic Representation of the Model . . . . . . . . . . . . . . . . . . 24
8.1 Format of a Wave file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
8.2 Hex Dump of a wave file . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
8.3 Class Diagram of the System . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.4 Plots for word ek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.5 Plots for word do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.6 Second, third and fourth MFCC coefficient . . . . . . . . . . . . . . . . . . 45
8.7 Fifth, sixth and seventh MFCC coefficient . . . . . . . . . . . . . . . . . . 46
vi
8.8 Eighth, ninth and tenth MFCC coefficient . . . . . . . . . . . . . . . . . . 47
8.9 Eleventh and Twelfth MFCC Coefficient and Normalised Energy . . . . . . 48
vii
Chapter 1
Introduction
Keyboard, although a popular medium, is not very convenient as it requires a certain
amount of skill for effective usage. A mouse on the other hand requires a good hand-eye
co-ordination. It is also cumbersome for entering non-trivial amount of text data and
hence requires use of an additional media such as keyboard. Physically challenged people
find computers difficult to use. Partially blind people find reading from a monitor difficult.
Current computer interfaces also assume a certain level of literacy from the user. It
also expect the user to have certain level of proficiency in English. In our country where
the literacy level is as low as 50% in some states, if information technology has to reach
the grass root level, these constraints have to be eliminated. Speech interface can help us
tackle these problems.
Speech Synthesis and Speech Recognition together form a speech interface. A speech
synthesiser converts text into speech. Thus it can read out the textual contents from the
screen. Speech recogniser had the ability to understand the spoken words and convert it
into text. We would need such softwares to be present for Indian languages.
1.1 Existing Systems
Although some promising solutions are available for speech synthesis and recognition,
most of them are tuned to English. The acoustic and language model for these systems
are for English language. Most of them require a lot of configuration before they can be
1
used.
There are also projects which have tried to adapt it to Hindi or other Indian Languages.
[7] explains how a acoustic model can be generated using a existing acoustic model for
English.
ISIP [2] and Sphinx [1] are two of the known Speech Recognition software in open
source. [5] gives a comparison of public domain software tools for speech recognition.
Some commercial software like IBM’s ViaVoice are also available.
1.2 Speech Recognition - definition and Issues
Speech recognition refers to the ability to listen (input in audio format) spoken words
and identify various sounds present in it, and recognise them as words of some known
language.
Speech recognition in computer system domain may then be defined as the ability of
computer systems to accept spoken words in audio format - such as wav or raw - and then
generate its content in text format.
Speech recognition in computer domain involves various steps with issues attached
with them. The steps required to make computers perform speech recognition are: Voice
recording, word boundary detection, feature extraction, and recognition with the help of
knowledge models.
Word boundary detection is the process of identifying the start and the end of a spoken
word in the given sound signal. While analysing the sound signal, at times it becomes
difficult to identify the word boundary. This can can be attributed to various accents
people have, like the duration of the pause they give between words while speaking.
Feature Extraction refers to the process of conversion of sound signal to a form suitable
for the following stages to use. Feature extraction may include extracting parameters such
as amplitude of the signal, energy of frequencies, etc.
Recognition involves mapping the given input (in form of various features) to one of the
known sounds. This may involve use of various knowledge models for precise identification
and ambiguity removal.
Knowledge models refers to models such as phone acoustic model, language models,
2
etc. which help the recognition system. To generate the knowledge model one needs to
train the system. During the training period one needs to show the system a set of inputs
and what outputs they should map to. This is often called as supervised learning.
1.3 Problem Definition
The aim of this project is to build a speech recognition tool for Hindi language. This
is a isolated word speech recognition tool. We have used continuous HMM, which can
support a vector as a observation, for the same.
1.4 Design of the system
The prepared system if visualised as a block diagram will have the following components:
Sound Recording and word detection component, feature extraction component, speech
recognition component, acoustic and language model.
Figure 1.1: Block diagram of Recognition System
• Sound Recording and Word detection component: The component is responsible
for taking input from microphone and identifying the presence of words. Word
detection is done using energy and zero crossing rate of the signal. The output of
3
Figure 1.2: Block diagram of Training System
this component can be a wave file or a direct feed for the feature extractor. The
component is discussed in detail in Chapter 2.
• Feature Extraction component: The component generated feature vectors for the
sound signals given to it. It generates Mel Frequency Cepstrum Coefficients and
Normalised energy as the features that should be used to uniquely identify the
given sound signal. This module is discussed in detail in Chapter 3.4
• Recognition component: This is a Continuous, Multi-dimensional Hidden Markov
Model based component. It is the most important component of the system and
is responsible for finding the best match in the knowledge base, for the incoming
feature vectors. This component in discussed in Chapter 5.
• Knowledge Model: The components consists of Word based Acoustic. Acoustic
Model has a representation of how a word sounds. Recognition system makes use
of this model while recognising the sound signal.
The basic flow once the training is done can be summarised as the sound input is
taken from the sound recorder and is feed to the feature extraction module. The feature
extraction module generates feature vectors out of it which are then forwarded to the
recognition component. The recognition component with the help of the knowledge model
and comes up with the result.
4
During the training the above flow differs after generation of feature vector. Here the
system takes the output of the feature extraction module and feeds it to the recognition
system for modifying the knowledge base.
1.5 Overview of the Report
The contents of the chapters are as follows.
• Chapter 2 discusses the recording of sound.
• Chapter 3 explains how feature extraction is achieved.
• Chapter 4 explains how a acoustic model and how is it represented using HMM.
• Chapter 5 explains how the implementation for training and recognition was done
for HMM.
• Chapter 7 concludes the report with a summary of the work done and what are the
next proposed steps to be taken.
• Chapter 7 is the appendix that gives some additional information that might be of
interest.
5
Chapter 2
Sound Recording and Word
Detection
The component responsibility is to accept input from a microphone and forward it to the
feature extraction module. Before converting the signal into suitable or desired form it
also does the important task of identifying the segments of the sound containing words. It
also has a provision of saving the sound into WAV files which are needed by the training
component.
2.1 Sound Recorder
2.1.1 Features
The recorder takes input from the microphone and saves it or forwards it depending on
which function is invoked Recorder supports changing of sampling rate, channels and size
of the sample. Default sampling rate of the recorder is 44100 samples per second, at a
size of 16 bits per sample and dual channel.
2.1.2 Design
Internally, it is the job of Sound Reader class to take the input from the user. The Sound
Reader class takes sampling rate, sample size and number of channels as parameters.
6
Sound Reader has three basic functions: open, close and read. Open function opens the
/dev/dsp device in the read mode. It makes appropriate ioctl calls to set the device
parameters. Close function releases the dsp device. Read function reads from the dsp
device checks if there is a valid sound present and returns the sound content. The Source
code of Yet Another Recorder (yarec) was referred for making the sound recorder.
The Recorder class takes care of converting the raw audio signal into WAV format and
stores it to a file. The format of WAV file if given in the appendix for reference.
2.2 Word Detector
2.2.1 The Principle
In speech recognition it is important to detect when a word is spoken. The system does
detects the region of silence. Anything other than silence is considered as a spoken word by
the system. The system uses energy pattern present in the sound signal and zero crossing
rate to detect the silent region. Taking both of them is important as only energy tends
to miss some parts of sounds which are important. This technique has been described in
[9].
2.2.2 The Method
For word detection a sample is taken every 10 milli-seconds. Energy and zero crossing
for this duration is calculated. Energy is calculated by adding the square of the value of
waveform at each instance and then dividing it by to number of instances over the period
of sample. Zero crossing rate is the number of times the value of the wave goes from the
negative number to positive of vice-versa.
Word Detector assumes that the first 100 milli-second is silence. It uses the average
Energy and average Zero Crossing Rate obtained during this time for identifying the
background noise. Upper threshold for energy and zero crossing is set to 2 times the
average value of background noise. Lower thresholds are set to 0.75 times the upper
threshold.
While detecting the presence of word in the sound, if the energy or zero crossing goes
7
above the upper threshold and stays above for three consecutive sample word is assumed
to be present and the recording is started. The recording continues till the energy and zero
crossing both fall below the lower threshold and stay there for at-least 30 milli-seconds.
Figure 2.1: Wave Plot for word paanch
2.2.3 Example
Figure 2.1 shows the waveform for word paanch at the rate of 44,100 samples per second.
Figure 2.2 shows the energy pattern for the same word. Figure 2.3 shows the zero crossing
rate for the same calculated at every 10 milli-seconds. The example signifies the impor-
tance of the frequency plot. Looking at the energy plot we would assume that the sound
began at the 10 reading, when actually the word begins at the start. This is because of
the sound ‘p’ which is present at the start of the word. The actual start can be detected
by zero crossing rate analysis.
2.2.4 Results
Seven people recorded the number one to ten in Hindi. The Table 2.1 shows how the
word detector program performed
8
Figure 2.2: Energy Plot for word paanch
Number of people: 7
Number of words per speaker: 10
Number of non words detected as words: 8
Number of words broken into two parts: 3
Table 2.1: Word Detector Performance
Out of the above recording one was also made with a constant hum as background
noise. The system was found to be immune to it.
9
Figure 2.3: Zero Crossing Plot for word paanch
10
Chapter 3
Feature Extractor
Humans have a capacity of identifying different types of sounds (phones). Phones put in
a particular order constitutes a word.If we want a machine to identify the spoken word, it
will have to differentiate between different kinds of sound the way the humans perceive it.
The point to be noted in case of humans is that although, one word spoken by different
people produces different sound waves humans are able to identify the sound waves as
same. On the other hand two sounds which are different are perceived as different by
humans. The reason being even when same phones or sounds are produced by different
speakers they have common features. A good feature extractor should extract these
features and use them for further analysis and processing.
Figure 3.1 shows a block diagram of a typical feature extraction module.
FeatureVector
AudioInput
Window SpectralAnalysis
FeatureEnhansment
FrequencyDomainFeatures
Time DomainFeatures
Figure 3.1: Block diagram of Feature Extractor
11
3.1 Windowing
Features get periodically extracted. The time for which the signal is considered for pro-
cessing is called a window and the data acquired in a window is called as a frame. Typically
features are extracted once every 10ms, which is called as frame rate. The window dura-
tion is typically 25ms. Thus two consecutive frames have overlapping areas. Figure 3.2
shows two frames being extracted and also the overlapping zone.
� �� �� �� �
OverlappingRegion
12 ms Frame size
10 ms Frame delay
Figure 3.2: Windowing of the speech signal
There are different types of windows which are used.
• Rectangular window
• Bartlett window
• Hamming window
The system uses Hamming window as it introduces the least amount of distortion.
Out of these the most widely used window is Hamming window. Impulse response
of the Hamming window is a raised cosine impulse and is shown in Figure 3.3. Transfer
function of hamming window is
0.54 + 0.46cos(nπ/m) (3.1)
Features are then extracted from each of frame. Most of the features can be categorised
into two categories:
12
Figure 3.3: Impulse Response of Hamming Window
• Temporal Feature
– Power spectral analysis (FFT)
– Linear predictive analysis (LPC)
– Mel scale cepstral analysis (MEL)
– First order derivative (DELTA)
• Spectral Feature
– Energy normalisation
– Zero Crossing Rate
3.2 Temporal Features
Temporal features are easy to extract, simple and have easy physical interpretation. Tem-
poral features like average energy level, zero-crossing rate, root mean square, maximum
amplitude, etc can be extracted out as features.
13
3.3 Spectral Analysis
Spectral analysis gives us quite a lot of information about the spoken phone. Time
domain data is converted to Frequency domain by applying Fourier transform on it. This
process gives us the spectral information. Spectral information is the energy levels at
different frequencies in a given window. Thus features like frequency with maximum
energy, distance between frequencies of maximum and minimum energies, etc can be
extracted.
3.3.1 Mel frequency cepstrum computation
Mel frequency cepstrum computation(MFCC) is considered to be the best available ap-
proximation of human ear. It is known that human ear are more sensitive to higher
frequency. The spectral information can then be converted to MFCC by passing the sig-
nals through band pass filters where higher frequencies are artificially boosted, and then
doing a inverse Digital Fourier Transform(DFT) on it. This results in higher frequencies
being more prominent.
Feature Extraction module is capable of producing different kinds of features from
the sound input. The possible features that can be extracted are Energy, MFCC, their
derivative coefficients and second order derivative coefficients. The source code for this
module has been taken from Internet-Accessible Speech Recognition Technology [2].
All the features that are generated are output one frame per line. So each line in the
output contains a feature vector for a frame. Figure 3.4 is an example of the output of
the feature extractor. Feature vector is nothing but a list of numbers that refer to a list
of properties of the sound in a frame.
Figure 3.4: Output of feature extractor
14
3.4 Feature Vector specification
Vectors were generated at a frame duration of 10 milli-second. Window used was Hamming
window with a duration of 25 milli-seconds. 12 MFCC, and energy level are generated
for each frame. These features can now be used for either recognition or for training the
HMM.
Figure 3.5: Plot of lowest MFCC
Figure 3.5 is a plotting of the lowest frequency MFCC. It can be seen that the plot of
this corresponds closely with that of the energy curve.
15
Chapter 4
Knowledge Models
For speech recognition, the system needs to know how the words sound. For this we need
to train the system. During the training, using the data given by the user, the system
generates acoustic model and language model. These models are later used by the system
to map a sound to a word or a phrase.
4.1 Acoustic Model
Features that are extracted by the Feature Extraction module need to be compared against
a model to identify the sound that was produced as the word that was spoken. This model
is called as Acoustic Model.
There are two kinds of Acoustic Models
• Word Model
• Phone Model
4.1.1 Word Model
Word models are generally used to small vocabulary systems. In this model the words
are modelled as whole. Thus each word needs to be modelled separately. If we need to
add support to recognise a new word, we will have to train the system for the word. In
the recognition process, the sound is matched against each of the model to find the best
16
Start End
Word1
Word2
Wordn
.
.
.
.
Fillers
S0 S1 S2 S3 S4 S5
Figure 4.1: Word acoustic model
match. This best match is assumed to be the spoken word. Building a model for a word
requires us to collect the sound files of the word from various users. These sound files are
then used to train a HMM Model. Figure 4.1.1 shows a diagrammatic representation of
phone based acoustic model.
4.1.2 Phone Model
In phone model instead of modelling the whole word, we model only parts of words
generally phones. And the word itself is modelled as sequence of phone. The heard sound
is now matched against the parts and parts are recognised. The recognised parts are
put together to for a word. For example the word ek is generated by combination of
two phones A and k. This is generally useful when we need a large vocabulary system.
Adding a new word in the vocabulary is easy as the sounds of phones are already know
only the possible sequence of phone for the word with it probability needs to be added
to the system. Figure 4.1.2 shows a diagrammatic representation of phone based acoustic
model.
Phone models can be further classified into:
17
S0 S1 S2 S3 S4 S5
Start End
Word1
Word2
.
.
.
.
Fillers
The
dh ah
Figure 4.2: Phone acoustic model
• Context-Independent Phone Model
• Context-Dependent Phone Model
4.1.2.1 Context-Independent Phone Model
In this model individual phones are modelled. The context that they occur is not modelled.
The good thing about this model is that the number of phone that have to be modelled
is small. Thus the complexity of the system is less.
4.1.2.2 Context-Dependent Phone Model
While modelling phone their neighbours are also considered. That means iy surrounded
by z and r is a separate entity as compared to iy surrounded by h and r. This results in a
growth of number of modelled phones which increases the complexity. [6] shows that high
phone recognition accuracies can be obtained using context-dependent phone models.
18
In both word acoustic model and phone acoustic model we need to model silence and
filler words too. Filler words are the sounds that humans produce between two words.
Both these models can either be implemented using a Hidden Markov Model or a
Neural Network. HMM is more widely used technique in automatic speech recognition
systems.
4.2 Language Model
Although there are words that have similar sounding phone, humans generally do not
find it difficult to recognise the word. This is mainly because they know the context,
and also have a fairly good idea about what words or phrases can occur in the context.
Providing this context to a speech recognition system is the purpose of language model.
The language model specifies what are the valid words in the language and in what
sequence they can occur.
4.2.1 Classification
Language Models can be classified into several categories:
Uniform Models Each word has equal probability of occurrence.
Stochastic Models Probability of occurrence of a word depends on the words preceding
it.
Finite State Languages Language uses a finite state network to define allowed word
sequences.
Context Free Grammar Context free grammar can be used to encode which kind of
sentence are allowed.
4.3 Implementation
We have implemented a word acoustic model. The system has a model for each word that
the system can recognise. The list of words can be considered as language model. Section
19
8.4 shows sample of a model.
While recognising the system need to know where to locate the model for each word
and what word the model corresponds to. This information is stored in a flat file called
models in a directory called hmms.
When a sound is given to the system to recognise, it compares each model with the
word and finds out the to model that most closely matches with it. The word correspond-
ing to that HMM model is given as the output. Details about the HMM models and its
training and recognising it are given in the next chapter.
20
Chapter 5
HMM Recognition and Training
Hidden Markov Model(HMM) is a state machine. The states of the model are represented
as nodes and the transition are represented as edges. The difference in case of HMM is
that the symbol does not uniquely identify a state. The new state is determined by the
symbol and the transition probabilities from the current state to a candidate state. [4] is
a tutorial on HMM which shows how it can be used.
Figure 5 shows a diagrammatic representation of a HMM. Nodes denoted as circles
are states. O1 to O5 are observations. Observation O1 takes us to states S1. aij defines
the transition probability between Si and Sj. It can be observed that the states also have
self transitions. If we are at state S1 and observation O2 is observed, we can either decide
to go to state S2 or stay in state S1. The decision is made depending on the probability
of observation at both the states and the transition probability.
start s1 s2 s3 enda01 a12 a34a23
a13 a11 a22 a33
a24
O1 O2 O3 O4 O5O2 O4
Figure 5.1: Diagrammatic Representation of HMM
21
Thus HMM Model is defined as:
λ = (Q,O, A, B, π) (5.1)
Where Q is {qi} (all possible states)
O is {vi} (all possible observation)
A is {aij} where aij = P (Xt+1 = qj|Xt = qi) (transition probabilities)
B is {bi} where bi(k) = P (Ot = vk|Xt = qit) (observation probabilities of observation k at
state i)
π is {πi} where πi = P (X0 = qi) (initial state probabilities)
Xt denotes the state at time t.
Ot denotes the observation at time t.
5.1 HMM and Speech Recognition
HMM can be classified upon various criteria:
• Values of Occurrences
– Discrete
– Continuous
• Dimension
– One Dimensional
– Multi Dimensional
• Probability density function
– Continuous density (Gaussian distribution) based
– Discrete density (Vector quantisation) based
While using HMM for recognition, we provide the occurrences to the model and it
returns a number. This number is the probability with which the model could have
22
produced the output (occurrences). In speech recognition occurrences are feature vectors
rather than just symbols. Hence for each occurrence, feature vector has a group of real
numbers. Thus, what we need for speech recognition is a Continuous, Multi-dimensional
HMM.
5.1.1 Implementation
There are HMM library that were looked at:
• HTK: HMM Tool Kit - is matured HMM implementation. But the license of usage
does not allow redistribution of code.
• A C++ implementation of HMM by Prof. Dekang Lin: The problem with this
implementation was that it was a discrete HMM implementation.
• GHMM: GHMM is a open source library for HMM. It supports both discrete and
continuous HMM. But it did not have support for more than one dimension.
Continuous HMM library, which supports vector as observations, has been imple-
mented in the project. The library uses Gaussian probability distribution function. Sec-
tion 8.4 shows a XML file containing a specification of HMM. The sample has five states
with a vector size of three.
The root tag in the HMM file is hmm which indicates that the file contains a HMM
model. The tag has two attributes states and vector-size indicating the number of states
and the vector size of an observation for the HMM respectively. Each state consists of
the outgoing edges with their probabilities. These outgoing edges are stored as transition
tag inside the state. Each tag has the target state id and the probability of transition. A
state also has one or more mixtures. A mixture consists of a vector of mean and a matrix
of variance, one for each dimension. These mean and variance are used to calculate
probability for an occurrence. The way of calculating the probability is discussed in
Subsection 5.1.2.2.
Figure 5.1.1 shows a diagrammatic representation of the model. Circles are states and
the arrows indicate transitions. Each transitions has its probability. In the diagram it
can be seen that states also have self transitions. Note that not all transitions are shown
23
End
S3S2
Start S1
S5
S4
mean:
2.34 5.45 −0.56
−0.87 1.23 0.18
1.23 −1.54 −2.3
0.18 −2.3 2.36
variance:
Figure 5.2: Diagrammatic Representation of the Model
24
in the figure. Each state has a mean and variance associated with it. Mean is a vector of
N real numbers where N is a size of the observation. While variance is a matrix of size
N ∗N .
5.1.2 Recognition using HMM
We need to recognise a word using the existing models of words that we have. Sound
recorder need to record the sound when it detects the presence of a word. This recorded
sound is then passed through feature vector extractor model. The output of the above
module is a list of features taken every 10 msec. This features are then passed to the
Recognition module for recognition.
The list of all the words that the system is trained for and their corresponding models
are given in a file called models present in the hmms. All models corresponding of the
words are then loaded in memory. The feature vectors generated by the feature vector
generator module act as the list of observation for the recognition module.
Probability of generation of the observation given a model, P (O|λ), is calculated for
each of the model using find probability function. The word corresponding to the HMM,
that gives the probability that is highest and is above the threshold, is considered to be
spoken.
5.1.2.1 Forward Variable
Forward variable was used to find the probability of list of occurrence given a HMM. For
a model λ with N states, P (O|λ) probability of observation, in terms of forward variable
α, given the model is defined as
P (O|λ) =N∑
i=1
αT (i) (5.2)
where αt+1 is recursive defined as
αt+1 = [N∑
i=1
αT (i)aij]bj(Ot+1) (5.3)
where α1 is πibi(O1)
25
5.1.2.2 Occurrence Probability
For the forward variable to work we need to find bi(Ot). Which is probability of a given
occurrence for a particular state. This value can be calculated by Multivariate normal
distribution formula. Probability of observation X occurring in state i is given as:
(1/(2π)D/2|Vi|1/2)exp(−(1/2) ∗ (Ot − µi)T V −1(Ot − µi)) (5.4)
where
D is dimension of the vector.
µi is matrix representing the mean vector.
Vi is the covariance Matrix.
|Vi| is the determinant of matrix Vi.
V −1i is the inverse of matrix Vi.
Mean vector µi is obtained by:
µi = (1/N) ∗∑Otεi
Ot (5.5)
Covariance Matrix Vi can be obtained by:
Vi = (1/N) ∗∑Otεi
(Ot − µi)T ∗ (Ot − µi) (5.6)
In Equation 5.6 variance is calculated by finding the distance vector between a ob-
servation and the mean. Transpose of the distance vector is taken and it is multiplied
with the distance vector. This operation gives a NXN where N is the dimension of the
system.
If the model is feed in insufficient data the co-variance matrix that gets created is such
that its determinant turns out to be very close to zero. Thus failing in creation of the
inverse matrix needed for probability of observation. We faced the same problem.
For the above methods need for a Matrix library was felt. A Matrix library was built
as the part of the system. Details about the Matrix Library are in the Appendix.
26
5.1.3 Training the Model
Before we can recognise a word we need to train the system. Train command is used to
train the system for a new word. The command takes at-least 3 parameters:
• No of states the HMM model should have N .
• The size of the feature vector D.
• One or more filenames each containing a training set.
For generating an initial HMM we take the N equally placed observations (feature
vector) from the first training set. Each one is used to train a separate state. After
training the states have a mean vector which is of size D. And a variance matrix of
size D ∗D containing all zeros. Then for each of the remaining observations, we find the
Euclidean distances between it and the mean vector of the states. We assign a observation
to the closest state for training. The state assigned to consecutive observations are tracked
to find the transitional probabilities.
Mean and Variance of the states are calculate as shown in the Equations 5.5 and 5.6.
5.1.3.1 Segmental K-means Algorithm
This algorithm tries to modify the initial model so as to maximise P (O, I|λ). Where O
are the training sets used for training and I is a state sequence in the given HMM. The
maximised (optimal) path for a training set is denoted by I*. Those observations that
were assigned to a different state then the one in which they should be present according
the optimal path are then moved to the state. This improves P (O, I ∗ |λ). The model is
evaluated again so with this changed assignments of observations.
We do the above process iteratively till there are no more reassignment needed. The
calculation of mean, variance, and transitional probabilities are done as shown before.
5.1.3.2 Viterbi algorithm
Viterbi algorithm is described in [8]. This algorithm is useful identifying the best path
that a signal can take in a HMM. Find the best path is a search problem. Viterbi uses
27
dynamic programing to reduce the search space. For the first observation sequence we find
out probability of a state being the start state. This done by taking a product of initial
probability and the observation probability for the state. For every other observation all
the states try to find a predecessor such that the probability of the predecessor multiplied
by the transition probability from the predecessor to itself is maximised.
28
Chapter 6
Experimental Results
6.1 Training
Ten HMM models each for digit in Hindi was made. For this we used about 60 to 90
recorded sounds for each digit. Voice of four people were used to train the system.
Each model took between seven to 14 iterations to converge.
6.2 Recognition
Recognition was tried on three kinds of sounds
• Seen sound: The sound files used to train the models
• Unseen sound seen user: Unused Sound file of the user whose other sound files were
used for training.
• Unseen user: The user whose voice we not used for training.
The result of the experiment were as shown:
29
Type of sound No of sounds Correct Recognition Wrong Recognition
Seen Sound 185 154 22
Unseen sound seen user 25 11 0
Unseen user 25 15 0
Table 6.1: Recognition Result
30
Chapter 7
Conclusion and Future Work
7.1 Conclusion
Sound Recorder with word detector was implemented. Sound recorder did pretty well in
identifying the presence of sound on the microphone. A HMM library was build which
was useful in recognition of speech and also for training a word based acoustic model.
Word models for digits in Hindi were generated. Each model was trained using about
sixty to nifty sound files collected from four people.
The trained model was used to recognise other words. Recogniser gave good results
when tested for sound used for training the model. For other sounds too the results were
satisfactory.
7.2 Future Work
We have used a word based acoustic model. This model can be used only for limited
vocabulary. We would have to move towards a phone based acoustic model. This problem
of lack of good public domain acoustic model for India language needs to be addressed.
There are a few public domain speech recognition systems that are available. But
they have their own drawback. For example, there is a speech recognition software called
ocvolume. The problem with ocvolume is that it does not use HMM, but vector quanti-
sation. Thus it will not scale up to many words. Similarly there are helper program and
31
libraries that are available. Example GHMM is a open source HMM library, but it does
not support multi-dimensional HMMs.
Thus as future work it would be useful if someone could do some work in improving
these libraries and software so that they can be integrated together and can become more
useful.
32
References
[1] Cmu sphinx - open source speech recognition engines.
[2] Internet-accessible speech recognition technology. http://www.cavs.msstate.edu/
hse/ies/projects/speech/index.html.
[3] Brent M. Dingle. Calculating determinants of symbolic and numeric matrices. Tech-
nical report, Texas A&M University, 2005.
[4] Rakesh Dugal and U. B. Desai. A tutorial on hidden markov models. http://uirvli.
ai.uiuc.edu/dugad/hmm_tut.html.
[5] Samudravijaya K and Maria Barot. A comparison of public domain software tools for
speech recognition. pages 125–131. Workshop on Spoken Language Processing, 2003.
[6] L. Lamel and J. Gauvain. High performance speaker-independent phone recognition
using cdhmm. EUROSPEECH-93.
[7] N. Rajput M. Kumar and A. Verma. A large-vocabulary continuous speech recognition
system for hindi. IBM Journal for Research and Development.
[8] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech
recognition. pages 256–286. Proceedings of the IEEE, 1989.
[9] L. R. Rabiner and M. R. Sambur. An algorithm for determining the endpoints of
isolated utterances. Bell System Technical Journal, Vol. 54, pages 297-315, 1975,.
33
Chapter 8
Appendix
8.1 WAV file format
A WAV file stores data in Little Endian format. A WAV file is written in Resource
Interchange File Format (RIFF). In RIFF format, file is divided into chunks. Each chunk
has headers which gives information about the data that follows. A WAV file requires
at least two chunks: Format Chunk and Data Chunk. Figure 8.1 shows a graphical
representation of a minimal WAV file.
RIFF ChunkWAVE
fmt Chunk
headers
data Chunksound data
Figure 8.1: Format of a Wave file
34
8.1.0.3 RIFF WAVE Chunk
RIFF WAVE Chunk contains nothing but headers of the WAV file. This chunk has three
headers. First header is a string “RIFF” indicating that the file follows RIF Format.
Second header is an integer specifying the size of the content that will follow. Third
header is a string “WAVE” identifying the file type.
8.1.0.4 FMT SubChunk
FMT Subchunk describes the format in which sound information is stored. Like RIFF
chunk this chunk also has a SubChunkID which is “fmt ”. It also has information such
as Sampling rate, Bits per sample, Number of channels, Audio Format, etc. Size of this
header is minimally 24 bytes and can be followed by Extra headers.
8.1.0.5 Data SubChunk
Data Subchunk has only two headers: ID and Size. This is followed by the actual sound
data.
Figure 8.2 shows part of hex dump of a wave file.
Figure 8.2: Hex Dump of a wave file
35
8.2 Installation Instructions
The source code of the program and the data files can be downloaded from
http://anakin.ncst.ernet.in/ ripul/speech/ .
8.2.1 Source Code
Source code is present in three directories: recorder, dsp and hmm.
8.2.1.1 Prerequisites
The following things should be present on the system for compiling and using the system:
• g++ compiler, version 3.3
• xerces library for C++
• development files of xerces library.
8.2.1.2 Recorder
Recorder contains the source file needed to recorded the sound from the microphone. This
part of code has two executables.
Recorder is used to record a the sound to a WAVE file. This command take a parameter
destination filename. When the program is started the sound is recorded to the file till
the program get a SIGTERM signal. SIGTERM is sent by pressing control-c.
RawRecorder can be used to record sound from the microphone and store them as
raw file. This commands listens to the sound coming from the microphone. Check for the
presence of word. If a word it present it is recorded to a file. Output of the command
will be files named as 1.wav, 2.wav and so on. So if the RawRecorder detect ten words,
there would be ten wav files. The recorded sound is of sample size 16 bits, sampled at a
rate of 44100, containing single channel.
36
8.2.1.3 Feature Extraction
This module is used to extract features from a given raw file. The command takes in many
options which indicate what features need to be extracted. The executable file name is
extract feature. The help file for the command is present in the directory.
8.2.1.4 HMM
The hmm directory contains the implementation of HMM library. The source code of the
library generates two executables Train and Recognise.
Training the system for a new word requires the sound files for that word. Feature
for the sound file can be extracted using the extract feature command. Train command
can be used to train the system. The command needs information such as the number of
states that the model should have, the size of the feature vector, and the file to be used
for training. First argument to the command should be a number indicating the number
of states. Second argument is the size of the vector. After this one or more file containing
the training data.
The output of the Train command is the trained HMM in XML format which should
be written to a file and put in the hmms directory. A entry needs to be made in the
models file present in the same directory.
For recognition we need to record the sound using the RawRecorder program. Then
extract the feature to get a mfcc file. Recognise command takes one or more filenames as
argument. It tries to recognise the word for each file.
8.2.2 Data Files
hmms directory has models present for digits in Hindi. data directory contains the sound
files that were used to train the system. Data directory contains ten directories, one for
each digit. The prefix in the file names indicate the name of the person who has recorded
the sound.
37
8.3 Matrix Library
The library supports trivial Matrix functions like Transpose, Multiply, Add, etc. and also
Determinant and Inverse. First the Determinant was coded with the standard text book
algorithm. Due to this the algorithm will taking a lot of time. [3] explains a method of
calculating fast determinant which was proposed by Erwin H. Bareiss. This algorithm
was coded and is giving quick results.
Later it was found that the result of the covariance matrix was so small that it was
always giving the result as zero. For this reason, the determinant function also taken in a
parameter called factor. Before calculating the determinant all the number of the matrix
are multiplied by the factor. Thus artificially boosting the value. Similar boosting was
done while calculating inverse matrix too.
8.4 HMM Model
The following is the contents of a file containing the model:
<hmm states=”5” dimension=”3”>
<state pi=”0.897059” id=”1”>
<transition probability=”0.890071” state-id=”1”/>
<transition probability=”0.0230496” state-id=”2”/>
<transition probability=”0.0921986” state-id=”3”/>
<transition probability=”0.00177305” state-id=”4”/>
<transition probability=”0.00177305” state-id=”5”/>
<mixture weight=”1” samples=”564”>
<mean>
-0.184761 -7.98322 3.35773
</mean>
<variance>
1.30115 0.0139008 -0.628016
0.0139008 2.27187 -0.19782
38
-0.628016 -0.19782 2.23247
</variance>
</mixture>
</state>
<state pi=”0.1” id=”2”>
<transition probability=”0.0116279” state-id=”1”/>
<transition probability=”0.912791” state-id=”2”/>
<transition probability=”0.0872093” state-id=”3”/>
<transition probability=”0.0116279” state-id=”4”/>
<transition probability=”0.00581395” state-id=”5”/>
<mixture weight=”1” samples=”240”>
<mean>
-3.12424 2.09704 2.6186
</mean>
<variance>
3.98811 -12.055 5.85019
-12.055 58.1744 -29.146
5.85019 -29.146 19.9382
</variance>
</mixture>
</state>
<state pi=”0.1” id=”3”>
<transition probability=”0.0035461” state-id=”1”/>
<transition probability=”0.00177305” state-id=”2”/>
<transition probability=”0.881206” state-id=”3”/>
<transition probability=”0.120567” state-id=”4”/>
<transition probability=”0.00177305” state-id=”5”/>
<mixture weight=”1” samples=”564”>
<mean>
1.21589 -2.90561 0.717873
39
</mean>
<variance>
0.541226 0.503624 -0.801156
0.503624 5.81307 -2.1564
-0.801156 -2.1564 3.89482
</variance>
</mixture>
</state>
<state pi=”0.1” id=”4”>
<transition probability=”0.00241546” state-id=”1”/>
<transition probability=”0.00483092” state-id=”2”/>
<transition probability=”0.00241546” state-id=”3”/>
<transition probability=”0.838164” state-id=”4”/>
<transition probability=”0.164251” state-id=”5”/>
<mixture weight=”1” samples=”414”>
<mean>
1.40136 8.17534 -4.76438
</mean>
<variance>
1.40487 -2.42026 -2.4986
-2.42026 10.8383 3.07776
-2.4986 3.07776 6.94241
</variance>
</mixture>
</state>
<state pi=”0.1” id=”5”>
<transition probability=”0.00347222” state-id=”1”/>
<transition probability=”0.236111” state-id=”2”/>
<transition probability=”0.00347222” state-id=”3”/>
<transition probability=”0.00347222” state-id=”4”/>
40
<transition probability=”0.770833” state-id=”5”/>
<mixture weight=”1” samples=”288”>
<mean>
-1.4302 7.82436 -3.31474
</mean>
<variance>
1.11265 -1.01489 -0.718436
-1.01489 6.40667 2.33577
-0.718436 2.33577 3.16465
</variance>
</mixture>
</state>
</hmm>
8.5 Class Diagram of the System
Figure 8.5 is the class diagram of our system.
8.6 Word Waveforms
Waveform for the word paanch was shown in Chapter 2. The waveforms of a few more
words are shown here.
8.7 MFCC and energy feature plots
Plot of the lowest MFCC for the word paanch was shown in the Chapter 3. The remaining
feature if plotted against the sample time are shown in this section.
41
Figure 8.3: Class Diagram of the System
42
Figure 8.4: Plots for word ek
43
Figure 8.5: Plots for word do
44
Figure 8.6: Second, third and fourth MFCC coefficient
45
Figure 8.7: Fifth, sixth and seventh MFCC coefficient
46
Figure 8.8: Eighth, ninth and tenth MFCC coefficient
47
Figure 8.9: Eleventh and Twelfth MFCC Coefficient and Normalised Energy
48