Speech Recognition for Hindi

Speech Recognition for Hindi

M. Tech. Project Report

Submitted in partial fulfillment of the requirements

for the degree of

Master of Technology

by

Ripul Gupta

Roll No: 03305406

under the guidance of

Prof. G. Sivakumar

aDepartment of Computer Science and Engineering

Indian Institute of Technology, Bombay

Mumbai

Acknowledgement

I would like to thank Prof. G Sivakumar for his invaluable support, encouragement

and guidance, without which my MTech Project would have been an exercise in futility.

Ripul Gupta

i

Abstract

Speech interface to computer is the next big step that computer science need to take for

general users. Speech recognition will play a important role in taking technology to them.

The need is not only for speech interface, but speech interface in local languages. Our

goal is to create a speech recognition software that can recognise Hindi words.

This report takes a brief look at the basic building block of a speech recognition engine.

It talks about implementation of different modules. Sound Recorder, Feature Extractor

and HMM training and Recogniser modules have been described in details. The results of

the experiments that were conducted are also provided. The report ends with a conclusion

and Future plan.

ii

Contents

1 Introduction 1

1.1 Existing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Speech Recognition - definition and Issues . . . . . . . . . . . . . . . . . . 2

1.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Design of the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.5 Overview of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Sound Recording and Word Detection 6

2.1 Sound Recorder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Word Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 The Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.2 The Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Feature Extractor 11

3.1 Windowing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Temporal Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Spectral Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3.1 Mel frequency cepstrum computation . . . . . . . . . . . . . . . . . 14

3.4 Feature Vector specification . . . . . . . . . . . . . . . . . . . . . . . . . . 15

iii

4 Knowledge Models 16

4.1 Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1.1 Word Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1.2 Phone Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.2.1 Context-Independent Phone Model . . . . . . . . . . . . . 18

4.1.2.2 Context-Dependent Phone Model . . . . . . . . . . . . . . 18

4.2 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 HMM Recognition and Training 21

5.1 HMM and Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.1.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.1.2 Recognition using HMM . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1.2.1 Forward Variable . . . . . . . . . . . . . . . . . . . . . . . 25

5.1.2.2 Occurrence Probability . . . . . . . . . . . . . . . . . . . . 26

5.1.3 Training the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1.3.1 Segmental K-means Algorithm . . . . . . . . . . . . . . . 27

5.1.3.2 Viterbi algorithm . . . . . . . . . . . . . . . . . . . . . . . 27

6 Experimental Results 29

6.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

6.2 Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

7 Conclusion and Future Work 31

7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

8 Appendix 34

8.1 WAV file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

8.1.0.3 RIFF WAVE Chunk . . . . . . . . . . . . . . . . . . . . . 35

8.1.0.4 FMT SubChunk . . . . . . . . . . . . . . . . . . . . . . . 35

iv

8.1.0.5 Data SubChunk . . . . . . . . . . . . . . . . . . . . . . . 35

8.2 Installation Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

8.2.1 Source Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

8.2.1.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . 36

8.2.1.2 Recorder . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

8.2.1.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 37

8.2.1.4 HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

8.2.2 Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

8.3 Matrix Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

8.4 HMM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

8.5 Class Diagram of the System . . . . . . . . . . . . . . . . . . . . . . . . . . 41

8.6 Word Waveforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

8.7 MFCC and energy feature plots . . . . . . . . . . . . . . . . . . . . . . . . 41

v

List of Figures

1.1 Block diagram of Recognition System . . . . . . . . . . . . . . . . . . . . . 3

1.2 Block diagram of Training System . . . . . . . . . . . . . . . . . . . . . . . 4

2.1 Wave Plot for word paanch . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Energy Plot for word paanch . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Zero Crossing Plot for word paanch . . . . . . . . . . . . . . . . . . . . . 10

3.1 Block diagram of Feature Extractor . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Windowing of the speech signal . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Impulse Response of Hamming Window . . . . . . . . . . . . . . . . . . . . 13

3.4 Output of feature extractor . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.5 Plot of lowest MFCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1 Word acoustic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 Phone acoustic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.1 Diagrammatic Representation of HMM . . . . . . . . . . . . . . . . . . . . 21

5.2 Diagrammatic Representation of the Model . . . . . . . . . . . . . . . . . . 24

8.1 Format of a Wave file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

8.2 Hex Dump of a wave file . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

8.3 Class Diagram of the System . . . . . . . . . . . . . . . . . . . . . . . . . . 42

8.4 Plots for word ek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

8.5 Plots for word do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

8.6 Second, third and fourth MFCC coefficient . . . . . . . . . . . . . . . . . . 45

8.7 Fifth, sixth and seventh MFCC coefficient . . . . . . . . . . . . . . . . . . 46

vi

8.8 Eighth, ninth and tenth MFCC coefficient . . . . . . . . . . . . . . . . . . 47

8.9 Eleventh and Twelfth MFCC Coefficient and Normalised Energy . . . . . . 48

vii

Chapter 1

Introduction

Keyboard, although a popular medium, is not very convenient as it requires a certain

amount of skill for effective usage. A mouse on the other hand requires a good hand-eye

co-ordination. It is also cumbersome for entering non-trivial amount of text data and

hence requires use of an additional media such as keyboard. Physically challenged people

find computers difficult to use. Partially blind people find reading from a monitor difficult.

Current computer interfaces also assume a certain level of literacy from the user. It

also expect the user to have certain level of proficiency in English. In our country where

the literacy level is as low as 50% in some states, if information technology has to reach

the grass root level, these constraints have to be eliminated. Speech interface can help us

tackle these problems.

Speech Synthesis and Speech Recognition together form a speech interface. A speech

synthesiser converts text into speech. Thus it can read out the textual contents from the

screen. Speech recogniser had the ability to understand the spoken words and convert it

into text. We would need such softwares to be present for Indian languages.

1.1 Existing Systems

Although some promising solutions are available for speech synthesis and recognition,

most of them are tuned to English. The acoustic and language model for these systems

are for English language. Most of them require a lot of configuration before they can be

1

used.

There are also projects which have tried to adapt it to Hindi or other Indian Languages.

[7] explains how a acoustic model can be generated using a existing acoustic model for

English.

ISIP [2] and Sphinx [1] are two of the known Speech Recognition software in open

source. [5] gives a comparison of public domain software tools for speech recognition.

Some commercial software like IBM’s ViaVoice are also available.

1.2 Speech Recognition - definition and Issues

Speech recognition refers to the ability to listen (input in audio format) spoken words

and identify various sounds present in it, and recognise them as words of some known

language.

Speech recognition in computer system domain may then be defined as the ability of

computer systems to accept spoken words in audio format - such as wav or raw - and then

generate its content in text format.

Speech recognition in computer domain involves various steps with issues attached

with them. The steps required to make computers perform speech recognition are: Voice

recording, word boundary detection, feature extraction, and recognition with the help of

knowledge models.

Word boundary detection is the process of identifying the start and the end of a spoken

word in the given sound signal. While analysing the sound signal, at times it becomes

difficult to identify the word boundary. This can can be attributed to various accents

people have, like the duration of the pause they give between words while speaking.

Feature Extraction refers to the process of conversion of sound signal to a form suitable

for the following stages to use. Feature extraction may include extracting parameters such

as amplitude of the signal, energy of frequencies, etc.

Recognition involves mapping the given input (in form of various features) to one of the

known sounds. This may involve use of various knowledge models for precise identification

and ambiguity removal.

Knowledge models refers to models such as phone acoustic model, language models,

2

etc. which help the recognition system. To generate the knowledge model one needs to

train the system. During the training period one needs to show the system a set of inputs

and what outputs they should map to. This is often called as supervised learning.

1.3 Problem Definition

The aim of this project is to build a speech recognition tool for Hindi language. This

is a isolated word speech recognition tool. We have used continuous HMM, which can

support a vector as a observation, for the same.

1.4 Design of the system

The prepared system if visualised as a block diagram will have the following components:

Sound Recording and word detection component, feature extraction component, speech

recognition component, acoustic and language model.

Figure 1.1: Block diagram of Recognition System

• Sound Recording and Word detection component: The component is responsible

for taking input from microphone and identifying the presence of words. Word

detection is done using energy and zero crossing rate of the signal. The output of

3

Figure 1.2: Block diagram of Training System

this component can be a wave file or a direct feed for the feature extractor. The

component is discussed in detail in Chapter 2.

• Feature Extraction component: The component generated feature vectors for the

sound signals given to it. It generates Mel Frequency Cepstrum Coefficients and

Normalised energy as the features that should be used to uniquely identify the

given sound signal. This module is discussed in detail in Chapter 3.4

• Recognition component: This is a Continuous, Multi-dimensional Hidden Markov

Model based component. It is the most important component of the system and

is responsible for finding the best match in the knowledge base, for the incoming

feature vectors. This component in discussed in Chapter 5.

• Knowledge Model: The components consists of Word based Acoustic. Acoustic

Model has a representation of how a word sounds. Recognition system makes use

of this model while recognising the sound signal.

The basic flow once the training is done can be summarised as the sound input is

taken from the sound recorder and is feed to the feature extraction module. The feature

extraction module generates feature vectors out of it which are then forwarded to the

recognition component. The recognition component with the help of the knowledge model

and comes up with the result.

4

During the training the above flow differs after generation of feature vector. Here the

system takes the output of the feature extraction module and feeds it to the recognition

system for modifying the knowledge base.

1.5 Overview of the Report

The contents of the chapters are as follows.

• Chapter 2 discusses the recording of sound.

• Chapter 3 explains how feature extraction is achieved.

• Chapter 4 explains how a acoustic model and how is it represented using HMM.

• Chapter 5 explains how the implementation for training and recognition was done

for HMM.

• Chapter 7 concludes the report with a summary of the work done and what are the

next proposed steps to be taken.

• Chapter 7 is the appendix that gives some additional information that might be of

interest.

5

Chapter 2

Sound Recording and Word

Detection

The component responsibility is to accept input from a microphone and forward it to the

feature extraction module. Before converting the signal into suitable or desired form it

also does the important task of identifying the segments of the sound containing words. It

also has a provision of saving the sound into WAV files which are needed by the training

component.

2.1 Sound Recorder

2.1.1 Features

The recorder takes input from the microphone and saves it or forwards it depending on

which function is invoked Recorder supports changing of sampling rate, channels and size

of the sample. Default sampling rate of the recorder is 44100 samples per second, at a

size of 16 bits per sample and dual channel.

2.1.2 Design

Internally, it is the job of Sound Reader class to take the input from the user. The Sound

Reader class takes sampling rate, sample size and number of channels as parameters.

6

Sound Reader has three basic functions: open, close and read. Open function opens the

/dev/dsp device in the read mode. It makes appropriate ioctl calls to set the device

parameters. Close function releases the dsp device. Read function reads from the dsp

device checks if there is a valid sound present and returns the sound content. The Source

code of Yet Another Recorder (yarec) was referred for making the sound recorder.

The Recorder class takes care of converting the raw audio signal into WAV format and

stores it to a file. The format of WAV file if given in the appendix for reference.

2.2 Word Detector

2.2.1 The Principle

In speech recognition it is important to detect when a word is spoken. The system does

detects the region of silence. Anything other than silence is considered as a spoken word by

the system. The system uses energy pattern present in the sound signal and zero crossing

rate to detect the silent region. Taking both of them is important as only energy tends

to miss some parts of sounds which are important. This technique has been described in

[9].

2.2.2 The Method

For word detection a sample is taken every 10 milli-seconds. Energy and zero crossing

for this duration is calculated. Energy is calculated by adding the square of the value of

waveform at each instance and then dividing it by to number of instances over the period

of sample. Zero crossing rate is the number of times the value of the wave goes from the

negative number to positive of vice-versa.

Word Detector assumes that the first 100 milli-second is silence. It uses the average

Energy and average Zero Crossing Rate obtained during this time for identifying the

background noise. Upper threshold for energy and zero crossing is set to 2 times the

average value of background noise. Lower thresholds are set to 0.75 times the upper

threshold.

While detecting the presence of word in the sound, if the energy or zero crossing goes

7

above the upper threshold and stays above for three consecutive sample word is assumed

to be present and the recording is started. The recording continues till the energy and zero

crossing both fall below the lower threshold and stay there for at-least 30 milli-seconds.

Figure 2.1: Wave Plot for word paanch

2.2.3 Example

Figure 2.1 shows the waveform for word paanch at the rate of 44,100 samples per second.

Figure 2.2 shows the energy pattern for the same word. Figure 2.3 shows the zero crossing

rate for the same calculated at every 10 milli-seconds. The example signifies the impor-

tance of the frequency plot. Looking at the energy plot we would assume that the sound

began at the 10 reading, when actually the word begins at the start. This is because of

the sound ‘p’ which is present at the start of the word. The actual start can be detected

by zero crossing rate analysis.

2.2.4 Results

Seven people recorded the number one to ten in Hindi. The Table 2.1 shows how the

word detector program performed

8

Figure 2.2: Energy Plot for word paanch

Number of people: 7

Number of words per speaker: 10

Number of non words detected as words: 8

Number of words broken into two parts: 3

Table 2.1: Word Detector Performance

Out of the above recording one was also made with a constant hum as background

noise. The system was found to be immune to it.

9

Figure 2.3: Zero Crossing Plot for word paanch

10

Chapter 3

Feature Extractor

Humans have a capacity of identifying different types of sounds (phones). Phones put in

a particular order constitutes a word.If we want a machine to identify the spoken word, it

will have to differentiate between different kinds of sound the way the humans perceive it.

The point to be noted in case of humans is that although, one word spoken by different

people produces different sound waves humans are able to identify the sound waves as

same. On the other hand two sounds which are different are perceived as different by

humans. The reason being even when same phones or sounds are produced by different

speakers they have common features. A good feature extractor should extract these

features and use them for further analysis and processing.

Figure 3.1 shows a block diagram of a typical feature extraction module.

FeatureVector

AudioInput

Window SpectralAnalysis

FeatureEnhansment

FrequencyDomainFeatures

Time DomainFeatures

Figure 3.1: Block diagram of Feature Extractor

11

3.1 Windowing

Features get periodically extracted. The time for which the signal is considered for pro-

cessing is called a window and the data acquired in a window is called as a frame. Typically

features are extracted once every 10ms, which is called as frame rate. The window dura-

tion is typically 25ms. Thus two consecutive frames have overlapping areas. Figure 3.2

shows two frames being extracted and also the overlapping zone.

� ��

OverlappingRegion

12 ms Frame size

10 ms Frame delay

Figure 3.2: Windowing of the speech signal

There are different types of windows which are used.

• Rectangular window

• Bartlett window

• Hamming window

The system uses Hamming window as it introduces the least amount of distortion.

Out of these the most widely used window is Hamming window. Impulse response

of the Hamming window is a raised cosine impulse and is shown in Figure 3.3. Transfer

function of hamming window is

0.54 + 0.46cos(nπ/m) (3.1)

Features are then extracted from each of frame. Most of the features can be categorised

into two categories:

12

Figure 3.3: Impulse Response of Hamming Window

• Temporal Feature

– Power spectral analysis (FFT)

– Linear predictive analysis (LPC)

– Mel scale cepstral analysis (MEL)

– First order derivative (DELTA)

• Spectral Feature

– Energy normalisation

– Zero Crossing Rate

3.2 Temporal Features

Temporal features are easy to extract, simple and have easy physical interpretation. Tem-

poral features like average energy level, zero-crossing rate, root mean square, maximum

amplitude, etc can be extracted out as features.

13

3.3 Spectral Analysis

Spectral analysis gives us quite a lot of information about the spoken phone. Time

domain data is converted to Frequency domain by applying Fourier transform on it. This

process gives us the spectral information. Spectral information is the energy levels at

different frequencies in a given window. Thus features like frequency with maximum

energy, distance between frequencies of maximum and minimum energies, etc can be

extracted.

3.3.1 Mel frequency cepstrum computation

Mel frequency cepstrum computation(MFCC) is considered to be the best available ap-

proximation of human ear. It is known that human ear are more sensitive to higher

frequency. The spectral information can then be converted to MFCC by passing the sig-

nals through band pass filters where higher frequencies are artificially boosted, and then

doing a inverse Digital Fourier Transform(DFT) on it. This results in higher frequencies

being more prominent.

Feature Extraction module is capable of producing different kinds of features from

the sound input. The possible features that can be extracted are Energy, MFCC, their

derivative coefficients and second order derivative coefficients. The source code for this

module has been taken from Internet-Accessible Speech Recognition Technology [2].

All the features that are generated are output one frame per line. So each line in the

output contains a feature vector for a frame. Figure 3.4 is an example of the output of

the feature extractor. Feature vector is nothing but a list of numbers that refer to a list

of properties of the sound in a frame.

Figure 3.4: Output of feature extractor

14

3.4 Feature Vector specification

Vectors were generated at a frame duration of 10 milli-second. Window used was Hamming

window with a duration of 25 milli-seconds. 12 MFCC, and energy level are generated

for each frame. These features can now be used for either recognition or for training the

HMM.

Figure 3.5: Plot of lowest MFCC

Figure 3.5 is a plotting of the lowest frequency MFCC. It can be seen that the plot of

this corresponds closely with that of the energy curve.

15

Chapter 4

Knowledge Models

For speech recognition, the system needs to know how the words sound. For this we need

to train the system. During the training, using the data given by the user, the system

generates acoustic model and language model. These models are later used by the system

to map a sound to a word or a phrase.

4.1 Acoustic Model

Features that are extracted by the Feature Extraction module need to be compared against

a model to identify the sound that was produced as the word that was spoken. This model

is called as Acoustic Model.

There are two kinds of Acoustic Models

• Word Model

• Phone Model

4.1.1 Word Model

Word models are generally used to small vocabulary systems. In this model the words

are modelled as whole. Thus each word needs to be modelled separately. If we need to

add support to recognise a new word, we will have to train the system for the word. In

the recognition process, the sound is matched against each of the model to find the best

16

Start End

Word1

Word2

Wordn

.

.

.

.

Fillers

S0 S1 S2 S3 S4 S5

Figure 4.1: Word acoustic model

match. This best match is assumed to be the spoken word. Building a model for a word

requires us to collect the sound files of the word from various users. These sound files are

then used to train a HMM Model. Figure 4.1.1 shows a diagrammatic representation of

phone based acoustic model.

4.1.2 Phone Model

In phone model instead of modelling the whole word, we model only parts of words

generally phones. And the word itself is modelled as sequence of phone. The heard sound

is now matched against the parts and parts are recognised. The recognised parts are

put together to for a word. For example the word ek is generated by combination of

two phones A and k. This is generally useful when we need a large vocabulary system.

Adding a new word in the vocabulary is easy as the sounds of phones are already know

only the possible sequence of phone for the word with it probability needs to be added

to the system. Figure 4.1.2 shows a diagrammatic representation of phone based acoustic

model.

Phone models can be further classified into:

17

S0 S1 S2 S3 S4 S5

Start End

Word1

Word2

.

.

.

.

Fillers

The

dh ah

Figure 4.2: Phone acoustic model

• Context-Independent Phone Model

• Context-Dependent Phone Model

4.1.2.1 Context-Independent Phone Model

In this model individual phones are modelled. The context that they occur is not modelled.

The good thing about this model is that the number of phone that have to be modelled

is small. Thus the complexity of the system is less.

4.1.2.2 Context-Dependent Phone Model

While modelling phone their neighbours are also considered. That means iy surrounded

by z and r is a separate entity as compared to iy surrounded by h and r. This results in a

growth of number of modelled phones which increases the complexity. [6] shows that high

phone recognition accuracies can be obtained using context-dependent phone models.

18

In both word acoustic model and phone acoustic model we need to model silence and

filler words too. Filler words are the sounds that humans produce between two words.

Both these models can either be implemented using a Hidden Markov Model or a

Neural Network. HMM is more widely used technique in automatic speech recognition

systems.

4.2 Language Model

Although there are words that have similar sounding phone, humans generally do not

find it difficult to recognise the word. This is mainly because they know the context,

and also have a fairly good idea about what words or phrases can occur in the context.

Providing this context to a speech recognition system is the purpose of language model.

The language model specifies what are the valid words in the language and in what

sequence they can occur.

4.2.1 Classification

Language Models can be classified into several categories:

Uniform Models Each word has equal probability of occurrence.

Stochastic Models Probability of occurrence of a word depends on the words preceding

it.

Finite State Languages Language uses a finite state network to define allowed word

sequences.

Context Free Grammar Context free grammar can be used to encode which kind of

sentence are allowed.

4.3 Implementation

We have implemented a word acoustic model. The system has a model for each word that

the system can recognise. The list of words can be considered as language model. Section

19

8.4 shows sample of a model.

While recognising the system need to know where to locate the model for each word

and what word the model corresponds to. This information is stored in a flat file called

models in a directory called hmms.

When a sound is given to the system to recognise, it compares each model with the

word and finds out the to model that most closely matches with it. The word correspond-

ing to that HMM model is given as the output. Details about the HMM models and its

training and recognising it are given in the next chapter.

20

Chapter 5

HMM Recognition and Training

Hidden Markov Model(HMM) is a state machine. The states of the model are represented

as nodes and the transition are represented as edges. The difference in case of HMM is

that the symbol does not uniquely identify a state. The new state is determined by the

symbol and the transition probabilities from the current state to a candidate state. [4] is

a tutorial on HMM which shows how it can be used.

Figure 5 shows a diagrammatic representation of a HMM. Nodes denoted as circles

are states. O1 to O5 are observations. Observation O1 takes us to states S1. aij defines

the transition probability between Si and Sj. It can be observed that the states also have

self transitions. If we are at state S1 and observation O2 is observed, we can either decide

to go to state S2 or stay in state S1. The decision is made depending on the probability

of observation at both the states and the transition probability.

start s1 s2 s3 enda01 a12 a34a23

a13 a11 a22 a33

a24

O1 O2 O3 O4 O5O2 O4

Figure 5.1: Diagrammatic Representation of HMM

21

Thus HMM Model is defined as:

λ = (Q,O, A, B, π) (5.1)

Where Q is {qi} (all possible states)

O is {vi} (all possible observation)

A is {aij} where aij = P (Xt+1 = qj|Xt = qi) (transition probabilities)

B is {bi} where bi(k) = P (Ot = vk|Xt = qit) (observation probabilities of observation k at

state i)

π is {πi} where πi = P (X0 = qi) (initial state probabilities)

Xt denotes the state at time t.

Ot denotes the observation at time t.

5.1 HMM and Speech Recognition

HMM can be classified upon various criteria:

• Values of Occurrences

– Discrete

– Continuous

• Dimension

– One Dimensional

– Multi Dimensional

• Probability density function

– Continuous density (Gaussian distribution) based

– Discrete density (Vector quantisation) based

While using HMM for recognition, we provide the occurrences to the model and it

returns a number. This number is the probability with which the model could have

22

produced the output (occurrences). In speech recognition occurrences are feature vectors

rather than just symbols. Hence for each occurrence, feature vector has a group of real

numbers. Thus, what we need for speech recognition is a Continuous, Multi-dimensional

HMM.

5.1.1 Implementation

There are HMM library that were looked at:

• HTK: HMM Tool Kit - is matured HMM implementation. But the license of usage

does not allow redistribution of code.

• A C++ implementation of HMM by Prof. Dekang Lin: The problem with this

implementation was that it was a discrete HMM implementation.

• GHMM: GHMM is a open source library for HMM. It supports both discrete and

continuous HMM. But it did not have support for more than one dimension.

Continuous HMM library, which supports vector as observations, has been imple-

mented in the project. The library uses Gaussian probability distribution function. Sec-

tion 8.4 shows a XML file containing a specification of HMM. The sample has five states

with a vector size of three.

The root tag in the HMM file is hmm which indicates that the file contains a HMM

model. The tag has two attributes states and vector-size indicating the number of states

and the vector size of an observation for the HMM respectively. Each state consists of

the outgoing edges with their probabilities. These outgoing edges are stored as transition

tag inside the state. Each tag has the target state id and the probability of transition. A

state also has one or more mixtures. A mixture consists of a vector of mean and a matrix

of variance, one for each dimension. These mean and variance are used to calculate

probability for an occurrence. The way of calculating the probability is discussed in

Subsection 5.1.2.2.

Figure 5.1.1 shows a diagrammatic representation of the model. Circles are states and

the arrows indicate transitions. Each transitions has its probability. In the diagram it

can be seen that states also have self transitions. Note that not all transitions are shown

23

End

S3S2

Start S1

S5

S4

mean:

2.34 5.45 −0.56

−0.87 1.23 0.18

1.23 −1.54 −2.3

0.18 −2.3 2.36

variance:

Figure 5.2: Diagrammatic Representation of the Model

24

in the figure. Each state has a mean and variance associated with it. Mean is a vector of

N real numbers where N is a size of the observation. While variance is a matrix of size

N ∗N .

5.1.2 Recognition using HMM

We need to recognise a word using the existing models of words that we have. Sound

recorder need to record the sound when it detects the presence of a word. This recorded

sound is then passed through feature vector extractor model. The output of the above

module is a list of features taken every 10 msec. This features are then passed to the

Recognition module for recognition.

The list of all the words that the system is trained for and their corresponding models

are given in a file called models present in the hmms. All models corresponding of the

words are then loaded in memory. The feature vectors generated by the feature vector

generator module act as the list of observation for the recognition module.

Probability of generation of the observation given a model, P (O|λ), is calculated for

each of the model using find probability function. The word corresponding to the HMM,

that gives the probability that is highest and is above the threshold, is considered to be

spoken.

5.1.2.1 Forward Variable

Forward variable was used to find the probability of list of occurrence given a HMM. For

a model λ with N states, P (O|λ) probability of observation, in terms of forward variable

α, given the model is defined as

P (O|λ) =N∑

i=1

αT (i) (5.2)

where αt+1 is recursive defined as

αt+1 = [N∑

i=1

αT (i)aij]bj(Ot+1) (5.3)

where α1 is πibi(O1)

25

5.1.2.2 Occurrence Probability

For the forward variable to work we need to find bi(Ot). Which is probability of a given

occurrence for a particular state. This value can be calculated by Multivariate normal

distribution formula. Probability of observation X occurring in state i is given as:

(1/(2π)D/2|Vi|1/2)exp(−(1/2) ∗ (Ot − µi)T V −1(Ot − µi)) (5.4)

where

D is dimension of the vector.

µi is matrix representing the mean vector.

Vi is the covariance Matrix.

|Vi| is the determinant of matrix Vi.

V −1i is the inverse of matrix Vi.

Mean vector µi is obtained by:

µi = (1/N) ∗∑Otεi

Ot (5.5)

Covariance Matrix Vi can be obtained by:

Vi = (1/N) ∗∑Otεi

(Ot − µi)T ∗ (Ot − µi) (5.6)

In Equation 5.6 variance is calculated by finding the distance vector between a ob-

servation and the mean. Transpose of the distance vector is taken and it is multiplied

with the distance vector. This operation gives a NXN where N is the dimension of the

system.

If the model is feed in insufficient data the co-variance matrix that gets created is such

that its determinant turns out to be very close to zero. Thus failing in creation of the

inverse matrix needed for probability of observation. We faced the same problem.

For the above methods need for a Matrix library was felt. A Matrix library was built

as the part of the system. Details about the Matrix Library are in the Appendix.

26

5.1.3 Training the Model

Before we can recognise a word we need to train the system. Train command is used to

train the system for a new word. The command takes at-least 3 parameters:

• No of states the HMM model should have N .

• The size of the feature vector D.

• One or more filenames each containing a training set.

For generating an initial HMM we take the N equally placed observations (feature

vector) from the first training set. Each one is used to train a separate state. After

training the states have a mean vector which is of size D. And a variance matrix of

size D ∗D containing all zeros. Then for each of the remaining observations, we find the

Euclidean distances between it and the mean vector of the states. We assign a observation

to the closest state for training. The state assigned to consecutive observations are tracked

to find the transitional probabilities.

Mean and Variance of the states are calculate as shown in the Equations 5.5 and 5.6.

5.1.3.1 Segmental K-means Algorithm

This algorithm tries to modify the initial model so as to maximise P (O, I|λ). Where O

are the training sets used for training and I is a state sequence in the given HMM. The

maximised (optimal) path for a training set is denoted by I*. Those observations that

were assigned to a different state then the one in which they should be present according

the optimal path are then moved to the state. This improves P (O, I ∗ |λ). The model is

evaluated again so with this changed assignments of observations.

We do the above process iteratively till there are no more reassignment needed. The

calculation of mean, variance, and transitional probabilities are done as shown before.

5.1.3.2 Viterbi algorithm

Viterbi algorithm is described in [8]. This algorithm is useful identifying the best path

that a signal can take in a HMM. Find the best path is a search problem. Viterbi uses

27

dynamic programing to reduce the search space. For the first observation sequence we find

out probability of a state being the start state. This done by taking a product of initial

probability and the observation probability for the state. For every other observation all

the states try to find a predecessor such that the probability of the predecessor multiplied

by the transition probability from the predecessor to itself is maximised.

28

Chapter 6

Experimental Results

6.1 Training

Ten HMM models each for digit in Hindi was made. For this we used about 60 to 90

recorded sounds for each digit. Voice of four people were used to train the system.

Each model took between seven to 14 iterations to converge.

6.2 Recognition

Recognition was tried on three kinds of sounds

• Seen sound: The sound files used to train the models

• Unseen sound seen user: Unused Sound file of the user whose other sound files were

used for training.

• Unseen user: The user whose voice we not used for training.

The result of the experiment were as shown:

29

Type of sound No of sounds Correct Recognition Wrong Recognition

Seen Sound 185 154 22

Unseen sound seen user 25 11 0

Unseen user 25 15 0

Table 6.1: Recognition Result

30

Chapter 7

Conclusion and Future Work

7.1 Conclusion

Sound Recorder with word detector was implemented. Sound recorder did pretty well in

identifying the presence of sound on the microphone. A HMM library was build which

was useful in recognition of speech and also for training a word based acoustic model.

Word models for digits in Hindi were generated. Each model was trained using about

sixty to nifty sound files collected from four people.

The trained model was used to recognise other words. Recogniser gave good results

when tested for sound used for training the model. For other sounds too the results were

satisfactory.

7.2 Future Work

We have used a word based acoustic model. This model can be used only for limited

vocabulary. We would have to move towards a phone based acoustic model. This problem

of lack of good public domain acoustic model for India language needs to be addressed.

There are a few public domain speech recognition systems that are available. But

they have their own drawback. For example, there is a speech recognition software called

ocvolume. The problem with ocvolume is that it does not use HMM, but vector quanti-

sation. Thus it will not scale up to many words. Similarly there are helper program and

31

libraries that are available. Example GHMM is a open source HMM library, but it does

not support multi-dimensional HMMs.

Thus as future work it would be useful if someone could do some work in improving

these libraries and software so that they can be integrated together and can become more

useful.

32

References

[1] Cmu sphinx - open source speech recognition engines.

[2] Internet-accessible speech recognition technology. http://www.cavs.msstate.edu/

hse/ies/projects/speech/index.html.

[3] Brent M. Dingle. Calculating determinants of symbolic and numeric matrices. Tech-

nical report, Texas A&M University, 2005.

[4] Rakesh Dugal and U. B. Desai. A tutorial on hidden markov models. http://uirvli.

ai.uiuc.edu/dugad/hmm_tut.html.

[5] Samudravijaya K and Maria Barot. A comparison of public domain software tools for

speech recognition. pages 125–131. Workshop on Spoken Language Processing, 2003.

[6] L. Lamel and J. Gauvain. High performance speaker-independent phone recognition

using cdhmm. EUROSPEECH-93.

[7] N. Rajput M. Kumar and A. Verma. A large-vocabulary continuous speech recognition

system for hindi. IBM Journal for Research and Development.

[8] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech

recognition. pages 256–286. Proceedings of the IEEE, 1989.

[9] L. R. Rabiner and M. R. Sambur. An algorithm for determining the endpoints of

isolated utterances. Bell System Technical Journal, Vol. 54, pages 297-315, 1975,.

33

Chapter 8

Appendix

8.1 WAV file format

A WAV file stores data in Little Endian format. A WAV file is written in Resource

Interchange File Format (RIFF). In RIFF format, file is divided into chunks. Each chunk

has headers which gives information about the data that follows. A WAV file requires

at least two chunks: Format Chunk and Data Chunk. Figure 8.1 shows a graphical

representation of a minimal WAV file.

RIFF ChunkWAVE

fmt Chunk

headers

data Chunksound data

Figure 8.1: Format of a Wave file

34

8.1.0.3 RIFF WAVE Chunk

RIFF WAVE Chunk contains nothing but headers of the WAV file. This chunk has three

headers. First header is a string “RIFF” indicating that the file follows RIF Format.

Second header is an integer specifying the size of the content that will follow. Third

header is a string “WAVE” identifying the file type.

8.1.0.4 FMT SubChunk

FMT Subchunk describes the format in which sound information is stored. Like RIFF

chunk this chunk also has a SubChunkID which is “fmt ”. It also has information such

as Sampling rate, Bits per sample, Number of channels, Audio Format, etc. Size of this

header is minimally 24 bytes and can be followed by Extra headers.

8.1.0.5 Data SubChunk

Data Subchunk has only two headers: ID and Size. This is followed by the actual sound

data.

Figure 8.2 shows part of hex dump of a wave file.

Figure 8.2: Hex Dump of a wave file

35

8.2 Installation Instructions

The source code of the program and the data files can be downloaded from

http://anakin.ncst.ernet.in/ ripul/speech/ .

8.2.1 Source Code

Source code is present in three directories: recorder, dsp and hmm.

8.2.1.1 Prerequisites

The following things should be present on the system for compiling and using the system:

• g++ compiler, version 3.3

• xerces library for C++

• development files of xerces library.

8.2.1.2 Recorder

Recorder contains the source file needed to recorded the sound from the microphone. This

part of code has two executables.

Recorder is used to record a the sound to a WAVE file. This command take a parameter

destination filename. When the program is started the sound is recorded to the file till

the program get a SIGTERM signal. SIGTERM is sent by pressing control-c.

RawRecorder can be used to record sound from the microphone and store them as

raw file. This commands listens to the sound coming from the microphone. Check for the

presence of word. If a word it present it is recorded to a file. Output of the command

will be files named as 1.wav, 2.wav and so on. So if the RawRecorder detect ten words,

there would be ten wav files. The recorded sound is of sample size 16 bits, sampled at a

rate of 44100, containing single channel.

36

8.2.1.3 Feature Extraction

This module is used to extract features from a given raw file. The command takes in many

options which indicate what features need to be extracted. The executable file name is

extract feature. The help file for the command is present in the directory.

8.2.1.4 HMM

The hmm directory contains the implementation of HMM library. The source code of the

library generates two executables Train and Recognise.

Training the system for a new word requires the sound files for that word. Feature

for the sound file can be extracted using the extract feature command. Train command

can be used to train the system. The command needs information such as the number of

states that the model should have, the size of the feature vector, and the file to be used

for training. First argument to the command should be a number indicating the number

of states. Second argument is the size of the vector. After this one or more file containing

the training data.

The output of the Train command is the trained HMM in XML format which should

be written to a file and put in the hmms directory. A entry needs to be made in the

models file present in the same directory.

For recognition we need to record the sound using the RawRecorder program. Then

extract the feature to get a mfcc file. Recognise command takes one or more filenames as

argument. It tries to recognise the word for each file.

8.2.2 Data Files

hmms directory has models present for digits in Hindi. data directory contains the sound

files that were used to train the system. Data directory contains ten directories, one for

each digit. The prefix in the file names indicate the name of the person who has recorded

the sound.

37

8.3 Matrix Library

The library supports trivial Matrix functions like Transpose, Multiply, Add, etc. and also

Determinant and Inverse. First the Determinant was coded with the standard text book

algorithm. Due to this the algorithm will taking a lot of time. [3] explains a method of

calculating fast determinant which was proposed by Erwin H. Bareiss. This algorithm

was coded and is giving quick results.

Later it was found that the result of the covariance matrix was so small that it was

always giving the result as zero. For this reason, the determinant function also taken in a

parameter called factor. Before calculating the determinant all the number of the matrix

are multiplied by the factor. Thus artificially boosting the value. Similar boosting was

done while calculating inverse matrix too.

8.4 HMM Model

The following is the contents of a file containing the model:

<hmm states=”5” dimension=”3”>

<state pi=”0.897059” id=”1”>

<transition probability=”0.890071” state-id=”1”/>





<mixture weight=”1” samples=”564”>

<mean>

-0.184761 -7.98322 3.35773

</mean>

<variance>

1.30115 0.0139008 -0.628016

0.0139008 2.27187 -0.19782

38

-0.628016 -0.19782 2.23247

</variance>

</mixture>

</state>

<state pi=”0.1” id=”2”>







<mean>

-3.12424 2.09704 2.6186

</mean>

<variance>

3.98811 -12.055 5.85019

-12.055 58.1744 -29.146

5.85019 -29.146 19.9382

</variance>

</mixture>

</state>

<state pi=”0.1” id=”3”>







<mean>

1.21589 -2.90561 0.717873

39

</mean>

<variance>

0.541226 0.503624 -0.801156

0.503624 5.81307 -2.1564

-0.801156 -2.1564 3.89482

</variance>

</mixture>

</state>

<state pi=”0.1” id=”4”>







<mean>

1.40136 8.17534 -4.76438

</mean>

<variance>

1.40487 -2.42026 -2.4986

-2.42026 10.8383 3.07776

-2.4986 3.07776 6.94241

</variance>

</mixture>

</state>

<state pi=”0.1” id=”5”>





40



<mean>

-1.4302 7.82436 -3.31474

</mean>

<variance>

1.11265 -1.01489 -0.718436

-1.01489 6.40667 2.33577

-0.718436 2.33577 3.16465

</variance>

</mixture>

</state>

</hmm>

8.5 Class Diagram of the System

Figure 8.5 is the class diagram of our system.

8.6 Word Waveforms

Waveform for the word paanch was shown in Chapter 2. The waveforms of a few more

words are shown here.

8.7 MFCC and energy feature plots

Plot of the lowest MFCC for the word paanch was shown in the Chapter 3. The remaining

feature if plotted against the sample time are shown in this section.

41

Figure 8.3: Class Diagram of the System

42

Figure 8.4: Plots for word ek

43

Figure 8.5: Plots for word do

44

Figure 8.6: Second, third and fourth MFCC coefficient

45

Figure 8.7: Fifth, sixth and seventh MFCC coefficient

46

Figure 8.8: Eighth, ninth and tenth MFCC coefficient

47

Figure 8.9: Eleventh and Twelfth MFCC Coefficient and Normalised Energy

48

Documents

Speech Recognition for Hindi