1. An Empirical Approach for Optimization of Acoustic Models in Hindi Speech Recognition Systems R. K. Aggarwal Dept. of Computer Engineering, National

1

Let us pray to Almighty

to illuminate our intellect towards

the righteous path

An Empirical Approach for Optimization of Acoustic Models in Hindi Speech Recognition Systems

R. K. AggarwalDept. of Computer Engineering,

National Institute of Technology (NIT)Kurukshetra, Haryana, India.

2

3

Automatic Speech Recognition (ASR)• The goal of ASR is to covert a speech signal into its equivalent text

message independent of the device, speaker or the environment.• It is a pattern recognition type of problem in which features are

extracted and a model is used for training and testing.

Speech

InputPreprocessing

Feature Extraction

Model Generation

Pattern

Classification

Training

Testing

Recognized Words

44

Statistical Approach to ASR

Recognized speech

Pre-ProcessingFeature-

ExtractionRecognizer

Acoustic Modeling

Language Modeling

microphone

Front End

Back End

parameters

LPCC/MFCC

Speechsound

Statistical framework of ASR• State-of-the-art speech recognition systems use mixture Gaussian output probability

distributions in HMM together with context dependent phone models. To handle the large number of state parameters of HMM, many similar states of the model are tied and the data corresponding to all these states are used to train one global state. HMMs with this type of sharing were proposed in literature under the names semi-continuous and tied-mixture HMMs.

• The main components of the ASR based on statistical approach are feature extraction, acoustic models (HMMs), language model and hypothesis search unit. The acoustic model typically consists of two parts. The first is to describe how a word sequence can be represented by sub-word units and the second is the mapping from each sub word units to acoustic observations. In language model rules are introduced to follow the linguistic restrictions present in the language and to allow redemption of possible invalid phoneme sequences.

• The acoustic and language models resulting from the training procedure are used as knowledge sources during decoding. 5

Work Significance

Difficulty for the design of Indian Languages ASR For the design and development of European languages ASR systems,

where large and standard databases (e.g. TIMIT, Switchboard corpus) are available to model acoustic variability, higher degrees of mixture tying have been applied e.g. 4000 to 8000 total tied states A range of 8 to 128 Gaussian mixtures

the same convention cannot be followed for Indian languages as the databases, prepared by various research groups, are relatively small and phonetically not very rich.

Solution In this paper we present a solution to find the right degree of mixture

tying by observing empirically the performance of Hindi speech recognition system using a self prepared small database.

6

7

Front-end DesignFront End of Speech Signals Mainly covers• Preprocessing

– Receiving the speech sound from the speaker.– Filtering the background noise to achieve highest possible signal to

noise ratio (SNR ratio).– Digitizing the analog speech signal.

• Feature Extraction (Parametric Transformation)– Extracting the set of properties of an utterance that have acoustic

correlation to the speech signal.– Perceptual Linear Prediction (PLP) feature extraction technique is

used in front end which is based on the working of human auditory system.

PLP

Block diagram of perceptual linear predictive (PLP) speech analysis

PLP Feature ExtractionCritical band resolution Critical band analysis is the basis for almost all the models based on auditory system.

It represents the approximation of ear’s ability to discriminate different frequencies. Experiments have shown that 25 critical bands exist over the frequency range of human hearing, which spans from 20 Hz to 20kHz.

The critical bands have constant width of 100 Hz for center frequencies up to 500 Hz, and the bandwidths increase as the center frequency increases further.

It is a frequency-domain transformation, which can be implemented as a filterbank with bandpass filters. Bark scaling is used for filter banks. The linear frequency scale is inadequate for representing the auditory system.

Human auditory system has linear relationship to the frequency scale for low frequencies but a logarithmic relationship at higher frequencies.

One critical band corresponds to a 1.5 mm step along the basilar membrane that contains 1200 primary auditory nerve fibers. 9

PLP Feature Extraction• To obtain the auditory spectrum, 17 critical band filter outputs are

used. Their center frequency are equally spaced in the Bark domain, defined by

where f is the frequency in Hz and z covers the range 0-5 KHz, into the range 0-17 Bark (i.e.0 ≤ z ≤ 17 Bark).

• Each band is simulated by a spectral weighting,

where are the center frequencies and

Finally, the feature vector consists of 39 values including the 12 cepstral coefficients with one energy, 13 delta cepstral coefficients and 13 delta delta coefficients. 10

11

RASTA (Relative Spectral)

Noise & Channel Compensation Technique• The linguistic components of the speech are governed by the rate of change of

the vocal tract shape.

• The rate of change of nonlinguistic components (i.e. the noise) in speech often lies outside the typical rate of change of the vocal tract shape.

• The relative spectral (RASTA) technique takes the advantage of this fact and suppresses the spectral components that change more slowly or quickly than the typical rate of change of speech .

• RASTA has often been combined with the PLP method and implemented as an IIR filter and the same filter is used for all frequency bands.

12

Tools used for ASR• HTK 3.4.1

– Developed at Cambridge University.– Designed in C++.– Supports Linux Platform. – For Window environment, it requires an interfacing software

CYGWIN.• SPHINX 4

– Developed at Carnegie Mellon University(CMU).– Designed in JAVA.

• MATLAB• JuliusHTK 3.4.1 is most widely used ASR tool.

1313

Univariate Gaussian/Normal Distribution

2

1)( xFX

2

2

1exp

2

1)(

x

xf

Where f(x) represent normal density,

m and s are two parameter viz. mean and standard deviation respectively of Gaussian distribution .

Probability Distribution F(x) is given by:

dxexF

x2/

2

2

1)(

F(x)

Xm

It is a continuous probability distribution when only one observation is under consideration. E.g. height of the students.

14

• Multivariate– When more than one observations are under consideration– E.g. height, weight and IQ level of the student– 39 dimensional MFCC feature vector in case of ASR.

where μ is the n dimensional mean vector,

Σ is the n×n covariance matrix, and

|Σ| is the determinant of covariance matrix Σ.

Multivariate Gaussian

11/ 2/ 2

1 1( ; , ) exp ( ) ( )

2(2 )T

nN x x x

15

Multivariate Gaussian cont..1 1

2 2

39 39

( )

( )( )

: :

( )

E X

E XE X

E X

11 12 1

221 22

1 2

( )( )

m

mT

m m mm

E X X

If the feature vectors are un-correlated, the covariance among them will be zero. In this situation only the diagonal elements will be considered as they represent the variance.

16

Mixture of Gaussian

17

Mixture of 3 Gaussians

Illustration of a mixture of 3 Gaussians in a two-dimensional

1. Contours of constant density for each of the mixture components, in which the 3 components are denoted red, blue and green

2. Contours of the marginal probability density p(x) of the mixture distribution.

3. A surface plot of the distribution p(x)

Fig 1 Fig 2 Fig 3

18

Review of ASR

Illustration of speech recognition process.

The raw waveform of speech is first parameterized to discrete 39 dimension feature vectors at front end. These feature vectors are called observation vectors at back end in the perspective of statistical framework.

Then the word string that correspond to observation vectors are decoded by the recognizer.

19

Hidden Markov Model• Speech characteristics

– In speech there are two types of variability:• Spectral Variability• Temporal Variability

To model these variabilities, double stochastic process are required one for each

20

HMM Structure• Extended Markov Chain or Stochastic Finite State Machine

– Temporal variability is covered by normal working of Markov chain.

– To cover the spectral variability, there is an addition in chain. Each state of chain is characterized by a special type of pdf, i.e., mixture of multivariate Gaussians.

b1(o2)

a11

Start

b1(o1)

r1 a01

O1 O2

Observation vectors

a24

a12

b2(o3)b2(o5)

a2 m3 End

a23a33

O4O3

b3(o6)

O6O5

a22

Left to right three emitting states HMM

2121

Illustration

a11 a12 a13 00.0 a22 a23 a24

0.0 0.0 a33 a34

0.0 0.0 0.0 a44

A =

0.30.4

S1

0.4

S21 = 0.52 = 0.03 = 0.04 = 0.55 = 0.06 = 0.0

S6

1.00.3

0.6

S4

0.2

S3

0.3

0.2

S5

0.3

0.5

0.4

0.80.3

Unit Selection in Acoustic Models• Whole Word Model

– It is successful for domain specific problems where small vocabulary is required.

• Syllable Model– HMMs are generated on the basis of syllables normally used in different

languages.• CI Phone Model

– These models are simple but unable to capture the variations of a phone with respect to context.

• Triphone Model (Context Dependent Phone Model)– Preceding and succeeding phones are grouped with the middle to improve

the performance. Generally for 48 phones, 48*48*48 triphone combinations can be generated but very difficult to manage. To cope with this problem, tied state clustering is performed in triphone models.22

States Clustering in Triphone HMMS

Need Of State Tying A typical system might have approximately 2400 states with 4 mixture

components per state giving about 800k parameters in total or approximately 4800 states with 8 mixture components per state giving about 3000k parameters in total.

The training data is not sufficient to generate an appropriate Gaussian mixture model for each state.

To address this problem in context dependent model, many similar states of the model are tied and the data corresponding to all these states are used to train one global state. This leads to a large amount of data for each state, hence parameters are well estimated.

HMMs with such type of sharing were proposed in literature under the names semi-continuous and tied-mixture HMMs. 23

State Clustering in Triphone HMMs

What is state clustering?State Tying/Clustering Acoustically similar states are tied to form state clustering. Cluster are known as Senones or Genomes the names given by various

research groups. State clusters are formed by forming a cluster tree using bottom-up approach.

Tree based clustering The leaf nodes in the tree corresponds to individual HMM states. Acoustically similar states are clustered to form next higher level. This iteration is performed till the desired numbers of clusters are achieved.

25

Tree based Clustering approach

Experimental Setup

HMM State Topology Whole word model and crossword triphone model of HMM with linear left-

right topology were used to compute the score against a sequence of features for their phonetic transcription.

In triphone model 3-states per phone, along with dummy (non-emitting) initial and final nodes, were used without the permission of state skipping. For whole word model seven states per word were used.

Training & TestingThe experiments were performed on a set of speech data consisting of six hundred words of Hindi language recorded by 10 male and 10 female speakers. Each time model was trained using various utterances of each word. Testing of randomly chosen hundred words spoken by different speakers is made, i.e., total test words are hundred.

27

Experiment with different mixtures

28

4 8 12 16 20 240.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9 STD. PLP PLP+RASTA

Mixture Components

Acc

ura

cy

Experiments were performed six times with different number of Gaussians along with triphone model as fundamental speech unit and MLE technique for parameter estimation, in HMM. Maximum accuracy was observed with sixteen numbers of Gaussian mixtures. This is too less in comparison to European languages ASR where normally 64 Gaussian mixtures have been used to achieve the optimum results.

Experiments with Vocabulary Sizes

29

200 400 60050

60

70

80

90

100

92%

84%80%

90% 88%

86%

Whole Word Sub word

Dictionary Size in words

Ac

cu

rac

yTwo models whole word

model and sub word triphone model were investigated with various vocabulary sizes.

Smaller the size of vocabulary, lesser the chances of confusion and hence better should be the accuracy.

For small vocabulary up to 200, whole word model gives maximum accuracy, and beyond that triphone model must be used for better accuracy.

Sixteen Gaussian mixtures were used in training of the model to get best results.

Experiments with Various Tied States Genones

180 750 1250 1700 2400

Accuracy

83% 88% 88% 85% 84%

30

The number of Gaussian mixtures used for each case of tied states is sixteen.

With the help of a decision tree, the mixtures of each state were tied for each

base phone and the training data triphones are mapped into a smaller set of

tied state triphones. Each state position of each phone has a binary tree associated

with it.

Maximum accuracy was observed around one thousand tied states

Conclusion• To avoid over fitting of data and to minimize computation overhead

appropriate degree of mixture tying is very important. • Experimental results have shown that only 16 Gaussian mixtures and

around one thousand tied states yield optimal performance in the context of databases available for Indian languages. While in case of European languages – the total number of tied states in a large vocabulary speaker independent

system typically ranges between 5,000 and 10,000 states. – A range of mixtures 32 to 128 is used.

• For small size vocabulary the whole word model is enough, but as the vocabulary size increases triphone model is required to achieve optimum results. The word recognition accuracy of whole word models decreases more rapidly than that of sub word models.

31

References• A.E. Rosenberg, L.R. Rabiner, J. Wilpon, D. Kahn. 1983. Demisyballe-Based Isolated

Word Recognition System. IEEE Transactions on Acoustic, Speech, and Signal Processing ASSP, 31(3): 713-726.

• A. Sharma, M.C. Shrotriya, O. Farooq, Z.A. Abbasi. 2008. Hybrid Wavelet Based LPC Features for Hindi Speech Recognition. International Journal of Information and Communication Technology, Inderscience publisher, vol. 1, pp. 373-381.

• A. Sixtus and H. Ney. 2002. From Within-Word Model Search to Crossword Model Search in Large Vocabulary Continuous Speech Recognition. Computer Speech and Language. 16(2): 245-271.

• C. Becchetti and K.P. Ricotti. 2004. Speech Recognition Theory and C++ Implementation. John Wiley.

• D. Klatt. 1986. Problem of Variability in Speech Recognition and in Models of Speech Perception. In J.S Perkell and D.M Klatt (editor), Variability and Invariance in speech Processes, 300-320. Lawrence Erlbaum Assoc, N.J. Hillsdale.

32

References contd…• Douglas O’Shaughnessy. 2003. Interacting With Computers by Voice-Automatic

Speech Recognitions and Synthesis. Proceedings of the IEEE, 91(9): 1272-1305.• F. Jelinek. 1997. Statistical Methods for Speech Recognition, MIT press.• H. Hermansky. 1990. Perceptually Predictive (PLP) Analysis of Speech. Journal of

Acoustic Society of America, 87:1738-1752.• H. Hermansky and N. Morgan. 1994. RASTA Processing of Speech. IEEE

Transaction of Speech and Audio Processing, 2(4): 578-589. • H. Hermansky, S. Sharma. 1999. Temporal Patterns (TRAPs) in ASR of Noisy

Speech. Proc. of IEEE Conference on Acoustic Speech and Signal Processing .• J. Koehler, N. Morgan, H. Hermansky, H. G. Hirsch and G. Tong. 1994. Integrating

RASTA-PLP into Speech Recognition. IEEE International Conference on Acoustics, Speech and Signal Processing, vol.1: 421-424.

33

References contd…• J. Baker, P. Bamberg et al. 1992. Large Vocabulary Recognition of wall Street Journal

Sentences at Dragon System. Proc. DARPA Speech and Natural Language Workshop, 387-392.

• J. Picone. 1993. Signal Modeling Techniques in Speech Recognition. Proceedings of the IEEE,81(9): 1215-1247.

• F. Lee. 1989. ASR the Development of SPHINX System. Kluwer Academic. • E. Baum & J. A. Eagon. 1967. An Inequality with Applications to Statistical

Estimation for Probabilistic Functions of Markov Processes and to a Model for Ecology. Bulletin of American Mathematical Society. 73:360-363.

• L.R. Bahl, P.F. Brown, P.V. de Souza and R.L. Mercer. 1986. Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition. Proceeding of IEEE ICASSP, 49-52.

34

References contd…• L.R. Rabiner. 1989. A Tutorial on Hidden Markov Models and Selected Applications

in Speech Recognition. Proc. of the IEEE, 77(2): 257- 286.• L.R. Rabiner, and R.W. Schafer. 2007. Introduction to Digital Speech Processing,

Foundations and Trends in Signal Processing, vol. 1, Issue 1-2, pp. 33-73.• Li Deng, D. O’Shaughnessy. 2003. Speech Processing: A Dynamic and Optimization-

Oriented Approach. Marcel Dekker Inc., New York-Basel.• M. Gales and S. Young. 2007. The Application of Hidden Markov Model in Speech

Recognition. Foundations and Trends in Signal Processing, 1(3):195-304. • M. Hwang & X. Huang. 1992. Sub Phonetic Modeling with Markov States—Senone.

In Proc. of IEEE ICASSP, 33–36.• M.J. Hunt, M. Lennig, P. Mermelstein. 1980. Experiments in Syllable-Based

Recognition of Continuous Speech. IEEE Transactions on Acoustic, Speech, and Signal Processing, 880-883.

35

References contd…

• M. Kumar, A. Verma, and N. Rajput. 2004. A Large Vocabulary Speech Recognition System for Hindi. Journal of IBM Research, vol.48, pp.703-715.

• M.Y. Hwang, X. Huang and F. Alleva. 1992. Predicting Unseen Triphones with Senomes, Proc. IEEE ICASSP-93, II:311-314.

• Nagendra Goel, Samuel Thomas, Mohit Agarwal et al. 2010. Approaches to Automatic Lexicon Learning With Limited Training Example. Proc. of IEEE Conference on Acoustic Speech and Signal Processing.

• Pablo Fetter, Alfred Kaltenmeier, Thomas Kuhn Peter and Regel-Brietzmann. 1996. Improved Modeling of OOV Words in Spontaneous Speech, Int. Conf. on Acoustic, Speech, and Signal Processing.

36

References contd…• Rivarol Vergin, Douglas O’Shaughnessy, Azarshid Farhat. 1999. Generalized Mel

Frequency Cepstral Coefficients for Large Vocabulary Speaker-Independent Continuous Speech Recognition. IEEE Transactions on Speech and Audio Processing, vol. 7 no. 5, pp. 525-532.

• R.K. Aggarwal and M. Dave. 2008. Implementing a Speech Recognition System Interface for Indian Languages. Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged languages, IIIT Hyderabad.

• R. K. Aggarwal and M. Dave. 2010. Effects of Mixtures in Statistical Modeling of Hindi Speech Recognition Systems. Proceedings of the 2nd International Conference on Intelligent Human Computer Interaction, Springer.

• R. Shwartz, Y. Chow, O. Kimball, S. Roucos, M. Krasner, J. Makhoul. 1985. Context Dependent Modelling for Acoustic-Phonetic Recognition of Continous Speech. IEEE International Conference on Acoustics, Speech and Signal Processing.

37

References contd…• S. J. Young. 1992. The General Use of Tying in Phoneme-Based HMM Speech

Recognizers. Int. Conf. on Acoustic, Speech, and Signal Processing. 569-572.• S. J. Young, J. J. Odell and P. C. Woodland. 1994. Tree-Based State Tying for High

Accuracy Acoustic Modeling. Proceedings of Human Language Technology Workshop, 307-312.

• S. J. Young and P.C. Woodland. 1993. The Use of State Tying in Continuous Speech Recognition, Proc. ESCA Eurospeech, 3:2203-2206, Berlin, Germany.

• S. Young, M. Gales, D. Povey et al. 2006. The HTK Book, Cambridge University Engineering Department.

• V. Digalakis & H. Murveit. 1994. Genones: Optimization the Degree of Tying in a Large Vocabulary HMM-Based Speech Recognizer. Proc. of IEEE ICASSP, 537–540.

38

39

THANKS

Documents

1. An Empirical Approach for Optimization of Acoustic Models in Hindi Speech Recognition Systems R. K. Aggarwal Dept. of Computer Engineering, National