Upload
samson-dalton
View
214
Download
1
Tags:
Embed Size (px)
Citation preview
1
Let us pray to Almighty
to illuminate our intellect towards
the righteous path
An Empirical Approach for Optimization of Acoustic Models in Hindi Speech Recognition Systems
R. K. AggarwalDept. of Computer Engineering,
National Institute of Technology (NIT)Kurukshetra, Haryana, India.
2
3
Automatic Speech Recognition (ASR)• The goal of ASR is to covert a speech signal into its equivalent text
message independent of the device, speaker or the environment.• It is a pattern recognition type of problem in which features are
extracted and a model is used for training and testing.
Speech
InputPreprocessing
Feature Extraction
Model Generation
Pattern
Classification
Training
Testing
Recognized Words
44
Statistical Approach to ASR
Recognized speech
Pre-ProcessingFeature-
ExtractionRecognizer
Acoustic Modeling
Language Modeling
microphone
Front End
Back End
parameters
LPCC/MFCC
Speechsound
Statistical framework of ASR• State-of-the-art speech recognition systems use mixture Gaussian output probability
distributions in HMM together with context dependent phone models. To handle the large number of state parameters of HMM, many similar states of the model are tied and the data corresponding to all these states are used to train one global state. HMMs with this type of sharing were proposed in literature under the names semi-continuous and tied-mixture HMMs.
• The main components of the ASR based on statistical approach are feature extraction, acoustic models (HMMs), language model and hypothesis search unit. The acoustic model typically consists of two parts. The first is to describe how a word sequence can be represented by sub-word units and the second is the mapping from each sub word units to acoustic observations. In language model rules are introduced to follow the linguistic restrictions present in the language and to allow redemption of possible invalid phoneme sequences.
• The acoustic and language models resulting from the training procedure are used as knowledge sources during decoding. 5
Work Significance
Difficulty for the design of Indian Languages ASR For the design and development of European languages ASR systems,
where large and standard databases (e.g. TIMIT, Switchboard corpus) are available to model acoustic variability, higher degrees of mixture tying have been applied e.g. 4000 to 8000 total tied states A range of 8 to 128 Gaussian mixtures
the same convention cannot be followed for Indian languages as the databases, prepared by various research groups, are relatively small and phonetically not very rich.
Solution In this paper we present a solution to find the right degree of mixture
tying by observing empirically the performance of Hindi speech recognition system using a self prepared small database.
6
7
Front-end DesignFront End of Speech Signals Mainly covers• Preprocessing
– Receiving the speech sound from the speaker.– Filtering the background noise to achieve highest possible signal to
noise ratio (SNR ratio).– Digitizing the analog speech signal.
• Feature Extraction (Parametric Transformation)– Extracting the set of properties of an utterance that have acoustic
correlation to the speech signal.– Perceptual Linear Prediction (PLP) feature extraction technique is
used in front end which is based on the working of human auditory system.
PLP
Block diagram of perceptual linear predictive (PLP) speech analysis
PLP Feature ExtractionCritical band resolution Critical band analysis is the basis for almost all the models based on auditory system.
It represents the approximation of ear’s ability to discriminate different frequencies. Experiments have shown that 25 critical bands exist over the frequency range of human hearing, which spans from 20 Hz to 20kHz.
The critical bands have constant width of 100 Hz for center frequencies up to 500 Hz, and the bandwidths increase as the center frequency increases further.
It is a frequency-domain transformation, which can be implemented as a filterbank with bandpass filters. Bark scaling is used for filter banks. The linear frequency scale is inadequate for representing the auditory system.
Human auditory system has linear relationship to the frequency scale for low frequencies but a logarithmic relationship at higher frequencies.
One critical band corresponds to a 1.5 mm step along the basilar membrane that contains 1200 primary auditory nerve fibers. 9
PLP Feature Extraction• To obtain the auditory spectrum, 17 critical band filter outputs are
used. Their center frequency are equally spaced in the Bark domain, defined by
where f is the frequency in Hz and z covers the range 0-5 KHz, into the range 0-17 Bark (i.e.0 ≤ z ≤ 17 Bark).
• Each band is simulated by a spectral weighting,
where are the center frequencies and
Finally, the feature vector consists of 39 values including the 12 cepstral coefficients with one energy, 13 delta cepstral coefficients and 13 delta delta coefficients. 10
11
RASTA (Relative Spectral)
Noise & Channel Compensation Technique• The linguistic components of the speech are governed by the rate of change of
the vocal tract shape.
• The rate of change of nonlinguistic components (i.e. the noise) in speech often lies outside the typical rate of change of the vocal tract shape.
• The relative spectral (RASTA) technique takes the advantage of this fact and suppresses the spectral components that change more slowly or quickly than the typical rate of change of speech .
• RASTA has often been combined with the PLP method and implemented as an IIR filter and the same filter is used for all frequency bands.
12
Tools used for ASR• HTK 3.4.1
– Developed at Cambridge University.– Designed in C++.– Supports Linux Platform. – For Window environment, it requires an interfacing software
CYGWIN.• SPHINX 4
– Developed at Carnegie Mellon University(CMU).– Designed in JAVA.
• MATLAB• JuliusHTK 3.4.1 is most widely used ASR tool.
1313
Univariate Gaussian/Normal Distribution
2
1)( xFX
2
2
1exp
2
1)(
x
xf
Where f(x) represent normal density,
m and s are two parameter viz. mean and standard deviation respectively of Gaussian distribution .
Probability Distribution F(x) is given by:
dxexF
x2/
2
2
1)(
F(x)
Xm
It is a continuous probability distribution when only one observation is under consideration. E.g. height of the students.
14
• Multivariate– When more than one observations are under consideration– E.g. height, weight and IQ level of the student– 39 dimensional MFCC feature vector in case of ASR.
where μ is the n dimensional mean vector,
Σ is the n×n covariance matrix, and
|Σ| is the determinant of covariance matrix Σ.
Multivariate Gaussian
11/ 2/ 2
1 1( ; , ) exp ( ) ( )
2(2 )T
nN x x x
15
Multivariate Gaussian cont..1 1
2 2
39 39
( )
( )( )
: :
( )
E X
E XE X
E X
11 12 1
221 22
1 2
( )( )
m
mT
m m mm
E X X
If the feature vectors are un-correlated, the covariance among them will be zero. In this situation only the diagonal elements will be considered as they represent the variance.
16
Mixture of Gaussian
17
Mixture of 3 Gaussians
Illustration of a mixture of 3 Gaussians in a two-dimensional
1. Contours of constant density for each of the mixture components, in which the 3 components are denoted red, blue and green
2. Contours of the marginal probability density p(x) of the mixture distribution.
3. A surface plot of the distribution p(x)
Fig 1 Fig 2 Fig 3
18
Review of ASR
Illustration of speech recognition process.
The raw waveform of speech is first parameterized to discrete 39 dimension feature vectors at front end. These feature vectors are called observation vectors at back end in the perspective of statistical framework.
Then the word string that correspond to observation vectors are decoded by the recognizer.
19
Hidden Markov Model• Speech characteristics
– In speech there are two types of variability:• Spectral Variability• Temporal Variability
To model these variabilities, double stochastic process are required one for each
20
HMM Structure• Extended Markov Chain or Stochastic Finite State Machine
– Temporal variability is covered by normal working of Markov chain.
– To cover the spectral variability, there is an addition in chain. Each state of chain is characterized by a special type of pdf, i.e., mixture of multivariate Gaussians.
b1(o2)
a11
Start
b1(o1)
r1 a01
O1 O2
Observation vectors
a24
a12
b2(o3)b2(o5)
a2 m3 End
a23a33
O4O3
b3(o6)
O6O5
a22
Left to right three emitting states HMM
2121
Illustration
a11 a12 a13 00.0 a22 a23 a24
0.0 0.0 a33 a34
0.0 0.0 0.0 a44
A =
0.30.4
S1
0.4
S21 = 0.52 = 0.03 = 0.04 = 0.55 = 0.06 = 0.0
S6
1.00.3
0.6
S4
0.2
S3
0.3
0.2
S5
0.3
0.5
0.4
0.80.3
Unit Selection in Acoustic Models• Whole Word Model
– It is successful for domain specific problems where small vocabulary is required.
• Syllable Model– HMMs are generated on the basis of syllables normally used in different
languages.• CI Phone Model
– These models are simple but unable to capture the variations of a phone with respect to context.
• Triphone Model (Context Dependent Phone Model)– Preceding and succeeding phones are grouped with the middle to improve
the performance. Generally for 48 phones, 48*48*48 triphone combinations can be generated but very difficult to manage. To cope with this problem, tied state clustering is performed in triphone models.22
States Clustering in Triphone HMMS
Need Of State Tying A typical system might have approximately 2400 states with 4 mixture
components per state giving about 800k parameters in total or approximately 4800 states with 8 mixture components per state giving about 3000k parameters in total.
The training data is not sufficient to generate an appropriate Gaussian mixture model for each state.
To address this problem in context dependent model, many similar states of the model are tied and the data corresponding to all these states are used to train one global state. This leads to a large amount of data for each state, hence parameters are well estimated.
HMMs with such type of sharing were proposed in literature under the names semi-continuous and tied-mixture HMMs. 23
State Clustering in Triphone HMMs
What is state clustering?State Tying/Clustering Acoustically similar states are tied to form state clustering. Cluster are known as Senones or Genomes the names given by various
research groups. State clusters are formed by forming a cluster tree using bottom-up approach.
Tree based clustering The leaf nodes in the tree corresponds to individual HMM states. Acoustically similar states are clustered to form next higher level. This iteration is performed till the desired numbers of clusters are achieved.
25
Tree based Clustering approach
Experimental Setup
HMM State Topology Whole word model and crossword triphone model of HMM with linear left-
right topology were used to compute the score against a sequence of features for their phonetic transcription.
In triphone model 3-states per phone, along with dummy (non-emitting) initial and final nodes, were used without the permission of state skipping. For whole word model seven states per word were used.
Training & TestingThe experiments were performed on a set of speech data consisting of six hundred words of Hindi language recorded by 10 male and 10 female speakers. Each time model was trained using various utterances of each word. Testing of randomly chosen hundred words spoken by different speakers is made, i.e., total test words are hundred.
27
Experiment with different mixtures
28
4 8 12 16 20 240.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9 STD. PLP PLP+RASTA
Mixture Components
Acc
ura
cy
Experiments were performed six times with different number of Gaussians along with triphone model as fundamental speech unit and MLE technique for parameter estimation, in HMM. Maximum accuracy was observed with sixteen numbers of Gaussian mixtures. This is too less in comparison to European languages ASR where normally 64 Gaussian mixtures have been used to achieve the optimum results.
Experiments with Vocabulary Sizes
29
200 400 60050
60
70
80
90
100
92%
84%80%
90% 88%
86%
Whole Word Sub word
Dictionary Size in words
Ac
cu
rac
yTwo models whole word
model and sub word triphone model were investigated with various vocabulary sizes.
Smaller the size of vocabulary, lesser the chances of confusion and hence better should be the accuracy.
For small vocabulary up to 200, whole word model gives maximum accuracy, and beyond that triphone model must be used for better accuracy.
Sixteen Gaussian mixtures were used in training of the model to get best results.
Experiments with Various Tied States Genones
180 750 1250 1700 2400
Accuracy
83% 88% 88% 85% 84%
30
The number of Gaussian mixtures used for each case of tied states is sixteen.
With the help of a decision tree, the mixtures of each state were tied for each
base phone and the training data triphones are mapped into a smaller set of
tied state triphones. Each state position of each phone has a binary tree associated
with it.
Maximum accuracy was observed around one thousand tied states
Conclusion• To avoid over fitting of data and to minimize computation overhead
appropriate degree of mixture tying is very important. • Experimental results have shown that only 16 Gaussian mixtures and
around one thousand tied states yield optimal performance in the context of databases available for Indian languages. While in case of European languages – the total number of tied states in a large vocabulary speaker independent
system typically ranges between 5,000 and 10,000 states. – A range of mixtures 32 to 128 is used.
• For small size vocabulary the whole word model is enough, but as the vocabulary size increases triphone model is required to achieve optimum results. The word recognition accuracy of whole word models decreases more rapidly than that of sub word models.
31
References• A.E. Rosenberg, L.R. Rabiner, J. Wilpon, D. Kahn. 1983. Demisyballe-Based Isolated
Word Recognition System. IEEE Transactions on Acoustic, Speech, and Signal Processing ASSP, 31(3): 713-726.
• A. Sharma, M.C. Shrotriya, O. Farooq, Z.A. Abbasi. 2008. Hybrid Wavelet Based LPC Features for Hindi Speech Recognition. International Journal of Information and Communication Technology, Inderscience publisher, vol. 1, pp. 373-381.
• A. Sixtus and H. Ney. 2002. From Within-Word Model Search to Crossword Model Search in Large Vocabulary Continuous Speech Recognition. Computer Speech and Language. 16(2): 245-271.
• C. Becchetti and K.P. Ricotti. 2004. Speech Recognition Theory and C++ Implementation. John Wiley.
• D. Klatt. 1986. Problem of Variability in Speech Recognition and in Models of Speech Perception. In J.S Perkell and D.M Klatt (editor), Variability and Invariance in speech Processes, 300-320. Lawrence Erlbaum Assoc, N.J. Hillsdale.
32
References contd…• Douglas O’Shaughnessy. 2003. Interacting With Computers by Voice-Automatic
Speech Recognitions and Synthesis. Proceedings of the IEEE, 91(9): 1272-1305.• F. Jelinek. 1997. Statistical Methods for Speech Recognition, MIT press.• H. Hermansky. 1990. Perceptually Predictive (PLP) Analysis of Speech. Journal of
Acoustic Society of America, 87:1738-1752.• H. Hermansky and N. Morgan. 1994. RASTA Processing of Speech. IEEE
Transaction of Speech and Audio Processing, 2(4): 578-589. • H. Hermansky, S. Sharma. 1999. Temporal Patterns (TRAPs) in ASR of Noisy
Speech. Proc. of IEEE Conference on Acoustic Speech and Signal Processing .• J. Koehler, N. Morgan, H. Hermansky, H. G. Hirsch and G. Tong. 1994. Integrating
RASTA-PLP into Speech Recognition. IEEE International Conference on Acoustics, Speech and Signal Processing, vol.1: 421-424.
33
References contd…• J. Baker, P. Bamberg et al. 1992. Large Vocabulary Recognition of wall Street Journal
Sentences at Dragon System. Proc. DARPA Speech and Natural Language Workshop, 387-392.
• J. Picone. 1993. Signal Modeling Techniques in Speech Recognition. Proceedings of the IEEE,81(9): 1215-1247.
• F. Lee. 1989. ASR the Development of SPHINX System. Kluwer Academic. • E. Baum & J. A. Eagon. 1967. An Inequality with Applications to Statistical
Estimation for Probabilistic Functions of Markov Processes and to a Model for Ecology. Bulletin of American Mathematical Society. 73:360-363.
• L.R. Bahl, P.F. Brown, P.V. de Souza and R.L. Mercer. 1986. Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition. Proceeding of IEEE ICASSP, 49-52.
34
References contd…• L.R. Rabiner. 1989. A Tutorial on Hidden Markov Models and Selected Applications
in Speech Recognition. Proc. of the IEEE, 77(2): 257- 286.• L.R. Rabiner, and R.W. Schafer. 2007. Introduction to Digital Speech Processing,
Foundations and Trends in Signal Processing, vol. 1, Issue 1-2, pp. 33-73.• Li Deng, D. O’Shaughnessy. 2003. Speech Processing: A Dynamic and Optimization-
Oriented Approach. Marcel Dekker Inc., New York-Basel.• M. Gales and S. Young. 2007. The Application of Hidden Markov Model in Speech
Recognition. Foundations and Trends in Signal Processing, 1(3):195-304. • M. Hwang & X. Huang. 1992. Sub Phonetic Modeling with Markov States—Senone.
In Proc. of IEEE ICASSP, 33–36.• M.J. Hunt, M. Lennig, P. Mermelstein. 1980. Experiments in Syllable-Based
Recognition of Continuous Speech. IEEE Transactions on Acoustic, Speech, and Signal Processing, 880-883.
35
References contd…
• M. Kumar, A. Verma, and N. Rajput. 2004. A Large Vocabulary Speech Recognition System for Hindi. Journal of IBM Research, vol.48, pp.703-715.
• M.Y. Hwang, X. Huang and F. Alleva. 1992. Predicting Unseen Triphones with Senomes, Proc. IEEE ICASSP-93, II:311-314.
• Nagendra Goel, Samuel Thomas, Mohit Agarwal et al. 2010. Approaches to Automatic Lexicon Learning With Limited Training Example. Proc. of IEEE Conference on Acoustic Speech and Signal Processing.
• Pablo Fetter, Alfred Kaltenmeier, Thomas Kuhn Peter and Regel-Brietzmann. 1996. Improved Modeling of OOV Words in Spontaneous Speech, Int. Conf. on Acoustic, Speech, and Signal Processing.
36
References contd…• Rivarol Vergin, Douglas O’Shaughnessy, Azarshid Farhat. 1999. Generalized Mel
Frequency Cepstral Coefficients for Large Vocabulary Speaker-Independent Continuous Speech Recognition. IEEE Transactions on Speech and Audio Processing, vol. 7 no. 5, pp. 525-532.
• R.K. Aggarwal and M. Dave. 2008. Implementing a Speech Recognition System Interface for Indian Languages. Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged languages, IIIT Hyderabad.
• R. K. Aggarwal and M. Dave. 2010. Effects of Mixtures in Statistical Modeling of Hindi Speech Recognition Systems. Proceedings of the 2nd International Conference on Intelligent Human Computer Interaction, Springer.
• R. Shwartz, Y. Chow, O. Kimball, S. Roucos, M. Krasner, J. Makhoul. 1985. Context Dependent Modelling for Acoustic-Phonetic Recognition of Continous Speech. IEEE International Conference on Acoustics, Speech and Signal Processing.
37
References contd…• S. J. Young. 1992. The General Use of Tying in Phoneme-Based HMM Speech
Recognizers. Int. Conf. on Acoustic, Speech, and Signal Processing. 569-572.• S. J. Young, J. J. Odell and P. C. Woodland. 1994. Tree-Based State Tying for High
Accuracy Acoustic Modeling. Proceedings of Human Language Technology Workshop, 307-312.
• S. J. Young and P.C. Woodland. 1993. The Use of State Tying in Continuous Speech Recognition, Proc. ESCA Eurospeech, 3:2203-2206, Berlin, Germany.
• S. Young, M. Gales, D. Povey et al. 2006. The HTK Book, Cambridge University Engineering Department.
• V. Digalakis & H. Murveit. 1994. Genones: Optimization the Degree of Tying in a Large Vocabulary HMM-Based Speech Recognizer. Proc. of IEEE ICASSP, 537–540.
38
39
THANKS