View
213
Download
1
Category
Preview:
Citation preview
Phone Classification Using HMM/SVM System and Normalization Technique
Mohammed Sidi Yakoub1, Roger Nkambou
1, Sid-Ahmed Selouani
2
1UQAM, CP 8888, Succ. Centre-Ville, Montreal, QC, H3C 3P8, Canada 2LARIHS lab., Moncton University, UMCS Shippagan, 218J D Gauthier Boul.
Shippagan, NB, E8S 1P1, Canada 1sidiyakoub.mohammed@courrier.uqam.ca, nkambou.roger@uqam.ca
2selouani@umcs.ca
Abstract—Support vector machines (SVM) were originally
developed for binary classification and extended for multi-class
classification. Due to their powerfulness and adaptation to hard
classification problems, we have chosen them for automatic speech
recognition (ASR). The aim of this paper is to investigate the use of
SVM multi-class classification coupled with HMM for TIMIT
phones. SVM requires that all data samples for training and test to
have the same features vector size. Due to the variability in length of
phone signals even for the same phone, we have used a
normalization technique: zero padding and resampling on all data
samples to get them have features vector with the same size. After
mapping the 61 TIMIT phones in 46 phones and conducting tests
using LibSVM and HTK, we have obtained a classification accuracy
rate of 91.26% with the hybrid HMM/SVM system and 71.41% with
the HMM-based system. These results show that the hybrid
HMM/SVM system using the normalization technique overcomes an
HMM-based system and improves the recognition accuracy by
19.8%. Therefore, our experiments result encouraged us to use this
hybrid system and normalization technique for the next work in the
context of spoken dialogue system.
Keyword—Automatic speech recognition; Hidden Markov
Model; LibSVM; multi-classification; Mel Frequency Cepstral
Coefficients; normalization; Support Vector Machines.
I. INTRODUCTION
Most automatic speech recognition systems (ASR) are built using hidden Markov models (HMM). However these
HMM classifiers have not achieved the performance required
by many ASR systems to achieve a recognition error rate that
is low enough. To tackle this problem, different approaches
have been proposed: building ASR as a predictive Artificial
Neural Network (ANN) system, a Support Vector Machine
(SVM) system, a hybrid HMM/ANN system and a hybrid
HMM/SVM system [1]. The aim of this paper is to
investigate the use of SVM with normalization technique to
improve phone recognition.
This paper is organized as follows: section 2 gives a brief
review of the SVM theory and its application in linear and
non-linear classification of two classes and more. Section 3
describes the overall ASR system used for phone
classification. In this section we describe the HMM-based
ASR system, the SVM system, the normalization technique
and the hybrid HMM/SVM system. Section 4 describes the
experiments and how tests were conducted using TIMIT [2], HTK [3] and LibSVM [4]. Finally, Section 5 concludes this
work and gives indication on future work.
II. SUPPORT VECTOR MACHINES
SVM [5][6] classifiers are known as successful and
powerful in the field of machine learning. They are
characterized by a solution that uses maximum margin; they
can handle data with high dimensionality; and their
convergence is guaranteed as the minimum of the associated cost function. Due to these characteristics, SVMs are
discriminative classifiers and are used to deal with hard
classification problems.
SVM is a popular classification technique. Using training
data, SVM generates a model that predicts a target class from
test data. Originally, SVM was used to predict two classes in
what is called two-class classification and was extended to
predict multiple classes, which is called multi-class
classification. In fact, SVM performs linear classification and
non-linear classification using kernels to map their inputs into
high-dimensional feature spaces. The next subsections give a brief overview of the SVM theory.
A. Linear classification
Two-class classification is built by finding a separating
hyperplane to separate the data of two classes +1 and -1. This
SVM classifier is capable of estimating whether an input
vector x belongs to the first class ( 1+=y ) or to the second
class ( 1−=y ). We can find many separators but the optimal
one is the one having the maximum margin M. This
separating hyperplane is called the maximum margin linear
classifier. The data points from the two classes located at the
edges/borders of that margin are support vectors. Therefore,
the goal of the classification is to define the margin of the
linear classifier as the width that the boundary could be
increased by before hitting the data points at the edges, and to
try to maximize it. This implies that only support vectors are
important and all other training samples are ignored. Figure 1
illustrates this classification. This is the simplest kind of SVM.
Mathematically, the classification goal is to maximize the
margin M and classify all samples correctly. This linear SVM
problem is formulated as a quadratic optimization problem
and it is given by equation (1).
Given a training set of sample-class pairs
( ) lii,yix ,...,1, = where nix ℜ∈ and { }ly 1,1 −+∈ , the
978-1-4799-4796-6/13/$31.00 © 2013 IEEE 000096
SVM requires the solution of equation (1). Quadratic
optimization problems are a well-known class of
programming problems, and many algorithms exist for
solving them. These algorithms try to find w and b to
satisfy equation (1).
Fig. 1. Linear classification of two classes
Equation (1) works when the data is clean. However, if the
data is noisy, equation (1) is replaced by equation (2), where
the penalty parameter C is used to control the overfitting
and iξ variables can be added to allow misclassification of
noisy samples. Again, the formulation in equation (2) does
not work properly if the data samples are too complex and nonlinear. However, such data cannot be classified linearly
and nonlinear classification is needed.
B. Nonlinear classification
Equation (2) is not enough to formulate the classification
problem when data is too complex and nonlinear. The idea
behind a nonlinear classification is that the original input
space is mapped into what is called a feature space of higher dimensionality where the classes in the training set are
separable.
The nonlinear case is given by equation (3), and the
nonlinear function )i(xφ is used for that mapping. Figure 2
illustrates this classification.
The solution of this quadratic optimization problem relies
on the function )i(xφ (and on other parameters not given
here for simplicity), which cannot be evaluated or is
impossible as long as the feature space dimensionality can be
infinite. However, we do not need to know )i(xφ and what
we need to know is the dot product )j().i( xxT φφ , which can
be evaluated by using a kernel function )j,i( xxK .
Different kernel functions exist: the simple linear kernel,
the radial basis function (RBF) kernel (
( ) )2
exp(, yxyxK −−= γ ), the polynomial kernel and
the sigmoid kernel. In our work, the RBF kernel is used.
C. Multi-class classification
SVM was introduced first for two-class classification and
was generated later for multi-class classification.
Two known approaches [7] for multi-class classification
combine a number of binary classifiers to achieve the multi-
class classification: one-against-one and one-against-the-rest.
Let k be the number of classes we want to classify. The first
approach constructs ( ) 2/1−kk SVM classifiers where each
one is trained on data from two classes and each class is
confronted against all the other classes separately (one-
against-one). The second approach constructs k SVM
classifiers where each class is compared against all the rest
(one-against-the-rest). In our work, one-against-one is used.
D. LibSVM library
LibSVM [4] is a library for implementation of SVM using
the method one-against-one for classification.
class +1 class -1
x1
x2
M = Margin width
wT x +
b =
0
wT x +
b =
-1wT x +
b =
1
x+
x+
x-
Support Vectors
wT x + b < 0
wT x + b > 0
class +1 class -1class +1 class -1
x1
x2
M = Margin width
wT x +
b =
0
wT x +
b =
-1wT x +
b =
1
x+
x+
x-
x+
x+
x-
Support Vectors
wT x + b < 0
wT x + b > 0
wT x + b < 0
wT x + b > 0
−=≤+
+=≥+
1 i if 1 i
1 i if 1 i
ybwx
ybwx ≡ ( ) i 1 i ∀≥+ bwx
Maximize W2=M ≡ Minimize ( ) WTW
21W =Φ
(1)
min ( ) WTW21W =Φ
subject to i 1 iTwi ∀≥
+ bxy
Quadratic Optimization Problem
Quadratic Optimization Problem
(2)
min ( ) ∑=
+=Φl
iiC
1
WTW21W ξ
subject to i 0i, i-1 iTwi ∀≥≥
+ ξξbxy
min ( ) ∑=
+=Φl
iiC
1
WTW21W ξ
subject to i 0i, i-1 )j().i(i ∀≥≥
+ ξξφφ bxxTy
Quadratic Optimization Problem
(3)
978-1-4799-4796-6/13/$31.00 © 2013 IEEE 000097
Fig. 2. Nonlinear classification of two classes.
It includes efficient multi-class classification, probability
estimates, various kernels (e.g., linear, polynomial, radial
basis function (RBF), sigmoid), automatic model selection to
generate the contour for cross-validation, etc. In LibSVM,
the choice of the method one-against-one instead of one-
against-the-rest is justified by the fact that the training using
the former is shorter than the latter and their performance is
comparable [7].
Having k classes and using the method one-against-one
means that there are ( ) 2/1−kk binary models (classifiers)
to train. We can use two techniques to find the classification
optimal parameters (i.e., penalty parameter and kernel
parameters): 1) the optimal parameters of each decision
function is selected for any two classes. Therefore, each
model has its own parameters. 2) The same parameters are
used for all models and the optimal ones will be the ones
giving the highest overall performance. LibSVM uses the
latter, as it was shown in [7] that the two techniques give a
similar performance.
III. THE OVERALL SYSTEM
Figure 3 shows the HMM/SVM hybrid system we propose to
classify TIMIT phones.
A. HMM-based system: reference system
The HMM-based speech recognition system is build using
HTK tools [3]. Each phoneme is modelized by a 5-state
HMM model with two non-emitting states (1st and 5th state)
and a mixture of 8 Gaussians.
Mel-Frequency Cepstral Coefficients (MFCCs), Deltas coefficients and cepstral pseudo-energy are calculated on
TIMIT database [2] and used to train and test the system. We
have used this HMM-based system as reference system for
comparison with the hybrid system.
Fig. 3. The HMM/SVM hybrid system using normalization.
B. Normalization technique
SVM requires that the features used for classification must
have the same size for all samples in training and test data.
To respect this constraint we have done a normalization of
samples by using zero padding and/or resampling on each
phone signal. These two techniques permit to us to normalize
the whole data. To do so, we search for each phone class a
signal which has a maximal length (MAX) and use these two techniques on the other signals of the same class to change
their size to MAX.
The normalization technique on training and test data is
performed according the following steps:
class +1 class -1
x1
x2
class +1 class -1
x1
x2
ix → )( ixφ
Signal
HMM-based
System
Classification result
Transcription/Label files
SVM
System
Normalization
978-1-4799-4796-6/13/$31.00 © 2013 IEEE 000098
Fig. 4. Normalization technique: zero padding and resampling applied on signals from TIMIT.
1st step: search for each phone class in the training data a
signal which has a maximal length (e.g. class1=’AA’
→ MAX1, class2=’SH’ → MAX2, etc.).
2nd
step:apply zero padding on each phone class in the
training data using its corresponding MAX value.
The resulting training data produce the same
features as its original (features, like MFCCs,
extracted from a signal padded with zeros or not are
the same).
3rd
step: apply resampling or zero padding on each phone
class in the test data: for each phone signal of classi having length greater than MAXi use resampling
otherwise use zero padding.
The resampling technique used here is based on a
linear-phase FIR filter as described in [8].
4th step: apply zero padding, if necessary, on training and test
data to make sure that all phone signals have the
same length.
Figure 4 illustrates an example of zero padding and
resampling applied on signals from TIMIT: the three signals,
after normalization (MAX = 1200), will have the same size which is 1200 samples and the calculation of MFCCs gives a
vector of 130 features for each signal according to the HTK
configuration file we have used.
C. SVM system
Phone classification is a multi-class classification where
each phone is a class. For k classes, we have built
( ) 2/1−kk classifiers using LibSVM with the method one-
against-one. The architecture of the SVM system for training
and testing is shown in figure 5. The features used for
training and testing are the same as the ones used with the
HMM-based system (MFCCs + Deltas + Energy). The
normalization technique and cross validation are used during the training and test phases.
D. HMM/SVM hybrid system
The HMM/SVM hybrid system shown in figure 3 is a
coupling of the HMM-based system with the SVM system:
the HMM-based system and the SVM system are trained on
the same data. The former segments the test signals and
generates a transcription/label file (e.g. HTK generates one
file containing the transcription of all test signals) whereas
the latter uses this transcription/label file to extract each class
phone from the test signals, do normalization and reclassify using the k(k-1)/2 SVM classifiers prepared during the
training phase.
IV. EXPERIMENTS AND RESULTS
A. Experiments setup
Experiments were conducted on the TIMIT database using
Matlab, HTK and LibSVM. The 1260 ‘sa’ sentences were
excluded since they could bias the results [9]. The remaining
5040 utterances were used for training and test. The training set is composed of 3696 utterances from 671 speakers (male
and female), and the test set is composed of 1344 utterances
from 168 speakers (male and female). Before conducting
Maximal
signal
(Reference)
Resampling
Zero padding
978-1-4799-4796-6/13/$31.00 © 2013 IEEE 000099
Fig. 5. SVM System architecture using normalization.
training and test, the 61 phones were mapped to 46 phones as follows:
- Delete q, (pau, epi, h#)→ sil, ux→ uw, ax-h → ax, hv
→ hh, dx→ d, nx→ n, eng→ ng.
- (bcl, b) → b, (dcl, d)→ d, (dcl, jh)→ jh, (gcl, g)→ g,
(kcl, k)→ k, (tcl, t)→ t, (tcl, ch)→ ch, (pcl, p)→ p.
- After analyzing the TIMIT database, the remaining bcl, dcl,
gcl, kcl, tcl, pcl were deleted, replaced by the phones close to
the pronunciation (e.g., dcl→ d) or merged with the previous
or next phone (e.g., (s, tcl)→ s).
Therefore, we will have 46 different classes where each
class corresponds to one phone. Three systems were tested:
1) HMM-based system: the training and test were conducted
using HTK tools. First, the speech was pre-emphasized using
a filter with 1-0.97z-1 as a transfer function. 13 Mel
Frequency Cepstral Coefficients (MFCCs) were calculated on
a Hamming window size of 25 msec advanced by 10 msec
each frame. Then, an FFT is performed to calculate a
magnitude spectrum for the frame, which is averaged into 24
triangular bins arranged at equal Mel-frequency intervals.
Finally, a cosine transform is applied to such data to calculate
the 13 MFCCs. Moreover, the normalized log energy is also found, which augments the 13 MFCCs to form a 14 features.
13 Deltas and their energy were calculated too. This feature
extraction generates a vector of 28 features for each frame.
Each phone is modelized by a 5-state HMM model with two
non-emitting states (1st and 5th state) and a mixture of Gaussians. We have tried a mixture of 2, 4, 6, 8 and 12 but
the best result phone accuracy was with 8 Gaussians.
This system is used as a reference system to compare with
the hybrid system.
2) SVM system: First, the normalization was applied on the
training and test sets. Then, the same speech processing used
with the HMM-based system is applied on the normalized
data to extract features (i.e. 28 features (MFCCs + Deltas +
Energy) for each frame is generated).
After the speech processing above, LibSVM is used to
generate the 1035 one-against-one classifiers from
normalized training set (46 phones → k = 46 → k(k – 1)/2
= 1035) as follows:
1st step: Each feature was linearly scaled to [-1, +1]. The
same scaling factor used on the training set was used on the
test set too.
2nd
step: Use RBF Kernel ( ) )exp(,2
yxyxK −−= γ .
3rd
step: Find the optimal parameters C (the penalty
parameter) and γ (the Kernel parameter) using cross-
validation. The (C , γ ) = (32.0 , 0.0078125) were the
optimal values found.
4th
step: Use the optimal parameters found previously to train
the 1035 classifiers.
After those steps, classification test consists of running the
1035 classifiers by LibSVM on the normalized test set.
Training
One-against-one Classifiers
1 2 … k (k-1)/2
LibSVM
+
Cross-Validation
Signal
Normalization
Features extraction
Test
LibSVM
Classification result
Signal
Normalization
Features extraction
978-1-4799-4796-6/13/$31.00 © 2013 IEEE 000100
3) HMM/SVM hybrid system: this system is the coupling of
the HMM-based with the SVM system. The former is run on
the test set and the transcription/label files generated after
phone recognition is fed to the latter. The transcription/label
file is used by the SVM system to extract each phone signal
from the test set, normalize it and run the 1035 classifiers to reclassify the test set.
B. Results
The HMM-based system was run on the test set and the
best phone accuracy rate was 71.41% using 8 Gaussians
mixture.
The SVM system was run first on the normalized test set
segmented manually (i.e. extracted from TIMIT training set)
and accuracy rate was 95.82% using cross-validation with the
optimal parameters (C , γ ) = (32.0 , 0.0078125).
The HMM/SVM hybrid system was run on the test set:
first, the test set was segmented automatically by the HMM-
based system and the transcription/label file was sent to the
SVM system. After that, the SVM extracts phone signals,
using the transcription/label file, normalizes and classifies.
The accuracy rate obtained was 91.26%.
However, by comparing results obtained from the HMM-
based system and the HMM/SVM hybrid system, we
conclude that this hybrid system with normalization
technique overcomes the standard HMM system. Table 1 shows the recognition accuracy rate for different
systems.
Table 1. Different phone recognition accuracy.
(Hamming windows size = 25 msec)
System Accuracy (%)
1) HMM:
- automatic segmentation 71.41
2) SVM:
- manual segmentation
- normalization
95.82
3) Hybrid HMM/SVM:
- automatic segmentation
- normalization 91.26
Table 2 compares this system with some previously
reported results on TIMIT phone classification using different
approaches. The good result was by [13].
V. CONCLUSION
We have shown in this paper the power of a HMM/SVM
hybrid system in classifying phones using the normalization
technique. The tests are significant and encourage us to use
this technique in our next work which is speech recognition
in a context of spoken dialogue system.
Table 2. Reported results on TIMIT phone classification.
System Accuracy
(%)
HMM [9] 66.08
CDHMM [10] 72.90
TRAPs, temporal context division + lattice
rescoring [11] 79.04
GMMs trained as SVMs [12] 69.90
Deep Belief Networks [13] 79.30
HMM (this work) 71.41
HMM/SVM + Normalization (this work) 91.26
REFERENCES
[1] R. Solera-Urena, J. Padrell-Sendra, D. Martin-Iglesias, A. Gallardo-
Antolin, C. Pelaez-Moreno and F. Diaz-de-Maria, “SVMs for
Automatic Speech Recognition: A Survey”, Springer, pp 190-216,
2007.
[2] S. John and E.A. Garofolo “TIMIT Acoustic-Phonetic Continuous
Speech Corpus”, Philadelphia: Linguistic Data Consortium, 1993.
[3] S. Young, G. Evermann, T. Hain, D. Kershaw, G. Moore, J. Odell,
D. Ollason, D. Povey, V. Valtchev, P.Woodland. “The HTK Book
for HTK Version 3.4”, March 2009.
[4] C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector
machines”, ACM Transactions on Intelligent Systems and
Technology, 2:27:1--27:27, 2011. Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm.
[5] B. Boser, I. Guyon, V. Vapnik, “Training algorithm for optimal
margin classifiers”, In the Fifth Annual Workshop on
Computational Learning Theory, Pittsburgh, ACM (1992) pp. 144–
152, 1992.
[6] C. Cortes and V. Vapnik “Support-vector network,” Machine
Learning, 20:273-297, 1995.
[7] C.-W. Hsu and C.-J. Lin. “A comparison of methods for multi-class
support vector machines”, IEEE Transactions on Neural Networks,
415-425, 13, 2002.
[8] T.W. Parks and C.S. Burrus “Digital Filter Design”, John Wiley &
Sons, pp. 54-83, 1987.
[9] K. F. Lee and H. W. Hon, “Speaker-independent phone recognition
using hidden Markov models”, IEEE Trans. Acoust., Speech,
Signal Process., vol. 37, no. 11, pp. 1641–1648, Nov. 1989.
[10] L.F. Lamel and J.L. Gauvain “High Performance Speaker
Independent Phone Recognition using CDHMM”, Proceedings of
Eurospeech, Germany, September, 1993.
[11] S.M. Siniscalchi et al. “High-accuracy phone recognition by
combining high-performance lattice generation and knowledge
based rescoring”, IEEE International Conference on Acoustics,
Speech and Signal Processing, Hawaii, April 2007.
[12] F. Sha and L.K. Saul “Large margin Gaussian mixture modelling
for phonetic classification and recognition”, IEEE International
Conference on Acoustics, Speech and Signal Processing, France,
May 2006.
[13] A. Mohamed et al. “Acoustic Modeling using Deep Belief
Networks”, IEEE Transactions on Audio, Speech, and Language
Processing, 2011.
978-1-4799-4796-6/13/$31.00 © 2013 IEEE 000101
Recommended