[IEEE 2013 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT) - Athens, Greece (2013.12.12-2013.12.15)] IEEE International Symposium on Signal Processing

Phone Classification Using HMM/SVM System and Normalization Technique

Mohammed Sidi Yakoub1, Roger Nkambou

1, Sid-Ahmed Selouani

2

1UQAM, CP 8888, Succ. Centre-Ville, Montreal, QC, H3C 3P8, Canada 2LARIHS lab., Moncton University, UMCS Shippagan, 218J D Gauthier Boul.

Shippagan, NB, E8S 1P1, Canada [email protected], [email protected]

[email protected]

Abstract—Support vector machines (SVM) were originally

developed for binary classification and extended for multi-class

classification. Due to their powerfulness and adaptation to hard

classification problems, we have chosen them for automatic speech

recognition (ASR). The aim of this paper is to investigate the use of

SVM multi-class classification coupled with HMM for TIMIT

phones. SVM requires that all data samples for training and test to

have the same features vector size. Due to the variability in length of

phone signals even for the same phone, we have used a

normalization technique: zero padding and resampling on all data

samples to get them have features vector with the same size. After

mapping the 61 TIMIT phones in 46 phones and conducting tests

using LibSVM and HTK, we have obtained a classification accuracy

rate of 91.26% with the hybrid HMM/SVM system and 71.41% with

the HMM-based system. These results show that the hybrid

HMM/SVM system using the normalization technique overcomes an

HMM-based system and improves the recognition accuracy by

19.8%. Therefore, our experiments result encouraged us to use this

hybrid system and normalization technique for the next work in the

context of spoken dialogue system.

Keyword—Automatic speech recognition; Hidden Markov

Model; LibSVM; multi-classification; Mel Frequency Cepstral

Coefficients; normalization; Support Vector Machines.

I. INTRODUCTION

Most automatic speech recognition systems (ASR) are built using hidden Markov models (HMM). However these

HMM classifiers have not achieved the performance required

by many ASR systems to achieve a recognition error rate that

is low enough. To tackle this problem, different approaches

have been proposed: building ASR as a predictive Artificial

Neural Network (ANN) system, a Support Vector Machine

(SVM) system, a hybrid HMM/ANN system and a hybrid

HMM/SVM system [1]. The aim of this paper is to

investigate the use of SVM with normalization technique to

improve phone recognition.

This paper is organized as follows: section 2 gives a brief

review of the SVM theory and its application in linear and

non-linear classification of two classes and more. Section 3

describes the overall ASR system used for phone

classification. In this section we describe the HMM-based

ASR system, the SVM system, the normalization technique

and the hybrid HMM/SVM system. Section 4 describes the

experiments and how tests were conducted using TIMIT [2], HTK [3] and LibSVM [4]. Finally, Section 5 concludes this

work and gives indication on future work.

II. SUPPORT VECTOR MACHINES

SVM [5][6] classifiers are known as successful and

powerful in the field of machine learning. They are

characterized by a solution that uses maximum margin; they

can handle data with high dimensionality; and their

convergence is guaranteed as the minimum of the associated cost function. Due to these characteristics, SVMs are

discriminative classifiers and are used to deal with hard

classification problems.

SVM is a popular classification technique. Using training

data, SVM generates a model that predicts a target class from

test data. Originally, SVM was used to predict two classes in

what is called two-class classification and was extended to

predict multiple classes, which is called multi-class

classification. In fact, SVM performs linear classification and

non-linear classification using kernels to map their inputs into

high-dimensional feature spaces. The next subsections give a brief overview of the SVM theory.

A. Linear classification

Two-class classification is built by finding a separating

hyperplane to separate the data of two classes +1 and -1. This

SVM classifier is capable of estimating whether an input

vector x belongs to the first class ( 1+=y ) or to the second

class ( 1−=y ). We can find many separators but the optimal

one is the one having the maximum margin M. This

separating hyperplane is called the maximum margin linear

classifier. The data points from the two classes located at the

edges/borders of that margin are support vectors. Therefore,

the goal of the classification is to define the margin of the

linear classifier as the width that the boundary could be

increased by before hitting the data points at the edges, and to

try to maximize it. This implies that only support vectors are

important and all other training samples are ignored. Figure 1

illustrates this classification. This is the simplest kind of SVM.

Mathematically, the classification goal is to maximize the

margin M and classify all samples correctly. This linear SVM

problem is formulated as a quadratic optimization problem

and it is given by equation (1).

Given a training set of sample-class pairs

( ) lii,yix ,...,1, = where nix ℜ∈ and { }ly 1,1 −+∈ , the

978-1-4799-4796-6/13/$31.00 © 2013 IEEE 000096

SVM requires the solution of equation (1). Quadratic

optimization problems are a well-known class of

programming problems, and many algorithms exist for

solving them. These algorithms try to find w and b to

satisfy equation (1).

Fig. 1. Linear classification of two classes

Equation (1) works when the data is clean. However, if the

data is noisy, equation (1) is replaced by equation (2), where

the penalty parameter C is used to control the overfitting

and iξ variables can be added to allow misclassification of

noisy samples. Again, the formulation in equation (2) does

not work properly if the data samples are too complex and nonlinear. However, such data cannot be classified linearly

and nonlinear classification is needed.

B. Nonlinear classification

Equation (2) is not enough to formulate the classification

problem when data is too complex and nonlinear. The idea

behind a nonlinear classification is that the original input

space is mapped into what is called a feature space of higher dimensionality where the classes in the training set are

separable.

The nonlinear case is given by equation (3), and the

nonlinear function )i(xφ is used for that mapping. Figure 2

illustrates this classification.

The solution of this quadratic optimization problem relies

on the function )i(xφ (and on other parameters not given

here for simplicity), which cannot be evaluated or is

impossible as long as the feature space dimensionality can be

infinite. However, we do not need to know )i(xφ and what

we need to know is the dot product )j().i( xxT φφ , which can

be evaluated by using a kernel function )j,i( xxK .

Different kernel functions exist: the simple linear kernel,

the radial basis function (RBF) kernel (

( ) )2

exp(, yxyxK −−= γ ), the polynomial kernel and

the sigmoid kernel. In our work, the RBF kernel is used.

C. Multi-class classification

SVM was introduced first for two-class classification and

was generated later for multi-class classification.

Two known approaches [7] for multi-class classification

combine a number of binary classifiers to achieve the multi-

class classification: one-against-one and one-against-the-rest.

Let k be the number of classes we want to classify. The first

approach constructs ( ) 2/1−kk SVM classifiers where each

one is trained on data from two classes and each class is

confronted against all the other classes separately (one-

against-one). The second approach constructs k SVM

classifiers where each class is compared against all the rest

(one-against-the-rest). In our work, one-against-one is used.

D. LibSVM library

LibSVM [4] is a library for implementation of SVM using

the method one-against-one for classification.

class +1 class -1

x1

x2

M = Margin width

wT x +

b =

0

wT x +

b =

-1wT x +

b =

1

x+

x+

x-

Support Vectors

wT x + b < 0

wT x + b > 0

class +1 class -1class +1 class -1

x1

x2

M = Margin width

wT x +

b =

0

wT x +

b =

-1wT x +

b =

1

x+

x+

x-

x+

x+

x-

Support Vectors

wT x + b < 0

wT x + b > 0

wT x + b < 0

wT x + b > 0

−=≤+

+=≥+

1 i if 1 i

1 i if 1 i

ybwx

ybwx ≡ ( ) i 1 i ∀≥+ bwx

Maximize W2=M ≡ Minimize ( ) WTW

21W =Φ

(1)

min ( ) WTW21W =Φ

subject to i 1 iTwi ∀≥

+ bxy

Quadratic Optimization Problem


(2)

min ( ) ∑=

+=Φl

iiC

1

WTW21W ξ

subject to i 0i, i-1 iTwi ∀≥≥

+ ξξbxy

min ( ) ∑=

+=Φl

iiC

1

WTW21W ξ

subject to i 0i, i-1 )j().i(i ∀≥≥

+ ξξφφ bxxTy


(3)

978-1-4799-4796-6/13/$31.00 © 2013 IEEE 000097

Fig. 2. Nonlinear classification of two classes.

It includes efficient multi-class classification, probability

estimates, various kernels (e.g., linear, polynomial, radial

basis function (RBF), sigmoid), automatic model selection to

generate the contour for cross-validation, etc. In LibSVM,

the choice of the method one-against-one instead of one-

against-the-rest is justified by the fact that the training using

the former is shorter than the latter and their performance is

comparable [7].

Having k classes and using the method one-against-one

means that there are ( ) 2/1−kk binary models (classifiers)

to train. We can use two techniques to find the classification

optimal parameters (i.e., penalty parameter and kernel

parameters): 1) the optimal parameters of each decision

function is selected for any two classes. Therefore, each

model has its own parameters. 2) The same parameters are

used for all models and the optimal ones will be the ones

giving the highest overall performance. LibSVM uses the

latter, as it was shown in [7] that the two techniques give a

similar performance.

III. THE OVERALL SYSTEM

Figure 3 shows the HMM/SVM hybrid system we propose to

classify TIMIT phones.

A. HMM-based system: reference system

The HMM-based speech recognition system is build using

HTK tools [3]. Each phoneme is modelized by a 5-state

HMM model with two non-emitting states (1st and 5th state)

and a mixture of 8 Gaussians.

Mel-Frequency Cepstral Coefficients (MFCCs), Deltas coefficients and cepstral pseudo-energy are calculated on

TIMIT database [2] and used to train and test the system. We

have used this HMM-based system as reference system for

comparison with the hybrid system.

Fig. 3. The HMM/SVM hybrid system using normalization.

B. Normalization technique

SVM requires that the features used for classification must

have the same size for all samples in training and test data.

To respect this constraint we have done a normalization of

samples by using zero padding and/or resampling on each

phone signal. These two techniques permit to us to normalize

the whole data. To do so, we search for each phone class a

signal which has a maximal length (MAX) and use these two techniques on the other signals of the same class to change

their size to MAX.

The normalization technique on training and test data is

performed according the following steps:

class +1 class -1

x1

x2

class +1 class -1

x1

x2

ix → )( ixφ

Signal

HMM-based

System

Classification result

Transcription/Label files

SVM

System

Normalization

978-1-4799-4796-6/13/$31.00 © 2013 IEEE 000098

Fig. 4. Normalization technique: zero padding and resampling applied on signals from TIMIT.

1st step: search for each phone class in the training data a

signal which has a maximal length (e.g. class1=’AA’

→ MAX1, class2=’SH’ → MAX2, etc.).

2nd

step:apply zero padding on each phone class in the

training data using its corresponding MAX value.

The resulting training data produce the same

features as its original (features, like MFCCs,

extracted from a signal padded with zeros or not are

the same).

3rd

step: apply resampling or zero padding on each phone

class in the test data: for each phone signal of classi having length greater than MAXi use resampling

otherwise use zero padding.

The resampling technique used here is based on a

linear-phase FIR filter as described in [8].

4th step: apply zero padding, if necessary, on training and test

data to make sure that all phone signals have the

same length.

Figure 4 illustrates an example of zero padding and

resampling applied on signals from TIMIT: the three signals,

after normalization (MAX = 1200), will have the same size which is 1200 samples and the calculation of MFCCs gives a

vector of 130 features for each signal according to the HTK

configuration file we have used.

C. SVM system

Phone classification is a multi-class classification where

each phone is a class. For k classes, we have built

( ) 2/1−kk classifiers using LibSVM with the method one-

against-one. The architecture of the SVM system for training

and testing is shown in figure 5. The features used for

training and testing are the same as the ones used with the

HMM-based system (MFCCs + Deltas + Energy). The

normalization technique and cross validation are used during the training and test phases.

D. HMM/SVM hybrid system

The HMM/SVM hybrid system shown in figure 3 is a

coupling of the HMM-based system with the SVM system:

the HMM-based system and the SVM system are trained on

the same data. The former segments the test signals and

generates a transcription/label file (e.g. HTK generates one

file containing the transcription of all test signals) whereas

the latter uses this transcription/label file to extract each class

phone from the test signals, do normalization and reclassify using the k(k-1)/2 SVM classifiers prepared during the

training phase.

IV. EXPERIMENTS AND RESULTS

A. Experiments setup

Experiments were conducted on the TIMIT database using

Matlab, HTK and LibSVM. The 1260 ‘sa’ sentences were

excluded since they could bias the results [9]. The remaining

5040 utterances were used for training and test. The training set is composed of 3696 utterances from 671 speakers (male

and female), and the test set is composed of 1344 utterances

from 168 speakers (male and female). Before conducting

Maximal

signal

(Reference)

Resampling

Zero padding

978-1-4799-4796-6/13/$31.00 © 2013 IEEE 000099

Fig. 5. SVM System architecture using normalization.

training and test, the 61 phones were mapped to 46 phones as follows:

- Delete q, (pau, epi, h#)→ sil, ux→ uw, ax-h → ax, hv

→ hh, dx→ d, nx→ n, eng→ ng.

- (bcl, b) → b, (dcl, d)→ d, (dcl, jh)→ jh, (gcl, g)→ g,

(kcl, k)→ k, (tcl, t)→ t, (tcl, ch)→ ch, (pcl, p)→ p.

- After analyzing the TIMIT database, the remaining bcl, dcl,

gcl, kcl, tcl, pcl were deleted, replaced by the phones close to

the pronunciation (e.g., dcl→ d) or merged with the previous

or next phone (e.g., (s, tcl)→ s).

Therefore, we will have 46 different classes where each

class corresponds to one phone. Three systems were tested:

1) HMM-based system: the training and test were conducted

using HTK tools. First, the speech was pre-emphasized using

a filter with 1-0.97z-1 as a transfer function. 13 Mel

Frequency Cepstral Coefficients (MFCCs) were calculated on

a Hamming window size of 25 msec advanced by 10 msec

each frame. Then, an FFT is performed to calculate a

magnitude spectrum for the frame, which is averaged into 24

triangular bins arranged at equal Mel-frequency intervals.

Finally, a cosine transform is applied to such data to calculate

the 13 MFCCs. Moreover, the normalized log energy is also found, which augments the 13 MFCCs to form a 14 features.

13 Deltas and their energy were calculated too. This feature

extraction generates a vector of 28 features for each frame.

Each phone is modelized by a 5-state HMM model with two

non-emitting states (1st and 5th state) and a mixture of Gaussians. We have tried a mixture of 2, 4, 6, 8 and 12 but

the best result phone accuracy was with 8 Gaussians.

This system is used as a reference system to compare with

the hybrid system.

2) SVM system: First, the normalization was applied on the

training and test sets. Then, the same speech processing used

with the HMM-based system is applied on the normalized

data to extract features (i.e. 28 features (MFCCs + Deltas +

Energy) for each frame is generated).

After the speech processing above, LibSVM is used to

generate the 1035 one-against-one classifiers from

normalized training set (46 phones → k = 46 → k(k – 1)/2

= 1035) as follows:

1st step: Each feature was linearly scaled to [-1, +1]. The

same scaling factor used on the training set was used on the

test set too.

2nd

step: Use RBF Kernel ( ) )exp(,2

yxyxK −−= γ .

3rd

step: Find the optimal parameters C (the penalty

parameter) and γ (the Kernel parameter) using cross-

validation. The (C , γ ) = (32.0 , 0.0078125) were the

optimal values found.

4th

step: Use the optimal parameters found previously to train

the 1035 classifiers.

After those steps, classification test consists of running the

1035 classifiers by LibSVM on the normalized test set.

Training

One-against-one Classifiers

1 2 … k (k-1)/2

LibSVM

+

Cross-Validation

Signal

Normalization

Features extraction

Test

LibSVM

Classification result

Signal

Normalization

Features extraction

978-1-4799-4796-6/13/$31.00 © 2013 IEEE 000100

3) HMM/SVM hybrid system: this system is the coupling of

the HMM-based with the SVM system. The former is run on

the test set and the transcription/label files generated after

phone recognition is fed to the latter. The transcription/label

file is used by the SVM system to extract each phone signal

from the test set, normalize it and run the 1035 classifiers to reclassify the test set.

B. Results

The HMM-based system was run on the test set and the

best phone accuracy rate was 71.41% using 8 Gaussians

mixture.

The SVM system was run first on the normalized test set

segmented manually (i.e. extracted from TIMIT training set)

and accuracy rate was 95.82% using cross-validation with the

optimal parameters (C , γ ) = (32.0 , 0.0078125).

The HMM/SVM hybrid system was run on the test set:

first, the test set was segmented automatically by the HMM-

based system and the transcription/label file was sent to the

SVM system. After that, the SVM extracts phone signals,

using the transcription/label file, normalizes and classifies.

The accuracy rate obtained was 91.26%.

However, by comparing results obtained from the HMM-

based system and the HMM/SVM hybrid system, we

conclude that this hybrid system with normalization

technique overcomes the standard HMM system. Table 1 shows the recognition accuracy rate for different

systems.

Table 1. Different phone recognition accuracy.

(Hamming windows size = 25 msec)

System Accuracy (%)

1) HMM:

- automatic segmentation 71.41

2) SVM:

- manual segmentation

- normalization

95.82

3) Hybrid HMM/SVM:

- automatic segmentation

- normalization 91.26

Table 2 compares this system with some previously

reported results on TIMIT phone classification using different

approaches. The good result was by [13].

V. CONCLUSION

We have shown in this paper the power of a HMM/SVM

hybrid system in classifying phones using the normalization

technique. The tests are significant and encourage us to use

this technique in our next work which is speech recognition

in a context of spoken dialogue system.

Table 2. Reported results on TIMIT phone classification.

System Accuracy

(%)

HMM [9] 66.08

CDHMM [10] 72.90

TRAPs, temporal context division + lattice

rescoring [11] 79.04

GMMs trained as SVMs [12] 69.90

Deep Belief Networks [13] 79.30

HMM (this work) 71.41

HMM/SVM + Normalization (this work) 91.26

REFERENCES

[1] R. Solera-Urena, J. Padrell-Sendra, D. Martin-Iglesias, A. Gallardo-

Antolin, C. Pelaez-Moreno and F. Diaz-de-Maria, “SVMs for

Automatic Speech Recognition: A Survey”, Springer, pp 190-216,

2007.

[2] S. John and E.A. Garofolo “TIMIT Acoustic-Phonetic Continuous

Speech Corpus”, Philadelphia: Linguistic Data Consortium, 1993.

[3] S. Young, G. Evermann, T. Hain, D. Kershaw, G. Moore, J. Odell,

D. Ollason, D. Povey, V. Valtchev, P.Woodland. “The HTK Book

for HTK Version 3.4”, March 2009.

[4] C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector

machines”, ACM Transactions on Intelligent Systems and

Technology, 2:27:1--27:27, 2011. Software available at

http://www.csie.ntu.edu.tw/~cjlin/libsvm.

[5] B. Boser, I. Guyon, V. Vapnik, “Training algorithm for optimal

margin classifiers”, In the Fifth Annual Workshop on

Computational Learning Theory, Pittsburgh, ACM (1992) pp. 144–

152, 1992.

[6] C. Cortes and V. Vapnik “Support-vector network,” Machine

Learning, 20:273-297, 1995.

[7] C.-W. Hsu and C.-J. Lin. “A comparison of methods for multi-class

support vector machines”, IEEE Transactions on Neural Networks,

415-425, 13, 2002.

[8] T.W. Parks and C.S. Burrus “Digital Filter Design”, John Wiley &

Sons, pp. 54-83, 1987.

[9] K. F. Lee and H. W. Hon, “Speaker-independent phone recognition

using hidden Markov models”, IEEE Trans. Acoust., Speech,

Signal Process., vol. 37, no. 11, pp. 1641–1648, Nov. 1989.

[10] L.F. Lamel and J.L. Gauvain “High Performance Speaker

Independent Phone Recognition using CDHMM”, Proceedings of

Eurospeech, Germany, September, 1993.

[11] S.M. Siniscalchi et al. “High-accuracy phone recognition by

combining high-performance lattice generation and knowledge

based rescoring”, IEEE International Conference on Acoustics,

Speech and Signal Processing, Hawaii, April 2007.

[12] F. Sha and L.K. Saul “Large margin Gaussian mixture modelling

for phonetic classification and recognition”, IEEE International

Conference on Acoustics, Speech and Signal Processing, France,

May 2006.

[13] A. Mohamed et al. “Acoustic Modeling using Deep Belief

Networks”, IEEE Transactions on Audio, Speech, and Language

Processing, 2011.

978-1-4799-4796-6/13/$31.00 © 2013 IEEE 000101

Documents

[IEEE 2013 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT) - Athens, Greece (2013.12.12-2013.12.15)] IEEE International Symposium on Signal Processing