[IEEE 2007 IEEE International Multitopic Conference (INMIC) - Lahore, Pakistan (2007.12.28-2007.12.30)] 2007 IEEE International Multitopic Conference - Speaker Verification Using Boosted

Speaker Verification Using Boosted Cepstral Featureswith Gaussian Distributions

Ahmad Salman', Ejaz Muhammad2, Khawar Khurshid31-2National University of Sciences and Technology, Pakistan, 3Michigan State University, USA

Abstract- An effective yet simple approach for text-dependentspeaker verification is presented in this paper. The basic ideaemploys the fundamentals of Gaussian Mixture Model, which is apopular technique for speaker recognition in modern state-of-the-art systems. In this paper we introduce a novel technique forcreating unique speaker models using spectral and prosodicfeatures of speech signals which are further boosted to get thefinal robust discriminating speaker identity. Multi-class AdaptiveBoosting (AdaBoost) algorithm is used for the classification ofeach speaker model. The Gaussian distributions of the Mel-Frequency Cepstral Coefficients (MFCC) of each speaker are pre-emphasized using the pitch information in the speech signals. TheGMM combines the individual speaker models according to theset of mixing weights but our approach categorizes the speakermodels in weak-learned sets which then linearly and optimallycombine to form a strong classifier. The results of our algorithmshow 98% correct speaker verification for the set of 16 speakers.On the average the percentage of false acceptance is 2% while thefalse rejection rate is 0%.

Index terms-Speech processing, speaker verification,AdaBoost, cepstral coefficients, Gaussian distribution.

I. INTRODUCTION

Automatic speaker identification (ASI) and automatic speakerverification (ASV) systems have always been demanding interms of robustness and accuracy for the modem state-of-the-art security applications. In speaker verification, the task is todecide whether a speaker corresponds to a particular voicemodel or not. The speaker with known model is called targetspeaker or claimant, while the speaker which is unknown tothe system is called background speaker or imposter. Besidesverification, the goal is to increase the reliability of thealgorithm while applying different ASV techniques. Differentapproaches for speaker verification have been proposed so far.Some popular methods are dedicated to the text-independentspeaker recognition systems, which include speakerverification using vector quantization [1] and Gaussian mixturemodel [2] etc. Other techniques employ text-dependentalgorithms e.g. dynamic time warping (DTW) [3] and hiddenMarkov models (HMM) [4] etc.

Text-independent ASV systems do not cater for what wasspoken but try to generally categorize the speaker's voicemodel. Text-dependent system uses the same text for trainingand testing, which implies the comparison of word models.Both text-independent and text-dependent algorithms usealmost same speech parameters as features to categorize each

class of speakers. Usually the difference comes in the type ofclassification technique.

F. Soong and A. Rosentery proposed the speaker verificationtechnique [1] using vector quantization (VQ). Vectorquantization is one technique that can be successfully appliedto ASI/V systems by avoiding the difficulty of speechsegmentation into phones. VQ technique is usually used fortext-independent ASV in which frequently occurring featureentries represent a speaker and are used for training thecodebooks. Recognition accuracy with test utterance of 10, 5and 3 seconds was 96, 87 and 79% respectively. Resultsshowed that VQ can be applied more efficiently for shortutterances, therefore securing their place for text-dependentASV's also.

Gaussian mixture models (GMM) are more advancedapproach as compared to VQ. GMMs are similar to codebooksin the regard that the clusters in feature space are estimated aswell. In addition to mean vectors, the covariances of theclusters are computed, resulting in a more detailed speakermodel with the recognition accuracy of 96.8%, depicted by thework of Douglas A.Reylond [2].Dynamic time warping (DTW) assigns each training vector a

particular class identity without any further processing. Thentest vector is aligned to each of the training sequence such thatsome minimum distance criteria is achieved. This type ofclassification is obviously suitable for text-dependent ASVs.Another very successful approach is hidden markov models

(HMM), which outclasses the DTW and VQ techniques fortext-dependent ASV. HMM is a statistical model which can beviewed as a combination of DTW and GMM approach. Itrepresents more sophisticated speaker utterance model.Nevertheless the DTW approach sometimes performs betterthan the HMMs, when the training data set is insufficient.

In this paper we used fundamental idea of GMM to assigneach speaker, a separate identity. Any process of ASV can bedivided into two main tasks: One is feature extraction fromspeech signals; the other involves the classification of extractedfeatures and is performed by some learning algorithm. Forfeatures we used spectral as well as prosodic features ofspeech. Spectral features involve the calculation of filter bankMel-Frequency Cepstral Coefficients (MFCCs), which are oneof the most reliable features used in almost every ASI/Vsystems. We observed that average pitch period for somespecific utterance by some speaker can be taken as the

1-4244-1553-5/07/$25.00 ©2007 IEEE

discriminating feature, as it almost remains the same in othersimilar utterance too. Therefore the MFCC clusters for speechclips by some particular speaker can be viewed in a form of aGaussian distribution with specific mean and variance. Thesamples of these distributions when collected over time framesresembles with each other for some fixed utterance. The finalfeatures are formed by giving these distributions, the offsets ofweighted pitch periods, calculated for each speech clip. Thefeatures from each speaker are learned by the classifier calledAdaBoost. We employ multi-class AdaBoost as a combinationof binary Adaboost models. Each model discriminates betweentwo speakers. Therefore each speaker can be compared withthe other in a separate binary model. The process of featureextraction and classification is explained in detail in theupcoming sections.We organize this paper as follows. In section II, we include

principles and methodology of feature extraction processfollowed by its implementation algorithm. Section III explainsthe mathematical background of AdaBoost algorithm in ourscenario. Results and comparisons are given in section IVfollowed by conclusions in section V.

II. FEATURE EXTRACTION

The aim of feature extraction procedure is to collect thenecessary information from the speech data, in such a way thatsome pattern recognition process becomes capable of doingclass distinction easily. The features of a particular speech clipfor a person acts like an identity mark just like his finger prints.The process of feature extraction should be capable of reducingthe dimension of data as much as possible to avoid 'curse ofdimensionality'.

In our work, we start with mel-frequency cepstralcoefficients (MFCCs) as a basic feature. MFCCs can beregarded as 'standard' features for speech and speakerrecognition systems. First of all the speech signal is cut downinto frames of overlapping windows of equal length. Thespeech signal was sampled at 8 kHz and each frame is of 20 mswith the overlap of 10 ms. Each frame in time domain isconverted into the MFCC vector. These feature vectorsrepresent the cepstral property of the original signal. MFCCsare calculated using the triangular filter bank procedure [5].Fig. 1 shows the total procedure ofMFCC calculation followedby the frequency response of the filter bank in Fig. 2.Each speaker is subjected to speak a particular word at least

four times. So each clip gives a MFCC matrix with eachcolumn representing a MFCC vector for different frame.MFCCs are found with the pre-emphasis factor of 0.97. Afterthe MFCC calculation Cepstral Mean Subtraction (CMS) andRATA processing is done to overcome the linear channeldistortions. Now we find the Gaussian distribution using themean and variance of a particular cepstrm cluster, with onecepstrum from each of the four MFCC matrices. The samplesof these distributions, when collected against time give us aparticular pattern. We used MFCC-2 and MFCC-3 as

fundamental features. This is shown in Fig. 3 for MFCC-2.Then MFCC-2 and MFCC-3 features are averaged for the fourutterances by each of the 16 speaker.

SpeechSignal Windowing FFTI

Cepstral Cepstral Vectors-0.Filter Bank 20Oiog Transform

Fig. 1 Filterbank-based calculation procedure ofMFCCs.

Frequency Response of MFCC Filter Bank

o 014

0.012-

0.01

-0 0.008-

E<0.006-

0.004-

0 002

0L

1or

O0

10 _

5

Q- 10 _E

5

00

10

5-

00L

o

1 03Frequency (Hz)

Fig. 2 Frequency response of the filterbank (40 filters).

MFCC Distributions

10 20 30 40 50 60 70 80 90 100

10 20 30 40 50 60 70 80 90 1000 2 30 40 50 60 70 9 1

10 20 30 40 50 60 70 80 90 100

10 20 30 40 50

MFCC-2

60 70 80 90 100

Fig. 3 Time-sampled MFCC-2 Gaussian distributions for four utterances of a

speaker.

The next task is to find out the pitch periods from theutterances of each speaker. This is done using the linearprediction (LP) analysis. The popular Levinson-Durbinalgorithm is employed for the calculation of predictioncoefficients and residual signal frame by frame.Autocorrelation of the residual signal with the lag of 40-120 is

calculated and the lag giving maximum value ofautocorrelation is the pitch period estimate for a particularframe. In the similar fashion pitch periods of all frames of aspeech clip are derived. We observed that for the same wordutterances by a speaker, the average pitch remains almostsame.Now four sampled distributions for MFCC-2 are averaged

and four distributions for MFCC-3 are also averaged. In thisway we get two mean distributions of MFCC-2 and MFCC-3respectively. These distributions are given the offset of theaverage pitch period. These two distributions from eachspeaker with the effect of their respective pitch periods aresubjected to the classification step. Suitable weights or gainfactors are used to enhance the difference of the pitch offsets.These steps are mathematically illustrated as follows:

Suppose Xk = {X1, X2, x3 ... XK }be the vector of cepstralcoefficients where k = 1, 2, 3,..., K is the total number ofspeech clips by each speaker. K= 4 in our case.Now the Gaussian distribution with mean pt and variance v

for the K MFCCs of the ith cluster is given in (1). Here i = 1, 2,3,..., N is the total number of frames or clusters.

Pi (Xk)=N(xk; Pk ,vk ) (1)

so averaged MFCC-2 and MFCC-3 vectors are,

3m (Pi ) ={Pml I Pm2 I Pm3 . I PmN} (2)

here m = 2, 3 stands for the MFCC-2 and MFCC-3distributions respectively and Pml, Pm2,..., PmN are the frame-wise (time-wise) samples of the N distributions. Now (2) givesthe feature representation before the pitch offset and this willbe depicted with more detail in the next section.

III. ADABOOST CLASSIFICATION

As mentioned in the previous sections that theperformance of any ASI/V is critically dependent on the typeof learning algorithm used. In our work, although we claim toutilize the basic idea of GMM-based ASV, we did not use itsclassification technique. Once the GMM is established usingthe individual Gaussian models of the features, the imposterspeaker is judged by the log-likelihood ratio (LLR) function.

In this paper we adapt an AdaBoost-based algorithm,inspired by [6] to decide whether a speaker is actually aclaimant or an imposter. We derive a binary discriminator forthe features of each speaker that is accurate and fast and it usessimple MFCC distribution and pitch measure for theclassification of speaker. Results show that without any priorknowledge of signal structure, we can get the accuracy up to98%. This discussion is organized in the following subsections:

A. Training FeatureFor a good binary discriminator we have to find a training

feature that separates the two classes (features of two differentspeakers) with the largest margin possible. We used Gaussian

distributions of MFCCs, as explained in section II, along withthe weighted pitch periods of the respective speech signals inthe training set, i.e. (2) now finally becomes,

3m (Pi) =2Pavg +{Pml1 Pm2 Pm3 ... PmN} (3)

where Pavg is the average pitch value of a speech clip and Xis a gain factor with range:{0.5-10}. This gain factor helps toamplify the difference between the features in the training set.

B. AdaBoostFirst we should get familiar with some terminology.

Terminology:

* xi E X ... ith sample of the training vectorx .

* yi EY.. Attribute toxi, -1 for speaker-A and +1

for speaker-B.* ht... Weak classifier.

* D... Weights.

* ... Weighted error.

* at Confidence measure.

* H(x) ... Final strong classifier.AdaBoost is an ensemble (or meta-learning) method that

constructs a classifier in iterative fashion. In each iteration, ituses a vote of a learning routine, called weak learner. Thedecision of a weak learner is weighted before next iteration.The weights are proportional to the correctness of thecorresponding weak classifier. Precisely saying, AdaBoost isan algorithm for constructing a strong classifier as a linearcombination of weak classifier:

(4)T

f(x) = E atht (x)t=l

The algorithm of AdaBoost is given below:Given: Training set, {(x1 , Y1 ), .., (Xm Yin

where xi E X,Yi e Y = {-1,+1}Initialize weights D1 (i) = 1/mFort=1,...J:

* Find the feature ht that minimizes the error et, i.e.

h1 =arg min £ = Dt (i)[yi ha(xi)]i=i

* If et > 1/2 then stop

1 1w1+ rth|rh

m

rt = Dt (i)ht (Xi )Yii=l

* Update the weights:

Dt+1 (i) = Dt (i)exp - aty1h (xi ) where

Zt = 2 /t (I - .t 1 is a normalizing factor.

* Output the final classifier:

H(x) = sign E atht (x)j|

Looking at the algorithm, we can see that we increase ordecrease weights of wrongly or correctly classified examplerespectively. Therefore we give more weights to themisclassified data and less weight to the correctly classifieddata. In this way weight acts like an upper bound on the errorof a given example and ifet < 1/2, the improvement in thenext iteration is guaranteed. After the training of AdaBoostalgorithm, we get a model which gives the information ofconfidence measure a , error bound, weighted error £, thenormalizing factorZ, weights and iteration rules. This modelis then used to decide whether the speaker is claimant orimposter. Each test utterance is first converted to the form of(3) and then undergoes classification. Up till now we havediscussed the AdaBoost algorithm in its binary form. For thecomplete ASV, where the number of speakers to be classifiedcould be more than two, we have to implement AdaBoost in itsmulti-class form. Binary AdaBoost can be converted intomulti-class form using two methods. One is the one-against-alltechnique in which classification is performed between eachclass and the remainings. The other is the one-against-onestrategy in which classification model is created between eachpair of classes. For our work we used the former technique.Fig. 4 shows the multi-class strategy of AdaBoost.

C ExampleIn the experimentation, 16 different speakers uttered the sameword, four times each. The purpose of repetition is to accountfor the difference in pitch and gain variations each time aperson speaks. Now the next step is the feature extraction.Each speech clip is converted into MFCCs and hence finallygets the form of (3). Now for classification, each AdaBoostmodel represents two speakers with four clips from each.Features for speaker-lare labeled as '+1' and features forspeaker-2 as '-1'. Therefore features along with their labelsconstitute a complete training set. This training set is then usedfor the learning of AdaBoost algorithm. All the other modelsfor other speakers are created in the same fashion (Fig. 4) tomake a complete ASV system. The unknown speaker firstclaims the identity of a known speaker to the ASV system. Themodels related to that particular identity get activated and givetheir decision in the favor or against the unknown speaker.

Speaker-1> Model-1

Speaker-2Speaker-1

> Model-2SDeaker-3

. Model-N

Speech Feature0-Extraction Model-N+l

P.Model-2N-1i

Model-2N

-. Model-3N-2

oSpakrl-i

Speaker-1

Speaker-NSpeaker-2

Speaker-3Speaker-2

Speaker-NSpeaker-3

Speaker-4

Speaker-3

Speaker-N

v

Fig. 4 Multiclass AdaBoost one-against-all technique for N speakers.

Fig. 5 shows the complete training set for two speakersbased on (3). We can see the distributions are different fromeach other not only because of the means and variances, but thepitch offsets make these two classes distinctly apart from eachother a lot.

14

12 Speaker-A

10

(D 8

E~<6

O00

Training Set

Speaker-B

20 40 60 80 100 120 140 160 180 200Training Data

Fig. 5 Features of Speaker-A (Left), features of Speaker-B (Right).

The error measures of AdaBoost algorithm is shown in Fig.6 followed by the discriminated data in Fig. 7, both for onemodel (two speakers). For good classification the error boundshould be lower than 0.2 and weighted error cannot be greaterthan 0.5.

IV. PERFORMANCE EVALUATION

Our proposed scheme of ASV i.e. Speaker VarificationUsing AdaBoost Classification (SVAC) is subjected to theperformance measure along with some currently used ASV

techniques, like text-dependent continous density hiddenMarkov models (CDHMM), text-dependent vectorquantization (VQ), text-dependent DTW [7] and text-dependent GMM (TCGMM) [9].

Error Measures

0.9Error BoundWeighted Error

08

07

06a)

05E<04-

TABLE ISpeaker Verification Performance

Speaker 00 Correct Identification.Model (3 sec. Utterance)

CDHMM 96 ± 1.4

VQ 96± 1.6

TCGMM 97 ± 1.4

DTW 96 ± 1.1SVAC 97 ± 1.3

03

02

0.1

00 10 20 30 40 50 60 70 80 90 100

Iterations

Fig. 6 Error graph ofAdaBoost algorithm. Error bound, (Dotted line),weighted error (Solid line).

Data Discrimination10

Speaker-A9- O Speaker-B

BoundaryBoundary

V. CONCLUSIONS

A new speaker verification approach is proposed in this paper.This approach gave the idea that without going into thecomplexity and utilizing simple features like cepstrums andpitch of speech signal efficiently, we were able to keep theerror rates of the ASV system significantly low. The beauty ofAdaBoost classifier in our approach is that it makes a set ofbinary models for each speaker. So if we want to introduce anew speaker to the ASV system, we can simply cascade his setof models. Our SVAC approach is valid for text-dependentscenario, but it would be an interesting idea to modify it for thetext-independent case in future.

VI. ACKNOWLEDGMENT

Authors would like to thank the anonymous reviewers fortheir valuable comments.

3

2 4 6 8 10 12

Fig. 7 Data discrimination: speaker-A (Gray) and speaker-B (Black).

All these techniques were tested in the same scenario where16 different speakers recorded their voices, four times each, forsome fixed 3 seconds utterance. All these algorithms were

tested among the 16 speakers for correct acceptance and falseacceptance i.e. correct decision for claimants and incorrectdecision for imposters. Table-I shows the overall comparisonof some techniques for correct decision, i.e. the decision iscorrect when claimant is accepted and the decision is alsocorrect when the imposter is rejected. The VQ and GMMspeaker verification techniques are usually used for text-independent speaker verification, but these techniques, whenapplied for text-dependent case, can perform very well [8-9].

REFERENCES

[1] F. Soong, A. Rosenberg, L.Rabiner, and B. Juang, " A VectorQuantization Approach to Speaker Recognition," Proc. Int. ConfAcoustics, Speech and Signal Processing, vol. 1, pp. 387-390, Tampa,FL, 1985.

[2] D. A. Raeynolds, "Speaker Identification and Verification UsingGaussian Mixture Speaker Models," Speech Communications, vol. 17,pp. 91-108, Aug. 1995.

[3] S. Furui, "Cepstral Analysis Technique for Autometic SpeakerVerification," IEEE Trans. ASSP-29, pages 254-272, 1981.

[4] J. J. Webb and E. L. Rissanen, " Speaker Identification ExperimentsUsing HMMs, " Proc. ICASSP-93, 2:3 87-3 90, 1993

[5] S. B. Davies and P. Mermelstein, "Comparison of ParametricRepresentations for Monosyllabic Word Recognition in ContinuouslySpoken Sentences," IEEE Trans. Acoustics, Speech and SignalProcessing, vol. ASSP-28, no. 4, pp. 357-366, Aug. 1980.

[6] P. A. Viola and M. J. Johns, "Robust real-time object detection",International Conference on Computer Vision, pp. 747, 2001.

[7] K. Yu, J. Mason and J. Oglesby, "Speaker Recognition Using HMMs,DTW and VQ," Speech Research Group,Uuniversity of Wales Swansea.

[8] J, T. Buck, D. K. Burton, and J. E. Shore, "Text-Dependent SpeakerRecognition Using Vector Quantization," Proc. ICASSP-85, 1:391-394,1985.

[9] D. E. Sturim, D. A. Reynolds, R. B. Dunn, T. F. Quatieri, "SpeakerVerification Using Text Constrained Gaussian Mixture Models," Proc.IEEE ICASSP, vol. 1, pp. 677-680, Orlando, FL. 2002.

Documents

[IEEE 2007 IEEE International Multitopic Conference (INMIC) - Lahore, Pakistan (2007.12.28-2007.12.30)] 2007 IEEE International Multitopic Conference - Speaker Verification Using Boosted