Speaker Verification With Feature-Space MAPLR Parameters

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 3, MARCH 2011 505

Speaker Verification With Feature-SpaceMAPLR Parameters

Donglai Zhu, Member, IEEE, Bin Ma, Senior Member, IEEE, and Haizhou Li, Senior Member, IEEE

Abstract—This paper studies a new technique that character-izes a speaker by the difference between the speaker and a cohortof background speakers in the form of feature-space maximum aposteriori linear regression (fMAPLR). The fMAPLR is a linearregression function that projects speaker dependent features tospeaker independent ones, also known as an affine transform. Itconsists of two sets of parameters, bias vectors and transform ma-trices. The former, representing the first order information, is morerobust than the latter, the second-order information. We proposea flexible tying scheme that allows the bias vectors and the ma-trices to be associated with different regression classes, such thatboth parameters are given sufficient statistics in a speaker verifica-tion task. We formulate a maximum a posteriori (MAP) algorithmfor the estimation of feature transform parameters, that furtheralleviates the possible numerical problem. The fMAPLR param-eters are then vectorized and compared via a support vector ma-chine (SVM). We conduct the experiments on National Institute ofStandards and Technology (NIST) 2006 and 2008 Speaker Recog-nition Evaluation databases. The experiments show that the pro-posed technique consistently outperforms the baseline Gaussianmixture model (GMM)-SVM speaker verification system.

Index Terms—Feature transform, maximum a posteriori,speaker recognition, support vector machine (SVM).

I. INTRODUCTION

S PEAKER verification is a process that verifies a speaker’sclaimed identity by using the speaker’s voice. Just like

other pattern recognition tasks, speaker verification typically in-volves three common steps, feature extraction, speaker mod-eling, and a classification decision. Many state-of-the-art sys-tems are based on cepstral features for their robustness and easeof use. In this paper, we continue to pursue effective use of cep-stral features in a new speaker modeling framework.

Successful speaker verification is typically carried out eitherby log-likelihood ratios of Gaussian mixture models (GMMs)[1], or by discriminative distance derived from support vectormachines (SVMs) [2]–[5]. The former is based on generativetraining of speaker models such as maximum-likelihood (ML)or maximum a posteriori (MAP) estimation [6], while thelatter is based on discriminative training of speaker modelssuch as GLDS-SVM [2] and bag of N-grams [3]. Recent

Manuscript received November 24, 2009; revised March 05, 2010; acceptedMay 12, 2010. Date of publication May 24, 2010; date of current version De-cember 03, 2010. The associate editor coordinating the review of this manuscriptand approving it for publication was Dr. Mark Gales.

The authors are with the Department of Human Language Technology, Insti-tute for Infocomm Research, Singapore 138632 (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TASL.2010.2051269

studies have taken advantage of both techniques. Among themost effective techniques are GMM-SVM and MLLR-SVM.GMM-SVM models a speaker by constructing a supervector ofmeans derived from generative Gaussian components [4], whileMLLR-SVM models a speaker by using the transform parame-ters in the maximum-likelihood linear regression (MLLR) [5].

In SVM-based speaker modeling, one of the fundamentalissues is to choose the SVM feature to represent a collection ofcepstral features of a speaker. The SVM feature is often referredto as a supervector because it is a large vector formed by theconcatenation of multiple smaller vectors, e.g., GMM meansin GMM-SVM and MLLR transform rows in MLLR-SVM[4], [5]. The two methods differ in the following aspects.First, GMM-SVM adapts GMM means with MAP techniquewhile MLLR-SVM learns a small number of transforms withMLLR adaptation [7], [8]. Second, unlike the supervector inGMM-SVM that models the cepstral observations of a speaker,the supervector in MLLR-SVM models the difference be-tween a speaker-dependent and a speaker-independent model.Third, a GMM supervector is derived from the means of amixture of Gaussian components, while an MLLR supervectoris derived from phone-dependent Gaussian components ofhidden Markov models (HMMs). In short, from speaker mod-eling point of view, GMM-SVM enjoys a simple and robustmodeling process, whereas MLLR-SVM requires some priorknowledge about phone clusters and the phonetic transcripts oftraining data.

Several studies have attempted to use phone-indepen-dent MLLR parameters to circumvent the need of phonetictranscripts. In [9], MLLR is estimated based on GMM andnormalized with GMM parameters as the SVM kernel. Un-fortunately, the results show that GMM-based MLLR kernelmethod is not as effective as GMM-SVM. In [10], MLLR andconstrained MLLR (CMLLR) are estimated based on GMM,which does not yield better performance than GLDS-SVM. Ingeneral, these studies lead to a conclusion that the GMM-basedMLLR kernel is not as effective as the HMM-based MLLRkernel, citing the data fragmentation problem which is ad-dressed in [5] by sharing the MLLR transforms across phoneHMMs, and independent of the choice of words. We note thatGMM models the overall cepstral distribution which conflatesspeaker characteristics with the choice of words.

Studies also show that multiple classes of transforms arepreferred because one transform is insufficient to discriminatespeakers, e.g., eight classes of transforms are adopted for eachgender in [5]. On the other hand, the number of regressionclasses and the tying of transforms are sensitive because thetransform estimation may encounter numerical problems suchas singular matrix inverse due to insufficient adaptation data.

1558-7916/$26.00 © 2010 IEEE

506 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 3, MARCH 2011

To overcome this issue, one way is to use the dynamic regres-sion class generation which sets the number of classes in anunsupervised manner according to the adaptation data [11]. Analternative is to use the MAP estimate to train the transform,e.g., MAPLR [12], [13] and its feature-space version [14]. Thedifference between MAP and ML estimation lies in the as-sumption of an appropriate prior distribution of the parametersto be estimated. In ML estimation, we assume the parametersare fixed but unknown without any prior knowledge; Instead,in MAP estimation, we assume the parameters belong to someprior probability density functions (pdfs), which has beenproven an effective way in dealing with sparse training data.

The speaker model proposed in this paper is conceptually mo-tivated by the MLLR-SVM work in the sense that both use theaffine transforms as parameters to model the speaker difference.In MLLR-SVM, the affine transform parameters are consideredas unknown variables to be estimated from the statistics. Fromthe speaker modeling point of view, a speaker is therefore mod-eled by such an affine transform. In this paper, we bring the ideaof MLLR a step forward by considering the affine transform pa-rameters as belonging to some prior pdfs. In this way, a speakeris modeled truly by a set of prior pdfs from which an affine trans-form is derived as an observation instance. In terms of trainingstrategy, the proposed method is in common with GMM-SVMbecause we formulate the training under the MAP estimationparadigm. Because the transforms are carried out on the cepstralfeatures, as opposed to the Gaussian components, the methodis called feature-space maximum a posteriori linear regression(fMAPLR), and the resulting speaker verification system is re-ferred to as fMAPLR-SVM hereinafter.

Note that there are many more parameters in a transformmatrix than in a bias vector. This often leads to numericalproblems in MLLR estimation of transform matrix [15]. Theissue of lopsided number of parameters prompts us to look intodifferent tying schemes for the transform matrices and biasesacross Gaussian components [16], [17]. In other words, weallow transform matrices and bias vectors to be associated withseparate regression classes [18]. Motivated by this idea, westarted with a study that estimates fMAPLR parameters basedon the universal background model (UBM) [19]. We furtherextended the work to a joint estimation of both the GMMparameters and the fMAPLR parameters [20]. In this paper,we conduct a comprehensive study of the fMAPLR-SVMframework. We present the MAP adaptation of fMAPLR pa-rameters based on the GMM which is jointly estimated withthe prior pdfs of fMAPLR parameters. Similar to speakeradaptive training (SAT) [21], we use the method of alternativevariables to iteratively estimate both hyperparameters andfMAPLR parameters. The fMAPLR parameters are vectorizedand normalized with the rank normalization [5] to construct theSVM kernel.

Fig. 1 compares three SVM-based speaker verification tech-niques. With the speaker data, speaker-dependent (SD) param-eters are estimated based on speaker-independent (SI) models.Note that MLLR-SVM and fMAPLR-SVM model the speakerdifference while the GMM-SVM models the speaker itself. Onthe other hand, both GMM-SVM and fMAPLR-SVM assumeprior pdfs for speaker models. In particular, fMAPLR-SVM al-

Fig. 1. Diagram of three closely related techniques. (a) MLLR-SVM,(b) GMM-SVM, and (c) fMAPLR-SVM. MLLR-SVM and fMAPLR-SVMmodel the speaker difference while GMM-SVM models the speaker di-rectly. GMM-SVM and fMAPLR-SVM are trained with MAP method whileMLLR-SVM is trained with ML.

lows the GMM and the prior pdfs of fMAPLR parameters to bejointly estimated on the background data, which allows a canon-ical GMM rather than the UBM.

Although fMAPLR-SVM is motivated by MLLR-SVM [5],it has several clear advantages as summarized next. First, wedecouple the regression classes between the transform matricesand bias vectors, thus allowing flexible tying of fMAPLR pa-rameters. This is different from MLLR or CMLLR where boththe transform matrix and bias vector in an affine transform al-ways share the same regression class [15], [22]. Second, we es-timate the transform parameters with the MAP criteria as op-posed to ML criteria in MLLR and CMLLR, thus alleviatingnumerical problems due to insufficient training data. Third, wedefine the transform on the feature space which can be viewedas a joint transform between means and variances [15].

In general terms, GMM-SVM and fMAPLR-SVM bothassume prior pdfs for speaker models, fMAPLR-SVM offerssome practical advantages. First, the GMM and the priorpdfs of fMAPLR parameters can be jointly estimated on thebackground data, which leads to a canonical GMM rather thanthe UBM. Second, the transforms that share among Gaussiancomponents can be efficiently estimated with limited adap-tation data. In this paper, we only compare GMM-SVM andfMAPLR-SVM in experiments because one can easily developan fMAPLR-SVM system based on GMM-SVM one withoutadditional resources, such as phonetic transcripts of data.

An important issue in speaker recognition is the intraspeakerintersession variability such as the channel effects. The vari-ability can be alleviated in different domains. In the featuredomain, the variability in feature vectors can be normalizedby methods such as feature mapping [23]. In the model do-main, model parameters are compensated by methods such asjoint factor analysis for GMM [24], nuisance attribute projec-tion (NAP) [25], and within-class covariance normalization [26]for SVM. Note that the model-domain methods can be imple-mented in the feature domain as well [27]. In the score domain,

ZHU et al.: SPEAKER VERIFICATION WITH FEATURE-SPACE MAPLR PARAMETERS 507

scores are normalized by methods such as Hnorm [1] and Tnorm[28]. Since fMAPLR-SVM works under the SVM framework,we perform the NAP on the SVM kernel and the Tnorm on theSVM scores.

This paper is organized as follows. Section II formulatesthe feature transform, including the definition of fMAPLRparameters with their prior pdfs, and the estimation of param-eters and hyperparameters. Section III depicts the SVM withfMAPLR supervector. Section IV reports the experimentalresults. Finally, we conclude in Section V.

II. SPEAKER-DEPENDENT FEATURE TRANSFORM

Let us assume that a speech utterance spoken by a speaker isrepresented by a sequence of feature vectors ,where are -dimensional vectors. We define the fMAPLRfunction that maps the speaker’s feature vector to a speaker-independent feature vector as follows:

(1)

where is a nonsingular matrix, is a -di-mensional vector. We consider

as an fMAPLR parameter set. Aspeaker can then be characterized by , i.e., a set oftransform matrices and bias vectors.

Let us model with a GMM

(2)

which is defined by a set of parameters, where is the number of Gaussian components,

are Gaussian mixture weights, is a Gaussian, aremean vectors, and are diagonalcovariance matrices. The transform classes are associated withGaussian components in the GMM by sharing the transformacross mixture components. A centroid splitting algorithm withEuclidean distance measure is used to map each componentto two separate classes: and

. This is also called tying of pa-rameters. In this way, the matrix parameters and bias parametersin an affine transform belong to two different classes. Typicallywe have and . It is noted that a transform matrixhas parameters and a bias vector only has parameters.The choice of and thus regulates the number of parameters.By allowing different combination of and , we are able todefine a speaker model with the desired number of parameters.

Given the speech and the GMM , MAP adaptation ofthe fMAPLR parameter set is to maximize the followingposterior pdf:

(3)

where and are the prior pdfs of and, respectively. We call the hyperparameters of

fMAPLR parameters that regulate the distribution of fMAPLRparameters.

In summary, we have three sets of parameters, that are 1)the GMM parameter set , 2) the hyperparameter set , and 3)the fMAPLR parameter set . and are estimated on thebackground data, and is estimated on the speaker’s data.

A. Prior pdfs

The prior pdf of is defined as a matrix variate normal pdf[13]

(4)

where are the hyperparameters that control, with , , , and ,

. We fix and , with being a scalar controlparameter and an identity matrix. When the value of getssmaller, the MAP estimation of becomes closer to the priorparameter ; on the contrary the MAP estimation gets closerto the ML estimation.

The prior pdf of is defined as a normal pdf

(5)

where are the hyperparametersthat control , with being a D-dimensional mean vector,and a diagonal covariance matrix.

B. Estimation of Hyperparameters

We jointly estimate the hyperparameters andthe GMM parameters to maximize the likelihood on thebackground data. The likelihood is presented in (10) derivedin Appendix VI-A. The estimation is carried out by using themethod of alternative variables, where the hyperparameters areupdated iteratively in multiple steps each estimating one subsetof hyperparameters by fixing the other hyperparameters. Byfixing and , estimation of is similar to the estimationof CMLLR [15]. By fixing , estimation of and issimilar to the estimation of integrated model [29]–[31], [17].Its derivation is presented in Appendix VI-B. In summary,and are jointly estimated with the following steps.Step 1) Initialization: The GMM parameter set is initial-

ized with the UBM which is trained with ML onthe background data. We set and

, where is an identity matrix and is apositive value.

Step 2) Estimation of by fixing and : In this step,are estimated by fixing and . The updating

formula of is similar to CMLLR [15]. SeveralEM iterations can be performed.

Step 3) Estimation of and by fixing : In this step,and are estimated by fixing . Updating

formulas of and are shown in(14)–(18). Several EM iterations can be performed.

Step 4) Iteration between Step 2 and Step 3 until a criterionis satisfied.


C. Estimation of fMAPLR Parameters

Given the hyperparameters and the GMM parameters , thefMAPLR parameters are estimated by maximizing the pos-terior pdf in (3). Similar to the estimation of hyperparameters,we adopt the method of alternative variables to estimate .By fixing , estimation of is derived in Appendix VI-C.By fixing , estimation of is derived in Appendix VI-D.In summary, is estimated with the following steps.Step 1) Initialization: are set to be identity matrices and

are set to be zero vectors.Step 2) Estimation of by Fixing : In this step,

is estimated by fixing . The updating formula forthe th row of is shown in (22). Several EMiterations can be performed.

Step 3) Estimation of by Fixing : In this step, isestimated by fixing . The updating formula for

is shown in (23). Several EM iterations can beperformed.

Step 4) Iteration between Step 2 and Step 3 until a criterionis satisfied.

D. Discussion

With the definition of fMAPLR function [(1)], we can setthe number of transform matrices and the number of biasvectors separately. For example, we may set in orderto estimate “precise” bias vectors as well as “robust” transformmatrices. It would be interesting to conduct an inquiry into howfMAPLR-SVM is related to some existing techniques. To start,let us revisit (1).

If we associate the transform matrices and thebias vectors with the same regression classes, i.e.,

, (1) can be rewrittenas

(6)

which takes the same form as the CMLLR transform function[15]. Just like CMLLR, if we combine and into anextended transform matrix , and define anextended feature vector , then the updating for-mula of [(22)] can be easily generalized to estimate .However, our fMAPLR updating formula differs from that ofCMLLR in that fMAPLR-SVM follows MAP estimation whileCMLLR follows ML estimation process. Examining (22), onecan observe that MAP estimate introduces a smoothing mech-anism that is regulated by . MAP estimate approximates MLone as . fMAPLR with (6) may look similar to theMLLR transform function [7]. However, they are fundamen-tally different because fMAPLR is applied on the features, whileMLLR is on the model parameters. Of course, the formulationof fMAPLR and MLLR becomes very different when transformmatrices and bias vectors of fMAPLR belong to different re-gression classes, i.e.,

.

If we define bias vectors each associated with aGaussian component in the GMM, i.e., , and as-sume , and , (23) can berewritten as

(7)

If we compare (7) with the MAP updating formula of Gaussianmean vector [6], we see that is actually the differencebetween the original mean vector and the MAP-adaptedmean vector, thus literately modeling the speaker difference.This also helps understand the different motivation behindfMAPLR-SVM and GMM-SVM.

We have reported other variants of the fMAPLR speakermodeling technique. It is worth noting that this paper not onlyproposes a new method, but also offers a unified framework thatlinks up the fMAPLR family of speaker modeling techniques.If we do not update the GMM parameters when estimatingthe hyperparameters , the fMAPLR parameters areestimated based on the UBM. We presented this method in[19], which is referred to as fMAPLR(UBM)-SVM hereinafter.If we assume that the transformed features in (1) are nottruly speaker independent but contain “residual” speaker in-formation, an SD-GMM can be adapted from the SI-GMMusing these residual SD features which are transformed fromthe observed features by the estimated fMAPLR parameters.Both the SD-GMM means and the fMAPLR parameters canthen be used for speaker modeling. We presented this methodin [20], which is referred to as fMAPLR(+MEAN)-SVMhereinafter. fMAPLR-SVM can be viewed as a simplifiedversion of fMAPLR(+MEAN)-SVM where are assumedto be truly speaker independent, which is proved superiorto fMAPLR(+MEAN)-SVM in this paper. We will comparefMAPLR-SVM with the two previously reported fMAPLRvariants in Section IV-D.

III. SVM WITH fMAPLR SUPERVECTOR

The fMAPLR parameters as a speaker modelare vectorized to

work in an SVM, as shown in Fig. 2 where is the th row in. For each speech utterance, the fMAPLR parameters are

concatenated to form a supervector consisting ofelements. We call parameters in the supervector as free param-eters because they describe the speaker model space under theparameter tying scheme. An SVM is trained for each targetspeaker by regarding the target speaker’s training supervectorsas positive samples, and the supervectors from the backgrounddata set as negative samples. Our experiments are implementedusing the SVMTorch software with a linear inner-productkernel [32].

The supervectors need to be scaled to the same dynamicranges. We adopt rank normalization which has been success-fully used to normalize the elements of MLLR matrices in theMLLR-SVM method [5]. It replaces each value in the super-vectors with its rank among the background data samples on agiven dimension, and then scales the ranks to a value between


Fig. 2. Vectorization of fMAPLR parameters for speaker modeling.

0 and 1. The rank normalization warps the distribution to beuniform on each dimension of background vectors, which mayresult in better robustness for the SVM classifier.

In the SVM framework, NAP has been used to project outa subspace from the original supervector space [25]. The sub-space, aiming to model nuisance effects in speech, is trainedon a set of training data recorded from many different speakerseach speaking multiple speech segments in different channels.We perform the NAP on the fMAPLR supervector. Finally, theoutput SVM scores are normalized with Tnorm which furthercompensates for the nuisance effects [28].

IV. EXPERIMENTS

We evaluate fMAPLR-SVM on the National Institute of Stan-dards and Technology (NIST) 2006 and 2008 speaker recogni-tion evaluation (SRE) tasks [33]. Experiments are performed onEnglish telephone speech data under the core conditions. Thetelephone speech data are collected for the Mixer corpus bythe Linguistic Data Consortium using the Fishboard platform.Under the core condition, both enrollment and test data are aconversation of approximately five minutes each. We choosethe 1-conversation training data in the NIST 2004 SRE corpusas the background data. The NAP training data include 4934speech utterances of 2.5-min duration, by 186 female and 124male speakers, from the NIST 2004 SRE corpus. We use the1-conversation training data in the NIST 2005 SRE corpus fortraining cohort models in Tnorm score normalization.

The speech utterances are segmented at 20-ms frame lengthwith 12.5-ms frame shift. Each frame is converted to a 36-di-mensional feature vector composed of 12 MFCC coefficientsand their first and second-order derivatives. The feature se-quences are filtered by a RASTA filter. An energy-basedvoice activity detection (VAD) process is then used to removenon-speech frames. Finally, an utterance-based mean andvariance normalization is performed on feature vectors.

We compare the fMAPLR-SVM method with theGMM-SVM method. Two gender-dependent UBMs aretrained on the background data each consisting of 512 Gaussiancomponents. In GMM-SVM, GMM means are adapted fromthe UBMs with the MAP estimation. In fMAPLR-SVM, theUBMs are used to initialize the GMM parameters in the estima-tion of hyperparameters. In GMM-SVM, the free parametersconsist of normalized Gaussian means so the number of freeparameters is the number of Gaussian components multiplied

TABLE IPERFORMANCE OF GMM-SVM AND fMAPLR-SVM ON MALE SPEAKER DATA

OF THE NIST 2006 SRE CORE CONDITION CORPUS. fMAPLR IS DEFINED

WITH 512 BIAS VECTORS ESTIMATED WITH DIFFERENT �

by the feature dimension, i.e., . In fMAPLR-SVM, thefree parameters consist of transform matrices and bias vectorsso the number of free parameters is . In bothmethods, NAP is performed with matrix rank of 40 and Tnormis performed on SVM scores.

A. Preset Parameters

In the estimation of hyperparameters, we initialize , whichdenote the variances in the prior pdf of in (5), as identitymatrices multiplied by a positive value controlling the widthof the distribution. Equation (23) shows that the MAP estima-tion of gets closer to the ML estimation when the value of

becomes larger. Without loss of generality, we evaluate theperformance of different values on the male speaker data ofthe NIST 2006 SRE corpus for quick turnaround in experiments.The fMAPLR is defined with no transform matrix and 512 biasvectors each associated with a Gaussian component in GMMso that fMAPLR-SVM and GMM-SVM have the same numberof free parameters. It is shown that marginal performance im-provement is obtained by running multiple EM iterations in theestimation of both hyperparameters and fMAPLR parameters,and therefore one EM iteration is performed in estimation of theparameters in the following experiments. The value rangesbetween 0.001 and 1 which corresponds to the variance rangein GMM. Results are shown in Table I including the equal errorrate (EER) and minimum detection cost function (DCF). Com-pared with the GMM-SVM method, all of the four valuesachieve the performance improvement. It is also shown thatchange of the value within the range has marginal effect onthe recognition performance, so we simply choosein our following experiments.

In the prior pdf of [(4)], we need to predetermine thescale parameter . Equation (22) shows that the MAP estima-tion of gets closer to the ML estimation when the valueof becomes larger. We evaluate the performance of differentvalues using the male speaker data of the NIST 2006 SRE corecondition corpus. The fMAPLR is defined with 1 transform ma-trix and no bias vector. Table II shows the results with different

values ranging between 1 and 1000. Similar to , change ofvalue also has marginal effect on the EER and DCF results.

Therefore, we set in the following experiments. Notethat the same setting is adopted in [13]. As approaches infinite,the fMAPLR estimation numerically corresponds to CMLLRwith a single transform matrix and zero bias. Empirically, weestimate the CMLLR kernel based on the UBM and report theresults in Table II, referred to as CMLLR-SVM. In Table II,fMAPLR-SVM shows slightly better EERs than CMLLR-SVM


TABLE IIPERFORMANCE OF CMLLR-SVM AND fMAPLR-SVM ON MALE SPEAKER

DATA OF THE NIST 2006 SRE CORE CONDITION CORPUS. CMLLR AND

fMAPLR ARE BOTH DEFINED WITH A SINGLE TRANSFORM MATRIX AND

WITHOUT BIAS. fMAPLR IS ESTIMATED WITH DIFFERENT �

TABLE IIIPERFORMANCE OF GMM-SVM AND fMAPLR-SVM ON MALE SPEAKER

DATA OF THE NIST 2006 SRE CORE CONDITION CORPUS. fMAPLRIS DEFINED WITH � BIAS VECTORS

probably because of the MAP adaptation and the joint estima-tion of GMM and prior pdfs of fMAPLR parameters. Com-pared with the results in Table I, the performance of fMAPLRwith one transform matrix is inferior to the performance of theGMM-SVM method and the fMAPLR-SVM method with 512bias vectors. Similar results are observed in [9], [10] where theGMM-based MLLR kernels do not outperform the GMM-SVMmethod. The results suggest that the problem of data fragmenta-tion in GMM-based methods can severely affect the estimationof transform matrix [5].

B. fMAPLR With Bias Vectors Only

To appreciate the effect of the matrix and bias parametersin the fMAPLR speaker model, we next study their respectiveperformance in a speaker recognition task individually. To startwith, we investigate the performance of fMAPLR-SVM withbias vectors only, i.e., fMAPLR is defined with no transformmatrix. In this case, the second step in estimation of both hy-perparameters and fMAPLR parameters is skipped. We eval-uate the performance with the number of bias vectors rangingfrom 32 to 512 on the male speaker data of the NIST 2006 SREcore condition corpus. Results are presented in Table III. “#FP”denotes the number of free parameters in the unit of ; there-fore, we have FP for GMM-SVM and forfMAPLR-SVM. It is shown that the fMAPLR-SVM methodwith 256 bias vectors outperforms the GMM-SVM method al-though the fMAPLR kernel has fewer free parameters. This sug-gests that it is more effective to jointly estimate GMM and priorpdfs of fMAPLR parameters, which leads to a canonical GMMthat better models the speaker-independent data than the UBMtrained with multiple speakers.

C. fMAPLR With Transform Matrices Only

We further investigate the performance of fMAPLR-SVMwith transform matrices only, i.e., fMAPLR is defined with nobias vector. In this case, only the GMM parameters needto be updated in Step 3 in the estimation of hyperparameters,

TABLE IVPERFORMANCE OF GMM-SVM, CMLLR-SVM AND fMAPLR-SVMON MALE SPEAKER DATA OF THE NIST 2006 SRE CORE CONDITION

CORPUS. fMAPLR IS DEFINED WITH � TRANSFORM MATRICES. CMLLRIS DEFINED WITH � EXTENDED TRANSFORM MATRICES

and Step 3 in the estimation of fMAPLR parameters is skipped.In Table IV, we compare the performance of fMAPLR-SVM,GMM-SVM, and CMLLR-SVM on the male speaker data ofthe NIST 2006 SRE core condition corpus. For both fMAPLRand CMLLR, the number of transform matrices varies from 1 to5. CMLLR is defined with extended transform matrices whichinclude a bias vector for each matrix [15] so that its number offree parameters (#FP) is larger than that of fMAPLR. It is shownthat among five values, setting of yields the best re-sults for both fMAPLR-SVM and CMLLR-SVM. This observa-tion also coincides with what are reported in [34] where 2-classMLLR outperforms both 1-class and 4-class MLLR. This sug-gests that the number of regression classes needs to be appropri-ately set in the estimation of transform matrices. CMLLR-SVMis slightly better than fMAPLR-SVM in most cases probablydue to the inclusion of a bias vector in each of transform ma-trices in CMLLR. A comparison with GMM-SVM shows thatmatrix parameters are not as effective as bias parameters whenused alone.

D. Comparison With Two Variants of fMAPLR-SVM

As discussed in Section II-D, we have studied two vari-ants of fMAPLR-SVM in [19] and [20], referred to asfMAPLR(UBM)-SVM and fMAPLR(+MEAN)-SVM, re-spectively. In this section, we compare the performance of thetwo variants with fMAPLR-SVM and GMM-SVM. DefiningfMAPLR with one transform matrix and 512 bias vectors, weconduct experiments on male speaker data of the NIST 2006SRE core condition corpus, and obtain results in Table V.First, it is shown that fMAPLR(UBM)-SVM yields betterperformance than GMM-SVM. The two methods work in thesame way when performing MAP adaptation over UBM, butdiffer in the speaker model—one uses fMAPLR parameters,another uses GMM means. Second, fMAPLR-SVM achievesbetter performance than fMAPLR(+MEAN)-SVM. The latterdiffers from the former in that GMM means are further adaptedon the fMAPLR-transformed features and used in the SVMsupervector together with the fMAPLR parameters. Thisreaffirms our proposal that fMAPLR parameters effectivelycharacterize the speakers. Further adaptation of GMM meanson the fMAPLR-transformed features may not help, at least inthe NIST 2006 SRE test. Third, fMAPLR-SVM achieves betterperformance than fMAPLR(UBM)-SVM. fMAPLR-SVM


TABLE VPERFORMANCE OF GMM-SVM, fMAPLR-SVM AND ITS TWO VARIANTS ON

MALE SPEAKER DATA OF THE NIST 2006 SRE CORE CONDITION CORPUS

TABLE VIPERFORMANCE OF SVM AND DOT-PRODUCT SCORING ON MALE SPEAKER

DATA OF THE NIST 2006 SRE CORE CONDITION CORPUS. RESULTS ARE

COMPARED BEFORE NORMALIZATION (RAW), AFTER TNORM AND ZTNORM

differs from fMAPLR(UBM)-SVM in that it jointly estimatesthe GMM and prior pdfs of fMAPLR parameters. The resultvalidates the advantage of the joint estimation over piecemealestimation in the modeling process.

E. Dot-Product Scoring Versus SVM Scoring

In SVM scoring, support vectors in a speaker model can becollapsed down into a single supervector and the score willbe a dot-product , where is the supervector of a testingspeech utterance [2]. In view of this, it is interesting to seehow the dot-product score works in comparison with the SVMoutput. Given the speaker supervector and the testing su-pervector , the dot-product score is simply . In compar-ison with SVM, the dot-product has the advantages of savingcomputational cost and avoiding the problem of selection ofSVM training background data. In Table VI, we compare theperformance of SVM and dot-product on male speaker data ofthe NIST 2006 SRE core condition corpus. We use the 1-con-versation training data in the NIST 2004 SRE corpus as theZnorm cohort data set. It is shown that ZTnorm may not behelpful for SVM but is essential for dot-product. After ZTnorm,dot-product yields similar performance as SVM. The resultssuggest that the dot-product can be a substitute for SVM whenusing appropriate ZTnorm cohort data.

F. Performance on NIST 2006 SRE

We evaluate the fMAPLR-SVM method with differentnumber of transform matrices and bias vectors on the NIST2006 SRE data to see the contribution of parameters. Thenumber of transform matrices is set to be 0, 1, or 2, and thenumber of bias vectors is set to be 256 or 512, resulting in sixcombinations of fMAPLR. Table VII summarizes results formale, female, and all speakers. First, the performance improvesas the number of transform matrices increases if we fix thenumber of bias vectors, which validates the effectiveness oftransform matrices. Second, the method with 512 bias vectorsyields better performance than that with 256 bias vectors,which suggests larger bias vectors help. The best performanceis achieved with two transform matrices and 512 bias vectors.Fig. 3 compares the detection error tradeoff (DET) curves

Fig. 3. DET curves on NIST 2006 SRE.

Fig. 4. DET curves on NIST 2008 SRE.

TABLE VIICOMPARISON OF GMM-SVM AND fMAPLR-SVM ON NIST 2006 SRE.

fMAPLR IS DEFINED WITH� TRANSFORM MATRICES AND � BIAS VECTORS

of GMM-SVM and fMAPLR-SVM, where the fMAPLR isdefined with one transform matrix and 512 bias vectors forconsistency with the DET curve in Fig. 4. It is clearly shown


TABLE VIIIFUSION OF fMAPLR-SVM, GMM-SVM AND CMLLR-SVM ON NIST 2006 SRE. FMAPLR IS DEFINED WITH TWO TRANSFORM MATRICES AND 512 BIAS

VECTORS. CMLLR IS DEFINED WITH TWO EXTENDED TRANSFORM MATRICES ESTIMATED BASED ON THE UBM

TABLE IXCOMPARISON OF GMM-SVM AND fMAPLR-SVM ON NIST 2008 SRE. fMAPLR IS DEFINED WITH � TRANSFORM MATRICES AND � BIAS VECTORS

that the fMAPLR-SVM method achieves better overall DETcurve than the GMM-SVM method.

We also try the fusion of fMAPLR-SVM, GMM-SVM,and CMLLR-SVM by combining scores from these methods.Table VIII shows the fusion results on the full data set ofNIST 2006 SRE corpus. We observe the fusion of GMM-SVMand CMLLR-SVM improves the performance from eithermethod, suggesting that transform matrices and GMM meansoffer complementary information. However, the fusion offMAPLR-SVM with GMM-SVM, CMLLR-SVM, or bothdoes not obtain additional gain in comparison with a singlefMAPLR-SVM. This is probably because that fMAPLR de-fined with both transform matrices and bias vectors containsthe information of transform matrices in CMLLR-SVM and ofGMM means in GMM-SVM.

G. Performance on NIST 2008 SRE

In this section, we evaluate the fMAPLR-SVM method on theNIST 2008 SRE. The experimental setup is the same as that inthe above evaluation on the NIST 2006 SRE data. Results sum-marized in Table IX lead to the same conclusions as those onthe NIST 2006 SRE data, except that the best performance isachieved by the method with one transform matrix and 512 biasvectors. It indicates that the 2008 data are modeled by one trans-form matrix better than two transform matrices, which againdemonstrates that an appropriate number of regression classes iscrucial for the estimation of transform matrices. Fig. 4 comparesDET curves on the NIST 2008 SRE data between GMM-SVMand fMAPLR-SVM, where fMAPLR is defined with one trans-form matrix and 512 bias vectors. The DET-curve comparisonconfirms the effectiveness of the fMAPLR-SVM method.

V. CONCLUSION

We proposed an fMAPLR-SVM method for speaker recog-nition, where the SVM kernel is constructed with fMAPLR pa-

rameters. A flexible form of fMAPLR is defined where the trans-form matrices and bias vectors are associated with separate re-gression classes. The number of bias vectors can be set morethan the number of transform matrices because the estimation ofbias vectors is more robust given the limited amount of speakersamples. The fMAPLR parameters are estimated with the MAPadaptation to prevent the possible numerical problems in the MLestimation of transform matrices as well as to integrate prior in-formation into the transform by leveraging on the prior infor-mation. The estimation of fMAPLR parameters is based on aGMM which is jointly estimated with the prior pdfs of fMAPLRparameters in order to produce a canonical GMM to model thespeaker-independent data better than the UBM. The hyperpa-rameters and fMAPLR parameters are updated using the methodof alternative variables.

Experiments conducted on the NIST 2006 and 2008 SREtasks lead to the following conclusions. First, it is shown thatthe recognition performance is not sensitive to the valuesof two preset parameters ( and ) varying in a relativelywide range, which confirms the robustness of fMAPLR es-timation. Second, for fMAPLR with bias vectors only, thefMAPLR-SVM method with fewer free parameters outper-forms the GMM-SVM method with more free parameters. Itvalidates the effectiveness of the canonical GMM producedby the joint estimation with the prior pdfs of fMAPLR param-eters. Third, for fMAPLR with transform matrices only, theperformance of the fMAPLR-SVM method is inferior to theperformance of the GMM-SVM method. Similar results areobserved for the MLLR kernel in [9] and [10]. It indicates thatthe problem of data fragmentation in GMM-based methods canseverely affect the estimation of transform matrices. Fourth, thefMAPLR-SVM method achieves the best performance with afew transform matrices and a large number of bias vectors onthe two tasks. It proves that the performance is optimized withthe flexible form of fMAPLR where the number of bias vectorsis set more than the number of transform matrices.


The fMAPLR-SVM method continues to suffer from thedata-fragmentation problem due to the use of GMM whichmodels frames independently. One way to address the problemis to estimate the fMAPLR parameters based on the phonemeHMM instead of the GMM. Furthermore, transforms can bemore accurately estimated with the multipass HMM decoding.In [5], MLLR transforms are estimated based on triphoneHMMs with two recognition passes, where the first pass esti-mates three transforms using a phone-loop decoding and thesecond pass estimates nine transforms using word referencesfrom the first-pass recognition that includes a bigram languagemodel (LM) decoding, lattice generation and a higher orderLM rescoring. It leads to the SVM kernel consisting of eightMLLR transforms from the second pass for each gender, whilethe SVM kernel in our method consists of one or two transformmatrices only. To this end, one of our future work is to studythe fMAPLR-SVM method based on the triphone HMMs.

APPENDIX

Likelihood With Respect to Hyperparameters: Let’s as-sume that the background data set is recorded from speakers,which is denoted as , whereis a feature vector sequence. According to (1), an observed fea-ture vector is first rotated by and then decomposed intotwo hidden factors: a speaker-independent feature vector anda bias vector . The likelihood of with respect to the hy-perparameters is then written as

(8)

where and are defined in (2) and (5), respec-tively. Considering the hyperparameter to be estimated in theprior pdf of is the mean matrix , we define

(9)

where is the determinant of , and the Gaussian haszero mean and covariance of . Assuming , the Gaussianbecomes a Kronecker delta function . Sub-stituting (2), (5), and (9) in (8), we get

(10)

Estimation of and by Fixing : Assuming isfixed and replacing with the mean matrix in the priorpdf, we rewrite (1) as

(11)

where the observed variable is decomposedinto two hidden variables, speaker-independent feature vectors

and speaker-dependent bias vectors , bythe equation . With the Gaussian mixture sequence

denoted by , the EM auxiliary function for (10) isdefined as

(12)

Assuming and are statistically independent and denotingvariables by frame sequences, we get

(13)where is the Gaussian mixture weight, and

are defined in (2) and (5), respectively. Note thatwe define and therefore the sign of mean values isinverted. Substituting (2), (5), and (13) in (12), we get

where is the occupation probability of Gaussian compo-nent at time of the current observation, where the likelihoodfor each frame of observation is calculated as follows:

Differentiating with respect to and , andequating to zero, we get

(14)

(15)

(16)

(17)

(18)

Thus, the expected values can be calculated as


Estimation of by Fixing : The EM auxiliary func-tion for the posterior pdf in (3) is defined as

(19)

where the likelihood for each frame of observation is calculatedas follows:

(20)By fixing , we can estimate with row-by-row update.

Equation (19) can be rewritten as

(21)

where is the th row of , is the extended cofactorrow vector with , and

Differentiating (21) with and equating to zero, we get

(22)

where

Thus, the value of is obtained to maximize (19).Estimation of by Fixing : We estimate by

fixing . The auxiliary function in (19) is rewritten with re-spect to as

Differentiating with and equating to zero, we get

(23)

where is the th row of , and is the th element

in .

REFERENCES

[1] D. A. Reynolds, T. F. Quatieri, and B. D. Robert, “Speaker verificationusing adapted Gaussian mixture models,” Digital Signal Process., vol.10, pp. 19–41, 2000.

[2] W. M. Campbell, “Generalized linear discriminant sequence kernelsfor speaker recognition,” in Proc. ICASSP, 2002, pp. 161–164.

[3] W. M. Campbell, J. P. Campbell, D. A. Reynolds, D. A. Jones, and T.R. Leek, “Phonetic speaker recognition with support vector machines,”in Proc. NIPS, 2003, pp. 1377–1384.

[4] W. M. Campbell, D. E. Sturim, D. A. Reynolds, and A. Solomonoff,“SVM based speaker verification using a GMM supervector kernel andNAP variability compensation,” in Proc. ICASSP, 2006, pp. 97–100.

[5] A. Stolcke, S. S. Kajarekar, L. Ferrer, and E. Shrinberg, “Speakerrecognition with session variability normalization based on MLLRadaptation transforms,” IEEE Trans. Audio, Speech, Lang. Process.,vol. 15, no. 7, pp. 1987–1998, Sep. 2007.

[6] J.-L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation formultivariate Gaussian mixture observations of Markov chains,” IEEETrans. Speech Audio Process., vol. 2, no. 2, pp. 291–298, Apr. 1994.

[7] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear re-gression for speaker adaptation of continuous density hidden Markovmodels,” Comput.Speech Lang., vol. 9, pp. 171–185, 1995.

[8] M. J. F. Gales and P. C. Woodland, “Mean and variance adaptationwithin the MLLR framework,” Comput. Speech Lang., vol. 10, pp.249–264, 1996.

[9] Z. N. Karam and W. M. Campbell, “A new kernel for SVM MLLRbased speaker recognition,” in Proc. Interspeech, 2007, pp. 290–293.

[10] M. Ferràs, C. C. Leung, C. Barras, and J.-L. Gauvain, “MLLR tech-niques for speaker recognition,” in Proc. IEEE Odyssey’08: SpeakerLang.Workshop, 2008.

[11] C. J. Leggetter and P. C. Woodland, “Flexible speaker adaptation usingmaximum likelihood linear regression,” in Proc. ARPA SLS Technol.Workshop, 1995, pp. 110–115.

[12] W. Chou, “Maximum posterior linear regression with elliptically sym-metric matrix variate priors,” in Proc. Eurospeech, 1999, pp. 1–4.

[13] O. Siohan, C. Chesta, and C.-H. Lee, “Hidden Markov model adapta-tion using maximum a posteriori linear regression,” in Proc. WorkshopRobust Methods for Speech Recognition in Adverse Conditions, 1999,pp. 147–150.

[14] X. Lei, J. Hamaker, and X. He, “Robust feature space adaptation fortelephony speech recognition,” in Proc. ICSLP, 2006, pp. 773–776.

[15] M. J. F. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Comput. Speech Lang., vol. 12, pp. 75–98,1998.

[16] M. Rahim and B.-H. Juang, “Signal bias removal by maximum likeli-hood estimation for robust telephone speech recognition,” IEEE Trans.Speech Audio Process., vol. 4, no. 1, pp. 19–30, Jan. 1996.

[17] A. Sankar and C.-H. Lee, “A maximum likelihood approach to sto-chastic matching for robust speech recognition,” IEEE Trans. SpeechAudio Process., vol. 4, no. 3, pp. 190–202, May 1996.

[18] Q. Huo and D. Zhu, “Robust speech recognition based on structuredmodeling, irrelevant variability normalization and unsupervised onlineadaptation,” in Proc. ICASSP, 2009, pp. 4637–4640.

[19] D. Zhu, B. Ma, and H. Li, “Using MAP estimation of feature trans-formation for speaker recognition,” in Proc. Interspeech, 2008, pp.849–852.

[20] D. Zhu, B. Ma, and H. Li, “Joint MAP adaptation of feature transfor-mation and Gaussian mixture model for speaker recognition,” in Proc.ICASSP, 2009, pp. 4045–4048.

[21] T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul, “A com-pact model for speaker adaptive training,” in Proc. ICSLP, 1996, pp.1137–1140.


[22] V. V. Digalakis, D. Rtischev, and L. G. Neumeyer, “Speaker adapta-tion using constrained estimation of Gaussian mixtures,” IEEE Trans.Speech Audio Process., vol. 3, no. 5, pp. 357–366, Sep. 1995.

[23] D. A. Reynolds, T. F. Quatieri, and B. D. Robert, “Speaker verificationusing adapted Gaussian mixture models,” Digital Signal Process., vol.10, pp. 19–41, 2000.

[24] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, and P. Dumouchel, “A studyof interspeaker variability in speaker verification,” IEEE Trans. Audio,Speech, Lang. Process., vol. 16, no. 5, pp. 980–988, Jul. 2008.

[25] A. Solomonoff, W. M. Campbell, and I. Boardman, “Advancesin channel compensation for SVM speaker recognition,” in Proc.ICASSP, 2005, pp. 629–632.

[26] A. O. Hatch, S. Kajarekar, and A. Stolcke, “Within-class covariancenormalization for SVM-Based speaker recognition,” in Proc. Inter-speech, 2006, pp. 1471–1474.

[27] C. Vair, D. Colibro, F. Castaldo, E. Dalmasso, and P. Laface, “Channelfactors compensation in model and feature domain for speaker recog-nition,” in Proc. IEEE Odyssey’06: Speaker Lang. Workshop, 2006.

[28] R. Auckenthaler, M. Carey, and H. Lloyd-Thomas, “Score normaliza-tion for text-independent speaker verification systems,” Digital SignalProcess., vol. 10, pp. 42–54, 2000.

[29] R. C. Rose, E. M. Hofstetter, and D. A. Reynolds, “Integrated modelsof signal and background with application to speaker identification innoise,” IEEE Trans. Speech Audio Process., vol. 2, no. 2, pp. 245–257,Apr. 1994.

[30] M. Afify, Y. Gong, and J.-P. Haton, “A general joint additive and con-volutive bias compensation approach applied to noisy Lombard speechrecognition,” IEEE Trans. Speech Audio Process., vol. 6, no. 6, pp.524–538, Nov. 1998.

[31] J. Wu and Q. Huo, “A switching linear Gaussian hidden Markovmodel and its application to nonstationary noise compensation forrobust speech recognition,” in Proc. ICASSP, 2003, pp. 977–980.

[32] R. Collobert and S. Bengio, “SVMTorch: Support vector machinesfor large-scale regression problems,” J. Mach. Learn. Res., vol. 1, pp.143–160, 2001.

[33] NIST Speaker Recognition Evaluation, [Online]. Available:http://www.itl.nist.gov/iad/mig/tests/sre

[34] Z. N. Karam and W. M. Campbell, “A multi-class MLLR kernel forSVM speaker recognition,” in Proc. ICASSP, 2008, pp. 4117–4120.

Donglai Zhu (M’07) received the B.Eng. and Ph.D.degrees in electronics engineering from the Univer-sity of Science and Technology of China, Hefei, in1998 and 2003, respectively.

From 2003 to 2005, he was a Research Assistantin the University of Hong Kong. Since 2005, hehas been a Research Fellow in Human LanguageTechnology Department, Institute for InfocommResearch, Singapore. His research interests includerobust speech recognition, speaker recognition,spoken language recognition, and machine learning.

Bin Ma (M’00–SM06) received the B.Sc. degreein computer science from Shandong University,Jinan, China, in 1990, the M.Sc. degree in patternrecognition and artificial intelligence from the Insti-tute of Automation, Chinese Academy of Sciences(IACAS), Beijing, in 1993, and the Ph.D. degree incomputer engineering from The University of HongKong in 2000.

He was a Research Assistant from 1993 to 1996at the National Laboratory of Pattern Recognition,IACAS. In 2000, he joined Lernout & Hauspie Asia

Pacific as a Researcher focusing on the speech recognition of multiple Asianlanguages. From 2001 to 2004, he worked for InfoTalk Corp., Ltd., as a SeniorResearcher and a Senior Technical Manager engaging in mix-lingual telephonyspeech recognition and embedded speech recognition. Since 2004, he has beena Research Scientist and the Group Leader of Speech Processing Group, In-stitute for Infocomm Research, Singapore. He now serves as a Subject Editorof Speech Communication. His current research interests include robust speechrecognition, speaker and language recognition, spoken document retrieval, nat-ural language processing, and machine learning.

Haizhou Li (M’91–SM’01) is currently the PrincipalScientist and Department Head of the Human Lan-guage Technology, Institute for Infocomm Research,Singapore. He is also the Program Manager of SocialRobotics at the Science and Engineering ResearchCouncil of A*Star in Singapore. He has worked onspeech and language technology in academia andindustry since 1988. He taught in the University ofHong Kong (1988–1990), South China Universityof Technology (1990–1994), and Nanyang Techno-logical University (2006-present). He was a Visiting

Professor at CRIN, France, (1994–1995), and at the University of New SouthWales, Australia (2008). As a technologist, he was appointed as ResearchManager in Apple-ISS Research Centre (1996–1998), Research Director inLernout & Hauspie Asia Pacific (1999–2001), and Vice President in InfoTalkCorp., Ltd., (2001–2003). His current research interests include automaticspeech recognition, speaker and language recognition, and natural languageprocessing. He has published over 150 technical papers in international journalsand conferences. He holds five international patents.

Dr Li now serves as an Associate Editor of the IEEE TRANSACTIONS ON

AUDIO, SPEECH AND LANGUAGE PROCESSING, and the Springer InternationalJournal of Social Robotics. He is an elected Board Member of the InternationalSpeech Communication Association (ISCA, 2009–2013), a Vice President ofthe Chinese and Oriental Language Information Processing Society (COLIPS,2009–2011), an Executive Board Member of the Asian Federation of NaturalLanguage Processing (AFNLP, 2006–2010), and a Member of the ACL. He wasa recipient of National Infocomm Award of Singapore in 2001. He was namedone the two Nokia Visiting Professors 2009 by Nokia Foundation in recognitionof his contribution to speaker and language recognition technologies.

Documents

Speaker Verification With Feature-Space MAPLR Parameters