SPEAKER VERIFICATION USING SUPPORT VECTOR MACHINES

S. Raghavan, G. Lazarou and J. PiconeIntelligent Electronic Systems

Center for Advanced Vehicular SystemsMississippi State University

URL: http://www.cavs.msstate.edu/hse/ies/publications/conferences/ieee_secon/2006/support_vectors/

SPEAKER VERIFICATIONUSING SUPPORT VECTOR MACHINES

Page 2 of 15Speaker Verification Using Support Vector Machines

Speaker Verification

• Speaker verification uses voice as a biometric to determine the authenticity of a user.

• Speaker verification systems consist of two essential operations: Enrollment: the system learns the speaker’s acoustic information

and builds a speaker model or template. Verification: the claimed speaker is compared to the model and a

likelihood score is computed. A threshold is used to discriminate true speaker from an impostor.

This presentation focuses on the classifier portion of the speaker verification system.


• This is an example of a distribution of the fourth cepstral coefficient from one of the utterances taken from the NIST 2003 Speaker Recognition dataset.

• The distribution cannot be modeled by a single Gaussian.

• By using two Gaussians, we can achieve a more accurate representation of the model.

Baseline System

• The baseline system used a Gaussian Mixture Model (GMM) as the underlying classifier.

• The system used 39 dimensional Mel Frequency Cepstral Coefficients (MFCCs) as input features.

};.,.....,.,...{:

}.,.....,.,.,.{:

}.,.....,.,.,.{:

21222312219311212323312

221222312231123312112331

091194545234423001223

N

1

0

frame

frame

frame

DecisionCompute Likelihood

Statistical ModelTrain Utterance

Test Utterance


• The probability of obtaining the correct hypothesis given the test data is a posterior probability, and it can be decomposed into a prior and conditional probability using the Bayes rule.

• The parameters for P(A/S) are estimated from the data using Maximum Likelihood Estimation (MLE).

• The parametric form of P(A/S) is generally assumed to be Gaussian.

• Since the acoustic data may not actually be Gaussian, we will end up with approximations that lead to classification errors.

Drawbacks of Using Gaussian Mixtures

termgnormalizin Evidence, P(A)speaker theofy probabilitPrior P(S)

speaker given the acoustics ofy Probabilit P(A/S)acoustics given thespeaker ofy Probabilit P(S/A)

)()()/()/(

APSPSAPASP


Optimal Decision Boundary

• Maximum likelihood convergence does not translate to optimal classification if a priori assumptions about the data are not correct.

• At best one can approximate the distribution by using a mixture of Gaussians, but still the problem of finding the optimal decision boundary still remains.

• SVM does not make a Gaussian assumption. Instead it is a purely discriminative approach where the boundaries are directly learnedfrom the data.


Support Vector Machines (SVMs)

• SVMs are binary classifiers that learn the decision region through discriminative training.

• SVMs transform data to a higher dimensional space using kernel functions.

• In the higher dimensional space, SVMs can construct a linear hyperplane to form the decision boundary, which in lower dimension would require a non-linear decision boundary.

• SVMs require labeled data for training.

• SVMs find an optimal decision boundary.

class 1

class 2

w

H1

H2C1 CO

C2

optimalclassifier

• Hyperplanes C0-C2 achieve zero empirical risk. C0 generalizes optimally.

• The data points that define the boundary are called support vectors.


• Hyperplane:

• Constraints:

• Quadratic optimization of a Lagrange functional minimizes risk criterion (maximizes margin). Only a small portion become support vectors.

• Final Classifier:

Using SVMs for Optimal Decision Boundary

bwx

01)( bwixiy

SVs

bxixiyixf )()(

LBG classifier- classification error rate 27.88% SVM classifier- classification error rate 21.15%


• Speaker verification requires the system to make a binary decision.

• The decision boundary can be learned from labeled data.

• An SVM classifier is ideally suited for such a task. It requires two sets of labeled data: in-class and out-of-class.

• The impostor data can consist of data from all speakers excluding the speaker whose data is used as in-class.

• A classifier must be trained for each speaker.

• During verification the speaker’s utterance is matched with the claimed identity’s classifier ( ).

Using SVMs for Speaker Verification

SVs

bxixiyixf )()(


Block Diagram of SVM Based Speaker Verification

• The input data contains 39-dimensional MFCCs.

• The in-class data corresponds to a particular speaker and the out-of-class data contains features of all other speakers.

• The SVM model contains the support vectors and the weights.

in-class data

out-class data

Train the modelSVM

model

SVs

iii bxxyxf )()( Test dataDistance from the hyperplanebetween -1 to 1

if (distance < threshold) {Reject Speaker}else {Accept Speaker}

TRAINING

TESTING

A model is created for every speaker in the training data set

Using Structural Risk Minimization (SRM)

claimedmodel


Experimental Design• Database: NIST 2003. The development train and test set contained 60 and 78 utterances respectively. Each utterance was approximately 2 minutes long.

• Standard 39-dimensional MFCCs were used as features.

• The in-class data for training contained the entire feature set from the speaker utterance.

• The out-of-class data for training contained randomly picked features from all the remaining speakers in the training data set. Note: The SVM trainer was not exposed to all the data available for out-of-class. This was done in order to speed up the training process.

• During testing, the distance of every MFCC test vector from the hyperplane is computed and averaged. If this average is greater than zero, we accept the speaker, else we reject the speaker.

• A Detection Error Trade off curve was plotted, using the true speaker and impostor speaker distances. This was compared with the baseline GMM speaker verification system.


Analyzing the Effect of Training Parameters

The parameters were varied based on the Detection Cost Function (DCF):

CDet = [CMiss × P(Miss|Target) × P(Target)] +

[CFalseAlarm× P(FalseAlarm|NonTarget) × (1-P(Target))]

The NIST DET curve analysis program was used to obtain the DCF values.

Two main parameters were analyzed:

• Penalty (C): This parameter accounts for the training errors. This value was varied from 10 – 100 and no significant change in performance was noted. Hence a mid value of 50 was chosen.

• Kernel parameter (gamma): For gamma values between 2.5 to 0.02 there was no change in the distance scores of the utterances in the test set. The performance was stable between 0.03 and 0.01. The best performance was observed when gamma was 0.019.


Effect of Penalty and Kernel Parameters

Minimum DCF as a function of Gamma

Gamma(C=50) Min DCF

0.010 0.2125

0.015 0.2168

0.019 0.1406

0.030 0.2305

DET curves for various values of the RBF kernel parameter Gamma

• The DET curve helps in selecting a decision threshold.

• The DET curve also helps in choosing an optimal operating region for the system.

False Alarm probability (in %)

Mis

s p

rob

ab

ilit

y (

in %

)


SVM Vs GMM (Baseline System)

SVM Vs HMM Based Baseline System

HMM SVM

EER25%

EER16%

Min DCF0.2124

Min DCF0.1406

A comparison of HMM and SVM performance• The EER improved by 9% absolute.

• The Min-DCF improved by 33% relative.

• The results are promising since only a small portion of the entire data was used as out-of-class data.

• Also the system uses a very simple averaging strategy to make yes/no decision. A more intelligent approach would yield better results.

False Alarm probability (in %)

Mis

s p

rob

ab

ilit

y (

in %

)


Conclusions and Future Work

Conclusions:

• The SVM system performed significantly better than the GMM baseline system.

– The Equal Error Rate improved by 9% absolute.

– The Min-DCF value improved by 33% relative.

• The effect of RBF kernel parameter was analyzed [… state the result! …]

• A speaker verification framework has been laid out. [… not a good conclusion!]…

Future Work:

• Improving training efficiency of SVMs using subsampling strategy. Determine the support vectors from small subsets of the training data.

• Relevance Vector Machines (RVMs) for speaker verification.

• Use more robust features to train the classifier (features that capture the nonlinearities present in the speech signal).

• Research on a better technique to determine the score for a speaker.


Resources

• Pattern Recognition Applet: compare popular algorithms on standard or custom data sets

• Speech Recognition Toolkits: compare SVMs and RVMs to standard approaches using a state of the art ASR toolkit

• Foundation Classes: generic C++ implementations of many popular statistical modeling approaches


References

• J.P. Campbell, “Speaker Recognition: A Tutorial,” Proceedings of IEEE, pp. 1437-1462, September 1997.

• D. A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models,” Speech Commun, pp. 91–108, 1995.

• A. Ganapathiraju, “Support Vector Machines for Speech Recognition,” Ph.D. Dissertation, Department of Electrical and Computer Engineering, Mississippi State University, January 2002.

• W. M. Campbell, E. Singer, P. A. Torres-Carrasquillo, and D. A. Reynolds, “Language Recognition with Support Vector Machines,” Proc. Odyssey: The Speaker and Language Recognition Workshop, Toledo, Spain, ISCA, pp. 41-44, 31 June 2004.

• J. Picone, “Signal Modeling Techniques in Speech Recognition,” IEEE Proceedings, vol. 81, no. 9, pp. 1215-1247, September 1993.

• “NIST 2003 Speaker Recognition Evaluation Plan,” http://www.nist.gov/speech/tests/spk/2003/doc/2003-spkrec-evalplan-v2.2.pdf, March 2006.

• A. Martin, G. Doddington, M. Ordowski, and M. Przybocki, “The DET curve in assessment of detection task performance,”. In Proceedings of EuroSpeech, volume 4, pages 1895--1898, 1997.

Documents

SPEAKER VERIFICATION USING SUPPORT VECTOR MACHINES