SPEAKER VERIFICATION

EXPLORATION OF SPEAKER MODELLING AND SCORE NORMALIZATION METHOD FOR DEVELEPMENT OF A VOICE PASSWORD BASED SPEAKER VERIFICATION SYSTEM

Under guidance of NAME-AJAY KUMAR PASWAN

Dr. G. Pradhan M.TECH yr (ECE Dept.)

NIT PATNA (ECE Dept.) ROLL NO-1229011

OUTLINE

Introduction Literature review on speaker verification system Summary of literature review Motivation for present work Baseline speaker verification system Proposed speaker verification system Summary and Contribution Future scope

INTRODUCTION

Speaker verification: A process of verifying identity claim of a person from his/her voice To improve the security level, recent technology turned towards using biometric

features over non-biometric features With the emergence of mobile technology a person can remotely access the

system, so, remote monitoring is possible

Speaker verification can be divided into Text-independent Text-dependent Voice password

Text- independent system has lesser performance than text-dependent due to additional phonetic variability between training and testing speech Text- independent system requires more data for training and testing

BRIEF HISTORY Research in the field of speaker recognition was

initially carried out in 1950s in Bell laboratories using isolated digits [1].

1960- 1990 most of the research was focused on extraction of speaker specific information from the speech data, and development of text dependent speaker verification system.

In 1990-2005 the speaker recognition method shifted from template based pattern matching to statistical modeling. Different statistical modeling method like GMM and GMM-UBM are proposed.

2005- 2014 most of the research was focused on compensation of mismatches and development of practical authentication systems. Different compensation methods like JAFA, i-vectors and LDA, WCCN, PLDA are proposed.1. K. H. Davis, et. al., “Automatic recognition of spoken digits,” J.A.S.A., 24 (6), pp. 637-642, 1952.

MODULAR REPRESENTATION OF VOICE PASS WORD BASED SPEAKER VERIFICATION SYSTEM

Fig: Voice password speaker verification system

Training Reference model

Speech

Identity claim

Testing

Speech R

Accept/reject

Pre-

processing

Feature

extraction

Model

Building

Pre-

processing

Feature

extraction

comparison

Decision logic

PREPROCESSING Preprocessing is an important step in a speaker verification

system. This also called voice activity detection (VAD). VAD separates speech region from non-speech regions[2-3] It is very difficult to implement a VAD algorithm which works

consistently for different type of data VAD algorithms can be classified in two groups

Feature based approach Statistical model based approach

Each of the VAD method have its own merits and demerits depending on accuracy, complexity etc.

Due to simplicity most of the speaker verification systems use signal energy for VAD.2. J. P. Campbell, “Speaker Recognition: A Tutorial,” Proc. IEEE, vol. 85, no. 9, pp. 1437–1462, September 1997.3. D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. on speech and audio processing, vol. 3, no. 1, pp. 72–83, January 1995.

FEATURE EXTRACTION The speech signal along with speaker information contains

many other redundant information like recording sensor, channel, environment etc.

The speaker specific information in the speech signal[2] Unique speech production system Physiological Behavioral aspects

Feature extraction module transforms speech to a set of feature vectors of reduce dimensions To enhance speaker specific information Suppress redundant information[2-4]

4. F. Bimbot, J. Bonastreand, C. Fredouille, G. Gravier, I. Chagnolleau, S. Meignier, T. Merlin, J. Garcia, D. Delacretaz, and D. A. Reynolds, “A tutorial on text-independent speaker verification,” EURASIP Journal on Applied Signal Processing, vol. 4, pp. 430–451, 2004.

An ideal feature Robust to environmental and recording condition Contains less intra-speaker variability More inter-speaker variability

Most of the state-of-the-art speaker verification systems use Mel-frequency Cepstral Coefficient (MFCC) appended to it’s first and second order derivative as the feature vectors Easy to extract Provides best performance compared to other

features MFCC mostly contains information about the

resonance structure of the vocal tract system

STEPS FOR MFCC COMPUTATION

Windowing of signal using Hamming window DFT spectrum: Discrete Fourier transform is calculated for

each window frame by following DFT equation. X (k) = ; Mel-Spectrum Mel-Spectrum can be calculated by passing

the Fourier transform of the signal through mel-filter bank, mel-bank filter is a set of band pass filter

The mel-frequency related to the linear frequency as fmel = 2595log10ቀ1+ 𝑓700ቁ

Discrete cosine transform (DCT): Discrete cosine transform convert mel–spectrum on

log scale to cepstral coefficients Unlike spectral feature which are highly correlated ,

cepstral features produce a more decorrelated , compact representation.

DCT of k log filter bank , spectral values, {log(Sk)}Kk=1 ,

into L cepstral coefficient Cn = (Sk) cos n = 1 ,2 ,3 , …… L

Typically L = 13 MFCC coefficient are calculated per frame , which is feature vector of that frame. The cepstral coefficient are usually static feature, they

contain the information about a particular frame only, so to get dynamics of the signal first and second derivative of cepstral coefficient is computed.

SPEAKER MODELING Speaker models the statistical information present

in the feature vectors it enhances the speaker information and suppress the redundant information

For text independent speaker verification speaker modeling technique used is vector quantization(VQ), Gaussian mixer model(GMM)[5], GMM-universal back ground model(GMM-UBM)[6], Artificial neural network(ANN)and support vector machine(SVM)

The Gaussian mixer model is most widely used for speaker verification systems5. D. A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models,” Speech Communication, vol. 17, pp. 91–108, March 1995.6. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Processing, vol. 10, pp. 19–41, January 2000.

Gaussian model assumes the feature vectors follow a Gaussian distribution, characterized by mean vectors, covariance matrix and weights

The data unseen in the training which appear in the test data will trigger a low score

Though GMM is quit powerful but it need large training data to properly estimate model parameter

GMM is available powerful and versatile parameter estimation algorithm, expectation-maximization.

Pattern comparison Testing phase test feature vectors are

compared with claimed model to get similarity between training and testing speech

Different similarity measure is done for used modeling method

Euclidean distance [8] for VQ , log likelihood score(LLS)[7] and log likelihood score ratio(LLSR) for GMM-UBM.7. D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. on speech and audio processing, vol. 3, no. 1, pp. 72–83, January 19958. F. K. Soong and A. E. Rosenberg, “On the use of instantaneous and transitional spectral information in speaker recognition,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. 36, no. 6, pp. 871–879, June 1988

PERFORMANCE MEASURE: A perfect SV system should accept all true claim

and reject all the false claims Depending on the variability between the training

and testing speech some true claim may be rejected and some false claim may be accepted

Therefore the speaker verification performance is measured in term of false rejection rate (FRR) and false acceptance rate (FAR), more meaningfully in term of equal error rate(EER)[9].

In order to improve the visualization of the SV performance, the detection error tradeoff(DTF) curve is used to performance measure

9.F. Bimbot, J. Bonastreand, C. Fredouille, G. Gravier, I. Chagnolleau, S. Meignier, T. Merlin, J. Garcia, D. Delacretaz, and D. A. Reynolds, “A tutorial on text-independent speaker verification,” EURASIP Journal on Applied Signal Processing, vol. 4, pp. 430–451, 2004

SUMMARY ON REVIEW

The performance of speaker verification system is mostly depend upon quality of speech signal

The performance of the system degraded significantly under mismatched conditions

The phonetic variability between training and testing speech is another major source of mismatch

Speaker verification for text-dependent mode is performed with DTW algorithm, HMM

GMM is useful for modeling of system in text-independent mode

MOTIVATION FOR PRESENT WORK

Most of the application where speech signal of short duration used around 3-5ms, but Speaker verification system provide poor performance for short duration speech signal

This degradation of performance is due to phonetic variability between training and testing speech data

The phonetic variability may be reduced by artificially generating multiple utterance, taking feature around Glottal closure instants(GCI)

Most of the SV system develop score normalization using on cohort centric normalization. The speaker centric score normalization may provide better result.

OBJECTIVE OF THIS THESIS WORK

To develop voice password based speaker verification

To study impact of text-mismatch on the performance of voice password based speaker verification system

Develop a voice password based speaker verification system in text-independent mode

Explore method to model speaker information in limited data condition

Study and Explore the advantages of speaker centric score normalization

DATABASE COLLECTION

Total database collection = 100 speaker Male speaker ,85 and Female speaker ,15

Number of repetition for train= 3 session Number of repetition for test=5 session Format of file naming = 8765538857_NAMCF

BASELINE SPEAKER VERIFICATION SYSTEM

For Baseline speaker verification the following parameter are used VAD threshold is taken 0.1 of average energy Baseline uses MFCC appended with first and

second order derivative , i.e. delta(Δ) and delta delta(ΔΔ) for feature extraction

Feature vector: It uses 39 dimension feature vector and 20ms frame size with shift 2ms.

Modeling: GMM GMM size: 8, 16, 32, 64.

3.6 EXPERIMENTAL RESULT

train

test

GMM size

8 16 32 64

Vp Name Vp name Vp Name Vp Name

Vp 17.52 42.26 24.74 43.92 28.8 44.32 38.14 46.39

name 39.17 17.52 41.23 20.61 43.29 27.83 39.17 45.36

Table: Baseline result.

1 2 5 10 20 40 60 80 90 95 98 99 1

2

5

10

20

40

60

80

90

95

98

99

False Alarm probability (in %)

Mis

s p

robabili

ty (

in %

)

original baseline gmm

Fig: baseline DET plot

GENERATION OF MULTIPLE UTTERANCE BY ADDING WHITE NOISE TO TRAINING SPEECH

Motivation It covers entire spectrum of speech signal Addition of white noise will reduce phonetic

variability as it covers entire spectrum Feature are calculated white noise added for

training and clear for testing Modeling of train with white noise added, and

test data is clear White noise used range [-10,-

5,0,+5,+10,+20]db VAD used in clean speech for reference index Reference index is used to find speech region

EXPERIMENTAL RESULT

Train

test

GMM SIZE

8 16 32 64

Vp Name Vp Name vp Name Vp Name

Vp 14.4 35.05 18.55 37.11 22.86 40.20 28.86 44.32

Name 35.05 12.37 35.05 14.40 39.17 19.58 40.20 26.80

Table : Result based on white noise added.

1 2 5 10 20 40 60 80 90 95 98 99 1

2

5

10

20

40

60

80

90

95

98

99


Mis

s p

robabili

ty (

in %

)


white noise added

Fig: comparison of baseline with white noise added

By adding white noise the phonetic variability of the training data reduce.Hence the performance is increased from baseline.

MAXIMUM A POSTERIOR (MAP) ADAPTATION METHOD:

In Gaussian mixer model for Modeling of speaker, it is necessary sufficient training data must be available to make model of speaker

There is a another method available, maximum a posteriori (MAP) adaptation, of a background model trained on the speech data of several other speaker

It may be useful for evaluation of statistical model which may be useful for short duration speech data

Maximum a posterior (MAP) adaptation takes the prior information of existing model and changes their parameter according to new training data

Train

test

GMM SIZE

8 16 32 64

Vp Name Vp Name Vp Name Vp Name

Vp 14.43 36.08 11.34 34.02 15.46 36.08 20.61 37.11

Name 34.02 12.37 34.02 14.43 40.20 15.46 40.20 24.74

Table : Map adaptation of clean data on noise model.

EXPERIMENTAL RESULT:

1 2 5 10 20 40 60 80 90 95 98 99 1

2

5

10

20

40

60

80

90

95

98

99


Mis

s pr

obab

ility

(in

%)


white noise added

map adaptation

Fig: comparison of above two model with map adaptation

RESIDUAL MFCC FROM GCI

Computation of residual phase through Linear prediction analysis The speech signal produced is convolution of

excitation source and vocal track system The speaker verification system required

speaker specific information The feature around glottal closure instants

(GCI) are more speaker specific[10]

10. B. Yanganarayana and P. Satyanarayana Murty, “Enhancement of reverberant speech using LP residual signal,”IEEE Trans.Speech Audio Process.,vol. 14, pp. 774-784, May 2006.

ZERO FREQUENCY FILTERING (ZFF) METHOD

The ZFF method is most useful for evaluating the various parameter of prosodic parameter

It is the best available method to calculate expressive parameter for various emotions

The feature around CGI can be computed using ZFF[11]

Periodically located epoch in voiced speech signal represent the glottal closure instants

11. K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from speech signals,” IEEE Trans. Audio, Speech and Language Process., vol. 16, no. 8, pp. 1602–1614, Nov. 2008

EXPERIMENTAL RESULT FOR RESIDUAL MFCC FROM GCI

Train

test

GMM SIZE

8 16 32 64

Vp name Vp name Vp Name Vp Name

Vp 20.6 35.05 19.58 38.14 25.77 39.17 30.92 42.26

Name 36.08 24.74 36.08 25.77 40.20 31.95 46.39 35.05

Table : result of residual MFCC from GCI

DET CURVE FOR COMPARISON OF DIFFERENT PROPOSED METHOD WITH BASELINE METHOD

1 2 5 10 20 40 60 80 90 95 98 99 1

2

5

10

20

40

60

80

90

95

98

99


Mis

s p

robabili

ty (

in %

)

baseline gmm

gmm white noise added

map adaptation

residual around gci

SCORE NORMALIZATION

The speech data used for the development of model and testing varies between the speakers

For the same speaker quality and quantity of test data varies between the trials so the verification score varies between the trials

Compensation of different variability at the score level is commonly known as score normalization

The score normalization helps to reduce degradation and mismatch effect that are not compensated at feature and model levels

It also transforms scores from different trials into a similar range so that a common speaker independent verification threshold can be used

BASELINE WITH SCORE NORMALIZATION EXPERIMENTAL RESULT

Train and test speech Session 1(initial) After score normalization

name train name test 16.49 14.43

name train vp test 38.14 37.11

vp train vp test 19.58 22.68

vp train name test 40.20 40.20

Table : experimental result GMM baseline with score normalization

Fig : DET plot of baseline initial and after score normalization

1 2 5 10 20 40 60 80 90 95 98 99 1

2

5

10

20

40

60

80

90

95

98

99


Mis

s pr

obab

ility

(in

%)

score norm baseline gmm

GMM WITH WHITE NOISE ADDED TRAIN






Table : experimental result of GMM with white noise added train

Fig : DET plot for GMM with white noise and after score normalization

1 2 5 10 20 40 60 80 90 95 98 99 1

2

5

10

20

40

60

80

90

95

98

99


Mis

s p

robabili

ty (

in %

)


score norm white noise added

MAP ADAPTATION OF CLEAN DATA ON NOISY MODEL






Table : result of map adaptation of noisy train on clean test initial and after score

Fig : DET plot for map adaptation of noisy train initial and after SN

1 2 5 10 20 40 60 80 90 95 98 99 1

2

5

10

20

40

60

80

90

95

98

99


Mis

s p

robabili

ty (

in %

)



score norm map adaptation

RESIDUAL FEATURE FROM 3MS GCI






Table : result of residual MFCC around 3ms GCI:

Fig: DET plot for residual MFCC around GCI initial and after score normalization.

1 2 5 10 20 40 60 80 90 95 98 99 1

2

5

10

20

40

60

80

90

95

98

99


Mis

s p

robabili

ty (

in %

)



score norm map adaptation

score norm residual around gci

The residual MFCC calculate the feature around glottal closure instances(GCI)

Residual MFCC is specific to speaker only and it does not contain information about utterances

It may provide better result combined with above technique

DET PLOT FOR COMPARISON OF DIFFERENT MODELING TECHNIQUE

1 2 5 10 20 40 60 80 90 95 98 99 1

2

5

10

20

40

60

80

90

95

98

99


Mis

s p

robability (

in %

)

baseline gmm

white noise added

map adaptationresidual around gci

score nor of map addapttation

Fig: comparison of different modeling technique

SUMMARY OF DIFFERENT MODELING TECHNIQUE

The baseline best score is is 17.52. By using white noise the best result is

improved from 17.52 to 14.4 for vp and 17.52 to 12.37 for name

The result is further improved using MAP adaptation from 14.4 to 12.37 and for name 12.37 to 11.34

By using score normalization technique the score is reduced upto 8.24 for vp and 9.27 for name.

COTRIBUTION:

Database is collected for future research A method is proposed to model the limited

data by generating multiple utterance of speech through adding controlled white noise addition to clean speech

The performance of speaker centric score normalization under limited data condition is address

FUTURE SCOPE:

Extraction of feature to reduce the impact of phonetic variability

Different residue of behavioral feature may be extracted in addition to MFCC for speaker verification

In this project we considered GMM modeling technique in next work many other technique

may be used like JAFA, i-vector etc.

REFERENCES 1. K. H. Davis, et. al., “Automatic recognition of spoken digits,”

J.A.S.A., 24 (6), pp. 637-642, 1952. 2. J. P. Campbell, “Speaker Recognition: A Tutorial,” Proc. IEEE, vol.

85, no. 9, pp. 1437–1462, September 1997. 3. D. A. Reynolds and R. C. Rose, “Robust text-independent speaker

identification using Gaussian mixture speaker models,” IEEE Trans. on speech and audio processing, vol. 3, no. 1, pp. 72–83, January 1995.

4. F. Bimbot, J. Bonastreand, C. Fredouille, G. Gravier, I. Chagnolleau, S. Meignier, T. Merlin, J. Garcia, D. Delacretaz, and D. A. Reynolds, “A tutorial on text-independent speaker verification,” EURASIP Journal on Applied Signal Processing, vol. 4, pp. 430–451, 2004.

5. D. A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models,” Speech Communication, vol. 17, pp. 91–108, March 1995.

6. D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital Signal Processing, vol. 10, pp. 19–41, January 2000.

7. D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. on speech and audio processing, vol. 3, no. 1, pp. 72–83, January 1995

8. F. K. Soong and A. E. Rosenberg, “On the use of instantaneous and transitional spectral information in speaker recognition,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. 36, no. 6, pp. 871–879, June 1988

9.F. Bimbot, J. Bonastreand, C. Fredouille, G. Gravier, I. Chagnolleau, S. Meignier, T. Merlin, J. Garcia, D. Delacretaz, and D. A. Reynolds, “A tutorial on text-independent speaker verification,” EURASIP Journal on Applied Signal Processing, vol. 4, pp. 430–451, 2004

10. B. Yanganarayana and P. Satyanarayana Murty, “Enhancement of reverberant speech using LP residual signal,”IEEE Trans.Speech Audio Process.,vol. 14, pp. 774-784, May 2006.

11. K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from speech signals,” IEEE Trans. Audio, Speech and Language Process., vol. 16, no. 8, pp. 1602–1614, Nov. 2008

thank you

Technology

SPEAKER VERIFICATION