Speaker Identification using Gaussian Mixture Model

2000/05/03 1

Speaker Identification using Gaussian Mixture Model

Presented by CWJ

2000/05/03 2

Reference

D. A. Reynolds and R. C. Rose, “Robust Text-

Independent Speaker Identification Using

Gaussian Mixture Speaker Models”, IEEE Trans.

on Speech and Audio Processing, vol.3, No.1,

pp.72-83,January 1995.

2000/05/03 3

Outline

1. Introduction to Speaker Recognition

2. Gaussian Mixture Speaker Model (GMM)

3. Experimental Evaluation

2000/05/03 4

Introduction to Speaker Recognition

1. Two tasks of Speaker Recognition

-- Speaker Identification (this paper)

e.g. voice mail labeling

-- Speaker Verification

e.g. financial transactions

A. Some definitions of S.R.

2000/05/03 5

2. Two forms of spoken input

-- Text-dependent

-- Text-independent (this paper)

3. System Range

-- Closed Set (this paper)

-- Open Set

2000/05/03 6

B. Several Methods used in Speaker

Recognition

VQ

NN

1985 1995HMM

VQ

NN

GMM

HMM

VQ

NN

2000/05/03 7

1. Use long-term averages of acoustic features

(spectrum,pitch…) first and earliest

Idea :

To average out the factors influencing

intra-speaker variation, leave only

the speaker dependent component.

Drawback : required long speech utterance(>20s)

2000/05/03 8

2. Training SD model for each speaker

Explicit segmentation

HMM

Implicit segmentation

VQ,GMM

2000/05/03 9

HMM:

Advantage : Text-independent

Drawback : a significant increase in

computational complexity

VQ:

Advantage : unsupervised clustering

Drawback : Text-dependent

2000/05/03 10

3. The use of discriminative Neural Network (NN)

※ model the decision function which best discriminate speakers

Advantage : less parameters, higher performance compared to VQ model Drawback : The network must be retrained when a new speaker is added to the system.

2000/05/03 11

GMM :

Advantage : Text-Independent

probabilistic framework (robust)

computationally efficient

easily to be implemented

2000/05/03 12

The Gaussian mixture model (GMM)

A. Model Interpretations

Speech Recognition

(GMM) State Level

2000/05/03 13

Speaker RecognitionSpeaker k

1

1

2

2

1p 2p

……………………

i

i

ip

Acousticclass

1. Each Gaussian component models an acoustic class

2000/05/03 14

2. GMM gives the arbitrarily-shaped densities a better

approximation.

2000/05/03 15

2000/05/03 16

B. Signal Analysis

2000/05/03 17

C. Model Description

Gaussian Mixture Density

)()|(1

xbpxpM

iii

Where x

D-dimensional random vector

)()'(

2

1exp

)2(

1)( 1

212 iii

iDi xxxb

iiip ,, Mi ,,1

Nodal, Grand,Global

Nodal, diagonal (this)

2000/05/03 18

D. ML Parameter Estimation

Step:

1. Beginning with an initial model

2. Estimate a new model such that

3. Repeated 2. until convergence is reached.

)|()|( XpXp

2000/05/03 19

Mixture Weights

Means

Variances

T

tti xip

Tp

1

),|(1

T

t t

T

t tti

xip

xxip

1

1

),|(

),|(

2

1

1

22

),|(

),|(iT

t t

T

t tti

xip

xxip

M

k tkk

tiit

xbp

xbpxip

1)(

)(),|(

2000/05/03 20

E. Speaker Identification

a group of speakers S = {1,2,…,S} is represented by

GMM’s λ1, λ2, …, λs

)(

)Pr()|(maxarg)|Pr(maxargˆ11 Xp

XpXS kk

Skk

Sk

)|(maxargˆ1

kSk

XpS

)|(logmaxargˆ1

1kt

T

tSk

xpS

T

ttiikt xbpxp

1

)()|( which

logtake

2000/05/03 21

Experimental Evaluation

A. Performance Evaluation

,,,,, 21

1

21 TT

Segment

T xxxxx

e.g. frame rate = 10ms, T = 500

the length of a test utterance = 5 seconds

,,,,, 2

2

121 T

Segment

TT xxxxx

2000/05/03 22

% correct identification =

# of correctly identified segments

total # of segments

×100

2000/05/03 23

C. Algorithmic Issues

1. Model Initialization :

-- Use SI,context dependent subword HMM’s

mean and their global variance.

-- Randomly choose 50 vectors for initial

model mean, and an identity matrix for the

starting covariance matrix

2000/05/03 24

2. Variance Limiting :

When training a nodal variance GMM

the magnitude of variance

so, give the constraint

2min

2

2min

2

2min

22

i

iii if

if

The min variance, is determined empirically.2min

2000/05/03 25

3. Model Order :

I. Performance versus model order.

1,2,4,8,16,32,64

2000/05/03 26

II. Performance for different

amounts of training data

and model orders

III. Performance versus

model order for trained

with 30,60,and 90s of

speech.

2000/05/03 27

4. Spectral Variability Compensation :

1) Frequency Warping :

Nfff

fff

minmax

min'

Nf : original Nyquist frequency

2000/05/03 28

2) Spectral Shape Compensation :

Assumption :

ChannelSpeaker Signal Processing

f

Frequency response

mel-cepstral feature vector

hxz

2000/05/03 29

‧mean normalization for T.I. channel filter (CMS)

T

ttzT

m1

1 mzz tcompt

‧use “channel invariant” feature (delta-cepstral)

2000/05/03 30

5. Large Population Performance :

Documents

Speaker Identification using Gaussian Mixture Model