Intra-Class Variability Modeling for Speech Processing

Introduction Mapping Modeling Speaker Diarization Summary

H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 1

Dr. Hagai Aronowitz

IBM Haifa Research Lab

Presentation is available online at: http://aronowitzh.googlepages.com/

Intra-Class Variability Modeling for Speech Processing



Given labeled training segments from class + and class –, classify unlabeled test segments

Classification framework

1. Represent speech segments in segment-space

2. Learn a classifier in segment-space• SVMs• NNs• Bayesian classifiers• …

Speech ClassificationProposed framework



OutlineIntra-Class Variability Modeling for Speech Processing

1 Introduction to GMM based classification

2 Mapping speech segments into segment space

3 Intra-class variability modeling

4 Speaker diarization

5 Summary



GMM based speaker recognitionEstimate Pr(yt|S)

1. Train a universal background model (UBM) GMM using EM2. For every target speaker S:

Train a GMM GS by applying MAP-adaptation

Text-Independent Speaker RecognitionGMM-Based Algorithm [Reynolds 1995]

Assuming frame independence:

T

tT SySyy1t

1 Pr,...,Pr

?Pr SY

UBM

Q1 - speaker #1

Q2 - speaker #2

μ1 μ2 μ3

R26 MFCC feature space



1. Invalid frame independence assumption:

Factors such as channel, emotion, lexical variability, and

speaker aging cause frame dependency

2. GMM scoring is inefficient – linear in the length of the

audio

3. GMM scoring does not support indexing

GMM Based Algorithm - Analysis








5 Summary



Mapping Speech Segments into Segment SpaceGMM scoring approximation 1/4

Definitions

X: training session for target speaker

Y: test session

Q: GMM trained for X

P: GMM trained for Y

Goal

Compute Pr(Y |Q) using GMMs P and Q only

Motivation

1. Efficient speaker recognition and indexing

2. More accurate modeling



QPHdxQxPxQyQYx

T

T

ttTT

,PrlogPrPrlogPrlog1

11

)1(

Negative cross entropy


Approximating the cross entropy between two GMMs

1. Matching based lower bound [Aronowitz 2004]

2. Unscented-transform based approximation [Goldberger & Aronowitz 2005]

3. Others options in [Hershey 2007]



CwwQPH

D

d

D

d

Qdj

D

d

Qj

j

G

g

Pg Q

dj

Pdg

Qdj

Qdj

Pdg

1

2

21

1,

1 21 ,

,

2

,

2

,, loglogmax,

(2)

Matching based approximation


Assuming weights and covariance matrices are speaker independent (+ some approximations):

CwQPH

G

g

D

di

dg

Qdg

Pdg

1 1 22

,

2

,,,

(3)

Mapping T is induced:

dg

GMMdg

gdDg

GD

wGMMT

RGMMT

,

,*ˆ;ˆ

:

(4)



Results


Figure and Table taken from:H. Aronowitz, D. Burshtein, “Efficient Speaker Recognition Using Approximated Cross Entropy (ACE)”, in IEEE Trans. on Audio, Speech & Language Processing, September 2007.



1. Anchor modeling projection [Sturim 2001]

• efficient but inaccurate

2. MLLR transofrms [Stolcke 2005]

• accurate but inefficient

3. Kernel-PCA-based mapping [Aronowitz 2007c]

Given - a set of objects

- a kernel function

(a dot product between each pair of objects)

Finds a mapping of the objects into Rn which preserves the

kernel function.• accurate & efficient

Other Mapping Techniques








5 Summary



The classic GMM algorithm does not explicitly model intra-speaker inter-session variability:• channel, noise• language• stress, emotion, aging

The frame independence assumption does not hold in these cases!

T

tT SySyy1t

1 Pr,...,Pr)1(

dffSySfdfSfyySyyT

tTT

1t

11 ,PrPr,,...,Pr,...,Pr)3(

Instead, we can use a more relaxed assumption:

Intra-Class Variability Modeling [Aronowitz 2005b] Introduction

T

tT fSyfSyy1t

1 ,Pr,,...,Pr)2(

which leads to:



Speaker

FrameFrame

sequencesequencegenerated independently

a GMM

Old vs. New Generative Models

Session GMM

FrameFrame

sequencesequence

Speaker a PDF over GMM space

a GMM

generated independently

Old Model New Model



speaker #1 speaker #2

speaker #3

Session-GMM Space

Session-GMM space

GMM for session A of speaker #1

GMM for session B of speaker #1



GDs~

,~|ˆPr NS

Modeling in Session-GMM space 1/2

Recall mapping T induced by the GMM approximation analysis:

• is called a supervector• A speaker is modeled by a multivariate normal distribution in supervector space:

)3(

• A typical dimension of is 50,000*50,000• is estimated robustly using PCA + regularization: Covariance is assumed to be a low rank matrix with an additional non-zero (noise) diagonal

GDΣ~

GDΣ~

dg

GMMdg

gdDg

GD

wGMMT

RGMMT

,

,*ˆ;ˆ

:



Supervector space

GDΣ~

1

2

1

2

1

2

1

2

1

2

1

2speaker #1 speaker #2

speaker #3 Delta supervector space

sΣ2~

Modeling in Session-GMM Space 2/2Estimating covariance matrix



• is estimated from the NIST-2006-SRE corpus• Evaluation is done on the NIST-2004-SRE corpus

• ETSI MFCC (13-cep + 13-delta-cep)• Energy based voice activity detector• Feature warping• 2048 Gaussians• Target models are adapted from GI-UBM• ZT-norm score normalization

GDΣ~

Experimental Setup

Datasets

System description



Results

38% reduction in EER



• NAP+SVMs [Campbell 2006]

• Factor Analysis [Kenny 2005]

• Kernel-PCA [Aronowitz 2007c]

• Model each supervector as

s S : Common speaker subspace

u U : Speaker unique subspace

• S is spanned by a set of development supervectors (700 speakers) • U is the orthogonal complement of S in supervector space• Intra-speaker variability is modeled separately in S and in U• U was found to be more discriminative than S• EER was reduced by 44% compared to baseline GMM

Other Modeling Techniques

Kernel-PCA based algorithm

us



Session space

Feature space

x

f(x)

Tx

Common speaker subspace (Rn)

y

f(y)

Ty

uy

ux

Speaker unique subspace

K-PCA

Anchor sessions

Kernel-PCA Based Modeling

Kernel induced








5 Summary



Goals

• Detect speaker changes – “speaker segmentation”

• Cluster speaker segments - “speaker clustering”

Motivation for new method

Current algorithms do not exploit available training data!

(besides tuning thresholds, etc.)

Method

Explicitly model inter-segment intra-speaker variability from labeled

training data, and use for the metric used by change-detection /

clustering algorithms.

Trainable Speaker Diarization [Aronowitz 2007d]



Dev data

• BNAD05 (5hr) - Arabic, broadcast news

Eval data

• BNAT05 – Arabic, broadcast news,

(207 target models, 6756 test segments)

System EER (%)

Anchor modeling (baseline) 15.1

Anchor modeling - Kernel based scoring 10.8

Kernel-PCA projection (CSS) 8.8

Kernel-PCA projection (CSS) + inter-segment variability modeling

7.4

Speaker recognition on pairs of 3s segments



Speaker change detection

• 2 adjacent sliding windows (3s each)

• Speaker verification scoring + normalization

Speaker clustering

• Speaker verification scoring + normalization

• Bottom-up clustering

Speaker Error Rate (SER) on BNAT05

• Anchor modeling (baseline): 12.9%

• Kernel-PCA based method: 7.9%

Speaker Diarization System & Experiments







5 Summary




• A method for mapping speech segments into a GMM

supervector space was described

• Intra-speaker inter-session variability is modeled in

GMM supervector space

Speaker recognition

• EER was reduced by 38% on the NIST-2004 SRE

• A corresponding kernel-PCA based approach reduces

EER by 44%

Speaker diarization

• SER for speaker diarization was reduced by 39%.

Summary 1/2



• Speaker recognition [Aronowitz 2005b; Aronowitz 2007c]

• Speaker diarization (“who spoke when”) [Aronowitz 2007d]

• VAD (voice activity detection) [Aronowitz 2007a]

• Language identification [Noor & Aronowitz 2006]

• Gender identification [Bocklet 2008]

• Age detection [Bocklet 2008]

• Channel/bandwidth classification [Aronowitz 2007d]

Summary 2/2Algorithms based on the proposed framework



[1] D. A. Reynolds et al., “Speaker identification and verification using Guassian mixture speaker models,” Speech Communications, 17, 91-108.

[2] D.E. Sturim et al., “Speaker indexing in large audio databases using anchor models”, in Proc. ICASSP, 2001.

[3] H. Aronowitz, D. Burshtein, A. Amir, "Speaker indexing in audio archives using test utterance Gaussian mixture modeling", in Proc. ICSLP, 2004.

[4] H. Aronowitz, D. Burshtein, A. Amir, "A session-GMM generative model using test utterance Gaussian mixture modeling for speaker verification", in Proc. ICASSP, 2005.

[5] P. Kenny et al., “Factor Analysis Simplified”, in Proc. ICASSP, 2005.

[6] H. Aronowitz, D. Irony, D. Burshtein, “Modeling Intra-Speaker Variability for Speaker Recognition ”, in Proc. Interspeech, 2005.

[7] J. Goldberger and H. Aronowitz, "A distance measure between GMMs based on the unscented transform and its application to speaker recognition" , in Proc. Interspeech 2005.

[8] H. Aronowitz, D. Burshtein, "Efficient Speaker Identification and Retrieval", in Proc. Interspeech 2005.

Bibliography 1/2



[9] A. Stolcke et al., “MLLR Transforms as Features in Speaker Recognition”, in Proc. Interspeech, 2005.

[10] E. Noor, H. Aronowitz, "Efficient language Identification using Anchor Models and Support Vector Machines,“ in Proc. ISCA Odyssey Workshop, 2006.

[11] W.M. Campbell et al., “SVM Based Speaker Verification Using a GMM Supervector Kernel and NAP Variability Compensation”, in Proc. ICASSP 2006.

[12] H. Aronowitz, “Segmental modeling for audio segmentation”, in Proc. ICASSP, 2007.

[13] J.R. Hershey and P. A. Olsen, “Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models” ,in Proc. ICASSP 2007.

[14] H. Aronowitz, D. Burshtein, “Efficient Speaker Recognition Using Approximated Cross Entropy (ACE)”, in IEEE Trans. on Audio, Speech & Language Processing, September 2007.

[15] H. Aronowitz, “Speaker Recognition using Kernel-PCA and Intersession Variability Modeling”, in Proc. Interspeech, 2007.

[16] H. Aronowitz, “Trainable Speaker Diarization”, in Proc. Interspeech, 2007.[17] T. Bocklet et al., “Age and Gender Recognition for Telephone Applications

Based on GMM Supervectors and Support Vector Machines”, in Proc. ICASSP, 2008.

Bibliography 2/2



Presentation is available online at: http://aronowitzh.googlepages.com/

Thanks!



Backup slides



Session spaceDot-product feature space

f(x)

f(y)

x

yKernel trick

Anchor sessions

f()

Goals: - Map sessions into feature space

- Model in feature space

Kernel-PCA Based Mapping 2/5



Given - kernel K

- n anchor sessions

Find an orthonormal basis for

Method

1) Compute eigenvectors of the centralized kernel-matrix ki,j =

K(Ai,Aj).

2) Normalize eigenvectors by square-roots of corresponding

eigenvalues → {vi}

3) for is the requested basis

},...,{ 1 nAfAfspan

ini vAfAff ,...,1}{ if

nAA ,...,1




nn AxK

AxK

v

v

xT

,

...

,

...:11

is a mapping x→Rn with the property:

Given sessions x, y, may be uniquely represented as:

},...,{/

},...,{

1

1

n

n

AfAfspanFU

AfAfspanC

Common speaker subspace -

Speaker unique subspace -

UuuCccucyfucxf yxyxyyxx ,and,withand

()(,) yfxf

22

yx ccyTxT




Session space Feature space

x f(x)

Tx

Common speaker subspace (Rn)

y

f(y)

Ty

uy

ux

Speaker unique subspace

K-PCA

Anchor sessions




Modeling in Segment-GMM Supervector Space

Segment-GMM supervector spaceSegment-GMM supervector space

FrameFrame

sequence:sequence:

segment #1segment #1

FrameFrame

sequence:sequence:

segment #2segment #2

FrameFrame

sequence:sequence:

segment #nsegment #n

music

speechsilence



Segmental Modeling for Audio Segmentation

Goal

• Segment audio accurately and robustly into speech / silence / music segments.

Novel idea

• Acoustic modeling is usually done on a frame-basis.

• Segmentation/classification is usually done on a segment-basis (using smoothing).

Why not explicitly model whole segments?

Note: speaker, noise, music-context, channel (etc.) are constant during a segment.



10-2

10-1

10-2

10-1

speech miss probability

sile

nce

mis

s pr

obab

ility

SPEECH / SILENCE SEGMENTATION

IBM EVAL06IBM EVAL06 no-padGMM baselineSegmental System EER FA @

FR=0.5%

FR @

FA=1%

EVAL06 FA=24.2% @ FR=0.25%

GMM

baseline

2.9% 7.9% 29.6%

Segmental 1.7% 5.1% 2.7%

Error

reduction

41% 35% 91%

Speech / Silence Segmentation – Results 1/2



10-3

10-2

10-1

10-2

10-1

speech miss probability

mus

ic m

iss

prob

abili

ty

SPEECH / MUSIC SEGMENTATION

IBM EVAL06IBM EVAL06 no-padGMM baselineSegmental

System EER FA @

FR=0.5%

FR @

FA=1%

EVAL06 FA=69% @ FR=0.25%

GMM

baseline

1.43% 3.4% 3.2%

Segmental 1.27% 2.0% 1.9%

Error

reduction

11% 41% 41%

Speech / Silence Segmentation – Results 2/2



LID in Session Space

English

Arabic

FrenchSession space

Training session Test session



1. Front end: shifted delta cepstrum (SDC).

2. Represent every train/test session by a GMM super-vector.

3. Train a linear SVM to classify GMM super-vectors.

Results

• EER=4.1% on the NIST-03 Eval (30sec sessions).

LID in Session Space - Algorithm



Anchor Modeling Projection

• Speaker indexing [Sturim et al., 2001]

• Intersession variability modeling in projected space [Collet et

al., 2005]

• Speaker clustering [Reynolds et al., 2004]

• Speaker segmentation [Collet et al., 2006]

• Language identification [Noor and Aronowitz, 2006]

nXsXsX ˆ,...,ˆ 1

UBM

iFi X

XXs

Pr

Prlogˆ 1

Given: anchor models λ1,…,λn and session X= x1,…,xF

= average normalized log-likelihood

Projection:



The classic GMM algorithm does not explicitly model intra-speaker inter-session variability:• Noise• Channel• Language• Changing speaker characteristics – stress, emotion, aging

The frame independence assumption does not hold in these cases!

T

tT SySyy1t

1 Pr,...,Pr)1(

dffSySfdfSfyySyyT

tTT

1t

11 ,PrPr,,...,Pr,...,Pr)2(

Instead, we get:

Intra-Class Variability ModelingIntroduction

fSt Gy ,Pr SG fS ,Pr

Documents

Intra-Class Variability Modeling for Speech Processing