Unsupervised adaptation for speaker detectionlia.univ-avignon.fr/fileadmin/documents/Users/Intranet/chercheurs/... · Unsupervised adaptation for speaker detection Jean-François

1

UnsupervisedUnsupervised adaptation for adaptation for speaker speaker detectiondetection

Jean-François BonastreLIA, Avignon

[email protected]

IBM Seminar 14th September 2006

J.F. Bonastre, IBM Seminar, September 14th 2006 2

OutlineOutline

Introduction to UBM/GMMALIZE/LIA_SpkDet toolkit

Unsupervised adaptation: why and how ?

Soft, continuous speaker model adaptationNIST SRE is really a speaker detection task ?

Artificially modified impostor voice

2


Introduction to UBM/GMMIntroduction to UBM/GMM

Speaker Speaker detectiondetection tasktask (NIST, 1conv)(NIST, 1conv)

Joe ?

DevTest segmentTrain seg. (1 by spk)

Eval database

DecisionYes/NoScore

Test segmentJoe TrainSegment

Eval database

More at www.nist.gov/speech



UBM/GMM UBM/GMM ApproachApproach

UBM

Ac param EM-ML

A set of speakers

Target speakermodel

Ac param

1 locuteur X

AdaptationTraining

Ac param

Test segment Y

comparison

Decision

Test

3



ALIZE/ALIZE/LIA_SpkDetLIA_SpkDet toolkittoolkitClassical Gaussian Mixture toolkit

EM ML/MAP, Gaussian component sharingSmall HMM / Viterbi (designed for segmentation)

ALIZE = low level model/featureLIA_SpkDet = « High Level » system

Current LIA research system (evaluated during NIST-SRE)Feature norm/warping/mappingZ/H/T NormBayes factor Analysys soon…Design For NIST and Demos

Open source www.lia.univ-avignon.fr/heberges/ALIZE



ALIZE/ALIZE/LIA_SpkDetLIA_SpkDet toolkittoolkit

4


Unsupervised adaptation: Unsupervised adaptation: Why ?Why ? and How ?and How ?

In NIST SRE 1side.1side task, only one training segment is available by speaker

Could be short, noisy, or content specificOnly one session

In real world commercial applicationsDifficult to request a lot of time to a clientInteresting to launch the system with few data and to improve the speaker representationDifficult to have multiple sessions for a given speaker training set


Unsupervised adaptation: Unsupervised adaptation: Why ? and Why ? and HowHow ??

Build X, a basic speaker model using the available training dataWhen a verification is requested, Y1

Compute the score between X and Y1

If (score > AdaptT) adapt X using Y1 = X1

else X1 =X

When a second query Y2 is commingcompute score (Y2 ,X1 )

If (score > AdaptT) adapt …See Barras (OD04), Mirghafori (ICSLP02), van Leeuwen (NIST04)

5


Unsupervised adaptation: Unsupervised adaptation: Why ?Why ? and How ?and How ?

ORACLE performance are very goodUnsupervised Online Adaptation for Speaker Verificationover the Telephone, Claude Barras, Sylvain Meignier, Jean-Luc GauvainSpeaker Odyssey 2004


Unsupervised adaptation: Unsupervised adaptation: Why ?Why ? and how ?and how ?

Unsupervised model adaptation for speaker verification, Alexandre Preti, Jean-François Bonastre, ICSLP04

Baseline

Oracle

6



Problem: Results during the evaluation campaigns are quite poor



Unsupervised model adaptation for speaker verification, Alexandre Preti, Jean-François Bonastre, ICSLP04

BaselineAnd Hard Decision

adaptation

Oracle

7



Problem: Results during the evaluation campaigns are quite poorWhy ?

The adaptation is done only when we have a good scoreI.e a good matching between the current speaker model and the test file

A model is adapted if the initial model is good enoughIf the mismatch between the sessions is not too largeGood (only) if used on large dataset (not test by test adaptation)…

Problem with NIST protocol



Area for a sure hard decision

In interest area

When we are « sure »to not make a mistake

We don’t have enough client tests !

8



Suppress the hard decisionNo decision = continious adaptationAdaptation with good client dataBut also with bad, impostor data

Weighted adaptation

Soft, continuous speaker model Soft, continuous speaker model adaptationadaptation

Ongoing work - Unpublished

Alexandre Preti and JF Bonastre

9


The idea The idea

Always adapt the client models with incoming dataBut weight the data by p(X=Y)A posteriori probability (and not LLK)P(X=Y| LRx)

What is needed?Traditional Score: LLR(test data/ client model)Something to transform the score in p(y|x)


WMAPWMAPP(X=Y| LRx)= p(LRx | X=Y).p(X=Y) ;

p(LRx| X=Y).p(X=Y) + p(LRx|X ≠ Y).p(X ≠ Y)

p(X=Y) = the prior probability of a targetp(X≠Y) = the prior probability of an impostorp(LRx| X=Y) = the score on the target distributionp(LRx| X ≠ Y) = the score on the impostor distribution

We model the target/impostor score distributions by a GMM learned on a dev set.

10


WMAPWMAP

Scores

Probability of a target (weight)


Updating a modelUpdating a model

New target model =

MAP ﴾UBM, {selected trials ; weights}+ initial training data)

Weights are integrated in the EM/ML estimation (thanks to ALIZE toolkit)

11


ProtocolsProtocols

2 different protocols:BATCH: all the trials involving a target are used to adapt its model

The model is adapted before to compute the final scores

NIST: SRE unsupervised adaptation modeThe update of a speaker model is allowed only with (this speaker) previous trial segments

Obviously more adaptation data for the BATCH protocol(done ndx line by ndx line)


TNORM Score NormalizationTNORM Score NormalizationBasic TNORM (with 2.5 min of train data) is not well suited for the BATCH protocol

The amount of train data is now far from 2.5 min.Need of a TNORM with different length of training data

All target of NIST SRE 2004 using unsupervised adaptation

As the amount of train data is limited in the NIST protocol, basic T-NORM should perform well (160 targets of NIST SRE 2004).

12


Results: BATCH ProtocolResults: BATCH Protocolscore distributions are learnt on the NIST SRE 2004 for NIST SRE 2005 results.

10% DCF relative improvement

35% EER relative improvement


Results: NIST ProtocolResults: NIST Protocol

15% DCF relative improvement

33% EER relative improvement

13


Analysis, target trialsAnalysis, target trials

Baseline

Nist Unsup A.

Total

Accepted

Non Tar

1103 115 978

1170 130 1040

Target trial = 1231

Batch A. 1064 89 975

We acceptmore target trials

We rejectmore impostor trials


Conclusion Conclusion Ongoing work Unsupervised adaptation method without hard decision thresholdDCF and EER improvements

Larger for commercial application as the ratio target/impostor is better !

Two protocols: same gain but two different behavioursOther problems should be addressed

Score normalisation (using score models ?)Threshold estimation/adaptation

14

NISTNIST--SRESREIs it really a speaker detection Is it really a speaker detection task ?task ?


Is it really a speaker detection Is it really a speaker detection task ?task ?

The evaluation campaign is a very interesting framework for research

Large datasetClear protocolsFocusing one 1 problem = large improvements year after year

ButOne sort of dataOne taskFew analysis of the results in term of data/phonetic/knowledge (focused only on “performance”)

15


Is it really a speaker detection Is it really a speaker detection task ?task ?

Improvements thanks toHNorm = ZNorm+ something linked to 2 sort of phonesTnorm = environment/phone mismatchFeature warping/mapping…

Also reducing environment/phone mismatch(But have an effect on the classifiers)

…


Is it really a speaker detection Is it really a speaker detection task ?task ?Example: Bayesian Factor Analysis

(Patrick Kenny)Quite recent but already large improvementsDedicated (currently) to channel effectsAt least 10 labs are implementing PK approach or something close to thatWho tested if the channel is the key factor ?

16


Artificially modified impostor Artificially modified impostor voicevoice

We are all using the UBM/GMM since x yearsWhich sort of information is modeled by UBM/GMM classifiers ?

ExperimentIf we know a voice example of a targeted speaker If we know the speaker recognition techniqueIs it possible to transform the voice of someone else in order to cheat the system ?

TRANSFER FUNCTION-BASED VOICE TRANSFORMATION FORSPEAKER RECOGNITION, Jean-François Bonastre, Driss Matrouf, Corinne Fredouille,Speaker Odyssey 2006


ExperimentalExperimental contextcontextNIST-SRE 2005 (1conv-1conv), male only

1231 target tests12317 non target tests

Test (segment Y, client S)Classical test: Y=S ?If it is a non target test, transform Y using S

Y’=Vtrans(Y,S)New test (segment Y’, client S) : Y’=S ?

17


VoiceVoice transformationtransformationFrame by frame + Overlap-addEstimate a new cepstral target for an impostor frame y, as a combination of the component means of the target speaker modelWith the cepstral target -> transfer functionFiltering in order to change the original transfer function by the new oneOther parameters are taken in the original signal


ResultsResults (1)(1)ImpostorImpostor score score disributiondisribution

Normal (1)

Transformed (2) usinga non train segment for X

Transformed (3) usingspeaker X training seg

18


ResultsResults (2)(2)DetDet andand errorerror ratesrates

False Acceptance

Mis

s

1

2 3F. Accept. Miss Det.

0.88 %

49.72 %

96.55 %

27.45 %

27.45 %

27.45 %

1 - baseline

2 - !=

3 - =

FA from 0.88 % to 96.55 % !!(same threshold)


ExamplesExamples

7396 8049

NCFB_A -1.94 4.84 0.46 5.47

French BN (Alain Passerel)Driss Redragui Fabrice Drouelle Franck Mathevon Joel Collado

NIST SRE

19


ConclusionConclusionWe are using efficient UBM/GMM systems

But we don’t know which information is usedWe know now it is possible to cheat the system

Need of caution for Forensic/National security applicationsBonastre et al Eurospeech 2003

In this experiment, we used a knowledge ofThe feature extractionThe method (UBM/GMM, number of components, of top)The world model

To be extended


ThanksThanks !!

For inviting meFor the attention

Questions ?

20


Annex Annex UBMUBM--GMM and GMM and protocolprotocol


ParadigmParadigmP (target speaker | speech data)Bayesian Hypothesis Test: LR

UBM (Universal Background Model) represents the inverse hypothesis

It is usually learned with hundreds of hours of speechThe EM algorithm and multiple iterations are usedSome tricks: variance flooring (?)

Front-end is a cepstral analysisSpeech (MFCC,LFCC)+ derivatives…

21


Experimental ProtocolExperimental ProtocolBased upon NIST SRE 2005 database:

Primary, 1side-1side, male set280 speakersUtterances 2.5 min. long (contains speech)13624 tests (951 target tests)Impostors: 200 speakers from the BM model

Commonalities:(16+16d) LFCC features (300-3Khz), Tnormedsystems, 2048 UBM model.


The The LIA_SpkDetLIA_SpkDet system system GMM/UBM, 2048 diagonal components16 Cepstral coefficients + 16 Delta

(50 coeff in 2006)Frame selection based on a 3

component-GMM modeling of the energyFeature normalization: mean removal

and variance normalizationScore normalization: Tnorm

22


AnnexAnnex –– VoiceVoice transformationtransformation


Voice transformation (1)Voice transformation (1)Independent processing and overlapIndependent processing and overlap--addadd

Frame by frame processing

With overlap

Adding

Y’0=VT(y0,S)

Y’0

y0y1 yn

Y’1 Y’n

Y’1=VT(y1,S) Y’n=VT(yn,S)

Y

adding

Y’=VT(Y,S)Y’

23


Voice transformation Voice transformation (2)(2)FindFind thethe targettarget transfertransfer functionfunction ((frameframe) )

Target speakermodel

Acoustic parameter vector (cepstral)

A posteriori probabilities for each component

Build the targetby combining

the meansof the components

usingthe probability vector

Cepstral target -> tranfer function


Voice transformation (3)Voice transformation (3)

2 // GMM2 // GMM

1 to 1 tying of the components

Filtering

ASR is using several feature norm techniques:Not possible to come back to signal= 2 // models for a target

•One to estimate the a posteriori proba (master)•One to estimate the target (filtering)

Master

Master: in the ASRfeature space

YCompute the

a posteriori proba

TargetFiltering: in the filteringfeature space

Combine the means

24


Voice transformation (4)Voice transformation (4)FilteringFiltering

y

Hy(f) ( ) ( )( )fHfHfH

y

x =

( ) fHx

Y’

Build the target for the frame

Target transferfunction

Original TF


Distributions client Distributions client andand impimp

25


DetDet usingusing thethe transformation for transformation for allall thethe teststests


// Models// Models

ASR feature domain

Filtering feature domain

ASR target speaker

model

Get EM-E hidden variable

Apply EM-M on the Filtering domain

Filteringtarget speaker

model

Documents

Unsupervised adaptation for speaker detectionlia.univ-avignon.fr/fileadmin/documents/Users/Intranet/chercheurs/... · Unsupervised adaptation for speaker detection Jean-François