Upload
vohuong
View
214
Download
1
Embed Size (px)
Citation preview
1
UnsupervisedUnsupervised adaptation for adaptation for speaker speaker detectiondetection
Jean-François BonastreLIA, Avignon
IBM Seminar 14th September 2006
J.F. Bonastre, IBM Seminar, September 14th 2006 2
OutlineOutline
Introduction to UBM/GMMALIZE/LIA_SpkDet toolkit
Unsupervised adaptation: why and how ?
Soft, continuous speaker model adaptationNIST SRE is really a speaker detection task ?
Artificially modified impostor voice
2
J.F. Bonastre, IBM Seminar, September 14th 2006 3
Introduction to UBM/GMMIntroduction to UBM/GMM
Speaker Speaker detectiondetection tasktask (NIST, 1conv)(NIST, 1conv)
Joe ?
DevTest segmentTrain seg. (1 by spk)
Eval database
DecisionYes/NoScore
Test segmentJoe TrainSegment
Eval database
More at www.nist.gov/speech
J.F. Bonastre, IBM Seminar, September 14th 2006 4
Introduction to UBM/GMMIntroduction to UBM/GMM
UBM/GMM UBM/GMM ApproachApproach
UBM
Ac param EM-ML
A set of speakers
Target speakermodel
Ac param
1 locuteur X
AdaptationTraining
Ac param
Test segment Y
comparison
Decision
Test
3
J.F. Bonastre, IBM Seminar, September 14th 2006 5
Introduction to UBM/GMMIntroduction to UBM/GMM
ALIZE/ALIZE/LIA_SpkDetLIA_SpkDet toolkittoolkitClassical Gaussian Mixture toolkit
EM ML/MAP, Gaussian component sharingSmall HMM / Viterbi (designed for segmentation)
ALIZE = low level model/featureLIA_SpkDet = « High Level » system
Current LIA research system (evaluated during NIST-SRE)Feature norm/warping/mappingZ/H/T NormBayes factor Analysys soon…Design For NIST and Demos
Open source www.lia.univ-avignon.fr/heberges/ALIZE
J.F. Bonastre, IBM Seminar, September 14th 2006 6
Introduction to UBM/GMMIntroduction to UBM/GMM
ALIZE/ALIZE/LIA_SpkDetLIA_SpkDet toolkittoolkit
4
J.F. Bonastre, IBM Seminar, September 14th 2006 7
Unsupervised adaptation: Unsupervised adaptation: Why ?Why ? and How ?and How ?
In NIST SRE 1side.1side task, only one training segment is available by speaker
Could be short, noisy, or content specificOnly one session
In real world commercial applicationsDifficult to request a lot of time to a clientInteresting to launch the system with few data and to improve the speaker representationDifficult to have multiple sessions for a given speaker training set
J.F. Bonastre, IBM Seminar, September 14th 2006 8
Unsupervised adaptation: Unsupervised adaptation: Why ? and Why ? and HowHow ??
Build X, a basic speaker model using the available training dataWhen a verification is requested, Y1
Compute the score between X and Y1
If (score > AdaptT) adapt X using Y1 = X1
else X1 =X
When a second query Y2 is commingcompute score (Y2 ,X1 )
If (score > AdaptT) adapt …See Barras (OD04), Mirghafori (ICSLP02), van Leeuwen (NIST04)
5
J.F. Bonastre, IBM Seminar, September 14th 2006 9
Unsupervised adaptation: Unsupervised adaptation: Why ?Why ? and How ?and How ?
ORACLE performance are very goodUnsupervised Online Adaptation for Speaker Verificationover the Telephone, Claude Barras, Sylvain Meignier, Jean-Luc GauvainSpeaker Odyssey 2004
J.F. Bonastre, IBM Seminar, September 14th 2006 10
Unsupervised adaptation: Unsupervised adaptation: Why ?Why ? and how ?and how ?
Unsupervised model adaptation for speaker verification, Alexandre Preti, Jean-François Bonastre, ICSLP04
Baseline
Oracle
6
J.F. Bonastre, IBM Seminar, September 14th 2006 11
Unsupervised adaptation: Unsupervised adaptation: Why ? and Why ? and HowHow ??
Problem: Results during the evaluation campaigns are quite poor
J.F. Bonastre, IBM Seminar, September 14th 2006 12
Unsupervised adaptation: Unsupervised adaptation: Why ? and Why ? and HowHow ??
Unsupervised model adaptation for speaker verification, Alexandre Preti, Jean-François Bonastre, ICSLP04
BaselineAnd Hard Decision
adaptation
Oracle
7
J.F. Bonastre, IBM Seminar, September 14th 2006 13
Unsupervised adaptation: Unsupervised adaptation: Why ? and Why ? and HowHow ??
Problem: Results during the evaluation campaigns are quite poorWhy ?
The adaptation is done only when we have a good scoreI.e a good matching between the current speaker model and the test file
A model is adapted if the initial model is good enoughIf the mismatch between the sessions is not too largeGood (only) if used on large dataset (not test by test adaptation)…
Problem with NIST protocol
J.F. Bonastre, IBM Seminar, September 14th 2006 14
Unsupervised adaptation: Unsupervised adaptation: Why ? and Why ? and HowHow ??
Area for a sure hard decision
In interest area
When we are « sure »to not make a mistake
We don’t have enough client tests !
8
J.F. Bonastre, IBM Seminar, September 14th 2006 15
Unsupervised adaptation: Unsupervised adaptation: Why ? and Why ? and HowHow ??
Suppress the hard decisionNo decision = continious adaptationAdaptation with good client dataBut also with bad, impostor data
Weighted adaptation
Soft, continuous speaker model Soft, continuous speaker model adaptationadaptation
Ongoing work - Unpublished
Alexandre Preti and JF Bonastre
9
J.F. Bonastre, IBM Seminar, September 14th 2006 17
The idea The idea
Always adapt the client models with incoming dataBut weight the data by p(X=Y)A posteriori probability (and not LLK)P(X=Y| LRx)
What is needed?Traditional Score: LLR(test data/ client model)Something to transform the score in p(y|x)
J.F. Bonastre, IBM Seminar, September 14th 2006 18
WMAPWMAPP(X=Y| LRx)= p(LRx | X=Y).p(X=Y) ;
p(LRx| X=Y).p(X=Y) + p(LRx|X ≠ Y).p(X ≠ Y)
p(X=Y) = the prior probability of a targetp(X≠Y) = the prior probability of an impostorp(LRx| X=Y) = the score on the target distributionp(LRx| X ≠ Y) = the score on the impostor distribution
We model the target/impostor score distributions by a GMM learned on a dev set.
10
J.F. Bonastre, IBM Seminar, September 14th 2006 19
WMAPWMAP
Scores
Probability of a target (weight)
J.F. Bonastre, IBM Seminar, September 14th 2006 20
Updating a modelUpdating a model
New target model =
MAP ﴾UBM, {selected trials ; weights}+ initial training data)
Weights are integrated in the EM/ML estimation (thanks to ALIZE toolkit)
11
J.F. Bonastre, IBM Seminar, September 14th 2006 21
ProtocolsProtocols
2 different protocols:BATCH: all the trials involving a target are used to adapt its model
The model is adapted before to compute the final scores
NIST: SRE unsupervised adaptation modeThe update of a speaker model is allowed only with (this speaker) previous trial segments
Obviously more adaptation data for the BATCH protocol(done ndx line by ndx line)
J.F. Bonastre, IBM Seminar, September 14th 2006 22
TNORM Score NormalizationTNORM Score NormalizationBasic TNORM (with 2.5 min of train data) is not well suited for the BATCH protocol
The amount of train data is now far from 2.5 min.Need of a TNORM with different length of training data
All target of NIST SRE 2004 using unsupervised adaptation
As the amount of train data is limited in the NIST protocol, basic T-NORM should perform well (160 targets of NIST SRE 2004).
12
J.F. Bonastre, IBM Seminar, September 14th 2006 23
Results: BATCH ProtocolResults: BATCH Protocolscore distributions are learnt on the NIST SRE 2004 for NIST SRE 2005 results.
10% DCF relative improvement
35% EER relative improvement
J.F. Bonastre, IBM Seminar, September 14th 2006 24
Results: NIST ProtocolResults: NIST Protocol
15% DCF relative improvement
33% EER relative improvement
13
J.F. Bonastre, IBM Seminar, September 14th 2006 25
Analysis, target trialsAnalysis, target trials
Baseline
Nist Unsup A.
Total
Accepted
Non Tar
1103 115 978
1170 130 1040
Target trial = 1231
Batch A. 1064 89 975
We acceptmore target trials
We rejectmore impostor trials
J.F. Bonastre, IBM Seminar, September 14th 2006 26
Conclusion Conclusion Ongoing work Unsupervised adaptation method without hard decision thresholdDCF and EER improvements
Larger for commercial application as the ratio target/impostor is better !
Two protocols: same gain but two different behavioursOther problems should be addressed
Score normalisation (using score models ?)Threshold estimation/adaptation
14
NISTNIST--SRESREIs it really a speaker detection Is it really a speaker detection task ?task ?
J.F. Bonastre, IBM Seminar, September 14th 2006 28
Is it really a speaker detection Is it really a speaker detection task ?task ?
The evaluation campaign is a very interesting framework for research
Large datasetClear protocolsFocusing one 1 problem = large improvements year after year
ButOne sort of dataOne taskFew analysis of the results in term of data/phonetic/knowledge (focused only on “performance”)
15
J.F. Bonastre, IBM Seminar, September 14th 2006 29
Is it really a speaker detection Is it really a speaker detection task ?task ?
Improvements thanks toHNorm = ZNorm+ something linked to 2 sort of phonesTnorm = environment/phone mismatchFeature warping/mapping…
Also reducing environment/phone mismatch(But have an effect on the classifiers)
…
J.F. Bonastre, IBM Seminar, September 14th 2006 30
Is it really a speaker detection Is it really a speaker detection task ?task ?Example: Bayesian Factor Analysis
(Patrick Kenny)Quite recent but already large improvementsDedicated (currently) to channel effectsAt least 10 labs are implementing PK approach or something close to thatWho tested if the channel is the key factor ?
16
J.F. Bonastre, IBM Seminar, September 14th 2006 31
Artificially modified impostor Artificially modified impostor voicevoice
We are all using the UBM/GMM since x yearsWhich sort of information is modeled by UBM/GMM classifiers ?
ExperimentIf we know a voice example of a targeted speaker If we know the speaker recognition techniqueIs it possible to transform the voice of someone else in order to cheat the system ?
TRANSFER FUNCTION-BASED VOICE TRANSFORMATION FORSPEAKER RECOGNITION, Jean-François Bonastre, Driss Matrouf, Corinne Fredouille,Speaker Odyssey 2006
J.F. Bonastre, IBM Seminar, September 14th 2006 32
ExperimentalExperimental contextcontextNIST-SRE 2005 (1conv-1conv), male only
1231 target tests12317 non target tests
Test (segment Y, client S)Classical test: Y=S ?If it is a non target test, transform Y using S
Y’=Vtrans(Y,S)New test (segment Y’, client S) : Y’=S ?
17
J.F. Bonastre, IBM Seminar, September 14th 2006 33
VoiceVoice transformationtransformationFrame by frame + Overlap-addEstimate a new cepstral target for an impostor frame y, as a combination of the component means of the target speaker modelWith the cepstral target -> transfer functionFiltering in order to change the original transfer function by the new oneOther parameters are taken in the original signal
J.F. Bonastre, IBM Seminar, September 14th 2006 34
ResultsResults (1)(1)ImpostorImpostor score score disributiondisribution
Normal (1)
Transformed (2) usinga non train segment for X
Transformed (3) usingspeaker X training seg
18
J.F. Bonastre, IBM Seminar, September 14th 2006 35
ResultsResults (2)(2)DetDet andand errorerror ratesrates
False Acceptance
Mis
s
1
2 3F. Accept. Miss Det.
0.88 %
49.72 %
96.55 %
27.45 %
27.45 %
27.45 %
1 - baseline
2 - !=
3 - =
FA from 0.88 % to 96.55 % !!(same threshold)
J.F. Bonastre, IBM Seminar, September 14th 2006 36
ExamplesExamples
7396 8049
NCFB_A -1.94 4.84 0.46 5.47
French BN (Alain Passerel)Driss Redragui Fabrice Drouelle Franck Mathevon Joel Collado
NIST SRE
19
J.F. Bonastre, IBM Seminar, September 14th 2006 37
ConclusionConclusionWe are using efficient UBM/GMM systems
But we don’t know which information is usedWe know now it is possible to cheat the system
Need of caution for Forensic/National security applicationsBonastre et al Eurospeech 2003
In this experiment, we used a knowledge ofThe feature extractionThe method (UBM/GMM, number of components, of top)The world model
To be extended
J.F. Bonastre, IBM Seminar, September 14th 2006 38
ThanksThanks !!
For inviting meFor the attention
Questions ?
20
J.F. Bonastre, IBM Seminar, September 14th 2006 39
Annex Annex UBMUBM--GMM and GMM and protocolprotocol
J.F. Bonastre, IBM Seminar, September 14th 2006 40
ParadigmParadigmP (target speaker | speech data)Bayesian Hypothesis Test: LR
UBM (Universal Background Model) represents the inverse hypothesis
It is usually learned with hundreds of hours of speechThe EM algorithm and multiple iterations are usedSome tricks: variance flooring (?)
Front-end is a cepstral analysisSpeech (MFCC,LFCC)+ derivatives…
21
J.F. Bonastre, IBM Seminar, September 14th 2006 41
Experimental ProtocolExperimental ProtocolBased upon NIST SRE 2005 database:
Primary, 1side-1side, male set280 speakersUtterances 2.5 min. long (contains speech)13624 tests (951 target tests)Impostors: 200 speakers from the BM model
Commonalities:(16+16d) LFCC features (300-3Khz), Tnormedsystems, 2048 UBM model.
J.F. Bonastre, IBM Seminar, September 14th 2006 42
The The LIA_SpkDetLIA_SpkDet system system GMM/UBM, 2048 diagonal components16 Cepstral coefficients + 16 Delta
(50 coeff in 2006)Frame selection based on a 3
component-GMM modeling of the energyFeature normalization: mean removal
and variance normalizationScore normalization: Tnorm
22
J.F. Bonastre, IBM Seminar, September 14th 2006 43
AnnexAnnex –– VoiceVoice transformationtransformation
J.F. Bonastre, IBM Seminar, September 14th 2006 44
Voice transformation (1)Voice transformation (1)Independent processing and overlapIndependent processing and overlap--addadd
Frame by frame processing
With overlap
Adding
Y’0=VT(y0,S)
Y’0
y0y1 yn
Y’1 Y’n
Y’1=VT(y1,S) Y’n=VT(yn,S)
Y
adding
Y’=VT(Y,S)Y’
23
J.F. Bonastre, IBM Seminar, September 14th 2006 45
Voice transformation Voice transformation (2)(2)FindFind thethe targettarget transfertransfer functionfunction ((frameframe) )
Target speakermodel
Acoustic parameter vector (cepstral)
A posteriori probabilities for each component
Build the targetby combining
the meansof the components
usingthe probability vector
Cepstral target -> tranfer function
J.F. Bonastre, IBM Seminar, September 14th 2006 46
Voice transformation (3)Voice transformation (3)
2 // GMM2 // GMM
1 to 1 tying of the components
Filtering
ASR is using several feature norm techniques:Not possible to come back to signal= 2 // models for a target
•One to estimate the a posteriori proba (master)•One to estimate the target (filtering)
Master
Master: in the ASRfeature space
YCompute the
a posteriori proba
TargetFiltering: in the filteringfeature space
Combine the means
24
J.F. Bonastre, IBM Seminar, September 14th 2006 47
Voice transformation (4)Voice transformation (4)FilteringFiltering
y
Hy(f) ( ) ( )( )fHfHfH
y
x =
( ) fHx
Y’
Build the target for the frame
Target transferfunction
Original TF
J.F. Bonastre, IBM Seminar, September 14th 2006 48
Distributions client Distributions client andand impimp
25
J.F. Bonastre, IBM Seminar, September 14th 2006 49
DetDet usingusing thethe transformation for transformation for allall thethe teststests
J.F. Bonastre, IBM Seminar, September 14th 2006 50
// Models// Models
ASR feature domain
Filtering feature domain
ASR target speaker
model
Get EM-E hidden variable
Apply EM-M on the Filtering domain
Filteringtarget speaker
model