26
Speaker Detection Without Models Dan Gillick July 27, 2004

Speaker Detection Without Models Dan Gillick July 27, 2004

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Speaker Detection Without Models Dan Gillick July 27, 2004

Speaker Detection Without Models

Dan Gillick

July 27, 2004

Page 2: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (2)

Motivation

Want to develop a speaker ID algorithm that:

• captures sequential information• takes advantage of extended data• combines well with existing baseline systems

Page 3: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (3)

The Algorithm

• Rather than build models (GMM, HMM, etc.) to describe the information in the training data, we directly compare test data frames to training data frames.

• We compare sequences of frames because we believe there is information in sequences that systems like the GMM do not capture.

• The comparisons are guided by token-level alignments extracted from a speech recognizer.

Page 4: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (4)

Front-End

Using 40 MFCC features per 10ms frame

– 19 Cepstrals and Energy (C0)

– Their deltas

Page 5: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (5)

The Algorithm: Overview

Cut the test and target data into tokens

– use word or phone-level time-alignments from the SRI recognizer

– note that these alignments have lots of errors (both word errors and alignment errors)

Page 6: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (6)

The Algorithm: Overview

Compare test and target data

1. Take the first test token

2. Find every instance of this token in the target data

3. Measure the distance between the test token and each target instance

4. Move on to the next test token

Page 7: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (7)

The AlgorithmTest data Training data

Page 8: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (8)

The AlgorithmTest data Training data

“take the first test token”: grab the sequence of frames corresponding to this token according to the recognizer output

Hello

Page 9: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (9)

The AlgorithmTest data Training data

“Find every instance of this token in the target data”

Hello Hello (1)

Hello (2)

Hello (3)

Page 10: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (10)

The AlgorithmTest data Training data

“Measure the distance between the test token and each target instance”: distance = sum of the (Euclidian) distances between frames of the test and target instances

Hello Hello (1)

Hello (2)

Hello (3)Euclidian distance function

Distance = 25

Distance = 25

Page 11: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (11)

The AlgorithmTest data Training data

“Measure the distance between the test token and each target instance”: distance = sum of the (Euclidian) distances between frames of the test and target instances

Hello Hello (1)

Hello (2)

Hello (3)Euclidian distance function

Distance = 40

Distance = 25

Distance = 40

Page 12: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (12)

The AlgorithmTest data Training data

“Measure the distance between the test token and each target instance”: distance = sum of the (Euclidian) distances between frames of the test and target instances

Hello Hello (1)

Hello (2)

Hello (3)Euclidian distance function

Distance = 18

Distance = 25

Distance = 40

Distance = 18

Page 13: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (13)

The Algorithm: Distance Function

But these instances have different lengths. How do we line up the frames? Here are some possibilities:

1. Line up the first frames and cut off the longer at the shorter

2. Use a sliding window approach: slide the shorter through the longer, taking the best (smallest) total distance.

3. Use dynamic time warping (DTW)

Hello (test)

Hello (3)

Euclidian distance function

Distance = 18

Page 14: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (14)

The Algorithm: Take the 1-BestTest data Training data

Now what do we do with these scores? There are a number of options, but we only keep the 1-best score. One motivation for this decision is that we are mainly interested in positive information.

Hello Hello (1)

Hello (2)

Hello (3)

Distance = 25

Distance = 40

Distance = 18

Token Score = 18

Page 15: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (15)

The Algorithm: ScoringTest data Training data

So we accumulate scores for each token. What do we do with these? Some options:

1. Average them, normalizing either by the number of tokens or by the total number of frames (Basic score)

2. Focus on some subset of the scores

a. Positive evidence (Hit score): ∑ [ (#frames) / (k^score) ]

b. Negative evidence: ∑ [ (#frames*target count) / (k^(M-score)) ]

Hello Token Score = 18my Token Score = 16.5name Token Score = 21Etc…

Page 16: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (16)

Normalization

• Most systems use a UBM (universal background model) to center the test pieces– Since this system has no model, we create a

background by lumping together speech from a number of different held-out speakers and running the algorithm with this group as training data

• ZNorm to center the “models”– Find the mean score for each “model” or training set by

running a number of held-out imposters against each one.

Page 17: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (17)

Results

Results reported on split 1 (of 6) of Switchboard I

(1624 test vs. target scores)

Page 18: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (18)

Results

TOKEN STYLE BKG ZNORMBSCR EER

HS EER

COMB EER

COMB DCF

word unigrams sw 14 none 6.82 4.83

For reference: GMM performance on the same data set: 0.67% EER; 0.0491 DCF

Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set

Page 19: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (19)

Results

TOKEN STYLE BKG ZNORMBSCR EER

HS EER

COMB EER

COMB DCF

word unigramssw

dtw

14

14

none

none

6.82

4.16

4.83

3.16

For reference: GMM performance on the same data set: 0.67% EER; 0.0491 DCF

Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set

Page 20: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (20)

Results

TOKEN STYLE BKG ZNORMBSCR EER

HS EER

COMB EER

COMB DCF

word unigrams

sw

dtw

dtw

14

14

14

none

none

16

6.82

4.16

2.66

4.83

3.16

2.16 2.00 0.0416

For reference: GMM performance on the same data set: 0.67% EER; 0.0491 DCF

Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set

Page 21: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (21)

Results

TOKEN STYLE BKG ZNORMBSCR EER

HS EER

COMB EER

COMB DCF

word unigrams

sw

dtw

dtw

14

14

14

none

none

16

6.82

4.16

2.66

4.83

3.16

2.16 2.00 0.0416

word bigramssw

dtw

14

14

16

16

5.80

2.83

3.68

2.16 1.83 0.0447

phone unigrams dtw 14 16 2.64 2.48 1.98 0.0560

phone bigrams dtw 14 16 1.83 1.83 1.33 0.0333

phone trigrams dtw 14 16 1.65 1.65 1.16 0.0345

For reference: GMM performance on the same data set: 0.67% EER; 0.0491 DCF

Style: sw = sliding window; Bkg: # of speakers in the bkg set; Znorm: # of speakers in the znorm set

Page 22: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (22)

Results

How do positive and negative evidence compare?

Word-bigrams + bkg (positive evidence) 3.16% EER

Word-bigrams + bkg (negative evidence) 26.5% EER

Page 23: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (23)

Results

How is the system effected by errorful recognizer transcripts?

Word bigrams + bkg + znorm (recognized transcripts) 1.83% EER

Word bigrams + bkg + znorm (true transcripts) 1.16% EER

Page 24: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (24)

Results

How does the system combine with the GMM?

This experiment was done on the first half (splits 1,2,3) of Switchboard I

EER DCF

SRI GMM system 0.97 0.04806

Best phone-bigram system 1.46 0.06110

GMM + phone-bigrams 0.49 0.02040

Page 25: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (25)

Future Stuff

• Try larger background population, larger znorm set• Try other, non-Euclidian distance functions• Change the front-end features (Feature mapping)• Run the system on Switchboard II; 2004 eval. data• Dynamic token selection

– While the system works well already, perhaps its real strength is one which has not been exploited. Since there are no models, we might dynamically select the longest available frame sequences in the test and target data for scoring.

Page 26: Speaker Detection Without Models Dan Gillick July 27, 2004

July 27, 2004 Speaker Detection Without Models Dan Gillick (26)

Thanks

Steve (wrote all the DTW code, versions 1 through 5…)

Barry (tried to make my slides fancy)

Barbara

Everyone else in the Speaker ID group