50
The 2004 MIT Lincoln Laboratory Speaker Recognition System D.A.Reynolds, W. Campbell, T. Gleason, C. Quillen, D. Sturim, P. Torres-Carrasquillo, A. Adami (ICASSP 2005) CS298 Seminar Shaunak Chatterjee 09-23-2011 1

Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

The 2004 MIT Lincoln Laboratory Speaker Recognition System

D.A.Reynolds, W. Campbell, T. Gleason, C. Quillen, D. Sturim, P. Torres-Carrasquillo, A. Adami (ICASSP 2005)

CS298 Seminar

Shaunak Chatterjee

09-23-2011 1

Page 2: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Actually …

• Robust text-independent speaker identification using Gaussian mixture speaker models – Reynolds, Rose (1995)

• Speaker verification using adapted Gaussian mixture models – Reynolds, Quatieri, Bunn (2000)

• Speaker recognition based on idiolectal differences between speakers – Doddington (2001)

• Generalized linear discriminant sequence kernels for speaker recognition – Campbell (2002)

• Modeling prosodic dynamics for speaker recognition – Adami, Mihaescu, Reynolds, Godfrey (2003)

• Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

• The 2004 MIT Lincoln Laboratory Speaker Recognition System – Reynolds et al (2005)

• The MIT Lincoln Laboratory 2008 Speaker Recognition System – Sturim, Campbell, Karam, Reynolds, Richardson (2009)

2

Page 3: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Douglas A. Reynolds

• PhD (Georgia Tech, 1992)

• Currently Senior Member of Technical Staff at MIT Lincoln Lab

• Most cited author in speaker recognition (by far?)

• Contributed several key ideas currently used in robust speaker recognition systems

• MIT Lincoln Lab has won numerous awards at the NIST SRE over the years

3

Page 4: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

What can we learn from speech?

Slide courtesy: Reynolds, Heck 4

Page 5: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Speaker Recognition

Identification

• No identity claim is made

• Classification

Verification

• Identity claim is made

• Binary decision

• Open-set vs closed-set • Text-dependent vs text-independent

5

Page 6: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Applications

• (Telephonic) Transaction Authentication

• Access Control

– Physical facilities

– Computer and data networks

• Parole Monitoring

• Information Retrieval

– Audio indexing in call centers

• Forensics

6

Page 7: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Components of a speaker recognition system

Slide courtesy: Reynolds, Heck 7

Universal Background Model

Background’s “Voiceprint”

Page 8: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Phases of speaker verification

Slide courtesy: Reynolds, Heck 8

Page 9: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Feature Extraction

9

Universal Background Model

Background’s “Voiceprint”

Page 10: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Feature Extraction

• Pre-processing

– Bandlimiting

– Silence, noise removal

– Channel bias removal (RASTA et al)

• Feature computation

– MFCC computed every 10ms over a 20ms window

– F0 and energy features

– Phonetic features

10

Page 11: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Speaker models

Slide courtesy: Reynolds, Heck 11

Universal Background Model

Background’s “Voiceprint”

Page 12: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Gaussian mixture models (GMMs)

12

• Trained using EM • Often converges within 5 iterations • Wide range of choices to constrain

parameters

Page 13: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Why GMMs? - I Histogram of one cepstral coefficient for a 25-second speech sequence Unimodal distribution Gaussian mixture model Vector Quantization (VQ)

[Reynolds 95] 13

Page 14: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Why GMMs? - II

Each component of the GMM corresponds to a speaker-dependent vocal tract configuration

[Reynolds 95] Image: wikipedia 14

Page 15: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Text-dependent vs text-independent

15

Slide courtesy: Reynolds, Heck

Page 16: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Speaker models

Slide courtesy: Reynolds, Heck 16

Universal Background Model

Background’s “Voiceprint”

Page 17: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Hypothesis testing

17

Page 18: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

2004 MIT Lincoln Lab Speaker Recognition System (MITLL)

• Seven core systems – Spectral based

• GMM-UBM

• (Spectral) SVM

– Prosodic based • Pitch and Energy GMM

• Slope and duration GMM

– Phonetic based • Phone N-grams

• Phone SVM

– Idiolectal based

18

Page 19: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

2004 MIT Lincoln Lab Speaker Recognition System (MITLL)

• Seven core systems – Spectral based

• GMM-UBM

• (Spectral) SVM

– Prosodic based • Pitch and Energy GMM

• Slope and duration GMM

– Phonetic based • Phone N-grams

• Phone SVM

– Idiolectal based

19

Page 20: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Feature Extraction – GMM-UBM

• 19-dimensional MFCC every 10ms using a 20ms window

• Bandlimiting: 300-3138Hz

• RASTA filtering

– To reduce channel bias effects

• Δ-cepstral coefficients computed for ±2 frames

• Silence removal, feature mapping, normalization

20

Page 21: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

UBM training

• Gender-independent 2048 mixture UBM trained from Switchboard and OGI National Cellular Database Corpora – MIXER corpus (the test data) was not used

• Target models (for individual speakers) are derived by Bayesian adaptation of the UBM parameters and training data from MIXER – “compensating” for UBM

21

Page 22: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

2004 MIT Lincoln Lab Speaker Recognition System (MITLL)

• Seven core systems – Spectral based

• GMM-UBM

• (Spectral) SVM

– Prosodic based • Pitch and Energy GMM

• Slope and duration GMM

– Phonetic based • Phone N-grams

• Phone SVM

– Idiolectal based

22

Page 23: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Support Vector Machines (SVM)

23

Page 24: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

SVM - II

24

Page 25: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

SVM - III

25

Page 26: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Spectral SVM (for speech)

• Campbell (2002) showed that good performance in speaker recognition tasks could be achieved using sequence kernels

• Sequence kernel: provides a numerical comparison of speech utterances as entire sequences

• Campbell introduced a novel sequence kernel derived from generalized linear discriminants

26

Page 27: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

SVM setup in MITLL

• Same front-end processing as before

• Background (or the other class) for every speaker consisted of a set of speakers taken from Switchboard

– Current speaker under training had target of +1 and every other speaker had target of -1

• SVM training was performed using the GLDS kernel

27

Page 28: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

2004 MIT Lincoln Lab Speaker Recognition System (MITLL)

• Seven core systems – Spectral based

• GMM-UBM

• (Spectral) SVM

– Prosodic based • Pitch and Energy GMM

• Slope and duration GMM

– Phonetic based • Phone N-grams

• Phone SVM

– Idiolectal based

28

Page 29: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Prosodic based systems

• Prosody: the rhythm, stress and intonation of speech

• Spectral approaches focus on capturing short-term information

• Prosodic systems can model long-term information

• Two systems in 2004 MITLL SRS – Distribution based pitch/energy classifier

– Pitch/energy sequence modeling system

29

Page 30: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Pitch and Energy GMM

• Very similar to GMM-UBM

– Main difference: feature set

• Log F0 and log energy estimated every 10ms using RAPT – Robust Algorithm for Pitch Tracking (Talkin 1995)

• Δ features (over 50ms window) appended

• Silence and noisy region removal

• UBM: 512 components (Switchboard)

30

Page 31: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

What is F0?

• Fundamental frequency of a human voice

– Between 85-180 in males

– 165-255 in females

– Range is below most band

limits

– Higher harmonics are

transmitted

– F0 is not static

31

Page 32: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Slope and duration n-gram - I

• The dynamics of F0 and energy also convey information about speaker identity

• Dynamics of both trajectories jointly represent certain prosodic gestures characteristic of a speaker (Adami et al, 2003)

32

Page 33: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Slope and duration n-gram - II

• F0 and energy trajectories converted into a sequence of tokens

– Each token reflects a joint state of the trajectories (rising or falling)

33

Page 34: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

2004 MIT Lincoln Lab Speaker Recognition System (MITLL)

• Seven core systems – Spectral based

• GMM-UBM

• (Spectral) SVM

– Prosodic based • Pitch and Energy GMM

• Slope and duration GMM

– Phonetic based • Phone N-grams

• Phone SVM

– Idiolectal based

34

Page 35: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Phonetic based system - I

• Gender independent phone recognition

• Phone recognizers trained on phonetically marked speech from OGI multi-language corpus

• Output token streams were processed to produce a sequence of token symbols

35

Page 36: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Phonetic based system – II

• Two systems

– Standard n-gram modeling

• Bi-gram model estimated for each speaker (for each phone/language)

• UBM from Switchboard

• 6 scores fused

– Phone SVM

• Very similar to Spectral SVM

36

Page 37: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

2004 MIT Lincoln Lab Speaker Recognition System (MITLL)

• Seven core systems – Spectral based

• GMM-UBM

• (Spectral) SVM

– Prosodic based • Pitch and Energy GMM

• Slope and duration GMM

– Phonetic based • Phone N-grams

• Phone SVM

– Idiolectal based

37

Page 38: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Idiolectal differences

• Only look at content!

• It is possible to determine authorship of papers/literary works by looking at them

38

Page 39: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Idiolectal differences

• Speech content is conventionally less constrained and therefore more distinctive

• Unfortunately, a lot of data is needed for reasonable accuracy

39

Page 40: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

MITLL idiolectal based system

• Only considered bigrams

– Trigrams and higher did not improve performance

• Switchboard data used to create UBM

• BBN Byblos 3.0 used for speech-to-text conversion

40

Page 41: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

System fusion

• Perceptron classifier

41

Page 42: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Performance measure

Slide courtesy: Reynolds, Heck 42

Page 43: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

DET – different scenarios

43

Slide courtesy: Reynolds, Heck

Page 44: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Results - I

44

Page 45: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Results - II

45

Page 46: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

No gain from higher-level information

• All development data from English – Could have led to a bias in the UBMs

• SRE04 dataset had tons of channel mismatch – More difficult task, potentially masks gains

• Both are essentially mismatches between training and test distributions/data

46

Page 47: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Results - III

• All Pool: all languages • Common pool: English

only

• Clear indication of cross-lingual degradation

• N-gram system reduces error significantly

47

Page 48: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

Conclusions

• 2004 MITLL system attempted to exploit other levels of information (prosodic, phonetic, idiolectal) to better characterize and recognize a speaker

• 7 core systems • Generative, discriminative and discrete classifiers • Results on the “challenging” MIXER corpus

(SRE04) • Previous success in system fusion needs to be

tailored better for cross-lingual environments

48

Page 49: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

2008 MITLL Speaker Recognition system (Interspeech 2009)

• Two main themes

– Variational nuisance modeling to allow for better compensation for channel variation

– Fuse systems targeting different linguistic tiers of information (high and low)

49

Page 50: Speaker Recognition Systems - ICSIfractor/fall2011/pres1.pdf · • Speaker adaptive cohort selection for Tnorm in text-independent speaker verification – Sturim, Reynolds (2005)

QUESTIONS?

Thanks for the attention!

50