Speaker and Language RecognitionA Guided Safari
Doug Reynolds
2008 Odyssey Workshop
This work was sponsored by the Department of Defense under Air Force contract F19628-00-C-0002 Opinions interpretations conclusions and recommendations are those of the authors and are not necessarily endorsed by the United States Government
MIT Lincoln Laboratory2
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory3
Odyssey 2008
The OdysseyMartigny Switzerland ndash April 5-7 1994
bull First workshop focused solely on speaker recognition
ndash Helped form working relationships among international SID community
bull 46 papers 6 tutorialskeynotesbull 65 attendeesbull Technologies TD-HMMs TI-GMMs MLP
LVQ RBF DTW LTAbull Corpora Home grown (digits words
phrases 10-30 speakers) YOHO TIMIT NTIMIT KING POLYPHONE SWB1
bull Very difficult to compare resultsndash Varying corpora experiment designs
measures of performancebull Large emphasis on text-dependent
applications (telcoms)ndash Some papers on forensic SV (human and
machines)
MIT Lincoln Laboratory4
Odyssey 2008
The OdysseyAvignon France ndash April 20-23 1998
bull Focus on forensic and commercial applications
bull 40 papers 5 keynotesbull 78 participantsbull Technologies More emphasis on statistical
approaches (HMM GMM AHS)bull Corpora Still diverse small set (less home-
grown) more TIMIT and SPIDRE (SWB)ndash Europeans showing lead in common
corporaexperiments (POLYCOST VERIVOX CAVE)
bull Increasing buzz about dot-com speechspeaker companies
bull Some lasting themes in talksndash Doddington getting to know the speakerndash Champod LRs as evidence in Baysian
framework
bull Some friction between automatic speaker recognition community and expert human speaker examiner community
ndash ASR crowd pressed for measured error rate
ndash Examiner crowd pressed for transparency and explanation in results
MIT Lincoln Laboratory5
Odyssey 2008
The OdysseyCrete Greece ndash June 18-22 2001
bull Start of official ldquoOdysseyrdquo workshop series
ndash Originally set for Tel-Aviv Israelbull 40 papers 3 keynotesbull 75 participantsbull Technologies More papers on new
tasks (biometrics diarization) addressing practical issues (robustness channel compensation threshold setting)
bull Corpora Many more papers using SRE corpora and experiment design
bull Bayesian framework taking hold for forensic applications
bull Several speaker verification companies (PerSay Nuance VoiceVault)
MIT Lincoln Laboratory6
Odyssey 2008
The OdysseyToledo Spain ndash May 31- June 3 2004
bull Co-occurrence with NIST SRE 2004 workshop
bull 61 paper 4 keynotesbull 147 participantsbull Technologies GMM SVM NAP LFA
high-level features adaptation audio-video LID
bull Corpora SRE corporaprotocol dominant for TI-Telephone RT BNEWS data for diarization TNONFI field forensic corpus
bull Text-dependent work focusing more on user phrases (less digit strings)
MIT Lincoln Laboratory7
Odyssey 2008
The OdysseySan Juan Puerto Rico ndash June 28-30 2006
bull Co-occurrence with NIST SRE 2006 workshop
ndash Followed LRE 2005 in December bull 60 papers 1 keynotebull 103 participantsbull Technologies GMM-SVM NAPLFA
GMM-MMI high-level features robustness
bull Corpora Dominated by SRE and LRE corporaprotocol
MIT Lincoln Laboratory8
Odyssey 2008
The OdysseyStellenbosch South Africa ndash January 21-24 2008
bull Expect to see continued trends inndash Common corporaevaluationsndash High-quality papers and novel topics
bull More fish pictures hellip
MIT Lincoln Laboratory9
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory10
Odyssey 2008
NIST SpeakerLanguage Recognition Evaluations
bull Recurring NIST evaluations of speakerlanguage recognition technology
bull Aim Provide a common paradigm for comparing technologies
bull Focus Conversational telephone speech (text-independent)
Evaluation Coordinator
Linguistic Data Consortium
Data Provider Comparison of technologies on common task
Evaluate
Improve
Technology ConsumersApplication domain parameters
Technology Developers
MIT Lincoln Laboratory11
Odyssey 2008
NIST SRELREPre-history
1992 1993
Rutgers Summer Workshop
1994Informal LRE bull 4 sites OGI MITLL MIT ITTbull OGI 12 lang corpus
MartignyWorkshop
DARPA SID evalbull 3 sites Dragon (LVCSR) ITT
(NN) MITLL (GMM)bull Early SWB1bull 1-4 conv trainbull 24 tgtsbull 111 tgt-test 466 imp-testbull 10-60s testbull Speaker dependent ROCbull Intro of Swets Normal-Deviate
plot (DET)bull Areas Under ROC PdPf=10bull Sun Sparc 1020 MB
MIT Lincoln Laboratory12
Odyssey 2008
NIST SRELREFormal Start
1995 1996SRE 1 bull 6 sites BBN (uniGauss)
Dragon (LVCSR) Ensigma(ergodic HMM) INRS (phone HMMs) ITT (NN) MITLL (GMM o64 cohort)
bull SWB1bull 26 tgtsbull Train 10s 30s 4 sessbull Test 5s 10s 30s Separate
tgt and imp testsbull Area Under ROC
PdPf=3105 closed set error
bull SameDiff phone effectbull Speaker dependent ROC
SRE 2 bull 11 sites Ensigma ITT
MITLL SRI CAIP INRS Dragon BBN LIMSI ATampT Sanders
bull Broad phone HMM LVCSR VQ adapted GMM SVM worldcohorts hnorm anchor models
bull SWB1bull 40 tgtsbull Train 2 min - 1 sess 1
handset 2 handset bull Test 3s 10s 30s Separate
tgt and imp testsbull Pooled DET DCF
LRE 1 bull 4-5 sitesbull PPRLM GMM-CEP
Syllabic models fusion
bull Callfriendbull 12 languages 3
dialectsbull Test 3s 10s 30sbull DET DCF
MIT Lincoln Laboratory13
Odyssey 2008
NIST SRELRESteady Progress
Avignon Workshop
1997 1998SRE 3 bull 8 sites bull Pitch features handset mic
detectorcomp using more dev data
bull SWB2p1bull All speakers act as tgts
and imposters (current paradigm)
bull Train 2 min - 1 sess 1 handset 2 handset
bull Test 3s 10s 30sbull No cross-sex trials
matched and mismatched test phone
bull DET DCF
SRE 4bull 12 sitesbull Phone sequences (BBN)
sequence models (Dragon)
bull SWB2p2bull Train 2 min - 1 sess 2
sess all - 2 sessbull Test 3s 10s 30sbull SNDN and HS type side
knowledgebull Human performance
3s
1999SRE 5bull 13 sitesbull T-norm system fusionbull SWB2p3bull Train 2 min - 2 sessbull Test varying duration (0-
15 15-30 30-45gt45) diff number
bull New tasks 2-spkr test speaker tracking
MIT Lincoln Laboratory14
Odyssey 2008
NIST SRELRENew Directions
Odyssey Workshop
JHU SuperSIDWorkshop
2000 2001SRE 6bull 12 sites (First shark
sighting) bull SMS bull SWB2p1p2
AHUMADAbull Train 2 min - 1 sessbull Test variable 0-60 bull New tasks 2-spkr
train amp test N-speaker segmentation
20032002SRE 7bull 13 sitesbull Per-frame SVM
Fusion text-constrained GMM word amp phone N-gram
bull SWB2p1p2 AHUMADA SWB2p4 (cell)
bull Extended data task (SWB1)
SRE 8bull 24 sitesbull Feat map
high-level features mlpfusion
bull SWB2p5 (cell) SWB2p2p3 (ext) FBIVoiceDB(Multi Modal) BNEWS (seg) Meeting (seg)
SRE 9bull 19 sitesbull SVM GLDS phone
svm nerfs)bull SWB2p5 (cell)
SWB2p2p3 (ext)
LRE 2bull 6 sitesbull PPRLM GMM-SDC
SVM-SDC fusionbull Callfriendbull 12 languagesbull Test 3s 10s 30s
MIT Lincoln Laboratory15
Odyssey 2008
NIST SRELRECurrent Period
Odyssey Workshop
Odyssey Workshop
2004 2005 20072006LRE 4bull 21 sitesbull SVM-GSV ho
ngrams fLFAfNAPbull Mixer5 OHSUbull 14 languages 5
dialectsaccentsbull Calibrated LLRs
LRE 3bull 11 sitesbull GMM-MMI TRAPSNN-
decoder phone lattice
2008
Odyssey WS SRE 13 JHU WS
SRE 10bull 24 sitesbull Large system
fusionbull Mixer1bull Bilingual
speakers
SRE 11bull 27 sitesbull LFA SVM-MLLRbull Mixer2 MMSRbull Cross-channel
microphonesbull Calibrated LLRs
SRE 12bull 36 sitesbull SVM-GSV
spectral-only systems
bull Mixer2+3 MMSRbull Bilingual cross-
channelbull Multi-site
collaboration
PPR-BinTree PPR-SVMbull OHSU Mixer12bull 7 languages 3
dialectsaccents
MIT Lincoln Laboratory16
Odyssey 2008
NIST SREHow are we doing
0
001002
003004
005
006007
008
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Year
DC
F
Landline 1sp 2 min train 30 sec test
Cellular 1sp 2 min train 30 sec test
Landline 2-speaker detection
Ahumada(Spanish)
Multimodal (FBI)Landline 1sp (40
target speaker paradigm)
Cellularland 2-speaker detection
CellLand 1sp 8-conv train 1-conv test
Cross-mic1-conv train (tel) 1-conv test (mic)
Swb1 Swb2p1 Swb2p3 Swb2p4 Swb2p5 Mixer1 Mixer3
bull Sampling of tasks shown 28 in SRE04
Cross-language
Swb2p2 Mixer2 MMSR
MIT Lincoln Laboratory17
Odyssey 2008
0123456789
10
0
1
2
3
4
2001 2002 2003 2004 2005 2006
SRE Performance Trends 2001-2007Lincoln Systems
bull Consistent and steady improvement for datatask focus
EER ()1conv4w1conv4w
8conv4w1conv4w
minDCFx100
2001 2002 2003 2004 2005 2006SWB1 SWB2 MIXER2-3
bull New data sets designed to be more challenging
bull New features classifiers and compensations drive error rates down over time
SVM-GSV GMM-LFA MultiFeatSVM-GLDS SVM-MLLR+NAP
2006
NAP TC-SVM wordphone lattices2005
PhoneWord-SVM GMM-ATNORM2004
Feature Mapping SVM-GLDS2003
SuperSID High-level features2002
Text-const GMM word-ngram2001
MIT Lincoln Laboratory18
Odyssey 2008
0
10
20
30
40
1996 2003 2005 2005 2007 2007
EER
()
30s 10s 3s
CallFriend(12-lang)
OHSU(7-lang)
Mixer3(14-lang)
113
32 421014
LRE Performance Trends 1996-2007Lincoln Systems
19
Year Main LID Technology
1996 PPRLM2003 + GMM-SDC SVM-SDC2005 + Phone lattices SVM w ngrams
Binary Trees2007 + TRAPS tokenizers fLFA fNAP
GMM-MMI SVM-GSV calibrated LLRs
MIT Lincoln Laboratory19
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory2
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory3
Odyssey 2008
The OdysseyMartigny Switzerland ndash April 5-7 1994
bull First workshop focused solely on speaker recognition
ndash Helped form working relationships among international SID community
bull 46 papers 6 tutorialskeynotesbull 65 attendeesbull Technologies TD-HMMs TI-GMMs MLP
LVQ RBF DTW LTAbull Corpora Home grown (digits words
phrases 10-30 speakers) YOHO TIMIT NTIMIT KING POLYPHONE SWB1
bull Very difficult to compare resultsndash Varying corpora experiment designs
measures of performancebull Large emphasis on text-dependent
applications (telcoms)ndash Some papers on forensic SV (human and
machines)
MIT Lincoln Laboratory4
Odyssey 2008
The OdysseyAvignon France ndash April 20-23 1998
bull Focus on forensic and commercial applications
bull 40 papers 5 keynotesbull 78 participantsbull Technologies More emphasis on statistical
approaches (HMM GMM AHS)bull Corpora Still diverse small set (less home-
grown) more TIMIT and SPIDRE (SWB)ndash Europeans showing lead in common
corporaexperiments (POLYCOST VERIVOX CAVE)
bull Increasing buzz about dot-com speechspeaker companies
bull Some lasting themes in talksndash Doddington getting to know the speakerndash Champod LRs as evidence in Baysian
framework
bull Some friction between automatic speaker recognition community and expert human speaker examiner community
ndash ASR crowd pressed for measured error rate
ndash Examiner crowd pressed for transparency and explanation in results
MIT Lincoln Laboratory5
Odyssey 2008
The OdysseyCrete Greece ndash June 18-22 2001
bull Start of official ldquoOdysseyrdquo workshop series
ndash Originally set for Tel-Aviv Israelbull 40 papers 3 keynotesbull 75 participantsbull Technologies More papers on new
tasks (biometrics diarization) addressing practical issues (robustness channel compensation threshold setting)
bull Corpora Many more papers using SRE corpora and experiment design
bull Bayesian framework taking hold for forensic applications
bull Several speaker verification companies (PerSay Nuance VoiceVault)
MIT Lincoln Laboratory6
Odyssey 2008
The OdysseyToledo Spain ndash May 31- June 3 2004
bull Co-occurrence with NIST SRE 2004 workshop
bull 61 paper 4 keynotesbull 147 participantsbull Technologies GMM SVM NAP LFA
high-level features adaptation audio-video LID
bull Corpora SRE corporaprotocol dominant for TI-Telephone RT BNEWS data for diarization TNONFI field forensic corpus
bull Text-dependent work focusing more on user phrases (less digit strings)
MIT Lincoln Laboratory7
Odyssey 2008
The OdysseySan Juan Puerto Rico ndash June 28-30 2006
bull Co-occurrence with NIST SRE 2006 workshop
ndash Followed LRE 2005 in December bull 60 papers 1 keynotebull 103 participantsbull Technologies GMM-SVM NAPLFA
GMM-MMI high-level features robustness
bull Corpora Dominated by SRE and LRE corporaprotocol
MIT Lincoln Laboratory8
Odyssey 2008
The OdysseyStellenbosch South Africa ndash January 21-24 2008
bull Expect to see continued trends inndash Common corporaevaluationsndash High-quality papers and novel topics
bull More fish pictures hellip
MIT Lincoln Laboratory9
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory10
Odyssey 2008
NIST SpeakerLanguage Recognition Evaluations
bull Recurring NIST evaluations of speakerlanguage recognition technology
bull Aim Provide a common paradigm for comparing technologies
bull Focus Conversational telephone speech (text-independent)
Evaluation Coordinator
Linguistic Data Consortium
Data Provider Comparison of technologies on common task
Evaluate
Improve
Technology ConsumersApplication domain parameters
Technology Developers
MIT Lincoln Laboratory11
Odyssey 2008
NIST SRELREPre-history
1992 1993
Rutgers Summer Workshop
1994Informal LRE bull 4 sites OGI MITLL MIT ITTbull OGI 12 lang corpus
MartignyWorkshop
DARPA SID evalbull 3 sites Dragon (LVCSR) ITT
(NN) MITLL (GMM)bull Early SWB1bull 1-4 conv trainbull 24 tgtsbull 111 tgt-test 466 imp-testbull 10-60s testbull Speaker dependent ROCbull Intro of Swets Normal-Deviate
plot (DET)bull Areas Under ROC PdPf=10bull Sun Sparc 1020 MB
MIT Lincoln Laboratory12
Odyssey 2008
NIST SRELREFormal Start
1995 1996SRE 1 bull 6 sites BBN (uniGauss)
Dragon (LVCSR) Ensigma(ergodic HMM) INRS (phone HMMs) ITT (NN) MITLL (GMM o64 cohort)
bull SWB1bull 26 tgtsbull Train 10s 30s 4 sessbull Test 5s 10s 30s Separate
tgt and imp testsbull Area Under ROC
PdPf=3105 closed set error
bull SameDiff phone effectbull Speaker dependent ROC
SRE 2 bull 11 sites Ensigma ITT
MITLL SRI CAIP INRS Dragon BBN LIMSI ATampT Sanders
bull Broad phone HMM LVCSR VQ adapted GMM SVM worldcohorts hnorm anchor models
bull SWB1bull 40 tgtsbull Train 2 min - 1 sess 1
handset 2 handset bull Test 3s 10s 30s Separate
tgt and imp testsbull Pooled DET DCF
LRE 1 bull 4-5 sitesbull PPRLM GMM-CEP
Syllabic models fusion
bull Callfriendbull 12 languages 3
dialectsbull Test 3s 10s 30sbull DET DCF
MIT Lincoln Laboratory13
Odyssey 2008
NIST SRELRESteady Progress
Avignon Workshop
1997 1998SRE 3 bull 8 sites bull Pitch features handset mic
detectorcomp using more dev data
bull SWB2p1bull All speakers act as tgts
and imposters (current paradigm)
bull Train 2 min - 1 sess 1 handset 2 handset
bull Test 3s 10s 30sbull No cross-sex trials
matched and mismatched test phone
bull DET DCF
SRE 4bull 12 sitesbull Phone sequences (BBN)
sequence models (Dragon)
bull SWB2p2bull Train 2 min - 1 sess 2
sess all - 2 sessbull Test 3s 10s 30sbull SNDN and HS type side
knowledgebull Human performance
3s
1999SRE 5bull 13 sitesbull T-norm system fusionbull SWB2p3bull Train 2 min - 2 sessbull Test varying duration (0-
15 15-30 30-45gt45) diff number
bull New tasks 2-spkr test speaker tracking
MIT Lincoln Laboratory14
Odyssey 2008
NIST SRELRENew Directions
Odyssey Workshop
JHU SuperSIDWorkshop
2000 2001SRE 6bull 12 sites (First shark
sighting) bull SMS bull SWB2p1p2
AHUMADAbull Train 2 min - 1 sessbull Test variable 0-60 bull New tasks 2-spkr
train amp test N-speaker segmentation
20032002SRE 7bull 13 sitesbull Per-frame SVM
Fusion text-constrained GMM word amp phone N-gram
bull SWB2p1p2 AHUMADA SWB2p4 (cell)
bull Extended data task (SWB1)
SRE 8bull 24 sitesbull Feat map
high-level features mlpfusion
bull SWB2p5 (cell) SWB2p2p3 (ext) FBIVoiceDB(Multi Modal) BNEWS (seg) Meeting (seg)
SRE 9bull 19 sitesbull SVM GLDS phone
svm nerfs)bull SWB2p5 (cell)
SWB2p2p3 (ext)
LRE 2bull 6 sitesbull PPRLM GMM-SDC
SVM-SDC fusionbull Callfriendbull 12 languagesbull Test 3s 10s 30s
MIT Lincoln Laboratory15
Odyssey 2008
NIST SRELRECurrent Period
Odyssey Workshop
Odyssey Workshop
2004 2005 20072006LRE 4bull 21 sitesbull SVM-GSV ho
ngrams fLFAfNAPbull Mixer5 OHSUbull 14 languages 5
dialectsaccentsbull Calibrated LLRs
LRE 3bull 11 sitesbull GMM-MMI TRAPSNN-
decoder phone lattice
2008
Odyssey WS SRE 13 JHU WS
SRE 10bull 24 sitesbull Large system
fusionbull Mixer1bull Bilingual
speakers
SRE 11bull 27 sitesbull LFA SVM-MLLRbull Mixer2 MMSRbull Cross-channel
microphonesbull Calibrated LLRs
SRE 12bull 36 sitesbull SVM-GSV
spectral-only systems
bull Mixer2+3 MMSRbull Bilingual cross-
channelbull Multi-site
collaboration
PPR-BinTree PPR-SVMbull OHSU Mixer12bull 7 languages 3
dialectsaccents
MIT Lincoln Laboratory16
Odyssey 2008
NIST SREHow are we doing
0
001002
003004
005
006007
008
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Year
DC
F
Landline 1sp 2 min train 30 sec test
Cellular 1sp 2 min train 30 sec test
Landline 2-speaker detection
Ahumada(Spanish)
Multimodal (FBI)Landline 1sp (40
target speaker paradigm)
Cellularland 2-speaker detection
CellLand 1sp 8-conv train 1-conv test
Cross-mic1-conv train (tel) 1-conv test (mic)
Swb1 Swb2p1 Swb2p3 Swb2p4 Swb2p5 Mixer1 Mixer3
bull Sampling of tasks shown 28 in SRE04
Cross-language
Swb2p2 Mixer2 MMSR
MIT Lincoln Laboratory17
Odyssey 2008
0123456789
10
0
1
2
3
4
2001 2002 2003 2004 2005 2006
SRE Performance Trends 2001-2007Lincoln Systems
bull Consistent and steady improvement for datatask focus
EER ()1conv4w1conv4w
8conv4w1conv4w
minDCFx100
2001 2002 2003 2004 2005 2006SWB1 SWB2 MIXER2-3
bull New data sets designed to be more challenging
bull New features classifiers and compensations drive error rates down over time
SVM-GSV GMM-LFA MultiFeatSVM-GLDS SVM-MLLR+NAP
2006
NAP TC-SVM wordphone lattices2005
PhoneWord-SVM GMM-ATNORM2004
Feature Mapping SVM-GLDS2003
SuperSID High-level features2002
Text-const GMM word-ngram2001
MIT Lincoln Laboratory18
Odyssey 2008
0
10
20
30
40
1996 2003 2005 2005 2007 2007
EER
()
30s 10s 3s
CallFriend(12-lang)
OHSU(7-lang)
Mixer3(14-lang)
113
32 421014
LRE Performance Trends 1996-2007Lincoln Systems
19
Year Main LID Technology
1996 PPRLM2003 + GMM-SDC SVM-SDC2005 + Phone lattices SVM w ngrams
Binary Trees2007 + TRAPS tokenizers fLFA fNAP
GMM-MMI SVM-GSV calibrated LLRs
MIT Lincoln Laboratory19
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory3
Odyssey 2008
The OdysseyMartigny Switzerland ndash April 5-7 1994
bull First workshop focused solely on speaker recognition
ndash Helped form working relationships among international SID community
bull 46 papers 6 tutorialskeynotesbull 65 attendeesbull Technologies TD-HMMs TI-GMMs MLP
LVQ RBF DTW LTAbull Corpora Home grown (digits words
phrases 10-30 speakers) YOHO TIMIT NTIMIT KING POLYPHONE SWB1
bull Very difficult to compare resultsndash Varying corpora experiment designs
measures of performancebull Large emphasis on text-dependent
applications (telcoms)ndash Some papers on forensic SV (human and
machines)
MIT Lincoln Laboratory4
Odyssey 2008
The OdysseyAvignon France ndash April 20-23 1998
bull Focus on forensic and commercial applications
bull 40 papers 5 keynotesbull 78 participantsbull Technologies More emphasis on statistical
approaches (HMM GMM AHS)bull Corpora Still diverse small set (less home-
grown) more TIMIT and SPIDRE (SWB)ndash Europeans showing lead in common
corporaexperiments (POLYCOST VERIVOX CAVE)
bull Increasing buzz about dot-com speechspeaker companies
bull Some lasting themes in talksndash Doddington getting to know the speakerndash Champod LRs as evidence in Baysian
framework
bull Some friction between automatic speaker recognition community and expert human speaker examiner community
ndash ASR crowd pressed for measured error rate
ndash Examiner crowd pressed for transparency and explanation in results
MIT Lincoln Laboratory5
Odyssey 2008
The OdysseyCrete Greece ndash June 18-22 2001
bull Start of official ldquoOdysseyrdquo workshop series
ndash Originally set for Tel-Aviv Israelbull 40 papers 3 keynotesbull 75 participantsbull Technologies More papers on new
tasks (biometrics diarization) addressing practical issues (robustness channel compensation threshold setting)
bull Corpora Many more papers using SRE corpora and experiment design
bull Bayesian framework taking hold for forensic applications
bull Several speaker verification companies (PerSay Nuance VoiceVault)
MIT Lincoln Laboratory6
Odyssey 2008
The OdysseyToledo Spain ndash May 31- June 3 2004
bull Co-occurrence with NIST SRE 2004 workshop
bull 61 paper 4 keynotesbull 147 participantsbull Technologies GMM SVM NAP LFA
high-level features adaptation audio-video LID
bull Corpora SRE corporaprotocol dominant for TI-Telephone RT BNEWS data for diarization TNONFI field forensic corpus
bull Text-dependent work focusing more on user phrases (less digit strings)
MIT Lincoln Laboratory7
Odyssey 2008
The OdysseySan Juan Puerto Rico ndash June 28-30 2006
bull Co-occurrence with NIST SRE 2006 workshop
ndash Followed LRE 2005 in December bull 60 papers 1 keynotebull 103 participantsbull Technologies GMM-SVM NAPLFA
GMM-MMI high-level features robustness
bull Corpora Dominated by SRE and LRE corporaprotocol
MIT Lincoln Laboratory8
Odyssey 2008
The OdysseyStellenbosch South Africa ndash January 21-24 2008
bull Expect to see continued trends inndash Common corporaevaluationsndash High-quality papers and novel topics
bull More fish pictures hellip
MIT Lincoln Laboratory9
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory10
Odyssey 2008
NIST SpeakerLanguage Recognition Evaluations
bull Recurring NIST evaluations of speakerlanguage recognition technology
bull Aim Provide a common paradigm for comparing technologies
bull Focus Conversational telephone speech (text-independent)
Evaluation Coordinator
Linguistic Data Consortium
Data Provider Comparison of technologies on common task
Evaluate
Improve
Technology ConsumersApplication domain parameters
Technology Developers
MIT Lincoln Laboratory11
Odyssey 2008
NIST SRELREPre-history
1992 1993
Rutgers Summer Workshop
1994Informal LRE bull 4 sites OGI MITLL MIT ITTbull OGI 12 lang corpus
MartignyWorkshop
DARPA SID evalbull 3 sites Dragon (LVCSR) ITT
(NN) MITLL (GMM)bull Early SWB1bull 1-4 conv trainbull 24 tgtsbull 111 tgt-test 466 imp-testbull 10-60s testbull Speaker dependent ROCbull Intro of Swets Normal-Deviate
plot (DET)bull Areas Under ROC PdPf=10bull Sun Sparc 1020 MB
MIT Lincoln Laboratory12
Odyssey 2008
NIST SRELREFormal Start
1995 1996SRE 1 bull 6 sites BBN (uniGauss)
Dragon (LVCSR) Ensigma(ergodic HMM) INRS (phone HMMs) ITT (NN) MITLL (GMM o64 cohort)
bull SWB1bull 26 tgtsbull Train 10s 30s 4 sessbull Test 5s 10s 30s Separate
tgt and imp testsbull Area Under ROC
PdPf=3105 closed set error
bull SameDiff phone effectbull Speaker dependent ROC
SRE 2 bull 11 sites Ensigma ITT
MITLL SRI CAIP INRS Dragon BBN LIMSI ATampT Sanders
bull Broad phone HMM LVCSR VQ adapted GMM SVM worldcohorts hnorm anchor models
bull SWB1bull 40 tgtsbull Train 2 min - 1 sess 1
handset 2 handset bull Test 3s 10s 30s Separate
tgt and imp testsbull Pooled DET DCF
LRE 1 bull 4-5 sitesbull PPRLM GMM-CEP
Syllabic models fusion
bull Callfriendbull 12 languages 3
dialectsbull Test 3s 10s 30sbull DET DCF
MIT Lincoln Laboratory13
Odyssey 2008
NIST SRELRESteady Progress
Avignon Workshop
1997 1998SRE 3 bull 8 sites bull Pitch features handset mic
detectorcomp using more dev data
bull SWB2p1bull All speakers act as tgts
and imposters (current paradigm)
bull Train 2 min - 1 sess 1 handset 2 handset
bull Test 3s 10s 30sbull No cross-sex trials
matched and mismatched test phone
bull DET DCF
SRE 4bull 12 sitesbull Phone sequences (BBN)
sequence models (Dragon)
bull SWB2p2bull Train 2 min - 1 sess 2
sess all - 2 sessbull Test 3s 10s 30sbull SNDN and HS type side
knowledgebull Human performance
3s
1999SRE 5bull 13 sitesbull T-norm system fusionbull SWB2p3bull Train 2 min - 2 sessbull Test varying duration (0-
15 15-30 30-45gt45) diff number
bull New tasks 2-spkr test speaker tracking
MIT Lincoln Laboratory14
Odyssey 2008
NIST SRELRENew Directions
Odyssey Workshop
JHU SuperSIDWorkshop
2000 2001SRE 6bull 12 sites (First shark
sighting) bull SMS bull SWB2p1p2
AHUMADAbull Train 2 min - 1 sessbull Test variable 0-60 bull New tasks 2-spkr
train amp test N-speaker segmentation
20032002SRE 7bull 13 sitesbull Per-frame SVM
Fusion text-constrained GMM word amp phone N-gram
bull SWB2p1p2 AHUMADA SWB2p4 (cell)
bull Extended data task (SWB1)
SRE 8bull 24 sitesbull Feat map
high-level features mlpfusion
bull SWB2p5 (cell) SWB2p2p3 (ext) FBIVoiceDB(Multi Modal) BNEWS (seg) Meeting (seg)
SRE 9bull 19 sitesbull SVM GLDS phone
svm nerfs)bull SWB2p5 (cell)
SWB2p2p3 (ext)
LRE 2bull 6 sitesbull PPRLM GMM-SDC
SVM-SDC fusionbull Callfriendbull 12 languagesbull Test 3s 10s 30s
MIT Lincoln Laboratory15
Odyssey 2008
NIST SRELRECurrent Period
Odyssey Workshop
Odyssey Workshop
2004 2005 20072006LRE 4bull 21 sitesbull SVM-GSV ho
ngrams fLFAfNAPbull Mixer5 OHSUbull 14 languages 5
dialectsaccentsbull Calibrated LLRs
LRE 3bull 11 sitesbull GMM-MMI TRAPSNN-
decoder phone lattice
2008
Odyssey WS SRE 13 JHU WS
SRE 10bull 24 sitesbull Large system
fusionbull Mixer1bull Bilingual
speakers
SRE 11bull 27 sitesbull LFA SVM-MLLRbull Mixer2 MMSRbull Cross-channel
microphonesbull Calibrated LLRs
SRE 12bull 36 sitesbull SVM-GSV
spectral-only systems
bull Mixer2+3 MMSRbull Bilingual cross-
channelbull Multi-site
collaboration
PPR-BinTree PPR-SVMbull OHSU Mixer12bull 7 languages 3
dialectsaccents
MIT Lincoln Laboratory16
Odyssey 2008
NIST SREHow are we doing
0
001002
003004
005
006007
008
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Year
DC
F
Landline 1sp 2 min train 30 sec test
Cellular 1sp 2 min train 30 sec test
Landline 2-speaker detection
Ahumada(Spanish)
Multimodal (FBI)Landline 1sp (40
target speaker paradigm)
Cellularland 2-speaker detection
CellLand 1sp 8-conv train 1-conv test
Cross-mic1-conv train (tel) 1-conv test (mic)
Swb1 Swb2p1 Swb2p3 Swb2p4 Swb2p5 Mixer1 Mixer3
bull Sampling of tasks shown 28 in SRE04
Cross-language
Swb2p2 Mixer2 MMSR
MIT Lincoln Laboratory17
Odyssey 2008
0123456789
10
0
1
2
3
4
2001 2002 2003 2004 2005 2006
SRE Performance Trends 2001-2007Lincoln Systems
bull Consistent and steady improvement for datatask focus
EER ()1conv4w1conv4w
8conv4w1conv4w
minDCFx100
2001 2002 2003 2004 2005 2006SWB1 SWB2 MIXER2-3
bull New data sets designed to be more challenging
bull New features classifiers and compensations drive error rates down over time
SVM-GSV GMM-LFA MultiFeatSVM-GLDS SVM-MLLR+NAP
2006
NAP TC-SVM wordphone lattices2005
PhoneWord-SVM GMM-ATNORM2004
Feature Mapping SVM-GLDS2003
SuperSID High-level features2002
Text-const GMM word-ngram2001
MIT Lincoln Laboratory18
Odyssey 2008
0
10
20
30
40
1996 2003 2005 2005 2007 2007
EER
()
30s 10s 3s
CallFriend(12-lang)
OHSU(7-lang)
Mixer3(14-lang)
113
32 421014
LRE Performance Trends 1996-2007Lincoln Systems
19
Year Main LID Technology
1996 PPRLM2003 + GMM-SDC SVM-SDC2005 + Phone lattices SVM w ngrams
Binary Trees2007 + TRAPS tokenizers fLFA fNAP
GMM-MMI SVM-GSV calibrated LLRs
MIT Lincoln Laboratory19
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory4
Odyssey 2008
The OdysseyAvignon France ndash April 20-23 1998
bull Focus on forensic and commercial applications
bull 40 papers 5 keynotesbull 78 participantsbull Technologies More emphasis on statistical
approaches (HMM GMM AHS)bull Corpora Still diverse small set (less home-
grown) more TIMIT and SPIDRE (SWB)ndash Europeans showing lead in common
corporaexperiments (POLYCOST VERIVOX CAVE)
bull Increasing buzz about dot-com speechspeaker companies
bull Some lasting themes in talksndash Doddington getting to know the speakerndash Champod LRs as evidence in Baysian
framework
bull Some friction between automatic speaker recognition community and expert human speaker examiner community
ndash ASR crowd pressed for measured error rate
ndash Examiner crowd pressed for transparency and explanation in results
MIT Lincoln Laboratory5
Odyssey 2008
The OdysseyCrete Greece ndash June 18-22 2001
bull Start of official ldquoOdysseyrdquo workshop series
ndash Originally set for Tel-Aviv Israelbull 40 papers 3 keynotesbull 75 participantsbull Technologies More papers on new
tasks (biometrics diarization) addressing practical issues (robustness channel compensation threshold setting)
bull Corpora Many more papers using SRE corpora and experiment design
bull Bayesian framework taking hold for forensic applications
bull Several speaker verification companies (PerSay Nuance VoiceVault)
MIT Lincoln Laboratory6
Odyssey 2008
The OdysseyToledo Spain ndash May 31- June 3 2004
bull Co-occurrence with NIST SRE 2004 workshop
bull 61 paper 4 keynotesbull 147 participantsbull Technologies GMM SVM NAP LFA
high-level features adaptation audio-video LID
bull Corpora SRE corporaprotocol dominant for TI-Telephone RT BNEWS data for diarization TNONFI field forensic corpus
bull Text-dependent work focusing more on user phrases (less digit strings)
MIT Lincoln Laboratory7
Odyssey 2008
The OdysseySan Juan Puerto Rico ndash June 28-30 2006
bull Co-occurrence with NIST SRE 2006 workshop
ndash Followed LRE 2005 in December bull 60 papers 1 keynotebull 103 participantsbull Technologies GMM-SVM NAPLFA
GMM-MMI high-level features robustness
bull Corpora Dominated by SRE and LRE corporaprotocol
MIT Lincoln Laboratory8
Odyssey 2008
The OdysseyStellenbosch South Africa ndash January 21-24 2008
bull Expect to see continued trends inndash Common corporaevaluationsndash High-quality papers and novel topics
bull More fish pictures hellip
MIT Lincoln Laboratory9
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory10
Odyssey 2008
NIST SpeakerLanguage Recognition Evaluations
bull Recurring NIST evaluations of speakerlanguage recognition technology
bull Aim Provide a common paradigm for comparing technologies
bull Focus Conversational telephone speech (text-independent)
Evaluation Coordinator
Linguistic Data Consortium
Data Provider Comparison of technologies on common task
Evaluate
Improve
Technology ConsumersApplication domain parameters
Technology Developers
MIT Lincoln Laboratory11
Odyssey 2008
NIST SRELREPre-history
1992 1993
Rutgers Summer Workshop
1994Informal LRE bull 4 sites OGI MITLL MIT ITTbull OGI 12 lang corpus
MartignyWorkshop
DARPA SID evalbull 3 sites Dragon (LVCSR) ITT
(NN) MITLL (GMM)bull Early SWB1bull 1-4 conv trainbull 24 tgtsbull 111 tgt-test 466 imp-testbull 10-60s testbull Speaker dependent ROCbull Intro of Swets Normal-Deviate
plot (DET)bull Areas Under ROC PdPf=10bull Sun Sparc 1020 MB
MIT Lincoln Laboratory12
Odyssey 2008
NIST SRELREFormal Start
1995 1996SRE 1 bull 6 sites BBN (uniGauss)
Dragon (LVCSR) Ensigma(ergodic HMM) INRS (phone HMMs) ITT (NN) MITLL (GMM o64 cohort)
bull SWB1bull 26 tgtsbull Train 10s 30s 4 sessbull Test 5s 10s 30s Separate
tgt and imp testsbull Area Under ROC
PdPf=3105 closed set error
bull SameDiff phone effectbull Speaker dependent ROC
SRE 2 bull 11 sites Ensigma ITT
MITLL SRI CAIP INRS Dragon BBN LIMSI ATampT Sanders
bull Broad phone HMM LVCSR VQ adapted GMM SVM worldcohorts hnorm anchor models
bull SWB1bull 40 tgtsbull Train 2 min - 1 sess 1
handset 2 handset bull Test 3s 10s 30s Separate
tgt and imp testsbull Pooled DET DCF
LRE 1 bull 4-5 sitesbull PPRLM GMM-CEP
Syllabic models fusion
bull Callfriendbull 12 languages 3
dialectsbull Test 3s 10s 30sbull DET DCF
MIT Lincoln Laboratory13
Odyssey 2008
NIST SRELRESteady Progress
Avignon Workshop
1997 1998SRE 3 bull 8 sites bull Pitch features handset mic
detectorcomp using more dev data
bull SWB2p1bull All speakers act as tgts
and imposters (current paradigm)
bull Train 2 min - 1 sess 1 handset 2 handset
bull Test 3s 10s 30sbull No cross-sex trials
matched and mismatched test phone
bull DET DCF
SRE 4bull 12 sitesbull Phone sequences (BBN)
sequence models (Dragon)
bull SWB2p2bull Train 2 min - 1 sess 2
sess all - 2 sessbull Test 3s 10s 30sbull SNDN and HS type side
knowledgebull Human performance
3s
1999SRE 5bull 13 sitesbull T-norm system fusionbull SWB2p3bull Train 2 min - 2 sessbull Test varying duration (0-
15 15-30 30-45gt45) diff number
bull New tasks 2-spkr test speaker tracking
MIT Lincoln Laboratory14
Odyssey 2008
NIST SRELRENew Directions
Odyssey Workshop
JHU SuperSIDWorkshop
2000 2001SRE 6bull 12 sites (First shark
sighting) bull SMS bull SWB2p1p2
AHUMADAbull Train 2 min - 1 sessbull Test variable 0-60 bull New tasks 2-spkr
train amp test N-speaker segmentation
20032002SRE 7bull 13 sitesbull Per-frame SVM
Fusion text-constrained GMM word amp phone N-gram
bull SWB2p1p2 AHUMADA SWB2p4 (cell)
bull Extended data task (SWB1)
SRE 8bull 24 sitesbull Feat map
high-level features mlpfusion
bull SWB2p5 (cell) SWB2p2p3 (ext) FBIVoiceDB(Multi Modal) BNEWS (seg) Meeting (seg)
SRE 9bull 19 sitesbull SVM GLDS phone
svm nerfs)bull SWB2p5 (cell)
SWB2p2p3 (ext)
LRE 2bull 6 sitesbull PPRLM GMM-SDC
SVM-SDC fusionbull Callfriendbull 12 languagesbull Test 3s 10s 30s
MIT Lincoln Laboratory15
Odyssey 2008
NIST SRELRECurrent Period
Odyssey Workshop
Odyssey Workshop
2004 2005 20072006LRE 4bull 21 sitesbull SVM-GSV ho
ngrams fLFAfNAPbull Mixer5 OHSUbull 14 languages 5
dialectsaccentsbull Calibrated LLRs
LRE 3bull 11 sitesbull GMM-MMI TRAPSNN-
decoder phone lattice
2008
Odyssey WS SRE 13 JHU WS
SRE 10bull 24 sitesbull Large system
fusionbull Mixer1bull Bilingual
speakers
SRE 11bull 27 sitesbull LFA SVM-MLLRbull Mixer2 MMSRbull Cross-channel
microphonesbull Calibrated LLRs
SRE 12bull 36 sitesbull SVM-GSV
spectral-only systems
bull Mixer2+3 MMSRbull Bilingual cross-
channelbull Multi-site
collaboration
PPR-BinTree PPR-SVMbull OHSU Mixer12bull 7 languages 3
dialectsaccents
MIT Lincoln Laboratory16
Odyssey 2008
NIST SREHow are we doing
0
001002
003004
005
006007
008
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Year
DC
F
Landline 1sp 2 min train 30 sec test
Cellular 1sp 2 min train 30 sec test
Landline 2-speaker detection
Ahumada(Spanish)
Multimodal (FBI)Landline 1sp (40
target speaker paradigm)
Cellularland 2-speaker detection
CellLand 1sp 8-conv train 1-conv test
Cross-mic1-conv train (tel) 1-conv test (mic)
Swb1 Swb2p1 Swb2p3 Swb2p4 Swb2p5 Mixer1 Mixer3
bull Sampling of tasks shown 28 in SRE04
Cross-language
Swb2p2 Mixer2 MMSR
MIT Lincoln Laboratory17
Odyssey 2008
0123456789
10
0
1
2
3
4
2001 2002 2003 2004 2005 2006
SRE Performance Trends 2001-2007Lincoln Systems
bull Consistent and steady improvement for datatask focus
EER ()1conv4w1conv4w
8conv4w1conv4w
minDCFx100
2001 2002 2003 2004 2005 2006SWB1 SWB2 MIXER2-3
bull New data sets designed to be more challenging
bull New features classifiers and compensations drive error rates down over time
SVM-GSV GMM-LFA MultiFeatSVM-GLDS SVM-MLLR+NAP
2006
NAP TC-SVM wordphone lattices2005
PhoneWord-SVM GMM-ATNORM2004
Feature Mapping SVM-GLDS2003
SuperSID High-level features2002
Text-const GMM word-ngram2001
MIT Lincoln Laboratory18
Odyssey 2008
0
10
20
30
40
1996 2003 2005 2005 2007 2007
EER
()
30s 10s 3s
CallFriend(12-lang)
OHSU(7-lang)
Mixer3(14-lang)
113
32 421014
LRE Performance Trends 1996-2007Lincoln Systems
19
Year Main LID Technology
1996 PPRLM2003 + GMM-SDC SVM-SDC2005 + Phone lattices SVM w ngrams
Binary Trees2007 + TRAPS tokenizers fLFA fNAP
GMM-MMI SVM-GSV calibrated LLRs
MIT Lincoln Laboratory19
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory5
Odyssey 2008
The OdysseyCrete Greece ndash June 18-22 2001
bull Start of official ldquoOdysseyrdquo workshop series
ndash Originally set for Tel-Aviv Israelbull 40 papers 3 keynotesbull 75 participantsbull Technologies More papers on new
tasks (biometrics diarization) addressing practical issues (robustness channel compensation threshold setting)
bull Corpora Many more papers using SRE corpora and experiment design
bull Bayesian framework taking hold for forensic applications
bull Several speaker verification companies (PerSay Nuance VoiceVault)
MIT Lincoln Laboratory6
Odyssey 2008
The OdysseyToledo Spain ndash May 31- June 3 2004
bull Co-occurrence with NIST SRE 2004 workshop
bull 61 paper 4 keynotesbull 147 participantsbull Technologies GMM SVM NAP LFA
high-level features adaptation audio-video LID
bull Corpora SRE corporaprotocol dominant for TI-Telephone RT BNEWS data for diarization TNONFI field forensic corpus
bull Text-dependent work focusing more on user phrases (less digit strings)
MIT Lincoln Laboratory7
Odyssey 2008
The OdysseySan Juan Puerto Rico ndash June 28-30 2006
bull Co-occurrence with NIST SRE 2006 workshop
ndash Followed LRE 2005 in December bull 60 papers 1 keynotebull 103 participantsbull Technologies GMM-SVM NAPLFA
GMM-MMI high-level features robustness
bull Corpora Dominated by SRE and LRE corporaprotocol
MIT Lincoln Laboratory8
Odyssey 2008
The OdysseyStellenbosch South Africa ndash January 21-24 2008
bull Expect to see continued trends inndash Common corporaevaluationsndash High-quality papers and novel topics
bull More fish pictures hellip
MIT Lincoln Laboratory9
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory10
Odyssey 2008
NIST SpeakerLanguage Recognition Evaluations
bull Recurring NIST evaluations of speakerlanguage recognition technology
bull Aim Provide a common paradigm for comparing technologies
bull Focus Conversational telephone speech (text-independent)
Evaluation Coordinator
Linguistic Data Consortium
Data Provider Comparison of technologies on common task
Evaluate
Improve
Technology ConsumersApplication domain parameters
Technology Developers
MIT Lincoln Laboratory11
Odyssey 2008
NIST SRELREPre-history
1992 1993
Rutgers Summer Workshop
1994Informal LRE bull 4 sites OGI MITLL MIT ITTbull OGI 12 lang corpus
MartignyWorkshop
DARPA SID evalbull 3 sites Dragon (LVCSR) ITT
(NN) MITLL (GMM)bull Early SWB1bull 1-4 conv trainbull 24 tgtsbull 111 tgt-test 466 imp-testbull 10-60s testbull Speaker dependent ROCbull Intro of Swets Normal-Deviate
plot (DET)bull Areas Under ROC PdPf=10bull Sun Sparc 1020 MB
MIT Lincoln Laboratory12
Odyssey 2008
NIST SRELREFormal Start
1995 1996SRE 1 bull 6 sites BBN (uniGauss)
Dragon (LVCSR) Ensigma(ergodic HMM) INRS (phone HMMs) ITT (NN) MITLL (GMM o64 cohort)
bull SWB1bull 26 tgtsbull Train 10s 30s 4 sessbull Test 5s 10s 30s Separate
tgt and imp testsbull Area Under ROC
PdPf=3105 closed set error
bull SameDiff phone effectbull Speaker dependent ROC
SRE 2 bull 11 sites Ensigma ITT
MITLL SRI CAIP INRS Dragon BBN LIMSI ATampT Sanders
bull Broad phone HMM LVCSR VQ adapted GMM SVM worldcohorts hnorm anchor models
bull SWB1bull 40 tgtsbull Train 2 min - 1 sess 1
handset 2 handset bull Test 3s 10s 30s Separate
tgt and imp testsbull Pooled DET DCF
LRE 1 bull 4-5 sitesbull PPRLM GMM-CEP
Syllabic models fusion
bull Callfriendbull 12 languages 3
dialectsbull Test 3s 10s 30sbull DET DCF
MIT Lincoln Laboratory13
Odyssey 2008
NIST SRELRESteady Progress
Avignon Workshop
1997 1998SRE 3 bull 8 sites bull Pitch features handset mic
detectorcomp using more dev data
bull SWB2p1bull All speakers act as tgts
and imposters (current paradigm)
bull Train 2 min - 1 sess 1 handset 2 handset
bull Test 3s 10s 30sbull No cross-sex trials
matched and mismatched test phone
bull DET DCF
SRE 4bull 12 sitesbull Phone sequences (BBN)
sequence models (Dragon)
bull SWB2p2bull Train 2 min - 1 sess 2
sess all - 2 sessbull Test 3s 10s 30sbull SNDN and HS type side
knowledgebull Human performance
3s
1999SRE 5bull 13 sitesbull T-norm system fusionbull SWB2p3bull Train 2 min - 2 sessbull Test varying duration (0-
15 15-30 30-45gt45) diff number
bull New tasks 2-spkr test speaker tracking
MIT Lincoln Laboratory14
Odyssey 2008
NIST SRELRENew Directions
Odyssey Workshop
JHU SuperSIDWorkshop
2000 2001SRE 6bull 12 sites (First shark
sighting) bull SMS bull SWB2p1p2
AHUMADAbull Train 2 min - 1 sessbull Test variable 0-60 bull New tasks 2-spkr
train amp test N-speaker segmentation
20032002SRE 7bull 13 sitesbull Per-frame SVM
Fusion text-constrained GMM word amp phone N-gram
bull SWB2p1p2 AHUMADA SWB2p4 (cell)
bull Extended data task (SWB1)
SRE 8bull 24 sitesbull Feat map
high-level features mlpfusion
bull SWB2p5 (cell) SWB2p2p3 (ext) FBIVoiceDB(Multi Modal) BNEWS (seg) Meeting (seg)
SRE 9bull 19 sitesbull SVM GLDS phone
svm nerfs)bull SWB2p5 (cell)
SWB2p2p3 (ext)
LRE 2bull 6 sitesbull PPRLM GMM-SDC
SVM-SDC fusionbull Callfriendbull 12 languagesbull Test 3s 10s 30s
MIT Lincoln Laboratory15
Odyssey 2008
NIST SRELRECurrent Period
Odyssey Workshop
Odyssey Workshop
2004 2005 20072006LRE 4bull 21 sitesbull SVM-GSV ho
ngrams fLFAfNAPbull Mixer5 OHSUbull 14 languages 5
dialectsaccentsbull Calibrated LLRs
LRE 3bull 11 sitesbull GMM-MMI TRAPSNN-
decoder phone lattice
2008
Odyssey WS SRE 13 JHU WS
SRE 10bull 24 sitesbull Large system
fusionbull Mixer1bull Bilingual
speakers
SRE 11bull 27 sitesbull LFA SVM-MLLRbull Mixer2 MMSRbull Cross-channel
microphonesbull Calibrated LLRs
SRE 12bull 36 sitesbull SVM-GSV
spectral-only systems
bull Mixer2+3 MMSRbull Bilingual cross-
channelbull Multi-site
collaboration
PPR-BinTree PPR-SVMbull OHSU Mixer12bull 7 languages 3
dialectsaccents
MIT Lincoln Laboratory16
Odyssey 2008
NIST SREHow are we doing
0
001002
003004
005
006007
008
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Year
DC
F
Landline 1sp 2 min train 30 sec test
Cellular 1sp 2 min train 30 sec test
Landline 2-speaker detection
Ahumada(Spanish)
Multimodal (FBI)Landline 1sp (40
target speaker paradigm)
Cellularland 2-speaker detection
CellLand 1sp 8-conv train 1-conv test
Cross-mic1-conv train (tel) 1-conv test (mic)
Swb1 Swb2p1 Swb2p3 Swb2p4 Swb2p5 Mixer1 Mixer3
bull Sampling of tasks shown 28 in SRE04
Cross-language
Swb2p2 Mixer2 MMSR
MIT Lincoln Laboratory17
Odyssey 2008
0123456789
10
0
1
2
3
4
2001 2002 2003 2004 2005 2006
SRE Performance Trends 2001-2007Lincoln Systems
bull Consistent and steady improvement for datatask focus
EER ()1conv4w1conv4w
8conv4w1conv4w
minDCFx100
2001 2002 2003 2004 2005 2006SWB1 SWB2 MIXER2-3
bull New data sets designed to be more challenging
bull New features classifiers and compensations drive error rates down over time
SVM-GSV GMM-LFA MultiFeatSVM-GLDS SVM-MLLR+NAP
2006
NAP TC-SVM wordphone lattices2005
PhoneWord-SVM GMM-ATNORM2004
Feature Mapping SVM-GLDS2003
SuperSID High-level features2002
Text-const GMM word-ngram2001
MIT Lincoln Laboratory18
Odyssey 2008
0
10
20
30
40
1996 2003 2005 2005 2007 2007
EER
()
30s 10s 3s
CallFriend(12-lang)
OHSU(7-lang)
Mixer3(14-lang)
113
32 421014
LRE Performance Trends 1996-2007Lincoln Systems
19
Year Main LID Technology
1996 PPRLM2003 + GMM-SDC SVM-SDC2005 + Phone lattices SVM w ngrams
Binary Trees2007 + TRAPS tokenizers fLFA fNAP
GMM-MMI SVM-GSV calibrated LLRs
MIT Lincoln Laboratory19
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory6
Odyssey 2008
The OdysseyToledo Spain ndash May 31- June 3 2004
bull Co-occurrence with NIST SRE 2004 workshop
bull 61 paper 4 keynotesbull 147 participantsbull Technologies GMM SVM NAP LFA
high-level features adaptation audio-video LID
bull Corpora SRE corporaprotocol dominant for TI-Telephone RT BNEWS data for diarization TNONFI field forensic corpus
bull Text-dependent work focusing more on user phrases (less digit strings)
MIT Lincoln Laboratory7
Odyssey 2008
The OdysseySan Juan Puerto Rico ndash June 28-30 2006
bull Co-occurrence with NIST SRE 2006 workshop
ndash Followed LRE 2005 in December bull 60 papers 1 keynotebull 103 participantsbull Technologies GMM-SVM NAPLFA
GMM-MMI high-level features robustness
bull Corpora Dominated by SRE and LRE corporaprotocol
MIT Lincoln Laboratory8
Odyssey 2008
The OdysseyStellenbosch South Africa ndash January 21-24 2008
bull Expect to see continued trends inndash Common corporaevaluationsndash High-quality papers and novel topics
bull More fish pictures hellip
MIT Lincoln Laboratory9
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory10
Odyssey 2008
NIST SpeakerLanguage Recognition Evaluations
bull Recurring NIST evaluations of speakerlanguage recognition technology
bull Aim Provide a common paradigm for comparing technologies
bull Focus Conversational telephone speech (text-independent)
Evaluation Coordinator
Linguistic Data Consortium
Data Provider Comparison of technologies on common task
Evaluate
Improve
Technology ConsumersApplication domain parameters
Technology Developers
MIT Lincoln Laboratory11
Odyssey 2008
NIST SRELREPre-history
1992 1993
Rutgers Summer Workshop
1994Informal LRE bull 4 sites OGI MITLL MIT ITTbull OGI 12 lang corpus
MartignyWorkshop
DARPA SID evalbull 3 sites Dragon (LVCSR) ITT
(NN) MITLL (GMM)bull Early SWB1bull 1-4 conv trainbull 24 tgtsbull 111 tgt-test 466 imp-testbull 10-60s testbull Speaker dependent ROCbull Intro of Swets Normal-Deviate
plot (DET)bull Areas Under ROC PdPf=10bull Sun Sparc 1020 MB
MIT Lincoln Laboratory12
Odyssey 2008
NIST SRELREFormal Start
1995 1996SRE 1 bull 6 sites BBN (uniGauss)
Dragon (LVCSR) Ensigma(ergodic HMM) INRS (phone HMMs) ITT (NN) MITLL (GMM o64 cohort)
bull SWB1bull 26 tgtsbull Train 10s 30s 4 sessbull Test 5s 10s 30s Separate
tgt and imp testsbull Area Under ROC
PdPf=3105 closed set error
bull SameDiff phone effectbull Speaker dependent ROC
SRE 2 bull 11 sites Ensigma ITT
MITLL SRI CAIP INRS Dragon BBN LIMSI ATampT Sanders
bull Broad phone HMM LVCSR VQ adapted GMM SVM worldcohorts hnorm anchor models
bull SWB1bull 40 tgtsbull Train 2 min - 1 sess 1
handset 2 handset bull Test 3s 10s 30s Separate
tgt and imp testsbull Pooled DET DCF
LRE 1 bull 4-5 sitesbull PPRLM GMM-CEP
Syllabic models fusion
bull Callfriendbull 12 languages 3
dialectsbull Test 3s 10s 30sbull DET DCF
MIT Lincoln Laboratory13
Odyssey 2008
NIST SRELRESteady Progress
Avignon Workshop
1997 1998SRE 3 bull 8 sites bull Pitch features handset mic
detectorcomp using more dev data
bull SWB2p1bull All speakers act as tgts
and imposters (current paradigm)
bull Train 2 min - 1 sess 1 handset 2 handset
bull Test 3s 10s 30sbull No cross-sex trials
matched and mismatched test phone
bull DET DCF
SRE 4bull 12 sitesbull Phone sequences (BBN)
sequence models (Dragon)
bull SWB2p2bull Train 2 min - 1 sess 2
sess all - 2 sessbull Test 3s 10s 30sbull SNDN and HS type side
knowledgebull Human performance
3s
1999SRE 5bull 13 sitesbull T-norm system fusionbull SWB2p3bull Train 2 min - 2 sessbull Test varying duration (0-
15 15-30 30-45gt45) diff number
bull New tasks 2-spkr test speaker tracking
MIT Lincoln Laboratory14
Odyssey 2008
NIST SRELRENew Directions
Odyssey Workshop
JHU SuperSIDWorkshop
2000 2001SRE 6bull 12 sites (First shark
sighting) bull SMS bull SWB2p1p2
AHUMADAbull Train 2 min - 1 sessbull Test variable 0-60 bull New tasks 2-spkr
train amp test N-speaker segmentation
20032002SRE 7bull 13 sitesbull Per-frame SVM
Fusion text-constrained GMM word amp phone N-gram
bull SWB2p1p2 AHUMADA SWB2p4 (cell)
bull Extended data task (SWB1)
SRE 8bull 24 sitesbull Feat map
high-level features mlpfusion
bull SWB2p5 (cell) SWB2p2p3 (ext) FBIVoiceDB(Multi Modal) BNEWS (seg) Meeting (seg)
SRE 9bull 19 sitesbull SVM GLDS phone
svm nerfs)bull SWB2p5 (cell)
SWB2p2p3 (ext)
LRE 2bull 6 sitesbull PPRLM GMM-SDC
SVM-SDC fusionbull Callfriendbull 12 languagesbull Test 3s 10s 30s
MIT Lincoln Laboratory15
Odyssey 2008
NIST SRELRECurrent Period
Odyssey Workshop
Odyssey Workshop
2004 2005 20072006LRE 4bull 21 sitesbull SVM-GSV ho
ngrams fLFAfNAPbull Mixer5 OHSUbull 14 languages 5
dialectsaccentsbull Calibrated LLRs
LRE 3bull 11 sitesbull GMM-MMI TRAPSNN-
decoder phone lattice
2008
Odyssey WS SRE 13 JHU WS
SRE 10bull 24 sitesbull Large system
fusionbull Mixer1bull Bilingual
speakers
SRE 11bull 27 sitesbull LFA SVM-MLLRbull Mixer2 MMSRbull Cross-channel
microphonesbull Calibrated LLRs
SRE 12bull 36 sitesbull SVM-GSV
spectral-only systems
bull Mixer2+3 MMSRbull Bilingual cross-
channelbull Multi-site
collaboration
PPR-BinTree PPR-SVMbull OHSU Mixer12bull 7 languages 3
dialectsaccents
MIT Lincoln Laboratory16
Odyssey 2008
NIST SREHow are we doing
0
001002
003004
005
006007
008
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Year
DC
F
Landline 1sp 2 min train 30 sec test
Cellular 1sp 2 min train 30 sec test
Landline 2-speaker detection
Ahumada(Spanish)
Multimodal (FBI)Landline 1sp (40
target speaker paradigm)
Cellularland 2-speaker detection
CellLand 1sp 8-conv train 1-conv test
Cross-mic1-conv train (tel) 1-conv test (mic)
Swb1 Swb2p1 Swb2p3 Swb2p4 Swb2p5 Mixer1 Mixer3
bull Sampling of tasks shown 28 in SRE04
Cross-language
Swb2p2 Mixer2 MMSR
MIT Lincoln Laboratory17
Odyssey 2008
0123456789
10
0
1
2
3
4
2001 2002 2003 2004 2005 2006
SRE Performance Trends 2001-2007Lincoln Systems
bull Consistent and steady improvement for datatask focus
EER ()1conv4w1conv4w
8conv4w1conv4w
minDCFx100
2001 2002 2003 2004 2005 2006SWB1 SWB2 MIXER2-3
bull New data sets designed to be more challenging
bull New features classifiers and compensations drive error rates down over time
SVM-GSV GMM-LFA MultiFeatSVM-GLDS SVM-MLLR+NAP
2006
NAP TC-SVM wordphone lattices2005
PhoneWord-SVM GMM-ATNORM2004
Feature Mapping SVM-GLDS2003
SuperSID High-level features2002
Text-const GMM word-ngram2001
MIT Lincoln Laboratory18
Odyssey 2008
0
10
20
30
40
1996 2003 2005 2005 2007 2007
EER
()
30s 10s 3s
CallFriend(12-lang)
OHSU(7-lang)
Mixer3(14-lang)
113
32 421014
LRE Performance Trends 1996-2007Lincoln Systems
19
Year Main LID Technology
1996 PPRLM2003 + GMM-SDC SVM-SDC2005 + Phone lattices SVM w ngrams
Binary Trees2007 + TRAPS tokenizers fLFA fNAP
GMM-MMI SVM-GSV calibrated LLRs
MIT Lincoln Laboratory19
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory7
Odyssey 2008
The OdysseySan Juan Puerto Rico ndash June 28-30 2006
bull Co-occurrence with NIST SRE 2006 workshop
ndash Followed LRE 2005 in December bull 60 papers 1 keynotebull 103 participantsbull Technologies GMM-SVM NAPLFA
GMM-MMI high-level features robustness
bull Corpora Dominated by SRE and LRE corporaprotocol
MIT Lincoln Laboratory8
Odyssey 2008
The OdysseyStellenbosch South Africa ndash January 21-24 2008
bull Expect to see continued trends inndash Common corporaevaluationsndash High-quality papers and novel topics
bull More fish pictures hellip
MIT Lincoln Laboratory9
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory10
Odyssey 2008
NIST SpeakerLanguage Recognition Evaluations
bull Recurring NIST evaluations of speakerlanguage recognition technology
bull Aim Provide a common paradigm for comparing technologies
bull Focus Conversational telephone speech (text-independent)
Evaluation Coordinator
Linguistic Data Consortium
Data Provider Comparison of technologies on common task
Evaluate
Improve
Technology ConsumersApplication domain parameters
Technology Developers
MIT Lincoln Laboratory11
Odyssey 2008
NIST SRELREPre-history
1992 1993
Rutgers Summer Workshop
1994Informal LRE bull 4 sites OGI MITLL MIT ITTbull OGI 12 lang corpus
MartignyWorkshop
DARPA SID evalbull 3 sites Dragon (LVCSR) ITT
(NN) MITLL (GMM)bull Early SWB1bull 1-4 conv trainbull 24 tgtsbull 111 tgt-test 466 imp-testbull 10-60s testbull Speaker dependent ROCbull Intro of Swets Normal-Deviate
plot (DET)bull Areas Under ROC PdPf=10bull Sun Sparc 1020 MB
MIT Lincoln Laboratory12
Odyssey 2008
NIST SRELREFormal Start
1995 1996SRE 1 bull 6 sites BBN (uniGauss)
Dragon (LVCSR) Ensigma(ergodic HMM) INRS (phone HMMs) ITT (NN) MITLL (GMM o64 cohort)
bull SWB1bull 26 tgtsbull Train 10s 30s 4 sessbull Test 5s 10s 30s Separate
tgt and imp testsbull Area Under ROC
PdPf=3105 closed set error
bull SameDiff phone effectbull Speaker dependent ROC
SRE 2 bull 11 sites Ensigma ITT
MITLL SRI CAIP INRS Dragon BBN LIMSI ATampT Sanders
bull Broad phone HMM LVCSR VQ adapted GMM SVM worldcohorts hnorm anchor models
bull SWB1bull 40 tgtsbull Train 2 min - 1 sess 1
handset 2 handset bull Test 3s 10s 30s Separate
tgt and imp testsbull Pooled DET DCF
LRE 1 bull 4-5 sitesbull PPRLM GMM-CEP
Syllabic models fusion
bull Callfriendbull 12 languages 3
dialectsbull Test 3s 10s 30sbull DET DCF
MIT Lincoln Laboratory13
Odyssey 2008
NIST SRELRESteady Progress
Avignon Workshop
1997 1998SRE 3 bull 8 sites bull Pitch features handset mic
detectorcomp using more dev data
bull SWB2p1bull All speakers act as tgts
and imposters (current paradigm)
bull Train 2 min - 1 sess 1 handset 2 handset
bull Test 3s 10s 30sbull No cross-sex trials
matched and mismatched test phone
bull DET DCF
SRE 4bull 12 sitesbull Phone sequences (BBN)
sequence models (Dragon)
bull SWB2p2bull Train 2 min - 1 sess 2
sess all - 2 sessbull Test 3s 10s 30sbull SNDN and HS type side
knowledgebull Human performance
3s
1999SRE 5bull 13 sitesbull T-norm system fusionbull SWB2p3bull Train 2 min - 2 sessbull Test varying duration (0-
15 15-30 30-45gt45) diff number
bull New tasks 2-spkr test speaker tracking
MIT Lincoln Laboratory14
Odyssey 2008
NIST SRELRENew Directions
Odyssey Workshop
JHU SuperSIDWorkshop
2000 2001SRE 6bull 12 sites (First shark
sighting) bull SMS bull SWB2p1p2
AHUMADAbull Train 2 min - 1 sessbull Test variable 0-60 bull New tasks 2-spkr
train amp test N-speaker segmentation
20032002SRE 7bull 13 sitesbull Per-frame SVM
Fusion text-constrained GMM word amp phone N-gram
bull SWB2p1p2 AHUMADA SWB2p4 (cell)
bull Extended data task (SWB1)
SRE 8bull 24 sitesbull Feat map
high-level features mlpfusion
bull SWB2p5 (cell) SWB2p2p3 (ext) FBIVoiceDB(Multi Modal) BNEWS (seg) Meeting (seg)
SRE 9bull 19 sitesbull SVM GLDS phone
svm nerfs)bull SWB2p5 (cell)
SWB2p2p3 (ext)
LRE 2bull 6 sitesbull PPRLM GMM-SDC
SVM-SDC fusionbull Callfriendbull 12 languagesbull Test 3s 10s 30s
MIT Lincoln Laboratory15
Odyssey 2008
NIST SRELRECurrent Period
Odyssey Workshop
Odyssey Workshop
2004 2005 20072006LRE 4bull 21 sitesbull SVM-GSV ho
ngrams fLFAfNAPbull Mixer5 OHSUbull 14 languages 5
dialectsaccentsbull Calibrated LLRs
LRE 3bull 11 sitesbull GMM-MMI TRAPSNN-
decoder phone lattice
2008
Odyssey WS SRE 13 JHU WS
SRE 10bull 24 sitesbull Large system
fusionbull Mixer1bull Bilingual
speakers
SRE 11bull 27 sitesbull LFA SVM-MLLRbull Mixer2 MMSRbull Cross-channel
microphonesbull Calibrated LLRs
SRE 12bull 36 sitesbull SVM-GSV
spectral-only systems
bull Mixer2+3 MMSRbull Bilingual cross-
channelbull Multi-site
collaboration
PPR-BinTree PPR-SVMbull OHSU Mixer12bull 7 languages 3
dialectsaccents
MIT Lincoln Laboratory16
Odyssey 2008
NIST SREHow are we doing
0
001002
003004
005
006007
008
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Year
DC
F
Landline 1sp 2 min train 30 sec test
Cellular 1sp 2 min train 30 sec test
Landline 2-speaker detection
Ahumada(Spanish)
Multimodal (FBI)Landline 1sp (40
target speaker paradigm)
Cellularland 2-speaker detection
CellLand 1sp 8-conv train 1-conv test
Cross-mic1-conv train (tel) 1-conv test (mic)
Swb1 Swb2p1 Swb2p3 Swb2p4 Swb2p5 Mixer1 Mixer3
bull Sampling of tasks shown 28 in SRE04
Cross-language
Swb2p2 Mixer2 MMSR
MIT Lincoln Laboratory17
Odyssey 2008
0123456789
10
0
1
2
3
4
2001 2002 2003 2004 2005 2006
SRE Performance Trends 2001-2007Lincoln Systems
bull Consistent and steady improvement for datatask focus
EER ()1conv4w1conv4w
8conv4w1conv4w
minDCFx100
2001 2002 2003 2004 2005 2006SWB1 SWB2 MIXER2-3
bull New data sets designed to be more challenging
bull New features classifiers and compensations drive error rates down over time
SVM-GSV GMM-LFA MultiFeatSVM-GLDS SVM-MLLR+NAP
2006
NAP TC-SVM wordphone lattices2005
PhoneWord-SVM GMM-ATNORM2004
Feature Mapping SVM-GLDS2003
SuperSID High-level features2002
Text-const GMM word-ngram2001
MIT Lincoln Laboratory18
Odyssey 2008
0
10
20
30
40
1996 2003 2005 2005 2007 2007
EER
()
30s 10s 3s
CallFriend(12-lang)
OHSU(7-lang)
Mixer3(14-lang)
113
32 421014
LRE Performance Trends 1996-2007Lincoln Systems
19
Year Main LID Technology
1996 PPRLM2003 + GMM-SDC SVM-SDC2005 + Phone lattices SVM w ngrams
Binary Trees2007 + TRAPS tokenizers fLFA fNAP
GMM-MMI SVM-GSV calibrated LLRs
MIT Lincoln Laboratory19
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory8
Odyssey 2008
The OdysseyStellenbosch South Africa ndash January 21-24 2008
bull Expect to see continued trends inndash Common corporaevaluationsndash High-quality papers and novel topics
bull More fish pictures hellip
MIT Lincoln Laboratory9
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory10
Odyssey 2008
NIST SpeakerLanguage Recognition Evaluations
bull Recurring NIST evaluations of speakerlanguage recognition technology
bull Aim Provide a common paradigm for comparing technologies
bull Focus Conversational telephone speech (text-independent)
Evaluation Coordinator
Linguistic Data Consortium
Data Provider Comparison of technologies on common task
Evaluate
Improve
Technology ConsumersApplication domain parameters
Technology Developers
MIT Lincoln Laboratory11
Odyssey 2008
NIST SRELREPre-history
1992 1993
Rutgers Summer Workshop
1994Informal LRE bull 4 sites OGI MITLL MIT ITTbull OGI 12 lang corpus
MartignyWorkshop
DARPA SID evalbull 3 sites Dragon (LVCSR) ITT
(NN) MITLL (GMM)bull Early SWB1bull 1-4 conv trainbull 24 tgtsbull 111 tgt-test 466 imp-testbull 10-60s testbull Speaker dependent ROCbull Intro of Swets Normal-Deviate
plot (DET)bull Areas Under ROC PdPf=10bull Sun Sparc 1020 MB
MIT Lincoln Laboratory12
Odyssey 2008
NIST SRELREFormal Start
1995 1996SRE 1 bull 6 sites BBN (uniGauss)
Dragon (LVCSR) Ensigma(ergodic HMM) INRS (phone HMMs) ITT (NN) MITLL (GMM o64 cohort)
bull SWB1bull 26 tgtsbull Train 10s 30s 4 sessbull Test 5s 10s 30s Separate
tgt and imp testsbull Area Under ROC
PdPf=3105 closed set error
bull SameDiff phone effectbull Speaker dependent ROC
SRE 2 bull 11 sites Ensigma ITT
MITLL SRI CAIP INRS Dragon BBN LIMSI ATampT Sanders
bull Broad phone HMM LVCSR VQ adapted GMM SVM worldcohorts hnorm anchor models
bull SWB1bull 40 tgtsbull Train 2 min - 1 sess 1
handset 2 handset bull Test 3s 10s 30s Separate
tgt and imp testsbull Pooled DET DCF
LRE 1 bull 4-5 sitesbull PPRLM GMM-CEP
Syllabic models fusion
bull Callfriendbull 12 languages 3
dialectsbull Test 3s 10s 30sbull DET DCF
MIT Lincoln Laboratory13
Odyssey 2008
NIST SRELRESteady Progress
Avignon Workshop
1997 1998SRE 3 bull 8 sites bull Pitch features handset mic
detectorcomp using more dev data
bull SWB2p1bull All speakers act as tgts
and imposters (current paradigm)
bull Train 2 min - 1 sess 1 handset 2 handset
bull Test 3s 10s 30sbull No cross-sex trials
matched and mismatched test phone
bull DET DCF
SRE 4bull 12 sitesbull Phone sequences (BBN)
sequence models (Dragon)
bull SWB2p2bull Train 2 min - 1 sess 2
sess all - 2 sessbull Test 3s 10s 30sbull SNDN and HS type side
knowledgebull Human performance
3s
1999SRE 5bull 13 sitesbull T-norm system fusionbull SWB2p3bull Train 2 min - 2 sessbull Test varying duration (0-
15 15-30 30-45gt45) diff number
bull New tasks 2-spkr test speaker tracking
MIT Lincoln Laboratory14
Odyssey 2008
NIST SRELRENew Directions
Odyssey Workshop
JHU SuperSIDWorkshop
2000 2001SRE 6bull 12 sites (First shark
sighting) bull SMS bull SWB2p1p2
AHUMADAbull Train 2 min - 1 sessbull Test variable 0-60 bull New tasks 2-spkr
train amp test N-speaker segmentation
20032002SRE 7bull 13 sitesbull Per-frame SVM
Fusion text-constrained GMM word amp phone N-gram
bull SWB2p1p2 AHUMADA SWB2p4 (cell)
bull Extended data task (SWB1)
SRE 8bull 24 sitesbull Feat map
high-level features mlpfusion
bull SWB2p5 (cell) SWB2p2p3 (ext) FBIVoiceDB(Multi Modal) BNEWS (seg) Meeting (seg)
SRE 9bull 19 sitesbull SVM GLDS phone
svm nerfs)bull SWB2p5 (cell)
SWB2p2p3 (ext)
LRE 2bull 6 sitesbull PPRLM GMM-SDC
SVM-SDC fusionbull Callfriendbull 12 languagesbull Test 3s 10s 30s
MIT Lincoln Laboratory15
Odyssey 2008
NIST SRELRECurrent Period
Odyssey Workshop
Odyssey Workshop
2004 2005 20072006LRE 4bull 21 sitesbull SVM-GSV ho
ngrams fLFAfNAPbull Mixer5 OHSUbull 14 languages 5
dialectsaccentsbull Calibrated LLRs
LRE 3bull 11 sitesbull GMM-MMI TRAPSNN-
decoder phone lattice
2008
Odyssey WS SRE 13 JHU WS
SRE 10bull 24 sitesbull Large system
fusionbull Mixer1bull Bilingual
speakers
SRE 11bull 27 sitesbull LFA SVM-MLLRbull Mixer2 MMSRbull Cross-channel
microphonesbull Calibrated LLRs
SRE 12bull 36 sitesbull SVM-GSV
spectral-only systems
bull Mixer2+3 MMSRbull Bilingual cross-
channelbull Multi-site
collaboration
PPR-BinTree PPR-SVMbull OHSU Mixer12bull 7 languages 3
dialectsaccents
MIT Lincoln Laboratory16
Odyssey 2008
NIST SREHow are we doing
0
001002
003004
005
006007
008
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Year
DC
F
Landline 1sp 2 min train 30 sec test
Cellular 1sp 2 min train 30 sec test
Landline 2-speaker detection
Ahumada(Spanish)
Multimodal (FBI)Landline 1sp (40
target speaker paradigm)
Cellularland 2-speaker detection
CellLand 1sp 8-conv train 1-conv test
Cross-mic1-conv train (tel) 1-conv test (mic)
Swb1 Swb2p1 Swb2p3 Swb2p4 Swb2p5 Mixer1 Mixer3
bull Sampling of tasks shown 28 in SRE04
Cross-language
Swb2p2 Mixer2 MMSR
MIT Lincoln Laboratory17
Odyssey 2008
0123456789
10
0
1
2
3
4
2001 2002 2003 2004 2005 2006
SRE Performance Trends 2001-2007Lincoln Systems
bull Consistent and steady improvement for datatask focus
EER ()1conv4w1conv4w
8conv4w1conv4w
minDCFx100
2001 2002 2003 2004 2005 2006SWB1 SWB2 MIXER2-3
bull New data sets designed to be more challenging
bull New features classifiers and compensations drive error rates down over time
SVM-GSV GMM-LFA MultiFeatSVM-GLDS SVM-MLLR+NAP
2006
NAP TC-SVM wordphone lattices2005
PhoneWord-SVM GMM-ATNORM2004
Feature Mapping SVM-GLDS2003
SuperSID High-level features2002
Text-const GMM word-ngram2001
MIT Lincoln Laboratory18
Odyssey 2008
0
10
20
30
40
1996 2003 2005 2005 2007 2007
EER
()
30s 10s 3s
CallFriend(12-lang)
OHSU(7-lang)
Mixer3(14-lang)
113
32 421014
LRE Performance Trends 1996-2007Lincoln Systems
19
Year Main LID Technology
1996 PPRLM2003 + GMM-SDC SVM-SDC2005 + Phone lattices SVM w ngrams
Binary Trees2007 + TRAPS tokenizers fLFA fNAP
GMM-MMI SVM-GSV calibrated LLRs
MIT Lincoln Laboratory19
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory9
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory10
Odyssey 2008
NIST SpeakerLanguage Recognition Evaluations
bull Recurring NIST evaluations of speakerlanguage recognition technology
bull Aim Provide a common paradigm for comparing technologies
bull Focus Conversational telephone speech (text-independent)
Evaluation Coordinator
Linguistic Data Consortium
Data Provider Comparison of technologies on common task
Evaluate
Improve
Technology ConsumersApplication domain parameters
Technology Developers
MIT Lincoln Laboratory11
Odyssey 2008
NIST SRELREPre-history
1992 1993
Rutgers Summer Workshop
1994Informal LRE bull 4 sites OGI MITLL MIT ITTbull OGI 12 lang corpus
MartignyWorkshop
DARPA SID evalbull 3 sites Dragon (LVCSR) ITT
(NN) MITLL (GMM)bull Early SWB1bull 1-4 conv trainbull 24 tgtsbull 111 tgt-test 466 imp-testbull 10-60s testbull Speaker dependent ROCbull Intro of Swets Normal-Deviate
plot (DET)bull Areas Under ROC PdPf=10bull Sun Sparc 1020 MB
MIT Lincoln Laboratory12
Odyssey 2008
NIST SRELREFormal Start
1995 1996SRE 1 bull 6 sites BBN (uniGauss)
Dragon (LVCSR) Ensigma(ergodic HMM) INRS (phone HMMs) ITT (NN) MITLL (GMM o64 cohort)
bull SWB1bull 26 tgtsbull Train 10s 30s 4 sessbull Test 5s 10s 30s Separate
tgt and imp testsbull Area Under ROC
PdPf=3105 closed set error
bull SameDiff phone effectbull Speaker dependent ROC
SRE 2 bull 11 sites Ensigma ITT
MITLL SRI CAIP INRS Dragon BBN LIMSI ATampT Sanders
bull Broad phone HMM LVCSR VQ adapted GMM SVM worldcohorts hnorm anchor models
bull SWB1bull 40 tgtsbull Train 2 min - 1 sess 1
handset 2 handset bull Test 3s 10s 30s Separate
tgt and imp testsbull Pooled DET DCF
LRE 1 bull 4-5 sitesbull PPRLM GMM-CEP
Syllabic models fusion
bull Callfriendbull 12 languages 3
dialectsbull Test 3s 10s 30sbull DET DCF
MIT Lincoln Laboratory13
Odyssey 2008
NIST SRELRESteady Progress
Avignon Workshop
1997 1998SRE 3 bull 8 sites bull Pitch features handset mic
detectorcomp using more dev data
bull SWB2p1bull All speakers act as tgts
and imposters (current paradigm)
bull Train 2 min - 1 sess 1 handset 2 handset
bull Test 3s 10s 30sbull No cross-sex trials
matched and mismatched test phone
bull DET DCF
SRE 4bull 12 sitesbull Phone sequences (BBN)
sequence models (Dragon)
bull SWB2p2bull Train 2 min - 1 sess 2
sess all - 2 sessbull Test 3s 10s 30sbull SNDN and HS type side
knowledgebull Human performance
3s
1999SRE 5bull 13 sitesbull T-norm system fusionbull SWB2p3bull Train 2 min - 2 sessbull Test varying duration (0-
15 15-30 30-45gt45) diff number
bull New tasks 2-spkr test speaker tracking
MIT Lincoln Laboratory14
Odyssey 2008
NIST SRELRENew Directions
Odyssey Workshop
JHU SuperSIDWorkshop
2000 2001SRE 6bull 12 sites (First shark
sighting) bull SMS bull SWB2p1p2
AHUMADAbull Train 2 min - 1 sessbull Test variable 0-60 bull New tasks 2-spkr
train amp test N-speaker segmentation
20032002SRE 7bull 13 sitesbull Per-frame SVM
Fusion text-constrained GMM word amp phone N-gram
bull SWB2p1p2 AHUMADA SWB2p4 (cell)
bull Extended data task (SWB1)
SRE 8bull 24 sitesbull Feat map
high-level features mlpfusion
bull SWB2p5 (cell) SWB2p2p3 (ext) FBIVoiceDB(Multi Modal) BNEWS (seg) Meeting (seg)
SRE 9bull 19 sitesbull SVM GLDS phone
svm nerfs)bull SWB2p5 (cell)
SWB2p2p3 (ext)
LRE 2bull 6 sitesbull PPRLM GMM-SDC
SVM-SDC fusionbull Callfriendbull 12 languagesbull Test 3s 10s 30s
MIT Lincoln Laboratory15
Odyssey 2008
NIST SRELRECurrent Period
Odyssey Workshop
Odyssey Workshop
2004 2005 20072006LRE 4bull 21 sitesbull SVM-GSV ho
ngrams fLFAfNAPbull Mixer5 OHSUbull 14 languages 5
dialectsaccentsbull Calibrated LLRs
LRE 3bull 11 sitesbull GMM-MMI TRAPSNN-
decoder phone lattice
2008
Odyssey WS SRE 13 JHU WS
SRE 10bull 24 sitesbull Large system
fusionbull Mixer1bull Bilingual
speakers
SRE 11bull 27 sitesbull LFA SVM-MLLRbull Mixer2 MMSRbull Cross-channel
microphonesbull Calibrated LLRs
SRE 12bull 36 sitesbull SVM-GSV
spectral-only systems
bull Mixer2+3 MMSRbull Bilingual cross-
channelbull Multi-site
collaboration
PPR-BinTree PPR-SVMbull OHSU Mixer12bull 7 languages 3
dialectsaccents
MIT Lincoln Laboratory16
Odyssey 2008
NIST SREHow are we doing
0
001002
003004
005
006007
008
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Year
DC
F
Landline 1sp 2 min train 30 sec test
Cellular 1sp 2 min train 30 sec test
Landline 2-speaker detection
Ahumada(Spanish)
Multimodal (FBI)Landline 1sp (40
target speaker paradigm)
Cellularland 2-speaker detection
CellLand 1sp 8-conv train 1-conv test
Cross-mic1-conv train (tel) 1-conv test (mic)
Swb1 Swb2p1 Swb2p3 Swb2p4 Swb2p5 Mixer1 Mixer3
bull Sampling of tasks shown 28 in SRE04
Cross-language
Swb2p2 Mixer2 MMSR
MIT Lincoln Laboratory17
Odyssey 2008
0123456789
10
0
1
2
3
4
2001 2002 2003 2004 2005 2006
SRE Performance Trends 2001-2007Lincoln Systems
bull Consistent and steady improvement for datatask focus
EER ()1conv4w1conv4w
8conv4w1conv4w
minDCFx100
2001 2002 2003 2004 2005 2006SWB1 SWB2 MIXER2-3
bull New data sets designed to be more challenging
bull New features classifiers and compensations drive error rates down over time
SVM-GSV GMM-LFA MultiFeatSVM-GLDS SVM-MLLR+NAP
2006
NAP TC-SVM wordphone lattices2005
PhoneWord-SVM GMM-ATNORM2004
Feature Mapping SVM-GLDS2003
SuperSID High-level features2002
Text-const GMM word-ngram2001
MIT Lincoln Laboratory18
Odyssey 2008
0
10
20
30
40
1996 2003 2005 2005 2007 2007
EER
()
30s 10s 3s
CallFriend(12-lang)
OHSU(7-lang)
Mixer3(14-lang)
113
32 421014
LRE Performance Trends 1996-2007Lincoln Systems
19
Year Main LID Technology
1996 PPRLM2003 + GMM-SDC SVM-SDC2005 + Phone lattices SVM w ngrams
Binary Trees2007 + TRAPS tokenizers fLFA fNAP
GMM-MMI SVM-GSV calibrated LLRs
MIT Lincoln Laboratory19
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory10
Odyssey 2008
NIST SpeakerLanguage Recognition Evaluations
bull Recurring NIST evaluations of speakerlanguage recognition technology
bull Aim Provide a common paradigm for comparing technologies
bull Focus Conversational telephone speech (text-independent)
Evaluation Coordinator
Linguistic Data Consortium
Data Provider Comparison of technologies on common task
Evaluate
Improve
Technology ConsumersApplication domain parameters
Technology Developers
MIT Lincoln Laboratory11
Odyssey 2008
NIST SRELREPre-history
1992 1993
Rutgers Summer Workshop
1994Informal LRE bull 4 sites OGI MITLL MIT ITTbull OGI 12 lang corpus
MartignyWorkshop
DARPA SID evalbull 3 sites Dragon (LVCSR) ITT
(NN) MITLL (GMM)bull Early SWB1bull 1-4 conv trainbull 24 tgtsbull 111 tgt-test 466 imp-testbull 10-60s testbull Speaker dependent ROCbull Intro of Swets Normal-Deviate
plot (DET)bull Areas Under ROC PdPf=10bull Sun Sparc 1020 MB
MIT Lincoln Laboratory12
Odyssey 2008
NIST SRELREFormal Start
1995 1996SRE 1 bull 6 sites BBN (uniGauss)
Dragon (LVCSR) Ensigma(ergodic HMM) INRS (phone HMMs) ITT (NN) MITLL (GMM o64 cohort)
bull SWB1bull 26 tgtsbull Train 10s 30s 4 sessbull Test 5s 10s 30s Separate
tgt and imp testsbull Area Under ROC
PdPf=3105 closed set error
bull SameDiff phone effectbull Speaker dependent ROC
SRE 2 bull 11 sites Ensigma ITT
MITLL SRI CAIP INRS Dragon BBN LIMSI ATampT Sanders
bull Broad phone HMM LVCSR VQ adapted GMM SVM worldcohorts hnorm anchor models
bull SWB1bull 40 tgtsbull Train 2 min - 1 sess 1
handset 2 handset bull Test 3s 10s 30s Separate
tgt and imp testsbull Pooled DET DCF
LRE 1 bull 4-5 sitesbull PPRLM GMM-CEP
Syllabic models fusion
bull Callfriendbull 12 languages 3
dialectsbull Test 3s 10s 30sbull DET DCF
MIT Lincoln Laboratory13
Odyssey 2008
NIST SRELRESteady Progress
Avignon Workshop
1997 1998SRE 3 bull 8 sites bull Pitch features handset mic
detectorcomp using more dev data
bull SWB2p1bull All speakers act as tgts
and imposters (current paradigm)
bull Train 2 min - 1 sess 1 handset 2 handset
bull Test 3s 10s 30sbull No cross-sex trials
matched and mismatched test phone
bull DET DCF
SRE 4bull 12 sitesbull Phone sequences (BBN)
sequence models (Dragon)
bull SWB2p2bull Train 2 min - 1 sess 2
sess all - 2 sessbull Test 3s 10s 30sbull SNDN and HS type side
knowledgebull Human performance
3s
1999SRE 5bull 13 sitesbull T-norm system fusionbull SWB2p3bull Train 2 min - 2 sessbull Test varying duration (0-
15 15-30 30-45gt45) diff number
bull New tasks 2-spkr test speaker tracking
MIT Lincoln Laboratory14
Odyssey 2008
NIST SRELRENew Directions
Odyssey Workshop
JHU SuperSIDWorkshop
2000 2001SRE 6bull 12 sites (First shark
sighting) bull SMS bull SWB2p1p2
AHUMADAbull Train 2 min - 1 sessbull Test variable 0-60 bull New tasks 2-spkr
train amp test N-speaker segmentation
20032002SRE 7bull 13 sitesbull Per-frame SVM
Fusion text-constrained GMM word amp phone N-gram
bull SWB2p1p2 AHUMADA SWB2p4 (cell)
bull Extended data task (SWB1)
SRE 8bull 24 sitesbull Feat map
high-level features mlpfusion
bull SWB2p5 (cell) SWB2p2p3 (ext) FBIVoiceDB(Multi Modal) BNEWS (seg) Meeting (seg)
SRE 9bull 19 sitesbull SVM GLDS phone
svm nerfs)bull SWB2p5 (cell)
SWB2p2p3 (ext)
LRE 2bull 6 sitesbull PPRLM GMM-SDC
SVM-SDC fusionbull Callfriendbull 12 languagesbull Test 3s 10s 30s
MIT Lincoln Laboratory15
Odyssey 2008
NIST SRELRECurrent Period
Odyssey Workshop
Odyssey Workshop
2004 2005 20072006LRE 4bull 21 sitesbull SVM-GSV ho
ngrams fLFAfNAPbull Mixer5 OHSUbull 14 languages 5
dialectsaccentsbull Calibrated LLRs
LRE 3bull 11 sitesbull GMM-MMI TRAPSNN-
decoder phone lattice
2008
Odyssey WS SRE 13 JHU WS
SRE 10bull 24 sitesbull Large system
fusionbull Mixer1bull Bilingual
speakers
SRE 11bull 27 sitesbull LFA SVM-MLLRbull Mixer2 MMSRbull Cross-channel
microphonesbull Calibrated LLRs
SRE 12bull 36 sitesbull SVM-GSV
spectral-only systems
bull Mixer2+3 MMSRbull Bilingual cross-
channelbull Multi-site
collaboration
PPR-BinTree PPR-SVMbull OHSU Mixer12bull 7 languages 3
dialectsaccents
MIT Lincoln Laboratory16
Odyssey 2008
NIST SREHow are we doing
0
001002
003004
005
006007
008
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Year
DC
F
Landline 1sp 2 min train 30 sec test
Cellular 1sp 2 min train 30 sec test
Landline 2-speaker detection
Ahumada(Spanish)
Multimodal (FBI)Landline 1sp (40
target speaker paradigm)
Cellularland 2-speaker detection
CellLand 1sp 8-conv train 1-conv test
Cross-mic1-conv train (tel) 1-conv test (mic)
Swb1 Swb2p1 Swb2p3 Swb2p4 Swb2p5 Mixer1 Mixer3
bull Sampling of tasks shown 28 in SRE04
Cross-language
Swb2p2 Mixer2 MMSR
MIT Lincoln Laboratory17
Odyssey 2008
0123456789
10
0
1
2
3
4
2001 2002 2003 2004 2005 2006
SRE Performance Trends 2001-2007Lincoln Systems
bull Consistent and steady improvement for datatask focus
EER ()1conv4w1conv4w
8conv4w1conv4w
minDCFx100
2001 2002 2003 2004 2005 2006SWB1 SWB2 MIXER2-3
bull New data sets designed to be more challenging
bull New features classifiers and compensations drive error rates down over time
SVM-GSV GMM-LFA MultiFeatSVM-GLDS SVM-MLLR+NAP
2006
NAP TC-SVM wordphone lattices2005
PhoneWord-SVM GMM-ATNORM2004
Feature Mapping SVM-GLDS2003
SuperSID High-level features2002
Text-const GMM word-ngram2001
MIT Lincoln Laboratory18
Odyssey 2008
0
10
20
30
40
1996 2003 2005 2005 2007 2007
EER
()
30s 10s 3s
CallFriend(12-lang)
OHSU(7-lang)
Mixer3(14-lang)
113
32 421014
LRE Performance Trends 1996-2007Lincoln Systems
19
Year Main LID Technology
1996 PPRLM2003 + GMM-SDC SVM-SDC2005 + Phone lattices SVM w ngrams
Binary Trees2007 + TRAPS tokenizers fLFA fNAP
GMM-MMI SVM-GSV calibrated LLRs
MIT Lincoln Laboratory19
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory11
Odyssey 2008
NIST SRELREPre-history
1992 1993
Rutgers Summer Workshop
1994Informal LRE bull 4 sites OGI MITLL MIT ITTbull OGI 12 lang corpus
MartignyWorkshop
DARPA SID evalbull 3 sites Dragon (LVCSR) ITT
(NN) MITLL (GMM)bull Early SWB1bull 1-4 conv trainbull 24 tgtsbull 111 tgt-test 466 imp-testbull 10-60s testbull Speaker dependent ROCbull Intro of Swets Normal-Deviate
plot (DET)bull Areas Under ROC PdPf=10bull Sun Sparc 1020 MB
MIT Lincoln Laboratory12
Odyssey 2008
NIST SRELREFormal Start
1995 1996SRE 1 bull 6 sites BBN (uniGauss)
Dragon (LVCSR) Ensigma(ergodic HMM) INRS (phone HMMs) ITT (NN) MITLL (GMM o64 cohort)
bull SWB1bull 26 tgtsbull Train 10s 30s 4 sessbull Test 5s 10s 30s Separate
tgt and imp testsbull Area Under ROC
PdPf=3105 closed set error
bull SameDiff phone effectbull Speaker dependent ROC
SRE 2 bull 11 sites Ensigma ITT
MITLL SRI CAIP INRS Dragon BBN LIMSI ATampT Sanders
bull Broad phone HMM LVCSR VQ adapted GMM SVM worldcohorts hnorm anchor models
bull SWB1bull 40 tgtsbull Train 2 min - 1 sess 1
handset 2 handset bull Test 3s 10s 30s Separate
tgt and imp testsbull Pooled DET DCF
LRE 1 bull 4-5 sitesbull PPRLM GMM-CEP
Syllabic models fusion
bull Callfriendbull 12 languages 3
dialectsbull Test 3s 10s 30sbull DET DCF
MIT Lincoln Laboratory13
Odyssey 2008
NIST SRELRESteady Progress
Avignon Workshop
1997 1998SRE 3 bull 8 sites bull Pitch features handset mic
detectorcomp using more dev data
bull SWB2p1bull All speakers act as tgts
and imposters (current paradigm)
bull Train 2 min - 1 sess 1 handset 2 handset
bull Test 3s 10s 30sbull No cross-sex trials
matched and mismatched test phone
bull DET DCF
SRE 4bull 12 sitesbull Phone sequences (BBN)
sequence models (Dragon)
bull SWB2p2bull Train 2 min - 1 sess 2
sess all - 2 sessbull Test 3s 10s 30sbull SNDN and HS type side
knowledgebull Human performance
3s
1999SRE 5bull 13 sitesbull T-norm system fusionbull SWB2p3bull Train 2 min - 2 sessbull Test varying duration (0-
15 15-30 30-45gt45) diff number
bull New tasks 2-spkr test speaker tracking
MIT Lincoln Laboratory14
Odyssey 2008
NIST SRELRENew Directions
Odyssey Workshop
JHU SuperSIDWorkshop
2000 2001SRE 6bull 12 sites (First shark
sighting) bull SMS bull SWB2p1p2
AHUMADAbull Train 2 min - 1 sessbull Test variable 0-60 bull New tasks 2-spkr
train amp test N-speaker segmentation
20032002SRE 7bull 13 sitesbull Per-frame SVM
Fusion text-constrained GMM word amp phone N-gram
bull SWB2p1p2 AHUMADA SWB2p4 (cell)
bull Extended data task (SWB1)
SRE 8bull 24 sitesbull Feat map
high-level features mlpfusion
bull SWB2p5 (cell) SWB2p2p3 (ext) FBIVoiceDB(Multi Modal) BNEWS (seg) Meeting (seg)
SRE 9bull 19 sitesbull SVM GLDS phone
svm nerfs)bull SWB2p5 (cell)
SWB2p2p3 (ext)
LRE 2bull 6 sitesbull PPRLM GMM-SDC
SVM-SDC fusionbull Callfriendbull 12 languagesbull Test 3s 10s 30s
MIT Lincoln Laboratory15
Odyssey 2008
NIST SRELRECurrent Period
Odyssey Workshop
Odyssey Workshop
2004 2005 20072006LRE 4bull 21 sitesbull SVM-GSV ho
ngrams fLFAfNAPbull Mixer5 OHSUbull 14 languages 5
dialectsaccentsbull Calibrated LLRs
LRE 3bull 11 sitesbull GMM-MMI TRAPSNN-
decoder phone lattice
2008
Odyssey WS SRE 13 JHU WS
SRE 10bull 24 sitesbull Large system
fusionbull Mixer1bull Bilingual
speakers
SRE 11bull 27 sitesbull LFA SVM-MLLRbull Mixer2 MMSRbull Cross-channel
microphonesbull Calibrated LLRs
SRE 12bull 36 sitesbull SVM-GSV
spectral-only systems
bull Mixer2+3 MMSRbull Bilingual cross-
channelbull Multi-site
collaboration
PPR-BinTree PPR-SVMbull OHSU Mixer12bull 7 languages 3
dialectsaccents
MIT Lincoln Laboratory16
Odyssey 2008
NIST SREHow are we doing
0
001002
003004
005
006007
008
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Year
DC
F
Landline 1sp 2 min train 30 sec test
Cellular 1sp 2 min train 30 sec test
Landline 2-speaker detection
Ahumada(Spanish)
Multimodal (FBI)Landline 1sp (40
target speaker paradigm)
Cellularland 2-speaker detection
CellLand 1sp 8-conv train 1-conv test
Cross-mic1-conv train (tel) 1-conv test (mic)
Swb1 Swb2p1 Swb2p3 Swb2p4 Swb2p5 Mixer1 Mixer3
bull Sampling of tasks shown 28 in SRE04
Cross-language
Swb2p2 Mixer2 MMSR
MIT Lincoln Laboratory17
Odyssey 2008
0123456789
10
0
1
2
3
4
2001 2002 2003 2004 2005 2006
SRE Performance Trends 2001-2007Lincoln Systems
bull Consistent and steady improvement for datatask focus
EER ()1conv4w1conv4w
8conv4w1conv4w
minDCFx100
2001 2002 2003 2004 2005 2006SWB1 SWB2 MIXER2-3
bull New data sets designed to be more challenging
bull New features classifiers and compensations drive error rates down over time
SVM-GSV GMM-LFA MultiFeatSVM-GLDS SVM-MLLR+NAP
2006
NAP TC-SVM wordphone lattices2005
PhoneWord-SVM GMM-ATNORM2004
Feature Mapping SVM-GLDS2003
SuperSID High-level features2002
Text-const GMM word-ngram2001
MIT Lincoln Laboratory18
Odyssey 2008
0
10
20
30
40
1996 2003 2005 2005 2007 2007
EER
()
30s 10s 3s
CallFriend(12-lang)
OHSU(7-lang)
Mixer3(14-lang)
113
32 421014
LRE Performance Trends 1996-2007Lincoln Systems
19
Year Main LID Technology
1996 PPRLM2003 + GMM-SDC SVM-SDC2005 + Phone lattices SVM w ngrams
Binary Trees2007 + TRAPS tokenizers fLFA fNAP
GMM-MMI SVM-GSV calibrated LLRs
MIT Lincoln Laboratory19
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory12
Odyssey 2008
NIST SRELREFormal Start
1995 1996SRE 1 bull 6 sites BBN (uniGauss)
Dragon (LVCSR) Ensigma(ergodic HMM) INRS (phone HMMs) ITT (NN) MITLL (GMM o64 cohort)
bull SWB1bull 26 tgtsbull Train 10s 30s 4 sessbull Test 5s 10s 30s Separate
tgt and imp testsbull Area Under ROC
PdPf=3105 closed set error
bull SameDiff phone effectbull Speaker dependent ROC
SRE 2 bull 11 sites Ensigma ITT
MITLL SRI CAIP INRS Dragon BBN LIMSI ATampT Sanders
bull Broad phone HMM LVCSR VQ adapted GMM SVM worldcohorts hnorm anchor models
bull SWB1bull 40 tgtsbull Train 2 min - 1 sess 1
handset 2 handset bull Test 3s 10s 30s Separate
tgt and imp testsbull Pooled DET DCF
LRE 1 bull 4-5 sitesbull PPRLM GMM-CEP
Syllabic models fusion
bull Callfriendbull 12 languages 3
dialectsbull Test 3s 10s 30sbull DET DCF
MIT Lincoln Laboratory13
Odyssey 2008
NIST SRELRESteady Progress
Avignon Workshop
1997 1998SRE 3 bull 8 sites bull Pitch features handset mic
detectorcomp using more dev data
bull SWB2p1bull All speakers act as tgts
and imposters (current paradigm)
bull Train 2 min - 1 sess 1 handset 2 handset
bull Test 3s 10s 30sbull No cross-sex trials
matched and mismatched test phone
bull DET DCF
SRE 4bull 12 sitesbull Phone sequences (BBN)
sequence models (Dragon)
bull SWB2p2bull Train 2 min - 1 sess 2
sess all - 2 sessbull Test 3s 10s 30sbull SNDN and HS type side
knowledgebull Human performance
3s
1999SRE 5bull 13 sitesbull T-norm system fusionbull SWB2p3bull Train 2 min - 2 sessbull Test varying duration (0-
15 15-30 30-45gt45) diff number
bull New tasks 2-spkr test speaker tracking
MIT Lincoln Laboratory14
Odyssey 2008
NIST SRELRENew Directions
Odyssey Workshop
JHU SuperSIDWorkshop
2000 2001SRE 6bull 12 sites (First shark
sighting) bull SMS bull SWB2p1p2
AHUMADAbull Train 2 min - 1 sessbull Test variable 0-60 bull New tasks 2-spkr
train amp test N-speaker segmentation
20032002SRE 7bull 13 sitesbull Per-frame SVM
Fusion text-constrained GMM word amp phone N-gram
bull SWB2p1p2 AHUMADA SWB2p4 (cell)
bull Extended data task (SWB1)
SRE 8bull 24 sitesbull Feat map
high-level features mlpfusion
bull SWB2p5 (cell) SWB2p2p3 (ext) FBIVoiceDB(Multi Modal) BNEWS (seg) Meeting (seg)
SRE 9bull 19 sitesbull SVM GLDS phone
svm nerfs)bull SWB2p5 (cell)
SWB2p2p3 (ext)
LRE 2bull 6 sitesbull PPRLM GMM-SDC
SVM-SDC fusionbull Callfriendbull 12 languagesbull Test 3s 10s 30s
MIT Lincoln Laboratory15
Odyssey 2008
NIST SRELRECurrent Period
Odyssey Workshop
Odyssey Workshop
2004 2005 20072006LRE 4bull 21 sitesbull SVM-GSV ho
ngrams fLFAfNAPbull Mixer5 OHSUbull 14 languages 5
dialectsaccentsbull Calibrated LLRs
LRE 3bull 11 sitesbull GMM-MMI TRAPSNN-
decoder phone lattice
2008
Odyssey WS SRE 13 JHU WS
SRE 10bull 24 sitesbull Large system
fusionbull Mixer1bull Bilingual
speakers
SRE 11bull 27 sitesbull LFA SVM-MLLRbull Mixer2 MMSRbull Cross-channel
microphonesbull Calibrated LLRs
SRE 12bull 36 sitesbull SVM-GSV
spectral-only systems
bull Mixer2+3 MMSRbull Bilingual cross-
channelbull Multi-site
collaboration
PPR-BinTree PPR-SVMbull OHSU Mixer12bull 7 languages 3
dialectsaccents
MIT Lincoln Laboratory16
Odyssey 2008
NIST SREHow are we doing
0
001002
003004
005
006007
008
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Year
DC
F
Landline 1sp 2 min train 30 sec test
Cellular 1sp 2 min train 30 sec test
Landline 2-speaker detection
Ahumada(Spanish)
Multimodal (FBI)Landline 1sp (40
target speaker paradigm)
Cellularland 2-speaker detection
CellLand 1sp 8-conv train 1-conv test
Cross-mic1-conv train (tel) 1-conv test (mic)
Swb1 Swb2p1 Swb2p3 Swb2p4 Swb2p5 Mixer1 Mixer3
bull Sampling of tasks shown 28 in SRE04
Cross-language
Swb2p2 Mixer2 MMSR
MIT Lincoln Laboratory17
Odyssey 2008
0123456789
10
0
1
2
3
4
2001 2002 2003 2004 2005 2006
SRE Performance Trends 2001-2007Lincoln Systems
bull Consistent and steady improvement for datatask focus
EER ()1conv4w1conv4w
8conv4w1conv4w
minDCFx100
2001 2002 2003 2004 2005 2006SWB1 SWB2 MIXER2-3
bull New data sets designed to be more challenging
bull New features classifiers and compensations drive error rates down over time
SVM-GSV GMM-LFA MultiFeatSVM-GLDS SVM-MLLR+NAP
2006
NAP TC-SVM wordphone lattices2005
PhoneWord-SVM GMM-ATNORM2004
Feature Mapping SVM-GLDS2003
SuperSID High-level features2002
Text-const GMM word-ngram2001
MIT Lincoln Laboratory18
Odyssey 2008
0
10
20
30
40
1996 2003 2005 2005 2007 2007
EER
()
30s 10s 3s
CallFriend(12-lang)
OHSU(7-lang)
Mixer3(14-lang)
113
32 421014
LRE Performance Trends 1996-2007Lincoln Systems
19
Year Main LID Technology
1996 PPRLM2003 + GMM-SDC SVM-SDC2005 + Phone lattices SVM w ngrams
Binary Trees2007 + TRAPS tokenizers fLFA fNAP
GMM-MMI SVM-GSV calibrated LLRs
MIT Lincoln Laboratory19
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory13
Odyssey 2008
NIST SRELRESteady Progress
Avignon Workshop
1997 1998SRE 3 bull 8 sites bull Pitch features handset mic
detectorcomp using more dev data
bull SWB2p1bull All speakers act as tgts
and imposters (current paradigm)
bull Train 2 min - 1 sess 1 handset 2 handset
bull Test 3s 10s 30sbull No cross-sex trials
matched and mismatched test phone
bull DET DCF
SRE 4bull 12 sitesbull Phone sequences (BBN)
sequence models (Dragon)
bull SWB2p2bull Train 2 min - 1 sess 2
sess all - 2 sessbull Test 3s 10s 30sbull SNDN and HS type side
knowledgebull Human performance
3s
1999SRE 5bull 13 sitesbull T-norm system fusionbull SWB2p3bull Train 2 min - 2 sessbull Test varying duration (0-
15 15-30 30-45gt45) diff number
bull New tasks 2-spkr test speaker tracking
MIT Lincoln Laboratory14
Odyssey 2008
NIST SRELRENew Directions
Odyssey Workshop
JHU SuperSIDWorkshop
2000 2001SRE 6bull 12 sites (First shark
sighting) bull SMS bull SWB2p1p2
AHUMADAbull Train 2 min - 1 sessbull Test variable 0-60 bull New tasks 2-spkr
train amp test N-speaker segmentation
20032002SRE 7bull 13 sitesbull Per-frame SVM
Fusion text-constrained GMM word amp phone N-gram
bull SWB2p1p2 AHUMADA SWB2p4 (cell)
bull Extended data task (SWB1)
SRE 8bull 24 sitesbull Feat map
high-level features mlpfusion
bull SWB2p5 (cell) SWB2p2p3 (ext) FBIVoiceDB(Multi Modal) BNEWS (seg) Meeting (seg)
SRE 9bull 19 sitesbull SVM GLDS phone
svm nerfs)bull SWB2p5 (cell)
SWB2p2p3 (ext)
LRE 2bull 6 sitesbull PPRLM GMM-SDC
SVM-SDC fusionbull Callfriendbull 12 languagesbull Test 3s 10s 30s
MIT Lincoln Laboratory15
Odyssey 2008
NIST SRELRECurrent Period
Odyssey Workshop
Odyssey Workshop
2004 2005 20072006LRE 4bull 21 sitesbull SVM-GSV ho
ngrams fLFAfNAPbull Mixer5 OHSUbull 14 languages 5
dialectsaccentsbull Calibrated LLRs
LRE 3bull 11 sitesbull GMM-MMI TRAPSNN-
decoder phone lattice
2008
Odyssey WS SRE 13 JHU WS
SRE 10bull 24 sitesbull Large system
fusionbull Mixer1bull Bilingual
speakers
SRE 11bull 27 sitesbull LFA SVM-MLLRbull Mixer2 MMSRbull Cross-channel
microphonesbull Calibrated LLRs
SRE 12bull 36 sitesbull SVM-GSV
spectral-only systems
bull Mixer2+3 MMSRbull Bilingual cross-
channelbull Multi-site
collaboration
PPR-BinTree PPR-SVMbull OHSU Mixer12bull 7 languages 3
dialectsaccents
MIT Lincoln Laboratory16
Odyssey 2008
NIST SREHow are we doing
0
001002
003004
005
006007
008
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Year
DC
F
Landline 1sp 2 min train 30 sec test
Cellular 1sp 2 min train 30 sec test
Landline 2-speaker detection
Ahumada(Spanish)
Multimodal (FBI)Landline 1sp (40
target speaker paradigm)
Cellularland 2-speaker detection
CellLand 1sp 8-conv train 1-conv test
Cross-mic1-conv train (tel) 1-conv test (mic)
Swb1 Swb2p1 Swb2p3 Swb2p4 Swb2p5 Mixer1 Mixer3
bull Sampling of tasks shown 28 in SRE04
Cross-language
Swb2p2 Mixer2 MMSR
MIT Lincoln Laboratory17
Odyssey 2008
0123456789
10
0
1
2
3
4
2001 2002 2003 2004 2005 2006
SRE Performance Trends 2001-2007Lincoln Systems
bull Consistent and steady improvement for datatask focus
EER ()1conv4w1conv4w
8conv4w1conv4w
minDCFx100
2001 2002 2003 2004 2005 2006SWB1 SWB2 MIXER2-3
bull New data sets designed to be more challenging
bull New features classifiers and compensations drive error rates down over time
SVM-GSV GMM-LFA MultiFeatSVM-GLDS SVM-MLLR+NAP
2006
NAP TC-SVM wordphone lattices2005
PhoneWord-SVM GMM-ATNORM2004
Feature Mapping SVM-GLDS2003
SuperSID High-level features2002
Text-const GMM word-ngram2001
MIT Lincoln Laboratory18
Odyssey 2008
0
10
20
30
40
1996 2003 2005 2005 2007 2007
EER
()
30s 10s 3s
CallFriend(12-lang)
OHSU(7-lang)
Mixer3(14-lang)
113
32 421014
LRE Performance Trends 1996-2007Lincoln Systems
19
Year Main LID Technology
1996 PPRLM2003 + GMM-SDC SVM-SDC2005 + Phone lattices SVM w ngrams
Binary Trees2007 + TRAPS tokenizers fLFA fNAP
GMM-MMI SVM-GSV calibrated LLRs
MIT Lincoln Laboratory19
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory14
Odyssey 2008
NIST SRELRENew Directions
Odyssey Workshop
JHU SuperSIDWorkshop
2000 2001SRE 6bull 12 sites (First shark
sighting) bull SMS bull SWB2p1p2
AHUMADAbull Train 2 min - 1 sessbull Test variable 0-60 bull New tasks 2-spkr
train amp test N-speaker segmentation
20032002SRE 7bull 13 sitesbull Per-frame SVM
Fusion text-constrained GMM word amp phone N-gram
bull SWB2p1p2 AHUMADA SWB2p4 (cell)
bull Extended data task (SWB1)
SRE 8bull 24 sitesbull Feat map
high-level features mlpfusion
bull SWB2p5 (cell) SWB2p2p3 (ext) FBIVoiceDB(Multi Modal) BNEWS (seg) Meeting (seg)
SRE 9bull 19 sitesbull SVM GLDS phone
svm nerfs)bull SWB2p5 (cell)
SWB2p2p3 (ext)
LRE 2bull 6 sitesbull PPRLM GMM-SDC
SVM-SDC fusionbull Callfriendbull 12 languagesbull Test 3s 10s 30s
MIT Lincoln Laboratory15
Odyssey 2008
NIST SRELRECurrent Period
Odyssey Workshop
Odyssey Workshop
2004 2005 20072006LRE 4bull 21 sitesbull SVM-GSV ho
ngrams fLFAfNAPbull Mixer5 OHSUbull 14 languages 5
dialectsaccentsbull Calibrated LLRs
LRE 3bull 11 sitesbull GMM-MMI TRAPSNN-
decoder phone lattice
2008
Odyssey WS SRE 13 JHU WS
SRE 10bull 24 sitesbull Large system
fusionbull Mixer1bull Bilingual
speakers
SRE 11bull 27 sitesbull LFA SVM-MLLRbull Mixer2 MMSRbull Cross-channel
microphonesbull Calibrated LLRs
SRE 12bull 36 sitesbull SVM-GSV
spectral-only systems
bull Mixer2+3 MMSRbull Bilingual cross-
channelbull Multi-site
collaboration
PPR-BinTree PPR-SVMbull OHSU Mixer12bull 7 languages 3
dialectsaccents
MIT Lincoln Laboratory16
Odyssey 2008
NIST SREHow are we doing
0
001002
003004
005
006007
008
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Year
DC
F
Landline 1sp 2 min train 30 sec test
Cellular 1sp 2 min train 30 sec test
Landline 2-speaker detection
Ahumada(Spanish)
Multimodal (FBI)Landline 1sp (40
target speaker paradigm)
Cellularland 2-speaker detection
CellLand 1sp 8-conv train 1-conv test
Cross-mic1-conv train (tel) 1-conv test (mic)
Swb1 Swb2p1 Swb2p3 Swb2p4 Swb2p5 Mixer1 Mixer3
bull Sampling of tasks shown 28 in SRE04
Cross-language
Swb2p2 Mixer2 MMSR
MIT Lincoln Laboratory17
Odyssey 2008
0123456789
10
0
1
2
3
4
2001 2002 2003 2004 2005 2006
SRE Performance Trends 2001-2007Lincoln Systems
bull Consistent and steady improvement for datatask focus
EER ()1conv4w1conv4w
8conv4w1conv4w
minDCFx100
2001 2002 2003 2004 2005 2006SWB1 SWB2 MIXER2-3
bull New data sets designed to be more challenging
bull New features classifiers and compensations drive error rates down over time
SVM-GSV GMM-LFA MultiFeatSVM-GLDS SVM-MLLR+NAP
2006
NAP TC-SVM wordphone lattices2005
PhoneWord-SVM GMM-ATNORM2004
Feature Mapping SVM-GLDS2003
SuperSID High-level features2002
Text-const GMM word-ngram2001
MIT Lincoln Laboratory18
Odyssey 2008
0
10
20
30
40
1996 2003 2005 2005 2007 2007
EER
()
30s 10s 3s
CallFriend(12-lang)
OHSU(7-lang)
Mixer3(14-lang)
113
32 421014
LRE Performance Trends 1996-2007Lincoln Systems
19
Year Main LID Technology
1996 PPRLM2003 + GMM-SDC SVM-SDC2005 + Phone lattices SVM w ngrams
Binary Trees2007 + TRAPS tokenizers fLFA fNAP
GMM-MMI SVM-GSV calibrated LLRs
MIT Lincoln Laboratory19
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory15
Odyssey 2008
NIST SRELRECurrent Period
Odyssey Workshop
Odyssey Workshop
2004 2005 20072006LRE 4bull 21 sitesbull SVM-GSV ho
ngrams fLFAfNAPbull Mixer5 OHSUbull 14 languages 5
dialectsaccentsbull Calibrated LLRs
LRE 3bull 11 sitesbull GMM-MMI TRAPSNN-
decoder phone lattice
2008
Odyssey WS SRE 13 JHU WS
SRE 10bull 24 sitesbull Large system
fusionbull Mixer1bull Bilingual
speakers
SRE 11bull 27 sitesbull LFA SVM-MLLRbull Mixer2 MMSRbull Cross-channel
microphonesbull Calibrated LLRs
SRE 12bull 36 sitesbull SVM-GSV
spectral-only systems
bull Mixer2+3 MMSRbull Bilingual cross-
channelbull Multi-site
collaboration
PPR-BinTree PPR-SVMbull OHSU Mixer12bull 7 languages 3
dialectsaccents
MIT Lincoln Laboratory16
Odyssey 2008
NIST SREHow are we doing
0
001002
003004
005
006007
008
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Year
DC
F
Landline 1sp 2 min train 30 sec test
Cellular 1sp 2 min train 30 sec test
Landline 2-speaker detection
Ahumada(Spanish)
Multimodal (FBI)Landline 1sp (40
target speaker paradigm)
Cellularland 2-speaker detection
CellLand 1sp 8-conv train 1-conv test
Cross-mic1-conv train (tel) 1-conv test (mic)
Swb1 Swb2p1 Swb2p3 Swb2p4 Swb2p5 Mixer1 Mixer3
bull Sampling of tasks shown 28 in SRE04
Cross-language
Swb2p2 Mixer2 MMSR
MIT Lincoln Laboratory17
Odyssey 2008
0123456789
10
0
1
2
3
4
2001 2002 2003 2004 2005 2006
SRE Performance Trends 2001-2007Lincoln Systems
bull Consistent and steady improvement for datatask focus
EER ()1conv4w1conv4w
8conv4w1conv4w
minDCFx100
2001 2002 2003 2004 2005 2006SWB1 SWB2 MIXER2-3
bull New data sets designed to be more challenging
bull New features classifiers and compensations drive error rates down over time
SVM-GSV GMM-LFA MultiFeatSVM-GLDS SVM-MLLR+NAP
2006
NAP TC-SVM wordphone lattices2005
PhoneWord-SVM GMM-ATNORM2004
Feature Mapping SVM-GLDS2003
SuperSID High-level features2002
Text-const GMM word-ngram2001
MIT Lincoln Laboratory18
Odyssey 2008
0
10
20
30
40
1996 2003 2005 2005 2007 2007
EER
()
30s 10s 3s
CallFriend(12-lang)
OHSU(7-lang)
Mixer3(14-lang)
113
32 421014
LRE Performance Trends 1996-2007Lincoln Systems
19
Year Main LID Technology
1996 PPRLM2003 + GMM-SDC SVM-SDC2005 + Phone lattices SVM w ngrams
Binary Trees2007 + TRAPS tokenizers fLFA fNAP
GMM-MMI SVM-GSV calibrated LLRs
MIT Lincoln Laboratory19
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory16
Odyssey 2008
NIST SREHow are we doing
0
001002
003004
005
006007
008
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
Year
DC
F
Landline 1sp 2 min train 30 sec test
Cellular 1sp 2 min train 30 sec test
Landline 2-speaker detection
Ahumada(Spanish)
Multimodal (FBI)Landline 1sp (40
target speaker paradigm)
Cellularland 2-speaker detection
CellLand 1sp 8-conv train 1-conv test
Cross-mic1-conv train (tel) 1-conv test (mic)
Swb1 Swb2p1 Swb2p3 Swb2p4 Swb2p5 Mixer1 Mixer3
bull Sampling of tasks shown 28 in SRE04
Cross-language
Swb2p2 Mixer2 MMSR
MIT Lincoln Laboratory17
Odyssey 2008
0123456789
10
0
1
2
3
4
2001 2002 2003 2004 2005 2006
SRE Performance Trends 2001-2007Lincoln Systems
bull Consistent and steady improvement for datatask focus
EER ()1conv4w1conv4w
8conv4w1conv4w
minDCFx100
2001 2002 2003 2004 2005 2006SWB1 SWB2 MIXER2-3
bull New data sets designed to be more challenging
bull New features classifiers and compensations drive error rates down over time
SVM-GSV GMM-LFA MultiFeatSVM-GLDS SVM-MLLR+NAP
2006
NAP TC-SVM wordphone lattices2005
PhoneWord-SVM GMM-ATNORM2004
Feature Mapping SVM-GLDS2003
SuperSID High-level features2002
Text-const GMM word-ngram2001
MIT Lincoln Laboratory18
Odyssey 2008
0
10
20
30
40
1996 2003 2005 2005 2007 2007
EER
()
30s 10s 3s
CallFriend(12-lang)
OHSU(7-lang)
Mixer3(14-lang)
113
32 421014
LRE Performance Trends 1996-2007Lincoln Systems
19
Year Main LID Technology
1996 PPRLM2003 + GMM-SDC SVM-SDC2005 + Phone lattices SVM w ngrams
Binary Trees2007 + TRAPS tokenizers fLFA fNAP
GMM-MMI SVM-GSV calibrated LLRs
MIT Lincoln Laboratory19
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory17
Odyssey 2008
0123456789
10
0
1
2
3
4
2001 2002 2003 2004 2005 2006
SRE Performance Trends 2001-2007Lincoln Systems
bull Consistent and steady improvement for datatask focus
EER ()1conv4w1conv4w
8conv4w1conv4w
minDCFx100
2001 2002 2003 2004 2005 2006SWB1 SWB2 MIXER2-3
bull New data sets designed to be more challenging
bull New features classifiers and compensations drive error rates down over time
SVM-GSV GMM-LFA MultiFeatSVM-GLDS SVM-MLLR+NAP
2006
NAP TC-SVM wordphone lattices2005
PhoneWord-SVM GMM-ATNORM2004
Feature Mapping SVM-GLDS2003
SuperSID High-level features2002
Text-const GMM word-ngram2001
MIT Lincoln Laboratory18
Odyssey 2008
0
10
20
30
40
1996 2003 2005 2005 2007 2007
EER
()
30s 10s 3s
CallFriend(12-lang)
OHSU(7-lang)
Mixer3(14-lang)
113
32 421014
LRE Performance Trends 1996-2007Lincoln Systems
19
Year Main LID Technology
1996 PPRLM2003 + GMM-SDC SVM-SDC2005 + Phone lattices SVM w ngrams
Binary Trees2007 + TRAPS tokenizers fLFA fNAP
GMM-MMI SVM-GSV calibrated LLRs
MIT Lincoln Laboratory19
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory18
Odyssey 2008
0
10
20
30
40
1996 2003 2005 2005 2007 2007
EER
()
30s 10s 3s
CallFriend(12-lang)
OHSU(7-lang)
Mixer3(14-lang)
113
32 421014
LRE Performance Trends 1996-2007Lincoln Systems
19
Year Main LID Technology
1996 PPRLM2003 + GMM-SDC SVM-SDC2005 + Phone lattices SVM w ngrams
Binary Trees2007 + TRAPS tokenizers fLFA fNAP
GMM-MMI SVM-GSV calibrated LLRs
MIT Lincoln Laboratory19
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory19
Odyssey 2008
Roadmap
bull The odyssey from 1994 to 2008
bull The scenic route through NIST speaker and language recognition evaluations
bull The expedition into future territories
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory20
Odyssey 2008
The ExpeditionEvaluations
bull The evaluation paradigm has clearly helped propel speaker and language RampD forward
ndash Common focus ndash Comparable results and repeatable experimentsndash Collaboration
bull But there are some issues to considerndash Proliferation of tasks and conditions can dilute and fragment
community effortndash Evaluations are application-dependent
The tasks conditions and data are representative of some application(s)
Are these being set in a meaningful wayndash Performance numbers need context
Time-pressed less-technical potential users want yesno to ldquowill it or wonrsquot it work for my applicationrdquo
ndash Speaker and language recognition research increasingly relies on data driven discovery
Does performance depend on highly matched dev data Are performance gains due to technology or data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data
MIT Lincoln Laboratory21
Odyssey 2008
The ExpeditionResearch
bull Speaker and language research are built on three core areas
ndash Speech Science Understanding how speakerlanguage information is conveyed in the speech signal and how to robustly extract measures of this information
ndash Pattern Recognition Techniques and algorithms to effectively represent and compare salient patterns in data
ndash Data Driven Discovery Effectively using data to apply refine and improve systems built from above
bull Current speakerlanguage research is heavily weighted toward data driven discovery
ndash Cure or cursendash Are we discovering underlying problems to address in
research or just where we want more data