Upload
brenda-mccarthy
View
216
Download
1
Embed Size (px)
Citation preview
MIT Lincoln Laboratory Nuance Communications
Automatic Speaker RecognitionRecent Progress, Current Applications,
and Future Trends
Douglas A. Reynolds, PhDSenior Member of Technical
StaffM.I.T. Lincoln Laboratory
Larry P. Heck, PhDSpeaker Verification R&DNuance Communications
MIT Lincoln Laboratory Nuance Communications
Outline
• Introduction and applications
• General theory
• Performance
• Conclusion and future directions
MIT Lincoln Laboratory Nuance Communications
Extracting Information from Speech
SpeechRecognition
LanguageRecognition
SpeakerRecognition
Words
Language Name
Speaker Name
“How are you?”
English
James Wilson
Speech Signal
Goal: Automatically extract information transmitted in speech signal
MIT Lincoln Laboratory Nuance Communications
IntroductionIdentification
• Determines who is talking from set of known voices
• No identity claim from user (many to one mapping)
• Often assumed that unknown voice must come from set of known speakers - referred to as closed-set identification
?
?
?
?
Whose voice is this?
MIT Lincoln Laboratory Nuance Communications
IntroductionVerification/Authentication/Detection
• Determine whether person is who they claim to be
• User makes identity claim: one to one mapping
• Unknown voice could come from large set of unknown speakers - referred to as open-set verification
• Adding “none-of-the-above” option to closed-set identification gives open-set identification
?
Is this Bob’s voice?
MIT Lincoln Laboratory Nuance Communications
IntroductionSpeech Modalities
• Text-dependent recognition
– Recognition system knows text spoken by person
– Examples: fixed phrase, prompted phrase
– Used for applications with strong control over user input
– Knowledge of spoken text can improve system performance
Application dictates different speech modalities:
• Text-independent recognition
– Recognition system does not know text spoken by person
– Examples: User selected phrase, conversational speech
– Used for applications with less control over user input
– More flexible system but also more difficult problem
– Speech recognition can provide knowledge of spoken text
MIT Lincoln Laboratory Nuance Communications
IntroductionVoice as a Biometric
Strongest security
• Biometric: a human generated signal or attribute for authenticating a person’s identity
• Voice is a popular biometric:
– natural signal to produce
– does not require a specialized input device
– ubiquitous: telephones and microphone equipped PC
• Voice biometric with other forms of security
– Something you have - e.g., badge
– Something you know - e.g., password
– Something you are - e.g., voice
HaveKnow
Are
MIT Lincoln Laboratory Nuance Communications
IntroductionApplications
• Access control– Physical facilities– Data and data networks
• Transaction authentication– Toll fraud prevention– Telephone credit card purchases– Bank wire transfers
• Monitoring– Remote time and attendance logging– Home parole verification– Prison telephone usage
• Information retrieval– Customer information for call centers– Audio indexing (speech skimming device)
• Forensics– Voice sample matching
MIT Lincoln Laboratory Nuance Communications
Outline
• Introduction and applications
• General theory
• Performance
• Conclusion and future directions
MIT Lincoln Laboratory Nuance Communications
ACCEPT
General TheoryComponents of Speaker Verification System
Feature extraction
Feature extraction
SpeakerModel
SpeakerModel
Bob’s “Voiceprint”
“My Name is Bob”
ACCEPT
Bob
ImpostorModel
ImpostorModel
Identity Claim
DecisionDecision
REJECTInput Speech
Impostor “Voiceprints”
MIT Lincoln Laboratory Nuance Communications
General TheoryPhases of Speaker Verification System
Two distinct phases to any speaker verification system
Feature extraction
Feature extraction
Model training
Model training
Enrollment speech for each speaker
Bob
Sally
Voiceprints (models) for each speaker
Sally
Bob
Enrollment Enrollment PhasePhase
Model training
Model training
Accepted!Feature extraction
Feature extraction
Verificationdecision
Verificationdecision
Claimed identity: Sally
Verification Verification PhasePhase
Verificationdecision
Verificationdecision
MIT Lincoln Laboratory Nuance Communications
General TheoryFeatures for Speaker Recognition
• Humans use several levels of perceptual cues for speaker recognition
Semantics, diction,pronunciations,idiosyncrasies
Socio-economicstatus, education,place of birth
Prosodics, rhythm,speed intonation,volume modulation
Personality type,parental influence
Acoustic aspect ofspeech, nasal,deep, breathy,rough
Anatomical structureof vocal apparatus
Semantics, diction,pronunciations,idiosyncrasies
Socio-economicstatus, education,place of birth
Prosodics, rhythm,speed intonation,volume modulation
Personality type,parental influence
Acoustic aspect ofspeech, nasal,deep, breathy,rough
Anatomical structureof vocal apparatus
High-level cues (learned traits)
Low-level cues (physical traits)
Easy to automatically extract
Difficult to automatically extract
Hierarchy of Perceptual Cues
• There are no exclusive speaker identifiably cues
• Low-level acoustic cues most applicable for automatic systems
MIT Lincoln Laboratory Nuance Communications
General TheoryFeatures for Speaker Recognition
• Desirable attributes of features for an automatic system (Wolf ‘72)
• Occur naturally and frequently in speech
• Easily measurable
• Not change over time or be affected by speaker’s health
• Not be affected by reasonable background noise nor depend on specific transmission characteristics
• Not be subject to mimicry
• Occur naturally and frequently in speech
• Easily measurable
• Not change over time or be affected by speaker’s health
• Not be affected by reasonable background noise nor depend on specific transmission characteristics
• Not be subject to mimicry
Practical
Robust
Secure
• No feature has all these attributes
• Features derived from spectrum of speech have proven to be the most effective in automatic systems
MIT Lincoln Laboratory Nuance Communications
General TheorySpeech Production
• Speech production model: source-filter interaction– Anatomical structure (vocal tract/glottis) conveyed in speech spectrum
Glottal pulses Vocal tract Speech signal
MIT Lincoln Laboratory Nuance Communications
General TheoryFeatures for Speaker Recognition
• Speech is a continuous evolution of the vocal tract – Need to extract time series of spectra– Use a sliding window - 20 ms window, 10 ms shift
...
Fourier Transform
Fourier Transform MagnitudeMagnitude
• Produces time-frequency evolution of the spectrum
MIT Lincoln Laboratory Nuance Communications
General TheorySpeaker Models
SpeakerModel
SpeakerModel
Bob’s “Voiceprint”
Bob
ACCEPT
Feature extraction
Feature extraction
“My Name is Bob”
ACCEPT
ImpostorModel
ImpostorModel
Identity Claim
DecisionDecision
REJECT
Impostor “Voiceprints”
SpeakerModel
SpeakerModel
Bob’s “Voiceprint”
Bob
General Theory Components of Speaker Verification System
MIT Lincoln Laboratory Nuance Communications
General TheorySpeaker Models
• Speaker models (voiceprints) represent voice biometric in compact and generalizable form
h-a-d
• Modern speaker verification systems use Hidden Markov Models (HMMs)
– HMMs are statistical models of how a speaker produces sounds
– HMMs represent underlying statistical variations in the speech state (e.g., phoneme) and temporal changes of speech between the states.
– Fast training algorithms (EM) exist for HMMs with guaranteed convergence properties.
MIT Lincoln Laboratory Nuance Communications
General TheorySpeaker Models
Form of HMM depends on the application
“Open sesame”
Fixed Phrase Word/phrase models
/s/ /i/ /x/
Prompted phrases/passwords Phoneme models
General speech
Text-independent single state HMM
MIT Lincoln Laboratory Nuance Communications
General TheoryVerification Decision
General Theory Components of Speaker Verification System
Bob’s “Voiceprint”
Bob
ACCEPT
Feature extraction
Feature extraction
“My Name is Bob”
Identity Claim
SpeakerModel
SpeakerModel
ImpostorModel
ImpostorModel
DecisionDecision
REJECT
Impostor “Voiceprints”
ACCEPT
SpeakerModel
SpeakerModel
Bob’s “Voiceprint”
Bob
ImpostorModel
ImpostorModel
DecisionDecision
REJECT
Impostor “Voiceprints”
ACCEPT
MIT Lincoln Laboratory Nuance Communications
General TheoryVerification Decision
Verification decision approaches have roots in signal detection theory
• 2-class Hypothesis test: H0: the speaker is an impostor
H1: the speaker is indeed the claimed speaker.
• Statistic computed on test utterance S as likelihood ratio:
Likelihood S came from speaker HMMLikelihood S did not come from speaker HMM
log
reject
Feature extraction
Feature extraction
SpeakerModel
SpeakerModel
ImpostorModel
ImpostorModel
DecisionDecision+
-
accept
MIT Lincoln Laboratory Nuance Communications
Outline
• Introduction and application
• General theory
• Performance
• Conclusion and future directions
MIT Lincoln Laboratory Nuance Communications
Verification PerformanceEvaluating Speaker Verification Systems
• There are many factors to consider in design of an evaluation of a speaker verification system
Speech quality – Channel and microphone characteristics– Noise level and type– Variability between enrollment and verification
speech
Speech modality – Fixed/prompted/user-selected phrases– Free text
Speech duration – Duration and number of sessions of enrollment and verification speech
Speaker population – Size and composition
• Most importantly: The evaluation data and design should match the target application domain of interest
MIT Lincoln Laboratory Nuance Communications
Verification PerformanceEvaluating Speaker Verification Systems
PROBABILITY OF FALSE ACCEPT (in %)
PR
OB
AB
ILIT
Y O
F F
AL
SE
RE
JEC
T (
in %
)
Equal Error Rate (EER) = 1 %
Wire Transfer:
False acceptance is very costly
Users may tolerate rejections for security
Toll Fraud:
False rejections alienate customers
Any fraud rejection is beneficial
Application operating point depends on relative costs of the two error types
High Convenience
High Security
Balance
Example Performance Curve
MIT Lincoln Laboratory Nuance Communications
Verification PerformanceNIST Speaker Verification Evaluations
• Annual NIST evaluations of speaker verification technology (since 1995)
• Aim: Provide a common paradigm for comparing technologies
• Focus: Conversational telephone speech (text-independent)
Evaluation Coordinator
Linguistic Data Consortium
Data Provider
Technology Developers
Comparison of technologies on common task
Evaluate
Improve
MIT Lincoln Laboratory Nuance Communications
Verification PerformanceRange of Performance
Probability of False Accept (in %)
Pro
bab
ilit
y o
f F
alse
Rej
ect
(in
%)
Text-dependent (Combinations)
Clean Data
Single microphone
Large amount of train/test speech
Text-dependent (Combinations)
Clean Data
Single microphone
Large amount of train/test speech
Text-independent (Conversational)
Telephone Data
Multiple microphones
Moderate amount of training data
Text-independent (Conversational)
Telephone Data
Multiple microphones
Moderate amount of training data
Text-dependent (Digit strings)
Telephone Data
Multiple microphones
Small amount of training data
Text-dependent (Digit strings)
Telephone Data
Multiple microphones
Small amount of training data
Text-independent (Read sentences)
Military radio Data
Multiple radios & microphones
Moderate amount of training data
Text-independent (Read sentences)
Military radio Data
Multiple radios & microphones
Moderate amount of training data
Incre
asing constra
ints
MIT Lincoln Laboratory Nuance Communications
Verification PerformanceHuman vs. Machine
• Motivation for comparing human to machine
– Evaluating speech coders and potential forensic applications
• Schmidt-Nielsen and Crystal used NIST evaluation (DSP Journal, January 2000)
– Same amount of training data
– Matched Handset-type tests
– Mismatched Handset-type tests
– Used 3-sec conversational
utterances from telephone speech Mat
ched
Mis
mat
ched
Matched(COMPUTER)
Matched(HUMAN)
Mismatched(COMPUTER)
Mismatched(HUMAN)
ErrorRates
Humans44%
betterHumans
15%worse
MIT Lincoln Laboratory Nuance Communications
Verification PerformanceApplication Deployments
ApplicationApplication• Voice authentication based Voice authentication based
on spoken phone numberon spoken phone number• Provides secure access to Provides secure access to
customer record & credit customer record & credit card informationcard information
ImplementationImplementation• Edify telephony platformEdify telephony platform• Performance @1% EERPerformance @1% EER
BenefitsBenefits• SecuritySecurity• PersonalizationPersonalization
VolumeVolume• 250k customers 250k customers
enrolled currentlyenrolled currently@20K calls/day@20K calls/day
• 5 million customers 5 million customers will enroll by Q2 ‘00 will enroll by Q2 ‘00 @150K calls/day@150K calls/day
MIT Lincoln Laboratory Nuance Communications
Verification PerformanceSpeaker + Knowledge Verification
AuthenticateAuthenticateKnowledgeKnowledge
AuthenticateAuthenticateKnowledgeKnowledge
AcceptAccept
RejectReject
DataDataDataData
AuthenticateAuthenticateVoiceVoice
AuthenticateAuthenticateVoiceVoice
VoiceVoicePrintsPrints
VoiceVoicePrintsPrints
Please enter your account number““5551234”5551234”Say your date of
birth““October 13, October 13,
1964”1964”You’re accepted by the system
BiometricBiometricBiometricBiometric
KnowledgeKnowledge VoiceVoiceover over
TelephoneTelephone
MIT Lincoln Laboratory Nuance Communications
Outline
• Introduction
• General theory
• Performance
• Conclusion and future directions
MIT Lincoln Laboratory Nuance Communications
Conclusions
• Speaker verification is one of the few recognition areas where machines can outperform humans
• Speaker verification technology is a viable technique currently available for applications
• Speaker verification can be augmented with other authentication techniques to add further security
MIT Lincoln Laboratory Nuance Communications
Future Directions
• Research will focus on using speaker verification techniques for more unconstrained, uncontrolled situations
– Audio search and retrieval– Increasing robustness to channel variabilities– Incorporating higher-levels of knowledge into decisions
• Speaker recognition technology will become an integral part of speech interfaces
– Personalization of services and devices– Unobtrusive protection of transactions and information