MIT Lincoln Laboratory Nuance Communications Automatic Speaker Recognition Recent Progress, Current Applications, and Future Trends Douglas A. Reynolds,

MIT Lincoln Laboratory Nuance Communications

Automatic Speaker RecognitionRecent Progress, Current Applications,

and Future Trends

Douglas A. Reynolds, PhDSenior Member of Technical

StaffM.I.T. Lincoln Laboratory

Larry P. Heck, PhDSpeaker Verification R&DNuance Communications


Outline

• Introduction and applications

• General theory

• Performance

• Conclusion and future directions


Extracting Information from Speech

SpeechRecognition

LanguageRecognition

SpeakerRecognition

Words

Language Name

Speaker Name

“How are you?”

English

James Wilson

Speech Signal

Goal: Automatically extract information transmitted in speech signal


IntroductionIdentification

• Determines who is talking from set of known voices

• No identity claim from user (many to one mapping)

• Often assumed that unknown voice must come from set of known speakers - referred to as closed-set identification

?

?

?

?

Whose voice is this?


IntroductionVerification/Authentication/Detection

• Determine whether person is who they claim to be

• User makes identity claim: one to one mapping

• Unknown voice could come from large set of unknown speakers - referred to as open-set verification

• Adding “none-of-the-above” option to closed-set identification gives open-set identification

?

Is this Bob’s voice?


IntroductionSpeech Modalities

• Text-dependent recognition

– Recognition system knows text spoken by person

– Examples: fixed phrase, prompted phrase

– Used for applications with strong control over user input

– Knowledge of spoken text can improve system performance

Application dictates different speech modalities:

• Text-independent recognition

– Recognition system does not know text spoken by person

– Examples: User selected phrase, conversational speech

– Used for applications with less control over user input

– More flexible system but also more difficult problem

– Speech recognition can provide knowledge of spoken text


IntroductionVoice as a Biometric

Strongest security

• Biometric: a human generated signal or attribute for authenticating a person’s identity

• Voice is a popular biometric:

– natural signal to produce

– does not require a specialized input device

– ubiquitous: telephones and microphone equipped PC

• Voice biometric with other forms of security

– Something you have - e.g., badge

– Something you know - e.g., password

– Something you are - e.g., voice

HaveKnow

Are


IntroductionApplications

• Access control– Physical facilities– Data and data networks

• Transaction authentication– Toll fraud prevention– Telephone credit card purchases– Bank wire transfers

• Monitoring– Remote time and attendance logging– Home parole verification– Prison telephone usage

• Information retrieval– Customer information for call centers– Audio indexing (speech skimming device)

• Forensics– Voice sample matching


Outline

• Introduction and applications

• General theory

• Performance



ACCEPT

General TheoryComponents of Speaker Verification System

Feature extraction

Feature extraction

SpeakerModel

SpeakerModel

Bob’s “Voiceprint”

“My Name is Bob”

ACCEPT

Bob

ImpostorModel

ImpostorModel

Identity Claim

DecisionDecision

REJECTInput Speech

Impostor “Voiceprints”


General TheoryPhases of Speaker Verification System

Two distinct phases to any speaker verification system

Feature extraction

Feature extraction

Model training

Model training

Enrollment speech for each speaker

Bob

Sally

Voiceprints (models) for each speaker

Sally

Bob

Enrollment Enrollment PhasePhase

Model training

Model training

Accepted!Feature extraction

Feature extraction

Verificationdecision


Claimed identity: Sally

Verification Verification PhasePhase




General TheoryFeatures for Speaker Recognition

• Humans use several levels of perceptual cues for speaker recognition

Semantics, diction,pronunciations,idiosyncrasies

Socio-economicstatus, education,place of birth

Prosodics, rhythm,speed intonation,volume modulation

Personality type,parental influence

Acoustic aspect ofspeech, nasal,deep, breathy,rough

Anatomical structureof vocal apparatus

Semantics, diction,pronunciations,idiosyncrasies

Socio-economicstatus, education,place of birth

Prosodics, rhythm,speed intonation,volume modulation

Personality type,parental influence

Acoustic aspect ofspeech, nasal,deep, breathy,rough

Anatomical structureof vocal apparatus

High-level cues (learned traits)

Low-level cues (physical traits)

Easy to automatically extract

Difficult to automatically extract

Hierarchy of Perceptual Cues

• There are no exclusive speaker identifiably cues

• Low-level acoustic cues most applicable for automatic systems



• Desirable attributes of features for an automatic system (Wolf ‘72)

• Occur naturally and frequently in speech

• Easily measurable

• Not change over time or be affected by speaker’s health

• Not be affected by reasonable background noise nor depend on specific transmission characteristics

• Not be subject to mimicry

• Occur naturally and frequently in speech

• Easily measurable

• Not change over time or be affected by speaker’s health

• Not be affected by reasonable background noise nor depend on specific transmission characteristics

• Not be subject to mimicry

Practical

Robust

Secure

• No feature has all these attributes

• Features derived from spectrum of speech have proven to be the most effective in automatic systems


General TheorySpeech Production

• Speech production model: source-filter interaction– Anatomical structure (vocal tract/glottis) conveyed in speech spectrum

Glottal pulses Vocal tract Speech signal



• Speech is a continuous evolution of the vocal tract – Need to extract time series of spectra– Use a sliding window - 20 ms window, 10 ms shift

...

Fourier Transform

Fourier Transform MagnitudeMagnitude

• Produces time-frequency evolution of the spectrum


General TheorySpeaker Models

SpeakerModel

SpeakerModel


Bob

ACCEPT

Feature extraction

Feature extraction


ACCEPT

ImpostorModel

ImpostorModel

Identity Claim

DecisionDecision

REJECT


SpeakerModel

SpeakerModel


Bob

General Theory Components of Speaker Verification System



• Speaker models (voiceprints) represent voice biometric in compact and generalizable form

h-a-d

• Modern speaker verification systems use Hidden Markov Models (HMMs)

– HMMs are statistical models of how a speaker produces sounds

– HMMs represent underlying statistical variations in the speech state (e.g., phoneme) and temporal changes of speech between the states.

– Fast training algorithms (EM) exist for HMMs with guaranteed convergence properties.



Form of HMM depends on the application

“Open sesame”

Fixed Phrase Word/phrase models

/s/ /i/ /x/

Prompted phrases/passwords Phoneme models

General speech

Text-independent single state HMM


General TheoryVerification Decision

General Theory Components of Speaker Verification System


Bob

ACCEPT

Feature extraction

Feature extraction


Identity Claim

SpeakerModel

SpeakerModel

ImpostorModel

ImpostorModel

DecisionDecision

REJECT


ACCEPT

SpeakerModel

SpeakerModel


Bob

ImpostorModel

ImpostorModel

DecisionDecision

REJECT


ACCEPT


General TheoryVerification Decision

Verification decision approaches have roots in signal detection theory

• 2-class Hypothesis test: H0: the speaker is an impostor

H1: the speaker is indeed the claimed speaker.

• Statistic computed on test utterance S as likelihood ratio:

Likelihood S came from speaker HMMLikelihood S did not come from speaker HMM

log

reject

Feature extraction

Feature extraction

SpeakerModel

SpeakerModel

ImpostorModel

ImpostorModel

DecisionDecision+

-

accept


Outline

• Introduction and application

• General theory

• Performance



Verification PerformanceEvaluating Speaker Verification Systems

• There are many factors to consider in design of an evaluation of a speaker verification system

Speech quality – Channel and microphone characteristics– Noise level and type– Variability between enrollment and verification

speech

Speech modality – Fixed/prompted/user-selected phrases– Free text

Speech duration – Duration and number of sessions of enrollment and verification speech

Speaker population – Size and composition

• Most importantly: The evaluation data and design should match the target application domain of interest


Verification PerformanceEvaluating Speaker Verification Systems

PROBABILITY OF FALSE ACCEPT (in %)

PR

OB

AB

ILIT

Y O

F F

AL

SE

RE

JEC

T (

in %

)

Equal Error Rate (EER) = 1 %

Wire Transfer:

False acceptance is very costly

Users may tolerate rejections for security

Toll Fraud:

False rejections alienate customers

Any fraud rejection is beneficial

Application operating point depends on relative costs of the two error types

High Convenience

High Security

Balance

Example Performance Curve


Verification PerformanceNIST Speaker Verification Evaluations

• Annual NIST evaluations of speaker verification technology (since 1995)

• Aim: Provide a common paradigm for comparing technologies

• Focus: Conversational telephone speech (text-independent)

Evaluation Coordinator

Linguistic Data Consortium

Data Provider

Technology Developers

Comparison of technologies on common task

Evaluate

Improve


Verification PerformanceRange of Performance

Probability of False Accept (in %)

Pro

bab

ilit

y o

f F

alse

Rej

ect

(in

%)

Text-dependent (Combinations)

Clean Data

Single microphone

Large amount of train/test speech

Text-dependent (Combinations)

Clean Data

Single microphone

Large amount of train/test speech

Text-independent (Conversational)

Telephone Data

Multiple microphones

Moderate amount of training data

Text-independent (Conversational)

Telephone Data



Text-dependent (Digit strings)

Telephone Data


Small amount of training data

Text-dependent (Digit strings)

Telephone Data


Small amount of training data

Text-independent (Read sentences)

Military radio Data

Multiple radios & microphones


Text-independent (Read sentences)

Military radio Data

Multiple radios & microphones


Incre

asing constra

ints


Verification PerformanceHuman vs. Machine

• Motivation for comparing human to machine

– Evaluating speech coders and potential forensic applications

• Schmidt-Nielsen and Crystal used NIST evaluation (DSP Journal, January 2000)

– Same amount of training data

– Matched Handset-type tests

– Mismatched Handset-type tests

– Used 3-sec conversational

utterances from telephone speech Mat

ched

Mis

mat

ched

Matched(COMPUTER)

Matched(HUMAN)

Mismatched(COMPUTER)

Mismatched(HUMAN)

ErrorRates

Humans44%

betterHumans

15%worse


Verification PerformanceApplication Deployments

ApplicationApplication• Voice authentication based Voice authentication based

on spoken phone numberon spoken phone number• Provides secure access to Provides secure access to

customer record & credit customer record & credit card informationcard information

ImplementationImplementation• Edify telephony platformEdify telephony platform• Performance @1% EERPerformance @1% EER

BenefitsBenefits• SecuritySecurity• PersonalizationPersonalization

VolumeVolume• 250k customers 250k customers

enrolled currentlyenrolled currently@20K calls/day@20K calls/day

• 5 million customers 5 million customers will enroll by Q2 ‘00 will enroll by Q2 ‘00 @150K calls/day@150K calls/day


Verification PerformanceSpeaker + Knowledge Verification

AuthenticateAuthenticateKnowledgeKnowledge

AuthenticateAuthenticateKnowledgeKnowledge

AcceptAccept

RejectReject

DataDataDataData

AuthenticateAuthenticateVoiceVoice

AuthenticateAuthenticateVoiceVoice

VoiceVoicePrintsPrints

VoiceVoicePrintsPrints

Please enter your account number““5551234”5551234”Say your date of

birth““October 13, October 13,

1964”1964”You’re accepted by the system

BiometricBiometricBiometricBiometric

KnowledgeKnowledge VoiceVoiceover over

TelephoneTelephone


Outline

• Introduction

• General theory

• Performance



Conclusions

• Speaker verification is one of the few recognition areas where machines can outperform humans

• Speaker verification technology is a viable technique currently available for applications

• Speaker verification can be augmented with other authentication techniques to add further security


Future Directions

• Research will focus on using speaker verification techniques for more unconstrained, uncontrolled situations

– Audio search and retrieval– Increasing robustness to channel variabilities– Incorporating higher-levels of knowledge into decisions

• Speaker recognition technology will become an integral part of speech interfaces

– Personalization of services and devices– Unobtrusive protection of transactions and information

Documents

MIT Lincoln Laboratory Nuance Communications Automatic Speaker Recognition Recent Progress, Current Applications, and Future Trends Douglas A. Reynolds,