Automatic Speaker Verification

Automatic Speaker Verification

Zouhir Wakaf, PhD

Outline Introduction Speaker identification vs verification Speaker verification overview The parts of a speaker verification system Evaluation of speaker verification performance Application Future Directions

IntroductionExtracting Information from SpeechSpeech SignalSpeechrecognitionSpeakerrecognitionWordsSpeaker identityHow are you?Dr. AhmadGoal: Automatically extract informationtransmitted in speech signal

Determine the speaker identity Selection between a set of known voices The user does not claim an identity Closed set identification Assume that all speakers are known to the system Open set identification Possibility that speaker is not among the speakers known to the systemSpeaker identification???Whose voice is this?

Speaker Verification Synonyms: authentication, detection

User claims an identity

System task: Accept or reject identity claim

The voice can come from outside the set of known speakers All speakers known: closed set

Impostor: All voices but the true identity?Is this Ahmads voice?

Identification vs verificationFeatureextractionImpostor ModelSpeaker Modeldecision +_> accept

Speech ModalitiesApplication dictates different speech modalities:

Text-dependent recognition Recognition system knows text spoken by person Examples: fixed phrase, prompted phrase Used for applications with strong control over user input Knowledge of spoken text can improve system performance Prompting may reduce risk of impostors using voice recordings

Text-independent recognition Recognition system does not know text spoken by person Examples: User selected phrase, conversational speech Used for applications with less control over user input More flexible system but also more difficult problem Speech recognition can provide knowledge of spoken text

Speech for Identification Speech is easily produced

It does not require advanced input devices

Can be applied using telephones, PCs

Can be supplied with

- password phrase to improve security - Personal knowledge

Speaker verification Which features? How to model the speaker How to model the imposters How to make the decision to minimize probability of error

Two distinct phases to any speaker verification systemEnrolment PhasePhases of Speaker Verification SystemModel trainingFeatureextractionEnrolment speech for each speakerVoiceprints (models)for each speakerAhmadSalmaAhmadVerification PhaseVerificationdecisionFeatureextractionAccepted!Claimed identity: SalmaSalma

Features for Speaker Recognition Humans use several levels of perceptual cues for speaker recognition There are no exclusive speaker identity cues Low-level acoustic cues (physical traits) most applicable for automatic systems Desirable attributes of features for an automatic system

Occur naturally and frequently in speech Easily measurable Not change over time or be affected by speakers health Not be affected by reasonable background noise nor depend on specific transmission characteristics Not be subject to mimicryPractical

Robust

Secure

Features for Speaker Recognition No feature has all these attributes

Features derived from spectrum of speech have proven to be the most effective in automatic systems

Typically: MFCCs

Speaker Models Speaker models (voiceprints) represent voice biometric in compactand generalizable form Modern speaker verification systems useHidden Markov Models (HMMs)

HMMs are statistical models of how a speakerproduces sounds h-a-d

HMMs represent underlying statistical variationsin the speech state (e.g., phoneme) and temporalchanges of speech between the states. Fast training algorithms (EM) exist for HMMs withguaranteed convergence properties.

Speaker ModelsForm of HMM depends on the applicationFixed Phrase Word/phrase modelsOpen SemsamePrompted phrases/passwords Phoneme models /s/ /i/ /x/Text-independent single state HMMGeneral speech

Text-independent speaker verification The imposter model is built using speech from all speakers

GMM with high number of mixture components

The speaker model is built using speaker adaptation Relatively small amount of speech

Verification Decision

The decision is a 2-class hypothesis testH0: the speaker is an impostorH1: the speaker is indeed the claimed speaker. Statistic computed on test utterance S as likelihood ratio:

Likelihood S came from speaker HMMLikelihood S did not come from speaker HMM=logFeatureextractionImpostor ModelSpeaker Modeldecision +_> accept

Verification PerformanceEvaluating Speaker Verification SystemsThere are many factors to consider in evaluating speakerverification systemsSpeech quality Channel and microphone characteristics Noise level and type Variability between enrolment and verification speechSpeech modality Fixed/prompted/user-selected phrases Free textSpeech durationSpeaker population Duration and number of sessions of enrolment and verification speech Size and composition

DET-curveImportance of the error types depend on application!

Applications Transaction authentication Toll fraud prevention Telephone credit card purchases Telephone brokerage (e.g., stock trading)

Applications Access control Physical facilities Computers and data networks

Applications Monitoring Remote time and attendance logging Home parole verification Prison telephone usage

Applications Information retrieval Customer information for call centers Audio indexing (speech skimming device)

Applications Forensics Voice sample matching

Recorded threatSuspect

Future DirectionsResearch will focus on using speaker recognitionfor more unconstrained, uncontrolled situations

Audio search and retrieval Increasing robustness to channel variability Incorporating higher-levels of knowledge into decisions

Speaker recognition technology will become anintegral part of speech interfaces

Personalization of services and devices Unobtrusive protection of transactions and information

Documents

Automatic Speaker Verification