34
APSIPA Asia-Pacific Signal and Information Processing Association APSIPA Asia-Pacific Signal and Information Processing Association Speaker Verification – The present and future of voiceprint based security Prof. Eliathamby Ambikairajah Head of School of Electrical Engineering & Telecommunications, University of New South Wales, Australia 21 Oct 2013

Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Asia-Pacific Signal and Information Processing Association

Speaker Verification – The present and future of voiceprint based security

Prof. Eliathamby Ambikairajah Head of School of Electrical Engineering & Telecommunications,

University of New South Wales, Australia

21 Oct 2013

Page 2: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Outline

• Introduction

• Speaker Verification Applications

• Speaker Verification System

• Performance measure

• NIST Speaker Recognition Evaluation (SRE)

• Discussion

1

Page 3: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

“How are you?”

Introduction

• Speech conveys several types of information

– Linguistic: message and language information

– Paralinguistic : emotional and physiological characteristics

Speech Recognition

Language Recognition

Speaker Recognition

Emotion Recognition

Accent Recognition

“How are you?” English Hsing Ming Happy Taiwanese

Linguistic Paralinguistic

2

Page 4: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Introduction

Speech Recognition

Language Recognition

Speaker Recognition

Emotion Recognition

Accent Recognition

“How are you?”

3

Speaker Identification

determines who is speaking given a set of

enrolled speakers

Speaker Verification

determines if the unknown voice is from

the claimed speaker

Speaker Diarization

partition an input audio stream into

homogeneous segments according to the speaker

identity

Page 5: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia 4

Speaker Identification

determines who is speaking given a set of

enrolled speakers

Speaker Verification

determines if the unknown voice is from

the claimed speaker

Speaker Diarization

partition an input audio stream into

homogeneous segments according to the speaker

identity

3.5 4 4.5 5 5.5

x 104

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Model repositorySpeaker 1

Model

Unknown Speaker

Best Matching Speaker

Speaker 2 Model

Speaker M Model

Model repositorySpeaker 1

Model

Claimed Speaker 2

RejectSpeaker 2

Model

Speaker M Model

3.5 4 4.5 5 5.5

x 104

-0.1

-0.05

0

0.05

0.1

0.15

0.2

3.5 4 4.5 5 5.5

x 104

-0.1

-0.05

0

0.05

0.1

0.15

0.2

3.5 4 4.5 5 5.5

x 104

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Speaker 1

Speaker 2

Speaker 1

Page 6: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Speaker Verification Applications - Biometrics

5

Physical facilities

Access control

Telephone credit card purchases

Transaction authentication

Page 7: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Speaker Verification System – Basic Overview

• In automatic speaker verification,

– The front-end converts speech signal into a more convenient representation (typically a set of feature vectors)

– The back-end compares this representation to a model of a speaker to determine how well they match

6

3 . 5 4 4 . 5 5 5 . 5

x 1 04

- 0 . 1

- 0 . 0 5

0

0 . 0 5

0 . 1

0 . 1 5

0 . 2

Feature Extraction

ClassificationSpeech

Speaker Model

Decision MakingAccept/Reject

Front-end Back-end

Page 8: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

UBM: represent general, speaker independent model to be compared against a person-specific model when making an accept or reject decision.

Speaker Models

Speaker 1 Model

John’s Model

Speaker Verification System

I am John

Determine level of Match

Determine level of Match

Likelihood of Generic Male

Likelihood of John

Likelihood Ratio Decision Making

NOT JOHN

Universal Background Models (UBM)

Generic Male

Generic Female

Feature Extraction

c0 c1 cnc2

Page 9: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

Speaker Verification System – Speaker Enrolment

8

Universal Background Models (UBM)

Generic Male

Generic Female

Speaker 1

Speaker 2

Speaker N

Feature Extraction

ModelTraining

Speaker x1

Speaker x2

Speaker xM

Feature Extraction

Feature Extraction

Feature Extraction

Model Adaptation

Model Adaptation

Model Adaptation

Background male speaker data

Target male speaker data

3 . 5 4 4 . 5 5 5 . 5

x 1 04

- 0 . 1

- 0 . 0 5

0

0 . 0 5

0 . 1

0 . 1 5

0 . 2

3 . 5 4 4 . 5 5 5 . 5

x 1 04

- 0 . 1

- 0 . 0 5

0

0 . 0 5

0 . 1

0 . 1 5

0 . 2

Creating a male UBM

Speaker Models

Speaker x1 Model

Speaker x2 Model

Speaker xM Model

Creating male speaker-specific models

Step 1

Step 2

Page 10: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Detailed Speaker Verification System

9

Feature

Extraction

Speaker

Modelling

Classification

(Scoring)SpeechAccept/

Reject

Feature

Normalisation

Model

Normalisation

Score

Normalisation

Decision

Making

· Cepstral Mean Subtraction

(CMS)

· RelAtive SpecTrAl (RASTA)

· Feature Warping

· Feature Mapping

· Nuisance Attribute

Projection (NAP)

· Joint Factor Analysis (JFA)

· i-vectors

· Within Class Covariance

Normalisation (WCCN)

· Linear Discriminant Analysis

(LDA)

· Probabilistic Linear

Discriminant Analysis

(PLDA)

· Zero-normalisation

(Z-norm)

· Test-normalisation

(T-Norm)

Page 11: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

Front-end: Feature Extraction

10

3.5 4 4.5 5 5.5

x 104

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Frame 1 Frame 2 Frame 3 Frame N

25ms 25ms 25ms 25ms

Feature Vector

Feature Vector

Feature Vector

Feature Vector

Windowing

Feature Extraction

Feature Normalisation

BASIC FEATURES

-10 -5 0 5 100

0.2

0.4

-10 -5 0 5 100

0.1

0.2

0.3

0.4

Windowing

Feature Extraction

Windowing

Feature Extraction

Windowing

Feature Extraction

c0 c1 cnc2 c0 c1 cnc2 c0 c1 cnc2 c0 c1 cnc2

Normalised Feature Vector

c0 c1 cnc2 c0 c1 cnc2 c0 c1 cnc2 c0 c1 cnc2NORMALISED FEATURES

Normalised Feature Vector

Normalised Feature Vector

Normalised Feature Vector

Co distribution

Normalised Co

distribution

Page 12: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia 11

Temporal Derivative

c0 c1 cnc2 c0 c1 cnc2 c0 c1 cnc2

Normalised Feature vectors

(Frame 1)

Normalised Feature vectors

(Frame 2)

Normalised Feature vectors

(Frame P)

d0d1 dnd2

Delta Feature vectors

(Frame 1)

Delta Feature vectors

(Frame 2)

Delta Feature vectors

(Frame P)

d0d1 dnd2 d0d1 dnd2

Temporal Derivative

a0a1 ana2

Acceleration Feature vectors

(Frame 1)

Acceleration Feature vectors

(Frame 2)

Acceleration Feature vectors

(Frame P)

a0a1 ana2 a0a1 ana2

c0 c1 cnc2 d0d1 dnd2 a0a1 ana2Frame 1 Features:

(e.g: 39 dimensions)

Page 13: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Detailed Speaker Verification System

12

Feature

Extraction

Speaker

Modelling

Classification

(Scoring)SpeechAccept/

Reject

Feature

Normalisation

Model

Normalisation

Score

Normalisation

Decision

Making

· Cepstral Mean Subtraction

(CMS)

· RelAtive SpecTrAl (RASTA)

· Feature Warping

· Feature Mapping

· Nuisance Attribute

Projection (NAP)

· Joint Factor Analysis (JFA)

· i-vectors

· Within Class Covariance

Normalisation (WCCN)

· Linear Discriminant Analysis

(LDA)

· Probabilistic Linear

Discriminant Analysis

(PLDA)

· Zero-normalisation

(Z-norm)

· Test-normalisation

(T-Norm)

Page 14: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

Speaker Modelling

Probability density function approximated by 3-component Gaussian mixture models

Each Gaussian mixture consist of a mean (µ), covariance (Σ) and weight (w)

-8 -6 -4 -2 0 2 4 6 8 100

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Overall

PDF

Weighted

Gaussian 3

Weighted

Gaussian 2

Weighted

Gaussian 1

All Weights

must sum to 1

13

4 4.5 5 5.5 6 6.5 7 7.5 8 8.5 93

4

5

6

7

8

9

45

67

89

3

4

5

6

7

8

9

0

0.2

0.4FEATURE SPACE

MODELLING PROBABILITY DISTRIBUTION

Dim

en

sio

n 1

(C

0)

Dimension 2 (C1)

Page 15: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Database for creating UBM (example)

• Training set

– 56 male speakers (each speaker consists of 2 minutes of active speech) for creating the UBM

• Target set

– 20 male speakers (each speaker consists of 2 minutes of active speech) for speaker-specific model

• Test set

– 250 male utterances (each speaker has many test utterances) with the known identity

14

-1000 0 1000 2000 3000 4000 5000 6000 7000 8000 9000-4

-3

-2

-1

0

1

2

0 500 1000 1500 2000 2500 3000-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 500 1000 1500 2000 2500-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Page 16: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia 15

UBM

Target Speaker Data

Target Model

-6 -4 -2 0 2 4 6 8-0.1

0

0.1

0.2

0.3

0.4

-6 -4 -2 0 2 4 6 8-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

Mean = 0.9

Covariance = 0.9

Mean = 0.8

Covariance = 0.5

Weight = 0.2

Weight = 0.3

Feature Dimension 1

Fe

atu

re D

ime

nsio

n 2

Feature Dimension 1

Fe

atu

re D

ime

nsio

n 2

Universal Background Model (UBM) consists

of 1024 Gaussian mixtures

Target speaker model consists of 1024

Gaussian mixtures

Gaussian mixture consists of a mean (µ), covariance (Σ) and weight (w)

1

2

1024

998

1024

998

Page 17: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

Representing GMMs

16

GMM

Mixture 1 Mixture 2 Mixture 1024

1x1

- W

eig

ht

39

x1 M

ean

s

39

x1 C

ova

rian

ces

1x1

- W

eig

ht

39

x1 M

ean

s

39

x1 C

ova

rian

ces

1x1

- W

eig

ht

39

x1 M

ean

s

39

x1 C

ova

rian

ces

1x1024 Weight vector

39x1024 Means matrix

39x1024 Covariances matrix

GMM REPRESENTATION

The UBM and each speaker model is a GMM

Each of them will be represented by a vector of weights, a matrix of means and a matrix of covariances

Page 18: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

Universal Background Models

Generic Male

Generic Female

Speaker Models

Speaker 1 Model

John’s Model

Decision Making

Likelihood S came from speaker model Likelihood S did not come from speaker model

Score, L = log

𝑳 ≶ 𝜽

Reject/Accept

Determine level of Match

Determine level of Match

Likelihood of Generic Male

Likelihood of John

17

Feature Extraction

Page 19: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Score Normalisation

Speaker Verification System 1

Speaker Verification System 2

Speaker Verification System N

Speech

Different Systems perform speaker

verification in parallel

18

Page 20: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Score Normalisation

Speaker Verification System 1

Speaker Verification System 2

Speaker Verification System N

Speech

Score 1

Score 2

Score N

May not fall in the same range. i.e.,

NOT directly comparable

19

Page 21: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Score Normalisation

Speaker Verification System 1

Speaker Verification System 2

Speaker Verification System N

Speech

Score 1

Score 2

Score N

Score Normalisation

Score Normalisation

Score Normalisation

NormalisedScore 1

NormalisedScore 2

NormalisedScore N

Normalised scores (L) are comparable

20

Page 22: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Fusion

21

Final score will be a weighted sum of score from each system

Speaker Verification System 1

Speaker Verification System 2

Speaker Verification System N

Speech

Score Normalisation

Score Normalisation

Score Normalisation

Normalised

Score 1, s1

Normalised

Score 2, s2

Normalised

Score N, sN

Final Score =

w1s1 + w2s2 +…+wNsN

Fusion

Page 23: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

Performance measure

• Types of error:

– Misses: valid identity is rejected o Probability of miss: ratio of the number of falsely rejected

speaker tests to the total number of correct speaker trials.

– False alarms: invalid identity is accepted o Probability of false alarm: ratio of the number of falsely accepted

speaker tests to the total number of impostor trials

TRUE SPEAKER

IMPOSTER

ACCEPT CLAIM REJECT CLAIM

CORRECT DECISION

CORRECT DECISION

FALSE ACCEPTANCE

MISS

22

Page 24: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

Performance measure - Detection error trade-off (DET) curve

False Acceptance Rate (in %)

Mis

s R

ate

(in

%)

Each point on the curve corresponds to a different 𝜃

23

Page 25: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

Performance measure - Detection error trade-off (DET) curve

Equal Error Rate (EER) = 1 %

Wire Transfer:

False acceptance is very costly

Users may tolerate rejections for security

Customization:

False rejections alienate customers

Any customization is beneficial

Application operating point depends on relative costs of the two error types

High Convenience

High Security

Balance

False Acceptance Rate (in %)

Mis

s R

ate

(in

%)

24

Page 26: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

NIST Speaker Recognition Evaluation (SRE)

• Ongoing text independent speaker recognition evaluations conducted by NIST (http://www.itl.nist.gov/iad/mig/tests/spk/)

– driving force in advancing the state-of-the-art

– Conditions for different amounts of data o 10 sec.

o 3-5 minutes

o 8 minutes

o Separate channel and summed channel conditions

– English-speakers, non-English speakers, multilingual speakers

25

Page 27: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

NIST SRE Trends

• 1996 – First SRE in current series

• 2000 – AHUMADA Spanish data, first non-English speech

• 2001 – Cellular data, Automatic Speech Recognition (ASR) transcripts provided

• 2005 – Multiple languages with bilingual speakers, room mic recordings, cross-channel trials

• 2008 – Interview data

• 2010 – High and low vocal effort, aging, HASR (Human-Assisted Speaker Recognition) Evaluation

• 2012 – Broad range of test conditions, with added noise and reverberation, target speakers defined beforehand

26

Page 28: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Basic System

Feature

Extraction

MAP adaptation

Log-likelihood

UBM

Sp

ee

ch

27

Page 29: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Trends

• In 2004’s: Classification

Feature

Extraction

MAP adaptation

Extract Supervectors

UBM

Sp

ee

ch

SVM Scoring

28

Page 30: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Trends

• In 2005’s: Channel compensation - NAP

Feature

Extraction

MAP adaptation

Extract Supervectors

UBM

Sp

ee

ch

NAP SVM Scoring

29

Page 31: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Trends

• In 2007’s: Channel compensation - JFA

Feature

Extraction

Baum-Welch statistics

estimation

Factor analysis

Speaker Factor

ExtractionWCCN SVM

UBMJFA

Hyperparameters (V, U , D)

Sp

ee

ch

30

Page 32: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Trends

• In 2009’s: Channel compensation – i-vector

Feature

Extraction

Baum-Welch statistics

estimation

Factor analysis

i-vector extraction

WCCN LDACosine

Distance Scoring

UBMTotal

Variability Matrix

Sp

ee

ch

31

Page 33: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia

Trends

• In 2009’s: Channel compensation – PLDA

Feature

Extraction

Baum-Welch statistics

estimation

Factor analysis

i-vector extraction

PLDA Log-likelihood

UBMTotal

Variability Matrix

Sp

ee

ch

32

Page 34: Speaker Verification The present and future of voiceprint based … Distinguished Lecture... · 2013. 11. 13. · Speaker Verification System I am John Determine level of Match Determine

APSIPA Asia-Pacific Signal and Information Processing Association

APSIPA Distinguished Lecture Series @ IIU, Malaysia 33