79
© 2016 IBM Corporation Data Challenged Speaker Recognition Hagai Aronowitz IBM Research - Haifa

Google and SRI talk September 2016

Embed Size (px)

Citation preview

Page 1: Google and SRI talk September 2016

© 2016 IBM Corporation

Data Challenged Speaker Recognition Hagai Aronowitz IBM Research - Haifa

Page 2: Google and SRI talk September 2016

© 2016 IBM Corporation 2

Mobile Person Recognition

Agenda

1. Speaker ID basics

2. Inter-Dataset Variability Compensation (IDVC)

3. Text-dependent Speaker ID with a small devset

4. Audiovisual synchrony detection

Page 3: Google and SRI talk September 2016

© 2016 IBM Corporation 3

Speaker ID Basics

Page 4: Google and SRI talk September 2016

© 2016 IBM Corporation

State-of-the-Art Overview

1. Low-level features (1 per 10ms) – Spectral based (MFCCs + Δ + Δ Δ)

2. High-level features (1 per session)

– GMM supervectors (concatenated means) – I-vectors (GMM supervectors + factor analysis)

3. Modeling and scoring high-level features – PLDA: Jointly model within-speaker and

between-speaker variabilities

4. Calibration – Score normalization – Convert normalized scores into log-likelihood-ratios – Use side information (SNR, …)

4

Speaker Recognition Overview

Low level

feature extraction

High level

feature extraction

x1,…,xT

ϕ ϕ' Scoring

Calibration

score

LLR

Page 5: Google and SRI talk September 2016

© 2016 IBM Corporation

State-of-the-Art Overview

1. Low-level features (1 per 10ms) – Spectral based (MFCCs + Δ + Δ Δ)

5

Low level

feature extraction

High level

feature extraction

x1,…,xT

ϕ ϕ'

Scoring

Calibration

score

LLR

• Speech is divided into frames of 20ms (with 50% overlap)

• Signal in each frame is assumed stationary

• Frame t is represented by a vector xt

• represents the spectral characteristics of frame t

(13 MFCCs + 13 delta MFCCs + 13 double-delta MFCCs)

39Rxt

Speaker Recognition Overview

Page 6: Google and SRI talk September 2016

© 2016 IBM Corporation

State-of-the-Art Overview

2. High-level features (1 per session)

– GMM supervectors (concatenated means)

6

Low level

feature extraction

High level

feature extraction

x1,…,xT

ϕ ϕ'

Scoring

Calibration

score

LLR

• A UBM is trained on the dev data (many speakers)

• UBM is adapted to the low-level features of a given session

• GMM means are stacked into a supervector

Speaker Recognition Overview

Page 7: Google and SRI talk September 2016

© 2016 IBM Corporation

State-of-the-Art Overview

2. High-level features (1 per session)

– I-vectors (GMM supervectors + factor analysis)

7

Factor analysis used for high-level feature extraction

M: supervector for a given session

m : overall mean (UBM supervector)

T : rectangular matrix of low-rank (total variability matrix)

ϕ: standard normal random vector (i-vector)

Low level

feature extraction

High level

feature extraction

x1,…,xT

ϕ ϕ'

Scoring

Calibration

score

LLR

TmM

Speaker Recognition Overview

Page 8: Google and SRI talk September 2016

© 2016 IBM Corporation

I-vector Extraction

8

1. Estimate Baum-Welch (BW) statistics

− Counts:

− Sums:

using

2. I-vector MAP estimate is

with

and Σ is a stacking of the UBM covariance matrices

FT 11TLˆ

TTIL 1NT

),(|(

),(|(

~

~

g

ggtg

ggtg

tNXpw

NXpwXgp

t

tg XgpN |

t

ttg XXgpF |

Speaker Recognition Overview

Page 9: Google and SRI talk September 2016

© 2016 IBM Corporation

State-of-the-Art Overview

3. Modeling and scoring high-level features – PLDA: Jointly model within-speaker and

between-speaker variabilities

9

Low level

feature extraction

High level

feature extraction

x1,…,xT

ϕ ϕ'

Scoring

Calibration

score

LLR

Speaker Recognition Overview

9

µ

µ

s

c

Green speaker

Cov=W

Cov=B i-vector space

Page 10: Google and SRI talk September 2016

© 2016 IBM Corporation

Probabilistic Linear Discriminant Analysis (PLDA)

10

Speaker Recognition Overview

The PLDA framework assumes that i-vectors distribute according to:

ϕ - i-vector

µ - global mean

s - speaker component

c - channel / within-speaker variability component

s and c are assumed to independently distribute normally:

The PLDA model is parameterized by {µ, B, W}

cs

)W,0(~,)B,0(~ NcNs

Page 11: Google and SRI talk September 2016

© 2016 IBM Corporation

PLDA: Details

11

Speaker Recognition Overview

PLDA training

Hyper-parameters W and B are trained using EM (on a devset)

PLDA scoring

Given i-vectors x and y:

tot

s

xNdsspsxpxp ,;)(

ypxp

yxp

yxp

yxp

H,

H,

H,score

tot

tot

sy

xNdsspsypsxpyxp

B

B,;)(H,

with

WBtot

Page 12: Google and SRI talk September 2016

© 2016 IBM Corporation

PLDA: Details (cont.)

12

Speaker Recognition Overview

PLDA scoring (cont.)

for µ=0:

constP2QQscore yxyyxx TT T

with

111

111

BBBP

BBQ

tottottot

tottottot

Page 13: Google and SRI talk September 2016

© 2016 IBM Corporation

State-of-the-Art Overview

4. Calibration – Score normalization – Convert normalized scores into log-likelihood-ratios – Use side information (SNR, …)

13

Low level

feature extraction

High level

feature extraction

x1,…,xT

ϕ ϕ'

Scoring

Calibration

score

LLR

Speaker Recognition Overview

Score normalization is not required if models are trained properly

─ with lots of data

─ no mismatch

Otherwise, score normalization is essential because it cancels out modeling inaccuracies and various artifacts

Page 14: Google and SRI talk September 2016

© 2016 IBM Corporation

Znorm

14

Znorm standardizes the distribution of scores for speaker S

given a calibration set of imposter test sessions

Let φ(S,Y) be the score of test session Y for speaker S

Estimate mean and variance of φ(S,·)

Standardize φ(S,Y)

),(

),(),(),(Znorm

S

SYSYS

Z

Z

),(V),(Z YSS Y

),(E),(Z YSS Y

C. Auckenthaler, et al. "Score Normalization for Text-independent Speaker Verification Systems," Digital Signal Processing, vol. 10 No 1-3, 2000.

Speaker Recognition Overview

Page 15: Google and SRI talk September 2016

© 2016 IBM Corporation

Speaker ID: Progress of State-of-the-Art

Year 1995 2001 2005 2008 2011 2016

Algorithm GMM GMM-UBM NAP JFA PLDA DNN+PLDA

EER 10% 6% 4% 2% 1% 0.6%

0

2

4

6

8

10

1995 2000 2005 2010 2015

EER (in %)

Speaker Recognition Overview

Page 16: Google and SRI talk September 2016

© 2016 IBM Corporation 16

Inter-Dataset Variability Compensation (IDVC)

Page 17: Google and SRI talk September 2016

© 2016 IBM Corporation

Robustness to Data Mismatch

The i-vector PLDA scheme requires

• 1000s of subjects with several recordings per subject

• Matched recording devices and channels

In practice, we may have plenty of data from a source domain, but limited or no data from the target domain

• We address the setup when no data is available from the target domain

• Our method is named Inter Dataset Variability Compensation (IDVC)

17

Inter-Dataset Variability Compensation

Page 18: Google and SRI talk September 2016

© 2016 IBM Corporation

The Domain Robustness Challenge

A major topic in the JHU-2013 speaker recognition workshop

Setup

• Eval data: NIST 2010 SRE – telephone data

• SWB: Phone calls recorded during 97’-04’ (Switchboard 1&2)

• MIXER: Data from NIST evaluations 04’-08’

Goal

• Reduce degradation when training on SWB (no use of MIXER)

18

PLDA training EER (in %)

MIXER 2.4

SWB 8.2

Inter-Dataset Variability Compensation

Page 19: Google and SRI talk September 2016

© 2016 IBM Corporation

Source data Target data

Projected

source data

Projected

target data

P P PLDA model PLDA model

Motivation for the IDVC Method Find a projection P in i-vector space that minimizes the distance between the PLDA models for source and (unknown) target data

19

PLDA model PLDA model

Inter-Dataset Variability Compensation

Page 20: Google and SRI talk September 2016

© 2016 IBM Corporation

IDVC with Unseen Target Data

How can we find projection P if target data is unknown?

• We divide available dev data to homogenous subsets

• We estimate P

• We expect/hope P to generalize well to unseen target data

Success is based on the availability of heterogeneous dev data

• Existence of distinct subsets is required

• Subsets may either be labeled or found using clustering

20

Inter-Dataset Variability Compensation

Page 21: Google and SRI talk September 2016

© 2016 IBM Corporation

Finding Projection P

• We represent each subset by a vector in the hyper-parameter space

• We find and remove a subspace of the i-vector space that would

effectively reduce variability in the PLDA hyper-parameter space

21

µ

B W

Inter-Dataset Variability Compensation

Page 22: Google and SRI talk September 2016

© 2016 IBM Corporation

IDVC - Outline

1. Partition the dev data into distinct homogenous subsets

2. Estimate PLDA hyper-parameters {µ, B, W} for

each subset

3. Estimate “bad” i-vector subspaces: Sµ, SW, SB

4. Join “bad” subspaces to form a single “bad” subspace:

• Subspace S is removed from the i-vector as a pre-processing

cleanup step

22

BW SSSS

Inter-Dataset Variability Compensation

Page 23: Google and SRI talk September 2016

© 2016 IBM Corporation

Estimating Subspace for Hyper-Parameter µ

• PCA is applied on the set of vectors {µi}

• The optimal dimension of Sµ is a function of the expected

magnitude of mismatch

• Note: Speaker labels are not required for estimating the {µi}

hyper-parameters

• This is the original IDVC method developed in JHU-2013:

23

H. Aronowitz,"Inter Dataset Variability Compensation for Speaker Recognition", in Proc. ICASSP, 2014.

Inter-Dataset Variability Compensation

Page 24: Google and SRI talk September 2016

© 2016 IBM Corporation

Estimating Subspaces for Covariance Matrices • This the extended IDVC method

• Given a set of n covariance matrices {Wi} we denote the mean covariance by W

• Goal: find directions in the i-vector space with maximal variability of normalized variance across different datasets

• Define: v - unit vector (direction in i-vector space)

• Variance of Wi along direction v equals to vtWiv

• Define variability of normalized variance w.r.t. v:

24

vv

vvt

i

t

W

Wvar

H. Aronowitz, “Compensating Inter-Dataset Variability in PLDA Hyper-Parameters for Robust Speaker Recognition”, in Proc. Speaker Odyssey, 2014.

Inter-Dataset Variability Compensation

Page 25: Google and SRI talk September 2016

© 2016 IBM Corporation

Estimating Subspaces for Covariance Matrices (2)

Algorithm

1. Whiten the i-vector space with respect to W

• Calculate the square root R of W-1

• After whitening:

2. Compute

3. Find the k top eigenvectors of Ω: v1,…vk

4. Span the “bad” subspace using R-1v1,…R-1vk

25

21 Win

Inter-Dataset Variability Compensation

RRWWW iii

Page 26: Google and SRI talk September 2016

© 2016 IBM Corporation

Estimating Subspaces for Covariance Matrices (3) Claim 1: In the whitened space, the top eigenvector v1 of Ω

maximizes the objective on the whitened matrices:

Claim 2: v1 when transformed back to the original space maximizes the variances of the projections in the original space:

26

Inter-Dataset Variability Compensation

vv

vvv

t

t

v W

Wvarmaxarg i

11

vv

vv

v

vt

t

v W

Wvarmaxarg

R

R i

11

1

1

1

Page 27: Google and SRI talk September 2016

© 2016 IBM Corporation

Estimating Subspaces for Covariance Matrices (4)

Proof of Claim 1:

Let v be a unit vector:

27

Inter-Dataset Variability Compensation

vv

vvv

t

t

v W

Wvarmaxarg i

11

1WIW vvt

vv

vv

vv

vv

vvv

i

n

t

i

i

t

n

i

t

t

i

t

i

21

21

*

Wmaxarg

1Wmaxarg

Wvarmaxarg

W

Wvarmaxarg

Page 28: Google and SRI talk September 2016

© 2016 IBM Corporation

Estimating Subspaces for Covariance Matrices (5)

Proof of Claim 2:

28

Inter-Dataset Variability Compensation

vv

vv

v

vt

t

v W

Wvarmaxarg

R

R i

11

1

1

1

1

1-

1

1-

1

1

1-

1~

1-

1~R

1-

1-1-

1-1-

1

1

R

R

W

Wvarmaxarg

R

~W~

~W~varmaxargR

~W~

~W~varmaxargR

RWR

RWRvarmaxarg

W

Wvarmaxarg

1

v

v

vv

vv

v

vv

vv

vv

vv

vv

vv

vv

vv

t

i

t

v

t

i

t

v

t

i

t

v

t

i

t

v

t

i

t

v

Page 29: Google and SRI talk September 2016

© 2016 IBM Corporation

Estimation without Speaker Labels

• Estimation of subspaces SW and SB requires speaker labels

• T denotes the total covariance matrix of the i-vectors of the dev data (in a given subset)

• T can be estimated without speaker labels

• Note that for typical datasets,

→ An i-vector subspace that contains high inter-dataset variability in the T hyper-parameter will also contain high inter-dataset variability for either W or B (and vice versa)

29

BWT

Inter-Dataset Variability Compensation

Page 30: Google and SRI talk September 2016

© 2016 IBM Corporation

Partitioning the Dev Data

Code Description 97S62 SWB-1 Release 2 98S75 SWB-2 Phase I 99S79 SWB-2 Phase II

2001S13 SWB Cellular Part 1 2002S06 SWB-2 Phase III 2004S07 SWB Cellular Part 2

We defined 3 different partitions

GI-6 Each release in a separate subset

GD-12 A further partition into male and female subsets

GI-2 One telephone (97S62) and one cellular (2004S07) subset

30

Switchboard data consists of 6 releases

Inter-Dataset Variability Compensation

Page 31: Google and SRI talk September 2016

© 2016 IBM Corporation

Selected Results

31

Method EER

(in %) Baseline (dev=MIXER) 2.4 Baseline (dev=SWB) 8.2

IDCV: µ only 3.8 IDCV: µ, B(30), W(30) 3.0

No speaker labels, IDVC: µ, T(30), GI-2 3.3

Inter-Dataset Variability Compensation

Results using GD-12 (GD-6 gives similar results)

Page 32: Google and SRI talk September 2016

© 2016 IBM Corporation

IDVC EER

(in %)

DCF

(old)

DCF

(new)

No 8.2 0.33 0.69

μ (10) 3.8 0.19 0.53

IDVC Results: μ only Inter-Dataset Variability Compensation

Page 33: Google and SRI talk September 2016

© 2016 IBM Corporation

IDVC EER

(in %)

DCF

(old)

DCF

(new)

No 8.2 0.33 0.69

W (100) 3.4 0.16 0.50

B (100) 3.5 0.16 0.50

IDVC Results: W / B only Inter-Dataset Variability Compensation

SWB

Page 34: Google and SRI talk September 2016

© 2016 IBM Corporation

Conclusions

34

1. IDVC has shown to effectively reduce the influence of dataset

variability for the domain robustness challenge

• EER was decreased by 63% (90% recovery)

• DCF (old) was decreased by 58% (91% recovery)

• DCF (new) was decreased by 32% (71% recovery)

2. IDVC works well even when trained

• on two subsets only

• without speaker labels

Inter-Dataset Variability Compensation

Page 35: Google and SRI talk September 2016

© 2016 IBM Corporation 35

Text Dependent Speaker ID with a Small Devset

Page 36: Google and SRI talk September 2016

© 2016 IBM Corporation

Training with a Small Devset

36

Why train with a small devset ?

TD training is more effective than TI training

TD Data collection is expensive

Techniques

1. Robust estimation of GMM supervector covariance matrix Key idea: soft independence of parameters

2. Stabilize score normalization parameters Key idea: Remove a subspace of the high-level-feature space that accounts for maximal estimation error of score normalization parameters

3. Minimize the combined effect of both within speaker variability and hyper-parameters estimation error using a matched filter on the high-level-features

Speaker Recognition: Training with a Small Devset

Page 37: Google and SRI talk September 2016

© 2016 IBM Corporation 37

GMM Supervectors

A GMM supervector is a stacking of g (~500) Gaussians each one is f (~40) dimensional

Speaker Recognition: Training with a Small Devset

Page 38: Google and SRI talk September 2016

© 2016 IBM Corporation

MFCC

features

GMM

supervector

NAP

compensation

Scoring

(dot

product)

ZT-score

normalization

MFCC

features

GMM

supervector

NAP

compensation

MFCC

features GMM-

Nuisance

supervector

Intersession

covariance

matrix

MFCC

features MFCC

features

GMM-

Nuisance

supervector

GMM-

Nuisance

GMM

supervecto

r

GMM

supervecto

r

GMM

supervecto

r

MFCC

features MFCC

features

GMM

supervecto

r

GMM

supervector

GMM-nuisance

supervector

NAP

subspace

NAP Subspace Estimation

GMM-NAP Scoring

GMM-NAP System Description Speaker Recognition: Training with a Small Devset

Page 39: Google and SRI talk September 2016

© 2016 IBM Corporation 39

Text Dependent Speaker ID with a Small Devset

Robust Estimation of Supervector Covariance

Page 40: Google and SRI talk September 2016

© 2016 IBM Corporation

Robust Estimation of Supervector Covariance

40

G1

G2

G3

Define groups G0,G1,… inspired by GMM-supervector structure

Estimate distribution of covariance matrix entries for each group (i.e., mean and variance) from observed data

Estimate covariance :

─ Assuming normal distribution and group independence:

,..,maxarg 10 GGwpw cw

c

c

,g

g

gcw 1

'2

'

2

11

g gg

g

Robust Estimation of Supervector Covariance

f,g

f,g

Page 41: Google and SRI talk September 2016

© 2016 IBM Corporation

Gaussian-based Smoothing (GBS)

41

1. Whiten features (in supervector space)

Estimate a linear transformation on the feature space that minimizes the

average (over all Gaussians) off-diagonal correlations in the supervector

covariance matrix, and normalizes the average diagonal elements to 1

2. Smooth the sample covariance matrix

• Relaxed Block Diagonal:

• Covariance between two supervector components that correspond to

different feature coefficients (but same Gaussian) is dependent on the

Gaussian index only:

• Covariance between two supervector components that correspond to

the same feature coefficient is dependent on Gaussian indices only:

3. De-whiten (reverse step 1)

Robust Estimation of Supervector Covariance

0C2211 ,,,2121 fgfgggff

gfgfgff 21 ,,,21 C

2121 ,,,,C ggfgfg

Page 42: Google and SRI talk September 2016

© 2016 IBM Corporation 42

Text Dependent Speaker ID with a Small Devset

Score Stabilization

Page 43: Google and SRI talk September 2016

© 2016 IBM Corporation

Score normalization • Essential for GMM-NAP • Contributes to PLDA when PLDA is poorly trained

− Small devset or data mismatch

Observation Score normalizing is very sensitive to devset size

Experiment • Common passphrase • Data: Wells Fargo • Large dev: 200 speakers x 4 sessions • Small dev: 20/30 speakers x 2 sessions • NAP trained on small dev

Motivation

Score Norm Data 20 speakers EER (in %)

30 speakers EER (in %)

Large dev 1.7 1.5

Small dev 2.8 2.4

Degradation 65% 60%

Score Stabilization

43

Page 44: Google and SRI talk September 2016

© 2016 IBM Corporation

x - test supervector

- scoring function

µ(s) , σ(s) - mean and std parameters for supervector s

)(

)(),(),(

s

sxsxs

Z

ZZnorm

)',(E)( ' xss xZ

)',(var)( '

2 xss xZ

(1)

(2)

(3)

Znorm normalizes imposter scores for an enrolled supervector s to a zero mean and unit variance

Score Stabilization

Znorm

44

Page 45: Google and SRI talk September 2016

© 2016 IBM Corporation

• Accuracy degrades because the estimates for the score-norm parameters µ(s) and σ(s) are too noisy

• We want to reduce estimation noise of the score-norm parameters

• We seek for a low-dimensional subspace in supervector space that accounts for high variability in score-norm parameters

• Our analysis is done for Znorm but is valid for Tnorm and ZTnorm as well

Score Stabilization

Score Stabilization: Motivation

45

Page 46: Google and SRI talk September 2016

© 2016 IBM Corporation

Given a devset of supervectors X={x1,..,xn}, the unbiased estimates for the Znorm parameters are

XxZ xsXs

),(),(ˆ

22

1

2 ),(),(),(ˆXxXxn

nZ xsxsXs

(4)

(5)

Goal: Minimize the expected variance of :

),(ˆ

),(ˆ),(varE ,

Xs

Xsxs

Z

ZXxs

),( xsZnorm

(6)

),(ˆ

),(ˆ

Xs

Xs

Z

Z

),(ˆ XsZvariances of and

In practice, we minimize independently the expected

Score Stabilization

Method

46

Page 47: Google and SRI talk September 2016

© 2016 IBM Corporation

Assumptions

• Impostor scores distribute normally (per speaker)

• Scoring is a dot-product; WLOG:

Note

<

0EE xs

sxss

xss

s

t

Z

Xx

t

Z

Z

cov),(

)(ˆ

0)(

2

(7)

Score Stabilization

Stabilization of σZ(s,X)

(9) 4

122 covtr),(ˆvarE xXs

nZXs

21

2

42 cov

1

),(2),(ˆvar sxs

n

sXs t

n

ZZX

(8)

Therefore

47

Conclusion: the subspace to be removed is spanned by the top eigenvectors of cov(x) which is the total variability covariance matrix

Page 48: Google and SRI talk September 2016

© 2016 IBM Corporation

Further assumption

is already stabilized:

< <

),(ˆ XsZ ),(),(ˆ XsXs ZZ

nsxs

sxs

s

Xs

s

Xst

t

n

Z

ZX

Z

ZX

1

cov

cov

),(

),(ˆvar

),(

),(ˆvar

1

2

(13)

→ no further stabilization is possible using a subspace removal technique

Score Stabilization

Stabilization of μZ(s,X)/σZ(s,X)

48

Page 49: Google and SRI talk September 2016

© 2016 IBM Corporation

Text Dependent (Wells-Fargo)

• 200 speakers for dev + 550 for eval

• 2 landline (LL) + 2 cellular (CC) sessions per speaker

• Common passphrase (0-1-2-3-4-5-6-7-8-9)

• 3 repetitions for enrolment, 1 for verification

• Reduced: 20-50 speakers with 2/4 sessions per speaker

Score Stabilization

Dataset

49

Page 50: Google and SRI talk September 2016

© 2016 IBM Corporation

SS-n: rank of subspace removed

System 20LC 20 30CC 30LL 30RR 30LC 30 50LC 50 200 LLCC

NAP 10 2.8 2.5 3.2 3.3 3.5 2.4 2.1 1.8 1.6 1.0

NAP 10 SS-10 2.4 2.0 2.7 2.4 2.4 2.0 1.8 1.6 1.4 1.1

NAP 10 SS-25 2.3 2.0 2.4 2.4 2.4 2.1 1.8 1.7 1.5 1.1

NAP 10 SS-50 2.3 1.9 2.4 2.6 2.5 2.1 1.8 1.8 1.5 1.1

NAP 10 Norm-full 1.7 1.6 2.0 2.0 1.9 1.5 1.4 1.5 1.2 1.0

EER reduction (SS-25)

18% 20% 25% 27% 31% 13% 14% 6% 6% -10%

Recovery rate (SS-25)

45% 56% 67% 69% 69% 33% 43% 33% 25% -10%

ML-based intersession matrix estimation

Score Stabilization

Text Dependent Results

50

Page 51: Google and SRI talk September 2016

© 2016 IBM Corporation

System 20LC 20 30CC 30LL 30RR 30LC 30 50LC 50 200 LLCC

GBS 2.5 2.3 2.7 2.7 2.7 2.2 2.1 1.8 1.8 1.6

GBS SS-10 2.1 2.0 2.2 2.1 2.1 1.9 1.8 1.7 1.6 1.3

GBS SS-25 2.1 1.9 2.2 2.1 2.0 1.9 1.8 1.7 1.6 1.3

GBS SS-50 2.1 1.8 2.2 2.3 2.4 1.9 1.7 1.7 1.6 1.4

EER reduction

(SS-25) 16% 17% 19% 22% 26% 14% 14% 6% 11% 19%

SS-n: rank of subspace removed

GBS-based intersession matrix estimation

Score Stabilization

Text Dependent Results (2)

51

Page 52: Google and SRI talk September 2016

© 2016 IBM Corporation

SS25: rank of subspace removed is 25

20LC 20 30CC 30LL 30RR 30LC 30 50LC 50 Full

1

1.5

2

2.5

3

3.5

Subset

EE

R (

in %

)

NAP10

NAP10+SS25

GBS

GBS+SS25

Score Stabilization

Text Dependent Results (3)

200LLCC

52

Page 53: Google and SRI talk September 2016

© 2016 IBM Corporation

1. Removing top eigenvectors of the total variability

covariance matrix stabilizes score-normalization

parameters

2. Method reduced error when devset is small

3. Method combines well with Gaussian-based smoothing

(GBS) for NAP estimation

Score Stabilization

Conclusions

53

Page 54: Google and SRI talk September 2016

© 2016 IBM Corporation 54

Text Dependent Speaker ID with a Small Devset

Matched Filter for Speaker Recognition

Page 55: Google and SRI talk September 2016

© 2016 IBM Corporation 55

Matched Filter for Speaker Recognition

Motivation

• Cosine-distance is fundamental in speaker ID. It motivates: − Score-normalization − I-vector centering − I-vector length normalization − Domain Mismatch Problems and Solutions (such as IDVC)

• Questions: − Why is cosine distance needed? − Are our generative models in-line with cosine distance?

• Cosine distance-based Geometry requires knowledge of the center − How can we handle uncertainty in the mean?

o Domain mismatch: IDVC o Training from a small devset: this work

Page 56: Google and SRI talk September 2016

© 2016 IBM Corporation 56

Modified from: Dehak, N., et al..: Front-End Factor Analysis for Speaker Verification. IEEE Transactions on Audio, Speech and Language Processing,, May 2011.

Speaker Geometry

Page 57: Google and SRI talk September 2016

© 2016 IBM Corporation 57

• An observed signal x is a sum of a desirable signal s and an additive noise v

(1) • We seek a filter h, that maximizes the output signal-to-noise

ratio, where the output is the inner product of the filter and the observed signal x

• where α is a scaling constant (which cancels out after score normalization).

• The scoring function is therefore:

Matched Filter for Speaker Recognition

vsx

svh11 cov

Matched Filters

(2)

(3) xvsscore t 1cov

Page 58: Google and SRI talk September 2016

© 2016 IBM Corporation 58

Given a high-level feature vector x extracted from a session, consider the model

(4)

s - mean high-level feature vector representing the speaker nx - session dependent intra-speaker nuisance vector µ - center/mean of the speaker population distribution of high-level features cx - session dependent scaling factor (basis for score normalization, cosine distance scoring, i-vector length normalization)

xx nscx

yx ~,~

Matched Filter for Speaker Recognition

Speaker ID Generative Model

Page 59: Google and SRI talk September 2016

© 2016 IBM Corporation 59

Given a pair of high-level features x and y we apply the matched filter framework

- x and y after centering and length normalization

δ - center estimation error (bias)

We can apply matched filter theory for which

W - intra speaker covariance matrix:

Δ - center uncertainty covariance matrix:

xy

xycc

nnxy11~~

Matched Filter for Speaker Recognition

Speaker ID Generative Model

yx ~,~

xhc

~varW111

(5)

ncovW

cov

(6)

Page 60: Google and SRI talk September 2016

© 2016 IBM Corporation 60

Note that a reasonable estimate for Δ is based on the sample total Covariance matrix T (estimated from the devset)

m - number of speakers in the development dataset.

We obtain the following scoring function:

Matched Filter for Speaker Recognition

Speaker ID Generative Model

(7)

(8)

m

T

ym

xyxf ct ~Tvar

W~,

11

Page 61: Google and SRI talk September 2016

© 2016 IBM Corporation

Training with a Small Devset : Results

61

[1] H. Aronowitz, "Exploiting Supervector Structure for Speaker Recognition Trained on a Small Development Set”, in Proc. Interspeech, 2015. [2] H. Aronowitz, "Score Stabilization for Speaker Recognition Trained on a Small Development Set", in Proc. Interspeech, 2015. [3] H. Aronowitz. “Speaker recognition using matched filters”, in Proc. ICASSP, 2016.

System 20LC 20 30CC 30LL 30LC 30 50LC 50 200 LLCC

Baseline 2.8 2.5 3.2 3.3 2.4 2.1 1.8 1.6 1.0

Robust Smoothing (RS) 2.5 2.3 2.7 2.7 2.2 2.1 1.8 1.8 1.6

Score stabilization (SS) 2.3 2.0 2.4 2.4 2.1 1.8 1.7 1.5 1.1

Matched filter 2.2 1.9 2.3 2.3 1.9 1.7 1.6 1.4 0.9

RS + SS 2.1 1.9 2.2 2.1 1.9 1.8 1.7 1.6 1.3

Error reduction (in %) RS + SS

25 24 31 36 21 14 6 - -

Error reduction (in %) Matched filter

21 24 28 30 21 19 11 13 10

Speaker Recognition: Training with a Small Devset

Page 62: Google and SRI talk September 2016

© 2016 IBM Corporation 62

Audiovisual Synchrony Detection

Page 63: Google and SRI talk September 2016

© 2016 IBM Corporation

Mobile Biometric Authentication

63

Background

• User is authenticated using a mobile device (smartphone/tablet)

• Input is audiovisual

• Biometric authentication is done using either: − Speaker ID

− Face ID

− Fusion of both

• EERs of ~0.1% may be obtained combining the speaker and face modalities

• Emphasizes the need for liveness detection / spoofing detection

Audiovisual Synchrony Detection

Page 64: Google and SRI talk September 2016

© 2016 IBM Corporation

Spoofing Attacks

64

Audio modality

• Playback

• Splicing (audio ‘cut and paste’)

• Voice transformation

• TTS (mostly adaptive-TTS)

Audiovisual modality

• Modalities spoofed independently − Modalities not synchronized

• Modalities spoofed jointly − Much harder

− Synchrony is not easy to obtain

Audiovisual Synchrony Detection

Face modality

• Video attack

• Photo attack

Page 65: Google and SRI talk September 2016

© 2016 IBM Corporation

Synchrony Detection

65

Related work

• Not an easy task − Bregler and Konig report that mutual information between the audio

and video streams was maximal when the lag between them was ~120ms

− Lags are highly context sensitive

• Performance is limited − Accuracy

− Robustness

− Training requirements

• Previous works target general purpose synchrony detection − Text independency

− Speaker independency

− Correlation based approach obtain poor performance

Audiovisual Synchrony Detection

Page 66: Google and SRI talk September 2016

© 2016 IBM Corporation

Text Dependent Synchrony Detection

66

Setup

• Fixed passphrase (e.g. My voice is my password, OK Google, etc.)

Goal

• Given an enrollment clip from a target user, verify the synchronization of a test clip supposedly spoken by the target speaker

Main idea

• Exploit text dependency

• Avoid direct comparison of speech and image (apples to oranges)

• Method is applicable to user selected passphrase as well

Audiovisual Synchrony Detection

Page 67: Google and SRI talk September 2016

© 2016 IBM Corporation

Outline of Method

67

• Temporal alignments are computed independently for the audio and visual modalities

• Alignments are expected to be similar for synchronized recordings

Audiovisual Synchrony Detection

Enrollment

Test

Page 68: Google and SRI talk September 2016

© 2016 IBM Corporation 68

Audio-based Alignment

• Voice Activity Detection (VAD): Energy-based

• MFCC feature extraction

• Dynamic Time Warping (DTW)

Enrollment

Test

Audiovisual Synchrony Detection

Page 69: Google and SRI talk September 2016

© 2016 IBM Corporation 69

Visual-based Alignment

• Detect mouth region of interest and normalize it

• Extract visual features

• Dynamic Time Warping (DTW)

Enrollment

Test

Audiovisual Synchrony Detection

Page 70: Google and SRI talk September 2016

© 2016 IBM Corporation 70

Mouth Detection and Normalization • Face detection using Viola-Jones

• Lips detection using Active Shape Model (ASM)

• A 50x30 pixel region is cropped around the lips

Audiovisual Synchrony Detection

Page 71: Google and SRI talk September 2016

© 2016 IBM Corporation 71

Visual Features Histogram of Oriented Gradients (HOG)

• A popular shallow feature

− each sample is evenly divided into 16 16 cells

− for each cell, we calculate a histogram of 8 gradient orientation bins (in 0-2π)

• We found HOG unsuitable for finding a good DTW alignments between clips

Image credit: https://github.com/pavitrakumar78/Python-Custom-Digit-Recognition

Audiovisual Synchrony Detection

Page 72: Google and SRI talk September 2016

© 2016 IBM Corporation 72

Deep Visual Features

Architecture • The input is a stack of 5 consecutive 50x30 lips crops

• 3 convolution layers

• 2 Fully Connected (FC) layers

• ReLUs applied after each layer, except for the last FC layer

Input Vectors Visual Features

Audiovisual Synchrony Detection

Page 73: Google and SRI talk September 2016

© 2016 IBM Corporation 73

DNN Training • The network is trained using a Siamese architecture

• In each iteration the network is given a pair of lip stacks, V1 and V2 and a label y (0/1: synchronized/unsynchronized)

• The loss function L is

where yi and di are the label and Euclidean distance for the i-th lip stacks pair respectively, and δ is a predefined margin

N

i

iiii dydyN

L1

22 0,max12

1

Audiovisual Synchrony Detection

Page 74: Google and SRI talk September 2016

© 2016 IBM Corporation 74

DNN Training: Creation of Training Pairs Goal

Encourage DNN to produce visual features that mimic the correspondence between MFCC features

Positive pairs

1. Select a pair of clips with the same person and same text

2. Find the optimal DTW alignment for the audio stream using MFCCs

3. Map the audio-based alignment into a visual-based alignment

4. For each pair of aligned visual frames form a positive training sample using the corresponding lip stacks

Negative pairs

1. Select visual pairs corresponding to pairs of audio frames with maximal pairwise MFCC distance (off the alignment path)

Audiovisual Synchrony Detection

Page 75: Google and SRI talk September 2016

© 2016 IBM Corporation 75

DNN Training: Creation of Training Pairs

Example

MFCC distance matrix of two clips. fa and fc represent a negative pair while fb and fd represent a positive pair

Audiovisual Synchrony Detection

Page 76: Google and SRI talk September 2016

© 2016 IBM Corporation 76

Dataset and Setup

• Data recorded with iPad-2 & iPhone-5 held in arm-length

• 41 subjects x 2 or 3 sessions on each device

• A session includes a subject repeating 3 times:

− My voice is my password

− Please verify me with the number

• We used 5-fold cross-validation to train the DNN

− Per fold, 80% of the users used for training, and the remaining 20% used for evaluation

• Average utterance duration is 1.5s (of speech)

• ~13K clip pairs used for training (incl. cross-device)

• 60 positive+60 negative samples per clip pair

• 10% of training set used for learning progress evaluation

Audiovisual Synchrony Detection

Page 77: Google and SRI talk September 2016

© 2016 IBM Corporation 77

Photo Attack Detection

Setup • Positive trials

− Synchronized clip

• Negative trials − Audio: authentic (“playback attack”) − Visual: a static image of the subject, held each image in front of a

camera, slightly moved to mimic ‘liveness’

Results (EER in %)

Audiovisual Synchrony Detection

Page 78: Google and SRI talk September 2016

© 2016 IBM Corporation 78

Video Attack Detection

Setup • Positive trials (4.2K)

− Synchronized clip

• Negative trials (4.2K) − Audio: authentic (“playback attack”) − Visual: visual stream is taken from a different clip (“playback attack”)

and is chopped to match audio endpoints

Results (EER in %)

Audiovisual Synchrony Detection

Page 79: Google and SRI talk September 2016

© 2016 IBM Corporation 79

Conclusion

• We introduce a text-dependent audiovisual synchrony detection scheme

• availability of enrollment audiovisual clips is used

• same pass-phrase must be used for both enrollment and authentication

• no direct comparison of audio and visual-based low-level features

Audiovisual Synchrony Detection