Google and SRI talk September 2016

© 2016 IBM Corporation

Data Challenged Speaker Recognition Hagai Aronowitz IBM Research - Haifa

© 2016 IBM Corporation 2

Mobile Person Recognition

Agenda

1. Speaker ID basics

2. Inter-Dataset Variability Compensation (IDVC)

3. Text-dependent Speaker ID with a small devset

4. Audiovisual synchrony detection


Speaker ID Basics


State-of-the-Art Overview

1. Low-level features (1 per 10ms) – Spectral based (MFCCs + Δ + Δ Δ)

2. High-level features (1 per session)

– GMM supervectors (concatenated means) – I-vectors (GMM supervectors + factor analysis)

3. Modeling and scoring high-level features – PLDA: Jointly model within-speaker and

between-speaker variabilities

4. Calibration – Score normalization – Convert normalized scores into log-likelihood-ratios – Use side information (SNR, …)

4

Speaker Recognition Overview

Low level

feature extraction

High level

feature extraction

x1,…,xT

ϕ ϕ' Scoring

Calibration

score

LLR



1. Low-level features (1 per 10ms) – Spectral based (MFCCs + Δ + Δ Δ)

5

Low level

feature extraction

High level

feature extraction

x1,…,xT

ϕ ϕ'

Scoring

Calibration

score

LLR

• Speech is divided into frames of 20ms (with 50% overlap)

• Signal in each frame is assumed stationary

• Frame t is represented by a vector xt

• represents the spectral characteristics of frame t

(13 MFCCs + 13 delta MFCCs + 13 double-delta MFCCs)

39Rxt





– GMM supervectors (concatenated means)

6

Low level

feature extraction

High level

feature extraction

x1,…,xT

ϕ ϕ'

Scoring

Calibration

score

LLR

• A UBM is trained on the dev data (many speakers)

• UBM is adapted to the low-level features of a given session

• GMM means are stacked into a supervector





– I-vectors (GMM supervectors + factor analysis)

7

Factor analysis used for high-level feature extraction

M: supervector for a given session

m : overall mean (UBM supervector)

T : rectangular matrix of low-rank (total variability matrix)

ϕ: standard normal random vector (i-vector)

Low level

feature extraction

High level

feature extraction

x1,…,xT

ϕ ϕ'

Scoring

Calibration

score

LLR

TmM



I-vector Extraction

8

1. Estimate Baum-Welch (BW) statistics

− Counts:

− Sums:

using

2. I-vector MAP estimate is

with

and Σ is a stacking of the UBM covariance matrices

FT 11TLˆ

TTIL 1NT

),(|(

),(|(

~

~

g

ggtg

ggtg

tNXpw

NXpwXgp

t

tg XgpN |

t

ttg XXgpF |




3. Modeling and scoring high-level features – PLDA: Jointly model within-speaker and

between-speaker variabilities

9

Low level

feature extraction

High level

feature extraction

x1,…,xT

ϕ ϕ'

Scoring

Calibration

score

LLR


9

µ

µ

s

c

Green speaker

Cov=W

Cov=B i-vector space


Probabilistic Linear Discriminant Analysis (PLDA)

10


The PLDA framework assumes that i-vectors distribute according to:

ϕ - i-vector

µ - global mean

s - speaker component

c - channel / within-speaker variability component

s and c are assumed to independently distribute normally:

The PLDA model is parameterized by {µ, B, W}

cs

)W,0(~,)B,0(~ NcNs


PLDA: Details

11


PLDA training

Hyper-parameters W and B are trained using EM (on a devset)

PLDA scoring

Given i-vectors x and y:

tot

s

xNdsspsxpxp ,;)(

ypxp

yxp

yxp

yxp

H,

H,

H,score

tot

tot

sy

xNdsspsypsxpyxp

B

B,;)(H,

with

WBtot


PLDA: Details (cont.)

12


PLDA scoring (cont.)

for µ=0:

constP2QQscore yxyyxx TT T

with

111

111

BBBP

BBQ

tottottot

tottottot



4. Calibration – Score normalization – Convert normalized scores into log-likelihood-ratios – Use side information (SNR, …)

13

Low level

feature extraction

High level

feature extraction

x1,…,xT

ϕ ϕ'

Scoring

Calibration

score

LLR


Score normalization is not required if models are trained properly

─ with lots of data

─ no mismatch

Otherwise, score normalization is essential because it cancels out modeling inaccuracies and various artifacts


Znorm

14

Znorm standardizes the distribution of scores for speaker S

given a calibration set of imposter test sessions

Let φ(S,Y) be the score of test session Y for speaker S

Estimate mean and variance of φ(S,·)

Standardize φ(S,Y)

),(

),(),(),(Znorm

S

SYSYS

Z

Z

),(V),(Z YSS Y

),(E),(Z YSS Y

C. Auckenthaler, et al. "Score Normalization for Text-independent Speaker Verification Systems," Digital Signal Processing, vol. 10 No 1-3, 2000.



Speaker ID: Progress of State-of-the-Art

Year 1995 2001 2005 2008 2011 2016

Algorithm GMM GMM-UBM NAP JFA PLDA DNN+PLDA

EER 10% 6% 4% 2% 1% 0.6%

0

2

4

6

8

10

1995 2000 2005 2010 2015

EER (in %)



Inter-Dataset Variability Compensation (IDVC)


Robustness to Data Mismatch

The i-vector PLDA scheme requires

• 1000s of subjects with several recordings per subject

• Matched recording devices and channels

In practice, we may have plenty of data from a source domain, but limited or no data from the target domain

• We address the setup when no data is available from the target domain

• Our method is named Inter Dataset Variability Compensation (IDVC)

17

Inter-Dataset Variability Compensation


The Domain Robustness Challenge

A major topic in the JHU-2013 speaker recognition workshop

Setup

• Eval data: NIST 2010 SRE – telephone data

• SWB: Phone calls recorded during 97’-04’ (Switchboard 1&2)

• MIXER: Data from NIST evaluations 04’-08’

Goal

• Reduce degradation when training on SWB (no use of MIXER)

18

PLDA training EER (in %)

MIXER 2.4

SWB 8.2



Source data Target data

Projected

source data

Projected

target data

P P PLDA model PLDA model

Motivation for the IDVC Method Find a projection P in i-vector space that minimizes the distance between the PLDA models for source and (unknown) target data

19

PLDA model PLDA model



IDVC with Unseen Target Data

How can we find projection P if target data is unknown?

• We divide available dev data to homogenous subsets

• We estimate P

• We expect/hope P to generalize well to unseen target data

Success is based on the availability of heterogeneous dev data

• Existence of distinct subsets is required

• Subsets may either be labeled or found using clustering

20



Finding Projection P

• We represent each subset by a vector in the hyper-parameter space

• We find and remove a subspace of the i-vector space that would

effectively reduce variability in the PLDA hyper-parameter space

21

µ

B W



IDVC - Outline

1. Partition the dev data into distinct homogenous subsets

2. Estimate PLDA hyper-parameters {µ, B, W} for

each subset

3. Estimate “bad” i-vector subspaces: Sµ, SW, SB

4. Join “bad” subspaces to form a single “bad” subspace:

• Subspace S is removed from the i-vector as a pre-processing

cleanup step

22

BW SSSS



Estimating Subspace for Hyper-Parameter µ

• PCA is applied on the set of vectors {µi}

• The optimal dimension of Sµ is a function of the expected

magnitude of mismatch

• Note: Speaker labels are not required for estimating the {µi}

hyper-parameters

• This is the original IDVC method developed in JHU-2013:

23

H. Aronowitz,"Inter Dataset Variability Compensation for Speaker Recognition", in Proc. ICASSP, 2014.



Estimating Subspaces for Covariance Matrices • This the extended IDVC method

• Given a set of n covariance matrices {Wi} we denote the mean covariance by W

• Goal: find directions in the i-vector space with maximal variability of normalized variance across different datasets

• Define: v - unit vector (direction in i-vector space)

• Variance of Wi along direction v equals to vtWiv

• Define variability of normalized variance w.r.t. v:

24

vv

vvt

i

t

W

Wvar

H. Aronowitz, “Compensating Inter-Dataset Variability in PLDA Hyper-Parameters for Robust Speaker Recognition”, in Proc. Speaker Odyssey, 2014.



Estimating Subspaces for Covariance Matrices (2)

Algorithm

1. Whiten the i-vector space with respect to W

• Calculate the square root R of W-1

• After whitening:

2. Compute

3. Find the k top eigenvectors of Ω: v1,…vk

4. Span the “bad” subspace using R-1v1,…R-1vk

25

21 Win


RRWWW iii


Estimating Subspaces for Covariance Matrices (3) Claim 1: In the whitened space, the top eigenvector v1 of Ω

maximizes the objective on the whitened matrices:

Claim 2: v1 when transformed back to the original space maximizes the variances of the projections in the original space:

26


vv

vvv

t

t

v W

Wvarmaxarg i

11

vv

vv

v

vt

t

v W

Wvarmaxarg

R

R i

11

1

1

1



Proof of Claim 1:

Let v be a unit vector:

•

27


vv

vvv

t

t

v W

Wvarmaxarg i

11

1WIW vvt

vv

vv

vv

vv

vvv

i

n

t

i

i

t

n

i

t

t

i

t

i

21

21

*

Wmaxarg

1Wmaxarg

Wvarmaxarg

W

Wvarmaxarg



Proof of Claim 2:

28


vv

vv

v

vt

t

v W

Wvarmaxarg

R

R i

11

1

1

1

1

1-

1

1-

1

1

1-

1~

1-

1~R

1-

1-1-

1-1-

1

1

R

R

W

Wvarmaxarg

R

~W~

~W~varmaxargR

~W~

~W~varmaxargR

RWR

RWRvarmaxarg

W

Wvarmaxarg

1

v

v

vv

vv

v

vv

vv

vv

vv

vv

vv

vv

vv

t

i

t

v

t

i

t

v

t

i

t

v

t

i

t

v

t

i

t

v


Estimation without Speaker Labels

• Estimation of subspaces SW and SB requires speaker labels

• T denotes the total covariance matrix of the i-vectors of the dev data (in a given subset)

• T can be estimated without speaker labels

• Note that for typical datasets,

→ An i-vector subspace that contains high inter-dataset variability in the T hyper-parameter will also contain high inter-dataset variability for either W or B (and vice versa)

29

BWT



Partitioning the Dev Data

Code Description 97S62 SWB-1 Release 2 98S75 SWB-2 Phase I 99S79 SWB-2 Phase II

2001S13 SWB Cellular Part 1 2002S06 SWB-2 Phase III 2004S07 SWB Cellular Part 2

We defined 3 different partitions

GI-6 Each release in a separate subset

GD-12 A further partition into male and female subsets

GI-2 One telephone (97S62) and one cellular (2004S07) subset

30

Switchboard data consists of 6 releases



Selected Results

31

Method EER

(in %) Baseline (dev=MIXER) 2.4 Baseline (dev=SWB) 8.2

IDCV: µ only 3.8 IDCV: µ, B(30), W(30) 3.0

No speaker labels, IDVC: µ, T(30), GI-2 3.3


Results using GD-12 (GD-6 gives similar results)


IDVC EER

(in %)

DCF

(old)

DCF

(new)

No 8.2 0.33 0.69

μ (10) 3.8 0.19 0.53

IDVC Results: μ only Inter-Dataset Variability Compensation


IDVC EER

(in %)

DCF

(old)

DCF

(new)

No 8.2 0.33 0.69

W (100) 3.4 0.16 0.50

B (100) 3.5 0.16 0.50

IDVC Results: W / B only Inter-Dataset Variability Compensation

SWB


Conclusions

34

1. IDVC has shown to effectively reduce the influence of dataset

variability for the domain robustness challenge

• EER was decreased by 63% (90% recovery)

• DCF (old) was decreased by 58% (91% recovery)

• DCF (new) was decreased by 32% (71% recovery)

2. IDVC works well even when trained

• on two subsets only

• without speaker labels



Text Dependent Speaker ID with a Small Devset


Training with a Small Devset

36

Why train with a small devset ?

TD training is more effective than TI training

TD Data collection is expensive

Techniques

1. Robust estimation of GMM supervector covariance matrix Key idea: soft independence of parameters

2. Stabilize score normalization parameters Key idea: Remove a subspace of the high-level-feature space that accounts for maximal estimation error of score normalization parameters

3. Minimize the combined effect of both within speaker variability and hyper-parameters estimation error using a matched filter on the high-level-features

Speaker Recognition: Training with a Small Devset


GMM Supervectors

A GMM supervector is a stacking of g (~500) Gaussians each one is f (~40) dimensional



MFCC

features

GMM

supervector

NAP

compensation

Scoring

(dot

product)

ZT-score

normalization

MFCC

features

GMM

supervector

NAP

compensation

MFCC

features GMM-

Nuisance

supervector

Intersession

covariance

matrix

MFCC

features MFCC

features

GMM-

Nuisance

supervector

GMM-

Nuisance

GMM

supervecto

r

GMM

supervecto

r

GMM

supervecto

r

MFCC

features MFCC

features

GMM

supervecto

r

GMM

supervector

GMM-nuisance

supervector

NAP

subspace

NAP Subspace Estimation

GMM-NAP Scoring

GMM-NAP System Description Speaker Recognition: Training with a Small Devset



Robust Estimation of Supervector Covariance



40

G1

G2

G3

Define groups G0,G1,… inspired by GMM-supervector structure

Estimate distribution of covariance matrix entries for each group (i.e., mean and variance) from observed data

Estimate covariance :

─ Assuming normal distribution and group independence:

,..,maxarg 10 GGwpw cw

c

c

,g

g

gcw 1

'2

'

2

11

g gg

g


f,g

f,g


Gaussian-based Smoothing (GBS)

41

1. Whiten features (in supervector space)

Estimate a linear transformation on the feature space that minimizes the

average (over all Gaussians) off-diagonal correlations in the supervector

covariance matrix, and normalizes the average diagonal elements to 1

2. Smooth the sample covariance matrix

• Relaxed Block Diagonal:

• Covariance between two supervector components that correspond to

different feature coefficients (but same Gaussian) is dependent on the

Gaussian index only:

• Covariance between two supervector components that correspond to

the same feature coefficient is dependent on Gaussian indices only:

3. De-whiten (reverse step 1)


0C2211 ,,,2121 fgfgggff

gfgfgff 21 ,,,21 C

2121 ,,,,C ggfgfg



Score Stabilization


Score normalization • Essential for GMM-NAP • Contributes to PLDA when PLDA is poorly trained

− Small devset or data mismatch

Observation Score normalizing is very sensitive to devset size

Experiment • Common passphrase • Data: Wells Fargo • Large dev: 200 speakers x 4 sessions • Small dev: 20/30 speakers x 2 sessions • NAP trained on small dev

Motivation

Score Norm Data 20 speakers EER (in %)

30 speakers EER (in %)

Large dev 1.7 1.5

Small dev 2.8 2.4

Degradation 65% 60%

Score Stabilization

43


x - test supervector

- scoring function

µ(s) , σ(s) - mean and std parameters for supervector s

)(

)(),(),(

s

sxsxs

Z

ZZnorm

)',(E)( ' xss xZ

)',(var)( '

2 xss xZ

(1)

(2)

(3)

Znorm normalizes imposter scores for an enrolled supervector s to a zero mean and unit variance

Score Stabilization

Znorm

44


• Accuracy degrades because the estimates for the score-norm parameters µ(s) and σ(s) are too noisy

• We want to reduce estimation noise of the score-norm parameters

• We seek for a low-dimensional subspace in supervector space that accounts for high variability in score-norm parameters

• Our analysis is done for Znorm but is valid for Tnorm and ZTnorm as well

Score Stabilization

Score Stabilization: Motivation

45


Given a devset of supervectors X={x1,..,xn}, the unbiased estimates for the Znorm parameters are

XxZ xsXs

),(),(ˆ

22

1

2 ),(),(),(ˆXxXxn

nZ xsxsXs

(4)

(5)

Goal: Minimize the expected variance of :

),(ˆ

),(ˆ),(varE ,

Xs

Xsxs

Z

ZXxs

),( xsZnorm

(6)

),(ˆ

),(ˆ

Xs

Xs

Z

Z

),(ˆ XsZvariances of and

In practice, we minimize independently the expected

Score Stabilization

Method

46


Assumptions

• Impostor scores distribute normally (per speaker)

• Scoring is a dot-product; WLOG:

Note

<

0EE xs

sxss

xss

s

t

Z

Xx

t

Z

Z

cov),(

)(ˆ

0)(

2

(7)

Score Stabilization

Stabilization of σZ(s,X)

(9) 4

122 covtr),(ˆvarE xXs

nZXs

21

2

42 cov

1

),(2),(ˆvar sxs

n

sXs t

n

ZZX

(8)

Therefore

47

Conclusion: the subspace to be removed is spanned by the top eigenvectors of cov(x) which is the total variability covariance matrix


Further assumption

is already stabilized:

< <

),(ˆ XsZ ),(),(ˆ XsXs ZZ

nsxs

sxs

s

Xs

s

Xst

t

n

Z

ZX

Z

ZX

1

cov

cov

),(

),(ˆvar

),(

),(ˆvar

1

2

(13)

→ no further stabilization is possible using a subspace removal technique

Score Stabilization

Stabilization of μZ(s,X)/σZ(s,X)

48


Text Dependent (Wells-Fargo)

• 200 speakers for dev + 550 for eval

• 2 landline (LL) + 2 cellular (CC) sessions per speaker

• Common passphrase (0-1-2-3-4-5-6-7-8-9)

• 3 repetitions for enrolment, 1 for verification

• Reduced: 20-50 speakers with 2/4 sessions per speaker

Score Stabilization

Dataset

49


SS-n: rank of subspace removed

System 20LC 20 30CC 30LL 30RR 30LC 30 50LC 50 200 LLCC

NAP 10 2.8 2.5 3.2 3.3 3.5 2.4 2.1 1.8 1.6 1.0

NAP 10 SS-10 2.4 2.0 2.7 2.4 2.4 2.0 1.8 1.6 1.4 1.1

NAP 10 SS-25 2.3 2.0 2.4 2.4 2.4 2.1 1.8 1.7 1.5 1.1

NAP 10 SS-50 2.3 1.9 2.4 2.6 2.5 2.1 1.8 1.8 1.5 1.1

NAP 10 Norm-full 1.7 1.6 2.0 2.0 1.9 1.5 1.4 1.5 1.2 1.0

EER reduction (SS-25)

18% 20% 25% 27% 31% 13% 14% 6% 6% -10%

Recovery rate (SS-25)

45% 56% 67% 69% 69% 33% 43% 33% 25% -10%

ML-based intersession matrix estimation

Score Stabilization

Text Dependent Results

50


System 20LC 20 30CC 30LL 30RR 30LC 30 50LC 50 200 LLCC

GBS 2.5 2.3 2.7 2.7 2.7 2.2 2.1 1.8 1.8 1.6

GBS SS-10 2.1 2.0 2.2 2.1 2.1 1.9 1.8 1.7 1.6 1.3

GBS SS-25 2.1 1.9 2.2 2.1 2.0 1.9 1.8 1.7 1.6 1.3

GBS SS-50 2.1 1.8 2.2 2.3 2.4 1.9 1.7 1.7 1.6 1.4

EER reduction

(SS-25) 16% 17% 19% 22% 26% 14% 14% 6% 11% 19%

SS-n: rank of subspace removed

GBS-based intersession matrix estimation

Score Stabilization

Text Dependent Results (2)

51


SS25: rank of subspace removed is 25

20LC 20 30CC 30LL 30RR 30LC 30 50LC 50 Full

1

1.5

2

2.5

3

3.5

Subset

EE

R (

in %

)

NAP10

NAP10+SS25

GBS

GBS+SS25

Score Stabilization

Text Dependent Results (3)

200LLCC

52


1. Removing top eigenvectors of the total variability

covariance matrix stabilizes score-normalization

parameters

2. Method reduced error when devset is small

3. Method combines well with Gaussian-based smoothing

(GBS) for NAP estimation

Score Stabilization

Conclusions

53



Matched Filter for Speaker Recognition



Motivation

• Cosine-distance is fundamental in speaker ID. It motivates: − Score-normalization − I-vector centering − I-vector length normalization − Domain Mismatch Problems and Solutions (such as IDVC)

• Questions: − Why is cosine distance needed? − Are our generative models in-line with cosine distance?

• Cosine distance-based Geometry requires knowledge of the center − How can we handle uncertainty in the mean?

o Domain mismatch: IDVC o Training from a small devset: this work


Modified from: Dehak, N., et al..: Front-End Factor Analysis for Speaker Verification. IEEE Transactions on Audio, Speech and Language Processing,, May 2011.

Speaker Geometry


• An observed signal x is a sum of a desirable signal s and an additive noise v

(1) • We seek a filter h, that maximizes the output signal-to-noise

ratio, where the output is the inner product of the filter and the observed signal x

• where α is a scaling constant (which cancels out after score normalization).

• The scoring function is therefore:


vsx

svh11 cov

Matched Filters

(2)

(3) xvsscore t 1cov


Given a high-level feature vector x extracted from a session, consider the model

(4)

s - mean high-level feature vector representing the speaker nx - session dependent intra-speaker nuisance vector µ - center/mean of the speaker population distribution of high-level features cx - session dependent scaling factor (basis for score normalization, cosine distance scoring, i-vector length normalization)

xx nscx

yx ~,~


Speaker ID Generative Model


Given a pair of high-level features x and y we apply the matched filter framework

- x and y after centering and length normalization

δ - center estimation error (bias)

We can apply matched filter theory for which

W - intra speaker covariance matrix:

Δ - center uncertainty covariance matrix:

xy

xycc

nnxy11~~



yx ~,~

xhc

~varW111

(5)

ncovW

cov

(6)


Note that a reasonable estimate for Δ is based on the sample total Covariance matrix T (estimated from the devset)

m - number of speakers in the development dataset.

We obtain the following scoring function:



(7)

(8)

m

T

ym

xyxf ct ~Tvar

W~,

11


Training with a Small Devset : Results

61

[1] H. Aronowitz, "Exploiting Supervector Structure for Speaker Recognition Trained on a Small Development Set”, in Proc. Interspeech, 2015. [2] H. Aronowitz, "Score Stabilization for Speaker Recognition Trained on a Small Development Set", in Proc. Interspeech, 2015. [3] H. Aronowitz. “Speaker recognition using matched filters”, in Proc. ICASSP, 2016.

System 20LC 20 30CC 30LL 30LC 30 50LC 50 200 LLCC

Baseline 2.8 2.5 3.2 3.3 2.4 2.1 1.8 1.6 1.0

Robust Smoothing (RS) 2.5 2.3 2.7 2.7 2.2 2.1 1.8 1.8 1.6

Score stabilization (SS) 2.3 2.0 2.4 2.4 2.1 1.8 1.7 1.5 1.1

Matched filter 2.2 1.9 2.3 2.3 1.9 1.7 1.6 1.4 0.9

RS + SS 2.1 1.9 2.2 2.1 1.9 1.8 1.7 1.6 1.3

Error reduction (in %) RS + SS

25 24 31 36 21 14 6 - -

Error reduction (in %) Matched filter

21 24 28 30 21 19 11 13 10



Audiovisual Synchrony Detection


Mobile Biometric Authentication

63

Background

• User is authenticated using a mobile device (smartphone/tablet)

• Input is audiovisual

• Biometric authentication is done using either: − Speaker ID

− Face ID

− Fusion of both

• EERs of ~0.1% may be obtained combining the speaker and face modalities

• Emphasizes the need for liveness detection / spoofing detection



Spoofing Attacks

64

Audio modality

• Playback

• Splicing (audio ‘cut and paste’)

• Voice transformation

• TTS (mostly adaptive-TTS)

Audiovisual modality

• Modalities spoofed independently − Modalities not synchronized

• Modalities spoofed jointly − Much harder

− Synchrony is not easy to obtain


Face modality

• Video attack

• Photo attack


Synchrony Detection

65

Related work

• Not an easy task − Bregler and Konig report that mutual information between the audio

and video streams was maximal when the lag between them was ~120ms

− Lags are highly context sensitive

• Performance is limited − Accuracy

− Robustness

− Training requirements

• Previous works target general purpose synchrony detection − Text independency

− Speaker independency

− Correlation based approach obtain poor performance



Text Dependent Synchrony Detection

66

Setup

• Fixed passphrase (e.g. My voice is my password, OK Google, etc.)

Goal

• Given an enrollment clip from a target user, verify the synchronization of a test clip supposedly spoken by the target speaker

Main idea

• Exploit text dependency

• Avoid direct comparison of speech and image (apples to oranges)

• Method is applicable to user selected passphrase as well



Outline of Method

67

• Temporal alignments are computed independently for the audio and visual modalities

• Alignments are expected to be similar for synchronized recordings


Enrollment

Test


Audio-based Alignment

• Voice Activity Detection (VAD): Energy-based

• MFCC feature extraction

• Dynamic Time Warping (DTW)

Enrollment

Test



Visual-based Alignment

• Detect mouth region of interest and normalize it

• Extract visual features

• Dynamic Time Warping (DTW)

Enrollment

Test



Mouth Detection and Normalization • Face detection using Viola-Jones

• Lips detection using Active Shape Model (ASM)

• A 50x30 pixel region is cropped around the lips



Visual Features Histogram of Oriented Gradients (HOG)

• A popular shallow feature

− each sample is evenly divided into 16 16 cells

− for each cell, we calculate a histogram of 8 gradient orientation bins (in 0-2π)

• We found HOG unsuitable for finding a good DTW alignments between clips

Image credit: https://github.com/pavitrakumar78/Python-Custom-Digit-Recognition



Deep Visual Features

Architecture • The input is a stack of 5 consecutive 50x30 lips crops

• 3 convolution layers

• 2 Fully Connected (FC) layers

• ReLUs applied after each layer, except for the last FC layer

Input Vectors Visual Features



DNN Training • The network is trained using a Siamese architecture

• In each iteration the network is given a pair of lip stacks, V1 and V2 and a label y (0/1: synchronized/unsynchronized)

• The loss function L is

where yi and di are the label and Euclidean distance for the i-th lip stacks pair respectively, and δ is a predefined margin

N

i

iiii dydyN

L1

22 0,max12

1



DNN Training: Creation of Training Pairs Goal

Encourage DNN to produce visual features that mimic the correspondence between MFCC features

Positive pairs

1. Select a pair of clips with the same person and same text

2. Find the optimal DTW alignment for the audio stream using MFCCs

3. Map the audio-based alignment into a visual-based alignment

4. For each pair of aligned visual frames form a positive training sample using the corresponding lip stacks

Negative pairs

1. Select visual pairs corresponding to pairs of audio frames with maximal pairwise MFCC distance (off the alignment path)



DNN Training: Creation of Training Pairs

Example

MFCC distance matrix of two clips. fa and fc represent a negative pair while fb and fd represent a positive pair



Dataset and Setup

• Data recorded with iPad-2 & iPhone-5 held in arm-length

• 41 subjects x 2 or 3 sessions on each device

• A session includes a subject repeating 3 times:

− My voice is my password

− Please verify me with the number

• We used 5-fold cross-validation to train the DNN

− Per fold, 80% of the users used for training, and the remaining 20% used for evaluation

• Average utterance duration is 1.5s (of speech)

• ~13K clip pairs used for training (incl. cross-device)

• 60 positive+60 negative samples per clip pair

• 10% of training set used for learning progress evaluation



Photo Attack Detection

Setup • Positive trials

− Synchronized clip

• Negative trials − Audio: authentic (“playback attack”) − Visual: a static image of the subject, held each image in front of a

camera, slightly moved to mimic ‘liveness’

Results (EER in %)



Video Attack Detection

Setup • Positive trials (4.2K)

− Synchronized clip

• Negative trials (4.2K) − Audio: authentic (“playback attack”) − Visual: visual stream is taken from a different clip (“playback attack”)

and is chopped to match audio endpoints

Results (EER in %)



Conclusion

• We introduce a text-dependent audiovisual synchrony detection scheme

• availability of enrollment audiovisual clips is used

• same pass-phrase must be used for both enrollment and authentication

• no direct comparison of audio and visual-based low-level features


Documents

Google and SRI talk September 2016