Upload
hagai-aronowitz
View
112
Download
0
Embed Size (px)
Citation preview
© 2016 IBM Corporation
Data Challenged Speaker Recognition Hagai Aronowitz IBM Research - Haifa
© 2016 IBM Corporation 2
Mobile Person Recognition
Agenda
1. Speaker ID basics
2. Inter-Dataset Variability Compensation (IDVC)
3. Text-dependent Speaker ID with a small devset
4. Audiovisual synchrony detection
© 2016 IBM Corporation 3
Speaker ID Basics
© 2016 IBM Corporation
State-of-the-Art Overview
1. Low-level features (1 per 10ms) – Spectral based (MFCCs + Δ + Δ Δ)
2. High-level features (1 per session)
– GMM supervectors (concatenated means) – I-vectors (GMM supervectors + factor analysis)
3. Modeling and scoring high-level features – PLDA: Jointly model within-speaker and
between-speaker variabilities
4. Calibration – Score normalization – Convert normalized scores into log-likelihood-ratios – Use side information (SNR, …)
4
Speaker Recognition Overview
Low level
feature extraction
High level
feature extraction
x1,…,xT
ϕ ϕ' Scoring
Calibration
score
LLR
© 2016 IBM Corporation
State-of-the-Art Overview
1. Low-level features (1 per 10ms) – Spectral based (MFCCs + Δ + Δ Δ)
5
Low level
feature extraction
High level
feature extraction
x1,…,xT
ϕ ϕ'
Scoring
Calibration
score
LLR
• Speech is divided into frames of 20ms (with 50% overlap)
• Signal in each frame is assumed stationary
• Frame t is represented by a vector xt
• represents the spectral characteristics of frame t
(13 MFCCs + 13 delta MFCCs + 13 double-delta MFCCs)
39Rxt
Speaker Recognition Overview
© 2016 IBM Corporation
State-of-the-Art Overview
2. High-level features (1 per session)
– GMM supervectors (concatenated means)
6
Low level
feature extraction
High level
feature extraction
x1,…,xT
ϕ ϕ'
Scoring
Calibration
score
LLR
• A UBM is trained on the dev data (many speakers)
• UBM is adapted to the low-level features of a given session
• GMM means are stacked into a supervector
Speaker Recognition Overview
© 2016 IBM Corporation
State-of-the-Art Overview
2. High-level features (1 per session)
– I-vectors (GMM supervectors + factor analysis)
7
Factor analysis used for high-level feature extraction
M: supervector for a given session
m : overall mean (UBM supervector)
T : rectangular matrix of low-rank (total variability matrix)
ϕ: standard normal random vector (i-vector)
Low level
feature extraction
High level
feature extraction
x1,…,xT
ϕ ϕ'
Scoring
Calibration
score
LLR
TmM
Speaker Recognition Overview
© 2016 IBM Corporation
I-vector Extraction
8
1. Estimate Baum-Welch (BW) statistics
− Counts:
− Sums:
using
2. I-vector MAP estimate is
with
and Σ is a stacking of the UBM covariance matrices
FT 11TLˆ
TTIL 1NT
),(|(
),(|(
~
~
g
ggtg
ggtg
tNXpw
NXpwXgp
t
tg XgpN |
t
ttg XXgpF |
Speaker Recognition Overview
© 2016 IBM Corporation
State-of-the-Art Overview
3. Modeling and scoring high-level features – PLDA: Jointly model within-speaker and
between-speaker variabilities
9
Low level
feature extraction
High level
feature extraction
x1,…,xT
ϕ ϕ'
Scoring
Calibration
score
LLR
Speaker Recognition Overview
9
µ
µ
s
c
Green speaker
Cov=W
Cov=B i-vector space
© 2016 IBM Corporation
Probabilistic Linear Discriminant Analysis (PLDA)
10
Speaker Recognition Overview
The PLDA framework assumes that i-vectors distribute according to:
ϕ - i-vector
µ - global mean
s - speaker component
c - channel / within-speaker variability component
s and c are assumed to independently distribute normally:
The PLDA model is parameterized by {µ, B, W}
cs
)W,0(~,)B,0(~ NcNs
© 2016 IBM Corporation
PLDA: Details
11
Speaker Recognition Overview
PLDA training
Hyper-parameters W and B are trained using EM (on a devset)
PLDA scoring
Given i-vectors x and y:
tot
s
xNdsspsxpxp ,;)(
ypxp
yxp
yxp
yxp
H,
H,
H,score
tot
tot
sy
xNdsspsypsxpyxp
B
B,;)(H,
with
WBtot
© 2016 IBM Corporation
PLDA: Details (cont.)
12
Speaker Recognition Overview
PLDA scoring (cont.)
for µ=0:
constP2QQscore yxyyxx TT T
with
111
111
BBBP
BBQ
tottottot
tottottot
© 2016 IBM Corporation
State-of-the-Art Overview
4. Calibration – Score normalization – Convert normalized scores into log-likelihood-ratios – Use side information (SNR, …)
13
Low level
feature extraction
High level
feature extraction
x1,…,xT
ϕ ϕ'
Scoring
Calibration
score
LLR
Speaker Recognition Overview
Score normalization is not required if models are trained properly
─ with lots of data
─ no mismatch
Otherwise, score normalization is essential because it cancels out modeling inaccuracies and various artifacts
© 2016 IBM Corporation
Znorm
14
Znorm standardizes the distribution of scores for speaker S
given a calibration set of imposter test sessions
Let φ(S,Y) be the score of test session Y for speaker S
Estimate mean and variance of φ(S,·)
Standardize φ(S,Y)
),(
),(),(),(Znorm
S
SYSYS
Z
Z
),(V),(Z YSS Y
),(E),(Z YSS Y
C. Auckenthaler, et al. "Score Normalization for Text-independent Speaker Verification Systems," Digital Signal Processing, vol. 10 No 1-3, 2000.
Speaker Recognition Overview
© 2016 IBM Corporation
Speaker ID: Progress of State-of-the-Art
Year 1995 2001 2005 2008 2011 2016
Algorithm GMM GMM-UBM NAP JFA PLDA DNN+PLDA
EER 10% 6% 4% 2% 1% 0.6%
0
2
4
6
8
10
1995 2000 2005 2010 2015
EER (in %)
Speaker Recognition Overview
© 2016 IBM Corporation 16
Inter-Dataset Variability Compensation (IDVC)
© 2016 IBM Corporation
Robustness to Data Mismatch
The i-vector PLDA scheme requires
• 1000s of subjects with several recordings per subject
• Matched recording devices and channels
In practice, we may have plenty of data from a source domain, but limited or no data from the target domain
• We address the setup when no data is available from the target domain
• Our method is named Inter Dataset Variability Compensation (IDVC)
17
Inter-Dataset Variability Compensation
© 2016 IBM Corporation
The Domain Robustness Challenge
A major topic in the JHU-2013 speaker recognition workshop
Setup
• Eval data: NIST 2010 SRE – telephone data
• SWB: Phone calls recorded during 97’-04’ (Switchboard 1&2)
• MIXER: Data from NIST evaluations 04’-08’
Goal
• Reduce degradation when training on SWB (no use of MIXER)
18
PLDA training EER (in %)
MIXER 2.4
SWB 8.2
Inter-Dataset Variability Compensation
© 2016 IBM Corporation
Source data Target data
Projected
source data
Projected
target data
P P PLDA model PLDA model
Motivation for the IDVC Method Find a projection P in i-vector space that minimizes the distance between the PLDA models for source and (unknown) target data
19
PLDA model PLDA model
Inter-Dataset Variability Compensation
© 2016 IBM Corporation
IDVC with Unseen Target Data
How can we find projection P if target data is unknown?
• We divide available dev data to homogenous subsets
• We estimate P
• We expect/hope P to generalize well to unseen target data
Success is based on the availability of heterogeneous dev data
• Existence of distinct subsets is required
• Subsets may either be labeled or found using clustering
20
Inter-Dataset Variability Compensation
© 2016 IBM Corporation
Finding Projection P
• We represent each subset by a vector in the hyper-parameter space
• We find and remove a subspace of the i-vector space that would
effectively reduce variability in the PLDA hyper-parameter space
21
µ
B W
Inter-Dataset Variability Compensation
© 2016 IBM Corporation
IDVC - Outline
1. Partition the dev data into distinct homogenous subsets
2. Estimate PLDA hyper-parameters {µ, B, W} for
each subset
3. Estimate “bad” i-vector subspaces: Sµ, SW, SB
4. Join “bad” subspaces to form a single “bad” subspace:
• Subspace S is removed from the i-vector as a pre-processing
cleanup step
22
BW SSSS
Inter-Dataset Variability Compensation
© 2016 IBM Corporation
Estimating Subspace for Hyper-Parameter µ
• PCA is applied on the set of vectors {µi}
• The optimal dimension of Sµ is a function of the expected
magnitude of mismatch
• Note: Speaker labels are not required for estimating the {µi}
hyper-parameters
• This is the original IDVC method developed in JHU-2013:
23
H. Aronowitz,"Inter Dataset Variability Compensation for Speaker Recognition", in Proc. ICASSP, 2014.
Inter-Dataset Variability Compensation
© 2016 IBM Corporation
Estimating Subspaces for Covariance Matrices • This the extended IDVC method
• Given a set of n covariance matrices {Wi} we denote the mean covariance by W
• Goal: find directions in the i-vector space with maximal variability of normalized variance across different datasets
• Define: v - unit vector (direction in i-vector space)
• Variance of Wi along direction v equals to vtWiv
• Define variability of normalized variance w.r.t. v:
24
vv
vvt
i
t
W
Wvar
H. Aronowitz, “Compensating Inter-Dataset Variability in PLDA Hyper-Parameters for Robust Speaker Recognition”, in Proc. Speaker Odyssey, 2014.
Inter-Dataset Variability Compensation
© 2016 IBM Corporation
Estimating Subspaces for Covariance Matrices (2)
Algorithm
1. Whiten the i-vector space with respect to W
• Calculate the square root R of W-1
• After whitening:
2. Compute
3. Find the k top eigenvectors of Ω: v1,…vk
4. Span the “bad” subspace using R-1v1,…R-1vk
25
21 Win
Inter-Dataset Variability Compensation
RRWWW iii
© 2016 IBM Corporation
Estimating Subspaces for Covariance Matrices (3) Claim 1: In the whitened space, the top eigenvector v1 of Ω
maximizes the objective on the whitened matrices:
Claim 2: v1 when transformed back to the original space maximizes the variances of the projections in the original space:
26
Inter-Dataset Variability Compensation
vv
vvv
t
t
v W
Wvarmaxarg i
11
vv
vv
v
vt
t
v W
Wvarmaxarg
R
R i
11
1
1
1
© 2016 IBM Corporation
Estimating Subspaces for Covariance Matrices (4)
Proof of Claim 1:
Let v be a unit vector:
•
27
Inter-Dataset Variability Compensation
vv
vvv
t
t
v W
Wvarmaxarg i
11
1WIW vvt
vv
vv
vv
vv
vvv
i
n
t
i
i
t
n
i
t
t
i
t
i
21
21
*
Wmaxarg
1Wmaxarg
Wvarmaxarg
W
Wvarmaxarg
© 2016 IBM Corporation
Estimating Subspaces for Covariance Matrices (5)
Proof of Claim 2:
28
Inter-Dataset Variability Compensation
vv
vv
v
vt
t
v W
Wvarmaxarg
R
R i
11
1
1
1
1
1-
1
1-
1
1
1-
1~
1-
1~R
1-
1-1-
1-1-
1
1
R
R
W
Wvarmaxarg
R
~W~
~W~varmaxargR
~W~
~W~varmaxargR
RWR
RWRvarmaxarg
W
Wvarmaxarg
1
v
v
vv
vv
v
vv
vv
vv
vv
vv
vv
vv
vv
t
i
t
v
t
i
t
v
t
i
t
v
t
i
t
v
t
i
t
v
© 2016 IBM Corporation
Estimation without Speaker Labels
• Estimation of subspaces SW and SB requires speaker labels
• T denotes the total covariance matrix of the i-vectors of the dev data (in a given subset)
• T can be estimated without speaker labels
• Note that for typical datasets,
→ An i-vector subspace that contains high inter-dataset variability in the T hyper-parameter will also contain high inter-dataset variability for either W or B (and vice versa)
29
BWT
Inter-Dataset Variability Compensation
© 2016 IBM Corporation
Partitioning the Dev Data
Code Description 97S62 SWB-1 Release 2 98S75 SWB-2 Phase I 99S79 SWB-2 Phase II
2001S13 SWB Cellular Part 1 2002S06 SWB-2 Phase III 2004S07 SWB Cellular Part 2
We defined 3 different partitions
GI-6 Each release in a separate subset
GD-12 A further partition into male and female subsets
GI-2 One telephone (97S62) and one cellular (2004S07) subset
30
Switchboard data consists of 6 releases
Inter-Dataset Variability Compensation
© 2016 IBM Corporation
Selected Results
31
Method EER
(in %) Baseline (dev=MIXER) 2.4 Baseline (dev=SWB) 8.2
IDCV: µ only 3.8 IDCV: µ, B(30), W(30) 3.0
No speaker labels, IDVC: µ, T(30), GI-2 3.3
Inter-Dataset Variability Compensation
Results using GD-12 (GD-6 gives similar results)
© 2016 IBM Corporation
IDVC EER
(in %)
DCF
(old)
DCF
(new)
No 8.2 0.33 0.69
μ (10) 3.8 0.19 0.53
IDVC Results: μ only Inter-Dataset Variability Compensation
© 2016 IBM Corporation
IDVC EER
(in %)
DCF
(old)
DCF
(new)
No 8.2 0.33 0.69
W (100) 3.4 0.16 0.50
B (100) 3.5 0.16 0.50
IDVC Results: W / B only Inter-Dataset Variability Compensation
SWB
© 2016 IBM Corporation
Conclusions
34
1. IDVC has shown to effectively reduce the influence of dataset
variability for the domain robustness challenge
• EER was decreased by 63% (90% recovery)
• DCF (old) was decreased by 58% (91% recovery)
• DCF (new) was decreased by 32% (71% recovery)
2. IDVC works well even when trained
• on two subsets only
• without speaker labels
Inter-Dataset Variability Compensation
© 2016 IBM Corporation 35
Text Dependent Speaker ID with a Small Devset
© 2016 IBM Corporation
Training with a Small Devset
36
Why train with a small devset ?
TD training is more effective than TI training
TD Data collection is expensive
Techniques
1. Robust estimation of GMM supervector covariance matrix Key idea: soft independence of parameters
2. Stabilize score normalization parameters Key idea: Remove a subspace of the high-level-feature space that accounts for maximal estimation error of score normalization parameters
3. Minimize the combined effect of both within speaker variability and hyper-parameters estimation error using a matched filter on the high-level-features
Speaker Recognition: Training with a Small Devset
© 2016 IBM Corporation 37
GMM Supervectors
A GMM supervector is a stacking of g (~500) Gaussians each one is f (~40) dimensional
Speaker Recognition: Training with a Small Devset
© 2016 IBM Corporation
MFCC
features
GMM
supervector
NAP
compensation
Scoring
(dot
product)
ZT-score
normalization
MFCC
features
GMM
supervector
NAP
compensation
MFCC
features GMM-
Nuisance
supervector
Intersession
covariance
matrix
MFCC
features MFCC
features
GMM-
Nuisance
supervector
GMM-
Nuisance
GMM
supervecto
r
GMM
supervecto
r
GMM
supervecto
r
MFCC
features MFCC
features
GMM
supervecto
r
GMM
supervector
GMM-nuisance
supervector
NAP
subspace
NAP Subspace Estimation
GMM-NAP Scoring
GMM-NAP System Description Speaker Recognition: Training with a Small Devset
© 2016 IBM Corporation 39
Text Dependent Speaker ID with a Small Devset
Robust Estimation of Supervector Covariance
© 2016 IBM Corporation
Robust Estimation of Supervector Covariance
40
G1
G2
G3
Define groups G0,G1,… inspired by GMM-supervector structure
Estimate distribution of covariance matrix entries for each group (i.e., mean and variance) from observed data
Estimate covariance :
─ Assuming normal distribution and group independence:
,..,maxarg 10 GGwpw cw
c
c
,g
g
gcw 1
'2
'
2
11
g gg
g
Robust Estimation of Supervector Covariance
f,g
f,g
© 2016 IBM Corporation
Gaussian-based Smoothing (GBS)
41
1. Whiten features (in supervector space)
Estimate a linear transformation on the feature space that minimizes the
average (over all Gaussians) off-diagonal correlations in the supervector
covariance matrix, and normalizes the average diagonal elements to 1
2. Smooth the sample covariance matrix
• Relaxed Block Diagonal:
• Covariance between two supervector components that correspond to
different feature coefficients (but same Gaussian) is dependent on the
Gaussian index only:
• Covariance between two supervector components that correspond to
the same feature coefficient is dependent on Gaussian indices only:
3. De-whiten (reverse step 1)
Robust Estimation of Supervector Covariance
0C2211 ,,,2121 fgfgggff
gfgfgff 21 ,,,21 C
2121 ,,,,C ggfgfg
© 2016 IBM Corporation 42
Text Dependent Speaker ID with a Small Devset
Score Stabilization
© 2016 IBM Corporation
Score normalization • Essential for GMM-NAP • Contributes to PLDA when PLDA is poorly trained
− Small devset or data mismatch
Observation Score normalizing is very sensitive to devset size
Experiment • Common passphrase • Data: Wells Fargo • Large dev: 200 speakers x 4 sessions • Small dev: 20/30 speakers x 2 sessions • NAP trained on small dev
Motivation
Score Norm Data 20 speakers EER (in %)
30 speakers EER (in %)
Large dev 1.7 1.5
Small dev 2.8 2.4
Degradation 65% 60%
Score Stabilization
43
© 2016 IBM Corporation
x - test supervector
- scoring function
µ(s) , σ(s) - mean and std parameters for supervector s
)(
)(),(),(
s
sxsxs
Z
ZZnorm
)',(E)( ' xss xZ
)',(var)( '
2 xss xZ
(1)
(2)
(3)
Znorm normalizes imposter scores for an enrolled supervector s to a zero mean and unit variance
Score Stabilization
Znorm
44
© 2016 IBM Corporation
• Accuracy degrades because the estimates for the score-norm parameters µ(s) and σ(s) are too noisy
• We want to reduce estimation noise of the score-norm parameters
• We seek for a low-dimensional subspace in supervector space that accounts for high variability in score-norm parameters
• Our analysis is done for Znorm but is valid for Tnorm and ZTnorm as well
Score Stabilization
Score Stabilization: Motivation
45
© 2016 IBM Corporation
Given a devset of supervectors X={x1,..,xn}, the unbiased estimates for the Znorm parameters are
XxZ xsXs
),(),(ˆ
22
1
2 ),(),(),(ˆXxXxn
nZ xsxsXs
(4)
(5)
Goal: Minimize the expected variance of :
),(ˆ
),(ˆ),(varE ,
Xs
Xsxs
Z
ZXxs
),( xsZnorm
(6)
),(ˆ
),(ˆ
Xs
Xs
Z
Z
),(ˆ XsZvariances of and
In practice, we minimize independently the expected
Score Stabilization
Method
46
© 2016 IBM Corporation
Assumptions
• Impostor scores distribute normally (per speaker)
• Scoring is a dot-product; WLOG:
Note
<
0EE xs
sxss
xss
s
t
Z
Xx
t
Z
Z
cov),(
)(ˆ
0)(
2
(7)
Score Stabilization
Stabilization of σZ(s,X)
(9) 4
122 covtr),(ˆvarE xXs
nZXs
21
2
42 cov
1
),(2),(ˆvar sxs
n
sXs t
n
ZZX
(8)
Therefore
47
Conclusion: the subspace to be removed is spanned by the top eigenvectors of cov(x) which is the total variability covariance matrix
© 2016 IBM Corporation
Further assumption
is already stabilized:
< <
),(ˆ XsZ ),(),(ˆ XsXs ZZ
nsxs
sxs
s
Xs
s
Xst
t
n
Z
ZX
Z
ZX
1
cov
cov
),(
),(ˆvar
),(
),(ˆvar
1
2
(13)
→ no further stabilization is possible using a subspace removal technique
Score Stabilization
Stabilization of μZ(s,X)/σZ(s,X)
48
© 2016 IBM Corporation
Text Dependent (Wells-Fargo)
• 200 speakers for dev + 550 for eval
• 2 landline (LL) + 2 cellular (CC) sessions per speaker
• Common passphrase (0-1-2-3-4-5-6-7-8-9)
• 3 repetitions for enrolment, 1 for verification
• Reduced: 20-50 speakers with 2/4 sessions per speaker
Score Stabilization
Dataset
49
© 2016 IBM Corporation
SS-n: rank of subspace removed
System 20LC 20 30CC 30LL 30RR 30LC 30 50LC 50 200 LLCC
NAP 10 2.8 2.5 3.2 3.3 3.5 2.4 2.1 1.8 1.6 1.0
NAP 10 SS-10 2.4 2.0 2.7 2.4 2.4 2.0 1.8 1.6 1.4 1.1
NAP 10 SS-25 2.3 2.0 2.4 2.4 2.4 2.1 1.8 1.7 1.5 1.1
NAP 10 SS-50 2.3 1.9 2.4 2.6 2.5 2.1 1.8 1.8 1.5 1.1
NAP 10 Norm-full 1.7 1.6 2.0 2.0 1.9 1.5 1.4 1.5 1.2 1.0
EER reduction (SS-25)
18% 20% 25% 27% 31% 13% 14% 6% 6% -10%
Recovery rate (SS-25)
45% 56% 67% 69% 69% 33% 43% 33% 25% -10%
ML-based intersession matrix estimation
Score Stabilization
Text Dependent Results
50
© 2016 IBM Corporation
System 20LC 20 30CC 30LL 30RR 30LC 30 50LC 50 200 LLCC
GBS 2.5 2.3 2.7 2.7 2.7 2.2 2.1 1.8 1.8 1.6
GBS SS-10 2.1 2.0 2.2 2.1 2.1 1.9 1.8 1.7 1.6 1.3
GBS SS-25 2.1 1.9 2.2 2.1 2.0 1.9 1.8 1.7 1.6 1.3
GBS SS-50 2.1 1.8 2.2 2.3 2.4 1.9 1.7 1.7 1.6 1.4
EER reduction
(SS-25) 16% 17% 19% 22% 26% 14% 14% 6% 11% 19%
SS-n: rank of subspace removed
GBS-based intersession matrix estimation
Score Stabilization
Text Dependent Results (2)
51
© 2016 IBM Corporation
SS25: rank of subspace removed is 25
20LC 20 30CC 30LL 30RR 30LC 30 50LC 50 Full
1
1.5
2
2.5
3
3.5
Subset
EE
R (
in %
)
NAP10
NAP10+SS25
GBS
GBS+SS25
Score Stabilization
Text Dependent Results (3)
200LLCC
52
© 2016 IBM Corporation
1. Removing top eigenvectors of the total variability
covariance matrix stabilizes score-normalization
parameters
2. Method reduced error when devset is small
3. Method combines well with Gaussian-based smoothing
(GBS) for NAP estimation
Score Stabilization
Conclusions
53
© 2016 IBM Corporation 54
Text Dependent Speaker ID with a Small Devset
Matched Filter for Speaker Recognition
© 2016 IBM Corporation 55
Matched Filter for Speaker Recognition
Motivation
• Cosine-distance is fundamental in speaker ID. It motivates: − Score-normalization − I-vector centering − I-vector length normalization − Domain Mismatch Problems and Solutions (such as IDVC)
• Questions: − Why is cosine distance needed? − Are our generative models in-line with cosine distance?
• Cosine distance-based Geometry requires knowledge of the center − How can we handle uncertainty in the mean?
o Domain mismatch: IDVC o Training from a small devset: this work
© 2016 IBM Corporation 56
Modified from: Dehak, N., et al..: Front-End Factor Analysis for Speaker Verification. IEEE Transactions on Audio, Speech and Language Processing,, May 2011.
Speaker Geometry
© 2016 IBM Corporation 57
• An observed signal x is a sum of a desirable signal s and an additive noise v
(1) • We seek a filter h, that maximizes the output signal-to-noise
ratio, where the output is the inner product of the filter and the observed signal x
• where α is a scaling constant (which cancels out after score normalization).
• The scoring function is therefore:
Matched Filter for Speaker Recognition
vsx
svh11 cov
Matched Filters
(2)
(3) xvsscore t 1cov
© 2016 IBM Corporation 58
Given a high-level feature vector x extracted from a session, consider the model
(4)
s - mean high-level feature vector representing the speaker nx - session dependent intra-speaker nuisance vector µ - center/mean of the speaker population distribution of high-level features cx - session dependent scaling factor (basis for score normalization, cosine distance scoring, i-vector length normalization)
xx nscx
yx ~,~
Matched Filter for Speaker Recognition
Speaker ID Generative Model
© 2016 IBM Corporation 59
Given a pair of high-level features x and y we apply the matched filter framework
- x and y after centering and length normalization
δ - center estimation error (bias)
We can apply matched filter theory for which
W - intra speaker covariance matrix:
Δ - center uncertainty covariance matrix:
xy
xycc
nnxy11~~
Matched Filter for Speaker Recognition
Speaker ID Generative Model
yx ~,~
xhc
~varW111
(5)
ncovW
cov
(6)
© 2016 IBM Corporation 60
Note that a reasonable estimate for Δ is based on the sample total Covariance matrix T (estimated from the devset)
m - number of speakers in the development dataset.
We obtain the following scoring function:
Matched Filter for Speaker Recognition
Speaker ID Generative Model
(7)
(8)
m
T
ym
xyxf ct ~Tvar
W~,
11
© 2016 IBM Corporation
Training with a Small Devset : Results
61
[1] H. Aronowitz, "Exploiting Supervector Structure for Speaker Recognition Trained on a Small Development Set”, in Proc. Interspeech, 2015. [2] H. Aronowitz, "Score Stabilization for Speaker Recognition Trained on a Small Development Set", in Proc. Interspeech, 2015. [3] H. Aronowitz. “Speaker recognition using matched filters”, in Proc. ICASSP, 2016.
System 20LC 20 30CC 30LL 30LC 30 50LC 50 200 LLCC
Baseline 2.8 2.5 3.2 3.3 2.4 2.1 1.8 1.6 1.0
Robust Smoothing (RS) 2.5 2.3 2.7 2.7 2.2 2.1 1.8 1.8 1.6
Score stabilization (SS) 2.3 2.0 2.4 2.4 2.1 1.8 1.7 1.5 1.1
Matched filter 2.2 1.9 2.3 2.3 1.9 1.7 1.6 1.4 0.9
RS + SS 2.1 1.9 2.2 2.1 1.9 1.8 1.7 1.6 1.3
Error reduction (in %) RS + SS
25 24 31 36 21 14 6 - -
Error reduction (in %) Matched filter
21 24 28 30 21 19 11 13 10
Speaker Recognition: Training with a Small Devset
© 2016 IBM Corporation 62
Audiovisual Synchrony Detection
© 2016 IBM Corporation
Mobile Biometric Authentication
63
Background
• User is authenticated using a mobile device (smartphone/tablet)
• Input is audiovisual
• Biometric authentication is done using either: − Speaker ID
− Face ID
− Fusion of both
• EERs of ~0.1% may be obtained combining the speaker and face modalities
• Emphasizes the need for liveness detection / spoofing detection
Audiovisual Synchrony Detection
© 2016 IBM Corporation
Spoofing Attacks
64
Audio modality
• Playback
• Splicing (audio ‘cut and paste’)
• Voice transformation
• TTS (mostly adaptive-TTS)
Audiovisual modality
• Modalities spoofed independently − Modalities not synchronized
• Modalities spoofed jointly − Much harder
− Synchrony is not easy to obtain
Audiovisual Synchrony Detection
Face modality
• Video attack
• Photo attack
© 2016 IBM Corporation
Synchrony Detection
65
Related work
• Not an easy task − Bregler and Konig report that mutual information between the audio
and video streams was maximal when the lag between them was ~120ms
− Lags are highly context sensitive
• Performance is limited − Accuracy
− Robustness
− Training requirements
• Previous works target general purpose synchrony detection − Text independency
− Speaker independency
− Correlation based approach obtain poor performance
Audiovisual Synchrony Detection
© 2016 IBM Corporation
Text Dependent Synchrony Detection
66
Setup
• Fixed passphrase (e.g. My voice is my password, OK Google, etc.)
Goal
• Given an enrollment clip from a target user, verify the synchronization of a test clip supposedly spoken by the target speaker
Main idea
• Exploit text dependency
• Avoid direct comparison of speech and image (apples to oranges)
• Method is applicable to user selected passphrase as well
Audiovisual Synchrony Detection
© 2016 IBM Corporation
Outline of Method
67
• Temporal alignments are computed independently for the audio and visual modalities
• Alignments are expected to be similar for synchronized recordings
Audiovisual Synchrony Detection
Enrollment
Test
© 2016 IBM Corporation 68
Audio-based Alignment
• Voice Activity Detection (VAD): Energy-based
• MFCC feature extraction
• Dynamic Time Warping (DTW)
Enrollment
Test
Audiovisual Synchrony Detection
© 2016 IBM Corporation 69
Visual-based Alignment
• Detect mouth region of interest and normalize it
• Extract visual features
• Dynamic Time Warping (DTW)
Enrollment
Test
Audiovisual Synchrony Detection
© 2016 IBM Corporation 70
Mouth Detection and Normalization • Face detection using Viola-Jones
• Lips detection using Active Shape Model (ASM)
• A 50x30 pixel region is cropped around the lips
Audiovisual Synchrony Detection
© 2016 IBM Corporation 71
Visual Features Histogram of Oriented Gradients (HOG)
• A popular shallow feature
− each sample is evenly divided into 16 16 cells
− for each cell, we calculate a histogram of 8 gradient orientation bins (in 0-2π)
• We found HOG unsuitable for finding a good DTW alignments between clips
Image credit: https://github.com/pavitrakumar78/Python-Custom-Digit-Recognition
Audiovisual Synchrony Detection
© 2016 IBM Corporation 72
Deep Visual Features
Architecture • The input is a stack of 5 consecutive 50x30 lips crops
• 3 convolution layers
• 2 Fully Connected (FC) layers
• ReLUs applied after each layer, except for the last FC layer
Input Vectors Visual Features
Audiovisual Synchrony Detection
© 2016 IBM Corporation 73
DNN Training • The network is trained using a Siamese architecture
• In each iteration the network is given a pair of lip stacks, V1 and V2 and a label y (0/1: synchronized/unsynchronized)
• The loss function L is
where yi and di are the label and Euclidean distance for the i-th lip stacks pair respectively, and δ is a predefined margin
N
i
iiii dydyN
L1
22 0,max12
1
Audiovisual Synchrony Detection
© 2016 IBM Corporation 74
DNN Training: Creation of Training Pairs Goal
Encourage DNN to produce visual features that mimic the correspondence between MFCC features
Positive pairs
1. Select a pair of clips with the same person and same text
2. Find the optimal DTW alignment for the audio stream using MFCCs
3. Map the audio-based alignment into a visual-based alignment
4. For each pair of aligned visual frames form a positive training sample using the corresponding lip stacks
Negative pairs
1. Select visual pairs corresponding to pairs of audio frames with maximal pairwise MFCC distance (off the alignment path)
Audiovisual Synchrony Detection
© 2016 IBM Corporation 75
DNN Training: Creation of Training Pairs
Example
MFCC distance matrix of two clips. fa and fc represent a negative pair while fb and fd represent a positive pair
Audiovisual Synchrony Detection
© 2016 IBM Corporation 76
Dataset and Setup
• Data recorded with iPad-2 & iPhone-5 held in arm-length
• 41 subjects x 2 or 3 sessions on each device
• A session includes a subject repeating 3 times:
− My voice is my password
− Please verify me with the number
• We used 5-fold cross-validation to train the DNN
− Per fold, 80% of the users used for training, and the remaining 20% used for evaluation
• Average utterance duration is 1.5s (of speech)
• ~13K clip pairs used for training (incl. cross-device)
• 60 positive+60 negative samples per clip pair
• 10% of training set used for learning progress evaluation
Audiovisual Synchrony Detection
© 2016 IBM Corporation 77
Photo Attack Detection
Setup • Positive trials
− Synchronized clip
• Negative trials − Audio: authentic (“playback attack”) − Visual: a static image of the subject, held each image in front of a
camera, slightly moved to mimic ‘liveness’
Results (EER in %)
Audiovisual Synchrony Detection
© 2016 IBM Corporation 78
Video Attack Detection
Setup • Positive trials (4.2K)
− Synchronized clip
• Negative trials (4.2K) − Audio: authentic (“playback attack”) − Visual: visual stream is taken from a different clip (“playback attack”)
and is chopped to match audio endpoints
Results (EER in %)
Audiovisual Synchrony Detection
© 2016 IBM Corporation 79
Conclusion
• We introduce a text-dependent audiovisual synchrony detection scheme
• availability of enrollment audiovisual clips is used
• same pass-phrase must be used for both enrollment and authentication
• no direct comparison of audio and visual-based low-level features
Audiovisual Synchrony Detection