Upload
lacey-blackburn
View
35
Download
0
Embed Size (px)
DESCRIPTION
Intra-Class Variability Modeling for Speech Processing. Dr. Hagai Aronowitz IBM Haifa Research Lab Presentation is available online at: http://aronowitzh.googlepages.com/. Speech Classification Proposed framework. - PowerPoint PPT Presentation
Citation preview
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 1
Dr. Hagai Aronowitz
IBM Haifa Research Lab
Presentation is available online at: http://aronowitzh.googlepages.com/
Intra-Class Variability Modeling for Speech Processing
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 2
Given labeled training segments from class + and class –, classify unlabeled test segments
Classification framework
1. Represent speech segments in segment-space
2. Learn a classifier in segment-space• SVMs• NNs• Bayesian classifiers• …
Speech ClassificationProposed framework
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 3
OutlineIntra-Class Variability Modeling for Speech Processing
1 Introduction to GMM based classification
2 Mapping speech segments into segment space
3 Intra-class variability modeling
4 Speaker diarization
5 Summary
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 4
GMM based speaker recognitionEstimate Pr(yt|S)
1. Train a universal background model (UBM) GMM using EM2. For every target speaker S:
Train a GMM GS by applying MAP-adaptation
Text-Independent Speaker RecognitionGMM-Based Algorithm [Reynolds 1995]
Assuming frame independence:
T
tT SySyy1t
1 Pr,...,Pr
?Pr SY
UBM
Q1 - speaker #1
Q2 - speaker #2
μ1 μ2 μ3
R26 MFCC feature space
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 5
1. Invalid frame independence assumption:
Factors such as channel, emotion, lexical variability, and
speaker aging cause frame dependency
2. GMM scoring is inefficient – linear in the length of the
audio
3. GMM scoring does not support indexing
GMM Based Algorithm - Analysis
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 6
OutlineIntra-Class Variability Modeling for Speech Processing
1 Introduction to GMM based classification
2 Mapping speech segments into segment space
3 Intra-class variability modeling
4 Speaker diarization
5 Summary
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 7
Mapping Speech Segments into Segment SpaceGMM scoring approximation 1/4
Definitions
X: training session for target speaker
Y: test session
Q: GMM trained for X
P: GMM trained for Y
Goal
Compute Pr(Y |Q) using GMMs P and Q only
Motivation
1. Efficient speaker recognition and indexing
2. More accurate modeling
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 8
QPHdxQxPxQyQYx
T
T
ttTT
,PrlogPrPrlogPrlog1
11
)1(
Negative cross entropy
Mapping Speech Segments into Segment SpaceGMM scoring approximation 2/4
Approximating the cross entropy between two GMMs
1. Matching based lower bound [Aronowitz 2004]
2. Unscented-transform based approximation [Goldberger & Aronowitz 2005]
3. Others options in [Hershey 2007]
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 9
CwwQPH
D
d
D
d
Qdj
D
d
Qj
j
G
g
Pg Q
dj
Pdg
Qdj
Qdj
Pdg
1
2
21
1,
1 21 ,
,
2
,
2
,, loglogmax,
(2)
Matching based approximation
Mapping Speech Segments into Segment SpaceGMM scoring approximation 3/4
Assuming weights and covariance matrices are speaker independent (+ some approximations):
CwQPH
G
g
D
di
dg
Qdg
Pdg
1 1 22
,
2
,,,
(3)
Mapping T is induced:
dg
GMMdg
gdDg
GD
wGMMT
RGMMT
,
,*ˆ;ˆ
:
(4)
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 10
Results
Mapping Speech Segments into Segment SpaceGMM scoring approximation 4/4
Figure and Table taken from:H. Aronowitz, D. Burshtein, “Efficient Speaker Recognition Using Approximated Cross Entropy (ACE)”, in IEEE Trans. on Audio, Speech & Language Processing, September 2007.
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 11
1. Anchor modeling projection [Sturim 2001]
• efficient but inaccurate
2. MLLR transofrms [Stolcke 2005]
• accurate but inefficient
3. Kernel-PCA-based mapping [Aronowitz 2007c]
Given - a set of objects
- a kernel function
(a dot product between each pair of objects)
Finds a mapping of the objects into Rn which preserves the
kernel function.• accurate & efficient
Other Mapping Techniques
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 13
Introduction Mapping Modeling Speaker Diarization Summary
OutlineIntra-Class Variability Modeling for Speech Processing
1 Introduction to GMM based classification
2 Mapping speech segments into segment space
3 Intra-class variability modeling
4 Speaker diarization
5 Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 14
Introduction Mapping Modeling Speaker Diarization Summary
The classic GMM algorithm does not explicitly model intra-speaker inter-session variability:• channel, noise• language• stress, emotion, aging
The frame independence assumption does not hold in these cases!
T
tT SySyy1t
1 Pr,...,Pr)1(
dffSySfdfSfyySyyT
tTT
1t
11 ,PrPr,,...,Pr,...,Pr)3(
Instead, we can use a more relaxed assumption:
Intra-Class Variability Modeling [Aronowitz 2005b] Introduction
T
tT fSyfSyy1t
1 ,Pr,,...,Pr)2(
which leads to:
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 15
Introduction Mapping Modeling Speaker Diarization Summary
Speaker
FrameFrame
sequencesequencegenerated independently
a GMM
Old vs. New Generative Models
Session GMM
FrameFrame
sequencesequence
Speaker a PDF over GMM space
a GMM
generated independently
Old Model New Model
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 16
Introduction Mapping Modeling Speaker Diarization Summary
speaker #1 speaker #2
speaker #3
Session-GMM Space
Session-GMM space
GMM for session A of speaker #1
GMM for session B of speaker #1
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 17
GDs~
,~|ˆPr NS
Modeling in Session-GMM space 1/2
Recall mapping T induced by the GMM approximation analysis:
• is called a supervector• A speaker is modeled by a multivariate normal distribution in supervector space:
)3(
• A typical dimension of is 50,000*50,000• is estimated robustly using PCA + regularization: Covariance is assumed to be a low rank matrix with an additional non-zero (noise) diagonal
GDΣ~
GDΣ~
dg
GMMdg
gdDg
GD
wGMMT
RGMMT
,
,*ˆ;ˆ
:
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 18
Introduction Mapping Modeling Speaker Diarization Summary
Supervector space
GDΣ~
1
2
1
2
1
2
1
2
1
2
1
2speaker #1 speaker #2
speaker #3 Delta supervector space
sΣ2~
Modeling in Session-GMM Space 2/2Estimating covariance matrix
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 19
• is estimated from the NIST-2006-SRE corpus• Evaluation is done on the NIST-2004-SRE corpus
• ETSI MFCC (13-cep + 13-delta-cep)• Energy based voice activity detector• Feature warping• 2048 Gaussians• Target models are adapted from GI-UBM• ZT-norm score normalization
GDΣ~
Experimental Setup
Datasets
System description
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 20
Results
38% reduction in EER
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 21
• NAP+SVMs [Campbell 2006]
• Factor Analysis [Kenny 2005]
• Kernel-PCA [Aronowitz 2007c]
• Model each supervector as
s S : Common speaker subspace
u U : Speaker unique subspace
• S is spanned by a set of development supervectors (700 speakers) • U is the orthogonal complement of S in supervector space• Intra-speaker variability is modeled separately in S and in U• U was found to be more discriminative than S• EER was reduced by 44% compared to baseline GMM
Other Modeling Techniques
Kernel-PCA based algorithm
us
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 22
Session space
Feature space
x
f(x)
Tx
Common speaker subspace (Rn)
y
f(y)
Ty
uy
ux
Speaker unique subspace
K-PCA
Anchor sessions
Kernel-PCA Based Modeling
Kernel induced
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 23
OutlineIntra-Class Variability Modeling for Speech Processing
1 Introduction to GMM based classification
2 Mapping speech segments into segment space
3 Intra-class variability modeling
4 Speaker diarization
5 Summary
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 24
Goals
• Detect speaker changes – “speaker segmentation”
• Cluster speaker segments - “speaker clustering”
Motivation for new method
Current algorithms do not exploit available training data!
(besides tuning thresholds, etc.)
Method
Explicitly model inter-segment intra-speaker variability from labeled
training data, and use for the metric used by change-detection /
clustering algorithms.
Trainable Speaker Diarization [Aronowitz 2007d]
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 25
Dev data
• BNAD05 (5hr) - Arabic, broadcast news
Eval data
• BNAT05 – Arabic, broadcast news,
(207 target models, 6756 test segments)
System EER (%)
Anchor modeling (baseline) 15.1
Anchor modeling - Kernel based scoring 10.8
Kernel-PCA projection (CSS) 8.8
Kernel-PCA projection (CSS) + inter-segment variability modeling
7.4
Speaker recognition on pairs of 3s segments
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 26
Speaker change detection
• 2 adjacent sliding windows (3s each)
• Speaker verification scoring + normalization
Speaker clustering
• Speaker verification scoring + normalization
• Bottom-up clustering
Speaker Error Rate (SER) on BNAT05
• Anchor modeling (baseline): 12.9%
• Kernel-PCA based method: 7.9%
Speaker Diarization System & Experiments
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 27
1 Introduction to GMM based classification
2 Mapping speech segments into segment space
3 Intra-class variability modeling
4 Speaker diarization
5 Summary
OutlineIntra-Class Variability Modeling for Speech Processing
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 28
• A method for mapping speech segments into a GMM
supervector space was described
• Intra-speaker inter-session variability is modeled in
GMM supervector space
Speaker recognition
• EER was reduced by 38% on the NIST-2004 SRE
• A corresponding kernel-PCA based approach reduces
EER by 44%
Speaker diarization
• SER for speaker diarization was reduced by 39%.
Summary 1/2
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 29
• Speaker recognition [Aronowitz 2005b; Aronowitz 2007c]
• Speaker diarization (“who spoke when”) [Aronowitz 2007d]
• VAD (voice activity detection) [Aronowitz 2007a]
• Language identification [Noor & Aronowitz 2006]
• Gender identification [Bocklet 2008]
• Age detection [Bocklet 2008]
• Channel/bandwidth classification [Aronowitz 2007d]
Summary 2/2Algorithms based on the proposed framework
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 30
[1] D. A. Reynolds et al., “Speaker identification and verification using Guassian mixture speaker models,” Speech Communications, 17, 91-108.
[2] D.E. Sturim et al., “Speaker indexing in large audio databases using anchor models”, in Proc. ICASSP, 2001.
[3] H. Aronowitz, D. Burshtein, A. Amir, "Speaker indexing in audio archives using test utterance Gaussian mixture modeling", in Proc. ICSLP, 2004.
[4] H. Aronowitz, D. Burshtein, A. Amir, "A session-GMM generative model using test utterance Gaussian mixture modeling for speaker verification", in Proc. ICASSP, 2005.
[5] P. Kenny et al., “Factor Analysis Simplified”, in Proc. ICASSP, 2005.
[6] H. Aronowitz, D. Irony, D. Burshtein, “Modeling Intra-Speaker Variability for Speaker Recognition ”, in Proc. Interspeech, 2005.
[7] J. Goldberger and H. Aronowitz, "A distance measure between GMMs based on the unscented transform and its application to speaker recognition" , in Proc. Interspeech 2005.
[8] H. Aronowitz, D. Burshtein, "Efficient Speaker Identification and Retrieval", in Proc. Interspeech 2005.
Bibliography 1/2
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 31
[9] A. Stolcke et al., “MLLR Transforms as Features in Speaker Recognition”, in Proc. Interspeech, 2005.
[10] E. Noor, H. Aronowitz, "Efficient language Identification using Anchor Models and Support Vector Machines,“ in Proc. ISCA Odyssey Workshop, 2006.
[11] W.M. Campbell et al., “SVM Based Speaker Verification Using a GMM Supervector Kernel and NAP Variability Compensation”, in Proc. ICASSP 2006.
[12] H. Aronowitz, “Segmental modeling for audio segmentation”, in Proc. ICASSP, 2007.
[13] J.R. Hershey and P. A. Olsen, “Approximating the Kullback Leibler Divergence Between Gaussian Mixture Models” ,in Proc. ICASSP 2007.
[14] H. Aronowitz, D. Burshtein, “Efficient Speaker Recognition Using Approximated Cross Entropy (ACE)”, in IEEE Trans. on Audio, Speech & Language Processing, September 2007.
[15] H. Aronowitz, “Speaker Recognition using Kernel-PCA and Intersession Variability Modeling”, in Proc. Interspeech, 2007.
[16] H. Aronowitz, “Trainable Speaker Diarization”, in Proc. Interspeech, 2007.[17] T. Bocklet et al., “Age and Gender Recognition for Telephone Applications
Based on GMM Supervectors and Support Vector Machines”, in Proc. ICASSP, 2008.
Bibliography 2/2
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 32
Presentation is available online at: http://aronowitzh.googlepages.com/
Thanks!
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 33
Backup slides
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 34
Session spaceDot-product feature space
f(x)
f(y)
x
yKernel trick
Anchor sessions
f()
Goals: - Map sessions into feature space
- Model in feature space
Kernel-PCA Based Mapping 2/5
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 35
Given - kernel K
- n anchor sessions
Find an orthonormal basis for
Method
1) Compute eigenvectors of the centralized kernel-matrix ki,j =
K(Ai,Aj).
2) Normalize eigenvectors by square-roots of corresponding
eigenvalues → {vi}
3) for is the requested basis
},...,{ 1 nAfAfspan
ini vAfAff ,...,1}{ if
nAA ,...,1
Kernel-PCA Based Mapping 3/5
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 36
nn AxK
AxK
v
v
xT
,
...
,
...:11
is a mapping x→Rn with the property:
Given sessions x, y, may be uniquely represented as:
},...,{/
},...,{
1
1
n
n
AfAfspanFU
AfAfspanC
Common speaker subspace -
Speaker unique subspace -
UuuCccucyfucxf yxyxyyxx ,and,withand
()(,) yfxf
22
yx ccyTxT
Kernel-PCA Based Mapping 4/5
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 37
Session space Feature space
x f(x)
Tx
Common speaker subspace (Rn)
y
f(y)
Ty
uy
ux
Speaker unique subspace
K-PCA
Anchor sessions
Kernel-PCA Based Mapping 5/5
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 38
Modeling in Segment-GMM Supervector Space
Segment-GMM supervector spaceSegment-GMM supervector space
FrameFrame
sequence:sequence:
segment #1segment #1
FrameFrame
sequence:sequence:
segment #2segment #2
FrameFrame
sequence:sequence:
segment #nsegment #n
music
speechsilence
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 39
Segmental Modeling for Audio Segmentation
Goal
• Segment audio accurately and robustly into speech / silence / music segments.
Novel idea
• Acoustic modeling is usually done on a frame-basis.
• Segmentation/classification is usually done on a segment-basis (using smoothing).
Why not explicitly model whole segments?
Note: speaker, noise, music-context, channel (etc.) are constant during a segment.
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 40
10-2
10-1
10-2
10-1
speech miss probability
sile
nce
mis
s pr
obab
ility
SPEECH / SILENCE SEGMENTATION
IBM EVAL06IBM EVAL06 no-padGMM baselineSegmental System EER FA @
FR=0.5%
FR @
FA=1%
EVAL06 FA=24.2% @ FR=0.25%
GMM
baseline
2.9% 7.9% 29.6%
Segmental 1.7% 5.1% 2.7%
Error
reduction
41% 35% 91%
Speech / Silence Segmentation – Results 1/2
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 41
10-3
10-2
10-1
10-2
10-1
speech miss probability
mus
ic m
iss
prob
abili
ty
SPEECH / MUSIC SEGMENTATION
IBM EVAL06IBM EVAL06 no-padGMM baselineSegmental
System EER FA @
FR=0.5%
FR @
FA=1%
EVAL06 FA=69% @ FR=0.25%
GMM
baseline
1.43% 3.4% 3.2%
Segmental 1.27% 2.0% 1.9%
Error
reduction
11% 41% 41%
Speech / Silence Segmentation – Results 2/2
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 42
LID in Session Space
English
Arabic
FrenchSession space
Training session Test session
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 43
1. Front end: shifted delta cepstrum (SDC).
2. Represent every train/test session by a GMM super-vector.
3. Train a linear SVM to classify GMM super-vectors.
Results
• EER=4.1% on the NIST-03 Eval (30sec sessions).
LID in Session Space - Algorithm
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 44
Anchor Modeling Projection
• Speaker indexing [Sturim et al., 2001]
• Intersession variability modeling in projected space [Collet et
al., 2005]
• Speaker clustering [Reynolds et al., 2004]
• Speaker segmentation [Collet et al., 2006]
• Language identification [Noor and Aronowitz, 2006]
nXsXsX ˆ,...,ˆ 1
UBM
iFi X
XXs
Pr
Prlogˆ 1
Given: anchor models λ1,…,λn and session X= x1,…,xF
= average normalized log-likelihood
Projection:
Introduction Mapping Modeling Speaker Diarization Summary
H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 45
The classic GMM algorithm does not explicitly model intra-speaker inter-session variability:• Noise• Channel• Language• Changing speaker characteristics – stress, emotion, aging
The frame independence assumption does not hold in these cases!
T
tT SySyy1t
1 Pr,...,Pr)1(
dffSySfdfSfyySyyT
tTT
1t
11 ,PrPr,,...,Pr,...,Pr)2(
Instead, we get:
Intra-Class Variability ModelingIntroduction
fSt Gy ,Pr SG fS ,Pr