Upload
vashon
View
35
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition. Bing Zhang and Spyros Matsoukas BBN Technologies. Present by shih-hung Liu 2006/ 05/16. Outline. Review PCA, LDA, HLDA Introduction MPE-HLDA Experimental results Conclusions. Review - PCA. - PowerPoint PPT Presentation
Citation preview
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis
for Speech Recognition
Bing Zhang and Spyros MatsoukasBBN Technologies
Present by shih-hung Liu 2006/ 05/16
2
Outline
• Review PCA, LDA, HLDA• Introduction• MPE-HLDA• Experimental results• Conclusions
3
Review - PCA
1
1 N T
i ii
T x X x XN
1
1 N
ii
X xN
Ti p iy x
4
Review - LDA
ˆ arg maxp
Tp p
p Tp p
B
W
1
1 J
j jj
W N WN
1
1 J T
j j ji
B N X X X XN
( )
1 , 1, ,T
j i j i jg i jj
W x X x X j JN
Ti p iy x
5
Review - HLDA
1
2
2
Ti ig i g i g iy y
i ing i
P x e P y
,1
,
0, 1 0
0,
j
pj p j
jp
n
( )
00
pj
j n p
Ti iy x
6
Review - HLDA
1
1( ) ( ) ( ) ( )
1
log ; log
1 log 2 log2
N
i ii
N T nT Ti g i g i i g i g i
i
L x P x
x x N
, 1, ,p Tj p jX j J
0Tn pX
, 1, ,p Tj p j pW j J
n p Tn p n pT
7
Introduction for MPE-HLDA
• In speech recognition systems, feature analysis is usually employed for better classification accuracy and complexity control.
• In recent years, extensions to the classical LDA have been widely adopted
• Among them, HDA seeks to remove the equal variance constraint by LDA
• ML-HLDA is taking the HMM structure (eg. diagonal covariance Guassian mixture state distribution) into consideration
8
Introduction for MPE-HLDA
• Despite the differences between the above techniques, they have some common limitation.
• First, none of them assumes any prior knowledge of confusable hypotheses, so their choices are determined to be suboptimal for recognition
• Second, their objective functions do not directly related to the WER
• For example, we found that HLDA could select totally non-discriminant features while improving its objective function by mapping all training samples to a single point in space along some dimensions
9
Introduction
• LDA and HLDA– Better classification accuracy– some common Limitations
• None of them assumes any prior knowledge of confusable hypotheses• Their objective functions do not directly relate to the word error rate (WER)
– As a result, it is often unknown whether selected features will do well in testing by just looking at the values of objective functions
• Minimum Phoneme Error– Minimize phoneme errors in lattice-based training frameworks– Since this criterion is closely related to WER, MPE-HLDA tends to be more robust
than other projection methods, which makes it potentially better suited for a wider variety of features
10
MPE-HLDA
• MPE-HLDA model
• MPE-HLDA aims at minimizing expected number of phoneme errors introduced by the MPE-HLDA model in a given hypothesis lattice, or equivalently maximizing the function
m m
Tm m
t t
A
C diag A A
o Ao
, | (4)
is the total number of training utterances,
is the sequence of p-dimensional observation vectors in utterance r, is the "raw accuracy" score of wor
r
R
MPE r r rr w
r
r
F O P w O w
R
Ow
d hypothesis .rw
,m mC
11
MPE-HLDA
•
| is the posterior probability of hypothesis in the lattice
| |
|
is the language model probability of hypothesis ,
k : in order
r
r r r
k
r r rr r k
r r rw
r r
P w O w
P O w P wP w O
P O w P w
P w w
to reduce acoustic scores dynamic range, thereby avoiding the concentration of all posterior mass in the top-1 hypothesis of the lattice.
12
MPE-HLDA
• It can be shown that the derivative of (4) with respect to A is
, log | ,, (6)
, | ,
is the MPE score of utterance r (average accuracy over all hypotheses),
is the average accuracy ove
r
RMPE qr r
rr q
r r qr r
r
F O P O q rk D q r
A Awhere
D q r P q O r q r
r
q
rr all hypotheses that contain arc q .
13
MPE-HLDA
•
log | , log | ,
and are the begin and end time of are ,
denotes the posterior probability of Gaussian m in arc at time t.
qr
qrqr
r r
qr
Eqr r tm
t S m
q q r
mr
P O q r P o mt
A A
S E q
t q
1 1 1
1 1 1 1
log | ,t m mm m t p m m t
T Tm m t m t m m m p m m t m t m
Tmt t m t m
Tmt t m t m
P o mC C P I A C R
A
C C diag o o A C I A C o o
where
P diag o o
R o o
14
MPE-HLDA
• Therefore, Eq.(6) can be rewritten as
1 1
1
,
,
,
,
qr
r
r qr
qr
r
r qr
qr
r
r qr
MPEm m m m p m
m
Em
m r qr q t S
Em m
m r q tr q t S
Em m
m r q tm r q t S
F Ok C C g I A kJ
Awhere
D q r t
g D q r t P
J C D q r t R
39*39
39*162
15
MPE-HLDA Implementation
• In theory, the derivative of the MPE-HLDA objective function can be computed based on Eq.(12), via s single forward-backward pass over the training lattices. In practice, however, it is not possible to fit all the full covariance matrices in memory.
• Two steps– First, run a forward-backward pass over the training lattices to accumulate– Second, uses these statistics together with the full covariance matrices to synthesize the
derivative.
• The Paper used gradient descent in updating the projection matrix.
AOFAA MPEi
jij
),ˆ()()(1
16
MPE-HLDA Overview
Tmtmt
mt
Tmtmt
mt
mtmmp
mtmm
t
E
St
t
m
mq
rq
rqrr
rqR
r qr
MPE
r
R
r wrrMPE
mm
tt
Tmm
mm
np
ooRoodiagP
RCAIPCCA
moPA
moPtA
qOP
rqOqPrqDA
qOPrqD
AOF
wOwPOF
C
AooAAdiagC
AnpA
rq
rq
r
r
r
r
r
r
))(ˆˆ(])ˆˆ)(ˆˆ[(
ˆ)ˆ(ˆ),|ˆ(log
),|ˆ(log)(),|ˆ(log
)]()()[,ˆ|(),(
),|ˆ(log),(),ˆ(
)()ˆ|(),ˆ(
modelHLDA -MPE theas )ˆ,ˆ( refer to willwe
ˆ)(ˆ
ˆ,matrix projection feature global
111
17
MPE-HLDA Overview
mt
r q
E
St
mqr
mm
mt
r q
E
St
mqrm
r q
E
St
mqrm
mmpmmmm
MPE
RtrqDCJ
PtrqDg
trqD
JAIgCCAOF
r
rq
rq
r
r
rq
rq
r
r
rq
rq
r
)(),(ˆ
)(),(
)(),(
)ˆ(ˆ),ˆ(
1
11
18
Implementation
• 1. Initialize feature projection matrix by LDA or HLDA, and MPE-HLDA model
• 2. Set • 3. Compute covariance statistics in the original feature space
– (a) Do maximum likelihood update of MPE-HLDA model in the feature space define by
– (b) Do single pass retraining using to generate and in the original feature space
• 4. Optimize the feature projection matrix:– (a) Set– (b) Project and using to get model in reduced subspace – (c) Run F-B pass on lattices using to compute , and – (d) Use , and statistics form 4(c) to compute the MPE derivative– (e) Update to using gradient descent– (f) Set , go to 4(b) unless convergence
• 5. Optionally, set and go to 3
)0(A)0(M̂
1i)(i
m)1(ˆ iM
)1( iA)1(ˆ iM )(i
m)(i
m
)1()( ,0 iij AAj
)(im
)(im
)(ijA
)(ˆ iM)(ˆ iM
1 jj)(i
jA )(1
ijA
)(ijA
)(im
m mg J
1 , ˆ , )(1
)()(1
)( iiMMAA ij
iij
i
AOFAA MPEi
jij
),ˆ()()(1
19
Experiments
• DARPA EARS research project
• CTS, 800/2300 hrs for ML training, 370 hrs of held-out data for MPE-HLDA training
• BN, 600 hrs from Hub4 and TDT for ML training, 330 hrs of held-out data for MPE-HLDA estimating
• PLP(15dim) and 1st 2nd 3th derivative coefficients (60dim)
• EARS 2003 Evaluation test set
20
Experiments
21
Experiments
22
Conclusions
• We have taken a first look at a new feature analysis method, MPE-HLDA.
• It shows that it is effective in reducing recognition error, and that it is more robust than other commonly used analysis methods like LDA and HLDA