Upload
jeffery-pitts
View
213
Download
0
Embed Size (px)
Citation preview
Present by: Fang-Hui Chu
A Survey of Large Margin Hidden Markov Model
Xinwei Li, Hui Jiang
York University
2
Reference Papers
• [Xinwei Li] [M.S. thesis] [Sep. 2005], “Large Margin HMMs for SR”
• [Xinwei Li] [ICASSP 05], “Large Margin HMMs for SR”
• [Chaojun Liu] [ICASSP 05], “Discriminative training of CDHMMs for Maximum Relative Separation Margin”
• [Xinwei Li] [ASRU 05], “A constrained joint optimization method for LME”
• [Hui Jiang] [SAP 2006], “Large Margin HMMs for SR”
• [Jinyu Li] [ICSLP 06], “Soft Margin Estimation of HMM parameters”
3
Outline
• Large Margin HMMs
• Analysis of Margin in CDHMM
• Optimization methods for Large Margin HMMs estimation
• Soft Margin Estimation for HMM
4
Large Margin HMMs for ASR
• In ASR, given any speech utterance Χ, a speech recognizer will choose the word Ŵ as output based on the plug-in MAP decision rule as follows:
• For a speech utterance Xi, assuming its true word identity as Wi, the multiclass separation margin for Xi is defined as
)(Fmaxarg)()(maxarg
)()(maxarg)(maxargˆ
WW
WW
WW
XXpWp
WXpWpXWpW
Discriminant function
)(Fmax)(F)(, j
ijji Wi
WWWWii XXXd
Ω denotes the set of all possible words
)(F)(Fmin)(, ji
ijjWiWi
WWWi XXXd
5
Large Margin HMMs for ASR
• According to the statistical learning theory [Vapnik], the generalization error rate of a classifier in new test sets is theoretically bounded by a quantity related to its margin
• Motivated by the large margin principle, even for those utterances in the training set which all have positive margin, we may still want to maximize the minimum margin to build an HMM-based large margin classifier for ASR
6
Large Margin HMMs for ASR
• Given a set of training data D = { X1, X2,…,XT}, we usually know the true word identities for all utterances in D, denoted as L = {W1, W2,…,WT}
• First, from all utterances in D, we need to identify a subset of utterances S as
• We call S as support vector set and each utterance in S is called a support token which has relatively small positive margin among all utterances in the training set D
)(0 and iii XdXX DS
where ε> 0 is a preset positive number
7
Large Margin HMMs for ASR
• This idea leads to estimating the HMM models Λ based on the criterion of maximizing the minimum margin of all support tokens, which is named as large margin estimation (LME) of HMM
)(minmaxarg~
iSXXd
i )(F)(Fmin)(
, jiijj
WiWiWWW
i XXXd
)()(minmaxarg~
,, jiji
WiWiijWSX
XFXF
)()(maxminarg~
,, ijji
WiWiijWSX
XFXF
The HMM models, , estimated in this way, are called large margin HMMs~
8
Analysis of Margin in CDHMM
• Adopt the Viterbi method to approximate the summation with the single optimal Viterbi path, the discriminant function can be expressed as
)(,
,
****
****1
*1
*1
*1
*1
*1
*1
*1 2
1
WpxN
waxNwXF
tttt
tttt
lslst
R
tlssslslslssW
R
t
D
d dls
dlsitd
dls
R
tls
R
tsssW
tt
tt
tt
tttt
x
waWpXF
1 12
2
2
12
**
**
**
****1
*1
)(log
2
1
loglog)(log
9
Analysis of Margin in CDHMM
• Here, we only consider to estimate mean vectors
R
t
D
d dls
dlsitd
jWi
R
t
D
d dls
dlsitd
iWi
jtjt
jtjt
j
itit
itit
i
xCXF
xCXF
1 12
2
1 12
2
**
**
**
**
)(
2
1
)(
2
1
In this case, the discriminant functions can be represented as a summation of some quadratic terms related to mean values of CDHMMs
10
Analysis of Margin in CDHMM
• As a result, the decision margin can be represent as a standard diagonal quadratic form
• Thus, for each feature vector xit, we can divide all of its dimensions into two parts:
R
t
D
d dls
dlsitd
dls
dlsitd
ij
WiWiiij
jtjt
jtjt
itit
itit
ji
xxC
XFXFXd
1 12
2
2
2
**
**
**
** )()(
2
1
)(
we can see that each feature dimension contributes to the decision margin separately
22
222
1 ******** , dlsdlstdlsdlstjtjtititjtjtitit
dDdD
11
Analysis of Margin in CDHMM
• After some math manipulation, we have:
22
22
22
2
22
22
2
1
****
********
****
****
****
****
****
****
****
21
2
2
2
)()()(
dlsdls
dlsdlsdlsdls
itd
dlsdls
dlsdls
itd
dlsdls
dlsdls
itd
dlsdls
itd
dlsdls
dlsdls
itd
Dditditditditd
R
t Dditditditdijiij
jtjtitit
jtjtititititjtjt
jtjtitit
jtjtitit
jtjtitit
jtjtitit
jtjtitit
jtjtitit
jtjtitit
tt
B
CA
L
CBxALxCXd
linear function quadratic function
12
Analysis of Margin in CDHMM
13
Analysis of Margin in CDHMM
14
Analysis of Margin in CDHMM
15
Optimization methods for LM HMM estimation
• An iterative localized optimization method
• An constrained joint optimization method
• Semidefinite programming method
16
Iterative localized optimization
• In order to increase the margin unlimitedly while keeping the margins positive for all samples, both of the models must be moved together– if we keep one of the models fixed, the other model cannot be
moved too far under the constraint that all samples must have positive margin
– Otherwise the margin for some tokens will become negative
• Instead of optimizing parameters of all models at the same time, only one selected model will be adjusted in each step of optimization
• Then the process iterates to update another model until the optimal margin is achieved
17
Iterative localized optimization
• How to select the target model in each step?– The model should be relevant to the support token with the mini
mum margin
• The minimax optimization can be re-formulated as:
)or ( and , , allfor
0)()(
subject to
)()(maxminargor ,
)1(
mWmjWjjSX
XFXF
XFXF
iii
Wiji
Wiji
mWmjjWSX
nm
i
i
iiim
18
Iterative localized optimization
• Approximated by summation of exponential functions
1
or ,,
or ,,
)()(explog
)()(max
mWmjWjjSX
Wiji
Wiji
mWmjWjjSX
iii
i
i
iii
XFXF
XFXF
mWmjWjjSX
iij
mWmjWjjSX
Wijim
iii
iii
i
Xd
XFXFQ
or ,,
or ,,
)(explog1
)()(explog1
)(
19
Iterative localized optimization
20
Constrained Joint optimization
• Introduce some constraints to make the optimization problem bounded
• In this way, the optimization can be performed jointly with respect to all model parameters
21
Constrained Joint optimization
• In order to bound the margin contribution from the linear part:
• In order to bound the margin contribution from the quadratic part:
2
1
21
1
, ij
R
t Dditdiww gXR
tji
2
1
2)0(2
2
, ij
R
t Dditditdiww GBBXR
tji
22
Constrained Joint optimization
• Reformulate the large margin estimation as the following constrained minimax optimization problem:
0)()(
,
,
subject to
)()(maxminarg~
22
21
,,
ij
ji
ji
ijji
WiWi
ijiWW
ijiWW
WiWiijWSX
XFXF
GXR
gXR
XFXF
23
Constrained Joint optimization
• The constrained minimization problem can be transformed into an unconstrained minimization problem
)()()(
,,0max
,)()(
2211
,
2212
,
211
PPQ
GXR
gXRQO
ijSXijiWW
ijSXijiWW
iji
iji
ijWSXiij
ijWSXWiWi
ji
jiij
Xd
XFXFQ
,,
,,
)(explog1
)()(explog1
)(
24
Constrained Joint optimization
25
Soft Margin estimation
• Model separation measure and frame selection
• SME objective function and sample selection
)|(
)|(log)( arg
compi
ettiLLRi WXl
WXld
)()|(
)|(log
1)( arg
iijj compij
ettij
i
SMEi FXI
WXl
WXl
nd
N
iiiij
j compij
ettij
i
N
i
SMEi
N
iii
SME
UXIFXIWXl
WXl
nρ
Nρ
λ
dρNρ
λ
XlNρ
λL
1
arg
1
1
)()())|(
)|(log
1(
1
))((1
),(1
)(
otherwise
xρifxρxρ ,00 ,)(
26
Soft Margin estimation
• Difference between SME and LME– LME neglects the misclassified samples. Consequently, LME oft
en needs a very good preliminary estimate from the training set– SME works on all the training data, both the correctly classified a
nd misclassified samples– While SME must first choose a margin ρ heuristically