26
Present by: Fang-Hu i Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

Embed Size (px)

Citation preview

Page 1: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

Present by: Fang-Hui Chu

A Survey of Large Margin Hidden Markov Model

Xinwei Li, Hui Jiang

York University

Page 2: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

2

Reference Papers

• [Xinwei Li] [M.S. thesis] [Sep. 2005], “Large Margin HMMs for SR”

• [Xinwei Li] [ICASSP 05], “Large Margin HMMs for SR”

• [Chaojun Liu] [ICASSP 05], “Discriminative training of CDHMMs for Maximum Relative Separation Margin”

• [Xinwei Li] [ASRU 05], “A constrained joint optimization method for LME”

• [Hui Jiang] [SAP 2006], “Large Margin HMMs for SR”

• [Jinyu Li] [ICSLP 06], “Soft Margin Estimation of HMM parameters”

Page 3: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

3

Outline

• Large Margin HMMs

• Analysis of Margin in CDHMM

• Optimization methods for Large Margin HMMs estimation

• Soft Margin Estimation for HMM

Page 4: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

4

Large Margin HMMs for ASR

• In ASR, given any speech utterance Χ, a speech recognizer will choose the word Ŵ as output based on the plug-in MAP decision rule as follows:

• For a speech utterance Xi, assuming its true word identity as Wi, the multiclass separation margin for Xi is defined as

)(Fmaxarg)()(maxarg

)()(maxarg)(maxargˆ

WW

WW

WW

XXpWp

WXpWpXWpW

Discriminant function

)(Fmax)(F)(, j

ijji Wi

WWWWii XXXd

Ω denotes the set of all possible words

)(F)(Fmin)(, ji

ijjWiWi

WWWi XXXd

Page 5: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

5

Large Margin HMMs for ASR

• According to the statistical learning theory [Vapnik], the generalization error rate of a classifier in new test sets is theoretically bounded by a quantity related to its margin

• Motivated by the large margin principle, even for those utterances in the training set which all have positive margin, we may still want to maximize the minimum margin to build an HMM-based large margin classifier for ASR

Page 6: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

6

Large Margin HMMs for ASR

• Given a set of training data D = { X1, X2,…,XT}, we usually know the true word identities for all utterances in D, denoted as L = {W1, W2,…,WT}

• First, from all utterances in D, we need to identify a subset of utterances S as

• We call S as support vector set and each utterance in S is called a support token which has relatively small positive margin among all utterances in the training set D

)(0 and iii XdXX DS

where ε> 0 is a preset positive number

Page 7: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

7

Large Margin HMMs for ASR

• This idea leads to estimating the HMM models Λ based on the criterion of maximizing the minimum margin of all support tokens, which is named as large margin estimation (LME) of HMM

)(minmaxarg~

iSXXd

i )(F)(Fmin)(

, jiijj

WiWiWWW

i XXXd

)()(minmaxarg~

,, jiji

WiWiijWSX

XFXF

)()(maxminarg~

,, ijji

WiWiijWSX

XFXF

The HMM models, , estimated in this way, are called large margin HMMs~

Page 8: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

8

Analysis of Margin in CDHMM

• Adopt the Viterbi method to approximate the summation with the single optimal Viterbi path, the discriminant function can be expressed as

)(,

,

****

****1

*1

*1

*1

*1

*1

*1

*1 2

1

WpxN

waxNwXF

tttt

tttt

lslst

R

tlssslslslssW

R

t

D

d dls

dlsitd

dls

R

tls

R

tsssW

tt

tt

tt

tttt

x

waWpXF

1 12

2

2

12

**

**

**

****1

*1

)(log

2

1

loglog)(log

Page 9: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

9

Analysis of Margin in CDHMM

• Here, we only consider to estimate mean vectors

R

t

D

d dls

dlsitd

jWi

R

t

D

d dls

dlsitd

iWi

jtjt

jtjt

j

itit

itit

i

xCXF

xCXF

1 12

2

1 12

2

**

**

**

**

)(

2

1

)(

2

1

In this case, the discriminant functions can be represented as a summation of some quadratic terms related to mean values of CDHMMs

Page 10: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

10

Analysis of Margin in CDHMM

• As a result, the decision margin can be represent as a standard diagonal quadratic form

• Thus, for each feature vector xit, we can divide all of its dimensions into two parts:

R

t

D

d dls

dlsitd

dls

dlsitd

ij

WiWiiij

jtjt

jtjt

itit

itit

ji

xxC

XFXFXd

1 12

2

2

2

**

**

**

** )()(

2

1

)(

we can see that each feature dimension contributes to the decision margin separately

22

222

1 ******** , dlsdlstdlsdlstjtjtititjtjtitit

dDdD

Page 11: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

11

Analysis of Margin in CDHMM

• After some math manipulation, we have:

22

22

22

2

22

22

2

1

****

********

****

****

****

****

****

****

****

21

2

2

2

)()()(

dlsdls

dlsdlsdlsdls

itd

dlsdls

dlsdls

itd

dlsdls

dlsdls

itd

dlsdls

itd

dlsdls

dlsdls

itd

Dditditditditd

R

t Dditditditdijiij

jtjtitit

jtjtititititjtjt

jtjtitit

jtjtitit

jtjtitit

jtjtitit

jtjtitit

jtjtitit

jtjtitit

tt

B

CA

L

CBxALxCXd

linear function quadratic function

Page 12: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

12

Analysis of Margin in CDHMM

Page 13: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

13

Analysis of Margin in CDHMM

Page 14: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

14

Analysis of Margin in CDHMM

Page 15: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

15

Optimization methods for LM HMM estimation

• An iterative localized optimization method

• An constrained joint optimization method

• Semidefinite programming method

Page 16: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

16

Iterative localized optimization

• In order to increase the margin unlimitedly while keeping the margins positive for all samples, both of the models must be moved together– if we keep one of the models fixed, the other model cannot be

moved too far under the constraint that all samples must have positive margin

– Otherwise the margin for some tokens will become negative

• Instead of optimizing parameters of all models at the same time, only one selected model will be adjusted in each step of optimization

• Then the process iterates to update another model until the optimal margin is achieved

Page 17: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

17

Iterative localized optimization

• How to select the target model in each step?– The model should be relevant to the support token with the mini

mum margin

• The minimax optimization can be re-formulated as:

)or ( and , , allfor

0)()(

subject to

)()(maxminargor ,

)1(

mWmjWjjSX

XFXF

XFXF

iii

Wiji

Wiji

mWmjjWSX

nm

i

i

iiim

Page 18: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

18

Iterative localized optimization

• Approximated by summation of exponential functions

1

or ,,

or ,,

)()(explog

)()(max

mWmjWjjSX

Wiji

Wiji

mWmjWjjSX

iii

i

i

iii

XFXF

XFXF

mWmjWjjSX

iij

mWmjWjjSX

Wijim

iii

iii

i

Xd

XFXFQ

or ,,

or ,,

)(explog1

)()(explog1

)(

Page 19: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

19

Iterative localized optimization

Page 20: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

20

Constrained Joint optimization

• Introduce some constraints to make the optimization problem bounded

• In this way, the optimization can be performed jointly with respect to all model parameters

Page 21: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

21

Constrained Joint optimization

• In order to bound the margin contribution from the linear part:

• In order to bound the margin contribution from the quadratic part:

2

1

21

1

, ij

R

t Dditdiww gXR

tji

2

1

2)0(2

2

, ij

R

t Dditditdiww GBBXR

tji

Page 22: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

22

Constrained Joint optimization

• Reformulate the large margin estimation as the following constrained minimax optimization problem:

0)()(

,

,

subject to

)()(maxminarg~

22

21

,,

ij

ji

ji

ijji

WiWi

ijiWW

ijiWW

WiWiijWSX

XFXF

GXR

gXR

XFXF

Page 23: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

23

Constrained Joint optimization

• The constrained minimization problem can be transformed into an unconstrained minimization problem

)()()(

,,0max

,)()(

2211

,

2212

,

211

PPQ

GXR

gXRQO

ijSXijiWW

ijSXijiWW

iji

iji

ijWSXiij

ijWSXWiWi

ji

jiij

Xd

XFXFQ

,,

,,

)(explog1

)()(explog1

)(

Page 24: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

24

Constrained Joint optimization

Page 25: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

25

Soft Margin estimation

• Model separation measure and frame selection

• SME objective function and sample selection

)|(

)|(log)( arg

compi

ettiLLRi WXl

WXld

)()|(

)|(log

1)( arg

iijj compij

ettij

i

SMEi FXI

WXl

WXl

nd

N

iiiij

j compij

ettij

i

N

i

SMEi

N

iii

SME

UXIFXIWXl

WXl

λ

dρNρ

λ

XlNρ

λL

1

arg

1

1

)()())|(

)|(log

1(

1

))((1

),(1

)(

otherwise

xρifxρxρ ,00 ,)(

Page 26: Present by: Fang-Hui Chu A Survey of Large Margin Hidden Markov Model Xinwei Li, Hui Jiang York University

26

Soft Margin estimation

• Difference between SME and LME– LME neglects the misclassified samples. Consequently, LME oft

en needs a very good preliminary estimate from the training set– SME works on all the training data, both the correctly classified a

nd misclassified samples– While SME must first choose a margin ρ heuristically