Upload
hua
View
61
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Forward-backward algorithm. LING 575 Week 2: 01/10/08. Some facts. The forward-backward algorithm is a special case of EM. EM stands for “expectation maximization”. EM falls into the general framework of maximum-likelihood estimation (MLE). Outline. Maximum likelihood estimate (MLE) - PowerPoint PPT Presentation
Citation preview
Forward-backward algorithm
LING 575
Week 2: 01/10/08
1
Some facts
• The forward-backward algorithm is a special case of EM.
• EM stands for “expectation maximization”.
• EM falls into the general framework of maximum-likelihood estimation (MLE).
2
Outline
• Maximum likelihood estimate (MLE)
• EM in a nutshell
• Forward-backward algorithm
• Hw1
• Additional slides
– EM main ideas
– EM for PM models3
MLE
4
What is MLE?
• Given
– A sample X={X1, …, Xn}
– A list of parameters θ
• We define– Likelihood of the data: P(X | θ)– Log-likelihood of the data: L(θ)=log P(X|θ)
• Given X, find )(maxarg
LML
5
MLE (cont)
• Often we assume that Xis are independently identically distributed (i.i.d.)
• Depending on the form of P(X | θ), solving this optimization problem can be easy or hard.
)|(logmaxarg
)|(logmaxarg
)|,...,(logmaxarg
)|(logmaxarg
)(maxarg
1
ii
ii
n
ML
XP
XP
XXP
XP
L
6
An easy case
• Assuming– A coin has a probability p of being a head, 1-p
of being a tail.– Observation: We toss a coin N times, and the
result is a set of Hs and Ts, and there are m Hs.
• What is the value of p based on MLE, given the observation?
7
An easy case (cont)
)1log()(log
)1(log)|(log)(
pmNpm
ppXPL mNm
01
))1log()(log()(
p
mN
p
m
dp
pmNpmd
dp
dL
p= m/N
8
EM: basic concepts
9
Basic setting in EM
• X is a set of data points: observed data• is a parameter vector.• EM is a method to find θML where
• Calculating P(X | θ) directly is hard.• Calculating P(X,Y|θ) is much simpler,
where Y is “hidden” data (or “missing” data).
)|(logmaxarg
)(maxarg
XP
LML
10
The basic EM strategy
• Z = (X, Y)– Z: complete data (“augmented data”)– X: observed data (“incomplete” data)– Y: hidden data (“missing” data)
11
The “missing” data Y
• Y need not necessarily be missing in the practical sense of the word.
• It may just be a conceptually convenient technical device to simplify the calculation of P(X | θ).
• There could be many possible Ys.
12
Examples of EM
HMM PCFG MT Coin toss
X
(observed)
sentences sentences Parallel data Head-tail sequences
Y (hidden) State sequences
Parse trees Word alignment
Coin id sequences
θ aij
bijk
P(ABC) t(f | e)
d(aj | j, l, m), …
p1, p2, λ
Algorithm Forward-backward
Inside-outside
IBM Models N/A
13
Basic idea
• Consider a set of starting parameters
• Use these to “estimate” the missing data
• Use “complete” data to update parameters
• Repeat until convergence
14
The EM algorithm
• Start with initial estimate, θ0
• Repeat until convergence– E-step: calculate
– M-step: find
),(maxarg)1( tt Q
)|,(log),|(),(1
yxPxyPQ it
n
i yi
t
15
EM Highlights
• It is a general algorithm for missing data problems.
• It requires “specialization” to the problem at hand.
• Some classes of problem have a closed-form solution for the M-step. For example,– Forward-backward algorithm for HMM– Inside-outside algorithm for PCFG– EM in IBM MT Models
16
The forward-backward algorithm
17
Notation
• A sentence: O1,T=o1…oT,
• T is the sentence length• The state sequence X1,T+1=X1 … XT+1
• t: time t, range from 1 to T+1.• Xt: the state at time t.
• i, j: state si, sj.
• k: word wk in the vocabulary18
Forward probability
It is the probability of producing o1,t-1 while ending up in state si:
),()( 1,1 iXOPt tt
def
i
ii )1(
tijoiji
ij batt )()1(
Initialization:
Induction:
19
Backward probability
• It is the probability of producing the sequence Ot,T, given that at time t, we are at state si.
)|()( , iXOPt tTt
def
i
1)1( TiInitialization:
Induction:tijoij
jji batt )1()(
20
Calculating the prob of the observation
)1()(1
i
N
iiOP
)1()(1
TOPN
ii
)()(
),()(
1
1
tt
iXOPOP
i
N
ii
N
it
21
is the prob of traversing a certain arc at time t given O: (denoted by pt(i, j) in M&S)
N
mmm
jijoiji
tt
ttij
tt
tbat
OP
OjXiXP
OjXiXPt
t
1
1
1
)()(
)1()(
)(
),,(
)|,()(
)(tij
22
N
jttti OjXiXPOiXPt
11 )|,()|()(
is the prob of being at state i at time t given O:
)()(1
ttN
jiji
)(ti
23
Expected counts
• Calculating expected counts by summing over the time index t
• Expected # of transitions from state i to j in O:
• Expected # of transitions from state i in O:
N
j
T
tij
T
t
N
jij
T
ti ttt
1 11 11
)()()(
T
tij t
1
)(
24
Update parameters
)1(
1expˆ
i
i ttimeatistateinfrequencyected
T
tij
N
j
T
tij
T
ti
T
tij
ij
t
t
t
t
istatefromstransitionofected
jtoistatefromstransitionofecteda
11
1
1
1
)(
)(
)(
)(
#exp
#exp
T
tij
T
tijkt
ijk
t
two
jtoistatefromstransitionofected
observedkwithjtoistatefromstransitionofectedb
1
1
)(
)(),(
#exp
#exp
25
Final formulae
N
mmm
jijoijiij
tt
tbatt t
1
)()(
)1()()(
T
tij
N
j
T
tij
ij
t
ta
11
1
)(
)(
T
tij
T
tijkt
ijk
t
twob
1
1
)(
)(),(
26
The inner loop for forward-backward algorithm
Given an input sequence and1. Calculate forward probability:
• Base case• Recursive case:
2. Calculate backward probability:• Base case:• Recursive case:
3. Calculate expected counts:
4. Update the parameters:
),,,,( BAKS
ii )1(
tijoiji
ij batt )()1(
1)1( Ti
tijoijj
ji batt )1()(
N
mmm
jijoijiij
tt
tbatt t
1
)()(
)1()()(
T
tij
N
j
T
tij
ij
t
ta
11
1
)(
)(
T
tij
T
tijkt
ijk
t
twob
1
1
)(
)(),(
27
Summary
• A way of estimating parameters for HMM– Define forward and backward probability, which can
calculated efficiently (DP)– Given an initial parameter setting, we re-estimate the
parameters at each iteration. – The forward-backward algorithm is a special case of
EM algorithm for PM model
28
Hw1
29
T
tij
T
tijkt
ijk
t
two
jtoistatefromstransitionofected
observedkwithjtoistatefromstransitionofectedb
1
1
)(
)(),(
#exp
#exp
Arc-emission HMM:
Q1: How to estimate the emission probability in a state-emission HMM?
30
Given an input sequence and an HMM 1. Calculate forward probability:
• Base case• Recursive case:
2. Calculate backward probability:• Base case:• Recursive case:
3. Calculate expected counts:
4. Update the parameters:
),,,,( BAKS
ii )1(
tijoiji
ij batt )()1(
1)1( Ti
tijoijj
ji batt )1()(
N
mmm
jijoijiij
tt
tbatt t
1
)()(
)1()()(
T
tij
N
j
T
tij
ij
t
ta
11
1
)(
)(
T
tij
T
tijkt
ijk
t
twob
1
1
)(
)(),(
Q2: how to modify the algorithm when there are multiple input sentences (e.g., a set of sentences)?
31
EM: main ideas
32
Additional slides
33
Idea #1: find θ that maximizes the likelihood of training data
)|(logmaxarg
)(maxarg
XP
LML
34
Idea #2: find the θt sequence
No analytical solution iterative approach, find
s.t.
,....,...,, 10 t
....)(...)()( 10 tlll
35
Idea #3: find θt+1 that maximizes a tight lower bound of )()( tll
a tight lower bound
])|,(
)|,([log)()(
1),|( t
i
in
ixyP
t
yxP
yxPEll t
i
36
Idea #4: find θt+1 that maximizes the Q function
)]|,([logmaxarg
])|,(
)|,([logmaxarg
1),|(
1),|(
)1(
yxPE
yxp
yxpE
i
n
ixyP
ti
in
ixyP
t
ti
ti
Lower bound of )()( tll
The Q function
37
The Q-function
• Define the Q-function (a function of θ):
– Y is a random vector.– X=(x1, x2, …, xn) is a constant (vector).– Θt is the current parameter estimate and is a constant (vector).– Θ is the normal variable (vector) that we wish to adjust.
• The Q-function is the expected value of the complete data log-likelihood P(X,Y|θ) with respect to Y given X and θt.
)|,(log),|(
)]|,([log)|,(log),|(
)]|,([log],|)|,([log),(
1
1),|(
),|(
yxPxyP
yxPEYXPXYP
YXPEXYXPEQ
it
n
i yi
n
iixyP
Y
t
XYP
tt
ti
t
38
The EM algorithm
• Start with initial estimate, θ0
• Repeat until convergence– E-step: calculate
– M-step: find
),(maxarg)1( tt Q
)|,(log),|(),(1
yxPxyPQ it
n
i yi
t
39
Important classes of EM problem
• Products of multinomial (PM) models
• Exponential families
• Gaussian mixture
• …
40
The EM algorithm for PM models
41
PM models
mi
yxici
i
yxici
i
yxici pppyxp ),,(),,(),,(
1
...)|,(
1 ji
ip
Where is a partition of
all the parameters, and for any j
),...,( 1 m
42
HMM is a PM
kji
j
kw
ik
jsis
ji
wss
yxsscj
w
i
yxsscji
ssp
ssp
yxp
,,
),,(
),,(
)(
)(
)|,(
,
1
1
kijk
jij
b
a
43
PCFG
• PCFG: each sample point (x,y):– x is a sentence– y is a possible parse tree for that sentence.
)|()|,(1
ii
n
ii AAPyxP
)|(
)|(
)|(
)|,(
VPsleepsVPP
NPJimNPP
SVPNPSP
yxP
44
PCFG is a PM
,
),,()(
)|,(
A
yxAcAp
yxp
A
Ap 1)(
45
Q-function for PM
)log),,(),|((...)log),,(),|((
)log),|((...)log),|((
))(log(),|(
)|,(log),|(
),(
11
),,(
1
),,(
1
),,(
1
1
1
1
jij
tn
i yiji
j
tn
i yi
yxjC
jj
tn
i yi
yxjC
jj
tn
i yi
yxjC
k jj
tn
i yi
it
n
i yi
t
pyxjCxyPpyxjCxyP
pxyPpxyP
pxyP
yxPxyP
Q
k
i
k
i
k
),(1tQ 46
Maximizing the Q function
jj
it
n
i yi
t pyxjCxyPQ log),,(),|(),(11
1
11
j
jp
Maximize
Subject to the constraint
11
)log),,(),|((),(ˆ1
1j
jjj
it
n
i yi
t ppyxjCxyPQ
Use Lagrange multipliers
47
Optimal solution
11
)log),,(),|((),(ˆ1
1j
jjj
it
n
i yi
t ppyxjCxyPQ
0)/),,(),|((),(ˆ
1
1
ji
tn
i yi
j
t
pyxjCxypp
Q
),,(),|(1
yxjCxyp
pi
tn
i yi
j
Normalization factor
Expected count
48
PCFG example
• Calculate expected counts
• Update parameters
49
The EM algorithm for PM models
// for each iteration
// for each training example xi
// for each possible y
// for each parameter
// for each parameter
50
EM: An example
51
Forward-backward algorithm
• HMM is a PM (Product of Multi-nominal) Model
• Forward-backward algorithm is a special case of the EM algorithm for PM Models.
• X (observed data): each data point is an O1T.
• Y (hidden data): state sequence X1T.
• Θ (parameters): aij, bijk, πi.
52
Expected counts
)(
)|,(
),,(*),|(
),,(*),|()(
1
111
1111
1
t
OjXiXP
ssXOcountOXP
ssYXcountXYPsscount
T
tij
Tt
T
tt
jX
iTTTT
Yjiji
T
53
Expected counts (cont)
),()(
),(*),|,(
),,(*),|(
),,(*),|()(
1
111
1111
1
kk
T
tij
kkTt
T
tt
jX
w
iTTTT
j
w
iY
j
w
i
wOt
wOOjXiXP
ssXOcountOXP
ssYXcountXYPsscount
T
k
kk
54
Summary
55
Summary
• The EM algorithm– An iterative approach – L(θ) is non-decreasing at each iteration– Optimal solution in M-step exists for many classes of
problems.
• The EM algorithm for PM models– Simpler formulae– Three special cases
• Inside-outside algorithm• Forward-backward algorithm• IBM Models for MT
56
Relations among the algorithms
The generalized EM
The EM algorithm
PMGaussian Mix
Inside-Outside
Forward-backward
IBM models
57
Strengths of EM
• Numerical stability: in every iteration of the EM algorithm, it increases the likelihood of the observed data.
• The EM handles parameter constraints gracefully.
58
Problems with EM
• Convergence can be very slow on some problems and is intimately related to the amount of missing information.
• It guarantees to improve the probability of the training corpus, which is different from reducing the errors directly.
• It cannot guarantee to reach global maxima (it could get struck at the local maxima, saddle points, etc)
The initial estimate is important.
59