Forward-backward algorithm

Forward-backward algorithm

LING 575

Week 2: 01/10/08

1

Some facts

• The forward-backward algorithm is a special case of EM.

• EM stands for “expectation maximization”.

• EM falls into the general framework of maximum-likelihood estimation (MLE).

2

Outline

• Maximum likelihood estimate (MLE)

• EM in a nutshell

• Forward-backward algorithm

• Hw1

• Additional slides

– EM main ideas

– EM for PM models3

MLE

4

What is MLE?

• Given

– A sample X={X1, …, Xn}

– A list of parameters θ

• We define– Likelihood of the data: P(X | θ)– Log-likelihood of the data: L(θ)=log P(X|θ)

• Given X, find )(maxarg

LML

5

MLE (cont)

• Often we assume that Xis are independently identically distributed (i.i.d.)

• Depending on the form of P(X | θ), solving this optimization problem can be easy or hard.

)|(logmaxarg

)|(logmaxarg

)|,...,(logmaxarg

)|(logmaxarg

)(maxarg

1

ii

ii

n

ML

XP

XP

XXP

XP

L

6

An easy case

• Assuming– A coin has a probability p of being a head, 1-p

of being a tail.– Observation: We toss a coin N times, and the

result is a set of Hs and Ts, and there are m Hs.

• What is the value of p based on MLE, given the observation?

7

An easy case (cont)

)1log()(log

)1(log)|(log)(

pmNpm

ppXPL mNm

01

))1log()(log()(

p

mN

p

m

dp

pmNpmd

dp

dL

p= m/N

8

EM: basic concepts

9

Basic setting in EM

• X is a set of data points: observed data• is a parameter vector.• EM is a method to find θML where

• Calculating P(X | θ) directly is hard.• Calculating P(X,Y|θ) is much simpler,

where Y is “hidden” data (or “missing” data).

)|(logmaxarg

)(maxarg

XP

LML

10

The basic EM strategy

• Z = (X, Y)– Z: complete data (“augmented data”)– X: observed data (“incomplete” data)– Y: hidden data (“missing” data)

11

The “missing” data Y

• Y need not necessarily be missing in the practical sense of the word.

• It may just be a conceptually convenient technical device to simplify the calculation of P(X | θ).

• There could be many possible Ys.

12

Examples of EM

HMM PCFG MT Coin toss

X

(observed)

sentences sentences Parallel data Head-tail sequences

Y (hidden) State sequences

Parse trees Word alignment

Coin id sequences

θ aij

bijk

P(ABC) t(f | e)

d(aj | j, l, m), …

p1, p2, λ

Algorithm Forward-backward

Inside-outside

IBM Models N/A

13

Basic idea

• Consider a set of starting parameters

• Use these to “estimate” the missing data

• Use “complete” data to update parameters

• Repeat until convergence

14

The EM algorithm

• Start with initial estimate, θ0

• Repeat until convergence– E-step: calculate

– M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

n

i yi

t

15

EM Highlights

• It is a general algorithm for missing data problems.

• It requires “specialization” to the problem at hand.

• Some classes of problem have a closed-form solution for the M-step. For example,– Forward-backward algorithm for HMM– Inside-outside algorithm for PCFG– EM in IBM MT Models

16

The forward-backward algorithm

17

Notation

• A sentence: O1,T=o1…oT,

• T is the sentence length• The state sequence X1,T+1=X1 … XT+1

• t: time t, range from 1 to T+1.• Xt: the state at time t.

• i, j: state si, sj.

• k: word wk in the vocabulary18

Forward probability

It is the probability of producing o1,t-1 while ending up in state si:

),()( 1,1 iXOPt tt

def

i

ii )1(

tijoiji

ij batt )()1(

Initialization:

Induction:

19

Backward probability

• It is the probability of producing the sequence Ot,T, given that at time t, we are at state si.

)|()( , iXOPt tTt

def

i

1)1( TiInitialization:

Induction:tijoij

jji batt )1()(

20

Calculating the prob of the observation

)1()(1

i

N

iiOP

)1()(1

TOPN

ii

)()(

),()(

1

1

tt

iXOPOP

i

N

ii

N

it

21

is the prob of traversing a certain arc at time t given O: (denoted by pt(i, j) in M&S)

N

mmm

jijoiji

tt

ttij

tt

tbat

OP

OjXiXP

OjXiXPt

t

1

1

1

)()(

)1()(

)(

),,(

)|,()(

)(tij

22

N

jttti OjXiXPOiXPt

11 )|,()|()(

is the prob of being at state i at time t given O:

)()(1

ttN

jiji

)(ti

23

Expected counts

• Calculating expected counts by summing over the time index t

• Expected # of transitions from state i to j in O:

• Expected # of transitions from state i in O:

N

j

T

tij

T

t

N

jij

T

ti ttt

1 11 11

)()()(

T

tij t

1

)(

24

Update parameters

)1(

1expˆ

i

i ttimeatistateinfrequencyected

T

tij

N

j

T

tij

T

ti

T

tij

ij

t

t

t

t

istatefromstransitionofected

jtoistatefromstransitionofecteda

11

1

1

1

)(

)(

)(

)(

#exp

#exp

T

tij

T

tijkt

ijk

t

two

jtoistatefromstransitionofected

observedkwithjtoistatefromstransitionofectedb

1

1

)(

)(),(

#exp

#exp

25

Final formulae

N

mmm

jijoijiij

tt

tbatt t

1

)()(

)1()()(

T

tij

N

j

T

tij

ij

t

ta

11

1

)(

)(

T

tij

T

tijkt

ijk

t

twob

1

1

)(

)(),(

26

The inner loop for forward-backward algorithm

Given an input sequence and1. Calculate forward probability:

• Base case• Recursive case:

2. Calculate backward probability:• Base case:• Recursive case:

3. Calculate expected counts:

4. Update the parameters:

),,,,( BAKS

ii )1(

tijoiji

ij batt )()1(

1)1( Ti

tijoijj

ji batt )1()(

N

mmm

jijoijiij

tt

tbatt t

1

)()(

)1()()(

T

tij

N

j

T

tij

ij

t

ta

11

1

)(

)(

T

tij

T

tijkt

ijk

t

twob

1

1

)(

)(),(

27

Summary

• A way of estimating parameters for HMM– Define forward and backward probability, which can

calculated efficiently (DP)– Given an initial parameter setting, we re-estimate the

parameters at each iteration. – The forward-backward algorithm is a special case of

EM algorithm for PM model

28

Hw1

29

T

tij

T

tijkt

ijk

t

two

jtoistatefromstransitionofected

observedkwithjtoistatefromstransitionofectedb

1

1

)(

)(),(

#exp

#exp

Arc-emission HMM:

Q1: How to estimate the emission probability in a state-emission HMM?

30

Given an input sequence and an HMM 1. Calculate forward probability:

• Base case• Recursive case:

2. Calculate backward probability:• Base case:• Recursive case:

3. Calculate expected counts:

4. Update the parameters:

),,,,( BAKS

ii )1(

tijoiji

ij batt )()1(

1)1( Ti

tijoijj

ji batt )1()(

N

mmm

jijoijiij

tt

tbatt t

1

)()(

)1()()(

T

tij

N

j

T

tij

ij

t

ta

11

1

)(

)(

T

tij

T

tijkt

ijk

t

twob

1

1

)(

)(),(

Q2: how to modify the algorithm when there are multiple input sentences (e.g., a set of sentences)?

31

EM: main ideas

32

Additional slides

33

Idea #1: find θ that maximizes the likelihood of training data

)|(logmaxarg

)(maxarg

XP

LML

34

Idea #2: find the θt sequence

No analytical solution iterative approach, find

s.t.

,....,...,, 10 t

....)(...)()( 10 tlll

35

Idea #3: find θt+1 that maximizes a tight lower bound of )()( tll

a tight lower bound

])|,(

)|,([log)()(

1),|( t

i

in

ixyP

t

yxP

yxPEll t

i

36

Idea #4: find θt+1 that maximizes the Q function

)]|,([logmaxarg

])|,(

)|,([logmaxarg

1),|(

1),|(

)1(

yxPE

yxp

yxpE

i

n

ixyP

ti

in

ixyP

t

ti

ti

Lower bound of )()( tll

The Q function

37

The Q-function

• Define the Q-function (a function of θ):

– Y is a random vector.– X=(x1, x2, …, xn) is a constant (vector).– Θt is the current parameter estimate and is a constant (vector).– Θ is the normal variable (vector) that we wish to adjust.

• The Q-function is the expected value of the complete data log-likelihood P(X,Y|θ) with respect to Y given X and θt.

)|,(log),|(

)]|,([log)|,(log),|(

)]|,([log],|)|,([log),(

1

1),|(

),|(

yxPxyP

yxPEYXPXYP

YXPEXYXPEQ

it

n

i yi

n

iixyP

Y

t

XYP

tt

ti

t

38

The EM algorithm

• Start with initial estimate, θ0

• Repeat until convergence– E-step: calculate

– M-step: find

),(maxarg)1( tt Q

)|,(log),|(),(1

yxPxyPQ it

n

i yi

t

39

Important classes of EM problem

• Products of multinomial (PM) models

• Exponential families

• Gaussian mixture

• …

40

The EM algorithm for PM models

41

PM models

mi

yxici

i

yxici

i

yxici pppyxp ),,(),,(),,(

1

...)|,(

1 ji

ip

Where is a partition of

all the parameters, and for any j

),...,( 1 m

42

HMM is a PM

kji

j

kw

ik

jsis

ji

wss

yxsscj

w

i

yxsscji

ssp

ssp

yxp

,,

),,(

),,(

)(

)(

)|,(

,

1

1

kijk

jij

b

a

43

PCFG

• PCFG: each sample point (x,y):– x is a sentence– y is a possible parse tree for that sentence.

)|()|,(1

ii

n

ii AAPyxP

)|(

)|(

)|(

)|,(

VPsleepsVPP

NPJimNPP

SVPNPSP

yxP

44

PCFG is a PM

,

),,()(

)|,(

A

yxAcAp

yxp

A

Ap 1)(

45

Q-function for PM

)log),,(),|((...)log),,(),|((

)log),|((...)log),|((

))(log(),|(

)|,(log),|(

),(

11

),,(

1

),,(

1

),,(

1

1

1

1

jij

tn

i yiji

j

tn

i yi

yxjC

jj

tn

i yi

yxjC

jj

tn

i yi

yxjC

k jj

tn

i yi

it

n

i yi

t

pyxjCxyPpyxjCxyP

pxyPpxyP

pxyP

yxPxyP

Q

k

i

k

i

k

),(1tQ 46

Maximizing the Q function

jj

it

n

i yi

t pyxjCxyPQ log),,(),|(),(11

1

11

j

jp

Maximize

Subject to the constraint

11

)log),,(),|((),(ˆ1

1j

jjj

it

n

i yi

t ppyxjCxyPQ

Use Lagrange multipliers

47

Optimal solution

11

)log),,(),|((),(ˆ1

1j

jjj

it

n

i yi

t ppyxjCxyPQ

0)/),,(),|((),(ˆ

1

1

ji

tn

i yi

j

t

pyxjCxypp

Q

),,(),|(1

yxjCxyp

pi

tn

i yi

j

Normalization factor

Expected count

48

PCFG example

• Calculate expected counts

• Update parameters

49

The EM algorithm for PM models

// for each iteration

// for each training example xi

// for each possible y

// for each parameter

// for each parameter

50

EM: An example

51

Forward-backward algorithm

• HMM is a PM (Product of Multi-nominal) Model

• Forward-backward algorithm is a special case of the EM algorithm for PM Models.

• X (observed data): each data point is an O1T.

• Y (hidden data): state sequence X1T.

• Θ (parameters): aij, bijk, πi.

52

Expected counts

)(

)|,(

),,(*),|(

),,(*),|()(

1

111

1111

1

t

OjXiXP

ssXOcountOXP

ssYXcountXYPsscount

T

tij

Tt

T

tt

jX

iTTTT

Yjiji

T

53

Expected counts (cont)

),()(

),(*),|,(

),,(*),|(

),,(*),|()(

1

111

1111

1

kk

T

tij

kkTt

T

tt

jX

w

iTTTT

j

w

iY

j

w

i

wOt

wOOjXiXP

ssXOcountOXP

ssYXcountXYPsscount

T

k

kk

54

Summary

55

Summary

• The EM algorithm– An iterative approach – L(θ) is non-decreasing at each iteration– Optimal solution in M-step exists for many classes of

problems.

• The EM algorithm for PM models– Simpler formulae– Three special cases

• Inside-outside algorithm• Forward-backward algorithm• IBM Models for MT

56

Relations among the algorithms

The generalized EM

The EM algorithm

PMGaussian Mix

Inside-Outside

Forward-backward

IBM models

57

Strengths of EM

• Numerical stability: in every iteration of the EM algorithm, it increases the likelihood of the observed data.

• The EM handles parameter constraints gracefully.

58

Problems with EM

• Convergence can be very slow on some problems and is intimately related to the amount of missing information.

• It guarantees to improve the probability of the training corpus, which is different from reducing the errors directly.

• It cannot guarantee to reach global maxima (it could get struck at the local maxima, saddle points, etc)

The initial estimate is important.

59

Documents

Forward-backward algorithm