Three fundamental questions with Hidden Markov Models · 2008. 4. 3. · Forward Procedure Consider the forward variable: αt (i) =P(o1,o2 ,..,ot ,qt =Si |λ) being t the time and

Three fundamental questions with Hidden Markov Models

Once defined the structural elements of HMM we may want to ask three different

kinds of questions:

11.. Given the model ),,( πλ BA= , what is the probability of

observing ),..,,( 21 ToooO = , that is )|( λOP ?

22.. Given the sequence of observations ),..,,( 21 ToooO = and the HMM defined

by ),,( πλ BA= , what is the most likely state sequence ),..,,( 21 Tqqqq = that

best explains observations?

33.. How can we learn A, B e π given a sequence of observations such that

)|( λOP is maximum?

Question 1: pattern recognition

Question 2: find the best hidden sequence of staes (not directly observable); it

does not exists a correct state sequence, however we can find the one that

maximize some error measure

Question 3: model training, given some example sequences we can set up a

hidden Markov model. This is the first step in many applications since often we

do not have the model parameters to answer the previous questions.

Solution to problem 1, probability estimation

This problem aswer the question about the probability of observing a sequence

of observation ),..,,( 21 ToooO = given a HMM λ, that is )|( λOP . Since we assume

the observation to be independent from each other (recall the conditional

independence property of HMM Bayesian Network), the probability of

),..,,( 21 ToooO = being generated from a sequence of states q can be computed

as:

∏=

=⋅⋅⋅=T

iiqiTqTqq obobobobBqOP

12211 )()(...)()(),|(

and the probability of the state sequence q is:

∏−

=+− ⋅=⋅⋅⋅⋅=

1

11,1132211 ...),|(

T

iiqiqqTqTqqqqq aaaaAqP πππ

The joint probability for O and q, that is the probability of O e q verified at the

same time, is the product of the two previous terms:

),|(),|()|,( πλ AqPBqOPqOP ⋅=

)(...)()( 12221111 TqTqTqTqqqqq obaobaob −⋅⋅⋅= π

( )∏=

− ⋅⋅=T

ttqtqtqtqq obaob

21111 )()(π

We aim at finding )|( λOP obtained summing out the joint probability over all

possible sequences of states q:

∑ ⋅=q

AqPBqOPOP ),|(),|()|( πλ

∑ −⋅⋅⋅=qTq

TqTqTqTqqqqq obaobaob..1

12221111 )(...)()(π

( )∑ ∏=

− ⋅⋅=qTq

T

ttqtqtqtqq obaob

..1 21111 )()(π

The interpretation of this is the following. At time t = 1 the process starts in state

q1 with probability πq1 and it generates observation o1 with probability bq1(o1).

Time goes on changing from t to t+1 and a transition from q1 to q2 happens with

probability aq1q2 generating symbol o2 with probability bq2(o2). This iterates up to

time T.

This computation is infeasible in practice due to the exponential complexity with

respect to sequence length T. Precisely it requires (2T-1)NT multiplications and

NT-1 sums. Also for small N and T, such as N=5 (states) and T=100

(observations), we are requested (2·100-1)5100≈1.6·1072 multiplications and 5100-

1≈8.0·1069 sums. To get some practical use out of HMM we need a more efficient

algorithm such as the forward-backwardalgorithm, that is linear in T.

Forward Procedure

Consider the forward variable:

)|,,..,,()( 21 λα ittt SqoooPi ==

being t the time and Si the state. αt(i) is the probability of observing the sequence

o1,o2,..,ot , given the current state Si at time t.

The values of αt+1(i) can be obtained summing the forward variable for all the N

states at time t, multipling it for the corresponding transition probabilities between

states aij and for the emission probabilities bj(ot+1). The algorithm is the following:

11.. Initialization

t = 1

Niobi ii ≤≤= 1 , )()( 11 πα

22.. Induction

Njaiobj N

i ijtjt ≤≤= ∑ =++ 1 , )()()(1 t11 αα

33.. Time update

t = t+1

If t<T GOTO 2, otherwise GOTO 4

44.. Final calculation

∑==

N

i T iOP1

)()|( αλ

The single step is summarized by the following graph (Figure 3.6):

Summary of one step for alfa computation

The following graph shows the exponential number of paths in the space of

observations.

Maximum number of transitions between states and selected paths

It is required to compute 2N summs between αt(j) for each αt+1(i), weighted by the

corresponding transition probabilities between state i and the probability of

observing ot+1.

Number of operations required for the computation of variable alfa

At each time we have N forward variables, having each the same complexity.

The induction is correct since it follow directly from:

ijttN

jtt ajqooPiqooP )|,..()|,..( 1111 λλ === ∑ =+

When using this algorithm, wee are required N(N+1)(T-1)+N multiplications and

N(N-1)(T-1) sums. For N = 5 state and T = 100 observations it requires

5(5+1)(100-1)+5=2975 multiplications and 5(5-1)(100-1)=1980 sums, obtaining a

significative gain with respect to the direct approach that requires 1072

multiplications and 1069 sums.

Backward Procedure

Previous recursion can be applied also backward defining the backward variable:

),|,..,,()( 21 λβ itTttt SqoooPi == ++

Sating the probability of observing the sequence from t+1 to the end, given the

state i at time t and the HMM λ. Note that e the forward vairable definition

regards a joint probability while the backward variable is a conditional probability

and ita can be computed similarly.

Maximum number of trasitions between states and selected paths

The backward algorithm is obtaind by the followinf steps:

11.. Initialization

t = T-1

NiiT ≤≤= 1 , 1)(β

22.. Induction

∑ = ++ ≤≤=N

j tjijtt Niobaii1 11 1 , )()()( ββ

33.. Time update

t = t-1

If t>0 GOTO 2, otherwise end

Note thate initialization sets 1)( =iTβ for all i. Each step of the algorithm can be

summarized as follows:

Summary of one step in beta variable computation

The whole computation of forward and backward variables for N state has

complexity )()()( 2NONONO =⋅ .

Forward and backward computation has complexity proportional to the squared number of states

Since (i))( tt i βα represents the probability of observing the sequence O and

being in state i at time t given the HMM λ, we can write the sequence in terms of

forward and backward variables, using the following:

∑ ===

N

i t tOiqPOP1

allfor )|,()|( λλ

∑ ==

N

i tt ti1

allfor (i))( βα

∑==

N

i T i1)(α

Scaling of Forward and Backward variables

As usually happens in statitisc, we should deal with numerical issues. αt(i) and

βt(i) computation multiplies probabilities, having values lower than 1, thus they

tend to zero exponentially with the number of symbols in the sequnces. Using

common floating point for t high enough, i.e., t>100, we can easily underflow.

The basic procedure, multiplies αt(i), at any t and in all states i-1≤i≤N, for a

coefficient ct that depends only on time and not on the state; this factor will be

used also to scale βt(i) in the following.

Let be:

11.. ( )itα̂ the scaled variable during each iteration

22.. ( )itα̂̂ the local version before scalingè

33.. ct the scaling coefficient

The algorithm becomes the following

11.. Initialization

t = 2

Niobi ii ≤≤= 1 , )()( 11 πα

( ) Niii ≤≤= 1 , )(ˆ̂11 αα

( )∑ =

= N

ii

c1 1

11α

( ) ( )ici 111ˆ αα =

22.. Induction

( ) ( ) ( )∑ = − ≤≤=N

j jittit Niajobi1 1 1 , ˆˆ̂ αα

( )∑ =

= N

i tt

ic

1ˆ̂

1α

( ) ( ) Niici ttt ≤≤= 1 , ˆ̂ˆ αα

33.. Time update

1+= tt

If t≤T GOTO 2, otherwise GOTO 4


∑=−=

T

t tcOP1log)|(log λ

Step 2 can be rewritten as:

( ) ( )ici ttt αα ˆ̂ˆ =

( ) ( )( ) ( ) ( )( )∑∑ ∑ = −

= = −

⋅=N

j jittiN

k

N

j jkttk

ajobajob 1 1

1 1 1

ˆˆ

1 αα

( ) ( )( )( ) ( )( ) Ni

ajob

ajobN

k

N

j jkttk

N

j jitti≤≤=

∑ ∑∑

= = −

= −1 ,

ˆ

ˆ

1 1 1

1 1

α

α

Using induction the scale variable can be obtained by the unscaled one as:

( ) ( ) Njjcj t

t

t ≤≤⎟⎟⎠

⎞⎜⎜⎝

⎛= −

−

=− ∏ 1 , ˆ 1

1

11 αα

ττ

The iduction step can be computed as before, but with one time unit shift

∑ = − ≤≤=N

j ijttit Niajobi1 1 1 , )()()( αα

We can now write:

( )( ) ( )( )( ) ( )( )∑ ∑∑

= = −

= −= N

k

N

j jkttk

N

j jittit

ajob

ajobi

1 1 1

1 1

ˆ

ˆˆ

α

αα

( ) ( )

( ) ( )∑ ∑ ∏

∑ ∏

= = −

−

=

= −

−

=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛

⎟⎟⎠

⎞⎜⎜⎝

⎛

=N

k

N

j jkt

t

tk

N

j jit

t

ti

ajcob

ajcob

1 1 1

1

1

1 1

1

1

α

α

ττ

ττ

( ) ( )( )( ) ( )( )∑ ∑∏

∑∏

= = −

−

=

= −

−

=

⎟⎟⎠

⎞⎜⎜⎝

⎛

⎟⎟⎠

⎞⎜⎜⎝

⎛

=N

k

N

j jkttk

t

N

j jitti

t

ajobc

ajobc

1 1 1

1

1

1 1

1

1

α

α

ττ

ττ

( )( )

Ni1 ,

1

≤≤=

∑=

N

kt

t

k

i

α

α

It can be proven that any term αt(i) is scaled by the sum for all the states of αt(i).

At the end the algorithm we cannot use ( )itα̂ in )|( λOP computation because

the variable is already scaled. However we can apply the following property:

( )∏ ∑= =

=T N

iTt ic

1 11

τ

α

( )∏=

=⋅T

t OPc1

1|τ

λ

( )∏=

= T

cOP

1

1|

ττ

λ

Using this new estimate )|( λOP , however, the result could be il so small to

underflow computer precision; for this reason we suggest to use the following:

( ) ∑∏

=

=

−==T

t tT cc

OP1

1

log1|log

ττ

λ

The scaled backward algorithm can be defined more easily since uses the same

factors of the forward one. Also the notation is similar:

11.. ( )itβ̂ the scaled variable during iteration

22.. ( )itβ̂̂ the local version before scaling

33.. ct the scaling coefficient

The algorithm works as follows:

11.. Initialization

1−= Tt

( ) Ni1 , 1 ≤≤=iTβ

( ) ( ) Ni1 , iˆT ≤≤= ββ TT ci

22.. Induction

( ) ( ) ( ) Ni1 , ˆˆ̂1

11t ≤≤= ∑=

++

N

jtjijt obaji ββ

( ) ( ) Ni1 , iˆ̂ˆt ≤≤= ββ tt ci

33.. Time update

be 1−= tt

If t>0 GOTO 2, otherwise end

Solution to problem 2, optimal state sequence

This time the problem is finding the most likely state sequence that generated a

given sequence of observations. If with problem 1 we have an exact solution, in

this case we can applya an approximate algorithm. This is due to the difficulty in

the definition of an optimality criterion.

One common approach is to define the most likely state qt at each time step t,

introducing the variable:

( ) ( )λγ ,|OSqPi itt ==

That is the probability of being in state Si at time t given the sequence of

observations O and the HMM λ. An alternative formulation is:

( ) ( )λγ ,|OSqPi itt ==

( )( )λ

λ|

|,OPSqOP it ==

( )( )∑

=

=

== N

iit

it

SqOP

SqOP

1|,

|,

λ

λ

Since )|,,..,,()( 21 λα iqoooPi ttt == and ),|,..,,()( 21 λβ iqoooPi tTttt == ++ ,

)|,( 1 λSqOP t = can be written as the joint probability:

),|,..,,()|,,..,,()|,( 21211 λλλ itTttittt SqoooPSqoooPSqOP =⋅=== ++

Now we can rewrite:

( ) ( ) ( )( ) ( )∑ =

= N

i tt

ttt

iiiii

1βα

βαγ

The most likely state at time t is:

( )[ ] Ttiq tNit ≤≤= ≤≤ 1 , maxarg 1* γ

Even if q* maximizes the expected number of correct states, there could be some

issues with the whole sequence since we aer not taking into account transition

probabilities. For instance, is some transition has null probability aij=0, the path

between the optimal states cannot be valid. To generate a valid path we can

apply the Viterbi algorithm, based on dynamic programming. γt(i) is not used in

the Viterbi algorithm, but it will be used in the solution of problem 3.

Viterbi Algorithm

To find the single best sequence of states, ),..,,( 21 Tqqqq = , that is optimal with

respect to the observation sequence ),..,,( 21 ToooO = and the HMM λ, in order to

maximize ),|( λOqP , we can apply a dynamic programming approach named

Viterbi algorithm [Forney 1973] [Viterbi 1967].

Since:

)|()|,(),|(

λλλ

OPOqPOqP =

),|( λOqP maximization is equivalent to )|,( λOqP maximization in Viterbi.

The basic idea is similar to the Forward procedure, but in this case computation

is considered only for two cosecutive time steps: t and t+1, starting with t = 1 and

finishing for t = T. The main difference is in the computation between the two time

steps; Viterbi looks for and memorizes the highest value corresponding to the

best so far sequence of states, while Forward procedure comuptes the

summation of all results. At the end of calculation, the best values, for which the

algorithm keeps track, can be used to recover the state sequence using a

backtracking process.

Let us consider the following:

( )λδ |..,..max)( 111..2,1 titqtqqt ooSqqPi == −

That is the probability of observing o1..ot on the best path ending on state i at

time t, given the HMM λ.

Using induction we can compute the probabibility at the next step:

])([max)()( 111 ijtNitjt aiobi δδ ≤≤++ =

To find the sequence of states we need to keep track of this function argmax for

all t and j. This can be easily done using an array, ψt(j).

Here it is the complete algorithm:

11.. Initialization

t = 2;

Niobi ii ≤≤= 1 , )()( 11 πδ

Nii ≤≤= 1 , 0)(1ψ

22.. Induction

Nj(i)aδ)(obj ijt-Nitjt ≤≤= ≤≤ 1 , max)( 11δ

Nj(i)aδj ijt-Nit ≤≤= ≤≤ 1 , ][maxarg)( 11ψ

33.. Update

t = t+1;

if t ≤ T GOTO 2, otherwise GOTO 4.


[ ])(max* 1 iP TNi δ≤≤=

[ ])(maxarg* 1 iq TNiT δ≤≤=

55.. State sequence reconstruction

a. Initialization

t = T-1

b. Backtracking

( )11 ** ++= ttt qq ψ

c. Update

t = t-1;

If t ≥ 1 GOTO b, otherwise end.

As previously stated we might encouter problem with underflow so we need a

slightly different formulation of this algorithm.

An alternative Viterbi algorithm

To overcome the underflow issues due to probability multiplication we can use

the logarithm of model parameters to transform all these multiplications into

sums. We might encouter other issues when some parameters are equal to zero,

sa usually happens with matrixes A e π, but we can solve them by adding a small

number to them.

This new Viterbi Algorithm is:

11.. Preprocessing

Niii ≤≤= 1 ),log(~ ππ

Njiaa ijij ≤≤= ,1 ),log(~

22.. Initialization

t = 2;

Niobob ii ≤≤= 1 )),(log()(~11

))(log()(~11 ii δδ =

))(log( 1obiiπ=

Niobii ≤≤+= 1 , )(~~1π

Nii ≤≤= 1 , 0)(1ψ

33.. Induction

Njobob tjtj ≤≤= 1 , ))(log()(~

))(log()(~ jj tt δδ =

( )ijt-Nitj (i)aδ)(ob 11maxlog ≤≤=

Nja(i)δ)(ob ijt-Nitj ≤≤++= ≤≤ 1 , ]~~[max~11

( ) Nja(i)δj ijt-Nit ≤≤+= ≤≤ 1 , ]~~[maxarg 11ψ

44.. Update

t = t+1;

If t ≤ T GOTO 3, otherwise GOTO 5.

55.. Final computation

)](~[max*~1 iP TNi δ≤≤=

)](~[maxarg* 1 iq TNiT δ≤≤=

66.. State sequence reconstruction

a. Initialization

t = T-1

b. Backtracking

( )11 ** ++= ttt qq ψ

c. Update

t = t-1;

if t ≥ 1 GOTO b, otherwise end.

For instance, let us consider a model with N = 3 states and observations T=8.

During initialization (t = 1) we find the values for δ1(1), δ1(2) and δ1(3); suppose

δ1(2) is the maximum. In the next iteration, for t = 2, we find δ2(1), δ2(2) and δ2(3),

thi time δ2(1) is the higher. So forth obtaining δ3(3), δ4(2), δ5(2), δ6(1), δ7(3) and

δ8(3). The obtimal sequence is summarized in the following graph (Figure 3.12):

Figure Error! No text of specified style in document.-1 Sequence of states explored by the Viterbi

algorithm

Solution to problem 3, parameter estimation

The learn training when referring to HMM is referred to ),,( πλ BA= model

parameters estimation. Given and observation O we look for the model λ* that

maximizes )|( λOP (maximum likelihood principle); this can be written as:

[ ])|(maxarg* λλ λ OP=

This is the most challenging problem since there is no analytical solution for the

paramiters that maximize the probability of observing a given sequence; for this

reason a local maximum of likelihood is chosen. The most used trategies to

obtain such solution are the Baum-Welch algorithm [Baum 1972] [Baum et al.

1970] [Dempster et al. 1977], Expectation-Maximization o EM, or gradient

descend. All af these use an iterative approach to get a better likelihood for

)|( λOP ; among the others the Baum-Welch algorithm ha some advantages:

Numerically stable

It converges to a local optimum

Has linear convergence

For these reasons Baum-Welch is the most used. This method belongs to the

class of algorithms used in statistics to deal with missing values; the EM

algorithm is the general form [Bilmes 1998] [McLachlan e Krishnan 1997]

[Redner e Walzer 1984] and gives a procedure for local optimization.

In general with maximum likelihood estimation we look for the parameter θ that

describes a distribution supposed to generate the data, X, so that likelihood

)|( Xθξ is maximum. When some data are missing, θ defines the distribution

both for X, and missing data Y. The idea behind the EM algorithm is to compute

the complete-data likelihood ),|( YXθξ and find a miximum for it with the

following iterative algorithm:

11.. EXPECTATION (E-step): use the parameter estimated at last iteration, θi,

and the costant X. Compute the expected value of complete-data log-

likelihood and the log-likelihood as a function of θ with respect to the

random variableY, defined by θi and X.

22.. MAXIMIZATION (M-step): select the valuo of θ that maximizes

expectation, and then set θθ =+1i .

The EM algorthm does not decrease likelihood and converges to a local

maximum, as with the Baum-Welch algorithm. To describe the estimation

procedure we need to introduce the following variable:

( ) ( ) ( )( )λ

λλξ

||,,

,|,, 11 OP

OSqSqPOSqSqPji jtit

jtitt

====== +

+

It represents the probability of being in state Si at time t and in state Sj at time

t+1, given the HMM λ and the sequence of observations O, as summarized in the

following graph (Figure 3.13).

Figure Error! No text of specified style in document.-2 Computation procedure for the probability of

moving from one state the following one

The following schema reports the posterior probability of γt(i) (Figure 3.14)

Figure Error! No text of specified style in document.-3 Derivation of posterior probability gven a state i

For these reasons it holds that:

( ) ( ) ( ) ( )( )λ

βαξ

|, 11

OPjobai

ji ttjijtt

++=

( ) ( ) ( )( ) ( ) ( )∑ ∑= = ++

++= N

i

N

j ttjijt

ttjijt

jobai

jobai

1 1 11

11

βα

βα

As previously mentioned in problem 2, γt(i) is the probability of being in state Si at

time t, given the whole sequence of observations O end the HMM λ. At this point,

we can write:

( ) ( ) ( ) ( )∑∑ == + ======N

j tN

j jtititt jiOSqSqPOSqPi11 1 ,,|,,| ξλλγ

If we apply the summation over t to γt(i) we obtain a number that can be

considered as the number of visits to state Si, or equivalently the expected value

of transitions from state Si, including t = T. If we take the same summation also

for ξt(i,j) we obtain the number of trasitions from Si to Sj.

To summarize:

( ) iSi statein starting ofy probabilit 1 =γ

( ) O sequence in the state from ns transitioofnumber expected 1

1 iT

t t Si =∑ −

=γ

( ) Oin state to state from ns transitioofnumber expected,1

1 jiT

t t SSji =∑ −

=ξ

From these definitions we can derive the estimation formulas:

1 tat time state offrequency expected == ii Sπ

( )i1γ=

( ) ( )( ) ( )∑ =

= N

iii

ii

1 11

11

βαβα

( ) ( )( )∑ =

= N

i T iii

1

11

αβα

i

jiij S

SSa

state from ns transitioofnumber expected state to state from ns transitioofnumber expected

=

( )( )∑

∑−

=

−

== 1

1

1

1,

T

t t

T

t t

i

ji

γ

ξ

∑

∑−

=

−

= +

=

=== 1

1

1

1 1

)|,(

)|,,(T

t t

T

t tt

OiqP

OjqiqP

λ

λ

( ) ( ) ( )

( ) ( )∑∑

−

=

−

= ++= 1

1

1

1 11T

t tt

T

t ttjijt

ii

jobai

βα

βα

( )j

jj S

Skb

statefor occurences ofnumber expected statein k symbolfor nsobservatio ofnumber expected

=

( )

( )∑

∑

=

==

= T

t t

kOt

T

t t

j

j

1

1

γ

γ

∑

∑=

=

=

⋅== T

t t

T

t tt

OjqP

qOOjqP

1

1

)|,(

),()|,(

λ

δλ

( ) ( ) ( )( ) ( )∑

∑=

== T

t tt

T

t tttt

ii

oOji

1

1,

βα

δβα

where ⎩⎨⎧ =

=otherwise 0

if 1),( tttt

oOoOδ

It is useful to remind that:

∑ ∑= ===

N

i

N

i Ttt iiiOP1 1

)()()()|( αβαλ

)()()|,( iiSqOP ttit βαλ ==

( ) ( ) ( ) ( )jobaiSqSqOP ttjijtjtit 111 |,, +++ === βαλ

These parameter needs to satisfy the stochastic constraint of normalization:

∑==

N

i i11π

∑ =≤≤=

N

j ij Nia1

1 , 1

∑ −

=≤≤≤≤=

1

01 1 , 1)(M

ot tj TtNjob

Using the scaling approach we have:

( ) ( ) ( ) ( )( ) ( ) ( )∑ ∑= = ++

++= N

i

N

j ttjijt

ttjijtt

jobai

jobaiji

1 1 11

11

ˆˆ

ˆˆ,

βα

βαξ

( ) ( ) ( )

( ) ( ) ( )∑ ∑ ∏∏

∏∏

= = ++=

+=

++=

+=

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛

=N

i

N

j t

T

ttjijt

t

t

T

ttjijt

t

jcobaic

jcobaic

1 1 11

11

11

11

βα

βα

ττ

ττ

ττ

ττ

( ) ( ) ( )

( ) ( ) ( )∑ ∑∏∏

∏∏

= = +++==

+++==

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛

=N

i

N

j ttjijt

T

t

t

ttjijt

T

t

t

jobaicc

jobaicc

1 1 1111

1111

βα

βα

ττ

ττ

ττ

ττ

( ) ( ) ( )( ) ( ) ( )∑ ∑= = ++

++= N

i

N

j ttjijt

ttjijt

jobai

jobai

1 1 11

11

βα

βα

And also that:

( ) ( ) ( )( ) ( )∑ =

= N

i tt

ttt

iiiii

1ˆˆ

ˆˆ

βαβαγ

( ) ( )

( ) ( )∑ ∏∏

∏∏

===

==

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛

=N

i t

T

tt

t

t

T

tt

t

icic

icic

11

1

βα

βα

ττ

ττ

ττ

ττ

( ) ( )

( ) ( )∑∏∏

∏∏

=+==

+==

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛

⎟⎟⎠

⎞⎜⎜⎝

⎛⎟⎟⎠

⎞⎜⎜⎝

⎛

=N

i tt

T

t

t

tt

T

t

t

iicc

iicc

111

11

βα

βα

ττ

ττ

ττ

ττ

( ) ( )( ) ( )∑ =

= N

i tt

tt

iiii

1βα

βα

It can be proven that variables do not change in both formulations. Using scaled

forward and backward variables we get:

( ) ( ) ( ) ( )( ) ( ) ( )∑ ∑= = ++

++= N

i

N

j ttjijt

ttjijtt

jobai

jobaiji

1 1 11

11

ˆˆ

ˆˆ,

βα

βαξ

( ) ( ) ( )( ) ( ) ( )( )∑ ∑= = ++

++= N

i

N

j ttjijt

ttjijt

jobai

jobai

1 1 11

11

ˆˆ

ˆˆ

βα

βα

( ) ( ) ( )

( ) ( )∑ =

++= N

i tt

ttjijt

ji

jobai

1

11

ˆˆ

ˆˆ

βα

βα

( ) ( ) ( )

( ) ( )( ) ( )∑

∑ =

=

++=N

i ttN

i tt

ttjijt

jiji

jobai

11

11

1

ˆˆ

βαβα

βα

( ) ( ) ( )jobai ttjijt 11ˆˆ ++= βα

Since we use ξt(i,j) and γt(i) to compute A e π the probabilities are equivalent also

after scaling so, avoiding nunmerical issues, we can write:

( ) ( )( )∑ =

= N

i T

ii

ii

1

11

ˆ

ˆˆ

α

βαπ

but also:

( ) ( ) ( )( ) ( )∑

∑−

=

−

= ++= 1

1

1

1 11

ˆˆ

ˆˆT

t tt

T

t ttjijtij

ii

jobaia

βα

βα

At the same time we can use these variables to estimate the coefficients of the

emission matrix B.

This process iterates for a given number of iterations or up to a threshold in the

difference between models measured as the difference between α before and

after iteration. The following schema (Figure 3.15) summarrize the steps to

compute these parameters starting from an initial guess.

Figure Error! No text of specified style in document.-4 Schema of the iterative algorithm for HMM

parameters estimation

α and β estimates

γ and ξ estimates

Model θ

New modell θ*

Threshold or max iterations?

New model θ*

No Yes

Sequences of observations

Multiple sequences of observations

In the Baum-Welch algorithm we need to compute the values for many

parameters; to obtain robust estimates we need a sufficint amount of training

data. We cannot use a single sequence otherwise the model would be able to

recognize just that one, but it would not perform properly on new ones. For this

reason it needs to be changed few times (at the risk of numerical instabilities).

We need to adapt the algorithm to multiple sequences; let be R a set of

seuences of observations:

],..,,[ 21 ROOOO =

Where ),..,,( 21rT

rrr oooO = is the r-th sequence of observation with length T. The

new likelihood to be maximized P(O|λ) becomes:

∏=

=R

r

rOPOP1

)|()|( λλ

The soluttion to this problem is similar to the standard procedure; variables are

estimated for each training example on cumuled values to get parameters

averaging the optimal solution for each sequence. This procedure is then

repeated until and optimal value is found or we reached a maximum number of

iterations.

Formulas become:

( ) ( )( )∑ ∑

∑= =

== R

r

N

ir

T

R

rrr

ii

ii

1 1

1 11

ˆ

ˆˆ

α

βαπ

( ) ( ) ( )( ) ( )∑ ∑

∑ ∑=

−

=

=

−

=++

= R

r

T

trt

rt

R

r

T

t tr

tr

jijrt

ijii

jobaia

1

1

1

1

1

1 11

ˆˆ

ˆˆ

βα

βα

( )

( )∑ ∑

∑ ∑

= =

== =

= R

r

T

t t

kOt

R

r

T

t t

jj

jkb

1 1

1 1

)(γ

γ

Parameter initialization

Before we can start training, it is important to initialize in a proper way our guess

for the model so that we can approach a global optimum as much as possible.

Usually an uniform distribution for π and A is used, but if we apply the left-right

model, π has probability 1 for the first state and zero for the others.

For instance we could have:

140001

x⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=π

4410005.05.000

05.05.00005.05.0

x

A

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

Emission matrix require a good initial estimate to get a proper convergence. This

is usually obtained doing a uniform segmentation of states for each example, and

collectiong all the observations for a given state Sj from the training examples.

Then a clustering algorithm is applied to find initial estimates for each state.

The training, i.e., the sequence of observed data, could be too small since we

cannot collect enough data. In this case, we might have almost no occurences

for unlikely states or observations; this results in uncorrect model parameters

estimation.

If the number of symbol sequences is so small not to have occurences for some

observations, initial zero estimates of model parameters will not be changed and

will result in zero probability for new sequences. This is due to an improper

estimate of parameters due to the scarce date. To reduce this phaenomenon,

there are a few strategies [Rabiner 1989] [Lee 1988], such as the use of

codebook, reducing the number of symbols for each state or the whole number of

states. The easier method to face data shortage is to introduce a constant value

ε as a lower bound each model parameter at each iteration. This turns out into:

⎩⎨⎧

<≥

=ππ

π

επεεππ

πi

iii se

se

⎩⎨⎧

<≥

=aija

aijijij a

aaa

εεε

se se

⎩⎨⎧

<≥

=btib

btititi ob

obobob

εεε

)( se )( se )(

)(

In practice επ = εa = εb = 0,0001. Moreover, as with any probability distribution,

we need to satisfy the normalization property:

∑==

N

i i11π

∑ =≤≤=

N

j ij Nia1

1 , 1

∑ −

=≤≤≤≤=

1

01 1 , 1)(M

ot tj TtNjob

Since these probabilities are changed at any iteration this property might not hold

(expecially if we use a smoothing approach as previously described). Thi simply

requires to compute at each iteration:

NiN

k k

ii ≤≤=∑ =

1 , 1π

ππ

Njia

aa N

k ik

ijij ≤≤=∑ =

,1 , 1

10 e 1 , )(

)()( 1

0

−≤≤≤≤=∑ −

=

MoNjkb

obob tM

k tj

tjtj

t

Documents

Three fundamental questions with Hidden Markov Models · 2008. 4. 3. · Forward Procedure Consider the forward variable: αt (i) =P(o1,o2 ,..,ot ,qt =Si |λ) being t the time and