Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Three fundamental questions with Hidden Markov Models
Once defined the structural elements of HMM we may want to ask three different
kinds of questions:
11.. Given the model ),,( πλ BA= , what is the probability of
observing ),..,,( 21 ToooO = , that is )|( λOP ?
22.. Given the sequence of observations ),..,,( 21 ToooO = and the HMM defined
by ),,( πλ BA= , what is the most likely state sequence ),..,,( 21 Tqqqq = that
best explains observations?
33.. How can we learn A, B e π given a sequence of observations such that
)|( λOP is maximum?
Question 1: pattern recognition
Question 2: find the best hidden sequence of staes (not directly observable); it
does not exists a correct state sequence, however we can find the one that
maximize some error measure
Question 3: model training, given some example sequences we can set up a
hidden Markov model. This is the first step in many applications since often we
do not have the model parameters to answer the previous questions.
Solution to problem 1, probability estimation
This problem aswer the question about the probability of observing a sequence
of observation ),..,,( 21 ToooO = given a HMM λ, that is )|( λOP . Since we assume
the observation to be independent from each other (recall the conditional
independence property of HMM Bayesian Network), the probability of
),..,,( 21 ToooO = being generated from a sequence of states q can be computed
as:
∏=
=⋅⋅⋅=T
iiqiTqTqq obobobobBqOP
12211 )()(...)()(),|(
and the probability of the state sequence q is:
∏−
=+− ⋅=⋅⋅⋅⋅=
1
11,1132211 ...),|(
T
iiqiqqTqTqqqqq aaaaAqP πππ
The joint probability for O and q, that is the probability of O e q verified at the
same time, is the product of the two previous terms:
),|(),|()|,( πλ AqPBqOPqOP ⋅=
)(...)()( 12221111 TqTqTqTqqqqq obaobaob −⋅⋅⋅= π
( )∏=
− ⋅⋅=T
ttqtqtqtqq obaob
21111 )()(π
We aim at finding )|( λOP obtained summing out the joint probability over all
possible sequences of states q:
∑ ⋅=q
AqPBqOPOP ),|(),|()|( πλ
∑ −⋅⋅⋅=qTq
TqTqTqTqqqqq obaobaob..1
12221111 )(...)()(π
( )∑ ∏=
− ⋅⋅=qTq
T
ttqtqtqtqq obaob
..1 21111 )()(π
The interpretation of this is the following. At time t = 1 the process starts in state
q1 with probability πq1 and it generates observation o1 with probability bq1(o1).
Time goes on changing from t to t+1 and a transition from q1 to q2 happens with
probability aq1q2 generating symbol o2 with probability bq2(o2). This iterates up to
time T.
This computation is infeasible in practice due to the exponential complexity with
respect to sequence length T. Precisely it requires (2T-1)NT multiplications and
NT-1 sums. Also for small N and T, such as N=5 (states) and T=100
(observations), we are requested (2·100-1)5100≈1.6·1072 multiplications and 5100-
1≈8.0·1069 sums. To get some practical use out of HMM we need a more efficient
algorithm such as the forward-backwardalgorithm, that is linear in T.
Forward Procedure
Consider the forward variable:
)|,,..,,()( 21 λα ittt SqoooPi ==
being t the time and Si the state. αt(i) is the probability of observing the sequence
o1,o2,..,ot , given the current state Si at time t.
The values of αt+1(i) can be obtained summing the forward variable for all the N
states at time t, multipling it for the corresponding transition probabilities between
states aij and for the emission probabilities bj(ot+1). The algorithm is the following:
11.. Initialization
t = 1
Niobi ii ≤≤= 1 , )()( 11 πα
22.. Induction
Njaiobj N
i ijtjt ≤≤= ∑ =++ 1 , )()()(1 t11 αα
33.. Time update
t = t+1
If t<T GOTO 2, otherwise GOTO 4
44.. Final calculation
∑==
N
i T iOP1
)()|( αλ
The single step is summarized by the following graph (Figure 3.6):
Summary of one step for alfa computation
The following graph shows the exponential number of paths in the space of
observations.
Maximum number of transitions between states and selected paths
It is required to compute 2N summs between αt(j) for each αt+1(i), weighted by the
corresponding transition probabilities between state i and the probability of
observing ot+1.
Number of operations required for the computation of variable alfa
At each time we have N forward variables, having each the same complexity.
The induction is correct since it follow directly from:
ijttN
jtt ajqooPiqooP )|,..()|,..( 1111 λλ === ∑ =+
When using this algorithm, wee are required N(N+1)(T-1)+N multiplications and
N(N-1)(T-1) sums. For N = 5 state and T = 100 observations it requires
5(5+1)(100-1)+5=2975 multiplications and 5(5-1)(100-1)=1980 sums, obtaining a
significative gain with respect to the direct approach that requires 1072
multiplications and 1069 sums.
Backward Procedure
Previous recursion can be applied also backward defining the backward variable:
),|,..,,()( 21 λβ itTttt SqoooPi == ++
Sating the probability of observing the sequence from t+1 to the end, given the
state i at time t and the HMM λ. Note that e the forward vairable definition
regards a joint probability while the backward variable is a conditional probability
and ita can be computed similarly.
Maximum number of trasitions between states and selected paths
The backward algorithm is obtaind by the followinf steps:
11.. Initialization
t = T-1
NiiT ≤≤= 1 , 1)(β
22.. Induction
∑ = ++ ≤≤=N
j tjijtt Niobaii1 11 1 , )()()( ββ
33.. Time update
t = t-1
If t>0 GOTO 2, otherwise end
Note thate initialization sets 1)( =iTβ for all i. Each step of the algorithm can be
summarized as follows:
Summary of one step in beta variable computation
The whole computation of forward and backward variables for N state has
complexity )()()( 2NONONO =⋅ .
Forward and backward computation has complexity proportional to the squared number of states
Since (i))( tt i βα represents the probability of observing the sequence O and
being in state i at time t given the HMM λ, we can write the sequence in terms of
forward and backward variables, using the following:
∑ ===
N
i t tOiqPOP1
allfor )|,()|( λλ
∑ ==
N
i tt ti1
allfor (i))( βα
∑==
N
i T i1)(α
Scaling of Forward and Backward variables
As usually happens in statitisc, we should deal with numerical issues. αt(i) and
βt(i) computation multiplies probabilities, having values lower than 1, thus they
tend to zero exponentially with the number of symbols in the sequnces. Using
common floating point for t high enough, i.e., t>100, we can easily underflow.
The basic procedure, multiplies αt(i), at any t and in all states i-1≤i≤N, for a
coefficient ct that depends only on time and not on the state; this factor will be
used also to scale βt(i) in the following.
Let be:
11.. ( )itα̂ the scaled variable during each iteration
22.. ( )itα̂̂ the local version before scalingè
33.. ct the scaling coefficient
The algorithm becomes the following
11.. Initialization
t = 2
Niobi ii ≤≤= 1 , )()( 11 πα
( ) Niii ≤≤= 1 , )(ˆ̂11 αα
( )∑ =
= N
ii
c1 1
11α
( ) ( )ici 111ˆ αα =
22.. Induction
( ) ( ) ( )∑ = − ≤≤=N
j jittit Niajobi1 1 1 , ˆˆ̂ αα
( )∑ =
= N
i tt
ic
1ˆ̂
1α
( ) ( ) Niici ttt ≤≤= 1 , ˆ̂ˆ αα
33.. Time update
1+= tt
If t≤T GOTO 2, otherwise GOTO 4
44.. Final calculation
∑=−=
T
t tcOP1log)|(log λ
Step 2 can be rewritten as:
( ) ( )ici ttt αα ˆ̂ˆ =
( ) ( )( ) ( ) ( )( )∑∑ ∑ = −
= = −
⋅=N
j jittiN
k
N
j jkttk
ajobajob 1 1
1 1 1
ˆˆ
1 αα
( ) ( )( )( ) ( )( ) Ni
ajob
ajobN
k
N
j jkttk
N
j jitti≤≤=
∑ ∑∑
= = −
= −1 ,
ˆ
ˆ
1 1 1
1 1
α
α
Using induction the scale variable can be obtained by the unscaled one as:
( ) ( ) Njjcj t
t
t ≤≤⎟⎟⎠
⎞⎜⎜⎝
⎛= −
−
=− ∏ 1 , ˆ 1
1
11 αα
ττ
The iduction step can be computed as before, but with one time unit shift
∑ = − ≤≤=N
j ijttit Niajobi1 1 1 , )()()( αα
We can now write:
( )( ) ( )( )( ) ( )( )∑ ∑∑
= = −
= −= N
k
N
j jkttk
N
j jittit
ajob
ajobi
1 1 1
1 1
ˆ
ˆˆ
α
αα
( ) ( )
( ) ( )∑ ∑ ∏
∑ ∏
= = −
−
=
= −
−
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛
=N
k
N
j jkt
t
tk
N
j jit
t
ti
ajcob
ajcob
1 1 1
1
1
1 1
1
1
α
α
ττ
ττ
( ) ( )( )( ) ( )( )∑ ∑∏
∑∏
= = −
−
=
= −
−
=
⎟⎟⎠
⎞⎜⎜⎝
⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛
=N
k
N
j jkttk
t
N
j jitti
t
ajobc
ajobc
1 1 1
1
1
1 1
1
1
α
α
ττ
ττ
( )( )
Ni1 ,
1
≤≤=
∑=
N
kt
t
k
i
α
α
It can be proven that any term αt(i) is scaled by the sum for all the states of αt(i).
At the end the algorithm we cannot use ( )itα̂ in )|( λOP computation because
the variable is already scaled. However we can apply the following property:
( )∏ ∑= =
=T N
iTt ic
1 11
τ
α
( )∏=
=⋅T
t OPc1
1|τ
λ
( )∏=
= T
cOP
1
1|
ττ
λ
Using this new estimate )|( λOP , however, the result could be il so small to
underflow computer precision; for this reason we suggest to use the following:
( ) ∑∏
=
=
−==T
t tT cc
OP1
1
log1|log
ττ
λ
The scaled backward algorithm can be defined more easily since uses the same
factors of the forward one. Also the notation is similar:
11.. ( )itβ̂ the scaled variable during iteration
22.. ( )itβ̂̂ the local version before scaling
33.. ct the scaling coefficient
The algorithm works as follows:
11.. Initialization
1−= Tt
( ) Ni1 , 1 ≤≤=iTβ
( ) ( ) Ni1 , iˆT ≤≤= ββ TT ci
22.. Induction
( ) ( ) ( ) Ni1 , ˆˆ̂1
11t ≤≤= ∑=
++
N
jtjijt obaji ββ
( ) ( ) Ni1 , iˆ̂ˆt ≤≤= ββ tt ci
33.. Time update
be 1−= tt
If t>0 GOTO 2, otherwise end
Solution to problem 2, optimal state sequence
This time the problem is finding the most likely state sequence that generated a
given sequence of observations. If with problem 1 we have an exact solution, in
this case we can applya an approximate algorithm. This is due to the difficulty in
the definition of an optimality criterion.
One common approach is to define the most likely state qt at each time step t,
introducing the variable:
( ) ( )λγ ,|OSqPi itt ==
That is the probability of being in state Si at time t given the sequence of
observations O and the HMM λ. An alternative formulation is:
( ) ( )λγ ,|OSqPi itt ==
( )( )λ
λ|
|,OPSqOP it ==
( )( )∑
=
=
== N
iit
it
SqOP
SqOP
1|,
|,
λ
λ
Since )|,,..,,()( 21 λα iqoooPi ttt == and ),|,..,,()( 21 λβ iqoooPi tTttt == ++ ,
)|,( 1 λSqOP t = can be written as the joint probability:
),|,..,,()|,,..,,()|,( 21211 λλλ itTttittt SqoooPSqoooPSqOP =⋅=== ++
Now we can rewrite:
( ) ( ) ( )( ) ( )∑ =
= N
i tt
ttt
iiiii
1βα
βαγ
The most likely state at time t is:
( )[ ] Ttiq tNit ≤≤= ≤≤ 1 , maxarg 1* γ
Even if q* maximizes the expected number of correct states, there could be some
issues with the whole sequence since we aer not taking into account transition
probabilities. For instance, is some transition has null probability aij=0, the path
between the optimal states cannot be valid. To generate a valid path we can
apply the Viterbi algorithm, based on dynamic programming. γt(i) is not used in
the Viterbi algorithm, but it will be used in the solution of problem 3.
Viterbi Algorithm
To find the single best sequence of states, ),..,,( 21 Tqqqq = , that is optimal with
respect to the observation sequence ),..,,( 21 ToooO = and the HMM λ, in order to
maximize ),|( λOqP , we can apply a dynamic programming approach named
Viterbi algorithm [Forney 1973] [Viterbi 1967].
Since:
)|()|,(),|(
λλλ
OPOqPOqP =
),|( λOqP maximization is equivalent to )|,( λOqP maximization in Viterbi.
The basic idea is similar to the Forward procedure, but in this case computation
is considered only for two cosecutive time steps: t and t+1, starting with t = 1 and
finishing for t = T. The main difference is in the computation between the two time
steps; Viterbi looks for and memorizes the highest value corresponding to the
best so far sequence of states, while Forward procedure comuptes the
summation of all results. At the end of calculation, the best values, for which the
algorithm keeps track, can be used to recover the state sequence using a
backtracking process.
Let us consider the following:
( )λδ |..,..max)( 111..2,1 titqtqqt ooSqqPi == −
That is the probability of observing o1..ot on the best path ending on state i at
time t, given the HMM λ.
Using induction we can compute the probabibility at the next step:
])([max)()( 111 ijtNitjt aiobi δδ ≤≤++ =
To find the sequence of states we need to keep track of this function argmax for
all t and j. This can be easily done using an array, ψt(j).
Here it is the complete algorithm:
11.. Initialization
t = 2;
Niobi ii ≤≤= 1 , )()( 11 πδ
Nii ≤≤= 1 , 0)(1ψ
22.. Induction
Nj(i)aδ)(obj ijt-Nitjt ≤≤= ≤≤ 1 , max)( 11δ
Nj(i)aδj ijt-Nit ≤≤= ≤≤ 1 , ][maxarg)( 11ψ
33.. Update
t = t+1;
if t ≤ T GOTO 2, otherwise GOTO 4.
44.. Final calculation
[ ])(max* 1 iP TNi δ≤≤=
[ ])(maxarg* 1 iq TNiT δ≤≤=
55.. State sequence reconstruction
a. Initialization
t = T-1
b. Backtracking
( )11 ** ++= ttt qq ψ
c. Update
t = t-1;
If t ≥ 1 GOTO b, otherwise end.
As previously stated we might encouter problem with underflow so we need a
slightly different formulation of this algorithm.
An alternative Viterbi algorithm
To overcome the underflow issues due to probability multiplication we can use
the logarithm of model parameters to transform all these multiplications into
sums. We might encouter other issues when some parameters are equal to zero,
sa usually happens with matrixes A e π, but we can solve them by adding a small
number to them.
This new Viterbi Algorithm is:
11.. Preprocessing
Niii ≤≤= 1 ),log(~ ππ
Njiaa ijij ≤≤= ,1 ),log(~
22.. Initialization
t = 2;
Niobob ii ≤≤= 1 )),(log()(~11
))(log()(~11 ii δδ =
))(log( 1obiiπ=
Niobii ≤≤+= 1 , )(~~1π
Nii ≤≤= 1 , 0)(1ψ
33.. Induction
Njobob tjtj ≤≤= 1 , ))(log()(~
))(log()(~ jj tt δδ =
( )ijt-Nitj (i)aδ)(ob 11maxlog ≤≤=
Nja(i)δ)(ob ijt-Nitj ≤≤++= ≤≤ 1 , ]~~[max~11
( ) Nja(i)δj ijt-Nit ≤≤+= ≤≤ 1 , ]~~[maxarg 11ψ
44.. Update
t = t+1;
If t ≤ T GOTO 3, otherwise GOTO 5.
55.. Final computation
)](~[max*~1 iP TNi δ≤≤=
)](~[maxarg* 1 iq TNiT δ≤≤=
66.. State sequence reconstruction
a. Initialization
t = T-1
b. Backtracking
( )11 ** ++= ttt qq ψ
c. Update
t = t-1;
if t ≥ 1 GOTO b, otherwise end.
For instance, let us consider a model with N = 3 states and observations T=8.
During initialization (t = 1) we find the values for δ1(1), δ1(2) and δ1(3); suppose
δ1(2) is the maximum. In the next iteration, for t = 2, we find δ2(1), δ2(2) and δ2(3),
thi time δ2(1) is the higher. So forth obtaining δ3(3), δ4(2), δ5(2), δ6(1), δ7(3) and
δ8(3). The obtimal sequence is summarized in the following graph (Figure 3.12):
Figure Error! No text of specified style in document.-1 Sequence of states explored by the Viterbi
algorithm
Solution to problem 3, parameter estimation
The learn training when referring to HMM is referred to ),,( πλ BA= model
parameters estimation. Given and observation O we look for the model λ* that
maximizes )|( λOP (maximum likelihood principle); this can be written as:
[ ])|(maxarg* λλ λ OP=
This is the most challenging problem since there is no analytical solution for the
paramiters that maximize the probability of observing a given sequence; for this
reason a local maximum of likelihood is chosen. The most used trategies to
obtain such solution are the Baum-Welch algorithm [Baum 1972] [Baum et al.
1970] [Dempster et al. 1977], Expectation-Maximization o EM, or gradient
descend. All af these use an iterative approach to get a better likelihood for
)|( λOP ; among the others the Baum-Welch algorithm ha some advantages:
Numerically stable
It converges to a local optimum
Has linear convergence
For these reasons Baum-Welch is the most used. This method belongs to the
class of algorithms used in statistics to deal with missing values; the EM
algorithm is the general form [Bilmes 1998] [McLachlan e Krishnan 1997]
[Redner e Walzer 1984] and gives a procedure for local optimization.
In general with maximum likelihood estimation we look for the parameter θ that
describes a distribution supposed to generate the data, X, so that likelihood
)|( Xθξ is maximum. When some data are missing, θ defines the distribution
both for X, and missing data Y. The idea behind the EM algorithm is to compute
the complete-data likelihood ),|( YXθξ and find a miximum for it with the
following iterative algorithm:
11.. EXPECTATION (E-step): use the parameter estimated at last iteration, θi,
and the costant X. Compute the expected value of complete-data log-
likelihood and the log-likelihood as a function of θ with respect to the
random variableY, defined by θi and X.
22.. MAXIMIZATION (M-step): select the valuo of θ that maximizes
expectation, and then set θθ =+1i .
The EM algorthm does not decrease likelihood and converges to a local
maximum, as with the Baum-Welch algorithm. To describe the estimation
procedure we need to introduce the following variable:
( ) ( ) ( )( )λ
λλξ
||,,
,|,, 11 OP
OSqSqPOSqSqPji jtit
jtitt
====== +
+
It represents the probability of being in state Si at time t and in state Sj at time
t+1, given the HMM λ and the sequence of observations O, as summarized in the
following graph (Figure 3.13).
Figure Error! No text of specified style in document.-2 Computation procedure for the probability of
moving from one state the following one
The following schema reports the posterior probability of γt(i) (Figure 3.14)
Figure Error! No text of specified style in document.-3 Derivation of posterior probability gven a state i
For these reasons it holds that:
( ) ( ) ( ) ( )( )λ
βαξ
|, 11
OPjobai
ji ttjijtt
++=
( ) ( ) ( )( ) ( ) ( )∑ ∑= = ++
++= N
i
N
j ttjijt
ttjijt
jobai
jobai
1 1 11
11
βα
βα
As previously mentioned in problem 2, γt(i) is the probability of being in state Si at
time t, given the whole sequence of observations O end the HMM λ. At this point,
we can write:
( ) ( ) ( ) ( )∑∑ == + ======N
j tN
j jtititt jiOSqSqPOSqPi11 1 ,,|,,| ξλλγ
If we apply the summation over t to γt(i) we obtain a number that can be
considered as the number of visits to state Si, or equivalently the expected value
of transitions from state Si, including t = T. If we take the same summation also
for ξt(i,j) we obtain the number of trasitions from Si to Sj.
To summarize:
( ) iSi statein starting ofy probabilit 1 =γ
( ) O sequence in the state from ns transitioofnumber expected 1
1 iT
t t Si =∑ −
=γ
( ) Oin state to state from ns transitioofnumber expected,1
1 jiT
t t SSji =∑ −
=ξ
From these definitions we can derive the estimation formulas:
1 tat time state offrequency expected == ii Sπ
( )i1γ=
( ) ( )( ) ( )∑ =
= N
iii
ii
1 11
11
βαβα
( ) ( )( )∑ =
= N
i T iii
1
11
αβα
i
jiij S
SSa
state from ns transitioofnumber expected state to state from ns transitioofnumber expected
=
( )( )∑
∑−
=
−
== 1
1
1
1,
T
t t
T
t t
i
ji
γ
ξ
∑
∑−
=
−
= +
=
=== 1
1
1
1 1
)|,(
)|,,(T
t t
T
t tt
OiqP
OjqiqP
λ
λ
( ) ( ) ( )
( ) ( )∑∑
−
=
−
= ++= 1
1
1
1 11T
t tt
T
t ttjijt
ii
jobai
βα
βα
( )j
jj S
Skb
statefor occurences ofnumber expected statein k symbolfor nsobservatio ofnumber expected
=
( )
( )∑
∑
=
==
= T
t t
kOt
T
t t
j
j
1
1
γ
γ
∑
∑=
=
=
⋅== T
t t
T
t tt
OjqP
qOOjqP
1
1
)|,(
),()|,(
λ
δλ
( ) ( ) ( )( ) ( )∑
∑=
== T
t tt
T
t tttt
ii
oOji
1
1,
βα
δβα
where ⎩⎨⎧ =
=otherwise 0
if 1),( tttt
oOoOδ
It is useful to remind that:
∑ ∑= ===
N
i
N
i Ttt iiiOP1 1
)()()()|( αβαλ
)()()|,( iiSqOP ttit βαλ ==
( ) ( ) ( ) ( )jobaiSqSqOP ttjijtjtit 111 |,, +++ === βαλ
These parameter needs to satisfy the stochastic constraint of normalization:
∑==
N
i i11π
∑ =≤≤=
N
j ij Nia1
1 , 1
∑ −
=≤≤≤≤=
1
01 1 , 1)(M
ot tj TtNjob
Using the scaling approach we have:
( ) ( ) ( ) ( )( ) ( ) ( )∑ ∑= = ++
++= N
i
N
j ttjijt
ttjijtt
jobai
jobaiji
1 1 11
11
ˆˆ
ˆˆ,
βα
βαξ
( ) ( ) ( )
( ) ( ) ( )∑ ∑ ∏∏
∏∏
= = ++=
+=
++=
+=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛
=N
i
N
j t
T
ttjijt
t
t
T
ttjijt
t
jcobaic
jcobaic
1 1 11
11
11
11
βα
βα
ττ
ττ
ττ
ττ
( ) ( ) ( )
( ) ( ) ( )∑ ∑∏∏
∏∏
= = +++==
+++==
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛
=N
i
N
j ttjijt
T
t
t
ttjijt
T
t
t
jobaicc
jobaicc
1 1 1111
1111
βα
βα
ττ
ττ
ττ
ττ
( ) ( ) ( )( ) ( ) ( )∑ ∑= = ++
++= N
i
N
j ttjijt
ttjijt
jobai
jobai
1 1 11
11
βα
βα
And also that:
( ) ( ) ( )( ) ( )∑ =
= N
i tt
ttt
iiiii
1ˆˆ
ˆˆ
βαβαγ
( ) ( )
( ) ( )∑ ∏∏
∏∏
===
==
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛
=N
i t
T
tt
t
t
T
tt
t
icic
icic
11
1
βα
βα
ττ
ττ
ττ
ττ
( ) ( )
( ) ( )∑∏∏
∏∏
=+==
+==
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛
=N
i tt
T
t
t
tt
T
t
t
iicc
iicc
111
11
βα
βα
ττ
ττ
ττ
ττ
( ) ( )( ) ( )∑ =
= N
i tt
tt
iiii
1βα
βα
It can be proven that variables do not change in both formulations. Using scaled
forward and backward variables we get:
( ) ( ) ( ) ( )( ) ( ) ( )∑ ∑= = ++
++= N
i
N
j ttjijt
ttjijtt
jobai
jobaiji
1 1 11
11
ˆˆ
ˆˆ,
βα
βαξ
( ) ( ) ( )( ) ( ) ( )( )∑ ∑= = ++
++= N
i
N
j ttjijt
ttjijt
jobai
jobai
1 1 11
11
ˆˆ
ˆˆ
βα
βα
( ) ( ) ( )
( ) ( )∑ =
++= N
i tt
ttjijt
ji
jobai
1
11
ˆˆ
ˆˆ
βα
βα
( ) ( ) ( )
( ) ( )( ) ( )∑
∑ =
=
++=N
i ttN
i tt
ttjijt
jiji
jobai
11
11
1
ˆˆ
βαβα
βα
( ) ( ) ( )jobai ttjijt 11ˆˆ ++= βα
Since we use ξt(i,j) and γt(i) to compute A e π the probabilities are equivalent also
after scaling so, avoiding nunmerical issues, we can write:
( ) ( )( )∑ =
= N
i T
ii
ii
1
11
ˆ
ˆˆ
α
βαπ
but also:
( ) ( ) ( )( ) ( )∑
∑−
=
−
= ++= 1
1
1
1 11
ˆˆ
ˆˆT
t tt
T
t ttjijtij
ii
jobaia
βα
βα
At the same time we can use these variables to estimate the coefficients of the
emission matrix B.
This process iterates for a given number of iterations or up to a threshold in the
difference between models measured as the difference between α before and
after iteration. The following schema (Figure 3.15) summarrize the steps to
compute these parameters starting from an initial guess.
Figure Error! No text of specified style in document.-4 Schema of the iterative algorithm for HMM
parameters estimation
α and β estimates
γ and ξ estimates
Model θ
New modell θ*
Threshold or max iterations?
New model θ*
No Yes
Sequences of observations
Multiple sequences of observations
In the Baum-Welch algorithm we need to compute the values for many
parameters; to obtain robust estimates we need a sufficint amount of training
data. We cannot use a single sequence otherwise the model would be able to
recognize just that one, but it would not perform properly on new ones. For this
reason it needs to be changed few times (at the risk of numerical instabilities).
We need to adapt the algorithm to multiple sequences; let be R a set of
seuences of observations:
],..,,[ 21 ROOOO =
Where ),..,,( 21rT
rrr oooO = is the r-th sequence of observation with length T. The
new likelihood to be maximized P(O|λ) becomes:
∏=
=R
r
rOPOP1
)|()|( λλ
The soluttion to this problem is similar to the standard procedure; variables are
estimated for each training example on cumuled values to get parameters
averaging the optimal solution for each sequence. This procedure is then
repeated until and optimal value is found or we reached a maximum number of
iterations.
Formulas become:
( ) ( )( )∑ ∑
∑= =
== R
r
N
ir
T
R
rrr
ii
ii
1 1
1 11
ˆ
ˆˆ
α
βαπ
( ) ( ) ( )( ) ( )∑ ∑
∑ ∑=
−
=
=
−
=++
= R
r
T
trt
rt
R
r
T
t tr
tr
jijrt
ijii
jobaia
1
1
1
1
1
1 11
ˆˆ
ˆˆ
βα
βα
( )
( )∑ ∑
∑ ∑
= =
== =
= R
r
T
t t
kOt
R
r
T
t t
jj
jkb
1 1
1 1
)(γ
γ
Parameter initialization
Before we can start training, it is important to initialize in a proper way our guess
for the model so that we can approach a global optimum as much as possible.
Usually an uniform distribution for π and A is used, but if we apply the left-right
model, π has probability 1 for the first state and zero for the others.
For instance we could have:
140001
x⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=π
4410005.05.000
05.05.00005.05.0
x
A
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
Emission matrix require a good initial estimate to get a proper convergence. This
is usually obtained doing a uniform segmentation of states for each example, and
collectiong all the observations for a given state Sj from the training examples.
Then a clustering algorithm is applied to find initial estimates for each state.
The training, i.e., the sequence of observed data, could be too small since we
cannot collect enough data. In this case, we might have almost no occurences
for unlikely states or observations; this results in uncorrect model parameters
estimation.
If the number of symbol sequences is so small not to have occurences for some
observations, initial zero estimates of model parameters will not be changed and
will result in zero probability for new sequences. This is due to an improper
estimate of parameters due to the scarce date. To reduce this phaenomenon,
there are a few strategies [Rabiner 1989] [Lee 1988], such as the use of
codebook, reducing the number of symbols for each state or the whole number of
states. The easier method to face data shortage is to introduce a constant value
ε as a lower bound each model parameter at each iteration. This turns out into:
⎩⎨⎧
<≥
=ππ
π
επεεππ
πi
iii se
se
⎩⎨⎧
<≥
=aija
aijijij a
aaa
εεε
se se
⎩⎨⎧
<≥
=btib
btititi ob
obobob
εεε
)( se )( se )(
)(
In practice επ = εa = εb = 0,0001. Moreover, as with any probability distribution,
we need to satisfy the normalization property:
∑==
N
i i11π
∑ =≤≤=
N
j ij Nia1
1 , 1
∑ −
=≤≤≤≤=
1
01 1 , 1)(M
ot tj TtNjob
Since these probabilities are changed at any iteration this property might not hold
(expecially if we use a smoothing approach as previously described). Thi simply
requires to compute at each iteration:
NiN
k k
ii ≤≤=∑ =
1 , 1π
ππ
Njia
aa N
k ik
ijij ≤≤=∑ =
,1 , 1
10 e 1 , )(
)()( 1
0
−≤≤≤≤=∑ −
=
MoNjkb
obob tM
k tj
tjtj
t