Upload
brooks
View
23
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Hidden Markov Models. A Hidden Markov Model consists of. A sequence of states { X t |t T } = { X 1 , X 2 , ... , X T } , and A sequence of observations { Y t |t T } = { Y 1 , Y 2 , ... , Y T }. - PowerPoint PPT Presentation
Citation preview
A Hidden Markov Model consists of
1. A sequence of states {Xt|t T} = {X1, X2, ... , XT} , and
2. A sequence of observations {Yt |t T} = {Y1, Y2, ... , YT}
• The sequence of states {X1, X2, ... , XT} form a Markov chain moving amongst the M states {1, 2, …, M}.
• The observation Yt comes from a distribution that is determined by the current state of the process Xt. (or possibly past observations and past states).
• The states, {X1, X2, ... , XT}, are unobserved (hence hidden).
Some basic problems: from the observations {Y1, Y2, ... , YT}
1. Determine the sequence of states {X1, X2, ... , XT}.
2. Determine (or estimate) the parameters of the stochastic process that is generating the states and the observations.;
Example 1
• A person is rolling two sets of dice (one is balanced, the other is unbalanced). He switches between the two sets of dice using a Markov transition matrix.
• The states are the dice.• The observations are the numbers rolled
each time.
Example 2
• The Markov chain is two state. • The observations (given the states) are
independent Normal. • Both mean and variance dependent on state.
HMM AR.xls
Speech Recognition
• When a word is spoken the vocalization process goes through a sequence of states.
• The sound produced is relatively constant when the process remains in the same state.
• Recognizing the sequence of states and the duration of each state allows one to recognize the word being spoken.
• The interval of time when the word is spoken is broken into small (possibly overlapping) subintervals.
• In each subinterval one measures the amplitudes of various frequencies in the sound. (Using Fourier analysis). The vector of amplitudes Yt is assumed to have a multivariate normal distribution in each state with the mean vector and covariance matrix being state dependent.
Hidden Markov Models for Biological Sequence
Consider the Motif:[AT][CG][AC][ACGT]*A[TG][GC] Some realizations:
A C A - - - A T GT C A A C T A T CA C A C - - A G CA G A - - - A T CA C C G - - A T C
A.8CGT.2
AC.8G.2T
A.8C.2GT
AC1.0GT
ACG.2T.8
AC.8G.2T
A.2C.4G.2T.2
.4
1.0 1.0 1.0 1.0
.6.6
.4
Hidden Markov model of the same motif :
[AT][CG][AC][ACGT]*A[TG][GC]
Computing LikelihoodLet ij = P[Xt+1 = j|Xt = i] and = (ij) = the MM transition matrix. Let = P[X1 = i] and
= the initial distribution over the states.
]P[ 2211 TT i , ... , X i , X i X
0i
002
01
0 ,,, M
TT iiiiiii 132211
0
Now assume that P[Yt = yt |X1 = i1, X2 = i2, ... , Xt = it]
= P[Yt = yt | Xt = it] = p(yt| ) =
Then P[X1 = i1,X2 = i2..,XT = iT, Y1 = y1, Y2 = y2, ... , YT = yT]
= P[X = i, Y = y]
=
ti tt yi
TTTT yiiiyiiiyiiiyii 133322221111
0
Therefore P[Y1 = y1, Y2 = y2, ... , YT = yT]
= P[Y = y]
T
TTTTiii
yiiiyiiiyiiiyii,,,
0
21
133322221111
,,, where,, 210 ML
In the case when Y1, Y2, ... , YT are continuous random variables or continuous random vectors, Let f(y| ) denote the conditional distribution of Yt given Xt = i. Then the joint density of Y1, Y2, ... , YT is given by
= f(y1, y2, ... , yT) = f(y)
where = f(yt| )
i
,,0 L
T
TTTTiii
yiiiyiiiyiiiyii,,,
0
21
133322221111
tt yiti
Efficient Methods for computing Likelihood
The Forward Method
Consider
tt
tt yyyYYY ,,, and ,,,Let 21
)(21
)( yY
tttttttt iXyYyYyYPiXP ,,,,, 2211)()( yY
tt i
111
01111 , yiiiXyYP
111)()( ,, tttt
tt iXyYP yY
ti
tttttttt iXiXyYP 1111)()( ,,, yY
111
, )()(
tttt
t
yiiii
tttt iXP yY
tttt
tttt iXiXyYP , |, )()(1111 yY
ti
tttt iXP , )()( yY
11)1()1(
11 , Note iXPi yY
11)1()1(
11 , and
tttt
tt iXPi yY
111 tttt
t
yiiii
tt i
The Backward Procedure
Ttt
tTtt
t yyyYYY ,,, and ,,,Let 21)(*
21)(* yY
ttTTtttt iXyYyYyYP |,,, 2211
tttt
tt iXPi | Consider )(*)(** yY
11)1(*)1(*
1*
1 | Note
TTTT
TT iXPi yY
T
TTTTi
yiiiTTTT iXyYP 111|
11)1()1(*
1*
1 | Now
tttt
tt iXPi yY
ti
tttttt
tt iXiXyYP 11)(*)(* |, , yY
tttti
tttt iXyYPiXP
t
|| )(*)(* yY
111
* tttt
t
yiiii
tt i
11)(*)(* | , tt
tttt iXyYP yY
11| tttt iXiXP
)0()0(* Then yYyY PP
1
11)1(*)1(*
11 , , i
iXyYP yY
111111)1(*)1(* , |
1
yYiXPiXPi
yY
,| 111
1
011
)1(*)1(*iyi
iiXP yY
,111
1
01
*1 iyi
ii
TT yYyYyYP ,,, 2211
Prediction of states from the observations and the model:
TTTT iXPi , Consider yY
TiTT
TTTTTT i
iP
iXPiXP
yYyYyY
, Thus
yYyYyY
, Also
PiXPiXP tt
tt
t
iTT
tttt ii
ii
T
*
TTTT iXyYyYyYP ,,,, 2211
yY
yYyY
, )(*)(*)()(
P
iXPiXP tttt
tttt
The Viterbi Algorithm (Viterbi Paths)
Suppose that we know the parameters of the Hidden Markov Model.Suppose in addition suppose that we have observed the sequence of observations Y1, Y2, ... , YT.
Now consider determining the sequence of States X1, X2, ... , XT.
Recall that P[X1 = i1,... , XT = iT, Y1 = y1,... , YT = yT]
= P[X = i, Y = y]
=Consider the problem of determining the sequence of states, i1, i2, ... , iT , that maximizes the above probability. This is equivalent to maximizingP[X = i|Y = y] = P[X = i,Y = y] / P[Y = y]
TTTT yiiiyiiiyiiiyii 133322221111
0
The Viterbi AlgorithmWe want to maximizeP[X = i, Y = y] =
Equivalently we want to minimize U(i1, i2, ... , iT)
Where U(i1, i2, ... , iT)
= -ln (P[X = i, Y = y])
=
TTTT yiiiyiiiyiiiyii 133322221111
0
TTTT yiiiyiiiyii
12221111lnlnln 0
• Minimization of U(i1, i2, ... , iT) can be achieved by Dynamic Programming.
• This can be thought of as finding the shortest distance through the following grid of points.
• By starting at the unique point in stage 0 and moving from a point in stage t to a point in stage t+1 in an optimal way. The distances between points in stage t and points in stage t+1 are equal to:
• By starting at the unique point in stage 0 and moving from a point in stage t to a point in stage t+1 in an optimal way.
• The distances between points in stage t and points in stage t+1 are equal to:
and 0 ifln,0111
011 tid yii
1 ifln,11111 tiid
tttt yiiittt
Let ), ... , i, i(iU tt
21
tttt yiiiyiiiyii
12221111lnlnln 0
), ... , i, i(iU)(iV tt
, ... , i, iit
t
t21
121
min
111
01
1 ln yii)(iV
111
ln)min11
tttt
tyiiit
t
itt (iV)(iV
ttt iidiidid ,,,0 121211
Then
and
i1 = 1, 2, …, M
it+1 = 1, 2, …, M; t = 1,…, T-2
11 ,)min ttttt
iiid(iV
t
Summary of calculations of Viterbi Path
1. i1 = 1, 2, …, M
2.it+1 = 1, 2, …, M; t = 1,…, T-2
3.
111
01
1 ln yii)(iV
111
ln)min11
tttt
tyiiit
t
itt (iV)(iV
TTTT
TyiiiT
T
i
T (iVV 1
1
ln)min 11
), ... , i, iU(i T, ... , i, ii T
2121
min
An alternative approach to prediction of states from the observations and the model:
It can be shown that:
yY tt iXP
TiTT
ttttt i
iii
*
titttt
tttt
iiii*
*
Backward Probabilities
1.
2.
tttt
tt iXPi | )*()*(* yY
1*
1 TT i
T
TTTTi
yiii 1
1*
1 tt i 111
* tttt
t
yiiii
tt i
HMM generator (normal).xls
Estimation of Parameters of a Hidden Markov Model
If both the sequence of observations Y1, Y2, ... , YT and the sequence of States X1, X2, ... , XT is observed Y1 = y1, Y2 = y2, ... , YT = yT, X1 = i1, X2 = i2, ... , XT = iT, then the Likelihood is given by:
TTTT yiiiyiiiyiiiyiiL 133322221111
00 ,,
the log-Likelihood is given by:
21111
lnlnln,,ln,, 000iiyiiLl
TTTT yiiiyiii lnlnlnln
13332
M
i iyiy
M
i
M
jijij
M
iii ff
11 11
00 lnlnln
statefirst thein occurs i state timesofnumber the where 0 if
j. state tochanges i state timesofnumber theijf
case) discrete thein (or iiiy ypyf
iXy tt
iy where nsobservatio all of sum the
In this case the Maximum Likelihood estimates are:
= the MLE of i computed from the observations yt where Xt = i.
1ˆ
00 ii
f
and ,ˆ
1
M
jij
ijij
f
f
i
MLE (states unknown)
If only the sequence of observations Y1 = y1, Y2
= y2, ... , YT = yT are observed then the Likelihood is given by:
T
TTTTiii
yiiiyiiiyiiiyiiL,,,
00
21
133322221111,,
• It is difficult to find the Maximum Likelihood Estimates directly from the Likelihood function.
• The Techniques that are used are1. The Segmental K-means
Algorithm2. The Baum-Welch (E-M)
Algorithm
The Segmental K-means Algorithm
In this method the parameters are adjusted to maximize
where is the Viterbi path
θΠπλ ,,0
iyλiyθΠπ ,,,,0 LL
TTTT yiiiyiiiyiiiyii 133322221111
0
Tiii ,, 21i
Consider this with the special caseCase: The observations {Y1, Y 2, ... , YT} are continuous Multivariate Normal with mean vector and covariance matrix when , i.e.
iμ iΣiX t
itiit
ipti μyΣμy
Σy 121
2/ exp2
1
1. Pick arbitrarily M centroids a1, a2, … aM. Assign each of the T observations yt (kT if multiple realizations are observed) to a state it by determining :
2. Then
iti
ay min
kii
i
10 timesofNumber
from ns transitioofNumber tofrom ns transitioofNumber ˆ
ij i
ij
3. And
4. Calculate the Viterbi path (i1, i2, …, iT) based on the parameters of step 2 and 3.
5. If there is a change in the sequence (i1, i2, …, iT) repeat steps 2 to 4.
i
iiitit
ii
iit
i NNtt
μyμyΣ
yμ
ˆˆˆ,ˆ
The Baum-Welch (E-M) Algorithm
• The E-M algorithm was designed originally to handle “Missing observations”.
• In this case the missing observations are the states {X1, X2, ... , XT}.
• Assuming a model, the states are estimated by finding their expected values under this model. (The E part of the E-M algorithm).
• With these values the model is estimated by Maximum Likelihood Estimation (The M part of the E-M algorithm).
• The process is repeated until the estimated model converges.
The E-M AlgorithmLet denote the joint distribution of Y,X. Consider the function:
Starting with an initial estimate of . A sequence of estimates are formed by finding to maximize with respect to .
θXYθXY ,,, Lf
θYθXYθθ X ,,,ln, LEQ )1( θθ
)(mθ)1( mθθ )(, mQ θθθ
Example: Sampling from Mixtures Let y1, y2, …, yn denote a sample from the density:
mmyf θθθ ,,,,,,, 2121
mm ygygyg θθθ 2211
where121 m
and iiyg θθ for except known is
Suppose that m = 2 and let x1, x2, …, x1 denote independent random variables taking on the value 1 with probability and 0 with probability 1- .Suppose that yi comes from the density
21,, θθyf 21 1 θθ ygxygx ii
We will also assume that g(y|i) is normal with mean iand standard deviation i.
Thus the joint distribution of x1, x2, …, xn and let y1, y2, …, yn is:
2121 ,,,,, xyf
n
i
y
ixxi
ii ex1
2
1
1 21
21
21
2
2
22
2
221
iy
i ex
In the case of an HMM the log-Likelihood is given by:
21111lnlnln,,ln,, 000
iiyiiLl
TTTT yiiiyiii lnlnlnln
13332
M
i iyiy
M
i
M
jijij
M
iii ff
11 11
00 lnlnln
statefirst thein occurs i state timesofnumber the where 0 if
j. state tochanges i state timesofnumber theijf
case) discrete thein (or iiiy ypyf
iXy tt
iy where nsobservatio all of sum the
Let
Expected no. of transitions from state i to state j.
yY , , 1 jXiXPji ttt
yYyY
,, 1
PjXiXP tt
yY
yYyY
,,, , )1*()1*(111
)()(
PyYjXiXP tt
ttttt
t
jT
ttjijt
jjyi
*
11
1
1,
T
tt ji
The E-M Re-estimation Formulae
Case 1: The observations {Y1, Y2, ... , YT} are discrete with K possible values and
iXyYP ttiy
Case 2: The observations {Y1, Y 2, ... , YT} are continuous Multivariate Normal with mean vector and covariance matrix when , i.e.
iμ iΣiX t
itiit
ipti μyΣμy
Σy 121
2/ exp2
1
Measuring distance between two HMM’s
Let
and
denote the parameters of two different HMM models. We now consider defining a distance between these two models.
11101 ,, θΠπλ
22202 ,, θΠπλ
The Kullback-Leibler distance
Consider the two discrete distributions and
( and in the continuous case) then define
y1p y2p y1f y2f
yy
yy 1
2
121 ln, p
ppppI
yy 21 lnln1 ppEp
These measures of distance between the two distributions are not symmetric but can be made symmetric by the following:
2
,,,1221
21 ppIppIppI s
In the case of a Hidden Markov model.
where
The computation of in this case is formidable
iiiii ppp θΠπyλyy ,,0
i
θΠπiy iiip ,,, 0
TTTT yiiiyiiiyiiiyiip
133322221111
00 ,,,
θΠπiy
21 , ppI
Juang and Rabiner distance
Let denote a sequence of observations generated from the HMM with parameters:
Let denote the optimal (Viterbi) sequence of states assuming HMM model .
)()(2
)(1
)( ,,, iT
iiiT YYY Y
iiii θΠπλ ,,0
yyyyi )()(2
)(1
)(* ,,, i
Tiii iii
iiii θΠπλ ,,0