Hidden Markov Models

Hidden Markov Models

A Hidden Markov Model consists of

1. A sequence of states {Xt|t T} = {X1, X2, ... , XT} , and

2. A sequence of observations {Yt |t T} = {Y1, Y2, ... , YT}

• The sequence of states {X1, X2, ... , XT} form a Markov chain moving amongst the M states {1, 2, …, M}.

• The observation Yt comes from a distribution that is determined by the current state of the process Xt. (or possibly past observations and past states).

• The states, {X1, X2, ... , XT}, are unobserved (hence hidden).

X1 X2 X3 XT

Y1 Y2 Y3 YT

The Hidden Markov Model

…

Some basic problems: from the observations {Y1, Y2, ... , YT}

1. Determine the sequence of states {X1, X2, ... , XT}.

2. Determine (or estimate) the parameters of the stochastic process that is generating the states and the observations.;

Examples

Example 1

• A person is rolling two sets of dice (one is balanced, the other is unbalanced). He switches between the two sets of dice using a Markov transition matrix.

• The states are the dice.• The observations are the numbers rolled

each time.

Balanced Dice

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

2 3 4 5 6 7 8 9 10 11 12

Unbalanced Dice

0

0.05

0.1

0.15

0.2

0.25

0.3

2 3 4 5 6 7 8 9 10 11 12

Example 2

• The Markov chain is two state. • The observations (given the states) are

independent Normal. • Both mean and variance dependent on state.

HMM AR.xls

Example 3 –Dow Jones

106108110112114116118120122124126

0 10 20 30 40 50 60 70 80

Daily Changes Dow Jones

-1

-0.5

0

0.5

1

1.5

2

0 10 20 30 40 50 60 70 80

Hidden Markov Model??

-1

-0.5

0

0.5

1

1.5

2

0 10 20 30 40 50 60 70 80

Bear and Bull Market?

106108110112114116118120122124126

0 10 20 30 40 50 60 70 80

Speech Recognition

• When a word is spoken the vocalization process goes through a sequence of states.

• The sound produced is relatively constant when the process remains in the same state.

• Recognizing the sequence of states and the duration of each state allows one to recognize the word being spoken.

• The interval of time when the word is spoken is broken into small (possibly overlapping) subintervals.

• In each subinterval one measures the amplitudes of various frequencies in the sound. (Using Fourier analysis). The vector of amplitudes Yt is assumed to have a multivariate normal distribution in each state with the mean vector and covariance matrix being state dependent.

Hidden Markov Models for Biological Sequence

Consider the Motif:[AT][CG][AC][ACGT]*A[TG][GC] Some realizations:

A C A - - - A T GT C A A C T A T CA C A C - - A G CA G A - - - A T CA C C G - - A T C

A.8CGT.2

AC.8G.2T

A.8C.2GT

AC1.0GT

ACG.2T.8

AC.8G.2T

A.2C.4G.2T.2

.4

1.0 1.0 1.0 1.0

.6.6

.4

Hidden Markov model of the same motif :

[AT][CG][AC][ACGT]*A[TG][GC]

Profile HMMs

Begin End

Computing LikelihoodLet ij = P[Xt+1 = j|Xt = i] and = (ij) = the MM transition matrix. Let = P[X1 = i] and

= the initial distribution over the states.

]P[ 2211 TT i , ... , X i , X i X

0i

002

01

0 ,,, M

TT iiiiiii 132211

0

Now assume that P[Yt = yt |X1 = i1, X2 = i2, ... , Xt = it]

= P[Yt = yt | Xt = it] = p(yt| ) =

Then P[X1 = i1,X2 = i2..,XT = iT, Y1 = y1, Y2 = y2, ... , YT = yT]

= P[X = i, Y = y]

=

ti tt yi

TTTT yiiiyiiiyiiiyii 133322221111

0

Therefore P[Y1 = y1, Y2 = y2, ... , YT = yT]

= P[Y = y]

T

TTTTiii

yiiiyiiiyiiiyii,,,

0

21

133322221111

,,, where,, 210 ML

In the case when Y1, Y2, ... , YT are continuous random variables or continuous random vectors, Let f(y| ) denote the conditional distribution of Yt given Xt = i. Then the joint density of Y1, Y2, ... , YT is given by

= f(y1, y2, ... , yT) = f(y)

where = f(yt| )

i

,,0 L

T

TTTTiii

yiiiyiiiyiiiyii,,,

0

21

133322221111

tt yiti

Efficient Methods for computing Likelihood

The Forward Method

Consider

tt

tt yyyYYY ,,, and ,,,Let 21

)(21

)( yY

tttttttt iXyYyYyYPiXP ,,,,, 2211)()( yY

tt i

111

01111 , yiiiXyYP

111)()( ,, tttt

tt iXyYP yY

ti

tttttttt iXiXyYP 1111)()( ,,, yY

111

, )()(

tttt

t

yiiii

tttt iXP yY

tttt

tttt iXiXyYP , |, )()(1111 yY

ti

tttt iXP , )()( yY

11)1()1(

11 , Note iXPi yY

11)1()1(

11 , and

tttt

tt iXPi yY

111 tttt

t

yiiii

tt i

)()( Then TTPP yYyY

Ti

TTTT iXP , )()( yY

Ti

TTTT iXyYyYyYP ,,,, 2211

TT yYyYyYP ,,, 2211

Ti

TT i

The Backward Procedure

Ttt

tTtt

t yyyYYY ,,, and ,,,Let 21)(*

21)(* yY

ttTTtttt iXyYyYyYP |,,, 2211

tttt

tt iXPi | Consider )(*)(** yY

11)1(*)1(*

1*

1 | Note

TTTT

TT iXPi yY

T

TTTTi

yiiiTTTT iXyYP 111|

11)1()1(*

1*

1 | Now

tttt

tt iXPi yY

ti

tttttt

tt iXiXyYP 11)(*)(* |, , yY

tttti

tttt iXyYPiXP

t

|| )(*)(* yY

111

* tttt

t

yiiii

tt i

11)(*)(* | , tt

tttt iXyYP yY

11| tttt iXiXP

)0()0(* Then yYyY PP

1

11)1(*)1(*

11 , , i

iXyYP yY

111111)1(*)1(* , |

1

yYiXPiXPi

yY

,| 111

1

011

)1(*)1(*iyi

iiXP yY

,111

1

01

*1 iyi

ii

TT yYyYyYP ,,, 2211

Prediction of states from the observations and the model:

TTTT iXPi , Consider yY

TiTT

TTTTTT i

iP

iXPiXP

yYyYyY

, Thus

yYyYyY

, Also

PiXPiXP tt

tt

t

iTT

tttt ii

ii

T

*

TTTT iXyYyYyYP ,,,, 2211

yY

yYyY

, )(*)(*)()(

P

iXPiXP tttt

tttt

The Viterbi Algorithm (Viterbi Paths)

Suppose that we know the parameters of the Hidden Markov Model.Suppose in addition suppose that we have observed the sequence of observations Y1, Y2, ... , YT.

Now consider determining the sequence of States X1, X2, ... , XT.

Recall that P[X1 = i1,... , XT = iT, Y1 = y1,... , YT = yT]

= P[X = i, Y = y]

=Consider the problem of determining the sequence of states, i1, i2, ... , iT , that maximizes the above probability. This is equivalent to maximizingP[X = i|Y = y] = P[X = i,Y = y] / P[Y = y]


0

The Viterbi AlgorithmWe want to maximizeP[X = i, Y = y] =

Equivalently we want to minimize U(i1, i2, ... , iT)

Where U(i1, i2, ... , iT)

= -ln (P[X = i, Y = y])

=


0

TTTT yiiiyiiiyii

12221111lnlnln 0

• Minimization of U(i1, i2, ... , iT) can be achieved by Dynamic Programming.

• This can be thought of as finding the shortest distance through the following grid of points.

• By starting at the unique point in stage 0 and moving from a point in stage t to a point in stage t+1 in an optimal way. The distances between points in stage t and points in stage t+1 are equal to:

Stage 0 Stage 1 Stage 2 Stage T-1 Stage T

...

Dynamic Programming

• By starting at the unique point in stage 0 and moving from a point in stage t to a point in stage t+1 in an optimal way.

• The distances between points in stage t and points in stage t+1 are equal to:

and 0 ifln,0111

011 tid yii

1 ifln,11111 tiid

tttt yiiittt


...

Dynamic Programming

,0 11 id , 212 iid ,1 TTT iid


...

Dynamic Programming

,1 TTT iid

Let ), ... , i, i(iU tt

21

tttt yiiiyiiiyii

12221111lnlnln 0

), ... , i, i(iU)(iV tt

, ... , i, iit

t

t21

121

min

111

01

1 ln yii)(iV

111

ln)min11

tttt

tyiiit

t

itt (iV)(iV

ttt iidiidid ,,,0 121211

Then

and

i1 = 1, 2, …, M

it+1 = 1, 2, …, M; t = 1,…, T-2

11 ,)min ttttt

iiid(iV

t

Finally

TTTT

TyiiiT

T

i

T (iVV 1

1

ln)min 11

), ... , i, iU(i T, ... , i, ii T

2121

min

Summary of calculations of Viterbi Path

1. i1 = 1, 2, …, M

2.it+1 = 1, 2, …, M; t = 1,…, T-2

3.

111

01

1 ln yii)(iV

111

ln)min11

tttt

tyiiit

t

itt (iV)(iV

TTTT

TyiiiT

T

i

T (iVV 1

1

ln)min 11

), ... , i, iU(i T, ... , i, ii T

2121

min

An alternative approach to prediction of states from the observations and the model:

It can be shown that:

yY tt iXP

TiTT

ttttt i

iii

*

titttt

tttt

iiii*

*

111

0yii 11 i

11 tt i 111 tttt

t

yiiii

tt i

tttt iXP , )()( yY tt i

Forward Probabilities

1.

2.

Backward Probabilities

1.

2.

tttt

tt iXPi | )*()*(* yY

1*

1 TT i

T

TTTTi

yiii 1

1*

1 tt i 111

* tttt

t

yiiii

tt i

HMM generator (normal).xls

Estimation of Parameters of a Hidden Markov Model

If both the sequence of observations Y1, Y2, ... , YT and the sequence of States X1, X2, ... , XT is observed Y1 = y1, Y2 = y2, ... , YT = yT, X1 = i1, X2 = i2, ... , XT = iT, then the Likelihood is given by:

TTTT yiiiyiiiyiiiyiiL 133322221111

00 ,,

the log-Likelihood is given by:

21111

lnlnln,,ln,, 000iiyiiLl

TTTT yiiiyiii lnlnlnln

13332

M

i iyiy

M

i

M

jijij

M

iii ff

11 11

00 lnlnln

statefirst thein occurs i state timesofnumber the where 0 if

j. state tochanges i state timesofnumber theijf

case) discrete thein (or iiiy ypyf

iXy tt

iy where nsobservatio all of sum the

In this case the Maximum Likelihood estimates are:

= the MLE of i computed from the observations yt where Xt = i.

1ˆ

00 ii

f

and ,ˆ

1

M

jij

ijij

f

f

i

MLE (states unknown)

If only the sequence of observations Y1 = y1, Y2

= y2, ... , YT = yT are observed then the Likelihood is given by:

T

TTTTiii

yiiiyiiiyiiiyiiL,,,

00

21

133322221111,,

• It is difficult to find the Maximum Likelihood Estimates directly from the Likelihood function.

• The Techniques that are used are1. The Segmental K-means

Algorithm2. The Baum-Welch (E-M)

Algorithm

The Segmental K-means Algorithm

In this method the parameters are adjusted to maximize

where is the Viterbi path

θΠπλ ,,0

iyλiyθΠπ ,,,,0 LL


0

Tiii ,, 21i

Consider this with the special caseCase: The observations {Y1, Y 2, ... , YT} are continuous Multivariate Normal with mean vector and covariance matrix when , i.e.

iμ iΣiX t

itiit

ipti μyΣμy

Σy 121

2/ exp2

1

1. Pick arbitrarily M centroids a1, a2, … aM. Assign each of the T observations yt (kT if multiple realizations are observed) to a state it by determining :

2. Then

iti

ay min

kii

i

10 timesofNumber

from ns transitioofNumber tofrom ns transitioofNumber ˆ

ij i

ij

3. And

4. Calculate the Viterbi path (i1, i2, …, iT) based on the parameters of step 2 and 3.

5. If there is a change in the sequence (i1, i2, …, iT) repeat steps 2 to 4.

i

iiitit

ii

iit

i NNtt

μyμyΣ

yμ

ˆˆˆ,ˆ

The Baum-Welch (E-M) Algorithm

• The E-M algorithm was designed originally to handle “Missing observations”.

• In this case the missing observations are the states {X1, X2, ... , XT}.

• Assuming a model, the states are estimated by finding their expected values under this model. (The E part of the E-M algorithm).

• With these values the model is estimated by Maximum Likelihood Estimation (The M part of the E-M algorithm).

• The process is repeated until the estimated model converges.

The E-M AlgorithmLet denote the joint distribution of Y,X. Consider the function:

Starting with an initial estimate of . A sequence of estimates are formed by finding to maximize with respect to .

θXYθXY ,,, Lf

θYθXYθθ X ,,,ln, LEQ )1( θθ

)(mθ)1( mθθ )(, mQ θθθ

The sequence of estimates converge to a local maximum of the likelihood

.

)(mθ

θYθY fL ,

Example: Sampling from Mixtures Let y1, y2, …, yn denote a sample from the density:

mmyf θθθ ,,,,,,, 2121

mm ygygyg θθθ 2211

where121 m

and iiyg θθ for except known is

Suppose that m = 2 and let x1, x2, …, x1 denote independent random variables taking on the value 1 with probability and 0 with probability 1- .Suppose that yi comes from the density

21,, θθyf 21 1 θθ ygxygx ii

We will also assume that g(y|i) is normal with mean iand standard deviation i.

Thus the joint distribution of x1, x2, …, xn and let y1, y2, …, yn is:

2121 ,,,,, xyf

n

i

y

ixxi

ii ex1

2

1

1 21

21

21

2

2

22

2

221

iy

i ex

n

i

xy

ixx

ii

ii ex1

2

1

1 21

21

21

ii

xy

e

1

2

2

22

22

21

xy,,,,, 2121 L

xy,,,,,ln 2121 L

xy,,,,, 2121 l

n

i

ii

yxn1

21

21

1 2lnln2ln

2

n

i

ii

yx1

22

22

2 2ln1ln1

011

1

n

i

ii xx

xy,,,,, 2121

l

1or 11

n

ii

n

ii xnx

n

ii

n

ii xnx

11

1or

nxn

ii

1

Hence

n

xn

ii

1ˆ and

xy,,,,, 21211

l

n

iy

i

y

iii

exex12

2

2

1

22

221

21

21

21

2

1

In the case of an HMM the log-Likelihood is given by:

21111lnlnln,,ln,, 000

iiyiiLl

TTTT yiiiyiii lnlnlnln

13332

M

i iyiy

M

i

M

jijij

M

iii ff

11 11

00 lnlnln

statefirst thein occurs i state timesofnumber the where 0 if

j. state tochanges i state timesofnumber theijf

case) discrete thein (or iiiy ypyf

iXy tt

iy where nsobservatio all of sum the

Recall

and

Expected no. of transitions from state i.

jtt

tt

jT

tttt jj

iijiiiXPi *

**

yY

1

1

T

tt i

Let

Expected no. of transitions from state i to state j.

yY , , 1 jXiXPji ttt

yYyY

,, 1

PjXiXP tt

yY

yYyY

,,, , )1*()1*(111

)()(

PyYjXiXP tt

ttttt

t

jT

ttjijt

jjyi

*

11

1

1,

T

tt ji

The E-M Re-estimation Formulae

Case 1: The observations {Y1, Y2, ... , YT} are discrete with K possible values and

iXyYP ttiy

ii 0ˆ

and ,

,ˆ 1

1

1

1

T

tt

T

tt

ij

i

ji

T

tt

T

yytt

iy

i

it

1

,1ˆ

Case 2: The observations {Y1, Y 2, ... , YT} are continuous Multivariate Normal with mean vector and covariance matrix when , i.e.

iμ iΣiX t

itiit

ipti μyΣμy

Σy 121

2/ exp2

1

ii 0ˆ

and ,

,ˆ 1

1

1

1

T

tt

T

tt

ij

i

ji

T

tt

T

tititt

iT

tt

T

ttt

i

i

i

i

i

1

1

1

1ˆˆ

ˆ,ˆ

μyμyΣ

yμ

Measuring distance between two HMM’s

Let

and

denote the parameters of two different HMM models. We now consider defining a distance between these two models.

11101 ,, θΠπλ

22202 ,, θΠπλ

The Kullback-Leibler distance

Consider the two discrete distributions and

( and in the continuous case) then define

y1p y2p y1f y2f

yy

yy 1

2

121 ln, p

ppppI

yy 21 lnln1 ppEp

and in the continuous case:

yyyy

df

ffffI 1

2

121 ln,

yy 21 lnln1 ffE f

These measures of distance between the two distributions are not symmetric but can be made symmetric by the following:

2

,,,1221

21 ppIppIppI s

In the case of a Hidden Markov model.

where

The computation of in this case is formidable

iiiii ppp θΠπyλyy ,,0

i

θΠπiy iiip ,,, 0

TTTT yiiiyiiiyiiiyiip

133322221111

00 ,,,

θΠπiy

21 , ppI

Juang and Rabiner distance

Let denote a sequence of observations generated from the HMM with parameters:

Let denote the optimal (Viterbi) sequence of states assuming HMM model .

)()(2

)(1

)( ,,, iT

iiiT YYY Y

iiii θΠπλ ,,0

yyyyi )()(2

)(1

)(* ,,, i

Tiii iii

iiii θΠπλ ,,0

Then define:

and

defD 21 ,λλ

2)1()2(*

)1(1)1()1(*

)1( ,ln,ln1lim λYiYλYiY TTTTTpp

T

2

,,,1221

21 λλλλλλ DDDs

Documents

Hidden Markov Models