PGM 2002/03 Tirgul 1 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,

PGM 2002/03 Tirgul 1

Hidden Markov Models

Introduction

Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models, although they were developed long before the notion of general models existed (1913). They are used to model time-invariant and limited horizon models that have both an underlying mechanism (hidden states) and an observable consequence. They have been extremely successful in language modeling and speech recognition systems and are still the most widely used technique in these domains.

Markov ModelsA Markov process or model assumes that we can predict the future based just on the present (or on a limited horizon into the past):

Let {X1,…,XT} be a sequence of random variables taking values {1,…,N} then the Markov properties are:

Limited Horizon:

Time invariant (stationary):

)|(),,|( 111 tttt XiXPXXiXP

)|( 12 XiXP

Describing a Markov ChainA Markov chain can be described by the transition matrix A and initial state probabilities Q:

or alternatively:

and we calculate:

)|( 1 iXjXPa ttij )( 1 iXPqi

1

2

4

3

0.60.4

1.01.0

0.3

0.7

1.0

1

1111211 ),()|()|()(),,(

1

T

tttXTTT XXAqXXPXXPXPXXP

Hidden Markov ModelsIn a Hidden Markov Model (HMM) we do not observe the sequence that the model passed through but only some probabilistic function of it. Thus, it is a Markov model with the addition of emission probabilities:

For example:

Observed: The House is on fire

States: def noun verb prep noun

)|( iXkYPb ttik

Why use HMMs?• A lot of real-life processes are composed of underlying events generating surface phenomena. Tagging parts of speech is a common example.

• We can usually think of processes as having a limited horizon (we can easily extend to the case of a constant horizon larger than 1)

• We have efficient training algorithm using EM

• Once the model is set, we can easily run it:t=1, start in state i with probability qi

forever : move from state i to j with probability aij

emit yt=k with probability bik

t=t+1

The fundamental questionsLikelihood: Given a model =(A,B,Q), how do we efficiently compute the likelihood of an observation P(Y|)?

Decoding: Given the observation sequence Y and a model , what state sequence explains it best (MPE)?This is, for example, the tagging process of an observed sentence.

Learning: Given an observation sequence Y, and a generic model, how do we estimate the parameters that define the best model to describe the data?

Computing the LikelihoodFor any state sequence (X1,…,XT):

using P(Y,X)=P(Y|X)P(X) we get:

But, we have O(TNT) multiplications!

TT yxyxyx

T

tttt bbbXXyPXYP

22111

1),|()|(

TT xxxxxxxT aaaqXXP132211

),...,( 1

T

ttttxx

T

tyxxxx

XX

baq

XPXYPXYPYP

111

2

)()|(),()(

The trellis (lattice) algorithmTo compute likelihood: Need to enumerate over all paths in the lattice (all possible instantiations of X1…XT).

But… some starting subpath (blue) is common to many continuing paths (blue+red)

Idea: using dynamic programming, calculate a path in terms of shorter sub-paths

The trellis (lattice) algorithm

We build a matrix of the probability of being at time t at state i - t(i)=P(xt=i,y1y2…yt-1) is a function of the previous column (forward procedure):

1

2

N

j

a1j b

1Yt

a Njb NYt

a2j b2Yt

t(i) t+1(i)

N

iT

N

jjyjitt

i

iαYP

bajαiα

qiα

t

11

11

1

)()(

)()(

)(

The trellis (lattice) algorithm (cont.)We can similarly define a backwards procedure for filling the matrix

And we can easily combine:

)|()( 1 iXyyPiβ tTtt

N

iiyi

n

jtjyijtT

iβbqYP

jβbaiβiβt

11

11

)()(

)()(,1)(

1

1

)()(

)|(),(

),|(),(

),,(

),(),(

1

1111

11

1

iβbiα

iXyyPiXyyP

iXyyyyPiXyyP

yyiXyyP

iXyyPiXYP

tiyt

tTttt

ttTttt

Tttt

tTt

t

Finding the best state sequenceWe would like to the most likely path (and not just the most likely state at each time slice)

The Viterbi algorithm is an efficient trellis method for finding the MPE:

and we to reconstruct the path:

),()|( maxargmaxarg YXPYXPxx

tt jyjitj

tjyjitj

t

i

bajδiγbajδiδ

qiδ

)(maxarg)()(max)(

)(

11

1

)ˆ(ˆ

)(maxargˆ)(max)ˆ(

11

111

ttt

Ti

TTi

XγX

iδXiδXP

The Casino HMMA casino switches from a fair die (state F) to a loaded one (state U) with probability 0.05 and the other way around with probability 0.1. Also, with probability 0.01. The casino, honestly, reads off the number that was rolled.

9.01.0

05.095.0A

21

101

101

101

101

101

61

61

61

61

61

61

B

101

21

21

21

21

61

61

61

61

61

),3151166661(

FFFFFUUUUUP

The Casino HMM (cont.)What is the likelyhood of 3151166661?

Y= 3 1 5 1 1 6 6 6 6 1

1(1)=0.95, 1(2)=0.05

2(1)=0.95*0.95*1/6+0.05*0.1*1/10=0.1509

2(1)=0.95*0.05*1/6+0.05*0.9*1/10=0.0124

3(1)=0.0240,3(2)=0.0025

4(1)=0.0038,4(1)=0.0004

5(1)=0.0006,5(1)=0.0001

… all smaller then 0.0001!

The Casino HMM (cont.)What explains 3151166661 best?

Y= 3 1 5 1 1 6 6 6 6 1

1(1)=0.95, 1(2)=0.05

2(1)=max(0.95*0.95*1/6,0.05*0.1*1/10)=0.1504

2(1)=max(0.95*0.05*1/6,0.05*0.9*1/10)=0.0079

3(1)=0.0238,3(2)=0.0013

4(1)=0.0006,4(1)=0.0002

5(1)=0.0001,5(1)=0.0000…

…

The Casino HMM (cont.)An example of reconstruction using Viterbi (Durbin):

Rolls 3151162464466442453113216311641521336

Die 0000000000000000000000000000000000000

Viterbi 0000000000000000000000000000000000000

Rolls 2514454363165662656666665116645313265

Die 0000000011111111111111111111100000000

Viterbi 0000000000011111111111111111100000000

Rolls 1245636664631636663162326455236266666

Die 0000111111111111111100011111111111111

Viterbi 0000111111111111111111111111111111111

LearningIf we were given both X and Y, we could choose

Using the Maximum Likelihood principal, we simply assign the parameter for each relative frequency

What do we do when we have only Y?

ML here does not have a closed form formula!

),,|,(maxarg,

QBAXYP trainingtrainingBA

),,|(maxarg,

QBAYP trainingBA

EM (Baum Welch)Idea: Using current guess to complete data and re-estimate

Thm: Likelihood of observables never decreases!!! (to be proved later in the course)

Problems: Gets stuck at sub-optimal solutions

E-Step“Guess” X using Y

and current parameters

M-StepReestimate parameters using

current completion of data

Parameter Estimation We define the expected number of transitions from state i to j at time t:

the expected # of transition from i to j in Y is then

m ntnymymnt

tjyiyijtttt nβbbamα

jβbbaiαYjXiXPjip

tt

tt

)()(

)()()|,(),(

1

11

1

1

T

tt jip

1

),(

Parameter Esimation (cont.)We use EM re-estimation formulas using the expected counts we already have:

t j

j ktyt

jitp

jitp

istatefromemissionsofected

istatefromkofemissionsofected

ik

t jjitp

tjitp

istatefromstransitionofected

jtoistatefromstransitionofected

ij

jjiPtimeatistateinfrequencyectedi

b

a

q

),(

),(

#exp

#exp

),(

),(

#exp

#exp

),(1exp

:

1

Application: Sequence Pair AlignmnetDNA sequences are “strings” with a four letter alphabet = { C,G,A,T}.

Real-life sequences often have by way of mutation of a letter or even a deletion. A fundamental problem in computation biology is to align such sequences:

Input: Output:

CTTTACGTTACTTACG CTTTACGTTAC-TTACG

CTTAGTTACGTTAG C-TTA-GTAACGTTA-G

How can we use an HMM for such a task?

Sequence Pair Alignmnet (cont.)We construct and HMM with 3 states that emits two letters:

(M)atch: Probability of emission of aligned pairs (high probability for matching letter low for mismatch)

(D)elete1: Emission of a letter in the first sequence and an insert in the second sequence

(D)elete2: The converse of D1

Transition matrix A

M D1 D2

M 0.9 0.05 0.05

D1 0.95 0.05 0

D2 0.95 0 0.05

Q

M 0.9

D1 0.05

D2 0.05

Matrix BIf in M, emit same letter

with probability 0.24 If in D1 or D2 emit all

letters uniformly

Sequence Pair Alignmnet (cont.)How to align 2 new sequences?

We also need (B)egin and (E)nd states to signify the start and end of the sequence.

• From B we will have to same transition probabilities as for M and we will never return to it.

• We will have a probability of 0.05 to reach E from any states and we will never leave it. This probability will determine the average length of a an alignment.

We now just do Viterbi with some extra technical details (Programming ex1)

ExtensionsHMM have been used so extensively, that it is impossible to even start on the many forms they take. Several extension however are worth mentioning:

•using different kind of transition matrices

•using continuous observations

•using a larger horizon

•assigning probabilities to wait times

•using several training sets

•…

Documents

PGM 2002/03 Tirgul 1 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,