Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven

Hidden Markov Models

Yves Moreau

Katholieke Universiteit Leuven

Regular expressions

Alignment

Regular expression

Problem: regular expression does not distinguish Exceptional TGCTAGG Consensus ACACATC

ACA---ATGTCAACTATCACAC--AGCAGA---ATCACCG--ATC

[AT][CG][AC][ACGT]*A[TG][GC]


A .8C 0G 0T .2

A 0C .8G .2T 0

A .8C .2G 0T 0

A 1C 0G 0T 0

A 0C 0G .2T .8

A 0C .8G .2T 0

A .2C .4G .2T .2

1.0 1.0

.6

.4

.6

.4

1.01.0

Sequence score

04720

80180116040

6080180180)(

.

....

.... P

ACACATCTransition probabilities

Emission probabilities

Log odds

Use logarithm for scaling and normalize by random model

Log odds for sequence S:

A: 1.16T:-0.22

C: 1.16G:-0.22

A: 1.16C:-0.22

A: 1.39G:-0.22T: 1.16

C: 1.16G:-0.22

A:-0.22C: 0.47G:-0.22T:-0.22

0 0

-0.51

-0.92

-0.51

-0.92

00

25.0log)(log25.0

)(log LSP

SPL

Log odds

65.6

16.1016.139.151.047.0

51.016.1016.1016.1)(oddslog

ACACATC

Sequence Log odds

ACAC--ATC (consensus) 6.7

ACA---ATG 4.9

TCAACTATC 3.0

ACAC--AGC 5.3

AGA---ATC 4.9

ACCG--ATC 4.6

TGCT--AGG (exceptional) -0.97

Markov chain

Sequence: Example of a Markov chain

Probabilistic model of a DNA sequence

)|( 1 sxtxPa iist

Transition probabilities

),,, with ,...,, (e.g., 21 TGCAxxxxx iL AA

A

C G

T

Markov property

Probability of a sequence through Bayes’ rule

Markov property “The future is only function of the present and not of the past”

1 1

1 1 1 2 1 1

( ) ( , ,..., | length )

( | ,..., ) ( | ,..., ) ( )L L

L L L L

P x P x x x L

P x x x P x x x P x

L

ixx

LLLL

iiaxP

xPxxPxxPxxPxP

21

112211

1)(

)()|()|()|()(

Beginning and end of a sequence

Computation of the probability is not homogeneous Length distribution is not modeled

P(length=L) unspecified Solution

Modeling of beginning and end of the sequence

The probability to observe a sequence of a given length decreases with the length of the sequence

1

1

Sequence: , ,..., ,

( )

( | )

L

s

t L

x x

a P x s

a P x t

A

C G

T


A .8C 0G 0T .2

A 0C .8G .2T 0

A .8C .2G 0T 0

A 1C 0G 0T 0

A 0C 0G .2T .8

A 0C .8G .2T 0

A .2C .4G .2T .2

1.0 1.0

.6

.4

.6

.4

1.01.0

Sequence score

04720

80180116040

6080180180)(

.

....

.... P

ACACATCTransition probabilities


Hidden Markov Model

In a hidden Markov model, we observe the symbol sequence x but we want to reconstruct the hidden state sequence (path )

Transition probabilities (: a0l, : ak0)


Joint probability of the sequence ,x1,...,xL, and the path

)|( 1 klPa iikl

)|()( kbxPbe iik

L

ii iii

axeaxP1

0 11)(),(

Casino (I) – problem setup

The casino uses mostly a fair die but switches sometimes to a loaded die

We observe the outcome x of the successive throws but want to know when the die was fair or loaded (path )

1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6

1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2

0.05

0.1

0.90.95

Fair Loaded

Estimation of the sequence and state probabilities

The Viterbi algorithm

We look for the most probable path *

This problem can be tackled by dynamic programming Let us define vk(i) as the probability of the most probable path that

ends in state k for the emission of symbol xi

Then we can compute this probability recursively as

* arg max ( , ) arg max ( | )P x P x

))((max)()1( 1 klkk

ill aivxeiv

1 11 1 1

,...,( ) max ( ,..., , ,.... , )

ik i i iv i P x x k

The Viterbi algorithm

The Viterbi algorithm grows the best path dynamically Initial condition: sequence in beginning state Traceback pointers tot follow the best path (= decoding)

)(ptr:)1,...,(Traceback

))((maxarg

))((max),( :nTerminatio

))1((maxarg)(ptr

))1((max)()(:),...,1(Recursion

0,0)0(,1)0(:)0(tion Initializa

**1

0*

0*

0

iii

kkk

L

kkk

klkk

i

klkk

ill

k

Li

aLv

aLvxP

aivl

aivxeivLi

kvvi

Casino (II) - Viterbi

The forward algorithm

The forward algorithm let us compute the probability P(x) of a sequence w.r.t. an HMM

This is important for the computation of posterior probabilities and the comparison of HMMs

The sum over all paths (exponentially many) can be computed by dynamic programming

Les us define fk(i) as the probability of the sequence for the paths that end in state k with the emission of symbol xi

Then we can compute this probability as

),()( xPxP

),,...,()( 1 kxxPif iik

k

klkill aifxeif )()()1( 1

The forward algorithm

The forward algorithm grows the total probability dynamically from the beginning to the end of the sequence

Initial condition: sequence in beginning state End: all states converge to the end state

kkk

kklkill

k

aLfxP

aifxeifLi

kffi

0

0

)()( :End

)1()()(:),...,1(Recursion

0,0)0(,1)0(:)0(tion Initializa

The backward algorithm

The backward algorithm let us compute the probability of the complete sequence together with the condition that symbol xi is emitted from state k

This is important to compute the probability of a given state at symbol xi

P(x1,...,xi,i=k) can be computed by the forward algorithm fk(i)

Let us define bk(i) as the probability that the rest of the sequence for the paths that pass through state k at symbol xi

)|,...,(),,...,(

),,...,|,...,(),,...,(),(

11

111

kxxPkxxP

kxxxxPkxxPkxP

iLiii

iiLiiii

)|,...,()( 1 kxxPib iLik

The backward algorithm

The backward algorithm grows the probability bk(i) dynamically backwards (from end to beginning)

Border condition: start in end state

Once both forward and backward probabilities are available, we can compute the posterior probability of the state

llll

llilklk

kk

bxeaxP

ibxeaibLi

kaLbLi

)1()()( :nTerminatio

)1()()(:)1,...,1(Recursion

,)(:)(tion Initializa

10

1

0

)(

)()()|(

xP

ibifxkP kk

i

Posterior decoding

Instead of using the most probable path for decoding (Viterbi), we can use the path of the most probable states

The path ^ can be “illegal” (P(^|x)=0) This approach can also be used when we are interested

in a function g(k) of the state (e.g., labeling)

)|(maxargˆ xkP ik

i

k

i kgxkPxiG )()|()|(

Casino (III) – posterior decodering

Posterior probability of the state “fair” w.r.t. the die throws

Casino (IV) – posterior decodering

New situation : P(xi+1 = FAIR | xi = FAIR) = 0.99 Viterbi decoding cannot detect the cheating from 1000

throws, while posterior decoding does

1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6

1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2

0.01

0.1

0.90.99

Fair Loaded

Parameter estimation for HMMs

Choice of the architecture

For the parameter estimation, we assume that the architecture of the HMM is known

Choice of architecture is an essential design choice

Duration modeling

“Silent states” for gaps

Parameter estimation with known paths

HMM with parameters (transition and emission probabilities)

Training set D of N sequences x1,...,xN Score of the model is the likelihood of the parameters

given the training data

N

j

jN xPxxP1

1 )|(log)|,...,(log),(Score D

Parameter estimation with known paths

If the state paths are known, the parameters are estimated through counts (how often is a transition used, how often is a symbol produced by a given state)

Use of ‘pseudocounts’ if necessary Akl = number of transitions from k to l in training set +

pseudocount rkl

Ek(b) = number of emissions of b from k in training set + pseudocount rk(b)

bk

kk

llk

klkl bE

bEbe

AA

a)(

)()( and

Parameter estimation with unknown paths: Viterbi training

Strategy: iterative method Suppose that the parameters are known and find the best path Use Viterbi decoding to estimate the parameters Iterate till convergence

Viterbi training does not maximize the likelihood of the parameters

Viterbi training converges exactly in a finite number of steps

))(),...,(,|,...,(maxarg *1*1Vit NN xxxxP

Parameter estimation with unknown paths: Baum-Welch training

Strategy: parallel to Viterbi but we use the expected value for the transition and emission counts (instead of using only the best path)

For the transitions

For the emissions

)|(

)1()()(),|,( 1

1

xP

ibxeaifxlkP lilklk

ii

j i

jl

jilkl

jkj

j i

jiikl ibxeaif

xPxlkPA )1()()(

)|(1

),|,( 11

j bxi

jk

jkj

j bxi

jik

ji

ji

ibifxP

xkPbE}|{}|{

)()()|(

1),|()(

Parameter estimation with unknown paths: Baum-Welch training

Initialization: Choose arbitrary model parameters Recursion:

Set all transitions and emission variables to their pseudocount For all sequences j = 1,...,n

Compute fk(i) for sequence j with the forward algorithm Compute bk(i) for sequence j with the backward algorithm Add the contributions to A and E

Compute the new model parameters akl =Akl/kl’ and ek(b) Compute the log-likelihood of the model

End: stop when the log-likelihood does not change more than by some threshold or when the maximum number of iterations is exceeded

Casino (V) – Baum-Welch training

1: 0.192: 0.193: 0.234: 0.085: 0.236: 0.08

1: 0.072: 0.103: 0.104: 0.175: 0.056: 0.52

0.27

0.29

0.710.73

Fair Loaded

1: 0.172: 0.173: 0.174: 0.175: 0.176: 0.15

1: 0.102: 0.113: 0.104: 0.115: 0.106: 0.48

0.07

0.12

0.880.93

Fair Loaded

1: 1/62: 1/63: 1/64: 1/65: 1/66: 1/6

1: 1/102: 1/103: 1/104: 1/105: 1/106: 1/2

0.05

0.1

0.90.95

Fair Loaded

Originalmodel

300 throws 30000 throws

Numerical stability

Many expressions contain products of many probabilities This causes underflow when we compute these

expressions For Viterbi, this can be solved by working with the

logarithms For the forward and backward algorithms, we can work

with an approximation to the logarithm or by working with rescaled variables

Summary

Hidden Markov Models Computation of sequence and state probabilities

Viterbi computation of the best state path The forward algorithm for the computation of the probability of a

sequence The backward algorithm for the computation of state probabilities

Parameter estimation for HMMs Parameter estimation with known paths Parameter estimation with unknown paths

Viterbi training Baum-Welch training

Documents

Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven