Class Hidden Markov Models

1

Class Hidden Markov Models

2

Markov ChainsFirst order Markov process

• set of states, Si

• transition probabilities, aij

○ the probability of each state is dependent only on the previous state

○ trivial constraints (transitions must be true probabilities)𝑎𝑖𝑗≥0∧∑

𝑖=1

𝑁

𝑎𝑖𝑗=1

G

A C

T

aAA

aAC

aAGaAT

A C G T

A 0.18 0.27 0.43 0.12

C 0.17 0.37 0.27 0.19

G 0.16 0.34 0.38 0.12

T 0.08 0.36 0.38 0.18

𝑎𝑖𝑗=𝑃 (𝑠𝑡=𝑆 𝑗∨𝑠𝑡− 1=𝑆𝑖 )

3

Markov ChainsProbabilities

• for a sequence of observed states O = ( s0, s1, …sT)in general

for the Markov chain

4

Markov ChainsWhat is it good for?

• Model comparison – for a given model (set of transition probabilities), what is the probability of seeing the observed data○ The best model (maximum likelihood model) can be found by simply

counting the observed transitions in a sufficiently large set of data• Generated random sequence according to specific background

models• Birth /death processes – probability of extinction• Branching processes (trees)• Markov chain Monte Carlo (MCMC)

5

Hidden Markov ModelMore complicated model

• let each state emit a character, ck, according to a set of emission probabilities, bik, where bik = P( ck|Si)

• For a set of observed characters O = (o0, …, oT) and states (so, ..., sT)𝑃 (𝑜𝑡 )=𝑏𝑠𝑡𝑜𝑡𝑃 (𝑠𝑡 )

6

Hidden Markov ModelTwo state model for sequences

• maybe S0 is exon and S1 is intron

• what if you have a set of observed characters, but you want to know what the state is (or the most likely state)

• the state information is the hidden part of the hidden Markov model

S0

P(A) = 0.148P(C) = 0.334P(G) = 0.365P(T) = 0.154

S1

P(A) = 0.262P(C) = 0.247P(G) = 0.238P(T) = 0.253

CGCTTAGCTATCGCATTCGGCTACGATCTAGCTACGTAGCTATGCCGATGCATTATTACGCGATCTCGATCG

C G C T T A

S0 S0 S0 S1 S1 S1

7

Hidden Markov ModelRabiner's three basic problems

• what is the probability of an observed sequence, O, given a model?(evaluation)○ joint probability of observations and state sequence – P(O,S|θ)○ as useful as Markov Chain

• what is the optimal sequence of states that "explains" the observed data? (decoding)○ optimality criterion?

• how can one adjust the model parameters to maximize the probability of the observed data given the model(learning)

8

Hidden Markov ModelEvaluation

• for a model with parameters , what is or more simply • model parameters

Q a set of states A the state transition matrix B the emission probabilities ω an initial probability distribution

• observations

9

Hidden Markov ModelEvaluation

• Assume the observations are independent

The probability of a a particular state sequence (or path), π,

• There are NT state sequences and O(T) calculations so the brute force complexity is O(TNT)

10

Hidden Markov ModelForward algorithm

• is the probability of observing the partial sequence given that state

with

• complexity O(N2T)

11

Hidden Markov ModelBackward algorithm

• almost the same as forward algorithm• is the probability of observing the partial sequence given that state

with initial condition

12

Hidden Markov ModelProblem 2: Decoding

• what is the optimal sequence of states that "explains" the observed data?

• optimality criteria○ the path π that maximizes the correct number of individual states, i.e.,

the path where the states are individually most likely○ the most probable single path, maximize or equivalently Viterbi

algorithm• The optimal path is the path that maximizes

let be the highest probability path ending in state i

• keep track of argument that maximizes at each position t

13

Hidden Markov ModelProblem 3: learning

• find the parameters, , that maximizes

• No analytical solution requires iterative solution (Baum-Welch algorithm)1. initial model , repeat2. compute parameters based on and observations O3. if , stop4. else, accept , goto 2.

• With Baum-Welch algorithm likelihood is proven to be greater or equal at each step

14

Hidden Markov ModelTraining

• need to update the transition probabilities. the probability of being in state i at time t and state j at time t+1 is

• the probability of being in state i at time t, given the observed sequence O is

or in terms of ,

15

Hidden Markov ModelTraining

• Derived quantities

expected number of times state i is used

expected number of transitions from state i to state j

• BW parameter updates○ probability of starting in state I

○ where

16

Class Hidden Markov ModelWhy?

• Basic HMM has no way to include the known states of classified training data into the optimization! Generally trained in an unsupervised learning approach.

• HMMs are not discriminitive models• How do you discriminate?

○ Separate training data into classes and train○ for any set of observations, compare probabilities of observed data

given each model and choose the best model (likelihood ratio, for example)

17

Class Hidden Markov ModelHidden Markov Models for Labeled SequencesAnders Krogh, ISMB 1994

• Assume you have a sequence of observed symbols, (s0, s1, …, sT), and a sequence of (observed) classes (c0, c1, …, cT)

• Each state at time t, emits an observed symbol and an observed class (label). The classes are treated similarly to the emission probabilities

probability of emitting character a and class label x• Most often a state will emit a single class, i.e., for two states x, and

y, = 1 and = 0 = 0 and = 1

• What is the probability of the class labels given this model

• is the basic HMM calculation described earlier and solved using the forward/backward algorithm

18

Class Hidden Markov Model

• in the general case, multiple paths through the model can give the same labeling. in the special case where each state emits only a single class label, you can use Viterbi to optimize

the most probable class labeling of sequence s

• Maximum likelihood parameter estimates

19


20

Class Hidden Markov ModelCalculating gradient -

21

Class Hidden Markov ModelCalculating gradient -

• basically the same as

• the overall likelihood and optimization

22


• Overall gradient of the likelihood• Intuitively

○ Permissible paths - the state path given in φ

• mk is the number of times θk is used in permissible paths○ Slightly modified forward/backward○ αt and βt are zero except along the permissible paths

• nk is the number of times θk is used in all possible○ Use standard forward/backward algorithm

• Straightforward application of EM (Baum-Welch) is likely to give negative probabilities (eq. 16)

• Krogh proposes an iterative method, including multiple training sequences, μ

23

ReferencesTo learn about HMMs

• Rabiner L. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257-286, 1989.○ This is the ultimate origin, everything comes back to here including basic nomenclature and

definition of problems• Durbin R, Eddy S, Krogh A, Mitchison G. Biological Sequence analysis, probabilistic

models of proteins and nucleic acids. Cambridge University Press, 1998.○ Detailed applications to sequence alignment, phylogenetic trees, RNA folding, and other

biological models• Mann T. Numerically stable hidden Markov model implementation.

http://bozeman.genome.washington.edu/compbio/mbt599_2006/hmm_scaling_revised.pdf. 2006.○ Detailed pseudocode for implementation of HMM calculation in log space to avoid numerical

underflow. Very useful if you are writing your own code.• De Fonzo V, Aluffi-Pentini F, Parisi V. Hidden Markov Models in Bioinformatics. Current

Bioinformatics 2, 49-61, 2007.○ A more recent review that gives a clear outline of the algorithms and lots of references to

more recent applications in computational biology.• Kanungo T. Hidden Markov Models. http://www.kanungo.com/software/hmmtut.pdf. • Kanungo T. UMDHMM: hidden Markov model toolkit. in "Extended finite state models of

language", Kornai A (ed). Cambridge University Press. Software download http://www.kanungo.com/software/software.html#umdhmm 1999.

Documents

Class Hidden Markov Models