Bioinformatics Algorithmssiret.ms.mff.cuni.cz › sites › default › files › doc › david... · Hidden Markov Models based on: Durbin, Richard, Sean R. Eddy, Anders Krogh, and

Bioinformatics Algorithms

David Hoksza

http://siret.ms.mff.cuni.cz/hoksza

Hidden Markov Models

based on:

Durbin, Richard, Sean R. Eddy, Anders Krogh, and Graeme

Mitchison. Biological sequence analysis: probabilistic models of

proteins and nucleic acids. Cambridge university press, 1998.

Outline

• Markov chains

• Hidden Markov models (HMMs)• Definition

• Decoding

• Viterbi, forward, backward algorithm

• Parameter estimation

• Baum-Welch algorithm

Motivation examples

Dishonest casino CpG islands

• When CG dinucleotide appears in a human genome, the C is often methylated and mutates into T → low CG frequency than expected based on probabilities of C and G

• Methylation is suppressed in some regions such as promoters → higher CG content than elsewhere → CpG islands

3

• Casino which, with given probability, switches loaded and fair dice

Markov chains

• A Markov chain models discrete stochastic process going through discrete states in discrete time with the Markov property

• Markov property where future states depend solely on the present state, i.e. given the present, the future does not depend on the past

• Transition probability – probability of getting from one state to the other𝑎𝑠𝑡 = 𝑃(𝑥𝑖 = 𝑡|𝑥𝑖−1 = 𝑠)

• For a probabilistic model and sequence of states

𝑷 𝒙 = 𝑃 𝑥𝐿, 𝑥𝐿−1, … , 𝑥1 = 𝑃 𝑥𝐿 𝑥𝐿−1, … , 𝑥1 𝑃 𝑥𝐿−1 𝑥𝐿−2, … , 𝑥1 …𝑃 𝑥1

= 𝑃 𝑥𝐿 𝑥𝐿−1 𝑃 𝑥𝐿−1 𝑥𝐿−2 …𝑃 𝑥2 𝑥1 𝑃 𝑥1 = 𝑷(𝒙𝟏)ෑ

𝒊=𝟐

𝑳

𝒂𝒙𝒊−𝟏𝒂𝒙𝒊

• To obtain a more homogenous description, begin (𝐵) and sometimes end (𝐸) states are added to the model and set 𝑃 𝑥1 = 𝑠 = 𝑎𝐵𝑠 and 𝑃 𝐸 𝑥𝐿 = 𝑡 = 𝑎𝑡𝐸

4

Discriminating using Markov chains

• When we know probability of a sequence given a model, we can carry out likelihood ratio test• Markov chain model for CpG islands regions (+) and for the

remaining regions (-)

• 𝑎𝑠𝑡+ =

𝑐𝑠𝑡+

σ𝑡′𝑐𝑠𝑡′+ , where 𝑐𝑠𝑡

+ is frequency of letter 𝑠 followed by 𝑡 in + regions

(transition probability)

• Similarly for “-” regions

𝑆 𝑥 = log𝑃(𝑥|𝑚𝑜𝑑𝑒𝑙+)

𝑃(𝑥|𝑚𝑜𝑑𝑒𝑙−)=

𝑖=1

𝐿

log𝑎𝑥𝑖−1𝑥𝑖+

𝑎𝑥𝑖−1𝑥𝑖− =

𝑖=1

𝐿

𝛽𝑥𝑖−1𝑥𝑖

5

6

Distribution of length-normalized scores of the sequences (CpG island sequences in black)

Hidden Markov models

• In Markov chains, state is coupled with the observation and thus directly visible to the observer

• In hidden Markov models (HMM) the state is not visible to the observer → when an event is observed the model can be in different states which give rise to that event with different probabilities

7

• In the context of CpG islands, at given position we can be

either in an island or not, but we cannot tell just by

observing the letter

• In the context of the dishonest casino, we don’t know

whether at given moment a fair or loaded die is being used

Hidden Markov models - definition

• Sequence of states needs to be decoupled from sequence of observations/symbols• path 𝝅 – sequence of states which follows a Markov chain• transition probabilities 𝑎𝑘𝑙 - probability of getting to getting to 𝑙 in next step when being in 𝑘

𝑎𝑘𝑙 = 𝑃 𝜋𝑖 = 𝑙 𝜋𝑖−1 = 𝑘

• emission (observation) probabilities 𝑒𝑘(𝑏) – probability of given observation in given state (states emits/generates the observation → generative model)

𝑒𝑘 𝑏 = 𝑃 𝑥𝑖 = 𝑏 𝜋𝑖 = 𝑘

𝑃 𝑥, 𝜋 = 𝑎0𝜋1 ෑ

𝑖=1

𝐿

𝑒𝜋𝑖 𝑥𝑖 𝑎𝜋𝑖𝑎𝜋𝑖+1 , 𝜋𝐿+1 = 0

8

Most probable state path

• Decoding – finding the most probable sequence of states that gave rise to the sequence of observations• Many paths could have generate given sequence, but

we are usually interested only in the most probable one

𝜋∗ = argmax𝜋

𝑃(𝑥, 𝜋)

9

CGCG

C+G+C+G+C-G-C-G-C+G-C+G-

?

Viterbi algorithm

• Recursive procedure• Given probabilities 𝒗𝒌(𝒊) of the most probable paths ending in 𝑘 for observation 𝑥𝑖 ,

the probabilities of the states for observations 𝑥𝑖+1can be obtained as

• 𝒗𝒍 𝒊 + 𝟏 = 𝑒𝑙(𝑥𝑖+1)max𝑘(𝑣𝑘 𝑖 𝑎𝑘𝑙)

𝑣0 0 = 1; 𝑣𝑘 0 = 0, 𝑘 > 0

𝒗𝒍 𝒊 = 𝒆𝒍(𝒙𝒊)𝒎𝒂𝒙𝒌

(𝒗𝒌 𝒊 − 𝟏 𝒂𝒌𝒍)

𝑝𝑡𝑟𝑖 𝑙 = argmax𝑘

(𝑣𝑘 𝑖 − 1 𝑎𝑘𝑙)

𝑃 𝑥, 𝜋∗ = max𝑘

𝑣𝑘 𝐿 𝑎𝑘0𝜋𝐿∗ = argmaxk 𝑣𝑘 𝐿 𝑎𝑘0

10

initialization

recursion 𝑖 ∈< 1,… , 𝐿 >

termination

11

𝑣𝑘 𝐿 𝑎𝑘0

source: Durbin, Richard, et al. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press, 1998.

Probability of a sequence

• How probable is it to see a sequence of observations 𝑥• For Markov chains we have only one possible path, but that is not the case

with HMMs

𝑃 𝑥 =

𝜋

𝑃(𝑥, 𝜋)

• Exhaustively enumerate all paths is not computationally feasible

• A good, but unnecessary approximation is to consider only the most probable path 𝜋∗

12

Probability of a sequence - forward algorithm

• Similarly to Viterbi, we can compute the probability up to (𝑖 + 1)-stobservation given probability of the first 𝑖 observations

• 𝑓𝑘 𝑖 = 𝑃(𝑥1…𝑥𝑘 , 𝜋𝑖 = 𝑘)

• 𝑓𝑙 𝑖 + 1 = 𝑒𝑙(𝑥𝑖+1)σ𝑘 𝑣𝑘 𝑖 𝑎𝑘𝑙

𝑓0 0 = 1; 𝑓𝑘 0 = 0, 𝑘 > 0

𝒇𝒍 𝒊 = 𝒆𝒍 𝒙𝒊

𝒌

𝒇𝒌 𝒊 − 𝟏 𝒂𝒌𝒍

𝑃 𝑥 =

𝑘

𝑓𝑘 𝐿 𝑎𝑘0

13

initialization

recursion 𝑖 ∈< 1,… , 𝐿 >

termination

Probability of a state

• Probability of a state for given observation

• Given the whole sequence, what is the probability that observation 𝒙𝒊was generated by state 𝒌, i.e. 𝑃 𝜋𝑖 = 𝑘 𝑥 = posterior probability

• First we compute 𝑃(𝑥, 𝜋𝑖 = 𝑘) and by conditioning we get 𝑃 𝜋𝑖 = 𝑘 𝑥

𝑃 𝑥, 𝜋𝑖 = 𝑘 = 𝑃 𝜋𝑖 = 𝑘 𝑥 𝑃 𝑥

𝑃 𝜋𝑖 = 𝑘 𝑥 =𝑃 𝑥, 𝜋𝑖 = 𝑘

𝑃(𝑥)

14

Probability of a state – backward algorithm𝑃 𝑥, 𝜋𝑖 = 𝑘 = 𝑃 𝑥1…𝑥𝑖 , 𝜋𝑖 = 𝑘 𝑃 𝑥𝑖+1…𝑥𝐿 𝑥1…𝑥𝑖 , 𝜋𝑖 = 𝑘

= 𝑃 𝑥1…𝑥𝑖 , 𝜋𝑖 = 𝑘 𝑃 𝑥𝑖+1…𝑥𝐿 𝜋𝑖 = 𝑘= 𝑓𝑘 𝑖 𝑏𝑘(𝑖)

• 𝑏𝑘(𝑖) can be obtained using backward recursion in contrast to forward recursion used to obtain 𝑓𝑘(𝑖)

𝑏𝑘 𝐿 = 𝑎𝑘0, 𝑘 > 0

𝒃𝒌 𝒊 =

𝒍

𝒂𝒌𝒍𝒆𝒍(𝒙𝒊+𝟏)𝒃𝒍 𝒊 + 𝟏

𝑃 𝑥 =

𝑙

𝑎0𝑙𝑒𝑙 𝑥1 𝑏𝑙(1)

15

initialization

recursion 𝑖 ∈< 𝐿 − 1; 1 >

termination

𝑷 𝝅𝒊 = 𝒌 𝒙 =𝑷 𝒙, 𝝅𝒊 = 𝒌

𝑷(𝒙)=𝒇𝒌 𝒊 𝒃𝒌 𝒊

𝑷(𝒙)

16

The x-axis shows the roll

number. The shaded areas

correspond to loaded dice

rolls.

Posterior decoding

• Having posterior probability of states, we can have alternative decoding to the Viterbi one – posterior decoding

ෝ𝝅𝒊 = 𝐚𝐫𝐠𝐦𝐚𝐱𝒌

𝑷(𝝅𝒊 = 𝒌|𝒙)

• We might be interested in a property derived from the sequence states, not the states themselves → function 𝑔(𝑘) on states

𝑮 𝒊 𝒙 =

𝒌

𝑷 𝝅𝒊 = 𝒌 𝒙 𝒈(𝒌)

• For CpG islands, let’s set 𝑔 𝑘 = ቊ0, 𝑘 ∈ {𝐴−, 𝐶−𝐺−𝑇−}

1, 𝑘 ∈ {𝐴+, 𝐶+𝐺+𝑇+}, then 𝐺(𝑖|𝑥) is the

posterior probability of nucleotide 𝑖 being in a CpG island17

Viterbi vs posterior decoding – loaded dices

18

Lower probability of going

from a fair dice to a loaded

one. Viterbi decoding in

such case stays in the fair

state for every roll.

Viterbi vs posterior decoding – CpG islands

Viterbi decoding

• 2 false negatives

• 121 false positives• short predicted islands → postprocessing

• concatenate close (500bp) islands• remove <500 bp long islands

• 67 false false positives

Posterior decoding

• 2 false negatives

• 236 false positives• 83 after filtration

19

• 41 sequences, each with one putative island

Parameters estimation

• Parameters of the model (transition and observation probabilities) might not be known and need to be estimated

• Training data, i.e. labeled observations (paths) are available• We can count frequencies of state transitions 𝐴𝑘𝑙 and emissions in given states 𝐸𝑘(𝑏) and use it for

parameters estimation• Often pseudocounts for 𝐴𝑥 and 𝐸𝑥 are used to take into account the fact, that we see only a sample of the data,

e.g. if 𝐴𝑥 and 𝐸𝑥 is 0, we still should take into account possibility that we have just not seen it in the training data• 𝐴𝑘𝑙=𝐴𝑘𝑙 + 𝑟𝑘𝑙 , 𝐸𝑘(𝑏)=𝐸𝑘(𝑏) + 𝑟𝑘(𝑏)

• Pseudocounts (low vs. high) can be used by encoding our prior beliefs

𝑎𝑘𝑙 =𝐴𝑘𝑙

σ𝑙′ 𝐴𝑘𝑙′𝑒𝑘 𝑏 =

𝐸𝑘 𝑏

σ𝑏′ 𝐸𝑘 𝑏′

• Training data, i.e. labeled observations (paths) are not available• Baum-Welch algorithm

20

Maximum likelihood

estimators

Baum-Welch algorithm

• Let’s iteratively compute 𝐴𝑘𝑙 and 𝐸𝑘(𝑏) from existing values of 𝑎𝑘𝑙 and 𝑒𝑘(𝑏) by considering the most probable paths, and use it to update 𝑎𝑘𝑙and 𝑒𝑘(𝑏)

𝑃 𝜋𝑖 = 𝑘, 𝜋𝑖+1 = 𝑙 𝑥, 𝜃 =𝑓𝑘 𝑖 𝑎𝑘𝑙𝑒𝑙 𝑥𝑖+1 𝑏𝑙(𝑖 + 1)

𝑃(𝑥)

𝐴𝑘𝑙 =

𝑗

1

𝑃 𝑥𝑗

𝑖

𝑓𝑘𝑗𝑖 𝑎𝑘𝑙𝑒𝑙 𝑥𝑖+1

𝑗𝑏𝑙𝑗(𝑖 + 1) , 𝐸𝑘(𝑏)=

𝑗

1

𝑃 𝑥𝑗

{𝑖|𝑥𝑖𝑗=𝑏}

𝑓𝑘𝑗(𝑖)𝑏𝑘

𝑗(𝑖)

21

Probability of

sequence up to

state k

Probability

for transition

from k to l

Proabability

of emitting

𝑥𝑖 + 1 in 𝑙

Proabability

of the rest of

the sequence

Intialization

• Pick arbitrary

model

parameters

Recurrence

• Set A and E to zero or pseudocounts

• For each sequence 𝑗• Caluclate 𝑓𝑘(𝑖) for 𝑗 using forward algorithm

• Caluclate 𝑏𝑘(𝑖) for 𝑗 using backward algorithm

• Add contributions of 𝑗 to A and E

• Calculate new model (𝑎𝑘𝑙 , 𝑒𝑘)

• Calculate new log likelihood of the model

Termination

• Stop if log-likelihood is less

than some threshold or

maximum number of

iterations is exceeded

Training

sequences

22

Original Estimated from 300 rolls Estimated from 30 000 rolls

Log transformation

• Multiplying many numbers between 0 and 1 will results in underflow

• For Viterbi algorithms, we simply consider logarithms of probabilities• 𝑥 = log 𝑥

• 𝑉𝑙 𝑖 + 1 = ǁ𝑒𝑙 𝑥𝑖+1 +max𝑘(𝑉𝑘 𝑖 + 𝑎𝑘𝑙)

• For forward and backward algorithms, the transformation can be done as well, but the logs of the parameters can’t be precomputed due to having sums in the formulas

23

Documents

Bioinformatics Algorithmssiret.ms.mff.cuni.cz › sites › default › files › doc › david... · Hidden Markov Models based on: Durbin, Richard, Sean R. Eddy, Anders Krogh, and