Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Bioinformatics Algorithms
David Hoksza
http://siret.ms.mff.cuni.cz/hoksza
Hidden Markov Models
based on:
Durbin, Richard, Sean R. Eddy, Anders Krogh, and Graeme
Mitchison. Biological sequence analysis: probabilistic models of
proteins and nucleic acids. Cambridge university press, 1998.
Outline
• Markov chains
• Hidden Markov models (HMMs)• Definition
• Decoding
• Viterbi, forward, backward algorithm
• Parameter estimation
• Baum-Welch algorithm
Motivation examples
Dishonest casino CpG islands
• When CG dinucleotide appears in a human genome, the C is often methylated and mutates into T → low CG frequency than expected based on probabilities of C and G
• Methylation is suppressed in some regions such as promoters → higher CG content than elsewhere → CpG islands
3
• Casino which, with given probability, switches loaded and fair dice
Markov chains
• A Markov chain models discrete stochastic process going through discrete states in discrete time with the Markov property
• Markov property where future states depend solely on the present state, i.e. given the present, the future does not depend on the past
• Transition probability – probability of getting from one state to the other𝑎𝑠𝑡 = 𝑃(𝑥𝑖 = 𝑡|𝑥𝑖−1 = 𝑠)
• For a probabilistic model and sequence of states
𝑷 𝒙 = 𝑃 𝑥𝐿, 𝑥𝐿−1, … , 𝑥1 = 𝑃 𝑥𝐿 𝑥𝐿−1, … , 𝑥1 𝑃 𝑥𝐿−1 𝑥𝐿−2, … , 𝑥1 …𝑃 𝑥1
= 𝑃 𝑥𝐿 𝑥𝐿−1 𝑃 𝑥𝐿−1 𝑥𝐿−2 …𝑃 𝑥2 𝑥1 𝑃 𝑥1 = 𝑷(𝒙𝟏)ෑ
𝒊=𝟐
𝑳
𝒂𝒙𝒊−𝟏𝒂𝒙𝒊
• To obtain a more homogenous description, begin (𝐵) and sometimes end (𝐸) states are added to the model and set 𝑃 𝑥1 = 𝑠 = 𝑎𝐵𝑠 and 𝑃 𝐸 𝑥𝐿 = 𝑡 = 𝑎𝑡𝐸
4
Discriminating using Markov chains
• When we know probability of a sequence given a model, we can carry out likelihood ratio test• Markov chain model for CpG islands regions (+) and for the
remaining regions (-)
• 𝑎𝑠𝑡+ =
𝑐𝑠𝑡+
σ𝑡′𝑐𝑠𝑡′+ , where 𝑐𝑠𝑡
+ is frequency of letter 𝑠 followed by 𝑡 in + regions
(transition probability)
• Similarly for “-” regions
𝑆 𝑥 = log𝑃(𝑥|𝑚𝑜𝑑𝑒𝑙+)
𝑃(𝑥|𝑚𝑜𝑑𝑒𝑙−)=
𝑖=1
𝐿
log𝑎𝑥𝑖−1𝑥𝑖+
𝑎𝑥𝑖−1𝑥𝑖− =
𝑖=1
𝐿
𝛽𝑥𝑖−1𝑥𝑖
5
6
Distribution of length-normalized scores of the sequences (CpG island sequences in black)
Hidden Markov models
• In Markov chains, state is coupled with the observation and thus directly visible to the observer
• In hidden Markov models (HMM) the state is not visible to the observer → when an event is observed the model can be in different states which give rise to that event with different probabilities
7
• In the context of CpG islands, at given position we can be
either in an island or not, but we cannot tell just by
observing the letter
• In the context of the dishonest casino, we don’t know
whether at given moment a fair or loaded die is being used
Hidden Markov models - definition
• Sequence of states needs to be decoupled from sequence of observations/symbols• path 𝝅 – sequence of states which follows a Markov chain• transition probabilities 𝑎𝑘𝑙 - probability of getting to getting to 𝑙 in next step when being in 𝑘
𝑎𝑘𝑙 = 𝑃 𝜋𝑖 = 𝑙 𝜋𝑖−1 = 𝑘
• emission (observation) probabilities 𝑒𝑘(𝑏) – probability of given observation in given state (states emits/generates the observation → generative model)
𝑒𝑘 𝑏 = 𝑃 𝑥𝑖 = 𝑏 𝜋𝑖 = 𝑘
𝑃 𝑥, 𝜋 = 𝑎0𝜋1 ෑ
𝑖=1
𝐿
𝑒𝜋𝑖 𝑥𝑖 𝑎𝜋𝑖𝑎𝜋𝑖+1 , 𝜋𝐿+1 = 0
8
Most probable state path
• Decoding – finding the most probable sequence of states that gave rise to the sequence of observations• Many paths could have generate given sequence, but
we are usually interested only in the most probable one
𝜋∗ = argmax𝜋
𝑃(𝑥, 𝜋)
9
CGCG
C+G+C+G+C-G-C-G-C+G-C+G-
?
Viterbi algorithm
• Recursive procedure• Given probabilities 𝒗𝒌(𝒊) of the most probable paths ending in 𝑘 for observation 𝑥𝑖 ,
the probabilities of the states for observations 𝑥𝑖+1can be obtained as
• 𝒗𝒍 𝒊 + 𝟏 = 𝑒𝑙(𝑥𝑖+1)max𝑘(𝑣𝑘 𝑖 𝑎𝑘𝑙)
𝑣0 0 = 1; 𝑣𝑘 0 = 0, 𝑘 > 0
𝒗𝒍 𝒊 = 𝒆𝒍(𝒙𝒊)𝒎𝒂𝒙𝒌
(𝒗𝒌 𝒊 − 𝟏 𝒂𝒌𝒍)
𝑝𝑡𝑟𝑖 𝑙 = argmax𝑘
(𝑣𝑘 𝑖 − 1 𝑎𝑘𝑙)
𝑃 𝑥, 𝜋∗ = max𝑘
𝑣𝑘 𝐿 𝑎𝑘0𝜋𝐿∗ = argmaxk 𝑣𝑘 𝐿 𝑎𝑘0
10
initialization
recursion 𝑖 ∈< 1,… , 𝐿 >
termination
11
𝑣𝑘 𝐿 𝑎𝑘0
source: Durbin, Richard, et al. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press, 1998.
Probability of a sequence
• How probable is it to see a sequence of observations 𝑥• For Markov chains we have only one possible path, but that is not the case
with HMMs
𝑃 𝑥 =
𝜋
𝑃(𝑥, 𝜋)
• Exhaustively enumerate all paths is not computationally feasible
• A good, but unnecessary approximation is to consider only the most probable path 𝜋∗
12
Probability of a sequence - forward algorithm
• Similarly to Viterbi, we can compute the probability up to (𝑖 + 1)-stobservation given probability of the first 𝑖 observations
• 𝑓𝑘 𝑖 = 𝑃(𝑥1…𝑥𝑘 , 𝜋𝑖 = 𝑘)
• 𝑓𝑙 𝑖 + 1 = 𝑒𝑙(𝑥𝑖+1)σ𝑘 𝑣𝑘 𝑖 𝑎𝑘𝑙
𝑓0 0 = 1; 𝑓𝑘 0 = 0, 𝑘 > 0
𝒇𝒍 𝒊 = 𝒆𝒍 𝒙𝒊
𝒌
𝒇𝒌 𝒊 − 𝟏 𝒂𝒌𝒍
𝑃 𝑥 =
𝑘
𝑓𝑘 𝐿 𝑎𝑘0
13
initialization
recursion 𝑖 ∈< 1,… , 𝐿 >
termination
Probability of a state
• Probability of a state for given observation
• Given the whole sequence, what is the probability that observation 𝒙𝒊was generated by state 𝒌, i.e. 𝑃 𝜋𝑖 = 𝑘 𝑥 = posterior probability
• First we compute 𝑃(𝑥, 𝜋𝑖 = 𝑘) and by conditioning we get 𝑃 𝜋𝑖 = 𝑘 𝑥
𝑃 𝑥, 𝜋𝑖 = 𝑘 = 𝑃 𝜋𝑖 = 𝑘 𝑥 𝑃 𝑥
𝑃 𝜋𝑖 = 𝑘 𝑥 =𝑃 𝑥, 𝜋𝑖 = 𝑘
𝑃(𝑥)
14
Probability of a state – backward algorithm𝑃 𝑥, 𝜋𝑖 = 𝑘 = 𝑃 𝑥1…𝑥𝑖 , 𝜋𝑖 = 𝑘 𝑃 𝑥𝑖+1…𝑥𝐿 𝑥1…𝑥𝑖 , 𝜋𝑖 = 𝑘
= 𝑃 𝑥1…𝑥𝑖 , 𝜋𝑖 = 𝑘 𝑃 𝑥𝑖+1…𝑥𝐿 𝜋𝑖 = 𝑘= 𝑓𝑘 𝑖 𝑏𝑘(𝑖)
• 𝑏𝑘(𝑖) can be obtained using backward recursion in contrast to forward recursion used to obtain 𝑓𝑘(𝑖)
𝑏𝑘 𝐿 = 𝑎𝑘0, 𝑘 > 0
𝒃𝒌 𝒊 =
𝒍
𝒂𝒌𝒍𝒆𝒍(𝒙𝒊+𝟏)𝒃𝒍 𝒊 + 𝟏
𝑃 𝑥 =
𝑙
𝑎0𝑙𝑒𝑙 𝑥1 𝑏𝑙(1)
15
initialization
recursion 𝑖 ∈< 𝐿 − 1; 1 >
termination
𝑷 𝝅𝒊 = 𝒌 𝒙 =𝑷 𝒙, 𝝅𝒊 = 𝒌
𝑷(𝒙)=𝒇𝒌 𝒊 𝒃𝒌 𝒊
𝑷(𝒙)
16
The x-axis shows the roll
number. The shaded areas
correspond to loaded dice
rolls.
Posterior decoding
• Having posterior probability of states, we can have alternative decoding to the Viterbi one – posterior decoding
ෝ𝝅𝒊 = 𝐚𝐫𝐠𝐦𝐚𝐱𝒌
𝑷(𝝅𝒊 = 𝒌|𝒙)
• We might be interested in a property derived from the sequence states, not the states themselves → function 𝑔(𝑘) on states
𝑮 𝒊 𝒙 =
𝒌
𝑷 𝝅𝒊 = 𝒌 𝒙 𝒈(𝒌)
• For CpG islands, let’s set 𝑔 𝑘 = ቊ0, 𝑘 ∈ {𝐴−, 𝐶−𝐺−𝑇−}
1, 𝑘 ∈ {𝐴+, 𝐶+𝐺+𝑇+}, then 𝐺(𝑖|𝑥) is the
posterior probability of nucleotide 𝑖 being in a CpG island17
Viterbi vs posterior decoding – loaded dices
18
Lower probability of going
from a fair dice to a loaded
one. Viterbi decoding in
such case stays in the fair
state for every roll.
Viterbi vs posterior decoding – CpG islands
Viterbi decoding
• 2 false negatives
• 121 false positives• short predicted islands → postprocessing
• concatenate close (500bp) islands• remove <500 bp long islands
• 67 false false positives
Posterior decoding
• 2 false negatives
• 236 false positives• 83 after filtration
19
• 41 sequences, each with one putative island
Parameters estimation
• Parameters of the model (transition and observation probabilities) might not be known and need to be estimated
• Training data, i.e. labeled observations (paths) are available• We can count frequencies of state transitions 𝐴𝑘𝑙 and emissions in given states 𝐸𝑘(𝑏) and use it for
parameters estimation• Often pseudocounts for 𝐴𝑥 and 𝐸𝑥 are used to take into account the fact, that we see only a sample of the data,
e.g. if 𝐴𝑥 and 𝐸𝑥 is 0, we still should take into account possibility that we have just not seen it in the training data• 𝐴𝑘𝑙=𝐴𝑘𝑙 + 𝑟𝑘𝑙 , 𝐸𝑘(𝑏)=𝐸𝑘(𝑏) + 𝑟𝑘(𝑏)
• Pseudocounts (low vs. high) can be used by encoding our prior beliefs
𝑎𝑘𝑙 =𝐴𝑘𝑙
σ𝑙′ 𝐴𝑘𝑙′𝑒𝑘 𝑏 =
𝐸𝑘 𝑏
σ𝑏′ 𝐸𝑘 𝑏′
• Training data, i.e. labeled observations (paths) are not available• Baum-Welch algorithm
20
Maximum likelihood
estimators
Baum-Welch algorithm
• Let’s iteratively compute 𝐴𝑘𝑙 and 𝐸𝑘(𝑏) from existing values of 𝑎𝑘𝑙 and 𝑒𝑘(𝑏) by considering the most probable paths, and use it to update 𝑎𝑘𝑙and 𝑒𝑘(𝑏)
𝑃 𝜋𝑖 = 𝑘, 𝜋𝑖+1 = 𝑙 𝑥, 𝜃 =𝑓𝑘 𝑖 𝑎𝑘𝑙𝑒𝑙 𝑥𝑖+1 𝑏𝑙(𝑖 + 1)
𝑃(𝑥)
𝐴𝑘𝑙 =
𝑗
1
𝑃 𝑥𝑗
𝑖
𝑓𝑘𝑗𝑖 𝑎𝑘𝑙𝑒𝑙 𝑥𝑖+1
𝑗𝑏𝑙𝑗(𝑖 + 1) , 𝐸𝑘(𝑏)=
𝑗
1
𝑃 𝑥𝑗
{𝑖|𝑥𝑖𝑗=𝑏}
𝑓𝑘𝑗(𝑖)𝑏𝑘
𝑗(𝑖)
21
Probability of
sequence up to
state k
Probability
for transition
from k to l
Proabability
of emitting
𝑥𝑖 + 1 in 𝑙
Proabability
of the rest of
the sequence
Intialization
• Pick arbitrary
model
parameters
Recurrence
• Set A and E to zero or pseudocounts
• For each sequence 𝑗• Caluclate 𝑓𝑘(𝑖) for 𝑗 using forward algorithm
• Caluclate 𝑏𝑘(𝑖) for 𝑗 using backward algorithm
• Add contributions of 𝑗 to A and E
• Calculate new model (𝑎𝑘𝑙 , 𝑒𝑘)
• Calculate new log likelihood of the model
Termination
• Stop if log-likelihood is less
than some threshold or
maximum number of
iterations is exceeded
Training
sequences
22
Original Estimated from 300 rolls Estimated from 30 000 rolls
Log transformation
• Multiplying many numbers between 0 and 1 will results in underflow
• For Viterbi algorithms, we simply consider logarithms of probabilities• 𝑥 = log 𝑥
• 𝑉𝑙 𝑖 + 1 = ǁ𝑒𝑙 𝑥𝑖+1 +max𝑘(𝑉𝑘 𝑖 + 𝑎𝑘𝑙)
• For forward and backward algorithms, the transformation can be done as well, but the logs of the parameters can’t be precomputed due to having sums in the formulas
23