A retrospective look at our models

A retrospective look at our modelsFirst we saw the finite state automaton

The rigid non-stochastic nature of these structures ultimately limited their usefulness to us as models of

DNA

1 2 3 4 5 6 7 8S eg g g gc gc tc

c

a

The finite state automaton above is non-deterministic, but is NOT probabilistic

Sequences under test are either elements of the set of accepted sequences, or are rejected

Markov ModelWhat do we need to probabilistically model DNA sequences?

A C

G T

States

Transition probabilities

Here each observation necessarily corresponded to the underlying biological state…which also limited the usefulness and generality of

the concept as a model for DNA or amino acid sequences

HMM TopologyMany different HMM topologies are possible…

a

b

RS e

Careful selection of alternative HMM topologies will allow us to address a variety of different problems using essentially the same algorithmic toolkit

HMM TopologyDuration modeling with HMMs

Consider our CpG island model with two underlying states….

How long does our model dwell in a particular state?

This should be familiar as an exponentially decaying value

CpG +

S e

CpG -

P(L residues) = (1-p)·pL-1

p


What if this doesn’t accurately reflect the real distribution of lengths?

CpG +

S e

CpG -

P(L residues) =

(1-p)·pL-1

p

P

L


Assuming that each state shown here has the same emission frequencies, this forms an example of a submodel topology that will

guarantee a minimum of four symbols drawn from the same underlying emission distribution, but with a length distribution that

is still geometric. The duration of the model has been tweaked.

There are many more complex possibilities depending on the properties of the length distribution you seek to model

etc.

Note: this and numerous other figures that follow in this presentation are after Durbin et al.

etc.


The topology shown in this variation can model any desired distribution of lengths varying between 2 and 6

etc.

etc.


Without belaboring the derivation, we’ll just mention that this negative binomial distribution offers a great deal of flexibility in modeling length distributions

p

etc. etc.

p p p

1-p 1-p 1-p 1-p

Where n = number of repetitive states, so here n = 4

𝑷 (𝒍 )=( 𝒍−𝟏𝒏−𝟏)𝒑𝒍−𝒏(𝟏−𝒑 )𝒏


By selecting appropriate values for the number of repeating states and the probability of re-entering each state, you can flexibly model the length distribution

http://www.boost.org/doc/libs/1_40_0/libs/math/doc/sf_and_dist/html/math_toolkit/dist/dist_ref/dists/negative_binomial_dist.html


In practical HMMs we would probably use n much smaller than in these examples, but with probability of state re-entry much closer to 1



HMM TopologyA trivial HMM is equivalent to a Position Specific Scoring Matrix

. . . M1 M2 MLS e

The transitions are deterministic and Pr{aMi→Mi+1}= 1 but the emissions correspond to the estimated amino acid

or nucleotide frequencies in the columns of a PSSM

We refer collectively to M1…Mj as match states This simple model cannot account for indels and does not

capture all of the information present in a multiple alignment

HMM TopologyInsertions are handled with special states with background-like emission

M1 Mj Mj+1S e

Insertion states correspond to states that do not match anything in the model. They almost always have emission

probabilities matching the background distribution

We still need a way to deal with deletions…

. . .

Ij

HMM TopologyHow do insertions affect the overall probability of a sequence?

M1 Mj Mj+1S e

Assuming log-odds scoring relative to some background distribution, emissions from Ij will not contribute to the overall score, and only the transitions will matter

. . .

Ij

log aMj→Ij + (k-1)·log aIj→Ij

+ log aIj→Mj+1

For an insertion of length k:

HMM TopologyHow do insertions affect the overall probability of a sequence?

M1 Mj Mj+1S e

This is equivalent to the familiar gap opening + gap extension penalties used in many sequence alignment methods. This is therefore a form of affine gap scoring

. . .

Ij

log aMj→Ij + (k-1)·log aIj→Ij + log aIj→Mj+1

For an insertion of length k:

HMM TopologyHow best to handle deletions?

S e

The problem with allowing arbitrarily long gaps in any practically sized model is that we would need to

estimate far too many transition probabilities

Estimation is the hardest HMM problem, so we should avoid an unnecessary proliferation of transitions wherever possible!

HMM TopologyIn general we handle deletions by adding silent states…

S e

We can use a sequence of silent-state transitions to move from any match state to any other match state!

Note that not all Dj→Dj+1 transitions need have the same probability

Mj

Dj

The Profile HMM If we combine all these features, we have the famous profile HMM

S e

We can use a sequence of silent-state transitions to move from any match state to any other match state!

Profile HMMs fully generalise the concept of pairwise alignment!

Mj

Dj

Ij

The Profile HMM If we combine all these features, we have the famous profile HMM

S e

Profile HMMs are extensively used for the identification of new members of conserved sequence families

This relies on the ability to estimate parameters for the profile HMM on the basis of multiple sequence alignments

Mj

Dj

Ij

Basic estimation for profile HMMs

We can now count simply count the transitions and emissions to calculate our maximum likelihood estimators from frequencies. But what about missing observations?

Consider a multiple sequence alignment

#pos 01234567890 >glob1 VGA--HAGEY>glob2 V----NVDEV>glob3 VEA--DVAGH>glob4 VKG------D>glob5 VYS--TYETS>glob6 FNA--NIPKH>glob7 IAGADNGAGV *** *****First heuristic: positions with more than 50% gaps will be modeled as

inserts, the remainder as matches. In this example only the starred columns will correspond to matches


Let’s start by simply applying Laplace’s rule: add one pseudocount to every frequency calculation

Parameter estimation from multiple alignmentsSecond heuristic: use pseudocounts to fill in missing observations..

Here B is the total number of pseudocounts, and q represents the fraction of the total number that have been allocated to that

particular transition or emission


These are the estimates when we apply Laplace’s rule for adding pseudocounts..

Consider a multiple sequence alignment

#pos 01234567890 >glob1 VGA--HAGEY>glob2 V----NVDEV>glob3 VEA--DVAGH>glob4 VKG------D>glob5 VYS--TYETS>glob6 FNA--NIPKH>glob7 IAGADNGAGV *** *****eM1(V) = 6/27, eM1(I) = eM1(F) = 2/27, eM1(all other aa) = 1/27

aM1→M2 = 7/10, aM1→D2 = 2/10, aM1→I1 = 1/10, etc.

Scoring with profile HMMs

We could calculate separate scores for the models, then combine, but this is inefficient

Since the state path is unknown we have two basic options:

• Employ the Viterbi algorithm to determine the joint probability of the observed sequence and the most probable state path

• Employ the forward (or backward) algorithm to determine the full probability summed over all possible state paths

The state path of sequences under test are always unknown

Either way, one fly in the ointment is that we are usually interested in a score expressed as a log-odds ratio relative to our random model:

𝑷 (𝒙|𝒓𝒂𝒏𝒅𝒐𝒎 )=∏𝒊𝒒𝒙 𝒊

Scoring with profile HMMs

In other words, we need only adjust and “log_floatize” our emissions, which is something we can do up front in the self.emissions dict rather than in the algorithms

We could rewrite the recurrences to accommodate log-odds scoring. For example, Viterbi might look like this:

Modifying Viterbi and Forward for profile HMMs

This would be a serious nuisance except we already have a log_float class!!! So, we can instead just alter our emission terms to the form below:

emission = log_float( )

Handling multiple sequences in Python

class FastA(list): # note derivation from list, not object

def __init__(self):#stuff

def read_file(self, filepath):# stuff

def calculate_MLE(self, pseudocounts)# stuffreturn (transitions, emissions)

I’ll provide more information about the key methods in the form of a Python docstring

Inheritance is a powerful tool for deriving new classes

Our goal is to develop a class, FastA, that behaves like a python list that serves as a container for sequences read from a file. Each element will contain a tuple of the

annotation and the sequence stored as a string. It should also be able to use these sequences to then calculate “dict-of-dict” transition and emission distributions suitable

for use in our increasingly sophisticated HMM class

Documents

A retrospective look at our models