Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities

Lecture 10 – Models of DNA Sequence Evolution

Correct for multiple substitutions in calculating pairwise genetic distances.

Derive transformation probabilities for likelihood-based methods.

Prob(Rr | t ) = pm x Pm,k(v3,1) x Pk,A(v1,w) x Pk,G(v1,x) x Pm,l(v3,2) x Pl,C(v2,y) x Pl,C(v2,z)

It’s the Pi,j’s that we need a substitution model to calculate.

The models typically used are Markov processes.

Poisson process is a stochastic process that can be used to model events in time.

The time between events is exponentially distributed, with rate l.

Jukes-Cantor ModelThe probability of a site remaining constant is: pii(t) = ¼ + ¾ e-4at

The probability of a site changing is : pij(t) = ¼ - ¼ e-4at

a is the rate at which any nucleotide changes to any other per unit time.

Given that the state at the site is i at t0, we start by estimating the probability of state i at that site at t1.

pi(0) = 1

pi(1) = 1-3a

Now, what’s the probability of this site having state i at t2

There are two ways for the site to have state i at t2:

1 – It still hasn’t changed since time t0.

2 – It has changed to something else and back again.

Therefore, pi(2) = (1 – 3a) pi(1) + a [1 – pi(1)], where

(1 – 3a) pi(1) = probability of no change at the site during time t2, (1-3a), times the probability of the site having state i at time t1, (pi(1)).

and

a[1-pi(1)] = probability of a change to i, (a), times the probability that the site is not state i at time t1, (1-pi(1))

Jukes-Cantor Model

Jukes-Cantor Model

We have a recurrence equation.

pi(t+1) = (1 - 3a) pi(t) + a [1 – pi(t)] = pi(t) - 3api(t) + a – api(t)

We can calculate the change in pi(t) across time, Dt.

pi(t+1) – pi(t) = -3api(t) + a – api(t)

so

and

Jukes-Cantor Model

pi(t) = 1/4 + (pi(0) – 1/4) e -4at

We have a probability that a site has a particular nucleotide after time t, given in terms of its initial state.

If i = j, pi(0) = 1.

Therefore, pii(t) = 1/4 + 3/4 e -4at

If i not = j, pi(0) = 0, and pij(t) = 1/4 -

1/4 e -4at

a is an instantaneous rate, so we’ve modeled branch length (rate x time) explicitly in our expectations.

The JC model makes several assumptions.

1) All substitutions are equally likely; we have a single substitution type.

2) Base frequencies are assumed to be equal; each of the four nucleotides occurs at 25% of sites.

3) Each site has the same probability of experiencing a substitution as any other; we have an equal-rates model.

4) The process is constant through time.

5) Sites are independent of each other.

6) Substitution is a Markov process.

-3a a a

a a

-3a a a

Q = a a -3a a

a a a -

3a

Q - matrix

Substitution types and base frequencies.

-m(apC + bpG + cpT) mapC

mbpG mcpT

mgpA -m(gpA + dpG - epT) mdpG

mepT

Q = mhpA mjpC -

m(hpA + jpC + fpT) mfpT

mipA mkpC

mlpG -m(ipA + kpC + lpG)

For the general case:

where, m = the average instantaneous substitution rate,a, b, c, …, l are relative rate parameters (one of them is set to 1).and pi’s are the frequencies of the base that is being substituted to.

Note that this is not symmetric, and therefore, the full model is non-reversible.

a = g, b = h, c = i, d = j, e = k, & f = l.


-m(apC + bpG + cpT) mapC

mbpG mcpT

mapA -m(apA + dpG + epT) mdpG

mepT

Q = mbpA mdpC -

m(bpA + dpC + fpT) mfpT

mcpA mepC

mfpG -m(cpA + epC + fpG)

General Time-Reversible Model

There are six relative transformation rates (one of which is set to 1).

There are four base frequencies that must sum to 1.

Note that this is not a symmetric matrix, but it can be decomposed into R and P.


-m(a+b+c) mamb mc

ma -m(a+d+e)

md meR =

mb md -m(b+d+f) mf

mc me

mf -m(c+e+f)pA

0 00

0

pC 00

P = 0

0 pG

0

00 0pT

Visual GTR

Common Simplifications

Transition type substitutions occur at a higher rate than transversion substitutions.

K2P Model was the first to address this.

So we set b = e = k (for transitions), and a = c = d = f = 1 (for transversions) .

-(m)(k + 2)/4 m/4 mk/4 m/4

m/4 -(m)(k + 2)/4

m/4 mk/4for K2P: Q =

mk/4 m/4-(m)(k + 2)/4 m/4

m/4 mk/4 m/4 -(m)(k + 2)/4

All pi = ¼

where a = mk/4 and b = m/4. Thus, k = / a b and

Hasegawa-Kishino-Yano (HKY) Model

-m(kpG + pY) mpC mkpG mpT

mpA -m(kpT + pR)

mpG mkpfor HKY: Q =

mkpA mpC -m(kpA + pY) mpT

mpA mkpC

mpG -m(kpC + pR)

where a = mk, b = m, pR = pA + pG, and pY = pC + pT.

There are lots of other models that restrict the Q-matrix.

Some common models

There are 203 special cases of the GTR, 406 if we allow for equal base frequencies.

Calculating Transformation Probabilities.

So the Q & R matrices we’ve been discussing define the instantaneous rates of substitutions from one nucleotide to another.

Convert the rates to probabilities by matrix exponentiation:

P(t) = e Qt

Jukes-Cantor

K2P

Again, it’s these Pij that are used in the likelihood function.

Documents

Lecture 10 – Models of DNA Sequence Evolution Correct for multiple substitutions in calculating pairwise genetic distances. Derive transformation probabilities