Molecular Evolution and Phylogenetic Tree Reconstruction

Molecular Evolution and Phylogenetic Tree

Reconstruction

1 4

3 2 5

1 4 2 3 5

Phylogenetic Trees

• Nodes: species

• Edges: time of independent evolution

• Edge length represents evolution time

AKA genetic distance

Not necessarily chronological time

Inferring Phylogenetic Trees

Trees can be inferred by several criteria:

Morphology of the organisms• Can lead to mistakes!

Sequence comparison

Example:

Mouse: ACAGTGACGCCCCAAACGTRat:ACAGTGACGCTACAAACGTBaboon:CCTGTGACGTAACAAACGAChimp: CCTGTGACGTAGCAAACGAHuman: CCTGTGACGTAGCAAACGA

Inferring Phylogenetic Trees

• Sequence-based methods Deterministic (Parsimony) Probabilistic (SEMPHY)

• Distance-based methods UPGMA Neighbor-Joining

• Can compute distances from sequences

Distance Between Two Sequences

Basic principles:

• Degree of sequence difference is proportional to length of independent sequence evolution

• Only use positions where alignment is certain – avoid areas with (too many) gaps

Distance Between Two Sequences

Given sequences xi, xj,

Define

dij = distance between the two sequences

One possible definition:

dij = fraction f of sites u where xi[u] xj[u]

Better scores are derived by modeling evolution as a continuous change process

Outline

• Molecular Evolution

• Distance Methods UPGMA / Average Linkage Neighbor-Joining

• Sequence Methods Deterministic (Parsimony) Probabilistic (SEMPHY)

Molecular Evolution

Q: How can we model evolution on nucleotide level? (ignore gaps, focus on substitutions)

A: Consider what happens at a specific position for small time interval Δt

• P(t) = vector of probabilities of {A,C,G,T} at time t• μAC = rate of transition from A to C per unit time• μA = μAC + μAG + μAT rate of transition out of A• pA(t+Δt) = pA(t) – pA(t) μA Δt + pC(t) μCA Δt + …

Molecular Evolution

In matrix/vector notation, we get

P(t+Δt) = P(t) + Q P(t) Δt

where Q is the substitution rate matrix

Molecular Evolution

• This is a differential equation:

P’(t) = Q P(t)

• A substitution rate matrix Q implies a probability distribution over {A,C,G,T} at each position, including stationary (equilibrium) frequencies πA, πC, πG, πT

• Each Q is an evolutionary model (some work better than others)

Evolutionary Models

• Jukes-Cantor

• Kimura

• Felsenstein

• HKY

Estimating Distances

• Solve the differential equation and compute expected evolutionary time given sequences

• Jukes-Cantor

• Kimura

Outline




A simple clustering method for building tree

UPGMA (unweighted pair group method using arithmetic averages)Or the Average Linkage Method

Given two disjoint clusters Ci, Cj of sequences,

1dij = ––––––––– {p Ci, q Cj}dpq

|Ci| |Cj|

Claim that if Ck = Ci Cj, then distance to another cluster Cl is:

dil |Ci| + djl |Cj| dkl = ––––––––––––––

|Ci| + |Cj|

Algorithm: Average Linkage

Initialization:Assign each xi into its own cluster Ci

Define one leaf per sequence, height 0

Iteration:Find two clusters Ci, Cj s.t. dij is minLet Ck = Ci Cj

Define node connecting Ci, Cj, and place it at height dij/2

Delete Ci, Cj

Termination:When two clusters i, j remain, place root at

height dij/2

1 4

3 2 5

1 4 2 3 5

Average Linkage Example

v w x y z

v 0 6 8 8 8

w 0 8 8 8

x 0 4 4

y 0 2

z 0

y zxwv

12

3

4v w x yzv 0 6 8 8

w 0 8 8

x 0 4

yz 0

v w xyz

v 0 6 8

w 0 8

xyz 0

vw xyz

vw 0 8

xyz 0

Ultrametric Distances and Molecular Clock

Definition:A distance function d(.,.) is ultrametric if for any three distances dij dik

dij, it is true that dij dik = dij

The Molecular Clock:The evolutionary distance between species x and y is 2 the Earth time

to reach the nearest common ancestorThat is, the molecular clock has constant rate in all species

1 4 2 3 5years

The molecular clock results in ultrametric

distances

Ultrametric Distances & Average Linkage

Average Linkage is guaranteed to reconstruct correctly a binary tree with ultrametric distances

Proof: Exercise

1 4 2 3 5

Weakness of Average Linkage

Molecular clock: all species evolve at the same rate (Earth time)

However, certain species (e.g., mouse, rat) evolve much faster

Example where UPGMA messes up:

23

41

1 4 32

Correct tree AL tree

Additive Distances

Given a tree, a distance measure is additive if the distance between any pair of leaves is the sum of lengths of edges connecting them

Given a tree T & additive distances dij, can uniquely reconstruct edge lengths:

• Find two neighboring leaves i, j, with common parent k• Place parent node k at distance dkm = ½ (dim + djm – dij) from any node m i, j

1

2

3

4

5

6

7

8

9

10

12

11

13d1,4

Reconstructing Additive Distances Given T

x

y

zw

v

54

7

3

3 4

6

v w x y z

v 0 10 17 16 16

w 0 15 14 14

x 0 9 15

y 0 14

z 0

T

If we know T and D, but do not know the length of each leaf, we can reconstruct those lengths

D


x

y

zw

v

v w x y z

v 0 10 17 16 16

w 0 15 14 14

x 0 9 15

y 0 14

z 0

TD


x

y

zw

v

v w x y zv 0 10 17 16 16

w 0 15 14 14

x 0 9 15

y 0 14

z 0

T

D

a x y za 0 11 10 10

x 0 9 15

y 0 14

z 0

a

D1dax = ½ (dvx + dwx – dvw)

day = ½ (dvy + dwy – dvw)

daz = ½ (dvz + dwz – dvw)


x

y

zw

v

Ta x y z

a 0 11 10 10

x 0 9 15

y 0 14

z 0 a

D1

a b za 0 6 10

b 0 10

z 0

D2

b

c

a ca 0 3

c 0

D3

d(a, c) = 3d(b, c) = d(a, b) – d(a, c) = 3d(c, z) = d(a, z) – d(a, c) = 7d(b, x) = d(a, x) – d(a, b) = 5d(b, y) = d(a, y) – d(a, b) = 4d(a, w) = d(z, w) – d(a, z) = 4d(a, v) = d(z, v) – d(a, z) = 6Correct!!!

54

7

3

3 4

6

Neighbor-Joining

• Guaranteed to produce the correct tree if distance is additive• May produce a good tree even when distance is not additive

Step 1: Finding neighboring leaves

Define

Dij = (N – 2) dij – ki dik – kj djk

Claim: The above “magic trick” ensures that Dij is minimal iff i, j are neighbors

1

2 4

3

0.1

0.1 0.1

0.4 0.4

Algorithm: Neighbor-Joining

Initialization:Define T to be the set of leaf nodes, one per sequenceLet L = T

Iteration:Pick i, j s.t. Dij is minimalDefine a new node k, and set dkm = ½ (dim + djm – dij) for all m L

Add k to T, with edges of lengths dik = ½ (dij + ri – rj), djk = dij – dik

where ri = (N – 2)-1 ki dik Remove i, j from L; Add k to L

Termination:When L consists of two nodes, i, j, and the edge between them of length dij

Outline




Parsimony

• One of the most popular methods: GIVEN multiple alignment FIND tree & history of substitutions explaining alignment

Idea: Find the tree that explains the observed sequences with a minimal number of substitutions

Two computational subproblems:

1. Find the parsimony cost of a given tree (easy)

2. Search through all tree topologies (hard)

Example: Parsimony Cost of One Column

A B A A

{A, B}C++

{A}C = 1

{A}

{A} {B} {A} {A}

ABAA

Parsimony Scoring

Given a tree, and an alignment column uLabel internal nodes to minimize the number of required substitutions

Initialization:Set cost C = 0; node k = 2N – 1 (last leaf)

Iteration:If k is a leaf, set Rk = { xk[u] } // Rk is simply the character of kth species

If k is not a leaf,Let i, j be the daughter nodes;Set Rk = Ri Rj if intersection is nonemptySet Rk = Ri Rj, and increment C if intersection is empty

Termination:Minimal cost of tree for column u, = C

Example

A A A B

{A} {A} {A} {B}

B A BA

{A} {B} {A} {B}

{A}{A}

{A}

{A,B}

{A,B}

{B}

{B}

Traceback:

1. Choose an arbitrary nucleotide from R2N – 1 for the root

2. Having chosen nucleotide r for parent k, If r Ri choose r for daughter iElse, choose arbitrary nucleotide from Ri

Easy to see that this traceback produces some assignment of cost C

Parsimony Traceback

Another Parsimony Algorithm

Let C(v) be cost for subtree rooted at node vLet C(v,x) be cost for subtree rooted at v if we force v to have value x

Initialization:For each leaf v

C(v) = 0C(v,x) = 0 if x is input character that labels v; C(v,x) = ∞

otherwiseIteration:

Let u, w be children of vC(v,x) = min(C(u) + 1, C(u,x)) + min(C(v) + 1, C(v,x))C(v) = min C(v,x)

Termination:Minimal cost is C(root)

Probabilistic Methods

A more refined measure of evolution along a tree than parsimony

P(x1, x2, xroot | t1, t2) = P(xroot) P(x1 | t1, xroot) P(x2 | t2, xroot)

If we use Jukes-Cantor, for example, and x1 = xroot = A, x2 = C, t1 = t2 = 1,

= pA¼(1 + 3e-4α) ¼(1 – e-4α) = (¼)3(1 + 3e-4α)(1 – e-4α)

x1

t2

xroot

t1

x2


• If we know all internal labels xu,

P(x1, x2, …, xN, xN+1, …, x2N-1 | T, t) = P(xroot)jrootP(xj | xparent(j), tj, parent(j))

• Usually we don’t know the internal labels, therefore

P(x1, x2, …, xN | T, t) = xN+1 xN+2 … x2N-1 P(x1, x2, …, x2N-1 | T, t)

xroot

x1

x2 xN

xu

Felsenstein’s Likelihood Algorithm

Define:

and recursively compute:

Felsenstein’s Likelihood Algorithm

Now using u and U we can compute:

and


Given M (ungapped) alignment columns of N sequences, • Define likelihood of a tree:

L(T, t) = P(Data | T, t) = m=1…M P(x1m, …, xnm | T, t)

Maximum Likelihood Reconstruction:

• Given data X = (xij), find a topology T and length vector t that maximize likelihood L(T, t)

Current popular methods

HUNDREDS of programs available!http://evolution.genetics.washington.edu/phylip/software.html#methods

Some recommended programs:

• Discrete—Parsimony-based Rec-1-DCM3

http://www.cs.utexas.edu/users/tandy/mp.htmlTandy Warnow and colleagues

• Probabilistic SEMPHY

http://www.cs.huji.ac.il/labs/compbio/semphy/Nir Friedman and colleagues

http://evolution.genetics.washington.edu/phylip/software.html

http://www.cs.utexas.edu/users/tandy/mp.html

http://www.cs.huji.ac.il/labs/compbio/semphy/

Documents

Molecular Evolution and Phylogenetic Tree Reconstruction