Upload
tovi
View
60
Download
0
Embed Size (px)
DESCRIPTION
1. 4. 3. 5. 2. 5. 2. 3. 1. 4. Molecular Evolution and Phylogenetic Tree Reconstruction. Phylogenetic Trees. Nodes: species Edges: time of independent evolution Edge length represents evolution time AKA genetic distance Not necessarily chronological time. - PowerPoint PPT Presentation
Citation preview
Molecular Evolution and Phylogenetic Tree
Reconstruction
1 4
3 2 5
1 4 2 3 5
Phylogenetic Trees
• Nodes: species
• Edges: time of independent evolution
• Edge length represents evolution time
AKA genetic distance
Not necessarily chronological time
Inferring Phylogenetic Trees
Trees can be inferred by several criteria:
Morphology of the organisms• Can lead to mistakes!
Sequence comparison
Example:
Mouse: ACAGTGACGCCCCAAACGTRat:ACAGTGACGCTACAAACGTBaboon:CCTGTGACGTAACAAACGAChimp: CCTGTGACGTAGCAAACGAHuman: CCTGTGACGTAGCAAACGA
Inferring Phylogenetic Trees
• Sequence-based methods Deterministic (Parsimony) Probabilistic (SEMPHY)
• Distance-based methods UPGMA Neighbor-Joining
• Can compute distances from sequences
Distance Between Two Sequences
Basic principles:
• Degree of sequence difference is proportional to length of independent sequence evolution
• Only use positions where alignment is certain – avoid areas with (too many) gaps
Distance Between Two Sequences
Given sequences xi, xj,
Define
dij = distance between the two sequences
One possible definition:
dij = fraction f of sites u where xi[u] xj[u]
Better scores are derived by modeling evolution as a continuous change process
Outline
• Molecular Evolution
• Distance Methods UPGMA / Average Linkage Neighbor-Joining
• Sequence Methods Deterministic (Parsimony) Probabilistic (SEMPHY)
Molecular Evolution
Q: How can we model evolution on nucleotide level? (ignore gaps, focus on substitutions)
A: Consider what happens at a specific position for small time interval Δt
• P(t) = vector of probabilities of {A,C,G,T} at time t• μAC = rate of transition from A to C per unit time• μA = μAC + μAG + μAT rate of transition out of A• pA(t+Δt) = pA(t) – pA(t) μA Δt + pC(t) μCA Δt + …
Molecular Evolution
In matrix/vector notation, we get
P(t+Δt) = P(t) + Q P(t) Δt
where Q is the substitution rate matrix
Molecular Evolution
• This is a differential equation:
P’(t) = Q P(t)
• A substitution rate matrix Q implies a probability distribution over {A,C,G,T} at each position, including stationary (equilibrium) frequencies πA, πC, πG, πT
• Each Q is an evolutionary model (some work better than others)
Evolutionary Models
• Jukes-Cantor
• Kimura
• Felsenstein
• HKY
Estimating Distances
• Solve the differential equation and compute expected evolutionary time given sequences
• Jukes-Cantor
• Kimura
Outline
• Molecular Evolution
• Distance Methods UPGMA / Average Linkage Neighbor-Joining
• Sequence Methods Deterministic (Parsimony) Probabilistic (SEMPHY)
A simple clustering method for building tree
UPGMA (unweighted pair group method using arithmetic averages)Or the Average Linkage Method
Given two disjoint clusters Ci, Cj of sequences,
1dij = ––––––––– {p Ci, q Cj}dpq
|Ci| |Cj|
Claim that if Ck = Ci Cj, then distance to another cluster Cl is:
dil |Ci| + djl |Cj| dkl = ––––––––––––––
|Ci| + |Cj|
Algorithm: Average Linkage
Initialization:Assign each xi into its own cluster Ci
Define one leaf per sequence, height 0
Iteration:Find two clusters Ci, Cj s.t. dij is minLet Ck = Ci Cj
Define node connecting Ci, Cj, and place it at height dij/2
Delete Ci, Cj
Termination:When two clusters i, j remain, place root at
height dij/2
1 4
3 2 5
1 4 2 3 5
Average Linkage Example
v w x y z
v 0 6 8 8 8
w 0 8 8 8
x 0 4 4
y 0 2
z 0
y zxwv
12
3
4v w x yzv 0 6 8 8
w 0 8 8
x 0 4
yz 0
v w xyz
v 0 6 8
w 0 8
xyz 0
vw xyz
vw 0 8
xyz 0
Ultrametric Distances and Molecular Clock
Definition:A distance function d(.,.) is ultrametric if for any three distances dij dik
dij, it is true that dij dik = dij
The Molecular Clock:The evolutionary distance between species x and y is 2 the Earth time
to reach the nearest common ancestorThat is, the molecular clock has constant rate in all species
1 4 2 3 5years
The molecular clock results in ultrametric
distances
Ultrametric Distances & Average Linkage
Average Linkage is guaranteed to reconstruct correctly a binary tree with ultrametric distances
Proof: Exercise
1 4 2 3 5
Weakness of Average Linkage
Molecular clock: all species evolve at the same rate (Earth time)
However, certain species (e.g., mouse, rat) evolve much faster
Example where UPGMA messes up:
23
41
1 4 32
Correct tree AL tree
Additive Distances
Given a tree, a distance measure is additive if the distance between any pair of leaves is the sum of lengths of edges connecting them
Given a tree T & additive distances dij, can uniquely reconstruct edge lengths:
• Find two neighboring leaves i, j, with common parent k• Place parent node k at distance dkm = ½ (dim + djm – dij) from any node m i, j
1
2
3
4
5
6
7
8
9
10
12
11
13d1,4
Reconstructing Additive Distances Given T
x
y
zw
v
54
7
3
3 4
6
v w x y z
v 0 10 17 16 16
w 0 15 14 14
x 0 9 15
y 0 14
z 0
T
If we know T and D, but do not know the length of each leaf, we can reconstruct those lengths
D
Reconstructing Additive Distances Given T
x
y
zw
v
v w x y z
v 0 10 17 16 16
w 0 15 14 14
x 0 9 15
y 0 14
z 0
TD
Reconstructing Additive Distances Given T
x
y
zw
v
v w x y zv 0 10 17 16 16
w 0 15 14 14
x 0 9 15
y 0 14
z 0
T
D
a x y za 0 11 10 10
x 0 9 15
y 0 14
z 0
a
D1dax = ½ (dvx + dwx – dvw)
day = ½ (dvy + dwy – dvw)
daz = ½ (dvz + dwz – dvw)
Reconstructing Additive Distances Given T
x
y
zw
v
Ta x y z
a 0 11 10 10
x 0 9 15
y 0 14
z 0 a
D1
a b za 0 6 10
b 0 10
z 0
D2
b
c
a ca 0 3
c 0
D3
d(a, c) = 3d(b, c) = d(a, b) – d(a, c) = 3d(c, z) = d(a, z) – d(a, c) = 7d(b, x) = d(a, x) – d(a, b) = 5d(b, y) = d(a, y) – d(a, b) = 4d(a, w) = d(z, w) – d(a, z) = 4d(a, v) = d(z, v) – d(a, z) = 6Correct!!!
54
7
3
3 4
6
Neighbor-Joining
• Guaranteed to produce the correct tree if distance is additive• May produce a good tree even when distance is not additive
Step 1: Finding neighboring leaves
Define
Dij = (N – 2) dij – ki dik – kj djk
Claim: The above “magic trick” ensures that Dij is minimal iff i, j are neighbors
1
2 4
3
0.1
0.1 0.1
0.4 0.4
Algorithm: Neighbor-Joining
Initialization:Define T to be the set of leaf nodes, one per sequenceLet L = T
Iteration:Pick i, j s.t. Dij is minimalDefine a new node k, and set dkm = ½ (dim + djm – dij) for all m L
Add k to T, with edges of lengths dik = ½ (dij + ri – rj), djk = dij – dik
where ri = (N – 2)-1 ki dik Remove i, j from L; Add k to L
Termination:When L consists of two nodes, i, j, and the edge between them of length dij
Outline
• Molecular Evolution
• Distance Methods UPGMA / Average Linkage Neighbor-Joining
• Sequence Methods Deterministic (Parsimony) Probabilistic (SEMPHY)
Parsimony
• One of the most popular methods: GIVEN multiple alignment FIND tree & history of substitutions explaining alignment
Idea: Find the tree that explains the observed sequences with a minimal number of substitutions
Two computational subproblems:
1. Find the parsimony cost of a given tree (easy)
2. Search through all tree topologies (hard)
Example: Parsimony Cost of One Column
A B A A
{A, B}C++
{A}C = 1
{A}
{A} {B} {A} {A}
ABAA
Parsimony Scoring
Given a tree, and an alignment column uLabel internal nodes to minimize the number of required substitutions
Initialization:Set cost C = 0; node k = 2N – 1 (last leaf)
Iteration:If k is a leaf, set Rk = { xk[u] } // Rk is simply the character of kth species
If k is not a leaf,Let i, j be the daughter nodes;Set Rk = Ri Rj if intersection is nonemptySet Rk = Ri Rj, and increment C if intersection is empty
Termination:Minimal cost of tree for column u, = C
Example
A A A B
{A} {A} {A} {B}
B A BA
{A} {B} {A} {B}
{A}{A}
{A}
{A,B}
{A,B}
{B}
{B}
Traceback:
1. Choose an arbitrary nucleotide from R2N – 1 for the root
2. Having chosen nucleotide r for parent k, If r Ri choose r for daughter iElse, choose arbitrary nucleotide from Ri
Easy to see that this traceback produces some assignment of cost C
Parsimony Traceback
Another Parsimony Algorithm
Let C(v) be cost for subtree rooted at node vLet C(v,x) be cost for subtree rooted at v if we force v to have value x
Initialization:For each leaf v
C(v) = 0C(v,x) = 0 if x is input character that labels v; C(v,x) = ∞
otherwiseIteration:
Let u, w be children of vC(v,x) = min(C(u) + 1, C(u,x)) + min(C(v) + 1, C(v,x))C(v) = min C(v,x)
Termination:Minimal cost is C(root)
Probabilistic Methods
A more refined measure of evolution along a tree than parsimony
P(x1, x2, xroot | t1, t2) = P(xroot) P(x1 | t1, xroot) P(x2 | t2, xroot)
If we use Jukes-Cantor, for example, and x1 = xroot = A, x2 = C, t1 = t2 = 1,
= pA¼(1 + 3e-4α) ¼(1 – e-4α) = (¼)3(1 + 3e-4α)(1 – e-4α)
x1
t2
xroot
t1
x2
Probabilistic Methods
• If we know all internal labels xu,
P(x1, x2, …, xN, xN+1, …, x2N-1 | T, t) = P(xroot)jrootP(xj | xparent(j), tj, parent(j))
• Usually we don’t know the internal labels, therefore
P(x1, x2, …, xN | T, t) = xN+1 xN+2 … x2N-1 P(x1, x2, …, x2N-1 | T, t)
xroot
x1
x2 xN
xu
Felsenstein’s Likelihood Algorithm
Define:
and recursively compute:
Felsenstein’s Likelihood Algorithm
Now using u and U we can compute:
and
Probabilistic Methods
Given M (ungapped) alignment columns of N sequences, • Define likelihood of a tree:
L(T, t) = P(Data | T, t) = m=1…M P(x1m, …, xnm | T, t)
Maximum Likelihood Reconstruction:
• Given data X = (xij), find a topology T and length vector t that maximize likelihood L(T, t)
Current popular methods
HUNDREDS of programs available!http://evolution.genetics.washington.edu/phylip/software.html#methods
Some recommended programs:
• Discrete—Parsimony-based Rec-1-DCM3
http://www.cs.utexas.edu/users/tandy/mp.htmlTandy Warnow and colleagues
• Probabilistic SEMPHY
http://www.cs.huji.ac.il/labs/compbio/semphy/Nir Friedman and colleagues