View
216
Download
0
Category
Preview:
Citation preview
9/1/2005 1
Ultrametric phylogenies
By Sivan Yogev Based on Chapter 11 from “Inferring Phylogenies” by J. Felsenstein
29/1/2005
Introduction – additive trees
In the last lecture we saw the concept of distance based phylogenetic trees
d(i,j) is the distance between the objects indexed i and j In particular, we discussed additive sets, in which:
For each i: d(i,i) = 0, and for each ji: d(i,j)0 For each i,j: d(i,j) = d(j,i) For each i,j,k: d(i,k) ≤ d(i,j) + d(j,k) [triangle inequality] Any subset of four objects can be labelled i,j,k,l such that
d(i,j) + d(k,l) ≤ d(i,l) + d(j,k) = d(i,k) + d(j,l)[four points condition]
An additive set defines a tree. Every tree defines an additive distance matrix between its leaves
39/1/2005
Molecular clocks
Let us assume that “stable” mutations in the genome occur uniformly over long time periods
This defines a “molecular clock” – each mutation stands for a constant period of time
We can therefore approximate the time since any two taxa diverged from their last common ancestor by the number of differences between the genomes in conserved regions
49/1/2005
Ultrametric trees
Given a group of taxa with distances, if we assume the “molecular clock” model and wish to find the evolutionary tree, the number of mutations from the last common ancestor to every taxon should be similar
This means that the distance from the root of the evolutionary tree to each leaf is the same
Such a tree is called an Ultrametric tree
59/1/2005
Ultrametric trees (cont.)
If we have a set of objects with a distance between them, we want to know if this set is ultrametric
For ultrametric sets, these condition hold: For each i: d(i,i) = 0, and for each ji: d(i,j)0 For each i,j: d(i,j) = d(j,i) For each i,j,k: d(i,k) ≤ max{d(i,j), d(j,k)}
[ultrametric condition] The last condition can be replaced by this one:
Any subset of three objects can be labelled i,j,k such that d(i,j) ≤ d(j,k) = d(i,k)
69/1/2005
Ultrametric trees (cont.)
An ultrametric set is also additive The opposite is not always true
Distance matrices
Additive matrices
Ultrametric matrices
79/1/2005
Ultrametric decision
Given a set of n objects with distances, we want to determine if the set is ultrametric
The naïve approach – go over all triplets, and check if the ultrametric condition holds
Complexity – O(n3) More efficient algorithms exists (Gusfield gives a
simple O(n2logn) and a more sophisticated O(n2) algorithm with partial proofs)
89/1/2005
Approximations
However, for most biological data there is no accurate “ultrametric solution”
This means that some heuristic is needed The most popular method is UPGMA, which
stands for Unweighted Pair Group Method using Arithmetic mean
Introduced by Sokal and Michener (1958)
99/1/2005
UPGMA
Input: A set of n objects, with a distance between every two objects
Output: an ultrametric tree with the given objects as leaves
The main data structures used by the algorithm are a graph G=(V,E) which contains trees with the objects as leaves, and a distance matrix between each two roots of trees in the graph
109/1/2005
UPGMA (cont.)
Initialization: Each object in a separate tree, distance by input
We will use an example of 5 mammal speciesBear Raccoon Weasel Seal Dog
Bear 0 26 34 29 32
Raccoon 26 0 42 44 48
Weasel 34 42 0 44 51
Seal 29 44 44 0 50
Dog 32 48 51 50 0
Bear Raccoon Weasel Seal Dog
119/1/2005
UPGMA (cont.) We iterate until there is only one tree At each iteration we perform:
Find the two trees x and y with minimal distance d(x,y)
Add a new node, and connect the roots of x and y to this node. The result is a new tree z. The height of the root of z is d(x,y)/2
Compute the distance between z and the other remaining trees (without x and y)
129/1/2005
UPGMA (cont.) First iteration:
Bear Raccoon Weasel Seal Dog
Bear 0 26 34 29 32
Raccoon 26 0 42 44 48
Weasel 34 42 0 44 51
Seal 29 44 44 0 50
Dog 32 48 51 50 0
Bear Raccoon Weasel Seal Sea lion
BR
13 13
139/1/2005
UPGMA (cont.) Update computation – denote the number of leaves
in the tree x by nx, then for each t x,y we set:
Bear Raccoon Weasel Seal Dog
Bear 0 26 34 29 32
Raccoon 26 0 42 44 48
Weasel 34 42 0 44 51
Seal 29 44 44 0 50
Dog 32 48 51 50 0
),(),(),( tyDn
ntxD
n
ntzD
z
y
z
x
BR Weasel Seal Dog
BR 0 38 36.5 40
Weasel 38 0 44 51
Seal 36.5 44 0 50
Dog 40 51 50 0
149/1/2005
UPGMA (cont.) Second iteration:
Bear Raccoon WeaselSeal Dog
BR
1313
BR Weasel Seal Dog
BR 0 38 36.5 40
Weasel 38 0 44 51
Seal 36.5 44 0 50
Dog 40 51 50 0
BRS
18.25
18.25-13=5.25
159/1/2005
UPGMA (cont.) Third iteration:
BRS Weasel Dog
BRS 0 40 43.3
Weasel 40 0 51
Dog 43.3 51 0
Bear Raccoon WeaselSeal Dog
BR
1313
BRS
18.25
18.25-13=5.25
BRSW
2020-18.25=1.75
169/1/2005
UPGMA (cont.) Fourth (and last) iteration:
BRSW Dog
BRSW 0 45.25
Dog 45.25 0
Bear Raccoon WeaselSeal Dog
BR
1313
BRS
18.25
18.25-13=5.25
BRSW
2020-18.25=1.75
BRSWD
22.62522.625-20=2.625
179/1/2005
UPGMA - complexity
A simple implementation takes n-1 iterations, where in each iteration we find the minimal distance at O(n2), with total complexity of O(n3)
We can keep a list of the smallest distance in each row. This way it takes O(n) to find the minimal distance, while updating the list is also O(n) at each iteration. Therefore, the total complexity is O(n2).
189/1/2005
Ultrametric evaluation
UPGMA gives us an ultrametric tree Is this tree the best possible? Depends on how we measure the quality of
an approximated tree for a given matrix Let U(i,j) be the distance in the ultrametric
tree U between the objects indexed i and j The L norm is defined by:
),(),(max,
jiUjidLji
199/1/2005
Ultrametric evaluation (cont.) There is an O(n2) algorithm for finding the
ultrametric tree U with minimal L norm (Farach, Kannan and Warnow, 1995)
Is this tree the best possible? It would be better to include all distances
The L1 norm is defined by:
ji
jiUjidL,
1 ),(),(
Finding U with minimal L1 norm is NP-hard!(Day, 1987)
Recommended