Upload
kathlyn-lester
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Chap. 7. Building Trees
Fixation in Neutral ModelAdvantageous mutation with a fitness 1 + s
s: selection coefficient If m copies of the
mutation, the mean fitness of the population: E[W] =[m(s+1) + (N-m)]/N
Wright-Fisher model Gene copy is selected for
the next generation with a prob. proportional to its fitness
a = m(1+s)/NE[W] The number of copies in
the next generation is still by binomial
Exact analysis by diffusion model
A
B
C
D
E
F
G
HI
time
6
2
1 1
2
1
2
6
1
2
2
1
A
BC
2
1
2
D
Eone unit
Tree nomenclature
taxon
taxon
A
B
C
D
E
F
G
HI
time
6
2
1 1
2
1
2
Tree nomenclature: clades
Clade ABF (monophyletic group)
Examples of clades
Lindblad-Toh et al., Nature 438: 803 (2005), fig. 10
Phylogenetic Methods Family of related sequences evolved from a common ancestor is studied with phylogenetic trees showing the order of evolution
Want to have a tree representation showing Divergence among species Evolutionary distance
Usually unrooted
DB
A C
D
B
A C
DB
AC
Phylogenetic Trees
Rooted tree provide direction of evolution and its distance
Unrooted tree is less informative Finding a root
Use known species relationship If not known, use mid-point method: finding a point on the tree with the mean distance among the tree is identical in either side – assumes the same evolution rate
Phylogenetic Trees
Rooted tree provide direction of evolution and its distance
Unrooted tree is less informative Finding a root
Use known species relationship If not known, use mid-point method: finding a point on the tree with the mean distance among the tree is identical in either side – assumes the same evolution rate
Tree Construction Multiple sequences are aligned Use JC or other models to compute pair-wise
evolutionary distances
From distance matrix, use a clustering method Join the closest two clusters to form a larger
one Recompute distances between all clusters Repeat two steps above until all species are
connected
distance-based and character-based:
Distance-based methods involve a distance metric,such as the number of amino acid changes betweenthe sequences, or a distance score. Examples ofdistance-based algorithms are UPGMA and neighbor-joining.
Character-based methods include maximum parsimonyand maximum likelihood. Parsimony analysis involvesthe search for the tree with the fewest amino acid(or nucleotide) changes that account for the observeddifferences between taxa.
Tree-building methods
Example with Globin
Distance-based treeCalculate the pairwise alignments;if two sequences are related,put them next to each other on the tree
Character-based tree: identify positions that best describe how characters (amino acids) are derived from common ancestors
Tree from Distance Matix Given a weighted tree, with weights on edges
representing evolutionary distances Additive distances
di,c + dc,j = Di,j
Find the nearest leaves – combine to the same parent Not easy to find neighboring leaves
Reconstructing tree Shorten all hanging edges
of a tree Reduce length of every hanging edge by the same small amount δ, then distance matrix is reduced by 2δ
Find the leaf with 0 weight and remove the leaf
Additive matrix
Tree-building methods: UPGMA
UPGMA is unweighted pair group methodusing arithmetic mean
1 2
3
4
5
Tree-building methods: UPGMA
Step 1: compute the pairwise distances of allthe proteins. Get ready to put the numbers 1-5at the bottom of your new tree.
1 2
3
4
5
Tree-building methods: UPGMA
Step 2: Find the two proteins with the smallest pairwise distance. Cluster them.
1 2
3
4
5
1 2
6
Tree-building methods: UPGMA
Step 3: Do it again. Find the next two proteins with the smallest pairwise distance. Cluster them.
1 2
3
4
5
1 2
6
4 5
7
Tree-building methods: UPGMA
Step 4: Keep going. Cluster.
1 2
3
4
5 1 2
6
4 5
7
3
8
Tree-building methods: UPGMA
Step 4: Last cluster! This is your tree.
1 2
3
4
5
1 2
6
4 5
7
3
8
9
UPGMA is a simple approach for making trees.
• An UPGMA tree is always rooted.• An assumption of the algorithm is that the molecular clock is constant for sequences in the tree. If there are unequal substitution rates, the tree may be wrong.• While UPGMA is simple, it is less accurate than the neighbor-joining approach (described next).
Distance-based methods: UPGMA trees
Page 256
UMPGMA Method Distance between two
clusters is defined as the mean of the distances between species in the two clusters Human cluster vs.
chimpanzee/pygmy cluster Mean of human-chimpanzee
and human-pygmy distances Produces a rooted tree
Tree distance between chimpanzee and pygmy is 0.0149/2
All species end at the right aligned (because the same molecular evolution is assumed in every species) – not used
Neighbor-Joining (NJ) Method Additive distance in an unrooted
tree Distance between two species
is the sum of branch lengths connecting them
NJ Method Construct an unrooted tree
whose branch lengths are as close to the distance matrix among species
Algorithm Join two neighbors, and
replace them by a new internal node
Keep repeating the step until all species are covered
The neighbor-joiningmethod of Saitou and Nei(1987) Is especially usefulfor making a tree having a large number of taxa.
Begin by placing all the taxa in a star-like structure.
Making trees using neighbor-joining
Page 259
Tree-building methods: Neighbor joining
Next, identify neighbors (e.g. 1 and 2) that are most closelyrelated. Connect these neighbors to other OTUs via aninternal branch, XY. At each successive stage, minimizethe sum of the branch lengths.
Tree-building methods: Neighbor joining
Define the distance from X to Y by
dXY = 1/2(d1Y + d2Y – d12)
Tree-building methods: character based
Rather than pairwise distances between proteins,evaluate the aligned columns of amino acidresidues (characters).
As an example of tree-building using maximum parsimony, consider these four taxa:
AAGAAAGGAAGA
How might they have evolved from a common ancestor such as AAA?
Page 261
AAG AAA GGA AGA
AAAAAA
1 1AGA
AAG AGA AAA GGA
AAAAAA
1 2AAA
AAG GGA AAA AGA
AAAAAA
1 1AAA
1 2
Tree-building methods: Maximum parsimony
Cost = 3 Cost = 4 Cost = 4
1
In maximum parsimony, choose the tree(s) with the lowest cost (shortest branch lengths).
Page 261
Parsimony Use simplest possible explanation of the data, or
with fewest assumptions Binary states: 0 for ancestral, 1 derived
character 0 may be ancestral tetrpod forelimb bone
structure 1 may be the bone structure in the bird wing
C,D posses a derived character, not possessed by A, B
Tree A: the character must have evolved on the + branch
Tree b: evolved once (+), and lost (*) Tree c: evolved independently (+) on two
branches Parsimony criterion – tree a is the simplest
(single state change)
The main idea of character-based methods is to findthe tree with the shortest branch lengths possible.Thus we seek the most parsimonious (“simple”) tree.
• Identify informative sites. For example, constant characters are not parsimony-informative.
• Construct trees, counting the number of changesrequired to create each tree. For about 12 taxa orfewer, evaluate all possible trees exhaustively; for >12 taxa perform a heuristic search.
• Select the shortest tree (or trees).
Making trees using character-based methods
Page 260
Small parsimony Problem
Characters in the string are independent, and the problem can be solved independently for each character
Assume that each leaf is labeled by a single character Solve for a more general problem
Length of an edge is defined as the Hamming distance For k-letter alphabet,
dH(v,w) = 0 if v=w; =1, otherwise
Small Parsimony Problem Find the most parsimonious labeling of the internal vertices
input: Tree T with each leaf labeled by an m-character string output: Labeling of internal vertices of T minimizing the parsimony score
Weighted Small Parsimony Problem
David Sankoff Dynamic Programming, 1975 Internal vertex v with offsprings u and w
si(v) = min{si(u)+δi,t} + min{si(w)+ δj,t}
Weighted Small Parsimony Problem Find the minimal weighted parsimony score labeling of internal vertices
input: Tree T with each leaf labeled by a k-letter alphabet, and kxk scoring matrix output: Labeling of internal vertices of T minimizing the weighted parsimony score
Parsimony Example Five species, scored for 6 characters with 0 or 1
state each
Calculate how many changes of state are needed in a tree, for example
Species 1 2 3 4 5 6
Alpha 1 0 0 1 1 0
Beta 0 0 1 0 0 0
Gamma 1 1 0 0 0 0
Delta 1 1 0 1 1 1
Epsilon 0 0 1 1 1 0
alpha
delta
gamma beta
epsilon
Species 1
Alpha 1
Beta 0
Gamma 1
Delta 1
Epsilon 0
alpha delta gamma beta epsilon
alpha delta gamma beta epsilon
Red: state of 1Regular: state of 0
Reconstruction of Character 1
Reconstruction of Character 2
Species 2
Alpha 0
Beta 0
Gamma 1
Delta 1
Epsilon 0
alpha delta gamma beta epsilon
alpha delta gamma beta epsilon
alpha delta gamma beta epsilon
Reconstruction of Character 3
Species 3
Alpha 0
Beta 1
Gamma 0
Delta 0
Epsilon 1
alpha delta gamma beta epsilon
alpha delta gamma beta epsilon
Reconstruction of character 4, 5
alpha delta gamma beta epsilon
alpha delta gamma beta epsilon
Species 4 5
Alpha 1 1
Beta 0 0
Gamma 0 0
Delta 1 1
Epsilon 1 1
Species 6
Alpha 0
Beta 0
Gamma 0
Delta 1
Epsilon 0
Reconstruction of character 6
alpha delta gamma beta epsilon
Reconstruction with All Changes
alpha delta gamma beta epsilon
4
2,6
2,5
1,3
5 4
• Total # of changes, taking random choice for more than one trees• = 1 + 2 + 2 + 2 + 1 + 1 = 9
Most Parsimonious Trees
alpha delta gamma beta epsilon
2
6 4,5
1,3
4,5
alphadeltagamma beta epsilon
4,5 6
2
1,3
4,5
alpha
delta
gamma beta
epsilon
2
6
4,51,3
4,5
• Identical, if unrooted
How to determine Branch Lengths
Given an unrooted tree, use the average over all possible reconstructions of each character
alpha delta gamma beta epsilon
4
2,6
2,5
1,3
5 4alpha
delta
gamma beta
epsilon
1
1.5
0.52.5
11.5
1
Large Parsimony Problem
NP-complete Greedy heuristics
Start with an arbitrary tree Move from one tree to another if it lowers
parsimony score by nearest neighbor interchange
LargeParsimony Problem Find a tree with n leaves having the minimal parsimony score
input: An nxm matrix M describing n species, each represented by m-character string output: A tree T with n leaves labeled by n rows of matrix M, and a labeling of the internal vertices with minimal parsimony score over all possible trees and labelings
Modifying Trees
Views
Nearest Neighbor Interchange
Swap two adjacent branches Erase an
interior branch and two branches connected to it
• Likelihood ratios• Example: predict helices and loops in a
protein• Known info: helices have a high content
of hydrophobic residues
• ph and pl: frequencies of AA being in the helix or loop
• Lh and Ll : likelihoods that a sequence of N AAs are in a helix or a loop
• Lh = ∏N ph , Ll = ∏N pl
• Rather than likelihoods, their ratios have more info
• Lh/Ll : is sequence more or less likely to be a helical or loop region
• S = ln(Lh/Ll) = ∑ N ln(ph/pl): positive for
helical region• Partition a sequence into N-AA segments
(N=300)
Probabilistic Models
• Previous example has two hypotheses (Helix or Loop)• The sequence is described by models 0 and 1
• Models 0 and 1 are defined by ph and pl
• Generalize to k hypotheses: Mk models (k=0,1,2,…)
• Given a test dataset D, what is the prob. that D is described by each of the models ?
• Known info: prior probs., Pprior(Mk) for each modelfrom other info sources
• Compute likelihood of D according to each of the models: L(D|Mk)
• Of interest is not the prob of D arising from Mk but the prob of D being described by Mk
• Namely, Ppost(Mk| D) ∞ L(D|Mk) Pprior(Mk) : posterior prob.
• Ppost(Mk| D) = L(D|Mk) Pprior(Mk)/∑iL(D|ii) Pprior(Mi)
• => Bayesian prob.
Prior and Posterior Probs.
• Basic principles• We make inference using posterior probs.• If a posterior prob. of one model is higher, it can
be the best model with confidence
• Special case: two models• Two prior probs.: Pprior
0 , Pprior1
• Pposti = Li Pprior
i/(L0 Pprior0 + L1 Pprior
1)
• Log-odd score: S΄ = ln(L1Pprior
1/L0Pprior0) = ln(L1/L0) + ln(Pprior
1/Pprior0)
= S + ln(Pprior1/Pprior
0)
• Difference between S΄and S is simply the additive constant, and ranking will be identical whether we use S΄or S
• Warning: if Pprior1 is small, S has to be high to
make S΄positive
• When Pprior0 = Pprior
1, S΄= S
• Ppost1 = 1/(1 + L0 Pprior
0 /L1 Pprior1) = 1/(1 + exp(- S΄))
• S΄=0 →Ppost1 =1/2; S΄is large and negative → Ppost
1 ≈1
Bayesian Prob.
• Given a model of sequence evolution and a proposed tree structure, compute the likelihood that the known sequences would have evolved on that tree
• ML chooses the tree that maximizes this likelihood
• Three parameters• Tree toplogy• Branch lengths• Values of the
parameters in the rate matrix
Maximum Likelihood (ML) Phylogeny
• Given a model of sequence evolution at a site
• Likelihood of ancestor X: L(X) = PXA(t1) PXG(t2)
• L(Y) = PYG(t4) ∑X L(X) PYX(t3)
• L(W) = ∑y ∑Z
L(Y)PWY(t5)L(Z)PWZ(t6)
• Total likelihood for the site:• L = ∑W P W L(W)• P W: equilibrium
prob.• Is equal to posterior
prob. of different clades
What is Likelihood in ML Tree ?
X
A
Y
W
Z
G G T T
t1
t3
t5
t2
t4
t6
• ML tree maximizes the total likelihood of the data given the tree, i.e., L(data|tree)
• We want to compute posterior prob: P(tree|data)• From Bayes theorem,
Pt(tree|data)= L(data|tree)*Pr(tree)/ ∑L(data|tree)*Pr(tree) (summation is over all possible trees)
• Namely, posterior prob. ∞ L(data|tree)*Pr(tree)• Problem is the summation over all possible
trees• Moreover, what we really want is, given the
data, the posterior prob. that a particular clade of interest is present
Ppost(clade|data)= ∑P(data|tree) for trees containing the clade = ∑cladeL(data|tree)*Pprior(tree)/ ∑all
treesL(data|tree)*Pprior(tree)
• In practice, Ppsot(clade|data) = # of trees containing clade/total # of trees in the sample
Computing Posterior Prob.
Maximum likelihood is an alternative to maximumparsimony. It is computationally intensive. A likelihoodis calculated for the probability of each residue inan alignment, based upon some model of thesubstitution process.
What are the tree topology and branch lengths that have the greatest likelihood of producing the observed data set?
ML is implemented in the TREE-PUZZLE program,as well as PAUP and PHYLIP.
Making trees using maximum likelihood
Page 262
(1) Reconstruct all possible quartets A, B, C, D. For 12 myoglobins there are 495 possible quartets.
(2) Puzzling step: begin with one quartet tree. N-4 sequences remain. Add them to the branches systematically, estimating the support for each internal branch. Report a consensus tree.
Maximum likelihood: Tree-Puzzle
Maximum likelihood tree
Quartet puzzling
Calculate:
Pr [ Tree | Data] =
Bayesian inference of phylogeny with MrBayes
Pr [ Data | Tree] x Pr [ Tree ]
Pr [ Data ]
Pr [ Tree | Data ] is the posterior probability distribution of trees. Ideally this involves a summation over all possible trees. In practice, Monte Carlo Markov Chains (MCMC) are run to estimate the posterior probability distribution.
Notably, Bayesian approaches require you to specify prior assumptions about the model of evolution.
Bootstrapping is a commonly used approach tomeasuring the robustness of a tree topology.Given a branching order, how consistently doesan algorithm find that branching order in a randomly permuted version of the original data set?
Stage 5: Evaluating trees: bootstrapping
Page 266
Bootstrapping is a commonly used approach tomeasuring the robustness of a tree topology.Given a branching order, how consistently doesan algorithm find that branching order in a randomly permuted version of the original data set?
To bootstrap, make an artificial dataset obtained by randomly sampling columns from your multiple sequence alignment. Make the dataset the same size as the original. Do 100 (to 1,000) bootstrap replicates.Observe the percent of cases in which the assignmentof clades in the original tree is supported by the bootstrap replicates. >70% is considered significant.
Stage 5: Evaluating trees: bootstrapping
Page 266
In 61% of the bootstrapresamplings, ssrbp and btrbp(pig and cow RBP) formed adistinct clade. In 39% of the cases, another protein joinedthe clade (e.g. ecrbp), or oneof these two sequences joinedanother clade.
Bootstrapping Bootstrapping
A method of assessing the reliability of trees
Numbers in the rooted tree – called bootstrap percentages
Distances according to models are not realistic due to chance fluctuations
Boostrapping addresses the question on if these
fluctuations are influencing the tree configuration
Boostrapping deliberately construct sequence data sets that differe by some small random fluctuations from real sequences
And check if the same tree topology is obtained
Randomized sequences are constructed by sampling columns
Bootstrapping Generate 100 or 1,000 randomized sequences And compute what percentage of randomized trees
contain the same group 77% boostrap value is considered to be reliable
e.g. 24% -- doubtful if they form a clade 71% - human/chimpanzee/pygmy chimapnzee
Between two high figures Chimpanzee/pygmy always form a clade Gorilla/human/chimpanzee/pygmy always form a
clade (Gorilla.(human,chimpanzees)) appear more
frequently than (human,(gorilla, chimpanzees)) or (chimpanzees, (gorilla,human))
Thus, can conclude (human, chimpanzees) is more reliable
Can construct a consensus tree Frequency of each possible clade is determined Construct a consensus tree by adding clades from
more frequent clades
Evaluate trees according to the least squared error E = ∑(dij – dtree
ij)2/d2ij
Fitch and Margliash , 1967 Clustering methods such as NJ and UPGMA have a
well-defined algorithm and produces one tree, but no criterion
Optimization approach has a well-defined criterion, but no well-defined algorithm Has to construct many alternative trees and
test each one for the criterion Other optimization approaches
Maximum likelihood – choose the tree on which likelihood of observing the given sequences is highest
Parsimony – choose the tree for which the fewest number of substitutions are required in the sequences
Tree Optimization
Tree Space Number of distinct trees grows by n!! (product of
odd numbers) For N species, (2N-5)!! for unrooted trees,
(2N-3)!!, rooted N=7, 9*7*5*3*1 = 945, N=10, 2.0*106
Consider a ‘tree space’ as a set of all possible tree topologies Two trees are neighbors if they differ by a
topological change known as a nearest-neighbor interchange (NNI) With NNI, an internal branch of a tree is
selected
A subtree is swapped with any other at the other end of the internal branch
Tree 4 is not a neighbor of tree 1 (rather by pruning and refgrafting (SPR))
Optimization in Tree Space Hill climbing algorithm
Given an initial tree (from distance matrix, for example)
Find a neighboring tree that is better if found, move to this new tree, and search
neighbors Until a local optimum is reached (no neighbors
are found) Cannot guarantee global optimum
Heuristic search Start with random three species, and construct
an unrooted tree Add one species at a time, connecting it in the
optimal way Continue with different initial random three
species, each time producing a local optimum Repeat this enough, and may claim a global
optimum
ML vs. Parsimony Parsimony is fast – ML requires each tree topology
to be optimized ML is model-based
parsimony’s model is equal substitution Parsimony can incorporate models, but not clear
what the weights have to be Parsimony tries to minimize the number of
substitutions, irrespective of the branch lengths ML allows for changes more likely to happen on
longer branches On a long branch, no reason to try to minimize
the number of substitutions Parsimony is strong for evaluating trees based on
qualitative characters