Chap. 7. Building Trees. Fixation in Neutral Model Advantageous mutation with a fitness 1 + s s: selection coefficient If m copies of the mutation, the

Chap. 7. Building Trees

Fixation in Neutral ModelAdvantageous mutation with a fitness 1 + s

s: selection coefficient If m copies of the

mutation, the mean fitness of the population: E[W] =[m(s+1) + (N-m)]/N

Wright-Fisher model Gene copy is selected for

the next generation with a prob. proportional to its fitness

a = m(1+s)/NE[W] The number of copies in

the next generation is still by binomial

Exact analysis by diffusion model

A

B

C

D

E

F

G

HI

time

6

2

1 1

2

1

2

6

1

2

2

1

A

BC

2

1

2

D

Eone unit

Tree nomenclature

taxon

taxon

A

B

C

D

E

F

G

HI

time

6

2

1 1

2

1

2

Tree nomenclature: clades

Clade ABF (monophyletic group)

Examples of clades

Lindblad-Toh et al., Nature 438: 803 (2005), fig. 10

Phylogenetic Methods Family of related sequences evolved from a common ancestor is studied with phylogenetic trees showing the order of evolution

Want to have a tree representation showing Divergence among species Evolutionary distance

Usually unrooted

DB

A C

D

B

A C

DB

AC

Phylogenetic Trees

Rooted tree provide direction of evolution and its distance

Unrooted tree is less informative Finding a root

Use known species relationship If not known, use mid-point method: finding a point on the tree with the mean distance among the tree is identical in either side – assumes the same evolution rate

Phylogenetic Trees

Rooted tree provide direction of evolution and its distance

Unrooted tree is less informative Finding a root

Use known species relationship If not known, use mid-point method: finding a point on the tree with the mean distance among the tree is identical in either side – assumes the same evolution rate

Tree Construction Multiple sequences are aligned Use JC or other models to compute pair-wise

evolutionary distances

From distance matrix, use a clustering method Join the closest two clusters to form a larger

one Recompute distances between all clusters Repeat two steps above until all species are

connected

distance-based and character-based:

Distance-based methods involve a distance metric,such as the number of amino acid changes betweenthe sequences, or a distance score. Examples ofdistance-based algorithms are UPGMA and neighbor-joining.

Character-based methods include maximum parsimonyand maximum likelihood. Parsimony analysis involvesthe search for the tree with the fewest amino acid(or nucleotide) changes that account for the observeddifferences between taxa.

Tree-building methods

Example with Globin

Distance-based treeCalculate the pairwise alignments;if two sequences are related,put them next to each other on the tree

Character-based tree: identify positions that best describe how characters (amino acids) are derived from common ancestors

Tree from Distance Matix Given a weighted tree, with weights on edges

representing evolutionary distances Additive distances

di,c + dc,j = Di,j

Find the nearest leaves – combine to the same parent Not easy to find neighboring leaves

Reconstructing tree Shorten all hanging edges

of a tree Reduce length of every hanging edge by the same small amount δ, then distance matrix is reduced by 2δ

Find the leaf with 0 weight and remove the leaf

Additive matrix

Tree-building methods: UPGMA

UPGMA is unweighted pair group methodusing arithmetic mean

1 2

3

4

5


Step 1: compute the pairwise distances of allthe proteins. Get ready to put the numbers 1-5at the bottom of your new tree.

1 2

3

4

5


Step 2: Find the two proteins with the smallest pairwise distance. Cluster them.

1 2

3

4

5

1 2

6


Step 3: Do it again. Find the next two proteins with the smallest pairwise distance. Cluster them.

1 2

3

4

5

1 2

6

4 5

7


Step 4: Keep going. Cluster.

1 2

3

4

5 1 2

6

4 5

7

3

8


Step 4: Last cluster! This is your tree.

1 2

3

4

5

1 2

6

4 5

7

3

8

9

UPGMA is a simple approach for making trees.

• An UPGMA tree is always rooted.• An assumption of the algorithm is that the molecular clock is constant for sequences in the tree. If there are unequal substitution rates, the tree may be wrong.• While UPGMA is simple, it is less accurate than the neighbor-joining approach (described next).

Distance-based methods: UPGMA trees

Page 256

UMPGMA Method Distance between two

clusters is defined as the mean of the distances between species in the two clusters Human cluster vs.

chimpanzee/pygmy cluster Mean of human-chimpanzee

and human-pygmy distances Produces a rooted tree

Tree distance between chimpanzee and pygmy is 0.0149/2

All species end at the right aligned (because the same molecular evolution is assumed in every species) – not used

Neighbor-Joining (NJ) Method Additive distance in an unrooted

tree Distance between two species

is the sum of branch lengths connecting them

NJ Method Construct an unrooted tree

whose branch lengths are as close to the distance matrix among species

Algorithm Join two neighbors, and

replace them by a new internal node

Keep repeating the step until all species are covered

The neighbor-joiningmethod of Saitou and Nei(1987) Is especially usefulfor making a tree having a large number of taxa.

Begin by placing all the taxa in a star-like structure.

Making trees using neighbor-joining

Page 259

Tree-building methods: Neighbor joining

Next, identify neighbors (e.g. 1 and 2) that are most closelyrelated. Connect these neighbors to other OTUs via aninternal branch, XY. At each successive stage, minimizethe sum of the branch lengths.

Tree-building methods: Neighbor joining

Define the distance from X to Y by

dXY = 1/2(d1Y + d2Y – d12)

Tree-building methods: character based

Rather than pairwise distances between proteins,evaluate the aligned columns of amino acidresidues (characters).

As an example of tree-building using maximum parsimony, consider these four taxa:

AAGAAAGGAAGA

How might they have evolved from a common ancestor such as AAA?

Page 261

AAG AAA GGA AGA

AAAAAA

1 1AGA

AAG AGA AAA GGA

AAAAAA

1 2AAA

AAG GGA AAA AGA

AAAAAA

1 1AAA

1 2

Tree-building methods: Maximum parsimony

Cost = 3 Cost = 4 Cost = 4

1

In maximum parsimony, choose the tree(s) with the lowest cost (shortest branch lengths).

Page 261

Parsimony Use simplest possible explanation of the data, or

with fewest assumptions Binary states: 0 for ancestral, 1 derived

character 0 may be ancestral tetrpod forelimb bone

structure 1 may be the bone structure in the bird wing

C,D posses a derived character, not possessed by A, B

Tree A: the character must have evolved on the + branch

Tree b: evolved once (+), and lost (*) Tree c: evolved independently (+) on two

branches Parsimony criterion – tree a is the simplest

(single state change)

The main idea of character-based methods is to findthe tree with the shortest branch lengths possible.Thus we seek the most parsimonious (“simple”) tree.

• Identify informative sites. For example, constant characters are not parsimony-informative.

• Construct trees, counting the number of changesrequired to create each tree. For about 12 taxa orfewer, evaluate all possible trees exhaustively; for >12 taxa perform a heuristic search.

• Select the shortest tree (or trees).

Making trees using character-based methods

Page 260

Small parsimony Problem

Characters in the string are independent, and the problem can be solved independently for each character

Assume that each leaf is labeled by a single character Solve for a more general problem

Length of an edge is defined as the Hamming distance For k-letter alphabet,

dH(v,w) = 0 if v=w; =1, otherwise

Small Parsimony Problem Find the most parsimonious labeling of the internal vertices

input: Tree T with each leaf labeled by an m-character string output: Labeling of internal vertices of T minimizing the parsimony score

Weighted Small Parsimony Problem

David Sankoff Dynamic Programming, 1975 Internal vertex v with offsprings u and w

si(v) = min{si(u)+δi,t} + min{si(w)+ δj,t}

Weighted Small Parsimony Problem Find the minimal weighted parsimony score labeling of internal vertices

input: Tree T with each leaf labeled by a k-letter alphabet, and kxk scoring matrix output: Labeling of internal vertices of T minimizing the weighted parsimony score

Parsimony Example Five species, scored for 6 characters with 0 or 1

state each

Calculate how many changes of state are needed in a tree, for example

Species 1 2 3 4 5 6

Alpha 1 0 0 1 1 0

Beta 0 0 1 0 0 0

Gamma 1 1 0 0 0 0

Delta 1 1 0 1 1 1

Epsilon 0 0 1 1 1 0

alpha

delta

gamma beta

epsilon

Species 1

Alpha 1

Beta 0

Gamma 1

Delta 1

Epsilon 0

alpha delta gamma beta epsilon


Red: state of 1Regular: state of 0

Reconstruction of Character 1


Species 2

Alpha 0

Beta 0

Gamma 1

Delta 1

Epsilon 0





Species 3

Alpha 0

Beta 1

Gamma 0

Delta 0

Epsilon 1



Reconstruction of character 4, 5



Species 4 5

Alpha 1 1

Beta 0 0

Gamma 0 0

Delta 1 1

Epsilon 1 1

Species 6

Alpha 0

Beta 0

Gamma 0

Delta 1

Epsilon 0

Reconstruction of character 6


Reconstruction with All Changes


4

2,6

2,5

1,3

5 4

• Total # of changes, taking random choice for more than one trees• = 1 + 2 + 2 + 2 + 1 + 1 = 9

Most Parsimonious Trees


2

6 4,5

1,3

4,5

alphadeltagamma beta epsilon

4,5 6

2

1,3

4,5

alpha

delta

gamma beta

epsilon

2

6

4,51,3

4,5

• Identical, if unrooted

How to determine Branch Lengths

Given an unrooted tree, use the average over all possible reconstructions of each character


4

2,6

2,5

1,3

5 4alpha

delta

gamma beta

epsilon

1

1.5

0.52.5

11.5

1

Large Parsimony Problem

NP-complete Greedy heuristics

Start with an arbitrary tree Move from one tree to another if it lowers

parsimony score by nearest neighbor interchange

LargeParsimony Problem Find a tree with n leaves having the minimal parsimony score

input: An nxm matrix M describing n species, each represented by m-character string output: A tree T with n leaves labeled by n rows of matrix M, and a labeling of the internal vertices with minimal parsimony score over all possible trees and labelings

Modifying Trees

Views

Nearest Neighbor Interchange

Swap two adjacent branches Erase an

interior branch and two branches connected to it

• Likelihood ratios• Example: predict helices and loops in a

protein• Known info: helices have a high content

of hydrophobic residues

• ph and pl: frequencies of AA being in the helix or loop

• Lh and Ll : likelihoods that a sequence of N AAs are in a helix or a loop

• Lh = ∏N ph , Ll = ∏N pl

• Rather than likelihoods, their ratios have more info

• Lh/Ll : is sequence more or less likely to be a helical or loop region

• S = ln(Lh/Ll) = ∑ N ln(ph/pl): positive for

helical region• Partition a sequence into N-AA segments

(N=300)

Probabilistic Models

• Previous example has two hypotheses (Helix or Loop)• The sequence is described by models 0 and 1

• Models 0 and 1 are defined by ph and pl

• Generalize to k hypotheses: Mk models (k=0,1,2,…)

• Given a test dataset D, what is the prob. that D is described by each of the models ?

• Known info: prior probs., Pprior(Mk) for each modelfrom other info sources

• Compute likelihood of D according to each of the models: L(D|Mk)

• Of interest is not the prob of D arising from Mk but the prob of D being described by Mk

• Namely, Ppost(Mk| D) ∞ L(D|Mk) Pprior(Mk) : posterior prob.

• Ppost(Mk| D) = L(D|Mk) Pprior(Mk)/∑iL(D|ii) Pprior(Mi)

• => Bayesian prob.

Prior and Posterior Probs.

• Basic principles• We make inference using posterior probs.• If a posterior prob. of one model is higher, it can

be the best model with confidence

• Special case: two models• Two prior probs.: Pprior

0 , Pprior1

• Pposti = Li Pprior

i/(L0 Pprior0 + L1 Pprior

1)

• Log-odd score: S΄ = ln(L1Pprior

1/L0Pprior0) = ln(L1/L0) + ln(Pprior

1/Pprior0)

= S + ln(Pprior1/Pprior

0)

• Difference between S΄and S is simply the additive constant, and ranking will be identical whether we use S΄or S

• Warning: if Pprior1 is small, S has to be high to

make S΄positive

• When Pprior0 = Pprior

1, S΄= S

• Ppost1 = 1/(1 + L0 Pprior

0 /L1 Pprior1) = 1/(1 + exp(- S΄))

• S΄=0 →Ppost1 =1/2; S΄is large and negative → Ppost

1 ≈1

Bayesian Prob.

• Given a model of sequence evolution and a proposed tree structure, compute the likelihood that the known sequences would have evolved on that tree

• ML chooses the tree that maximizes this likelihood

• Three parameters• Tree toplogy• Branch lengths• Values of the

parameters in the rate matrix

Maximum Likelihood (ML) Phylogeny

• Given a model of sequence evolution at a site

• Likelihood of ancestor X: L(X) = PXA(t1) PXG(t2)

• L(Y) = PYG(t4) ∑X L(X) PYX(t3)

• L(W) = ∑y ∑Z

L(Y)PWY(t5)L(Z)PWZ(t6)

• Total likelihood for the site:• L = ∑W P W L(W)• P W: equilibrium

prob.• Is equal to posterior

prob. of different clades

What is Likelihood in ML Tree ?

X

A

Y

W

Z

G G T T

t1

t3

t5

t2

t4

t6

• ML tree maximizes the total likelihood of the data given the tree, i.e., L(data|tree)

• We want to compute posterior prob: P(tree|data)• From Bayes theorem,

Pt(tree|data)= L(data|tree)*Pr(tree)/ ∑L(data|tree)*Pr(tree) (summation is over all possible trees)

• Namely, posterior prob. ∞ L(data|tree)*Pr(tree)• Problem is the summation over all possible

trees• Moreover, what we really want is, given the

data, the posterior prob. that a particular clade of interest is present

Ppost(clade|data)= ∑P(data|tree) for trees containing the clade = ∑cladeL(data|tree)*Pprior(tree)/ ∑all

treesL(data|tree)*Pprior(tree)

• In practice, Ppsot(clade|data) = # of trees containing clade/total # of trees in the sample

Computing Posterior Prob.

Maximum likelihood is an alternative to maximumparsimony. It is computationally intensive. A likelihoodis calculated for the probability of each residue inan alignment, based upon some model of thesubstitution process.

What are the tree topology and branch lengths that have the greatest likelihood of producing the observed data set?

ML is implemented in the TREE-PUZZLE program,as well as PAUP and PHYLIP.

Making trees using maximum likelihood

Page 262

(1) Reconstruct all possible quartets A, B, C, D. For 12 myoglobins there are 495 possible quartets.

(2) Puzzling step: begin with one quartet tree. N-4 sequences remain. Add them to the branches systematically, estimating the support for each internal branch. Report a consensus tree.

Maximum likelihood: Tree-Puzzle

Maximum likelihood tree

Quartet puzzling

Calculate:

Pr [ Tree | Data] =

Bayesian inference of phylogeny with MrBayes

Pr [ Data | Tree] x Pr [ Tree ]

Pr [ Data ]

Pr [ Tree | Data ] is the posterior probability distribution of trees. Ideally this involves a summation over all possible trees. In practice, Monte Carlo Markov Chains (MCMC) are run to estimate the posterior probability distribution.

Notably, Bayesian approaches require you to specify prior assumptions about the model of evolution.

Bootstrapping is a commonly used approach tomeasuring the robustness of a tree topology.Given a branching order, how consistently doesan algorithm find that branching order in a randomly permuted version of the original data set?

Stage 5: Evaluating trees: bootstrapping

Page 266

Bootstrapping is a commonly used approach tomeasuring the robustness of a tree topology.Given a branching order, how consistently doesan algorithm find that branching order in a randomly permuted version of the original data set?

To bootstrap, make an artificial dataset obtained by randomly sampling columns from your multiple sequence alignment. Make the dataset the same size as the original. Do 100 (to 1,000) bootstrap replicates.Observe the percent of cases in which the assignmentof clades in the original tree is supported by the bootstrap replicates. >70% is considered significant.

Stage 5: Evaluating trees: bootstrapping

Page 266

In 61% of the bootstrapresamplings, ssrbp and btrbp(pig and cow RBP) formed adistinct clade. In 39% of the cases, another protein joinedthe clade (e.g. ecrbp), or oneof these two sequences joinedanother clade.

Bootstrapping Bootstrapping

A method of assessing the reliability of trees

Numbers in the rooted tree – called bootstrap percentages

Distances according to models are not realistic due to chance fluctuations

Boostrapping addresses the question on if these

fluctuations are influencing the tree configuration

Boostrapping deliberately construct sequence data sets that differe by some small random fluctuations from real sequences

And check if the same tree topology is obtained

Randomized sequences are constructed by sampling columns

Bootstrapping Generate 100 or 1,000 randomized sequences And compute what percentage of randomized trees

contain the same group 77% boostrap value is considered to be reliable

e.g. 24% -- doubtful if they form a clade 71% - human/chimpanzee/pygmy chimapnzee

Between two high figures Chimpanzee/pygmy always form a clade Gorilla/human/chimpanzee/pygmy always form a

clade (Gorilla.(human,chimpanzees)) appear more

frequently than (human,(gorilla, chimpanzees)) or (chimpanzees, (gorilla,human))

Thus, can conclude (human, chimpanzees) is more reliable

Can construct a consensus tree Frequency of each possible clade is determined Construct a consensus tree by adding clades from

more frequent clades

Evaluate trees according to the least squared error E = ∑(dij – dtree

ij)2/d2ij

Fitch and Margliash , 1967 Clustering methods such as NJ and UPGMA have a

well-defined algorithm and produces one tree, but no criterion

Optimization approach has a well-defined criterion, but no well-defined algorithm Has to construct many alternative trees and

test each one for the criterion Other optimization approaches

Maximum likelihood – choose the tree on which likelihood of observing the given sequences is highest

Parsimony – choose the tree for which the fewest number of substitutions are required in the sequences

Tree Optimization

Tree Space Number of distinct trees grows by n!! (product of

odd numbers) For N species, (2N-5)!! for unrooted trees,

(2N-3)!!, rooted N=7, 9*7*5*3*1 = 945, N=10, 2.0*106

Consider a ‘tree space’ as a set of all possible tree topologies Two trees are neighbors if they differ by a

topological change known as a nearest-neighbor interchange (NNI) With NNI, an internal branch of a tree is

selected

A subtree is swapped with any other at the other end of the internal branch

Tree 4 is not a neighbor of tree 1 (rather by pruning and refgrafting (SPR))

Optimization in Tree Space Hill climbing algorithm

Given an initial tree (from distance matrix, for example)

Find a neighboring tree that is better if found, move to this new tree, and search

neighbors Until a local optimum is reached (no neighbors

are found) Cannot guarantee global optimum

Heuristic search Start with random three species, and construct

an unrooted tree Add one species at a time, connecting it in the

optimal way Continue with different initial random three

species, each time producing a local optimum Repeat this enough, and may claim a global

optimum

ML vs. Parsimony Parsimony is fast – ML requires each tree topology

to be optimized ML is model-based

parsimony’s model is equal substitution Parsimony can incorporate models, but not clear

what the weights have to be Parsimony tries to minimize the number of

substitutions, irrespective of the branch lengths ML allows for changes more likely to happen on

longer branches On a long branch, no reason to try to minimize

the number of substitutions Parsimony is strong for evaluating trees based on

qualitative characters

Documents

Chap. 7. Building Trees. Fixation in Neutral Model Advantageous mutation with a fitness 1 + s s: selection coefficient If m copies of the mutation, the