BINF6201/8201 Molecular phylogenetic methods 4 11-10-2011

BINF6201/8201

Molecular phylogenetic methods 4

11-10-2011

Maximum likelihood methodsSo far we have only considered a single site (configuration). The

likelihood for all sites is the product of the likelihoods for each site if all the sites evolve independently.

Suppose there are s homologous sequences each with N nucleotides. Let Dn be the n-th column of the multiple alignment.

),...,,|( 21 TDf mn For a tree T, let be the likelihood of tree T for the n-th site, where 1, 2,…, m are the unknown parameters such as the branch length. Using the previous case as an example, we have,

sn

n

n

n

d

dd

D...

2

1

,

lkji

Dn )}()()()()({

),,,,|,,,(),...,,|(

21534

432121

vPvPvPvPvPg

TvvvvlkjihTDf

yjyiy

xyxkx

xlx

mn

,ii v

Maximum likelihood methods For simplicity, let’s assume the sequences are homogenous, i.e., all

sites evolve at the same rate, then the likelihood function for the entire sequence for the tree T is,

Here, we treat L as a function of the parameters. We then search for the values of 1, 2,…, m that maximize L given the topology of the tree T, this value of L is called a ML value of the tree T.

Finding the ML value can be a slow process.We do this for all possible tree topologies, and identify the one that has

the largest ML value as the inferred phylogenetic tree of the s sequences.

Clearly, different substitution models may result in different trees.When the number of OTUs is larger, a heuristic trees search algorithm

should be used for evaluating the alternative trees.

N

nmnm TDfTDL

12121 ),...,,|(),|...,,(

Heuristic tree search using predefined clustersAlthough the tree space could be very large, majority of them have

extremely low likelihood values for a certain OTUs. So we can safely ignore these unpromising trees, and focus on the

promising ones.To reduce the searching

space, we can predefine clusters if their relationships are known as the input.

Then the problem becomes to examine the (105) possible trees generated by connecting these predefined groups, instead of an astronomically large number of unrooted trees:

!!41!)!5232(!)!52( NNU

Heuristic tree search using predefined clustersThe ML value is computed for

each tree, the one with the largest ML value is returned as the inferred tree.

As this algorithm examines all possible trees, so the global optimum is guaranteed if the predefined groups are correct.

When the simple J-C model was used, and a homogenous substitution rate is assumed, the resulting ML tree is similar to the NJ and parsimony tree with the problem of misplacing tree shrews inside the primate group.

Maximum likelihood trees for primates However, when the more

sophisticated HKY substitution model, plus six g-distribution rate categories and invariant sites were used, the tree constructed by the ML method places the tree shrews outside of the primate group.

Nevertheless, there are three trifurcations on this tree, indicating that at a trifurcation point, any of the three clusters can be an outgroup of the other two, and the three trees have the same ML value.

Comparison of parsimony and maximum likelihood methods

Parsimony methods have only one assumption that the changes on the branches are equally possible, however, this assumption may not hold.

Because of the few assumptions are used in parsimony methods, their proponents believe that these methods can be applied to any sequence data.

Parsimony method is also relatively fast, so can be applied to larger data sets.

ML methods make assumptions about the evolutionary models. ML methods need to optimize all these parameters to find the ML

value, therefore they are computationally intensive, and are very slow. When evolutionary models are properly selected, ML methods tend to

achieve better results than parsimony methods.

Heuristic tree search using quartet puzzlingThe quartet puzzling algorithm is very fast heuristic algorithm for

exploring the promising trees.

Step 1: Computer ML values of the three trees for all possible four sequences

For each possible 4 sequences

4

3n

1

2

3

4

1

3

2

4

1

4

2

3

trees

The best ML tree1

2

3

4

5

6

Heuristic tree search using quartet puzzlingStep 2: Randomly pick up four sequences, place them in the tree according to their best ML tree.

1

4

2

3

Step 3: Randomly pick up a remaining sequence, and add it to the tree, such that growing tree has a maximum number of best ML quartet trees. Repeat this process until all sequences are added to the tree.

5

1

3

2

5

then, the resulting tree will be,

4

3

2

5

1

4

2

3

For example, if sequence 5 is randomly picked, and if one or both of the following trees are the best ML quartet trees involving 1, 2, 3, 4, and 5:

Heuristic tree search using quartet puzzling

1

4

2

3

5 Add sequence 61

4

2

3

5

6

6

1

3

4Then the resulting tree will be

Then last sequence 6 is added to the tree. If the following has the best ML among all quartet trees containing sequence 6,

The whole process is repeated many times with the sequences being selected in different orders. The resulting tree will depend on the order of sequence selections.

The tree that happens most frequently will be chosen as the inferred tree.

Bayesian phylogenetic methods Bayesian theorem: if A and B are two events, then

)()()/()/(

),()/()()/()(

APBPBAPABP

BPBAPAPABPABP

T1 T2 T3 T4 T5 T6

T7 T8 T9 T10 T11 T12

D

If T1, T2, …, and Tn, are events that partitions the sample space, and D is an event from the sample space, then,

.)()/(

)()/(...)()/()()/()(

1

2211

n

iii

nn

TPTDP

TPTDPTPTDPTPTDPDP

n

iii

jj

jjj

TPTDP

TPTDPDP

TPTDPDTP

1

)()/(

)()/()(

)()/()/(

Bayesian phylogenetic methodsFor N OTUs, we can have n=(2N-5)!! possible unrooted trees, which is

a partition of the tree space. Let D be the alignment of the N OUTs, but we do not know which tree is most likely to account for D.

tree1 tree2 tree3 tree4 tree5 tree6

tree7 tree8 tree9 tree10 ……. treen

In the ML method, we compute the probability (likelihood) that D can be generated by each tree:

L(treei)=P(D/treei).

We find the maximum likelihood ML=max [P(D/treei)] by changing the parameters (branch length or substitution rates) on each tree i, and return the tree that has largest ML.

In Bayesian methods, we compute the probability that a tree can be generated by the observed alignment of the N OTUs, which is called the posterior probability, )./( DtreeP j

Bayesian phylogenetic methodsUsing Bayesian theorem, we have,

Calculation of the denominator of the posterior probability can difficulty, because we have to numerate all possible trees, and their branch length or substitution rate.

However, the value of the denominator is a constant for all possible trees, thus the posterior probability of each tree is only proportional to the likelihood of the tree multiplied by the prior probability.

If we can generate a large number of trees, such that the frequency of a tree is proportional to its likelihood of the tree multiplied by the prior probability, then the posterior probability can be easily computed by,

.sample thein treeofnumber total

as topology same the with treesofnumber

)()/()/(

j

jjj

tree

treePtreeDPDtreeP

,)()/(

)()/()/(

1

n

iii

jjj

treePtreeDP

treePtreeDPDtreeP where, P(treei) is called the prior

probability.

The Markov chain Monte Carlo method for samplingMarkov chain Monte Carlo (MCMC) is a method for generating a

sample from the entire sample space, such that the frequency of each individual in the sample is propotional to the likelihood to generate the observed data.

If we have no preference for choosing a tree before seeing the data, we can use a non-informative uniform prior probability, therefore,

)/()/(

)/(

)()/(

)()/()/(

11

jn

ii

jn

iii

jjj treeDP

treeDP

treeDP

treePtreeDP

treePtreeDPDtreeP

The MCMC method begins with a trial tree T1 and compute its likelihood, L1, a move is then made on this tree that changes it by a small amount on any of the following parameters,

1. Branch length;2. Rate of substitution;3. Topology by a nearest neighbor interchange tree move.

The Markov chain Monte Carlo method for samplingThe likelihood of the new tree T2, L2 is computed, which is usually

slightly different from L1.

If L2 > L1, then T2 is accepted, and it becomes an element in the sample If L2 < L1, then T2 is accepted with probability L2 / L1.

This rule of selection is call the Metropolis algorithm.Therefore the MCMC method favors hill-climbing moves, but also

allows downhill moves with the a certain probability.The result will be that the equilibrium probabilities of observing the

different trees in the sample are given by the likelihoods of the trees. To see this, suppose that we have only two trees, so MCMC moves

back and forward between them with transition probabilities r12 and r21.

T1 T2

r12

r21

The Markov chain Monte Carlo method for samplingLet p1 and p2 be the equilibrium probabilities of these trees in the

sample. Then at equilibrium, the probabilities of observing these trees during the sampling process should be constant,

.or ,2

1

12

21212121 p

prrrprp

This property is called detailed balance. To have trees in the sample to be proportional to their likelihoods, we need to set

.2

1

2

1

LL

pp

Therefore, we have, . 2

1

12

21

LL

rr

This means that to generate the desired sample, we should set the ratio of transitional probability to be equal to the ratio of likelihoods.

The MCMC algorithm just does this, because, if L2 > L1, we set r12=1, r21= L1 /L2; therefore, r21/r12= L1 /L2.

if L2 < L1, we set r12= L2 /L1, and r21=1; therefore, r21/r12= L1 /L2.

The top four trees for the Platyrrhini group by MCMC

The same as in the tree constructed by NJ and

parsimony methods

To compute likelihoods, HKY substitution model, plus six g-distribution rate categories and invariant sites are used.

The most parts o the tree are well defined, except the following groups. The positions of Capuchin is varying

The top seven trees for principle groups by MCMC

The same as by NJ and

parsimony

The uncertainty of these trees indicate that more sequences are needed to solve the problem.

The positions of Capuchin is varying

Popular phylogenetic tree construction programs PHYLIP

PAUP (Phylogenetic Analysis Using Parsimony)

• Developed by Joseph Felsenstein; • Implements most known distance methods such as UPGAM and

NJ, maximum parsimony and ML methods;• The most recent release is version 3.69, which contains more than

50 programs; • Command line interface;• The package can be freely downloaded at

http://evolution.genetics.washington.edu/phylip.html

• Written by David Swofford;• Includes parsimony, distance matrix, invariants, and maximum

likelihood methods and many indices and statistical tests; • Described at http://paup.csit.fsu.edu/ • Unfortunately, it is now commercialized by Sinauer Associates,

selling for $85-150/package.

Popular phylogenetic tree construction programs MEGA (Molecular Evolutionary Genetic Analysis)

TREE-PUZZLE

• Developed by Sudhir Kumar and colleagues; • Contains parsimony, distance and likelihood methods for molecular

data (nucleic acid sequences and protein sequences); • Can do bootstrapping, consensus trees, and a variety of data editing

tasks;• Has sequence alignment function using an implementation of

ClustalW;• A GUI based program;• Contain tree display functions.

• Written by Korbinian Strimmer;• A program for maximum likelihood analysis for nucleotide and

amino acid alignments; • Infers phylogenies by quartet puzzling;

Popular phylogenetic tree construction programs TREE-PUZZLE

MrBayes

Tree View

• Supports all popular models of sequence evolution of nucleotides and proteins, and can take rate heterogeneity among sites into account;

• Compatible with PHYLIP files; • The current version also has features for parallel computation

using the MPI message-passing interface if this is available;• Freely available at http://www.tree-puzzle.de/.

• A program for the Bayesian estimation of phylogenetic trees.• Ability to analyze nucleotide, amino acid, restriction site, and

morphological data• Freely available at http://mrbayes.csit.fsu.edu/

• A program for visualization and printing trees; • Free at http://taxonomy.zoology.gla.ac.uk/rod/treeview.html

Documents

BINF6201/8201 Molecular phylogenetic methods 4 11-10-2011