17
CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 7: Phylogenetic Trees Reconstruction Date-September-26,2004 Lecturer: Wing-Kin Sung Scribe: Cheng Weiwei, Yang Liang 7.1 Introduction Deoxyribonucleic Acid or DNA is commonly known to be responsible for en- coding the information of life. Through sexual reproduction, DNA is passed on as hereditary material to offspring. However, sometimes ’mistakes’ might occur which are passed on to the offspring. These mistakes cause the DNA, instead of being identical to the parent DNA, to change a little (one or more of the bases in the strand might have been changed). Such ’mistakes’ are known as mutations. Through many generations of reproductions, with mutations going on between every generation, bringing about little changes each time in the offspring, differ- ent species emerge(or evolve). This is where phylogenetic comes in. Phylogenetic is the study of the genetic relationship among different species. 7.1.1 Mitochondrial DNA and Inheritance The evolution of modern man is one of the interesting topics in phylogenetics. Be- fore going further into details of this topic, a particular aspect which has proven to be significant in this study will be discussed - mitochondrial DNA (mtDNA). The mtDNA is not located in the cell nucleus, it resides, instead, in an outlying compartment called a mitochondrion. The mitochondria are organelles located outside the nucleus in the cytoplasm of the cell. These organelles are responsible for energy transfer and are basically the ’powerhouses’ of the cell. There are about 1,700 mitochondria in every human cell; each includes an identical circu- lar double-stranded mtDNA consists of 16500 base pairs. The mtDNA contains mere 37 genes compared with the 50,000 to 100,000 genes in nuclear DNA. And these few mtDNA genes are devoted largely to the mitochondria’s principal job of producing energy for the chemical reactions in a cell. The pointwise mutation substitution rates of mtDNA are roughly 10 times faster than nuclear DNA. Since human is one of the newest species, fast mutation rates 7-1

CS5238 Combinatorial methods in bioinformatics …ksung/cs5238/2004Sem1/note/...CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 7: Phylogenetic Trees Reconstruction

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS5238 Combinatorial methods in bioinformatics …ksung/cs5238/2004Sem1/note/...CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 7: Phylogenetic Trees Reconstruction

CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1

Lecture 7: Phylogenetic Trees ReconstructionDate-September-26,2004

Lecturer: Wing-Kin Sung Scribe: Cheng Weiwei, Yang Liang

7.1 Introduction

Deoxyribonucleic Acid or DNA is commonly known to be responsible for en-coding the information of life. Through sexual reproduction, DNA is passed onas hereditary material to offspring. However, sometimes ’mistakes’ might occurwhich are passed on to the offspring. These mistakes cause the DNA, instead ofbeing identical to the parent DNA, to change a little (one or more of the bases inthe strand might have been changed). Such ’mistakes’ are known as mutations.Through many generations of reproductions, with mutations going on betweenevery generation, bringing about little changes each time in the offspring, differ-ent species emerge(or evolve). This is where phylogenetic comes in. Phylogeneticis the study of the genetic relationship among different species.

7.1.1 Mitochondrial DNA and Inheritance

The evolution of modern man is one of the interesting topics in phylogenetics. Be-fore going further into details of this topic, a particular aspect which has provento be significant in this study will be discussed - mitochondrial DNA (mtDNA).

The mtDNA is not located in the cell nucleus, it resides, instead, in an outlyingcompartment called a mitochondrion. The mitochondria are organelles locatedoutside the nucleus in the cytoplasm of the cell. These organelles are responsiblefor energy transfer and are basically the ’powerhouses’ of the cell. There areabout 1,700 mitochondria in every human cell; each includes an identical circu-lar double-stranded mtDNA consists of 16500 base pairs. The mtDNA containsmere 37 genes compared with the 50,000 to 100,000 genes in nuclear DNA. Andthese few mtDNA genes are devoted largely to the mitochondria’s principal jobof producing energy for the chemical reactions in a cell.

The pointwise mutation substitution rates of mtDNA are roughly 10 times fasterthan nuclear DNA. Since human is one of the newest species, fast mutation rates

7-1

Page 2: CS5238 Combinatorial methods in bioinformatics …ksung/cs5238/2004Sem1/note/...CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 7: Phylogenetic Trees Reconstruction

Lecture 7: Phylogenetic Trees Reconstruction 7-2

should have a better reflection of its evolution.

Everyone inherits the mtDNA from his/her mother, with the reason as follows:Whenever an egg cell is fertilized, nuclear chromosomes from a sperm cell enterthe egg and combine with the egg’s nuclear DNA, producing a mixture of bothparents’ genetic code. The mtDNA from the sperm cell, however, is left behind,outside of the egg cell. So the fertilized egg contains a mixture of the fatherand mother’s nuclear DNA and an exact copy of the mother’s mtDNA but noneof the father’s mtDNA. This result shows that mtDNA is maternally inherited;every individual in the same maternal lineage should have the identical mtDNAsequence. This means that the mtDNA in the cells of a person’s body are copiesof his or her mother’s mtDNA, and all the mother’s mtDNA is a copy of hermother’s, and so on. Therefore, all human beings inherit the mtDNA from thecommon mother of human (Eve).

7.1.2 Finding the Origin of human

Even though everyone in the Earth today has inherited his or her mtDNA fromone person who lived long ago, our mtDNA is not exactly alike. Random muta-tions have altered the genetic code over the millennia. But these mutations areorganized, in a way. For example, let’s say that 10,000 years after the most recentcommon ancestor, one of the mtDNA branches experienced a mutation. Anotherbranch might experience a mutation in a different location. This alteration wouldalso be passed on.

By looking at the similarities and differences of the mtDNAs of the nowadays in-dividuals, researchers try to reconstruct where the branching took place. From anoriginal 1987 Nature article, the three authors (Rebecca Cann, Mark Stoneking,and Allan Wilson) looked at the mtDNA of 147 people from continents aroundthe world (though for Africans, they relied on African Americans). Later, withthe help of a computer program, they put together a sort of family tree, groupingthose with the most similar DNA together, then grouping the groups, and thengrouping the groups of groups. The final tree showed that one of the two primarybranches consisted only of African mtDNA and that the other branch consistedof mtDNA from all over the world, including Africa. From this, they inferredthat the most recent common mtDNA ancestor might be an African woman.This study was done under the assumption of a constant molecular clock andsuch phylogenetic tree implies that the common ancestor of modern man appearsroughly 200,000 years ago.

The above theory is based on the idea that rates of change in genes and proteinsare constant over time. It claims that in any given DNA sequence, mutations ac-cumulate at an approximately constant rate as long as the DNA sequence retains

Page 3: CS5238 Combinatorial methods in bioinformatics …ksung/cs5238/2004Sem1/note/...CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 7: Phylogenetic Trees Reconstruction

Lecture 7: Phylogenetic Trees Reconstruction 7-3

its original functions. The difference between the sequences of a DNA segment(or protein) in two species would then be proportional to the time since thespecies diverged from a common ancestor (coalescence time). This time may bemeasured in arbitrary units and then it can be calibrated in millions of years forany given gene if the fossil record of that species happens to be rich.

Moreover, such a theory does not hold in general. Molecular clocks do not behavemetronomically, i.e., neutral mutations do not yield literally constant rates ofmolecular change but are expected to yield constant average rates of change overa long period of time (known as ’stochastically constant’ rates). In general, rRNAevolves slowly whereas mtDNA rapidly. Fast rate of mutation has the drawback ofrandomizing the sequences. It is also possible that reverse mutations will occurand the comparisons will be very difficult. These facts together with furtherresearch show that Wilson’s clock is not accurate and the analysis only supportsthe common ancestor of modern man appears from 100,000 to 1 million yearsago.

7.1.3 Phylogeny

The tree built in Wilson’s project is known as phylogeny. Phylogeny is definedas the reconstruction of the evolutionary history of a set of species. Usually, it isa leaf-labeled tree where the internal nodes refer the hypothetical ancestors andthe leaves are labeled by the species. Two species that look alike will be repre-sented as neighboring external branches and will be joined to a common parentbranch. The objective of phylogenetic analysis is to analyze all the branchingrelationships in a tree and their respective branch lengths. As an example, wecan look at the phylogeny of lizards as illustrated in Figure 7.1.

C. tigris D. dorsalis C.draconoides U. scoparia P. platyrhinos

Figure 7.1: Phylogeny of lizards

Page 4: CS5238 Combinatorial methods in bioinformatics …ksung/cs5238/2004Sem1/note/...CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 7: Phylogenetic Trees Reconstruction

Lecture 7: Phylogenetic Trees Reconstruction 7-4

A phylogenetic tree, also known as a cladogram or a dendrogram, is a tree de-scribing several life forms and their relations. As shown in Figure 7.1, the edgesof the tree represent the evolutionary relationships. Time is the vertical dimen-sion with the current time at the bottom and earlier times above it. There arefive extant species (species currently living) with their respective names stated.The lines above the extant species represent the same species in the past. Thepoint that two lines converge to should be interpreted as the point when the twospecies diverges from a common ancestral species, and this point is the commonancestral species. And so it goes until eventually, some time in the past, all thespecies derived from just one species, the one displayed as the top point.

7.1.4 Rooted and Unrooted Trees

Most methods for the inference of phylogeny yield trees that are unrooted, i.e.the common ancestor is unknown. Thus from a tree itself, it is impossible totell which species is branched off before the others. Estimating the root duringthe tree reconstruction is scientifically difficult with pure computing technology,as one can only compare similarity of the sequences but cannot identify whichsequence occurs first.

Rooted trees are reconstructed by systematic biologists, and the technique gener-ally employed is to use an outgroup. An outgroup is a species which is clearly lessrelated to all other species in the phylogeny. For example, if we are interested inthe phylogenetic relationship of primates, we might use some other mammal thatis clearly not a primate, say a mouse, as the outgroup. Once the best unrootedtree containing the outgroup is constructed, the unrooted tree can be ”rooted”on the edge separating the outgroup from the rest of the species.

The problem with using outgroups is that what appears to be an outgroup maynot be an outgroup (it is difficult to decide which species is an outgroup). Evenwhen the species is definitely an outgroup, it may be difficult to locate the edge towhich it should attach because it might be very different from the ”ingroup”. Forexample, a bacterium is clearly an outgroup member with respect to primates,but they are too different to even easily determine homologous characters (theshared attributes due to common ancestry).

There is also a not-so-commonly used rooting method called mid-point rooting. Inmid-point rooting the longest edge-weighted path between two terminal speciesis found and the tree is rooted at the mid-point of this path. The rationale of thismethod comes from making a molecular-clock assumption; however, it is oftenfound to be violated.

Furthermore, seeking rooted versions of trees just increases the probability of

Page 5: CS5238 Combinatorial methods in bioinformatics …ksung/cs5238/2004Sem1/note/...CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 7: Phylogenetic Trees Reconstruction

Lecture 7: Phylogenetic Trees Reconstruction 7-5

error, since it adds another aspect of the tree which can be incorrectly analyzed.Examples of a rooted and an unrooted tree are illustrated in Figure 7.2.

Sequence A

Sequence B

Sequence C

Sequence DRooted tree

Sequence A

Sequence B

Sequence C

Sequence D

Unrooted tree

Figure 7.2: Structure of evolutionary

7.1.5 Phylogenetic Tree Reconstruction

One of the main problems of Phylogenetic analysis is phylogenetic tree recon-struction. For any collection of data, there will be some ancestral relationshipamong the sampled sequences. Phylogenetic tree reconstruction is the process ofinferring these ancestral relationships and presenting them using a tree structure.

Two types of input are used to reconstruct the phylogenetic tree, one is characterbased(Maximum Parsimony and maximum likelihood) and the other is distancebased (Unweighted Pair Group with Arithmetic Mean- UPGMA, TransformedDistance Methods and Neighbor Relation Methods).

In this lecture, only the character based method will be explained. For thismethod, each species is described by a set of characters (i.e. traits). A charactercan be a base in a specific position in its DNA sequence such as the number of

Page 6: CS5238 Combinatorial methods in bioinformatics …ksung/cs5238/2004Sem1/note/...CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 7: Phylogenetic Trees Reconstruction

Lecture 7: Phylogenetic Trees Reconstruction 7-6

eyes of the species. Those characters are taken as input, the output will be a treewhich best explains the input.

Figure 7.3 demonstrates the character based method. The left figure is a S*Cmatrix, of which each the ijth entry is the state of the ith species for the jth

character, while the right figure shows the probable best topology to explain thematrix of the left figure.

Figure 7.3: character based method

Parsimony and maximum likelihood are two of the most popular character basedmethods used to evaluate trees. Parsimony is a NP-hard problem, its solutionsare generally obtained using heuristics (mostly hill-climbing), although exact al-gorithms based upon branch-and-bound approaches are used for small enoughdata sets (generally speaking of up to about 17 species). Solutions for the ”max-imum likelihood” tree are obtained through similar hill-climbing searches. Butsince the ”maximum likelihood” method evaluates each fixed leaf-labeled topol-ogy, it is more computationally expensive. Consequently, maximum likelihoodhas not been used as frequently as parsimony for tree reconstruction. Details ofthose two methods will be given in section 7.2 and 7.3 respectively.

Statisticians have addressed the performance of phylogenetic reconstruction (orestimation) methods with respect to the accuracy of the unrooted leaf-labeledtree obtained by the method. Methods which will recover the true tree (i.e. leaf-labeled tree) with arbitrarily high probability, given long enough sequences, aresaid to be statisticallyconsistent for that tree.

7.1.6 Applications of Phylogeny

Aside from the more obvious applications of understanding the history of lifeand analyzing rapidly mutating viruses such as HIV, phylogeny has several other

Page 7: CS5238 Combinatorial methods in bioinformatics …ksung/cs5238/2004Sem1/note/...CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 7: Phylogenetic Trees Reconstruction

Lecture 7: Phylogenetic Trees Reconstruction 7-7

uses: phylogeny helps in the prediction of the structure of proteins and RNAs,helps to explain and predict gene expression, helps to predict ligand structure fordrug design and helps to design enhanced organisms such as rice and wheat.

Another example is multiple sequence alignment. Most multiple sequence align-ment programs used in practice rely on a phylogenetic tree in order to speedup the computation, and the use of phylogenetic analysis in sequence alignmentitself is one significant application.

When the sequences of two nucleic acid or protein molecules found in two differentorganisms are similar, they are likely to have been derived from a common ances-tor sequence. Sequence alignments reveal conserved and non-conserved regionsbetween evolutionarily related sequences. When two sequences have a closed evo-lutionary relationship, they are known as homologous. Therefore, one can saythat sequence homology is a function of similarity. However, computing similar-ity requires the alignment of the sequences.

The fact that a gap of any length can occur within a sequence introduces theproblem of judging how many individual changes have occurred within the or-ganism and in what order. These gaps can thus be used as phylogenetic markers.Knowledge of changes within an organism through mutation over time gives usknowledge of regions within a sequence that are relatively conserved. Theseconserved regions, which may be present in a particular protein, might be thefunctional region of this protein. Focusing on the conserved regions of sequencesobtained from different levels in the phylogenetic tree, one might thus be ableto predict the structure through homology modeling with another known pro-tein with a similar function. Structural similarity will then aid in the rationaldesign of ligands as drug compounds which might prove to have a high bindingaffinity with these structurally conserved regions, thus helping to produce moreefficacious and specific drug compounds.

7.2 Parsimony

7.2.1 What is Parsimony

The principle of parsimony is defined as ”a scientific rule which states that ifthere exists two answers to a problem or a question, and if, for one answer tobe true, well-established laws of logic and science must be re-written, ignored, orsuspended in order to allow it to be true, and for the other answer to be trueno such accommodation need to be made, then the simpler of the two answersis much more likely to be correct.” Putting it in a simpler way, parsimony is aprinciple that states that the simplest explanation that explains the greatest num-

Page 8: CS5238 Combinatorial methods in bioinformatics …ksung/cs5238/2004Sem1/note/...CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 7: Phylogenetic Trees Reconstruction

Lecture 7: Phylogenetic Trees Reconstruction 7-8

ber of observations is preferred to more complex explanations.

Parsimony is one of the most popular character based method for the phylogenyreconstruction problem in the systematic biology literature. Generally the muta-tion is rare, and should be minimized during reconstruction.Thus the idea behindparsimony is to build the phylogeny with the fewest point mutations.

Formal Definition:

• Let S be a set of (DNA or Protein) sequences.

• Denote H(x,y) be the hamming distance between two sequences x and y.(Hamming distance is the number of characters that are different betweensequences x and y.)

• The most parsimonious tree is a tree T leaf-labeled by S and each inter-nal node is assigned a sequence such that H(T ) = Σ(x,y)∈E(T )H(x, y) isminimized. Note that H(T) is called the parsimony length of T.

Figure 7.4: Examples of parsimonious tree

The input matrix in Figure 7.3 has 4 species, each of which is represented bya sequence of 4 characters. Figure 7.4 shows two most parsimonious trees of thisinput. Both trees in Figure 7.4 have the parsimony length 3 which is the min-imum number of mutations of the input matrix. In this case, the parsimonioustree is not unique.

Note: Although there is no rule that requires nature to follow the simplestpath, and results can vary based on which pieces of evidence are used, Parsimonyis nonetheless a strong basis for the scientific work of paleontologists.

Page 9: CS5238 Combinatorial methods in bioinformatics …ksung/cs5238/2004Sem1/note/...CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 7: Phylogenetic Trees Reconstruction

Lecture 7: Phylogenetic Trees Reconstruction 7-9

7.2.2 Computational Problems

The computational problems related to the parsimony approach is NP-hard.Here, we categorized them in two groups:

• Small Parsimony problem: To find the parsimony length of a given treetopology.

• Large Parsimony problem: To find the most parsimonious tree.

7.2.2.1 Small Parsimony Problem

Input: Given a set S of sequences and the topology of a rooted phylogenetic treeT with leave labeled by SGoal: Find the parsimony length of T.Problem statement:

1. What is the minimum number of changes for this topology?

2. What is the optimal labeling of the internal nodes?

This problem can be solved using Fitch’s algorithm [F71] in polynomial time.As characters are mutually independent, we can handle an input with sequencesof mcharacters by considering each character separately. The simple case is justconsider each sequence with only one character, to obtain the optimal score andthe optimal labelling for the entire data, simply run the algorithm once for eachcharacter.

Each sequence has one character(Simple case)Input: a leaf-labelled tree T where each leaf v is labelled by a single charactervc.Output: a fully-labelled tree which is also the most parsimonious tree of T .

Algorithm:The algorithm first puts all characters at the leaves of the tree and then it triesto label every internal node of tree, such that every node has label equal to oneof its children.Formally the algorithm is as follows.

1. For every leaf v, let Sv = {vc}.2. For every internal node v with children u, w,

let Sv = Su ∩ Sw if Su ∩ Sw 6= Φ ; and Su ∪ Sw otherwise

3. For every node v in preorder,

Page 10: CS5238 Combinatorial methods in bioinformatics …ksung/cs5238/2004Sem1/note/...CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 7: Phylogenetic Trees Reconstruction

Lecture 7: Phylogenetic Trees Reconstruction 7-10

• Let u be its parent. If uc ∈ Sv set vc ← uc; otherwise, assign anycharacter in Sv to vc.

In Steps 1&2, the algorithm tries to compute Sv for each node, it has to traversethe tree T in postorder - starting with the leaves and walking its way down tothe root.

For the last step, given the set Sv, to determine the value vc for each internalnodes v, this time, the tree was traversed in pre-order, i.e. start from the root up.

The result of this algorithm is a fully-labelled tree. The number of mutations inthis tree is equal to the number of times Su ∩ Sw was empty, in step 2.

A C G C G A C G C G

{AC}*

{ACG}* {CG}*

C*

G* G*

{CG} G

- Each asterisk(*) requires a change in one of the edges to its children

Figure 7.5: Small parsimony problem(simple case)

Figure 7.5 shows an example for the implementation of the above algorithm. Letk be the size of alphabet which is 4 for DNA and 20 for protein. For each internalnode of the tree, one have to consider k cases, thus the time complexity of theabove algorithm should be O(nk).

Each sequence has m characters:To solve the problem with each sequence having m characters, first note that theith character and the jth character are independent for any i and j. The problemof having m characters in a sequence can, thus, be solved using m instance of thesimple case problem. Therefore, the time complexity would be O(mnk).

Page 11: CS5238 Combinatorial methods in bioinformatics …ksung/cs5238/2004Sem1/note/...CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 7: Phylogenetic Trees Reconstruction

Lecture 7: Phylogenetic Trees Reconstruction 7-11

7.2.2.2 Large Parsimony Problem

Input: A set S of sequences.Output: The most parsimonious tree.Problem statement:

1. What is the optimal phylogeny for these species, i.e., the one minimizingthe parsimony length?

The large parsimony problem is a NP-hard problem, i.e. the problem most proba-bly cannot be solved in a polynomial time. However, it can be 2-approximated inpolynomial time(which means that the constructed tree has parsimony length atmost 2 time worse than that of the most parsimonious tree), using the followingalgorithm.

Approximation algorithm:

• Given a set S of sequences, G(S) is defined to be a weighted complete graphwhose note set is bijectively labeled by the species in S and each edge (i,j) has weight H(i, j), where H(i, j) is the hamming distance between theith and jth sequences.

• return the minimum spanning tree, T , of G(S).

Theorem 7.1 Let T be a minimum spanning tree of G(S). Then, the parsimonylength of T is at most twice that of the most parsimonious tree.

Proof:

• Let T ∗ be the most parsimonious tree.

• Let C be an Euler cycle of T ∗. (Note that in the Euler cycle C of T ∗, everyedge in T ∗ appears twice.)

• Create from C a smaller tour, C∗, which contains only the nodes of G(S)ordered in the way in which they appear in C.

• Let the weight of the tour C be denoted by w(C), and define it to be thesum of the weights of the edges in C. Similarly define w(C∗). it is easy tosee the following:

w(C) = 2w(T ∗)

since C is Euler cycle, and thatw(C∗) ≤ w(C)

since Hamming distance satisfy the triangle inequality.

• If we delete any single edge from C∗ we create a path P such thatw(P ) ≤ w(C∗)

Page 12: CS5238 Combinatorial methods in bioinformatics …ksung/cs5238/2004Sem1/note/...CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 7: Phylogenetic Trees Reconstruction

Lecture 7: Phylogenetic Trees Reconstruction 7-12

• Consider T, a minimum spanning tree for the graph G(S). Since P is alsoa spanning tree, it follows that

w(T ) ≤ w(P ) ≤ w(C∗) ≤ w(C)= 2 w(T ∗)

The main contribution of this result is an upper bound on the parsimonylength of the most parsimonious tree for a given input.

An example demonstrating this algorithm is shown in Figure 7.6. S is the inputmatrix as shown in Figure 7.3. G(S) is a 4-complete weighted graph constructedfrom S. The minimum spanning tree of G(S) is denoted in red edges. T is thecorresponding parsimonious tree constructed from that minimum spanning tree.

Figure 7.6: large Parsimony

Maximum parsimony is not guaranteed to be statistically consistent, i.e. givenmore characters, maximum parsimony may not be able to recover the true treewith arbitrarily high probability. In other words, it may not be able to convergeto the same tree when more characters are provided.

7.3 Maximum Likelihood Based Model

This section studies a maximum likelihood based model of phylogenetic trees re-constructions.

7.3.1 What is a Maximum Likelihood Model

The use of maximum likelihood estimators in evolutionary trees was pioneered byJ.Felsenstein, A.Edwards, and E.Thompson. The general idea behind maximum

Page 13: CS5238 Combinatorial methods in bioinformatics …ksung/cs5238/2004Sem1/note/...CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 7: Phylogenetic Trees Reconstruction

Lecture 7: Phylogenetic Trees Reconstruction 7-13

likelihood estimators is the observation that

Pr(Model|Data) = Pr(ModelandData)/Pr(Data)= Pr(Data|Model)Pr(Model)/Pr(Data)

In this formulation P (Model|Data) is proportional to P (Data|Model); therefore,given a set of data D, maximum likelihood tries to find a model M such thatP (D|M) is maximized. A maximum likelihood model consists of two components.They are

• A rooted tree which models the evolution relationship;

• Every edge is associated with a stochastic model of evolution.

The stochastic model of evolution assumes that the sequence characters are inde-pendently and identically distributed (iid.). It also assumes that the trees havethe Markov property, i.e. the evolution occurs at one subtree is independent tothe other parts of the trees.

Examples of models include Cavender-Felsenstein model (also called Cavender-Farris model)[CF87], and Jukes-Cantor model [JC69]. In this lecture, we onlystudy the former one in detail.

7.3.2 Cavender-Felsenstein Model

We now introduce the Canveder-Felsenstein Model [CF87] in details. This isthe simplest possible Markov model of evolution, and under this model (as wellas under more general iid. models), we can establish bounds on the convergencerates of different methods.

7.3.2.1 Definition of the Model

The model assumes that each character has two states, i.e {0, 1} Cavender-Felsenstein Model consists of the following.

• The topology of a rooted evolutionary tree T , the root is a fixed leaf andthe distribution at the root is uniform;

• A stochastic model of evolution for every edge e, that is, for every characteri, a mutation probability pi(e) for each edge e in T .

For this CF model, we have the following assumptions:

• ∀i, ∀e = (u, v) ∈ T , we have 0 < pi(e) < 0.5, andu = 0 u = 1

v = 0 Pr(u = 0|v = 0) = 1− pi(e) Pr(u = 1|v = 0) = pi(e)v = 1 Pr(u = 0|v = 1) = pi(e) Pr(u = 1|v = 1) = 1− pi(e)

Page 14: CS5238 Combinatorial methods in bioinformatics …ksung/cs5238/2004Sem1/note/...CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 7: Phylogenetic Trees Reconstruction

Lecture 7: Phylogenetic Trees Reconstruction 7-14

• Pr(u|v) = Pr(v|u);

• For the root r, Pr(r = 0) = Pr(r = 1) = 0.5.

Thus given a data set Di for character i of all species, we may use (T, pi) to rep-resent a model, where T is the topology of the evolution tree, and pi : E(T ) →(0, 0.5) is the mutation probability for each edge in T , where E(T ) is the edgeset of T .

We denote Pr(Di|T, pi) as the probability of the data is Di given that the modelis (T, pi). Now we present an example for calculating Pr(Di|T, pi).

Tu

b

r

a

c

Figure 7.7: Cavender-Felsentein Model

Consider three species a, b, and c. Assume the tree topology T of the model isas shown in Figure 7.7. Suppose the data Di says: ai = 1, bi = 1, ci = 0; Weobtain the probability Pr(Di|T, pi) of the given data Di and model (T, pi),

k=0,1;j=0,1

Pr(ri = k)Pr(ai = 1|ri = k)Pr(ui = j|ri = k)Pr(bi = 1|ui = j)Pr(ci = 0|ui = j).

In the example, each edge will maintain a probability table. If all the tree nodesin T are in fixed states, the mutation probability can be obtained easily by theproperties of Markov Model. However, there are two internal nodes r and u, weare not sure about their states, so the mutation probability calculated by summa-rizing the probability calculated for all the possible states of those internal nodes.

For the case that the data is D = D1 ∪ ... ∪ Dn, given the model (T, p1, ..., pn).The probability is obtained as follows.

Page 15: CS5238 Combinatorial methods in bioinformatics …ksung/cs5238/2004Sem1/note/...CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 7: Phylogenetic Trees Reconstruction

Lecture 7: Phylogenetic Trees Reconstruction 7-15

Theorem 7.2 Given the data D = ∪1≤i≤nDi and the model (T, {pi|1 ≤ i ≤ n}),where T is the topology of the evolution and pi is the mutation probability forcharacter i, we obtain that

Pr(D|T, p1, .., pn) =n∏

i=1

Pr(Di|T, pi). (7.1)

7.3.2.2 Computational Problems

There are two computational problems for Cavender-Felsenstein model.

• Likelihood of a modelGive the model M , for any data D, try to compute Pr(D|M);

• Find model with maximum likelihoodGive data D, try to find a model M which maximizes Pr(D|M).

We now discuss the algorithms of computing these two problems as following.

7.3.2.3 Likelihood of a model

We may describe the computation problem of the likelihood of a model as

• Input

Data D: m species where each species is characterized by n character;

Model M : M = (T, p1, ..., pn);

• Aim

Compute Pr(D|M).

It is obvious that we can compute Pr(D|M) using (7.1), but it will take expo-nential time. However, we can define the likelihood recursively and compute thevalue using dynamic programming as following.

• Definition: For a particular character i, let Li(v, s) be the likelihood ofthe subtree rooted at v, given that character i has state s.

• Basis: for all leaf v, and state s,

Li(v, s) =

{1 if vi = s0 otherwise.

(7.2)

• Recurrence: Traverse the tree in postorder, for every internal node v withchildren, says, u and w,

Li(v, s) =

y=0,1

Li(u, y)Pr(ui = y|vi = s)

x=0,1

Li(w, x)Pr(wi = x|vi = s)

(7.3)

Page 16: CS5238 Combinatorial methods in bioinformatics …ksung/cs5238/2004Sem1/note/...CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 7: Phylogenetic Trees Reconstruction

Lecture 7: Phylogenetic Trees Reconstruction 7-16

• For the Root: Finally, for the root, we have

L =m∏

i=1

s=0,1

(1

2Li(root, s)

) (7.4)

Now we analyze the time complexity of the dynamic programming algorithm.

• For every node v and every state s, Li(v, s) can be computed in O(1) timeaccording to (7.3);

• Since there are n nodes and m characters, all Li(v, s) can be computed inO(mn) time;

• For L, it can be computed in O(m) time;

• In total, Likelihood of a tree can be computed in O(mn) time.

7.3.2.4 Find model with maximum likelihood

We describe the computation problem of the likelihood of a model as follows

• Input

Data D: m species where each species is characterized by n character;

• Aim

Find a model M = (T, p1, ..., pn) which maximizes Pr(D|M)

This is a difficult problem, although we still don’t know whether it is NP-hardor not. A practical solution is to use heuristic method to get close to optima (like[PAUP]).

7.3.3 Final remark for Maximum Likelihood

For the Cavender-Felsenstein model, maximum likelihood is statistically consis-tent,that is, the maximum likelihood method is able to recover the true tree witharbitrarily high probability when more characters are available.

Page 17: CS5238 Combinatorial methods in bioinformatics …ksung/cs5238/2004Sem1/note/...CS5238 Combinatorial methods in bioinformatics 2004/2005 Semester 1 Lecture 7: Phylogenetic Trees Reconstruction

Lecture 7: Phylogenetic Trees Reconstruction 7-17

References

[CF87] J.A. Cavender, and J. Felsenstein, “Invariants of phylogenies:Simple case with discrete states”, Journal of Classification, 4, 57-71.1987.

[F71] W. M. Fitch, “Toward defining the course of evolution: minimumchange for a specified tree topology”,Systematic Zoology, 20:406-416,1971.

[JC69] T. H. Jukes, and C. Cantor, “Mammalian Protein Metabolism”,chapter Evolution of protein molecules, pages 21-132, Academic Press,New York, 1969.

[JW99] K. Junhyong Kim, and T. Warnow, “Tutorial on PhylogeneticTree Estimation Tutorial for ISMB 99”, 1999.

[PAUP] http://newfish.mbl.edu/Course/Software/PAUP/, Marine BiologicalLaboratory