1
Algorithms and Complexity in Phylogenetics Darwin 1837 Notebook B: Transmutation of species www.darwin-online.org.uk Understanding and determining the evolutionary relationships between species has been a central goal of biology since Darwin first developed the Theory of Evolution. This page from his 1837 notebook shows one of the first pictures of a `phylogenetic tree’. The text reads: “I think <<diagram>> Thus between A & B immense gap of relation. C & B the finest gradation, B & D rather greater distinction. Thus genera would be formed. -” <sideways>“Case must be that one generation then should be as many living as now. To do this and to have many species in same genus (as is) requires extinction.” Origins -------GGC-CGTGTTCTTGCGGTGGCTG--AAGATGGTCC || ||| ||||| | |||| | ||||||||| CGCATGCGGAACGT--TCTTTCCGTGG--GAACAGATGGTCC | |||||| ||| |||| | |||| ||| ||||||||| C--ATGCGGCACG---TCTTGCGGTGG--GAAAAGATGGTCC Finding good trees Comparing trees Using the information DNA and protein sequencing has provided an immense amount of data to work with. The algorithmic challenge is to efficiently turn this data into useful information. In Phylogenetics we are trying to find an evolutionary tree that best fits the data. One technique is to compute a pairwise distance matrix for the set of species being examined, from the number of discrepancies in the sequences. This matrix can be used to measure the fit of a proposed tree and the data. A `hill climbing’ algorithm then adjusts the tree to continually improve the fit. But when does this process lead to the best possible tree? By analyzing the mathematics of the fitness measure and the combinatorics of the trees we can prove bounds on the data accuracy required to lead to the correct tree [1]. Analysis using different methods, or different data sets, can lead to contradicting trees on one set of species. We need appropriate methods of comparing trees and measuring the discrepancies. Subtree Prune and Regraft (SPR) distance is a biologically motivated measure of tree discrepancy. Complexity Unfortunately, it transpires that determining the SPR distance is NP- complete [2]. This suggests that it is computationally intractable, i.e. no algorithm can efficiently determine the distance between an arbitrary pair of trees, regardless of the techniques applied. Fixed Parameter Tractability Despite this negative result, further analysis of the combinatorial structures underlying SPR distance has allowed the development of fixed parameter algorithms [2, 3]. Such an algorithm runs efficiently when a given parameter is known to be relatively small, even when the overall input size may be large. In this case we can efficiently determine the SPR distance between trees which are known to differ in a limited number of places. Approximation Algorithm An alternative approach to dealing with NP-completeness is to try to find approximate solutions. Our understanding of the common structures between trees has enabled us to develop a 3- approximation algorithm [3]. This approach runs quickly on an arbitrary pair of input trees, and guarantees that the result is close to the correct answer. Our 3-approximation gives the best guarantee currently known. Beyond establishing the ancestral relationships between species, phylogenetic tress also enable us to learn more about the nature of evolution and make quantitative evaluations of biodiversity. Hybridization A reticulation event, e.g. a hybridization or a horizontal gene transfer (HGT), causes the DNA sequences from transfered genes to have a different evolutionary history (i.e. phylogenetic tree) to the rest of the genome (see right). Conversely, comparison of the trees associated with different genes can reveal how much hybridization has occurred in the past, and between which ancestral species. As with SPR distance, it is NP- complete to determine the minimum amount of hybridization required to explain the differences between two trees [4]. Again, it is possible to develop fixed parameter algorithms [5], and these have been implemented and tested on a database of grasses [6]. Nature Reserve Selection There is great public and political interest in conserving the world’s `biodiversity’. Although the intuitive idea of biodiversity is easy to grasp, it can be difficult to formalize and quantify the biodiversity of a region, reserve or given set of species. Furthermore, it is not clear how to best select a set of species or regions to be the focus of resources and efforts. Phylogenetic trees can provide a concrete and logical way to quantify biodiversity [7]. This allows us to examine the complexity of the selection problems, and develop algorithms to solve or approximate them in various circumstances [8,9]. [1] M. Bordewich, O. Gascuel, K. T. Huber, and V. Moulton, Consistency of topological moves based on the balanced minimum evolution principle of phylogenetic inference, Transactions on Computational Biology and Bioinformatics, in press. [2] M. Bordewich and C. Semple, On the computational complexity of the rooted subtree prune and regraft distance, Annals of Combinatorics 8 (2004), 409–423. [3] M. Bordewich, C. McCartin, and C. Semple, A 3-approximation algorithm for the subtree distance between phylogenies, Journal of Discrete Algorithms 6 (2008), no. 3, 458–471. [4] M. Bordewich and C. Semple, Computing the minimum number of hybridisation events for a consistent evolutionary history, Discrete Applied Mathematics 155 (2007), no. 8, 914–928. [5] M. Bordewich and C. Semple, Computing the hybridisation number of two phylogenetic trees is fixed parameter tractable, Transactions on Computational Biology and Bioinformatics 4 (2007), no. 3, 458–466. [6] M. Bordewich, S. Linz, K. St. John, and C. Semple, A reduction algorithm for computing the hybridization number of two trees, Evolutionary Bioinformatics 3 (2007), 86–98. [7] M. Bordewich, A Rodrigo and C. Semple, Selecting Taxa to Save or Sequence: Desirable Criteria and a Greedy Solution, preprint (2008). [8] M. Bordewich, C. Semple, and A. Spillner, Optimizing phylogenetic diversity across two trees, Preprint NI07068-PLG, Isaac Newton Institute, Cambridge, U.K., 2007. [9] M. Bordewich and C. Semple, Nature reserve selection problem: A tight approximation algorithm, Transactions on Computational Biology and Bioinformatics 5 (2008), no. 2, 275-280. Subtree Prune Regraft The SPR operation Given two phylogenetic trees, one can be transformed into the other using SPR operations. The minimum number of SPR operations required is known as the SPR distance between the trees. Diversity measures Under various models of evolution we can quantify the biodiversity of a set of species using phylogenetic trees. Intuitively, conserving a panda, lizard penguin and kiwi (above) will capture more of the diversity represented in the trees, than selecting a snake, lizard penguin and humming bird (below). Gene trees vs species trees A HGT between the frog ancestor and lizard ancestor (above) means that some genes suggest frogs and lizards are more closely related than lizards and snakes (below). Magnus Bordewich HGT

Algorithms and Complexity in Phylogenetics · 2008-08-21 · Algorithms and Complexity in Phylogenetics Darwin 1837 Notebook B: Transmutation of species Understanding and determining

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Algorithms and Complexity in Phylogenetics · 2008-08-21 · Algorithms and Complexity in Phylogenetics Darwin 1837 Notebook B: Transmutation of species Understanding and determining

Algorithms and Complexity in Phylogenetics

Darwin 1837 Notebook B: Transmutation of species www.darwin-online.org.uk

Understanding and determining the evolutionary relationships between species has been a central goal of biology since Darwin first developed the Theory of Evolution.

This page from his 1837 notebook shows one of the first pictures of a `phylogenetic tree’.

The text reads:

“I think

<<diagram>>

Thus between A & B immense gap of relation. C & B the finest gradation, B & D rather greater distinction. Thus genera would be formed. -”

<sideways>“Case must be that one generation then should be as many living as now. To do this and to have many species in same genus (as is) requires extinction.”

Origins

-------GGC-CGTGTTCTTGCGGTGGCTG--AAGATGGTCC || ||| ||||| | |||| | |||||||||CGCATGCGGAACGT--TCTTTCCGTGG--GAACAGATGGTCC| |||||| ||| |||| | |||| ||| |||||||||C--ATGCGGCACG---TCTTGCGGTGG--GAAAAGATGGTCC

Finding good trees

Comparing treesUsing the information

DNA and protein sequencing has provided an immense amount of data to work with. The algorithmic challenge is to efficiently turn this data into useful information. In Phylogenetics we are trying to find an evolutionary tree that best fits the data.

One technique is to compute a pairwise distance matrix for the set of species being examined, from the number of discrepancies in the sequences. This matrix can be used to measure the fit of a proposed tree and the data.

A `hill climbing’ algorithm then adjusts the tree to continually improve the fit. But when does this process lead to the best possible tree?

By analyzing the mathematics of the f i t n e s s m e a s u r e a n d t h e combinatorics of the trees we can prove bounds on the data accuracy required to lead to the correct tree [1].

Analysis using different methods, or different data sets, can lead to contradicting trees on one set of species. We need appropriate methods of comparing trees and measur ing the d iscrepanc ies . Subtree Prune and Regraft (SPR) distance is a biologically motivated measure of tree discrepancy.

Complexity

Unfortunately, it transpires that determining the SPR distance is NP-complete [2]. This suggests that it is computationally intractable, i.e. no algorithm can efficiently determine the distance between an arbitrary pair of trees, regardless of the techniques applied.

Fixed Parameter Tractability

Despite this negative result, further analys is of the combinator ia l structures underlying SPR distance has allowed the development of fixed parameter algorithms [2, 3]. Such an algorithm runs efficiently when a given parameter is known to be relatively small, even when the overall input size may be large. In t h i s c a se we can e f f i c i en t l y determine the SPR distance between trees which are known to differ in a limited number of places.

Approximation Algorithm

An alternative approach to dealing with NP-completeness is to try to find approximate solutions. Our understanding of the common structures between trees has enab l ed u s t o d eve l op a 3 -approximation algorithm [3]. This approach runs quickly on an arbitrary pair of input trees, and guarantees that the result is close to the correct answer. Our 3-approximation gives the best guarantee currently known.

Beyond establishing the ancestral relationships between species, phylogenetic tress also enable us to learn more about the nature of evolution and make quantitative evaluations of biodiversity.

Hybridization

A r e t i c u l a t i on even t , e . g . a hybridization or a horizontal gene transfer (HGT), causes the DNA sequences from transfered genes to have a different evolutionary history (i.e. phylogenetic tree) to the rest of the genome (see right). Conversely, comparison of the trees associated with different genes can reveal how much hybridization has occurred in the past, and between which ancestral species.

As with SPR distance, it is NP-complete to determine the minimum amount of hybridization required to explain the differences between two trees [4]. Again, it is possible to develop fixed parameter algorithms [ 5 ] , a n d t h e s e h a v e b e e n implemented and tested on a database of grasses [6].

Nature Reserve Selection

There is great public and political interest in conserving the world’s `biodiversity’. Although the intuitive idea of biodiversity is easy to grasp, it can be difficult to formalize and quantify the biodiversity of a region, reserve or given set of species. Furthermore, it is not clear how to best select a set of species or regions to be the focus of resources and efforts.

Phylogenetic trees can provide a concrete and logical way to quantify biodiversity [7]. This allows us to examine the complexity of the selection problems, and develop algorithms to solve or approximate them in various circumstances [8,9].

[1] M. Bordewich, O. Gascuel, K. T. Huber, and V. Moulton, Consistency of topological moves based on the balanced minimum evolution principle of phylogenetic inference, Transactions on Computational Biology and Bioinformatics, in press.

[2] M. Bordewich and C. Semple, On the computational complexity of the rooted subtree prune and regraft distance, Annals of Combinatorics 8 (2004), 409–423.

[3] M. Bordewich, C. McCartin, and C. Semple, A 3-approximation algorithm for the subtree distance between phylogenies, Journal of Discrete Algorithms 6 (2008), no. 3, 458–471.

[4] M. Bordewich and C. Semple, Computing the minimum number of hybridisation events for a consistent evolutionary history, Discrete Applied Mathematics 155 (2007), no. 8, 914–928.

[5] M. Bordewich and C. Semple, Computing the hybridisation number of two phylogenetic trees is fixed parameter tractable, Transactions on Computational Biology and Bioinformatics 4 (2007), no. 3, 458–466.

[6] M. Bordewich, S. Linz, K. St. John, and C. Semple, A reduction algorithm for computing the hybridization number of two trees, Evolutionary Bioinformatics 3 (2007), 86–98.

[7] M. Bordewich, A Rodrigo and C. Semple, Selecting Taxa to Save or Sequence: Desirable Criteria and a Greedy Solution, preprint (2008).

[8] M. Bordewich, C. Semple, and A. Spillner, Optimizing phylogenetic diversity across two trees, Preprint NI07068-PLG, Isaac Newton Institute, Cambridge, U.K., 2007.

[9] M. Bordewich and C. Semple, Nature reserve selection problem: A tight approximation algorithm, Transactions on Computational Biology and Bioinformatics 5 (2008), no. 2, 275-280.

Subtree

Prune

Regraft

The SPR operation

Given two phylogenetic trees, one can be transformed into the other using SPR operations. The minimum number of SPR operations required is known as the SPR distance between the trees.

Diversity measuresUnder various models of evolution we can quantify the biodiversity of a set of species using phylogenetic trees.

Intuitively, conserving a panda, lizard penguin and kiwi (above) will capture more of the diversity represented in the trees, than selecting a snake, lizard penguin and humming bird (below).

Gene trees vs species trees

A HGT between the frog ancestor and lizard ancestor (above) means that some genes suggest frogs and lizards are more closely related than lizards and snakes (below).

Magnus Bordewich

HGT