Combinatorial and graph-theoretic problems in evolutionary tree reconstruction

Preview:

DESCRIPTION

Combinatorial and graph-theoretic problems in evolutionary tree reconstruction. Tandy Warnow Department of Computer Sciences University of Texas at Austin. Reconstructing the “Tree” of Life. Handling large datasets: millions of species. The “Tree of Life” is not really a tree: - PowerPoint PPT Presentation

Citation preview

Combinatorial and graph-theoretic problems in evolutionary tree

reconstruction

Tandy WarnowDepartment of Computer Sciences

University of Texas at Austin

Reconstructing the “Tree” of LifeHandling large datasets: Handling large datasets:

millions of speciesmillions of species

The “Tree of Life” is not The “Tree of Life” is not really a tree: really a tree:

reticulate evolutionreticulate evolution

Evolution informs about everything in biology

• Big genome sequencing projects just produce data -- so what?

• Evolutionary history relates all organisms and genes, and helps us understand and predict – interactions between genes (genetic networks)– drug design– predicting functions of genes– influenza vaccine development– origins and spread of disease– origins and migrations of humans

Possible Indo-European tree(Ringe, Warnow and Taylor 2000)

Challenges in estimating phylogenies

• Computational: almost all “good” approaches for estimating phylogenies involve solving NP-hard problems

• Statistical• Data

Major methods for phylogeny reconstruction

• Biology: heuristics for NP-hard optimization problems

• Linguistics: an exact algorithm for an NP-hard optimization problem

Outline for the rest of the talk

• NP-hard and polynomial time problems• Phylogeny reconstruction in biology: the

challenge is to develop better heuristics for NP-hard problems

• Phylogeny reconstruction in linguistics: the NP-hard perfect phylogeny problem, and how we solve it exactly

A polynomial-time problem• 2-colorability: Given graph G = (V,E), determine if

we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.

• Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored.

• Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

A polynomial-time problem• 2-colorability: Given graph G = (V,E), determine if

we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.

• Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored.

• Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

A polynomial-time problem• 2-colorability: Given graph G = (V,E), determine if

we can assign colors red and blue to the vertices of G so that no edge connects vertices of the same color.

• Greedy Algorithm. Start with one vertex and make it red, and then make all its neighbors blue, and keep going. If you succeed in coloring the graph without making two nodes of the same color adjacent, the graph can be 2-colored.

• Running time: O(n+m) time, where n is the number of vertices and m is the number of edges.

What about this?

• 3-colorability: Given graph G, determine if we assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

What about this?

• 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

A brute-force solution seems to require O(3n) time, where n is the number of vertices.

• Some decision problems can be solved in polynomial time:– Can graph G be 2-colored?– Does graph G have a Eulerian tour?

• Some decision problems seem to not be solvable in polynomial time:– Can graph G be 3-colored?– Does graph G have a Hamiltonian cycle?

What about this?

• 3-colorability: Given graph G, determine if we can assign red, blue, and green to the vertices in G so that no edge connects vertices of the same color.

• This problem is provably NP-hard. What does this mean?

P vs. NP, continued

• The “big” question in theoretical computer science is:– Is it possible to solve an NP-hard

problem in polynomial time?• If the answer is “yes”, then all NP-hard

problems can be solved in polynomial time, so P=NP. This is generally not believed.

Coping with NP-hard problems

Since NP-hard problems may not be solvable in polynomial time, the options are:– Solve the problem exactly (but use lots of time

on some inputs)– Use heuristics which may not solve the

problem exactly (and which might be computationally expensive, anyway)

General comments for NP-hard optimization problems

• Getting exact solutions may not be possible for some problems on some inputs, without spending a great deal of time.

• You may not know when you have an optimal solution, if you use a heuristic.

• Sometimes exact solutions may not be necessary, and approximate solutions may suffice. But, how good an approximation do you need?

DNA Sequence Evolution

AAGACTT

TGGACTTAAGGCCT

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT

AGGGCAT TAGCCCT AGCACTT

AAGACTT

TGGACTTAAGGCCT

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT

Molecular Systematics

TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT

U V W X Y

U

V W

X

Y

Maximum Parsimony

• Given set S of sequences of the same length over the nucleotide alphabet {A,C,T,G}, find tree leaf-labelled by S with other DNA sequences (of the same length) labelling internal nodes, so as to minimize the “length” of the tree (the sum of the Hamming distances on the edges).

• NP-hard!20

Solving NP-hard problems exactly is … unlikely

• Number of (unrooted) binary trees on n leaves is (2n-5)!!

• If each tree on 1000 taxa could be analyzed in 0.001 seconds, we would find the best tree in

2890 millennia

#leaves #trees4 35 156 1057 9458 103959 135135

10 202702520 2.2 x 1020

100 4.5 x 10190

1000 2.7 x 102900

Research: we try to develop better heuristics

Comparison of TNT to Rec-I-DCM3(TNT) on one large dataset

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 4 8 12 16 20 24

Hours

Average MP score above

optimal, shown as a percentage of

the optimal

Current best techniques

DCM boosted version of best techniques

Summary (so far)

• Optimization problems in biology are almost all NP-hard, and heuristics may run for months before finding local optima.

• The challenge here is to find better heuristics, since exact solutions are very unlikely to ever be achievable on large datasets.

Possible Indo-European tree(Ringe, Warnow and Taylor 2000)

Phylogenies of Languages

• Languages evolve over time, just as biological species do (geographic and other separations induce changes that over time make different dialects incomprehensible -- and new languages appear)

• The result can be modelled as a rooted tree• The interesting thing is that many characteristics

of languages evolve without back mutation or parallel evolution -- so a “perfect phylogeny” is possible!

“Homoplasy-Free” Evolution (perfect phylogenies)

YES NO

Historical Linguistic Data

• A character is a function that maps a set of languages, L, to a set of states.

• Three kinds of characters:– Phonological (sound changes)– Lexical (meanings based on a wordlist)– Morphological (grammatical features)

Cognate Classes• Two words w1 and w2 are in the same cognate class, if they

evolved from the same word through sound changes.

• French “champ” and Italian “champo” are both descendants of Latin “campus”; thus the two words belong to the same cognate class.

• Spanish “mucho” and English “much” are not in the same cognate class.

The Ringe-Warnow Model of Language Evolution

• The nodes of the tree which contain elements of the same cognate class should form a rooted connected subgraph of the true tree

• The model is known as the Character Compatibility or Perfect Phylogeny.

Perfect Phylogeny

• A phylogeny T for a set S of taxa is a perfect phylogeny if each state of each character occupies a subtree (no character has back-mutations or parallel evolution)

30

Perfect phylogenies, cont.

• A=(0,0), B=(0,1), C=(1,3), D=(1,2) has a perfect phylogeny!

• A=(0,0), B=(0,1), C=(1,0), D=(1,1) does not have a perfect phylogeny!

A perfect phylogeny

• A = 0 0• B = 0 1• C = 1 3• D = 1 2

A

B

C

D

A perfect phylogeny

• A = 0 0• B = 0 1• C = 1 3• D = 1 2• E = 0 3• F = 1 3

A

B

C

D

E F

The Perfect Phylogeny Problem

• Given a set S of taxa (species, languages, etc.) determine if a perfect phylogeny T exists for S.

• The problem of determining whether a perfect phylogeny exists is NP-hard (McMorris et al. 1994, Steel 1991).

Triangulated Graphs

• A graph is triangulated if it has no simple cycles of size four or more.

Triangulating Colored Graphs:An Example

A graph that can be c-triangulated

Triangulating Colored Graphs:An Example

A graph that can be c-triangulated

Triangulating Colored Graphs:An Example

A graph that cannot be c-triangulated

Triangulating Colored Graphs (TCG)

Triangulating Colored Graphs: given a vertex-colored graph G, determine if G can be c-triangulated.

The PP and TCG Problems

• Buneman’s Theorem: A perfect phylogeny exists for a set S if and only if the associated character state intersection graph can be c-triangulated.

• The PP and TCG problems are polynomially equivalent and NP-hard.

40

A no-instance of Perfect Phylogeny

• A = 0 0• B = 0 1• C = 1 0• D = 1 1

0 1

0

1

An input to perfect phylogeny (left) of four sequences describedby two characters, and its partition intersection graph. Note thatthe partition intersection graph is 2-colored.

Solving the PP Problem Using Buneman’s Theorem

“Yes” Instance of PP: c1 c2 c3 s1 3 2 1 s2 1 2 2 s3 1 1 3 s4 2 1 1

Solving the PP Problem Using Buneman’s Theorem

“Yes” Instance of PP: c1 c2 c3 s1 3 2 1 s2 1 2 2 s3 1 1 3 s4 2 1 1

Some special cases are easy

• Binary character perfect phylogeny solvable in linear time

• r-state characters solvable in polynomial time for each r (combinatorial algorithm)

• Two character perfect phylogeny solvable in polynomial time (produces 2-colored graph)

• k-character perfect phylogeny solvable in polynomial time for each k (produces k-colored graphs -- connections to Robertson-Seymour graph minor theory)

44

Constructing trees in historical linguistics

• Maximum Compatibility: given the input matrix for the set S of languages described by the set C of characters, find a tree T leaf-labelled by S on which a maximum number of the characters in C are compatible (i.e., evolve without homoplasy).

• NP-hard.

The Indo-European (IE) Dataset

• 24 languages• 22 phonological characters, 15 morphological characters,

and 333 lexical characters.• Total number of working characters is 390 (multiple

character coding, and parallel development)• A phylogenetic tree T on the IE dataset (Ringe, Taylor and

Warnow)• T is compatible with all but 16 characters• Resolves most of the significant controversies in Indo-European

evolution; shows however that Germanic is a problem (not treelike)

Phylogenetic Tree of the IE Dataset (Ringe, Warnow, and Taylor)

Explaining remaining incompatibilies

• We modelled the remaining incompatibilities as undetected borrowing between languages.

• This leads to the mathematical model of “perfect phylogenetic networks”

Modelling borrowing: Networks and Trees within Networks

“Perfect Phylogenetic Network” for IENakhleh et al., Language 2005

Summary

• NP-hard optimization problems abound in phylogeny reconstruction, and in computational biology in general, and need very accurate solutions

• Many real problems have beautiful and natural combinatorial and graph-theoretic formulations

Acknowledgements

• NSF and the David and Lucile Packard Foundation (funding)

• Collaborators Bernard Moret (UNM CS), Donald Ringe (Penn Linguistics)

• Students: Usman Roshan and Luay Nakhleh

Phylolab, U. TexasPlease visit us athttp://www.cs.utexas.edu/users/phylo/

Recommended