O PTIMALITY OF THE N EIGHBOR J OINING A LGORITHM AND F ACES OF THE B ALANCED M INIMUM E VOLUTION P OLYTOPE David Haws Joint work with Ruriko Yoshida and

OPTIMALITY OF THE NEIGHBOR JOINING ALGORITHM AND FACES OF THE BALANCED MINIMUM EVOLUTION POLYTOPE

David Haws

Joint work with Ruriko Yoshida and Terrell Hodge

To appear in Bulletin of Mathematical Biology

Figure 19.1 Genomes 3 (© Garland Science 2007)

ORIGINS OF SPECIES

GENE TREE IN A SPECIES TREE

Maddison WP (1997) Gene trees in species trees. Systematic Biology 46: 523-536

PHYLOGENETIC RECONSTRUCTION

Observe alignment of DNA for n species.

1 AGCCCGTCGC…2 AGCTCGTCCC…3 GGCTCGACCC… n AGCCGGATCC…

Find binary tree that best describes the evolutionary history of the n species.

PHYLOGENETIC RECONSTRUCTION

Maximum likelihood estimation methods (MLE): These methods describe evolution in terms of discrete-state continuous-time Markov process.

Bayesian inference methods: Use Bayes Theorem and MCMC to estimate the posterior distribution rather than obtaining a point estimation.

And distance based methods…

DISTANCE BASED METHODS Observe alignment of

DNA for n species.

1 AGCCCGTCGC…2 AGCTCGTCCC…3 GGCTCGACCC… …n AGCCGGATCC…

Compute an “evolutionary” distance between each pair of DNA sequences.

1 2 … n

1 0 0.3 … 0.5

2 0.3 0 … 0.4

… … … … …

n 0.5 0.4 … 0

NOTE: Still need to find binary tree that fits best with this distance matrix.

DISTANCE BASED METHOD OVERVIEW

1 AGCCCGTCGC…2 AGCTCGTCCC…3 GGCTCGACCC… …n AGCCGGATCC…

1 2 … n

1 0 0.3

… 0.5

2 0.3

0 … 0.4

… … … … …

n 0.5

0.4

… 0

Find binary tree T that “best” describes the distance matrix D.

I.e., consider D fixed and explore all binary trees to find best tree T.

Align DNA Compute distance matrix D

Find binary tree T given D

Binary tree here means bifurcating tree.

DISTANCE MATRIX (FROM A TREE)

A distance matrix for a tree T is a matrix D where Dij is the mutation distance between species i and j.

1 2 3 4 5 6

1 0 6 8 9 12 11

2 6 0 6 7 10 9

3 8 6 0 3 6 5

4 9 7 3 0 5 4

5 12 10 6 5 0 5

6 11 9 5 4 5 0

BALANCED MINIMUM EVOLUTION BME is a weighted least squares distance based

method which puts more emphasis on the shorter distances.

Given a distance matrix D, the BME method can assign edge lengths to any binary tree topology T with n leaves.

Goal of BME, given fixed D, is to find the binary tree T with the smallest sum of total branch lengths ∆D(T) (assigned by BME).

min ∆D(T) for all (2n-5)!! tree topologies.

+ =

1 2 3 4 … 6

1 0 6 8 9 … 11

2 6 0 6 7 … 9

3 8 6 0 3 … 5

4 9 7 3 0 … 4

5 … … … … … 5

6 11 9 5 4 5 0

BME

PAUPLIN’S FORMULA

If ∆D(T) is the sum of branch lengths of the tree topology T estimated by BME given D, then Pauplin’s formula is

where

Wij(T) = (2)(1−# of branches between i and j in T)

for a particular tree topology T.

EXAMPLE

For the tree topology above, we haveW (T) = (1/2, 1/4, 1/8, 1/8, 1/4, 1/8, 1/8, 1/4, 1/4, 1/2). Index is lexicographic: 01,02,03,04,12,13,…,34.

BME AS A LINEAR PROGRAM

Given Pauplin’s formula, the BME method is thus given by the following linear program:

such that

where

We call Pn the BME polytope.

The set of all objectives D such that Tt is minimal is the normal cone at the vertex W(Tt). We call this cone the BME cone of Tt.

BME POLYTOPE

W. Day (87) showed that finding the tree topology minimizing ∆D(T) is NP-hard.

Current BME software uses hill-climbing heuristics.

BME polytope lies in Rn(n-1)/2 and is dimension n(n-1)/2 – n.

Lemma[Eickmeyer,Yoshida,2008] Vertices of Pn are the BME vectors of unrooted binary trees with n leaves. The star phylogeny lies in the interior of the BME polytope, and all other BME vectors lie on the boundary of the BME polytope.

COMBINATORICS OF THE BME POLYTOPES

For up to n = 7 taxa, Eickmeyer et. al. computed BME polytopes and studied their structure.n Dimension F-vector

4 2 (3, 3)

5 5 (15, 105, 250, 210, 52)

6 9 (105, 5460, ?, ?, ?, 90262)

7 14 (945, 445410, ?, ?, ?, ?, ?)

All pairs of binary tree topologies T1, T2 on n ≤ 6 taxa can be cooptimal.For n = 7, there is one combinatorial type of non-edge.

COMBINATORIAL TYPE OF NON-EDGE

n = 7.

EDGES OF THE BME POLYTOPE

We still do not understand all pairs of trees which will form edges on the BME polytope.

If we understand the edges, we might be able to devise a competitive alternative to FastME (current software) that improves trees by walking along edges on the BME polytope, rather than performing nearest-neighbor interchange (NNI), or subtree-prune-regraft (SPR) moves.

Edge-walking (known as the simplex algorithm in linear programming) works very well in practice.

SUBTREE PRUNE REGRAFT (SPR) MOVE

1 Select a subtree.

2 Detach the selected subtree.

3 Attempt to regraft it onto another branch of the remaining tree, in such a way that a new tree is formed.

SPR MOVE ADJACENCY

This means that a pair of binary tree topologies T1, T2 on n taxa adjacent by an SPR move are adjacent by an edge on the BME polytope.

Theorem [H, Hodge, and Yoshida, 2010]If a pair of binary tree topologies T1, T2 on n taxa are adjacent by a subtree prune regraft (SPR) move then they can be cooptimal in terms of Pauplin’s formula for BME.

COMPARING BME TO NEIGHBOR JOINING

Neighbor Joining (NJ) method: A highly popular distance based method used in phylogenetics. [Saito, Nei 1987],[Studier, Keppler 1988].

Given a fixed distance matrix D, NJ computes a tree topology by recursively joining two nodes which are ‘close’.

Specifically NJ joins nodes a and b which have minimal Q-value:

NJ: FAST AND CONSISTENT

Nodes a,b are then replaced by a single new node z which is the root of the cherry (a,b), and distances Dzk are defined as Dzk = Dak + Dbk − 2Dab. Neighbor joining is then applied recursively on the remaining nodes, until a binary tree is obtained.

Neighbor joining based on elements of the matrix Q is consistent: Given a tree metric D = DT as input, NJ will correctly output tree T .

NEIGHBOR JOINING CONES

Elements of Q are linear in the distances.

So picking a cherry (a,b) means the distances satisfy linear inequalities.

After picking cherry (a,b) and replacing it with a new node z, the new distances Dzk are linear in the old distances: Dzk = Dak + Dbk − 2Dab.

NEIGHBOR JOINING CONES

Thus NJ will output a particular tree topology T, and pick cherries in a particular order,

the original distances Dij satisfy certain linear inequalities.

These inequalities define a cone (apex 0) in

Rn(n-1)/2, called a NJ cone.

NJ will output a particular tree topology T iff the pairwise distances lies in a union of NJ cones.

ISSUES WITH NEIGHBOR JOINING

Neighbor joining is fast and consistent, but it isn’t based on a model of speciation.

The NJ algorithm is a greedy algorithm optimizing the BME criteria [Gascuel, Steel 2006]

Neighbor joining outputs a tree topology T iff the data lies in a union of cones. The union of these cones need not be convex.

In fact NJ is not convex: There are distance matrices D, D’, such that NJ produces the same tree T1 when run on input D or D’, but NJ produces a different tree T2 not equal to T1 when run on the input (D + D’)/2

NJ AND BME CONES

This result is particularly important in phylogenetics since this shows that even though the NJ Algorithm is a greedy algorithm, with any order to pick leaf pairs, the NJ Algorithm will return the BME tree for some dissimilarity map.

Theorem [H., Hodge, Yoshida (2010)]

Given a tree T with any number of taxa, and any particular order σ of picking its pairs of leaves, the BME cone of T and the NJ cone of T and σ has intersection of positive measure.

FACES OF BME POLYTOPE

A clade of a binary tree T is the subtree given by an internal node and all its decendents.

Blue and red boxes are clades, while green is not a clade.

CLADE-FACES OF THE BME POLYTOPE

Theorem [H., Hodge, Yoshida (2010)]

Every disjoint collection of clades C1,C2,…,Ck gives a face of the BME polytope,

We can now describe a large class of faces of the BME polytope.

Note: Clade-face is a smaller dimensional BME polytope.

SUMMARY

BME is a consistent distance based phylogenetic reconstruction method with strong biological interpretation.

BME method is equivalent to LP over the BME polytope. Until recently, nothing was known about this polytope in general.

SPR moves are edges of the BME polytope and disjoint clades are faces!

We hope to exploit this new knowledge of the BME polytope to develop new algorithms.

We strengthened the connection between the hugely popular NJ method and the BME method.

OPTIMALITY OF THE NEIGHBOR JOINING ALGORITHM AND FACES OF THE BALANCED MINIMUM EVOLUTION POLYTOPE

Thank you!

To appear in the Bulletin of Mathematical Biology

Available: http://arxiv.org/abs/1004.2073

David Haws, University of Kentuckywww.davidhaws.net

http://arxiv.org/abs/1004.2073

Documents

O PTIMALITY OF THE N EIGHBOR J OINING A LGORITHM AND F ACES OF THE B ALANCED M INIMUM E VOLUTION P OLYTOPE David Haws Joint work with Ruriko Yoshida and