55
Computing the all-pairs quartet distance on a set of evolutionary trees Martin Stissing 1 Thomas Mailund 1,2 Christian Nørgaard Storm Pedersen 1 Gerth Stølting Brodal 1 Rolf Fagerberg 3 (1) University of Aarhus, (2) Oxford University, (3) University of Southern Denmark

Presentation at APBC 2007

  • Upload
    mailund

  • View
    440

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Presentation at APBC 2007

Computing the all-pairs quartet distance on a set of evolutionary trees

Martin Stissing 1

Thomas Mailund 1,2

Christian Nørgaard Storm Pedersen 1

Gerth Stølting Brodal 1

Rolf Fagerberg 3

(1) University of Aarhus, (2) Oxford University, (3) University of Southern Denmark

Page 2: Presentation at APBC 2007

dc

b

aQuartet: Four named species in an unrooted tree

dc

b

aQuartet topology: The topology of the quartet induced by the tree

Quartet distance: The number of quartets that differs in topology between the two trees

butterfly quartetbutterfly quartet butterfly quartet star quartet

Quartets and quartet distance

Page 3: Presentation at APBC 2007

Quartets and quartet distance

A

B

C

D E

A

B

C

D E

Page 4: Presentation at APBC 2007

Quartets and quartet distance

A

B

C

D E

A

C

B

D

A

B

C

DABCD

A

B

C

D E

Page 5: Presentation at APBC 2007

A

B

C

D E

A

C

B

E

A

C

B

D

A

B

C

E

A

B

C

DABCD

ABCE

A

B

C

D E

Quartets and quartet distance

Page 6: Presentation at APBC 2007

Quartets and quartet distance

A

B

C

D E

A

C

B

E

A

C

B

D

A

B

D

E

A

B

C

E

A

B

C

D

A

B

D

E

ABCD

ABCE

ABDE

A

B

C

D E

Page 7: Presentation at APBC 2007

Quartets and quartet distance

A

B

C

D E

A

C

B

E

A

C

B

D

A

B

D

E

A

C

D

E

A

B

C

E

A

B

C

D

A

B

D

E

A

C

D

E

ABCD

ABCE

ABDE

ACDE

A

B

C

D E

Page 8: Presentation at APBC 2007

Quartets and quartet distance

A

B

C

D E

A

C

B

E

A

C

B

D

B

C

D

E

A

B

D

E

A

C

D

E

A

B

C

E

A

B

C

D

B

C

D

E

A

B

D

E

A

C

D

E

ABCD

ABCE

ABDE

ACDE

BCDE

A

B

C

D E

Page 9: Presentation at APBC 2007

Quartets and quartet distance

Quartet distance = binom(5,4) - 3 = 5 - 3 = 2

A

B

C

D E

A

C

B

E

A

C

B

D

B

C

D

E

A

B

D

E

A

C

D

E

A

B

C

E

A

B

C

D

B

C

D

E

A

B

D

E

A

C

D

E

ABCD

ABCE

ABDE

ACDE

BCDE

A

B

C

D E

Page 10: Presentation at APBC 2007

Previous workG. Estabrook, F. McMorris, and C. Meacham. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology., 27:27-33, 1978.

W. Day and C. R. Doucette.Expected behaviour of quartet distances between undirected tree.Proc. 18th International Numerical Taxonomy Conference, 1984

C. R. DoucetteAn efficient algorithm to compute quartet dissimilarity measures. O(n3)Bachelor of Science (Honours) Dissertation.Memorial University of Newfoundland, 1985

M. Steel and D. Penny.Distribution of tree comparison metrics—some new results.Systematic Biology, 42(2):126–141, 1993.

D. Bryant, J. Tsang, P. E. Kearney, and M. Li. O(n2)Computing the quartet distance between evolutionary trees.Proc. 11th Symp. on Discrete Algorithms (SODA), 285–286, 2000.

G. S. Brodal, R. Fagerberg, and C. N. S. Pedersen O(n log2 n)Computing the quartet distance in time O(n log n). O(n log n)Algorithmica, 38(2): 377-395, 2003.

Page 11: Presentation at APBC 2007

Previous workG. Estabrook, F. McMorris, and C. Meacham. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology., 27:27-33, 1978.

W. Day and C. R. Doucette.Expected behaviour of quartet distances between undirected tree.Proc. 18th International Numerical Taxonomy Conference, 1984

C. R. DoucetteAn efficient algorithm to compute quartet dissimilarity measures. O(n3)Bachelor of Science (Honours) Dissertation.Memorial University of Newfoundland, 1985

M. Steel and D. Penny.Distribution of tree comparison metrics—some new results.Systematic Biology, 42(2):126–141, 1993.

D. Bryant, J. Tsang, P. E. Kearney, and M. Li. O(n2)Computing the quartet distance between evolutionary trees.Proc. 11th Symp. on Discrete Algorithms (SODA), 285–286, 2000.

G. S. Brodal, R. Fagerberg, and C. N. S. Pedersen O(n log2 n)Computing the quartet distance in time O(n log n). O(n log n)Algorithmica, 38(2): 377-395, 2003.

Algorithms are for fully resolved trees (binary trees) only.

Quartet distance between binary trees seems easier to compute ...

Page 12: Presentation at APBC 2007

Previous workG. Estabrook, F. McMorris, and C. Meacham. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology., 27:27-33, 1978.

W. Day and C. R. Doucette.Expected behaviour of quartet distances between undirected tree.Proc. 18th International Numerical Taxonomy Conference, 1984

C. R. DoucetteAn efficient algorithm to compute quartet dissimilarity measures. O(n3)Bachelor of Science (Honours) Dissertation.Memorial University of Newfoundland, 1985

M. Steel and D. Penny.Distribution of tree comparison metrics—some new results.Systematic Biology, 42(2):126–141, 1993.

D. Bryant, J. Tsang, P. E. Kearney, and M. Li. O(n2)Computing the quartet distance between evolutionary trees.Proc. 11th Symp. on Discrete Algorithms (SODA), 285–286, 2000.

G. S. Brodal, R. Fagerberg, and C. N. S. Pedersen O(n log2 n)Computing the quartet distance in time O(n log n). O(n log n)Algorithmica, 38(2): 377-395, 2003.

Algorithms are for fully resolved trees (binary trees) only.

Quartet distance between binary trees seems easier to compute ...

C. Christiansen, T. Mailund, C. N. S. Pedersen, and M. RandersAlgorithms for computing the quartet distance between trees of arbitrary degreeProc of Workshop on Algorithms in Bioinformatics (WABI), 77-88, 2005 O(n3)

O(d2n2)

C. Christiansen, T. Mailund, C. N. S. Pedersen, M. Randers, and M. StissingFast calculation of the quartet distance between trees of arbitrary degreesAlgorithms for Molecular Biology, 1(16), 2006 O(dn2)

M. Stissing, C. N. S. Pedersen, T. Mailund, G. S. Brodal, and R. FagerbergComputing the quartet distance between evolutionary trees of bounded degreeProc. of APBC, 2007 O(d9 n log n)

Page 13: Presentation at APBC 2007

QDist

http://www.birc.au.dk/~mailund/qdist.html

QDist: Implements the O(n2) and O(n log2 n) time algorithm for computing the quartet distance between binary trees

Page 14: Presentation at APBC 2007

QDist

http://www.birc.au.dk/~mailund/qdist.html

... we want to compare 10,000 trees of size 200 using Qdist, is that possible? Takes about 2·10,0002 sec ≈ 6 years ...

QDist: Implements the O(n2) and O(n log2 n) time algorithm for computing the quartet distance between binary trees

Page 15: Presentation at APBC 2007

QDist

http://www.birc.au.dk/~mailund/qdist.html

... we want to compare 10,000 trees of size 200 using Qdist, is that possible? Takes about 2·10,0002 sec ≈ 6 years ...

QDist: Implements the O(n2) and O(n log2 n) time algorithm for computing the quartet distance between binary trees

Immediate solution: Perform O(k2) comparisons between two tree using time O(n log2 n) or O(n2) each ...

Page 16: Presentation at APBC 2007

QDist

http://www.birc.au.dk/~mailund/qdist.html

... we want to compare 10,000 trees of size 200 using Qdist, is that possible? Takes about 2·10,0002 sec ≈ 6 years ...

QDist: Implements the O(n2) and O(n log2 n) time algorithm for computing the quartet distance between binary trees

Immediate solution: Perform O(k2) comparisons between two tree using time O(n log2 n) or O(n2) each ...

Faster solution: Exploit similarities between the input trees by merging them into a single DAG-structure which allows “common subtrees” to be shared.

All comparisons are performed by comparing the DAG against the input trees, or against itself.

The speed-up depends on compactness of the DAG ...

Page 17: Presentation at APBC 2007

Counting shared (directed) quartetsIdea: Iterate over all pairs of edges in the two trees, and count for each pair how many associated quartets are shared. The problem is to define “associated” such each quartet is counted as most once ...

Directed quartets

b

c

a,de

1A1

B1

C1

b

c

a,de

2A2

B2

C2

Page 18: Presentation at APBC 2007

Counting using colouring

All colourings in one tree can be produced in O(n log n) –Smaller half trick

Each colouring can be counted in amortized O(log n) – Hierarchical decomposition and polynomial manipulations

Idea: Alternatively, colour leaves according to nodes in one tree and count compatible colourings in the other

Brodal et al.: Computing the Quartet Distance between Evolutionary Treesin Time O(n log n), Algoritmica 38(2), 2003.

Page 19: Presentation at APBC 2007

Comparing sets of trees

Observation: Shared sub-trees implies shared quartets.

Page 20: Presentation at APBC 2007

Comparing sets of trees

Observation: We can share results between trees by merging them into a DAG.

Page 21: Presentation at APBC 2007

Experiment – DAG sizeThe size of the DAG depends on the similarity of the input trees

Simulation setup

Simulating one binary tree with n leaves: evolve kn sequences of length m on the k tree, construct k neighbor joining trees from the resulting kn sequences

Page 22: Presentation at APBC 2007

Experiment – Running timeRunning time of “DAG versus DAG” algorithm on a standard Linux PC

The “DAG-DAG” approach takes between 20 and 500 seconds for the same comparisons depending on the similarity of the trees

The straightforward “all-against-all” approach takes 3000 second for comparing 100 trees of size 500

Page 23: Presentation at APBC 2007

Summary

Results

Utilizing shared tree structure speeds up the computation significantly ...

Future work

Extension to non-binary trees and trees with unequal leaf sets

Applications?

Page 24: Presentation at APBC 2007

Computing the quartet distance between evolutionary trees of

bounded degree

Martin Stissing 1

Christian Nørgaard Storm Pedersen 1

Thomas Mailund 1,2

Gerth Stølting Brodal 1

Rolf Fagerberg 3

(1) University of Aarhus, (2) Oxford University, (3) University of Southern Denmark

Page 25: Presentation at APBC 2007

Comparing partially resolved trees

Page 26: Presentation at APBC 2007

Previous workG. Estabrook, F. McMorris, and C. Meacham. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology., 27:27-33, 1978.

W. Day and C. R. Doucette.Expected behaviour of quartet distances between undirected tree.Proc. 18th International Numerical Taxonomy Conference, 1984

C. R. DoucetteAn efficient algorithm to compute quartet dissimilarity measures. O(n3)Bachelor of Science (Honours) Dissertation.Memorial University of Newfoundland, 1985

M. Steel and D. Penny.Distribution of tree comparison metrics—some new results.Systematic Biology, 42(2):126–141, 1993.

D. Bryant, J. Tsang, P. E. Kearney, and M. Li. O(n2)Computing the quartet distance between evolutionary trees.Proc. 11th Symp. on Discrete Algorithms (SODA), 285–286, 2000.

G. S. Brodal, R. Fagerberg, and C. N. S. Pedersen O(n log2 n)Computing the quartet distance in time O(n log n). O(n log n)Algorithmica, 38(2): 377-395, 2003.

Algorithms are for fully resolved trees (binary trees) only.

Quartet distance between binary trees seems easier to compute ...

C. Christiansen, T. Mailund, C. N. S. Pedersen, and M. RandersAlgorithms for computing the quartet distance between trees of arbitrary degreeProc of Workshop on Algorithms in Bioinformatics (WABI), 77-88, 2005 O(n3)

O(d2n2)

C. Christiansen, T. Mailund, C. N. S. Pedersen, M. Randers, and M. StissingFast calculation of the quartet distance between trees of arbitrary degreesAlgorithms for Molecular Biology, 1(16), 2006 O(dn2)

M. Stissing, C. N. S. Pedersen, T. Mailund, G. S. Brodal, and R. FagerbergComputing the quartet distance between evolutionary trees of bounded degreeProc. of APBC, 2007 O(d9 n log n)

Page 27: Presentation at APBC 2007

Previous workG. Estabrook, F. McMorris, and C. Meacham. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology., 27:27-33, 1978.

W. Day and C. R. Doucette.Expected behaviour of quartet distances between undirected tree.Proc. 18th International Numerical Taxonomy Conference, 1984

C. R. DoucetteAn efficient algorithm to compute quartet dissimilarity measures. O(n3)Bachelor of Science (Honours) Dissertation.Memorial University of Newfoundland, 1985

M. Steel and D. Penny.Distribution of tree comparison metrics—some new results.Systematic Biology, 42(2):126–141, 1993.

D. Bryant, J. Tsang, P. E. Kearney, and M. Li. O(n2)Computing the quartet distance between evolutionary trees.Proc. 11th Symp. on Discrete Algorithms (SODA), 285–286, 2000.

G. S. Brodal, R. Fagerberg, and C. N. S. Pedersen O(n log2 n)Computing the quartet distance in time O(n log n). O(n log n)Algorithmica, 38(2): 377-395, 2003.

Algorithms are for fully resolved trees (binary trees) only.

Quartet distance between binary trees seems easier to compute ...

C. Christiansen, T. Mailund, C. N. S. Pedersen, and M. RandersAlgorithms for computing the quartet distance between trees of arbitrary degreeProc of Workshop on Algorithms in Bioinformatics (WABI), 77-88, 2005 O(n3)

O(d2n2)

C. Christiansen, T. Mailund, C. N. S. Pedersen, M. Randers, and M. StissingFast calculation of the quartet distance between trees of arbitrary degreesAlgorithms for Molecular Biology, 1(16), 2006 O(dn2)

M. Stissing, C. N. S. Pedersen, T. Mailund, G. S. Brodal, and R. FagerbergComputing the quartet distance between evolutionary trees of bounded degreeProc. of APBC, 2007 O(d9 n log n)

Page 28: Presentation at APBC 2007

Idea: Iterate over all pairs of edges (or nodes) in the two trees, and count for each pair how many associated quartets are shared. The problem is to define “associated” such each quartet is counted as most once ...

Algorithms for partially resolved trees

Different solutionsType Time Space

Center based O(n3) O(n)Edge based O(|V||V'|·d·d') O(n2)Edge based O(n+|V||V'|·id·id') O(n+|V||V'|)Node based O(n+|V||V'|·min{id,id'}) O(n+|V||V'|)

- n is the number of species/leaves

- |V|, |V'| are number of internal nodes in T and T'

- id, id' are internal degree of T and T'

“worst case”-tree

|V|=id=O(n)

Page 29: Presentation at APBC 2007

QuartetDistQuartetDist: Computes quartet distance (or normalized similarity, i.e. quartet fit similarity) between general trees

http://www.birc.au.dk/~chrisc/qdist/

A simple GUIImplementation in Java by Martin Randers and Chris Christiansen

Page 30: Presentation at APBC 2007

Previous workG. Estabrook, F. McMorris, and C. Meacham. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Zoology., 27:27-33, 1978.

W. Day and C. R. Doucette.Expected behaviour of quartet distances between undirected tree.Proc. 18th International Numerical Taxonomy Conference, 1984

C. R. DoucetteAn efficient algorithm to compute quartet dissimilarity measures. O(n3)Bachelor of Science (Honours) Dissertation.Memorial University of Newfoundland, 1985

M. Steel and D. Penny.Distribution of tree comparison metrics—some new results.Systematic Biology, 42(2):126–141, 1993.

D. Bryant, J. Tsang, P. E. Kearney, and M. Li. O(n2)Computing the quartet distance between evolutionary trees.Proc. 11th Symp. on Discrete Algorithms (SODA), 285–286, 2000.

G. S. Brodal, R. Fagerberg, and C. N. S. Pedersen O(n log2 n)Computing the quartet distance in time O(n log n). O(n log n)Algorithmica, 38(2): 377-395, 2003.

Algorithms are for fully resolved trees (binary trees) only.

Quartet distance between binary trees seems easier to compute ...

C. Christiansen, T. Mailund, C. N. S. Pedersen, and M. RandersAlgorithms for computing the quartet distance between trees of arbitrary degreeProc of Workshop on Algorithms in Bioinformatics (WABI), 77-88, 2005 O(n3)

O(d2n2)

C. Christiansen, T. Mailund, C. N. S. Pedersen, M. Randers, and M. StissingFast calculation of the quartet distance between trees of arbitrary degreesAlgorithms for Molecular Biology, 1(16), 2006 O(dn2)

M. Stissing, C. N. S. Pedersen, T. Mailund, G. S. Brodal, and R. FagerbergComputing the quartet distance between evolutionary trees of bounded degreeProc. of APBC, 2007 O(d9 n log n)

Page 31: Presentation at APBC 2007

Counting using colouring

Idea: Alternatively, colour leaves according to nodes in one tree and count compatible colourings in the other

This approach does not explicitly consider all pairs of nodes/edges, which makes it possible to achieve a sub-quadratic running time

Brodal et al.: Computing the Quartet Distance between Evolutionary Treesin Time O(n log n), Algoritmica 38(2), 2003.

Page 32: Presentation at APBC 2007

Counting using colouring

Idea: Alternatively, colour leaves according to nodes in one tree and count compatible colourings in the other

This approach does not explicitly consider all pairs of nodes/edges, which makes it possible to achieve a sub-quadratic running time

Ideas can be generalized to trees of maximum degree d . Use d colours instead of 3, use a more complex hierarchical decomposition tree ...

Brodal et al.: Computing the Quartet Distance between Evolutionary Treesin Time O(n log n), Algoritmica 38(2), 2003.

Page 33: Presentation at APBC 2007

Counting using colouring

Idea: Alternatively, colour leaves according to nodes in one tree and count compatible colourings in the other

This approach does not explicitly consider all pairs of nodes/edges, which makes it possible to achieve a sub-quadratic running time

Ideas can be generalized to trees of maximum degree d . Use d colours instead of 3, use a more complex hierarchical decomposition tree ...

Brodal et al.: Computing the Quartet Distance between Evolutionary Treesin Time O(n log n), Algoritmica 38(2), 2003.

Page 34: Presentation at APBC 2007

Counting using colouring

Idea: Alternatively, colour leaves according to nodes in one tree and count compatible colourings in the other

This approach does not explicitly consider all pairs of nodes/edges, which makes it possible to achieve a sub-quadratic running time

Ideas can be generalized to trees of maximum degree d . Use d colours instead of 3, use a more complex hierarchical decomposition tree ...

Brodal et al.: Computing the Quartet Distance between Evolutionary Treesin Time O(n log n), Algoritmica 38(2), 2003.

Efficiently calculated using hierarchical decomposition tree

Page 35: Presentation at APBC 2007

Shared “butterfly” quartets

Page 36: Presentation at APBC 2007

Shared “star” quartets

Page 37: Presentation at APBC 2007

Colouring

Page 38: Presentation at APBC 2007

Colouring

Page 39: Presentation at APBC 2007

Colouring

Page 40: Presentation at APBC 2007

Colouring

Page 41: Presentation at APBC 2007

Colouring

Page 42: Presentation at APBC 2007

Colouring

Page 43: Presentation at APBC 2007

Colouring

Page 44: Presentation at APBC 2007

Colouring

Page 45: Presentation at APBC 2007

Colouring

Page 46: Presentation at APBC 2007

Colouring

Smaller-half trick

Page 47: Presentation at APBC 2007

Colouring

Smaller-half trick

In total: All colourings takes time O(n log n) ...

Page 48: Presentation at APBC 2007

Hierarchical decomposition tree

Let T be an unrooted tree of size n with degree at most d. A component C in T is:

A single node in T, or

A connected subset of nodes in T with degree at most d.

A hierarchical decomposition tree H(T)of T is a rooted binary tree:

A leaf is simple component (single node in T).

An internal node is a composite component of its children.

Page 49: Presentation at APBC 2007

Hierarchical decomposition tree

Let T be an unrooted tree of size n with degree at most d. A component C in T is:

A single node in T, or

A connected subset of nodes in T with degree at most d.

A hierarchical decomposition tree H(T)of T is a rooted binary tree:

A leaf is simple component (single node in T).

An internal node is a composite component of its children.Lemma: A hierarchical decomposition tree H(T) of T of

height O(d log n) can be constructed in time O(dn)

Page 50: Presentation at APBC 2007

Hierarchical decomposition tree

The function F yields the number of quartets associated to nodes in C and compatible with colouring ...

Annotation of node C in H(T)

A d-tuple denoting the number of leaves in C coloured in each of the d colours.

A function F of d × “degree of C” variables – representing counts of each colour in the remaining leaves of T

Page 51: Presentation at APBC 2007

Hierarchical decomposition tree

The function F yields the number of quartets associated to nodes in C and compatible with colouring ...

Annotation of node C in H(T)

A d-tuple denoting the number of leaves in C coloured in each of the d colours.

A function F of d × “degree of C” variables – representing counts of each colour in the remaining leaves of T

The function at the root is a constant which counts the total number of quartets compatible with the current colouring in time O(1) ...

Page 52: Presentation at APBC 2007

Hierarchical decomposition tree

Page 53: Presentation at APBC 2007

Hierarchical decomposition tree

Page 54: Presentation at APBC 2007

Hierarchical decomposition tree

Observation: F is a polynomial of at most d2 variables of degree at most 4, i.e. It can be manipulated in time O(d8) ...

Lemma: Initial annotation takes time O(d9n). Updating the annotation when changing a colour takes amortized time O(d9 log n) ...

Page 55: Presentation at APBC 2007

Hierarchical decomposition tree

Observation: F is a polynomial of at most d2 variables of degree at most 4, i.e. It can be manipulated in time O(d8) ...

Lemma: Initial annotation takes time O(d9n). Updating the annotation when changing a colour takes amortized time O(d9 log n) ...

Summing it all upColouring leaves and updating H(T) takes amortized time O(log n × d9 log n) per node v1 in T1. Counting takes time O(1) per node v1 in T1. In total: O(d9 n log2 n).

... with “contractions” in H(T) the total time can be improved to O(d9 n log n) ...