Deciphering Reticulate Evolution Using Phylogenetic ... · Deciphering Reticulate Evolution Using...

Preview:

Citation preview

Deciphering Reticulate Evolution UsingPhylogenetic Reconciliation

Mukul S. Bansal

Department of Computer Science and Engineering,University of Connecticut,

USA

The Phylogenetic Networks WorkshopJuly 27, 2015

Supertrees or Supernetworks?

Gene Family Evolution

ProblemHow did any given gene family evolve?

I Gene families evolve inside species trees.I Affected by evolutionary events such as gene duplication,

horizontal gene transfer, and gene loss.

A B C D A B C D

y

x

y

z

a1 c1 b1 d1a2 c2

Species Tree S

x

z

Evolution of Gene Family G

Duplication

Loss

Transfer

Gene Tree on G

Definition: DTL Reconciliation

A B C D

y

Species Tree S

x

z

a1 c1 b1 d1a2 c2

Gene Tree on G

a1 c1 b1 d1a2 c2

G with evolutionary history

x,Σ

y,Δ

y,Σ z,Σ D,Θ

A B C D

x

y

z

Duplication

Loss

Transfer

Definition: DTL Reconciliation

A B C D

y

Species Tree S

x

z

a1 c1 b1 d1a2 c2

Gene Tree on G

a1 c1 b1 d1a2 c2

G with evolutionary history

x,Σ

y,Δ

y,Σ z,Σ D,Θ

Input: A gene tree for that gene family, and a trusted rootedspecies tree.Output: An evolutionary history of that gene family showinghorizontal gene transfers, gene duplications, losses, and speciationevents.

DTL Reconciliation Problem Formulation

Parsimony formulation:

I Costs are assigned to duplications, transfers, and losses.

I Goal: Find the reconciliation that minimizes the total cost.

I Easy to compute cost for a given reconciliation.

a1 c1 b1 d1a2 c2

x,Σ

y,Δ

y,Σ z,Σ D,Θ

Costs: L=1, Δ=2, Θ=3

2 + 3 + 2x1 = 7

1Δ, 1Θ, 2L

Different reconciliation could have different cost.

Applications of DTL Reconciliation

A B C D

y

Species Tree S

x

z

a1 c1 b1 d1a2 c2

Gene Tree on G

a1 c1 b1 d1a2 c2

G with evolutionary history

x,Σ

y,Δ

y,Σ z,Σ D,Θ

I Understanding how gene families evolve.

I Dating gene birth.

I Inferring orthologs/paralogs/xenologs.

I Gene tree error-correction.

I Whole genome species tree construction.

I Constructing species phylogenetic networks.

Applications of DTL Reconciliation

A B C D

y

Species Tree S

x

z

a1 c1 b1 d1a2 c2

Gene Tree on G

a1 c1 b1 d1a2 c2

G with evolutionary history

x,Σ

y,Δ

y,Σ z,Σ D,Θ

I Understanding how gene families evolve.

I Dating gene birth.

I Inferring orthologs/paralogs/xenologs.

I Gene tree error-correction.

I Whole genome species tree construction.

I Constructing species phylogenetic networks.

Background: Computing Optimal DTL Reconciliations

For undated species tree:I Best time-consistent reconciliation: NP-hard (Tofigh et al.,

2011; Ovadia et al., 2011)I If time-consistency not enforced: O(mn)-time algorithm

(Bansal et al., 2012)

For dated species trees:I Best time-consistent reconciliation: O(mn2)-time algorithm

(Doyon et al., 2010)

DTLI model:I Considers ILS at unresolved species tree nodes (Stolzer et al.,

2012)

A B C D

Species Tree S

a1 c1 b1 d1a2 c2

Gene Tree G

g sΣ

Background: Handling Multiple Optima

I There can be many optimal DTL reconciliations.

110

1001000

10000100000

100000010000000

1000000001E+091E+101E+111E+121E+131E+141E+151E+16

0 20 40 60 80 100 120 140 160 180 200

Num

ber o

f Sol

utio

ns

Gene Tree Size

Number of Optimal Solutions by Gene Tree Size

I Enumeration: Exponential in input size (Tofigh. et al., 2011; Chenet al., 2012)

I Uniform random sampling and aggregation: O(mn2)-time (Bansalet al., 2013)

I Compact representation in reconciliation graph: O(mn3)-time(Scornavacca et al., 2013)

I Median reconciliation (Nguyen et al., 2013)

Background: Gene Tree Error-Correction

Gene tree construction is highly error-prone. Garbage in, garbageout.

I First-attempts: Mowgli-NNI (Nguyen et al., 2012), AnGST(David and Alm, 2011)

I New methods: TreeFix-DTL (Bansal et al., 2015), ALE(szollosi et al. 2013) ), TERA (Scornavacca et al., 2015))

0

norm

alize

d RF

dist

ance

0.02

0.04

0.06

0.08

0.10

0.12

lowererror

0.14 B

rate 1 rate 3 rate 5 rate 10

RAxMLNOTUNG

Treefix

MowgliNNIAnGSTTreeFix-DTL

varying substitution rate

low-DTL medium-DTL high-DTL

A varying DTL rate

RAxMLNOTUNG

Treefix

MowgliNNIAnGSTTreeFix-DTL

sequence length 173 sequence length 333

RAxMLNOTUNGTreefix

MowgliNNIAnGSTTreeFix-DTL

C varying sequence length

Background: Event Cost Assignment

Fundamental questions:

I What are the “best” event costs to use?

I How do reconciliations vary as we change event costs?

Algorithm to partition event cost space into equivalence regionsbased on pareto-optimality of event counts: O(m5n logm)-time(Libeskind-Hadas et al., 2014)

Tra

nsfe

r co

st r

elat

ive

to d

uplic

atio

n

5

4

3

2

1

54321

<6, 1, 2, 3> Count = 3<6, 0, 3, 1> Count = 2<4, 0, 5, 0> Count = 8<4, 1, 4, 0> Count = 2<6, 2, 1, 5> Count = 1<5, 4, 0, 10> Count = 1

Loss cost relative to duplicationT

rans

fer

cost

rel

ativ

e to

dup

licat

ion

Loss cost relative to duplication1 2 3 4 5

1

2

3

4

5(a) (b) (c)

Constructing Phylogenetic Networks

I Microbial phylogenetic networks are distinct fromhybridization networks.

Problem formulationInput: A collection of gene families (sequence alignments) and areference species tree.Output: Species tree augmented with horizontal edges(representing transfer events), and labels for each vertical andhorizontal edge specifying the genes that traveled through it.

Gene Trees

Species Tree

Genes ...

Genes ...

Genes ... Genes ...

Genes ...

Constructing Phylogenetic Networks

Advantages of using a reference species tree:

I Improved complexity and scalability.

I Inferred network does not depend on reference tree.

I Customizable and easily interpretable network view.

Basic implementation:

1. Infer gene trees.

2. Use DTL reconciliation to reconcile individual gene trees withreference tree.

3. Aggregate transfers inferred for gene trees onto species tree;e.g., NOTUNG.

Challenges due to reconciliation uncertainty

I Multiple optima, gene tree error, and event cost assignmentconfound reconciliation accuracy.

73.15

0.90 1.29

19.10

5.55

0

10

20

30

40

50

60

70

80

90

100

100 [90, 99] [80, 89] [50, 79] [0, 49]

Frac

tion

of n

odes

(%)

Consistency (%)

Challenges due to reconciliation uncertainty

I Multiple optima, gene tree error, and event cost assignmentconfound reconciliation accuracy.

% p

reci

sion

020

6010

0

80 81 8194

% s

ensi

tivity

020

6010

0

8189 90

98

% p

reci

sion

020

6010

0

3871 76

96

% s

ensi

tivity

020

6010

0

68 68 73

92

RAxML AnGST TreeFix-DTL True

Duplications Transfers Losses

% p

reci

sion

020

6010

0

31

73 7685

% s

ensi

tivity

020

6010

0

7687 87 92

Possible Solution

Proposal: Distinguish between highly-supported andweakly-supported events across multiple optima, multiple eventcosts, and gene tree topologies.

I Easy to do for multiple optima and event costs based ondeveloped algorithms.

I No such algorithms for gene tree error.

Goal: Develop algorithms to generate alternative gene treetopologies and study variability of reconciliation across them.

Optimal Gene Tree Resolution (OGTR)

Input: A non-binary gene tree GN , a species tree S , and eventcosts.Output: Find a binary resolution GB of GN such that, the mostparsimonious DTL reconciliation of GB and S has smallestreconciliation cost.

I Provides rigorous framework for uniformly sampling from allcandidate gene trees.

I Non-binary gene tree obtained by collapsing weak edges.I Fundamental question for DTL reconciliation.

A B C D

Species Tree SGene Tree

A AC B C D A B C D

y

Species Tree S

x

z

Σ

Θ

Gene Tree

A AC B C D

Σ

Σ

Θ

(a) (b)

GN GB

(b)

OGTR is NP-hard

I OGTR is NP-hard for both undated and dated species trees(Kordi and Bansal, 2015).

I Surprising since problem is linear-time solvable underduplication-loss model (Zheng and Zhang, 2014).

Key Ideas:

I Reduction from minimum 3 set cover problem.

I Subtrees in species tree correspond to sets

I Subtrees in unresolved gene tree correspond to elements ofthe universe.

I Structure of gadget forces root of each subtree (element) tomap to a “set” in which that element occurs.

I Using more sets for the mapping results in higherreconciliation cost.

OGTR is NP-Hard

A Fixed-Parameter Algorithm for OGTR

I DP algorithms for binary gene trees can be extended to workwith non-binary gene trees.

I Consider all possible resolutions at each non-binary gene treenode and fill DP-table for subproblem with best score over allresolutions.

I Gives O(2k · k! · ln + mn) and O(2k · k! · ln + mn2)-timealgorithms for dated and undated species trees, respectively

A B C D

Species Tree S

a1 c1 b1 d1a2 c2

Gene Tree G

g s15

A Fixed-Parameter Algorithm for OGTR

(a) (b)

(c) (d)

20 40 60 80 100 12 140 0 500

1000 1500

2000 2500

30003500

40004500

5000

2 5 10 15 20 25 30 35 40

3 4 5 6 7 83-8 -6 -4 -2 0

2 4 6 8

10

3 4 5 6 7 8

6

8

10

12

14

16

18

20

(

1122

3344

5

Maximum out-degree

4500 5000

4000 3500 3000

2500 2000 1500

1000 500

0 0

(b

Percentage of non-binary nodes

50% 80%

30 35

50% 80%

Maximum out-degree Maximum out-degree

50

050505

105

45

4332211

Uniform Sampling of Optimal Resolutions

I DP framework allows for sampling optimal resolutionsuniformly at random.

I Samples can be combined with event cost and multipleoptima analyses to assess robustness of individual events andmappings.

I Enables identification of highly- and weakly-supported eventsand mappings for network construction.

Future Directions: Assigning Weakly-Supported Events

1. Leverage high-support events to improve assignment of allother events:

I Use highly-supported events to infer highways of transfers.I Improve assignment of other events based on inferred

highways.

2. Use gene tree and species tree branch lengths to choosebetween alternative scenarios for low-support events.

Thank You!

Questions!

Recommended