View
4
Download
0
Category
Preview:
Citation preview
Deciphering Reticulate Evolution UsingPhylogenetic Reconciliation
Mukul S. Bansal
Department of Computer Science and Engineering,University of Connecticut,
USA
The Phylogenetic Networks WorkshopJuly 27, 2015
Supertrees or Supernetworks?
Gene Family Evolution
ProblemHow did any given gene family evolve?
I Gene families evolve inside species trees.I Affected by evolutionary events such as gene duplication,
horizontal gene transfer, and gene loss.
A B C D A B C D
y
x
y
z
a1 c1 b1 d1a2 c2
Species Tree S
x
z
Evolution of Gene Family G
Duplication
Loss
Transfer
Gene Tree on G
Definition: DTL Reconciliation
A B C D
y
Species Tree S
x
z
a1 c1 b1 d1a2 c2
Gene Tree on G
a1 c1 b1 d1a2 c2
G with evolutionary history
x,Σ
y,Δ
y,Σ z,Σ D,Θ
A B C D
x
y
z
Duplication
Loss
Transfer
Definition: DTL Reconciliation
A B C D
y
Species Tree S
x
z
a1 c1 b1 d1a2 c2
Gene Tree on G
a1 c1 b1 d1a2 c2
G with evolutionary history
x,Σ
y,Δ
y,Σ z,Σ D,Θ
Input: A gene tree for that gene family, and a trusted rootedspecies tree.Output: An evolutionary history of that gene family showinghorizontal gene transfers, gene duplications, losses, and speciationevents.
DTL Reconciliation Problem Formulation
Parsimony formulation:
I Costs are assigned to duplications, transfers, and losses.
I Goal: Find the reconciliation that minimizes the total cost.
I Easy to compute cost for a given reconciliation.
a1 c1 b1 d1a2 c2
x,Σ
y,Δ
y,Σ z,Σ D,Θ
Costs: L=1, Δ=2, Θ=3
2 + 3 + 2x1 = 7
1Δ, 1Θ, 2L
Different reconciliation could have different cost.
Applications of DTL Reconciliation
A B C D
y
Species Tree S
x
z
a1 c1 b1 d1a2 c2
Gene Tree on G
a1 c1 b1 d1a2 c2
G with evolutionary history
x,Σ
y,Δ
y,Σ z,Σ D,Θ
I Understanding how gene families evolve.
I Dating gene birth.
I Inferring orthologs/paralogs/xenologs.
I Gene tree error-correction.
I Whole genome species tree construction.
I Constructing species phylogenetic networks.
Applications of DTL Reconciliation
A B C D
y
Species Tree S
x
z
a1 c1 b1 d1a2 c2
Gene Tree on G
a1 c1 b1 d1a2 c2
G with evolutionary history
x,Σ
y,Δ
y,Σ z,Σ D,Θ
I Understanding how gene families evolve.
I Dating gene birth.
I Inferring orthologs/paralogs/xenologs.
I Gene tree error-correction.
I Whole genome species tree construction.
I Constructing species phylogenetic networks.
Background: Computing Optimal DTL Reconciliations
For undated species tree:I Best time-consistent reconciliation: NP-hard (Tofigh et al.,
2011; Ovadia et al., 2011)I If time-consistency not enforced: O(mn)-time algorithm
(Bansal et al., 2012)
For dated species trees:I Best time-consistent reconciliation: O(mn2)-time algorithm
(Doyon et al., 2010)
DTLI model:I Considers ILS at unresolved species tree nodes (Stolzer et al.,
2012)
A B C D
Species Tree S
a1 c1 b1 d1a2 c2
Gene Tree G
g sΣ
Background: Handling Multiple Optima
I There can be many optimal DTL reconciliations.
110
1001000
10000100000
100000010000000
1000000001E+091E+101E+111E+121E+131E+141E+151E+16
0 20 40 60 80 100 120 140 160 180 200
Num
ber o
f Sol
utio
ns
Gene Tree Size
Number of Optimal Solutions by Gene Tree Size
I Enumeration: Exponential in input size (Tofigh. et al., 2011; Chenet al., 2012)
I Uniform random sampling and aggregation: O(mn2)-time (Bansalet al., 2013)
I Compact representation in reconciliation graph: O(mn3)-time(Scornavacca et al., 2013)
I Median reconciliation (Nguyen et al., 2013)
Background: Gene Tree Error-Correction
Gene tree construction is highly error-prone. Garbage in, garbageout.
I First-attempts: Mowgli-NNI (Nguyen et al., 2012), AnGST(David and Alm, 2011)
I New methods: TreeFix-DTL (Bansal et al., 2015), ALE(szollosi et al. 2013) ), TERA (Scornavacca et al., 2015))
0
norm
alize
d RF
dist
ance
0.02
0.04
0.06
0.08
0.10
0.12
lowererror
0.14 B
rate 1 rate 3 rate 5 rate 10
RAxMLNOTUNG
Treefix
MowgliNNIAnGSTTreeFix-DTL
varying substitution rate
low-DTL medium-DTL high-DTL
A varying DTL rate
RAxMLNOTUNG
Treefix
MowgliNNIAnGSTTreeFix-DTL
sequence length 173 sequence length 333
RAxMLNOTUNGTreefix
MowgliNNIAnGSTTreeFix-DTL
C varying sequence length
Background: Event Cost Assignment
Fundamental questions:
I What are the “best” event costs to use?
I How do reconciliations vary as we change event costs?
Algorithm to partition event cost space into equivalence regionsbased on pareto-optimality of event counts: O(m5n logm)-time(Libeskind-Hadas et al., 2014)
Tra
nsfe
r co
st r
elat
ive
to d
uplic
atio
n
5
4
3
2
1
54321
<6, 1, 2, 3> Count = 3<6, 0, 3, 1> Count = 2<4, 0, 5, 0> Count = 8<4, 1, 4, 0> Count = 2<6, 2, 1, 5> Count = 1<5, 4, 0, 10> Count = 1
Loss cost relative to duplicationT
rans
fer
cost
rel
ativ
e to
dup
licat
ion
Loss cost relative to duplication1 2 3 4 5
1
2
3
4
5(a) (b) (c)
Constructing Phylogenetic Networks
I Microbial phylogenetic networks are distinct fromhybridization networks.
Problem formulationInput: A collection of gene families (sequence alignments) and areference species tree.Output: Species tree augmented with horizontal edges(representing transfer events), and labels for each vertical andhorizontal edge specifying the genes that traveled through it.
Gene Trees
Species Tree
Genes ...
Genes ...
Genes ... Genes ...
Genes ...
Constructing Phylogenetic Networks
Advantages of using a reference species tree:
I Improved complexity and scalability.
I Inferred network does not depend on reference tree.
I Customizable and easily interpretable network view.
Basic implementation:
1. Infer gene trees.
2. Use DTL reconciliation to reconcile individual gene trees withreference tree.
3. Aggregate transfers inferred for gene trees onto species tree;e.g., NOTUNG.
Challenges due to reconciliation uncertainty
I Multiple optima, gene tree error, and event cost assignmentconfound reconciliation accuracy.
73.15
0.90 1.29
19.10
5.55
0
10
20
30
40
50
60
70
80
90
100
100 [90, 99] [80, 89] [50, 79] [0, 49]
Frac
tion
of n
odes
(%)
Consistency (%)
Challenges due to reconciliation uncertainty
I Multiple optima, gene tree error, and event cost assignmentconfound reconciliation accuracy.
% p
reci
sion
020
6010
0
80 81 8194
% s
ensi
tivity
020
6010
0
8189 90
98
% p
reci
sion
020
6010
0
3871 76
96
% s
ensi
tivity
020
6010
0
68 68 73
92
RAxML AnGST TreeFix-DTL True
Duplications Transfers Losses
% p
reci
sion
020
6010
0
31
73 7685
% s
ensi
tivity
020
6010
0
7687 87 92
Possible Solution
Proposal: Distinguish between highly-supported andweakly-supported events across multiple optima, multiple eventcosts, and gene tree topologies.
I Easy to do for multiple optima and event costs based ondeveloped algorithms.
I No such algorithms for gene tree error.
Goal: Develop algorithms to generate alternative gene treetopologies and study variability of reconciliation across them.
Optimal Gene Tree Resolution (OGTR)
Input: A non-binary gene tree GN , a species tree S , and eventcosts.Output: Find a binary resolution GB of GN such that, the mostparsimonious DTL reconciliation of GB and S has smallestreconciliation cost.
I Provides rigorous framework for uniformly sampling from allcandidate gene trees.
I Non-binary gene tree obtained by collapsing weak edges.I Fundamental question for DTL reconciliation.
A B C D
Species Tree SGene Tree
A AC B C D A B C D
y
Species Tree S
x
z
Σ
Θ
Gene Tree
A AC B C D
Σ
Σ
Θ
(a) (b)
GN GB
(b)
OGTR is NP-hard
I OGTR is NP-hard for both undated and dated species trees(Kordi and Bansal, 2015).
I Surprising since problem is linear-time solvable underduplication-loss model (Zheng and Zhang, 2014).
Key Ideas:
I Reduction from minimum 3 set cover problem.
I Subtrees in species tree correspond to sets
I Subtrees in unresolved gene tree correspond to elements ofthe universe.
I Structure of gadget forces root of each subtree (element) tomap to a “set” in which that element occurs.
I Using more sets for the mapping results in higherreconciliation cost.
OGTR is NP-Hard
A Fixed-Parameter Algorithm for OGTR
I DP algorithms for binary gene trees can be extended to workwith non-binary gene trees.
I Consider all possible resolutions at each non-binary gene treenode and fill DP-table for subproblem with best score over allresolutions.
I Gives O(2k · k! · ln + mn) and O(2k · k! · ln + mn2)-timealgorithms for dated and undated species trees, respectively
A B C D
Species Tree S
a1 c1 b1 d1a2 c2
Gene Tree G
g s15
A Fixed-Parameter Algorithm for OGTR
(a) (b)
(c) (d)
20 40 60 80 100 12 140 0 500
1000 1500
2000 2500
30003500
40004500
5000
2 5 10 15 20 25 30 35 40
3 4 5 6 7 83-8 -6 -4 -2 0
2 4 6 8
10
3 4 5 6 7 8
6
8
10
12
14
16
18
20
(
1122
3344
5
Maximum out-degree
4500 5000
4000 3500 3000
2500 2000 1500
1000 500
0 0
(b
Percentage of non-binary nodes
50% 80%
30 35
50% 80%
Maximum out-degree Maximum out-degree
50
050505
105
45
4332211
Uniform Sampling of Optimal Resolutions
I DP framework allows for sampling optimal resolutionsuniformly at random.
I Samples can be combined with event cost and multipleoptima analyses to assess robustness of individual events andmappings.
I Enables identification of highly- and weakly-supported eventsand mappings for network construction.
Future Directions: Assigning Weakly-Supported Events
1. Leverage high-support events to improve assignment of allother events:
I Use highly-supported events to infer highways of transfers.I Improve assignment of other events based on inferred
highways.
2. Use gene tree and species tree branch lengths to choosebetween alternative scenarios for low-support events.
Thank You!
Questions!
Recommended