Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
ASTRAL-III: increased scalability and impacts of contracting low support branches
Chao Zhang Maryam Rabiee
Erfan Sayyari Siavash Mirarab
University of California, San Diego
Gene tree discordance
2
OrangutanGorilla ChimpHuman
The species tree
A gene treeOrang.Gorilla ChimpHuman Orang.Gorilla Chimp Human
Causes of gene tree discordance include:• Incomplete Lineage Sorting (ILS) • Duplication and loss • Horizontal Gene Transfer (HGT)
gene1000gene 1
Incomplete Lineage Sorting (ILS)A random process related to the coalescence of alleles across various populations
3
Tracing alleles through generations
Incomplete Lineage Sorting (ILS)A random process related to the coalescence of alleles across various populations
3
Tracing alleles through generations
Incomplete Lineage Sorting (ILS)A random process related to the coalescence of alleles across various populations
3
Tracing alleles through generations
Multi-species coalescent (MSC) model captures ILS
Unrooted quartets under MSC model
For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010)
4
Orang.
Gorilla Chimp
HumanOrang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
θ2=15% θ3=15%θ1=70%Gorilla
Orang.
Chimp
Human
d=0.8
Unrooted quartets under MSC model
For a quartet (4 species), the unrooted species tree topology has at least 1/3 probability in gene trees (Allman, et al. 2010)
4
Orang.
Gorilla Chimp
HumanOrang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
θ2=15% θ3=15%θ1=70%Gorilla
Orang.
Chimp
Human
d=0.8
The most frequent gene tree = The most likely species tree
5
Orang.Gorilla ChimpHuman Rhesus
More than 4 speciesFor >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013)
5
Orang.Gorilla ChimpHuman Rhesus
1. Break gene trees into (n 4 ) quartets of species
2. Find the dominant tree for all quartets of taxa 3. Combine quartet trees
Some tools (e.g.. BUCKy-p [Larget, et al., 2010])
More than 4 speciesFor >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013)
5
Orang.Gorilla ChimpHuman Rhesus
1. Break gene trees into (n 4 ) quartets of species
2. Find the dominant tree for all quartets of taxa 3. Combine quartet trees
Some tools (e.g.. BUCKy-p [Larget, et al., 2010])
More than 4 speciesFor >4 species, the species tree topology can be different from the most like gene tree (called anomaly zone) (Degnan, 2013)
Alternative: weight all 3(n
4 ) quartet topologies by their frequency
and find the optimal tree
Maximum Quartet Support Species Tree
• Optimization problem:
6
Find the species tree with the maximum number of induced quartet trees shared with the collection of k input gene trees
the set of quartet trees induced by T
a gene treeScore(T ) =
mX
1
|Q(T ) \Q(ti)|
Maximum Quartet Support Species Tree
• Optimization problem:
• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly
6
Find the species tree with the maximum number of induced quartet trees shared with the collection of k input gene trees
the set of quartet trees induced by T
a gene treeScore(T ) =
mX
1
|Q(T ) \Q(ti)|
Maximum Quartet Support Species Tree
• Optimization problem:
• Theorem: Statistically consistent under the multi-species coalescent model when solved exactly
6
Find the species tree with the maximum number of induced quartet trees shared with the collection of k input gene trees
the set of quartet trees induced by T
a gene tree
NP-Hard [Lafond & Scornavaccaori, 2016]
Score(T ) =mX
1
|Q(T ) \Q(ti)|
ASTRAL-I [Mirarab, et al., Bioinformatics, 2014]
• ASTRAL solves the problem using dynamic programming
7
ASTRAL-I [Mirarab, et al., Bioinformatics, 2014]
• ASTRAL solves the problem using dynamic programming
• Introduced a constrained version of the problem
• Draws branches in the species tree from a given set X = {all bipartitions in all gene trees}
• Th constrained version is statistically consistent
• Running time: O(n2k |X |1.73)
7
ASTRAL-II [Mirarab and Warnow, Bioinformatics, 2015]
1. Faster calculation of the score function inside DP
• O(nk |X |1.73) instead of O(n 2k |X |1.73)
8
ASTRAL-II [Mirarab and Warnow, Bioinformatics, 2015]
1. Faster calculation of the score function inside DP
• O(nk |X |1.73) instead of O(n 2k |X |1.73)
2. Add extra bipartitions to the set X using heuristics
• Consensus + support + subsampling species
• Using quartet-based distances to find likely branches
• Complete incomplete gene trees
8
ASTRAL-II [Mirarab and Warnow, Bioinformatics, 2015]
1. Faster calculation of the score function inside DP
• O(nk |X |1.73) instead of O(n 2k |X |1.73)
2. Add extra bipartitions to the set X using heuristics
• Consensus + support + subsampling species
• Using quartet-based distances to find likely branches
• Complete incomplete gene trees
3. Can handle input gene trees with polytomies in O(n3k |X |1.73)
8
ASTRAL used by the biologists• Plants: Wickett, et al., 2014, PNAS • Birds: Prum, et al., 2015, Nature • Xenoturbella, Cannon et al., 2016, Nature • Xenoturbella, Rouse et al., 2016, Nature • Flatworms: Laumer, et al., 2015, eLife • Shrews: Giarla, et al., 2015, Syst. Bio. • Frogs: Yuan et al., 2016, Syst. Bio. • Tomatoes: Pease, et al., 2016, PLoS Bio. • Angiosperms: Huang et al., 2016, MBE • Worms: Andrade, et al., 2015, MBE
AST
RA
L-II
AST
RA
L
ASTRAL-III: improved running time
Feature 1:
ASTRAL-III can now analyze multifurcating gene trees with the same running time as binary gene trees: O(nk |X |1.73)
10
Zhang et al. Page 14 of 34
(a)
50 200 500 1000
non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75
2−12
2−10
2−8
2−6
contractionAv
g we
ight
cal
cula
tion
time
(sec
onds
) 1600
200
ASTRAL−II
ASTRAL−III
(b)
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●
●● ● ● ●
●●
●
●●
● ● ● ● ● ●●
● ●
● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●
●●
●
●● ● ● ●
●
●●
●●
● ● ● ● ●
●
●●
●
● ● ● ● ● ●●
●●
●
●● ● ● ● ●
●●
●
●
●
●●
●●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
● ● ● ● ● ●
●
●
●
●
●
● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
50 200 500 1000
non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 750
25
50
75
100
contraction
Set X
clu
ster
size
(K)
Figure 6 Weight calculation and |X| on S100. Average and standard error of (a) the time ittakes to score a single tripartition using Eq. 3 and (b) search space size |X| are shown for bothASTRAL-II and ASTRAL-III on the S100 dataset. Running time is in log scale. We vary numbersof gene trees (boxes) and sequence length (200 and 1600). See Fig. S3 for similar patterns forwith 400 and 800bp alignments.
3.2.2 Running time for large polytomies
ASTRAL-III has a clear advantage compared to ASTRAL-II with respect to the
running time when gene trees include polytomies (Fig. 6a). Since ASTRAL-II and
ASTRAL-III can have a di↵erent set X, we show the running time per each weight
calculation (i.e., Eq. 3). As we contract low support branches and hence increase the
prevalence of polytomies, the weight calculation time quickly grows for ASTRAL-II,
whereas, in ASTRAL-III, the weight calculation time remains flat, or even decreases.
3.2.3 The search space
Comparing the size of the search space (|X|) between ATRAL-II and ASTRAL-
III shows that as intended, the search space is decreased in size for cases with no
polytomy but can increase in the presence of polytomies (Fig. 6b). With no contrac-
tion, on average, |X| is always smaller for ASTRAL-III than ASTRAL-II. With few
error-prone gene trees (50 gene trees from 200bp alignments), the search space has
reduced dramatically but with many genes or high-quality gene trees, the reduc-
tions are minimal. Moreover, the search space for gene trees estimated from short
alignments (e.g., 200bp) is several times larger than those based on longer align-
ments (e.g., 1600bp) for both methods. Contracting low support branches initially
increases the search space because ASTRAL-III adds multiple resolutions per poly-
tomy to X. Further contraction results in reductions in |X|, presumably because
many polytomies exist and they are resolved similarly inside ASTRAL-III.
Zhang et al. Page 14 of 34
(a)
50 200 500 1000
non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75
2−12
2−10
2−8
2−6
contraction
Avg
weig
ht c
alcu
latio
n tim
e (s
econ
ds) 1600
200
ASTRAL−II
ASTRAL−III
(b)
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●
●● ● ● ●
●●
●
●●
● ● ● ● ● ●●
● ●
● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●
●●
●
●● ● ● ●
●
●●
●●
● ● ● ● ●
●
●●
●
● ● ● ● ● ●●
●●
●
●● ● ● ● ●
●●
●
●
●
●●
●●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
● ● ● ● ● ●
●
●
●
●
●
● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
50 200 500 1000
non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 750
25
50
75
100
contraction
Set X
clu
ster
size
(K)
Figure 6 Weight calculation and |X| on S100. Average and standard error of (a) the time ittakes to score a single tripartition using Eq. 3 and (b) search space size |X| are shown for bothASTRAL-II and ASTRAL-III on the S100 dataset. Running time is in log scale. We vary numbersof gene trees (boxes) and sequence length (200 and 1600). See Fig. S3 for similar patterns forwith 400 and 800bp alignments.
3.2.2 Running time for large polytomies
ASTRAL-III has a clear advantage compared to ASTRAL-II with respect to the
running time when gene trees include polytomies (Fig. 6a). Since ASTRAL-II and
ASTRAL-III can have a di↵erent set X, we show the running time per each weight
calculation (i.e., Eq. 3). As we contract low support branches and hence increase the
prevalence of polytomies, the weight calculation time quickly grows for ASTRAL-II,
whereas, in ASTRAL-III, the weight calculation time remains flat, or even decreases.
3.2.3 The search space
Comparing the size of the search space (|X|) between ATRAL-II and ASTRAL-
III shows that as intended, the search space is decreased in size for cases with no
polytomy but can increase in the presence of polytomies (Fig. 6b). With no contrac-
tion, on average, |X| is always smaller for ASTRAL-III than ASTRAL-II. With few
error-prone gene trees (50 gene trees from 200bp alignments), the search space has
reduced dramatically but with many genes or high-quality gene trees, the reduc-
tions are minimal. Moreover, the search space for gene trees estimated from short
alignments (e.g., 200bp) is several times larger than those based on longer align-
ments (e.g., 1600bp) for both methods. Contracting low support branches initially
increases the search space because ASTRAL-III adds multiple resolutions per poly-
tomy to X. Further contraction results in reductions in |X|, presumably because
many polytomies exist and they are resolved similarly inside ASTRAL-III.
Low gene tree errorHigh gene tree error
more polytomies
ASTRAL-III: improved running time
Feature 2: Can exploit similarities between gene trees
• O(D |X |1.73) where D is the sum of degrees of all unique gene tree nodes
• Overlay all gene trees onto a single polytree; use a bottom-up traversal to score a cluster
11
Zhang et al. Page 30 of 34
●● ●●
●
●
●● ● ●
●
●
●
●● ●●
●
●
●● ● ●●
●
●
500 1500
0 5000 10000 15000 0 5000 10000 15000
0
500
1000
1500
#Genes
Run
ning
tim
e (m
inut
es)
●
●
ASTRAL2
ASTRAL3
Figure S2 Running time versus k. Average running time of ASTRAL-II versus ASTRAL-III onthe avian dataset with 500bp or 1500bp alignments with varying numbers of gens (k), shown innormal scale. A LOESS curve is fit to the data points. ASTRAL-II could not finish on 2
14 genes inthe allotted 48 hour time. Averages are over 4 runs.
(a)
50 200 500 1000
non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75
2−12
2−10
2−8
2−6
contraction
Avg
weig
ht c
alcu
latio
n tim
e (s
econ
ds) ASTRAL−II
ASTRAL−III
400
800
(b)
●●
● ● ● ● ● ● ●●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●● ● ● ● ●
●● ● ●●
● ● ● ● ●
●
●
● ●
● ● ● ● ● ●●
● ● ●●
● ● ● ● ●●
●●
●
●●
●● ●
●
●
●
● ●
●
● ● ●●
●
●
●
●
●
● ●● ● ● ●
●
●
●●
●
● ● ● ●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
● ●●
●●
●
●
●
●
●
●
● ● ●●
●
●
●
●
●
50 200 500 1000
non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 750
20000
40000
60000
contraction
Set X
clu
ster
size
400 800 ● ●ASTRAL−II ASTRAL−III
Figure S3 Weight calculation and |X| on S100. Average and standard error of (a) the time ittakes to score a single tripartition using Eq. 3 and (b) search space size |X| for both ASTRAL-II(red) and ASTRAL-III (blue) on the S100 dataset. Running time is in log scale for varyingnumbers of gene trees (boxes) and sequence length 400 and 800 (line types).
Zhang et al. Page 13 of 34
●
●
●
●
●
●
●
●
●
●
●
●
●
2.272.09
●
●
●
●
●
●
●
●
●
●
●
●
●
2.292.06
500 1500
28 29 210 211 212 213 214 28 29 210 211 212 213 214
20
25
210
#Genes
Run
ning
tim
e (m
inut
es)
●
●
ASTRAL2
ASTRAL3
Figure 5 Running time versus k. Average running times (4 replicates) are shown for ASTRAL-IIand ASTRAL-III on the avian dataset with 500bp or 1500bp alignments with varying numbers ofgens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in thelog/log space and line slopes are shown. ASTRAL-II did not finish on 2
14 genes in 48 hour.
branches has minimal impact on the discordance (eight discordant branches with
binned MP-EST instead of nine). However, contracting low support branches with
3%–33% thresholds dramatically reduces the discordance with the reference tree (2,
2, 4, 2, 3, and 3 discordant branches, respectively, for 3%, 5%, 7%, 10%, 20%, and
33%). Three thresholds (3%, 5%, and 10%) produce an identical tree (Fig. 4d). The
remaining di↵erences are among the branches that are deemed unresolved by Jarvis
et al. and change among the reference trees as well [5]. Contracting at 50% and 75%
thresholds, however, increases discordance to five and six branches, respectively.
Thus, consistent with simulations, contracting very low support branches seems
to produce the best results, when judged by similarity with the reference trees. To
summarize, ASTRAL-III obtained on unbinned but collapsed gene trees agreed with
all major relations in Jarvis et al., including the novel Columbea group, whereas
the unresolved tree missed important clades (Fig. 4).
3.2 RQ2: Running time improvements
We study the improvements in running time as various parameters change.
3.2.1 Varying the number of genes (k)
We compare ASTRAL-III to ASTRAL-II on the avian simulated dataset, changing
the number of genes from 28 to 214 and forcing X to be the same for both versions to
enable comparing impacts of improved weight calculation. We allow each replicate
run to take up to two days. ASTRAL-II is not able to finish on the dataset with
k = 214, while ASTRAL-III finishes on all conditions. ASTRAL-III improves the
running time over ASTRAL-II and the extent of the improvement depends on k
(Fig. 5). With 1000 genes or more, there is at least a 2.1X improvement. With
213 genes, the largest value where both versions could run, ASTRAL-III finishes
on average 3.2 times faster than ASTRAL-II (234 versus 758 minutes). Moreover,
fitting a line to the average running time in the log-log scale graph reveals that
on this dataset, the running time of ASTRAL-III on average grows as O(k2.08),
which is better than that of ASTRAL-II at O(k2.28), and both are better than the
theoretical worst case, which is O(k2.726).
ASTRAL-III: improved running time
Feature 3. An A*-like algorithm is used to trim the dynamic programming
• Compute an upper-bound on the score of a cluster without “expanding” it
• If the upper bound is below the best existing score, don’t expand
12
New features (journal version)4. Restrict the search space to guarantee |X |=O(nk)
• Overall running time is O(D(nk)1.73)= O((nk)2.73)
• Does not impact accuracy
• Further improves speed
13
New features (journal version)4. Restrict the search space to guarantee |X |=O(nk)
• Overall running time is O(D(nk)1.73)= O((nk)2.73)
• Does not impact accuracy
• Further improves speed
5. The A*-like algorithm is further developed to compute . . even tighter upper bounds (complicated)
• Improves speed only empirically and slightly
13
Empirical running time
Empirical running time seems close to O((nk)2)
14
Zhang et al. Page 34 of 34
●
● ●
● ●
●
1.09
10k
20k
50k
100k
200k
500k
1M
200 500 1000 2000 5000 10000#Species
Size
of X
●
●
●
●
●
●
1.83
1
10
100
1000
200 500 1000 2000 5000 10000#Species
Run
ning
tim
e (m
inut
es)
Figure S8 Empirical running time of ASTRAL-III with n. Average running time is shown forASTRAL-III for datasets with varying n. Averages are over 20 replicates. One replicate of 2000species dataset could not finish in 2 days and is removed from the analysis. Note that thesedatasets have factors other than n that change as well (e.g., the amount of ILS, etc.). Thus, theserunning times should be treated as ball-park estimates. Finally, we note that on the 10,000dataset, we have only 2 replicates and not 20.
Zhang et al. Page 13 of 34
●
●
●
●
●
●
●
●
●
●
●
●
●
2.272.09
●
●
●
●
●
●
●
●
●
●
●
●
●
2.292.06
500 1500
28 29 210 211 212 213 214 28 29 210 211 212 213 214
20
25
210
#Genes
Run
ning
tim
e (m
inut
es)
●
●
ASTRAL2
ASTRAL3
Figure 5 Running time versus k. Average running times (4 replicates) are shown for ASTRAL-IIand ASTRAL-III on the avian dataset with 500bp or 1500bp alignments with varying numbers ofgens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in thelog/log space and line slopes are shown. ASTRAL-II did not finish on 2
14 genes in 48 hour.
branches has minimal impact on the discordance (eight discordant branches with
binned MP-EST instead of nine). However, contracting low support branches with
3%–33% thresholds dramatically reduces the discordance with the reference tree (2,
2, 4, 2, 3, and 3 discordant branches, respectively, for 3%, 5%, 7%, 10%, 20%, and
33%). Three thresholds (3%, 5%, and 10%) produce an identical tree (Fig. 4d). The
remaining di↵erences are among the branches that are deemed unresolved by Jarvis
et al. and change among the reference trees as well [5]. Contracting at 50% and 75%
thresholds, however, increases discordance to five and six branches, respectively.
Thus, consistent with simulations, contracting very low support branches seems
to produce the best results, when judged by similarity with the reference trees. To
summarize, ASTRAL-III obtained on unbinned but collapsed gene trees agreed with
all major relations in Jarvis et al., including the novel Columbea group, whereas
the unresolved tree missed important clades (Fig. 4).
3.2 RQ2: Running time improvements
We study the improvements in running time as various parameters change.
3.2.1 Varying the number of genes (k)
We compare ASTRAL-III to ASTRAL-II on the avian simulated dataset, changing
the number of genes from 28 to 214 and forcing X to be the same for both versions to
enable comparing impacts of improved weight calculation. We allow each replicate
run to take up to two days. ASTRAL-II is not able to finish on the dataset with
k = 214, while ASTRAL-III finishes on all conditions. ASTRAL-III improves the
running time over ASTRAL-II and the extent of the improvement depends on k
(Fig. 5). With 1000 genes or more, there is at least a 2.1X improvement. With
213 genes, the largest value where both versions could run, ASTRAL-III finishes
on average 3.2 times faster than ASTRAL-II (234 versus 758 minutes). Moreover,
fitting a line to the average running time in the log-log scale graph reveals that
on this dataset, the running time of ASTRAL-III on average grows as O(k2.08),
which is better than that of ASTRAL-II at O(k2.28), and both are better than the
theoretical worst case, which is O(k2.726).
Zhang et al. Page 13 of 34
●
●
●
●
●
●
●
●
●
●
●
●
●
2.272.09
●
●
●
●
●
●
●
●
●
●
●
●
●
2.292.06
500 1500
28 29 210 211 212 213 214 28 29 210 211 212 213 214
20
25
210
#Genes
Run
ning
tim
e (m
inut
es)
●
●
ASTRAL2
ASTRAL3
Figure 5 Running time versus k. Average running times (4 replicates) are shown for ASTRAL-IIand ASTRAL-III on the avian dataset with 500bp or 1500bp alignments with varying numbers ofgens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in thelog/log space and line slopes are shown. ASTRAL-II did not finish on 2
14 genes in 48 hour.
branches has minimal impact on the discordance (eight discordant branches with
binned MP-EST instead of nine). However, contracting low support branches with
3%–33% thresholds dramatically reduces the discordance with the reference tree (2,
2, 4, 2, 3, and 3 discordant branches, respectively, for 3%, 5%, 7%, 10%, 20%, and
33%). Three thresholds (3%, 5%, and 10%) produce an identical tree (Fig. 4d). The
remaining di↵erences are among the branches that are deemed unresolved by Jarvis
et al. and change among the reference trees as well [5]. Contracting at 50% and 75%
thresholds, however, increases discordance to five and six branches, respectively.
Thus, consistent with simulations, contracting very low support branches seems
to produce the best results, when judged by similarity with the reference trees. To
summarize, ASTRAL-III obtained on unbinned but collapsed gene trees agreed with
all major relations in Jarvis et al., including the novel Columbea group, whereas
the unresolved tree missed important clades (Fig. 4).
3.2 RQ2: Running time improvements
We study the improvements in running time as various parameters change.
3.2.1 Varying the number of genes (k)
We compare ASTRAL-III to ASTRAL-II on the avian simulated dataset, changing
the number of genes from 28 to 214 and forcing X to be the same for both versions to
enable comparing impacts of improved weight calculation. We allow each replicate
run to take up to two days. ASTRAL-II is not able to finish on the dataset with
k = 214, while ASTRAL-III finishes on all conditions. ASTRAL-III improves the
running time over ASTRAL-II and the extent of the improvement depends on k
(Fig. 5). With 1000 genes or more, there is at least a 2.1X improvement. With
213 genes, the largest value where both versions could run, ASTRAL-III finishes
on average 3.2 times faster than ASTRAL-II (234 versus 758 minutes). Moreover,
fitting a line to the average running time in the log-log scale graph reveals that
on this dataset, the running time of ASTRAL-III on average grows as O(k2.08),
which is better than that of ASTRAL-II at O(k2.28), and both are better than the
theoretical worst case, which is O(k2.726).
Zhang et al. Page 13 of 34
●
●
●
●
●
●
●
●
●
●
●
●
●
2.272.09
●
●
●
●
●
●
●
●
●
●
●
●
●
2.292.06
500 1500
28 29 210 211 212 213 214 28 29 210 211 212 213 214
20
25
210
#Genes
Run
ning
tim
e (m
inut
es)
●
●
ASTRAL2
ASTRAL3
Figure 5 Running time versus k. Average running times (4 replicates) are shown for ASTRAL-IIand ASTRAL-III on the avian dataset with 500bp or 1500bp alignments with varying numbers ofgens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in thelog/log space and line slopes are shown. ASTRAL-II did not finish on 2
14 genes in 48 hour.
branches has minimal impact on the discordance (eight discordant branches with
binned MP-EST instead of nine). However, contracting low support branches with
3%–33% thresholds dramatically reduces the discordance with the reference tree (2,
2, 4, 2, 3, and 3 discordant branches, respectively, for 3%, 5%, 7%, 10%, 20%, and
33%). Three thresholds (3%, 5%, and 10%) produce an identical tree (Fig. 4d). The
remaining di↵erences are among the branches that are deemed unresolved by Jarvis
et al. and change among the reference trees as well [5]. Contracting at 50% and 75%
thresholds, however, increases discordance to five and six branches, respectively.
Thus, consistent with simulations, contracting very low support branches seems
to produce the best results, when judged by similarity with the reference trees. To
summarize, ASTRAL-III obtained on unbinned but collapsed gene trees agreed with
all major relations in Jarvis et al., including the novel Columbea group, whereas
the unresolved tree missed important clades (Fig. 4).
3.2 RQ2: Running time improvements
We study the improvements in running time as various parameters change.
3.2.1 Varying the number of genes (k)
We compare ASTRAL-III to ASTRAL-II on the avian simulated dataset, changing
the number of genes from 28 to 214 and forcing X to be the same for both versions to
enable comparing impacts of improved weight calculation. We allow each replicate
run to take up to two days. ASTRAL-II is not able to finish on the dataset with
k = 214, while ASTRAL-III finishes on all conditions. ASTRAL-III improves the
running time over ASTRAL-II and the extent of the improvement depends on k
(Fig. 5). With 1000 genes or more, there is at least a 2.1X improvement. With
213 genes, the largest value where both versions could run, ASTRAL-III finishes
on average 3.2 times faster than ASTRAL-II (234 versus 758 minutes). Moreover,
fitting a line to the average running time in the log-log scale graph reveals that
on this dataset, the running time of ASTRAL-III on average grows as O(k2.08),
which is better than that of ASTRAL-II at O(k2.28), and both are better than the
theoretical worst case, which is O(k2.726).
Low support branches• Does it help to contract
branches with low support?
• Yes, but only for very low supports
15
Simulations: 100 taxa, simphy, ILS: around 46% true discordance FastTree, support from bootstrapping
genes
Low support branches• Does it help to contract
branches with low support?
• Yes, but only for very low supports
• Mostly helps in the presence of low support gene trees
15
Simulations: 100 taxa, simphy, ILS: around 46% true discordance FastTree, support from bootstrapping
genes
Low support branches• Does it help to contract
branches with low support?
• Yes, but only for very low supports
• Mostly helps in the presence of low support gene trees
• More genes allows for more filtering
15
Simulations: 100 taxa, simphy, ILS: around 46% true discordance FastTree, support from bootstrapping
genes
Avian biological dataset
16
Zhang et al. Page 12 of 34
Downy woodpecker
Yellow-throated sandgrouse
Common cuckoo
Carmine bee-eater
Anna's hummingbird
MacQueen's bustard
Great-crested grebeTurkey
Golden-collared manakin
White-tailed tropicbird
Cuckoo-roller
Red-legged seriema
Rhinoceros hornbill
Zebra finch
Common ostrich
Adelie penguin
Kea
Medium ground-finch
Peregrine falcon
American flamingo
Budgerigar
Little egret
White-throated tinamou
Bald eagle
Barn owl
Hoatzin
Turkey vulture
Killdeer
Dalmatian pelican
Bar-tailed trogon
Chuck-will's-widow
Crested ibis
Pekin duck
Pigeon
Speckled mousebird
Grey-crowned crane
Red-throated loon
Chicken
Sunbittern
Great cormorant
Red-crested turaco
Rifleman
Brown mesite
Northern fulmarEmperor penguin
American crow
Chimney swift
White-tailed eagle
50
88
97
59
80
91
97
99
58
Columbea
Turkey vulture
Great-crested grebe
Common cuckoo
Red-crested turaco
White-throated tinamou
Speckled mousebird
Adelie penguin
Anna's hummingbird
Pigeon
Dalmatian pelican
American flamingo
Great cormorant
Grey-crowned crane
Medium ground-finch
Red-throated loon
Kea
Cuckoo-roller
Hoatzin
Peregrine falcon
White-tailed tropicbird
Brown mesite
Bar-tailed trogon
Bald eagle
Rifleman
White-tailed eagle
Northern fulmar
Chuck-will's-widow
Turkey
Carmine bee-eater
Zebra finch
Little egret
Pekin duck
Crested ibis
MacQueen's bustard
Golden-collared manakin
Common ostrich
Rhinoceros hornbill
Emperor penguin
Killdeer
Chimney swift
Red-legged seriema
Downy woodpecker
Sunbittern
Chicken
Budgerigar
American crow
Yellow-throated sandgrouse
Barn owl
99
97
89
88
97
9860
88
90
98
56
Sunbittern
KeaPeregrine falcon
Killdeer
Budgerigar
Zebra finch
Rifleman
Bald eagle
Speckled mousebirdBarn owl
Golden-collared manakin
Little egret
Northern fulmar
Turkey vulture
Carmine bee-eater
Hoatzin
Rhinoceros hornbill
Chimney swift
American flamingo
Red-legged seriema
Medium ground-finch
Red-crested turaco
Common ostrich
Chuck-will's-widow
Great-crested grebe
Yellow-throated sandgrousePigeon
ChickenTurkey
American crow
White-throated tinamou
MacQueen's bustard
White-tailed eagle
Anna's hummingbird
Cuckoo-roller
Emperor penguin
Bar-tailed trogon
Great cormorant
Grey-crowned crane
White-tailed tropicbirdRed-throated loon
Brown mesite
Downy woodpecker
Adelie penguin
Pekin duck
Crested ibisDalmatian pelican
Common cuckoo91
96
72
91
84
55
9691
70
Speckled mousebird
Little egret
Brown mesite
Pekin duck
Barn owl
Bar-tailed trogon
Bald eagle
Anna's hummingbird
Downy woodpecker
Chicken
Red-legged seriema
Zebra finch
MacQueen's bustard
Northern fulmar
Rifleman
American crow
Peregrine falcon
Rhinoceros hornbill
Emperor penguin
Cuckoo-roller
Red-throated loon
Great cormorant
Common ostrich
Chuck-will's-widow
Adelie penguin
White-tailed tropicbird
Red-crested turaco
Golden-collared manakin
White-tailed eagleTurkey vulture
Budgerigar
Dalmatian pelican
Pigeon
Carmine bee-eater
Turkey
Medium ground-finch
Grey-crowned crane
Great-crested grebe
Killdeer
Chimney swift
Sunbittern
Hoatzin
White-throated tinamou
Common cuckoo
Kea
Crested ibis
Yellow-throated sandgrouse
American flamingo
91,90,70
98,99,100
84,92,95
92,93,85
87,91,8552,59,74
80,73,70
a) Binned MP-EST(coalescent-basedpublished tree)
b) ExaML concatenation(published tree)
c) Unbinned ASTRAL - no contraction
d) Unbinned ASTRAL -3%, 5%, 10% contraction
Figure 4 Avian dataset with 14,446 genes. Shown are reference trees from the original paper [5]using the coalescent-based binning (a) and concatenation (b), and two new trees usingASTRAL-III with no contraction (c) and with contraction with 3%, 5%, and 10% thresholds (d).Support values (bootstrap for a,b and local posterior probability for c,d) shown for all branchesexcept those with full support; in (d), support is shown for 3%, 5%, and 10%, respectively.Branches conflicting with the reference coalescent-based tree are shown as dotted red lines.
MP-EST run on binned gene trees (i.e., binned MP-EST) produced results [5, 33]
that were largely congruent with the concatenation using ExaML [45] and di↵ered
in only five branches with low support (Fig. 4ab); both trees were used as the
reference [5]. Here, we test if simply contracting low support gene tree branches
and using ASTRAL-III produces a tree that is congruent with these reference trees.
Similar to MP-EST, when ATRAL-III is run on 14,446 gene trees with no con-
traction, the results di↵er in nine and 11 branches, respectively, with respect to
the reference binned MP-EST and concatenation trees (Fig. 4c). Moreover, this
tree contradicts some strong results from the avian analyses (e.g., not recovering
the Columbea group [5]). ASTRAL-III with no contraction finishes in 32 hours,
but with contraction, depending on the threshold, it takes 3 to 84 hours (> 50
hours for 0% – 20% thresholds and < 26 hours for 33% – 75%). Contracting 0%
14K gene trees with very low gene tree resolution (25% mean BS)
Avian biological dataset
16
Zhang et al. Page 12 of 34
Downy woodpecker
Yellow-throated sandgrouse
Common cuckoo
Carmine bee-eater
Anna's hummingbird
MacQueen's bustard
Great-crested grebeTurkey
Golden-collared manakin
White-tailed tropicbird
Cuckoo-roller
Red-legged seriema
Rhinoceros hornbill
Zebra finch
Common ostrich
Adelie penguin
Kea
Medium ground-finch
Peregrine falcon
American flamingo
Budgerigar
Little egret
White-throated tinamou
Bald eagle
Barn owl
Hoatzin
Turkey vulture
Killdeer
Dalmatian pelican
Bar-tailed trogon
Chuck-will's-widow
Crested ibis
Pekin duck
Pigeon
Speckled mousebird
Grey-crowned crane
Red-throated loon
Chicken
Sunbittern
Great cormorant
Red-crested turaco
Rifleman
Brown mesite
Northern fulmarEmperor penguin
American crow
Chimney swift
White-tailed eagle
50
88
97
59
80
91
97
99
58
Columbea
Turkey vulture
Great-crested grebe
Common cuckoo
Red-crested turaco
White-throated tinamou
Speckled mousebird
Adelie penguin
Anna's hummingbird
Pigeon
Dalmatian pelican
American flamingo
Great cormorant
Grey-crowned crane
Medium ground-finch
Red-throated loon
Kea
Cuckoo-roller
Hoatzin
Peregrine falcon
White-tailed tropicbird
Brown mesite
Bar-tailed trogon
Bald eagle
Rifleman
White-tailed eagle
Northern fulmar
Chuck-will's-widow
Turkey
Carmine bee-eater
Zebra finch
Little egret
Pekin duck
Crested ibis
MacQueen's bustard
Golden-collared manakin
Common ostrich
Rhinoceros hornbill
Emperor penguin
Killdeer
Chimney swift
Red-legged seriema
Downy woodpecker
Sunbittern
Chicken
Budgerigar
American crow
Yellow-throated sandgrouse
Barn owl
99
97
89
88
97
9860
88
90
98
56
Sunbittern
KeaPeregrine falcon
Killdeer
Budgerigar
Zebra finch
Rifleman
Bald eagle
Speckled mousebirdBarn owl
Golden-collared manakin
Little egret
Northern fulmar
Turkey vulture
Carmine bee-eater
Hoatzin
Rhinoceros hornbill
Chimney swift
American flamingo
Red-legged seriema
Medium ground-finch
Red-crested turaco
Common ostrich
Chuck-will's-widow
Great-crested grebe
Yellow-throated sandgrousePigeon
ChickenTurkey
American crow
White-throated tinamou
MacQueen's bustard
White-tailed eagle
Anna's hummingbird
Cuckoo-roller
Emperor penguin
Bar-tailed trogon
Great cormorant
Grey-crowned crane
White-tailed tropicbirdRed-throated loon
Brown mesite
Downy woodpecker
Adelie penguin
Pekin duck
Crested ibisDalmatian pelican
Common cuckoo91
96
72
91
84
55
9691
70
Speckled mousebird
Little egret
Brown mesite
Pekin duck
Barn owl
Bar-tailed trogon
Bald eagle
Anna's hummingbird
Downy woodpecker
Chicken
Red-legged seriema
Zebra finch
MacQueen's bustard
Northern fulmar
Rifleman
American crow
Peregrine falcon
Rhinoceros hornbill
Emperor penguin
Cuckoo-roller
Red-throated loon
Great cormorant
Common ostrich
Chuck-will's-widow
Adelie penguin
White-tailed tropicbird
Red-crested turaco
Golden-collared manakin
White-tailed eagleTurkey vulture
Budgerigar
Dalmatian pelican
Pigeon
Carmine bee-eater
Turkey
Medium ground-finch
Grey-crowned crane
Great-crested grebe
Killdeer
Chimney swift
Sunbittern
Hoatzin
White-throated tinamou
Common cuckoo
Kea
Crested ibis
Yellow-throated sandgrouse
American flamingo
91,90,70
98,99,100
84,92,95
92,93,85
87,91,8552,59,74
80,73,70
a) Binned MP-EST(coalescent-basedpublished tree)
b) ExaML concatenation(published tree)
c) Unbinned ASTRAL - no contraction
d) Unbinned ASTRAL -3%, 5%, 10% contraction
Figure 4 Avian dataset with 14,446 genes. Shown are reference trees from the original paper [5]using the coalescent-based binning (a) and concatenation (b), and two new trees usingASTRAL-III with no contraction (c) and with contraction with 3%, 5%, and 10% thresholds (d).Support values (bootstrap for a,b and local posterior probability for c,d) shown for all branchesexcept those with full support; in (d), support is shown for 3%, 5%, and 10%, respectively.Branches conflicting with the reference coalescent-based tree are shown as dotted red lines.
MP-EST run on binned gene trees (i.e., binned MP-EST) produced results [5, 33]
that were largely congruent with the concatenation using ExaML [45] and di↵ered
in only five branches with low support (Fig. 4ab); both trees were used as the
reference [5]. Here, we test if simply contracting low support gene tree branches
and using ASTRAL-III produces a tree that is congruent with these reference trees.
Similar to MP-EST, when ATRAL-III is run on 14,446 gene trees with no con-
traction, the results di↵er in nine and 11 branches, respectively, with respect to
the reference binned MP-EST and concatenation trees (Fig. 4c). Moreover, this
tree contradicts some strong results from the avian analyses (e.g., not recovering
the Columbea group [5]). ASTRAL-III with no contraction finishes in 32 hours,
but with contraction, depending on the threshold, it takes 3 to 84 hours (> 50
hours for 0% – 20% thresholds and < 26 hours for 33% – 75%). Contracting 0%
Zhang et al. Page 12 of 34
Downy woodpecker
Yellow-throated sandgrouse
Common cuckoo
Carmine bee-eater
Anna's hummingbird
MacQueen's bustard
Great-crested grebeTurkey
Golden-collared manakin
White-tailed tropicbird
Cuckoo-roller
Red-legged seriema
Rhinoceros hornbill
Zebra finch
Common ostrich
Adelie penguin
Kea
Medium ground-finch
Peregrine falcon
American flamingo
Budgerigar
Little egret
White-throated tinamou
Bald eagle
Barn owl
Hoatzin
Turkey vulture
Killdeer
Dalmatian pelican
Bar-tailed trogon
Chuck-will's-widow
Crested ibis
Pekin duck
Pigeon
Speckled mousebird
Grey-crowned crane
Red-throated loon
Chicken
Sunbittern
Great cormorant
Red-crested turaco
Rifleman
Brown mesite
Northern fulmarEmperor penguin
American crow
Chimney swift
White-tailed eagle
50
88
97
59
80
91
97
99
58
Columbea
Turkey vulture
Great-crested grebe
Common cuckoo
Red-crested turaco
White-throated tinamou
Speckled mousebird
Adelie penguin
Anna's hummingbird
Pigeon
Dalmatian pelican
American flamingo
Great cormorant
Grey-crowned crane
Medium ground-finch
Red-throated loon
Kea
Cuckoo-roller
Hoatzin
Peregrine falcon
White-tailed tropicbird
Brown mesite
Bar-tailed trogon
Bald eagle
Rifleman
White-tailed eagle
Northern fulmar
Chuck-will's-widow
Turkey
Carmine bee-eater
Zebra finch
Little egret
Pekin duck
Crested ibis
MacQueen's bustard
Golden-collared manakin
Common ostrich
Rhinoceros hornbill
Emperor penguin
Killdeer
Chimney swift
Red-legged seriema
Downy woodpecker
Sunbittern
Chicken
Budgerigar
American crow
Yellow-throated sandgrouse
Barn owl
99
97
89
88
97
9860
88
90
98
56
Sunbittern
KeaPeregrine falcon
Killdeer
Budgerigar
Zebra finch
Rifleman
Bald eagle
Speckled mousebirdBarn owl
Golden-collared manakin
Little egret
Northern fulmar
Turkey vulture
Carmine bee-eater
Hoatzin
Rhinoceros hornbill
Chimney swift
American flamingo
Red-legged seriema
Medium ground-finch
Red-crested turaco
Common ostrich
Chuck-will's-widow
Great-crested grebe
Yellow-throated sandgrousePigeon
ChickenTurkey
American crow
White-throated tinamou
MacQueen's bustard
White-tailed eagle
Anna's hummingbird
Cuckoo-roller
Emperor penguin
Bar-tailed trogon
Great cormorant
Grey-crowned crane
White-tailed tropicbirdRed-throated loon
Brown mesite
Downy woodpecker
Adelie penguin
Pekin duck
Crested ibisDalmatian pelican
Common cuckoo91
96
72
91
84
55
9691
70
Speckled mousebird
Little egret
Brown mesite
Pekin duck
Barn owl
Bar-tailed trogon
Bald eagle
Anna's hummingbird
Downy woodpecker
Chicken
Red-legged seriema
Zebra finch
MacQueen's bustard
Northern fulmar
Rifleman
American crow
Peregrine falcon
Rhinoceros hornbill
Emperor penguin
Cuckoo-roller
Red-throated loon
Great cormorant
Common ostrich
Chuck-will's-widow
Adelie penguin
White-tailed tropicbird
Red-crested turaco
Golden-collared manakin
White-tailed eagleTurkey vulture
Budgerigar
Dalmatian pelican
Pigeon
Carmine bee-eater
Turkey
Medium ground-finch
Grey-crowned crane
Great-crested grebe
Killdeer
Chimney swift
Sunbittern
Hoatzin
White-throated tinamou
Common cuckoo
Kea
Crested ibis
Yellow-throated sandgrouse
American flamingo
91,90,70
98,99,100
84,92,95
92,93,85
87,91,8552,59,74
80,73,70
a) Binned MP-EST(coalescent-basedpublished tree)
b) ExaML concatenation(published tree)
c) Unbinned ASTRAL - no contraction
d) Unbinned ASTRAL -3%, 5%, 10% contraction
Figure 4 Avian dataset with 14,446 genes. Shown are reference trees from the original paper [5]using the coalescent-based binning (a) and concatenation (b), and two new trees usingASTRAL-III with no contraction (c) and with contraction with 3%, 5%, and 10% thresholds (d).Support values (bootstrap for a,b and local posterior probability for c,d) shown for all branchesexcept those with full support; in (d), support is shown for 3%, 5%, and 10%, respectively.Branches conflicting with the reference coalescent-based tree are shown as dotted red lines.
MP-EST run on binned gene trees (i.e., binned MP-EST) produced results [5, 33]
that were largely congruent with the concatenation using ExaML [45] and di↵ered
in only five branches with low support (Fig. 4ab); both trees were used as the
reference [5]. Here, we test if simply contracting low support gene tree branches
and using ASTRAL-III produces a tree that is congruent with these reference trees.
Similar to MP-EST, when ATRAL-III is run on 14,446 gene trees with no con-
traction, the results di↵er in nine and 11 branches, respectively, with respect to
the reference binned MP-EST and concatenation trees (Fig. 4c). Moreover, this
tree contradicts some strong results from the avian analyses (e.g., not recovering
the Columbea group [5]). ASTRAL-III with no contraction finishes in 32 hours,
but with contraction, depending on the threshold, it takes 3 to 84 hours (> 50
hours for 0% – 20% thresholds and < 26 hours for 33% – 75%). Contracting 0%
14K gene trees with very low gene tree resolution (25% mean BS)
Avian biological dataset
16
Zhang et al. Page 12 of 34
Downy woodpecker
Yellow-throated sandgrouse
Common cuckoo
Carmine bee-eater
Anna's hummingbird
MacQueen's bustard
Great-crested grebeTurkey
Golden-collared manakin
White-tailed tropicbird
Cuckoo-roller
Red-legged seriema
Rhinoceros hornbill
Zebra finch
Common ostrich
Adelie penguin
Kea
Medium ground-finch
Peregrine falcon
American flamingo
Budgerigar
Little egret
White-throated tinamou
Bald eagle
Barn owl
Hoatzin
Turkey vulture
Killdeer
Dalmatian pelican
Bar-tailed trogon
Chuck-will's-widow
Crested ibis
Pekin duck
Pigeon
Speckled mousebird
Grey-crowned crane
Red-throated loon
Chicken
Sunbittern
Great cormorant
Red-crested turaco
Rifleman
Brown mesite
Northern fulmarEmperor penguin
American crow
Chimney swift
White-tailed eagle
50
88
97
59
80
91
97
99
58
Columbea
Turkey vulture
Great-crested grebe
Common cuckoo
Red-crested turaco
White-throated tinamou
Speckled mousebird
Adelie penguin
Anna's hummingbird
Pigeon
Dalmatian pelican
American flamingo
Great cormorant
Grey-crowned crane
Medium ground-finch
Red-throated loon
Kea
Cuckoo-roller
Hoatzin
Peregrine falcon
White-tailed tropicbird
Brown mesite
Bar-tailed trogon
Bald eagle
Rifleman
White-tailed eagle
Northern fulmar
Chuck-will's-widow
Turkey
Carmine bee-eater
Zebra finch
Little egret
Pekin duck
Crested ibis
MacQueen's bustard
Golden-collared manakin
Common ostrich
Rhinoceros hornbill
Emperor penguin
Killdeer
Chimney swift
Red-legged seriema
Downy woodpecker
Sunbittern
Chicken
Budgerigar
American crow
Yellow-throated sandgrouse
Barn owl
99
97
89
88
97
9860
88
90
98
56
Sunbittern
KeaPeregrine falcon
Killdeer
Budgerigar
Zebra finch
Rifleman
Bald eagle
Speckled mousebirdBarn owl
Golden-collared manakin
Little egret
Northern fulmar
Turkey vulture
Carmine bee-eater
Hoatzin
Rhinoceros hornbill
Chimney swift
American flamingo
Red-legged seriema
Medium ground-finch
Red-crested turaco
Common ostrich
Chuck-will's-widow
Great-crested grebe
Yellow-throated sandgrousePigeon
ChickenTurkey
American crow
White-throated tinamou
MacQueen's bustard
White-tailed eagle
Anna's hummingbird
Cuckoo-roller
Emperor penguin
Bar-tailed trogon
Great cormorant
Grey-crowned crane
White-tailed tropicbirdRed-throated loon
Brown mesite
Downy woodpecker
Adelie penguin
Pekin duck
Crested ibisDalmatian pelican
Common cuckoo91
96
72
91
84
55
9691
70
Speckled mousebird
Little egret
Brown mesite
Pekin duck
Barn owl
Bar-tailed trogon
Bald eagle
Anna's hummingbird
Downy woodpecker
Chicken
Red-legged seriema
Zebra finch
MacQueen's bustard
Northern fulmar
Rifleman
American crow
Peregrine falcon
Rhinoceros hornbill
Emperor penguin
Cuckoo-roller
Red-throated loon
Great cormorant
Common ostrich
Chuck-will's-widow
Adelie penguin
White-tailed tropicbird
Red-crested turaco
Golden-collared manakin
White-tailed eagleTurkey vulture
Budgerigar
Dalmatian pelican
Pigeon
Carmine bee-eater
Turkey
Medium ground-finch
Grey-crowned crane
Great-crested grebe
Killdeer
Chimney swift
Sunbittern
Hoatzin
White-throated tinamou
Common cuckoo
Kea
Crested ibis
Yellow-throated sandgrouse
American flamingo
91,90,70
98,99,100
84,92,95
92,93,85
87,91,8552,59,74
80,73,70
a) Binned MP-EST(coalescent-basedpublished tree)
b) ExaML concatenation(published tree)
c) Unbinned ASTRAL - no contraction
d) Unbinned ASTRAL -3%, 5%, 10% contraction
Figure 4 Avian dataset with 14,446 genes. Shown are reference trees from the original paper [5]using the coalescent-based binning (a) and concatenation (b), and two new trees usingASTRAL-III with no contraction (c) and with contraction with 3%, 5%, and 10% thresholds (d).Support values (bootstrap for a,b and local posterior probability for c,d) shown for all branchesexcept those with full support; in (d), support is shown for 3%, 5%, and 10%, respectively.Branches conflicting with the reference coalescent-based tree are shown as dotted red lines.
MP-EST run on binned gene trees (i.e., binned MP-EST) produced results [5, 33]
that were largely congruent with the concatenation using ExaML [45] and di↵ered
in only five branches with low support (Fig. 4ab); both trees were used as the
reference [5]. Here, we test if simply contracting low support gene tree branches
and using ASTRAL-III produces a tree that is congruent with these reference trees.
Similar to MP-EST, when ATRAL-III is run on 14,446 gene trees with no con-
traction, the results di↵er in nine and 11 branches, respectively, with respect to
the reference binned MP-EST and concatenation trees (Fig. 4c). Moreover, this
tree contradicts some strong results from the avian analyses (e.g., not recovering
the Columbea group [5]). ASTRAL-III with no contraction finishes in 32 hours,
but with contraction, depending on the threshold, it takes 3 to 84 hours (> 50
hours for 0% – 20% thresholds and < 26 hours for 33% – 75%). Contracting 0%
Zhang et al. Page 12 of 34
Downy woodpecker
Yellow-throated sandgrouse
Common cuckoo
Carmine bee-eater
Anna's hummingbird
MacQueen's bustard
Great-crested grebeTurkey
Golden-collared manakin
White-tailed tropicbird
Cuckoo-roller
Red-legged seriema
Rhinoceros hornbill
Zebra finch
Common ostrich
Adelie penguin
Kea
Medium ground-finch
Peregrine falcon
American flamingo
Budgerigar
Little egret
White-throated tinamou
Bald eagle
Barn owl
Hoatzin
Turkey vulture
Killdeer
Dalmatian pelican
Bar-tailed trogon
Chuck-will's-widow
Crested ibis
Pekin duck
Pigeon
Speckled mousebird
Grey-crowned crane
Red-throated loon
Chicken
Sunbittern
Great cormorant
Red-crested turaco
Rifleman
Brown mesite
Northern fulmarEmperor penguin
American crow
Chimney swift
White-tailed eagle
50
88
97
59
80
91
97
99
58
Columbea
Turkey vulture
Great-crested grebe
Common cuckoo
Red-crested turaco
White-throated tinamou
Speckled mousebird
Adelie penguin
Anna's hummingbird
Pigeon
Dalmatian pelican
American flamingo
Great cormorant
Grey-crowned crane
Medium ground-finch
Red-throated loon
Kea
Cuckoo-roller
Hoatzin
Peregrine falcon
White-tailed tropicbird
Brown mesite
Bar-tailed trogon
Bald eagle
Rifleman
White-tailed eagle
Northern fulmar
Chuck-will's-widow
Turkey
Carmine bee-eater
Zebra finch
Little egret
Pekin duck
Crested ibis
MacQueen's bustard
Golden-collared manakin
Common ostrich
Rhinoceros hornbill
Emperor penguin
Killdeer
Chimney swift
Red-legged seriema
Downy woodpecker
Sunbittern
Chicken
Budgerigar
American crow
Yellow-throated sandgrouse
Barn owl
99
97
89
88
97
9860
88
90
98
56
Sunbittern
KeaPeregrine falcon
Killdeer
Budgerigar
Zebra finch
Rifleman
Bald eagle
Speckled mousebirdBarn owl
Golden-collared manakin
Little egret
Northern fulmar
Turkey vulture
Carmine bee-eater
Hoatzin
Rhinoceros hornbill
Chimney swift
American flamingo
Red-legged seriema
Medium ground-finch
Red-crested turaco
Common ostrich
Chuck-will's-widow
Great-crested grebe
Yellow-throated sandgrousePigeon
ChickenTurkey
American crow
White-throated tinamou
MacQueen's bustard
White-tailed eagle
Anna's hummingbird
Cuckoo-roller
Emperor penguin
Bar-tailed trogon
Great cormorant
Grey-crowned crane
White-tailed tropicbirdRed-throated loon
Brown mesite
Downy woodpecker
Adelie penguin
Pekin duck
Crested ibisDalmatian pelican
Common cuckoo91
96
72
91
84
55
9691
70
Speckled mousebird
Little egret
Brown mesite
Pekin duck
Barn owl
Bar-tailed trogon
Bald eagle
Anna's hummingbird
Downy woodpecker
Chicken
Red-legged seriema
Zebra finch
MacQueen's bustard
Northern fulmar
Rifleman
American crow
Peregrine falcon
Rhinoceros hornbill
Emperor penguin
Cuckoo-roller
Red-throated loon
Great cormorant
Common ostrich
Chuck-will's-widow
Adelie penguin
White-tailed tropicbird
Red-crested turaco
Golden-collared manakin
White-tailed eagleTurkey vulture
Budgerigar
Dalmatian pelican
Pigeon
Carmine bee-eater
Turkey
Medium ground-finch
Grey-crowned crane
Great-crested grebe
Killdeer
Chimney swift
Sunbittern
Hoatzin
White-throated tinamou
Common cuckoo
Kea
Crested ibis
Yellow-throated sandgrouse
American flamingo
91,90,70
98,99,100
84,92,95
92,93,85
87,91,8552,59,74
80,73,70
a) Binned MP-EST(coalescent-basedpublished tree)
b) ExaML concatenation(published tree)
c) Unbinned ASTRAL - no contraction
d) Unbinned ASTRAL -3%, 5%, 10% contraction
Figure 4 Avian dataset with 14,446 genes. Shown are reference trees from the original paper [5]using the coalescent-based binning (a) and concatenation (b), and two new trees usingASTRAL-III with no contraction (c) and with contraction with 3%, 5%, and 10% thresholds (d).Support values (bootstrap for a,b and local posterior probability for c,d) shown for all branchesexcept those with full support; in (d), support is shown for 3%, 5%, and 10%, respectively.Branches conflicting with the reference coalescent-based tree are shown as dotted red lines.
MP-EST run on binned gene trees (i.e., binned MP-EST) produced results [5, 33]
that were largely congruent with the concatenation using ExaML [45] and di↵ered
in only five branches with low support (Fig. 4ab); both trees were used as the
reference [5]. Here, we test if simply contracting low support gene tree branches
and using ASTRAL-III produces a tree that is congruent with these reference trees.
Similar to MP-EST, when ATRAL-III is run on 14,446 gene trees with no con-
traction, the results di↵er in nine and 11 branches, respectively, with respect to
the reference binned MP-EST and concatenation trees (Fig. 4c). Moreover, this
tree contradicts some strong results from the avian analyses (e.g., not recovering
the Columbea group [5]). ASTRAL-III with no contraction finishes in 32 hours,
but with contraction, depending on the threshold, it takes 3 to 84 hours (> 50
hours for 0% – 20% thresholds and < 26 hours for 33% – 75%). Contracting 0%
Zhang et al. Page 12 of 34
Downy woodpecker
Yellow-throated sandgrouse
Common cuckoo
Carmine bee-eater
Anna's hummingbird
MacQueen's bustard
Great-crested grebeTurkey
Golden-collared manakin
White-tailed tropicbird
Cuckoo-roller
Red-legged seriema
Rhinoceros hornbill
Zebra finch
Common ostrich
Adelie penguin
Kea
Medium ground-finch
Peregrine falcon
American flamingo
Budgerigar
Little egret
White-throated tinamou
Bald eagle
Barn owl
Hoatzin
Turkey vulture
Killdeer
Dalmatian pelican
Bar-tailed trogon
Chuck-will's-widow
Crested ibis
Pekin duck
Pigeon
Speckled mousebird
Grey-crowned crane
Red-throated loon
Chicken
Sunbittern
Great cormorant
Red-crested turaco
Rifleman
Brown mesite
Northern fulmarEmperor penguin
American crow
Chimney swift
White-tailed eagle
50
88
97
59
80
91
97
99
58
Columbea
Turkey vulture
Great-crested grebe
Common cuckoo
Red-crested turaco
White-throated tinamou
Speckled mousebird
Adelie penguin
Anna's hummingbird
Pigeon
Dalmatian pelican
American flamingo
Great cormorant
Grey-crowned crane
Medium ground-finch
Red-throated loon
Kea
Cuckoo-roller
Hoatzin
Peregrine falcon
White-tailed tropicbird
Brown mesite
Bar-tailed trogon
Bald eagle
Rifleman
White-tailed eagle
Northern fulmar
Chuck-will's-widow
Turkey
Carmine bee-eater
Zebra finch
Little egret
Pekin duck
Crested ibis
MacQueen's bustard
Golden-collared manakin
Common ostrich
Rhinoceros hornbill
Emperor penguin
Killdeer
Chimney swift
Red-legged seriema
Downy woodpecker
Sunbittern
Chicken
Budgerigar
American crow
Yellow-throated sandgrouse
Barn owl
99
97
89
88
97
9860
88
90
98
56
Sunbittern
KeaPeregrine falcon
Killdeer
Budgerigar
Zebra finch
Rifleman
Bald eagle
Speckled mousebirdBarn owl
Golden-collared manakin
Little egret
Northern fulmar
Turkey vulture
Carmine bee-eater
Hoatzin
Rhinoceros hornbill
Chimney swift
American flamingo
Red-legged seriema
Medium ground-finch
Red-crested turaco
Common ostrich
Chuck-will's-widow
Great-crested grebe
Yellow-throated sandgrousePigeon
ChickenTurkey
American crow
White-throated tinamou
MacQueen's bustard
White-tailed eagle
Anna's hummingbird
Cuckoo-roller
Emperor penguin
Bar-tailed trogon
Great cormorant
Grey-crowned crane
White-tailed tropicbirdRed-throated loon
Brown mesite
Downy woodpecker
Adelie penguin
Pekin duck
Crested ibisDalmatian pelican
Common cuckoo91
96
72
91
84
55
9691
70
Speckled mousebird
Little egret
Brown mesite
Pekin duck
Barn owl
Bar-tailed trogon
Bald eagle
Anna's hummingbird
Downy woodpecker
Chicken
Red-legged seriema
Zebra finch
MacQueen's bustard
Northern fulmar
Rifleman
American crow
Peregrine falcon
Rhinoceros hornbill
Emperor penguin
Cuckoo-roller
Red-throated loon
Great cormorant
Common ostrich
Chuck-will's-widow
Adelie penguin
White-tailed tropicbird
Red-crested turaco
Golden-collared manakin
White-tailed eagleTurkey vulture
Budgerigar
Dalmatian pelican
Pigeon
Carmine bee-eater
Turkey
Medium ground-finch
Grey-crowned crane
Great-crested grebe
Killdeer
Chimney swift
Sunbittern
Hoatzin
White-throated tinamou
Common cuckoo
Kea
Crested ibis
Yellow-throated sandgrouse
American flamingo
91,90,70
98,99,100
84,92,95
92,93,85
87,91,8552,59,74
80,73,70
a) Binned MP-EST(coalescent-basedpublished tree)
b) ExaML concatenation(published tree)
c) Unbinned ASTRAL - no contraction
d) Unbinned ASTRAL -3%, 5%, 10% contraction
Figure 4 Avian dataset with 14,446 genes. Shown are reference trees from the original paper [5]using the coalescent-based binning (a) and concatenation (b), and two new trees usingASTRAL-III with no contraction (c) and with contraction with 3%, 5%, and 10% thresholds (d).Support values (bootstrap for a,b and local posterior probability for c,d) shown for all branchesexcept those with full support; in (d), support is shown for 3%, 5%, and 10%, respectively.Branches conflicting with the reference coalescent-based tree are shown as dotted red lines.
MP-EST run on binned gene trees (i.e., binned MP-EST) produced results [5, 33]
that were largely congruent with the concatenation using ExaML [45] and di↵ered
in only five branches with low support (Fig. 4ab); both trees were used as the
reference [5]. Here, we test if simply contracting low support gene tree branches
and using ASTRAL-III produces a tree that is congruent with these reference trees.
Similar to MP-EST, when ATRAL-III is run on 14,446 gene trees with no con-
traction, the results di↵er in nine and 11 branches, respectively, with respect to
the reference binned MP-EST and concatenation trees (Fig. 4c). Moreover, this
tree contradicts some strong results from the avian analyses (e.g., not recovering
the Columbea group [5]). ASTRAL-III with no contraction finishes in 32 hours,
but with contraction, depending on the threshold, it takes 3 to 84 hours (> 50
hours for 0% – 20% thresholds and < 26 hours for 33% – 75%). Contracting 0%
14K gene trees with very low gene tree resolution (25% mean BS)
Moving forward …• ASTRAL, like all other two-step approaches, is
sensitive to errors in the input gene trees
• What else can be done to reduce impacts of gene tree error?
17
Moving forward …• ASTRAL, like all other two-step approaches, is
sensitive to errors in the input gene trees
• What else can be done to reduce impacts of gene tree error?
• ASTRAL scales to 10K species. We have 90K bacterial genomes.
• Can we scale further? … divide-and-conquer …
17
19
Zhang et al. Page 14 of 34
(a)
50 200 500 1000
non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75
2−12
2−10
2−8
2−6
contraction
Avg
weig
ht c
alcu
latio
n tim
e (s
econ
ds) 1600
200
ASTRAL−II
ASTRAL−III
(b)
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●
●● ● ● ●
●●
●
●●
● ● ● ● ● ●●
● ●
● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●
●●
●
●● ● ● ●
●
●●
●●
● ● ● ● ●
●
●●
●
● ● ● ● ● ●●
●●
●
●● ● ● ● ●
●●
●
●
●
●●
●●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
● ● ● ● ● ●
●
●
●
●
●
● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
50 200 500 1000
non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 750
25
50
75
100
contraction
Set X
clu
ster
size
(K)
Figure 6 Weight calculation and |X| on S100. Average and standard error of (a) the time ittakes to score a single tripartition using Eq. 3 and (b) search space size |X| are shown for bothASTRAL-II and ASTRAL-III on the S100 dataset. Running time is in log scale. We vary numbersof gene trees (boxes) and sequence length (200 and 1600). See Fig. S3 for similar patterns forwith 400 and 800bp alignments.
3.2.2 Running time for large polytomies
ASTRAL-III has a clear advantage compared to ASTRAL-II with respect to the
running time when gene trees include polytomies (Fig. 6a). Since ASTRAL-II and
ASTRAL-III can have a di↵erent set X, we show the running time per each weight
calculation (i.e., Eq. 3). As we contract low support branches and hence increase the
prevalence of polytomies, the weight calculation time quickly grows for ASTRAL-II,
whereas, in ASTRAL-III, the weight calculation time remains flat, or even decreases.
3.2.3 The search space
Comparing the size of the search space (|X|) between ATRAL-II and ASTRAL-
III shows that as intended, the search space is decreased in size for cases with no
polytomy but can increase in the presence of polytomies (Fig. 6b). With no contrac-
tion, on average, |X| is always smaller for ASTRAL-III than ASTRAL-II. With few
error-prone gene trees (50 gene trees from 200bp alignments), the search space has
reduced dramatically but with many genes or high-quality gene trees, the reduc-
tions are minimal. Moreover, the search space for gene trees estimated from short
alignments (e.g., 200bp) is several times larger than those based on longer align-
ments (e.g., 1600bp) for both methods. Contracting low support branches initially
increases the search space because ASTRAL-III adds multiple resolutions per poly-
tomy to X. Further contraction results in reductions in |X|, presumably because
many polytomies exist and they are resolved similarly inside ASTRAL-III.
Zhang et al. Page 14 of 34
(a)
50 200 500 1000
non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75
2−12
2−10
2−8
2−6
contraction
Avg
weig
ht c
alcu
latio
n tim
e (s
econ
ds) 1600
200
ASTRAL−II
ASTRAL−III
(b)
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●
●● ● ● ●
●●
●
●●
● ● ● ● ● ●●
● ●
● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●
●●
●
●● ● ● ●
●
●●
●●
● ● ● ● ●
●
●●
●
● ● ● ● ● ●●
●●
●
●● ● ● ● ●
●●
●
●
●
●●
●●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
● ● ● ● ● ●
●
●
●
●
●
● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
50 200 500 1000
non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 750
25
50
75
100
contraction
Set X
clu
ster
size
(K)
Figure 6 Weight calculation and |X| on S100. Average and standard error of (a) the time ittakes to score a single tripartition using Eq. 3 and (b) search space size |X| are shown for bothASTRAL-II and ASTRAL-III on the S100 dataset. Running time is in log scale. We vary numbersof gene trees (boxes) and sequence length (200 and 1600). See Fig. S3 for similar patterns forwith 400 and 800bp alignments.
3.2.2 Running time for large polytomies
ASTRAL-III has a clear advantage compared to ASTRAL-II with respect to the
running time when gene trees include polytomies (Fig. 6a). Since ASTRAL-II and
ASTRAL-III can have a di↵erent set X, we show the running time per each weight
calculation (i.e., Eq. 3). As we contract low support branches and hence increase the
prevalence of polytomies, the weight calculation time quickly grows for ASTRAL-II,
whereas, in ASTRAL-III, the weight calculation time remains flat, or even decreases.
3.2.3 The search space
Comparing the size of the search space (|X|) between ATRAL-II and ASTRAL-
III shows that as intended, the search space is decreased in size for cases with no
polytomy but can increase in the presence of polytomies (Fig. 6b). With no contrac-
tion, on average, |X| is always smaller for ASTRAL-III than ASTRAL-II. With few
error-prone gene trees (50 gene trees from 200bp alignments), the search space has
reduced dramatically but with many genes or high-quality gene trees, the reduc-
tions are minimal. Moreover, the search space for gene trees estimated from short
alignments (e.g., 200bp) is several times larger than those based on longer align-
ments (e.g., 1600bp) for both methods. Contracting low support branches initially
increases the search space because ASTRAL-III adds multiple resolutions per poly-
tomy to X. Further contraction results in reductions in |X|, presumably because
many polytomies exist and they are resolved similarly inside ASTRAL-III.
Zhang et al. Page 14 of 34
(a)
50 200 500 1000
non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75
2−12
2−10
2−8
2−6
contraction
Avg
weig
ht c
alcu
latio
n tim
e (s
econ
ds) 1600
200
ASTRAL−II
ASTRAL−III
(b)
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●
●● ● ● ●
●●
●
●●
● ● ● ● ● ●●
● ●
● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●
●●
●
●● ● ● ●
●
●●
●●
● ● ● ● ●
●
●●
●
● ● ● ● ● ●●
●●
●
●● ● ● ● ●
●●
●
●
●
●●
●●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
● ● ● ● ● ●
●
●
●
●
●
● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
50 200 500 1000
non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 750
25
50
75
100
contraction
Set X
clu
ster
size
(K)
Figure 6 Weight calculation and |X| on S100. Average and standard error of (a) the time ittakes to score a single tripartition using Eq. 3 and (b) search space size |X| are shown for bothASTRAL-II and ASTRAL-III on the S100 dataset. Running time is in log scale. We vary numbersof gene trees (boxes) and sequence length (200 and 1600). See Fig. S3 for similar patterns forwith 400 and 800bp alignments.
3.2.2 Running time for large polytomies
ASTRAL-III has a clear advantage compared to ASTRAL-II with respect to the
running time when gene trees include polytomies (Fig. 6a). Since ASTRAL-II and
ASTRAL-III can have a di↵erent set X, we show the running time per each weight
calculation (i.e., Eq. 3). As we contract low support branches and hence increase the
prevalence of polytomies, the weight calculation time quickly grows for ASTRAL-II,
whereas, in ASTRAL-III, the weight calculation time remains flat, or even decreases.
3.2.3 The search space
Comparing the size of the search space (|X|) between ATRAL-II and ASTRAL-
III shows that as intended, the search space is decreased in size for cases with no
polytomy but can increase in the presence of polytomies (Fig. 6b). With no contrac-
tion, on average, |X| is always smaller for ASTRAL-III than ASTRAL-II. With few
error-prone gene trees (50 gene trees from 200bp alignments), the search space has
reduced dramatically but with many genes or high-quality gene trees, the reduc-
tions are minimal. Moreover, the search space for gene trees estimated from short
alignments (e.g., 200bp) is several times larger than those based on longer align-
ments (e.g., 1600bp) for both methods. Contracting low support branches initially
increases the search space because ASTRAL-III adds multiple resolutions per poly-
tomy to X. Further contraction results in reductions in |X|, presumably because
many polytomies exist and they are resolved similarly inside ASTRAL-III.
Zhang et al. Page 14 of 34
(a)
50 200 500 1000
non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75
2−12
2−10
2−8
2−6
contraction
Avg
weig
ht c
alcu
latio
n tim
e (s
econ
ds) 1600
200
ASTRAL−II
ASTRAL−III
(b)
● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
●
●● ● ● ●
●●
●
●●
● ● ● ● ● ●●
● ●
● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●
●●
●
●● ● ● ●
●
●●
●●
● ● ● ● ●
●
●●
●
● ● ● ● ● ●●
●●
●
●● ● ● ● ●
●●
●
●
●
●●
●●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
● ● ● ● ● ●
●
●
●
●
●
● ● ● ● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
50 200 500 1000
non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 75 non 0 3 5 7 10 20 33 50 750
25
50
75
100
contraction
Set X
clu
ster
size
(K)
Figure 6 Weight calculation and |X| on S100. Average and standard error of (a) the time ittakes to score a single tripartition using Eq. 3 and (b) search space size |X| are shown for bothASTRAL-II and ASTRAL-III on the S100 dataset. Running time is in log scale. We vary numbersof gene trees (boxes) and sequence length (200 and 1600). See Fig. S3 for similar patterns forwith 400 and 800bp alignments.
3.2.2 Running time for large polytomies
ASTRAL-III has a clear advantage compared to ASTRAL-II with respect to the
running time when gene trees include polytomies (Fig. 6a). Since ASTRAL-II and
ASTRAL-III can have a di↵erent set X, we show the running time per each weight
calculation (i.e., Eq. 3). As we contract low support branches and hence increase the
prevalence of polytomies, the weight calculation time quickly grows for ASTRAL-II,
whereas, in ASTRAL-III, the weight calculation time remains flat, or even decreases.
3.2.3 The search space
Comparing the size of the search space (|X|) between ATRAL-II and ASTRAL-
III shows that as intended, the search space is decreased in size for cases with no
polytomy but can increase in the presence of polytomies (Fig. 6b). With no contrac-
tion, on average, |X| is always smaller for ASTRAL-III than ASTRAL-II. With few
error-prone gene trees (50 gene trees from 200bp alignments), the search space has
reduced dramatically but with many genes or high-quality gene trees, the reduc-
tions are minimal. Moreover, the search space for gene trees estimated from short
alignments (e.g., 200bp) is several times larger than those based on longer align-
ments (e.g., 1600bp) for both methods. Contracting low support branches initially
increases the search space because ASTRAL-III adds multiple resolutions per poly-
tomy to X. Further contraction results in reductions in |X|, presumably because
many polytomies exist and they are resolved similarly inside ASTRAL-III.
Further improvements• Use a polytree to overly all the
input gene trees into one data structure
• Allows us to spend time for each unique node in gene trees once
• O(|D|(nm)1.73)
• 3X running time improvement
20
unique nodes in gene trees = O(nm)
Dynamic programming
21
A
A-A’A’A-A’A’A-A’A’
L-A
• Recursively break subsets of species into smaller subsets
L
Dynamic programming
21
A
A-A’A’A-A’A’A-A’A’
L-A
A-A’
A’
L-A
• Recursively break subsets of species into smaller subsets
• : Compare u against input gene trees and compute quartets from gene trees satisfied by u
u
L
w(u)
Dynamic programming
21
A
A-A’A’A-A’A’A-A’A’
L-A
A-A’
A’
L-A
• Recursively break subsets of species into smaller subsets
• : Compare u against input gene trees and compute quartets from gene trees satisfied by u
Exponential
u
L
w(u)
Constrained version
22
A
A-A’A’A-A’A’A-A’A’
L-A
L
A1
A4
A8
A5
A2
A6
A3A7
A9
A10
• Restrict “branches” in the species tree to a given constraint set X.
Asymptotic running time?• Simple discrete math question:
• X = a set of subsets of some set L. Y = { (a ,b) ∈ X | a∩b=∅ , a∪b ∈ X }
• Clearly, |Y| < |X|2
• What’s the maximum |Y| with respect to |X|?
23