Upload
others
View
30
Download
3
Embed Size (px)
Citation preview
ASTRAL Tutorial
• Instructor: Tandy Warnow, [email protected]
• Main developer: Siavash Mirarab, [email protected]
• Github site: https://github.com/smirarab/ASTRAL
• Email: [email protected]
1
This tutorial• Basic ASTRAL approach and statistical consistency
• How to use ASTRAL
• Using Best ML gene trees instead of Multi-Locus Bootstrapping
• Collapsing low support branches
• Modifying the search space
• Whether to filter or not
• Using multiple individuals
• Using SIESTA with ASTRAL (handling multiple optimal solutions)
• Interpreting the output
• Understanding branch support
• Understanding branch lengths
• Literature about ASTRAL
• Another Symposium (right before Evolution 2018) on Phylogenomics, in Montpellier!
2
Unrooted quartets under MSC model
For a quartet (4 species), the most probable unrooted quartet tree (among the gene trees) is the unrooted species tree topology (Allman, et al. 2010)
3
Orang.
Gorilla Chimp
HumanOrang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
θ2=15% θ3=15%θ1=70%Gorilla
Orang.
Chimp
Human
d=0.8
Unrooted quartets under MSC model
For a quartet (4 species), the most probable unrooted quartet tree (among the gene trees) is the unrooted species tree topology (Allman, et al. 2010)
3
Orang.
Gorilla Chimp
HumanOrang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
θ2=15% θ3=15%θ1=70%Gorilla
Orang.
Chimp
Human
d=0.8
The most frequent gene tree = The most likely species tree
4
Orang.Gorilla ChimpHuman Rhesus
More than 4 speciesFor 5 or more species, the unrooted species tree topology can be different from the most probable gene tree (called “anomaly zone”) (Degnan, 2013)
4
Orang.Gorilla ChimpHuman Rhesus
1. Break gene trees into (n 4 ) quartets of species
2. Find the dominant tree for all quartets of taxa
3. Combine quartet trees
Some tools (e.g.. BUCKy-p [Larget, et al., 2010])
More than 4 speciesFor 5 or more species, the unrooted species tree topology can be different from the most probable gene tree (called “anomaly zone”) (Degnan, 2013)
4
Orang.Gorilla ChimpHuman Rhesus
1. Break gene trees into (n 4 ) quartets of species
2. Find the dominant tree for all quartets of taxa
3. Combine quartet trees
Some tools (e.g.. BUCKy-p [Larget, et al., 2010])
More than 4 speciesFor 5 or more species, the unrooted species tree topology can be different from the most probable gene tree (called “anomaly zone”) (Degnan, 2013)
Alternative: weight all 3(n
4 ) quartet topologies by their frequency
and find the optimal tree
Maximum Quartet Support Species Tree• Optimization problem:
•
5
Find the species tree with the maximum number of induced quartet trees shared with the collection of m input gene trees
the set of quartet trees induced by T
a gene treeScore(T ) =
mX
1
|Q(T ) \Q(ti)|
Maximum Quartet Support Species Tree• Optimization problem:
•
• Statistically consistent under the multi-species coalescent model when solved exactly
5
Find the species tree with the maximum number of induced quartet trees shared with the collection of m input gene trees
the set of quartet trees induced by T
a gene treeScore(T ) =
mX
1
|Q(T ) \Q(ti)|
Maximum Quartet Support Species Tree• Optimization problem:
•
• Statistically consistent under the multi-species coalescent model when solved exactly
5
Find the species tree with the maximum number of induced quartet trees shared with the collection of m input gene trees
the set of quartet trees induced by T
a gene tree
NP-Hard [Lafond & Scornavacca, 2016]
Score(T ) =mX
1
|Q(T ) \Q(ti)|
Maximum Quartet Support Species Tree• Optimization problem:
•
• Statistically consistent under the multi-species coalescent model when solved exactly
• ASTRAL: an exact solution using dynamic programming••
5
Find the species tree with the maximum number of induced quartet trees shared with the collection of m input gene trees
the set of quartet trees induced by T
a gene tree
NP-Hard [Lafond & Scornavacca, 2016]
Score(T ) =mX
1
|Q(T ) \Q(ti)|
ASTRAL versions• ASTRAL-I (up to v. 4.7.3) only searches bipartitions in gene trees
• Running time increases polynomially with the number of genes and the number of species
6
ASTRAL versions• ASTRAL-I (up to v. 4.7.3) only searches bipartitions in gene trees
• Running time increases polynomially with the number of genes and the number of species
• ASTRAL-II (v. 4.7.4 to 5.1.0) increased the search space heuristically (not guaranteed to be polynomial)
• Improved the running time
• Can handle polytomies in input gene trees
6
ASTRAL versions• ASTRAL-I (up to v. 4.7.3) only searches bipartitions in gene trees
• Running time increases polynomially with the number of genes and the number of species
• ASTRAL-II (v. 4.7.4 to 5.1.0) increased the search space heuristically (not guaranteed to be polynomial)
• Improved the running time
• Can handle polytomies in input gene trees
• ASTRAL-III (since v. 5.1.1) changes search space again to make sure it grows polynomially with input size
• Improved running time for multifurcating trees; makes it feasible to remove very low support branches from gene trees
6
ASTRAL used by the biologistsEarly use: • Plants: Wickett, et al., 2014, PNAS • Birds: Prum, et al., 2015, Nature • Xenoturbella, Cannon et al., 2016, Nature • Xenoturbella, Rouse et al., 2016, Nature • Flatworms: Laumer, et al., 2015, eLife • Shrews: Giarla, et al., 2015, Syst. Bio. • Frogs: Yuan et al., 2016, Syst. Bio. • Tomatoes: Pease, et al., 2016, PLoS Bio. • Angiosperms: Huang et al., 2016, MBE • Worms: Andrade, et al., 2015, MBE
ASTRAL-II
ASTRAL
ASTRAL-III
bestML versus MLBS
8
• Multi-locus Bootstrapping (MLBS) is possible
• See -b option here: https://github.com/smirarab/ASTRAL/blob/master/astral-tutorial.md#multi-locus-bootstrapping
• You need a file with bootstrap replicates for each gene tree
• MLBS not suggested as seem to degrade accuracy compared to simply using ML gene trees.
Bioinformatics, 2014, Mirarab et al.
Low support branches• Does it help to contract
branches with low support?
• Yes, but only for very low support branches
9
Simulations: 100 taxa, simphy, ILS: around 46% true discordance FastTree, support from bootstrapping
genes
All replicates
BMC Bioinformatics, 2018, Zhang et al.
Low support branches• Does it help to contract
branches with low support?
• Yes, but only for very low support branches
• Mostly helps in the presence of low support gene trees
9
Simulations: 100 taxa, simphy, ILS: around 46% true discordance FastTree, support from bootstrapping
genes
All replicates
BMC Bioinformatics, 2018, Zhang et al.
Low support branches• Does it help to contract
branches with low support?
• Yes, but only for very low support branches
• Mostly helps in the presence of low support gene trees
• More genes allows for more filtering
9
Simulations: 100 taxa, simphy, ILS: around 46% true discordance FastTree, support from bootstrapping
genes
All replicates
BMC Bioinformatics, 2018, Zhang et al.
Modifying the Search Space
• For small enough datasets, you can run ASTRAL in *exact* mode, which will find the globally optimal tree! (-x option)
• ASTRAL’s default usage will not run in exact mode, and will constrain the search space largely using the gene trees.
• It may be beneficial to *expand* the search space, using the “-e” or “-f” options (see tutorial at GitHub site). In particular, you can add species trees you’ve estimated using other methods, such as concatenation, SVDquartets, or species trees that other studies have suggested.
10
Should you filter?• Filtering genes based on missing data?
• Generally not beneficial (see Molloy and Warnow, Systematic Biology 2018)
• Filtering genes based on gene tree estimation error?
• Depends on conditions (see Molloy and Warnow, Systematic Biology 2018)
• Filtering fragmentary sequences from genes while keeping the gene?
• Often beneficial (see Sayyari, Whitfield, and Mirarab, MBE 2018)
11
Multiple individuals
• What if we sample multiple individuals from each species?
• In recently diverged species individuals can be non-monophyletic in gene trees
• Sampling multiple individuals may provide extra signal
12
Handling multiple individuals• Handling multiple individuals is supported
(the relevant paper has been in review for a long time)
• The mapping between names in gene trees and names in the species tree should be provided using the -a option. Two formats are supported (either can be used):
13
cat: siamesecat12,persiancat17,wildcat dog: dog1,labrador,bulldog-1,bulldog-2 horse: pony,shire-2,mustang110
cat 3 siamesecat12 persiancat17 wildcat dog 4 dog1 labrador bulldog-1 bulldog-2 horse 3 pony shire-2 mustang110
Format 1:
Format 2:
Using SIESTA with ASTRAL
• ASTRAL solves an optimization problem, and there can be several optimal solutions.
• You can use SIESTA (Vachaspati and Warnow, BMC Genomics 2018) to enumerate the optimal trees and compute a consensus tree. Open source software at https://github.com/pranjalv123.
14
Local posterior probability• Recall quartet frequencies follow a multinomial distribution
16
Orang.
Gorilla Chimp
HumanOrang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
m2 = 63 m3 = 57m1 = 80
θ2 θ3θ1
Local posterior probability• Recall quartet frequencies follow a multinomial distribution
• P (gene tree seen m1/m times = species tree) = P(θ1 > 1/3)
• Can be solved analytically. The resulting measure is called “the local posterior probability” (localPP)
• Handle n>4 by averaging quartet scores
16
Orang.
Gorilla Chimp
HumanOrang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
m2 = 63 m3 = 57m1 = 80
θ2 θ3θ1
Quartet support vs. localPP
17
!3Background: How many targets are enough? The number of loci required for confidently resolving a phylogenetic relationship depends on its “hardness”. Discordant gene trees due to incomplete lineage sorting (ILS) can make branches of a species tree very difficult to resolve. ILS is a function of the branch length and population size. This means that shorter branches or larger population sizes increase ILS, and can make species tree reconstruction more difficult. The co-PI has recently developed a new approach for estimating the support of a branch in a coalescent-based framework based on the proportion of gene trees that support quartet topologies in gene trees[31]. This new approach is analogous to finding the probability that a three-sided die is loaded towards each facet by observing outcomes of many tosses. We can also ask the reverse question: if we have an estimate of the degree of bias of a die, how many tosses are needed to confidently find its loaded side. Similarly, this approach[31] sheds light on the relationship between the number of genes required to achieve a level of support for a branch of a certain length in coalescent units (Fig. 3). While the co-PI’s method of calculating support, which is implemented in ASTRAL [32, 33], gives the basic mathematical framework, more method development is needed (see below).
Research Approach Specimens Suitable specimens of more than 80 species of Sabellidae and >120 Terebelliformia species have been collected from localities around the world in the last 15 years by the PI. Further collecting will be undertaken for some key taxa for a few more transcriptomes and for targeted DNA capture. We also have agreements from other experts in Sabellidae and Terebelliformia to provide other needed ethanol-preserved specimens for targeted capture. We will obtain ~ 50% of the known sabellid diversity and ~25% of terebelliforms for targeted capture sequencing (~500 species total). Our plan is to sample all genera, particularly focusing on their type species. This should allow for the generation of robust phylogenies, from which we will revise the taxonomy. Specimens will be vouchered at the SIO Benthic Invertebrate Collection and biodiversity information curated on the Encyclopedia of Life. Sequencing (transcriptomes and targeted capture of DNA) A targeted capture approach will be used with an appropriate number of loci (see below) to generate robust phylogenies of Sabellidae and Terebelliformia. Two different sets of targets will be designed from transcriptomes, across Sabellidae and Terebelliformia repectively. We have already generated new transcriptomes for nine sabellids, a serpulid and a fabriciid (bold terminals in Fig. 4), and nine Terebelliformia (bold terminals in Fig. 5) and combined this data with the few publicly available transcriptomes. Based on direct sequencing data already obtained for several genes for Sabellidae and Terebelliformia [Rouse in prep.; Stiller et al. in prep.], our sampling arguably spans the extant diversity of each clade. Several more transcriptomes will be needed to ensure the targeted capture methods will be robust. Methods for generating these transcriptomes will follow those previously used by us[6, 34]. How many targets are needed? A targeted capture pipeline typically starts from a large number of loci sequenced from a small number of species using genome-wide approaches (e.g., transcriptomics)[7, 8, 35]. This initial dataset is then used to select a smaller subset of loci for the targeted capture phase, which involves a larger set of species. To reduce the cost and effort, one would want to minimize the number of loci in the second phase, as long as sufficient loci are selected to confidently resolve relationships of interest. To date the number of loci has been determined in an ad hoc manner, either by the ultra conserved elements discovered[7, 8] or limitations of the targeted capture technology[13]. Initial analyses of our two datasets demonstrates how the number of
Fig. 3. Impact of number of genes (colors) and amount of ILS (x-axis) on support (posterior probability) for a branch (y-axis). Under the coalescent model, for four species separated with an unrooted branch of length d, the probability of a gene tree being identical to the species tree is 1-2/3e-d (we use this as a measure of ILS). We measure support using local pp [31] for various numbers of gene trees (assuming no gene tree estimation error). More ILS (ie., low 1-2/3e-d) requires more genes to get high support.
quartet frequency
Increased number of genes (k ) ⇒ increased support
Decreased discordance ⇒ increased support
18
localPP is more accurate than bootstrapping
250 500
1000 1500
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00
False Positive Rate
Recall
MLBS Local PP
250 500
1000 1500
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00
False Positive Rate
Recall
MLBS Local PP
250 500
1000 1500
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00
False Positive Rate
Recall
MLBS Local PP
250 500
1000 1500
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00
False Positive Rate
Recall
MLBS Local PP
Avian simulated dataset (48 taxa, 1000 genes)
the ROC curves show (fig. 5), for the same number offalse positives branches, local posterior probabilities result inbetter recall than MLBS. This pattern is more pronouncedfor shorter alignments, which have increased gene treeerror. For example, for the 250 bp model condition, if wechoose a support threshold that results in 0.01 FPR, with localposterior values, we still recover 84% of correct branches,whereas with MLBS, the same FPR results in retaining70% of correct branches. Thus, for a desired level of precision,better recall can be obtained using local posteriorprobabilities.
Branch LengthBranch length accuracy on the avian dataset was a function ofgene tree estimation error whether ASTRAL or MP-EST wasused (table 4). With true gene trees, branch length log error
FIG. 4. Branch length accuracy on the A-200 dataset with Medium ILS. See supplementary figures S7 and S8, Supplementary Material online forlow and high ILS. The estimated branch length is plotted against the true branch length in log scale (base 10). Blue line: a fitted generalizedadditive model with smoothing (Wood 2011).
Table 3. Branch Length Accuracy on the A-200 Dataset.
Dataset n Log Err RMSE
True gt Est. gt True gt Est. gt
Low ILS 1,000 0.10 0.42 5.57 6.75Low ILS 200 0.16 0.44 6.22 6.99Low ILS 50 0.25 0.48 6.84 7.29Med ILS 1,000 0.03 0.20 0.22 0.86Med ILS 200 0.07 0.22 0.44 0.91Med ILS 50 0.13 0.26 0.74 1.05High ILS 1,000 0.06 0.15 0.03 0.13High ILS 200 0.11 0.18 0.07 0.15High ILS 50 0.18 0.24 0.14 0.19
NOTE.—Logarithmic error (Log Err) and RMSE are shown for true species treesscored with true gene trees or estimated gene trees (Est. gt).
250 500
1000 1500
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00False Positive Rate
Rec
all
MLBS Local PP
FIG. 5. ROC curve for the avian dataset based on MLBS and local PPPPsupport values. Boxes show different numbers of sites per gene (con-trolling gene tree estimation error).
Fast Coalescent-Based Support Computation . doi:10.1093/molbev/msw079 MBE
9
by guest on May 28, 2016
http://mbe.oxfordjournals.org/
Dow
nloaded from
[Sayyari and Mirarab, MBE, 2016]
100Xfaster
Branch Length
• The level of discordance is a function of coalescent unit branch length
19
[Sayyari and Mirarab, MBE, 2016]
Orang.
Gorilla Chimp
HumanOrang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
θ2=15% θ3=15%θ1=70%Gorilla
Orang.
Chimp
Human
d=0.8
✓1 = 1� 2
3e�d
Branch Length
• The level of discordance is a function of coalescent unit branch length
• A single quartet (n=4): just reverse the discordance formula to get the ML estimate
19
[Sayyari and Mirarab, MBE, 2016]
Orang.
Gorilla Chimp
HumanOrang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
θ2=15% θ3=15%θ1=70%Gorilla
Orang.
Chimp
Human
d=0.8
✓1 = 1� 2
3e�d
Branch Length
• The level of discordance is a function of coalescent unit branch length
• A single quartet (n=4): just reverse the discordance formula to get the ML estimate
• n>4: average frequencies around a branch
19
[Sayyari and Mirarab, MBE, 2016]
Orang.
Gorilla Chimp
HumanOrang.
GorillaChimp
Human
Orang.
Gorilla
Chimp
Human
θ2=15% θ3=15%θ1=70%Gorilla
Orang.
Chimp
Human
d=0.8
✓1 = 1� 2
3e�d
Branch lengths accuracy
20
True gene trees
1500 1000
500 250
−7.5
−5.0
−2.5
0.0
2.5
−7.5
−5.0
−2.5
0.0
2.5
−7.5 −5.0 −2.5 0.0 2.5 −7.5 −5.0 −2.5 0.0 2.5true branch length (log scale)
estim
ated
bra
nch
leng
th (l
og s
cale
)
−7.5
−5.0
−2.5
0.0
2.5
−7.5 −5.0 −2.5 0.0 2.5true branch length (log scale)
estim
ated
bra
nch
leng
th (l
og s
cale
)
With true gene trees, ASTRAL correctly estimates BL
[Sayyari and Mirarab, MBE, 2016]
Branch lengths accuracy
20
True gene trees
1500 1000
500 250
−7.5
−5.0
−2.5
0.0
2.5
−7.5
−5.0
−2.5
0.0
2.5
−7.5 −5.0 −2.5 0.0 2.5 −7.5 −5.0 −2.5 0.0 2.5true branch length (log scale)
estim
ated
bra
nch
leng
th (l
og s
cale
)
−7.5
−5.0
−2.5
0.0
2.5
−7.5 −5.0 −2.5 0.0 2.5true branch length (log scale)
estim
ated
bra
nch
leng
th (l
og s
cale
)
was only 0.06, corresponding to branches that are about 14%shorter or longer than the true branch. As gene tree estima-tion error increases with reduced number of sites (see table 1for gene tree error statistics), the branch length error also
increases. Thus, while 1,500 bp genes give 0.17 log error,250 bp genes result in 0.59 error, which corresponds tobranches that are on average 3.9 times too short or long.Moreover, unlike true gene trees, the error in branch lengthsestimated based on estimated gene trees is biased towardunderestimation (fig. 6), a pattern that increases in intensitywith shorter alignments.
ASTRAL and MP-EST have similar branch length accuracymeasured by log error for highly accurate gene trees, butASTRAL has an advantage with increased gene tree error(supplementary fig. S10, Supplementary Material online andtable 4). Error measured by root mean squared error (RMSE)(which emphasizes the accuracy of long branches) is compa-rable for the two methods, but MP-EST has a slight advantagegiven accurate gene trees.
Biological DatasetsFor each biological dataset, we show MLBS support and thelocal posterior probabilities, computed based on RAxML genetrees available from respective publications. We also collapsegene tree branches with <33% bootstrap support and usethese collapsed gene trees to draw local posterior probabili-ties. For ease of discussion, we show local posterior probabil-ities as percentages and refer to them simply as posterior orcollapsed posterior (for values based on collapsed gene trees).We discuss the confidence in important branches in eachtree.
1KP. Three of the key relationships studied by Wickett et al.(2014) are the sister branch to land plants, the base of theangiosperms, and the relationship among Bryophytes (horn-worts, liverworts, and mosses). In the ASTRAL tree, manybranches have full support regardless of the measure of thesupport used, but the remaining branches reveal interestingpatterns (fig. 7). The sister relationship betweenZygnematales and land plants receives a moderate 80% BS,but has 100% posterior. Wickett et al. (2014) also recoveredthis relationship by concatenation of various data partitions.There are 12 other branches that have collapsed posteriorsthat are at least 10% higher than BS (fig. 7); no branch hassubstantially higher BS than collapsed posterior. Collapsedposterior for monophyly of Bryophytes and for Amborellaas sister to other angiosperms are 100% (compared with97% and 93% BS, respectively).
When we collapse low-support branches in gene trees,posterior goes up for several branches: nine branches haveimprovements of 10% or more, and only two branches havecomparable reductions. An interesting case is Coleochaetalesas sister to Zygnematalesþ land plants, which has only 62%BS and 61% posterior, but has 100% collapsed posterior.Finally, note that several branches have low posterior, evenafter collapsing.
Our estimated branch lengths are short for several nodes.For example, the branch that unites (ChloranthalesþMagnoliids) and Eudicots has a length of 0.14 in coalescentunits. Other branches that have been historically hard to re-cover also tend to have short branches; however, these arenot necessarily extremely short branches that would
FIG. 6. ASTRAL branch length accuracy on the avian dataset. Logtransformed estimated branch lengths are shown versus true branchlengths, and a generalized additive model is fitted to the data. Onebranch with length 10"6 is trimmed out here, but full results, includ-ing MP-EST, is shown in supplementary figure S9, SupplementaryMaterial online.
Table 4. Branch Length Accuracy for the Avian Dataset.
No. of sites Log Err RMSE
ASTRAL MPEST ASTRAL MPEST
True gt. 0.06 (0.10) 0.07 (0.11) 0.44 (0.44) 0.30 (0.30)1,500 0.17 (0.20) 0.14 (0.18) 0.83 (0.83) 0.70 (0.70)1,000 0.22 (0.27) 0.22 (0.25) 1.08 (1.07) 1.01 (1.00)500 0.37 (0.42) 0.42 (0.46) 1.65 (1.64) 1.65 (1.64)250 0.59 (0.63) 0.81 (0.84) 2.25 (2.24) 2.28 (2.26)
NOTE.—Logarithmic error and root mean squared error are shown for true speciestrees scored with true gene trees or estimated gene trees with various numbers ofsites using ASTRAL and MP-EST. An extremely short branch with length 10"6 wasremoved from the calculations, but error including that branch is shownparenthetically.
Sayyari and Mirarab . doi:10.1093/molbev/msw079 MBE
10
by guest on May 28, 2016
http://mbe.oxfordjournals.org/
Dow
nloaded from
low gene tree error
High gene tree error
Moderate g.t. error
Medium g.t. error
With true gene trees, ASTRAL correctly estimates BLWith error-prone estimated gene trees, ASTRAL underestimates BL
[Sayyari and Mirarab, MBE, 2016]
Caveats• Branch length:
• Only for internal branches unless you have multiple individuals for a species
• In coalescent units, so the *true* value is still a function of population size and generation time in addition to actual time
• Local PP:
• Empirically better than BS support but based on many assumptions
21
Simple run
• Runs ASTRAL version 5.6.1 on 1KP-genetrees.tre and save results in 1KP-speciestrees.tre with logs saved to 1KP-log.txt.
• The dataset is from Wickett, Mirarab, et al. PNAS (2014), Phylotranscriptomic Analysis of the Origins and early Diversification of Land Plants.
23
java -jar ~/workspace/ASTRAL/astral.5.6.1.jar -i 1KP-genetrees.tre -o 1KP-speciestrees.tre 2> 1KP-log.txt
Open the tree in FigTree
• ASTRAL trees are unrooted.
• You will have to root them.
• You may have to open 2-3 times in FigTree (not sure why)
24
Equisetum_diffusum
Bazzania_trilobata
Rhynchostegium_serrulatum
Riccia_sp
Hibiscus_cannabinus
Bryum_argenteum
Juniperus_scopulorum
Vitis_vinifera
Eschscholzia_californicaPodophyllum_peltatum
Physcomitrella_patens
Mesostigma_viride
Rosulabryum_cf_capillare
Prumnopitys_andina
Acorus_americanus
Selaginella_moellendorffii_1kp
Carica_papaya
Roya_obtusa
Chlorokybus_atmophyticus
Spirogyra_sp
Sorghum_bicolor
Thuidium_delicatulum
Sciadopitys_verticillata
Pseudolycopodiella_caroliniana
Taxus_baccata
Sarcandra_glabra
Boehmeria_nivea
Persea_americana
Catharanthus_roseus
Arabidopsis_thaliana
Nuphar_advena
Spirotaenia_minuta
Cycas_rumphii
Penium_margaritaceum
Dioscorea_villosa
Cycas_micholitzii
Kadsura_heteroclita
Sphagnum_lescurii
Ceratodon_purpureus
Aquilegia_formosa
Angiopteris_evecta
Coleochaete_irregularis
Kochia_scoparia
Cunninghamia_lanceolata
Netrium_digitus
Marchantia_polymorpha
Inula_helenium
Nothoceros_aenigmaticus
Ipomoea_purpurea
Yucca_filamentosa
Cylindrocystis_cushleckae
Gnetum_montanum
Metzgeria_crassipilis
Ginkgo_bilobaPsilotum_nudum
Dendrolycopodium_obscurum
Monomastix_opisthostigma
Liriodendron_tulipifera
Mougeotia_sp
Mesotaenium_endlicherianum
Diospyros_malabarica
Ephedra_sinica
Allamanda_cathartica
Zamia_vazquezii
Rosmarinus_officinalis
Saruma_henryi
Pteridium_aquilinum
Sphaerocarpos_texanus
Klebsormidium_subtile
Colchicum_autumnale
Amborella_trichopoda
Pinus_taeda
Leucodon_brachypus
Pyramimonas_parkeae
Sabal_bermudana
Cosmarium_ochthodes
Chaetosphaeridium_globosum
Huperzia_squarrosa
Zea_mays
Chara_vulgaris
Medicago_truncatula
Nephroselmis_pyriformis
Tanacetum_parthenium
Alsophila_spinulosa
Populus_trichocarpa
Cedrus_libani
Anomodon_attenuatus
Oryza_sativa
Marchantia_emarginata
Brachypodium_distachyon
Cylindrocystis_brebissonii
Uronema_sp
Entransia_fimbriata
Nothoceros_vincentianus
Smilax_bona
Larrea_tridentata
Polytrichum_commune
Coleochaete_scutata
Selaginella_moellendorffii_genome
Welwitschia_mirabilis
Ophioglossum_petiolatum
Houttuynia_cordata
Hedwigia_ciliata
Length in coalescent units
• Often better to look at cladogram
25
Entransia_fimbriata
Smilax_bona
Mougeotia_sp
Klebsormidium_subtile
Angiopteris_evecta
Marchantia_emarginata
Pinus_taedaCedrus_libani
Acorus_americanus
Psilotum_nudum
Eschscholzia_californica
Ophioglossum_petiolatum
Hibiscus_cannabinus
Sarcandra_glabra
Amborella_trichopoda
Catharanthus_roseus
Cylindrocystis_brebissonii
Oryza_sativa
Bazzania_trilobata
Uronema_sp
Chlorokybus_atmophyticus
Pyramimonas_parkeae
Brachypodium_distachyon
Houttuynia_cordata
Carica_papaya
Populus_trichocarpa
Dioscorea_villosa
Ephedra_sinica
Selaginella_moellendorffii_1kp
Diospyros_malabarica
Cylindrocystis_cushleckaeMesotaenium_endlicherianum
Coleochaete_scutata
Pteridium_aquilinum
Sorghum_bicolor
Cycas_micholitzii
Chara_vulgaris
Cycas_rumphii
Nuphar_advena
Arabidopsis_thaliana
Prumnopitys_andina
Pseudolycopodiella_caroliniana
Nephroselmis_pyriformis
Metzgeria_crassipilis
Aquilegia_formosa
Riccia_sp
Chaetosphaeridium_globosum
Physcomitrella_patens
Vitis_vinifera
Leucodon_brachypus
Roya_obtusa
Sciadopitys_verticillata
Larrea_tridentata
Equisetum_diffusum
Penium_margaritaceum
Sphaerocarpos_texanus
Zea_mays
Liriodendron_tulipifera
Marchantia_polymorpha
Ipomoea_purpurea
Sabal_bermudana
Rosulabryum_cf_capillare
Medicago_truncatula
Dendrolycopodium_obscurum
Ceratodon_purpureus
Allamanda_cathartica
Selaginella_moellendorffii_genome
Yucca_filamentosa
Saruma_henryi
Mesostigma_viride
Podophyllum_peltatum
Thuidium_delicatulum
Inula_heleniumRosmarinus_officinalis
Welwitschia_mirabilis
Spirogyra_sp
Colchicum_autumnale
Cosmarium_ochthodes
Kadsura_heteroclita
Coleochaete_irregularis
Rhynchostegium_serrulatum
Monomastix_opisthostigma
Taxus_baccata
Gnetum_montanum
Persea_americana
Alsophila_spinulosa
Sphagnum_lescurii
Cunninghamia_lanceolata
Netrium_digitus
Spirotaenia_minuta
Kochia_scoparia
Huperzia_squarrosa
Juniperus_scopulorum
Polytrichum_commune
Zamia_vazquezii
Nothoceros_vincentianus
Tanacetum_parthenium
Ginkgo_biloba
Boehmeria_nivea
Hedwigia_ciliata
Nothoceros_aenigmaticus
Anomodon_attenuatus
Bryum_argenteum
Annotate branches
26
Entransia_fimbriata
Smilax_bona
Mougeotia_sp
Klebsormidium_subtile
Angiopteris_evecta
Marchantia_emarginata
Pinus_taedaCedrus_libani
Acorus_americanus
Psilotum_nudum
Eschscholzia_californica
Ophioglossum_petiolatum
Hibiscus_cannabinus
Sarcandra_glabra
Amborella_trichopoda
Catharanthus_roseus
Cylindrocystis_brebissonii
Oryza_sativa
Bazzania_trilobata
Uronema_sp
Chlorokybus_atmophyticus
Pyramimonas_parkeae
Brachypodium_distachyon
Houttuynia_cordata
Carica_papaya
Populus_trichocarpa
Dioscorea_villosa
Ephedra_sinica
Selaginella_moellendorffii_1kp
Diospyros_malabarica
Cylindrocystis_cushleckaeMesotaenium_endlicherianum
Coleochaete_scutata
Pteridium_aquilinum
Sorghum_bicolor
Cycas_micholitzii
Chara_vulgaris
Cycas_rumphii
Nuphar_advena
Arabidopsis_thaliana
Prumnopitys_andina
Pseudolycopodiella_caroliniana
Nephroselmis_pyriformis
Metzgeria_crassipilis
Aquilegia_formosa
Riccia_sp
Chaetosphaeridium_globosum
Physcomitrella_patens
Vitis_vinifera
Leucodon_brachypus
Roya_obtusa
Sciadopitys_verticillata
Larrea_tridentata
Equisetum_diffusum
Penium_margaritaceum
Sphaerocarpos_texanus
Zea_mays
Liriodendron_tulipifera
Marchantia_polymorpha
Ipomoea_purpurea
Sabal_bermudana
Rosulabryum_cf_capillare
Medicago_truncatula
Dendrolycopodium_obscurum
Ceratodon_purpureus
Allamanda_cathartica
Selaginella_moellendorffii_genome
Yucca_filamentosa
Saruma_henryi
Mesostigma_viride
Podophyllum_peltatum
Thuidium_delicatulum
Inula_heleniumRosmarinus_officinalis
Welwitschia_mirabilis
Spirogyra_sp
Colchicum_autumnale
Cosmarium_ochthodes
Kadsura_heteroclita
Coleochaete_irregularis
Rhynchostegium_serrulatum
Monomastix_opisthostigma
Taxus_baccata
Gnetum_montanum
Persea_americana
Alsophila_spinulosa
Sphagnum_lescurii
Cunninghamia_lanceolata
Netrium_digitus
Spirotaenia_minuta
Kochia_scoparia
Huperzia_squarrosa
Juniperus_scopulorum
Polytrichum_commune
Zamia_vazquezii
Nothoceros_vincentianus
Tanacetum_parthenium
Ginkgo_biloba
Boehmeria_nivea
Hedwigia_ciliata
Nothoceros_aenigmaticus
Anomodon_attenuatus
Bryum_argenteum
1
1
1
1
1
1
1
0.94
1
1
1
1
1
0.99
1
1
1
1
1
1
1
1
1
0.7
0.39
1
0.63
0.81
1
1
0.61
1
0.68
1
11
1
1
1
1
1
1
0.98
1
1
1
1
0.42
1
1
1
1
1
1
1
1
0.98
0.86
1
1
1
1
1
1
1
1 1
1
1
1
1
1
0.76
1
0.81
1
1
0.9
0.81
1
1
1
0.98
0.97
1
1
1
1
1
1
1
1
1
11
0.99
1
0.95
0.331
1
By default, nodes labelled by localPP
The log (1KP-log.txt) includes …
27
This is ASTRAL version 5.6.1 424 trees read from 1KP-genetrees.tre … Number of taxa: 103 (103 species) … Taxon occupancy: {Pseudolycopodiella_caroliniana=282, Kadsura_heteroclita=234, Entransia_fimbriata=294, … … Number of Clusters after addition by greedy: 11043 … Final quartet score is: 339023690 Final normalized quartet score is: 0.8946722590876793 … ASTRAL finished in 36.372 secs
Number of individuals
Number of species
Search space sizeNormalized
quartet scoreRaw quartet score
Main publications• Mirarab, Siavash, Rezwana Reaz, Md. Shamsuzzoha Bayzid, Théo
Zimmermann, M. S. Swenson, and Tandy Warnow. “ASTRAL: Genome-Scale Coalescent-Based Species Tree Estimation.” Bioinformatics 30, no. 17 (2014): i541–48. https://doi.org/10.1093/bioinformatics/btu462.
• Mirarab, S., and T. Warnow. “ASTRAL-II: Coalescent-Based Species Tree Estimation with Many Hundreds of Taxa and Thousands of Genes.” Bioinformatics 31, no. 12 (2015). https://doi.org/10.1093/bioinformatics/btv234.
• Zhang, Chao, Maryam Rabiee, Erfan Sayyari, and Siavash Mirarab. “ASTRAL-III: Polynomial Time Species Tree Reconstruction from Partially Resolved Gene Trees.” BMC Bioinformatics 19, no. S6 (2018): 153. https://doi.org/10.1186/s12859-018-2129-y.
• Sayyari, Erfan, and Siavash Mirarab. “Fast Coalescent-Based Computation of Local Branch Support from Quartet Frequencies.” Molecular Biology and Evolution 33, no. 7 (2016): 1654–68. https://doi.org/10.1093/molbev/msw079.
28
Some other ASTRAL-related papers• Testing for polytomies in species trees using quartet frequencies (Sayyari and Mirarab).
Genes (2018) (Note: Implemented in option -t 10)
• Filtering loci is not beneficial! (Molloy and Warnow), Systematic Biology (2018)
• ASTRAL consistent under models of missing data (Nute, Molloy, Chou, and Warnow). BMC Genomics (2018)
• SIESTA improves ASTRAL (Vachaspati and Warnow). BMC Genomics (2018)
• Visualizing Discordance using DiscoVista (Sayyari, Whitfield, and Mirarab). Molecular Phylogenetics and Evolution (2018)
• Fragmentary sequences can negatively impact ASTRAL trees (Sayyari, Whitfield, and Mirarab). Molecular Biology and Evolution (2017)
• How many genes does ASTRAL need? (Shekhar, Roch, and Mirarab). Transactions on Computational Biology and Bioinformatics (2017)
• Using ASTRAL as a supertree method (Vachaspati and Warnow). Bioinformatics (2017)
• Performance under ILS and HGT (Davidson, Vachaspati, Mirarab, and Warnow). BMC Genomics (2015).
29
For more info• Contact me: Tandy Warnow, [email protected]
• Main developer: Siavash Mirarab, [email protected]
• Software available at Github site: https://github.com/smirarab/ASTRAL
• See tutorial and README at GitHub site: https://github.com/smirarab/ASTRAL/blob/master/astral-tutorial.md
• Email: [email protected]
• More related papers at http://tandy.cs.illinois.edu/papers-all.html and http://eceweb.ucsd.edu/~smirarab/publications.html
• Related software (ASTRID, SVDquest, FastRFS, SIESTA) on github, and linked from http://tandy.cs.illinois.edu/software.html. See also https://github.com/esayyari/DiscoVista for DiscoVista.
30
Phylogenomics Software Symposium
• Location: ISEM, Montpellier, France on August 17, 2018
• Travel awards available for up to $500 per person
• Tutorial on phylogenomics provided by Siavash Mirarab
• Research talks by Mirarab, Nakhleh, Kosiol, Tannier, Warnow (and other contributed talks)
• Free - no fee, but registration required (limited to 50 people)
• See http://tandy.cs.illinois.edu/PhyloSynth-Symposium.html for more information
31
Annotate (score) a given tree and compute quartet support
• Scores 1KP-speciestrees.tre based on the 1KP-genetrees.tre and annotates branches with the quartet score of the branch (-t 1), saving results into 1KP-speciestrees-qs.tre.
• Check out other annotation options at https://github.com/smirarab/ASTRAL/blob/master/astral-tutorial.md#extensive-branch-annotations
32
java -jar ~/workspace/ASTRAL/astral.5.6.1.jar -i 1KP-genetrees.tre -q 1KP-speciestrees.tre -o 1KP-speciestrees-qs.tre -t 1
Quartet scores as branch-specific estimate of gene tree discordance
33
Quartet score for the main resolution
Brachypodium_distachyon
Alsophila_spinulosa
Selaginella_moellendorffii_genome
Eschscholzia_californica
Riccia_sp
Ophioglossum_petiolatum
Kochia_scoparia
Ephedra_sinica
Sarcandra_glabra
Nephroselmis_pyriformis
Cycas_micholitzii
Liriodendron_tulipifera
Zamia_vazquezii
Chlorokybus_atmophyticus
Smilax_bona
Nothoceros_aenigmaticus
Pseudolycopodiella_caroliniana
Entransia_fimbriata
Marchantia_emarginata
Persea_americana
Sphaerocarpos_texanus
Ipomoea_purpurea
Chara_vulgaris
Nothoceros_vincentianus
Marchantia_polymorpha
Welwitschia_mirabilis
Mougeotia_sp
Podophyllum_peltatum
Taxus_baccata
Cylindrocystis_brebissoniiSpirogyra_sp
Rhynchostegium_serrulatum
Rosulabryum_cf_capillare
Physcomitrella_patens
Larrea_tridentata
Sciadopitys_verticillata
Cosmarium_ochthodesPenium_margaritaceum
Coleochaete_irregularis
Sorghum_bicolor
Ceratodon_purpureus
Ginkgo_biloba
Medicago_truncatula
Houttuynia_cordata
Gnetum_montanum
Metzgeria_crassipilis
Rosmarinus_officinalis
Nuphar_advena
Uronema_sp
Cedrus_libani
Mesotaenium_endlicherianum
Acorus_americanus
Catharanthus_roseus
Hedwigia_ciliataBryum_argenteum
Chaetosphaeridium_globosum
Sabal_bermudana
Cycas_rumphii
Kadsura_heteroclita
Equisetum_diffusum
Netrium_digitus
Cylindrocystis_cushleckae
Inula_helenium
Selaginella_moellendorffii_1kp
Polytrichum_commune
Populus_trichocarpa
Angiopteris_evecta
Roya_obtusa
Tanacetum_parthenium
Colchicum_autumnale
Monomastix_opisthostigma
Psilotum_nudum
Allamanda_cathartica
Diospyros_malabarica
Boehmeria_niveaVitis_vinifera
Sphagnum_lescurii
Coleochaete_scutata
Pinus_taeda
Bazzania_trilobata
Prumnopitys_andina
Carica_papaya
Anomodon_attenuatus
Dendrolycopodium_obscurum
Thuidium_delicatulum
Mesostigma_viride
Dioscorea_villosa
Aquilegia_formosa
Hibiscus_cannabinus
Spirotaenia_minuta
Zea_mays
Pteridium_aquilinum
Huperzia_squarrosa
Arabidopsis_thaliana
Amborella_trichopoda
Oryza_sativa
Juniperus_scopulorum
Saruma_henryi
Yucca_filamentosa
Leucodon_brachypus
Klebsormidium_subtile
Cunninghamia_lanceolata
Pyramimonas_parkeae
48.11
70.41
37.78
55.64
90.25
65.36
87.08
64.57
95.51
51.6
36.96
86.81
75.2
92.65
57.27
72.05
67.74
51.14
82.35
34.03
49.47
92.72
61.25
43.33
40.24
37.22
98.52
33.4898.4
34.3
95.77
69.74
45.03
70.15
76.64
93.95
43.26
58.96
96.9
45.5
42.97
93.18
78.39
98.12
50.75
41.94
91.86
67.53
93.79
58.95
49.52
38.95
99.97
89.7
66.4
39.62
41.88
82.16
71.69
58.96
84.83
47.94
51.89
51.75
96.57
39.25
36.52
81.83
57.51
58.89
95.29
99.11
97.73
90.51
46.9
54.59
64.15
68.29
42.25
41.61
38.67
90.38
43.86
60.47
88.99
88.19
97.39
53.11
81.34
45.5
62.82
84.47
53.71
53.04
38.09
43.5
40.93
62.11
69.27
46.05
42.53
• https://github.com/esayyari/DiscoVista • Sayyari, et. al. “DiscoVista: Interpretable Visualizations of
Gene Tree Discordance.” MPE 122 (2018): 110–15.
34
Discovista: visualizing discordance
• https://github.com/esayyari/DiscoVista • Sayyari, et. al. “DiscoVista: Interpretable Visualizations of
Gene Tree Discordance.” MPE 122 (2018): 110–15.
34
Discovista: visualizing discordance
Empirical running time as a function of the number of genes
35
Zhang et al. Page 13 of 34
●
●
●
●
●
●
●
●
●
●
●
●
●
2.272.09
●
●
●
●
●
●
●
●
●
●
●
●
●
2.292.06
500 1500
28 29 210 211 212 213 214 28 29 210 211 212 213 214
20
25
210
#Genes
Run
ning
tim
e (m
inut
es)
●
●
ASTRAL2
ASTRAL3
Figure 5 Running time versus k. Average running times (4 replicates) are shown for ASTRAL-IIand ASTRAL-III on the avian dataset with 500bp or 1500bp alignments with varying numbers ofgens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in thelog/log space and line slopes are shown. ASTRAL-II did not finish on 214 genes in 48 hour.
branches has minimal impact on the discordance (eight discordant branches with
binned MP-EST instead of nine). However, contracting low support branches with
3%–33% thresholds dramatically reduces the discordance with the reference tree (2,
2, 4, 2, 3, and 3 discordant branches, respectively, for 3%, 5%, 7%, 10%, 20%, and
33%). Three thresholds (3%, 5%, and 10%) produce an identical tree (Fig. 4d). The
remaining di↵erences are among the branches that are deemed unresolved by Jarvis
et al. and change among the reference trees as well [5]. Contracting at 50% and 75%
thresholds, however, increases discordance to five and six branches, respectively.
Thus, consistent with simulations, contracting very low support branches seems
to produce the best results, when judged by similarity with the reference trees. To
summarize, ASTRAL-III obtained on unbinned but collapsed gene trees agreed with
all major relations in Jarvis et al., including the novel Columbea group, whereas
the unresolved tree missed important clades (Fig. 4).
3.2 RQ2: Running time improvements
We study the improvements in running time as various parameters change.
3.2.1 Varying the number of genes (k)
We compare ASTRAL-III to ASTRAL-II on the avian simulated dataset, changing
the number of genes from 28 to 214 and forcing X to be the same for both versions to
enable comparing impacts of improved weight calculation. We allow each replicate
run to take up to two days. ASTRAL-II is not able to finish on the dataset with
k = 214, while ASTRAL-III finishes on all conditions. ASTRAL-III improves the
running time over ASTRAL-II and the extent of the improvement depends on k
(Fig. 5). With 1000 genes or more, there is at least a 2.1X improvement. With
213 genes, the largest value where both versions could run, ASTRAL-III finishes
on average 3.2 times faster than ASTRAL-II (234 versus 758 minutes). Moreover,
fitting a line to the average running time in the log-log scale graph reveals that
on this dataset, the running time of ASTRAL-III on average grows as O(k2.08),
which is better than that of ASTRAL-II at O(k2.28), and both are better than the
theoretical worst case, which is O(k2.726).
Zhang et al. Page 13 of 34
●
●
●
●
●
●
●
●
●
●
●
●
●
2.272.09
●
●
●
●
●
●
●
●
●
●
●
●
●
2.292.06
500 1500
28 29 210 211 212 213 214 28 29 210 211 212 213 214
20
25
210
#Genes
Run
ning
tim
e (m
inut
es)
●
●
ASTRAL2
ASTRAL3
Figure 5 Running time versus k. Average running times (4 replicates) are shown for ASTRAL-IIand ASTRAL-III on the avian dataset with 500bp or 1500bp alignments with varying numbers ofgens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in thelog/log space and line slopes are shown. ASTRAL-II did not finish on 214 genes in 48 hour.
branches has minimal impact on the discordance (eight discordant branches with
binned MP-EST instead of nine). However, contracting low support branches with
3%–33% thresholds dramatically reduces the discordance with the reference tree (2,
2, 4, 2, 3, and 3 discordant branches, respectively, for 3%, 5%, 7%, 10%, 20%, and
33%). Three thresholds (3%, 5%, and 10%) produce an identical tree (Fig. 4d). The
remaining di↵erences are among the branches that are deemed unresolved by Jarvis
et al. and change among the reference trees as well [5]. Contracting at 50% and 75%
thresholds, however, increases discordance to five and six branches, respectively.
Thus, consistent with simulations, contracting very low support branches seems
to produce the best results, when judged by similarity with the reference trees. To
summarize, ASTRAL-III obtained on unbinned but collapsed gene trees agreed with
all major relations in Jarvis et al., including the novel Columbea group, whereas
the unresolved tree missed important clades (Fig. 4).
3.2 RQ2: Running time improvements
We study the improvements in running time as various parameters change.
3.2.1 Varying the number of genes (k)
We compare ASTRAL-III to ASTRAL-II on the avian simulated dataset, changing
the number of genes from 28 to 214 and forcing X to be the same for both versions to
enable comparing impacts of improved weight calculation. We allow each replicate
run to take up to two days. ASTRAL-II is not able to finish on the dataset with
k = 214, while ASTRAL-III finishes on all conditions. ASTRAL-III improves the
running time over ASTRAL-II and the extent of the improvement depends on k
(Fig. 5). With 1000 genes or more, there is at least a 2.1X improvement. With
213 genes, the largest value where both versions could run, ASTRAL-III finishes
on average 3.2 times faster than ASTRAL-II (234 versus 758 minutes). Moreover,
fitting a line to the average running time in the log-log scale graph reveals that
on this dataset, the running time of ASTRAL-III on average grows as O(k2.08),
which is better than that of ASTRAL-II at O(k2.28), and both are better than the
theoretical worst case, which is O(k2.726).
Zhang et al. Page 13 of 34
●
●
●
●
●
●
●
●
●
●
●
●
●
2.272.09
●
●
●
●
●
●
●
●
●
●
●
●
●
2.292.06
500 1500
28 29 210 211 212 213 214 28 29 210 211 212 213 214
20
25
210
#Genes
Run
ning
tim
e (m
inut
es)
●
●
ASTRAL2
ASTRAL3
Figure 5 Running time versus k. Average running times (4 replicates) are shown for ASTRAL-IIand ASTRAL-III on the avian dataset with 500bp or 1500bp alignments with varying numbers ofgens (k), shown in log scale (see Fig. S2 for normal scale). A line is fit to the data points in thelog/log space and line slopes are shown. ASTRAL-II did not finish on 214 genes in 48 hour.
branches has minimal impact on the discordance (eight discordant branches with
binned MP-EST instead of nine). However, contracting low support branches with
3%–33% thresholds dramatically reduces the discordance with the reference tree (2,
2, 4, 2, 3, and 3 discordant branches, respectively, for 3%, 5%, 7%, 10%, 20%, and
33%). Three thresholds (3%, 5%, and 10%) produce an identical tree (Fig. 4d). The
remaining di↵erences are among the branches that are deemed unresolved by Jarvis
et al. and change among the reference trees as well [5]. Contracting at 50% and 75%
thresholds, however, increases discordance to five and six branches, respectively.
Thus, consistent with simulations, contracting very low support branches seems
to produce the best results, when judged by similarity with the reference trees. To
summarize, ASTRAL-III obtained on unbinned but collapsed gene trees agreed with
all major relations in Jarvis et al., including the novel Columbea group, whereas
the unresolved tree missed important clades (Fig. 4).
3.2 RQ2: Running time improvements
We study the improvements in running time as various parameters change.
3.2.1 Varying the number of genes (k)
We compare ASTRAL-III to ASTRAL-II on the avian simulated dataset, changing
the number of genes from 28 to 214 and forcing X to be the same for both versions to
enable comparing impacts of improved weight calculation. We allow each replicate
run to take up to two days. ASTRAL-II is not able to finish on the dataset with
k = 214, while ASTRAL-III finishes on all conditions. ASTRAL-III improves the
running time over ASTRAL-II and the extent of the improvement depends on k
(Fig. 5). With 1000 genes or more, there is at least a 2.1X improvement. With
213 genes, the largest value where both versions could run, ASTRAL-III finishes
on average 3.2 times faster than ASTRAL-II (234 versus 758 minutes). Moreover,
fitting a line to the average running time in the log-log scale graph reveals that
on this dataset, the running time of ASTRAL-III on average grows as O(k2.08),
which is better than that of ASTRAL-II at O(k2.28), and both are better than the
theoretical worst case, which is O(k2.726).
Avian simulations: n = 48 species k = 256 to 16,384 gene trees Relatively high ILS Low/moderate gene tree error 4 replicates per dot
Empirical running time seems to increase close to O(k2)
17 hours
36
Zhang et al. Page 34 of 34
●
● ●
● ●
●
1.09
10k
20k
50k
100k
200k
500k
1M
200 500 1000 2000 5000 10000#Species
Size
of X
●
●
●
●
●
●
1.83
1
10
100
1000
200 500 1000 2000 5000 10000#Species
Run
ning
tim
e (m
inut
es)
Figure S8 Empirical running time of ASTRAL-III with n. Average running time is shown forASTRAL-III for datasets with varying n. Averages are over 20 replicates. One replicate of 2000species dataset could not finish in 2 days and is removed from the analysis. Note that thesedatasets have factors other than n that change as well (e.g., the amount of ILS, etc.). Thus, theserunning times should be treated as ball-park estimates. Finally, we note that on the 10,000dataset, we have only 2 replicates and not 20.
Empirical running time as a function of the number of species
Empirical running time seems to increase close to O(n1.8)
Simphy simulations: n = 200 to 10,000 species k = 1000 gene trees Varying ILS Varying gene tree error 20 replicates everywhere, except for 10000 that has only 2 replicates
52 hours