Log Transformation Improves Dating of Phylogenies · Log Transformation Improves Dating of Phylogenies Uyen Mai,1 and Siavash Mirarab2* 1Department of Computer Science and Engineering,

Log Transformation Improves Dating of Phylogenies

Uyen Mai,1 and Siavash Mirarab2*

1Department of Computer Science and Engineering, UC San Diego, CA 92093, USA2Department of Electrical and Computer Engineering, UC San Diego, CA 92093, USA

*Corresponding author: [email protected]

Abstract

Phylogenetic trees inferred from sequence data often have branch lengths measured in theexpected number of substitutions and therefore, do not have divergence times estimated.These trees give an incomplete view of evolutionary histories since many applications ofphylogenies require time trees. Many methods have been developed to convert the inferredbranch lengths from substitution unit to time unit using calibration points, but none isuniversally accepted as they are challenged in both scalability and accuracy under complexmodels. Here, we introduce a new method that formulates dating as a non-convex optimiza-tion problem where the variance of log-transformed rate multipliers are minimized acrossthe tree. On simulated and real data, we show that our method, wLogDate, is often moreaccurate than alternatives and is more robust to various model assumptions.

Keywords: time tree, divergence time estimation, phylogenetic dating, molecular dating,non-convex optimization.

1 Introduction

Phylogenetic inference from sequence data does not reveal divergence time (i.e. exact timingof evolutionary events) unless paired with external timing information. Because the level ofdivergence given by sequence data is a function of unknown time and mutation rates, theinferred phylogenies measure branch lengths in the unit of expected number of substitutionsper site. Dating a phylogeny is the process of translating branch lengths from substitution unitto time unit. Such a process requires soft or hard constraints for the timing of some nodes of thetree, leaving the divergence times of the remaining nodes to be inferred. Timing informationis often represented in the form of calibration points obtained by carbon-dating fossils [16] orsampling times for leaves of the tree in fast-evolving viruses and bacteria. While dating isdifficult [26], it is crucial for understanding evolutionary processes [6, 11] and is a fundamentalneed in many clinical applications of phylogenetics and phylodynamics [39].

A large number of computational methods for dating phylogenies are available [e.g., see26? ]. A main point of differentiation between these methods is the clock model they as-sume [29]. Some methods rely on strict molecular clock [41] where rates are assumed to beclose to fixed [e.g., 18, 33, 37]. Other methods allow rates to be drawn independently froma stationary distribution [1, 4, 38] or model the evolution of rates with time [14] and allowcorrelated rates across branches [e.g., 3, 15, 20, 30, 34–36].

Another distinction between methods is the use of parametric models [28]. Many datingmethods use an explicit parametric statistical model and formulate dating as estimating param-eters in a maximum likelihood (ML) or Bayesian inference framework [e.g., 4, 18, 37, 38]. Yetthere are non-parametric methods that rely on optimization problems that do not compute like-lihood under explicit statistical model [e.g., 30, 35], albeit with some underlying assumptions.When the assumed parametric model is close to the reality, we expect parametric methods toperform well. However, parametric methods can be sensitive to model deviations, a problemthat my be avoided by non-parametric methods.

In this paper, we introduce LogDate, a new method of dating rooted phylogenies based ona non-parametric approach. We define mutation rates necessary to compute time unit branch

1

.CC-BY 4.0 International licenseauthor/funder. It is made available under aThe copyright holder for this preprint (which was not peer-reviewed) is the. https://doi.org/10.1101/2019.12.20.885582doi: bioRxiv preprint

https://doi.org/10.1101/2019.12.20.885582

http://creativecommons.org/licenses/by/4.0/

lengths as the product of a single global rate and a set of rate multipliers, one per branch.We seek to find the overall rate and all rate multipliers such that the log-transformed ratemultipliers have the minimum variance. This formulation gives us a constrained optimizationproblem, which although not convex, can be solved in a scalable fashion using the standardapproaches such as sequential least squares programming. While formulation of dating as anoptimization problem is not new [18, 37], our contribution here is the realization that log-transforming the rate multipliers results in more accurate dates. Although the LogDate methodis non-parametric, it corresponds to the ML estimate under specific parameters of a statisticalmodel. In extensive simulation studies and three biological data sets, we show that a weightedversion of LogDate, namely wLogDate, has higher accuracy in inferring node ages compared toalternative methods, including some that rely on time-consuming Bayesian inference.

2 Methods

2.1 Notations

For a rooted binary tree T with n leaves, we let [0, . . . , n − 2] uniquely index internal nodes(root is assigned 0) and let [n−1, . . . , 2n−2] index leaves. Let par(i) denote the parent of nodei and let cl(i) and cr(i) denote the left and right children of node i, respectively. We refer tothe edge between par(i) and i as ei.

We can measure each branch ei of T in either time unit or substitution unit. Let ti denotethe divergence time of node i, i.e. the time when species i diverged into cl(i) and cr(i). Thenτi = ti−tpar(i) is the length of the edge ei in time unit. We then let bi be the length of ei measuredin the unit of expected number of substitutions incurred along edge ei. Let τ = [τ1, . . . , τ2n−2]and b = [b1, . . . , b2n−1]. To convert from τi to bi, let µi be the mutation rate along branch ei,then bi = µiτi.

For models of sequence evolution that depend on bi only, µi and τi are not identifiable.Moreover, b is not observable directly. Instead, the tree generates sequences with length s,from which we can estimate b = [b1, . . . , b2n−1] using standard approaches (e.g., ML). Thus,the given data is b instead of b.

2.2 Dating as an optimization problem

Given the inferred branch lengths in substitution unit b and the divergence time of some nodes,the goal of phylogenetic dating is to find τ . Let f = [f1, . . . , f2n−2] be unknown variables whereeach fi = τi

bi. Our goal of finding τ is identical to finding f . Assuming the mutation rates on

the branches are distributed around a global rate µ, we define νi = fiµ = µτibi

. We formulate

dating as a constrained optimization problem on 2n − 1 variables x = [ν1, . . . , ν2n−2, µ]. Hardcalibration points and sampling times define linear constraints on x (detailed below). Becauseconstraints do not fully determine τ , we need extra assumptions on x, which can be motivatedby properties of the distribution of fi. The choice of the objective function and the optimizationstrategies distinguish different dating approaches and are our main contributions in this paper.

2.2.1 Linear constraints Ψ from sampling times

For each edge ei we need to find the scaling factor fi such that τi = fibi and for any pair of nodes(i, j) associated with sampling times (ti, tj), the following constraint ψ(i, j) must be satisfied

ψ(i, j) : µ(tj − ti) =∑

k∈P (m,j)

νk bk −∑

k∈P (i,m)

νk bk (1)

where m is the LCA of i and j and P (u, v) is the path connecting u and v. Thus, given k timepoints, k(k − 1)/2 constraints must hold. However, only k − 1 of these constraints imply allothers. Let t0 be the unknown time at the root of the tree. For k calibration points t1, . . . , tk,

2


https://doi.org/10.1101/2019.12.20.885582


we can setup k constraints of the form:

Ci : µ(ti − t0) =∑

k∈P (0,i)

νk bk

For any pair (i, j), the linear constraint given in Eq. 1 can be derived by Cj −Ci. Also, we canremove t0 from the set of constraints by subtracting C1 from all other constraints C2, . . . , Ck.This gives us the final k− 1 linear constraints on ν, which we denote as Ψ. We can easily buildΨ using Algorithm 1 (Supplementary material) in linear time.

2.2.2 Optimization Criteria

Recall that each fi is influenced both by the distribution of µi and the random error in thebi estimation of bi. In traditional strict-clock models [41] such as those used by Least-SquaresDating (LSD) [37] and Langley-Fitch (LF) [18], a constant rate is assumed throughout the tree(∀iµi = µ ). Under this model, the distribution of fi is determined by deviations of bi from bi.

Langley and Fitch [18] (LF) modeled the number of substitutions per sequence site on abranch i by a Poisson distribution with mean λ = µτi. The total observed number of substi-tutions can be modeled as sbi ∼ Poisson(sµτi) (s = the sequence length). The log likelihoodfunction is:

2n−2∑i=1

(sbi log(sbi)− log((sbi)!)

)+

2n−2∑i=1

sbi (log νi − νi) .

Given s and bi, LF finds x that maximizes likelihood subject to the constraints:

x∗ = arg minν

2n−2∑i=1

bi(νi − log νi

)subject to Ψ. (2)

To et al. 37 assume bi follows a Gaussian model bi ∼ Gaussian(µτi, σ2i ) and approximate

the variance by bis (the method includes smoothing strategies omitted here). Then, the negative

log likelihood function is proportional to:

2n−2∑i=1

(bi − µτi)2

σ2i≈

2n−2∑i=1

s

bi(bi − µτi)2 =

2n−2∑i=1

sbi(1− νi)2.

Thus, the ML estimate can be formulated as:

x∗ = arg minν

2n−2∑i=1

bi(1− νi)2 subject to Ψ. (3)

Both LF and LSD have convex formulations. Langley and Fitch 18 proved that their negativelog-likelihood function is convex and thus the local minimum is also the global minimum. Ourconstraint-based formulation of LF also can be easily proved convex by showing its Hessianmatrix is positive definite. To et al. 37 pointed out their objective function is a weighted least-squares. Using our formulation, we also see that Eq. 3 together with the calibration constraintsform a standard convex quadratic optimization problem which has a unique analytical solution.

2.3 LogDate Method

LF and LSD only seek to model the Poisson-like errors in b and ignore true rate heterogeneity.Strict-clock assumption is now believed to be unrealistic in many settings [12, 32? ], motivatingrelaxed clocks, typically by assuming that µis are drawn i.i.d. from some distribution [e.g.,1, 4, 38]. Most methods rely on presumed parametric distributions (typically, LogNormal,Exponential, or Gamma) and estimate parameters using ML [38], MAP [1], or MCMC [2, 4].

3


https://doi.org/10.1101/2019.12.20.885582


(a) (b)

LF

LogDate

LSD

0.00

0.25

0.50

0.75

1.00

1/4 1/2 1 2 4νi

Nor

mal

ized

pen

alty

0.05−0.95

0.2−0.8

1e−11

1e−07

1e−03

1e+01

0.0001 0.0010 0.0100 0.1000

bi

b i^b i

Figure 1: (a) The penalty associated to multiplying a single edge i with multiplier νi in LSD, LF, andLogDate approaches, as shown in Equations 2, 3, and 4. To allow comparison, we normalize the penaltyto be zero at ν = 1 and to be 1 at ν = 4. (b). The confidence interval of the ratio between estimatedand true branch length using the Poisson model. The confidence interval of Poisson follows 1

2χ22sbi

on

the lower side and 12χ

22sbi+2 on the upper side and is drawn for α

2 = 0.0.5 and α2 = 0.2 to get 0.2–0.8 and

0.05–0.095 intervals for 0.0001 ≤ bi ≤ 0.4.

To develop a non-parametric method that does not assume a strict clock, we avoid explicitassumptions on the model that generates the rate multipliers. Instead, we follow a simpleprinciple: we assume that given two solutions of x both satisfying the calibration constraints,the solution with less variability in fi values is more preferable. Thus, we prefer solutions thatminimize deviations from a strict clock while allowing deviations.

A natural way to formalize this principle is to minimize the variance of fi. This can beachieved by finding µ and all νi such that νi is centered at 1 and

∑2n−2i=1 (νi − 1)2 is minimized.

Interestingly, the ML function used by LSD (Eq. 3) is a weighted version of this approach.However, the minimum variance principle results in a fundamental asymmetry: multiplying ordividing the rate of a branch by the same factor is penalized differently (Fig 1a); e.g., the penaltyfor νi = 4 is more than ten times larger than νi = 1/4. The LF model is more symmetrical thanLSD but remains asymmetrical (Fig 1a). This property results from the asymmetric distributionof the Poisson distribution around its mean, especially for small mean, in log scale. For example,for small bi, the lower limit of the confidence interval of bi/bi is further from the mean on a logscale than the upper limit (Fig 1b). Thus, a very small bi/bi, which is within the realm of possibleoutcomes, is not penalized heavily. In contrast, going above the mean rate is heavily penalized.

Without a model of rate variation, we assert that the penalty for multiplying a branch by afactor of ν should be no different than dividing the branch by ν. Note that this assertion is onlywith respect to true variations of the mutation rate and not branch length estimation error.This simple property (we call it symmetry of ratios) is not offered by minimization of variance(note that LF and LSD do not seek to model true rate variation; they target branch lengthestimation error only). To ensure the symmetry of ratios, we propose taking the logarithm ofthe multipliers νi before minimizing their variance. This log-transformation is the main insightof our method.

4


https://doi.org/10.1101/2019.12.20.885582


2.3.1 LogDate optimization function

Given b and the set of calibration constraints described earlier, we seek to find x = [ν1, . . . , ν2n−2, µ]as:

x∗ = arg minν

2n−1∑1

log2(νi) subject to Ψ. (4)

This objective function satisfies the symmetry of ratio property that we desire (Fig. 1a).Since νi values are multipliers of rates around µ, it is also easy to see that assuming µ is the meanrate, this optimization problem minimizes the variance of the log-transformed rate multipliers(around their mean 1). Note that the objective function only depends on νi and µ is includedonly in the constraints. This setting reduces the complexity of the objective function and speedsup the optimization search. Since the values of νi close to 1 are preferred in Eq. 4, the optimalsolution would push µ to the mean rate.

It is straightforward to prove that if νi rate multipliers are drawn from a LogNormal dis-tribution with the symmetry of ratios (that is, with any parameters that result in mode 1),the LogDate optimization problem is identical to finding the ML solution. The formal proof isgiven in Claim 1 (Supplementary material).

The vanilla LogDate formulation, however, has a limitation: by allowing rates to varyfreely in a multiplicative way, it fails to model the Poisson process that governs branch length

estimation error ( bibi ). Following the Poisson model, i.e. sbi ∼ Poisson(sµτi), the variance of

bi can be computed by noting that var(sbi) = sµτi and thus var(bi) = µτis = bi

s . As Figure 1b

shows, when bi is small, bi can easily vary by several orders of magnitude around bi. Furthermore,these variances are not symmetrical; drawing values several factors smaller than the mean ismore likely than drawing values above the mean by the same factor. This lack of symmetry inthe Poisson process is not captured in our LogDate formulation.

Since the true branch length bi is unknown, a common practice is to use the estimated biin place of bi to estimate its variance as bi/s. This explains why both LF and LSD objectivefunctions (Eqs. 2 and 3) put a weight of bi to each term of νi. Following the same strategy,we propose weighting each log2(νi) term in a way that reduces the penalty and allows moredeviations in the log space if the branch is small (and is thus subject to higher error). Sincewe log transform νi, explicitly computing weights in a way that is guaranteed to cancel out thePoisson variations is challenging. However, since the weights are meant to reflect the varianceof bei (logarithmic scale), we infer that weights should be monotonically increasing with bi

(Fig. 1b). We use√bi as weights, a selection driven by simplicity and empirical performance.

The shortest branches require even more care. When the branch is very short, for a limited-size alignment, with high probability, we will see zero mutations. For these no-event branches,tree estimation tools report arbitrary small lengths. Therefore, using bi in place of bi to estimateits variance is not effective for very small branches. To deal with this challenge, Sanderson31 collapses all branches with length bi < 1/s). To et al. 37 proposed adding a smoothingconstant c/s to each bi to estimate bi variance, where c is a parameter that the user can tune.Following a similar strategy, we propose adding a small constant b to each bi. We choose b tobe the maximum branch length that produces no mutations with probability at least 1− α forα ∈ [0, 1]. Assuming sb ∼ Poisson(sb), it is easy to show that the maximum b that satisfies

Pr(sb = 0|b = b) ≥ 1 − α is b = − log(1−α)s . By default, we set α = 0.01 but allow the user to

adjust this parameter. Thus, we define the weighted LogDate (wLogDate) as follows:

x∗ = arg minν

2n−1∑1

√b+ bi log2(νi)

subject to Ψ.

(5)

5


https://doi.org/10.1101/2019.12.20.885582


2.3.2 Solving the optimization problem

LogDate problems (Eq. 4 and Eq. 5) are non-convex, and hence solving them is non-trivial.The problem is convex if 0 ≤ νi ≤ e. For small clock deviation and small estimation error inbi, the νi values should be small so that the problem becomes convex with one local minimum.However, as νi ≤ e is not guaranteed, we have to rely on gradient-based numerical optimizationmethods to search for multiple local minima and select the best solution we can find. To searchfor local minima, we use the Scipy solver with trust-constr [17] method. To help the solverwork efficiently, we incorporate three techniques that we next describe.

Computing Jacobian and Hessian matrices analytically helps speedup the search. Bytaking the partial derivative of each νi, we can compute the Jacobian, J , of Eq. 5. Also, sinceEq. 5 is separable, its Hessian H is a (2n − 2) × (2n − 2) diagonal matrix. Simple derivationsgive us:

J =[2

√b+ b1

log ν1ν1

, . . . , 2

√b+ b2n−2

log ν2n−2ν2n−2

]Tand Hii =2

√b+ bi

1− log νiν2i

.

Sparse matrix representation further saves space and computational time. The Hessianmatrix is diagonal, allowing us to store only the diagonal elements. In addition, the constraintsmatrix defined by Ψ is highly sparse. If all sampling times are given at the leaves, the numberof non-zero elements in our (n − 1) × (2n − 1) matrix is O(n log n) (Claim 3; Supplementarymaterial). If the tree is either caterpillar or balanced, the number of non-zeroes reduced to Θ(n).Thus, we use sparse matrix representation implemented in the Scipy package. This significantlyreduces the running time of LogDate.

Starting from multiple feasible initial points is necessary given that our optimizationproblem is non-convex. Providing initial points that are feasible (i.e. satisfied the calibra-tion constraints) helps the SciPy solver work efficiently. We designed a heuristic strategy tofind multiple initial points given sampling times t1, . . . , tn of all the leaves (as is common inphylodynamics).

We first describe the process to get a single initial point. We compute the root age t0 andµ using root-to-tip regression (RTT) [33]. Next, we scale all branches of T to conform with Ψas follow: let m = arg mini ti (breaking ties arbitrarily). Let d(r, i) denote the distance fromthe root r to node i and P (r,m) denote the path from r to m. For each node i in P (r,m), weset τi = bi(tm − t0)/d(r,m). Then going upward from m to r following P (m, r), for each edge(i, j) we compute tj = ti − τi and recursively apply the process on the clade i. At the root, weset tm to the second oldest (second minimum) sampling time and repeat the process on a newpath until all leaves are processed. Since RTT gives us µ, to find ν we simply set νi = µτi/bi.

To find multiple initial points, we repeatedly apply RTT to a set of randomly selected cladesof T and scale each clade using the aforementioned strategy. Specifically, we randomly select aset S of some internal nodes in the tree and add the root to S. Then, by a post-order traversal,we visit each u ∈ S and date the clade u using the scaling strategy described above. We thenremove the entire clade u from the tree but keep the node u as a leaf (note that the age of uis already computed) and repeat the process for the next node in S. The root will be the lastnode to be visited. After visiting the root, we have all the τi for all i. Finally, we set µ to themutation rate given by RTT at the root and for each branch i, compute νi as µτi

bi. In a tree of

n leaves, we have 2(n−1) − 1 ways to select the initial non-empty set S, giving us enough roomfor randomization.

6


https://doi.org/10.1101/2019.12.20.885582


2.4 Experiments

2.4.1 Datasets

Simulated data. To et al. 37 simulated a dataset of HIV env gene. Their time trees weregenerated based on a birth-death model with periodic sampling times. There are four treemodels, namely D995 11 10 (M1), D995 3 25 (M2), D750 11 10 (M3), and D750 3 25 (M4),each of which has 100 replicates for a total of 400 different tree topologies. M1 and M2 simulateintra-host HIV evolution and are ladder-like while M3 and M4 simulate inter-host evolution andare balanced. Also, M4 has much higher root-to-tip distance (mean: 57) compared to M1–M3(22, 33, and 29). Starting from conditions simulated by To et al. 37, we use the provided timetree to simulate the clock deviations. Using an uncorrelated model of the rates, we draw eachrate from one of three different distributions, each of which is centered at the value µ = 0.006as in To et al. 37. Thus, we set each µi to xiµ where xi is drawn from one of three distributions:LogNormal (mean:1.0, std: 0.4), Gamma (α = β = 6.05), and Exponential (λ = 1). Sequencesof length 1000 were simulated for each of the model conditions using SeqGen [24] under thesame settings as To et al. 37.

H1N1 2009 pandemic. We re-analyze the H1N1 biological data provided by To et al. 37which includes 892 H1N1pdm09 sequences collected worldwide between 13 March 2009 and 9June 2011. We reuse the estimated PhyML [8] trees, 100 bootstrap replicates, and all the resultsof the dating methods other than LogDate that are provided by To et al. 37.

San Diego HIV. We study a dataset of 926 HIV-1 subtype B pol sequences obtained in SanDiego between 1996 and 2018 as part of the PIRC study. We use IQTree [23] to infer a treeunder the GTR+Γ model, root the tree on 22 outgroups, then remove the outgroups. Becauseof the size, we could not run BEAST.

West African Ebola epidemic. We study the dataset of Zaire Ebola virus from Africa,which includes 1,610 near-full length genomes sampled between 17 March 2014 and 24 October2015. The data was collected and analyzed by Dudas et al. 5 using BEAST and re-analyzedby Volz and Frost 38 using IQTree to estimate the ML tree and treedater to infer node ages.We run LSD, LF, and wLogDate on the IQTree from Volz and Frost 38 and use the BEASTtrees from Dudas et al. 5, which include 1000 sampled trees (BEAST-1000) and the Maximumclade credibility tree (BEAST-MCC). To root the IQTree, we search for the rooting positionthat minimizes the triplet distance [27] between the IQTree and the BEAST-MCC tree.

2.4.2 Methods Compared

We compare wLogDate to three other methods: LSD, LF as implemented in r8s [31], andBEAST [2]. We used phyML [8] to re-estimate branch lengths of the true tree topology insubstitution unit and used each of the three methods wLogDate, LSD (in the rooted and con-strained mode), and LF to infer the time tree. wLogDate was run with 10 feasible startingpoints.

For the Bayesian method BEAST, we fixed the true rooted tree topology and only inferrednode ages. Following To et al. 37, we ran BEAST using HKY+Γ8 and coalescent with constantpopulation size tree prior. We used two clock models on the rate parameter: the strict-clock (i.e.fixed rate) and the LogNormal distribution. For the strict-clock prior, we set clock rate prior toa uniform distribution between 0 and 1. For the LogNormal prior, we set the ucld.mean priorto a uniform distribution between 0 and 1, and ucld.stdev prior to an exponential distributionwith parameter 1/3 (default). We always set the length of the MCMC chain to 107 generations,burn-in to 10%, and sampling to every 104 generations (identical to To et al. 37).

7


https://doi.org/10.1101/2019.12.20.885582


●●●●●

●●●

●

●

●●

●

●●

●

●

●

●

●

●●●●

●

●●

●●

● ●●

●

●●

●

●

●

●

●

●●●●●

●●●●

●●

●

●●

●

●●

●

●●

Lognormal Gamma Exponential

M1 M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M40.00

0.25

0.50

0.75

0.05

0.10

0.025

0.050

0.075

0.100

0.125

name

rmse

/hei

ght

● ● ● ● ●B_strict B_lnorm LSD LF wLogDate

Figure 2: Distributions of RMSE normalized by the tree height for internal node ages inferred by allmethods on model trees M1–M4, each with clock models Lognorm, Gamma, and Exponential. Boxesshow median, 10% and 90% quantiles; dots and error bars show mean and standard error (100 replicates).

2.4.3 Error Measurement

On the simulated datasets, where the ground truth is known, we compare the accuracy of themethods using several metrics. We compute the root mean squared error (RMSE) measuredbetween true and estimated vector of times (τ) and normalize it by tree height. We also rankmethods by RMSE rounded to two decimal digits (to avoid different ranks when errors aresimilar). Besides the exact timing of nodes, we examine the time to Most Recent CommonAncestor (tMRCA) of all nodes and the mutation rate. On real data, we show lineage-through-time (LTT) plots [22], which trace the number of lineages at any point in time, and comparetMRCA times to the literature.

3 Results

3.1 Simulated data

We first evaluate the convergence of the ScipPy optimization method across 10 starting points(Fig. S1). LogDate and wLogDate converge to a stable result after 50–200 iterations, dependingon the model condition. Convergence seems easier when rates are Gamma or LogDate andharder when the rates are Exponential. We then compared LogDate and wLogDate (Fig. S2).The wLogDate method is never worse than LogDate, and under some conditions (e.g., M4 withthe Exponential rate variation), weighting results in dramatic improvements. Thus, in the restof the paper, we focus on wLogDate, designated as the default way of using the method.

Measured by RMSE, the accuracy of all methods varies substantially across model trees(M1 – M4) and models of rate variation (Fig. 2). Comparing methods, for many conditions,wLogDate has the lowest error, and in many others, it is ranked second best (Table 1). Acrossall conditions, wLogDate has a mean rank of 1.75, followed by BEAST with strict clock witha mean rank 2; mean normalized RMSE of wLogDate, LF, BEAST-strict, BEAST-LogNormal,and LSD are 0.072, 0.074, 0.077, 0.087, and 0.116, respectively.

For all methods, errors are an order of magnitude smaller for the LogNormal and Gammamodels of rate variations compared to the Exponential model. In terms of trees, M4, which

8


https://doi.org/10.1101/2019.12.20.885582


model Clock model B lnorm B strict LF LSD wLogDateLogNormal 1 3 4 5 1

M4 Gamma 2 4 3 5 1Exponential 4 3 2 5 1LogNormal 2 3 3 5 1



M1 Gamma 5 1 1 4 1Exponential 2 1 3 3 5

Averaged rank 3.5 2 2.5 4.5 1.75Table 1: Ranking of the dating methods under different model conditions. For each model condition, theaveraged RMSE of all internal node ages is computed and ranked among the dating methods (roundedto two decimal digits). Best method shown in bold.

simulates inter-host evolution and high the largest height, presents the most challenging casefor all methods. Interestingly, wLogDate has the best accuracy under all parameters of M4 treeand also all parameters of M3 (thus, both inter-host conditions). On M1, all methods have verylow error and perform similarly (Fig. 2).

Among other methods, results are consistent with the literature. Despite its conceptualsimilarity to wLogDate, LSD has the worst accuracy. On M1 and M2, LSD is competitivewith other methods; however, on M3 and M4, it has a much higher error, especially with theExponential model of rate variation. With the LogNormal clock model, BEAST-LogNormalis better than BEAST-strict only for M4 but not for M1–M3; in fact, BEAST-LogNormalhas the highest error for the M2 condition. This result is surprising given the correct modelspecification. Nevertheless, BEAST-LogNormal is competitive only under the LogNormal modelof rate variation and is one of the two worst methods elsewhere. Thus, BEAST-LogNormal issensitive to model misspecification. In contrast, BEAST-strict is less sensitive to the model ofrate variation and ranks among the top three in most cases. In particular, BEAST-strict isalways the best method for intra-host ladder-like trees M1 and M2.

Accuracy of tMRCA follows similar patterns (Fig. 3). Again, the Exponential rate variationmodel is the most difficult case for all methods, resulting in biased results and highly variableerror rates for most methods. In all conditions of M3 and M4, wLogDate has the best accuracyand improves on the second best method by 9 – 66% (Table 2). For M1 and M2, BEAST-strictis often the best method. The mean tMRCA error of wLogDate across all conditions is 4.83(years), which is substantially better than the second best method, BEAST-strict (6.21).

In terms of the mutation rate, the distinction between methods is less pronounced (Table S1).wLogDate is the best method jointly with the two strict clock models BEAST-strict and LF.Overall, even though LF and wLogDate tend to over-estimate mutation rates, both have lessbiased results compared to other methods (Fig. 3). LSD and BEAST-LogNormal have thehighest errors; depending on the condition, each can overestimate or underestimate the ratebut LSD tends to underestimate while BEAST-LogNormal tends to overestimate. On M1,wLogDate and LF have a clear advantage over BEAST-strict, which tends to over-estimate

9


https://doi.org/10.1101/2019.12.20.885582


●●●●● ●●●●●●●●

●

●

●●●●● ●●●●●

●●

●●

●

●●●●

● ●●●●

● ●●

●

●

●

●●●●● ●●●●●

●●

●

●

●

M1 M2 M3 M4

Lognormal Gamma Exponential Lognormal Gamma Exponential Lognormal Gamma Exponential Lognormal Gamma Exponential

−300

−200

−100

0

−80

−60

−40

−20

0

−30

−20

−10

0

10

−10

−5

0

clockModel

tMR

CA

B_strict B_lnorm LSD LF wLogDate

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

● ●●●

●

● ●●

●

●

●

●●

●●●

●●

●

●●●

●

●

●●

●●●

●● ●●●

●● ●

●

●

●●

M1 M2 M3 M4

Lognormal Gamma Exponential Lognormal Gamma Exponential Lognormal Gamma Exponential Lognormal Gamma Exponential

1e−04

1e−03

1e−02

0.002

0.003

0.005

0.003

0.010

0.030

0.003

0.005

0.010

clockModel

mu

B_strict B_lnorm LSD LF wLogDate

Figure 3: The inferred (top) tMRCA and (bottom) expected mutation rate on different tree models andclock models. Distributions are over 100 replicates. The solid horizontal lines indicate the true mutationrate and tMRCA. Each black is the average of the inferred values for each method under each modelcondition. We remove 6 outlier data points (2 LF, 1 LSD, 2 BEAST-LogNormal, 1 BEAST-Strict) withexceptional incorrect tMRCA (< −350) in the M4/Exponential model.

the rate. On M2, the three methods have similar accuracy. For M3 and M4, BEAST-strictunder-estimates the rate under the Exponential model of rate variation, and wLogDate and LFare closer to the true value. For all methods, M4 is the most challenging case.

3.2 Biological data

On the H1N1 dataset, the best available evidence has suggested a tMRCA between December2008 and January 2009 [10, 19, 25]. wLogDate inferred the tMRCA to be 14 December 2008(Fig. 4a), which is consistent with the literature. LF and LSD both infer a slightly earliertMRCA (10 November 2008), followed by BEAST-strict and BEAST-lognorm (October 2008and July 2008), and finally BEAST runs using the phyML tree (Feb. 2008 for strict and July2007 for LogNormal). While the exact tMRCA is not known on this real data, the resultsdemonstrate that wLogDate, on a real data, produces times that match the presumed groundtruth.

On the HIV dataset, wLogDate inferred a tMRCA of 1958 with only a handful of lineagescoalescing in the 1950s and most others coalescing in 1960s and early 1970s (Fig. S3). Therecovered tMRCAs is within the range postulated in the literature for subtype B [7, 40] andthe fact that randomly sampled HIV lineages across USA tend to coalesce deep in the treeis a known phenomenon. LF and LSD recovered the tMRCA of 1952 and 1953, respectively.Comparing to wLogDate, these two strict-clock methods postulate an earlier burst of subtypeB (Fig. 4c). We were not able to run BEAST on this dataset.

On the Ebola dataset, the BEAST-1000 trees obtained from Dudas et al. 5 inferred thetMRCA to be between 13 September 2013 and 26 January 2014 (95% credible interval) and the

10


https://doi.org/10.1101/2019.12.20.885582


Tree Clock Model B strict B lnorm LF LSD wLogDate

Lognormal 6.99 9.50 6.66 7.38 6.11 ( 9% ↓)M4 Gamma 7.83 10.48 7.02 8.48 6.28 (12% ↓)

Exponential 43.5 140.9 116.2 62.2 32.5 (34% ↓)Lognormal 1.37 2.60 1.21 1.39 1.03 (17% ↓)

M3 Gamma 1.60 3.14 1.23 1.67 0.97 (27% ↓)Exponential 5.76 34.67 4.87 8.35 2.94 (66% ↓)

Lognormal 1.40 1.41 1.50 1.63 1.47 ( 5% ↑)M2 Gamma 1.54 1.44 1.75 1.92 1.66 (15% ↑)

Exponential 3.39 4.59 4.28 5.27 3.72 ( 10% ↑)Lognormal 0.28 0.28 0.30 0.37 0.30 ( 7% ↑)

M1 Gamma 0.27 0.29 0.32 0.35 0.30 (11% ↑)Exponential 0.60 1.11 0.79 0.82 0.69 (15% ↑)

Average 6.21 17.54 12.17 8.13 4.83Table 2: Mean absolute error of the inferred tMRCA of BEAST strict, BEAST lognorm, LF, LSD, andwLogDate. For wLogDate, parenthetically, we compare it with the best (↑) or second best (↓) methodfor each condition. We show percent improvement by wLogDate, as measured by the increase in theerror of the second best method (wLogDate or the alternative) divided by the error of the best method.

BEAST-MCC inferred the tMRCA to be 5 December 2013 as reported by Volz and Frost 38.Here, wLogDate inferred a tMRCA on 7 December 2013, which is very close to the estimate byBEAST. Both LF and LSD inferred an earlier tMRCA: 29 October 2013 for LF and 2 October2013 for LSD, but still within the 95 per cent credible interval of BEAST-1000. LTT plotsshowed a similar reconstruction by all methods for this dataset (Fig. 4d).

We also compare running times of dating methods on the three real biological datasets(Table S2). LSD was always the fastest, running in just seconds, compared to minutes for LFand wLogDate. LF is faster than wLogDate on the H1N1 and HIV data, while on Ebola data,wLogDate is faster. We report the running time for wLogDate as the sequential run of 10independent starting points and note that wLogDate can easily be parallelized.

4 Discussion and future work

We introduced (w)LogDate, a new method for dating phylogenies based on a non-convex op-timization problem. We showed that by log-transforming the rates before minimizing theirvariance, we obtain a method that performs much better than LSD, which is a similar methodwithout the log transformation. Our relatively simple method also outperformed other existingmethods, including the Bayesian methods, which are much slower. The improvements were mostpronounced in terms of the estimation of tMRCA and individual node ages and less so for themutation rate. Moreover, improvements are most visible under the hardest model conditions.

The log transformation results in a non-convex optimization problem, which is harder tosolve than the convex problems solved by LSD and LF. However, we note that the problemis convex for rate multipliers between 0 and e. In addition, given the advances in numericalmethods for solving non-convex optimization problems, insistence on convex problems seemsunnecessary. Our results indicate that this non-convex problem can be solved efficiently in thevaried settings we tested. The main benefit of the log transformation is that it allow us to definea scoring function that assigns symmetrical penalties for increased or decreased rates (Fig. 1a).

The accuracy of LogDate under varied conditions we tested is remarkable and is a testamentto the power of non-parametric methods. LogDate does not depend on a particular model of rate

11


https://doi.org/10.1101/2019.12.20.885582


(a) (b)

● ●

●

●

●●

●

2004

2005

2006

2007

2008

2009Y

ea

r (t

MR

CA

)

●

●

●

●

●

●

●B_strict

B_lnorm

phyML_B_strict

phyML_B_lnorm

LSD

LF

wLogDate0

100

200

300

400

2008 2009 2010 2011time

linea

ges

(c) (d)

0

200

400

600

1960 1980 2000 2020time

linea

ges

0

100

200

300

400

2014.0 2014.5 2015.0 2015.5time

linea

ges

Figure 4: (a) Inferred tMRCA of the H1N1 dataset. Boxplots represent the median, maximum, minimum,97.5% and 2.5% quantiles of the bootstrap estimates for LF, LSD, and wLogDate, and of the posteriordistribution for BEAST. Yellow dot shows the inferred tMRCA of the best ML or MAP tree. BEASTwas run with 4 different settings: B strict and B lnorm allow BEAST to infer both tree topology andbranch lengths, with strict and LogNormal clock models; phyML B strict and phyML B lnorm fixed thetopology to the rooted phyML tree given to BEAST. All other methods (LSD, LF, and wLogDate) wererun on the rooted phyML trees. Results for LSD, LF, and BEAST are all obtained from To et al. 37.(b) LTT plot for all methods on the H1N1 data. (c) LTT plot of fast methods on the HIV dataset. (d)LTT plot of BEAST, LSD, LF, and wLogDate on the Ebola dataset.

evolution and is based on a simple optimization problem. We emphasize that the parametricmodels used in practice are employed for mathematical convenience and not because of a strongbiological reason to believe that they capture real variations in rates. Thus, being free from aparametric model, in our judgment, is a strength of the method.

The LogDate approach can be further improved in several aspects. First, the current formu-lation of LogDate assumes a rooted phylogenetic tree, whereas most inferred trees are unrooted.Rooting phylogenies is a non-trivial problem and can also be done based on principles of min-imizing rate variation [21]. Similar to LSD, LogDate can be generalized to unrooted trees byrooting the tree on each branch, solving the optimization problem for each root, and choosingthe root that minimizes the (w)LogDate objective function. We leave the careful study of suchan approach to the future work.

Beyond rooting, the future work can explore the possibility of building a specialized solver forLogDate to gain speedup. One approach could be exploiting the special structure of the searchspace defined by the tree, which is the strategy employed by LSD to solve the least-squaresoptimization in linear time. Divide-and-conquer may also prove effective.

The weighting scheme used in LogDate is chosen heuristically to deal with the Poisson

12


https://doi.org/10.1101/2019.12.20.885582


process that governs deviations of estimated branch lengths from the true branch length. Infuture, the weighting schema should be studied more carefully, both in terms of theoreticalproperties and empirical performance. More generally, LogDate, like many other methods, isbased on an uncorrelated model: rates across branches do not have any relationships. Futurework should consider adding correlation terms to the optimization problem.

Finally, we described, implemented, and tested LogDate in the phylodynamic setting wheresampling times are available, as opposed to the paleontological setting with fossil calibrations.While the algorithms extends easily, treatment of fossil calibrations is a complex topic and addsother difficulties such as the uncertainty of times and placement of fossils on trees [9, 13]. Thus,applying LogDate in such settings requires further tweaks to the optimization formulation,including changing equality to inequality constraints.

In the studies of LogDate accuracy, we have left several questions for the future works.Most importantly, all the simulation models we used assumed that evolutionary rates, evenwhen variable, are drawn from a fixed distribution at random. Thus, rates are stationary anddo not change in distribution through time. Real evolution, in addition to changing rates, mayhave a direction, where, for example, rates may reduce with time. Moreover, it is reasonableto assume rates of adjacent branches are not independent, a condition we did not simulate.Future work should test LogDate under models of rate variation that include correlations andare time-sensitive. Moreover, to facilitate estimation of error, we used the true topology withestimated branch lengths. Future work should study the impact of the incorrect topology onLogDate and other dating methods.

Code availability. The LogDate code and all the data used here are available on https:

//github.com/uym2/LogDate.

5 AcknowledgmentsWe thank Susan B. Little for providing the San Diego HIV sequence dataset used in this study.

This work was supported by the National Science Foundation (NSF) grant III-1845967 anda developmental grant from the University of California, San Diego Center for AIDS Research(P30 AI036214), supported by the National Institutes of Health.

References[1] Akerborg, O., Sennblad, B., and Lagergren, J. 2008. Birth-death prior on phylogeny and

speed dating. BMC Evolutionary Biology , 8(1): 77.

[2] Drummond, A. J. and Rambaut, A. 2007. BEAST: Bayesian evolutionary analysis by sam-pling trees. BMC evolutionary biology , 7: 214.

[3] Drummond, A. J. and Suchard, M. A. 2010. Bayesian random local clocks, or one rate torule them all. BMC Biology , 8(1): 114.

[4] Drummond, A. J., Ho, S. Y. W., Phillips, M. J., and Rambaut, A. 2006. Relaxed Phyloge-netics and Dating with Confidence. PLoS Biology , 4(5): e88.

[5] Dudas, G., Carvalho, L. M., Bedford, T., Tatem, A. J., Baele, G., Faria, N. R., Park, D. J.,Ladner, J. T., Arias, A., Asogun, D., Bielejec, F., Caddy, S. L., Cotten, M., D’Ambrozio, J.,Dellicour, S., Di Caro, A., Diclaro, J. W., Duraffour, S., Elmore, M. J., Fakoli, L. S., Faye,O., Gilbert, M. L., Gevao, S. M., Gire, S., Gladden-Young, A., Gnirke, A., Goba, A., Grant,D. S., Haagmans, B. L., Hiscox, J. A., Jah, U., Kugelman, J. R., Liu, D., Lu, J., Malboeuf,C. M., Mate, S., Matthews, D. A., Matranga, C. B., Meredith, L. W., Qu, J., Quick, J., Pas,S. D., Phan, M. V. T., Pollakis, G., Reusken, C. B., Sanchez-Lockhart, M., Schaffner, S. F.,Schieffelin, J. S., Sealfon, R. S., Simon-Loriere, E., Smits, S. L., Stoecker, K., Thorne, L.,Tobin, E. A., Vandi, M. A., Watson, S. J., West, K., Whitmer, S., Wiley, M. R., Winnicki,S. M., Wohl, S., Wolfel, R., Yozwiak, N. L., Andersen, K. G., Blyden, S. O., Bolay, F.,Carroll, M. W., Dahn, B., Diallo, B., Formenty, P., Fraser, C., Gao, G. F., Garry, R. F.,

13


https://github.com/uym2/LogDate

https://github.com/uym2/LogDate

https://doi.org/10.1101/2019.12.20.885582


Goodfellow, I., Gunther, S., Happi, C. T., Holmes, E. C., Kargbo, B., Keıta, S., Kellam, P.,Koopmans, M. P. G., Kuhn, J. H., Loman, N. J., Magassouba, N., Naidoo, D., Nichol, S. T.,Nyenswah, T., Palacios, G., Pybus, O. G., Sabeti, P. C., Sall, A., Stroher, U., Wurie, I.,Suchard, M. A., Lemey, P., and Rambaut, A. 2017. Virus genomes reveal factors that spreadand sustained the ebola epidemic. Nature, 544(7650): 309–315.

[6] Forest, F. 2009. Calibrating the Tree of Life: fossils, molecules and evolutionary timescales.Annals of Botany , 104(5): 789–794.

[7] Gilbert, M. T. P., Rambaut, A., Wlasiuk, G., Spira, T. J., Pitchenik, A. E., and Worobey,M. 2007. The emergence of HIV/AIDS in the Americas and beyond. Proceedings of theNational Academy of Sciences, 104(47): 18566–18570.

[8] Guindon, S., Dufayard, J.-F., Lefort, V., Anisimova, M., Hordijk, W., and Gascuel, O. 2010.New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing thePerformance of PhyML 3.0. Systematic Biology , 59(3): 307–321.

[9] Heath, T. A. 2012. A hierarchical Bayesian model for calibrating estimates of species diver-gence times. Systematic biology , 61(5): 793–809.

[10] Hedge, J., Lycett, S. J., and Rambaut, A. 2013. Real-time characterization of the molecularepidemiology of an influenza pandemic. Biology Letters, 9(5): 20130331.

[11] Hillis, D. M., Moritz, C., and Mable, B. K. 1996. Molecular Systematics, volume 2nd.Sinauer Associates.

[12] Ho, S. Y. 2014. The changing face of the molecular evolutionary clock. Trends in Ecology& Evolution, 29(9): 496 – 503.

[13] Ho, S. Y. W. and Phillips, M. J. 2009. Accounting for Calibration Uncertainty in Phylo-genetic Estimation of Evolutionary Divergence Times. Systematic Biology , 58(3): 367–380.

[14] Huelsenbeck, J. P., Larget, B., and Swofford, D. 2000. A compound poisson process forrelaxing the molecular clock. Genetics, 154(4): 1879–92.

[15] Kishino, H., Thorne, J. L., and Bruno, W. J. 2001. Performance of a Divergence TimeEstimation Method under a Probabilistic Model of Rate Evolution. Molecular Biology andEvolution, 18(3): 352–361.

[16] Kodandaramaiah, U. 2011. Tectonic calibrations in molecular dating. Current Zoology ,57(1): 116–124.

[17] Lalee, M., Nocedal, J., and Plantenga, T. 1998. On the implementation of an algorithm forlarge-scale equality constrained optimization. SIAM Journal on Optimization, 8(3): 682–706.

[18] Langley, C. H. and Fitch, W. M. 1974. An examination of the constancy of the rate ofmolecular evolution. Journal of Molecular Evolution, 3(3): 161–177.

[19] Lemey, P., Suchard, M., and Rambaut, A. 2009. Reconstructing the initial global spreadof a human influenza pandemic: A bayesian spatial-temporal model for the global spread ofh1n1pdm. PLoS currents, 1: RRN1031–RRN1031. 20029613[pmid].

[20] Lepage, T., Bryant, D., Philippe, H., and Lartillot, N. 2007. A General Comparison ofRelaxed Molecular Clock Models. Molecular Biology and Evolution, 24(12): 2669–2680.

[21] Mai, U., Sayyari, E., and Mirarab, S. 2017. Minimum variance rooting of phylogenetictrees and implications for species tree reconstruction. PLOS ONE , 12(8): e0182238.

[22] Nee, S., May, R. M., and Harvey, P. H. 1994. The reconstructed evolutionary process.Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences,344(1309): 305–311.

[23] Nguyen, L. T., Schmidt, H. A., Von Haeseler, A., and Minh, B. Q. 2015. IQ-TREE: A fastand effective stochastic algorithm for estimating maximum-likelihood phylogenies. MolecularBiology and Evolution, 32(1).

[24] Rambaut, A. and Grass, N. C. 1997. Seq-Gen: an application for the Monte Carlo simu-lation of DNA sequence evolution along phylogenetic trees. Bioinformatics, 13(3): 235–238.

[25] Rambaut, A. and Holmes, E. 2009. The early molecular epidemiology of the swine-origin

14


https://doi.org/10.1101/2019.12.20.885582


a/h1n1 human influenza pandemic. PLoS currents, 1: RRN1003–RRN1003. 20025195[pmid].

[26] Rutschmann, F. 2006. Molecular dating of phylogenetic trees: A brief review of currentmethods that estimate divergence times. Diversity & Distributions, 12(1): 35–48.

[27] Sand, A., Holt, M. K., Johansen, J., Fagerberg, R., Brodal, G. S., Pedersen, C. N. S., andMailund, T. 2013. Algorithms for computing the triplet and quartet distances for binary andgeneral trees. Biology , 2(4): 1189–209.

[28] Sanderson, M. J. 1997. A Nonparametric Approach to Estimating Divergence Times in theAbsence of Rate Constancy. Molecular Biology and Evolution, 14(12): 1218–1231.

[29] Sanderson, M. J. 1998. Estimating rate and time in molecular phylogenies: beyond themolecular clock? In Molecular systematics of plants II , pages 242–264. Springer.

[30] Sanderson, M. J. 2002. Estimating Absolute Rates of Molecular Evolution and DivergenceTimes: A Penalized Likelihood Approach. Molecular Biology and Evolution, 19(1): 101–109.

[31] Sanderson, M. J. 2003. r8s: inferring absolute rates of molecular evolution and divergencetimes in the absence of a molecular clock. Bioinformatics, 19(2): 301–302.

[32] Schwartz, J. H. and Maresca, B. 2006. Do molecular clocks run at all? a critique ofmolecular systematics. Biological Theory , 1(4): 357–371.

[33] Shankarappa, R., Margolick, J. B., Gange, S. J., Rodrigo, A. G., Upchurch, D., Farzade-gan, H., Gupta, P., Rinaldo, C. R., Learn, G. H., He, X., Huang, X. L., and Mullins, J. I.1999. Consistent viral evolutionary changes associated with the progression of human im-munodeficiency virus type 1 infection. Journal of virology , 73(12): 10489–502.

[34] Snir, S., Wolf, Y. I., and Koonin, E. V. 2012. Universal Pacemaker of Genome Evolution.PLoS Computational Biology , 8(11): e1002785.

[35] Tamura, K., Battistuzzi, F. U., Billing-Ross, P., Murillo, O., Filipski, A., and Kumar, S.2012. Estimating divergence times in large molecular phylogenies. Proceedings of the NationalAcademy of Sciences, 109(47): 19333–19338.

[36] Thorne, J. L., Kishino, H., and Painter, I. S. 1998. Estimating the rate of evolution of therate of molecular evolution. Molecular Biology and Evolution, 15(12): 1647–1657.

[37] To, T.-H., Jung, M., Lycett, S., and Gascuel, O. 2015. Fast Dating Using Least-SquaresCriteria and Algorithms. Systematic Biology , 65(1): 82–97.

[38] Volz, E. M. and Frost, S. D. W. 2017. Scalable relaxed clock phylogenetic dating. VirusEvolution, 3(2). vex025.

[39] Volz, E. M., Koelle, K., and Bedford, T. 2013. Viral Phylodynamics. PLoS ComputationalBiology , 9(3).

[40] Wertheim, J. O., Fourment, M., and Kosakovsky Pond, S. L. 2012. Inconsistencies inEstimating the Age of HIV-1 Subtypes Due to Heterotachy. Molecular Biology and Evolution,29(2): 451–456.

[41] Zuckerkandl, E. 1962. Molecular disease, evolution, and genetic heterogeneity. Horizonsin biochemistry , pages 189–225.

15


https://doi.org/10.1101/2019.12.20.885582


Supplementary materialSupplementary text

Algorithm 1 Setup the linear constraints Ψ for tree T given a set of calibration points.

function SetupConstraints(T)Ψ← {}for w in post-order traversal of T do

if w is a leaf thenif w is calibrated then

nearest timepoint(w) ← welse

nearest timepoint(w) ← ∅else

(w1, w2) ← Children(w)u ← nearest timepoint(w1)v ← nearest timepoint(w2)if w is calibrated then

nearest timepoint(w) ← wif u! = ∅ then

Ψ← Ψ ∪ ψ(w, u)

if v! = ∅ thenΨ← Ψ ∪ ψ(w, v)

elseif u == ∅ then

nearest timepoint(w) ← velse

if v == ∅ thennearest timepoint(w) ← u

elsenearest timepoint(w) ← u if (d(w, u) < d(w, v)) else vΨ← Ψ ∪ ψ(u, v)

16


https://doi.org/10.1101/2019.12.20.885582


Claim 1. The optimal x∗ in Eq. 4 is the Maximum Likelihood (ML) estimate of the model ofrate variation where νi ∼ LogNormal(µ, σ2) and Mode(νi) = 1.

Proof. We have: νi ∼ LogNormal(µ, σ2) =⇒ Mode(νi) = eµ−σ2

=⇒ 1 = eµ−σ2

=⇒ µ = σ2

Therefore, the log-likelihood function under the LogNormal model of rate variation can bewritten as follows:

logLlG =∑2n−2

i=1 log(

1νiσ√2π

exp(− (log νi−µ)2

2σ2

))∝

∑2n−2i=1

(− log νi − log2 νi−2µ log νi

2µ

)∝ − log2 νi

(S1)

Thus, the ML estimate can be found by minimizing Eq. 4.

Lemma 2. The length of the shortest path from the root of a binary tree to its leaves is at mostlog n where n is the number of leaves in the tree.

Proof. Consider a rooted binary tree T with n leaves; let r be the root and h be the length ofthe shortest path from r to the leaves of T . We need to prove that h ≤ log2 n.

Let Di be the set of nodes in T with depth i, that is, Di = {w ∈ T |d(r, w) = i}. Wefirst prove that |Di| = 2i ∀i ≤ h where |Di| denotes the cardinality of Di. We prove this byinduction. The base case i = 0 holds since the root r is the only node with depth 0. Supposewe have |Dk| = 2k and k < h, we need to prove that if k + 1 ≤ h then |Dk+1| = 2k+1. Notethat a node v ∈ Dk+1 if and only if its parent par(v) ∈ Dk. Because T is a binary tree, eachnode in T must either has no child (leaf node) or two children (internal node). Since k < h,there must be no leaf node in Dk, otherwise, a leaf v in Dk has d(r, v) = k < h, which defines aroot-to-leaf path that is shorter than h and contradicts the definition of h. Thus, each node inDk has exactly 2 children, making |Dk+1| = 2 ∗ |Dk| = 2 ∗ 2k = 2k+1.

Now we have |Dh| = 2h. To prove that h ≤ log2 n, note that Dh contains a mixture ofleaves and internal nodes and each internal node in Dh must have more than one leaf below it.Therefore, the size of Dh is at most the size of the leaf set of T ; that is, |Dh| ≤ n. Thus, wehave 2h = |Dh| ≤ n =⇒ h ≤ log2 n.

Claim 3. If all the leaves have sampling times and there is no other calibration points givenfor internal nodes, the matrix corresponding to the constraints Ψ setup by Algorithm 1 hasO(n log(n)) non-zero elements, where n is the number of leaves in the input tree T .

Proof. Let P(w) denote the shortest path from a node w to its leaves and let |P(w)| denote thelength of this path. Let Tw be the clade of T below w and let |w| denote the size of this clade (i.e.the number of leaves below w). Applying lemma 2 on Tw, we have |P(w)| ≤ log2 |w| ≤ log2 nfor all w ∈ T .

Note that if all leaves have sampling times, Algorithm 1 adds exactly one constraint for eachinternal node in the tree. For each node w with two children cl(w) and cr(w), the non-zeroelements of the constraint added when node w is visited must locate on P(cl(w)), P(cr(w)),and the two branches (w, cl(w)) and (w, cr(w)). Let ηw denote the number of non-zero elementsof the constraint defined by node w, then ηw ≤ |P(cl(w))| + |P(cr(w))| + 1 + 1 ≤ 2 log2 n + 2.Thus, the total number of non-zeros in all constraints corresponding to the n−1 internal nodesis bounded above by (n− 1)(2 log2 n+ 2) ∈ O(n log n).

17


https://doi.org/10.1101/2019.12.20.885582


Supplementary figures and tables

Tree Model Clock Model BEAST strict BEAST lnorm LF LSD wLogDate

Lognormal 0.0007 0.0011 0.0004 0.0005 0.0004M1 Gamma 0.0009 0.0014 0.0005 0.0004 0.0006

Exponential 0.0009 0.0022 0.0008 0.0013 0.0009


Exponential 0.0013 0.0038 0.0015 0.0020 0.0015


Exponential 0.0010 0.0011 0.0009 0.0027 0.0009


Exponential 0.0020 0.0016 0.0016 0.0037 0.0017

Average 0.0008 0.0013 0.0008 0.0012 0.0008Table S1: Mean absolute error of the inferred mutation rate of BEAST strict, BEAST lognorm, LF,LSD, and wLogDate.

Data LSD LF wLogDate

H1N1 (n=892) < 1 sec 1 min 3 mins

HIV San Diego (n=904) < 1 sec 11 mins 24 mins

Ebola (n=1610) 2 secs 4 mins 3 minsTable S2: Running time of LSD, LF, and wLogDate on both biological datasets.

18


https://doi.org/10.1101/2019.12.20.885582


wLogDate

Lognormal

wLogDate

Gamma

wLogDate

Exponential

LogDate

Lognormal

LogDate

Gamma

LogDate

Exponential

0 30 60 90 0 25 50 75 100 125 0 50 100 150 200 250

0 25 50 75 100 0 20 40 60 0 50 100 150

500

1000

1500

2000

20

30

40

50

60

20

40

60

4

6

8

0

100

200

300

400

10

20

30

Iteration

Obj

ectiv

e

Figure S1: Objective value versus iteration of the LogDate and wLogDate runs on one arbitrarily selectedsimulated tree (M4, replicate 2). Each of the two methods were run using 10 random initial pointsgenerated using the strategy described in ??.

●●

●●●●

●

●

●●

●●

●●

●

●

●●

●●

●

●

●

●

Lognormal Gamma Exponential

M1 M2 M3 M4 M1 M2 M3 M4 M1 M2 M3 M40.0

0.2

0.4

0.6

0.8

0.02

0.04

0.06

0.08

0.025

0.050

0.075

rmse

● ●LogDate wLogDate

Figure S2: RMSE of the internal node ages inferred by LogDate and wLogDate (the weighted version).

19


https://doi.org/10.1101/2019.12.20.885582


0 5 1 0 1 5 2 0 2 5 3 0 3 5 4 0 4 5 5 0 5 5 6 0 6 5 7 0 7 5 8 0 8 5 9 0 9 5 100 105 110 115 120 125 130 135 140 145 150 155 160 165 170 175 180 185 190 195 200 205 210 215 220 225 230 235 240 245 250 255 260 265 270 275 280 285 290 295 300 305 310 315 320 325 330 335 340 345 350 355 360 365 370 375 380 385 390 395 400 405 410 415 420 425 430 435 440 445 450 455 460 465 470 475 480 485 490 495 500 505 510 515 520 525 530 535 540 545 550 555 560 565 570 575 580 585 590 595 600 605 610 615 620 625 630 635 640 645 650 655 660 665 670 675 680 685 690 695 700 705 710 715 720 725 730 735 740 745 750 755 760 765 770 775 780 785 790 795 800 805 810 815 820 825 830 835 840 845 850 855 860 865 870 875 880 885 890 895 900 905 910 915 920 925 930 935 940 945 950 955 960 965 970 975 980 985 990 995 1000 1005 1010 1015 1020 1025 1030 1035 1040 1045 1050 1055 1060 1065 1070 1075 1080 1085 1090 1095 1100 1105 1110 1115 1120 1125 1130 1135 1140 1145 1150 1155 1160 1165 1170 1175 1180 1185 1190 1195 1200 1205 1210 1215 1220 1225 1230 1235 1240 1245 1250 1255 1260 1265 1270 1275 1280 1285 1290 1295 1300 1305 1310 1315 1320 1325 1330 1335 1340 1345 1350 1355 1360 1365 1370 1375 1380 1385 1390 1395 1400 1405 1410 1415 1420 1425 1430 1435 1440 1445 1450 1455 1460 1465 1470 1475 1480 1485 1490 1495 1500 1505 1510 1515 1520 1525 1530 1535 1540 1545 1550 1555 1560 1565 1570 1575 1580 1585 1590 1595 1600 1605 1610 1615 1620 1625 1630 1635 1640 1645 1650 1655 1660 1665 1670 1675 1680 1685 1690 1695 1700 1705 1710 1715 1720 1725 1730 1735 1740 1745 1750 1755 1760 1765 1770 1775 1780 1785 1790 1795 1800 1805 1810 1815 1820 1825 1830 1835 1840 1845 1850 1855 1860 1865 1870 1875 1880 1885 1890 1895 1900 1905 1910 1915 1920 1925 1930 1935 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 2055 2060 2065 2070 2075 2080 2085 2090 2095 2100 2105 2110 2115 2120 2125 2130 2135 2140 2145 2150 2155 2160 2165 2170 2175 2180 2185 2190 2195 2200 2205 2210 2215 2220 2225 2230 2235 2240 2245 2250 2255 2260 2265 2270 2275 2280 2285 2290 2295 2300 2305 2310 2315 2320 2325 2330 2335 2340 2345 2350 2355 2360 2365 2370 2375 2380 2385 2390 2395 2400 2405 2410 2415 2420 2425 2430 2435 2440 2445 2450 2455 2460 2465 2470 2475 2480 2485 2490 2495 2500 2505 2510 2515 2520 2525 2530 2535 2540 2545 2550 2555 2560 2565 2570 2575 2580 2585 2590 2595 2600 2605 2610 2615 2620 2625 2630 2635 2640 2645 2650 2655 2660 2665 2670 2675 2680 2685 2690 2695 2700 2705 2710 2715 2720 2725 2730 2735 2740 2745 2750 2755 2760 2765 2770 2775 2780 2785 2790 2795 2800 2805 2810 2815 2820 2825 2830 2835 2840 2845 2850 2855 2860 2865 2870 2875 2880 2885 2890 2895 2900 2905 2910 2915 2920 2925 2930 2935 2940 2945 2950 2955 2960 2965 2970 2975 2980 2985 2990 2995 3000 3005 3010 3015 3020 3025 3030 3035 3040 3045 3050 3055 3060 3065 3070 3075 3080 3085 3090 3095 3100 3105 3110 3115 3120 3125 3130 3135 3140 3145 3150 3155 3160 3165 3170 3175 3180 3185 3190 3195 3200 3205 3210 3215 3220 3225 3230 3235 3240 3245 3250 3255 3260 3265 3270 3275 3280 3285 3290 3295 3300 3305 3310 3315 3320 3325 3330 3335 3340 3345 3350 3355 3360 3365 3370 3375 3380 3385 3390 3395 3400 3405 3410 3415 3420 3425 3430 3435 3440 3445 3450 3455 3460 3465 3470 3475 3480 3485 3490 3495 3500 3505 3510 3515 3520 3525 3530 3535 3540 3545 3550 3555 3560 3565 3570 3575 3580 3585 3590 3595 3600 3605 3610 3615 3620 3625 3630 3635 3640 3645 3650 3655 3660 3665 3670 3675 3680 3685 3690 3695 3700 3705 3710 3715 3720 3725 3730 3735 3740 3745 3750 3755 3760 3765 3770 3775 3780 3785 3790 3795 3800 3805 3810 3815 3820 3825 3830 3835 3840 3845 3850 3855 3860 3865 3870 3875 3880 3885 3890 3895 3900 3905 3910 3915 3920 3925 3930 3935 3940 3945 3950 3955 3960 3965 3970 3975 3980 3985 3990 3995 4000 4005 4010 4015 4020 4025 4030 4035 4040 4045 4050 4055 4060 4065 4070 4075 4080 4085 4090 4095 4100 4105 4110 4115 4120 4125 4130 4135 4140 4145 4150 4155 4160 4165 4170 4175 4180 4185 4190 4195 4200 4205 4210 4215 4220 4225 4230 4235 4240 4245 4250 4255 4260 4265 4270 4275 4280 4285 4290 4295 4300 4305 4310 4315 4320 4325 4330 4335 4340 4345 4350 4355 4360 4365 4370 4375 4380 4385 4390 4395 4400 4405 4410 4415 4420 4425 4430 4435 4440 4445 4450 4455 4460 4465 4470 4475 4480 4485 4490 4495 4500 4505 4510 4515 4520 4525 4530 4535 4540 4545 4550 4555 4560 4565 4570 4575 4580 4585 4590 4595 4600 4605 4610 4615 4620 4625 4630 4635 4640 4645 4650 4655 4660 4665 4670 4675 4680 4685 4690 4695 4700 4705 4710 4715 4720 4725 4730 4735 4740 4745 4750 4755 4760 4765 4770 4775 4780 4785 4790 4795 4800 4805 4810 4815 4820 4825 4830 4835 4840 4845 4850 4855 4860 4865 4870 4875 4880 4885 4890 4895 4900 4905 4910 4915 4920 4925 4930 4935 4940 4945 4950 4955 4960 4965 4970 4975 4980 4985 4990 4995 5000 5005 5010 5015 5020 5025 5030 5035 5040 5045 5050 5055 5060 5065 5070 5075 5080 5085 5090 5095 5100 5105 5110 5115 5120 5125 5130 5135 5140 5145 5150 5155 5160 5165 5170 5175 5180 5185 5190 5195 5200 5205 5210 5215 5220 5225 5230 5235 5240 5245 5250 5255 5260 5265 5270 5275 5280 5285 5290 5295 5300 5305 5310 5315 5320 5325 5330 5335 5340 5345 5350 5355 5360 5365 5370 5375 5380 5385 5390 5395 5400 5405 5410 5415 5420 5425 5430 5435 5440 5445 5450 5455 5460 5465 5470 5475 5480 5485 5490 5495 5500 5505 5510 5515 5520 5525 5530 5535 5540 5545 5550 5555 5560 5565 5570 5575 5580 5585 5590 5595 5600 5605 5610 5615 5620 5625 5630 5635 5640 5645 5650 5655 5660 5665 5670 5675 5680 5685 5690 5695 5700 5705 5710 5715 5720 5725 5730 5735 5740 5745 5750 5755 5760 5765 5770 5775 5780 5785 5790 5795 5800 5805 5810 5815 5820 5825 5830 5835 5840 5845 5850 5855 5860 5865 5870 5875 5880 5885 5890 5895 5900 5905 5910 5915 5920 5925 5930 5935 5940 5945 5950 5955 5960 5965 5970 5975 5980 5985 5990 5995 6000 6005 6010 6015 6020 6025 6030 6035 6040 6045 6050 6055 6060 6065 6070 6075 6080 6085 6090 6095 6100 6105 6110 6115 6120 6125 6130 6135 6140 6145 6150 6155 6160 6165 6170 6175 6180 6185 6190 6195 6200 6205 6210 6215 6220 6225 6230 6235 6240 6245 6250 6255 6260 6265 6270 6275 6280 6285 6290 6295 6300 6305 6310 6315 6320 6325 6330 6335 6340 6345 6350 6355 6360 6365 6370 6375 6380 6385 6390 6395 6400 6405 6410 6415 6420 6425 6430 6435 6440 6445 6450 6455 6460 6465 6470 6475 6480 6485 6490 6495 6500 6505 6510 6515 6520 6525 6530 6535 6540 6545 6550 6555 6560 6565 6570 6575 6580 6585 6590 6595 6600 6605 6610 6615 6620 6625 6630 6635 6640 6645 6650 6655 6660 6665 6670 6675 6680 6685 6690 6695 6700 6705 6710 6715 6720 6725 6730 6735 6740 6745 6750 6755 6760 6765 6770 6775 6780 6785 6790 6795 6800 6805 6810 6815 6820 6825 6830 6835 6840 6845 6850 6855 6860 6865 6870 6875 6880 6885 6890 6895 6900 6905 6910 6915 6920 6925 6930 6935 6940 6945 6950 6955 6960 6965 6970 6975 6980 6985 6990 6995 7000 7005 7010 7015 7020 7025 7030 7035 7040 7045 7050 7055 7060 7065 7070 7075 7080 7085 7090 7095 7100 7105 7110 7115 7120 7125 7130 7135 7140 7145 7150 7155 7160 7165 7170 7175 7180 7185 7190 7195 7200 7205 7210 7215 7220 7225 7230 7235 7240 7245 7250 7255 7260 7265 7270 7275 7280 7285 7290 7295 7300 7305 7310 7315 7320 7325 7330 7335 7340 7345 7350 7355 7360 7365 7370 7375 7380 7385 7390 7395 7400 7405 7410 7415 7420 7425 7430 7435 7440 7445 7450 7455 7460 7465 7470 7475 7480 7485 7490 7495 7500 7505 7510 7515 7520 7525 7530 7535 7540 7545 7550 7555 7560 7565 7570 7575 7580 7585 7590 7595 7600 7605 7610 7615 7620 7625 7630 7635 7640 7645 7650 7655 7660 7665 7670 7675 7680 7685 7690 7695 7700 7705 7710 7715 7720 7725 7730 7735 7740 7745 7750 7755 7760 7765 7770 7775 7780 7785 7790 7795 7800 7805 7810 7815 7820 7825 7830 7835 7840 7845 7850 7855 7860 7865 7870 7875 7880 7885 7890 7895 7900 7905 7910 7915 7920 7925 7930 7935 7940 7945 7950 7955 7960 7965 7970 7975 7980 7985 7990 7995 8000 8005 8010 8015 8020 8025 8030 8035 8040 8045 8050 8055 8060 8065 8070 8075 8080 8085 8090 8095 8100 8105 8110 8115 8120 8125 8130 8135 8140 8145 8150 8155 8160 8165 8170 8175 8180 8185 8190 8195 8200 8205 8210 8215 8220 8225 8230 8235 8240 8245 8250 8255 8260 8265 8270 8275 8280 8285 8290 8295 8300 8305 8310 8315 8320 8325 8330 8335 8340 8345 8350 8355 8360 8365 8370 8375 8380 8385 8390 8395 8400 8405 8410 8415 8420 8425 8430 8435 8440 8445 8450 8455 8460 8465 8470 8475 8480 8485 8490 8495 8500 8505 8510 8515 8520 8525 8530 8535 8540 8545 8550 8555 8560 8565 8570 8575 8580 8585 8590 8595 8600 8605 8610 8615 8620 8625 8630 8635 8640 8645 8650 8655 8660 8665 8670 8675 8680 8685 8690 8695 8700 8705 8710 8715 8720 8725 8730 8735 8740 8745 8750 8755 8760 8765 8770 8775 8780 8785 8790 8795 8800 8805 8810 8815 8820 8825 8830 8835 8840 8845 8850 8855 8860 8865 8870 8875 8880 8885 8890 8895 8900 8905 8910 8915 8920 8925 8930 8935 8940 8945 8950 8955 8960 8965 8970 8975 8980 8985 8990 8995 9000 9005 9010 9015 9020 9025 9030 9035 9040 9045 9050 9055 9060 9065 9070 9075 9080 9085 9090 9095 9100 9105 9110 9115 9120 9125 9130 9135 9140 9145 9150 9155 9160 9165 9170 9175 9180 9185 9190 9195 9200 9205 9210 9215 9220 9225 9230 9235 9240 9245 9250 9255 9260 9265 9270 9275 9280 9285 9290 9295 9300 9305 9310 9315 9320 9325 9330 9335 9340 9345 9350 9355 9360 9365 9370 9375 9380 9385 9390 9395 9400 9405 9410 9415 9420 9425 9430 9435 9440 9445 9450 9455 9460 9465 9470 9475 9480 9485 9490 9495 9500 9505 9510 9515 9520 9525 9530 9535 9540 9545 9550 9555 9560 9565 9570 9575 9580 9585 9590 9595 9600 9605 9610 9615 9620 9625 9630 9635 9640 9645 9650 9655 9660 9665 9670 9675 9680 9685 9690 9695 9700 9705 9710 9715 9720 9725 9730 9735 9740 9745 9750 9755 9760 9765 9770 9775 9780 9785 9790 9795 9800 9805 9810 9815 9820 9825 9830 9835 9840 9845 9850 9855 9860 9865 9870 9875 9880 9885 9890 9895 9900 9905 9910 9915 9920 9925 9930 9935 9940 9945 9950 9955 9960 9965 9970 9975 9980 9985 9990 9995 10000 10005 10010 10015 10020 10025 10030 10035 10040 10045 10050 10055 10060 10065 10070 10075 10080 10085 10090 10095

Figure S3: A TimeTree of San Diego HIV epidemic according to wLogDate.

20


https://doi.org/10.1101/2019.12.20.885582


Documents

Log Transformation Improves Dating of Phylogenies · Log Transformation Improves Dating of Phylogenies Uyen Mai,1 and Siavash Mirarab2* 1Department of Computer Science and Engineering,