Shrinkage-based Random Local Clocks with Scalable Inference

Abstract Molecular clock models undergird modern methods of divergence-time estimation. Local clock models propose that the rate of molecular evolution is constant within phylogenetic subtrees. Current local clock inference procedures exhibit one or more weaknesses, namely they achieve limited scalability to trees with large numbers of taxa, impose model misspecification, or require a priori knowledge of the existence and location of clocks. To overcome these challenges, we present an autocorrelated, Bayesian model of heritable clock rate evolution that leverages heavy-tailed priors with mean zero to shrink increments of change between branch-specific clocks. We further develop an efficient Hamiltonian Monte Carlo sampler that exploits closed form gradient computations to scale our model to large trees. Inference under our shrinkage clock exhibits a speed-up compared to the popular random local clock when estimating branch-specific clock rates on a variety of simulated datasets. This speed-up increases with the size of the problem. We further show our shrinkage clock recovers known local clocks within a rodent and mammalian phylogeny. Finally, in a problem that once appeared computationally impractical, we investigate the heritable clock structure of various surface glycoproteins of influenza A virus in the absence of prior knowledge about clock placement. We implement our shrinkage clock and make it publicly available in the BEAST software package.


Introduction
Molecular clock models are ubiquitous phylogenetic instruments for divergence-time estimation with applications ranging from timing placental mammal radiation (Springer et al., 2003) to estimating influenza diversity (Davidson et al., 2014).To capture clock rate variation along the lineages of a phylogeny, Thorne et al. (1998) propose an autocorrelated, or "heritable" rate model while others (Drummond and Suchard, 2010;Yoder and Yang, 2000) assume there exist, at most, a small number of "local" clocks on any given tree.In each case, closely related lineages maintain similar or even identical evolutionary rates.Autocorrelated rate models are computationally appealing due to the induced smooth transition in rate from parent to child node along the tree but may inappropriately shrink large rate changes between adjacent nodes (Smith et al., 2010).On the other hand, local clock models allow large rate changes to exist but can be computationally unpalatable on large problems due to the combinatorial complexity of choosing (or learning) the number and location of local clocks.When these quantities are simultaneously learned with the tree, Drummond and Suchard (2010) call this the random local clock (RLC) model.
Due to these complications, some authors employ uncorrelated relaxed clocks such as the uncorrelated log-normal relaxed molecular clock (Drummond et al., 2006), but this generates excessive rate heterogeneity in cases where clock rate changes are thought to be more punctuated, for example between HIV subtypes (Bletsa et al., 2019).For a more in-depth review of various molecular clock models, see Ho and Duchêne (2014).Here we propose an autocorrelated clock model where we place a Bayesian bridge shrinkage prior on the increment between parent and child log branch rates.Among various shrinkage priors in the literature, the Bayesian bridge has a unique advantage in having both a collapsed spike-and-slab representation as well as a Gaussian scale-mixture form.The first representation intuitively places large mass near zero reflecting our a priori belief that most increments should be zero but has heavy tails that allow for estimating large rate changes in an approximately unbiased manner.Like many other shrinkage priors, the Bayesian bridge includes a "global scale" nuisance parameter about which learning, in the absence of prior information, typically limits the speed of inference.Polson et al. (2014) develop a framework to facilitate efficient Gibbs sampling of this nuisance parameter in a regression context and we utilize their approach here.On the other hand, the second representation of the Bayesian bridge as Gaussian scalemixture is differentiable almost everywhere.We exploit this feature and develop an efficient Hamiltonian Monte Carlo (HMC) sampler over the space of increments that employs recent work on closed form gradient representations (Ji et al., 2020) to make our shrinkage-clock inference scalable to large trees.Crucial to our inference, we define recursive algorithms to compute the requisite joint gradient of the log posterior in our transformed increment space with computational complexity that scales only linearly with the number of tips in the tree.
We implement our method in BEAST (Suchard et al., 2018), a popular software package for reconstructing rooted, time-measured phylogenies.Due to our efficient inference machinery, our shrinkage-clock achieves the tractable benefits of the autocorrelated rate model and simultaneously maintains the flexibility of more punctuated local clock models.
We demonstrate the inference gains of our approach versus the RLC across 20 different simulated data sets of a 40 taxa tree.We additionally compare the accuracy of our shrinkageclock to the RLC by studying the adaptive radiation of rodents and other mammals, and demonstrate utility of the heavy-tailed Bayesian bridge shrinkage prior by comparing it to the more ubiquitous Laplace prior.Finally, we deploy our shrinkage-clock to estimate the existence, location and magnitude of host-specific clock rates in surface glycoproteins of the influenza A virus.

Setup
Consider a rooted, bifurcating tree F with N tips and N − 1 internal (ancestral) nodes.We index tips i = 1, . . ., N and internal nodes i = N + 1, . . ., 2N − 2. We designate node 2N − 1 to be the root of the tree.Let pa(i) denote the parent of the ith node and let branch length t i connect node i with its parent.

The relaxed clock
Aligned molecular sequence data S evolve according to a continuous time Markov process defined by infinitesimal rate matrix Q.In our examples, Q is a 4 × 4 matrix that describes the relative substitution process between nucleotides along the branches in F , but in gen-eral, Q may be of larger dimension to accommodate alignments at the codon or amino acid resolution, see Yang (2014) for reference.Each site k of S evolves independently and identically according to Q but may have its own site-specific rate of evolution s k .A priori we specify that E[s k ] = 1.Under the relaxed clock model, the transition probability matrix for branch i, where branch-rate multiplier ρ i is the number of expected substitutions per unit time.
Under this transform, each branch-rate multiplier is the product of clock rate r i and location parameter γ scaled by the inverse of total expected substitutions per total tree time, (2) This results in one fewer degree of freedom since For heterochronous data, we estimate γ, but for ultrametric studies where the height of the tree is not identifiable we fix γ = 1.

Autocorrelated shrinkage-clock
We assume clock rates r = {r 1 , . . ., r 2N −2 } are autocorrelated and model the incremental difference φ i between branch i's clock rate and its parent lineage clock rate, Under this parameterization, the increments φ = {φ 1 , . . .φ 2N −2 } ∈ R 2N −2 are a linear transformation of log r.To shrink the total number of rate changes along the tree, we let φ i iid ∼ P φ such that E[φ i ] = 0. Typically, P φ may follow a Gaussian (Thorne et al., 1998) or Laplace distribution.We choose the flexible, heavy-tailed, Bayesian bridge prior (Polson et al., 2014) on the increments, where µ > 0 is termed the "global scale" and α ∈ (0, 1] changes the shape of P φ where smaller α places more mass near zero.See Figure ( 1) for a comparison of the bridge to common shrinkage priors.Choosing α to be close to 0 forces the Bayesian bridge prior closer to best subset selection when used in a regression setting, while α = 1 matches the Laplace prior.In all examples, we set α = 1 4 to enforce slightly stronger shrinkage than the default 0.5 employed by Polson et al. (2014).Since increments are independent, the joint prior is simply the product 3 Inference We follow the computationally efficient sampling approach outlined by Polson et al. (2014) and view the prior on the increments as a scale mixture of normals (West, 1987), where the local scale of branch i, λ i > 0 and draws from a one-sided stable distribution, see Nishimura and Suchard (2019); Polson et al. (2014) for more details.To improve convergence speed and maintain the benefits of our heavy-tailed prior, we employ the shrunken-shoulder regularization of Nishimura and Suchard (2019) and augment our bridge to have light tails past a reasonably large point.Our scale mixture prior on an increment becomes where slab width ξ bounds the variance of increments to ξ 2 .In the examples to follow, we set ξ = 2, effecting a weakly informative, generous upper bound on clock rate changes.
Specifically, a slab width of 2 asserts that there is at most 5% probability for r i to be greater than 50 × r pa(i) .
We are interested in learning about the posterior, where λ = {λ 1 , . . ., λ 2N −2 }, θ represents all relevant parameters that describe the molecular substitution model and, again, F is the phylogenetic tree.We place a relatively uninformative, Gamma prior on µ − 1 α with shape 1 and scale 2. Additionally, we place the CTMC conditional reference prior of Ferreira and Suchard (2008) on the location γ.We detail the priors on θ and F in each example of the sequel.
We use Markov chain Monte Carlo (MCMC) to marginalize over local scale parameters and approximate the posterior (9).Specifically, we employ a random-scan Metropolis-within-Gibbs (Levine and Casella, 2006;Liu, 2008) sampling approach to update the full conditional densities implicit in (9).Efficient sampling schemes for γ, θ, F are well described by Suchard et al. (2018), while Polson et al. (2014) outline the efficient scale-mixture approach we use to sample (µ, λ).Here, we turn our attention to sampling where typically ν ∼ N(0, M) (Neal, 2011).Often mass matrix M = I 2N −2 but HMC sampling may be improved by using an alternative M, such as an approximation of the Hessian of the log-posterior (Stan Development Team, 2017;Zhang and Sutton, 2011).For further reading on HMC, see Betancourt (2017); Neal (2011).While HMC samplers offer more efficient posterior exploration, they require computationally expensive gradient calculations that often diminish their usefulness.Here we exploit and extend recent work (Ji et al., 2020) on branch-specific clock rate gradients to facilitate fast inference of r under our shrinkage model.

Hamiltonian Monte Carlo increment sampler
We generate proposals in increment space, since φ are uncorrelated in the prior and we transform back to rate space as described by the linear transformation in equation ( 4).
HMC sampling of the rates requires the gradient of the rate log-posterior, To compute the gradient of the log-likelihood with respect to the increments, we first find the gradient with respect to clock rates r, where we compute all entries in ] with the computational O(N ) algorithm derived by Ji et al. (2020) and We complete the gradient L[ρ (r)] dr j dφ k and where transformation dr j dφ k follows directly from equation (4).To preserve the O(N ) gradient computation, we take advantage of the tree structure explicit in equation ( 14) and accumulate the gradient of the log-likelihood via one post-order traversal of the tree.To begin, let i and j be both daughters of node k in F , then We next turn our attention to the gradient of the log-prior, Since p(φ | λ, µ) is Gaussian, the first term gradient unwinds, Numerical solutions to the second term in (11) involve the change-of-variable Jacobian ∂ ∂φ k log dφ dr and appear to necessitate an O(N 2 ) sparse determinant computation.To facilitate faster computation of the transform, we index nodes of the tree such that i < j =⇒ i is not ancestral to j.Under this indexing, dφ dr is an upper triangular matrix with 1 r i along its diagonal, see Figure (2) for an example.Figure 2: Example tree with corresponding Jacobian matrix.Index i < j =⇒ i is not ancestral to j thus the Jacobian is upper-triangular and the determinant is the product of diagonal entries.
The gradient of the log determinant, where 1 is the indicator function and d k is the number of descendants of node k.All together, and we accumulate d k in one recursive post-order tree traversal by observing where, again, i and j are both daughters of k.
To further improve the proposals of our HMC sampler, we precondition the mass matrix M to be the current-state absolute value of the Hessian of the log-prior, This diagonal matrix weights momentum draws by prior increment precision.Intuitively, equation ( 20) improves HMC sampling by rescaling increment proposals by the variance of φ, allowing larger steps to be taken in dimensions with larger variance.See Neal (2011) for further discussion on mass matrix transformations.

Local clocks in three nuclear genes of rodents and other mammals
To verify the accuracy of our model, we turn to a well-studied example of adaptive radiation in mammals and rodents.We estimate the existence of four local clocks where we define a local clock on branch i if the posterior odds φ i > 0 is greater than 10 or less than 1 10 .The posterior odds here is equivalent to a Bayes factor since an increment is equally likely to be positive or negative under the prior.Furthermore, a Bayes factor greater than 10 is suggestive of "strong evidence" against an alternative hypothesis ( Kass and Raftery, 1995).In the approach, we do not make assumptions about the magnitude of local clocks on a tree and instead use posterior probability of increment sign to define a clock.To illustrate the benefits of using a heavy-tailed prior on the increments, we further compare the performance of our Bayesian bridge prior to the more usual Laplace prior for shrinkage (see Figure 1).We again fit our shrinkage-clock as described above but remove the slab and fix α = 1, thus placing a Laplace prior on each increment.We find the posterior mean of increment variance under both the Laplace and Bridge priors is 0.057 with 95% high posterior density (HPD) intervals (0.036, 0.089) and (0.035, 0.093) respectively.Despite having very similar variance, we find the posterior mean of the absolute maximum increment is 0.84 (0.57, 1.22) and 1.01 (0.70, 1.54) under the Laplace and Bridge priors respectively.This evidences induced smoothing of the clock rates under the exponential tails of a Laplace prior, that on average shrinks the largest increment by approximately 20%.

Simulation study
We compare the scalability of our shrinkage-clock to the RLC under a simulated example.
We generate 1000 character nucleotide sequences from a fixed 40 tip tree 20 times.In each simulation, there are 4 distinct lineages (A, B, C, D) of 10 taxa each.Time to most recent common ancestor (TMRCA) for each lineage is 40 years and tree height is 80 years.Lineages B, C and D evolve with a relative clock rate of 1.0 while the MRCA of lineage A starts a new clock with relative rate 2.0.
We compare the accumulation of effective sample size (ESS) per unit time of branchspecific clock rates under both our Bayesian bridge shrinkage-clock and the RLC while simultaneously inferring the phylogeny.ESS approximates the number of independent samples from a Markov chain and we use this metric to evaluate how well each inference procedure explores clock rate space.We report the results across all 20 simulated datasets in Figure ( 4).
The median ESS/second is 0.49 and 0.13 under the shrinkage-clock and the RLC respectively, exhibiting a 3.8-fold speed increase.Additionally, the "least well" explored clock rate across all simulations accumulated 3.5 × 10 −3 and 4.9 × 10 −4 ESS/second under the shrinkage-clock and RLC respectively, a 7.1-fold speed-up, for these relatively small taxon-count examples.
We further report here that inference under the shrinkage-clock without preconditioning the mass matrix results in a minimum and median ESS/second of 2.3 × 10 −4 and 1.5 × 10 −3 respectively.Overall, preconditioning the mass matrix improves ESS/second 15× for the worst-explored clock rate under the shrinkage-clock.
Averaged across all twenty simulations, the Bayes factor for the true clock rate change under our shrinkage clock is 5.25 while the second most likely clock rate has Bayes factor support of 0.56.Comparably, averaged over all runs, the Bayes factor of the one true clock under the RLC is 3.51.Additionally, the true relative clock-rate for the 'A' clade is 2.0 and we estimate 1.51 (0.90, 2.31) and 1.49 (0.98, 2.36)

Influenza A virus
We further demonstrate the scalability and utility of our shrinkage-clock model by examining the evolution of two major influenza A virus (IVA) surface glycoprotein subtypes: hemagglutinin (HA) H7 and neuraminidase (NA) N7.Both hemagglutinin and neuraminidase protein mutations impact IVA's epitope and allow IVA to escape adaptive immune responses (McAuley et al., 2019;Wilson and Cox, 1990).Worobey et al. (2014)  GTR substitution model with 4 category discrete-Γ site rate model.We depart from their example, however, in our use of tree prior.We employ a Bayesian skygrid prior (Gill et al., 2013) with 50 population size bins and a cutoff of 200 years instead of using the skyride prior (Minin et al., 2008)  We find no sharp local clocks exist with Bayes factor > 10 or < 1 10 on the NA N7 tree under our shrinkage-clock model but do see evidence for rate heterogeneity.Incidentally, the most likely clock occurs on the branch that begins the Eastern avian clade of the NA N7 tree (Figure 5).The second most probable clock is found in a subclade of the equine lineage.
On the other hand, we estimate the existence of seven local clocks on the HA H7 tree and report these in Figure ( 6).Overall, the mean posterior clock rate for NA N7 is lower than HA H7.We report the posterior mean and 95% HPD intervals of γ are 2.7 {2.1 − 3.3} × 10 −3 and 3.3 {2.8 − 3.9} × 10 −3 for NA N7 and HA H7 respectively.Furthermore, under our shrinkage-clock the posterior mean root heights and 95% HPD intervals of the N7 and H7 trees in absolute time are 1798 (1733-1855) and 1853 (1808 -1897) respectively.rate 2.17 H7 tree.Since three of these clocks belong to edges of tip nodes, this may reflect incomplete sampling or sequencing error.Despite inferring six more clocks than the host-specific model, the posterior mean estimate of root height under our shrinkage-clock is within five years of previous estimates (Worobey et al., 2014).
Our shrinkage-clock accumulates ESS/second of the worst explored clock rate 7.1× faster than the RLC across 78 branches of 20 simulated data sets.If ESS is used as stopping criteria for phylogenetic reconstruction, this could save users up to 85% of BEAST runtime.As the bridge exponent α approaches 0, the bridge prior density is more peaked near zero resulting in sharper increment shrinkage and thus better distinguishable local clocks.Nishimura and Suchard (2019) examine multiple α and report closer to optimal coverage for smaller α but with increasing computational cost due to mixing.For this reason, shrinkage-clock users may find it useful to adjust α depending on desired clock-rate coverage or to tackle even larger tree studies.We make all BEAST XML files used in this work publicly available at https://github.com/suchard-group/shrinkingclocks.

Figure 1 :
Figure1: The Bayesian bridge prior places more mass near 0 and has heavier tails compared to other common shrinkage priors.The Bayesian bridge reflects our a priori belief that local clocks are rare, but may arbitrarily speed-up or slow-down the rate of molecular evolution.
Since there are 2N − 2 correlated branch-rate multipliers, one for each branch of the tree, univariable MCMC sampling schemes for r scale poorly to large trees.To remedy this difficulty, we employ Hamiltonian Monte Carlo (HMC) to sample all r simultaneously and with high acceptance probability.HMC leverages the geometry of the high-dimensional branchrate multiplier space to propose states that are farther away than traditional proposals but stay within regions of high posterior density.HMC escapes entrapment by local extrema of the posterior by generating random momentum ν = {ν 1 , . . ., ν 2N −2 } in each dimension Huchon et al. (2002) andDouzery et al. (2003) examine the adaptive radiation of 21 rodents compared to 19 other placental mammals and two marsupial outgroups using the first two codon positions for three nuclear genes: ADRA2B, IRBP and vWF (2422 alignment sites).Douzery et al. (2003) establish the presence of clock variability within this set of taxa and report their best fitting model contains five local clocks.Drummond and Suchard (2010) use the RLC model to re-examine this claim and estimate the existence of between 6 and 12 local clocks.Here we employ our shrinkage-clock model to jointly infer the mammalian phylogeny as well as the number and location of local clocks.Because this is an ultrametric example, γ = 1.We follow the specifications ofDrummond and Suchard (2010) andDouzery et al. (2003) and use a general time reversible (GTR) substitution model with a 4 category discrete-Γ site rate model.We run ten separate Markov chains of our shrinkage-clock with ten different starting trees for 30M states and build a maximum clade credibility (MCC) tree from the combined results (Figure3).We further run 100 RLC chains with 100 different starting trees for 30M states and build a MCC tree for comparison.Under the combinatorial parameter space of the RLC, we observe suboptimal mixing and that some chains convergence to different modes, hence our choice for combining 100 independent chains; the 10 independent chains for the shrinkage-clock simply errs on the side of caution since each independent shrinkage-clock chain converges to the same topology.Incidentally, the shrinkage-clock MCC topology differs from the RLC MCC in two places.First, Bradypus attaches to one of two neighbor internal branches deep in the tree.Second, Anomalurus is more closely related to the Dipus than Castor under the shrinkage-clock.This second difference highlights the well-known difficulty in Anomalurus placement(Horner et al., 2007).Indeed,Blanga-Kanfi et al. (2009) find Anomalurus sits between Dipus and Castor in an analysis of six nuclear genes.The posterior probabilities of Bradypus and Anomalurus parent branches under the shrinkage-clock are almost equal (0.64 and 0.49 respectively).

Figure 3 :
Figure 3: Maximum clade credibility (MCC) tree under shrinkage-clock of mammalian and rodent radiation where branches are colored by posterior mean relative clock rates r.If branch i a new clock, it is labeled with the posterior probability φ i > 0. For comparison, local clocks of the random local clock (RLC) model are depicted as black triangles.Two local clocks of the RLC are excluded due to topological differences between the RLC and shrinkage-clock MCC trees.

Figure 4 :
Figure 4: Effective sample size of branch-specific clock rates per second of BEAST runtime under the shrinkage-clock and RLC during a full joint phylogenetic analysis.

Figure 5 :
Figure 5: Maximum clade credibility tree for influenza A's neuraminidase subtype N7.Branches are colored by posterior clock rates.The most probable local clock is reported and labeled with posterior probability that φ i > 0. The second most probable clock starts a sub-clade of the equine lineage and has a Bayes factor φ i > 0 of 0.351.

Figure 6 :
Figure 6: Maximum clade credibility tree for influenza A's hemagglutinin subtype H7.Branches are colored by posterior clock rates.Local clocks are labeled with a star and the posterior probability φ i > 0.
under the shrinkage-clock and RLC respectively.
Worobey et al. (2014)estimation is sensitive to molecular clock model specification.To consistently estimate divergence times,Worobey et al. (2014)allow the clock rates of various glycoprotein subtypes to vary only between viral hosts and find H7 and N7 each evolve slower in equine hosts than avian hosts.We re-examine this claim with our more general shrinkage-clock model that does not assume the existence of host-dependent clock rates.Specifically, we re-analyze 146 complete gene (1716 nt) sequences of H7 and 92 complete gene (1416 nt) sequences of N7.In each case we follow the model specifications ofWorobey et al. (2014)and employ a . While the number of taxa here is only approximately double or triple the number in the previous simulation study, the set of all possible local clocks under the RLC grows exponentially with the number of taxa, (31 to 63 orders of magnitude), challenging clock rate inference under the RLC.