A Phylodynamic Workflow to Rapidly Gain Insights into the Dispersal History and Dynamics of SARS-CoV-2 Lineages

Abstract Since the start of the COVID-19 pandemic, an unprecedented number of genomic sequences of SARS-CoV-2 have been generated and shared with the scientific community. The unparalleled volume of available genetic data presents a unique opportunity to gain real-time insights into the virus transmission during the pandemic, but also a daunting computational hurdle if analyzed with gold-standard phylogeographic approaches. To tackle this practical limitation, we here describe and apply a rapid analytical pipeline to analyze the spatiotemporal dispersal history and dynamics of SARS-CoV-2 lineages. As a proof of concept, we focus on the Belgian epidemic, which has had one of the highest spatial densities of available SARS-CoV-2 genomes. Our pipeline has the potential to be quickly applied to other countries or regions, with key benefits in complementing epidemiological analyses in assessing the impact of intervention measures or their progressive easement.


Inference of a time-scaled phylogenetic tree
To infer a time-scaled phylogenetic tree, we selected all non-Belgian sequences in the Nextstrain analysis, along with all available Belgian sequences in GISAID as of June 10, 2020, to be included in our analysis. Once we knew which were the accessions of interest, we downloaded the latest whole genome alignment from GISAID. We then cleaned the alignment by manually trimming the 5' and 3' untranslated regions (RefSeq NC_045512.2) and gap-only sites. To obtain a maximum likelihood phylogeny, we ran IQ-TREE 2.0.3 (Minh et al. 2020) under a general time reversible (Tavaré 1986) (GTR) model of nucleotide substitution with empirical base frequencies and four free rate (Yang 1995) site categories. This model configuration was selected as the best GTR model using IQ-TREE's ModelFinder tool. The tree was then inspected for outlier sequences using TempEst (Rambaut et al. 2016) 1.5.3 and, once the outliers were removed, time-calibrated using TreeTime 0.7.4 (Sagulenko et al. 2018). To replicate the Nextstrain workflow as closely as possible, we specified a clock rate of 8x10 -4 in TreeTime and removed samples that deviate more than four interquartile ranges from the root-to-tip regression (https://github.com/nextstrain/ncov).

Preliminary discrete phylogeographic analysis
We performed a preliminary phylogeographic analysis using the discrete diffusion model (Lemey et al. 2009) implemented in the software package BEAST 1.10 . The objective of this first analysis was to identify independent introduction events of SARS-CoV-2 lineages into Belgium. To this end, we used our timescaled phylogenetic tree as a fixed empirical tree and only considered two possible ancestral locations: "Belgium" and "non-Belgium". Bayesian inference through Markov chain Monte Carlo (MCMC) was run on this empirical tree for 10 6 generations and sampled every 1,000 generations. MCMC convergence and mixing properties were inspected using the program Tracer 1.7  to ensure that effective sample size (ESS) values associated with estimated parameters were all >200. After having discarded 10% of sampled trees as burn-in, a maximum clade credibility (MCC) tree was obtained using TreeAnnotator 1.10 . We used the resulting MCC tree to delineate Belgian clusters here defined as phylogenetic clades corresponding to independent introduction events in Belgium. In practice, we identified introduction events by comparing the locations assigned to each pair of nodes connected by the phylogenetic branches of this MCC tree, i.e. the most probable location inferred at internal nodes and the sampling location for tip nodes. We considered an introduction event to be the case when the location assigned to a node was "Belgium" and the location assigned to its parent node in the tree was "non-Belgium".

Continuous and post hoc phylogeographic analyses
We used the continuous diffusion model (Lemey et al. 2010) available in BEAST 1.10 ) to perform a spatially-explicit (or "continuous") phylogeographic reconstruction of the dispersal history of SARS-CoV-2 lineages in Belgium. We employed a relaxed random walk (RRW) diffusion model to generate a posterior distribution of trees whose internal nodes are associated with geographic coordinates (Lemey et al. 2010). Specifically, we used a Cauchy distribution to model the among-branch heterogeneity in diffusion velocity. We performed a distinct continuous phylogeographic reconstruction for each Belgian clade identified by the initial discrete phylogeographic inference, again fixing a time-scaled subtree as an empirical tree. As phylogeographic inference under the continuous diffusion model does not allow identical sampling coordinates assigned to the tips of the tree, we avoided assigning sampling coordinates using the centroid point of each administrative area of origin. For a given sampled sequence, we instead retrieved geographic coordinates from a point randomly sampled within its municipality of origin, which is the maximal level of spatial precision in available metadata. This approach avoids using the common "jitter" option that adds a restricted amount of noise to duplicated sampling coordinates. Using such a jitter could be problematic because it can move sampling coordinates to administrative areas neighbouring their actual administrative area of origin (Dellicour et al. 2018). Furthermore, the administrative areas considered here are municipalities and are rather small (there are currently 581 municipalities in Belgium). The clade-specific continuous phylogeographic reconstructions were only based on Belgian tip nodes for which the municipality of origin was known, i.e. 639 out of 740 genomic sequences. Furthermore, we only performed a continuous phylogeographic inference for Belgian clades linking a minimum of three tip nodes with a known sampling location (municipality).
Each Markov chain was run for 10 6 generations and sampled every 1,000 generations. As with the discrete phylogeographic inference, MCMC convergence/mixing properties were assessed with Tracer, and MCC trees (one per clade) were obtained with TreeAnnotator after discarding 10% of sampled trees as burn-in. We then used functions available in the R package "seraphim" (Dellicour et al. 2016) to extract spatiotemporal information embedded within the same 1,000 posterior trees and visualise the continuous phylogeographic reconstructions. We also used "seraphim" to estimate the following weighted lineage dispersal velocity: where di and ti are the geographic distance travelled (great-circle distance in km) and the time elapsed (in days) on each phylogeny branch, respectively. Weighted lineage dispersal velocity was estimated before and during the lockdown by selecting phylogeny branches for which both nodes occurred before or during the lockdown (using March 18, 2020, as a cut-off), respectively. Consequently, phylogeny branches with ancestral and youngest nodes occurring respectively before and during the lockdown were discarded for these estimates of lineage dispersal velocity. Furthermore, we also estimated the evolution of the weighted lineage dispersal velocity through time, computing estimates over a sliding window of two weeks allowing a minimum number of phylogeny branches to be considered in each time. The choice of this weighted metric was motivated by the low variance associated with its estimates, which results from a restricted contribution of phylogeny branches with short duration, making it particularly useful when aiming to compare different data sets or different time periods (Dellicour et al. 2019). Consequently, depending on the duration of branches overlapping each time slice, estimates obtained for successive time slices do not necessarily contribute equally to the overall estimate obtained for the entire period under consideration. We verified the robustness of our estimates through a subsampling procedure consisting of re-computing the weighted dispersal velocity after having randomly discarded 25% of branches in each of the 1,000 posterior trees.

Assessing the robustness of the pipeline to the selection of the starting tree
Because our approach is dependent on the time-scaled phylogenetic tree selected as a starting point of the pipeline, we performed an additional analysis to assess if our results are robust to the choice of that starting tree. Specifically, we performed five additional independent replicates of maximum likelihood phylogenetic inference in IQ-TREE using different starting seeds. We then re-ran our entire workflow using these alternative trees as starting points, which allowed us to investigate to what extent the topological uncertainty associated with the maximum likelihood inference step could have an impact on the different estimates. We reported and compared the results obtained when considering each starting tree (Table S1): the number of independent introduction events identified on the Belgian territory, and the weighted lineage dispersal velocity (before and during lockdown). Table S1. Robustness test of the pipeline to the selection of the starting tree. We here report estimates (and associated 95% HPD intervals) obtained when running our pipeline on distinct starting phylogenetic trees independently inferred using IQ-TREE.  Figure S1. Time-scaled phylogene c tree in which we iden fied Belgian clusters. A cluster is here defined as a phylogene c clade likely corresponding to a dis nct introduc on into the Belgian territory. We delineated these clusters by performing a simplis c discrete phylogeographic reconstruc on along the me-scaled phylogene c tree while only considering two poten al ancestral loca ons: "Belgium" and "outside Belgium". On the tree, lineages circula ng in Belgium are highlighted in green, and green nodes correspond to the most ancestral node of each Belgian cluster. Human population density (log-transformed) 03-05-2020 03-04-2020 03-03-2020 Figure S2. Spa ally-explicit phylogeographic reconstruc on of the dispersal history of SARS-CoV-2 lineages in the Province of Liège. Con nuous phylogeographic reconstruc on was performed along each Belgian clade (cluster) iden fied by the ini al discrete phylogeographic analysis. For each clade, we mapped the maximum clade credibility (MCC) branches located in the province of Liège, before and a er the 18 th March 2020 (i.e. the beginning of the lockdown).