link-ancestors: fast simulation of local ancestry with tree sequence software

Abstract Summary It is challenging to simulate realistic tracts of genetic ancestry on a scale suitable for simulation-based inference. We present an algorithm that enables this information to be extracted efficiently from tree sequences produced by simulations run with msprime and SLiM. Availability and implementation A C-based implementation of the link-ancestors algorithm is in tskit (https://tskit.dev/tskit/docs/stable/). We also provide a user-friendly wrapper for link-ancestors in tspop, a Python-based utility package.

Algorithm 1 The link-ancestors algorithm.
Set i ← 1, S ← ∅ and A n ← ∅ for all n ∈ N .
l ← max l i , l j , r ← min r i , r j S ← S ∪ c j : [l, r) j ← j + 1 end while end if 15: end if if p i ̸ = p i−1 or i = E then if p i ∈ P then for c j : [l j , r j ) ∈ S do L ← L ∪ l j , r j : p i , c j 20: end for end if Fast simulation of local ancestry with tree sequence software Figure 1: A pair of modern-day American human chromosomes are simulated with msprime to contain tracts of ancestry from three sources: Africa (AFR), Europe (EUR) and Asia (ASIA).This is an example of the information that can be extracted with the tspop wrapper for link-ancestors.See the Supplementary Information for more information about the simulation.

Details of the American admixture exploratory simulation
Our first demonstration simulation uses the American admixture model originally published by Browning et al. (2018) and implemented in stdpopsim (Adrion et al., 2020;Lauterbur et al., 2022).This model contains four modern-day populations representing contemporary Homo sapiens: AFR (Contemporary African population), EUR (Contemporary European population), ASIA (Contemporary Asian population) and ADMIX (Modern admixed population).The ADMIX population is modelled as an instantaneous pulse of gene flow from the other three populations 10 generations in the past.There is also a small amount of continual migration between the other three populations going back to 1000 generations in the past.See the stdpopsim catalog for full details of this model.We simulated a diploid region of length 243,199,373 base pairs with a uniform per-base, per-generation recombination rate of 1.15235 × 10 −8 , as per the stdpopsim defaults for Homo sapiens chromosome 1.
To ensure the correctness of our model, we exported the stdpopsim implementation into msprime code using the Demes package (Gower et al., 2022).We then added a census event at 1000 generations in the past.The ancestral nodes recorded at this census time were then used as input into link-ancestors.The choice of census time ensures that these ancestors belong to one of the three populations: AFR, EUR or ASIA.We generated 10 replicate tree sequences holding 1000 individuals (2000 chromosomes) from each of the three contemporary populations.Tracts of local ancestry from these ancestors in one simulated present-day individual are plotted in Figure 1.The code we ran is available at https://github.com/gtsambos/ooa-ancestry-example.

Details of the runtime and memory comparisons
On each of the replicates, we compared the performance of link-ancestors with several simpler tree-by-tree approaches using tskit's tree methods.Runtimes were calculated using Python's time module, while memory usage was estimated using the scalene command-line interface (Berger et al., 2020).Our results are shown in Table 2.
The initial approach that we considered, 'baseline', involved a loop through each tree T and each sample node.At each sample node, we iterated upwards through the tree to obtain a local genealogical path.Once we encountered an ancestral node from the census time of 100, we terminated the iteration and recorded a tract of local ancestry between the sample and the ancestor over the interval spanned by the tree.This procedure requires of order O(T n log(n)) operations for a dataset of n samples in a tree sequence of T trees; since many segments of ancestry span multiple trees, this is much less efficient in practice than link-ancestors.However, this procedure had substantial memory requirements, requiring over 128 Gb of RAM.This is likely due to the uncompressed nature of the output: since consecutive trees differ by just one subtree-prune-and-regraft operation, many samples will share a common census ancestor across many consecutive trees.Therefore, instead of proceeding with this approach, we ran a modified version, 'squash', that progressively compressed these neighbouring segments of identical ancestry.We also considered that some of the runtime differences were due to the fact that the computationally intensive part of our link-ancestors implementation is coded in C. To minimise the impact of this difference, we used the numba package to create a streamlined pre-compiled version of our 'squash' implementation; we refer to this as the 'numba' procedure.The 'numba' procedure was substantially faster than 'squash' on its own, and with a minimal cost to memory usage.However, both methods required several hours of computation, and were thus still orders of magnitudes slower than the implementation of link-ancestors in tskit, which ran in roughly 4 seconds.Differences in memory requirements between these three methods were negligible.

Details of the Out-of-Africa exploratory simulation
To highlight how the performance of link-ancestors varies according to the time of the census ancestors, we briefly describe some simulations using an older census time.We chose the Out-of-Africa model with archaic admixture originally published by Ragsdale and Gravel (2019) and also implemented in stdpopsim (Adrion et al., 2020;Lauterbur et al., 2022).This model contains three modern-day populations representing contemporary Homo sapiens: YRI (Yoruba, African), CEU (European, Utah) and CHB (Han Chinese, Beijing).There are two populations representing archaic hominins: NEA (Neanderthals) and ARC (a 'ghost' archaic African population).Under this model, there are small amounts of ancient introgression between the following pairs of populations: (YRI, ARC), (CEU, NEA), (CHB, NEA).There is also migration between the three Homo sapiens populations.The 'out-of-Africa' event is modelled as an instantaneous split within the YRI population 2093 generations in the past.See the stdpopsim catalog for full details of this model.The simulated contig was the same as before.We added a census event at 2093.5 generations in the past, just prior to the Out-of-Africa event in which the CEU and CHB lineages split from YRI.This choice of time ensures that the census ancestors belong to one of three populations: YRI, NEA and ARC.We generated 10 replicate tree sequences holding 100 individuals (200 chromosomes) from each of the three contemporary populations.(The smaller sample size was chosen to ensure that we could test link-ancestors on a local machine with <16 Gb of RAM).
Results of these tests are shown in Table 3.Although link-ancestors is still an order of magnitude faster the other approaches, the relative difference is less striking than in the American admixture example detailed in the previous section.It also requires a larger amount of memory.This is due to the fact that the tree-by-tree methods do not need to retain any information about ancestors in previous trees, unlike link-ancestors.We therefore expect that link-ancestors will prove most useful in settings where the admixture events of interest are not ancient.
Fast simulation of local ancestry with tree sequence software

Table 1 :
A reference of objects in the link-ancestors algorithm.
i Internal The child node of the ith edge in E N .E Internal The total number of edges in the edge table E N .E N Tree sequence The ordered set of edges in the tree sequence.i Internal An index for the edges in E N .j Internal An index for the segments in S. k Internal An index for the segments in A n .l i Internal The left coordinate of the ith edge in E N .i Internal The parent node of the ith edge in E N .r i Internal The right coordinate of the ith edge in E N .

Table 2 :
Means and standard deviations of runtimes and RAM usage from various approaches to extract local ancestry from tree sequences.The simulations used a demographic model of contemporary American human admixture using a census time of 100 generations in the past.

Table 3 :
Fast simulation of local ancestry with tree sequence software Means and standard deviations of runtimes and RAM usage from various approaches to extract local ancestry from tree sequences.The simulations used a demographic model of Neanderthal and other archaic human introgression using a census time of 2093.5 generations in the past.