scrm: efficiently simulating long sequences using the approximated coalescent with recombination

Motivation: Coalescent-based simulation software for genomic sequences allows the efficient in silico generation of short- and medium-sized genetic sequences. However, the simulation of genome-size datasets as produced by next-generation sequencing is currently only possible using fairly crude approximations. Results: We present the sequential coalescent with recombination model (SCRM), a new method that efficiently and accurately approximates the coalescent with recombination, closing the gap between current approximations and the exact model. We present an efficient implementation and show that it can simulate genomic-scale datasets with an essentially correct linkage structure. Availability and implementation: The open source implementation scrm is freely available at https://scrm.github.io under the conditions of the GPLv3 license. Contact: staab@bio.lmu.de or gerton.lunter@well.ox.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.


Algorithm outline
The following provides a high level overview of the algorithm SCRM for simulating the Ancestral Recombination Graph under the coalescent with recombination (CWR). The program scrm implements this algorithm with modifications to incorporate demographic features such as population size changes, population structure and migration. Most of these modifications are straight-forward manipulation of rates and have only a minor impact on the algorithm.
In general, SCRM alternates between moving along the sequence and moving backwards in time. The current sequence position is denoted with p, and the current time with t. The sequence is assumed to start at position 0 and end at position 1. SCRM simulates a binary tree 1 representing the genealogy of the simulated sequences. This tree changes when moving along the sequence. It consists of local and non-local branches (see section 1.1). The simulated sequences are assumed to be at the "lower" end of the tree, which correspond to recent times. Elements of the tree further in the past are referred to as "upwards" (as in Figure S1).

Glossary
The following terms are used for describing the algorithm SCRM : local branch A branch that carries genetic material ancestral to the simulated sequences at the current sequence position p. non-local branch A branch that was local for a previous sequence position p < p, but does not carry ancestral material for the current position p. local recombination A recombination that occurs on a local branch. non-local recombination A recombination that occurs on a non-local branch.
active branch While going backwards in time, up to two branches can be source of changes to the tree. An active branch can either be coalescing at its top, or be looking for non-local recombinations on itself. passive branch A branch that is not active.
local root The point on the tree where the two oldest local branches merge. There can be a non-local branch on top of the local root, so that the local root is not necessarily the actual root of the tree. coalescing If a branch that ends with the root of a tree is coalescing, the top of the branch looks for another branch to merge into. This results in a Coalescence event. looking for non-local recombinations If a non-local branch is active, it samples whether non-local recombinations are occurring on this branch. This results in a Non-local recombination or an End of branch encountered event.

Events and rates
When moving backwards in time, the following events can occur with rates depending on the active nodes or when arriving at a certain time points: 1. Coalescence. Each coalescing branch merges into any branch that exists at the current time with rate one. If two active branches are coalescing, the pair can also merge together with rate one.
2. Non-local recombination. On each branch that is looking for non-local recombinations, recombinations happen with a rate equal to the recombination rate times the sequence length since the branch was last checked for non-local recombinations or became non-local.
3. End of branch encountered. If no recombination occurs on a branch looking for recombinations, an end of branch event is created at the end of the branch.
4. Time of local root reached. When time t is less than the time of the local root t M RCA , one branch is active. When t = t M RCA , this event additionally activates the branch above the local root.

Algorithm
1. "Build an initial tree" 1.1 Initialize the position p = 0 at one end of the sequence.
1.2 Build an initial tree according to Kingman's coalescent.
2. "Move along the sequence to the next local recombination" 2.1 Sample the sequence length s until the next local recombination occurs according to an exponential distribution with rate equal to the length of the current local tree.
2.2 If p + s > 1, the algorithm has reach the end of the sequence and is finished. Otherwise, move along the sequence to the local recombination, e.g. p = p + s.
2.3 Sample the time t rec and the branch b rec of the recombination event, which occurs uniformly on the local tree.
3. "Modify the tree" 3.1 Remove the subtree below the local recombination from the tree, and make it a separate tree. Mark the branch above the root of the new tree as the only active branch. Mark every branch above b rec that has not at least one simulated sequence below it as non-local.
3.2 Set the time t to the time of the recombination t rec .
3.3 Calculate rates for all possible events (see "Events and rates"). Sample the time length t next until the next event happens according to the rates. Increase the time t = t + t next . Continue depending on the type of the event: • Coalescence. The merge is implemented. If the target branch is a local branch: Modification of tree is complete 2 . Continue with 2.1.
the other coalescing node: Modification of tree is complete. Continue with 2.1.
a non-local branch: The coalescing branch is no longer active. The target branches becomes active, and starts looking for postponed non-local recombinations on itself. Continue to go back in time, i.e. with 3.3. • Non-local recombination: The branch becomes passive and the part of the branch below the recombination event is marked as local. The subtree below the recombination is removed from the tree and becomes a separate tree. The root of this tree becomes active and starts coalescing. Continue with 3.3. • End of branch encountered : The branch is marked as local and not active. If the branch ends at the root of a tree: A new branch is created above the root, marked as active and starts coalescing. Continue with 3.3. leads to a local branch: Modification of tree is complete 2 . Continue with 2.1.
leads to a non-local branch: This branch becomes active and starts looking for recombinations. Continue with 3.3. • Time of local root reached. If there is no branch above the local root: Create a new branch above the root, mark it as active and coalescing. a (necessarily non-local) branch: This branch becomes active and starts looking for nonlocal recombinations. Continue with 3.3, now with two active branches. Figure S1: Example of the SCRM algorithm. Red denotes local branches, black non-local one. Active branches are yellow dotted. From (A) to (B) the algorithm moves along the sequence until a recombination occurs (black cross). The recombination is implemented (Step 3.1), which produces (C). The active branch is now coalescing and coalesces into the non-local branch above (D). The active branch is now looking for non-local recombinations, one of which occurs in (E). The recombination is implemented (F), and the active branch is coalescing again. In (G) the time of the local root reached event occurs. Both active branches are now coalescing and coalescent in (G). Now, Step 3 is finished and the algorithm continues with Step 2.

Testing the exact algorithm
To verify that both our algorithm and its implementation work correctly, we compared samples produced under the coalescent with recombination to ones produced by scrm without approximation. For the comparison, we selected six demographic models (Table S2) and simulated 10 5 samples with both ms 3 and scrm for each model. We compared the distribution of commonly used summary statistics including the time to the most recent common ancestor (TMRCA), total branch lengths, the number of mutations and recombinations, average pairwise differences, Fay and Wu's θ H (Fay and Wu, 2000), difference (denoted by H) between θ H and average pairwise differences, and Tajima's D (Tajima, 1989) using Kolmogorov-Smirnov-and χ 2 -tests. In all models the summary statistic's distributions seemed to be virtually identical. We tested for difference in the distributions Kolmogorov-Smirnov-and χ 2 -tests. None of the test found a significant difference at a 95%-confidence level (see Table S1).

Approximation
When building the ARG using SCRM, the effect that non-local branches have on the genealogy after the next recombination events decays with genetic distance. As we move on along the sequence, non-local branches that are not hit by other coalescing branches accumulate a high recombination rate. If such a branch is hit while incorporating the next recombination, the next non-local recombination on this branch will likely occur close to the coalescence point. As the algorithm continues coalescing above the non-local recombination, it will be in a similar state as it was before hitting the branch. Hence non-local branches with high accumulated recombination rates contribute only minor 'jumps' backwards in time to the algorithm. This observation is the key reason that linkage between sites decays with increasing genetic distance in the ARG. Different to the WH-Algorithm, SCRM is able to easily identify branches with a high accumulated recombination rate. We use this to introduce an approximation by removing branches that are not hit by coalescing branches while moving on the sequence for more than l bases, where l is an arbitrary threshold. This results in a sliding window of length l, within which SCRM produces the ARG, while it ignores some genetic linkage to positions outside of this window. If the threshold l is equal to sequence length, SCRM produces the exact ARG; SCRM is identical to the SMC' model if l = 0.

Effects of the approximation on summary statistics
We additionally compared the results of simulations of ms and scrm with approximation windows sizes of l = 0kB, l = 10kB, l = 50kB and l = 300kB using the same procedure as when comparing the exact algorithms above. In addition, we also calculated average TMRCA across the simulated sequence and compared the empirical distribution of its first four moments to ms ( Figure S3). Simulated data were genereated by the following command: where ${i} are ranged from 1 to 100000, and ${window} takes value 0, 10000, 50000, and 300000.
We tested how the approximation parameter affects the simulation output, and the distribution of common summary statistics (Figures S2). A clear difference in the distribution of summary statistics was found when comparing the SMC' to the CWR. As expected the difference decreases when increasing the window size. Some summary statistics appeared less sensitive to the approximation parameters than the others. For example, there is no significant difference in the distribution of Tajima'D or H when the approximation parameter is greater than 50kb. When using scrm in with a window length of 300kb or in exact mode, none of the summary statistics showed a significant difference from ms. A similar pattern is observed when looking at the distribution of average TMRCAs (Figures S3).

Effects of the approximation on linkage
Similar to the analysis of Eriksson et al. (2009), we used the correlation of the total branch length between two sites to quantify the difference in genetic linkage between the CWR and it's approximations produced by scrm. Let ρ(δ) denote the branch length correlation of two sites that are δ base pairs apart. We investigated how ρ(δ) differs between sequences generated with ms using the exact ARG and with scrm using the approximated ARG for two loci as a function of their distance δ for different values of the approximation parameter l.
Let L(i) denote the total branch length of the local genealogy at the site i, we define the branch length correlation between site i and i + δ as where we estimated the expected total branch length E[L] empirically using simulation with ms. The genealogies were generated by the following command: where ${i} are ranged from 1 to 1000, and ${window} takes value 0, 1000, 3000, 5000, 10000, 30000, 50000, 100000 and 300000. We found that when using the exact window approach to approximate the CWR, the correct LD is procured as long as both sites are inside the exact window. The linkage dropped rapidly if δ became larger than the window length l. As linkage decreased with distance between the sites, the maximal possible error introduced by the approximation is also decreasing if l becomes larger. The correlation ρ decreased surprisingly slowly. We still found measurable correlation for sites 100kb apart, which is equivalent to an average of 140 recombination events ( Figure S4). Ratio of observed and true autocorrelation ρ scrm 0kb scrm 10kb scrm 50kb scrm 300kb Figure S4: Ratio of observed and true autocorrelations in total branch length linkage of scrm, where the true autocorrelations are estimated by ms. The increasing variation for higher values of δ is due to the decreasing total linkage between sites.
4 Efficiency compared to existing software 4.1 Linkage comparison of total tree branch length We measured how efficiently scrm performs in comparison to the programs fastsimcoal and MaCS. Similarly to the comparison in section 3.2, we simulated 20 haplotypes with length of 10 Mb, using a recombination rate of 10 −8 under a panmictic model with each program, measured the run-time and used the integrated difference of ρ(δ) to ms to measure the accuracy of the programs. For software that offers a variable approximation level, we use different approximation parameters. We compared run-time and difference in linkage to the ARG for the programs fastsimcoal, MaCS and scrm, which all approximate the ARG. When implementing the SMC' model, all programs produced less linkage than the ARG. Simulations with scrm appeared to be more efficient than MaCS and fastsimcoal, and about 50 times faster than ms. When MaCS' history parameter and scrm's exact window were used, the linkage became more similar to the ARG. When used with a large history, results from MaCS showed more linkage than the ARG, while scrm approached the correct value, see Figure 2 in the main article.

Total number of mutation comparison between approximated models
The program Cosi2 (Shlyakhter et al., 2014) is an efficient simulator which generates the CWR backwards in time instead of following a sequential model. It simulates the CWR extremely efficiently for sequences on an order of several megabases, and also has an approximation feature. Unfortunately, cosi2 seems to be not capable of printing the simulated trees. Hence, it was not possible to compare cosi2's total branch length linkage with the rest of the programs in section 4.1. Instead, we used the distribution of the number of mutations for a separate comparison ( Figure S5).  Figure S5: Efficiency of the approximations of scrm and Cosi2. The runtime (x-axis) for simulating 40Mb with a recombination and mutation rate of 10 −8 using the indicated approximation parameter is drawn. For the y-axis, the empirical cumulative density function (ECDF) of the number of mutations is calculated from 10 6 simulations (as in Figure S2) and the maximal difference of this function from the ECDF of the exact simulation with Cosi2 is drawn. We use Cosi2 as reference in this figure as ms can not simulate the used scenario. We suppose that the minimal difference obtained by both programs reflects the stochasticity of obtaining the ECDF from a finite sample.