Simulating the dynamics of targeted capture sequencing with CapSim

Abstract Motivation Targeted sequencing using capture probes has become increasingly popular in clinical applications due to its scalability and cost-effectiveness. The approach also allows for higher sequencing coverage of the targeted regions resulting in better analysis statistical power. However, because of the dynamics of the hybridization process, it is difficult to evaluate the efficiency of the probe design prior to the experiments which are time consuming and costly. Results We developed CapSim, a software package for simulation of targeted sequencing. Given a genome sequence and a set of probes, CapSim simulates the fragmentation, the dynamics of probe hybridization and the sequencing of the captured fragments on Illumina and PacBio sequencing platforms. The simulated data can be used for evaluating the performance of the analysis pipeline, as well as the efficiency of the probe design. Parameters of the various stages in the sequencing process can also be evaluated in order to optimize the experiments. Availability and implementation CapSim is publicly available under BSD license at https://github.com/Devika1/capsim. Supplementary information Supplementary data are available at Bioinformatics online.


Modelling DNA fragment size and PacBio polymerase read length
We fitted a large number of probability distributions (over 80 distributions currently implemented in the Python Numpy V1.18 library) to the histograms of fragment size of the above mentioned data sets. Supp. Fig. 1 presents the fit of the most fitted distributions, including Log-Logistic (Fisk), Burr, Log-Normal, Log-Gamma, Levy, Mielke and Johnson SB distributions on four data sets. Of these distributions, we found distributions in the Log-Logistic family (Fisk, Burr, Mielke and Johnson SB) were consistently among the best fitted distributions, though Mielke and Johnson SB distributions seemed to be not robust in the presence of short fragments. We performed the same analysis for the histograms of polymerase read lengths of the PacBio sequencing data, and also found Fisk distribution fitted the best ( Figure 2).
The Log-Logistic distribution has been used for modeling of the distribution of wealth or income in economics [1] and for survival analysis. It is a heavy-tailed distribution that Effects of capture and sequencing on fragment size Supp. Fig. 3 and 4 present the fragment size distributions at various stages of sequencing on Illumina and Pacbio platforms respectively. Note that the fragments here included the 60 base adaptors from both ends. We noticed the fragments obtained from capturing tend to be slightly shorter than those prior to capture. On the capture Illumina data set, the  pre-capture fragment size distribution peaked at nearly 800 bp, but that from post-capture fragment peaked at over 700 bp (Supp. Fig. 3a and 3b). Similarly, for the PacBio data sets, the modes pre-capture and post-capture fragment size distributions were about 2000 bp and 1700 bp, respectively (Supp. Fig. 4a and 4b). This suggests the capture process has the tendency to select shorter fragments. We also noticed the fragments that were sequenced on both platforms were even shorter. The mode of the fragment size (without adaptors) distribution from Illumina data was just over 200 bp (Supp. Fig. 3). This is equivalent to a mode of 320 base-pairs for fragment with adapters, which is markedly shorter than the post-capture fragment library (mode of 700 basepairs). This is explained by the clustering process preferentially amplifies shorter libraries in a mixture of fragments 1 . The reduction of fragment sizes from captured library to sequenced fragments on the PacBio is less dramatic, from about to 1.7kb to 1.5kb (Supp. Fig. 4). We attribute this to that smaller fragments are more likely to be selected to translocate through the wells in the PacBio SMRT cell.

Implementation of CapSim
In a capture protocol, only the fragments that are hybridized with probes (captured) are retained for the sequencing step, while the others are washed away. Instead of simulating the fragmentation of the whole genomes where most fragments are not captured, CapSim marks the regions on the genome that can be captured. With this, CapSim only needs to sample fragments from these regions to significantly reduce running time. We examined the real captured data and found that there were fragments captured with the hybridisation of only 30 bp to a probe. We therefore used bowtie2 [3] with permissive parameters and allowing multiple position alignments (-local -very-sensitive-local -mp 8 -rdg 10,8 -rfg 10,8 -k 10000) to align probes to the genome sequence. The regions which have at least one probe aligning to are marked for sampling captured fragments.
As described previously, the clustering of fragments to the Illumina flowcell favours fragments library in the range of 100-200 bp. For a long fragment library, most of fragments will be not be clustered for sequencing, and hence the simulation of these fragments would be wasted. Instead, from the distribution of fragment size specified by the users for fragmentation (fragmentation fragment) and the fragment distribution preferred by the clustering process (clustering distribution), CapSim uses a Monte-Carlo technique to generate a distribution of fragments that actually clustered and sequenced. Briefly, CapSim iteratively samples from the fragmentation distribution. These samples are then selected or rejected following the clustering distribution. The selected samples are then used to generate the fragment size distribution of fragments that are actually clustered for sequencing. By generating this distribution before the simulation of fragmentation, CapSim can avoid the unnecessary simulation of the fragments that are not used for sequencing. We used Illumina capture sequencing data from NA12878 sample to assess the performance of the CapSim simulation and WesSim simulation. All 3 datasets were mapped to hg19 reference using bwa-mem. We used Picard to calculate the capture metrics (Fig. 5). Percentage of target bases at specified coverage was similar to real capture data and CapSim simulation data, atleast 70% of on-target bases had greater than 30x coverage. Only 12% of on-target capture was achieved with real capture sequencing data, and this is due to the capture design of repetitive regions. Low on-target rate is likely due to the alignment issues with repetitive sequences. Nevertheless, the exact alignment settings were used for all the datasets to avoid any bias in alignment. CapSim simulation data achieved 41% on-target, whereas WesSim achieved 93% on-target. Furthermore, figure. 6 shows the off-target regions which were detected by CapSim, but not by WesSim. These results demonstrate that CapSim simulation produced comparable results to real capture sequencing data than WesSim.

Performance of CapSim
Supp. Fig. 6: Sequence coverage distribution of Illumina sequencing data from real capture sequencing data on test sample (first panel), capture simulation data from CapSim (second panel) and WesSim (third panel). Position of the capture probes are shown in blue in the bottom capture track panels. (a) -(f) are different regions in the genome, which has off-target capture in real capture data in test sample, which was also successfully detected by CapSim.
CapSim was also used to generate PacBio sequencing data and the performance of CapSim was comparable to real capture PacBio sequencing data. WesSim does not support simulation of PacBio sequencing data, therefore there wasn't any comparable tools to assess the performance of CapSim for PacBio sequencing data Supp. Fig. 7: Sequence coverage distribution of PacBio sequencing data from real capture sequencing data on test sample (first panel) and capture simulation data from CapSim (second panel). Position of the capture probes are shown in blue in the bottom capture track panels.
(a) -(c) are different regions in the genome which were successfully simulated by CapSim.