SVEngine: an efficient and versatile simulator of genome structural variations with features of cancer clonal evolution

Abstract Background Simulating genome sequence data with variant features facilitates the development and benchmarking of structural variant analysis programs. However, there are only a few data simulators that provide structural variants in silico and even fewer that provide variants with different allelic fraction and haplotypes. Findings We developed SVEngine, an open-source tool to address this need. SVEngine simulates next-generation sequencing data with embedded structural variations. As input, SVEngine takes template haploid sequences (FASTA) and an external variant file, a variant distribution file, and/or a clonal phylogeny tree file (NEWICK) as input. Subsequently, it simulates and outputs sequence contigs (FASTAs), sequence reads (FASTQs), and/or post-alignment files (BAMs). All of the files contain the desired variants, along with BED files containing the ground truth. SVEngine's flexible design process enables one to specify size, position, and allelic fraction for deletions, insertions, duplications, inversions, and translocations. Finally, SVEngine simulates sequence data that replicate the characteristics of a sequencing library with mixed sizes of DNA insert molecules. To improve the compute speed, SVEngine is highly parallelized to reduce the simulation time. Conclusions We demonstrated the versatile features of SVEngine and its improved runtime comparisons with other available simulators. SVEngine's features include the simulation of locus-specific variant frequency designed to mimic the phylogeny of cancer clonal evolution. We validated SVEngine's accuracy by simulating genome-wide structural variants of NA12878 and a heterogeneous cancer genome. Our evaluation included checking various sequencing mapping features such as coverage change, read clipping, insert size shift, and neighboring hanging read pairs for representative variant types. Structural variant callers Lumpy and Manta and tumor heterogeneity estimator THetA2 were able to perform realistically on the simulated data. SVEngine is implemented as a standard Python package and is freely available for academic use .


FINDINGS Background
Next generation sequencing (NGS) has enabled researchers to detect and resolve complex genomic structural features at base-pair resolution. One can detect a variety of structural variations (SVs) including deletions, insertions, inversions, tandem duplications and translocations based on NGS whole genome sequence data [1]. A variety of algorithms have been developed for structural variant calling from NGS data. This includes programs such as Breakdancer, CNVnator, Delly, Haplotype Caller, Lumpy, SWAN, Pindel among others [2][3][4][5][6][7][8][9][10]. Even with these programs, accurate SV detection remains a significant challenge. For example, some SVs occur in lower allelic fractions as seen in tumors with intratumoral heterogeneity [11]. This is frequently the case in sequencing tumor samples, where cancer starts from a seeding clone and through clonal evolution, successively acquires additional rearrangements at lower allelic fractions.
Benchmarking structural variant callers with available ground truth data sets is critical for software tool development, bioinformatics pipeline testing and objective assessment of detection accuracy [12]. Whole genome data sets are available from high sequencing coverage with Illumina or Pacific Bioscience systems [13]. However, for those users who wish to generate new sequencing data sets with specific features, identification and generation of ground truth data sets is a laborious and cost-prohibitive endeavour. Moreover, it is extremely difficult to empirically determine the analytical consequences of different sample processing methods, experimental variability in library preparation and issues of sequencing bias in analysis [14].
Simulating NGS data provides an inexpensive alternative for assessing new algorithms in the context of sequencing data variation as noted [15]. With simulated datasets, one can start refining analysis procedures in silico. Simulated NGS datasets can incorporate the variability associated with NGS sequence data including: sequencing coverage; number of libraries and insert size; base error rates; tool parameters at the data analysis level. For in silico NGS data, a large number of SV characteristics can be readily designed including the number, the category, the size, the breakpoint sequence, the variant fraction and the haplotype for any given locus. As a result, investigators can use this simulated data to assess the potential performance and make the trade-off between analysis cost and sensitivity before even carrying out the experiment.
A number of programs generate NGS read sets to simulate metagenomics or single nucleotide polymorphisms are available [16][17][18][19][20][21][22]. Only recently have we seen the development and release of structural variant simulators. An early example is RSVSim [23], an R package which amends template sequence files with structural variant changes.
However, it requires an interactive R session thus does not support batch processing.
SCNVSim [24] improves upon RSVSim by providing a command line interface. It simulates somatic copy number variants given a number of desired SV events and/or contigs.
Nonetheless, both SCNVSim and RSVSim can only output mutated contig files (FASTA), which require external steps to simulate sequence reads (FASTQ) and output resulting alignments (BAM). VarSim [14] improves upon RSVSim and SCNVSim with integrated read simulation using read simulators such as ART [25]. Instead of using a template sequence file, BAMSurgeon [26] patches an existing alignment file to embed structural variants.
However, this application requires a high depth of coverage in the existing BAM file to successfully assemble a local contig for sequence patching. Moreover, the resulting structural variant may not have the exact breakpoints for the intended simulation. Overall, none of the listed tools provide a straightforward, joint control of an individual variant, including its exact breakpoints, ploidy and locus-specific allelic fraction. These features are particularly useful in simulating the clonal expansion of somatic structural variants as seen in tumors.
As a solution to the limitations of current structural variant simulators, we designed and implemented SVEngine, a full-featured simulation program suite. SVEngine is capable of generating short sequence read sets, such as produced by an Illumina system, for thousands of spike-in variants that cover different types, sizes, haplotypes and allelic fractions. Our application produces these simulated NGS data sets in a fraction of the time of other similar tools. SVEngine's flexibility for accepting different formats enables a user to generate whole genome or targeted sequencing data mimicking germ-line, somatic and complex clonal structured genomes with ease. It offers a high degree of allelic control through its parallelized divide-and-conquer planning scheme. In the simplest mode, users only need to provide the template (reference) sequences and a desired meta-distribution of type, size and variant frequency to receive a full set of resulting FASTA, FASTQ and BAM outputs along with the ground truth BED file.

SVEngine features and simulation performance
We compare the available features of SVEngine with other simulators that include RSVsim [23], SCNVsim [24], VarSim [14] and BAMsurgeon [26], as shown in Table 1. SVEngine and the other tools can simulate common types of copy number events, e.g. deletions and tandem duplications. All simulators except SCNVsim can simulate copy number neutral events, including insertions, inversions, and translocations. SVEngine improves the simulation of more complex SV events -it incorporates a variety of additional structural variant types originating from a combination of changes, such as inverted translocations, inverted duplications, duplicated translocation, and foreign sequence insertions. Users directly specify these events while preparing their input parameters -this process is more streamlined compared to other tools. For example, viral genome sequence insertion, which is a hallmark of the genomes of infected cells as seen in viral diseases and cancers [27], is easily achieved with SVEngine, but not available with other simulation software.
In terms of input/output flexibility and ease of use, SVEngine provides automates template sequence modification, read simulation and read mapping steps. These features are not found in other simulators of SV events. Also, SVEngine is the only tool which outputs a full set of simulation results in standard formats, including altered contig sequence (FASTA), simulated short reads (FASTQ) and alignment (BAM) files (Figure 1). At the input step, all tools take in template sequences in FASTA format as the starting material, while BAMsurgeon additionally requires a pre-existing alignment file in BAM format as input.
Overall, read coverage of this BAM file has to be large (typically >30x), to successfully assemble local contigs. Such requirements preclude the use of BAMSurgeon in applications generating low coverage and consequently limit its users to mimicking conditions based on available high-coverage BAMs. The VarSim tool needs structural variant prototypes from DGV [28] making it only applicable to the human genome. At the output step, RSVsim and SCNVSim provide modified sequence contigs in FASTA files. BAMsurgeon outputs modified alignment in BAM files, but without providing associated modified contigs. VarSim provides both contigs containing a variant and simulated short reads, but it still requires additional user effort to generate alignment files.
With regard to precise and versatile control of individual variants, SVEngine enables one to easily specify variant type, size, exact breakpoint, ploidy and allelic fraction for individual loci. Both BAMsurgeon and SVEngine also support exact breakpoints for individual variants.
However, in practice, the actual breakpoints generated by BAMsurgeon may differ from input, as a result of improvised local contig assembly. Another unique feature of SVEngine is the ability to specify multiple sequencing libraries, which can each have different insert size mean and standard deviation, intended coverage depth, and read length.
In addition to the features listed in Table 1, SVEngine allows user to designate some regions while avoiding others. Examples of such applications include simulating exome or targeted sequencing data sets. Moreover, this feature enables one to avoid complex regions such as telomeres and centromeres. SVEngine also features parallelized simulation by dividing genome into pieces, embedding variants into each piece and then stitching them together. Therefore, its performance can be boosted using the multi-core processors. Table 2 lists the runtime on a test set of 15,000 SV events into a 30x coverage whole genome sequencing simulation, including 2,500 events each of deletion, tandem duplication, inversion, translocation and domestic or foreign sequence insertion. In multi-processor mode, SVEngine has the shortest runtime in all three levels of simulation, i.e. obtaining altered contigs, simulated reads and alignments in FASTA, FASTQ and BAM formats in less than 10 min, 20 mins and 3 hours, respectively. Overall, SVEngine is 1x, 15x and 48x times faster than the single-process SVEngine run. The performance scales almost linearly with the added CPU power in generating the alignment output, because the read mapping time cost dominates other time costs including data serializing time. Moreover, even the singleprocess SVEngine (SVE-single) is more efficient than its other counterparts. For example, it took only 10 mins for SVE-single to generate all altered contigs when RSVSim and SCNVSim took several hours. SVE-single required half the time BAMSurgeon needs to generate all read alignments. All run time were all measured on 8 Intel® Core i7-4790 CPU @ 3.60GHz with 32 GB RAM computer.

Simulating cancer genome evolution
SVEngine provides a high degree of control over SV events with variable allelic fractionsthis feature enables one to simulate heterogeneous cancer genomes undergoing a phylogeny tree-structured clonal evolutionary process. As a demonstration, we present a simple example scenario in which we simulated with SVEngine ( Figure 2). To simplify the description of the phylogenetic process of cancer evolution, we use a binary tree representation of phylogeny. This binary tree is easily converted to a typical phylogeny tree by merging all nodes of identical cell subpopulations.
One example is a binary tree shown in Figure  As we can see, any terminal somatic genotype is completely determined by following the mutational path from the root down to a leaf node. We use a tertiary vector . Therefore, the final population frequency shows the derivation of the above quantities for the example binary tree. The sequence of events ensures a partial order that the mutant allele frequency is always higher for events occurring upstream, as compared to events occurring downstream on the same lineage. It is possible that terminal genotypes may not all coexist in extant populations. The proposed binary tree representation accommodates a deceased population by having zero proportion for such leaf node. SVEngine allows user to input a binary tree with relevant bifurcation fractions to structure the variant fractions that falls along the line of the evolutionary tree. For designating this feature, the input to SVEngine is in standard NEWICK format -a widely accepted format using parenthesis to encode nested tree structures [29]. Each internal node is labelled by the population splitting variant and weighted by the conditional splitting fraction. Each leaf node is labelled by associated  to 100% to simulate a twoclonal origin evolution. With SVEngine's high efficiency, the simulation is easily scaled to tens of thousands of variants with a tree having a more complex structure.

Simulating the multitude of structural variations
Current structural variation detection methods mostly rely on detecting altered read mapping features to identify structure changes [30]. The most important such features are read depth/coverage, read pair insert size, single ended read pairs (hanging reads), soft-clipped reads and split reads (clip/split reads). It is essential for structural variant simulators to correctly produce such feature changes corresponding to the causal event. In For an insertion (Figure 3, second row), the most noticeable change is the clustering of both the right and left hanging read pairs centering over the breakpoint. One observes a similar clustering for the left and right clip/split reads. As shown in Figure 3, an insertion exhibits fewer changes than other types of structural variants, and so insertions are generally the most difficult to detect. In the scenario of a tandem duplication (Figure 3

Simulation software and pipeline
SVEngine was developed as a standard Python package with a C extension. SVEngine provides two Python executables and one C command line executable: mutforge, tree2var and xwgsim, respectively. The mutforge command implements a parallelized algorithm that divides the template genome into blocks of contigs, spikes structural variants into the contigs, samples short reads from the altered contigs, and finally merges the short-read sets back into one file and performs the alignment. The tree2var command implements a procedure that determines variant fractions from an input phylogeny tree based on Equations (1) and (2) and a depth first search graph algorithm, and then substitutes these allele fractions in an input VAR file. The xwgsim command implements a modification to wgsim, which reduces the read sampling rate by 50% for the overlapping regions between contigs (i.e. ligation regions). The overlaps were designed so as to allow for the proper merging of contig-wise read sets. xwgsim only interacts with mutforge thus is mostly transparent to a user.
As shown in Figure 1, the required inputs to mutforge are three Once all inputs are provided, the SVEngine master process divides the template genome into blocks and serializes spike-in tasks to parallel worker processes. The worker process patches its assigned contig and if read pairs were required, it also calls xwgsim to simulate read pairs. The read pair subsets are then collected by the master process and merged, and if alignments were required, it also calls bwa-mem and samtools to map the reads to the reference.
The output of SVEngine has three levels. At the first level (contig), only two files would be generated: one is a FASTA file containing all the altered contigs, the other is the ground truth of spiked-in variants in a BED3 format file with the additional columns following the VAR format as in the input. At the second level (read pair), SVEngine additionally outputs the read 1 and read 2 of the simulated read pairs in two FASTQ files. Finally, at the third level (alignment), SVEngine provides the read alignment output to the given reference in a BAM file format. The runtime of SVEngine increases with the specified output level, as additional processing time will be required. Table 2 can be used as a reference for runtime estimates for different output levels.
The tree2var command simulates a clonal evolution scenario, which requires an additional tree input file (NEWICK). tree2var also takes a VAR file, which can be generated from mutforge with a META file input and in the dryrun mode. The user must insure that the identifier of the tree's internal nodes and the variant match each other, as this is used to identify and replace allele fraction with the value computed from tree phylogeny. The tree2var outputs a new VAR file which contains the rewritten allele fraction fields that reflects the clonal structure described by the user tree. For intuitive diagnostics, tree2var also outputs a ASCII text based plot of the parsed input tree. SVEngine's tree parsing interacts with DendroPy [29], which allows further functionality such as random tree simulations and many tree statistics. The output VAR file from tree2var then becomes the input to mutforge for actual read simulation.

A parallel simulation framework
SVEngine's major improvements to existing structural variant simulation tools involve one's ability to alter the allelic fraction, control of haplotypes and highly efficient parallelized simulation. These improvements were achieved through the core algorithm as illustrated in Fourth, when the worker processes are completed, the master process collects all simulated read pairs from all tasks and concatenates them into two final files, one for read 1 and the other for read 2. Also, it collects all original and altered contigs and concatenate them into one final sequence file. Finally, it performs read pair alignment to the reference genome using bwa-mem and samtools. This is last step, although sequential in SVEngine, is already thread parallelized by other required programs such as the bwa and samtools tools [22,32]. The user also has the option to continue the simulation to the end, which is equivalent to input the output VAR file into SVEngine for simulation in the next step.
To increase the sensitivity of SV detection, researchers may prepare multiple sequencing libraries with different molecular parameters for analysis. For example, different insert sizes enable the detection of a wider spectrum detection of SVs [33]. Longer sequence read length can boost the performance of some callers that employ remapping strategies [34]. A unique feature of SVEngine is its ability to simulate NGS data modelling multiple libraries with different mean insert size and standard deviation, coverage and read lengths. The feature is implemented within the worker process. When using a multi library task, SVEngine will call xwgsim multiple times to generate read pairs in accordance with the library specification.
SVEngine provides simulation data that target or masks specific genomic regions. This formats, such as BED or VCF files. Subsequently, these annotation formats are easily converted into an SVEngine VAR format input file using text processing utilities such as awk and sed. Using a VAR file generated in this way, SVEngine can easily embed these variants into simulation data. LCX and DMA generated and analyzed the data with assistance from CL. LCX and HPJ wrote the manuscript. All authors approved the manuscript.