CNAsim: improved simulation of single-cell copy number profiles and DNA-seq data from tumors

Abstract Summary CNAsim is a software package for improved simulation of single-cell copy number alteration (CNA) data from tumors. CNAsim can be used to efficiently generate single-cell copy number profiles for thousands of simulated tumor cells under a more realistic error model and a broader range of possible CNA mechanisms compared with existing simulators. The error model implemented in CNAsim accounts for the specific biases of single-cell sequencing that leads to read count fluctuation and poor resolution of CNA detection. For improved realism over existing simulators, CNAsim can (i) generate WGD, whole-chromosomal CNAs, and chromosome-arm CNAs, (ii) simulate subclonal population structure defined by the accumulation of chromosomal CNAs, and (iii) dilute the sampled cell population with both normal diploid cells and pseudo-diploid cells. The software can also generate DNA-seq data for sampled cells. Availability and implementation CNAsim is written in Python and is freely available open-source from https://github.com/samsonweiner/CNAsim.

A user-defined proportion of the sampled cells in the experiment can be taken from normal, healthy diploid cells outside of the tumor lineage. Samples taken from tumors of cancer patients are expected to contain a significant proportion of normal cells in addition to cancer cells [18]. Some experiments identify and discard normal cells before further downstream analysis through practices such as DAPI staining with flourescenceactivated cell sorting [27,43], however non-cancerous cells are often included [22]. In certain situations, normal cells play a crucial role in analysis, for example the SCOPE [62] method for CNA detection which uses normal cells as negative controls samples for read depth normalization. CNAsim dilutes the sampled population with normal cells by attaching observed cells directly to the root node with a special edge that contains no somatic mutations. Essentially, the added normal cells inherit an identical copy of the input genome placed at the root node.
In addition to normal cells, CNAsim can also simulate pseudo-normal cells. The presence of pseudodiploid cells has been reported in existing scDNA-seq datasets [3,30,32,44], and are typically defined as near-diploid cells that are seemingly unrelated to the major tumor populations, but have also diverged from the normal cells by a varying degree [44]. In CNAsim, pseudo-normal cells are introduced into the cell lineage tree along the edge between the root node and founder node of the primary tumor lineage. This edge represents the period of time in which the initial abnormal population of cells acquire the set of cancer causing mutations that result in neoplasm, and is therefore expected to have a significant number of focal CNAs compared to edges within the tumor lineage (see Supplemental Note S7). If m is the total number of focal CNAs assigned to this edge, pseudo-normal cells are attached at random increments along the edge such that they inherit anywhere between 1 and m − 1 focal events.
The number of normal, pseudo-normal, and tumor cells is controlled by three parameters: the total number of sampled cells, the normal cell fraction, and the pseudo-normal cell fraction. The number of tumor cells is thus the remaining fraction, and is equivalent to the number of leaves in the main tumor lineage. By default, both fractions are set to 0.

S.2 Simulation of tumor cell lineage tree
The main tumor lineage is simulated separately from the normal and pseudo-normal cells. CNAsim simulates an exponentially growing tumor population under neutral coalescence [28] where the observed tumor cells are sampled from a much larger population. The evolutionary relationships between the sampled tumor cells are represented as leaves in a cell lineage tree generated using the msprime package [4], and the common ancestor of all sampled tumor cells is referred to as the founder cell.
Neutral theory has been widely applied to, and observed in, tumor populations across cell types [5,9,35,45,46,58,63] and is used in existing single-cell tumor phylogeny simulators [49,55,64]. However, the dynamics of neutral evolution have primarily been observed in cancer only after the accumulation of genomic changes that result in neoplasia, and it is believed that selection plays an important role in the initial stages of cancer progression. Therefore, it is assumed that the founder cell is sampled from a small existing population of cancer cells after oncogenic initiation. While neutral evolution have been shown to dominate the genomic landscape during later stages of tumor growth [58,63], this does not necessarily imply the complete absence of selection [56]. In the context of tumor growth, selection among subclones may be a relevant factor in shaping the tumor landscape [54]. While a coalescent model for cancer subclone selection has yet to be fully developed [49], there have been attempts to model selection in the coalescent under limited scenarios, e.g., [4,10]. Accordingly, CNAsim allows the user to introduce selection into the tumor lineage through a user-defined number of selective sweeps that result in a burst of rapid coalescence [4]. To achieve this, the population model alternates between a standard coalescence [28] and a structured coalescence with selective backgrounds [10] implemented in msprime [4] as the SweepGenicSelection ancestry model. However, since this approach likely does not adequately model subclonal selection, and since neutral evolution is an ideal null-model for comparison [12,49], the default number of sweeps in CNAsim is 0 (in other words, only the standard coalescence model is used). If the number of sweeps is > 0, each sweep is distributed randomly over the tumor lifespan, and the selective advantage of each sweep is controlled by a user-defined strength parameter [4].
CNAsim takes as input an exponential growth rate α used in the coalescence model. For convenience, we derive an appropriate value of α that is used as the default. We assume the tumor grows exponentially from a founder cell sampled from an initial population of N t = 1000 that began t generations in the past [11,42,61]. Diagnosis occurs in the present (at time t = 0) when the tumor reaches a population of N 0 = 1 × 10 9 cells [5]. We approximate the time between onset and diagnosis to be 10 years [23,41], during which we assume a cell turnover of once per day [5,25]. The total number of generations t is therefore 10 × 365 = 3650. This gives a growth rate of α = ln(Nt)−ln(N0) t ≈ 3.785 × 10 −3 .

S.3 Clonal model
After the cell lineage tree is generated, a user-defined number of ancestral tumor cells can be selected to represent subclonal founders. In addition to focal CNAs, subclonal founder cells undergo a Poissondistributed number of chromosomal events. All ancestral and observed cells that are descendants of a subclonal founder inherit its chromosomal aberrations, and are considered to be part of the same subclone. This approach of defining clones with chromosomal events is motivated by clonal structures observed in real scDNA-seq datasets [43,51,66], but was largely ignored in the era of bulk-sequencing [60].
Chromosomal events can also be added along the edges out of the founder cell to create a super-clonal structure over the distinct left and right branches. This feature is mimicking observations found in real data [66]. The number of chromosomal events for each super-clone is also Poisson-distributed, but has a separate mean from subclonal founders.
Subclonal founder cells are selected randomly among all ancestral nodes excluding the founder of the main tumor lineage. To ensure a uniform selection with respect to tree depth, the probability of choosing an ancestral node is proportional to the number of leaves in the subtree induced by the ancestral node. For stricter control over subclone size, ancestral nodes are chosen according to a user-defined Gaussian distribution. Specifically, the probability of selecting an ancestral node n with |n| descendant leaves is proportional to its density: where µ and σ are the mean number of cells in a subclone and the standard deviation, respectively. The Gaussian distribution can be positively skewed by setting the value of µ to be sufficiently high with σ acting as a skew parameter.
As an alternative to using the size of the induced subtree as a selection criteria, the probability of selecting an ancestral node can be made proportional to its branch length. Edge lengths in trees simulated under neutral coalescence often increase with tree height, ensuring a variety of both small and large subclones are selected.

S.4 Simplified representation of genomes
Before simulating the evolution of cellular genomes along the cell lineage, CNAsim constructs an initial normal diploid genome which is placed at the root of the tree. This genome is equivalent to one copy each of the provided input haploid reference genomes, for two copies total. We note that passing two reference genomes is only necessary is one is interested in generating sequencing reads with known variants on a particular haplotype (see supplementary section S7). If a single haploid reference is provided, then the genome is equivalent to two identical copies of the provided reference. Rather than treating the initial genome as a long sequence of nucleotides, CNAsim partitions the sequence into an ordered array of nonoverlapping regions of uniform length equal to M . Each region is represented by a unique integer identifier, corresponding to a particular M -length DNA segment of the reference sequence. For example, a reference genome consisting of a single chromosome containing 1 × 10 8 bp and M = 1000 would be represented by two identical ordered vectors ⟨1, 2, ..., 100000⟩. When a CNA is generated along an edge of the tree and alters the inherited genome, this has the effect of augmenting the ordered array of regions. Conceptually, this is equivalent to an alteration of the corresponding DNA segment defined by each effected region. For example, if M = 1000, a 3000 bp deletion event starting at region 8 in the genome ⟨1, ..., 6, 7, 8, 9, 10, 11, 12, ..., 100000⟩ would result in ⟨1, ..., 6, 7, 11, 12, ..., 100000⟩. Using this condensed representation, the ordering and length of a genome is dynamically updated, which implicitly allows for overlapping events. Additionally, this representation naturally lends itself to generating ground truth copy number profiles through examining the number of regions assigned to each bin.
When choosing a value for M , one has to consider the trade-off between computational efficiency and biological realism. In any case, an appropriate lower bound for M is the smallest scale at which a CNA can alter a genome. It is unclear where to draw the boundary between indels and CNAs, reflected in many different definitions of the lower bound for CNA size in literature. Typically, a variant has to be larger than 1000 base pairs to be classified as a CNA [15,48,53,59], but minimum sizes have been reported at as small as 50 base pairs [37,47]. Naturally, larger values of M will improve the runtime of CNAsim and may be suitable for experiments with a large number of observed cells.
When partitioning each chromosome of the initial genome into regions, the number of regions is based solely on chromosome length. Chromosome lengths can be determined by the primary haploid reference genome passed by input from the user. This is a required input if CNAsim is used to generate sequencing reads, because a complete DNA sequence for each chromosome is needed for sampling reads. After the genomes are mutated along the cell-lineage tree, when generating sequencing reads for a particular cell, the genome is first converted back into a full sequence using the reference. More specifically, given the simplified genome representation, a complete sequence is built by substituting each region into its corresponding Mlength nucleotide sequence from the provided reference genome(s). If on the other hand CNAsim is used only to generate CNPs, an input reference genome is not necessary. Instead, the user has the option to select chromosome lengths derived from the human reference genome GrCH38. Additionally, there are parameters to directly control the number of chromosomes and chromosome length. If CNAsim is used to simulate chromosome-arm events, the location of the centromere must be specified. Because the ratio of short to long arm of each chromosome cannot be learned from an input reference genome, users have two options when run in either mode. First, users can use chromosome arm ratios derived from GrCH38. Otherwise, a fixed ratio can be set for all chromosomes.

S.5 CNA rates and properties
CNAsim can generate four types of CNAs, which are distinguished by size: focal CNAs, chromosome-arm CNAs, whole-chromosomal CNAs, and whole-genome duplication (WGD). The latter three are characterized as aneuploidy events and may only occur in select locations in the tree. On the other hand, focal CNAs can occur along all edges except those between normal cells and root. The number of focal CNAs along an edge is drawn from a Poisson distribution with a user-defined mean of λ. The value of λ can either be fixed for each edge, or it can be proportional to edge length. In the latter case, λ corresponds to the mean number of events for the average leaf edge length. Given that trees simulated under neutral coalescence generally have shorter edge lengths towards the leaves, it is expected that more events will be placed higher in the tree. Single-cell analysis of multiple triple-negative breast cancer samples and a melanoma tumor-line revealed the presence of many tens to even hundreds of CNAs in the genome of an individual cell [19,43,60]. These findings are consistent with pan-cancer studies across thousands of samples and multiple cancer types [7,17,26,67], albeit with varying sequencing technologies and CNA detection methods. Furthermore, these studies may be under-reporting the true number of CNAs due to resolution constraints or simply filtering out smaller variants. To both simulate an appropriate number of events and ensure adequate genomic variation among observed cells, the value for λ should reflect the total number of cells in the experiment. For an experiment with n observed tumor cells, the expected number of CNAs accumulated in total for each observed cell is roughly log n * λ.
Focal CNAs are chosen to be deletions or amplifications with a binomial distribution. Previous studies have reported either a relatively equal number of deletions and amplifications, or a slightly greater proportion of amplifications [7,67]. In either case, the number of deletions tends to scale with the number of amplifications [26], so by default the parameter in the binomial p = 0.5. Amplification events draw an additional number of copies to be gained from a geometric distribution [38]. The default parameter in the geometric is set to p = 0.5 for a mean of 1/p = 2 additional copies based on the observation that the amplitude of most copy number gains is relatively low and higher copy number gains are rare [34,36]. All additional copies are placed in tandem with the original. The length of a focal CNA is drawn from an exponential distribution with a mean of β and a user-defined minimum length. Several probes into somatic CNA patterns have indicated that the frequency at which focal CNAs occur is inversely proportional to their sizes [7,29,33,67]. However, mean CNA lengths appear to vary significantly across experiments; while some studies have reported average lengths less than 1 Mbp [1,2], others have found this value to be multiple Mbp [7,16,20,26,67]. This variance can be partially explained by inconsistent terminology and a difference in technology and resolution. Setting β = 5M bp by default is well within existing estimates and is sufficient for meaningful evaluation. We note that in using the simplified genome representation (see Supplementary Section S4), focal CNA lengths must be scaled into units of M -length regions. That is, an event of length X bp will modify X/M regions, rounded to the nearest integer with a minimum of 1 region. In choosing the location where a focal CNA occurs, first the allele-specific chromosome is chosen with probability proportional to the chromosome length. Then, a starting coordinate is chosen uniformly among all available positions. That is, for a chromosome of length L, the starting coordinate of an event of length X is chosen uniformly among ⟨1, ..., L − X⟩. We note that once the chromosome is selected, we ensure the event length is no greater than the chromosome length. To illustrate how focal CNAs alter the simplified genome, consider a default chromosome of length 100 Mbp with a region size of 1000 bp, represented by the vector ⟨1, 2, ..., 100000⟩. A 5000 bp deletion starting at the 10000 bp coordinate would alter the chromosome to be ⟨1, ..., 9, 15, ..., 100000⟩. If instead there was a 5000 bp duplication starting at the 10000 bp coordinate, the altered chromosome would be ⟨1, ..., 9, 10, 11, 12, 13, 14, 10, 11, 12, 13, 14, 15, ..., 100000⟩.
Recent studies into ITH have indicated that driver mutations leading to clonal sweeps rarely occur at intermediate or later stages in tumor proliferation without external factors [13], suggesting a considerable number of mutations initiate cancer growth and are pervasive across subclones [19,50,54]. To account for this, the value of λ is amplified along the edge between the root node and founder node by a user-defined scalar. A whole-genome duplication (WGD) event can be placed along this edge after focal CNAs have mutated the founder genome. This is based on evidence for WGDs arising early in carcinogenesis across many cancer types, after the accumulation of transforming driver mutations [8].
There are three locations in the tree where chromosomal CNAs can occur: the edge between root and founder, the edges out of the founder, and edges above clonal nodes. The number of events selected along these edges are Poisson distributed with a separate mean for each location, denoted by α 1 , α 2 , and α 3 , respectively. Tuning these parameters, along with the option of adding WGD, gives the user significant control over the mutational landscape of the population and average ploidy. For example, a population-wide ploidy of three can be achieved by toggling WGD and using a high rate of chromosomal deletions above the founder. Data from pan-cancer studies show that 25% of the genome from a typical solid tumor is altered through whole-chromosomal and chromosome arm CNA [6]. To this effect, the default values for α 1 , α 2 , and α 3 are such that the genome of the average observed cell in the population has undergone a median of 8 events.
Chromosomal CNAs are chosen to affect either a chromosome arm or a whole chromosome by a Binomial distribution. These two sources of large scale alteration are thought to have different underlying biological mechanisms [26,31], and thus likely occur at different rates. Indeed, chromosome-arm CNAs are thought to occur more commonly than whole chromosomal CNAs [7,57], and may also be the result of several focal CNAs. By default, chromosome CNAs are chosen to be arm-level events with probability p = 0.75. It should be noted that short and long arms are selected with equal probability. Both chromosome-arm CNAs and whole-chromosomal CNAs can either be a deletion or duplication event, but not an amplification [7]. The probability that the event is a deletion or duplication is according to a Binomial distribution. The probability of selecting a deletion, by default, depends on whether or not WGD is included. The pan-cancer study of [67] found that the average ploidy in tumors with WGD was 3.31 instead of 4, whereas tumors without WGD had an average ploidy of 1.99. This suggests large amounts of loss are associated with WGD, whereas diploid genomes maintain a relative balance of gain and loss. In the presence of WGD, chromosomal CNAs are duplications with probability p = 0.2, otherwise p = 0. 5 Each chromosome that was created as a result of WGD or whole chromosomal duplications are treated as new alleles. Unlike focal CNAs, which occur in tandem with the original segment, WGD and wholechromosomal duplications are the result of errors during segregation [26,31]. This implementation approach ensures consistency of any subsequent CNAs across the affected region.

S.6 Generating copy number profiles
When run under CNP mode, CNAsim generates copy number data directly without the need of applying CNA detection algorithms to sequencing data. First, the M length regions of the starting diploid genome are grouped into larger fixed size bins. The choice of bin size depends on the expected coverage of the experiment and the resolution of the copy number detection method, for example on the order of 500 kbp [21] to 5 Mbp [66]. The default bin size is set to 1 Mbp. Each observed cellular genome is 'mapped' back to the starting diploid genome by counting the number of copies of original M length regions. Then, the copy number of each bin is computed to be the average number of occurrences of all regions assigned to that bin. This results in the desired cell by bin copy number profile matrix.
Despite recent advancements in CNA detection algorithms, obtaining accurate copy number profiles remains unreliable [38]. CNAsim implements a realistic noise model designed to capture error expected of real copy number profiles. The simulator models two independent sources of error at different steps in the single-cell DNA sequencing pipeline that may result in artifacts in the final copy number profiles. In scDNA-seq data, existing methods have reported a significant concentration of errors in the copy numbers occurring at the boundaries between distinct copy number segments where CNAs have occurred [21,38]. This is likely due to the fact that low and/or non-uniform coverage can blur the precise endpoints between segments. CNAsim captures this noise pattern as part of the boundary model. The idea is to extend the boundaries of contiguous segments with the same copy number such that a random number of neighboring bins are assigned the segment's copy number. Conversely, the boundaries may be shrunk such that bins at the edges of the segment are assigned the copy number of the neighboring segment. For each chromosome of each cell, we iterate across the copy number profile while keeping track of the length of the current segment. When a shift occurs, we draw the new length of the segment from a Gaussian distribution with mean equal to the segment length, and a standard deviation equal to the segment length multiplied by an error rate r b , given by the user.
Next, random jitter is introduced to each bin independently as part of the jitter model. The intention is to mimic random fluctuations in read counts across the genome that are common to scDNA-seq technologies, the primary cause being non-uniform coverage. The challenge of working with noisy sequencing data has been addressed in recent years, as CNA detection algorithms specifically designed for scDNA-seq data usually apply a normalization routine [39,62]. However, many commonly used methods are still susceptible to fluctuations in read counts, while others are refactored methods originally designed for other sources of data like bulk sequencing [52]. For each bin with copy number c, a new copy number is assigned based on a draw from a Gaussian distribution with mean to equal c and standard deviation equal to c * r j , where r j is the error rate given by the user.
Default values of r b and r j were chosen by observing the error rates of the various CNA detection methods from the benchmarking study of [38], which reported precision and recall values for three different methods over various experimental conditions. While the results of these experiments appeared to vary greatly across different sequencing technologies and ploidies, it was reported that the best performing method achieved precision in the range 0.4 − 0.75, and recall in the range 0.5 − 0.8. To choose appropriate defaults, we evaluated precision and recall values of all combinations of r b ∈ {0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.12} and r j ∈ {0.05, 0.075, 0.1, 0.125, 0.15}. We found that setting r b = 0.04 and r j = 0.1 results in a precision and recall of 0.738 and 0.784, respectively, which correspond to the upper end of the reported ranges of [38].

S.7 Generating sequencing data
CNAsim can be used to generate sequencing reads of each observed cell. To generate reads for an observed cell, the mutated genome is first expanded from the simplified representation into a complete sequence by replacing regions with their corresponding DNA segment in the reference. If a single haploid reference is provided, then the same region on either allele will be substituted for the same corresponding nucleotide sequence. If two haploid reference genomes are provided, which we refer to as the paternal and maternal reference, then the region on the paternal (resp. maternal) allele will be substituted for the corresponding nucleotide sequence from the paternal (resp. maternal) reference genome. This feature is meant to enable the evaluation of computational methods involving phasing, for example CHISEL [66], a method for inferring allele-and haplotype-specific copy numbers. CNAsim next determines the number of reads assigned to the cell. If uniform coverage is used, the genome-wide expected readcount r is computed as r = |G| * C 2 * L , where |G| is the total length of the genome, C is the coverage, and L is the read length. The actual read count is sampled from a Poisson distribution with a mean of r. Paired-end sequencing reads are then generated using the short-read simulator dwgsim (https://github.com/nh13/DWGSIM) which requires a readcount and the mutated reference to execute.
If non-uniform coverage is used, the genome is partitioned into non-overlapping windows of fixed size, and readcounts are generated for each window. This is necessary because dwgsim samples reads uniformly across the provided reference. For a window of size W , the expected number of reads r ′ = W * C 2L . To simulate the effects of non-uniform read depth, an initial set of windows at some set interval I sample a coverage coefficient c w . The value of c w for the initial set of windows is drawn from a Beta distribution, scaled to be a value in the range (0, 2), where a value of c w = 1 indicating that window w has exactly the expected read count. As for the remaining windows, a cubic bezier curve is generated over the entire set of windows seeded with the values drawn from the Beta distribution. This is to ensure that read counts change smoothly across windows and avoids jagged peaks.
The resulting read count of window w is drawn from a Poisson distribution with mean r ′ * c w . The parameters used in the Beta distribution control the degree of non-uniformity across windows. For modeling this, we follow the approach of SimSCSnTree [40], where a point on the lorenz curve is used to represent the coverage non-uniformity of the desired sequencing technology. The lorenz curve represented by that point is then transformed into the Beta distribution. For example, a value of x = 0.5 and y = 0.27 will model the coverage non-uniformity of MALBAC [38].

S.8 Performance
We evaluated the resource requirements for generating sequencing data with CNAsim over three experimental conditions. Specifically, for each experiment we measured the total runtime, peak memory usage, and the final storage size. The results are shown in Table S1. Each experiment used the GRCh38 human reference genome assembly and a read length of 150, with all other parameters excluding those listed in the table remained at their defaults. For some of the experiments, we also evaluated the resource requirements of SimSCSnTree [40] and MosaicSim [55]. All experiments were run on an Intel Xeon 2.1 GHz processor with 64 GB of RAM using a single core, unless stated otherwise. We repeated the 100 cell, 0.1x coverage experiment using CNAsim with non-uniform coverage, this time making use of 8 parallel cores. This resulted in a reduction in runtime from 20.2 hr with a single core to 4.1 hr with 8 cores.
Next, we evaluated the resource requirements of using CNAsim to generate copy number profiles only. We found that a whole-genome 10000 cell dataset with a region length of 1000 bp took approximately 6 hours on a single core and used a negligible amount of memory and storage. It should be noted that the runtime scales linearly with both the number of cells and the region length. For example, a whole-genome 1000 cell dataset with a region length of 1000 bp will have an equivalent runtime to a whole-genome 10000 cell dataset with a region length of 10000, each taking approximately 35 minutes to execute on a single core.  Table S2: An expanded list of features implemented in CNAsim and other relevant simulators for singlecell data. From left to right, these are SCSIM [24], SCSsim [65], SCSilicon [14], CellCoal [49], PSiTE [64], MosaicSim [55], and SimSCSnTree [40]. A check mark within brackets indicates that the feature is implemented partially or with limited functionality.