Sim3C: simulation of Hi-C and Meta3C proximity ligation sequencing technologies

Abstract Background Chromosome conformation capture (3C) and Hi-C DNA sequencing methods have rapidly advanced our understanding of the spatial organization of genomes and metagenomes. Many variants of these protocols have been developed, each with their own strengths. Currently there is no systematic means for simulating sequence data from this family of sequencing protocols, potentially hindering the advancement of algorithms to exploit this new datatype. Findings We describe a computational simulator that, given simple parameters and reference genome sequences, will simulate Hi-C sequencing on those sequences. The simulator models the basic spatial structure in genomes that is commonly observed in Hi-C and 3C datasets, including the distance-decay relationship in proximity ligation, differences in the frequency of interaction within and across chromosomes, and the structure imposed by cells. A means to model the 3D structure of randomly generated topologically associating domains is provided. The simulator considers several sources of error common to 3C and Hi-C library preparation and sequencing methods, including spurious proximity ligation events and sequencing error. Conclusions We have introduced the first comprehensive simulator for 3C and Hi-C sequencing protocols. We expect the simulator to have use in testing of Hi-C data analysis algorithms, as well as more general value for experimental design, where questions such as the required depth of sequencing, enzyme choice, and other decisions can be made in advance in order to ensure adequate statistical power with respect to experimental hypothesis testing.

Minimum Standards Reporting Checklist. Information essential to interpreting the data presented should be made available in the figure legends.
Have you included all the information requested in your manuscript?
Resources A description of all resources used, including antibodies, cell lines, animals and software tools, with enough information to allow them to be uniquely identified, should be included in the Methods section. Authors are strongly encouraged to cite Research Resource Identifiers (RRIDs) for antibodies, model organisms and tools, where possible.
Have you included the information requested as detailed in our Minimum Standards Reporting Checklist? Yes Availability of data and materials All datasets and code on which the conclusions of the paper rely must be either included in your submission or deposited in publicly available repositories (where available and ethically appropriate), referencing such data using a unique identifier in the references and in the "Availability of Data and Materials" section of your manuscript.  Within bioinformatics, data simulation has become an important proxy for real 3 experimental data when testing individual algorithms and, more broadly, whole analysis 4 workflows. Use of simulated data in testing can be motivated by the formal notion of 5 software validation, the direct comparative analysis of algorithms or even as an 6 exploratory technique when prototyping experimental design. For established 7 experimental methods, the ever-accumulating public data archive offers a route to 8 thorough real-data driven testing. For a chosen test however, a difficulty remains in 9 matching desired data characteristics to one or several public dataset(s). Further, as 10 fields such as DNA sequencing develop, new forms of experimental data appear for 11 which the public data archives contain few, if any, examples. Though performance on 12 real data is the ultimate arbiter of analytical value, few researchers would have the time 13 and financial resources to commit to its generation purely for software testing. 14 Simulation-driven development and testing has proven to be a highly cost effective and 15 time efficient approach. It offers the possibility to explore data characteristics as a near 16 continuum and subject software to a previously unavailable degree of testing 17 thoroughness. 18 Tools for simulating DNA sequencing reads have existed from the very early days of 19 genomics, beginning with the many anonymous implementations of simple DNA 20 shearing algorithms, up to the most recent highly detailed empirical model 21 simulators [13,14,19,30]. From read simulation in isolation, field advancements such as 22 metagenomics have been accompanied soon after by simulators reflecting their specific 23 data characteristics and evolving experimental methodology [2,16,34]. 24 We introduce Sim3C, a software package designed to simulate data generated by HiC 25 and other 3C-based proximity ligation (PL) sequencing protocols. The software includes 26 flexible support for a range of sequencing project scenarios and choice of three 3C 27 methods (HiC, Meta3C, DNase-HiC). The resulting output (paired-end FastQ) is easily 28 assimilated into existing workflows, opening the door to more thorough software testing, 29 such as the comparative analysis of clustering algorithms [8]. 30 3C sequencing 31 3C-based sequencing protocols, including Hi-C, 4C-seq, and Meta3C, have great 32 potential to address questions directed at the spatial organization of DNA in samples 33 ranging from eukaryotic tissue, to single cells, to microbial communities. The growing 34 use of these protocols creates a legitimate need for a simulator capable of generating 35 data with relevant characteristics. 36 Chromosome conformation capture (3C) was originally designed as a PCR-based 37 assay to measure interactions among a small number of defined regions of eukaryotic 38 chromosomes [7]. In 2009 Lieberman-Aiden [21] reported an extension of the protocol to 39 high throughput sequencing, enabling the global spatial arrangement of chromosomes to 40 be reconstructed at unprecedented resolution. All 3C protocols depend on an initial 41 formalin fixation step, which crosslinks proteins bound to DNA in vivo. Subsequently 42 cells are lysed and the DNA:protein complexes are sheared enzymatically and/or 43 physically to create free ends in the bound DNA strands. These free ends are then 44 subjected to a proximity ligation reaction, in which ligation of free ends preferentially 45 occurs among DNA strands cobound in a protein complex. The DNA:protein crosslinks 46 are then reversed, the DNA is purified, and an Illumina-compatible sequencing library is 47 constructed. In HiC protocols, the proximity ligation junctions can then be further 48 purified in the sequencing library. 49 2/23 3C-derived methods have found several applications beyond their initial use to 50 reconstruct 3D chromosome structure. For example, it has been shown that 3C-derived 51 data provide a valuable signal for genome scaffolding [5,10], as well as a signal that can 52 support genome-wide haplotype phasing [17,36]. 3C-derived data has also proven 53 valuable for metagenomics, where initial studies on mock communities demonstrated 54 that highly accurate genome reconstruction in mixed microbial communities could be 55 facilitated by proximity ligation sequence data [4,6,26]. Subsequent application to 56 naturally occurring microbial communities has also suggested that bacteriophage can be 57 linked to their hosts with this data type [24].

58
In the remainder of this manuscript we describe the Sim3C software and outline how 59 it can be used to simulate data for various 3C-derived experiments. Listing 1. A mock two genome community. For demonstration purposes, we assume that the plasmid (plas 1) is present in four copies and that there is a 0.4/0.6 relative abundance split between the two organisms (bac1, bac2) in the community § Listing 2. A mock four chromosome genome. Cellular abundance is a constant across the profile, while chr4 exists in two copies. Note that relative abundances specified in a profile are not required to sum to 1, but are normalised internally.

85
Sim3C models three forms of experimental noise: machine-based sequencing error, the 86 formation of spurious ligation products and the contamination of PL libraries with 87 WGS read-pairs.

107
Lastly, conventional WGS read-pairs represent a source of contamination within a 108 PL library, which even after HiC enrichment steps, are not completely eliminated. The 109 rates at which spurious and WGS read-pairs are injected into a simulation run are 110 controllable by the end-user.

111
Simulation modes 112 Since HiC was first introduced [21], the development of variants and extensions has 113 been continual [11,26,32,33]. Variants have often strived to further enhance the 114 discriminatory power of the original experiment, while seemingly adding yet more 115 complexity to an already challenging protocol (in-situ DNase HiC, sciHiC) [33].  protocol results in a gross mixture of both WGS and PL read-pairs, where only a small 131 percentage of the total read-pair yield (approx. 1%) will possess PL junctions [22]. The 132 enrichment process within HiC, however, is not perfectly efficient and WGS read-pairs 133 are still observed (approx. 10-50% of reads contain a PL product) [22]. DNase-HiC 134 replaces restriction digest with a non-specific endonuclease (e.g. DNase I) [23] or 135 mechanical DNA shearing process (e.g. sonication) [11]. In this operational mode,

142
After obtaining insert parameters, the HiC strategy (figure 1a) first tests if the insert 143 will represent a WGS or PL read-pair (∼ Bern(p ef f )), where efficiency p ef f is defined 144 in the sense of enrichment. When p ef f = 1, there is perfect filtering and all WGS 145 read-pairs are eliminated from the experiment. In the case of WGS, the iteration 146 reaches an end-point and the simulation emits a conventional read-pair drawn from the 147 community definition. In the case of PL, next a 3-tuple defining a cut-site is drawn 148 (gen 1 , chr 1 , x 1 ), where the categorical distribution over chromosomes is weighted by 149 relative abundances (A) and chromosomal copy-numbers (n cpy ); genomic position is 150 sampled uniformly from the set of restriction sites (sites(chr 1 )); and parent genome 151 (gen 1 ) is implicit from the chromosome. Next, a test for spurious ligation is performed 152 (∼ Bern(p spur )). If a spurious event has occurred, the 3-tuple defining the second site 153 (gen 2 , chr 2 , x 2 ) is drawn i.i.d. as the first. If not spurious, next a test for 154 inter-chromosomal (trans) ligation is performed. For inter-chromosomal events, as the 155 source genome is implicitly defined (gen 2 = gen 1 ), only the second chromosome and 156 position (chr 2 , x 2 ) are drawn. Chromosome chr2 is selected without replacement from 157 the same genome (gen 1 ), where the categorical distribution is adjusted for the removal 158 of the first chromosome, and genomic position x 2 on chr 2 is drawn i.i.d. as the first.

159
Lastly, an intra-chromosomal (cis) ligation must have occurred. As now both genome 160 and chromosome are implicitly defined (gen 2 = gen 1 , chr 2 = chr 1 ), all that is required 161 is to draw genomic position x 2 . Here, the separation s between x 1 and x 2 162 (s = |x 2 − x 1 |) is constrained to follow an empirically determined long-tailed mixture of 163 the geometric and uniform distributions (equation 1).

258
Additionally for the real contact map, long-range interactions away from either diagonal 259 can be seen to drop to a lower threshold than that produced from simulation.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62 63 64 Figure 6. Bacterial contact maps. Observed HiC interactions for the monochromosomal genome of Caulobacter crescentus NA1000. Comparing (a) real experimental data [18], to the three simulation choices (b) traditional HiC, (c) DNase-HiC and (d) traditional HiC with TADs enabled. Sharp rectilinear modulations of the intensity within (a) and (b) indicate a reduction in PL observations within a given bin. Not due to 3D chromosome structure, rather such features can be attributed largely to mappability and low cut-site density. (c) Without an enzymatic constraint a significantly smoother field is apparent, yet still susceptible to mappability. (d) Enabling topologically associated domains (TADs) highlights the similarity between features produced merely from biases and what could be truly associated with 3D structure.  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 From the contact map of real HiC data (figure 7a), it can be seen that the rates of 287 intra-chromosomal and inter-chromosomal interactions are roughly equivalent in 288 magnitude. Across the eight chromosomes of S. stipitis, there is significant uniformity in 289 the degree of physical intimacy within and between all chromosomes. The subtleties of 290 this chromosomal organisation reveals a self-similar "fuzzy-x" pattern repeated between 291 all chromosomes across the contact map. The convergence point within the pattern is 292 attributed to centromere-SPB binding and has been used to predict centromere 293 locations [38]. It has been shown that the physical constraints generated from the 294 interaction of centromeres to the spindle pole body (SPB) and telomeres to the nuclear 295 envelope are sufficient to explain a number of experimental observations in real 296 data [12,39]. As Sim3C was derived from study of bacterial datasets, our simulation 297 model does not currently include a notion of these higher organism physical constraints. 298 Consequently, the contact map derived from simulated traditional HiC sequencing elicits 299 a flat field (figure 7b), where the intensity variation that does exist is a byproduct of 300 aforementioned factors such as mappability and cut-site density. For the runtime 301 parameters employed, the rate of intra-chromosomal contact is higher than that of 302 inter-chromosomal, making clear the boundaries between the eight chromosomes (figure 303 7b). Though our model is presently incomplete for higher organisms, there remains a 304 potential utility as an analytical or simply observational prior. Figure 7. Eukaryotic contact maps. Observed HiC interactions (a) real and (b) simulated data from the eight chromosome genome of the budding yeast Scheffersomyces stipitis CBS 6054 [6]. Grey dashed lines and alternating light and dark grey axes demarcate the boundaries between chromosomes. (b) Simulated data elicits a flat field and the clearly evident higher rate of intra-to inter-interactions makes for easily observable chromosomal boundaries within the map. (a) Contrastingly for real data, the similar rates of intra-chr and inter-chr interactions reveals the physical constraints imposed by centromere-SPB tethering on all eight chromosomes [38].

306
In the deconvolution of metagenomes, proximity ligation methods hold great potential 307 as new sources of information and have been investigated by the construction and 308 sequencing of synthetic communities [4,6,26]. We selected two previously constructed  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 (table 1). Intended as "proof of concept" experiments, neither community reflects a real 311 environment, but rather were intended to be easily interpreted and include interesting 312 features, such as: range of GC, single and multi-chromosomal genomes and strain-level 313 divergence. The HiC community involved five genotypes from four species, one genome 314 of two chromosomes (B. thailandensis), E. coli strains BL21 and K12 (Average 315 Nucleotide Identity, ANI 99%) and a wide overall GC range of 37-68% (table 2). Of 316 lower complexity, the Meta3C community involved three genomes from three species, 317 included one genome of two chromosomes (V. cholerae) and had a narrower GC range 318 of 44-51% (table 3). Relative to the single genome experiments above, a lower depth of 319 sequencing resulted in a lower overall contact map intensity ( figure 8). This is 320 particularly the case for Meta3C, where, by the nature of the method, a large proportion 321 (approx. 99%) of the sequencing yield is in reality conventional WGS read-pair data [26]. 322 As a direct result, in binning the Meta3C dataset, there were insufficient counts to fully 323 establish finer detail within the contact maps, leaving a smoother appearance.

324
As with single-genome experiments, metagenomic contact maps are locally 325 modulated by factors such as mappability and cut-site density. Importantly now for 326 metagenomes, the factors of relative abundance and GC content interact to alter the 327 observed intensity of each chromosome within the contact map.

328
As a first approximation and assuming agreement in nucleotide sampling frequency, 329 we expect n 0 = L/4 λ recognition sites for an enzyme of site length λ and DNA sequence 330 length L. The degree to which an enzyme and DNA sequence deviate from this estimate 331 could be described as how well they match, m = n x /n 0 . Poorer quality matches (m < 1) 332 occur when an enzyme's recognition site is underrepresented, while conversely, better 333 quality matches (m > 1) describe a situation of more recognition sites than expected.