Bisulfite probing reveals DNA structural intricacies

Abstract In recent decades, study of DNA structure has largely been focused on the interrelationships between nucleotides at the level of nearest neighbours. A little-utilized approach to probing structure on a larger scale is non-denaturing bisulfite modification of genomic DNA in conjunction with high-throughput sequencing. This technique revealed a marked gradient in reactivity increasing towards the 5′ end of poly-dC:dG mononucleotide repeats as short as two base pairs, suggesting that access of the anion may be greater at these points due to positive-roll bending not predicted by existing models. Consistent with this, the 5′ ends of these repeats are strikingly enriched at positions relative to the nucleosome dyad that bend towards the major groove, while their 3′ ends tend to sit outside these areas. Mutation rates are also higher at the 5′ ends of poly-dC:dG when CpG dinucleotides are excluded. These findings shed light on the mechanisms underlying bending/flexibility of the DNA double helix as well as the sequences that facilitate DNA packaging.


INTRODUCTION
Although the structure of the DNA double-helix at the level of the relative angular positions of adjacent base pairs (dinucleotide steps) has been explored in great detail, largerscale influences are less well understood. Sequence features such as mononucleotide runs have been known for decades to influence helical flexibility (1), but applications that involve prediction of DNA bending, such as the modelling of nucleosome binding, currently rely on dinucleotide models (2).
A well-established approach to examining fine structural aspects of native DNA at a broader scale using strand cleavage by the hydroxyl radical (3) is limited by incomplete understanding of its mechanism. An alternative method uses non-denaturing bisulfite treatment followed by PCR amplification and high-throughput sequencing. This is similar to commonly used methods for mapping cytosine methyla-tion using bisulfite, with the exception that double-stranded DNA is treated, so in theory only unmethylated cytosines at regions that are single-stranded for structural reasons should convert to uracil and ultimately be detectable as thymine by sequencing.
The main previous application of bisulfite as a probe of DNA structure has been in helping to characterize discrete non B-DNA structural features with single-stranded components including intramolecular triplexes, quadruplexes and cruciforms (4)(5)(6), and more recently to map R-loops genome wide (7). These studies were confined to examining relatively large structural entities, and bisulfite has generally been overlooked for finer probing because it reacts only very weakly with double-stranded DNA. An exception to this was an investigation of the Bcl-2 translocation breakpoint implicated in Burkitt's Lymphoma, which showed that non-denaturing bisulfite sequencing is sensitive enough to detect subtle conformational change at the translocation-associated sequence GGGCCC, noting that this might be explained by its B-A-form transition characteristics (8).
Specifically, it was proposed that the positive roll-bend angle and decreased base-pair lifetimes seen at the motif using crystallographic and spectroscopic methods could allow easier access for the bisulfite anion. Positive-roll/majorgroove-directed bending at the motif GGGCCC has been indicated by multiple other methods (8)(9)(10)(11), and base-pair stability is lowest at the central GC dinucleotide of this motif (10). A previous observation that bisulfite reactivity is significantly elevated in supercoiled plasmid DNA (4) is also relevant, since supercoiling has been associated with both bending and decreased base-pair stability (12). However, while it has been proposed that increased presence of divalent cations in the major groove of GGGCCC motifs could promote their observed bending, and that the same could occur at G:C-rich sequences in general (1), the idea that the increasing structural perturbation towards the central point of GGGCCC considered as the convergence of the 5 ends of two adjacent poly-C runs might also apply to the 5 ends of poly-C in other contexts has not been studied.
In addition to the motif GGGCCC, known influences on DNA bending that are not well modelled at the tetranucleotide or lower levels include A-tracts, defined as four or more consecutive A:T base pairs containing no TA dinucleotides (9,13,14). A-tracts are relatively inflexible, but bending of DNA flanking A-tracts has been shown, and this can be directed oppositely to GGGCCC, i.e. narrowing the minor groove, at junction points with non-A-tract DNA (1,9,14,15). If bisulfite reactivity reflects DNA bending, the effect of neighbouring A-tracts might therefore be suppressive.
Data on mutational patterns suggest an additional indirect test of differential bending properties between the ends of poly-dC:dG runs (hereafter referred to as poly-C), e.g. a correlation between DNA bending based on roll and tilt parameters and the mismatched base pairs that antecede inherited mutations has been shown (16). Evidence of majorgroove-directed bending by mismatched duplexes (31) and elevated structural perturbations by transversions including increased roll bending at A:C mismatches (32) are also relevant to the hypothesis that polar structure at poly-C influences mutability, though it is also notable that helical flexibility is expected to facilitate mismatch repair (16). Apparently contrary to the hypothesis of mutation-promoting structural abnormality at the 5 ends of poly-C runs, elevated polymorphism near the 3 ends of poly-C, thought to be due to an electrostatic hole created by loss of an electron due to oxidation, has been reported (17). However, that analysis included the motif CpG, the hypermutability of which is driven by the vulnerability of 5-methylcytosine to deamination, to which DNA bending/flexibility is not necessarily related (18). Analysis excluding CpG dinucleotides has not, to our knowledge, been published.
We sought to test these hypotheses using deep sequencing of bisulfite-modified protein-free native DNA, in conjunction with available datasets on de novo mutations of the human genome.

Reagents
Sodium bisulfite (cat. # 243973) and hydroquinone (cat. # 74347) were from Sigma Aldrich. The Wizard DNA cleanup kit (cat. # A7231) and Proteinase K (cat. # V3021) were from Promega. DNA extraction was done with a Norgen Biotek Blood DNA Isolation Mini Kit (Cat# 46300). Degenerate PCR primers were custom made by Integrated DNA Technologies.

DNA extraction
DNA was extracted from buffy coat cells of an individual human male according to the kit manufacturer's instructions. Briefly, Proteinase K was added to a microcentrifuge tube, followed by 200 l of buffy coat. Next, 300 l of lysis solution was added, and samples were vortexed and incubated at 55 • C for 10 min. Following ethanol precipitation, samples were given an additional incubation at 37 • C for 16-24 h with Proteinase-K, followed by re-precipitation with ethanol and suspension in 30 l dd H 2 O.

Bisulfite treatment
We carried out bisulfite modification as described (4). For each reaction, 0.3 g of sodium bisulfite was first dissolved in 1050 l dd H 2 O/g sodium bisulfite. To this was added 525.2 l of 2 M NaOH/g sodium bisulfite. Once dissolved, the pH of the solution was confirmed to be 5.2-5.3. In a 1.5 mL microcentrifuge tube, 457.5 l of the sodium bisulfite solution was then mixed with 12 l of 20 mM hydroquinone. To this was added 1 g of genomic DNA in 30 l of dd H 2 O with pipette mixing. The solution was then incubated for 16 h at 37 • C in the dark. The DNA was purified using a Wizard DNA clean-up kit according to the manufacturer's protocol, and then desulfonated with 0.3 M NaOH for 15 min at 37 • C. Following this, ethanol precipitation was carried out for 1 h at -20 • C, followed by three washes with 70% ethanol, and suspension in 30 l Tris-EDTA (pH 8.0).

PCR amplification
We PCR-amplified twelve bisulfite-modified promoter, coding and intergenic regions of 1.7-2 kb from human Chromosome 21. This size range was selected for convenient sequencing library preparation for Illumina Miseq. Degenerate primers were used to reduce PCR bias towards less reacted molecules. These had 50% A residues at each G, and 50% T at each C (see Supplementary Table S1 for primer sequences). Thermal cycling was as follows: initial denaturation at 95 • C for 3 min, followed by 10 cycles of 98 • C for 20 s, 60 • C for 2 min and 72 • C for 1:30, and then 25 cycles of 98 • C for 15 s, 45 • C for 45 s and 72 • C for 1:30, with a 5-min final extension at 72 • C. This protocol was intended to maximize yield while minimizing non-specific amplification due to the degenerate primers. It produced mostly clean products, as visualized under UV light following ethidium bromide staining. Amplicons were cleaned using magnetic beads. Preliminary experiments showed these PCR methods yielded bisulfite conversion levels comparable to published studies (4,8).

Sequencing
The amplicon library was sequenced with one Illumina MiSeq 2 × 300 base PE run, version 3 chemistry, at 5% volume (New Zealand Genomics Ltd; Massey University). Quality control included trimming both ends of all reads up to the first base with a sequencing quality score of 32 (p = 0.00063). All remaining reads shorter than 500 bp and all reads containing any base with a sequencing quality score lower than 22 (p = 0.0063) were also removed.

Alignment and variant mapping
Following quality control, a total of 21 kb was aligned to the reference sequence using Bowtie-2 with 'sensitive' settings. All bases covered by less than 500 reads were excluded from further analysis, as were those with mapping quality scores lower than 20 (p = 0.01). Each base was scored for numbers of each unexpected nucleotide using BAM-readcount (19). Both strands were combined in the results, i. e. a G → A transition was interpreted as C → T on the other strand, and polarity was accounted for in the analysis of poly-C runs, e.g. DCC*C*D was combined with H*G*GGH.

Statistical analyses
Statistical analysis was done in R (Version 4.1). Significant differences involving bending and mutation rate statistics vs nucleosome position were determined by two-tailed T-or chi-square tests. The effects of A-tracts and local A:T content on bisulfite reactivity were evaluated using a generalized linear model (quasibinomial family in R). For the analysis of cyclization data from Basu et al. (20), the mean of the three experiments ('c26', 'c29' and 'c31') was taken and used as the independent variable in linear regression against the frequencies of sequence motifs per 50mer, or bisulfite reactivities based on 4-or 7-bp sliding windows as specified. The plots of bisulfite reactivity vs nucleosome position (Supplementary Figure S3) were based on 7-bp sliding windows, with values determined by averaging reactivity from Dumelie et al. (7) at the central points of each heptameric context in the data. Mutation rate was defined as the number of de novo mutations of a motif, based on data from Kessler et al. divided by the number of instances of the motif within the region concerned.
Operations involving large datasets employed bespoke Bash or Python scripts, which available at https://doi.org/ 10.5281/zenodo.7613454. Genome-wide data on nucleosome dyad positions were from Gaffney et al. (2012). The level of roll-bending at each position of predicted 146 bp nucleosome-bound regions was based on data from x-ray crystal structures (2). For the purposes of this work, the positions of these regions were practically identical to those determined by other studies (21).

A gradient in bisulfite reactivity across poly-C
We examined just over 19 kb of protein-free genomic DNA from human Chromosome 21 with bisulfite treatment followed by PCR and deep sequencing. Out of a total of 7249 cytosines (Cs) covered by >500 reads, 99% appeared as thymine (T) in at least one read (Supplementary Table S2). poly-C showed substantially elevated reactivity, which increased with their length (Figure 1), while isolated C (DCD, where D = A, T or G) was converted in only just over 2% of reads.
Upon visual examination of read alignments, the most obvious pattern was sharply increasing reactivity from right-to-left at poly-C and left-to-right at poly-G, i.e. a 3 →5 (polar) gradient, even in the absence of 3 CpG dinucleotides, i.e. D(C) n W (Figure 1). This was evident for Poly-C runs as short as two base-pairs. The pattern was also present in a separate dataset of bisulfite reactivity in human genomic DNA (7) (Supplementary Figure S1). The larger scale of that dataset allowed investigation of repeats longer than 5 bp, but these much rarer sequences showed only shallow gradients. Nonetheless, the gradient was evident at the vast majority of C/G runs, including 94% of GG/CC in the data generated for this study. There were no differences in sequencing/PCR error (percentage of non-reference reads not attributable to bisulfite modification, which averaged less than 0.5%) or read depth between the 3' and 5' ends of poly-C considered as all types combined, or separated by repeat length (p > 0. 19 by T-test).

Exceptionally high reactivity of G n C n motifs
We confirmed the exceptionally high reactivity of the motif GGGCCC, previously found to react strongly with bisulfite (8). It also showed a high 5 :3 reactivity ratio (2.9 on average), i.e. modification peaked at the central GC dinucleotide. In our data, eight of the ten most reactive motifs were of the form G n C n (Table 1; see Supplementary Table S3 for the equivalent data from Dumelie and Jaffrey) (7). Consistent with Tsai et al. (8), A-forming potential measured at Table 1.
Contexts of the most bisulfite-reactive cytosines. Bases converted in more than 40% of reads were excluded due to the possibility of heterozygosity. The equivalent results from Dumelie and Jaffrey (7) (22,23) was not a significant predictor of bisulfite reactivity.

Absence of clear associations with established base-pair step parameters
We regressed the mean modification frequencies of each overlapping di-and tetranucleotide in the tested sequence against the canonical base-step parameters twist, roll, tilt, slide, shift and rise derived from molecular dynamics simulation (24). At the dinucleotide step level, slide (p = 0.005) and rise (p = 0.007) were weakly positively associated with reactivity, but at the tetranucleotide level the former was in a negative direction (p = 7 × 10 −8 ), while the latter was weaker (p = 0.024). Salient factors influencing the lack of associations with the bending parameters tilt and roll were (i) that the highly reactive GC dinucleotide is relatively rigid according to the step data, having the second-lowest roll value among dinucleotides and (ii) CA and AC, positioned at the 3 and 5 ends of poly-C respectively, are considered to be very high-and very low-roll respectively. These attributes are well established in dinucleotide step parameter datasets based on molecular dynamics simulation (25)(26)(27), though context-dependence of the bending characteristics of CA and GC has been shown (21).

Effects of flanking bases
Analysis of sequence to either side of each bisulfitemodified cytosine revealed clear contextual effects at flanking bases as distant as 3 bp (p < 10 −6 by Wilcoxon signedrank test; Figure 2). Most notably, 5 G neighbours increased reactivity while 5 T suppressed it. G as 3 neighbour decreased reactivity even when CpG dinucleotides with little or no modification (<0.4%; a threshold chosen because it corresponded to a step in the plot of reactivity at these dinucleotides; Supplementary Figure S2) were excluded in order to indicate the effect of CpG methylation. Given that reactivity of the motif GGGCCC peaks at its central GC dinucleotide, it is notable that GC dinucleotides in general were 46% more reactive than other dinucleotides, excluding CC. As noted in the previous section, this does not appear to be due to known intrinsic properties of the GC dinucleotide step. Although it is distinguished by high twist, this property is also shared with TC, which suppressed reactivity.

Nucleosome-binding data indicate polar bending at poly-C
Examination of the distribution of poly-C sequences relative to bound nucleosomes provides an indirect test of the hypothesis that the 5 ends of poly-C in general are bent towards the major groove. It has been known for several decades that frequencies of short GC-rich sequences, notably SS dinucleotides (two consecutive residues without an A or T) and the highly bisulfite-reactive GCC/GGC trinucleotide, show periodic maxima at positions where the major groove faces inward towards the protein and is narrowed, while AT-rich di-and trinucleotides are more often found with the opposite periodicity, i.e. in regions bent towards the minor groove (28,29) (Supplementary Figure S3  B).
In our data, bisulfite modification per cytosine is correlated with sites that undergo positive roll bending to a clear, though undramatic, extent, and the correlation increases with increasing strength of nucleosome positioning (Supplementary Figure S3C, D) and increasing reactivity of the motif analyzed (Supplementary Figure S3E, F). More interesting patterns emerge on closer examination. Most notably, the 5 ends of poly-C sequences are strikingly enriched in the central, most positively bent parts of these areas, while their 3 ends tend to sit away from the peaks (Figures 3 and Supplementary Figure S4-S7). This trend begins at n = 2 for G(C n ) and n = 3 for A(C n ). The phenomenon is still visible at n = 8 (Supplementary Figure S7), though longer repeats are rare. The combination of a significant positive association with roll for 5 ends (P < 0.05) and no such association for 3 ends is evident for n = 3 and 4 for A(C n ) and n = 2, 3 and 4 for G(C n ). Consistent with the suppressive effect on bisulfite reactivity of 5 T, T(C n ) sequences do not show the pattern.
Interestingly, previous studies of GC-rich motifs associated with positively roll-bent regions of the nucleosome have shown that the patterns tend to be reversed near the nucleosome dyad (28) (Supplementary Figure S3 B). It is therefore notable that the 5 ends of A(C n ) repeats ( Figure  3A, Supplementary Figures S4A, S5A, S6A and S7A), and the motifs most highly reactive with bisulfite (Supplementary Figure S3 F) are enriched in the three central peaks of positive-roll bending, though this is less clear for G(C n ), and bisulfite reactivity overall.

Suppressive effect of neighbouring A-tracts
Consistent with the hypothesis of major groove compression promoting bisulfite reactivity, and contrary to the expectation of higher reactivity due to greater DNA breathing in AT-rich regions, we observed decreasing reactivity with increasing flanking AT-content (Supplementary Table  S3). Total reactivity was lower in the presence of a neighbouring A-tract by 42%, 44% and 41% for DCD, DCCD and DCCCD respectively. The effect was greatest for 5' Atracts on the 5' C's of poly-C. A-tracts had a weaker influence when located to the 3' side, where they were associated with 24-26% reductions. Taking into account local ATcontent and the suppressive effect of 5' T residues, the influence of A-tracts was much reduced but remained significant (p < 0.02). Presence of a TA dinucleotide adjacent to the 5 end of poly-C had a positive effect on bisulfite reactivity Nucleic Acids Research, 2023, Vol. 51, No. 7 3265 Figure 2. The effect of flanking sequence on bisulfite reactivity at isolated C (DCD). The bars represent mean reactivity levels at position zero associated with each particular flanking base, e. g. the mean reactivity of the dinucleotide GC was just over 3%. Similar patterns were seen at poly-C runs. The total number of Cs included was 3912. Error bars are 1 SEM.
(P < 2.6 × 10 −5 in the model presented in Supplementary  Table S3). This is also consistent with a bending/flexibilitybased structural explanation for the reactivity gradient at poly-C, because this highly flexible dinucleotide is known to be prone to positive roll bending (24,30).

Correlation with bending measured by cyclization
An additional test of the link between bisulfite reactivity and DNA flexibility/bending is enabled by a dataset of 'cyclizability' scores measured for 82367 50-bp DNA fragments (20). Mean bisulfite reactivity per cytosine obtained for each 50mer using a 4 bp sliding window was correlated with mean cyclizability (r = 0.23; p < 10 −16 ). Much weaker associations were evident for G/C-content and number of A-tracts per sequence (r = 0.03 and -0.04 respectively). Poly-C runs of 4 bp or more were enriched by 36% in the half of sequences with the most positive cyclizability scores (p < 10 −16 ), and 3.6-fold in the top 1000 compared with the bottom 1000. While these correlations are quite weak, it is notable that the cyclizability data are noisy, with standard deviations from the three experiments exceeding the positive or negative magnitude of their mean scores in nearly half the tested sequences.
Also of interest in view of currently used dinucleotide step models, numbers of GC and CG dinucleotides per sequence were weak positive and negative predictors of cyclizability (r = 0.08 and -0.06, respectively), which is contrary to expectation based on their roll and tilt values. Further, the two strongest dinucleotide predictors were TA and TC (r = 0.21 and -0.16, respectively), and this is also consistent with the hypothesis that bisulfite reactivity reflects DNA bending in view of the strong negative and positive influences on reactivity of 5 -T and 5 -TA neighbours respectively.

Polar mutation biases at poly-C
Using a recent database of 93048 germline de novo mutations from human trios (31), we found that poly-C is more mutable at its 5 than its 3 end when adjacent 3 G residues (i.e. all CpG dinucleotides) are excluded, for all combinations of flanking bases, by an average of 28% for runs of 2-5 bp (Supplementary Table S4). Consistent with bisulfite reactivity patterns, the gradient is much less marked in the presence of a 5 T (Figure 4). It is much stronger for C → A transversions than for C → T transitions, but is oppositely directed for C → G transversions, which are enriched at poly-C 3 ends, though they are comparatively rare (Figure 4).
Location with respect to nucleosome positioning has previously been shown to affect patterns of mutation (32), and the enrichment of WW and SS dinucleotides at sites with negative-and positive-roll bending respectively can be explained in part by elevated C→T mutability in the former (33). However, mutability of C n runs divided between n = 2-3 and n = 4-6, as well as between 5 A, G and T adjacent residues, is similarly distributed between positive-and negative-roll-bending regions of the nucleosome (Figure 4).

DISCUSSION
Increased flexibility and/or base-pair opening at one end of poly-C repeats would be consistent with a hypothesis known as the 'broken stack' model of DNA bending (34). This predicts bending at runs of G:C base pairs due to a tendency for consecutive G residues to align/stack squarely (in parallel), necessitating periodic disruption in either stacking or Watson-Crick hydrogen bonding to accommodate the curved shape of the B-form double-helix. However, polarity in bending at poly-C was apparently not specifically invoked by the authors of this model.
Other well-known models of DNA bending have focussed on the role of A-tracts or cation binding (1). Our observation that poly-C is highly bisulfite-reactive is clearly consistent with the latter, given that its observed electropositive characteristic should attract the bisulfite anion (8). Our hypothesis of DNA bending predicts that the significance of the suppressive role of A-tracts is that they primarily direct bending oppositely to GGGCCC, i.e. narrowing the minor groove, which has been observed previously (9). The significance of the potentially related observation that neighbouring 5 T residues are especially suppressive of reactivity is less clear. However, it is notable in this context that repeats of TCCA/TGGA have been found to be the most significant feature in random sequence with outstandingly low nucleosome occupancy (40). The authors of that study showed that the nucleosome-resisting property of TGGA-rich fragments was not due to quadruplex formation, so resistance to bending is a plausible explanation. In this context it is also notable that, in our data, sequences with quadruplex-forming potential also show the gradient (data not shown), which is unexpected if they actually form the structures in the absence of DNA-binding proteins or polymerase-induced supercoiling.
Transient single-strandedness due DNA breathing appears unlikely to contribute heavily to the patterns we observed, given that flanking A/T-content is negatively cor-related with reactivity. Therefore, it is also noteworthy that steric inhibition should preclude the attachment of bisulfite to anti-form C, in which bases in the native Watson-Crick double-helix are locked (11). It may be reasonable to speculate that a permissive syn-form conformation, which has been demonstrated in the context of the double helix in Z-DNA (12), could play a role, at least in the context of existing structural perturbation. In the pH range we used for bisulfite treatment (5.2-5.3), some Cs are expected to be protonated in double-stranded DNA, as a result of which they can form a weakened Hoogsteen-type base pair with G, which adopts a syn configuration (35). The steric implications of the rotations about the sugar-base bond that these transformations would entail could presumably increase sensitivity to structural fluctuations.
Another possibility to be considered is that artefacts of bisulfite conversion contributed to the patterns we observed, for example if the conversion process stimulated reactivity at the 5 end of poly-C. However, this seems less likely in view of observations that, in RNA, the G:U mismatch has been shown to create a region of strong electronegative potential (36). Further experimental work to confirm our hypotheses will require high technical sensitivity, as major groove compression at GGGCCC was not seen in solution NMR (11), and 5 ends of poly-C aren't necessarily bent in crystal structures. It has been hypothesized that detectable bending at GGGCCC motifs may require the presence of multivalent cations (1), and it is notable that Mg 2+ concentration correlates positively with both nucleosome formation (37) and major-groove-directed bending (38). However, it appears that bisulfite probing may be sensitive enough to obviate the requirement for multivalent cations, as the principal cation in our reaction mixture was Na + , and there is some evidence that DNA binding of monovalent cations may be weak, at least for GC-rich tracts (39).
The phenomenon of increased electrostatic positive charge at poly-C is also reflected in enhanced reactivity of the hydroxyl radical at these sequences (3), though neither this nor other established methods indicate a polar structural gradient at poly-C (40,41). Bisulfite probing may therefore provide unique insight into the structural intricacies of GC-rich sequences.

DATA AVAILABILITY
Data are deposited in the Genbank short read archive under accession PRJNA910828. Scripts are available at Zen-odo, https://doi.org/10.5281/zenodo.7613454, and are also available upon request to the corresponding author.