Enzyme-free optical DNA mapping of the human genome using competitive binding

Abstract Optical DNA mapping (ODM) allows visualization of long-range sequence information along single DNA molecules. The data can for example be used for detecting long range structural variations, for aiding DNA sequence assembly of complex genomes and for mapping epigenetic marks and DNA damage across the genome. ODM traditionally utilizes sequence specific marks based on nicking enzymes, combined with a DNA stain, YOYO-1, for detection of the DNA contour. Here we use a competitive binding approach, based on YOYO-1 and netropsin, which highlights the contour of the DNA molecules, while simultaneously creating a continuous sequence specific pattern, based on the AT/GC variation along the detected molecule. We demonstrate and validate competitive-binding-based ODM using bacterial artificial chromosomes (BACs) derived from the human genome and then turn to DNA extracted from white blood cells. We generalize our findings with in-silico simulations that show that we can map a vast majority of the human genome. Finally, we demonstrate the possibility of combining competitive binding with enzymatic labeling by mapping DNA damage sites induced by the cytotoxic drug etoposide to the human genome. Overall, we demonstrate that competitive-binding-based ODM has the potential to be used both as a standalone assay for studies of the human genome, as well as in combination with enzymatic approaches, some of which are already commercialized.


Supporting Tables
Supporting Figures Figure S1: The effect on match scores (C max ) with varying number of acquired frames. A) Average normalized C max values for different number of frames for BACs P18, C17, H8, J19 and J21 when fitted to their corresponding theoretical barcode B) Comparison of theoretical P18 barcode (black) with experimental P18 barcodes consisting of different number of frames as indicated (gray).
With increasing number of time frames the signal to noise ratio decreases, and hence the precision of the obtained fit to the theoretical reference increases. However, even when using a single time frame 83% of the maximal score is obtained, and for 10 frames the number is as high as 96%. Figure S2: Fraction of correct hits as a function of allowed size variation. Experimental barcodes from BACs P18, C17, H8, J19 and J21 mapped to the human genome with different amount of size variation allowed (1% steps).
The optimal amount of allowed size variation for the five examined BACs was always between 2-10%, with an overall optimal value of 5%. As expected, the fraction of correctly placed fragments initially increased as more stretching was applied, and later decreased as more attempts of stretching increases the probability that the experimental barcodes map well also to incorrect positions along the human genome.
The sensitivity to amount of size variation allowed for an experimental barcode will vary depending on the degree of AT/GC variations in the underlying DNA sequence, which could explain the differences for the different BACs. Moreover, minor experimental errors in pinpointing the correct extension of the DNA based on the reference lambda-DNA molecule could also affect the results. The precision increases with increasing number of time frames, since an average extension can be better estimated. Figure S3: Examples of fits for individual experimental BAC barcodes to the corresponding correct position along the human genome. All barcodes were mapped using 10 frames and up to 5% stretch (1% steps).
For all 5 BACs individual fragments could be mapped to the genome with high precision. As expected, the signal to noise ratio is larger than for the consensus barcodes of the BACs (Figure 2, main text). However, this does not seem to affect the possibility to map the fragments to the correct position in the human genome to any large extent. Figure S4: Repetitive region found in experimental fragment when matched to human genome. Zoomed in version of repetitive region in Figure 4C (main text). The theoretical barcode is shown in black and the experimental barcode in green.
The possibility of ODM to acquire long range sequence information is a great advantage compared to traditional sequencing techniques. Using ODM, repeats can be quantified directly from the barcode, exemplified by the repetitive region of the GC rich exons of the Protocadherin Alpha gene cluster, mapped in Figure S4. Figure S5: Impact of exposure time, light power and number of frames on data quality. Average C max -values with corresponding standard deviation (N=20) for experimental J21 BAC barcodes when compared to its corresponding theoretical barcode with different amounts of light, exposure time and time frames. The green bar represents the general settings used for acquiring the data in study. The acquisition time for the camera to acquire ten images with 30 ms exposure is approximately the same as the time to obtain five images with 100 ms exposure.
In order to evaluate the possibility of increasing the throughput in future applications, experiments were made with lower exposure time per frame, compensated by higher intensity of the excitation light. The results show that exposure times down to 30 ms can be used when compensated by increased light exposure, reducing the acquisition time by a factor 2 compared to when using 100 ms exposure. The results also show that the number of time frames can be reduced when using a higher light intensity, without any large effect on the data quality. Figure S6: Simulated p-values expected for different lengths and match scores. The expected p-value for match scores C max of 0.9, 0.8, 0.7 and 0.6 for DNA molecules of varying length (kb).
The data presented in Figure 4 (main text) represents only a small fraction of the human genome. To generalize our results and study the effect of fragment size on the p-value of a fit between an experimental barcode and the human genome, in silico simulations were performed using randomized theoretical fragments of varying sizes ( Figure S6, details in Supporting Methods). For each size (steps of approximately 3 kb), one thousand randomized theoretical barcodes, preserving the statistical properties of the human DNA, were compared to the randomized theoretical version of the human genome. The distribution of C max scores for each size was then used in order to predict the level of significance (p-value) that would be obtained for an experimental barcode with different C max scores (0.6-0.9).
The C max obtained when comparing an experimental barcode to the human genome is typically in the range of 0.7-0.9. The results shown in Figure S6 predict that it should be possible to obtain a p-value below our threshold of 10 −6 for all DNA-molecules larger than approximately 275 kb, using a C max of > 0.7. Moreover, with a C max of 0.9 we can successfully map fragments down to 110 kb or even smaller, demonstrating the high discriminatory power of the obtained barcodes. Figure S7: Number of matching positions for none-mappable regions within the human genome. The histogram depicts the number of additional positions to which each DNA fragment, deemed non-mappable in Figure  5 (main text), matches with a greater C max than 0.81.
Many of the none uniquely mappable regions only show a high degree of similarity to a few additional positions of the human genome ( Figure S7). Hence, the correct match position for an experimental barcode from a none-mappable region can be narrowed down to only a few positions within the entire human genome.

Process experimental data
In this section we describe how the output of the nanofluidics-based fluorescence imaging experiments was processed using custom written MATLAB software. The workflow can be summarized: Imaging output (Movie) → kymograph → barcode.

Extract kymographs
As output from the nanofluidics experiments, we obtain three dimensional arrays (movies) of fluorescently stained DNA molecules. These movies are then processed following the outline in [1] in order to generate kymographs. The method is summarized in Algorithm 1, and the steps are explained in more detail below.
Then the convolution is defined as 3. Find the angle of the movie: (a) Average the movie over time frames k = 1 . . . N .
Use morphological top hat filtering with a disk D of radius 12 pixels to correct uneven illumination for the dark background, i.e.
where imopen is a morphological image opening operation first applying erosion and then dilation on the image.
(c) Find edges in A corrected using the Sobel method, which uses Sobel approximation to the derivative, and returns edges at those points where the gradient of A corrected is maximum.
(d) Apply a standard Hough transform. It uses parametric representation for the line i.e. each matrix element (i, j) is converted to a ρ and θ matrix. The angles are fixed to be in the range [−90 : .01 : 89.99]. The optimal angle is then the highest peak in this matrix.
(e) Use the detected angle to rotate the movie. Consequently, the nanochannel with the molecule will be oriented vertically. After the rotation, there might be some pixels at the edges of the rotated movie that do not correspond to pixels in the original movie. We refer to these as undefined pixels.
4. Find foreground of the movie (simple amplification and thresholding); (a) Compute the mean and standard deviation of the intensity of the pixels of the movie. Use these to substitute the undefined pixels with pseudo-random (rng(rng(0, 'twister')) in Matlab) values.
(b) Amplify the movie by using convolution with a kernel whose matrix has elements ((x 2 + y 2 ) −0.25 ) for x = −2, . . . , 2, y = −2, . . . , 2. Normalize and square the output movie; (c) Further amplify the movie by using convolution with the vector −1.5 −0.25 1.25 2 1.25 −0.25 −1.5 . Normalize the output movie; (d) Average the movie along the time-frames dimension and along the rows dimension, leaving the averaged column fluorescence profile. Find local extrema in this profile. From these, choose the extrema with intensities higher than mean+3 standard deviations of the intensities. The columns between minimas on the left and right of this maxima are labeled as signal columns; (e) To find foreground values in these columns, we define a threshold for intensity which is the minimum of Matlab's graythresh() (which is computed for all non-signal column pixels) and mean() + 3std() (which is computed only for those values that are below the graythresh() value); (f) Find pixels that are identified as signal in all time-frames (or in at least 20 time-frames, if there are more than 20 time-frames). Label the columns that have less than 20 of such pixels as non-signal columns; 5. Locate the molecule along the channel; (a) Extract intensity values for the signal columns of the time-averaged rotated movie; (b) Apply the Gaussian filter with σ = 10 pixels, and kernel of size 50 3 pixels; (c) Run Matlab's findpeaks() to detect molecule center index; (d) Find the molecule row and column coordinates in the channel (to that end we fit a function a + f · (tanh((x − b)) − tanh((x − c)) ); 6. Average the molecule over 3 columns and with 100 row padding on the edges of the molecule to generate a kymograph.

Align kymographs
To compensate for small thermal fluctuations of the DNA inside the nanochannels, features in the kymograph need to be aligned. This is done using the alignment method NRAlign, introduced in [2], which in turn is an improvement of the WPAlign method, introduced in [1]. The steps of the algorithm are described in Algorithm 2.

Limit number of time-frames
Kymographs have varying number of time-frames (rows), depending on how many snapshots of the molecule are used. We study the efficiency of the method based on always selecting a fixed number of timeframes, n k . This reduces the set of kymographs to only those kymographs that have at least n k rows, and in particular, we have a set of kymographsK, witĥ From kymographs to time-averaged barcodes To compute time-averaged kymographs, we need to first detect the edges in the kymograph. Once the barcodes and their bitmasks are computed, we can generate consensus barcodes or compare to theory, and here we use the same methods for that as in [3]. Bitmasks A bitmask is a binary vector associated to the barcode, and its values are ones except at the edges of the barcode. The number of pixels at the edges is chosen based on point spread function, which is 300nm, and pixel to nano-meter convertion ratio of the experiment, which is 130nm/px, and a ∆ = 3. The number of pixels which are zeros at the edges of the bitmask is then 3 · 300/130 ≈ 7 pixels.

In-silico simulations
We here use in-silico simulations in order to i) compute a DNA barcode for a known DNA sequence, ii) simulate DNA barcodes by using randomization iii) compute p-values for matching experimental barcodes to theory, and iv) estimate how much of the human genome we can cover using this method.

Theory generation
Generating theory barcodes DNA sequences of the 24 human chromosomes (GRCh38.p7 dataset) in .fa.gz format were downloaded from RefSeq, NCBI Reference Sequence Database (ftp://ftp.ncbi.nlm.nih.gov/genomes/ Homo_sapiens/, and are summarized in table SM1 DNA sequences were converted into theoretical barcodes as described in [3]. For completeness, the method's steps are summarized here: 1. A set of input parameters are provided, described in Table SM2. The competitive binding model used here, which we call 'literature', uses parameter values as described in [3]. concDNA, concY, concN are the concentrations of DNA, YOYO-1 and netropsin. psfSigmaWidth is the width of point spread function in nanometers. meanBpExt is the scaling factor (nm/bp) of the nanometer to base-pair extension. pixelWidth nm is the nm to pixel conversion ratio.
). These are just the averages of netropsin (YOYO-1) binding probabilities along the DNA. We then define a two variable function which we minimize to find the free concentrations of Netropsin and YOYO-1, C 1 and C 2 , min C1,C2 For this we use multidimensional unconstrained nonlinear minimization (Nelder-Mead), using MATLAB's in-built function fminsearch with initial parameters concN and concY .
3. Any undefined nucleotides are changed into random letters using Matlab's command randseq; 4. If the sequence is assumed to be linear, add N = 10000 base-pairs to the ends of the sequence; 5. If the sequence is longer than 500000 base-pairs, cut it into smaller parts of length 200000 to save memory with added 3000 base-pairs at the ends; 6. For each of these fragments, compute the probability that YOYO-1 is bound to a base-pair i, p(i); 7. Convolve these fragments with a Gaussian and (if the sequence was cut into smaller fragments) stack these back together to one long sequence, to get a barcode of the form 8. Remove the extra base-pairs at the ends if the sequence was linear (isLinearT F = 1); 9. Convert from base-pair resolution to pixel resolution using a moving average window; Stretching theory The theoretical barcode of the whole human genome in base-pair resolution takes a large amount of space and contains redundant information, since when doing the comparisons, we are interested in pixelated theory. Therefore instead of saving the barcode with base-pair resolution, we save it with pixel resolution. This provides us with a challenge, because theory with base-pair resolution depends on nm to bp extension of the experiments. Since this can vary from experiment to experiment, to cover all possibilities, we would need to compute the theory for a very large number of possibilities.
Instead, we make an observation that a convolution of two Gaussians of width's σ 1 and σ 2 is still a Gaussian, with width Using this observation, we first compute the pixelated theory for one pre-chosen nm/bp ratio (this should not be smaller then the largest nm/bp expected in the experiments). In this case we chose it to be 0.3 nm/bp. This will give us a theory barcode with a convolution with a Gaussian of σ pixels width. However, if the actual nm/bp is different, then we need to compress the theory barcode by s, i.e. all the Gaussians in the convolution should have width σ 1 = σ · s, s < 1. This is not equal to σ unless s = 1.
However, we can use our observation about convolution of two Gaussians to correct σ. Since σ 1 = σ · s, we just need to convolve with a Gaussian of unknown width σ 2 , which we find from equation 11 to be Stretching when comparing experiments to theory A stretching factor of 5% at increments of 1% was used when comparing experiments to theory. We stretched the experiments to all the possible versions using these stretching factors. For stretching, we used Matlab's interp1 function, which returns interpolated values at specific query points using linear interpolation.

Randomized barcode
In here, we use a simpler approach for generating randomized barcodes than in [3]. Our present approach is described by Algorithm 3. Briefly, to simulate a barcode of length lenBar we first compute lenBar normally distributed (mean = 0, standard deviation = 1) random numbers. We then convolve this vector with a Gaussian kernel of a given point spread function with width psf (see Algorithm 4).
input : lenBar, pixelWidth nm, psfSigmaWidth nm output: simBar We expect that this method provides a realistic representation of a barcode at the pixel level, since at the base-pair level we do not expect to have long range correlations unless there are repetitive/homogenous regions. See figure SM1 for example of a randomized barcode using our new approach.

Slow computation of p-values
In this section we explain how the p-values can be computed. We do not use this method, but rather present it here for reference (in the next subsection we explain how we speed up the method). Parameters for computing p-values are given in Table SM3. For example, strFac describes the stretching allowed when comparing experiments to theory, which here is 5% with 1% increments.
Briefly, we first simulate a long barcode of length equal to the length of the theory, which is lenLong pixels. Then for each stretching factor, we compute numRnd randomized barcodes. The length of these barcodes depends on the stretching factor. Then we compute the maximum score for each of these randomized barcodes against the theory. The scores for all positions when comparing a short barcode against a long barcode is computed by function compute pcc(), which computes the match scores by sliding the shorter barcode along the longer barcode, and in both directions (see [3]). The resulting matrix of match scores is averaged over the stretch factors, and distribution parameters of a functional form is computed using a function compute distribution parameters(). Finally compute p value is used to compute the p-value given a match score cMaxVal.
The main differences between this method and that in [3] is that we change the way we generate randomized barcodes, we simulate the longer barcode (this is needed for the section to follow), and we add stretching.

Fast computation of p-values
Here we describe our new method for computing p-values as used in the main text. Our new method is considerably faster than the method presented in the previous subsection. The speed-up comes from fact that we pre-generate a database for a range of different barcode lengths, and an average bp/nm extention ratio. This database is then used to skip steps 2-8 in Algorithm 5.
To generate the p-value database, we use the parameters of Table SM3, as well as two additional parameters listed in Table SM4. The algorithm we use to compute a database is given in Algorithm 6. In this method we perform similar calculations to those in steps 2-8 of Algorithm 5. We compute a randomized long barcode. Then instead of using different stretching factors in the first loop, we use lengths of barcodes directly in the first loop. The inside loop differs from Algorithm 5 in that the first parameter to sim bar is the length of barcode i. The output of this calculation is a database file data, which can be saved as data.dat, or in a simple text format, as we save it to PVALTOT.txt. Once the database is computed, we can use it to compute a p-value for barcode of length l. Given this length, we compute the different lengths of the barcodes we can have when including stretching factors strFac. This gives us l vec different lengths, i.e. l vec = l · strF ac i | i = 1, length(strF ac) We use these to get a sub-matrix of the pre-generated database, defined as Finally, we average the rows to get numRnd maximal correlation coefficients.
We then proceed with step 10 of Algorithm 5.
Simulation of p-values for different lengths (Supporting Figure S6) As before, we have fixed nm/bp ratio 0.2363, and stretching of 5% at 1% increments. We then use lengths from 200 to 800 pixels (taken every 5 pixels), i.e. 110kb to 440kb. Then, for a given length, we take all the match scores from the database for the length +-5%. We take the maximum over the dimension of different lengths. This gives us 1000 match scores. For these scores, we compute the parameters of the extreme value distribution, and then find a threshold of the p-value for which the match score falls bellow one of 4 cases: 0.9, 0.8, 0.7, 0.6. We finally plot the barcode lengths versus the p-value threshold for each of these 4 cases. The results can be seen in Supporting Figure S6.
Evaluation of mappable parts of the human genome ( Figure 5) We ran evaluation using nm to bp ratio of 0.269 with the average size 689 pixels, which gives approximately 333 kilo base-pairs (more precisely 689 · 130 0.269 ). The genome is cut into fragments. The full comparison would require us to cut the genome into all possible Lmers. However, mainly to reduce the computational time, and because we do not expect this to have significant effect on the overall coverage percentage, we cut the genome into non-overlapping fragments. Each of the 24 chromosomes is cut into N i fragments of equal lengths lenBarcodes = 689 (where i is the number of the chromosome). There are thus some pixels at the right ends of chromosomes that are not included in the calculations, since we only consider non-overlapping fragments.
These fragments are then mapped against the non-overlapping part of the human genome, i.e. the whole human genome with a deletion. To compare the fragment against the theory, we use Pearson correlation coefficient C as a match score. The maximum correlation C max is saved for each fragment. All fragments which obtained a maximum correlation coefficient greater than 0.81 to any part of the human genome are considered to be non-mappable, due to the lack of uniqueness in the barcode. The results can be seen in Figure 5 (main text).