Loops are geometric catalysts for DNA integration

Abstract The insertion of DNA elements within genomes underpins both genetic diversity and disease when unregulated. Most of DNA insertions are not random and the physical mechanisms underlying the integration site selection are poorly understood. Here, we perform Molecular Dynamics simulations to study the insertion of DNA elements, such as viral DNA or transposons, into naked DNA or chromatin substrates. More specifically, we explore the role of loops within the polymeric substrate and discover that they act as ‘geometric catalysts’ for DNA integration by reducing the energy barrier for substrate deformation. Additionally, we discover that the 1D pattern and 3D conformation of loops have a marked effect on the distribution of integration sites. Finally, we show that loops may compete with nucleosomes to attract DNA integrations. These results may be tested in vitro and they may help to understand patterns of DNA insertions with implications in genome evolution and engineering.


INTRODUCTION
Genomes are constantly reshaped and manipulated during transcription, replication and cell division.They also need to be reshuffled during eukaryotic meiosis and undergo horizontal exchange of genetic material in prokaryotes.DNA transposition is one of the main factors driving genetic diversity and expansion, especially in plants.Transposons make up about 85% of the maize genome [1,2] and up about 50% of the human genome [3] and can be associated with the onset of diseases [4].At the same time, the infection and replication of many viruses, including HIV-1, require the integration of the viral DNA within the host genome.Given the large abundance of non-transcribed DNA within the human genome (up to 90% [5]), a random insertion process would most of the time lead to unsuccessful infections.Instead, HIV-1 and other lentiviruses, appear to integrate their DNA non-randomly and in vicinity of transcriptionally active genes [6][7][8].Additionally, both viral and transposable element insertions often appear clustered [6], e.g. the LINE-1 and Alu repeats [8].These clusters of repeated elements often contribute to the organisation and compartmentalization of the genome, thereby impacting the global genome organisation [9][10][11][12][13][14][15].Viceversa, activation of transcription of transposable elements can cause the unfolding of chromatin and the downstream spatial rearrangement of the genome [16].
Understanding the process of DNA insertion and the biophysical mechanisms driving non-random insertion patterns and clustered integrations is an important outstanding question [17,18].While it has been argued that DNA topology and chromatin structure have an impact on the integration site selection [19][20][21], a systematic quantification of how integration is favoured or * corresponding author, davide.michieletto@ed.ac.uk disfavoured in certain DNA and chromatin topologies remains an open challenge [22][23][24][25].
Recent works have highlighted the importance of spatial and physical constraints in the integration patterns of HIV-1 [7,20].For instance, genes that are located closer to the nuclear envelope are more often integrated by HIV [7].At the same time, insertion of transposable elements may be limited by the accessibility of transposase into chromatin [18], as employed in the "assay for transposase-accessible chromatin" (ATAC) [26][27][28].Additionally, previous simulations have suggested that different physical features may be important at different scales during the process of integration [20].For instance, HIV-1 viral DNA associated to the integrase enzyme first enters the nuclear envelope and diffuses through the nucleus within the large-scale chromatin mesh [29,30]; then, it likely binds and diffuses a chromatin fibre [31] and ultimately attempts to integrate within DNA [32], which requires elastic deformation of the DNA double-helix [33].DNA integration is a multi-scale problem, involving a 3D search in a complex environment [7,29] and a 1D search within a complex free energy landscape [17].For this reason, in order to fully understand DNA integration, one requires to dissect the important contributions at each length-scale [20].
In this work, we focus on the contribution of DNA looping, its 3D organisation, and competition with nucleosomes.Specifically, we first quantify how the length of DNA loops affects integration site selection and we discover that the longer the loops the less likely they are to attract integrations.Intriguingly, the rate of integration per unit length of loop is non-monotonic and displays a minimum: short loops are energetically favourable to integrate but difficult to find, while larger loops are less energetically favourable but more likely to be found and attract more integration attempts.We also find that the way loops are organised matters: clustered loops are systematically less integrated than sparse ones.Finally, we show that within a landscape where both loops and nucleosomes are present, loops are significantly more integrated if their length is shorter than the one of nucleosomal DNA.

METHODS
We model the insertion of a DNA element, e.g.transposons or viral DNA, into a DNA substrate, e.g.human or bacterial genome, using a coarse-grained bead-spring polymer model.Since integration is a relatively rare event that we want to study in isolation, using a coarsegrained model allows us to sample a far larger time and statistics than all-atom or oxDNA models.In our model, DNA is represented as a semi-flexible bead-spring chain polymer composed of N beads of diameter σ, which is set to be 2.5 nm (or 7.35 bp).The dynamics of each bead are determined by a Langevin equation: where r i is the position of the i-th bead, m i and γ i are respectively the mass and the friction coefficient of the i-th bead due to an implicit solvent.Energies are expressed in units of k B T , where k B is the Boltzmann constant and T is the system's temperature.Distances are expressed in units of σ.Time is expressed in units of the typical time for a bead to diffuse a distance of its size τ Br = σ 2 /D = 3πησ 3 /k B T , where D is the diffusion constant for a bead, and η the viscosity of the implicit solvent.For simplicity, all the particles have the same mass and friction coefficient.The last term in Eq. ( 1), δF , is a stochastic force with zero mean ⟨δF (t)⟩ = 0 and amplitude ⟨δF i,α (t)δF j,β (t ′ )⟩ = 2k B T γ δ ij δ αβ δ(t − t ′ ), where i and j run over particles and α and β run over the Cartesian components.This choice ensures detailed balance [34].The term containing the gradient ∇U i is the force acting on particle i due to all the other particles in the system as explained below.Consecutive polymer beads are connected together through an harmonic potential: Where k harm (set to 20k B T /σ 2 ) determines the strength of the spring and r 0 (set to 1.1σ) is the equilibrium bond distance.We implement the bending rigidity of the DNA through a Kratky-Porod potential between triplets of consecutive beads where t i is the tangent vector connecting beads i − 1 to i, and l p is the persistence length of the polymer which is set to be l p = 20σ = 50 nm.Finally, steric hindrance interactions between non-adjacent polymer beads are simulated through a LJ potential, shifted to be zero at r cut = 2 1 6 σ, as if r < r cut and 0 otherwise, where . ( The total potential energy acting on bead i is the sum of these three contributions.Equation ( 1) is integrated using the Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) with a velocity-Verlet scheme [35].We set the integration time step to be ∆t = 0.01τ Br .Our model does not account for the torsional rigidity of the DNA, and we plan to include it in the future.
Using the model described above we simulate two chains, one circular representing viral DNA or a transposable element, and the other linear representing the target, or substrate DNA.Both polymers diffuse in a box and we attempt the integration of the circular polymer into the linear chain using a modified version of the "double-bridging" algorithm implemented in LAMMPS as fix bond/swap [36].This code allows us to perform "reconnection" moves and swap the bonds in the system that are connecting beads that are closer than R c = 2σ in 3D (see Fig. 1).The reconnections are attempted every 1 LAMMPS step.We implemented some modifications to this fix that allow us to perform recombination moves between viral and host polymers (inter-chain reconnections) while avoiding intra-chain (or self) reconnections [20,37,38].The reconnection moves are also weighted by a Metropolis test that assigns a probability of successful swap depending on the energy difference ∆U between the old and the new configuration before and after the swap.This probability is 1 if ∆U < 0 and p = e −∆U/k B T if ∆U ≥ 0. We also modified this part of the code to perform polymer reconnections that bypass the Metropolis test thus allowing non-equilibrium integration.These modified codes can be found at https://git.ecdf.ed.ac.uk/taplab .FIG. 2. a Sketch of our search procedure for nearest 1D regulatory elements in the genome b Histogram of 1D distances between nearest-neighbor pairs of regulatory elements, i.e. enhancer and promoters, within the human genome: 3% of these are below 500 bp and 8% below 1 kbp [39].c Sketch of the simulation setup, where we investigate the integration probability in loops of different sizes.d Histogram of integration sites probability along the polymer for loops of different sizes (80 bp in green and 448 bp in grey).e Probability of integration inside the looped region as a function of loop length, comparing equilibrium (green) and non-equilibrium (purple) algorithms.Random integration would yield the black line indicated by the arrow.
The simulations are performed as follows.We prepare an initial polymer configuration as a random walk and impose the loop by setting a harmonic bond (with strength increasing from 1 to 20 k B T during the equilibration) in between two beads at 1D distance l.For each of the > 1000 replicas, we let the system equilibrate for at least 3 10 5 LAMMPS steps.At the end of the equilibration, we allow the integration to happen.We stop the simulation after 10 7 timesteps, irrespectively of whether the integration has happened or not.We output the bond list every 10 6 steps and from that we reconstruct the topology of the polymers and can detect when and where the integration has happened.Using this information we then build a histogram of where the integration sites are along the polymer and can thus quantify how many are within and outside the looped regions.
In analogy with previous work, here we argue that short DNA loops may attract more integrations due to the fact that they have a lower energy barrier for elastic deformations [20].On the other hand, short DNA loops are energetically costly [46,56,57].In order to quantify how frequently short DNA loops may appear in the human genome, we compute the 1D distance between nearest-neighbouring regulatory elements, such as enhancers and promoters, as listed in the GenHancher database [39].Fig. 2a shows that about 8% of these regulatory elements are less than 1 kbp apart and 3% of them shorter than 500 bp.While only a fraction of them will be looped at any one time, these figures show that there is a considerable amount of short (< 1 kbp) potential loops in the human genome.On top of this, we note that the integrase itself can compete with nucleosomes to create an extremely short loop of DNA in the range of 5-10 bp [54].Likewise, we expect similar, if not greater proportion of short loops between regulatory elements in bacteria, such as the lac repressor [58] or deformations of the DNA substrate for instance created by nucleoidassociated proteins such as IHF [52,59].
Motivated by this, we aim to characterise how the 1D loop length affects the insertion of the circular DNA into the target substrate DNA.To do this, we perform simulations in which a loop of length ℓ is stabilised in a polymer of length N = 100 beads = 735 bp introducing a harmonic spring in between two beads.We then perform 1500 independent simulations to stochastically sample the probability distribution, p(x), of integrating at position x along the polymer.In Fig. 2d we show the profile of integrations, i.e. p(x), over the simulated polymer.One can appreciate that the case in which there is a short loop (80 bp) in the polymer yields a distribution p(x) peaked within the loop, whereas the case with a longer loop (448 bp) yields a constant distribution compatible with the random p rand = 1/N distribution.
To better quantify the enhancement of integrations within the looped region compared with the ones outside it, we compute the probability of integration per bead inside the loop as p in = I in /(ℓI tot ), where I in is the number of integrations that occur inside the loop and I tot the total number of integrations across our simulations.If the insertions were random we would expect I in /I tot = ℓ/N , recovering a random integration probability per bead p rand = 1/N .By analysing our simulations, we find that p in is significantly larger than the random probability when the polymer has a short loop (< 200 bp), and that p in decays to 1/N for large loop lengths.We understand this as follows.Short loops store a significant amount of energy in the form of looping.This implies that when the viral loop integrates into the pre-looped substrate, it lowers the (free) energy of the system because the loop is then longer and the curvature smaller.This is also in line with the experimental finding of favoured integration in DNA molecules carrying torsional stress [17] and can be explained as follows: the elastic energy paid for looping of an elastic rod of length ℓ and persistence length l p is with ϵ a numerical factor depending on the shape of the loop [60].Assuming the integration process to be an Arrhenius, energy activated, process we expect the integration in looped substrates to occur at a rate κ = exp ∆U/k B T where ∆U = U (after) − U (before) < 0. In figure (Fig. 3a, blue) we indeed show that an integration event within a loop (indicated with an arrow) lowers the energy of the system.On the contrary, for integrations in non-looped substrates ∆E out = 0, as there is no free energy gain (Fig. 3a, green).We can compute the energy difference before/after integration within a loop as ∆U = U (after)−U (before) ≈ a/(ℓ+l HIV )−a/ℓ ≈ −a/ℓ+const for ℓ ≪ l HIV and where a is a numerical constant.We test this approximation by measuring the energy change during an integration event and we plot the difference ∆U in Fig. 3b which indeed displays a 1/ℓ dependence, as expected.In turn, the probability of integration within a loop can be written as P (found|in)P (in) = P (in|found)P (found) (7) and where we can take P (found|in) = 1.In this equation P (found) is the probability of a viral DNA to find a loop, and P (in) is the probability of integration in the loop.In turn, we can write that the probability that there is an integration within the loop, conditional to the fact that the viral DNA has found the loop is P (in|found) = exp (−∆U/k B T ) ∼ exp (−a/ℓ).At the same time, the probability that the viral DNA finds the looped segment within the simulation box is P (found) = 4/3πσ 3 ℓ/V ∼ ℓ, where V is the volume of the box.In simulations, we can test this scaling by tracking the fraction of time that the viral loop spends in 3D proximity of the DNA loop as a function of different loop sizes; as shown in Fig. 3c, this quantity is to a good extent linearly increasing with the size of the loop.The same argument holds for integrations outside of the loop, where P (out|found) = exp (−∆U/k B T ) ∼ 1 (as there is no energy difference) and P (found) ∼ (N − ℓ).Finally, we can write that the fraction of viral loops integrated within the looped substrate is then expected to be where a is expected to be proportional to ∆U/k B T and b a constant.By plotting the fraction of integrations inside the looped region f in = p in ℓ we see that the data follows extremely well the prediction of Eq. ( 8) (see Fig. 3d).Intriguingly, our findings imply that there is an optimum length to reduce the amount of integrations within the looped segment.As one can appreciate from Fig. 3d, the minimum corresponds to a loop that is long enough to possess a small free energy gain upon integration, yet short enough to be difficult to find.

Integration in clustered and sparse loops
We now ask what happens to the distribution of integrations when there are many loops along the substrate.In particular, we are interested to understand what are the cooperative effects that appear when the loops are sparse and uniformly distributed along the substrate or when they are clustered in a short segment of the polymer (Fig. 4a).To investigate this question, we therefore perform simulations with many loops (all of the same length ℓ = 80 bp) formed along a polymer of 10.3 kbp in either sparse or clustered 1D arrangement (Fig. 4a).
Interestingly, we observe that the 3D conformations of the polymers are rather distinct, with the latter (clustered) case being more swollen and with the looped region occupying a smaller volume fraction than in the former (sparse) case (see snapshots in Fig. 4a and radius of gyration in Fig. 4c).
To then understand if this distinct 3D organisation due to the 1D arrangement affects the overall integration probability, we perform simulations with varying num-bers of loops while preserving constant loop size (Fig. 4d).
In line with what we discovered in the previous section, we observe that the sparse loops attract more integrations overall.Again, we argue that this is due to the smaller 3D volume fraction occupied by the 1d clustered loop arrangement.
The consequence of such a different 3D organisation is reflected on the distribution of integration events.Indeed, as shown in Fig. 4b-c, we observe that in the sparse case the integrations are uniformly distributed among all the loops in the polymer, however in the case of clustered loops, the ones at the two ends of the region hosting the loops display more integrations than the inner ones (Fig. 4c).We argue that this is due to a "screening" effect, whereby the marginal loops are typically the ones more exposed while the more internal ones are more tucked in within the clustered looped region.

Integration in looped euchromatin
We conclude this work by investigating the integration within a substrate that displays both loops and nucleosomes.Nucleosomes are assembled as in Ref. [20] where we have previously shown that DNA wrapped around histones attract more integrations due to their strongly bent states, in line with experiments [17,22,31,54].Here we ask what happens if in a segment of chromatin there is a competition, or a synergy, between looped and nucleosomal DNA and whether the two would enhance even more the integration in looped euchromatin.
To test this, we perform simulations of a polymer of length 10.3 kbp wrapped around 30 histones (where each block of nucleosomal DNA is 20 beads or ∼ 147 bp) and displaying a central looped region with varying loop lengths.Specifically, we compare two systems: one with loops of length ℓ = 80 bp (11 beads), and one with loops of length ℓ = 154 bp (21 beads) (see snapshots in Fig. 5a,c).We observe that shorter loops enhance integration within the looped region rather than into the nucleosomal DNA.Instead, having loops of approximately the same size as the nucleosomal DNA blocks leads to a more uniformly distributed integration profile, preserving only the preference of "bent" DNA regions over the rest of the polymer (Fig. 5b,d).We note that the spikes in the integration profile correspond to the location of a nucleosome or a loop.In line with what we discovered in the previous section, we again observe in the profile of integration probability, that the interior loops and nucleosomes are screened by the more exterior ones (i.e.P (s) is systematically smaller in the middle with respect to the edge of the polymer).
Finally, we highlight that recent cryo-EM structural data revealed that the intasome-nucleosome structure creates a short, highly bent DNA loop by shifting the nucleosome out of registry [54].Such a short (5-10 bp) loop would, in our model, favour even more the integration within this extremely bent loop-on-nucleosome feature.

I. CONCLUSIONS
How geometric features of DNA and chromatin may affect the integration site selection of retroviral DNA or transposable elements is far from understood.While it is widely accepted that the integration site selection is not random, the underpinning (bio)physical processes driving the selection are still largely unexplored.
In this work, we used computer simulations to explore the role of loops in DNA and chromatin in determining integration site selection.We use a model of reconnecting polymers in equilibrium, hence iso-energetic and satisfying detailed balance, that has already been successful at capturing the distribution of HIV integration site selection in the human genome [20].First, we discover that integrations are favoured in looped regions due to the release of bending energy.This is in line with experimental observations showing that integration is favoured in nucleosomes [22] and supercoiled substrates [17].In spite of the fact that the microscopic action of strand exchange is iso-energetic, the release of bending energy acts as a geometric catalyst for the integration site selection.
Intriguingly, we observe a competition between the free energy gained and the probability of finding such loops.More specifically, integration in small loops releases a larger amount of free energy compared with larger loops, but they are more difficult to find and vice-versa.In our simulations, this competition manifests itself with a nonmonotonic behaviour of the integration rate displaying a minimum around 200 bp.
We have also explored the effect of having many loops on the same substrate.Interestingly, the 3D conformation of the polymer strongly depends on whether the loops are sparse, i.e. uniformly distributed along the chain, or are clustered in 1D.In the latter case, the looped region occupies a smaller fraction of the simulation box and is, as a consequence, overall less integrated than the sparse loops.This many-loop system also displays a screening effect whereby 1D clustered loops display a non-uniform integration probability, that is largest for the loops that are most external to the cluster.Finally, we perform simulations of a chromatin substrate and discover that loops attract more integrations, as long as they are shorter than the nucleosomal DNA segments.Our simulations suggest that loops-onnucleosomes, such as the ones generated by the intasomenucleosome complex [54], are perhaps the most effective at attracting integrations.
We argue that our results will illuminate the role of loops as geometric catalysts in determining the integration site selection in vivo.We argue that our results could be tested by designing artificial loops either in vivo using, for instance, bivalent dCas9 [61] or in vitro DNA origami [62] or generic bridge proteins [46].

FIG. 1 .
FIG. 1. Snapshot from molecular dynamics simulations displaying an integration event within a DNA loop.Blue beads represent non-looped target DNA, purple beads looped target DNA and yellow beads represent viral DNA.

FIG. 3 .
FIG. 3. a Angle energy measured during the course of two simulations, one displaying an integration event within a loop of length 51 bp (blue trajectory) and one displaying an integration in the non-looped region of the polymer (green).b Scaling of the change in (free) energy before/after an integration event within a loop as a function of the loop size, displaying a ∆U ∼ 1/ℓ dependence.c Scaling of the fraction of time the viral DNA spends in the 3D vicinity of the looped section as a function of loop length and displaying a linear scaling.d Fraction of integrations within the looped region.The black line is a fit of Eq. (8) with ∆U/kBT and b as free parameters.

FIG. 4 .
FIG. 4. a Sketch and snapshots of our simulated setup.One can appreciate that sparse loops cover a larger volume fraction than clustered ones.b We keep the length of the loops constant to ℓ = 80 bp and vary the number of loops on the polymer, increasing the fraction of contour length covered by loops.c Radius of gyration of the region hosting loops.d Fraction of integrations inside loops normalised by the total amount of integrations.Polymers with clustered loops display a systematically smaller fraction of integrations occurring in loops.e-f Histogram of integration probability along the polymer in the case of (e) sparse and (f) clustered, loops.

FIG. 5 .
FIG. 5. a Snapshot of chromatin simulation with short l = 80 bp loops.b Probability of integration along the polymer with short loops.c Snapshot of chromatin simulation with loops comparable with nucleosome length l = 154 bp.d Probability of integration along the polymer with long loops.