Multiscale Bayesian simulations reveal functional chromatin condensation of gene loci

Abstract Chromatin, the complex assembly of DNA and associated proteins, plays a pivotal role in orchestrating various genomic functions. To aid our understanding of the principles underlying chromatin organization, we introduce Hi-C metainference, a Bayesian approach that integrates Hi-C contact frequencies into multiscale prior models of chromatin. This approach combines both bottom-up (the physics-based prior) and top-down (the data-driven posterior) strategies to characterize the 3D organization of a target genomic locus. We first demonstrate the capability of this method to accurately reconstruct the structural ensemble and the dynamics of a system from contact information. We then apply the approach to investigate the Sox2, Pou5f1, and Nanog loci of mouse embryonic stem cells using a bottom-up chromatin model at 1 kb resolution. We observe that the studied loci are conformationally heterogeneous and organized as crumpled globules, favoring contacts between distant enhancers and promoters. Using nucleosome-resolution simulations, we then reveal how the Nanog gene is functionally organized across the multiple scales of chromatin. At the local level, we identify diverse tetranucleosome folding motifs with a characteristic distribution along the genome, predominantly open at cis-regulatory elements and compact in between. At the larger scale, we find that enhancer–promoter contacts are driven by the transient condensation of chromatin into compact domains stabilized by extensive internucleosome interactions. Overall, this work highlights the condensed, but dynamic nature of chromatin in vivo, contributing to a deeper understanding of gene structure–function relationships.

equilibration of the average radius of gyration, neighboring distances and angles).For the model parametrization, the bond and angle potential strengths were determined directly from the corresponding 1CPN distributions by Boltzmann inversion, while the non-bonded interaction strength for theta condition was determined from 10 7 -steps 1kb model simulations with various strengths for polymer lengths of 50, 100, 200, 400, and 800 kb (Fig. S1D).
The same kernel parameters are used for the Hi-C metainference simulations.
All our MD simulations were performed using the software LAMMPS (2) patched with PLUMED 2 (3) implementing Hi-C metainference.In the 1kb model simulations, we integrated the equations of motion by Langevin dynamics using a timestep  = 0.005</ & , where the mass of the 1kb beads m is set to the default mass unit of 1 g/mole.Within the simulations, we set the bead sizes to the default LAMMPS unit of 1Å, but this is later rescaled to a=22nm when measuring actual physical distances.In the 1CPN simulations, we integrated the equations of motion by Langevin dynamics using a timestep of 60 fs and the default 1CPN settings (4).We set the DNA sequence parameter controlling nucleosome sliding to that of telomeric polyTTAGGG sequences, representing an average genomic sequence with intermediate nucleosome affinity.
For the metainference applications to the Sox2, Pou5f1, and Nanog loci of mESCs, the target systems were studied at 1kb resolution with N=128 replicas by running 10 6 -steps Hi-C metainference simulations (the first 10 5 steps of each simulation were discarded as equilibration).For the 1CPN metainference application to the 10kb Nanog locus, we chose the initial chromatin conformations among those sampled during the unbiased 1CPN trajectories, selecting for each of the 128 replicas the 10kb conformation with the smallest root mean square distance to the corresponding one observed at the last step of the 1kb metainference simulations (this ensures that the initial 1CPN ensemble is already in reasonable agreement with the target Micro-C map).The 1CPN Nanog simulation was 15x10 6 MD steps long and we discarded the first 5x10 6 steps as equilibration (based on the convergence of the scaling factor α).For all mESC applications, we ran simulations using 128 replicas.
Metainference allows a variety of error models and priors, and upon evaluation of convergence speed and accuracy, we chose to use a single Gaussian error model acting on all the entries of the target contact map.When integrating the Hi-C data into MD, we also include in the metainference bias distant loci with zero Hi-C reads (so pairs of loci where contacts are not observed in experiments), unless they correspond to a genomic location where all the entries are zero (which likely reflects issues with sequencing at this location).When using the 1kb model, the metainference biasing force originating from a given contact between two genomic locations is applied to the two corresponding 1kb beads, while for the 1CPN simulations the bias is applied to the corresponding nucleosome core particles.

Analysis of the copolymer trajectories by Markov state modeling
To analyze the conformational ensembles and the dynamics of the polymers for our Hi-C metainference validation, we first performed TICA dimensionality reduction (5) on the copolymer contact maps defined from the positions of every 5 th 1kb beads using the kernel cij(dij)=1/(1+dij/r0) with r0=22 nm.The coordinates of the polymers in our Hi-C metainference simulations have been transformed using the same linear TICA transformation determined from the copolymer simulations.Fig. 2D and SI Fig. 2D show the free energy landscapes along the two slowest TICA coordinates for the Hi-C metainference (128 replicas) and copolymer MD simulations, respectively.To quantitatively compare the probability distributions of the metainference and target copolymer ensembles, we divided the ensembles into 50 clusters by k-means clustering (6) using the 6 slowest TICA coordinates of the copolymer system.The Hi-C metainference conformations have been equivalently divided into 50 clusters based on the same cluster centers determined from the copolymer clustering.Fig. 2E indicates the Kullback-Leibler between the metainference and copolymer equilibrium probability distributions of these 50 clusters.To evaluate the dynamics, we estimated, for the metainference and copolymer systems, the transitions probabilities between the clusters by Markov state modeling using a lag-time of 10 4 MD steps (7) (Fig. 2G from main text).parametrize the 1kb prior model of chromatin (these are also described in SI Fig. 1).The prior 1CPN system is made of 50 nucleosomes with a uniform 200 bp nucleosome repeat region; here the enhancer-promoter distances represent the distances between the nucleosomes closest to the corresponding enhancer and promoter regions in the Hi-C metainference simulations.

Figure S1 .
Figure S1.Unbiased chromatin fiber MD simulations.A-C.Running averages over 32 independent 50-nucleosome 1CPN trajectories showing the equilibration of radius of gyration (A), and chromatin fiber neighbor distances (B) and cosine of the angles (C) over time.To compute the distances and angles, the chromatin fiber is coarse-grained to 1kb beads, each located at the center of mass of 5 neighboring nucleosomes.The neighbor distances and angles are calculated based on the positions of these 1kb beads along the fiber (as indicated in Fig.1Afrom the main text).For the parametrization of the 1kb model, the first 5x10 9 MD steps are discarded as equilibration.The dotted lines are exponential fits to the data.D. Scaling of the average radius of gyration (Rg) with chromatin fiber length in our 1kb resolution model, plotted on a logarithmic scale, based on 10 7 -steps MD simulations.The slope of 0.5 indicates that the radius of gyration grows as Rg~L 0.5 , as an ideal chain.

Figure S2 .
Figure S2.Hi-C metainference validation.A-C.Comparison between metainference (128 replicas) and target copolymer ensemble probability distributions of the distances between enhancer-like elements 1 and 2 (A) and 2 and 3 (B), and the radius of gyration (C).D. Free energy landscape of the target copolymer ensemble obtained by projecting the conformations along the two slowest TICA coordinates.E. Components of the two slowest TICA coordinates, capturing different interactions between the three enhancer-like elements of the copolymer.

Figure S3 .
Figure S3.Hi-C metainference of the Sox2 and Pou5f1 loci of mESCs using the 1kb model.A. Time autocorrelations of the distances (<Δd(t)Δd(t+τ)>t/<Δd 2 >) between the Sox2 and Pou5f1 promoters and their respective enhancers +100kb and +40kb away.B. Mean square displacement of the Sox2 enhancer-promoter 3d distance <[Δr(t+τ)-Δr(t)] 2 >t as a function of simulation timestep.The scaling factor to convert simulation time to real time (reported throughout the main text) is obtained by matching the typical chromatin experimental diffusion coefficient of 0.25 μm 2 /s (8) based on a linear fit within a small time window (dotted line).TODO add comparison of Sox2 and Pou5f1 contact probability between metainference and experiments.

Figure
Figure S5. A. Eigenvectors of the first four tetranucleosome principal components, projected on the tetranucleosome contact maps (the coordinates used for PCA).Darker contacts correspond to higher eigenvector components.B. For each nucleosome along the 10kb Nanog locus (x-axis), the average distance to the nearest nucleosome as observed in the 1CPN metainference ensemble.

Figure
Figure S6. A. Free energy landscape of the 10kb Nanog locus from metainference simulations projected along the first two principal components obtained from the chromatin contact maps, together with the cluster centers of the 4 k-means clusters shown in Fig. 6A from the main text (using the same color scheme).B. Representative 360-s unbiased MD trajectory using the 1CPN prior model (in the absence of metainference) and the corresponding snapshots of the initial and final chromatin configurations.C. Comparison of the probability distributions of the enhancer-promoter distances observed in the Hi-C metainference (blue) and unbiased (grey) MD simulations.The unbiased ensemble was generated from the same 36 4x10 9 -steps trajectories (excluding equilibration) used to