Decoding dissociation of sequence-specific protein–DNA complexes with non-equilibrium simulations

Abstract Sequence-specific protein–DNA interactions are crucial in processes such as DNA organization, gene regulation and DNA replication. Obtaining detailed insights into the recognition mechanisms of protein–DNA complexes through experiments is hampered by a lack of resolution in both space and time. Here, we present a molecular simulation approach to quantify the sequence specificity of protein–DNA complexes, that yields results fast, and is generally applicable to any protein–DNA complex. The approach is based on molecular dynamics simulations in combination with a sophisticated steering potential and results in an estimate of the free energy difference of dissociation. We provide predictions of the nucleotide specific binding affinity of the minor groove binding Histone-like Nucleoid Structuring (H-NS) protein, that are in agreement with experimental data. Furthermore, our approach offers mechanistic insight into the process of dissociation. Applying our approach to the major groove binding ETS domain in complex with three different nucleotide sequences identified the high affinity consensus sequence, quantitatively in agreement with experiments. Our protocol facilitates quantitative prediction of protein–DNA complex stability, while also providing high resolution insights into recognition mechanisms. As such, our simulation approach has the potential to yield detailed and quantitative insights into biological processes involving sequence-specific protein–DNA interactions.

Example input file to perform the steered molecular dynamics simulations using plumed [2,12] and gromacs [13,1]: # Setting units UNITS LENGTH=nm TIME=ps ENERGY=kj/mol # System information # Note that with this input file the file system.pdb must be provided MOLINFO MOLTYPE=protein STRUCTURE=system.pdb# Define the range of atoms for which periodic boundary effects must be removed # Note that pbc effects are removed only for the calculation of the CV # and that i and j represent the atom indices that include the protein-DNA complex WHOLEMOLECULES ENTITY0=i-j # Definition of the contactmap consising of pairwise distances # The labels a_0, a_n, and b_0, b_n indicate atom numbers.# The labels starting with a are atoms in the protein, ranging from 0 to n # The labels starting with b are atoms in the DNA, ranging from 0 to m cmap: CONTACTMAP ... Note that this input file is an example, and needs to be adjusted to be used, by replacing the labels a 0, b 0, a n and b m with the actual atom numbers of the atoms in the protein (labeled a) and the DNA (labeled b) included in the contact map.The input files as used in this work are provided on figshare: 10.6084/m9.figshare.c.6446950.
3 Sequence specificity of the ETS domain

Structure and function of ETS
The ETS protein family is one of the most ancient transcription factor families in animal evolution [3] and contains many different eukaryotic transcription factors with a highly conserved winged helixloop-helix motif [11].The ETS domain of the PU.1 transcription factor was the first structure to be resolved in complex with DNA, with X-ray crystallography, PDB-code 1PUE [7], revealing the ETS domain as a major groove binder.The ETS domain recognizes purine-rich sequences containing a 5'-(A/T)GGA(A/T)-3' consensus, and tolerates many variations among the flanking bases, as shown by equilibrium titration experiments [10].The ETS protein PU.1 forms sequence-specific complexes with no fewer than 65 different sequences with a wide range of affinities [6,9].We selected the ETS domain from PU.1 to further test our simulation protocol, aiming to predict differences in binding of the ETS domain to three different nucleotide sequences.The nucleotide sequence from the crystal structure is also the consensus sequence as identified with equilibrium titration experiments [10] and alchemical calculations [4]: 5'-AAAAGGGGAAGTGGG-3', with only differences in the terminal base pairs.As a low affinity sequence, we inverted the consensus sequence by changing A to G and T to C, and vice versa: 5'-CCCCTTTTCCTGAAA-3'.Finally, we included an anti-consensus sequence for which experiments [10] and alchemical calculations [4] showed that the ETS domain has a lower affinity: 5'-AAAAAAGGAAGGTGG-3'.Supplementary Figure S6 shows snapshots of the systems and the three nucleotide sequences investigated in this work.Supplementary Figure S6: A) Molecular representations of the ETS domain (brown) bound to the consensus sequence (green) with the DNA interacting amino acids shown as atomic licorice representations.B) ETS domain (brown) with labeled amino acids that are used for in the contact map for the analysis and steered molecular dynamics pulling coordinate as well as the three nucleotide sequences used in this work are shown in text: consenus (green), inverse (purple), and anti-consensus (red).

Characterization of the ETS-DNA complexes
As starting point for the simulation, we used the ETS domain from the PU.1 transcription factor in complex with DNA with the consensus sequence 5'-AAAAGGGGAAGTGGG-3', PDB code 1PUE [7], with the single stranded overhang residues removed.This sequence differs from the consensus sequence as identified in Ref. [10] in the terminal base pairs.For each protein-DNA complex we performed 4x 1 µs Molecular Dynamics simulations using the same protocol for the H-NS system as described in the methods section, with the exception that we used here a concentration of 150 mM NaCl to match in vitro experimental conditions [10].
Analysis of the MD simulations is similar as described in the main text, with the following adaptations to the contact map calculation.First, we identified the atom furthest from the alpha carbon (C α ) atom in the side chain of each amino acid residue in the protein.This atom served as a representative of the side chain for subsequent analyses.For all the DNA bases, we selected the N1 or N3 atoms based on whether the base was a pyrimidine or purine, respectively.These atoms were chosen as they are located at the core of the groove or near the helical axis of the DNA.We computed the distances between the representative atoms of the protein side chains and all DNA bases.Based on these distances we applied the switching function described in the methods section to obtain the number of contacts each protein residue has with the DNA.Then we summed the number of contacts each amino acid has with the DNA bases and normalized all values.Using the normalized contact count we proceeded to select only the amino acids that have higher count than 0.4, which resulted in the selection of 11 amino acids that form at the interface of the ETS and the dsDNA strand, shown in Supplementary Table S1.Next, we selected for each amino acid the hydrogen bond donor and acceptor atoms to construct a contact map between the N1/N3 atoms of the DNA sequence.
Supplementary Figure S7 shows the total contact count of the contact map for the cumulative data of the MD simulations for the three systems.See the previous paragraph for a description of the contact map calculation.The consensus and anti sequence show a bimodal distribution at 85 and 100 contacts, while the inverse sequence shows a uni-modal distribution around 85 contacts.In addition, we decomposed the total number of contacts by the individual contributions of the protein residues, which shows that ARG232 makes the most contacts with a count above 20 with the DNA.The anti-consensus sequence shows more variation for the ARG232 indicated by the second shoulder below 20 contacts.Overall, no significant difference between the sequences are observed, except that ARG235 makes more contacts with the DNA in the consensus sequence, compared to the inverse and anti sequences.In addition, the LYS249 has a bimodal distribution in which the peak at around 3 contacts has a higher maximum contact count binding mode for the consensus and anti sequence, but not for the inverse sequence.We also compared where the protein residues bind to the DNA by computing the median contact count for each residue with respect to the individual bases pairs of the DNA, which are shown as heat maps in Supplementary Figure S7C.The majority of the residues are Supplementary Table S1: Contact map atoms between ETS and DNA grooves.
map Supplementary Figure S1: A) Example of switching function used in the contact map with two different sets of parameters: the rational decay with r 0 = 0.4, d 0 = 0.25, nn = 2, mm = 4 in blue and the linear decay with r 0 = 3.0, d 0 = 0.3, nn = 1, mm = 12 in orange, and in gray a linear function for comparison.B) 2D kernel density estimates of the rational contact count, C QGR−minor , with respect to the linear contact count for H-NS in complex with the high affinity sequence and its GC-analogue respectively based on the FI MD simulations (3.5 µs cumulative simulation time).

1. 3
Analysis of Steered Molecular Dynamics simulations Supplementary Figure S4: A) Pulling coordinate with respect to the PMF, work of the individual SMD runs and RM SD DN A split based on the dissociation route of the high affinity system with left the R-G-Q path and right the Q-G-R path.The color gradient in the middle and lower panel correspond to the maximum work of the respective run (dark low work, light high work).B) Illustration of H-NS with the high affinity sequence along the pulling coordinate, λ, of the highest work SMD run following the Q-G-R route.Starting from λ = 0.4, the R114 residue does not dissociate.Instead, R114 pulls the central portion of the DNA strand along, causing the ends of the DNA to bend away from H-NS.Upon final detachment of R114, hydrogen bonds in multiple base pairs break, resulting in extremely high work and RMSD values for the DNA.Note that the runs with high work have a negligible contribution to the PMF.Supplementary Figure S5: 2D kernel density estimates (KDE) of the C QGR−minor contact count with respect to the center of mass distance in nm between: H-NS and DNA (COM DBD−DN A ), 4 central bases pairs and the QGR motif (COM cDN A−QGR ), and the hydrogen bond acceptors in the minor groove and hydrogen bond donors in the QGR motif (COM A−D ).The top row shows the KDEs of the high affinity sequence and the bottom row for the CG-analogue.In addition in each panel annotated with the correlation coefficient between the C QGR−minor and respective COM 2 Example input file for Steered Molecular Dynamics

#
reducing the total number of contacts from 108 to 65 # With KAPPA indicating the force constant of the spring in kJ/mol nm^2 MOVINGRESTRAINT .Store output in COLVAR file every 50 steps PRINT ARG=* FILE=COLVAR STRIDE=50 NE, NH1, NH2 ASN236 N, OD1, ND2 LYS239 N, NZ LYS249 N, NZ located in the helix that is bound to the major groove of the DNA.The two flanking residues LYS219 and LYS249 interact with the backbone region of the minor grooves in the 5' and 3' direction.The heat maps in Supplementary Figure S7 show that the arginines in the helix-loop-helix motif form the most contacts with the center of the nucleotide sequence, and the flanking lysines interact with the base pairs close to the DNA ends.Supplementary Figure S7: A) Cummulative density estimates of the contact count between the ETS domain and DNA for the MD simulations of the consensus (green), inverse (purple) and anti-consensus (red) sequence.B)) Decomposed density estimates of the number of contacts of each ETS residue with respect to the complete DNA.C) Heat-maps of mean contacts of the ETS residues interacting with the basepairs of the respective sequences.Supplementary Figure S8: A) PMF along C ET S−groove showing Boltzmann-weighted average work plot (solid lines) for each system (green consensus, purple inverse, and red anti-consensus).In addition with the standard deviation of all work-curves is shown with the shaded area above the Boltzmannweighted average work.B) Shows snapshots along the progression of the pulling coordinate at λ = 1.0, 0.5, and 0.0 for the lowest work SMD run of the consensus affinity sequence starting in the FI state (with the dsDNA in green and the ETS domain in brown)