Absolute binding-free energies between standard RNA/DNA nucleobases and amino-acid sidechain analogs in different environments

Despite the great importance of nucleic acid–protein interactions in the cell, our understanding of their physico-chemical basis remains incomplete. In order to address this challenge, we have for the first time determined potentials of mean force and the associated absolute binding free energies between all standard RNA/DNA nucleobases and amino-acid sidechain analogs in high- and low-dielectric environments using molecular dynamics simulations and umbrella sampling. A comparison against a limited set of available experimental values for analogous systems attests to the quality of the computational approach and the force field used. Overall, our analysis provides a microscopic picture behind nucleobase/sidechain interaction preferences and creates a unified framework for understanding and sculpting nucleic acid–protein interactions in different contexts. Here, we use this framework to demonstrate a strong relationship between nucleobase density profiles of mRNAs and nucleobase affinity profiles of their cognate proteins and critically analyze a recent hypothesis that the two may be capable of direct, complementary interactions.


Figure S1
Potential of mean force (PMF) curves for the binding between ADE, CYT, URA and THY and all side chains in water and in methanol 3

Figure S2
Free energy of bringing side chains and nucleobases from water to methanol and of bringing each pair of side chain and nucleobase from the unbound state in water to the bound state in methanol 4

Figure S3
Distributions of correlations between mRNA PUR content and protein affinity for nucleobases in methanol 5

Figure S4
Correlations between mRNA PUR content and protein affinity for nucleobases in water 6

Table S1
Correlations between binding free energy scales 7

Table S2
Median values of pairwise Pearson coefficients between sequence profiles of mRNA nucleobase content and their cognate proteins' profiles of binding free energy with a given nucleobase in water 8

Table S3
Median values of pairwise Pearson coefficients between sequence profiles of mRNA nucleobase content and their cognate proteins' profiles of binding free energy with a given nucleobase in methanol including ΔG W-->M (unb) 9

Table S4
Pairwise Pearson coefficients between binding free energies of side chains to nucleobases and average nucleobase codon content over the whole human proteome in water and methanol

Theory
Theory section explaining how the PMF, the binding free energies and the free energy difference of bringing the unbound state from water to methanol are calculated

Methods
Umbrella sampling and analysis of the matching between mRNA composition and cognate proteins' nucleobase affinity 16 S3 Figure S1. Potential of mean force (PMF) curves for ADE, CYT, URA and THY (rows) with all side chains in water (left) and in methanol (right).   Table S3. Median values of pairwise Pearson coefficients between sequence profiles of mRNA nucleobase content and their cognate proteins' profiles of binding free energy with a given nucleobase in methanol including ΔG W-->M (unb). Colors represent the P--values, obtained by shuffling the affinity scales 10 4 times.  Table  S4. Pairwise Pearson correlation coefficients between binding free energies of amino acids to a given nucleobase (scl) and the average nucleobase content of their codons over the whole human proteome (cdn). Results are shown for water (A) and methanol (B); colors represent the P--values, obtained by shuffling the affinity scales 10 6 times.

Potentials of mean force and binding free energy
The potential of mean force (PMF) is defined as the free energy along a reaction coordinate r, where r represents e.g. a specific angle or a distance. A prerequisite for the calculation of a PMF is sufficient sampling at each value of r. Unfortunately, this also includes some higher energy configurations that are, by definition, sampled only rarely in molecular dynamics (MD) simulations. Visiting such configurations in simulation can be enhanced by applying umbrella sampling (US), which makes use of biasing potentials to overcome energy barriers. The biasing potentials in this study are defined as harmonic distance restraints centered at several values along the reaction coordinate r. Although the performed simulations give rise to biased probabilities along r for each window separately, the weighted histogram analysis method (WHAM) is used to combine and unbias these results, leading to a raw potential of mean force, ΔG WHAM (r). This can in turn be used to obtain the probability P(r) to find the two molecules at a radial distance r.
Here, C 1 is a constant, k B is the Boltzmann constant and T the temperature in the units of Kelvin. The PMF is defined in terms of the radial distribution function g(r): The radial distribution function based on MD simulations is defined as the ratio between P(r) and the probability of finding the two molecules at distance r in the case of a homogenous distribution, as show in eq. 3.

=
!"# 3 Here, V box is the volume of the box and V(r) the volume available at distance r. This definition of g(r) includes the correction for the fact that the available volume increases with r, i.e. includes the Jacobian which accounts for the transformation of Cartesian positions of the molecules to intermolecular distances. The PMF can thus be obtained by combining equations 1--3: The PMF can also be used to calculate the free energy of binding, with a necessary prerequisite being that several corrections be applied. For this, two regions along the reaction coordinate r are defined, V b and V unb which correspond to bound and unbound states, respectively. The lower boundary of V b is simply defined as the minimum distance for which ΔG WHAM (r) is determined. For the upper boundary of V b , the distance for which ΔG WHAM (r) shows a maximum value is chosen. The unbound region V unb is defined as the region between x and y, where x and y are chosen such, that ΔG PMF (r) is flat in this whole region based on visual inspection. The concentration penalty as shown in eq. 5 accounts for the free energy associated with bringing the ligand from V unb , to the volume available at r. 5 S12 !"# Here, V unb is the volume available in the unbound state during the simulations. It can easily be shown that, apart from a constant, the sum of this penalty and ΔG PMF is the same as ΔG WHAM . The free energy difference between the bound and unbound state can now be calculated using: The definition used here for the unbound state is not very general, as it depends on the largest simulated distance. For this reason, a standard state correction is added to ΔG(V b ,V unb ), making the free energy independent of the choice of the unbound state.
Here, V 0 is the standard volume of 1.661 nm 3 . Although the free energy difference is hereby independent on the exact choice of the unbound state, it is still slightly dependent on the definition of the bound state. Scheme S1. Thermodynamic cycle for the calculation of the free energy difference between a molecule Q in water and in methanol. Subscripts W and M refer to the solvents water and methanol, D to the dummy state without partial charges and no Lennard--Jones interactions with the surroundings, N to the neutral state where all partial charges are zero, Q to the state with partial charges.

Free energy from unbound (water) to bound (methanol) state
The free energy of bringing the unbound state from water to methanol, ΔG W-->M (unb) can be obtained from the thermodynamic cycle as shown in scheme S1, where the subscripts W and M denote the solvents water and methanol, respectively, Q denotes a molecule with partial charges, N denotes the molecule with its partial charges set to zero and D denotes a dummy state, where the partial charges and Lennard--Jones interactions of all atoms of the molecule are set to zero, leaving only covalently bound dummy atoms.
The free energy difference between D W and D M is 0, because neither of these states interacts with the solvent. We thus only have to determine the free energy difference between D and Q in both solvents. Since the distance between the sidechain analog and nucleobase is larger than the cutoff of electrostatic interactions in the unbound state, we assume that their contributions to ΔG W-->M (unb) can be calculated from separate simulations. The thermodynamic cycle of scheme S1 is thus calculated separately for individual nucleobases as well as individual amino-acid side chains. The total free energy of changing a neutral, dummy molecule D into a charged, fully interacting molecule Q can be calculated with: 8 where ΔG std is the standard state correction. The raw charging free energy (ΔG raw ) is obtained from TI simulations with 11 equally distributed λ--states between 0 and 1, where λ=0 represents the neutral state and λ=1 the fully charged state. Simulations are 1 ns in length at each λ--state. Integration over <dH/dλ> λ using the trapezoidal rule results in ΔG raw . In order to obtain a method--independent charging free energy, several effects must be accounted for. The charging free energies were obtained from simulations with cutoff truncation, a reaction field correction and periodic boundary conditions, which requires the following correction terms (1-3); A. The type A correction accounts for the error in the solvent polarization caused by the use of effective electrostatic interactions, rather than Coulombic interactions. It can be divided into two parts, where the first one compensates for the neglected interactions that were outside of the cutoff sphere (A 1 ) and the second one compensates for the errors made within the cutoff sphere (A 2 ). B. The type B correction compensates for the error in the solvent polarization due to the finite microscopic size of the system and the periodicity. C. The type C correction accounts for the error in the potential at the ionic site due to an inappropriate summation scheme (C1). D. The type D correction compensates for the inaccurate dielectric permittivity of the solvent model. The formulas for the corrections described below are appropriate for simulations with a non-polarizable force field, periodic boundary conditions (PBC) with cubic boxes, electrostatics with cutoff truncation (CT) and Barker--Watts (BW) reaction field correction with a molecular--based cutoff (BM). Correction A1 can be calculated with: 9 where ε 0 is the permittivity of vacuum, q I the net charge of the 'ion', ε s ' is the permittivity of the solvent model (61.0 for SPC water and 18.6 for methanol) and R C the cutoff radius. The definition of the A 2 correction term is based on continuum--electrostatics calculations, but this numerical procedure can be substituted by an empirical expression as proposed by Reif & Hünenberger (eq 43. in ref (2)), with virtually no loss of accuracy. The empirical function used is:

10
where the optimized fitting coefficients bj, j=1…11 can be found in ref (2) and R I represents the ionic radius. The ionic volume is estimated by measuring the difference in box volume between the neutral state and the dummy state. The ion is assumed to be spherical and R I can thus be determined from V I = 4/3 π R I 3 . The type B correction term is related to the type A corrections and can be expressed as:

11
where μ CT and v CT can again be determined from empirical expression as proposed by ref (2) 12 with d μ,i and d v,i , i=1,2 representing the fitting coefficients as stated in ref (2). The C1 correction term is calculated analytically with:

13
where ε BW is the permittivity assigned to the medium outside of the cutoff sphere (which is ideally set to permittivity of the solvent model ε' s ), Ns is the number of solvent molecules, γ' s is the quadropole moment trace of the solvent model and L is the box length. Assuming that the solvent has a single van der Waals interaction site (which is its molecular center M) and that it is spherical, γ' s can be calculated through the analytical function: 14 where N is the number of atoms in the solvent molecule, q i is the partial charge of atoms i of the solvent and r i is the distance of atom i to the center M. The above mentioned assumptions are valid for the SPC water model and applying eq. 14 results in γ' s =0.0082e . nm 2 . In order to be able to calculate the C1 correction term for methanol, we assume that the oxygen atom is the molecular center of methanol, that methanol looks spherical and rotates isotropically around the oxygen, even though the carbon atom also is a van der Waals interaction site. The effective quadropole moment trace calculated with eq. 14 is then equal to 0.0103e . nm 2 . To test if these assumptions are reasonable for methanol, the charging free energies for ASP in methanol are calculated, once from simulations with ε BW equal to 18.6 (BM scheme) and once without reaction field correction beyond the cutoff sphere (ε BW =1, CM scheme). In the first case we use the effective quadropole moment trace to calculate the C1 correction and in the latter case, no C1 correction term is required (as can be easily seen from eq. 14). After applying all corrections, the difference between the BM and CM scheme was only around 2.5 kJ/mol (results not shown). Taking into account that CM calculations are very crude because of the large cutoff artifacts, we can say that the effective quadropole moment trace of 0.0103 e . nm 2 is able to account for the exclusion potential of methanol very well and we used this value in all further calculations. The D type correction is calculated with: 15 where ε s is the experimentally determined permittivity of the solvent. Here we have used ε s =78.4 for water and ε s =33.0 for methanol. The final charging free energy is thus determined by:

16
Note that the above corrections all scale with the net charge and thus will only contribute to the charging free energies of the charged amino acid side chains and not to the neutral amino acid side chains or nucleobases. The free energy of cavity formation, ΔG D-->N , is again calculated from TI simulations, now with 22 λ--states between 0 and 1, where λ=0 represents the neutral state (all atomic partial charges set to zero) and λ=1 the dummy state. A softcore--potential for the Lennard--Jones interactions with α LJ =0.5 is used to prevent numerical instabilities and the simulations are 1 ns in length at each λ-state. Integration over <dH/dλ> λ using the trapezoidal rule results in -ΔG D-->N . In addition to the corrections for electrostatic interactions, the standard state correction, ΔG std , has to be applied in order to compare to experimental values. This can be done by

17
where R is the ideal gas constant (8.314*10 --3 kJ/mol/K) , T is the temperature (298K), b° is the molality (=1mol/kg), ρ S is the density of the solvent at T (997 kg/m 3 for water and 788 kg/m 3 for methanol) and P° is the pressure (100 kJ/m 3 ). This results in standard state corrections of 7.36 and 7.95 kJ/mol for methanol and water simulations, respectively. Once ΔG D-->Q has been calculated for all amino acid side chains and nucleobases in both water and methanol, the values of ΔG W-->M (unb) can be determined with: and are given in figure S3A and S3B for the amino acid side chains and nucleobases, respectively. The free energy difference between a nucleobase and a sidechain analog in the unbound state in water to the bound state in water is then: where the values of ΔG bind (M) can be found in figure  2B and the final results ΔG W,unb-->M,bound are shown in figure S4C.

Molecular dynamics (MD) simulations
The simulation box lengths used for MD simulations ranged between 3.7 and 4.1 nm and contained anywhere between 1600 and 2300 water molecules. The systems were equilibrated by applying position restraints to the solute (initial force constant of 2.5x10 4 kJ mol --1 ) and generating velocities from the Maxwell--Boltzmann distribution at the temperature of 50 K. After 20 ps of simulation time, the force constant of the position restraints was lowered by a factor 10, whereas the temperature was increased by 50 K. This procedure was continued until a temperature of 250 K was reached. The last equilibration step was performed for 40 ps at the final temperature of 298 K, whereby position restraints were switched off, and instead the center--of--mass translation was removed every 1000 steps. The systems were weakly coupled (4) to an external bath with a temperature of 298 K and a coupling time of 0.1 ps, whereby the solute and solvent degrees of freedom were coupled to the heat bath independently. The pressure was kept constant at 1 atm using isotropic weak coupling with a compressibility of 7.51 x 10 --4 (kJ mol --1 nm --3 ) --1 and a relaxation time of 0.5 ps (4,5). Non--bonded interactions were calculated using a triple range cutoff scheme. Within the short--ranged cutoff of 0.8 nm, all interactions were calculated based on a pairlist which was generated every 5 steps. The interactions between 0.8 and 1.4 nm were calculated with every pairlist update and were kept constant between updates. Interactions beyond 1.4 nm were accounted for by a reaction field contribution with a dielectric permittivity of 61, which is appropriate for SPC water (6). All simulations described above were repeated with methanol as solvent, whereby the cubic boxes were slightly larger (box lengths ranging between 4.1 and 4.9 nm) and filled with anywhere between 1300 and 1800 methanol molecules. The pressure was again kept constant at 1 atm using isotropic weak coupling, but now with a compressibility of 2.08 x 10 --3 (kJ mol --1 nm --3 ) --1 (7,8). The dielectric permittivity used to calculate the reaction field contribution for interactions beyond 1.4 nm was equal to 18.6 as is appropriate for the methanol model used in these simulations (9).

Umbrella sampling
The restraining distances in US ranged between 0.2 and 1.9 nm, with a separation between individual steps of 0.1 nm. The weighted histogram analysis method (WHAM) (10) was used to unbias US simulations and to obtain PMF curves. In order to verify that convergence had been reached, the production runs were divided into 5 ns--long segments and the PMF was determined for each of them. In cases where the binding free energies calculated from these PMFs showed a discrepancy of more than 1.5 kJ mol --1 , the simulations at all windows were S17 prolonged by 5 ns, until convergence had been reached. Following this routine, the pairs GUA--HISB, THY--GLU and URA--GLU were prolonged up to 20, 20 and 15 ns, respectively, for the simulations in water. In methanol, the ADE--ARG, CYT--LYSH, GUA--ASN, GUA--GLU and GUA--HISB pairs were prolonged to a total simulation time of 15 ns or, in the case of CYT--ASP, 20 ns.

Matching between mRNA composition and cognate proteins' nucleobase affinity
The possibility of direct complementary interactions between mRNA and their cognate proteins was investigated based on the human proteome as extracted from the UniProtKB database (April 2013 release). Entries were required to be Swiss--Prot reviewed and have a protein existence annotation of at least "predicted" (i.e. proteins annotated as "uncertain" were excluded). The cross--references section of the UniProtKB entry provided the ENA (European Nucleotide Archive Database, http://www.ebi.ac.uk/ena) accession numbers of the mRNA sequences, from which the first one complying with the length criterion (RNA length = 3x protein length +3) was chosen. This procedure resulted in 16954 protein sequences together with their cognate mRNAs. A binding preference scale was defined for each of the four RNA nucleobases (A, C, G and U) in both environments (water and methanol) based on the binding free energies between amino-acid side--chain analogs and nucleobases. For histidine, the average of the binding free energies for the two neutral forms (HISA/HISB) was chosen. Throughout the text, the scales are designated as, for example, G scl (wat), which refers to the scale of binding free energies of amino--acid side--chain analogs to GUA in water. The Pearson correlation coefficients between the sequence profiles capturing nucleobase content of mRNAs and nucleobase--affinity profiles of their cognate protein sequences were evaluated after applying a sliding--window averaging procedure to each sequence, where the window consisted of 21 side chains or codons. For this analysis, the minimum length of proteins was set to 43 side chains, corresponding to 2w+1 where w is the size of the averaging window. In cases where glycine or proline side chains were present along the protein sequence, these side chains were excluded from the average over the window in question, since the binding free energies could not be determined for these side chains using our current setup. The corresponding codons in the cognate mRNA were consequently also excluded from the window average. These window averages were thus calculated over fewer than 21 side chains/codons. The significance of Pearson correlation coefficients was determined by random shuffling of the affinity scales. Each scale was shuffled 10 4 times and the Pearson correlations coefficients (R) between the nucleobase content of the mRNA and the nucleobase affinity of their cognate proteins were calculated for each of these shuffled scales. The P--values were then determined from the fraction of randomized scales which absolute value of R is greater than the R obtained with the original scale (|R| > |R original |). The significance of the effective free energies of interactions was evaluated by generating 10 6 randomized variants of each native mRNA sequence by exhaustively shuffling the positions of its codons. For each protein, the interaction energy was then not only determined for its cognate, native mRNA but also the shuffled ones. P--values were defined as the fraction of its randomized mRNAs that exhibited lower interaction free energies than the cognate mRNA.