Calculation of accurate interatomic contact surface areas for the quantitative analysis of non-bonded molecular interactions

Abstract Summary Intra- and intermolecular contact surfaces are routinely calculated for a large array of applications in bioinformatics but are typically approximated from differential solvent accessible surface area calculations and not calculated directly. These approximations do not properly take the effects of neighboring atoms into account and tend to deviate considerably from the true contact surface. We implemented an extension of the original Shrake-Rupley algorithm to accurately estimate interatomic contact surface areas of molecular structures and complexes. Our extended algorithm is able to calculate the contact area of an atom to all nearby atoms by directly calculating overlapping surface patches, taking into account the possible shielding effects of neighboring atoms. Here, we present a versatile software tool and web server for the calculation of contact surface areas, as well as buried surface areas and solvent accessible surface areas (SASA) for different types of biomolecules, such as proteins, nucleic acids and small organic molecules. Detailed results are provided in tab-separated values format for analysis and Protein Databank files for visualization. Direct contact surface area calculation resulted in improved accuracy in a benchmark with a non-redundant set of 245 protein–DNA complexes. SASA-based approximations underestimated protein–DNA contact surfaces on average by 40%. This software tool may be useful for surface-based intra- and intermolecular interaction analyses and scoring function development. Availability and implementation A web server, stand-alone binaries for Linux, MacOS and Windows and C++ source code are freely available from http://schuellerlab.org/dr_sasa/. Supplementary information Supplementary data are available at Bioinformatics online.


1.1
Calculation of the solvent accessible surface area of a molecule For the calculation of the solvent accessible surface area or SASA, the Shrake and Rupley algorithm was used (Shrake and Rupley, 1973). It consists in generating a spherical cloud of points for each atom, where each point is at a distance equivalent to the van der Waals (vdW) radius plus the radius of a water molecule (1.4 Å by default) from the atom center. Each point represents a surface segment whose size is equivalent to the total surface divided by the amount of points. After excluding the points that are located inside the volume of the point clouds of other atoms, it is possible to obtain the accessible surface area by counting the remaining points and multiplying by the surface area they represent. The point cloud must have distances between points as equivalent as possible in the spherical plane to remove inaccuracies in the surface calculation caused by the uneven distribution of points. This was solved by approximating the surface of a sphere by the tessellation algorithm implemented in the software Thomson Applet (Saff and Kuijlaars, 1997;Cecka et al., 2007) ( Figure S1). Briefly, the algorithm initially generates points on a sphere at random positions. Then, a gradient descent algorithm that simulates each point as an electrical charge constrained to move on the spherical surface is executed to subsequently optimize the position of these points. The optimization ends when the variance of the distance between points no longer decreases. We precalculated point clouds of a unit sphere with 15,092 points.
To optimize the search for interacting atoms, only those atoms that have their centers at a distance less than two times the sum of their van der Waals radii plus two times the van der Waals radius of a water molecule (1.4 Å) are considered as potentially interacting.

1.2
Calculation of contact surface areas To calculate the contact surfaces between two atoms, we extended the original Shrake-Rupley algorithm (Shrake and Rupley, 1973). In our modification, new variables were introduced to store the additional information of which atoms bury a certain surface point of another atom ( Figure S2). With this information it is possible to find groups of points of an atom that are buried by other unique groups of atoms, identifying all unique groups. Our algorithmic modifications include the structures pBuriedBy and AreaBuriedBy that allow to store the information of which points in the surface of an atom are buried by which other groups of atoms. The latter variable contains the list of unique set of atoms and the number of points that each of them buries ( Figure S2). When applied to the simplest example of a three-body system, the algorithm proceeds as follows ( Figure S3): The surface of body 1 buried by body 2 corresponds to the section in red, plus the purple section. Since the purple area was counted twice, it is divided by two for each atom. In general, the value of the buried surface area that a body B causes to a body A is equivalent to the sum of all overlaps divided by the number of atoms that participate with the group of body B. In the example of Figure S3, the contact surface area of body 1 that is buried by body 2 is equivalent to the section in red plus half of the purple section. This normalization of shared contact surfaces by the number of interacting atoms ensures that the sum of all contact surfaces is equal to the buried surface area calculated by the original Shrake-Rupley algorithm. Figure S1. Sphere surface approximation calculated by the Thomson Applet software. In this example a sphere was subdivided in 1,000 equivalent areas. Each subsurface is represented by a center point (not shown). In practice, dr_sasa works with a precalculated unit sphere with 15,092 points.  In this three-body example, the body 1 has its surface buried by bodies 2 and 3. Each point on surface of 1 represents a value equivalent to (total surface)/(number of points). The points in red represent the surface buried only by body 2, the blue points are buried only by 3, and the purple points are buried by bodies 2 and 3. In the outlined boxes in the center part of the figure, representations of maps of buried points to list of atoms are shown, which corresponds to variable pBuriedBy in the algorithm described in Fig S2. The outlined box on the right-hand side corresponds to the variable AreaBuriedBy of our algorithm and maps unique groups of atoms to lists of surface points, which correspond to contact surface areas.

1.3
Definitions of van der Waals radii Structures provided in PDB and Mol2 formats use different definitions of vdW radii. For PDB files the vdW radii definition is equivalent to the radii of the popular software tool NACCESS (Hubbard and Thornton, 1993;Chothia, 1975) and is provided in Table S1 and Table S2. Structures provided in the Mol2 format are assigned vdW radii according to the SYBYL atom types contained in Mol2 files. This vdW radii definition is equivalent to the one used by the molecular modeling program UCSF Chimera (Pettersen et al., 2004) based on the data of Tsai et al., 1999 (Table S3). vdW radii of ions were extracted from the CRC Handbook of Chemistry and Physics (Lide, 2001) for both PDB and Mol2 files, as in UCSF Chimera (Pettersen et al., 2004) (Table S4). Ions are assumed to be in their prevalent coordination number. Hydrogen atoms are ignored in all calculations. User-defined tables of vdW radii may be provided as plain text files as exemplified in the online documentation of dr_sasa (command line switch: -v).

1.4
Program usage This program receives inputs, sets its operation mode through the command line, and outputs its results as text files. The output tables are separated by tabs, facilitating the import of the data to most common spreadsheet software or easy parsing with custom user scripts. The standard mode of operation or operation mode 0 (command line switch: -m 0) calculates only the Solvent Accessible Surface Area (SASA) of the input molecules, where the output is a PDB file with the SASA of the atom (in Å 2 ) inserted in the B-factor column. The contact surface area mode of operation (mode 1, command line switch: -m 1) requires the selection of the chains to be considered as separate objects, or if the PDB or Mol2 file contains different types of biomolecules (protein and nucleic acids, protein and ligands, or any combination of the three) the program will automatically identify the separate objects and calculate their interactions. In this mode the output is presented as two types of matrices. The first type is a NxN matrix where N is the total number of atoms in the structure, and the sum of each column is equal to the total buried surface area of an atom. Theses sums are also saved in the B-factor column of a separate PDB output file for visualization. The second type of matrices are LxM sized, where L is the number of atoms of the first selected chain or molecule type and M is the number of atoms of the second selected chain or molecule type. These latter matrices come in pairs of files, where each value of the first matrix corresponds to the buried surface that chain A causes to B and the inverse is true for the second matrix. Output matrices for all possible pairs of defined or identified objects will be written. All these matrices are generated twice, at two levels: per atom and per residue. In addition, to generate intramolecular surface-based contact maps, all residues (command line switch: -m 2) and atoms (command line switch: -m 3) can be considered as objects. In these two operation modes 2 and 3, residues/atoms are compared all-against-all by first isolating a pair of residues/atoms and then calculating their contact surface. Modes of operation 1, 2, and 3 take only solvent exposed atoms into consideration. The last mode of operation (mode 4, command line switch: -m 4) calculates intermolecular contact surface areas without requiring that the contact surfaces are solvent accessible. This is especially useful for internal and deep ligand binding cavities with low solvent accessibility (see example from Fig. 1b in the main manuscript). The SASA per atom and BSA per atom are further classified according to the chemical nature of the corresponding atom into protein backbone/side chain, DNA backbone/base and polar/hydrophobic. These data are provided in two additional output files (.atmasa and .datmasa for SASA and BSA, respectively).

1.5
Generation of graphical contact map plots Graphical surface-based contact map plots ( Figure S4) can be generated with a python script freely available from our web site. Required inputs for the script is a *.by_atom.tsv or *.by_res.tsv matrix file and a corresponding .atmasa file (see Table  S5 for a description of the output files and their content). The contact maps may be generated per residue, per atom or mixed (per residue for proteins and per atom for small ligand molecules). Each cell of the contact map corresponds to the contact surface area (CSA in Å) of the atom/residue on the X axis, caused by the atom/residue on the Y axis. The bar plot on the X axis indicates relative BSA: BSA is calculated as the column sum of CSA, which is subsequently normalized by the SASA of the atom/residue on the X axis (BSArel = BSA/SASA).

1.6
Web server A web server to run dr_sasa is freely available from http://schuellerlab.org/dr_sasa/. The web server is able to run dr_sasa in all modes of operation described above and generates a compressed ZIP file for download, containing all output files. For user convenience, the web server does also generate contact map plots by default ( Figure S4).

1.7
Output files Pair of matrices containing the information of the CSA that atoms of object B cause to object A and vice versa, respectively. This information is indicated in the header, with the format "(obj_A)<-(obj_B)". The arrow indicates that the matrix contains the CSA of atoms of A, which are buried by B. The buried object is always placed first and corresponds to the X axis. The sum of all CSA values per column (column sums) corresponds to the BSA of the first object. The objects can be either molecular types, chains, or chain combinations. Pairs for ALL possible combinations will be generated. Summing all values equals the total buried surface area for the object indicated in the header. Created in operation modes 1 and 4.

by_res.tsv
NxN files containing the same information as the matrix pairs, where N is equal to the total number of atoms in the structure or residue, respectively, within the selected chains. Summing the columns of this file is equal to the BSA of the atom in X axis in operation modes 1 and 4. Please note that this sum has no meaning for operation modes 2 and 3. Generated in all operation modes except mode 0.
<structure file name>.overlaps File with details about all subsurface burial information. Deprecated, may be removed in upcoming software versions. Generated in all operation modes except mode 0. Optional for operation modes 2 and 3. Figure S4. Surface-based contact map of a protein-ligand complex. The map represents the serine protease factor Xa bound to the small molecule inhibitor rivaroxaban (PDB ID 2w26). The figure was generated with our provided python script and it is equivalent to those images generated by the web server.

Feature comparison
Notes: Y, yes, supported by default; U, requires custom user settings; CSA, contact surface area; BSA, buried surface area.

1.9
Protein-DNA validation dataset A non-redundant set of 245 protein-DNA complexes was obtained from our Protein-DNA Interface Database (PDIdb; Norambuena and Melo, 2010) and prepared as published before (Ribeiro et al., 2015). Briefly, the amino acid sequences of the protein chains of 922 protein-DNA interface complexes were clustered with the computer program BLASTClust (Altschul et al., 1990), according to a length coverage threshold of 90% and sequence identity of 70%. The resulting set is non-redundant in terms of the protein sequences and is available from our web site at http://melolab.org/pdidb/. A single structure (PDB code: 1qpi) was excluded from the analysis due to errors in the PDB file. All PDB files were processed with a script to remove atoms/residues/chains with alternative locations (altloc), keeping the location of higher occupancy, or in case of same occupancy, keeping the first position in sequential order.

Protein-ligand validation dataset
A protein-ligand dataset derived from PDBbind was prepared (Liu et al., 2017). We used 290 protein-ligand complex structures defined by the authors as the 'core set', last updated November 2016. The core set is a non-redundant collection of protein-ligand complexes for which experimentally measured binding affinity data (Kd, Ki, and IC50) are available. Ligands include small organic compounds and short peptides. Solvent atoms and atoms with alternate locations were removed, and the files were saved in the PDB format, and in addition in the Mol2 format with help of PyMOL (The PyMOL Molecular Graphics System, 2018).

Runtime comparison
dr_sasa is suitable for batch processing. SASA calculations for 290 protein-ligand complexes took 2.1 minutes on a 16thread x86 notebook computer (AMD Ryzen 7 1700 @ 3.2 GHz), equivalent to 0.4 seconds per structure (3437.7  2587.5 atoms per structure; Table S7). In addition, we have employed dr_sasa to calculate SASA and the interface contact surface area for 10,000 snapshots (PDB structures) extracted from a molecular dynamics trajectory of an "MarA-micF" protein-DNA complex (similar to PDB 1bl0; 1830 atoms). SASA calculations for the 10,000 snapshots took 36 min. on the 16thread x86 notebook computer, equivalent to 0.2 s per structure. CSA calculations on the same dataset took 214 min. (1.3 s per structure).

Improved accuracy of contact surface area calculations by our modified Shrake-Rupley algorithm
Contact surface areas may be estimated approximately by calculating differences in SASA of artificially rearranged molecular objects. This approximate solution was implemented in the web server DNAproDB (Sagendorf et al., 2017) and our own software tool PDIviz for the analysis of protein-DNA complexes (Ribeiro et al., 2015). Here, we show that these approximations are rather rough and produced an average relative difference of -40% compared to dr_sasa's direct calculation of interatomic contact surface areas (CSA).
To approximate CSA of protein atoms in contact with different DNA regions (e.g. protein surface buried by DNA bases), the following approach was implemented, which is based entirely on SASA calculations. First, an artificial protein-DNA complex is created by removing the atoms corresponding to certain DNA regions (bases, BB, major/minor groove) and its SASA is calculated. Next, SASA of the unmodified complex is calculated and subtracted from the SASA of the modified complex to give an estimate of CSA. Specifically: However, this is an approximate method and direct CSA calculation by our modified Shrake-Rupley algorithm should be more accurate. We determined CSA for protein atoms in contact with four different regions of the DNA double helix with dr_sasa (CSA mode 1) and compared these calculations with the results of the above approximate method estimated with (i) DNAproDB (Sagendorf et al., 2017), which employs the software tool FreeSASA internally (Mitternacht, 2016), and with (ii) dr_sasa (SASA mode 0). We report the absolute ( Figure S5, Figure S6, Table S8, and Table S9) and relative differences ( Figure S7, Figure S8, Table S10, and Table S11) per structure.
We interpreted the results of this section as increased accuracy of dr_sasa due to its ability to calculate CSA directly from surface overlaps. Another possible explanation for the observed relative difference of -40% would be an error in dr_sasa's CSA calculations. As a control experiment, we further demonstrate that all individual CSA's sum up to the total buried surface area (BSA), as expected (section 2.2). As BSA is typically estimated as differential SASA, we also show, as further control experiments, that SASA calculated by dr_sasa is similar to the values calculated by several other software tools (sections 2.3 and 2.4).        , where Diffrel(CSAi) denotes the difference (possibly signed) of a CSA estimate of structure i. N is the total number of differences and std. dev. refers to the standard deviation.

Validation of buried surface area calculations
In the previous section we show that dr_sasa's direct calculation of CSA is more accurate than estimating the contact surface from differential SASA. However, another possible explanation for the observed relative difference of -40% would be an error in dr_sasa's CSA calculations. As a control experiment we demonstrate here that all individual CSA's sum up to the total buried surface area (BSA), as expected.
The buried surface area (BSA) of a molecule i forming a molecular complex is typically calculated by subtracting the SASA in the free state from the complexed state: However, dr_sasa calculates BSA as the sum of all individual contact surface areas (CSA) of all atoms participating in the interaction as: where n is the number of interacting atoms and CSAi is the contact surface area of atom i. Our normalization scheme of contact surfaces shared between several atoms, as explained in section 1.2, ensures that this sum is equal to total BSA. This claim was validated on our protein-ligand dataset by calculating the difference between both methods per atom ( Figure S9 and Table S12) and per structure ( Figure S10 and Table S13). As expected, the average difference is zero with minor deviations due to limits in numerical floating point precision.  , where Diff(BSAi) denotes the difference (possibly signed) of a BSA estimate of atom i. N is the total number of differences and std. dev. refers to the standard deviation. All reported differences were zero with a precision of 3 decimal numbers. where Diff (BSAi) denotes the difference (possibly signed) of a BSA estimate of structure i. The protein-ligand dataset was employed. Box plots were drawn as in Figure S5. where Diff (BSAi) denotes the difference (possibly signed) of a BSA estimate of structure i. N is the total number of differences and std. dev. refers to the standard deviation.

Validation of solvent accessible surface area calculations for protein-ligand complexes
In section 2.1 we have shown that direct calculation of CSA is more accurate than relying on rough differential SASA estimates, and in section 2.2 we have demonstrated that these results are not likely caused by an error in dr_sasa, as all individual atomic CSA's indeed summed up to the total BSA.
Here, we further show that dr_sasa's calculation of plain SASA is similar to the values obtained by the software tools NACCESS (Hubbard and Thornton, 1993), MSMS (Sanner et al., 1996), and FreeSASA (Mitternacht, 2016). SASA was calculated for 290 protein-ligand complexes provided in the PDB format with the three software tools and the difference in surface area to dr_sasa was analyzed per atom ( Figure S11 and Table S14) and per structure ( Figure S12 and Table S15). The average difference per structure compared to NACCESS was -0.699  12.015 Å (mean  standard deviation) and was 2.265  12.584 Å for FreeSASA. However, for MSMS we obtained a large difference of -130.691  316.274 Å 2 per structure. As an explanation, we observed that in some cases MSMS determined a zero value for SASA for atoms which were solvent exposed as by dr_sasa/NACCESS/FreeSASA. In Table S16 we compared total SASA of 13 protein-ligand complexes calculated by all four software tools. These 13 protein-ligand complexes had large MSMS vs. dr_sasa differences (< -500 Å 2 ). 95% of these can be explained by atoms for which MSMS calculated an exposed surface of 0 Å 2 but were detected solvent exposed by dr_sasa. We conclude that these large deviations are a peculiarity of MSMS, as results of dr_sasa, NACCESS and FreeSASA were similar. In addition, these large deviations were outliers related to very large structures, as we obtained a low relative difference of < 1% per structure ( Figure S13 and Table S17). Figure S11. Comparison of SASA values of dr_sasa and three other SASA tools, per atom. The protein-ligand dataset was employed. We observed that MSMS determined a zero value for SASA for some atoms which were solvent-exposed, hence the large number of outliers. Differences were calculated as where Diff(SASAi) denotes the (possibly signed) difference of a SASA estimate of atom i. Box plots shown here were drawn as in Figure S5.  where Diff(SASAi) denotes the (possibly signed) difference of a SASA estimate of structure i. Box plots shown here were drawn as in Figure S5.   , where Diffrel(SASAi) denotes the relative difference of a SASA estimate of structure i. Box plots were drawn as in Figure S5. , where Diffrel(SASAi) denotes the relative difference of a SASA estimate of structure i. N is the total number of differences and std. dev. refers to the standard deviation.

2.4
Validation of solvent accessible surface area calculations for protein-DNA complexes We further validated plain SASA calculations for 245 protein-DNA complexes with dr_sasa and NACCESS. We report the difference in surface area per atom ( Figure S14 and Table S18) and per structure ( Figure S15 and Table S19). We obtained a low overall difference of -0.074 ± 1.804 Å 2 (mean ± standard deviation) per structure.

2.5
Effect of different vdW radii definitions on SASA calculation Structures provided in PDB format use a different vdW radii definition than structures provided in Mol2 format (c.f. section 1.3). To estimate the effect of the different vdW radii definitions, we calculated SASA for our protein-ligand set provided as PDB files and Mol2 files (converted from PDB by PyMOL) and calculated the difference by atom ( Figure S16 and Table  S20) and structure ( Figure S17 and Table S21), and the relative difference per structure ( Figure S18 and Table S22). We observed a low relative difference of 1% per protein, and a relative difference of 10% per ligand. This is an expected result, since the atom typing and vdW radii used for the Mol2 format allows for a more fine-grained description of ligand atoms.     , where Diffrel(SASAi) denotes the (possibly signed) relative difference of a SASA estimate of structure i. Box plots were drawn as in Figure S5. The structures of the protein-ligand set were provided either in Mol2 or in PDB format. Values were calculated as ( ) = (( 2 + 1) − ( + 1)) ( + 1) ⁄ , where Diffrel(SASAi) denotes the (possibly signed) relative difference of a SASA estimate of structure i. N is the total number of differences and std. dev. refers to the standard deviation.

2.6
Calculation of CSA without requiring that the contact surfaces are solvent accessible Surface-based interaction analysis of protein-ligand complexes is complicated by the fact that many ligand binding sites are deep cavities that may be poorly accessible by the rolling ball approximation. In these cases, the actual protein-ligand contact surface area will be underestimated if the binding pockets are poorly solvent exposed. We therefore implemented a variation of our CSA calculation, in which atoms are not required to be solvent accessible (mode 4 of dr_sasa). We compared the area difference of this mode against solvent exposed CSA calculations (mode 1 of dr_sasa) on our proteinligand dataset. Results are reported per atom ( Figure S19 and Table S23) and per structure ( Figure S20 and Table S24). Our results indicate that the relative difference in BSA without the requirement that atoms were solvent exposed was, on average, 10 times larger per protein structure compared to BSA of mode 1 ( Figure S21 and Table S25). of a BSA estimate of atom i. Box plots were drawn as in Figure S9.  of a BSA estimate of structure i. Box plots were drawn as in Figure S5. N is the total number of differences and std. dev. refers to the standard deviation. The protein-ligand dataset was employed. Figure S21. Comparison of BSA without and with the requirement that atoms are solvent exposed, relative difference per structure. The proteinligand dataset was employed. Values were calculated as ( ) = (( 4 + 1) − ( 1 + 1)) ( 1 + 1) ⁄ , where Diffrel (BSAi) denotes the relative difference of a BSA estimate of structure i between both modes. Box plots were drawn as in Figure S5. Table S25. Comparison of BSA without and with the requirement that atoms are solvent exposed, relative difference per structure. Values were calculated as ( ) = (( 4 + 1) − ( 1 + 1)) ( 1 + 1) ⁄ , where Diffrel(BSAi) denotes the relative difference of a BSA estimate of structure i between both modes. N is the total number of differences and std. dev. refers to the standard deviation. The protein-ligand dataset was employed.