iGNM 2.0: the Gaussian network model database for biomolecular structural dynamics

Gaussian network model (GNM) is a simple yet powerful model for investigating the dynamics of proteins and their complexes. GNM analysis became a broadly used method for assessing the conformational dynamics of biomolecular structures with the development of a user-friendly interface and database, iGNM, in 2005. We present here an updated version, iGNM 2.0 http://gnmdb.csb.pitt.edu/, which covers more than 95% of the structures currently available in the Protein Data Bank (PDB). Advanced search and visualization capabilities, both 2D and 3D, permit users to retrieve information on inter-residue and inter-domain cross-correlations, cooperative modes of motion, the location of hinge sites and energy localization spots. The ability of iGNM 2.0 to provide structural dynamics data on the large majority of PDB structures and, in particular, on their biological assemblies makes it a useful resource for establishing the bridge between structure, dynamics and function.


INTRODUCTION
Several studies in the last decade have drawn attention to the significance of intrinsic dynamics as a major determinant of the mechanism of action of proteins and their complexes (1)(2)(3)(4)(5). Intrinsic dynamics refers to the conformational changes intrinsically favored by the 3-dimensional (3D) structure. These are equilibrium motions that maintain the native fold while allowing for concerted subunit or domain rearrangements (global motions) or for more localized conformational changes such as loop motions or side chain rotations (local motions), often relevant to biological function. These motions underlie the adaptation of biomolecules to their functional interactions and play an essential role in allosteric signaling (6). As a consequence, an important question is to assess which structural elements (e.g. residues, secondary structures, domains or entire subunits) undergo large fluctuations away from their mean positions (i.e. those enjoying high mobility), or which structural elements provide adequate flexibility to enable conformational changes (e.g. hinge-bending sites) that may be relevant to function. Furthermore, it is often of interest to determine which structural elements are subject to strongly correlated (or anticorrelated) motions, toward gaining insights into allosterically coupled regions. The Gaussian Network Model (GNM), introduced almost two decades ago (7,8) has served as an efficient, yet powerful, tool for addressing these questions, supported by the iGNM database (9) and its online computation server (10). GNM provides information on the size of motions of individual structural elements as well as the correlations between the motions of these elements. It has proven useful in a broad range of applications, e.g. for predicting the elastic modulus of protein nanofibrils (11), evaluating the coexistence of stability and flexibility in proteins (12), quantifying entropic contributions to binding free energy (13), assessing the significance of collective dynamics in the mechanochemical activity of enzymes (14), and identifying dynamically coupled domains and interdomain binding sites (15), to name a few. The basic idea behind the GNM is that folded structures, under native state conditions, have access to a spectrum of motions (or modes), which can be delineated by a simple description, an elastic network representation of structure. Adoption of an elastic network model (ENM) permits to take advantage of the established theory and methods of macromolecular statistical mechanics (16). The solid physical foundations as well as mathematical simplicity led to the broad usage of ENMs for efficient and accurate determination of collective dynamics using normal mode analysis (NMA) methods (2,17,18).
A crucial feature of the GNM is its ability to easily decompose the motions into a spectrum of modes, and extract global (low frequency, slow or soft) modes, or local (high frequency, fast or stiff) modes, similar to NMA but in a significantly simpler and more efficient way. The former group of modes usually underlies cooperative functional events including allosteric rearrangements, and the latter relates to energy localization and folding nuclei (19)(20)(21)(22)(23)(24). Agreement between experimental data and GNM predictions consis-D416 Nucleic Acids Research, 2016, Vol. 44, Database issue tently gave support to the utility of the GNM, beside its conceptual simplicity and computational efficiency. Examples of experimental data that have been used in benchmarking GNM predictions include X-ray crystallographic B-factors (25), H/D exchange data (26), NMR data (27), conformational variability derived from the principal component analysis (PCA) of ensembles of structures resolved in different forms for a given biomolecule (28) -protein (5,14,29) or RNA (30,31).
The observed utility of the GNM for identifying dynamically coupled domains led to the development of servers for predicting the hinge sites in biomolecular structures (32,33), building on earlier work for visualizing molecular motions (34). Notable efforts have been made for evaluating and disseminating collective modes of motions using ENMs and/or NMA (35)(36)(37)(38)(39)(40)(41)(42)(43)(44), including the development of ENCoM server (45) for exploring the effect of mutations. Despite all these efforts, the DBs on ENM/NMAbased collective motions have been limited to a few studies such as iGNM (9) and Promode (37,42) DBs. In particular, Promode provides data on for 52 014 Protein Data Bank (PDB) (46) structures using an all-atom NMA in dihedral angle space. However, these are usually limited to single chains, or asymmetric units reported in the PDB; whereas for many applications, and in particular for multimeric systems, the dynamics of the biologically functional form, also called biological assembly (BA), is of interest.
We present in this study an updated version of iGNM, iGNM 2.0. The current version is a substantial advancement over the original iGNM DB developed in 2005. First, the total number of structures for which dynamics data are made available increased from 20 058 in version 1 to more than 100 000. Second, we took advantage of the improved techniques (Ajax, JQuery, HTM5, PHP and Highcharts) that enhanced the security and interoperability of the resource. Third, more results for each entry are reported compared to the earlier version, using interactive molecular viewers and charts. Fourth, iGNM 2.0 provides data not only for proteins, but for practically all types of PDB structures, including the complexes with DNA and RNA molecules or other substrates. Finally, GNM data are provided for the BA in the PDB, after assembling the biologically functional (usually multimeric) structure from the coordinates deposited for the asymmetric unit whenever applicable. The new database now provides access to pre-computed data on the dynamic properties of many supramolecular structures, which may help build plausible hypotheses for further studies.

GNM and mode spectral analysis
In the GNM, the nodes are identified by the ␣-carbons (of amino acids for proteins) and P, C4' and C2 atoms (of nucleotides for DNA/RNA molecules) ( Figure 1); and the springs are placed between all pairs of nodes/residues within a first inter-residue coordination shell in folded structures -identified to be r c ≈ 7.0 -7.5Å for folded proteins (47). The connectivity of the network is defined by the Kirchhoff matrix, . The off-diagonal elements of are ij = ji = −1 if nodes i and j are within r c , and zero otherwise; and the diagonal elements represent the coordination numbers (or degrees) of the residues (nodes), found from ii = − j ij where the summation is performed over all elements j, j = i. Knowledge of completely defines the network topology, and permits us to evaluate the intrinsically favored (or natural) modes of motion (relaxation) uniquely accessible to the structure. The ms fluctuations of residues where R i is the change in the position vector of node/residue i) directly scale with the diagonal elements of −1 ; and the cross-correlations between residue fluctuations scale with the off-diagonal elements, i.e.
The proportionality constant is 3k B T/γ , where k B is the Boltzmann constant, T is the absolute temperature and γ is the force constant assumed to be uniform for all springs in the network. The value of γ does not alter the 'distribution' of fluctuations nor does it affect the orientational crosscorrelations The fluctuation profile and the above cross-correlations are obtained without any parameters. Agreement with experiments without any adjustable parameter is the major strength of the GNM. Because the rows/columns of are not independent, −1 is the pseudo inverse obtained as where the summation is performed over the N-1 nonzero eigenvalues λ k of and the corresponding eigenvectors u k . The eigenvector u k represents the normalized distribution of displacements for the N nodes along the principal/normal (mode) axis k, and the eigenvalue λ k scales with the square frequency of the fluctuations along this axis. The contribution of mode k to ms fluctuations or cross-correlations scales with λ k −1 such that the lowest frequency mode (k = 1, λ 1 ≤ λ 2 ≤ . . . . λ N-1 ) makes the largest contribution. Details on the derivations of GNM equations can be found in our previous work (48).

Data set
All the structures deposited in PDB as of June 30, 2015 were downloaded (109 457 of them) (46). For each of NMR structures, the first model among those deposited in the PDB, was used in GNM calculations. Likewise, the first BA files were used for those having multiple BA records in PDB. Structures containing less than 12 nodes or more than ∼20 000 nodes as well as those having data for C ␣ -atoms only were filtered out, which led to 107 201 PDB files. The size and shape distributions of these structures are shown in coordinates) (15). The iGNM 2.0 therefore contains information on the dynamics of biological assemblies of up to 2 × 10 4 residues, and up to an axial ratio of ∼100.

Inputs: query and searching functions
The iGNM 2.0 offers two options for searching the database. The first is to directly enter the PDB 4-letter id. The second is an advanced query function that permits users to search the database using one or more properties, such as the experimental method, the resolution of the structure (if applicable), the structure name (or title word), an author name, the release date, the residue count or molecular weight. Users may also search by entering dynamic features such as degree of collectivity of the GNM modes. The user is then directed to a relational table that includes all the PDBs entries that match the search. These entries can be sorted by features such as residue count or resolution. The relational tables can be exported as plain text file (tsv or csv format) or Excel file (xls or xlsx format).

RESULTS
The Results page contains a J(s)mol window (on the left) illustrating the investigated structure (or its biological assembly, if applicable) color-coded by the square displacements of residues, and a panel of results (on the right). The panel contains seven clickable items described below, by way of an example, e.g. DNA/AlkB family demethylase complex (PDB id: 4NIH) (49).  Table 1 lists the correlation coefficients between experiments and theory averaged over all 97 959 PDB structures resolved by X-ray in our data set. Results are presented for different subsets of PDB structures: Subset S1 refers to the cases where the PDB structure accessible by default (also called asymmetric unit, Asym) is identical to the BA. Those in subset S2 are not. They consist of two groups: S2B where the BA is constructed by assembling multiple copies of the default structure reported in the PDB, and S2A where the BA is a part of the default structure. Note that the consideration of the entire BA constructed by assembling multiple copies of the asymmetric units is essential to obtaining a higher correlation with experiments (see subset S2B, last row in Table 1). Figure 2 panels C-F provide more information on the correlation coefficients obtained for BAs (dashed bars) and Asym (gray bars). Supplementary Figure S1 shows the same results for the 97 959 X-ray structures included in the iGNM 2.0. We note that the agreement with experiments improves with decreasing asymmetry (or axial ratio) and increasing size. 1 and 2, by default), as illustrated in Figure 3A. The lower half of the page displays the mode shapes (i.e. square displacements of residues driven by a given mode, plotted as a function of residue index). The ribbon diagram color code and mode shape for mode k are obtained from the diagonal elements of [C] k (see Equation 4). Results for both slow/soft modes and fast/stiff modes can be viewed. In the former case, the residue motions are (usually) uniformly-distributed across the structure (the modes are highly collective); in the latter, a number of sharp peaks appear in the mode shape (the modes are highly localized). Panel B in Figure 3 illustrates such selected modes.
(iii) Domain separations by dynamics (3D/2D). Each residue i moves in either the positive or negative direction along a given mode axis. The direction along mode k is given by the sign (+ or −) of the i th element of u k (each element corresponding to a residue or node). The subsets of residues moving in opposite directions are said to undergo anticorrelated movements in mode k. Each mode thus separates the structure into two subsets of residues that move in opposite directions (colored red and blue in Figure 3C). Note that in the global modes, residues in a given subset are spatially contiguous (they form coherent domains/subunits, etc.); whereas in the higher modes, they consist of multiple, more localized elements. Residues at the crossover regions between + and − directions define the interfaces between the anticorrelated domains in the global modes. The interface often includes a global hinge sites that plays a key mechanical role in enabling the relative movements of the domains. Likewise, key chemical residues (e.g. catalytic residues) whose precise (fixed) positioning is essential for activity usually lie at such interfacial regions, and as such they undergo minimal (if any) displacement in these modes minima (14). (iv) GNM connectivity model (3D/2D) page displays the topology of the network as an interactive 3D network model ( Figure 1) or a 2D representation similar to a contact map. (v) Cross-correlations (3D/2D) page displays the orientational correlations (C ij orient ) between pairs of nodes, for the user-selected mode. Two maps are simultaneously displayed, with the second permitting to focus on selected regions of the former. Maps for customized subsets of modes can be calculated using the online calculation engine for N < 1000 nodes. Figure  3D illustrates the map for demethylase complex, based on the three slowest modes. The colors distinguish the correlated (red) and anticorrelated (blue) pairs of residues.  a Results are reported as average correlations for the indicated subsets ± standard error. b In the subset S1, the BA is identical to the default structure accessible at the PDB; Subset S2 consists of two subgroups, S2A and S2B; in S2A, the BA is a part of the default PDB (the asymmetric unit); in S2B, it is assembled from multiple copies of the whole/part of the default PDB using the transformation matrices reported in the PDB.
(vi) Collectivity (2D) for a given mode k is a measure of the degree of cooperativity (between residues) in that mode, defined as (50) where, k is the mode number and i is the residue index. A larger collectivity value refers to a more distributive mode and vice versa. Usually soft modes are highly collective. Collectivity values are reported for soft modes that account for 1/10 of the overall dynamics. (vii) Results in plain text. All GNM results accessible in the above six output pages can be downloaded via HTTP, for further analysis and alternative visualizations. Modes at the low frequency end of the spectrum (the most favorable modes from energetic point of view) up to 40% of the spectrum are stored and can be retrieved for each PDB entry.

Database architecture of iGNM 2.0
The architecture of iGNM 2.0 is sketched in Figure 4 and detailed in its caption. Further information about technology used in visualization module can be found in the Supplementary Text S1. We want to particularly mention that our database is regularly updated by subroutines that identify newly added PDB files and GNM computations are subsequently performed by an off-line GNM engine to catch the pace of rapidly growing PDB (averaging ∼30 new structures per day in 2014

DISCUSSION AND CONCLUSION
The iGNM 2.0 is a significantly enhanced version of the database iGNM published a decade ago. The original database found broad usage and utility in investigating the equilibrium dynamics of structures deposited in the PDB. The new version will further facilitate its usage, now containing data on more than 95% of the current PDB content, in addition to offering a user-friendly interface with advanced 3D and 2D visualization and analysis capabilities. A unique feature of iGNM 2.0 is the accessibility of data for BAs, rather than the single chains or asymmetric units, thus providing insights into the dynamic properties of biologically functional entities. Note that the BAs may be very different from the asymmetric units both structurally and dynamically. The differences between GNM-predicted and X-ray crystallographic (PDB-reported) B-factors may originate from artifacts such as crystal contacts between replica on adjacent lattice sites of the crystal, which would lead to lower B-factors than those predicted (by the GNM) for the isolated molecule (51)(52)(53). Conversely, calculations performed for the asymmetric unit would miss the effect of inter-subunit contacts and depart from the B-factors that are reported for the BA. The subset S2B (Table 1) represents the latter case: there is a considerable improvement (from 0.513 to 0.610) in the correlation with experimental data, when the B-factors are computed for the entire BA composed of multiple subunits (and not for the asymmetric unit only, which would be retrieved as the default structure in the PDB). Exploring of the dynamics of the BA itself is therefore essential to extracting biologically meaningful results. Supplementary Figure S2 illustrates this feature for a voltage-gated sodium channel. The most collective (softest) modes of motions of BAs often underlie allosteric or largescale cooperative rearrangements of entire subunits or domains.
In previous work, different representations have been adopted for ENM nodes, e.g. Setny and Zacharias proposed the center of the ribose sugar in the backbone to be the best site for the nucleotide ENM node (54). Good agreement with experimental data on nucleotide-containing structures (DNA/RNA and their complexes) has been obtained in iGNM 2.0 by adopting a 3-node representation for each nucleotide, the nodes being placed at the sugar, the base and the phosphate groups. This representation has recently proven to accurately reproduce the principal modes sampled by microseconds simulations (31).
The present structural-proteome scale analysis clearly shows that the agreement of GNM predictions with experiments improves with the size of the investigated structure. This property became clear here by performing systematic computations for large structures and BAs. A close look at the correlations with experimental B-factors, in the range N < 1400, is presented in Supplementary Figure S3 panel A for three different subsets listed in Table 1. Another feature worth noting is that the GNM usually yields more accurate results for globular structures. We can see in Panel B the decrease in correlation with increasing axial ratio.
Examination of structures of even 10 4 residues showed that the accuracy of the results did not decrease with increasing size. In particular, we noted that the current iGNM 2.0 computations for the 14 PDB structures deposited to date in the PDB for various forms of the ribosome (30S, 40S or 70S subunits, complexed with different proteins) led to correlation coefficients of ∼0.7 with experimental data on residue fluctuations (see Supplementary Figure S4) in addition to indicating slow modes and cross-correlations consistent with experimentally observed rachet-like mechanism.
Large structures/assemblies are actually the most challenging systems for molecular dynamics simulations, and most simulations for systems of the order of 10 3 residues are limited to short durations, far from sampling the collective dynamics or cross-correlations that cooperatively involve the intact structures. In this respect the iGNM 2.0 is distinguished as a resource that provides information on the collective dynamics of this challenging set of structures.

SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.