Shared Signature Dynamics Tempered by Local Fluctuations Enables Fold Adaptability and Specificity

Abstract Recent studies have drawn attention to the evolution of protein dynamics, in addition to sequence and structure, based on the premise structure-encodes-dynamics-encodes-function. Of interest is to understand how functional differentiation is accomplished while maintaining the fold, or how intrinsic dynamics plays out in the evolution of structural variations and functional specificity. We performed a systematic computational analysis of 26,899 proteins belonging to 116 CATH superfamilies. Characterizing cooperative mechanisms and convergent/divergent features that underlie the shared/differentiated dynamics of family members required a methodology that lends itself to efficient analyses of large ensembles of proteins. We therefore introduced, SignDy, an integrated pipeline for evaluating the signature dynamics of families based on elastic network models. Our analysis confirmed that family members share conserved, highly cooperative (global) modes of motion. Importantly, our analysis discloses a subset of motions that sharply distinguishes subfamilies, which lie in a low-to-intermediate frequency regime of the mode spectrum. This regime has maximal impact on functional differentiation of families into subfamilies, while being evolutionarily conserved among subfamily members. Notably, the high-frequency end of the spectrum also reveals evolutionary conserved features across and within subfamilies; but in sharp contrast to global motions, high-frequency modes are minimally collective. Modulation of robust/conserved global dynamics by low-to-intermediate frequency fluctuations thus emerges as a versatile mechanism ensuring the adaptability of selected folds and the specificity of their subfamilies. SignDy further allows for dynamics-based categorization as a new layer of information relevant to distinctive mechanisms of action of subfamilies, beyond sequence or structural classifications.


Introduction
Studies in recent years have established the role of structural dynamics, also called intrinsic dynamics, in facilitating, if not driving, the interactions and function of biomolecular systems in the cell. Many biological events, including substrate recognition, binding and transport, allosteric signaling, communication and regulation, and mechanochemical responses, shortly referred to as protein actions, take advantage of the proteins' intrinsic dynamics . Intrinsic dynamics refers to collective thermal fluctuations in the conformational space, uniquely defined by the 3D architecture, or fold, under physiological conditions. Among the spectrum of motions intrinsically accessible to a structure, the modes of motions with the lowest frequency, called global modes, are often distinguished by their cooperativity and easy accessibility, hence their involvement in allosteric responses (Townsend et al. 2015), and qualification as soft modes.
Rapid evaluation of intrinsic dynamics with the help of elastic network models (ENMs) introduced around the turn of the century (Tirion 1996;Bahar et al. 1997;Hinsen et al. 2000;Atilgan et al. 2001) has enabled a deeper understanding of the functional significance of global motions (Tama and Sanejouand 2001;Ma 2005;Delarue 2008;Zheng et al. 2009;Fuglebakk et al. 2012Fuglebakk et al. , 2015Tirion 2015;Hsieh et al. 2016;Lopez-Blanco and Chacon 2016). ENMs present the important advantage of yielding a unique analytical solution for the collective dynamics of each structure, thus overcoming the sampling inaccuracies or computational time/memory limitations of conventional molecular dynamics simulations (Dror et al. 2012;Luitz et al. 2015;Bottaro and Lindorff-Larsen 2018;Srivastava et al. 2018), and lending themselves to large-scale analyses of ensembles of proteins. ENM-based studies revealed a close correspondence between the structural changes stabilized upon ligand binding and the intrinsic motions already accessible to the "unbound" protein prior to ligand-binding (Tobi and Bahar 2005;Skjaerven et al. 2011). This led to the concept of pre-existing paths of collective structural changes selectively favored upon specific substrate binding (Zheng et al. 2009;Meireles et al. 2011).
In parallel, the evolutionary significance of global modes of motion became clear (Carnevale et al. 2006;Maguid et al. 2006Maguid et al. , 2008Hollup et al. 2011;Bahar et al. 2017). Computations highlighted the coupling between sequence evolution and intrinsic dynamics (Liu and Bahar 2012;Zou et al. 2015;Echave et al. 2016), and experiments demonstrated that the changes in structure (or oligomerization state) stabilized by mutations bear close resemblance to structural changes that accommodate ligand binding (Perica et al. 2014). Evolvability of intrinsic dynamics thus emerged as a major mechanism enabling adaptability to environmental changes, intermolecular interactions, or even mutations (Tokuriki and Tawfik 2009;Haliloglu and Bahar 2015). Recent work further showed that intrinsic dynamics is a major determinant of the impact of missense mutations on function, and that the inclusion of ENM-based features in a machine learning classifier improves the accuracy of pathogenicity predictions .
These observations call for a rigorous evaluation of the conservation/differentiation of structural dynamics in relation to the evolution of sequence and structure in protein families/subfamilies using sufficiently large data sets, and dissecting the contribution of collective motions in different frequency regimes; and the need for a tool to accomplish this task. The present study aims at addressing these needs. We introduce a new interface, SignDy, for evaluating the Signature Dynamics of protein families, building on ENM theory and methods implemented in the application programming interface (API) ProDy (Bakan et al. 2014). Application to 116 superfamilies of proteins discloses basic principles for functional fitness and diversification: exploiting the robust global dynamics of a versatile fold, and gaining specificity via localized, yet impactful, fluctuations conserved among subfamily members but divergent across subfamilies. We further illustrate the utility of SignDy by way of application to three families of folds: 1) leucine transporter (LeuT), 2) periplasmic-binding protein type-1 (PBP-1), and 3) triosephosphate isomerase (TIM) barrel. SignDy proves to be an effective tool for quantitative evaluation of both generic dynamics of families, and specific dynamics of subfamilies, identifying the specific modes of motions that distinguish subfamilies (shared by subfamily members but sharply different across subfamilies), and learning how evolutionarily selected folds exploit collective modes of motions in different frequency regimes to reconcile a diversity of sequences and functions with the same architecture.

New Approaches
The results in this study are generated using a new computing and visualization interface, SignDy, designed to enable and automate the evaluation and comparison of the dynamics of structures belonging to evolutionarily related proteins. SignDy is built upon the Protein Dynamics (ProDy) API (Bakan et al. 2014) launched for bridging biomolecular structure and function via characterization of dynamics . With more than 100,000 unique visitors and $1.7 million downloads (reported by Google Analytics), ProDy serves as a major resource for exploring a wide range of phenomena, from collective dynamics to sequence coevolution. SignDy benefits from 1) theory and methods of ENMs Li et al. 2017), mainly the Gaussian network model (GNM) (Bahar et al. 1997) and the anisotropic network model (ANM) (Atilgan et al. 2001;Eyal et al. 2015); 2) the reconciliation (Chennubhotla and Bahar 2007) of physics-based theories of polymer statistical mechanics and machine learning (ML) algorithms of spectral graph theory; 3) the consolidation of theory and experiments to extract information on motions that facilitate ligand binding, molecular machinery, or allosteric signaling (Tobi and Bahar 2005;Zheng et al. 2009;Fuglebakk et al. 2012;Lopez-Blanco and Chacon 2016); 4) the Evol module (Bakan et al. 2014) for evaluating sequence (co)evolution and comparison with structural dynamics (Liu and Bahar 2012); and 5) NMWiz, an interactive visualization GUI that interoperates with VMD (Humphrey et al. 1996) and Chimera (Yang et al. 2012).
SignDy is designed as a pipeline composed of seven steps depicted in figure 1, described in the Materials and Methods and the supplementary methods, Supplementary Material online (additional information can be found in online tutorials; http://prody.csb.pitt.edu/signdy/; last accessed April 26, 2019): 1) selection of protein family members to be used as input; 2) structural alignment of members and identification of core residues; 3) refinement of the resulting ensemble and its associated multiple sequence alignment (MSA) based on sequence and structure similarity criteria to select a representative set of homologues; 4) computation of mode spectra using the GNM or ANM, identification of mode-mode matches between family members, and evaluation of the collectivity of modes; 5) characterization of signature dynamics, that is, mechanisms of global movements and interresidue cross-correlations shared among family members; 6) quantitative assessment of the conservation/divergence of structural dynamics between family members or subfamilies, broken down into different frequency regimes, and identification of specific modes that unify subfamily members and maximally discriminate between subfamilies, toward gaining insights into the mechanistic basis of functional differentiation of fold families into specific subfamilies; 7) classification of family members based on their dynamics in different frequency regimes, and comparison of the dynamics-based (frequency-dependent) distributions of family members with the distributions based on their sequence, structure, and subfamily function.
We use as metrics of conservation/divergence of structural dynamics among family members the correlation cosines cc k A; B ð Þbetween each matching mode k of members A and B, and the spectral overlap SO ij A; B ð Þ for sets of modes (i k j) in various frequency regimes. Averages over all pairs of members A and B belonging to specific pairs of subfamilies provide quantitative information on the conservation or differentiation of structural dynamics between subfamilies in different frequency regimes. We analyze the evolution of motions in the global (1 k 3), low-frequency (LF; 4 k 20), low-tointermediate frequency (LTIF; 21 k 60), and highfrequency (HF; k > 60) regimes by assessing which type of motions (global, LF, LTIF, or HF) are shared among family members, how mode collectivity and conservation relate to each other, and which modes accompany, if not control, the differentiation of families into subfamilies. Zhang et al. . doi:10.1093/molbev/msz102

A Unique Signature Dynamics Defined by Conserved Global Motions Characterizes Each Family
Figure 2a-c illustrates the signature dynamics for three folds, LeuT, PBP-1, and TIM barrel. Information on the corresponding data sets of proteins (Data Sets 1-3) can be found in the respective supplementary tables S1-S3, Supplementary Material online; their sequence, structure, and function distributions are presented in supplementary figure S1, Supplementary Material online. The average mobility profile of residues resulting from global modes of motion (up to k ¼ 3 [blue]) and LF motions (up to k ¼ 10 [orange] and 20 [green] modes) are displayed, along with their standard deviation (SD) and range within each family. Minima and maxima can be traced back to secondary structural elements (indicated by colored bars along the abscissa in a and c) and loops (or disordered regions), respectively. This is due to the high-packing density at secondary structural elements manifested by small-amplitude fluctuations at those regions. The minimal difference between the three curves in each panel indicates the robustness of the signature dynamics defined by global modes. The LF modes in the range 10 k 20, which are usually less collective than those in k 10, induce increased variations (shades) indicative of a differentiation among members while preserving the signature dynamics.
To assess the level of conservation of global modes within families, we evaluated the mode-mode correlation cosines cc k h i averaged over all family members for each equivalent mode k. The results are presented in figure 3a-c (green curves and shades for the respective averages and SDs). Sharp peaks at the lowest frequency end of the spectra and rapid decays with increasing mode number confirm the conservation of global modes.

Robust Global Modes Define Signature Dynamics
To confirm the dominance of global modes as a determinant of family signature dynamics, we examined their level of conservation within CATH superfamilies.  fig. 3d, green curve and shade for 1 k < 100) consistently show that global modes are highly conserved. The average correlation cosine for the top-ranking mode (k ¼ 1) of superfamily members is 0.80 6 0.19 and drops to 0.20 6 0.07 for k ¼ 20 (supplementary fig. S3a and b, Supplementary Material online). Higher modes display a plateau with minimal (0.1-0.2) correlation.
Larger proteins/domains have access to a broader conformational space and a wider spectrum of motions. One might expect their dynamics to be more heterogeneous, leading to weaker mode conservation among members. Computations (supplementary fig. S3, Supplementary Material online) showed, however, that the dependency of mode conservation propensity on protein size is minimal. The top-ranking modes exhibit strong correlations, irrespective of the size of the   h iis plotted as a function of the total number of modes, n, included in the analysis, together with the corresponding variation among family members (lighter blue band). The curves reflect two counter effects: first, there is a peak at the lowest frequency end, consistent with the conservation of global modes. The overlap rapidly decreases with increasing n, due to the dissimilarity of the newly added modes. This differentiation between family members is consistent with the rapid drop in mode conservation (green curve) shown in figure 3a-c for LeuT, PBP-1, TIM barrel families as well as that for CATH superfamilies ( fig. 3d). Then, a new regime is observed, the LTIF regime, which includes modes 20-60 approximately, where the spectral overlap is minimized (minima indicated by dashed vertical lines). Finally, an opposite effect takes over, manifested by an increase in overlap. This arises from the increased coverage of the space of conformational changes (shown in the orange curve), consistent with the theoretical limit of SO 1n A; B ð Þ!1 as the complete space of motions is considered. The minimum in SO 1n h i occurs for n 50. The LTIF regime where the cumulative spectral overlap is minimized emerges as a determinant of the specificity of family members. The percent contribution of the modes in this regime to the overall spectrum amounts to $25% (see the increase in the cumulative weight of modes [orange curves in fig. 3a-c] in this interval), which means a substantial contribution to alter dynamics, while retaining the generic behavior.
Further calculations performed for CATH superfamilies ( fig. 3d) corroborated the same trends. Supplementary   FIG. 2. Signature dynamics of each family is robustly defined by global motions uniquely defined by the fold. Panels (a-c) display the distributions of mean-square fluctuations (MSFs) of residues for the respective fold families LeuT, PBP-1, and TIM barrel. Mobility profiles driven by k ¼ 3 (blue), 10 (orange), and 20 (green) modes are presented, along with their SDs and ranges (bands in lighter shades). Horizontal bars along the abscissa indicate 1) the transmembrane (TM) helices of LeuT, 2) the upper lobe (UL) and lower lobe (LL) of PBP-1, and 3) the secondary structure (orange, b-strands; red, a-helices) of TIM barrel. Residue numbers along the abscissa refer to those retained in the ensemble (i.e., sequence positions whose occupancy in the MSA is 0.7 or higher), and deletions are not explicitly shown. (d-f) Ribbon diagrams of representative members, with core residues colorcoded by their mobilities in global modes (1 k 3; blue, minimal; red, maximal). Zhang et al. . doi:10. i, averaged over all superfamilies, is 0.55 6 0.25, despite the low (<0.10) cumulative weight of this small set of modes. The addition of modes in the LF regime lowers the cumulative overlap to 0.45 6 0.15, even though a larger subspace of conformational changes is sampled, indicating the dissimilarities in conformational motions among members in this regime. A high overlap ( SO all h i¼ 0.84 6 0.02) is recovered by the ensemble of all modes, which, by definition, forms a complete basis set that spans all possible conformational changes.
Overall, these data underscore the role of motions in the LTIF regime in differentiating family members within a given fold family, which will be further elaborated below.

Increased Sequence Heterogeneity in a Given Fold Family Manifests Itself by Higher Differentiation of Dynamics, Especially in the LTIF Regime
Our earlier work showed that sequentially conserved sites are also distinguished by their restricted fluctuations; or the mobility of residues, reflected by their mean square fluctuations (MSFs) around their mean positions, increases with increasing Shannon entropy (H) at the corresponding sequence position (Liu and Bahar 2012). That study established the correlation between sequence variation and conformational flexibility (RMSF). Here, we investigated one further property, the change in flexibility, DRMSF, at a given position among family members, which is a metric of the extent of differentiation in the equilibrium dynamics between family members.
To this aim, we first evaluated the level of sequence heterogeneity within each family, using Shannon entropy as a metric. The resulting distribution among 13,648 residues belonging to 77 CATH families (after excluding the small folds with N < 100 residues) is shown by the histogram (gray bars) in figure 3e. The histogram perfectly fits a lognormal distribution in support of the accurate sampling of sequence variabilities by the examined set (supplementary table S4 The results are presented for different subsets of modes: global (k 3), LF (4 k 20), LTIF (21 k 60), and HF (k ! 60) regimes. The bar plot in the inset displays the D RMSF h i averaged over all sequence entropies for the four respective groups. These results clearly show the dominant role of LTIF motions in imparting the member-specific differences in the fluctuation spectrum of individual family members, except for the high-sequence entropy region. In this case, differentiation of the modes shifts toward slower modes, as can be seen from the crossover between the LF and LTIF curves. The shift to LF modes reflects the earlier divergence of modes along the mode spectrum, in tandem with the higher divergence of sequence.
A closer examination shows that D RMSF h icontributed by the global modes is relatively flat with respect to sequence entropy in the range H 1:5. This insensitivity to sequence variations suggests that global dynamics are more conserved compared with sequence, presumably consistent with the slower divergence of structure, compared with sequence. Supplementary figure S3d, Supplementary Material online, further shows that diverging structures encode diverging dynamics despite the rather narrow root-mean-square deviation (RMSD) range. This dependency is stronger when all modes (red dots) are considered, as opposed to global modes (orange dots), confirming the increased differentiation of mode spectra with addition of higher modes. There is, however, some variation of spectral overlap with sequence identity (supplementary fig. S3e, Supplementary Material online), indicating that diverging sequences encode diverging dynamics too, which will become even clearer by focusing on subfamily dynamics next.

Differentiation of Protein Families into Specific Subfamilies Is Accompanied by the Evolution of LTIF Motions
Consider a family composed of m subfamilies (or a superfamily of m families). For example, the currently considered TIM family contains eight subfamilies (with at least four members). Subfamily classification is based on the specific functions of family members, for example, in the case of TIM barrel, we have aldolases class 1 (ALD1), glycosidases (GLYC), and phosphenolpyruvate binding domains (PEPE). Of interest is to assess to what extent subfamily members share similar modes among themselves, and to what extent they differ from other subfamily members. In other words, is the differentiation of fold families into specific subfamilies accompanied, if not driven, by a subset of modes that typifies the subfamily, and distinguishes it from all other subfamilies?
Note that subfamily members are not necessarily sequentially close or structurally close, but they belong to the same subfamily because of their shared biological (e.g., specific enzymatic) activities. In this respect, it is of interest to see if their common functions are supported by common mechanisms of action, or shared modes. Another way of asking the same question is which particular modes, or modes in which frequency regime, unify members within subfamilies, while ensuring maximal differentiation between subfamilies themselves. Toward this goal, we evaluated the spectral distances d ij m p ; m s between subfamilies p and s, composed of m p and m s members, respectively, based on the similarity of their modes i k j (see Materials and Methods and Supplementary Material online). Figure 4a-d and supplementary figure S4a-d, Supplementary Material online, illustrate the respective results for TIM and PBP-1 families. Results are presented for the global, LF, LTIF and HF frequency regimes (respective panels a-d) by color-coded matrices. Diagonal elements describe the level of conservation of dynamics within subfamilies; whereas off-diagonal terms represent the distances between pairs of subfamilies, with dark red entries indicating a strong divergence. We note that the LTIF modes are maximally distinctive across families, followed by LF modes, while the global modes and, interestingly, HF modes retain similarities. The strong discrimination provided by the LTIF regime between subfamilies-a feature apparent in the large-scale examination of CATH superfamilies, is now clearer with the subfamily-subfamily distance maps based on subfamily dynamics.
Further comparison of the conservation/divergence of structural dynamics across subfamilies with their sequence and structure similarities (panels e-g in fig. 4 and supplementary fig. S4, Supplementary Material online) reveals that the correlations (or lack thereof) between the mode spectra of subfamilies in different regimes closely parallel sequence properties, rather than structural similarities/dissimilarities. The latter was assessed by two metrics, average RMSD between subfamilies and average Template Modeling score (TM-score) (Zhang and Skolnick 2004), which yielded almost identical results. In other words, the division of families into subfamilies relates to the differentiation of their dynamics, more than the differentiation of their structure, in support of the direct relevance of motions/dynamics to subfamily function. Overall these results demonstrate that the specific mechanisms that distinguish subfamilies can be traced back to the intrinsic modes in the LTIF regime.

Evolutionary Conservation of Modes Shows a Unique Dependency on Their Collectivity
Global modes are usually known to be highly collective, that is, they cooperatively embody large portions of the structure. HF modes, on the other hand, are highly localized. In order to understand whether the conserved and not conserved modes in the same frequency range are characterized by different levels of collectivity, we compared the conservation profile of the modes and their collectivity profile observed in superfamilies. The results are illustrated for an example CATH superfamily (3 0 5 0 -cyclic nucleotide phosphodiesterase catalytic domain) in figure 5a, and similar results are shown for a series of CATH superfamilies in supplementary figure S5, Supplementary Material online. In each case, the green curve displays the conservation profile ( cc k h i similar to fig. 3a-c) and the red curve the collectivity profile (j k ) for all modes (1 k N À 1) obtained by the GNM. All the curves practically show the same trend: a positive correlation Zhang et al. . doi:10.1093/molbev/msz102 MBE between conservation and collectivity in the global and LF regimes, followed by the negative correlation in the LTIF and HF regimes, and strikingly an increase in conservation, accompanied by a decrease in collectivity at the fastest end of frequency spectrum, designated here as the very high-frequency (VHF) regime, already discerned in figure 3a-c.
Systematic analysis of all 77 CATH superfamilies led to the plots in figure 5 (panels b-f). In the case of global modes, the more collective modes are also those that tend to be evolutionarily conserved (panel b), and the same trend can also be seen in LF modes (in panel c) although we can detect some modes that exhibit the opposite behavior, that is, they exhibit high conservation despite having low collectivity. This type of anticorrelation dominates the rest of the spectrum, including the LTIF, HF, and VHF modes (panels d-f). Panel (g) displays all the results, thus allowing us to clearly view the complex relationship between collectivity and conservation, broken down into different regimes.

Conserved Local Motions Specific to Subfamilies Can Be Detected among HF Modes
It is interesting to observe peaks at relatively higher modes in the mode conservation curves (figs. 3a-c and 5a and supplementary fig. S5, green curves, Supplementary Material online). These signal the conservation of local events among subsets of members. Figure 4d and supplementary figure S4d, Supplementary Material online, further support the conservation of HF modes within subfamilies, and even across subfamilies. Early applications of the GNM pointed to evolutionarily conserved sites distinguished by HF modes relevant to stability, even though the high sensitivity of HF modes to structural details would preclude us from generalization (Bahar et al. 1998). The consolidation of such modes over all family members by SignDy provides a framework for identifying such critical sites, illustrated in supplementary figure S6, Supplementary Material online for the TIM barrel and PBP-1 families, which may assist in assessing the pathogenicity of single amino acid variants (SAVs) .

Dynamics-Based Dendrograms Distinguish between Substates and Subfamilies of Structural Homologs
Using sequence-, structure-, and dynamics-based distance metrics, we generated the maps and dendrograms presented in figure 6 and supplementary figure S7, Supplementary Material online, for the PBP-1 family, named after periplasmic binding proteins (PBPs) in bacteria that capture solute in the periplasm and supply them to ABC transporters (Quiocho and Ledvina 1996). This fold has been used in a number of other proteins, where it is involved in signal transduction in a variety of eukaryotic multidomain receptors (Felder et al. 1999 (d-f) under each map. They are colored from most similar in dark blue to most different in dark red. The members are reordered along the axis based on the dendrograms to aid with visualization and the numbering of family members along the axes corresponds to that in supplementary table S2, Supplementary Material online, which is based on the structure dendrogram. The color-code along the two axes refers to function annotations in supplementary figure S1b and table S2, Supplementary Material online.
At the sequence level ( fig. 6a and d), we observe a clear separation between bacterial and eukaryotic family members (highlighted in orange and pink, respectively). Smaller clusters with higher sequence similarity within these two groups (yellow, green, and blue boxes in the fig. 6a) correspond to functional groups such as different iGluRs (AMPA, kainate, delta, and NMDA receptors), class C G-protein-coupled receptors (GPCRs), and natriuretic peptide receptors (NPRs). The structure-based dendrogram ( fig. 6e) reveals more heterogeneity including a splitting of the bacterial, GPCR, and NPR structures into open and closed forms but performs less well at distinguishing subfamilies.
Dynamics-based classification based on global ANM or GNM modes ( fig. 6f and supplementary fig. S7a, Supplementary Material online, respectively) enables us to discriminate between active and inactive states, distinguished by conserved opening/closing and twisting motions  driven by the two signature ANM modes ( fig. 6g and supplementary movies 1

Application to LeuT Fold Family Shows How Signature Dynamics Favors Functional and/or Multimerization Mechanisms
The LeuT fold, first resolved for a bacterial leucine transporter (Yamashita et al. 2005), is shared by many prokaryotic and eukaryotic secondary transporters despite their low-sequence identity (Shi 2013;Drew and Boudker 2016). It is composed of 12 TM helices that alternate between outward-facing (OF) and inward-facing (IF) conformations during the transport cycle. The former favors the uptake of substrate from the extracellular (EC) region, and the latter its release to the intracellular (IC) region, accompanied by the cotransport of Na þ ions, and in some cases by the antiport of other substrates/ions (Kazmier et al. 2017). Family members include Shared Signature Dynamics Tempered by Local Fluctuations . doi:10.1093/molbev/msz102 MBE dopamine transporter (DAT), multihydrophobic amino acid transporter (MhsT), benzyl-hydantoin transporter (Mhp1), sodium/galactose transporter (vSGLT), glycine betaine transporter (BetP), carnitine/butyrobetaine antiporter (CaiT), and arginine/agmatine antiporter (AdiC). See supplementary table S1 and figure S1a, Supplementary Material online, for sequence and structure properties of the 85 members studied here, figures 2a and 3a for signature dynamics, mode conservation, and spectral overlap between family members.
Here, we focus on transport and multimerization mechanisms of LeuT members. Figure 7 and supplementary movies 3-5, Supplementary Material online, reveal how the three global modes operate in a complementary way to enable substrate transport: they divide the fold into two parts from three orthogonal perspectives, resulting each in concerted opposite-direction (anticorrelated) fluctuations (or breathing motions) of the respective parts. Their combination allows for the cooperative opening and closing of the central substrate/ion-binding pocket (supplementary fig. S8, Supplementary Material online). The close-to-zero values in figure 7a (indicated by vertical pink shades) indicate pivotal sites at the interface between oppositely moving substructures.
Closer examination reveals large displacements in EC loop 3 (EL3; known as helix H7 in BetP and CaiT) (black arrows in fig. 7a). The transporters exhibit large structural heterogeneities at this region (supplementary fig. S9a, Supplementary Material online). However, the movement of EL3 is not random. On the contrary, it is driven by a cooperative mode (ANM mode 2) that enables the transition between OF and IF states of the transporter; and further motion of BetP H7 along the same direction/mode allows for intersubunit contacts that stabilize the trimer  Another region distinguished by its conformational adaptability is IC loop 2 (IL2; red arrow in fig. 7a and c). This region undergoes large rearrangements during the OF $ IF transitions of LeuT (Krishnamurthy and Gouaux 2012), Mhp1 short; see the bar on the right). Results are displayed for four frequency regimes, global, LF, LTIF, and HF, in the respective panels (a-d), as indicated by the ranges i mode j. Diagonal terms show the average distances between members within subfamilies based on the motions in the particular frequency window; and the off-diagonal terms show those across subfamilies. The LTIF regime (modes 21-60) provides the sharpest discrimination between subfamilies; whereas modes in both the global (a) and HF (d) regimes are relatively conserved. For comparison, we present the sequence distances (e) and structural distances (f and g, using RMSD and TM-score as metrics) between subfamilies. Note that the subfamily-subfamily spectral distances in the LTIF regime (panel c) conform closely to their functional classification (panel h) defined by CATH, rather than their structural similarities (panels f and g), in strong support of the significance of LTIF motions in the evolution of function. Zhang et al. . doi:10.1093/molbev/msz102 MBE (Shimamura et al. 2010), MhsT (Malinauskaite et al. 2014), and BetP (Perez et al. 2012), the directions and the sizes of the deformations varying between members. The departure from the generic signature profile at this region suggests a role in imparting specificity (see also fig. 7c). Finally, the crosscorrelation maps ( fig. 7b and supplementary fig. S10, Supplementary Material online) highlight the structural elements that undergo coupled same-sense (red) or oppositesense (or anticorrelated; blue) motions. The largest variations in cross-correlations (lower map in fig. 7b) take place in the motions of TM6 with respect to TMs 1-3 and 10. These interhelical distances have been noted to define the extent of opening/closure of the EC and IC vestibules (Cheng and Bahar 2014;Drew and Boudker 2016). TM1 movements are shown here to be anticorrelated with respect to TM10 which forms a coherent block with TM5 and TM7. These observations are consistent with recent H/D exchange mass spectrometry experiments where partial unwinding of TM1, 5, 6, and 7 drives the OF ! IF transition (Merkle et al. 2018).

Discussion
In recent years, there has been an increasing interest in interpreting sequence evolutionary trends in the light of biophysical models, reconciling evolutionary biology, and structural biophysics (Liberles et al. 2012;Echave and Wilke 2017). Structural stability and related functions such as residue packing density are key constraints in sequence conservation and evolutionary change rate (Echave et al. 2016). Yet, stability alone is not sufficient for functionality. Many proteins achieve their function by virtue of their conformational flexibility (Zheng et al. 2009;Skjaerven et al. 2011;Haliloglu and Bahar 2015). While the conservation of sequence, or sequence evolution rate, closely relate to structural stability and thermodynamics, the conservation of structure and its evolution might be closely determined by its adaptability to functional requirements. The present study aimed at shedding light to the relation between biomolecular dynamics and evolution of structure and function. We examined subfamilies from the perspective of their structural dynamics and identified which frequency windows of the mode spectrum naturally provide the most discriminative description of subfamilies, that is, which modes entail motions shared among subfamily members but sharply divergent between subfamilies. Decomposition of the mode spectrum into the contribution of different frequency windows unambiguously revealed the evolutionary significance of a well-defined subset of modes, those lying in the LTIF regime. These modes endow subfamily MBE members with subfamily-specific motions, or mechanisms of action, and they provide maximal discrimination between subfamilies in accord with their functional categorization in the CATH database.
Pioneering studies that introduced the concept of evolution of structural dynamics and/or its relation to sequence evolution traditionally focused on experimental data, for example, a-carbon fluctuations (B-factors) (Maguid et al. 2006(Maguid et al. , 2008, the coupling between sequence variability and structural dynamics (Liu and Bahar 2012;Nevin Gerek et al. 2013), or diversity of conformers resolved for well-studied proteins in the PDB (Juritz et al. 2013). The present study, inspired by earlier observations and motivated by the need to gain a deeper understanding of the principles that control the conservation/divergence of structural dynamics led to design and implementation of a new interface, SignDy. SignDy permitted us to systematically analyze 15,636 proteins in 77 CATH superfamilies, and revealed features that could not be discerned if it were not for serial analysis of large ensembles of CATH superfamilies. We discerned for the first time the differences in the conservation of modes in different frequency regimes, and the close relationship between the dissimilarities in the LTIF modes and the structural variations and specific mechanisms of action that distinguish subfamilies.

Distinctive Evolution of Modes in Different Frequency Regimes and Relation to Differentiation into Subfamilies
We have conducted a thorough examination of the evolution of structural dynamics by focusing on four windows of mode spectra: global modes (k ¼ 1-3), slow (low frequency, LF) modes (k ¼ 4-20), LTIF modes (k ¼ 21-60), and fast (high frequency, HF) modes (k > 60). These ranges are estimated from the average behavior of 77 CATH superfamilies, and the FIG. 6. Categorization of family members based on their sequence, structure, and dynamics. (a-f) Distance matrices (a-c) and corresponding dendrograms (d-f) for PBP-1 family members based on (a and d) Hamming distance between sequences, (b and e) RMSD between structures and (c and f) spectral distance between global ANM modes (k ¼ 5). The numbers and colors along the axes correspond to the order of the conformers based on RMSD clustering and the subfamilies to which they belong (see supplementary table S2, Supplementary Material online). In the trees, each node represents a member and the colors and labels correspond to subfamilies along with conformational/functional states. In the sequence case (a and d), there is a clear distinction between bacteria and eukaryotes, highlighted in blue and orange, respectively. (g) The first two global signature ANM modes are shown with arrows illustrating the opposite motions of the upper and lower lobes. Mode 1 (left) shows a twisting/ untwisting motion and mode 2 (right) shows an opening/closing motion as shown in supplementary movies 1 and 2, Supplementary Material online. (h) Projection of family members onto a 2D space spanned by the first two signature ANM modes clusters family members based on global mode spectra akin to panels (c) and (f). (i) Projection of family members onto a 2D space spanned by the first two principal components of structural variation clusters family members akin to panels (b) and (e). Zhang et al. . doi:10.1093/molbev/msz102 MBE boundaries between these regimes may vary slightly among different protein families. Notably, different frequency regimes exhibited different relationships to the evolution of structure and function. The global modes are highly conserved across all members of the family, that is, they are resilient to change throughout evolution, presumably due to their role in defining the signature dynamics of the family. The LF regime, on the other hand, exhibits a dependency on the type of subfamily, thus underlying the differentiation of subfamily members in terms of their dynamics. This effect is further pronounced in the LTIF regime. The LTIF regime ensures maximal discrimination between the dynamics, or accessible mechanisms of action, of subfamily members, while also accomplishing the highest similarity among members within subfamilies. Major contribution to the specificity of subfamilies originates in the LTIF regime. Finally, the HF regime makes little contribution to structural divergence ( fig. 3e). Yet, the same regime has several "conserved" modes, similar to the global modes, but completely different in terms of their collectivity (see figs. 3a-c and 5 and supplementary FIG. 7. Generic and specific features of LeuT fold dynamics. (a) Global mode shapes and displacements along the global modes shared by family members (mean profiles, solid curves), and their differentiation (SDs; darker shaded area), and the full range of variations (lighter shaded area). Colored bars along the upper abscissa indicate the TM domains. Pink vertical bands indicate the residues lining the substrate-binding pocket, which show minimal spatial displacements. Red and black arrows indicate the locations of IL2 and EC3/H7, respectively, the high flexibility of which is essential to substrate recognition and multimerization. The ribbon diagrams generated for a representative LeuT structure (PDB ID: 2A65) are color-coded (from blue to red) by the size and direction of motions (from negative to positive) in each mode. (b) Generic covariance map (top) and its SD (bottom), based on k 20 modes. See more details in supplementary figure S10, Supplementary Material online. Specific residue pairs whose cross-correlations significantly depart from the generic covariance are indicated by white arrows (bottom). The curve along the left ordinate shows the row-average. The peak at TM6 suggests a driving role in eliciting cooperative changes.  . While HF motions are usually viewed as noise in molecular simulations, the current approach that yields an analytical solution (unique to each fold) reveals the evolutionary conservation of selected HF fluctuations among family members, across all subfamilies. In sharp contrast to global modes, HF modes are highly localized, but presumably important enough to biological function such that they are retained across subfamilies throughout evolution of sequence and structure. These findings link biomolecular structural dynamics (topologyencoded collective modes of motion) to the evolution of structure and function.

New Insights by Serial Examination of Large Ensembles of Protein Families
SignDy differs from existing computational methods for exploring structure-based dynamics and its evolution in several ways: first, it is fundamentally different from full atomic models and simulations, which do not lend themselves to systematic comparative analysis of hundreds, if not thousands, of proteins' dynamics, even with technological advances that permit to up to milliseconds dynamics of small proteins (Dror et al. 2012). Second, at the heart of the methodology is the prediction of structural dynamics and in particular, the global modes of motions and correlations by a model (ENM) that lends itself to analytical solutions. The adoption of such a method that dissects structural dynamics was essential to distinguishing conserved and divergent motions of families and superfamilies.
Many ENM-based predictive studies of comparative dynamics provided valuable insights on the evolution of motions in selected systems (Carnevale et al. 2006;Micheletti 2013;Dutta et al. 2015;Zou et al. 2015;Reuter 2016, 2018;). Other comparative studies highlighted the bridge between structural dynamics and sequence evolution (Liu and Bahar 2012;Nevin Gerek et al. 2013). Significant efforts have been deployed for developing interfaces that enable principal component analysis of structurally known sequence homologues, comparisons with ANM predictions Skjaerven et al. 2014), and comparison with sequence coevolution properties (Bakan et al. 2014). More recently, Reuter and coworkers (Tiwari and Reuter 2016) performed an insightful ENM analysis of 23 proteins belonging to five different families that share the TIM barrel fold, to highlight the adaptability of the fold to various functions by virtue of its intrinsic signature dynamics. However, a large-scale systematic study of superfamilies of protein folds that share very low-sequence identity and accommodate a diversity of functions has been a challenge due to many obstacles, starting from the selection/retrieval of (super)family members (which cannot be done by PDB BLAST search due to low-sequence identity), and the optimal structural alignment of members. SignDy provides automated tools that surmounts these obstacles and allows for comparing the dynamics of CATH superfamily members that share similar structures but minimal sequence identity and a broad range of functional diversity. Another strength of SignDy is the use of GNM (in addition to ANM), which has been shown in numerous applications to yield results in better agreement with experiments than ANM . The tool highlights features that could not be unambiguously detected upon examination of individual cases, such as the trade-off between adaptability and specificity as discussed next.

Compromise between Adaptability and Specificity
It is well known that sequence diverges much faster than structure. In other words, the sequence space is much larger than the structure/fold space. The mapping of various sequences into a small number of folds, or a relatively small set of fold superfamilies (e.g., $100 examined here that cover almost 1 = 4 of PDB structures), does not, however, prevent proteins from achieving a broad diversity of functions. The latter is enabled by conformational dynamics.
The present study suggests that conformational dynamics supports the selection of folds in two ways: first, all family members share the fold-encoded global modes, or signature dynamics, that presumably underlie the versatility of the fold, for example, the different members may exhibit different levels of interdomain opening, or global twisting, but these are all slight rearrangements along the shared soft modes, which facilitate the adaptation to different substrates. These signature modes are largely conserved across different structures of the same protein as observed in previous studies (Batista et al. 2010(Batista et al. , 2011Krieger et al. 2015;. Secondly, motions in the LTIF regime define the specificity of subfamilies. Members of subfamilies are unified by their shared motions, or mechanisms of actions, in that particular regime, and they are maximally differentiated from other subfamily members precisely by virtue of the differences in their specific motions in this regime. In summary, robust global dynamics is a unifying feature in favor of the selection of the family fold; whereas LTIF dynamics is the way the specificity requirement copes with common fold. An earlier study demonstrated that global modes are robust to perturbations, which could explain their conservation (Echave and Fernandez 2010). Their robustness to perturbation does not preclude the fact that these modes are also functionally significant, as confirmed in numerous studies. To the extent that functionality is a driving force for selecting structures, these robust modes that are functional would be expected to play a role in the selection or evolution of the structures that favor these modes.

Convergent versus Divergent Evolution
Despite the wealth of data on well-studied proteins such as TIM barrel proteins, it is still not clear whether their shared fold originates from common ancestry, or results from convergent evolution. Protein folds are presumed to be more susceptible to evolutionary convergence than sequences, but sequence-profile-based phylogenetic analysis can detect evolutionary relationships even among sequentially distant members of a given superfamily, in support of divergent evolution (Theobald and Wuttke 2005). Other studies show that fitness constraints enforce evolutionary paths that preserve protein structure despite sequence divergence down to 30% Zhang et al. . doi:10.1093/molbev/msz102 MBE sequence identity (Gilson et al. 2017). Yet, the currently examined superfamilies contain members with much lower sequence identity, and other studies suggest that there is a limit to amino acid divergence while maintaining the contact topology/fold of the protein (Porto et al. 2005). While the current study cannot ascertain whether the shared structures are maintained during divergent evolution of sequences, or selected by convergent evolution, we clearly distinguish robust signature dynamics shared by family members, as well as LF and LTIF modes that characterize subfamilies. It remains to be established whether the prevalence of robust global motions, and accessibility to selected LTIF modes drive the selection of these folds.

Future Directions
Current models and methods explain $60% of the observed variance in site-specific substitution rates in proteins, highlighting the limitations of state-of-the-art approaches (Echave et al. 2016), which are often based on machine learning methods of sequence analysis and other structure-based considerations such as local packing density and solvent accessibility. Previous analysis demonstrated that local packing density is a major determinant of evolutionary rate, while flexibility, as described by RMSFs is not. ENMs inherently account for packing density, but also provide a higher level of description of the complete topology. Notably, an ENMbased mechanistic study has been shown to account for site-specific evolutionary rates and their relationship with packing density and flexibility (Huang et al. 2014), and another assisted in improving our assessment of the impact of SAVs . While the current study does not aim at inferring causal relationships between structural dynamics and sequence evolution rate, the signature profiles and covariances obtained here upon mathematically exact evaluation and dissection of the coupled dynamics of all residues provide information furthering our understanding of site-specific evolutionary rates or impact of mutations.
Additional studies with SignDy by a wide range of users with expertise on particular proteins and families would provide deeper insights into the evolution of dynamics and its importance for function. A reasonable strategy for utilizing SignDy in characterizing family/subfamily dynamics vis-a-vis structure and function evolution would be: 1) generate the mode conservation and collectivity profiles for the investigated family (e.g., fig. 5a and multiple profiles in supplementary fig. S5, Supplementary Material online); 2) identify the conserved modes (peaks in the same figures, green curves) in different regimes; 3) examine the corresponding mode shapes (e.g., figs. 2 and 7 and supplementary fig. S6, Supplementary Material online) to 4) identify critical sites responsible for the evolutionarily conserved signature dynamics (minima in global modes) and stability (peaks in HF modes) as well as those susceptible to subfamily specific divergence (in conserved LTIF modes); and 5) generate dendrograms that provide information on dynamics similarities in different regimes, complementing sequence and structure similarities, among family members ( fig. 6 and supplementary fig. S7, Supplementary Material online). While subfamily-subfamily spectral distances have been analyzed here based on different frequency windows of structural dynamics ( fig. 4 and supplementary fig. S4, Supplementary Material online), computations may be performed for narrower windows or even individual modes, to identify the most discriminative modes and infer new design/engineering principles for alterations of function.

SignDy Architecture and Workflow
SignDy is designed as a pipeline composed of seven steps as depicted in figure 1. We present the steps below. Technical details are presented in the supplementary methods, Supplementary Material online, and online tutorials.
(1) Selection of protein family members. The input to SignDy can be entered or generated in three ways: 1) entering a Pfam (Finn et al. 2016) or CATH-Gene3D (Dawson et al. 2017) ID representative of a family; 2) providing a query PDB (Burley et al. 2017) or UniProt (The UniProt 2017) ID, or a sequence in FASTA format, so as to extract the corresponding structural homologues using either existing ProDy functions or a new protocol designed to retrieve homologues from the Dali server (Holm and Laakso 2016); or 3) submitting a list of PDB codes for homologous proteins.
(2) Structural alignment and definition of core residues. This task is conceptually simple but not trivial and critically important. We structurally aligned family members using sequence alignments, the CE structural alignment algorithm (Shindyalov and Bourne 1998), and the alignments output from the Dali server (Holm and Laakso 2016). Comparison of CE and Dali shows the closer superposition of structures achieved by Dali, and hence its use is suggested whenever available (see supplementary methods and supplementary fig. S11, Supplementary Material online). A "core" of N residues is identified for each fold, composed of those sites with high-sequence occupancy (>70%), and structurally aligned among all members.
(3) Assessment of sequence and structure similarities among family members and selection of a refined representative set of homologues. Overrepresented sequences and structures as well as highly dissimilar ones are filtered out as described in the supplementary methods, Supplementary Material online. Typically, the average sequence identity over all pairs within the family is around 0.20 6 0.12, while pairwise RMSDs remain <7.0 Å. This step yields a refined ensemble of M members, including a reference structure R. Supplementary figures S1 and S2, Supplementary Material online, display the distributions of the average sequence identities and average RMSDs calculated for the three example families and the 116 CATH superfamilies examined here, respectively. (4) Evaluation of mode spectra and conserved mechanisms, using the GNM ( MBE properties characterize each mode: shape/mechanism (i.e., distribution of residue movements), and frequency/rate. The modes are ordered from LF (slow/ soft, global) to HF (fast, local). We quantify the mode-mode matches between R and each of the other M À1 members of the family/ensemble. The resulting equivalent modes for each member are reordered to match the mode order of R, and the collectivity of each mode is computed (see supplementary methods, Supplementary Material online). (5) Identification of signature dynamics. The spatial mobility of core residues driven by global modes averaged over all members, and its variation across family members, define the "signature dynamics" of the family as illustrated in figure 2a-c for the LeuT, PBP-1, and TIM barrel families. Another generic property is the crosscorrelations or the N Â N covariance map between residue motions averaged over all M members, which can be evaluated for different frequency windows. (6) Quantitative assessment of conservation of individual modes and spectral overlap between family members, and between subfamilies. The level of conservation of mode k within a given family is measured by the mode-mode correlation cosine computed for the kth equivalent mode, averaged over all MðM À 1Þ=2 pairs of members. Another criterion for the extent of similarity between the mode spectra of members A and B; is the spectral overlap, SO ij A; B ð Þ; a cumulative property evaluated for the subset of i k j modes (see supplementary methods, Supplementary Material online). SO_ij (A,B) is evaluated for different frequency regimes by suitable selection of the indices i and j. Mode-mode correlation cosines (for individual modes) and spectral overlaps (for sets of modes) both serve as metrics for assessing the conservation of dynamics. (7) Classification of family members based on their dynamics. A dynamics-based dendrogram for the family (analogous to a phylogenetic tree) is calculated using the spectral distance between pairs of members A and B, d ij A; B ð Þ¼ cos À1 SO ij A; B ð Þ À Á ; as a metric. The differentiation of collective motions in different regimes ði k jÞ between the m subfamilies is obtained by averaging d ij A; B ð Þvalues over all members belonging to each pair of subfamilies. These subfamily-subfamily distances conveniently displayed in m Â m matrices for each frequency range ð Þ¼1 À seq identity fraction, respectively. We verified that the RMSDs yield results similar to those obtained with TM score, another measure of structural difference that overcomes some potential problems with the RMSD measure (Zhang and Skolnick 2004) (supplementary fig. S12, Supplementary Material online).

Data Availability
The data sets used for generating the results are presented in supplementary tables S1-S4, Supplementary Material online.

Code Availability
The source code for ProDy can be found on GitHub at https:// github.com/prody/ProDy; last accessed: 04/26/2019. ProDy and SignDy computing language (Python) is essential to extensibility and interoperation with a wealth of modeling tools. Functions for generating ensembles are available in the ensemble module; those for generating mode ensembles and analyzing signature dynamics can be found in the SignDy module; and those for retrieving data from CATH and Dali are available in the database module. All code for generating the results and figures presented in the study are available in the form of tutorials on the ProDy website: http://prody.csb. pitt.edu/tutorials/signdy_tutorial/; last accessed: 04/26/2019.

Supplementary Material
Supplementary data are available at Molecular Biology and Evolution online.