The DynaSig-ML Python package: automated learning of biomolecular dynamics–function relationships

Abstract   The DynaSig-ML (‘Dynamical Signatures–Machine Learning’) Python package allows the efficient, user-friendly exploration of 3D dynamics–function relationships in biomolecules, using datasets of experimental measures from large numbers of sequence variants. It does so by predicting 3D structural dynamics for every variant using the Elastic Network Contact Model (ENCoM), a sequence-sensitive coarse-grained normal mode analysis model. Dynamical Signatures represent the fluctuation at every position in the biomolecule and are used as features fed into machine learning models of the user’s choice. Once trained, these models can be used to predict experimental outcomes for theoretical variants. The whole pipeline can be run with just a few lines of Python and modest computational resources. The compute-intensive steps are easily parallelized in the case of either large biomolecules or vast amounts of sequence variants. As an example application, we use the DynaSig-ML package to predict the maturation efficiency of human microRNA miR-125a variants from high-throughput enzymatic assays. Availability and implementation DynaSig-ML is open-source software available at https://github.com/gregorpatof/dynasigml_package.


introduction
The Elastic Network Contact Model (ENCoM) is the only sequence-sensitive coarse-grained normal mode analysis model [1].Its sequence sensitivity enables its use to predict the impact of sequence variants on biomolecular function through changes in predicted stability [2] and dynamics [3].We recently extended ENCoM to work on RNA molecules and predicted microRNA maturation efficiency from a dataset of experimentally measured maturation efficiencies of over 26 000 sequence variants [4].To do so, the ENCoM Dynamical Signatures, which are vectors of predicted structural fluctuations at every position in the system, were used as input variables in a LASSO multiple linear regression model [5] to predict maturation efficiency.To our knowledge, this coupling of coarse-grained normal mode analysis to machine learning in order to predict biomolecular function is the first of its kind.Furthermore, it can be applied to any biomolecule for which there exist experimental data linking perturbations (such as mutations or ligand binding) to function.Indeed, ENCoM is currently applicable to proteins, nucleic acids, small molecules and their complexes [6].Here we present the DynaSig-ML ("Dynamical Signatures -Machine Learning") Python package, which allows the implementation and automated replication of that novel protocol.As an example application, we apply DynaSig-ML to predict enzymatic efficiencies of VIM-2 lactamase sequence variants, starting from mutagenesis data.DynaSig-ML automatically computes the ENCoM Dynamical Signatures from a list of perturbed structures (mutations or ligand binding), stores them as lightweight serialized files, and can then be used to train simple machine learning algorithms using the Dynamical Signatures.The first algorithm is LASSO regression, which allows the mapping of the learned coefficients on the studied structure (automatically accomplished by DynaSig-ML plus two simple PyMOL [7] commands).As these coefficients represent the relationship between flexibility change at specific positions and the predicted functional property, this mapping can be used to drive new biological hypotheses.The second machine learning model implemented is the multilayer perceptron (MLP), a type of feedforward neural network [8].MLPs can learn complex relationships between the input variables and are thus more powerful than LASSO regression, however it is not possible to map the learned patterns back on the structure because of the MLP's complexity and absence of linear independence between input variables.DynaSig-ML automatically generates graphs showing testing performance.Each of the necessary steps to apply DynaSig-ML is documented online as part of a step by step tutorial (https://dynasigml.readthedocs.io).

implementation
DynaSig-ML runs the ENCoM model within NRGTEN, another user-friendly, extensively documented Python package [6].The machine learning models are implemented using the scikit-learn Python package [9].The numerical computing is accomplished by NumPy [10] and the performance graphs are generated with matplotlib [11], making these four packages the only dependencies of DynaSig-ML.

vim-2 lactamase example
In order to illustrate a typical use case of DynaSig-ML, we applied it to study dynamics-function relationships from deep mutational scan (DMS) data on the VIM-2 lactamase enzyme [12].VIM-2 (Verona integron-encoded metallo-β-lactamase 2) is a bacterial enzyme capable of degrading β-lactam antibiotics, and represents a major source of worldwide antibiotic resistance [13].The DMS dataset used measured bacterial fitness for every VIM-2 sequence variant under various concentrations of antibiotics [12].For this application, we use the fitness under the maximal concentration of ampicillin (128µg/mL) at 37 degrees Celsius as the property the machine learning models try to predict from the Dynamical Signatures. Figure 1 illustrates the whole protocol used to start from the PDB [14] structure of VIM-2 lactamase [15], train the machine learning models, test their performance and map the LASSO coefficients back on the VIM-2 structure.[12].In the case of the LASSO regression model, the independence of the input variables allows the mapping of the learned coefficients back on the VIM-2 structure.The color gradient represents each coefficient, from blue for negative coefficients, to white for null coefficients and red for positive coefficients.The largest absolute value coefficient will have the brightest color.The sign of a coefficient captures the nature of the relationship between flexibility changes at that position and the experimental property of interest (in this case, evolutionary fitness).Negative coefficients mean that rigidification of the position leads to higher fitness, while positive coefficients mean that softening of that position leads to higher fitness.The thickness of the cartoon represents the absolute value of the coefficients, i.e. their relative importance in the model.
Interestingly, the Dynamical Signatures exhibit good complementarity to the other attributes used by Chen et al., which are the mean ∆∆G of folding calculated with Rosetta [17], the change in accessible surface area, and a vector of length 40 which specifies what are the starting and mutated amino acid identities for the point mutation.The authors report a training coefficient of determination of R 2 = 0.55 when fitting a linear model with these input variables to all the available data.They do not report testing performance, so this 0.55 R 2 can be seen as an upper bound for the performance of a linear model based on these properties.Since the dataset contains point mutations only, there can be no sequence redundancy between the training and testing set.For this reason, we generated a random 80/20 train/test split for this application.The exact variants randomly picked for the testing set are available in the GitHub repository that accompanies the online DynaSig-ML tutorial (https://github.com/gregorpatof/dynasigmlvim2 example).When combining the Chen et al. attributes and Dynamical Signatures, we obtain LASSO and MLP models reaching respective testing performances of R 2 = 0.61 and R 2 = 0.69.The enrichment factors at 10%, which are values ranging from 0 to 10 characterizing the relative proportion of the top 10% measured 1 supplementary information Supplementary Figure 1 reports the testing performance of LASSO and MLP models when trained using the ENCoM Dynamical Signatures combined to the static predictors from Chen et al. [1], which are the Rosetta ∆∆G of folding, the change in solvent accessible surface area and 40 variables describing the starting and ending amino acid for the mutation (20 possibilities for each).In order to investigate the contributions of Dynamical Signatures and static descriptors alone, we also trained LASSO and MLP models using either one of these sets of variables alone.The testing performances are shown in Supplementary Figure 2 and Supplementary Figure 3. Interestingly, while the testing R 2 is lower for the DynaSig models (0.24 vs 0.52 for LASSO, 0.46 vs 0.58 for MLP), both EF10% are identical (2.79 for LASSO and 3.26 for MLP).This led us to hypothesize that the models might be enriching the same variants as their top predictions, however it is not the case.Supplementary Table 2 lists all variants that in the top 10% for any of the tested models or for the experimentally measured fitness.Surprisingly, despite identical EF10% performances, the DynaSig and static models have few top predictions in common (14 out of 87 for the LASSO models, 29 out of 87 for the MLP models).This discrepancy in ranking the variants might explain the good complementarity of the two sets of features, especially when it comes to predictive R 2 .Another striking finding is that the DynaSig models seem to benefit more from the power of the MLP, as R 2 almost doubles going from LASSO to MLP.This big gain in performance might be explained by the intrisincally nonlinear relationships between flexibility changes at different positions (the whole protein is a coupled system), which cannot be captured by linear regression.
When it comes to predicting experimental fitness for variants containing more than one mutation, the starting and ending amino acid vector used by Chen et al. cannot be used.However, from a biomolecular engineering point of view, variants containing multiple mutations are where computational methods really shine as the number of possible variants grows exponentially with the number of mutations.The Dynamical Signatures however can be computed for any sequence variant as long as the assumption that the equilibrium structure does not change considerably holds.We have recently predicted maturation efficiency for miR-125a sequence variants containing up to 6 mutations [2].In order to investigate what performance one could expect using only the static properties generalizable to variants with multiple mutations, we trained LASSO and MLP models using only the accessible solvent area change and predicted ∆∆G of folding.The testing performances are reported in Supplementary Figure 4 and while the R 2 coefficients are fair (0.42 and 0.45 for LASSO and MLP), the EF10% values are lower than with all other predictors used at 1.86 for LASSO and 2.67 for MLP.This illustrates the advantage of the Dynamical Signatures, as they are generalizable to multiple mutations and perform on par with the full static descriptors when it comes to EF10%.
Both training and testing R 2 and EF10% are provided in Supplementary Table 1 as a means of quick comparison between the different models.We obtain similar training R 2 as Chen et al. using their static descriptors (0.54 for our implementation, 0.55 is what was reported), and the combination of Dynamical Signatures and static descriptors reaches training R 2

Figure 1 :
Figure1: ENCoM-DynaSig-ML pipeline applied to VIM-2 lactamase deep mutational scan data.The crystal structure of VIM-2 lactamase is used as a template to perform the 4343 point mutations with experimental fitness data using the MODELLER software[16], all subsequent steps are performed using DynaSig-ML.For each of the in silico variants, a Dynamical Signature is computed with ENCoM.LASSO regression and multilayer perceptron models are trained using as input variables the Dynamical Signatures and the structural attributes used by Chen et al. as part of the VIM-2 deep mutational scan study[12].In the case of the LASSO regression model, the independence of the input variables allows the mapping of the learned coefficients back on the VIM-2 structure.The color gradient represents each coefficient, from blue for negative coefficients, to white for null coefficients and red for positive coefficients.The largest absolute value coefficient will have the brightest color.The sign of a coefficient captures the nature of the relationship between flexibility changes at that position and the experimental property of interest (in this case, evolutionary fitness).Negative coefficients mean that rigidification of the position leads to higher fitness, while positive coefficients mean that softening of that position leads to higher fitness.The thickness of the cartoon represents the absolute value of the coefficients, i.e. their relative importance in the model.
of 0.64 for LASSO and 0.96 for MLP.