ClustENMD: efficient sampling of biomolecular conformational space at atomic resolution

Abstract Summary Efficient sampling of conformational space is essential for elucidating functional/allosteric mechanisms of proteins and generating ensembles of conformers for docking applications. However, unbiased sampling is still a challenge especially for highly flexible and/or large systems. To address this challenge, we describe a new implementation of our computationally efficient algorithm ClustENMD that is integrated with ProDy and OpenMM softwares. This hybrid method performs iterative cycles of conformer generation using elastic network model for deformations along global modes, followed by clustering and short molecular dynamics simulations. ProDy framework enables full automation and analysis of generated conformers and visualization of their distributions in the essential subspace. Availability and implementation ClustENMD is open-source and freely available under MIT License from https://github.com/prody/ProDy. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Mapping the conformational space of proteins has been a challenge, especially for large assemblies of complexes. Elastic network models (ENMs) and normal mode analysis (NMA) have proven to predict the global modes of motion of biomolecular systems, and particularly supramolecular machines in the last two decades, as shown in numerous comparisons with experimentally observed conformational changes (Bahar et al., 2010;Bakan and Bahar, 2009;Tama and Sanejouand, 2001). Thus, hybrid techniques combining ENM/NMA with molecular dynamics (MD) have come forth as computationally efficient means for elucidating transition pathways (Gur et al., 2013;Orellana et al., 2019) and for conformational sampling for large complexes (Costa et al., 2015;Kurkcuoglu et al., 2016), as we recently reviewed (Krieger et al., 2020).
ClustENM hybrid algorithm  has been introduced for unbiased sampling of the essential subspace spanned by the softest ENM modes through integration with clustering and energy minimization of conformers. Comparison with experimental data has shown the efficiency and utility of ClustENM for investigating highly flexible proteins like calmodulin (Kurkcuoglu and Doruker, 2016;Kurkcuoglu et al., 2016) as well as large assemblies such as the ribosome (Can et al., 2017;Guzel et al., 2020;Kurkcuoglu et al., 2016). More recently, ClustENM conformers have proven to facilitate protein-DNA and protein-protein ensemble docking (Kurkcuoglu and Bonvin, 2020) and prediction of cryptic allosteric pockets (Kaynak et al., 2020).
Such promising results have motivated us to further develop and implement the ClustENMD version in the widely used ProDy (Bakan et al., 2011;Zhang et al., 2021) application programming interface (API) via integration with OpenMM (Eastman et al., 2017) software. This version allows us to generate more realistic conformers by performing short MD simulations even for large allosteric complexes, together with high efficiency and full automation within a Python environment.

Methods and features
The ClustENMD algorithm is explained schematically in Figure 1A. In Step 1, the input structure is subjected to anisotropic network model (ANM) analysis to produce atomistic conformers using random deformations along linear combinations of a set of global ENM modes. In Step 2, the conformers are clustered based on their structural similarities, and a representative member is selected for each cluster. In Step 3, the representatives from the previous step are structurally relaxed by short MD simulations using OpenMM. The new conformers are then fed back to Step 1, each being used as a starting point for a new generation of conformers. This iterative procedure (Steps 1-3) is repeated for several generations to allow for sufficiently large excursions from the original energy minimum.
ClustENMD has the following features (see Supplementary Material including Tutorial for details): • Implemented as a class in ProDy • Integrated with OpenMM (Step 3) • Applicable to multimeric complexes/assemblies, comprising protein, RNA and/or DNA chains • Input structure either retrieved from the Protein Data Bank (Berman et al., 2000) (PDB) or provided by the user in PDB file format • Addition of hydrogens and any missing heavy atoms in the residues of the input structure by PDBFixer/OpenMM and/or in PDB/DCD format • Analysis of output conformers using diverse ProDy modules, e.g. ensemble analysis • Fully automated pipeline, from input PDB file to the generated ensemble of conformers • High-computational efficiency on GPU-architecture

Illustration
ClustENMD results for two case studies are presented in Figure 1 using simulations in implicit solvent model and heating up (HU) the system to 300 K. Figure 1B presents the population distribution for adenylate kinase (AK). AK is known to undergo a large conformational change (7 Å RMSD) between open (apo) and closed (substrate/inhibitor-bound) states. The contour plot corresponding to the population distribution of ClustENMD conformers is displayed on the two-dimensional (2D) space spanned by the two interdomain angles, namely LID-Core and NMP-Core (Beckstein et al., 2009). This plot is based on six independent runs, each comprising five generations (see Supplementary Table S1 for details on all systems/runs). Homologous experimental structures retrieved from the PDB using ProDy are shown on the same plot (black dots). ClustENMD conformers sample the two states as well as the transition region between them. See Supplementary Figure S1 for runs with more generations. Figure 1C displays the 2D space for hetero-dimeric HIV-1 reverse transcriptase (RT), a large enzyme (N ¼ 1000 residues). Here, the population distribution (contour plot) is projected onto the essential space of experimentally resolved structures. The axes denoted by the first two principal components (PC1 and PC2) are derived from the principal component analysis of 365 experimental structures (black circles) resolved for RT under different conditions (oligonucleotide/inhibitor-bound or -unbound). ClustENMD conformers (red circles) projected onto this space sample the close neighborhoods of most experimental structures (see Supplementary  Fig. S2 for other runs including those in explicit solvent). We also present in Supplementary Figure S3, the counterpart of Figure 1C for AK, i.e. the generated conformers projected onto the subspaces spanned by experimentally defined PCs. Supplementary Figures S4  and S5 further display the respective ensemble of RT conformers in the space spanned by fingers-thumb versus fingers-RNase H distances, and that of HIV-1 protease conformers projected onto PC1-PC2 subspace.
The high efficiency of ClustENMD is reflected by the average computing time for a 5-generation run that generates 300 conformers (presented in Fig. 1), which takes 8 min (AK, 214 residues) to 27 min (RT, 978 residues) on a single GPU platform with NVIDIA V R GeForce V R RTX 2080 Ti graphics card. For a comparison of computational efficiency and required resources, we refer to a recent enhanced sampling study on generating a detailed free energy landscape for AK using Gaussian accelerated replica-exchange umbrella sampling (Oshima et al., 2019); each of the 32 replica (of 2 Â 10 8 MD time steps) would require a couple of days on the same platform, as opposed to a total run time of 8 min for ClustENMD. We note that ClustENMD could be applied to the protein model refinement problem if the excursions/deformations from the initial model are restricted, possibly by adding restraints (Heo et al., 2021), in order to retain the initial model's conformational characteristics. It remains to be seen in future applications whether such an approach might help to efficiently sample conformations closer to the native structure.

Concluding remarks
The ClustENMD algorithm is implemented within ProDy (Bakan et al., 2011;Zhang et al., 2021), an open-source Python API for protein structure, dynamics and sequence analysis, containing multiple modules. The ProDy package (downloaded more than 2.1 million times and visited by 140 000þ unique users worldwide) ensures broad dissemination of ClustENMD to the research community in addition to providing accessory tools for analyses and visualization. The current version of ClustENMD is unique in performing unbiased sampling with high-computational efficiency, augmented by fully automated and user-friendly features upon integration with ProDy and OpenMM.

Funding
This work has been supported by the National Institutes of Health [P41GM103712 to I.B.]. HIV-1 RT. Conformational surface plotted along the first two PCs of experimental structures (black dots), onto which conformers and the initial structure (green circle) are projected. Cyan contours indicate the density levels of 903 ClustENMD conformers. The distributions in panels B and C were produced by kernel density estimate (KDE) using Gaussian kernels.