Prediction of pathogenic single amino acid substitutions using molecular fragment descriptors

Abstract Motivation Next Generation Sequencing technologies make it possible to detect rare genetic variants in individual patients. Currently, more than a dozen software and web services have been created to predict the pathogenicity of variants related with changing of amino acid residues. Despite considerable efforts in this area, at the moment there is no ideal method to classify pathogenic and harmless variants, and the assessment of the pathogenicity is often contradictory. In this article, we propose to use peptides structural formulas of proteins as an amino acid residues substitutions description, rather than a single-letter code. This allowed us to investigate the effectiveness of chemoinformatics approach to assess the pathogenicity of variants associated with amino acid substitutions. Results The structure-activity relationships analysis relying on protein-specific data and atom centric substructural multilevel neighborhoods of atoms (MNA) descriptors of molecular fragments appeared to be suitable for predicting the pathogenic effect of single amino acid variants. MNA-based Naïve Bayes classifier algorithm, ClinVar and humsavar data were used for the creation of structure-activity relationships models for 10 proteins. The performance of the models was compared with 11 different predicting tools: 8 individual (SIFT 4G, Polyphen2 HDIV, MutationAssessor, PROVEAN, FATHMM, MVP, LIST-S2, MutPred) and 3 consensus (M-CAP, MetaSVM, MetaLR). The accuracy of MNA-based method varies for the proteins (AUC: 0.631–0.993; MCC: 0.191–0.891). It was similar for both the results of comparisons with the other individual predictors and third-party protein-specific predictors. For several proteins (BRCA1, BRCA2, COL1A2, and RYR1), the performance of the MNA-based method was outstanding, capable of capturing the pathogenic effect of structural changes in amino acid substitutions. Availability and implementation The datasets are available as supplemental data at Bioinformatics online. A python script to convert amino acid and nucleotide sequences from single-letter codes to SD files is available at https://github.com/SmirnygaTotoshka/SequenceToSDF. The authors provide trial licenses for MultiPASS software to interested readers upon request.


Introduction
In 2021, the global market for the next generation sequencing (NGS) was estimated at $6.37 billion (https://www.preceden ceresearch.com/next-generation-sequencing-market), corresponding to approximately 7 million human genome-wide sequences. NGS allows to detect any genetic variant in an individual patient, serving as a bridge between contemporary and precision medicine. Those variants include single nucleotide polymorphisms (SNPs, also known as single nucleotide variants, SNVs), insertion and deletion, copy number variation and chromosomal rearrangement events. Since SNPs make up the majority variants [e.g. dbSNP contains over 1 billion records, (Sherry et al., 2001, https://www.ncbi.nlm. nih.gov/snp)], the greatest interest present nonsynonymous variants leading to a change in the amino acid (a.a.) sequence. Amino acids substitutions (single amino acid variants, SAVs), that lead to the disarrangement of normal functions of proteins, are deleterious, otherwise they are benign. Molecular consequences for some SNPs have annotations in specific databases (such as UniProt, HGMD, ClinVar, etc.). But most of missense SNPs related with SAVs have no experimental or/and clinical reports with annotations describing their pathogenicity. Thereby, predicting the SNP effect using computational tools is a common practice.
Such tools may differ in mathematical algorithms, the used characteristics of amino acids/nucleotide sequences, and in the scoring system. Two common strategies in developing pathogenicity predictors are the combination of properties driven by evolutionary/physicochemical consideration (i.e. individual tools) and aggregation of pre-existing pathogenicity predictors scores to produce new predictions (metapredictors) (Ö zkan et al. 2021). Individual tools are trained on large annotated data, various features may be in input vector including evolutionary (Carter et al. 2013, Wang et al. 2020, Won et al., 2021, conservation information (Li et al. 2009, Vaser et al. 2016, Malhis et al. 2020, and functional estimates (Capriotti et al. 2013, Niroula et al. 2015, Pejaver et al. 2020). In addition, diverse methods utilize properties of amino acids (Adzhubei et al. 2013, Reva et al. 2011, Qi et al. 2021) and proteins sequence context (Calabrese et al. 2009, Choi et al. 2012, Ló pez-Ferrando et al. 2017, Ancien et al. 2018). On the other hand, metapredictors (Jagadeesh et al. 2016, Kim et al. 2017) emphasize the adaptation of classification algorithms, such as Random Forests or Neural Networks, rather than leveraging unique features.
Despite the abundance of such software, the prediction of SAVs function impact remains a challenging and unsolved problem. The recent comparing of the methods (Ló pez-Ferrando et al. 2017) on selected number of genes has shown an inferior prediction accuracy compared to the average score obtained in the global test. The authors conclude that there exists the necessity to evolve specific predictors for protein families exhibiting nonstandard behavior. The decreased accuracy may be explained using heterogenic data in training models. Moreover, because all selected genes are associated with distinct diseases, the tools may be incompletely used in making clinical decisions. Protein-specific (PS) pathogenicity prediction strategy supposed data of specific gene/protein can complement current tools. Recent studies have shown the performance competitiveness of specific tools with nonspecific predictors (Torkamani and Schork 2007, Crockett et al. 2012, Riera et al. 2016, with a tendency to outperform them (Riera et al. 2016).
In chemoinformatics, such an approach is known as structure-activity relationship analysis (SAR), and it is successfully implemented in computer-aided drug design. Basically, SAR analysis describes the determination between a chemical structure represented in machine-readable formats and a biological activity of compounds (Tong et al. 2003, Guha 2013, Muratov et al. 2020. Besides, SAR models are usually built for a specific target without considering it directly. Since many targets and biological activities are sparsely studied, SARs are often applied to small or unbalanced datasets (Idakwo et al. 2020, Sakai et al. 2021. The main concept in this work is to consider amino acid sequences of peptides with centered SAV as structural formulas. Such descriptions may provide additional information for SAVs pathogenicity effect. Thus, we propose an approach for predicting the pathogenicity impact of SAVs in clinically relevant proteins consisting of PS classifiers trained on structural SAVs with the nearest amino acid neighbors representation.

Datasets and data preparation
The schematic diagram of the study is displayed in Fig. 1. The research object consists of 10 disease associated proteins [selected from the publication of Ló pez-Ferrando co-authors (Ló pez-Ferrando et al. 2017)] and their known missense variants from ClinVar (Landrum et al. 2018) and humsavar (http:// www.uniprot.org/docs/humsavar) databases dated August 2020. According to the ACMG Guidelines (Richards et al. 2015) classification used in ClinVar, "pathogenic" or "likely pathogenic" variants were indicated as "pathogenic" (1), likewise "benign" or "likely benign" composed as "benign" (0) class. All 2857 ClinVar selected records had minimum "one star" review status. The data from humsavar (1760 SAVs) was applied as previously classified (humsavar.txt file, released 26 February 2020). Subsequently, the overlap between two datasets was executed and conflicts in variants interpretation were resolved (depending on the last valid record date), the resulting 3820 SAVs, 841 benign and 2979 pathogenic, were recognized. Twenty records had conflicts in clinical interpretation, for most of them annotations from ClinVar were chosen. Only three Figure 1. The scheme of the study. The studied proteins are represented as the gene symbols. Annotated single amino acid variants (SAVs) were taken from ClinVar and Humsavar databases. The numbers under the arrows are the total number of SAVs. UniProt was used as a source of protein sequences. Based on protein sequences, peptides with different sizes were created (depending on the number of amino acids close to the central SAV). The models were trained only on the peptides with protein-related substitutions represented in structured-data (SD) format. Test sets predictions were made on corresponding training sets during the 5-fold cross validation procedure (5F-CV). For comparison, pathogenicity scores were obtained from the academic version of dbNSFP4.1 for 11 bioinformatic predictors. MNA, Multilevel Neighborhoods of Atoms descriptors.
variants had more recent entries in the humsavar database that were selected for the study. The general sizes of the datasets with the number of pathogenic and benign SAVs for each protein are represented in Table 1. The proteins had both pathogenic and benign SAVs, with minimum of 125 in a dataset, which made it possible to create more stable classification models. Since some chosen proteins have several known isoforms, the reference canonical sequences pursuant to SwissProt (The UniProt Consortium, 2021) were used.
The method concept of dataset creation for training and validation of SAR models is illustrated at Fig. 2. Firstly, every SAV was mapped to corresponding sequences of related protein and peptides with the length from 3 to 31 a.a. residues were obtained, respectively. The fragments had SAV in the center of peptides and 2-15 neighbors as natural amino acids. SAVs located at the edges of the protein shift from the fragment center. In summary, we had 15-peptide lengths for 10 selected proteins that were compiled to 150 structure-data files (SDFs). An SDF contains structural formulas of definite protein fragments described in MOL V3000 format, the SAV position, the coding protein gene name, and the effect of missense variant labels. Next, SDFs were divided into five training and five test datasets to make 5-fold cross-validation (5F-CV). Datasets generations, including fragments translation to structural formulas, were made with the original Python script. In the end, test sets contained unique peptides with the fixed length from distinct positions of the protein, similarly, training sets incorporated the remaining fragments. The division was taken into account sorting by the position label, resulting at least 100 records for training and 25 for test datasets, respectively. The datasets are avaible as CSV files in Supplementary materials.

PASS software and Multilevel Neighborhoods of Atoms (MNA) descriptors
To create classification models predicting the effect of SAVs, a special version of PASS (Prediction of Activity Spectra for Substances) software (MultiPASS version) was used. The PASS software was successfully implemented into several web applications to predict the biological activity of compounds based on their structural formula (Lagunin et al. 2000, Lagunin et al. 2013, Rudik et al. 2015, Lagunin et al. 2018. MultiPASS version was made specifically for amino acid and nucleotide sequences structure analysis. MultiPASS employs a uniform set of atom-centric substructural MNA descriptors for the representation of peptide structures and a modified Naïve Bayes classifier to model structure-activity relationships (Poroikov et al. 2000, Filimonov et al. 2014. Earlier this approach was successfully used to predict phosphorylation sites of proteins (Karasev et al. 2017) and epitope/MHC specificity for CDR3 TCR sequences (Smirnov et al. 2023). MNA descriptors are based on the molecular structure representation that includes hydrogen atoms in accordance with the valences and partial charges of atoms and does not specify bond types (Filimonov et al. 2014). MNA descriptors for each atom of the molecule are computed recursively as follows: the 0th level is the mark A of the atom itself (D 0 (A)¼ (À)A), where "À" is a mark added to nonring atoms, and any next level descriptor is the linear substructure notation D n (A) ¼ D n-1 (A)(D n-1 (B 1 )D n-1 (B 2 ). . .D n-1 (B k )), D nÀ1 (B i ) is the (n À 1)-level MNA descriptor for atom A's i-th immediate neighbor B i . According to the notation, descriptors include various distanced atom neighbors, and higher levels include previous ones (Fig. 3). Since the original PASS deals with small molecules, it uses 1-2 levels of MNA descriptors. In turn, the main feature of MultiPASS is the calculation of a higher-level of MNA descriptors (up to 15), which provides a possibility to describe the structure of macromolecules such as peptides.  For each predictable "activity", the PASS algorithm calculates two probabilities (Pa and Pi) for the studied compound. Pa (probability "to be active") and Pi (probability "to be inactive") estimates the belonging of the predicted compound to classes of active and inactive compounds, respectively (Filimonov et al. 2014). We used (Pa-Pi) > 0 as a final score for the pathogenic class in the predictions. Thus, models were built on diverse peptides lengths and MNA-level combinations.

Approach estimation
All protein classifiers were trained according to the 5F-CV procedure using from 1th to 15th MNA levels, which implies each one-fifth part of the dataset was used as an independent test set and each four-fifth part was used as a training set (Fig. 2). The predictions on test sets were summarized to compare the quality of models as the area under the receiver operating characteristic curve (AUC) and Mathew's correlation coefficient (MCC) measures, as well as sensitivity, specificity, and balanced accuracy metrics.
where MCC is the Matthew Correlation Coefficient MCC, TP is the true positive, TN is the true negative, FP is the false positive, and FN is the false negative.
To make a comparative test between the MNA-based approach and several existed bioinformatic predictors, the prediction scores were taken from dbNSFP4.1a (Liu et al. 2020). The academic version of dbNSFP4.1 contains transcriptspecific functional predictions scores and interpretations for human nonsynonymous and splice-site SNPs (Liu et al. 2020) made by the computational tools. We chose the scores belonging to protein canonical isoforms. The total number of SAVs is represented in Table 1. The cut-off points of appropriate methods were used for AUC and MCC calculation. The assessment metrics were gained with the original Python script and Scikit-learn library (Pedregosa et al. 2011).

MNA-based method performance
Overall, 150 datasets and 11 thousand SAR models were built. We studied how the length of peptides and the level of MNA descriptors influenced the classification models accuracy during 5F-CV procedure. It appeared that the models differ over a wide range of AUC and MCC values (Fig. 4,  Supplementary Fig. S1). Variants effects in some proteins (e.g. ATM and CFTR) are poorly predicted by the structural description, as most of those models have the AUC parameter  below 0.7. Nevertheless, several individual protein models have a better predictive power. A large range of AUC scores for FBN1 was given for SAR models created on different peptide length and levels of MNA descriptors. It may be caused by the predominance of pathogenic over neutral substitutions in the dataset (Table 1). During the study, we could not identify certain peptide length or MNA level parameters for the method which would lead to best SAR models on all proteins ( Supplementary Fig. S1). Instead, for each protein we selected the combinations of the peptide length and levels of MNA descriptors leading to the highest values of AUC. The list of proteins associated with diseases and best models' evaluation characteristics is given in Table 2. Self-recognition and the reference peptide tests were also made (Supplementary Table  S1). The models perform with an approximately 100% accuracy in the self-recognition test. In the reference peptide test, the benign labels for unmutated peptides (randomly generated from the reference sequence of the protein) were predicted with a similar accuracy tendency as for 5F-CV results. We studied the correlation between AUC values and the size of training sets, proportions of pathogenic and benign variants. There appeared to be no correlation between AUC values of models and the number of variants or their proportions in the training sets. (Supplementary Fig. S2). According to superior mean values of AUC and MCC scores, we chose models with optimal parameters for every protein, mainly 9-13 levels for MNA descriptors and 15-31 for peptide length. In case of close scores, we chose classifiers built on lower parameters ( Table 2, the parameters are put in Table 3).
We matched the MNA-based scores with the original datasets and related them to the primary structure with applied domains from Pfam database (Mistry et al. 2021), based on the calculation of SAV frequency in 100 a.a. sliding window ( Supplementary Fig. S14). For each position we calculated the number of pathogenic or benign variants in the window, the obtained values were divided by 100. Supplementary Fig. S3 shows that for some proteins the distribution of known pathogenic and benign SAVs varies along positions. For example, BRCA1 sequence has many pathogenic variants on the edges, and most benign variants are located in other parts of the protein. For proteins with such a controversial distribution of pathogenic and benign variants, we got classification models with the highest value of accuracy. In addition, AUC values of the models built on wholly related datasets (Tables 3 and 4) a AUC, area under the receiver operating characteristic curve; 5F-CV, AUC obtained in five cross-validation procedure; LOO-CV, AUC obtained in leaveone-out validation procedure; Sen., sensitivity; Spec., Specificity; BA, balanced accuracy; MCC, Matthew correlation coefficient; ATM, ataxia-telangiectasia mutated; CPR, cancer-predisposing syndrome; TCR, transmembrane conductance regulator. Prediction of pathogenic single amino acid substitutions using molecular fragment descriptors 5 were calculated by the leave-one-out cross-validation (LOO-CV) and the 20-fold validation (Supplementary Table S1, Supplementary Fig. S2C) procedure during the training. Analyzing AUC 5F-CV and AUC LOO-CV values offered no significant differences between them (Mann-Whitney test, P > 0.05). All of the statements above suggest the PS models are robust. We also made a 3D-charts for the PS models performance, where the area of optimal parameter values for a particular protein can be seen (Supplementary Figs S4-S13 in Supplementary Materials), the bumps on the plane reflect the moments when the MNA levels do not capture adjacent a.a. residues, which leads to a decrease in AUC.

Comparison with individual and consensus methods
As mentioned previously, our total dataset contains 3917 unique amino acid substitutions, whose gene annotations were downloaded from dbNSFP4.1a (Liu et al. 2020) using search_dbNSFP41a.jar (Table 1). Then, for the canonical isoforms, the prediction scores for individual tools such as SIFT 4G (Vaser et al. 2016 (Kim et al. 2017) were fetched. Although some of the methods lacked annotations for a couple of proteins, the overall dataset coverage was sufficient. The indicated coverage was calculated on the full dataset. For comparison, we used one model for each protein in accordance with the optimal combination of AUC and MCC values in the 5F-CV procedure.
Generally, the individual predictors demonstrated acceptable results of the classification. MNA-based predictions showed no significant differences in the pairwise MCC comparison with methods other than FATHMM (Student's t-test, P < 0.05). While SIFT 4G and Polyphen2 got the highest averages, in the protein scope our approach and PROVEAN yielded most of the highest scores (Table 3). MutPred, that predicts the molecular basis of diseases, also achieved a great accuracy, but the poorest coverage may indicate a decreased applicability domain (Table 4). As expected, metapredictors have prevailed mean AUC, in some conditions give a lower prediction power in MCC terms. Since MNA-based scores were obtained in the 5F-CV procedure, and it is not known whether dataset variants were used in the rest predictors trainings, the results of other methods may be overestimated in comparison with the proposed method. Nevertheless, MNAbased models for some proteins received the highest accuracy (Fig. 5, Tables 3 and 4). Remarkably, PROVEAN is also focused on predicting the functional effect of amino acid substitutions using sequence and evolutionary features, and in the case where it failed, the proposed structure-based method worked well. It should be also noted that FATHMM, PROVEAN, MVP, and M-CAP derive inconsistent estimates (MCC 0), despite the rather high average AUC. Therefore, researchers should choose an individual method carefully and then annotate specific SAVs.
We observed the cases in which the MNA-based method gave the correct prediction results, while other methods failed. Some of these variants are given in Table 5. To explain the  results, we determined the localization of these substitutions in the proteins and examined the features of their environment using UniProt feature viewer ( Supplementary Fig. S14). It turned out that all the substitutions were located in specific areas, such as the functional region ( Supplementary Fig.  S14A), the disulfide bond ( Supplementary Fig. S14B) or the helix (Supplementary Fig. S14C). Notably, pathogenic SAVs in BRCA2 and LDLR are mapped within unique peptides, whereas the benign SAV in CFTR is outside. In the first case isoleucine replaces methionine and both these residues belong to the nonpolar (hydrophobic) a.a. group, therefore such a substitution is considered conservative. The same picture is observed for Arg to Lys (both residues belong to the group of positively charged amino acids) and Ile to Phe [both residues belong to the group of nonpolar (hydrophobic) amino acids] substitutions. It is likely that our approach better captures changes in the structural properties of such areas.

Discussion
In this study, we have examined the feasibility of utilizing the structure-activity paradigm to predict the pathogenic effect of SAVs. For this purpose, 10 proteins associated with various pathological conditions in humans were selected. The choice was made based on the total number of the clinically annotated variants (minimum 125 SAVs) along with the availability of comparative assessment provided by other methods. In cheminformatics, even 10 structural formulas of molecules may be enough to create reasonable SAR models. The classification models were trained with variable parameters of the MNA descriptors levels and the peptide frame of the substituted a.a. residues and their surroundings; the search for the optimal parameters was carried out.
Finally, the 5-fold CV prediction results of the best models were compared with eleven computational prediction algorithms: eight individual [SIFT 4G (Vaser et al. 2016 (Kim et al. 2017)]. The MNA-based approach showed the similar predictive accuracy with the best of them in terms of AUC and MCC, albeit for a limited set of subjects under study. MNA descriptors have proven capable of fully describing large linear molecules such as peptides. Practically, by utilizing the properties of the primary structure alone, it was possible to achieve a comparable accuracy with other individual methods using the properties of amino acid sequences (Adzhubei et al. 2013, Qi et al. 2021). This can be explained by the relations of the secondary, tertiary, and primary structures properties. The obtained values are in agreement with the PS predictors of other authors (Crockett et al. 2012, Riera et al. 2016, although a direct comparison could not be provided. According to 5F-CV, LOO-CV, 20F-CV results and dataset size versus accuracy (Supplementary  Table S1, Supplementary Fig. S2C), it can be concluded that MNA-based performance does not correlate with the number of pathogenic variants in the training set. This can be seen as an advantage of the SAR approach as for many proteins there are only a few clinically annotated SAVs. The described results were achieved without the direct use of information about 2D or 3D structures of proteins, as well as the alignment or search for homologues. As already mentioned, we tested the capabilities of the SAR approach, so we deliberately did not use any evolutionary information, which is reported by many studies to make a significant contribution to predictions of pathogenic effect (Capriotti et al. 2013, Niroula et al. 2015. In perspective, the MNA-based approach may be improved by suppling additional features such as the evolutionary conservation, data on regulatory and binding sites, 2D and 3D structures of proteins.