Abstract

Motivation

Human immunodeficiency virus (HIV) drug resistance is a global healthcare issue. The emergence of drug resistance influenced the efficacy of treatment regimens, thus stressing the importance of treatment adaptation. Computational methods predicting the drug resistance profile from genomic data of HIV isolates are advantageous for monitoring drug resistance in patients. However, existing computational methods for drug resistance prediction are either not suitable for emerging HIV strains with complex mutational patterns or lack interpretability, which is of paramount importance in clinical practice. The approach reported here overcomes these limitations and combines high accuracy of predictions and interpretability of the models.

Results

In this work, a new methodology based on generative topographic mapping (GTM) for biological sequence space representation and quantitative genotype–phenotype relationships prediction purposes was introduced. The GTM-based resistance landscapes allowed us to predict the resistance of HIV strains based on sequencing and drug resistance data for three viral proteins [integrase (IN), protease (PR) and reverse transcriptase (RT)] from Stanford HIV drug resistance database. The average balanced accuracy for PR inhibitors was 0.89 ± 0.01, for IN inhibitors 0.85 ± 0.01, for non-nucleoside RT inhibitors 0.73 ± 0.01 and for nucleoside RT inhibitors 0.84 ± 0.01. We have demonstrated in several case studies that GTM-based resistance landscapes are useful for visualization and analysis of sequence space as well as for treatment optimization purposes. Here, GTMs were applied for the in-depth analysis of the relationships between mutation pattern and drug resistance using mutation landscapes. This allowed us to predict retrospectively the importance of the presence of particular mutations (e.g. V32I, L10F and L33F in HIV PR) for the resistance development. This study highlights some perspectives of GTM applications in clinical informatics and particularly in the field of sequence space exploration.

Availability and implementation

https://github.com/karinapikalyova/ISIDASeq.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

The global epidemic of human immunodeficiency virus (HIV) infection and acquired immunodeficiency syndrome (AIDS) is one of the major public health issues, affecting more than 38 million people worldwide (UNAIDS, 2020). The primary causative agent of HIV infection is the Human immunodeficiency virus 1 belonging to the Lentivirus genus, Retrovidae family (Knipe and Howley, 2013). The virus induces a progressive weakening of the immune system and if untreated, leads to a rise of opportunistic infections and, subsequently, death. The development of a preventive vaccine remains a challenge, and, for now, the treatment of the HIV/AIDS is based on the usage of antiretroviral therapy (ART; Ananworanich, 2015; Pavlakis and Felber, 2018). Since its introduction, ART has significantly extended life expectancy and improved the quality of life of people diagnosed with HIV infection (Knipe and Howley, 2013).

Currently recommended ART consists in the treatment by a combination of inhibitors of HIV enzymes including reverse transcriptase (RT), integrase (IN) and protease (PR; WHO, 2018). Despite being highly effective in many cases, ART failure can occur for some individuals due to the emergence of resistance against one or more inhibitors (Iyidogan and Anderson, 2014). Drug resistance testing is, therefore, essential for the selection of an optimal ART regimen (Günthard et al., 2018). The drug resistance can be assessed by either phenotypic or genotypic assays. Phenotypic assays provide information about the inhibitory activity of drugs against a particular virus variant in in vitro cell-based experiments. Although these assays provide information essential for establishing genotype–phenotype correlations, they are labor- and time-consuming, which restricts their use in routine clinical diagnostics. The genotyping assays, on the other hand, are based on genome sequencing, which is accessible, fast and inexpensive, but requires an accurate interpretation system allowing one to predict the resistance profile for the particular virus variant. Due to the rapidity and availability of sequencing, a large amount of sequence data was accumulated in public repositories, and interpretation systems based on various computational methods for predicting drug resistance were developed to assist in clinical diagnostics.

One of the largest and commonly used public HIV drug-resistance databases is the Stanford HIV drug resistance database (HIVDB; Rhee et al., 2003). It comprises more than 400 thousand sequences of HIV-1 proteins and over 60 thousand drug resistance profiles for RT, PR and IN proteins. Numerous algorithms were suggested for the interpretation of mutational patterns present in these data and the prediction of the mutations’ influence on drug resistance (Vercauteren and Vandamme, 2006). These algorithms can be broadly classified into rule based and machine-learning (ML) based. The rule-based approaches comprise a set of predefined rules according to which the variant is defined as resistant or susceptible. They are typically derived from published data on mutations associated with drug resistance and correlations between treatment regimen and virological response data (Vercauteren and Vandamme, 2006). Although, rule-based methods are still widely used due to their interpretability, they require a continuous update of the existing rules and are not very accurate when complex mutational patterns are present in sequences (Singh, 2017). Alternatively, ML algorithms can be used to overcome the aforementioned problems of rule-based approaches. Various ML algorithms were applied for HIV drug resistance profiling, including linear regression (Rhee et al., 2006; Yu et al., 2014), support vector machines (SVM; Khalid and Sezerman, 2018; Masso and Vaisman, 2013; Rhee et al., 2006), decision trees (Beerenwinkel et al., 2002; Rhee et al., 2006), random forest (RF; Shen et al., 2016; Tarasova et al., 2018), artificial neural networks (Pasomsub et al., 2010; Rhee et al., 2006; Sheik Amamuddy et al., 2017; Steiner et al., 2020; Wang and Larder, 2003) and Bayesian approaches (Tarasova et al., 2017). They allowed achieving better accuracy of drug resistance predictions as compared to rule-based methods.

Despite the high accuracy of resistance predictions, most ML algorithms lack intuitive interpretability inherent to rule-based methods. The ideal algorithm would need to combine high predictive accuracy and intuitive interpretation of the results of the predictions. One way to get an interpretable ML system is to combine the visualization of data with predictions that would be consistent with the results of the visualization. To represent the sequence data, dimensionality reduction methods can be used. In such approaches, biological sequences are encoded as objects in a multidimensional space and then projected to 2D or 3D spaces. In recent years, several dimensionality reduction algorithms such as: principal components analysis (PCA; Hotelling, 1933), self-organizing maps (SOMs; Kohonen, 1982), generative topographic mapping (GTM; Bishop et al., 1998), etc. were applied for sequence space analysis. The resulting maps obtained after applying aforementioned dimensionality reduction methods can be ‘coloured’ by any property of the items it hosts (e.g. the map hosting PR sequences can be coloured by associated drug resistance profiles). Hence, the ‘property landscapes’ are produced—the local ‘colour’ representing the mean of experimental properties of the items residing in the given point or zone of a map. Depending on the specific hypotheses and the pertinence/information-richness of the ‘training set’ of items used to create such landscapes, these can be used as predictive tools—by assuming that the unknown property of a new item can be read from the ‘colour’ of the landscape zone in which it resides. Regarding HIV, SOM and kernel PCA were applied for visual analysis of the HIV mutant sequences’ space (Drǎghici and Potter, 2003; Ramon et al., 2019).

Among other dimensionality reduction techniques, GTM seems to be particularly useful, as it has a rather unique ability to be both a visualization tool and a multitask quantitative predictive model. GTM was extensively studied for chemical space analysis and showed superior visualization ability and prediction accuracy in tasks related to quantitative structure–property relationships modelling as compared to other dimensionality reduction algorithms (Gaspar et al., 2016).

Herein, the GTM method was used for building interpretable maps of HIV sequence space and predicting HIV drug resistance profiles. We focussed on IN, PR and RT inhibitors due to the availability of sufficient data [fold ratio (FR) value—sequence of mutated protein pairs] in the databases. For most of the drugs, the balanced accuracy of GTM-based resistance predictions was comparable or slightly lower (by 0.01–0.03) than that of popular state-of-the-art ML algorithms: RF, SVM and gradient boosting (GB).

Additional analysis of GTMs allowed us to detect mutation patterns leading to the resistance. For this purpose, an interactive tool for sequence space exploration was developed and applied to the HIV drug resistance analysis. The developed approach enables one to work with high-dimensional genomic and proteomic data, all along by ensuring the rapid building of interpretable models. It does not only overcome the interpretability problem inherent to ML methods but also provides a general methodology applicable for various tasks related to phenotypes prediction. Hence, we believe that this approach will be useful in numerous tasks related to the analysis and exploration of sequence space and modelling of quantitative genotype–phenotype relationships (QGPR).

2 Materials and methods

2.1 Data acquisition and preprocessing

The HIV protein sequence data with associated resistance profiles accumulated in large databases provide the basis for computational drug resistance predictions. The resistance is expressed by FR value
(1)
where IC50(D)mutant—drug concentration at which the replication of the mutant virus strain is inhibited by 50%, IC50(D)ref—drug concentration at which the replication of highly susceptible reference virus strain is inhibited by 50%.

The above ratio is specific for each anti-HIV drug D, and these drugs were each targeted at a specific HIV protein. The challenge of this work is to predict FR(D) as a function of the mutations affecting the viral protein against which D was designed for (its primary target). In principle, mutations affecting other viral proteins might potentially indirectly impact on the effectiveness of the drug D, but the analysis of such effects, if ever observed, is beyond the purpose of the current paper. Specific models will thus be built for various drugs binding HIV RT, IN and PR sequences. Associated FR values from high quality filtered HIVDB genotype–phenotype dataset were used for modelling quantitative genotype–resistance relationships, where ‘genotype’ in this context designs the (mutant) protein sequences, the various ‘phenotypes’ being FR(D) for each drug binding to the target. Overall, 1707, 659 and 1958 mutant HIV RT, IN and PR sequences, respectively, with corresponding FR values measured for six RT, two IN or eight PR inhibitors were extracted from the HIVDB genotype–phenotype dataset. Each mutant protein sequence in the database was represented as a set of amino acid substitutions in certain positions in regard to the reference sequences. These mutant protein sequences were then completed by original non-mutated amino acid residues from the consensus sequence to form viral protein chains. The number of amino acid residues was 99 per HIV PR monomer, 288 for the HIV IN and 240 for the HIV RT. Although the p66 subunit of HIV RT comprises 560 amino acid residues, the filtered data on mutations and associated resistance profiles was available only for the first 240 amino acid residues of the RT p66 domain and, hence, only this data was used for modelling. Since the first three N-terminal amino acids were not systematically reported for all PR sequences, they were ignored altogether. Some of the sequences contained non-standard symbols, such as the ‘.’ symbol denoting unknown amino acid, ‘X’ symbol denoting a mixture of four and more amino acids, symbols denoting the presence of two or more amino acids at the same position (‘mixtures of amino acids’, e.g. ‘A/S’ for alanine–serine mixture), etc. The amino acid mixtures arise due to the noise in the experimental data or the coexistence of several HIV variants in a single patient’s virus sample. The presence of these amino acid mixture symbols indicates ambiguities in sequence data and can confound QGPR modelling (Ramon et al., 2019). Several approaches were suggested for the processing of such sequences. In a majority of approaches the sequences with symbols denoting insertions and deletions, unknown amino acids and mixtures of four and more amino acids (‘X’ symbol) were deleted (Masso and Vaisman, 2013; Ramon et al., 2019; Yu et al., 2014). By contrast, the strategies for processing the sequences with amino acid mixtures were diverse. Tarasova et al. (2018) addressed this problem by verifying the frequency of occurrence of amino acids from mixtures in the set of corresponding positions in the examined enzymes of other drug-resistant HIV-1 variants and keeping the most frequent one. Masso and Vaisman (2013) deleted all the sequences with the mixtures of amino acids. In contrast to the latter approach, the technique leading to the expansion of the dataset via combinatorial enumeration of all possible variants of the sequences containing mixtures was also used (Sheik Amamuddy et al., 2017; Shen et al., 2016; Yu et al., 2014). Ramon et al. (2019) suggested applying various kernels for handling amino acid mixtures, hence maintaining the authenticity of the data and reducing the chance of incorrect mutation patterns being introduced.

Taking into account the aforementioned information, all sequences containing symbols, which did not denote 20 standard amino acids were removed from the dataset. HIV IN sequences containing amino acid mixtures located in major drug resistance mutation positions—positions that, according to current knowledge (Rhee et al., 2003), are associated with significant changes in drug resistance (DRM positions), were removed. HIV PR and RT sequences of high quality filtered HIVDB genotype–phenotype dataset did not contain any mixtures in DRM positions. Amino acid mixture symbols not located in major drug resistance mutation positions were replaced by the symbol of the first amino acid from the mixture symbol (i.e. a mixture such as ‘A/S’ was replaced by the ‘A’ symbol). Moreover, only sequences corresponding to the most studied HIV subtype B were kept to increase the prediction performance for this subtype. Finally, 1581 sequences with known FR for PR inhibitors, 510 sequences with available FR for IN inhibitors and 1389 sequences with known FR for RT inhibitors were prepared. Classes ‘resistant’ and ‘susceptible’ were assigned to each of these sequences with respect to every drug based on the corresponding FR values and clinical thresholds extracted from HIVDB (Supplementary Table S1). The GTM-based cartography includes as a first step the unsupervised construction of a map spanning the relevant space of possible protein sequences as defined by a pool of representative items (the ‘frame set’) (Lin et al., 2020). The latter do not need to be annotated by FR—therefore, the frame set was expanded to include also observed mutant sequences for which FR values were not yet reported for the studied drugs. Separate frame sets for RT, IN and PR sequences were composed from the corresponding labelled sequences and unlabelled sequences from genotype-treatment dataset from HIVDB. In such a way, the individual frame sets consisted of 6591 sequences for IN, and 5000 sequences for PR and RT. The enrichment of the dataset with the randomly selected unlabelled sequences was performed to ensure more uniform coverage of the mutant HIV proteins sequence space. Since additional unlabelled sequences are genetically different from the labelled ones, the enriched dataset better delineates the sequence space of HIV mutants. The workflow used for the data preprocessing is shown in Supplementary Figure S2.

2.2 Descriptors

A prerequisite for applying most ML algorithms is encoding amino acid sequences into numeric vectors (Zamani and Kremer, 2011). Herein, various types of encoding schemes were used: k-mers (subsequences of amino acids of length k), one-hot encoded vectors, vectors derived from BLOSUM and HIVb amino acid substitution matrices. Reduced amino acid alphabets, combining amino acids by their physico-chemical properties, and allowing to simplify protein complexity (Zheng et al., 2019) were also applied. In addition to that, descriptors transformed with Principal Component Analysis (PCA) and kernel PCA were generated. The scheme of the process used for sequence descriptors generation is shown in Supplementary Figure S3. Presentation of the descriptors used in this work is given in Supplementary Appendix S1.

2.3 Generative topographic mapping

GTM is a nonlinear dimensionality reduction method that was introduced by Bishop et al. (1998). The GTM method is based on the injection of a flexible hypersurface (the ‘manifold’) into the high-dimensional data space formed by the given data points. In the current work, each data point corresponds to an HIV protein sequence encoded by a numerical vector of a fixed length (Supplementary Fig. S3). The manifold is trained in such a way to pass through the densest regions of frame set data space. Once the manifold is obtained, the data points are projected onto it. A 2D map results from the manifold unbending. Note that GTM considers each sequence not only as a single point but also as a probability distribution in its latent space (i.e. a probability to find the given sequence residing in the given node of the map). This allows one to create a cumulative landscape obtained by overlapping the probability distributions of all the training items (Kireeva et al., 2012). The latter can then be ‘coloured’ by associating to each node of the map the mean property/activity class of sequences residing in it, hereby generating fuzzy classification or continuous property landscapes. These landscapes can be used as predictive models at the same time providing data visualization. To address the class imbalance problem, probabilities of classes at nodes can be reweighted by the number of members of each class according to the Bayesian class normalization procedure. The GTM algorithm applied to amino acid sequences space is described in Figure 1 and its details are given in Supplementary Appendix S2.

Key steps of the GTM algorithm applied to amino acid sequence space. Each amino acid sequence is encoded by a numerical vector (1) defining its position in the N-dimensional descriptor space (2). The flexible manifold is fitted in a way to approach the data points followed by projection of the data points onto the manifold (3). A 2D map results from the unbending of the manifold. Each projected data point is characterized by a probability to be located in the nodes of a rectangular grid superposed with the manifold. (4) Each node is then associated (‘coloured’) with a weighted average of resistance values of residing sequences. Ensemble of the coloured nodes forms the resistance landscape (5)
Fig. 1.

Key steps of the GTM algorithm applied to amino acid sequence space. Each amino acid sequence is encoded by a numerical vector (1) defining its position in the N-dimensional descriptor space (2). The flexible manifold is fitted in a way to approach the data points followed by projection of the data points onto the manifold (3). A 2D map results from the unbending of the manifold. Each projected data point is characterized by a probability to be located in the nodes of a rectangular grid superposed with the manifold. (4) Each node is then associated (‘coloured’) with a weighted average of resistance values of residing sequences. Ensemble of the coloured nodes forms the resistance landscape (5)

Here, a separate manifold was built for each considered viral protein (IN, PR, RT). For a given manifold, several classification landscapes (models), each corresponding to a particular drug, were prepared. For example, eight landscapes predicting resistance for different drugs were prepared for the GTM manifold for HIV PR.

GTM has four hyperparameters: map resolution, number of hidden Radial Basis Functions (RBF), regularization coefficient and width of an RBF. Along with descriptors type, the best GTM hyperparameter values for the given modelling task must be found. For this purpose, genetic algorithm (GA) is applied.

2.4 Genetic algorithm

The GA is a stochastic evolutionary optimization algorithm adapted for parameter optimization problems. In our study, the GA was used as a method for selecting the optimal descriptors and hyperparameters of the GTM (Horvath et al., 2014). To build the GTM manifold, a set of ≥5000 HIV protein sequences (frame set; see the description above) is used. The selection of appropriate descriptors and the GTM hyperparameters was carried out according to the evaluation of the model’s predictive performance. The latter was estimated by an averaged balanced accuracy (average proportion of sequences predicted correctly for each class: susceptible and resistant) for each drug in a 5-folds cross-validation procedure
(2)
where TPf is the number of truly resistant sequences predicted as resistant in the fold f, TNf is the number of truly susceptible sequences predicted as susceptible in the fold f, FNf is the number of truly resistant sequences predicted as susceptible in the fold f and FPf is the number of truly susceptible sequences predicted as resistant in the fold f.

The best maps are thus the ones which are able to host, on their one manifold, a maximum of highly predictive resistant/susceptible binary classification landscapes (corresponding to the different drugs associated to that protein). Using a common manifold to model drug resistance of various drugs not only enhances the interpretation (by providing a common reference system to navigate the sequence space) but also presents the inherent benefits of multitask learning (Lin et al., 2019). However, different protein targets require distinct dedicated maps, each covering a given sequence space (of IN, PR and RT, respectively).

2.5 Comparison to other machine-learning methods

The GTM’s drug resistance predictive ability was compared with other state-of-the-art ML algorithms implemented in the scikit-learn library (v. 0.23.1; Pedregosa et al., 2011). The following hyperparameters were tuned during optimization (grid search):

  • RF (Breiman, 2001): number of trees (100, 300, 500, 1000), number of features (all features, squared root of the number of features, log2 of number of features), out-of-bag sampling (with and without), max depth of trees (full tree, 5, 10, 30), class weight (none, balanced, balanced_subsample);

  • SVM (Boser et al., 1992): regularization coefficient (0.1, 1, 10, 100, 1000), kernel coefficient (1, 0.1, 0.01, 0.001, 0.0001), kernel (‘rbf’, ‘linear’, ‘poly’, ‘sigmoid’);

  • GB (Friedman, 2001): number of trees (100, 300, 500, 1000), number of features (all features, squared root of the number of features, log2 of number of features), learning rate (0.0001, 0.001, 0.01, 0.1, 1.0), subsampling (0.5, 0.7, 1.0) max depth of trees (full tree, 5, 10, 30).

Evaluation of the model performance was made using 5-fold cross-validation repeated five times (the same as in the case of GTM).

2.6 Analysis of resistance determining mutation patterns

The combination of mutations in certain amino acid positions (mutation pattern) defines the position of the sequence on the map. Hence, one can extract sequences residing in the particular node or a group of nodes (a zone) and determine which particular mutation pattern is prevalent in these sequences. This analysis can be performed using the algorithms for the identification of specificity determining positions. In this work, the SDPred algorithm v.2 (Kalinina et al., 2009) was used. This algorithm is based on the calculation of the position-wise mutual information in two or more groups of aligned sequences. It allows identifying positions in which the amino acid distribution is significantly different between groups of sequences. The algorithm was applied in combination with the GTM: sequences residing in the specific node or zone of the GTM were considered as the target group and all other sequences projected onto the map in other nodes were considered as another group.

The graphical representation of aligned amino acid sequences (i.e. sequence logo) was performed with Logomaker python library (Tareen and Kinney, 2019). The sequence logo representation is composed from stacks of symbols that reflect the presence of particular amino acids in a specified position in a group of aligned sequences. The height of the amino acid symbols within the specified stack (position in a sequence) represents the frequency of occurrence of a certain amino acid in a group of aligned sequences.

3 Results and Discussion

3.1 Cartography of HIV proteins sequence space and drug resistance profiling

The GTMs for PR, RT and IN inhibitors represent frameworks for visualization of mutant HIV enzymes sequence space and for prediction of their resistance to drugs according to HIVDB thresholds (Fig. 2). The maps reflect the density distribution of mutant enzymes that are encoded by different encoding schemes. The colour of the maps indicates the mean resistance profile of the sequences residing in a given node of the map. The level of transparency reflects the number of the sequences located in the map. The maps with the highest prediction performance were constructed based on three types of descriptors: 4-mers for PR, one-hot encoded vectors for IN and PCA-transformed 1–3-mers for RT.

Resistance landscapes for (A) two PR inhibitors, (B) two nucleoside RT inhibitors, (C) two non-nucleoside RT inhibitors and (D) two IN inhibitors. Each node is coloured by a weighted average of drug resistance profiles of the residing sequences. Red zones are occupied by the resistant sequences, while the blue zones contain susceptible sequences. All colours in between correspond to mixed zones containing both of them. The transparency reflects how many sequences resided in the particular node
Fig. 2.

Resistance landscapes for (A) two PR inhibitors, (B) two nucleoside RT inhibitors, (C) two non-nucleoside RT inhibitors and (D) two IN inhibitors. Each node is coloured by a weighted average of drug resistance profiles of the residing sequences. Red zones are occupied by the resistant sequences, while the blue zones contain susceptible sequences. All colours in between correspond to mixed zones containing both of them. The transparency reflects how many sequences resided in the particular node

The map for nelfinavir HIV PR inhibitor shows a clear separation between resistant and susceptible PR sequences located on the map (Fig. 2A). The comparison of this map to the one for tipranavir reveals common regions that are coloured differently, i.e. the same zones on the map are coloured in red for nelfinavir, whereas being blue for tipranavir. This highlights the broader spectrum of mutant HIV variants that are susceptible to treatment with tipranavir. Tipranavir is a non-peptidomimetic drug and possesses a different binding profile from other peptidomimetic PR inhibitors. The high potency of tipranavir against even multidrug-resistant HIV PR arises due to a different kind of hydrogen bond patterns that it forms with the residues from the flap region of the PR (Wang et al., 2011). Overall, stronger binders such as tipranavir and darunavir are more difficult to be discouraged by mutations.

The resistance landscapes for the nucleoside RT inhibitors were, in general, similar (Fig. 2B). In contrast, whereas certain zones on landscapes for nucleoside RT inhibitors are red, the same zones for non-nucleoside RT inhibitors appear in blue. Therefore, non-nucleoside RT inhibitors (e.g. efavirenz, nevirapine) can be effective for HIV variants with mutant RT sequences residing in the bottom right corner of the resistance landscape (Fig. 2C). Thus, a simultaneous analysis of maps for nucleoside and non-nucleoside inhibitors could provide an insight to optimal drug combinations for ART regimens.

IN inhibitors are usually employed in combination with PR and RT inhibitors in treatment-experienced individuals. The IN landscapes for the inhibitors considered in this work (raltegravir and elvitegravir) are very similar (Fig. 2D). Indeed, raltegravir and elvitegravir possess comparable antiviral activity profiles and the presence of the cross-resistance between these drugs is common (Shimura and Kodama, 2009).

3.2 Comparison to other machine-learning methods

Various state-of-the-art ML methods were compared with GTM in terms of prediction accuracies. The GTM-based models allowed achieving high prediction performance in cross-validation (see Supplementary Fig. S6 and Table S2) comparable to the one of other ML methods for most of the drugs. In more detail, in case of multitask GTM, the balanced accuracy values ranged from 0.80–0.94 for PR inhibitors (the highest average prediction accuracy among all proteins) to 0.71–0.75 for non-nucleoside RT inhibitors (the lowest average prediction accuracy among all proteins). It should be noted that the GTM was run in a multitask mode (the same model applies to all inhibitors of a particular viral protein), whereas other approaches were used in a single-task mode, i.e. models were trained for each drug separately. The advantage of the multitask GTM lies in a versatility of the created maps. The GTM model (map) for a given protein is trained to be able to predict simultaneously the resistance against several inhibitors. Therefore, related resistance landscapes built on this map can be used for comparative analysis of resistance profiles of protein mutants to different drugs. Thus, the GTM implicitly enables an intuitive exploration of the resistance profiles and corresponding complex mutation patterns via resistance and mutation landscapes.

3.3 HIV mutation patterns

The maps can be used for the exploration of HIV mutation patterns and their influence on drug resistance. Since the presence of certain types of mutations defines the resistance profile of the HIV variants, the exploration of the relationship between resistance landscapes and the distribution of sequences with specific mutation patterns on the maps (mutation landscapes) can be used to associate mutation patterns with resistance (Figs 2 and 3). In this context, the maps can be used either to visualize the distribution of sequences containing the specific mutation pattern (mutation landscape) and compare it with the resistance landscapes or select a specific part of the map and determine which mutations lead to the resistance profile indicated on the map.

Mutation landscapes for HIV PR (A), IN (B), RT (C) built for ≥5000 sequences from the frame set. Red zones are predominantly occupied by the sequences with specified mutations, while the blue zones mostly contain sequences without specified mutation pattern. All colours in between correspond to mixed zones containing both sequences with and without specified mutation pattern in different proportions
Fig. 3.

Mutation landscapes for HIV PR (A), IN (B), RT (C) built for ≥5000 sequences from the frame set. Red zones are predominantly occupied by the sequences with specified mutations, while the blue zones mostly contain sequences without specified mutation pattern. All colours in between correspond to mixed zones containing both sequences with and without specified mutation pattern in different proportions

At first, the mutation patterns leading to drug resistance for HIV PR, IN and RT found in the literature (Ceccherini-Silberstein et al., 2007; Rhee et al., 2003) were analysed (Fig. 2). For example, the key resistance-associated mutation occurring in PR–V32I was investigated (Fig. 3). The HIV isolates with this mutation demonstrate high resistance against all the drugs, including highly potent darunavir and tipranavir (Fig. 2 and Supplementary. S7), which can be seen while comparing resistance and mutational landscapes. This is a case of the pan-resistance towards PR inhibitors. All of the PR variants residing in the top right corner of the resistance landscapes are highly mutated. The majority of sequences that contain the V32I mutation also contain other mutations inducing resistance to PR inhibitors. It was experimentally proven that the V32I influences the ability of other mutations to induce resistance against PR inhibitors (Aoki et al., 2018). This effect of the V32I mutation on drug resistance is indeed reflected on the resistance landscapes for all PR inhibitors by the presence of the upper right densely populated red cluster consisting of highly mutated PR sequences containing the V32I mutation among others (Fig. 2A and Supplementary Fig. S7). Similarly to the analysis of mutations in PR, both well-known (e.g. G140S, Q148H) and emerging resistance-inducing mutation patterns (L228H) in IN and RT, respectively, can also be analysed (Fig. 3). For instance, a comparison of the mutation landscapes with the resistance ones (Figs 2and 3, Supplementary Fig. S7) shows that these mutation patterns are abundant among mutant proteins conferring resistance.

To identify the residue positions, which are ‘specific’ to the sequences from a particular node or a group of nodes on the map, the SDPred (Kalinina et al., 2009) algorithm was applied (see Section 2). This algorithm, originally developed for the prediction of residues responsible for a particular functional specificity of a protein, was repurposed for analysis of predominant mutations that distinguish groups of sequences. Given the sequences from the selected zone of the GTM, the algorithm was used to predict the set of amino acid alignment positions that are different between the sequences residing in this zone and all other sequences. For example, the sequences located in a zone populated by HIV variants resistant to nelfinavir and susceptible to darunavir were compared to all other sequences (Fig. 4). Then, the corresponding specificity determining mutations were identified (Supplementary Table S3). Two mutations with the highest relative frequency of occurrence in the selected zone were established: D30N and N88D (Fig. 4A). They are thus the major mutations that differentiate the sequences from this particular zone and all other sequences residing in the remaining regions of the map. These mutations are indeed associated with strong resistance to nelfinavir according to the literature data (Rhee et al., 2003). Namely, the sequences with the D30N mutation pattern confer strong resistance to nelfinavir, whereas being susceptible to darunavir (Fig. 4A). This mutation commonly occurs in combination with N88D inducing cross-resistance to atazanavir and saquinavir. Two other mutations occurring together with N88D and D30N are L10F and L33F. These mutations are associated with reduced susceptibility to both first-generation (nelfinavir, fosamprenavir, indinavir) and second-generation (lopinavir, darunavir) drugs (Tremblay, 2008). When the mutations D30N, N88D, L10F, L33F occur in a sequence simultaneously (Fig. 4B), the resistance profile shifts towards high resistance region for indinavir, lopinavir, saquinavir, fosamprenavir and atazanavir in comparison to sequences with only D30N, N88D mutation pattern present. Hence, the presence of these mutations is important for resistance development. This example illustrates the applicability of the maps for in-depth analysis of mutation pattern-resistance relationships.

Scheme of the comparative analysis of the resistance and mutational landscapes for HIV PR. (A) The exploration of mutations (D30N, N88D) inducing the resistance to nelfinavir and susceptibility to darunavir from the selected zone of the map (shown as a black rectangle). (B) The exploration of additional mutations (L10F, L33F) inducing the resistance to fosamprenavir for the aforementioned selected zone. The coloured stacks of letters highlighted in yellow in the sequence logo at the bottom of the figure represent the amino acids in the specified position in a group of aligned sequences from the selected zone on the map. The coloured letters within the selected stacks represent the resistance determining mutations. The size of the letter reflects the frequency of occurrence of such amino acid in a defined position. The Z-scores (calculated with the SDPred) reflecting which positions within the sequences from the specified zone of the map contain the amino acid mutations that are more prevalent in sequences from this zone as compared to all others are shown above the corresponding letters
Fig. 4.

Scheme of the comparative analysis of the resistance and mutational landscapes for HIV PR. (A) The exploration of mutations (D30N, N88D) inducing the resistance to nelfinavir and susceptibility to darunavir from the selected zone of the map (shown as a black rectangle). (B) The exploration of additional mutations (L10F, L33F) inducing the resistance to fosamprenavir for the aforementioned selected zone. The coloured stacks of letters highlighted in yellow in the sequence logo at the bottom of the figure represent the amino acids in the specified position in a group of aligned sequences from the selected zone on the map. The coloured letters within the selected stacks represent the resistance determining mutations. The size of the letter reflects the frequency of occurrence of such amino acid in a defined position. The Z-scores (calculated with the SDPred) reflecting which positions within the sequences from the specified zone of the map contain the amino acid mutations that are more prevalent in sequences from this zone as compared to all others are shown above the corresponding letters

4 Conclusion

To sum up, in this work, a novel cartography-based methodology for protein sequence space exploration and phenotype profiling was suggested. The methodology is based on the nonlinear dimensionality reduction method GTM, which allows one to transform complex sequence data and associated phenotypic properties into interpretable two-dimensional maps. Predictive performance of GTM-based models was comparable or slightly lower than that of popular ML algorithms. This observation corresponds to earlier reported benchmarking studies of various biological activities of chemical compounds (Gaspar et al., 2015; Kireeva et al., 2012). Additionally, GTM highlights complex structure–property relationships, e.g. mutation patterns leading to drug resistance via the resistance and mutation landscapes. The usage of mutation pattern landscapes allows in-depth analysis of complex mutation patterns and leverages the intuitive understanding of their influence on resistance. Hence, this work introduces the first-in-class tool for HIV sequence space exploration and drug resistance analysis combining high predictive accuracy inherent to ML algorithms and interpretability specific to rule-based methods. The introduced methodology is universal and can be applied to other QGPR modelling tasks. Moreover, the GTM can be especially effective for the analysis of big data, since the initial GTM building only requires a limited number of representative samples. Then, the landscapes (QGPR models) can be rapidly built for any amount of available phenotypic data. Since the genomic and proteomic data are rapidly accumulating, the application of the GTM-based methodology for sequences space analysis has the potential to find a wide application as an accurate and interpretable ML framework for personalized medicine.

Data availability statement

Datasets containing pre-processed sequences of IN, PR, and RT and drug resistance data as well as the ISIDASeq script for k-mers counting are publicly available in the ISIDASeq repository at https://github.com/karinapikalyova/ISIDASeq, and can be accessed via DOI: https://doi.org/10.5281/zenodo.6091711

Funding

O.T. and V.P. thank the Russian Science Foundation [Grant 19-75-10097] for the support.

Conflict of Interest: none declared.

References

Ananworanich
J.
(
2015
)
What will it take to cure HIV?
 
Top. Antiviral Med
.,
23
,
80
84
.

Aoki
M.
 et al. (
2018
)
Mechanism of darunavir (DRV)’s high genetic barrier to HIV-1 resistance: a key V32I substitution in protease rarely occurs, but once it occurs, it predisposes HIV-1 to develop DRV resistance
.
MBio
,
9
,
e02425-17
. [10.1128/mBio.02425-17]

Beerenwinkel
N.
 et al. (
2002
)
Diversity and complexity of HIV-1 drug resistance: a bioinformatics approach to predicting phenotype from genotype
.
Proc. Natl. Acad. Sci. USA
,
99
,
8271
8276
.

Bishop
C.M.
 et al. (
1998
)
GTM: the generative topographic mapping
.
Neural Comput
.,
10
,
215
234
.

Boser
B.E.
 et al. (
1992
)
A training algorithm for optimal margin classifiers
. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92. Association for Computing Machinery, New York, NY, USA, pp.
144
152
.

Breiman
L.
(
2001
)
Random forests
.
Mach. Learn
.,
45
,
5
32
.

Ceccherini-Silberstein
F.
 et al. (
2007
)
Characterization and structural analysis of novel mutations in human immunodeficiency virus type 1 reverse transcriptase involved in the regulation of resistance to nonnucleoside inhibitors
.
J. Virol
.,
81
,
11507
11519
.

Drǎghici
S.
,
Potter
R.B.
(
2003
)
Predicting HIV drug resistance with neural networks
.
Bioinformatics
,
19
,
98
107
.

Friedman
J.H.
(
2001
)
Greedy function approximation: a gradient boosting machine
.
Ann. Stat
.,
29
,
1189
1232
.

Gaspar
H.A.
 et al. (
2015
)
Stargate GTM: bridging descriptor and activity spaces
.
J. Chem. Inf. Model
.,
55
,
2403
2410
.

Gaspar
H.A.
 et al. (
2016
)
Generative topographic mapping approach to chemical space analysis
.
ACS Symp. Ser
.,
1222
,
211
241
.

Günthard
H.F.
 et al. (
2018
)
Human immunodeficiency virus drug resistance: 2018 recommendations of the International Antiviral Society-USA Panel and 8 International Antiviral Society-USA
.
HIV Drug Resist. Recommend. CID
,
2019
,
177
.

Horvath
D.
 et al. (
2014
)
An evolutionary optimizer of libsvm models
.
Challenges
,
5
,
450
472
.

Hotelling
H.
(
1933
)
Analysis of a complex of statistical variables into principal components
.
J. Educ. Psychol
.,
24
,
417
441
.

Iyidogan
P.
,
Anderson
K.S.
(
2014
)
Current perspectives on HIV-1 antiretroviral drug resistance
.
Viruses
,
6
,
4095
4139
.

Kalinina
O.V.
 et al. (
2009
)
Combining specificity determining and conserved residues improves functional site prediction
.
BMC Bioinform
.,
10
,
174
.

Khalid
Z.
,
Sezerman
O.U.
(
2018
)
Prediction of HIV drug resistance by combining sequence and structural properties
.
IEEE/ACM Trans. Comput. Biol. Bioinform
.,
15
,
966
973
.

Kireeva
N.
 et al. (
2012
)
Generative topographic mapping (GTM): universal tool for data visualization, structure-activity modeling and dataset comparison
.
Mol. Inform
.,
31
,
301
312
.

Kohonen
T.
(
1982
)
Self-organized formation of topologically correct feature maps
.
Biol. Cybern
.,
43
,
59
69
.

Lin
A.
 et al. (
2019
)
Multi-task generative topographic mapping in virtual screening
.
J. Comput. Aided Mol. Des
.,
33
,
331
343
.

Lin
A.
 et al. (
2020
)
Parallel generative topographic mapping: an efficient approach for big data handling
.
Mol. Inf
.,
39
,
2000009
.

Masso
M.
,
Vaisman
I.I.
(
2013
)
Sequence and structure based models of HIV-1 protease and reverse transcriptase drug resistance
.
BMC Genomics
,
14
,
S3
. [10.1186/1471-2164-14-S4-S3]

Pasomsub
E.
 et al. (
2010
)
The application of artificial neural networks for phenotypic drug resistance prediction: evaluation and comparison with other interpretation systems
.
Jpn. J. Infect. Dis
.,
63
,
87
94
.

Pavlakis
G.N.
,
Felber
B.K.
(
2018
)
A new step towards an HIV/AIDS vaccine
.
Lancet
,
392
,
192
194
.

Pedregosa
F.
 et al. (
2011
)
scikit-learn: machine learning in Python
.
J. Mach. Learn. Res
.,
12
,
2825
2830
.

Ramon
E.
 et al. (
2019
) HIV Drug Resistance Prediction with Categorical Kernel Functions.
BMC Bioinformatics
20, 410 (2019), Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11466, pp. 233–244. [10.1007/978-3-030-17935-9_22]

Rhee
S.-Y.
 et al. (
2003
)
Human immunodeficiency virus reverse transcriptase and protease sequence database
.
Nucleic Acids Res
.,
31
,
298
303
.

Rhee
S.-Y.
 et al. (
2006
)
Genotypic predictors of human immunodeficiency virus type 1 drug resistance statistics medical sciences
.
Proc. Natl. Acad. Sci. USA
,
103
,
17355
17360
. www.pnas.orgcgidoi10.1073pnas.0607274103

Sheik Amamuddy
O.
 et al. (
2017
)
Improving fold resistance prediction of HIV-1 against protease and reverse transcriptase inhibitors using artificial neural networks
.
BMC Bioinform
.,
18
,
369
. [10.1186/s12859-017-1782-x]

Shen
C.
 et al. (
2016
)
Automated prediction of HIV drug resistance from genotype data
.
BMC Bioinform
.,
17
,
278
. [10.1186/s12859-016-1114-6]

Shimura
K.
,
Kodama
E.N.
(
2009
)
Elvitegravir: a new HIV integrase inhibitor
.
Antiviral Chem. Chemother
.,
20
,
79
85
.

Singh
Y.
(
2017
)
Machine learning to improve the effectiveness of ANRS in predicting HIV drug resistance
.
Healthcare Inf. Res
.,
23
,
271
276
.

Steiner
M.C.
 et al. (
2020
)
Drug resistance prediction using deep learning techniques on HIV-1 sequence data
.
Viruses
,
12
,
560
.

Tarasova
O.
 et al. (
2017
)
PASS-based approach to predict HIV-1 reverse transcriptase resistance
.
J. Bioinf. Comput. Biol
.,
15
,
1650040
.

Tarasova
O.
 et al. (
2018
)
A computational approach for the prediction of HIV resistance based on amino acid and nucleotide descriptors
.
Molecules
,
23
,
2751
.

Tareen
A.
,
Kinney
J.B.
(
2020
)

Logomaker: beautiful sequence logos in Python. Bioinformatics, 36, 2272–2274.

Tremblay
C.L.
(
2008
)
Combating HIV resistance—focus on darunavir
.
Ther. Clin. Risk Manage
.,
4
,
759
766
.

Vercauteren
J.
,
Vandamme
A.M.
(
2006
)
Algorithms for the interpretation of HIV-1 genotypic drug resistance information
.
Antiviral Res
.,
71
,
335
342
.

Wang
D.
,
Larder
B.
(
2003
)
Enhanced prediction of lopinavir resistance from genotype by use of artificial neural networks
.
J. Infect. Dis
.,
188
,
653
660
.

Wang
Y.
 et al. (
2011
)
The higher barrier of darunavir and tipranavir resistance for HIV-1 protease
.
Biochem. Biophys. Res. Commun
.,
412
,
737
742
.

WHO. (

2018
)
Interim Guidelines Suppl to the 2016 Consolidated Guidelines
.
WHO
.

Yu
X.
 et al. (
2014
)
Prediction of HIV drug resistance from genotype with encoded three-dimensional protein structure
.
BMC Genomics
,
15
,
S1
. [10.1186/1471-2164-15-S5-S1]

Zamani
M.
,
Kremer
S.C.
(
2011
)
Amino acid encoding schemes for machine learning methods
. In: 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW 2011), IEEE, Atlanta, GA, USA, pp.
327
333
.

Zheng
L.
 et al. (
2019
)
RAACBook: a web server of reduced amino acid alphabet for sequence-dependent inference by using Chou’s five-step rule
.
Databases
,
2019
,
baz131
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Jonathan Wren
Jonathan Wren
Associate Editor
Search for other works by this author on:

Supplementary data