Sequence Multilabel classification for exploiting cross-resistance information in HIV-1 drug resistance prediction

Motivation: Antiretroviral treatment regimens can sufficiently sup-press viral replication in human immunodeficiency virus (HIV)-infected patients and prevent the progression of the disease. However, one of the factors contributing to the progression of the disease despite on-going antiretroviral treatment is the emergence of drug resistance. The high mutation rate of HIV can lead to a fast adaptation of the virus under drug pressure, thus to failure of antiretroviral treatment due to the evolution of drug-resistant variants. Moreover, cross-resistance phenomena have been frequently found in HIV-1, leading to resistance not only against a drug from the current treatment, but also to other not yet applied drugs. Automatic classification and prediction of drug resistance is increasingly important in HIV research as well as in clinical settings, and to this end, machine learning techniques have been widely applied. Nevertheless, cross-resistance information was not taken explicitly into account, yet. Results: In our study, we demonstrated the use of cross-resistance information to predict drug resistance in HIV-1. We tested a set of more than 600 reverse transcriptase sequences and corresponding resistance information for six nucleoside analogues. Based on multilabel classification models and cross-resistance information, we were able to significantly improve overall prediction accuracy for all drugs, compared with single binary classifiers without any additional information. Moreover, we identified drug-specific patterns within the reverse transcriptase sequences that can be used to determine an optimal order of the classifiers within the classifier chains. These patterns are in good agreement with known resistance mutations and support the use of


INTRODUCTION
The human immunodeficiency virus (HIV) is one of the major human diseases leading to about 2 million deaths yearly. Although antiretroviral treatment is working well in principle, the high mutation rate of HIV frequently leads to a fast adaptation of the virus and thus to the development of drug-resistant viral strains. Tripathi et al. (2012) performed stochastic simulations of the within-host evolution of HIV-1. By estimating the structure of the HIV-1 quasispecies, they were able to calculate an error threshold of HIV-1. They discovered that HIV-1 has a mutation rate that is close to error catastrophe and that the error threshold depends heavily on the recombination rate of HIV-1 (Tripathi et al., 2012). Pennings (2012) analyzed the evolution of resistance in HIV-1 due to standing genetic variation. She ascertained that, depending on the treatment, probabilities of evolution of drug resistance due to standing genetic variation vary between 0 and 39%.
The most important parameters for the evolution of drug resistance are the effective population size of the virus before treatment and the fitness of the resistant virus during treatment. Evolution of drug resistance finally leads to a failure of antiretroviral treatment and thus to the progression of disease. Experimental testing of viral resistance in patients has been widely used in research as well as in clinical settings to gain information about the way drug resistance evolves. In contrast to these experimental phenotypic assays, computational approaches offer the possibility to predict drug resistance in HIV-1 in an easy and fast way based on short sequence information of the viral genotype, e.g. the sequence of the viral reverse transcriptase (RT). Such computational models for predicting drug resistance in HIV-1 have been developed and widely applied. These computational models are mainly based on statistical or machine learning methods that try to find a mapping from the sequence information to a 'resistance factor'. Usually, the IC 50 ratios [The concentration of a specific drug inhibiting 50% of viral replication compared with cell culture experiments without the drug is defined as IC 50 (50% inhibitor concentration).] are used in these models to define resistance. In general, drug resistance means reduced inhibition of viral replication by antiretroviral drugs, resulting in increased IC 50 values. The IC 50 values of the drug resistant isolates and HIV wild type are used to calculate resistance factors IC 50 ðdrug concentration for resistant strainÞ IC 50 ðdrug concentration for wild typeÞ , as a standardized measure of HIV drug resistance. The data are separated into the classes 'susceptible' and 'resistant' based on a drug-specific cutoff of the resistance factor. A computational model is then trained based on these data pairs (sequence and corresponding class), which can then be used to predict the resistance factor or the resistance class for new unseen sequences. For instance, Rhee et al. (2006) used five different statistical and *To whom correspondence should be addressed. machine learning methods (decision trees, artificial neural networks, support-vector machines, least-squares regression and least angle regression) to predict drug resistance in HIV-1 for 16 drugs, including protease-and RT inhibitors. Kierczak et al. (2009) developed a rough set-based model considering physico-chemical changes of mutated sequences compared with the wild-type strain to predict RT inhibitor resistance in HIV-1. Moreover, the same group developed the first systems biology approaches to HIV-1 drug resistance, showing networks of interacting positions (Kierczak et al., 2010). One important finding, however, has not been exploited in these models so far, namely the occurrence of cross-resistance. In the context of HIV-1, cross-resistance means that mutations leading to a resistance against a specific drug, which is currently part of the antiretroviral treatment of a specific patient, also leads to resistance (In some cases, albeit less frequently, it could also lead to a re-sensitization for other drugs.) against other drugs (that may or may not be part of the same treatment). In the current study, we analyzed cross-resistance in HIV-1 for a dataset of more than 600 RT sequences and six nucleoside analogues (NAs), namely Lamivudine (3TC), Abacavir (ABC), Zidovudine (AZT), Stavudine (d4T), Didanosine (ddI) and Tenofovir (TDF). Moreover, we developed a model that exploits knowledge about cross-resistance to improve the overall prediction accuracy for the whole repertoire of drugs used in this study. To this end, we made use of novel methods for so-called multilabel classification (MLC), a generalization of conventional (polychotomous) classification that has recently gained increasing attention in machine learning (Tsoumakas and Katakis, 2007).
To the best of our knowledge, this is the first time information about RT inhibitors cross-resistance has been explicitly integrated in HIV-1 drug resistance prediction models. In contrast to protease inhibitors (PIs) as well as non-nucleoside reverse transcriptase inhibitors (NNRTIs), NAs have less side effects (Stu¨rmer et al., 2007). Moreover, the combination of different drug classes during therapy may lead to unpredictable interactions of PIs and NNRTIs with the cytochrome P450 system and thus may complicate the therapy. Therefore, one option might be the use of quadruple nucleoside therapy (Stu¨rmer et al., 2007).

Data
For our analyses, we used a publicly available dataset consisting of RT sequences of HIV-1 with corresponding resistance factors (IC 50 ratios) for six NAs, namely 3TC, ABC, AZT, d4T, ddI and TDF (Rhee et al., 2006). A summary of the dataset is shown in Table 1. For our method, we needed RT sequences for which IC 50 ratios for all mentioned drugs are available, so we discarded the information about TDF (owing to the low number of sequences) and merged the other sequences and IC 50 ratios of the remaining drugs. Strains lacking phenotypic results for any drug analyzed in the current study were discarded before analysis. Eventually, this yields 614 RT sequences with IC 50 ratios for 3TC, ABC, AZT, d4T and ddI. The class ratio (positive samples / negative samples) for all drugs ranges between 0.83 and 2.39. All sequences originated from subtype B strains.

Predictive modeling
The goal of our study was to build models that can be used to predict whether there are resistance mutations within an RT sequence of a specific virus for different NAs. Thus, we used drug-specific cutoffs for the IC 50 ratios to separate the sequences into the classes 'resistant' and 'susceptible' for the different drugs. For 3TC and AZT, the cutoff was set to 3, for d4T and ddI it was set to 1.5 and for ABC it was set to 2. This means that sequences having a corresponding IC 50 ratio above or equal to the cutoff are defined as 'resistant' (Table 1), whereas sequences having an IC 50 ratio below the cutoff are defined as 'susceptible'. For instance, an RT sequence with a ratio of 10.3 for 3TC is defined as 'resistant', whereas another RT sequence with a ratio of 2.5 is defined as 'susceptible'.
In some studies, e.g. Rhee et al. (2006) As already demonstrated in several protein classification studies, e.g. (Chowriappa et al., 2008;Dybowski et al., 2010;Heider et al., 2009Heider et al., , 2010, the use of physico-chemical properties and especially the use of hydrophobicity characteristics (Kyte and Doolittle, 1982), lead to good prediction results. Thus, we encoded the RT protein sequences into hydrophobicity vectors and normalized them to length 240, which represents the average sequence length in the dataset, using Interpol (Heider and Hoffmann, 2011). Thus, we eventually end up with a dataset consisting of 614 instances (RT sequences), each of which is characterized in terms of a normalized hydrophobicity vector of length 240. Moreover, each instance is associated with five binary class labels, namely resistance for ddI, AZT, d4T, 3TC and ABC. Our goal, now, is to train a model that generalizes beyond these examples, i.e., that allows for accurately predicting each of the five outputs on the basis of any normalized hydrophobicity vector given as input information.

Multilabel classification
The above problem obviously falls in the realm of (binary) classification, a well-established and thoroughly explored subfield of statistics and machine learning. In fact, the arguably most simple way to solve it is to train one binary classifier for each of the five outputs, thereby splitting the original multioutput problem into five single-output problems. Each of these problems can then be solved individually, using the large repertoire of existing methods for binary classification.
This approach has an important disadvantage, however. It cannot take any advantage of possible dependencies between the different outputs.
Modeling and exploiting such dependencies to improve prediction accuracy is one of the key goals of MLC. Intuitively, if the value of one output may (statistically) depend on the value of others, then predicting all outputs simultaneously should indeed be better than predicting them separately. This is the main argument against simple decomposition techniques like the one proposed above, called binary relevance (BR) learning in the context of MLC. More formally, let L ¼ f 1 , . . . , m g be a finite set of class labels (in our case the resistance for the five drugs), and let X be an instance space (in our case the 240-dimensional hydrophobicity vectors). An MLC task assumes a training set S ¼ fðx 1 , y 1 Þ, . . . , ðx n , y n Þg, generated independently and identically according to a probability distribution PðX, YÞ on X Â Y. Here, Y is the set of possible label combinations, i.e., the power set of L. To ease notation, we define a label combination y as a binary vector y ¼ ðy 1 , y 2 , . . . , y m Þ, in which y j ¼ 1 indicates the presence (relevance) and y j ¼ 0 the absence (irrelevance) of j . Under this convention, the output space is given by Y ¼ f0, 1g m . The goal in MLC is to induce from S a model h : X ! Y that correctly predicts the subset of relevant labels for unlabeled query instances x 2 X. Recall that, in our context, 'relevance of a label' stands for 'resistance against a drug'; thus, a prediction y ¼ ð1, 0, 1, 0, 0Þ would suggest that the instance (RT sequence) at hand is resistant against the first and the third drug while being susceptible to the others.

Performance metrics The prediction of label subsets (vectors)
instead of single labels suggests different types of performance metrics for MLC. Commonly used examples of such metrics, which also seem to be meaningful in the context of our application, include the Hamming loss, the subset 0/1 loss and the F-measure. Let Þ the corresponding prediction produced by the classifier. The Hamming loss of the prediction b y i is then defined as the fraction of labels whose relevance is incorrectly predicted (For a predicate P, the expression P ½ ½ evaluates to 1 if P is true and to 0 if P is false.) The subset 0/1 loss simply checks whether the complete label subset is predicted correctly or not: The F-measure essentially corresponds to the harmonic mean of the precision and recall of the prediction; it is defined as follows: where 0=0 ¼ 1 by definition. The above metrics are used to evaluate the prediction b y i for an individual instance x i , i.e., they are computed instance-wise. Correspondingly, in an experimental study, the average accuracy would be reported as the average over all instances x i in the test data X test . Apart from that, another option is of course to report accuracy in a label-wise manner, namely to compute standard performance metrics such as classification rate (percentage of correct predictions) or AUC (area under the receiver-operating characteristic curve) separately for each label i 2 L. Note that some metrics can be computed instancewise as well as label-wise. For example, the label-wise version of the F-measure is given by where y j ¼ ðy 1, j , . . . , y N, j Þ is the vector of values for the label j and b y j the corresponding vector of predictions.

Classifier chains
Until now, several methods for MLC have been proposed in the literature. Here, we shall focus on a method called classifier chains (CCs; Read et al., 2011), which, despite having been introduced only lately, already enjoys great popularity. This is arguably due to the fact that it is based on a simple and elegant yet effective idea for capturing label dependencies. The CCs method learns m binary classifiers (each one dealing with the BR problem associated with one label) linked along a chain, each time extending the feature space by all previous labels in the chain. For instance, if the chain follows the order 1 ! 2 ! . . . ! m , then the classifier h j responsible for predicting the relevance of j is of the form The training data for this classifier consists of (expanded) instances ðx i , y i, 1 , . . . , y i, jÀ1 Þ labeled with y i, j , that is, original training instances x i supplemented by the relevance of the labels 1 , . . . , jÀ1 preceding j in the chain. Thus, the classifier h j supposed to predict the label of class j makes use of the preceding labels as additional input information, thereby capturing possible dependencies between the labels. Theoretically, the CC approach can be motivated by the product rule of probability (Dembczyn´ski et al., 2010): Pðy k j x, y 1 , . . . , y kÀ1 Þ ð 6Þ Note that, for training the classifier (5), any standard method for binary classification can be used (logistic regression, decision trees, support vector machines, etc.).
At prediction time, when a new instance x needs to be labeled, a label vector b y ¼ ðŷ 1 , . . . ,ŷ m Þ is produced by successively querying each classifier h j . Note, however, that the inputs of these classifiers are not welldefined, because the supplementary attributes y i, 1 , . . . , y i, jÀ1 are not available. These missing values are therefore replaced by their respective predictions: y 1 used by h 2 as an additional input is replaced byŷ 1 ¼ h 1 ðxÞ, y 2 used by h 3 as an additional input is replaced byŷ 2 ¼ h 2 ðx,ŷ 1 Þ and so forth. Thus, the prediction y is of the form The process of training a CC and using it for prediction is illustrated in Figure 1.

Ensembles of CCs
Realizing that the order of labels in the chain may influence the performance of the classifier, and that an optimal order is hard to anticipate, Read et al. (2011) propose the use of an ensemble of CC classifiers. This approach combines the predictions of different random orders and, moreover, uses a different sample of the training data to train each member of the ensemble. Ensembles of classifier chains (ECC) have been shown to increase prediction performance over CC by effectively using a simple voting scheme to aggregate predicted relevance sets of the individual chains. For each label j , relevance is predicted by thresholding the proportionŵ j of classifiers predicting y j ¼ 1 at a level t, i.e., y j ¼ŵ j ! t Â Ã Â Ã .

RESULTS AND DISCUSSION
The major goal of our experimental study was to provide empirical evidence for the conjecture that capturing statistical dependencies between HIV-1 drugs is instrumental in learning classifiers for resistance prediction. Dependencies of that type are biologically plausible and suggested by the observation of cross-resistance; besides, they are also confirmed by our data: Table 2 shows the pair-wise associations between the binary class labels (drugs), expressed in the form of the phi coefficient (This coefficient is equal to the Pearson correlation for binary variables and is also closely connected to the 2 statistics.). As can be seen, the dependency between the resistance for different drugs is positive throughout and specifically high for the two pairs 3TC/ABC and AZT/d4T. This observation is in perfect agreement with attribute importance analyses on the basis of random forest classifiers that were trained for each class individually, which are in partial agreement with recent expertdefined resistance mutations (Johnson et al., 2011) and other computational approaches, e.g. Kierczak et al. (2010). As can be seen in Figure 2, for 3TC as well as ABC, the most important sequence positions (often selected as discriminative attributes by the classifiers) are found in the C-terminal part with the highest peak at position 184. For the other drugs, namely AZT, d4T and ddI, the highest peaks are spread at the C-terminal as well as at the N-terminal part, e.g. 41, 70, 210 and 215. Some NAs resistance patterns are well known (Stu¨rmer et al., 2007), e.g., the so-called thymidine analogue mutations at position 41, 65, 67, 70, 210, 215 and 219, leading to varying levels of AZT and d4T resistance (Antinori et al., 2006;Garcia-Lerma et al., 2003;Lafeuillade and Tardy, 2003). Another important mutation at position 184 is also reflected in the importance analyses. The mutation M184V is associated with high-level 3TC resistance as well as with ABC resistance. For ABC resistance also mutations at positions 65, 74 and 115 could be found during ABC therapy. Moreover, mutation patterns at position 151 in combination with mutations at position 62, 69, 75, 77 and 116 are also associated with high-level resistance against AZT, 3TC and ABC (Sirivichayakul et al., 2003). Interestingly, position 65, which is associated with a broad cross-resistance in almost all NAs except for AZT, has also a high importance for the AZT classification. Nevertheless, random forest importance analyses have some limitations, as they only estimate the importance of a sequence position for the classification, but do not provide information whether a specific sequence position is positively or negatively associated with resistance; moreover, they do not provide information about interacting sequence positions that contribute to resistance. For a comprehensive structural analysis and interpretation of resistance mutations in RT see (Kierczak et al., 2010). Interestingly, phylogenetic analyses of the sequences using a neighbor-joining approach (Gouy et al., 2010) as well as principal component analysis on the distance matrix showed that the sequences cannot be easily separated into the different resistance classes based only on the sequence information (Supplementary Figures S1 and S2). It is important to note, however, that a positive correlation between labels does not necessarily imply a benefit for prediction. In particular, while the above correlation is an unconditional measure of dependence between class labels, a multilabel classifier such as CC seeks to capture conditional dependencies, namely the dependence between class labels given the input information x. Table 3 shows the average misclassification rates of classifiers (logistic regression) that have been trained for the individual class labels i on different input information, namely (i) the original predictor variables x and (ii) this feature vector supplemented by the resistance information of one of the other drugs j ; thus, we simply assumed that j was already known when i needs to be predicted. As can be seen, 3TC benefits more from knowing ddI than from knowing ABC, and d4T benefits more from ABC than from AZT. Another possible effect that cannot be excluded and could have an influence on our findings is treatment history. Unfortunately, the treatment histories in the dataset are highly diverse. However, most of the patients have an unknown treatment history or have not been treated yet (see http://hivdb.stanford.edu). Thus we assume that treatment history might play only a minor role in our model.
The current study was related to the idea of CCs in so far as class labels j are used as additional predictor variables for other labels i . Here, however, we assumed the true values y j of the additional predictor to be known, not only for training but also for prediction. In chaining, on the other hand, the true values y i are only known in the training step, whereas for prediction, they have to be replaced by their estimatesŷ i . Thus, although the past study confirms the potential benefit of label information for prediction purposes, it is not clear that label dependencies can indeed be exploited in a practically realistic setting where other labels are not known at prediction time.   On the x-axis the sequence positions are shown, whereas the y-axis represents the sum of all decreases in Gini impurity. Feature importance for five single random forests was assessed using the sum of all decreases in Gini impurity, which has been shown to be more robust compared with the mean decrease in accuracy (Calle and Urrea, 2010)

The effect of chaining
To analyze the practical usefulness of classifier chaining, we compared the prediction accuracy of the following methods: BR: A single binary classifier is trained independently for each of the five labels.
CCs: The five classifiers are trained according to the CC approach outlined in Section 2.4. The chain was constructed by sorting the labels in decreasing order according to their individual (BR) prediction accuracy (This is a commonly used rule of thumb, which is motivated by the observation that mistakes of a single classifier tend to be propagated along the rest of the chain (Senge et al., 2013); consequently, strong classifiers should be placed at the top and poor ones more toward the end of the chain.): ECCs: The ECC method described in Section 2.5 was implemented with 10 chains, each time choosing the order of labels at random. The threshold t was taken as 1/2.
All methods were implemented with standard logistic regression as a base learner. Prediction performance was measured in terms of the Hamming loss (i), the subset 0/1 loss (ii) and the F-measure (iii) as instance-wise metrics, and the classification rate, the AUC and the F-measure (iv) as label-wise metrics. Each of these metrics was estimated by means of a 10-fold cross-validation repeated five times, and results are reported in terms of the mean values and the standard deviations. Moreover, we also indicate the ranking of the three methods, with the best performing method on rank 1 and the worst performing method on rank 3.       The results for the instance-wise metrics are summarized in Table 4, those for the label-wise metrics in Table 5. Although the differences are not always statistically significant, as can be seen from the standard deviations, the overall picture is clear and obviously in favor of the chaining methods. In fact, chaining achieves systematic (albeit sometimes small) gains in comparison with standard BR learning. Among the two chaining methods, ECC performs even stronger than CC and typically yields the best results.
To make sure that the results are not too much influenced by the underlying base learner used by all methods, we repeated the same experiments with random forests (Breiman, 2001) instead of logistic regression. These two learners exhibit different properties. In particular, while logistic regression fits a linear decision boundary in the instance space, decision trees are much more flexible and able to model highly non-linear concepts; this flexibility is even increased by the aggregation of different trees in the random forest approach. Thus, it comes at no surprise that the performance of all methods is in general improved. Nevertheless, in terms of relative comparison, the picture is more or less identical to the first experiment with logistic regression. Both chaining methods improve on BR, with ECC being even better than CC (Tables 6 and 7).

CONCLUSION
We conclude with an affirmative answer to one of the main questions of our study, namely whether or not cross-resistance information can be used to improve overall accuracy in drug resistance prediction. By using MLC methods, a relatively recent development in machine learning, we were able to exploit cross-resistance information for RT inhibitors. More concretely, our results are based on a specific MLC method called CCs.
We consider these results as promising and, therefore, intend to further explore this direction in future work. On the methodological side, we would like to try alternative MLC methods, including the probabilistic variant of CCs proposed by Dembczyn´ski et al. (2010) but also approaches that are not based on the idea of chaining. As an interesting property of the former, let us mention that it does not only produce binary predictions, but proper probability estimates of single labels or label combinations. Predictions of that kind are interesting, not only for the minimization of various loss functions, but also for the purpose of representing uncertainty. Moreover, we want to include multiclass and regression models to be able to predict more classes, e.g. intermediate resistance, and even the resistance factors.
On the application side, our study has focused on NAs so far, although a typical clinical treatment includes drugs from several classes. It might of course be interesting to test our approach for other types of antiretroviral drugs, for example, non-nucleoside RT inhibitors, and for other target proteins, such as HIV-1 protease and corresponding PIs. By now, our approach is limited to specialized treatment cases and thus is currently not well applicable in clinical settings. However, in the future we plan to adapt our approach for NNRTIs as well as for PIs. Moreover, all sequences used in the current study originated from subtype B strains, thus the results of our model might be misleading if it is applied to other subtypes.
Funding: This work was partially supported by the Center for Synthetic Microbiology (SYNMIKRO), Marburg, Germany.
Conflict of Interest: none declared.