-
PDF
- Split View
-
Views
-
Cite
Cite
Loris Nanni, Sheryl Brahnam, Multi-label classifier based on histogram of gradients for predicting the anatomical therapeutic chemical class/classes of a given compound, Bioinformatics, Volume 33, Issue 18, September 2017, Pages 2837–2841, https://doi.org/10.1093/bioinformatics/btx278
Close -
Share
Abstract
Given an unknown compound, is it possible to predict its Anatomical Therapeutic Chemical class/classes? This is a challenging yet important problem since such a prediction could be used to deduce not only a compound’s possible active ingredients but also its therapeutic, pharmacological and chemical properties, thereby substantially expediting the pace of drug development. The problem is challenging because some drugs and compounds belong to two or more ATC classes, making machine learning extremely difficult.
In this article a multi-label classifier system is proposed that incorporates information about a compound’s chemical–chemical interaction and its structural and fingerprint similarities to other compounds belonging to the different ATC classes. The proposed system reshapes a 1D feature vector to obtain a 2D matrix representation of the compound. This matrix is then described by a histogram of gradients that is fed into a Multi-Label Learning with Label-Specific Features classifier. Rigorous cross-validations demonstrate the superior prediction quality of this method compared with other state-of-the-art approaches developed for this problem, a superiority that is reflected particularly in the absolute true rate, the most important and harshest metric for assessing multi-label systems.
The MATLAB code for replicating the experiments presented in this article is available at https://www.dropbox.com/s/7v1mey48tl9bfgz/ToolPaperATC.rar?dl=0.
Supplementary data are available at Bioinformatics online.
1 Introduction
Being able to classify an unknown compound into its ATC (Anatomical Therapeutic Chemical) system is a problem that has significance for both drug development and basic research. Developed by the World Health Organization (WHO), the ATC system is a hierarchical classification system that categorizes compounds into 14 main groups: (i) alimentary tract and metabolism; (ii) blood and blood forming organs; (iii) cardiovascular system; (iv) dermatologicals; (v) genitourinary system and sex hormones; (vi) systemic hormonal preparations, excluding sex hormones and insulins; (vii) anti-infectives for systemic use; (viii) antineoplastic and immunomodulating agents; (ix) musculoskeletal system; (x) nervous system; (xi) antiparasitic products, insecticides and repellents; (xii) respiratory system; (xiii) sensory organs; and (xiv) various. In the last decade, many systems and webservers for predicting a compound’s ATC classification have been developed (Chen, 2012; Cheng et al., 2016, 2017; Dunkel et al., 2008). Some of these systems, such as (Dunkel et al., 2008) and (Wu et al., 2013), can only map a compound to one ATC label, even though there is mounting evidence indicating that many compounds in system biology and medicine can belong to more than one category. Therefore, mapping a compound to its ATC system is a multi-label problem, and predictive systems, as pointed out by (Chen, 2012), should address it as such.
In this work, we propose an ensemble composed of Multi-Label Learning with Label-Specific Features (LIFT) (Zhang and Wu, 2015) classifiers that are fed with Histograms of Gradients (HoG) descriptors extracted from 2D representations of a compound. The original 1D feature vector used to describe the compound is randomly sorted 50 times, generating 50 different 2D representations that are then used to train 50 LIFTs that are combined by sum rule. Finally, this ensemble is combined with LIFT trained on the original 1D feature vectors. Rigorous comparisons with other state-of-the-art approaches demonstrate the superiority of the proposed system.
The main strength of this approach lies in the adoption of 2D representations of patterns. Direct manipulation of matrices offers may advantages, with perhaps the most important being the possibility of extracting powerful state-of-the-art texture descriptors (Nanni et al., 2012). In this work, HoG was chosen as the descriptor so that the correlation among sets of features within a given neighborhood of a 2D representation could be investigated; this is different from coupling feature selection and classification.
As demonstrated by a series of recent publications (see, e.g. Chen et al., 2016a,b,, 2017; Cheng et al., 2017; Jia et al., 2016a,b; Liu et al., 2017; Qiu et al., 2016) and in compliance with Chou’s five-step rule (Chou, 2011), to establish a really useful statistical predictor for a biological or biomedical system, we should follow the following five guidelines: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the statistical samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; and (v) establish a user-friendly web-server for the predictor that is accessible to the public. Below, we describe how to deal with these steps one-by-one.
2 Materials and methods
2.1 Benchmark dataset and original representation
To facilitate comparison, we use the benchmark dataset provided in (Chen, 2012), which contains a total of 3883 drugs that are divided nonexclusively into 14 ATC-classes, with 3295 belonging to only one class, 370 belonging to two classes, 110 belonging to three classes, 37 belonging to four classes, 27 belonging to five classes and 44 belonging to six classes. None of the drugs belong to seven or more classes. Because the ATC classification system is multiclass, for each class a 1 indicates that the drug belongs to the corresponding class and a 0 indicates that it does not.
The dataset can be formulated in set notation as the union of elements in each class: , and a sample D in can be represented starting with three mathematical expressions reflecting their intrinsic correlation with the target to be predicted. First, via a 14D vector that represents its maximum interaction scorewith the drugs in each of the 14 subsets (for details, see Kanehisa et al., 2004; Kotera et al., 2012). These scores can be downloaded in Supplementary Material S4 for (Chen, 2012). Second, via a 14D vector that represents the maximum structural similarity scorein the 14 subsets (for details, see Kotera et al., 2012). These scores can be downloaded in Supplementary Material S5 for (Chen, 2012). Third, via a 14D vector that represents its molecular fingerprint similarity score in the 14 subsets (for details, see Xiao et al., 2013). These scores can be downloaded in Supplementary Material S6 for (Chen, 2012).
2.2 Vector-to-matrix operation
Reshaping the original vector into a matrix to investigate the correlation among sets of features in a given neighborhood can be accomplished as follows: starting with the original input vector and letting be the output matrix, where (u = 42 here, see Section 2.1) and M be a random rearrangement of the original vector into a square matrix, then each entry of matrix M can be formulated as an element of (a), such that is a random permutation of [1…u].
A simple approach for improving performance is to perform the reshaping n times (n = 50 here) by randomly sorting. For each extracted descriptor, a different LIFT (see Section 2.4) is trained, with the results combined by sum rule.
2.2 Histograms of gradients
The matrix M is described by HoG (Dalal and Triggs, 2005), which is implemented by dividing a matrix into small spatial regions, or cells (5 × 6 here), where each cell accumulates a local 1D histogram of gradient directions over the values contained in each cell. Simple 1D [−1,0,1] masks with a smoothing scale of = 0 are used in this step, and each cell value calculates a weighted vote for an edge orientation based on the orientation of the gradient element that is centered on it; votes are accumulated into orientation bins (nine here) evenly spaced over 0°–180° over the cells. These values are then normalized by accumulating a measure of local histogram energy over larger spatial regions, or blocks. This measure in turn is used to normalize all the cells within the block. The concatenated histogram entries of these normalized descriptor blocks are the descriptors used to represent M. In this work, normalization was performed on the whole descriptor using the L2 norm.
2.3 Multi-label learning with label-specific features
The HoG descriptors are fed into a LIFT (Zhang and Wu, 2015) classifier, which is a two-step multi-label learning method that, in step 1, constructs features specific to each label using clustering analysis on both positive and negative instances, training and testing them by querying the clustering results; and that, in step two, induces a family of f classifiers, where each member is generated from label specific features that are different from the original ones.
3 Results and discussion
3.1 Set of five metrics for multi-label systems
3.2 Cross-validation
Many cross-validation methods are used for statistical prediction, the most common being (i) independent test set, (ii) subsampling (K-fold cross-validation) and (iii) the jackknife test, which is considered the least arbitrary method and one that yields a unique outcome for a benchmark dataset (Chou, 2011). Accordingly, the jackknife test is used in this study.
3.3 Performance of the proposed method and comparisons with the literature
Our ensemble and the LIFT classifier contains the parameter θ for assigning a pattern to a given class, and the predicted results obtained by the classifier will depend on this parameter's value. If the similarity of a given pattern of a given class is higher than θ, then the pattern is assigned to that class. In Figure 1 we report the absolute true performance indicator obtained by varying weight θ in the following:
LIFT trained using original 1D feature vector (dashed line): θ {0.2 0.35 0.50 0.65 0.80 0.95}.
Ensemble of the 50 LIFTs (combined by sum rule) trained using HOG (dotted line): θ {10 12 14 16 18 20}.
Fusion by sum rule between 1 above, weighted by 50, and 2 above (solid line): θ {20 24 28 32 36 40}. We call this approach EnsLIFT.
Clearly, EnsLIFT is the most stable and best performing method, its superiority indicated by considering other performance indicators (see Table 1).
Comparison with other state-of-the-art multi-label predictors
| Method . | Aiming . | Coverage . | Accuracy . | Absolute true . | Absolute False . |
|---|---|---|---|---|---|
| EnsLIFT (θ = 30) | 78.18% | 75.77% | 71.21% | 63.30% | 2.85% |
| LIFT | 83.93% | 68.18% | 67.78% | 61.11% | 3.13% |
| iATC-mISF | 67.83% | 67.10% | 66.41% | 60.98% | 5.85% |
| (Chen 2012) | 50.76% | 75.79% | 49.38% | 13.83% | 8.83% |
| ML-KNN | 79.96% | 56.70% | 59.10% | 54.16% | — |
| RankSVM | 60.26% | 52.89% | 45.49% | 35.77% | — |
| Method . | Aiming . | Coverage . | Accuracy . | Absolute true . | Absolute False . |
|---|---|---|---|---|---|
| EnsLIFT (θ = 30) | 78.18% | 75.77% | 71.21% | 63.30% | 2.85% |
| LIFT | 83.93% | 68.18% | 67.78% | 61.11% | 3.13% |
| iATC-mISF | 67.83% | 67.10% | 66.41% | 60.98% | 5.85% |
| (Chen 2012) | 50.76% | 75.79% | 49.38% | 13.83% | 8.83% |
| ML-KNN | 79.96% | 56.70% | 59.10% | 54.16% | — |
| RankSVM | 60.26% | 52.89% | 45.49% | 35.77% | — |
Comparison with other state-of-the-art multi-label predictors
| Method . | Aiming . | Coverage . | Accuracy . | Absolute true . | Absolute False . |
|---|---|---|---|---|---|
| EnsLIFT (θ = 30) | 78.18% | 75.77% | 71.21% | 63.30% | 2.85% |
| LIFT | 83.93% | 68.18% | 67.78% | 61.11% | 3.13% |
| iATC-mISF | 67.83% | 67.10% | 66.41% | 60.98% | 5.85% |
| (Chen 2012) | 50.76% | 75.79% | 49.38% | 13.83% | 8.83% |
| ML-KNN | 79.96% | 56.70% | 59.10% | 54.16% | — |
| RankSVM | 60.26% | 52.89% | 45.49% | 35.77% | — |
| Method . | Aiming . | Coverage . | Accuracy . | Absolute true . | Absolute False . |
|---|---|---|---|---|---|
| EnsLIFT (θ = 30) | 78.18% | 75.77% | 71.21% | 63.30% | 2.85% |
| LIFT | 83.93% | 68.18% | 67.78% | 61.11% | 3.13% |
| iATC-mISF | 67.83% | 67.10% | 66.41% | 60.98% | 5.85% |
| (Chen 2012) | 50.76% | 75.79% | 49.38% | 13.83% | 8.83% |
| ML-KNN | 79.96% | 56.70% | 59.10% | 54.16% | — |
| RankSVM | 60.26% | 52.89% | 45.49% | 35.77% | — |
The final cross-validation results (using the jackknife test) are shown in Table 1.
To facilitate comparisons, the corresponding results obtained by iATC-mISF (Cheng et al., 2016) and by the prediction method in (Chen, 2012) are also provided. To demonstrate the power of the EnsLIFT predictor, we extend the comparison to cover two state-of-the-art predictors: Multi-label K-Nearest Neighbor (ML-KNN) (Li et al., 2012) and Support Vector Machine (RankSVM) (Lee and Lin, 2014). Although not developed specifically for the ATC system prediction problem, both these methods were designed to handle multi-label systems. The absolute-true and absolute-false metrics are the two most important and harshest metrics for multi-label systems. For the absolute-true metric, the higher the percentage the better the multi-label predictor's performance; for the absolute-false metric, the reverse is the case: the lower the percentage the better the performance. The results in Table 1 clearly demonstrate that the proposed predictor is a powerful method for identifying ATC classes for unknown compounds.
In Table 2 we report the performance obtained using a 5-fold cross validation. As expected this result in a slight decrease in performance. Nonetheless, EsnLIFT still outperforms LIFT.
Performance obtained using a 5-fold cross validation
| Method . | Aiming . | Coverage . | Accuracy . | Absolute true . | Absolute False . |
|---|---|---|---|---|---|
| EnsLIFT (θ = 30) | 79.52% | 73.22% | 69.74% | 62.46% | 2.93% |
| LIFT | 83.38% | 66.51% | 66.09% | 59.46% | 3.30% |
| Method . | Aiming . | Coverage . | Accuracy . | Absolute true . | Absolute False . |
|---|---|---|---|---|---|
| EnsLIFT (θ = 30) | 79.52% | 73.22% | 69.74% | 62.46% | 2.93% |
| LIFT | 83.38% | 66.51% | 66.09% | 59.46% | 3.30% |
Performance obtained using a 5-fold cross validation
| Method . | Aiming . | Coverage . | Accuracy . | Absolute true . | Absolute False . |
|---|---|---|---|---|---|
| EnsLIFT (θ = 30) | 79.52% | 73.22% | 69.74% | 62.46% | 2.93% |
| LIFT | 83.38% | 66.51% | 66.09% | 59.46% | 3.30% |
| Method . | Aiming . | Coverage . | Accuracy . | Absolute true . | Absolute False . |
|---|---|---|---|---|---|
| EnsLIFT (θ = 30) | 79.52% | 73.22% | 69.74% | 62.46% | 2.93% |
| LIFT | 83.38% | 66.51% | 66.09% | 59.46% | 3.30% |
In Figure 2 we report the absolute-true performance of EnsLIFT obtained by varying the value of N from 10 to 75. Cleary, the performance increases with higher values of N. For this reason, we compare in Table 4 below the performance obtained by a fusion of LIFT trained using the original 1D feature vectors and an ensemble of the 75 LIFTs (combined by sum rule) trained using HOG. This new fusion is labeled EnsLIFTl.
Performance obtained varying the parameters of HoG
| number of cells . | number of bins . | Absolute True . |
|---|---|---|
| 5 × 6 | 9 | 61.73 |
| 5 × 6 | 7 | 53.59 |
| 5 × 6 | 11 | 57.30 |
| 4 × 5 | 9 | 61.50 |
| 6 × 7 | 9 | 62.22 |
| 7 × 8 | 9 | 62.30 |
| 8 × 9 | 9 | 62.07 |
| number of cells . | number of bins . | Absolute True . |
|---|---|---|
| 5 × 6 | 9 | 61.73 |
| 5 × 6 | 7 | 53.59 |
| 5 × 6 | 11 | 57.30 |
| 4 × 5 | 9 | 61.50 |
| 6 × 7 | 9 | 62.22 |
| 7 × 8 | 9 | 62.30 |
| 8 × 9 | 9 | 62.07 |
Performance obtained varying the parameters of HoG
| number of cells . | number of bins . | Absolute True . |
|---|---|---|
| 5 × 6 | 9 | 61.73 |
| 5 × 6 | 7 | 53.59 |
| 5 × 6 | 11 | 57.30 |
| 4 × 5 | 9 | 61.50 |
| 6 × 7 | 9 | 62.22 |
| 7 × 8 | 9 | 62.30 |
| 8 × 9 | 9 | 62.07 |
| number of cells . | number of bins . | Absolute True . |
|---|---|---|
| 5 × 6 | 9 | 61.73 |
| 5 × 6 | 7 | 53.59 |
| 5 × 6 | 11 | 57.30 |
| 4 × 5 | 9 | 61.50 |
| 6 × 7 | 9 | 62.22 |
| 7 × 8 | 9 | 62.30 |
| 8 × 9 | 9 | 62.07 |
Performance obtained using n = 75 versus n = 50 and a fine tuning of the HoG parameters
| Method . | Aiming . | Coverage . | Accuracy . | Absolute true . | Absolute False . |
|---|---|---|---|---|---|
| EnsLIFThog (θ = 45) | 78.15% | 75.81% | 71.14% | 63.25% | 2.80% |
| EnsLIFTl (θ = 45) | 78.19% | 75.98% | 71.31% | 63.40% | 2.80% |
| EnsLIFT (θ = 30) | 78.18% | 75.77% | 71.21% | 63.30% | 2.85% |
| Method . | Aiming . | Coverage . | Accuracy . | Absolute true . | Absolute False . |
|---|---|---|---|---|---|
| EnsLIFThog (θ = 45) | 78.15% | 75.81% | 71.14% | 63.25% | 2.80% |
| EnsLIFTl (θ = 45) | 78.19% | 75.98% | 71.31% | 63.40% | 2.80% |
| EnsLIFT (θ = 30) | 78.18% | 75.77% | 71.21% | 63.30% | 2.85% |
Performance obtained using n = 75 versus n = 50 and a fine tuning of the HoG parameters
| Method . | Aiming . | Coverage . | Accuracy . | Absolute true . | Absolute False . |
|---|---|---|---|---|---|
| EnsLIFThog (θ = 45) | 78.15% | 75.81% | 71.14% | 63.25% | 2.80% |
| EnsLIFTl (θ = 45) | 78.19% | 75.98% | 71.31% | 63.40% | 2.80% |
| EnsLIFT (θ = 30) | 78.18% | 75.77% | 71.21% | 63.30% | 2.85% |
| Method . | Aiming . | Coverage . | Accuracy . | Absolute true . | Absolute False . |
|---|---|---|---|---|---|
| EnsLIFThog (θ = 45) | 78.15% | 75.81% | 71.14% | 63.25% | 2.80% |
| EnsLIFTl (θ = 45) | 78.19% | 75.98% | 71.31% | 63.40% | 2.80% |
| EnsLIFT (θ = 30) | 78.18% | 75.77% | 71.21% | 63.30% | 2.85% |
In Table 3, we report the performance obtained by varying the parameters of HoG (with n = 50) and note that higher performance is obtained by increasing the number of cells.
The results in Table 3 motivated us to run a test using n = 75 coupled with the best parameter settings of HoG, which we label EnsLIFThog. As reported in Table 4, we find that the absolute true performance increases to 62.25. The performance obtained by EnsLIFThog is similar to EnsLIFTl where the best parameter settings of HoG are used. The improvement of EnsLIFTl and EnsLIFThog, with respect to EnsLIFT, is negligible.
We validate results by checking the error independence using the well-known Yule’s Q-statistic (Kuncheva and Whitaker, 2003), where the values of Q are bounded by [−1,1]. Classifiers that recognize the same patterns correctly have a value of Q that is greater than zero; whereas those that commit errors on different patterns have a Q value that is less than zero. The Q-statistic between LIFT trained using the original 1D feature vectors and an ensemble of 50 LIFTs (combined by sum rule) trained using HOG is 0.9374, thereby demonstrating that different descriptors train a set of partially uncorrelated classifiers. For this reason we can conclude with confidence that the fusion EnsLIFT outperforms LIFT trained using the original 1D feature vectors.
4 Conclusion
The focus of this paper was to find a good method for predicting a compound’s ATC class/classes. In the future we plan on testing this method on multiple benchmark datasets. Moreover, as pointed out in (Chou and Shen, 2009) and emphasized and demonstrated in a series of recent publications (see, e.g. Chen et al., 2016a,b), user-friendly and publicly accessible web-servers represent the future direction for practically developing a more useful predictor. As such, we shall make efforts in our future work to provide a web-server for the prediction method presented in this article.
Conflict of Interest: none declared.
References

