Application of machine learning techniques to tuberculosis drug resistance analysis

Abstract Motivation Timely identification of Mycobacterium tuberculosis (MTB) resistance to existing drugs is vital to decrease mortality and prevent the amplification of existing antibiotic resistance. Machine learning methods have been widely applied for timely predicting resistance of MTB given a specific drug and identifying resistance markers. However, they have been not validated on a large cohort of MTB samples from multi-centers across the world in terms of resistance prediction and resistance marker identification. Several machine learning classifiers and linear dimension reduction techniques were developed and compared for a cohort of 13 402 isolates collected from 16 countries across 6 continents and tested 11 drugs. Results Compared to conventional molecular diagnostic test, area under curve of the best machine learning classifier increased for all drugs especially by 23.11%, 15.22% and 10.14% for pyrazinamide, ciprofloxacin and ofloxacin, respectively (P < 0.01). Logistic regression and gradient tree boosting found to perform better than other techniques. Moreover, logistic regression/gradient tree boosting with a sparse principal component analysis/non-negative matrix factorization step compared with the classifier alone enhanced the best performance in terms of F1-score by 12.54%, 4.61%, 7.45% and 9.58% for amikacin, moxifloxacin, ofloxacin and capreomycin, respectively, as well increasing area under curve for amikacin and capreomycin. Results provided a comprehensive comparison of various techniques and confirmed the application of machine learning for better prediction of the large diverse tuberculosis data. Furthermore, mutation ranking showed the possibility of finding new resistance/susceptible markers. Availability and implementation The source code can be found at http://www.robots.ox.ac.uk/ davidc/code.php Supplementary information Supplementary data are available at Bioinformatics online.

LR is easy to implement and efficient to train and provides probabilities for outcomes. However LR cannot solve nonlinear problems due to its linear decision surface and has high bias.
SVM SVM is a binary classifier that works based on finding the widest hyperplane margin to separate samples in the feature space. The hyperplane is optimised by maximising its distance to closest training points of each class, defined as support vectors. The data is not always linearly separable and hence a kernel function may transfer the data into a linearly separable space and improve the performance.
This model was run using LIBSVM. Radial basis function (SVM-RBF) and linear (SVM-Linear) kernels were considered in this work in which Optunitiy package was used to optimise kernel parameters.
SVM has a convex optimisation and also can work for both linear and non-linear separable data. It also can handle high dimensional data. However, it needs enough samples from both positive and negative classes and it is susceptible to overfitting. Furthermore, it needs a lot of memory and CPU time.

PM
PM is a generative model and is based on class conditional independence of variables. Although its assumptions do not usually hold, PM often performs well. Regarding TB analysis, the PM method calculates the probability that an isolate has a given SNP. The prediction for a new data point is then calculated from the probability of each class label, given a new example and the training data.
A Beta(1,0.25) prior for the resistant class and a Beta(0.25,1) prior for the susceptible class were considered.
PM is fast to predict labels and can work with high dimensional data. Nevertheless if there is a category with no training data, the model will assign a zero probability (unable to predict it). Its other limitation is the independence assumption.

RF
RF is an averaging method that is based on building several independent classifiers. This model fits a number of decision tree (DT) classifiers on different subsets of the dataset and average results to produce final predictions and improve the performance.
100 estimators were considered for RF training.
It is based on training of several independent trees that can be fit in parallel. Moreover, RF usually reduce the variance. However, it needs heavier computational resources.
Adaboost Adaboost is a boosting method based on sequentially building weak estimators (models that are only slightly better than the random prediction, i.e. small DTs). Initially, all data samples have the same weight. However, at successive iterations, the weights of mislabelled training samples by the boosted model at the previous step are increased, while the weights are decreased for the remaining training samples. Hence, as iterations proceed, the influence of difficult samples is increased. Later, it combines the predictions through a weighted majority vote with the aim of reducing the bias of the combined classifier and producing a powerful ensemble.
DT used as the base classifier and 100 estimators were considered for the training.
It is not subject to overfitting but is sensitive to noisy data and outliers. Moreover, due to sequentially building of trees, Adaboost cannot be parallelized.                Most techniques performed similarly considering SPCA/SNMF except for considering (F1 + SPCA + PM) for all drugs, (F1 + SPCA + SVM-RBF) for AK and CAP, and (F1 + SPCA + Adaboost) for MOX, OFX, KAN, CAP, and CIP in comparison with other models. Furthermore, (F1 + SNMF + GBT) and (F1 + SNMF + LR-L1) was the top performing models considering F1-Score. Moreover, adding SPCA for the dimension reduction step enhanced the F1-score by up to 12.54%, 4.61%, 7.45%, and 9.58% for AK, MOX, OFX, and CAP respectively compared to considering the whole F1.  Table 13. Top 10 mutations ranked by LR-L1, LR-L2, and GBT; resistance/susceptible-associated mutations to each given drug are indicated in the boldface. The other mutations are either known to be related to other drugs or not in the library.

Supplementary J
Here, we compared the performance of (F1 + SPCA/SNMF + GBT) for all drugs keeping 50, 100, and 150 components. Considering different number of components in (F1 + SNMF + GBT) resulted in similar performance for all drugs except for CIP and RIF. Keeping lower components showed to improve the performance of CIP and RIF. Considering higher number of components in (F1 + SPCA + GBT) indicated no effect on the performance for most of drugs but can improve the performance for MOX and OFX drugs.