An end-to-end heterogeneous graph attention network for Mycobacterium tuberculosis drug-resistance prediction

Abstract Antimicrobial resistance (AMR) poses a threat to global public health. To mitigate the impacts of AMR, it is important to identify the molecular mechanisms of AMR and thereby determine optimal therapy as early as possible. Conventional machine learning-based drug-resistance analyses assume genetic variations to be homogeneous, thus not distinguishing between coding and intergenic sequences. In this study, we represent genetic data from Mycobacterium tuberculosis as a graph, and then adopt a deep graph learning method—heterogeneous graph attention network (‘HGAT–AMR’)—to predict anti-tuberculosis (TB) drug resistance. The HGAT–AMR model is able to accommodate incomplete phenotypic profiles, as well as provide ‘attention scores’ of genes and single nucleotide polymorphisms (SNPs) both at a population level and for individual samples. These scores encode the inputs, which the model is ‘paying attention to’ in making its drug resistance predictions. The results show that the proposed model generated the best area under the receiver operating characteristic (AUROC) for isoniazid and rifampicin (98.53 and 99.10%), the best sensitivity for three first-line drugs (94.91% for isoniazid, 96.60% for ethambutol and 90.63% for pyrazinamide), and maintained performance when the data were associated with incomplete phenotypes (i.e. for those isolates for which phenotypic data for some drugs were missing). We also demonstrate that the model successfully identifies genes and SNPs associated with drug resistance, mitigating the impact of resistance profile while considering particular drug resistance, which is consistent with domain knowledge.


Supplement B
In this supplement, we first provide details for PMI-based embedding initialisation, followed by details of the proposed HGAT-AMR model.

B.1 PMI-based embedding
The parameters required before PMI-based embedding include embedding dimension (denoted d-in), cut-off thresholds of counts for most and the least frequent SNPs in all isolates; as words in all sentence (denoted th-most and th-least, respectively). The corresponding values used in this paper are d-in=128, th-most=90% of the total number of isolates, th-least=3 counts. Given these thresholds, the most and least frequent SNPs are removed from the full dataset. PMI-based embedding method generates embedding for the remaining SNPs. Then, the embedding of isolates is obtained by aggregating the resulting embedding of SNPs with a maximum operator. Finally, the embedding of isolates and SNPs are used to be initialised embedding for HGAT-AMR model in the next step.

B.2 Architecture of HGAT-AMR
The hyperparameters of HGAT-AMR and the details of MTB graphs constructed based on different subsets are listed in table below.

Supplement C
In this supplement, we provide hyperparameter sets in grid search for classical machine learning comparators, followed by pipeline of evaluating conventional machine learning comparator models (as shown in Figure S.1).

Supplement D
In this supplement, we provide details for training the HGAT-AMR, followed by a pipeline of inductive training.
While training HGAT-AMR, an optimiser of Adam was applied, the learning rate was set to 0.005, weight decay was set to 5e-5, the maximum epochs was set to 400. An early stop criterion was applied, when the validation accuracy keeps to increase for 15 epochs, the model stops to be optimised and the resulting model is save as the best model. While evaluation, the threshold on predicted probability is set to 0.5.  In this supplement, the performance metrics of three models, LR, SVM and HGAT for predicting INH, PZA, EMB and RIF are listed in table S4-7