GOAT: Gene-level biomarker discovery from multi-Omics data using graph ATtention neural network for eosinophilic asthma subtype

Abstract Motivation Asthma is a heterogeneous disease where various subtypes are established and molecular biomarkers of the subtypes are yet to be discovered. Recent availability of multi-omics data paved a way to discover molecular biomarkers for the subtypes. However, multi-omics biomarker discovery is challenging because of the complex interplay between different omics layers. Results We propose a deep attention model named Gene-level biomarker discovery from multi-Omics data using graph ATtention neural network (GOAT) for identifying molecular biomarkers for eosinophilic asthma subtypes with multi-omics data. GOAT identifies genes that discriminate subtypes using a graph neural network by modeling complex interactions among genes as the attention mechanism in the deep learning model. In experiments with multi-omics profiles of the COREA (Cohort for Reality and Evolution of Adult Asthma in Korea) asthma cohort of 300 patients, GOAT outperforms existing models and suggests interpretable biological mechanisms underlying asthma subtypes. Importantly, GOAT identified genes that are distinct only in terms of relationship with other genes through attention. To better understand the role of biomarkers, we further investigated two transcription factors, CTNNB1 and JUN, captured by GOAT. We were successful in showing the role of the transcription factors in eosinophilic asthma pathophysiology in a network propagation and transcriptional network analysis, which were not distinct in terms of gene expression level differences. Availability and implementation Source code is available https://github.com/DabinJeong/Multi-omics_biomarker. The preprocessed data underlying this article is accessible in data folder of the github repository. Raw data are available in Multi-Omics Platform at http://203.252.206.90:5566/, and it can be accessible when requested.


Supplementary tables
Fig. S1: Eosinophilic asthma/non-eosinophilic asthma is the asthma subtype that is discriminable with multi-omics data Each bar plot shows the prediction accuracy of each subtype defined in Table S1.The right panel shows the bar plot of Eosinophilic asthma (EA) and non-eosinophilic asthma (NEA).Blue bars depict the 10-fold cross-validation prediction accuracy from random forest classifier using single omics data-methylome, transcriptome, proteome, and metabolome.On the contrary, the grey bar depicts baseline prediction accuracy which is the ratio of the majority subtype class over total samples.The more imbalanced the subtype class is, the higher the baseline prediction accuracy is.We compared the prediction accuracy of each omics data compared to the baseline.Our goal is to determine subtypes that show high prediction accuracy due to the explanatory power of each omics, not by the random guessing of subtype class to the majority class.Fig. S2: Important genes discovered solely with proteome is secretory/structural proteins List of proteins discovered as important features in random forest model using proteome data.It is the gene-level inspection of the results of proteome from Fig. S1.Feature importance is computed as permutation importance for 5 repeats using permutation importance method from sklearn.inspection module (Breiman, 2001).The proteins are selected as important features when the mean importance over the repeats is non-zero.The function of each protein is annotated by the Secretome database of Protein Atlas (Uhlén et al., 2019) and GeneCards (Stelzer et al., 2016).Gene ablation study Tightness of genes in a PPI network AUROC AUPRC Fig. S8: Existence of cooperative effect of biomarkers (A) Average node connectivity of genes in a protein-protein interaction network comparing the biomarkers identified by GOAT with other biomarker discovery methods based on statistical testing (Detail in Supplementary method H).Violin plots depict the null distribution of average node connectivity.The star denotes the test statistic, the average node connectivity of the specified biomarker set discovered with the methods in the x-axis.(B) Line plot of test AUPRC/AUROC decaying according to the number of genes excluded.To exclude the effect of the randomly selected genes in a model, all selected features were set to zero.For exclusion of the specified number of genes, random exclusion is repeated 10 times so that the interquartile range is depicted as shades.DEG, Differentially expressed gene; DEP, Differentially expressed protein; DAMgene, Differentially abundant metabolite (DAM)-related gene; AUPRC, Area under the precision-recall curve; AUROC, Area under the receiver operating characteristic curve.

C.
A.

D Data preprocessing
The multi-omics dataset of COREA cohort is downloaded from .Raw data can be accessible when requested.Methylation data is generated by bisulfite-sequencing and aligned to hg19 genome.Methylation status is measured in 3,419,515 CpG sites and transformed into gene-level methylation status by averaging methylation level of CpG sites within ± 2kb on transcription start site (TSS), using TSS annotation of hg19 from UCSC Genome Browser (Navarro Gonzalez et al., 2021).Transcriptome data is generated by RNA-seq using HISAT2 (Kim et al., 2015) (v2.2.0) and processed with samtools (Li et al., 2009) (v1.9) and HTSeq (Anders et al., 2015) (v0.12.4),where ucsc hg19 genome is utilized as a reference for each step.Raw read counts are normalized to counts per million.Proteome data is generated by SWATH mass spectrometry (MS), processed as a peak intensity file with Perseus (Tyanova et al., 2016) (v1.6.5), and processed as in protein-level abundance with R DEP package (Zhang et al., 2018).Metabolome data is generated via targeted liquid chromatography-MS of 400 endogenous metabolites and processed as a metabolite abundance with MetIDQ from Biocrates.Clinical subtypes are defined by the following clinical features (Table S1).

E Training and validation strategy E.1 Cross-validation scheme
We performed 10-fold cross-validation scheme where each set (N=30) became a held-out test set, while the other nine sets (N=270) were used for both biomarker prioritization (Stage 1) and biomarker identification (Stage 2) (Fig. S9A).Identical split of test set is fed into all methods in comparative analysis, including ours.To ensure there is no information leak, only samples from the training set were used to prioritize biomarkers by optimizing the network propagation parameters (stage 1) of our biomarker identification model (stage 2).

E.2 Parameter optimization
In the first stage of prioritization of biomarker candidates, hyperparameters are k was set to 100, α was set to 0.8, β was set to 0.5, and k ′ was set to 150.For hyperparameter tuning, prediction performance was assessed as 5-fold cross-validation accuracy within the training set, which is called validation accuracy.Parameters with the highest prediction performance for validation sets were saved for test set evaluation (Fig. S9B).For the hyperparameter sensitivity analysis shown in Fig. S9C, the prediction performance for the test set gradually decreases as k increases and the prediction performance becomes saturated as k ′ increases.We reported the test performance reported in Fig. 2 with the best combination of hyperparameters selected using validation accuracy, k = 100 and k ′ = 150.
In the second stage of identification of biomarkers, parameters consist of model parameter to be optimized during training process and hyperparameter to be designated outside of the model.For both model parameter optimization and hyperparameter tuning, prediction performance was evaluated as 5-fold cross-validation accuracy within training set and parameters with the highest performance were saved for test set evaluation (Fig. S9A).We used Adam Optimizer to optimize the model parameters.The hyperparameters were tuned by grid search, as a combination of learning rate [0.1, 0.01, 0.001, 0.0001], dropout rate [0.1, 0.2, 0.3, 0.4, 0.5], and feature dimension of node embeddings in graph neural network [4,8,16,32,64,128].Each combination of hyperparameters was fed into the model and evaluated as 5-fold cross-validation accuracy.Selected hyperparameters are 0.001 for learning rate, 0.2 for dropout rate, and 8 for feature dimension of node embeddings.

F Network propagation analysis
The specific network for the eosinophilic asthma / non-eosinophilic asthma subtype is constructed as a coexpression network of transcriptome data using samples of each subtype.Given a protein-protein interaction (PPI) network as a template, we computed edge weight as a Pearson correlation of each pair of genes in a PPI network.Edge weight higher than 0.3 is rescued to construct subtypespecific networks, W. For each transcription factor (TF), the node attribute of the selected TF is set to 1 and the resource of the node is propagated to all other nodes in a network according to the following equation, p t = (1 − α)Wp t−1 + αp 0

G Transcriptional network analysis
TRRUST is a database of gene regulatory networks whose interactions are retrieved from literature mining (Han et al., 2018).For the TFs, CTNNB1 and JUN, we trimmed downstream target genes of the TFs from TRRUST.The omic profile of the downstream target genes from the COREA cohort is examined.

H Biomarkers cooperativity
The tightness of genes in G in a protein-protein interaction network is defined as follows, where κ G (•) is the minimum number of nodes that must be removed to disconnect two non-adjacent nodes.We inferred null distribution with random sampling of |G| genes 100 times.Permutation test P value is computed according to the null distribution of a test statistic κ(G) of the specified genes.
Fig.S5: Subnetwork of biomarkers and functional analysis of gene modules, connected to CTNNB1 (A) Gene modules from network biomarkers.Each node in a network denotes a gene, and the edge is retrieved from the gene-interaction network exploited in graph neural network model.Rhombus indicates transcription factors while the circle indicates non-TFs.Blue nodes are functional genes that can be discovered by proteome or metabolome analysis, denoted DEP / DAM genes.Gene modules of functional genes are clustered via GLay algorithm(Su et al., 2010).Orange nodes are the genes solely by GOAT that are connected to the gene modules, denoted as mediator genes.Yellow nodes are the genes connected to CTNNB1.Blue shade indicates functional modules connected to CTNNB1.(B) Enriched gene ontology biological process terms of each module.TF, Transcription factor; DEP, Differentially expressed protein; DAMgene, Differentially abundant metabolite (DAM)-related gene.

Fig. S7 :
Fig. S7: CTNNB1/JUN-related genes show distinct pattern in Eosinophilic asthma vs. Non-eosinophilic asthma (A) Violin plot of gene expression of CTNNNB1 and JUN in transcriptome level comparing Eosinophilic asthma (EA) and Non-eosinophilic asthma (NEA).(B) Venn diagram of the genes most affected by CTNNNB1 and JUN, comparing EA/NEA (Detailed descriptions are in the Supplementary method F). (C) Venn diagram of biological process terms of the gene ontology enriched with subtype-specific genes discovered in B. (D) Violin plot of gene expression of JUN's target gene (TG) in proteome level comparing EA/NEA.(E) Upset plot to display all intersections of TF sets for each TG in D, sorted by size.TFs {gene} indicates the set of TFs for the gene retrieved from TRRUST database (Detailed descriptions are in Supplementary method G).Dark circles indicate sets that are part of the intersection, and horizontal bars indicate the size of each TF set.Blue shade indicates intersection shared by multiple TF sets.Deep blue shade indicates the intersection of TFs shared by all target genes.(Statistical annotations with t-test, ns: non-significant, **: p<0.01, ****: p<0.0001).TF, Transcription factor; EA, Eosinophilic asthma; NEA, Non-eosinophilic asthma.

Fig. S9 :
Fig. S9: Hyperparameter search for biomarker candidate prioritization (Stage 1) (A) Cross-validation scheme for hyperparameter search.5-fold CV was conducted within training set, where each fold was set as held-out set for validation and other folds were used for model training.Median of prediction accuracy across 5 CV-folds, called validation accuracy, was measured for hyperparameter tuning.(B) Line plot of validation accuracy with increasing k, k ′ .Red line in the plots depicts validation accuracy and light red shade depicts standard deviation of validation accuracy over 10-fold CV.Hyperparameter k signifies the number of DAMgenes from met-Propagate(Graham Linck et al., 2020) to be screened for the following procedures in stage 1 of GOAT and k ′ signifies the number of biomarker candidate genes from network propagation to be screened for the following procedures in stage 2 of GOAT.k, k ′ that showed best prediction performance within validation set are selected for test set evaluation.(C) Line plot of test AUPRC/AUROC with increasing k, k ′ .Blue line in the plots depicts test AUPRC/AUROC and light blue shade depicts standard deviation of test AUPRC/AUROC over 10-fold CV.CV, Cross-validation; Acc, Accuracy; AUPRC, Area under the precision-recall curve; AUROC, Area under the receiver operating characteristic curve; DAMgene, Differentially abundant metabolite (DAM)-related gene.11

Table S1 :
Subtype criterion Asthma subtypes classified by clinicians in Asan medical center.

Table S2 :
Over-representation analysis of gene modules connected via CTNNB1 Enriched gene ontology biological process (GOBP) terms with gene modules connected to CTNNB1.Over-representation analysis is conducted withEnrichR (Kuleshov et al., 2016)

Table S3 :
Over-representation analysis of gene modules connected via JUN Enriched gene ontology biological process (GOBP) terms with gene modules connected to JUN.Over-representation analysis is conducted withEnrichR (Kuleshov et al., 2016)

Table S4 :
TF enrichment TF enrichment analysis of the biomarkers discovered via EnrichR (Chen et al., 2013) based on TR-RUST Transcription Factors 2019 databse.