Lilikoi V2.0: a deep learning–enabled, personalized pathway-based R package for diagnosis and prognosis predictions using metabolomics data

Abstract Background previously we developed Lilikoi, a personalized pathway-based method to classify diseases using metabolomics data. Given the new trends of computation in the metabolomics field, it is important to update Lilikoi software. Results here we report the next version of Lilikoi as a significant upgrade. The new Lilikoi v2.0 R package has implemented a deep learning method for classification, in addition to popular machine learning methods. It also has several new modules, including the most significant addition of prognosis prediction, implemented by Cox-proportional hazards model and the deep learning–based Cox-nnet model. Additionally, Lilikoi v2.0 supports data preprocessing, exploratory analysis, pathway visualization, and metabolite pathway regression. Conculsion Lilikoi v2.0 is a modern, comprehensive package to enable metabolomics analysis in R programming environment.


INTRODUCTION
Metabolomics is an increasingly popular platform to systematically investigate metabolites as potential biomarkers for diseases [1]. With the rapid development in this field, data analysis is becoming a critical component to interpret and apply the results for translational and clinical research. However, currently the majority of metabolomics analysis workflows are provided as web-apps [1], limiting its adaptation by the bioinformatics community, and/or integration with other omics workflows in a programmable matter.
To address such needs, previously we developed Lilikoi, a personalized pathway-based method to classify diseases using metabolomics data [2]. Different from other metabolomics analysis packages, the personalized and pathway-based representation of metabolomics features is the highlight of Lilikoi version 1 (v1). Lilikoi v1 enables classifications using various machine learning methods. It has four modules: feature mapper, dimension transformer, feature selector, and classification predictor [2].
Here we report Lilikoi v2.0, as a significant upgrade for Lilikoi v1. The new mission is embarked by several recent trends or needs in the research community. Firstly, given the recent applications of deeplearning in the metabolomics and other genomics fields [3][4][5][6][7][8][9], it is important to enable metabolomics researchers to investigate such new approaches. We thus implemented a deep-learning neural network, as a new method in the classification module. Secondly, metabolomics have the potential to be prognosis markers [10], however, currently rarely metabolomics data analysis workflow is available for prognosis modeling and prediction. We herein implemented multiple methods for prognosis prediction, including Cox-Proportional Hazard (Cox-PH) model and Cox-nnet, a neural-network based model [5].
Thirdly, we augmented the pathway-based metabolomics analysis with metabolite-pathway regression and pathway visualization. Last but not least, we also include additional preprocessing methods for metabolomics data analysis (eg. normalization, imputation) and tools for exploratory data analysis (eg. PCA and t-SNE analysis, and source of variation analysis). In summary, Lilikoi v2.0 is a more mature, comprehensive and modern package to empower the metabolomics community.

Datasets
Three breast cancer metabolomics datasets were used to demonstrate the new functionalities of Lilikoi samples are ER+ and 67 samples are ER- [12]. Metabolomics in this dataset were based on GC-TOFMS.
The third dataset is shared by authors from an original National Cancer Institute (NCI) study, composed of 536 metabolites from 67 breast tumor samples and 65 tumor-adjacent noncancerous tissues [10]. In our analysis, we only used the 67 breast tumor samples for prognosis modeling.

Data preprocessing
For data preprocessing, we added normalization and imputation methods. Three normalization methods (standard normalization, quantile normalization and median-fold normalization) were implemented. We used the normalize.quantiles function in the preprocessCore package [13] to perform the Quantile normalization. For imputation, we used k-nearest neighbors (knn) method as the default method to impute missing values. The knn imputation was done by the impute.knn in the impute R package [14].

Exploratory analysis
Principal Component Analysis (PCA) is a feature selection technique [15]. It extracts the most important information in high-dimensional datasets. The t-SNE plot is a dimension reduction method to help users to visualize high-dimensional data [16] . We implemented the PCA and t-SNE plots in Lilikoi v2.0 via the M3C package [17]. We also added the source of variation analysis (SOV) for exploratory analysis, implemented by the Anova function in the car package [18]. SOV identifies the relationships between confounders and metabolomics data, based on ANOVA tests [19,20]. Any clinical variable with F-score bigger than the error term, whose F-score is 1, is deemed a confounder.

Metabolite to pathway level transformation
Most other pathway analysis tools for metabolomics data that use Fisher's exact test or hypergeometric test, and their performance was compared earlier [21]. Different from all these method, Lilikoi uses the Pathifier algorithm to perform the metabolites-pathway dimension transformation per sample [22] . For each pathway P in each patient i, a pathway dysregulation score (PDS) DP(i) between 0 and 1 is generated, based on the metabolites associated with this pathway. A larger PDS value represents a higher degree of dysregulation (larger deviation from the normal controls). As the result of the dimension transformation, a new pathway-level matrix is constructed, which can be used to substitute the original metabolomics profile matrix, for downstream classification or prognosis modeling.
Briefly, a PDS score DP(i) is calculated as the following: in the high-dimensional space dP made of metabolite vectors (where each metabolite belongs to pathway P), all samples form a data cloud, where sample i is a data point xi. The principle curve SP′ in this space dP is then computed using Hastie and Stuetzle's algorithm [23]. For each sample, the data point xi is projected onto the principle curve SP′.
The dysregulation score DP(i) of sample i is then defined as the distance from the start of the principal curve to the projected point on this curve. More details of applications of Pathifier on biomarker studies (prognosis or diagnosis) can be found in earlier publications [2,24,25].

Deep learning for classification
The deep learning algorithm in Lilikoi v2.0 is based on the h2o package [26][27][28]. It used a multi-layer neural network trained with stochastic gradient descent search to predict the diagnosis results. For the neural-network configuration, users are free to set parameters including activation function, hidden layer size, drop-out ratio, L1 and L2 regularization, size of one batch and adaptive learning rate decay factor. Users can also incorporate other control parameters like random discrete to optimize the hyperparameter setting to achieve the best deep learning performance. The running time for each classification method is calculated with the Sys.time() function in R and measured using slurm job scheduler on a dedicated group computer server cluster (consisting of 4 nodes (Dell PowerEdge C6420) of 2 X Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz, 192GB RAM). One processor and 50GB memory were reserved for each job.

Prognosis prediction
Lilikoi v2.0 enables prognosis prediction, at either the metabolite level using metabolite-sample matrix, or the pathway-level (after pathifier based pathway transformation) using pathway dysregulation score (PDS)-sample matrix. PDS is a normalized score between [0,1] that measures the degree of dysregulation of a pathway relative to the norm (controls). Currently two prognosis prediction methods are implemented: Cox-Proportional Hazards (Cox-PH) method [29] with penalization, and the neuralnetwork based Cox-nnet method [5]. Cox-nnet is based on the artificial neural network framework with a default of two-layer neural network: a hidden layer and an output layer [31]. The output layer is fit to the Cox regression. Lilikoi v2.0 imports Cox-nnet originally written in Python, using the reticulate package.
The hazard function of the Cox-PH model is: ℎ( | ) = ℎ 0 ( ) ( )with the log hazard ratio of = with its partial likelihood cost function : The Cox-nnet expands the Cox-PH hazard function above as: where is the output of the hidden layer, G is the activation function and W is the coefficient weight matrix between the input and hidden layer.
In the demonstration NCI data, we applied cross validation on the training dataset to determine the optimal L2 regularization lambda parameter. Cox-nnet supports three gradient descent algorithms: The fitness of the models is evaluated by two metrics: C-index and log rank p-values. C-index is a goodness of fit measure of survival models [33]. A C-index of 1 indicates that the model is the best model for prediction and C-index = 0.5 means that the model prediction is no better than a random guess. Log-rank p-value is based on the log-rank test [34,35] to evaluate the null hypothesis that no difference in survival exists between the high-risk and low-risk groups. Log-rank p-value less than 0.05 means that there is significant difference between these two groups. Users have the option to split the data by N folds cross-validation, where the model is trained on the N-1 fold data and evaluated on the rest 1 fold data.

Pathway level analysis
The selected pathway features from classification or prognosis prediction, can be visualized with the Pathview r package [36]. Currently, any KEGG pathway can be used as the input to render pathway in Lilikoi by the RCy3 R package [37]. predictor [2]. However, given the recent applications of deep-learning in the metabolomics and other genomics fields [3][4][5][6][7][8], it is important to enable metabolomics researchers to investigate such new approaches. We thus implemented deep-learning as a new method in the classification module.

Overview of updated functionalities in
Moreover, metabolomics have the potential to be prognosis markers [10], however, currently rarely metabolomics data analysis workflow is available to handle this issue. We herein implemented multiple methods for prognosis prediction, including Cox-Proportional Hazard (or Cox-PH) model and Coxnnet, a neural-network based model [5]. Additionally, we augmented the pathway-based metabolomics analysis with metabolite-pathway relationship analysis and pathway visualization. Last but not least, we also include additional preprocessing methods for metabolomics data analysis (eg. normalization, imputation) and tools for exploratory data analysis (eg. PCA and t-SNE analysis, and source of variation analysis).
Importantly, Lilikoi v2.0 has added the following new functionalities (marked in red text boxes), as shown in Fig. 1. A pre-processing module is added for the initial steps, where normalization and imputation are considered. A new exploratory data analysis module is also added, to enable dimension To investigate the metabolomics-phenotype data relationship, Lilikoi v2.0 has added the source of variation analysis between confounders and metabolomics data, based on ANOVA tests [18]. Any clinical confounder with F-score bigger than the error term, whose F-score is 1, needs to be adjusted for in differential metabolite tests, when using other clinical variable(s) for grouping.

Deep learning enabled classification module
Deep-learning enabled classification module is one of the highlighted functionalities of Lilikoi v2.0.
The deep learning framework uses the same dataset and adopts the same architecture as previously described [9]. The objective is to classify the 204 ER+ samples from the 67 ER-samples. We split the data roughly 4:1 ratio into training and testing data, with 10 fold cross-validation in the training data. We repeated this process 10 times randomly, to obtain averaged metrics.
We used the metabolite features as the inputs for deep-learning based classification, along with other popular methods: LDA, SVM, RF, RPART, LOG, and GBM (Methods). As shown in Fig. 2A and Table   1, deep learning on average performs the best overall in the training data, with the significantly higher F-1 statistic value (0.95) and sensitivity (0.98) than all other methods. F-1 statistic is a good unbiased metric given the unbalanced samples in ER+ and ER-classes. However, the specificity (0.75) in the training dataset is second to the lowest (SPEC of LDA=0.72). The advantage of deep learning is more pronounced in the testing dataset ( Fig. 2B  probably due to the size of the samples. As a word of caution, the computation time to run the deep-learning method is significantly longer than other machine learning methods, and it is only beneficial when the sample size is median (in the order of hundreds).

Prognosis prediction
Deep-learning enabled prognosis prediction is another unique functionalities of Lilikoi v2.0, compared to other metabolomics analysis packages and toolkits. To demonstrate prognosis analysis, we used the NCI dataset as described in Methods. As the unique feature of Lilikoi is pathway-level modeling, the metabolites intensity data are first transformed to pathway level data matrix (see Methods). Penalized survival analysis using Cox-PH model and Cox-nnet were conducted. For Cox-PH regression, L2 norm (Ridge) penalization was applied to select featured pathways. After fitting, the prognosis index (PI) was used to separate the patient into the high-risk vs low-risk groups using the first quantile of PI as the threshold. As shown by the Kaplan-Meier curves in Fig. 3, the Cox-PH model yields a C-index value of 0.65 and log rank p-value of 0.04 (Fig. 3A); Cox-nnet model yields slightly better results, with a Cindex value of 0.66 and log rank p-value of 0.02 (Fig. 3B).

Pathway downstream analysis
We used the metabolites expression information in the aforementioned workbench breast cancer dataset PR000284 as the cpd.data input of the pathview function. According to our featureSelection results, alanine aspartate and glutamate metabolism is one of the top pathways for metabolite data. Therefore, we demonstrate the pathway visualization, based on the Pathview R package. using "alanine aspartate and glutamate metabolism pathway" (Fig. 4). As shown in Fig. 4, 6 metabolites in this pathway have intensities. Asparagine has increased levels in ER-patients, due to the conversion from its substrate aspartate, which is reduced in ER-patients. The reduction of aspartate in ER-patients is consistent with the previous observation [38].
It is important to link the significant metabolites that contribute to the pathway features. For this, singlevariate regressions between metabolites and pathways are conducted, with the workbench dataset with 207 plasma samples (126 breast cancer cases and 81 control cases). The regression results (Fig. 5) can be visualized by the partite graph, where the yellow nodes represent pathway features, and the green nodes are metabolites significantly (p<0.05) associated with the pathways. show how each metabolite contributes to the selected pathways. The generic term "metabolic pathways", is associated with the largest number (86) of metabolites. Among them, isopentenyl pyrophosphate has the most weight on the edge. Many pathways related to amino acid synthesis and metabolism are highlighted. Users can also elect to exam the metabolites within a particular pathway, by individual bar graphs. As an example, we show the metabolites that are associated with "alanine aspartate and glutamate metabolism pathway" analysis packages [39][40][41]. One of the closest comprehensive packages MetaboAnalystR [42]. Some functions are similar between the two tools, such as classification using caret packages. However, there are some very significant differences between the two, such as the functionalities mentioned above. On the other hand, MetaboAnalystR provides other functionalities, such as time-series analysis, power analysis, and network explorer, which Lilikoi does not have yet.
Some practical challenges still exist, leaving room for the future development of Lilikoi. For example, mapping rate of metabolites and pathways can be further improved, by using better matching algorithms. Also, the current best classification model in Lilikoi is determined by users. We would like to automatically recommend the best classification model for users. This will be dependent on training a large set of metabolomics datasets for benchmarking, beyond of the scope of this report. Despite this, we recommend the users to pay more attention to the machine learning methods that are less prone to overfitting, such as random forest, given the fact that the majority of the datasets have median sample size (in the order of hundreds). The comparison between deep-learning and other machine-learning methods shows the advantages of gained accuracy of deep-learning method. However, such benefit is achieved at the cost of computation time. Moreover, the higher performance of deep-learning method is conditional on the sample sizes. Deep-learning method is superior when the patient size is at least a few hundred. Moreover, given the limited size of annotated metabolites (in the order of hundreds) in the assays, the pathway visualization is sparse, based on metabolites alone. Integration between metabolomics and other genomics data types is helpful to fill in the missing information, especially with the aid of deep-learning and machine-learning ensemble tools, such as DeepProg models that are developed by us and others [3,4,8,43].      Glutamate Metabolism pathway and its corresponding metabolites. *: p<0.05; **: p<0.01; ***:p<0.001.