Deep IDA: a deep learning approach for integrative discriminant analysis of multi-omics data with feature ranking—an application to COVID-19

Abstract Motivation Many diseases are complex heterogeneous conditions that affect multiple organs in the body and depend on the interplay between several factors that include molecular and environmental factors, requiring a holistic approach to better understand disease pathobiology. Most existing methods for integrating data from multiple sources and classifying individuals into one of multiple classes or disease groups have mainly focused on linear relationships despite the complexity of these relationships. On the other hand, methods for nonlinear association and classification studies are limited in their ability to identify variables to aid in our understanding of the complexity of the disease or can be applied to only two data types. Results We propose Deep Integrative Discriminant Analysis (IDA), a deep learning method to learn complex nonlinear transformations of two or more views such that resulting projections have maximum association and maximum separation. Further, we propose a feature ranking approach based on ensemble learning for interpretable results. We test Deep IDA on both simulated data and two large real-world datasets, including RNA sequencing, metabolomics, and proteomics data pertaining to COVID-19 severity. We identified signatures that better discriminated COVID-19 patient groups, and related to neurological conditions, cancer, and metabolic diseases, corroborating current research findings and heightening the need to study the post sequelae effects of COVID-19 to devise effective treatments and to improve patient care. Availability and implementation Our algorithms are implemented in PyTorch and available at: https://github.com/JiuzhouW/DeepIDA


Integrative Discriminant Analysis (IDA) for joint association and classification
Let S d b and S d w be the between-class and within-class covariances for the d-th view, respectively.That is, The first term in equation ( 1) maximizes the sum of the separation of classes in each view, and the second term maximizes the sum of the squared pairwise correlations between two views.Here, ρ was used to control the influence of separation and association in the optimization problem.The optimum solution for equation (1) was shown to solve D systems of eigenvalue problem [12].The discriminant loadings ( Γ 1 , . . ., Γ d ) correspond to the eigenvectors that maximally associate the views and separate the classes within each view.The views were projected onto the discriminant loadings to yield discriminant scores and samples were classified using nearest centroid.In the Method Section, we propose a novel nonlinear formulation of IDA using deep nueral networks (DNN) to study nonlinear associations among two or more views and separation of classes within a view.We also propose a homogeneous ensemble approach for feature ranking to identify features contributing most to the association of the views and the separation of the classes in a view.

Theorems and Proofs
Theorem 1.Let S d t and S d b respectively be the total covariance and the between-class covariance for the toplevel representations H d , d = 1, . . ., D. Let S dj be the cross-covariance between top-level representations d and j.Assume 4) of main text are eigenvectors corresponding respectively to eigenvalues where Prove.Solving the optimization problem is equivalent to iteratively solving the following generalized eigenvalue systems: Proof.The Lagrangian is The first order stationary solution for Γ d (∀d = 1, ..., D) is Rearranging the equation 3 we have For Γ j , ∀j ̸ = d fixed, the above can be solved for the eigenvalues of (c . Arrange the eigenvalues from large to small and denote Λ d ∈ R o d ×o d as the diagonal matrix of those values.For the top l largest eigenvalues, denote the corresponding eigenvectors as Γ d = [γ d,1 , ..., γ d,l ].Therefore, starting from d = 1, following this process, Γ 1 is updated; then, update Γ 2 and so on; finally, update Γ D .We iterate until convergence, which is defined as, Then the solution f d to the optimization problem in equation (5) [main text] for view d maximizes Proof.Fix d and let η d,1 , η d,2 , ..., η d,l be the top l eigenvalues of Then, 1.4 Comparison of Deep IDA with some existing nonlinear methods Unlike Deep CCA [1] and Deep generalized CCA [2] which use DNN networks to learn transformations of nonlinearly transformed views to maximize associations between two (Deep CCA) or more (Deep generalized CCA) views, we maximize both association of views and separation of classes in a view jointly, thus we do not need any sophisticated classification algorithm which could add another layer of computational complexity.Our proposed method is closely related to the method (multiview linear discriminant analysis network, MvLDAN in short) in [5] since we too find linear projections of nonlinearly transformed views that separate classes within each view and maximize correlation between multiple views.In [5], the authors proposed to solve the following optimization problem for linear projection matrices A 1 , • • • , A D and neural network parameters (weights and biases): where T is a concatenation of projection matrices from all views, S b and S w are the betweenclass and within-class covariances for all views, respectively, and S c is the cross-covariance matrix for all views.
Further, λ and β are regularization parameters.The authors then considered to learn the parameters of the deep neural network by maximizing the smallest eigenvalues of the generalized eigenvalue problem arising from equation ( 5) that do not exceed some threshold that is specified in advance.Although we have the same goal, our optimization formulation in equation ( 2) of the main text, and our loss function is different.We constrain the total covariance matrix S t instead of S w and as noted above, our loss function is bounded.As such, we do not have convergence issues when training our deep learning parameters.Deep LDA ( [3]), a deep neural network-based linear discriminant analysis, suffers from this convergence issue.A major drawback of MvLDAN, Deep CCA, Deep generalized CCA and most existing DNN methods is that they cannot be used to identify variables contributing most to the association among the views and/or separation in the classes.Recently, a DNN method for multiview multi-task classification problems, MOMA, that uses attention mechanism for interpretability has been proposed [10].We note that although MOMA is developed for two or more views, the algorithm the authors provide is applicable to only two views.

Optimization and Algorithm
The training process of DeepIDA is as follows: • Feed forward and calculate the loss.The output for D deep neural networks are H 1 , ..., H D , which includes the neural network parameters (i.e., the weights and biases).Based on the objective in equation ( 6 • Gradient within each sub-network.Since each view-specific sub-network is propagated separately, we can calculate the gradient of each sub-network independently.As the neural network parameters (i.e., weights and biases) of view d network is denoted as θ d , we can calculate the partial derivative of last layer with respect to sub-network parameters as ∂H d ∂θ d .These networks include shallow or multiple layers and nonlinear activation functions, such as Leaky-ReLu [8].
• Deep IDA update via gradient descent.By the chain rule, we can calculate We use the autograd function in the PyTorch [11] package to compute this gradient.Therefore, for every optimization step, a stochastic gradient descent-based optimizer, such as Adam [6], is used to update the network parameters.
We describe the Deep IDA algorithm in Algorithm 1.We also describe in Algorithm 2 the approach for obtaining the linear projections using the output of the final layer.Feed forward the network of each view with latest weights and biases to obtain the final layer Apply Algorithm 2 to obtain Γ 1 , ..., Γ D

Compute eigenvalues of c
Compute the gradient of weights and biases for each network by the PyTorch Autograd function

7:
Update the weights and biases with the specified learning rate α

Run Time
For a desktop with memory = 32G, one single run of DeepIDA usually takes two minutes until convergence (around 50 epochs) on data with sample size 6000 and feature size 2000.In addition, the feature ranking process for one bootstrap averages around 4 hours for data of dimension 6000 × 2000.However, with parallel computing, it's possible to train 50 bootstraps concurrently within the same timeframe.For datasets with smaller sample sizes, like those in our real data analysis, a single Deep IDA run takes less than 10 seconds.The total time for feature ranking based on 50 bootstraps is approximately 1.5 hours when using 15-core parallel computing.

Set-up
We conduct simulation studies to assess the performance of Deep IDA for varying data dimensions, and as the relationship between the views and within a view become more complex.We demonstrate that Deep IDA is capable of i) simultaneous association of data from multiple views and discrimination of sample classes, and ii) identifying signal variables.
We consider two different simulation Scenarios.In Scenario One, we simulate data to have linear relationships between views and linear decision boundaries between classes (Refer to Supplementary Material for set-up and results).In Scenario Two, we simulate data to have nonlinear relationships between views and nonlinear decision boundaries between classes.There are K = 3 classes and D = 2 and D = 3 views in Scenario One.In Scenario

Comparison Methods
We compare Deep IDA with classification-, association-, and joint association and classification-based methods.
For classification-based methods, we consider the support vector machine [4] on stacked views.For associationbased methods, we consider nonlinear methods such as deep canonical correlation analysis (Deep CCA) [1] and deep generalized CCA (DGCCA) [2] when there are three or more views.The association-based methods only consider nonlinear associations between views, as such we follow with SVM to perform classification using the learned low-dimensional representations from the methods.We also compare Deep IDA to SIDA [12], a joint association and classification method that models linear relationships between views and among classes in each view.We perform SIDA using the Matlab codes the authors provide.We perform Deep CCA and DGCCA using the PyTorch codes the authors provide.We couple Deep CCA and DGCCA with the teacher-student framework (TS) [9] to rank variables.We also investigate the performance of these methods when we use variables selected from Deep IDA.

Linear Simulations
We consider two simulation settings in this Scenario and we simulate data similar to simulations in [12].In Setting One, there are D = 2 views X 1 and X 2 , with p 1 = 1, 000 and p 2 = 1, 000 variables respectively.
There are K = 3 classes each with size n k = 180, k = 1, 2, 3 giving a total sample size of n = 540.Let , 2 be a concatenation of data from the three classes.The combined data X 1 k , X 2 k for each class are simulated from N (µ k , Σ), 2 are the mean vectors for X 1 k and X 2 k respectively.The covariance matrix Σ for the combined data for each class is partitioned as where Σ 1 , Σ 2 are respectively the covariance of X 1 and X 2 , and Σ 12 is the cross-covariance between the two views.We generate Σ 1 and Σ 2 as block diagonal with 2 blocks of size 10, between-block correlation 0, and each block is a compound symmetric matrix with correlation 0.8.We generate the cross-covariance matrix Σ 12 as and the entries of V 1 1 ∈ ℜ 20×2 are i.i.d samples from U(0.5,1).We similarly define V 2 for the second view, and we normalize such that We then set 2) to depict moderate association between the views.For class separation, define the matrix [ΣA, , and set the first, second, and third columns as the mean vector for class 1, 2, and 3, respectively.Here, the first column of A 1 ∈ ℜ p1×2 is set to (c1 10 , 0 p1−10 ) and the second column is set to (0 10 , −c1 10 , 0 p−20 ); we fix c at 0.2.We set similarly, but we fix c at 0.1 to allow for different class separation in each view.
In Setting Two, we simulate D = 3 views, X d , d = 1, 2, 3, and each view is a concatenation of data from three classes as before.The combined data X 1 k , X 2 k , X 3 k for each class are simulated from N (µ k , Σ), where 3 is the combined mean vector for class k; µ d k ∈ ℜ p d , j = 1, 2, 3 are the mean vectors for X d k , d = 1, 2, 3.The true covariance matrix Σ is defined similar to Setting One but with the following modifications.We include Σ 3 , Σ 13 , and Σ 23 , and we set Σ 13 = Σ 23 = Σ 12 .Like Σ 1 and Σ 2 , Σ 3 is partitioned into signals and noise, and the covariance for the signal variables, Σ 3 , is also block diagonal with 2 blocks of size 10, between-block correlation 0, and each block is compound symmetric matrix with correlation 0.8.

Results for Linear Simulations
Table 2 gives classification accuracy for the methods and the true positive rates for the top 20 variables selected.
We implemented a three-hidden layer network with dimensions 512, 256, and 64 for both Deep IDA and Deep CCA.The dimension of the output layer was set as 10.Table 7 lists the network structure used for each setting.
For Deep IDA + Bootstrap, the bootstrap algorithm proposed in the Methods Section was implemented on the training data to choose the top 20 ranked variables.We then implemented Deep IDA on the training data but with just the variables ranked in the top 20 in each view.The learned model and the testing data were used to obtain the test errors.To compare our feature ranking process with the teacher-student (TS) network approach for feature ranking, we also implemented Deep IDA without the bootstrap approach for feature ranking, and we used the learned model from Deep IDA in the TS framework for feature ranking.We also performed feature ranking using the learned model from Deep CCA (Setting One) and Deep GCCA (Setting Two).The average error rates for the nonlinear methods are higher than the error rate for SIDA, a linear method for joint association and classification analysis.This is not surprising as the true relationships among the views, and the classes within a view are linear.Nevertheless, the average test error rate for Deep IDA that is based on the top 20 variables in each view from the bootstrap method (i.e., Deep IDA + Bootstrap) is lower than the average test error rates from Deep CCA and SVM (on stacked views).When we implemented Deep CCA, SVM, and DGCCA on the top 20 variables that were selected by our proposed method, we observed a decrease in the error rate across the methods.For instance, the error rates for Deep CCA using all variables compared to using the top 20 variables identified by our method were 33.17% and 22.95%, respectively.Further, compared to Deep IDA on all variables (i.e., Deep IDA + TS), Deep IDA + Bootstrap has a lower average test error, demonstrating the advantage of variable selection.In Setting Two, the classification accuracy for Deep GCCA was poor.In terms of variable selection, compared to SIDA, the proposed method was competitive at identifying the top-ranked 20 variables.The TS framework for ranking variables was sub-optimal as evident from the true positive rates for Deep IDA + TS, Deep CCA + TS, and Deep GCCA + TS. 3 More Results From Real Data Analysis

Evaluation of the Noisy MNIST digits data
The original MNIST handwritten image dataset [7] consists of 70,000 grayscale images of handwritten digits split into training, validation and testing sets of 50,000, 10,000 and 10,000 images, respectively.The validation set was used to select network parameters from the best epoch (lowest validation loss).Each image is 28 × 28 pixels and has associated with it a label that denotes which digit the image represents (0-9).In [13], a more complex and challenging noisy version of the original data was generated and used as a second view.First, all pixel values were scaled to 0 and 1.The images were randomly rotated at angles uniformly sampled from ], and the resulting images were used as view 1.Each rotated image was paired with an image of the same label from the original MNIST data, independent random noise generated from U(0,1) was added, and the pixel values were truncated to [0,1].The transformed data is view 2. Figure 1 shows two image plots of a digit for views 1 and 2. Of note, view 1 is informative for classification and view 2 is noisy.Therefore, an ideal multiview classification method should be able to extract the useful information from view 1 while disregarding the noise in view 2. The goal of this application is to evaluate how well the proposed method without feature ranking can classify the digits using the two views.Thus, we applied Deep IDA without feature ranking and the competing methods to the training data and we used the learned models and the testing data to obtain test classification accuracy.
The validation data was used in Deep IDA and Deep CCA to choose the best model among all epochs.Table 9 in the supplementary material lists the network structure used in this analysis.Table 3 gives the test classification results of the methods.We observe that the test classification accuracy of the proposed method with nearest centroid classification is better than SVM on the stacked data, and is comparable to Deep CCA.We observe a slight advantage of the proposed method when we implement SVM on the final layer of Deep IDA.

Data pre-processing and application of Deep IDA and competing methods
Of the 128 patients, 125 had both omics and clinical data.We focused on proteomics, RNA-seq, and metabolomics data in our analyses since many lipids were not annotated.We formed a four-class classification problem using COVID-19 and ICU status.Our four groups were: with COVID-19 and not admitted to the ICU (COVID Non-ICU), with COVID-19 and admitted to the ICU (COVID ICU), no COVID-19 and admited to the ICU (Non-COVID ICU), and no COVID-19 and not admitted to the ICU (Non-COVID Non-ICU).The frequency distribution of samples in these four groups were: 40% COVID ICU, 40% COVID Non-ICU, 8% Non-COVID Non-ICU, and 12% Non-COVID ICU.The initial dataset contains 18,212 genes, 517 proteins, and 111 metabolomics features.Prior to applying our method, we pre-processed the data as follows.All genes which were missing in our samples were removed from the dataset and 15,106 genes remained.We selected genes that more than half skewed.The transformed data were standardized to have mean zero and variance one.We kept genes with variance less than the 25th percentile.We then used ANOVA on the standardized data to filter out (p-values > 0.05) genes with low potential to discriminate among the four groups.For the proteomics and metabolomics data, we standardized each molecule to have mean zero and variance one, pre-screened with ANOVA and filtered out molecules with p-values > 0.05.Our final data were X 1 ∈ ℜ 125×2,734 for the gene data, X 2 ∈ ℜ 125×269 for the protoemics data, and X 3 ∈ ℜ 125×66 for the metabolomics data.

and µ d k = 1 n k n k i=1 x d ik is the mean 2 w 2 wS 2 w. 1 ,
for class k in view d.Let S dj , j < d be the cross-covariance between the d-th and j-th views.Let M d = S d −1/2 w S d b S d −1/and N dj = S d −1/dj S j −1/The IDA method finds linear discriminant vectors ( Γ . . ., Γ d ) that maximally associate the multiple views and separate the classes within each view by solving the optimization problem: max Γ 1 ,••• ,Γ D {ρ D d=1 tr(Γ dT M d Γ d ) + 2(1 − ρ) D(D − 1) D d=1,d̸ =j tr( ) of the main text, the final loss is calculated and denoted as L = − D d=1 l r=1 η d,r .• Gradient of the loss function.The loss function L depends on the estimated linear projections Γ d , d = 1, • • • , D and since these linear projections use the outputs of the last layer of the network, there are no parameters involved.Therefore, we calculate the gradient of the loss with respect to the view-specific output, i.e., ∂L ∂H d , d = 1, . . ., D.

Algorithm 1 1 :
Algorithm of Deep IDA Input: Training data X = {X 1 , X 2 , ..., X D } and class labels y; number of nodes of each layer in D neural networks (including dimensions of linear sub-spaces to project onto, o 1 , o 2 , ..., o D ); learning rate α Output: Optimized weights and biases for D neural networks and corresponding estimates ( f 1 (X 1 ), ..., f D (X D )) Initialization Assign random numbers to weights and biases for D neural networks 2: while loss not converge do 3:

Figure 1 :
Figure 1: An example of Noisy MNIST data.For the subject with label "9", view 1 observation is on the left and view 2 observation is on the right.

Table 3 :
Noisy MNIST data: SVM was implemented on the stacked data.For Deep CCA + SVM, we trained SVM on the combined outputs (from view 1 and view 2) of the last layer of Deep CCA.For Deep IDA + NCC, we implemented the Nearest Centroid Classification on the combined outputs (from view 1 and view 2) of the last layer of Deep IDA.For Deep IDA + SVM, we trained SVM on the combined outputs (from view 1 and view 2) of the last layer of Deep IDA.

Figure 3 :
Figure 3: Discrimination (3-D) plots: COVID-19 patient groups are well-separated in the training data.From the testing data, the COVID ICU group seems to be separated from the COVID NON-ICU and NON-COVID ICU groups, especially in the RNA-sequencing and proteomics data.Correlation plots (2-D): Overall (combining all three discriminant scores), the mean correlation between the metabolomics and proteomics data was highest (0.4) while the mean correlation between the metabolomics and RNA-sequencing data was lowest (0.09).

Figure 4 :
Figure 4: Comparison of protein levels among COVID-19 patient groups (p-value < 0.05, Kruskal-Wallis test).COL18A1 was highly ranked by Deep IDA, and the other 8 proteins are shared by the "FXR/RXR Activation" and "LXR/RXR Activation" pathways.Protein expression levels for ALB, APOM, and TF are lower in patients with COVID-19 (especially in patients with COVID-19 who were admitted to the ICU).Protein expression levels for AGT and CLU are higher in patients with COVID-19 admitted to the ICU compared to the other groups.

Table 1 :
Supplementary Material for "Deep IDA: A Deep Learning Approach for Integrative Discriminant Analysis of Multi-omics Data with Feature Ranking-An Application to COVID-19" Unique features of Deep IDA compared to other methods.
*Only applicable to two views.+Covariates could be added as additional view in Deep GCCA.Joint Methods refere to methods that combine integration and classification simultaneously.
Two, there are K = 2 classes and D = 2 views.In all Scenarios, we generate 20 Monte Carlo training, validation, and testing sets.We evaluate the proposed and existing methods using the following criteria: i) test accuracy, and ii) feature selection.For feature selection, we evaluate the methods ability to select the true signals.In Scenario One, the first 20 variables are important, and in Scenario Two, the Top 10% of variables in view 1 are signals.Since Deep IDA and the teacher-student (TS) framework rank features, and SIDA assigns zero weights to unimportant variables, for fair comparison, we only assess the number of signal variables that are in the Top 20 (for Scenario One) and the Top 10% (for Scenario Two) variables selected by the methods.We compare test accuracy for Deep IDA with and without the variable ranking approach proposed in this manuscript.

Table 2 :
Linear Setting: RS; randomly select tuning parameters space to search.TPR-1; true positive rate for X 1 .Similar for TPR-2.TS: Teacher student network.− indicates not applicable.Deep IDA on selected top 20 variables is when we use the bootstrap algorithm to choose the top 20 ranked variables, train Deep IDA with the top 20-ranked variables, and then use the learned model and the test data to obtain test errors.

Table 5 :
Top Diseases and Biological Functions from Ingenuity Pathway Analysis (IPA).

Table 6 :
Top Molecular and Cellular Functions Functions from Ingenuity Pathway Analysis (IPA).

Table 7 :
Linear Simulations Network structures for all deep learning based methods.In order to make fair comparisons, for each dataset, the network structure for Deep CCA/Deep GCCA is the same as the proposed Deep IDA method.The activation function is Leakly Relu with parameter 0.1 by default.After activation, batch normalization is also implemented.− indicates not applicable

Table 8 :
Nonlinear Simulations Network structures for all deep learning based methods.In order to make fair comparisons, for each dataset, the network structure for Deep CCA/Deep GCCA is the same as the proposed Deep IDA method.The activation function is Leakly Relu with parameter 0.1 by default.After activation, batch normalization is also implemented.

Table 9 :
Real Data Analysis Network structures for all deep learning based methods.In order to make fair comparisons, for each dataset, the network structure for Deep CCA/Deep GCCA is the same as the proposed Deep IDA method.The activation function is Leakly Relu with parameter 0.1 by default.After activation, batch normalization is also implemented.For Covid-19 data, we select the top 50 features for each view from Bootstrap Deep IDA with input-512-20.− indicates not applicable