mvlearnR and Shiny App for multiview learning

Abstract Summary The package mvlearnR and accompanying Shiny App is intended for integrating data from multiple sources or views or modalities (e.g. genomics, proteomics, clinical, and demographic data). Most existing software packages for multiview learning are decentralized and offer limited capabilities, making it difficult for users to perform comprehensive integrative analysis. The new package wraps statistical and machine learning methods and graphical tools, providing a convenient and easy data integration workflow. For users with limited programming language, we provide a Shiny Application to facilitate data integration anywhere and on any device. The methods have potential to offer deeper insights into complex disease mechanisms. Availability and implementation mvlearnR is available from the following GitHub repository: https://github.com/lasandrall/mvlearnR. The web application is hosted on shinyapps.io and available at: https://multi-viewlearn.shinyapps.io/MultiView_Modeling/.


Introduction
Nowadays, multiple types of data (or sometimes called views or modalities [e.g. genomics, proteomics, metabolomics]) are frequently measure on the same sets of individuals and has opened an era of research on multiview learning.It is recognized that the mechanisms that underlie complex diseases may be unraveled by approaches that go beyond analyzing each type of data separately.However, analyzing these data types to obtain useful information and knowledge is challenging because the data are complex, heterogeneous, and high-dimensional, and requires a considerable level of analytical sophistication.Many methods have been proposed to associate multiview data (e.g.[Hotelling, 1936, Safo et al., 2018a, Horst, 1961, Kettenring, 1971, Lock et al., 2013, Safo and Lu, 2023].
The methods could be unsupervised or a combination of supervised and unsupervised techniques.The unsupervised methods first correlate the different data types to learn shared or view-independent low-dimensional representations Hotelling [1936], Safo et al. [2018a], Lock et al. [2013].This is then followed by independent prediction analyses that use the learned low-dimensional representations.The unsupervised methods are useful for data exploration.The joint association and prediction methods combine supervised and unsupervised techniques such that assessing associations between multiple views is linked to predicting an outcome.[Palzer et al., 2022, Safo et al., 2022, Wang and Safo, 2021, Moon and Lee, 2022].The goal in these methods is then to learn low-dimensional representations that have potential to predict the outcome under consideration.Since the outcome is used in deriving the low-dimensional representations, these low-dimensional representations are naturally endowed with prediction capabilities which enhances interpretability.
Most existing software packages for multiview learning tend to be decentralized, making it difficult for users to perform comprehensive integrative analysis.The mix-omics [Rohart et al., 2017] package for integration offers both supervised and unsupervised methods for multiview learning.However, the methods provided in mix-omics are limited.For instance, the outcome types are either continuous or categorical, not allowing for other types of outcomes (e.g Poisson, time-to-event).The methods do not allow for the use of prior biological information which can enhance interpretability.Importantly, users must be well versed in the R programming langauge, which is limiting.
We provide mvlearnR, an R software for multiview learning, which will serve as a comprehensive software for integrating data from multiple sources.The new package wraps statistical and machine learning methods and graphical tools, providing a convenient and easy data integration workflow.For users with limited programming language, we provide a Shiny Application to facilitate data integration.Currently, mvlearnR can be used to: • Prefilter each data type via differential analysis (DA).We provide both supervised and unsupervised options for DA or for filtering out noise variables prior to performing data integration.
• Integrate data from two sources using a variant of the popular unsupervised method for associating data from two views, i.e. canonical correlation analysis (CCA).
• Predict a clinical outcome using results from CCA.We provide four outcome data distribution type (i.e.gaussian, binomial, Poisson, and time-to-event data.) • Jointly integrate data from two or more sources and discriminate between two or more classes.We provide an additional method which allows to incorporate prior biological structure (e.g., variable-variable relationships).These methods allow to include covariates.
• Visualize results from DA or integrative analysis methods.These plots include: volcano plots, UMAP plots, variable importance plots, discriminant plots, correlation plots, relevance network plots, loadings plots, and within-and between-view biplots.These visualization tools will help unravel complex relationships in multiview data.
• Demonstrate our integration workflow via already uploaded synthetic and real molecular and clinical data pertaining to COVID-19.
The rest of this paper presents the package and Shiny App, with more details in the Supplemetary Material.We organize the paper as follows: First, we present the implementation details of the package and web application.Next, we present the filtering, supervised and unsupervised integration and visualization methods used in greater detail.Then, we demonstrate how mvlearnR can be used on real data and discuss the proper interpretation of the results.Finally, we discuss the limitations of the package and web application and potential future directions.

Methods and Implementation
In this section, we give details about the package and web application and summarize the methods implemented.In Table S1 of the supplementary material, we provide the currently available functions in mvlearnR and their descriptions.

The mvlearnR Web App and Package
The mvlearnR web app consists of a user-friendly interface ( Figure 1), it is ideal for users with limited programming expertise in R, and it can be used anywhere and on any device.Leveraging state-of-the-art unsupervisedSafo et al.
[2018b] and supervised Safo et al. [2019] integrative analysis methods, mvlearnR web server and package enable researchers to integrate molecular and clinical data, ultimately reducing the gap from raw molecular data to biological insights.The web application has four tabs.The first tab, 'Home', provides a brief overview of the methods and related links (Figure 1).The second tab, 'Supervised', is where the user will implement supervised integrative analysis methods.
The third tab, 'Unsupervised', is where the user will implement unsupervised integrative analysis methods.We provide options for the user to upload their own data or use example data.These tabs produce outputs of the model including classification performance, variable importance tables and plots, and several other plots to help the user understand the results.The fourth tab, 'Filtering', is where the user has the option to filter and preprocess their data to a customizable lower dimensional subset prior to data integration, using supervised and unsupervised filtering methods.The web application uses the R Shiny framework and is hosted at shinyapps.io.
In the R-package, we provide real data pertaining to COVID-19 severity and status.The data are from a study conducted by Overmyer et al. [2021] that collected blood samples from 102 participants with COVID-19 and 26 participants without COVID-19 and quantified for metabolomics, RNA sequencing (RNA-Seq), proteomics, and lipidomics.We provide in this package the proteomics and RNA-seq data as preprocessed in Lipman et al. [2022].Disease severity Figure 1: Multiview Shiny App Interface: Our Shiny App will allow non-users of R to seamlessly conduct integrative analysis.The web application has four tabs.The first tab, 'Home', provides a brief overview of the methods and related links.The second tab, 'Supervised', is where the user will implement supervised integrative analysis methods.The third tab, 'Unsupervised', is where the user will implement unsupervised integrative analysis methods.We provide options for the user to upload their own data or use example data.These tabs produce outputs of the model including classification performance, variable importance tables and plots, and several other plots to help the user understand the results.The fourth tab, 'Filtering', is where the user has the option to filter and preprocess their data to a customizable lower dimensional subset prior to data integration, using supervised and unsupervised filtering methods.
was measured using the World Health Organization (WHO) 0-8 disease specific scale (8 denotes death), and a score out of 45 that indicates the number of hospital free days (HFD-45) Overmyer et al. [2021].These two outcome variables and other metadata (e.g.age, sex, comorbidities) are provided in the R-package.We refer the reader to Overmyer et al. [2021] for more details on available data.The R-package can be downloaded from GitHub at https://github.com/lasandrall/mvlearnR.We next describe the methods and functions.

Data Import and Filtering
We provide real data pertaining to COVID-19 severity and status and several simulated datasets to demonstrate the use of the package.Simulated data for two views and a binary outcome could be read into R as data(sidaData), and data(selpData).The COVID-19 data can be imported into R as data(COVIDData) [Figure S3].This is a list with 3 entries: Proteomic, RNASeq, and Clinical data.Integrative analysis methods sometimes perform poorly on large datasets so we provide supervised and unsupervised methods to filter data to help users focus on variables that are more likely to yield meaningful findings after integration.In the R-package, the function filter.supervised()[Figure S5] can be used to filter each view when an outcome is available via the four methods: linear, logistic, t-test, and Kruskal-Wallis (KW) test.Supervised filtering allows the user to filter variables based on their association with an outcome.P-values can be adjusted for multiple hypothesis testing.The function filter.unsupervised()can be used to filter each view using unsupervised methods such as variance and interquartile range (IQR) filtering.We provide an option to log2 transform variables, scale variables to have variance one, center variables to have mean zero, or normalize variables to have mean zero and variance one.Quality control checks (e.g.batch correction) and other form of normalizations specific to a particular omics data should be done outside mvlearnR.
The web application provides the filtering, scaling, centering, and normalization options.Regarding the data to be uploaded for filtering, the user can upload i) Train and Test Sets, for when uploaded data have already been split into training and testing sets.Filtering will be conducted only on the training data.After filtering is complete, a new test set will be constructed to contain the same variables as the filtered training data; ii) Full dataset, for when uploaded data have not been split into training and testing sets.If the user would like the app to create separate training and testing sets, we provide an option for this via the 'Pct in Training set' tab.

Unsupervised methods for associating data from two sources
We provide the sparse canonical correlation analysis (CCA) method, SELPCCA, proposed in Safo et al. [2018a], and described in details in the Supplementary Material for unsupervised data integration.CCA Hotelling [1936] is a multivariate linear dimension reduction method for maximizing association between data from two sources.CCA finds weighted combinations of variables in each data that maximize the overall dependency structure between pairs of data.In the context of data integration strategies described in Picard et al. [2021] S8] can be used to obtain low-dimensional linear representations that maximize associations between pairs of data, and to identify key variables contributing to the maximum correlation between the pairs of data.The main inputs to cvselpscca() are the train data and the number of canonical vectors, 'ncancorr', to be estimated, which defaults to 1 if not specified.
The output of cvselpscca() include: 'hataplha' and 'hatbeta', representing the loadings or canonical vectors for the two views, respectively; 'optTau', the optimal tuning parameter for each data type, and 'maxcorr', estimated canonical correlation coefficient, which shows the strength of the association between the data types.The canonical vectors could be visualized, for more insight (see Section on Visualizations and Supplementary Material).Since these loadings are standardized to have unit norm, a variable with larger weight contributes more to the association between the views.Please refer to the Section Use Case for a demonstration of the cvselpscca() function.
On the web application, the SELPCCA method is located on the 'Unsupervised' tab.The user can upload their data and then use the default hyper parameters to obtain the canonical vectors.After hitting the 'Run Model' tab, a notification button notifies the user that the model is running.After completion, the top 20 selected variables from each view is printed out.

Prediction with learned low-dimensional representations from unsupervised methods
Since SELPCCA is an unsupervised method, it can only be used to identify variables contributing to the maximal association between two views.SELPCCA is ideal as an exploratory method to explore variables that contribute to the overall dependency structure between the views.If an outcome is available, one can associate the learned low-dimensional representation(s) with the outcome.We provide the function selpcca.pred()[FigureS15] for this purpose where the results from the SELPCCA model are used to build a generalized linear model (GLM) or Cox prediction model.The required option 'family' is a string specifying the type of prediction model to build.Options are gaussian, binomial, Poisson, and survival.When family = 'survival', a Cox proportional model will be fitted.
Otherwise a GLM will be used.The function predict() can be used to predict out of sample data using the learned low-dimensional representations (Figure S16).The performance of these new predictions can be assessed on a training data using the function PerformanceMetrics() [Figure 16].Currently two family options are provided: 'binomial' and 'gaussian'.
On the web application, the SELP-Predict method is located on the 'Supervised' tab.The results from the SELPCCA model are used to build a generalized linear model (GLM) or Cox prediction model.To implement this function, the user uploads their data, sets the distribution family, and can use the default hyper parameters to obtain the canonical variates.The required option 'family' is a string specifying the type of prediction model to build.Options are gaussian, binomial, poisson, and survival.When family = 'survival', a Cox proportional model will be fitted.Otherwise a GLM will be used.After model implementation, the user can view the model estimates, obtain some prediction estimates, and visualize the top 20 selected variables for each view.

Supervised methods for associating data from two or more sources
Sparse integrative discriminant analysis (SIDA) Safo et al. [2022] is an integrative analysis method for jointly modeling associations between two or more views and creating separation of classes within each view.The algorithm considers the overall association between multiview data, and the separation within each view when choosing discriminant vectors that are associated and optimally separate subjects.Thus, SIDA combines the advantages of linear discriminant analysis (LDA) Hotelling [1936], a supervised learning method for maximizing separation between classes in one view, and CCA, an unsupervised learning method for maximizing correlation between two data types, and falls under the intermediate strategy for data integration described in Picard et al. [2021].SIDA allows the user to select key variables that contribute to the maximum association of the views and separation of the classes.The function cvSIDA() performs n-fold cross-validation to select optimal tuning parameters for SIDA based on training data, and predicts training or testing class membership.The function cvSIDANet() incorporates prior structural information (e.g.gene-gene connectivity) in SIDA via the use of the normalized Laplacian of a graph, thus encouraging selection of predictors that are connected and behave similarly.This enhances interpretability.Covariates, if available, can be included, via the option WithCovariates==TRUE.

Visualizations
Results from the supervised filtering approach could be visualized via volcano plots using the function volcanoPlot().
The filtered or original data could be visualized via uniform manifold approximation projection (UMAP) [McInnes et al., 2018] with the function umapPlot().We provide the function VarImportancePlot() to visualize the weights (in absolute value) of the low-dimensional loadings from cvselpscca(), cvSIDA(), and cvSIDANet().Since the low-dimensional loadings are standardized to have unit norm, a variable with larger weight contributes more to the association between the views (for the unsupervised integrative analysis methods) or to the association between the views and the discrimination of classes within each view (for the supervised integrative analysis methods).We provide the function DiscriminantPlots() and CorrelationPlots() to visualize the class separation within each view, and correlations between pairs of views, respectively.We provide the function networkPlot() for relevance network that shows variable-variable connections between pairs of views.We provide the function LoadingsPlot() to visualize the loadings for each view and to demonstrate the relationships between pairs of variables within each view.We provide the function WithinViewBiplot() to show the scores and loadings together for a specific view.The function BetweenViewBiplot() shows the scores and loadings from pairs of views together.
4 Use Case

Demonstration of SELPCCA and SELP Predict
In this Subsection, we demonstrate the use of SELPCCA on multiomics data pertaining to COVID-19.Our goal is to associate Proteomics and RNASeq data to identify proteins and genes driving the overall dependency structure between the two molecular data.We then associate the canonical variates with COVID-19 status in a logistic regression model to investigate whether the canonical variates are able to discriminate between individuals with and without COVID-19.
We load the data in as data(COVIData) [Figure S3].The number of cases (COVID-19) and non-cases (non-COIVD-19) is 98 and 22, respectively.There are 264 proteins and 5,800 genes.In this analysis, View 1 corresponds to the proteomis data, and View 2, the RNASeq data.We subset the data into 90% training, and 10% testing, keeping the proportion of cases and non-cases similar to the proportion in the whole data [Figure S4].We filter data using the function filter.supervised()[Figure S5] with the options: 'method'= 'logistic'; 'padjust' =TRUE; 'adjmethod'= BH; 'standardize' = TRUE.Our outcome is disease status ('DiseaseStatus.Indicator').After univariate filtering by removing proteins and genes that are not statistically significant (adjusted p-value > 0.05), 87 proteins and 2,573 genes remain.We use the function volcano() to obtain volcano plots for proteins and genes [Figure S6].We use the function umapPlot() to obtain UMAP plots of the filtered data to visualize how well the samples are separated [Figure S7].
To fit SELPCCA, we invoke the function cvselpscca() and set the number of canonical variates to 2 [Figure S8].From running SELPCCA, we observed that 78 proteins and 32 genes have nonzero coefficients on the first CCA vector, which suggests that these proteins and genes maximize correlation between the proteomics and RNASeq data (estimated correlation is 0.636).Further, 54 proteins and 9 genes have nonzero coefficients on the second CCA vector, with estimated correlation 0.599.The top 20 proteins (shown as Uniprot IDs, UID) and genes with highest absolute loadings for the first CCA vector are shown in Figure S9.These figures are obtained with the function VarImportancePlot().Some of the highly ranked proteins for the first CCA vector include: Immunoglobulin lambda variable 3-1 (UID P01715), HLA class I histocompatibility antigen, B alpha chain (UID P30491), Alpha-2-HS-glycoprotein (UID P02765).Some of the highly ranked genes on the first CCA vector include: ubiquitin conjugating enzyme E2 C (UBE2C), cell division cycle 6 (CDC6), cyclin A2 (CCNA2), and DEP domain containing 1B (DEPDC1B).We note that some of the highly-ranked proteins (e.g.HLA class I histocompatibility antigen, B alpha chain [UID P30491] and Surfactant, pulmonary-associated protein B, isoform CRA_a [UID D6W5L6]) and genes (e.g.CDC6 and CCNA2) each had high log-odds ratios for discriminating between COVID-19 cases and non-cases.
We observe that the first canonical variate for each view is able to separate those with and without COVID-19 in the train (Figure S10) and test (Figure S11) sets.We use the function WithinViewBiplot() to visualize the sample discrimination, the canonical loadings for each view, and assess how the top variables in each view are related to each other (Figure S12).The protein Immunoglobulin lambda 4-69 (UID A0A075B6H9) appears to be highly correlated with the protein Ferritin light chain (UID P02792).The gene UBE2C is loaded on the first CCA vector, and the genes UPK3A and GPR35 are loaded on the second CCA vector.We use between-view biplots to visualize biplots for both views (Figure S13).This plot allows us to assess how genes and proteins are related.Solid black vectors represent loading plots for the first view (proteins).Dashed red vectors represent loadings plot for the second view (genes).We generate this plot with the function BetweenViewBiplot().The protein Immunoglobulin lambda variable 3-1 (UID P01715) appears to be positively correlated with the gene UBE2C.
For more insight into the association between the genes and proteins, we invoke the relevance network plot function networkPlot() [Figures 2 and S14].The nodes of the graph represent variables for the pairs of views, and edges represent the correlations between pairs of variables.Dashed and solid lines indicate negative and positive correlations, respectively.Circle nodes are View 1 variables (proteins), and rectangular nodes are View 2 variables (genes).We show edges with correlations at least 0.58.The plot suggests that the protein Immunoglobulin lambda variable 3-1 (UID P01715) is highly positively correlated with many genes (including CDC6, CCNA2, UBE2C), and the protein Alpha-2-HS-glycoprotein (UID P02765) is highly negatively correlated with many genes (including CCNA2 and CDC6).
In terms of prediction, we fitted a logistic regression model on the training data with the predictors as the first two canonical variates.We used the function selpscca.pred()for this purpose [Figure S15].Our results suggest that the first canonical variates for proteins and genes are significantly associated with COVID-19 status (p-value < 0.05).We predicted the test data from the learned model with the predict() function, and obtained train [Figure S16] and test [Figure S17] prediction estimates (e.g.accuracy, sensitivity, F1 etc) with the PerformanceMetrics() function.In Figures S15 and S16, we observe that both train and test accuracy and F1 score are high, suggesting that the first two canonical variates potentially discriminate those with and without COVID-19.

Demonstration of SIDA
Unlike SELPCCA which is an unsupervised method for integrating data from multiple sources, SIDA is a supervised data integration method.We demonstrate the use of SIDA on multiomics data pertaining to COVID-19.Our goal is to associate the proteomics and RNASeq data and discriminate between COVID-status in a joint model.We further identify proteins and genes that maximize both association and discrimination.We apply the function cvSIDA() [Figure S18] to obtain estimated SIDA discriminant vectors, correlation coefficients, and variables potentially contributing to the association of the views and the discrimination between samples within each view.
From implementing SIDA, we observed that 26 proteins and 23 genes have nonzero coefficients, which suggests that these proteins and genes maximize both correlation between the proteomics and RNASeq data (estimated correlation from train data is 0.42) as well as separation between those with and without COVID-19.The top 20 proteins (shown as Uniprot IDs, UID) and genes with highest absolute loadings are shown in Figure S19.Some of the highly ranked proteins include: (UID P04196), (UID P14543), (UID E9PEK4).Some of the highly ranked genes include: (GOLGA8Q), (ADGB), (TNFRSF6B), and (SLC25A41).
We use the function DiscriminantPlots() [Figure S20] to visualize the separation of COVID-19 cases and non-cases.
From Figures S20 and S21, the classes are well-separated in both the training (Figure S20) and testing sets (Figure S21).We use the function CorrelationPlots() to visualize the strength of the association between the proteins and genes and separation as well.From Figure S22, we notice that the views are moderately correlated, and the classes are well-separated.For more insight into the association between the genes and proteins, we invoke the relevance network plot function networkPlot() [Figure S23].The plot suggest that the gene FAM3D is negatively correlated with many proteins (e.g.PO2766, P30491, Q08380), and positively correlated with proteins that include A0ADC4DFP6, E9PEK4, P04196, D6W5L6.
In terms of prediction, we obtain the train and test error upon running cvSIDA [Figure S18].We used the function PerformanceMetrics() to obtain other performance metrics.In Figures S24 and S25, we observe that both train and test performance metrics are high, suggesting that SIDA discriminant scores are able to discriminate those with .We show edges with correlations at least 0.58.The plot suggest that the protein P01715 is highly positively correlated with many genes, and the protein P02765 is highly negatively correlated with many genes.
and without COVID-19.The estimated train correlation is 0.41.Further, the performance metrics, especially test performance metrics, are better for SIDA than SELPCCA, which suggests that in this application, joint modeling of association and separation is better.

Discussion and Future Work
We have introduced an R package, mvlearnR for integrating data form multiple sources.The package wraps statistical and machine learning methods and graphical tools, providing a convenient and easy data integration workflow.For users with limited programming language, we provide a Shiny Application to facilitate data integration.Our multiview dashboard will enable easy, user-friendly comprehensive integrative analysis of molecular and clinical data from anywhere and on any device, without needing to know the R language.We offer a friendly web user interface using the R Shiny framework where users can integrate multiple datasets, visualize and download results in easy to use format.Currently, linear multivariate methods for integrative analysis and biomarker identification are provided in mvlearnR, and the methods can only be used for integrating cross-sectional data.However, we have developed integrative analysis methods for disease subtyping [Zhang et al., 2022] and for biomarker identification where we model nonlinear relationships between data from multiple sources and a clinical outcome Wang and Safo [2021], Wang et al. [2023], Safo and Lu [2023].These methods, and other methods we develop in the future, will eventually be added to mvlearnR and the accompanying web application.Thus, we envision mvlearnR and our web application to be a one-stop place for comprehensive data integration, for both users of R (or Python) and non-users of these software.The package mvlearnR and accompanying Shiny App is intended for integrating data from multiple sources (e.g.genomics, proteomics, metabolomics).Most existing software packages for multiview learning are decentralized, making it difficult for users to perform comprehensive integrative analysis.The new package wraps statistical and machine learning methods and graphical tools, providing a convenient and easy data integration workflow.For users with limited programming language, we provide a Shiny Application to facilitate data integration.The methods have potential to offer deeper insights into complex disease mechanisms.Currently, mvlearnR can be used for the following: • Prefiltering of each omics data via differential analysis (DA).We provide both supervised and unsupervised options for DA or for filtering out noise variables prior to performing data integration.
• Integrating data from two sources using a variant of the popular unsupervised method for associating data from two views, i.e. canonical correlation analysis (CCA).
• Predicting a clinical outcome using results from CCA.We provide four outcome data distribution type (i.e.gaussian, binomial, Poisson, and time-to-event data.) • Supervised integrative analysis (one-step) for jointly associating data from two or more sources and classifying an outcome.This method allows to include covariates.
• Supervised integrative analysis (one-step) for jointly associating structured data from two or more sources and classifying an outcome.This method allows to include covariates.
• Visualizatiing results from the DA or our integrative analysis methods.These plots include: volcano plots, UMAP plots, variable importance plots, discriminant plots, correlation plots, relevance network plots, loadings plots, and within-and between-view biplots.These visualization tools will help unravel complex relationships in the data.
• Demonstrating our integration workflow via already uploaded synthetic and real molecular and clinical data pertaining to COVID-19.
Currently, linear multivariate methods for integrative analysis and biomarker identification are provided in mvlearnR.However, we have developed integrative analysis methods for disease subtyping [1] and nonlinear integrative analysis methods for biomarker identification [2][3][4] that will eventually be added to mvlearnR and the accompanying web application.Other methods we develop in the future will be added.Thus, we envision mvlearnR and our web application to be a one-stop place for comprehensive data integration, for both users of R (or Python) and non-users of these software.

The Methods
The methods SELPCCA and SIDA in mvlearnR have been described in [5] and [6], respectively.For completeness sake, we describe these methods below.Since these methods are extensions of canonical correlation analysis (CCA) [7] and linear discriminant analysis (LDA) [7], we briefly describe these two methods.

Linear Discriminant Analysis
Suppose we have one set of data X with n samples on the rows and p variables on the columns.Assume the n samples are partitioned into K classes, with n k being the number of samples in class k, k = 1, . . ., K. Assume each sample belongs to one of the K classes and let y i be the class membership for the ith subject.Let X k = (x 1k , . . ., x n k ,k ) T ∈ ℜ n k ×p , x k ∈ ℜ p be the data matrix for class k.Given these available data, we wish to predict the class membership y j of a new subject j using their high-dimensional information z j ∈ ℜ p .Fishers linear discriminant analysis (LDA) may be used to predict the class membership.For a K class prediction problem, LDA finds K − 1 low-dimensional vectors (also called discriminant vectors), which are linear combinations of all available variables, such that projected data onto these discriminant vectors have maximal separation between the classes and minimal separation within the classes.The within-class and between-class separation are defined respectively as: x ik is the mean vector for class k, μ is the combined class mean vector and is defined as n k μk .The solution to the LDA problem is given are the eigenvalue-eigenvector Note that LDA is only applicable to one set of data.

Canonical Correlation Analysis
Canonical correlation analysis on the other hand is applicable when there are two views.Now, suppose we have two sets of data.
, all measured on the same set of n subjects.The goal of CCA [7]is to find linear combinations of the variables in X 1 , say u = X 1 α and in X 2 , say v = X 2 β, such that the correlation between these linear combinations is maximized, i.e ρ = max cor(u where Σ 12 is the p × q population covariance between X 1 and X 2 , and Σ 11 , andΣ 22 are the population covariances of X 1 and X 2 , respectively.The population cross-covariances Σ 12 , Σ 11 , and Σ 22 are estimated by their sample versions S 12 , S 11 , and S 22 , respectively.For low-dimensional settings, these sample versions are consistent estimators of the population covariances, for fixed p and q, and large n.However, for high-dimensional settings, these sample versions are regularized.The vectors u and v are called the first canonical variates for X 1 and X 2 , respectively.We note that the correlation coefficient is invariant to scaling, thus one can choose the denominator be one and then consider the equivalent problem: Using Langragian multipliers, it can be shown that the CCA solution to this optimizaiton problem is given by , where e 1 and f 1 are the first left and right singular vectors of respectively.We note that maximizing the correlation is equivalent to maximizing the square of the correlation.To obtain subsequent canonical variates, one can impose the following orthogonality constraints in the optimization problem:

Sparse CCA via SELP (SELPCCA)
The classical CCA method finds linear combinations of all available variables, and since these weights are typically nonzero, it is difficult to interpret the findings.SELPCCA [5] is a variant of CCA that shrinks some of the weights of the low-dimensional representations α and β to zero, thus allowing to identify relevant variables contributing to the overall dependency structure in the data.In particular, the authors in [5] proposed to solve the following optimization problem: (1) Here, ( α1 , β1 ) are the first nonsparse solution to the CCA optimization problem, and (ρ) 1 is the corresponding eigenvalue.Of note, S11 and S22 are regularized versions of S 11 and S22 , respectively.The authors considered two forms of regularizations: S11 = I p and S22 = I q , for standardized data, and the ridge regularization: S11 = S 11 + log(p)/nI p and S22 = S 22 + log(q)/nI q .Also, τ 1 and τ 2 are tuning parameters which are chosen by cross-validation.See [5] for more details.For subsequent canonical directions α i , and β i , i > 1, the authors solved the above sparse optimization problem on deflated data.

Prediction with SELPCCA
CCA and SELPCCA are unsupervised multivariate methods, with their main goal of finding canonical vectors that maximize the overall dependency structure between two views.In some biomedical applications, it is usually the case that an outcome variable, say y, is available and a specific interest might be to investigate how these canonical variates are related to the outcome of interest.For this purpose, one can build regression models to associate the outcome with these canonical variates.In our package, we provide selpPredict() for this purpose.We provide options gaussian, binomial, poisson, and survival to model continuous, categorical, count, or time-to-event data, respectively.We also provide the function predict() to predict out of sample data using the learned low-dimensional representations or canonical variates.

Sparse Integrative Discriminant Analysis (SIDA) for two or more views
Instead of considering the two-step CCA approach proposed above when there is a clinical outcome in addition to the views, the authors in [6] developed the method, SIDA, for jointly maximizing association between two or more views while also maximizing separation between classes in each view.Although CCA is applicable to only two views, SIDA is applicable to two or more views.SIDA combines the advantages of LDA, a supervised learning method for maximizing separation between classes in a view, and CCA, an unsupervised learning method for maximizing correlation between two data types.Suppose there are d = 1, . . ., D views.Let .The authors solved the following optimization problem for the basis discriminant vectors Γ d : Of note, ρ controls the influence of separation or association in the optimization problem.The second term sums the unique pairwise squared correlations and weights them by D(D−1) 2 so that the sum of the squared correlations is one.The authors showed that the nonsparse basis discriminant directions for the d-th view, Γ d , that maximize both associations and separations are given by the eigenvectors corresponding to the eigenvalues (Λ d ) that iteratively solve the following eigensystems: where c 1 = ρ and c 2 = 2(1−ρ) D(D−1) , and N dj = D d,j N dj Γ j Γ j T N jd , d, j = 1, . . .D, j ̸ = d sums all unique pairwise correlations of the d-th and the j-th views.
For sparsity or smoothness, they considered the following optimization problems: The authors considered two forms of penalty term P(Γ d ), depending on whether sparsity only (data-driven) or sparsity and smoothness (knowledge-driven) is desired: In the latter, γ Ln i is the i-th row of the matrix product L n Γ d , and L n , is the normalized Laplacian of a graph, which is view-dependent.Essentially, the first term in this penalty acts as a smoothing operator for the weight matrices Γ d so that variables that are connected within the d-th view are encouraged to be selected or neglected together.This was termed SIDANet (SIDA for structured or network data).SIDANet is applicable when there exists prior biological information in the form of variable-variable connections, which guides the detection of connected variables that maximize both separation and association.

Visualizations of discriminant scores and canonical correlation variates
Once the discriminant vectors or canonical correlation vectors have been estimated using SIDA/SIDANet or SELPCCA respectively, the scores for each view, representing projections of each view onto these vectors can be estimated.We provide DiscriminantPlots() and CorrelationPlots() to visualize these scores.The DiscriminantPlots() function can be used to visualize the first (assuming only two classes) or the first and second discriminant scores (assuming > two classes).Each point on this plot represent a score for an individual.Discriminant plots can showcase how well the discriminant vectors separate the classes.Correlation plots on the other hand can be used to visualize the strength of association between pairs of views as well as the separation of the classes.Figure S1 demonstrates sample figures that can be generated by our DiscriminantPlots() and CorrelationPlots() in mvlearnR.We provide different color options for the graphs.3 Visualizations of variables selected

Relevance Network
Relevance Network (RN) [8][9] have been proposed to visualize pairwise relationships of variables in integrative analysis.Here, the nodes represent variables, and edges represent variable associations.To construct RN, the correlation matrix of pairwise associations is obtained from the data.Then, for a specific cutoff (say 0.7), if the estimated correlation coefficients is greater (in absolute value) than this cutoff, an edge is drawn between these two variables; otherwise, no edge is drawn and these two variables are deemed not associated for this threshold.Further, variables/nodes with no link are not represented in the RN.The cutoff is used to make the graph less dense, especially when the number of variables is high.RN have potential to shed lignt into the complex associations between different types of data.We provide the function networkPlot() to visualize the pairwise relationships of variables selected by SELPCCA, SIDA, and SIDANet.We follow ideas [9] to construct the RN for SELPCCA, SIDA, and SIDANet.In particular, instead of computing the Pearson correlation coefficients between each pair of variables in each view separately, we construct bipartite graph (or bigraph) where variables/nodes from view X d are connected to variables/nodes from view X j , d ̸ = j, d, j = 1, . . ., D. Similar to [9] we construct the bipartite graphs using a pairwise similarity matrix directly obtained from the outputs of our integrative analysis methods.
Suppose we have applied SELPCCA to two views and we have obtained the canonical loadings αl and βl , l = 1, . . ., L, for views 1 and 2, respectively.For l > 2, it is likely that within a view, different variables will have nonzero coefficients for each canonical loading.Therefore, we deem a variable to be selected if it has a nonzero coefficient in at least one canonical loading.Let A = ( αT 1 , . . ., αT L ) T ∈ ℜ p×L be a concatenation of all L canonical loadings for view 1.Let B = ( βT 1 , . . ., βT L ) T ∈ ℜ q×L be defined similarly for view 2. Given data with these selected variables, we construct the canonical variates: U = X 1 selected A and V = X 2 selected B. Let p ′ and q ′ be the number of selected variables in views 1 and 2, respectively.To construct the p ′ × q ′ similarity matrix for views 1 and 2, we first project data for the selected variables to U, for view 1, V for view 2, or to a combination of U and V. Since X 1 and X 2 are analyzed simultaneously in CCA, the projection of the data onto a combination of U and V seems natural.Thus, we define Z = U + V.As noted in [9], the Z variables are closest to X d and X j .Then, for the ith variable X 1 i ∈ X 1 selected (similarly X 2 i ∈ X 2 selected ) and the lth component z l ∈ Z, we compute the scalar inner product x d il = ⟨X d i , z l ⟩, d = 1, 2; j = 1 . . .p ′ (or q ′ ), l = 1, . . ., L. Assuming that the variables X d i , and z l are standardized to have variance 1, We compute the similarity matrix M between views 1 and 2 as M = x 1 x 2 ′ T ∈ ℜ p ′ ×q ′ , where x 1 is a p ′ × L matrix with the ilth entry x 1 il and x 2 is a q ′ × L matrix with the ilth entry x 2 il .Each entry in the similarity matrix is between −1 and 1.
We construct the similarity matrix for SIDA and SIDANet in a similar fashion.Of note, in SIDA and SIDANet, l = K − 1. Although, the l 2,1 norm encourages the same variables to be selected across all K − 1 components, empirically, this is not usually the case.Therefore, we obtain overall variable selection as described above.We obtain the discriminant vectors for the dth and jth views as U = X 1 selected Γ d and V = X 2 selected Γ j , respectively.Given U and V, we compute the similarity matrix for SIDA and SIDANet in a similar fashion.
We generate the relevance network from the similarity matrix.The nodes of the graph represent variables for the pairs of views, and edges represent the correlations between pairs of variables.The function networkPlot() in mvlearnR can be used to generate relevance networks.Please refer to the bottom left and right panels of Figure S2 for sample relevance network plots.Dashed and solid lines indicate negative and positive correlations respectively.Circle nodes are View 1 variables, and rectangular nodes are View 2 variables.We provide an option to tune the network through a correlation cutoff.For instance in the middle right panel of Figure S2, we only show graph for variables with correlations greater than 0.9.The plot suggest that variable V1031 from View 2 is highly correlated with variable V31 from View 1.Since there is no edge between variable V38 and variables V1037, it indicates that the correlation between these pairs of variables is smaller than 0.9.

Variable importance plots
We provide the function VarImportancePlot() [Top left panel of Figure S2] to visualize the weights (in absolute value) of the loadings.Since the loadings are standardized to have unit norm, a variable with larger weight contributes more to the association between the views (for SELPCCA) or to the association between the views and discrimination of classes within each view (SIDA and SIDANet).We only show the top 20 variables and their weights but one can view data matrix for all variables.

Loadings plots
We provide the function LoadingsPlot() [Top right panel of Figure S2] to plot discriminant and canonical correlation vectors.These graphs are useful for visualizing how selected variables from SIDA/SIDANet and SELPCCA contribute to the first and second discriminant (for SIDA and SIDANet) or canonical correlation (for SELPCCA) vectors.Variables farther from the origin and close to the first or second axis have higher impact on the first or second discriminant/canonical vectors, respectively.Variables farther from the origin and between both first and second axes have similar higher contributions to the first and second discriminant/canonical correlation vectors.In both situations, for SIDA and SIDANet, this suggests that these variables contribute more to the separation of classes and association of views.For SELPCCA, this suggests that these variables contribute more to the association between the two views.This plot can only be generated for classification and association problems with 3 or more classes (SIDA and SIDANet), or for CCA problems with two or more canonical correlation vectors requested (i.e.ncancorr > 1 for SELPCCA).The angle between two vectors also give an indication of how the two variables are correlated.In particular, vectors that are close to each other suggests that the variables have high positive correlation.Vectors that are about 90 degrees indicate that the two variables are uncorrelated.Vectors that have an angle close to 180 degrees indicate that the two variables have negative correlation.

Visualization of loadings and scores simultaneously
Biplots are useful for representing both loadings and discriminant scores/canonical correlation variates.We present biplots for each view and between views.In particular, we provide the function WithinViewBiplot() to visualize the scores and loadings for each view separately [Figure 1, Bottom left panel].We also provide the function BetweenViewBiplot() to graph scores and loadings for pairs of views.The scores are the sum of scores for the two views (Refer to Section on relevance network for more explanation).Please refer to Figure 1, Bottom right panel for a sample biplot between views.In this graph, dashed red vectors represent loadings plot for the second view.And solid black vectors represent loadings plot for the first view.Please refer to section 3.3 for a brief discussion on interpreting loadings plots.In this plot, we only show labels for the top 7 proteins and 3 genes with largest absolute weights.The protein Immunoglobulin lambda variable 3-1 (UID P01715) appears to be positively correlated with the gene UBE2C.The protein P02792 appears to be negatively correlated with the genes UPK3A and GPR35.Class 0 is non-COVID-19 samples.Class 1 is COVID-19 samples.

Figure 2 :
Figure 2: Relevance network plot for SELPCCA.The nodes of the graph represent variables for the pairs of views, and edges represent the correlations between pairs of variables.Dashed and solid lines indicate negative and positive correlations, respectively.Circle nodes are View 1 variables (proteins), and rectangular nodes are View 2 variables(genes).We show edges with correlations at least 0.58.The plot suggest that the protein P01715 is highly positively correlated with many genes, and the protein P02765 is highly negatively correlated with many genes.

Figure 3 :
Figure 3: Relevance network plot for SIDA.The nodes of the graph represent variables for the pairs of views, and edges represent the correlations between pairs of variables.Dashed and solid lines indicate negative and positive correlations, respectively.Circle nodes are View 1 variables (proteins), and rectangular nodes are View 2 variables (genes).We show edges with correlations at least 0.1.The plot suggest that the gene FAM3D is negatively correlated with many proteins (e.g.PO2766, P30491, Q08380), and positively correlated with proteins such as A0ADC4DFP6, E9PEK4, P04196, D6W5L6.
2, . . ., D be a concatenation of the K classes in the d-th view.Let S d b and S d w be the between-class and within-class covariances for the d-th view.Let S dj , j < d be the cross-covariance between the d-th and j-th views.Define M d = S d −1

Figure S1 :
Figure S1: Sample discriminant plots for two classes (left panel), for three classes (middle panel) and correlation plot (right panel).Discriminant plots can demonstrate whether the discriminant vectors separate the classes.Correlation plots demonstrate the strength of association between pairs of views.It also shows how well the classes are separated.

Figure S2 :
Figure S2: Sample plots for visualizing variables within and between views.Top left panel: variable importance plot showing the relative importance of each variable.Variables with largest absolute loadings rank high.The weights for each variable is normalized relative to the largest weight.Top right panel: Loading plots.We provide loading plots for each view to demonstrate the relationships between pairs of variables within each view.Variables that are farther from the origin have higher impact on the axis the vector is close to.Vectors that are close to each other suggest that the pairs of variables are correlated positively.Vectors with an angle of about 180 degrees suggest that the variables have negative correlation.Vectors with a 90 degree angle between them suggest that the two variables are not correlated.Middle left and right panels: Relevance network showing variable-variable connections between pairs of views with correlations > 0.73 [left panel] and > 0.9 [right panel].Dashed and solid lines indicate negative and positive correlations, respectively.Color key gives the direction of the associations where green indicates negative correlations and red shows positive correlations.Bottom left panel: Within-view biplot.This plot shows the scores and loadings together for a specific view.Bottom right panel: Between-view biplot.This plot shows the scores and loadings from pairs of views together.The scores are the sum of scores for each view.Solid and dashed lines represent vectors for Views 1 and 2, respectively.In this example, variable V31 (from View 1) has a higher correlation with variables V1031 and V1038 from View 2 compared to Variable V1001 from View 2. This finding is supported by the edge between V31 and V1038 and V1031 in the network plot but no edge between V31 and V1001 given a correlation cutoff of 0.9.

Figure S5 :
Figure S5: Supervised filtering with logistic regression.Example output table from supervised filtering showing Top 10 (sorted by p-values) molecules with estimated coefficients (Coef), P-value (Pval), whether the variable is kept or filtered out (Keep), and the view (View).View 1 is for proteomics data, and View 2 is for RNASeq data.

Figure S8 :
FigureS8: Fitting SELPCCA with cross-validation on training data to choose optimal hyperparameters and to obtain overall canonical correlation vectors on whole data based on the optimal hyperparameter.We use the function cvselpscca().

Figure S9 :
Figure S9: Variable important plots for the first canonical vector for View 1 (Proteins) and View 2 (Genes).Top 20 variables with largest absolute weights for first canonical correlation vector from application of SELPCCA are shown.For proteins, Uniprot IDs are shown on variable importance plot.

Figure S10 :
Figure S10: Discriminant plots for SELPCCA based on training data.Top Panel: proteins and Bottom panel: genes.We use the function DiscriminantPlots() to generate these plots.Class 0 is non-COVID-19 samples.Class 1 is COVID-19 samples.

Figure S11 :
Figure S11: Discriminant plots for SELPCCA based on testing data.Top Panel: proteins and Bottom panel: genes.We use the function DiscriminantPlots() to generate these plots.Class 0 is non-COVID-19 samples.Class 1 is COVID-19 samples.

Figure S12 :
Figure S12: Within-view biplots.Biplots are useful for representing both loadings plot and discriminant scores/canonical correlation variates.We provide the function WithinViewBiplot() to visualize the scores and loadings for each view separately.In this plot, we only show labels for top 7 proteins and 3 genes with largest absolute weights.Top Panel: Protein IDs shown are loaded on both the first and second canonical vectors.Protein ID A0A075B6H9 appears to be highly correlated with P02792.Bottom Panel: the gene UBE2C is loaded on the first canonical vector; genes UPK3A and GPR35 are loaded on the second canonical vector.Class 0 is non-COVID-19 samples.Class 1 is COVID-19 samples.

Figure S13 :
FigureS13: Between-view biplots are useful for visualizing biplots for both views.This plot allows us to assess how genes and proteins are related.Solid black vectors represent loading plots for the first view (proteins).Dashed red vectors represent loadings plot for the second view (genes).We generate this plot with the function BetweenViewBiplot().In this plot, we only show labels for the top 7 proteins and 3 genes with largest absolute weights.The protein Immunoglobulin lambda variable 3-1 (UID P01715) appears to be positively correlated with the gene UBE2C.The protein P02792 appears to be negatively correlated with the genes UPK3A and GPR35.Class 0 is non-COVID-19 samples.Class 1 is COVID-19 samples.

Figure S14 :
Figure S14: Relevance network plot.The nodes of the graph represent variables for the pairs of views, and edges represent the correlations between pairs of variables.Dashed and solid lines indicate negative and positive correlations, respectively.Circle nodes are View 1 variables (proteins), and rectangular nodes are View 2 variables (genes).We show edges with correlations at least 0.58.The plot suggest that the protein P01715 is highly positively correlated with many genes, and the protein P02765 is highly negatively correlated with many genes.

Figure S15 :
Figure S15: Prediction with the first two canonical variates from SELPCCA.Results suggest that the first canonical variate for view 1 and view 2 are significantly associated with COVID-19 status (p-value < 0.05).That is, the first canonical variates for views 1 (proteins) and 2 (genes) are able to discriminate between those with and without COVID-19.

Figure S16 :
Figure S16: Train Performance metrics.We use the function PerformanceMetrics() to obtain these metrics.Top panel: Train predictions.

Figure S17 :
Figure S17: Test Performance metrics.We use the function PerformanceMetrics() to obtain these metrics.Top panel: Train predictions.

Figure S18 :
Figure S18: Fitting SIDA with cross-validation on training data to choose optimal hyperparameters and to obtain overall discriminant vectors based on the optimal hyperparameter.We use the function cvSIDA().

Figure S19 :
Figure S19: Variable important plots for View 1 (Proteins) and View 2 (Genes) after implementing SIDA.Top 20 variables with largest absolute weights are shown.For proteins, Uniprot IDs are shown on variable importance plot.

Figure S20 :
Figure S20: Discriminant plots based on training data.Top Panel: proteins and Bottom panel: genes.We use the function DiscriminantPlots() to generate these plots.Class 0 is non-COVID-19 samples.Class 1 is COVID-19 samples.

Figure S21 :
Figure S21: Discriminant plots based on testing data.Top Panel: proteins and Bottom panel: genes.We use the function DiscriminantPlots() to generate these plots.Class 0 is non-COVID-19 samples.Class 1 is COVID-19 samples.

Figure S22 :
Figure S22: Correlation plots based on training data.We use the function CorrelationPlots() to generate these plots.Class 0 is non-COVID-19 samples.Class 1 is COVID-19 samples.

Figure S23 :
Figure S23: Relevance network plot for SIDA.The nodes of the graph represent variables for the pairs of views, and edges represent the correlations between pairs of variables.Dashed and solid lines indicate negative and positive correlations, respectively.Circle nodes are View 1 variables (proteins), and rectangular nodes are View 2 variables (genes).We show edges with correlations at least 0.1.The plot suggest that the gene FAM3D is negatively correlated with many proteins (e.g.PO2766, P30491, Q08380), and positively correlated with proteins such as A0ADC4DFP6, E9PEK4, P04196, D6W5L6.

Figure S24 :
Figure S24: Performance metrics.We use the function PerformanceMetrics() to obtain these metrics.Top panel: Train predictions.Bottom panel: AUC plot.

Figure S25 :
Figure S25: Performance metrics.We use the function PerformanceMetrics() to obtain these metrics.Top panel: Test predictions.Bottom panel: AUC plot.