-
PDF
- Split View
-
Views
-
Cite
Cite
Yifeng Li, Fang-Xiang Wu, Alioune Ngom, A review on machine learning principles for multi-view biological data integration, Briefings in Bioinformatics, Volume 19, Issue 2, March 2018, Pages 325–340, https://doi.org/10.1093/bib/bbw113
Close - Share Icon Share
Abstract
Driven by high-throughput sequencing techniques, modern genomic and clinical studies are in a strong need of integrative machine learning models for better use of vast volumes of heterogeneous information in the deep understanding of biological systems and the development of predictive models. How data from multiple sources (called multi-view data) are incorporated in a learning system is a key step for successful analysis. In this article, we provide a comprehensive review on omics and clinical data integration techniques, from a machine learning perspective, for various analyses such as prediction, clustering, dimension reduction and association. We shall show that Bayesian models are able to use prior information and model measurements with various distributions; tree-based methods can either build a tree with all features or collectively make a final decision based on trees learned from each view; kernel methods fuse the similarity matrices learned from individual views together for a final similarity matrix or learning model; network-based fusion methods are capable of inferring direct and indirect associations in a heterogeneous network; matrix factorization models have potential to learn interactions among features from different views; and a range of deep neural networks can be integrated in multi-modal learning for capturing the complex mechanism of biological systems.
Introduction
In this big data era, information grow almost exponentially in volume, variety and complexity [1]. For example, in current biomedical research, it is not uncommon to have access to a large amount of data from a single patient, such as clinical records (e.g. age, sex, histories, pathologies and therapeutics), high-throughput omics data (e.g. genomics, transcriptomics, proteomics and metabolomics measurements) and so on, all under proper multi-party consents. In this article, we use the term ‘multi-view data’ to denote any kinds of heterogeneous (could be homogeneous) data that provide complementary information to characterize a biological object, phenomenon or system from various aspects. Such data may be of different types and from different sources, follow different statistical distributions, possess different semantics, suffer from different levels of imprecisions and contain different kinds of uncertainties. Specifically, we are interested in four types of multi-view data: (1) multi-view data with different groups of samples measured by the same feature set (or called multi-class data), (2) multi-view data with the same set of objects (samples) but several distinct feature sets, (3) multi-view data measuring the same set of objects by the same set of features in different conditions (can be represented by a three-way tensor) and (4) multi-view data with different features and different sample sets in the same phenomenon or system, which can be further transformed to multi-relational data.
The type-2 and type-4 multi-view data described above are often referred as multi-omics data. Generation of type-2 multi-omics data requires collaborative efforts within a big consortium such as the Cancer Genome Atlas [2], the Encyclopedia of DNA Elements Consortium [3], the Roadmap Epigenomics Project [4] and the Genotype-Tissue Expression Project [5]. The wide existence of type-4 multi-omics data is often owing to an uncoordinated contribution of independent (small) projects. Single-omics data only describe a biological process at one specific molecular level. For example, whole genome (or exome) sequencing [6] detects single-nucleotide and structural variations (genetic level); ChIP-seq [7] identifies transcription factor binding sites (protein–DNA interactomic level) and histone modifications (epigenomic level) in the entire human genome; DNase-seq [8] detects open chromatin regions for transcription factor binding loci (epigenomic level); whole genome bisulfite-seq [9] allows to build methylomes which are key to understand gene regulation, X chromosome inactivation and cancerogenesis (epigenetic level); RNA-seq [10] can be used to capture gene expression level, and discover alternative splicing, gene fusion and novel isoforms (transcriptomic level); microRNA (miRNA)-seq [11] snapshots expression of micro-RNAs that regulate mRNA translations (translation level); and protein arrays [12] and mass spectrometers are useful to detect concentration of proteins [13] and metabolites [14] (proteomic and metabolomic levels). To identify acting pathways from DNA variations and epigenetic changes to proteins and metabolites, omics data at each level should be generated for the same tissues. Single-omics data enumerated above have the following characteristics: (1) high dimensionality, (2) redundancy, (3) highly correlated features and (4) non-negativity. On top of them, multi-omics data have the following characteristics: (1) mutual complementarity, (2) causality and (3) heterogeneity.
In bioinformatics, there are five types of data-driven analyses where integrative machine learning methods are required. The first is multi-class feature selection and classification problem, where given multiple groups of objects measured using the same set of features, one is often interested in selecting key features responsible for the separation of these groups. One example is meta-analysis or pan-analysis of gene profiles from many distinct tumor tissues. Performance of classification can be measured by area under the receiver operating characteristic curve (auROC) for balanced data and area under the precision-recall curve (auPRC) for imbalanced data. Second, integrating multi-omics data of the same set of labeled objects is expected to escalate prediction (classification or regression) power, for example, the early detection of cancers based on multi-platform data. Third, in the above setting but without class labels, the task becomes an unsupervised learning to discover novel groups of samples. Tumor subtyping is such a commonly conducted analysis. Performance of clustering can be computationally quantified using simulation study, multi-view extensions of index criteria [15] and enrichment analysis. Fourth, given multiple heterogeneous feature sets observed for the same or group of samples, the interactions among inter-view features could be crucial to understand the pathways of a phenotype. The obtained potential pathways should be validated by computational enrichment analysis and wet-lab experiments. Last, given homogeneous and heterogeneous relations within and between multiple sets of biological entries from different molecular levels and clinical descriptions, inferring the relations between inter-set entries is named association study in a complex system. The findings should be finally tested by wet-lab experiments.
On the one hand, multi-view data provide us with an unprecedented opportunity to understand a complex biological system from different angles and levels (e.g. genotype–phenotype interactions [16] and cancer studies [17]), and make precise data-driven predictions (e.g. drug response prediction [18]). For instance, intelligent learning systems have been successfully used in the genome-wide detection of cis-regulatory regions [19] to combine sequence information, transcription factor binding, histone modifications, chromatin accessibility as well as 3D genome information (such as DNA shapes and genomic domain interactions) for a comprehensive description of cis-regulatory activities. On the other hand, it poses a tough challenge for machine learning experts and data scientists to wisely optimize the use of these data for specific needs. In accordance with when multi-view data are incorporated into a learning process, data fusion techniques can be classified as early, intermediate or late integration methods [20]. In early integration methods, features from different data are concatenated into a single feature vector before fitting an unsupervised or supervised model. In late integration, separate models are first learned for individual views, then their outputs are further combined to make the final determination. An intermediate strategy globally involves data integration in a learning process. Thus, the design of a computational intelligent model should be determined by the nature of multi-view data, the need of an analysis and the complexity of incorporating these multi-view data.
In the frontier of big biological data analytics, it becomes indispensable to investigate the fundamental principles of integrating multi-view data to provide bioinformaticians and data scientists an aerial view and guide to choose and devise multi-view methods. In this review, we focus on integrative machine learning principles for the five types of analyses using the four categories of multi-view data. This article is a significant extension of our preliminary discussion of data integration in a workshop [21]. During the preparation of this extension, we learned that other studies also independently recognized that data integration is an urgent need in current and future bioinformatics, for example, the work of [22], which highlights network methods and non-negative matrix factorizations (NMFs); however, our study covers more comprehensive discussions, such as tree-based methods and multi-modal deep learning, as well as the latest advances, such as partial-least-squares-based models and similarity network fusion (SNF) approaches. For giving the readers an overview of this article, Table 1 is provided as a summary of various machine-learning-based analyses for the four types of multi-view data. The following sections are organized based on machine learning methodologies. We shall discuss simple feature concatenation, Bayesian models, tree-based ensemble methods, multiple kernel learning, network-based methods, matrix factorizations and deep neural networks.
An overview of integrative analyses that can be conducted by machine learning methods on four types of multi-view data. Details regarding these methods and applications are described in separate sections
| Integrative method . | Type of multi-view data . | |||
|---|---|---|---|---|
| Multi-class data (type-1) . | Multi-feature-set data (type-2) . | Tensor data (type-3) . | Multi-relational data (type-4) . | |
| Feature concatenation |
| |||
| Bayesian models or networks |
|
| ||
| Ensemble learning |
|
| ||
| Multiple kernel learning | Classification |
| Association study | |
| Network-based methods | Association study | |||
| Multi-view matrix or tensor factorization |
|
|
| Association study |
| Multi-modal learning |
| |||
| Integrative method . | Type of multi-view data . | |||
|---|---|---|---|---|
| Multi-class data (type-1) . | Multi-feature-set data (type-2) . | Tensor data (type-3) . | Multi-relational data (type-4) . | |
| Feature concatenation |
| |||
| Bayesian models or networks |
|
| ||
| Ensemble learning |
|
| ||
| Multiple kernel learning | Classification |
| Association study | |
| Network-based methods | Association study | |||
| Multi-view matrix or tensor factorization |
|
|
| Association study |
| Multi-modal learning |
| |||
An overview of integrative analyses that can be conducted by machine learning methods on four types of multi-view data. Details regarding these methods and applications are described in separate sections
| Integrative method . | Type of multi-view data . | |||
|---|---|---|---|---|
| Multi-class data (type-1) . | Multi-feature-set data (type-2) . | Tensor data (type-3) . | Multi-relational data (type-4) . | |
| Feature concatenation |
| |||
| Bayesian models or networks |
|
| ||
| Ensemble learning |
|
| ||
| Multiple kernel learning | Classification |
| Association study | |
| Network-based methods | Association study | |||
| Multi-view matrix or tensor factorization |
|
|
| Association study |
| Multi-modal learning |
| |||
| Integrative method . | Type of multi-view data . | |||
|---|---|---|---|---|
| Multi-class data (type-1) . | Multi-feature-set data (type-2) . | Tensor data (type-3) . | Multi-relational data (type-4) . | |
| Feature concatenation |
| |||
| Bayesian models or networks |
|
| ||
| Ensemble learning |
|
| ||
| Multiple kernel learning | Classification |
| Association study | |
| Network-based methods | Association study | |||
| Multi-view matrix or tensor factorization |
|
|
| Association study |
| Multi-modal learning |
| |||
Feature concatenation
LASSO is variable selection consistent if and only if the training data X satisfies the irrepresentable condition [32]. But its generalizations—adaptive LASSO [33] and randomized LASSO—using the stability selection [34] are consistent variable selection procedures. Sparse linear models using a robust loss function, such as hinge loss, is robust to outlier samples [35]. The sparse regularization techniques reduce model complexity, thus prevent model selection from overfitting.
The concatenated features require additional downstream processing that may lead to loss of key information. First, because multi-view data are observed in different forms that may include continuous features, discrete features, characters and even graphic data types, converting them into acceptable types (e.g. continuous to discrete, categorical to binary coding) is necessary for certain models such as hidden Markov models. Second, features from multiple views usually have different scales. Particularly for discriminative models, it is sometimes necessary to normalize or standardize the combined features to reduce bias and speed up training process. The feature concatenation strategy followed by normalization is commonly used in linear models such as SVMs and LASSOs. Moreover, feature concatenation is often unworkable with modern data which possess a high dimensionality and rich structural information. For instance, converting medical text documents into a bag of words and combining it together with vectorized image pixels may certainly ignore the importance of language semantics and local structures in images.
Bayesian methods to integrate prior knowledge

Then the parameters to be learned from training data become .
Bayesian methods are well-known for their capability of incorporating various prior knowledge in predictive or exploratory models. However, it may be difficult to find useful information as prior features. Furthermore, it is often hard to assume proper class-conditional distributions, especially for complex systems. In case of many-class problems, finding a suitable class-conditional distribution for each individual class becomes unattainable in practice.
Bayesian methods for data of mixed types
Bayesian network classifiers. (A) General Bayesian network classifier. The class variable is treated as an ordinary node. (B) Naïve Bayes classifier. Features are assumed to be conditionally independent given the class variable as their common parent. (C) Tree-augmented naïve Bayes classifier. Sharing the class variable as parent, the features have a tree structure (in bold edges).

Naïve Bayes classifier is famous as a slim and swift BN model because learning its structure is needless and the inference of class label is straightforward. Tree-augmented naïve Bayes classifier [44] (Figure 1C) relaxes the independence among features by a tree structure. It outperforms the naïve Bayes classifier but keeps the efficiency of model learning.
In model selection of BNs, the Bayesian information (BIC) criterion and the minimum description length (MDL) criterion are asymptotically consistent [45]. Model averaging and single model selection with respect to a penalized criterion, such as BIC and MDL, help BNs to avoid overfitting of data [46]. In the presence of incomplete data, BNs is robust to missing values by ignoring them when computing sufficient statistics in parameter estimation.
Trees of mixed data types and ensemble learning
Decision trees should be considered as integrative models because a mixture of discrete and continuous features can be simultaneously integrated [47] without the need to normalize features. Classification and regression trees are representatives of rule-based prediction models. When recursively building a classification tree, a feature (even a subset of features) that splits the classes the best, in term of a scoring function, is selected to create a node. At each node, rules are established to branch out different classes downward. Different from black-box models, the learned hierarchy of rules (tree) are well interpretable. Meanwhile, decision trees can be applied to select features by ranking attributes with respect to their summed improvements for class purity. In a decision tree with T internal nodes, the importance score of the i-th feature can be defined by
, where indicates whether the i-th feature is selected in the t-th node to split the corresponding data region and g(t) is the gain of class purity measured, for example, by Gini index [47, 48]. Because each feature is used to learn decision rules, multi-view data of various data types (discrete, categorical and continuous) can be considered together without normalization. The values of continuous variables are partitioned into intervals of different lengths, thus decision rules can be created for continuous variables of a variety of distributions without the need to standardize input data. In fact, decision trees remain invariant under feature scaling and transformation. However, decision trees are sensitive to noise, thus have a poor capability of generalization. Moreover, building a decision tree for high-dimensional data could consume an unaffordable amount of time.
The overfitting issue of decision trees can be overcome by collective intelligence, that is, ensemble learning [49, 50], which builds a population of decision trees as weak learners for the state-of-the-art performance. Bagging [51] and boosting [52, 53] are popular ensemble learning models, where bagging simply combines the decisions of multiple weak learners, while boosting tweaks the weak learners to focus on hard examples. However, a few trees learned in this manner may be highly correlated. Moreover, learning a collection of decision trees for multi-view data with many features becomes much more unaffordable.
Random forest addresses the above two challenges by randomly picking up features in the construction of trees. Although the randomness degrades the interpretability, the importance of features can still be obtained by out-of-bag (OOB) randomization or Gini index. In the former method, the importance score of the i-th feature is defined as the difference of OOB errors between using the original OOB samples and using the OOB samples where the values of the i-th feature are permuted. In the latter method, Gini indices [47] of the i-th feature in individual trees in the forest can be averaged as the importance score [48]. Although widely used in practice, many properties of random forest models stay unknown. It has been just shown that random forests are resistant to outliers [54]. Recent studies also prove that not all random forest models are universally consistent [55, 56].
There are three ways to integrate data by ensemble learning. The first way is to use the concatenated features as input of random forest. The second way is to build multiple trees for each data view, and then use all learned trees of all views to vote for the final decision [57, 58]. An example of using random forest as a late integration method is illustrated in Figure 2. More elegant combination methods are discussed in [59]. This ensemble-learning-based data integration strategy has several advantages. First, this method can be easily manipulated and its outcomes are well interpretable. Second, class imbalance problems can be elegantly addressed by random forest in its bootstrapping [60]. Third, granularity of features can be carefully considered in the step of sampling features [61]. However, because it is a late-integration strategy, the interactions of features from separate sources cannot be detected. The third way is to obtain new meta-features from multi-view data instead of using the original features. This idea is from West’s group who incorporate both clinical factors and genomic data in predictive survival assessments [62]. A meta-feature (named meta-gene in [62]) is defined as the first principal component of a cluster of genes grouped by a clustering algorithm. Then, the model grows a forest of statistical classification and prediction trees. In each tree, features used in the nodes are decided by the significances of Bayesian factor tests on the features (meta-genes and clinical factors). Multiple significant features can be distributed in multiple trees so that the correlations between trees are reduced. The final decision is determined by a weighted combination of the decisions of all trees, where the probabilities of trees are used as combination weights. One advantage of the meta-feature-based ensemble model is that information from different views can be incorporated via clustering in the model learning. Because the meta-features are used instead of the original features, complexity of trees can thus be reduced.
Late integration of multi-view data using ensemble learning (e.g. random forest).
Kernel learning and metric learning
Multiple kernel learning for data integration. Metric learning may be applied to learn suitable similarity matrices.
Network-based approaches to integrate multiple homogeneous networks
Multi-view data of cohort samples can be integrated in the sample space by network fusion methods. Although these methods are essentially nonlinear MKL models, as they are mainly presented in the context of biological network mining, we discuss them in this separate section. Given multiple networks with identical nodes but different edges, an established idea is to fuse these networks in a final network reflecting common and view-specific connections. As a typical example of this principle, SNF [76] integrates mRNA expression, DNA methylation and miRNA of cohort cancer patients for tumor subtyping and survival prediction. SNF first constructs a sample-similarity network for each view where a node represents a sample and a weighted edge reflects the similarity between two samples. Then, it uses a message-passing-theory-based method to iteratively update each network making it more similar to the other networks. Finally, a patient-similarity network is generated by fusing all individual networks. Essentially, this network-based method falls into the kernel learning principle discussed in the ‘Kernel learning and metric learning’ section. Kernel clustering methods, for example, spectral clustering, can be applied on the patient-similarity network to find distinct groups of patients, that is, subtyping. Comparison assessment on multi-omics cancer data showed that SNF outperforms iCluster [77] and feature concatenation in terms of Cox log-rank test and a cluster index score.
Similarly, using this principle, multi-view data of cohort features can be integrated in the feature space. Here, we take the reconstruction of gene regulatory networks (GRNs) as an example. Given multiple RNA-seq data sets generated under different perturbations of the same cell system, a GRN can be learned for each data set, and then fused by SNF to a final GRN. Likewise, GRNs learned by different algorithms can also be fused by a method like SNF to achieve robust result, which implements the philosophy—‘the wisdom of crowds’ [78]. A similar network fusion method has been proposed in KeyPathwayMinerWeb for pathway enrichment analysis using multi-omics data [79]. Another similar method is called ‘FuseNet’, which models multi-omics data using Markov network allowing non-Gaussian distributions, and represents the model parameters by shared latent factors to collectively infer a holistic gene network [80].

Network-based methods for fusing multiple relational data
In association studies such as gene–disease associations and genotype–phenotype associations, the relation between two types of objects can also be represented by a matrix denoted by X where indicates the strength of relation between objects i and j, or by a network where nodes represent objects and (weighted) edges indicate the presence of associations. Thus, the association problems, can be solved by either kernel (relational) matrix factorization methods or graphical methods, even a mixture of both. Based on the number of relationships, association studies can be categorized to two-relational or multi-relational problems.
In two-relational association studies, the challenge is how to integrate multiple relational (or adjacency) matrices of two sets of biological entries. For example in gene–disease studies, the question is how to integrate multiple known gene–disease relational matrices obtained by different measurements for inferring new relevant candidate genes or candidate diseases given a pivot set of genes. The kernelized Bayesian matrix factorization method is an effective method to infer a bipartite graph by multiple data source integration [86, 67]. Recently, Lan et al. [87] have used this method to infer potential miRNA–disease associations by integrating sequence and functional information of miRNA with semantic and functional information of diseases. Their experimental results demonstrated that this method could not only effectively predict unknown miRNA–disease associations, but also outperformed its competing methods in terms of auROC.
A heterogeneous network can be constructed to represent homogeneous associations (such as gene–gene associations) in homo-networks and heterogeneous associations (such as gene–disease associations) in heter-networks. Given a network, a random walk is a path that starts at a prespecified node and randomly moves to its neighbor, then to neighbor’s neighbor, and so on. Random walks can explain the observed behaviors of many random processes and thus serve as a fundamental model for the recorded stochastic activities. Random walk methods have been applied on either two-relational heterogeneous networks (such as gene–phenotype associations [88], drug–target interactions [89] and miRNA–disease associations [90, 91]) or multi-relational heterogeneous networks (for example, drug–disease associations [92]) to infer novel candidate relations.
Tri-matrix factorizations, combined with network methods, are found useful in association studies of multiple sets of biological entries, where pair-wise associations are represented in relational matrices. Figure 4 is such an example of multi-relational associations, where each node represents a set of homogeneous objects and each edge represents a relational matrix. Data fusion approach with penalized matrix tri-factorization (DFMF) [93] is a model that can integrate multi-relational data. Given multiple sets of distinct biological objects , and their relations represented in matrices or , where , the basic idea is to decompose each given pair-wise association matrix as , where rows of and are the latent factors for object sets and , respectively, governs the interactions between and . DFMF can only infer new associated objects in a directly given pair of associations, which is essentially a matrix completion problem. Compared with MKL, random forest and relational learning by matrix factorization, DFMF achieved higher auROC for prediction of gene function and prediction of pharmacologic actions. The methodology used in Meduca [94] extends DFMF to a solution for inferring the most significant size-k modules of objects indirectly associated to a given set of pivot objectives in a multiplex of association data. As illustrated in Figure 4, given RR-gene (RR stands for regulatory region), gene–pathway, gene–drug, disease–drug associations, the Meduca method can addresses two questions, for example: (1) Given a pivot set of RRs associated to a subtype of breast cancer, how to detect other RRs that are also significantly associated to this tumor subtype [the so-called candidate-pivot-equivalence (CPE) regime]; (2) Given a pivot set of diseases, how to find the relevant RRs [the so-called candidate-pivot-inequivalence (CPI) regime]. To realize this, Medusa first uses collective matrix factorization (i.e. the DFMF model [93]) to generate latent data matrices for individual associations, and then produces a connection matrix chaining a chosen set of latent data matrices from the source objects to the target objects. For CPE problems, a significant size-k module can be obtained based on connections. For CPI problems, the evaluation of candidate objects is based on visibility. Because there are multiple paths from the source objects to the target objects, Medusa combines all possible connection matrices to compute the scores of each candidate objects for the evaluation of size-k modules. In the prediction of gene–disease associations, Medusa obtained higher auPRC and auROC than random walk.
An example of multiple association studies represented in a multiplex heterogeneous network where each node represents a set of objects of the same type, and each edge represents an association matrix. Gene–gene, RR–gene, gene–drug, gene–pathway and drug–disease associations are given (marked in bold lines), whereas associations such as RR–disease and gene–disease associations are yet to be inferred (marked in dashed lines). There might exist multiple paths for indirect associations, for example in RR–disease associations, we can have RR-gene-pathway-drug-disease and RR-gene-drug-disease.
Feature extractions and matrix factorizations for detecting shared and view-specific components
While it is often challenging to combine features of multiple views in the original input spaces, new features generated by feature extraction methods can be easily combined. As illustrated in Figure 5, an idea is to extract new features from each data view first, and then incorporate these new features together. Finally, a classification or clustering algorithm can be applied on the combined features. Depending on the nature of an individual data view, a feature extraction method learns the representations of samples in a new feature space. Matrix factorization methods, such as principal component analysis [95, 96], factor analysis (FA) [97, 98], NMF [99, 100, 101, 102], SR [65] and tensor decomposition methods [103, 104], are commonly used feature extraction models. Other dimensionality reduction methods than matrix decompositions, such as autoencoder [105] and restricted Boltzmann machine (RBM) [106], can be applied as well. There are several benefits of using feature extraction in data integration. First, the natures of heterogeneous data from multiple omics data can be separately well counted. Despite the original data types, the new features in the corresponding feature spaces are usually numeric, implying an easy concatenation. Second, the high-dimensionality is dramatically reduced so that the downstream analysis will be more efficient. Third, extracting new features separately for each data view implements the principle of divide and conquer, thus computational complexity can be significantly reduced. Fourth, relational data can be well incorporated by kernel feature extraction methods [107]. However, one pitfall of the feature-extraction-based integrative principle is that the interactions (correlation, dependency or association) between input features from different views cannot be taken into account in the separated feature extraction procedures.
An integrative predictive model based on separate view-wise feature extraction.
To consider the interactions between features from different views, (Bayesian) multi-view matrix factorization methods can be applied to extract new features on the feature-wise concatenated matrix. By inducing group-wise (that is view-wise) sparsity on the basis matrix (that is factor loading matrix), Bayesian group factor analysis (GFA) [108, 109] detects ubiquitous and view-specific factors (see Figure 6), which is informative to discover features from multiple views involved in a potential pathway. As a special case of GFA, Bayesian canonical correlation analysis (CCA) [110] only detects correlated factors between two views. Sparse GFA has been developed for biclustering multi-view data with co-occurring samples [111]. Simulation study revealed that sparse GFA could more accurately recover predefined biclusters in terms of F1 score compared with factor analysis for bicluster acquisition [112]. Sparse GFA was applied to predict drug response of cancer cell lines supported by multi-omics data, where the prediction was modeled as inferring missing values (drug response), cross-validation performance was measured by correlation coefficients and real predictions were validated by enrichment analysis.
Data integration based on Bayesian group factor analysis. Zero blocks are marked in white.
NMF has been extended for multi-view data for clustering and latent FA. Similar to iCluster [77], MultiNMF [113] and EquiNMF [114] allow all views to share a unique coefficient matrix for collective clustering, while Joint-NMF [115] restricts all views to share the same basis matrix for finding common factors among multi-omics data. Using ubiquitous basis matrix and view-specific basis matrices as well as view-specific coefficient matrices, integrative-NMF [116] is able to detect homogeneous and heterogeneous factors from multi-omics data. Comparative studies showed that integrative-NMF significantly outperforms joint-NMF on both a simulated data set with heterogeneous noise in terms of a module-detection score and a real ovarian cancer multi-omics data set in terms of purity indices. The discovered modules were validated using pathway enrichment analysis. Although not in a non-negative setting but closely relevant, joint and individual clustering (JIC) [117] and joint and individual variation explained (JIVE) [118] use both common and view-specific coefficient matrices. Comparison showed that JIC is advantageous over iCluster on both simulated data in terms of precision and multi-omics (RNA-seq and miRNA-seq) breast cancer data in terms of validation using clinical information. This idea can be easily generalized to any matrix factorizations, including NMF. Because high-throughput sequencing platforms generate read-count data, which are naturally non-negative, multi-view NMF models have a great potential for various analyses such as tumor subtyping, pathway analysis and biomarker selection.
In addition to matrix factorization models, their extensions—tensor decompositions [129, 103, 130] should be also considered for dimensionality reduction (in classification or clustering) and FA on multi-view data naturally represented by a tensor [131].
Many matrix decomposition models are non-convex from the optimization perspective. Thus, their performances are affected by initial values. Moreover, feature selection based on l1-norm regularized sparse matrix factorizations may suffer from inconsistency. Applying the strategy of stability selection on matrix factorizations can make the procedure consistent [132]. Matrix or tensor factorization models assuming Gaussian or Poisson distributions to the data are not robust to outlying data points. Robust factorizations, such as robust PLS [133], robust NMF [134] and tensor factorization using l1-norm [135], are developed to address this issue. In the near future, more work needs to be done on robust multi-view matrix factorizations. Missing values can be nicely handled by weighted matrix factorizations [101] and Bayesian matrix factorizations [136, 137], which can ignore missing entries in their learning. Future developments of new multi-view matrix factorization tools should consider the functionality of dealing with missing values. Overfitting is not an issue for penalized and Bayesian multi-view matrix factorizations.
Multi-modal deep learning
The deep neural network-based [138] multi-modal structure, illustrated in Figure 7, is another option to integrate multi-view data with heterogeneous feature sets, and capture their high-level associations for prediction, clustering and handling incomplete data. The basic idea is to select a specific sub-network for each view, and then integrate the output of individual sub-networks in higher layers. The sub-networks provide the flexibility of choosing appropriate deep learning models respectively for individual data views, such as deep belief net (DBN) [139] or deep Boltzmann machine (DBM) [140] for binary, Gaussian or count data, convolutional neural network [141] for image data, recurrent neural network [142] for sequential signal and deep feature selection (DFS) [143] for choosing discriminative features. The sub-networks can be either directed or undirected. The whole model can be supervised or unsupervised. The work in [144] is a typical example of multi-modal learning for image-text data. In this model, a Gaussian-Bernoulli DBM and a replicated softmax DBM are respectively constructed for continuous image data and word-count data; a top layer connects both sub-networks to learn their joint representation. This multi-modal DBM can be applied to generate image given text, generate text given image, classify or cluster samples based on joint representations of multi-view samples. A multi-modal DBN has emerged in bioinformatics to integrate gene expression, DNA methylation and drug response for tumor subtyping [145], showing superior performance to k-mean clustering in terms of clinical discrepancy (survival time).
Additive multi-modal deep learning for data integration. Different deep learning models can be applied, as sub-networks, to individual data views. An integrative network combines information from the sub-networks. The model can be either directed or undirected; either supervised or unsupervised. Bottom-up arrows indicate a discriminative model. Downward or undirected connections indicate a generative model.
Multi-modal deep neural networks have five attractive strengths for data integration. First, when learning model parameters, the sub-networks can be pretrained using different data views, separately, then the parameter of the entire network (including the integrative layers and sub-networks) can be globally fine-tuned. Thus, this component-wise learning can significantly reduce the cost of computation. Second, the heterogeneous information from different views can be jointly considered well in the integrative layers for inference, classification and clustering [146]. Third, multi-modal networks can even learn on samples with missing views [144], which enables the maximal use of available data instead of merely using samples with complete views. Furthermore, a well-trained generative multi-modal network, such as DBM, can be used to infer profiles of missing views given some other observed views from an individual, which is quite interesting, for instance, for predicting the impact of genetic variations and epigenetic changes to gene expression. Last but not least, the flexible and deep structure of multi-modal learning is appropriate to model complex systems, thus has a great potential to make a full use of genomic data observed in various molecular levels.
The consistency of neural networks is well studied. The universal approximation theorem tells us that a feed-forward neural network having one hidden layer with proper parameters is able to approximate any continuous function [147, 148]. The great predictive power of multi-modal deep learning lies in its capabilities of modeling complex probability distributions and capturing high-level semantics. The limits of RBMs and DBMs to approximate probability distributions are still under investigation. Robust deep learning models can be developed by using robust cost functions for data with outliers [149]. Some experimental studies have reported that deep learning models are robust to noise [150, 151]. Overfitting is a problem for deep neural networks owing to their complex structures and large amount of parameters. Regularization (using l1- and l2-norms) and model averaging such as dropout [152] are effective techniques to avoid this issue. In addition, how to precisely catch and explicitly interpret inter-view feature interactions remains an open problem. The generalization of shallow multi-view matrix factorization methods, discussed in the ‘Feature extractions and matrix factorizations for detecting shared and view-specific components’ section, to their deep versions poses a new challenge. Finally, it should be noted that, beside additive integrative deep learning models, other strategies, such as multiplicative and sequential integration, exist [153].
Discussion and conclusion
Recent developments in many bioinformatics topics, such as cancer diagnosis, precision medicine and health-informatics systems have a keen need for integrative machine learning models to incorporate all available data for a better insight into complex biological systems and a more precise solution. In this review, we have investigated a variety of data integration principles from a machine learning perspective. Their basic ideas, structures, asymptotic consistency, robustness, risk of overfitting, strengths and limitations are discussed, respectively. These methods, particularly multi-view matrix factorizations and multi-modal deep learning, will revolutionize the way of using information and play a key role in integrative bioinformatics.
Multi-omics data measured for the same set of patients (or samples) are key to identify disease subtypes and pathways. However, such data are almost unavailable in diseases other than cancers in the current moment. In studies of complex diseases, such as multiple sclerosis, schizophrenia and autism spectrum disorder, there exist some independent RNA-seq and whole genome (or exome) sequencing data, enabling investigations either in transcriptomic level or genetic level. Compared with cancers, the signals in complex diseases might be too weak to be identified. Ideally large-scale multi-omics data of the same batch of patients are necessary for comprehensive analysis in different molecular levels [154]. The difficulty of obtaining tissues and lack of funding and human resources are the main challenges to generate such data. Therefore, researchers in non-cancer studies are eagerly suggested to switch their focus to integrative analysis and work in a coordinated manner to ensure the quality and completeness of multi-platform omics data.
Feature selection methods should be carefully chosen according to the purpose of a specific analysis. Owing to the nature that biological data often have highly correlated features, a set of relevant features selected to computationally optimize the power of prediction may not make sense in biological causalities. If features are selected to solely pursue the highest classification performance, l1-norm regularized sparse models (e.g. LASSOs), sparse PLS, DFS or random forest should be considered. However, if one wants to globally examine the behavior of all features in multiple classes, BNs, feature clustering or multi-view matrix factorizations for feature pattern discovery should be taken into account.
We list open-source packages and tools for the seven categories of integrative models in Table 2. They are mainly implemented in Python, R and MATLAB, among which Python can serve as a promising platform to realize integrative models because (1) a multi-dimensional array is passed to a function by reference (while array arguments are passed by value in R and MATLAB), which is critical for big data; (2) the friendly object-oriented programming paradigm enables the development of large packages; and (3) the support of machine learning (particularly deep learning) packages facilitates the implementation of intelligent integrative methods. Even though multi-modal neural networks are potentially useful in the fusion of multi-view data, satisfactory software packages are still not available yet. Thus, a generic and comprehensive multi-modal package is eagerly expected in the near future, so that bioinformaticians can conveniently choose suitable types of sub-networks and define model structures.
Implementations of machine learning methods for multi-view data analysis
| Method . | Tool name [Ref.] . | Functionality . | URL . | Language . |
|---|---|---|---|---|
| Feature concatenation | glmnet [155] | LASSO, elastic net | cran.r-project.org/web/packages/glmnet | R |
| scikit-learn [156] | LASSO, elastic net, SVM | scikit-learn.org | Python | |
| grplasso [157] | group LASSO | cran.r-project.org/web/packages/grplasso | R | |
| SGL | Sparse group LASSO | cran.r-project.org/web/packages/SGL | R | |
| SPAMS [158] | (Sparse) group LASSO using proximal algorithms | spams-devel.gforge.inria.fr | R, Python, MATLAB | |
| ParsimonY | Overlapping group LASSO | github.com/neurospin/pylearn-parsimony | Python | |
| glasso [28] | Graphical LASSO | cran.r-project.org/web/packages/glasso | R | |
| Bayesian models or networks | bnlearn [159] | Bayesian network learning and inference; does not support mixed types; naïve Bayes and tree-augmented naïve Bayes classifiers | cran.r-project.org/web/packages/bnlearn | R |
| Ensemble learning | Random forest [160] | random forest | cran.r-project.org/web/packages/randomForest | R |
| scikit-learn [156] | random forest | scikit-learn.org | Python | |
| Multiple kernel learning | Mklaren [161] | Simultaneous multiple kernel learning and low-rank approximation | github.com/mstrazar/mklaren | Python |
| LibMKL [162] | Soft margin MKLs | sites.google.com/site/xinxingxu666 | MATLAB | |
| SimpleMKL [74] | MKL SVMs | asi.insa-rouen.fr/enseignants/arakoto/code/mklindex.html | MATLAB | |
| GMKL [163] | Generalized MKL based on gradient descent and SVM | research.microsoft.com/en-us/um/people/manik/code/GMKL/download.html | MATLAB | |
| Network-based method | SNF [76] | Similarity network fusion | compbio.cs.toronto.edu/SNF | R, MATLAB |
| KeyPathwayMiner [79] | Extract all maximally connected sub-networks | tomcat.compbio.sdu.dk/keypathwayminer | Java | |
| FuseNet [80] | Infer networks from multi-omics data | github.com/marinkaz/fusenet | Python | |
| scikit-fusion [93] | Data fusion based on DFMF | github.com/marinkaz/scikit-fusion | Python | |
| Medusa [94] | Collective-matrix-factorization-based indirect association discovery | github.com/marinkaz/medusa | Python | |
| Multi-view matrix or tensor factorization | pls [164] | Partial least squares and principal component regression | cran.r-project.org/web/packages/pls | R |
| spls [165] | Sparse PLS regression and classification; simultaneous dimension reduction and variable selection | cran.r-project.org/web/packages/spls | R | |
| O2PLS [166] | O2-PLS | github.com/selbouhaddani/O2PLS | R | |
| K-OPLS [167] | Kernel-based PLS | kopls.sourceforge.net | R, MATLAB | |
| CCAGFA [109] | GFA and CCA | cran.r-project.org/web/packages/CCAGFA | R | |
| GFAsparse [111] | Sparse GFA for biclustering | research.cs.aalto.fi/pml/software/GFAsparse | ||
| iCluster [77] | Integrative clustering of multiple genomic data types | cran.r-project.org/web/packages/iCluster | R | |
| r.jive [118] | JIVE | cran.r-project.org/web/packages/r.jive | R | |
| iNMF [116] | Integrative NMF | github.com/yangzi4/iNMF | Python | |
| MVMF | Multi-view NMFs for feature pattern discovery from multi-class data | github.com/yifeng-li/mvmf | Python | |
| Tensor Toolbox [168] | Operations of multi-way arrays | www.sandia.gov/tgkolda/TensorToolbox | MATLAB | |
| N-way Toolbox [169] | Multi-way PARAFAC, PLS, and Tucker models | www.models.life.ku.dk/nwaytoolbox | MATLAB | |
| Sparse PARAFAC [170] | Sparse PARAFAC | www.models.life.ku.dk/sparafac | MATLAB | |
| CMTF [171] | Coupled matrix and tensor factorization | www.models.life.ku.dk/joda/CMTF_Toolbox | MATLAB | |
| NTFLAB | Non-negative tensor factorizations | www.bsp.brain.riken.jp/ICALAB/nmflab.html | MATLAB | |
| Multi-modal learning | multimodal [144] | Multi-modal DBMs | www.cs.toronto.edu/nitish/multimodal, github.com/nitishsrivastava/deepnet | Python |
| Method . | Tool name [Ref.] . | Functionality . | URL . | Language . |
|---|---|---|---|---|
| Feature concatenation | glmnet [155] | LASSO, elastic net | cran.r-project.org/web/packages/glmnet | R |
| scikit-learn [156] | LASSO, elastic net, SVM | scikit-learn.org | Python | |
| grplasso [157] | group LASSO | cran.r-project.org/web/packages/grplasso | R | |
| SGL | Sparse group LASSO | cran.r-project.org/web/packages/SGL | R | |
| SPAMS [158] | (Sparse) group LASSO using proximal algorithms | spams-devel.gforge.inria.fr | R, Python, MATLAB | |
| ParsimonY | Overlapping group LASSO | github.com/neurospin/pylearn-parsimony | Python | |
| glasso [28] | Graphical LASSO | cran.r-project.org/web/packages/glasso | R | |
| Bayesian models or networks | bnlearn [159] | Bayesian network learning and inference; does not support mixed types; naïve Bayes and tree-augmented naïve Bayes classifiers | cran.r-project.org/web/packages/bnlearn | R |
| Ensemble learning | Random forest [160] | random forest | cran.r-project.org/web/packages/randomForest | R |
| scikit-learn [156] | random forest | scikit-learn.org | Python | |
| Multiple kernel learning | Mklaren [161] | Simultaneous multiple kernel learning and low-rank approximation | github.com/mstrazar/mklaren | Python |
| LibMKL [162] | Soft margin MKLs | sites.google.com/site/xinxingxu666 | MATLAB | |
| SimpleMKL [74] | MKL SVMs | asi.insa-rouen.fr/enseignants/arakoto/code/mklindex.html | MATLAB | |
| GMKL [163] | Generalized MKL based on gradient descent and SVM | research.microsoft.com/en-us/um/people/manik/code/GMKL/download.html | MATLAB | |
| Network-based method | SNF [76] | Similarity network fusion | compbio.cs.toronto.edu/SNF | R, MATLAB |
| KeyPathwayMiner [79] | Extract all maximally connected sub-networks | tomcat.compbio.sdu.dk/keypathwayminer | Java | |
| FuseNet [80] | Infer networks from multi-omics data | github.com/marinkaz/fusenet | Python | |
| scikit-fusion [93] | Data fusion based on DFMF | github.com/marinkaz/scikit-fusion | Python | |
| Medusa [94] | Collective-matrix-factorization-based indirect association discovery | github.com/marinkaz/medusa | Python | |
| Multi-view matrix or tensor factorization | pls [164] | Partial least squares and principal component regression | cran.r-project.org/web/packages/pls | R |
| spls [165] | Sparse PLS regression and classification; simultaneous dimension reduction and variable selection | cran.r-project.org/web/packages/spls | R | |
| O2PLS [166] | O2-PLS | github.com/selbouhaddani/O2PLS | R | |
| K-OPLS [167] | Kernel-based PLS | kopls.sourceforge.net | R, MATLAB | |
| CCAGFA [109] | GFA and CCA | cran.r-project.org/web/packages/CCAGFA | R | |
| GFAsparse [111] | Sparse GFA for biclustering | research.cs.aalto.fi/pml/software/GFAsparse | ||
| iCluster [77] | Integrative clustering of multiple genomic data types | cran.r-project.org/web/packages/iCluster | R | |
| r.jive [118] | JIVE | cran.r-project.org/web/packages/r.jive | R | |
| iNMF [116] | Integrative NMF | github.com/yangzi4/iNMF | Python | |
| MVMF | Multi-view NMFs for feature pattern discovery from multi-class data | github.com/yifeng-li/mvmf | Python | |
| Tensor Toolbox [168] | Operations of multi-way arrays | www.sandia.gov/tgkolda/TensorToolbox | MATLAB | |
| N-way Toolbox [169] | Multi-way PARAFAC, PLS, and Tucker models | www.models.life.ku.dk/nwaytoolbox | MATLAB | |
| Sparse PARAFAC [170] | Sparse PARAFAC | www.models.life.ku.dk/sparafac | MATLAB | |
| CMTF [171] | Coupled matrix and tensor factorization | www.models.life.ku.dk/joda/CMTF_Toolbox | MATLAB | |
| NTFLAB | Non-negative tensor factorizations | www.bsp.brain.riken.jp/ICALAB/nmflab.html | MATLAB | |
| Multi-modal learning | multimodal [144] | Multi-modal DBMs | www.cs.toronto.edu/nitish/multimodal, github.com/nitishsrivastava/deepnet | Python |
Implementations of machine learning methods for multi-view data analysis
| Method . | Tool name [Ref.] . | Functionality . | URL . | Language . |
|---|---|---|---|---|
| Feature concatenation | glmnet [155] | LASSO, elastic net | cran.r-project.org/web/packages/glmnet | R |
| scikit-learn [156] | LASSO, elastic net, SVM | scikit-learn.org | Python | |
| grplasso [157] | group LASSO | cran.r-project.org/web/packages/grplasso | R | |
| SGL | Sparse group LASSO | cran.r-project.org/web/packages/SGL | R | |
| SPAMS [158] | (Sparse) group LASSO using proximal algorithms | spams-devel.gforge.inria.fr | R, Python, MATLAB | |
| ParsimonY | Overlapping group LASSO | github.com/neurospin/pylearn-parsimony | Python | |
| glasso [28] | Graphical LASSO | cran.r-project.org/web/packages/glasso | R | |
| Bayesian models or networks | bnlearn [159] | Bayesian network learning and inference; does not support mixed types; naïve Bayes and tree-augmented naïve Bayes classifiers | cran.r-project.org/web/packages/bnlearn | R |
| Ensemble learning | Random forest [160] | random forest | cran.r-project.org/web/packages/randomForest | R |
| scikit-learn [156] | random forest | scikit-learn.org | Python | |
| Multiple kernel learning | Mklaren [161] | Simultaneous multiple kernel learning and low-rank approximation | github.com/mstrazar/mklaren | Python |
| LibMKL [162] | Soft margin MKLs | sites.google.com/site/xinxingxu666 | MATLAB | |
| SimpleMKL [74] | MKL SVMs | asi.insa-rouen.fr/enseignants/arakoto/code/mklindex.html | MATLAB | |
| GMKL [163] | Generalized MKL based on gradient descent and SVM | research.microsoft.com/en-us/um/people/manik/code/GMKL/download.html | MATLAB | |
| Network-based method | SNF [76] | Similarity network fusion | compbio.cs.toronto.edu/SNF | R, MATLAB |
| KeyPathwayMiner [79] | Extract all maximally connected sub-networks | tomcat.compbio.sdu.dk/keypathwayminer | Java | |
| FuseNet [80] | Infer networks from multi-omics data | github.com/marinkaz/fusenet | Python | |
| scikit-fusion [93] | Data fusion based on DFMF | github.com/marinkaz/scikit-fusion | Python | |
| Medusa [94] | Collective-matrix-factorization-based indirect association discovery | github.com/marinkaz/medusa | Python | |
| Multi-view matrix or tensor factorization | pls [164] | Partial least squares and principal component regression | cran.r-project.org/web/packages/pls | R |
| spls [165] | Sparse PLS regression and classification; simultaneous dimension reduction and variable selection | cran.r-project.org/web/packages/spls | R | |
| O2PLS [166] | O2-PLS | github.com/selbouhaddani/O2PLS | R | |
| K-OPLS [167] | Kernel-based PLS | kopls.sourceforge.net | R, MATLAB | |
| CCAGFA [109] | GFA and CCA | cran.r-project.org/web/packages/CCAGFA | R | |
| GFAsparse [111] | Sparse GFA for biclustering | research.cs.aalto.fi/pml/software/GFAsparse | ||
| iCluster [77] | Integrative clustering of multiple genomic data types | cran.r-project.org/web/packages/iCluster | R | |
| r.jive [118] | JIVE | cran.r-project.org/web/packages/r.jive | R | |
| iNMF [116] | Integrative NMF | github.com/yangzi4/iNMF | Python | |
| MVMF | Multi-view NMFs for feature pattern discovery from multi-class data | github.com/yifeng-li/mvmf | Python | |
| Tensor Toolbox [168] | Operations of multi-way arrays | www.sandia.gov/tgkolda/TensorToolbox | MATLAB | |
| N-way Toolbox [169] | Multi-way PARAFAC, PLS, and Tucker models | www.models.life.ku.dk/nwaytoolbox | MATLAB | |
| Sparse PARAFAC [170] | Sparse PARAFAC | www.models.life.ku.dk/sparafac | MATLAB | |
| CMTF [171] | Coupled matrix and tensor factorization | www.models.life.ku.dk/joda/CMTF_Toolbox | MATLAB | |
| NTFLAB | Non-negative tensor factorizations | www.bsp.brain.riken.jp/ICALAB/nmflab.html | MATLAB | |
| Multi-modal learning | multimodal [144] | Multi-modal DBMs | www.cs.toronto.edu/nitish/multimodal, github.com/nitishsrivastava/deepnet | Python |
| Method . | Tool name [Ref.] . | Functionality . | URL . | Language . |
|---|---|---|---|---|
| Feature concatenation | glmnet [155] | LASSO, elastic net | cran.r-project.org/web/packages/glmnet | R |
| scikit-learn [156] | LASSO, elastic net, SVM | scikit-learn.org | Python | |
| grplasso [157] | group LASSO | cran.r-project.org/web/packages/grplasso | R | |
| SGL | Sparse group LASSO | cran.r-project.org/web/packages/SGL | R | |
| SPAMS [158] | (Sparse) group LASSO using proximal algorithms | spams-devel.gforge.inria.fr | R, Python, MATLAB | |
| ParsimonY | Overlapping group LASSO | github.com/neurospin/pylearn-parsimony | Python | |
| glasso [28] | Graphical LASSO | cran.r-project.org/web/packages/glasso | R | |
| Bayesian models or networks | bnlearn [159] | Bayesian network learning and inference; does not support mixed types; naïve Bayes and tree-augmented naïve Bayes classifiers | cran.r-project.org/web/packages/bnlearn | R |
| Ensemble learning | Random forest [160] | random forest | cran.r-project.org/web/packages/randomForest | R |
| scikit-learn [156] | random forest | scikit-learn.org | Python | |
| Multiple kernel learning | Mklaren [161] | Simultaneous multiple kernel learning and low-rank approximation | github.com/mstrazar/mklaren | Python |
| LibMKL [162] | Soft margin MKLs | sites.google.com/site/xinxingxu666 | MATLAB | |
| SimpleMKL [74] | MKL SVMs | asi.insa-rouen.fr/enseignants/arakoto/code/mklindex.html | MATLAB | |
| GMKL [163] | Generalized MKL based on gradient descent and SVM | research.microsoft.com/en-us/um/people/manik/code/GMKL/download.html | MATLAB | |
| Network-based method | SNF [76] | Similarity network fusion | compbio.cs.toronto.edu/SNF | R, MATLAB |
| KeyPathwayMiner [79] | Extract all maximally connected sub-networks | tomcat.compbio.sdu.dk/keypathwayminer | Java | |
| FuseNet [80] | Infer networks from multi-omics data | github.com/marinkaz/fusenet | Python | |
| scikit-fusion [93] | Data fusion based on DFMF | github.com/marinkaz/scikit-fusion | Python | |
| Medusa [94] | Collective-matrix-factorization-based indirect association discovery | github.com/marinkaz/medusa | Python | |
| Multi-view matrix or tensor factorization | pls [164] | Partial least squares and principal component regression | cran.r-project.org/web/packages/pls | R |
| spls [165] | Sparse PLS regression and classification; simultaneous dimension reduction and variable selection | cran.r-project.org/web/packages/spls | R | |
| O2PLS [166] | O2-PLS | github.com/selbouhaddani/O2PLS | R | |
| K-OPLS [167] | Kernel-based PLS | kopls.sourceforge.net | R, MATLAB | |
| CCAGFA [109] | GFA and CCA | cran.r-project.org/web/packages/CCAGFA | R | |
| GFAsparse [111] | Sparse GFA for biclustering | research.cs.aalto.fi/pml/software/GFAsparse | ||
| iCluster [77] | Integrative clustering of multiple genomic data types | cran.r-project.org/web/packages/iCluster | R | |
| r.jive [118] | JIVE | cran.r-project.org/web/packages/r.jive | R | |
| iNMF [116] | Integrative NMF | github.com/yangzi4/iNMF | Python | |
| MVMF | Multi-view NMFs for feature pattern discovery from multi-class data | github.com/yifeng-li/mvmf | Python | |
| Tensor Toolbox [168] | Operations of multi-way arrays | www.sandia.gov/tgkolda/TensorToolbox | MATLAB | |
| N-way Toolbox [169] | Multi-way PARAFAC, PLS, and Tucker models | www.models.life.ku.dk/nwaytoolbox | MATLAB | |
| Sparse PARAFAC [170] | Sparse PARAFAC | www.models.life.ku.dk/sparafac | MATLAB | |
| CMTF [171] | Coupled matrix and tensor factorization | www.models.life.ku.dk/joda/CMTF_Toolbox | MATLAB | |
| NTFLAB | Non-negative tensor factorizations | www.bsp.brain.riken.jp/ICALAB/nmflab.html | MATLAB | |
| Multi-modal learning | multimodal [144] | Multi-modal DBMs | www.cs.toronto.edu/nitish/multimodal, github.com/nitishsrivastava/deepnet | Python |
Finally, we hope this review will provide a guide for bioinformaticians to select suitable tools corresponding to specific problems. We also expect that machine learning engineers and biological data scientists can be inspired by this discussion to develop and share their own novel approaches, to push forward the study of integrative biological data analysis.
We provide a comprehensive review on biological data integration techniques from a machine learning perspective.
Bayesian models and decision trees are discussed for incorporating prior information and integrating data of mixed data types.
Tri-matrix factorizations and network-based method are reviewed for two-relational and multi-relational association studies.
Multi-view matrix factorization models are investigated for detecting ubiquitous and view-specific components from multi-view omics data.
Multi-modal deep learning approaches are discussed for simultaneous use of multiple data sets in supervised and unsupervised settings.
Yifeng Li is a research scientist at the National Research Council Canada since 2015. Recognized by the Governor General’s Gold Medal, he obtained Ph.D. from the University of Windsor in 2013. From 2013 to 2015, supported by the NSERC Postdoctoral Fellowship, he was a post-doctoral trainee at the University of British Columbia. His research interests include sparse machine learning models, deep learning models, matrix factorizations, feature selection, data integration, large-scale optimization, big data analysis in bioinformatics and health-informatics, gene regulations and cancer studies. He is a member of IEEE and CAIAC.
Fang-Xiang Wu is a professor of the Division of Biomedical Engineering and the Department of Mechanical Engineering at the University of Saskatchewan. His current research interests include computational and systems biology, genomic and proteomic data analysis, biological system identification and parameter estimation, applications of control theory to biological systems. He has published more than 260 technical papers. Dr Wu is serving as the editorial board member of three international journals, the guest editor of several international journals and as the program committee chair or member of several international conferences. He has also reviewed papers for many international journals. He is a senior member of IEEE.
Alioune Ngom received his Ph.D. in 1998 at the University of Ottawa, and is currently a professor at the School of Computer Science, University of Windsor. Before joining UWindsor in 2000, he has held an assistant professor position at Lakehead University. His main research interests include but are not limited to computational intelligence and machine learning methods, and their applications in the fields of computational biology and bioinformatics. His current research includes gene regulatory network reconstruction, protein complex identification, sparse representation learning, network clustering and biomarker selection. He is a member of IEEE.
Acknowledgements
We greatly appreciate the suggestions from the three anonymous reviewers, making this review clear and comprehensive. We would like to thank Drs Youlian Pan (NRC), Alain Tchagang (NRC) and Raymond Ng (UBC) for providing valuable comments in the improvement of this article. We also want to acknowledge Ping Luo (UofS) for searching for potential non-cancer multi-omics data.
Funding
The National Research Council Canada (NRC) and the Natural Sciences and Engineering Research Council of Canada.
References






