Integrating multimodal data through interpretable heterogeneous ensembles

Abstract Motivation Integrating multimodal data represents an effective approach to predicting biomedical characteristics, such as protein functions and disease outcomes. However, existing data integration approaches do not sufficiently address the heterogeneous semantics of multimodal data. In particular, early and intermediate approaches that rely on a uniform integrated representation reinforce the consensus among the modalities but may lose exclusive local information. The alternative late integration approach that can address this challenge has not been systematically studied for biomedical problems. Results We propose Ensemble Integration (EI) as a novel systematic implementation of the late integration approach. EI infers local predictive models from the individual data modalities using appropriate algorithms and uses heterogeneous ensemble algorithms to integrate these local models into a global predictive model. We also propose a novel interpretation method for EI models. We tested EI on the problems of predicting protein function from multimodal STRING data and mortality due to coronavirus disease 2019 (COVID-19) from multimodal data in electronic health records. We found that EI accomplished its goal of producing significantly more accurate predictions than each individual modality. It also performed better than several established early integration methods for each of these problems. The interpretation of a representative EI model for COVID-19 mortality prediction identified several disease-relevant features, such as laboratory test (blood urea nitrogen and calcium) and vital sign measurements (minimum oxygen saturation) and demographics (age). These results demonstrated the effectiveness of the EI framework for biomedical data integration and predictive modeling. Availability and implementation Code and data are available at https://github.com/GauravPandeyLab/ensemble_integration. Supplementary information Supplementary data are available at Bioinformatics Advances online.

learn respectively) were used.The only exceptions were specifying C=0.001 for SVM and M=100 for LR to control time to convergence, based on our previous experience with these algorithms.However, we did not optimize the parameters of any of the prediction algorithms used for each dataset and/or label individually to avoid overfitting.
All the training of the local and EI ensemble models was conducted in a nested cross-validation (Nested CV; Section 2.4) setup.In this setup, the whole dataset is split into five outer folds, which are further divided into inner folds.The inner folds are used for training the local models, while the outer folds are used for training and evaluating the ensembles.Nested CV also helps reduce overfitting during heterogeneous ensemble learning by separating the set of examples on which the local and ensemble models are trained and evaluated (Whalen et al. (2016)).
All the algorithms and their parameters are included in the EI code provided at the public GitHub repository mentioned above.Users of the code are also able to change these settings as they desire.
• Model: Note that our study was focused on proposing and evaluating prediction algorithms, such as EI and benchmarks like deepNF and Mashup, and not to propose one or more specific models for our target problems.The only exception to this was the EI-based COVID-19 mortality prediction model that was interpreted in Section 3.3.We have shared this model through the GitHub repository mentioned above.We also hope that the results of the interpretation of this model will help shed light on COVID-19 pathophysiology, as well as help other researchers design and conduct related studies.More importantly, we hope that our EI framework provides a novel, reliable methodology for building specific models in other studies.
• Evaluation: As explained in Section 2.4, as well as relevant subsections of Section 3 (Results), we rigorously evaluated our proposed EI framework, and compared them with relevant benchmark approaches.Specifically, we used the Nested CV setup described above to fairly evaluation all the algorithms, as well as reduce overfitting in the process.We also used a variety of evaluation metrics, most prominently F max , which was recommended by the Critical Assessment of Protein Function Annotation (CAFA) exercise (Radivojac et al. (2013)) for the evaluation of supervised methods for unbalanced classes, like in PFP.We also evaluated the consistency of our EI interpretation method with other methods and evidence in the literature (Section 3.3).Thus, consistent with the focus of our study, we rigorously evaluated all the algorithms tested, and assessed the results they generated.
We hope that the substantial details we have provided in accordance with the DOME recommendations for our study will aid its reproducibility and utility.

Local Model Ranks (LMRs) EI Model
Local Feature Ranks (LFRs)  Supplementary Fig. 2: Overview of the workflow for identifying the best-performing algorithms for protein function prediction.These algorithms, namely (a) EI, (b) classifiers on integrated networks derived using deepNF and Mashup and (c) heterogeneous ensembles applied to the individual data modalities were applied to the STRING data as described in Section 2.3.1.Also marked in the workflows are the layers (steps) at which data and/or information were integrated.
Based on the cross-validation results obtained, we identified and compared the best-performing algorithms in each of these categories for each GO term.
Overview of the EI model interpretation method.The method is based on local model (LM Rs, purple arrow) and feature (LF Rs, red arrow) ranks.LM R denotes the importance of a local model derived from one of the data modality (e.g., Local model(s) 1 derived from Modality 1) to the final EI model, while LF R denotes the contribution of each feature in the corresponding data modality (e.g., A-D in Modality 1) to a local model.The method averages the product of the LM R and LF R for each valid pair of local model and feature into a rank product score (RP S).The final ranking of all the features in terms of their importance is determined by sorting the RP Ss in ascending order.