Optimal linear ensemble of binary classifiers

Abstract Motivation The integration of vast, complex biological data with computational models offers profound insights and predictive accuracy. Yet, such models face challenges: poor generalization and limited labeled data. Results To overcome these difficulties in binary classification tasks, we developed the Method for Optimal Classification by Aggregation (MOCA) algorithm, which addresses the problem of generalization by virtue of being an ensemble learning method and can be used in problems with limited or no labeled data. We developed both an unsupervised (uMOCA) and a supervised (sMOCA) variant of MOCA. For uMOCA, we show how to infer the MOCA weights in an unsupervised way, which are optimal under the assumption of class-conditioned independent classifier predictions. When it is possible to use labels, sMOCA uses empirically computed MOCA weights. We demonstrate the performance of uMOCA and sMOCA using simulated data as well as actual data previously used in Dialogue on Reverse Engineering and Methods (DREAM) challenges. We also propose an application of sMOCA for transfer learning where we use pre-trained computational models from a domain where labeled data are abundant and apply them to a different domain with less abundant labeled data. Availability and implementation GitHub repository, https://github.com/robert-vogel/moca.


Introduction
The size and complexity of biological datasets, such as whole genome sequencing, single-cell RNA sequencing, proteomics, and imaging, have expanded substantially in the last decade.Applications of data exploration techniques, as well as predictive analytics, such as machine learning (ML), can be used to uncover novel biological insights or to create predictive models (Ezer and Whitaker 2019).
In fact, several ML-based models such as MammaPrint (Slodkowska and Ross 2009), Decipher (Eric et al. 2016) and IDx-DR (Van Der Heijden et al. 2018) are already in use in the clinic.However, as it has been documented in the literature (Norel et al. 2011, Scialdone et al. 2015, Hu and Greene 2018), ML models may not generalize well.This is a serious challenge in critical areas of biomedicine, such as disease diagnostics and prognosis.The poor generalization of ML models (Agarwal et al. 2005, Kallus andZhou 2018) can be ascribed to overfitting and/or biases in the training data.Therefore, extensive and objective evaluation and benchmarking of ML models is critical.Crowd-sourced data competitions, e.g.Dialogue on Reverse Engineering and Methods (DREAM) challenges (Stolovitzky et al. 2009, Saez-Rodriguez et al. 2016), provide an objective platform to benchmark and validate different algorithms submitted by participants.
Independently trained predictive models can be combined into one aggregate predictive model.The endeavor of combining predictions from multiple algorithms is known as ensemble learning in the ML community (Whalen andPandey 2013, Kim et al. 2021).By not relying on any single model, ensemble learning has the potential to produce more robust algorithms than any of its constituents (Whalen and Pandey 2013).In fact, many of the most popular ML models, such as random forest or boosting algorithms, are ensemble-based predictors.
Crowd-sourced data competitions provide a natural platform for ensemble learning in that dozens of models developed by participating groups can be combined into one ensemble.DREAM challenge organizers have consistently observed in many of their crowd-sourced competitions that even a simple combination strategy such as averaging the submitted predictions has better generalization properties than any of the individual methods (Marbach et al. 2010, 2012, Saez-Rodriguez et al. 2016).We will call this simple averaging strategy the Wisdom of Crowds (WOC) ensemble throughout the current article.
While simple and effective in many applications, the WOC ensemble is by no means the optimal ensemble strategy.In the DREAM Network Inference Challenge (Marbach et al. 2012), e.g. a simple linear ensemble predictor where each algorithm is assigned a weight proportional to its performance was considerably superior to the WOC ensemble.As another example, in the DREAM Digital Mammography challenge (Schaffter et al. 2020), a logistic regression-based meta-learner ensemble classifier significantly outperformed each constituent classifier as well as the WOC ensemble.Although these ensemble predictor results are promising, they were typically obtained from problems that had enough labeled data to train an ensemble classifier.However, for many problems in biology and medicine, creating labeled data is very costly, and therefore, we usually have to use unlabeled data or, at the very best, have limited labeled data where training a supervised ensemble classifier is difficult without running into the risk of overfitting.
To bridge this gap, we propose two novel ensemble learning strategies.The first strategy, uMOCA (unsupervised MOCA), works in the context of unsupervised ensemble learning, where there is no sufficient amount of labeled data to train an ensemble classifier.uMOCA takes as input a matrix of predictions of a set of classifiers and, as a first step, estimates the performance of individual classifiers from it.It then forms a linear ensemble classifier in which each individual classifier is assigned a weight proportional to its estimated performance.We show theoretically that the uMOCA algorithm is the optimal linear ensemble under the assumption of class-conditioned independence (i.e. the classifiers make independent predictions given the class labels).Although uMOCA works robustly and accurately in many instances, in applications where the class-conditioned independence assumption is strongly violated, the performance of uMOCA suffers because uMOCA does not accurately estimate base classifier performances, and is no anymore the optimal ensemble.Our second algorithm, sMOCA (supervised MOCA), deals with these situations when a labeled dataset is available.sMOCA uses a labeled dataset to calculate the performance of individual classifiers and forms an optimal linear ensemble by considering the correlation structure of individual algorithms.Compared to supervised ensemble learning methods such as stacking (Whalen and Pandey 2013), where the ensemble classifier is constructed via meta-training, sMOCA is less prone to overfitting.This is especially important for problems where only a limited number of samples are available for training the ensemble classifier.
The rest of the article is structured as follows: Firstly, we will introduce the uMOCA ensemble and discuss its performance using simulation data that aligns with our classconditioned independence assumption.Next, we will introduce the sMOCA algorithm and utilize both the uMOCA and sMOCA in nine DREAM challenges, where the degree of violation of the conditional independence assumption varies.Lastly, we will present a new application of sMOCA in transfer learning and apply it to a skin cancer detection problem where we only have a small amount of labeled data to train individual classifiers.

Simulations
Simulations were prepared by rank-transforming random samples from two Gaussian distributions given user-specified classifier AUC values and a conditional correlation matrix.We set the mean of the distribution modeling the negative class to zero and the variances representing samples from each class to one.The mean of the positive class distribution was determined from the AUC using the expression AUC ¼ �, with Φ representing the standard normal cumulative distribution (Marzban 2004).While the respective distributions have unit variance, the user-specified correlation matrix is used as the covariance matrix.In Fig. 3I, the covariance matrix is the identity matrix, while in Fig. 3K, the correlations were selected at random from the DREAM 9.5 Prostate Challenge Dataset.

Inferring MOCA weights
The MOCA weights determined by the uMOCA algorithm were estimated piecewise.First, the prevalence of the positive class ρ ¼ PðY ¼ 1Þ and Δ were inferred using the techniques presented in Ahsen et al. (2019).Specifically, we implement their iterative procedure for estimating a rank one tensor from the covariance matrix or third central moment tensor using Numpy's eigenvector decomposition of Hermitian matrices module [numpy.linalg.eigh,(Oliphant 2006)] and the CANDECOMP/PARAFAC decomposition with one leading factor (Kolda and Bader 2009) made readily available in Tensorly (Kossaifi et al. 2019), respectively (see Supplemental Note 3 for mathematical details).

Transfer learning for skin lesion classification
Skin lesion images were resized using sci-kit image (van der Walt et al. 2014) module "transform.resize."For each deep learning model from TensorFlow Hub, we trained an L 1 regularized Logistic Regression and Gaussian Naive Bayes classifier using sci-kit learn (Pedregosa et al. 2011).For classifier training and testing, we performed 10 unique runs of 5-fold cross-validation for a total of 50 tests.The training set was split into two equal partitions.The first was used to train either the L 1 regularized Logistic Regression or Gaussian Naive Bayes classifiers.In the regularized Logistic Regression, we automated the selection of the regularizer strength by selecting the one which maximized the average AUC of the classifier over 10-fold cross-validation tests.Specifically, we tested 15 regularizer values logarithmically spaced between [0.01, 10].The second training set was then used to train the sMOCA classifier by estimating the MOCA weights and performing a greedy algorithm for ensemble selection.
The statistical significance of sMOCA outperforming all other methods in terms of AUC, BA, and F1 score was assessed by the t-test assuming related samples, Wilcoxon rank-sum test, Wilcoxon signed rank test, and binomial test.
For all tests, we used the corresponding implementation in the scipy.statsmodule (Jones et al. 2001-), and the 50 estimates of performance from the 10 independent runs of 5-fold cross-validation.To apply the binomial test, we assigned each test comparing sMOCA with an alternative binary value.A one was assigned when the performance of sMOCA was greater than that of the alternative method and zero otherwise.The estimated P < .001 is used for statistical significance in all measurements.

Results
The MOCA classifier is an optimal way of linearly combining class predictions made by a set of binary classifiers, where each classifier is assigned a weight (the MOCA weight) such that the ability of the resulting ensemble classifier to discriminate between the two classes, as measured by the signal-tonoise ratio (SNR) (to be defined below), is maximized.The MOCA ensemble has two variants: (i) uMOCA (Fig. 1A), for applications in which classifiers had been previously trained (supervised or unsupervised), and we do not have labeled data to train an ensemble, and (ii) sMOCA (Fig. 1B) for applications when we have enough labeled data to train the ensemble classifier.In uMOCA (Fig. 1A), we infer the MOCA weights directly from the second and third statistical moments of classifier predictions on a test dataset, which can be calculated without the use of class labels.The uMOCA ensemble is optimal under the assumption that base classifier predictions are conditionally independent given class labels.In sMOCA (Fig. 1B), we use the available sample class labels to calculate the MOCA weights.

The SNR and calculation of MOCA weights
Suppose we are given M binary base classifiers and N samples out of which N 1 belong to the positive class (y ¼ 1) and N 0 ¼ N − N 1 belong to the negative class (y ¼ 0).Each of the M base classifiers assigns each sample i a score s i commensurate with its certainty that the sample belongs to the positive (high score) or negative class (low score).For example, in the case of a naive Bayes classifier, this score could be the posterior probability that a given sample belongs to the positive class.
In the case of a support vector machine (SVM) classifier, the score could be the distance between the sample and the separating hyperplane in feature space.
Calibration is essential to ensemble learning as different base classifiers may produce scores on different scales (Whalen and Pandey 2013).Take, e.g. the gene network inference problem, where algorithms assign a score indicative of the presence or absence of a direct functional dependence between two genes.These algorithms may produce scores on entirely different scales, such as the Pearson correlation coefficient in WGCNA (Langfelder and Horvath 2008) or mutual information in ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks) (Margolin et al. 2006).The Pearson correlation coefficient takes values in the interval ½ − 1; 1�, whereas mutual information takes values in ½0; 1Þ A B Figure 1.The MOCA strategy.The MOCA strategy is an optimal aggregation of rank-ordered predictions by pre-trained binary classifiers on new, neverbefore-seen data.It has two versions that can be applied in either the absence or presence of labeled data.(A) When labels are not present, the uMOCA algorithm infers the optimal weights without using any labeled examples.(B) When labels are available, the sMOCA algorithm estimates the optimal weights and greedily selects the optimal combination of base classifiers.sMOCA is noteworthy on account that its training data requirements can be much less than the pre-trained models it is aggregating.
MOCA (Bishop 2006).Without calibration, the contribution of a base classifier's score to the ensemble may be biased due to the scale of the produced score and result in inferior ensemble performance.In this work, we use rank transformations of base classifier scores to calibrate classifier predictions.There are two main reasons for using ranks.Firstly, transforming scores to ranks is easy and can be used in an unsupervised way.Secondly, we can calculate important performance metrics such as the area under the receiving operator characteristics curve (AUC) using only the ranking of samples rather than actual scores assigned by an algorithm (Marzban 2004).
In what follows, we will use the convention that a classifier assigns higher scores to samples it predicts to be more likely to belong to the positive class, and the rank transformation transforms higher scores to lower ranks.For example, the sample that gets the rank 1 is the one for which the classifier produced the largest score and is, therefore, the sample that is most likely to be in the positive class for that classifier.Let r ik denote the rank of a sample k assigned by classifier i.To measure the performance of a classifier i, we will use the SNR defined as: where μ ijy and σ 2 ijy are the class-conditioned mean and variance of ranked classifier predictions, respectively, and Δ i :¼ μ ij0 − μ ij1 denotes the difference between the means of classconditioned rank predictions of classifier i. S i is not without precedent, and in some instances, it is closely related to the AUC.In Marzban (2004), authors showed that the AUC of normally distributed class-conditioned scores is given by Φ Si ffi ffi , where Φ is the standard normal cumulative distribution and S i is given in Eq. ( 1), where the mean μ and standard deviation σ correspond to the class-conditioned Gaussian scores densities.The SNR is a measure of how different the average ranks for class 0 and class 1 are, re-scaled with a measure of the dispersion of the ranks around their mean for both classes.Figure 2 shows examples of the SNR for different classifiers.Figure 2A shows the relation between the SNR and the AUC for a family of classifiers characterized by classconditioned rank distribution ranging from good classifiers (Fig. 2B) to worse than random classifiers (Fig. 2E).The AUC is a monotonically increasing sigmoidal function of the classifiers SNR.This relationship can be made more intuitive by inspecting simulation results depicted in Fig. 2B-E.A highperforming classifier, as depicted in Fig. 2B, will have the highest class-conditioned rank probability mass near the extreme ranks [1, N], meaning that samples belonging to the positive (negative) classes are more likely to be ranked low (high).In fact, we know that as the number of samples increases, the AUC asymptotically converges to the probability that a classifier assigns positive samples at lower ranks than negative samples.Therefore, a perfect classifier with an AUC of 1 will always rank positive samples below negative ones.This situation corresponds to class 0 and class 1-conditioned probabilities of nonoverlapping support, with a class 1-conditioned rank probability that is equal to 1 ρN for ranks between 1 and ρN and 0 otherwise, and a class 0-conditioned rank probability that is equal to 1 ð1 − ρÞN for ranks between ρN þ 1 and N and 0 otherwise, where ρ denotes the proportion of class 1 samples.In this limiting case, the SNR takes its maximum possible value and can be explicitly computed to be ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 − 4=N 2 p : (2) The last inequality corresponds to the case ρ ¼ 1=2.For ρ ¼ 0:4 and N ¼ 200, the maximum SNR for which AUC ¼ 1 is 2.4, as can be seen in Fig. 2A.An uninformative classifier, on the other hand, has a uniform conditional probability for all ranks, which implies that the classifier cannot distinguish positive samples from negative ones (Fig. 2D), and therefore, in this case, the SNR ¼ 0. We would also like to note that a negative SNR is indicative of a base classifier that either incorrectly adopts our convention and assigns samples from the positive class high ranks or has systematic errors that result in the classification of negative samples as positive and vice versa (Fig. 2A and E).
Next, we compute the SNR for an ensemble of M classifiers.Let us assume that classifier i assigns sample k the rank r ik .If classifier i is given a weight w i , we define the ensemble score s w k for the set of weights w as: The SNR for the ensemble, denoted by S w ens , is defined as the difference between the 0 and class 1-conditioned expected values of the ensemble score, normalized by square root of the sums of class 0 and class 1-conditioned variances of the ensemble score.This is a natural generalization of the definition of the SNR for one classifier given in Eq. (1).We show in Supplementary Note 1 that the SNR of the ensemble can be written as: where D; w 2 R M × 1 and C 2 R M × M .In Eq. ( 4), D and w are vectors whose ith elements represent the difference in the class-conditioned expected values (Δ i :¼ μ ij0 − μ ij1 ) and ensemble weight w i assigned to the ith base classifier, respectively.The matrix C :¼ C 0 þ C 1 is the sum of the classconditioned covariance matrices (C y ) of the classifiers (see Supplementary Note 1 for details).Equation ( 4) is of primary interest, as it defines the ensemble SNR in terms of the classifier weights.It would be desirable to choose weights that maximize the SNR of the ensemble.We will call such weights the MOCA weights and denote them by w MOCA .We show in Supplementary Note 2 that, when normalized to have a norm equal to 1, MOCA weights are given by Note that the MOCA weights given in ( 5) are equivalent to weights assigned to each feature in Fisher's LDA (Fisher 1936, Xanthopoulos et al. 2013).The biggest difference is in MOCA we work in the rank-transformed predictions.
Using this expression in Eq. ( 4), we find that the MOCA ensemble SNR can be expressed as: where C − 1=2 is the square root of C − 1 , which exists given that C is symmetric and positive definite.Equation ( 5) shows that the MOCA weight of classifier i is a linear combination of the D j 's of all classifiers weighted by elements of the inverse of matrix C.
It is instructive to consider the case for which the off-diagonal elements of C are 0, which occurs when the base classifiers are conditionally independent.In this case C is diagonal, with diagonal elements In this case, the ensemble SNR becomes S w MOCA ens ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Note that when M ¼ 1, expression (7) reduces to Eq. ( 1), the SNR for one classifier, which has an upper bound given by Eq. (2).However, as M increases, the MOCA ensemble SNR grows as ffi ffi ffi ffi ffi M p provided the SNRs of the base classifiers in the ensemble are bigger than 0, even if they are weak classifiers.This suggests that in the case of class-conditioned independent classifiers, the MOCA ensemble approaches the perfect classifier as the number of classifiers M increases.

MOCA
However, when the off-diagonal elements of the conditional covariance matrices are not zero, the denominator of each of the Δ i terms in Eq. ( 6) increases, and in effect, downweights the individual contributions of highly correlated methods to the ensemble as they do not carry independent information.In consequence, MOCA weights automatically balance the performance of base classifiers and the diversity of their predictions.These two properties together, performance and classifier diversity, allow the ensemble performance to exceed that of its constituents (Whalen and Pandey 2013).In the next section, we show how to calculate the ference in conditional means Δ i and the covariance matrices C 0 and C 1 required to compute the MOCA weights using Eq. ( 5) in the case of unlabeled data.

Unsupervised estimation of MOCA weights
We first apply the MOCA ensemble in an unsupervised setting.In the absence of sample class labels, we are unable to calculate the class-conditioned means and covariance matrices of sample rank predictions required to compute the MOCA weights.To make progress, we will add the hypothesis that the classifiers in our ensemble are conditionally independent given the class, also assumed in previous work (Parisi et al. 2014, Jaffe et al. 2015, Ahsen et al. 2019).Under this hypothesis, matrix C is diagonal, and we only need to estimate the conditional mean and variance for each classifier in order to construct the MOCA ensemble.
Even in the absence of labeled data, it is remarkable that it is possible to estimate the MOCA weights w MOCA under the assumption of class-conditioned independent classifier predictions.We showed in the previous section that for conditionally independent classifiers, the MOCA weights are given by where β is a constant so that the vector w MOCA has unit norm: In Ahsen et al. (2019), the authors show how to estimate Δ i from unlabeled data, which we use in estimating MOCA weight under the assumption of conditional independence of classifier predictions.The conditional independence assumption ensures that the covariance ensures that the only contributing factor to covariance is the discriminating ability of classifiers [for more info, see Ahsen et al. (2019)].The final piece needed to estimate the MOCA weights, therefore, is to determine σ 2 ij0 þ σ 2 ij1 , the sum of class-conditioned variances of each base classifier j.In this work, we provide a novel way to estimate the sum of the conditional variances from the second-and third-order moments of the unconditioned rank predictions (see Supplementary Note 3 for derivation and details).
We first tested the uMOCA strategy by simulating predictions of conditionally independent base classifiers (Fig. 3I).In this simulated setting, uMOCA was able to accurately infer the MOCA weights, achieving a correlation between the real MOCA weights (calculated using actual class labels) and the inferred MOCA weights of 0.99 (Fig. 3A).Using the MOCA weights in the ensemble score given in Eq. ( 3), we can assign a score to each sample to compute the AUC of the ensemble.A good point of comparison for uMOCA is a simple ensembling methodology, which we will call the WOC ensemble (Marbach et al. 2012), consisting of making all weights in the ensemble be equal to 1, that is w WOC i ¼ 1.We next compare uMOCA and WOC ensemble classifiers using the AUC as the performance metric.In Fig. 3E, we see that the uMOCA classifier outperforms the WOC ensemble and the bestperforming individual classifier.The inferior performance of the WOC ensemble can be attributed to the fact that good and bad predictions are equally weighted.The MOCA ensemble gives more weight to the better classifiers and less weight to the worse classifiers, making the predictions more accurate.Next, we applied uMOCA to the problem of gene network inference.Here, we collected predictions from teams participating in the DREAM2 BCL6 Transcription Factor Prediction Challenge (Stolovitzky et al. 2009).In this challenge, participating teams were asked to predict which of the 200 genes are transcriptional targets of the transcription factor BCL6 and which ones are decoys.Gold standard labels were experimentally determined for these genes and used by the challenge organizers to benchmark participants' algorithms.When applying uMOCA to the rank-transformed predictions, we found that the inferred MOCA weight and the true weights computed using the gold standard labels had a correlation of 0.81 (Fig. 3B).The inferred MOCA weights were then used to assign a MOCA ensemble score to each sample.The AUC of the MOCA ensemble outperformed the best base prediction as well as the WOC ensemble (Fig. 3F).We attribute the good performance of uMOCA to the fact that our conditional independence assumption is not significantly violated in these rank predictions.This can be qualitatively assessed by inspecting the off-diagonal entries of the normalized conditional covariance matrix C ¼ C 0 þ C 1 shown as heat maps in the bottom row of Fig. 3.In Fig. 3I, the off-diagonal entries are close to zero, as is expected from conditionally independent simulation data.In comparison, Fig. 3J shows small deviations from zero-correlation for the actual challenge data.We conclude that uMOCA is robust to moderate violations of the conditional independence assumption.We next investigated the performance of uMOCA in cases for which our assumption of conditional independence is strongly violated.First, we use simulation data with conditionally dependent predictions (Fig. 3K) and see that the accuracy of inferring the MOCA weights (Fig. 3C) is less than its conditionally independent counterpart (Fig. 3A).In effect, the AUC of the uMOCA classifier is less than the best individual method (Fig. 3G).We then applied uMOCA to the DREAM Prostate Cancer Prediction challenge.In the DREAM Prostate Cancer Prediction challenge, participants were asked to predict outcomes survival for prostate cancer patients based on patients' clinical variables (Guinney et al. 2017).The participants were given clinical covariates from four clinical trials.In applying uMOCA, we see that the MOCA weights are inferred with a correlation of 0.23 (Fig. 3D).Here, the performance of the uMOCA classifier is not statistically significantly different than either WOC or the best individual classifier (Fig. 3H).We attribute this poor performance to the strong conditional covariance between base classifier predictions (Fig. 3L).There are two main reasons for the poor performance of uMOCA ensemble in this case.First, the elements that go into the computation of the MOCA weights (Δ, ρ, and σ 2 ij1 þ σ 2 ij0 Þ are not estimated accurately.Second in forming the functional form of the uMOCA weights is the result of optimizing the ensemble SNR under the assumption of conditional independence.Once this assumption is violated, the uMOCA weights are no longer optimal, and we need to account for the correlation between classifiers.In the next section, we show how we can overcome these shortcomings when sample labels are available. Various unsupervised clustering methods exist in the literature, such as GMM (Gaussian Mixture Models) (Yang et al. 2012) and Naive Bayes Clustering (Gamez et al. 2006).Similar to uMOCA, the naive Bayes approach assumes class conditional independence of predictions, and similar to GMM, it uses Gaussian priors.To see how uMOCA compares to those clustering methods, we compared uMOCA to GMM (see Supplementary Note 5).Our results show that uMOCA better than GMM when the conditional independence assumption is not violated.When the conditional independence assumption is violated, uMOCA seems to have to clear advantage over GMM.This advantage of uMOCA could be due to several reasons, as we explain below.For unsupervised clustering methods such as GMM, the decision boundary between modes could be more flexible than the uMOCA.Furthermore, the data does not need to be exactly Gaussian for a GMM to perform well, even better than uMOCA.But this cannot be known a priori.What uMOCA provides is a more principled, optimal method under the appropriate conditions.

Supervised MOCA
We next applied MOCA in settings in which labeled samples to train the ensemble classifier exist.In such a setting, MOCA weights can be estimated directly from the empirical class-conditioned means and covariance matrices.Therefore, the conditional independence assumption is not required.We call this use of MOCA supervised MOCA, or sMOCA for short.In cases where we have only a limited amount of labeled samples, applying an optimal linear combination such as MOCA is less prone to overfitting.This is because we use the labels only to compute the first and second-classconditioned moments of the rank predictions rather than training a supervised ensemble algorithm, which typically requires splitting the training data and using crossvalidation schemes.
We first run the sMOCA algorithm in the DREAM BCL6 Challenge and DREAM Prostate Cancer Prediction Challenge data discussed in the previous section.In Table 1, we see that sMOCA achieves an AUC ¼ 0.915 ± 0.018 while uMOCA has an AUC ¼ 0.901 ± 0.023.In this challenge, the performance improvement of sMOCA over uMOCA is not very dramatic as the conditional independence assumption necessary for the use of uMOCA is only slightly violated, and therefore, the estimates obtained using uMOCA are reasonably accurate.However, the conditional independence assumption is strongly violated for the DREAM Prostate Cancer Prediction Challenge.This is reflected in the performance comparison between sMOCA and uMOCA.sMOCA achieves an AUC ¼ 0.726 which exceeds the AUC of uMOCA and WOC by nearly 0.04 and more than 0.05, respectively.
We performed a more extensive comparative analysis by measuring the performance of uMOCA, sMOCA, WOC, and the best individual method in seven additional DREAM challenges.The complete set of challenges represents diverse problems in computational biomedicine, including drug combination synergy and antagonism prediction, network inference, toxicogenomics, and epitope prediction.In all but a single example, we observed that sMOCA is the topperforming method in terms of AUC (Table 1).In the DREAM NCI drug combination antagonism prediction challenge and the HPN Network Inference Challenge, sMOCA outperforms each method by an AUC of greater than 0.15 and 0.08, respectively (Table 1).

MOCA for transfer learning
So far, we have shown how to use MOCA in cases that aim to combine a set of trained base classifiers.Transfer learning is a field of ML where models pre-trained in one domain are repurposed in similar domains (Pan and Yang 2010).Transfer learning is increasingly used in various biomedical problems, including cancer detection from medical images such as mammograms.For example, in the recent DREAM Digital Mammography challenge (Schaffter et al. 2020), participants used deep learning methods pre-trained on computer vision datasets such as ImageNet (Deng et al. 2009) and repurposed those models to predict breast cancer from mammograms.Given that there are various models for repurposing, it is not known which model generalizes better to the new domain.Therefore, instead of relying on one pretrained algorithm, we can use MOCA and combine multiple algorithms.
To test the MOCA strategy for transfer learning we collected five pre-trained models and 2750 skin lesion thumbnail images.The pre-trained models, obtained from Google's TensorFlow (Abadi et al. 2016) and TensorFlow Hub, are Inception v3 (Szegedy et al. 2016), PNASNet-Large (Liu et al. 2018, Zoph et al. 2017), ResNet-v2 (He et al. 2016), and MobileNet-v2 (Sandler et al. 2018).The dataset has 2750 a Each entry represents the mean AUC ± SEM from 5-fold cross-validation using data from unique scientific challenges.

MOCA
skin lesion thumbnail images collected from the publicly available ISIC-archive (Gutman et al. 2016).Although 2750 images may seem like a good-size dataset, when combined with its counterparts in image processing with millions of images, the 2750 skin lesion images represent a very modest number of labeled examples to effectively fine-tune each, or any, of the deep learning models.The pre-trained models could not be blindly applied to the new dataset.Indeed, many of the models required input images of unique size, and their output layers produced a unique number of features.As such, we used standard image interpolation to resize the images and appended the respective output layer with a binary classification layer (Fig. 4A).For each of the five models, we trained two distinct classification layers, making a total of 10 unique base classifiers.The classifiers we chose were L 1 regularized Logistic Regression and Gaussian Naive Bayes, as they represent two distinct training strategies of the same sigmoidal function, whose relative performances were not a priori known (Ng and Jordan 2002).Indeed, the application of these modest additions to each deep learning model for the classification of 2750 images was completed with off-the-shelf tools and using a laptop CPU with modest memory in less than a day.
Figure 4B presents the performance of each base classifier, the WOC and sMOCA in terms of AUC, the balanced accuracy (BA) [0.5 (TPR þ TNR)] and F1 score (harmonic mean of the precision and recall).Here, we see that sMOCA significantly outperforms all other methods in terms of AUC.Moreover, sMOCA outperforms all other strategies in terms of the BA and F1 scores.This is of interest because BA and F1 score evaluate the ability of a classifier to correctly infer sample class labels.An important ability for computer-aided diagnosis is the classification of an ailment as opposed to the assignment of a continuous score.Together we see that sMOCA for transfer learning outperforms WOC and its constituents in generating sample scores (AUC) and in inference of sample class labels (BA and F1 score).

Conclusion
Many common biomedical tasks, such as cancer diagnostics, differential expression calling, somatic mutation calling, and gene network inference, can be viewed as binary classification problems.For example, in a differential expression analysis, we are trying to predict genes whose expression significantly differs between two conditions (e.g.treatment versus control).Most differential expression callers output P-values, which can be used to rank genes from the most likely to be differentially expressed between the two conditions of interest to the least likely.Many tools, such as limma (Smyth 2005) and DESeq (Anders and Huber 2012), are proposed for the differential expression analysis.Different algorithms have distinct assumptions on the data analyzed, which in a real experimental setup is not known a priori nor can it be estimated given the fact that most experiments consist of only a few biological replicates.Therefore, given the unlabeled nature of this problem, it is not known which algorithm will work the best in a given context.In such problems, choosing an ensemble of different algorithms will perform more robustly than any individual algorithm.
In this article, we first introduced the uMOCA algorithm which deals with unsupervised problems such as the differential expression calling.uMOCA estimates the performance of each classifier in terms of AUC and produces an optimal ensemble classifier by assigning each classifier a weight proportional to its estimated performance which is exact when the individual classifiers produce conditionally independent predictions.We then empirically showed in a simulated example and in a DREAM crowdsourcing challenge (BCL6 DREAM Challenge) that uMOCA performs significantly better than any single algorithm and the WOC strategy.Unlike many traditional supervised learning problems where models are trained using labeled data, in this challenge, the participants created models based on first-order principles as well as a diverse range of external datasets.This resulted in a diversity of individual models that did not violate our conditional independence assumption significantly, leading to the superior performance of uMOCA.We also compared uMOCA to unsupervised learning methods such as GMM.Our empirical results show that uMOCA performance is better than GMM in cases where classifiers are conditionally independent, and there is no significant difference between GMM and uMOCA when the conditional independence assumption is violated.
Although uMOCA performs robustly in different tasks, it is no longer the optimal ensemble classifier when the classifier predictions are strongly conditionally dependent.The conditional independence assumption might be violated if the base classifiers are similar (typical for supervised learning problems) and/or trained on the same dataset.To deal with such cases, we proposed the sMOCA algorithm, which uses class labels for estimating the MOCA weights and performs optimal ensemble selection.This is especially useful in supervised problems where there are only a handful of samples to create the ensemble classifier.In such cases training a meta-classifier might lead to overfitting, whereas a simple theoretically backed strategy as sMOCA is more robust.Using nine crowdsourcing challenges, we showed that in eight challenges, sMOCA significantly outperformed uMOCA, the best individual method, as well as the WOC strategy.Finally, we proposed a novel use of sMOCA in a transfer learning problem where we use deep learning methods pre-trained on the large annotated dataset ImageNet, to predict skin cancer where the amount of labeled dataset was less than 1% of the ImageNet dataset.Our analysis of this showed that there is value in using ensemble strategies in transfer learning problems.We can significantly improve any individual model by using the simple WOC strategy.The application of sMOCA to this dataset further improved the WOC performance.
Our results show that applying ensemble strategies improves the robustness of predictive methods in various supervised and unsupervised biomedical problems.In inherently unlabeled problems such as gene network inference and differential expression calling, uMOCA can greatly enhance the prediction performance by optimally combining predictions from different models.If there is a labeled dataset to create an ensemble classifier, the sMOCA algorithm should be preferred as it can handle dependence between classifiers.sMOCA is especially useful in cases where there is only a limited number of samples, and traditional supervised algorithms are prone to overfitting.

Figure 2 .
Figure2.The signal-to-noise score.Simulated rank predictions of 500 samples in which 200 samples (prevalence ρ ¼ 0.4) are from the positive class (y ¼ 1).The simulation consists of assuming unit-variance Gaussian class-conditioned score distributions with the differences between the score means for classes 1 and 0 chosen such that the AUC¼ Φ� s1 − � s0 2 � �, where Φ is the standard normal cumulative distribution and � s 1 and � s 1 are the mean scores for classes 1 and 0, respectively.Estimates of the probability of sample rank given the class label, P ðR ¼ rjY ¼ yÞ, were computed by averaging the true class labels at a given rank over 1000 replicate simulation experiments.(A) The AUC is related to the signal-to-noise score by a sigmoidal function.(B-E) Plots of the conditional distribution for methods with an AUC of (B) 0.9, (C) 0.6, (D) 0.5, and (E) 0.2.

Figure
Figure The unsupervised MOCA algorithm.MOCA was applied to (A, E, I) simulation data where base classifiers predictions are conditionally independent, (B, F, J) predictions by teams participating in the DREAM2 BCL6 Transcription factor target prediction challenge, (C, G, K) simulation data where base classifier predictions are conditionally dependent, and (D, H, L) predictions by teams participating in the DREAM 9.5 Prostate Cancer Prediction Challenge.For each dataset, we demonstrate MOCA's ability to infer MOCA weight, w i , i ¼ 1; 2; . . .; M, measure the AUC in relation to the wisdom of crowd ensemble (WOC), and the best individual base classifier (Best_BC), and measure the empirical conditional correlation matrix C. The error bars represent SEM computed from 5-fold cross-validation.

Figure 4 .
Figure 4. Transfer learning with sMOCA was applied to automated melanoma classification using 2750 images from the ISIC-archive and five deep learning models from TensorFlow Hub: (i) inception_v3, (ii) mobile_net_v2_035_224, (iii) resetnet_v2, (iv) pnasnet_large, and (v) nasnet_mobile.(A) Each deep learning model was pre-trained on the ImageNet 2012 (ILSVRC-2012-CLS) dataset.To apply to images for melanoma prediction, we resized images to match the input layer of the respective network and then used the output layer values for each image as a feature vector for binary classification by either L 1 regularized Logistic Regression or Gaussian Naive Bayes.We then assessed the performance of each deep learning model paired with a binary classifier for a total of 10 independent methods by 10 independent rounds of 5-fold cross-validation.In each fold, we split the training data into two groups, the first for training the classification layer and the second for training sMOCA.(B) The bar chart shows the average performance as measured by AUC, BA, and F1 score ± SEM for sMOCA, WOC, and the independent methods.sMOCA outperformed, with respect to each performance measure, all other methods with P < .001.

Table 1 .
Comparison analysis of uMOCA and sMOCA.a