Predictive and interpretable models via the stacked elastic net

Abstract Motivation Machine learning in the biomedical sciences should ideally provide predictive and interpretable models. When predicting outcomes from clinical or molecular features, applied researchers often want to know which features have effects, whether these effects are positive or negative and how strong these effects are. Regression analysis includes this information in the coefficients but typically renders less predictive models than more advanced machine learning techniques. Results Here, we propose an interpretable meta-learning approach for high-dimensional regression. The elastic net provides a compromise between estimating weak effects for many features and strong effects for some features. It has a mixing parameter to weight between ridge and lasso regularization. Instead of selecting one weighting by tuning, we combine multiple weightings by stacking. We do this in a way that increases predictivity without sacrificing interpretability. Availability and implementation The R package starnet is available on GitHub (https://github.com/rauschenberger/starnet) and CRAN (https://CRAN.R-project.org/package=starnet).


Introduction
High-dimensional regression requires regularization. The elastic net (Zou and Hastie, 2005) generalizes ridge ðL 2 Þ and lasso ðL 1 Þ regularization, and overcomes some of their shortcomings. Adapting the sparsity of the model to the sparsity of the signal, it often improves predictions. One issue with the elastic net is that it has two tuning parameters: either two regularization parameters k 1 and k 2 for ridge and lasso, or one regularization parameter k and one mixing parameter a for moderating between ridge and lasso. Tuning both a and k is notoriously hard due to the flat cross-validated likelihood landscape (van de Wiel et al., 2019). Alternatively, fixing a close to the lasso might be a good solution, because this introduces stability (Friedman et al., 2010). As an alternative to tuning or fixing a, we propose to combine multiple values of a, using stacked generalization (Wolpert, 1992). Each a renders one model with one estimated effect for each feature. Instead of selecting one a for making predictions, stacking combines the predictions from multiple a (Fig. 1). The resulting ensemble model (multiple a) might be more predictive than any of the constituent models (single a) but is less interpretable due to multiple effects for each feature (one for each a). Rather than combining the predicted values from the base learners, we propose to combine their linear predictors. This allows us to rewrite the complex model (with multiple effects for each feature) as a simple model (with one effect for each feature). The stacked elastic net thereby increases predictivity while maintaining the interpretability of the regression coefficients. Furthermore, feature selection is possible after model fitting (Hahn and Carvalho, 2015). In the following, we introduce the stacked elastic net, analyse simulated and experimental high-dimensional data and discuss possible extensions.

Base learners
The data consist of one outcome and p features for n samples, possibly in a high-dimensional setting ðp ) nÞ. For example, the outcome might represent a clinical variable, and the features might represent molecular data. Let the n Â 1 vector y denote the outcome, and let the n Â p matrix X denote the features. We index samples by i 2 f1; . . . ; ng and features by j 2 f1; . . . ; pg. In the generalized linear model framework, we have where hðÁÞ is a link function, b 0 is the unknown intercept and b ¼ ðb 1 ; . . . ; b p Þ > are the unknown slopes. Penalized maximumlikelihood estimation involves determining where Lðy; b 0 ; bÞ is the likelihood, and qðk; a; bÞ is the elastic net penalty. The likelihood depends on the type of regression model (e.g. Gaussian, binomial or Poisson), and the penalty function is where k is the regularization parameter ðk ! 0Þ, and a is the elastic net mixing parameter ð0 a 1Þ. The limits correspond to ridge ða ¼ 0Þ and lasso ða ¼ 1Þ regularization. We consider m different values for a, which are equally spaced in the unit interval and indexed by k 2 f1; . . . ; mg. For each a k , we use 10-fold crossvalidation for tuning k k . We consider an exponentially decreasing sequence of values for k k , starting with the intercept-only model ðk k ! 1Þ and stopping with the (almost) unpenalized model ðk k ! 0Þ. In short, we select the optimal k Ã k for each a k . We retain the corresponding cross-validated linear predictors in the n Â m ma-trixĤ ðcvÞ .

Meta learner
We then regress the outcome on the cross-validated linear predictors: where x 0 is the unknown intercept, and x ¼ ðx 1 ; . . . ; x m Þ > are the unknown slopes. The intercept might allow the meta learner to reduce systematic errors from strongly correlated base learners. Since the slopes are weights, we constrain them to the unit interval, i.e. 0 x k 1 for all k 2 f1; . . . ; mg. They weight the linear predictors from the different elastic net mixing parameters. Penalized conditional maximum-likelihood estimation involves determining fx 0 ;xg ¼ argmax x0;x fLðy; x 0 ; xÞ À qðk; xÞg; where Lðy; x 0 ; xÞ is the likelihood conditional onĤ ðcvÞ , and qðk; xÞ is the lasso penalty Using the same cross-validation folds as for the base learners, we select the optimal regularization parameter k Ã for the meta learner. Accordingly, in the two consecutive cross-validation loops, we use the same training sets for estimating the base and meta parameters (b 0 and b given a k for all k; x 0 and x), and the same validation sets for tuning the base and meta hyperparameters (k k for all k; k).
The tuned elastic net is a special case of the stacked elastic net: if the intercept equals zero (x 0 ¼ 0), one weight equals one (x k ¼ 1), and all other weights equal zero (x l6 ¼k ¼ 0), the meta learner simply selects one mixing parameter (a k ). In a broader sense, van der Laan et al. (2007) distinguish between cross-validation selection and super-learning, which consist of selecting one or combining multiple base learners, respectively.

Combination
Given the cross-validated parameters k Ã ¼ ðk Ã 1 ; . . . ; k Ã m Þ > and k Ã , we refit the base and meta learners to all folds. For the base learners, let the 1 Â m vectorb 0 and the p Â m matrixb denote the estimated intercepts and slopes, respectively. For the meta learner, the estimates arex 0 andx ¼ ðx 1 ; . . . ;x m Þ > . We then use the estimates from the base and meta learners to predict the outcome of previously unseen samples.
If sample i has the feature vector X i᭺ ¼ ðX i1 ; . . . ; X ip Þ > , base learner k returns the linear predictorĝ ik ¼b 0k þ P p j¼1bjk X ij . The meta learner combines the linear predictors from all base learners: Since the stacked linear predictor is a function of pooled estimates, we perform stacking without loss of interpretability. For each feature, the corresponding pooled estimate represents the estimated effect on the outcome. Due to ridge regularization in one of the base learners, however, all pooled estimates might be different from zero. Stacking worsens the variable selection property of the elastic net, but we still have the option to select variables after model fitting (see below).

Extension
Decoupling shrinkage and selection (Hahn and Carvalho, 2015) allows us to perform feature selection after model fitting. The idea is to approximate the fitted linear predictorĝ Ã ¼ Xb . This can be achieved by regressing the fitted linear predictor on the features and estimating a sparse model: where c 0 is the unknown intercept, and c ¼ ðc 1 ; . . . ; c p Þ > are the unknown slopes. Penalized maximum-likelihood estimation involves determining where Lðĝ Ã ; c 0 ; cÞ is the Gaussian likelihood, and qðk; cÞ is the adaptive lasso penalty (Zou, 2006) The absolute values of the dense estimates (b Ã ) operate as weights for the sparse estimates (ĉ). As k increases from 0 to 1, the number of non-zero coefficients decreases from minðn; pÞ to 0. We can cross-validate k, or adjust k in order that the model includes a specific number of non-zero coefficients (e.g. P p j¼1 I½ĉ j 6 ¼ 0 ¼ 10). We expect this approximation to work well when the pooled estimates are relatively sparse, i.e. include few values far from zero and many values close to zero. Such a situation is fairly natural for the stacked elastic net because it pools mainly sparse and strongly correlated models. Nevertheless, post-hoc feature selection might significantly decrease the predictive performance of the stacked elastic net, and should therefore be used with caution.

Prediction accuracy
To examine the predictive performance of the stacked elastic net, we conducted a simulation study. We compared ridge, lasso, tuned elastic net and stacked elastic net regularization.
In three different scenarios, we repeatedly simulated highdimensional data. In each iteration, we sampled several n-dimensional vectors from the standard Gaussian distribution, namely three signal variables ðz 1 ; z 2 ; z 3 Þ and p noise variables ðe 1 ; . . . ; e p Þ. We constructed the outcome from the signal variables, and the features from the signal and noise variables. In all scenarios, the n-dimensional outcome vector equals the sum of the three signal variables ðy ¼ z 1 þ z 2 þ z 3 Þ. The n Â p feature matrix X, however, depends on the scenario (Table 1). Let x j denote the jth column of X, for any j in f1; . . . ; pg. Each feature equals a weighted sum of one signal variable and one noise variable: where the weight p is in the unit interval, and the index l equals 1 or 2. The weight p determines whether the feature is weakly ðp ¼ 0:1Þ, moderately ðp ¼ 0:5Þ or strongly ðp ¼ 0:9Þ correlated with the signal variable, and consequently weakly, moderately or highly predictive of the outcome. In the first scenario, one feature is strongly correlated with z 1 , and another feature is strongly correlated with z 2 . In the second scenario, 50% of the features are weakly correlated with z 1 , and the other 50% are weakly correlated with z 2 . And in the third scenario, 5% of the features are moderately correlated with z 1 , and another 5% of the features are moderately correlated with z 2 . The weighting ensures that all features have unit variance: In each scenario, we simulated the outcome (n Â 1 vector y) and the features (n Â p matrix X) each 100 times, where n ¼ 10 000 and p ¼ 500. We assessed the predictive performance using 100 samples for training and validation (internal 10-fold cross-validation) and 9900 samples for testing (hold out). Figure 2 shows the mean squared error for the test set under different flavours of elastic net regularization (ridge, lasso, tuning, stacking). These out-of-sample errors are estimates of the predictive performance on previously unseen data, with lower values indicating better predictions. Lasso outperforms ridge if the signal is sparse (1st scenario), but ridge outperforms lasso if the signal is dense (2nd scenario). Approaching the performance of the optimal elastic net mixing parameter, tuning is slightly worse than lasso in the sparse case (1st scenario), slightly worse than ridge in the dense case (2nd scenario), or better than both in the intermediate case (3rd scenario). We observe that stacking outperforms tuning in all three scenarios. Stacking is even slightly better than lasso in the sparse case and slightly better than ridge in the dense case. The most important gains relative to the best competitor occur in the intermediate case. In the three scenarios, stacking is the best approach in 79%; 67% and 88% of the iterations, respectively.
Next, we tested whether stacking leads to significantly better predictions than ridge, lasso and tuning. For this purpose, we calculated the pairwise differences in out-of-sample mean squared error, applied the two-sided Wilcoxon signed-rank test and used the Bonferroni-adjusted 5% significance level (P-value 0:05=9). Stacking significantly outperforms tuning in all three scenarios. Moreover, stacking is significantly better than ridge and lasso, but not significantly different from ridge if the signal is dense (2nd scenario). In practice, we often do not know whether ridge or lasso is more suitable for the data at hand. An advantage of the elastic net is that it automatically adapts to the sparsity level of the signal.
For comparison, we also examined the elastic net with the fixed mixing parameters a ¼ 0:05 (close to ridge) and a ¼ 0:95 (close to lasso). As expected, the ridge-like elastic net performs better than ridge if the signal is sparse (1st scenario) and worse than ridge if the signal is dense (2nd scenario). The results for the lasso-like elastic net are similar to those for the lasso. Indeed, it has previously been found that the elastic net without simultaneous tuning of both penalties can mimic ridge or lasso regression (Waldron et al., 2011).
Some applications require models with a limited number of selected features. We therefore verified how post-hoc feature selection affects the predictive performance of the stacked elastic net. Figure 3 shows the generalization error for different numbers of non-zero coefficients. Models with many selected features tend to be more predictive than models with few selected features. While the stacked elastic net outperforms the lasso given a small number of non-zero coefficients, this difference vanishes for large numbers of non-zero coefficients. Post-hoc feature selection increases predictivity if the signal is sparse (1st scenario) and otherwise decreases predictivity (2nd and 3rd scenarios).

Estimation accuracy
If we knew the effects of the features on the outcome, we could not only examine the prediction accuracy but also the estimation accuracy of the stacked elastic net. We adapted the simulation study to make this possible: (i) Simulating the features (n Â p matrix X) from the multivariate Gaussian distribution Nðl; RÞ with a constant mean and correlation structure, namely l j ¼ 0; R jj ¼ 1 and R jk ¼ 0:1 for all j and k 6 ¼ j in f1; . . . ; pg. (ii) Generating the effects (p Â 1 vector b) by setting most coefficients to zero and some coefficients to one, namely 5 (sparse scenario), 50 (dense scenario) or 20 (mixed scenario). (iii) Obtaining the outcome (n Â 1 vector y) by summing up the linear predictor and the residuals (y ¼ Xb þ e), where the residuals are Gaussian noise with half the (sample) standard deviation of the linear predictor.

Benchmark datasets
To further examine the performance of the stacked elastic net, we analysed experimental genomics data. The R package plsgenomics includes three preprocessed gene expression sets for binary or multinomial classification, namely tumour against normal colon tissue (Alon et al., 1999), two kinds of leukaemia (Golub et al., 1999) and four types of small-blue-round-cell tumours (Khan et al., 2001). For the last, we reduced the multinomial problem to four oneversus-rest binary problems. All three datasets are high-dimensional: the first covers 62 samples and 2000 features, the second covers 38 samples and 3051 features, and the third covers 83 samples and 2308 features. We did not perform any further preprocessing to ensure reproducibility and comparability. To obtain robust and almost unbiased estimates of the predictive performance, we used repeated nested cross-validation with 10 repetitions, 10 external folds and 10 internal folds. Table 2 shows the median cross-validated logistic deviance for the six binary classification problems. The stacked elastic net decreases the loss, as compared to ridge, lasso and tuning, except for lasso on the colon dataset. Under post-hoc feature selection with the number of non-zero coefficients determined by cross-validation, stacking remains competitive. Figure 4 shows the median cross-validated loss for different elastic net mixing parameters. For 'leukaemia' and 'SRBCT', the loss decreases between 0 (ridge) and some a, and then increases between this a and 1 (lasso). The optimal elastic net mixing parameter, across all cross-validation repetitions, is a ¼ 0:95 for 'colon', a ¼ 0:2 for 'leukaemia' and a ¼ 0:4 for 'SRBCT'. If we had known these values before the analysis, we would have minimized the cross-validated loss. Searching for the optimal a in each cross-validation iteration, we either find or miss the optimal a. This is why the tuned elastic net never outperforms the elastic net with the optimal a for a single split. In contrast, the stacked elastic net may outperform the elastic net with the optimal a. We observe this for two out of three applications, namely 'leukaemia' and 'SRBCT'.

Normal/tumour classification
The Cancer Genome Atlas (The Cancer Genome Atlas Research Network et al., 2013) provides genomic data for 33 cancer types. We retrieved the upper quartile normalized RSEM (RNA-Seq by expectation-maximization) TPM (transcript per million) gene expression values (R package curatedTCGAData), merged replicated measurements (R package MultiAssayExperiment) and extracted the sample definitions from the barcodes (R package TCGAutils). We retained 'solid tissue normal' (collected near the tumour) and 'primary solid tumour' samples. For each cancer type, we retained the 2000 most variably expressed genes, and standardized their expression values.
For cancer types with at least five normal and five tumour samples, we repeatedly trained and validated models with approximately 90% of the samples, and tested the models with approximately 10% of the samples. Table 3 shows the cross-validated logistic deviance under different regularization methods. Here, lasso performs better     Fig. 3. Median out-of-sample mean squared error against number of non-zero coefficients, for the lasso (grey) and the stacked elastic net with post-hoc feature selection (black), in the first (left), second (centre), and third (right) scenarios. The dashed lines indicate the medians from the unrestricted versions than ridge for 13 out of 15 cancer types, and stacking performs better than tuning for 11 out of 15 cancer types. The mean decrease in cross-validated logistic deviance from tuning to stacking is 7:5%, and the two-sided Wilcoxon signed-rank test returns a P-value of 0.06. Post-hoc feature selection with the number of non-zero coefficients determined by cross-validation leads to competitive models, except for cholangiocarcinoma (CHOL). The problem with this cancer type might be the small sample size together with the fact that normal and tumour samples are derived from the same patients. In any case, results for such small sample sizes are inherently unreliable.

Discussion
The elastic net is the method of choice for many biomedical applications, because it renders predictive and interpretable models. It weights between ridge and lasso regularization, but the optimal weighting is often unknown. Instead of selecting one weighting by tuning, we combine multiple weightings by stacking. According to our empirical analyses, this improves the predictive performance of the elastic net in various settings. The increase in computational cost is negligible, because the only addition is the low-dimensional regression of the outcome on the cross-validated linear predictors. The equivalence between stacking linear predictors and pooling regression coefficients allows us to increase the predictive performance while maintaining the interpretability of the regression coefficients.
In contrast to the lasso, the stacked elastic net might or might not perform feature selection. It selects features unless the meta learner includes the base learner with pure ridge regularization, but it tends to select more features than the tuned elastic net, because it combines multiple base learners. The stacked elastic net selects a feature if and only if the meta learner selects a base learner that selects this feature. It is therefore possible to impose feature selection by excluding the base learner with pure ridge regularization (a > 0). As this might fail to render sufficiently sparse models, we suggest to perform post-hoc feature selection (Hahn and Carvalho, 2015) but recommend to verify by cross-validation whether imposing sparsity makes the model much less predictive.
An extension of the stacked elastic net would be to use a fused penalty (Tibshirani et al., 2005) for the meta learner, because the base learners are related in regard to the elastic net mixing parameter. Another extension would be to combine two ensemble techniques, namely stacking and bagging. While stacking involves fitting different models to the same samples and weighting the predictions, bagging involves fitting the same model to different bootstrap samples and averaging the predictions. Since random (bagged) regressions seem to be competitive with random forests (Song et al., 2013), we could potentially combine stacking and bagging to make elastic net regression even more predictive without making it less interpretable. Table 3. Cross-validated logistic deviance for binary classification problems (rows) under different regularization methods (columns), with the class frequencies (0/1) in the first two columns, and the results for post-hoc feature selection in parentheses