Abstract

Motivation

High-throughput sequencing technologies generate a huge amount of data, permitting the quantification of microbiome compositions. The obtained data are essentially sparse compositional data vectors, namely vectors of bacterial gene proportions which compose the microbiome. Subsequently, the need for statistical and computational methods that consider the special nature of microbiome data has increased. A critical aspect in microbiome research is to identify microbes associated with a clinical outcome. Another crucial aspect with high-dimensional data is the detection of outlying observations, whose presence affects seriously the prediction accuracy.

Results

In this article, we connect robustness and sparsity in the context of variable selection in regression with compositional covariates with a continuous response. The compositional character of the covariates is taken into account by a linear log-contrast model, and elastic-net regularization achieves sparsity in the regression coefficient estimates. Robustness is obtained by performing trimming in the objective function of the estimator. A reweighting step increases the efficiency of the estimator, and it also allows for diagnostics in terms of outlier identification. The numerical performance of the proposed method is evaluated via simulation studies, and its usefulness is illustrated by an application to a microbiome study with the aim to predict caffeine intake based on the human gut microbiome composition.

Availability and implementation

The R-package ‘RobZS’ can be downloaded at https://github.com/giannamonti/RobZS.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Human microbiome studies that make use of high-throughput technologies are a valuable source of information for the human health. The human microbiome, constituted by all microorganisms in and on the human body, is associated to health and has an impact on risk of disease. A microbiome dataset, derived from 16S rRNA sequencing, consists in the collection of abundances of microbial operational taxonomic units (OTUs), or bacterial taxa. Such count data are usually normalized by the total abundance of a sample to relative abundances, to account for differences in sequencing depths. In that case, the values of each observation would sum up to 1 or 100%, if reported in proportions or percentages, and many statistical methods such as regression on the normalized data are no longer applicable because of data singularity.

In the literature, microbiome data have been considered and successfully treated as compositional data (CODA) (Li, 2015; Gloor et al., 2016; Quinn et al., 2018). CODA convey relative information in a sense that the (log-)ratios of the values between the variables are of major interest for the analysis (Filzmoser et al., 2018). This leads to many appealing properties, one of them being scale invariance: the ratios between the original or between the relative abundances are identical, and thus the statistical analysis does not depend on how the data are normalized, or if they are normalized at all. The lack of scale invariance is also a major argument against applying standard regression models on the original data, even if the data have not been normalized to constant sum.

In many studies, the microbiome composition is used as covariate in regression models to analyze its association with a clinical outcome. This composition has special features: it consists of many variables and thus forms high-dimensional data. Many abundances are reported as zero, and thus we can talk about sparse data. Typically, only few variables from the composition are important to model the outcome, and most of them are irrelevant for the model. Thus, an appropriate variable selection is desirable and necessary. A further challenge are data outliers, i.e. observations which deviate from the majority of the samples. In the ideal case, outlying observations should receive smaller weight in the regression problem in order to reduce their influence on the estimated regression coefficients. Robust regression methods take care of an appropriate weighting scheme (Maronna et al., 2006).

In this article, we propose a unified framework capable to integrate robust techniques in the context of the variable selection and coefficient estimation problem in high-dimensional regression with compositional covariates, leading to parsimonious inferential solutions and models which are easier to interpret.

Several methods to perform regression with compositional exploratory variables have been presented in the literature: for a mixture experiment, Aitchison and Bacon-Shone (1984) introduced a CODA regression model based on linear log-contrasts, namely a linear combination of logratios between compositional parts.

In the high-dimensional setting, Lin et al. (2014) considered variable selection and estimation for the log-contrast model. They proposed an 1 regularization method for the linear log-contrast regression model with a linear constraint on the coefficients. Their constrained Lasso is also known as ZeroSum regression, it incorporates the compositional nature of the data into the model, and works well in the high-dimensional setting where the number of available regressors p is much larger than the number of observations n. Shi et al. (2016) extended the linear log-contrast regression model by imposing a set of multiple linear constraints on the coefficients in order to achieve subcompositional coherence of the results obtained at different taxonomic ranks which the composition of taxa belongs to. Altenbuchinger et al. (2017) imposed an elastic-net penalty to the ZeroSum regression, developing a coordinate descent algorithm for the estimation. This constrained regularized regression method has been applied in case of compositional covariates, as well as reference point insensitive analyses involving any biological measurement such as the human microbiome.

Bates and Tibshirani (2019) adapted Lasso for CODA, using the logratios of all variable pairs of the components as predictors. They proposed a two-step fitting procedure that combines a convex filtering step with a second non-convex pruning step, yielding highly sparse solutions to face the very large dimensionality of the predictor space.

Some penalized robust estimation methods have been recently proposed in the literature. These include an MM-estimator with a ridge penalty (Maronna, 2011), a sparse least trimmed squares (LTSs) regression estimator with a lasso penalty (Alfons et al., 2013), and with elastic-net penalty (Kurnaz et al., 2018), a regularized S-estimator with an elastic-net penalty (Freue et al., 2019) and bridge MM-estimators (Smucler and Yohai, 2017), among others.

In this article, we propose a robust version of the penalized ZeroSum regression. Robustness is achieved by trimming large residuals, motivated by the fact that outliers in the data affect the inferential results, and also small misspecifications of the underlying parametric model can lead to poor prediction accuracy (Huber and Ronchetti, 2009; Maronna et al., 2006).

The outline of the paper is as follows. Section 2 reviews the regression models with compositional covariates, and presents the Robust ZeroSum regression estimator. Simulation experiments are conducted in Section 3 to evaluate the numerical performance of the proposed method. Section 4 presents an application to gut microbiome data, and the final Section 5 concludes.

2 Regression models for CODA

2.1 Linear log-contrast model

In the seminal work of Aitchison and Bacon-Shone (1984), a regression model for CODA was introduced, which is known as linear log-contrast model. It is related to the design of experiments with mixtures, called simplex designs. Consider a matrix X of compositional covariates, X=[xij]1in;1jp, w.l.o.g. expressed with constant sum 1. Thus, each row lies in the unit simplex Sp={xij:xij>0andj=1pxij=1}. The log-transformed values of X are collected in the matrix Z=[zij=log(xij)]1in;1jpRn×p. A log-contrast is defined, in a symmetric form, as a linear combination of the columns of Z with coefficients β=(β1,,βp)T, thus Zβ, with the constraint j=1pβj=0 (Lin et al., 2014). The linear log-contrast model considers a response with values y=(y1,,yn)T, and the corresponding regression model with the log-contrast as covariates is
y=Zβ+ε,stj=1pβj=0,
(1)
where ε is the error component, usually assumed normally distributed around zero, with constant variance σ2. The parameters are usually estimated by the least-squares (LS) method considering the constraint on the parameters. Note that the formulation (1) does not include an intercept in the model, as it can be omitted by centering all the predictor variables and the response.
In the high-dimensional setting, when the sample size n is lower than the number of predictors p, and the ordinary LSs method is not applicable, Lin et al. (2014) proposed a variable selection procedure and estimation for a sparse log-contrast model,
β^sparse=argminβRp(1nyZβ22+λβ1),stj=1pβj=0,
(2)
where λ>0, is the regularization parameter, which calibrates the sparseness, and ·2 and ·1 indicate the 2 and 1 norm, respectively. Depending on the choice of λ, several or many of the components of β are zero, and thus this sparse regression coefficient vector corresponds to a variable selection in the model. The authors introduced a coordinate descent method of multipliers to estimate the model parameters. By virtue of the zero-sum constraint, the proposed estimator fulfills desirable compositional properties such as scale invariance, i.e. the regression coefficients are independent of an arbitrary scaling of the basis count from which a composition is obtained, permutation invariance and selection invariance. The selection invariance property asserts that the estimator is unchanged if one knew in advance which components would be estimated as zero and applied the procedure only to the components associated to non-zero coefficients (Lin et al., 2014).
Altenbuchinger et al. (2017) combined the variable selection problem and estimation for model (1) with the elastic-net regularization (Zou and Hastie, 2005),
β^ZeroSum=argminβRp(1nyZβ22+λPα(β)),st j=1pβj=0,
(3)
where Pα(β)=(αβ1+1α2β22) is the elastic-net penalty, α[0,1] is a tuning parameter which balances the 2 and 1 penalty. Model (3) is known as ZeroSum elastic-net estimator, to emphasize the constraint on the regression coefficients in conjunction with the elastic-net regularization. To fit the model (3), a coordinate descent-based algorithm (Friedman et al., 2007) was implemented. Setting α = 1, model (3) is the Lasso model, and for α = 0 it is Ridge regression (Tibshirani, 1994).

2.2 Robust and sparse regression models for CODA

It is well known that the ordinary LSs estimator for linear regression is very sensitive to the presence of outliers in the space spanned by the dependent variable, namely vertical outliers, and in the space spanned by the regressors, namely leverage points. To overcome this issue, several robust alternatives were proposed in the literature; among others, we focus here on Rousseeuw’s LTS estimator (Rousseeuw, 1984).

Considering a regression of the response y on the design matrix XRn×p, i.e. y=Xβ+ε, where β=(β1,,βp)TRp. For every βRp we denote the corresponding residuals by ri(β)=yixiTβ,(i=1,,n), where xi=(xi1,,xip)T. Denote the ordered squared residuals as r(1)2r(n)2, then the LTS estimator is obtained as
β^LTS=argminβRp(i=1hr(i)2(β)),
(4)
where h may lie between n/2 and n. The specific choice of h depends on the desired properties of the resulting estimator: a smaller value leads to more robustness, but to less efficiency, and vice versa. Minimizing (4) is equivalent to finding the subset of size h with the smallest LSs objective function. As the number of observations and covariates increases, the search of LTS estimates in (4) becomes computationally more and more expensive, and thus a fast-LTS algorithm (Rousseeuw and Van Driessen, 2006) was proposed. The basic idea of this algorithm consists in the concentration-step (C-step), in which the most promising subsets of size h are used to find a local optimum. These C-steps can be repeated a specified number of times, or iterated until convergence.
Alfons et al. (2013) proposed an extension of the fast-LTS algorithm for sparse data by adding an 1 penalty on the LTS coefficient estimates, leading to the sparse LTS estimator
β^SLTS=argminβRp(i=1hr(i)2(β)+hλβ1),
(5)
for hn and the tuning parameter λ0. Sparse LTS regression is equivalent to detecting the subset of h observations whose Lasso fit leads to the smallest penalized sum of squared residuals. The sparse LTS estimator can be interpreted as a trimmed version of the Lasso. Due to the 1 penalty, some of the estimated regression coefficients are exactly zero, performing variable selection. Since potential outliers are trimmed in the objective function, the sparse LTS estimator is robust against vertical outliers, and leverage points.
Kurnaz et al. (2018) extended the work of Alfons et al. (2013) by substituting the 1 penalization in (5) with an elastic-net (EN) penalty. They proposed the trimmed (EN)LTS estimator, defined by
β^(EN)LTS=argminβRpargminH{1,,n}:|H|=h(iH(yixiTβ)2+hλPα(β)),
(6)
where Pα(β) is the elastic-net penalty as in (3), H is an outlier-free subset of the set of all indexes {1,2,,n}, and |H| denotes the cardinality of the set H. They used an analogue of the iterative fast-LTS algorithm along with a ‘warm start’ strategy to obtain the optimal choice of the tuning parameters α and λ in (6).

Our study extends the trimmed (EN)LTS estimator to a constrained parameter space, to convey the zero-sum constraint typical for compositional covariates. We call our estimator the Robust ZeroSum (RobZS) estimator. The interesting aspect of RobZS is the combination of the zero-sum constraint of the regression coefficients, with the elastic-net regularization in a robust way.

The algorithm to find the solution of RobZS is detailed in Section 2.3. The selection of the tuning parameters α and λ will be discussed in Supplementary Section S1 of Supplementary Material, and an extensive simulation study, reported in Section 3, demonstrates the robustness of the estimator in presence of data outliers.

2.3 Algorithm

A preprocessing data step is required: the response variable and the compositional covariates are robustly centered by the median.

Let R(H,β) the objective function of the RobZS regression, for a fixed combination of the tuning parameters α and λ, based on a subsample of observations from the index set H{1,,n} with |H|=hn,
R(H,β)=iH(yiziTβ)2+hλPα(β), st j=1pβj=0,
(7)
where Pα(β) is the elastic-net penalty as in (3). For each subsample given by the set H we can obtain β^H as
β^H=argminβRpR(H,β), st j=1pβj=0.
Let Hopt the optimal subset Hopt=argminH{1,,n}:|H|=hR(H,β^H), that is the optimal subset of hn observations which lead to the smallest penalized residual sum of squares, where the zero-sum constraint needs to be preserved, thus the optimal solution is
β^opt=argminβRpR(Hopt,β).
(8)
The optimal subset Hopt is obtained using an analogue of the fast-LTS algorithm, based on iterated C-steps on diverse initial subsets. The C-step at iteration κ consists of computing the elastic-net solution, that preserves the zero-sum constraint, based on the current subset Hκ, with |Hκ|=h, and constructing the next subset Hκ+1 from the observations corresponding to the h smallest squared residuals. Let Hκ denote a certain subsample derived at iteration κ and let β^Hκ be the coefficients of the corresponding ZeroSum fit, see model (3). After computing the squared residuals ri,κ2=(yiziTβ^Hκ)2, for i=1,,n, the subsample Hκ+1 for iteration κ+1 is defined as the set of indices corresponding to the h smallest squared residuals at the previous iteration κ. Let β^Hκ+1 denote the coefficients of the ZeroSum fit based on the subset Hκ+1. It is straightforward to derive that
R(Hκ+1,β^κ+1)R(Hκ+1,β^κ)R(Hκ,β^κ).

We can see that the C-steps result in a decrease of the objective function, and that the algorithm iteratively converges to a local optimum in a finite number of steps. In order to increase the chance to approximate the global optimum, a large number of random initial subsets H0 of size h for any sequence of C-steps should be used. Each initial subset H0 is obtained through a search with elemental subsets of size 3.

For a fixed combination of the tuning parameters λ0 and α[0,1], the implemented algorithm, which is similar to the fast-LTS, is as follows:

  1. Draw s = 500 random initial elemental subsamples Hsel of size 3 and let β^Hsel be the corresponding estimated coefficients.

  2. For all s subsets, compute the squared residuals for all n observations ri,s2=(yiziTβ^Hsel)2, for i=1,,n, and consider the indexes of the smallest h of them: {r(1),s2,,r(h),s2}, as starting points to compute only two C-steps.

  3. Retain only s1=10 subsets of size h with the smallest objective function (7) and for each subsample perform C-steps until convergence. The resulting best subset corresponds to the one with the smallest value of the objective function.

The choice of the parameters for the algorithm has been discussed in literature (Alfons et al., 2013; Kurnaz et al., 2018; Rousseeuw and Van Driessen, 2006). For example, a large number for s increases the likelihood to approximate the global optimum, and a small number s1 decreases the computation time.

To reduce the computational cost of this 3-step sequential algorithm, which ideally should be computed for each possible combination of the tuning parameters, we considered a ‘warm-start’ strategy (Friedman et al., 2010). The idea is that for a particular combination of α and λ, the resulting best h-size subset from step 3 might also be an appropriate subset for a combination in the neighborhood of this α and/or λ, and thus step 1 can be omitted.

To select the optimal combination (αopt,λopt) of the tuning parameters α[0,1] and λ[ε·λMax,λMax], with ε0, leading to the optimal subset Hopt, a repeated K-fold CV procedure (Hastie et al., 2001) applied on those best h-size subsets, on a two-dimensional surface is adopted. (Details are reported in Supplementary Section S1 of Supplementary Material).

Furthermore, we apply a reweighting step, that downweights outliers detected by the solution β^opt, to increase the efficiency of the proposed estimator. We consider outliers as observations with standardized residuals larger than a certain quantile of the standard normal distribution. Since the RobZS estimator is biased due to regularization, it is necessary to center the residuals. Denote ris as the standardized residuals, where the residual scale is derived from the h observations in the final subset. Then the binary weights are defined by
wi={1if |ris|Φ1(1δ)0if |ris|>Φ1(1δ)i=1,,n
(9)
where Φ is the cumulative distribution function of the standard normal distribution. The typical choice for δ is 0.0125, so that 2.5% of the observations are expected to be flagged as outliers in the normal model.
Finally, the RobZS estimator is defined as
β^RobZS=argminβRp(i=1nwi(yiziTβ)2+nwλ˜Pαopt(β)),st j=1pβj=0,
(10)
where nw=i=1nwi is the sum of weights, αopt is the optimal parameter obtained considering the optimal subset Hopt, whereas the tuning parameter λ˜ is obtained by a 5-fold cross-validation (CV) procedure. Note that it is necessary to update the parameter λ, because nw>h, the initial conservative guess of outliers in the data, and thus the penalty can act moderately in a different way than for (8).

This algorithm to compute RobZS estimator is implemented in the software environment R. The source code files are hosted in the Github repository of the first author (https://github.com/giannamonti/RobZS).

Estimation of an intercept. Simulation experiments have shown that data preprocessing by robustly centering the response and the covariates by the median, and by applying the model without intercept does not yield the best results. An improvement is possible by additionally centering and scaling (with arithmetic means and empirical standard deviations) the input data for the ZeroSum elastic-net estimator. This has been done in all steps of the previously outlined algorithm where this estimator is involved. An exception is the repeated K-fold CV procedure to determine the tuning parameters, where additional centering and scaling of folds would lead to biased results.

Given the optimal RobZS solution β^RobZS on the robustly centered data, we can recover the estimate of the intercept β^0, by simply computing β^0=y¯j=1px¯jβ^jRobZS, where y¯ and {x¯j}1p are the original medians. This intercept is added to the intercept which results from classically centering and scaling the response and the explanatory variables to compute the final RobZS estimator in (10).

Debiasing strategies. RobZS suffers from a bias due to double penalization resulting from the elastic-net penalty. To overcome this shortcoming we suggest three debiasing strategies: a rescaled RobZS solution, following the approach of Zou and Hastie (2005), a relaxed (Meinshausen, 2007) and a hybrid RobZS. Let β^RobZS the RobZS estimate of βRp, given the couple of estimated tuning parameters (αopt,λ˜), the rescaled RobZS solution is defined as
β^RobZS(rescaled)=(1+λ˜2(1αopt))β^RobZS.
(11)

This simple way of rescaling mitigates the effect of shrinkage, it leads to an estimator with less bias, at the price of more variance. A valid alternative of rescaling is a relaxed RobZS, which consists in a two-step procedure. Firstly, RobZS is applied to identify the set of non-zero coefficients, say Aαopt,λ˜, then RobZS is performed again on the active predictors selected in the first step ZAαopt,λ˜, and fixing α=αopt. The active set of predictors Aαopt,λ˜ presumably does not include ‘noise’ variables, and collects variables that are effective competitors in being part of the model, thus the shrinkage in the second step is less marked. We considered also a hybrid RobZS solution, a two-step procedure where, in the first stage, RobZS is applied to perform variable selection and in the second stage, a RobZS with a Lasso penalty (α = 1) is performed again on the predictors selected in the first stage to reduce the excessive number of false positives (see Tibshirani, 2011, Peter Bühlmann’s comments). In all cases the intercept should be re-estimated. The effect of these debiasing strategies is reported in Supplementary Section S2.2 of Supplementary Material. All other results refer to the direct RobZS solution.

3 Simulation

The aim of this section is to compare the performance of the RobZS estimator to the competing estimators by means of a Monte Carlo study. We make a comparison with the Lasso (the regular least absolute shrinkage and selection operator) (Tibshirani, 1994), the ZeroSum (ZS) estimator (Altenbuchinger et al., 2017), the sparse LTS estimator (SLTS) of Alfons et al. (2013) and the robust EN(LTS) estimator (Kurnaz et al., 2018), denoted by RobL in the following. We also provide a comparison with the algorithm of Bates and Tibshirani (2019), here abbreviated by ‘ZS (B&T)’. In their log-ratio Lasso estimator they propose a fast approximate algorithm which is used here for comparison. Note that this algorithm does not return an optimized value of the tuning parameter λ, and thus we cannot report loss values. In order to compare with the Lasso solution, we have set the parameter α equal to 1 for the methods involving elastic-net penalties.

3.1 Sampling schemes

We generated the covariate data, corresponding to the relative bacterial abundances in a microbiome analysis, following Lin et al. (2014). We first generated an n × p data matrix W=[wij]1in;1jp from a multivariate normal distribution Np(θ,Σ), and then obtained the design matrix X=[xij]1in;1jp by the transformation
xij=exp(wij)/k=1pexp(wik),
subsequently each row is a random sample from a logistic normal distribution (Aitchison and Shen, 1980). The correlation structure of the predictors is defined by Σ=[Σij]1i,jp=ρ|ij|, with ρ = 0.2 or 0.5, to consider different levels of correlation. To reflect the fact that the components of a composition in metagenomic data often differ by orders of magnitude, the components of θ=(θ1,,θp)T are defined as θj=log(0.5p), for j=1,,5, and θj=0 otherwise.

The two robust estimators are calculated taking the subset size h=3(n+1)/4 for an easy comparison. This means that n/4 is an initial guess of the maximal proportion of outliers in the data. For each replication, we choose the optimal tuning parameter λopt as described in Supplementary Section S1 of Supplementary Material, with a repeated 5-fold CV procedure and a suitable sequence of 41 values between ε·λMax, with ε0 and λMax, where λMax is chosen in order to get full sparsity in the coefficient vector.

The values of the response were generated according to model (1), with coefficient vector β=(βj)1jp with β1=1,β2=0.8,β3=0.6,β6=1.5,β7=0.5, β8=1.2 and βj=0 for j{1,,p}{1,2,3,6,7,8}, and σ=0.5, so that three of the six non-zero coefficients were among the five major components and the rest were among the minor components.

Different sample size/dimension combinations (n, p) =(50, 30), (100, 200) and (100, 1000) are considered, thus a low-high-dimensional setting (n > p), moderate-high-dimensional setting (n < p), and high-dimensional setting (np), and the simulations are repeated 100 times for each setting.

For each of the three simulation settings we applied the following contamination schemes:

  • Scenario A. (Clean) No contamination.

  • Scenario B. (Vert) Vertical outliers: we add to the first γ% (with γ = 10 or 20) of the observations of the response variable a random error term coming from a normal distributions N(10, 1).

  • Scenario C. (Both) Outliers in both the response and the predictors: this is a more extreme situation in which we considered vertical outliers but also leverage points. Vertical outliers are generated adding to the first γ% (with γ = 10 or 20) of the observations of the response variable a random error term coming from a normal distributions N(20, 1). To get leverage points we replace the first γ% (with γ = 10 or 20) of the observations of the block of informative variables by values coming from a p-dimensional Logistic-Normal distribution with mean vector (50,,50)T and a correlation equal to 0.9 for each pair of variable components.

We do not consider a scenario with exclusively leverage points, as the resulting contaminated design matrix X is constructed to have row sums of 1, consequently the effect of leverage points is by construction always limited.

We present here the simulation results for γ = 10%. The conclusions that can be drawn for γ = 20% follow the same tendency, and the related simulation results are reported in Supplementary Section S2.1 of Supplementary Material for the sake of completeness.

3.2 Performance measures

To evaluate the prediction performance of the proposed sparse method in comparison to the other models we use the prediction error (PE) and a trimmed prediction error. For this purpose, two independent test samples, a clean and a contaminated one, of size n for each contamination scheme were generated in each simulation run. The prediction error is computed as
PE=1nyZβ^22,
(12)
where y and Z denote the response vector and the design matrix of the test set data, respectively, and β^ is the parameter estimate derived from the training data. The trimmed prediction error is the trimmed version of the measure defined in (12). In the simulation we used a trimming level equal to 0.1.
Concerning sparsity, the estimated models are evaluated by the number of false positives (FP) and the number of false negatives (FN), defined as
FP(β^)=|j{1,,p}:β^j0βj=0|}FN(β^)=|j{1,,p}:β^j=0βj0|
(13)
where positives and negatives refer to non-zero and zero coefficients, respectively.

The estimation accuracy is assessed by the q losses ||β^β||q, with q =1, 2 and . The lower values of these criteria are, the better the models perform.

3.3 Simulation results

Tables 13 report averages (‘mean’) and standard deviations (‘SD’) of the performance measures defined in the previous section over all 100 simulation runs, for each method and for the different contamination schemes. The prediction error is computed considering the clean test set, while the trimming prediction error refers to a test set generated according to the same structure as the training set. Table 4 reports the comparison results of selective performance, FP and FN. The results presented refer to a parameter configuration with ρ=0.2, for a contamination level of 10%, and for the low (Table 1), moderate (Table 2) and high-dimensional data configuration (Table 3).

Table 1.

Means and standard deviations of various performance measures among different methods, based on 100 simulations

PE
PE (10%)
Loss 1
Loss 2
Loss
MeanSDMeanSDMeanSDMeanSDMeanSD
(A)Lasso0.4150.1040.3000.0801.2820.4450.1770.1010.2270.074
ZS0.3930.1070.2830.0801.1820.4270.1450.0780.2000.059
RobL0.6150.3420.4430.2471.8690.7840.4270.4200.3380.142
RobZS1.0310.4870.7390.3643.1151.0770.8410.5550.4490.171
SLTS1.4230.7751.0330.5823.0191.1071.3270.8000.6400.205
ZS (B&T)0.4760.2370.3470.187
(B)Lasso5.2122.1925.8502.4335.7251.9874.1901.7651.1430.254
ZS5.1292.2195.7162.4055.5551.8053.9361.7201.0840.258
RobL1.1560.9541.3181.1042.6231.4291.0201.0950.5060.276
RobZS0.7600.3270.8610.3802.3830.8130.5250.2900.3770.114
SLTS0.9320.4751.0570.5512.2570.8270.7590.4760.4970.153
ZS (B&T)5.6992.4166.1752.453
(C)Lasso17.9115.66814.6003.58613.1832.82415.4635.9212.0210.511
ZS17.1156.13614.3003.88912.8182.83314.4355.3251.9410.392
RobL0.8220.6040.9270.6842.3041.1580.6500.6290.4140.177
RobZS0.7620.4030.8600.4582.4161.0120.5410.4170.3670.129
SLTS0.9310.4891.0500.5642.2940.8390.7490.4870.4900.174
ZS (B&T)20.8366.25917.0154.070
PE
PE (10%)
Loss 1
Loss 2
Loss
MeanSDMeanSDMeanSDMeanSDMeanSD
(A)Lasso0.4150.1040.3000.0801.2820.4450.1770.1010.2270.074
ZS0.3930.1070.2830.0801.1820.4270.1450.0780.2000.059
RobL0.6150.3420.4430.2471.8690.7840.4270.4200.3380.142
RobZS1.0310.4870.7390.3643.1151.0770.8410.5550.4490.171
SLTS1.4230.7751.0330.5823.0191.1071.3270.8000.6400.205
ZS (B&T)0.4760.2370.3470.187
(B)Lasso5.2122.1925.8502.4335.7251.9874.1901.7651.1430.254
ZS5.1292.2195.7162.4055.5551.8053.9361.7201.0840.258
RobL1.1560.9541.3181.1042.6231.4291.0201.0950.5060.276
RobZS0.7600.3270.8610.3802.3830.8130.5250.2900.3770.114
SLTS0.9320.4751.0570.5512.2570.8270.7590.4760.4970.153
ZS (B&T)5.6992.4166.1752.453
(C)Lasso17.9115.66814.6003.58613.1832.82415.4635.9212.0210.511
ZS17.1156.13614.3003.88912.8182.83314.4355.3251.9410.392
RobL0.8220.6040.9270.6842.3041.1580.6500.6290.4140.177
RobZS0.7620.4030.8600.4582.4161.0120.5410.4170.3670.129
SLTS0.9310.4891.0500.5642.2940.8390.7490.4870.4900.174
ZS (B&T)20.8366.25917.0154.070

Note: Parameter configuration: (n,p)=(50, 30), ρ=0.2. The best values (of “mean”) among the different methods are presented in bold.

PE, prediction error.

Table 1.

Means and standard deviations of various performance measures among different methods, based on 100 simulations

PE
PE (10%)
Loss 1
Loss 2
Loss
MeanSDMeanSDMeanSDMeanSDMeanSD
(A)Lasso0.4150.1040.3000.0801.2820.4450.1770.1010.2270.074
ZS0.3930.1070.2830.0801.1820.4270.1450.0780.2000.059
RobL0.6150.3420.4430.2471.8690.7840.4270.4200.3380.142
RobZS1.0310.4870.7390.3643.1151.0770.8410.5550.4490.171
SLTS1.4230.7751.0330.5823.0191.1071.3270.8000.6400.205
ZS (B&T)0.4760.2370.3470.187
(B)Lasso5.2122.1925.8502.4335.7251.9874.1901.7651.1430.254
ZS5.1292.2195.7162.4055.5551.8053.9361.7201.0840.258
RobL1.1560.9541.3181.1042.6231.4291.0201.0950.5060.276
RobZS0.7600.3270.8610.3802.3830.8130.5250.2900.3770.114
SLTS0.9320.4751.0570.5512.2570.8270.7590.4760.4970.153
ZS (B&T)5.6992.4166.1752.453
(C)Lasso17.9115.66814.6003.58613.1832.82415.4635.9212.0210.511
ZS17.1156.13614.3003.88912.8182.83314.4355.3251.9410.392
RobL0.8220.6040.9270.6842.3041.1580.6500.6290.4140.177
RobZS0.7620.4030.8600.4582.4161.0120.5410.4170.3670.129
SLTS0.9310.4891.0500.5642.2940.8390.7490.4870.4900.174
ZS (B&T)20.8366.25917.0154.070
PE
PE (10%)
Loss 1
Loss 2
Loss
MeanSDMeanSDMeanSDMeanSDMeanSD
(A)Lasso0.4150.1040.3000.0801.2820.4450.1770.1010.2270.074
ZS0.3930.1070.2830.0801.1820.4270.1450.0780.2000.059
RobL0.6150.3420.4430.2471.8690.7840.4270.4200.3380.142
RobZS1.0310.4870.7390.3643.1151.0770.8410.5550.4490.171
SLTS1.4230.7751.0330.5823.0191.1071.3270.8000.6400.205
ZS (B&T)0.4760.2370.3470.187
(B)Lasso5.2122.1925.8502.4335.7251.9874.1901.7651.1430.254
ZS5.1292.2195.7162.4055.5551.8053.9361.7201.0840.258
RobL1.1560.9541.3181.1042.6231.4291.0201.0950.5060.276
RobZS0.7600.3270.8610.3802.3830.8130.5250.2900.3770.114
SLTS0.9320.4751.0570.5512.2570.8270.7590.4760.4970.153
ZS (B&T)5.6992.4166.1752.453
(C)Lasso17.9115.66814.6003.58613.1832.82415.4635.9212.0210.511
ZS17.1156.13614.3003.88912.8182.83314.4355.3251.9410.392
RobL0.8220.6040.9270.6842.3041.1580.6500.6290.4140.177
RobZS0.7620.4030.8600.4582.4161.0120.5410.4170.3670.129
SLTS0.9310.4891.0500.5642.2940.8390.7490.4870.4900.174
ZS (B&T)20.8366.25917.0154.070

Note: Parameter configuration: (n,p)=(50, 30), ρ=0.2. The best values (of “mean”) among the different methods are presented in bold.

PE, prediction error.

Table 2.

Means and standard deviations of various performance measures among different methods, based on 100 simulations

PE
PE (10%)
Loss 1
Loss 2
Loss
MeanSDMeanSDMeanSDMeanSDMeanSD
(A)Lasso0.3900.0730.2750.0541.6020.5180.1640.0660.2070.059
ZS0.3800.0720.2680.0531.5610.6010.1470.0640.1850.051
RobL0.7720.8040.5480.5772.4741.4740.5810.8560.3500.236
RobZS1.1790.4830.8360.3533.5820.8751.0310.5170.5210.146
SLTS1.3670.8130.9740.6013.3771.2621.2800.7980.6030.191
ZS (B&T)0.3770.1760.2670.129
(B)Lasso4.5491.6374.9991.6606.4682.6623.6091.1971.0000.195
ZS4.3661.3664.8271.4176.2021.9523.4431.0170.9750.210
RobL1.5140.9571.6801.0443.7001.6711.3951.0170.6000.253
RobZS0.7710.3830.8650.4372.7090.9580.5750.4050.3880.135
SLTS0.7900.3980.8850.4512.3800.8530.6090.4160.4170.130
ZS (B&T)4.7361.8635.1911.907
(C)Lasso5.1631.1823.8340.83710.4141.78010.4771.2711.6740.172
ZS8.8211.7276.5191.16810.9351.8657.4051.1651.3390.197
RobL0.9530.6911.0730.7882.9781.3950.8060.7790.4610.213
RobZS0.6720.3180.7520.3572.5260.9360.4800.3770.3460.116
SLTS0.7330.3450.8220.3842.3040.8870.5800.3760.4060.131
ZS (B&T)12.9283.2979.2472.243
PE
PE (10%)
Loss 1
Loss 2
Loss
MeanSDMeanSDMeanSDMeanSDMeanSD
(A)Lasso0.3900.0730.2750.0541.6020.5180.1640.0660.2070.059
ZS0.3800.0720.2680.0531.5610.6010.1470.0640.1850.051
RobL0.7720.8040.5480.5772.4741.4740.5810.8560.3500.236
RobZS1.1790.4830.8360.3533.5820.8751.0310.5170.5210.146
SLTS1.3670.8130.9740.6013.3771.2621.2800.7980.6030.191
ZS (B&T)0.3770.1760.2670.129
(B)Lasso4.5491.6374.9991.6606.4682.6623.6091.1971.0000.195
ZS4.3661.3664.8271.4176.2021.9523.4431.0170.9750.210
RobL1.5140.9571.6801.0443.7001.6711.3951.0170.6000.253
RobZS0.7710.3830.8650.4372.7090.9580.5750.4050.3880.135
SLTS0.7900.3980.8850.4512.3800.8530.6090.4160.4170.130
ZS (B&T)4.7361.8635.1911.907
(C)Lasso5.1631.1823.8340.83710.4141.78010.4771.2711.6740.172
ZS8.8211.7276.5191.16810.9351.8657.4051.1651.3390.197
RobL0.9530.6911.0730.7882.9781.3950.8060.7790.4610.213
RobZS0.6720.3180.7520.3572.5260.9360.4800.3770.3460.116
SLTS0.7330.3450.8220.3842.3040.8870.5800.3760.4060.131
ZS (B&T)12.9283.2979.2472.243

Note: Parameter configuration: (n,p)=(100, 200, ρ=0.2). The best values (of “mean”) among the different methods are presented in bold.

PE, prediction error.

Table 2.

Means and standard deviations of various performance measures among different methods, based on 100 simulations

PE
PE (10%)
Loss 1
Loss 2
Loss
MeanSDMeanSDMeanSDMeanSDMeanSD
(A)Lasso0.3900.0730.2750.0541.6020.5180.1640.0660.2070.059
ZS0.3800.0720.2680.0531.5610.6010.1470.0640.1850.051
RobL0.7720.8040.5480.5772.4741.4740.5810.8560.3500.236
RobZS1.1790.4830.8360.3533.5820.8751.0310.5170.5210.146
SLTS1.3670.8130.9740.6013.3771.2621.2800.7980.6030.191
ZS (B&T)0.3770.1760.2670.129
(B)Lasso4.5491.6374.9991.6606.4682.6623.6091.1971.0000.195
ZS4.3661.3664.8271.4176.2021.9523.4431.0170.9750.210
RobL1.5140.9571.6801.0443.7001.6711.3951.0170.6000.253
RobZS0.7710.3830.8650.4372.7090.9580.5750.4050.3880.135
SLTS0.7900.3980.8850.4512.3800.8530.6090.4160.4170.130
ZS (B&T)4.7361.8635.1911.907
(C)Lasso5.1631.1823.8340.83710.4141.78010.4771.2711.6740.172
ZS8.8211.7276.5191.16810.9351.8657.4051.1651.3390.197
RobL0.9530.6911.0730.7882.9781.3950.8060.7790.4610.213
RobZS0.6720.3180.7520.3572.5260.9360.4800.3770.3460.116
SLTS0.7330.3450.8220.3842.3040.8870.5800.3760.4060.131
ZS (B&T)12.9283.2979.2472.243
PE
PE (10%)
Loss 1
Loss 2
Loss
MeanSDMeanSDMeanSDMeanSDMeanSD
(A)Lasso0.3900.0730.2750.0541.6020.5180.1640.0660.2070.059
ZS0.3800.0720.2680.0531.5610.6010.1470.0640.1850.051
RobL0.7720.8040.5480.5772.4741.4740.5810.8560.3500.236
RobZS1.1790.4830.8360.3533.5820.8751.0310.5170.5210.146
SLTS1.3670.8130.9740.6013.3771.2621.2800.7980.6030.191
ZS (B&T)0.3770.1760.2670.129
(B)Lasso4.5491.6374.9991.6606.4682.6623.6091.1971.0000.195
ZS4.3661.3664.8271.4176.2021.9523.4431.0170.9750.210
RobL1.5140.9571.6801.0443.7001.6711.3951.0170.6000.253
RobZS0.7710.3830.8650.4372.7090.9580.5750.4050.3880.135
SLTS0.7900.3980.8850.4512.3800.8530.6090.4160.4170.130
ZS (B&T)4.7361.8635.1911.907
(C)Lasso5.1631.1823.8340.83710.4141.78010.4771.2711.6740.172
ZS8.8211.7276.5191.16810.9351.8657.4051.1651.3390.197
RobL0.9530.6911.0730.7882.9781.3950.8060.7790.4610.213
RobZS0.6720.3180.7520.3572.5260.9360.4800.3770.3460.116
SLTS0.7330.3450.8220.3842.3040.8870.5800.3760.4060.131
ZS (B&T)12.9283.2979.2472.243

Note: Parameter configuration: (n,p)=(100, 200, ρ=0.2). The best values (of “mean”) among the different methods are presented in bold.

PE, prediction error.

Table 3.

Means and standard deviations of various performance measures among different methods, based on 100 simulations

PE
PE (10%)
Loss 1
Loss 2
Loss
MeanSDMeanSDMeanSDMeanSDMeanSD
(A)Lasso0.5000.1170.3520.0842.2800.6180.2930.1110.2840.073
ZS0.4800.1030.3390.0742.1500.6030.2590.1010.2610.067
RobL3.3541.8902.3681.3435.6331.8453.3561.8940.9340.369
RobZS1.9191.2891.3620.9164.4621.6061.8591.4090.6690.291
SLTS2.7301.0881.9280.7715.5731.2742.7121.0320.8730.192
ZS (B&T)0.4700.3580.3340.262
(B)Lasso5.4811.1905.9461.1727.5372.9124.7091.0681.1330.185
ZS5.4441.1715.9021.1777.5752.8304.6471.1081.1280.209
RobL3.1571.1573.5441.3336.0371.3713.2871.2400.9460.233
RobZS1.5131.0131.7011.1433.9991.4081.4341.1560.5910.252
SLTS1.9660.9982.2141.1414.8751.4861.9501.0790.7260.223
ZS (B&T)5.8272.1786.3372.287
(C)Lasso2.9960.7222.3240.5808.1051.1756.2240.6431.2580.120
ZS6.3561.1054.6550.9009.7871.4615.4770.8621.1350.166
RobL2.8431.5273.1621.7125.5521.5942.9111.4380.8920.260
RobZS1.3420.8191.4900.9143.9861.3261.2820.9810.5560.222
SLTS1.8180.8772.0260.9604.6041.3861.8330.9100.7070.197
ZS (B&T)10.2602.3007.1971.632
PE
PE (10%)
Loss 1
Loss 2
Loss
MeanSDMeanSDMeanSDMeanSDMeanSD
(A)Lasso0.5000.1170.3520.0842.2800.6180.2930.1110.2840.073
ZS0.4800.1030.3390.0742.1500.6030.2590.1010.2610.067
RobL3.3541.8902.3681.3435.6331.8453.3561.8940.9340.369
RobZS1.9191.2891.3620.9164.4621.6061.8591.4090.6690.291
SLTS2.7301.0881.9280.7715.5731.2742.7121.0320.8730.192
ZS (B&T)0.4700.3580.3340.262
(B)Lasso5.4811.1905.9461.1727.5372.9124.7091.0681.1330.185
ZS5.4441.1715.9021.1777.5752.8304.6471.1081.1280.209
RobL3.1571.1573.5441.3336.0371.3713.2871.2400.9460.233
RobZS1.5131.0131.7011.1433.9991.4081.4341.1560.5910.252
SLTS1.9660.9982.2141.1414.8751.4861.9501.0790.7260.223
ZS (B&T)5.8272.1786.3372.287
(C)Lasso2.9960.7222.3240.5808.1051.1756.2240.6431.2580.120
ZS6.3561.1054.6550.9009.7871.4615.4770.8621.1350.166
RobL2.8431.5273.1621.7125.5521.5942.9111.4380.8920.260
RobZS1.3420.8191.4900.9143.9861.3261.2820.9810.5560.222
SLTS1.8180.8772.0260.9604.6041.3861.8330.9100.7070.197
ZS (B&T)10.2602.3007.1971.632

Note: Parameter configuration: (n,p)=(100,1000),ρ=0.2. The best values (of “mean”) among the different methods are presented in bold.

PE, prediction error.

Table 3.

Means and standard deviations of various performance measures among different methods, based on 100 simulations

PE
PE (10%)
Loss 1
Loss 2
Loss
MeanSDMeanSDMeanSDMeanSDMeanSD
(A)Lasso0.5000.1170.3520.0842.2800.6180.2930.1110.2840.073
ZS0.4800.1030.3390.0742.1500.6030.2590.1010.2610.067
RobL3.3541.8902.3681.3435.6331.8453.3561.8940.9340.369
RobZS1.9191.2891.3620.9164.4621.6061.8591.4090.6690.291
SLTS2.7301.0881.9280.7715.5731.2742.7121.0320.8730.192
ZS (B&T)0.4700.3580.3340.262
(B)Lasso5.4811.1905.9461.1727.5372.9124.7091.0681.1330.185
ZS5.4441.1715.9021.1777.5752.8304.6471.1081.1280.209
RobL3.1571.1573.5441.3336.0371.3713.2871.2400.9460.233
RobZS1.5131.0131.7011.1433.9991.4081.4341.1560.5910.252
SLTS1.9660.9982.2141.1414.8751.4861.9501.0790.7260.223
ZS (B&T)5.8272.1786.3372.287
(C)Lasso2.9960.7222.3240.5808.1051.1756.2240.6431.2580.120
ZS6.3561.1054.6550.9009.7871.4615.4770.8621.1350.166
RobL2.8431.5273.1621.7125.5521.5942.9111.4380.8920.260
RobZS1.3420.8191.4900.9143.9861.3261.2820.9810.5560.222
SLTS1.8180.8772.0260.9604.6041.3861.8330.9100.7070.197
ZS (B&T)10.2602.3007.1971.632
PE
PE (10%)
Loss 1
Loss 2
Loss
MeanSDMeanSDMeanSDMeanSDMeanSD
(A)Lasso0.5000.1170.3520.0842.2800.6180.2930.1110.2840.073
ZS0.4800.1030.3390.0742.1500.6030.2590.1010.2610.067
RobL3.3541.8902.3681.3435.6331.8453.3561.8940.9340.369
RobZS1.9191.2891.3620.9164.4621.6061.8591.4090.6690.291
SLTS2.7301.0881.9280.7715.5731.2742.7121.0320.8730.192
ZS (B&T)0.4700.3580.3340.262
(B)Lasso5.4811.1905.9461.1727.5372.9124.7091.0681.1330.185
ZS5.4441.1715.9021.1777.5752.8304.6471.1081.1280.209
RobL3.1571.1573.5441.3336.0371.3713.2871.2400.9460.233
RobZS1.5131.0131.7011.1433.9991.4081.4341.1560.5910.252
SLTS1.9660.9982.2141.1414.8751.4861.9501.0790.7260.223
ZS (B&T)5.8272.1786.3372.287
(C)Lasso2.9960.7222.3240.5808.1051.1756.2240.6431.2580.120
ZS6.3561.1054.6550.9009.7871.4615.4770.8621.1350.166
RobL2.8431.5273.1621.7125.5521.5942.9111.4380.8920.260
RobZS1.3420.8191.4900.9143.9861.3261.2820.9810.5560.222
SLTS1.8180.8772.0260.9604.6041.3861.8330.9100.7070.197
ZS (B&T)10.2602.3007.1971.632

Note: Parameter configuration: (n,p)=(100,1000),ρ=0.2. The best values (of “mean”) among the different methods are presented in bold.

PE, prediction error.

Table 4.

Comparison of selective performance among different methods, scenarios and parameters configuration

(n,p)MethodScenario
(A)
(B)
(C)
MeanSDMeanSDMeanSD
(50,30)FPLasso10.634.0376.075.10912.912.771
ZS10.374.0126.194.89612.352.672
RobL10.814.2829.285.12110.714.728
RobZS17.374.89414.614.87814.954.723
SLTS7.632.5817.382.6628.242.523
ZS (B&T)1.461.1671.421.5054.131.942
FNLasso0.000.0002.841.6802.201.326
ZS0.000.0002.531.6782.051.132
RobL0.060.2780.541.1050.150.458
RobZS0.100.4140.040.1970.020.141
SLTS0.530.6740.250.4790.160.420
ZS (B&T)0.050.2193.471.2263.660.987
(100, 200)FPLasso29.5412.85116.1613.92523.8911.573
ZS29.6013.99916.1912.05430.378.684
RobL31.2713.11123.2712.98427.3413.192
RobZS35.3910.54931.3011.07932.2612.554
SLTS22.106.50520.875.41720.235.683
ZS (B&T)0.880.0871.481.8674.881.677
FNLasso0.000.0002.671.3030.510.577
ZS0.000.0002.451.2501.801.025
RobL0.050.2190.650.9570.120.433
RobZS0.130.4180.050.2610.040.243
SLTS0.370.5250.140.4720.090.321
ZS (B&T)0.010.1003.661.0663.511.068
(100, 1000)FPLasso47.2818.59321.2821.59228.2914.163
ZS45.0217.44822.1119.95041.7711.610
RobL37.6520.59438.0923.70134.9919.350
RobZS42.1017.21039.9116.07643.7414.912
SLTS40.526.85040.497.77036.408.452
ZS (B&T)1.111.2301.771.9584.781.962
FNLasso0.000.0003.911.0650.910.818
ZS0.000.0003.521.2752.080.849
RobL2.331.7982.281.3711.901.453
RobZS0.901.1930.591.0450.420.768
SLTS1.761.0260.940.9931.040.777
ZS (B&T)0.050.2614.111.0243.910.911
(n,p)MethodScenario
(A)
(B)
(C)
MeanSDMeanSDMeanSD
(50,30)FPLasso10.634.0376.075.10912.912.771
ZS10.374.0126.194.89612.352.672
RobL10.814.2829.285.12110.714.728
RobZS17.374.89414.614.87814.954.723
SLTS7.632.5817.382.6628.242.523
ZS (B&T)1.461.1671.421.5054.131.942
FNLasso0.000.0002.841.6802.201.326
ZS0.000.0002.531.6782.051.132
RobL0.060.2780.541.1050.150.458
RobZS0.100.4140.040.1970.020.141
SLTS0.530.6740.250.4790.160.420
ZS (B&T)0.050.2193.471.2263.660.987
(100, 200)FPLasso29.5412.85116.1613.92523.8911.573
ZS29.6013.99916.1912.05430.378.684
RobL31.2713.11123.2712.98427.3413.192
RobZS35.3910.54931.3011.07932.2612.554
SLTS22.106.50520.875.41720.235.683
ZS (B&T)0.880.0871.481.8674.881.677
FNLasso0.000.0002.671.3030.510.577
ZS0.000.0002.451.2501.801.025
RobL0.050.2190.650.9570.120.433
RobZS0.130.4180.050.2610.040.243
SLTS0.370.5250.140.4720.090.321
ZS (B&T)0.010.1003.661.0663.511.068
(100, 1000)FPLasso47.2818.59321.2821.59228.2914.163
ZS45.0217.44822.1119.95041.7711.610
RobL37.6520.59438.0923.70134.9919.350
RobZS42.1017.21039.9116.07643.7414.912
SLTS40.526.85040.497.77036.408.452
ZS (B&T)1.111.2301.771.9584.781.962
FNLasso0.000.0003.911.0650.910.818
ZS0.000.0003.521.2752.080.849
RobL2.331.7982.281.3711.901.453
RobZS0.901.1930.591.0450.420.768
SLTS1.761.0260.940.9931.040.777
ZS (B&T)0.050.2614.111.0243.910.911

FP, number of false positives; FN, number of false negatives. The best values (of “mean”) among the different methods are presented in bold.

Table 4.

Comparison of selective performance among different methods, scenarios and parameters configuration

(n,p)MethodScenario
(A)
(B)
(C)
MeanSDMeanSDMeanSD
(50,30)FPLasso10.634.0376.075.10912.912.771
ZS10.374.0126.194.89612.352.672
RobL10.814.2829.285.12110.714.728
RobZS17.374.89414.614.87814.954.723
SLTS7.632.5817.382.6628.242.523
ZS (B&T)1.461.1671.421.5054.131.942
FNLasso0.000.0002.841.6802.201.326
ZS0.000.0002.531.6782.051.132
RobL0.060.2780.541.1050.150.458
RobZS0.100.4140.040.1970.020.141
SLTS0.530.6740.250.4790.160.420
ZS (B&T)0.050.2193.471.2263.660.987
(100, 200)FPLasso29.5412.85116.1613.92523.8911.573
ZS29.6013.99916.1912.05430.378.684
RobL31.2713.11123.2712.98427.3413.192
RobZS35.3910.54931.3011.07932.2612.554
SLTS22.106.50520.875.41720.235.683
ZS (B&T)0.880.0871.481.8674.881.677
FNLasso0.000.0002.671.3030.510.577
ZS0.000.0002.451.2501.801.025
RobL0.050.2190.650.9570.120.433
RobZS0.130.4180.050.2610.040.243
SLTS0.370.5250.140.4720.090.321
ZS (B&T)0.010.1003.661.0663.511.068
(100, 1000)FPLasso47.2818.59321.2821.59228.2914.163
ZS45.0217.44822.1119.95041.7711.610
RobL37.6520.59438.0923.70134.9919.350
RobZS42.1017.21039.9116.07643.7414.912
SLTS40.526.85040.497.77036.408.452
ZS (B&T)1.111.2301.771.9584.781.962
FNLasso0.000.0003.911.0650.910.818
ZS0.000.0003.521.2752.080.849
RobL2.331.7982.281.3711.901.453
RobZS0.901.1930.591.0450.420.768
SLTS1.761.0260.940.9931.040.777
ZS (B&T)0.050.2614.111.0243.910.911
(n,p)MethodScenario
(A)
(B)
(C)
MeanSDMeanSDMeanSD
(50,30)FPLasso10.634.0376.075.10912.912.771
ZS10.374.0126.194.89612.352.672
RobL10.814.2829.285.12110.714.728
RobZS17.374.89414.614.87814.954.723
SLTS7.632.5817.382.6628.242.523
ZS (B&T)1.461.1671.421.5054.131.942
FNLasso0.000.0002.841.6802.201.326
ZS0.000.0002.531.6782.051.132
RobL0.060.2780.541.1050.150.458
RobZS0.100.4140.040.1970.020.141
SLTS0.530.6740.250.4790.160.420
ZS (B&T)0.050.2193.471.2263.660.987
(100, 200)FPLasso29.5412.85116.1613.92523.8911.573
ZS29.6013.99916.1912.05430.378.684
RobL31.2713.11123.2712.98427.3413.192
RobZS35.3910.54931.3011.07932.2612.554
SLTS22.106.50520.875.41720.235.683
ZS (B&T)0.880.0871.481.8674.881.677
FNLasso0.000.0002.671.3030.510.577
ZS0.000.0002.451.2501.801.025
RobL0.050.2190.650.9570.120.433
RobZS0.130.4180.050.2610.040.243
SLTS0.370.5250.140.4720.090.321
ZS (B&T)0.010.1003.661.0663.511.068
(100, 1000)FPLasso47.2818.59321.2821.59228.2914.163
ZS45.0217.44822.1119.95041.7711.610
RobL37.6520.59438.0923.70134.9919.350
RobZS42.1017.21039.9116.07643.7414.912
SLTS40.526.85040.497.77036.408.452
ZS (B&T)1.111.2301.771.9584.781.962
FNLasso0.000.0003.911.0650.910.818
ZS0.000.0003.521.2752.080.849
RobL2.331.7982.281.3711.901.453
RobZS0.901.1930.591.0450.420.768
SLTS1.761.0260.940.9931.040.777
ZS (B&T)0.050.2614.111.0243.910.911

FP, number of false positives; FN, number of false negatives. The best values (of “mean”) among the different methods are presented in bold.

In Scenario (A)—no outliers (Clean)—ZS and ZS (B&T) show the best performance in terms of the mean prediction error. Of course, the Lasso estimator (and its robust version), as well as the SLTS estimator, only perform variable selection, but they do not fulfill the condition that the sum of the estimated regression coefficients should be zero, missing the desirable properties of CODA analysis mentioned in Section 2, and these results are reported here only for benchmarking purposes. The algorithm of Bates and Tibshirani (2019), ZS (B&T), slightly improves the ZS prediction error results. A big difference is the excellent performance for the false positives (FP), but a (much) poorer performance for the false negatives (FN), see Table 4; the latter might be more important in applications.

All robust methods lose efficiency which is reflected by a somewhat higher prediction error, and the gap to the non-robust estimators is increasing in higher dimension. However, this gap is smaller for the mean 10% trimmed PE, which means that although no outliers have been generated, there are test set observations which are clearly deviating from the data majority. All methods perform well in terms of the average number of false positives and false negatives. In high dimension, the robust methods produce a higher number of FN. The estimation accuracy through q losses is quite comparable for all methods, but the values increase for the robust methods with increasing dimensionality.

The second scenario (B)—outliers in the response, or vertical outliers (Vert)—shows quite different results: the Lasso and ZeroSum estimators are strongly influenced by the outliers. The prediction errors increased dramatically, and the same is true for the q losses. The reason for that can be seen in the high number of FN (remember that 6 non-zero coefficients have been generated). The robust estimators achieve similar results as in the case of non-contaminated data. RobZS shows an excellent behavior, and it is the clear winner especially in the high-dimensional situation. Since the non-trimmed and the trimmed prediction errors are very similar for the robust estimators, they are able to correctly identify the model and thus the generated outliers. The variable selection performance of the proposed estimator is comparable to that of SLTS, but it tends to select fewer FN at the cost of slightly increased FP. We note that the FN have a substantially higher negative effect on the prediction error than FP, as important variables are incorrectly ignored.

In the third scenario (C)—outliers in both response and the predictors (Both)—the RobZS estimators shows the best performance in terms of prediction error, especially in the high-dimensional setting. As observed before, RobZS leads to the smallest FN at the cost of a higher FP.

An interesting observation is that, although SLTS shows in several settings comparable performance to the RobZS estimator, it performs poor in terms of the FN for the uncontaminated setting, particularly in lower dimension.

Overall, we observe a general decrease in prediction accuracy for the Lasso and the ZeroSum estimators in presence of vertical outliers and with both vertical outliers and bad leverage points, underlining the need for robust methods. Moreover, in the contaminated scenarios, the standard deviations of the RobZS estimator for the various performance measures are among the smallest, suggesting stability in the estimation and in the prediction performance in all considered settings. These simulation results also enhance that the RobZS estimator has an excellent prediction performance in contaminated scenarios.

Supplementary Section S2.1 of Supplementary Material reports the analogue results for ρ=0.5 and 10% contamination, as well as results for a contamination level of 20%. These results follow the same tendency and thus a change in the correlation structure of the explanatory variables or an increase of the amount of outliers has no essential impact on the overall conclusions. An exception is that for 20% contamination, SLTS is very competitive to RobZS in scenario (B) in lower dimension, but this advantage disappears in the high-dimensional setting.

3.4 Simulations with increasing proportion of zeros in the covariates

We compare the predictive accuracy of the ZS and RobZS estimators as a function of the proportion of zeros in the training and test data, because this setting is relevant in various real data applications. We firstly generate the matrix of count data from which, after normalization and logarithmic transform, we compute the response vector y according to the linear model. Then we replace a fixed proportion of the existing counts by 0 in a random uniform way, and subsequently the zeros are replaced by values 0.5 before converting the data to compositional form to allow the logarithmic transformation. We use contamination setting (B) and increase the zero proportion in the covariates from 0.1 to 0.8 in steps of 0.1. Note that we only contaminated the training sample and not the test sample. Figure 1 shows the resulting prediction error averaged over 100 replications for each fixed proportion of zeros (solid lines). The dashed lines are the means plus/minus two times the standard errors from the replications. The red lines are for ZS, the blue lines for RobZS.

Fig. 1.

Prediction performance of the ZS (red) and RobZS (blue) estimators in scenario (B) by increasing the proportions of zeros in training and test data from 0.1 to 0.8 in steps of 0.1. Parameter configuration: (a) n=50,p=30,ρ=0.2, (b) n=100,p=200,ρ=0.2, (c) n=100,p=1000,ρ=0.2. Shown are means (solid lines) plus/minus two standard errors averaged over 100 replications for each fixed proportions of zeros

As expected, the performance of both estimators linearly reduces as the proportion of zero counts in the covariates increases. However RobZS shows the best overall behavior even when the proportion of zero counts in the covariates is very high, as ZS is very much affected by outliers.

3.5 Simulations with increasing outlier proportion

The simulations in this section investigate the behavior of the estimators ZS and RobZS for increasing levels of contamination. We use contamination setting (C) and increase the outlier proportion from zero to 0.5 in steps of 0.02. In each step, 50 replications are carried out, and the means plus/minus two standard errors of the results are presented in Figure 2. The red lines are for ZS, the blue lines for RobZS. The simulations are conducted for the parameters n =50, p =30 and ρ=0.2. The results basically reveal that the results for ZS get worse if the outlier proportion increases. Particularly, FN quickly increases to a value of about 2, and thus 2 out of 6 active variables are (on average) not identified. RobZS shows stable performance up to about 25% of contamination. This is explained by the trimming proportion of the procedure, which we set to 25% in all experiments. The evaluation with a 10% trimmed prediction error (upper right plot) is clearly not appropriate in a setting with high proportions of outliers. It is interesting that FN is very stable (up to about 20% contamination), and that FP decreases for an outlier proportion of up to 0.25. This means that the estimated regression parameters get sparser with higher contamination, and true noise variables are more accurately identified.

Fig. 2.

Performance of the ZS (red) and RobZS (blue) estimators by increasing the contamination level, using scenario (C). Here, n =50, p =30, ρ=0.2; shown are means (solid lines) plus/minus two standard errors derived from 50 simulation replications at each step

Supplementary Section S2.3 of Supplementary Material also contains simulations studies which investigate the effect of varying sparsity. A general conclusion from these results is that less sparsity of the model reduces the advantage of the robust method. Or, in other words, RobZS has much better performance than ZS in presence of outliers, and if the true underlying model is very sparse.

Finally, a further simulation study is presented in Supplementary Section S2.4 of Supplementary Material which focuses on the use of the elastic-net penalty (both Ridge and Lasso). The comparison of ZS and RobZS reveals that RobZS tends to select on average a much smaller value for the tuning parameter α than ZS, which comes closer to a ridge penalty, and thus contains more variables in the model. Consequently, FP is generally higher for RobZS compared to ZS, but FN is considerably lower, even in the uncontaminated case. As expected, the prediction error is much smaller for the robust method in a contaminated scenario.

4 An application to human gut microbiome data

We applied the proposed RobZS model to a cross-sectional study of the association between diet and gut microbiome composition (Wu et al., 2011). In this study, fecal samples from 98 healthy individuals were collected and the microbiome dataset was produced by high-throughput sequencing of 16S rRNA, obtaining 6674 OTUs, the normalized counts of clustered sequences that depict bacteria types. We aim to predict caffeine intake, the continuous outcome of interest, based on the OTU abundances (Jaquet et al., 2009; Xiao et al., 2018). The microbiome dataset was previously preprocessed by Xiao et al. (2018) removing rare OTUs with prevalence less than 10%. Due to the high proportion of zero counts, we further retained OTUs that appeared in at least 25 samples, resulting in a matrix of dimension n×p=98×240. Additionally, we applied the quantile transformation to the caffeine intake to fulfill the underlying assumption of normality, as done in Xiao et al. (2018). Zero counts were replaced by the maximum rounding error of 0.5 to allow for the logarithmic transformation, which is a common practice in the context of analyzing microbiome data (Aitchison, 1986, Section 11.5). Note that in CODA analysis there are more sophisticated methods for zero replacement (Lubbe et al., 2021), but since this is not the focus of this paper, and because the proportion of zeros is also quite high with 49%, we stick to this simple replacement strategy.

For a fair investigation of the prediction performance of the four sparse estimators, a 5-fold CV procedure was repeated 50 times, resulting in 250 fitted models for each sparse regression method. This is a common way used in machine learning to reduce the error in the estimate of mean model performance. In the training set, the parameter selection follows the one described in the simulation section. Prediction error and trimmed prediction error were used to assess the prediction accuracy of the different methods. Note that instead of using a trimmed prediction error, one could also use other robust error measures if the outlier proportion is unknown, such as the robust τ scale estimator of Maronna and Zamar (2002).

Figure 3 shows the boxplots of the CV PEs (left) and the 10% trimmed CV PEs (right) over all replications for the estimators Lasso, ZeroSum, RobL, RobZS and SLTS. The estimators RobZS and SLTS show somewhat smaller prediction errors, and SLTS yields the smallest 10% trimmed PEs, at the cost of a larger variability. It can thus be assumed that there is a certain effect of outliers which influence the model estimation.

Fig. 3.

Analysis of gut microbiome data. Boxplots of CV PEs (left) and 10% trimmed CV PEs (right) over all replications for Lasso, ZeroSum, RobL, RobZS and SLTS

In order to investigate the impact of potential outliers in more detail, we can apply an outlier analysis on the scaled residuals. For each model fit within the CV scheme, the scale of the residuals for the CV training data can be estimated. This is done with the classical standard deviation; but for the robust fits we only include residuals from observations with weight 1 in the reweighting step, see Equation (9). Thus, outliers according to this weighting scheme will not affect the estimation of the residual scale. Then the residuals from the left-out folds are scaled with this estimator, and the CV PEs now include only the observations where the scaled residuals are within the interval [2.5,2.5]. The results are shown in the boxplots of the left panel of Figure 4, and in Figure 5. Figure 5 shows, for each model and over all CV replications, the mean of the scaled residuals for each observation. The residual scale was estimated from the model fit, and the scaled residuals are computed from the CV predictions. Since there are 50 CV replications, we can show the averages over 50 scaled residuals for each observation and for each estimator. The sorting of the observations on the horizontal axis is according to the RobZS mean. With the cutoff values ±2.5, shown as dashed lines, we see that SLTS identifies a huge amount of outliers when compared to the other estimators, i.e. more than 74% of the predicted values are flagged as anomalous, also their range is extremely large, suggesting that this model is inadequate for the example data we are dealing with. Consequently, the smaller CV PE without outliers for SLTS in the boxplot of Figure 4(left) is not comparable with the others as it is based only on few observations. RobZS instead identifies a more feasible number of outliers in the predictions, namely less than 30%.

Fig. 4.

Analysis of gut microbiome data. Left panel: Boxplot of CV PEs over all replications by Lasso, ZeroSum, RobL and RobZS. Only the observations whose corresponding scaled residuals are within the interval [2.5,2.5] were considered. Right panel: Fitted versus measured values of the transformed caffeine intake. The green points correspond to observations detected as outliers by the RobZS estimator

Fig. 5.

Analysis of gut microbiome data. Mean of the scaled CV prediction residuals for each observation. The sorting of the observations on the x-axis is according to the mean for RobZS

We can recognize that the RobZS estimator achieves the best performance, outperforming the other methods by a large margin. This is due to the robustness and precision of the RobZS estimator, which allows for a reliable outlier diagnostics for the predictions. This can also be seen in the right panel of Figure 4, where RobZS was simply applied to the complete dataset. The plot shows the measured versus fitted response variable, and the color corresponds to the weights from Equation (9). The green points correspond to observations detected as outliers by the RobZS estimator, namely data with binary weight wi = 0. Based on the analysis in Figure 4 (right) we can also investigate which observations are potential (prediction) outliers.

The 250 models for each method, resulting from the described CV procedure, can be further analyzed for the variable selection performance. Figure 6 shows on the vertical axis the proportion of models containing at least the number of zeros in the regression parameter estimates indicated on the horizontal axis (in total 241 variables). A substantial proportion of models lead to a fully sparse solution; for RobL we obtain in about 60% of the models full or almost full sparsity. In contrast, SLTS yields much less sparsity, and the models are also very similar in terms of sparsity. RobZS seems to be best tunable, as this method leads to higher sparsity, but the proportion of fully sparse models is still the lowest.

Fig. 6.

Analysis of gut microbiome data. Proportion of models (out of all 50 * 5) containing at least the number of zeros shown on the horizontal axis over all CV replications by Lasso, ZeroSum, RobL, RobZS and SLTS

Further investigations on application results are reported in Supplementary Section S3 of Supplementary Material.

5 Conclusions

We have proposed the robust ZeroSum (RobZS) regression estimator as a trimmed version of the ZeroSum (ZS) estimator, used in high-dimensional settings with compositional covariates. This model can be applied in microbiome analysis to identify bacterial taxa associated with a continuous response. Like in Lasso or elastic-net regression, the estimated regression coefficient vector is typically sparse. Additionally, however, the non-zero coefficients sum up to zero, and this constraint is appropriate for linear log-contrast models as they are used in the context of CODA analysis. In other words, the estimator is appropriately performing variable selection among compositional explanatory variables and allows for an interpretation of those selected compositional parts.

The estimation procedure of the RobZS estimator is based on an analogue of the fast-LTS algorithm in the context of robust regression. For the computation, a robust elastic-net regression procedure has been adapted and implemented. The conducted simulation studies reveal that the RobZS estimator has similar performance as the non-robust ZS estimator if there are no outliers, but in case of contamination (vertical outliers, leverage points) the robust version leads to a big advantage in terms of prediction error, precision of the estimated regression coefficients and ability to correctly identify the relevant variables (partly at the cost of a slightly increased false positive rate). Also when compared to other robust estimators such as sparse LTS (Alfons et al., 2013) or elastic-net LTS (Kurnaz et al., 2018), which however do not incorporate the compositional aspect of the data, RobZS is superior according to the evaluation measures in almost all settings. Simulations with zeros in the data, with varying sparsity and varying outlier proportions have further underlined the excellent performance of RobZS. The application to microbiome data has demonstrated that RobZS is capable to balance the sparsity of the solution with proper prediction accuracy. A further benefit is that outliers in the training data can be identified, but also for new data it is possible to indicate outlyingness, thus values of the explanatory variables or the response which do not match the training data.

In future work, this model will be extended to the generalized linear models framework for high-dimensional compositional covariates.

Acknowledgements

The authors wish to thank the Editor, the Associate Editor and the reviewers for their valuable comments and suggestions. They greatly acknowledge the DEMS Data Science Lab for supporting this work by providing computational resources.

Funding

This research was supported by local funds from University of Milano-Bicocca [FAR 2018 to G.S.M.].

Conflict of Interest: none declared.

References

Aitchison
J.
(
1986
)
The Statistical Analysis of Compositional Data
.
Chapman & Hall
,
London
.

Aitchison
J.
,
Bacon-Shone
J.
(
1984
)
Log contrast models for experiments with mixtures
.
Biometrika
,
71
,
323
330
.

Aitchison
J.
,
Shen
S.M.
(
1980
)
Logistic-normal distributions: some properties and uses
.
Biometrika
,
67
,
261
272
.

Alfons
A.
 et al.  (
2013
)
Sparse least trimmed squares regression for analyzing high-dimensional large data sets
.
Ann. Appl. Stat
.,
7
,
226
248
.

Altenbuchinger
M.
 et al.  (
2017
)
Reference point insensitive molecular data analysis
.
Bioinformatics
,
33
,
219
226
.

Bates
S.
,
Tibshirani
R.
(
2019
)
Log-ratio lasso: scalable, sparse estimation for log-ratio models
.
Biometrics
,
75
,
613
624
.

Filzmoser
P.
 et al.  (
2018
)
Applied Compositional Data Analysis. With Worked Examples in R.
 
Springer Series in Statistics, Springer
,
Cham, Switzerland
.

Freue
G.V.C.
 et al.  (
2019
)
Robust elastic net estimators for variable selection and identification of proteomic biomarkers
.
Ann. Appl. Stat
.,
13
,
2065
2090
.

Friedman
J.
 et al.  (
2007
)
Pathwise coordinate optimization
.
Ann. App. Stat
.,
1
,
302
332
.

Friedman
J.
 et al.  (
2010
)
Regularization paths for generalized linear models via coordinate descent
.
J. Stat. Softw
.,
33
,
1
22
.

Gloor
G.B.
 et al.  (
2016
)
It’s all relative: analyzing microbiome data as compositions
.
Ann. Epidemiol
.,
26
,
322
329
.

Hastie
T.
 et al.  (
2001
)
The Elements of Statistical Learning
.
Springer Series in Statistics. Springer New York Inc
.,
Berlin
.

Huber
P.
,
Ronchetti
E.
(
2009
)
Robust Statistics
, 2nd edn.
Wiley
,
New York
.

Jaquet
M.
 et al.  (
2009
)
Impact of coffee consumption on the gut microbiota: a human volunteer study
.
Int. J. Food Microbiol
.,
130
,
117
121
.

Kurnaz
F.S.
 et al.  (
2018
)
Robust and sparse estimation methods for high-dimensional linear and logistic regression
.
Chemometr. Intell. Lab
.,
172
,
211
222
.

Li
H.
(
2015
)
Microbiome, metagenomics, and high-dimensional compositional data analysis
.
Annu. Rev. Stat. Appl
.,
2
,
73
94
.

Lin
W.
 et al.  (
2014
)
Variable selection in regression with compositional covariates
.
Biometrika
,
101
,
785
797
.

Lubbe
S.
 et al.  (
2021
)
Comparison of zero replacement strategies for compositional data with large numbers of zeros
.
Chemometr. Intell. Lab
.,
210
,
104248
.

Maronna
R.
,
Zamar
R.
(
2002
)
Robust estimates of location and dispersion for high-dimensional datasets
.
Technometrics
,
44
,
307
317
.

Maronna
R.
 et al.  (
2006
)
Robust Statistics
.
John Wiley & Sons, Ltd
.,
Hoboken, NJ
.

Maronna
R.A.
(
2011
)
Robust ridge regression for high-dimensional data
.
Technometrics
,
53
,
44
53
.

Meinshausen
N.
(
2007
)
Relaxed lasso
.
Comput. Stat. Data Anal
.,
52
,
374
393
.

Quinn
T.P.
 et al.  (
2018
)
Understanding sequencing data as compositions: an outlook and review
.
Bioinformatics
,
34
,
2870
2878
.

Rousseeuw
P.J.
(
1984
)
Least median of squares regression
.
J. Am. Stat. Assoc
.,
79
,
871
880
.

Rousseeuw
P.J.
,
Van Driessen
K.
(
2006
)
Computing LTS regression for large data sets
.
Data Min. Knowl. Disc
,
12
,
29
45
.

Shi
P.
 et al.  (
2016
)
Regression analysis for microbiome compositional data
.
Ann. Appl. Stat
.,
10
,
1019
1040
.

Smucler
E.
,
Yohai
V.J.
(
2017
)
Robust and sparse estimators for linear regression models
.
Comput. Stat. Data Anal
.,
111
,
116
130
.

Tibshirani
R.
(
1994
)
Regression shrinkage and selection via the lasso
.
J. R. Stat. Soc. Ser. B Stat. Methodol
.,
58
,
267
288
.

Tibshirani
R.
(
2011
)
Regression shrinkage and selection via the lasso: a retrospective
.
J. R. Stat. Soc. Ser. B Stat. Methodol
.,
73
,
273
282
.

Wu
G.D.
 et al.  (
2011
)
Linking long-term dietary patterns with gut microbial enterotypes
.
Science
,
334
,
105
108
.

Xiao
J.
 et al.  (
2018
)
A phylogeny-regularized sparse regression model for predictive modeling of microbial community data
.
Front. Microbiol
.,
9
,
3112
.

Zou
H.
,
Hastie
T.
(
2005
)
Regularization and variable selection via the elastic net
.
J. R. Stat. Soc. Ser. B Stat. Methodol
.,
67
,
301
320
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)
Associate Editor: Pier Luigi Martelli
Pier Luigi Martelli
Associate Editor
Search for other works by this author on:

Supplementary data