-
PDF
- Split View
-
Views
-
Cite
Cite
Xuming He, Kean Ming Tan, Wen-Xin Zhou, Robust estimation and inference for expected shortfall regression with many regressors, Journal of the Royal Statistical Society Series B: Statistical Methodology, Volume 85, Issue 4, September 2023, Pages 1223–1246, https://doi.org/10.1093/jrsssb/qkad063
- Share Icon Share
Abstract
Expected shortfall (ES), also known as superquantile or conditional value-at-risk, is an important measure in risk analysis and stochastic optimisation and has applications beyond these fields. In finance, it refers to the conditional expected return of an asset given that the return is below some quantile of its distribution. In this paper, we consider a joint regression framework recently proposed to model the quantile and ES of a response variable simultaneously, given a set of covariates. The current state-of-the-art approach to this problem involves minimising a non-differentiable and non-convex joint loss function, which poses numerical challenges and limits its applicability to large-scale data. Motivated by the idea of using Neyman-orthogonal scores to reduce sensitivity to nuisance parameters, we propose a statistically robust and computationally efficient two-step procedure for fitting joint quantile and ES regression models that can handle highly skewed and heavy-tailed data. We establish explicit non-asymptotic bounds on estimation and Gaussian approximation errors that lay the foundation for statistical inference, even with increasing covariate dimensions. Finally, through numerical experiments and two data applications, we demonstrate that our approach well balances robustness, statistical, and numerical efficiencies for expected shortfall regression.
1 Introduction
Expected shortfall (ES), also known as superquantile or conditional value-at-risk (VaR), has been recognised as an important risk measure with versatile applications in finance (Acerbi & Tasche, 2002; Rockafellar & Uryasev, 2002), management science (Ben-Tal & Teboulle, 1986; Du & Escanciano, 2017), operations research (Rockafellar et al., 2014; Rockafellar & Uryasev, 2000), and clinical studies (He et al., 2010). For example, in finance, expected shortfall refers to the expected return of an asset or investment portfolio conditional on the return being below a lower quantile of its distribution, namely its VaR. In their Fundamental Review of the Trading Book (Basel Committee, 2016, 2019), the Basel Committee on Banking Supervision confirmed the replacement of VaR with ES as the standard risk measure in banking and insurance.
Let Y be a real-valued random variable with finite first-order absolute moment, , and let be its cumulative distribution function (CDF). For any , the quantile and ES at level α are defined as
respectively. If is continuous, the α-level ES is equivalently given by (see, e.g. Lemma 2.16 of McNeil et al., 2015)
For instance, in socio-economics applications, Y is the income and can be interpreted as the average income for the sub-population whose income falls below the α-quantile of the entire population. We refer the reader to Chapter 6 of Shapiro et al. (2014) and Rockafellar and Royset (2014) for a thorough discussion of ES and its mathematical properties.
With the increasing focus on ES and its desired properties as a risk measure, it is natural to examine the impact of a p-dimensional explanatory vector X, on the tail behaviour of Y through ES. One motivating example is the Job Training Partnership Act (JTPA), a large publicly funded training programme that provides training for adults with barriers to employment and out-of-school youths. The goal is to examine whether the training programme improves future income for adults with low-income earnings (Bloom et al., 1997), for which quantile regression (QR)-based approaches have been proposed (Abadie et al., 2002; Chernozhukov & Hansen, 2008). For example, the 0.05-quantile of the post-programme income refers to the highest income earning of those who have the 5% lowest income among the entire population, while the 0.05-ES concerns the average income earning within this sub-population and therefore is more scientifically relevant in the JTPA study.
Compared to the substantial body of literature on QR, extant works on ES estimation and inference in the presence of covariates are scarce. We refer the reader to Scaillet (2004), Cai and Wang (2008), Kato (2012), Linton and Xiao (2013), and Martins-Filho et al. (2018) for non-parametric conditional ES estimation, and more recently to Dimitriadis and Bayer (2019), Patton et al. (2019), and Barendse (2020) in the context of (semi-)parametric models. As suggested in Patton et al. (2019), this is partly because regulatory interest in ES as a risk measure is only recent, and also due to the fact that this measure is not elicitable (Gneiting, 2011). Let be a class of distributions on . We say that a statistical functional with () is elicitable relative to the class if there exists a loss function such that for any . Here, means that the expectation is taken with respect to the random variable Z that follows the distribution F. For example, the mean is elicitable using the -loss, and the median is elicitable using the -loss. Although the ES is not elicitable on its own, it is jointly elicitable with the quantile using a class of strictly consistent joint loss functions (Fissler & Ziegel, 2016). Based on this result, Dimitriadis and Bayer (2019) and Patton et al. (2019) proposed a joint regression model for the conditional α-level quantile and ES of Y, given the covariates . In this work, we focus on (conditional) linear joint quantile-ES models:
Equivalently, we have and , where the conditional α-quantile of and the conditional α-level expected shortfall of , given , are zero. More generally, one may allow the quantile and the ES models to depend on different covariate vectors and , respectively. In this case, the conditional α-quantile and α-ES of and , respectively, given , are assumed to be zero.
To jointly estimate and , Dimitriadis and Bayer (2019) and Patton et al. (2019) considered an M-estimator, defined as the global minimum of any member of a class of strictly consistent joint loss functions over some compact set (Fissler & Ziegel, 2016). The joint loss function, which will be specified in equation (5), is non-differentiable and non-convex. Dimitriadis and Bayer (2019) employed the derivative-free Nelder–Mead algorithm to minimise the resulting non-convex loss, which is a heuristic search method that may converge to non-stationary points. From a statistical perspective, they further established consistency and asymptotic normality for the global minima. However, from a computational perspective, finding the global minimum of a non-convex function is generally intractable: approximating the global minimum of a k-times continuously differentiable function to -accuracy requires at least as many as evaluations (ignoring problem-dependent constants) of the function and its first k-derivatives (Nemirovski & Yudin, 1983). The lack of differentiability makes this problem even more challenging numerically. To mitigate the computational effort, Barendse (2020) proposed a two-step procedure by first estimating the quantile parameters via standard QR, followed by least squares regression with generated response variables. Although computationally efficient, the ensuing estimator is sensitive to heavy-tailed error distributions due to the use of -loss for fitting possibly highly skewed data in the second step; see Section 3.1 for a rigorous statement.
In this paper, we propose a new two-stage method for robust estimation and inference under a joint quantile and expected shortfall regression model (3), with a particular focus on the latter. Compared to existing approaches, our proposed method is robust against heavy-tailed errors without compromising statistical efficiency under light-tailed distributions. Computationally, our method can be implemented via fast and scalable gradient-based algorithms. The main contributions of this work are summarised as follows:
Our method builds upon a recent approach to joint quantile and expected shortfall regression via a two-step procedure (Barendse, 2020). However, a general non-asymptotic theory for this approach has yet to be established. To fill this gap, we establish a finite-sample theoretical framework for the two-step ES estimator when the dimension of the model, p, increases with the number of observations, n. Specifically, we provide explicit upper bounds, as a function of , on the estimation error (under -risk) and (uniform) Gaussian approximation errors; see Online Supplementary Section A. We also construct asymptotically valid (entrywise) confidence intervals for the ES parameters. The main computational effort of this two-step procedure is the QR fit in stage one. Therefore, we recommend using the convolution-smoothed QR method (Fernandes et al., 2021), which can be solved using fast first-order algorithms that are scalable to very large-scale problems (He et al., 2023). Our non-asymptotic theory allows the dimension p to grow with the sample size, which paves the way for analysing series/projection estimators under joint non-parametric quantile-ES models (Belloni et al., 2019) and penalised estimators under high-dimensional sparse models.
The standard two-step estimator is a least squares estimator (LSE) with generated response variables. As a result, it is sensitive to the tails of the distribution of Y. We propose a robust ES regression method that applies adaptive Huber regression (Zhou et al., 2018) in the second step to address this issue. The resulting estimator achieves sub-Gaussian deviation bounds even when the (conditional) distribution of only has Pareto-like tails. To achieve a trade-off between bias and robustness, we propose using a diverging robustification parameter . In practice, we choose this hyper-parameter using a data-driven mechanism (L. Wang et al., 2021), guided by the non-asymptotic results presented in Section 4 and inspired by the censored equation approach introduced in Hahn et al. (1990). We have also developed efficient algorithms to compute standard and robust two-step ES estimators under additional constraints. These constraints ensure that the fitted ES does not exceed the fitted quantile at each observation. We refer the reader to the Online Supplementary Section D for more details.
We conduct thorough numerical comparisons between the two-step estimator and the proposed robust variant with the joint M-estimator of Dimitriadis and Bayer (2019) on large synthetic data sets generated from a location-scale model, with both light- and heavy-tailed error distributions. To compute the joint M-estimator, we use the R package esreg, which is available at https://cran.r-project.org/package=esreg. To implement the proposed robust two-step procedure, we use a combination of R packages, quantreg or conquer and adaHuber. Our results show that the proposed robust ES regression approach achieves satisfying statistical performance, a higher degree of robustness against heavy-tailed error distributions, and superior computational efficiency and stability. We also demonstrate the effectiveness of our approach through numerical experiments and two real data examples.
In this work, the term ‘robustness’ specifically refers to the robustness against heavy-tailed distributions, as revealed by non-asymptotic deviation analysis dating back to Catoni (2012). In Catoni (2012)’s study of univariate mean estimation, it was found that while the sample mean has the optimal minimax mean squared error among all mean estimators, its deviation is worse for non-Gaussian samples than for Gaussian ones. Moreover, the worst-case deviation is sub-optimal when the sampling distribution has heavy tails. To be more specific, let be independent copies of X with mean μ and variance . Applying Chebyshev’s inequality to the empirical mean yields a polynomial-type deviation bound: holds with probability at least for any . Furthermore, if X is sub-Gaussian, meaning that for all , then can be referred to as a sub-Gaussian estimator, as it satisfies with probability that . In order to obtain sub-Gaussian deviations under a condition of bounded second moments, Fan et al. (2017) considered the Huber mean estimator , which is closely related to the method proposed by Catoni (2012). Here denotes the Huber loss; see definition (18). Theorem 5 in Fan et al. (2017) establishes that for any , with and satisfies with probability at least as long as . While Fan et al. (2017) does not explicitly state this, the divergence of τ in this context is also intended to strike a balance between bias and robustness. In comparison to (univariate) mean estimation, the problem of regression with growing dimensions and generated response variables present new technical challenges and requires more nuanced analysis. Nevertheless, the underlying phenomenon is quite similar.
1.1 Notation
For any two vectors and , we define their inner product as . We use to denote the -norm in : for and . Let be the unit sphere in under -norm. Given a positive semi-definite matrix and , let . Given an event/subset , or denotes the zero-one indicator function for . For two real numbers a and b, we write and . For two sequences and of non-negative numbers, we write if for some constant independent of n, if , and if and .
2 Preliminaries and background
2.1 The joint regression framework
Assume we observe a sequence of data vectors , where is the response variable, and is a p-dimensional vector of explanatory variables (covariates). For some fixed probability level , denote the conditional α-level quantile and ES of given the covariates as and , respectively. For the latter, we adhere to the definition .
We consider the joint regression framework introduced in Dimitriadis and Bayer (2019) for modelling the conditional quantile and expected shortfall. For some probability level , assume that
where are the unknown true underlying parameters for quantile and ES, respectively. Fissler and Ziegel (2016) explained that quantile and ES are jointly elicitable and proposed a class of strictly consistent joint loss functions for quantile and ES estimation. Let be an increasing and integrable function, and let be a three times continuously differentiable function such that both and its derivative are strictly positive. The proposed joint loss function in Fissler and Ziegel (2016) takes the form
This general form also includes the joint loss function proposed by Acerbi and Székely (2014) by taking for some and .
In the regression framework with a fixed number of covariates, Dimitriadis and Bayer (2019) established the consistency and asymptotic normality of the M-estimator , defined as
where is the parameter space, assumed to be compact, convex, and has non-empty interior. The main challenge of the aforementioned approach is that the objective function in equation (6) is non-differentiable and non-convex for any feasible choice of the functions and (Fissler & Ziegel, 2016). Note from definition (1) that the expected shortfall depends on the quantile, not vice versa. The estimation and inference of is thus the main challenge. It is, however, infeasible to estimate a single regression model for ES through M-estimation, that is, by minimising some strictly consistent loss function (Dimitriadis & Bayer, 2019).
In the joint regression framework, if the main goal is to estimate and forecast ES, then can be naturally viewed as a nuisance parameter. Motivated by the idea of using Neyman-orthogonal scores to reduce sensitivity with respect to nuisance parameters (Barendse, 2020; Chernozhukov et al., 2018; Neyman, 1979) proposed a two-stage procedure that bypasses non-convex optimisation problems. In the first stage, an estimate of is obtained via standard QR. The second step employs an orthogonal score with fitted thresholding quantiles to estimate . The key observation is as follows. Define the function
where is given in equation (5). Under model (4), we have almost surely over X. Let be the conditional distribution function of Y given X. Provided that is continuously differentiable, taking the gradient with respect to β on both sides of the above equality yields
We hence refer to the following property:
as Neyman orthogonality.
2.2 Two-step ES estimation via Neyman-orthogonal score
We start with a detailed overview of the two-step approach proposed by Barendse (2020) using the Neyman-orthogonal score (7) under the joint model (4). In Section 3.1, we will develop a non-asymptotic (finite-sample) theory for the two-step ES estimator, , under the regime in which p is allowed to increase with the sample size n. We further develop asymptotic normality results for individual coordinates, or more generally linear projections, of , in the increasing-dimension regime ‘’. Our non-asymptotic results and techniques pave the way for analysing high-dimensional sparse quantile-ES models.
The first step involves computing the standard QR estimator of :
where is the check function (Koenker & Bassett, 1978). The second step is motivated by the orthogonal score in equation (5). Specifically, let be the joint empirical loss with
Given obtained from the first step, the ES estimator of is computed as
For any β fixed, the function is convex with gradient and Hessian given by
respectively. By the first-order condition, the ES regression estimator satisfies the moment condition , and has a closed-form expression
provided that is full-rank.
When p is large, we suggest using the convolution-smoothed QR (conquer) estimator (Fernandes et al., 2021; He et al., 2023) in the first step, which can be computed by fast and scalable gradient-based algorithms. Given a smoothing parameter/bandwidth , the conquer estimator minimises the convolution-smoothed loss function with , where for some symmetric, non-negative kernel function K, and * is the convolution operator. We refer to Fernandes et al. (2021) and He et al. (2023) for more details, including both asymptotic and finite-sample properties of when p is fixed and growing as well as the bandwidth selection.
Define matrices and with satisfying under model (4). Provided that satisfies , we will show in the Online Supplementary Theorem A.3 that is asymptotically normal:
As a direct implication, an asymptotically valid entrywise confidence interval for can be constructed as follows. Recall that is the joint quantile-ES regression estimators given in equations (9) and (11), respectively. Define the estimated ‘residuals’ as
We then use the sample analogue of Σ and a plug-in estimator of Ω, namely, and . Consequently, we construct (approximate) 95% confidence interval for each coefficient as
3 Robust expected shortfall regression
3.1 Motivation
The two-step estimator given in equation (12) is essentially an LSE with generated response variables. While the two-step procedure is computationally efficient and enjoys nice asymptotic properties, due to the use of the least squares type loss, it is sensitive to outliers or heavy-tailed data that is ubiquitous in various areas such as climate, insurance claims, and genomics data. In particular, heavy-tailedness has become a well-known stylised fact of financial returns and stock-level predictor variables (Cont, 2001). Since the expected shortfall is a quantity that describes the tail behaviour of a distribution, it is important to construct an estimator that is robust to the power-law or Pareto-like tails.
To motivate the need for a robust ES estimator, we start with the non-regression setting in which . The two-step ES estimator (12) can then be simplified as
where is the empirical CDF of Y and is the sample quantile. The estimator (15) coincides with the ES estimate (4) in Bassett et al. (2004), although the latter is motivated differently by the following property:
Since , up to higher order terms, equals which, by the consistency of sample quantiles, is first-order equivalent to the ‘oracle’ ES estimator .
Since the truncated variable can be highly left-skewed with heavy tails, the corresponding empirical mean is sensitive to the (left) tails of the distribution of Y, and hence lacks robustness against heavy-tailed data. Specifically, let be i.i.d. random variables with mean μ and variance . When is sub-Gaussian (i.e. for any ), it follows from the Chernoff bound (Chernoff, 1952) that
In other words, the sample mean satisfies the sub-Gaussian deviation bound. On the other hand, the following proposition provides a lower bound for the deviations of the empirical mean when the distribution of Y is the least favourable among all heavy-tailed distributions with mean zero and variance .
Together, the upper and lower bounds (16) and (17) show that the worst case deviations of the empirical mean are sub-optimal when the underlying distribution is heavy-tailed (as opposed to having Gaussian-like thin tails). If Y follows a heavy-tailed distribution, such as the t- or Pareto distributions, then the left-truncated variables have not only heavy but also asymmetric tails. In this case, the empirical mean can be a sub-optimal estimator of .
3.2 Robust estimation and inference via adaptive Huber regression
To robustify the ES regression estimator (12) in the presence of skewed heavy-tailed observations, we utilise the idea of adaptive Huber regression in Zhou et al. (2018). For some , the Huber loss (Huber, 1973) takes the form
We propose a robust/Huberised ES regression estimator defined as
where is as defined in equation (10), and is a robustification parameter that should be calibrated adaptively from data.
To see this, we consider the oracle Huber ES estimator defined as
where . For any , is an M-estimator of its population counterpart
Let be the derivative of the Huber loss. By the convexity of the Huber loss, must satisfy the first-order condition . On the other hand, define the ES deviations , satisfying and . Since the conditional distribution of given is asymmetric, in general we have , which in turn implies that . We thus refer to their difference under the -norm, , as the robustification bias. Proposition 2 provides an upper bound for the robustification bias, which depends on τ and some moment parameters. In particular, τ needs to diverge for the robustification bias to diminish.
Assume that satisfies almost surely for some constant , and that , where is positive definite. Then, for any , we have .
In Section 4, we investigate the finite-sample properties of the robust ES estimator obtained via equations (9) and (19): our results include a deviation inequality for (Theorem 1), the Bahadur representation (Theorem 2), and a Berry–Esseen bound for linear projections of and (Theorem 3). With a properly chosen τ that is of order , we will show that with high probability. Moreover, for any deterministic vector , the standardised statistic converges in distribution to , where and . Our theoretical analysis reveals two attractive properties of the adaptive Huberised ES estimator : (i) the non-asymptotic deviation upper bounds for are much smaller in order than those for at any given confidence level and (ii) the asymptotic relative efficiency of to is one. Moreover, Theorem 3 shows that the two-step robust estimator (with estimated conditional quantiles) is asymptotically equivalent to the oracle Huberised estimator (20) (assuming were known). This further justifies the usefulness of the Neyman-orthogonal score, which makes the QR estimation error first-order negligible.
Consistent estimators of Σ and are useful for statistical inference. Given the pair of quantile-ES regression estimators , with a slight abuse of notation we use and to denote the fitted QR and ES residuals as in equation (13) except with replaced by . As discussed in Section 2.2, a naive estimate of Ω is . In the presence of heavy-tailed errors , even the ‘oracle’ estimate performs poorly and tends to overestimate. Motivated by Huber regression, we further propose a simple truncated estimator of Ω given by
where is a second robustification parameter. Consequently, we construct approximate 95% robust confidence intervals for ’s as
The convergence rate of with a suitably chosen γ will be discussed in Section 4.
As previously discussed, the robustification parameter τ plays a crucial role in achieving a balance between bias and robustness against heavy-tailed error distributions. This balance is necessary because of the asymmetric nature of the ES residual with . Assuming that the (conditional) variance of is bounded, i.e. (almost surely) for some , Theorem 1 suggests that to achieve a tight deviation bound at the confidence level for any given , the robustification parameter should be of order . In practice, the scale of is typically unknown. A useful heuristic is to substitute it with the sample standard deviation of the negative QR residuals , which we denote by . Here, refers to the first-stage QR estimator. Using as a data-driven proxy for τ, the resulting estimator is also location and scale equivariant.
In the following, we present a refined data-driven approach for selecting τ that consistently outperforms the previously mentioned rule of thumb in the numerical experiments conducted in Section 6. This approach is adapted from the method proposed in L. Wang et al. (2021) and draws inspiration from the censored equation approach originally introduced by Hahn et al. (1990) as a proof technique for deriving robust weak convergence theory for self-normalised sums. Note that for each , the Huber ES estimator can be defined equivalently as the solution to the estimating equation , , where are the generated response variables, and denotes the initial QR estimator. Since the optimal choice of τ is proportional to the noise scale, we propose to estimate and the unknown noise scale simultaneously by solving the following system of equations for :
where . Since the Huber loss is convex, the first (vector) equation in θ with s fixed can be solved using either the iteratively reweighted least squares algorithm or the Broyden–Fletcher–Goldfarb–Shanno algorithm. For the second equation with θ fixed, it can be shown that the function is non-increasing, as demonstrated by its derivative
Proposition 3 in L. Wang et al. (2021) further guarantees that the equation with θ fixed has a unique solution, provided that . Based on these observations, we propose the following alternating algorithm, which begins at iteration 0 with an initial estimate , the two-step LSE given in equation (11), or equivalently equation (12). At each iteration , the procedure involves two steps:
Compute the ES ‘residuals’ using the previous estimate . Let be the solution to the equation , .
Compute the updated estimate , where .
Given a prespecified tolerance (e.g. ), the algorithm will terminate at the tth iteration if , or if the maximum number of iterations is reached. Our numerical experiments in Section 6 show that this algorithm generally achieves convergence after only a small number of iterations. Intuitively, we attribute the algorithm’s fast convergence to the observation that changes gradually as τ varies. This graduate change cause the residuals to behave similarly over a range of τ values. We refer the reader to the Online Supplementary Section B for a detailed elaboration on the motivations behind our proposed data-driven method.
4 Statistical theory
This section presents non-asymptotic high probability bounds for the error of the Huberised two-step ES estimator , as defined in equation (19). Additionally, we establish a non-asymptotic Bahadur representation for , which is a crucial step towards obtaining a Berry–Esseen-type bound for Gaussian approximation. Throughout this section, we write with . Without loss of generality, we assume that the random predictors have zero means, that is, for . This makes the later sub-Gaussian assumption more reasonable; see Condition 2 below. Otherwise, we set . With this notation, the joint model (4) becomes and , where and with . The sub-Gaussian assumption can then be imposed on Z, and our analysis naturally applies to .
In the context of a joint (linear) quantile and ES regression model, we initiate by establishing a high probability bound, explicitly dependent on n and p, for the QR estimator (9). To this end, we impose the following conditions on the covariates and the conditional distribution of Y given X.
The conditional density function of given X, denoted by , exists and is continuous on its support. Moreover, there exist constants such that and for all almost surely (over X).
The random covariate vector is sub-Gaussian, that is, there exists some (dimension-free) constant such that for all and , where and is positive definite. Let for .
Condition 1 imposes regularity conditions on the random error distributions, accommodating heteroskedastic error distributions and not requiring the existence of any moment. Condition 2 is used to guarantee that population and empirical quantities (e.g. the objective or gradient function or the gradient function) are uniformly close to each other in a compact region. It can be replaced by a boundedness assumption, which will lead to similar results. For example, is compactly supported with either or , where is an absolute constant and is usually proportional to .
Under Conditions 1 and 2, the QR estimator given in equation (9) satisfies, for any , that holds with probability at least as long as , where are constants depending only on .
While QR has been extensively studied since the seminal work of Koenker and Bassett (1978), there remains a paucity of literature that addresses its finite-sample properties, particularly in terms of high probability bounds. Proposition 3 revisits Theorem 2.1 originally presented in Pan and Zhou (2021). For the sake of completeness, we provide a self-contained and simplified proof in the Online Supplementary Section G.9. Shifting our focus to ES regression, which involves conditional expectations, we additionally impose the following moment condition on the random error .
The conditional CDF of given X is continuously differentiable and satisfies for all . Moreover, the negative part of , denoted by , satisfies almost surely (over X), where denotes the conditional variance given X.
Condition 3 asserts that the conditional variance of the negative part of the QR residual is bounded. In our theoretical analysis, we assume to be a constant for convenience. More generally, one can assume a form of , where is a positive function on (not necessarily bounded), and η is independent of X satisfying . In this case, an additional moment assumption on , such as boundedness , would suffice.
Our next result establishes high probability bounds for the estimation error of ES regression, conditioning on the event that falls within a local neighbourhood of .
For any , the robust estimator with satisfies with probability at least conditioned on that
The above bound is proportional to , in contrast to the bound for the two-step LSE, which is proportional to , as demonstrated in the Online Supplementary Theorem A.1 in Section A. This observation suggests that the Huberised estimator is much more robust to heavy tails from a non-asymptotic perspective, compared to the two-step LSE. Specifically, in cases where the error variables only have finite variance, the worst-case deviations of are considerably larger than those of .
Bias-robustness trade-off
A uniform bound over τ
If, in addition to Condition 3, some higher order moment of is bounded, namely, almost surely (over X) for some , the second term on the right-hand side of equation (24) will become . In order to attain tight (finite-sample) concentration bounds, the robustification parameter should not exceed in magnitude. Conversely, τ should demonstrate sufficiently rapid growth in order for the bias term, controlled by or (in case higher order moments of are bounded), to decay at a comparable rate to the stochastic error.
Unlike the two-step LSE , the robust counterpart does not possess a closed-form expression. As a pivotal step in deriving Gaussian approximation results, the following theorem furnishes a non-asymptotic Bahadur representation for , complete with explicit error bounds depending on and the first-stage QR estimation error.
Lastly, we present the following Gaussian approximation result that bounds the Kolmogorov distance between the distribution of the standardised statistic and the standard normal distribution, uniformly over all deterministic vectors , where . A similar conclusion applies to the oracle robust estimate (20). The following theorem shows that the two-step robust estimator obtained via equations (9) and (19) is asymptotically equivalent to the oracle Huberised estimator (20), assuming is known.
The above Gaussian approximation result lays the theoretical foundation for the statistical inference problems of testing the linear hypothesis vs. and constructing confidence intervals for , where and are predetermined. Given the joint quantile and ES regression estimates , let be the truncated estimator of defined in equation (21) with denoting a second robustification parameter. Then, we consider the robust test statistic for testing , and the (approximate) 100% confidence interval for , where is a robust variance estimator and is the upper -percentile of . The consistency of with a properly chosen γ is investigated in the Online Supplementary Section C.
5 Non-parametric expected shortfall regression
In this section, we consider non-parametric models for joint quantile and expected shortfall regression. For a predetermined quantile level , the goal is to estimate the unknown (conditional) quantile and expected shortfall functions and , with an emphasis on the latter. By equation (1), and can be identified as
Motivated by the two-step procedure developed under joint linear models, in the following we propose a non-parametric ES estimator using the series regression method (Andrews, 1991; Eubank & Spiegelman, 1990; Newey, 1997). Such a non-parametric estimate is carried out by regressing the dependent variable on an asymptotically growing number of approximating functions of the covariates, and therefore is closely related to the estimator define in equation (11) under the so-called many regressors model (Belloni et al., 2019), that is, the dimension is allowed to grow with n. The idea of series estimation is to first approximate and by their ‘projections’ on the linear spans of and series/basis functions, respectively, and then fit the coefficients using the observed data. Specifically, we approximate functions and by linear forms and , where
are two vectors of series approximating functions of dimensions and . Here both and may increase with n. We thus define the vectors of quantile and ES series approximation coefficients as
Given independent observations , from with denoting a compact subset of , we write and . Extending the two-step approach described in Section 2.2, we first define the (conditional) quantile series estimator of (Belloni et al., 2019):
With generated response variables , the second-stage ES series estimator is given by
Commonly used series functions with good approximation properties include B-splines, polynomials, Fourier series and compactly supported wavelets. We refer to Newey (1997) and Chen (2007) for a detailed description of these series functions. In the context of QR, Chen (2007) established the consistency and rate of convergence at a single quantile index. More recently, Belloni et al. (2019) developed a large sample theory for the quantile series coefficient process, including convergence rate and uniform strong approximations. The choice of the parameter , also known as the order of the series estimator, is crucial for establishing the balance between bias and variance.
Note that the quantile series estimator in equation (30) has been well studied by Belloni et al. (2019). Because the number of regressors increases with the sample size, conventional central limit theorems are no longer applicable to capture the joint asymptotic normality of the regression coefficients. The growing dimensionality is the primary source of technical complications. Our theoretical analysis under the joint linear model (4), which leads to novel non-asymptotic high probability bounds, can be used as a starting point for studying the two-step non-parametric ES series estimator defined in equation (31). Of particular interest is to develop a uniform inference procedure for the conditional ES function and its theory. That is, at a given confidence level , we aim to construct a pair of functional estimates from such that
Since a significant amount of additional work is still needed, including explicit characterisations of the ES series approximation error and the impact of first-stage non-parametric QR estimation error, we leave a rigorous theoretical investigation of to future work. Although we have only focussed on series methods, there are other non-parametric techniques that offer superior empirical and theoretical performance. Among those, deep neural networks have stood out as a promising tool for non-parametric estimation, from least squares, logistic to QR (Farrell et al., 2021; Schmidt-Hieber, 2020; Shen et al., 2021). It is practically useful to construct deep learning implementations of two-step estimators and statistically important to deliver valid inferences on finite-dimensional parameters following first-step estimation (of both quantile and ES functions) using deep learning. A detailed investigation of these problems is beyond the present scope but of future interest.
6 Numerical studies and real data examples
6.1 Monte Carlo experiments
In this section, we assess the numerical performance of the proposed method for fitting expected shortfall regression. For its R implementation, we first obtain a QR estimate via the quantreg library, and in step two use the adaHuber library to solve (19) with the robustification parameter selected adaptively as described in Section 3.2.
We compare the proposed two-step adaptive Huber ES estimator (2S-AH) to several competitors: (i) the joint regression estimate (joint) via FZ loss minimisation, implemented via the R library esreg with the default option; (ii) the two-step LSE (12) (2S-LS); and (iii) the oracle two-step ‘estimator’ (2S-oracle). Recall that the two-step procedure first obtains a QR estimator via either standard (Koenker & Bassett, 1978) or smoothed QR regression (He et al., 2023), and subsequently computes the ES estimator based on fitted quantile thresholds . The oracle method refers to the two-step ES estimate based on the true quantile thresholds .
In our simulation studies, we first generate and independently, where s are independent Rademacher random variables and . Data are then generated from the heteroscedastic model
where with , and the random noise follows one of the following two distributions: (i) standard normal distribution and (ii) t-distribution with degrees of freedom (). Given and , the true quantile and expected shortfall regression coefficients are and , where and are the α-level quantile and expected shortfall of , respectively.
We first set the dimension and sample size , where the quantile level α takes values in . Simulation results on the relative -error , averaged over 200 replications, are reported in Tables 1 and 2 under the and noise model, respectively. All four methods have very similar performance across different quantile levels in the normal model, while in the presence of heavy-tailed errors, the proposed robust method achieves consistently more favourable performance. This demonstrates that the use of adaptive Huber regression (in stage two) gains robustness against heavy-tailed errors without compromising statistical efficiency when the error distribution is light-tailed.
Mean relative -error (and standard error), averaged over 200 replications, when , , , and
. | noise . | ||
---|---|---|---|
Method . | . | . | . |
2S-AH | 0.130 (0.003) | 0.150 (0.003) | 0.171 (0.004) |
2S-LS | 0.130 (0.003) | 0.150 (0.003) | 0.171 (0.004) |
joint | 0.130 (0.003) | 0.151 (0.003) | 0.177 (0.004) |
2S-oracle | 0.129 (0.003) | 0.149 (0.003) | 0.171 (0.004) |
. | noise . | ||
---|---|---|---|
Method . | . | . | . |
2S-AH | 0.130 (0.003) | 0.150 (0.003) | 0.171 (0.004) |
2S-LS | 0.130 (0.003) | 0.150 (0.003) | 0.171 (0.004) |
joint | 0.130 (0.003) | 0.151 (0.003) | 0.177 (0.004) |
2S-oracle | 0.129 (0.003) | 0.149 (0.003) | 0.171 (0.004) |
Mean relative -error (and standard error), averaged over 200 replications, when , , , and
. | noise . | ||
---|---|---|---|
Method . | . | . | . |
2S-AH | 0.130 (0.003) | 0.150 (0.003) | 0.171 (0.004) |
2S-LS | 0.130 (0.003) | 0.150 (0.003) | 0.171 (0.004) |
joint | 0.130 (0.003) | 0.151 (0.003) | 0.177 (0.004) |
2S-oracle | 0.129 (0.003) | 0.149 (0.003) | 0.171 (0.004) |
. | noise . | ||
---|---|---|---|
Method . | . | . | . |
2S-AH | 0.130 (0.003) | 0.150 (0.003) | 0.171 (0.004) |
2S-LS | 0.130 (0.003) | 0.150 (0.003) | 0.171 (0.004) |
joint | 0.130 (0.003) | 0.151 (0.003) | 0.177 (0.004) |
2S-oracle | 0.129 (0.003) | 0.149 (0.003) | 0.171 (0.004) |
Mean relative -error (and standard error), averaged over 200 replications, when , , and
. | noise . | ||
---|---|---|---|
Method . | . | . | . |
2S-AH | 0.484 (0.008) | 0.470 (0.009) | 0.429 (0.008) |
2S-LS | 0.612 (0.013) | 0.606 (0.016) | 0.532 (0.013) |
joint | 0.581 (0.012) | 0.567 (0.014) | 0.511 (0.013) |
2S-oracle | 0.612 (0.013) | 0.607 (0.016) | 0.532 (0.013) |
. | noise . | ||
---|---|---|---|
Method . | . | . | . |
2S-AH | 0.484 (0.008) | 0.470 (0.009) | 0.429 (0.008) |
2S-LS | 0.612 (0.013) | 0.606 (0.016) | 0.532 (0.013) |
joint | 0.581 (0.012) | 0.567 (0.014) | 0.511 (0.013) |
2S-oracle | 0.612 (0.013) | 0.607 (0.016) | 0.532 (0.013) |
Mean relative -error (and standard error), averaged over 200 replications, when , , and
. | noise . | ||
---|---|---|---|
Method . | . | . | . |
2S-AH | 0.484 (0.008) | 0.470 (0.009) | 0.429 (0.008) |
2S-LS | 0.612 (0.013) | 0.606 (0.016) | 0.532 (0.013) |
joint | 0.581 (0.012) | 0.567 (0.014) | 0.511 (0.013) |
2S-oracle | 0.612 (0.013) | 0.607 (0.016) | 0.532 (0.013) |
. | noise . | ||
---|---|---|---|
Method . | . | . | . |
2S-AH | 0.484 (0.008) | 0.470 (0.009) | 0.429 (0.008) |
2S-LS | 0.612 (0.013) | 0.606 (0.016) | 0.532 (0.013) |
joint | 0.581 (0.012) | 0.567 (0.014) | 0.511 (0.013) |
2S-oracle | 0.612 (0.013) | 0.607 (0.016) | 0.532 (0.013) |
In a more extreme setting where , Figure 1 shows the boxplots of squared -errors for three ES estimates (2S-LS, 2S-AH, and joint) under the normal and models. Although the 2S-LS estimator is easy-to-compute, it is more sensitive to heavy-tailedness than the joint estimator obtained via FZ loss minimisation. We further compare the proposed method with the joint regression approach in terms of computational efficiency. The computational time in seconds averaged over 50 independent replications, for the two methods with growing subject to () are reported in Figure 2. These numerical results show evidence that our R implementation of the robust two-step method can be faster than the esreg library for the joint regression approach by several orders of magnitude.

Boxplots of squared total -errors (including the intercept when its true value is 2), based on 500 replications, for three ES regression estimators (2S-LS, 2S-AH, and joint) at quantile level . The mean squared errors of these three estimators are 0.1219, 0.0983, and 0.1119 in the normal model, and 1.7401, 0.8017, and 1.2542 in the model.

Average elapsed time (in seconds) over 50 replications for the proposed method implemented by a combination of quantreg and adaHuber and the joint regression approach implemented by esreg under and error models when . The sample size is set to be . The solid and dashed lines correspond to the proposed method and the joint regression approach, respectively.
To shed some light on the drastic difference in numerical efficiency between the two methods, note that the joint regression approach (Dimitriadis & Bayer, 2019) relies on the Nelder–Mead simplex method, which is sensitive to the starting values and not guaranteed to converge to a local minimum. The convergence of the Nelder–Mead method is already very slow for large-scale problems because it is a direct search method based on function comparison. And due to its sensitivity to starting values, Dimitriadis and Bayer (2019) proposed to re-optimise the model (several times) with the perturbed parameter estimates as new starting values. This explains, to some extent, the fast increase in the runtime of esreg as both n and p grow. The function in quantreg that fits linear QR is coded in fortran, and thus is very fast in larger problems. The computation of adaptive Huber regression is based on the Barzilai-Borwein gradient descent method (Barzilai & Borwein, 1988), implemented via RcppArmadillo in adaHuber.
Next, we construct entrywise (approximate) 95% confidence intervals (CIs) for the expected shortfall regression parameter . The CI for the two-step estimator is based on equation (14) (non-robust) and equation (22) (robust), and we use the default option in the esreg package to implement Dimitriadis and Bayer (2019)’s method. To evaluate the accuracy and reliability of the CIs, we compute the empirical coverage probability and interval width based on 500 independent replications, then averaged over the p slope coefficients. Results for and () are reported in Tables 3 and 4.
Empirical coverage probability and mean width (based on 500 replications) of 95% confidence intervals averaged over variables when , , and
. | . | . | . | |||
---|---|---|---|---|---|---|
Method . | Coverage . | Width . | Coverage . | Width . | Coverage . | Width . |
2S-AH | 0.950 | 0.595 | 0.949 | 0.660 | 0.948 | 0.744 |
joint | 0.946 | 0.584 | 0.944 | 0.651 | 0.942 | 0.740 |
2S-LS | 0.950 | 0.595 | 0.949 | 0.661 | 0.948 | 0.745 |
. | . | . | . | |||
---|---|---|---|---|---|---|
Method . | Coverage . | Width . | Coverage . | Width . | Coverage . | Width . |
2S-AH | 0.950 | 0.595 | 0.949 | 0.660 | 0.948 | 0.744 |
joint | 0.946 | 0.584 | 0.944 | 0.651 | 0.942 | 0.740 |
2S-LS | 0.950 | 0.595 | 0.949 | 0.661 | 0.948 | 0.745 |
Empirical coverage probability and mean width (based on 500 replications) of 95% confidence intervals averaged over variables when , , and
. | . | . | . | |||
---|---|---|---|---|---|---|
Method . | Coverage . | Width . | Coverage . | Width . | Coverage . | Width . |
2S-AH | 0.950 | 0.595 | 0.949 | 0.660 | 0.948 | 0.744 |
joint | 0.946 | 0.584 | 0.944 | 0.651 | 0.942 | 0.740 |
2S-LS | 0.950 | 0.595 | 0.949 | 0.661 | 0.948 | 0.745 |
. | . | . | . | |||
---|---|---|---|---|---|---|
Method . | Coverage . | Width . | Coverage . | Width . | Coverage . | Width . |
2S-AH | 0.950 | 0.595 | 0.949 | 0.660 | 0.948 | 0.744 |
joint | 0.946 | 0.584 | 0.944 | 0.651 | 0.942 | 0.740 |
2S-LS | 0.950 | 0.595 | 0.949 | 0.661 | 0.948 | 0.745 |
Empirical coverage probability and mean width (based on 500 replications) of 95% confidence intervals averaged over variables when , , and
. | . | . | . | |||
---|---|---|---|---|---|---|
Method . | Coverage . | Width . | Coverage . | Width . | Coverage . | Width . |
2S-AH | 0.947 | 3.633 | 0.946 | 2.790 | 0.948 | 2.243 |
joint | 0.959 | 5.771 | 0.959 | 3.571 | 0.954 | 2.872 |
2S-LS | 0.952 | 4.521 | 0.950 | 3.397 | 0.953 | 2.687 |
. | . | . | . | |||
---|---|---|---|---|---|---|
Method . | Coverage . | Width . | Coverage . | Width . | Coverage . | Width . |
2S-AH | 0.947 | 3.633 | 0.946 | 2.790 | 0.948 | 2.243 |
joint | 0.959 | 5.771 | 0.959 | 3.571 | 0.954 | 2.872 |
2S-LS | 0.952 | 4.521 | 0.950 | 3.397 | 0.953 | 2.687 |
Empirical coverage probability and mean width (based on 500 replications) of 95% confidence intervals averaged over variables when , , and
. | . | . | . | |||
---|---|---|---|---|---|---|
Method . | Coverage . | Width . | Coverage . | Width . | Coverage . | Width . |
2S-AH | 0.947 | 3.633 | 0.946 | 2.790 | 0.948 | 2.243 |
joint | 0.959 | 5.771 | 0.959 | 3.571 | 0.954 | 2.872 |
2S-LS | 0.952 | 4.521 | 0.950 | 3.397 | 0.953 | 2.687 |
. | . | . | . | |||
---|---|---|---|---|---|---|
Method . | Coverage . | Width . | Coverage . | Width . | Coverage . | Width . |
2S-AH | 0.947 | 3.633 | 0.946 | 2.790 | 0.948 | 2.243 |
joint | 0.959 | 5.771 | 0.959 | 3.571 | 0.954 | 2.872 |
2S-LS | 0.952 | 4.521 | 0.950 | 3.397 | 0.953 | 2.687 |
Once again, all three methods perform similarly under normal errors, while the robust approach gives the narrowest CIs while maintaining the desired coverage level under errors. Together, the results in Tables 2 and 4 demonstrate the robustness of the proposed method, as indicated by the theoretical investigations in Section 4.
6.2 Data application I: health disparity
Iron deficiency is one of the most common nutritional deficiency worldwide and is one of the leading cause of anaemia (Camaschella, 2015). Being able to detect iron deficiency is essential in medical care for patients with inflammation, infection, or chronic disease. It is also important in preventive care since iron deficiency tends to present signs of a more serious illness such as gastrointestinal malignancy (Rockey & Cello, 1993). One measure of iron deficiency that has proven to be useful is the soluble transferrin receptor (sTRP), a carrier protein for transferrin (Mast et al., 1998). A high value of sTRP indicates iron deficiency.
The scientific goal here is to assess whether there is any disparity in sTRP levels among four different ethnic groups: Asian, Black, Mexican American, and White. To this end, we analyse a data set obtained from the National Health and Nutrition Examination Survey from 2017 to 2020 (pre-covid). In this data set, the response variable sTRP was measured for female participants who range in age from 20 to 49 years. The covariates of interest are three dummy variables that correspond to Asian, Mexican American, and Black, using White as the baseline. We adjust for demographic variables such as age, education level, and healthy diet throughout our analysis. For simplicity, we remove all participants with missing values on the covariates and the final data set consists of observations and covariates.
As an exploratory analysis, in Figure 3 we plot the quantile curves of sTRP measurements at levels from 50% to 99% for each of the four different ethnic groups. In this data set, the sTRP values range from 1.24 to 35.1 mg/L. We note that the normal range for females is between 1.9 and 4.4 mg/L (Kratovil et al., 2007), and values that are much higher than 4.4 mg/L indicate severe iron deficiency. We see from Figure 3 that the majority of the population have sTRP levels within the normal range. However, there are large disparities between Black and the other three ethnic groups, reflected in higher quantiles of the marginal distributions of sTRP.

The soluble transferrin receptor levels (mg/L) vs. quantile levels (ranging from 0.5 to 0.99) for the female population in four different ethnic groups: Asian, Black, Mexican American, and White. The orange horizontal dashed line indicates the upper bound of the normal range (1.9–4.4 mg/L) for transferrin receptors among females.
To quantify the statistical significance of the aforementioned disparity, we fit robust expected shortfall regression at (upper tail), with the robustification parameter tuned by the procedure described in Section 3.2. This is equivalent to fitting the proposed 2S-AH method at level (see Section 3) after flipping the signs of both the response and the covariates. We also implement the standard QR at level α.
Table 5 reports the estimated coefficients and the associated 95% confidence intervals for the three indicator covariates on the ethnic groups Asian, Mexican American, and Black, using White as a baseline. We see that both the quantile and robust expected shortfall regression methods are able to detect a health disparity between Black and White. Specifically, the estimated robust ES regression coefficient and 95% CI (in the parenthesis) is 3.03 (1.88, 4.19) vs. its QR counterparts’ 0.86 (0.37, 1.35). With the use of QR (at level 0.75), we do not observe a statistically significant health disparity between Asian and White. In contrast, 2S-AH detects health disparity between Asian and White with an estimated coefficient 2.34 (0.59, 4.09). We also see that the QR detects health disparity between Mexican American and White, but the effect size is close to zero. In summary, ES regression complements QR, and can be more effective, as a tool to detect health disparity especially when it only occurs in the tail of the conditional distribution.
The estimated regression coefficients (and 95% confidence intervals) for three dummy variables: Asian, Black, and Mexican American, using White as a baseline
. | Asian . | Black . | Mexican American . |
---|---|---|---|
QR | 0.31 (0.02, 0.64) | 0.86 (0.37, 1.35) | 0.22 (0.42, -0.01) |
ES regression (2S-AH) | 2.34 (0.59, 4.09) | 3.03 (1.88, 4.19) | 0.13 (0.76, 1.03) |
. | Asian . | Black . | Mexican American . |
---|---|---|---|
QR | 0.31 (0.02, 0.64) | 0.86 (0.37, 1.35) | 0.22 (0.42, -0.01) |
ES regression (2S-AH) | 2.34 (0.59, 4.09) | 3.03 (1.88, 4.19) | 0.13 (0.76, 1.03) |
Note. Results of the upper-tail robust ES regression method 2S-AH and standard QR at quantile level are reported.
The estimated regression coefficients (and 95% confidence intervals) for three dummy variables: Asian, Black, and Mexican American, using White as a baseline
. | Asian . | Black . | Mexican American . |
---|---|---|---|
QR | 0.31 (0.02, 0.64) | 0.86 (0.37, 1.35) | 0.22 (0.42, -0.01) |
ES regression (2S-AH) | 2.34 (0.59, 4.09) | 3.03 (1.88, 4.19) | 0.13 (0.76, 1.03) |
. | Asian . | Black . | Mexican American . |
---|---|---|---|
QR | 0.31 (0.02, 0.64) | 0.86 (0.37, 1.35) | 0.22 (0.42, -0.01) |
ES regression (2S-AH) | 2.34 (0.59, 4.09) | 3.03 (1.88, 4.19) | 0.13 (0.76, 1.03) |
Note. Results of the upper-tail robust ES regression method 2S-AH and standard QR at quantile level are reported.
6.3 Data application II: JTPA
We consider the JTPA study, a publicly funded training programme that provides training for adults with the goal of improving their earnings. Specifically, we focus on the Title II sub-programme of the JPTA study that is mainly offered to adults with barriers to employment and out-of-school youths. This data set was previously analysed in Bloom et al. (1997). It consists of 30 months of accumulated earnings for 6,102 females and 5,102 males, with 16 covariates that are related to the demographics of the individuals such as age, race, and the indicator variable that indicates whether the individual received JPTA training. After removing individuals with zero income, there are 4,576 males and 5,296 females. Our goal is to assess the effect of JPTA training on participants’ earnings with an emphasis on the low-income population that is employed, for both male and female sub-groups.
To this end, we fit an expected shortfall regression model using the proposed robust method with . The robustification parameter τ is selected automatically via the procedure described in Section 3.2. Specifically, we regress the 30-month accumulated earnings on the JPTA training to assess the effect of JPTA training on low-income individuals, adjusting for whether individuals completed high school, race, Hispanic/non-Hispanic, marital status, working less than 13 weeks in the past year, and age. We report the estimated regression coefficient for the binary variable JPTA training and its associated 95% confidence intervals. The results are summarised in Table 6.
The estimated regression coefficient of the binary predictor JPTA training (and its 95% confidence interval) for the proposed robust method and the standard QR at quantile level
Male sub-group . | . | . | . |
---|---|---|---|
QR | 465 (255, 675) | 882 (603, 1,161) | 2031 (1,431, 2,603) |
ES regression (2S-AH) | 283 (149, 418) | 552 (333, 771) | 1,093 (641, 1,546) |
Female sub-group | |||
QR | 202 (76, 328) | 480 (307, 653) | 1,086 (719, 1,452) |
ES regression (2S-AH) | 123 (41, 205) | 300 (146, 453) | 672 (385, 958) |
Male sub-group . | . | . | . |
---|---|---|---|
QR | 465 (255, 675) | 882 (603, 1,161) | 2031 (1,431, 2,603) |
ES regression (2S-AH) | 283 (149, 418) | 552 (333, 771) | 1,093 (641, 1,546) |
Female sub-group | |||
QR | 202 (76, 328) | 480 (307, 653) | 1,086 (719, 1,452) |
ES regression (2S-AH) | 123 (41, 205) | 300 (146, 453) | 672 (385, 958) |
Note. Results are rounded to the closest integer.
The estimated regression coefficient of the binary predictor JPTA training (and its 95% confidence interval) for the proposed robust method and the standard QR at quantile level
Male sub-group . | . | . | . |
---|---|---|---|
QR | 465 (255, 675) | 882 (603, 1,161) | 2031 (1,431, 2,603) |
ES regression (2S-AH) | 283 (149, 418) | 552 (333, 771) | 1,093 (641, 1,546) |
Female sub-group | |||
QR | 202 (76, 328) | 480 (307, 653) | 1,086 (719, 1,452) |
ES regression (2S-AH) | 123 (41, 205) | 300 (146, 453) | 672 (385, 958) |
Male sub-group . | . | . | . |
---|---|---|---|
QR | 465 (255, 675) | 882 (603, 1,161) | 2031 (1,431, 2,603) |
ES regression (2S-AH) | 283 (149, 418) | 552 (333, 771) | 1,093 (641, 1,546) |
Female sub-group | |||
QR | 202 (76, 328) | 480 (307, 653) | 1,086 (719, 1,452) |
ES regression (2S-AH) | 123 (41, 205) | 300 (146, 453) | 672 (385, 958) |
Note. Results are rounded to the closest integer.
From Table 6, we see that 95% confidence intervals for the robust method do not contain zero for all . This indicates that the JPTA training is statistically effective to improve earnings for the low-income population. Specifically, for the male sub-population, the estimated ES-effects of JPTA training are 283, 552, and 1,093 dollars at levels , , and , respectively. To further assess whether the estimated effects are scientifically meaningful, we compute the average 30-month accumulated earnings below the quantile levels , and for the male sub-group, which are 214, 566, and 1,496, respectively. We find that the JPTA training doubles the average income for individuals with income below the quantile levels 0.05 and 0.1, and becomes less effective for individuals with higher income. Similar findings are also observed for the female sub-group.
7 Conclusion and discussions
This paper considers expected shortfall regression under a joint quantile and ES model recently proposed in Dimitriadis and Bayer (2019) and Patton et al. (2019). The existing approach is based on a joint M-estimator, defined as the global minimum of any member of a class of strictly consistent joint loss functions (Fissler & Ziegel, 2016) over some compact set. Since the loss function is non-differentiable and non-convex, the computation of such a joint M-estimator is intrinsically difficult especially when the dimensionality is large. To circumvent the aforementioned challenge, Barendse (2020) proposed a two-step procedure for estimating the joint quantile and ES model based on Neyman orthogonalisation: the first step involves fitting the QR, and the second step employs the Neyman-orthogonal scores to estimate the ES parameters. Due to the use of -loss in the second step, the resulting estimator is sensitive to heavy-tailed error distributions.
To address the robustness and computation concerns simultaneously, we propose a robust two-step method that applies adaptive Huber regression (Zhou et al., 2018) in the second step. The key is the use of a diverging robustification parameter for bias-robustness trade-off, tuned by a convenient data-driven mechanism. The proposed method can be efficiently implemented by a combination of R packages quantreg/conquer and adaHuber. The Python code that implements both our proposed methods and the existing non-convex optimisation-based methods (Dimitriadis & Bayer, 2019; Peng & Wang, 2022) is now publicly available at https://github.com/WenxinZhou/conquer. We establish a finite-sample theoretical framework for this two-step method, including deviation bound, Bahadur representation and (uniform) Gaussian approximations, in which the dimension of the model, p, may depend on and increase with the sample size, n. Robust confidence intervals/sets are also constructed. Numerical experiments further demonstrate that the proposed robust ES regression approach achieves satisfying statistical performance, high degree of robustness (against heavy-tailed data) and superior computational efficiency and stability. Through two data applications on health disparity and the JTPA study, we illustrate that ES regression complements QR as a useful tool to explore heterogeneous covariate effects on the average tail behaviour of the outcome.
Although we restrict attention to (joint) linear models in this work, our non-asymptotic theory and the underpinning techniques pave the way for analysing (i) series/projection estimators under joint non-parametric quantile-ES models and (ii) penalised estimators under high-dimensional sparse quantile-ES models. We leave these extensions in future research. One limitation in our data analysis for the JTPA study is that we do not account for potential selection bias. Specifically, as pointed out by Abadie et al. (2002), out of all subjects that were assigned to participate in the training programme, only approximately 60% of them (compliers) actually committed to the training programme. These individuals may simply have higher motivation in improving their earnings, and thus, the training status is likely positively correlated with potential income earnings. Generalising the proposed method to estimate the complier expected shortfall treatment effect, using an instrumental variable approach previously considered in Abadie et al. (2002), is another direction for future research.
The ES regression methods considered in this paper are suited for a fixed quantile level , independent of the sample size. For extreme quantiles satisfying 0 or 1 as , both the FZ loss minimisation method (see equations (5) and (6)) and two-step procedures perform poorly because observations become scarce at that level, i.e. is not large enough. In fact, if dimension p is fixed, Online Supplementary Theorem A.1 and Theorem 1 imply that the two-step ES regression estimates, robust, and non-robust, are consistent if as . In the case where , these methods are no longer useful and one may need to resort to extreme value theory (de Haan & Ferreira, 2006; H. J. Wang et al., 2012), which provides the statistical tools for a feasible extrapolation into the tail of the variable of interest. A more detailed discussion on modelling the extremes is deferred to the Online Supplementary Section E.
Acknowledgments
We sincerely thank the Joint Editor, Qiwei Yao, as well as the Associate Editor and the three reviewers for their numerous constructive comments and valuable suggestions, which have significantly helped us improve the quality of this work.
Funding
X.H. is supported by NSF Grants DMS-1914496 and DMS-1951980. K.M.T. is supported by NSF DMS-2113346 and NSF CAREER DMS-2238428. W.-X.Z. acknowledges the support from NSF DMS-2113409.
Data availability
The two data sets used in Sections 6.2 and 6.3 are publicly available at https://www.cdc.gov/nchs/nhanes/index.htm and https://economics.mit.edu/people/faculty/josh-angrist/angrist-data-archive, respectively.
Supplementary material
Supplementary material is available online at Journal of the Royal Statistical Society: Series B.
References
Author notes
Conflict of interest: None declared.