Abstract

Expected shortfall (ES), also known as superquantile or conditional value-at-risk, is an important measure in risk analysis and stochastic optimisation and has applications beyond these fields. In finance, it refers to the conditional expected return of an asset given that the return is below some quantile of its distribution. In this paper, we consider a joint regression framework recently proposed to model the quantile and ES of a response variable simultaneously, given a set of covariates. The current state-of-the-art approach to this problem involves minimising a non-differentiable and non-convex joint loss function, which poses numerical challenges and limits its applicability to large-scale data. Motivated by the idea of using Neyman-orthogonal scores to reduce sensitivity to nuisance parameters, we propose a statistically robust and computationally efficient two-step procedure for fitting joint quantile and ES regression models that can handle highly skewed and heavy-tailed data. We establish explicit non-asymptotic bounds on estimation and Gaussian approximation errors that lay the foundation for statistical inference, even with increasing covariate dimensions. Finally, through numerical experiments and two data applications, we demonstrate that our approach well balances robustness, statistical, and numerical efficiencies for expected shortfall regression.

1 Introduction

Expected shortfall (ES), also known as superquantile or conditional value-at-risk (VaR), has been recognised as an important risk measure with versatile applications in finance (Acerbi & Tasche, 2002; Rockafellar & Uryasev, 2002), management science (Ben-Tal & Teboulle, 1986; Du & Escanciano, 2017), operations research (Rockafellar et al., 2014; Rockafellar & Uryasev, 2000), and clinical studies (He et al., 2010). For example, in finance, expected shortfall refers to the expected return of an asset or investment portfolio conditional on the return being below a lower quantile of its distribution, namely its VaR. In their Fundamental Review of the Trading Book (Basel Committee, 2016, 2019), the Basel Committee on Banking Supervision confirmed the replacement of VaR with ES as the standard risk measure in banking and insurance.

Let Y be a real-valued random variable with finite first-order absolute moment, E|Y|<, and let FY be its cumulative distribution function (CDF). For any α(0,1), the quantile and ES at level α are defined as

(1)

respectively. If FY is continuous, the α-level ES is equivalently given by (see, e.g. Lemma 2.16 of McNeil et al., 2015)

(2)

For instance, in socio-economics applications, Y is the income and ESα(Y) can be interpreted as the average income for the sub-population whose income falls below the α-quantile of the entire population. We refer the reader to Chapter 6 of Shapiro et al. (2014) and Rockafellar and Royset (2014) for a thorough discussion of ES and its mathematical properties.

With the increasing focus on ES and its desired properties as a risk measure, it is natural to examine the impact of a p-dimensional explanatory vector X, on the tail behaviour of Y through ES. One motivating example is the Job Training Partnership Act (JTPA), a large publicly funded training programme that provides training for adults with barriers to employment and out-of-school youths. The goal is to examine whether the training programme improves future income for adults with low-income earnings (Bloom et al., 1997), for which quantile regression (QR)-based approaches have been proposed (Abadie et al., 2002; Chernozhukov & Hansen, 2008). For example, the 0.05-quantile of the post-programme income refers to the highest income earning of those who have the 5% lowest income among the entire population, while the 0.05-ES concerns the average income earning within this sub-population and therefore is more scientifically relevant in the JTPA study.

Compared to the substantial body of literature on QR, extant works on ES estimation and inference in the presence of covariates are scarce. We refer the reader to Scaillet (2004), Cai and Wang (2008), Kato (2012), Linton and Xiao (2013), and Martins-Filho et al. (2018) for non-parametric conditional ES estimation, and more recently to Dimitriadis and Bayer (2019), Patton et al. (2019), and Barendse (2020) in the context of (semi-)parametric models. As suggested in Patton et al. (2019), this is partly because regulatory interest in ES as a risk measure is only recent, and also due to the fact that this measure is not elicitable (Gneiting, 2011). Let P be a class of distributions on Rd. We say that a statistical functional θ:PD with DRp (p1) is elicitable relative to the class P if there exists a loss function ρ:Rd×RpR such that θ(F)=argminθDEZFρ(Z,θ) for any FP. Here, EZF means that the expectation is taken with respect to the random variable Z that follows the distribution F. For example, the mean is elicitable using the L2-loss, and the median is elicitable using the L1-loss. Although the ES is not elicitable on its own, it is jointly elicitable with the quantile using a class of strictly consistent joint loss functions (Fissler & Ziegel, 2016). Based on this result, Dimitriadis and Bayer (2019) and Patton et al. (2019) proposed a joint regression model for the conditional α-level quantile and ES of Y, given the covariates XRp. In this work, we focus on (conditional) linear joint quantile-ES models:

(3)

Equivalently, we have ε=YXTβ* and ξ=YXTθ*, where the conditional α-quantile of ε and the conditional α-level expected shortfall of ε, given XRp, are zero. More generally, one may allow the quantile and the ES models to depend on different covariate vectors Xq and Xe, respectively. In this case, the conditional α-quantile and α-ES of ε and ξ, respectively, given X=(XqT,XeT)T, are assumed to be zero.

To jointly estimate β* and θ*, Dimitriadis and Bayer (2019) and Patton et al. (2019) considered an M-estimator, defined as the global minimum of any member of a class of strictly consistent joint loss functions over some compact set (Fissler & Ziegel, 2016). The joint loss function, which will be specified in equation (5), is non-differentiable and non-convex. Dimitriadis and Bayer (2019) employed the derivative-free Nelder–Mead algorithm to minimise the resulting non-convex loss, which is a heuristic search method that may converge to non-stationary points. From a statistical perspective, they further established consistency and asymptotic normality for the global minima. However, from a computational perspective, finding the global minimum of a non-convex function is generally intractable: approximating the global minimum of a k-times continuously differentiable function f:RpR to ϵ-accuracy requires at least as many as (1/ϵ)p/k evaluations (ignoring problem-dependent constants) of the function and its first k-derivatives (Nemirovski & Yudin, 1983). The lack of differentiability makes this problem even more challenging numerically. To mitigate the computational effort, Barendse (2020) proposed a two-step procedure by first estimating the quantile parameters via standard QR, followed by least squares regression with generated response variables. Although computationally efficient, the ensuing estimator is sensitive to heavy-tailed error distributions due to the use of L2-loss for fitting possibly highly skewed data in the second step; see Section 3.1 for a rigorous statement.

In this paper, we propose a new two-stage method for robust estimation and inference under a joint quantile and expected shortfall regression model (3), with a particular focus on the latter. Compared to existing approaches, our proposed method is robust against heavy-tailed errors without compromising statistical efficiency under light-tailed distributions. Computationally, our method can be implemented via fast and scalable gradient-based algorithms. The main contributions of this work are summarised as follows:

  • Our method builds upon a recent approach to joint quantile and expected shortfall regression via a two-step procedure (Barendse, 2020). However, a general non-asymptotic theory for this approach has yet to be established. To fill this gap, we establish a finite-sample theoretical framework for the two-step ES estimator when the dimension of the model, p, increases with the number of observations, n. Specifically, we provide explicit upper bounds, as a function of (n,p), on the estimation error (under L2-risk) and (uniform) Gaussian approximation errors; see Online Supplementary Section A. We also construct asymptotically valid (entrywise) confidence intervals for the ES parameters. The main computational effort of this two-step procedure is the QR fit in stage one. Therefore, we recommend using the convolution-smoothed QR method (Fernandes et al., 2021), which can be solved using fast first-order algorithms that are scalable to very large-scale problems (He et al., 2023). Our non-asymptotic theory allows the dimension p to grow with the sample size, which paves the way for analysing series/projection estimators under joint non-parametric quantile-ES models (Belloni et al., 2019) and penalised estimators under high-dimensional sparse models.

  • The standard two-step estimator is a least squares estimator (LSE) with generated response variables. As a result, it is sensitive to the tails of the distribution of Y. We propose a robust ES regression method that applies adaptive Huber regression (Zhou et al., 2018) in the second step to address this issue. The resulting estimator achieves sub-Gaussian deviation bounds even when the (conditional) distribution of Y|X only has Pareto-like tails. To achieve a trade-off between bias and robustness, we propose using a diverging robustification parameter τ=τ(n,p)>0. In practice, we choose this hyper-parameter using a data-driven mechanism (L. Wang et al., 2021), guided by the non-asymptotic results presented in Section 4 and inspired by the censored equation approach introduced in Hahn et al. (1990). We have also developed efficient algorithms to compute standard and robust two-step ES estimators under additional constraints. These constraints ensure that the fitted ES does not exceed the fitted quantile at each observation. We refer the reader to the Online Supplementary Section D for more details.

  • We conduct thorough numerical comparisons between the two-step estimator and the proposed robust variant with the joint M-estimator of Dimitriadis and Bayer (2019) on large synthetic data sets generated from a location-scale model, with both light- and heavy-tailed error distributions. To compute the joint M-estimator, we use the R package esreg, which is available at https://cran.r-project.org/package=esreg. To implement the proposed robust two-step procedure, we use a combination of R packages, quantreg or conquer and adaHuber. Our results show that the proposed robust ES regression approach achieves satisfying statistical performance, a higher degree of robustness against heavy-tailed error distributions, and superior computational efficiency and stability. We also demonstrate the effectiveness of our approach through numerical experiments and two real data examples.

In this work, the term ‘robustness’ specifically refers to the robustness against heavy-tailed distributions, as revealed by non-asymptotic deviation analysis dating back to Catoni (2012). In Catoni (2012)’s study of univariate mean estimation, it was found that while the sample mean has the optimal minimax mean squared error among all mean estimators, its deviation is worse for non-Gaussian samples than for Gaussian ones. Moreover, the worst-case deviation is sub-optimal when the sampling distribution has heavy tails. To be more specific, let X1,,Xn be independent copies of X with mean μ and variance σ2>0. Applying Chebyshev’s inequality to the empirical mean X¯n=(1/n)i=1nXi yields a polynomial-type deviation bound: |X¯nμ|σ1/(nδ) holds with probability at least 1δ for any δ(0,1). Furthermore, if X is sub-Gaussian, meaning that Eeλ(Xμ)eσ2λ2/2 for all λR, then X¯n can be referred to as a sub-Gaussian estimator, as it satisfies with probability 1δ that |X¯nμ|σ2log(2/δ)/n. In order to obtain sub-Gaussian deviations under a condition of bounded second moments, Fan et al. (2017) considered the Huber mean estimator μ^τ=argminθRi=1nτ(Xiθ), which is closely related to the method proposed by Catoni (2012). Here τ() denotes the Huber loss; see definition (18). Theorem 5 in Fan et al. (2017) establishes that for any vσ, μ^τ with τ=vn/log(1/δ) and δ(0,1) satisfies |μ^τμ|4vlog(1/δ)/n with probability at least 12δ as long as n8log(1/δ). While Fan et al. (2017) does not explicitly state this, the divergence of τ in this context is also intended to strike a balance between bias and robustness. In comparison to (univariate) mean estimation, the problem of regression with growing dimensions and generated response variables present new technical challenges and requires more nuanced analysis. Nevertheless, the underlying phenomenon is quite similar.

1.1 Notation

For any two vectors u=(u1,,uk)T and v=(v1,,vk)TRk, we define their inner product as uTv=u,v=j=1kujvj. We use p(1p) to denote the p-norm in Rk: up=(i=1k|ui|p)1/p for p1 and u=max1ik|ui|. Let Sk1={uRk:u2=1} be the unit sphere in Rk under 2-norm. Given a positive semi-definite matrix ARk×k and uRk, let uA:=A1/2u2. Given an event/subset A, 1(A) or 1A denotes the zero-one indicator function for A. For two real numbers a and b, we write ab=min{a,b} and ab=max{a,b}. For two sequences {an}n1 and {bn}n1 of non-negative numbers, we write anbn if anCbn for some constant C>0 independent of n, anbn if bnan, and anbn if anbn and anbn.

2 Preliminaries and background

2.1 The joint regression framework

Assume we observe a sequence of data vectors {(Yi,Xi)}i=1n, where YiR is the response variable, and XiRp is a p-dimensional vector of explanatory variables (covariates). For some fixed probability level α(0,1), denote the conditional α-level quantile and ES of Yi given the covariates Xi as Qα(Yi|Xi) and ESα(Yi|Xi), respectively. For the latter, we adhere to the definition ESα(Yi|Xi)=E{Yi|YiQα(Yi|Xi),Xi}.

We consider the joint regression framework introduced in Dimitriadis and Bayer (2019) for modelling the conditional quantile and expected shortfall. For some probability level α(0,1), assume that

(4)

where β*,θ*Rp are the unknown true underlying parameters for quantile and ES, respectively. Fissler and Ziegel (2016) explained that quantile and ES are jointly elicitable and proposed a class of strictly consistent joint loss functions for quantile and ES estimation. Let G1 be an increasing and integrable function, and let G2 be a three times continuously differentiable function such that both G2 and its derivative G2=G2 are strictly positive. The proposed joint loss function in Fissler and Ziegel (2016) takes the form

(5)

This general form also includes the joint loss function proposed by Acerbi and Székely (2014) by taking G1(x)=(W/2)x2 for some WR and G2(x)=αx2/2.

In the regression framework with a fixed number of covariates, Dimitriadis and Bayer (2019) established the consistency and asymptotic normality of the M-estimator (β~T,θ~T)T, defined as

(6)

where ΘRp is the parameter space, assumed to be compact, convex, and has non-empty interior. The main challenge of the aforementioned approach is that the objective function in equation (6) is non-differentiable and non-convex for any feasible choice of the functions G1 and G2 (Fissler & Ziegel, 2016). Note from definition (1) that the expected shortfall depends on the quantile, not vice versa. The estimation and inference of θ* is thus the main challenge. It is, however, infeasible to estimate a single regression model for ES through M-estimation, that is, by minimising some strictly consistent loss function (Dimitriadis & Bayer, 2019).

In the joint regression framework, if the main goal is to estimate and forecast ES, then β* can be naturally viewed as a nuisance parameter. Motivated by the idea of using Neyman-orthogonal scores to reduce sensitivity with respect to nuisance parameters (Barendse, 2020; Chernozhukov et al., 2018; Neyman, 1979) proposed a two-stage procedure that bypasses non-convex optimisation problems. In the first stage, an estimate β^ of β* is obtained via standard QR. The second step employs an orthogonal score with fitted thresholding quantiles to estimate θ*. The key observation is as follows. Define the function

(7)

where s0 is given in equation (5). Under model (4), we have ψ0(β*,θ*;X)=0 almost surely over X. Let FY|X be the conditional distribution function of Y given X. Provided that FY|X is continuously differentiable, taking the gradient with respect to β on both sides of the above equality yields

We hence refer to the following property:

(8)

as Neyman orthogonality.

2.2 Two-step ES estimation via Neyman-orthogonal score

We start with a detailed overview of the two-step approach proposed by Barendse (2020) using the Neyman-orthogonal score (7) under the joint model (4). In Section 3.1, we will develop a non-asymptotic (finite-sample) theory for the two-step ES estimator, θ^, under the regime in which p is allowed to increase with the sample size n. We further develop asymptotic normality results for individual coordinates, or more generally linear projections, of θ^, in the increasing-dimension regime ‘p2/n=o(1)’. Our non-asymptotic results and techniques pave the way for analysing high-dimensional sparse quantile-ES models.

The first step involves computing the standard QR estimator of β*:

(9)

where ρα(u)={α1(u<0)}u is the check function (Koenker & Bassett, 1978). The second step is motivated by the orthogonal score s0 in equation (5). Specifically, let L^(β,θ)=(1/n)i=1nsi2(β,θ) be the joint empirical loss with

(10)

Given β^ obtained from the first step, the ES estimator θ^ of θ* is computed as

(11)

For any β fixed, the function θL^(β,θ) is convex with gradient and Hessian given by

respectively. By the first-order condition, the ES regression estimator θ^ satisfies the moment condition θL^(β^,θ^)=0, and has a closed-form expression

(12)

provided that X=(X1,,Xn)TRn×p is full-rank.

 
Remark 1

When p is large, we suggest using the convolution-smoothed QR (conquer) estimator (Fernandes et al., 2021; He et al., 2023) in the first step, which can be computed by fast and scalable gradient-based algorithms. Given a smoothing parameter/bandwidth h>0, the conquer estimator β^h minimises the convolution-smoothed loss function βi=1nρα,h(YiXiTβ) with ρα,h(u)=(ρα*Kh)(u)=ρα(u)Kh(vu)dv, where Kh(u):=(1/h)K(u/h) for some symmetric, non-negative kernel function K, and * is the convolution operator. We refer to Fernandes et al. (2021) and He et al. (2023) for more details, including both asymptotic and finite-sample properties of β^h when p is fixed and growing as well as the bandwidth selection.

Define p×p matrices Σ=E(XXT) and Ω=E(ω2XXT) with ω:=(YXTβ*)1(YXTβ*)+αXT(β*θ*) satisfying E(ω|X)=0 under model (4). Provided that p=pn satisfies p2/n0, we will show in the Online Supplementary Theorem A.3 that θ^j is asymptotically normal:

As a direct implication, an asymptotically valid entrywise confidence interval for θ* can be constructed as follows. Recall that (β^,θ^) is the joint quantile-ES regression estimators given in equations (9) and (11), respectively. Define the estimated ‘residuals’ as

(13)

We then use the sample analogue of Σ and a plug-in estimator of Ω, namely, Σ^=i=1nXiXiT/n and Ω^=i=1nω^i2XiXiT/n. Consequently, we construct (approximate) 95% confidence interval for each coefficient as

(14)

3 Robust expected shortfall regression

3.1 Motivation

The two-step estimator θ^ given in equation (12) is essentially an LSE with generated response variables. While the two-step procedure is computationally efficient and enjoys nice asymptotic properties, due to the use of the least squares type loss, it is sensitive to outliers or heavy-tailed data that is ubiquitous in various areas such as climate, insurance claims, and genomics data. In particular, heavy-tailedness has become a well-known stylised fact of financial returns and stock-level predictor variables (Cont, 2001). Since the expected shortfall is a quantity that describes the tail behaviour of a distribution, it is important to construct an estimator that is robust to the power-law or Pareto-like tails.

To motivate the need for a robust ES estimator, we start with the non-regression setting in which Xi1. The two-step ES estimator (12) can then be simplified as

(15)

where F^ is the empirical CDF of Y and Q^α=F^1(α) is the sample quantile. The estimator ES^α (15) coincides with the ES estimate (4) in Bassett et al. (2004), although the latter is motivated differently by the following property:

Since |F^(Q^α)α|1/n, up to higher order terms, ES^α equals (αn)1i=1nYi1{YiQ^α} which, by the consistency of sample quantiles, is first-order equivalent to the ‘oracle’ ES estimator ESαora^:=(αn)1i=1nYi1{YiQα(Y)}.

Since the truncated variable Yi1{YiQα(Y)} can be highly left-skewed with heavy tails, the corresponding empirical mean is sensitive to the (left) tails of the distribution of Y, and hence lacks robustness against heavy-tailed data. Specifically, let X1,,Xn be i.i.d. random variables with mean μ and variance σ2>0. When Xi is sub-Gaussian (i.e. E(eλXi)eλ2σ2/2 for any λR), it follows from the Chernoff bound (Chernoff, 1952) that

(16)

In other words, the sample mean X¯n=(1/n)i=1nXi satisfies the sub-Gaussian deviation bound. On the other hand, the following proposition provides a lower bound for the deviations of the empirical mean (1/n)i=1nYi1{YiQα(Y)} when the distribution of Y is the least favourable among all heavy-tailed distributions with mean zero and variance σ2.

 
Proposition 1
For any value of the standard deviation σ>0 and any probability level δ(0,e1], there exists some distribution with mean zero and variance σ2 such that for any α(0,1), the i.i.d. sample {Yi}i=1n of size n drawn from it satisfies
(17)
as long as neδ/α, where Qα=Qα(Y) is the αth quantile of Y.

Together, the upper and lower bounds (16) and (17) show that the worst case deviations of the empirical mean are sub-optimal when the underlying distribution is heavy-tailed (as opposed to having Gaussian-like thin tails). If Y follows a heavy-tailed distribution, such as the t- or Pareto distributions, then the left-truncated variables Zi:=Yi1{YiQα(Y)} have not only heavy but also asymmetric tails. In this case, the empirical mean (αn)1i=1nZi can be a sub-optimal estimator of ESα(Y).

3.2 Robust estimation and inference via adaptive Huber regression

To robustify the ES regression estimator (12) in the presence of skewed heavy-tailed observations, we utilise the idea of adaptive Huber regression in Zhou et al. (2018). For some τ>0, the Huber loss (Huber, 1973) takes the form

(18)

We propose a robust/Huberised ES regression estimator defined as

(19)

where si(β^,θ) is as defined in equation (10), and τ>0 is a robustification parameter that should be calibrated adaptively from data.

To see this, we consider the oracle Huber ES estimator defined as

(20)

where Zi=(YiXiTβ*)1(YiXiTβ*)+αXiTβ*. For any τ>0, θ^τora is an M-estimator of its population counterpart

Let ψτ(t)=τ(t)=sign(t)min(|t|,τ) be the derivative of the Huber loss. By the convexity of the Huber loss, θτ* must satisfy the first-order condition E{ψτ(ZiαXiTθτ*)Xi}=0. On the other hand, define the ES deviations ωi=ZiαXiTθ*, satisfying E(ωi|Xi)=0 and E(ωi)=0. Since the conditional distribution of ωi given Xi is asymmetric, in general we have E{ψτ(ZiαXiTθ*)Xi}=E{ψτ(ωi)Xi}0, which in turn implies that θτ*θ*. We thus refer to their difference under the 2-norm, θτ*θ*2, as the robustification bias. Proposition 2 provides an upper bound for the robustification bias, which depends on τ and some moment parameters. In particular, τ needs to diverge for the robustification bias to diminish.

 
Proposition 2

Assume that ε:=YXTβ* satisfies varX(ε0)σ¯2 almost surely for some constant σ¯>0, and that κ4=supuSp1Eu,Σ1/2X4<, where Σ=E(XXT) is positive definite. Then, for any τ2κ41/4σ¯, we have θτ*θ*Σ2σ¯2/(ατ).

In Section 4, we investigate the finite-sample properties of the robust ES estimator θ^τ obtained via equations (9) and (19): our results include a deviation inequality for θ^τθ*Σ (Theorem 1), the Bahadur representation (Theorem 2), and a Berry–Esseen bound for linear projections of θ^τ and θ^τora (Theorem 3). With a properly chosen τ that is of order τσ¯n/p, we will show that αθ^τθ*Σσ¯p/n with high probability. Moreover, for any deterministic vector aRp, the standardised statistic αna,θ^τθ*/ϱa converges in distribution to N(0,1), where ϱa2=aTΣ1ΩΣ1a and ω=(YXTβ*)1(YXTβ*)+αXT(β*θ*). Our theoretical analysis reveals two attractive properties of the adaptive Huberised ES estimator θ^τ: (i) the non-asymptotic deviation upper bounds for θ^τ are much smaller in order than those for θ^ at any given confidence level and (ii) the asymptotic relative efficiency of θ^τ to θ^ is one. Moreover, Theorem 3 shows that the two-step robust estimator (with estimated conditional quantiles) is asymptotically equivalent to the oracle Huberised estimator (20) (assuming β* were known). This further justifies the usefulness of the Neyman-orthogonal score, which makes the QR estimation error first-order negligible.

Consistent estimators of Σ and Ω=E(ω2XXT) are useful for statistical inference. Given the pair of quantile-ES regression estimators (β^,θ^τ), with a slight abuse of notation we use ε^i and ω^i to denote the fitted QR and ES residuals as in equation (13) except with θ^ replaced by θ^τ. As discussed in Section 2.2, a naive estimate of Ω is Ω^=(1/n)i=1nw^i2XiXiT. In the presence of heavy-tailed errors εi, even the ‘oracle’ estimate Ω~=(1/n)i=1nwi2XiXiT performs poorly and tends to overestimate. Motivated by Huber regression, we further propose a simple truncated estimator of Ω given by

(21)

where γ=γ(n,p)>0 is a second robustification parameter. Consequently, we construct approximate 95% robust confidence intervals for θj*’s as

(22)

The convergence rate of Ω^γ with a suitably chosen γ will be discussed in Section 4.

As previously discussed, the robustification parameter τ plays a crucial role in achieving a balance between bias and robustness against heavy-tailed error distributions. This balance is necessary because of the asymmetric nature of the ES residual ω=ε0+αXT(β*θ*) with ε=YXTβ*. Assuming that the (conditional) variance of ε=ε0 is bounded, i.e. varX(ε)σ¯2 (almost surely) for some σ¯>0, Theorem 1 suggests that to achieve a tight deviation bound at the 1δ confidence level for any given δ(0,1), the robustification parameter τ=τ(n,p) should be of order σ¯n/(p+logδ1). In practice, the scale of σ¯ is typically unknown. A useful heuristic is to substitute it with the sample standard deviation of the negative QR residuals {ε^i,=min(YiXiTβ^,0)}i=1n, which we denote by σ^. Here, β^ refers to the first-stage QR estimator. Using τ^=σ^n/(p+logδ1) as a data-driven proxy for τ, the resulting estimator is also location and scale equivariant.

In the following, we present a refined data-driven approach for selecting τ that consistently outperforms the previously mentioned rule of thumb in the numerical experiments conducted in Section 6. This approach is adapted from the method proposed in L. Wang et al. (2021) and draws inspiration from the censored equation approach originally introduced by Hahn et al. (1990) as a proof technique for deriving robust weak convergence theory for self-normalised sums. Note that for each τ>0, the Huber ES estimator θ^τ can be defined equivalently as the solution to the estimating equation i=1nψτ(Z^iαXiTθ)Xi=0, θRd, where Z^i=ε^i,+αXiTβ^ are the generated response variables, and β^ denotes the initial QR estimator. Since the optimal choice of τ is proportional to the noise scale, we propose to estimate θ* and the unknown noise scale simultaneously by solving the following system of equations for (θ,s)Rp×(0,):

where k=k(n,p)=n/(p+logn). Since the Huber loss is convex, the first (vector) equation in θ with s fixed can be solved using either the iteratively reweighted least squares algorithm or the Broyden–Fletcher–Goldfarb–Shanno algorithm. For the second equation with θ fixed, it can be shown that the function sg2(θ,s) is non-increasing, as demonstrated by its derivative

Proposition 3 in L. Wang et al. (2021) further guarantees that the equation g2(θ,s)=0 with θ fixed has a unique solution, provided that i=1n1(|Z^iαXiTθ|>0)>n/k2=p+logn. Based on these observations, we propose the following alternating algorithm, which begins at iteration 0 with an initial estimate θ0=θ^, the two-step LSE given in equation (11), or equivalently equation (12). At each iteration t=1,2,, the procedure involves two steps:

  • Compute the ES ‘residuals’ ωit=Z^iαXiTθt1 using the previous estimate θt1. Let st be the solution to the equation (1/n)i=1n(|ωit/s|k)2=1, s>0.

  • Compute the updated estimate θtargminθRpi=1nτt(Z^iαXiTθ), where τt=stk.

Given a prespecified tolerance ϵ>0 (e.g. ϵ=105), the algorithm will terminate at the tth iteration if max{g1(θt,st)2,|g2(θt,st)|}ϵ, or if the maximum number of iterations is reached. Our numerical experiments in Section 6 show that this algorithm generally achieves convergence after only a small number of iterations. Intuitively, we attribute the algorithm’s fast convergence to the observation that θ^τ changes gradually as τ varies. This graduate change cause the residuals to behave similarly over a range of τ values. We refer the reader to the Online Supplementary Section B for a detailed elaboration on the motivations behind our proposed data-driven method.

4 Statistical theory

This section presents non-asymptotic high probability bounds for the error θ^τθ*2 of the Huberised two-step ES estimator θ^τ, as defined in equation (19). Additionally, we establish a non-asymptotic Bahadur representation for θ^τ, which is a crucial step towards obtaining a Berry–Esseen-type bound for Gaussian approximation. Throughout this section, we write X=(x1,,xp)TRp with x11. Without loss of generality, we assume that the random predictors x2,,xp have zero means, that is, μj=E(xj)=0 for j=2,,p. This makes the later sub-Gaussian assumption more reasonable; see Condition 2 below. Otherwise, we set Z=(1,z2,,zp)T=(1,x2μ2,,xpμp)T. With this notation, the joint model (4) becomes Qα(Y|Z)=β0+j=2pzjβj* and ESα(Y|Z)=θ0+j=2pzjθj*, where β1=μTβ* and θ1=μTθ* with μ=(1,μ2,,μp)T. The sub-Gaussian assumption can then be imposed on Z, and our analysis naturally applies to {(Yi,Zi)}i=1n.

In the context of a joint (linear) quantile and ES regression model, we initiate by establishing a high probability bound, explicitly dependent on n and p, for the QR estimator β^ (9). To this end, we impose the following conditions on the covariates and the conditional distribution of Y given X.

 
Condition 1

The conditional density function of ε:=YXTβ* given X, denoted by fε|X, exists and is continuous on its support. Moreover, there exist constants f_,l0>0 such that fε|X(0)f_ and |fε|X(t)fε|X(0)|l0|t| for all tR almost surely (over X).

 
Condition 2

The random covariate vector XRp is sub-Gaussian, that is, there exists some (dimension-free) constant υ11 such that P(|uTW|υ1t)2et2/2 for all t0 and uSp1, where W=Σ1/2X and Σ=E(XXT) is positive definite. Let κl=supuSp1E|uTW|l for l1.

Condition 1 imposes regularity conditions on the random error distributions, accommodating heteroskedastic error distributions and not requiring the existence of any moment. Condition 2 is used to guarantee that population and empirical quantities (e.g. the objective or gradient function or the gradient function) are uniformly close to each other in a compact region. It can be replaced by a boundedness assumption, which will lead to similar results. For example, X=(x1,,xp)T is compactly supported with either XCX or Σ1/2X2BX, where CX is an absolute constant and BX is usually proportional to p.

 
Proposition 3

Under Conditions 1 and 2, the QR estimator β^ given in equation (9) satisfies, for any t0, that β^β*ΣC1f_1(p+t)/n holds with probability at least 1et as long as nC2l02f_4(p+t), where C1,C2>0 are constants depending only on υ1.

While QR has been extensively studied since the seminal work of Koenker and Bassett (1978), there remains a paucity of literature that addresses its finite-sample properties, particularly in terms of high probability bounds. Proposition 3 revisits Theorem 2.1 originally presented in Pan and Zhou (2021). For the sake of completeness, we provide a self-contained and simplified proof in the Online Supplementary Section G.9. Shifting our focus to ES regression, which involves conditional expectations, we additionally impose the following moment condition on the random error ε.

 
Condition 3

The conditional CDF Fε|X of ε given X is continuously differentiable and satisfies |Fε|X(t)Fε|X(0)|f¯|t| for all tR. Moreover, the negative part of ε, denoted by ε=ε0, satisfies varX(ε)σ¯2 almost surely (over X), where varX denotes the conditional variance given X.

Condition 3 asserts that the conditional variance of the negative part of the QR residual ε=YXTβ* is bounded. In our theoretical analysis, we assume σ¯ to be a constant for convenience. More generally, one can assume a form of ε=σ(X)η, where σ:Rp(0,) is a positive function on Rp (not necessarily bounded), and η is independent of X satisfying var(η1(η0))σ¯2. In this case, an additional moment assumption on σ(X), such as boundedness E{σ(X)4}, would suffice.

Our next result establishes high probability bounds for the estimation error of ES regression, conditioning on the event that β^ falls within a local neighbourhood of β*.

 
Theorem 1
Assume Conditions 2 and 3 hold. For any t>0, let r0>0 be such that r0σ¯ and f¯r02σ¯(p+t)/n. Then, the two-step robust α-ES (0<α1/2) estimator θ^τ with τ=c0σ¯n/(p+t) (for any c01) satisfies that, with probability at least 13et conditioned on the event {β^β*Σr0},
(23)
provided that the sample size obeys nC3(p+t), where C1>0 is a constant depending on (υ1,c0) and C2,C3>0 depend only on υ1.

For any δ(0,1), the robust estimator θ^τ with τσ¯n/(p+log(1/δ)) satisfies with probability at least 1δ conditioned on {β^β*Σr0} that

The above bound is proportional to log(1/δ), in contrast to the bound for the two-step LSE, which is proportional to 1/δ, as demonstrated in the Online Supplementary Theorem A.1 in Section A. This observation suggests that the Huberised estimator is much more robust to heavy tails from a non-asymptotic perspective, compared to the two-step LSE. Specifically, in cases where the error variables only have finite variance, the worst-case deviations of θ^ are considerably larger than those of θ^τ.

Bias-robustness trade-off

 
Remark 2
The choice of τ stated in Theorem 1 is a reflection of the bias-robustness trade-off. As discussed in Section 3.2, the robust estimator θ^τ can be viewed as an M-estimator of θτ*=argminθE{τ(ZiαXiTθ)}, which differs from the true ES regression coefficient θ* due to the asymmetry of ES ‘residuals’ ωi=ZiαXiTθ*. Consider the decomposition
As long as τσ¯ under Condition 3, Proposition 2 ensures that αθ^τθτ*Σ2σ¯2/τ. Examining the proof of Theorem 1, we see that
with high probability conditioned on the event {β^β*Σr0}. We therefore select τσ¯n/(p+t) in order to minimise the upper bound as a function of τ.

A uniform bound over τ

 
Remark 3
Recall from Proposition 3 that with probability at least 1n1, β^β*Σf_1(p+logn)/n as long as np+logn. Complementing the proof of Theorem 1 with a discretisation argument, we can obtain a more general result that holds for a range of τ values. Specifically, let τ¯τ_>0 be such that σ¯τ_τ¯σ¯n/(p+logn). Then, with probability at least 1Cn1 for some absolute constant C1,
(24)
as long as np+logn. The proof of the uniform upper bound in equation (24) is provided in the Online Supplementary Section G.9. As ensured by this uniform bound, a data-driven choice of τ within the aforementioned range can be used.

If, in addition to Condition 3, some higher order moment of ε is bounded, namely, EX{|εEX(ε)|k}αk almost surely (over X) for some k>2, the second term on the right-hand side of equation (24) will become αkτ_1k. In order to attain tight (finite-sample) concentration bounds, the robustification parameter τ=τ(n,p) should not exceed n/(p+logn) in magnitude. Conversely, τ should demonstrate sufficiently rapid growth in order for the bias term, controlled by σ¯2τ_1 or αkτ_1k (in case higher order moments of ε are bounded), to decay at a comparable rate to the stochastic error.

Unlike the two-step LSE θ^, the robust counterpart θ^τ does not possess a closed-form expression. As a pivotal step in deriving Gaussian approximation results, the following theorem furnishes a non-asymptotic Bahadur representation for θ^τ, complete with explicit error bounds depending on (n,p) and the first-stage QR estimation error.

 
Theorem 2
Assume the same conditions as in Theorem 1. For any t>0, the α-ES estimator θ^τ with τσ¯n/(p+t) satisfies that, with probability at least 16et conditioned on {β^β*Σr0},
(25)
as long as np+t, where ωi=εi0+αXiT(β*θ*).

Lastly, we present the following Gaussian approximation result that bounds the Kolmogorov distance between the distribution of the standardised statistic αnaT(θ^τθ*)/ϱa and the standard normal distribution, uniformly over all deterministic vectors aRp, where ϱa2=aTΣ1ΩΣ1a. A similar conclusion applies to the oracle robust estimate θ^τora (20). The following theorem shows that the two-step robust estimator obtained via equations (9) and (19) is asymptotically equivalent to the oracle Huberised estimator (20), assuming β* is known.

 
Theorem 3
In addition to Conditions 13, assume that there exist constants σ¯,α3>0 such that
(26)
Then, the robust α-level (α(0,1/2]) ES estimator θ^τ with τσ¯n/(p+logn) satisfies
(27)
Moreover, the oracle Huberised ES estimator θ^τora (20) with the same τ satisfies
(28)

The above Gaussian approximation result lays the theoretical foundation for the statistical inference problems of testing the linear hypothesis H0:aTθ*=c0 vs. H1:aTθ*c0 and constructing confidence intervals for aTθ*, where aRp and c0R are predetermined. Given the joint quantile and ES regression estimates (β^,θ^τ), let Ω^γ be the truncated estimator of Ω=E(ω2XXT) defined in equation (21) with γ=γ(n,p)>0 denoting a second robustification parameter. Then, we consider the robust test statistic Ta=αn(aTθ^τc0)/ϱ^a,γ for testing H0:aTθ*=c0, and the (approximate) 100(1c)% confidence interval aTθ^τ±zc/2ϱ^a,γ/(αn) for aTθ*, where ϱ^a,γ2:=aTΣ^1Ω^γΣ^1a is a robust variance estimator and zc/2 is the upper (c/2)-percentile of N(0,1). The consistency of ϱ^a,γ2 with a properly chosen γ is investigated in the Online Supplementary Section C.

5 Non-parametric expected shortfall regression

In this section, we consider non-parametric models for joint quantile and expected shortfall regression. For a predetermined quantile level α(0,1), the goal is to estimate the unknown (conditional) quantile and expected shortfall functions fq*(x)=Qα(Y|X=x) and fe*(x)=ESα(Y|X=x), with an emphasis on the latter. By equation (1), fq* and fe* can be identified as

Motivated by the two-step procedure developed under joint linear models, in the following we propose a non-parametric ES estimator using the series regression method (Andrews, 1991; Eubank & Spiegelman, 1990; Newey, 1997). Such a non-parametric estimate is carried out by regressing the dependent variable on an asymptotically growing number of approximating functions of the covariates, and therefore is closely related to the estimator define in equation (11) under the so-called many regressors model (Belloni et al., 2019), that is, the dimension p=pn is allowed to grow with n. The idea of series estimation is to first approximate fq* and fe* by their ‘projections’ on the linear spans of m1 and m2 series/basis functions, respectively, and then fit the coefficients using the observed data. Specifically, we approximate functions fq* and fe* by linear forms U(x)Tβ and V(x)Tθ, where

are two vectors of series approximating functions of dimensions m1 and m2. Here both m1 and m2 may increase with n. We thus define the vectors of quantile and ES series approximation coefficients as

(29)

Given independent observations (Yi,Xi), 1in from (Y,X)R×X with X denoting a compact subset of Rp, we write Ui=U(Xi)Rm1 and Vi=Vi(Xi)Rm2. Extending the two-step approach described in Section 2.2, we first define the (conditional) quantile series estimator of fq*(x)=Qα(Y|X=x) (Belloni et al., 2019):

(30)

With generated response variables Z^i=αf^q(Xi)+{Yif^q(Xi)}1{Yif^q(Xi)}, the second-stage ES series estimator is given by

(31)

Commonly used series functions with good approximation properties include B-splines, polynomials, Fourier series and compactly supported wavelets. We refer to Newey (1997) and Chen (2007) for a detailed description of these series functions. In the context of QR, Chen (2007) established the consistency and rate of convergence at a single quantile index. More recently, Belloni et al. (2019) developed a large sample theory for the quantile series coefficient process, including convergence rate and uniform strong approximations. The choice of the parameter m1, also known as the order of the series estimator, is crucial for establishing the balance between bias and variance.

Note that the quantile series estimator f^q in equation (30) has been well studied by Belloni et al. (2019). Because the number of regressors increases with the sample size, conventional central limit theorems are no longer applicable to capture the joint asymptotic normality of the regression coefficients. The growing dimensionality is the primary source of technical complications. Our theoretical analysis under the joint linear model (4), which leads to novel non-asymptotic high probability bounds, can be used as a starting point for studying the two-step non-parametric ES series estimator f^e defined in equation (31). Of particular interest is to develop a uniform inference procedure for the conditional ES function fe* and its theory. That is, at a given confidence level 1γ, we aim to construct a pair of functional estimates [f^eL,f^eU] from {(Yi,Xi)}i=1n such that

Since a significant amount of additional work is still needed, including explicit characterisations of the ES series approximation error and the impact of first-stage non-parametric QR estimation error, we leave a rigorous theoretical investigation of f^e to future work. Although we have only focussed on series methods, there are other non-parametric techniques that offer superior empirical and theoretical performance. Among those, deep neural networks have stood out as a promising tool for non-parametric estimation, from least squares, logistic to QR (Farrell et al., 2021; Schmidt-Hieber, 2020; Shen et al., 2021). It is practically useful to construct deep learning implementations of two-step estimators and statistically important to deliver valid inferences on finite-dimensional parameters following first-step estimation (of both quantile and ES functions) using deep learning. A detailed investigation of these problems is beyond the present scope but of future interest.

6 Numerical studies and real data examples

6.1 Monte Carlo experiments

In this section, we assess the numerical performance of the proposed method for fitting expected shortfall regression. For its R implementation, we first obtain a QR estimate via the quantreg library, and in step two use the adaHuber library to solve (19) with the robustification parameter selected adaptively as described in Section 3.2.

We compare the proposed two-step adaptive Huber ES estimator (2S-AH) to several competitors: (i) the joint regression estimate (joint) via FZ loss minimisation, implemented via the R library esreg with the default option; (ii) the two-step LSE (12) (2S-LS); and (iii) the oracle two-step ‘estimator’ (2S-oracle). Recall that the two-step procedure first obtains a QR estimator β^ via either standard (Koenker & Bassett, 1978) or smoothed QR regression (He et al., 2023), and subsequently computes the ES estimator based on fitted quantile thresholds {XiTβ^}i=1n. The oracle method refers to the two-step ES estimate based on the true quantile thresholds {XiTβ*}i=1n.

In our simulation studies, we first generate γ*=(γ1*,,γp*)T and η*=(η1*,,ηp*)T independently, where γj*s are independent Rademacher random variables and ηj*i.i.d.0.5Bernoulli(1/2). Data are then generated from the heteroscedastic model

(32)

where Xi=(Xi1,,Xip)T with Xiji.i.d.Unif(0,1.5), and the random noise εi follows one of the following two distributions: (i) standard normal distribution and (ii) t-distribution with ν>2 degrees of freedom (tν). Given γ* and η*, the true quantile and expected shortfall regression coefficients are β*=γ*+Qα(ε)η* and θ*=γ*+ESα(ε)η*, where Qα(ε) and ESα(ε) are the α-level quantile and expected shortfall of ε, respectively.

We first set the dimension p=20 and sample size n=50p/α, where the quantile level α takes values in {0.05,0.1,0.2}. Simulation results on the relative 2-error θ^θ*2/θ*2, averaged over 200 replications, are reported in Tables 1 and 2 under the N(0,1) and t2.5 noise model, respectively. All four methods have very similar performance across different quantile levels in the normal model, while in the presence of heavy-tailed errors, the proposed robust method achieves consistently more favourable performance. This demonstrates that the use of adaptive Huber regression (in stage two) gains robustness against heavy-tailed errors without compromising statistical efficiency when the error distribution is light-tailed.

Table 1.

Mean relative 2-error θ^θ*2/θ*2 (and standard error), averaged over 200 replications, when εiN(0,1), p=20, n=50p/α, and α={0.05,0.1,0.2}

N(0,1) noise
Methodα=0.05α=0.1α=0.2
2S-AH0.130 (0.003)0.150 (0.003)0.171 (0.004)
2S-LS0.130 (0.003)0.150 (0.003)0.171 (0.004)
joint0.130 (0.003)0.151 (0.003)0.177 (0.004)
2S-oracle0.129 (0.003)0.149 (0.003)0.171 (0.004)
N(0,1) noise
Methodα=0.05α=0.1α=0.2
2S-AH0.130 (0.003)0.150 (0.003)0.171 (0.004)
2S-LS0.130 (0.003)0.150 (0.003)0.171 (0.004)
joint0.130 (0.003)0.151 (0.003)0.177 (0.004)
2S-oracle0.129 (0.003)0.149 (0.003)0.171 (0.004)
Table 1.

Mean relative 2-error θ^θ*2/θ*2 (and standard error), averaged over 200 replications, when εiN(0,1), p=20, n=50p/α, and α={0.05,0.1,0.2}

N(0,1) noise
Methodα=0.05α=0.1α=0.2
2S-AH0.130 (0.003)0.150 (0.003)0.171 (0.004)
2S-LS0.130 (0.003)0.150 (0.003)0.171 (0.004)
joint0.130 (0.003)0.151 (0.003)0.177 (0.004)
2S-oracle0.129 (0.003)0.149 (0.003)0.171 (0.004)
N(0,1) noise
Methodα=0.05α=0.1α=0.2
2S-AH0.130 (0.003)0.150 (0.003)0.171 (0.004)
2S-LS0.130 (0.003)0.150 (0.003)0.171 (0.004)
joint0.130 (0.003)0.151 (0.003)0.177 (0.004)
2S-oracle0.129 (0.003)0.149 (0.003)0.171 (0.004)
Table 2.

Mean relative 2-error θ^θ*2/θ*2 (and standard error), averaged over 200 replications, when εit2.5, p=20, n=50p/α and α={0.05,0.1,0.2}

t2.5 noise
Methodα=0.05α=0.1α=0.2
2S-AH0.484 (0.008)0.470 (0.009)0.429 (0.008)
2S-LS0.612 (0.013)0.606 (0.016)0.532 (0.013)
joint0.581 (0.012)0.567 (0.014)0.511 (0.013)
2S-oracle0.612 (0.013)0.607 (0.016)0.532 (0.013)
t2.5 noise
Methodα=0.05α=0.1α=0.2
2S-AH0.484 (0.008)0.470 (0.009)0.429 (0.008)
2S-LS0.612 (0.013)0.606 (0.016)0.532 (0.013)
joint0.581 (0.012)0.567 (0.014)0.511 (0.013)
2S-oracle0.612 (0.013)0.607 (0.016)0.532 (0.013)
Table 2.

Mean relative 2-error θ^θ*2/θ*2 (and standard error), averaged over 200 replications, when εit2.5, p=20, n=50p/α and α={0.05,0.1,0.2}

t2.5 noise
Methodα=0.05α=0.1α=0.2
2S-AH0.484 (0.008)0.470 (0.009)0.429 (0.008)
2S-LS0.612 (0.013)0.606 (0.016)0.532 (0.013)
joint0.581 (0.012)0.567 (0.014)0.511 (0.013)
2S-oracle0.612 (0.013)0.607 (0.016)0.532 (0.013)
t2.5 noise
Methodα=0.05α=0.1α=0.2
2S-AH0.484 (0.008)0.470 (0.009)0.429 (0.008)
2S-LS0.612 (0.013)0.606 (0.016)0.532 (0.013)
joint0.581 (0.012)0.567 (0.014)0.511 (0.013)
2S-oracle0.612 (0.013)0.607 (0.016)0.532 (0.013)

In a more extreme setting where α=0.01, Figure 1 shows the boxplots of squared 2-errors for three ES estimates (2S-LS, 2S-AH, and joint) under the normal and t3 models. Although the 2S-LS estimator is easy-to-compute, it is more sensitive to heavy-tailedness than the joint estimator obtained via FZ loss minimisation. We further compare the proposed method with the joint regression approach in terms of computational efficiency. The computational time in seconds averaged over 50 independent replications, for the two methods with growing (n,p) subject to n=50p/α (α{0.05,0.1,0.2}) are reported in Figure 2. These numerical results show evidence that our R implementation of the robust two-step method can be faster than the esreg library for the joint regression approach by several orders of magnitude.

Boxplots of squared total ℓ2-errors (including the intercept when its true value is 2), based on 500 replications, for three ES regression estimators (2S-LS, 2S-AH, and joint) at quantile level α=0.01. The mean squared errors of these three estimators are 0.1219, 0.0983, and 0.1119 in the normal model, and 1.7401, 0.8017, and 1.2542 in the t3 model.
Figure 1.

Boxplots of squared total 2-errors (including the intercept when its true value is 2), based on 500 replications, for three ES regression estimators (2S-LS, 2S-AH, and joint) at quantile level α=0.01. The mean squared errors of these three estimators are 0.1219, 0.0983, and 0.1119 in the normal model, and 1.7401, 0.8017, and 1.2542 in the t3 model.

Average elapsed time (in seconds) over 50 replications for the proposed method implemented by a combination of quantreg and adaHuber and the joint regression approach implemented by esreg under N(0,1) and t2.5 error models when α∈{0.05,0.1,0.2}. The sample size is set to be n=⌈50p/α⌉. The solid and dashed lines correspond to the proposed method and the joint regression approach, respectively.
Figure 2.

Average elapsed time (in seconds) over 50 replications for the proposed method implemented by a combination of quantreg and adaHuber and the joint regression approach implemented by esreg under N(0,1) and t2.5 error models when α{0.05,0.1,0.2}. The sample size is set to be n=50p/α. The solid and dashed lines correspond to the proposed method and the joint regression approach, respectively.

To shed some light on the drastic difference in numerical efficiency between the two methods, note that the joint regression approach (Dimitriadis & Bayer, 2019) relies on the Nelder–Mead simplex method, which is sensitive to the starting values and not guaranteed to converge to a local minimum. The convergence of the Nelder–Mead method is already very slow for large-scale problems because it is a direct search method based on function comparison. And due to its sensitivity to starting values, Dimitriadis and Bayer (2019) proposed to re-optimise the model (several times) with the perturbed parameter estimates as new starting values. This explains, to some extent, the fast increase in the runtime of esreg as both n and p grow. The function in quantreg that fits linear QR is coded in fortran, and thus is very fast in larger problems. The computation of adaptive Huber regression is based on the Barzilai-Borwein gradient descent method (Barzilai & Borwein, 1988), implemented via RcppArmadillo in adaHuber.

Next, we construct entrywise (approximate) 95% confidence intervals (CIs) for the expected shortfall regression parameter θ*. The CI for the two-step estimator is based on equation (14) (non-robust) and equation (22) (robust), and we use the default option in the esreg package to implement Dimitriadis and Bayer (2019)’s method. To evaluate the accuracy and reliability of the CIs, we compute the empirical coverage probability and interval width based on 500 independent replications, then averaged over the p slope coefficients. Results for p=20 and n=50p/α (α{0.05,0.1,0.2}) are reported in Tables 3 and 4.

Table 3.

Empirical coverage probability and mean width (based on 500 replications) of 95% confidence intervals averaged over p=20 variables when n=50p/α, α={0.05,0.1,0.2}, and εiN(0,1)

N(0,1)α=0.05α=0.1α=0.2
MethodCoverageWidthCoverageWidthCoverageWidth
2S-AH0.9500.5950.9490.6600.9480.744
joint0.9460.5840.9440.6510.9420.740
2S-LS0.9500.5950.9490.6610.9480.745
N(0,1)α=0.05α=0.1α=0.2
MethodCoverageWidthCoverageWidthCoverageWidth
2S-AH0.9500.5950.9490.6600.9480.744
joint0.9460.5840.9440.6510.9420.740
2S-LS0.9500.5950.9490.6610.9480.745
Table 3.

Empirical coverage probability and mean width (based on 500 replications) of 95% confidence intervals averaged over p=20 variables when n=50p/α, α={0.05,0.1,0.2}, and εiN(0,1)

N(0,1)α=0.05α=0.1α=0.2
MethodCoverageWidthCoverageWidthCoverageWidth
2S-AH0.9500.5950.9490.6600.9480.744
joint0.9460.5840.9440.6510.9420.740
2S-LS0.9500.5950.9490.6610.9480.745
N(0,1)α=0.05α=0.1α=0.2
MethodCoverageWidthCoverageWidthCoverageWidth
2S-AH0.9500.5950.9490.6600.9480.744
joint0.9460.5840.9440.6510.9420.740
2S-LS0.9500.5950.9490.6610.9480.745
Table 4.

Empirical coverage probability and mean width (based on 500 replications) of 95% confidence intervals averaged over p=20 variables when n=50p/α, α={0.05,0.1,0.2}, and εit2.5

t2.5α=0.05α=0.1α=0.2
MethodCoverageWidthCoverageWidthCoverageWidth
2S-AH0.9473.6330.9462.7900.9482.243
joint0.9595.7710.9593.5710.9542.872
2S-LS0.9524.5210.9503.3970.9532.687
t2.5α=0.05α=0.1α=0.2
MethodCoverageWidthCoverageWidthCoverageWidth
2S-AH0.9473.6330.9462.7900.9482.243
joint0.9595.7710.9593.5710.9542.872
2S-LS0.9524.5210.9503.3970.9532.687
Table 4.

Empirical coverage probability and mean width (based on 500 replications) of 95% confidence intervals averaged over p=20 variables when n=50p/α, α={0.05,0.1,0.2}, and εit2.5

t2.5α=0.05α=0.1α=0.2
MethodCoverageWidthCoverageWidthCoverageWidth
2S-AH0.9473.6330.9462.7900.9482.243
joint0.9595.7710.9593.5710.9542.872
2S-LS0.9524.5210.9503.3970.9532.687
t2.5α=0.05α=0.1α=0.2
MethodCoverageWidthCoverageWidthCoverageWidth
2S-AH0.9473.6330.9462.7900.9482.243
joint0.9595.7710.9593.5710.9542.872
2S-LS0.9524.5210.9503.3970.9532.687

Once again, all three methods perform similarly under normal errors, while the robust approach gives the narrowest CIs while maintaining the desired coverage level under t2.5 errors. Together, the results in Tables 2 and 4 demonstrate the robustness of the proposed method, as indicated by the theoretical investigations in Section 4.

6.2 Data application I: health disparity

Iron deficiency is one of the most common nutritional deficiency worldwide and is one of the leading cause of anaemia (Camaschella, 2015). Being able to detect iron deficiency is essential in medical care for patients with inflammation, infection, or chronic disease. It is also important in preventive care since iron deficiency tends to present signs of a more serious illness such as gastrointestinal malignancy (Rockey & Cello, 1993). One measure of iron deficiency that has proven to be useful is the soluble transferrin receptor (sTRP), a carrier protein for transferrin (Mast et al., 1998). A high value of sTRP indicates iron deficiency.

The scientific goal here is to assess whether there is any disparity in sTRP levels among four different ethnic groups: Asian, Black, Mexican American, and White. To this end, we analyse a data set obtained from the National Health and Nutrition Examination Survey from 2017 to 2020 (pre-covid). In this data set, the response variable sTRP was measured for female participants who range in age from 20 to 49 years. The covariates of interest are three dummy variables that correspond to Asian, Mexican American, and Black, using White as the baseline. We adjust for demographic variables such as age, education level, and healthy diet throughout our analysis. For simplicity, we remove all participants with missing values on the covariates and the final data set consists of n=1,689 observations and p=7 covariates.

As an exploratory analysis, in Figure 3 we plot the quantile curves of sTRP measurements at levels from 50% to 99% for each of the four different ethnic groups. In this data set, the sTRP values range from 1.24 to 35.1 mg/L. We note that the normal range for females is between 1.9 and 4.4 mg/L (Kratovil et al., 2007), and values that are much higher than 4.4 mg/L indicate severe iron deficiency. We see from Figure 3 that the majority of the population have sTRP levels within the normal range. However, there are large disparities between Black and the other three ethnic groups, reflected in higher quantiles of the marginal distributions of sTRP.

The soluble transferrin receptor levels (mg/L) vs. quantile levels (ranging from 0.5 to 0.99) for the female population in four different ethnic groups: Asian, Black, Mexican American, and White. The orange horizontal dashed line indicates the upper bound of the normal range (1.9–4.4 mg/L) for transferrin receptors among females.
Figure 3.

The soluble transferrin receptor levels (mg/L) vs. quantile levels (ranging from 0.5 to 0.99) for the female population in four different ethnic groups: Asian, Black, Mexican American, and White. The orange horizontal dashed line indicates the upper bound of the normal range (1.9–4.4 mg/L) for transferrin receptors among females.

To quantify the statistical significance of the aforementioned disparity, we fit robust expected shortfall regression at α=0.75 (upper tail), with the robustification parameter tuned by the procedure described in Section 3.2. This is equivalent to fitting the proposed 2S-AH method at level 1α (see Section 3) after flipping the signs of both the response and the covariates. We also implement the standard QR at level α.

Table 5 reports the estimated coefficients and the associated 95% confidence intervals for the three indicator covariates on the ethnic groups Asian, Mexican American, and Black, using White as a baseline. We see that both the quantile and robust expected shortfall regression methods are able to detect a health disparity between Black and White. Specifically, the estimated robust ES regression coefficient and 95% CI (in the parenthesis) is 3.03 (1.88, 4.19) vs. its QR counterparts’ 0.86 (0.37, 1.35). With the use of QR (at level 0.75), we do not observe a statistically significant health disparity between Asian and White. In contrast, 2S-AH detects health disparity between Asian and White with an estimated coefficient 2.34 (0.59, 4.09). We also see that the QR detects health disparity between Mexican American and White, but the effect size is close to zero. In summary, ES regression complements QR, and can be more effective, as a tool to detect health disparity especially when it only occurs in the tail of the conditional distribution.

Table 5.

The estimated regression coefficients (and 95% confidence intervals) for three dummy variables: Asian, Black, and Mexican American, using White as a baseline

AsianBlackMexican American
QR0.31 (0.02, 0.64)0.86 (0.37, 1.35)0.22 (0.42, -0.01)
ES regression (2S-AH)2.34 (0.59, 4.09)3.03 (1.88, 4.19)0.13 (0.76, 1.03)
AsianBlackMexican American
QR0.31 (0.02, 0.64)0.86 (0.37, 1.35)0.22 (0.42, -0.01)
ES regression (2S-AH)2.34 (0.59, 4.09)3.03 (1.88, 4.19)0.13 (0.76, 1.03)

Note. Results of the upper-tail robust ES regression method 2S-AH and standard QR at quantile level α=0.75 are reported.

Table 5.

The estimated regression coefficients (and 95% confidence intervals) for three dummy variables: Asian, Black, and Mexican American, using White as a baseline

AsianBlackMexican American
QR0.31 (0.02, 0.64)0.86 (0.37, 1.35)0.22 (0.42, -0.01)
ES regression (2S-AH)2.34 (0.59, 4.09)3.03 (1.88, 4.19)0.13 (0.76, 1.03)
AsianBlackMexican American
QR0.31 (0.02, 0.64)0.86 (0.37, 1.35)0.22 (0.42, -0.01)
ES regression (2S-AH)2.34 (0.59, 4.09)3.03 (1.88, 4.19)0.13 (0.76, 1.03)

Note. Results of the upper-tail robust ES regression method 2S-AH and standard QR at quantile level α=0.75 are reported.

6.3 Data application II: JTPA

We consider the JTPA study, a publicly funded training programme that provides training for adults with the goal of improving their earnings. Specifically, we focus on the Title II sub-programme of the JPTA study that is mainly offered to adults with barriers to employment and out-of-school youths. This data set was previously analysed in Bloom et al. (1997). It consists of 30 months of accumulated earnings for 6,102 females and 5,102 males, with 16 covariates that are related to the demographics of the individuals such as age, race, and the indicator variable that indicates whether the individual received JPTA training. After removing individuals with zero income, there are 4,576 males and 5,296 females. Our goal is to assess the effect of JPTA training on participants’ earnings with an emphasis on the low-income population that is employed, for both male and female sub-groups.

To this end, we fit an expected shortfall regression model using the proposed robust method with α={0.05,0.1,0.2}. The robustification parameter τ is selected automatically via the procedure described in Section 3.2. Specifically, we regress the 30-month accumulated earnings on the JPTA training to assess the effect of JPTA training on low-income individuals, adjusting for whether individuals completed high school, race, Hispanic/non-Hispanic, marital status, working less than 13 weeks in the past year, and age. We report the estimated regression coefficient for the binary variable JPTA training and its associated 95% confidence intervals. The results are summarised in Table 6.

Table 6.

The estimated regression coefficient of the binary predictor JPTA training (and its 95% confidence interval) for the proposed robust method and the standard QR at quantile level α{0.05,0.1,0.2}

Male sub-groupα=0.05α=0.1α=0.2
QR465 (255, 675)882 (603, 1,161)2031 (1,431, 2,603)
ES regression (2S-AH)283 (149, 418)552 (333, 771)1,093 (641, 1,546)
Female sub-groupα=0.05α=0.1α=0.2
QR202 (76, 328)480 (307, 653)1,086 (719, 1,452)
ES regression (2S-AH)123 (41, 205)300 (146, 453)672 (385, 958)
Male sub-groupα=0.05α=0.1α=0.2
QR465 (255, 675)882 (603, 1,161)2031 (1,431, 2,603)
ES regression (2S-AH)283 (149, 418)552 (333, 771)1,093 (641, 1,546)
Female sub-groupα=0.05α=0.1α=0.2
QR202 (76, 328)480 (307, 653)1,086 (719, 1,452)
ES regression (2S-AH)123 (41, 205)300 (146, 453)672 (385, 958)

Note. Results are rounded to the closest integer.

Table 6.

The estimated regression coefficient of the binary predictor JPTA training (and its 95% confidence interval) for the proposed robust method and the standard QR at quantile level α{0.05,0.1,0.2}

Male sub-groupα=0.05α=0.1α=0.2
QR465 (255, 675)882 (603, 1,161)2031 (1,431, 2,603)
ES regression (2S-AH)283 (149, 418)552 (333, 771)1,093 (641, 1,546)
Female sub-groupα=0.05α=0.1α=0.2
QR202 (76, 328)480 (307, 653)1,086 (719, 1,452)
ES regression (2S-AH)123 (41, 205)300 (146, 453)672 (385, 958)
Male sub-groupα=0.05α=0.1α=0.2
QR465 (255, 675)882 (603, 1,161)2031 (1,431, 2,603)
ES regression (2S-AH)283 (149, 418)552 (333, 771)1,093 (641, 1,546)
Female sub-groupα=0.05α=0.1α=0.2
QR202 (76, 328)480 (307, 653)1,086 (719, 1,452)
ES regression (2S-AH)123 (41, 205)300 (146, 453)672 (385, 958)

Note. Results are rounded to the closest integer.

From Table 6, we see that 95% confidence intervals for the robust method do not contain zero for all α{0.05,0.1,0.2}. This indicates that the JPTA training is statistically effective to improve earnings for the low-income population. Specifically, for the male sub-population, the estimated ES-effects of JPTA training are 283, 552, and 1,093 dollars at levels 0.05, 0.1, and 0.2, respectively. To further assess whether the estimated effects are scientifically meaningful, we compute the average 30-month accumulated earnings below the quantile levels 0.05,0.1, and 0.2 for the male sub-group, which are 214, 566, and 1,496, respectively. We find that the JPTA training doubles the average income for individuals with income below the quantile levels 0.05 and 0.1, and becomes less effective for individuals with higher income. Similar findings are also observed for the female sub-group.

7 Conclusion and discussions

This paper considers expected shortfall regression under a joint quantile and ES model recently proposed in Dimitriadis and Bayer (2019) and Patton et al. (2019). The existing approach is based on a joint M-estimator, defined as the global minimum of any member of a class of strictly consistent joint loss functions (Fissler & Ziegel, 2016) over some compact set. Since the loss function is non-differentiable and non-convex, the computation of such a joint M-estimator is intrinsically difficult especially when the dimensionality is large. To circumvent the aforementioned challenge, Barendse (2020) proposed a two-step procedure for estimating the joint quantile and ES model based on Neyman orthogonalisation: the first step involves fitting the QR, and the second step employs the Neyman-orthogonal scores to estimate the ES parameters. Due to the use of L2-loss in the second step, the resulting estimator is sensitive to heavy-tailed error distributions.

To address the robustness and computation concerns simultaneously, we propose a robust two-step method that applies adaptive Huber regression (Zhou et al., 2018) in the second step. The key is the use of a diverging robustification parameter for bias-robustness trade-off, tuned by a convenient data-driven mechanism. The proposed method can be efficiently implemented by a combination of R packages quantreg/conquer and adaHuber. The Python code that implements both our proposed methods and the existing non-convex optimisation-based methods (Dimitriadis & Bayer, 2019; Peng & Wang, 2022) is now publicly available at https://github.com/WenxinZhou/conquer. We establish a finite-sample theoretical framework for this two-step method, including deviation bound, Bahadur representation and (uniform) Gaussian approximations, in which the dimension of the model, p, may depend on and increase with the sample size, n. Robust confidence intervals/sets are also constructed. Numerical experiments further demonstrate that the proposed robust ES regression approach achieves satisfying statistical performance, high degree of robustness (against heavy-tailed data) and superior computational efficiency and stability. Through two data applications on health disparity and the JTPA study, we illustrate that ES regression complements QR as a useful tool to explore heterogeneous covariate effects on the average tail behaviour of the outcome.

Although we restrict attention to (joint) linear models in this work, our non-asymptotic theory and the underpinning techniques pave the way for analysing (i) series/projection estimators under joint non-parametric quantile-ES models and (ii) penalised estimators under high-dimensional sparse quantile-ES models. We leave these extensions in future research. One limitation in our data analysis for the JTPA study is that we do not account for potential selection bias. Specifically, as pointed out by Abadie et al. (2002), out of all subjects that were assigned to participate in the training programme, only approximately 60% of them (compliers) actually committed to the training programme. These individuals may simply have higher motivation in improving their earnings, and thus, the training status is likely positively correlated with potential income earnings. Generalising the proposed method to estimate the complier expected shortfall treatment effect, using an instrumental variable approach previously considered in Abadie et al. (2002), is another direction for future research.

The ES regression methods considered in this paper are suited for a fixed quantile level α(0,1), independent of the sample size. For extreme quantiles satisfying α=αn 0 or 1 as n, both the FZ loss minimisation method (see equations (5) and (6)) and two-step procedures perform poorly because observations become scarce at that level, i.e. αn is not large enough. In fact, if dimension p is fixed, Online Supplementary Theorem A.1 and Theorem 1 imply that the two-step ES regression estimates, robust, and non-robust, are consistent if αn2n as n. In the case where αn2n=O(1), these methods are no longer useful and one may need to resort to extreme value theory (de Haan & Ferreira, 2006; H. J. Wang et al., 2012), which provides the statistical tools for a feasible extrapolation into the tail of the variable of interest. A more detailed discussion on modelling the extremes is deferred to the Online Supplementary Section E.

Acknowledgments

We sincerely thank the Joint Editor, Qiwei Yao, as well as the Associate Editor and the three reviewers for their numerous constructive comments and valuable suggestions, which have significantly helped us improve the quality of this work.

Funding

X.H. is supported by NSF Grants DMS-1914496 and DMS-1951980. K.M.T. is supported by NSF DMS-2113346 and NSF CAREER DMS-2238428. W.-X.Z. acknowledges the support from NSF DMS-2113409.

Data availability

The two data sets used in Sections 6.2 and 6.3 are publicly available at https://www.cdc.gov/nchs/nhanes/index.htm and https://economics.mit.edu/people/faculty/josh-angrist/angrist-data-archive, respectively.

Supplementary material

Supplementary material is available online at Journal of the Royal Statistical Society: Series B.

References

Abadie
A.
,
Angrist
J.
, &
Imbens
G.
(
2002
).
Instrumental variables estimates of the effect of subsidized training on the quantiles of trainee earnings
.
Econometrica
,
70
(
1
),
91
117
. https://doi.org/10.1111/1468-0262.00270

Acerbi
C.
, &
Székely
B.
(
2014
).
Back-testing expected shortfall
.
Risk
,
27
(
11
),
76
81
.

Acerbi
C.
, &
Tasche
D.
(
2002
).
On the coherence of expected shortfall
.
Journal of Banking & Finance
,
26
(
7
),
1487
1503
. https://doi.org/10.1016/S0378-4266(02)00283-2

Andrews
D. W. K.
(
1991
).
Asymptotic normality of series estimators for nonparametric and semiparametric regression models
.
Econometrica
,
59
(
2
),
307
345
. https://doi.org/10.2307/2938259

Barendse
S.
(
2020
).
Efficiently weighted estimation of tail and interquantile expectations
.
Preprint
. https://doi.org/10.2139/ssrn.2937665

Barzilai
J.
, &
Borwein
J. M.
(
1988
).
Two-point step size gradient methods
.
IMA Journal of Numerical Analysis
,
8
(
1
),
141
148
. https://doi.org/10.1093/imanum/8.1.141

Basel Committee.
(
2016
).
Minimum capital requirements for market risk (Technical Report). Bank for International Settlements. https://www.bis.org/bcbs/publ/d352.pdf
.

Basel Committee.
(
2019
).
Minimum capital requirements for market risk (Technical Report). Bank for International Settlements. https://www.bis.org/bcbs/publ/d457.pdf
.

Bassett
G.
,
Koenker
R.
, &
Kordas
G.
(
2004
).
Pessimistic portfolio allocation and Choquet expected utility
.
Journal of Financial Econometrics
,
2
(
4
),
477
492
. https://doi.org/10.1093/jjfinec/nbh023

Belloni
A.
,
Chernozhukov
V.
,
Chetverikov
D.
, &
Fernández-Val
I.
(
2019
).
Conditional quantile processes based on series or many regressors
.
Journal of Econometrics
,
213
(
1
),
4
29
. https://doi.org/10.1016/j.jeconom.2019.04.003

Ben-Tal
A.
, &
Teboulle
M.
(
1986
).
Expected utility, penalty functions, and duality in stochastic nonlinear programming
.
Management Science
,
32
(
11
),
1445
1466
. https://doi.org/10.1287/mnsc.32.11.1445

Bloom
H.
,
Orr
L.
,
Bell
S.
,
Cave
G.
,
Doolittle
F.
,
Lin
W.
, &
Bos
J.
(
1997
).
The benefits and costs of JTPA Title II-A programs: Key findings from the national job training partnership act study
.
The Journal of Human Resources
,
32
(
3
),
549
576
. https://doi.org/10.2307/146183

Cai
Z.
, &
Wang
X.
(
2008
).
Nonparametric estimation of conditional VaR and expected shortfall
.
Journal of Econometrics
,
147
(
1
),
120
130
. https://doi.org/10.1016/j.jeconom.2008.09.005

Camaschella
C.
(
2015
).
Iron-deficiency anemia
.
New England Journal of Medicine
,
372
(
19
),
1832
1843
. https://doi.org/10.1056/NEJMra1401038

Catoni
O.
(
2012
).
Challenging the empirical mean and empirical variance: A deviation study
.
Annales de l’Institut Henri Poincaré Probabilités et Statistiques
,
48
(
4
),
1148
1185
. https://doi.org/10.1214/11-AIHP454

Chen
X
. (
2007
).
Chapter 76 large sample sieve estimation of semi-nonparametric models
.
Handbook of Econometrics
,
6
,
5549
5632
. https://doi.org/10.1016/S1573-4412(07)06076-X

Chernoff
H.
(
1952
).
A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations
.
The Annals of Mathematical Statistics
,
23
(
4
),
493
507
. https://doi.org/10.1214/aoms/1177729330

Chernozhukov
V.
,
Chetverikov
D.
,
Demirer
M.
,
Duflo
E.
,
Hansen
C.
,
Newey
W.
, &
Robins
J.
(
2018
).
Double/debiased machine learning for treatment and structural parameters
.
The Econometrics Journal
,
21
(
1
),
C1
C68
. https://doi.org/10.1111/ectj.12097

Chernozhukov
V.
, &
Hansen
C.
(
2008
).
Instrumental variable quantile regression: A robust inference approach
.
Journal of Econometrics
,
142
(
1
),
379
398
. https://doi.org/10.1016/j.jeconom.2007.06.005

Cont
R.
(
2001
).
Empirical properties of asset returns: Stylized facts and statistical issues
.
Quantitative Finance
,
1
(
2
),
223
236
. https://doi.org/10.1080/713665670

de Haan
L.
, &
Ferreira
A.
(
2006
).
Extreme value theory: An introduction
.
Springer-Verlag
.

Dimitriadis
T.
, &
Bayer
S.
(
2019
).
A joint quantile and expected shortfall regression framework
.
Electronic Journal of Statistics
,
13
(
1
),
1823
1871
. https://doi.org/10.1214/19-EJS1560

Du
Z.
, &
Escanciano
J. C.
(
2017
).
Backtesting expected shortfall: Accounting for tail risk
.
Management Science
,
63
(
4
),
940
958
. https://doi.org/10.1287/mnsc.2015.2342

Eubank
R. L.
, &
Spiegelman
C. H.
(
1990
).
Testing the goodness of fit of a linear model via nonparametric regression techniques
.
Journal of the American Statistical Association
,
85
(
410
),
387
392
. https://doi.org/10.1080/01621459.1990.10476211

Fan
J.
,
Li
Q.
, &
Wang
Y.
(
2017
).
Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions
.
Journal of the Royal Statistical Society Series B: Statistical Methodology
,
79
(
1
),
247
265
. https://doi.org/10.1111/rssb.12166

Farrell
M. H.
,
Liang
T.
, &
Misra
S.
(
2021
).
Deep neural networks for estimation and inference
.
Econometrica
,
89
(
1
),
181
213
. https://doi.org/10.3982/ECTA16901

Fernandes
M.
,
Guerre
E.
, &
Horta
E.
(
2021
).
Smoothing quantile regressions
.
Journal of Business & Economic Statistics
,
39
(
1
),
338
357
. https://doi.org/10.1080/07350015.2019.1660177

Fissler
T.
, &
Ziegel
J. F.
(
2016
).
Higher order elicitability and Osband’s principle
.
The Annals of Statistics
,
44
(
4
),
1680
1707
. https://doi.org/10.1214/16-AOS1439

Gneiting
T.
(
2011
).
Making and evaluating point forecasts
.
Journal of the American Statistical Association
,
106
(
494
),
746
762
. https://doi.org/10.1198/jasa.2011.r10138

Hahn
M. G.
,
Kuelbs
J.
, &
Weiner
D. C.
(
1990
).
The asymptotic joint distribution of self-normalized censored sums and sums of squares
.
The Annals of Probability
,
18
(
3
),
1284
1341
. https://doi.org/10.1214/aop/1176990747

He
X.
,
Hsu
Y.-H.
, &
Hu
M.
(
2010
).
Detection of treatment effects by covariate-adjusted expected shortfall
.
The Annals of Applied Statistics
,
4
(
4
),
2114
2125
. https://doi.org/10.1214/10-AOAS347

He
X.
,
Pan
X.
,
Tan
K. M.
, &
Zhou
W.-X.
(
2023
).
Smoothed quantile regression with large-scale inference
.
Journal of Econometrics
,
232
(
2
),
367
388
. https://doi.org/10.1016/j.jeconom.2021.07.010

Huber
P. J.
(
1973
).
Robust estimation: Asymptotics, conjectures and Monte Carlo
.
The Annals of Statistics
,
1
(
5
),
799
821
. https://doi.org/10.1214/aos/1176342503

Kato
K.
(
2012
).
Weighted Nadaraya–Watson estimation of conditional expected shortfall
.
Journal of Financial Econometrics
,
10
(
2
),
265
291
. https://doi.org/10.1093/jjfinec/nbs002

Koenker
R.
, &
Bassett
G.
(
1978
).
Regression quantiles
.
Econometrica
,
46
(
1
),
33
50
. https://doi.org/10.2307/1913643

Kratovil
T.
,
DeBerardinis
J.
,
Gallagher
N.
,
Luban
N.
,
Soldin
S.
, &
Wong
E.
(
2007
).
Age specific reference intervals for soluble transferrin receptor (sTfR)
.
Clinica Chimica Acta
,
380
(
1-2
),
222
224
. https://doi.org/10.1016/j.cca.2007.02.012

Linton
O.
, &
Xiao
Z.
(
2013
).
Estimation and inference about the expected shortfall for time series with infinite variance
.
Econometric Theory
,
29
(
4
),
771
807
. https://doi.org/10.1017/S0266466612000692

Martins-Filho
C.
,
Yao
F.
, &
Torero
M.
(
2018
).
Nonparametric estimation of conditional value-at-risk and expected shortfall based on extreme value theory
.
Econometric Theory
,
34
(
1
),
23
67
. https://doi.org/10.1017/S0266466616000517

Mast
A.
,
Blinder
M.
,
Gronowski
A.
,
Chumley
C.
, &
Scott
M.
(
1998
).
Clinical utility of the soluble transferrin receptor and comparison with serum ferritin in several populations
.
Clinical Chemistry
,
44
(
1
),
45
51
. https://doi.org/10.1093/clinchem/44.1.45

McNeil
A. J.
,
Frey
R.
, &
Embrechts
P.
(
2015
).
Quantitative risk management: Concepts, techniques and tools
(2nd ed.).
Princeton University Press
.

Nemirovski
A.
, &
Yudin
D.
(
1983
).
Problem complexity and method efficiency in optimization
.
Wiley
.

Newey
W.
(
1997
).
Convergence rates and asymptotic normality for series estimators
.
Journal of Econometrics
,
79
(
1
),
147
168
. https://doi.org/10.1016/S0304-4076(97)00011-0

Neyman
J.
(
1979
).
C(α) tests and their use
.
Sankhya
,
41
(
1/2
),
1
21
. http://www.jstor.org/stable/25050174

Pan
X.
, &
Zhou
W.-X.
(
2021
).
Multiplier bootstrap for quantile regression: Non-asymptotic theory under random design
.
Information and Inference: A Journal of the IMA
,
10
(
3
),
813
861
. https://doi.org/10.1093/imaiai/iaaa006

Patton
A. J.
,
Ziegel
J. F.
, &
Chen
R.
(
2019
).
Dynamic semiparametric models for expected shortfall (and value-at-risk)
.
Journal of Econometrics
,
211
(
2
),
388
413
. https://doi.org/10.1016/j.jeconom.2018.10.008

Peng
X.
, &
Wang
H. J.
(
2022
).
‘Inference for joint quantile and expected shortfall regression’, arXiv, arXiv:2208.10586, preprint: not peer reviewed
.

Rockafellar
R. T.
, &
Royset
J. O.
(
2014
).
Superquantiles and their applications to risk, random variables, and regression. INFORMS TutORials in Operations Research, null(null)
,
151
167
. https://doi.org/10.1287/educ.2013.0111

Rockafellar
R. T.
,
Royset
J. O.
, &
Miranda
S. I.
(
2014
).
Superquantile regression with applications to buffered reliability, uncertainty quantification, and conditional value-at-risk
.
European Journal of Operational Research
,
234
(
1
),
140
154
. https://doi.org/10.1016/j.ejor.2013.10.046

Rockafellar
R. T.
, &
Uryasev
S.
(
2000
).
Optimization of conditional value-at-risk
.
The Journal of Risk
,
2
(
3
),
21
41
. http://doi.org/10.21314/JOR.2000.038

Rockafellar
R. T.
, &
Uryasev
S.
(
2002
).
Conditional value-at-risk for general loss distributions
.
Journal of Banking & Finance
,
26
(
7
),
1443
1471
. https://doi.org/10.1016/S0378-4266(02)00271-6

Rockey
D.
, &
Cello
J.
(
1993
).
Evaluation of the gastrointestinal tract in patients with iron-deficiency anemia
.
New England Journal of Medicine
,
329
(
23
),
1691
1695
. https://doi.org/10.1056/NEJM199312023292303

Scaillet
O.
(
2004
).
Nonparametric estimation and sensitivity analysis of expected shortfall
.
Mathematical Finance
,
14
(
1
),
115
129
. https://doi.org/10.1111/j.0960-1627.2004.00184.x

Schmidt-Hieber
J.
(
2020
).
Nonparametric regression using deep neural networks with ReLU activation function
.
The Annals of Statistics
,
48
(
4
),
1875
1897
. https://doi.org/10.1214/19-AOS1875

Shapiro
A.
,
Dentcheva
D.
, &
Ruszczynski
A.
(
2014
).
Lectures on stochastic programming: Modeling and theory
(2nd ed.).
SIAM
.

Shen
G.
,
Jiao
Y.
,
Lin
Y.
,
Horowitz
J. L.
, &
Huang
J.
(
2021
).
‘Deep quantile regression: Mitigating the curse of dimensionality through composition’, arXiv, arXiv:2107.04907, preprint: not peer reviewed
.

Wang
H. J.
,
Li
D.
, &
He
X.
(
2012
).
Estimation of high conditional quantiles for heavy-tailed distributions
.
Journal of the American Statistical Association
,
107
(
500
),
1453
1464
. https://doi.org/10.1080/01621459.2012.716382

Wang
L.
,
Zheng
C.
,
Zhou
W.
, &
Zhou
W.-X.
(
2021
).
A new principle for tuning-free Huber regression
.
Statistica Sinica
,
31
(
4
),
2153
2177
. https://doi.org/10.5705/ss.202019.0045

Zhou
W.-X.
,
Bose
K.
,
Fan
J.
, &
Liu
H.
(
2018
).
A new perspective on robust M-estimation: Finite sample theory and applications to dependence-adjusted multiple testing
.
The Annals of Statistics
,
46
(
5
),
1904
1931
. https://doi.org/10.1214/17-AOS1606

Author notes

Conflict of interest: None declared.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/pages/standard-publication-reuse-rights)

Supplementary data