Abstract

Motivated by the Basel III regulations, recent studies have considered joint forecasts of Value-at-Risk and Expected Shortfall. A large family of scoring functions can be used to evaluate forecast performance in this context. However, little intuitive or empirical guidance is currently available, which renders the choice of scoring function awkward in practice. We therefore develop graphical checks of whether one forecast method dominates another under a relevant class of scoring functions, and propose an associated hypothesis test. We illustrate these tools with simulation examples and an empirical analysis of S&P 500 and DAX returns.

The Basel III standard on minimum capital requirements for market risk (Basel Committe on Banking Supervision, 2016) uses Expected Shortfall (ES), rather than Value-at-Risk (VaR), to quantify the risk of a bank’s portfolio. As described by McNeil et al. (2015, Chapter 8), ES possesses several desirable theoretical properties. However, it also has a major drawback: It is not elicitable, that is there is no scoring function that sets the incentive to report ES honestly, or that can be used to compare ES forecasts’ accuracy.1 As a partial remedy to this problem Fissler and Ziegel (2016, henceforth FZ) show that ES is jointly elicitable with VaR and characterize the class of scoring functions that can be used to evaluate forecasts of type (VaR, ES). Fissler et al. (2016) provide a nontechnical introduction and discuss regulatory implications.

In applied work, it may be challenging to select a specific member function from the FZ family on either economic or statistical grounds. Furthermore, different choices of scoring functions may yield different forecast rankings in the case of imperfect forecasters or non-nested information sets, see for example Patton (2016). Motivated by this problem, we present a mixture representation using elementary members of the FZ family, which is mathematically similar to recent results by Ehm et al. (2016) for quantiles and expectiles. The mixture representation gives rise to Murphy diagrams which allow to check whether one forecast dominates another under a relevant class of scoring functions.2 While this class could be the entire FZ family, we argue that a subfamily that emphasizes ES—rather than VaR—is economically more desirable in the light of the Basel III standard. Analyzing the robustness of forecast rankings across this class of scoring functions is relevant both conceptually and practically, and referred to as forecast dominance in the following.

Forecast dominance holds at the population level—that is, it is defined in terms of expected performance, which is unobservable. Statistical tests are based on observable performance statistics, and are designed to detect significant deviations from hypotheses about expected performance; see for example Diebold and Mariano (1995) and Clark and McCracken (2013). In the present context, such tests are complicated by the fact that the null hypothesis refers to performance under all elementary members of the mixture representation, that is, for all values of an auxiliary parameter. To tackle the resulting testing problem, we propose a variant of the test by Hansen (2005). The test was originally developed to conduct multiple comparisons among a finite set of forecast models. In contrast, our situation involves an infinite family of elementary functions which enter the comparison. We provide theoretical and simulation evidence that the test has good size and power properties in the present situation. Our test complements recent work by Ehm and Krüger (2018) and Yen and Yen (2018) who consider tests of forecast dominance for quantiles and expectiles. These papers differ from ours along several dimensions. First, they consider different forecast types (functionals) than we do. Second, their theoretical justification is based on Gaussian processes, whereas we use arguments from the multiple testing literature (Westfall and Young, 1993; Cox and Lee, 2008). Finally, Ehm and Krüger (2018) use independent permutation (rather than dependent bootstrap) methods to implement the null hypothesis.

In an empirical case study, we evaluate forecasts for daily log returns of the S&P 500 and DAX stock market indices. Three models with varying degree of sophistication are considered: the HEAVY model (Shephard and Sheppard, 2010) with access to the past’s intra-daily data competes against two models using merely end-of-day data, a GARCH(1, 1) model (Bollerslev, 1986) and a naive “historical simulation” model. Our results suggest that both the HEAVY and the GARCH model dominate simple historical simulation; in contrast, we find no dominance relation between HEAVY and GARCH.

We emphasize that our interest lies in comparative forecast evaluation—that is, we seek to compare the (VaR, ES) forecasts of two competing methods.3 Comparative evaluation is important to select a suitable forecasting method in practice, especially given the wealth of data sources and statistical techniques that could plausibly be used to generate forecasts. Comparative forecast evaluation is different from absolute evaluation which aims to determine whether a given forecast method possesses certain desirable optimality properties. The Basel II procedure of counting VaR “violations,” that is, the number of times the actual return fell below the VaR forecast, is an example of absolute forecast evaluation. See Nolde and Ziegel (2017) for a detailed discussion of comparative versus absolute evaluation of financial forecasts.

The contributions of the present article include a mixture representation of the FZ family in Section 2, which yields the Murphy diagrams, and a test for the hypothesis of forecast dominance. Sections 3 introduces the test and provides a theoretical justification; Section 4 presents simulation evidence on the test’s size and power. We illustrate the mixture representation and the test in an empirical case study in Section 5. A discussion in Section 6 concludes. Three appendices contain proofs and technical details.

1 Consistent Scoring Functions for VaR and ES

To keep notation light, we start with a single-period outcome and move on to time-series considerations in the next section. Let YR be a random variable describing the single-period return of a financial asset, where a negative return, Y < 0, corresponds to a loss. VaR and ES are popular measures of tail risk. Let F denote the distribution of Y, and assume that Y has finite mean. Then for a given level α(0,1), the VaR and ES are defined as
and
If VaRα(F) is unique, then ESα(F) can also be written as
The latter representation explains the name “expected shortfall”: ESα(F) is the expectation of Y, given that Y is below its VaR. We are interested in small values of α, in particular α=0.025 which is the level that the Basel Committe on Banking Supervision (2016) requests for ES predictions. Then, VaRα and ESα will typically have negative values. Our sign convention corresponds to the sign convention of utility functions as used in Delbaen (2012) and it implies that VaRαESα always holds.
Following Gneiting (2011a), Ehm et al. (2016), Patton (2016) and others, it is now widely recognized that consistent scoring functions are essential for comparing point forecasts. Consistency implies that, on average, a misspecified model may not outperform a correct model. As discussed in Fissler and Ziegel (2016), ESα cannot be evaluated consistently without joint consideration of VaRα, so we stack the two functionals to obtain the two-dimensional functional
As return distributions we consider members of the class F1 of distributions with finite mean and unique quantiles. The latter assumption allows us to simplify our presentation and does not seem restrictive in the context of financial returns. For example, the HEAVY and GARCH models used in our case study (Section 4) clearly satisfy the assumption. As forecasts of type Tα, where v is a forecast of VaRα and e is a forecast of ESα, we consider elements of the action domain A0={(v,e)R2:ve}, thereby ruling out irrational forecasts that violate VaRαESα. The following definition formalizes the notion of a consistent scoring function for Tα. 
Definition 1.1
A scoring functionS:A0×RR is a function such that S(v,e,y)dF(y) exists for all FF1,(v,e)A0. The scoring function S is called consistent for Tα if
(1)
for all (v,e)A0 and all random variables Y with distribution in F1. The scoring function S is strictly consistent if equality in (1) implies (v,e)=Tα(F).
Equation (1) says that, in expectation, it is a forecaster’s best possible action to state the forecast Tα(F), rather than an arbitrary alternative (v,e)A0. In this sense, a consistent scoring function sets the incentive for honest and accurate forecasting of Tα. Importantly, there is not only one scoring function that is consistent for Tα. Instead, there is a whole family of scoring functions with this property.4 Here, we consider normalized scores for which S(y,y,y)=0 holds true. This normalization is in line with much of the existing literature (e.g., Gneiting, 2011a); other normalizations can easily be accommodated. Corollary 5.5 of Fissler and Ziegel (2016) implies that all scoring functions S of the form
(2)
are consistent scoring functions for Tα, where G1, G2, and G2 are functions from R to R,G2=G2 (i.e., G2 is the derivative of G2), G1 and G2 are increasing, G20, and G1(y)dF(y),G2(y)dF(y) exist and are finite for all FF1. If G2 is strictly increasing, we obtain strict consistency. For example, the choice G1(z)=0,G2(z)=exp(z)/(1+exp(z)) satisfies all of these requirements but there are many alternatives. Subject to regularity conditions, all normalized consistent scoring functions on the action domain A0 are of the form (2).

Patton (2011), Nolde and Ziegel (2017), and Patton et al. (2018) have argued for the use of homogeneous scoring functions for forecast comparison. Such additional requirements on the scoring functions narrow down the possible choices for G1, G2, and G2 in (2). On a restricted action domain A=R×(,0)A0, homogeneous scoring functions for Tα exist (Nolde and Ziegel, 2017, Theorem C.3), and under some additional assumptions there is even a unique zero-homogeneous choice (Patton et al., 2018, Proposition 1). Nevertheless, choosing one single scoring function for forecast evaluation implicitly imposes an order of preference on all sequences of forecasts which is usually hard and sometimes impossible to justify. Indeed, the results in Nolde and Ziegel (2017) show no clear preference between a zero-homogeneous choice for S or a (1/2)-homogeneous choice for S with respect to performance in forecast comparison. In their simulation study, these two different choices give emphasis to different aspects of model misspecification.

More generally, Patton (2016) and others have demonstrated that the choice of scoring function is relevant for the ranking of two competing forecasts in the presence of model misspecification and non-nested information sets, both of which are common in practice. The methods we consider in this article are robust with respect to the choice of scoring function, in the sense that we compare forecasts under a class of scoring functions. We therefore make the following definition of forecast dominance which is analogous to Ehm et al. (2016, Definition 1). 

Definition 1.2
Let α(0,1) and let S be a class of consistent scoring functions for Tα. For two (possibly random) forecasts (VA,EA) and (VB,EB) made by methods A and B, respectively, we say that method A weakly dominates method B with respect to S if
where the expectations are with respect to the joint distribution of (VA,EA,VB,EB,Y).

Once dominance has been established for a given class S, it can be translated to the extension including all mixtures, for example dominance with respect to {S1,S2} implies dominance with respect to {aS1+bS2:a,b0}. This simple observation is the basis for so-called Murphy diagrams which are graphical tools to check for forecast dominance empirically with respect to all consistent scoring functions. Ehm et al. (2016) provide mixture representations of the families of consistent scoring functions for quantiles and expectiles. In order to derive similar methodology for (VaRα(F),ESα(F)), the following result presents a mixture representation for consistent scoring functions of the form given in (2). 

Proposition 1.1
Letα(0,1). Forη,yR,(v,e)A0, we define the elementary scores
Let H1 be a locally finite measure and H2 a measure that is finite on all intervals of the form(,x],xR. Then all scoring functionsS:A0×RRthat are of the form (2) can be written as
(3)
The scores at (3) are consistent forTα. They are strictly consistent if H2 puts positive mass on all open intervals.

The first integral in Equation (3) represents the first line of Equation (2), whereas the second integral in (3) represents the second and third line of (2). In fact, for x1x2, we have H1((x1,x2])=G1(x2)G1(x1) and for xR, we have H2((,x])=G2(x). The scores Sη,1 and Sη,2 are themselves consistent scoring functions for Tα, which follows immediately by choosing Dirac-measures for H1 or H2 in (3). The score Sη,1(v,y) goes to zero as η±, whereas the score Sη,2(v,e,y) goes to zero as η+, and converges to (1/α)(1{yv}α)(vy) as η. This explains the different restrictions on the corresponding mixing measures H1 and H2 in Proposition 1.1.

We identify a subclass of consistent scoring functions for Tα whose members emphasize the evaluation of the ESα component. The first integral in (3) corresponds to the mixture representation of consistent scoring functions for quantiles (Ehm et al., 2016, Theorem 1a), a class that in our context only evaluates the VaRα forecast and ignores ESα. Hence, choosing anything but a constant H1 puts unnecessary emphasis on the VaRα component of the forecast. The second integral corresponds to the evaluation of ESα, conditional on VaRα, where we cannot completely extinguish VaRα in the evaluation due to the results on the (non-)elicitability of ESα. Hence, we define S2 as the class of all consistent scoring functions for Tα as given at (3) with a constant H1 (such that the first integral is zero), and focus on this class in the following.5 In the context of the class S2, we denote the elementary scores Sη,2 simply by Sη, since the scores Sη,1 have been excluded.

Our focus on S2 is motivated by the aim to maximize the impact of the ESα component in evaluation, which is in line with the emphasis set in Basel III. Focusing on S2 also seems justified from a statistical perspective: Dimitriadis and Bayer (2017) investigate several members of S2 in a regression framework. They argue that moving beyond S2 (i.e., considering nonconstant choices of H1 in Equation 1.1) does not improve the numerical performance of their estimators. Furthermore, S2 contains in particular positively homogeneous scoring functions for all possible degrees of homogeneity; see Nolde and Ziegel (2017, Section 2.3.1 and Theorem 6). As discussed there, positively homogeneous scoring functions enjoy a number of attractive properties.

The mixture representation at (3) allows graphical displays of the performance of Tα forecasts with respect to the elementary scores of S2,
where the expectation is with respect to the joint distribution of (V, E, Y). In practice, the expectation is estimated by the average observed score. Examples of these displays, called Murphy diagrams (Ehm et al., 2016), are given in Figure 2 in Section 4. The diagrams provide simple graphical checks of whether one forecast dominates another under all scoring functions in S2. Specifically, Proposition 1.1 implies that the forecast of method A dominates that of method B with respect to S2 if and only if
this condition is analogous to the one in Ehm et al. (2016, Corollary 1) for quantiles and expectiles.

Clearly, one could also consider forecast dominance for Tα with respect to all consistent scoring functions. The procedures described in the following can be adapted to this case; an extension that is conceptually simple yet tedious in practice. This is because one needs to check inequalities across two grids of parameters, one for Sη,1 and one for Sη,2. Instead, when focusing on S2, it suffices to check inequalities along a single grid for η.

2 Testing Forecast Dominance

Here we first translate the methodology from Section 1 into a time series context, and then introduce a test of forecast dominance based on the elementary scores.

2.1 Comparing Time Series Forecasts

So far, we have only considered a one-period forecasting problem. In most financial applications, however, the goal is to predict a time series {Yt}tN, such as a sequence of asset returns observed at trading days t=1,2,. Furthermore, let (Vt,Et)A0 denote the (VaRα,ESα) forecast of Yt, based on an appropriate information set Wt1 generated by data available at time t1. In applications, we seek to make forecasts and realizations comparable across time. We therefore require the following assumption. 

Assumption 2.1

The time series {Zt}tN with Zt=(Vt,Et,Yt)A0×R is stationary with distribution FZ.

This assumption rules out deterministic time trends, structural breaks, and seasonalities. Nevertheless, many multivariate autoregressive models (e.g., Lütkepohl, 2005) or stochastic volatility models (e.g., Harvey et al., 1994) are stationary.

Consider any consistent scoring function S for Tα. Assumption 2.1 implies that the distribution of the random variable S(Vt,Et,Yt) does not depend on time, t. In particular, this holds when S equals an elementary score Sη in the class S2. We can thus define the notion of an expected elementary score, as follows. Consider a sequence of forecasts {Vt,Et}tN and corresponding realizations {Yt}tN which jointly define a stationary time series as in Assumption 2.1. The expected elementary scores for this process are given by
(4)
where FZ is defined in Assumption 2.1. Based on this definition, a notion of forecast dominance “on average over time” follows naturally: 
Definition 2.1
Let {VtA,EtA}tN and {VtB,EtB}tN denote two competing sequences of forecasts of (VaRα,ESα), and let {Yt}tN denote the corresponding realizations such that {(VtA,EtA,Yt)}tN and {(VtB,EtB,Yt)}tN both satisfy Assumption 2.1 with stationary distributions FZA and FZB, respectively. We say that method A weakly dominates method B with respect to S2 if
where the expectations are as at (4) with respect to the corresponding stationary distribution.
If {Sη(VtA,EtA,Yt)}tN and {Sη(VtB,EtB,Yt)}tN are ergodic, the expectations in Definition 2.1 can be consistently estimated by empirical averages over observed forecasts and realizations at dates t=1,,n, for example as n it holds that
and analogously for method B.

2.2 Testing for Forecast Dominance

We are interested in the following null hypothesis:
Definition 2.1 gives a formal statement of the hypothesis. Importantly, we consider tests of finite-sample predictive ability (Giacomini and White, 2006; Clark and McCracken, 2013, Section 3.2) throughout this article. That is, we ask whether method A outperforms method B, taking into account that both methods are based on imperfect parameter estimates. This setup is closely aligned with practical forecast comparisons where parameter uncertainty is a relevant concern.6 Our proposed testing procedure employs the test by Hansen (2005). Whereas the test was originally designed to handle a finite number of comparisons, we adapt it to our problem which involves an infinite number of comparisons. We describe the procedure in the present section. In Sections 2.3 and 2.4, we present theory and connections to the multiple testing literature.
For each ηR,tN, consider the score difference between methods A and B,
with expectation μ(η) and standard deviation σ(η). This translates to a null hypothesis of
To construct test statistics, we use the finite sample estimates of μ(η) and σ(η),
where μn(η) is the average of the empirical score differences, and σn(η) is an autocorrelation-consistent estimator based on the empirical autocovariances γi(η) and kernel weights κ(n,i). Politis and Romano (1994) derive the estimator σn(η) in the context of the stationary bootstrap under the requirement that qn0 as n, with optimality considerations leading to qn=c n1/3. We choose c1.36 as explained below, and compute the usual t-statistic given by
Politis and Romano (1994) and Hansen (2005) justify the studentized bootstrap test statistic
where μn*(η)=(1/n)t=1nδt*(η) is the average of a bootstrap sample of score differences {(δt*(η))ηR:t=1,,n}. We use the stationary bootstrap of Politis and Romano (1994), where blocks of data are sampled at random. The block length is also random, following a geometric distribution with parameter qn, leading to an average block length of 1/qn. As mentioned above, we let qn=c n1/3, where we set the constant c1.36, so that we obtain an average block length of ten in our empirical study (for which n is around 2500). Choosing μn(η) to center the distribution of μn*(η) corresponds to the conservative estimator μ^u in Hansen (2005, p. 372). It is a convenient property of the stationary bootstrap that the formula for the studentizing factor σn(η) does not depend on the specific observations drawn in a bootstrap iteration, and hence must be computed only once. Following Hansen (2005), we use the same variance estimator for the bootstrap iterations and the original sample.
We then compute the supremal test statistics on R for the original data and all bootstrap iterations b=1,,B,
Due to the test statistics’ structure these suprema can be computed exactly, as explained in Appendix B, but the computational cost increases quickly in sample size. The p-value for the joint null hypothesis that μ(η)0  ηR is given by
(5)
where PB denotes the probability measure induced by bootstrap sampling.

In Appendix B, we show that it is sufficient to consider the test statistics as piece-wise functions with break points for all η{EtA,EtB}t=1n, that is, the set of ES forecasts. The supremal test statistics are computed as the maximum of the left-sided and right-sided limits in all break points, and in all remaining critical points that fall in their respective interval of the piece-wise function partition.

As the sample size and the number of break points grow, computations become increasingly expensive so that we also explore whether computational shortcuts are attainable without compromising the test’s size and power properties. To that end, we evaluate the test statistics at a grid of values G={η[1],η[2],,η[m]}R, with η[1]<η[2]<<η[m]. We consider the following choices for G:
  1. The grid Gn that consists of the (ordered) elements of the set {EtA,EtB}t=1n. This set is a natural choice in that it coincides with the jump points of the elementary scores Sη, see Proposition 1.1. We refer to this choice as “jumps” in the following.

  2. A thinned version of Gn, considering only every tenth element (“jumps/10”).

  3. An equally spaced grid ranging from the minimum of Gn to the maximum of Gn, containing as many elements as the thinned version in 2 (“equidistant”).

See Sections 2.3 and 3 for theory and simulation evidence on the choice of grid points. We compute the maximal test statistics (original and bootstrap iteration) across G,
leading to a p-value for the joint null hypothesis that μ(η)0  ηG,
(6)
the empirical probability that the bootstrap test statistic exceeds its sample counterpart.

2.3 Theoretical Justifications

Following Cox and Lee (2008, Section 3.1), this section provides a theoretical justification for using the Hansen (2005) test in the present context. We first define the theoretical, supremal test statistic
This quantity can not be computed in practice since the expected score difference μ(η) is unknown even under the null hypothesis (the latter imposes no specific value but only that μ(η)0). Nevertheless, T0max is useful for the theoretical justification of the testing procedure described in Section 2.2. Let F* denote the bootstrap distribution function of T0*,max and F*1 its quantile function. The following high-level assumption states that the bootstrap procedure is consistent for the supremum T0max. 
Assumption 2.2
Let α(0,1). The bootstrap procedure is such that
for some sequence an0 as n.

White (2000, Proposition 2.2 and Corollary 2.7) and Hansen (2005, Corollary 3) establish results similar to our Assumption 2.2 in the context of comparing multiple forecasting methods. In both of these studies, the test statistic of interest is the maximal element of a finite-dimensional vector. In contrast, our procedure is based on functional data, in that our test statistics are suprema over an uncountable set. While Assumption 2.2 seems plausible, we are not aware of a formal justification, but this issue is beyond the scope of the present article. 

Theorem 2.1
Suppose that Assumption 2.2 holds and thatμ(η)0for allηR. Then, for anyα(0,1),
where the sequence (an)nN is as in Assumption 2.2.

Theorem 2.1 shows that the size of the test is under control for all elements of the null hypothesis (i.e., both on its boundary and in its interior, corresponding to equal performance and strict dominance respectively). This type of control is often hard to achieve; c.f. the comments and references in Ehm and Krüger (2018, Section 7).

Note that Theorem 2.1 refers to the p-value pH in (5) that is based on analytical calculations for the relevant suprema. For computational reasons, we might not want to compute the p-value pH but only the grid-based approximation p˜H given at (6). The following result states conditions under which this approximation is justified. For a grid G={η[1],η[2],,η[m]}R, with η[1]<η[2]<<η[m] as in Section 2.2, we define for i=1,,m+1
We next assume that there are upper bounds on the variation of the test statistics within each interval: 
Assumption 2.3
There exist constants τG0,τG*0 such that
(7)
(8)
 
Proposition 2.2
Suppose that Assumption 2.3 holds. Then,
wherep˜H(ε)=PB(T˜0*,max>T˜max+ε).

The proposition derives lower and upper bounds on the analytical p-value pH. In particular, it states that the impact of the grid approximation is small if the test statistics display little variation within the intervals Ii. In the Monte Carlo study of Section 3, we follow the recommendation of Cox and Lee (2008, p. 626) and use the “raw” p-values resulting from the grid approximation. In doing so, we essentially assume that τG=τG*=0 which seems reasonable when the grid is sufficiently dense, for example for large sample sizes and a continuous population distribution for the ES forecasts.7

Taking a slightly different perspective, the grid-based approximation can be seen as testing the following, restricted notion of forecast dominance: 

Definition 2.1’
Method A weakly dominates method B with respect to S2 on the finite grid GR if
where G has been defined before Assumption 2.3, and all other objects are as in Definition 2.1.

Definition 2.1’ is a necessary condition for Definition 2.1. Furthermore, if the grid G is deterministic, then the Hansen (2005) test is valid for Definition 2.1’ without further adjustment.8 Proposition 2.2 quantifies the difference between pH (the p-value for Definition 2.1) and p˜H (the p-value for Definition 2.1’). Its result is in line with the intuition that both p-values are similar if the grid is dense enough. Our simulation results in Table 1 provide direct numerical evidence on the quality of the grid-based approximation.

Table 1.

Simulation results on size and power (grid-focused)

ES levelGridDGP parameters
β= 0.00.50.70.9
ν= 1064106410641064
Size scenariosζ1=ζ2=1
α=0.010exact3.94.24.13.42.93.31.42.32.41.10.40.4
jumps2.02.42.42.12.41.21.91.21.00.40.50.1
jumps/103.13.03.82.42.32.32.11.52.10.60.50.1
equidist.3.44.24.43.02.22.62.62.41.80.60.30.3
0.025exact3.74.04.43.13.43.33.73.42.72.11.01.3
jumps3.23.23.23.02.52.11.91.81.80.80.70.5
jumps/102.92.94.12.32.32.82.62.01.51.00.80.9
equidist.3.23.92.92.73.52.13.03.21.91.51.00.6
0.050exact4.84.94.44.33.24.53.63.43.41.51.82.1
jumps3.63.64.73.63.42.71.12.72.61.31.21.1
jumps/103.74.33.02.62.92.92.12.53.10.91.20.7
equidist.3.83.63.22.84.44.02.02.92.92.01.71.0
Power scenariosζ1=0.1ζ2=0
α=0.010exact64.245.829.88.14.52.71.41.20.50.10.00.1
jumps55.838.120.46.62.21.20.80.30.40.00.00.0
jumps/1055.737.422.44.91.61.41.10.10.30.10.10.1
equidist.57.538.325.35.13.11.21.00.40.20.20.30.0
0.025exact92.685.281.733.225.021.711.88.14.61.21.11.0
jumps85.780.973.627.521.115.27.66.33.50.60.70.5
jumps/1086.382.278.225.618.313.77.85.33.11.00.60.5
equidist.87.184.178.426.320.515.86.45.24.50.70.20.2
0.050exact98.798.297.658.957.453.126.621.618.64.53.03.0
jumps96.796.096.956.048.146.421.820.416.42.22.92.5
jumps/1096.995.696.456.649.548.822.217.516.34.32.72.9
equidist.97.896.896.954.650.649.723.220.715.62.73.62.6
ES levelGridDGP parameters
β= 0.00.50.70.9
ν= 1064106410641064
Size scenariosζ1=ζ2=1
α=0.010exact3.94.24.13.42.93.31.42.32.41.10.40.4
jumps2.02.42.42.12.41.21.91.21.00.40.50.1
jumps/103.13.03.82.42.32.32.11.52.10.60.50.1
equidist.3.44.24.43.02.22.62.62.41.80.60.30.3
0.025exact3.74.04.43.13.43.33.73.42.72.11.01.3
jumps3.23.23.23.02.52.11.91.81.80.80.70.5
jumps/102.92.94.12.32.32.82.62.01.51.00.80.9
equidist.3.23.92.92.73.52.13.03.21.91.51.00.6
0.050exact4.84.94.44.33.24.53.63.43.41.51.82.1
jumps3.63.64.73.63.42.71.12.72.61.31.21.1
jumps/103.74.33.02.62.92.92.12.53.10.91.20.7
equidist.3.83.63.22.84.44.02.02.92.92.01.71.0
Power scenariosζ1=0.1ζ2=0
α=0.010exact64.245.829.88.14.52.71.41.20.50.10.00.1
jumps55.838.120.46.62.21.20.80.30.40.00.00.0
jumps/1055.737.422.44.91.61.41.10.10.30.10.10.1
equidist.57.538.325.35.13.11.21.00.40.20.20.30.0
0.025exact92.685.281.733.225.021.711.88.14.61.21.11.0
jumps85.780.973.627.521.115.27.66.33.50.60.70.5
jumps/1086.382.278.225.618.313.77.85.33.11.00.60.5
equidist.87.184.178.426.320.515.86.45.24.50.70.20.2
0.050exact98.798.297.658.957.453.126.621.618.64.53.03.0
jumps96.796.096.956.048.146.421.820.416.42.22.92.5
jumps/1096.995.696.456.649.548.822.217.516.34.32.72.9
equidist.97.896.896.954.650.649.723.220.715.62.73.62.6

Notes: Size and power results (in percentage points) of the Monte Carlo investigation of the dominance test for a 5% significance level, with focus on the effect of the grid type. “exact” denotes exact analytical computation of the test statistic; the three grid types (“jumps,” “jumps/10,” and “equidistant”) are introduced at the end of Section 2.2. The sample size is fixed at 500 observations, p-values are generated using 500 bootstrap iterations, and 1000 p-values are drawn.

Table 1.

Simulation results on size and power (grid-focused)

ES levelGridDGP parameters
β= 0.00.50.70.9
ν= 1064106410641064
Size scenariosζ1=ζ2=1
α=0.010exact3.94.24.13.42.93.31.42.32.41.10.40.4
jumps2.02.42.42.12.41.21.91.21.00.40.50.1
jumps/103.13.03.82.42.32.32.11.52.10.60.50.1
equidist.3.44.24.43.02.22.62.62.41.80.60.30.3
0.025exact3.74.04.43.13.43.33.73.42.72.11.01.3
jumps3.23.23.23.02.52.11.91.81.80.80.70.5
jumps/102.92.94.12.32.32.82.62.01.51.00.80.9
equidist.3.23.92.92.73.52.13.03.21.91.51.00.6
0.050exact4.84.94.44.33.24.53.63.43.41.51.82.1
jumps3.63.64.73.63.42.71.12.72.61.31.21.1
jumps/103.74.33.02.62.92.92.12.53.10.91.20.7
equidist.3.83.63.22.84.44.02.02.92.92.01.71.0
Power scenariosζ1=0.1ζ2=0
α=0.010exact64.245.829.88.14.52.71.41.20.50.10.00.1
jumps55.838.120.46.62.21.20.80.30.40.00.00.0
jumps/1055.737.422.44.91.61.41.10.10.30.10.10.1
equidist.57.538.325.35.13.11.21.00.40.20.20.30.0
0.025exact92.685.281.733.225.021.711.88.14.61.21.11.0
jumps85.780.973.627.521.115.27.66.33.50.60.70.5
jumps/1086.382.278.225.618.313.77.85.33.11.00.60.5
equidist.87.184.178.426.320.515.86.45.24.50.70.20.2
0.050exact98.798.297.658.957.453.126.621.618.64.53.03.0
jumps96.796.096.956.048.146.421.820.416.42.22.92.5
jumps/1096.995.696.456.649.548.822.217.516.34.32.72.9
equidist.97.896.896.954.650.649.723.220.715.62.73.62.6
ES levelGridDGP parameters
β= 0.00.50.70.9
ν= 1064106410641064
Size scenariosζ1=ζ2=1
α=0.010exact3.94.24.13.42.93.31.42.32.41.10.40.4
jumps2.02.42.42.12.41.21.91.21.00.40.50.1
jumps/103.13.03.82.42.32.32.11.52.10.60.50.1
equidist.3.44.24.43.02.22.62.62.41.80.60.30.3
0.025exact3.74.04.43.13.43.33.73.42.72.11.01.3
jumps3.23.23.23.02.52.11.91.81.80.80.70.5
jumps/102.92.94.12.32.32.82.62.01.51.00.80.9
equidist.3.23.92.92.73.52.13.03.21.91.51.00.6
0.050exact4.84.94.44.33.24.53.63.43.41.51.82.1
jumps3.63.64.73.63.42.71.12.72.61.31.21.1
jumps/103.74.33.02.62.92.92.12.53.10.91.20.7
equidist.3.83.63.22.84.44.02.02.92.92.01.71.0
Power scenariosζ1=0.1ζ2=0
α=0.010exact64.245.829.88.14.52.71.41.20.50.10.00.1
jumps55.838.120.46.62.21.20.80.30.40.00.00.0
jumps/1055.737.422.44.91.61.41.10.10.30.10.10.1
equidist.57.538.325.35.13.11.21.00.40.20.20.30.0
0.025exact92.685.281.733.225.021.711.88.14.61.21.11.0
jumps85.780.973.627.521.115.27.66.33.50.60.70.5
jumps/1086.382.278.225.618.313.77.85.33.11.00.60.5
equidist.87.184.178.426.320.515.86.45.24.50.70.20.2
0.050exact98.798.297.658.957.453.126.621.618.64.53.03.0
jumps96.796.096.956.048.146.421.820.416.42.22.92.5
jumps/1096.995.696.456.649.548.822.217.516.34.32.72.9
equidist.97.896.896.954.650.649.723.220.715.62.73.62.6

Notes: Size and power results (in percentage points) of the Monte Carlo investigation of the dominance test for a 5% significance level, with focus on the effect of the grid type. “exact” denotes exact analytical computation of the test statistic; the three grid types (“jumps,” “jumps/10,” and “equidistant”) are introduced at the end of Section 2.2. The sample size is fixed at 500 observations, p-values are generated using 500 bootstrap iterations, and 1000 p-values are drawn.

2.4 Related Procedures

The testing procedure described in Sections 2.2 and 2.3 can be viewed as conducting pointwise tests for each η, adjusting the resulting test statistics or p-values for multiple testing and then taking the minimal adjusted p-value as a p-value for the joint hypothesis μ(η)0 for all ηR. More specifically, the p-value adjustment implicit in the method of Hansen (2005) is a simplified (“one-step” or “single-step”) variant of the Westfall and Young (1993) step-down procedure for multiple testing; see Cox and Lee (2008, Section 3.2) and Meinshausen et al. (2011). Cox and Lee (2008) analyze the properties of applying Westfall and Young (1993) to functional data. In Appendix C, we describe the Westfall–Young procedure for our testing problem. The resulting p-value pWY for the null hypothesis μ(η)0 for all ηR always fulfills pWYpH, implying that the Westfall–Young procedure is more powerful. There are situations where the difference between both procedures is noticeable; see Cox and Lee (2008, Section 3.2). However, in our Monte Carlo study both approaches typically imply the same test decisions at conventional levels (see Appendix C), such that the difference between the two procedures is negligible in practice. We therefore focus on the simpler approach of Hansen (2005).

3 Monte Carlo Evidence on the Dominance Test

While we have provided a partial justification for the dominance test in Section 2.3, two important issues remain. First, does Assumption 2.2 provide a realistic description of the bootstrap in the present situation? Second, in case we substitute the analytical p-value pH at (5) by a computational shortcut p˜H at (6), does the choice of grid points for η matter in practice? In view of these open issues, we next investigate the testing procedure by simulation. The data generating process (DGP) intends to be similar to the HEAVY forecasting model which we use in the empirical analysis of Section 4. To this end, we first create data from the following process:
where the series RK˜t mimics the “realized kernel” measure of intra-day volatility (Barndorff-Nielsen et al., 2008, 2009), and the coefficient β determines the persistence of the series σt2. To generate the series RK˜t1, we simulate from a Gaussian AR(1) model that we fit to the logarithmic values of the empirical series RKt corresponding to the S&P 500 data used in our empirical analysis.9 We further set σ02=0.35, and assume that the return at day t is given by
where {Xt}tN is a sequence of i.i.d. random variables that are t-distributed with ν degrees of freedom. The factor (ν2)/ν accounts for the fact that the variance of a t-distributed variable equals ν/(ν2). Hence, the factor ensures that the conditional variance of Rt is given by σt2. Given knowledge of the process and the sequence {RKj}jt1, the perfect Tα forecast for Rt consists of
where Qα,ν is the α-quantile of the t-distribution with ν degrees of freedom. We consider two forecasting models m{1,2} with perturbed ideal forecasts:
where ɛt,mN(0,ζm), independently of t and m.10 The variance term ζm0 is a measure of expected deviation from optimality. The limiting case ζm=0 means that ɛt,m=0 almost surely, and corresponds to perfect forecasts. Note that model m incurs the same error in both components of its Tα forecast; this ensures that qt,α,met,α,m always holds as required. When ζ1=ζ2>0, then both models have equal expected scores, which is consistent with weak dominance. To generate a difference in forecast quality, we consider the case of ζ1>0 and ζ2=0, which means that the second model issues perfect forecasts while the first model deviates from optimality to a degree measured by ζ1. This setup implies a dominance relationship in favor of the second model following results from Tsyplakov (2014),11 and a violation of the null hypothesis that m = 1 weakly dominates m = 2. We note that a reversed dominance relation is the strongest type of null hypothesis violation: Reversed dominance implies that μ(η)0 for all ηR, whereas μ(η)>0 for a small range of η would be sufficient to violate the null.

In the following investigation, we consider various values for the two DGP parameters (i.e., the persistence parameter β and the degrees of freedom ν), for the three hyperparameters (i.e., the functional level α, the type of grid for η, the number of observations n), and for the two parameters controlling forecast quality (i.e., the perturbation parameters ζ1 and ζ2). For the entire investigation we choose a significance level of 5%. The calculation of a single p-value uses 500 bootstrap iterations, and we draw 1000 p-values per scenario.

Tables 1 and 2 both show Monte Carlo simulation results for size and power. Table 1 adresses the question whether the grid specification for η is important at a sample size of 500. In Section 2.2, we discussed exact computation of the supremal test statistics and three variants of grid approximation (“jumps,” “jumps/10,” and “equidistant”). We can observe no exceedance of the nominal 5% level for the scenarios with equal predictive quality, regardless of whether we use the exact computation or any of the three grid approximations. For the power scenarios, we observe a nice grouping by parameter combination with evidence for a generally minor loss of power when using any of the grid approximation options. As the computational cost increases noticeably beyond a sample size of 500 and power properties are similar, our further simulation results are based on the thinned grid of ES forecasts (“jumps/10”).

Table 2.

Simulation results on size and power (focus on sample size)

ES levelObs.DGP parameters
β= 0.50.70.9
ν= 106410641064
Size scenariosζ1=ζ2=1
α=0.010n = 5002.42.32.32.11.52.10.60.50.1
10002.72.62.72.72.12.10.60.60.6
25004.23.92.41.52.81.61.01.00.6
0.0255002.32.32.82.62.01.51.00.80.9
10002.43.33.92.42.81.91.31.11.1
25004.04.12.82.83.53.32.31.70.9
0.0505002.62.92.92.12.53.10.91.20.7
10002.62.72.92.82.62.01.82.11.1
25002.72.54.52.92.52.93.22.81.7
Power scenariosζ1=0.1ζ2=0
α=0.010n = 5004.91.61.41.10.10.30.10.10.1
100019.29.34.23.51.80.30.40.10.3
250074.149.023.620.79.15.81.10.80.5
0.02550025.618.313.77.85.33.11.00.60.5
100063.851.140.719.317.810.82.31.01.7
250099.195.690.164.851.835.16.83.62.7
0.05050056.649.548.822.217.516.34.32.72.9
100089.087.084.449.142.236.96.94.14.7
2500100.0100.0100.094.389.884.619.012.512.0
Power scenariosζ1=0.2ζ2=0
α=0.010n = 50031.718.610.26.32.41.90.30.20.1
100075.956.333.118.210.94.40.30.30.4
250099.897.886.972.746.620.04.72.01.3
0.02550072.764.356.125.220.412.42.11.70.6
100098.395.892.365.152.939.35.43.83.3
2500100.0100.0100.098.894.387.924.116.29.8
0.05050092.592.390.656.850.549.67.78.36.5
1000100.099.999.790.086.786.017.813.114.0
2500100.0100.0100.099.9100.0100.053.544.741.1
ES levelObs.DGP parameters
β= 0.50.70.9
ν= 106410641064
Size scenariosζ1=ζ2=1
α=0.010n = 5002.42.32.32.11.52.10.60.50.1
10002.72.62.72.72.12.10.60.60.6
25004.23.92.41.52.81.61.01.00.6
0.0255002.32.32.82.62.01.51.00.80.9
10002.43.33.92.42.81.91.31.11.1
25004.04.12.82.83.53.32.31.70.9
0.0505002.62.92.92.12.53.10.91.20.7
10002.62.72.92.82.62.01.82.11.1
25002.72.54.52.92.52.93.22.81.7
Power scenariosζ1=0.1ζ2=0
α=0.010n = 5004.91.61.41.10.10.30.10.10.1
100019.29.34.23.51.80.30.40.10.3
250074.149.023.620.79.15.81.10.80.5
0.02550025.618.313.77.85.33.11.00.60.5
100063.851.140.719.317.810.82.31.01.7
250099.195.690.164.851.835.16.83.62.7
0.05050056.649.548.822.217.516.34.32.72.9
100089.087.084.449.142.236.96.94.14.7
2500100.0100.0100.094.389.884.619.012.512.0
Power scenariosζ1=0.2ζ2=0
α=0.010n = 50031.718.610.26.32.41.90.30.20.1
100075.956.333.118.210.94.40.30.30.4
250099.897.886.972.746.620.04.72.01.3
0.02550072.764.356.125.220.412.42.11.70.6
100098.395.892.365.152.939.35.43.83.3
2500100.0100.0100.098.894.387.924.116.29.8
0.05050092.592.390.656.850.549.67.78.36.5
1000100.099.999.790.086.786.017.813.114.0
2500100.0100.0100.099.9100.0100.053.544.741.1

Notes: Size and power results (in percentage points) of the Monte Carlo investigation of the dominance test for a 5% significance level, with focus on the effect of the number of observations. The grid of points η is thinned by a factor of 10 (“jumps/10,” see end of Section 2.2), p-values are generated using 500 bootstrap iterations, and 1000 p-values are drawn.

Table 2.

Simulation results on size and power (focus on sample size)

ES levelObs.DGP parameters
β= 0.50.70.9
ν= 106410641064
Size scenariosζ1=ζ2=1
α=0.010n = 5002.42.32.32.11.52.10.60.50.1
10002.72.62.72.72.12.10.60.60.6
25004.23.92.41.52.81.61.01.00.6
0.0255002.32.32.82.62.01.51.00.80.9
10002.43.33.92.42.81.91.31.11.1
25004.04.12.82.83.53.32.31.70.9
0.0505002.62.92.92.12.53.10.91.20.7
10002.62.72.92.82.62.01.82.11.1
25002.72.54.52.92.52.93.22.81.7
Power scenariosζ1=0.1ζ2=0
α=0.010n = 5004.91.61.41.10.10.30.10.10.1
100019.29.34.23.51.80.30.40.10.3
250074.149.023.620.79.15.81.10.80.5
0.02550025.618.313.77.85.33.11.00.60.5
100063.851.140.719.317.810.82.31.01.7
250099.195.690.164.851.835.16.83.62.7
0.05050056.649.548.822.217.516.34.32.72.9
100089.087.084.449.142.236.96.94.14.7
2500100.0100.0100.094.389.884.619.012.512.0
Power scenariosζ1=0.2ζ2=0
α=0.010n = 50031.718.610.26.32.41.90.30.20.1
100075.956.333.118.210.94.40.30.30.4
250099.897.886.972.746.620.04.72.01.3
0.02550072.764.356.125.220.412.42.11.70.6
100098.395.892.365.152.939.35.43.83.3
2500100.0100.0100.098.894.387.924.116.29.8
0.05050092.592.390.656.850.549.67.78.36.5
1000100.099.999.790.086.786.017.813.114.0
2500100.0100.0100.099.9100.0100.053.544.741.1
ES levelObs.DGP parameters
β= 0.50.70.9
ν= 106410641064
Size scenariosζ1=ζ2=1
α=0.010n = 5002.42.32.32.11.52.10.60.50.1
10002.72.62.72.72.12.10.60.60.6
25004.23.92.41.52.81.61.01.00.6
0.0255002.32.32.82.62.01.51.00.80.9
10002.43.33.92.42.81.91.31.11.1
25004.04.12.82.83.53.32.31.70.9
0.0505002.62.92.92.12.53.10.91.20.7
10002.62.72.92.82.62.01.82.11.1
25002.72.54.52.92.52.93.22.81.7
Power scenariosζ1=0.1ζ2=0
α=0.010n = 5004.91.61.41.10.10.30.10.10.1
100019.29.34.23.51.80.30.40.10.3
250074.149.023.620.79.15.81.10.80.5
0.02550025.618.313.77.85.33.11.00.60.5
100063.851.140.719.317.810.82.31.01.7
250099.195.690.164.851.835.16.83.62.7
0.05050056.649.548.822.217.516.34.32.72.9
100089.087.084.449.142.236.96.94.14.7
2500100.0100.0100.094.389.884.619.012.512.0
Power scenariosζ1=0.2ζ2=0
α=0.010n = 50031.718.610.26.32.41.90.30.20.1
100075.956.333.118.210.94.40.30.30.4
250099.897.886.972.746.620.04.72.01.3
0.02550072.764.356.125.220.412.42.11.70.6
100098.395.892.365.152.939.35.43.83.3
2500100.0100.0100.098.894.387.924.116.29.8
0.05050092.592.390.656.850.549.67.78.36.5
1000100.099.999.790.086.786.017.813.114.0
2500100.0100.0100.099.9100.0100.053.544.741.1

Notes: Size and power results (in percentage points) of the Monte Carlo investigation of the dominance test for a 5% significance level, with focus on the effect of the number of observations. The grid of points η is thinned by a factor of 10 (“jumps/10,” see end of Section 2.2), p-values are generated using 500 bootstrap iterations, and 1000 p-values are drawn.

Table 2 gives a more comprehensive summary of the effects that different parameter values have on size and power. Again, while keeping the size controlled below the 5% level, we can observe that both higher persistence and heavier tails lead to a decrease in power. Similarly, forecasts at a functional level of 0.01 are much harder to evaluate than forecasts at a level of 0.05. In combination, the presence of high persistence while evaluating low-level ES forecasts can make it impossible to reach a power higher than the nominal size even for sample sizes of 2500. However, for moderate values of persistence and ES forecast level, forecast dominance can be rejected reliably.

4 Empirical Results for S&P 500 and DAX Returns

In this section, we apply our methodology to compare forecasts for the returns of two stock indices, the S&P 500 and the DAX. The return of the index (S&P 500 or DAX) is defined as
where Pt is the level of the index at the end of trading day t. As before, let Wt1 denote the information set generated by data up to day t – 1. We consider three models for daily log returns with corresponding (VaRα,ESα) forecasts at level α=0.025.
The first specification is a HEAVY model (Shephard and Sheppard, 2010) that uses intra-daily realized measures to model the time-varying variance of financial returns. The model posits that
(9)
where V denotes variance, and RKt1is the realized kernel measure computed from intra-daily price movements at day t – 1. We further assume that, conditional on Wt1, ((ν2)/ν σt1)1 Rt follows a t-distribution with ν degrees of freedom. We jointly estimate the model parameters (ω,γ,β, and ν) via maximum likelihood. We re-fit the model only on the first trading day of each month using a rolling window of 1500 observations, that is, roughly six years of daily data. The model yields an estimate of VaRα and ESα of Rt, conditional on Wt1. Second, we consider a GARCH(1, 1) model as proposed by Bollerslev (1986). The variance specification coincides with Equation (9), except that the squared daily return, Rt12, is used in place of RKt1. As for the HEAVY model, we assume a t-distribution and jointly estimate all parameters via maximum likelihood. As a third, simplistic benchmark, we also consider the empirical unconditional VaRα and ESα computed from the returns in the 1500 observations up until day t – 1. This approach resembles “historical simulation” (HS) methods which are popular in practice (see e.g., McNeil et al., 2015, Section 9.2.3).

Our analysis is based on data from http://realized.oxford-man.ox.ac.uk/; this source covers both daily closing prices and realized measures computed from intra-daily data. We construct forecasts for the period from January 2006 to January 2016.12 The entire analysis is out-of-sample, that is, we evaluate the forecasts against realizations that were not used for model fitting. Table 3 gives a brief summary of the parameter estimates. For both the S&P500 and DAX series, the HEAVY model features a larger γ parameter (weight on realized measure) than does GARCH. Furthermore, HEAVY is less persistent than GARCH, as reflected in a smaller estimate of β. These findings are qualitatively in line with empirical results by Shephard and Sheppard (2010). Finally, the estimated degrees of freedom are larger (i.e., closer to normality) for HEAVY than for GARCH. This suggests that the realized kernel measure may be more informative as a volatility proxy than squared returns, in the sense that conditioning on realized kernel leads to a lighter-tailed return distribution than does conditioning on squared returns.

Table 3.

Parameter estimates for HEAVY and GARCH models as presented in Equation (9)


S&P 500
ωγβν

HEAVY0.0000.3440.74210.684
GARCH0.0090.0850.9126.984


DAX
ωγβν

HEAVY0.0000.6060.55813.712
GARCH0.0180.0830.9108.256

S&P 500
ωγβν

HEAVY0.0000.3440.74210.684
GARCH0.0090.0850.9126.984


DAX
ωγβν

HEAVY0.0000.6060.55813.712
GARCH0.0180.0830.9108.256

Notes: All parameters are re-estimated each month using rolling windows. Numbers in the table are medians across rolling windows.

Table 3.

Parameter estimates for HEAVY and GARCH models as presented in Equation (9)


S&P 500
ωγβν

HEAVY0.0000.3440.74210.684
GARCH0.0090.0850.9126.984


DAX
ωγβν

HEAVY0.0000.6060.55813.712
GARCH0.0180.0830.9108.256

S&P 500
ωγβν

HEAVY0.0000.3440.74210.684
GARCH0.0090.0850.9126.984


DAX
ωγβν

HEAVY0.0000.6060.55813.712
GARCH0.0180.0830.9108.256

Notes: All parameters are re-estimated each month using rolling windows. Numbers in the table are medians across rolling windows.

Figure 1 presents time series plots of the HEAVY and HS forecasts (the GARCH forecasts are visually similar to the HEAVY ones, and are thus omitted for better display). The figure shows that the HEAVY forecasts display much more time variation than the forecasts of the simple HS method, suggesting that the HEAVY model is much quicker to react to changes in the market environment than the HS method. Table 4 presents some summary statistics on the forecasts. On average, the HS model produces lower forecasts than the other two methods. For the S&P 500 data set, the average VaRα forecast is –2.065 for HEAVY, compared to –2.213 for GARCH and –2.761 for HS. The violation rates of the VaRα forecasts are 4.2% (HEAVY), 4% (GARCH), and 2.9% (HS), with all three methods exceeding the nominal level of 2.5%, partially due to the negative returns around the 2007–2009 financial crisis.

Table 4.

Summary statistics for empirical forecasts

Avg. VaRαAvg. ESαVaRα “violation” rate


S&P 500
HEAVY−2.065−2.5940.042
GARCH−2.213−2.8520.040
HS−2.761−4.0280.029



DAX
HEAVY−2.499−3.0950.038
GARCH−2.606−3.3220.038
HS−3.130−4.4930.025
Avg. VaRαAvg. ESαVaRα “violation” rate


S&P 500
HEAVY−2.065−2.5940.042
GARCH−2.213−2.8520.040
HS−2.761−4.0280.029



DAX
HEAVY−2.499−3.0950.038
GARCH−2.606−3.3220.038
HS−3.130−4.4930.025

Notes: Sample period ranges from January 2006 to January 2016 (daily data). The VaRα “violation” rate is the fraction of days for which the actual returns falls below the VaRα forecast (nominal rate: α=0.025).

Table 4.

Summary statistics for empirical forecasts

Avg. VaRαAvg. ESαVaRα “violation” rate


S&P 500
HEAVY−2.065−2.5940.042
GARCH−2.213−2.8520.040
HS−2.761−4.0280.029



DAX
HEAVY−2.499−3.0950.038
GARCH−2.606−3.3220.038
HS−3.130−4.4930.025
Avg. VaRαAvg. ESαVaRα “violation” rate


S&P 500
HEAVY−2.065−2.5940.042
GARCH−2.213−2.8520.040
HS−2.761−4.0280.029



DAX
HEAVY−2.499−3.0950.038
GARCH−2.606−3.3220.038
HS−3.130−4.4930.025

Notes: Sample period ranges from January 2006 to January 2016 (daily data). The VaRα “violation” rate is the fraction of days for which the actual returns falls below the VaRα forecast (nominal rate: α=0.025).

Time series plots of empirical forecasts for VaR and ES. The sample periods ranges from January 2006 to January 2016. See text for details.
Figure 1.

Time series plots of empirical forecasts for VaR and ES. The sample periods ranges from January 2006 to January 2016. See text for details.

Murphy diagrams for empirical forecasts. Smaller scores are better.
Figure 2.

Murphy diagrams for empirical forecasts. Smaller scores are better.

Figures 2 and 3 and Table 5 contain our main forecast evaluation results. Figure 2 presents Murphy diagrams for all three methods with the display for S&P 500 at left and the DAX results at right. For both data sets, the HEAVY model seems to attain the lowest average elementary score for the vast majority of thresholds η. Forecasts based on the GARCH(1, 1) model perform slightly worse, and the HS method’s performance trails by a considerable margin. This pattern is emphasized in Figure 3, where the HEAVY forecasts are compared directly against GARCH(1, 1) and HS, respectively. Examining the difference in elementary scores makes it easier to detect which of two models is better at a certain threshold, especially when the difference is small. Pointwise confidence intervals at the 95% level deliver an impression for the significance of the outperformance exhibited by the HEAVY model. For the S&P 500 data, HEAVY seems to perform significantly better than GARCH for a majority of thresholds η; in contrast, the visual comparison for DAX returns does not indicate any clear dominance relation. Table 5 reports the p-value of the formal dominance test presented in Section 3: There is ample support against the null hypothesis that HS dominates HEAVY, but no evidence against dominance of HEAVY over HS. In the comparison of HEAVY and GARCH(1, 1), we do not find enough evidence to reject either direction of weak dominance, at least at the 5% level. These results are found for both the S&P 500 and the DAX data.13 Note that the results in Table 5 are based on a mean block length of ten in the block bootstrap implementation. Using a mean block length of twenty leads to the same test decisions at the 5% level. Furthermore, the results in Table 5 are based on exact calculation of the supremal test statistic; grid-based approximations yield very similar p-values.

Table 5.

Test results for empirical forecasts


S&P 500
Hypothesisp-value

HS weakly dominates HEAVY0.000
HEAVY weakly dominates HS0.386
GARCH weakly dominates HEAVY0.082
HEAVY weakly dominates GARCH0.554


DAX
Hypothesisp-value

HS weakly dominates HEAVY0.000
HEAVY weakly dominates HS0.488
GARCH weakly dominates HEAVY0.172
HEAVY weakly dominates GARCH0.732

S&P 500
Hypothesisp-value

HS weakly dominates HEAVY0.000
HEAVY weakly dominates HS0.386
GARCH weakly dominates HEAVY0.082
HEAVY weakly dominates GARCH0.554


DAX
Hypothesisp-value

HS weakly dominates HEAVY0.000
HEAVY weakly dominates HS0.488
GARCH weakly dominates HEAVY0.172
HEAVY weakly dominates GARCH0.732

Notes: The table presents p-values for several hypotheses related to forecast dominance (see Definition 2.1). The results are based on exact calculation of the supremal test statistic (see Section 2.2), and the bootstrap implementation is based on a mean block length of ten.

Table 5.

Test results for empirical forecasts


S&P 500
Hypothesisp-value

HS weakly dominates HEAVY0.000
HEAVY weakly dominates HS0.386
GARCH weakly dominates HEAVY0.082
HEAVY weakly dominates GARCH0.554


DAX
Hypothesisp-value

HS weakly dominates HEAVY0.000
HEAVY weakly dominates HS0.488
GARCH weakly dominates HEAVY0.172
HEAVY weakly dominates GARCH0.732

S&P 500
Hypothesisp-value

HS weakly dominates HEAVY0.000
HEAVY weakly dominates HS0.386
GARCH weakly dominates HEAVY0.082
HEAVY weakly dominates GARCH0.554


DAX
Hypothesisp-value

HS weakly dominates HEAVY0.000
HEAVY weakly dominates HS0.488
GARCH weakly dominates HEAVY0.172
HEAVY weakly dominates GARCH0.732

Notes: The table presents p-values for several hypotheses related to forecast dominance (see Definition 2.1). The results are based on exact calculation of the supremal test statistic (see Section 2.2), and the bootstrap implementation is based on a mean block length of ten.

Score differences for empirical forecasts. Negative difference means that HEAVY outperforms its competitor. Confidence intervals are pointwise at 95% level.
Figure 3.

Score differences for empirical forecasts. Negative difference means that HEAVY outperforms its competitor. Confidence intervals are pointwise at 95% level.

The fact that both HEAVY and GARCH dominate HS can perhaps be explained by their use of conditioning information, in contrast to the unconditional distribution estimate implicit in HS. Holzmann and Eulert (2014) show that larger information sets lead to better scores under correct specification. While the latter assumption is unlikely to be satisfied in practice, one might expect similar results to hold under moderate degrees of misspecification.

5 Discussion

In this article, we provide a mixture representation for the consistent scoring functions for the pair (VaRα,ESα). This mixture representation facilitates assessments of whether one sequence of predictions for (VaRα,ESα) dominates another across a suitable, user-specified class of scoring functions. As we are primarily interested in the comparison of the ES forecasts, we focus on a class that puts as much emphasis on ES as possible.

We also propose a formal statistical test for forecast dominance in this context. Theoretical arguments are provided to show that the size of the test is controlled asymptotically, which is supported by the results of a detailed simulation study. This study also investigates the power properties of the test for a broad range of parameter choices in a practically relevant model for the data generating process. For the ES level of 0.025 recommended in the Basel III standard and a degree of volatility persistence that is similar to our empirical estimates, we observe good power properties for reasonably large sample sizes.

When comparing forecast performance in terms of forecast dominance, it is not necessary to select a specific scoring function prior to forecast evaluation. In the presence of possibly misspecified forecasts and non-nested information sets, this is an advantage as any choice of a particular consistent scoring function induces a preference ordering on all possible sequences of forecasts which is usually difficult or impossible to justify, or, even to describe; see Patton (2016). On the other hand, Murphy diagrams may lead to inconclusive situations in which neither of the two forecast methods dominates the other. This may be undesirable in contexts of decision making. Ideally, future work should develop a better understanding of Murphy diagrams, so that they can not only be used to check for forecast dominance but also guide the decision for a consistent scoring function appropriate for a specific application if a total order on forecasting methods is needed.

Appendix

A Proofs 

Proposition 1.1

 

Proof
The F1-consistency of Sη1 and Sη2 follows directly from Fissler and Ziegel (2016, Corollary 5.5). This implies the F1-consistency of S at (3) by a small modification of Gneiting (2011a, Theorem 2). To see that all scoring functions at (2) can be written as at (3), observe that an increasing function G can always be written as
where H is a locally finite measure and zR. As G20, we can assume that the measure H2 puts finite mass on all intervals of the form (,x] and choose z = – ∞. Finally, G2 is strictly increasing if and only if H2 puts positive mass on all open intervals. □

 
Theorem 2.1

 

Proof
We have TmaxT0max, hence
Therefore, using Assumption 2.2, we obtain

 
Proposition 2.2

 

Proof
By Assumption 2.3, we obtain
Therefore,

B Test Statistic Behavior
The elementary score difference δ(η)=Sη(vA,eA,y)Sη(vB,eB,y) in S2 for a single case of forecasts and observation (vA,eA,vB,eB,y) is given by
As a function of η, the difference δ(η) is a discontinuous, piece-wise linear function with break points at eA and eB. Hence, the empirical mean difference μn(η) has breaks for all η{etA,etB}t=1n. The average μn*(η) from a bootstrap sample exhibits only a subset of the breaks in μn(η).
We look at the structure of the test statistics T and T0*. As the numerator of T0* is the centered version μn*(η)μn(η), and the denominator is the same for T0* and T, there is no difference in the break structure. Furthermore, both numerators are piece-wise linear functions, and the denominator is the square root of a quadratic function. Hence, both types of test statistics can be parameterized in the same way,
where a,b,c,d,e,a0*,b0* are piece-wise constant functions in η, with the argument suppressed for readability. For the sake of completeness, the functions are given below, where a0*(η) and b0*(η) are computed analogously to a(η) and b(η) respectively,
On any of the intervals induced by the ordered set of break points, we can think of either test statistic as a function
with constant parameters a,b,c,d,eR. If f(η) is constant or linear only the endpoints of the interval are of interest. Otherwise, we check whether the only remaining critical point for an extremum lies in the interval,
The composite null hypothesis is checked on all critical points that fall in their respective interval, and the left-sided and right-sided limits of the break points.

Finally, we can use these results to calculate supηIT(η) and supη,νI|T(η)T(ν)| for any interval I, and for either type of test statistic.

C Relation of Hansen’s Test to Westfall and Young (1993)

We also considered the method of Westfall and Young (1993) which controls the familywise error rate (i.e., the probability of making at least one false rejection) in multiple testing problems. To describe the procedure, let m be the number of points at which the test statistic is evaluated, and let π be the permutation of {1,,m} such that T(η[π(1)])T(η[π(2)])T(η[π(m)]); that is, the permutation π arranges the sample t-statistics in ascending order. Define Uk*=max{T0*(η[π(s)]):sk}. For example, Um* is the largest of all bootstrapped t-statistics, and U5* is the largest bootstrap t-statistic across the grid points η[π(1)],η[π(2)],,η[π(5)].

We generate B sets of bootstrap t-statistics, obtaining values Uk,b* for 1km and 1bB. The adjusted p-value for the pointwise test at η=η[k] is then given by
note that π1(k) indicates the rank of T(η[k]) among {T(η[1]),,T(η[m])}. Finally, the p-value for the joint hypothesis of interest is given by
There is an interesting connection between Hansen’s test and the Westfall–Young method (see Cox and Lee, 2008, Section 3.2): The Westfall–Young p-value is always weakly smaller than the p-value of Hansen’s test (in the variant described in Section 2.2 above), such that the Westfall–Young method is potentially more powerful. To see this, let ηk* be the grid point associated with the largest t-statistic, that is π1(k*)=m. Then, Hansen’s p-value is given by pH=rk*, and it holds that rk*pWY by construction. In some cases, the difference can be practically relevant, see Cox and Lee (2008, Section 4, Figure 4).

However, in our setup, the difference between the two procedures seems to be unimportant. Table 6 summarizes the maximum number of disagreements between the two procedures at significance levels of integer-valued percentage points from 1% to 10%. We observe that across all 594 parameter combinations in our simulation study from Section 3, the largest power difference lies at 1% and the largest size difference lies at a tenth of the respective nominal level. Additionally, it seems that these differences decrease as the number of observations grows. These findings suggests that the results of Meinshausen et al. (2011) showing a certain asymptotic optimality property of Hansen’s procedure in some settings, may hold more generally.

Table 6.

Maximum number of disagreements (per 1000)

ObservationsSignificance levels
1%2%3%4%5%6%7%8%9%10%
All scenarios
n = 50053665876810
10001210343433
25002110211201
Size scenariosζ1=ζ2=0
n = 5001112446489
10000000111111
25001000000100
ObservationsSignificance levels
1%2%3%4%5%6%7%8%9%10%
All scenarios
n = 50053665876810
10001210343433
25002110211201
Size scenariosζ1=ζ2=0
n = 5001112446489
10000000111111
25001000000100

Notes: Maximum number of disagreements (per 1000 p-value replications) between Hansen’s test and the Westfall and Young correction among all simulated parameter combinations (scenarios) for various levels of significance. We consider 432 distinct parameter combinations for n = 500 and 81 distinct parameter combinations for n = 1000 and n = 2500, with one third of each category being size scenarios.

Table 6.

Maximum number of disagreements (per 1000)

ObservationsSignificance levels
1%2%3%4%5%6%7%8%9%10%
All scenarios
n = 50053665876810
10001210343433
25002110211201
Size scenariosζ1=ζ2=0
n = 5001112446489
10000000111111
25001000000100
ObservationsSignificance levels
1%2%3%4%5%6%7%8%9%10%
All scenarios
n = 50053665876810
10001210343433
25002110211201
Size scenariosζ1=ζ2=0
n = 5001112446489
10000000111111
25001000000100

Notes: Maximum number of disagreements (per 1000 p-value replications) between Hansen’s test and the Westfall and Young correction among all simulated parameter combinations (scenarios) for various levels of significance. We consider 432 distinct parameter combinations for n = 500 and 81 distinct parameter combinations for n = 1000 and n = 2500, with one third of each category being size scenarios.

Footnotes

1 As detailed below, a scoring function (or loss function) assigns a real-valued score, given a forecast and a realizing observation.

2 The name of the diagrams alludes to the meteorologist Allan H. Murphy (1931–1997) who pioneered similar diagrams in the context of a binary dependent variable (see Murphy, 1977, as well as Ehm et al., 2016, p. 519).

3 In financial jargon, the word “backtesting” is sometimes used as a synonym for “forecast evaluation.”

4 The situation is similar for other functionals, that is, there is typically a whole family of scoring functions that are consistent for a given functional. For example, Savage (1971) identifies a family of scoring functions that are consistent for the mean, and Gneiting (2011b) describes the family of scoring functions that are consistent for a quantile.

5 Another representation of the class S2, which does not make use of elementary scores, can be obtained by setting the function G1 to zero in Equation (2).

6 In contrast, comparisons of population-level predictive ability (Clark and McCracken, 2013, Section 3.1) ask whether model A would outperform model B if both models were estimated without error. They are useful to discriminate between alternative theories or assess the possible impact of a certain regressor, but are less in line with practical forecast situations which we consider here.

7 In principle, one might try to estimate τG and τG* in order to arrive at bounds for pH. However, this procedure is likely to be computationally demanding, which contradicts the original motivation for using the grid approximation. It hence seems preferable to either set τG=τG*=0, or to compute pH via the analytical supremum calculations detailed in Appendix B. We provide simulation results on both approaches in Section 4.

8 Hansen’s test conducts a comparison between a benchmark method and finitely many competitors. In the case of Definition 2.1’, the comparison is between two methods at a finite number of fixed grid points. From a technical perspective, both of these comparisons boil down to testing whether all elements of a random vector have nonnegative expectation. Note that Hansen (2005) allows for cross-sectional dependence among the vector elements, as well as for certain forms of time series dependence, both of which are likely present in our setup.

9 The resulting AR(1) model implies that the log of RKt has a mean of –0.62, a first-order autoregressive coefficient of 0.83, and a residual variance of 0.38.

10 We evaluate ESt|t1,α using the function esT of the R package VaRES (Nadarajah et al., 2013), which employs numerical integration. The values for ESt|t1,α thus obtained are very similar to an analytical expression (see Dobrev et al., 2017, Section 4, and the references therein) of which we became aware after completing this work. For example, for α{0.01,0.025,0.05} and a t-distribution with ν{4,6,10}, the absolute difference is smaller than 2 ×109.

11 Both models have access to the same information base, which is used optimally by the second model, but suboptimally by the first model. Tsyplakov (2014) shows that this setup implies dominance of the second model under all proper scoring rules.

12 More precisely, the S&P 500 sample comprises 2420 observations from January 6, 2006 to January 25, 2016; the DAX sample comprises 2494 observations from January 4, 2006 to January 25, 2016.

13 At the 10% level, the null that GARCH dominates HEAVY is rejected for the S&P 500 data, in line with the visual impression conveyed by Figure 3.

* We thank seminar and conference participants in Heidelberg, Augsburg (Statistische Woche 2016), Karlsruhe (HeiKaMEtrics 2018), and Freiburg (GPSD 2018) for helpful comments. J. F. Z. gratefully acknowledges financial support of the Swiss National Science Foundation. The work of F.K. and A.J. has been funded by the European Union Seventh Framework Programme under grant agreement 290976. They also thank the Klaus Tschira Foundation for infrastructural support at the Heidelberg Institute for Theoretical Studies (HITS). The opinions expressed in this article are those of the authors and do not necessarily reflect the views of Raiffeisen Schweiz. Calculations were performed on the HPC cluster at HITS, and UBELIX (http://www.id.unibe.ch/hpc), the HPC cluster at the University of Bern.

References

Barndorff-Nielsen
O. E.
,
Hansen
P. R.
,
Lunde
A.
, and
Shephard
N.
.
2008
.
Designing Realized Kernels to Measure the Ex Post Variation of Equity Prices in the Presence of Noise
.
Econometrica
76
:
1481
1536
.

Barndorff-Nielsen
O. E.
,
Hansen
P. R.
,
Lunde
A.
, and
Shephard
N.
.
2009
.
Realized Kernels in Practice: Trades and Quotes
.
Econometrics Journal
12
:
C1
C32
.

Basel Committee on Banking Supervision.

Minimum capital requirements for market risk
.
Available from January (2016).
http://www.bis.org/bcbs/publ/d352.htm (accessed January 15, 2019).

Bollerslev
T.
1986
.
Generalized Autoregressive Conditional Heteroskedasticity
.
Journal of Econometrics
31
:
307
327
.

Clark
T.
, and
McCracken
M.
.
2013
. “
Advances in Forecast Evaluation
.” In
Elliott
G.
and
Timmermann
A.
(eds.),
Handbook of Economic Forecasting
vol. 2
, Amsterdam:
Elsevier
, pp.
1107
1201
.

Cox
D. D.
, and
Lee
J. S.
.
2008
.
Pointwise Testing with Functional Data Using the Westfall-Young Randomization Method
.
Biometrika
95
:
621
634
.

Delbaen
F.
2012
.
Monetary Utility Functions
. Osaka:
Osaka University Press
.

Diebold
F. X.
, and
Mariano
R. S.
.
1995
.
Comparing Predictive Accuracy
.
Journal of Business & Economic Statistics
13
:
253
263
.

Dimitriadis
T.
, and
Bayer
S.
.
2017
.
A Joint Quantile and Expected Shortfall Regression Framework
.
Preprint, arXiv
1704
:
02213
.

Dobrev
D.
,.
Nesmith
T. D.
, and
Oh
D. H.
.
2017
.
Accurate Evaluation of Expected Shortfall for Linear Portfolios with Elliptically Distributed Risk Factors
.
Journal of Risk and Financial Management
10: 5
.

Ehm
W.
, and
Krüger
F.
.
2018
.
Forecast Dominance Testing via Sign Randomization
.
Electronic Journal of Statistics
12
:
3758
3793
.

Ehm
W.
,
Gneiting
T.
,
Jordan
A.
, and
Krüger
F.
.
2016
.
Of Quantiles and Expectiles: Consistent Scoring Functions, Choquet Representations, and Forecast Rankings
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
78
:
505
562
.

Fissler
T.
, and
Ziegel
J. F.
.
2016
.
Higher Order Elicitability and Osband’s Principle
.
The Annals of Statistics
44
:
1680
1707
.

Fissler
T.
,
Ziegel
J. F.
, and
Gneiting
T.
.
2016
.
Expected Shortfall Is Jointly Elicitable with Value at Risk - implications for Backtesting
.
Risk Magazine
January issue
.

Giacomini
R.
, and
White
H.
.
2006
.
Tests of Conditional Predictive Ability
.
Econometrica
74
:
1545
1578
.

Gneiting
T.
2011a
.
Making and Evaluating Point Forecasts
.
Journal of the American Statistical Association
106
:
746
762
.

Gneiting
T.
2011b
.
Quantiles as Optimal Point Forecasts
.
International Journal of Forecasting
27
:
197
207
.

Hansen
P. R.
2005
.
A Test for Superior Predictive Ability
.
Journal of Business & Economic Statistics
23
:
365
380
.

Harvey
A.
,
Ruiz
E.
, and
Shephard
N.
.
1994
.
Multivariate Stochastic Variance Models
.
Review of Economic Studies
61
:
247
264
.

Holzmann
H.
, and
Eulert
M.
.
2014
.
The Role of the Information Set for Forecasting - with Applications to Risk Management
.
The Annals of Applied Statistics
8
:
595
621
.

Lütkepohl
H.
2005
.
New Introduction to Multiple Time Series Analysis
. Berlin:
Springer
.

McNeil
A. J.
,
Frey
R.
, and
Embrechts
P.
.
2015
.
Quantitative Risk Management: Concepts, Techniques and Tools
, 2 edn. Princeton, NJ:
Princeton University Press
.

Meinshausen
N.
,
Maathuis
M. H.
, and
Bühlmann
P.
.
2011
.
Asymptotic Optimality of the Westfall-Young Permutation Procedure for Multiple Testing under Dependence
.
The Annals of Statistics
39
:
3369
3391
.

Murphy
A. H.
1977
.
The Value of Climatological, categorical and Probabilistic Forecasts in the Cost-loss Ratio Situation
.
Monthly Weather Review
105
:
803
816
.

Nadarajah
S.
,
Chan
S.
, and
Afuecheta
E.
.
2013
.
VaRES: Computes Value at Risk and Expected Shortfall for over 100 Parametric Distributions
,
Available at
https://CRAN.R-project.org/package=VaRES (accessed January 15, 2019).

Nolde
N.
, and
Ziegel
J. F.
.
2017
.
Elicitability and Backtesting: Perspectives for Banking Regulation
.
The Annals of Applied Statistics
11
:
1833
1874
.

Patton
A. J.
2011
.
Volatility Forecast Comparison Using Imperfect Volatility Proxies
.
Journal of Econometrics
160
:
246
256
.

Patton
A. J.
2016
. “
Comparing Possibly Misspecified Forecasts
.”
Working paper, Duke University
. Available at http://public.econ.duke.edu/∼ap172/Patton_bregman_paper_nov17.pdf (accessed January 15, 2019).

Patton
A. J.
,
Ziegel
J. F.
, and
Chen
R.
.
2018
.
Dynamic Semiparametric Models for Expected Shortfall (and Value-at-Risk)
.
Journal of Econometrics
,
forthcoming
.

Politis
D. N.
, and
Romano
J. P.
.
1994
.
The Stationary Bootstrap
.
Journal of the American Statistical Association
89
:
1303
1313
.

Savage
L. J.
1971
.
Elicitation of Personal Probabilities and Expectations
.
Journal of the American Statistical Association
66
:
783
801
.

Shephard
N.
, and
Sheppard
K.
.
2010
.
Realising the Future: Forecasting with High-frequency-based Volatility (HEAVY) Models
.
Journal of Applied Econometrics
25
:
197
231
.

Tsyplakov
A.
2014
. “
Theoretical Guidelines for a Partially Informed Forecast Examiner
.”
Working paper, Munich Personal RePec Archive
. Available at https://mpra.ub.uni-muenchen.de/67333/ (accessed January 15, 2019).

Westfall
P.
, and
Young
S. S.
.
1993
.
Resampling-Based Multiple Testing: Examples and Methods for P-Value Adjustment
.
Hoboken, NJ
:
Wiley
.

White
H.
2000
.
A Reality Check for Data Snooping
.
Econometrica
68
:
1097
1126
.

Yen
T.-J.
, and
Yen
Y.-M.
.
2018
.
Testing Forecast Accuracy of Expectiles and Quantiles with the Extremal Consistent Loss Functions
.
Preprint, arXiv
1707
:
02048v3
.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)