-
PDF
- Split View
-
Views
-
Cite
Cite
Johanna F Ziegel, Fabian Krüger, Alexander Jordan, Fernando Fasciati, Robust Forecast Evaluation of Expected Shortfall, Journal of Financial Econometrics, Volume 18, Issue 1, Winter 2020, Pages 95–120, https://doi.org/10.1093/jjfinec/nby035
- Share Icon Share
Abstract
Motivated by the Basel III regulations, recent studies have considered joint forecasts of Value-at-Risk and Expected Shortfall. A large family of scoring functions can be used to evaluate forecast performance in this context. However, little intuitive or empirical guidance is currently available, which renders the choice of scoring function awkward in practice. We therefore develop graphical checks of whether one forecast method dominates another under a relevant class of scoring functions, and propose an associated hypothesis test. We illustrate these tools with simulation examples and an empirical analysis of S&P 500 and DAX returns.
The Basel III standard on minimum capital requirements for market risk (Basel Committe on Banking Supervision, 2016) uses Expected Shortfall (ES), rather than Value-at-Risk (VaR), to quantify the risk of a bank’s portfolio. As described by McNeil et al. (2015, Chapter 8), ES possesses several desirable theoretical properties. However, it also has a major drawback: It is not elicitable, that is there is no scoring function that sets the incentive to report ES honestly, or that can be used to compare ES forecasts’ accuracy.1 As a partial remedy to this problem Fissler and Ziegel (2016, henceforth FZ) show that ES is jointly elicitable with VaR and characterize the class of scoring functions that can be used to evaluate forecasts of type (VaR, ES). Fissler et al. (2016) provide a nontechnical introduction and discuss regulatory implications.
In applied work, it may be challenging to select a specific member function from the FZ family on either economic or statistical grounds. Furthermore, different choices of scoring functions may yield different forecast rankings in the case of imperfect forecasters or non-nested information sets, see for example Patton (2016). Motivated by this problem, we present a mixture representation using elementary members of the FZ family, which is mathematically similar to recent results by Ehm et al. (2016) for quantiles and expectiles. The mixture representation gives rise to Murphy diagrams which allow to check whether one forecast dominates another under a relevant class of scoring functions.2 While this class could be the entire FZ family, we argue that a subfamily that emphasizes ES—rather than VaR—is economically more desirable in the light of the Basel III standard. Analyzing the robustness of forecast rankings across this class of scoring functions is relevant both conceptually and practically, and referred to as forecast dominance in the following.
Forecast dominance holds at the population level—that is, it is defined in terms of expected performance, which is unobservable. Statistical tests are based on observable performance statistics, and are designed to detect significant deviations from hypotheses about expected performance; see for example Diebold and Mariano (1995) and Clark and McCracken (2013). In the present context, such tests are complicated by the fact that the null hypothesis refers to performance under all elementary members of the mixture representation, that is, for all values of an auxiliary parameter. To tackle the resulting testing problem, we propose a variant of the test by Hansen (2005). The test was originally developed to conduct multiple comparisons among a finite set of forecast models. In contrast, our situation involves an infinite family of elementary functions which enter the comparison. We provide theoretical and simulation evidence that the test has good size and power properties in the present situation. Our test complements recent work by Ehm and Krüger (2018) and Yen and Yen (2018) who consider tests of forecast dominance for quantiles and expectiles. These papers differ from ours along several dimensions. First, they consider different forecast types (functionals) than we do. Second, their theoretical justification is based on Gaussian processes, whereas we use arguments from the multiple testing literature (Westfall and Young, 1993; Cox and Lee, 2008). Finally, Ehm and Krüger (2018) use independent permutation (rather than dependent bootstrap) methods to implement the null hypothesis.
In an empirical case study, we evaluate forecasts for daily log returns of the S&P 500 and DAX stock market indices. Three models with varying degree of sophistication are considered: the HEAVY model (Shephard and Sheppard, 2010) with access to the past’s intra-daily data competes against two models using merely end-of-day data, a GARCH(1, 1) model (Bollerslev, 1986) and a naive “historical simulation” model. Our results suggest that both the HEAVY and the GARCH model dominate simple historical simulation; in contrast, we find no dominance relation between HEAVY and GARCH.
We emphasize that our interest lies in comparative forecast evaluation—that is, we seek to compare the (VaR, ES) forecasts of two competing methods.3 Comparative evaluation is important to select a suitable forecasting method in practice, especially given the wealth of data sources and statistical techniques that could plausibly be used to generate forecasts. Comparative forecast evaluation is different from absolute evaluation which aims to determine whether a given forecast method possesses certain desirable optimality properties. The Basel II procedure of counting VaR “violations,” that is, the number of times the actual return fell below the VaR forecast, is an example of absolute forecast evaluation. See Nolde and Ziegel (2017) for a detailed discussion of comparative versus absolute evaluation of financial forecasts.
The contributions of the present article include a mixture representation of the FZ family in Section 2, which yields the Murphy diagrams, and a test for the hypothesis of forecast dominance. Sections 3 introduces the test and provides a theoretical justification; Section 4 presents simulation evidence on the test’s size and power. We illustrate the mixture representation and the test in an empirical case study in Section 5. A discussion in Section 6 concludes. Three appendices contain proofs and technical details.
1 Consistent Scoring Functions for VaR and ES
Patton (2011), Nolde and Ziegel (2017), and Patton et al. (2018) have argued for the use of homogeneous scoring functions for forecast comparison. Such additional requirements on the scoring functions narrow down the possible choices for G1, G2, and in (2). On a restricted action domain , homogeneous scoring functions for exist (Nolde and Ziegel, 2017, Theorem C.3), and under some additional assumptions there is even a unique zero-homogeneous choice (Patton et al., 2018, Proposition 1). Nevertheless, choosing one single scoring function for forecast evaluation implicitly imposes an order of preference on all sequences of forecasts which is usually hard and sometimes impossible to justify. Indeed, the results in Nolde and Ziegel (2017) show no clear preference between a zero-homogeneous choice for or a (1/2)-homogeneous choice for with respect to performance in forecast comparison. In their simulation study, these two different choices give emphasis to different aspects of model misspecification.
More generally, Patton (2016) and others have demonstrated that the choice of scoring function is relevant for the ranking of two competing forecasts in the presence of model misspecification and non-nested information sets, both of which are common in practice. The methods we consider in this article are robust with respect to the choice of scoring function, in the sense that we compare forecasts under a class of scoring functions. We therefore make the following definition of forecast dominance which is analogous to Ehm et al. (2016, Definition 1).
Once dominance has been established for a given class , it can be translated to the extension including all mixtures, for example dominance with respect to implies dominance with respect to . This simple observation is the basis for so-called Murphy diagrams which are graphical tools to check for forecast dominance empirically with respect to all consistent scoring functions. Ehm et al. (2016) provide mixture representations of the families of consistent scoring functions for quantiles and expectiles. In order to derive similar methodology for , the following result presents a mixture representation for consistent scoring functions of the form given in (2).
The first integral in Equation (3) represents the first line of Equation (2), whereas the second integral in (3) represents the second and third line of (2). In fact, for , we have and for , we have . The scores and are themselves consistent scoring functions for , which follows immediately by choosing Dirac-measures for H1 or H2 in (3). The score goes to zero as , whereas the score goes to zero as , and converges to as . This explains the different restrictions on the corresponding mixing measures H1 and H2 in Proposition 1.1.
We identify a subclass of consistent scoring functions for whose members emphasize the evaluation of the component. The first integral in (3) corresponds to the mixture representation of consistent scoring functions for quantiles (Ehm et al., 2016, Theorem 1a), a class that in our context only evaluates the forecast and ignores . Hence, choosing anything but a constant H1 puts unnecessary emphasis on the component of the forecast. The second integral corresponds to the evaluation of , conditional on , where we cannot completely extinguish in the evaluation due to the results on the (non-)elicitability of . Hence, we define as the class of all consistent scoring functions for as given at (3) with a constant H1 (such that the first integral is zero), and focus on this class in the following.5 In the context of the class , we denote the elementary scores simply by , since the scores have been excluded.
Our focus on is motivated by the aim to maximize the impact of the component in evaluation, which is in line with the emphasis set in Basel III. Focusing on also seems justified from a statistical perspective: Dimitriadis and Bayer (2017) investigate several members of in a regression framework. They argue that moving beyond (i.e., considering nonconstant choices of H1 in Equation 1.1) does not improve the numerical performance of their estimators. Furthermore, contains in particular positively homogeneous scoring functions for all possible degrees of homogeneity; see Nolde and Ziegel (2017, Section 2.3.1 and Theorem 6). As discussed there, positively homogeneous scoring functions enjoy a number of attractive properties.
Clearly, one could also consider forecast dominance for with respect to all consistent scoring functions. The procedures described in the following can be adapted to this case; an extension that is conceptually simple yet tedious in practice. This is because one needs to check inequalities across two grids of parameters, one for and one for . Instead, when focusing on , it suffices to check inequalities along a single grid for η.
2 Testing Forecast Dominance
Here we first translate the methodology from Section 1 into a time series context, and then introduce a test of forecast dominance based on the elementary scores.
2.1 Comparing Time Series Forecasts
So far, we have only considered a one-period forecasting problem. In most financial applications, however, the goal is to predict a time series , such as a sequence of asset returns observed at trading days . Furthermore, let denote the forecast of Yt, based on an appropriate information set generated by data available at time . In applications, we seek to make forecasts and realizations comparable across time. We therefore require the following assumption.
The time series with is stationary with distribution FZ.
This assumption rules out deterministic time trends, structural breaks, and seasonalities. Nevertheless, many multivariate autoregressive models (e.g., Lütkepohl, 2005) or stochastic volatility models (e.g., Harvey et al., 1994) are stationary.
2.2 Testing for Forecast Dominance
In Appendix B, we show that it is sufficient to consider the test statistics as piece-wise functions with break points for all , that is, the set of ES forecasts. The supremal test statistics are computed as the maximum of the left-sided and right-sided limits in all break points, and in all remaining critical points that fall in their respective interval of the piece-wise function partition.
The grid Gn that consists of the (ordered) elements of the set . This set is a natural choice in that it coincides with the jump points of the elementary scores , see Proposition 1.1. We refer to this choice as “jumps” in the following.
A thinned version of Gn, considering only every tenth element (“jumps/10”).
An equally spaced grid ranging from the minimum of Gn to the maximum of Gn, containing as many elements as the thinned version in 2 (“equidistant”).
2.3 Theoretical Justifications
White (2000, Proposition 2.2 and Corollary 2.7) and Hansen (2005, Corollary 3) establish results similar to our Assumption 2.2 in the context of comparing multiple forecasting methods. In both of these studies, the test statistic of interest is the maximal element of a finite-dimensional vector. In contrast, our procedure is based on functional data, in that our test statistics are suprema over an uncountable set. While Assumption 2.2 seems plausible, we are not aware of a formal justification, but this issue is beyond the scope of the present article.
Theorem 2.1 shows that the size of the test is under control for all elements of the null hypothesis (i.e., both on its boundary and in its interior, corresponding to equal performance and strict dominance respectively). This type of control is often hard to achieve; c.f. the comments and references in Ehm and Krüger (2018, Section 7).
The proposition derives lower and upper bounds on the analytical p-value pH. In particular, it states that the impact of the grid approximation is small if the test statistics display little variation within the intervals Ii. In the Monte Carlo study of Section 3, we follow the recommendation of Cox and Lee (2008, p. 626) and use the “raw” p-values resulting from the grid approximation. In doing so, we essentially assume that which seems reasonable when the grid is sufficiently dense, for example for large sample sizes and a continuous population distribution for the ES forecasts.7
Taking a slightly different perspective, the grid-based approximation can be seen as testing the following, restricted notion of forecast dominance:
Definition 2.1’ is a necessary condition for Definition 2.1. Furthermore, if the grid G is deterministic, then the Hansen (2005) test is valid for Definition 2.1’ without further adjustment.8 Proposition 2.2 quantifies the difference between pH (the p-value for Definition 2.1) and (the p-value for Definition 2.1’). Its result is in line with the intuition that both p-values are similar if the grid is dense enough. Our simulation results in Table 1 provide direct numerical evidence on the quality of the grid-based approximation.
ES level . | Grid . | DGP parameters . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | . | 0.0 . | . | . | 0.5 . | . | . | 0.7 . | . | . | 0.9 . | . | . |
. | . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . |
Size scenarios | |||||||||||||
exact | 3.9 | 4.2 | 4.1 | 3.4 | 2.9 | 3.3 | 1.4 | 2.3 | 2.4 | 1.1 | 0.4 | 0.4 | |
jumps | 2.0 | 2.4 | 2.4 | 2.1 | 2.4 | 1.2 | 1.9 | 1.2 | 1.0 | 0.4 | 0.5 | 0.1 | |
jumps/10 | 3.1 | 3.0 | 3.8 | 2.4 | 2.3 | 2.3 | 2.1 | 1.5 | 2.1 | 0.6 | 0.5 | 0.1 | |
equidist. | 3.4 | 4.2 | 4.4 | 3.0 | 2.2 | 2.6 | 2.6 | 2.4 | 1.8 | 0.6 | 0.3 | 0.3 | |
0.025 | exact | 3.7 | 4.0 | 4.4 | 3.1 | 3.4 | 3.3 | 3.7 | 3.4 | 2.7 | 2.1 | 1.0 | 1.3 |
jumps | 3.2 | 3.2 | 3.2 | 3.0 | 2.5 | 2.1 | 1.9 | 1.8 | 1.8 | 0.8 | 0.7 | 0.5 | |
jumps/10 | 2.9 | 2.9 | 4.1 | 2.3 | 2.3 | 2.8 | 2.6 | 2.0 | 1.5 | 1.0 | 0.8 | 0.9 | |
equidist. | 3.2 | 3.9 | 2.9 | 2.7 | 3.5 | 2.1 | 3.0 | 3.2 | 1.9 | 1.5 | 1.0 | 0.6 | |
0.050 | exact | 4.8 | 4.9 | 4.4 | 4.3 | 3.2 | 4.5 | 3.6 | 3.4 | 3.4 | 1.5 | 1.8 | 2.1 |
jumps | 3.6 | 3.6 | 4.7 | 3.6 | 3.4 | 2.7 | 1.1 | 2.7 | 2.6 | 1.3 | 1.2 | 1.1 | |
jumps/10 | 3.7 | 4.3 | 3.0 | 2.6 | 2.9 | 2.9 | 2.1 | 2.5 | 3.1 | 0.9 | 1.2 | 0.7 | |
equidist. | 3.8 | 3.6 | 3.2 | 2.8 | 4.4 | 4.0 | 2.0 | 2.9 | 2.9 | 2.0 | 1.7 | 1.0 | |
Power scenarios | |||||||||||||
exact | 64.2 | 45.8 | 29.8 | 8.1 | 4.5 | 2.7 | 1.4 | 1.2 | 0.5 | 0.1 | 0.0 | 0.1 | |
jumps | 55.8 | 38.1 | 20.4 | 6.6 | 2.2 | 1.2 | 0.8 | 0.3 | 0.4 | 0.0 | 0.0 | 0.0 | |
jumps/10 | 55.7 | 37.4 | 22.4 | 4.9 | 1.6 | 1.4 | 1.1 | 0.1 | 0.3 | 0.1 | 0.1 | 0.1 | |
equidist. | 57.5 | 38.3 | 25.3 | 5.1 | 3.1 | 1.2 | 1.0 | 0.4 | 0.2 | 0.2 | 0.3 | 0.0 | |
0.025 | exact | 92.6 | 85.2 | 81.7 | 33.2 | 25.0 | 21.7 | 11.8 | 8.1 | 4.6 | 1.2 | 1.1 | 1.0 |
jumps | 85.7 | 80.9 | 73.6 | 27.5 | 21.1 | 15.2 | 7.6 | 6.3 | 3.5 | 0.6 | 0.7 | 0.5 | |
jumps/10 | 86.3 | 82.2 | 78.2 | 25.6 | 18.3 | 13.7 | 7.8 | 5.3 | 3.1 | 1.0 | 0.6 | 0.5 | |
equidist. | 87.1 | 84.1 | 78.4 | 26.3 | 20.5 | 15.8 | 6.4 | 5.2 | 4.5 | 0.7 | 0.2 | 0.2 | |
0.050 | exact | 98.7 | 98.2 | 97.6 | 58.9 | 57.4 | 53.1 | 26.6 | 21.6 | 18.6 | 4.5 | 3.0 | 3.0 |
jumps | 96.7 | 96.0 | 96.9 | 56.0 | 48.1 | 46.4 | 21.8 | 20.4 | 16.4 | 2.2 | 2.9 | 2.5 | |
jumps/10 | 96.9 | 95.6 | 96.4 | 56.6 | 49.5 | 48.8 | 22.2 | 17.5 | 16.3 | 4.3 | 2.7 | 2.9 | |
equidist. | 97.8 | 96.8 | 96.9 | 54.6 | 50.6 | 49.7 | 23.2 | 20.7 | 15.6 | 2.7 | 3.6 | 2.6 |
ES level . | Grid . | DGP parameters . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | . | 0.0 . | . | . | 0.5 . | . | . | 0.7 . | . | . | 0.9 . | . | . |
. | . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . |
Size scenarios | |||||||||||||
exact | 3.9 | 4.2 | 4.1 | 3.4 | 2.9 | 3.3 | 1.4 | 2.3 | 2.4 | 1.1 | 0.4 | 0.4 | |
jumps | 2.0 | 2.4 | 2.4 | 2.1 | 2.4 | 1.2 | 1.9 | 1.2 | 1.0 | 0.4 | 0.5 | 0.1 | |
jumps/10 | 3.1 | 3.0 | 3.8 | 2.4 | 2.3 | 2.3 | 2.1 | 1.5 | 2.1 | 0.6 | 0.5 | 0.1 | |
equidist. | 3.4 | 4.2 | 4.4 | 3.0 | 2.2 | 2.6 | 2.6 | 2.4 | 1.8 | 0.6 | 0.3 | 0.3 | |
0.025 | exact | 3.7 | 4.0 | 4.4 | 3.1 | 3.4 | 3.3 | 3.7 | 3.4 | 2.7 | 2.1 | 1.0 | 1.3 |
jumps | 3.2 | 3.2 | 3.2 | 3.0 | 2.5 | 2.1 | 1.9 | 1.8 | 1.8 | 0.8 | 0.7 | 0.5 | |
jumps/10 | 2.9 | 2.9 | 4.1 | 2.3 | 2.3 | 2.8 | 2.6 | 2.0 | 1.5 | 1.0 | 0.8 | 0.9 | |
equidist. | 3.2 | 3.9 | 2.9 | 2.7 | 3.5 | 2.1 | 3.0 | 3.2 | 1.9 | 1.5 | 1.0 | 0.6 | |
0.050 | exact | 4.8 | 4.9 | 4.4 | 4.3 | 3.2 | 4.5 | 3.6 | 3.4 | 3.4 | 1.5 | 1.8 | 2.1 |
jumps | 3.6 | 3.6 | 4.7 | 3.6 | 3.4 | 2.7 | 1.1 | 2.7 | 2.6 | 1.3 | 1.2 | 1.1 | |
jumps/10 | 3.7 | 4.3 | 3.0 | 2.6 | 2.9 | 2.9 | 2.1 | 2.5 | 3.1 | 0.9 | 1.2 | 0.7 | |
equidist. | 3.8 | 3.6 | 3.2 | 2.8 | 4.4 | 4.0 | 2.0 | 2.9 | 2.9 | 2.0 | 1.7 | 1.0 | |
Power scenarios | |||||||||||||
exact | 64.2 | 45.8 | 29.8 | 8.1 | 4.5 | 2.7 | 1.4 | 1.2 | 0.5 | 0.1 | 0.0 | 0.1 | |
jumps | 55.8 | 38.1 | 20.4 | 6.6 | 2.2 | 1.2 | 0.8 | 0.3 | 0.4 | 0.0 | 0.0 | 0.0 | |
jumps/10 | 55.7 | 37.4 | 22.4 | 4.9 | 1.6 | 1.4 | 1.1 | 0.1 | 0.3 | 0.1 | 0.1 | 0.1 | |
equidist. | 57.5 | 38.3 | 25.3 | 5.1 | 3.1 | 1.2 | 1.0 | 0.4 | 0.2 | 0.2 | 0.3 | 0.0 | |
0.025 | exact | 92.6 | 85.2 | 81.7 | 33.2 | 25.0 | 21.7 | 11.8 | 8.1 | 4.6 | 1.2 | 1.1 | 1.0 |
jumps | 85.7 | 80.9 | 73.6 | 27.5 | 21.1 | 15.2 | 7.6 | 6.3 | 3.5 | 0.6 | 0.7 | 0.5 | |
jumps/10 | 86.3 | 82.2 | 78.2 | 25.6 | 18.3 | 13.7 | 7.8 | 5.3 | 3.1 | 1.0 | 0.6 | 0.5 | |
equidist. | 87.1 | 84.1 | 78.4 | 26.3 | 20.5 | 15.8 | 6.4 | 5.2 | 4.5 | 0.7 | 0.2 | 0.2 | |
0.050 | exact | 98.7 | 98.2 | 97.6 | 58.9 | 57.4 | 53.1 | 26.6 | 21.6 | 18.6 | 4.5 | 3.0 | 3.0 |
jumps | 96.7 | 96.0 | 96.9 | 56.0 | 48.1 | 46.4 | 21.8 | 20.4 | 16.4 | 2.2 | 2.9 | 2.5 | |
jumps/10 | 96.9 | 95.6 | 96.4 | 56.6 | 49.5 | 48.8 | 22.2 | 17.5 | 16.3 | 4.3 | 2.7 | 2.9 | |
equidist. | 97.8 | 96.8 | 96.9 | 54.6 | 50.6 | 49.7 | 23.2 | 20.7 | 15.6 | 2.7 | 3.6 | 2.6 |
Notes: Size and power results (in percentage points) of the Monte Carlo investigation of the dominance test for a 5% significance level, with focus on the effect of the grid type. “exact” denotes exact analytical computation of the test statistic; the three grid types (“jumps,” “jumps/10,” and “equidistant”) are introduced at the end of Section 2.2. The sample size is fixed at 500 observations, p-values are generated using 500 bootstrap iterations, and 1000 p-values are drawn.
ES level . | Grid . | DGP parameters . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | . | 0.0 . | . | . | 0.5 . | . | . | 0.7 . | . | . | 0.9 . | . | . |
. | . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . |
Size scenarios | |||||||||||||
exact | 3.9 | 4.2 | 4.1 | 3.4 | 2.9 | 3.3 | 1.4 | 2.3 | 2.4 | 1.1 | 0.4 | 0.4 | |
jumps | 2.0 | 2.4 | 2.4 | 2.1 | 2.4 | 1.2 | 1.9 | 1.2 | 1.0 | 0.4 | 0.5 | 0.1 | |
jumps/10 | 3.1 | 3.0 | 3.8 | 2.4 | 2.3 | 2.3 | 2.1 | 1.5 | 2.1 | 0.6 | 0.5 | 0.1 | |
equidist. | 3.4 | 4.2 | 4.4 | 3.0 | 2.2 | 2.6 | 2.6 | 2.4 | 1.8 | 0.6 | 0.3 | 0.3 | |
0.025 | exact | 3.7 | 4.0 | 4.4 | 3.1 | 3.4 | 3.3 | 3.7 | 3.4 | 2.7 | 2.1 | 1.0 | 1.3 |
jumps | 3.2 | 3.2 | 3.2 | 3.0 | 2.5 | 2.1 | 1.9 | 1.8 | 1.8 | 0.8 | 0.7 | 0.5 | |
jumps/10 | 2.9 | 2.9 | 4.1 | 2.3 | 2.3 | 2.8 | 2.6 | 2.0 | 1.5 | 1.0 | 0.8 | 0.9 | |
equidist. | 3.2 | 3.9 | 2.9 | 2.7 | 3.5 | 2.1 | 3.0 | 3.2 | 1.9 | 1.5 | 1.0 | 0.6 | |
0.050 | exact | 4.8 | 4.9 | 4.4 | 4.3 | 3.2 | 4.5 | 3.6 | 3.4 | 3.4 | 1.5 | 1.8 | 2.1 |
jumps | 3.6 | 3.6 | 4.7 | 3.6 | 3.4 | 2.7 | 1.1 | 2.7 | 2.6 | 1.3 | 1.2 | 1.1 | |
jumps/10 | 3.7 | 4.3 | 3.0 | 2.6 | 2.9 | 2.9 | 2.1 | 2.5 | 3.1 | 0.9 | 1.2 | 0.7 | |
equidist. | 3.8 | 3.6 | 3.2 | 2.8 | 4.4 | 4.0 | 2.0 | 2.9 | 2.9 | 2.0 | 1.7 | 1.0 | |
Power scenarios | |||||||||||||
exact | 64.2 | 45.8 | 29.8 | 8.1 | 4.5 | 2.7 | 1.4 | 1.2 | 0.5 | 0.1 | 0.0 | 0.1 | |
jumps | 55.8 | 38.1 | 20.4 | 6.6 | 2.2 | 1.2 | 0.8 | 0.3 | 0.4 | 0.0 | 0.0 | 0.0 | |
jumps/10 | 55.7 | 37.4 | 22.4 | 4.9 | 1.6 | 1.4 | 1.1 | 0.1 | 0.3 | 0.1 | 0.1 | 0.1 | |
equidist. | 57.5 | 38.3 | 25.3 | 5.1 | 3.1 | 1.2 | 1.0 | 0.4 | 0.2 | 0.2 | 0.3 | 0.0 | |
0.025 | exact | 92.6 | 85.2 | 81.7 | 33.2 | 25.0 | 21.7 | 11.8 | 8.1 | 4.6 | 1.2 | 1.1 | 1.0 |
jumps | 85.7 | 80.9 | 73.6 | 27.5 | 21.1 | 15.2 | 7.6 | 6.3 | 3.5 | 0.6 | 0.7 | 0.5 | |
jumps/10 | 86.3 | 82.2 | 78.2 | 25.6 | 18.3 | 13.7 | 7.8 | 5.3 | 3.1 | 1.0 | 0.6 | 0.5 | |
equidist. | 87.1 | 84.1 | 78.4 | 26.3 | 20.5 | 15.8 | 6.4 | 5.2 | 4.5 | 0.7 | 0.2 | 0.2 | |
0.050 | exact | 98.7 | 98.2 | 97.6 | 58.9 | 57.4 | 53.1 | 26.6 | 21.6 | 18.6 | 4.5 | 3.0 | 3.0 |
jumps | 96.7 | 96.0 | 96.9 | 56.0 | 48.1 | 46.4 | 21.8 | 20.4 | 16.4 | 2.2 | 2.9 | 2.5 | |
jumps/10 | 96.9 | 95.6 | 96.4 | 56.6 | 49.5 | 48.8 | 22.2 | 17.5 | 16.3 | 4.3 | 2.7 | 2.9 | |
equidist. | 97.8 | 96.8 | 96.9 | 54.6 | 50.6 | 49.7 | 23.2 | 20.7 | 15.6 | 2.7 | 3.6 | 2.6 |
ES level . | Grid . | DGP parameters . | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | . | 0.0 . | . | . | 0.5 . | . | . | 0.7 . | . | . | 0.9 . | . | . |
. | . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . |
Size scenarios | |||||||||||||
exact | 3.9 | 4.2 | 4.1 | 3.4 | 2.9 | 3.3 | 1.4 | 2.3 | 2.4 | 1.1 | 0.4 | 0.4 | |
jumps | 2.0 | 2.4 | 2.4 | 2.1 | 2.4 | 1.2 | 1.9 | 1.2 | 1.0 | 0.4 | 0.5 | 0.1 | |
jumps/10 | 3.1 | 3.0 | 3.8 | 2.4 | 2.3 | 2.3 | 2.1 | 1.5 | 2.1 | 0.6 | 0.5 | 0.1 | |
equidist. | 3.4 | 4.2 | 4.4 | 3.0 | 2.2 | 2.6 | 2.6 | 2.4 | 1.8 | 0.6 | 0.3 | 0.3 | |
0.025 | exact | 3.7 | 4.0 | 4.4 | 3.1 | 3.4 | 3.3 | 3.7 | 3.4 | 2.7 | 2.1 | 1.0 | 1.3 |
jumps | 3.2 | 3.2 | 3.2 | 3.0 | 2.5 | 2.1 | 1.9 | 1.8 | 1.8 | 0.8 | 0.7 | 0.5 | |
jumps/10 | 2.9 | 2.9 | 4.1 | 2.3 | 2.3 | 2.8 | 2.6 | 2.0 | 1.5 | 1.0 | 0.8 | 0.9 | |
equidist. | 3.2 | 3.9 | 2.9 | 2.7 | 3.5 | 2.1 | 3.0 | 3.2 | 1.9 | 1.5 | 1.0 | 0.6 | |
0.050 | exact | 4.8 | 4.9 | 4.4 | 4.3 | 3.2 | 4.5 | 3.6 | 3.4 | 3.4 | 1.5 | 1.8 | 2.1 |
jumps | 3.6 | 3.6 | 4.7 | 3.6 | 3.4 | 2.7 | 1.1 | 2.7 | 2.6 | 1.3 | 1.2 | 1.1 | |
jumps/10 | 3.7 | 4.3 | 3.0 | 2.6 | 2.9 | 2.9 | 2.1 | 2.5 | 3.1 | 0.9 | 1.2 | 0.7 | |
equidist. | 3.8 | 3.6 | 3.2 | 2.8 | 4.4 | 4.0 | 2.0 | 2.9 | 2.9 | 2.0 | 1.7 | 1.0 | |
Power scenarios | |||||||||||||
exact | 64.2 | 45.8 | 29.8 | 8.1 | 4.5 | 2.7 | 1.4 | 1.2 | 0.5 | 0.1 | 0.0 | 0.1 | |
jumps | 55.8 | 38.1 | 20.4 | 6.6 | 2.2 | 1.2 | 0.8 | 0.3 | 0.4 | 0.0 | 0.0 | 0.0 | |
jumps/10 | 55.7 | 37.4 | 22.4 | 4.9 | 1.6 | 1.4 | 1.1 | 0.1 | 0.3 | 0.1 | 0.1 | 0.1 | |
equidist. | 57.5 | 38.3 | 25.3 | 5.1 | 3.1 | 1.2 | 1.0 | 0.4 | 0.2 | 0.2 | 0.3 | 0.0 | |
0.025 | exact | 92.6 | 85.2 | 81.7 | 33.2 | 25.0 | 21.7 | 11.8 | 8.1 | 4.6 | 1.2 | 1.1 | 1.0 |
jumps | 85.7 | 80.9 | 73.6 | 27.5 | 21.1 | 15.2 | 7.6 | 6.3 | 3.5 | 0.6 | 0.7 | 0.5 | |
jumps/10 | 86.3 | 82.2 | 78.2 | 25.6 | 18.3 | 13.7 | 7.8 | 5.3 | 3.1 | 1.0 | 0.6 | 0.5 | |
equidist. | 87.1 | 84.1 | 78.4 | 26.3 | 20.5 | 15.8 | 6.4 | 5.2 | 4.5 | 0.7 | 0.2 | 0.2 | |
0.050 | exact | 98.7 | 98.2 | 97.6 | 58.9 | 57.4 | 53.1 | 26.6 | 21.6 | 18.6 | 4.5 | 3.0 | 3.0 |
jumps | 96.7 | 96.0 | 96.9 | 56.0 | 48.1 | 46.4 | 21.8 | 20.4 | 16.4 | 2.2 | 2.9 | 2.5 | |
jumps/10 | 96.9 | 95.6 | 96.4 | 56.6 | 49.5 | 48.8 | 22.2 | 17.5 | 16.3 | 4.3 | 2.7 | 2.9 | |
equidist. | 97.8 | 96.8 | 96.9 | 54.6 | 50.6 | 49.7 | 23.2 | 20.7 | 15.6 | 2.7 | 3.6 | 2.6 |
Notes: Size and power results (in percentage points) of the Monte Carlo investigation of the dominance test for a 5% significance level, with focus on the effect of the grid type. “exact” denotes exact analytical computation of the test statistic; the three grid types (“jumps,” “jumps/10,” and “equidistant”) are introduced at the end of Section 2.2. The sample size is fixed at 500 observations, p-values are generated using 500 bootstrap iterations, and 1000 p-values are drawn.
2.4 Related Procedures
The testing procedure described in Sections 2.2 and 2.3 can be viewed as conducting pointwise tests for each η, adjusting the resulting test statistics or p-values for multiple testing and then taking the minimal adjusted p-value as a p-value for the joint hypothesis for all . More specifically, the p-value adjustment implicit in the method of Hansen (2005) is a simplified (“one-step” or “single-step”) variant of the Westfall and Young (1993) step-down procedure for multiple testing; see Cox and Lee (2008, Section 3.2) and Meinshausen et al. (2011). Cox and Lee (2008) analyze the properties of applying Westfall and Young (1993) to functional data. In Appendix C, we describe the Westfall–Young procedure for our testing problem. The resulting p-value pWY for the null hypothesis for all always fulfills , implying that the Westfall–Young procedure is more powerful. There are situations where the difference between both procedures is noticeable; see Cox and Lee (2008, Section 3.2). However, in our Monte Carlo study both approaches typically imply the same test decisions at conventional levels (see Appendix C), such that the difference between the two procedures is negligible in practice. We therefore focus on the simpler approach of Hansen (2005).
3 Monte Carlo Evidence on the Dominance Test
In the following investigation, we consider various values for the two DGP parameters (i.e., the persistence parameter β and the degrees of freedom ν), for the three hyperparameters (i.e., the functional level α, the type of grid for η, the number of observations n), and for the two parameters controlling forecast quality (i.e., the perturbation parameters ζ1 and ζ2). For the entire investigation we choose a significance level of 5%. The calculation of a single p-value uses 500 bootstrap iterations, and we draw 1000 p-values per scenario.
Tables 1 and 2 both show Monte Carlo simulation results for size and power. Table 1 adresses the question whether the grid specification for η is important at a sample size of 500. In Section 2.2, we discussed exact computation of the supremal test statistics and three variants of grid approximation (“jumps,” “jumps/10,” and “equidistant”). We can observe no exceedance of the nominal 5% level for the scenarios with equal predictive quality, regardless of whether we use the exact computation or any of the three grid approximations. For the power scenarios, we observe a nice grouping by parameter combination with evidence for a generally minor loss of power when using any of the grid approximation options. As the computational cost increases noticeably beyond a sample size of 500 and power properties are similar, our further simulation results are based on the thinned grid of ES forecasts (“jumps/10”).
ES level . | Obs. . | DGP parameters . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
. | . | 0.5 . | . | . | 0.7 . | . | . | 0.9 . | . | . |
. | . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . |
Size scenarios | ||||||||||
n = 500 | 2.4 | 2.3 | 2.3 | 2.1 | 1.5 | 2.1 | 0.6 | 0.5 | 0.1 | |
1000 | 2.7 | 2.6 | 2.7 | 2.7 | 2.1 | 2.1 | 0.6 | 0.6 | 0.6 | |
2500 | 4.2 | 3.9 | 2.4 | 1.5 | 2.8 | 1.6 | 1.0 | 1.0 | 0.6 | |
0.025 | 500 | 2.3 | 2.3 | 2.8 | 2.6 | 2.0 | 1.5 | 1.0 | 0.8 | 0.9 |
1000 | 2.4 | 3.3 | 3.9 | 2.4 | 2.8 | 1.9 | 1.3 | 1.1 | 1.1 | |
2500 | 4.0 | 4.1 | 2.8 | 2.8 | 3.5 | 3.3 | 2.3 | 1.7 | 0.9 | |
0.050 | 500 | 2.6 | 2.9 | 2.9 | 2.1 | 2.5 | 3.1 | 0.9 | 1.2 | 0.7 |
1000 | 2.6 | 2.7 | 2.9 | 2.8 | 2.6 | 2.0 | 1.8 | 2.1 | 1.1 | |
2500 | 2.7 | 2.5 | 4.5 | 2.9 | 2.5 | 2.9 | 3.2 | 2.8 | 1.7 | |
Power scenarios | ||||||||||
n = 500 | 4.9 | 1.6 | 1.4 | 1.1 | 0.1 | 0.3 | 0.1 | 0.1 | 0.1 | |
1000 | 19.2 | 9.3 | 4.2 | 3.5 | 1.8 | 0.3 | 0.4 | 0.1 | 0.3 | |
2500 | 74.1 | 49.0 | 23.6 | 20.7 | 9.1 | 5.8 | 1.1 | 0.8 | 0.5 | |
0.025 | 500 | 25.6 | 18.3 | 13.7 | 7.8 | 5.3 | 3.1 | 1.0 | 0.6 | 0.5 |
1000 | 63.8 | 51.1 | 40.7 | 19.3 | 17.8 | 10.8 | 2.3 | 1.0 | 1.7 | |
2500 | 99.1 | 95.6 | 90.1 | 64.8 | 51.8 | 35.1 | 6.8 | 3.6 | 2.7 | |
0.050 | 500 | 56.6 | 49.5 | 48.8 | 22.2 | 17.5 | 16.3 | 4.3 | 2.7 | 2.9 |
1000 | 89.0 | 87.0 | 84.4 | 49.1 | 42.2 | 36.9 | 6.9 | 4.1 | 4.7 | |
2500 | 100.0 | 100.0 | 100.0 | 94.3 | 89.8 | 84.6 | 19.0 | 12.5 | 12.0 | |
Power scenarios | ||||||||||
n = 500 | 31.7 | 18.6 | 10.2 | 6.3 | 2.4 | 1.9 | 0.3 | 0.2 | 0.1 | |
1000 | 75.9 | 56.3 | 33.1 | 18.2 | 10.9 | 4.4 | 0.3 | 0.3 | 0.4 | |
2500 | 99.8 | 97.8 | 86.9 | 72.7 | 46.6 | 20.0 | 4.7 | 2.0 | 1.3 | |
0.025 | 500 | 72.7 | 64.3 | 56.1 | 25.2 | 20.4 | 12.4 | 2.1 | 1.7 | 0.6 |
1000 | 98.3 | 95.8 | 92.3 | 65.1 | 52.9 | 39.3 | 5.4 | 3.8 | 3.3 | |
2500 | 100.0 | 100.0 | 100.0 | 98.8 | 94.3 | 87.9 | 24.1 | 16.2 | 9.8 | |
0.050 | 500 | 92.5 | 92.3 | 90.6 | 56.8 | 50.5 | 49.6 | 7.7 | 8.3 | 6.5 |
1000 | 100.0 | 99.9 | 99.7 | 90.0 | 86.7 | 86.0 | 17.8 | 13.1 | 14.0 | |
2500 | 100.0 | 100.0 | 100.0 | 99.9 | 100.0 | 100.0 | 53.5 | 44.7 | 41.1 |
ES level . | Obs. . | DGP parameters . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
. | . | 0.5 . | . | . | 0.7 . | . | . | 0.9 . | . | . |
. | . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . |
Size scenarios | ||||||||||
n = 500 | 2.4 | 2.3 | 2.3 | 2.1 | 1.5 | 2.1 | 0.6 | 0.5 | 0.1 | |
1000 | 2.7 | 2.6 | 2.7 | 2.7 | 2.1 | 2.1 | 0.6 | 0.6 | 0.6 | |
2500 | 4.2 | 3.9 | 2.4 | 1.5 | 2.8 | 1.6 | 1.0 | 1.0 | 0.6 | |
0.025 | 500 | 2.3 | 2.3 | 2.8 | 2.6 | 2.0 | 1.5 | 1.0 | 0.8 | 0.9 |
1000 | 2.4 | 3.3 | 3.9 | 2.4 | 2.8 | 1.9 | 1.3 | 1.1 | 1.1 | |
2500 | 4.0 | 4.1 | 2.8 | 2.8 | 3.5 | 3.3 | 2.3 | 1.7 | 0.9 | |
0.050 | 500 | 2.6 | 2.9 | 2.9 | 2.1 | 2.5 | 3.1 | 0.9 | 1.2 | 0.7 |
1000 | 2.6 | 2.7 | 2.9 | 2.8 | 2.6 | 2.0 | 1.8 | 2.1 | 1.1 | |
2500 | 2.7 | 2.5 | 4.5 | 2.9 | 2.5 | 2.9 | 3.2 | 2.8 | 1.7 | |
Power scenarios | ||||||||||
n = 500 | 4.9 | 1.6 | 1.4 | 1.1 | 0.1 | 0.3 | 0.1 | 0.1 | 0.1 | |
1000 | 19.2 | 9.3 | 4.2 | 3.5 | 1.8 | 0.3 | 0.4 | 0.1 | 0.3 | |
2500 | 74.1 | 49.0 | 23.6 | 20.7 | 9.1 | 5.8 | 1.1 | 0.8 | 0.5 | |
0.025 | 500 | 25.6 | 18.3 | 13.7 | 7.8 | 5.3 | 3.1 | 1.0 | 0.6 | 0.5 |
1000 | 63.8 | 51.1 | 40.7 | 19.3 | 17.8 | 10.8 | 2.3 | 1.0 | 1.7 | |
2500 | 99.1 | 95.6 | 90.1 | 64.8 | 51.8 | 35.1 | 6.8 | 3.6 | 2.7 | |
0.050 | 500 | 56.6 | 49.5 | 48.8 | 22.2 | 17.5 | 16.3 | 4.3 | 2.7 | 2.9 |
1000 | 89.0 | 87.0 | 84.4 | 49.1 | 42.2 | 36.9 | 6.9 | 4.1 | 4.7 | |
2500 | 100.0 | 100.0 | 100.0 | 94.3 | 89.8 | 84.6 | 19.0 | 12.5 | 12.0 | |
Power scenarios | ||||||||||
n = 500 | 31.7 | 18.6 | 10.2 | 6.3 | 2.4 | 1.9 | 0.3 | 0.2 | 0.1 | |
1000 | 75.9 | 56.3 | 33.1 | 18.2 | 10.9 | 4.4 | 0.3 | 0.3 | 0.4 | |
2500 | 99.8 | 97.8 | 86.9 | 72.7 | 46.6 | 20.0 | 4.7 | 2.0 | 1.3 | |
0.025 | 500 | 72.7 | 64.3 | 56.1 | 25.2 | 20.4 | 12.4 | 2.1 | 1.7 | 0.6 |
1000 | 98.3 | 95.8 | 92.3 | 65.1 | 52.9 | 39.3 | 5.4 | 3.8 | 3.3 | |
2500 | 100.0 | 100.0 | 100.0 | 98.8 | 94.3 | 87.9 | 24.1 | 16.2 | 9.8 | |
0.050 | 500 | 92.5 | 92.3 | 90.6 | 56.8 | 50.5 | 49.6 | 7.7 | 8.3 | 6.5 |
1000 | 100.0 | 99.9 | 99.7 | 90.0 | 86.7 | 86.0 | 17.8 | 13.1 | 14.0 | |
2500 | 100.0 | 100.0 | 100.0 | 99.9 | 100.0 | 100.0 | 53.5 | 44.7 | 41.1 |
Notes: Size and power results (in percentage points) of the Monte Carlo investigation of the dominance test for a 5% significance level, with focus on the effect of the number of observations. The grid of points η is thinned by a factor of 10 (“jumps/10,” see end of Section 2.2), p-values are generated using 500 bootstrap iterations, and 1000 p-values are drawn.
ES level . | Obs. . | DGP parameters . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
. | . | 0.5 . | . | . | 0.7 . | . | . | 0.9 . | . | . |
. | . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . |
Size scenarios | ||||||||||
n = 500 | 2.4 | 2.3 | 2.3 | 2.1 | 1.5 | 2.1 | 0.6 | 0.5 | 0.1 | |
1000 | 2.7 | 2.6 | 2.7 | 2.7 | 2.1 | 2.1 | 0.6 | 0.6 | 0.6 | |
2500 | 4.2 | 3.9 | 2.4 | 1.5 | 2.8 | 1.6 | 1.0 | 1.0 | 0.6 | |
0.025 | 500 | 2.3 | 2.3 | 2.8 | 2.6 | 2.0 | 1.5 | 1.0 | 0.8 | 0.9 |
1000 | 2.4 | 3.3 | 3.9 | 2.4 | 2.8 | 1.9 | 1.3 | 1.1 | 1.1 | |
2500 | 4.0 | 4.1 | 2.8 | 2.8 | 3.5 | 3.3 | 2.3 | 1.7 | 0.9 | |
0.050 | 500 | 2.6 | 2.9 | 2.9 | 2.1 | 2.5 | 3.1 | 0.9 | 1.2 | 0.7 |
1000 | 2.6 | 2.7 | 2.9 | 2.8 | 2.6 | 2.0 | 1.8 | 2.1 | 1.1 | |
2500 | 2.7 | 2.5 | 4.5 | 2.9 | 2.5 | 2.9 | 3.2 | 2.8 | 1.7 | |
Power scenarios | ||||||||||
n = 500 | 4.9 | 1.6 | 1.4 | 1.1 | 0.1 | 0.3 | 0.1 | 0.1 | 0.1 | |
1000 | 19.2 | 9.3 | 4.2 | 3.5 | 1.8 | 0.3 | 0.4 | 0.1 | 0.3 | |
2500 | 74.1 | 49.0 | 23.6 | 20.7 | 9.1 | 5.8 | 1.1 | 0.8 | 0.5 | |
0.025 | 500 | 25.6 | 18.3 | 13.7 | 7.8 | 5.3 | 3.1 | 1.0 | 0.6 | 0.5 |
1000 | 63.8 | 51.1 | 40.7 | 19.3 | 17.8 | 10.8 | 2.3 | 1.0 | 1.7 | |
2500 | 99.1 | 95.6 | 90.1 | 64.8 | 51.8 | 35.1 | 6.8 | 3.6 | 2.7 | |
0.050 | 500 | 56.6 | 49.5 | 48.8 | 22.2 | 17.5 | 16.3 | 4.3 | 2.7 | 2.9 |
1000 | 89.0 | 87.0 | 84.4 | 49.1 | 42.2 | 36.9 | 6.9 | 4.1 | 4.7 | |
2500 | 100.0 | 100.0 | 100.0 | 94.3 | 89.8 | 84.6 | 19.0 | 12.5 | 12.0 | |
Power scenarios | ||||||||||
n = 500 | 31.7 | 18.6 | 10.2 | 6.3 | 2.4 | 1.9 | 0.3 | 0.2 | 0.1 | |
1000 | 75.9 | 56.3 | 33.1 | 18.2 | 10.9 | 4.4 | 0.3 | 0.3 | 0.4 | |
2500 | 99.8 | 97.8 | 86.9 | 72.7 | 46.6 | 20.0 | 4.7 | 2.0 | 1.3 | |
0.025 | 500 | 72.7 | 64.3 | 56.1 | 25.2 | 20.4 | 12.4 | 2.1 | 1.7 | 0.6 |
1000 | 98.3 | 95.8 | 92.3 | 65.1 | 52.9 | 39.3 | 5.4 | 3.8 | 3.3 | |
2500 | 100.0 | 100.0 | 100.0 | 98.8 | 94.3 | 87.9 | 24.1 | 16.2 | 9.8 | |
0.050 | 500 | 92.5 | 92.3 | 90.6 | 56.8 | 50.5 | 49.6 | 7.7 | 8.3 | 6.5 |
1000 | 100.0 | 99.9 | 99.7 | 90.0 | 86.7 | 86.0 | 17.8 | 13.1 | 14.0 | |
2500 | 100.0 | 100.0 | 100.0 | 99.9 | 100.0 | 100.0 | 53.5 | 44.7 | 41.1 |
ES level . | Obs. . | DGP parameters . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
. | . | 0.5 . | . | . | 0.7 . | . | . | 0.9 . | . | . |
. | . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . | 10 . | 6 . | 4 . |
Size scenarios | ||||||||||
n = 500 | 2.4 | 2.3 | 2.3 | 2.1 | 1.5 | 2.1 | 0.6 | 0.5 | 0.1 | |
1000 | 2.7 | 2.6 | 2.7 | 2.7 | 2.1 | 2.1 | 0.6 | 0.6 | 0.6 | |
2500 | 4.2 | 3.9 | 2.4 | 1.5 | 2.8 | 1.6 | 1.0 | 1.0 | 0.6 | |
0.025 | 500 | 2.3 | 2.3 | 2.8 | 2.6 | 2.0 | 1.5 | 1.0 | 0.8 | 0.9 |
1000 | 2.4 | 3.3 | 3.9 | 2.4 | 2.8 | 1.9 | 1.3 | 1.1 | 1.1 | |
2500 | 4.0 | 4.1 | 2.8 | 2.8 | 3.5 | 3.3 | 2.3 | 1.7 | 0.9 | |
0.050 | 500 | 2.6 | 2.9 | 2.9 | 2.1 | 2.5 | 3.1 | 0.9 | 1.2 | 0.7 |
1000 | 2.6 | 2.7 | 2.9 | 2.8 | 2.6 | 2.0 | 1.8 | 2.1 | 1.1 | |
2500 | 2.7 | 2.5 | 4.5 | 2.9 | 2.5 | 2.9 | 3.2 | 2.8 | 1.7 | |
Power scenarios | ||||||||||
n = 500 | 4.9 | 1.6 | 1.4 | 1.1 | 0.1 | 0.3 | 0.1 | 0.1 | 0.1 | |
1000 | 19.2 | 9.3 | 4.2 | 3.5 | 1.8 | 0.3 | 0.4 | 0.1 | 0.3 | |
2500 | 74.1 | 49.0 | 23.6 | 20.7 | 9.1 | 5.8 | 1.1 | 0.8 | 0.5 | |
0.025 | 500 | 25.6 | 18.3 | 13.7 | 7.8 | 5.3 | 3.1 | 1.0 | 0.6 | 0.5 |
1000 | 63.8 | 51.1 | 40.7 | 19.3 | 17.8 | 10.8 | 2.3 | 1.0 | 1.7 | |
2500 | 99.1 | 95.6 | 90.1 | 64.8 | 51.8 | 35.1 | 6.8 | 3.6 | 2.7 | |
0.050 | 500 | 56.6 | 49.5 | 48.8 | 22.2 | 17.5 | 16.3 | 4.3 | 2.7 | 2.9 |
1000 | 89.0 | 87.0 | 84.4 | 49.1 | 42.2 | 36.9 | 6.9 | 4.1 | 4.7 | |
2500 | 100.0 | 100.0 | 100.0 | 94.3 | 89.8 | 84.6 | 19.0 | 12.5 | 12.0 | |
Power scenarios | ||||||||||
n = 500 | 31.7 | 18.6 | 10.2 | 6.3 | 2.4 | 1.9 | 0.3 | 0.2 | 0.1 | |
1000 | 75.9 | 56.3 | 33.1 | 18.2 | 10.9 | 4.4 | 0.3 | 0.3 | 0.4 | |
2500 | 99.8 | 97.8 | 86.9 | 72.7 | 46.6 | 20.0 | 4.7 | 2.0 | 1.3 | |
0.025 | 500 | 72.7 | 64.3 | 56.1 | 25.2 | 20.4 | 12.4 | 2.1 | 1.7 | 0.6 |
1000 | 98.3 | 95.8 | 92.3 | 65.1 | 52.9 | 39.3 | 5.4 | 3.8 | 3.3 | |
2500 | 100.0 | 100.0 | 100.0 | 98.8 | 94.3 | 87.9 | 24.1 | 16.2 | 9.8 | |
0.050 | 500 | 92.5 | 92.3 | 90.6 | 56.8 | 50.5 | 49.6 | 7.7 | 8.3 | 6.5 |
1000 | 100.0 | 99.9 | 99.7 | 90.0 | 86.7 | 86.0 | 17.8 | 13.1 | 14.0 | |
2500 | 100.0 | 100.0 | 100.0 | 99.9 | 100.0 | 100.0 | 53.5 | 44.7 | 41.1 |
Notes: Size and power results (in percentage points) of the Monte Carlo investigation of the dominance test for a 5% significance level, with focus on the effect of the number of observations. The grid of points η is thinned by a factor of 10 (“jumps/10,” see end of Section 2.2), p-values are generated using 500 bootstrap iterations, and 1000 p-values are drawn.
Table 2 gives a more comprehensive summary of the effects that different parameter values have on size and power. Again, while keeping the size controlled below the 5% level, we can observe that both higher persistence and heavier tails lead to a decrease in power. Similarly, forecasts at a functional level of 0.01 are much harder to evaluate than forecasts at a level of 0.05. In combination, the presence of high persistence while evaluating low-level ES forecasts can make it impossible to reach a power higher than the nominal size even for sample sizes of 2500. However, for moderate values of persistence and ES forecast level, forecast dominance can be rejected reliably.
4 Empirical Results for S&P 500 and DAX Returns
Our analysis is based on data from http://realized.oxford-man.ox.ac.uk/; this source covers both daily closing prices and realized measures computed from intra-daily data. We construct forecasts for the period from January 2006 to January 2016.12 The entire analysis is out-of-sample, that is, we evaluate the forecasts against realizations that were not used for model fitting. Table 3 gives a brief summary of the parameter estimates. For both the S&P500 and DAX series, the HEAVY model features a larger γ parameter (weight on realized measure) than does GARCH. Furthermore, HEAVY is less persistent than GARCH, as reflected in a smaller estimate of β. These findings are qualitatively in line with empirical results by Shephard and Sheppard (2010). Finally, the estimated degrees of freedom are larger (i.e., closer to normality) for HEAVY than for GARCH. This suggests that the realized kernel measure may be more informative as a volatility proxy than squared returns, in the sense that conditioning on realized kernel leads to a lighter-tailed return distribution than does conditioning on squared returns.
Parameter estimates for HEAVY and GARCH models as presented in Equation (9)
S&P 500 | ||||
ω | γ | β | ν | |
HEAVY | 0.000 | 0.344 | 0.742 | 10.684 |
GARCH | 0.009 | 0.085 | 0.912 | 6.984 |
DAX | ||||
ω | γ | β | ν | |
HEAVY | 0.000 | 0.606 | 0.558 | 13.712 |
GARCH | 0.018 | 0.083 | 0.910 | 8.256 |
S&P 500 | ||||
ω | γ | β | ν | |
HEAVY | 0.000 | 0.344 | 0.742 | 10.684 |
GARCH | 0.009 | 0.085 | 0.912 | 6.984 |
DAX | ||||
ω | γ | β | ν | |
HEAVY | 0.000 | 0.606 | 0.558 | 13.712 |
GARCH | 0.018 | 0.083 | 0.910 | 8.256 |
Notes: All parameters are re-estimated each month using rolling windows. Numbers in the table are medians across rolling windows.
Parameter estimates for HEAVY and GARCH models as presented in Equation (9)
S&P 500 | ||||
ω | γ | β | ν | |
HEAVY | 0.000 | 0.344 | 0.742 | 10.684 |
GARCH | 0.009 | 0.085 | 0.912 | 6.984 |
DAX | ||||
ω | γ | β | ν | |
HEAVY | 0.000 | 0.606 | 0.558 | 13.712 |
GARCH | 0.018 | 0.083 | 0.910 | 8.256 |
S&P 500 | ||||
ω | γ | β | ν | |
HEAVY | 0.000 | 0.344 | 0.742 | 10.684 |
GARCH | 0.009 | 0.085 | 0.912 | 6.984 |
DAX | ||||
ω | γ | β | ν | |
HEAVY | 0.000 | 0.606 | 0.558 | 13.712 |
GARCH | 0.018 | 0.083 | 0.910 | 8.256 |
Notes: All parameters are re-estimated each month using rolling windows. Numbers in the table are medians across rolling windows.
Figure 1 presents time series plots of the HEAVY and HS forecasts (the GARCH forecasts are visually similar to the HEAVY ones, and are thus omitted for better display). The figure shows that the HEAVY forecasts display much more time variation than the forecasts of the simple HS method, suggesting that the HEAVY model is much quicker to react to changes in the market environment than the HS method. Table 4 presents some summary statistics on the forecasts. On average, the HS model produces lower forecasts than the other two methods. For the S&P 500 data set, the average forecast is –2.065 for HEAVY, compared to –2.213 for GARCH and –2.761 for HS. The violation rates of the forecasts are 4.2% (HEAVY), 4% (GARCH), and 2.9% (HS), with all three methods exceeding the nominal level of 2.5%, partially due to the negative returns around the 2007–2009 financial crisis.
. | Avg. . | Avg. . | “violation” rate . |
---|---|---|---|
S&P 500 | |||
HEAVY | −2.065 | −2.594 | 0.042 |
GARCH | −2.213 | −2.852 | 0.040 |
HS | −2.761 | −4.028 | 0.029 |
DAX | |||
HEAVY | −2.499 | −3.095 | 0.038 |
GARCH | −2.606 | −3.322 | 0.038 |
HS | −3.130 | −4.493 | 0.025 |
. | Avg. . | Avg. . | “violation” rate . |
---|---|---|---|
S&P 500 | |||
HEAVY | −2.065 | −2.594 | 0.042 |
GARCH | −2.213 | −2.852 | 0.040 |
HS | −2.761 | −4.028 | 0.029 |
DAX | |||
HEAVY | −2.499 | −3.095 | 0.038 |
GARCH | −2.606 | −3.322 | 0.038 |
HS | −3.130 | −4.493 | 0.025 |
Notes: Sample period ranges from January 2006 to January 2016 (daily data). The “violation” rate is the fraction of days for which the actual returns falls below the forecast (nominal rate: ).
. | Avg. . | Avg. . | “violation” rate . |
---|---|---|---|
S&P 500 | |||
HEAVY | −2.065 | −2.594 | 0.042 |
GARCH | −2.213 | −2.852 | 0.040 |
HS | −2.761 | −4.028 | 0.029 |
DAX | |||
HEAVY | −2.499 | −3.095 | 0.038 |
GARCH | −2.606 | −3.322 | 0.038 |
HS | −3.130 | −4.493 | 0.025 |
. | Avg. . | Avg. . | “violation” rate . |
---|---|---|---|
S&P 500 | |||
HEAVY | −2.065 | −2.594 | 0.042 |
GARCH | −2.213 | −2.852 | 0.040 |
HS | −2.761 | −4.028 | 0.029 |
DAX | |||
HEAVY | −2.499 | −3.095 | 0.038 |
GARCH | −2.606 | −3.322 | 0.038 |
HS | −3.130 | −4.493 | 0.025 |
Notes: Sample period ranges from January 2006 to January 2016 (daily data). The “violation” rate is the fraction of days for which the actual returns falls below the forecast (nominal rate: ).

Time series plots of empirical forecasts for VaR and ES. The sample periods ranges from January 2006 to January 2016. See text for details.

Murphy diagrams for empirical forecasts. Smaller scores are better.
Figures 2 and 3 and Table 5 contain our main forecast evaluation results. Figure 2 presents Murphy diagrams for all three methods with the display for S&P 500 at left and the DAX results at right. For both data sets, the HEAVY model seems to attain the lowest average elementary score for the vast majority of thresholds η. Forecasts based on the GARCH(1, 1) model perform slightly worse, and the HS method’s performance trails by a considerable margin. This pattern is emphasized in Figure 3, where the HEAVY forecasts are compared directly against GARCH(1, 1) and HS, respectively. Examining the difference in elementary scores makes it easier to detect which of two models is better at a certain threshold, especially when the difference is small. Pointwise confidence intervals at the 95% level deliver an impression for the significance of the outperformance exhibited by the HEAVY model. For the S&P 500 data, HEAVY seems to perform significantly better than GARCH for a majority of thresholds η; in contrast, the visual comparison for DAX returns does not indicate any clear dominance relation. Table 5 reports the p-value of the formal dominance test presented in Section 3: There is ample support against the null hypothesis that HS dominates HEAVY, but no evidence against dominance of HEAVY over HS. In the comparison of HEAVY and GARCH(1, 1), we do not find enough evidence to reject either direction of weak dominance, at least at the 5% level. These results are found for both the S&P 500 and the DAX data.13 Note that the results in Table 5 are based on a mean block length of ten in the block bootstrap implementation. Using a mean block length of twenty leads to the same test decisions at the 5% level. Furthermore, the results in Table 5 are based on exact calculation of the supremal test statistic; grid-based approximations yield very similar p-values.
S&P 500 | |
Hypothesis | p-value |
HS weakly dominates HEAVY | 0.000 |
HEAVY weakly dominates HS | 0.386 |
GARCH weakly dominates HEAVY | 0.082 |
HEAVY weakly dominates GARCH | 0.554 |
DAX | |
Hypothesis | p-value |
HS weakly dominates HEAVY | 0.000 |
HEAVY weakly dominates HS | 0.488 |
GARCH weakly dominates HEAVY | 0.172 |
HEAVY weakly dominates GARCH | 0.732 |
S&P 500 | |
Hypothesis | p-value |
HS weakly dominates HEAVY | 0.000 |
HEAVY weakly dominates HS | 0.386 |
GARCH weakly dominates HEAVY | 0.082 |
HEAVY weakly dominates GARCH | 0.554 |
DAX | |
Hypothesis | p-value |
HS weakly dominates HEAVY | 0.000 |
HEAVY weakly dominates HS | 0.488 |
GARCH weakly dominates HEAVY | 0.172 |
HEAVY weakly dominates GARCH | 0.732 |
Notes: The table presents p-values for several hypotheses related to forecast dominance (see Definition 2.1). The results are based on exact calculation of the supremal test statistic (see Section 2.2), and the bootstrap implementation is based on a mean block length of ten.
S&P 500 | |
Hypothesis | p-value |
HS weakly dominates HEAVY | 0.000 |
HEAVY weakly dominates HS | 0.386 |
GARCH weakly dominates HEAVY | 0.082 |
HEAVY weakly dominates GARCH | 0.554 |
DAX | |
Hypothesis | p-value |
HS weakly dominates HEAVY | 0.000 |
HEAVY weakly dominates HS | 0.488 |
GARCH weakly dominates HEAVY | 0.172 |
HEAVY weakly dominates GARCH | 0.732 |
S&P 500 | |
Hypothesis | p-value |
HS weakly dominates HEAVY | 0.000 |
HEAVY weakly dominates HS | 0.386 |
GARCH weakly dominates HEAVY | 0.082 |
HEAVY weakly dominates GARCH | 0.554 |
DAX | |
Hypothesis | p-value |
HS weakly dominates HEAVY | 0.000 |
HEAVY weakly dominates HS | 0.488 |
GARCH weakly dominates HEAVY | 0.172 |
HEAVY weakly dominates GARCH | 0.732 |
Notes: The table presents p-values for several hypotheses related to forecast dominance (see Definition 2.1). The results are based on exact calculation of the supremal test statistic (see Section 2.2), and the bootstrap implementation is based on a mean block length of ten.

Score differences for empirical forecasts. Negative difference means that HEAVY outperforms its competitor. Confidence intervals are pointwise at 95% level.
The fact that both HEAVY and GARCH dominate HS can perhaps be explained by their use of conditioning information, in contrast to the unconditional distribution estimate implicit in HS. Holzmann and Eulert (2014) show that larger information sets lead to better scores under correct specification. While the latter assumption is unlikely to be satisfied in practice, one might expect similar results to hold under moderate degrees of misspecification.
5 Discussion
In this article, we provide a mixture representation for the consistent scoring functions for the pair . This mixture representation facilitates assessments of whether one sequence of predictions for dominates another across a suitable, user-specified class of scoring functions. As we are primarily interested in the comparison of the ES forecasts, we focus on a class that puts as much emphasis on ES as possible.
We also propose a formal statistical test for forecast dominance in this context. Theoretical arguments are provided to show that the size of the test is controlled asymptotically, which is supported by the results of a detailed simulation study. This study also investigates the power properties of the test for a broad range of parameter choices in a practically relevant model for the data generating process. For the ES level of 0.025 recommended in the Basel III standard and a degree of volatility persistence that is similar to our empirical estimates, we observe good power properties for reasonably large sample sizes.
When comparing forecast performance in terms of forecast dominance, it is not necessary to select a specific scoring function prior to forecast evaluation. In the presence of possibly misspecified forecasts and non-nested information sets, this is an advantage as any choice of a particular consistent scoring function induces a preference ordering on all possible sequences of forecasts which is usually difficult or impossible to justify, or, even to describe; see Patton (2016). On the other hand, Murphy diagrams may lead to inconclusive situations in which neither of the two forecast methods dominates the other. This may be undesirable in contexts of decision making. Ideally, future work should develop a better understanding of Murphy diagrams, so that they can not only be used to check for forecast dominance but also guide the decision for a consistent scoring function appropriate for a specific application if a total order on forecasting methods is needed.
Appendix
B Test Statistic Behavior
Finally, we can use these results to calculate and for any interval I, and for either type of test statistic.
C Relation of Hansen’s Test to Westfall and Young (1993)
We also considered the method of Westfall and Young (1993) which controls the familywise error rate (i.e., the probability of making at least one false rejection) in multiple testing problems. To describe the procedure, let m be the number of points at which the test statistic is evaluated, and let π be the permutation of such that that is, the permutation π arranges the sample t-statistics in ascending order. Define . For example, is the largest of all bootstrapped t-statistics, and is the largest bootstrap t-statistic across the grid points .
However, in our setup, the difference between the two procedures seems to be unimportant. Table 6 summarizes the maximum number of disagreements between the two procedures at significance levels of integer-valued percentage points from 1% to 10%. We observe that across all 594 parameter combinations in our simulation study from Section 3, the largest power difference lies at 1% and the largest size difference lies at a tenth of the respective nominal level. Additionally, it seems that these differences decrease as the number of observations grows. These findings suggests that the results of Meinshausen et al. (2011) showing a certain asymptotic optimality property of Hansen’s procedure in some settings, may hold more generally.
Observations . | Significance levels . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
. | 1% . | 2% . | 3% . | 4% . | 5% . | 6% . | 7% . | 8% . | 9% . | 10% . |
All scenarios | ||||||||||
n = 500 | 5 | 3 | 6 | 6 | 5 | 8 | 7 | 6 | 8 | 10 |
1000 | 1 | 2 | 1 | 0 | 3 | 4 | 3 | 4 | 3 | 3 |
2500 | 2 | 1 | 1 | 0 | 2 | 1 | 1 | 2 | 0 | 1 |
Size scenarios | ||||||||||
n = 500 | 1 | 1 | 1 | 2 | 4 | 4 | 6 | 4 | 8 | 9 |
1000 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
2500 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
Observations . | Significance levels . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
. | 1% . | 2% . | 3% . | 4% . | 5% . | 6% . | 7% . | 8% . | 9% . | 10% . |
All scenarios | ||||||||||
n = 500 | 5 | 3 | 6 | 6 | 5 | 8 | 7 | 6 | 8 | 10 |
1000 | 1 | 2 | 1 | 0 | 3 | 4 | 3 | 4 | 3 | 3 |
2500 | 2 | 1 | 1 | 0 | 2 | 1 | 1 | 2 | 0 | 1 |
Size scenarios | ||||||||||
n = 500 | 1 | 1 | 1 | 2 | 4 | 4 | 6 | 4 | 8 | 9 |
1000 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
2500 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
Notes: Maximum number of disagreements (per 1000 p-value replications) between Hansen’s test and the Westfall and Young correction among all simulated parameter combinations (scenarios) for various levels of significance. We consider 432 distinct parameter combinations for n = 500 and 81 distinct parameter combinations for n = 1000 and n = 2500, with one third of each category being size scenarios.
Observations . | Significance levels . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
. | 1% . | 2% . | 3% . | 4% . | 5% . | 6% . | 7% . | 8% . | 9% . | 10% . |
All scenarios | ||||||||||
n = 500 | 5 | 3 | 6 | 6 | 5 | 8 | 7 | 6 | 8 | 10 |
1000 | 1 | 2 | 1 | 0 | 3 | 4 | 3 | 4 | 3 | 3 |
2500 | 2 | 1 | 1 | 0 | 2 | 1 | 1 | 2 | 0 | 1 |
Size scenarios | ||||||||||
n = 500 | 1 | 1 | 1 | 2 | 4 | 4 | 6 | 4 | 8 | 9 |
1000 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
2500 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
Observations . | Significance levels . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
. | 1% . | 2% . | 3% . | 4% . | 5% . | 6% . | 7% . | 8% . | 9% . | 10% . |
All scenarios | ||||||||||
n = 500 | 5 | 3 | 6 | 6 | 5 | 8 | 7 | 6 | 8 | 10 |
1000 | 1 | 2 | 1 | 0 | 3 | 4 | 3 | 4 | 3 | 3 |
2500 | 2 | 1 | 1 | 0 | 2 | 1 | 1 | 2 | 0 | 1 |
Size scenarios | ||||||||||
n = 500 | 1 | 1 | 1 | 2 | 4 | 4 | 6 | 4 | 8 | 9 |
1000 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
2500 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
Notes: Maximum number of disagreements (per 1000 p-value replications) between Hansen’s test and the Westfall and Young correction among all simulated parameter combinations (scenarios) for various levels of significance. We consider 432 distinct parameter combinations for n = 500 and 81 distinct parameter combinations for n = 1000 and n = 2500, with one third of each category being size scenarios.
Footnotes
1 As detailed below, a scoring function (or loss function) assigns a real-valued score, given a forecast and a realizing observation.
2 The name of the diagrams alludes to the meteorologist Allan H. Murphy (1931–1997) who pioneered similar diagrams in the context of a binary dependent variable (see Murphy, 1977, as well as Ehm et al., 2016, p. 519).
3 In financial jargon, the word “backtesting” is sometimes used as a synonym for “forecast evaluation.”
4 The situation is similar for other functionals, that is, there is typically a whole family of scoring functions that are consistent for a given functional. For example, Savage (1971) identifies a family of scoring functions that are consistent for the mean, and Gneiting (2011b) describes the family of scoring functions that are consistent for a quantile.
5 Another representation of the class , which does not make use of elementary scores, can be obtained by setting the function G1 to zero in Equation (2).
6 In contrast, comparisons of population-level predictive ability (Clark and McCracken, 2013, Section 3.1) ask whether model A would outperform model B if both models were estimated without error. They are useful to discriminate between alternative theories or assess the possible impact of a certain regressor, but are less in line with practical forecast situations which we consider here.
7 In principle, one might try to estimate τG and in order to arrive at bounds for pH. However, this procedure is likely to be computationally demanding, which contradicts the original motivation for using the grid approximation. It hence seems preferable to either set , or to compute pH via the analytical supremum calculations detailed in Appendix B. We provide simulation results on both approaches in Section 4.
8 Hansen’s test conducts a comparison between a benchmark method and finitely many competitors. In the case of Definition 2.1’, the comparison is between two methods at a finite number of fixed grid points. From a technical perspective, both of these comparisons boil down to testing whether all elements of a random vector have nonnegative expectation. Note that Hansen (2005) allows for cross-sectional dependence among the vector elements, as well as for certain forms of time series dependence, both of which are likely present in our setup.
9 The resulting AR(1) model implies that the log of has a mean of –0.62, a first-order autoregressive coefficient of 0.83, and a residual variance of 0.38.
10 We evaluate using the function esT of the R package VaRES (Nadarajah et al., 2013), which employs numerical integration. The values for thus obtained are very similar to an analytical expression (see Dobrev et al., 2017, Section 4, and the references therein) of which we became aware after completing this work. For example, for and a t-distribution with , the absolute difference is smaller than .
11 Both models have access to the same information base, which is used optimally by the second model, but suboptimally by the first model. Tsyplakov (2014) shows that this setup implies dominance of the second model under all proper scoring rules.
12 More precisely, the S&P 500 sample comprises 2420 observations from January 6, 2006 to January 25, 2016; the DAX sample comprises 2494 observations from January 4, 2006 to January 25, 2016.
13 At the 10% level, the null that GARCH dominates HEAVY is rejected for the S&P 500 data, in line with the visual impression conveyed by Figure 3.
* We thank seminar and conference participants in Heidelberg, Augsburg (Statistische Woche 2016), Karlsruhe (HeiKaMEtrics 2018), and Freiburg (GPSD 2018) for helpful comments. J. F. Z. gratefully acknowledges financial support of the Swiss National Science Foundation. The work of F.K. and A.J. has been funded by the European Union Seventh Framework Programme under grant agreement 290976. They also thank the Klaus Tschira Foundation for infrastructural support at the Heidelberg Institute for Theoretical Studies (HITS). The opinions expressed in this article are those of the authors and do not necessarily reflect the views of Raiffeisen Schweiz. Calculations were performed on the HPC cluster at HITS, and UBELIX (http://www.id.unibe.ch/hpc), the HPC cluster at the University of Bern.
References
Basel Committee on Banking Supervision.