Bias-corrected score decomposition for generalized quantiles

Decompositions of the score of a forecast represent useful tools for assessing its performance. We consider local score decompositions permitting detailed forecast assessments across a spectrum of condi-tions of interest. We derive corrections to the bias of the decomposition components in the framework of point forecasts of quantile-type functionals, and illustrate their performance by simulation. Related bias corrections have thus far only been known for squared error criteria.


INTRODUCTION
The difficulty of making predictions about future data has led to a large literature on the assessment of forecasts. The squared distance between predicted and realized values provides a simple measure of the accuracy, and much existing work relies on quadratic error criteria. Our focus here is on point forecasts of some characteristic of the predictive distribution such as the mean or a quantile. Proper error quantification in this case depends on the concept of a loss type scoring function that is consistent for the given characteristic (Gneiting, 2011). Given such a consistent scoring function, forecaster X 1 would be ranked better than forecaster X 2 if the average score of X 1 is less than that of X 2 . Additional criteria include skill scores (Murphy & Winkler, 1987), calibration and sharpness measures (DeGroot & Fienberg, 1983;Gneiting et al., 2007), as well as decompositions of the average score into three components commonly referred to as reliability, resolution and uncertainty (Murphy & Winkler, 1987;Bröcker, 2009Bröcker, , 2012Weijs et al., 2010;Christensen, 2015). Comprehensive assessment of forecast schemes requires further information not provided by such summary statistics. Graphical tools such as verification rank histograms are particularly useful for checking the correct calibration of the predictions, but accuracy is also an issue (Gneiting et al., 2007).
We consider score decompositions permitting an assessment of both these aspects. To that end, the data are split into strata according to the values of a variable W . Usually, only the special case W = X , where X is the forecast, is considered. A. Tsyplakov, in the 2014 working paper 'Theoretical guidelines for a partially informed forecast examiner' posted at https://mpra.ub.uni-muenchen.de/67333/, points out that the information available to forecasters and assessors may differ, and assessments may be carried out on the basis of any relevant information, encoded here by W . We specifically consider the case where X is W -measurable. Then W is best thought of as a pair W = (X , A) where A represents auxiliary information c 2017 Biometrika Trust This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/ licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. or indexes domains calling for a separate forecast evaluation, such as season and latitude in weather forecasts (Murphy, 1995). In any case, conditionally on W the score decomposes into components measuring calibration and entropy locally, in contrast with the common three-term decomposition which involves the intrinsically global resolution and uncertainty terms. Taking up previous findings of a potentially serious bias in estimators of these criteria (Bröcker, 2012;Bentzien & Friederichs, 2014), we derive corrections removing this bias to a first order of approximation. Related results have so far been obtained for meanvalue-type functionals and squared error (Ferro & Fricker, 2012). The latter permits algebraic calculations that are not feasible with the scoring functions relevant to quantiles or expectiles. We instead build on a recently established mixture representation of such scoring functions (Ehm et al., 2016) and on the local behaviour of empirical distribution functions. Our bias corrections apply to a broad class of generalized quantiles and the associated consistent scoring functions, making it possible to treat quantiles and expectiles in a unified manner. Likewise, they yield immediate corrections to the bias of global score components such as the resolution.

SCORE DECOMPOSITIONS
A characteristic such as a quantile or mean value corresponds to a functional F → T (F) on some class F of right-continuous distribution functions on the real line, the predictive distributions (Gneiting, 2011). Given T , a nonnegative scoring function S = S(x, y) of the point forecast x of T (F) and the future observation y is consistent for T if for every F ∈ F the expression S( is the unique minimizer of S(x, F). For example, the piecewise linear score S(x, y) = |1 x<y − α| |x − y| is strictly consistent for the lower α-quantile, T (F) = inf {t : F(t) α}, and the asymmetric quadratic score S(x, y) = |1 x<y − α| (x − y) 2 is strictly consistent (Newey & Powell, 1987). Here 1 A denotes the indicator function of the event A, and the respective moments are supposed to be finite for F ∈ F. The case α = 1/2 retrieves the median and the mean, which minimize the expected absolute and squared error, respectively.
Throughout the following, S stands for a fixed scoring function that is consistent for the given functional T on F. The minimum expected score, inf x S(x, F) = S{T (F), F}, is called the entropy of F, and the overshoot d(x, F) = S(x, F) − S{T (F), F} is called the divergence between x and T (F). In the median and the mean value cases, the entropy reduces to one half times the mean absolute deviation and the variance, respectively, and quite generally it makes sense to think of the entropy as a generalized variance and the divergence as a bias term.
We now consider the point forecast x, the verifying observation y, and the third variable w as a triplet of random variables (X , Y , W ) defined on some probability space ( , B, Q). Let G denote the unconditional, and G w the conditional, distribution of Y given W = w. All these distributions are supposed to be members of F. The random variable S{T (G W ), G W } quantifies the conditional uncertainty in Y given W , to be called the local entropy, ENT W , and may assume negative values, which does not thwart the technical developments but makes interpretations difficult. Thus for simplicity, we assume that X is W -measurable. Rearranging (1) yields a conditional bias variance type decomposition essentially identical to those of Murphy (1995), Bröcker (2009), and Tsyplakov in the paper cited above, namely For comparison, the unconditional score decomposition derived in various guises by Brier (1950), Murphy & Winkler (1987), Bröcker (2009) and others, reads This results from (2) measures the average variability of T (G W ), and the global miscalibration MCB commonly is referred to as the reliability or conditional bias. We deviate from this terminology because we feel that miscalibration is more appropriate than reliability, and because in our setting the term conditional bias is needed for a different purpose. We assume throughout that ( . For simplicity, W is supposed to be discrete, as is effectively the case when the range of W is partitioned into finitely many bins. The numerical values of the w i do not matter in most of the following, and simply are labelled as k = 1, . . . , m. Accordingly, G k denotes the conditional distribution of Y given W = k. We estimate G k by the empirical distributionĜ n,k of the y i such that w i = k. Let n k = #{i : w i = k} denote their number, and define the local empirical score, entropy, and calibration terms as The empirical analog of (2) then obtains as and the empirical counterparts of the uncertainty and resolution terms arê withĜ n the empirical distribution of all y i . WritingS = n −1 i S(x i , y i ) for the average score, one obtains the empirical analog of (3),S by defining the empirical miscalibration so as to satisfy (5), i.e., asM CB =S +R ES −ÛNC. There is obviously an analogy with the theoretical decompositions: averaging (4) with weights n k /n exactly reproduces the global empirical decomposition (5). As a specific feature of (4), all three terms depend only on the forecast-observation pairs (x i , y i ) with w i = k, and so are strictly local in this sense. By contrast, the empirical resolution and uncertainty terms involve T (Ĝ n ), and so are global in nature.

CORRECTING THE BIAS OF THE DECOMPOSITION COMPONENTS
If the empirical distributionsĜ n andĜ n,k are close to G and G k , the empirical calibration, resolution, and uncertainty or entropy terms should give useful approximations to their theoretical counterparts. However, there could be a finite sample bias. To study conditional biases given the w i , it is convenient to introduce the σ -algebra W generated by all indicator variables 1 w i =k (i = 1, . . . , n; k = 1, . . . , m). Frequent use will be made of the following fact.
LEMMA 1. Conditionally on W the random variables y i with w i = k are independent and identically distributed with distribution G k .
We now show that in order to obtain bias-corrected estimates of the single decomposition components, it suffices to correct for the conditional biases β k of the local empirical entropies, conditionally on W. If E W denotes conditional expectation given W, those biases are given by Extensive simplifications result from the local average scoresS k being unbiased estimates of the respective conditional expectations E W (S k ) = E{S(X , Y ) | W = k}. For by (4), the conditional biases ofÊNT k andM CB k add to zero, so the bias of the latter term equals −β k . A similar argument applies to the global miscalibration termM whose conditional bias is − k (n k /n) β k . Given W, the numbers n k are known and can be considered as fixed. No such reduction is available for the conditional bias β = E W [S{T (Ĝ n ),Ĝ n }] − S{T (G), G} of the empirical uncertaintyÛNC = S{T (Ĝ n ),Ĝ n }. However, β is again the conditional bias of an empirical entropy, which allows us to determine this bias in complete analogy to that ofÊNT k . Finally, writinĝ RES =ÛNC − (ÛNC −R ES) and recallingÛNC −R ES = S{T (Ĝ n,k ),Ĝ n,k } shows that the conditional bias of the empirical resolution, too, can be expressed in terms of the other biases.
To summarize, suppose that suitable estimatesβ k andβ of the local and global conditional biases β k and β are available. Our bias-corrected estimates for the single terms in the decompositions (2), (3) then are constructed aŝ Clearly, a correction for the conditional bias corrects for the unconditional bias.
Remark 1. The sign of β k is immediate from (6). Indeed, consistency of S implies that for every x. Since we may set x = T (G k ), it follows that E W [S{T (Ĝ n,k ),Ĝ n,k }] S{T (G k ), G k }, that is, β k 0. Analogously for the global conditional bias, β 0. One notices the similarity with the fact that the empirical variance, with norming 1/n, is biased downwards.
Bias corrections for score decompositions have mainly been studied for mean values. Functionals such as quantiles or expectiles have recently received interest as well (Bentzien & Friederichs, 2014). Their consistent scoring functions share a common form allowing a unified treatment: every such scoring function admits a pointwise valid representation S(x, y) = S θ (x, y) dM (θ ) where {S θ , θ ∈ R} is a family of elementary scoring functions each of which is consistent for the given functional, and M is a nonnegative measure on R (Ehm et al., 2016). One common form of the elementary scoring functions is (Dawid, 2016;Ziegel, 2016) where the function I θ ( y) is nondecreasing in θ , and both I and L are right-continuous in θ for every y. Specifically, in the case of an α-quantile the functions are I θ ( y) = 1 y θ − α, L θ ( y) = α 1 y>θ , while for an α-expectile I θ ( y) = (1 − α)(θ − y) + − α( y − θ) + , L θ ( y) = α( y − θ) + , with a + = a ∨ 0, a − = −(a ∧ 0) the positive respectively negative part of a ∈ R. The I θ form a family of so-called identification functions. The name derives from the fact that they can be used to identify the value T (F) of the relevant functional at F. Indeed, I θ (F) < 0 ⇐⇒ θ < T (F), which follows from the respective definition of T (F) and the fact that θ → I θ ( y), hence θ → I θ (F) ≡ I θ ( y) dF( y), is nondecreasing and right-continuous.
PROPOSITION 1. (Dawid, 2016). Suppose that {S θ } is a family of nonnegative scoring functions of the form (9) with functions I θ such that the mapping θ → I θ (F) is well-defined, nondecreasing, and rightcontinuous for every F ∈ F. Let S(x, y) = S θ (x, y) dM (θ ) where M is a nonnegative measure on R such that S(x, F) < ∞ for F ∈ F, x ∈ R. Then S and each S θ is a consistent scoring function for the functional T on F defined as Henceforth we consider score decompositions and bias corrections within the setting of Proposition 1. Functionals T of the specified type will be referred to as generalized quantiles. For our main result we need to suppose the following.
Assumption 1. The class F contains each conditional distribution G k , and for any F ∈ F, (iii) if F n is the empirical distribution function of a sample of size n from F, then as n → ∞ the process θ → U θ ,n = n 1/2 {I θ (F n ) − I θ (F)}, converges weakly in the Skorohod space D(−∞, ∞) to a mean zero Gaussian process {U θ } with continuous sample paths (van der Vaart, 1998, Sect. 18.3).
The above assumptions are weak. For α-quantiles, e.g., U θ,n converges weakly to the Brownian bridge process B{F(θ )} (van der Vaart, 1998, p. 266), which has continuous sample paths since θ → F(θ ) = I θ (F) + α is continuous, by (ii). THEOREM 1. Suppose that the mixture measure M has a continuous density m.Then underAssumption 1, the random variableˆ k from (6) admits a stochastic approximation by another random variable k such thatˆ k = k + o p (n −1 k ) as n k → ∞ and As in McCullagh (1987, p. 209) we take the expression (10) for E W ( k ) as our approximation to β k = E W (ˆ k ). Estimatesβ k of the local conditional biases to be used with (7), (8) are obtained on substituting the unknown distributions G k in (10) by their empirical counterpartsĜ n,k . The estimateβ of the global conditional bias is of the same form, only that G k ,Ĝ n,k have to be replaced by G,Ĝ n , and n k by n. We therefore focus on the local biases.
Most prominent among the mixture scores are those with a constant mixture density m. For quantiles, m ≡ 1 yields the piecewise linear score, S(x, y) = |1 x<y − α| |x − y|, while for expectiles m ≡ 2 yields the asymmetric quadratic score, S(x, y) = |1 x<y − α| (x − y) 2 . These standard choices are presupposed in the following. We first consider the quantile case. Since by condition (ii) every F ∈ F has a continuous density f = F , we find that I 2 , whence our bias correction assumes the formβ k = −α(1 − α)/{2n kĝk (q k )} whereĝ k andq k are estimates of the density g k of G k and of its α-quantile, respectively. This is unfortunate in two respects: first, density estimates tend to be unstable; secondly, the crucial term appears in the denominator. Thus unless n k is large enough for a kernel estimate to be useful, it appears wise to consider a bias correction only in connection with a parametric model, whereĝ k can be estimated by plugging in the parameter estimates. The expectile case presents no such problems. One simply may replace G k byĜ n,k and t k by the empirical α-expectileη k ofĜ n,k in (10). The correction is fully nonparametric as well as local, as it depends only on the empirical distribution of the verifying observations in the respective bin. For α = 1/2 the correction simplifies toβ k = −s 2 k /(2n k ), with s 2 k the variance ofĜ n,k . Related results for this case were obtained by Bröcker (2012) and Ferro & Fricker (2012).

SIMULATIONS
To illustrate the finite sample performance of the bias correction we report a small simulation study. For simplicity, we took W = X , and adopted a bivariate normal model, for some γ > 0. Then E (Y | X ) = γ X , so that in the case α = 1/2 of the mean value or the median, forecast X is perfectly calibrated if and only if γ = 1. However, for α | = 1/2 the forecast is miscalibrated for α-quantiles and α-expectiles even if γ = 1. Parameters used in the simulations were σ x = 2, σ = 1, and α ∈ {0·5, 0·9}, γ ∈ {0·8, 1}. We generated 5000 random samples each of n = 200 data pairs (x i , y i ) according to the model (11). The data were grouped into m = 40 bins B k according to the order statistics of the x i , so that every bin contained n k = 5 data. In this context, w i may be thought of as the mean value, denotedx k , of the x i belonging to B k . Accordingly, we approximated the theoretical conditional distribution of Y given X ∈ B k by N (x k γ , σ 2 ), which is adequate if thex k -values are sufficiently dense. The approximation is not applicable to the two boundary bins, which were ignored. Figure 1 presents simulation results for a normalized miscalibration measure we call deviation,D EV k = (1 +M CB k /ÊNT k ) 1/2 − 1, and its bias-corrected analogD EV * k , compared against the theoretical quantities DEV k = (1+MCB k /ENT k ) 1/2 −1. Evidently, the local biases were quite large, and the bias reduction moderate to substantial. Similar results hold for the global biases. The bias of the simulated global miscalibration estimatesM CB normalized by their standard deviation ranged from 2·4 to 5 in the important special case α = 0·5. With bias correction, the corresponding values were between 0·25 and 0·87. They were never larger than one half times the uncorrected ones and often substantially smaller.

DISCUSSION
Average scores are widely used to assess forecasts. Conditioning on a third variable yields a decomposition of the average score into local calibration and entropy terms permitting refined assessments. For example, graphical displays of the local decomposition terms can serve as diagnostic tools. The components of the empirical decomposition suffer from a systematic, potentially serious bias. Interestingly, Bentzien & Friederichs (2014) found that these components depend less on the binning when corrected for bias. Our bias correction works binwise, but it can also be used to reduce the bias of the global decomposition terms, where the bias can considerably exceed the dispersion. Throughout the paper it was assumed that the data triplets (x i , y i , w i ) are independent and identically distributed. In fact, the proof of Theorem 1 only requires such an assumption conditionally given W, which is weaker.