Valid sequential inference on probability forecast performance

Probability forecasts for binary events play a central role in many applications. Their quality is commonly assessed with proper scoring rules, which assign forecasts a numerical score such that a correct forecast achieves a minimal expected score. In this paper, we construct e-values for testing the statistical significance of score differences of competing forecasts in sequential settings. E-values have been proposed as an alternative to p-values for hypothesis testing, and they can easily be transformed into conservative p-values by taking the multiplicative inverse. The e-values proposed in this article are valid in finite samples without any assumptions on the data generating processes. They also allow optional stopping, so a forecast user may decide to interrupt evaluation taking into account the available data at any time and still draw statistically valid inference, which is generally not true for classical p-value based tests. In a case study on postprocessing of precipitation forecasts, state-of-the-art forecasts dominance tests and e-values lead to the same conclusions.


Introduction
Consider a forecast user who compares probability predictions p t , q t ∈ [0, 1], t ∈ N, for a binary event Y t+h ∈ {0, 1}, where h ≥ 1 is the time lag between the forecasts and observations. At time t, the forecasts p t , q t as well as any predictions and observations before t are known. This setting encompasses many practical situations such as probability of precipitation forecasts h days ahead or predictions of negative economic growth in the next quarter. The forecast user wants to draw conclusions on the relative performance of p t and q t , that is, identify the better of the two forecasts.
Probability forecasts for binary events are arguably the simplest and best understood type of probabilistic forecasts; see Winkler (1996) for an earlier overview and more recent reviews in , Ranjan and Gneiting (2010) and Lai et al. (2011). The key requirements for probability forecasts are calibration, meaning that events with a predicted probability of p should occur at a frequency of p, and sharpness, which requires the forecast probabilities to be as informative as possible, i.e. close to 0 or 1. These properties are simultaneously assessed with proper scoring rules , which coincide with consistent scoring functions for the mean (Gneiting, 2011) in the case of probability forecasts and will be simply referred to as scoring functions in this article. A scoring function S = S(p, y) maps a forecast probability p and an observation y to a numerical score, with smaller scores indicating a better forecast. More precisely, S satisfies for all p, π ∈ [0, 1], where E π (·) denotes the expected value under the assumption that Y = 1 with probability π. That is, the true event probability attains a minimal expected score, and S is strictly consistent if equality in (1) only holds for p = π. Well-known examples are the Brier score (y − p) 2 and the logarithmic score −log(|1 − y − p|). For comparing the predictions p t and q t , the forecast user would therefore collect a sample (y t+h , p t , q t ), t = 1, . . . , T , and compute the empirical score difference 1 T T t=1 {S(p t , y t+h ) − S(q t , y t+h )}. To take into account the sampling uncertainty, such score differences are accompanied with p-values indicating whether the mean score significantly differs from zero. If the observations are not independent, as usual in sequential settings, a number of asymptotic tests are available to compute p-values, with prominent ones being the Diebold-Mariano test (Diebold and Mariano, 1995) and the test of conditional predictive ability by Giacomini and White (Giacomini and White, 2006). Further examples are the martingale-based approaches by Seillier-Moiseiwitsch and Dawid (1993) or Lai et al. (2011), and more recent tests of forecast dominance (Ehm and Krüger, 2018;Yen and Yen, 2021).
In this article, we expand the tools for drawing inference on probability forecast performance by e-values. E-values, with 'e' referring to 'expectation', have been introduced as an alternative to p-values for testing. The term e-value was used first in the literature by Vovk and Wang (2021), but the concept also appears in Shafer (2021), under the name 'betting score', and in Grünwald et al. (2020); see also the series of working papers on http://alrw.net/e/. In brief, an e-value is a random variable E ≥ 0 satisfying E(E) ≤ 1 under a given null hypothesis. By Markov's inequality, this implies P(E > 1/α) ≤ α for any α ∈ (0, 1), i.e. large realizations of an e-value can be considered as evidence against the null hypothesis, and the value 1/E is a conservative p-value. A main motivation for using e-values instead of p-values, explained in more detail in Shafer (2021), Grünwald et al. (2020) and Wang and Ramdas (2020), is their simple behaviour under combinations. The arithmetic average of e-values is again an e-value, and so is the product of independent or sequential e-values. E-values also have advantages over p-values with respect to false discovery rate control (Wang and Ramdas, 2020), which may be beneficial for the comparison of forecasts over many locations such as a fine latitute-longitude grid around the globe. The central property for this article is that e-values are valid under optional stopping and continuation, that is, the collection of data for computing an e-value may be stopped or continued based on seeing the past observations and e-values. It is well known that p-values in general do not satisfy these properties.
Our main contribution is the result that for any scoring rule S and forecasts p, q for Y ∈ {0, 1}, there exists an e-value which satisfies E π (E) ≤ 1 if and only if E π {S(p, Y ) − S(q, Y )} ≤ 0. This e-value allows one to draw inference on the relative performance of the forecasts p and q with respect to S with only a single observation. In a sequential setting, e-values from different time points can be merged by products into a nonnegative supermartingale or test-martingale, which are analysed in detail by . This gives a statistical test of forecast dominance which is valid in finite samples without any further assumptions on the data generating process. Moreover, the constructed e-values are valid under optional stopping, so a forecast user may decide to continue or stop forecast comparison based on only a part of the data. These advantages are inherent to any e-value, but we believe that they make e-values a particularly attractive tool in sequential forecast evaluation. The above mentioned tests for comparing probability forecasts are all only asymptotically valid, and the underlying assumptions are often difficult or impossible to verify. In the case of tests with asymptotic normality, the selection of the variance estimator for the test statistic may have a dramatic impact on the test validity (see for example Lazarus et al., 2018, Table 1). More serious is the problem of optional stopping. In a simple but realistic simulation example in this article, we demonstrate that commonly used tests for forecast superiority at the level of 0.05 may yield rejection rates of up to 0.15 under optional stopping, grossly misleading and invalidating statistical inference. Although statisticians and practitioners should know that the sample size for classical tests must be determined in advance, we believe that optional stopping is quite common in forecast evaluation, where data arrives sequentially and it might be tempting to stop, or continue, an expensive or time consuming experiment upon seeing enough, or just not enough, evidence against a hypothesis. Moreover, also in the analysis of past datasets, optional continuation may occur implicitly, in that methods are often first evaluated on a smaller, manageable part of the data and the analysis is continued if the results are promising. Last but not least, even to a statistician fully aware of the problem of optional stopping, it may be desirable to have a tool that allows stopping an evaluation when enough evidence is collected, without having to bother about the implications for inference.
The advantages of e-values for forecast comparison relative to the currently available methods come at a price, namely, lower power. This is well known not only for evalues, but a general phenomenon when tools for anytime-valid inference are compared to methods for inference with a fixed sample size; see for example Figure 1 in Waudby-Smith and Ramdas (2021) displaying the widths of time uniform and fixed time confidence intervals for a mean. However, in the case study in this article, p-values from classical tests and e-values lead to qualitatively the same results.

Scoring functions for probabilities
Throughout the article, E Q (·) denotes the expected value of the quantity in parentheses under the probability distribution Q. If the measure Q is the probability π ∈ [0, 1] of a binary event, we simply write E π (·).
When comparing probability forecasts with scoring functions, the choice of the scoring function plays a crucial role. While (1) guarantees that the true event probability always achieves a minimal expected score, different scoring functions may yield different rankings when misspecified forecasts are compared (Patton, 2020). This problem can be avoided by basing forecast comparison on several or all scoring rules simultaneously. For probabilities of binary events, under mild regularity conditions stated in Gneiting et al. (2007, Theorem 2.3), all consistent scoring functions are of the form S(p, y) = (0,1) where ν is a locally finite Borel measure on (0, 1) and (3) In the equation above, 1 denotes the indicator function. This representation originally dates back to Schervish (1989); see also Ehm et al. (2016). The scoring function S is strictly consistent if and only if ν assigns positive mass to all non-degenerate intervals in (0, 1).

Forecast dominance and hypotheses
Let (Ω, F, Q) be a probability space with a filtration F t , t ∈ N. We assume that the competing forecasts p t , q t and the observation Y t are a random vector (Y t , p t , q t ) adapted to F t , and (p t , q t ) are forecasts for Y t+h for some integer lag h ≥ 1. The measure Q describes the joint dynamics of the forecasts and the observations. When comparing forecasts using a given scoring function S, the quantity of interest is often not the unconditional expected score difference E Q {S(p t , Y t+h ) − S(q t , Y t+h )}, which describes the average relative performance of p t and q t . More interesting is the question whether given the information at the time of forecasting, F t , the conditional event probability is closer to p t than to q t , i.e.
This notion of forecast dominance is called conditional forecast dominance and has been introduced by Giacomini and White (2006).
The definition of forecast dominance used here does not require knowledge about the processes generating (Y t , p t , q t ), which are often unknown or not well enough understood to formulate a suitable stochastic model. The relative performance of the forecasts p t , q t is governed by the underlying distribution Q, and hypotheses about forecast dominance are hypotheses about the data generating process. Denoting by P the set of probability measures on (Ω, F), we will construct tests for the following hypotheses: H c = P ∈ P : sup Here, (c t ) t∈N is a sequence of F t -measurable random variables c t ∈ {0, 1}. If c t = 1 for all t, we write H S;c = H S and H c = H. In this case, Hypothesis (4) states that at all times t, forecast p t is at least as good as q t under the scoring rule S, given the information available at the time of forecasting. Hypothesis (5) is stronger and states that p t is preferred over q t under all elementary scores (3), and it corresponds to what is denoted by H s − in Ehm and Krüger (2018, formula (2.5)). Recently, hypotheses of the type of H or H S have been put in question by Zhu and Timmermann (2020), who demonstrate that the null hypothesis of equal conditional predictive accuracy is basically never satisfied in realistic settings. Their criticism does not directly apply to one-sided hypotheses, but we emphasize that the null hypotheses H S or H are rather strong in that they require conditional dominance at all time points. Tests for these hypotheses are therefore most suitable for the comparison of a new method to an established benchmark or state-ofthe-art method, where rejecting the null means that the new method outperforms the benchmark at least in some situations -a minimal requirement.
The classical example for a situation with P ∈ H is p t = P(Y t+h = 1 | F t ), i.e. p t is the ideal forecast in the sense of Gneiting and Ranjan (2013). For the hypotheses H S , one may easily construct situations with dominance relations also among non-calibrated forecasts; see the simulation examples in Section 4.
In many practical situations, it cannot be expected that a forecast method always outperforms another one, and forecast users are rather interested in the question under what conditions a particular forecast should be preferred. Choosing the sequence (c t ) t∈N such that c t = 1 if the condition holds and c t = 0 otherwise allows to formalize this question. Here the variables c t must be F t -measurable, that is, known at the time of forecasting. In practice this is not a severe limitation, since the information that one forecast is more accurate than another one under a given condition is only useful if this condition is known at the time of forecasting, and not ex post. But also from a theoretical point of view, forecast evaluation should only be conditioned on the forecasts themselves, and not on the observations or on information not available at the time of forecasting; see Lerch et al. (2017) for a detailed analysis of this issue in the case of extreme events.

One-period setting
We first construct e-values for the comparison of probability forecasts in a one-period setting where Y = 1 with probability π and the forecasts p, q are assumed to be fixed numbers in (0, 1). These e-values give an absolute and valid interpretation of predictive performance with only a single observation, e.g. for a single time point in the sequential setting of Section 2.2, or in binary classification problems with independent forecastobservation pairs, where the competing forecasts are based on covariates and π is the probability that Y = 1 conditional on the covariate values. The null hypotheses that p is a better forecast than q with respect to a given score S, or with respect to all scoring functions simultaneously, here correspond to The stronger null hypothesis H is the intersection of these intervals for all mixing measures ν, that is, [0, p]. In the case q > p, the intervals take the form [κ ν {[q, p)}, 1] or [p, 1], respectively. For a set P of probability measures and disjoint H, H ⊂ P, we say that an e-value E has null hypothesis H and alternative H if E P (E) ≤ 1 for all P ∈ H and E Q (E) > 1 for all Q ∈ H . The following theorem characterizes e-values for testing H S .
Theorem 3.1. Let S be a consistent scoring function and p, q ∈ (0, 1), p = q. Assume that the mixing measure ν of S satisfies ν [min(p, q), max(p, q)) > 0. Then a function E = E(y) is an e-value with null hypothesis H S and alternative [0, 1] \ H S , if and only if for some λ ∈ (0, 1], Theorem 3.1 gives a family of e-values for testing forecast dominance with a given score S, and in a next step, we tune the parameter λ in (6) such that the corresponding e-value has maximal 'power' against a given alternative. The notion of power for evalues differs from the classical power of p-values, and it is motivated in detail by Shafer (2021) and Grünwald et al. (2020). An e-value can be interpreted as a bet against the null hypothesis, and a product T t=1 E t of e-values represents the accumulated capital at time if T the initial capital is 1 and all money is invested in the bet at each step. Maximizing the gains is equivalent to maximizing the 'growth rate' , a strategy which is sometimes called Kelly betting, in reference to Kelly Jr (1956). If an e-value maximizes E P {log(E)} under a measure P representing an alternative hypothesis, it is called growth rate optimal or simply GROW (Grünwald et al., 2020). One such alternative could be that Y = 1 with probability q, but one can maximize the power under any other alternative π 1 ∈ H S . Theorem 3.2. Under the assumptions of Theorem 3.1, for any The corresponding e-value equals Theorem 3.2 shows that the GROW e-values for the comparison of probability forecasts take the form of likelihood ratios with the alternative probability in the numerator and the integral of the mixing measure ν (suitably normalized) over the interval [min(p, q), max(p, q)) in the denominator. It is possible to obtain this result directly by applying Theorem 1 in Grünwald et al. (2020), since κ ν {[min(p, q), max(p, q))} is the boundary of the null-hypothesis H S . We chose the indirect but more instructive approach via Theorem 3.1 since to the best of our knowledge, this is the first application of e-values to forecast comparison, and similar approaches might be used to construct evalues for score differences in more general settings than the evaluation of binary event forecasts. In fact, Waudby-Smith and Ramdas (2021, Proposition 2) have a similar representation as in (6) for e-values for testing hypotheses about a constant mean.
For the test of the null hypothesis H, applying Grünwald et al. (2020, Theorem 1) shows that the GROW e-value is the likelihood ratio.
Theorem 3.3. Let p, q ∈ (0, 1). Then the GROW e-value with null hypothesis H and alternative hypothesis that Y = 1 with probability π 1 ∈ H is given by (1 − π 1 )/(1 − p), y = 0, π 1 /p, y = 1. Table 1: Commonly used scoring rules and the corresponding denominator in the GROW e-value under the assumption p < q. The case p > q is obtained by interchanging the roles of p and q. The mixing measure ν is given in the form of its Lebesgue density h(θ), θ ∈ (0, 1). For the spherical score, p := (2p 2 − 2p + 1) 1/2 denotes the Euclidean norm of the vector (p, 1 − p).
In testing with e-values, the GROW e-value for testing the point null hypothesis {p} against the alternative π 1 is exactly the likelihood ratio, and Theorem 3.3 states that this is equivalent to testing forecast dominance with respect to all scoring functions. Dominance with respect to all scoring functions is a very strong requirement on p, since the null hypothesis is false as soon as the true probability π is on the same side of p as q, that is, in (p, 1] for p < q or in [0, p) for q < p, and the choice of π 1 is restricted to these sets. Unlike the e-values E π 1 p,q , E π 1 * p,q does not depend directly on q, but indirectly via the admissible values for π 1 .

Sequential inference
We now turn to the sequential model with observations Y t and forecasts p t , q t defined on a probability space (Ω, F, Q) with a filtration F t , t ∈ N. In the case h = 1, for any Q ∈ H S;c and any adapted sequence λ t ∈ [0, 1], t ∈ N, with E pt,qt;λt as defined in (6), If c t = 0, then there is no hypothesis about p t and q t . For these cases, the definition at (6) may be extended to λ = 0, so that E pt, Iterating this argument shows that the product T t=1 E pt,qt;λt (Y t+1 ) is an e-value for H S;c ; more precisely, the process t j=1 E p j ,q j ;λ j (Y j+1 ), t = 2, 3, . . . , is a non-negative supermartingale with respect to (F t ) t∈N . For general lag h, sequential conditioning at time steps of 1 is not possible, and one option is to average the products of all e-values with time difference of h, in the spirit of the U-statistics merging functions suggested by Vovk and Wang (2021). We summarize this in the following proposition.
be defined on a measurable space (Ω, F) and adapted to the filtration F t , t ∈ N, and assume λ t = 0 if c t = 0. Let further S be a strictly consistent scoring function. Then for all T ≥ h + 1, with are F T -measurable and are e-values under H S;c .
Proposition 3.4 is an analogous result to Theorem 3.1 in the sense that it only characterizes possible e-values for testing forecast dominance, but the parameters λ t could be any adapted sequence (λ t ) t∈N ⊂ [0, 1]. E-values for dominance testing under the conditions (c t ) t∈N are obtained by setting all e-values where the condition is not satisfied to 1. The forecast user may, and in fact, has to, tune the (λ t ) t∈N in order to attain a good power against a given alternative. Recall that at any t, the λ t may be a function of all the forecasts and observations before time t. Instead of the parameters λ t , it is usually more intuitive to think about an alternative probability η t for the event Y t+h = 1, and then directly use the GROW e-values E ηt pt,qt constructed in Theorem 3.2. In that respect, testing forecast dominance with e-values differs from p-value based tests of a zero score difference, which do not require the user to explicitly specify an alternative hypothesis. In the applications in Sections 4 and 5, we will give guidance on the selection of alternative hypotheses, and show that reasonable power may be attained with simple heuristic methods.
As a side remark, choosing an alternative hypothesis for e-values in sequential forecast dominance testing is similar to the conditional predictive ability tests by Giacomini and White (2006), where F t -measurable test functions are used to weight score differences and improve power. While the selection of the test functions in the Giacomini-White test is delicate, because they may have an impact on the variance estimates and the finite-sample validity of the tests, e-values remain valid under any choice of adapted weights (λ t ) t∈N .
Our last theoretical result states that the e-values e T constructed above are also valid when T is replaced by a stopping time τ . For h = 1, this is a consequence of the fact that (e t ) t≥2 is a non-negative supermartingale (see , Section 3). To understand validity under optional stopping intuitively, recall that at time t, the forecast user has to determine the parameter λ t in the e-value E pt,qt;λt (Y t+h ). Optional stopping at t 0 corresponds to setting λ t ≡ 0, or equivalently E pt,qt;λt (Y t+h ) ≡ 1, for t ≥ t 0 , i.e. ignoring all observations starting from time t 0 + h. In the case h = 1, this allows the forecast user to stop evaluation at any time, since λ t in E pt,qt;λt (Y t+1 ) is defined at the same time as Y t is observed. However, when h > 1, the coefficients λ t in E pt,qt;λt (Y t+h ) for t = t 0 − h + 1, . . . , t 0 − 1 have already been determined in the past and may not be set to zero at t 0 , since they must be (F t ) t∈N -adapted. This implies that the stopped e-value depends on the unknown, future observations Y t 0 +1 , . . . , Y t 0 +h−1 , so it is not deterministic at time t 0 .
In the case h = 1, optional stopping is a powerful strategy when the goal is to assess forecast superiority at a significance level α ∈ (0, 1), since the stopping time τ α = min{T, inf(t ≥ 2 : e t ≥ 1/α)} allows to reject the null hypothesis as soon as the sequential e-value e t exceeds 1/α. If h > 1, one may similarly define which guarantees that when stopping at t 0 , the level 1/α is exceeded no matter what values Y t 0 +1 , . . . , Y t 0 +h−1 take; see Appendix A.2. Instead of specifying a significance level α in advance, one may as well transform the sequence (e t ) t∈N into so-called anytime valid p-values, which are valid simultaneously for all t 0 ≥ h + 1 (see , Section 3.1). For h = 1, an anytime-valid p-value is given by p t 0 = min{1, inf s=1,...,t 0 1/e s }.
For h > 1, one may apply the same correction as in the stopping time τ α,h , namely, In the first example, for varying µ ∈ (0, 1), we simulate independent forecasts p t , q t ∼ Unif(0, 1), define π t = µq t + (1 − µ)p t , and generate independent Bernoulli observations Y t+1 with mean π t conditional on p t , q t . This represents a situation where forecasters only have access to partial information and both forecasts are not calibrated, i.e. P(Y t+1 = 1 | p t ) = p t and P(Y t+1 = 1 | q t ) = q t . We choose S to be the Brier score, so that p t outperforms q t if and only if π t ∈ [0, (p t +q t )/2] if p t < q t or π t ∈ [(p t +q t )/2, 1] if p t > q t , i.e. if and only if µ ≤ 0.5. When µ > 0.5, the GROW e-value is obtained by choosing π t as alternative hypothesis probability, but in practice, π t is not known. The forecast user might assume that the true probability of Y t+1 = 1 lies somewhere in between (p t +q t )/2 and q t , and choose a convex mixture η t (ξ) = ξ(p t + q t )/2 + (1 − ξ)q t with some ξ ∈ (0, 1) as alternative. Proposition 3.4 implies that for k ∈ N and ξ 1 , . . . , ξ k ∈ (0, 1),  Figure 1, we compare the rejection rates at the 5% level, corresponding to e-values greater or equal to 20, when the ξ j are k equispaced weights in (0, 1) for k = 1 and k = 5, i.e. ξ 1 = 0.5 if k = 1 and ξ l = l/6, l = 1, . . . , 5, in the case k = 5. We computed both the unstopped e-value e T and the stopped variant e τ 0.05 , and the e-values under alternatives η t = π t and η t = q t . The rejection rates are compared to those of one-sided t-tests of the null hypothesis that the mean Brier score difference is non-positive. Additionally, we show the rejection rates when the p-value is used for optional stopping at given time points upon seeing a significant difference. Our simulations illustrate the known fact that classical statistical tests are not valid under stopping. At the boundary of the null hypothesis, the rejection rate of the t-test amounts to 0.12 for T = 600 and optional stops at times 150, 300 and 450; given the number of optional stops, this phenomenon occurs independently of the sample size. As for the e-values, stopping (e τ 0.05 ) is always a more powerful but valid strategy compared to the e-value e T . While the heuristic alternatives achieve a power close to the power under the correct alternative hypothesis, the misspecified hypothesis η t = q t is clearly weaker. Interestingly, the correct alternative η t = π t has a lower power than the heuristic alternatives close to the boundary of the null hypothesis. This is not an error: Specifying η t = π t yields the optimal growth rate for the e-value, but this does not necessarily mean that it gives optimal power for the stopped e-value at the threshold 1/α = 20 in finite samples. The t-test generally achieves a higher power than the e-values, which has to be expected given the absence of assumptions on the data generating process and the validity under optional stopping; see also Waudby-Smith and Ramdas (2021).
The probability π t;h corresponds to the ideal forecast at lag h. We compare q t;h = π t;h and p t;h = π t;h+1 for lags h = 1, . . . , 3, so that q t;h always outperforms p t;h . With decreasing parameter θ, serial dependence decreases and the forecast skill of p t;h and q t;h becomes similar. The alternative hypothesis for the e-values is the correct alternative η t;h = q t;h , so that the effect of a higher lag can be analyzed isolated from the question how to choose the alternative hypothesis. Rejection rates are compared to the Diebold-Mariano test at the 5%-level. Figure 2 shows the rejection rates depending on the parameter θ for different sample sizes T . The e-values use the stopping time τ 0.05 for lag 1 and τ 0.05;h for lags h = 2 and h = 3. As in the previous simulations, the power of the e-values is below the p-values for the lag 1 forecasts, where the Diebold-Mariano test essentially coincides with the t-test. For lags 2 and 3, this difference increases, since the combination method for e-values becomes less powerful. With increasing lag, the rejection rates of both methods decrease, but the difference to lag 1 is smaller for the Diebold-Mariano test compared to the e-value. In this example, the Diebold-Mariano test is valid because the forecasts are ideal and the data generating process is stationary. For the e-values, validity is guaranteed without such assumptions, which may be of great advantage in applications. Henzi et al. (2021) have compared postprocessing methods for precipitation forecasts with lag 1 to 5 days at the airports Brussels (BRU), Frankfurt (FRA), London Heathrow (LHR) and Zurich (ZRH). In their case study, probability of precipitation (PoP) forecasts have been evaluated with the Brier score, but no tests for significance of score differences have been performed. We will illustrate here how to apply e-values for probability forecasts, and compare the results to state-of-the-art forecast dominance tests.

Data and methods
A detailed description of the dataset and methods is given in Section 5 of Henzi et al. (2021), and we only summarise the key information here. The dataset covers the period from January 06, 2007 to January 01, 2017, and accounting for missing values, the numbers of available observations are 3406 for Brussels, 3617 for Frankfurt, 2256 for London and 3241 for Zurich airport. Postprocessing is applied to the ensemble forecasts of the European Centre for Medium-Range Weather Forecasts (ECMWF; Molteni et al., 1996;Buizza et al., 2005), which are issued on a latitude-longitude grid and consist of a high resolution forecast, 50 perturbed ensemble forecasts at a lower resolution, and the control run for the perturbed forecasts. In simple words, ensemble forecasts account for uncertainty by running a numerical weather prediction (NWP) model several times, each time under slightly perturbated initial conditions, and each run of the model yields a different forecast, which together form a so-called ensemble (Leutbecher and Palmer, 2008). Ensemble forecasts are usually subject to biases and dispersion errors, which can be corrected by estimating the conditional distribution of the weather variable given the NWP ensemble. This statistical procedure is known as postprocessing of ensemble forecasts (Vannitsem et al., 2018). Henzi et al. (2021) propose isotonic distributional regression (IDR) as a benchmark for such postprocessing methods. IDR estimates conditional distributions nonparametrically and without any tuning parameters. The method is not specifically tailored to forecasting precipitation, and one would expect that a parametric model designed for this purpose gives more precise forecasts. One such method is heteroscedastic censored logistic regression (HCLR; Messner et al., 2014), which assumes that the square root of the precipitation follows a logistic distribution censored at zero. The implementation is as in Henzi et al. (2021). While the covariates in IDR are only the high resolution forecast, the control forecast, and the ensemble mean, the HCLR model additionally includes a scale parameter depending on the ensemble standard deviation.
Different from the study in Henzi et al. (2021), who use an expanding window for the postprocessing, we estimate both postprocessed forecasts on half of the data for each airport for simplicity, and keep the remaining half for validation.

Hypothesis tests
We illustrate the usage of e-values in the following hypothesis tests. Firstly, we try to reject the null hypothesis that IDR probability of precipitation forecasts are better than the HCLR PoP forecasts with respect to the Brier score. Secondly, we modify HCLR by dropping the scale parameter. It is expected that this variant, denoted by HCLR − subsequently, is outperformed by HCLR including the ensemble-dependent scale parameter, and also by IDR, since the models are based since both IDR and HCLR − assume a monotone relationship between the covariates and the PoP, but the nonparametric IDR can estimate a broader class of functions. And finally, we further investigate the effect of the scale parameter on HCLR predictions for high precipitation. Suppose a weather forecaster issues a warning if the probability that the precipitation exceeds a high threshold is more than 50%. As thresholds, we chose the empirical 90% quantile of precipitation in the training data for each airport. Intuitively, the HCLR model should yield more accurate warnings than HCLR − , because it includes the ensemble standard deviation as an uncertainty measure.
The first and second set of hypotheses are tested with the Brier score and the corresponding e-values. As alternative probability, we take the convex mixtures η t = 0.25p t + 0.75q t , which have been explored in Section 4, denoting by p t the forecasting method that is expected to have a better performance than q t under the null hypothesis. The hypothesis about the extreme precipitation warnings is a conditional comparison with the conditions c t = 1{max(p t , q t ) ≥ 0.5}. For this hypothesis, instead of dominance with respect to the Brier score, we test the stronger hypothesis of forecast dominance with respect to all scoring rules. The rationale is that the forecast dominance hypothesis should be easily rejected if the HCLR model truly issues the better tail forecasts, and on the other hand, failing to reject may indicate that either even with data of 10 years it is not possible to clearly discriminate the quality of such warnings, or that the ensemble standard deviation does not bring a benefit. For this hypothesis we define η t = q t , assuming that the conditional event probabilities should be much closer to the ones issued by HCLR than by HCLR − . No optional stopping is applied in all e-values.
For comparison, we also compute p-values for the significance of score differences. The first two hypotheses are tested with one-sided Diebold-Mariano tests (Diebold and Mariano, 1995; see also Giacomini and White, 2006). To estimate the variance of the test statistics, we use the heteroskedasticity and autocorrelation consistent estimator with Bartlett weights, see Lerch et al. (2017, Equation 2.18). For testing dominance of the tail probability forecasts, the test by Yen and Yen (2021) would allow arbitrary forecast lags, but it assumes strict stationarity. Since the sequence c t selects only particular instances, with possibly strongly varying time gaps in between, stationarity is highly questionable. We therefore apply the dominance test by Ehm and Krüger (2018), which is valid under weaker assumptions but limited to lag 1. Strictly speaking, both the Diebold-Mariano test and the forecast dominance test are valid under larger null hypotheses than the e-values, as they only require the average score difference between p t and q t to be nonpositive, whereas the null hypothesis for the e-values asks for conditional superiority at each time point. A comparison is nevertheless interesting, since these two tests represent commonly used methods that are applied to test the significance of score differences. Tables 2 and 3 show the e-values and one-sided p-values for the hypotheses described above, computed separately for each airport and forecast lag. The e-values are not transformed to p-values here. For interpretation, Vovk and Wang (2020, Section 3) suggest the discrete scale such that e-values in (0, 1], (1, 3.16], (3.16, 10], (10, 31.6], (31.6, 100], and (100, ∞) represent no, poor, substantial, strong, very strong, and decisive evidence against the null hypothesis, respectively. E-values larger than 100 are not displayed to improve readability, but an untruncated variant of Table 2 is contained in C so that it is possible to update the e-values with more recent data. For all hypotheses, the p-values and e-values largely lead to the same conclusions. HCLR does not outperform IDR for PoP forecasts at lags 1 to 3, but for the airports Brussels and Zurich there is substantial to strong evidence that it achieves lower Brier scores at the lags 4 and 5. HCLR − is clearly outperformed by the more complex variant with the ensemble-dependent scale parameter at short lags, and also for the longer lead times there is some evidence that including the scale parameter improves the forecasts, except for London airport. As for Year E−values the difference between IDR and HCLR − , both the e-values and the p-values suggest that IDR yields the better forecasts at lags 1 to 3, but at lags 4 and 5, there are no rejections of the null hypothesis. Figure 3 shows how the cumulative products of the e-values for the hypotheses tests at lag 1 evolve over time. If the goal was to accumulate strong evidence against the hypotheses, say exceeding the level 10, then the hypothesis that IDR outperforms HCLR − could already be rejected with only 9% or 27% of the data, respectively, which is where the corresponding lines first cross the level 10. For Zurich airport, rejection happens at 85% of the total sample size. Interestingly, in the comparison of HCLR and HCLR − for Brussels, lag 1, the p-value is non-significant (0.07) but the e-value gives decisive evidence (> 100). We attribute this to the different null hypotheses of the tests: The mean difference in Brier score is only 0.001 with an estimated standard deviation of 0.03, giving only little evidence against the null hypothesis of the Diebold-Mariano test. However, the null hypothesis for the e-value is smaller, requiring that HCLR − outperforms HCLR at all time points. Even if the score differences are only small, evidence eventually accumulates over the whole time period; see the rightmost panel of Figure 3. The fact the e-values in the comparison HCLR/HCLR − decrease with the forecast lag is an effect of the less powerful merging method for e-values with higher lag.
In the comparisons of extreme precipitation warnings, the p-value gives some evidence against the null hypothesis for Brussels airport, and the corresponding e-value is decisive, E = 3703. For the other lag 1 forecasts, both p-values and e-values do not indicate that including the ensemble standard deviation brings a benefit. As for the higher lags, for London and Zurich airport there is no evidence that HCLR outperforms HCLR − , and for Brussels and Frankfurt airport there is only evidence at lags 2 and 3. Overall, the evidence in favour of the HCLR model for issuing extreme precipitation warnings as compared to HCLR − is surprisingly weak.

B Simulation examples: Additional figures
The simulation example in Section 4.1 in the article has been tested for robustness with respect to various parameters: For the spherical and the logarithmic score, the probability π t was computed in such a way that µ = 0.5 corresponds to a score difference of zero, namely, with r t = E ν θ | θ ∈ [min(p t , q t ), max(p t , q t )) , we set π t = p t for µ = 0, π t = r t for µ = 0.5, π t = q t for µ = 1, and interpolate linearly in between these three points for the other µ. Figure 4 demonstrates that the rejection rates of the e-values are almost the same for all scoring functions. Figure 5 shows how the rejection rates vary with the alternative hypothesis for the evalue. In particular, it can be seen that the alternative π t is superior and q t is inferior for all sample sizes and significance levels. As for the alternatives with the parameter k, smaller k give higher rejection rates for small sample sizes and lower rejection rates for larger samples. Figure 6 shows that also the rejection rates of Student's t-test are essentially equal for the different scoring functions.
In Figure 7, it can be seen that the rejection rates of Student's t-test and Wilcoxon's signed rank test for this simulation are almost equal. Figure 8 shows that close to µ = 0.05, Student's t-test under optional stopping has too high rejection rates independent of the significance level and the sample size.
The simulation example in Section 4.2 was tested with different significance levels and scoring functions. Figure 9 shows that the choice of the scoring function has a minor influence on the rejection rates for the sample sizes 300 and 600, and almost no effect for 1200 and 2400.  Table 4 contains the e-vales and p-values of Table 2 in scientific digit notation.  Figure 4: Rejection rate of stopped e-value (alternative hypothesis with k = 1 as explained in the article) for Brier score (dots), spherical score (squares), logarithmic score (triangles), and different significance levels (columns) and sample sizes (rows). µ Rejection rate Figure 6: Rejection rate of Student's t-test for Brier score (dots), spherical score (squares), and logarithmic score (triangles) differences, for different significance levels and sample sizes.