Extreme value laws and mean squared error growth in dynamical systems

Background. Extreme value theory for chaotic deterministic dynamical systems is a rapidly expanding area of research. Given a system and a scalar observable defined on its state space, extreme value theory studies the asymptotic probability distributions of large values attained by the observable along evolutions of the system. Objective. The aim of this paper is to study the relation between the statistics and predictability of extremes. Methods. Predictability is measured by the mean squared error (MSE), which is estimated from the difference of pairs of forecasts conditional on one of the forecasts exceeding a threshold. Results. Under the assumption that pairs of forecast variables satisfy a linear regression model, we show that the MSE can be decomposed into the sum of three terms: a threshold-independent constant, a mean term that always increases with threshold, and a variance term that can either increase, decrease, or stay constant with threshold. Using the generalised Pareto distribution to model excesses over a threshold, we show that the MSE always increases with threshold at sufficiently high threshold. However, when the forecasts have a negative tail index the MSE can be a decreasing function of threshold at lower thresholds. Conclusions. Our method is illustrated by means of four examples: the tent map, the cusp map, and two low-order models for atmospheric regime transitions and the El Ni\~{n}o-Southern Oscillation phenomenon. These examples clearly show that predictability depends on the observable and the invariant measure of the system.


Introduction
An important problem in meteorology and climatology is understanding the statistics and predictability of extreme events such as storms, floods or heat waves. The statistical theory of extremes in stochastic processes, both with and without serial dependence, is well developed (Galambos, 1978;Leadbetter et al., 1983;Resnick, 1987;Castillo, 1988;Embrechts et al., 1997;Coles, 2001;Beirlant et al., 2004). Typically, the aim is to estimate the probability of the occurrence of future large values of a quantity, given a sample of past measurements of that quantity. To this end, the block maximum (BM) and peak-over-threshold (POT) methods are often used. The BM method amounts to dividing a time series in sufficiently long blocks and recording the largest observation of each block. The resulting block maxima are then fitted to the generalized extreme value distribution by methods such as maximum likelihood or L-moments. The POT method fits excesses over a selected threshold to a generalized Pareto distribution (GPD). Estimates of extreme value distributions can then be used to derive other quantities of interest such as return periods.
The extension of extreme value theory to the setting of chaotic, deterministic systems has become a rapidly expanding field in the last two decades (Collet, 2001;Freitas and Freitas, 2008;Freitas, 2009;Freitas et al., 2010Freitas et al., , 2011Freitas et al., , 2012Gupta, 2011;Holland et al., 2012aHolland et al., , 2016Chazottes and Collet, 2013). A comprehensive overview of the most important developments up to 2016 can be found in Lucarini et al. (2016). The main idea is to study time series obtained by evaluating a scalar observable along an evolution of the system. If the system is equipped with an invariant measure, then this time series can be interpreted as a stochastic process with serial dependence. Under suitable mixing conditions, the extreme value laws for dynamical systems are the same as for i.i.d. stochastic processes. Geophysical applications, in which dynamical systems arise as prognostic models and observables are physical quantities such as wind speed or temperature, form an important motivation for the development of the theory.
The predictability of extreme events is of great importance, and this topic has been studied by various authors. Hallerberg et al. (2007) studied the predictability of extreme increments in first-order autoregressive processes, wind speed recordings and long-range correlated autoregressive moving averages. In all their examples, extremes become better predictable with increasing event magnitude. The results in (Hallerberg and Kantz, 2008b) showed that for i.i.d. stochastic processes large increments are better predictable if the process is Gaussian, whereas large increments become less predictable if the underlying distribution has a power law tail. However, in the follow-up study (Hallerberg and Kantz, 2008a), which is concerned with threshold crossings instead of increments, it was found again that extremes are always better predictable. The work of Franzke (2012) and Franzke (2017) shows that also in the context of dynamic-stochastic models, extremes are better predictable than non-extremes. Bodai (2015) argues that in dynamical systems stronger predictability of extremes may be typical but not universal. Examples given by Sterk et al. (2012) and Sterk and Van Kekem (2017), however, show that the predictability of extreme values in deterministic dynamical systems can strongly depend on the observable, the attractor of the system and the prediction lead time.
The aim of this article is to explore the relation between statistics and predictability of extremes in dynamical systems by building on ideas presented in . Predictability is measured by the mean squared error (MSE), which is estimated from the difference of a pair of forecasts conditional on one of the forecasts exceeding a threshold. This approach has the advantage that we can consider finite-size errors in the initial condition, rather than infinitesimally small errors that are implicitly assumed when using predictability indicators such as finite-time Lyapunov exponents (Sterk et al., 2012). Using a linear regression model, we show that the MSE can be decomposed into the sum of three terms: a threshold-independent constant, a mean term that always increases with threshold and a variance term that can either increase, decrease, or stay constant with threshold. Using the GPD to model excesses over a threshold, we show that the MSE always increases with threshold at sufficiently high threshold. However, the MSE can be a decreasing function of threshold at lower thresholds when the forecasts have finite upper bounds. The aim of this article is to illustrate the key ideas by means of engaging examples. A more rigorous mathematical approach including formal proofs will be presented elsewhere. Our examples clearly show that the predictability of extremes depends on the regularity properties of the invariant measure and the observable.
This article is structured as follows. In Section 2, we first summarize the theory of extreme value statistics in deterministic systems. Next, we explain how the MSE can be decomposed as a sum of three terms, and we use the GPD to study the behaviour of these terms on asymptotically high thresholds. In Section 3.1, we apply our methodology to the tent map, which is a basic model in dynamical systems exhibiting sensitive dependence on initial conditions and which has an explicit invariant measure; this example illustrates all possible behaviours of the MSE decomposition terms. In Section 3.2, we study the cusp map, which exhibits intermittent dynamics and also has an explicit invariant measure; this example shows how intermittency can lead to clustered extremes for which the methodology of Section 2.2 may no longer be applicable. As geophysical examples we consider wind speeds in a low-order barotropic vorticity equation in Section 3.3 and sea surface temperatures in a low-order model for El Niño-Southern Oscillation (ENSO) in Section 3.4. Both models have a Shilnikov-like strange attractor that leads to irregular alternations between different physical regimes and slow decay of correlations. The article is concluded in Section 4 with a discussion.

Theoretical background
In this article, we consider a dynamical system, where  is a d-dimensional Riemannian manifold, a measurable transformation, and is an f-invariant probability measure. In applications, the map f can also be a time-shift map associated with the flow of a differential equation. Assume that there is a compact invariant set   ⊂ that supports the measure. Given an observable :   φ → , we consider the stationary process defined by The aim of this article is to explore the interplay between the statistics and the predictability of large values of the random variables X τ .

Statistics of extremes
Classical extreme value theory concerns the probability distribution of unlikely large events (Galambos, 1978;Leadbetter et al., 1983;Resnick, 1987;Castillo, 1988;Embrechts et al., 1997;Coles, 2001;Beirlant et al., 2004). Consider a time series ∞ = X ( ) i i 0 of independent identically distributed random variables and define the partial maximum over the first n occurrences: ) . … This variable has a degenerate limit as n → ∞, and therefore, it is necessary to consider a rescaling. Assume that there exist sequences a 0 n > and b n  ∈ such that the rescaled variable a M b ( ) n n n − converges to a non-degenerate distribution: Then, the extremal types theorem asserts that G is the generalized extreme value (GEV) distribution given by Leadbetter et al., 1983;Coles, 2001). The parameters λ ∞ ∞ ∈ − ( , ) and ∞ σ ∈ (0, ) are called location and scale. The tail index ξ ∈ − ( , ) ∞ ∞ is the most important parameter because it determines the tail behaviour of the distribution and therefore the likelihood of extreme values. In the cases 0 ξ = , 0 ξ > or 0 ξ < , the distribution (3) is also referred to as Gumbel (type I), Fréchet (type II) or Weibull (type III), respectively. The parameters ( , , ) λ σ ξ can be estimated from a time series using block-maximum or peak-over-threshold methods (Embrechts et al., 1997;Coles, 2001).
Recent work has extended extreme value theory to the setting of chaotic deterministic dynamical systems (see Lucarini et al., 2016 for a comprehensive overview). Under suitable mixing conditions, the process M n obtained from (1) satisfies (2). Most theoretical work focuses on observables of the form is a density point of the invariant measure ν and  ∞ ∞ g : [0, ) { } + → ∪ + is a strictly decreasing bijection in a neighbourhood of 0 and has a global maximum at 0. However, typical observables used in applications are not of this form. For example, in physical applications one might be interested in energy, wind speed or vorticity. Such observables are typically of the form The parameters of the GEV distribution (3) strongly depend on the form of the observable ϕ. For example, observables of the form (4) with g y y g y y g y y y ( ) log( ), ( ) , ( ) , , is the local dimension of the measure ν at ζ (see Lucarini et al., 2016, Section 4.2.1). Note that 0 ξ → as ζ → ∞ d( ) , which means that for systems with high-dimensional attractors, the distribution of extremes is approximately a Gumbel law. For systems with measures that are not absolutely continuous, equations (2) and (3) do not hold for linear scaling sequences, as outlined in Holland et al. (2016). However, numerical fitting schemes might still give good estimates on the tail behaviour (see also Faranda et al., 2012).
For observables of the form (5), however, determining the GEV distribution becomes a much more delicate problem. In this case, the GEV distribution is no longer determined by the local dimension of the measure ν, as stated in equation (7). The asymptotic behaviour of { } φ is the maximum of φ , still plays a crucial role. In fact, this critically depends on the geometry of the attractor   ⊂ and the level sets of φ . Problems arise, for instance, when the level set is a strip or a fractal set (see Holland et al., 2012 for an extensive discussion). Note that the observables (5) are continuous functions, and hence, they are bounded when the system has an attractor which is compact. This implies that the time series (1) has an upper bound, and in this case, we expect a priori that 0 ξ < for observables of the form (5).

Predictability of extremes
In this article, we study predictability in the context of a perfect model scenario: instead of comparing forecasts with observations we compare predictions with each other. Given two initial conditions x y , x y dist( , ) 0 0 is sufficiently small, we can make the twin predictions We will interpret X τ as an 'observation' and Y τ as a 'forecast'. We can now measure the predictability of extremes by the MSE between the twin predictions given that the observation exceeds a threshold: The aim of this article is to understand how the MSE depends on the threshold u. The following result provides a decomposition of the MSE for fixed τ as a function of u under the assumption that the 'forecast' and 'observation' are related through a linear model. In Sterk et al. (2016), a similar methodology was applied to a data set of wind speed forecasts produced by an operational weather forecasting model. The aim of this article is to apply the methodology to simple dynamical systems examples to investigate the relation between the predictability of extremes and certain recurrence properties of the system such as intermittency.
Lemma 1. Assume that the 'observation' X τ and 'forecast' Y τ are identically distributed at 0 τ = and that for 0 τ ≥ they are related by the linear regression model where the error term ε is independent of X τ and E( ) 0 ε = . Then X τ and Y τ are identically distributed for all 0 τ ≥ , and the coefficients of the linear model are given by Moreover, we have the following decomposition: Proof. If X τ and Y τ are identically distributed at 0 τ = , then they are identically distributed for 0 τ ≥ by invariance of the measure ν. Equation (11) follows from a straightforward computation. Note that µ is independent of τ since ν is invariant under the map f.
Combining the standard formula for conditional variance (Ross, 1997) with the linear model of equation (10) gives The assumption that the error ε is independent of X τ gives ◻ Recall the GPD, which is given by where 0 β > is the scale parameter and  ξ ∈ is the tail index (which is also referred to as the shape parameter). If X is any random variable, then under widely applicable conditions (Balkema and de Haan, 1974;Pickands, 1975), the distribution of the excess X u − conditional on X u > for sufficiently large u is well approximated by the GPD in which the scale parameter becomes a linear function of threshold, namely where λ σ ξ ( , , ) are the parameters of the GEV distribution (3) (see Coles, 2001). The next result will be useful to analyse the MSE at high thresholds u.
Lemma 2. Assume that the random variable X has a GPD with scale parameter 0 β > and tail index  ξ ∈ . If 1 2 / ξ < , then the first and second moment of X are finite and and The first two conditional moments are given by can be approximated by a GPD for which the scale parameter is given by (14). Combining this with Lemma 1 gives is a strictly monotonically increasing function of threshold. For 0 ξ < , a decreasing V u ( ) may compensate for an increasing E u ( ) and create a minimum S u ( ) at In summary, the MSE always increases with threshold at sufficiently high threshold, although it can be a decreasing function of threshold at lower thresholds but only if the 'observations' X τ have a negative tail index. Computing the minimum u min requires an accurate estimate of the parameters ( , , , ) µ λ σ ξ . However, in practical applications, it may be difficult to obtain a precise estimate for the tail index ξ (Holland et al., 2012).
In concrete applications, the expectation in (9) is replaced by its empirical counterpart that is computed from a sufficiently large sample. This amounts to drawing samples from the invariant measure for the pairs of initial conditions in (8). Given an arbitrary point in  , we first iterate the map f sufficiently many times to get an initial point x 0 on the attractor. Next, we construct a perturbation y 0 of x 0 for the twin prediction in (8). Note that by simply adding a random perturbation to x 0 we would get a point that is almost surely not on the attractor. In this case, the points x 0 and y 0 would not have the same distribution. Instead, we construct perturbations by finding a so-called analogue of the point x 0 . For a specified distance 0 δ > , we find the first j ≥ 100 such that y f x ( ) 0 0 δ < . By requiring that j ≥ 100 we ensure that the short-time evolutions starting at x 0 and y 0 do not overlap. We then replace the point x 0 by f(x 0 ) and repeat this procedure until a chosen sample size N is reached. In the experiments below, we have used the parameters N = 10 5 , δ = 0.1 (tent map), δ = 0.05 (cusp map), δ = 0.025 (low-order barotropic model) and δ = 0.1 (low-order ENSO model).

Examples
We first present two examples of one-dimensional dynamical systems. These examples have the advantage that the invariant measure is known. In particular, the extreme value law is also known when we restrict to observables of the form (4). This allows us to check to what extent the conclusions of Lemma (2) hold. Next, we present two geophysical models; for these examples, it is unknown whether the invariant measure has a smooth density or whether the extreme value law is of the form (3).

The tent map
Our first example is a uniformly expanding map. We consider the tent map given by [ , ]. The invariant measure ν is the Lebesgue measure on  rescaled by 1 2π / . We consider the observables in equation (4) in which the distance between two points is given by x y x y dist( , ) = − . It is easy to verify that the local dimension in equation (7) is given by d( ) 1 ζ = for all  ζ ∈ . In particular, this implies that the tail index for the process (1) with observables of the form (4) is given by 0 In the following, we will take 1 ζ = .  (4) with g = g 1 and ζ = 1.
We first discuss the case g g 1 = . Figure 1 shows the invariant density and a scatter plot of X τ versus Y τ for the lead time 1 τ = . Clearly, the forecast and observation have the same distribution. Points in the scatter plot for which X 2 > τ or Y 2 > τ deviate from the linear model, but these points form less than 5% of the total number of points in the scatter plot. In addition, for 1 τ = , the correlation coefficient is 0.95 τ ≈ ρ . Figure 1 also shows that the coefficient µ is a constant function of the prediction lead time τ , which can be explained by the fact that the measure is invariant under the map f . The correlation coefficient ρ τ decays very fast with lead time and we have ρ τ ≈ 0 for 6 τ ≥ . This can be explained by the fact that the doubling map has exponential decay of correlations for a large class of observables (Boyarsky and Góra, 1997). Figure 2 shows that the MSE increases as a function of the prediction lead time τ and saturates for 5 τ ≥ . For fixed lead times, the MSE monotonically increases as a function of threshold. Note that the MSE is plotted against the exceedance probability rather than the actual threshold. The decomposition terms of the MSE in equation (12) are also shown in Figure 2. Clearly, the conditional variance is constant at high thresholds whereas the squared excess term increases with threshold. This behaviour is explained by the asymptotic expressions derived in Lemma 2 since the tail index 0 ξ = .
For the observable (4) with g g 2 = and 10 α = , the coefficients µ and ρ of the linear model behave in qualitatively the same way, and hence, their graphs are omitted. The MSE and its decomposition terms are shown in Figure 3. In this case, 1 10 0 / ξ = > , and hence, Lemma 2 explains why the conditional variance term is increasing. For the observable (4)  (9) as functions of lead time and threshold for the tent map and the observable given by equation (4) with g = g 1 and ζ = 1. The conditional variance and square excess refer to the decomposition terms defined in equation (12).
with g g 3 = and 2 α = , the MSE and its decomposition terms are shown in Figure 4. In this case, we have 1 2 0 / ξ = − < , and the conditional variance term is decreasing. Note that a minimum in the MSE as a function of threshold is not clearly visible.
(which corresponds to the 80th percentile) to a GPD with the peaks over threshold method (Coles, 2001) gives the estimates    ( , , ) (0.031, 0.012, 0.452) λ σ ξ = − . Note that the estimated tail index is close to the true value 1 2 / ξ = − . Together with the estimate µ  1.21 = , formula (17) gives The estimated value of u in m lies far below thresholds for which the GPD is applicable, and this may explain why no clear minimum is observed in Figure 4.
The tent map clearly illustrates that all types of behaviour of the MSE as discussed in Section 2.2 can occur, depending on the choice of the observable. However, the tent map is a rather simple example. In more complicated examples, the local dimension varies across the attractor of the dynamical system. In this case, it is expected that the results also depend on the point  ζ ∈ where the observable attains its maximum. In the next section, we discuss an example for which this is indeed the case.
We now compare the results described above with conclusions that are obtained from finite-time Lyapunov exponents (FTLEs). The FTLE is the finite-time counterpart of the traditional infinite-time Lyapunov exponent, which measures the average exponential growth rate of infinitesimally small errors in the initial condition. For an interval map f I I : → , the time-τ FTLE is defined as Typically, λ τ strongly depends on both the initial condition x 0 and the prediction lead time τ . However, for the tent map we have f x ( ) 2 0 = ′ for almost every x 0  ∈ and this implies that x ( ) log(2) 0 λ = τ is independent of both the initial condition and the prediction lead time. Hence, measuring predictability with FTLEs implies that for the tent map all events are equally unpredictable and that predictability does not depend on lead time.
An alternative finite-time predictability indicator can be defined by which has been used in (Harle et al., 2006;Sterk and Van Kekem, 2017). Clearly, this indicator is related to the FTLE by the relation . Indeed, for a fixed value of τ one can interpret σ τ and λ τ as the same predictability indicators expressed in different units. For the tent map we have x ( ) 2 0 σ = τ τ . Unlike λ τ , the indicator σ τ does show that predictability decreases with increasing lead time τ as one would expect. However, σ τ suggests that prediction errors keep growing without bound, whereas in practice prediction errors often saturate for τ sufficiently large (see, e.g. , Figs 2-4). Note that for the tent map, σ τ is still independent of the initial condition and therefore not suitable as a predictability indicator for specific events such as extreme events.

The cusp map
Our next example is a non-uniformly expanding map. We consider the cusp map f x x ( ) 1 2 = − on [ 1, 1]  = − . This map is a qualitative model of a Poincaré return map of the Lorenz-63 system (Lorenz, 1963). The map has an invariant measure that is absolutely continuous with respect to the Lebesgue measure with density d dx Hemmer, 1984). The point x 1 = − is a neutral fixed point, and time series of the cusp map show intermittent behaviour near x 1 = − . This is also reflected by the invariant measure: at x 1 = − , the density d dx / ν is maximal, so that Birkhoff's ergodic theorem implies that iterates of the map spend longer times near x 1 = − . We consider the observable (4) with g g 3 = and 2 α = and two choices for  ζ ∈ . Due to the intermittent dynamics of the map f , the correlation coefficient of the linear model decays much slower than in the case of the tent map (see Fig. 5 for the case 1 ζ = ). In fact, the cusp map only has polynomial decay of correlations (Bhansali and Holland, 2007). A straightforward computation shows that the local dimension of the measure ν is given by d( ) 1 ζ = for all [ 1, 1) ζ ∈ − and d( ) 2 ζ = for 1 ζ = . In particular, we expect that the tail index is given by for for 1 2 [ 1, 1), 1 4 1.
The behaviour of the MSE and its decomposition terms strongly depend on the choice of ζ . For 1 ζ = , the MSE increases with threshold as the calculations in Section 2.2 suggest (see Fig. 6). However, for 1 ζ = − , the MSE decreases very fast with high thresholds (see Fig. 7). This cannot be explained by the results in Section 2.2. Note that for 1 ζ = − , the observable is maximized at a fixed point of f . Figure 8 shows time series of the observable (4) with g g 3 = and 2 α = for 1 ζ = − and 0.6 ζ = − . In the former case, the extremes form clusters: once the observable exceeds a high threshold, the next values in the time series also exceed that threshold.
The statistics of clustered extremes can still be described by equation (3) when the location and scale parameters ( , ) λ σ are replaced with the modified parameters (see Coles, 2001) The new parameter (0, 1] θ ∈ is referred to as the extremal index that can be interpreted as the reciprocal of the limiting mean cluster size, where the limiting is in the sense of clusters of exceedances of increasingly high threshold. Clustering in time series of random variables is discussed in Embrechts et al. (1997 andColes (2001). A comprehensive discussion on recurrence properties of dynamical systems and clustering of extremes can be found in Lucarini et al. (2016).  (10) for the cusp map with observable given by equation (4) with α = 2 and ζ = 1. Note that the decay of correlations is much slower than in the case of the tent map.
which implies that (0, 1) θ ∈ . In particular, this relation also shows that when the fixed point becomes less repelling, the extremal index decreases and hence the limiting mean cluster size of extremes becomes larger. However, for the cusp map, the point 1 ζ = − is an indifferent fixed point, which means that f ( ) 1 ζ = ′ . In Freitas et al. (2016), it is proven for the Manneville-Pomeau map with the observable x x ( ) log φ ζ = − − that the extremal index is zero when ζ is an indifferent fixed point. In this case, the scaling sequence in equation (2) has to be modified. However, the proof techniques in Freitas et al. (2016) only work for a suitable range of parameter values and are not directly applicable to the cusp map.
For the cusp map, we do not have an analytical expression for the extremal index, but we can estimate θ from the time series shown in Figure 8 using the algorithm described in Ferro and Segers (2003). For 1 ζ = − , we have θ ≈ 0.002 that lends support to the conjecture that θ = 0. This numerical estimate is consistent with the result proved in Freitas et al. (2016) for the Manneville-Pomeau map. Also observe that in Figure 8, the clusters are very large. The predictability of extremes in the presence of clusters will be studied in more detail in forthcoming work. In particular, we will study how the results presented in Section 2.2 need to be modified when 0 < θ < 1 and also in the extreme case 0 θ = for which formula (19) clearly no longer holds.

A low-order barotropic model
In this section, we consider a low-order model for atmospheric regime transitions. The existence and properties of an invariant measure are unknown. The model exhibits slow decay of correlations due to Shilnikov-like intermittency, and this implies that linear model in Section 2.2 remains valid for long lead times. The barotropic vorticity equation on a β-plane channel for atmospheric flow over orography is given by where ψ is the stream function, ψ * is the forcing, βy is the Coriolis force, h is the orography and Δ is the Laplace operator. In addition, J is the Jacobian operator which is defined as × π π so that b/2 is the aspect ratio of the β-plane channel. We take periodic boundary conditions in x and for y b 0,π = we require and y dx 0 0 , 0 2π ∫ which, respectively, mean that on the lateral boundaries the northward component of the velocity vanishes and that the net zonal flow is zero. Equation (21) can be spatially discretized by expanding the stream function as a Fourier series: where the state space variables z 1 and the Fourier coefficients  n m , ψ are related as follows: The parameters are chosen as in Crommelin et al. (2004) and are listed in Table 1. With this choice, the β-plane channel × π π b [0, 2 ] [0, ] has dimensions 5000 1250 × km, the orography h x y x y b ( , ) cos( ) sin( ) / = has a dimensional amplitude of 200 m, and one unit of time corresponds to 1 day.
The low-order model (23) was used in Crommelin et al. (2004) to explain intermittent transitions between two regions in state space corresponding to zonal and blocked flow regimes (see Fig. 9). The dynamics consists of three recurrent episodes: (i) transitions from zonal to blocked flows, typically taking 30 days, (ii) transitions from blocked to zonal flows, typically taking 40-80 days, and (iii) spiralling behaviour around the zonal regime, typically lasting >200 days. This intermittent behaviour can be explained by the presence of Shilnikov-like strange attractors due to homoclinic bifurcations taking place near a Hopf-saddle-node bifurcation (Broer and Vegter, 1984;Broer and Vitolo, 2008).
Sampling the flow of (23) at multiples of a time step t s gives a map f :   → with 6   = . In the following, we will take t 1 s = , which corresponds to one day. We will consider observables that measure wind speed at particular points in the β -plane channel. For a given state vector z z ( , , ) respectively. Evaluating the wind speed at a geographical location x y ( , ) in the β-plane channel gives observables : 6   φ → , which are indeed of the form W φ in equation (5), but their explicit expressions will be omitted. We will study wind speeds at three locations in the β-plane channel: the side of the orography 2) and the valley . The corresponding observables are denoted by sid φ , top φ and val φ , respectively. The intermittent dynamics of (23) causes the correlation coefficient ρ τ to decrease slowly with prediction lead time τ. This also implies that the linear model between the forecasts and observations remains valid for long lead times. For the observable sid φ , this is illustrated in Figure 10. The decay of ρ τ for the observables top φ and val φ is similar (not shown). The low-order model (23) belongs to a class of forced dissipative systems in which the quadratic nonlinearities preserve the total energy. It is well known that such systems have a trapping region (Lorenz, 1980), which means that the orbits remain in some ball around the origin for all t ≥ 0. Since the wind speed observables are continuous functions, their evolutions along orbits of the system have an upper bound. Therefore, we expect a priori that the process (1) has a negative tail index. In particular, we expect that the conditional variance term of the MSE always decreases as a function of threshold, but the rate of decrease might be different for the three observables. Figure 11 shows the MSE and its decomposition for the time-1 map of (23) with the observable sid φ . For lead time 20 τ = , the MSE slowly decreases with threshold exceedance probability p but shows a steep increase after p = 0.8. The behaviour of the MSE decomposition terms is consistent with the fact that 0 ξ < : the conditional variance term decreases and the squared excess term increases. A clear minimum in the MSE is not visible in this case. For the observable top φ (see Fig. 12), the MSE always seems to increase with p. The conditional variance term stays relatively constant up to p = 0.8 and then suddenly decreases. The squared excess term increases fast after p = 0.8. For the observable val φ (see Fig. 13), the MSE is nearly constant as a function of threshold up to p = 0.7 and increases fast thereafter. The conditional variance term decreases fast for low thresholds, whereas the squared excess term increases fast for high thresholds. This might explain why no minimum in the MSE is visible. Figure 14 shows the attractor of the low-order model (23) in which colours indicate the error between the forecast and observation for the observables sid φ and val φ and lead time 10 τ = . These figures show that the prediction error is large for only a small fraction of points on the attractor and that most of these points are located near the zonal regime. The initial conditions leading to an extreme event at a specific lead time are precisely located in the unpredictable region of the attractor. This qualitatively explains why the MSE increases for large thresholds. The results presented in this article are consistent with results presented in our earlier article (Sterk et al., 2012), which were obtained by means of FTLEs. However, in the latter case, the unpredictable part of the attractor was significantly larger. This might be a consequence of the fact that FTLEs do not take into account that prediction errors saturate with lead time and hence overestimate the growth of prediction errors.

A low-order model for ENSO
As a final example we consider a low-order model for the ENSO phenomenon that was presented in Timmermann and Jin (2002). The model consists of the three prognostic equations The relation between the subsurface temperature and the thermocline depth is given by where h h bL 2 1 τ = + is the depth deviation of the eastern equatorial thermocline from the reference depth H and z 0 is the depth at which w takes its characteristic value. The parameter h* controls the sharpness of the thermocline. For more details and physical justifications, see Timmermann and Jin (2002) and references therein. Together with these diagnostic relations, the model (24) is a closed system. The values of the parameters are listed in Table 2.
For the parameters listed in Table 2, the low-order model (24) has a Shilnikov-like chaotic attractor in which orbits spiral around an equilibrium of saddle-focus type (see Fig. 15). The equilibrium represents the tropical climate mean state of the model, and the spiralling motion corresponds to irregular temperature oscillations which lead to an ENSO event when T 2 reaches a maximum. This regime behaviour is also reflected in the probability density of T 2 shown in Figure 16 which is clearly bimodal. The two modes can be interpreted as different regimes: the dominant mode T 2 = 21 corresponds to the tropical climate mean, whereas the much weaker mode T 2 = 28.5 corresponds to an ENSO event. As for the low-order atmosphere model presented in the previous section, Figure 17 shows that correlations decay slowly in a non-monotonic manner.
The flow of (24) is sampled at intervals of 0.25 years and as observable we use the eastern equatorial sea surface temperature T 2 . Figure 18 shows the MSE and its decomposition terms as functions of lead time and threshold. The MSE is a non-monotonic function of τ , and this behaviour becomes more pronounced for larger thresholds u. Note that the MSE has a local maximum at lead times for which the correlation coefficient ρ τ has a local minimum and vice versa. Non-monotonic behaviour of prediction errors as a function of lead time has observed earlier in the context of an operational weather forecasting model  and toy-models (Sterk and Van Kekem, 2017). This phenomenon might be a consequence of the intermittent dynamics of the system. Perhaps the simplest possible explanation can be given in terms of the cusp map in Section 3.2. Figure 19 shows two predictions produced by the cusp map starting at two nearby initial conditions. The evolution starting at the initial condition x 0.9 0 = − drifts away from For fixed lead times, the MSE as a function of threshold is increasing. The conditional variance term shows a sharp decrease at large thresholds, whereas the squared excess term shows a fast increase at high thresholds. This explains why no minimum is observed. Figure 20 shows the attractor of the low-order model (24) in which colours indicate the error between the forecast and observation for the lead time 20 τ = (which corresponds to 5 years) and the observable T 2 . As in Figure 14, the unpredictable part of the attractor is located near the saddle-focus equilibrium and initial conditions leading to an extreme event are located in the unpredictable region of the attractor.

Discussion and conclusion
In this article, we have explored the relation between statistics and the predictability of extremes in dynamical systems. We have adopted a perfect model scenario by comparing forecasts with each other rather than comparing forecasts with observations. We measured predictability in terms of an MSE computed from the difference of a pair of forecasts conditional on one of the forecasts exceeding a threshold. We showed that the MSE can be written as a sum of three terms: a threshold-independent constant, a mean term that always increases with threshold and a variance term that can either increase, decrease or stay constant with threshold. Using the GPD to model excesses over a threshold, we showed that the MSE always increases with threshold at sufficiently high threshold. However, when the forecasts have a negative tail index, the MSE can be a decreasing function of threshold at lower thresholds. In this article, we have only considered twin forecasts, but the methodology can be extended to ensemble forecasts .   (24)  In earlier articles (Sterk et al., 2012;Sterk and Van Kekem, 2017), we have used FTLE-like predictability indicators to assess the predictability of extremes. However, these indicators have at least two disadvantages. First, these indicators are based on the assumption that errors in the initial condition are infinitesimally small, which is a very strong idealization. Second, their computation requires the explicit availability of a prediction model and its variational equations (also referred to as a tangent linear model). If the prediction model has a high-dimensional state space, then computationally expensive algorithms are needed to obtain FTLEs from singular value decompositions. With the method presented in this article, one can study the effect of finite size errors. More importantly, our method can be used in a data-driven framework in which a prediction model is not explicitly available. Indeed, only a data set of forecasts and observations is required.
Our methodology is applicable under the condition that forecast/observation pairs satisfy a linear relation for sufficiently large prediction lead times. This condition is met when the correlations decay sufficiently slow. For example, this is the case for the cusp map and the low-order atmosphere and ocean models discussed in Section 3. All these case studies exhibit a form of intermittent dynamics, i.e. alternating phases of regular and chaotic dynamics. Intermittency plays an important role in the dynamics of climate models and has been proposed as an explanation for atmospheric low-frequency variability due to homo-and heteroclinic connections (Crommelin, 2002(Crommelin, , 2003 and bifurcations of periodic orbits (Sterk et al., 2010), atmospheric regime dynamics (Crommelin et al., 2004) and active and passive ocean regimes in coupled atmosphere-ocean models (Van Veen et al., 2001). We expect that our methodology can be applied in a wide range of climate contexts in which correlations decay sufficiently slow.
The examples presented in this article show that predictability strongly depends on the invariant measure of the system and the regularity of the observable. For observables of the form (4), the forecasts can have a non-negative tail index in which case the MSE always increases with threshold. In climate models, however, observables are typically of the form (5), which are continuous functions on the state space d   = . Climate models often have dissipation terms, such as linear damping or diffusion, which implies that their evolutions are confined to attractors which are compact sets in state space. This means that observables of the form (5) attain a maximum value, which implies that the forecasts have a negative tail index. In this case, the MSE might decrease to a minimum before eventually increasing at high thresholds. Whether the minimum is observed or not, however, depends also on the location and scale parameters.
We now point out two directions for further research. In this article, we have studied predictability in the perfect model scenario that amounts to comparing forecasts with each other rather than comparing forecasts with real observations. An interesting direction topic for future research is the extension of our method to an imperfect model scenario. In that case, the forecast and observation do not have the same distribution, which implies that the assumptions of Lemma 1 are not satisfied and equation (12) is no longer applicable. Perhaps a more refined statistical model can be constructed to relate forecasts and observations to obtain a decomposition of the MSE. Such a statistical model should take into account the bias between the predictions and the observations. A second important direction for future research is to adapt the methodology in Section 2.2 to the setting of clustered extremes. The cusp map example shows that when an observable is maximized at a fixed point, then the extremes form clusters. In this case, the MSE can decrease at high thresholds. When the observable is maximized at a point sufficiently close to this fixed point, the extremes may still show a moderate degree of clustering. To describe the statistics of clustered extremes, the location and scale parameters in extreme value law (3) have to be modified with an additional parameter known as the extremal index. The latter parameter can be interpreted as the reciprocal of the mean cluster size. We believe that detailed studies of one-dimensional maps can help to obtain a better mathematical understanding of the predictability of clustered extremes. Of particular interest are the so-called intermittency maps that exhibit slow decay of correlations (Holland, 2005;Bhansali and Holland, 2007).
Clustering of extremes is not only relevant in the setting of low-dimensional dynamical systems, but also for meteo-climatic phenomena. For example, a heat wave or a cold spell might persist for a couple of days, and one might be interested in predicting the duration of an extreme event. In this context, it is also interesting to look at observables different from those in equation (5). For persistent extremes, one might be interested in observables that exceed a threshold for a subsequent number of elements in a time series, and for such observables, the extreme value laws are probably different from equation (3). Another example is recent observational work that shows that large-scale flow patterns, such as the North Atlantic Oscillation, which itself might be a manifestation of intermittency, cause temporal clustering of storms (Mailier et al., 2006;Vitolo et al., 2009). A better understanding of clustered extremes in dynamical systems may therefore also lead to a better understanding of real-world extremes.