High-dimensional peaks-over-threshold inference

Max-stable processes are increasingly widely used for modelling complex extreme events, but existing ﬁtting methods are computationally demanding, limiting applications to a few dozen variables. r -Pareto processes are mathematically simpler and have the potential advantage of incorporating all relevant extreme events, by generalizing the notion of a univariate exceedance. In this paper we investigate the use of proper scoring rules for high-dimensional peaks-over-threshold inference, focusing on extreme-value processes associated with log-Gaussian random functions, and compare gradient score estimators with the spectral and censored likelihood estimators for regularly varying distributions with normalized marginals, using data with several hundred locations. When simulating from the true model, the spectral estimator performs best, closely followed by the gradient score estimator, but censored likelihood estimation performs better with simulations from the domain of attraction, though it is outperformed by the gradient score in cases of weak extremal dependence. We illustrate the potential and ﬂexibility of our ideas by modelling extreme rainfall on a grid with 3600 locations, based on exceedances for locally intense and for spatially accumulated rainfall, and discuss diagnostics of model ﬁt. The differences between the two ﬁtted models highlight how the deﬁnition of rare events affects the estimated dependence structure.


INTRODUCTION
Recent contributions to extreme value theory describe models capable of handling spatiotemporal phenomena (e.g., Kabluchko et al., 2009) and provide a flexible framework for modelling rare events, but their complexity makes inference difficult, if not intractable, for high-dimensional data.For instance, the number of terms in the block maximum likelihood for a Brown-Resnick process grows with the dimension D like the Bell numbers (Huser & Davison, 2013), so computationally cheaper methods such as composite likelihood (Padoan et al., 2010) or the inclusion of partition information (Stephenson & Tawn, 2005;Thibaud et al., 2016) have been advocated.The first is slow and the second, though more efficient, is prone to bias if the partition is incorrect (Wadsworth, 2015).
An attractive alternative to the use of block maxima is peaks-over-threshold analysis, which includes more information by focusing on single extreme events.In the multivariate case, specific definitions of exceedances have been used (e.g., Rootzén & Tajvidi, 2006;Ferreira & de Haan, 2014;Engelke et al., 2015), which can be unified within the framework of r-Pareto processes c 2018 Biometrika Trust This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/ licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.(Dombry & Ribatet, 2015).For this approach, a full likelihood is often available in closed form, thus increasing the maximum number of variables that can be jointly modelled from a handful to a few dozen, but biased estimation may occur if nonextreme components are included.Censored likelihood, proposed in this context by Wadsworth & Tawn (2014), is more robust with regard to nonextreme observations, but it involves multivariate normal distribution functions, which can be computationally expensive.Nevertheless, inference is feasible for D ≈ 30.
Nonparametric alternatives to full likelihood inference developed using the tail dependence coefficient (Davis & Mikosch, 2009;Davis et al., 2013) or the stable tail dependence function (Einmahl et al., 2016) rely on pairwise estimators and allow peaks-over-threshold inference for D ≈ 100, but they are potentially inefficient and may be limited by combinatorial considerations.
Applications of max-stable processes (e.g., Asadi et al., 2015) or Pareto processes (Thibaud & Opitz, 2015) have focused on small regions and have used at most a few dozen locations with particular types of exceedance, but exploitation of much larger gridded datasets, along with more complex definitions of risk, is needed for a better understanding of extreme events and to reduce model uncertainties.The goals of this paper are to highlight the advantages of functional peaksover-threshold modelling using r-Pareto processes, to show the feasibility of high-dimensional inference for the Brown-Resnick model with hundreds of locations, and to compare the robustness of different procedures with regard to finite thresholds.We develop an estimation method based on the gradient score (Hyvärinen, 2005) for a general notion of exceedances, for which the computation of multivariate normal probabilities is not needed and computational complexity is driven by matrix inversion, as with classical Gaussian likelihood inference.This method focuses on single extreme events and a general notion of exceedance, modelled by Pareto processes, instead of the max-stable approach.

2•1. Univariate model
The statistical analysis of extremes was first developed for block maxima (Gumbel, 1958, § 5.1).This approach is widely used and can give good results, but the reduction of a complex dataset to maxima can lead to a significant loss of information (Madsen et al., 1997), so the modelling of exceedances over a threshold is often preferred (Davison & Smith, 1990).Let X be a random variable for which there exist sequences of constants a n > 0 and b n such that as n → ∞, where G is a nondegenerate distribution function.Then X is said to belong to the max-domain of attraction of G and, for a large enough threshold u < inf {x : where σ = σ (u) > 0 and a + = max(a, 0).If the shape parameter ξ is negative, then x must lie in the interval [0, −σ/ξ ], whereas x can take any positive value with positive or zero ξ .The implication is that the distribution over a high threshold u of any random variable X satisfying the conditions for (2) can be approximated by where ζ u is the probability that X exceeds the threshold u.In its simplest form, this model for univariate exceedances applies to independent and identically distributed variables, but it has also been used for time series, nonstationary and spatial data.The modelling of exceedances can be generalized to a multivariate setting (Rootzén & Tajvidi, 2006) and to continuous processes (Ferreira & de Haan, 2014;Dombry & Ribatet, 2015) within the functional regular variation framework.

2•2. Functional regular variation
Let S be a compact metric space, such as [0, 1] 2 for spatial applications.We write F + = C{S, [0, ∞)} for the closed subset of the Banach space of continuous functions x : S → R endowed with the uniform norm x = sup s∈S |x(s)|, write F for F + with the singleton {0} excluded, and write B( ) for the Borel σ -algebra associated with a metric space .Let M F denote the class of Borel measures on B(F ); we say that a set (Hult & Lindskog, 2005) if lim n→∞ n (A) = (A) for all A ∈ B(F ) bounded away from {0} with (∂A) = 0, where ∂A denotes the boundary of A. For equivalent definitions of this so-called ŵ-convergence, see Lindskog et al. (2014, Theorem 2.1).
Regular variation provides a flexible mathematical setting in which to characterize the tail behaviour of random processes in terms of ŵ-convergence of measures.A stochastic process X with sample paths in F is regularly varying (Hult & Lindskog, 2005) if there exists a sequence of positive real numbers a 1 , a 2 , . . .with lim n→∞ a n = ∞ and a measure ∈ M F such that as n → ∞; then we write X ∈ RV(F , a n , ); note the link to (1).For a normalized process X * , obtained for instance by standardizing the margins of X to unit Fréchet (Coles & Tawn, 1991, § 5) or unit Pareto (Klüppelberg & Resnick, 2008), regular variation is equivalent to the convergence of the renormalized pointwise maximum n −1 max n i=1 X * i of independent replicates of X * to a nondegenerate process Z * with unit Fréchet margins and exponent measure * (de Haan & Lin, 2001).The process Z * is called a simple max-stable process, and X * is said to lie in the max-domain of attraction of Z * .
Regular variation also affects the properties of exceedances over high thresholds.For any nonnegative measurable functional r : F → [0, ∞) and stochastic process {X (s)} s∈S , an r-exceedance is defined to be an event {r(X ) > u n } where the threshold u n is such that pr{r(X ) > u n } → 0 as n → ∞.We further require that r be homogeneous, i.e., there exists α > 0 such that r(ax) = a α r(x) for a > 0 and x ∈ F .As r(•) could be replaced by r(•) 1/α without loss of generality, in what follows we assume that α = 1.Dombry & Ribatet (2015) called r a cost functional, and in his 2013 Université de Montpellier II PhD thesis Thomas Opitz called it a radial aggregation function; but we prefer the term risk functional because r determines the type of extreme event whose risk is to be studied.
A natural formulation of subsequent results on r-exceedances uses a pseudo-polar decomposition.For a norm • ang on F , called the angular norm, and a risk functional r, a pseudo-polar transformation T r is a map such that where S ang is the unit sphere {x ∈ F : Theorem 2.1 in Lindskog et al. (2014) provides an equivalent pseudo-polar formulation of (3).For any X ∈ RV(F , a n , ) and any uniformly continuous risk functional r not vanishing -almost everywhere, there exist β > 0 and a measure σ r on B(S ang ) such that Lindskog et al., 2014, Corollary 4.4).The functional r(x) = sup s∈S x(s), used by Rootzén & Tajvidi (2006) in a multivariate setting and by Ferreira & de Haan (2014) for continuous processes, implies that realizations of X (s) exceeding the threshold at any location s ∈ S will be labelled extreme, but this functional can only be used in applications where X (s) is observed throughout S. Therefore it may be preferable to use functions such as max s∈S X (s) or max s∈S X (s)/u(s), where S ⊂ S is a finite set of gauged sites.Other risk functionals include S X (s) ds for the study of areal rainfall (Coles & Tawn, 1996), min s∈S X (s)/u(s), and X (s 0 ) for risks impacting a specific location s 0 .Although the choice of risk functional allows one to focus on particular types of extreme event, the choice of the angular norm • ang has no effect and is usually made for convenience.
Finally, for a common angular norm • ang , the angular measures of two risk functionals r 1 and r 2 that are strictly positive -almost everywhere are linked by the expression (5) Equation ( 5) is useful for simulation and when we are interested in r 2 -exceedances but inference has been performed based on r 1 .All the previous definitions and results hold for finite dimensions, i.e., for D-dimensional random vectors, upon replacing ŵ-convergence by vague convergence (Resnick, 2007, § 3.3.5)on M R D + \{0} , the class of Borel measures on B(R D + \ {0}) endowed with the • norm; see the PhD thesis of Thomas Opitz mentioned above.

2•3. r-Pareto processes
In this section, r denotes a functional that is nonnegative and homogeneous with α = 1.The r-Pareto processes (Dombry & Ribatet, 2015) are important for modelling exceedances and may be constructed as where U is a univariate Pareto random variable with pr(U > v) = 1/v β (v 1) and Q is a random process with sample paths in S r ang = {x ∈ F : r(x) 1, x ang = 1} and probability measure σ ang .The process P is then called an r-Pareto process with tail index β > 0 and angular measure σ ang ; to distinguish different Pareto processes, below we use the notation P r β,σ r for P.An important property of r-Pareto processes is threshold invariance: for all A ∈ B({x ∈ F : r(x) 1}) and all u 1 such that pr{r(P) u} > 0, pr{u −1 P ∈ A | r(P) u} = pr(P ∈ A).
Furthermore, for X ∈ RV(F , a n , ) with index β > 0 and for a risk functional r that is continuous at the origin and does not vanish -almost everywhere, the distribution of the r-exceedances converges weakly to that of an r-Pareto process, i.e., as n → ∞, with tail index β and probability measure σ r as defined in (4) (Dombry & Ribatet, 2015, Theorem 2).When working with a normalized process X * , the exponent measure * of the limiting max-stable process Z * and the measure 1 × σ r of the Pareto process are the same up to a coordinate transform, as suggested by (4).
For two different risk functionals r 1 and r 2 and angular measures σ r 1 and σ r 2 for which there exists ∈ M F such that the associated Pareto processes P r 1 β,σ r 1 and P r 2 β,σ r 2 are defined on different subsets of F , but, as suggested by ( 5), if there exists a threshold u min such that Simulation of r-Pareto processes is feasible only for a few risk functionals, such as r 1 (x) = x 1 , but ( 6) can be used to obtain samples of one process from those of another: for independent replicates x 1 , . . ., . Finally, let σ r be a probability measure on S r ang , and define the process where Thus equation ( 7) connects an r-Pareto process and its max-stable counterpart.

2•4. Extreme value processes associated with log-Gaussian random functions
We focus on a class of r-Pareto processes based on log-Gaussian stochastic processes, whose max-stable counterparts are Brown-Resnick processes.This class is particularly useful, not only for its flexibility but also because it is based on Gaussian models widely used in applications.Chiles & Delfiner (1999, pp. 84-108) review these classical models.
Let Z be a zero-mean Gaussian process with stationary increments, i.e., the semivariogram (Chiles & Delfiner, 1999, p. 30), and let σ 2 (s) = var{Z(s)}.If Z 1 , Z 2 , . . .are independent replicates of Z and {U n : n ∈ N} is a Poisson process on (0, ∞) with intensity u −2 du, independent of the Z n , then is a stationary Brown-Resnick process with standard Fréchet margins, whose distribution depends only on γ (Kabluchko et al., 2009); such processes are max-stable.Let γ θ denote a parameterized semivariogram whose parameter θ lies in a compact set , and let σ 2 θ denote the corresponding variance function.
Let s 1 , . . ., s D be locations of interest in S. In the rest of the paper, x will denote an element of where θ (•) is the finite-dimensional projection of the measure defined in (3).Then we can write (Huser & Davison, 2013) where The r-Pareto processes associated with log-Gaussian random functions are closely related to the intensity function λ θ corresponding to the measure θ , which can be found by taking partial derivatives of θ (x) with respect to x 1 , . . ., x D , yielding (Engelke et al., 2015) where x is the (D − 1)-dimensional vector with components {log(x i /x 1 ) + γ i,1 : i = 2, . . ., D} and θ is the (D − 1) × (D − 1) matrix with elements {γ i,1 + γ j,1 − γ i,j } i,j∈{2,...,D} .Wadsworth & Tawn (2014) derived an alternative symmetric expression for (10) that will be useful in § 3•3, but (10) is more readily interpreted.Similar expressions exist for extremal-t processes (Thibaud & Opitz, 2015).

INFERENCE FOR r-PARETO PROCESSES
3•1.Generalities In this section, x 1 , . . ., x N are independent replicates of a D-dimensional r-Pareto random vector P with tail index β = 1, and y 1 , . . ., y N are independent replicates of a regularly varying D-dimensional random vector Y * with normalized margins.
As in the univariate setting, statistical inference based on block maxima and the max-stable framework discards information by focusing for maxima instead of single events.Models for maxima are difficult to fit not only due to the small number of replicates, but also because the likelihood is usually too complex to compute in high dimensions (Castruccio et al., 2016).For the Brown-Resnick process, the full likelihood cannot be computed for D greater than around 10 (Huser & Davison, 2013), except in special cases.When the occurrence times of maxima are available, inference is usually possible up to D ≈ 30 (Stephenson & Tawn, 2005;Thibaud et al., 2016).
A useful alternative is composite likelihood inference (Padoan et al., 2010;Varin et al., 2011) based on subsets of observations of sizes smaller than D, which trades off a gain in computational efficiency against a loss of statistical efficiency.The number of possible subsets increases very rapidly with D, and their selection can be troublesome, though some statistical efficiency can be retrieved by taking higher-dimensional subsets.Castruccio et al. (2016) found higher-order composite likelihoods to be more robust than the spectral estimator, but in realistic cases these methods are limited to fairly small dimensions.
Estimation based on threshold exceedances and the Pareto process has the advantages that individual events are used, the likelihood function is usually simpler, and the choice of risk functional can be a means of tailoring the definition of an exceedance to the application.Equation (4) suggests that the choice of risk functional should not affect the estimates, but this is not entirely true, because the threshold cannot be taken arbitrarily high and the events selected depend on the risk functional, the choice of which enables the detection of mixtures in the extremes and can improve subasymptotic behaviour by fitting the model using only those observations closest to the chosen type of extreme event.For example, we might expect the extremal dependence of intense local rainfall events to differ from that of heavy large-scale precipitation, even in the same geographical region.
The probability density function of a Pareto process for r-exceedances over the threshold vector u ∈ R D + can be found by rescaling the intensity function λ θ by θ {A r (u)}, yielding where and A r (u) is the exceedance region {x ∈ R D + : r(x/u) 1}.Equation (11) yields the loglikelihood where division of vectors is componentwise and 1 denotes the indicator function.Maximization of gives an estimator θr (x 1 , . . ., x N ) that is consistent, asymptotically normal and efficient under mild conditions.Numerical evaluation of the D-dimensional integral θ {A r (u)} is generally intractable for large D, though it simplifies for certain risk functionals; an example is r(x) = max d x d , for which the integral is a sum of multivariate probability functions; see (9).Similarly, θ {A r (u)} does not depend upon θ when r(x) = D −1 d x d (Coles & Tawn, 1991); we call the corresponding version of (13) the spectral loglikelihood and its maximizer the spectral estimator.
In practice observations cannot be assumed to be exactly Pareto distributed; it is usually more plausible that they lie in the domain of attraction of some extremal process.As a consequence of Theorem 3.1 of de Haan & Resnick (1993), asymptotic properties of θr (x 1 , . . ., x N ) hold for θr (y 1 , . . ., y N ) as N → ∞ and u → ∞ with the number of exceedances N u = o(N ) → ∞; see § 3•3.However, the threshold u is finite and so low components of y n ∈ A r (u) may lead to biased estimation.As it is due to model misspecification, this bias is unavoidable; moreover, it grows with D, so these methods can perform poorly, especially if the extremal dependence is weak, because it is then more likely that at least one component of y n will be small (Engelke et al., 2015;Thibaud & Opitz, 2015;Huser et al., 2016).The bias can be reduced by a censored likelihood proposed in the multivariate setting by Joe et al. (1992) and used for the Brown-Resnick model by Wadsworth & Tawn (2014) and for the extremal-t process by Thibaud & Opitz (2015).This method works well in practice but typically requires the computation of multivariate normal and t probabilities, which can be challenging in realistic cases if standard code is used.Some modest changes to the code to perform quasi-Monte Carlo maximum likelihood estimation with hundreds of locations are described in § 3•2.
For spatiotemporal applications, inference for r-Pareto processes must be performed using data from thousands of locations, and in § 3•3 we discuss an approach that applies to a wide range of risk functionals and is computationally fast, statistically efficient and robust with regard to finite thresholds.

3•2. Efficient censored likelihood inference
Censored likelihood estimation for extreme value processes associated with log-Gaussian random functions was developed by Wadsworth & Tawn (2014) and is based on (13) with max d x d /u d as the risk functional and where any component lying below the threshold vector (u 1 , . . ., u D ) > 0 is treated as censored.The corresponding estimator has a higher variance but a lower bias than the spectral estimator.The censored likelihood density function for the Brown-Resnick process is (Asadi et al., 2015) where k components exceed their thresholds, x2:k and 2:k are subsets of the variables x and θ in equation ( 10), and φ k−1 and D−k are the multivariate Gaussian density and distribution functions with mean zero.The argument and covariance matrix for D−k are Wadsworth & Tawn (2014) derived similar expressions.The estimator is also consistent and asymptotically normal as For finite thresholds, θcens has been found to be more robust with respect to the presence of low components than the spectral estimator (Engelke et al., 2015;Huser et al., 2016), but it is awkward because of the potentially large number of multivariate normal integrals involved, thus far limiting its application to D 30 (Wadsworth & Tawn, 2014;Thibaud et al., 2016).
When maximizing the right-hand side of ( 14), the normalizing constant θ {A max (u)} described in (8) and the multivariate normal distribution functions require the computation of multidimensional integrals.Theorem 7 of Geyer (1994) suggests that we approximate θcens by maximizing p cens (θ) , where p D−k and p θ are Monte Carlo estimates of the corresponding integrals based on p simulated samples, yielding a maximizer θ p cens that converges almost surely to θcens as p → ∞.Classical Monte Carlo estimation for multivariate integrals yields a probabilistic error bound that is O(ωp −1/2 ), where ω = ω(φ) is the square root of the variance of the integrand φ.Quasi-Monte Carlo methods can achieve higher rates of convergence and thus improve computational efficiency while preserving the consistency of θp cens .For estimation of multivariate normal distribution functions, Genz & Bretz (2009, § 4.2.2) advocate the use of randomly shifted deterministic lattice rules, which can achieve a convergence rate of order O(p −2+ ) for some > 0. Lattice rules rely on regular sampling of the hypercube [0, 1] D , taking where frac(z) denotes the componentwise fractional part of z ∈ R D , p is a prime number of samples in the hypercube [0, 1] D , v ∈ {1, . . ., p} D is a carefully chosen generating vector, and ∈ [0, 1] D is a uniform random shift.Fast construction rules have been developed to find an optimal v for given numbers of dimensions D and samples p (Nuyens & Cools, 2004).The existence of generating vectors achieving a nearly optimal convergence rate, with integration error independent of the dimension, has been proved, and methods for their construction are available (Dick & Pillichshammer, 2010).
Our implementation of this approach applied to (14) and coupled with parallel computing is tractable for D of the order of a few hundred; see the Supplementary Material for details.Our algorithm can be extended to the extremal-t model by writing multivariate t probabilities in terms of the multivariate normal distribution function; see Genz & Bretz (2009) for more details.

3•3. Score matching
Classical likelihood inference requires either evaluation or simplification of the scaling constant θ {A r (u)}, whose complexity increases with the number of dimensions.Hence we seek alternatives that do not require its computation.
Let A be a sample space such as R D + , P a convex class of probability measures on A , and X a random variable with distribution F ∈ A .A proper scoring rule (Gneiting & Raftery, 2007) is a functional δ : The scoring rule is said to be strictly proper if δ (G, F) = 0 if and only if G = F, and under this hypothesis δ defines a divergence measure on P (Thorarinsdottir et al., 2013).
Let δ denote a strictly proper scoring rule, let {F θ : θ ∈ } ⊂ A be a parametric family of distributions, and let X 1 , . . ., X N be independent observations from F θ 0 .The first term of the divergence δ (F θ , F θ 0 ) can be estimated by minimization of which defines an unbiased and asymptotically normal estimator of θ 0 under suitable regularity conditions (Dawid et al., 2016, Theorem 4.1).Consequently, for a risk functional r, the estimator X n u is also consistent and asymptotically normal.Owing to de Haan & Resnick (1993, Propositions 3.1 and 3.2), these asymptotic properties can be generalized to samples from a regularly varying random vector with normalized marginals; see the Supplementary Material for the proof.
PROPOSITION 1.Let Y 1 , . . ., Y N be independent replicates of a regularly varying random vector Y * with normalized marginals and limiting measure θ 0 , and let δ be a strictly proper scoring rule satisfying the conditions of Theorem 4.1 of Dawid et al. (2016) Estimates of the Godambe information matrix KJ −1 K can be used for inference, and the scoring rule ratio statistic properly calibrated, can be used to compare nested models (Dawid et al., 2016, § 4.1).
The loglikelihood is the proper scoring rule associated with the Kullback-Leibler divergence.Although efficient, it is not robust, which is problematic for fitting asymptotic models such as Pareto processes, and a closed form for the normalizing coefficient θ {A r (u)} defined in ( 12) is available only in special cases.The gradient scoring rule (Hyvärinen, 2005) uses only the derivative ∇ x log λ r θ ,u and thus does not require computation of θ {A r (u)}.Hyvärinen (2007) adapted this scoring rule for strictly positive variables, and we propose to extend it to any domain of the form A r (u) = {x ∈ R D + : r(x/u) 1}, using the divergence measure where λ θ is differentiable for all θ ∈ on A r (u) \ ∂A r (u), with ∂A denoting the boundary of A, ∇ x is the gradient operator, w : A r (u) → R D + is a positive weight function, and ⊗ denotes the Hadamard product.If w is differentiable on A r (u) and satisfies certain boundary conditions discussed in the Supplementary Material, then the scoring rule is strictly proper.The gradient score for a log-Gaussian Pareto process satisfies the regularity conditions of Theorem 4.1 in Dawid et al. (2016), so the resulting estimator θw is asymptotically normal.
For the Brown-Resnick model, two possible weight functions are where r is a risk functional differentiable on R D + and the threshold vector u lies in R D + .The weights w 1 are derived from Hyvärinen (2007), whereas w 2 is designed to approximate the effect of censoring by downweighting components of x n near u.These weighting functions are well suited to extremes: a vector x ∈ A r (u) is penalized if r(x/u) is close to 1, and low components of x induce low weights for the associated partial derivatives.For these reasons, inference using δ w with the weighting functions in ( 16) should be more robust with respect to low components than is the spectral estimator.The estimator θw is much cheaper to compute than θcens and can be obtained for any risk functional differentiable on R D + .Expressions for the gradient score for the Brown-Resnick model can be found in the Supplementary Material, and the performances of these inference procedures are compared in § 4.
The gradient score can be applied to any extremal model with a multivariate density function whose logarithm is twice differentiable away from the boundaries of its support, and if discontinuities are present on this support, then a carefully chosen weighting function w ensures the existence and the consistency of the score.Indeed, similar expressions can be derived for the extremal-t model, though choices for the weight functions are more restricted: w 2 satisfies the boundary conditions, but w 1 does not ensure that the score is proper.

4•1. Exact simulation
The inference procedures and simulation algorithms described herein are contained in an R package, mvPot (de Fondeville, 2017; R Development Core Team, 2018).
We first illustrate the feasibility of high-dimensional inference by simulating r-Pareto processes associated with log-normal random functions at D locations in S = [0, 100] 2 .Details of the algorithm can be found in the Supplementary Material.
We use an isotropic power semivariogram, γ (s, s ) = ( s − s /τ ) κ /2, shape parameters κ = 0•5, 1, 1•3, 1•8, and scale parameter τ = 2•5, chosen such that the dependence models defined on S cover strong to weak extremal dependence.For this simulation, the dependence model with κ = 1•8 requires us to work on the log scale to avoid rounding errors.For each simulation, N = 10 000 r-Pareto processes, with r(x) = D −1 d x d , were simulated on regular 10 × 10, 20 × 10 and 20 × 15 grids.The grid size was restricted to at most 300 locations for ease of comparison with the second simulation study.For the gradient score, we use r The components of the threshold vector u are set equal to the empirical 0•99 quantile of r(x 1 ), . . ., r(x N ), giving N u = 100.For censored likelihood inference, we use the approach described in the Supplementary Material with p = 10 and threshold u equal to the empirical 0•99 quantile of max d x 1 d , . . ., max d x N d , so that the conditions for (6) are satisfied.One hundred replicates are used in each case.
Table 1 displays the relative root mean squared error for estimation based on the censored loglikelihood and the gradient score with weights w 1 and w 2 , compared to that based on the Table 1.Relative root mean squared error (%) for comparison of estimates based on censored loglikelihood, left, and the gradient score with weights w 1 , middle, and w 2 , right, relative to spectral estimates, for the parameters κ and τ = 2•5.Efficiency of 100% would correspond to the spectral estimator, and smaller values to less efficient estimators.Inference is performed using the top 1% of 10 000 Pareto processes with semivariogram γ (s, s ) = ( s − s /τ ) κ /2 simulated on regular 10 × 10, 20 × 10 and 20 × 15 grids spectral estimator.For all the methods and parameter combinations, bias is negligible and performance is driven mainly by the variance.As expected, efficiency is lower than 100% because when simulating and fitting from the true model, the spectral estimator performs best.The gradient score and censored likelihood estimators deteriorate as the extremal dependence weakens and the number of low components in the simulated vectors increases.The gradient score outperforms the censored likelihood except when censoring is low, i.e., when κ = 0•5.The performance of censored likelihood estimation deteriorates as D increases, suggesting that the gradient score will be preferable in high dimensions.These results are not realistic, however, since the data are simulated from the model fitted, whereas in practice the model is used as a high-threshold approximation to the data distribution.
The optimization of the likelihood based on the spectral density and gradient score functions takes only a dozen seconds even for the finest grid.The same random starting point is used for each optimization to ensure fair comparison.Estimation using the censored approach takes several minutes and slows greatly as the dimension increases; see the Supplementary Material.

4•2. Domain of attraction
As the asymptotic regime is never reached in practice, we now compare the robustness of the different inference procedures for finite thresholds.The Brown-Resnick process belongs to its own max-domain of attraction, so its peaks-over-threshold distribution converges to a generalized Pareto process with log-Gaussian random function.We repeat the simulation study of § 4•1 with 10 000 Brown-Resnick processes and the same parameter values.This simulation uses the algorithm of Dombry et al. (2016) and is computationally expensive, so we used only 300 variables.It took around three hours with 16 cores to generate N = 10 000 samples on the finest grid.
Table 2 shows the results.As expected when the model is misspecified, the root relative mean squared error is mainly driven by bias, which increases with the shape κ and the dimension D. Spectral estimation is on the whole outperformed by both of the other methods.For κ = 0•5, the three methods show fairly similar overall performance, with the censored likelihood better at capturing the shape parameter, while the gradient score does better for the scale parameter.
The moderate extremal dependence cases, with κ = 1 and 1•3, are dominated by the censored likelihood, whereas for weak extremal dependence, κ = 1•8, the gradient score performs best, because too much information is lost by censoring.For the 100-point grid, the optimization procedures do not converge when the extremal dependence is too weak.The choice of the weighting function w affects the robustness of the gradient score.Computation times are similar to those in § 4•1.
Quantile-quantile plots show that the score-matching estimators are very close to being normally distributed, but censored likelihood estimates can deviate somewhat from normality due to the quasi-Monte Carlo approximation; this can be remedied by increasing the value of p.
To summarize: for strong extremal dependence, the three types of estimator are roughly equivalent.For moderate extremal dependence, we recommend using the censored likelihood if the number of variables permits; this is D 500 with our computational capabilities, although if extremal independence is reached at far distances and the grid is dense, the gradient score is an excellent substitute.Owing to its robustness and lack of dimensionality limitations, the gradient score appears to be the best choice for gridded applications with fine resolution.Empirical work suggests that it can be robustified by careful design of the weight function.

5•1. Real-data application
We fit an r-Pareto process based on the Brown-Resnick model to radar measurements of rainfall taken every 15 minutes during the wet season, June-September, from 1999 to 2004 on a regular 2 km grid in a 120 km × 120 km region of east Florida; see Fig. 1.There are 3600 spatial observations in each radar image, and 70 272 images in all.The region was chosen to repeat the application of Buhl & Klüppelberg (2016), but in a spatial setting only; a spatiotemporal model is beyond the scope of the present paper.Buhl & Klüppelberg (2016) analysed daily maxima for 10 km × 10 km squares, but we use nonaggregated data to fit a nonseparable parametric model for spatial extremal dependence, using single extreme events instead of daily maxima.
The marginal distributions for each grid cell were first locally transformed to unit Pareto using their empirical distribution functions.For general application, where we wish to extrapolate the distribution above observed intensities, a model for the marginal distributions of exceedances is needed, but since our goal here is to illustrate the feasibility of dependence model estimation on dense grids, we regard marginal modelling as outside the scope of this study.

5•2. Multivariate extremal dependence model
The spatial model of Buhl & Klüppelberg (2016) is fully separable, i.e., it is a sum of two separate semivariograms.This has the advantage that inference for each direction can be performed separately, but it cannot capture anisotropy that does not follow the axis of the grid, i.e., is not in the South-North or East-West directions.Furthermore, their pairwise likelihood approach focuses on short-distance pairs, and so could mis-estimate dependence at longer distances.To better capture possible anisotropy, we use the nonseparable semivariogram and anisotropy matrix The semivariogram γ achieves asymptotic extremal independence as the distance between sites tends to infinity, i.e., the pairwise extremal index increases to 2 as s − s → ∞.
To apply the peaks-over-threshold approach, we must define exceedances by choosing risk functionals.We focus on two types of extremes: local very intense rainfall at any point of the region, and high cumulative rainfall over the whole region.We therefore take the risk functionals The function r max is a differentiable approximation to max d X (s d ), which cannot be used with the gradient score because of its nondifferentiability. Censored likelihood is computationally out of reach with so many locations.Directly summing normalized observations X * makes no physical sense, so our function r sum , which selects extreme events with large spatial extent, attempts to transform the data back to the original scale; we take ξ 0 = 0•114, which is the average of independent local estimates of a generalized Pareto distribution.
We fitted univariate generalized Pareto distributions to r sum (x * n ) and r max (x * n ) (n = 1, . . ., 70 272) with increasing thresholds.The estimated shape parameters are stable around the 99•9 percentile, which we used for event selection, giving 59 exceedances; just two events were found to be extreme relative to both risk functionals.This threshold may appear rather high, but it corresponds to around 10 events per year, which seems reasonable in light of the timeframe.Here we merely illustrate the feasibility of high-dimensional inference, so we treat them as independent, but in practice temporal declustering should be considered.
Optimization of the gradient score with the w 1 weighting function on a 16-core cluster took 1-6 hours, depending on the initial point.Different initial points must be considered because of the possibility of local maxima.Results are shown in Table 3, where standard deviations are obtained using a jackknife procedure with 20 blocks.Both the estimated bias and the variance are fairly low.For r sum (x * n ), we obtain a model similar to that of Buhl & Klüppelberg (2016).
The estimated parameters differ appreciably for the two risk functionals, suggesting the presence of a mixture of different types of extreme events.The structure for r max is consistent with the database, in which the most intense events tend to be spatially concentrated.Our model suggests higher dependence for middle distances than was found by Buhl & Klüppelberg (2016), but they did note that their model underestimates dependence, especially for high quantiles.The estimated smoothness parameters are very close, and neither estimate of η differs significantly from zero, as imposed by Buhl & Klüppelberg (2016).For r sum , the estimated parameters show strong extremal dependence even at long distances, corresponding to exceedances of accumulated rainfall with large spatial cover.As â ≈ 1, there is no evidence of anistropy.
, where γ i,j is the semivariogram for sites s i and s j (i, j = 1, . . ., 3600), as defined in (9), and u > 0. An empirical estimator of π ij is πij = whose asymptotic behaviour derives from Davis & Mikosch (2009).For both risk functionals, the fitted model, represented by the solid black lines in Fig. 2, follows the cloud of estimated conditional exceedance probabilities reasonably well and captures the general trend, but fails to represent some local variation, perhaps due to a lack of flexibility of the power model.Finally, we use the models fitted in § 5•2 to simulate events with intensities equivalent to the weakest of the 59 events found by our risk functionals.Simulation is performed by generating the corresponding r-Pareto process with the fitted dependence structure, as in § 4•1. Figure 1 compares observations from the database and representative simulations, which seem to successfully reproduce both the spatial dependence and the intensity of the selected observations.A closer examination suggests that in both cases the models produce oversmooth rainfall fields.This could be addressed by improving event selection using risk functionals r that characterize special spatial structures or physical processes.Although we fail to detect anisotropy, more complex models for dependence that allow stochasticity of the spatial patterns might be worth considering.
These models can reproduce both spatial patterns and extreme intensity for spatially accumulated and local heavy rainfall.In both cases the fitted dependence model provides a reasonable fit and simulations seem broadly consistent with observations.However, the presence of two contrasting dependence structures highlights the complexity of extreme rainfall and suggests that a mixture model for both dependence and margins might be considered.Marginal and dependence parameters are often estimated separately, but with the presence of mixtures, which can be detected using different risk functionals, joint estimation is required, which is beyond the scope of this paper.For this reason and because we have neglected the temporal dependence, our model should be viewed as merely a first step towards a spatiotemporal rainfall generator.
Fig.1.Fifteen-minute accumulated rainfall in inches, observed (first row) and simulated (second row) for the risk functionals r sum and r max with an intensity equivalent to the 59th most intense event.

Fig. 2 .
Fig.2.Estimated conditional exceedance probabilities π ij for the risk functionals (a) r sum and (b) r max as functions of the distance between locations s i and s j (i, j = 1, . . ., 3600).In each panel the solid black curve represents the model fitted using gradient score estimation.

Table 2 .
As Table 1 but with inference based on the top 1% of 10 000 simulated Brown-Resnick processes

Table 3 .
Parameter estimates (with standard errors in parentheses) for an r-Pareto process derived from log-Gaussian random functions with the semivariogram γ (s, s ) = { (s − s ) /τ } κ obtained by maximization of the gradient score for events corresponding to the 59 highest exceedances of the risk functionals r sum and r max for the Florida radar rainfall data