Learning Stochastic Closures Using Ensemble Kalman Inversion

Although the governing equations of many systems, when derived from ﬁrst principles, may be viewed as known, it is often too expensive to numerically simulate all the interactions within the ﬁrst principles description. Therefore researchers often seek simpler descriptions that describe complex phenomena without numerically resolving all the interacting components. Stochastic differential equations (SDEs) arise naturally as models in this context. The growth in data acquisition, both through experiment and through limited ﬁrst principles simulations, provides an opportunity for the systematic derivation of SDE models in many disciplines. However, inconsistencies between SDEs and real data at small time scales often cause problems, when standard statistical methodology is applied to parameter estimation. The incompatibility between SDEs and real data can be addressed by deriving sufﬁcient statistics from the time-series data and learning parameters of SDEs based on these. Following this approach, we formulate the ﬁtting of SDEs to sufﬁcient statistics from real data as an inverse problem and demonstrate that this inverse problem can be solved by using ensemble Kalman inversion (EKI). Furthermore, we create a framework for non-parametric learning of drift and diffusion terms by introducing hierarchical, reﬁne-able parameterizations of unknown functions, using Gaussian process regression. We demonstrate the proposed methodology for the ﬁtting of SDE models, ﬁrst in a simulation study with a noisy Lorenz ’63 model, and then in other applications, including dimension reduction starting from various deterministic chaotic systems arising in the atmospheric sciences, large-scale pattern modeling in climate dynamics, and simpliﬁed models for key observables arising in molecular dynamics. The results conﬁrm that the proposed methodology provides a robust and systematic approach to ﬁtting SDE models to real data.


Overview and Literature Review
The goal of this paper is to describe a straightforward ensemble based methodology that facilitates parameter estimation in ergodic stochastic differential equations (SDEs), using statistics derived from time-series data.SDEs arise naturally as models in many disciplines, and the wish to describe complex phenomena without explicitly representing all of the interacting components within the system make them of widespread interest.Additionally, the increasing acquisition of data provides opportunities for the learning of stochastic models in previously unforseen application domains.However, SDE models, while often accurate at providing statistical predictions, may not be compatible with available data at the small scales where the Itô calculus model asserts very strong almost sure properties rarely found in real data.For this reason, standard statistical methodology for parameter estimation in SDEs is often not suited to fitting models to real data.On the other hand, practical experience in the applied sciences demonstrates the effectiveness of fitting Markovian stochastic processes to sufficient statistics, typically derived from a long time series by averaging procedures.Ensemble methods have demonstrable success in the solution of inverse problems, based on the use of interacting particle systems driven by the parameter-to-observable map; furthermore, empirically they are robust to noisy evaluations of the map.Choosing ergodic averages as observables and viewing finite-time averages, or averages over different initializations, as noisy evaluations of the ergodic average, puts us in a setting where we may apply ensemble methods to effectively estimate parameters in SDEs.These ensemble methods are derivativefree, side-stepping thorny technical and computational issues arising in the parameter-to-solution map for SDEs; and they are inherently parallelizable, and scale well to the learning of high dimensional parameters, making them a very attractive computational tool.
The statistics literature concerning parameter estimation from continuous time series often starts from the premise that the diffusion coefficient may be read off from small increments (see, e.g., Kutoyants, 2013), a property that only holds if the data are consistent with an SDE at arbitrarily fine scales.The fact that data may be inconsistent with the SDE at small scales was understood in the finance literature in the early part of this century, and new models were introduced to address this incompatibility (see, e.g., Zhang et al., 2005).In the papers (see, e.g., Pavliotis & Stuart, 2007;Papavasiliou et al., 2009) the setting of multiscale SDEs was used to elucidate similar phenomena arising from models in the physical sciences.Subsequent to these works, new methods were introduced to tackle the inconsistency at small scales, based around subsampling and other multiscale uses of the data (see, e.g., Zhang et al., 2006;Papaspiliopoulos et al., 2012;Pokern et al., 2009); see Pavliotis et al. (2012) for an overview of this work.The potential applicability of problems of this type goes beyond the applications in econometrics, finance and molecular dynamics, highlighted in the preceding references, and into problems in the geophysical sciences (Cotter et al., 2009;Kwasniok & Lohmann, 2009;Ying et al., 2019).A completely different approach to circumvent the use of fine-scale properties of the time series is to compute sufficient statistics from the time-series, and use these to learn parameters.This approach was pioneered as a systematic methodology, and then applied, in the series of papers (see, e.g., Krumscheid et al., 2013Krumscheid et al., , 2015;;Kalliadasis et al., 2015), based on statistics computed from multiple realizations and multiple initial conditions; alternatively one can use a single very long time-series.The use of statistics of time series (e.g., moments of autocorrelation functions) for estimating model parameters is common for discrete time series models, such as autoregressive (AR) or autoregressive moving average (ARMA) models (see, e.g., Lütkepohl, 2013;Brockwell et al., 1991).Another approach to parameter estimation in stochastic dynamical systems, which avoids the incompatiblity at small scales, is to augment the state to include parameters, and then use state estimation techniques such as the particle filter (see, e.g., Smith, 2013) or the unscented Kalman filter (see, e.g., Julier et al., 2000;Albers et al., 2017); however, this approach is notoriously sensitive in many practical settings (see, e.g., Doucet et al., 2001) and we do not pursue it here.In this paper we build on the approach pioneered in Krumscheid et al. (2013) and combine it with the ensemble Kalman approach to inverse problems, a methodology introduced in the oil industry (see, e.g., Chen & Oliver, 2002;Emerick & Reynolds, 2013) and formulated as a general methodology for inverse problems in Iglesias et al. (2013) and Albers et al. (2019); we refer to the methodology as ensemble Kalman inversion (EKI).Within this parameter estimation methodology, we employ ideas from Gaussian process regression (GPR) (see, e.g., Rasmussen & Williams, 2006) to parameterize unknown functions; this refineable, hierarchical approach to function representation leads to novel nonparametric methods for function learning.Our approach builds on preceding, nonhierarchical approaches to inversion using GPR such as that described in Xiao et al. (2016).The concept of learning the values at some fixed nodes of a Gaussian process, known as pilot point method (see, e.g., Doherty et al., 2010), has also been extensively explored by the groundwater modeling community in the context of inverse problems.

Our Contributions
Our contributions in this paper are as follows: • we formulate parameter estimation in ergodic SDEs as a classical inverse problem for the parameterto-data map, which is available to us through noisy evaluations, and we apply the EKI to this formulation; • within the EKI we demonstrate the utility of hierarchical parameterizations of unknown functions, using GPR; • we demonstrate the methodology on a simulation study that employs a noisy (SDE) version of the Lorenz '63 model; • we study reduction of the Lorenz '63 ODE model to a two dimensional SDE in co-ordinates computed by applying PCA to the ODE data; • we study the multiscale Lorenz '96 ODE model, seeking an SDE closure in the slow variables alone; • we apply the methodology to fit a stochastic delay differential equation (SDDE) to El NioSouthern Oscillation data; • we apply the methodology to fit an SDE that describes fluctuations in the dihedral angle of a butane molecule.
In considering these simulation studies and real-data examples, we demonstrate the effectiveness of the methodology to find parameters, and we evaluate the accuracy, or lack of fidelity, of various stochastic models derived from data.In section 2, we formulate the inverse problem of interest, and introduce four example problems to which we will apply our methodology.Section 3 describes the ensemble Kalman based methodology we employ to solve the inverse problem, as well as a discussion of the novel hierarchical Gaussian process based representation that we employ to represent, and learn, unknown functions.In section 4, we describe numerical results relating to each of the four example problems.We conclude in section 5.

Problem Formulation
The aim of this work is to estimate parameters θ ∈ Θ in the SDE where x ∈ R n , f : R n × Θ → R n and Σ : R n × Θ → R n×n .For all of the examples Θ ⊆ R p , but the algorithms we use extend to include the nonparametric setting in which p = ∞.We showcase a GPR based method for hierarchical function representation which is refineable.We assume that the SDE is ergodic and let E denote expectation with respect to the stationary process resulting from this ergodicity.If x(•; θ ) ∈ X := C(R + ; R n ) denotes a solution of the SDE started in a statistical stationary state and F : X → R J is a function on the space of solution trajectories, then define G : Θ → R J by We wish to solve the inverse problem of finding θ from noisy evaluations of G (θ ).The noisy evaluations arise from the fact that we will use finite-length trajectories to approximate the expectation defining G (θ ); averages over initial conditions and/or realizations of the noise could also be used.The following remark highlights the various observables F that we will use in this paper.
REMARK 2.1 We will use m th −moments of vector x at time t = 0: where M is a subset of cardinality m comprising indices (repetition allowed) from {1, • • • , n}, leading to the ergodic average G m .We will also use F ac (x(•)) = x(t) ⊗ x(0), leading through ergodic averaging to the auto-correlation function G ac of the stationary process.And finally we will use G psd to denote parameters of a polynomial fit to the logarithm of the power spectral density; recall that the power spectral density is the Fourier transform of the auto-correlation function of the stationary process.All of the functions G m , G ac and G psd can be approximated by time-averaging.
We now describe four examples that will be used to illustrate the methodology.
EXAMPLE 2.1 (Lorenz 63 System) The Lorenz equations (see, e.g., Lorenz, 1963) are a system of three ordinary differential equations taking the form where x = [x 1 , x 2 , x 3 ] and f : R 3 → R 3 is given by (2.3) We are interested in the noisy version of these equations, in the form This SDE will be used in a simulation study to illustrate our methodology.
We will also use the Lorenz 63 model (2.2), (2.3) written in a new coordinate system computed by means of PCA, and introduced in Selten (1995) and Palmer (2001).This amounts to introducing new coordinates a = Ax, in which we obtain ȧ = g(a), (2.5) where x = [x 1 , x 2 , x 3 ] and g : R 3 → R 3 is given by (2.6) The co-ordinate a 3 contains around 4% of the total variance of the system.In Palmer (2001), this was used as an argument to seek a stochastic dimension reduction in the model, in discrete time, in the variables a 1 , a 2 alone.We reinterpret this idea in continuous time and use our methodology to evaluate the idea, using data from (2.5), (2.6) to study the fidelity possible when fitting an SDE of the form where functions are represented by Gaussian process regression as described in detail subsection 3.2.EXAMPLE 2.2 (Lorenz 96 System) The Lorenz 96 multiscale system (see, e.g., Lorenz, 1996) describes the evolution of two sets of variables, denoted by x k (slow variables) and y j,k (fast variables): x k+K = x k , y j,k+K = y j,k , y j+J,k = y j,k+1 . (2.8) The coupling term hcy k describes the impact of fast dynamics on the slow dynamics with y k being the average (2.9) We work with the parameter choices as in Schneider et al. (2017).Specifically, we choose K = 36, J = 10, h = 1, and F = c = b = 10.By assuming a spatially homogeneous (with respect to j) equilibrium in the fast dynamics, fixing k and x k as in Fatkullin & Vanden-Eijnden ( 2004), we obtain (2.10) We will seek a stochastic closure for the slow variables {x k } encapsulating systematic deviations from, and random fluctuations around, this simple balance: (2.11) Using data from (2.8) we will fit a parameterized ψ(•) and the noise level σ .The aim is to fit a stochastic delay differential equation (SDDE) (see, e.g., Buckwar, 2000;Erneux, 2009) to the ENSO data shown in Figure 1 (time-series and histogram).To be concrete we will fit a model of the following form: (2.12) Here delay τ 1 represents the effect of the eastward traveling Kelvin wave, and delay τ 2 represents the effect of the westward traveling Rossby wave.Since good estimates for τ 1 and τ 2 exist, we will simply fit the four parameters a, b, c, σ to data from the left-hand panel in Figure 1.
EXAMPLE 2.4 (Butane Molecule Dihedral Angle) In many problems arising in molecular dynamics, it is of interest to determine reduced models describing the behaviour of certain functionals derived from the molecular conformation.An example of such a functional is the dihedral angle in a simple model of a butane molecule (see, e.g., Schlick, 2010).Fig. 2 shows how the dihedral angle is defined from the four atoms which comprise the butane molecule.Fig. 3 shows the time series and histrogram for this angle, derived from a molecular dynamics model which we now describe.The model comprises a system of twelve second order SDEs for the 4 atomic positions in R 3 , of Langevin type: (2.13) The potential V , mass m 0 , damping γ 0 and inverse temperature β 0 characterize the molecule.From the time series of x ∈ C(R + ; R 12 ) generated by this model we fit a simplified model for the the dihedral angle.This simplified model takes the form (2.14) We will fit parameterized versions of γ and Ψ to data for φ generated by studying the time-series for the dihedral angle defined by x from (2.13).Related work, fitting SDE models for the dihedral angle, may be found in

Algorithms
In subsection 3.1, we describe the ensemble-based derivative-free optimization method that we use to fit parameters, the EKI algorithm.In subsection 3.2, we discuss how we use GPR to design hierarchical, refineable parameterizations of unknown functions that we wish to learn from data.

Ensemble Kalman Inversion (EKI)
Recall that the data we are given, y ∈ R J , is a noisy evaluation of the function G (θ ).We use the notation G(θ ) to denote this noisy evaluation; recall that we typically envision the noisy evaluation coming from time-averaging.Appealing to central limit theorem type results that quantify rates of convergence towards ergodic averages, we assume that G (θ ) − G(θ ) is Gaussian.Then the inverse problem can be formulated as follows: given y ∈ R J , find θ ∈ Θ so that Estimating Γ may achieved by using time-series of different lengths, as explained in Cleary et al. (2020) in the specific case of G derived from moments, F m .
The ensemble Kalman inversion (EKI) algorithm we employ is as described in Iglesias et al. (2013) and Albers et al. (2019).The algorithm propagates a set of J parameter estimates {θ ( j) n } J j=1 through algorithmic time n, using the update formula The matrix C GG n is the empirical covariance of {G(θ The EKI algorithm as stated preserves the linear span of the initial ensemble {θ ( j) 0 } J j=1 for each n and thus operates in a finite dimensional vector space, even if Θ is an infinite dimensionsal vector space.
The natural objective function associated with the inverse problem (3.1) is The two primary reasons for using EKI to solve the SDE-based inverse problem that is ou focus here are: (a) it does not require derivatives, which are difficult to compute in this setting of SDEs, and nonetheless provably, in the linear case (see, e.g., Schillings & Stuart, 2017), and approximately, in the nonlinear case (see, e.g., Garbuno-Inigo et al., 2020) behaves like a projected gradient descent with respect to objective (3.3);(b) it is robust to noisy evaluations of the forward map as shown in Duncan et al. (2020).
More generally the EKI approach also has the advantage of being inherently parallelizable and of scaling well to high dimensional problems.

Gaussian Process Regression (GPR)
Throughout this paper we apply the EKI algorithm in a finite-dimensional vector space Θ ⊆ R p .To estimate an unknown function, we use Gaussian process regression.More precisely, we will use the mean m(x) of a Gaussian process and optimize over the mean values of the process at a fixed set of design points, over the covariance of the noise in the observations used to define the Gaussian process, and over the parameters describing the covariance function of the Gaussian process.This leads to a parameter dependent function m(x; θ ), the construction of which we now detail.Note that the construction involves probabilistic considerations but, once completed, leads to a class of deterministic functions m(x; θ ) over θ ∈ Θ .The key advanatge of the GPR construction is that it leads to a hierarchical parameterization that has proved very useful in many machine learning tasks.Adapting it to statistical estimation more generally is potentially very fruitful, and is one of the ideas we use here.We are essentially proposing to use Gaussian process parameterizations beyond the simple regression setting, into more general function learning problems.
The set of parameters θ contains the following elements: (i) noisily observed values of m = {m(x (i) )} at some fixed nodes x (i) , denoted θ = {θ i }; (ii) observation errors Σ obs = diag(σ (i) ) at the fixed nodes x (i) , based on the assumption that θ = m + √ Σ obs ξ for some ξ ∼ N(0, I); (iii) hyper-parameters (σ , ), respectively an amplitude and a length-scale, of the kernel k(x, x ) used in the GP regression.
The mean function m(x) minimizes over K, the reproducing kernel Hilbert space associated with covariance function k(x, x ; σ , ).By the representer theorem (see, e.g., Rasmussen & Williams, 2006), it follows that the minmizer lies in the linear span of {k(x, x (i) ; σ , )} I i=1 } and may be found by solving a linear system of dimension equal to the number of design points (the dimension of θ ).The result is, however, a complex nonlinear function of θ = (θ , Σ obs , σ , ).It is these parameters that are then learnt from data.Throughout this paper, we take Σ obs to be a constant (which enters the vector θ ) multiple of the identity.

Numerical Results
To demonstrate the capability of the proposed methodology, we apply it to four different examples, including classical chaotic systems (Lorenz 63 system in subsection 4.1, Lorenz 96 system in subsection 4.2), climate dynamics (ENSO in subsection 4.3), and molecular dynamics (butane molecule dihedral angle in subsection 4.4).All these numerical studies confirm that the proposed methodology serves well as a systematic approach to the fitting of SDE models to data.In all cases, the data are initially presented in two figures, one showing the ability of the EKI method to fit the data, and a second showing how well the fitted SDE performs in terms of reproducing the invariant measure.We then show other figures which differ between the different cases and are designed to illustrate the nature of the fitted SDE model.

Lorenz 63 System
Unlike the other examples in this paper, this initial simulation study involves data that come directly from an SDE within the model class being fitted.Specifically, the data are obtained by simulating the model with a given set of parameters: α = 10, ρ = 28, β = 8/3, and σ = 10, as well as the choice g L (x 2 ) = x 2 .The purpose of this example is to illustrate our computational methodology in a perfect test setting.Using EKI we will fit θ defined in two different ways: (i) Fix g L (x 2 ) = x 2 and learn θ = (α, σ ).Results are presented in Figs. 4 to 5.
(ii) Parameterize g L (x) by θ 1 and learn θ = (θ 1 , σ ).Results are presented in Figs. 6 to 8. The data in this case are a finite-time average approximation of {G 1 (x), G 2 (x)}, i.e., the first and second moments of the state vector x are used as observational data.Therefore, the data vector y has 9 elements in total.In case (i) we fit both an ODE (constrain σ = 0) and an SDE (learn σ ) to the given data.Thus, we are fitting 1 parameter in the ODE case and 2 parameters in the SDE setting, in both cases using a data vector y of dimension 9.The comparison of the output of the EKI algorithm with the true data, in the case of both ODE and SDE fits, is presented in Fig. 4. In the ODE case the algorithm fails to match the second moment (left panel), while in the SDE case all moments are well-matched.This has a significant effect on the ability to reproduce the invariant measure (Fig. 5).In the left panel (ODE), the fit is very poor whereas in the right panel (SDE) it is excellent.The fit of the SDE is not surprising since the data is generated directly from within the model class to be fit; the behaviour of the fit to an ODE gives insight into the method when the model is misspecified.We now perform exactly the same set of experiments, fitting both an ODE and an SDE, but allowing the function g L to be learnt as well.The function g L is parameterized by the mean of a GP (with unknown values specified at 5 fixed nodes, and constant observation error and hyper-parameters).Thus we are fitting 8 parameters for the ODE and 9 parameters for the SDE, using a data vector y of dimension 9.When we do this, we are able to substantially improve the fit of the ODE in both data space, seen in Fig. 6a, and as measured by the fit to the invariant measure, as see in Fig. 7a-c.Nonetheless the fit achieved by using an SDE remains substantially better, as seen in Fig. 6b and Fig. 7d-f.The ensemble mean of Gaussian process learnt in the ODE and SDE models are presented in Fig. 8.Both GPs successfully capture the linear trend in the middle range of x 2 , while the GP learnt in the ODE model exhibits more oscillations around the middle part and more rapid deviation from the linear trend at both ends.
Truth EKI  We fit a two-dimensional SDE model of the form (2.7) to data generated from the first two components of the three-dimensional ODE model (2.5), (2.6).The four unknown functions (ψ 1 (•), ψ 2 (•), σ 1 (•), σ 2 (•)) are parameterized by the mean of Gaussian process regression.Each GPR involves the mean values at five fixed nodes, a stationary observation error, and the stationary hyper-parameters (σ G P , ).The precise form of the data in this case is a finite time average approximation of {G m (a), G ac (a)}.Specifically, the first four moments of the state vector a are used in this case.In addition, nine equally spaced data points are used from a finite time average approximation of the autocorrelation function G ac (a) for both states a 1 and a 2 , from an interval of 10 time units, and we obtain 18 data points in total from the autocorrelation function.Thus we are fitting 32 parameters for the SDE by using a data vector y of dimension 27.The true moment data and autocorrelation data are presented in Fig. 9, alongside the results obtained by using EKI to fit the SDE model to this data, showing a relatively good match to the data.When measured by the ability to reproduce the invariant measure, the results demonstrate a fairly strong match between the marginal invariant densities on a 1 and a 2 from the original three-dimensional ODE and from the fitted two-dimensional SDE (Fig. 10).The fit to the autocorrelation functions of a 1 and a 2 is also quite good (Fig. 11).Fig. 12 compares the trajectories of the three-dimensional ODE (2.5), (2.6), projected on a 1 and a 2 , with those of the fitted SDE (2.7); the difference in smoothness of the solutions of ODEs and SDEs is apparent at that level, although it can be seen that the fitted SDE model shows a similar pattern of irregular switching between the two distinct components of the attractor in the true system.
Despite the seemingly simple formulation of this example as outlined in section 2, it is a rather difficult problem.On the one hand, the standard Lorenz 63 system (2.5), (2.6) is well known as a classical chaotic system.On the other hand, chaos cannot arise in the reduced-order 2D model of the form (2.7) in a deterministic setting where the noise levels σ i are set to zero, by the Poincaré-Bendixson theorem.Palmer (see, e.g., Palmer, 2001) showed that a reduced-order 2D model of the Lorenz 63 system in discrete time could exhibit qualitative features of the original 3D system, if subjected to additive noise, but it remained an unsolved problem to fit a stochastic 2D model that quantitatively captures detailed chaotic behavior of the original 3D chaotic system.This motivated us to introduce both more unknown parameters and more observed statistics as data, compared to the previous simulation study (4.1.1).In so doing, we have demonstrated some success in fitting an SDE model to data from the ODE.We face similar challenges in the remaining examples in this section, where the data are generated from a more complex model than that to which it is fitted, or indeed is simply observed data.

Lorenz 96 System
We generate data from the slow variable {x k } K k=1 in (2.8), and use it to fit a parameterized function ψ(•) and parameter σ appearing in (2.11).In the experiments presented below, we take K = 36.The form of the data in this case is finite-time averaged approximations of {G 1 (x), G 2 (x)}, i.e., observations of the first and second moments of the state vector x = {x k } n k=1 , for some n K.The model discrepancy term ψ(X k ) in (2.11) is parameterized by the mean of a GP with mean values at 7 fixed nodes, and constant observation error and hyper-parameters.Thus, we are fitting 10 parameters for the ODE and 11 for the SDE, using a data vector y of dimension 44 (when only observing the first 8 slow variables) or of dimension 702 (when observing all 36 slow variables).Our numerical results illustrate several interesting properties of stochastic closures for the Lorenz 96 multiscale system: a) we show that for the relatively large time scale separation of c = 10 it is possible to fit both an accurate ODE (σ = 0) and SDE (σ > 0), in the sense that both ODE and SDE fitted models (2.8) accurately reproduce the invariant measure of the full system (2.8), using only n = 8; b) we show that with weaker scale separation of c = 3, using n = 8 leads to an ODE fit which is very  poor, as measured by reproduction of the invariant measure, whereas the SDE fit, whilst not as good as for c = 10, is passable; c) we show that by increasing n to equal K = 36, the ODE fit does not improve, but for the SDE it becomes excellent.
Results for case a) are presented in Figs 13 to 14.It can be seen in Fig. 13 that both the fitted ODE model and SDE model show almost perfect agreements with the true system in data space.Furthermore, both fitted ODE model and SDE model agree well with the true system when we compare the invariant measure with that of the underlying data-generating model, in Fig. 14.Although the agreement is slightly better for the fitted SDE model in Fig. 14b, the performances of fitted ODE and SDE models are quantitatively close to each other.The good performance of the fitted ODE model can be explained by invoking the averaging hypothesis to propose a closed ODE model of the form (2.11).The data implying the form of the closure ψ is as presented in Fig. 15a.Specifically, the scattering of the true closure term is moderate for c = 10, indicating that a deterministic closure of slow variables may well achieve good agreement with the true system.Results for case b) are presented in Figs 15 to 17.The averaging hypothesis does not apply so cleanly in this case where c = 3, as can be seen by comparing the moderate scattering in the data when c = 10 (Fig. 15a), with that obtained when c = 3 (Fig. 15b).In the second setting, it turns out that the fitted ODE model is far less satisfactory than the fitted SDE model.As seen in Fig. 16, the agreement with the true system in data space is still comparable between the fitted ODE and SDE models.However, studying the ability of the two models to fit invariant measures in Fig. 17, we see that the fitted ODE model concentrates around several discrete values, indicating a lack of chaotic behavior in long time.On the contrary, the fitted SDE model better captures the pattern of the invariant measure of the true system, although there is still room for improvement in this case.Results for case c) are presented in Figs.18 to 20.With the goal of improving the model fit, we augment the data to include all first and second moments of the slow variables.Figure 18 shows that the SDE fit achieves better agreement than the ODE fit in data space.As presented in Fig. 19, the fitted ODE model still tends to concentrate toward a small number of discrete values in the long-time behavior, while the invariant measure of fitted SDE model demonstrates excellent agreement with the true system.The issue of the fitted ODE/SDE models in failing to maintain, or to maintain, correct chaotic behavior for a long time can be more clearly seen in the time series presented in Fig. 20.

El NiñoSouthern Oscillation (ENSO)
In this subsection we use the time-series data for sea surface temperature (SST) T shown in Fig. 1a to fit an SDDE of the form (2.12), learning the four parameters a, b, c, σ .Recall that the delays τ 1 and τ 2 may be viewed as known properties of Rossby and Kelvin waves.The precise form of the data in this case is a finite-time average approximation of {{G m (T )} 4 m=1 , G ac (T ), G psd (T )}.Thus we are fitting 4 parameters in the SDDE by using a data vector y of dimension 16.The true moment data are presented in Fig. 21a.To use the autocorrelation function G ac (T ) as data, we sample 9 points from it with an interval of six months as presented in Fig. 21b.We also use the coefficients of a second-order polynomial fit to the logarithm of the power spectral density; the three coefficients are presented in Fig. 21c It can be seen in Fig. 21 that the fitted SDE model shows a good agreement with true ENSO statistics, with the exception of the two highest moments.Looking at the invariant measures presented in Fig. 22, we see evidence that the fitted SDE model can potentially capture the long time behavior of ENSO and thus help us better understand the ENSO mechanism as well.The satisfactory performance of the fitted SDE model is also confirmed by further studying the time series in Fig. 23, where we can see that the simulated time series from the fitted SDE model captures well the statistical behavior of ENSO data.

Butane Molecule Dihedral Angle
We fit the second-order Langevin equation (2.14) model to data derived from (2.13).The precise form of the data in this case is a finite-time average approximation of {{G m (φ )} 4 m=1 , G ac (φ ), G psd (φ )}, where φ denotes the dihedral angle.Thus, similarly to the previous subsection, we use the first four moments of φ , and the true moment data are presented in Fig. 24a.Furthermore, G ac (φ ) is approximated using 9 points sampled from the autocorrelation function with an interval of 50 fs, and these sampled data points are presented in Fig. 24b.Finally, we also use the coefficients of a second-order polynomial that fits to the logarithm of the power spectral density, and the three coefficients are presented in Fig. 24c.On the other hand, we are learning scalars γ and σ in (2.13), together with the potential Ψ constructed from Gaussian basis functions (with length scale fixed as 0.5) centered at 9 points evenly distributed in [−π, π].Thus, we are fitting 11 parameters for the SDE by using a data vector y of dimension 16. Results showing the fitted SDE are presented in Figs.24 to 27.
Figure 24 shows that the fitted SDE model achieves very good agreement with all the statistics of true dihedral angle data.More importantly, we can see in Fig. 25 that the fitted SDE model also leads to an invariant measure that agrees well with that of the true dihedral angle data.
In order to further validate the fitted SDE model, and in particular to show that it successfully captures the transition behavior and frequency information of the true data, we further study the time series (Fig. 26) and autocorrelation function (Fig. 27) by simulating the fitted SDE model for a long

Conclusions
Although computing power has increased dramatically in the past several decades, so too has the complexity of models that we'd like to simulate.It is still infeasible to resolve all the interactions within the true system in many applications.SDEs arise naturally as models in many disciplines, even if the first principles governing equations are deterministic, because stochasticity can effectively model unresolved variables.However, standard statistical methodology for parameter estimation is not always suitable for fitting SDE models to real data, due to the frequently observed inconsistency between SDEs and real data at small time scales.In this work, we exploit the idea of using sufficient statistics found by finite-time averaging time series data.Using these statistics, we demonstrated that an SDE model, with the unknown functions being parameterized by Gaussian process regression (GPR), can be fitted to the data using enemble Kalman inversion (EKI).
Ensemble methods are particularly well-suited to this problem for several reasons: they are derivativefree thereby sidestepping computational and technical issues that arise from differentiating SDEs with respect to parameters; they are inherently parallelizable; and they scale well to high-dimensional problems.The novel hierarchical GPR-based function approximation that we use meshes well with the EKI methdology.
Future directions of interest in this area include the derivation of a systematic approach to the determination of sufficient statistics, analysis of the EKI algorithm for learning in the context of these problems, and analysis of the use of GPR-based function representation in nonlinear inverse problems and the further use of the methodology to new problems arising in applications.
FIG. 1: Illustration of ENSO data from year 1870 to 2019 (the time interval between two adjacent data points is one month).
FIG.3: Illustration of the true butane dihedral angle data (the time interval between two adjacent data points is 10 −6 ns).
4.1.1Noisy Lorenz 63: Simulation Study.The first set of experiments are simulation studies in which data from (2.4) is used to fit parameters within the following model: FIG.4: First two moments of state x for the Lorenz 63 system found by using EKI to estimate α (ODE case) and σ , α (SDE case).
FIG.5: Invariant measures of fitted models compared with those of the true noisy Lorenz 63 system: ODE case, fit α; SDE case, fit σ , α.
FIG.6: First two moments of state x for the Lorenz 63 system by using EKI to estimate σ and the linear function g L (x 2 ).
FIG. 8: The linear function g L (x 2 ) learnt in (a) ODE model and (b) SDE model.
FIG. 9: Comparison of observation data between the true system and the EKI-fitted SDE model for the deterministic Lorenz 63 system.
FIG. 10: Invariant measures of the deterministic Lorenz 63 system and the fitted reduced-order SDE model.
FIG. 11: Autocorrelation of the deterministic Lorenz 63 system and the fitted reduced-order SDE model.
FIG.12: Time series of the deterministic Lorenz 63 system and the fitted reduced-order SDE model.
FIG.13: Comparison of observation data between the true Lorenz 96 system (c = 10) and the fitted ODE and SDE models.
FIG. 14: Comparison of invariant measures between the true Lorenz 96 system (c = 10) and the fitted ODE and SDE models.
FIG. 15: True closure term of slow variables with different temporal scale separations between slow and fast variables.
FIG. 16: Comparison of data between the true Lorenz 96 system (c = 3) and the fitted ODE and SDE models.Only first 8 slow variables are used to compute the statistics.
FIG. 18: Comparison of data between the true Lorenz 96 system (c = 3) and the fitted ODE and SDE models.All 36 slow variables are used to compute the first and second moments as data (only the first moment data is presented here).
FIG. 20: Comparisons of time series of slow variable X 1 by using (a) 8 observed slow variables with ODE, (b) 8 observed slow variables with SDE, and (c) 36 observed slow variables with SDE.The time range over which we collect data for EKI is T = 100.
FIG. 21: The comparison of observed quantities, including (a) the first four moments, (b) the autocorrelation function, and (c) the coefficients of fitted second-order polynomial of power spectrum density.
FIG. 22:The comparison of the invariant measures of SST between true ENSO data and fitted SDE model.
FIG. 23: The comparison of time series of SST between true ENSO data and one realization of the fitted SDE model.
FIG. 26: Time series of butane molecule dihedral angle.