Evolutionary full-waveform inversion

SUMMARY We present a new approach to full-waveform inversion (FWI) that enables the assimilation of data sets that expand over time without the need to reinvert all data. This evolutionary inversion rests on a reinterpretation of stochastic Limited-memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS), which randomly exploits redundancies to achieve convergence without ever con-sidering the data set as a whole. Speciﬁcally for seismological applications, we consider a dynamic mini-batch stochastic L-BFGS, where the size of mini-batches adapts to the number of sources needed to approximate the complete gradient. As an illustration we present an evolutionary FWI for upper-mantle structure beneath Africa. Starting from a 1-D model and data recorded until 1995, we sequentially add contemporary data into an ongoing inversion, showing how (i) new events can be added without compromising convergence, (ii) a consistent measure of misﬁt can be maintained and (iii) the model evolves over times as a function of data coverage. Though applied retrospectively in this example, our method constitutes a possible approach to the continuous assimilation of seismic data volumes that often tend to grow exponentially.


I N T RO D U C T I O N
Seismic tomography has come a long way since its early applications (e.g. Aki & Lee 1976;Dziewonski et al. 1977). Increasing computational resources enable increasingly sophisticated methods, for instance, finite-frequency tomography (e.g. Dahlen et al. 2000;Friederich 2003;Yoshizawa & Kennett 2005), joint inversions of multiple wave types (e.g. Ritsema et al. 1999;Chang et al. 2010;Koelemeijer et al. 2017) or Bayesian approaches (e.g. Trampert et al. 2004;Bodin & Sambridge 2009;Mosca et al. 2012;Gebraad et al. 2020). Recently, it has become increasingly feasible to perform 3-D full-waveform inversion (FWI) at various scales (e.g. Chen et al. 2007;Tape et al. 2010;French & Romanowicz 2014;Bozdag et al. 2016;Simute et al. 2016;Fichtner et al. 2018). While FWI may eliminate most simplifications of wave physics and thereby improve resolution, it also comes with significant computational cost because each iterative model update requires forward and adjoint simulations, at a cost scaling with frequency to the power of 4 (Virieux & Operto 2009;Pirli 2017). Additionally, Fresnel zones of higher-frequency waves are narrower, and as a result more data are needed to constrain the earth model sufficiently.
As increasing computational power allows us to invert higherfrequency data, also the number of data increases roughly exponentially. For example, the IRIS data archive grows ∼25 per cent per year (http://ds.iris.edu/data/distribution/). Hence, earth models should ideally update regularly to incorporate this new wealth of information. For ray tomography, such evolutionary approaches were pioneered by Debayle et al. (2016) at global scale and Tong et al. (2017) at regional scale. These authors expand their data set sequentially and then reinvert the larger data set to obtain an update. While evolutionary inversion would also be desirable for FWI, periodic reinversions of a growing data set are computationally out of scale.
Here we propose a strategy to efficiently incorporate information from new seismic recordings into an evolving FWI model. This is based on a stochastic L-BFGS method where newly acquired data contribute to quasi-random mini-batches, driving the next update. We demonstrate this strategy with an example, where we start from a 1-D model and a small data set from the early 1990s, and show how our approach can be used to dynamically expand this data set as new recordings become available. The result is an evolving and up-to-date model of the Earth which retains previous information while consistently incorporating new data.

M E T H O D O L O G Y
Primarily developed for machine learning applications where data sets are too large to be assimilated all at once (e.g. Bottou 2010;Byrd et al. 2011), stochastic gradient descent (SGD) has recently been adapted for seismic data inversion with the main goal to reduce computational cost (e.g. van Leeuwen & Herrmann 2013). Stochastic 306 C The Author(s) 2020. Published by Oxford University Press on behalf of The Royal Astronomical Society. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.Please check copy right for correctness. L-BFGS is an extension of SGD which utilizes curvature information and has been applied in medical imaging (Boehm et al. 2018). In the following paragraphs we demonstrate how stochastic L-BFGS can be reinterpreted as a tool that enables evolutionary model updating. Because of the unique challenges encountered with regionalto continental scale data sets, such as irregular data coverage, we will focus on the dynamic mini-batch stochastic L-BFGS proposed by van Herwaarden et al. (2020), which is specifically tuned to harness redundancies in seismological data sets, and which we briefly recapitulate for completeness in the following paragraphs.

Stochastic L-BFGS for evolutionary model construction
In most FWI approaches, a master data set of events is carefully curated and kept constant over the course of an inversion. In contrast, dynamic mini-batch FWI quasi-randomly selects for each iteration, a smaller subset (a mini-batch) of events which contribute significant information to the gradient direction. This is illustrated schematically in Fig. 1. To maximize coverage, the first mini-batch is constructed with Mitchell's best-candidate algorithm (Mitchell 1991), choosing the next event to be furthest from the previously selected ones.
To consistently monitor waveform misfit over the course of an inversion where the mini-batch composition changes continuously, a control group subset is carried over to the next iteration. Control group events are selected among the mini-batch events based on the requirement that the angle between the control group and the minibatch gradient, defined through the scalar product of the two, must be below a defined threshold, empirically set to 22.5 • (van Herwaarden et al. 2020). The number of control group events needed to meet the threshold may vary, thereby indicating if the mini-batch is too small or too large. Also, the size of the subsequent mini-batch is empirically set to twice the size of the current control group, meaning that the mini-batch size adapts dynamically. For subsequent iterations, events not in the current control group are replaced, again using Mitchell's best-candidate algorithm to incorporate new data. If no unseen data exists, events are selected quasi-randomly (van Herwaarden et al. 2020), whereby events that had a significant contribution to a past mini-batch's gradient direction have a higher probability of being selected again. The quasi-random event selection essentially penalizes frequent iterations with events that have either comparatively few recordings or have converged already.
Once a mini-batch and a control group have been selected, the model updating proceeds with the help of a trust-region L-BFGS method, as illustrated in the lower panel of Fig. 1. The original L-BFGS method (Nocedal 1980;Nocedal & Wright 1999) approximates the inverse Hessian for the complete data set in each iteration using the gradients of the previous iterations, where is typically below 10. The inverse Hessian approximation is then used for a quasi-Newton iteration. While the original L-BFGS sequence of gradients is deterministic, that is, fully controlled by the initial model, the dynamic mini-batch approach produces a random sequence of gradients for the approximation of the inverse Hessian. This is because quasi-random subsets of the complete data set are used to compute gradient approximations.
Although the dynamic mini-batch approach was originally designed as a method to improve computational efficiency, it also allows us to implement evolutionary FWI. In fact, within the dynamic mini-batch framework there is no difference between an event held out of previous mini-batch iterations and an event which had not yet been included in the master data set. Hence, as data from new events become available, we can simply add the event to the collection of candidate data of which the dynamic mini-batch algorithm is able to sample. By generally using newly added events as candidates for the next mini-batch, the inclusion of previously unseen data is prioritized. In this sense, the quasi-random acquisition of new seismic data, for instance, through the installation of new stations, becomes part of the quasi-random selection of mini-batch for the next iteration.

Example problem and setup
To illustrate how dynamic mini-batch methods can perform on an expanding data set we use the example of a continental-scale FWI of the African continent, consider a time frame from the early 1990s to the present, and show how the model and data set evolve over time. The initial data set contains 46 events and 1452 unique sourcereceiver pairs. It results from the selection of events that occurred between 1990 and 1995 and have a magnitude range of 5.5-6.7, which provides useful signal-to-noise ratios while avoiding finitesource effects. Adhering to the same magnitude range, we gradually expand the data set over the course of the iteration to finally include 428 events and 147 903 unique source-receiver pairs for events between 1990 and 2017. The majority of the data has been recorded with three-component seismometers. All of the data have been acquired through the publicly available International Federation of Digital Seismograph Networks (FDSN; Romanowicz & Dziewonski 1986). Both the initial and final data sets are shown in Fig. 2.
We perform forward and adjoint modelling using the spectralelement wave propagation solver Salvus (Afanasiev et al. 2019). The starting model is the transversely isotropic Preliminary Reference Earth Model (PREM) (Dziewonski & Anderson 1981), and topography and ellipticity are accounted for. We apply Gaussian smoothing to the model in the radial direction with a standard deviation of 50 km. This is done for two reasons: (1) We do not wish to incorporate a strong prior on the presence and location of discontinuities and (2) it helps to reduce the computational cost due to a reduction in the number of small crustal elements, without introducing cycle-skipping problems for the long period (65-120 s) waveforms that we will model and invert. While our method is not exclusive to a specific misfit function, the inversion presented herein employs time-frequency phase misfits (Fichtner et al. 2008). The obtained gradients are smoothed to prevent subwavelength structure from entering the model by effectively convolving the gradients with a Gaussian filter with a standard deviation of 200 km.
The stochastic gradient inversion initiates with the selection of 10 out of 46 events for the first mini-batch. We then add data at several stages as indicated by the vertical lines in Fig. 3. The number of source-receiver pairs contributing to the model grows with the amount of available data. Furthermore, as shown by the blue dots in Fig. 3, the composition of the mini-batches changes in each iteration, with priority given to new data that are more likely to bring new independent information. Towards the end of the iteration, the average mini-batch size has grown to around 50 events, indicating that more events are needed to approximate the gradient of the whole data set.

R E S U LT S
Since we have simulated a journey of an evolutionary inversion through time, we show the model at several stages in Fig. 4. Each In passing from iteration n to n + 1 the data set may be extended, and a new mini-batch is selected, giving preference to new data. After computing misfits and gradients for the mini-batch, a new control group is selected to monitor misfit reduction of updates. In our applications, presented in the following sections, we empirically chose that mini-batch twice as large as the control group. Other choices are possible, in case they produce satisfactory results.

Figure 2.
Data summary in the form of great-circle paths connecting earthquakes (beach balls) and stations (black triangles). Left: The initial data set with events from 1990-1995 contains 1452 unique source-receiver pairs. Right: The final data set with events from 1990-2017 has grown to 147 903 unique source-receiver pairs over the course of the inversion. stage has been influenced by a unique number of seismic recordings available until that point in time through a varying number of minibatch iterations. As the inversion progresses and more data are incorporated, several geologic features, such as the West African Craton and the Cameroon Volcanic Line become more pronounced.
The blue line in Fig. 5 shows the L 2 norm difference between SV velocity (V SV ) in the current and initial models. This difference increases quickly as large-scale velocity structure is recovered during the first few iterations. Afterwards, there are less changes, except around iteration 200, when significantly more stations started to record earthquakes in Africa. The inversion is carried out in a fixed frequency band. This was a deliberate choice, intended to emphasize the impact of data volumes on the resulting model.
Stochastic gradient descent in general, and the dynamic minibatch approach in particular, are primarily designed to assimilate data sets that are too large to be considered at once. Hence, by the purpose of their existence, they do not permit an analysis of the total misfit of the entire data set. Instead, to assess convergence, we study the behaviour of the control group, that is, the subset of Figure 3. Exploited data as a function of time and iteration count. Blue dots represent active events during an iteration, that is, members of the current mini-batch. Newly added events are given priority. The red line shows the total number of source-receiver pairs used over the course of the iterations, and black lines indicate when new events were added. The total amount of data grows at a similar rate as the IRIS DMC archive, that is, around 25 per cent yr −1 . a mini-batch that is kept for the next iteration to enable a direct misfit comparison between iterations. The red line in Fig. 5 shows the relative control group misfit reduction as a function of iteration count. After control group misfit initially decreases by nearly 10 per cent per iteration, it plateaus around 3 per cent. However, when new events are added, control group misfit reductions can be more significant because their waveforms may still need to be matched.

D I S C U S S I O N A N D C O N C L U S I O N S
We have shown results of a workflow that enables an evolutionary mode of inversion without reinversion of a complete data set. The workflow enables an integration of new data, thereby adding information to constrain the model. For the purpose of illustration, the example shown in the previous sections is retrospective, successively assimilating data from the early 1990s to 2017. The resulting suite of models provides snapshots of our knowledge of African upper-mantle structure as a function of time and station coverage. Yet, the potential of the method primarily lies in the continued incorporation of future data volumes that tend to grow exponentially.
In principle, it would be possible to expand the data set in other ways too. One could, for example, simply expand the data set and restart from the latest model. While being straightforward, this is unlikely to be efficient because previously matched data would be modelled without contributing significantly to the new search direction. Alternatively, one may perform a few iterations only with the new data and then iterate again with all the previous data. Also this strategy poses its own challenges. How many iterations would need to be performed and which data should be selected?
A major advantage of the stochastic gradient approach is that none of these choices need to be made, and that, by design, a large data set can be assimilated without having to consider the data set as a whole. The stochastic gradient approximation combined with the dynamically chosen mini-batches automatically 'learns' which events are relevant and how many are required to make a decent approximation of the larger data set.
The redundancy that is typical for seismological data sets, where events often occur in similar locations, allows the dynamic minibatch approach to achieve significant computational savings, compared to inversions of a complete data set (up to 80 per cent in the examples shown by van Herwaarden et al. 2020). As more data are incorporated into an inversion, it becomes more likely that the data set will contain redundant information. As a result, it seems plausible that the aforementioned computational savings will increase as the data sets become larger.
As computational power increases, it becomes possible to exploit higher-frequency data. Our workflow in principle allows model updates using events modelled at varying frequencies. In practice, however, this adds additional complexities. Higher frequencies must be modelled on more finely discretized meshes, and otherwise simple operations such as summing eventwise contributions to the total Figure 5. Convergence proxies. The blue line indicates the L 2 distance between the initial model and later models. The model diverges quickly from the initial model as large-scale velocity structure is recovered during the first iterations. At roughly iteration 200, when the data set is extended substantially, the models diverge further. The relative misfit decrease of the control groups, averaged over 10 iterations for better visibility, is shown in red. Black lines represent moments where new data was incorporated into the inversions. gradient become non-trivial ). In addition, higher frequencies are sensitive to smaller-scale structures, thus requiring changes in regularization and parametrization. The exact point at which these changes become important is a problem that still requires further research. In this study, we deliberately chose to keep the regularization and the frequency content constant, in order to isolate and show the effect that an evolving data set has on the model. As such a model continues to incorporate more and higher frequency data, it will be important to change the regularization accordingly.

A C K N O W L E D G E M E N T S
We gratefully acknowledge editor Joerg Renner, as well as François Lavoué and an anonymous reviewer for their constructive comments that helped to improve the paper. In addition, we would like to thank Christian Boehm for the helpful discussions and advice. This work was supported by the European Research Council (ERC) under the EU's Horizon 2020 Framework Programme (grant no. 714069). We also gratefully acknowledge support from the Swiss National Supercomputing Centre (CSCS) in the form of computing time grants c13 and s868. In addition, we would like to thank IRIS DMC, whose FDSN webservices provided the majority of the data.