A machine-learning-based surrogate model of Mars’ thermal evolution

S U M M A R Y Constraining initial conditions and parameters of mantle convection for a planet often requires running several hundred computationally expensive simulations in order to find those matching certain ‘observables’, such as crustal thickness, duration of volcanism, or radial contraction. A lower fidelity alternative is to use 1-D evolution models based on scaling laws that parametrize convective heat transfer. However, this approach is often limited in the amount of physics that scaling laws can accurately represent (e.g. temperature and pressure-dependent rheologies or mineralogical phase transitions can only be marginally simulated). We leverage neural networks to build a surrogate model that can predict the entire evolution (0–4.5 Gyr) of the 1-D temperature profile of a Mars-like planet for a wide range of values of five different parameters: reference viscosity, activation energy and activation volume of diffusion creep, enrichment factor of heat-producing elements in the crust and initial temperature of the mantle. The neural network we evaluate and present here has been trained from a subset of ∼10 000 evolution simulations of Mars ran on a 2-D quarter-cylindrical grid, from which we extracted laterally averaged 1-D temperature profiles. The temperature profiles predicted by this trained network match those of an unseen batch of 2-D simulations with an average accuracy of 99.7 per cent.


I N T RO D U C T I O N
The evolution of terrestrial planets is governed by subsolidus mantle convection (e.g. Breuer & Moore 2015). The physics of mantle convection can be quantified by solving the conservation equations of mass, momentum and energy for a fluid with an extremely high viscosity and negligible inertia. These are coupled nonlinear partial differential equations that are typically solved numerically using dedicated fluid dynamics codes (see e.g. the review of Zhong et al. 2015).
The initial conditions and large number of parameters required to run mantle convection simulations are often poorly known and/or largely unconstrained. However, certain outputs of the simulations can be related to 'observables' that can be inferred through planetary space missions using camera data, remote-sensing, or in situ measurements (e.g. radial contraction, surface heat flux, surface magnetization, duration and timing of volcanism, crustal thickness and elastic lithosphere thickness). These observables can be used as constraints to infer key model parameters and initial conditions, with the goal of learning about the basic physics and evolution of planets (e.g. . However, it can be computationally prohibitive to thoroughly scan the relevant parameter space through the solution of the full set of mantle convection equations in 2-D or 3-D. Hence, it is desirable to have a low-dimensional mapping that can rapidly predict the evolution for several parameters. The development of parametrized evolution models in the last few decades goes in this direction. They are essentially based on stacking several steady-state convective solutions (derived from experiments or numerical convection models), which are then advanced in time according to an energy balance equation (e.g. Stevenson et al. 1983;Gurnis 1989). The steady-state solutions are mostly expressed as scaling laws relating the vigour of convection, quantified through the non-dimensional Rayleigh number (Ra), and the non-dimensional ratio of convective to total heat flux at the surface, quantified through the nondimensional Nusselt number (Nu, e.g. Reese et al. 1998;Dumoulin et al. 1999;Solomatov & Moresi 2000;Deschamps & Sotin 2001). Such scaling laws are obtained by using a linear one-to-one (Ra-to-Nu) regression approach. However, the scaling laws obtained using this low-dimensional regression method and thereby the resulting parametrized evolution models have the disadvantage of being limited to relatively simple flows, mostly with constant material properties. For example, expanding on previous studies based on incompressible convection,Čížková et al. (2017), using a Cartesian 2-D convection model, investigated the impact of compressibility. They (d) These temperature fields are then laterally averaged to arrive at a sequence of 1-D temperature profiles. e) We train our network G(x, w) using these profiles. A trained surrogate G(x, w * ) can then use the optimized weights w * to instantaneously predict temperature profiles for the given parameters.
parametrized its influence through an additional non-dimensional number, the so-called dissipation number Di (e.g. King et al. 2010), and derived different Nu to Ra scaling relationships for different values of Di. This approach, however, becomes impractical as the number of parameters to test, such as Di, begins to grow. Neural networks (NNs) have been increasingly used for studying multivariate problems by approximating unknown highdimensional functions from image classification to text recognition, all the way to geodynamics. For example, Baumann & Kaus (2015) show that Markov Chain Monte Carlo methods can be used to constrain rheology and dynamics of the lithosphere in collision zones. In a different, yet related work, Baumann (2016) use NNs to study the same geodynamic inversion problem using an unsupervised classification algorithm called self-organizing map (Vesanto & Alhoniemi 2000). Another notable work is by Atkins et al. (2016), where they used a specific type of NN -called Mixture Density Networks (MDNs, Bishop 1994) -to study mantle convection as a pattern recognition problem. They inverted reduced representations of temperature fields to constrain parameters such as reference viscosity, yield stress and initial temperature. Shahnas et al. (2018) used support vector machine to estimate the magnitude of density anomalies from snapshots of mantle temperature fields. Recently, Baumeister et al. (2020) used MDNs to predict the distribution of the possible interior structures of extrasolar planets given observations of their mass and radius. All the above works are examples of inverse problems, where machine learning (ML) is used to infer parameters from observables. However, there have been fewer studies on NN-based forward surrogates. Atkins (2017) proposed a forward surrogate model for the Earth capable of predicting the mean mantle temperature and the degree of lateral heterogeneity using MDNs. Gillooly et al. (2019) used convection simulations with plate-like behaviour together with Generative Adversial Networks in order to complement plate reconstructions with an algorithm able to interpolate plate boundaries in unresolved regions.
With the goal of providing a tool to rapidly explore the thermal evolution of a Mars-like planet using the relevant physics, in this paper we build a 1-D surrogate model. As shown in Fig. 1, we use NNs to find nonlinear mappings from five parameters (plus time) to the temporal evolution of the 1-D temperature profile of the silicate mantle. A trained network that can be used to model the thermal evolution of a simplified Mars-like planet for a given set of parameters is freely available at https://github.com/agsiddhant/ ForwardSurrogate Mars 1D.
The paper is organized as follows. We begin by outlining the basics of NNs and the specific algorithms that we use to train the networks (Section 2). We then present the setup of the numerical simulations used to generate a data set of thermal evolutions of Mars calculated with our finite-volume code GAIA (Hüttig et al. 2013). Then, in Section 4, we present the data set, the results from training the NNs, and a comparison of the thermal evolutions predicted by the network with an independent set of GAIA simulations not used in training or evaluating the NNs. We then conclude by discussing some future avenues for the application of surrogate modelling in mantle convection research. Two appendices containing the more technical details of the NNs (Appendix A) and mantle convection model (Appendix B) complete the work.

N E U R A L N E T W O R K S F O R H I G H -D I M E N S I O N A L R E G R E S S I O N
In this section, we outline the basics of NNs. For a more detailed, yet accessible, introduction we refer to Bishop (1997). Consider a simple NN, like the one illustrated in Fig. 2. Here, only one hidden layer is shown. However, typically NNs will have more than one. The NN connects inputs nodes to outputs nodes via a hidden layer h with n neurons. Each neuron receives n inputs from the previous layer and outputs: where, g() is the activation function, which allows modelling nonlinearities. In this study, we use tanh () as activation function. Furthermore, x 0 = 1 is a 'bias' neuron added to each layer in the networkserves to translate the activation function to the left or to the right so that the origin of the activation function is not fixed at zero. In a Figure 2. Schematic of how a basic NN is used to build a forward surrogate model. The input nodes are connected to the output nodes via neurons in so-called 'hidden layers'. Each connection is quantified by an adjustable weight, which is optimized over several iterations by backpropagating the error in NN prediction.
Typically, NNs will have more than one hidden layer. The trained network can then take inputs t, η ref , , T ini , E and V and predicts the temperature profile at time t. This way the network can be evaluated at multiple values of t to produce an entire evolution.
fully connected NN, each neuron is connected with all the neurons in the previous layer. In this way, NNs provide a structure capable of approximating highly complex nonlinear maps (Baum & Haussler 1989). A mapping, say G(x), can be modelled by an NN as G(x, w), where w are adjustable weights. Once the structure of the NN has been defined, one needs to optimize the weights w. This can be done by defining a cost function that depends on w. One of the most commonly used cost functions is the mean-squared error (MSE). Its derivation is available in Appendix A. The standard approach to optimizing this cost function is error backpropagation. (e.g. Werbos 1982;Rumelhart et al. 1986). For an NN like the one illustrated in Fig. 2, the method of backpropagation would generally work as follows. The error in prediction by the network is propagated backwards through all the hidden layers using the principles of chain rule for differentiation. The derivatives of errors with respect to weights are used to update the adjustable parameters in a hidden layer at each iteration. This process is called gradient descent. There are several variants of gradient descent. We use a popular stochastic gradient descent optimizer called Adam (Kingma & Ba 2014, adaptive moment estimation) on mini-batches of the training set (which improves computational efficiency).
The derivatives needed during gradient descent can be calculated analytically or by automatic differentiation (AD), now offered by several ML libraries. We use TensorFlow (Abadi et al. 2015), where one only needs to set up the computational graph by defining the NN architecture and specifying the cost function. TensorFlow uses AD and one of the several already included optimizers (Adam in our case) to minimize the cost function. To systematically train and evaluate the performance of our networks, we split the data into three parts by first randomly shuffling the entire data set and then taking the desired number of samples. Training set: subset of data that is used to train the network; Validation set: half of the remaining data are used to fine tune the hyperparameters of the NN and make sure the network is not overfitting; Test set: the second half of the remaining data is used to evaluate the results. This last subset of the data is needed for assessing how well the NN performs because it is not seen by the network at any point in training or validating. In this study, we maintain a training/validating/testing split of (80 per cent, 10 per cent, and 10 per cent). We employ two different techniques to prevent overfitting. First, we modify the error function to include L 2 -regularization (see Appendix A). Second, we use early-stopping (Prechelt 2012), that is we only let the network train until the error function evaluated on the validation set starts increasing beyond a certain threshold.

S I M U L AT I O N S S E T U P
To train our ML algorithm, we generate a data set of simulations of the thermal evolution of Mars based on a setup similar to that used by Plesa et al. (2015). We consider a fluid with Newtonian rheology and infinite Prandtl number under the extended Boussinesq approximation (EBA, e.g. King et al. 2010). The viscosity is calculated using the Arrhenius law of diffusion creep (Hirth & Kohlstedt 2003). The thermal expansivity and conductivity are also temperature-and pressure-dependent . Assuming that a crust of fixed thickness d cr formed early (Nimmo & Tanaka 2005), we adjust the bulk abundance of all heat-producing elements in the whole mantle to a new bulk composition according to a given crustal enrichment factor . The model includes the effects of partial melting on the energy balance as well as on the depletion of heat-producing elements (Padovan et al. 2017). Two phase transitions in the olivine system are included using the standard approach of Christensen & Yuen (1985). The model is completed by a cooling boundary condition to treat the evolution of the core temperature. A detailed explanation of the thermal evolution model along with the corresponding equations is available in Appendix B.

Data set of Mars simulations
We built a data set with 10 453 evolution simulations using the setup described in Section 3, generating 2 TB of data using approximately 200 000 CPU hours. In this data set, we vary five parameters: the reference viscosity, the enrichment factor, the initial temperature and the activation energy and the activation volume, which, as shown Downloaded from https://academic.oup.com/gji/article/222/3/1656/5836720 by Deutsches Zentrum fuer Luftund Raumfahrt (DLR); Bibliotheksund Informationswesen user on 15 October 2020 by previous thermal evolution models of Mars (e.g. Grott & Wieczorek 2012;Plesa et al. 2015Plesa et al. , 2018, strongly influence the thermal evolution but are not well constrained. The parameters to vary are drawn from a uniform distribution randomly generated for specified ranges as shown in Fig. 1: η ref ∈ [10 19 , 10 22 ] Pa s, ∈ [1, 50], T ini ∈ [1600, 1800] K, E ∈ [10 5 , 5 × 10 5 ] J mol −1 and V ∈ [4 × 10 −6 , 10 × 10 −6 ] m 3 mol −1 . For each combination of the parameters, we ran a thermal evolution over 4.5 Gyr. However, not all simulations reached 4.5 Gyr. For certain combinations of parameters, convection can be extremely vigorous, which dramatically restricts the size of time-steps while rendering the systems of linear equations to solve particularly stiff. For certain simulations, the linear solver did not converge, invalidating the numerical solution. We filtered out such simulations by considering the root mean square of the magnitude of the velocity in the mantle u rms . We empirically chose an upper bound of 20 000 for u rms to ensure sufficient accuracy without losing too many simulations. Overall, 9524 out of 10 453 simulations satisfied the criterion of u rms ≤ 20 000. Of these 9524 simulations, we used all the available time steps -even from simulations that did not finish. This is because our input vector x to the NN includes time and equals [t, η ref , E, V, , T ini ]. The number of time-steps available for each simulation can vary greatly because while running the simulations, we chose to save an output every 4000th flow solver iteration as well as every 90 Myr. This was done to ensure that even for numerically stiff simulations, at least some time-steps were available. In total, we stored 337 848 time-steps from the filtered data, averaging 35 per simulation. Fig. 3 shows the evolution of the mean mantle and core-mantle boundary (CMB) temperatures, and of the surface and CMB heat fluxes, coloured according to the reference viscosity, from high (blue) to low (red). Temperature profiles from finished evolution simulations after 4.5 Gyr are also shown. It is clear that there exists some pattern in the outcome of the simulations. For example, lower values of the reference viscosity (hence higher values of the Rayleigh number), signifying vigorous convection, show more efficient heat transfer out of the mantle, thus a more rapid cooling. Therefore, profiles corresponding to high Rayleigh numbers exhibit a steep thermal gradient at the surface and an overall cooler profile. We demonstrate that these trends can be captured by our NN.
In order to accelerate the training of NNs, we reduced the size of the 1D temperature profiles of 200 points by two-thirds, while still capturing the shape of the temperature profiles. The temperature profiles are coarsened by taking every third point in the profile except at the surface and at the CMB. The temperature at the surface and the next two points the next two points correspond to those of the numerical grid to ensure the same precision as that of the numerical simulations. The same is done at the CMB.
We further normalize all the training inputs to be between 0 and 1 using the maximum and minimum values of each parameter. For η ref , we take its log first and then normalize the powers to be between 0 and 1 which are then used as input to the network. This produces the parameter distribution from the training set shown in Fig. 4. This is what the network sees when training. The parameter space is well covered, except some 'corners' of the data. This is particularly true for simulations with low reference viscosity and high activation energy. Some such simulations failed to reach convergence and were discarded under the filtering criterion of u rms ≤ 20 000. Hence, one can expect less prediction accuracy at later time-steps where the data are scarcer.

Training of neural networks
We train our surrogate model G(x, w) from 80 per cent of the entire data set. We then use 10 per cent of the data to test different network architectures and prevent overfitting and the remaining 10 per cent to evaluate the accuracy of the trained surrogate G(x, w * ). For 337 848 samples (simulations × time-steps), this results in a trainvalidation-test split of 270278 − 33785 − 33785.
After trial and error using NNs with architectures of different number of hidden layers and neurons per hidden layer, we found that relatively small architectures with a total number of neurons under 200 distributed across 2-3 hidden layers seemed to perform the best.
In Fig. 5 we present, as an example, results from a network with 3 hidden layers with 90, 60 and 30 neurons, trained for 4.4 million epochs. Fig. 5(a) shows some randomly selected temperature profiles from the test set plotted against the ones predicted by the NN. These can correspond to a temperature profile of any simulation at any time. Fig. 5(b) then shows the average absolute error and the average relative error in predicting all the temperature profiles in the test set. On average, the prediction errors are low, peaking to 6 K near the surface. One possibility for this behaviour can be that the temperatures near the surface are more degenerate. In other words, upon inspecting Fig. 3(e), one can see that the top part of the temperature profiles shows a less obvious colouring pattern than the rest. This could be a hint that the surface heat flux is more illconditioned, that is broader ranges of parameters can lead to the same heat flux. A second possibility is that numerical precision is smeared by the act of averaging the 2-D temperature field to a 1-D profile of points connected by linear elements and/or by the act of further reducing the size of the temperature profiles through linear coarsening. Finally, since the lateral temperature variations are typically largest at the base of the lithosphere, this can also introduce a higher uncertainty and an ultimately larger prediction error. However, the exact cause of this radial distribution of error remains subject to future investigations for now.
For millions of epochs on a data set of this size on a Tesla V100 GPU, it can take days to train with an early-stopping criterion of train while MSE validation (epoch) ≤ MSE validation (epoch − 0.05 epoch). (2) Here, one epoch is when a stochastic gradient descent algorithm (Adam in our case) has trained over all the mini-batches once, that is one iteration over the entire training set. The early-stopping criterion terminates training when the validation loss starts increasing beyond a certain threshold. Here, the day(s)-long training time is because regression is typically more demanding than classification, Downloaded from https://academic.oup.com/gji/article/222/3/1656/5836720 by Deutsches Zentrum fuer Luftund Raumfahrt (DLR); Bibliotheksund Informationswesen user on 15 October 2020 especially when fitting ∼270 000 data points up to a near optimal prediction accuracy. Furthermore, the long training time is not a particular concern in this study since the network only needs to be trained once. However, in case one wanted to train several networks, further optimization tricks, for example thermometer coding (e.g. Yunho Jeon & Chong-Ho Choi 1999; Montavon et al. 2013), could be used to speed up training. After training is completed, any point in the temperature profiles of the test set is predicted with an average error of 0.2604 per cent. Comparing this to the average prediction error of 0.2609 per cent on the training set indicates that there are no obvious under-or overfitting problems. However, a comprehensive analysis of fitting for NN is non-trivial (e.g. Jin et al. 2019;Bottou & Bousquet 2008) and beyond the scope of this paper.

Predicting evolution using trained neural networks
In this subsection, we evaluate the temperature profiles produced by the trained surrogate G(x, w * ) over the course of the entire thermal evolution from 0 to 4.5 Gyr. In order to see how well the trained NN performs, we created a fourth batch of 20 GAIA simulations with which we compare the NN predictions. This new small batch was created to demonstrate that the predictions from the NN capture the expected geophysical trends well. This requires manually setting input parameters so that only one parameter is varied for each subbatch of 45 simulations while others remain fixed. This cannot be achieved by randomly drawing from a joint uniform distribution of five parameters. In other words, these particular combinations do not exist in the data set. All the values of input parameters for the 20 GAIA simulations are listed in Table 1. In Fig. 6, we compare the NN predictions with 19 of the 20 GAIA simulations; simulation 8 with a high activation energy crashed and could not be used in this comparison.
The network is able to accurately capture the trends and match the GAIA simulations well. For different η ref , for example, one observes the expected trends from Fig. 6. A lower viscosity leads to a cooler overall profile because of more vigorous convection. The initial temperature, on the other hand, seems to have little impact on the final temperature profiles demonstrating the 'thermostat effect'.
However, the predictions are not perfect and there are various possible reasons for the small mismatches. For example, errors in the predictions of the temperature profiles can increase for cases having both a high Rayleigh number and a high activation energy for which the data are somewhat scarce. In fact, availability of data for training purposes is important to the accuracy of predictions from G(x, w * ). As a test, we trained different subsets of the entire data set (i.e. using a different number of simulations) and calculated the average relative error on the fourth GAIA set. For consistency, we maintain the same split of data for training/validating/testing (80 per cent/10 per cent/10 per cent) as well as hyperparameters like the size of the NN, the stopping criterion and the L 2 regularization parameter. The average relative error in predicting any point in the end temperature profiles for different sizes of data set of the fourth GAIA set from Table 1 is shown in Fig. 7. Fig. 7 shows that the error in predictions drops dramatically up to a data set with 1000 simulations. After that point only asymptotic improvements are seen-in other words, there is less and less to be gained from more simulations.

A further test for correlation between the training and the test set
In Fig. 7, we plotted the impact of number of simulations on the prediction accuracy. One 'simulation' on the x-axis corresponds to one sample from the five input parameters. Then for each simulation the number of available time-steps may vary. To be able to perform such a study, one ideally needs to maintain the same training-validation-test split of samples. Since the number of timesteps available per simulation can vary because of how we chose to store them and depending on how far along a simulation got, the only way to ensure that the resulting number of training-validationtest samples remains a uniform percentage of the total size of the data set is to first randomly shuffle all the available time-steps for all the filtered simulations and then take desired percentages out of it (e.g. 80 per cent, 10 per cent, and 10 per cent). In other words, if one splits the simulations by the desired percentages first and then takes the available time steps, there is no guarantee that the resulting Table 1. Values of input parameters to 20 GAIA simulations used for comparing evolution. These simulations are completely independent of the training-test-validation sets. Simulation 8 crashed and was discarded.
To address this question, it is worth revisiting the 19 GAIA simulations that we ran independently and whose entire evolutions we assessed in Fig. 6. Since these are completely independent of the 9453 simulations from which we drew our training-validation-test sets, the accuracy on the new 19 simulations is a good indicator that G(x, w * ) is doing more than just interpolating in time.
As a further check to ensure that the selection of the parameters for the 19 simulations was not simply a matter of good fortune, we took an NN from Fig. 7 that was trained with only 3000 simulations (maintaining the 80/10/10 split to this relatively small set) and used it to evaluate the remaining ∼6000 simulations which were not part of the training-validation-test distribution. The results are plotted in Fig. 8. In Fig. 8(a), we plot some randomly selected temperature profiles in blue solid and the corresponding NN predictions in dashed red, as well as the radial distribution of the absolute error in grey (indicated on top x-axis). Furthermore, in Fig. 8(b), at each radial location, we plot the average absolute error and the average relative error across all the profiles-the results show similar trends to Fig. 5(b), albeit the average values of errors are higher owing to the fewer simulations available to train the network with. Finally, in Fig. 8(b), we plot the mean relative error in predicting any point in the temperature profile at different times. The mean relative error increases slightly with time because at later times there are fewer time-steps available meaning fewer data points to train the network with. Overall, any point in the temperature profiles at any time from any of the 6000 simulations is predicted with an average error of 0.33 per cent.

Rapid evaluation of the parameter space using a trained surrogate
After establishing that the trained surrogate G(x, w * ) reliably predicts the temperature profiles T(r, t) for different parameters, we demonstrate that it can be used to evaluate the parameter space. Several quantities can be calculated from the temperature profile that can be then related-directly or indirectly-to specific observables.
In this subsection, we plot as an example two such quantities: the CMB temperature (T cmb ) and the upper mantle temperature beneath the stagnant lid (T lid ). The former can be used to assess the thermal state of the core with implications for its mode of solidification (e.g.  and for the tidal response of the planet (e.g. Plesa et al. 2018;Khan et al. 2018). The latter can be compared with inferences based on petrological studies predicting the source conditions at which Martian meteorites formed (e.g. Filiberto & Dasgupta 2015).
In Fig. 9, we plot the present-day values of T lid on upper right and T cmb on lower left for the five parameters: η ref , E, V, , T ini , always varying two at a time. In each plot where the two variables (say η ref , E) are varied, the others (in this case V, , T ini ) are kept constant. Unless varied, the parameters remain fixed at these values: η ref = 10 20 Pa s, = 20, T ini = 1700 K, E = 2 × 10 5 J mol −1 andV = 6 × 10 −6 m 3 mol −1 .
With the aid of such plots, one can also study the correlation between different parameters. For example, for this model of a Marslike planet, for the core to cool, one would need a combination of a low-reference viscosity and a high activation energy. A higher activation energy, or in other words, a higher temperature dependence of viscosity leads to more vigorous convection and a more efficient cooling. Furthermore, such plots can provide a better insight into how parameters impact the thermal evolution with respect to other parameters. For example, from the contour plot of T lid plotted for η ref and , one sees that η ref is more decisive in determining the lid temperature.

D I S C U S S I O N A N D C O N C L U S I O N S
We trained an NN on ∼10 000 mantle convection simulations and developed a forward surrogate model capable of capturing the thermal evolution of Mars. The NN we provide on Github 1 has been trained with 80 per cent of the full data set and is capable of instantaneously calculating the entire evolution of 1-D temperature profiles over 4.5 Gyr. Upon comparing the predictions of this trained surrogate with an unseen batch of GAIA simulations, we concluded that the network captures the trends accurately and matches the GAIA simulations well with an average accuracy of 99.7 per cent.
We present this study as a proof of concept showing that highdimensional regression algorithms like NNs can help us study nontrivial mantle convection problems involving multiple parameters and physical processes. There are several advantages to this approach. First, we can bypass the need for devising complicated scaling laws that may need dedicated fine tuning to match the outcomes of numerical simulations (e.g. Thiriet et al. 2019), while at the same time capturing physical processes, like temperature-and pressure-dependent thermodynamic and transport properties, which are normally not easily incorporated into scaling laws. Second, our network is able to predict the entire 1-D temperature profile including the shapes and locations of phase transitions in contrast to parametrized models that typically operate under the assumption of a theoretical adiabatic temperature profile. Third, by training directly in time, we can also circumvent constructing evolution models with energy balance equations. Our NN implicitly learns the relations between the initial values of parameters and their evolution with time. Fourth, trained surrogates like the one we provide on Github can be easily downloaded and used to conduct parameter studies any number of times, without having to repeatedly perform the simulations on a supercomputer. The plot in Fig. 9 is a good example of that-it took a few seconds for the trained NN to produce the present-day temperature profiles from which T cmb and T lid were calculated. For calculating at 50 different values of each parameter, each plot of two parameters would then require 2500 simulations. One would then need ∼25 000 simulations to go through all combinations of parameters. Furthermore, one could use the same trained NN to extend plots like Fig. 9 to 3-D (or higher dimensional) combinations by varying three parameters in each plot, instead of just two. This would come at the fraction of a cost using trained NNs. Using full convection simulations in this case would become intractable even on a supercomputer due to the exponential scaling of Downloaded from https://academic.oup.com/gji/article/222/3/1656/5836720 by Deutsches Zentrum fuer Luftund Raumfahrt (DLR); Bibliotheksund Informationswesen user on 15 October 2020 required simulations (1.25 million) with number of parameters to be varied. Fifth, this can be an efficient way for different research groups to share and compare their results. Researchers could, for example, exchange such trained forward surrogates (which are a few kB large) and compare their models by checking the temperature profiles predicted for any arbitrary combinations of parameters. The results of this work are a first step towards high-dimensional surrogate modelling in mantle convection. Several challenges remain open. For example, in training on and predicting only 1-D temperature profiles, some information is lost. One could address this issue by considering instead the entire 2-D temperature field. There has been some progress in the broader fluid dynamics community on high-dimensional ML from small data sets. Two different approaches can be adopted to tackle this challenge. One is based on a class of physics-informed algorithms, where the partial differential equations are embedded in the loss function 'softly' using AD which has a regularizing effect on the optimization (e.g. Raissi et al. 2019Raissi et al. , 2020. The other approach combines advanced ML techniques like convolutional NNs and recurrent NNs to learn in space and time, respectively. Mohan et al. (2019), for example, demonstrated the effectiveness of these techniques in capturing the dynamics of 3D turbulent flows. In this paper, we saw that approximately 1000 simulations are sufficient for training a 1-D forward surrogate for that can model the evolution of a planet for five unknown parameters in time. This minimum number of simulations required to train a higher dimensional surrogate, for example in 2-D, is likely to be higher.
Furthermore, the predictions of this trained surrogate are restricted to the ranges of the parameters over which the NN was trained. A good extrapolation beyond these ranges is not to be expected. In that sense, one must also keep in mind that the NN trained on this data set is limited to the physics included in the convection simulations which were used as training data. Hence, another potential avenue of research would be the inclusion of additional parameters, such as the radius of the core, thickness of the mantle and number and type of phase transitions. That would allow the investigation of different physical models for the same planet as well as for bodies of different sizes, both in the solar system (Mercury, Venus and the Moon) and around other stars (the class of super-Earths).
The field of high-dimensional surrogate modelling in the mantle convection community might just be getting started, but we believe it has great potential to improve our understanding of how the terrestrial planets evolve.