An autonomous petrological database for geodynamic simulations of magmatic systems


 Self-consistent modelling of magmatic systems is challenging as the melt continuously changes its chemical composition upon crystallization, which may affect the mechanical behaviour of the system. Melt extraction and subsequent crystallization create new rocks while depleting the source region. As the chemistry of the source rocks changes locally due to melt extraction, new calculations of the stable phase assemblages are required to track the rock evolution and the accompanied change in density. As a consequence, a large number of isochemical sections of stable phase assemblages are required to study the evolution of magmatic systems in detail. As the state-of-the-art melting diagrams may depend on nine oxides as well as pressure and temperature, this is a 10-D computational problem. Since computing a single isochemical section (as a function of pressure and temperature) may take several hours, computing new sections of stable phase assemblages during an ongoing geodynamic simulation is currently computationally intractable. One strategy to avoid this problem is to pre-compute these stable phase assemblages and to create a comprehensive database as a hyperdimensional phase diagram, which contains all bulk compositions that may emerge during petro-thermomechanical simulations. Establishing such a database would require repeating geodynamic simulations many times while collecting all requested compositions that may occur during a typical simulation and continuously updating the database until no additional compositions are required. Here, we describe an alternative method that is better suited for implementation on large-scale parallel computers. Our method uses the entries of an existing preliminary database to estimate future required chemical compositions. Bulk compositions are determined within boundaries that are defined manually or through principal component analysis in a parameter space consisting of clustered database entries. We have implemented both methods within a massively parallel computational framework while utilizing the Gibbs free energy minimization program Perple_X. Results show that our autonomous approach increases the resolution of the thermodynamic database in compositional regions that are most likely required for geodynamic models of magmatic systems.

Database for geodynamic simulations 1821 Magni et al. 2014;Rummel et al. 2018), or phase relations can be determined on the fly during an ongoing geodynamic simulation (e.g. Hebert et al. 2009;Tirone et al. 2009;Duesterhoeft et al. 2014;Oliveira et al. 2017;Riel et al. 2018). In case the stable phase assemblages are pre-computed as isochemical sections, they are normally limited by their compositional ranges, as they are often only designed to estimate density changes and seismic velocities (e.g. Nakagawa et al. 2009;Faccenda & Dal Zilio 2017) or to investigate the influence of water release in subduction zones (e.g. Rüpke et al. 2004;Magni et al. 2014). However, to fully understand magmatic processes, including the feedback between chemistry and thermomechanics is crucial. It requires a detailed consideration of the compositional evolution, which can be modelled with a large thermodynamic database that contains a few thousand up to several ten thousands of isochemical sections describing the stable phase assemblages. Recently, Rummel et al. (2018) showed that physical processes such as the melting source of a mantle plume can be better understood when the chemistry of extracted melt is taken into account during a geodynamic simulation. This result also demonstrated that it is practical to track the evolving chemical composition, even though only a limited database consisting of 625 isochemical sections for different BCs was used.
Existing melting models (e.g. Ghiorso & Sack 1995;Green et al. 2016;Holland et al. 2018) allow phase equilibrium calculations for a broad compositional range and are used normally only in a simplified manner. Implementing them as part of a thermomechanical simulation of a magmatic system is an important future step towards creating predictive models of magmatic systems. Several software packages exist to compute such stable phase assemblages, for example, MELTS/pMELTS/rhyolite-MELTS (Ghiorso & Sack 1995;Ghiorso et al. 2002;Gualda et al. 2012), THERMOCALC (Powell & Holland 1988), THERIAK-DOMINO (de Capitani & Petrakakis 2010) or Perple X (Connolly 2005(Connolly , 2009. We use Perple X in this study and created an MPI-parallel (Gropp et al. 1999) framework that invokes multiple Perple X sessions simultaneously. In this manner, many thousands of isochemical sections for the different BCs can be computed within a few days depending on the size of the parallel computer available. Perple X is suitable for such an automatized computational framework, as it calculates the stable phase assemblages without much manual interaction and prior knowledge about stable phases by minimizing the Gibbs free energy (Connolly 2017).
Here, we present a method that enlarges a database of isochemical sections of stable phase assemblages in an automated way based on an initially small database constructed from requested compositions of geodynamic forward models. Different sampling techniques are presented along with their suitability to estimate rock compositions that will emerge during geodynamic simulations. Differences between the required BCs and available ones in the database are investigated while enlarging the size of the database. We apply machine learning methods for dimensionality reduction such that compositional trends become visible on a lower dimensional map. Groups of similar BCs are extracted to which different sampling techniques are applied. The success of a reasonable enlargement of a database is dependent on the sampling method and whether a pre-subdivision of the sampling space is executed.
We distinguish here three different kinds of BCs: (1) initial BCs are those used in the preliminary (initial) database, (2) new BCs are created with the autonomous approach and (3) requested BCs are those requested during geodynamic forward models. The new BCs are compared with the requested BCs to evaluate their quality.

Modelling magmatic systems with petro-thermomechanical models
The chemical evolution of magmatic systems is a multifaceted process, which is associated with many melt extraction events. This process not only changes the chemical and mineralogical evolution but may also have a feedback on the thermomechanical behaviour of the dynamic system. In addition to the purely kinematic evolution of the magmatic system, which can be studied using, for example, Rcrust (Mayne et al. 2016), Perple X (Connolly 2009) or MELTS (Ghiorso & Sack 1995), we present here the integration of such models into a thermomechanical simulation such that the interaction between the chemistry and the mechanics of the system can be studied. In the following, we briefly describe the thermomechanical model that we employ in this study and provide a detailed description of the coupling with petrological approximations to model the compositional evolution of magmatic systems.
The petrological modelling approach is integrated in the thermomechanical simulation code MVEP2 1 (Kaus 2010;Thielmann & Kaus 2012;Rummel et al. 2018). MVEP2 employs finite elements to discretize the model domain and solves the conservation equations of mass, momentum and energy of slowly creeping fluids on geological timescales. The rheology is handled with constitutive relationships that define the visco-elasto-plastic deformation behaviour of rocks. To model diking, the numerical approach is combined with a semi-analytical fracture opening algorithm that is described in Rummel et al. (2020) where the modelling results are presented in more detail. The first dike formation is triggered by basaltic magma from the mantle that is injected as sills into the crust or uppermost mantle. The dike formation and thus the amount of extracted melt depends on the stress field above the partially molten or crystallized source. We employ a marker-in-cell technique to track rock properties during the geodynamic simulation. These markers play a central role in our approach as we also use them as tracers of the mineralogical and chemical evolution of the magmatic system. In our approximation, the chemical system is restricted to nine oxides, that is, SiO 2 -TiO 2 -Al 2 O 3 -FeO-MgO-CaO-Na 2 O-K 2 O-H 2 O, from which diverse mineral and melt compositions can be formed. Free water that is not stored in minerals is removed from the system. We use the Holland & Powell (2011) internally consistent thermodynamic data for endmembers and the activity-composition models for solid-solution phases are given in Supporting Information Table A1 (see Supporting Information Data A) including the melting model for metabasic rocks from Green et al. (2016). The system is buffered along the quartz-fayalite-magnetite oxygen buffer. We neglect the change of the redox state due to melt extraction and assume a homogeneously distributed redox state evolution with pressure and temperature independent of the local BC.
The information about the local petrological composition is tracked by assigning an individual isochemical section of stable phase assemblages to each marker employed in the geodynamic simulation. Our model has a lateral extent of 50 km and is 40 km deep, which is resolved with 400 and 350 finite elements, respectively. We start the simulation with ∼1.3 million markers, but the number increases during the simulations as new dikes/sills are generated, to ensure that the spatial evolution of rock and melt chemistry is tracked with sufficient accuracy. We neglect reaction kinetics and assume thermodynamic equilibrium. However, despite these simplifications, an immense thermodynamic database is required to cover all possible liquid and residuum compositions generated in the evolving magmatic system. The high number of requested isochemical sections as a consequence of only one newly generated dike can be illustrated with the following example. As the local chemical composition changes due to melt extraction, the chemistry of the system is split into a liquid part forming a dike and into a residuum part. The dike is described by only one isochemical section (with an average liquid chemistry), assuming homogeneous mixing of all melt prior to extraction. However, the remaining residuum chemistry changes strongly depending on the local amount and composition of extracted melt. Due to the different rock depletion stages, the number of isochemical sections with distinguishable BCs can be high. Our tests show that a few thousands of these sections of stable phase assemblages are required for a typical complete simulation.
The isochemical sections are computed with Perple X (Perple X 6.7.9; Connolly 2005Connolly , 2009 and are evaluated for discrete P-T conditions (257 × 257 points) for the given ranges and describe how solid and liquid oxide compositions as well as densities, melt fractions and stable mineral assemblages will evolve. With a pressure range of 0.1-3000 MPa and a temperature range of 290-1900 K, thermodynamic properties are determined every 11.7 MPa and 6.3 K for a specific BC. At this resolution, the computation time is reasonable (∼10 hr on average) and all phase boundaries are resolved with a sufficient precision. The thermodynamic data files employed are provided in the Supporting Information.
In general, there are two different strategies to combine geodynamic simulations of magma migration with thermodynamic properties which can be computed for a whole P-T section (on which we focus here) or for specific P-T conditions. We call the first method the dynamic method (1) in which a new isochemical section of stable phase assemblages (or alternatively the stable phase assemblage for a specific P-T condition) is computed on the fly during an ongoing simulation. The geodynamic simulation pauses while thermodynamic calculations are performed. The alternative approach is the static method (2) in which a look-up table with existing stable phase assemblages for different BCs is used. Here, an isochemical section that is acceptably similar to the requested one is used to immediately continue with the geodynamic simulation after updating the new phase properties on the marker.
(1) In the dynamic method, the computation of isochemical sections of stable phase assemblages is performed during an ongoing geodynamic simulation through which the number of required sections is reduced to a minimum. After each melt extraction event, the residuum or melt compositions can be directly used to compute new sections of stable phase assemblages, provided they do not exist yet in the database. The thermodynamic properties are directly updated on the marker, from which melt is extracted, and the chemical system develops based on this specifically updated isochemical section. The disadvantage of this method is that the respective geodynamic simulation has to pause until all necessary sections are computed. This pause massively increases the computation time for each geodynamical time step as the computation of an individual section may take several hours. From a computational perspective, the process is a sequential task, which is hard to parallelize with a multiprocessor architecture, because the workload can differ a lot due to a different number of newly requested compositions. For example, a single isochemical section request would force all other tasks to wait.
(2) For the static method, a database of existing isochemical sections is used to estimate the future rock evolution after each melt extraction event. Here, the most similar section of stable phase assemblages in terms of bulk chemistry is used and replaces the previously associated section of the marker. As a similarity measure, we use the Euclidean distance in space spanned by the components of bulk chemistry. Clearly, the closer the query is to the existing database entries, the more accurate the model becomes. For this reason, we store all requested residuum and melt compositions during a geodynamic simulation and compute new sections of stable phase assemblages separately from the simulations. The advantage is that the job can efficiently be performed in parallel depending on the size of the computer available. In our case, we use 256 tasks in parallel and compute ∼1000 isochemical sections in 3-4 d. However, it was initially unclear how many different sections are required for a typical simulation, and what the error is that is made by employing sections with the approximate, but not the exactly correct, chemical composition during the simulations. The obvious advantage of this approach is that the computation time for all sections is reduced to a minimum. The disadvantage, however, is that the respective geodynamic simulation has to be repeated many times until the differences between the requested compositions and BCs of the database become negligible. In principle, such a database can become large, such that technical issues arise when loading the full database during a geodynamic simulation, as well as problems with storing the data. We note that there are techniques to compress the rock properties without losing much in accuracy (Afonso et al. 2015), which we have not explored here.
Here, we investigate the second strategy. An interpolation between the results of the different isochemical sections is theoretically possible, in case the BCs are well distributed over that part of the parameter space that is most likely required for the geodynamic simulations. In the following, we show how such a database can be established and efficiently enlarged with an autonomous approach.

Establishing the database
From a computer science point of view, we deal with a parameter space that is spanned by nine oxides, bounded by upper and lower limits, respectively. There is the additional requirement that the oxide concentrations of each BC must add up to exactly 100 wt per cent, effectively reducing the parameter space to an 8-D manifold to which all BCs in the database (called here 'initial BCs') are restricted. We assume that the parameter space is discretized and only compositions with offsets of Ox. = 0.1 wt per cent can be realized. Thus, each composition differs from each other at least by 0.1 wt per cent in one of its oxides. Although this limitation dramatically reduces the possible number of petrological compositions, it would be inefficient to establish the database by randomly creating new BCs between the maximum and minimum concentrations of the respective oxides while respecting the constraints. Instead, we make use of expert knowledge, which provides good initial guesses about the interesting regions of the oxide space or about BCs most likely being required by future simulations. As a starting point, we compute five initial isochemical sections of stable phase assemblages representing the upper, middle and lower crusts as well as the mantle and the mafic basaltic intrusion. Using these initial sections, we perform a series of iterative geodynamic simulations (see Section 2.1). As soon as a melt extraction event occurs, the existing phase assemblages cannot sufficiently describe the future rock evolution and new isochemical sections are requested. We compute the sections with the requested BCs and repeat the simulation to produce additional composition requests from the evolving magmatic system. To further improve our initial guess, we vary the configurations of the simulations by changing tectonic boundary conditions, intrusion depths and temperatures of magma influx from the mantle, geothermal gradients and other properties. With this strategy, we created an initial database with thousands of isochemical sections that ideally cover the possible oxide concentration limits of requested BCs in our petro-thermomechanical models. This procedure may already result in a relatively fine-mesh sampling space producing similar compositions, especially in regions that are preferentially requested from the initial geodynamic models. Theoretically, one could further improve the database with more simulations and new requests that result from them. Such an improvement of the database is necessary for simulations that cannot describe the variations in melt and residuum chemistry with a sufficient accuracy. However, such a sampling requires significant manual interaction. In the following, we describe our approach to enlarge the database in a more autonomous manner.

Enlarging the database
To further enlarge the database in an autonomous manner, we make use of our initial guesses and guide the sampling of new BCs with the location of the initial BCs from the database. Visualizing the BCs along different oxides illustrates that existing BCs are correlated with each other, visible as they gather in smaller clouds or align along mostly nonlinear pathways that usually split up in multiple branches ( Fig. 1). Such a distribution of initial BCs means that the approach of uniformly sampling the entire initial BC space between their extreme concentrations ( Fig. 1a) is unsuitable to create exclusively new BCs that are similar in chemistry to the initial BCs and thus potentially requested. A robust method for linearly correlated data is the principal component analysis (PCA; Hotelling 1933). A large number of approaches exist for nonlinear data, including manifold learning and nonlinear PCA and kernel-PCA methods. By using PCA (PCA-CS, Section 2.3.2) or by constraining the oxide space manually (M-CS, Section 2.3.3), the sampling space for new BCs is reduced (Fig. 1b). However, even though corners are already neglected, in which no initial BCs are located, the sampling space is still relatively large and new BCs can be created between compositional branches of initial BCs. To further reduce the sampling space, a combination of robust dimensionality reduction techniques can be used by applying PCA or manually constrained boundaries to individual compositional clusters (Section 2.3.1) in the mapping space (Fig. 1c).

Simplifying the parameter space with t-SNE and DBSCAN
In order to deal with the complex BC relationships, we break down the distributions in smaller fractions of similar compositions using a combination of a dimensionality reduction technique and a clustering algorithm. We first apply the Barnes-Hut t-SNE (Van Der Maaten 2014) to map the high-dimensional data to a lower dimensional manifold such that clusters of similar BCs become visible. To quantitatively distinguish those clusters, a method is required that utilizes the coordinates of the data points to find densely distributed regions and thus clusters. For this purpose, the clustering technique 'density-based spatial clustering of applications with noise' (DB-SCAN; Ester et al. 1996) is applied. Both algorithms, Barnes-Hut t-SNE and DBSCAN, are briefly described below.
The Barnes-Hut t-SNE is based on the previously developed algorithms, stochastic neighbour embedding (SNE; Hinton & Roweis 2003) and t-distributed stochastic neighbour embedding (t-SNE; Maaten & Hinton 2008). It is different compared to the previous t-SNE version as it includes the Barnes-Hut algorithm (Barnes & Hut 1986) that massively accelerates the t-SNE procedure (Van Der Maaten 2014), which is necessary to analyse large data sets. In general, t-SNE projects high-dimensional data (in our case 8 dimensions) on a 2-D map (Fig. 2a), such that each data point (BC) has a 2-D representation. Nearby points correspond to similar BCs and distant points correspond to dissimilar BCs. t-SNE reduces the dimensionality while preserving the essential structures (local and global ones such as clusters at multiple scales) of the high-dimensional data in the lowdimensional map (Maaten & Hinton 2008). The Barnes-Hut t-SNE algorithm is available as MATLAB library (http://lvdmaate n.github.io/tsne/), which we employ in this study with default values.
In a second step, we identify clusters ( Fig. 2a) by applying DB-SCAN to the projected data of the 2-D map space. Each cluster is characterized by specific BCs that are chemically similar and distinguishable to those of other clusters. However, especially at the margin of a dense region, compositions of neighbouring clusters may be more similar than compositions located far away in the same cluster. DBSCAN is a density-based data clustering algorithm where clusters are defined as dense regions that contain a certain number of core points. Core points are data points that have a minimum number of neighbours within a pre-defined 'neighbourhood' distance. The clusters also contain points that are not defined as core points but locate within the neighbourhood distance of a core point. All other points are classified as noise. We employed DBSCAN to the initial data set with default values (minimum number of neighbours is 15 and 'neighbourhood' distance is 3). Using these values, our starting initial data set was separated into groups that can describe the individual compositional branches of BCs in the sampling space (see cluster distribution in Fig. 2b). However, with an increasing size of the database, the usage of such default values may result in a merging of several clusters, if the compositions change gradually rather than having distinguishable properties (cluster 27, Supporting Information Fig. D1, Data D). If this gradual change is the case, the default values have to be adjusted to split the big cluster into smaller ones. Furthermore, the information about the size of each cluster can be used such that more BCs are produced for clusters which have a wider distribution in the parameter space.
In comparison to k-means (Steinhaus 1956;MacQueen et al. 1967), DBSCAN has the advantage that the number of clusters do not have to be defined in advance, but the chosen input values control indirectly the numbers of clusters (see above). In this study, we employ a MATLAB implementation of DBSCAN that is available at http://yarpiz.com/255/ypml110-dbscan-clustering. As a result, we now deal with small fractions of the BC ensemble of the initial database that align on lower dimensional manifolds in a much less complex manner. This structure allows us to adapt simplified models of the local distribution and to make estimations about requested BCs from geodynamic simulations. We employ two different approaches to which we will refer to as PCA-constrained sampling (PCA-CS, Section 2.3.2) and manually constrained sampling (M-CS, Section 2.3.3).

PCA-constrained sampling
For each individual cluster, we determine the principal components and use them to parametrize the spatial distribution (Fig. 3). Based on the parametrization with the eigenvectors, we perform a guided sampling, which is we focus the sampling on a region that is interesting in the sense that the determined BCs likely become relevant for future geodynamic simulations. In the following four steps, we explain the details of the algorithm: (1) Find the principal components, that is, eigenvalues and eigenvectors, of the individual cluster distribution.
(2) In the coordinate system of the principal components, we determine the centre of mass and the extreme values along each component that define the ranges for each axis.
(3) We draw new samples by selecting new BCs: (i) either from a uniform distribution between the defined ranges (uniform PCA-CS) (ii) or from a normal distribution that is centred in the centre of mass and the standard deviation observed along the current dimension (normal PCA-CS) (iii) or from a normal distribution that is centred in the centre of the range along the current dimension (normal-centring PCA-CS).
(4) The samples are transformed back into the original BC coordinate system. As the sum of all oxides has to be 100 wt per cent to represent a BC, we repeat the sampling procedure until this requirement is fulfilled and a desired number of new BCs is reached.

Manually constrained sampling
Sometimes a linear description of the distribution within a cluster may not be appropriate and PCA might not be a good choice to parametrize the distribution. We, therefore, present an alternative approach that requires a less intuitive guidance but may result in a more accurate representation of the distribution. We consider the marginal distributions of all oxide combinations independently and determine polynomial fits and use them to define boundaries for a guided sampling. The exact procedure is presented below in steps 1-3 and illustrated in Fig. 4. For simplicity, we use 'Ox1' and 'Ox2' to describe the two components of an oxide combination.
(1) For each oxide combination, polynomial regression is performed and depending on the coefficients of determination, a firstor a second-degree polynomial is used. To set boundaries for the guided sampling, we translate this function along the coordinate axes until all initial BCs are included in the interval (Fig. 4).
(2) Starting from SiO 2 as Ox1, a second oxide (Ox2) is randomly chosen. The behaviour of the polynomial fit is crucial to determine the oxide concentrations and two different cases must be considered.
Case 1:Both boundary lines have the same x-coordinates. Within their horizontal range, a concentration for Ox1 is randomly determined (x, Case 1, Fig. 4). The resulting range in the vertical direction, between both boundary lines (y1 and y2, Case 1, Fig. 4), is used to determine a random concentration for Ox2 (blue star, Case 1, Fig. 4). Case 2:The boundary lines have different x-coordinates (e.g. for a banana-like shape, Case 2, Fig. 4). The axes are flipped to get the same x-coordinates for both boundary lines. Within their vertical Here, the ensemble contains 8457 BCs from the database, which can be summarized into 11 clusters that gather BCs with similar properties. 16 BCs could not be included in particular clusters and are classified as noise.
range, a concentration for Ox1 is randomly determined (y, Case 2, Fig. 4). The resulting range in the horizontal direction, between both boundary lines (x1 and x2, Case 2, Fig. 4), is used to determine a random concentration for Ox2 (red star, Case 2, Fig. 4). For Case 2, several exception rules must be considered to determine the Ox2 concentration, examples are shown in Fig. 4 at the bottom.
(3) It is ensured that the newly determined oxide concentrations are within the boundaries of other oxide combinations that include one of these oxides. If other oxide concentrations were already determined before, they have to be considered as well in the determination of the new oxide concentrations. The sum of all oxides has to be 100 wt per cent to represent a BC. The sampling procedure is repeated until this requirement is fulfilled and a desired number of new BCs is reached.
A further reduction of the parameter space is in principle possible, if the upper and lower boundaries do not have to cover necessarily all initial BCs and if the new oxide concentrations are determined along a normal rather than along an uniform distribution. This adjustment may avoid a possible oversampling of low-probability regions (Afonso et al. 2013).
All compositions obtained from PCA-CS and MCS are rounded on one decimal place and are used as input compositions in Perple X. Thus, isochemical sections in the database are distinguishable from each other in their BCs at least with 0.1 wt per cent (accuracy of an oxide concentration).

Different sampling approaches in comparison
The sampling can be done by applying the different sampling techniques (M-CS or PCA-CS) either to the non-clustered initial BCs or to the clustered initial BC data set (results from both approaches are presented below). Depending on which of the different sampling techniques is used, newly created BCs can be similar or dissimilar to already existing compositions of the database and are distributed in the full parameter space of the initial data set ensemble or only in parts of it.
The manually constrained boundaries include all initial BCs (grey points, Fig. 5  defines its ranges by shifting the polynomial fit through the initial data upwards and downwards by the same amount. This amount is determined by the maximum distance between the polynomial fit and the extreme values. If far located data points only exist in one direction of the polynomial fit (i.e. only above the parametrization, FeO versus MgO, cluster 11, Supporting Information Fig. B1, Data B), the sampling space is large compared to other sampling methods and may include regions without the support of initial BCs. The uniform PCA-CS constrains the sampling range of new BCs to minimal and maximal initial concentrations. Thus, if only one outlier of an initial BC exists (e.g. in FeO versus MgO, cluster 11, Supporting Information Fig. B1, Data B), all new oxide concentrations are sampled uniformly between this outlier and the centre of mass that limits the range towards the other extremum (Supporting Information Fig. B1, Data B). The influence of the outliers on the new BCs is smaller for the normal PCA-CS. Here, we use a normal distribution along the respective principal component to sample a new BC, which implies a low impact of distant samples. This impact is especially notable for cases in which only a few initial data points are located far away from the centre of mass. The normal PCA-CS method is maybe the most suitable one to determine new BCs that are similar to already existing compositions (initial BCs).
The choice of the most suitable sampling method to enlarge a database of stable phase assemblages with new BCs used in Perple X is dependent on the model expectation. If the requested compositions of the initial database consider all relevant processes that may occur in the geodynamic models, it can also be expected that future composition requests are similar. In contrast, if processes are not considered (e.g. crustal partial melting), the results can be different and a much wider parameter space is required to account for such processes. We avoid this problem by taking into account a wide spectrum of different processes in the models whose composition requests were used for the initial database. Generally for the non-clustered case (Fig. 5), if new BCs should be similar to initial BCs, a PCA-CS method, in which the concentrations are determined along a normal distribution, is most suitable. However, large parts of the parameter space are not well (or not at all) sampled with new BCs. In this case, the sampling is not explorative. In other cases, where all initial BCs are considered to be equally important, weighted sampling is not appropriate. Here, M-CS or uniform PCA-CS are better choices, as they sample the parameter space without weighting but within ranges. However, the sampling space is branched in our case (Fig. 1) and therefore all sampling approaches can produce compositions that are dissimilar to the initial data set. One can avoid this behaviour by subdividing the sampling space into smaller units using a clustering algorithm (Barnes-Hut t-SNE and DBSCAN). Each of the clusters can be treated independently in terms of sampling (Supporting Information Fig. B1, Data B) and represents a different compositional branch (Fig. 1c). This procedure allows the overall sampling to be similar for all methods applied (M-CS and PCA-CS; Fig. 6) and the disadvantages of the respective sampling methods become negligible. New BCs are well located around the initial BCs in the parameter space (Fig. 6) and thus existing branches of initial BCs are covered and still maintained by the new BCs.
A combination of the sampling approaches may result in a more complete database to cover the required chemical parameter space for geodynamic models of magmatic systems. Thus, the sampling guided through the location of non-clustered initial BCs with M-CS or uniform PCA-CS produces new BCs with a wider range in Figure 4. Finding new BCs within manually pre-defined boundaries. A polynomial fit through the initial data is used to create the upper and lower boundaries of an oxide combination, within which new oxide concentrations can be determined. Case 1: Boundary lines have the same x-coordinates. The first oxide (Ox1) concentration is determined within the horizontal range of the boundaries (red section). The related range in the vertical direction (blue section) is used to determine the corresponding concentration for the second oxide (Ox2). Case 2: Boundary lines have different x-coordinates. The axes are flipped (90 • rotation) so that the vertical range of the boundaries is used to determine the Ox1 concentration. In that special case shown here, two horizontal ranges are possible for the corresponding Ox2 concentration; one range has to be chosen. Different exception rules have to be considered to determine the Ox2 concentration for Case 2 and are presented at the bottom. Within the respective ranges, a uniform random number generator is used to specify the oxide concentrations. chemistry as of those available in the initial database. This may increase the probability that even requested BCs from geodynamic simulations with an unusual chemistry can roughly be explained by the isochemical sections of the enlarged database. For most of the requested BCs, the initial database may already broadly cover the field of all relevant compositions such that new BCs can be created based on the clustered initial BC ensemble to improve the results. This procedure refines the sampling space such that differences between existing compositions decrease.

Evaluation of new BCs
In the previous section, we demonstrated several sampling approaches to produce BCs that may be requested by future simulations. To quantitatively evaluate the success of various sampling methods, we now define a similarity index as follows: where 'ox' represents the respective oxide and N 1 , N 2 is the number of either initial BCs from the database or new BCs created with the sampling approaches.
Our similarity index D min is a distance-based measure with units of wt per cent and denotes the Euclidean distance between every new BC and its closest neighbour among all initial BCs. This calculation is executed in both directions by either starting from the new BCs of the sampling or from the initial BCs (Table 1). In addition to the minimal Euclidean distances (D min ), the average D min (D min ) is also computed for each sampling method. The D min indicates the most suitable sampling technique to locate new BCs close to initial BCs in the parameter space: the smaller D min , the more similar new BCs are to already existing compositions (see Table 1).
Without using the clustering algorithm, many of the new BCs are located far away from the neighbouring initial BCs, especially for the M-CS and the uniform PCA-CS (see D min , Table 1). These two methods also have the largest standard deviation of their minimal Euclidean distances indicating a large scattering of their data. Defining new compositions within clusters of the initial data ensemble strongly improves the result, as D min is reduced by at least a factor of four. With a clustering applied prior to sampling, all methods produce equally well distributed new BCs that are similar in chemistry to the initial BCs. PCA-CS in combination with a normal distribution slightly outperforms the other methods. Starting from an enlarged database (27 049 instead of 17 088 BCs) produces nearly the same results, although D min decreases moderately.

D I S C U S S I O N
Ideally, we would like to enlarge the database such that no additional isochemical sections of stable phase assemblages are required to describe the chemical evolution of magmatic systems in geodynamic models. We have presented several approaches to create and enlarge a petrological database. Whether or not a new section of stable phase assemblages is required depends on the geodynamic model in which new compositions are formed due to melt extraction. In these models, each petrological composition is described by an isochemical section with the closest BC. If a suitable section is not available, it must be computed. Consequently, after computing the missing sections of stable phase assemblages, the geodynamic simulation has to be repeated until no further sections are required. In the following, we discuss the extent to which our independently enlarged database helps to avoid such time-consuming repetitions and how the database can be applied to geodynamic simulations of magmatic systems.  Table 1. Minimal Euclidean distances (D min ) computed between new BCs obtained from the sampling and initial BCs from the database. Presented are the average D min (D min ) and the standard deviation (σ Dmin ). Sampling is applied to the non-clustered initial BCs (top) or to the clustered initial BC data set (bottom). D min are computed from the newly created to the initial BCs (left), and from the initial to the newly created BCs (right). 1350 new BCs are used from each sampling method. The number of initial BCs is 17 088.

M-CS/PCA-CS to initial BCs
Initial BCs to M-CS/PCA-CS

New samples and their relevance
The estimation of required BCs is tested by comparing the requested compositions during a geodynamic simulation with compositions obtained from the different sampling approaches. With an automatic improvement of the sampling density, the likelihood is increased that future requested compositions can be explained by the new samples. The sampling is based on either the non-clustered initial BCs or the compositional clusters to demonstrate the importance of the clustering algorithm. The minimal Euclidean distances (D min ) are computed (eq. 1) between new BCs created with the sampling methods and those requested from geodynamic simulations. For the respective geodynamic simulations, the number of available isochemical sections is equal to the number of initial BCs (17 088) used to determine new BCs with one of the sampling approaches. The requested compositions exclusively represent BCs that are not used already in the database. The D min between requested BCs and BCs created in the clustered sampling space indicate that all sampling methods are equally well suited to determine new compositions that are similar in chemistry to those requested in the geodynamic simulations ( Table 2). The large D min values between BCs created in the non-clustered sampling space and those requested during geodynamic simulations (Table 2) result from the fact that new BCs are determined within the whole range of initial BCs. Not all of these initial BCs are relevant for the current geodynamic simulation. Consequently, new composition requests are most likely never distributed over the full parameter space of initial BCs.

Evolution of requested BCs with an increasing number of available isochemical sections of stable phase assemblages
To ensure that the existing sections of stable phase assemblages cover the range in BCs required for geodynamic simulations, we investigate the evolution of requested compositions. These compositions have been requested during geodynamic simulations which have used a database consisting of either 17 088 or 27 049 isochemical sections. Only if the differences between requested and available sections of stable phase assemblages are insignificant, the chemical evolution of magmatic systems can be tracked in detail. To quantify these differences, the D min and the maximum D min are computed between the requested and the existing BCs in the database. The maximum D min specifies the maximum Euclidean distance that any requested composition can have to the closest already existing composition in the database. Rocks are tracked during geodynamic simulations only by the isochemical sections with the closest BCs, such that differences between requested and available sections become smaller as the database gets larger. Both the D min and the maximum D min decrease by 50 per cent if the database is enlarged from 17 088 to 27 049 sections. With the enlarged database, a D min of 0.58 is reached, which seems to be good enough for our geodynamic simulations given the possible thermodynamic and geodynamic inaccuracies and uncertainties caused by, for example, interpolation, resolution and calibration. In case the initial rock composition changes in the geodynamic model, for example, to a more hydrous one, this database may become insufficient and further isochemical sections of stable phase assemblages are required.
The number of required isochemical sections and the degree of accuracy with which rock properties have to be tracked in geodynamic models can strongly vary depending on the problem that is addressed. Thus, if one is mostly interested in estimating the different rock types, larger differences between requested and available sections of stable phase assemblages may be acceptable (with D min of a few wt per cent). For other problems, this accuracy may not be sufficient, especially if elements with low weight per cent affect the mineral assemblage. Here, we suggest that D min should not be larger than 1 wt per cent on average. However, two aspects must be kept in mind: (i) the Euclidean distance represents a sum of all distances between the respective oxides which may compensate each other and (ii) the Euclidean distance considers all oxides equally. Oxides with a high concentration (e.g. SiO 2 ) can have a major contribution to the Euclidean distance, but its change in concentration may not have a big influence on the phase stability. Thus, differences in rock properties between requested and available isochemical sections can be small, although the Euclidean distances are relatively large. Whereas, small changes in concentration for oxides that have a low concentration can have a big effect.

Applicability of the database
The applicability of the database is demonstrated using a TAS (total alkali-silica) diagram for plutonic rocks (Fig. 7). The 27 049 BCs of the database cover nearly all of the relevant igneous rock types, and especially along magma differentiation trends their density is high (Fig. 7). Melt is extracted in our geodynamic models (Fig. 8) from a source that is basaltic in composition (gabbro) to form higher differentiated rocks such as monzonite, quartz monzonite or granite (Fig. 7). The related cumulates form a reverse trend towards lower SiO 2 concentrations (peridotgabbro or even more depleted). The BCs of the database originate either from requested compositions of the geodynamic simulations or from our autonomous approach, and are well located on liquid lines of descent of natural igneous rock suites (green arrows, Fig. 7). The liquid line of descent describes the way the liquid composition changes as crystallization proceeds. This comparison demonstrates that the database is suitable to model magmatic processes with a high precision, while even small changes within the same rock type can be tracked. Repeating geodynamic simulations with this enlarged database consisting of 27 049 BCs, indicates that the newly requested compositions are almost perfectly explained by the existing ones in the database (red and black symbols, Fig. 7). Therefore, although the number of newly requested compositions is high (>10 000), they are not required to extend the database, as the compositional evolution can be tracked already with sufficient precision. Applying the autonomous approach to this database shows that most of the new BCs have similar compositions like the ones in the database, with a wider scattering for a few samples (Fig. 7). Comparing them with requested compositions indicates that approximately 50 per cent of the new BCs may be irrelevant for the current geodynamic forward models, as requested BCs occur in a much narrower range (Fig. 7). Thus, applying the sampling approaches generates new BCs that were potentially relevant in older geodynamic simulations, but not all of them in the current ones which were based on 27 049 BCs in the database. This effect results from the fact that the database was initially constructed with requested compositions which were mostly not based on sufficiently accurate sections of stable phase assemblages. With the enlargement of the database, the differences between requested and available isochemical sections have been decreased. Therefore, one could guide the sampling through the location of the requested BCs to focus the sampling on the more relevant part of the oxide parameter space, but it would strongly limit the compositional range of future composition requests.
The pre-computation of stable phase assemblages enables us to model the evolution of phase properties in magmatic systems over geologically relevant timescales. The large database allows us to investigate different kinds of magmatic processes, for example, mantle melting, fractional crystallization, crustal assimilation or even magma mixing. Even if not all isochemical sections from the database were finally required in our models, they might be useful for other thermomechanical models that are taking chemical processes into account. An example geodynamic simulation that Downloaded from https://academic.oup.com/gji/article/223/3/1820/5900142 by university of winnipeg user on 16 February 2022 Table 2. Minimal Euclidean distances (D min ) computed between new BCs obtained from the sampling and requested BCs from geodynamic simulations. Presented are the average D min (D min ) and the standard deviation (σ Dmin ). Sampling is applied to the non-clustered initial BCs (top) or to the clustered initial BC data set (bottom). D min are computed from the requested to the newly created BCs (left) and from the newly created to the requested BCs (right  Middlemost (1994). tracks the interaction between mechanical and chemical processes by using the stable phase assemblages of the database, is shown in Fig. 8, more results and information are provided in Rummel et al. (2020). In this model, basaltic sills are injected in the crust and trigger dike formation. The extraction of melt via tensile fractures is creating a heterogeneous system forming diverse rock types and locally changing cumulate/residuum chemistry. Melt that is extracted from already existing dikes evolves towards highly differentiated rocks. These highly evolved rocks are also produced in the model by partial melting of host rocks forming a parallel differentiation trend to the magmatic products originated from the basaltic sills (Fig. 8). Due to the large database, these different rock

Distribution of different rock units in the oxide parameter space
To visualize the magmatic evolution in the geodynamic model, we differentiate different rock units based on their genesis and/or location rather than their primary composition and assume that they are fully crystallized (a rock unit is defined here as, e.g. dike, sill, host rock or residuum, see Fig. 8). They are mostly distinguishable regarding their oxide concentrations and thus they are assignable to different compositional branches (Fig. 9). We can take advantage of this effect and can refine the sampling only for specific rock units for which a higher precision is desired. Applying the clustering algorithm (Section 2.3.1) to the different BCs shows that the resulting clusters or a collection of them are able to reflect the different rock units. The results demonstrate the advantage of using individual clusters of the entire ensemble, as it allows to sample specific rock units (Fig. 9).

C O N C L U S I O N
An efficient method to generate a database of stable phase assemblages is crucial to investigate the chemical evolution of magmatic systems in petro-thermomechanical models. To establish such a comprehensive database, bulk compositions (BCs) must be collected to compute sections of stable phase assemblages for each chemical composition. To avoid such a collection exclusively from geodynamic simulations, a new approach is developed that creates new BCs automatically. BCs are determined using either M-CS (manually-constrained sampling) or PCA-CS (principal component analysis-constrained sampling). M-CS and uniform PCA-CS create new BCs that are uniformly distributed over the sampling space considering its extreme oxide concentrations from the initial database that was constructed by a series of iterative geodynamic simulations. Both sampling approaches are strongly affected by outliers of the initial BC ensemble. PCA-CS with a normal distribution locates most of the new BCs in the centre of mass of initial BCs and thus gives less weight to the marginal parameter space. To better focus the sampling on individual compositional branches of the initial data set, the whole sampling space is split into smaller fractions of similar compositions (clusters). This split is done by applying the methods Barnes-Hut t-SNE in combination with DBSCAN. BCs that are requested during geodynamic simulations are shown to be well covered by BCs determined with the sampling approaches. New sections of stable phase assemblages for specific rock types can be produced if the sampling is guided by the clusters. The database is able to track the chemical evolution in magmatic systems in detail over geologically relevant timescales. Produced igneous rocks vary between peridotgabbro (or even more depleted) for cumulates/residuum and highly differentiated rocks like syenite, quartz monzonite or granite for dikes.

A C K N O W L E D G E M E N T S
We would like to thank the ZDV at the University of Mainz to provide us computational time on MOGON I cluster to run Perple X in parallel. We would like to thank the reviewers Mohit Melwani Daswani and James Connolly for the useful comments which strongly improved the paper.

F U N D I N G
Funding was provided by the VAMOS Research Center, University of Mainz (Germany) and by the ERC Consolidator Grant MAGMA (project #771143).

S U P P O RT I N G I N F O R M AT I O N
Supplementary data are available at GJ I online. Figure B1. New BCs generated with different sampling methods, shown for two different clusters (cluster 1 and cluster 11, Figs 2 and 6). The initial BCs of the respective clusters are shown as grey points. 8457 initial BCs are used for the clustering. 5078 of them are in cluster 1 and 226 of them in cluster 11, the rest are allocated to the other clusters. New oxide concentrations are determined (green points) within the manually constrained boundaries (white lines). The lines are the respective shifted polynomial fitting line through the initial BCs (here, second-degree polynomial) and are limited to their maximum and minimum concentrations. New BCs created with one of the PCA-CS methods are shown with the respective colours: turquoise for compositions determined uniformly along each principal component, pink for those determined with a normal distribution (centred in the mean) and yellow for compositions determined with a normal distribution (centred between the maximum and minimum values) along the principal component.  ∼10 000 requested BCs from both the first set of runs (orange points) and from the second set of runs (yellow points). In grey, BCs of the database (∼17 000 isochemical sections are used for the first round and ∼27 000 ones for the second round). For each requested composition, melt extraction conditions are presented, with pressure (b), temperature (c) and stable melt fraction (d). The rock units declaring the origin of the requested BCs (crustal host rocks, dikes or basaltic sills) are shown in (e). 'Dike 1' represents the first generation of dikes (or their respective cumulates), from which a second generation ('Dike 2') can form. 'a' defines dikes originating from crustal host rocks, and 'b' those originating from basaltic injected sills. The information about the type of the composition (liquid or solid fraction) is provided in (f). Figure D1. (a) Requested BCs (∼22 500 in total) from 14 geodynamic simulations are clustered into 28 groups, visualized as 2-D representation of the Barnes-Hut t-SNE projection showing the relative coordinates of the data points. Colours show clustering results using DBSCAN applied to the ensemble in the projection space. b) Each group contains specific BCs, shown for the oxide combination CaO versus SiO 2 . Each BC is related to an equilibrium pressure, temperature and melt fraction condition (c-e) at which it was generated. The information about the type of the composition (liquid or solid fraction) is provided in (f). The rock units declaring the origin of the requested compositions (crustal host rocks, dikes or basaltic sills) are shown in (g). 'Dike 1' represents the first generation of dikes (or their respective cumulates), from which a second generation ('Dike 2') can form. 'a' defines dikes originating from the crustal host rocks, and 'b' those originating from basaltic injected sills. Figure E1. Around 1350 new BCs are created between manually constrained boundaries (white lines). The initial data (grey points, 17 088 BCs) are not clustered here. Within the manually constrained boundaries, new oxide concentrations are randomly determined, indicated as green points. Oxides are given in wt per cent. The frequency distribution of points is shown as histogram plots for each oxide. This figure is an extension of Fig. 5. Figure E2. Around 1350 new BCs are randomly determined using the uniform PCA-CS. The initial data (grey points, 17 088 BCs) are not clustered here. Oxides are given in wt per cent. The frequency distribution of points is shown as histogram plots for each oxide. This figure is an extension of Fig. 5. Figure E3. Around 1350 new BCs are randomly determined using the normal PCA-CS. The initial data (grey points, 17 088 BCs) are not clustered here. Oxides are given in wt per cent. The frequency distribution of points is shown as histogram plots for each oxide. This figure is an extension of Fig. 5. Figure E4. Around 1350 new BCs are randomly determined using the normal-centring PCA-CS. The initial data (grey points, 17 088 BCs) are not clustered here. Oxides are given in wt per cent. The frequency distribution of points is shown as histogram plots for each oxide. This figure is an extension of Fig. 5. Figure E5. Around 150 new BCs are generated for cluster 11 ( Fig.  2 and Supporting Information Fig. B1, Data B) and are highlighted here. 17 088 initial BCs are used for the clustering (grey points), 226 of them are in cluster 11, the rest are allocated to other clusters (not highlighted). New oxide concentrations are randomly created (green points) within the manually constrained boundaries (white lines). The lines are the shifted polynomial fitting line through the initial data (here, second degree polynomial) and are limited to the maximum and minimum concentrations of the initial BCs. Oxides are given in wt per cent. The frequency distribution of points is shown as histogram plots for each oxide. This figure is an extension of Fig. 6. Figure E6. Around 150 new BCs are generated for cluster 11 (Fig. 2 and Supporting Information Fig. B1, Data B) and are highlighted here. 17 088 initial BCs are used for the clustering (grey points), 226 of them are in cluster 11, the rest are allocated to other clusters (not highlighted). New oxide concentrations are randomly created using uniform PCA-CS. Oxides are given in wt per cent. The frequency distribution of points is shown as histogram plots for each oxide. This figure is an extension of Fig. 6. Figure E7. Around 150 new BCs are generated for cluster 11 (Fig. 2 and Supporting Information Fig. B1, Data B) and are highlighted here. 17 088 initial BCs are used for the clustering (grey points), 226 of them are in cluster 11, the rest is allocated to other clusters (not highlighted). New oxide concentrations are randomly created using normal PCA-CS. Oxides are given in wt per cent. The frequency distribution of points is shown as histogram plots for each oxide. This figure is an extension of Fig. 6. Figure E8. Around 150 new BCs are generated for cluster 11 (Fig. 2 and Supporting Information Fig. B1, Data B) and are highlighted here. 17 088 initial BCs are used for the clustering (grey points), 226 of them are in cluster 11, the rest are allocated to other clusters (not highlighted). New oxide concentrations are randomly created using normal-centring PCA-CS. Oxides are given in wt per cent. The frequency distribution of points is shown as histogram plots for each oxide. This figure is an extension of Fig. 6. Figure E9. Requested BCs are shown for two sets of geodynamic simulations (each of them consists of seven simulations). ∼10 000 requested BCs from both the first set of runs (orange points) and from the second set of runs (yellow points). Grey points indicate the initial BCs of the database. The compositions are shown for all SiO 2 oxide combinations. Oxides are given in wt per cent. This figure is an extension of Supporting Information Fig. C1 (Supporting Information Data C). Figure E10. Melt extraction conditions of requested BCs from geodynamic simulations, for all SiO 2 oxide combinations. For each liquid (extracted melt) or solid composition (remaining residual material), the conditions are stored at which their compositions are stable, with pressure, temperature and stable melt fraction. The rock units declaring the origin of the requested compositions (crustal host rocks, dikes or basaltic sills) are shown. 'Dike 1' represents the first generation of dikes (or their respective cumulates), from which a second generation ('Dike 2') can form. 'a' defines dikes originating from the crustal host rocks, and 'b' those originating from basaltic injected sills. The information about the type of the composition (liquid or solid fraction) is provided in the last column. Oxides are given in wt per cent. This figure is an extension of Supporting Information Fig. C1 (Supporting Information Data C). Table A1. Excluded and solution phases (with activity-composition models and references) used in Perple X 6.7.9 (Connolly 2005(Connolly , 2009).
Please note: Oxford University Press is not responsible for the content or functionality of any supporting materials supplied by the authors. Any queries (other than missing material) should be directed to the corresponding author for the paper.