Transdimensional inference of archeomagnetic intensity change

conﬁrm agreement of our method with an integrated likelihood approach. We then apply the method to three archeomagnetic data sets all reduced to a single location: one temporally well-sampled within 700 km from Paris (here referred to as Paris700), one that is temporally sparse centred on Hawaii, and a third (from L¨ubeck, Germany and Paris700) that has additional ordering constraints on age from stratiﬁcation. Compared with other methods, our average posterior distributions largely agree, however our credible intervals appear to much better reﬂect the uncertainty during periods of sparse data coverage. Because each ensemble member of the posterior distribution is piecewise linear, we only ﬁt oscillations when required by the data. As an example, we show that an oscillatory signal, associated with temporally localized intensity maxima reported for a sparse Hawaiian data set, is not required by the data. However, we do recover the previously reported oscillation of period 260 yr for the Paris700 data set and compute the probability distribution of the period of oscillation. We further demonstrate that such an oscillation is unresolved when accounting for age uncertainty by using a ﬁxed age and with an artiﬁcially inﬂated error budget on intensity.


Transdimensional archeomagnetic analysis
2009 certain periods (sometimes referred to as dark ages), a reflection of either lack of data collection or simply that there are few remaining artefacts suitable for analysis. Scarcity is a particular problem for analysis of localized transient phenomena, which appear to punctuate the quasi-steady ancient magnetic field, such as archeomagnetic jerks: sharp changes in the local direction of the field over timescales of about one century associated with an intensity peak (Gallet et al. 2003), and geomagnetic spikes: large, rapid changes in the local intensity over a few decades (e.g. Shaar et al. 2011). In rare cases, some data sets offer a well-dated stratified sequence (Schnepp et al. 2009;Shaar et al. 2011;Kostadinova-Avramova et al. 2014) which add significant constraints to the behaviour of the archeomagnetic field through time.
Models of the archeomagnetic field may be global, regional or reduced to a single location (in the latter case, by using the virtual axial dipole approximation). All studies to date have used some form of prescribed smooth dependence in time to accommodate temporal changes of the field. The most widely used framework adopts two means of smoothing: first by using a temporal expansion in terms of cubic spline functions centred on a set of knot points, and secondly by reducing the temporal complexity of the model by a penalization method (Korte & Constable 2003;Lanos 2004;Thébault & Gallet 2010;Hervé & Lanos 2017;Tema et al. 2017). A prescription of the knot points in advance places an a priori constraint on the solution, for dynamics on timescales more rapid than the closest spacing cannot be accommodated. Furthermore, the penalization of temporal complexity (often the second time derivative of the radial field at the core-mantle boundary) is applied uniformly over the model, which means that some periods will be oversmoothed and some undersmoothed. A sliding-window with linear regression (Lanos et al. 2005) is an alternative to these methods, where the window duration may be a function of the temporal distribution of data (Le Goff et al. 2002). It is possible to remove temporal smoothing all together by adopting a stochastic process representation of the geomagnetic field (Hellio et al. 2014), although this procedure is still subjective as the parameters of this process (the covariances) need to be prescribed.
Transdimensional Bayesian methods offer an attractive alternative, in which the models are not regularized but allow the data to self-select model complexity, here the number and position of internal knot points (also known as internal vertices or change points). Despite the freedom of this procedure to choose an unlimited number of degrees of freedom leading to overfitting and highly complex models, in fact owing to the natural parsimony of the Bayesian framework the models are inherently smooth and large scale, and admit fine-structure only where required by the data (Sambridge et al. 2013). Rather than resulting in a single time-dependent solution, the Bayesian modelling produces a posterior distribution, numerically approximated by an ensemble of models sampled using a Markov chain, whose shape and variance gives important information about the behaviour of the solution and confidence that should be placed in it. Such transdimensional methods have been widely applied across the geosciences, in applications such as Earthquake rupture (Dettmer et al. 2014), geomagnetic reversals (Ingham et al. 2014), seismic imaging (Bodin et al. 2012), climate studies (Hopcroft et al. 2007), 2-D seafloor resistivity (Ray et al. 2014), inversion for gravitational anomalies (Luo 2010) and inference of abrupt changes in geochemical records (Gallagher et al. 2011).
An additional benefit of using a Bayesian procedure is that poorly known quantities (on which the posterior distribution depends), that would otherwise need to be fixed at some arbitrary level in a traditional inversion approach, can be co-estimated (Malinverno & Briggs 2004;Bodin et al. 2012;Hervé & Lanos 2017). This has a particular benefit for archeomagnetism for which data ages can be difficult to assess. Archeological data are typically assumed uniformly distributed between two dates determined by typology or historical reference; in contrast, volcanic data are dated using radiocarbon techniques and in this work their errors are assumed normally distributed with zero mean and a given standard deviation, conforming to the databases from which they are extracted. By including the data ages as additional or hyperparameters, a Bayesian formulation of the problem allows us to sample both the data ages and the model description in the same way, producing marginal posterior distributions for the data ages themselves (Lanos 2004;Hellio et al. 2014;Schnepp et al. 2015;Hervé & Lanos 2017) alongside that of the archeomagnetic intensity model. In this way, data with different intensities but the same mean estimated age may be judged by the posterior distribution to be associated with distinct ages. Such a scheme differs fundamentally from a typical global modelling approach in which apparently contemporaneous data, rather than being treated as distinct, are averaged to a single value (Korte & Constable 2018). Furthermore, in terms of archeomagnetic dating, this fully integrated approach that we adopt here should be contrasted with typical methods of dating in which the magnetic signature of a datum is matched against a separately produced timedependent model of the geomagnetic field (e.g. Pavón-Carrasco et al. 2011;Hervé & Lanos 2017;Lanos & Philippe 2017): here the age distributions are a component of the model output.
In this paper, we apply the transdimensional Bayesian framework to three time-series of archeomagnetic intensity, chosen to represent very different situations regarding the nature and quality of the archeomagnetic data sets available at a regional scale (in particular, their age distribution and uncertainties, consistency between the data, number of data), which are described in Section 2. In Section 3, we summarize the Bayesian method; more detail is included in the Appendices. Section 4 describes important benchmarks of the method against synthetic data sets. Sections 5-7 show the results of our method applied to the three archeomagnetic data sets, whose results are further discussed in Section 8 with reference to results from other methods.

T H R E E E X E M P L A R DATA S E T S
Archeomagnetic data includes measurements of either (or both of) intensity (i.e. magnitude) and direction (inclination, declination). For this initial exploration of the method, we will focus only on intensity variations. Below, we give an overview of the three chosen data sets; more details (e.g. on the specifics of data selection) can be found in Appendix A. These data sets are shown in Fig. 1, which shows not only the individual data (blue points) but the temporal data density (green histogram). For each data set, the magnetic intensity of each datum is assumed to be normally distributed, with a mean and standard deviation estimated from laboratory analysis as shown by the vertical error bars; the interpretation of the horizontal error bars is described in turn for each data set.
(1) Paris700: The first data set, Paris700, relies on stringent selection criteria (see Genevey et al. 2013), has quasi-uniform temporal resolution from about 1000 BC to 2000 AD and comprises data collected from within 700 km from Paris. Its 154 entries are all reduced to the latitude of Paris (48.9 • N) using the virtual axial dipole moment (VADM) approximation. The data age errors are assumed uniformly distributed within given ranges (shown by the horizontal error bars in Fig. 1(top panel). The three exemplar data sets that will be analysed by the Bayesian method in this paper: Paris700 (top panel); Hawaii (middle panel) and Lübeck-Paris700 (bottom panel). The data points are shown as blue circles with error bars indicating the uncertainty as described in the text. The red data points obey a strict ordering in age. The data-density is shown by the green histograms using 20 bins of equal width for each data set.
(2) Hawaii: To examine the effect of data sparsity and nonuniform sampling on our method, we consider a second data set of Hawaiian intensity measurements taken from the Geomagia.v3 database (Brown et al. 2015) over the past 4500 yr. Here, we adopted minimalist (i.e. loose) selection criteria and this case therefore corresponds to extracting a local data set from the global database in a relatively blind manner. All 134 retained entries are reduced to the latitude of Kilauea volcano (19.42 • N). The data are shown in Downloaded from https://academic.oup.com/gji/article-abstract/215/3/2008/5101441 by CNRS user on 30 September 2019 Fig. 1(middle panel), where the horizontal error bars indicate one standard deviation in radiocarbon-determined age. Note that recent 'historical' lava flows are assigned a normally distributed age uncertainty of 0.5 yr. The density of data is heavily weighted towards the most recent times.
(3) Lübeck-Paris700: Finally, some archeomagnetic data sets contain stratified data whose ages obey a strict ordering in time. We investigate a third data set, comprising the (unstratified) Paris700 data (restricted between 900 and 2000 AD) with 22 stratified data with otherwise poorly constrained ages from Lübeck, Germany (Schnepp et al. 2009), reduced to Paris using the VADM approximation, for the period ∼1300 AD to ∼1750 AD. The data (unstratified: blue; stratified: red) are shown in Fig. 1(bottom panel) where all ages are uniformly distributed in time with additional strict ordering constraints where relevant.

Bayesian inference
Bayesian inference is a procedure by which the knowledge of model parameters is improved from the prior information, expressed in terms of probability distributions, by the introduction of data. The result is not simply a single optimized model that fits the data, but rather the posterior distribution, characterized by an ensemble of models, which describes the probability of the unknown model parameters given the observed data. In this way, the posterior distribution not only describes the most likely value (i.e. the mode of the distribution), but also diagnostics such as credible intervals (the analogue of confidence intervals). The analysis rests upon Bayes' theorem: where m is the model and d is the collection of data. Expressed in words, the posterior (on the left-hand side) for the model given the data, is proportional to the product of the likelihood (the first term in the numerator) and the prior; the term in the denominator is termed the evidence (which does not depend on the model m) and although useful for choosing between physical models, will not enter into our analysis. The posterior distribution then depends on several key elements: the model parametrization, the likelihood and the prior. We will describe each in turn.

The model
The principal part of any individual model is the representation of intensity change with time, which we parametrize using continuous piecewise-linear functions defined over a period [t start , t end ]. The reason for our choice of linear, rather than higher order temporal dependence (e.g. cubic splines) is that we wish to assert minimal smoothness on the posterior. Any curvature within the posterior must be required by the data and cannot be an artefact of our parametrization. The piecewise linear regression function, g, is interpolated from the value of the intensity at the endpoints F start and F end , along with the age t j and intensity F j with 1 ≤ j ≤ k of any of the k internal change points. The number of internal change points, k ≥ 0, is a free parameter of this transdimensional model. We note that although every model individually is piecewise linear, taking ensemble statistics yields a smoothed evolution that gives a good description of magnetic field variability.
In this study, we will assume that the errors on both the data intensities and ages are known. However, because outliers are common in archeomagnetic data sets, a fuller treatment could include these errors as hyperparameters (Malinverno & Briggs 2004). Key to our algorithm is an appropriate handling of the data ages, on which the model depends. There are multiple possibilities of how we can treat them. First, we could treat the ages as known and error-free, and simply increase the error budget in the intensity uncertainty as a proxy for a combined error estimate (e.g. Ingham et al. 2014). Second, we could use a bootstrap method to repeatedly draw possible ages from their given distributions (e.g. Thébault & Gallet 2010). Third, we could incorporate the unknown data ages into an 'integrated likelihood', in which the data ages are entirely absent from the model vector (Sambridge 2016). Fourth, we can include the data ages as hyperparameters into our model vector. In this work, we adopt the last approach in our age-hyperparameter (or AH) method: not only does this handle the age uncertainties in an identical manner to the uncertainty in the other model parameters, but after recovery of our posterior distribution, the marginal distributions of the data ages can be readily computed. For the sake of completeness, however, we will confirm some of our calculations using an integrated likelihood approach (IL; see Section 4.3).
It is worth remarking that our incorporation of data ages as hyperparameters conflicts with the usual division between model-space and data. To an archeomagnetist, the data are the set of measurements of intensity along with the data ages. Yet, in isolation, the data ages are not informative about the distribution of the temporal variation of intensity. Therefore, we partition the intensity and ages-absorbing the ages into the unknown model vector and treating the 'data' (d of eq. 1) as simply the set of N data intensity values, denoted here as We write the model vector as where T indicates transpose, and m is a column vector of size 2k + 3 + N data , where f = (F start , F end , F 1 , F 2 , . . . , F k , t 1 , t 2 , . . . , t k ) describes the 2k + 2 vertices of the piecewise linear time-dependence, and a = (a 1 , a 2 , . . . , a N data ) are the ages of the N data data. When adopting the integrated likelihood, the model is simply m = [f, k] T . Fig. 2 shows a cartoon of a low-dimensional model with two data points and three internal vertices. Two aspects of notation are worthy of comment. The first is that although we have used both t and a to denote an age, this allows us to distinguish between the ages of the change points t j and the data a j . Second is that we have used the calligraphic symbol F j to denote the data (laboratory determined intensity) and F to denote the intensity value of a model vertex.

The likelihood
Any specific realization of the model m prescribes the data ages a along with the set of internal vertices required to define the piecewise-linear regression function, g, that describes intensity variation with time. To define a likelihood, for the jth datum we compute the difference between its intensity F j and the value of the regression function g(a j ) evaluated at its age. Because of the assumed normally distributed errors on the data F j , and because we assume errors independent between measurements, the likelihood of all the 2012 P.W. Livermore et al.

Intensity
Time Figure 2. Cartoon of a low-dimensional model over a fixed period [t start , t end ] constrained by two archeomagnetic data, with given intensities F 1 , F 2 and unknown ages a 1 and a 2 . The intensity variation is described by a piecewise-linear interpolation between the three internal vertices (t i , F i ) and the two endpoint intensities F start and F end . The arrows indicate how the Monte Carlo method allows each parameter to change. data intensities being realized is proportional to e −φ , where where σ j is the given estimate of the standard deviation. We note that, because only the ratio of likelihoods is relevant, the constant of proportionality never enters the analysis. In Sections 5 and 6, for the Paris700 and Hawaii data sets, the distribution of residuals will confirm the assumption of normally distributed intensity errors. The integrated likelihood approach of Sambridge (2016) is a generalization of the formula above to the case where the data ages are formally unknown but have a known distribution. By integrating (2) over all its possible values, it is possible to calculate a likelihood weighted by the age distribution; see Section 4.3 for further details.

The choice of prior distributions
A keystone of Bayesian methods is the description of the prior information as a probability distribution that characterizes what is known or supposed about the model before the introduction of data. The data ages are independent of the rest of the model, and by conditioning on the value of k, we can write the prior as We assume that p(k) is a uniform distribution over the interval [0, k max ], where k max is prescribed (we typically take k max = 50), that is Given a value of k, the intensity at each vertex F j and its corresponding age t j is assumed mutually independent, thus where j = 1, 2, . . . k indexes the internal intensity coefficients, and we have adopted the notational convenience F k + 1 = F start and F k + 2 = F end . We assume that the prior distribution of each intensity value F j is uniform: whereF max andF min are prescribed values (typically taken to be 100 μT and 30 μT).
For the purposes of model development, it is expedient to assume that the vertex ages are distributed uniformly from a choice of N equally spaced ages between t start and t end ). This means that the joint probability distribution of the vertex ages (in which their order is irrelevant) is However, for infinitely large N (that will describe our final model, see Appendix A), we convert from a discrete to a continuous variable; the prior distribution of each vertex age, t j , becomes independent, uniformly and identically distributed: For unstratified data, each datum age is supposed independent so that For a datum of index j that is archeologically dated, its age is uniformly distributed between given dates [a min j , a max j ] and has a prior distribution of whereas for normally distributed ages (e.g. from radiocarbon dating) the prior distribution is where μ j and λ j are the given mean and standard deviation. For a subset of data that is stratified, the corresponding set of ages are not independent and must obey a constraint of the form a j < Downloaded from https://academic.oup.com/gji/article-abstract/215/3/2008/5101441 by CNRS user on 30 September 2019 a j + 1 < · · ·. Formally, such a constraint will alter the mathematical structure of the prior distribution as given above, but because of the methodology by which we sample the posterior distribution this structure never actually enters our analysis. For the data ages, we only actually need to be able to draw from their prior distribution; practically, after drawing ages according to (7) or (8), we simply discard any set of ages that violates any required age ordering, and redraw. It is further noteworthy that although we will focus on data whose ages are solely either uniformly or normally distributed, it is straightforward to consider a mixture of age distributions. We note that despite specifying each model vertex intensity F j to have a uniform distribution, in general the associated prior for the linearly interpolated intensity g(t) (for given arbitrary t) is not uniformly distributed. The reason why the uniform characteristic does not carry over to g(t) is because it only attains values close to its bounds ofF max andF min when both of the vertices used in determining g(t) by linear interpolation also have extreme intensity, a situation which has small probability compared to the probability of a single vertex being extreme. This can be seen more formally in the simple case of k = 0 (no interior vertices). The distribution of intensity at the middle of the age range t mid = (t end + t start )/2 is simply the average of the two endpoint intensity distributions, that is, one half of the sum of two identically uniformly distributed random variables, which takes the form of a triangular distribution that has a peak value of (F max +F min )/2. Thus g(t mid ) is not uniform, in contrast to both g(t start ) and g(t end ), which are uniform. For arbitrary k this picture is more complicated, but a similar reasoning applies. In this way, our mean, median, mode, and credible intervals in the figures below are not exactly the mean, and median of the posterior model p(m|d), but rather of the posterior projected in the space of g(t).
Finally, it is of note that several authors (e.g. Green 1995;Hopcroft et al. 2007) use order statistics in their prior, in order that the change points are spread out. Here, we do not want to assume this, as in fact we are interested in features for which there is rapid time-dependence and therefore we want to allow clustering of vertices. Many of our prior distributions are uniform and may not in fact encode zero information despite assigning all values within their range the same probability (Jaynes 2003).

The AH-RJMCMC sampling algorithm
Having defined the prior and likelihood, all that remains is to characterize the posterior distribution, which we undertake numerically by drawing a sufficiently large set of samples using the reverse-jump Monte Carlo Markov Chain algorithm (RJMCMC; Green 1995). The resulting Markov chain of models, in which each model depends only on its predecessor, represents a random walk through the space of all permissible models; its distribution converges to the posterior distribution. The chain is built iteratively by either adding a duplicate of the current last model m or adding a proposed model m , a perturbation to m, which is drawn from a distribution of alternative models q(m |m). The decision of whether to add m or m to the chain is based on an acceptance test. The underlying methodology does not require any particular choice of q, but rather different choices of q will simply alter the speed of convergence to the posterior distribution (Green 1995).
In our algorithm, which is based on that of  and Gallagher et al. (2011), the model m differs from m by one of several possible perturbations, as depicted in Fig. 3 (see also Fig. 2) grouped by type. These perturbations depend on a set of user-specified parameters: σ change , σ birth , σ move and β that are discussed later.
For perturbations of type 1, the intensity value of a randomly chosen internal vertex is perturbed from its current value by a random amount distributed as N (0, σ 2 change ). Perturbations of type 2 alter the temporal arrangement of the vertices. For a perturbation 2a, the age of a randomly chosen vertex is altered by the addition of a normally distributed perturbation, N (0, σ 2 move ). For a perturbation 2b (vertex birth), a new vertex is proposed randomly (according to a uniform distribution) within the temporal limits of the model, with an intensity that is distributed N (0, σ 2 birth ) relative to its linearly interpolated value based on the current vertex distribution. Perturbation 2c describes vertex death, where a vertex is removed from the model. Lastly, perturbation 3 describes the resampling of N data /β ages within their given prior distributions where x denotes the floor (i.e. the integer part of) x. The specific details of each move type and their associated acceptance criteria mirror those of  and are described in Appendix B.
Each run is initialized with a model with a random number of vertices, described by coefficients f and a randomly chosen from their given prior distributions. Diagnostic statistics of the distribution of the posterior intensity and the marginal distribution of the data ages are computed 'on-the-fly', usually adopting thinning (e.g. analysing only every 100th chain member in order to minimize the effect of any temporary localized confinement in model space of the chain). The algorithm is run until the posterior distribution (as indicated by the computed diagnostics) converges, which typically occurs after about 1-5 × 10 6 model proposal iterations; the compute time for this is a matter of a few seconds on a single-core desktop computer. In accord with standard practice, we discard the first 50 000 iterations as 'burn-in' in order to remove dependency on the initial model. We note that our method, in which the model vector contains the data ages, is considerably (i.e. thousands of times) faster than an implementation (which we tried in a development phase) of the two-stage approach of Hellio et al. (2014): in which the ages were drawn in an outer loop by a Monte Carlo method, and the posterior distribution of intensity was then determined within an inner loop assuming fixed ages.
The rate of convergence of the Markov chain (and hence that of the posterior distribution) is affected by the choice of model proposal. Slow convergence will occur if either the proposed models are too different from any current model (they are unlikely to be accepted and the Markov chain will change only infrequently), or if they are too close to the current model (the chain will never effectively explore distant parts of model space). The proposal q depends on σ change , σ move , σ birth , β, which are adjusted to attain suitable acceptance ratios of about 15-50 per cent (Roberts 1996). Typical values are, respectively, 20 μT, 200 yr, 8 μT, 20, which give acceptance ratios (for model perturbation types of 1, 2a, 2b, 2c, 3) of 19, 11, 14, 14, 21 per cent for the Paris700 data set reported in Section 5.
In addition to reporting diagnostics such as histograms of vertex position, we also compute the relative entropy, or Kullback-Leibler divergence (e.g. Press et al. 2001), a quantification of the difference between the posterior p(g(t)|d) and prior distributions p(g(t)) of intensity variation (or in other words, a measure of information gain from the data): As we remarked earlier, the prior on the intensity evolution is not immediate from the prior information on the internal vertices; here,  (1), the intensity at a vertex is changed; in (2a) the age of a vertex is changed; in (2b) a vertex is born; in (2c) a vertex is removed; in (3) a subset of the data ages are resampled. The equal spacing of the black symbols is for graphical purposes only: in general they are spaced unequally.
we compute it by running the AH-RJMCMC algorithm with the likelihood set to 1. As a side note, this procedure of amending the likelihood should (and does) recover the prior distribution on each of the variables that we have specified.
If D KL is close to zero, it means that the posterior and prior distributions are very similar, so that the data add little information. Conversely, a high value of D KL means that the prior and posterior are significantly different. The relevance of this quantity here is that all Bayesian inferences are dependent on the choice of prior. If the KL divergence is low, then we might have little confidence in the robustness of the posterior distribution for it would likely alter if we changed the prior. On the other hand, a high value of D KL would render the posterior largely invariant of the choice of prior. As shown in Sections 5-7, high values of D KL occur when the posterior distribution is tightly focused, signifying strong constraints from the data relative to the assumed broad uniform prior.

Recovery of known underlying behaviour
In order to test the AH-RJMCMC methodology, we created synthetic versions of the Paris700 and the Hawaii data sets, termed Synt-Paris700 and Synt-Hawaii, both based on the smooth intensity variation of the coupled-Earth (CE) dynamo model (Aubert et al. 2013). The particular time window in the CE model was chosen such that the magnitude of variation of intensity was comparable to the observations, and is time-shifted in order that it is defined over the same age ranges as the observational data sets. It is worth remarking that the CE model time is scaled to terrestrial ages through considerations of the secular variation (Lhuillier et al. 2011), and any resemblance to existing intensity changes at either location is fortuitous. The CE model was sampled at the appropriate latitude/longitude location and, crucially, according to the same age distributions as the Paris700 and Hawaii data sets themselves. Pseudorandom noise is added in both intensity and age (according to their assumed forms) to mimic errors in the real data sets. We are now in a position to test the recovery of intensity evolution from noisy data, from either well-sampled or sparsely sampled data sets. Fig. 4 shows diagnostics of the AH-RJMCMC method applied to the Synt-Paris700 data set. The top row shows the posterior distribution of intensity characterized by its (time-dependent) average, median and modal values, alongside the 95 per cent credible intervals in filled orange; in the middle row is a density plot of the intensity distribution. Of primary importance is that for almost all the time window (except a small deviation around 1000 AD) the 'true' evolution (shown in black) largely agrees with the mode, median and average of the posterior distribution. Indeed, the true evolution is contained within the 95 per cent credible intervals (and the regions of highest intensity density), which gives us great confidence that the method can recover the underlying evolution. Fig. 4(bottom panel) shows the vertex position histogram with the KL divergence overlaid in red. The histogram shows strong evidence for a change in linear slope around 1400 AD and 1700 AD which occurs during a period of densely sampled data. There are several other ages which favour a change in slope, although their probability is smeared out and consequently not so high due to lower confidence from more sparsely sampled data. High values of the KL divergence in 500-1800 AD correspond to periods when the posterior distribution differs markedly from the prior and here takes the form of a tightly focused intensity density. It is worth noting, however, that high KL does not necessarily mean that the posterior better describes the true evolution, for in this example the deviation of 1000 AD is contained within this high KL period. Fig. 5 shows similar diagnostics but for the Synt-Hawaii data set. It is notable that despite the poor data coverage at early times, the method generally returns a good recovery of the intensity evolution. The asymmetric distribution of data (sparse at earlier times, dense at later times) means that the posterior distribution (shown for example either by the 95 per cent credible intervals or the intensity density) becomes very focused post 500 AD and has a high KL divergence. As with the previous synthetic data set, the posterior distribution does not follow the true evolution exactly: there are some discrepancies during short-period oscillations of the true evolution. Here, these are at 1400 BC, 800 AD and 1700 AD, and are caused by either poor sampling or because there is insufficient evidence to exclude a linear fit. Also of note is that, for the discrepancy around 1400 BC, the uncertainty bounds are relatively narrow compared with the Hawaiian data set (Fig. 13) despite being sampled at the same times and with the same assumed errors on age and intensity. This is because the synthetic data happen to be consistent with a single linear segment (which in our framework will always be preferred if the data allow), whereas for the Hawaiian data set any linear evolution fits poorly. Although the absence of the true solution feature within the 95 per cent credible intervals may be viewed as a limitation of the method, it is really a reflection of the fact that the non-linear  feature is not sampled at all so it is unsurprising that it is not present in the posterior.
Overall, for both data sets, the method recovers the intensity evolution of the CE-model very well. It is important to note that, by design, the method fits a posterior distribution of minimum curvature (being based on piecewise-linear segments). Thus any oscillatory behaviour in the posterior necessarily must be required from the data, and cannot be an artefact of the model. Conversely, minor and/or rapid oscillations of the true evolution which may either be poorly temporally sampled or lie within the error tolerance of a best-fitting linear-interpolation will not be present in the AH-RJMCMC posterior.

Marginal age distributions
Since the AH-RJMCMC method samples the joint distribution of the model space, we can derive the (joint) posterior distribution of any single variable, or subset of variables, by marginalization. Of particular interest is to examine the joint posterior distribution of the age and intensity of any given datum, which is simply achieved by collecting 'on the fly' the age a j (specified within each model) and its associated intensity, g(a j ), interpolated from the vertex information of the model. Fig. 6 shows two such joint distributions for the two data 119 (left) and 154 (right) of the Synt-Paris700 data set that were chosen to represent two extremes; these data are highlighted with green boxes in Fig. 4. The joint distribution is shown as green hexagons [dark(light) shading means higher(lower) values]. The marginal distributions of both age and intensity for each datum are shown on the top and right axes; for reference, the uniform prior distribution on age, and the assumed normally distribution of intensity, are shown in transparent orange. The noisy "observed" datapoint is shown in purple, while the true values are shown by the red triangle.
On the left the marginal posterior age distribution is comparable to the prior, so the Synt-Paris700 data offers no new information about the datum's age beyond what is already assumed in the prior. ; semi-transparent orange shows the prior on the data age, and also the assumed normal distribution of intensity. The true age and intensity of the data is shown in red while the synthetic 'observation' is shown by the purple triangles (with error bars).
Although the posterior distribution of intensity is centred on approximately the same value as the datum, it has a much smaller standard deviation.
By contrast, on the right the marginal posterior age distribution is heavily skewed, the datum being much more likely to have an age at the earlier end of the range [400, 500] BC. Indeed, the true value falls close to the peak of the joint distribution, at around 500 BC. Thus other data within the data set add significant constraints on the likely age of this particular datum because we fit the data set as a whole. In this example, the distribution of posterior intensity is comparable to that of the observed datum but shifted upwards in value.

Model consistency
In this section, we briefly address two aspects of model consistency. First is to check that our diagnostics of the posterior distribution (average, mode, median) do actually reflect the properties of the posterior distribution itself. Fig. 7 shows 1000 individual models that are spaced equally along the Markov chain (red lines) for the Synt-Paris700 example, plotted alongside the average model (black line) that is comparable to the other diagnostics (Fig. 4). It is apparent that the average variation does indeed follow the evolution of the ensemble.
The second issue is more subtle. In our method, we have accommodated the imprecise knowledge of the archeomagnetic data ages by including them as hyper-parameters, a, in our model vector m = [f, k, a] T . For our focus on archeomagnetism, this is a natural way to proceed because the marginalized posterior distributions of the data ages (which are of significant scientific value) are straightforward to compute. However, it is worth noting that this is not the only way of setting up the model. For example, rather than including the data ages in the model vector and using a likelihood based on a weighted misfit in intensity, we could swap these around: an alternative would have been to adopt the model vectorm = [f, k, F] T where F = (F 1 , F 2 , . . . ) are the data intensities, and use a likelihood based on the misfit of the uniformly distributed data ages. The fact that these two model setups are different is a manifestation of the asymmetry of the way in which the data is handled. Ideally, the ages and intensities of the archeomagnetic data should be handled in the same way (i.e. symmetrically), reflecting the fact that there is no objective reason to treat errors in age any differently than those in intensity. Of course, because the uncertainties in the ages and intensities do not necessarily obey the same distribution type, we cannot expect the implementation of the algorithm to be symmetric. Sambridge (2016) describes an 'integrated likelihood' method, handling 2-D data with known error distributions symmetrically by a piecewise-linear MCMC algorithm, in which the uncertainties in age are integrated out. He considers only the case where both variables are described by a joint normal distribution; in Appendix C we include details of the mathematical extension of his methodology to our focus on uniformly distributed ages and normally distributed intensities.
Using the integrated likelihood method has the advantages that (i) it treats uncertainty in both intensity and age in a symmetric way, and (ii) the model vector m IL = [f, k] T does not contain the data ages: therefore the model space that must be sampled has many fewer dimensions and convergence (in terms of the length of the Markov chain) may be more rapid although each forward model evaluation may be slower because of the additional marginalization over age uncertainty. Fig. 8 compares the converged posterior distributions from the age-hyperparameter and integrated-likelihood methods, using Markov chains of length 10 6 and 10 5 , respectively beyond their burn-in period of 5 × 10 4 . The fact that the methods closely agree gives us significant confidence in our methodology; it is apparent that the asymmetry in our method does not affect the final result. Computationally, for a given chain length, the integrated likelihood method is about a factor of 10 slower than the age-hyperparameter model. However, in this example this is balanced by requiring a Markov chain about 10 times shorter and so both examples shown  above take about the same amount of time to run: approximately 5 seconds on a modern desktop computer.

A P P L I C AT I O N T O T H E PA R I S 7 0 0 D ATA S E T
We now apply the age-hyperparameter methodology to the Paris700 data set; the key results are shown in Fig. 9. As for synt-Paris700, the average, median and mode follow each other closely, and post 0 AD have a focused intensity density giving confidence in the posterior. Other diagnostic plots are given in Appendix D, which includes confirmation of the assumption of normally distributed intensity errors. The time variation after 500 AD is broadly similar to Genevey et al. (2016), which is not surprising given the densesampling of the secular variation and the small scatter.
There are several data points which do not fit the general trend, which include a low intensity value (about 40 μT) around 100 BC. To gauge the effect of outlying results, Fig. 10 shows the effect of removing data (including the result around 100 BC) whose mean intensity lies outside the 95 per cent credible interval (evaluated at each datum's mid-point age) and re-running the AH-RJMCMC algorithm. With the outliers removed, at ages 200-300 BC, the posterior distribution differs markedly: the temporary dip in the intensity is now removed and values extend no lower than about 50 μT during this period.
One important feature of the posterior shown in Fig. 9 is a periodic signal contained within the envelope of relatively narrow 95 per cent credible intervals, most evident post 500 AD. Importantly, as noted in Section 4, because our model favours minimal curvature, the existence of such a signal cannot be attributed to anything other than the influence of the data themselves. This signal was identified in a similar data set by Genevey et al. (2016) who showed, by analysing the power spectral density (PSD) of their best-fitting curve, maximum power at approximately a 256-yr period. Here, we apply a similar procedure but to the entire posterior distribution to produce a probabilistic assessment of frequency content.
From the ensemble of 10 6 models that we generate from the AH-RJMCMC method, we take 1000 models (equally spaced along the Markov chain) and for each we perform the following analysis in order to determine the PSD as a function of period: (i) Isolation of the time series for the window 500-1900 AD (note that the models themselves are defined over a broader range, but this age range is strongly constrained by the data).
(ii) Removal of the linear trend.
(iii) Application of a forward-backward Butterworth filter using cut-off frequencies of 400 yr −1 and 40 yr −1 to isolate the periods of interest.
(iv) Computation of the PSD using the method of Welch (1967). Fig. 11(left-hand panel) shows the PSD as a function of period for each ensemble member (thin black line), with the PSD for the median, modal and average model shown in colour. The average of the set of 1000 individual PSD curves is shown in orange. We note that the average PSD agrees well with the PSD of the average (blue), a result that is not expected to be true in general unless all curves share a similar frequency content. There is a strong indication of a dominant period of 260-280 yr. Fig. 11(right-hand panel) shows a normalized histogram of the period corresponding to the maximum PSD for each of the 1000 representative models, with the bestfitting normal distribution shown in black (mean 269 yr, standard deviation 9 yr) that fits the general behaviour very well. We note that our preferred period of 269 yr is within 1.5 standard deviations (and thereby consistent with) the approximate 256 yr signal found  . Power spectral density as a function of period for the AH-RJMCMC method applied to the Paris700 data set, restricted to the period 500-1900 AD. Left-hand panel: shown in thin black is the PSD for each of the 1000 representative ensemble members; the PSD for the median, mode and average are shown as green, red and blue, respectively, while orange shows the average of the 1000 individual PSD curves. Right-hand panel: a normalized histogram of the period corresponding to the maximum PSD, with the best-fitting normal distribution (using only periods 200-300 yr) shown in black (mean 269 yr, standard deviation 9 yr). by Genevey et al. (2016). It is worth mentioning that, for their computations, Genevey et al. (2016) subtracted a non-linear trend (that differs from our use of a linear trend), which may explain the small difference in the period. We note that the shape of the histogram is converged using 1000 models, and that the plot of PSD is another example of projecting the model m onto a different space.
It is also of interest to compare our method of handling errors both in intensity and age with more common methods of treating the ages as known and increasing the error budget for the intensity. Fig. 12 shows our AH-RJMCMC method applied to the Paris700 data set in the top panel, compared with the same method applied to three other variants of the same data set: assumed age errors of zero (with the age being defined by the mid-point age of the interval), age errors Downloaded from https://academic.oup.com/gji/article-abstract/215/3/2008/5101441 by CNRS user on 30 September 2019 Figure 12. A comparison of posterior distributions of intensity using variants of the Paris700 database. From top to bottom: a reproduction of our AH-RJMCMC method applied to Paris700 (Section 5); the same as above but with assuming the ages are known to be exactly their midpoint values; as above with twice the intensity error budget; assuming known ages with a minimum intensity error of 5 μT.
of zero with twice the error budget on intensities, and age errors of zero with a minimum intensity error of 5 μT (as assigned by  to all intensity data for the construction of global field models). There are two key aspects of note. First, comparing the top two panels of the figure shows that assuming fixed ages causes rapid changes in the posterior intensity (which are absent when the ages are allowed to be free parameters), particularly for relatively sparsely sampled epochs (here clearly identified pre 0 AD). Second, either in doubling the intensity error budget or setting a minimum threshold of 5 μT results in the smoothing out of most fluctuations: of importance is that the periodic signal previously identified post 500 AD is absent in the bottom two panels. A similar result would likely hold were we to draw fixed ages from the given uniform distribution as in a bootstrap method, rather than using their midpoint values. This corroborates other studies (e.g. Genevey et al. 2016;Hellio & Gillet 2018) that have highlighted the mismatch in time variability between regional and global models, because of the necessity to include additional smoothing for the global case.

A P P L I C AT I O N T O T H E H AWA I I DATA S E T
We now turn our attention to the data set of the Hawaiian area extracted from Geomagia.v3 (Brown et al. 2015) that is much more sparse in time, and which has data ages and intensities both normally distributed. We adopt an extended prior for the intensity to be uniform in the range 10-100 μT because of the greater range of intensities in the data, and also increased the maximum number of internal vertices to 100. The increased scatter in the data (compared with Paris700) means that a converged posterior requires a longer Markov-chain. The results are summarized in Fig. 13, with a chain length (before thinning) of 5 × 10 6 after burn-in. Other diagnostic plots are given in Appendix E, which includes confirmation of the assumption of normally distributed intensity errors.
As with previous models, the average, median and mode have a very similar time dependence. However, as expected, the more scattered data set gives less confidence in the posterior distribution, reflected in its much broader credible intervals and less focused density. Where the data is particularly sparse or scattered, the 95 per cent credible interval range is about 30-80 μT (for example, at very early times), similar to the range of the prior on the intensity of the model vertices. Apart from for the very recent data dated to the 19th and 20th centuries thanks to direct observations, the width of the credible interval range does not extend much below 20 μT. This should be contrasted with the posterior from the Paris700 data set, which has a typical credible interval width of just 5 μT. In terms of KL divergence, typical values for the Hawaiian data set are between 0 and 1 for much of the age window, which is much less than the comparable diagnostic of the Paris700 data set (of about 2, for the period post 0 AD). Thus compared with Paris700, the Hawaiian data do not add as much information to the prior. One aspect of interest is the quasi-periodic behaviour identifiable in the top panel of Fig. 13 post 500 AD. Fig. 14 shows the same power spectral density analysis of Section 5 applied here, for the period 500-2000 AD. The scatter of the lines in the left-hand plot is a reflection of the poorly constrained ensemble of model curves (owing to the sparse data distribution), and in this case the average of the spectra is not similar to the spectra of the average model. Nevertheless, the spectra of the average, modal and median diagnostics for the posterior all have a maximum at around 325 yr. The right plot shows a histogram of the periods of the PSD maxima, with a very broad but bimodal structure with peaks loosely defined at around 210 and 330 yr. Thus the sparsity of the data set has an associated very broad spread in the possible models that fit the data, and this analysis does not identify any specific period. Furthermore, we note that the Hawaiian curves established by de Groot et al. (2013) and Tema et al. (2017) using different and more severe selection criteria do not show the same quasi-periodic features (see discussion).

A P P L I C AT I O N T O T H E L U B E C K -PA R I S 0 0 DATA S E T
Our final application of the AH-RJMCMC methodology is to the Lübeck-Paris700 data set that contains data with ordered age constraints. Fig. 15 shows a summary of the posterior intensity varia- tion with time. For the majority of the time window, the posterior is highly focused and well constrained by the data (with a KL divergence of at least 2).
Of primary interest is the inclusion of stratified data, in particular, the 10 archeomagnetic data all of age 1665 ± 85 AD but which obey strict ordering constraints. Although each of these data are consistent with (within error tolerance) the quasi-periodic fit to the unstratified data (Fig. 16), the posterior age distributions will be highly skewed as, for example, the datum of least intensity is most likely to have an age at the extreme upper end of the interval of [1580,1750]. Fig. 16(a) shows the stratified sequence of posterior ages for the 10 data each coloured differently, with the nominal midpoint age of 1665 AD marked as the vertical dashed line. The posterior ages differ markedly from the uniform prior (shown by the flat coloured rectangles). The posterior estimates of the age and intensity of each datum (given by the average of the relevant marginal posterior distributions) are shown in Fig. 16(b) using the same colour scheme. Each of the original data are marked at their midpoint age of 1665 AD (with error bars) with the posterior estimate marked as a solid square of the same colour. Thus, for example, the topmost datum (shown as blue) when treated as part of the entire data set, has an age which is most likely to be shifted earlier in time than 1665 AD and to a smaller value of intensity than its original estimated value. Such shifts are also shown in Fig. 17 that shows the joint ageintensity posterior probability distribution for the two end-members (red, sky-blue) of Fig. 16(a). On the left we see that the posterior age is on the lower-most extreme of the prior age interval, and has an associated posterior intensity which is more focused than the distribution describing the intensity error but centred on a similar value. On the right, the posterior age is at the extreme upper end of the prior interval, and has an intensity distribution which has a greater mean and a much smaller spread than the laboratory-determined intensity error.
It is worth remarking that, although all the data have a highest probability that is shifted relative to their nominal original age and intensity estimates, the stratified data are not passive in the process of model determination, but rather themselves contribute to the overall posterior distribution. This process therefore differs fundamentally from using a given secular variation curve as a means to estimate age (Le Goff et al. 2002;Pavón-Carrasco et al. 2011). Although applied here to stratified archeological data, the same technique could also be used to improve knowledge of the dating of sedimentary sequences, probably with interesting consequences for global field modelling.

D I S C U S S I O N A N D C O M PA R I S O N T O O T H E R M E T H O D S
In this paper, we have presented a transdimensional Bayesian method that can infer the time dependence of intensity evolution; it has four key elements. First, we co-estimate marginal distributions for data ages as well by treating these as unknown model parameters. Second, by averaging over a large ensemble of model realizations with differing parameter values and complexity, diagnostics (such as the average and credible intervals) of the posterior distribution are temporally smooth, yet formally unregularized. Third, because our method is based on linear interpolation, oscillations in the posterior distribution are only present when absolutely required by the data, and thus our method presents a powerful tool to probe oscillatory behaviour. Finally, our method is very fast (taking a matter of seconds) and the code is publicly available.
With these points in mind, it is now instructive to compare the results of our method to those obtained from two well-known and independently developed approaches of determining archeomagnetic intensity evolution. The first we consider is the sliding window method of Le Goff et al. (2002), in which the intensity evolution is defined within windows of variable duration (M. Le Goff, personal communication). The second is the Bayesian scheme developed by Lanos (Hervé & Lanos 2017;Hervé et al. 2017;Lanos & Philippe 2017). Like our scheme, it co-estimates data ages using a similar Monte-Carlo approach, but assumes a cubic spline (rather than linear) dependence of intensity with time, and achieves a smooth evolution by regularizing using a 'shrinkage' parameter (Tema et al. 2017) rather than (transdimensional) model averaging. The differences between the results from all these methods reflect not only the different methodologies, but also specific user choices. For Bayesian methods (here, our AH-RJMCMC method and that of Lanos and colleagues) the posterior depends on the chosen prior distributions, although for our model this dependence can be quantified (see the Kullback Leibler divergence described in Section 3.5). For the sliding window method, a choice needs to be made on the factors controlling the duration of the window. Fig. 18(a) shows our AH-RJMCMC method compared to the sliding window scheme for the Paris700 data set. Although the average curves largely agree, there are distinct differences, particularly pre 0 AD and around 750 AD where either the data are relatively scattered or are outlying to the general trend. In particular, the single data point of 40 μT around 200-300 BC causes a large oscillation in our AH-RJMCMC average, yet the sliding window results are more similar to our AH-RJMCMC method where we have rejected outliers (see Fig. 10) where the oscillation is absent. In both methods, the uncertainties are mainly smaller during times of dense data, but larger during times of data sparsity. Fig. 18(b) shows our AH-RJMCMC method compared to the Bayesian method of Lanos using a French data set described in Hervé et al. (2017) (which is almost identical to our Paris700 data set within the common time interval). During times of dense data coverage (50 BC onwards) the two methods agree in both their average and credible intervals. However, pre 50 BC there are some significant differences in the average models, in particular at 600 BC, 300 BC and 100 BC, probably due to the rejection of the nearby data points in the method of Lanos and colleagues. Furthermore, pre 50 BC the 95 per cent credible intervals are not in agreement. In our method, periods of large uncertainty generally correspond to periods of sparse data coverage. In contrast, the curve of Hervé et al. (2017) appears to have a credible interval width that varies only relatively slightly in time, even during periods of sparse data (see, in particular, the period close to 1000 BC). Thus their credible interval width does not appear to have a simple link with data sparsity. Similar conclusions can be drawn when comparing the two Bayesian methods but on the Hawaiian data set of Tema et al.  Figure 16. (a) Posterior age distribution for the 10 data with identical midpoint age but which obey stratification constraints; the associated prior distributions are shown as the coloured rectangles. (b) For each of these 10 data, a comparison of the midpoint age estimate (coloured circle with error bars) with their corresponding posterior (average) estimates (coloured squares). Both plots use the same colour scheme. Figure 17. Joint posterior probability distribution with marginals of the data age and intensity for the two end-member cases of the 10 stratified data all with midpoint age 1665 AD. The left-hand plot shows the distribution of the datum plotted in red in Fig. 16, while the right-hand plot shows the datum plotted in sky blue in Fig. 16. Green denotes the posterior, orange the prior distribution on age and the normally distributed likelihood for intensity.
(2017), shown in Fig. 19(a). In this case, comparable data gaps of 1500 yr between [4500,6000] BC and [1000, 2500] BC have very different credible interval widths: respectively, enormous (off scale) and narrow. For the same periods, our AH-RJMCMC method gives broad credible intervals whose widths are comparable with each other.
An important issue concerns treatment of data points that do not fit with the general trend: of particular note in the Paris700 data set (e.g. Fig. 18a) are those defining an intensity peak that may have occurred around 250 BC, and the single point defining an intensity low around 100 BC. Although it is not possible to know definitively whether such data represent the true geomagnetic field evolution or simply have a greater error than assumed, there are two methods by which such data can be handled. First, we can define and remove outlying data; we followed this approach in Section 5, where we pragmatically defined an outlier by assessing if the single point defined by the datum's midpoint age and intensity lay outside the 95 per cent credible interval of the posterior distribution calculated using the full data set. According to this definition, the single point having a low intensity value around 100 BC is outlying, whereas the data cluster around 250 BC are not. We remark that a small difference in age would place the datum around 100 BC inside the 95 per cent credible interval. Indeed, perhaps a more robust definition of outliers might be to proceed probabilistically, by calculating the probability, p, of any given datum (with its assumed distribution of errors in both age and intensity) falling outside the 95 per cent posterior credible interval bounds. Data for which p is greater than some given threshold value (say) 95 per cent (so that there would be only a 5 per cent probability that the true intensity and age values of the datum came from within the 95 per cent credible interval) could be labelled as outliers; such a procedure would likely result in only a few, if any, data being marked as outlying in the Paris700 data set. A second approach that proceeds within the Bayesian framework is to keep all the data but allow the data intensity errors to 2026 P.W. Livermore et al. be hyperparameters, whose posterior distribution is an output of the algorithm (e.g. Malinverno & Briggs 2004). Data that do not follow the general trend will then be associated with larger values of intensity error and therefore given less weight (e.g. Thébault & Gallet 2010;Licht et al. 2013).
The cluster of one or two data points around 250 BC in our study gives a local maximum in the posterior intensity. Interestingly, in both the methods of Lanos and Le Goff, this peak is totally missing, likely the consequence of inherent smoothing since the timescale of this feature is roughly the same as the one characterizing the different peaks over the past 1500 yr (see Fig. 18a and Genevey et al. 2013Genevey et al. , 2016. Although more data are obviously required to better constrain its existence, there is no reason to believe that such a feature could not have occurred around 250 BC. In passing, we note that our approach, which relies on a piecewise-linear temporal interpolation, is well suited to detect rapid fluctuations of the geomagnetic field, such as archeomagnetic jerks (Gallet et al. 2003) or geomagnetic spikes (Shaar et al. 2011).
Next, we comment on the power of our method to assess periodic behaviour, which is only permitted as part of the solution where required by the data. For the Paris700 data set, the fact that our method shows a dominant periodic signal (see Section 5) of about 270 yr is therefore a strong conclusion. Indeed, it is notable that this period is in agreement with the period of about 260 yr deduced by Genevey et al. (2016), despite the differences between approaches (in particular, in the long-term trend assumed), adding weight to the robustness of this observation.
Conversely, our method can also be used to assess whether a data set is compatible with the absence of a periodic signal. We explore one such example of the sparse Hawaiian data set (Tema et al. 2017) that is similar to that presented in this paper but with more stringent selection criteria. It is worth recalling that our own data set can be viewed as a blind (though reasonable) use of a global database or a data compilation. Tema et al. (2017) identified four consecutive peaks in the intensity evolution from 2000 BC to 2000 AD. Fig. 19(a) compares results from our method with that of Lanos (see Fig. 6c of Tema et al. 2017). Although our method fits a quasi-oscillatory signal post 0 AD, we do not find evidence for the final oscillation around 1500 AD: this is because the data are actually consistent with linear behaviour and do not require an oscillation. In this respect, the curve derived from the AH-RJMCMC method appears quite similar to that proposed by de Groot et al. (2013) constructed using another Hawaiian data selection. Following an application of our method, Fig. 19(b) shows the mean posterior value of each data age and corresponding intensity, and shows that a linear dependence is compatible by shifting the data within their error bounds. In a parallel study, a similar assessment of testing whether data can be shifted (within error bounds) to fit an overall trend was recently performed by Korte & Constable (2018) to assess geomagnetic spikes, although in a more ad hoc and non-probabilistic fashion. Overall, we find that this Hawaiian data set is entirely consistent with a much less oscillatory signal with no obvious period (our red curve).
We end by commenting briefly on the use of our AH-RJMCMC tool for archeomagnetic dating of a datum of some given intensity. In our approach, we would add this datum (with broad uniform uncertainty on age) to the data set, and run the AH-RJMCMC model. The posterior marginal age distribution is then part of the model output. This is to be contrasted with a more standard approach, in which the datum is compared to a reference curve. Both the AH-RJMCMC method described here, and the method of Lanos and colleagues, can perform either calculation. Although our approach of 'all data together' differs from the standard method of distinguishing a reference data set from a comparative datum, it has two advantages. First is that, within a Bayesian formulation, noisy data should not be removed from the problem but instead accounted for with their large uncertainties. In our method, this is implemented by using a very broad prior distribution on the datum age. Second is that separating the problem into two steps requires an arbitrary choice. How do we decide what is too noisy to be accounted as 'reference data'? Taking everything together is an integrated approach that combines all pieces of information, and that avoids this arbitrary distinction.

A C K N O W L E D G E M E N T S
PWL would like to thank Université Paris Diderot and the Institut de Physique du Globe de Paris for funding two research visits in 2015 and 2016, during which much of this work was begun. PWL was partially supported by the Natural Environment Research Council grant NE/G0140431. TB is supported by the European Union's Horizon 2020 research and innovation program under Grant Agreement No. 716542. We acknowledge support from the INSU (PNP) programme. We are very grateful to Maxime Le Goff for running his sliding window algorithm on our data, and for helpful comments. We also thank Gabrielle Hellio and an anonymous reviewer for constructive comments on the manuscript. All figures were created in Python using Matplotlib (Hunter 2007). All authors contributed to the ideas behind the approach taken in the manuscript; YG provided the data sets and background on archeomagnetic secular variation curves; AF, TB and PL developed the methodology and wrote the code; PL generated the results and wrote the paper. All authors provided comments on the manuscript. This is IPGP Contribution no. 3967. Code availability: The code and data sets are available at https://github.com/plivermore/AH-RJMCMC, which has DOI:10.5281/zenodo.1442345.
To examine the effect of data sparsity on our method, we consider a data set of Hawaiian intensity measurements taken from the Geomagia.v3 database (Brown et al. 2015). Each data point has both an intensity and age that are assumed normally distributed; the age estimate comes from radiocarbon analysis except for the most recent data whose ages are constrained by historical observations. For the latter data, we arbitrarily fixed the age standard deviation σ = 0.5 yr. Our approach here relied on much less selective criteria, which nevertheless led to a data set much more sparse than the Paris700 compilation. We retained the results whose quality is supported by an alteration test, regardless of the nature of this test, and which possess uncertainties in both intensity and in age. Between about 9000 BC and 4500 BC most available data lack age uncertainties, and we have therefore focused on the past 4500 yr. We complemented the compilation with the data sets recently obtained by de Groot et al. (2013) and Cromwell et al. (2015). All 134 data points are reduced to the latitude of Kilauea volcano (19.42 • N).

A3 Lübeck-Paris700: an example of stratification
We combined the Paris700 data set (between 900 AD and 2000 AD) with the stratified data set from Lübeck, Germany (Schnepp et al. 2009), whose data ages are not independent and must obey a strict ordering in time. These archeointensity data were obtained from a long series of superimposed bread oven floors in a bakery that was in activity from ∼1300 AD to ∼1750 AD. Although Lübeck is 800 km (marginally more than 700 km from Paris), we reduce these measurements to the latitude of Paris by assuming that the non-dipole field effect was negligible between the two regions.

A P P E N D I X B : T H E A H -R J M C M C A L G O R I T H M B1 Model overview
Our aim is to sample the joint posterior probability distribution of the model (the intensity as a function of time, alongside the data ages), given the data (the intensity values), which is described in Section 3 in the main text. Here we present in detail the method we use to sample from (and therefore characterize) the posterior distribution. For completeness, however, we first briefly summarize the modelling framework.
The model is defined within the age interval [t start , t end ], which needs to be large enough to span the temporal distributions (with uncertainty) of all data; otherwise some realizations of the data ages will be outside the model space. In the case of uniformly distributed ages, this is straightforward; for normally distributed ages, we ensure that the model boundaries lie at least 2 standard deviations from each datum mean-age.
The posterior is parametrized by the model vector m = [f, k, a] T , where f is the vector containing the endpoint intensities and the k internal vertex intensities and ages: F end , F 1 , F 2 , . . . , F k , t 1 , t 2 , . . . , t k ), and a are the data ages. The posterior requires a prescription of the likelihood (assumed normal, see Section 3.3) and prior distributions.
The prior distributions on the vertex intensities and number of vertices k are assumed independent and uniform, F j ∼ U(F min ,F max ), k ∼ U(0, k max ). Without stratigraphy constraints, the ages are assumed independent and either uniformly or normally distributed a j ∼ U(a min j , a max j ), or a j ∼ N (μ j , λ 2 j ), where μ j and λ j are the given mean and standard deviation. If a subset of the ages have stratigraphy constraints, then they are assumed to be independently distributed yet ordered: a j < a j + 1 < . . . . In practice, this is handled by drawing the ages independently and discarding a set of draws if the stratification constraints are violated. For ease of model development, the vertex ages are assumed confined to an equallyspaced discrete set of N values between the temporal limits of the model [t start , t end ]; the joint prior distribution of the vertex ages (in which their order is irrelevant) is then Our algorithm is to assemble a Markov chain, whose distribution converges to the posterior distribution we seek. To add a new model to the Markov chain, we propose a new model m which differs from the current model m by virtue of one of several possible moves with given probability: (1) Change the value of the intensity (F j ) of a randomly selected vertex j (prob 1/3) (2) Alter the age-distribution of vertices (prob 1/3) by either (a) Move in age: change the age (t j ) of a randomly selected vertex j (prob 1/9) (b) Birth: create a new vertex (prob 1/9) (c) Death: remove a vertex (prob 1/9) (3) Resample a randomly chosen subset of ages in a (prob 1/3)

B2 Proposals
Each of these proposals is given in more detail below.

B2.1 Change the value of intensity
For move type (1), we perturb where z ∼ N (0, σ 2 change ), where σ change is a user-specified standard deviation. Note that which is equal to p(F k |F k ). Thus this move is symmetric.

B2.2 Move in age
For move type (2a), we alter the age t j of vertex j where j is randomly selected where y represents a temporal shift to one of the unoccupied discrete set of ages (of which there are N − k). However, in the limit (N → ∞) that we will eventually take, all positions are unoccupied with probability 1 and so we may model age as a continuous variable and draw y ∼ N (0, σ 2 move ) where σ 2 move is user-specified. Again, because of the symmetry of the distribution, p(t j |t j ) = p(t j |t j ) and the move is symmetric.

B2.3 Birth
For move (2b), we give birth to a new vertex by randomly selecting an unoccupied epoch and assigning an intensity value; we also increase k by 1. In the discrete case, there are N − k available choices of age, which in the continuous limit becomes a uniform distribution. The value of intensity we give the point is not arbitrary, but a perturbation away from its interpolated value from the current model. Suppose we choose t j which lies between t j and t j + 1 . Then we set where r ∼ N (0, σ 2 birth ), and F * j is the value found using a linear interpolant between t j and t j + 1 . We then need to augment the model vector by F j and t j . The proposed model vector differs then from the current model by just two additional parameters: all other values (aside from k) remain the same.

B2.4 Death
For move (2c), we remove at random one of the vertices (with its associated intensity value) from the current model, and reduce k by 1. The proposed model therefore has two fewer parameters than the current model-all remaining parameters (aside from k) are identical.

B2.5 Resample ages
A subset of the ages in a of size N data /β (with β a user-specified parameter) is resampled according to the prior distributions (uniform or normal). This move is symmetric.

B3 Acceptance probabilities
Having defined a new proposal m , we need to decide on whether to accept it or not. For fixed-dimension MCMC modelling (i.e. k constant) we define that is the ratio of the products of the likelihood, prior, and proposal probability. This ratio is then compared with a randomly drawn  Green (1995) showed that the proposal ratio can be written as Here, j(m|m ) is the jump probability: the probability that the algorithm chooses to select proposal m from model m . The pdfs q m →m (u ) and q m→m (u) give the probability of the random variables needed to make the proposal; note that, for example, q m →m (u ) can depend only on the current state m of which u is a perturbation. Finally, J is the Jacobian of transformation. When making a proposal, if any of the random variables drawn lie outside their prior distribution then p(m ) vanishes and the acceptance ratio is 0: the proposal is never accepted.

B3.1 Moves of fixed dimension
If k is unaltered, then since both moves (1), (2a) and (3) are symmetric i.e. q(m|m ) = q(m |m), the prior distributions cancel out and α is simply the minimum of 1 and the ratio of the likelihoods: e −(φ −φ) , where φ is defined in eq. (2).

B3.2 Birth
For a birth move, suppose that model m has dimension k and m dimension k + 1. Then the ratio of priors is We now need to consider the ratio of the proposal probabilities. For birth, the jump probability is 1/3, and we further need to draw a specific additional age from the unoccupied ages (probability (N − k) −1 ) and then draw an intensity from its assumed normal distribution centred on the interpolated value with probability q(F j ). The model vector is simply augmented with the new age and intensity, and the transformation (described by the Jacobian J) is simply a relabelling of indices; therefore J = 1. The ratio of transition probabilities is then where The acceptance ratio is then Note that if instead of drawing the new value from a Gaussian distribution q(F j ) we proposed a new value from the uniform prior distribution U [F min ,F max ], we would have terms that would cancel out and α would then depend only on the ratio of likelihoods.

B3.3 Death
For the acceptance probability of the death of vertex j, suppose that model m has dimension k and m dimension k − 1. The jump probability is 1/3 and the doomed vertex (from k equally probable choices) needs to be removed; there is no requirement to generate any new variables. Each of the ratio of the priors and the proposals are simply the reciprocal of those for birth with k replaced by k − 1.
where q(F j ) is given by (B3). We finally note that none of the acceptance ratios depend on N, or on the choice of prior distribution for the data ages (which cancels out in all cases). We therefore take N to be infinitely large, converting each vertex age into an independent, continuous, uniformly distributed random variable.

A P P E N D I X C : A N I N T E G R AT E D L I K E L I H O O D F U N C T I O N
At the heart of the Bayesian method is a likelihood function, which gives the probability of an observed intensity value F given an assumed underlying model. If the datum's age is assumed errorfree and the true intensity value is y, the likelihood is However, because the datum's age itself is not known precisely, we need a method of incorporating its uncertainty into the modelling process. In the principal method we consider in this paper, these effects are accommodated by including the data ages into the model vector m = [f, k, a] T , and then successively resampling a according to the specified prior distribution. For our problem, this is a natural way to proceed, because the marginalized posterior distributions of the data ages (which are of significant scientific value) are trivial to compute. However, it is worth noting that this is not the only way of setting up the model: rather than including the data ages in the model vector and using the intensities in the likelihood, we could swap these around. Thus in an alternative model, we might use the model vector m = [f, k, F] T where F = (F 1 , F 2 , . . . ) are the data intensities, and use for the likelihood the uniform distribution on data age. The fact that these two model setups are different is a manifestation of the asymmetry of the way in which the data is handled. Ideally, the data ages and intensities, both accompanied with a specific error distribution, should be treated symmetrically as they enter on the same footing.
Sambridge (2016) describes a method in which 2-D data with known error distributions are treated symmetrically by a piecewise linear MCMC algorithm. He considers only the case where both variables are described by a joint normal distribution: here, we extend his methodology to our focus on uniformly distributed ages and normally distributed intensities.
The keystone of Sambridge (2016) is a definition of the likelihood that expresses the probability of realizing a particular datum d = (F, a), where F is its intensity and a is its age, given a model curve c (here, the piecewise linear description of intensity with time). This can be written as an integral over the whole curve in intensitytime space, considering a general source point q on the curve c parametrized by arc length:

[d|q(s)] p[q(s)]ds.
(C2) Each choice of q represents a different and equally probable [and thus p(q(s)) is uniform] realization of the datum's true intensity and age. It is worth noting that the choices of (i) parametrising c in terms of arclength, rather than age or intensity (as we could have done in our case) and (ii) assuming that all values of q are equally probable, produces a method which is symmetric in the variables.
There are two ingredients required in the above. First, because each choice of q is assumed equally probable, we can calculate p[q(s)] = C by integrating over the whole curve: where L is the total arclength of the curve of interest; hence C = 1/L. Second, we need to be able to evaluate the likelihood of realizing the data d assuming the true value is at q(s). We assume a normally distributed intensity F ∼ N (F q , σ 2 ), where F q is the intensity at q, and uniform in age a ∼ U[T 0 , T 1 ]. We will always consider a model curve c that includes the entire age interval [T 0 , T 1 ], and because the datum age probability is zero outside this interval, in (C2) we can restrict attention to ages within [T 0 , T 1 ]. Suppose there are j line segments that we need to consider, each with a start and terminal age. We will consider each in isolation, and add their contributions together.
We can parametrize the jth linear segment of the model by a new variable 0 ≤ θ ≤ 1, where s = l j θ and l j is the arclength of the segment. Hence ds = l j dθ and the contribution to (C2) is where F q (θ ) is the intensity at q(s(θ)). Now, where F 0 and F 1 are the start and end values of the intensity in the segment, a = F − F 0 and b = F 1 − F 0 . Hence the contribution can be written where using the notation of Sambridge (2016),θ = a/b and σ θ = σ b −1 . Note that in Sambridge (2016), his eq. (A9) the definition of σ θ has a typographic error.
It is interesting to consider the limit of small error in age in eq. (C3), corresponding to the case of F 0 = F 1 = F q (θ), in which there is only a single line segment (as the age interval is too small to contain more than one). In this limit the likelihood collapses to which recovers the usual normal likelihood (C1), albeit normalized by the (infinitesimal) time duration. Finally, we note that numerically, it is easiest to use the loglikelihood, rather than the likelihood itself. In cases where t 1 and t 2 are large (where the model is a very poor fit to the data), the log likelihood is numerically challenging to compute since both error functions are −1 to machine precision with a difference of zero. In this case, we can set the likelihood (for this datum) to be a prescribed small number (for example, 10 −50 ). Note that this case only arises when the model is a very poor fit to the data, which can occur on initializing the algorithm with a randomized model, but practically will never enter the MCMC sampling beyond 'burn-in'. Fig. D1 shows other diagnostics of the AH-RJMCMC method applied to the Paris700 data set that supplement those given in Section 5. The top row shows the misfit against iteration (i.e. the index of the Markov-chain) and a normalized histogram of the number of internal change points. The burn-in period (marked by the red bar) is sufficiently long for the misfit to drop to close to its lowest value, and the range of change points (fixed between 1 and 50) is large enough to capture the distribution of vertices, which peaks at around 40. Finally, the bottom panel of Fig. D1 shows a normalized histogram of the difference between the intensity values from Paris700 and the average posterior evaluated at the data mid-point ages weighted by the data intensity errors: F/σ = g(a j ) − F j /σ j for each datum j (see Section 3.3). The close similarity of the histogram and the standard N (0, 1) distribution gives us confidence in the assumed normal distribution of intensity error.  Fig. E1 shows other diagnostics of the AH-RJMCMC method applied to the Hawaii data set that supplement those given in Section 6. The top row shows the misfit against iteration (i.e. the index of the Markov-chain) and a normalized histogram of the number of internal change points. The burn-in period (marked by the red bar) is sufficiently long for the misfit to drop to close to its lowest value, and the range of change points (fixed between 1 and 100) is large enough to capture the distribution of vertices, which peaks at around 50. Finally, the bottom figure shows a normalized histogram of the difference between the intensity values from Hawaii and the average posterior evaluated at the mid-point ages weighted by the data set intensity errors: F/σ = g(a j ) − F j /σ j for each datum j (see Section 3.3). The close similarity of the histogram and the standard N (0, 1) distribution gives us confidence in the assumed normal distribution of intensity error.