Kernel-, mean-and noise-marginalised Gaussian processes for exoplanet transits and 𝐻 0 inference

Using a fully Bayesian approach, Gaussian Process regression is extended to include marginalisation over the kernel choice and kernel hyperparameters. In addition, Bayesian model comparison via the evidence enables direct kernel comparison. The calculation of the joint posterior was implemented with a transdimensional sampler which simultaneously samples over the discrete kernel choice and their hyperparameters by embedding these in a higher-dimensional space, from which samples are taken using nested sampling. This method was explored on synthetic data from exoplanet transit light curve simulations. The true kernel was recovered in the low noise region while no kernel was preferred for larger noise. Furthermore, inference of the physical exoplanet hyperparameters was conducted. In the high noise region, either the bias in the posteriors was removed, the posteriors were broadened or the accuracy of the inference was increased. In addition, the uncertainty in mean function predictive distribution increased due to the uncertainty in the kernel choice. Subsequently, the method was extended to marginalisation over mean functions and noise models and applied to the inference of the present-day Hubble parameter, 𝐻 0 , from real measurements of the Hubble parameter as a function of redshift, derived from the cosmologically model-independent cosmic chronometer and Λ CDM-dependent baryon acoustic oscillation observations. The inferred 𝐻 0 values from the cosmic chronometers, baryon acoustic oscillations and combined datasets are 𝐻 0 = 66 ± 6 km s − 1 Mpc − 1 , 𝐻 0 = 67 ± 10 km s − 1 Mpc − 1 and 𝐻 0 = 69 ± 6 km s − 1 Mpc − 1 , respectively. The kernel posterior of the cosmic chronometers dataset prefers a non-stationary linear kernel. Finally, the datasets are shown to be not in tension with ln 𝑅 = 12 . 17 ± 0 . 02.


INTRODUCTION
Gaussian Processes (GPs) have found widespread application as a method for regression of time-series data in astrophysics, such as the estimation of underlying physical parameters and the modelling of correlated noise, as they offer robust Bayesian non-parametric regression (Roberts et al. 2013;Aigrain & Foreman-Mackey 2022).A GP requires the choice of a kernel, which quantifies the strength of correlation between neighbouring data points.An existing approach to kernel selection is to write the kernel as a linear combination of several simpler kernels (Rasmussen & Williams 2005) and let the magnitude of the coefficients determine the relevance of each kernel.However, in cases where the joint posterior of the coefficients is multimodal or contains curving degeneracies in high dimensions, this lacks interpretability and information can be lost by projecting the posterior down to one-and two-dimensional marginal distributions.Other approaches allow the kernel to be approximated by kernels which are dense in the set of stationary kernels (Wilson & Adams 2013) or perform a greedy search over kernel structures (Duvenaud et al. 2013).In astrophysical and cosmological datasets, however, the choice of kernels is often governed and restricted by the physical model and interpretability of the hyperparameters in the physical context.An example is the selection between physically motivated noise models (Scheutwinkel et al. 2023) or the selection between ★ E-mail: nk544@cam.ac.ukGPs whose form are constrained by solutions to stochastic differential equations (Kelly et al. 2014).In this sense, kernel selection can be viewed as a form of Bayesian hypothesis testing, with the Bayes factors determining the kernel to be selected (Zhang et al. 2023).Another approach is deep kernel learning which transforms the kernel inputs with a deep neural network (Wilson et al. 2016), thus drastically increasing the flexibility of the model.However, the lack of interpretability of the latent variables of a neural network can be prohibitive.Moreover, sparse and noisy datasets, such as cosmic chronometers and baryon acoustic oscillations, do not meet the demand of such neural networks.In this paper, kernel selection is approached from a Bayesian perspective by computing the posterior probability of each kernel and sampling from this distribution.This has the advantage of direct interpretability of the kernel posterior probabilities, circumvents the drawbacks outlined above and generalises automatic relevance determination as any linear combination of the investigated kernels can also be included.
A Bayesian approach to model selection requires sampling of the GP hyperparameters to make predictions and calculate marginalised posterior distributions (Rasmussen & Williams 2005;Higson et al. 2019b;Dhawan et al. 2021;Simpson et al. 2021).Most of the current literature approximates this step with a maximum likelihood (ML-II) estimate which is only accurate for a uniform prior and a strongly peaked unimodal likelihood.Thus, in this paper, we investigate the method of marginalising over hyperparameters and the kernel choice and perform model selection of kernels using evidences.To the knowledge of the authors, this paper is the first to systematically investigate the ability of this method to infer a kernel ground truth and the conditions under which kernel recovery is possible depending on the signal-to-noise ratio and the number of data points.
To sample from the resulting hierarchical model, transdimensional sampling is employed, which samples from the joint posterior of the model choice and the hyperparameter vector of each model.The required sampling of the hyperparameters in high dimensions is conventionally done with Markov chain Monte Carlo (MCMC) methods (MacKay 2003).A review of transdimensional sampling based on MCMC is given in Sisson (2005).MCMC generates correlated samples and can thus become trapped at a local maximum of the posterior.Hence, these methods can fail to give representative samples and evidence estimates when the posterior distribution is multimodal, contains plateaus or exhibits a phase transition (MacKay 2003;Hackett et al. 2021).An alternative method which does not suffer from these shortcomings is nested sampling (NS; Skilling (2006); Sivia & Skilling (2006); Ashton et al. (2022); Buchner (2023)), which systematically compresses a set of samples in the parameter space towards the global maximum of the likelihood, continuously generating posterior samples and providing an estimate of the evidence.For these reasons, nested sampling is used as the transdimensional sampler implemented in this work (Brewer 2014;Hee et al. 2016).
From the perspective of cosmology and astrophysics, the inference of parameters from data is of primary interest.Thus, kernel inference, corresponding to the inference of the correct noise model, and hyperparameter inference are investigated in exoplanet simulations with correlated noise.These were chosen as exoplanet detection is an active field of research (Feroz et al. 2011), simulation packages exist (Foreman-Mackey et al. 2017;Parviainen 2015) and the shape of the mean function facilitates its separation from noise.In cosmology, one of the most prominent examples for the importance of parameter inference is the discrepancy between early-and late-time measurements of  0 , the Hubble tension (Di Valentino et al. 2021;Shah et al. 2021;Dainotti et al. 2022;Abdalla et al. 2022;Poulin et al. 2023).Gaussian Process regression (GPR) with a specific kernel and mean function has been used to infer  0 directly from Hubble parameter measurements (Busti et al. 2014;Ó Colgáin & Sheikh-Jabbari 2021;Gómez-Valent & Amendola 2018;Bernardo & Said 2021;Yu et al. 2018).Here, these methods are extended to kernel, mean function and noise model marginalisation.
The background on GPR and Bayesian model selection is summarised in Sections 2 and 3, respectively.Section 4 details the implementation of the transdimensional sampler.In Section 5, the implementation is explored and validated on synthetic data of exoplanet transits.The method is applied to  0 inference from real observations of cosmic chronometers and Baryon acoustic oscillations in Section 6.The paper concludes with Section 7. Appendices A and B contain derivations of closed-form expressions used to speed up computations, Appendix C shows that a GP with a linear kernel is equivalent to linear regression, Appendix D illustrates the evidence calculation of a uniform categorical prior, Appendix E describes the tests conducted on the implementation of the transdimensional sampler, Appendix F contains the priors used for the kernel hyperparameters and Appendix G details the implementation of a novel kernel similarity metric.

GAUSSIAN PROCESS REGRESSION
A GP is a set of random variables {  (x) | x ∈ }, where  is an index set1 , any finite subset of which have a joint Gaussian probability distribution (Rasmussen & Williams 2005).It thus describes a probability distribution over functions.A GP is specified by a mean function (x) and a positive definite kernel  (x, x ′ ), which measures the correlation between x and x ′ , such that (1) for all x and x ′ , where E denotes the expectation value.A kernel is stationary if it depends only on x − x ′ , otherwise it is non-stationary.
The combination of a constant mean function and a stationary kernel defines a weakly stationary GP.
The marginalisation property of a GP ensures that any finite subset of the random variables represents the entire GP so that it is not necessary to work with an infinite number of random variables, making GPs computationally tractable.
In GPR, a GP is used as a non-parametric method to perform the regression of a model,  :  → R, to training inputs,  = (x 1 , . . ., x  data ), and targets, y = ( 1 , . . .,   data ), to make predictions on test inputs,  ★ = (x ★ 1 , . . ., x ★  ★ ).The targets, y, are assumed to be realisations of the model output, f = (  (x 1 ), . . .,  (x  data )), with Gaussian noise with covariance matrix, , so that y | f ∼ N (f, ), where N is the normal distribution and " ∼ ()" denotes that  is sampled from ().Thus, it can be shown that the targets, y, and test outputs, where . By conditioning Equation 3 on the targets, the predictive distribution is obtained, with which predictions at the test inputs,  ★ , are computed.

BAYESIAN MODEL SELECTION
It is common in the machine learning literature (Rasmussen & Williams 2005) to restrict inference to stationary kernels and (x) = 0.This choice may not give optimal predictive performance in the low data limit (Fortuin et al. 2019).In fact, the mean function, , and noise covariance, , are often set by a physical model (Aigrain & Foreman-Mackey 2022), which have physical hyperparameters,  and , respectively.Hence, this paper focuses on the selection of a kernel from a discrete set of functions,   ∈ {  }  kernel =1 , each of which depends on hyperparameters,   ∈ {  }  kernel =1 , which may have different dimensions, with a fixed mean function and noise model.The approach is applied recursively in Section 6 to include marginalisation over mean functions and noise models.In the following, the inference of the kernel   and the model hyperparameters   = (  , , ) is described in terms of a hierarchical model.(Rasmussen & Williams 2005). and  ′ are real numbers and  = |  −  ′ |.

Hyperparameter selection
For a given choice of the kernel,   , the hyperparameters are subject to a posterior, which is the probability distribution conditioned on the observation of training data, (y, ).This is given by Bayes theorem as where    (  ) is the hyperparameter prior, which encodes any information on   prior to the observation of data.The single-kernel likelihood, L   , is given by and the kernel evidence, Z   , in Equation 5is Unless stated otherwise,    (  ) is assumed to be a weakly informative uniform prior, U(, ), where [, ] is an interval.

Kernel selection
The kernel posterior for the choice from  kernel kernels is given by Bayes theorem as where Π   is the prior over the kernel choice and the evidence, Z, is The prior Π   is taken to be uniform, Π   = 1  kernel .An expression for the propagated uncertainty in    is given in Appendix A.

Inference
Thus, given a dataset,  and y, and priors {   (  )}, the kernel and hyperparameters can be inferred from Equations 5 and 8.The posterior of any quantity, , which depends on the kernel and its hyperparameters is obtained by marginalising over them (Simpson et al. 2021): Since it is impossible to calculate this in general, one instead samples so that where {( (  ) ,  (  ) )} are  samples from the joint posterior,    P   ().If the distributions ( |  (  ) ,  (  ) ) are Gaussian, closed-form expressions for the mean and covariance of ( | y, ) can be derived (Appendix B).

Common kernel choices
Table 1 lists the set of kernels used throughout this paper.The exponential (E), Matérn-3/2 (M32), Matérn-5/2 (M52), Matérn-7/2 (M72) and squared exponential (SE) kernels are specific instances of the Matérn kernel family.Functions sampled from their corresponding GPs are increasingly smoother in the order listed, with the E kernel being zero times and the SE kernel infinitely many times differentiable in the mean square sense (Rasmussen & Williams 2005).Each of them has an amplitude and length scale hyperparameter, which is interpretable as a decorrelation length on the input space.The rational quadratic (RQ) kernel corresponds to a mixture of SE kernels with different length scales, where the hyperparameter  controls the contributions of the length scales.Functions sampled from the GPs of the exponential sine squared (ESS) and cosine (Cos) kernels are periodic with period  ESS and  Cos , respectively.Finally, using the non-stationary linear (L) kernel is equivalent to performing linear regression with Gaussian priors on the offset and slope with zero mean and variances  2 L,1 and  2 L,2 , respectively, which in turn is equivalent to a GP with a linear mean function (Appendix C).

Kullback-Leibler divergence and Bayesian model dimensionality
The Kullback-Leibler (KL) divergence of the hierarchical model above is defined as an average over the joint posterior of the ker-nel choice and the kernel hyperparameters, which decomposes into contributions from a KL divergence arising purely from the constraint on the kernel choice and an averaged KL divergence of the individual kernel hyperparameters, The Bayesian model dimensionality   (BMD; Handley & Lemos (2019a)) is defined as where Var[•] denotes the variance, and is the sum of BMDs arising from the constraint of the kernel posterior and the hyperparameter posteriors with an additional term increasing (decreasing) the BMD if the kernel posterior compression is correlated (anti-correlated) with the compression of the corresponding hyperparameter posterior, where Cov[•, •] denotes the covariance.The above expressions for D KL and   apply to any hierarchical model and will be applied recursively in Section 6 to marginalise over the mean function and noise model.

TRANSDIMENSIONAL SAMPLING
A valid approach to compute the posterior probabilities    is to fit each GP model sequentially to the data and to compute the evidence Z   for each.A subsequent calculation of  requires two-step sampling, firstly of a model choice from    and secondly of the hyperparameters from P   (  ) for the given choice.
A computationally more efficient method is to use transdimensional sampling (Brewer 2014;Hee et al. 2016), in which the hyperparameters for each model are embedded in a higher dimensional space such that the dimension of the models match and the model choice is promoted to a categorical hyperparameter, .In the following, the method of Hee et al. ( 2016) is adopted.There are two parts to the transdimensional sampling algorithm.Firstly, the appropriate hyperparameter vector must be prepared.Secondly, NS is used as a subroutine to sample from this hyperparameter vector, for which the PolyChord implementation is used (Handley et al. 2015).Specifically, for the investigated datasets, the models corresponding to different kernels share the same mean function, , and noise model, .The combined hyperparameter vector then takes the form where the categorical hyperparameter, , takes values in {1, . . .,  kernel } and , , and {  } are the hyperparameters of the mean function, noise and kernels, respectively.The likelihood, L, reduces to the single-kernel likelihood, L   , (Equation 6) for the kernel   selected by , L() = L   (, ,   ).The prior on  is the product of uniform priors on each of the entries of , unless stated otherwise.The priors are specified in PolyChord as inverse transforms on the unit hypercube (Handley et al. 2015).For the discrete hyperparameter , the function  ↦ → ⌈ kernel ⌉ achieves the required uniform categorical distribution, where  ∈ [0, 1] and ⌈•⌉ is the ceiling function2 .The effect is that the hypercube coordinate of  is partitioned into  kernel intervals of equal length, each interval corresponding to a value of .This is illustrated in Appendix D. As a result, the likelihood in unit hypercube coordinates is piecewise constant on each interval, assuming that all other hyperparameters are held fixed, which necessitates a modified NS algorithm or correction by sample post-processing in the anesthetic library (Fowlie et al. 2021).At a single NS iteration, the value of  selects the entries   in  which the likelihood L() depends on.The likelihood is degenerate in the remaining entries,  c = ( 1 , . . .,  −1 ,  +1 , . . .,   kernel ), so that NS does not compress the live points in the subspace of  c.As a consequence, the distribution of  c remains equal to the prior, which in our case is uniform.Hence, a single NS run in the space of  produces samples from the joint posterior over the model choice and the hyperparameters, P() =    P   (  )   ( c), and the evidence Z.The marginal distribution of  thus corresponds directly to the kernel posterior.To obtain the hyperparameter posterior of a kernel, P() is conditioned on a value of  and marginalised over  c.Computationally, this corresponds to discarding all samples which have a different value of  and computing the histogram of only the entries in  corresponding to   .
An alternative approach to compute    is to split the samples according to the value of .Each set of samples constitutes a separate NS run with a potentially variable number of live points, which must be reweighted (Higson et al. 2019a) to calculate the kernel evidence Z   from which the kernel posterior is obtained.We note that the hyperparameters  and  c do not affect the evidence calculation as the kernel likelihood L   is independent of them.
There are in principle two other ways of combining the individual hyperparameter vectors for transdimensional sampling.The first is in keeping with taking the disjoint union of the model parameter spaces, such as in reversible-jump MCMC (Green 1995).That is, we could define the combined vector as (, , ,  ′ ), where  ′ is a vector with the maximum dimensionality of the vectors {  }  kernel =1 .This has the advantage of a reduced dimensionality of the hyperparameter space although one cannot apply different priors for the hyperparameters of different kernels when using NS.Secondly, a product space approach could be taken (Carlin & Chib 1995).That is, the combined vector could be defined as (,  1 , . . .,   kernel ), with which each hyperparameter can be assigned a different prior, however the dimensionality is increased by having separate vector entries for the mean functions and noise models of each GP model.The approach that was taken in this paper corresponds to an adaptive approach between the two (Godsill 2001;Hee et al. 2016).To wit, we take the first approach for the shared hyperparameters,  and , and the second one for the hyperparameters not shared between the models, { }  kernel =1 , thus resulting in the combined vector of Equation 16.This method is advantageous for the applications in this paper as each hyperparameter can be assigned a different prior and the number of dimensions of the sampling space used for the mean function and the noise hyperparameters does not scale with the number of kernels.On the downside, the dimension of the sampling space scales with total number of kernel hyperparameters.
We note that regardless of the implementation of transdimensional sampling, Equations 13 and 15 remain valid ways of calculating the KL divergence and BMD, respectively, because the posterior and prior,    ( c), cancel in the posterior to prior ratio.However, it is computationally more efficient to calculate these directly from the definitions, Equations 12 and 14, as one has direct access to samples from the joint posterior by means of P().As a corollary, the effective (Bayesian) dimensionality of the space of  is independent of the transdimensional sampling method.
The transdimensional sampler (TS) was implemented using jax (Frostig et al. 2018), leveraging the speedup of just-in-time compilation when evaluated repeatedly.The implementation of Equation 6in tinygp (Foreman-Mackey 2023) was used due to its compatibility with jax.The anesthetic library (Handley 2019) was used for NS sample post-processing and evidence, KL divergence and BMD calculations.Since the live points are initially equally partitioned between the subspaces of fixed , the default number of live points in PolyChord was increased by a factor of  kernel .
The time complexity of a single evaluation of L() is O( 3 data ) (Rasmussen & Williams 2005) so that a full PolyChord run has a time complexity O( 3 dims  3 data ) (Handley et al. 2015), where  dims is the number of dimensions of .The computations were parallelised with mpi4py (Dalcin & Fang 2021).Tests for the TS are described in Appendix E.

SYNTHETIC EXOPLANET TRANSIT DATA
In this Section, the method is validated on synthetic datasets simulated from a known mean function and kernel.Firstly, we test under which conditions the true kernel can be inferred given the mean function.Secondly, we fit for both the mean function and the kernels in Table 1 and investigate the accuracy of the mean function hyperparameter inference, which is of greater interest from a physics perspective.
The mean function is a light curve simulated from an exoplanet transit (Winn 2010), which was chosen as the shape thereof facilitates its separation from the noise of an M32 kernel.Moreover, inference of the transit light curve parameters is of greater interest as it requires that correlated noise is accounted for explicitly (Aigrain & Foreman-Mackey 2022), for which GPs have been employed (Gibson et al. 2012).Exoplanet transit light curves were simulated using the jaxoplanet library (Foreman-Mackey & Garcia 2023; Aigrain & Foreman-Mackey 2022).In the following, the creation of the datasets for kernel inference and mean function hyperparameter inference is described.

Method
For kernel inference, a quadratically limb darkened light curve on the time interval [−0.2 days, 0.2 days] was used as the mean function for a GP with correlated noise from an M32 kernel and a white noise term,  =  2 I, where  is a hyperparameter and I is the identity matrix, so that the synthetic data is sampled from a Gaussian N (m, K +  2 I).The simulation hyperparameters are shown in Table 2. Next, we define the signal-to-noise ratio as the ratio of the kernel amplitude and the white noise, SNR =  M32 /.By varying the number of data points  data ∈ {75, . . ., 750} in steps of 75 and the signal-to-noise ratio log 10 (SNR) ∈ {0, . . ., 1} in steps of 0.05 by The green curve is the true mean function calculated from an exoplanet transit light curve simulation.Adding noise from an M32 kernel, the data points are obtained.The red and blue curves are the mean function predictive distributions, marginalised over the kernel posterior and conditioned on the M32 kernel, respectively.The shaded regions are one-sigma error bands.
keeping  M32 fixed and varying , a series of datasets was created.For each of these, GPR was performed for every kernel in Table 1 with the mean function set to the true mean function.When the mean function was fitted to the data, the hyperparameter priors of the mean function were chosen to be sufficiently narrow to achieve fast convergence to the posterior bulk during a NS run, yet wide enough to prevent significant truncation thereof.The kernel hyperparameter priors are described in Appendix F. The prior on  is the same as on the amplitude  of any kernel.Finally, for each dataset the kernel posterior   and the hyperparameter posteriors are calculated using the TS (Section 4).
For mean function hyperparameter inference, the datasets with  data = 75 and log 10 (SNR) ∈ {0, . . ., 2} in steps of 0.5 are created.GPR is performed with  orb fixed to the true value because the datasets only contain a single occultation so that  orb cannot be inferred.Performing GPR with the M32, M52, E and SE kernels showed that the  2 and  hyperparameters remain unconstrained.Hence, in addition to the datasets described above, GPR is re-performed with  2 and  fixed to their true values.An example dataset and the resulting mean function predictive distributions are shown in Figure 1, calculated for  2 and  fitted freely.

Kernel inference results
We define a measure of the sharpness of the kernel posterior as  =  M32 − ⟨  ⟩, where ⟨  ⟩ = 1/ kernel is the average kernel posterior probability.This is large when the kernel posterior is unimodal and sharply peaked at the true kernel and close to zero when the kernel posterior is flat.An alternative such measure is the entropy, −Σ    ln   , which attains a minimum value of zero when   = 1 for a single kernel and a maximum value of ln  kernel when the kernel posterior is flat.However, we intend to isolate the behaviour of the posterior at the true kernel so that  is used.The dependence of  on  data and log 10 (SNR) is shown in Figure 2, in which the region in which the M32 kernel is the maximum a posteriori (MAP) kernel is demarcated with a red border.Overall, the M32 kernel is recovered at low noise (log 10 (SNR) ≳ 0.5) but not for high noise Table 2. True values of the hyperparameters used for the creation of the synthetic datasets and the priors used for hyperparameter inference.The priors marked with (*) are given in Appendix F.

Hyperparameter True value
Prior (log 10 (SNR) ≲ 0.5) and this is correlated with the sharpness  of the kernel posterior.Thus,  is a measure for recovering the true kernel in the sense that a sharply peaked kernel posterior indicates that the true kernel is recovered and, conversely, a flat posterior indicates the opposite.For closer investigation, the kernel posteriors of the datasets marked (a), (b) and (c) in Figure 2 are shown in Figure 3. Furthermore, it is observed that, in the low SNR region (large ), the M32 kernel cannot be recovered despite  data being increased.Instead, the kernel posterior is flat with an entropy near ln  kernel , indicating that the dataset does not contain sufficient information to constrain the kernel prior.The reason for the flatness is that the covariance matrix, K +  2 I, of the normal distribution the data is sampled from receives a significant contribution from the diagonal white noise term,  2 I, which decreases the correlation of the samples.Hence, the true kernel cannot be inferred for high noise even if  data is increased.Visually, log 10 (SNR) = 0 corresponds to a dataset with almost no discernible pattern in the scatter of the data points, which is reflected in the flat kernel posterior.
It is expected that the plot in Figure 2 roughly separates into two regions.Firstly, a region in which the true kernel cannot be inferred and  is low, corresponding to low SNR and  data , and secondly a region in which the true kernel is recovered and  is high, corresponding to high SNR and  data .Moreover, these regions should be separated by a diagonal border.Figure 2 shows multiple deviations from this expectation: (i) For  data ∈ {225, 450, 675}, at high SNR, the true kernel is not correctly inferred.Instead, the kernel posterior shows a preference for the M52 and M72 kernels.
(ii) For  data ∈ {375, 750}, up to log 10 (SNR) ≈ 0.70,  remains low.Inspection of the kernel posterior shows that the E kernel is favoured.
These deviations are removed by extending the dataset to include multiple occultations, yielding Figure 5 which is discussed further below.We therefore conclude that these outliers are statistical fluctuations arising from the finite size of the dataset which do not occur in a real application which typically includes multiple occultations.
In the following, we show that deviations (i) and (ii) persist under the injection and removal of different types of noise which validate this conclusion.Firstly, we rule out that the deviations are caused by the cadence of the sampling of the light curve.For this, the observation times of the light curve are randomly shifted by adding independent and identically distributed uniform noise on the inputs.The deviations persist and are thus a feature intrinsic to the shape of the light curve.Secondly, we calculate  in the extreme case of infinite SNR, realised by setting  = 0.In this limit, the M32 kernel is the MAP kernel for all  data except for  data = 225, for which the posterior probability is finite for the M32 and M52 kernels.From this, we validate that the ambiguity between the M32 and M52 kernels is a feature of the shape of the light curve for  data = 225, whereas for all other values of  data , it is caused by the noise term,  2 I. Thirdly, by changing the random number generator seed to generate a different dataset and replotting  we conclude that if  is averaged over different realisations of the dataset, the deviations disappear.As the average over several realisations of the same light curve incorporates statistically independent examples of occultations, this is consistent with the finite size of the dataset causing the deviations.
In addition to the kernel posterior, we investigate how well the reconstructed kernel approximates the true kernel.For this, we define a similarity metric Δ which depends on a positive upper cutoff  max .The calculation thereof is illustrated in Figure 4. Firstly, we draw  = 1000 hyperparameter posterior samples for a given kernel  and plot the corresponding kernel functions  (0, ) against  ∈ [0,  max ].Next, we discretise the -axis at finely spaced cuts {  }.At a given cut   , the plotted kernel functions constitute a probability distribution with density  ,  , which we compute with a kernel density estimate (KDE; Appendix G).This probability density is evaluated at the value of the M32 kernel,  M32 (0,   ).The value of Δ is obtained by summing these probability densities over the cuts {  } and dividing by .Therefore, if the probability density of the reconstructed kernel is large along the path traced out by the graph of the true kernel function, corresponding to a higher similarity of between the reconstructed and true kernel, the value of Δ is large.Figure 4 shows Δ as a function of  max for  data = 75 and log 10 (SNR) ≈ 0.30.It is seen that for all  max , the similarity between the inferred and true M32 kernels is lower than for other kernels.Instead, the RQ kernel approximates the M32 kernel more closely.This is consistent with the results of the kernel posterior, in which the M32 kernel is not the MAP kernel at  data = 75 and log 10 (SNR) ≈ 0.30, and demonstrates that preselecting the M32 kernel does not necessarily reconstruct the ground truth most accurately in comparison to kernel marginalisation.
Finally, we investigate the applicability of these results for the kernel posterior when a mean function is included in the inference.This not only demonstrates consistency with the method in Section 5.2, in which the mean function was artificially held fixed, but also shows applicability to real datasets which contain multiple occultations.For this, the dataset is extended to include up to three occultations.It now consists of multiple copies of the occultation shown in Figure 1 next to each other.Figure 5 shows the analogue of Figure 2 when the mean function hyperparameters are fit to the data in addition to the kernels.Note that the number of data points is increased in proportion to the number of occultations contained in the dataset.This keeps the data density on the input domain, i.e. the number of data points per time interval, constant while increasing the size of the domain.This is important as a fit to a free mean function, which increases the dimensionality of the hyperparameter space, requires multiple statistically independent realisations of the same occultation without diluting the data in order to increase the information contained in the dataset.
With a single occultation in the dataset, corresponding to the dataset of this Section, the M32 kernel is correctly inferred at large values of SNR.However, the added complexity by extending the hyperparameter space to include the mean function hyperparameters causes the inference at low SNR to be prior-dominated, verified by inspecting the KL divergence, which specifically results in the preference to the RQ kernel for certain low-SNR datasets.By including a second occultation while fixing the occultation duration, the preference to the RQ kernel disappears, and the separation into low   and high entropy kernel posteriors depending on the value of SNR emerges, in agreement with Figure 2. Additionally, the broad kernel posteriors are confined to the lower left triangular region in the plot of  against  data and log 10 (SNR), which shrinks further as a third occultation is added into the dataset.The red border demarcating that the M32 kernel is the MAP kernel thus extends diagonally from the top left to the bottom right of such a plot.This indicates that increas-ing the number of data points at a fixed SNR or increasing SNR at a fixed number of data points improves kernel inference if sufficient information is included in the dataset, which in this case means that the domain of inputs is enlarged, which is distinct from keeping the domain of the inputs fixed and increasing the density of data points, which can lead to no constraints on the kernel posterior despite the large  data limit, as discussed above.This is also in agreement with the rough notion that more statistically independent data and lower noise generally improve inference.Moreover, the kernel inference remains correct as one moves further into the upper triangular region beyond the red border, indicating that there is a threshold in the number of data points or SNR beyond which there is little improvement in terms of kernel inference, which in practice may serve as a critical number of data points and SNR at which the inference becomes correct.In summary, we conclude that the results described in this Section are robust if a mean function is included under the condition that the mean function contains more occultations.
We remark that the only remaining difference to Figure 2 at low SNR lies in the occasional preference to kernels similar to M32, such as the M52 kernel.This is explained by noting that the mean function inference barely improves as more occultations are included, which was verified by inspection of the hyperparameter posteriors.This is a result of the occultation period,  = 0.12 days, being larger than the kernel length scale, ℓ M32 = 0.02 days, which means that the number of independent samples of the kernel increases more rapidly than the number of independent samples of the mean function as the number of occultations is increased.This implies that the inference of the kernel is overconfident despite wrong mean function inference.A possible way of fixing this would be to increase the number of occultations but simultaneously increase the kernel length scale such that the number of independent kernel samples, approximated by the fraction input range ℓ M32 , remains constant.This approach was not implemented here as it requires one to entirely change the datasets investigated.However, this situation is not unrealistic as noise length scales can be larger than the occultation period in real applications.By including a significantly larger number of occultations, while increasing the kernel length scale, it is expected that we approach the limit of a fixed mean function and thus the case investigated in this Section.
In the next Section, mean function hyperparameter inference is investigated on this problem, focussing on a dataset with a single occultation as the hyperparameter inference is relatively unchanged as up to three more occultations are added.

Mean function hyperparameter inference results
The agreement of the inferred mean function hyperparameters with their true values is within one sigma.However, the posteriors of the hyperparameters  2 and , which control the precise shape of the occultation, and  remain systematically broad for different  data and SNR values due to the lack of constraint from the data as outlined in the previous Section.
tion shows three favourable effects.These are illustrated in Figure 6 for datasets with  data = 75.Firstly, for the  posterior in the high noise region (log 10 (SNR) = 0), the MAP kernel is biased to the wrong value,  = 0.However, contributions from the other kernels, which are significant because the kernel posterior is not sharply peaked, remove this bias, resulting in a marginalised posterior peaked closer to the true value,  = 0.1.Secondly, in the low noise region (log 10 (SNR) = 1), inference with the M52 kernel moves posterior mass away from the true  value, whereas the M32 posterior shows little change from the uniform prior as the fluctuations in the posterior density are around 8% of the maximum value.Marginalisation has the effect of smoothing out the M52 posterior and moving posterior mass back to the true  value.Moreover, the E kernel contributes to this as it is peaked at the true  value.The marginalised posterior thus provides a more faithful representation of the lack of information in the data than the M52, E or SE kernels would suggest by themselves.Finally, for the  1 posterior in the intermediate noise re-gion (log 10 (SNR) = 0.5), all kernels are peaked at distinct  1 values around the true value.Marginalisation results in a unimodal posterior close to the true value.In contrast to this, inference with the M32 kernel results in a MAP estimate further away from the true  1 value.Finally, we investigate the effect of kernel marginalisation on the width of the one sigma error band of the mean function predictive distribution3 .For this, we take the difference between the error bands at each input  between the marginalised and M32 predictive distributions and average over all inputs .This defines the uncertainty increase, , which is shown for a series of datasets in Figure 7.It is observed that there is a transition from  ≳ 0 to  ≈ 0 at the value of log 10 (SNR) at the red border in Figure 2.This means that the additional uncertainty in the kernel choice is captured when the kernel posterior is flat, whereas the predictive distribution of the true kernel is reproduced when the true kernel maximises the kernel posterior.
In summary, kernel marginalisation performs the inference at least as accurately as using the M32 kernel.This is because in the low noise region, inference results from the true kernel are reproduced whereas in the high noise region, the contributions from other kernels remove bias in the M32 kernel.These results should be extended to lower the time complexity, O( 3 data  3 dims ), which warrants improvements when compared to that of ML-II estimates.Therefore, the extension to approximate GP methods (Rasmussen & Williams 2005) should be investigated.

HUBBLE PARAMETER INFERENCE
In this Section, the method of kernel marginalisation is applied to  0 inference from CC (Renzi & Silvestri 2023) and BAO (Li et al. 2021) datasets, which consist of measurements of the Hubble parameter  against redshift .

Method
The  0 posterior is obtained by fitting a GP to an  () dataset, extrapolating it to  = 0 and calculating the marginalised predictive distribution at  = 0.All amplitude, , and length scale, ℓ, hyperparameter priors are set to the conservative ranges [0 km s −1 Mpc −1 , 500 km s −1 Mpc −1 ] and [0, 20], respectively.Additionally, the mean function is set to a constant,   0 , for which three priors are investigated: These induce the  0 priors shown in Figure 8. Option (M.i) corresponds to a zero mean function, as done in Gómez-Valent & Amendola (2018); Bernardo & Said (2021);Yu et al. (2018), and results in an  0 prior biased towards  0 = 0 km s −1 Mpc −1 .Options (M.ii) and (M.iii) induce an  0 prior which is the convolution of the  0 prior of (M.i) and the prior on   0 , thus broadening the  0 prior and removing the bias.The range of the   0 priors was chosen such that the induced  0 prior is centred on  0 = 50 km s −1 Mpc −1 and flat within the range [0 km s −1 Mpc −1 , 100 km s −1 Mpc −1 ], which encompasses all  0 measurements to date (Shah et al. 2021).

𝑁 data
), where diag places its argument along the diagonal of a matrix and {  } are the measurement error bars provided in the dataset (Gómez-Valent & Amendola 2018;Bernardo & Said 2021;Yu et al. 2018).

𝑁 data
), where  is a hyperparameter fitted to the dataset with the prior  ∼ U(0, 5).This option assesses whether the sizes of the error bars provided in the dataset are correctly estimated, in which case  = 1 is expected.
(N.iii)  =  2 I, where  is a hyperparameter fitted to the dataset with the prior /(km s −1 Mpc −1 ) ∼ U(0, 500).This option ignores the measurement error bars in the dataset and fits homoscedastic white noise.It is expected that  is consistent with the size of the error bars {  } of option (N.i).
In the following, the  0 values for each of the nine choices of mean function priors and noise terms are calculated and, using the evidence ln Z, we marginalise  0 over these modelling choices (Scheutwinkel et al. 2023).This calculation is performed for the CC and BAO datasets and their combination.Finally, to assess whether the CC and BAO datasets are consistent, a tension analysis (Handley & Lemos 2019b) is performed from the entire model consisting of the nine choices and the kernel choice therein.

Results
The  0 means and one-sigma credible intervals corresponding to the GP fitting options above are shown in Figure 9. Within each of the nine subplots, the mean  0 value marginalised over all kernels is shown.Computationally, this value is produced by picking a kernel according to the kernel posterior, represented by the histogram.Given this kernel, a value of  0 is sampled from the  0 posterior conditioned on the kernel choice shown vertically below the selected kernel.Finally, the mean and one-sigma credible interval is computed from the samples.Additionally, we report the values of ln Z and the BMD   .
For both datasets, all nine fitting options produce  0 values consistent within one sigma.Marginalising over these gives the final values  0 = 66 ± 6 km s −1 Mpc −1 ,  0 = 67 ± 10 km s −1 Mpc −1 and  0 = 69 ± 6 km s −1 Mpc −1 for the CC, BAO and combined CC and BAO datasets, respectively.The evidences of the nine models for each dataset were converted into model posterior probabilities.For all three datasets, the noise model (N.iii) has negligible posterior probability, contributing at most 9.3% in the CC dataset.This low posterior probability is expected because a homoscedastic model is fitted to heteroscedastic data.For the CC and BAO datasets, the noise model (N.i) has the largest contribution, accounting for 49.7% and 70.3%, respectively, followed by the noise model (N.ii).However, for the combined dataset, noise model (N.ii) contributes dominantly with 91.6%.This preference for (N.ii) causes the shift to larger  0 for the combined dataset, as explained below when (N.i) and (N.ii) are compared.In essence, (N.ii) reduces the size of the measurement error bars which causes the fit to take on a more sigmoid shape in order to follow the data points more closely.This is achieved by a smaller decorrelation length of the kernel which in turn entails a Examples of the effect of kernel marginalisation on hyperparameter inference showing that when the true hyperparameters cannot be inferred by both the M32 kernel and the maximum a posteriori kernel, kernel marginalisation still increases the accuracy.Posterior distributions for the exoplanet mean function hyperparameters,  and  1 , are shown for synthetic datasets with  data = 75.Left: While the SE kernel, which has the maximum posterior kernel probability 0.31 ± 0.05 among the kernels, prefers  = 0, the E kernel is peaked close to the true value despite its smaller posterior kernel probability.In total, the contributions from the M32, M52 and E kernels remove the bias of the SE kernel to low  values, giving a marginalised posterior peaked around the true value.Middle: In the low noise region, the M52 kernel moves posterior mass away from the true value while the marginalised posterior is more uniform, thus representing the lack of information in the data more faithfully.Right: Neither the MAP kernel nor the M32 kernel are peaked at the true value.However, the contributions from other kernels give a marginalised posterior peaked close to the true value.The error bars on each data point measure the variation of this uncertainty over the inputs in a dataset.The orange and green lines are the mean of the blue data points in their respective log 10 (SNR)-ranges and their error bands are the respective one-sigma ranges.There is a transition to an average increase in the uncertainty at a critical noise level roughly where the kernel posterior transitions from being flat to peaked at the true kernel (red border in Figure 2).
quicker decay to the mean function so that  0 , the intercept at  = 0, becomes larger.As shown in Figure 9, the  0 values conditioned on the kernels {E, M32, M52, M72, SE, RQ, ESS, L} are consistent within one sigma.
For   0 = 0 km s −1 Mpc −1 , the inference from the Cos kernel shows a larger discrepancy.However, the Cos kernel contribution to the marginalised  0 value is suppressed by the kernel posterior,   ≈ 0. For the two other  0 priors, the Cos kernel agrees with the other kernels within one sigma.This is because the posterior for the period hyperparameter,  Cos , increases up to  Cos = 5 which is larger than the range of redshifts in the dataset so that the effect of the periodicity becomes negligible.For the ESS kernel, the uncertainty   0 is large, 27 <   0 [km s −1 Mpc −1 ] < 125, compared to the other kernels and the kernel posterior probability,   ≈ 0, suppresses its contribution.These results are consistent with the expectation that periodic kernels are not necessarily suitable to describe non-periodic data.Kernel marginalisation automatically accounts for this.Overall, marginalisation thus removes the inductive bias of the kernel choice.
For the CC data with noise model (N.iii), the MAP kernel is consistently the linear kernel.Since functions sampled from a GP with a linear kernel are linear functions, this indicates that linear regression suffices to describe the data.This is validated by plotting the predictive distribution for the M32 kernel separately for  M32 < 400 km s −1 Mpc −1 and  M32 > 400 km s −1 Mpc −1 , showing that the fits through the data points are nearly identical and well approximated by a linear function (Figure 10).Furthermore, this shows that samples with  M32 > 400 km s −1 Mpc −1 do not change the inferred  0 value and that the chosen prior range is sufficiently wide to infer  0 accurately.When the BAO measurements are added,   of the linear kernel decreases in favour of the M32 kernel.This is because the BAO dataset provides additional  () measurements in the ranges 0 ≲  ≲ 0.6 and 2.33 <  < 2.36 which are close to the boundary of the CC dataset -range and have smaller error bars, thus preferring a more sigmoid function for which kernels other than the linear kernel are more suited.
Using noise model (N.i) has four effects when compared to (N.iii).Firstly, it leads to an increase in  0 (Figure 9) for the CC and combined datasets.This is because the variation of  () near  = 0 is fitted as uncorrelated noise when (N.iii) is used, whereas the included error bars in (N.i) constrain the fit to curve upwards around  = 0, thus lowering the probability of the linear kernel and shifting  0 to higher values.The same effect causes the addition of the BAO to the CC data to increase  0 , as sigmoid functions become preferred.For the BAO dataset alone, however, changing noise model (N.i) to (N.iii) increases  0 which again correlates with the decrease of posterior probability of the linear kernel.Secondly, the uncertainty on  0 decreases (increases) for the CC and combined (BAO) datasets, when changing the noise model from (N.iii) to (N.i).In this case, if the posterior probabilities of the linear and M32 kernels are comparable, the error on  0 increases because the linear kernel produces an  0 value smaller than the M32 kernel, albeit with smaller error bars, thus leading to an overall increase in the variance on  0 .Thirdly, the mean of the  0 value increases with the width of the  0 prior, i.e. in the order (M.i), (M.ii) and (M.iii), albeit within one sigma.This shows that the prior choice of   0 = 0 km s −1 Mpc −1 contains a bias to low  0 values, as expected from Figure 8.If the constraint on  0 from these datasets were tighter, the bias would be more apparent.Finally, the increase in prior volume causes the evidence to decrease significantly, making it desirable to find alternative parametrisations of the GP model which are unbiased (Handley & Millea 2019).
Comparing noise models (N.i) and (N.ii), it is observed that (N.ii) has both larger  0 values and a larger uncertainty   0 .Despite the differences in  0 being within the error, this is accounted for by the combination of two effects.Firstly, we note that the noise model (N.ii) fits a value of  smaller than 1.In particular, this is  = 0.70 ± 0.13,  = 0.72 ± 0.16 and  = 0.67 ± 0.09 for the CC, BAO and combined datasets, respectively.Therefore, the scatter in the individual data points are smoothed less by the noise term , which is proportional to  2 , and must instead be fitted by the kernel.As a result, the decor-relation length scale of the kernel decreases.For example, the M32 kernel length scale ℓ M32 decreases from 10.3 ± 5.2 km s −1 Mpc −1 to 7.7 ± 4.4 km s −1 Mpc −1 when fitting the combined dataset with mean function (M.i) with noise models (N.i) and (N.ii), respectively.This leads to a strong swelling of the error band of the GP predictive distribution in regions where data is sparse (Figure 11), which in turn leads to the larger error bars on the extrapolated  0 values.In addition, the smaller decorrelation length causes the GP to decay to the mean function over a shorter redshift range so that the fit with noise model (N.ii) curves up more strongly in the vicinity of  = 0, leading to a larger  0 value (Figure 11).Secondly, the posterior probability of the linear kernel decreases when the noise model is changed from (N.i) to (N.ii) because  being less than 1 favours fits which follow all data points more closely.We note that the linear kernel produces slightly smaller  0 values compared to the other Matérn kernels (E, M32, M52, M72, SE) because the linear functions extend to  = ±∞ as a consequence of non-stationarity, whereas the Matérn kernels revert to the mean function.Thus, the disappearance of the preference to the linear kernel also shifts  0 to larger values.
We note that the hyperparameter  in noise model (N.iii) attains an average value of 13 ± 2 km s −1 Mpc −1 and 9.4 ± 1.1 km s −1 Mpc −1 in the CC and combined datasets, respectively, which is less than the average error bar size of these datasets, 20 km s −1 Mpc −1 , and 14 km s −1 Mpc −1 , suggesting that the measurement errors are overestimated.This conclusion is consistent with the average values of  in noise model (N.ii) being less than 1.Furthermore, if the white noise term,  2 I, had been included in the covariance matrix of the extrapolation, the error bar size on the inferred  0 values would increase to 14 km s −1 Mpc −1 and 11 km s −1 Mpc −1 in the two datasets, respectively, which is approximately twice the size of the error bars shown in Figure 9.This indicates that a significant amount of the variation in the dataset is absorbed in the white noise term.
Among the Matérn kernels (E, M32, M52, M72, SE), Figure 9 shows that the posterior probability for the M32 kernel is largest, followed by monotonically decreasing posterior probabilities in the order of the M52, M72 and SE kernels, and lastly the E kernel which has zero posterior probability for most fits.This is investigated by decomposing the kernel evidence, ln Z  , into the goodness of fit, ⟨ln L⟩ P , and the KL divergence, D KL , according to the identity ln Z  = ⟨ln L⟩ P − D KL (Hergt et al. 2021).Firstly, the E kernel has the lowest goodness of fit and largest KL divergence, thus explaining its low posterior probability.Secondly, the minor differences in posterior probabilities between the M32, M52, M72 and SE kernels are accounted for by small differences between the goodness of fits and KL divergences (these differences being approximately between 0.2 and 2).For almost all fits4 , the goodness of fit decreases and the KL divergence increases in the order of the kernels listed, thus explaining the trend in posterior probabilities.This increase in the KL divergence may be interpreted as a complexity penalty arising from the increased smoothness of the corresponding GPs5 .In this sense, the GP fit with the M32 kernel provides the simplest model and the best fit among the Matérn kernels.
Moreover, the BMDs, which are strictly less than the dimensionality of the fitted models, show that not all hyperparameters are tightly constrained.The lack of constraint is also reflected in the uncertainty  M32 = 400 km s −1 Mpc −1 .The tail of the distribution does not agree with the identical posterior calculated in Gómez-Valent & Amendola (2018) despite re-calculation with MCMC methods, which we suspect to be due to the prior choice.Bottom: Mean of the predictive distributions corresponding to samples taken from the blue and orange regions of the left subplot.The CC dataset is shown in the lower plot.The orange and blue curves give similar linear fits within the data range but when extrapolated, the orange curves have a wider -range, motivating the use of the linear kernel.
of the  0 values, which are approximately an order of magnitude larger than those of the Planck ( 0 = 67.4± 0.5 km s −1 Mpc −1 ; Aghanim et al. ( 2020)) and SH0ES ( 0 = 73.2± 1.3 km s −1 Mpc −1 ; Riess et al. ( 2021)) Collaborations, with which the agreement is within two sigma due to the size of the error bars.Hence, more CC measurements are needed to constrain  0 , in agreement with the conclusions in Zhang et al. (2010).Given more measurements,  0 inference using kernel marginalisation should be repeated in the future.The decreased uncertainty of  0 will likely result in a stronger dependence on the  0 prior choice.Lastly, we note that the tendency towards lower values in the cosmologically modelindependent CC dataset,  0 ≲ 68 km s −1 Mpc −1 , is consistent with Cimatti & Moresco (2023).
For further investigation, the datasets can be augmented with additional synthetic data points.This way, the dependence of kernel inference on the number of data points, the range of redshifts covered in the dataset and the noise level can be investigated, similarly to Section 5, to assess the conditions under which the kernel posterior becomes strongly peaked at a single kernel.This would also enable the definite selection between other non-stationary kernels such as polynomial kernels of higher order. ∈ [0.1, 0.9].However, the error band broadens more strongly than for the other two GP fits in the range  ∈ [0, 0.1], thus resulting in larger error bars on  0 , as shown in the inset.This plot was produced for the combined CC and BAO datasets and the mean function option   0 = 0 km s −1 Mpc −1 .

CONCLUSIONS
A novel principled fully Bayesian approach to kernel comparison and kernel marginalisation in Gaussian Process regression is introduced and applied to hyperparameter inference.The key quantity to calculate for inference from a dataset is the posterior over the kernel choice.This was accomplished by implementing a transdimensional sampler which samples over the kernel posterior and their hyperparameter posteriors in a single run.The approach is applied to synthetic exoplanet transit data and inference of the Hubble parameter from observational cosmic chronometer plus baryon acoustic oscillations data.
The method infers the true kernel in the low noise region of synthetic data from exoplanet light curve simulations and removes kernel preference in the high noise region.The contributions from multiple kernels increase the accuracy and remove bias in the physical hyperparameter posteriors.Furthermore, single-kernel inference underestimates the uncertainty on the mean function, which increases for kernel marginalisation when transitioning from the low to high noise region.
The method was applied to the inference of the present-day Hubble parameter,  0 , from measurements of the Hubble parameter against redshift from cosmic chronometers and baryon acoustic oscillations, yielding the values  0 = 66 ± 6 km s −1 Mpc −1 and  0 = 67 ± 10 km s −1 Mpc −1 , consistent with the Planck and SH0ES values within two sigma.Additionally, the contributions from individual kernels agree within one sigma and kernels whose inferred  0 values deviate more strongly are suppressed by the kernel posterior.
From a machine learning perspective, these investigations clearly show the characteristics of this method.Specifically, the inference of the true kernel strongly correlates with the noise level and the number of data points, captures additional uncertainty in the kernel choice, is capable of reproducing and improving existing analyses, automatically selects the most probable kernel and removes the inductive bias introduced from pre-selecting a kernel, which compellingly improves over the conventional maximum likelihood approach.Due to these characteristics, strong conclusions can be drawn from the application of the method to astrophysical and cosmological data.Mainly, it is insufficient to pre-select a kernel in low data point and high noise datasets such as the cosmic chronometers to infer  0 .Thus, it is evident that existing analyses which pre-select a kernel need to be verified with kernel marginalisation.
Future explorations of the method are much encouraged and involve the lowering of the computational runtime by the application of approximative GP methods.Additionally, the approach can be extended to marginalisation over mean functions and noise terms.From a physical perspective, the aspect of kernel selection facilitates the identification of the most probable model, which in turn guides the physical modelling.Therefore, the method is suitable for any dataset using Gaussian Process regression in which there is only a weak modelling preference.

Figure 1 .
Figure 1.Example synthetic dataset for  data = 750 and log 10 (SNR) = 1.The green curve is the true mean function calculated from an exoplanet transit light curve simulation.Adding noise from an M32 kernel, the data points are obtained.The red and blue curves are the mean function predictive distributions, marginalised over the kernel posterior and conditioned on the M32 kernel, respectively.The shaded regions are one-sigma error bands.

Figure 2 .Figure 3 .
Figure 2. Plot showing that the inference of the true kernel is correlated with the noise level and the number of data points.A measure of the sharpness of the kernel posterior at the M32 kernel,  =  M32 − ⟨   ⟩, is plotted against the signal-to-noise ratio, log 10 (SNR), and the number of data points,  data , of synthetic datasets created from an exoplanet simulation (Section 5.1) and correlated noise from an M32 kernel.Each coloured square corresponds to a dataset and the inscribed upper value is the plotted value of , and the lower text is the maximum a posteriori (MAP) kernel.In the region within the red border, the true M32 kernel maximises the posterior.For  data = 225, 450 and 675, the M52 kernel has a similar or larger posterior probability than the M32 kernel even for log 10 (SNR) ≈ 1.The complete kernel posteriors for the datasets marked (a), (b) and (c) are shown in Figure 3.

Figure 4 .
Figure 4. Top: Plot of the true kernel, which is an M32 kernel with  M32 = 0.002 and ℓ M32 = 0.02 days, and and M52 kernels with hyperparameters sampled from the posterior.The similarity metric Δ is calculated by summing the probability density of the M52 kernel samples along the curve of the true kernel.The settings for the plot are  data = 75 and log 10 (SNR) ≈ 0.30.Bottom: For each kernel, ln Δ is shown.Larger Δ corresponds to higher similarity between the inferred and true kernel.It is seen that, for any  max , the true M32 kernel is not approximated well by the M32 kernel posterior compared to other kernels leading to a flat kernel posterior   .

Figure 5 .
Figure 5. Plot of the sharpness, , multiple occultations are included in the dataset and the mean function and kernels are fit to the data simultaneously.As the number of occultations in the dataset is increased, the lower left triangular region in which the kernel cannot be inferred shrinks while the upper right region in which the true kernel can be inferred grows.This is consistent with the results for a fixed mean function (Figure2) and shows that the method is robust for a free mean function when statistically independent realisations of an occultation are present in the dataset.
Figure6.Examples of the effect of kernel marginalisation on hyperparameter inference showing that when the true hyperparameters cannot be inferred by both the M32 kernel and the maximum a posteriori kernel, kernel marginalisation still increases the accuracy.Posterior distributions for the exoplanet mean function hyperparameters,  and  1 , are shown for synthetic datasets with  data = 75.Left: While the SE kernel, which has the maximum posterior kernel probability 0.31 ± 0.05 among the kernels, prefers  = 0, the E kernel is peaked close to the true value despite its smaller posterior kernel probability.In total, the contributions from the M32, M52 and E kernels remove the bias of the SE kernel to low  values, giving a marginalised posterior peaked around the true value.Middle: In the low noise region, the M52 kernel moves posterior mass away from the true value while the marginalised posterior is more uniform, thus representing the lack of information in the data more faithfully.Right: Neither the MAP kernel nor the M32 kernel are peaked at the true value.However, the contributions from other kernels give a marginalised posterior peaked close to the true value.

D
Average for log 10 (SNR) < 0.6 Average for log 10 (SNR) > 0.6 Uncertainty increase D in the mean function predictive distribution

Figure 7 .
Figure7.Uncertainty increase  in the mean function predictive distribution against the signal-to-noise ratio log 10 (SNR) for  data = 75 and 150.The error bars on each data point measure the variation of this uncertainty over the inputs in a dataset.The orange and green lines are the mean of the blue data points in their respective log 10 (SNR)-ranges and their error bands are the respective one-sigma ranges.There is a transition to an average increase in the uncertainty at a critical noise level roughly where the kernel posterior transitions from being flat to peaked at the true kernel (red border in Figure2).

Figure 8 .
Figure 8. Prior probability distributions for  0 induced by the hyperparameter priors for each of the mean function priors in Section 6.1.The distributions are kernel density estimates with one-sigma bootstrap error bands.The uniform priors on   0 broaden the  0 prior and centre it at  0 = 50 km s −1 Mpc −1 , thus removing the bias towards  0 = 0 km s −1 Mpc −1 .The shown priors are calculated for the M32 kernel but have the same shape for all other kernels.

Figure 9 .
Figure9.Means and one-sigma credible intervals of  0 , evidences ln Z and Bayesian model dimensionalities  (Handley & Lemos 2019a) obtained by marginalising over the kernel choice.The columns of the grid of nine subplots correspond to the choice of dataset on which GPR is performed, which are the CC, BAO and the combined CC and BAO datasets.The rows correspond to three different noise models.Within each subplot, each colour corresponds to a choice for the  0 prior (Figure8).The right column shows the kernel probabilities   and the  0 values inferred from each kernel.The left column shows the final  0 values obtained by marginalising over these probabilities.It should be noted that due to the width of the error bars of some  0 values of the ESS kernel, their values display as vertical lines on the plot.

AFigure 10 .
Figure 10.Top: Posterior of the amplitude,  , when GPR is performed with the M32 kernel and the mean function set to zero and the noise model  = diag(   ) 2 .The blue and orange regions are separated by

Figure 11 .
Figure11.Kernel-marginalised GP fits for the three options of the noise term, .The option  =  2 diag(   ) 2 effectively fits a datasets with smaller measurement errors.Hence, the GP error band is narrower in the range

Table 1 .
Kernel families commonly used for Gaussian Processes