Eryn : A multi-purpose sampler for Bayesian inference

In recent years, methods for Bayesian inference have been widely used in many different problems in physics where detection and characterization are necessary. Data analysis in gravitational-wave astronomy is a prime example of such a case. Bayesian inference has been very successful because this technique provides a representation of the parameters as a posterior probability distribution, with uncertainties informed by the precision of the experimental measurements. During the last couple of decades, many specific advances have been proposed and employed in order to solve a large variety of different problems. In this work, we present a Markov Chain Monte Carlo (MCMC) algorithm that integrates many of those concepts into a single MCMC package. For this purpose, we have built {\tt Eryn}, a user-friendly and multipurpose toolbox for Bayesian inference, which can be utilized for solving parameter estimation and model selection problems, ranging from simple inference questions, to those with large-scale model variation requiring trans-dimensional MCMC methods, like the LISA global fit problem. In this paper, we describe this sampler package and illustrate its capabilities on a variety of use cases.


INTRODUCTION
In physics, and in science in general, one of the most encountered problems is the one of model calibration and comparison.We test our models of the physical world against the measured data, to estimate their parameters and to robustly determine the most suitable model that describes our observations.A crucial first step in this direction, is to efficiently explore the posterior distribution of the parameters given the measured data.Markov Chain Monte Carlo (MCMC) algorithms have proven to be very successful in this regard (Metropolis et al. 1953;Hastings 1970;Hitchcock 2003;Gilks et al. 1995), being one of the few methods which can efficiently perform Bayesian inference when the posterior is not analytically tractable and without solving exactly for the marginal likelihood.This is compared to, for example, grid methods, which are often computationally unfeasible.This is especially true in the field of Gravitational-Wave (GW) astronomy, where MCMC methods have been extensively used in order to find physical parameters for signals buried in the data (see e.g.Ashton & Talbot (2021); Biwer et al. (2019); Abbott et al. (2021a); Collaboration & the Virgo Collaboration (2020); Shawhan (2003); Abbott et al. (2021b)), as well as to hierarchically infer the properties of the underlying astrophysical populations (e.g.Ashton & Prix (2018); Thrane & Talbot (2019); Isi et al. (2022)).MCMC approaches can also compute the marginal likelihood or evidence (see section 2), by using techniques such as Thermodynamic Integration (see section 2.4).In a Bayesian framework, the evidence difference between two models can be used to compute the Bayes Factor, which is used to select between different models that could describe the observations.Thermodynamic integration (or other approximations, see section 2.5 or Cornish & Littenberg (2007); Gelman et al. (1996)), is ideal for cases where the number of competing models is small.However, in situations where the number of potential models becomes too large, the task of iteratively and hierarchically computing the marginal likelihood can become computationally inefficient, or even practically unachievable.Such is the case for future signal-dominated GW observatories, such as the Laser Interferometer Space Antenna (LISA) (Amaro-Seoane et al. 2017) or other proposed space-borne GW observatories (Luo et al. 2016;Kawamura et al. 2008;Ren et al. 2023).LISA will observe different types of GW sources, the most numerous of them being the Ultra Compact Binaries (UCBs) within the Milky Way (Amaro-Seoane et al. 2017, 2012;Littenberg et al. 2020;Littenberg & Cornish 2023;Strub et al. 2022;Zhang et al. 2021;Karnesis et al. 2021;Crowder & Cornish 2007).Those are mostly comprised of a population of Double White Dwarfs (DWD), although a small fraction of neutron star -white dwarf (NS-WD) or double neutron stars (NS-NS) binaries are expected (Breivik et al. 2020;Nelemans et al. 2001).In fact, LISA is going to detect GWs from the complete population of O (10 7 ) sources simultaneously, but only a small fraction of them are going to be individually resolvable (O (10 4 )).The large majority of signals will generate an anisotropic and non-stationary "confusion" type of signal, which will dominate the LISA band between 0.05 and ∼ 0.2 mHz.
In the above context, computing the marginal likelihood for such a large parameter space and for all possible numbers of events that could be in the population becomes computationally prohibitive.Instead, we can employ dynamical trans-dimensional MCMC meth-ods (Green 1995).This family of methods can be quite challenging to tune, but it has proven to yield satisfactory results, even for such demanding problems as the LISA data (Littenberg et al. 2020;Littenberg & Cornish 2023).There are also implementation challenges, which arise from technical aspects of the algorithm; one example being the dimension matching requirement when proposing moves between models with different dimensionality.In terms of algorithm efficiency, it is also crucial to choose proposal distributions that allow smooth transitions on a dynamical parameter space, a task which in many cases requires substantial effort.For these technical reasons, all the available software tools of this kind have been specifically developed for the particular problem they intend to solve.
In this work, we present Eryn, a reversible jump MCMC algorithm, capable of efficiently sampling dynamical parameter spaces, while remaining generic and usable by a large community.We build upon various ideas from statistics, astronomy, etc., in order to develop an efficient statistical toolbox that can be applied to the majority of problems involving detection and characterization of signals.Our primary goal, however, is to utilize Eryn as a basic ingredient for a data analysis pipeline to perform the LISA Global Fit (Littenberg et al. 2020;Littenberg & Cornish 2023).The Global Fit is a data analysis strategy required to tackle the problem of multiple source detection, separation, and characterization in LISA data.For demonstration purposes in this work, we use Eryn to analyze a "reduced" scenario of the LISA data in section 4.
This paper is organized as follows: in section 2, we begin with explaining the foundations of the MCMC algorithm, as well as some of the relevant methods that we have adopted for our implementation.In section 3, we describe how the methods introduced in section 2 are combined into the actual toolbox implemented in Eryn.In section 3.3, we demonstrate the capabilities of Eryn through some toy examples, while in section 4 we apply our machinery to more demanding applications in Gravitational-Wave astronomy.In particular, we demonstrate Eryn by performing model selection on a simulated population of Ultra Compact Galactic Binaries (UCBs) as measured by the future LISA observatory.Finally, in section 5, we summarize our work and discuss future applications.We should state again here, that Eryn is available as open source software in https://github.com/mikekatz04/Eryn.

MARKOV CHAIN MONTE CARLO ALGORITHMS
Nowadays, MCMC methods are considered to be a cornerstone of Bayesian Inference, being very effective in finding solutions to problems encountered across wide-ranging disciplines (e.g.Hobson et al. 2009;Ashton & Talbot 2021;Biwer et al. 2019;Hogg & Foreman-Mackey 2018;Sharma 2017;Johannes & Polson 2009).These include the sampling of the posterior densities of parameters of interest, the numerical marginalisation over nuisance parameters, and providing a framework to compute the marginal posterior distributions (or evidences) that can be used for model selection.The Bayesian framework is based around Bayes' Theorem: where  is the measured data and M our chosen model of analysis.
The ( ì |) term is the posterior distribution of the parameter set ì , which is related to the likelihood function of the data (| ì , M) and the prior densities of the parameters ( ì |M).The evidence (|M) is the marginal posterior over the parameter space ì  ∈ Θ: (2) For parameter estimation purposes, the evidence acts as a normalization constant and can be ignored.However, it is really important if one wants to perform model selection over the measured data.We shall describe in detail how one can numerically approximate the integral of eq. ( 2) in section 2.5.MCMC algorithms work by constructing a Markov Chain sequence, whose elements, ì  (  ), for  = 0, 1, . .., are (asymptotically) independent samples from the target distribution,  ( ì ).Under fairly general assumptions, the distribution of samples in the chain will converge to the target distribution provided the algorithm satisfies detailed balance: Here ( ì  → ì  ′ ) is the probability that the Markov chain moves from point ì  to point ì  ′ .The most widely-used MCMC algorithm is Metropolis-Hastings (Metropolis et al. 1953;Hastings 1970), which is explained in algorithm box 1.The first step of the algorithm is to define an initial state, ì  ( 0 ).Then, at each subsequent step , a new state is proposed, by randomly drawing from a given proposal distribution ( ì The newly proposed state is then accepted with a certain probability, given by eq. ( 4).If the move is accepted we set ì Any reasonable choice of the proposal density will generate a Markov chain with the correct stationary distribution.However, a good choice of  is critical for its efficiency, i.e. achieving the convergence of the MCMC chains within a reasonable computational time.For the special case of a symmetric proposal distribution1 , such as the widely used multivariate Gaussian distribution centered around ì  (  ), the ratio of eq. ( 4) in algorithm box 1 becomes simply the ratio of the target densities at the current ì  (  ) and proposed ì  ′ points.For high-dimensional problems, the multivariate Gaussian proposal can be tuned during the burn-in period of sampling to improve efficiency (Haario et al. 2001;Frenkel 2006;Andrieu & Thoms 2008;Gelman et al. 1996;Roberts & Rosenthal 2009;Christensen & Meyer 2022), or even scaled according to the Cramer-Rao bound, estimated from the Information matrix (Vallisneri 2008).
Algorithm 1: The Metropolis-Hastings algorithm (Metropolis et al. 1953;Hastings 1970). is the target density to be sampled. , where 4: Go to 2 and repeat until equilibrium is reached, and enough independent samples have been drawn from the target distribution.
Although the MH algorithm has been quite successful in tackling inference problems, there are practical implementation issues to overcome.Improving acceptance rate is crucial for convergence, and sometimes improvements in the proposal distribution are not sufficient to efficiently sample the parameter space.To tackle these issues, various MCMC enhancements have been proposed.A prime example is the Hamiltonian Monte Carlo (HMC) algorithms that utilize local gradients in order to generate proposal points (Neal 2011;Betancourt & Girolami 2013).One variant of HMC is the No-U-Turn sampler which automates part of the required tuning of the HMC Hoffman & Gelman (2011) sampler.Another alternative to the "standard" MH is the Gibbs sampling algorithm, which is particularly useful if the conditional distributions of the parameters of the model are known (Ritter & Tanner 1992;Muller 1991;Gilks & Wild 1992).All of the above developments have been shown to be useful in various disciplines (Gelman et al. 2004;Hogg & Foreman-Mackey 2018;Brooks 1998;Robert & Casella 2011;Joseph 2019;Sharma 2017;Kendall et al. 2005;Sivia & Skilling 2006;Sorensen & Gianola 2007;Johannes & Polson 2010;Baio 2012).Finally, there have recently been numerous proposals that aim to enhance sampling with machine learning techniques.At their core, many of these methods optimize the exploration of the likelihood surface, either by learning it directly (see for example Hermans et al. (2019)) or by sampling it in a simpler latent space (for example Wu et al. (2020)).
In this work, we introduce Eryn, which is built around the emcee package (Foreman-Mackey et al. 2013), enhanced with a variety of sampling mechanisms that allow us to perform inference on dynamical parameter spaces with minimal tuning.We expand on the most important features in the sections below.

Affine-invariant samplers
An affine transformation is one of the form ì  → ì  =  ì  + , where  and  are a constant matrix and vector respectively.Under an affine transformation a probability density ( ì |) transforms to Such transformations can help to transform difficult-to-sample distributions into easier-to-sample ones.A simple example is a multivariate Normal distribution.If the dynamical range of the eigenvalues of the covariance matrix is very large, then sampling can be difficult, but any multivariate Normal distribution can be transformed into a spherical distribution via an affine transformation.Affine-invariant MCMC is a class of samplers that are designed to have equal sampling efficiency for all distributions that are related by an affine transformation (Foreman-Mackey et al. 2013;Goodman & Weare 2010).The sequence of samples in a Markov chain, { ()}, can be written deterministically as a function of a sequence of random variables,  (), which represent the random draws used to propose new points and evaluate the accept/reject decision.Specifically we can always write where  denotes the target density.An affine-invariant sampler has the property i.e., the sequence of points visited when sampling an affinetransformed density are the affine transformations of the states visited when sampling the original density.If an affine transformation exists that maps the given target density to one which is more straightforward to sample from, an affine-invariant sampler should sample it as efficiently as it could the simpler distribution, so the convergence of affine-invariant samplers is less affected by correlations between the parameters (Foreman-Mackey et al. 2013).
In practice, this goal is achieved by following an ensemble of points, called walkers, and basing proposed moves on the distribution of other points in the ensemble.In Foreman-Mackey et al. (2013), the primary update move is the so-called stretch-move proposal.Each walker at state   () is updated by randomly selecting another walker  and proposing a new value  =   () +  [  () −   ()], where  is a random variable drawn from the distribution (Goodman & Weare 2010) The parameter  can be tuned to improve convergence, but  = 2 works well in the majority of applications (Foreman-Mackey et al. 2013).The proposed point is accepted with probability where  is the target density and  is the dimension of the parameter space.This acceptance probability is specific to the stretch proposal distribution given by Eq. ( 8).For other stretch proposals, the term  −1 must be replaced by  −2 (1/)/().Following this scheme, detailed balance is maintained, and it can be proven that affine-invariant samplers converge faster to their target distribution (Foreman-Mackey et al. 2013).Below in section 3 we discuss the extension of the stretch-move proposal to Reversible-Jump MCMC methods.The benefits of running MCMC chains in parallel, combined with a proposal distribution that requires almost no tuning, have contributed to an increasing popularity of affine-invariant samplers.

Delayed Rejection
The Delayed Rejection (DR) scheme of sampling was devised in order to improve two aspects of MCMC algorithms.First, it allows for improvements in the acceptance rate of the proposals, yielding "healthier" parameter chains, with better mixing.Secondly, it is more robust against becoming trapped in local maxima of the posterior surface (Trias et al. 2009;Mira 2001;Green & Mira 2001;Haario et al. 2006).The strategy, as the name suggests, is, at each iteration, instead of immediately rejecting the newly proposed point based on algorithm 1, we keep proposing new points while maintaining detailed balance by computing both the forward and backward transition probabilities.Suppose we are at a point ì  0 and use a proposal ( ì The usual acceptance probability, following the notation of eq.(1), is as per eq.( 4).If ì  1 is rejected, then instead of going back to step 1 of algorithm 1, we propose a new point, ì  2 , drawn from a proposal distribution ( ì  2 | ì  1 , ì  0 ).This proposal distribution may depend only on ì  1 , but we write it more generally here to allow for the case that the proposal is adapted based on the sequence of steps that have been rejected.The acceptance probability for ì If ì  2 is rejected, further steps can be included and each step adds additional proposal and rejection-probability terms to the numerator and denominator of the acceptance probability.For example, the three step acceptance probability,  3 ( ì is the minimum of one and The proposal  can be different at each step, as long as the relevant proposal density is used in eq. ( 11).For example, in Trias et al. (2009) the proposal is built upon a Gaussian mixture model that tries further points in the parameter space with the aim of efficiently exploring multiple modes of the posterior distribution.As the number of steps in the DR scheme becomes arbitrarily large the acceptance probability slowly approaches zero.This algorithm is also limited in practice by high computational requirements, since at every delayed rejection step we need to evaluate a new likelihood and compute the backwards probability (the  1 ( ì  2 , ì  1 ) from eq. ( 11)).Nevertheless, the DR scheme offers many advantages, and despite the computational cost, it is very useful when the posterior surface exhibits high dimensionality, and when acceleration techniques are available.These, for example, might include the use of Graphical Processing Units (GPUs), and/or heterodyned likelihoods (Cornish 2021).In our implementation here, we follow closely the one in Trias et al. (2009), for improving the acceptance rate of the between-model step of the Reversible Jump algorithm (see section 2.6).As already mentioned, the Reversible Jump MCMC allows for sampling dynamical parameter spaces.In the special case of nested models, such as the case of searching multiple signals in the LISA data, proposing the 'birth' of a signal out of a very wide prior can be very inefficient.A delayed rejection scheme alleviates this problem, by effectively performing a small search around the first set of rejections, increasing the chances of finding a good signal candidate, and thus improving the mixing of the chains.

Multiple Try Metropolis
The Multiple Try Metropolis (MTM) (Martino 2018;Liu et al. 2000;Martino et al. 2011;Bédard et al. 2012) is a sub-class of the implementation of the MH algorithm, which is based on the idea of generating a number of proposals for each individual current state, and then selecting one of them based on their importance weight.In proposing a move from ì   −1 , a set of  possible new points, {  }, are drawn from a proposal distribution (| ì   −1 ) and are assigned weights   = (  | ì   −1 ) using a weight function (•| ì   −1 ).One of these proposed points,   , is selected with probability given by the normalised weight To compute the acceptance probability we need to draw  − 1 points, {  ,  = 1, . . .,  − 1}, for the reverse move from the pro-posal (|  ), and assign weights (|  ).We then set ì and set ì   = ì   −1 otherwise (Martino 2018).This procedure will satisfy detailed balance if the weight function is chosen such that This will be satisfied by a weight function of the form where  ( ì with  being the dimensionality of the problem at hand.The detailed balance condition can also be satisfied by a weight function of the form Making this choice and additionally using a proposal function that is independent of the current point, ( ì ) only, we obtain the Independent MTM algorithm (Martino 2018).When using the independent MTM algorithm detailed balance is maintained when the same set of points is used for the reverse proposal as for the forward proposal, which saves the evaluation of  − 1 posterior densities.
The base MTM is currently implemented in Eryn with options for the Independent MTM algorithm and symmetric proposals.For a symmetric proposal distribution, ( ì , eq. ( 15) can be satisfied using the weight function ( ì In this case, we still need to draw separate samples for the reverse step (unlike in the Independent MTM case).
Generating a large number of candidate points yields certain advantages.As expected, the first advantage is the fact that there is usually very good coverage of the parameter space.The second is that the implementation of the MTM usually results in chain states with very low correlation between them.Nevertheless, as for Delayed Rejection, this algorithm requires increased computational resources, since multiple likelihoods/posterior densities have to be evaluated at each iteration of the chain.This cost can be offset in cases where the computations can be parallelized, for example using either CPU or GPU acceleration.

Adaptive Parallel Tempering
The concept of Parallel Tempering was introduced in order to efficiently sample surfaces with high multi-modality (Swendsen & Wang 1986;Hukushima & Nemoto 1996;Vousden et al. 2016).The idea is based on a transformation of the posterior density to a density with a different temperature, , defined by For  = 1 this is the target posterior density.In the limit  → ∞ it is the prior density.Intermediate temperatures "smooth out" the posterior by reducing the contrast between areas of high and low likelihood.
In parallel tempering, a set of Markov chains are constructed in parallel, each one sampling the transformed posterior for a different temperature .These chains periodically exchange information.The idea is that the hottest chains explore the parameter space more widely, and information about areas of high likelihood that they encounter propagate to the colder chains.Information is exchanged by proposing swaps of the states between the different chains.If two chains are sampling from target densities  1 ( ì ) and  2 ( ì ) respectively, then the transition probability for chain 1 in the swap is ). Detailed balance is thus maintained by accepting the swap with probability which for the specific case of swapping between two tempered chains  and  when doing parallel tempering is with   = 1/  being the inverse temperature, and ì   the given parameter state for the -th chain.
The temperature ladder   should be chosen in order to maximize the information flow between chains of different temperatures, so as to encourage the efficient exploration of the complete parameter space.Typically, this ladder can be static or dynamically adjusted during the sampling procedure.In Eryn we have adopted the procedure of Vousden et al. (2016), which adapts the temperature ladder based on the swap acceptance rate calculated directly from the chains.Ideally, one should aim for equal acceptance ratio between every pair of neighboring tempered chains, thus tuning their log-temperaturedifference   ≡ log(  − −1 ), according to the swap acceptance rate from eq. ( 20): where () tunes the timescale of the evolution of the temperatures.
The function () can be chosen depending on the desired behavior of the procedure.In Vousden et al. (2016) a hyperbolic dependence on the  state is chosen, in order to suppress large dynamic adjustments on long timescales.This setup is the default option in Eryn, but it can be customized.This process is more straightforward for ensemble samplers, where multiple walkers are used, simply because one can get an estimate of the acceptance rate directly from the particular state of the walkers at any given time step .Otherwise, the acceptance rate is computed after iterating for a predefined number of steps, which can be chosen by the user for the given problem at hand.It can be proven (Vousden et al. 2016), that the temperature ladder will converge to a particular stable configuration.One should only use this scheme for the initial burn-in stage of sampling, and then continue with a stationary ladder for the rest of the analysis.

Marginal posterior calculation for model selection
One of the most frequently encountered problems in physics, and in science in general, is that of model or variable selection, i.e., identifying the model best supported by the observed data.Working in a Bayesian framework, comparison between different hypotheses may be done by computing their evidences or marginal posteriors (Gelman et al. 2004) and evaluating the Bayes Factor: where the term (M  ), is the prior probability assigned to the model M  .
The marginal posterior density, or evidence, is given by the integral of eq. ( 2) and is in general quite challenging to compute.For some high signal-to-noise ratio (SNR) cases it can be approximated if the covariance matrix Σ of the parameters for all candidate models M are known.This approach is called the Laplace approximation (Kass & Raftery 1995;Gelman et al. 2004).However, this is only an approximation, and it sometimes fails for models with weak support from the data (Cornish & Littenberg 2007) (in particular when the posterior cannot be approximated by a multivariate Gaussian at ì  MAP ).When using parallel tempering 2.4, it is possible to compute the evidence by a procedure known as thermodynamic integration (Lartillot & Philippe 2006).We define a continuous distribution of evidences based on the target distribution for a chain with inverse temperature  = 1/ via For  = 0 the chain is sampling the prior and therefore log Z ,0 = 0.
For  = 1 we are sampling the target distribution and log Z ,1 = log Z  .Additionally we have From this we deduce log The expectation value is over the distribution being sampled by the chain at temperature  and so can be computed by averaging over the posterior samples (Lartillot & Philippe 2006;Goggans & Chi 2004;Vousden et al. 2016).The integral can then be evaluated using standard methods, for example the trapezium rule, using the full ladder of temperatures.This approach generates reliable evidences, with accuracy limited only by the number of temperatures being sampled, and the efficiency of the sampling of the parameter space Θ by the chains.Since its first introduction, there have been many applications of this approach, and in particular, there is extensive usage in GW astronomy (Vousden et al. 2016;Katz et al. 2022;Maturana-Russel et al. 2019;Littenberg & Cornish 2009, 2010).The thermodynamic integral in Eq. ( 25) can be thought of as computing the evidence as a sum of differences between the evidences at different temperatures.An alternative approach, called the steppingstone algorithm (Xie et al. 2010), writes the evidence as a product of the ratios of evidences at different temperatures where   denotes the inverse temperature of chain ,   is the number of different temperatures being sampled, and we assume  1 = 0 and    = 1.Each evidence ratio can be written as a posterior integral where  is the number of posterior samples in each chain, and ì    denotes the 'th sample at temperature   .This leads to the final expression log Z = In challenging situations, e.g., where the number of tempered chains is relatively small, the stepping-stone algorithm has been shown to produce more accurate estimates of the marginal likelihood than methods that use thermodynamic integration (Xie et al. 2010).This was also demonstrated with practical examples drawn from Gravitational Wave Astronomy in Maturana-Russel et al. ( 2019).

Reversible Jump
Another approach to the model selection problem is to follow a Reversible Jump (RJ) MCMC strategy, which can dynamically estimate the most probable hypotheses given the data (Green 1995).
The RJ-MCMC is a generalization of the MH algorithm that allows trans-dimensional proposals.Thus, the model order is considered a free parameter which is fitted together with the parameters of the individual models.The most widespread variation of the algorithm uses a two-stage procedure.(Godsill 2001) where with (| ì   , ) the likelihood for model , ( ì   |) the prior on the parameters ì   in model , and () the prior for the model state .The term  ′ (  )|J|/(  ) arises because of the need for dimension matching between the different model states.In general, we can define a move between model states in terms of a deterministic invertible mapping, ì , that is a function of the parameters and two sets of random variables,   and   , drawn from distributions (  ) and  ′ (  ) (Godsill 2001;Green 1995).The term |J| is the Jacobian defined by this invertible mapping and the term  ′ (  )/(  ) plays the role of the proposal ratio in the standard MH acceptance probability.Dimension mapping means that dim( ì Using RJ-MCMC introduces additional computational cost at each MCMC iteration, as well as technical challenges in implementation.Luckily, implementation can be easier when sampling nested models.This refers to problems where more complicated models contain their simpler counterparts.Examples of such cases are fitting polynomial models, which differ only in the highest order to be determined, or detection problems where multiple similar signals are potentially present the data.In such cases, the between-model step can always be formulated such that the Jacobian of eq. ( 31) becomes unity, and eq.( 29) simplifies to the ratio of posteriors accounting for any differences in prior and proposal volumes (Dellaportas et al. 2002;Lopes & West 2004;Littenberg et al. 2020).
After running RJ-MCMC, the Bayes Factor can be approximated by the ratio of the number of iterations spent within each model: This algorithm has proven to be robust for evaluating highdimensional competing models, and has been quite successful in tackling data analysis problems in GW astronomy (Littenberg et al. 2020;Cornish & Littenberg 2007;Karnesis et al. 2014) as well as areas spanning physics and signal processing (e.g., Marrs 1997;Khan et al. 2005;Yu et al. 2021).However, designing an efficient RJ-MCMC algorithm can be quite challenging.The first challenge is to choose suitable proposal distributions, which can greatly affect the convergence of the algorithm.In situations where the models are nested, it is both tempting and convenient to take the proposal to be the same as the prior distribution of the parameters.As an illustrative example, we refer to section 3.3.1,which describes a toy problem of searching for Gaussian pulses in noisy data.There, the parameters of the individual pulses are the amplitude and location of the pulse described by their (, ) coordinates.In order to search for those signals, the prior on their location must be wide enough to include the complete data set (see figure 1a).A birth proposal based on the prior would inevitably be quite inefficient, simply because the chance of proposing a good source candidate is small, especially if the proposal distribution is flat across (, ).We treat the above problem as a motivation to adopt efficient proposals with minimal tuning in Eryn, which we further discuss in section 3. The second major challenge, which of course depends on the given problem at hand, arises from the samplers' capability to explore a multi-modal dynamical parameter space.We discuss our strategy to overcome that challenge in section 3.

Eryn: GATHERING ALL THE PIECES TOGETHER
All the different algorithms described in previous sections can be extremely useful in tackling different kinds of problems that require sampling.In Gravitational Wave Astronomy, we encounter such problems far too often, where dynamical parameter spaces require vast computational resources in order to be explored efficiently.Motivated by those problems, we have implemented a new toolbox that combines all these techniques to enhance the capabilities of an MCMC sampler.We have named this package Eryn 2 , borrowing the name from the Tolkien mythos (TolkienGateway).The analogy has its basis in the idea of a forest: within a forest you have trees which correspond to different walkers, a.k.a.Ents, in an ensemble MCMC sampler.On each tree there are branches that represent the various types of models used to fit the data.For example, in the case of GW global fitting for LISA, we can imagine using the Galactic binaries as one branch and massive black hole binaries as another branch.Each branch has leaves, which represent the individual instances of each model.In the LISA example, leaves would represent the individual Galactic binaries or massive black hole binaries.And finally, to zoom out, when adding tempering capabilities, we can think of groups of walkers in each temperature taking the form of many forests (of walkers) located within different temperature climates.
We adopt the architecture of ensemble samplers, and in particular the one of emcee (Foreman-Mackey et al. 2013).Having multiple walkers running in parallel is ideal for efficiently sampling the parameter spaces using techniques such as parallel tempering, as described in 2.4 (also see section 2.1).In this setting, we evolve   walkers per temperature   , where each walker follows a Reversible Jump MCMC (see section 2.6), mapping a parameter space of altering dimensionality.In practice, walkers in higher temperatures sample the dynamic parameter space with fewer model components as the penalty from higher prior volume is not compensated by the smoother annealed likelihood.In other words higher temperatures have a sharper Occam's razor: the data can be explained with models that are simpler, or lower-dimensional.The highest temperature chain samples the prior on the model space (provided that  max = ∞).More details will be given in section 3.3.
As already mentioned in section 2.6, Reversible Jump algorithms are extremely challenging to tune, even for simpler classes of problems.One of the major challenges is the low acceptance rate for the between-model proposal, i.e., when we propose a new state where the parameter dimensionality differs.In cases of signal search and detection (which is a nested model situation), it is convenient to set the proposal corresponding to a "birth" move to be the same as the prior distribution.In order to accommodate all possibilities for the signals present, the prior densities are usually quite wide, and thus accepting a new higher dimensional state becomes quite improbable.For that reason, within Eryn, we have implemented a Delayed Rejection scheme with the aim of improving this acceptance ratio.When proposing ì   for a higher-dimensional model , we do not reject immediately, but rather make new delayed rejection proposals around the first rejected point ì   , using the given in-model step update proposal.This, in principle, allows the sampler to explore around ì   before rejecting the new state (Trias et al. 2009), which in turn improves the between-model step acceptance rate and produces healthier MCMC chains.
The Delayed Rejection scheme, as described in section 2.2, requires a serial computation of the delayed rejection acceptance ratio for walkers where the newly proposed state has been rejected.This scheme of calculating costly likelihoods sequentially in a loop during the between-model step, can lead to a computational bottleneck of the MCMC process.This is especially true for the LISA Global Fit problem, where multiple binary waveform signals are present in the data-stream.Then, the computational time for each RJ-MCMC iteration is significantly increased, since the progress will be halted until all walkers have gone through their respective Delayed Rejection process, which requires evaluation of new waveforms at each step.For the reasons summarized here, we have not used the Delayed-Rejection scheme for our analysis in section 4, and have resorted to the Multiple Try scheme.However, the Delayed-Rejection scheme, as explained in section 2.2, has been implemented in the Eryn package.
The Multiply Try scheme was essentially implemented in order to facilitate use of a parallelized likelihood framework.Parallelization is naturally compatible with Multiple Try MCMC as multiple proposals are made for each individual walker, which allows for the parallelized evaluation of proposal distributions, likelihood functions, and acceptance probabilities.Under these parallelized settings, one proposal can act as many when compared to the usual serial evaluation of proposals, allowing for better chain mixing in situations where proposals are infrequently accepted.That being said, it is still important to choose a good proposal distribution, for both the in-model and between-model RJ-MCMC steps, which we discuss further below.

Choosing efficient Proposal distributions
In the previous sections, we briefly discussed some of the challenges in choosing efficient proposal distributions for both the in-model and between-model steps of the RJ algorithm.For the in-model case, the challenge arises from the fact that it is sometimes impractical, or even unfeasible, to define a well-tuned proposal for each of the possible models that could represent the data.Using again the example of LISA data, one would need to tediously design an effective proposal distribution for the thousands of overlapping binary signals in the data.On the other hand, for the case of the between-model step, choosing proposals from the prior distribution, especially if it is highly uninformative, can be very inefficient for Reversible Jump sampling.For Eryn, in order to tackle those issues, we have implemented the Group Proposals explained below for addressing the within-model proposals in RJ, as well as a scheme to design an efficient proposal for birth moves during the between-model step, which is based on normalizing flows.

Group Proposals
In section 2.1, the stretch-move proposal was introduced and discussed.One of the obvious advantages of such a scheme of proposing new MCMC samples is that it requires minimal tuning (Foreman-Mackey et al. 2013).However, it does not extend well in its simplest form to the generalized Reversible Jump MCMC.The stretch proposal is based on the idea that the ensemble of points (  ) is sitting on the same posterior mode as the current point (  ).In a nested model situation where both the model count and the individual model parameters change, each point may lie on a different posterior mode representing a different set of leaves (sources) in the data.This can be alleviated by applying the stretch move to individual leaves within each branch of each walker, but there is still an issue of identifying leaves in different walkers that lie on the same posterior mode.The stretch proposal will technically still work when mixing leaves in different posterior modes, but the acceptance fraction will be negatively affected.However, within the stretch proposal formalism, the choice of   is customizable.The key to maintaining detailed balance is that   cannot depend on   , and   cannot be updated in the same iterative step as   (Goodman & Weare 2010).
We leverage this property to design a new type of stretch move that can handle Reversible Jump setups while maintaining a small number of tuning parameters.We call this proposal the "group" proposal.The mathematics that govern the group proposal are equivalent to that of the original stretch proposal.The key difference is that the group proposal chooses   (see Section 2.1) from a stationary group that is fixed for many proposed updates.This is in contrast to the original stretch proposal that always uses the current set of points in the ensemble sampler to draw   .The stationary group is updated after a large number of sampler iterations and we make sure that detailed balance is maintained during the update.We update every leaf within every branch of every walker at each iteration and repeat many iterations between updates of the stationary group.
The appropriate stationary group varies from problem to problem.The goal is to set a group that resembles as best as possible the posterior modes of the current leaves and then draw from it strategically so that the drawn point is likely (but not guaranteed) to lie on the same mode as the leaf that is currently being updated,   .In the example of the LISA Galactic binaries analysis, we set our stationary group to the full set of leaves (binaries) across all walkers at a specific temperature of the sampler at the end of a given iteration.Then, at proposal time, we efficiently locate the ∼   points in the stationary group that are closest to   from based on their initial frequency parameter.We then draw   from this group.The hope is that some percentage of the   drawn points will lie on the posterior mode on which   sits.The exact percentage will vary depending on how close the posterior modes are to each other and how many model instances exist in the sampler that include this specific mode.For low SNR binaries, for example, a source may exist in some walkers and not others, making it harder to map its posterior mode with the current group of stationary points.
The performance of group proposals is highly situation-and/or model-dependent.With individual source posterior modes that are well separated and easy to define in terms of separation, the performance will approach the performance of the base stretch proposal in non-Reversible Jump MCMC because the stationary group will well represent the specific posterior mode on which   is located.As the parameter space becomes more crowded and/or separation (distance) metrics become harder to define, the performance of group proposals will worsen.

Learning from the data
The second improvement concerns the between-model step of the RJ-MCMC.As mentioned earlier, for the case of nested models, it is often convenient to draw "birth" candidates directly from the prior distribution of parameters of the given model.This practice can be quite ineffective in terms of acceptance rate.As an example we can again use the LISA data-set case.The UCBs are distributed within the Galactic disk, congregated mostly around its Center (Amaro-Seoane et al. 2022), therefore, adopting a proposal based on an uninformative uniform prior across the sky, would waste computational resources exploring a part of the parameter space with low probability mass.A proposed solution is to use an informative prior derived from the spatial distribution of binaries in the Galaxy (Littenberg et al. 2020).In our work here, we have chosen an alternative data-driven route, based on the actual residual data after a burn-in period of the RJ-MCMC, which we describe below.After a sufficient number of RJ-MCMC iterations, we can extract a subset of sources from nested models which are constantly present in almost all walkers of our cold chain.In other words, we can find and subtract the brightest sources from the data, and then allow for another burn-in period on the resulting residuals.This allows the sampler to explore the remaining parameter space more easily, thus providing a good initial estimate for the weak signals possibly buried in the noisy residual data.We can then use those samples to construct a proposal density which will help us search for good candidates for those weak signals, without excluding the rest of the parameter space.This can be accomplished by fitting the distribution to the residual data described above.The most efficient way to fit to the generic distribution, is to use an invertible transformation, such as a normalizing flow (for example Dinh et al. 2016;Durkan et al. 2019).The method works in the following way: we sample from the base distribution (which is usually chosen to be Normal N (z; 0, 1)) and transform samples to the desired distribution () by applying the change of variable equation Here function  (z) is a bĳection which we fit by optimizing the Kullback-Leibler divergence,  KL [ (  ())||N (z; 0, 1)], between a normal distribution and the inverse transform of the distribution that we want to estimate.After the fit has converged we can draw samples from the Normal distribution and transform these to samples from the distribution fitted to all residuals and use it as a proposal.We will cover this method in more detail in a separate paper which is being prepared.

Convergence Diagnostics
Many standard approaches for assessing convergence of MCMC chains can be applied to the RJMCMC chains generated by Eryn.
Trace plots, both for individual parameters within models, and the indices labelling different models, will show that the chains are mixing well and that the initial burn-in phase has finished.Repeating runs with different numbers of walkers, different starting positions, different numbers of parallel chains or different choices of temperatures and comparing results can be used to assess convergence.Similarly, posteriors produced from randomly selected subsets of chain points can be compared (using for example statistical tools such as the Jensen-Shannon divergence (Menéndez et al. 1997)).Additionally, Eryn computes and outputs the Potential Scale Reduction Factor (PSRF) for each tempered chain, which provides a quantitative diagnostic of the convergence of the result (Gelman & Rubin 1992;Brooks & Gelman 1998).For a single parameter, the PSRF index is given by where  the mean of the empirical variance within each chain, and  is the estimated variance of the all chains, assuming that the target distribution is Gaussian.The degrees of freedom are estimated by the method of moments as  ≈ 2 / var( ) (Brooks & Gelman 1998).
The PSRF index takes values  ≥ 1, with values close to unity indicating converged MCMC chains.In the end, the quantity that really matters is the PSRF value for the cold chain, but converged chains across for all temperatures is a good indicators for healthy mixing and efficient temperature swaps.

Implementation
In this section, we discuss the main implementation details of Eryn.We refer the interested reader, or user, to the Eryn documentation for more exhaustive information and examples (Documentation).
The goal of Eryn is to produce a sampler that can handle all (or most) cases of MCMC sampling ranging from basic, non-tempered, single-model type, single-model instance posterior estimation to the full reversible jump MCMC with tempering, multiple model types, and adjustable model counts, as well as everywhere in between.In the basic case, Eryn aims to be a close replica of emcee trying to maintain as much simplicity as possible.At the complicated end of the spectrum, Eryn attempts to provide a common interface and underlying infrastructure for the variety of problems that may arise, allowing the user to maintain usage of the majority of the code from project to project, focusing on changing only the specific parts of the code that are difficult to implement or require special treatment for each specific problem.Since Eryn is effectively an enhanced version of the emcee package, the overall structure of emcee is strongly maintained.Like in emcee, "State" objects move coordinate and likelihood information around the ensemble sampler, storing information in a similar back end object either in memory or HDF5 files.Additionally, the interface used for adding proposals has remained.The various enhancements discussed in this work, including tempering, reversible jump moves, multiply try MCMC, etc., are all implemented within the emcee-like structure.This involved two main changes.First, the State objects have been scaled to hold information necessary for reversible jump MCMC: temperature information, prior information, and efficient and concise containers for multiple types of models with an adjustable number of individual model instances.Second, the reversible-jump proposal has been added as a proposal base, similar to the use of the 'MH' or 'RedBlue' moves within emcee.Beyond these main enhancements, there are also a variety of smaller, but useful, additions to Eryn that help the user build a variety of analysis pipelines.These include stopping or convergence functions, functions to periodically update the sampler setup while running, objects to carry special information through the sampler, and aids for coordinate transformation.

Toy Examples
In this section, we present a series of working examples for Eryn.We begin with simple problems, such as searching for simple signals in noisy data, with the aim of demonstrating the performance of this toolbox in a dynamical parameter space.The impact of the different enhancements discussed in section 3 will be assessed and discussed.Finally, in section 4, we will apply this machinery to more realistic problems in Gravitational-Wave astronomy.

Searching for pulse signals in Gaussian noise
In this first example, we explore the capabilities of Eryn in a simplified application, commonly encountered in physical sciences.We perform a search for Gaussian pulses in a simulated 2D data-set, in the presence of Gaussian noise with variance   = 0.2.We generate 25 pulses randomly distributed on the  −  plane with all pulses contained within ,  ∈ [−10, 10] (see figure 1a), and amplitude uniformly drawn from U [0.7, 1.5].The amplitude   of each pulse, labelled by , is considered a free parameter to be estimated, in addition to the Cartesian coordinates of their centres.The pulses' width was kept fixed to   ×    , with   = 0.2, for the sake of simplicity.Thus, we are required to estimate   , the total number of pulses in the data, and also estimate the parameters for each individual signal : ì   = {   ,   ,   }.The noise variance   is estimated as part of the fit.The analysis of this problem is performed using the adaptive parallel tempering scheme of section 2.4 and the Reversible Jump MCMC proposals (section 2.6).The in-model proposals for each model component are Gaussians, with a diagonal covariance matrix Σ = 10 −4    .This proposal is not tuned during sampling.The priors for the parameters are quite wide, covering the entire range of the data, while the prior on the number of pulses  is set to  ∼ U [0, 50].With the above settings, we obtain the results summarized in figures 1 and 2. In figure 1b we plot the most probable number of Gaussian pulses present in the data, or in other words, the most probable model given this particular data-set.It is clear that for the given level of noise, it is straightforward to recover the true number of signals.The noise variance is also estimated accurately as   = 0.2 ± 2 × 10 −3 .In figure 1c we plot the posterior densities for the parameters of all pulses recovered, while we also mark the true injected values.Figure 1c shows the trans-dimensional MCMC chains "stacked" over all samples of both model order and model-parameters.As already mentioned, in this simplified scenario all signals have similar value for the amplitude, thus the almost uni-modal marginal on   .This illustrative example is useful as an introductory application to the more complicated case of detection in Gravitational Wave astronomy presented below, in section 4.
In figure 2, three diagnostic quantities for this run are shown.In the top panel, the evolution of temperatures is presented.Following the recipe of Vousden et al. (2016), we control the distances between each temperature chain based on their in-between swap acceptance rate, computed from eq. ( 21).The tuning term () is set to () =  0 /(( +  0 )), with the adaptation lag  0 = 10 4 and the adaptation time  = 10 2 .The middle panel shows the evolution of the swap acceptance rate per number of walkers between the chains running at different temperatures.After ∼ 10 5 sampler iterations, the system converges to an equilibrium, where the rate of swapping states reaches a single value across the temperature range.In the bottom panel we show the acceptance rate for the in-model step of the algorithm, for all temperatures.As expected, after temperature equilibrium at ∼ 10 5 samples, the acceptance rate converges to a (different) value for each temperature, which is higher for higher temperatures (smoother posterior surfaces are easier to explore).(Vousden et al. 2016), the temperature ladder is tuned according to eq. ( 21), and the chains start to converge after ∼ 10 4 iterations.Middle panel: The evolution of the swap acceptance rate  ,  described in eq. ( 20), per number of walkers   .For this run we have used   = 10 walkers.After 10 5 iterations, the swap acceptance rate converges to a single (different) value for every temperature chain.Bottom panel: The "in-model" acceptance rate per temperature chain, given by eq. ( 4).Finally, it is interesting to investigate how the sampled dimensionality of the problem varies at at different temperatures.In figure 3, we plot the posterior on the number of pulses at each temperature.As expected, higher temperature chains tend to favour lower model dimensionality and the  ∞ chain samples the prior on .This can be attributed to the choice of priors and "birth proposal" distributions for both the signal parameters and .The likelihood is down-weighted at higher temperatures, making it harder to overcome the Occam penalty from including extra parameters in the model.This means quieter sources are less likely to be added and the preferred models have fewer sources.

Modelling power spectra: Searching for the optimal number
of B-spline knots.
One of the most common problems in signal processing is the characterization of the spectra of the data.This is often done by adopting spectral models and fitting the spectra directly in the frequency domain.This methodology is used when the signal of interest has stochastic properties.Examples from GW astronomy, include the measurement of stochastic signals with astrophysical, or cosmological origin (Abbott et al. 2021c;Auclair et al. 2023).There are many examples of possible stochastic signals for LISA (Amaro-Seoane et al. 2017, 2012;Auclair et al. 2023;Amaro-Seoane et al. 2022).
Searching for signals with stochastic properties requires flexible spectral models, both for the observatory instrumental noise, and the measured stochastic signal.For these reasons, it is sometimes convenient to adopt a versatile model, such as one that is based on B-spline interpolation schemes.B-splines are a geometrical modeling tool, and have proven to be very useful for modelling or generating smoother representations of data.They are piece wise polynomial curves with a certain number of continuous derivatives, and can be parameterised in various ways (Piegl & Tiller 1996).For this application, we follow Baghi et al. (2023), and choose to work with cubic-spline interpolation, using the corresponding SciPy library (Virtanen et al. 2020).The procedure starts by selecting a number of control points, or knots, with a given position and amplitude, which the smooth polynomial curve crosses and at which there is a change in the first non-continuous derivative.One of the challenging problems using such methods, is to choose the optimal number of spline knots for fitting the data, without overfitting.This is a model selection problem that can be easily solved with dynamical algorithms such as the one presented here.
For our next example, we generate time-series data directly from a theoretical model PSD.The simulated data are represented with the solid gray line in figure 4a.We then use the machinery of Eryn to find the optimal dimensionality for the problem, together with the best-fit parameters for the knots.To ease the computational complexity, we compute the PSD of the time series using the methodology developed by Tröbs & Heinzel (2006).In more detail, we begin by choosing a new frequency grid, on which we compute the PSD using the optimal number of averaged segments for each given frequency.Following this method, we essentially split the time-series data at the maximum number of segments that the given choice of window and percentage of data overlap permits, which will allow us to estimate the PSD at each frequency bin with minimal variance.By carefully choosing the window function and distance between the data points, one can compute a power spectrum with minimal correlations between frequencies.The estimated spectrum,   , at each frequency,   , is then used in the likelihood function given in Eq. ( 35) below (the spectrum   is represented by the red data-points in figure 4).
For more details about this method of computing the PSD, we refer the reader to (Armano et al. 2018;Tröbs & Heinzel 2006) and for a similar application to the work of Baghi et al. (2023).Finally, we also keep two knots fixed at the edges of the spectra, allowing the sampler to estimate only their amplitude, while the rest of the knot parameters (and their number) are left to be estimated from the data.For the spline knot positions, {log  , }, and amplitudes, {log  , }, we adopt uniform priors that cover the complete parameter space.Here, the  index corresponds to the knot number for the given model order .We also use a ladder of 10 temperatures, with 10 walkers each, while maintaining the same settings for the adaptivity of the temperatures as in section 3.3.1.Each walker is initialized at a random point on the parameter space, after drawing the dimensionality  of the model from  ∼ U [1,20].We adopt a Gaussian likelihood, with its logarithm written as where   is the PSD data value for the given frequency   , as computed by the method presented in (Tröbs & Heinzel 2006;Armano et al. 2018), using   averaged segments.The N , ( ì   ) is the spline noise model of order  evaluated at   , that depends on a parameter set in which the log  0 and log  +1 parameters refer to the logarithm of the PSD amplitude of the two fixed knots at the "edges" of the spectrum.Those two parameters correspond to our zeroth model order ( = 0), thus they are always being explored by the walkers of Eryn.
The results are shown in figure 4. In particular, in figure 4b we show the histogram of the recovered number of knots for the particular data-set.It is clear that 8 spline knots are preferred, two of them being fixed at the edges of the spectrum, and the other six knots free to take any position in the given frequency range.In figure 4c we show the 2D sliced posteriors for the spline parameters, {log  , } and {log  , }.In this figure, we again stack all the MCMC samples across model orders.The true spectrum is indicated by the orange solid line.There is an interesting outcome of this toy investigation; while there is a preferable dimensionality of the model, there is a weak constraint on the actual positions of the knots.We find that the sampler is virtually "scanning" the PSD data, showing slightly higher preference for locations between −6 and −4 in log-frequency, where the spectrum follows a more complicated shape.Finally, in figure 4a, the data (gray solid line and red data points), is shown together with model evaluations drawn from the posterior samples (pink solid lines).

EXAMPLES FROM GRAVITATIONAL WAVE ASTRONOMY
In recent years, we have witnessed the beginning of Gravitational Wave Astronomy.Since the first detection (Abbott et al. 2016) dozens of waveform signatures have been measured with the current network of observatories.At the time of the writing of this paper, more than 90 events have been recorded (Abbott et al. 2021a), the vast majority of them are black hole (BH) binary mergers, with a few of them being binary neutron star (NS) and BH-NS mergers.At the same time, detector networks are being improved (The LIGO Scientific Collaboration 2019; Abbott et al. 2020) and there are plans to expand them with the addition of new observatories, such as the Einstein Telescope (Maggiore et al. 2020;Punturo et al. 2010) or Cosmic Explorer (Evans et al. 2021;Abbott et al. 2017).Those detectors will unlock the sky to larger redshifts , allowing access to a vast number of potential sources.In addition, space missions, such as LISA (Amaro-Seoane et al. 2017, 2012), are predicted to be signal-dominated observatories, with many types of sources populating their data streams.In fact, we expect that source confusion will be one of the primary challenges in future data analysis efforts in gravitational wave astronomy.In a typical data-set, we expect an unknown number of signals, originating from sources that generate waveforms with different characteristics.Those range from the stellar-mass BH binaries now frequently observed by ground-basd detectors, to the supermassive BH binaries, extreme mass ratio inspirals, ultra compact Galactic binaries (UCB), and stochastic GW signals from both astrophysical and possibly cosmological origin (Amaro-Seoane et al. 2017, 2012;Auclair et al. 2023).For this final example, we will focus on the LISA mission, and in particular on the case of discriminating UCB signals.

Application to LISA data and the Ultra Compact Galactic Binaries
LISA is going to measure GW signals in the mHz regime, accessing sources of all the aforementioned types.As already discussed, the most numerous of them are going to be the UCBs, which will be almost monochromatic in the LISA band.Out of the millions of sources, only ∼ O (10 4 ) will be individually resolvable, and the rest will generate a confusion signal.As a consequence, for the duration of the mission, we will need to disentangle tens of thousands of sources which will be overlapping in both time and frequency domains.This is no trivial task, but various different strategies have already been proposed for analyzing such challenging data-sets.For example, Gaussian Processes can be employed (Strub et al. 2022), or Swarm Optimization techniques (Zhang et al. 2021), or hybrid swarm-based algorithms (Bouffanais & Porter 2016).Pipelines based on MCMC methods have been tested extensively (Crowder & Cornish 2007;Littenberg 2011;Littenberg et al. 2020;Littenberg & Cornish 2023), and have been demonstrated to be able to tackle complex cases where signals are overlapping.
Here, we will focus on the same problem, employing Eryn to solve a down-scaled version of the UCB challenge.It is down-scaled because we focus only on a single narrow frequency band, containing several overlapping signals, in the presence of instrumental noise only3 .In addition, we focus solely on demonstrating the capabilities of Eryn on dynamic parameter estimation for UCB type sources and The simulated data (gray), generated from the theoretical model (dashed black line).The PSD computed on an equally-spaced logarithmic grid with the method of (Tröbs & Heinzel 2006;Armano et al. 2018), which is used for inference, is represented with the red data points.The pink solid lines represent models drawn randomly from the posterior chains.(b) The optimal B-spline knots estimated by the dynamical parameter estimation procedure.As shown from this histogram, the optimal interior knot count for this data converges to six, corresponding to eight total knots including the two edge knots In ideal conditions (equal noises across spacecrafts, and equal LISA arms), the noise in  and  is independent, while the  data stream can be used as a signal-insensitive null channel, useful for instrument noise calibration.Since we perform analysis on a noise-free injection, we will be neglecting the  channel altogether.We simulated the injection data for an observation time of T obs = 1 year.The optimal SNR for each injected source,  opt , is given in table 1.The  opt quantity refers to the SNR of each source in isolation, with respect only to the instrumental noise, and can be calculated as with  ∈ { ,  } the noise-orthogonal TDI channels of eq. ( 37), while the (•|•) notation represents the noise weighted inner product expressed for two time series  and  as The tilde represents the data in the Fourier frequency domain, and the asterisk indicates complex conjugate.The S (  ) is the one-sided PSD of the noise for a given TDI channel.Under our assumptions  ,  (  ) =  , (  ).
For our investigation we chose to analyze noiseless data (no noise realization), while in the likelihood we are using the PSD noise levels taken from the LISA design studies (LISA Science Study Team 2018).For the signals, we utilize the fast frequency-domain UCB waveform model of Cornish & Littenberg (2007).Then, the two polarizations where M is the chirp mass,  gw is the instantaneous gravitational wave frequency,   is the luminosity distance,  is the inclination of the binary orbit, and () is the gravitational wave phase.The phase  can be expressed as  =  0 + 2 ∫   gw ( ′ ) ′ , with  0 being an initial arbitrary phase shift.The LISA constellation is assumed to be rigid and with equal arms, while the spacecrafts are assumed to follow analytic Keplerian orbits (Cornish & Rubbo 2003).Under these assumptions, it is straightforward to compute its response to the almost monochromatic waveforms of eqs.(40) (Cornish & Rubbo 2003;Babak et al. 2021).For more details about the waveform model, the response of the instrument, and the orbits of the constellation, we refer the reader to Cornish & Littenberg (2007); Robson et al. (2018); Katz et al. (2022); Cornish & Rubbo (2003); Babak et al. (2021).
In our simplified scenario, each binary signal in the Solar System Barycenter is determined by a set of eight parameters.Those are the ì  = {A,  gw [mHz],  gw [Hz/s],  0 , cos , , , sin }, where A is the overall amplitude,  gw is the first derivative of the gravitationalwave frequency,  the polarization,  is the ecliptic longitude, and  the ecliptic latitude of the binary.The amplitude of the signal is calculated as which can be used to obtain a rough SNR estimate, via (Littenberg et al. 2020) with   (  gw ) being the instrumental noise power spectral density at frequency  gw , and  * = 1/(2), where  the LISA arm length.Given eq. ( 41) and ( 42), we find it convenient to directly sample on  instead of A, which also yields a more illustrative measure of the amplitude of each binary.Then, if  is the measured TDI data and h the given GW signal after applying the response of the instrument, the logarithm of the likelihood for an arbitrary number  of UCB signals can be written as where for the sake of convenience, we have defined h  =  h( ì   ).We use wide uniform priors for all the rest of the binary parameters, covering essentially the complete parameter space.The exception is again the amplitude (SNR), , where we adopt a prior which was first introduced in Cornish & Littenberg (2015) and then adapted in Littenberg et al. (2020).The prior density can be expressed as where  * is a given constant that specifies the peak of the above distribution.This distribution is designed to prevent the proposal of sources with very small SNR in the model, as it drops sharply for  → 0. Those weak sources do not significantly affect the likelihood, and so their inclusion must be penalised by the prior4 .This prior choice forces the sampler to explore only potential sources with non-zero SNR, avoiding populating the chains with numerous undetectable signals.This prior performs adequately in this problem, but there are other solutions one could adopt in order to keep control of the number of very weak sources.This discussion, which sets the grounds for a global-fit analysis pipeline for the LISA data (Littenberg et al. 2020), is out of the scope of this paper, but a more detailed description will be presented in a future work.
4.1.0.1 Search Phase: Before parameter estimation, we initiate a search phase of our analysis, with the aim of getting the walkers to a better starting point on the posterior surface.This phase consists of an iterative brute force procedure, based on drawing a very large number of proposals, then maximizing the likelihood over the initial phase  0 , and finally perform a rapid MCMC sampling over the parameter space, using only a one-source model (therefore there is no dynamical parameter spaces).In particular, we draw 5×10 6 points in the parameter space, and after phase maximization, we use them as starting points to a parallel tempered MCMC run with   = 10 temperatures, each running with   = 500 walkers.When this step concludes, we keep the 100 best samples in terms of likelihood value and use their corresponding parameter estimates as starting points for the parameter estimation portion of the analysis.We also use the maximum Likelihood solution to subtract the source found from the data.We then use the residual data to search for another source, and this process repeats until there is no signal found with SNR  > 5.
In between successive iterations of the single-binary search, we run another MCMC over all sources found so far in order to adjust the parameters to account for correlations and overlap between sources.
After convergence, we found eight sources in our data-set with an optimal SNR greater than 5.We take these found sources and add them to all walkers in the sampler at the beginning of the full MCMC run described below.[6,20].For the sake of convenience in this simple application, we keep the six loudest binaries found during the search phase as fixed.This means that we still sample their waveform parameters, but they are not allowed to be removed by the Reversible Jump process.We chose this setup in order to accelerate the convergence of the algorithm, being confident that these sources are part of the solution.In future work, this will be adjusted to deal with the much larger complexity of the full problem.
Concerning the sampler settings, we use the adaptive parallel tempering scheme of section 2.4, building a temperature ladder of   = 10 temperatures, with 100 walkers for each temperature.For this run, we have also utilized the Multiple Try Metropolis algorithm (see section 2.3) in order to improve the acceptance rate in the reversible jump proposal.We have also tried the Delayed-Rejection scheme which is implemented for Eryn, but we found that the Multiple-Try strategy yields more efficient sampling.Finally, we have utilized the basic stretch and group stretch proposals that were described in section 2.1.
After convergence, the result is shown in figures 5 and 6.In figure 6a, the sampled posterior on the number of sources  is presented.In this histogram, we have added the six fixed binaries to the actual number of signals being sampled via the Reversible Jump algorithm.It is fairly obvious that we have managed to confidently resolve eight out of the ten injected binary signals.This fact that we do not favour 10 sources can be explained partly by the low SNR of the signals (see table 1) and partly by confusion from source overlap (also shown in figure 5).Additionally, the result of figure 6a depends on the given observation duration.The greater the  obs , the better our ability to resolve the confused sources.Thus, in that case, we should expect more Reversible Jump iterations across the higher dimensional models.
On the right panel, in figure 6b, the ensemble 2D posterior slice is shown, for two selected parameters.We call it ensemble because we are again "stacking" all the chains for these two parameters for all sources for all model orders .We chose to show only the amplitude (the  parameter explained in eq. ( 42) above) and the dominant emission frequency  gw , which illustrates the number of sources resolved, and how they overlap in frequency.A corner plot for more parameters is shown in figure A.1 in Appendix A. We also show the true injection values, marked as crosses, on top of the 2D posterior.From this plot alone, one can see that the sampler is exploring efficiently the parameter space, converging to the true values of the resolvable binaries that were injected.

DISCUSSION
We have implemented Eryn, a Bayesian sampling package capable of performing efficient trans-dimensional inference, by employing different techniques that improve its acceptance rate.These techniques are the affine invariant sampling, the adaptive parallel tempering, the delayed rejection, and multiple try metropolis, in combination with the construction of informative proposal distributions for the parameters of the models.The structure of Eryn is based on the widely used software emcee (Foreman-Mackey et al. 2013), enhanced with the ability of performing Reversible Jumps Green (1995) between different model spaces.The sampler capabilities have been demonstrated with toy models that are commonly encountered in different data analysis problems.We have begun with an application to signal detection, and in particular to searching for simple signals in the form of Gaussian pulse signals in the presence of Gaussian noise (see section 3.3.1).In section 3.3.2,we applied our algorithm to a problem of modeling power spectra with arbitrary shapes in frequency domain.In such cases, it is convenient to define models based on B-splines, which are able to faithfully capture the shape of any spectral data.However, in order to avoid over-fitting situations, the optimal order of the model (i.e. the optimal number of spline knots), needs to be estimated from the data.This can be done either sequentially, by trying models of different dimensionality and then comparing their performance, or dynamically, by using trans-dimensional algorithms such as Eryn.
This class of problems is often encountered in cosmology (Planck Collaboration et al. 2020a,b), where the signal of interest is stochastic in nature, and sometimes the prior knowledge on its shape is very limited.As already discussed, this is especially true for future GW observatories, which open the possibility of detecting such signals from both astrophysical and cosmological origin (Amaro-Seoane et al. 2017;Auclair et al. 2023;Baghi et al. 2023).The different theoretical models produce spectra with distinct shapes, increasing the need for shape-agnostic spectral models, such as the B-spline used here.
Finally, in section 4.1, we demonstrated Eryn in a more complicated problem, that of the analysis of ultra compact binary signals measured by the future LISA detector.These objects are going to produce the majority of the signals in the LISA data, each emitting almost monochromatic radiation.Their vast number will generate a confusion foreground, while only a few thousand of them will be resolvable from the data.We employ our tools described in this work, together with a search phase that is based on iteratively running the sampler on "static models" (no trans-dimensional moves) with phase-maximized likelihoods.We do these runs on the residuals of each iteration, with the aim of extracting all bright sources.In order to achieve faster convergence of our parameter estimation run, we choose to keep the brightest sources found during search as fixed (minimum number of model order is  = 6), while the Reversible Jump algorithm is used to search for weaker signals in the data.This is purely a choice that allows quick convergence in this fully-controlled and simplified LISA data-set.
We perform this analysis for a mission duration of  obs = 1 year and only on a single narrow frequency band around 4 mHz, which contains a total of 10 binary signals.It is worth noting here, that the synthetic data were produced assuming idealized conditions.This means that we do not consider any data irregularities, such as data gaps and glitches and spectral lines, or any other contamination originating from the mixing of signals of different types (such as supermassive BH binaries).In the end, as shown in figure 6a, we manage to recover eight out of ten injected signals.This result makes sense given the relative strength of the injections, and their waveform overlap.Many of the injected sources have an optimal SNR in isolation which is rather low (see table 1), so these are more susceptible to deterioration when we account for signal overlap.
The above investigations demonstrate that the dynamical parameter estimation capabilities of Eryn are suitable for these types of problems.This feature is missing from already existing libraries such as bilby Ashton & Talbot (2021), which are used by the Gravitational Wave community.bilby offers a wide selection of tools which are necessary for the Bayesian analysis of Gravitational Wave data.These include implementations of likelihood and prior functions, instrument response models and spectral densities, waveform models, and a number of samplers to choose from (both MCMC and nested samplers).On the other hand, Eryn is based on parallel tempering MCMC enhanced with Reversible Jump, which allows Bayesian analyses for a wider set of problems, in which sampling of a dynamical parameter space is needed.
Eryn has already been used in several works that have been already published (Katz et al. 2022;Katz 2022;Baghi et al. 2023;Sasli et al. 2023), or are going to appear soon.The work presented in this paper is the initial part of our efforts toward implementing a data analysis pipeline for LISA data.This pipeline will be demonstrated on the LDC2 data-set (LISA Data Challenges Working Group 2022), which contains multiple types of signals overlapping in both time and frequency domains.That being said, Eryn is a generic and versatile sampler, which can be used in any investigation that requires Reversible Jump sampling, and to our knowledge is one of the very few statistical tools of this kind that is not specialized to a single type of analysis (see discussion in section 2.6).

Figure 1 .
Figure 1.Searching for 2D Gaussian pulses in the presence of Gaussian noise.Panel (a): the simulated data, which consists of injections of 25 pulses in Gaussian noise with   = 0.2.Panel (b): the distribution of the model order, obtained by exploring the dynamical parameter space with Eryn.The true value is marked with a dashed red line.For this toy investigation, the correct number of simulated components is recovered.Panel (c): the 2D posterior densities for the parameters of the  Gaussian peaks (see text for details).

Figure 2 .
Figure 2. Top panel:The evolution of the temperature chains running in parallel for the toy problem of searching for 2D Gaussian pulses in Gaussian noise.The different colors indicate the the initial temperature chain index.Following the parallel tempering recipe of(Vousden et al. 2016), the temperature ladder is tuned according to eq. (21), and the chains start to converge after ∼ 10 4 iterations.Middle panel: The evolution of the swap acceptance rate  ,  described in eq.(20), per number of walkers   .For this run we have used   = 10 walkers.After 10 5 iterations, the swap acceptance rate converges to a single (different) value for every temperature chain.Bottom panel: The "in-model" acceptance rate per temperature chain, given by eq.(4).

Figure 3 .
Figure 3. Posterior on the number of Gaussians, , at each temperature   , for the toy problem of section 3.3.1.The different colors indicate the the initial temperature chain index.Darker colors corresponds to colder chains and vice versa.

Figure 4 .
Figure 4. Results for power spectra modelling with a shape agnostic model.(a) The simulated data (gray), generated from the theoretical model (dashed black line).The PSD computed on an equally-spaced logarithmic grid with the method of(Tröbs & Heinzel 2006;Armano et al. 2018), which is used for inference, is represented with the red data points.The pink solid lines represent models drawn randomly from the posterior chains.(b) The optimal B-spline knots estimated by the dynamical parameter estimation procedure.As shown from this histogram, the optimal interior knot count for this data converges to six, corresponding to eight total knots including the two edge knots.(c) Posterior slice for the knot parameters, (log  , , log  , ), after stacking the MCMC chains across all model dimensions, .This illustrates where the model prefers to place spline knots, which clearly corresponds to where the spectral density is changing most rapidly.It is also evident from this figure that we essentially "scan" the true noise shape (pink solid line), by placing knots across the frequency range (see text for more details).

Figure 5 .
Figure 5. Top panel: The simulated data used for demonstrating the capabilities of Eryn in tackling a high-dimensional problem.A total of 10 Ultra Compact Binaries in the vicinity of our Galaxy emitting Gravitational Wave signals at the mHz range.Here we plot the power spectral density of the  data channel of LISA.The catalogue of sources is taken from the second LISA Data Challenge (LISA Data Challenges Working Group 2022).Each signal is represented by a different colour.Bottom panel: The same data-set, now comparing the injected signal against the solution yielded by Eryn (see text for more details).We have plotted the shaded area by sampling the joint posterior on model order  and the corresponding parameters.

Figure 6 .
Figure 6.Left panel: In this figure, we show the posterior on the number of UCB sources in the data.The true injected number is shown with the red dashed line.It is clear that, for the given measurement duration of the particular data set, we manage to confidently resolve eight binaries out of a total of ten.Right panel: Corner plot for two of the eight parameters characterising each UCB source.These are the amplitude, expressed as an SNR , and the dominant emission (or initial) frequency,  gw [mHz] (see text for more details).The violet crosses represent the injected parameter values.A corner plot for more parameters is shown in figure A.1 in the Appendix.
During this step, we perform hybrid MCMC sampling, where we both update the found sources (in-model) and dynamically search for new and weaker sources in the data employing Reversible Jump sampling.For the number of signals , we adopt a uniform prior  ∼ U Figure A.1.A triangle plot showing the 2D posterior slices for the application of section 4.1, but for a greater selection of parameters than figure 6b.The rest of the parameters, if plotted stacked in the same manner, result in surfaces that cannot be so easily interpreted, and therefore have been left out.The true injected values are marked with crosses.
The first stage or in-model step, uses the standard MH algorithm to update all the parameters ì   for the given model .The second stage or between-model step proposes to update the model state  to a new model state .Parameters ì   for the new model are also proposed.The newly proposed state , is accepted with a probability defined by

Table 1 .
(Tinto & Dhurandhar 2005;Prince et al. 2002;Baghi et al. 2021), ), after stacking the MCMC chains across all model dimensions, .This illustrates where the model prefers to place spline knots, which clearly corresponds to where the spectral density is changing most rapidly.It is also evident from this figure that we essentially "scan" the true noise shape (pink solid line), by placing knots across the frequency range (see text for more details).The optimal SNR  opt for each of the 10 injected sources, computed for the given duration of the mission (see eq. (42)).The dominant emission frequency  gw is also given for reference.noothertypes of signals are contained in the data (e.g., chirping signatures from supermassive BH binaries).At the same time, we have access to the level of instrumental noise, which is shown in both panels of figure 5. Searching for the UCB signals across the complete LISA band requires a more elaborate implementation of this simplified pipeline.This pipeline will be focusing on solving the complete second LISA Data Challenge (LISA Data Challenges Working Group 2022), and is going to be presented in future work.We choose to work on the frequency segment between 3.997 and 4 mHz, which contains 10 UCB objects, drawn directly from the LDC2 catalogue (LISA Data Challenges Working Group 2022).Those are shown in the top panel of figure 5 which shows the power spectrum of the  data channel of LISA.We use the two noise-orthogonal , , and  Time Delay Interferometry variables(Tinto & Dhurandhar 2005;Prince et al. 2002;Baghi et al. 2021), which are linear combinations of the LISA relative frequency TDI Michelson measurements ,  , and