Fast Bayesian gravitational wave parameter estimation using convolutional neural networks

The determination of the physical parameters of gravitational wave events is a fundamental pillar in the analysis of the signals observed by the current ground-based interferometers. Typically, this is done using Bayesian inference approaches which, albeit very accurate, are very computationally expensive. We propose a convolutional neural network approach to perform this task. The convolutional neural network is trained using simulated signals injected in a Gaussian noise. We verify the correctness of the neural network’s output distribution and compare its estimates with the posterior distributions obtained from traditional Bayesian inference methods for some real events. The results demonstrate the convolutional neural network’s ability to produce posterior distributions that are compatible with the traditional methods. Moreover, it achieves a remarkable inference speed, lowering by orders of magnitude the times of Bayesian inference methods, enabling real-time analysis of gravitational wave signals. Despite the observed reduced accuracy in the parameters, the neural network provides valuable initial indications of key parameters of the event such as the sky location, facilitating a multi-messenger approach.


INTRODUCTION
Gravitational waves (GW) are ripples in the fabric of spacetime produced by moving massive objects such as binary black holes or rotating neutron stars.These waves were not directly detected until the recent discovery by the LIGO and Virgo collaborations in 2015 (Abbott et al. 2016).Since then, a total of 90 events have been detected in the three publicly released catalogs (Abbott et al. 2019(Abbott et al. , 2020(Abbott et al. , 2021a)), corresponding to the O1-O3 observing runs of Advanced LIGO (Aasi et al. 2015) and Advanced Virgo (Acernese et al. 2015).The fourth observing run started in Spring 2023, already detecting tenths of new events.The majority of them are binary black hole mergers, but binary neutron star mergers and black holeneutron star mergers have also been detected.The discovery of GWs has opened up a new window to study the universe, as they can be used, for example, to infer the population of merging compact objects (Abbott et al. 2023b), test the theory of General Relativity (Abbott et al. 2021b) or understand the composition of neutron stars (Abbott et al. 2022b,c,a).
In order to extract useful information from the detected GW signals, it is necessary to accurately perform an estimation of the parameters of the source leading to the observed signal.Typically, this is done using a Bayesian inference framework, in which a probability distribution is constructed over the parameter space based on the observed data and prior knowledge (Veitch et al. 2015;Singer & Price 2016;Abbott et al. 2019;Ashton et al. 2019;Romero-Shaw et al. 2020;Smith et al. 2020).Markov chain Monte Carlo (MCMC) or Nested Sampling (NS) algorithms are commonly used to wander ⋆ E-mail: mandres@ifae.esthe parameter space and sample the posterior probability distribution (Gilks 2005;Skilling 2006;Veitch et al. 2015;Romero-Shaw et al. 2020).These methods, albeit powerful and precise, can be computationally intensive and do not scale appropriately with the increasing number of events.Due to the large volume of expected detections of future runs, in particular in the next generation experiments, new tools or implementations are being studied (Berry et al. 2015;Pankow et al. 2015;Singer & Price 2016;Cuoco et al. 2020).
The best performing machine learning algorithms applied to the parameter estimation problem are the ones described by Gabbard et al. (2022); Dax et al. (2023); Green et al. (2020); Green & Gair (2021).Gabbard et al. (2022) use a variational autoencoder to predict the posterior probability distribution typically produced by a Bayesian inference approach.Their results are almost identical to those obtained using a Bayesian inference technique but to generate 8, 000 samples only takes 0.1s, an improvement of several orders of magnitude.A different approach is followed by Dax et  where they estimate the Bayesian posterior using a neural network and then modify the distribution using importance sampling.This method takes between one and ten hours (depending on the waveform used) to perform the parameter estimation running on a GPU and multiple cores, which is a great improvement with respect to the several hours to days that typically take for the MCMC samplers to wander the parameter space.Finally, Green et al. (2020); Green & Gair (2021) use normalizing flows to produce accurate posterior distributions, comparable to those produced by traditional methods.
In this paper, we present a new and fast method for GW parameter estimation using a CNN.This article is organized as follows.The training dataset and preprocessing of the data is explained in Sec. 2. In Sec. 3 the CNN architecture is presented.The training procedure is explained in Sec. 4. Finally, Sec. 5 shows the results obtained for the test set and for two real events, comparing it to the results obtained with the traditional MCMC approach.

DATASET AND PREPROCESSING
To generate the training data for the CNN we first take realizations of Gaussian noise that follow the power spectral density (PSD) of the strain measured for the O3b run in the three interferometers (Abbott et al. 2021a).Then, we simulate and inject on this noise GW signals using the IMRPhenomv2 waveform (Husa et al. 2016;Khan et al. 2016) from the PyCBC library (Nitz et al. 2020;Usman et al. 2016).The parameters of the injected signals are the ones described in Tab. 1.The delay between the different interferometers and its antenna patterns are also taken into account.
The strain is then whitened and normalized by dividing it by its standard deviation.We choose a sampling frequency of 4096 Hz and include the information of LIGO Hanford, LIGO Livingstone and Virgo at the same time.The frequency is restricted to the (30 Hz, 1024 Hz) range, as the signals fit well within these values.The signal is then cropped to 1s containing the merger between the time 0.5 s and 0.9 s (the exact coalescing time is taken as random over a uniform distribution).Therefore, the input matrix for the CNN has a shape of (4096, 3), each column corresponding to the data of each interferometer.
The signal-to-noise ratio (SNR) is defined as where h(f ) is the strain in frequency domain and Sn(f ) is the PSD of the strain noise.The network SNR is then computed as where i runs over the three interferometers.We restrict ourselves to signals that fulfill SNRnet ≥ 10.A total of 288, 000 signals are used for the training stage, 32, 000 for the validation and 100 are kept for testing the performance afterwards.
An improved efficiency has been observed when fitting some derived quantities rather than the original ones presented in Tab. 1.For instance, instead of directly estimating the individual masses of the black holes, we utilize the chirp mass, denoted by Mc and which is calculated using the expression alongside the mass ratio, denoted by q and computed as where m1 ≥ m2.
Similarly, instead of predicting the individual spins we use use the effective spin, defined as Furthermore, we make the decision not to fit the polarization and inclination angles, as they are not among the key parameters required for real-time analyses.As a result, the CNN will estimate the set of parameters θ θ θ = {Mc, q, d, tc, χ eff , α, δ}.

CNN ARCHITECTURE
The CNN architecture that has been used in this work is displayed in Fig. 1.The first set of layers are one-dimensional convolutions and one-dimensional max pooling ones, followed by a set of fully connected dense layers.Between each of the dense layers a 20% dropout is applied to prevent the overfitting during the training stage.
To be able to capture the uncertainty in the final estimation of the parameters we choose to model the full distribution as the posterior in Bayesian statistics would, instead of producing point estimates.Therefore, the last dense layer that contains 14 neurons will be connected to a TensorFlow Probability layer with a Kumaraswamy distribution for each parameter.
The Kumaraswamy distribution has a probability density function defined by (Kumaraswamy 1980) for x ∈ [0, 1], which is a flexible enough distribution.It is very similar to the beta distribution, but its simple expression of the probability density function allows to have an easy evaluation of the quantile function, which takes the form of Sampling from this distribution is then trivial by applying the inverse transform sampling method.If we draw ξ ∼ Uniform(0, 1) and then compute F −1 (ξ; α, β) for each realization, we will obtain the desired samples of this distribution.This procedure is very fast in contrast to sampling from a distribution for which no analytical expression of the quantile function is available or which is computationally expensive to evaluate.
Since our variables are not in the range [0, 1] , the estimated parameters need to be transformed as where θ ∈ θ θ θ.
The numbers estimated for the different variables are different in orders of magnitude and, therefore, we estimate the log α and log β instead.This motivates taking the exponential of the 14 numbers generated by the last dense layer before evaluating the probability distribution.
The loss function that will be used is the negative log-likelihood (i.e.L = − log L(α, β|X)).For the Kumaraswamy distribution and n observations, the log-likelihood equals Choosing this loss function is equivalent to maximizing the loglikelihood, meaning that the neural network is actually learning to find the parameters of the distribution that best fit the data.

TRAINING
To build and train the CNN we use Keras with TensorFlow's backend and its implementation in GPUs (Abadi et al. 2016).For managing the probabilistic layers we use distribution layers (Dillon et al. 2017) from TensorFlow Probability.
We feed the data in batches of 32 signals as was done in Menéndez-Vázquez et al. ( 2021); Andrés-Carcasona et al. (2023).The learning rate is set to 10 −5 .The metrics tracked during this stage are mainly the loss and the validation loss.The former is computed with the training set and represents the function being minimized, while the latter is computed using the validation set.This allows to observe the appearance of overfitting or underfitting during the training procedure.The training lasts for 8 epochs and the one yielding a minimum validation loss is going to be the neural network used for inference.
The evolution of the metrics tracked is displayed in Fig. 2. The loss, as expected, decreases steadily, indicating that the CNN is continuously learning to fit the data.This indicates a good behavior over the training set.The validation loss also displays a decrease during the training indicating that the CNN is also learning how to generalize the results over data that has not seen before.
The final loss and validation loss that are obtained for the CNN chosen, the one of the eighth epoch, are both −1.13.This CNN is put into test with the other data set and two real events in the following section.
It is worth noting that the CNN's performance could be further improved by expanding the training set to include a wider range of signals and noise conditions.Moreover, a higher fine-tuning of the model's hyperparameters, such as adjusting the learning rate or the architecture, could potentially enhance its accuracy and reduce a bit more the uncertainties in the estimated parameters.This is out of the scope of the paper, as it intends to be a proof-of-concept.

RESULTS
In this final section, the main results of the CNN are presented.The first test that can be performed to ensure the proper performance is constructing the probability-probability (or P-P for short) plot.This plot is constructed by performing the inference of the CNN over the test set and evaluating the p-value of the true parameter.This is computed as If the distribution outputted by the CNN is a true statistical distribution, the cumulative probability density of this p-value should follow a 45 • line between the points (0, 0) and (1, 1).This plot is shown in Fig. 3.These cross-validation results demonstrate the robustness of our CNN approach, as it consistently produces a reliable parameter estimation using a new subset of data.
Since the CNN has been trained on simulated signals and on Gaussian noise it is interesting to test it using real data.The GWTC-3 catalog (Abbott et al. 2021a) contains the 35 O3b confident detections.From this collection, a judicious selection of events is needed following a certain criteria.Namely, the chosen events must have been detected during periods when all three interferometers were actively online and when the data quality recorded across these interferometers was designated as optimal.Furthermore, compatibility with the CNN's training regimen requires that the mass estimations derived through conventional methodologies lie within the predetermined range calibrated for the neural network.Finally, only events exhibiting a SNR exceeding 10 are considered viable candidates as this was a decision during training.Accordingly, among these stringent criteria, the events GW200224_222234 and GW200129_065458 emerge as good choices for this validation, fullfilling the requirements and having different parameters.
For GW200129_065458, we apply the CNN on the data that contains it and compare it to the posterior distributions published in the GWTC-3 catalog by the LVK collaboration using the traditional Bayesian inference methods (Abbott et al. 2021a(Abbott et al. , 2023a)).The result is displayed in Fig. 4. To generate 10, 000 samples of the posterior the CNN, running on a GPU NVIDIA GeForce RTX 2080 Ti, takes 0.05 s.This implies an improvement of several orders of magnitude with respect to the traditional method.
Our CNN is able to produce posterior distributions that in all cases are compatible with the published results.The one that exhibits the worst similarity is, for this particular event, the distance, as it overestimates it.In this case, the sky position has some overlap with the MCMC estimated one, but still has a larger uncertainty and estimates a larger declination.
The results for the event GW200224_222234 are displayed in Fig. 5.In this case, all the estimated parameters are also in accordance to those published in the GWTC-3 catalog.The one that exhibits the worst performance is, in this case, the effective spin.Regarding the sky position, the uncertainty yielded by the CNN is too large to accurately pinpoint the event, but it still is unbiased and could produce a first alert for the instruments that look for electromagnetic counterparts to start pointing their instruments to a given patch in the sky while the more accurate pipelines improve the precision of the sky localization.
A typical short gamma ray burst (GRB), as the one observed alongside GW170817 (Abbott et al. 2017), lasts less than 2 s and, taking into account that the instruments might need time to point towards a given direction, ideally the inference speed should be of a fraction of a second to avoid becoming the bottleneck.This is achieved by our neural network.
Overall, the results indicate a good response of the CNN.One important advantage of our approach is its computational efficiency.The CNN model provides several posterior samples in a fraction of the time required by Bayesian inference methods.This enhanced capability enables real-time analysis of GW signals and even facilitates prompt alerts for possible follow-up observations and multimessenger astronomy collaborations.
Despite the high inference speed, the CNN approach introduces a trade-off in terms of a higher parameter uncertainty.These can be mainly attributed to the simplified assumptions made by the CNN architecture and the limited training data used.However, even with these uncertainties, the CNN model can serve as an initial indication for the sky position and other key parameters, enabling a swift response and triggering more accurate analyses by other slower, yet more accurate, pipelines.
In summary, with these results, our CNN model demonstrates promising capabilities for a fast parameter estimation of GWs.It provides a reliable posterior distribution, while exhibits a competitive performance compared to traditional Bayesian sampling methods, and offers real-time inference capabilities.This is only a proof-ofconcept and future research can focus on refining the CNN architecture, incorporating additional data sources, and exploring ensembling techniques to further improve the accuracy and robustness of gravitational wave parameter estimation.

CONCLUSIONS
We have presented a neural network to perform the typically computationally expensive task of estimating the parameters of gravitational wave events.The chosen architecture has been a one-dimensional convolutional neural network with three channels -one per available interferometer-and predicting the parameters of a Kumaraswamy distribution.These parameters then define what can be regarded as the posterior distribution.The training shows that this architecture can correctly learn how to estimate them and that the outputted distribution behaves like a probability density function.Finally, it has been applied to a real event and the results have shown that, although our CNN has a larger uncertainty than traditional Bayesian inference approaches, the actual parameters are in general in agreement with the outputted distribution.Our approach offers real-time inference speeds that could be used for possible multi-messenger follow-ups.

Figure 1 .
Figure 1.Architecture of the convolutional neural network used in this work.a Size of the input layer.b Filters and size of the kernel of the 1D convolution layer.c Size of the pool size of the 1D maximum pooling layer.d Number of neurons of the fully connected dense layer.e Layer that includes the 7 Kumaraswamy distribution (one per each estimated parameter).Before this layer the exponential of the numbers outputted by the last dense fully connected layer are taken.

Figure 2 .
Figure 2. Metrics tracked during the training procedure.

Figure 3 .
Figure 3. P-P plot for the different variables that are being fitted by the CNN.The individual masses are obtained by post-processing the chirp mass and mass ratio.The shaded regions indicate the ±1σ, ±2σ and ±3σ confidence levels in decreasing intensity of color, respectively.

Figure 4 .
Figure 4. Full posterior distribution obtained for the CNN and for the public parameter estimation release obtained with a MCMC approach for the GW200129_065458 event.

Figure 5 .
Figure 5. Full posterior distribution obtained for the CNN and for the public parameter estimation release obtained with a MCMC approach for the GW200224_222234 event.

Table 1 .
al. (2023), Parameter space of the training set.We additionally assume that m 1 ≥ m 2 .