A Uniformly Derived Catalogue of Exoplanets from Radial Velocities

A new catalogue of extrasolar planets is presented by re-analysing a selection of published radial velocity data sets using EXOFIT (Balan&Lahav 2009). All objects are treated on an equal footing within a Bayesian framework, to give orbital parameters for 94 exoplanetary systems. Model selection (between one- and two-planet solutions) is then performed, using both a visual flagging method and a standard chi-square analysis, with agreement between the two methods for 99% of the systems considered. The catalogue is to be made available online, and this 'proof of concept' study may be maintained and extended in the future to incorporate all systems with publicly available radial velocity data, as well as transit and microlensing data.


INTRODUCTION
Since the discovery of the first extrasolar planet in 1995 (Mayor & Queloz 1995), the research on extrasolar planets has undergone exponential expansion. A wide range of search methods have been developed during this period, resulting in the discovery of more than 700 planets to date, the majority of which have been from the radial velocity method. Traditional data reduction methods use a periodogram (Lomb 1976;Scargle 1982) to fix the orbital period and then the Levenberg-Marquardt minimisation (Levenberg 1944;Marquardt 1963) to fit the other orbital parameters. A catalogue of exoplanets has already been published by Butler et al. (2006) using this method to extract the orbital parameters of exoplanets. Recently, Bayesian MCMC methods have been introduced by Gregory (2005); Ford (2005); Ford & Gregory (2007) as a replacement for the traditional data reduction pipeline. exofit (Balan & Lahav 2009) is a freely available tool for estimating orbital parameters of extrasolar planets from radial velocity data using a Bayesian framework. Here are analysed 94 previously published data sets using exofit, forming a new, uniformly derived catalogue of exoplanets from a Bayesian perspective.
Statistical properties of the distribution of orbital parameters are critical for explaining the planetary formation process. It has been argued that there is now a statistically significant number of known companions to make inferences about the correlation between orbital elements. Early discussions on this subject can be found in a series of articles on the statistical properties of exoplanets by Udry et al. (2003); Santos et al. (2003); Eggenberger et al. (2004) and Halbwachs et al. (2005). The statistical discussion in this article is informed by the comparison to the published catalogues at http://www.exoplanet.eu (Schneider et al. 2011) and http://exoplanets.org (Wright et al. 2011).
The rest of this article is structured as follows: in Sections 2 and 3 the Bayesian framework and exofit software package are introduced. The data analysis pipeline is described in Section 4, model selection is discussed in Section 5, and the catalogue is presented in Section 6, and in the tables at the end of the article. The statistical properties of the distributions of various orbital parameters are discussed in Section 7 and the results are summarised in Section 8.

BAYESIAN FRAMEWORK
The Bayesian framework provides a transparent way of making probabilistic inferences from data. It is based on Bayes' theorem, which states that for a given model H with a set of parameters Θ and data D, the posterior probability distribution of parameters Pr(Θ|D, H) is proportional to the prior probability distribution Pr(Θ|H) times the likelihood of data, Pr(D|Θ, H). Using standard mathematical notation one can write, Pr(Θ|D, H) = Pr(D|Θ, H) Pr(Θ|H) Pr(D|H) . (1) The denominator of the right hand side of the above equation is called the Bayesian Evidence. Since it is the estimation of parameters that is of interest here, this term can be considered as a normalising constant and Equation 1 can be written as, Pr(Θ|D, H) ∝ Pr(D|Θ, H) Pr(Θ|H). (2) The key step in the Bayesian approach is to obtain the posterior distribution of parameters accurately. The inferences are then derived from the posterior distribution. The Markov Chain Monte Carlo (MCMC) method is a widely employed technique for simulating the posterior distribution (the left hand side of Equation 2). The basic steps in Bayesian parameter estimation can be summarised as follows: (i) model the observed data, i.e. construct the likelihood function, (ii) choose the prior probability distributions of parameters, (iii) obtain the posterior probability distribution, (iv) make inferences based on the posterior probability distribution.
exofit is a software package that estimates the orbital parameters of extrasolar planets, following the steps outlined above. It should be noted that exofit does not perform any Bayesian model selection -for a discussion of the relation of this aspect of the Bayesian framework to this study, the reader is directed to Section 5.

EXOFIT
exofit (Balan & Lahav 2009) is a publicly available tool for extracting orbital parameters of exoplanets from radial velocity measurements. It uses the MCMC method to simulate the posterior probability distribution of the orbital parameters. The likelihood of data Pr (D|Θ, H) in Equation 2 connects the mathematical model to the observed data. The radial velocity model and the corresponding Gaussian likelihood function are given in Balan & Lahav (2009), where the choice of likelihood is based on Gregory (2005). The prior probabilities are as used by Ford & Gregory (2007). exofit then generates samples from the posterior distributions of the orbital parameters in the mathematical model, which can be analysed with the aid of any statistical software. Details of the algorithmic structure of the code, including methods of controlling chain mixing and assessing convergence, can be found in Balan & Lahav (2009). For further information, the reader is directed to the exofit user's guide.

DATA ANALYSIS
As of 2009 August 21, when the data sets were extracted from the literature (Butler et al. (2006), and references therein), there were 295 planetary systems detected using the radial velocity method according to http://www.exoplanet.eu, with 346 individual planets in total. Some radial velocity data sets from the literature have fewer than ten entries and as such are not appropriate for use with Bayesian inference methods, as a radial velocity data set will need to include at least half an orbital period of a potential planetary companion. Hence data sets with not enough measurements to give accurate orbital solutions were not included. Also, the radial velocity data of any systems with more than 2 confirmed planets were ignored since at present, exofit can only search for either one or two planets.
Many more and different radial velocity data sets and stellar mass estimates are now available (though some not publicly), but for the sake of uniformity the original radial velocity data were used (i.e. those publicly available, frozen as of 2009 August 21 when the original data were collected). At a later stage the results can be improved by updating the original data sets to those which are now available, as well as incorporating the data from the many hundreds of additional planets that have been detected since the start of this study.
To enable accurate calculation of the derivable orbital parameters, the masses as well as the radial velocities of the associated stars were needed. These values were all taken from the published literature at http://www.exoplanet.eu, frozen as of 2011 March 01. The input for exofit is in the form of a simple text file with radial velocity, uncertainty, and the time of observation (in Julian Date format), where the radial velocity values must be in ms −1 . The Julian Date of the observation is offset to zero within exofit.
The publicly available statistical data analysis package, R, from the R Project for Statistical Computing (http://www.r-project.org), was used to analyse the output of exofit. The output from R includes the mean, median and standard deviation of the orbital parameters extracted from the posterior distribution samples produced by exofit.
The modal values are also produced, but will only have significance in the event of the posterior having more than one peak. Posterior distribution plots can also be produced with R, and the marginal distributions of each parameter can be found by plotting a histogram of the samples from the posterior. The full posterior distribution is helpful in analysing correlations between various parameters. Even though parameter degeneracy is present in the orbital solutions, highly degenerate solutions are less common.
The calculation time required for exofit depends on computational resources available to the user. It scales linearly with the number of radial velocity entries input to the code, and also depends significantly on the ease with which exofit can converge the data. If exofit is presented with data from a non-converging posterior distribution, it will take much longer than a larger data set with convergent orbital solutions. In technical terms, the mean average calculation time using exofit on 26 radial velocity data sets ranging from 10 to 50 data entries for a 1planet search was 44 seconds per data entry. For a 2-planet search using 30 data sets with between 11 and 256 entries, the calculation time increased to 3 minutes and 40 seconds per data entry. These times were achieved on a 2.80 GHz dual core linux system. Multiple runs were performed in order to confirm the orbital solutions for each system -these analyses were carried out using the UCL Legion High Performance Computing Facility, details of which can be found at http://www.ucl.ac.uk/isd/common/researchcomputing/services/legion-upgrade.

MODEL SELECTION
One of the most challenging aspects of the statistical inference procedure is the model selection problem. For the analysis of the radial velocity data the question of model selection refers to the selection of the correct number of planets to fit the observed data. Ford & Gregory (2007); Gregory (2007) employed thermodynamic integration for calculating the Evidence and selecting the optimal number of planets that fit the data. On the other hand, Feroz et al. (2011b,a) approached the situation as an object detection problem.
One of the most commonly employed model selection procedures makes use of the chi-square statistic. This is one of the most prominent methods for estimating the goodness of fit and it has been applied to many astronomical problems including the analysis of radial velocity data (see e.g. Butler et al. (2006)). Bayesian inference also offers a straightforward way of performing statistical model selection, based on Equation 1 and the Evidence. Even though this approach is conceptually simple, its implementation is in general computationally expensive, and exofit does not currently have the functionality to perform such Bayesian model selection.
Hence, in the present analysis, we make use of the traditional chi-square statistic as well as a visual flagging approach, discussing the relationship between the two in Section 8. We limit ourselves to 1 or 2 planets, as per the current capabilities of the code, but this may of course be extended in later studies. The rationale behind the visual flagging is that one can identify the poor fits to the data by comparing the predicted radial velocity curves for the 1-planet and the 2-planet solutions. The method involves assigning a 'visual quality flag' by eye to each system, where '1' signifies that the 1-planet fit is best, '2' means that the 2-planet fit is best, and '3' means that both 1-and 2-planet solutions provide equally good (or equally bad) fits. The results of this classification are shown in Table 6, next to the number of planets currently confirmed to exist in that system, taken by comparing the values on both http://www.exoplanet.eu and http://exoplanets.org (as these catalogues do not agree with each other in some cases). Table 6 also shows the log likelihood ratio of the reduced chi-square value of the 1-planet fit to that of the 2-planet fit, where we define the log likelihood ratio, Hence a value of R > 0 indicates that the 1-planet fit was best (had a smaller reduced chi-square value), and R < 0 indicates that the 2-planet model provided the best fit to the data. For all bar one system (HD8574), every '1' and '2' quality flag assigned to the fits by eye was in agreement with the calculated value of R, endorsing our method of assignation by visual inspection (see Figure 5). For only a few systems there were not sufficient degrees of freedom to calculate a value for R (due to e.g. only having 11 datapoints for the 12 parameters), denoted by '-' in the table.

CATALOGUE OF EXTRASOLAR PLANETS
In this paper the catalogue of extrasolar planets generated using exofit is presented in Tables 4 and 5. These contain the best estimates of the orbital parameters for both 1-and 2-planet fits for all systems analysed. The orbital parameters used to fit to the model were the systematic velocity offset of the data V , the orbital period of the planet T , the radial velocity semi-amplitude K, the orbital eccentricity e, the argument of periastron ω, and χ, parameterising the fraction of the orbit at the start time of the data along which the planet has travelled from the point of periastron passage. The final parameter, s, is a measure of all extra signal in the data after the planetary fits have been accounted for, and hence a high value could indicate the presence of an additional planet, or noisy data due to stellar activity, or the combined noise from all sources. The reader is referred to Balan & Lahav (2009) for a more complete description of this parameter, which is not considered in any more detail in this study. It should also be noted that the orbital parameter χ is not related to the statistical measure χ 2 used for determining the log likelihood ratio in Equation 3.
The direct output values from exofit shown in the tables are the medians of the parameter posterior distributions, and the associated 68.3% confidence regions. The other displayed derivable parameters of the systems (mass and semi-major axis) were calculated by transforming the orbital parameter posteriors using the standard relations, and, assuming mp m * . The final values for these derivable quantities were again taken to be the medians with 68.3% confidence regions, and are also displayed in the tables.

Choice of priors
The prior distributions and ranges used for the initial analysis were as shown in Tables 1 and 2. The prior for the systematic velocity was dependent on the system -the mean of the input RV data was calculated and used as the initial value, and the allowed range was 10 kms −1 symmetrically about this. For some systems different sets of prior boundaries were used in a second round of analysis -these stars and the prior boundaries applied are listed in Table 3. Systems which did not return good fits using the normal prior boundaries were re-run with these 'tight' priors, where the period of the planet was constrained to be within a range given by, where T pub is the published value of the period and σ pub is the published error on the period, both taken from http://www.exoplanet.eu on 2011 August 01. Further constraints may also be applied, for example, systems with near zero eccentricities require tight priors on the orbital parameter χ in order to avoid multimodal distributions (see Section 8), whilst systems with eccentricities close to unity need tight priors on the orbital period T in order to achieve convergence of the MCMC chains. Examples of the output of exofit are shown in Figures 1 and 2. Table 1. The assumed orbital parameter prior distributions and their boundaries for a 1-planet model. The min and max values for the systematic velocity parameter were the mean value of the raw radial velocities for that data file minus and plus 5000ms −1 respectively.

Parameter
Prior Mathematical form Min Max

Ambiguous systems
For some of the systems analysed there is a clear trend in the radial velocities indicating the possibility of a second planet, but the data are not informative enough to properly constrain the orbital parameters of such an object. Plotting the resulting radial velocity curve and judging by eye can help to assess and distinguish between the 1-and 2-planet fits and evaluate the validity of the orbital solution, though such poorly-constrained orbits will lead to large error bars on the estimates of the orbital parameters. This did not always even lead to a clear classification Table 3. Radial velocity data sets analysed with different period prior boundaries, for reasons explained in Section 6.2. The initial value is set to the published value of the period, the maximum value is the initial value plus twice the published error, and the minimum is the initial value minus twice the published error. This approach was generally necessary for those systems (e.g. WASP and XO data sets) where the number of datapoints available at the time of selecting the data was low, thus requiring tighter priors to adequately constrain the solution. though, and many systems were then re-run with tight priors on the period, given in Table 3. Some of these ambiguities were caused by data which were poor, or less accurate due to age, or too noisy due to stellar jitter. Others were simply due to the correlation between ω and χ, or data not good enough to constrain these two parameters. This resulted in near-uniform posteriors for ω and χ, and hence fits that match few of the measured datapoints as a result of being shifted in time. Estimates for masses and semi-major axes derived from these results are still valid however (providing reliable estimates for T , K and e are obtained, which was generally the case), as these values have no dependence on mean anomaly at epoch and the time evolution of the Keplerian orbit. The class 3 (both 1-and 2-planet fits equally good, or equally bad) systems, as introduced in Section 5, are those which were considered to be somewhat ambiguous even after being analysed with tighter priors. This category was subdivided further -in some cases these are distinct radial velocity solutions which provide plausible fits for both 1and 2-planet models, and are classified as '3a'. However, there are also systems where the 'second planet' fit just produces small-amplitude variations on the 1-planet solution (see Figure 3 for an example), or where the 1-and 2-planet fits are identical but the 'second planet' posteriors are peaks at very small values for T and K, and uniform for e, ω and χ (i.e. there is no single solution for a second planet from these data). From this an 'Occam's Razor' approach could be taken and the assumption made that the correct model for most of these '3b' class systems is in fact the single planet one. In a few cases though there may truly be a second planet present, and the data used are simply not good enough to change the likelihoods of the parameters from the initial 'no-knowledge' (uniform prior) situation. So for all class 3 systems, better (or at least more up-to-date) data and more complete analyses (such as using the log likelihood ratio in more detail to narrow down the classification) are required to accurately determine the correct orbital solution. . The resulting radial velocity curve fits using the derived orbital parameters for HD89307. Imposing a 2-planet model on data with only one planet can have the result shown here, where the 2-planet fit is the same as the 1-planet fit with a superimposed artificial small-amplitude periodic variation.   Figure 4 shows values for specified orbital elements from the literature against values yielded using exofit for each system. The mass, semi-major axis and period values all exhibit good correlations in general between the independently derived values and those in the published literature -this is unsurprising for the period as it has not been derived from other quantities. The eccentricity however shows a greater spread than expected -whereas our uniformly derived catalogue is consistent in taking the median, the published values use, in general, many different statistical measures to extract the final parameter values from varying analysis techniques. This highlights the value of using a consistent technique to build up reliable databases of orbital parameters. As mass and semi-major axis values are themselves derived from period and eccentricity, any inaccuracies in algorithms used will propagate, and also present discrepancies in the values yielded using exofit and are likely to amplify outliers in these plots. These outliers will be investigated in the future in order to assess the validity of the solutions.

COMPARISON OF ORBITAL PARAMETERS
There are some discrepancies in the global distribution of parameter values between this catalogue and the published literature, especially for the eccentricity parameter. This may be partly due to poor or out-dated data, and is almost certainly affected by the ubiquitous effects of certain parameter correlations (as explained in Section 8). These should be analysed in more detail in the future, and techniques developed to explore the parameter space more efficiently and minimize or eradicate such dependencies.

DISCUSSION
The primary objective of this article is to analyse radial velocity data sets uniformly, using a single platform for the data analysis. Butler et al. (2006) produced a catalogue of extrasolar planets using traditional methods (using periodograms and Levenberg-Marquardt minimisation). We have analysed a selection of radial velocity data sets using a Bayesian parameter estimation program for extrasolar planets. However, a model selection criteria is required for completion of the statistical inference process, and for this purpose, as described in Section 5, a chi-square statistic was employed as well as a visual flagging technique. Inconclusive results are obtained for a few data sets, but from those analysed here it can be seen that both model selection methods perform well, agreeing in 99% of cases, as demonstrated in Figure 5.
Investigating further, we find that the chi-square values depend on the point estimates of the orbital parameters used to construct the predicted radial velocity curve. If the posterior distribution is unimodal such an approach will work flawlessly. However, posterior distributions of the orbital parameters exhibit multi-modality on many occasions. For example the parameters ω and χ are extremely correlated and their posterior distributions are bimodal for many data sets (an example of this is shown in Figure 6), especially for planets with e ≈ 0. This problem has also been noted by Gregory (2007), who proposed re-parameterisation of the problem as a possible way of dealing with this situation.
Many data sets contain planetary signals whose period is greater than the span of the observations, and so obtaining constraints on the orbital parameters of these objects is an extremely difficult task. There are several data sets where it was possible to obtain estimates for the orbital parameters for one of the planets, but then the second signal could not be constrained due to weak signal-to-noise. In most cases these signals appear to be a linear or quadratic trend in the radial velocity data. Therefore, it becomes extremely difficult to classify these objects as planets, and this is one of the reasons a visual flagging method was employed. One example of this is shown in Figure 6, the results of the 2planet fit to the data of the system HD190228. The strongest  signal is picked up and well-constrained, as can be seen from the error bars in Table 5, and in addition the values for this planet match well those from the fit for a single planet. Thus we can be reasonably sure of the parameters of the first planet, but those of the second, shown as HD190228b in Table 5, are significantly less secure. Additionally, sharp prior boundaries were used on the orbital period for several data sets. In these cases we have found that either the planetary signal is very weak or the signal from a systematic trend from an additional companion in the radial velocities masks the weaker planetary signal.
We have also found that the aliasing effects (see e.g. Dawson & Fabrycky (2010)) in observations can produce additional peaks in the posterior distributions, necessitating the use of the sharp prior on the period.
In summary, a brief overview of the Bayesian theory has been given here, along with a description of the MCMC approach used in order to estimate the orbital parameters of extrasolar planets, more details of which can be found in Balan & Lahav (2009). A new catalogue of extrasolar planets is presented from the re-analysis of published radial velocity data sets, giving both 1-and 2-planet orbital solutions for 94 systems derived on a uniform basis. An attempt is made to distinguish between the solutions for each system by using both a visual categorisation method and a standard reduced chi-square technique, giving good agreement in 99% of the cases presented here. Improvements in this 'model selection' area of the analysis may be made by taking into account Bayesian Evidence, as seen in Gregory (2007) and Feroz et al. (2011b,a); more rigorous approaches such as these are outside the scope of this 'proof of concept' study, but may be looked into in the future. Other further work will include updating this catalogue to incorporate the most up-to-date data, as well as extending exofit to be able to use transit and microlensing results, to search for an arbitrary number of planets, and to look into the possibility of accounting for interactions between planetary bodies.

ACKNOWLEDGMENTS
The authors would like to thank Paul Gorman, Robert Michaelides, Lisa Menahem, Andrew Strang, Robert C. Clouth, and Milroy Travasso for their help in analysing planetary data sets. MH is supported by an Impact/Perren studentship, STB acknowledges support from the Isaac Newton Studentship and a studentship from the Astrophysics Group, Cavendish Laboratory, and OL acknowledges a Royal Society Wolfson Research Merit Award. The authors acknowledge the use of the UCL Legion High Performance Computing Facility, and associated support services, in the completion of this work. This research has made use of the Exoplanet Orbit Database and the Exoplanet Data Explorer at http://exoplanets.org.
The catalogue will be made available for public viewing at http://www.ucl.ac.uk/exoplanets/exocat. Table 4. Table of the orbital parameters for a 1-planet fit, both directly output from exofit and thence derived. The values of the parameters T , K, e and s (generated from exofit) are the medians of the parameter posterior distributions, with the associated 68.3% confidence regions. The other parameters were calculated using these values and stellar masses taken from the published literature. Note that some parameters are extremely well-constrained, hence the errors on the parameter estimates are so small as to appear to be zero to the two decimal places shown in this table. A full table in machine-readable format will be provided on the website, and the reader is directed there if such data are required.  Table 5. Table of the orbital parameters for a 2-planet fit, both directly output from exofit and thence derived. The values of the parameters T , K, e and s (generated from exofit) are the medians of the parameter posterior distributions, with the associated 68.3% confidence regions. The other parameters were calculated using these values and stellar masses taken from the published literature. Note that some parameters are extremely well-constrained, hence the errors on the parameter estimates are so small as to appear to be zero to the two decimal places shown in this table. A full table in machine-readable format will be provided on the website, and the reader is directed there if such data are required.  Table 6. The number of published planets compared with the best-fit model from this analysis (i.e. the flags and reduced chi-square ratios for the results for each system). The 'candidates' column shows the current number of confirmed planets (from http://www.exoplanet.eu and http://exoplanets.org, as of 2011 August 01), and the 'visual quality flag' (assigned by eye) is the best exofit model, where '1' signifies that the 1-planet fit is best, and '2' means that the 2-planet fit is best. '3' means that both 1-and 2-planet solutions provide equally good or bad fits, and this class is again subdivided into '3a' and '3b', as explained in Section 6.2. Also shown is the log-likelihood ratio of the chi-square values, R, as defined in Section 5. Our visual flag assignments are validated somewhat by noting that in 99% of systems the visual flag and chi-square results agree (or at least are not contradictory, for the class 3 cases). Those systems denoted by '-' are those where there were not sufficient degrees of freedom to calculate a value for the log likelihood ratio. The prior flag is also shown, where flag N indicates that the analysis was performed using the normal priors shown in Tables 1 and 2, and flag D indicates an analysis with different priors as shown in Table 3.