A Computationally Efficient Approach for Calculating Galaxy Two-Point Correlations

We developed a modification to the calculation of the two-point correlation function commonly used in the analysis of large scale structure in cosmology. An estimator of the two-point correlation function is constructed by contrasting the observed distribution of galaxies with that of a uniformly populated random catalog. Using the assumption that the distribution of random galaxies in redshift is independent of angular position allows us to replace pairwise combinatorics with fast integration over probability maps. The new method significantly reduces the computation time while simultaneously increasing the precision of the calculation. It also allows to introduce cosmological parameters only at the last and least computationally expensive stage, which is helpful when exploring various choices for these parameters.


INTRODUCTION
The two-point spatial correlation function (2pcf) is a key tool used in astrophysics for the study of large scale structure. Given an object such as a galaxy, the probability of finding another galaxy within a volume element δV at a distance s from the first galaxy is δP = n(1 + ξ(s))δV, where n is the mean number density and ξ(s) is the twopoint correlation function characterizing the deviation from the uniform distribution in separation between galaxies (Peebles 1973). The correlation function ξ(s) and its Fourier transform, the galaxy power spectrum P(k), have been used to describe the distribution of matter in galaxy surveys (Yu & Peebles 1969;Peebles & Hauser 1974;Groth & Peebles 1977;Feldman et al. 1994). During the past decade ξ(s) has become a popular tool for the reconstruction of the clustering signal known as baryon acoustic oscillations, or BAO (Eisenstein et al. 2005). The BAO signal is the signature of the density differences that arose in the early universe before the thermal decoupling of photons and baryons (Sunyaev & Zeldovich 1970;Peebles & Yu 1970). It is detectable today as a characteristic peak in the galaxy spatial correlation function at roughly s = 110h −1 Mpc, where h defines the Hubble parameter H0 = 100h km s −1 Mpc −1 . E-mail: regina@pas.rochester.edu † E-mail: scheong@u.rochester.edu ‡ E-mail: sybenzvi@pas.rochester.edu § Email: otto.heinz.hindrichs@cern.ch Given a spectroscopic survey containing the 3D coordinates of each galaxy, there exist several possible ways to estimate ξ(s). The most popular estimator is due to Landy & Szalay (1993), which is constructed by combining pairs of galaxies from a catalog of observed objects D ("data") and a randomly generated catalog R with galaxies distributed uniformly over the fiducial volume of the survey but using the same selection function as the data. The Landy-Szalay (LS) estimator iŝ ξ(s) = DD(s) − 2DR(s) + RR(s) RR(s) , where DD, RR, and DR are the normalized distributions of the pairwise combinations of galaxies from the data and random catalogs (plus cross terms) at a given distance s from each other. There exist many 2pcf estimators other than the LS estimator, and they have different advantages and limitations. However, all commonly known estimators are functions of DR and/or RR distributions (see Hamilton (1993); Kerscher et al. (2000); Vargas-Magana et al. (2013)) and therefore depend on the random catalog R. To limit statistical fluctuations inξ(s), it is typical to generate random catalogs with one to two orders of magnitude more galaxies than the survey under investigation. Unfortunately, the "brute-force" computation ofξ(s), in which all possible pair combinations are counted, is an O(N 2 ) calculation due to RR, where N is the number of galaxies in the random catalog R; this results in a trade-off between statistical uncertainties and computation time. This trade-off has consequences beyond reducing uncertainties. For example, researchers often simulate a va-riety of cosmological parameters when studying the BAO, but the computational overhead required to calculateξ(s) may limit the number and type of possible analyses that can be carried out. The overhead also increases the effort needed to compute the covariance of the 2pcf.
In this paper, we suggest a method which substantially reduces the time needed to compute RR and DR and hence is applicable to any estimatorξ(s). The method is based on the assumption that the probability for a galaxy to be observed at a particular location in the random catalog can be factorized into separate angular and redshift components (Ross et al. 2012;Anderson et al. 2013). This assumption allows us to replace pairwise combinatorics in the calculations of RR and DR with fast integration over probability maps. We find that the estimation ofξ(s) speeds up by factors > 100 with respect to the brute-force calculation. In practical terms, this means an estimate ofξ(s) which is typically carried out on large computing clusters can be performed on a modern notebook computer.
The paper is structured as follows. In Section 2 we describe the mathematical justification of the calculation for random-random (RR) and data-random (DR) galaxy pairs. In Section 3 we describe the resulting algorithm used to computeξ(s) in detail. The performance of the algorithm is studied in Section 4 using mock catalogs and spectroscopic data from the CMASS portion of the Sloan Digital Sky Survey (SDSS) DR9 catalog (Anderson et al. 2013;Ross et al. 2012;Sanchez et al. 2012;Percival et al. 2014). We then conclude in Section 5.

Random-random distribution
The Hubble distance is defined as DH = c/H0, where c is the speed of light. Given a set of cosmological parameters, the comoving distance of a given galaxy from the Earth, DC , can be calculated from its observed redshift z as where ΩM is the present day relative matter density of the universe, Ω k measures the curvature of space, and ΩΛ is the relative density due to the cosmological constant. The transverse comoving distance DM is defined as: As suggested by current cosmological constraints (see e.g. Aubourg et al. (2015); Ade et al. (2016)), Ω k is small and thus eq. 4 can be approximated by The comoving coordinates r of any galaxy can be described by its right ascension α, declination δ, and comoving radial distance r = DC (z). In the case of small but non-zero Ω k , the distance s12 between two points with coordinates r1 and r2 (illustrated in Fig. 1) can be approximated by: where σ12 and π12, the distances transverse and parallel to the line of sight (LOS) respectively, are defined as: The angular separation θ between the two galaxies is given by Since random catalogs are typically generated by uniformly populating the fiducial volume of the survey, the galaxy distribution over z, and thus over r, is factorizable from the angular distribution. In other words, any angular region of the sky has the same distribution of galaxies in r (see Ross et al. (2012) and Fig. 2). This means that the expected count of random galaxies, R( r), can be factorized into the product of the expected count Rang(α, δ) at a given angular position and a radial probability density function (PDF) Pr(r): R( r) evaluated this way has smaller statistical uncertainty since the precision of Rang(α, δ) depends on the number of points in the 2D angular cell, and the precision of Pr(r) depends on the number of points in the 1D range in r. In contrast, the statistical precision without the factorizability assumption is determined by the number of points in a 3D cell (r, α, δ). We will show in Section 4 that the uncertainties inξ(s) evaluated using the suggested method are indeed smaller. The PDF of random-random galaxy pairs separated by a distance s can be expressed as: where Ω represents a solid angle (and dΩ = cos δ dδ dα) and NRR is used to normalize the random-random distribution.
For the unweighted calculation (all galaxies have weight 1), the normalization constant is simply NRR = NR(NR − 1)/2 where NR is the total number of galaxies in the random catalog. However, for the weighted calculation, the normalization constant NRR is: where w R i is the weight of the i th galaxy in the random catalog. In this specific form, the calculation is O N 2 R and requires a double loop over indices. However, it can be rewritten as: which is now an O (NR)-calculation. The same approach can be applied when normalizing the DD histogram. In eq. 11, the integral is taken over the entire fiducial volume of the random catalog R, and the Dirac-δ function is introduced to ensure that the distance between two galaxies s12 is equal to the distance of interest s. To isolate the angular variables, we can write the δ function in eq. 11 as an integral of the product of two δ functions: Note that the second δ function is independent of the radial positions r1, r2 of the galaxies. Thus, RR can be rewritten as where is the count of RR galaxy pairs whose angular separation is θ.
The probability distribution of random galaxy pairs is constructed in two steps: first histogramming, where we construct the distribution f (θ) over angular separation using the count of random galaxies Rang(α, δ) according to eq. 16, and then integration, where we convolve the angular distribution with the radial PDF Pr(r) to obtain the distribution of random pairs over s.

Data-random and data-data distributions
Unlike in the random catalog, in the data catalog the distribution of galaxies over r cannot be factorized from their angular distribution. Thus, the probability distribution over the opening angle between galaxies in the data and random catalogs is also a function of r1, the position of the data galaxy: The DR distribution can then be calculated as where NDR is used to normalize the data-random distribution. For the unweighted calculation, the normalization constant is simply NDR = ND × NR where ND is the total number of galaxies in the data catalog. For the weighted calculation, the normalization constant NDR is: where w D i is the weight of the i th galaxy in the data catalog. The normalization constant NDR only requires an O (NR) computation and hence can be computed quickly even in the weighted calculation.
Finally, the data-data count is estimated using a bruteforce iteration over all possible data galaxy pairs.

Generalization of the 2pcf to anisotropic case
The discussion up to this point has dealt with only a spherically symmetric 2pcf. However, the algorithm is easily generalized to study anisotropy in the 2pcf (Davis & Peebles 1983).
The histogramming stage proceeds in the same fashion as for the isotropic case. At the integration stage, we compute the distances transverse and parallel to LOS according to eqs. 7 and 8 and fill in RR, DR and DD histograms in two dimensions (σ, π). The BAO signal is expected to manifest itself as an ellipse in this 2D histogram. In the isotropic case the ellipse is reduced to a circle.

DESCRIPTION OF THE ALGORITHM
We compute DD by summing over all pairs of galaxies i and j separated by the distance s, with each pair weighted by the product of the weights of individual galaxies w D i × w D j in the survey. To data we apply the weights according to the prescription of Ross et al. (2012) to deal with the issues of close-pair corrections (wcp), redshift-failure corrections (w rf ), systematic targeting effects (wsys) and shot noise and cosmic variance (wFKP) (Feldman et al. 1994) such that the total weight of a given data galaxy is:  RR and DR are calculated using finely binned probability densities in (α, δ) and r defined from the random catalog. The choice of the bin sizes is important and is determined by the final bin size, ∆s, desired inξ(s). In practice, this means that the bin sizes in α and δ must be smaller than the angle subtended by ∆s at the outermost radius of the data set by a factor greater than 4. The bin size in r should also be smaller than ∆s by the same factor. Examples of binned angular histograms are shown in Fig. 3. It is clear from the zoomed up view that the statistical fluctuations, determined by the finite size of the random catalog are significant. However, it is important to note that these fluctuations are much smaller than the fluctuations in 3D cells with full coordinates (r, α, δ), which is effectively what is used in the brute-force approach, or in any other approach that relies on 3D number density distribution. In this paper, we generate the 2D angular and 1D radial probability maps based on an existing random catalog (published along with the observational data), but ideally these maps should be generated directly based on the observational conditions. This approach will create smoother probability maps and minimize the shot noise associated with the use of random catalogs.
Once the bin sizes are chosen, the algorithm proceeds as follows. First, using the random catalog, we produce the histograms Rang(α, δ) and Pr(r). We note that each galaxy in the random catalog has an r-dependent weight w R (r), and these weights are used to fill the histograms: for each galaxy, Pr(r) is incremented by w R (r)/NR, and Rang(α, δ) is incremented by 1. This can be generalized to the case where the weights also depend on angular position and are factorizable into angular and radial weights: w R = w R (r)w R (α, δ). Then Rang(α, δ) must be incremented by w R (α, δ).
Then, we calculate the distribution g(r1, θ) using galaxies from the data catalog and Rang(α, δ). Here, we consider every pair of a data galaxy and an angular cell of Rang. The angular separation θ12 is calculated using the position (α1, δ1) of the data galaxy and the angular position (α2, δ2) of the barycenter of the Rang cell. For each such pair, the distribution g(r1, θ) is incremented by w D Rang(α2, δ2), where w D is the weight of the data galaxy.
Next, we perform an integration over distances r1 and r2 and produce the distributions RR and DR. This is achieved in three nested loops iterating over θ, r1 and r2. Given these three variables, the distance separation s12 is calculated using eq. 6. The distribution RR is incremented by f (θ)Pr(r1)Pr(r2) according to eq. 15, while DR is incremented by g(r1, θ)Pr(r2) according to eq. 18. Then, the two histograms are normalized according to eqs. 13 and 19 respectively.
In the calculation presented in this paper, we only consider galaxy pairs separated by a distance less than a certain maximum distance scale of interest lmax. For the estimate of DD, the whole space is divided into cubes with an edge length lmax. Each data galaxy is only paired with other galaxies within the same cube or the neighboring cubes. For RR and DR, the (α, δ) plane is divided into rectangular regions with angular size subtended by lmax at the smallest value of r. The algorithm proceeds to calculate f (θ) and g(r1, θ) only within one angular region and its neighbors. Finally, the LS estimator is calculated according to eq. 2. All the other estimators can be calculated just as easily, based on RR and DR distributions computed above.

PERFORMANCE OF THE ALGORITHM
The performance of the new algorithm is evaluated by calculating the runtime and the uncertainty of the LS estimator of the 2pcf.
Though the method described in this paper can be applied under any cosmology, for certainty we assume a ΛCDM+GR flat cosmology with parameters consistent with those used in the analysis of the SDSS-III DR9 data set (Anderson et al. 2013), i.e. H0 = 70 km s −1 Mpc −1 , ΩM = 0.274, ΩΛ = 0.726, and Ω k = 0.
We choose the following algorithm settings. The maximum galaxy pair separation lmax is 400 h −1 Mpc. The number of angular bins in the (α, δ) plane is Nα × N δ = 3500 × 1600, corresponding to an angular size of 1 mrad, or a resolved separation of galaxies of 2.55 h −1 Mpc at the outermost radius. The number of bins in θ is N θ = 3142, corresponding to an angular size of 1 mrad. The number of bins in r is Nr = 900, corresponding to a bin size of 1 h −1 Mpc.
The running time scales as (Nα × N δ ) 2 for the part that fills the histogram in θ, plus N θ ×N 2 r for the integration. The number of bins scales inversely with the bin size in the final histogram, ∆s, and linearly with the distance scale of interest lmax. The time performance of the histogramming scales as the number of 2D cells squared or O (lmax/∆s) 4 . The integration time scales as the number of bins in θ times the number of bins in r squared or O (lmax/∆s) 3 . Thus, the histogramming part has the dominant contribution to the computational time. Other fast computational techniques based on the number density rather than the actual permutation counting rely on data distribution in 3D cells (Moore et al. 2001) and thus have stronger scaling with the number of cells.
Usually, the calculations of RR and DR are the most computationally expensive since the size of the random catalog is typically much larger than the size of the data catalog. Therefore, we report the results of RR and DR for different sizes of random catalogs, while we do not change the data catalog.
To generate random catalogs of any desired size, we begin with the northern sky of the existing SDSS-III DR9 random catalog which contains ∼3.5M galaxies and extract the distributions Rang(α, δ) and Pr(r). We then randomly generate new galaxies according to the product of these distributions. While the new catalogs may amplify existing statistical fluctuations in the DR9 random catalog, our interest is in testing the runtime of the algorithms and estimating the variance inξ(s) as a function of random catalog size. Therefore, biases introduced by the generation procedure are unimportant.
For random catalog sizes of NR = 1M, 3M, and 10M galaxies, we generate 20 realizations each and calculate RR and DR using the brute-force technique and the fast algorithm. We estimate the root mean square (RMS) ofξ(s) using the 20 random catalogs and present it in Fig. 4. Note that though the bin size used in computation is small as specified above, the result is plotted with a much larger bin size, simply for visual purposes. In both the brute-force and fast algorithms, the uncertainties inξ(s) decrease as the size of the random catalog increases, but the uncertainties in the results based on the fast algorithm are smaller than those in the respective brute-force results, in agreement with the qualitative argument presented in Section 2. Before discussing the runtime, it is worth noting that the brute-force method allows for parallelization (Alonso 2012), as does the suggested fast method. Hence, the runtime is measured in CPU hours rather than elapsed wallclock time. As is done in the fast algorithm, the bruteforce pair counting was done using a grid scheme as suggested in (Alonso 2012) up to a maximum distance of lmax = 400h −1 Mpc. Hence, the two sets of results include the same amount of statistics.
In Fig. 5, we show the runtime of the algorithms measured for the pure algorithmic parts of the executable, which are based on a C++ implementation running on modern CPUs. As expected, the runtime of the brute-force calculation is proportional to N 2 R for RR and NR for DR. In contrast, the runtime of the fast algorithm plateaus around a constant value because it depends on the fiducial volume and the number of angular and radial bins, rather than the size of the random catalog NR. The maximum runtime of the fast algorithm is reached when each bin in the Rang(α, δ) distribution is populated and hence is used in the calculation ofξ(s).
To check the fast algorithm for bias, we computeξ(s) for 20 mock catalogs representing D with 200k galaxies Poissondistributed within the survey volume. These mock catalogs are produced using the same procedure used to generate the random catalogs, described previously. We find no bias; that is, the mean ofξ(s) is centered at zero as expected for a truly random distribution, and its RMS obeys a Poisson distribution (Fig. 6). Note that the Poisson error is dominated by DD, since the DR and RR are much larger and hence contribute smaller relative errors.
Finally, the algorithm is also applied to the SDSS-III DR9 BOSS data catalog. The results obtained from the sug-gested algorithm and the brute-force algorithm and their difference are presented in Fig. 7. The BAO peak is clearly visible in both, and the two distributions are consistent.

CONCLUSION
We have presented a new computational method of the galaxy 2pcf that replaces summation over all possible galaxy pairs with a probability map integration. The method provides a significant reduction in the calculation time and improves the precision of the calculation. In the future, the method could be used for a fast evaluation of the galaxy correlations in large spectroscopic surveys. In this case, the generation of large-size random catalogs can be replaced by weighted probability maps determined by observational conditions. The minimum pair separation for which the method is applicable is set by the fiber size, as it is for the bruteforce method. Reduced calculation time may also allow for the evaluation of larger numbers of mock scenarios and, thus, more extensive tests of cosmological parameters. Finally, we point out that the approach can be easily extended to evaluate higher order correlations.