Deep searches for broadband extended gravitational-wave emission bursts by heterogeneous computing

We present a heterogeneous search algorithm for broadband extended gravitational-wave emission (BEGE), expected from gamma-ray bursts and energetic core-collapse supernovae. It searches the $(f,\dot{f})$-plane for long duration bursts by inner engines slowly exhausting their energy reservoir by matched filtering on a {\em Graphics Processor Unit} (GPU) over a template bank of millions of one-second duration chirps. Parseval's Theorem is used to predict the standard deviation $\sigma$ of filter output, taking advantage of near-Gaussian noise in LIGO S6 data over 350-2000 Hz. Tails exceeding a mulitple of $\sigma$ are communicated back to a {\em Central Processing Unit} (CPU). This algorithm attains about 65\% efficiency overall, normalized to the Fast Fourier Transform (FFT). At about one million correlations per second over data segments of 16 s duration $(N=2^{16}$ samples), better than real-time analysis is achieved on a cluster of about a dozen GPUs. We demonstrate its application to the capture of high frequency hardware LIGO injections. This algorithm serves as a starting point for deep all-sky searches in both archive data and real-time analysis in current observational runs.


INTRODUCTION
Gravitational radiation offers a potentially powerful new channel to discover the physical nature and population statistics of core-collapse supernovae and their association with neutron stars and black holes. Recently, LIGO identified a black hole binary progenitor of GW150914 (LIGO-Virgo 2016) with remarkably low spin. Stellar mass black holes are believed to be remnants of extreme transient events such as gamma-ray bursts and core-collapse supernovae. Shortly after birth in the latter, possibly including the superluminous variety, black holes may encounter strong interactions with high density matter. This outlook opens a window to release their angular momentum in gravitational radiation, leaving slowly spinning remnants with a/M ≃ 0.3 in dimensionless spin (van Putten & Della Valle 2017). Future detection of similar events may reveal whether GW150915 is typical or the tail of a broad distribution in black hole mass and spin.
Neutron stars and black holes born in core-collapse of massive stars are of great interest as candidate sources of gravitational waves, especially for their potential to be visible also in the electromagnetic spectrum. Searches for these events may be triggered in either radiation channel (Sathyaprakash & Shutz 2009;van Putten 2016) and a combined detection would enable identification of source and host environment, in the footsteps of multi-wavelength observations of gamma-ray bursts pioneered by BeppoSAX (Frontera et al. 2009). EM-triggers obtained from transient surveys allow off-line GW-analysis of LIGO-Virgo archive data. On the other hand, GW-triggers require relatively low latency in EM follow-up, that may be challenging by modest localisation of LIGO-Virgo detections.
While core-collapse supernovae are relatively numerous, only a small fraction is known to be associated with extreme events. For instance, the true event rate (corrected for beaming) of long GRBs is about 1 per year within a distance of 100 Mpc. Achieving sensitivity to tens of Mpc to emissions limited to E GW = O (1M ⊙ c 2 ), where c denotes the velocity of light, poses a challenge for deep searches in gravitational wave data.
Broadband extended gravitational-wave emission (BEGE) from aforementioned extreme events may be produced with durations lasting up to tens of seconds. In chirp-based spectrograms, such may appear as trajectories marked by frequencies slowly wandering in time, featuring ascending and descending chirps (Levinson et al. 2015). To search for these signatures in the (f,ḟ )-plane, we recently devised a dedicated butterfly filtering using chirp templates of intermediate duration τ of, e.g., one second, targeting a time scale of phase coherence that may capture tens to hundreds of wave periods associated with non-axisymmetric accretion flows. Using millions of chirp templates it can detect complex signals such as Kolmogorov scaling in noisy time-series, recently in BeppoSAX gammaray light curves with an average photon count down to 1.26 photons per 0.5 ms bin (van Putten et al. 2014a). This kind of sensitivity suggests to explore its further applications to strain amplitude gravitational wave data (van Putten 2016).
Deep searches covering a complete science run of LIGO requires considerable computing resources in the application of butterfly filtering with a dense bank of templates. We here report on a novel algorithm by heterogeneous computing comprising both Graphics and Central Processing Units (GPUs, respectively, CPUs) using the Open Compute Language (OpenCL) (Gaster et al. 2011;Khronos 2015).
A primary challenge in heterogeneous computing is circumventing GPU-CPU bottlenecks arising from potentially vast discrepancies in data throughput over the Peripheral Component Interface (PCI). Our algorithm exploits near-Gaussian noise in the high frequency bandwidth of 350-2000 Hz in LIGO data, whereby the output of matched filtering is essentially Gaussian as well. Near-optimal efficiency is obtained by retaining only tails of relatively high signal-to-noise ratios back to the CPU from the GPU output, whose cut-off is predicted by Parseval's Theorem. Including overhead in the latter, our algorithm achieves about 65% efficiency normalized to GPU-accelerated Fast Fourier Transform (FFT) in complex-to-complex (C2C), single precision (SP) and interleaved out-of-place memory allocation.
Our choice of chirp templates is guided by inner engines involving black holes described by the Kerr metric, interacting with high density matter, expected in core-collapse of massive stars and mergers involving neutron stars, the latter envisioned in association with short GRBs with Extended Emission (SGRBEE) and long GRBs with no supernovae (LGRBN) such as GRB060614 (van Putten et al. 2014b).
In the application to LIGO S6, we give a detailed description of our data-base, which comprises bandpass filtering to aforementioned 350-2000 Hz (over 64 s segments of data, N = 2 18 samples) and a restriction to simultaneous H1-L1 detector output (29.4% of total S6 data). On a modern GPU, we realize approximately 80,000 correlations per second over 16 s segments of data (N = 2 16 samples). On a cluster of about a dozen GPUs, about 1 million correlations per second realizes better than real time analysis. As such, the presented method is applicable to both archive analysis and lowlatency searches in essentially real-time, pioneered following GW150914 (Abbott et al. 2016a) and for current Advanced LIGO runs (e.g. Guo et al. 2017). For the present archive analysis of LIGO S6, however, our focus is on deep searches for an exhaustive search, all-sky and blind without triggers from electromagnetic data.
Existing comprehensive, blind searches for bursts (Abbott et al. 2004(Abbott et al. , 2007(Abbott et al. , 2009aAbadie et al. 2010Abadie et al. , 2012Ando et al. 2005;Aasi et where the range in the frequency refers to dependency on initial black hole spin. This motivates our present focus on the high-frequency bandwidth 350-2000 Hz in LIGO S6. In this frequency bandwidth, LIGO noise is essentially Gaussian, which shall be exploited in our GPU-CPU method of analysis. In §2 we review chirp-based spectrograms by butterfly filtering. In §3, our heterogeneous computing algorithm is described with use of pre-and post-callback functions. Benchmark results are given in §4. §5 reports on a detection of some illustrative LIGO S6 hardware burst and calibration injections. We summarize our findings and outlook in §6. Strain noise H1 and L1 was better than 10 −20 over 89%, respectively, 64% of the time. Performance of H1 and L1 became somewhat more consistent after the first 3000 hours. LIGO S6 covers the period July 7 2009 through October 20 2010. In our analysis of LIGO S6, we focus on epochs when H1 and L1 are both taking data. These H1∧L1 data represent 29.4% of data when either H1 or L1 were taking data (H1∨L1), measured over 64 second segments (Fig. 1).
In our search for gravitational wave emission from core-collapse supernovae associated with stellar mass black holes, we focus on the frequency bandwidth of 350-2000 Hz. Bandpass filtering (over 64 s segments of data, N = 2 18 samples), LIGO noise is essentially Gaussian (e.g. van Putten 2016). This bandwidth may contain gravitational wave emission from non-axisymmetric mass motion about the Inner Most Stable Circular Orbit (ISCO) around stellar mass black holes (van Putten et al. 2011;van Putten 2016).
To search for slowly evolving trajectories in time-frequency space, we consider matched filtering over a large bank of chirp templates covering a range in f and time rate-of-change of frequencyḟ , i.e., a butterfly 0 < δ 1 < ḟ < δ 2 (2) for some δ 1,2 > 0. Over a finite bandwidth of frequencies, the resulting output is a chirp-based spectrogram. The chirps are generated from a long duration template, produced by solving a pair of ordinary differential equations modeling black hole spin down against high density matter at the ISCO (van Putten et al. 2014a), the results of which are illustrated in Fig. 2. Matched filtering of a time series y(t) against chirps templates w(t) is defined by correlations In the present application to LIGO strain data, y(t) and w(t) have zero mean. This integral is conveniently evaluated in the Fourier domain asρ(k) =w * (k)ỹ(k), wherẽ Discretizing (3) to samples at equidistant instances t n (n = 0, 1, · · · , N), we evaluate (4) by FFT. This is more efficient compared to direct evaluation of (3) in the time domain, whenever the number of samples N exceeds a few hundred. This may be readily observed by comparing compute times, convolving two vectors u and v by FFT versus direct evaluation in, e.g., MatLab; see also Smith (2016). Retaining tails ρ(t) > κσ for a threshold κ realizes near-optimal heterogeneous GPU-CPU computing, effectively circumventing PCI bandwidth limitations when κ is a few.
For reference, recall that correlating vectors y and w comprises three steps: twice forward FFT, pointwise productsρ involving complex conjugation, and one inverse FFT: For LIGO S6, the (downsampled) sampling rate is 4096 s −1 , whence N = 2 16 for 16 s data segments.
With vanishing mean values, the standard deviation of ρ(t n ), satisfies Parseval's Theorem where c n denote the Fourier coefficients of ρ(t) according to the FFT pair Bandpass filtered to 350-2000 Hz, H1∧L1 (Table 1) has noise which is essentially Gaussian (e.g. van Putten 2016). This property is inherited by ρ(t) in (3). Hence, ρ(t) is effectively described by σ in (7) for a given pair of data segment and template. Therefore, (7) provides a predictive step to the output of (5). In processing (5) on a GPU, a threshold in a post-callback function can be used to retain only tails (Fig. 3) for feedback to the CPU over the PCI. In (9), we implicitly apply the inequality to the absolute value of ρ n = ρ(t n ). Thus, (9) circumvents vast discrepancies in throughput of GPUs and CPUs whenever κ is on the order of a few. This step is essential for an optimal heterogeneous computing algorithm, to be benchmarked further below. It should be mentioned that below 350 Hz, LIGO data is non-Gaussian, giving rise to distributions of ρ(t) that occasionally show multiple peaks. (This depends on the pair of data segment and template.) In this event, σ inadequately describes ρ(t n ), whereby tails defined by (9) become less meaningful in defining candidate detections.
Processing is applied to batches of M = 2048 of H1∧L1 16 s data. Such block of about 9 hours of data comprising about 1 GByte, suitable for allocation in Global Memory of a typical GPU. Chirp templates are extracted by time slicing from a model of black hole spindown (van Putten et al. 2014a). While these emissions are of relatively high frequency when the black hole spins rapidly, late time emission following spin down reaches an asymptotic frequency satisfying (1). Analysis is performed in groups of M such templates by FFT in batch mode. Batch mode operation is essential to reaching optimal FFT performance on a GPU.

Teraflops compute requirements
Sensitivity to arbitrary, slowly varying transients is realised by banks sufficiently large to densely cover the (f, df /dt)-parameter space. For matched filtering, a bank of chirps of one second duration covering f = O(N) Hz with frequency changes O(f ) will be dense with step sizes order of 1/N Hz in f and df /dt, setting a minimum bank size of order O(N 2 ). For f on the order of one kHz, the minimum bank size is O(1M), needed to ensure a reasonable probability to match a signal (a "hit" when ρ > κσ).
For a better than real-time analysis by butterfly filtering of data segments of duration T over a template bank of size K 1 M, the required compute performance iṡ where the right hand side refers to our choice of T = 16 seconds and a template bank of α = 1, 2, · · · , K 1 sets of size M = 2048 each. Hardware requirements are considerably higher, since FFT's tend to be memory limited (not compute limited) on GPUs, especially when FFT array sizes exceed the size of Local Memory privy to individual Compute Units (CU). At typical efficiencies of η ≃ 7% in these cases, (10) points to a minimum requirement of about 50 teraflops at GPU maximal compute-performance, assuming (10) is realized at approximately optimal efficiency normalized to FFT. In what follows, we consider partitioning template bank by K 1 M and, respectively, data in β = 1, 2, · · · , K 2 blocks in In our application, K 1 M = 2 23 (up to 8 million) K 2 M = 288 for LIGO S6. The total number of correlations for a full LIGO S6 analysis is For our choice of 16 second segments (N = 2 16 ), (12) defines a compute requirement of 2.5 × 10 19 floating point operations for a complete LIGO S6 analysis over a bank of 8M templates. (ii) A chirp template w of duration τ = 1 s is extended by zeros to length N and its transform w is loaded into Global Memory. A pre-callback function computes M transformsρ from M pointwise array multiplicationsρ (k) =Z (k) ·w * (k = 1, 2, · · · , M);

Batch mode with pre-and post-callback functions
(iii) Inverse FFT N,M applied toZ (k) produce M corrections ρ k (t n ) over N samples, representing the most computationally (but memory limited) intensive step on the GPU; (iv) (ii) and (iii) are repeated M times, once for each of M chirp templates w at a total computational effort of M 2 inverse-FFT N .
At 5N log 2 N flops per FFT N , these combined steps for above mentioned N and M 2 involve ∼20 teraflop producing 2 TByte output. The latter shows the need to retain only tails of M × M convolutionsρ (k) (k = 1, 2, · · · , M), i.e. candidate events exceeding a multiple of σ km , one for each 16 s segment k of data and chirp template l (k, m = 1, 2, · · · , M). The σ km are pre-computed by Parseval's Theorem (7). As norms of complex Fourier coefficients, (7) is computationally demanding, requiring off-loading to the GPU as well (Fig. 3). For κ = 5.5, for instance, tails are limited on the order of 10 4 byte s −1 , well below the PCI bandwidth of several GByte s −1 , allowing near-optimal computing at about 65% efficiency overall (including Parseval's step), normalised to FFT alone. Retaining tails over the PCI by the CPU is realized as follows.

Gathering GPU-tails over the PCI
The tails of correlations satisfying (9) are gathered in two steps (Figs. 3-4). In correlations of a template w k ǫ W α and a segment y m ǫY β (1 ≤ α ≤ K 1 , 1 ≤ β ≤ K 2 , k, m = 1, 2, · · · , M), (9) is obtained for each σ km . Thus, w k gives M tails in correlation with Y β referenced by time of maximal correlation satisfying where ρ m = ρ(t m ). To circumvent limited PCI bandwidth, (13-14) is converted to pointers projected into an array A αk of size N, We evaluate (15) by post-callback function on the GPU by updating (t m , ρ m ) with (t m ′ , ρ m ′ ) whenever ρ m ′ > ρ m and ρ m ′ > σ km ′ . As an asynchronous read/write by pointwise index on Global Memory, this may lead to indeterministic behavior when two processors operate concurrently on the same index. When κ is appreciable, A αk is sparse, and this anomalous behavior is exceedingly rare.
Repeating (15) for all w k ǫ W α obtains M 2 tails by pointers Collecting all pointers in A α is evaluated by the CPU. Gathering results over the complete template bank obtains by repeating (16) for all 1 ≤ α ≤ K 1 , each time dereferencing A α into an array B of block size NM on the host, evaluated by the CPU. In collecting B, we select data with maximal ρ values at t n+mN from the * A α . Gathering all hits by removing selection of maximal ρ in collecting B in (17) produces extended output with up to two orders of magnitude more output in case of a signal. For a burst injection discussed below (Fig. 7), for instance, this increases output to tens of GByte for a bank of 8M templates. Such extended output may be of interest to second runs, following up on selected data segments covering candidate events, but less so to first runs through all data such as LIGO S6.  Table 2, clFFT operates on blocks of filtered H1∧L1 data in 1 GByte blocks allocatable in Global Memory for clFFT N,M (C2C, SP) with interleaved out-of-place memory storage.
Under OpenCL, a GPU is partitioned in CU's with fast but privy Local Memory and registers. Only Global Memory is shared across all CU's. Performance hereby critically depends on efficient use of Local Memory and minimal use of Global Memory, since access to the latter is relatively slow. With a Local Memory size of typically 32 kByte, clFFT performance for C2C SP will be essentially maximal N ≤ 2 12 . In our application, N = 2 16 , whereby clFFT performance is practically memory limited. Fig. 5 shows clFFT performance on GPUs with varying numbers of CUs (each comprising a number of Stream Processors) and Global Memory bus bandwidth (GByte s −1 ), namely the R9 nano (4096, 64, 512), the R9 390 (2560, 40, 384) and the D700 (2048, 32, 264). For the first, performance is over 600 Gflop s −1 for N > 2 12 (about 1000 Gflop s −1 for N ≤ 2 12 ). This is a direct result of the 32 kByte Local Memory size and 8N bytes in complex single precision storage and the need to access Global Memory when N > 2 12 . For N = 2 16 , the net result is overall efficiency of about 7% of peak floating point compute-performance by Stream Processors alone.
We implement Parseval's Theorem by partial sums off-loaded to the GPU, the results of which are summed by the CPU. At a few hundred Gflop s −1 performance thus achieved, wall clock compute time is about 25% compared to that of clFFT on the GPU. Including overhead in (i − iv) of §3.2 and gathering tails ( §3.3), the net result (including Parseval's step) is an efficiency overall of about 65%, normalised to clFFT alone as shown in Fig. 4, or about ∼ 8 × 10 4 correlations per second per GPU. On a cluster of about a dozen GPU's, we hereby realise about 1 million correlations per second, sufficient for a real-time analysis by up to 16 million templates according to (10). Filter output stored to disk is listed by block in files Bn, n = 1, 2, · · · , 288, illustrated in Table 3.

TAILS AND LIGO BURST INJECTIONS
To illustrate a full analysis, Fig. 6 shows a pseudo-spectrum of the tails (9) of H1∧L1 LIGO S6, obtained by averaging results of blocks using a template bank of intermediate size of 0.5 M chirps. The detailed structure shown represents the non-Gaussian features that carry any potentially relevant information, visible only by zooming in on tails in an otherwise overall near-Gaussian PDF of the internal GPU output ρ (Fig. 3). This has been verified numerically, in obtaining completely smooth spectra of tails of ρ following time-randomisation of H1 or L1 data (Fig. 6). Fig. 6 shows various pronounced features, some of which are probably associated with unsteady behaviour in various instrumental lines familiar from conventional Fourier spectra of S6 strain noise (LOSC 2017a). The details of which remain to be understood in more detail, especially so given the non-trivial residual spectrum of simultaneous hits with frequency pairs (f 1 , f 2 ) of H1 and L1 that are relatively close, here shown with |f 1 − f 2 | < ∆, ∆ = 50 Hz. (A similar spectrum obtains for ∆ = 100 Hz.) For the analysis with a bank of 0.5M templates shown, the total counts per block for H1 and Table 3. Butterfly filtering output Bn of a block n (n = 1, 2, · · · , 288) of hits ρ i > κσ (i = 1, 2) lists data sample offset i ǫ {1, 2, · · · , 2 27 }, ρ i and f i , the latter the initial frequency of associated chirp template. Multiplication of ρ i by 1000 allows storage of all entries in 4 byte integers. Sample shown of B161 (6,388,647 rows produced by a bank of 4M templates) highlights some simultaneous hits. Zeros represent no hit.    (Table 4), seen in simultaneous hits in H1 and L1 (left panels; ρ and frequency refer to geometric means of those of H1 and L1). Hits in H1 (red) and L1 (blue) practically overlap (right panel) shown as a function of time based on data sample offset in output file B161 (Table 3).
LIGO detectors are routinely given a variety of hardware injections to test the detectors and various signal detection pipelines. Of interest to the present analysis are burst injections that cover the relatively high frequency range 350-2000 Hz. The following uses some LIGO injections for a formal test and validation of software implementation. Fig. 7 shows an injection to both H1 and L1 captured by our algorithm at large injection SNR(LOSC), detected using a bank of 4M templates in a partial analysis of LIGO S6.
For high-confidence detections, correlated H1-L1 output such as illustrated in Fig. 7 is essential. While signal injections are often injected at the same GPS time, astrophysical sources will impact H1 and L1 along some finite viewing angle. In the time-domain, this is commonly identified by maximising correlations over a some finite time shift, here 0-10 ms given the distance between H1 and L1. Here, we make use of the fact that a difference in arrival time between H1 and L1 from a putative astrophysical source with finite time rate-of-change in f (t) is equivalent to a frequency shift, allowing searches in simultaneous H1-L1 filter output such as plotted in Fig. 7. Hz in H1 and L1 shows a generic trend with SNR in the injection process. On average, counts improve by a factor of 1.69 and ρ increments by 0.29 with the template bank of 8M compared to 0.5M chirps. Hits are counted with a frequency margin of ±50 Hz about these injection frequencies.
Figs. 8-9 shows a validation of sensitivity (see also earlier analyses of van Putten et al. (2014a); van Putten (2016)), here a priori limited to ρ exceeding 5.5 σ by choice of κ in (9), obtained in a partial LIGO S6 analysis using 8M templates. Overall, it appears that sensitivity in H1 is slightly better than L1 when signals are small. Searches for signals fainter than those shown would require a re-run of the analysis with κ < 5.5 in (9). For such extremely deep searches, excess tail sizes can conceivably be curtailed by generalising (9) to a finite band, κ 1 σ > ρ(t n ) > κ 2 σ with κ 2 − κ 1 1. Fig. 9 quantifies the gain in using bank sizes beyond the minimal requirements ( §3.1), showing an increase in hit counts and ρ in a detection of a sample of high frequency burst injections.

CONCLUSIONS AND OUTLOOK
Probing inner engines to gamma-ray bursts and core-collapse supernovae require deep searches in LIGO data. Taking full advantage of modern GPU hardware, we present a GPU-CPU implementation of butterfly filtering to search for broadband extended emission in gravitational waves from accreting flows around black holes, potentially relevant to the most extreme transient events.
Our benchmarks demonstrate near-optimal performance using banks of up to millions of chirp templates at better than real-time analysis, facilitating deep searches in LIGO archive data such as S6, advanced LIGO O1 and the currently ongoing O2 run.
Specific applications of the proposed method include correlation analysis of the H1 and L1 detectors and identification of mysterious or peculiar events of interest to further analysis. A leading order indication of correlations may derive, for instance, from counting statistics of hits, comparing simultaneous hit counts with total hit counts in H1 and L1. Specific events of interest may be followed up by second runs, gathering all hits by removing selection of maxima in collecting B in (17).
In butterfly filtering, signal detection typically comprises a large number hits, representing approximate matches with no single template providing a perfect match to the full signal at hand, as illustrated in Fig. 8. This combined output can in principle recover essentially maximal sensitivity (van Putten 2016). For automated searches of candidate events, clustering algorithms might apply (e.g George et al. 2017), that may also facilitate quantifying the level of confidence for such complex detection output.