A nonparametric method to assess significance of events in search for gravitational waves with false discovery rate

In this paper, we present a consistent procedure to assess the significance of gravitational wave events observed by laser interferometric gravitational wave detectors based on the background distribution of detection statistic. We propose a non-parametric method to estimate $p$-value. Based on the estimated $p$-values, we propose a new procedure to assess the significance of a particular event with $q$-value which is the minimum false discovery rate that can be attained when calling the event significant. The $q$-value gives us a criterion on the significance of events which is different from $P_{\rm astro}$ which is used in the LIGO-Virgo analysis and in other analysis. The proposed procedure is applied to the 1-OGC and 2-OGC catalogs [2][3]. For most of the events which were claimed significant in [2] and [3], we also obtain the same results. However, there are differences in the significance for several marginal events. Since the proposed procedure does not require any assumptions on signal and noise, it is very simple and straightforward. The procedure is also applicable to other searches for gravitational waves whose background distribution of detection statistic is difficult to know.


Introduction
The first gravitational wave event from binary black hole coalescence, GW150914, was observed by advanced LIGO detectors in the first observing run (O1) [1]. After the first detection, tens of gravitational wave events were reported [4]. During the second observing run (O2), the first gravitational waves from a binary neutron star coalescence, GW170817 [5], were observed by LIGO [6] and Virgo [7]. The follow-up observations by electromagnetic telescopes identified the host galaxy in NGC4993. The event strongly suggests the existence of radioactive decay of rapid neutron-capture process [8]. The discovery of these events has opened the gravitational wave astronomy. During the third observing run (O3), many candidates events were reported [9], and four events have been published individually [10][11][12][13]. Very recently, the GWTC-2 catalog which reports the gravitational wave signals from compact binary coalescences during the first half of O3 observation were released [14]. In the coming years the network of gravitational wave detectors consisting of two LIGO detectors, Virgo and KAGRA [15] plans to perform coincident observation runs. As the detectors' sensitivities improve and observation time becomes longer, we expect to observe more and more gravitational wave events.
In compact binary coalescence searches, we search for gravitational wave signals by maximizing the detection statistic over the template bank in a short time window. When the value of the detection statistic exceeds a given threshold, we record it as a trigger. Accordingly, for a given threshold, as the observing time and the template bank becomes larger, the probability that false triggers produced by noise (false alarm probability) becomes larger. This is called the multiple comparisons problem. Several methods have been proposed to control the false alarm probability. The Bonferroni correction is one of the method (see Chapter 9 of [16]). However, these methods generally reduce the detection probability while controlling the false alarm probability.
Recently, the false discovery rate (FDR) was proposed to treat these problems (see Section 3 for the formal definition of the FDR). By the author's knowledge, the first introduction of FDR to the gravitational wave community was done by Baggio and Prodi [17], but the paper did not discuss any actual problems. Recently, P astro was introduced as a measure of true discovery of a particular event [18]. In the recent catalog of gravitational waves from compact binary mergers [4], a candidate event is considered to have gravitational wave origin, if the false alarm rate is less than one per 30 days and the P astro , is larger than 0.5.
In this paper, we propose the use of q-value which is a measure of FDR. We present a consistent procedure to assess the significance of candidate events by using q-value. We first introduce a definition of the p-value by using the background distribution of the detection statistic. Then, we propose a new procedure to evaluate q-value of each event by extending the procedure proposed by Storey and Tibshirani [19]. The original procedure by Storey and Tibshirani [19] is not applicable for a search for gravitational waves from compact binary coalescences, because it requires a complete list of p-value. However, in gravitational wave searches, a complete list of p-value is usually not available because we store only triggers whose detection statistic is larger than a certain threshold. We apply these procedure to the publically available results of the analysis, 1-OGC catalog and 2-OGC catalog by Nitz et al. [2,3], and evaluate the q-value of each candidate event. We compare the significance of each candidate event evaluated by using P astro . We find that we obtain almost consistent results on the significance of each candidate event. However, we also find that, although the conclusion on the significance may change depending on the threshold for q-value and P astro , the conclusion on the significance of events can be different for marginally significant events. We find one such event in 2-OGC catalog.
The main advantage of our procedure is that our procedure is completely nonparametric, namely, we do not assume any parametric model behind data. Our procedure can be applied to other gravitational wave searches. The evaluation of p-value in non-parametric way, the procedure to evaluate q-value, estimation of q-value for the LIGO-Virgo O1 and O2 candidate events by using this procedure, all of these are new things in this paper. This paper is organized as follows. In section 2, we discuss statistical hypothesis testing in the search for gravitational waves from compact binary coalescences. In section 3, we present a procedure to assess a significance of a particular event with a false discovery rate. 2/19 In section 4, the proposed procedure is applied to the results of the analysis of the O1 data. Section 5 is devoted to a summary and discussion.

Estimation of p-value
We first introduce the statistical terminologies used in this paper. The definitions of statistical terminology can be found in a standard textbook, such as [16]. By analyzing the data from gravitational wave detectors, we obtain events which have larger signal-to-noise ratio than a threshold. Each event is classified as either signal or noise. If the event is originated from a gravitational wave, it is called a signal. Otherwise, it is called a noise. In the statistical literature, the noise model is called null hypothesis (in this paper, also called background ) and the signal model is called alternative hypothesis.
In the analysis of gravitational waves from compact binary coalescences, event search is done by maximizing the detection statistic over the templates. The detection statistic is also maximized over time within a certain time length.
In statistical hypothesis testing, the p-value of an event is a measure of the significance of the event. It is the probability that the event or rarer events occur under the null hypothesis. If the p-value of the event is significantly small, the null hypothesis is rejected. Let us consider statistical hypothesis testing of each event based on background distribution of detection statistic.

A conventional p-value
In the LIGO-Virgo O1 analysis, the following p-value was used [20,21] (see Appendix A for discussion on the derivation) where ρ is the detection statistic of a event. In this paper, we call this the conventional pvalue. Here, t obs and t bg are the time length of the analyzed data and the time length for the estimation of the background distribution, respectively. The estimation of the background data is usually generated by time-shifting data of different detectors [21]. Moreover, n bg (ρ) is the number of noise events in the background data whose detection statistics are equal to or larger than ρ. It is where r i is the detection statistic of the i-th event in the background data, 1 {·} = 1 if {·} is true and 0 otherwise. From the definition, n bg (0) is the total number of noise events in the background data. Therefore, µ(ρ) in Eq.(1) is the mean of number of events whose detection statistics are more than or equal to ρ. The ratio n bg (ρ)/t bg is usually called the false alarm rate of the event whose detection statistic is ρ.

Nonparametric estimation of p-value
Now, we introduce a non-parametric method to estimate p-value. Let us assume the background distribution is continuous. If we know the probability density function of detection statistic under the null hypothesis, f (r), the p-value of an event whose detection statistic is 3/19 ρ is given by In reality the background distribution is unknown, nevertheless, it can be estimated nonparametrically (free from assumption of a parameterized distribution) by using simulated background data. An estimator of the null distribution F is given bŷ It is important to distinguish F andF . The former is the (unknown) true background distribution, while the latter is an estimator of the background distribution. By Glivenko-Cantelli's theorem,F converges to F almost surely and uniformly in ρ [16]. Therefore, an estimator of the p-value of an event whose the detection statistics is ρ is given bŷ where we used the fact that ρ = r i where i = 1, ..., n bg (0). Note thatp(ρ) is the probability of obtaining the event whose the detection statistics is larger than ρ in the background data and has been called (an estimator of) false alarm probability in the gravitational wave community [23]. In addition,p(ρ) is proportional to the mean µ(ρ) in (1). The estimator (4) is a consistent estimator of the p-value (3), namely,p(ρ) converges to p(ρ) almost surely for each ρ by the strong law of large numbers. For later discussion, let us recall a basic property of a p-value. A p-value of a statistic ρ following any continuous null distribution F (ρ) follows the uniform distribution, because is the distribution function of the uniform distribution where 0 ≤ u ≤ 1 and P(x) is the probability of x. It is worthwhile to mention that we cannot expect that the conventional p-value given by (1) with ρ following F (ρ), follows the uniform distribution (see Appendix A). In the discussion that follows, we discuss the p-value defined by (3).

Assessment of significance with false discovery rate
In this section, we describe a statistical hypothesis testing by using detection statistics and how to assess a significance with the false discovery rate. When we perform the statistical test, each event can be categorized in four possible outcomes, which are summarized in Table  1. There are two kinds of truth (noise or signal) and two kinds of claim (called significant or called not significant). F and T are the number of noise and signal events called significant, respectively, and S is the total number of events called significant. n 0 and n 1 are the number of noise and signal statistics, respectively. n obs is the total number of events in the observed data.
In statistical hypothesis testing, a p-value threshold is selected to keep the number of false positives F small. When we select the threshold α, the expected number of false positive is αn obs . If n obs is very large, α should be selected to be very small.
Here, the probability P(F ≥ 1) is called a familywise error rate. The familywise error rate is simply called false alarm probability in the gravitational wave community, but we call the familywise false alarm probability in this paper to avoid a confusion. The family means that we test a hypothesis by using n obs tests. To control the familywise error rate such that P(F ≥ 1) ≤ α, that is, the rate that a noise event is classified as called significant is less than α, one of the solutions is to change the threshold α to α/n obs . This method is called Bonferroni's procedure (see Chapter 9 of [16]).
Unfortunately, controlling the familywise error rate is practical only when extremely few events are expected to be signal. Otherwise, controlling the familywise error rate will be too conservative and statistical power of the test procedure will be too poor. Benjamini and Hochberg [24] introduce the false discovery rate, which is defined as the expected value of F/S, E(F/S, S > 0), where F and S are introduced in Table 1, and give a test procedure to keep the FDR less than a threshold. A fairly recent survey of an FDR is [25]. Note that the false positive rate and the FDR are quite different measures. A false positive rate of 5% means that 5% of noise events are called significant. On the other hand, an FDR of 5% means that 5% of events called significant are noise events. Controlling FDR should be more powerful than controlling familywise error rate, since FDR is less than or equals to the familywise error rate [24].
Storey and Tibshirani [19] introduced the q-value for a particular event, which is the expected proportion of false positives incurred if calling the event significant. Let us define FDR(u), which is the FDR when calling all events significant whose p-value is less than or equals to a threshold u where 0 < u ≤ 1, namely, where E(x, y > 0) is the expectation of x given y > 0. Here, F (u) is the number of the noise events whose p-value is smaller than or equals to the threshold u, and S(u) is the number of both noise and signal events whose p-value is smaller than or equals to the threshold u. The definition of the q-value is the minimum FDR that can be attained when calling the event significant, namely, where i = 1, ..., n obs and the p-value given by (3) of the i-th event are denoted by p i . Note that FDR(u) is not always monotonically increasing in the threshold u. Taking minimum guarantees that the estimated q-value is increasing in the same order as the p-value.

5/19
Let us recall the procedure for estimating q-value proposed by Storey and Tibshirani [19]. Their estimator of the FDR(u) is whereπ 0 is an estimator of π 0 = n 0 /n obs which indicates the overall proportion of noise events in the data. Roughly speaking, (7) is a sample mean whose population mean is (5). Since a p-value of a statistic follows the uniform distribution under the null hypothesis (see Section 2), the numerator of (7) is an estimator of F (u). How to estimateπ 0 is the central issue. In the gravitational wave searches, very few events are expected to be signal. In such a case, we can assumeπ 0 1. In Appendix B, we show that this assumption is justified by using the 1-OGC and 2-OGC catalogs. We thus setπ 0 = 1.
We can construct an estimator of the q-value by plugging the estimator of the p-value (4) and the estimator of the FDR (7) into the expression (6) and settingπ 0 = 1. The result iŝ wherep i =p(r i ).

Application to 1-OGC and 2-OGC results
In this section, we evaluate q-value of events in the 1-OGC catalog [2] and in the 2-OGC catalog [3]. We use the data available at https://github.com/gwastro/1-ogc and https://github.com/gwastro/2-ogc. Available data set contains the information of events such as time, false alarm rate in a unit of year −1 , the value of ranking statistic, two masses, dimensionless spin component value of each star perpendicular to the orbital plane, etc. The data set consists of complete and bbh data sets. There are 146,214 and 12,741 events in complete and bbh data sets of 1-OGC, and 733,231 and 502,994 events in complete and bbh data sets of 2-OGC, respectively. The complete data set contains all candidate events from full analysis, and the bbh data set contains the candidate events from the BBH region targeted analysis [2,3].
Since p-value of events are not available in these catalog, we need to evaluate it from the false alarm rates (FAR). An estimate of FAR is given by n bg (ρ)/t bg where t bg is the length of data used for background estimation, and n bg (ρ) is defined by Eq. (2). The events in the catalog are defined by taking an event which gives a maximum detection statistic within a certain time window ∆t and in the template bank used in the analysis. Thus, the total number of background, n bg (0), is given as t bg /∆t. In both 1-OGC and 2-OGC, ∆t = 10 seconds are used. Then, from Eq. (4), we obtain an estimate of p-value of an event aŝ We note that the candidate events in these data sets are not all events in the sense that only events with relatively low false alarm rates are recorded. This is due to a practical reason in order to reduce the computation time of the analysis. This is a typical situation in gravitational wave analysis.
Since all candidate events are not available, we can not use the algorithm originally proposed in [19], which is explained as Algorithm 2 in Appendix C. Instead, we propose an 6/19 alternative procedure for estimating q-value which is a modified version of Algorithm 2. Appendix C explains why Algorithm 1 yields estimates of the q-value defined in (8).
Algorithm 1. We compute estimates of q-value defined in (8). Let m to be the number of false alarm rates which are less than some value. Assume p-value in the region around and larger thanp (m) are noises.
1. Compute estimates of p-value.
5. The estimated q-values for the i-th most significant event isq (i) .

1-OGC results
In the 1-OGC catalog [2], True Discovery Rate (TDR) and P astro are given to evaluate the significance of events. A true discovery is the complement of the false discovery, FDR=1−TDR. Note however that the evaluation of TDR in [2] is a very conservative estimate. In [2], an estimate of TDR is defined as where T (ρ c ) is the rate that signals of astrophysical origin are observed with a ranking statistic ≥ρ c , and F(ρ c ) is the FAR. In [2], to estimate T (ρ c ), two significant events GW150914 and GW151229 are assumed to be real astrophysical signals, and T ∼ 15yr −1 is obtained. In order to take into account of the uncertainty in the estimate based on only two events, the Poisson distribution is assumed for the observed number, and as a lower 95% bound, T ∼ 2.7yr −1 is obtained. In [2], this value is used in (10) for all events other than GW150914 and GW151226. On the other hand, P astro is the posterior probability given that a particular event has astrophysical origin. In the 1-OGC catalog [2], it is estimated as where P S (ρ c ) and P N (ρ c ) are the probability densities of an event having ranking statistic ρ c given the event is signal or noise, respectively, and Λ S and Λ N are the rates of signal and noise events. 1 In order to estimate Λ S P S (ρ c ), an analytic model of the signal distribution and a fixed conservative rate of mergers are used by assuming two events (GW150914 and GW151226) are astrophysical origin.   Figure 1 shows the q-value computed using Algorithm 1 from p-value of events in the complete data set. Table 2 summarizes the results of the estimated p-value and q-value for 10 most significant events. Figure 2 shows the q-values computed using Algorithm 1 from p-values of events in the bbh data set. Table 3 summarizes the results of estimated p-value and q-value for 10 most significant events. together with the inverse of the false alarm rate, 1 − TDR and P astro given in the 1-OGC catalog. For the first two events, since only upper limit to the false alarm rate was evaluated in [2], the estimated p-value of these events should be considered an upper limit to p-value. 1 − TDR and 1 − P astro are not given for the top two events in [2], since these events are used to estimate 1 − TDR and P astro of other events.
Following [2], we discuss the significance of events with bbh case. In Table 3, if we call the events whose q-value is smaller than 0.05 significant, the top three events are significant. The expected proportion of false discoveries incurred in the three events is less than 0.05. Since q-value of GW151012 (151012+09:54:43) is 9.83 × 10 −5 , this is significant enough as a true signal. In [2], since P astro for GW151012 is 9.76 × 10 −1 which is larger than 0.5, GW151012 is called significant. Thus, the results of q-value and P astro are consistent for this events.
In Table 3, we find two marginally not significant events, 160103+05:48:36 and 151213+00:12:20 whose q-value are 8.31 × 10 −2 and 8.53 × 10 −2 respectively. On the other hand, P astro for these events are small, 6.07 × 10 −2 and 4.66 × 10 −2 respectively. So in [2], these two events are called not significant. Although the conclusions are the same, the significance are slightly different between q-value and P astro in [2], and this difference might be interesting. However, since these two events do not appear in the 2-OGC catalog in the next subsection, we don't investigate these events more.
The value of 1 − TDR is about 1 order of magnitude larger than q-value for all events. Since TDR in [2] is a very conservative estimate, this difference is not surprising. Even in this case, 1 − TDR for GW151012 is 8.29 × 10 −4 . Thus, this can be called significant. But TDR for 160103+05:48:36 and 151213+00:12:20 is 0.483 and 0.545. Thus, these can not be called marginal events. 8/19 In the LIGO-Virgo GWTC-1 catalog of gravitational-waves from compact binary mergers during O1 and O2 [4], a necessary condition that an event is considered to be a gravitational wave signal is that the FAR of the event is less than one per 30 days, which corresponds to the p-value of 10(sec)/30(days) = 3.9 × 10 −6 . By linearly fitting the data in Figs. 1 and 2, we can evaluate that this p-value corresponds to the q-value of 0.411 and 0.240, respectively. The q-value of 0.05 corresponds to one per 271 days and one per 246 days of FAR, respectively. The q-value threshold of 0.05 is more stringent than the FAR of one per 30 days.
When we compare q-value of same event, the q-value in Table 3 is smaller than that in Table 2. The reason for this difference is that events in the data set are computed from the different number of templates. The small number of templates decreases the false alarm rate and p-value. Accordingly, it produces different q-value.  Figure 3 shows the q-values as a function of p-values in the complete data set. Table 4 summarizes the results of the estimated q-values of the events for 30 most significant events. Figure 4 shows the q-values as a function of p-values in the bbh data set. Table 5 summarizes the results of estimated q-values for top 30 events. P astro computed in the 2-OGC paper [3] is also shown in this table. 3 We discuss the significance of events for bbh case. In table 5, if we call the events whose q-value is smaller than 0.05 significant, the top 13 events are called significant. In [3], these 13 events are called significant since P astro is larger than 0.5. Thus, the results of q-value and P astro are consistent each other. On the other hand, we obtain a different result for 151205+19:55:25. The q-value of this event is 0.07, while P astro is 0.525. Thus, this is definitely 3 The method to estimate P astro in [3] is based on a mixture model developed in Farr et al. [18] and employed in GWTC-1 catalog by LIGO-Virgo collaboration [4]. Table 3 The same as Table 2, but obtained from bbh data set. FAR, 1 − TDR and 1 − P astro are obtained from 1-OGC catalog.

Summary and Discussion
In this paper, we presented a consistent procedure to assess the significance of each event. We proposed an estimator of the p-values (4) of a particular event in the statistical hypothesis 10/19 Table 4 Estimated p-values and q-values of the events of the complete data set of 2-OGC. Events are sorted by false alarm rate and the top 30 events are shown. The inverse false alarm rates (FAR −1 ) are obtained from the data set. p-values are computed by (4). q-values are computed by Algorithm 1.
In this procedure, we use a property that p-values follow the uniform distribution under the 11/19 null hypothesis, and we don't need any assumptions on the distribution of signals. We apply this procedure to 1-OGC and 2-OGC catalog data [2] [3]. There is already a procedure which was introduced to evaluate q-value in the literature [19]. However, since not all events in the analysis are available in the catalogs, we proposed a new procedure to evaluate q-value which is a modified version of the original one.
For bbh case of 2-OGC, we have 13 significant events. All of them are also identified significant based on P astro in [3]. There is one marginal event, 151205+19:55:25. The q-value of this event is 0.07 but the P astro computed in [3] is 0.525. Thus, q-value suggests this is marginally not significant, while P astro suggests this is marginally significant. It is not easy to conclude whether this signal is from astrophysical origin or not only from these results.
The method for estimating q-value presented in this paper is very simple because we don't need any assumptions on the distribution of noise and signal. Note that q-value and P astro are based on fundamentally distinct statistical disciplines. The q-value is a frequentist measure, which is devised to estimate FDR of events over some threshold of significance without any assumptions on signals. In contrast, P astro is a Bayesian measure, which is devised to estimate the posterior probability of astrophysical origin of a particular event relying on prior assumptions on signals. Nevertheless, from the results discussed above, we found that both approaches provide almost the same conclusion. The coincidence is not at all trivial. The coincidence would suggest that the prior assumptions on signals used in the computation of P astro are close to the reality. It should be useful to estimate q-value as well as P astro in the gravitational wave searches. This should be true especially for marginal events like 151205+19:55:25 in this paper. We can obtain additional information on the significance of an event from different criterion.
We also note that the procedure for estimating q-value presented in this paper can be applicable to other searches for gravitational waves. Our procedure for estimating q-value is not restricted to the specific searches for the gravitational waves whose true background distribution of detection statistic is difficult to know, because our procedure is based on the empirical distribution, which is always available by time-shifting of time-series data of different detectors.
A. Derivation and meaning of p conv As in various scientific research fields [33], there might be some confusion in use of p-value in the gravitational wave community. In the recent American statistical association statement on p-value [33], the first principle is "P -values can indicate how incompatible the data are with a specified statistical model". Therefore, if we are saying about a p-value, we always have to make clear what statistical model we are talking about. In this appendix, we discuss derivation and meaning of the conventional p-value p conv defined by (1), which is the probability of observing one or more noise events as strong as a signal whose detection statistic is ρ under the noise model. In the analysis paper of the event GW150914 [20], Abbott et al. called p conv a p-value, however, in the text we have not called it p-value to avoid a possible confusion with the p-value defined by (3).
Let us see more details of the probability (1) which was proposed by Usman et al. in Appendix of [21]. The total number of noise events in the observed data, N , is modeled parametrically with the Poisson process of mean µ: where µ = µ(ρ). The slight difference between the expression of µ(ρ) in (1) and the expression (1 + n bg (ρ)t obs )/t bg in Eq. 17 of [21] (the unity in the numerator) comes from the fact that the model used by Usman et al. [21] involves observed events. In contrast, (1) is based only on the noise events in simulated background data, because the authors of the present paper believe that the noise model is better to be constructed by noise events only. In addition, Usman et al [21] considered a randomness in the number of candidate events and then marginalized them out. However, these steps have no influence on the final expression if n bg (ρ) n bg (0) (compare Equations A.4 and A.12 in [21]). Then, the probability of observing one or more noise events as strong as a signal whose detection statistic is ρ under the noise model during the observation time, P(N ≥ 1), is given by (1). In the same manner, if we consider the probability of observing n 0 or more noise events as strong as a signal whose detection statistic is ρ under the noise model during the observation time, the p-value is p conv (ρ; n 0 ) := P(N ≥ n 0 ) = n≥n0 µ n n! e −µ .
B. Discussion onπ 0 In this Appendix, we show thatπ 0 in Eq. (7) can be approximated to beπ 0 1.π 0 is an estimator of π 0 = n 0 /n obs which indicates the overall proportion of noise events in the data. Settingπ 0 = 1 is reasonable when very few events are expected to be signal, such as the gravitational wave search. In fact, Benjamini and Hochberg's proposal [24] was settinĝ π 0 = 1. On the other hands, for data in which some portion of events are expected to be signal, such as in genomewide studies, Storey and Tibshirani [19] proposedπ 0 =f (1), wherê f (λ), λ ∈ (0, 1) is an estimate of f (λ) disccused in (B2). We consider a list of p-values which contains m p-values less than a certain value, and set n obs = m. We assume that the maximum p-value in this list isp (m) . In this case, n 0 is the number of noise whose p-values are between 0 andp (m) . We consider a function where 0 < λ <p (m) . If all p-values larger than λ 0 are noise, E(n(λ)) = n 0 for λ > λ 0 , since p-values follow uniform distribution.
As an estimator ofπ 0 , let us consider a function f (λ) = n(λ) m = #{p i > λ; i ∈ {1, . . . , m}} m(1 − λ/p (m) ) (B2) where 0 < λ <p (m) . If all p-values larger than λ 0 are noise, E(f (λ)) = π 0 for λ > λ 0 . In particular, if all p-values are noise, E(f (λ)) = 1 for λ ∈ (0,p (m) ). Figure B1 is the plot of f (λ) for the complete data set of 1-OGC. In this plot, we use m = 124, 524 events whose p-value is less than 0.3. We can see that f (λ) in Eq. (B2) is almost unity for 0 < λ < 0.25. We have 0.99 < f (λ) < 1.01 in this region. This means that almost all p-values are noise except for a very few p-values around zero. Larger scatter in 0.25 < λ < 0.3 is due to the statistical fluctuation caused by the smaller number in the numerator of Eq. (B2). Since we are mainly interested in the events with small p-value less than 10 −2 , we setπ 0 = 1. The situation is similar to the bbh case. Figure B2 is the plot of f (λ) for 0 < λ < 0.025 for the bbh data set of 1-OGC. In this plot, we use m = 10, 429 events whose p-value is less than 0.025. We have 0.98 < f (λ) < 1.02 for 0 < λ < 0.020. We have a larger deviation from unity,