## Abstract

The resampling-based test, which often relies on permutation or bootstrap procedures, has been widely used for statistical hypothesis testing when the asymptotic distribution of the test statistic is unavailable or unreliable. It requires repeated calculations of the test statistic on a large number of simulated data sets for its significance level assessment, and thus it could become very computationally intensive. Here, we propose an efficient *p*-value evaluation procedure by adapting the stochastic approximation Markov chain Monte Carlo algorithm. The new procedure can be used easily for estimating the *p*-value for any resampling-based test. We show through numeric simulations that the proposed procedure can be 100–500 000 times as efficient (in term of computing time) as the standard resampling-based procedure when evaluating a test statistic with a small *p*-value (e.g. less than 10^{ − 6}). With its computational burden reduced by this proposed procedure, the versatile resampling-based test would become computationally feasible for a much wider range of applications. We demonstrate the application of the new method by applying it to a large-scale genetic association study of prostate cancer.

*p*-value, Resampling-based tests, Stochastic approximation Markov chain Monte Carlo

## INTRODUCTION

*p*-value evaluation is a crucial ingredient in statistical hypothesis testing. Most often, the *p*-value can be obtained according to the asymptotic distribution of the test statistic when the sample size is sufficiently large. But in many applications, the asymptotic distribution is either unreliable due to insufficient sample size or unavailable for complicated test statistics. Resampling-based procedures, such as permutation (Good, 2005) or bootstrap (Efron and Tibshirani, 1994), have been widely used for evaluating the *p*-value in such situations.

The resampling-based procedure is easy to implement, but it requires repeated calculations of the test statistic on a large number of data sets simulated under some particular simulation scheme. In much current biomedical research, such as medical imaging studies (e.g. Naiman and Prebe, 2001) and genetic studies (e.g. Wellcome Trust, 2007), a large number of tests (sometimes over 1 million tests) are conducted in order to identify very few targets of interest. Invariably, when so many tests are conducted, we face the task of evaluating extremely small *p*-values. Also, if the family-wise false-positive rate is to be maintained at an acceptable level, any individual test must have an extremely small *p*-value to be declared globally significant. For instance, in the case of genome-wide association study (GWAS) with half a million genetic markers, a commonly accepted threshold for global significance is 10^{ − 7}. Using standard resampling-based procedures to estimate such a small *p*-value can be a daunting task, as [100/*p*] iterations are needed to reach the precision of , where *p* denotes the true *p*-value and is the standard deviation of the resampling-based estimate (Kimmel and Shamir, 2006 ; Shi *and others*, 2007). Thus, it generally requires about 10^{9} iterations to achieve a reliable estimate for a *p*-value of 10^{ − 7}. This puts a huge computational burden on resampling-based testing procedures.

Our method was motivated by a novel analysis of the Cancer Genetic Markers of Susceptibility (CGEMS) genetic association study of prostate cancer (Yeager *and others*, 2007; Thomas *and others*, 2008). The analysis was performed on stage II follow-up study in which 3941 prostate cancer cases and 3964 controls were genotyped for 27 383 single nucleotide polymorphisms (SNPs) that showed promising association evidence in the CGEMS stage I GWAS. A number of GWAS studies including CGEMS have previously shown that the chromosomal region of 8q24 contains multiple susceptibility SNPs for prostate cancer and some other cancers as well. In spite of its central role in cancer genetics, the biologic function of this region is not well understood because of lack of known genes in this chromosomal region. The goal of our analysis of the CGEMS study was to scan for SNPs that may influence the risk of prostate cancer through interaction with 8q24 SNPs and thus could provide insight into the biology of the 8q24 region. We used the Tukey score test proposed by Chatterjee *and others* (2006) to conduct such a scan. Since the asymptotic distribution for the Tukey score test is not available, Chatterjee *and others* (2006) proposed a resampling-based procedure for the *p*-value evaluation. For detection of a significant SNP among 27 383 SNPs at a family-wise type I error rate of 0.05, the Bonferroni corrected *p*-value threshold for individual SNPs is 1.83×10^{ − 6}. To accurately evaluate *p*-values around or below this level, according to the above argument, the standard resampling-based procedure needs at least 10^{8} iterations. However, due to its complex nature and large sample size, a single run of the Tukey score test for this project takes about 0.02 s of CPU time on a 2.6 GHz Opteron Linux machine. As a result, the *p*-value evaluation for a single SNP using the resampling-based test with 10^{8} iterations requires a total of 556 h of CPU time. Thus, it is computationally very challenging, if not infeasible, to apply the standard resampling-based test to this project.

To alleviate the computational obstacle presented by the standard resampling-based procedure, various alternatives have been proposed. Theoretic approximation formulas have been derived for particular problems (e.g. Siegmund, 1988; Feingold *and others*, 1993; Conneely and Boehnke, 2007; Li *and others*, 2008). Monte Carlo methods based on importance sampling have been developed for some specific applications, such as the scan statistic for genetic linkage studies (Anquist and Hossjer, 2004; Shi *and others*, 2007) and the genetic association tests based on contingency table analysis (Kimmel and Shamir, 2006). Although those alternatives work well for their targeted applications, they are not sufficiently general to be adapted readily for other applications. They achieve computational efficiency at the cost of application generality.

In this paper, we propose a *p*-value evaluation procedure that achieves both computational efficiency and application generality. The procedure is based on the recently developed stochastic approximation–based Monte Carlo algorithm (SAMC) (Liang *and others*, 2007). Once a simulation scheme (e.g. random permutation of the outcome) has been selected for generating the data under the null hypothesis, the SAMC algorithm can be used to evaluate the distribution of the test statistic effectively under that simulation scheme and to estimate the *p*-value in a dramatically reduced number of iteration steps.

## REVIEW OF THE SAMC ALGORITHM

Here, we briefly describe the basic SAMC algorithm. More details are given in Liang *and others* (2007). Suppose that we are interested in sampling the distribution , where *Z* is the normalizing constant and *X* is the sample space. Using a function *λ*(*x*) (e.g. a chosen test statistic), the SAMC algorithm first partitions the sample space into *m* disjoint subregions denoted by *E*_{1} = {*x*:*λ*(*x*) < *λ*_{1}}, *E*_{2} = {*x*:*λ*_{1} ≤ *λ*(*x*) < *λ*_{2}}, …, *E*_{m − 1} = {*x*:*λ*_{m − 2} ≤ *λ*(*x*) < *λ*_{m − 1}}, and *E*_{m} = {*x*:*λ*(*x*) ≥ *λ*_{m − 1}}, where *λ*_{1},…,*λ*_{m − 1} are a set of monotonically increasing real numbers specified by the user. To ensure a thorough exploration of the sample space *X*, SAMC seeks to draw samples from each of the subregions with a prespecified frequency, called the desired sampling distribution, denoted by **π** = (*π*_{1},…,*π*_{m}) with 0 < *π*_{i} < 1 and ∑_{i = 1}^{m}*π*_{i} = 1. If this goal can be achieved, then the difficulties of rare-event sampling encountered by conventional Markov Chain Monte Carlo (MCMC) algorithms and the resampling-based testing procedure can be essentially overcome. In this paper, we define {*γ*^{(t)}}, the gain factor sequence, in the context of stochastic approximation (Robbins and Monro, 1951) as

for a prespecified positive value of *t*_{0}. More discussion of the choice of *t*_{0} will be given in a later section. Other gain factor sequences are discussed in Liang *and others* (2007). We let *J*(*x*) denote the index of the subregion to which the sample *x* belongs.

Assuming that we have the sample *x*^{(t)} at the *t*th iteration of the SAMC algorithm, the next sample *x*^{(t + 1)} is drawn from a Metropolis–Hastings (MH) kernel with the proposal distribution *q*(*x*^{(t)},·) targeting at the stationary distribution

where *θ*^{(t)} = (*θ*_{1}^{(t)},…,*θ*_{m}^{(t)}) is an *m*-vector in a space Θ and is updated at each iteration. In this paper, we consider only the scenario in which the space Θ is compact. This is the general case when the sampling space *X* is generated through a resampling-based procedure, as the number of possible samples in the sampling space is always finite. Thus, as shown in Liang (2010)*θ*^{(t)} can always be kept in a compact set. Below is the summary of steps involved for generating *x*^{(t + 1)} and updating *θ*^{(t)} in the scenario where the space Θ is compact. To start the iteration, we can set *θ*_{k}^{(1)} = 0,*k* = 1,…,*m*.

### The SAMC algorithm

(a) (Sampling) Simulate a sample *x*^{(t + 1)} by a single MH update with the target distribution as defined in (2.2):

(a.1) Generate

*y*according to the proposal distribution*q*(*x*^{(t)},*y*).(a.2) Calculate the ratio .

(a.3) Accept the proposed move with a probability of min(1,

*r*). If it is accepted, set*x*^{(t + 1)}=*y*; otherwise set*x*^{(t + 1)}=*x*^{(t)}.

(b) (Weight updating) For *i* = 1,…,*m*, set *θ*_{i}^{(t + 1)} = *θ*_{i}^{(t)} + *γ*^{(t)}[*I*(*x*^{(t + 1)}∈*E*_{i}) − *π*_{i}].

The self-adjusting mechanism of the SAMC algorithm is obvious: if a proposal is rejected, the weight of the subregion to which the current sample belongs will be adjusted to a larger value, and thus the total rejection probability of moving out of the current subregion will decrease in the next iteration. This adaptive mechanism enables the algorithm to traverse the sample space very quickly. The SAMC algorithm represents a significant advance in sampling of rare events and in simulations of complex systems for which the energy landscape is rugged.

Following Liang *and others* (2007), we know ∫_{Ei}*ψ*(*x*)d*x* is approximately proportional to exp(*θ*_{i}^{(t)})(*π*_{i} + *π*_{0}) when *t* is large enough, where *π*_{0} = ∑_{j∈{i:Ei = 0}}*π*_{j}/(*m* − *m*_{0}), *m*_{0} is the number of empty subregions (an empty subregion is an area with ∫_{Ei}*ψ*(*x*)d*x* = 0). Furthermore, let *P*(*x*^{(t)}∈*E*_{i}) be the probability of sampling from the subregion *E*_{i} at iteration *t*. Liang *and others* (2007) showed that as *t*→*∞*, *P*(*x*^{(t)}∈*E*_{i}) converges to *π*_{i} + *π*_{0}, if *E*_{i} is not empty and 0 otherwise. Thus, the SAMC can control the sampling proportion for any given nonempty subregion.

*P*-VALUE EVALUATION BY SAMC

### The standard resampling-based procedure

Let *x* denote the observed data, and let *λ*(*x*) denote the test statistic for the null hypothesis of interest to us. In many applications, the exact or asymptotic distribution of *λ*(*x*) under the null hypothesis is unknown. To evaluate the *p*-value associated with the observed test statistic value, one usually estimates the empirical distribution of *λ*(*x*) through a resampling-based procedure. As we mentioned before, to achieve a reliable estimate for small *p*-values (e.g. less than 10^{ − 6}), the standard procedure will require a large number of resampling steps in order to generate enough of the more extreme statistic values.

The following rule of thumb can give us some guidance on how many resampling steps are needed to reach a reliable *p*-value estimate. Let *p* be the true *p*-value of *λ*(*x*). Let {*λ*(*x*_{i}^{*}),*i* = 1,…,*n*} be the test statistics calculated on *n* independent resamples. The estimated *p*-value is usually given as . The absolute relative error (ARE) is defined as . Let *C*_{ARE} be the desired ARE threshold. For small *p*, we know , the probability of having an acceptable estimate (with the ARE less than *C*_{ARE}), is about , where Φ(·) is the standard normal distribution function. So, if we want to estimate the (small) *p*-value with the median ARE at the level of *C*_{ARE}, the required number of resamples *n* is given by solving the equation

Thus, to estimate *p* with the ARE median level at *C* = 5%, we must choose an *n* of the order of 180/*p*, for example, *n* = 1.8×10^{9} for *p* = 10^{ − 7}.

### p-value evaluation through SAMC

Suppose that we have chosen a simulation scheme for generating *x* under the null hypothesis and that we let *f*(*x*)∝*ψ*(*x*) be the distribution of the sample *x*∈*X* under the simulation scheme. The observed statistic *λ*^{*} has its *p*-value given by *P*(*λ*(*x*) ≥ *λ*^{*}) = ∫_{λ(x) ≥ λ*}*f*(*x*)d*x*. As we have described above, SAMC can be used to estimate ∫_{Ei}*ψ*(*x*)d*x*. We always let *λ*_{m − 1} = *λ*^{*} for the purpose of *p*-value evaluation. After the *T*th iteration of the SAMC algorithm, with *T* large enough, we can estimate the *p*-value *P*(*λ*(*x*) ≥ *λ*^{*}) by

We can improve the estimate accuracy further by averaging estimates over *h*( ≥ 1) independent SAMC runs; we denote this type of estimate by SAMC(*h*).

For an effective implementation of SAMC for this problem, several practical issues need to be considered.

##### Simulation space partitioning.

The algorithm generally prefers a fine partition. Since the proposal *q*(·,·) is usually a local updating function, a fine partition of the space *X* will reduce the chance that both the current sample and the proposal sample belong to the same subregion and thus will enhance the ability of the algorithm to traverse the entire space. A natural choice for the definition of subregions is to divide the interval [*λ*_{0},*λ*^{*}] evenly into *m* − 1 intervals, assuming that we know *λ*_{0} (e.g. *λ*_{0} = 0), the lower boundary for *λ*(*x*). This strategy usually works well. But in certain situations, some subregions can end up with an extremely small probability, especially when the number of subregions *m* is relatively large (e.g. 300), which prevents the algorithm from converging. To avoid this, in the supplementary material available at *Biostatistics* online, we describe an adaptive procedure for partitioning the sampling space more appropriately.

##### Choice of the desired sampling distribution and gain factor.

We recommend a uniform setting for the desired sampling distribution **π**, that is, *π*_{1} = ⋯ = *π*_{m}, since the estimation of the *p*-value requires us to estimate each integral ∫_{Ei}*ψ*(*x*)d*x* accurately. For the choice of *t*_{0} in the gain factor defined by (2.1), we recommend setting *t*_{0} to a value between 1000 and 5000. In all the following numerical examples, we only present results with *t*_{0} = 1000. We obtain very similar results for other *t*_{0} within the recommended range.

##### Choice of the MH algorithm.

The MH algorithm used in step (a) of SAMC depends on the simulation scheme adopted by the resampling-based test. We will provide examples for the permutation tests and the bootstrap-based test in the following sections. In general, SAMC prefers an MH algorithm that updates only a small proportion of the current version of data (Liang, 2010). Based on our experiments, we recommend an MH algorithm that updates 5–10% of the data.

##### Diagnosis for convergence.

To check for the convergence of the algorithm, we can evaluate the pattern in the histogram , which is the proportion of the samples generated among *T* iterations falling into each subregion. Since we recommend setting the desired sampling distribution **π** to be uniform, the observed sampling proportion should converge to a uniform distribution over those nonempty subregions when the SAMC algorithm converges. Based on this fact, the convergence of the algorithm can be inspected by checking whether or not the histogram is flat. More specifically, we propose the following criterion for the convergence diagnosis. Based on , we calculate the relative sampling frequency error

which measures the deviation of the realized sampling distribution from the desired one, where *m*_{0}^{′} is the number of subregions not visited during the run. Based on our experience, we recommend using max_{i = 1}^{m}|ε_{f}(*E*_{i})| < 20% as an indication for convergence. This diagnosis method is simple and easy to use as it does not require multiple runs. It works especially well when the emptiness/nonemptiness of each subregion is known in prior. This is usually the case in the *p*-value evaluation application since the range of test statistic is usually known. Other diagnostic tools based on multiple runs developed for the standard MCMC algorithm can also be used. For example, we can check for consistence among estimates obtained by different runs.

## NUMERIC EXPERIMENTS

### Two-sample t-test example

We begin with an example for which we know the distribution of the test statistic under the null hypothesis and thus the tail probability (*p*-value), so that we can assess the accuracy of the estimate by SAMC. Suppose we have 2 groups of observations, *x*_{1,1},…,*x*_{1,n1}∼*N*(*μ*_{1},*σ*^{2}) and *x*_{2,1},…,*x*_{2,n2}∼*N*(*μ*_{2},*σ*^{2}). The standard two-sample *t*-test statistic *λ*(*x*) (assuming equal variance) is usually used to test the null hypothesis *H*_{0}:*μ*_{1} = *μ*_{2}, It is known that under the null hypothesis, *λ*(*x*) follows a *t*-distribution with *n*_{1} + *n*_{2} − 2 degrees of freedom (df). For this example, we generate a data set by setting *n*_{1} = *n*_{2} = 1000, *μ*_{1} = *μ*_{2} = 0, and *σ*_{1} = *σ*_{2} = 1.

Suppose we do not know the theoretical distribution of *λ*(*x*) but instead want to rely on a permutation procedure to evaluate the significance level for a given observed statistic value. In this permutation procedure, we randomly shuffle the group IDs among the 2 samples and generate new versions of the data. Under such a simulation scheme, each generated data set has the same probability, that is, *ψ*(*x*) = 1,*x*∈*X*, with *X* being the space of all possible simulated data sets. To apply the SAMC algorithm to evaluate the distribution of *λ*(*x*) under *ψ*(*x*), as defined above, we need to define the MH proposal distribution.

##### The MH proposal distribution.

Let *x*^{(t)} = (*x*_{1}^{(t)},*x*_{2}^{(t)}) denote the current version of the two-sample data obtained at iteration *t*, where *x*_{1}^{(t)} and *x*_{2}^{(t)} denote the assigned first and second groups of observations, respectively. The MH step to generate a new version of the data is essentially a “localized” version of the resampling procedure (in this case, the permutation procedure) adopted by the corresponding resampling-based test. Based on *x*^{(t)} = (*x*_{1}^{(t)},*x*_{2}^{(t)}), the updated data *y* can be generated by exchanging *L* pairs of samples sequentially between the 2 groups, with *L* being around 5% of the sample size in the smaller group. Note that the 2 samples *x*_{1}^{(t)} and *x*_{2}^{(t)} in this process are updated iteratively after each pair exchange, and each pair is drawn at random with replacement from the current version of the 2 samples. The purpose of using the random draw with replacement is to satisfy the minorization condition (Liang *and others*, 2007). For this MH updating procedure, we have because of the exchangeability of all subjects under the permutation scheme. Thus, we have in step (a.2).

A more detailed description for the SAMC algorithm used for this example is given in supplementary material available at *Biostatistics* online.

##### Numeric results.

We examined the performance of SAMC by comparing its estimates with the ones given by the 1998-df *t*-distribution under different numbers of iterations (*T* = 10^{6}or 5×10^{6}) and numbers of subregions (*m* = 101,201, or 301). For each choice of (*T*,*m*) and the targeted tail probability, the SAMC procedure was run 100 times on the same data set, using different random seeds.

To diagnose the convergence of the runs, we examined the relative sampling frequency error, {ε_{f}(*E*_{i})} as defined by (3.3). Based on our experiments, we found that SAMC converges more quickly for smaller *m* and less extreme targeted tail probabilities. SAMC converged with max_{i = 1}^{m}|ε_{f}(*E*_{i})| < 20% under all considered scenarios after 5×10^{6} iterations. In fact, SAMC had already converged even after 10^{6} iterations in all scenarios except for the one with a tail probability of 10^{ − 10}.

Table 1 summarizes the performance of SAMC(1), as compared with the ones given by the *t*-distribution with 1998-df, in terms of the 25th, 50th, and 75th percentile of ARE among 100 replications. The results indicate that SAMC can produce very accurate estimates for all ranges of tail probabilities with all specified parameter configurations, even for a *p*-value as low as 10^{ − 10}. For example, with 5×10^{6} iterations (plus 200 000 steps in the initial run for the purpose of subinterval refinement), the SAMC(1) run with 301 subregions takes about 2.22 h of CPU time on a 2.6 GHz Opteron Linux machine, and it can accurately estimate the tail probabilities of 10^{ − 6} and 10^{ − 10} with the median level of ARE at 1.9% and 3.1%, respectively. On the other hand, according to (3.1), the standard permutation procedure needs 1.3×10^{9} iteration steps (about 354 h of CPU time) to estimate the tail probability of 10^{ − 6} in order to achieve a 1.9% median level of ARE. To estimate the tail probability of 10^{ − 10} with the median level of ARE at 3.1%, the standard permutation-based procedure needs 4.7×10^{12} iterations and 1.28×10^{6} h of CPU time. Thus, the SAMC algorithm can be 100 to 500 000 times as efficient (in term of computing time) as the standard permutation procedure. The running time comparison was based on the algorithm implemented in the R language.

Tail probability | Number of iterations | Number of subregions | ||

101 | 201 | 301 | ||

10^{−6} | 10^{6} | (2.0, 5.7, 8.9) | (1.5, 3.8, 6.5) | (1.9, 4.0, 6.9) |

5 × 10^{6} | (1.4, 2.7, 4.1) | (0.9, 2.0, 2.9) | (1.1, 1.9, 3.5) | |

10^{−7} | 10^{6} | (3.4, 6.9, 10.6) | (2.2, 5.0, 8.4) | (1.9, 4.8, 8.2) |

5 × 10^{6} | (1.5, 2.8, 4.6) | (1.3, 2.3, 3.5) | (1.9, 2.2, 3.7) | |

10^{−8} | 10^{6} | (4.5, 7.8, 13.7) | (3.5, 6.0, 9.2) | (2.6, 5.5, 8.9) |

5 × 10^{6} | (1.4, 3.2, 5.8) | (1.5, 2.9, 4.9) | (1.4, 2.6, 4.4) | |

10^{−9} | 10^{6} | (3.7, 8.2, 12.6) | (3.0, 7.0, 12.0) | (3.3, 6.6, 10.2) |

5 × 10^{6} | (2.3, 4.4, 6.8) | (1.5, 3.2, 4.9) | (1.4, 2.9, 5.0) | |

10^{−10} | 10^{6} | (4.9, 9.7, 17.8) | (4.5, 7.9, 11.9) | (4.2, 7.4, 13.9)† |

5 × 10^{6} | (2.2, 5.5, 9.2) | (2.0, 3.7, 5.4) | (1.4, 3.1, 5.4) |

Tail probability | Number of iterations | Number of subregions | ||

101 | 201 | 301 | ||

10^{−6} | 10^{6} | (2.0, 5.7, 8.9) | (1.5, 3.8, 6.5) | (1.9, 4.0, 6.9) |

5 × 10^{6} | (1.4, 2.7, 4.1) | (0.9, 2.0, 2.9) | (1.1, 1.9, 3.5) | |

10^{−7} | 10^{6} | (3.4, 6.9, 10.6) | (2.2, 5.0, 8.4) | (1.9, 4.8, 8.2) |

5 × 10^{6} | (1.5, 2.8, 4.6) | (1.3, 2.3, 3.5) | (1.9, 2.2, 3.7) | |

10^{−8} | 10^{6} | (4.5, 7.8, 13.7) | (3.5, 6.0, 9.2) | (2.6, 5.5, 8.9) |

5 × 10^{6} | (1.4, 3.2, 5.8) | (1.5, 2.9, 4.9) | (1.4, 2.6, 4.4) | |

10^{−9} | 10^{6} | (3.7, 8.2, 12.6) | (3.0, 7.0, 12.0) | (3.3, 6.6, 10.2) |

5 × 10^{6} | (2.3, 4.4, 6.8) | (1.5, 3.2, 4.9) | (1.4, 2.9, 5.0) | |

10^{−10} | 10^{6} | (4.9, 9.7, 17.8) | (4.5, 7.9, 11.9) | (4.2, 7.4, 13.9)† |

5 × 10^{6} | (2.2, 5.5, 9.2) | (2.0, 3.7, 5.4) | (1.4, 3.1, 5.4) |

Some runs have not converged in this scenario.

We can also see from Table 1 that the accuracy of the SAMC(1) estimate increases as the number of subregions increases. As explained before, a fine partition enhances the ability of SAMC to traverse the whole sample space. However, it would require more iteration steps for SAMC to converge. In the following examples and application, we always choose *m* = 301. We also find that we can improve the accuracy of SAMC by combining 2 independent runs (see supplementary material available at *Biostatistics* online for more examples).

Figure 1 shows the progressive plots of the highest test statistic value for 20 independent runs of SAMC(1) with 120 evenly distributed subregions in the interval of [0, 6] and 20 runs of the standard permutation procedure. It indicates that SAMC can sample extreme statistic values very quickly. In almost all the runs shown in Figure 1, within 10^{4} iterations, SAMC can generate extreme *t*-test statistics that fall in the region with the tail probability less than 10^{ − 9}. On the other hand, even with a total of 10^{8} (i.e. 5 000,000 samples in each of 20 independent runs) permutation steps, the permutation procedure still cannot reach that area. This partially demonstrates why SAMC has such an advantage over the standard resampling-based procedure in term of estimating small *p*-values. The level-off pattern appearing on the SAMC progressive plots after they reach above the value of 6 is simply due to the fact that we did not further partition the region (6, + *∞*) in this experiment. As a result, the SAMC algorithm loses its self adjusting capability in that area and behaves similarly as the standard MH iteration.

### Other numerical experiments

We further demonstrate the accuracy of the SAMC in 2 additional numerical experiments. In supplementary material available at *Biostatistics* online, we evaluate the performance of SAMC in a likelihood ratio test example, where the resamping-based test is based on a parametric bootstrap procedure. Also in the supplementary material available at *Biostatistics* online, we apply SAMC in a MAX test (the test is based on the maximum of several test statistics) example, where the theoretic distribution of the test statistic is unknown and the accuracy of the SAMC is checked by comparing with results from the standard permutation procedure.

## APPLICATION TO PROSTATE CANCER GENETIC ASSOCIATION STUDY

As described in Section 1, we focused on CGEMS stage II follow-up study (Thomas *and others*, 2008). The goal of our analysis was to conduct an omnibus test to evaluate each SNP's association with the disease outcome by capturing both its main effect and its interaction with 7 SNPs in the “conditioning” 8q24 region that were known to be related to prostate cancer. The omnibus test used for this project was adopted from the Tukey score test (Chatterjee *and others*, 2006). The test statistic is defined as *W*_{max} = max_{U ≤ γ ≤ V}*W*(*γ*), where *γ* takes on values from [ − 5,5] at the interval of 0.1, and *W*(*γ*) is a score statistic for each given *γ*.

The calculation of the Tukey score test is not straightforward, as it involves the evaluation of the score test statistic *W*(*γ*) at a dense set of grid points of *γ*. In addition, the study has a large sample size, which makes the evaluation of *W*(*γ*) at a given *γ*computationally expensive. In addition, a resampling-based procedure has to be used for the evaluation of *p*-values since the asymptotic distribution for the Tukey score test is not available. One resampling-based approach is to randomly shuffle the genotype at the testing SNP, while keeping other covariates and the outcome unchanged.

We applied the SAMC algorithm to evaluate the *p*-value for dozens of top-ranked SNPs (according to the observed Tukey test statistic). Results for the top 5 ranked SNPs are given in Table 2. For each SNP, we ran the SAMC algorithm 100 times with different random seeds. Each chain has 2×10^{6} iterations (plus the initial 200 000 iterations for the purpose of subinterval refinement) and takes about 12 h of CPU time on a 2.6 GHz Opteron Linux machine. All SAMC runs converged within 2×10^{6} iterations. According to the Bonferroni threshold (i.e. 1.83×10^{ − 6}), the top 5 SNPs were found to be globally significant. All these 5 SNPs reported in Table 2 have been identified by previous studies to be associated with prostate cancer due to their main effect. In addition to these 5 top-ranked SNPs, a few other novel top-ranked SNPs that showed both main effect and interaction with SNPs in 8q24 region will be described in details elsewhere.

SNP ID | Test statistic† | p-value | |

Mean‡ | Accuracy summary§ | ||

rs4430796 | 41.5 | 8.72 × 10^{−10} | (2.4, 4.1, 6.6) |

rs10993994 | 41.1 | 1.10 × 10^{−9} | (1.9, 4.4, 6.8) |

rs7501939 | 32.0 | 1.05 × 10^{−7} | (1.3, 3.3, 5.1) |

rs10896449 | 30.5 | 2.33 × 10^{−7} | (1.7, 3.9, 5.8) |

rs10486567 | 27.2 | 1.15 × 10^{−6} | (1.7, 3.4, 5.3) |

SNP ID | Test statistic† | p-value | |

Mean‡ | Accuracy summary§ | ||

rs4430796 | 41.5 | 8.72 × 10^{−10} | (2.4, 4.1, 6.6) |

rs10993994 | 41.1 | 1.10 × 10^{−9} | (1.9, 4.4, 6.8) |

rs7501939 | 32.0 | 1.05 × 10^{−7} | (1.3, 3.3, 5.1) |

rs10896449 | 30.5 | 2.33 × 10^{−7} | (1.7, 3.9, 5.8) |

rs10486567 | 27.2 | 1.15 × 10^{−6} | (1.7, 3.4, 5.3) |

Based on the extended Tukey test (Chatterjee *and others*, 2006).

Mean of estimated *p*-values among 100 independent SAMC runs.

Values in the parentheses are the 25th, 50th, and 75th percentiles of the absolute relative difference (%) of *p*-value estimates by SAMC(1) in 100 independent runs. The absolute relative difference is calculated as the absolute percentage difference between an estimate by SAMC in a single run and the one given by (‡).

Although in this application we estimated the final *p*-value for each SNP based on the mean over 100 independent SAMC runs, any estimate from an individual run is not very different from the mean value. In Figure 2, we present the histogram of the 100 *p*-values estimated from individual runs for each of 4 top-ranked SNPs. In fact, for SNPs reported in Table 2, the maximum absolute relative difference between the estimate by SAMC in a single run and the mean value among 100 independent runs is in the range of 11–13%.

## DISCUSSION

We have proposed an efficient *p*-value evaluation procedure. The procedure has the same application scope as any of the standard resampling-based tests, such as the ones based on permutation and bootstrap procedures, and can be 100–500 000 times as efficient as the standard resampling-based procedure when evaluating a small *p*-value. We also proposed an adaptive subregion-partitioning algorithm to facilitate the SAMC algorithm so that the application of this procedure requires very minimum tuning parameters. Aided by this procedure, the already very popular resampling-based tests would become computationally feasible for a much wider range of applications. We have implemented the algorithm into a R package called EXPERT (extremely small *p*-value evaluation for resampling-based tests), which is freely available at first author's homepage (http://dceg.cancer.gov/about/staff-bios/yu-kai).

Like other MCMC procedures, the proposed algorithm has to be run sequentially. However, we have demonstrated through numerical examples that instead of generating a long MCMC chain on a single machine (the SAMC(1) approach), we can share the computational burden among multiple machines by running several relatively shorter chains simultaneously with less stringent convergence criteria and then average the results over all runs as the final estimate (the so-called SAMC(*h*) approach). Through numeric experiments (results not shown), we found that the accuracy of SAMC(1) is generally higher than that of SAMC(*h*) when the same total number of iterations was used. So if time permits, we still recommend using SAMC(1) with more iterations.

## SUPPLEMENTARY MATERIAL

Supplementary material is available at http://biostatistics.oxfordjournals.org.

## FUNDING

Intramural Program of the National Institutes of Health and the National Cancer Institute to K.Y. and F.L.; The National Science Foundation (DMS-0607755, CMMI-0926803); and the award (KUS-C1-016-04) made by the King Abdullah University of Science and Technology.

*Conflict of Interest:* None declared.

## References

*p*-value of the maximum of correlated tests, with applications to genome-wide association studies

*p*values using importance sampling, with applications to genetics and medical image analysis

*p*-values