Fast pairwise IBD association testing in genome-wide association studies

Motivation : Recently, investigators have proposed state-of-the-art Identity-by-descent (IBD) mapping methods to detect IBD segments between purportedly unrelated individuals. The IBD information can then be used for association testing in genetic association studies. One approach for this IBD association testing strategy is to test for excessive IBD between pairs of cases (‘pairwise method’). However, this approach is inefficient because it requires a large number of per-mutations. Moreover, a limited number of permutations define a lower bound for P -values, which makes fine-mapping of associated regions difficult because, in practice, a much larger genomic region is implicated than the region that is actually associated. Results: In this article, we introduce a new pairwise method ‘Fast-Pairwise’. Fast-Pairwise uses importance sampling to improve efficiency and enable approximation of extremely small P -values. Fast-Pairwise method takes only days to complete a genome-wide scan. In the application to the WTCCC type 1 diabetes data, Fast-Pairwise successfully fine-maps a known human leukocyte antigen gene that is known to cause the disease. Availability: Fast-Pairwise is


INTRODUCTION
Identity by descent (IBD) is a fundamental concept in genetics.Two individuals are IBD at a locus if they have identical alleles inherited from a common ancestor.Investigators have put tremendous efforts to map the IBD segments between purportedly unrelated individuals (Browning and Browning, 2010;Gusev et al., 2009;Purcell et al., 2007).The current state-of-the-art methods such as GERMLINE (Gusev et al., 2009) and Beagle (Browning andBrowning, 2010, 2011) can detect even small (several megabases) IBD segments shared between individuals.
One promising application of IBD mapping is to use discovered IBD segments in the association testing (Purcell et al., 2007).Investigators usually test single nucleotide polymorphisms (SNPs) for association, but SNPs may not 'tag' low frequency causal variations well (de Bakker et al., 2005).Imputation also performs poorly on rare variants (Browning and Browning, 2009;Marchini et al., 2007).Association testing based on the IBD information, or 'IBD association testing', can complement standard association testing methods (Browning and Browning, 2011).
There are two categories of IBD association testing method.The first method is the pairwise method (Purcell et al., 2007), where one compares the IBD rate of case/case pairs with the background IBD rate to detect excessive IBD between cases.The rationale is that if a rare causal variation has occurred in a relatively recent ancestor, cases will likely share an IBD segment containing the causal variant.The second method is the clustering method (Gusev et al., 2011), where one divides individuals into clusters based on the IBD information and then test each cluster for association assuming the cluster 'tags' a rare causal variation.In this article, we focus on the pairwise method.
The pairwise method has two computational challenges.The first challenge is computational efficiency.In the pairwise method, one uses permutation to approximate P-values because it is difficult to analytically obtain the asymptotic distribution of the statistic.Because the P-value threshold for genome-wide association studies (GWAS) is necessarily low due to multiple testing (Browning and Thompson, 2012), one must perform a large number of permutations, which can be computationally demanding.The second challenge is fine-mapping.After one identifies significant loci, it is important to pinpoint the most significant peak within the loci to further follow up candidate genes.The permutation is limited for this purpose because the smallest P-value it can approximate is constrained by the number of permutations, often resulting in many SNPs having the same minimal P-values in the region.
In this article, we present a new method, 'Fast-Pairwise', to overcome the computational challenges of the traditional pairwise method.Fast-Pairwise uses 'importance sampling' (Wasserman, 2004) to improve efficiency and to enable approximation of extremely small P-value.To devise an importance sampling procedure, we introduce a new statistic that has two properties; it can approximate the pairwise method statistic, and it can be conveniently used for designing a sampling procedure.We show that the new statistic has a close relationship with the pairwise method statistic through the properties of the graph representation of IBD.
Fast-Pairwise is efficient and takes only days to complete a genome-wide scan.To demonstrate the utility in fine-mapping, we apply our method to the type 1 diabetes dataset of the Wellcome Trust Case Control Consortium (2007).In this dataset, the traditional pairwise method can identify a significant region in chromosome 6 (Browning and Thompson, 2012), but it gives the same minimal P-value for a wide region , including all eight classical human leukocyte antigen (HLA) genes.Among these, Fast-Pairwise pinpoints HLA-DQB1, which is known to cause the disease (Todd et al., 1987).

IBD graph
Given N individuals, the IBD information at a genomic locus can be represented as a graph with N vertices (Fig. 1).An edge exists between a pair of vertices if the individuals are IBD.

Pairwise methods for IBD association mapping
We refer to a class of IBD association mapping methods as 'pairwise methods' if they examine the relative number of edges in the IBD graph at each locus.There are three different types of edges: edges that connect two case individuals, edges that connect two control individuals and edges that connect a case and a control individual.Pairwise methods can be performed in two different ways.One way is to compare the number of case/case pairs with control/control pairs.A second way is to compare the number of case/case pairs with non-case/case pairs (union of control/control pairs and case/control pairs).We will call the first method CC and the second method CN.In this article, we mainly focus on CN consistent with previous studies (Browning and Thompson, 2012;Purcell et al., 2007).If we simply refer to the 'pairwise method', we are referring to CN.

Suppose that we have
be the set of case vertices and V À be the set of control vertices.Let E þþ be the set of all possible case/case vertex pairs, E ÀÀ be the set of all possible control/control vertex pairs and E þÀ be the set of all possible case/control vertex pairs.Let e ij be 1 if there exists an edge between vertices i and j, and 0 otherwise.Then, the CC and CN statistics are defined as The asymptotic distributions of these statistics are difficult to obtain analytically.This is because the statistics are based on the edges that depend on each other.For this reason, statistical significance is assessed by permutation.We assume a one-sided test, where IBD segments carry variants that are involved in disease (Browning and Thompson, 2012).
The relationship between CC and CN is worth noting.Under the condition that the background IBD rates of control/control pairs and the non-case/case pairs are equivalent (IBD control=control ¼ IBD nonÀcase=case ), CN will be more powerful than CC owing to the additional N þ N À pairs it considers.We expect that, however, the relative ordering of the two statistics is similar to each other (Fig. 2), as most of the pairs for both CC and CN are the same.As we will show later, we will use this similarity as the basis of increasing the computational efficiency of computing the significance of S CN .

Permutation test
Permutation is the standard approach for assessing the significance of the pairwise method.A single permutation can be thought of as a randomly sampled a vector of case/control disease statuses.Let v ¼ ðv 1 , . . ., v N Þ, 8v i 2 f0, 1g be the vector of disease status of N individuals, where 0 denotes control and 1 denotes case.The test statistic of pairwise method, S CN , is a function of v. Let v be the case/control status that was originally observed in the data.The standard permutation test is equivalent to sampling new v from all possible permutations of v assuming a uniform distribution.Let B be the set of sampled v.The estimated P-value is where is the indicator function.
Fig. 2. High correlation between CC and CN statistics.We simulated 1000 studies under the alternative hypothesis (see Section 3.2).We then permuted phenotypes to simulate the null hypothesis.Spearman is 0.89 and 0.99 under the null and the alternative, respectively Fig. 1.An example of IBD graph.IBD detection method provides IBD information (Table ).Then we build a graph where vertices are individuals and edges are IBD relationships The drawback of the permutation test is that it is computationally inefficient.If the true P-value is small, which is required in genomewide studies, we will need a large number of permutations.For example, to assess a P-value p with standard error p/10, one needs approximately 100/p samples.For the genome-wide threshold of IBD association testing [6 Â 10 À6 , Browning and Thompson (2012)], 410 million permutations are required.

Fast-Pairwise
We develop a new method 'Fast-Pairwise' that uses importance sampling technique to speed up CN method (Wasserman, 2004).Unlike the standard permutation test that samples case/control status v from the uniform distribution, we sample v non-uniformly.Specifically, we aim to sample v from all permutations of v such that, on average, S CN ðvÞ will be similar to S CN ðvÞ.The intuition is that by intentionally sampling v that gives large value of S CN , we can reduce the variance of the P-value estimate.Thus, our goal is to design a sampling procedure that satisfies where the expectation is with respect to f, our sampling distribution for v.
However, designing such a sampling procedure is not straightforward.To this end, we leverage the fact that we can construct a simpler statistic that approximates S CN , which we use for the sampling.

IBD-degreetype
To apply importance sampling, we must identify a statistic that roughly approximates S CN but can be conveniently used for designing a sampling procedure.Because we empirically have observed that S CC approximates S CN (Fig. 2), we want to find a statistic that approximates S CC .Our proposed statistic S SUM is related to S CC through a concept that we introduce, called the IBD-degreetype, which is simply the degree of each individual in the IBD graph.Obtaining the degrees of vertices is equivalent to splitting all edges and counting how many split edges are adjacent to each vertex (Fig. 3).Then we assign these numbers to the vertices.Given this, we define the IBD-degreetype as conceptually similar to a genotype where the allele of each individual equals to the degree of the corresponding vertex in the IBD graph.
The IBD-degreetypes can be used for statistical testing for IBD association testing.The intuition is that if case/case pairs have an excessive number of IBDs, then case vertices will have higher degrees than control vertices.The test based on IBD-degreetype will be comparing the average degrees between cases and controls, where w i is the IBD-degreetype of individual i, or equivalently the degree of vertex i in the graph.We note that this statistic is conceptually similar to the traditional association statistic, which compares the frequency of the genotypes between the cases and controls.Here we instead compute the difference between the case and controls of the IBD-degreetypes (hence the name).We are interested in the monotonic relationship between statistics within the permutation procedure.Let v 1 and v 2 be the permuted case/ control status.We define 'monotonic increasing relationship' (MIR) as follows.
DEFINITION 1. Two statistics S and T are in an MIR if, 8v It is clear that if two statistics are in MIR, they will give the same P-value under permutation.It also follows that MIR is transitive (if S and T are in MIR and T and U are in MIR, then S and U are in MIR).
The IBD-degreetype test has the following relationship to the pairwise CC method.Figure 3 illustrates this relationship with a toy example.
LEMMA 1.In a balanced study design (N þ ¼ N À ), CC and the IBDdegreetype test are in MIR.PROOF.
Because S ID and S CC differ by a constant multiplication factor ðN þ À 1Þ, they are in MIR.
We introduce another simple form of test statistic based on IBDdegreetype.This sum statistic is the sum of IBD-degreetype alleles or the degrees of the vertices in cases, where jej denotes the total count of edges.In addition, the sum of the IBD-degreetypes over all vertices is equal to 2jej because each edge is counted twice. Therefore, N À is a constant, S SUM is a monotonic increasing linear transformation of S ID .Thus, they are in MIR.Fig. 3. Equivalence of pairwise (CC) method and IBD-degreetype test in a balanced study.Note that in pairwise CC method, the edge between (D,F) pair is ignored, which would not be ignored in CN method

Substitution strategy
Here we propose to use S SUM in sampling as a substitution to the pairwise method statistic, S CN .The logical ground for this strategy comes from the relationship between S CN and S SUM .We have empirically shown that S CN and S CC are highly correlated (Fig. 2).Because S CC and S ID are in MIR in a balanced study (Lemma 1), we expect that they will be correlated in general even in an unbalanced study, and we show this property through a simulation experiment (Supplementary Fig. S1).S ID is in MIR with S SUM (Lemma 2).Thus, S SUM can be an approximation to S CN (Fig. 4).
Given this relationship between S CN and S SUM , our strategy is to sample v such that S SUM ðvÞ will be similar to S SUM ðvÞ on average.Our new goal can be described as It turns out that this new goal is much easier to achieve.Note that S SUM is used only for sampling.After the sampling is done, the sampled v is used to approximate the P-value of CN method.This substitution approach works because in importance sampling, the sampling distribution need not guarantee optimality (the smallest variance of P-value estimate).Instead, a reasonably similar distribution to the optimal distribution suffices.It is clear that this strategy will perform the best if the balanced study condition is met.However, even if the condition is not met, only the variance of P-value estimate is affected and not the mean.The P-value estimate will still be unbiased, and it only means that we will need a larger number of samples to obtain the same accuracy.

Sampling with replacement
In this section, we devise a sampling procedure satisfying Equation (4).Such a sampling procedure will be the core part in our importance sampling framework for speeding up CN method.
Sampling a random v from all permutations of v can be thought of as sampling N þ of N individuals that will be assigned case status, or equivalently, sampling N þ case indices among 1, . . ., N. Let a 1 , . . ., a N þ be the sampled case indices.These are the indices in v that will be assigned '1'.Sampling case indices is without-replacement sampling procedure; we cannot sample the same index twice, so that exactly N þ distinct indices will be sampled at the end (8i 6 ¼ j, a i 6 ¼ a j ).This way, we can restrict sample space of v to the set of all permutations of v.
However, the design of sampling procedure satisfying Equation ( 4) is considerably easier if we assume sampling case indices with replacement.That is, we allow the same index can be sampled multiple times.Although this assumption is not valid for our purpose, our strategy is to devise a sampling approach satisfying Equation (4), assuming sampling with replacement first, and then extend the approach to the sampling without replacement.
Suppose that we pick a 1 among 1, . . ., N with probability Pða 1 ¼ kÞ ¼ gðkÞ, P N k¼1 gðkÞ ¼ 1.Because we assume sampling with replacement, sampling a 2 is no different from sampling a 1 ; in fact, for any 1 i N þ , a i is independent and identically distributed (IID) with distribution g.Now consider w a1 , the IBD-degreetype allele of a 1 .Let E g ðw a1 Þ be the expected value of w a1 with respect to g.Again, because we assume sampling with replacement, Then, by the definition of S SUM , we can easily see that Thus, equation (4) can be described as where the left side is the expected case IBD-degreetype allele in our distribution g and the right side is the average case IBD-degreetype allele in the observation v.This shows that, if the P-value is highly significant (e.g. the right side is large), we should pick a 1 (and all a i ) such that the expected value of IBD-degreetype allele can be large.
Here we propose a new sampling procedure that satisfies condition (5).We define distribution g as follows It is easy to show that this sampling procedure meets condition (5), as condition (5) can be described as and by solving this for , we exactly have equation ( 6).If 50, then we set to 0 to prevent negative t k .Such a case is not of our interest at any rate, as we focus on the one-sided test for detecting excess of case/case IBD.We choose the most simple linear function for t k , which enables us to calculate easily.As a result, for any v, we can calculate and completely define the distribution g.So far, we have successfully defined a sampling procedure satisfying Equation (4), assuming sampling with replacement.

Sampling without replacement
In this section, we extend the sampling procedure from the previous section to the 'sampling without replacement'.Here we propose to heuristically apply the same sampling scheme based on t 1 , . . ., t N to the without-replacement context.When we pick a i (ith case index), we pick index k among f1, . . ., Ngnfa 1 , . . ., a iÀ1 g with probability That is, we assume the same sampling probability proportional to t k , but we exclude indices previously picked as cases from our consideration.However, this sampling procedure does not exactly satisfy Equation (4) in the without-replacement sampling context.The indices with larger IBD-degreetype alleles are likely to be picked as cases earlier in the procedure and removed.Thus, if we use calculated assuming sampling with replacement, the expected case IBD-degreetype allele [the left side of Equation ( 5)] will be smaller than what we would obtain in the withreplacement context.To compensate for this difference, we use the following heuristic.In the middle of sampling, we empirically assess the left side of Equation ( 4) by examining the currently obtained samples.Then,Fig. 4. Relationship between different statistics we dynamically increment until the left side of Equation ( 4) is close to the right side.This simple heuristic is sufficient because, again, in importance sampling, we only need to approximately satisfy Equation (4).

P-value calculation
By using the sampling procedure developed in the previous section, we can obtain many sample v that approximately satisfy Equation (2).The final step is to use these samples to assess the P-value of pairwise (CN) method.In importance sampling, we must account for the fact that we used a sampling distribution that is different from the original distribution.Our original distribution is the uniform distribution defined by the standard permutation approach.The sampling distribution is defined by the sampling procedure that we developed in the previous section.
For a given v, the probability of sampling v differs in the two distributions as follows.To sample a v, we sample the case indices a 1 , . . ., a N þ .The probability of sampling case indices in the standard uniform distribution is On the other hand, the probability of sampling case indices in our sampling procedure described in the previous section is This is because at each step we pick ith case index a i , we sample the index with probability proportional to t ai , where the previously picked indices a 1 , . . ., a N þ are excluded from consideration.Using the standard formula of importance sampling, we approximate the P-value Note that we use the pairwise CN statistic in this formula.We use S SUM only to facilitate the sampling of v, and then use the obtained samples for CN method at the final step.We can approximate the variance of p with the following formula 1 jBj

Adjusting for population structure
A simple correction for population structure has been previously proposed (Browning and Thompson, 2012;Purcell et al., 2007) for the pairwise method.In this simple approach, the genomic average is subtracted from each of the two contrasting terms of the statistic.For example, in CN method, the genomic average of case/case IBD rate is subtracted from the observed case/case IBD rate, and the genomic average of non-case/case IBD rate is subtracted from the observed non-case/case IBD rate before calculating the statistic.The same approach can be applied to our Fast-Pairwise method.

Efficiency
To assess the efficiency gain of our Fast-Pairwise method, we use the Wellcome Trust Case Control Consortium (WTCCC) data (Wellcome Trust Case Control Consortium, 2007).We first run Beagle FastIBD to detect IBD between individuals (Browning and Browning, 2011).Then we test individual IBD segments for associations.We perform IBD association testing using both the traditional pairwise method based on permutation and our Fast-Pairwise method.We perform 10 million permutations for the traditional pairwise method.For our Fast-Pairwise method, we perform importance sampling with 1000 samples and 10 000 samples.We implemented both methods in the Java programming language.
Table 1 shows the estimated running time of both methods for analyzing the whole genome data (500 000 SNPs) of a single disease.The time is extrapolated from the estimated time for chromosome 22.Our Fast-Pairwise method is several orders of magnitude faster than the traditional pairwise method.It takes only 4 days for the whole genome, whereas the traditional method can take 13 000 days or 35 years of CPU time.
We can reduce the computation time for the traditional pairwise method by using an adaptive permutation approach.We can terminate the permutation earlier if the P-value approximates to a non-significant value.Given a P-value p, we need 100/p permutations to obtain the standard error of $ p=10.Suppose that we sample 100/p permutations for each P-value with upper limit of 10 millions.When we apply this adaptive approach to the WTCCC type 1 diabetes data, the estimated computation time is 474 days.Thus, Fast-Pairwise is still an order of magnitude faster than the traditional pairwise method with an adaptive permutation approach.

Accuracy
To assess the accuracy of our importance sampling, we use the simulation framework similar to Browning and Thompson (2012).Using the HapMap ENCODE regions (International HapMap Consortium, 2005), we run HapGen2 to simulate 10 000 individuals (Su et al., 2011).These individuals define our founder population.Then we simulate the first generation by sampling 100 000 individuals from the founders.Next we simulate the second generation by sampling 100 000 individuals from the first generation.We repeat until we obtain the 25th generation.Finally, we use the 25th generation to simulate a case/control study.Within the ENCODE region, we randomly select five causal variants among all rare variants (minor allele frequency 51%).If a haplotype contains one or more causal variants, it confers risk with relative risk selected from uniform (3,10).We assume the disease prevalence of 0.1.Given this disease model, using the standard formula of the case and control minor allele frequencies (Han et al., 2009), we sample 1500 cases and 1500 controls from the 25th generation.The IBD information between a pair of individuals is determined by tracking whether they are descendants of the same founder.We repeat this simulation 100 times per each of the 10 ENCODE regions to generate 1000 sets of case/control studies.Given these case/control study sets, we assess the P-values of pairwise (CN) method using both the standard permutation and the importance sampling of Fast-Pairwise.We use 10 000 samples for importance sampling and compare with 10 4 , 10 5 and 10 6 permutations.Figure 5 shows that the P-values of two methods track well within the P-value range that permutation can approximate (up to P-values of 10 À4 , 10 À5 and 10 À6 , respectively).Within this range, the Pearson correlation r 2 of two log P-values are 0.98, 0.94 and 0.99, respectively.This shows that our importance sampling procedure obtains accurate P-values.
Moreover, Figure 5 emphasizes a fundamental difference between the two methods.In permutation, the range of P-values one can obtain is limited by the number of permutations.Given jBj permutations, if none of the permutations exceeds the observed statistic, a conservative approximation of P-value is 1=jBj.In contrast, in Fast-Pairwise, the P-value range is not bounded by the number of samples.With a relatively small number of samples (10 000), Fast-Pairwise can obtain accurate P-values comparable with the permutation test for a wide range of P-values.We also performed extra permutations (410 6 ) to estimate P-values between 10 À6 and 10 À8 .The P-values of the two methods are still consistent within this P-value range (triangles in Fig. 5C).Browning and Thompson (2012) applied the pairwise (CN) method to the WTCCC type 1 diabetes (T1D) data based on the Beagle FastIBD IBD mapping results (Browning and Browning, 2011).Using 5 million permutations, they found that the major histocompatibility complex (MHC) region in the chromosome 6 is statistically significant given the genomewide threshold 6 Â 10 À6 .Because the MHC association to T1D has been historically known (Todd et al., 1987), this result was a validation that IBD association testing can detect the true association signal.

Application to WTCCC type 1 diabetes data
The limitation of Browning and Thompson's (2012) permutation approach is that although it is possible to determine whether each test is significant (p56 Â 10 À6 ) using 5 million permutations, it is not possible to approximate much smaller P-values.Given 5 million permutations, the smallest P-value one can approximate is bounded to 2 Â 10 À7 .In the MHC region, because of the strong signal and the long linkage disequilibrium, the location of the top peak of P-value is important for interpreting and fine-mapping the results.Figure 6A shows that the top peak of P-value is stretched over a wide region (48 Mb) including all eight classical HLA genes.Therefore, it is difficult to interpret which HLA gene is likely to be involved in the association.
We applied our Fast-Pairwise to the same dataset.Because Fast-Pairwise is the same pairwise method with increased efficiency, we expected to see the similar results as Browning and Thompson (2012).We discovered significant associations within the MHC region.However, the difference is that because our method can approximate small P-values well beyond the genome-wide threshold, it is possible to localize the statistical signal to a single marker (Fig. 6B).The top hit is at SNP rs241432 (p ¼ 7 Â 10 À45 ) at the intron of TAP2 gene.Among all major class I and II HLA genes, the closest gene to this peak is HLA-DQB1 (150 kb upstream from the peak).It is historically known that the main contributing gene for the MHC association to T1D is HLA-DQB1 (Todd et al., 1987).Thus, this result demonstrates that our Fast-Pairwise method can pinpoint the causal gene among many HLA genes within the MHC region, while the traditional pairwise method cannot.
One interesting observation is that the peak association of our IBD association test is on the TAP2 gene that encodes antigen peptide transporter 2 and has been shown to confer independent risk to the T1D when conditioned on the DQ haplotypes (Qu et al., 2007).Thus, the peak revealed by our Fast-Pairwise method may imply the added effect of TAP2 in addition to the primary effect of HLA-DQB1, which is in linkage disequilibrium.

DISCUSSION
We have developed a new efficient method for pairwise IBD association testing called 'Fast-Pairwise'.Fast-Pairwise uses importance sampling and can perform the pairwise method more efficiently than the traditional method based on permutation.Moreover, unlike permutation, Fast-Pairwise can approximate extremely small P-values beyond the genome-wide threshold.Using the WTCCC type 1 diabetes data, we show that Fast-Pairwise can successfully pinpoint a gene known to be associated to the disease within the MHC region.
The true utility of the IBD association testing is on finding novel loci where there are potentially multiple rare variants that cannot be found using single SNP tests (Browning and Thompson, 2012).An important advantage of IBD association testing is its wide applicability.The analysis can be performed using the same genotype data collected for single SNP tests without incurring additional cost.For this reason, we feel that many investigators will apply our method to search for these additional loci bearing rare causal variants.What is preventing researchers from applying this approach is an efficient method for IBD association testing, which we provide in this article.We expect that A B C Fig. 5. Accuracy of importance sampling.In simulations using the HapMap ENCODE region, we assess the P-values of pairwise (CN) method using both the standard permutation and Fast-Pairwise (importance sampling).We compare 10 000 samples of Fast-Pairwise to (A) 10 000, (B) 100 000, and (C) 1 000 000 permutations.The vertical dashed line denotes the lower bound of P-value that permutation can approximate given the number of permutations.The dots along the vertical line denote the simulations where none of the permutations exceeds the observed statistic, and therefore the lower bound of P-value is reported by the permutation test.The triangles in (C) are the P-values that we performed extra permutations (410 6 )

Table 1 .
Running time for pairwise IBD association testing for the WTCCC whole genome data