Reducing Pervasive False-Positive Identical-by-Descent Segments Detected by Large-Scale Pedigree Analysis

Analysis of genomic segments shared identical-by-descent (IBD) between individuals is fundamental to many genetic applications, from demographic inference to estimating the heritability of diseases, but IBD detection accuracy in nonsimulated data is largely unknown. In principle, it can be evaluated using known pedigrees, as IBD segments are by definition inherited without recombination down a family tree. We extracted 25,432 genotyped European individuals containing 2,952 father–mother–child trios from the 23andMe, Inc. data set. We then used GERMLINE, a widely used IBD detection method, to detect IBD segments within this cohort. Exploiting known familial relationships, we identified a false-positive rate over 67% for 2–4 centiMorgan (cM) segments, in sharp contrast with accuracies reported in simulated data at these sizes. Nearly all false positives arose from the allowance of haplotype switch errors when detecting IBD, a necessity for retrieving long (>6 cM) segments in the presence of imperfect phasing. We introduce HaploScore, a novel, computationally efficient metric that scores IBD segments proportional to the number of switch errors they contain. Applying HaploScore filtering to the IBD data at a precision of 0.8 produced a 13-fold increase in recall when compared with length-based filtering. We replicate the false IBD findings and demonstrate the generalizability of HaploScore to alternative data sources using an independent cohort of 555 European individuals from the 1000 Genomes project. HaploScore can improve the accuracy of segments reported by any IBD detection method, provided that estimates of the genotyping error rate and switch error rate are available.


Supplementary Note
Logarithm of Odds (LOD) segment scoring In this section, we describe an alternative scoring for potential IBD segments that is similar in spirit to the LOD score used in RefinedIBD (Browning BL and Browning SR 2013). Specifically, for a given segment S shared between two individuals i 1 and i 2 , we compute its LODscore as follows: where G (S) obs1 (resp. G (s) obs2 ) is the observed genotype of individual i 1 (resp. i 2 ) over segment S, and P r G obs2 conditioned on individuals i 1 and i 2 being IBD over segment S (resp. not being IBD).
The pseudo-likelihood is computed as follows: where is the genotyping error rate, #S is the number of markers in the IBD segment, and G (i) truej is the true genotype of individual j at position i. The probability of genotypes (G ) as a function of the IBD state (0, 1 or 2 alleles shared IBD at position i) is given in Supplementary Table S3. We note that Supplementary Table S3 was derived elsewhere (Albrechtsen et al. 2009). The probability of observing a genotype given the true genotype and the genotyping error rate is given in Supplementary  Table S4. Two genotypes are considered IBD if they either share one or two alleles IBD (IBD1 and IBD2 in Supplementary Table S4), and we give equal prior probabilities to the two configurations.
We assessed the performance of LODscore by computing its AUC for various segment sizes. We note that even though the LODscore has power to filter out false IBD segments, its AUC is generally lower than the HaploScore detailed in the main text (Supplementary Figure S10). Reasons for the lower power of LODscore may arise in part from two issues: 1) LODscore assumes each site is independent and thus ignores correlation between adjacent markers, and 2) LODscore ignores available phase information. Both issues could be alleviated by explicitly incorporating linkage disequilibrium between adjacent sites and switch errors into the model. However, because of the strong performance of HaploScore, we did not explore these research avenues further.
Relatedness mapping and tracts of relatedness for genome-wide data in the presence of linkage disequilibrium. Genetic Epidemiology. 33:266-274.
Browning BL, Browning SR. 2013. Improving the accuracy and efficiency of identity-by-descent detection in population data. Genetics. 194:459-471. Figure S1. Choosing the parent through which child-other IBD segments have been transmitted. The genome is represented as a horizontal gray line. Assayed sites compatible with IBD between the listed individual and a hypothetical other individual (not pictured) are indicated as vertical black lines. Assayed sites incompatible with IBD (e.g., opposite homozygote sites) are indicated as red crosses. Orange boxes indicate reported IBD segments between the listed individual and the hypothetical other individual (not pictured). A. The unambiguous case in which one parent has a corresponding IBD segment and the other parent does not. Here, the father would be selected as the parent for analysis. B. The case where each parent has an IBD segment that partially overlaps the child segment. The parent selected for analysis is determined by the fraction of sites shared IBD. In this case, despite the longer physical length of the father's segment, the mother would be selected since her segment overlap (5 of 9 sites) is larger than the father's (3 of 9 sites). C. The case where neither parent has a reported IBD segment. The father would be selected as the parent for analysis, since his genotype contains fewer opposite homozygote sites in the child IBD region.   Figure 2A.   Figure 2A,B and Figure 4A,B, respectively, using trio-phased data for all 2,952 trios. The similarity of this figure and the main text figure panels indicates that haplotype phasing errors do not contribute substantially to the estimates of IBD accuracy. Figure S6. IBD segment overlap and HaploScore performance on chromosome 10. A. Heat map of the mean fraction of reported IBD segments found in parents, binned by two measures of segment length. B. The fraction of child-other segments that are true IBD as a function of segment length. True IBD segments are defined as having at least 80% of their sites encompassed by a parent-other segment. C. Heat map of the mean fraction of reported IBD segments found in parents, binned by segment genetic length and HaploScore. D. Receiver operating characteristic for reported IBD segments of various lengths, discriminating by HaploScore. The four panels are analogous to Figure 2A,B and Figure 4A,B, respectively, calculated on chromosome 10 here.  Figure S8. Accuracy of child-other IBD segments reported by GERMLINE in the 1000 Genomes cohort. This figure is analogous to Figure 2 but performed on the 1000 Genomes cohort.

Supplementary Figures
A. Heat map of the mean fraction of reported child-other IBD segments contained in a corresponding parent-other segment, binned by two measures of segment length as described in Figure 2A. B. The fraction of child-other segments that are true IBD as a function of segment length. True IBD segments are defined as having at least 80% of their sites encompassed by a parent-other segment as in Figure 2B. C-F. Histograms of child-other segment counts binned by segment overlap for segments of 2-3 cM (C), 3-4 cM (D), 4-5 cM (E), and 5-6 cM (F). Note the scale changes on the y-axes: though the fraction of true segments of length < 3 cM is smallest, this range contains over 5-fold more true segments than all other length ranges combined.  Figure S9. Improving detection of true IBD segments using HaploScore in the 1000 Genomes cohort. This figure is analogous to Figure 4 but performed on the 1000 Genomes cohort.
A. Heat map of the mean fraction of reported IBD segments found in parents, binned by segment genetic length and HaploScore. Calculations are performed as in Figure 2A. B. Receiver operating characteristic for reported IBD segments of various lengths, discriminating by HaploScore. True IBD is defined as in Figure 2B. The dashed black line indicates the no-discrimination line. The area under each curve is parenthesized in its legend entry. C. Precision-recall plot for child-other segments binned by segment length. Figure S10. Receiver operating characteristic for reported IBD segments of various lengths, discriminating by LODscore. True positive IBD segments are defined as having at least 80% of their sites encompassed by a parent-other segment. The area under each curve is parenthesized in its legend entry.