The missing indels: an estimate of indel variation in a human genome and analysis of factors that impede detection

With the development of High-Throughput Sequencing (HTS) thousands of human genomes have now been sequenced. Whenever different studies analyze the same genome they usually agree on the amount of single-nucleotide polymorphisms, but differ dramatically on the number of insertion and deletion variants (indels). Furthermore, there is evidence that indels are often severely under-reported. In this manuscript we derive the total number of indel variants in a human genome by combining data from different sequencing technologies, while assessing the indel detection accuracy. Our estimate of approximately 1 million indels in a Yoruban genome is much higher than the results reported in several recent HTS studies. We identify two key sources of difficulties in indel detection: the insufficient coverage, read length or alignment quality; and the presence of repeats, including short interspersed elements and homopolymers/dimers. We quantify the effect of these factors on indel detection. The quality of sequencing data plays a major role in improving indel detection by HTS methods. However, many indels exist in long homopolymers and repeats, where their detection is severely impeded. The true number of indel events is likely even higher than our current estimates, and new techniques and technologies will be required to detect them.

The indel sets in NA18507 were generated from Illumina 100bp reads and Sanger traces, and used to estimate the total indel number in NA18507. We first used Sanger traces to validate Illumina indels. Our calculation was based on counting Illumina indels with Sanger coverage (i.e. covered by a Sanger read). For each Sanger read which covers an Illumina indel there are three situations: a) the Sanger read supports the indel i.e. either has exactly the same indel or has a indel that can be shifted to the Illumina indel by introducing at most one mismatch in the alignment; b) the Sanger read rejects the indel i.e. has a continuous segment covering the Illumina indel with at least 5bp flanking sequences on both sides of the indel; c) the Sanger read supports a different indel i.e. has a different indel within 5bp of the indel which cannot be shifted as in a).
We refer to a reported Illumina homozygous indel as true positive if all covering Sanger reads support it (situation a) and as false positive if all covering Sanger reads reject it (situation b). Define hom as total detected homozygous indels, hom a and hom b as the number of homozygous indels detected in Illumina reads with all covering Sanger reads supporting or rejecting the indel, respectively. Then the false discovery rate of homozygous indel detection is defined as the proportion of the rejected indels among all detected homozygous indels with high Sanger coverage: (1) The number of true positive homozygous indels is estimated as For heterozygous indels the Sanger reads should ideally come from both the variant allele (supporting the indel) and the reference allele (rejecting the indel). However, due to the very low Sanger coverage, requiring that both alleles be supported by Sanger data is too stringent for the majority of heterozygous indels. Instead we chose a pragmatic approach of gauging the FDR by examining the imbalance between the indel coverage and the reference coverage. Specifically, we focused on the numbers of heterozygous indels with all covering Sanger reads supporting/rejecting the indel. Define het as total detected heterozygous indels, and het a and het b as the numbers of heterozygous indels detected in Illumina data with all covering Sanger reads supporting/rejecting the indel, respectively. For true positive heterozygous indels the probability of full support and full rejection is 50:50 and so het a and het b should be close to each other. Therefore the difference between het b and het a can be used to estimate FDR for heterozygous indel detection as follows: The number of true positive heterozygous indels is estimated as Then we estimated the sensitivity of indel detection, i.e. the fraction of true indels from the reference annotation set that were detected correctly. Since most annotations lack heterozygosity information we compared homozygous and heterozygous indels to the same annotation and used the same sensitivity estimate for both indel types. Define s as the sensitivity of indel detection on an annotation. Then the preliminary estimate for the total indel number is estimated as However, this estimate does not yet take into account two additional sources of indel loss. First, although most of the existing indel annotations are validated, they may still contain false positives. Second, due to different sequencing technologies and analysis methods certain read sets may not be able to detect some of the real indels in an annotation.
To evaluate the FDR of a Sanger-based indel annotation we built a new reference with the indel sequences and reference genome, as described in the Materials and Methods section of the main text, and aligned high coverage Illumina sequencing data to it. Indels with coverage but no supporting reads for the indel variant can be considered false positives. Define fp as the number of these false positive indels and n as the number of all indels with Illumina reads. Then the false discovery rate of the annotation is Next, to quantify the incompleteness of Illumina coverage, define no_cov as the number of indels without Illumina coverage. Depending on the previously estimated FDR ann , a fraction of them may not be true indels, but the remaining ones should be considered true indels that were missed due to incomplete Illumina coverage (i.e. false negatives). The number of such false negatives, fn, is estimates as This suggests that true Sanger-annotation indels comprise (n-fp) indels that have Illumina coverage (true positives), and fn indels that were missed by Illumina reads (false negatives): Therefore the false negative rate of the Illumina sequencing data is defines as FNR data = fn / true_ann = fn / ( n -fp + fn ).
Using the false discovery rate of the annotation, and adjusting further for the incompleteness of Illumina sequencing coverage, the adjusted sensitivity s' is defined as: The preliminary estimate in equation (5) is then replaced with the adjusted estimate of the total number of indels: Section 2: Analysis tools and parameters BWA (v0.6.1) was used to align the Illumina reads and BWA-SW (v0.6.1) to align the Sanger reads with default settings.
BFAST (v0.7.0a) was used with the default parameters, similarly to the steps described in (1). BFAST is a highly sensitive aligner that generates a large number of alignments of lower mapping quality. During the post-processing step, low quality alignments (MAPQ < 20) were removed to ensure that all pipelines using BFAST alignments finish the indel detection in a reasonable amount of time (within 72 hours on a high-performance computing cluster).
Picard was used to remove duplicate reads from Illumina reads alignment with "VALIDATION_STRINGENCY = LENIENT REMOVE_DUPLICATES = True".
Dindel (v1.01) was used with the default parameters, similarly to the steps described in (1).
PRISM (v1.1.5): we set minimum discordant pair number for a cluster to 2. We call an indel when it is supported by at least 5 reads and the best alignment has no more than 2 mismatches.

Section 3: Concordance among indel detection tools
We found only a modest degree of overlap between the different methods: 369,641 indels were detected by all five pipeline combinations used in our analysis ( Figure S3). The relatively low level of concordance is consistent with the results from literature.
A low concordance among different indel detection methods has been observed in a number of studies, suggesting that indel detection in human populations is likely to be rather incomplete (2). A recent study reported a mere 26.8% agreement among three indel-calling pipelines, which is substantially lower than the concordance for SNP calls (3). These results indicate a potentially high numbers of false positives and/or false negatives. Another recent study found an even lower concordance of only 14.3% among three different indel detection pipelines, and noted different biases among the tools towards detecting short versus long indels (4). The concordance tended to improve, however, after requiring a higher read coverage for one of the pipelines.  C. Indels more than 5Mbp away from centromeres and telomeres. D. Indels outside UCSC segDup annotations. E. Indels within 50bp of Alus. F. Indels more than 50bp away from Alu elements. In GC content region of 36%-40%, PRISM sensitivity and saturation decline, although the coverage rises. This situation holds for B-D but not for E. F shows that PRISM detection performance outside of Alu regions is superior to A-E, which together with E indicates that the presence of Alus is the main reason for the loss of detection accuracy.

Figure S2. Dependence of coverage and GATK indel-detection metrics on the GC content.
The reference genome was cut into 200bp pieces and binned by GC content. GATK indel detection sensitivity (green curve) and saturation (red curve) are shown for each bin along with the genome coverage (black curve). Also shown is the distribution of the full reference genome across the same GC bins (blue semi-transparent histogram), as well as the distribution of Alu elements (red semi-transparent histogram). The two histograms demonstrate that Alus are generally overrepresented in areas with higher GC content, and also that the noticeable dip in the GATK sensitivity corresponds well with the presence of Alu elements (pink and magenta areas of the Alu histogram). The main results in this study were obtained using the PRISM+BWA pipeline. They were validated using 4 other pipelines, which combined either BWA or BFAST mappers with either GATK or Dindel detection algorithms. A comparable degree of overlap exists between different setups of indel detection tools. Figure S4. Estimation of the total number of 1-10bp indels in the Yoruban genome NA18507 using GATK in combination with BWA read aligner. The workflow involves the estimation of four sets of values: GATK FDR and the number of true positive indels detected by GATK (green boxes); the reliability of the reference indel annotation combined from Kidd and Mills sets (via false discovery rate, FDR; blue boxes); the incompleteness of the Illumina read coverage of the reference indels (via false negative rate, FNR; yellow boxes); and the computation of the adjusted sensitivity of indel detection in GATK, which is used to estimate the overall number of indels in the genome (orange boxes). Each box shows the initial indel counts or pipeline sensitivity (red numbers) as well as the computed estimates (blue numbers) based on the equations indicated in parentheses. The detailed explanation of the equations and the workflow is presented in Section 1 of this document. Figure S5. Estimation of the total number of 1-10bp indels in the Yoruban genome NA18507 using GATK in combination with BFAST read aligner. Figure S6. Estimation of the total number of 1-10bp indels in the Yoruban genome NA18507 using Dindel in combination with BWA read aligner. Figure S7. Estimation of the total number of 1-10bp indels in the Yoruban genome NA18507 using Dindel in combination with BFAST read aligner.

Figure S8. Distribution of indel length in human and four other primates in non-homopolymers.
The counts of indels of each length are normalized by the number of 1 bp indels. The fractions of longer indels are very stable across all the five species. Indels of even length are not included due to the existence of dimers whose variation rate can be significantly different from odd length indels.