ABRA: improved coding indel detection via assembly-based realignment

Motivation: Variant detection from next-generation sequencing (NGS) data is an increasingly vital aspect of disease diagnosis, treatment and research. Commonly used NGS-variant analysis tools generally rely on accurately mapped short reads to identify somatic variants and germ-line genotypes. Existing NGS read mappers have difficulty accurately mapping short reads containing complex variation (i.e. more than a single base change), thus making identification of such variants difficult or impossible. Insertions and deletions (indels) in particular have been an area of great difficulty. Indels are frequent and can have substantial impact on function, which makes their detection all the more imperative. Results: We present ABRA, an assembly-based realigner, which uses an efficient and flexible localized de novo assembly followed by global realignment to more accurately remap reads. This results in enhanced performance for indel detection as well as improved accuracy in variant allele frequency estimation. Availability and implementation: ABRA is implemented in a combination of Java and C/C++ and is freely available for download at https://github.com/mozack/abra. Contact: lmose@unc.edu; parkerjs@email.unc.edu Supplementary information: Supplementary data are available at Bioinformatics online.


TCGA BRCA Somatic Variants
We applied ABRA to 750 BRCA tumor / normal pairs. Tumor and normal samples were assembled together resulting in a single shared alternate reference for both samples. Variants were called on the original BAMs (pre-ABRA) as well as on the ABRA realigned BAMs (post-ABRA) using Strelka. Predicted indels were passed into TIGRA for assembly with the resulting contigs aligned via BLAT. A called indel is considered concordant if the TIGRA / BLAT result shows evidence of the variant in the tumor, but not in the normal. Improved indel detection performance is observed in the post ABRA results. The total number of single nucleotide variants (SNVs) called pre-ABRA is 619,911 versus 617,872 post-ABRA. The figure below reflects results of indel calls using Strelka. The numbers in the figure represent a cutoff point for quality scores reported by Strelka.

TCGA BRCA Germline Alignment Quality
Noisy read alignments containing disparate gaps, mismatches to the reference or high base quality soft clipping may be indicative of the presence of a more complex variant. An example of noisy pre ABRA alignments (top panel) and more parsimoniously represented post ABRA alignments (bottom panel) is below. We introduce Mismatch Density (MMD) as a measure of quality of alignments for reads near called indels. MMD for a given indel locus is measured by the number of high quality bases not matching the reference genome (inclusive of soft clipped bases) summed with the number of indels (not including the called indel) within a read length of the position of interest. In the 750 sample BRCA cohort, we measure MMD at loci for germline indels called specific to pre-ABRA alignments, post-ABRA alignments and indels called in both sets of alignments (common calls). For each of the 3 sets of calls, alignments in both pre ABRA and post ABRA BAMs are examined. In all 3 groups, MMD is reduced in post ABRA alignments compared to pre ABRA alignments. A total of 6,055,130 simple SNPs (SNPs at least a read length away from another variant) are called pre ABRA and 6,051,391 are called post ABRA.

TCGA BRCA Germline Allele Frequency
Working from indel variant calls common to both pre and post ABRA results in the BRCA cohort, we compared allele frequency of post-ABRA variant calls with pre-ABRA variant calls. Post-ABRA variant calls generally have allele frequencies closer to the 50% and 100% rates expected in a diploid individual.

Simulated Somatic Evaluation
To evaluate ABRA's impact on somatic variant calling, we simulated over 10000 variants in approximately 200 cancer genes of interest. SNVs and indels ranging in length from 1 to 100 bases were spiked into test datasets at frequencies ranging from 5 to 50%. The reads for this simulation are of length 100bp. Initial alignments were performed using BWA-MEM. We compare results of Strelka variant calling with no realignment, GATK Local Alignment around Indels and ABRA realignment. ABRA enables improved performance for indels of length 10 or more. The numbers on the plot represent variant quality scores reported by Strelka.

Figure 5. Evaluation of ABRA's somatic variant calling performance on simulated data
In the same simulated dataset, we examine mutant allele frequency (MAF) estimation. ABRA realigned results display improved MAF estimation across all indel lengths.