Summary: Assemblytics is a web app for detecting and analyzing variants from a de novo genome assembly aligned to a reference genome. It incorporates a unique anchor filtering approach to increase robustness to repetitive elements, and identifies six classes of variants based on their distinct alignment signatures. Assemblytics can be applied both to comparing aberrant genomes, such as human cancers, to a reference, or to identify differences between related species. Multiple interactive visualizations enable in-depth explorations of the genomic distributions of variants.
Supplementary information:Supplementary data are available at Bioinformatics online.
De novo genome assembly is becoming increasingly tractable on large genomes due to advances in long-read sequencing and mapping. This is leading to a greater quality and quantity of reference genomes across the tree of life ( Lee et al. , 2014 ; Roberts et al. , 2013 ). Researchers can now sequence and assemble the genomes of several related strains or species in order to compare them. This is a vast improvement over more common resequencing approaches where sequencing reads are aligned to a single reference genome, often allowing only SNPs or short indels to be identified. Now that increasing numbers of high-quality genome assemblies are available, there is a need to detect the large structural variants that mark important differences between these genomes. For example, there may be more than 10 000 structural variations representing megabases of genetic diversity present per human genome (see below). Assemblytics builds on the innovations of the whole genome alignment suite MUMmer ( Kurtz et al. , 2004 ) in order to detect and analyze these variants.
Assemblytics analyzes the alignments from MUMmer’s nucmer program to identify high-confidence structural variants in each sequence (contig) in the sample relative to a reference or another de novo assembly. It begins by using a plane-sweep algorithm to quickly identify all overlapping alignments with respect to the sample. It then filters the alignments to report those with at least a minimum amount of unique contig sequence anchor (default: 10 kbp) contained in no other alignments of that contig. This is similar to the filtering performed by delta-filter component of dnadiff ( Phillippy et al. , 2008 ), although guarantees uniqueness of the alignments while dnadiff may select equally matching repetitive alignments arbitrarily ( Supplementary Note 1).
The variant identification algorithm then considers each pair of consecutive alignments along a sample contig, determining variant presence and class by the spacing and orientation between these alignments. Assemblytics identifies all insertion and deletion variants as small as 1 bp up to a maximum of 10 kbp in size, with this maximum adjusted to match the size of the unique sequence anchor. This prevents translocations and complex variants from being interpreted as indels. Variants from multiple alleles are reported independently and tagged with contig of origin. Figure 1A illustrates the differences between variant classes. Tandem and repeat expansions are insertions taking place within tandem and repeat sequences, respectively, while tandem and repeat contractions are deletions within these elements. Insertions and deletions are characterized by a defined breakpoint (less than 50bp overlap or gap) on one side. Tandem variants occur between overlapping alignments (over 50 bp). Repeat variants occur within unmappable gaps between alignments. The individual alignments are also scanned to detect insertions or deletions spanned by the alignment. Finally variant classes, size distributions, and genomic coordinates of all variants are summarized through plots and tables, as well as an interactive dot plot of alignments ( Fig. 1B and C ). Supplementary Note 3 provides more details on the web interface.
We first evaluated the accuracy of the variation detection by applying Assemblytics to analyze simulated structural variations within the human genome (Supplemental Note 2). To do so, we created a modified version of the human genome with simulated insertions and deletions embedded at known positions, and then aligned both the original unmodified reference genome and the human genome assembled by MHAP presented by Berlin et al. (2015) to this modified reference. Our results show that Assemblytics is able to correctly identify the vast majority of variants present (90.9–99.9% recall) and with very low false positive rates (0.29–0.40%).
We next applied Assemblytics to all five de novo assemblies (human, D.melanogaster , A.thaliana , S.cerevisiae and E.coli ) presented by Berlin et al. (2015) to their respective reference genomes ( Supplementary Note 4). In each case, the genome was analyzed in less than 5 min and the full interactive results are available on the Assemblytics website. All assemblies showed a large number of structural variations, proportional to the size of the genome and the sizes of the events approximated a log-normal distribution with a long tail of large events. Drosophila was a notable exception with a relatively small number of variants identified, with only 204 compared to 2501 for A.thaliana of similar size. This was because the assembly was derived from the same inbred population as the original reference (ISO1). In the case of the human analysis, Assemblytics reported 11 206 structural variations spanning over 7.0 Mb. This included a noticeable enrichment for ∼320 bp insertions and deletions which we characterized as novel Alu variants.
Assemblytics allows researchers to take advantage of high-quality genome assemblies for detecting structural variation between species or even between aberrant and normal genomes in human disease. Here we applied Assemblytics to study 5 de novo assemblies produced using MHAP from PacBio reads, although it is capable of exploring assemblies produced by any algorithm and sequencing technology, including local assemblies. The accuracy of variant calls depends on the qualities of both assembly and reference genome as well as the accuracy of the whole genome alignments. A key advantage of Assemblytics is unique length filtering which disregards alignments that are not anchored in a significant amount of unique sequence. This provides a conservative filter, similar to requiring high mapping quality for read alignment, something that was not previously available for genome-genome alignments.
We thank Jason Chin, Adam Phillippy, Art Delcher, W. Richard McCombie, Fritz Sedlazeck and James Gurtowski for their helpful discussions and testing.
National Science Foundation DBI-1350041 and National Human Genome Research Institute R01-HG006677.
Conflict of Interest : none declared.