Population-scale detection of non-reference sequence variants using colored de Bruijn graphs

Abstract Motivation With the increasing throughput of sequencing technologies, structural variant (SV) detection has become possible across tens of thousands of genomes. Non-reference sequence (NRS) variants have drawn less attention compared with other types of SVs due to the computational complexity of detecting them. When using short-read data, the detection of NRS variants inevitably involves a de novo assembly which requires high-quality sequence data at high coverage. Previous studies have demonstrated how sequence data of multiple genomes can be combined for the reliable detection of NRS variants. However, the algorithms proposed in these studies have limited scalability to larger sets of genomes. Results We introduce PopIns2, a tool to discover and characterize NRS variants in many genomes, which scales to considerably larger numbers of genomes than its predecessor PopIns. In this article, we briefly outline the PopIns2 workflow and highlight our novel algorithmic contributions. We developed an entirely new approach for merging contig assemblies of unaligned reads from many genomes into a single set of NRS using a colored de Bruijn graph. Our tests on simulated data indicate that the new merging algorithm ranks among the best approaches in terms of quality and reliability and that PopIns2 shows the best precision for a growing number of genomes processed. Results on the Polaris Diversity Cohort and a set of 1000 Icelandic human genomes demonstrate unmatched scalability for the application on population-scale datasets. Availability and implementation The source code of PopIns2 is available from https://github.com/kehrlab/PopIns2. Supplementary information Supplementary data are available at Bioinformatics online.


Simulation and read alignment of the 100 simulated human samples
For each sample we used ART_illumina [Huang et al., 2012] to simulate paired short-reads (Illumina HiSeq 2500, 150bp length, 15X read coverage per haplotype, 300bp fragment size with a standard deviation σ=50bp) from the simulated haplotypes of chromosome 21. The simulated reads in the resulting FASTQ files were aligned to the modified human genome (HGCh38 with missing sequences I sim , see section 3.1) with bwa mem [Li, 2013]. Therefore, our simulation workflow considers alignment artifacts, i.e., some reads are not aligned to their original location on chromosome 21 but elsewhere in the genome. We call these reads off-target reads. On average each sample contains 7.49 million reads, 7.38 million of them are aligned or partially aligned to chromosome 21 and 12.7 thousand reads are labelled as unmapped.

Assembly metrics using QUAST
Supplementary Table 4 supports the simulated data analysis in section 3.1 of the main text. We used QUAST [Gurevich et al., 2013], a quality assessment tool for genome assemblies, to assess the consensus of the supercontigs with the truthset T . The results in Supplementary Table 4 show that the chosen default value for the ASF parameter (0.67) is a reasonable middle ground between assembly continuity, number of misassemblies and covered sequence fraction by the callset. For PopIns2, the N50, the NGA50 but also the number of misassemblies grows with an increasing ASF. The continuity and accordance with the truthset of PopIns' callset ranks between the results of PopIns2 with ASF=0.50 and ASF=0.67. However, directly compared to the results of PopIns2 with matching ASF=0.50, the supercontigs of PopIns contain more misassemblies, misassembled contigs and mismatches. PopIns2 has less mismatches in its assemblies and covers a higher fraction of the truthset than Pamir with all tested ASF settings. However, Pamir seems to create less misassemblies. Note that the number of misassemblies in this analysis is not trivial to compare as all callsets (except for the manually cropped one; marked with "*") have flanking reference sequences. The comparison of the two callsets of Pamir shows that QUAST reports many more misassemblies if flanking regions are present.
Interestingly, both variant caller report a largest alignment of 9074 bp which is the length of the second largest sequence in T .

Bridging low entropy sequences
Many genomic sequences contain sub regions of low entropy (e.g. short tandem repeats, homopolymers, poly-A tails). Those sub regions have a prominently reduced variety in dinucleotides. Sampling k-mers from low entropy sequences typically results in many highly similar k-mers. Subsequently, when a dBG is built from highly similar k-mers they tend to form characteristic topologies (Supplementary Figure 4a,b). We observed that these topologies can grow and cluster into large, complex and densely interconnected components (Supplementary Figure 4c) which we term Low Entropy Connected Components (LECC). The traversal scheme of PopIns2 merge can make poor neighbor decisions if highly abundant k-mers across many samples form complex LECCs and potentially returns incorrect supercontigs, i.e. the supercontigs do not contain accurately assembled non-reference sequence. When PopIns was developed the same problem with low entropy sequences was discovered even though the merge algorithm is entirely different. The solution of PopIns to this problem is a threshold y that leaves contigs below an entropy of y ignored within the merge algorithm. For PopIns2 we designed a mechanism that masks uncertainty in the supercontig sequences due to low entropy during the generation of a sequence.
Finding partners around LECCs. In the merge module prior to the traversal all unitigs with a sequence entropy below a user defined threshold (parameter -e with default=0, no consideration of LECCs) are annotated. Next, connected components of annotated unitigs (hence LECCs) are identified. For each LECC L all unitigs U L which border L are stored together with the information which k-mer k L (head or tail) connects to L. Then, for every u ∈ U L a DFS is initiated and searches through L until all accessible unitigs P ⊆ U L (called "Partners") are found. The color vector of k-mer k L ∈ u is compared to the color vector of every k L ∈ p, where p ∈ P, using the Jaccard Index of k-mer colors. As a result, a map M L stores the association between every u and its best color matching partner p.
Traversal with jumps. Once the best color matching partners around the LECCs are computed the merge module continues with the main traversal method. If the traversal detects a LECC L at some unitig u during its recursive progression the algorithm makes a lookup in M L to determine the partner of u. Then, instead of accessing a direct neighbor of u the traversal jumps across L directly to p. Later when the concatenation function ω generates the supercontig sequence from a given path, a jump over a LECC is encoded as a sequence of 'N' characters of length k.
Adjustment of sequence masking. The decision and extend of masking sequence to avoid potential unnoticed errors in supercontig sequences to left the user. By default PopIns2 does not apply masking and returns only complete sequences. However, insertions sequences that were generated from a path that led through a LECC tend to be less reliable in terms of sequence content and length. This becomes particularly severe if the true sequence contains STRs. As the merge module of PopIns2 is not optimized for accurate repeat resolution a false amount of copies of the repeat pattern can go unnoticed if the STR is longer than k and low entropy is not masked. By comparison, the merge process of PopIns ignores all input contigs below an entropy threshold of 0.75 by default.
PopIns2 merge offers an option (command line parameter -l) to explore the masking process by writing a CSV file that contains the IDs of low entropy unitigs according to the entropy parameter -e. This CSV file can be loaded into Bandage [Wick et al., 2015] to visualize the extend of LECCs in the graph.
Complexity of finding partners around LECCs. The mechanism to find partners around LECCs introduces additional computations to the merge algorithm but, in practise, its cost in run time is negligible compared to the main routine. The worst-case run time of our implementation to find all partners for all LECCs is where V is the amount of vertices in the ccdBG, E is the amount of edges in the ccdBG, N is the amount of LECCs, V LECC is the amount of vertices in a LECC, E LECC is the amount of edges in a LECC, B is the amount of vertices that border a particular LECC and C is the amount of colors in the ccdBG. Supplementary Table 7 shows a low level running time analysis for the methods involved into the partner finding mechanism.

Set overlap of joint NRS callsets from 147 samples
Supplementary Figure 7 shows the diagrams for the set overlaps of the joint NRS callsets from PopIns (here set P 1 ) and PopIns (here set P 2 ). To determine a set overlap of P 1 and P 2 we first computed an all-vs-all sequence alignment (using STELLAR [Kehr et al., 2011]; --minLength 50; maximal error rate --epsilon 0.05). With the given parameters we commonly observed a fragmentation of the pairwise alignments into multiple shorter local alignments. Hence, we decided to compute the set overlap separately for each callset as the amount of NRS that are covered by sequence of the orthogonal callset. In this context, a NRS p 1 ∈ P 1 (p 2 ∈ P 2 ) is covered if the fraction of base pairs that has no alignments to a p 2 ∈ P 2 (p 1 ∈ P 1 ) is below 1 − t.
We computed the set overlap of each callset for multiple minimum coverage thresholds t.

Set overlap of NRS callsets from trio pedigrees
The NRS sequences in Supplementary Figure 9 were taken from the final VCF file of each caller and every trio pedigree. Hence, every NRS in this analysis was already (partially or fully) placed to the reference genome and genotyped. At least one individual of the trio pedigree needs to contain the variant in at least one allele for the NRS to be extracted from the VCF. Next, we trimmed the flanking reference sequences from the NRS of Pamir's callsets. We further excluded all remaining sequences of less than 50bp length from all callsets. The remaining callsets of Popins, PopIns2 and Pamir individually had a median value of 1650, 1898 and 6661 NRS sequences per trio, respectively. Finally, we used STELLAR [Kehr et al., 2011] to compute the pairwise alignments.

Mendelian inheritance error rate and transmission rate
In Supplementary Figure 10