riboSeed: leveraging prokaryotic genomic architecture to assemble across ribosomal regions

Abstract The vast majority of bacterial genome sequencing has been performed using Illumina short reads. Because of the inherent difficulty of resolving repeated regions with short reads alone, only ∼10% of sequencing projects have resulted in a closed genome. The most common repeated regions are those coding for ribosomal operons (rDNAs), which occur in a bacterial genome between 1 and 15 times, and are typically used as sequence markers to classify and identify bacteria. Here, we exploit the genomic context in which rDNAs occur across taxa to improve assembly of these regions relative to de novo sequencing by using the conserved nature of rDNAs across taxa and the uniqueness of their flanking regions within a genome. We describe a method to construct targeted pseudocontigs generated by iteratively assembling reads that map to a reference genome’s rDNAs. These pseudocontigs are then used to more accurately assemble the newly sequenced chromosome. We show that this method, implemented as riboSeed, correctly bridges across adjacent contigs in bacterial genome assembly and, when used in conjunction with other genome polishing tools, can assist in closure of a genome.

This script was run 100 times, using a different random seed each time. As pseudo random number generation may differ between operating systems, comparable but not identical results can be expected.

--ref_as_contig
The assembly that results from including riboSeed's "long reads" is sensitive to the manner in which they are incorporated into the de novo assembly. Here, for our analyses, we used the SPAdes assembler [3], as it has built-in ways to include contigs (using the "-trusted-contigs" or "-untrusted-contigs") in FASTA format. Other assemblers could be used, but most require long reads to have a quality score associated with them, preventing direct use of riboSeed's long reads.
As mentioned in the Methods section, riboSeed uses the reference rDNA region in the initial subassembly; in subsequent subassemblies, the longest contig of the previous subassembly is used. These regions can be treated one of four ways using the --ref_as_contig argument: trusted, untrusted, infer, or ignore. Additionally, if the user is worried that the reference rDNA will too heavily influence the initial subassembly, they can enable the --initial_consensus flag to use a mapping consensus assembly instead of the de Bruijn graph based assembly from SPAdes.
The default manner in which rDNA regions (either from the reference or from the previous iteration's subassembly) behaviour is to infer (--ref_as_contig infer): if the percent of reads mapping to the (whole) reference sequence is over 80%, than the rDNA region will be included as a trusted contig. If below 80%, the reads will be treated as untrusted.
If a user wishes to have the subassemblies only using the reads (true de novo assembly), they can use the ignore option. We only recommend this with very close references.
Further, if the user wishes to explicitly define the behaviour, trusted or untrusted can be provided to the --ref_as_contig argument.

--score_min
By default, the accepted alignment score for BWA mapping is 1 2 the read length. If needed, this can be increased for greater stringency when dealing with more divergent references, or decreased to include more reads, which may be advantageous when assembling a low coverage dataset.     Table S6: Assembling S. aureus UAMS-1 with BugBuilder. We did not have access to critical information about the pipeline parameters used in the original assembly. This prevented exact recapitulation of the published results. Therefore, we approximated the settings based on notes from the publication. The performance of Pilon [5], GapFiller [6], or no finishing software was assessed with both the de fere novo and de novo assemblies. rDNA counts were visually determined using Mauve; all other metrics were generated with QUAST, using the scaffolds from the assemblies and the S. aureus MRSA252 reference. Misassembly/mismatch stats were removed, as the reference and sequenced strain have an average nucleotide identity of 97.62%, and the misassemblies cannot be differentiated from strain differences with QUAST. Blue and red highlight the best and worst results, respectively. Performance Across Prokaryotic Phyla

Performance on Archaeal Data
We assessed the effectiveness of riboSeed in assembling archaeal genomes. Most (~55%) archaeal genomes have only a single rDNA, and none has been observed to have more than four. As riboSeed requires a sequencing dataset and a reference genome, our ability to benchmark was limited; of the 104 entries in rrnDB with multiple rDNAs, only 7 had multiple entries at the species level. Among those, only 2 had publicly available short read data. We used were the only isolates that were suitable for riboSeed, in that there was publicly available short read data, more than a single rDNA operon, and an appropriate complete reference genome at the species level. Results are shown in Table   S7A. M formicicum BRM9 (CP006933.1) was used as a reference. While riboSeed with default parameters did not resolve any of the assembly gaps (final assembly k-mers 21, 33, 55, and 77), re-running the final assembly with k-mers of 21, 33, 55, 77, and 99 resulted in closing 2 of 2 rDNA gaps. We are unsure why the addition of 99-mers improved assembly with 89-bp reads, but we are actively investigating this. This shows that riboSeed is not limitted to Illumina short read data, and can be applied to Ion Torrent data.

Methanosarcina barkeri
Taken together, we show that given appropriate datasets and parameters, archaeal datasets can be processed in the same manner used for bacteria.

Performance Across Bacterial Phyla
In order to assess riboSeed's wider applicability, we selected additional datasets representing major bacterial phyla for those not already present in our analysis. In all cases, riboSeed improved the assemblies compared to the de novo with no missassemblies introduced (Table S7B). Thus, we conclude that riboSeed can be applied to a wide range of organisms.   Figure S1: Pseudocode of riboSeed algorithm q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q No rDNAs are correctly assembled using K. pneumoniae as the reference, or with de novo assembly. Scored with riboScore.py. N=8.  Figure S4: riboScan.py, riboSelect.py, and riboSnag.py were run on all the genomes used as references for de fere novo assemblies. Consensus alignment depth (grey bars) and Shannon entropy (black points, smoothed entropy as red line) for aligned rDNA regions. Similar to Figure 3 in the main text, for each genome, a gene neighboring the first rDNA operon was identified, and used to extract homologous rDNA operons from up to 25 other isolates at the species level. In most cases, the entropy is lower in homologous rDNAs than across all the rDNAs in a given genome. For strains with a low number of complete genomes for comparison available, entropy may be artificially increased (see Mycoplasma hominis) or decreased (Helicobacter cinaedi). A baseline entropy of greater than 0 may indicate equal distribution of two alleles of the operon either within a genome or across genomes. The GAGE-B paper [7] notes that the B. cereus HiSeq dataset proved particularly difficult to assemble. After noticing this irregularity, we re-assembled the trimmed reads downloaded from the GAGE-B website with metaSPAdes [8] using default parameters. Then, blastn was used to search the resulting contigs against NCBI's nt database (May, 2017) to get a list of hits according to the blobtools [9] specifications. Blobtools was then used to plot the hit coverage, taxonomy, and GC-content of the contigs. This revealed what appears to be a contamination. S5A. As the GC content of the contaminating organisms did not differ from B. cereus, we believe that many tools that use GC-skew to detect contamination would not have detected the problem with this dataset.
To further show the contamination, we split reads into those read pairs mapping to the B. cereus ATCC 10987 reference genome and those unmapped. BWA-MEM http://bio-bwa.sourceforge.net/ was used to map the 12039737 reads to the reference genome; samtools was used to separate the 7500534 reads (62%) that mapped from the 3984200 reads (33%) that failed to map with default parameters 1 .Each of these sets of reads was then assembled, BLASTed against the nt database, and plotted with blobtools S5B and C.
Further, MaxBin [10], Kraken, and MBBC [11] also supported the hypothesis that the sample is contaminated with approximately one third of reads originating from a non-B. cereus strain.

Atypical rDNA operon structure
Bacterial (and many archaeal) ribosomal RNA coding regions are commonly arranged into operons consisting of a 16S rRNA, 23S rRNA, and one or more 5S rRNAs, often with various tRNAs interspersed. In the course of this study, we observed some taxa lacking this typical 16S-23S-5S rRNA operon. When rDNAs are not structured into operons, assemblies from short reads do not suffer from the issue of long repeats and so do not require specialised approaches to assembly, such as riboSeed. We developed a module called structure for plotting rDNAs across a collection of genomes; this is available for riboSeed as of version 0.4.50. Figure S6 show the operon arrangement of a few examples of organisms exhibiting atypical operon structure. For comparison, the rDNAs in the reference strains used in this study are shown in Figure S7.  Figure S6: Atypical rDNA operon structure in select taxa. rRNA lengths are not shown to scale. Note that the NCBI record for Helicobacter pylori (NC_000915.1) shows a 5S rRNA not detected by Barrnap.