Plassembler: an automated bacterial plasmid assembly tool

Abstract Summary With recent advances in sequencing technologies, it is now possible to obtain near-perfect complete bacterial chromosome assemblies cheaply and efficiently by combining a long-read-first assembly approach with short-read polishing. However, existing methods for assembling bacterial plasmids from long-read-first assemblies often misassemble or even miss bacterial plasmids entirely and accordingly require manual curation. Plassembler was developed to provide a tool that automatically assembles and outputs bacterial plasmids using a hybrid assembly approach. It achieves increased accuracy and computational efficiency compared to the existing gold standard tool Unicycler by removing chromosomal reads from the input read sets using a mapping approach. Availability and implementation Plassembler is implemented in Python and is installable as a bioconda package using ‘conda install -c bioconda plassembler’. The source code is available on GitHub at https://github.com/gbouras13/plassembler. The full benchmarking pipeline can be found at https://github.com/gbouras13/plassembler_simulation_benchmarking, while the benchmarking input FASTQ and output files can be found at https://doi.org/10.5281/zenodo.7996690.


Introduction
Advances in the accuracy of long-read sequencing have made near perfect bacterial genome assemblies attainable by combining long-and short-read sequencing technologies (Wick et al. 2023). Until recently, short-read-first hybrid assembly methods were favoured using tools such as Unicycler (Wick et al. 2017), which implements short read assembly using SPAdes (Bankevich et al. 2012). As long-read sequencing accuracy has continued to improve, the current best practice favours long-read-first assemblies supplemented with shortread polishing using tools such as Trycycler (Wick et al. 2021b), Dragonflye (https://github.com/rpetit3/dragonflye), or MicroPIPE (Murigneux et al. 2021).
A limitation of long-read-first assemblies is that small (<20 kb) plasmids are often missed by long read first assemblies, especially when ligation-based library preparation methods are used (Wick et al. 2021a). This may result in an incomplete picture of a sample's plasmid mobilization and virulence potential, particularly for those with plasmids carrying antimicrobial resistance genes (Barry et al. 2019). In addition, long-read first assemblies often miss and misassemble small plasmids by doubling or tripling their length in assemblies (Wick andHolt 2019, Johnson et al. 2023), requiring manual intervention and curation. Accordingly, current best practice recommends hybrid short-read first assembly to recover small plasmids (Johnson et al. 2023, Wick et al. 2023). However, this method is computationally inefficient, as all input reads are assembled, including the majority that constitute the bacterial chromosome.
To improve computational efficiency, to increase accuracy, and to provide plasmid-only output that can be integrated with long-read-first pipelines chromosomal assemblies, we created Plassembler as a one-line tool that automatically outputs bacterial plasmid assemblies. Its increase in computational efficiency results from removing all reads that map to a quick draft bacterial chromosome assembly created using Flye (Kolmogorov et al. 2019) by default or optionally with Raven (Vaser and Siki c 2021) before conducting hybrid assembly using Unicycler (Wick et al. 2017). Plassembler then matches each assembled plasmid contig to the PLSDB (Galata et al. 2019) and outputs plasmid copy-number statistics for both long and short-reads. Plassembler can also be used as a fast quality control tool to check that long and short-reads are derived from the same bacterial isolate, which may be particularly useful for users conducting long-read re-sequencing to complete the genomes of isolates previously sequenced with short reads only.

Materials and methods
The Plassembler workflow is outlined in Fig. 1.

Input
Plassembler requires hybrid short paired-end and long read single-end FASTQ sequencing reads from the same bacterial isolate, along with a minimum size threshold for classifying chromosomal contigs specified using the '-c' parameter as input (Fig. 1A). Sufficient long-read sequencing depth is required to assemble chromosomal contigs that are larger than the provided threshold (see Section 2.3).
Plassembler provides the option of filtering the long reads by minimum read length using the '-m' parameter (defaults to 500 bp) and by minimum quality using the '-q' parameter (defaults to a Q-score of 9). Quality control can be skipped using the '-skip_qc' parameter.

Long-read-only assembly
By default, Plassembler uses Flye (Kolmogorov et al. 2019) to conduct a long-read-only assembly of the filtered long reads (Fig. 1C). Flye was chosen as the default long-read assembler due to its high chromosome and plasmid recovery, accuracy, and fast runtime (Wick and Holt 2019). If the resulting assembly has at least one contig that is longer than the provided '-c' chromosome length, then all such contigs are denoted as chromosomal and Plassembler continues. Otherwise, Plassembler will exit, asking the user to check the '-c' parameter value input or to increase long-read sequencing depth to ensure a complete chromosome is assembled. If there are additional contigs assembled that are smaller than the provided 'c' chromosome length, Plassembler denotes these as putative plasmid contigs. The '-c' parameter defaults to 1 megabase, allowing for some assembly fragmentation while retaining even large plasmids.
Alternatively, the long read assembler Raven (Vaser and Siki c 2021) can be instead of Flye using the '-use_raven' parameter, which will likely decrease run-time at the potential cost of accuracy (Tables 2 and 3). By default, Plassembler expects Oxford Nanopore Technologies long-reads as input, but can also be used with Pacific Biosciences long-reads using the '-pacbio_model' parameter.

Read mapping
Plassembler then maps all long-and short-reads to the longread-only assembly using Minimap2 (Li 2018) (Fig. 1D). All unmapped reads and all reads that map to putative plasmid contigs are then extracted using SAMtools (Li et al. 2009) and combined.

Hybrid assembly and depth estimation
Hybrid assembly is then conducted with Unicycler (Wick et al. 2017) to generate final plasmid contigs and assembly graphs (Fig. 1E). Long-and short-read plasmid copy numbers and associated statistics are estimated by mapping all reads to the chromosome and final plasmid assemblies using Minimap2 (Li 2018) and the SAMtools depth function (Li et al. 2009, Wick et al. 2021a (Fig. 1F). Long-read-first assembly is conducted with Flye by default or optionally with Raven. (D) All long and short reads are mapped to the long-read-first assembly. All reads that are unmapped and all reads that map to putative plasmid contigs are extracted. (E) These reads are then assembled using Unicycler. (F) Plasmid copy number is estimated for each assembled plasmid contig. (G) Each plasmid contig is matched against the PLSDB using mash.

PLSDB mash distance calculation
Finally, each assembled plasmid contig is compared to the 34 513 plasmids contained in PLSDB (Galata et al. 2019) using mash (Fig. 1G). All matches below the maximum threshold of a mash distance of 0.1 are considered. For each contig, the PLSDB match with the lowest mash distance is kept as the top hit. Contigs that do not have a PLSDB match are denoted as such and are less likely to be true plasmid assemblies, particularly if they are not circular.

Output
Plassembler's output files are outlined in Table 1. The primary outputs of Plassembler are a _plasmids.fasta file and a _plasmids.gfa file. The _plasmids.fasta file is taken from the output of Unicycler and contains the final plasmid assemblies in FASTA format. This is suitable for downstream analysis using tools such as MOB-suite (Robertson and Nash 2018) and mge-cluster (Arredondo-Alonso et al. 2022). In addition, a '_plasmids.gfa' file is generated containing the Unicycler assembly graphs that can be visualized using tools like Bandage (Wick et al. 2015). In addition, plassembler provides a '_summary.tsv' file. This file includes each plasmid's length, estimated mean, first quartile, third quartile and standard deviation of each plasmid's short-read and long-read depths, a column indicating whether each plasmid contig is circular and a column indicating whether the contig has a match in PLSDB under the maximum mash distance threshold of 0.1. If there is a hit, the '_summary.tsv' file will also contain all available PLSDB information about the top hit.

Benchmarking
Benchmarking, implemented using a reproducible Snakemake pipeline (Mö lder et al. 2021) powered by Snaketool (Roach et al. 2022), was conducted on an Intel V R Core TM i7-10700K CPU @ 3.80 GHz on a machine running Ubuntu 20.04.6 LTS.
To test the performance of Plassembler, we used simulated reads from 20 isolate assemblies from four different datasets. These consisted of:   We used Badread v0.3.0 (Wick 2019) and InSilicoSeq v1.5.4 (Gourlé et al. 2019) to generate simulated readsets from all ground truth assemblies. Long reads were simulated with the Nanopore 2020 error model, while short reads were simulated with the 'novaseq' error model. Both long and short read sets were simulated to a genome coverage of 60Â.
In addition to the simulated readsets, we tested the performance of Plassembler on real reads from the six isolates from (Wick et al. 2021a). Because these genomes were assembled using a highly accurate and independent approach to that used by Plassembler [Trycycler (Wick et al. 2021b) with manual curation], we considered that these assemblies could also be used as ground truth for testing the accuracy of Plassembler on the corresponding real readsets. Wick, Judd, Wyers, et al. have made all the details of their methodology available at https:// github.com/rrwick/Small-plasmid-Nanopore/blob/main/method. md. These isolates were sequenced in two technical replicates with two long-read sequencing methods. For our study, reads for both technical replicates and both sequencing chemistries were combined and subsampled to a depth of 60Â using rasusa v0.7.0 (Hall 2022).

Results
Plassembler was faster than Unicycler for every sample for the 20 simulated isolates and six real read sets samples for all thread counts, yielding a 3-to 10-fold speed improvement (Tables 2 and 3) depending on the sample, thread count, and long-read assembler used. The decrease in wall-clock runtime was largest single-threaded. Plassembler and Unicycler both had comparable maximum memory usage.
Plassembler was more accurate than Unicycler overall, recovering a higher average QUAST genome fraction than Unicycler against the simulated ground truth (Table 4). For the simulated reads, Plassembler missed fewer plasmids (one versus seven for Unicycler), but had a higher number of fragmented assemblies (four for Plassembler with Flye, five for Plassembler with Raven versus one for Unicycler). Unicycler also had one misassembly, while Plassembler did not have any. Rates of indels and mismatches were comparable and low for all three assembly methods.
The difference in genome fraction is explained by Plassembler's ability to recover small plasmids under 10 kb. In the simulated read sets Plassembler was able to recover small plasmids in Staphylococcus aureus C222 (2473 bp Table S1).
For the real read sets, Plassembler, and Unicycler had identical genome fractions and low indel and mismatch rates (Table 5). Similar to the simulated dataset, Plassembler recovered two additional small plasmids missed by Unicycler (Table 5 and Supplementary Table S6) of lengths 1934 bp (K.variicola INF345) and 10 697 bp (K.oxytoca MSB1 2C). The 10 697 bp plasmid recovered in K.oxytoca MSB1 2C was not recovered using the long-read first assembly method by Wick et al. (2021a). Annotation with Bakta v1.7.0 (Schwengers et al. 2021) revealed that this plasmid contains a Type III toxin-antitoxin system and other plasmid replication genes (Supplementary Table S10).
Plassembler with Raven was consistently faster than Plassembler with Flye (Tables 2 and 3). However, Plassembler with Raven had more fragmented assemblies in the simulated dataset (Table 4), due to worse performance of Raven in recovering draft assemblies of some plasmids compared to Flye (Wick and Holt 2019) (Supplementary Table S1).

Discussion
It has previously been shown that subsampling hybrid sequencing reads sets leads to increased plasmid recovery (De Maio et al. 2019). Plassembler's removal of chromosomal reads before short-read first assembly has similar benefits in terms of small plasmid recovery, as small plasmid reads constitute a larger proportion of the overall read set.
Flye and especially Raven assemblies commonly miss small plasmids (Supplementary Table S1), emphasizing that a longread first-assembly approach is inappropriate for recovering small plasmids, as reported in other studies (Wick et al. 2023, Johnson et al. 2023. Long read first assemblies with Flye (run as a part of Plassembler) in the real read datasets multiplicated many small plasmids (Supplementary Table S6). Multiplication was also present, though less common, in the simulated datasets (Supplementary Table S2). As reported previously (Wick andHolt 2019, Johnson et al. 2023), this indicates that multiplication in long read only plasmid assemblies may either reflect assembly errors or true plasmid multimerization (Crozat et al. 2014), but it is difficult to distinguish between the two.

Other use cases and features
Plassembler can be used to recover small plasmids from bacteria with multiple chromosomes, megaplasmids, or chromids. Plassembler will treat all long-read assembled contigs larger than the provided '-c' parameter as chromosomal. As an example, Plassembler v1.1.0 was used to recover plasmids from Vibrio campellii DS40M4, has two chromosomes of sizes 3.33 and 1.88 Mb and a 77 353 bp plasmid (Colston et al. 2019). Illumina and ONT sequencing reads for V.campellii were downloaded using fastq-dl (Petit III and Hall) https:// github.com/rpetit3/fastq-dl. Plassembler recovered the known 77 353 bp plasmid and an additional 5386 bp replicon (Supplementary Table S9), which blastn (Sayers et al. 2022) revealed was Enterobacteria phage phiX174, which is commonly used as a positive control in short-read sequencing runs and likely reflects contamination in this sample.
Plassembler can also be used as a fast quality control tool to detect differences between long and short read sets even from closely related isolates. From readsets of different isolates from the same species, Plassembler will extract all shortand long-reads that are unmapped to the long-read-only assembly. The hybrid assembly of these reads will then contain sections of chromosomal sequence that are present in the short-read set genome but not the long read set genome. These will be represented as noncircular contigs in the Plassembler output, likely without a PLSDB mash hit. Therefore, if five or more such contigs are assembled, Plassembler will warn the user that their long-and short-read sets may not match. Examples of Plassembler output where read sets from two closely related but distinct S.aureus isolates (same sequence type), and also two more distantly related S.aureus isolates (different sequence types) can be found in Supplementary Tables S11 and S12 (Enright et al. 2000, Houtak et al. 2023).
In addition, users with existing plasmid and chromosome assemblies who wish to estimate long and short read plasmid copy numbers and match each plasmid to the PLSDB can use Plassembler. This is enabled using 'plassembler assembled', along with specifying the assembled chromosome using the '-input_chromosome' and the plasmids using '-input_plasmids'.
Plassembler can also be used to assemble other small extrachromosomal replicons in hybrid sequencing data, such as bacteriophages (Shen and Millard 2021) or phage-plasmids (Pfeifer et al. 2022), assuming they have not integrated into the chromosome. An example is the 5386bp Enterobacteria phage phiX174 Plassembler recovered from Colston et al.

Limitations
Plassembler is nondeterministic between threadcounts, which is caused by long-read assembler nondeterminism (Supplementary Table S1). This leads to different read sets being recovered in Plassembler's mapping process, which occasionally produces differing plasmid assemblies. With Flye, nondeterminism also persisted even where the '-deterministic' parameter was used (Supplementary Table S5).
Plassembler requires sufficient long-read depth such that Flye or Raven can assemble complete chromosome-sized contigs. Plassembler therefore cannot be used with isolates with extremely low read depth. Unicycler should be used in this scenario.
The known linear plasmid in K.variicola INF345 reported by Wick et al. (2021a) was incorrectly assembled by both Plassembler and Unicycler in simulated and real read sets, due to a terminal inverted repeat that is characteristic of linear plasmids (Hawkey et al. 2022) (Supplementary Table S1). It is likely that linear plasmids are better assembled using a longread first approach.
Another possible limitation of Plassembler is with small plasmids that contain a mobile genetic element (MGE) shared with the chromosome. If the long-read assembler fails to assemble the small plasmid, then the Plassembler assembly will be incomplete. This is because reads that map to the MGE on the plasmid will neither be unmapped to the chromosome nor map to plasmid contigs in the Plassembler mapping process. Based on our benchmarking, this is unlikely to be an issue for plasmids larger than 10 kb, as the long read assembler is likely to recover them. Plassember was able to accurately recover the 44 kb plasmid harbouring a 16 kb mobile genetic element (MGE) shared by both the chromosome and plasmid for K. pneumoniae CAV 1217 (Supplementary Table S1).

Conclusion
Plassembler assembles bacterial plasmids from hybrid sequencing datasets faster and more accurately than existing approaches. It recovers more small plasmids that other assemblers miss and can be easily combined with long-read-first chromosomal assembly workflows to generate accurate bacterial genome assemblies. the idea of including 'plassembler assembled' functionality. This work was supported with supercomputing resources provided by the Phoenix HPC service at the University of Adelaide.