Characterization and simulation of metagenomic nanopore sequencing data with Meta-NanoSim

Abstract Background Nanopore sequencing is crucial to metagenomic studies as its kilobase-long reads can contribute to resolving genomic structural differences among microbes. However, sequencing platform-specific challenges, including high base-call error rate, nonuniform read lengths, and the presence of chimeric artifacts, necessitate specifically designed analytical algorithms. The use of simulated datasets with characteristics that are true to the sequencing platform under evaluation is a cost-effective way to assess the performance of bioinformatics tools with the ground truth in a controlled environment. Results Here, we present Meta-NanoSim, a fast and versatile utility that characterizes and simulates the unique properties of nanopore metagenomic reads. It improves upon state-of-the-art methods on microbial abundance estimation through a base-level quantification algorithm. Meta-NanoSim can simulate complex microbial communities composed of both linear and circular genomes and can stream reference genomes from online servers directly. Simulated datasets showed high congruence with experimental data in terms of read length, error profiles, and abundance levels. We demonstrate that Meta-NanoSim simulated data can facilitate the development of metagenomic algorithms and guide experimental design through a metagenome assembly benchmarking task. Conclusions The Meta-NanoSim characterization module investigates read features, including chimeric information and abundance levels, while the simulation module simulates large and complex multisample microbial communities with different abundance profiles. All trained models and the software are freely accessible at GitHub: https://github.com/bcgsc/NanoSim.

With regard to software registration in scientific resources, we would like to inform you that Meta-NanoSim has already been registered at the SciCrunch database and the RRID information has been included along with the FTP submission. All three NanoSim tools (NanoSim, Trans-NanoSim, and Meta-NanoSim) have also been registered at the bio.tools database with their corresponding identifiers. We have included this information in the "availability" section of the manuscript. Lastly, we would like to note that, as pertains to the workflowhub.eu, the Meta-NanoSim (and NanoSim) code has been implemented as a suite of modules, and there is no workflow being introduced in our manuscript.
Sincerely, Saber Hafezqorani, Chen Yang, and Inanc Birol Canada's Michael Smith Genome Sciences Centre British Columbia Cancer Agency REVIEWER 1 • On line 64. Given the range of samples that can currently be sequenced using Nanopore sequencing and the recent focus on short reads as opposed to the previous highlight given to long reads, this statement is out-of-date. I would recommend describing instead the current range of sizes that can be sequenced via Nanopore.
Response: We clarified in the main text that ONT reads can have variable lengths. We also expanded on the N50 length metric definition. (Lines 63-65) • How the authors would approach the different error rates namely of different types of flowcell (R9 vs R10). While I don't think this should be central to the development of the tool, I think this should be addressed in the manuscript in some way.
Response: Thank you for your comment about the error rates of different flowcells, which is an important point to consider as the nanopore technology matures. Indeed, Nanopore sequencing is an evolving technology and the error rates have improved tremendously since its emergence. Therefore, we agree that a simulator should be flexible to future advancements in this technology.
NanoSim is designed specifically to address this concern by learning and characterizing features of the input training data. Given a set of nanopore reads from a new flowcell (e.g. R10), the characterization stage in NanoSim (the "read_analysis.py" script) can learn the features that are native to the platform. If users have any specific flowcell chemistry that they want to simulate, they can use the characterization module to train their own models for simulation.
For the convenience of our users, we have been providing ready-to-use pre-trained models for the past 5 years. We will continue to provide pre-trained models for more recent flowcells and respective chemistries in future releases of NanoSim.
We had initially addressed this concern when explaining the NanoSim workflow (Implementation section, Meta-NanoSim general design sub-section and Fig. 1) as well as in the Conclusion section. We have further expanded on this important point and added the following sentence for clarification (Lines 558-564): "Considering the evolving Nanopore sequencing technology, with base accuracy improvements afforded by newer flowcells and updated chemistries, it is imperative to factor in those changes when simulating data with characteristics that are as close as possible to experimental data. The NanoSim suite of tools has this ability, which is accomplished by re-training new models on the latest available sequencing data. Pretrained models are available, and will be supported along with future NanoSim releases to account for nanopore technology advancements." • I would also have liked to see distributions of PHRED quality scores in the simulated reads in the analyses conducted in the manuscript.
Response: The simulated reads for our metagenome assembly benchmarking were generated in FASTA format. Therefore, quality scores do not contribute to the assembly performance of meta-Flye shown in Figure 5. We have investigated the distribution of quality scores in 1 million simulated reads based on the trained model for the Log dataset (Supplementary Figure S5). Overall, aligned reads tend to have higher quality scores than their unaligned counterparts. This is expected because a higher phred quality score corresponds to a lower likelihood of sequencing base errors (Lines 429-431).

REVIEWER 2:
• Line 166-168: the clause about structural variant is unclear to me, and perhaps to the reader. Please consider rephrasing.
Response: Thanks for your comment. We added clarity to the sentence and rephrased it as follows (Lines 167-171): "They may arise because of sequencing artifacts, or may appear like structural variants when the source genome is absent from the reference metagenome. More specifically, sequencing reads may be mapped to a similar but structurally different genome (e.g. a different strain or subspecies), causing segments of the read to align to different regions of the reference genome." • Line 209-210: I understand that the number of mapped reads is unadequate for abundance estimation of ONT data, but k-mers should not suffer from the same problem, shouldn't they? the number of k-mers matching to a genomic region (or a genome) will scale appropriately with read length. I have therefore a hard time understanding why k-mers are presented as problematic in the first sentence of the Abundance Estimation paragraph.
Response: There are two main reasons for this. First, as mentioned in the first sentence of that paragraph, the existing abundance estimation methods generally quantify the number of k-mers under the presumption that all reads have equal lengths, which is not a feature of ONT reads. Therefore, these algorithms originally designed for short reads are not directly applicable to ONT reads. Second, erroneous bases (due to mismatch or indel errors) in ONT reads would contribute to many low-multiplicity k-mers that do not belong to the reference. In other words, the k-mer multiplicities in long reads will very likely be an underestimate of the underlying microbial abundance levels. However, we acknowledge that this issue may be alleviated with improvements in the technology when the reads have much lower error rates.