Ultra-deep, long-read nanopore sequencing of mock microbial community standards

Abstract Background Long sequencing reads are information-rich: aiding de novo assembly and reference mapping, and consequently have great potential for the study of microbial communities. However, the best approaches for analysis of long-read metagenomic data are unknown. Additionally, rigorous evaluation of bioinformatics tools is hindered by a lack of long-read data from validated samples with known composition. Findings We sequenced 2 commercially available mock communities containing 10 microbial species (ZymoBIOMICS Microbial Community Standards) with Oxford Nanopore GridION and PromethION. Both communities and the 10 individual species isolates were also sequenced with Illumina technology. We generated 14 and 16 gigabase pairs from 2 GridION flowcells and 150 and 153 gigabase pairs from 2 PromethION flowcells for the evenly distributed and log-distributed communities, respectively. Read length N50 ranged between 5.3 and 5.4 kilobase pairs over the 4 sequencing runs. Basecalls and corresponding signal data are made available (4.2 TB in total). Alignment to Illumina-sequenced isolates demonstrated the expected microbial species at anticipated abundances, with the limit of detection for the lowest abundance species below 50 cells (GridION). De novo assembly of metagenomes recovered long contiguous sequences without the need for pre-processing techniques such as binning. Conclusions We present ultra-deep, long-read nanopore datasets from a well-defined mock community. These datasets will be useful for those developing bioinformatics methods for long-read metagenomics and for the validation and comparison of current laboratory and software pipelines.


Data Description
Whole-genome sequencing of microbial communities (metagenomics) has revolutionised our view of microbial evolution and diversity, with numerous potential applications for microbial ecology, clinical microbiology and industrial biotechnology [1,2]. Typically, metagenomic studies use high-throughput sequencing platforms (e.g. Illumina) [3] which generate very high yield, but of limited read length (100-300 bp).
In contrast, single-molecule sequencing platforms such as the Oxford Nanopore MinION, GridION and PromethION are able to sequence very long fragments of DNA (>10 Kbp, with over 2 Mbp reported) [4,5] and with recent improvements to the platform making metagenomic studies using nanopore more viable, such studies are increasing in frequency [6,7,8,9]. Long reads help with alignment-based assignment of taxonomy and function due to their increased infor-
Mock community standards are useful for the development of genomics methods [14], and for the validation of existing laboratory, software and bioinformatics approaches. For example, validating the accuracy of a taxonomic identi cation pipeline is important, because the consequences of erroneous taxonomic identi cation from a metagenomic analysis may be severe, e.g. in public health microbiology [15,16] or incorrect diagnoses in clinical microbiology diagnostics. Mock community standards can also be used as positive controls during laboratory work, for example to validate that DNA extraction methods will yield expected representation of a sampled community [14].
Here, we present four nanopore sequencing datasets of two microbial community standards, providing a state-of-theart benchmark to accelerate the development of methods for analysing long-read metagenomics data.

Background Information
The ZymoBIOMICS Microbial Community Standards (CS and CSII) are each composed of ten microbial species: eight bacteria and two yeasts ( Table 1). The organisms in CS (hereafter referred to as 'Even') are distributed equally (12%), with the exception of the two yeasts which are each present at 2%. Cell counts from organisms in CSII ('Log') community are distributed on a log scale, ranging from 89.1% (Listeria monocytogenes), down to 0.000089% (Staphylococcus aureus).

DNA extraction
DNA was extracted from 75 µl ZymoBIOMICS Microbial Community Standard (Product D6300, Lot ZRC190633) and 375 µl ZymoBIOMICS Microbial Community Standard II (Product D6310, Lot ZRC190842) using the ZymoBIOMICS DNA Miniprep extraction kit according to manufacturer's instructions, with the following modi cations to increase fragment length and maintain the expected representation of the Gram-negative species which are already lysed in the DNA/RNA Shield storage solution. The standard was centrifuged at 8,000×g for 5 minutes before removing the supernatant and retaining. The cell pellet was resuspended in 750 µl lysis bu er and added to the ZR BashingBead lysis tube. Bead-beating was performed on a FastPrep-24 (MP Biomedicals) instrument for 2 cycles of 40 seconds at 6.0 m s -1 , with 5 minutes sitting on ice between cycles. The bead tubes were centrifuged at 10,000×g for 1 minute and 450 µl of supernatant was transferred to a Zymo Spin III-F lter before being centrifuged again at 8000×g for 1 minute. 45 µl (Even) and 225 µl (Log) of the supernatant retained earlier was combined with 450 µl ltrate before adding 1485 µl (Even) or 2025 µl (Log) Binding Bu er and mixing before loading onto the column. Methods are available online via protocols.io [18].

Nanopore sequencing library preparation
Quanti cation steps were performed using the dsDNA HS assay for Qubit. DNA was size-selected by cleaning up with 0.45× volume of Ampure XP (Beckman Coulter) and eluted in 100 µl EB (Qiagen). Libraries were prepared from 1400 ng input DNA using the SQK-LSK109 kit (Oxford Nanopore Technologies) as  per manufacturer's protocol, except incubation times for endrepair, dA-tailing and ligation were increased to 30 minutes to improve ligation e ciency. The even and log libraries were split and used on both the GridION and PromethION owcells.

Sequencing
Sequencing libraries were quanti ed and two aliquots of 50 ng and 400 ng were prepared for GridION and PromethION sequencing respectively. The GridION sequencing was performed using FLO-MIN106 (rev.C) owcells, MinKNOW 1.15.1 and standard 48-hour run script with active channel selection enabled. The PromethION sequencing was performed using FLO-PRO002 owcells, MinKNOW 1.14.2 and standard 64-hour run script with active channel selection enabled.

Illumina sequencing
DNA was extracted from pure cultures of each species using the ZymoBIOMICS DNA Miniprep Kit. Library preparation was performed using the Kapa HyperPlus Kit with 100 ng DNA as input and TruSeq Y-adapters. The puri ed library derived from each sample was quanti ed by TapeStation (Agilent 4200) and pooled together in an equimolar fashion. The multiplexed isolates were sequenced on an Illumina HiSeq 1500 instrument using 2×101 bp (paired-end) sequencing, over four lanes. Raw reads were demultiplexed using bcl2fastq v2.17. Shotgun sequencing of the even and log communities was performed with the same protocol, with the exception that the log community was sequenced individually on two owcell lanes, and the even community was instead sequenced on an Illumina MiSeq using 2×151 bp (paired-end) sequencing.

PacBio draft assembly
A recently released orthogonal data set from McIntyre et al. (2019) includes individual PacBio sequencing of eight of the ten organisms that compose the two Zymo communities [17]. Assemblies for the eight isolates that passed quality control (excluding L. fermentum and C. neoformans) were generated with HGAP v2 [20]. Assemblies have been made available by the authors and were downloaded from https://github.com/al-mcintyre/mCaller_analysis_scripts/ tree/master/assemblies (Git commit dba494d) for the purposes of assessing metagenomic assembly accuracy for the 7 bacterial species where complete genomes were available.

Sequencing coverage estimation
Nanopore reads were aligned to the Illumina draft assembly using minimap2 [21] v2.14-r883 with parameters -ax map-ont -t 12 and converted to a sorted BAM le using samtools [22]. To reduce erroneous mappings, alignment BAM les were ltered using a script bamstats.py according to the following criteria; reference mapping length ≥500 bp, map quality (MAPQ) > 0, there are no supplementary alignments for this read and read is not a secondary alignment. Per-species coverage summary statistics were generated using the summariseStats.R Rscript.
Assemblies were conducted under a variety of parameter values for homopolymer-compressed k-mer size (-p), minimum graph edge weight support (-e) and read length threshold (-L). Global parameters for all runs (-S1 -K10000 --node-max 6000) were used to turn-o k-mer subsampling (to remove assembly stochasticity) and increase the coverage thresholds applied to k-mers and constructed nodes.
Assembled contigs were assigned to taxa with kraken2 [24] (--use-names -t12) using a database containing all of the archaeal, bacterial, fungal, protozoal and viral sequences from RefSeq, and UniVec_Core (database download links are in our repository). The kraken2 output was parsed with extracken.py and plotted with contiguity.R to visually assess contiguity. Following assignment, contigs can be extracted into separate FASTA with extract_contigs_with_kraken.py.
Nicholls and Quick et al. | 5 Figure 3. Bar plots demonstrating total length and contiguity of genomic assemblies obtained with wtdbg2 from each of the long-read nanopore data sets. For each organism in the community (coloured columns), contigs longer than 10 kbp are horizontally stacked along the x-axis. Each row represents a run of wtdbg2, with the parameters for edge support, read length threshold and homopolymer-compressed k-mer size labelled on the left. Assemblies are grouped by the data set on which they were run (row facets). Additionally, assemblies may be compared to the estimated true genome size, the available McIntyre et al. PacBio assemblies, and per-isolate Illumina SPAdes assembly. Estimated genomes sizes are the same as those found in Table 1, however to display approximate chromosomes, the two yeasts were replaced by their corresponding canonical NCBI references for visualisation purposes only. The C. neoformans strain used by the Zymo standards is a diploid genetic cross, which may explain the larger assemblies, compared to the represented estimated haploid size.

Assembly polishing
After inspection of the contiguity.R plot, eight high-contiguity assemblies were selected for polishing. Polishing consisted of two iterations of racon [25], followed by medaka (https:// github.com/nanoporetech/medaka) and two iterations of pilon [26]. racon v1.3.2 was used to polish contigs with the the nanopore reads. medaka v0.5.0 was used to polish the racon polished contigs, with the nanopore reads specifying the r941_flip model. The PromethION assemblies were polished using the same seqtk-derived 25% subset from which the assemblies were constructed. pilon v1.23 was used to polish the medaka polished contigs, with the CS (Even) community Illumina reads.

Estimation of genome completeness
To estimate accuracy of the polished assemblies, contigs were rst assigned to taxa and extracted into separate FASTA using kraken2 as previously described. For the seven bacteria for which corresponding PacBio draft assemblies were available, sequence identity dotplots were generated using a modi ed version of minidot (https://github.com/SamStudio8/minidot) which uses minimap2 (-x asm10 --no-long-join --dual=yes -P) to align the polished contigs binned by kraken2, to the corresponding PacBio draft. Genome completeness was estimated with CheckM v1.0.13 [27] using the taxonomy_wf subcommand, after each phase of the polishing pipeline. CheckM was executed separately for each kraken2 bin that had a corresponding PacBio reference, specifying the appropriate species for the bin to taxonomy_wf. We report the CheckM "Completeness" score, which estimates completeness by identifying collocated marker gene sets on the assembled contigs as a proportion of the total collection of marker gene sets expected for a speci c taxon.

Nanopore sequencing metrics
We generated a total of 335.1 Gbp of sequence from the four nanopore sequencing runs (Table 2, Figure 1a). PromethION owcells generated approximately ten-times more sequencing data than the comparative GridION runs and showed equivalent read length N50 and read accuracy (Figure 1b). We observe a di erence in sequencing speed between the PromethION (mean 419 bps and 437 bps for even and log) and the GridION (mean speed 352 and 372 bps for even and log) (Figure 1c).

Nanopore mapping statistics
We identify the presence of all 10 microbial species in the community, for both even and log samples, in expected proportions ( Figure 2). For the even community, the GridION results provide su cient depth (i.e. 30× coverage) to potentially assemble all eight of the bacteria. The coverage of the yeast genomes were lower (10× and 17×), potentially su cient for assembly sca olding. On the PromethION all genomes had >100× mean coverage (Tables 4 and 5).   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  Table 6. Sequence identity dotplots and CheckM genome completeness scores for each of the seven bacteria for which there was a corresponding PacBio assembly from McIntyre et al. (2019). Four wtdbg2 assembly conditions are represented, varying the homopolymer-compressed k-mer parameter 'p' and the graph minimum edge weight threshold 'e'. The read length threshold 'L' was xed at 5000 bp. The left and right halves of the table correspond to the same assembly condition for the GridION and 25% PromethION sequencing data, respectively. The L50/L95 refers to the number of assembled contigs required to span at least 50% and 95% of the estimated genome size (see Table 1). An -indicates that the set of assembled contigs assigned to a taxon were not of su cient total length to cover 95% of the estimated size. CheckM genome completeness scores are expressed as a percentage and were calculated per-organism at the end of each polishing phase. For the log-distributed community, three taxa have sucient coverage for assembly on GridION, compared with four on PromethION. On PromethION, a further two genomes (S. enterica and E. coli) have su cient coverage for assembly sca olding. We are able to detect S. aureus, the lowest abundance organism on both platforms, with 19 reads from PromethION (from 400 cell input) and 4 reads from GridION (from 50 cell input).

Nanopore metagenomic assemblies
We assessed the contiguity of our nanopore metagenomic assemblies for each run with di erent assembly parameters.
For the even community, genomes of the expected size were present for each of the bacterial species, contained in small numbers of large contigs (Figure 3). However, the two yeasts are highly fragmented, consistent with their low read depth.
L. monocytogenes is poorly assembled in the log dataset despite being the most abundant organism, indicating very high sequence coverage may be detrimental to the performance of wtdbg2. We note that assembling the entire PromethION dataset resulted in less complete and more fragmented assemblies. This led us to random subsample the PromethION data to 25% of the total dataset which improved the assembly results.
After subsampling, assemblies of the even community from the GridION and PromethION are similar. However, the assemblies from PromethION data had better representation of the yeasts in terms of size and contiguity (particularly for C. neoformans), likely due to the higher coverage of these species.
We also assessed the completeness of polished genomes for a selection of our highly-contiguous metagenomic assemblies.
For the GridION, we observe for at least one of the polished assemblies, four bacterial genomes are reconstructed to at least 95% of their length (L95) in a single contig. For PromethION, we observe that for seven bacteria, at least half the genome (L50) is reconstructed on a single contig, for at least one assembly condition (Table 6).
Genome completeness as estimated by CheckM averaged 73.95% and 70.98% over the four unpolished assemblies, for the GridION and PromethION respectively. We observed each phase of the polishing pipeline improved completeness. For the GridION assemblies, completeness was incrementally improved by 11.57 pp, 10.14 pp and 1.25 pp for two iterations of racon, one iteration of medaka and two iterations of short read polishing with pilon, respectively. For the PromethION, the three polishing phases incrementally improved assemblies by an average of 11.92 pp, 12.69 pp and 1.77 pp. In almost all cases, polishing yields near complete genomes (≥90%) genomes.

Discussion
There are several noteworthy aspects of this dataset: We generated over 300 Gbp of sequence data from the Oxford Nanopore PromethION and 30 gigabases from the Oxford Nanopore Grid-ION, on a well-characterised mock community sample and we have made basecalls and electrical signal data for each of the four runs presented here available: a combined dataset size of over four terabytes. The availability of the raw signal permits future basecalling of the data (an area under rapid development), as well as signal-level polishing and the detection of methylated bases [28].
Individual sequencing libraries were split between the Grid-ION and PromethION, permitting direct comparisons of the instruments to be made. We observed high concordance between the datasets from each platform. We note the sequencing speed of the PromethION is faster than the GridION, which we attribute to di erent running temperatures on these instruments (39 • C versus 34 • C, respectively).
Con dent detection of S. aureus was demonstrated for the GridION run to <50-cells using the log community. The Prome-thION generated around ve times more S. aureus reads as the GridION, however we loaded eight times as much library, making it appear less sensitive. It may be possible to reduce the input to PromethION owcells, but we have not attempted this.
Early results of metagenomic assembly show promise for reconstruction of whole microbial genomes from mixed samples without a binning step. We focused on the developing wtdbg2 software as the established minimap2 and miniasm method resulted in excessively large intermediate les (tens of terabases per analysis) which were impractical to store and analyse.
For the even community, using wtdgb2 with varying parameter choices, we were able to assemble four of the bacteria into single contigs. However, no single parameter set was found to be optimum for both total genome size and contig length. Increasing -e improved contiguity for the even community, however this resulted in the loss of yeasts from the assembly. Increasing the read length threshold (-L) improved contiguity for all sample and platform combinations, at the cost of genome size. Increasing the homopolymer-compressed k-mer size (-p) from the default of 21 to 23 also appears to improve contiguity.
We found that wtdbg2 expects a maximum of 200× sample coverage, and discards sequence k-mers and de Bruijn graph nodes with more than 200× support. Although these limits can be lifted by specifying higher -K and -node-max, we still observe more fragmented assemblies on the PromethION data (especially for the 100% PromethION data [not shown]) potentially indicating a need to further tune the algorithm to account for the large di erences in coverage between genomes. It should be noted that wtdbg2 is still under active development, making it di cult to make concrete recommendations for parameters.
We found that any form of polishing improves the completeness of assemblies, likely due to the correction of frameshifts caused by indels. Short read polishing with pilon also improves the assemblies, despite low coverage of the Illumina even community data and the results might be expected to improve further with increased coverage.
The availability of this dataset should help with further improvements to long-read assembly techniques.
Other mock microbial samples are available which we did not test here. A notable alternative mock community sample is from the Human Microbiome Project (HMP) and consists of 20 microbial samples (available from BEI Resources). This mock community have been sequenced as part of other studies, although the datasets are much smaller than the ones presented here [9,29]. Bertrand et al. presented a synthetic mock community of their own construction to demonstrate hybrid nanopore-Illumina metagenome assemblies [12].

Re-use potential
The provision of Illumina reads for each isolate permits a ground-truth to be obtained for the individual species contained in the mock community. This will be useful for training new nanopore basecalling and polishing models, long-read aligners, variant callers, and validating taxonomic assignment and assembly software and pipelines.

Availability of source code and requirements
Python and R scripts used to generate the summary information and analyses are open source and freely available via our repository (https://github.com/LomanLab/mockcommunity), under the MIT license. Our pipeline was orchestrated with Snakemake [30], the work ow is available from our repository.

Availability of supporting data and materials
This manuscript, and its supporting data are available under a Creative Commons Attribution 4.0 International license.
Unprocessed FASTQ from the Illumina sequencing of the ten isolates are available at the European Nucleotide Archive, via the identi ers listed in Table 1, identi ers for the even and log community Illumina sequencing can be found in Table 3.
Both the raw signal, and basecalled FASTQ for our nanopore sequencing experiments are available at the European Nucleotide Archive, via the identi ers listed in Table 2.
The SPAdes-assembled Illumina draft reference, and the collection of nanopore assemblies for each wtdbg2 condition are linked to from our GitHub repository (https://github.com/ LomanLab/mockcommunity), along with the kraken2 database used for taxonomic classi cation of the assembled contigs.
Further updates (such as updated references, or new assemblies) will be made available through our project website https://lomanlab.github.io/mockcommunity/.
An archival snapshot of our GitHub repository and associated assembly FASTA les are also available via GigaDB [31].