NanoAmpli-Seq: a workflow for amplicon sequencing for mixed microbial communities on the nanopore sequencing platform

Abstract Background Amplicon sequencing on Illumina sequencing platforms leverages their deep sequencing and multiplexing capacity but is limited in genetic resolution due to short read lengths. While Oxford Nanopore or Pacific Biosciences sequencing platforms overcome this limitation, their application has been limited due to higher error rates or lower data output. Results In this study, we introduce an amplicon sequencing workflow, i.e., NanoAmpli-Seq, that builds on the intramolecular-ligated nanopore consensus sequencing (INC-Seq) approach and demonstrate its application for full-length 16S rRNA gene sequencing. NanoAmpli-Seq includes vital improvements to the INC-Seq protocol that reduces sample processing time while significantly improving sequence accuracy. The developed protocol includes chopSeq software for fragmentation and read orientation correction of INC-Seq consensus reads while nanoClust algorithm was designed for read partitioning-based de novo clustering and within cluster consensus calling to obtain accurate full-length 16S rRNA gene sequences. Conclusions NanoAmpli-Seq accurately estimates the diversity of tested mock communities with average consensus sequence accuracy of 99.5% for 2D and 1D2 sequencing on the nanopore sequencing platform. Nearly all residual errors in NanoAmpli-Seq sequences originate from deletions in homopolymer regions, indicating that homopolymer aware base calling or error correction may allow for sequencing accuracy comparable to short-read sequencing platforms.


Experimental design and statistics
Click here to access/download;Manuscript;Main_Manuscript_R1_clean.pdf Click here to view linked References  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Background 56 Amplicon sequencing, particularly sequencing of the small subunit rRNA (SSU rRNA) gene and 57 internal transcribed spacer (ITS) regions, is widely used for profiling of microbial community 58 structure and membership [1][2][3][4]. The wide-scale application of amplicon sequencing has been 59 driven mainly by the ability to multiplex 100's of samples on a single sequencing run and obtain 60 millions of sequences of target communities on high-throughput sequencing platforms [1,4]. The 61 primary limitation of these commonly used technologies (i.e., Illumina's MiSeq, Ion Torrent PGM, 62 etc.) is that their read lengths are short, ranging from 150-400 bp [5]. While excellent at bulk 63 profiling of microbial communities through multiplexed deep sequencing, short read lengths are 64 limited in the taxonomic resolution of sequenced reads and more so, are not amenable to robust 65 phylogenetic analyses to assess the relationship between sequences originating from unknown 66 microbes with those in publicly available databases. An important effect of the proliferation in 67 short read sequencing applications has been a decrease in the rate at which long higher quality 68 sequences, particularly of SSU rRNA genes, are being deposited in public databases. This effect 69 is to some extent being mitigated through assembly and curation of near full-length SSU rRNA 70 genes from metagenomic datasets [6][7][8][9] and will continue to be mitigated with novel approaches 71 for SSU rRNA sequencing using synthetic long read approaches [10]. 72

73
The introduction of long-read single molecule sequencing platforms, such as PacBio's single-74 molecule real-time sequencing (SMRT) and single molecule sensing technologies on the Oxford 75 Nanopore Technologies (ONT) MinION TM platform, has opened the possibility of obtaining ultra-76 long reads [5,11]. While sequencing throughput and raw data quality of long read single molecule 77 sequencing approaches are yet to rival that of short read platforms, the ability to obtain ultra-long 78   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 reads can overcome several limitations of the latter [12]. For instance, long-read sequencing 79 combined with various error correction approaches [13,14] has been used to obtain high-quality 80 single contig microbial genomes [14] or increase assembly quality of previously sequenced but 81 fragmented eukaryotic genomes assemblies [15,16], which was not feasible using short-read 82 approaches. Long-read sequencing capabilities have also been recently leveraged to sequence near 83 full-length SSU rRNA genes (e.g., 16S rRNA) [17][18][19][20][21] or even the entire rrn operon [20,22]. 84 85 A majority of the studies utilizing either the SMRT or nanopore sequencing platforms have limited 86 data their analyses efforts to sequence classification, due to the fact that widely used sequence 87 classifiers are tolerant of high sequencing error rates [23,24]. However, these classification-only 88 approaches are limited in their ability to differentiate between closely related sequences, risks false 89 detections (i.e., read incorrectly classified at the family or genus levels due to high error rates), 90 and are unable to identify organisms that are not represented in the reference databases. In contrast, 91 some studies have gone beyond sequence classification by using consensus sequence construction 92 to improve overall sequence accuracy. The consensus sequence creation efforts thus far can be 93 categorized into two approaches. The first approach involves mapping raw, noisy reads to custom 94 or publicly available reference databases (i.e., SILVA) [25]. Subsequently, reads mapping to the 95 same reference sequence are then used for the semi-automated or manual construction of a 96 consensus sequence using overlapping alignments [20,22]. While this approach does result in 97 improved accuracy of the consensus sequence, clustering of reads based on mapping of noisy reads 98 to reference databases has significant limitations. First, incorrect read mapping to a reference is 99 prevalent due to high error rates of raw reads. Second, the reliance on a reference database ensures 100 that reads originating from organisms not represented in the reference database are ignored. The 101 more robust alternative towards high accuracy consensus sequence generation would be a 102 completely de novo approach, i.e., generation of a consensus sequence without the use of any 103 reference database. 104 105 To our knowledge, there are three reports of de novo data processing to reduce error rate from long 106 read sequencing of amplicons from mixed microbial communities [17,21,26] Seq) protocol for consensus-based error correction of nanopore sequencing reads with a median 117 accuracy of 97-98%. The INC-Seq workflow involves amplicon concatermization to link multiple 118 identical copies of the same amplicon on a single DNA molecule, sequencing of the 119 concatamerized molecules using 2D sequencing chemistry on the nanopore sequencing platform, 120 followed by consensus-based error correction after aligning the physically linked concatemers on 121 each sequenced DNA strand. By using this approach, Li et al [21] were able to increase the median 122 sequence accuracy of processed reads to 97-98%. While this significant improvement allowed for 123   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 taxonomic classification of sequences to the species level, it did not allow for sequence clustering 124 for diversity estimation due to residual median error rates of approximately 2-3%. 125

126
In this study, we leverage and expand on the INC-Seq protocol developed by Li et al [21] to 127 provide a complete workflow for amplicon sequencing and de novo data processing, NanoAmpli-128 Seq, applied to near full-length 16S rRNA gene of mock communities that results in high-quality 129 sequences with a mean sequence accuracy of 99.5±0.08%. The current version of NanoAmpli-130 Seq includes modifications to the library preparation protocol for INC-Seq and fixes a key issue 131 with INC-Seq consensus sequences while adding a novel read partitioning-based sequence 132 clustering approach which results in an accurate estimation of diversity of mixed microbial 133 communities and results in higher sequence accuracy by allowing within OTU sequence alignment 134 and consensus calling. Further, we demonstrate that NanoAmpli-Seq works equally well on the 135 (now obsolete) 2D sequencing chemistry and the recently released 1D2 sequencing chemistry on 136 the MinION TM device. While important limitations such as suboptimal re-construction of 137 community structure and error rate of ~0.5% remain, the proposed approach may be used for 138 sequencing of long amplicons from complex microbial communities to assess community 139 membership with cautious utilization of sequences from low abundance OTUs due to likely lower 140 sequence accuracy ranging from 99-99.5% accuracy. 141 142

143
Experimental design and workflow. The NanoAmpli-Seq protocol was developed and validated 144 using amplicon pools consisting of near full-length 16S rRNA gene of a single organism (Listeria 145 monocytogens) or an equimolar amplicon pool of near full-length 16S rRNA genes from 10 146 organisms (Supplementary Table S1). The amplicon pools were generated by PCR amplifying near 147 full-length 16S rRNA genes from genomic DNA of the target organism(s) using primers and PCR 148 reaction conditions as described in the materials and methods section. The respective amplicon 149 pools were subsequently prepared for sequencing using the INC-Seq workflow as outlined in 150 Figure 1, with a few significant modifications. Briefly, the amplicon pools were self-ligated to 151 form plasmid-like structures which was followed by digestion with plasmid-safe DNAse to remove 152 the remaining non-ligated linear amplicons. The DNA pool consisting of plasmid-like structures 153 was subject to rolling circle amplification (RCA) using random hexamer-free protocol using a 154 combination of primase/polymerase (PrimPol) and hi-fidelity Phi29 DNA polymerase [27]. The 155 RCA product was subject to two rounds of T7 endonuclease I debranching and g-TUBE 156 fragmentation followed by gap filling and DNA damage repair. Description of the protocol 157 including reagent volumes and incubation conditions is provided in the materials and methods and 158 a step-by-step protocol is provided in the supplementary text. The prepared amplicon pools for 159 both single organism and 10 organism mock community samples were then subject to library 160 preparation using the standard 2D (SQK-LSK208) (Runs 1 and 2) and 1D2 (SQK-LSK308) (Runs 161 3 and 4) kits using ONT specifications and sequenced on the MinION TM MK1b device followed 162 by basecalling using Albacore 1. Each resulting read consisted of multiple concatamerized physically linked amplicons from the 171 one original 16S rRNA gene amplicon. The long concamtermized amplicon reads were subject to 172 INC-Seq's anchor based alignment and consensus error correction using three different alignment 173 options (i.e., blastn, Graphmap, and partial order alignment (POA)) and followed by iteratively 174 running PBDAGCON on the consensus for error correction (INC-Seq flag "iterative"). Reads with 175 irregular segment length, unmappable anchors, and potentially chimeric molecules (i.e., 176 concatemers from more than one original 16S rRNA gene amplicon) were removed during the 177 generation of the INC-Seq consensus read. Manual inspection of INC-Seq consensus reads 178 revealed that a vast majority had an incorrect orientation of primers ( Figure 2A). Specifically, the 179  forward and reverse primers did not occur at the ends of the INC-Seq consensus reads but rather 180 were co-located at varying positions along the length of each read. Efforts to manually split INC-181 Seq reads and re-orient the forward and reverse splits based on primer orientation revealed the 182 presence of tandem repeats of nearly identical sequences, which affected efforts to merge the 183 forward and reverse read splits ( Figure 2B, 2C). consensus read with the highest mean score are re-oriented and any overhang is removed.  orienting reads using primer orientation resulted in the identification of insertions consisting of 193 repeated sequence patterns, i.e., tandem repeats. These tandem repeats were identified using 194  were delineated, i.e. tandem minimum repeat, tandem maximum repeat, and mismatch rate. The 197 percent identify between tandem repeat is estimated iteratively measuring the sequence similarity 198 between co-occurring segments using window size ranging from 10 bp to 350 bp with diminishing 199 sequence similarity threshold with increasing window size. The sequence similarity threshold with 200 increasing window size was applied as longer tandem repeats tend to have lower similarity to each 201 other compared to shorter. After completing re-orientation of reads in the removal of tandem 202 repeats, the forward and reverse splits are merged into a single read, and any read that does not 203 match prescribed length threshold (i.e., 1300-1450 bp) is discarded. This process of primer 204 identification and tandem repeat removal can also be visualized by turning on verbosity mode (flag 205 = -v) and the results can be exported in fasta format. 206

207
To enable fully reference-free analyses, we developed the nanoClust algorithm which takes the 208 fasta file of chopSeq corrected reads as input and then performs read-partitioning based de novo 209 clustering using VSEARCH [28] to delineate OTUs at a user-specified sequence similarity 210 threshold (i.e., 97% in this study) followed by within OTU read alignment and consensus calling 211 for each OTU. The nanoClust algorithm is written in python and requires Biopython packages 212 such as Seq, SeqIO, AlignIO, aligninfo and pairwise2. This algorithm was explicitly designed for 213 de novo clustering because standard de novo clustering approaches such as VSEARCH [28] and 214 the clustering approaches available in mothur [29,30] vastly overestimated the richness of the 215 mock community when using chopSeq corrected reads (see details below). The nanoClust 216 algorithm takes chopSeq corrected reads in fasta format, splits the reads into partitions based on 217 user-defined partition size, implements VSEARCH [28] for dereplication, chimera detection and 218 removal in each partition, and clustering for each partition to identify the partition category with 219 optimal (i.e., maximum) number of OTUs (not counting singleton OTUs) and (3) discards 220 singletons. Following this, nanoClust extracts read IDs for each OTU bin from the best performing 221 partition, the extracted read IDs for each OTU bin are then used to obtain full-length chopSeq 222 corrected reads, a subset of reads that fall within 10% of the average full-length read distribution 223 within each OTU bin are aligned using Multiple Alignment using Fast Fourier Transform 224 (MAFFT) [31, 32] with G-INS-i option, followed by consensus calling to obtain full-length 225 representative sequence for each OTU. The entire data processing workflow is shown in Figure 3. 226

Figure 3: (A)
Overview of the INC-Seq based anchor alignment and iterative consensus calling using PBDAGCON. (B) INC-Seq consensus reads were subject to chopSeq based read reorientation followed by tandem repeat removal and size selection to retain reads between 1300-1450bp. (C) chopSeq corrected reads are subject to partitioning followed by VSEARCH based binning to identify optimal binning results using partition that generates maximum number of OTUs (without singletons). MAFTT-G-INS-i was then used for sequence alignment of a subset of full length reads from each OTU bin for the best performing partition and the alignment was used to create the OTU consensus read .   13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  in high-quality data post-processing. 243 244 NanoAmpli-Seq data yield for 2D and 1D2 experiments. Runs 1, 2, 3, and 4 resulted in 29420, 245 59490, 142233, and 301432 raw records with post basecalling read lengths ranging from 5bp to 246 43kbp and 5bp to 234kbp for 2D and 1D2 sequencing protocols (Supplementary Figure S1). The 247 pass reads to total raw reads ratio ranged from 28% for 2D and 7-9% for the 1D2 experiments 248 (Table 1). It is unclear if the low yield of pass reads, particularly for the 1D2 experiments, were 249 due to the concatemerization process, DNA damage during enzymatic debranching and 250   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65 mechanical fragmentation that was unrepaired in the subsequent steps, or due to basecalling issues. 251 All pass reads were subjected to INC-Seq processing to allow for consensus-based error correction 252 using reads with a minimum of three concatemers per read (i.e., reads with less than three 253 concatemers were excluded for any subsequent analyses) as compared to six concatemer threshold 254 used by Li et al [21]. The number of concatemers per read passing INC-Seq threshold ranged from 255 3 to 21 and 3 to 42 for 2D and 1D2 data (Supplementary Figure S1). The total number of reads 256 passing the three concatemer threshold ranged from 36-75% of the base called reads depending on 257 the experiment, sequencing protocol, and alignment approach during INC-Seq processing ( Table  258 1). This was significantly higher than those reported by Li et al [21], primarily due to the use of 259 three compared to six concatemer threshold recommended previously. 260

INC-Seq processed reads demonstrated incorrect read orientation and presence of tandem 264
repeats. While the median read lengths for post INC-Seq were generally in the expected range 265 (i.e., 1350-1450 bp) (Table 1)  primers not located at the ends of the reads. As a result, the reads were chopped at the primer sites 276 and re-oriented to allow for the forward and reverse primers to be correctly oriented. However, 277 during the process of read re-orientation, we also discovered the presence of inserts in the form of 278 tandem repeats. Additional inspection of these inserts revealed that they were composed of 279 multiple repetitive sequences, with the length of these inserts ranging from 10 bp to in excess of 280 1500 bp (for rare cases) with median tandem repeat size ranging from 12 to 62 bp. The proportion 281 of INC-Seq processed reads with tandem repeats varied from 60-75% but did not reveal any 282 significant effect of type of aligner used during INC-Seq consensus calling or the sequencing 283 chemistry itself. Interestingly, however, the length distribution for the tandem repeats was strongly 284 associated with the sequencing chemistry ( Figure 4). 285   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65

349
We tested the effect of the choice of partition length on the estimation of the richness of the mock 350 communities (i.e., number of observed OTUs) and overall sequence accuracy post within-OTU 351   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 MAFFT-G-INS-i alignment and consensus sequence construction. To this effect, we varied the 352 number of partitions from one (i.e., partition length of 1300 bp) to seven (i.e., partition length 180 353 bp). With increasing number of partitions (i.e., decreasing partition length), the number of OTUs 354 being detected were significantly inflated above the theoretical threshold while at the same time 355 the average sequence accuracy decreased (Figure 7).  Further, using a single partition approach also resulted in discarding a significant number of 362 sequences that were deemed singletons prior to OTU clustering. Specifically, while three less 363 OTUs were detected in the single partition approach the total number of sequences retained post 364 clustering were 20% lower as compared to when two or three partition approach was used. 365 Considering the tradeoff between sequencing depth (and the resultant impact on detection of lower 366 abundance OTUs), the extent of deviation from the theoretical number of OTUs and overall 367 sequence accuracy, we recommend using either the two or three partition approach which result in 368 similar outcomes on all three metrics. 369 370

373
The nanoClust approach with three partitions was far superior to the direct clustering of full-length 374 reads and resulted in an accurate determination of the number of OTUs, with one-two spurious 375 OTUs (when using 3 concatemers threshold for INC-Seq) and no false negatives (Table 2)  with one-two mismatches associated with the spurious OTUs (Figure 8). While accurate OTU 382 estimation allowed for single OTUs to be detected in the one organism experiment, the overall 383 community structure deviated from the theoretical community structure for the ten organism 384 experiments ( Figure S4) and thus additional protocol optimization is essential to ensure levels of 385 deviation from theoretical community structure does not exceed what may be seen from PCR 386 biases [33]. Phylogenetic analyses of the consensus sequences demonstrated close placement of 387 the OTU consensus sequences with their corresponding references, with excellent pairwise 388 alignment between the two (Supplementary Figure S3). 389   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63    The nanoClust implementation in this study included a specified threshold of a maximum of 50 395 reads per OTU to generate OTU consensus sequences. This was feasible because our study 396 focusses on a single organism and even mock community of ten organisms. Thus, the process of 397 consensus construction was not limited by the number of reads that could be recruited. However, 398 it would be critical to determine the potential for poor quality consensus sequence due to fewer 399 reads with an OTU in naturally derived mixed microbial communities. To this end, we varied the 400 number of reads used for consensus sequence construction from 5 to 100 for 2D and 1D2 data 401 from the one organism experiment (Figure 9).

Discussion and conclusions 471
The current study uses mock communities to develop and validate the NanoAmpli-Seq workflow 472 for long amplicon sequencing on the nanopore sequencing platform. While this study focusses on 473 the near full-length 16S rRNA gene, in principle the approach outlined by the NanoAmpli-Seq 474 workflow should be amenable to amplicons generated from PCR amplification of any target gene 475 irrespective of target gene length. While leveraging the previously described INC-Seq protocol, 476 NanoAmpli-Seq adds several novel components which significantly enhances the amplicon 477 sequencing workflow for the nanopore platform. The improvements over the previously described 478 INC-Seq protocol involve modifications to both library preparation (i.e. PrimPol based primer 479 synthesis for RCA, debranching and fragmentation, shorter protocol length) and the data analyses. The correction of INC-Seq consensus reads using chopSeq did not allow for sequences with high 491 enough quality for direct OTU clustering using VSEARCH. However, the read partitioning-based 492 sequence clustering allowed for accurate determination of the number of OTUs in the mock 493   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 community. Further, de novo sequence clustering using nanoClust provided the opportunity to 494 significantly increase the number of sequences used for consensus calling. In this study, we used 495 50 reads for consensus calling (i.e. 150x coverage considering three concatemer threshold set for 496 INC-Seq) which resulted in average sequence accuracy of 99.5%. The use of more than 50 reads 497 for consensus calling in nanoClust did not improve sequence accuracy while reducing the number 498 of reads resulted in reduced precision. This threshold of 50 reads for both the 2D and 1D2 499 sequencing data suggests that the OTUs with fewer than 50 reads are likely to have sequence 500 quality lower than those OTUs with greater than 50 reads. It should however be noted that using a 501 10 read threshold (i.e., 30x coverage when including three concatemer threshold for INC-Seq) 502 consistently allowed for sequence accuracy consistently higher than 99% and thus sequence 503 classification to species level using any of the current sequence classification approaches [35, 36] 504 (i.e., RDP classifier) would be reliable even for lower abundance OTUs. 505 506 While the NanoAmpli-Seq workflow represents a significant improvement in amplicon 507 sequencing on the nanopore platform, some fundamental limitations remain. For instance, the 508 NanoAmpli-Seq sequence accuracy is still lower than those reported for short amplicons [3] or 509 those generated from the assembly of SSU rRNA from metagenomic sequencing on the Illumina 510 Platform [7,9] , and the full-length 16S rRNA sequencing on the PacBio platform using the 511 approach described previously [17]. Our analysis shows that the sequence accuracy does not 512 improve with more than 50 sequences used in the nanoClust based consensus calling process. 513 Nearly all the errors in the OTU consensus sequences originate from single deletions at 514 homopolymers regions, specifically for homopolymers greater than 4 bp. This homopolymer error 515 issue on the nanopore platform is well known [14,37] and is likely to best resolved during the base 516 calling process rather than subsequent data processing or by processing signal data (rather than 517 base called data) all the way through clustering, followed by base calling as the final step. The 518 second limitation of our approach is the low data yield at the base calling step, i.e. base called 519 reads represents only a small portion (i.e., 7-9% for 1D2 data) of the raw records. This data loss is 520 significant and could potentially deter the widespread use of the nanopore platform for amplicon 521 sequencing. While the precise cause of the low yield of pass reads post base calling is unclear, the 522 proportion of pass reads in our study are not significantly different from those reported elsewhere. 523 One current option would be to directly work with 1D rather than 1D2 data. However, the 524 maximum sequence accuracy of 1D reads post INC-Seq consensus construction was only 94% and 525 unsuitable for processing with chopSeq and nanoClust. The NanoAmpli-Seq workflow includes a 526 de novo clustering step, and as long as the sequence accuracy post chopSeq is ~97% (3-10 527 concatemers required), the binning process should provide for sufficient coverage for consensus-528 based sequence correction to accuracies in excess of 99%. The final limitation of our approach is 529 that the nanoClust relies on generating consensus sequences from multiple DNA sequences and 530 thus there is the likelihood of clustering and generating a multi-species consensus from closely 531 related species, i.e. those within 97% sequence similarity to each other. While we did not find 532 evidence for this "multispecies consensus sequence" while analyzing data from Li et al. [21] which 533 included closely related organisms, this possibility cannot be ignored. And thus, we recommend 534 that researchers refrain from depositing NanoAmpli-Seq processed sequences in publicly available 535 references databases, but utilize this approach for rapid screening of mixed microbial communities 536 and limit the use of NanoAmpli-Seq processed data for within-study sample microbial community 537 comparisons. Future improvement to avoid the likelihood of "multispecies consensus sequence" 538 would be to utilize primers with barcodes consisting of random N bases (i.e., unique molecular 539 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 tags), similar to that used by Karst et al. This could allow clustering of reads originating from the 540 same original sequence using the unique molecular tags. 541 542

Methods 543
Mock community description and preparation 544 Two different mock communities were constructed for the experiments outlined in this study. First, 545 a single organism mock community was constructed by amplifying the near full-length of the 16S 546 rRNA gene from genomic DNA of Listeria monocytogens using primers sets 8F (5'-547 AGRGTTTGATCMTGGCTCAG-3') and 1387R (5'-GGGCGGWGTGTACAAG-3'), both with 548 5' phosphorylated primers (Eurofins Genomics) 26 . Phosphorylated ends are essential for the 549 subsequent self-ligation step. PCR reaction mix was prepared in 25µl volumes with use of 12.5µl 550 of Q5® High-Fidelity 2X Master Mix (New England BioLabs Inc., M0492L), 0.8µl of 10pmol of 551 each primer, 9.9µl of nuclease-free water, (Roche Ltd.) and 1ng of bacterial DNA in total followed 552 by PCR amplification as described previously 26 . PCR amplicons from replicate PCR reactions 553 were combined and purified with use of HighPrep™ PCR magnetic beads (MagBio, AC-60050) 554 at 0.45x ratio. The ten organism mock community was constructed from purified near full-length 555 16S rRNA amplicons of 10 organisms. Briefly, genomic DNA from 10 bacteria were obtained 556 from DSMZ, Germany (Supplementary Table S1) and the aforementioned primers, PCR reaction 557 mix and thermocycling conditions were used to independently PCR amplify the near full-length 558 16S rRNA gene, followed by purification using HighPrep™ PCR magnetic beads as detailed 559 above. The purified amplicons from each organism were quantified on the Qubit using dsDNA HS 560 kit, normalized to 4ng/µl, and combined to generate an amplicon pool consisting of an equimolar 561 proportion of the 16S rRNA gene amplicons of the 10 organisms. 562 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 DNA sequencing library preparation. 563 To circularize the linear amplicons into plasmid-like structures, 5µl of Blunt/TA Ligase Master 564 Mix (New England Biolabs, M0367L) was added to 55 µl of amplicon pool at a concentration of 565 1ng/µl and incubated for 10min at 15°C then 10min at room temperature to (total time = 20 566 minutes). Not all linear amplicons self-ligate into plasmid-like structures, but some are likely to 567 cause long chimeric linear amplicons. These long chimeric structures were removed using 568 magnetic bead-based purification with the following modifications. HighPrep™ PCR magnetic 569 beads were homogenized by vortexing followed by aliquoting 50µl into a sterile 2ml tube and 570 placed on a magnetic rack for 3min. A total of 25µl of the supernatant was carefully removed using 571 a sterile pipette to concentrate the beads to 2x its original concentration. The tube was removed 572 from the magnetic rack and vortexed vigorously to resuspend the beads. This concentrated bead 573 solution was used at a ratio of 0.35x to remove any amplicons greater than 2000 bp in the post-574 ligation reaction mix. Briefly, the post-ligation product was mixed with concentrated bead solution 575 at the 0.35x ratio by vortexing followed by incubation for three minutes at room temperature. The 576 tube was placed on the magnetic rack to separate the beads from solution, followed by transferring 577 of clear liquid containing DNA structures less than 2000 bp into new sterile tubes. Sample 578 containing short self-ligated molecules was subject to another round of concentration using 579 standard magnetic beads at 0.5x ratios according to manufacturer instructions and eluted in 15µl 580 of warm nuclease-free water. Concentrated and cleaned DNA pool consisting of plasmid-like 581 structures and remaining linear amplicons was then processed with Plasmid-Safe™ ATP-582 Dependent DNase (Epicentre, E3101K) reagents to digest linear amplicons using the mini-prep 583 protocol according to manufacturer instructions and was followed by another round of cleanup 584   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 with magnetic beads at 0.45x ratio as described before and then eluted in 15µl of warm nuclease-585 free water. 586

587
The pool containing plasmid-like structures was subject to RCA with use of TruPrime TM RCA Kit 588 (Sygnis, 390100) random hexamer-free protocol. Samples were prepared in triplicate and 589 processed according to manufacturer protocol with all incubations performed in triplicate for 120-590 Replicates RCA products were combined (~4.5µg of DNA in total) and subject to de-branching 599 and fragmentation of post-RCA molecules to remove hyperbranching structures generated during 600 RCA. The RCA product was first treated with T7 endonuclease I enzyme (New England BioLabs, 601 M0302S) by adding 2µl of the reagent to the 65µl of RCA product followed by vortexing and 602 incubation as recommended by the manufacturer. Subsequently, the reaction mix was transferred 603 into a g-TUBE (Covaris, 520079) and centrifuged at 1800 rpm for 4min or until the entire reaction 604 mix passed through the fragmentation hole. The g-TUBE was reversed, and centrifugation process 605 was repeated. Post debranching and fragmentation, short fragments were removed using the 606 modified bead-based cleanup step using concentrated bead solution (see above for concentration 607 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64 procedure). Concentrated beads were mixed with fragmented RCA product at a 0.35x ratio, 608 vortexed for 15sec, and incubated at room temperature for 3min then placed on a magnetic rack 609 until the beads separated and the supernatant was removed. The beads were subsequently washed 610 with 70% freshly prepared ethanol according to manufacturer protocols. Size-selected amplicons 611 bound to the beads were eluted in 41µl of warm nuclease-free water. Preliminary experiments 612 indicated that one round of de-branching did not completely resolve the hyperbranching structure, 613 which was inferred based on poor sequencing yield likely caused due to pore blocking by 614 hyperbranched DNA. As a result, a second round of enzymatic de-branching using T7 615 endonuclease I was added and the de-branched product was cleaned a second time using the bead-616 based clean-up step. Figure S4 shows an example BioAnalyzer traces of the RCA product post-617 debranching/fragmentation and post-cleanup using magnetic bead-based protocol. 618 619 Finally, the de-branched RCA product was treated with NEBNext® FFPE DNA Repair Mix (New 620 England BioLabs, M6630S) for gap filling and DNA damage repair caused during g-TUBE 621 fragmentation and T7 endonuclease I enzyme. All reagent components were combined with de-622 branched RCA product according to manufacturer recommendations and incubated at 12°C for 623 10min then at 20°C for another 10min. Post incubation, the repaired RCA product was cleaned 624 using standard magnetic beads at 0.5x ratio, washed with 70% ethanol, and eluted in 46µl of warm 625 nuclease-free water. The concentration of the DNA product was measured using Qubit and was 626 approximately 20-25 ng/µl with a total yield of ~1000 ng of DNA with product size typically 627 ranging from 1500bp to 20,000 bp. A total of 45µl DNA pool of concatamerized amplicons was 628 prepared for sequencing using the standard 2D and 1D2 library preparation protocol by Oxford 629 Nanopore Technologies (SQK-LSK208, SQK-LSK308) according to manufacturer specifications 630   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 to obtain pre-sequencing mix. Moreover, the final concentration for prepared libraries was 631 determined using dsDNA HS kit on the Qubit instrument. A detailed step-by-step protocol is 632 provided in the supplementary text. 633 634 DNA sequencing. 635 The MinION MkIB was connected to Windows personal computer compatible with Oxford 636 Nanopore Technologies requirements. R9.4 (FLO-MIN106) and R9.5 (FLO-MIN107) flow cells 637 were placed onto the MinION Mk1B (Oxford Nanopore Technologies). Platform quality control 638 was performed using MinKNOW software (v1.4.2 for 2D and v1.6.11 for 1D2 libraries). Only 639 flow cells containing above 1100 active pores were used in this study. Each flow cell was primed 640 twice according to ONT specifications using priming buffer consisting of equal parts of running 641 buffer (RBF1) and nuclease-free water with 10 min breaks between subsequent primes. The 642 loading mix was prepared with a 12µL pre-sequencing mix, 75µL of RBF1, and 63µL of nuclease-643 free water. Loading mix was sequenced with MinKNOW settings appropriate for 2D or 1D2 644 options and standard 48h processing time for every run. Albacore 1.2.4 was used to convert raw 645 signals into HDF5 file format using switch options FLO-MIN106 and SQK-LSK208 for 2D data 646 and FLO-MIN107 and SQK-LSK308 for 1D2 data. 647 648 Data processing 649 HDF5 raw signals from each sequencing run were converted to FASTQ format using Fast5-to-650 Fastq (https://github.com/rrwick/Fast5-to-Fastq) and then from FASTQ to FASTA with seqtk 651 (https://github.com/lh3/seqtk).