Multi-locus and long amplicon sequencing approach to study microbial diversity at species level using the MinION™ portable nanopore sequencer

Abstract The miniaturized and portable DNA sequencer MinION™ has demonstrated great potential in different analyses such as genome-wide sequencing, pathogen outbreak detection and surveillance, human genome variability, and microbial diversity. In this study, we tested the ability of the MinION™ platform to perform long amplicon sequencing in order to design new approaches to study microbial diversity using a multi-locus approach. After compiling a robust database by parsing and extracting the rrn bacterial region from more than 67000 complete or draft bacterial genomes, we demonstrated that the data obtained during sequencing of the long amplicon in the MinION™ device using R9 and R9.4 chemistries were sufficient to study 2 mock microbial communities in a multiplex manner and to almost completely reconstruct the microbial diversity contained in the HM782D and D6305 mock communities. Although nanopore-based sequencing produces reads with lower per-base accuracy compared with other platforms, we presented a novel approach consisting of multi-locus and long amplicon sequencing using the MinION™ MkIb DNA sequencer and R9 and R9.4 chemistries that help to overcome the main disadvantage of this portable sequencing platform. Furthermore, the nanopore sequencing library, constructed with the last releases of pore chemistry (R9.4) and sequencing kit (SQK-LSK108), permitted the retrieval of the higher level of 1D read accuracy sufficient to characterize the microbial species present in each mock community analysed. Improvements in nanopore chemistry, such as minimizing base-calling errors and new library protocols able to produce rapid 1D libraries, will provide more reliable information in the near future. Such data will be useful for more comprehensive and faster specific detection of microbial species and strains in complex ecosystems.


46
During the last two years, DNA sequencing based on single-molecule technology has 47 completely changed the perception of genomics for scientists working in a wide range of 48 scientific fields. This new perspective is not only supported by the technology itself but 49 also by the affordability of these sequencing instruments. In fact, unprecedentedly, Oxford 50 Nanopore Technologies (ONT) released the first miniaturised and portable DNA sequencer 51 in early 2014, within the framework of the MinION TM Access Programme. Recently, the 52 MARC consortium (MinION Analysis and Reference Consortium) has published results 53 related to the study of the reproducibility and global performance of the MinION™ 54 platform. These results indicate that this platform is susceptible of a large stochastic 55 variation, essentially derived from the wet-lab and MinION™ operative methods, but also 56 that variability has minimal impact on data quality [1]. 57

58
The coordinated and collaborative work and mutual feedback between industry and the 59 scientific community have enabled ONT to develop rapidly towards improving its portable 60 platform for DNA sequencing, minimizing the stochastic variation during DNA library 61 preparation. Consequently, in late Autumn 2015, ONT released MkIb, the latest version of 62 MinION TM , and in April 2016 the fast mode chemistry (R9) was released, increasing the 63 rate of sensing DNA strands from 30-70 to 280-500 bp/sec and reaching up to 95% of per-64 base accuracy in 2D reads (Clive G. Brown, CTO ONT, personal communication). 65 66 One of the most attractive capabilities of the MinION™ platform is the sequencing and 67 assembly of complete bacterial genomes using exclusively nanopore reads [2] or through 68 hybrid approaches [3,4]. Notwithstanding, the MinION™ platform has also been 69 demonstrated useful in other relevant areas including: human genetic variant discovery [5, 70 6], detection of human pathogens [7,8], detection of antibiotic resistance [9,10], and 71 microbial diversity [11,12]. Regarding the latter, microbial diversity and taxonomic 72 approaches are common and in high demand to analyse the microbiota associated to a wide 73 variety of environment-and human-derived samples. However, these analyses are greatly 74 limited by the short-read strategies commonly employed . Thanks to improvements in the 75 chemistry of the most common, popular sequencing platforms in recent years, it is now 76 possible to characterise microbial communities in detail, down to the family or even genus 77 level, using genetic information derived from roughly 30% (~500nt) of the full 16S rRNA 78 gene. Despite the massive coverage achieved with short-read methods, the limitation in 79 terms of read length means taxonomic assignment at the species level is still unfeasible. For 80 instance, taxonomy strategies based on short-reads from Illumina MiSeq platform offer a 81 limited information that underestimates the microbial diversity of complex samples when 82 compared with alternative approaches based on long DNA reads [13]. Consequently, 83 implementation of long-read sequencing approaches to study larger fragments of marker 84 genes will permit the design of new studies to provide evidence for the central role of 85 precise bacterial species/strains in a great variety of microbial consortia. Recent studies at 86 this regard have showed important advances in taxonomy analysis using long reads 87 generated by single molecule technologies [11,14,15], indicating that the expansion or 88 inclusion of more hypervariable regions in the analysis overcomes the disadvantage of 89 working with error-prone DNA reads. With respect to the above, we have recently explored 90 the performance of the MinION™ device. Our study demonstrates that data obtained from 91 sequencing nearly full-length 16S rRNA gene amplicons is feasible to study microbial communities through nanopore technology [11]. We wanted to move a step forward in this 93 type of strategy, thus gaining more specificity when including several hypervariable 94 markers in the analysis, at sequence and structural level, by designing a multi-locus and 95 long amplicon sequencing method to study microbial diversity. At the same time, we also 96 wanted to explore the affordability of the MinION™ technology to perform microbial 97 diversity analyses by multiplexing several samples in one single MinION™ flowcell. 98 Accordingly, here we present a study of the 16S, 23S, and the internal transcribed spacer 99 After normalization of cluster numbers against the median size of respective regions 163 analyzed and referenced against the numbers obtained for 16S region at 97% sequence 164 identity, we found that rrn region comprising the 16S, ITS, and 23S coding regions exhibits 165 more than 4-fold more variation than that observed for the 16S molecule alone (at 100% 166 sequence identity). As expected, the 23S region exhibited more diversity by containing 167 more hypervariable regions than 16S region and getting almost 2-fold more diversity. 168 Strikingly, the ITS regions showed similar levels of genetic diversity despite to have almost 169 one fourth of the size of 16S region in average. When parsing the genetic information of 170 over 67,000 bacterial genomes, we observed the ITS region frequently encodes one or 171 several tRNA genes and it possess a high variability in terms of length as well. 172 Consequently, the variability observed in the rrn was the largest observed and thought to be 173 meaningful for the aims of this study. We obtaining data supporting the above notion by 174 searching the number of rrn clusters (at 100% identity) matching with the most 175 predominant species in the database, thus retrieving 1,713, 1,276, and 1,273 rrn clusters 176 annotated for Escherichia coli, Streptococcus pneumoniae, and Staphylococcus aureus, 177 respectively. In consequence, the rrn is able to accumulate enough sequence variability to 178 discern taxonomy even at strain level. 179 180 Performance of the R9 chemistry. Once we could compile a reference database for 181 comparison aims, we proceeded with the amplicon library construction and sequencing run 182 obtaining raw data consisting of 17,038 reads and almost all were classified as 1D reads. 183 For general knowledge, the DNA reads derived from the MinION TM device can be 184 classified into three types: '1D template', '1D complement', and '2D' reads. The latter, 2D 185 reads, are products of aligning and merging sequences from the template (read from leader adapter) and complement reads (a second adapter called hairpin or HP adapter must be 187 generated), produced from the same DNA fragment. These contain a lower error rate, 188 owing to strand comparison and mismatch correction. In addition to the technical issues 189 indicative of a bad ligation of the HP adapter, we obtained 93% of reads (~15,900 reads) 190 during the first 16h of run; thus, we obtained lower sequencing performance after re-191 loading with the second aliquot of the sequencing library and extended the run for another 192 24h (40h in sum). The fasta sequences were filtered by retaining those between 1,500 and 193 7,000 nt in length, obtaining at least enough sequence information to compare a DNA 194 sequence equivalent to the 16S rRNA gene length. After this filtering step, we retained 72% 195 of sequences (12,278) and then we performed the respective barcode splitting. For this 196 purpose, we modified the default parameters of the "split_barcodes.pl" perl script (Oxford 197 Nanopore Technologies) by incorporating the information of the extended barcodes (Table  198 1), rather than the barcode information alone, and simultaneously increased the stringency 199 parameter to 25 (14 by default). Afterwards the concatenation of reads were obtained from 200 respective forward and reverse extended barcodes, then we retrieved a total of 2,019 (52% 201 from forward and 48% from reverse barcodes) and 1,519 (53% from forward and 47% from 202 reverse barcodes) 1D reads for HM782D and D6305 mock communities, respectively. 203 Read-mapping was performed against the rrn database, compiling more than 22,000 rrn 204 regions, retrieved from more than 67,000 genomes available in GenBank (see Availability 205 of supporting data). The taxonomy associated to the best hit based on the competitive 206 alignment score followed by filtering steps (see methods) was used to determine the 207 structure of each mock community. The MinION TM sequencing data produced the microbial 208 structure presented in Figure 3 for the mock communities HM782D and D6305, 209 respectively. 211 Figure 3 shows the bacterial species and their respective relative proportions retrieved from 212 the analysis of the mock communities HM782D and D6305, respectively. With respect to 213 the HM782D mock community, we were able to recover 20 representative species, 214 accounting for 16 out of 20 species present in that artificial community ( Figure 3A). 215 However, the remaining four species that apparently are absent in this community have a 216 close relationship to others detected correctly, namely Bacillus subtilis, Bacillus 217 thuringensis, Bacillus anthracis, and Propionibacterium sp. Furthermore, we were unable 218 to report the presence of just four species present in HM782D because proportions of 219 Rhodobacter sphaeroides and Actinomyces odontolyticus were below the predominance 220 threshold (1%), being present in 0.25 and 0.12%, respectively. Similarly, other 40 different 221 species but close to that present in the HM782D mock community (Bacillus spp., 222 Streptococcus spp. Clostriudium spp., Neisseria spp., Staphylococcus spp, and Listeria 223 spp.) had minor representation in data derived from rrn sequencing. With respect to 224 Rhodobacter sphaeroides and Actinomyces odontolyticus lower proportions, we have 225 previously demonstrated that the low levels of 16S reads are a consequence of 226 amplification bias derived from the PCR reaction and not from sequencing itself [11]. In 227 this case, the new primer pair used to generate the long amplicons would seem to work 228 more efficiently than those previously used, but apparently they still present issues at 229 bacterial coverage level. When we revised the whole taxonomy contained in our rrn 230 database, the compiling of non rrn regions for Deinococcus radiodurans and Helicobacter 231 pylori partially explained the lack of these species in HM782D analysed by the present 232 approach. However, a new alignment process using individual 16S and 23S rRNA 233 sequences obtained from GenBank and including those for D. radiodurans and H. pylori, respectively, demonstrated that at least D. radiodurans could be identified in a higher 235 proportion than A. odontolyticus and R. sphaeroides, albeit in a lower proportion than our 236 predominance threshold. Regarding the results obtained from the D6305 mock community, 237 we found a total of 10 bacterial species present in this mixed DNA sample, eight of them 238 matched the expected structure of the community, and additionally 18 close species had 239 minor representation (Bacillus spp., Enterococcus spp., Klebsiella spp., Lactobacillus spp., 240 Streptococcus spp., and Staphylococcus spp.). Using the MinION™ data we were able to 241 recover 100% of the species present in this sample and the two additional members 242 identified also have a close relationship within the Bacillus genus, as observed in the 243 HM782D sample ( Figure 3B). We have determined that coverage needed to retrieve all 244 expected species in a non-even mock community with an abundance above 1% is ~13X in 245 terms of the number of species of that community. 246 247 When compared to reference values and proportions theoretically expected for the species 248 present in the two mock communities, we observed some deviations that were greater in 249 certain species. Particularly, in the HM782D sample the lowest coverage biases were 250 observed for Actinomyces odontolyticus (-5.36), Rhodobacter sphaeroides (-4.36

), and 251
Enterococcus faecalis (-2.04). This indicates that such species, in addition to D. 252 radiodurans and H. pylori, are more difficult to detect with the primers and PCR used here. 253 By contrast, Escherichia coli (1.79) seems to be preferentially amplified, given that this 254 species exhibited the highest positive coverage bias value ( Figure 3C). We again found that 255 coverage bias is linearly correlated with PCR products generated by quantifying E.coli, L. 256 gasseri, and B. vulgatus amplicons (Pearson's r = 0.82, p = 0.047), data indicating that there 257 are not major issues during taxonomy assignation by over-representation of certain species in the reference database. The values obtained for D6305 were more homogeneous, and the 259 lowest coverage bias was observed for Lactobacillus fermentum (-2.18) ( Figure 3D). 260 Additional analysis indicated that there was not significant correlation between coverage 261 bias and GC content in rrn. Although the low coverage bias for some species can be solved 262 by selecting another pair of primers, the ability to recover almost all of them, at least in a 263 low proportion, in itself represents an important attribute of this approach for inter-sample 264 comparisons. Interestingly, we observed a similar pattern of overrepresentation of Bacillus 265 spp. sequences (>50%) in D6305 sample but not for Escherichia spp. sequences (~4%) in 266 the HM782D mock community when Illumina MiSeq data was assessed ( Figure 3C-D). 267

268
The high error rate of the 1D reads (ranging between 70 and 87% sequence identity, 269 according to high quality alignments) makes barcoding de-multiplexing a difficult task in 270 nanopore data. However, our results indicate that with the configuration and parameters 271 presented here we could efficiently distinguish the reads generated from HM782D and 272 D6305 amplicons. As a consequence, the performance of this long amplicon approach to 273 properly assign microbial communities to samples was efficiently assisted by the 274 parameters during the de-multiplexing process that were central to discern reads obtained 275 from respective samples multiplexed in the MinION flowcell. For instance, the distribution 276 of reads matching with close related species such as Lactobacillus gasseri and 277 Lactobacillus fermentum, contained distinctively in HM782D and D6305 samples, was 278 indicative of the adequate execution of the de-multiplexing pipeline. The above was also 279 exemplified for Salmonella enterica sequences that were determined only in D6305 despite 280 its close relationship with E. coli at the 16S and 23S sequence level (close to 100%). 281 species was inspected directly distinguishing the ITS as the major source of variation 283 between the two species. Indeed, this was corroborated by the comparative analysis 284 performed during the clustering step of the reference samples to create our rrn database. 285 286 Performance of R9.4 chemistry. During the course of the present work the MinION R9.4 287 chemistry was delivered in Autumn 2016. Therefore, we wanted to perform a replicate 288 experiment using this type of chemistry in order to disclose how much improvement our 289 approach would gain in terms of sensibility and specificity. With only 3h run we observed a 290 notable improvement of throughput and per-base accuracy and the MinION™ produced 291 almost 40,000 reads with a predominant QScore distribution between 8 and 12 suggesting a 292 theoretical error rate of reads between 0.15 to 0.06, respectively, lower than obtained from 293 R9 reads (0.25 to 0.15). After compiling all sequences in a fasta file, we proceeded to 294 perform filtering in equal manner than previously done for R9 data. Consequently, we 295 retained more than 33,000 reads (86%) for further processing and taxonomy assignment. 296 The major results from comparison among R9 and R9.4 runs are summarized in the Table  297 2. As expected, the R9.4 dataset was more accurate and its reads showed a lower per-base 298 error rate, therefore, the taxonomy analysis based on this reads would be more precise than 299 observed with R9 reads. Globally, the results obtained from R9.4 chemistry are very similar 300 than those observed with R9 chemistry but the level of uncertainty was diminished by 301 reducing the number of close species to that contained in respective mock communities 302 exhibiting very low abundance (<1%), thus decreasing from 40 species to 15 for the 303 HM782D and from 18 to 16 for the D6305. We were unable again to recover D. 304 radiodurans and H. pylori reads but we improved the sensitivity for A. odontolyticus and R. 305 sphaeroides ( Figure 3C and 3D), whose relative proportions were almost duplicated in R9.4 data (R. sphaeroides = 0.44%, A. odontolyticus = 0.31%). We compared the 307 respective proportions obtained from R9 and R9.4 chemistries obtaining consistent results 308 ( Figure 3E) indicating that our approach is reproducible with no major changes despite the 309 different chemistry and kits for library preparation using during both sequencing runs. OTUs level, we retrieved a total of 14 sequences whose taxonomy identification is 324 presented in the Supplementary Material 3. In this case, only S. enterica could be identified 325 at species level. Given that data derived from this short read approach normally cannot 326 reach a reliable taxonomy assignment down to species level, we proceed to make 327 comparisons with R9 and R9.4 data by compiling these last information to genus level in 328 order to evaluate the performance of our approach with a commonly used procedure. In the 329 and coverage bias is depicted. We observed no larger deviations in data retrieved with 331 MinION regarding those numbers obtained with conventional approaches such as study of 332 V4-V5 regions with MiSeq platform. Interestingly, we observed similar pattern of 333 important negative coverage bias in all three approaches for Actinomyces spp., 334 Enterococcus spp., and Rhodobacter spp. species in the HM782D community and for 335 Lactobacillus spp., and Listeria spp., in the D6305 community, then suggesting that species improvement at this regard with no major differences when compared with data from 348 MiSeq platform (Table 3). read-length issues inherent to second-generation sequencing methods. These advances 356 allow researchers to infer taxonomy and analyse diversity from the almost full-length 357 bacterial 16S rRNA sequence [11,14,15,17]. Particularly, the ONT platform deserves 358 special attention given its portability and its fast development since the MinION TM became 359 available in 2014. Notwithstanding, this technology is susceptible to a large stochastic 360 variation, essentially derived from the wet-lab methods [1]. We corroborated this issue by 361 obtaining a sequencing run where the raw data predominantly consisted of 1D reads as a 362 consequence of the HP adapter ligation failure, despite following the manufacturer's 363 instructions. However, we were able to develop an efficient analysis protocol where the 364 higher read quality offered by R9 chemistry and the updated Metrichor basecaller protocol 365 proved pivotal to obtain 1D reads with a range of identity between 70 and 86%, with 366 sufficient per-base accuracy to successfully perform the taxonomic analyses described 367 herein. Moreover, during the course of this study the R9.4 flowcells were released and we 368 were able to replicate our approach using this improved pore chemistry and the SQK-369 LSK108 for 1D libraries obtaining reads with sequence identity up to 92%. 370 371 Our preliminary results indicated that the rrn region in bacteria preferentially has a unique 372 conformation (with the transcriptional arrangement of 16S-ITS-23S) and we could amplify 373 this ~4.5Kbp region with the selected S-D-Bact-0008-c-S-20 and 23S-2241R primer pair. 374 Once we were able to distinguish the feasibility to amplify the rrn, our approach comprised 375 the study of two different mock communities in a multiplex manner, to be combined in one 376 single MinION TM flowcell. By designing the respective forward and reverse primers tagged with specific barcodes recommended by ONT, we were able to retrieve extended barcode-378 associated reads, in spite of the large proportion of per-base errors contained in these types 379 of reads. Using MinION TM data based on multi-locus markers and long amplicon 380 sequencing, we could reconstruct the structure of two commercially available mock 381 communities. Although the expected proportions of some species in each community 382 exhibited an important coverage bias, we were able to recover 80% (HM782D) and 100% 383 (D6305) of bacterial species from the respective mock communities. Consequently, future 384 analyses should be conducted to find an appropriate PCR approach using primers with a 385 higher coverage for bacterial species. 386

387
We have analysed a great amount of genetic information with the aim of compiling a 388 valuable database containing the genetic information for the rrn present in over 67,000 389 draft and complete bacterial genomes. The global length distributions in the region 390 indicated that the rrn was 4,993 ± 187 bp in length whereas the 16S, ITS, and 23S sub-391 regions were 1,612 ± 75, 488 ± 186, and 3,036 ± 160 bp in length, respectively. Using this 392 genetic information of the rrn and clustered at 100% of sequence identity enabled us to 393 establish a multi-locus marker able to discriminate the taxonomy of two mock communities 394 containing very close species. The latter was possible given that simultaneous analysis of 395 the 16S, ITS, and 23S molecules offered almost 40-fold more diversity that studying the 396 16S, ITS, or 23S sequences separately and at 97% sequence identity. Moreover, the ITS 397 was distinguished individually as an important variable genetic region in terms of sequence 398 and length. Furthermore, it contributes notably to the higher variability observed in the rrn 399 region, a fact evidenced in previous studies [18][19][20][21]. The accumulation of a larger number 400 of variable sites in the rrn region, together with the particular structural variation of the ITS to potentially accommodate and encode tRNA genes, are thought to be central to 402 discriminating bacterial species, despite the large proportion of per-base errors contained in 403 MinION TM reads. Our data indicate that our MinION reads produce alignments with 404 averaged length of 2,463 and 3,191 bases for HM782D and D6305, respectively, using R9 405 chemistry and 4,173 and 4,115 bases for HM782D and D6305, respectively, using R9.4 406 chemistry. Consequently, the taxonomy assignment was predominantly based on the 407 variability of more than two out of the three markers included in the rrn, no matter if reads 408 were produced from the 16S or 23S edges of rrn amplicons. We expect this type of analysis 409 will likely become more accurate over time as nanopore chemistry improves in near future, 410 with the concomitant increase in throughput, which is pivotal to disclose the hundreds of 411 species present in complex microbial communities for analysis in human or environmental 412 studies. Therefore, the multi-locus, long and multiplex methods described here represent a 413 promising analysis routine for microbial and pathogen identification, relying on the 414 sequence variation accumulated in approximately 5kbp of DNA, roughly accounting for the 415 assessment of 1.25% of an average bacterial genome (~4Mbp). Notwithstanding, we cannot 416 obviate that the current state of this approach presents some limitations in terms of the 417 completeness of the rrn database created as well as the efficiency of the primers used to 418 generate the long amplicons that have to be revisited in order to improve and increase the 419 coverage of bacterial species. At date, our database include rrn sequences from 2,479 420 different species grouped into 918 different genus. In consequence, urgent studies must be 421 undertaken to generate a more complete database including the rrn genomic information 422 from species inhibiting complex and real samples such as those derived from human body. The rrn database 528 We built a database containing the genetic information for the 16S and 23S rRNA genes 529 and the ITS sequence in all the complete and draft bacterial genomes available in the NCBI 530 database (ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria). A total of 67,199 genomes 531 were analysed by downloading the "fna" files and parsing for rRNA genes into the 532 respective "gff" annotation file. Chromosome coordinates for rrn regions were parsed and 533 used to extract such a DNA sequences from complete chromosomes or DNA contigs 534 assembled. The resulting rrn sequences were analysed and the length distribution was 535 assessed. We retrieved a total of 47,698 rrn sequences with an average of 4,993 nt in 536 length. By selecting the size distribution equal to the 99th percentile (two-sided), we 537 discarded potential incomplete or aberrant annotated rrn sequences and observed that rrn 538 sequences can be found between 4,196 and 5,790nt; under these boundaries, our rrn database finally accounted for a total of 46,920 sequences. Equivalent databases were built 540 by parsing the respective rrn sequences with the RNammer tool to discriminate the 16S, 541 ITS, and 23S rRNA sequences [24]. To remove the level of redundancy of our rrn database 542 and to maintain the potential discriminatory power at strain level, we performed clustering 543 analysis using USEARCH v8 tool for sequence analysis and the option -otu_radius_pct 544 equal 0 [25], thus obtaining a total of 22,350 reference sequences. For comparative aims, 545 the rrn database and the 16S, ITS, and 23S databases were also analysed using the option -546 otu_radius_pct with values ranging from 1 to 3. For accessing to rrn database and the 547 respective species annotation, see Availability of supporting data. 548 549

MinION data analysis 550
Read-mapping was performed using the LAST aligner v.189 [26] with parameters -q1 -b1 -551 Q0 -a1 -r1. Each 1D read was compared in a competitive way against the entire rrn 552 database and the best hit was selected by obtaining the highest alignment score. Alignment 553 length as well as alignment coordinates in target and query sequences were parsed from the 554 LAST output and the sequence identity between matched regions was calculated using the 555 python Levenshtein distance package. An iterative processing was used to determine 556 thresholds for detection by evaluating the taxonomy distribution with reads subsampling 557 and different levels of sequence identity in top scored alignments. High quality alignments 558 were selected by filtering out those with identity values up to the 50th percentile of the 559 distribution of identity values of all reads per sample (~69%) in the R9 run. Therefore, 560 taxonomy assignment was based exclusively on alignments with ≥ 70% identity. For data 561 derived from R9.4 chemistry, high quality alignments were selected by filtering out those 562 with identity values up to 25th percentile of the distribution, thus retaining alignments with ≥ 81% identity. Basic stats, distributions, filtering, and comparisons were performed in R 564 v3.2.0 (https://cran.r-project.org). For relative quantification of species the singletons were 565 removed and the microbial species considered to be predominantly present in the mock 566 communities were those with a relative a proportion ≥ 1%, a value that demonstrated to be 567 discriminative to always obtain the expected microbial diversity during the iterative 568 processing of alignments. The coverage bias was calculated by obtaining fold-change 569 (Log 2 ) of species-specific read counting against the expected (theoretical) average for the 570 entire community according to information provided by the manufacturers.  arrangements expected for rrn and tested experimentally using two sets of primer pairs (see 717 small arrows drawn in each configuration). B -Agarose gel electrophoresis of PCR 718 reactions performed under the two hypothetical arrangements of rrn; lanes: 1) 1kb ruler 719 (Fermentas), 2) PCR reaction from the top configuration in panel A, 3) PCR reaction from 720 the bottom configuration in panel A. The GelAnalyser Java application was used to perform 721 the band size analysis of the 1kb ruler standard (C) and the amplicons obtained from human 722 faecal DNA (D). 723 724 Figure 2.Variability of the rrn region and its functional domains. The rrn database 725 compiled after parsing more than 67,000 draft and complete bacterial genomes was 726 assessed by clustering analysis at different levels of sequence identity: 97 (white bars), 98 727 (light grey bars), 99 (dark grey bars), and 100% (black bars). For comparative aims, the 728 functional DNA sequences encoded into the rrn region were also individually studied. The 729 normalized diversity (y axis) resulted from calculate the number of clusters obtained for 730 each analysis normalized with the median sizes of respective regions in terms of kb, and 731 referenced against the value obtained for 16S sequences clustered at 97%, the canonical 732 threshold for species assignment. 733 734 Figure 3.Microbial structure of the mock communities. A and B -microbial species and 735 respective relative proportions determined to be present in the HM782D and D6305 mock 736 communities, respectively, following the analysis of raw data obtained from rrn amplicon